$par-skp
$rm6 $hds
           SGML: What It Is and Why It's Good for Braille $l
                     (Though We Still Need UBC)
$hde $rm $sk1 $ptys
$tls SGML and Braille $tle
by Joseph E. Sullivan $l $ind3
  Member, International Committee on Accessible Document Design
$l
  Member, Texas Commission on Braille Textbook Production $l
  Chairman, Committee II of BANA Unified Braille Code Research Project
$l
  President, Duxbury Systems, Inc. $l $ind1
$sk1
March 9, 1993
$sk1
This document is hereby placed in the public domain.
$ptye


$cp3:1 $ifbrl $ind5 INTRODUCTION $ind1

Many of today's technologies seem to evolve so fast that there's
no keeping up with them.  For instance, practically all of the
latest and greatest computers of the 1970's are no longer seen
outside museums and reruns of the science fiction movies of that
era.  Many computers of the early 1980's are now serving as
bookends.  Even last year's computers suffer noticeably by
comparison with this year's, and so it goes.  Compared with
computers, braille -- invented in the early nineteenth century
-- would seem to be a model of solid and mature technology, one
that does not and probably should not evolve very rapidly.  While
that is basically correct, braille too can profit from the
inevitable march of progress, and at the present time there are
two technologies being developed that are, in my opinion, of
particular importance to the entire community of persons who
produce or use braille.

Those technologies are SGML (Standard Generalized Markup
Language) and UBC (Unified Braille Code).  Once you get past the
alphabet soup, the concepts behind these technologies are not
really difficult, even though they are a bit "technical" if you
get down to some of the details.  This paper concentrates
primarily on SGML, attempting to explain it in nontechnical terms,
and its benefits for braille, to the layman who may not know much
about either markup languages or braille.  It goes on to explain
why certain problems with braille are not solved by SGML, and it
touches on UBC as a promising complementary technology.  For
those readers who are already well versed in one or another
aspect of these subjects, it will be obvious (I hope) which
paragraphs may be skipped over.


$cp3:1 $ifbrl $ind5 WHAT IS SGML? $ind1

Since the 1960's, one of the more routine uses of computers has
been to format text, that is to produce "finished" documents from
text stored as a file on the computer.  To make this a bit more
concrete, let us assume that we wish to produce a simple document
that has several sections, each section having a heading and one
or more ordinary paragraphs.  Let us further assume that, in the
print edition, we want the section headings to be centered and
the paragraphs to be separated from each other, and from the
headings, by a skipped line.  Then we might start with a computer
file that contains the following text:
$sk1 $ind3
  [start-centering] First Heading [end-centering] [skip-line] This
  is the text of the first paragraph, which would normally be much
  longer but we do not wish to belabor this example.  [skip-line]
  This is the text of a second paragraph.  Again we will be
  unusually brief.  [skip-line] [start-centering] Second, Somewhat
  Longer Heading [end-centering] [skip-line] This is the first
  paragraph in the second section...
$sk1 $ind1
The items enclosed in brackets in the above example are obviously
not part of the text that is intended for the eventual human
reader, but rather "formatting codes" to be used by the computer
program that is to do the formatting.  The formatting program
uses them to tell what text to lay down in ordinary paragraphs,
what text to center, and where to skip lines.  Those codes, and
the text itself, are all that is important in this file.  In
particular, line endings in the original file signify nothing
more than a break between words, and so are exactly equivalent to
spaces; to emphasize this, we have deliberately shown the file as
running on in a kind of stream that is a bit hard to read (sorry
about that).

Starting with the above file, the output of the formatting
program would be something like this:
$kps $ind3 $ifbrl $sk1 $hds
                          First Heading
$sk1 $hde
  This is the text of the first paragraph, which would normally
  be much longer but we do not wish to belabor this example.
$sk1
  This is the text of a second paragraph.  Again we will be
  unusually brief.
$ifbrl $sk1 $hds
                 Second, Somewhat Longer Heading
$sk1 $hde
  This is the first paragraph in the second section...
$sk1 $ind1 $kpe
"Codes" may also be called "tags" or "markup".  In the example
given, the codes are obviously related directly to the
appearance, or format, of the resulting document, and so would be
called format codes or appearance-oriented markup.  This general
kind of markup is no doubt familiar to most users of
word-processing software.  In WordPerfect, for example, you need
only press the "Reveal Codes" function to see markup much like
our example, although the actual codes are different.  There are
a great many computerized formatting systems based on this type
of coding, some of them designed to accommodate braille
formatting as well as print, and much useful work has been and
continues to be done by them.  Even if you do not use such
computerized formatting systems, if you have had occasion to
re-type several pages of a paper just because of a few editorial
changes, you can appreciate how much labor is saved by having the
computer re-do the tedious page formatting after you have made
just the essential text changes in the original file.

Superficially, SGML codes are a lot like format codes, but they
have important additional advantages.  Again to keep this in the
concrete, let us re-code our original example SGML-style:
$sk1 $ind3
  [start-heading] First Heading [end-heading] [start-paragraph] This
  is the text of the first paragraph, which would normally be much
  longer but we do not wish to belabor this example. [end-paragraph]
  [start-paragraph] This is the text of a second paragraph.  Again
  we will be unusually brief.  [end-paragraph] [start-heading]
  Second, Somewhat Longer Heading [end-heading] [start-paragraph] This
  is the first paragraph in the second section... [end-paragraph]
$sk1 $ind1
The first thing we notice is that now the codes no longer refer
to skipped lines, centering and such matters of appearance but
rather to the CONTENT of the material, that is the nature of the
text in the logical organization of the document.  This
corresponds to the way that the original author thinks about the
document.  That author may not know and may not care what
formatting devices, such as line skips and centering, may
eventually be used to format the document for presentation, but
he or she (or an allied specialist) can still enter these
content-oriented codes because they correspond to natural
divisions of the material itself.  When it comes time to publish,
a person whose speciality is document appearance can decide how
to turn these codes into formatting rules.  There are various
ways of doing this.  One way might be to specify what the SGML
tags "mean" in terms of formatting tags, for example:
$sk1 $kps $ind3 $ptys
  [start-heading] means: [skip-line] [start-centering] $l
  [end-heading] means: [end-centering] [skip-line] $l
  [start-paragraph] means: [skip-line] $l
  [end-paragraph] means: [skip-line] $l
$kpe $sk1 $ind1 $ptye

These rules aren't quite right, because if they are taken too
literally then, for example, we would wind up with two skipped
lines between paragraphs.  But we did promise to stay clear of
technical details; you get the idea.

With respect to the eventual formatting, then, SGML coding may be
said to be "indirect".  That is, you can't really tell, from the
original file alone, what the document will look like, but must
reference the rules in order to figure that out.  This is
sometimes a disadvantage, but in many important situations it is
more than offset by advantages that also derive from this
indirectness.  One is that, as we have seen, the text coding on
the one hand, and the drawing up of the formatting rules on the
other, can be done by separate people each of whom is especially
appropriate for the particular task.  Another is that it remains
possible for each to work independently of the other.  The author
may change the text to improve or update the content, while the
document designer may alter the document formatting rules, yet
neither gets in the other's way.  An important corollary is that
eventually there may come to be multiple sets of formatting
rules, each for a particular type of document, such as a magazine
article set in columns, a hard-cover book, and a paperback
-- some of which may not even have been contemplated at the time
of original composition.

That's why SGML is "generalized" markup: it does not specifically
determine the format but rather can work with many different
format styles.

People who work regularly with word processors may recognize a
closely related concept here, namely the notion of a "style", as
they are now usually called.  (Especially in older word
processors, the term "macro" may be used with much the same
meaning.)  Basically, defining a style is equivalent to giving a
name to a collection of formatting codes.  After defining a
number of styles, the codes that you use thereafter in the text
may consist only of the style names, with no further need to
reference the basic formatting codes directly. If, for example,
the styles defined had the same names as our four SGML tags, and
were equated to sets of format codes similarly to the four rules
listed above, then the coding of the file could be entirely
indistinguishable from SGML, and have the same quality of
indirectness and consequently the other, mostly beneficial,
qualities that derive from indirectness.

Thus styles, consistently used, can have many of the same
benefits of SGML.  However, with styles there is generally no
mechanism to enforce consistent or logical use.  Thus, for
example, nothing would prevent using a "start-paragraph" tag
between a "start-heading" tag and its corresponding "end-heading"
tag.  It is chiefly on this point that SGML goes beyond styles.
With SGML, the set of tags that can be used, and the ways that
they can be used, is precisely defined in a separate file called
a Document Type Definition (DTD).  For example, the DTD defined
for our simple case would not only list the four desired tags,
but would undoubtedly also specify that paragraphs may not occur
within headings, nor vice versa.  Thus the user of an SGML system
is spared the possibility of making illogical coding errors.
Consequently, from the perspective of the DTD designer, it is
possible by means of the DTD to enforce a required "structure" on
the document.  That is, he or she may specify, for example, the
order in which various kinds of headings can appear, whether they
are mandatory or not, and many other matters that ultimately
determine what authors may do when writing documents governed by
that DTD.

With respect to what can be in an SGML document, then, we may say
that it has an "enforced structure".  As with indirectness, this
quality can sometimes be a drawback, because you can't simply
introduce a new tag into the text file, any more than you can
directly specify an imaginative new format.  Rather, you must
first define the tag (or more properly the start-end tag pair,
enclosing a "text element") in the DTD, and also specify how it
must be used; only then may you use the tag in a document
associated with that DTD.  This may be viewed as a loss of
flexibility or at least spontaneity, but it is a positive gain in
many cases where it is necessary to ensure uniformity over large
classes of documents.  Examples of such cases would be military
procedure manuals and the software user's guides within a series
for a computer system.  It is obvious, then, why SGML is most
popular in those organizations that must regularly issue such
documents.


$cp3:1 $ifbrl $ind5 IS SGML A NEW IDEA? $ind1

As computer-related technologies go, SGML is rather old. Deriving
from work by William Tunnicliffe, Charles Goldfarb and others in
the late 1960's and through the 1970's, it has become both a
standard of the International Standards Organization and the
technology behind several commercial products.

Today, though, SGML is still not widely used for general document
production.  Rather, its use remains largely tied to those
situations where enforced structure is highly desirable. Most
documents are produced in circumstances where there is less need
for formality, where the task at hand is primarily to produce one
specific edition, and where it may even be desirable to use
innovative format techniques to enhance document content through
visual aesthetics and interest.  These characteristics run
counter to SGML's properties of indirectness and enforced
structure. Consequently, it is not surprising that the great bulk
of document production still takes place using systems that are
appearance-oriented.  All of the most popular word processors
used in business offices, such as WordPerfect, Microsoft Word and
Ami Pro, and all of the most popular page composition programs
favored by professional publishers, such as CorelDraw and
PageMaker, are of that kind.

Despite the fact that appearance-oriented methods still dominate
after all these years, SGML has remained an important technology,
and more to the point now appears to be gaining in importance as
its strengths are more recognized and needed.  As always, things
are changing.  Increasingly, publication in a specific paper
format is not the end of the story for a typical document.
Instead, it is now more likely that the information will be
retained in a data base to allow for electronic reference, and
also that it will be retained for possible publication in
alternative formats, including those specifically adapted for
persons with disabilities.  Information in a database is more
useful when it is structured, so that for example you can
retrieve just those documents where a particular person is listed
as author, without retrieving those where that person is
otherwise mentioned.  As we have seen, both this kind of
structure and the indirectness that favors alternative formats
are defining strengths of SGML.  Consequently, there seems to be
increasing interest in SGML, with attendant progress in
technology, especially along the lines of combining SGML with
appearance-oriented methods so that it is easier to experience
the best of both.  For example, SoftQuad's Author/Editor, which
is firmly SGML-based, nevertheless allows viewing and working
with the document in formatted form on screen, as is typical of
appearance-oriented word processors, and also provides many other
convenience functions to reduce user involvement with the
technical aspects of markup.  We are also seeing general-purpose
DTD's being adopted as standards, so that organizations and
authors typically do not need to design DTD's unless their needs
are quite specialized.  Lastly, from the appearance-oriented end
of the spectrum, connections to SGML or at least SGML-like
facilities are beginning to appear.  Thus the direction that all
of this is going is clear, in my opinion, even if the final form
of the technology is not so clear because there is quite a ways
yet to go.


$cp3:1 $ifbrl $ind5 WHAT CAN SGML DO FOR BRAILLE? $ind1

As already discussed above, the indirect way that SGML specifies
format makes it especially suited for cases where a document is
to be produced in multiple formats.  It is really just a
corollary of this principle that makes SGML beneficial for
braille production, for in most cases braille is an alternative
publication format for a document already produced in print.

The treatment of ordinary paragraphs provides an example that is
both simple and of practical importance.  Let us assume that the
print copy skips lines between paragraphs, as in our example of
the previous section.  Even though that is now probably the most
common format used in print, it is not customarily used in
braille; rather, simple paragraph breaks are usually shown only
by an indentation of the first line, without skipping a line.  On
the other hand, there are cases where a skipped line in print
could correspond to a skipped line in braille -- around headings
and tables, for example.  Consequently, a file coded for the
appearance of the print document, as in the first coding
presented above, cannot be used just as it is.  Nor can it easily
be converted by automatic means, because some of the line skips
are for paragraph breaks and some are for other purposes.  Thus,
especially in practical cases that are naturally more extensive
and complex than our example, considerable human labor, of a
tedious nature, is needed to help sort out the various purposes
of the skipped lines and other print formatting devices.  By
contrast, the SGML file contains exactly the needed information:
paragraphs and headings are clearly and separately identified and
so the appropriate braille formatting is readily automated.

Perhaps this is the place to mention that, from what I have
observed, a saving of human labor in braille production generally
translates into more and better braille, that is to more
productive and interesting jobs for those who work with braille,
not lost jobs!

Coming back to the subject, another way of looking at SGML files
is that they are more useful than appearance-coded files because
they contain more information.  That is, they not only determine
(indirectly) WHAT is done in the way of formatting but WHY, in
terms of the author's intended structural divisions.  That
additional information happens to be quite useful for braille
production.


$cp3:1 $ifbrl $ind5 IS THE APPLICATION OF SGML TO BRAILLE A NEW IDEA?
$ind1

The potential benefits of SGML have long been evident to the
community involved with automating braille production.  Back in
the early 1980's, the National Braille Press carried out a
project called POINTS, under the sponsorship of the Library of
Congress, to investigate among other things the viability of
using SGML (then called "generic coding") as a common target for
conversion of text coming from various typesetting systems and
other disparate sources, and as a common source for text to be
published in various alternative formats including braille.  By
obvious analogy, this idea was called the "hourglass" principle
at the time, with SGML serving as the narrow working center of
the hourglass.  All of us involved with that project believed in
SGML on theoretical grounds, and at the end of the project felt
that our beliefs had been substantially confirmed.

It must nevertheless be acknowledged that, when it comes down to
the production techniques developed in the POINTS project that
continued to be used, most of them actually "went around" SGML in
most practical cases.  That is, the print format coding was, and
still is, mostly converted directly to braille format coding
without ever going through a SGML stage.  Why?  Because most real
braille production is concerned with just that specific format,
and takes place under get-the-job-done constraints on the use of
personnel and other resources.  Many different kinds of
literature, some of them requiring innovative format treatment in
braille, must be processed.  In other words, the same practical
realities that have led to the dominance of appearance-oriented
methods in the print world are also operative in the braille
world.  Under those circumstances, conversion into SGML first
seems like unnecessary overhead, an extra step into an indirect
language, that is insufficiently rewarded by SGML's benefits
because additional production formats are not often contemplated.
Even when one other adaptive format, such as large print, is
regularly produced, the practical balance has seemingly not yet
tipped towards widespread use of SGML.

Despite these sobering realities, it has long been quite plain
that SGML would work well for the braille community if only it
were more widely used in the print community, that is if SGML
files, coded for a commonly understood DTD, were more generally
available as the starting-point for braille work.  As discussed
earlier, the print world for its own reasons seems to be showing
increasing interest in SGML, and so that long-awaited condition
may finally be on the horizon.  Moreover, we can point to at
least three other factors in our favor.  First, there are
initiatives, such as the recent "Texas Braille Bill", that
promote the regular transfer of electronic media from the print
publishing industry to alternative-format producers.  This has
the effect of linking the two kinds of organizations
economically, so that the efficiencies of SGML are more operative
at the time of original coding.  Secondly, the efforts of the
International Committee on Accessible Document Design (ICADD)
have led to standard DTD's that serve as a well-defined
conversion target, in effect an accepted concrete declaration as
to "this is what we want", so that the designers of systems for
print production can begin to link them to the needs of the
braille world.  Thirdly, stimulated by those other factors, both
commercial efforts and academic projects, such as the one on
mathematics braille at Bradford University, now seem more
centered on SGML technology.


$cp3:1 $ifbrl $ind5 WILL SGML SOLVE ALL BRAILLE PRODUCTION PROBLEMS?
$ind1

No.  Many of the problems associated with braille production are
more properly associated with the rules for transcribing the text
itself, as distinct from arranging the text on the page. As an
example, the current rules of English literary braille require
that acronyms and abbreviations be treated differently from
regular English words that happen to be capitalized.  It is
beyond the scope of this paper to detail the reasons behind this
distinction, though it should be noted that they are well rooted
in braille tradition and can be seen as neither arbitrary nor
foolish when considered in the entire context of that tradition.
That distinction, however, can at times be difficult even for
human transcribers.  If, for example, the following capitalized
headline were to appear in a newspaper in the United States:
$sk1 $ind3 $ptys $g1
  EUROPE JOINS US IN TRADE PACT
$sk1 $ind1 $ptye $g2
it would be impossible to tell whether "US" stood for "United
States" or was simply a pronoun, yet the braille rendering would
require such a distinction.  Automated conversion, perhaps
needless to say, runs into difficulties with this distinction
even in the much more numerous cases where human judgment is not
so challenged.

It is easy to image an SGML "solution" to this problem.  We could
invent a tag pair to make the required distinction; for example
we could project that the original author or someone else along
the way would enter something like
$sk1 $ind3 $ptys $g1
  EUROPE JOINS [begin-acronym] US [end-acronym] IN TRADE PACT
$sk1 $ind1 $ptye $g2
when "United States" was intended, leaving the unannotated case
to imply that the pronoun was intended.  It could even happen
that such a tag could serve some print purposes, such as in a
book where distinctive "small caps" are used for acronyms and
regular capitals for other purposes.  Realistically, though, the
need for such a distinction would not be commonly felt in
preparing print, and so could not be relied upon as a solution
for braille purposes.  This fact becomes even more obvious when
we consider other kinds of distinctions required under current
braille rules, some of which are even harder to relate to any
conceivable print need.  For example, the word "tuberose" would
be brailled differently when it is a noun (a tuberous plant) as
opposed to when it is an adjective (a variant spelling of
"tuberous").  For another example, depending on the specific
braille code being used and other judgment factors, the
expression "(b)" (not counting the quotes) might be brailled
differently in each of the following circumstances: (1) when it
is used in parallel with "(a)" etc. as an enumerator in a list or
outline; (2) when it is a parenthesized reference to the letter
b; (3) when it is an expression in a mathematical context; and
(4) when it is an excerpt from a computer program.


$cp3:1 $ifbrl $ind5 HOW CAN UBC HELP? $ind1

These considerations, and others, have given rise to the "Unified
Braille Code" (UBC), which is not yet an official code but a
research project of the Braille Authority of North America
(BANA).  UBC aims at defining the relationship between print
symbols and braille symbols in a way that minimizes ambiguity and
judgment problems, in BOTH directions of conversion, and further
encompasses the needs of technical literature, all while
preserving all the essential characteristics of current English
literary braille.

It would take us too far afield in this paper to elaborate
further on UBC; that of course has been done elsewhere as to
motivation, and is still under development as to methodology.
Suffice it to say that SGML and UBC are both positive
developments for braille; that neither is sufficient in that each
solves problems that the other does not; and consequently that
they are entirely complementary, which is the main point of this
paper.

As a concluding footnote, in case it may appear SGML and UBC
taken together bid fair to solve all braille problems, it may be
worth mentioning that they do not.  When we consider all the
ramifications of foreign languages (even in English context),
music notation, and graphics, we find ourselves at the threshold
of the problems that we will be pondering for many years to come.
Human experience and judgment remain a valued part of the braille
transcription process, all the more so because an increased
volume of automated transcription can only be accompanied by an
increased incidence of the "hard" problems that only people
can solve.  SGML and UBC will, we expect, free those people to
concentrate on those kinds of problems.

$cp4:1 $hds $rm6
              FURTHER INFORMATION ON SGML AND ICADD
$hde $rm $sk1
The SGML Handbook, by Charles F. Goldfarb.  (Clarendon Press,
1990.)  This is the "bible" on the subject, a "practical aid for
people who want to understand, use and implement ISO 8879 -- the
SGML standard".  It does contain introductory
concept papers and some application examples (sample DTD's).  It
also lists additional sources, among them:
$ind3 $sk1
  The International SGML Users' Group (Secretary: Stephen G.
  Downie, c/o SoftQuad Inc., Toronto, Ontario, Canada)
$sk1
  Graphics Communication Association (GCA) (Arlington, Virginia,
  USA)
$sk1
  An introductory video by Yuri Rubinsky and Marc Giacomelli
  (available from GCA)
$ind1 $sk1
Electronic Manuscript Preparation and Markup, by National
Information Standards Organization (Bethesda, Maryland, USA).
This is a technical document, being the text of standard
Z39.59-1988, a DTD generally known as the AAP (Association of
American Publishers) DTD.
$sk1
Reference Manual on Electronic Manuscript Preparation and Markup,
by the Association of American Publishers (available from the
Electronic Publishing Special Interest Group [EPSIG], Dublin,
Ohio, USA; tel. 614-764-6000).  This explains how to use the
above AAP DTD standard, for publishers, authors, and editors, in
a format that is simpler to follow.
$sk1
Author's Guide to Electronic Manuscript Preparation and Markup,
by the Association of American Publishers (also available from
EPSIG).  This document has the same stated purpose as the
foregoing, but is less detailed; it describes only the most basic
rules and tags.
$sk1
International Committee for Accessible Document Design (ICADD)
Statement of Purpose.  (available from Recording for the Blind
R&D, Missoula, Montana, USA; tel. 406-728-7201)
