Charles F. Goldfarb's SGML Source: InFrequently Asked Questions (InFAQs)

Contents (Last Revised: Nov 15, 1997)

Why Is the SGML International Standard Important?
What Is the Most Important Technical Reason for Using SGML?
What Is An SGML Grove?
Why Is It So Hard to Hack SGML Without a Parser?
How Are Name Spaces Managed Using SGML?
What Is the SGML Entity Structure and Why Does It Matter?
Acknowledgments
Copyright & Disclaimers

Why Is the SGML International Standard Important?

SGML is designed to make your information last longer than the systems that created it. Such longevity also implies immunity to short-term changes -- such as a change from one application program to another -- so SGML is also inherently designed for re-purposing and portability. And the same technical characteristics of SGML that make these long-term benefits possible also provide near-term benefits in document production: shorter lead times, lower costs, more flexible processing, and better control.

But the real key to SGML's success -- both politically and technically -- is the fact that SGML is a bona fide International Standard, not the creation of a dominant vendor or a consortium. I say "politically" because large users feel they can safely invest millions to convert to SGML because the SGML specification is stable and is maintained by a neutral organization. I say "technically" because the concept of conformance to a standard is what makes SGML work.

Here's how conformance works. The SGML standard defines the requirements for "conforming SGML documents". These requirements are remarkably flexible. In fact, SGML isn't so much a standard for "what you have to do" as a standard for "describing what you've done and why you chose to do it". (So SGML conformance doesn't force you to be a conformist!)

The standard also sets requirements for "conforming SGML systems" -- but these are defined principally in terms of their ability to process conforming SGML documents. The objective is for the user to have a library of conforming SGML documents and be able to use any conforming SGML systems to process those documents in a multitude of ways -- regardless of how many previous processes have taken place.

How Do I Recognize a Conforming SGML System?

Such a demanding set of objectives for SGML has necessarily resulted in a non-trivial language design. SGML has some subtle details, and the implications of failing to address them properly in products are not as widely understood as they should be. So, just as it is vital to the effectiveness of SGML that conformance be defined rigorously, it is equally important that conforming products be identifiable unambiguously.

For this reason, the SGML standard requires a conforming product to be identified prominently as "An SGML System Conforming to International Standard ISO 8879 -- Standard Generalized Markup Language". The standard calls such a product a "conforming SGML system". It is required to have a description of its SGML capabilities, including its ability to support optional features, in a standardized format called a "system declaration". A conforming product isn't forced to support any of the optional facilities of SGML, but if it does, it must support them according to the requirements of the standard.

A conforming product's documentation must also meet certain requirements, designed to minimize user retraining when starting to use additional SGML products. These requirements involve consistent use of standardized terminology, accurately distinguishing features of SGML from features of the product, and so on.

Note: Just claiming conformance doesn't prove conformance. Validating a product's conformance claims is the role of "SGML conformance testing", a rigorous process governed by an International Standard of its own. If you believe that a conforming SGML product doesn't actually conform, that is a bug you can report to the vendor.

To the standard, all other products are "non-conforming". You may see them described with terms like "SGML compliant", "SGML based", "standards-based", "SGML aware", etc., but these terms are not defined in ISO 8879. There is an ever increasing number of such products, with varying degrees of SGML support. Many products, though non-conforming, can process a large variety of conforming documents. Some may even have the necessary functionality for conformance, but don't formally claim it.

Can I Use Non-conforming Products?

Certainly.

Many popular SGML products are non-conforming, and have proven highly useful -- even essential -- in SGML environments. Not all SGML users need to achieve all of the objectives that SGML is designed for. They are willing and able to trade off the benefits of conformance in favor of cost savings or other product functionality.

Only you can decide whether the reasons for a product's non-conformance are relevant to your own expected use of the product. These reasons could include lack of support for a facility that is mandatory for conformance but not needed for your implementation, or documentation that doesn't meet the requirements of the standard.

A conforming product isn't necessarily the best product for your purposes. Many factors govern an intelligent choice of products for an SGML system. SGML conformance is only one of these, and it can never be the only one. You need to consider functionality, cost, performance, service, vendor reputation, and so on. And no product review or third-party recommendation can substitute for your own careful assessment of a product's applicability to your enterprise's unique requirements.

The Major Benefit of SGML Conformance

The major benefit of SGML conformance, to both users and vendors, has nothing to do with technicalities. It is that SGML is defined by a bona fide de jure International Standard -- maintained by a strong standards organization that is recognized by governments and whose standards have the force of law in many countries.

Every webmaster has seen what happens to a standard that does not have such stability and authority -- even one of high quality and major importance. Dominant vendors try to run away with it, competitors are reduced to playing catch-up instead of competing equally, and users have to cope with multiple conflicting variations instead of a true standard. The procedures of the ISO and the national standards bodies that belong to it have protected SGML from that sort of chaos.

What Is the Most Important Technical Reason for Using SGML?

Under Construction

Well, if you can't wait to see the answer, I'll tell you, but it will be a while yet before I can provide more of an explanation. Here it is, in brief:

"Abstraction support" -- that trait of an information representation (data format, notation, et. al.) that allows its users to distinguish what they consider to be the "abstract" information content of a document from the "style information" that is used to render it, and that allows tools to enforce and preserve that distinction. Abstraction support is what allows SGML documents to be reusable and portable.

What Is An SGML Grove?

When the source code of a computer program is parsed, the result is called a "parse tree".

When an SGML document is parsed, the result is called a "grove". A grove is also a parse tree, but you can think of it as a higher form of tree that -- like all higher life forms -- has developed some specialized cells.

The SGML grove concept recognizes that there can be several kinds of relationship among the nodes of a tree, and therefore a parse results in several related distinct trees, rather than one great big one.

For example, the relationship between an element and its attributes is different from the relationship between an element and its content subelements; so different that it is not helpful to think of the attributes and the subelements as siblings. When you navigate in a tree, you don't want the "next sibling" of a sub-chapter to be an attribute of the parent chapter.

Instead, the grove recognizes a privileged relationship that defines a "content tree", consisting of the subelements and data of the root element, recursively. The attribute list trees are also part of the grove, but are not part of the content tree. An attribute list is a property of its element, but that property is not one of the privileged ones that define the content tree.

Every notation, when interpreted, results in the creation of a parse tree. The SGML grove is no less general: SGML groves can be created for any notation, not just SGML, and the resulting groves/trees can all be navigated and queried as a single structure.

Although the navigation and querying of groves is generic, it is also necessary for a grove to represent properties that are unique to the notation that was parsed. That information is called a "property set".

The SGML property set, which is used in the DSSSL and HyTime standards, tells how to populate a grove with SGML-related properties. These can range from the basic SGML abstractions, like the element structure, to the markup strings that were used to notate the abstractions in the first place (which would allow the original document to be recreated character-for-character).

There is also a DSSSL property set, which introduces additional constructs, like glyphs. However, an implementation does not have to buy into an entire property set. You can specify a "grove plan" that indicates which properties are available in your program's grove.

It is this generality of groves that allows all data and metadata to be processed consistently, even when it is represented in different notations.

Disclaimer: This is a rough and somewhat simplified summary. The complete story is in the SGML Extended Facilities annex of the HyTime standard. See A Reader's Guide to the HyTime Standard for more information.

Why Is It So Hard to Hack SGML Without a Parser?

SGML, and its derivatives, XML and HTML, are character-based notations. That means an "SGML document" is actually a character string, one that describes an abstraction (also called a "document").

In order to glean the abstraction, the notation must be interpreted -- it is not enough just to accept the characters as they are. The interpreter must "parse" the text and separate it into "data" and "markup".

Data is part of the abstraction -- the real information. Markup is information that helps the parser construct a representation of the abstraction, called a grove, so that an application can process the abstraction.

For example, when you view this Web page through a browser, the paragraphs are rendered as separate text blocks. You don't see the markup strings, such as "<P>". When you "View Source", however, you do see the uninterpreted document text that the browser would normally interpret.

The power of character-based notations such as XML/SGML is that you can also interpret their text as plain-text (aka ASCII-text), which allows you to use plain-text editors and other tools that don't actually parse the text. Here enter the Perl script, the syntax-coloring text editor, the simple pattern match, and a multitude of similar programs that "scan" the text for pieces of markup, rather than parsing it and building a proper grove.

There is great value in such programs, so much so that designers of character-based notations have a tendency to compromise the design to optimize plain-text hacking. Done carelessly, such optimization can lead to duplication of function, with increased complexity for the real parsers, and possible conflicts and ambiguities.

Plain-text hackers typically rely on the "format" (in the presentation sense) of the source document. They may require some tags to be at the start of a line. They may require (or prohibit) some markup minimization facilities, such as short references and null or empty end-tags, and so on.

But those SGML features that support plain-text processing have no effect (nor should they) on the abstraction that is being described. As a result, grove-based programs, such as SGML editors, feel free to create a new source document string whenever they save an SGML document. This compounds the problems of the plain-text hacker, who finds all his line-breaking and minimization strategies rearranged, and his scripts no longer producing the same results.

How Are Name Spaces Managed Using SGML?

There are two basic approaches to using conflicting names from multiple name spaces together:

Unification: Map them into a single name space with no conflicts.
Qualification: Identify them with the names of their name spaces (and, if necessary, the names of the name spaces of the name space names, and so on).

Programming languages tend to use qualification because programmers need to be conscious of which libraries they are using. SGML documents have always used unification, because it allows the document type designer to bear the brunt of name space handling while keeping things simple for authors and editors.

Examples of SGML name space unification:

Entities: Multiple name spaces of public identifiers and system identifiers are mapped into entity names.
Notations: Ditto.
Enabling architectures: Element type and attribute names from multiple meta-DTDs mapped into a single DTD.

The one place where SGML uses qualification is in formal public identifiers, but these are eventually mapped to entity names by the DTD designer and thereby hidden from the end user.

What Is the SGML Entity Structure and Why Does It Matter?

Under Construction

Acknowledgments

I'd like to thank Sarah Tourville and the rest of her team at SAGRELTO Enterprises, Inc. for building the site, with a special thank you to chief programmer Ron Picker. Peter Newcomb and the other folks at TechnoTeacher, Inc. provide the disk space, the high speed net connection, and site administration. Andrew Goldfarb, of Eye-Tech Graphics, provides artwork on demand. Sarah and Peter donate their work because they believe in SGML. Andrew has no choice because he's my son. A-link Network Services provides my personal Internet access; they get paid.

Click here for the SGML Source Home Page ...
Or else use your browser's back arrow button to return to the last page you read.

Copyright (C)1996 Charles F. Goldfarb. All rights reserved. "SGML Source", "Infrequently Asked Questions" and "InFAQs" are service marks of Charles F. Goldfarb. I take no responsibility for the accuracy of the contents of this site. I've collected and disseminated this information in an attempt to be helpful, but you use it at your own risk. If you're smart, you won't use it at all without verifying it for yourself. For these reasons, and out of respect for intellectual property (including my own), information on this site cannot be used or cited for any commercial purpose. Any questions, comments, or suggestions? Send me mail.