SGML: The Reason Why and the First Published Hint

(C)1997 Journal of the American Society for Information Science. Volume 48, Number 7 (July 1997)

Charles F. Goldfarb

Information Management Consulting

This paper is a commentary -- over a quarter-century after the fact -- on the first published paper to suggest the need for (and hint at the existence of) what is now the Standard Generalized Markup Language. It was presented at the 33rd annual meeting of the American Society for Information Science in Philadelphia, October 15, 1970, and published in Volume 7 of the ASIS Proceedings. The editors of this special isssue of JASIS felt that that meeting was worth remembering here because of its hitherto unpublicized connection with the origin of SGML. I would like to add that it is also worth remembering because of its closing banquet, which featured an erudite and witty speech by a professor with two doctorates, a 70-piece balalaika orchestra, the entire Philadelphia Mummers band (replete with banjos, saxophones, and feathered headdresses), and a middle-eastern belly dancer who worked on the table tops! I've spoken at some hundred conferences since then and none of them has even come close.

An Online System for Integrated Text Processing

Charles F. Goldfarb, Edward J. Mosher, and Theodore I. Peterson

International Business Machines Corporation
Cambridge, Massachusetts

The authors of this paper were not, as one might expect, the same three people who invented IBM's non-standardized precursor Generalized Markup Language -- GML. The "G" and "M" are represented, but Ray Lorie, who made perhaps the most significant contribution to the idea, did so in a matter of minutes in the capacity of friendly consultant and rarely participated actively in GML's development. Ted Peterson, who had been with IBM for a decade or more before either Ed Mosher or myself joined the company, was our store of experience -- a guide through the maze of twisty passages (all different) of corporate and academic research.

This was my second published paper. My first, published in the Harvard Journal of International Law, was entitled "A Model Foreign Tax Credit Act for a Developing Nation". There are dryer subjects than SGML!

Abstract

An online integrated text processing system has been implemented in an experimental version known as INTIME (INteractive Textual Information Management Experiment).

The work was done at IBM's Cambridge Scientific Center, one of a half-dozen small laboratories that existed at that time near major universities. The centers were operated by IBM's marketing arm, primarily as a vehicle for bolstering relations with academe through joint projects. However, the centers also did useful research, and "technology transfer" to a product division was a highly sought objective of every project.
These dual goals led to a sort of institutional schizophrenia. On the one hand we were encouraged to build collegial relations, publish our research, and share information. On the other hand, IBM was laboring under a consent decree from an antitrust suit that, among other things, prohibited the announcement of a product until it was developed sufficiently that there was a high probability that it would ship. The company's concern for honoring the letter and spirit of the decree was such that any novel technology with a potential for becoming a product had to be kept confidential. (I suspect that commercial competitive considerations may have played a role in this policy as well.)
The product potential of GML was recognized early on, and as a result we had to be careful of what we said and how we said it. Fortunately, the primary focus of INTIME was the integration of text processing functions, so we had something we could report on.

Text processing encompasses the functions of editing, document storage and retrieval, and composition, all of which are present in varying degrees in the work of authors, publishers, and researchers. A system which integrates these functions in a conversational setting offers considerable flexibility in manipulating a textual data base.

Note the dispassionate understatement in the last sentence, the result of Ted's careful tutelage in academic style. This sentence must have read like a fantasy in 1970. "Conversational setting" may seem no great novelty today, but then computers occupied entire large air-conditioned rooms and were tended by priests. The closest anyone got to a conversation was leaving a deck of keypunched cards in the input basket. "Textual data base" was also an outlandish idea, in an era when most of the keypunches were upper-case only.

The experimental version of the system was developed on the IBM System/360 Model 67 using the CP-67 control program and CMS operating system. The initial users were a program development group whose operations required preparation and revision of manuals, searching for references to specific topics, and automatic generation of indexes. The primary objective of the experiment was to observe the effects of interrelating the various text processing functions, rather than to measure performance.

CP-67 was an amazing breakthrough. Instead of writing a complex time-sharing operating system, they wrote a simple "hypervisor" that supported multiple concurrent emulations of IBM 360 hardware, so-called "virtual machines". You could then run the operating system of your choice in its own virtual machine. The real success of the design lay in the fact that they had an unambiguous precise spec -- the "Principles of Operation" manual for the IBM 360 computer.

The INTIME experiment demonstrates the feasibility and usefulness of online integrated text processing and indicates directions for future study.

Introduction

There would be many uses for an online integrated text processing system. A university press, for example, may engage in demand publishing. Its database, originally prepared for typesetting, could be searched for material relevant to the subject of a proposed publication. The selected text could be examined and manipulated with a context editor and then composed for publication. A lawyer would use these capabilities similarly for researching and drafting a brief.

Actually, the law office application was the original motivation for the project, something I was allowed to do part-time because of my knowledge of the user requirements. My real job was to encourage the staffs of the various scientific centers to make use of the CP-67-based Wide Area Network that was centered in Cambridge.

This paper discusses the text processing functions, their integration into an online interactive system, and experience with an experimental version of such a system.

Text Processing Functions

Text Editing

In online systems, text editing is the interactive creation and modification of textual files from a terminal (1). Editing programs which can locate a line by its word content, as well as by its position in a file, are known as "context" editors. They provide a retrieval capability: the command "LOCATE /researchers/" finds the first line containing "researchers"; the global command "CHANGE /edit/edit/ causes printing of every line containing "edit" (Fig. 1).

"Context editors" were hot stuff in 1970. Text editing terminals were rare enough in themselves (they looked like typewriters when they didn't look like teletypes), and most editors required you to select lines to edit by their numbers. Armed with a line-numbered printout of the file, experienced users would start editing from the last line forward so that inserting or deleting lines wouldn't change the numbers of those lines remaining to be revised.

LOCATE /researchers/
researchers. A system which integrates
CHANGE /researchers/analysts/ analysts. A system which integrates
CHANGE /edit/edit/ *
In online systems, text editing is
are known as "context" editors. They
NEXT
provide a retrieval capability: e.g.,
QUIT

Fig. 1 Text Editing (CMS)

Document Storage and Retrieval

Solutions to the problem of identifying documents relevant to a query include diverse techniques: e.g., manual or automatic selection of document descriptors for indexing, and text searching based on logical, statistical, syntactic, or semantic criteria (2). Interactive systems may improve effectiveness, perhaps with approaches different from those used in batch procedures (3, 4). They may facilitate -- and may even require -- the development of free, natural query languages.

The first sentence of this paragraph sums up several of the burning issues of the day in information retrieval, some of them the subject of near-religious warfare between believers in one approach or the other. Issues like "manual or automatic selection of document descriptors" are still with us, most recently and visibly manifested in the battles between World Wide Web search engines, like Yahoo and InfoSeek.

The operations in which storage and retrieval interact with other text processing functions -- entry of raw text and output of retrieved text -- are essentially similar even in different retrieval systems. A integrated text processing system could support a variety of retrieval approaches through standard interfaces, thereby facilitating comparative research in an interactive setting.

The idea here was to enable use of all of the diverse indexing and search strategies, without having to duplicate the input and output systems. Today we are promised this kind of "componentware" capability by the proponents of ActiveX, OpenDoc, Java, and the like, all of them focused on allowing the software to interwork. Unfortunately, they all overlook one requirement that INTIME recognized in 1969 -- the need for a Generalized Markup Language.

Text Composition

The typical composition program accepts natural text in which command codes are interspersed to specify such operations as right-justification, columnar alignment, and hyphenation. Also inserted into the output are translations of commands, such as line spacing, to be performed by the actual typesetting device. Output may be directed to an online printer or photocomposer, or to tape for off-line typesetting.

The tape was as likely to be punched-paper as magnetic. I remember flying out to visit a large aircraft manufacturer several years later and being shown its "textual data base" of maintenance manuals. It was a room full of waste baskets, employed because their shape kept the paper tape from getting tangled. (I think I went home by bus.)

An important feature of many composition programs is the ability to designate by suitable input instructions the use of specified formats. Previously stored sequences of commands or text replace the instructions, and the expanded input is then processed. In more sophisticated systems, formats may summon other formats, including themselves (5).

Users would compete to write the most elegant and bizarre applications of nested stored formats. The ultimate challenge was considered to be automated layout of supermarket ads.

Pagination may follow composition, as with newspaper page layout; it is usually an integral part of systems for in-house printing with line printers (6). This paper was prepared using one of the latter, CMS Script (Fig. 2).

CMS Script eventually evolved, with the addition of GML, to become IBM's hugely successful Document Composition Facility. In addition to serving some 4000 mainframe customers, it also supported IBM's own publishing operation, producing over 11 million master pages for what was then reputed to be the world's second largest publisher (after the U.S. government).

.ce
Introduction
.ju
.11 39
.ss
The work of
authors, publishers, and researchers
involves, in varying
degrees, three recognizable text

Formatting commands:
.ce center next line
.ju justify right margin
.ll line length
.ss single spacing

Fig. 2 Script Input (CMS)

Integrated Text Processing

An integrated text processing system comprises three broad divisions (Fig. 3).

Fig. 3 Integrated Text Processing

Storage

Documents may be entered from the user's personal data base (after creation with a context editor) or from the output of offline creation (e.g., paper tape or punched cards). A raw concordance is generated for updating the system dictionary directly or for prompting an analyst who is selecting appropriate document descriptors; frequency counts are accumulated for statistical retrieval techniques. The text and typesetting format commands are stored in the system data base.

Retrieval

The retrieval language processor translates the query language into an internal form for the retrieval routine. Document retrieval has two parts: determining from a query which documents are of interest, and presenting the documents in some form. The latter should be considered part of publication; retrieval produces only a set of document numbers.

The measurement and control function provides preliminary statistics about the set of documents satisfying a query. It could also select an appropriate search algorithm based on the user's statement of needs (e.g., browsing or intensive search), or based on the degree of his satisfaction with the progress of the search.

Publication

Retrieved or known document numbers provide access to the full text, either with embedded typesetting commands, as for editorial applications, or fully composed, for research or browsing. The output processor displays the text on a CRT or line printer, prepares tape or other media for offline composition, or sends output to online photocomposition or microform systems. Composed output can be transmitted to a pagination processor.

A powerful alternative is to direct output to the user's personal data base for access by the context editor. The retrieved text can be incorporated into a system document or modified and entered as a new version. The context editor can also be used for further retrieval within the document, since it is capable of locating words and phrases.

Derived information, such as indexes, word frequencies, and other statistics, can be prepared by the measurement and control function, and then processed by the output function in the same manner as text.

Commercial products still haven't achieved the degree of non-overlapping interworking envisioned by INTIME, although SGML has helped them to get much closer.

Experimental System

An experimental system was implemented to determine the feasibility and usefulness of integrated text processing, and to permit observing the effect of interrelating the basic text processing functions. High performance and a complete set of features were less important than rapid implementation. Programs were chosen primarily for convenience of incorporation into the system, regardless of whether they represented the best approaches to implementing their functions.

An online environment was created on an IBM 360/67 by using the CP-67 control program to handle all resource allocation, including time-slicing. CP-67 gives each terminal user a "virtual" computer (7), in which he can run the operating system and programs of his choice. and also permits multiple users to have concurrent access to a disk file. These features allow one to design a time-shared interactive system while writing only single-user programs.

In the early stages of the project, the system programmers who helped us were convinced that "concurrent access to a disk file" would solve the problem of having the separate programs (text editor, document retrieval, etc.) interoperate. Reality set in when they saw that each program choked on the special control codes that the other programs required. Ed, Ray, and I invented GML to eliminate that nuisance so we could get on with the important work!

The Cambridge Monitor System (CMS) was chosen as the operating system because of its interactive context editor and Script composition programs. CMS also provides such useful systems features as online debugging and extensive capabilities for file manipulation (8).

Document retrieval was implemented by converting the Document Processing System (9) from batch (Operating System/360) to online operation and modifying its query language to provide greater freedom and flexibility for conversational use. Searches are based both on reference data (bibliographic information) and word occurrence (according to relative positions or assigned weights).

I remember wading through thousands of lines of uncommented 360 Assembler code to separate the indexing and search components from the rest of the Document Processing System. I wound up throwing away 80% of it, as good an argument for integration as any we had encountered.

A Program development group used the experimental system to prepare their documentation, to locate references to programs being changed, and to generate indexes for manuals. Because the indexes required page numbers, the text was composed and paginated by Script prior to entry into the system, thereby precluding any experience with handling embedded format commands.

The individual pages of each manual were then treated as separate documents, an approach also suggested by the existence of a small number of lengthy documents rather than the assortment of many documents of different sizes usually encountered in a database used for research. Although page numbers do not represent informational structure, as do paragraphs, they are needed in the preparation of printed indexes for books.

Conclusions

The implementation of INTIME demonstrates that an online integrated system for text processing is feasible. Such a system can be modular to accommodate varying needs and computer facilities.

That summed things up quickly (we were limited to four camera-ready pages) so we would have space for the really significant finding of the project, the first published hint of SGML:

The usefulness of a retrieval program can be affected by its ability to identify the structure and purpose of the parts of text (e.g., footnotes, abstracts, citations). Conventional practice uses special input formats to convey some structural information, but this is not possible in an integrated system, where input is not prepared specially for retrieval. A heuristic routine for identifying new paragraphs in normal text was developed for INTIME, but a more sophisticated facility is needed. A typesetting command language could convey such information, but present languages deal with the appearance of the text, not with the purpose which motivated it.

There it is, a new kind of "typesetting command language" that deals not with the appearance of the text, but "with the purpose which motivated it". In a word (well, three words), a Generalized Markup Language. Without it, we couldn't have made INTIME work at all.

Rice has proposed a high-level text structure language that would describe the parts of text with mnemonic codes (10). The composition program would identify the codes as calls to stored formats; the retrieval program would use them for classification.

Stanley Rice, then a New York book designer and now a California publishing consultant, provided my original inspiration for GML. The first phrase in the previous sentence was his idea, the second was mine. Unlike Stan, Bill Tunnicliffe, Norm Scharpf, and other pioneers in this area, my chief motivation was information retrieval, not typesetting. (Remember, I was trying to automate a law office.)
I recall a meeting early in the project with Steve Furth, an IBM Industry Marketing Director and an early influential and passionate campaigner for computerized information retrieval. I told him of my ideas for integration, which included removing the (procedural) typesetting markup. He said something about that being wrong because it could have other uses. I said something like "you mean figuring out something is a caption because it is centered." He said "something like that" and referred me to Stan Rice's work. The rest, as they say, is history.

An online integrated text processing system can facilitate the text handling in the publishing process by supplying a common data base for all functions. The regular creation and use of such databases during the normal course of publication would be a major step toward the realization of computer-based libraries.

The paper ends as it began, with a dispassionate understatement. But in 1970 we had no idea how much of an understatement it would turn out to be. We couldn't have imagined that GML would spark the invention of SGML and become an International Standard. Or that SGML would strike such a responsive chord in thousands of creative and industrious people that they would build an industry that has forever changed the way that enterprises create and use documents.
And in 1970 I certainly couldn't have foreseen that Tim Berners-Lee would create a World Wide Web. I don't know whether I'm prouder of being the father of SGML or the grandfather of its child, HTML. (Like all grandfathers, I had nothing to do with the procreation or the painful labor, but that doesn't impair my right to be proud!) The Web has accomplished "the realization of computer-based libraries" to a fare-thee-well, and is on its way to realizing fundamental changes in human society as well. The next twenty-five years of Generalized Markup Languages should prove even more exciting than the first.

References

1. "Context Editors, Part 1: A Conversational Context-Directed Editor," Report No. 320-2041, Cambridge Scientific Center, IBM Corp., Cambridge, Massachusetts (March, 1969).

2. Gerald Salton, Automatic Information Organization and Retrieval, McGraw-Hill Book Co., New York, 1968.

3. G. Salton, "Automatic Text Analysis," Science 168, 335-343 (April 17, 1970).

4. F. W. Lancaster, "Interaction between Requesters and a Large Mechanized Retrieval System, "Information Storage and Retrieval 4, 239-252 (1968).

5. "System/360 Text Processor Pagination/360, Application Description Manual," Form No. GE20-0328, IBM Corp., White Plains, New York.

6. S. E. Madnick and A. Moulton, "SCRIPT, an On-Line Manuscript Processing System," IEEE Transactions On. Engineering Writing and Speech EWS-11 (2), 92-100 (1968).

7. J. N Bairstow, "Many from One: the ;Virtual Machine Arrives," Computer Decisions, 29-31 (January, 1970).

8. L. H. Seawright and J. A. Kelch, "An Introduction to CP-67/CMS," Report No. 320-2032, Cambridge Scientific Center, IBM Corp., Cambridge, Massachusetts (October, 1969)

9. "Application Description, IBM System/360 Document Processing: System," Form No. H20-0315, IBM Corp., White Plains, New York (1967).

10. S. Rice, "Editorial Text Structures," Memorandum to Standards Planning and Requirements Committee, American National Standards Institute, (March 17, 1970).