What Can You Do with XML Today?

by Jay Ven Eman

Interest in Extensible Markup Language (XML) rivals the press coverage the World Wide Web received at the turn of the Millennium. Is it justified? Rather than answer directly, let us take a brief survey of what XML is and why it was developed, and then we can highlight some current XML initiatives. Whether or not the hype is justified, ASIS&T [American Society for Information Science and Technology] members will inevitably become involved in their own XML initiatives.

An alternative question to the title is, “What can’t you do with XML?” I use it to brew my coffee in the morning. Doesn’t everyone? To prove my point, the following is a well-formed XML document. (Well-formed will be defined later in the article.)

<?xml version=”1.0″ standalone=”yes” encoding=”ISO-8859-1″?>
<StartDay Attitude=”Iffy”>
<Sunrise Time=”6:22″ AM=”Yes”/>
<Coffee Prepare=”Or die!” Brand=”Starbuck’s” Type=”Colombian” Roast=”Dark”/>
<Water Volume=”24″ UnitMeasure=”Ounces”>Add cold water.</Water>
<Beans Grind=”perc” Type=”Java”>Grind coffee beans. Faster, please!!</Beans>
<Grounds>Dump grounds into coffee machine.</Grounds>
<Heat Temperature=”152 F”>Turn on burner</Heat>
<Brew>Wait, impatiently!!</Brew>
<Dispense MugSize=”24″ UnitMeasure=”Ounces”>Pour, finally.</Dispense>
</StartDay>

This XML document instance contains sufficient information to drive the coffee making process. Given the intrinsic nature of XML, our coffee-making document instance could be used by the butler (should we be so lucky) or by a Java applet or perl script to send processing instructions to a wired or wireless coffeepot. If XML can brew coffee, it can drive commerce; it can drive R&D; it can drive the information industry; it can drive information research; it can drive the uninitiated out of business.

What is XML? To understand XML, you must understand metadata. Metadata is “data about data.” It is an abstraction, layered above the abstraction that is language. Metadata can be characterized as natural or added. To illustrate, consider the following text string, “Is MLB a sport, entertainment or business?” You, the reader, can guess with some degree of accuracy that this is an article title about major league baseball (MLB). Presented out of context, even people are only guessing. Computers have no clue, in or out of context. There are no software systems that can reliably recognize it in a meaningful way.

For this example, it is a newspaper article title. To it we will add subject terms from a controlled vocabulary, identify the author, the date and add an abstract. As a “well-formed” XML document instance, it is rendered:

<?xml version=”1.0″ standalone=”yes” encoding=”ISO-8859-1″?>
<DOC Date=5/21/02 Doctype=”Newspaper”>
<TI> “Is MLB a sport, entertainment or business?” </TI>
<Byline> Smith </Byline>
<ST> Sports </ST>
<ST> Entertainment </ST>
<ST> Business </ST>
<AB> Text of abstract…</AB>
<Text> Start of article …</Text>
</DOC>

In this example, what is the metadata? What is natural and what is added? Natural metadata is information that enhances our understanding of the document and parts thereof and can be found in the source information. The date, the author’s name and the title are “natural” metadata. They are an abstraction layer apart from the “data” and add significantly to our understanding of the “data.”

The subject terms, document type and abstract are “added” metadata. This information also adds to our understanding, but it had to be added by an editor and/or software. The tags are “added” and are metadata. Metadata can be the element tag, the attribute tag or the values labeled by element and attribute tags. It is the collection of metadata that allows computer software to reliably deal with the data. Metadata facilitates networking across computers, platforms, software, companies, cities and countries.

What is the “data” in this example? The text, tables, charts, figures and graphs that are contained within the open <Text> and close </Text> tags.

Generalized data markup has four objectives: structure, content, context and format. Markup can articulate and even dictate structure. Structure in a text object includes such concepts as a document must have a beginning and an end, the document must have a chapter and at least one or more subchapters and the subchapter(s) must follow the chapter, and so on. Your goal with structure is to render the hierarchical and organizational design of the data and the parts thereof.

Content markup identifies what the data means. For example, “This text string represents the title of a sidebar and not the title of the article.” Author, title, date and telephone number used as tags identify the content characteristics of data.

Context markup labels content so that additional meaning can be derived from a content markup tag such as “title.” A “document type” label or tag would add significant information about the title. In our example about Major League Baseball, a context sensitive tag, “document type,” identifying “title” as coming from a newspaper, would have different connotations from document type equals a legal journal.

Format markup allows you to determine how the document instance will be displayed. Generalized markup designers are moving to portable display markup. A portable display markup uses a style sheet that accompanies the document instance. This makes it much easier to change how a document is displayed. A document instance can have any number of style sheets. Markup indicating the title should be displayed using Times New Roman font, 44 point, bold, and centered at the top of page one could be embedded in the document instance at the point in the document where the title begins. By placing these instructions in a style sheet, the document instance is freed from the burden of carrying format information. Style changes become easier.

Generalized markup experimentation began in the late ’60s and ’70s at places like IBM and within the secondary publishing industry. For a brief period in the ’80s, the file loading format of Dialog Information Services, Dialog format b, became a de facto markup standard for secondary publishers. As generalized markup developments moved from the labs to the standards arena and started to become metalanguages, three basic parts to generalized markup language (GML) emerged. One, GML, needs a declaration at the start of a document. In both XML examples above, the opening line of each denotes the markup language to be invoked, XML version 1.0. There are many GMLs in use and within any given GML there are many implementations. Providing a GML declaration gives the document user, either machine or human, the information needed to begin decoding the markup.

Two, GML relies on a document type definition (DTD) or schema. Using the rules, syntax, etc., of the metalanguage, the DTD is the set of instructions needed to mark up an actual document. The DTD describes the tags, their meaning and how to use them. It is the DTD that says the publication date in the document will be encoded using <pubdate> and </pubdate>. Question: Are the angle brackets, “<” and “>”, required? (The answer is posted to www.accessinn.com.) The DTD is necessary because the labels or tags are arbitrary. Publication date can be just as easily defined and marked up as <publicationdate> or <PD>. Many, many design issues and processing instructions are articulated in a DTD or schema. The two examples above are much easier to process by machine or human, if accompanied by a DTD.

Three, GML requires a “document instance.” GML can be used to encode any digital object, but textual oriented digital files, documents, are currently very common, so “document” is used in conjunction with “instance” to refer to all GML objects. The document instance is your newspaper article about MLB with the markup encoding it. When you have a document instance, you have an article ready for use by a GML tool such as a Web browser.

Continuing on the history of GML, the publishing industry really pushed strongly for the development of a GML. Standard Generalized Markup Language (SGML) resulted. It is an ISO standard, ISO 8879:1988 (www.iso.org). SGML helps publishers and others wrench their data from proprietary editorial and photocomposition systems. The push came about as publishers recognized new revenue streams in digital content. Extracting their data from proprietary markup, markup used only for format, was nearly impossible. As publishers implemented SGML, photocomposition vendors developed import and export routines to handle it.

SGML is a metalanguage. It does not specify how to mark up any given document. It allows for application and platform independence. It is portable. Sharing and re-packaging of information is possible.

SGML is very complex and expensive to implement and maintain. The software used to support SGML is expensive and complex. It has no mainstream browser support. Because of its complexity, it is not Web friendly.

Extensible Markup Language (XML) was developed to overcome these limitations and help foster the burgeoning Internet. SGML was used as the foundation or starting point in the development, but it is incorrect to label XML as a child of SGML. Like SGML, XML is a metalanguage. It allows implementers the flexibility to articulate their own markup strategy. By comparison, HTML is a specific application and resulting implementation of SGML. HTML is a published, supported DTD. HTML could be labeled a child of SGML, but application or implementation is a better descriptor of SGML and XML DTDs.

XML is a World Wide Web Consortium (W3C) recommendation, REC-XML-19980210 (www.w3c.org). No DTD or schema is required, but one can be designed and used in conjunction with XML document instances. Content and context tags are possible. XML can make use of style sheets. It does not have all of the features and complexity of SGML and, thus, it is easier to support, and it is easier to share XML document instances in a networked world. XML has become a family of recommendations and initiatives.

XML document instances can be generated and used without benefit of a DTD or schema. This can happen if the document instance is well-formed. An XML document instance is “well-formed” if it adheres to XML syntax. The document instance must have an XML declaration. XML syntax requires that each document instance have a root, or document, tag that immediately follows the XML declaration. It can be used only once and must close the document instance. All tags are closed and properly nested.

Both examples above are well-formed. They start with an XML declaration. The root, or document, tag is <StartDay></StartDay> for our coffee document and <DOC></DOC> for the sports article. Be aware that XML is case sensitive regarding element tags. Using <StartDay> and </Startday> would result in a parsing error. The parser would report that the document was not well-formed because the element, <StartDay>, was not closed, that is, it had no end-tag, </StartDay>. All tags must be closed.

XML syntax requires proper nesting of tags. Here is an example of properly nested and improperly nested tags.

Properly Nested Tags

<Brew>Wait, impatiently!!</Brew>
<Dispense MugSize=”24″ UnitMeasure=”Ounces”>Pour, finally.</Dispense>

Improperly Nested Tags

<Brew>Wait, impatiently!!
<Dispense MugSize=”24″ UnitMeasure=”Ounces”>Pour, finally.</Brew></Dispense>

The second markup is not well-formed because <Dispense> opens before the closing tag – </Brew> – is encoded. The open and close tags are crossed. The coffee brewing instructions would not parse, resulting in no coffee and that would be disastrous!

Validity is another important XML concept. A document instance is valid when it parses successfully against its accompanying DTD or schema. As mentioned, XML does not require the use of a DTD, but when a document instance invokes a DTD or schema, then a parser will validate the document instance against it. If it parses successfully, the document is both well-formed and valid. A document instance can be well-formed but not valid if it has a DTD and violates the rules of its DTD. For example, its schema may require an element to be numeric only. If the numeric-only element contains an alpha character, it would be invalid. Parsed without invoking its DTD, it could very likely be well-formed. If a document instance is valid, then it is always well-formed.

Early in the development of XML, the concept of namespaces was introduced to allow for the resolution of ambiguous elements and attributes that inevitably occur as information zooms around the Internet or even a corporate intranet. Extending our coffee example illustrates a simple case of namespace use.

<?xml version=”1.0″?>
<Startday>
xmlns:USA=”http://coffee.org/usa/measures”
xmlns:UK=”http://coffee.org/uk/measures”>
<USA:ounces>24</ounces>
<UK:ounces>24</ounces>
</Startday>

In this example, using the namespace syntax disambiguates “ounces.” The first instance of “ounces” has an element value 24, but the namespace syntax tells the parser to use the definition of ounces found at the URL pointing to the USA subdirectory. The second “ounces” element points to the subdirectory of UK. At each unique URL, the parser will find a definition for the element, ounces. It is possible that both schemas define ounces similarly, but without the schema or DTD, the element remains ambiguous to the parser.

XML does not require a DTD, but DTDs are very useful. When a DTD accompanies its document instance, the recipient can do much more with the document. The DTD defines the elements, attributes and entities used within the document instance. This enhances understanding and reduces ambiguity. An XML DTD allows for much of the functionality described above.

In addition to elements, SGML and XML allow for attributes and entities. Briefly, an attribute is metadata about an element.

<Beans Grind=”perc” Type=”Java”>

“Grind” and “Type” are attributes and provide additional descriptive and processing information about the element, “Beans.” The element “Beans” could appear in the document instance as <Beans>, but with the addition of the attributes, you gain greater understanding of the element.

Entities are shortcuts similar to keyboard macros. They are used within a DTD and within the document instance. Within a DTD they can be used to summarize a long series of elements, attributes and other entities, so they can be grouped and reused without significant typing. They are used within the document instance to represent special characters such as the Greek character beta . They can be used to represent a long data or text string that is used many times within the document instance. The parser inserts the correct character or text string on rendering.

One limitation of an XML DTD is that it is not a well-formed XML document instance. XML schemas are well-formed and are functionally equivalent to DTDs. However, schemas are more powerful than XML DTDs because their functionality has been extended. A schema can include extensive processing instructions and parameters used by software for any number of reasons. Data can be manipulated, generated, reorganized and checked for errors. Element and attribute values can be better controlled fostering greater data quality.

Given the greater functionality of XML schemas, processing initiatives have been developed. Two processing initiatives are Digital Object Model (DOM) and Simple API for XML (SAX). DOM has a Level 1 W3C recommendation in circulation. DOM defines a programmatic interface for traversing the document instance hierarchy and manipulating elements and attributes. DOM places the entire schema structure into memory, parsing it into a tree structure. Using a DOM implementation, programs can treat each node of the tree as an object and perform endless manipulations. SAX is being utilized to overcome the memory requirements of DOM, which can be onerous, for very complex schema applied to long documents. SAX, as the name implies, is simpler than DOM. Processing initiatives are the interface between your data and your software environment. DOM would layer between your XML repository, or database, and your photocomposition and fulfillment software.

Transformations allow multiple renderings, or views, of your data. An example of an XML transformation initiative is Extensible Stylesheet Language (XSL). XSL transforms XML into HTML and other presentation formats including other XML. It allows extensive reordering, generating text and calculations, but does not modify the source data. Because XSL uses XML syntax, an XSL is well-formed. It can be sent with its document instance. Even multiple XSL can be sent with a document instance. Other transformation initiatives are underway. Your data requirements and the dominant trends in your field should guide your choice of initiatives.

A visit to an XML Web site like www.xml.org reveals a bewildering array of XML schema, processing and transformation initiatives. Where do you start? If helpful, go to the FAQs and tutorial pages. There is even an article titled, “XML for Dummies.” Once you’re comfortable with XML jargon, visit the vertical market pages. These represent initiatives in just about every field of endeavor from agriculture to zoology, from profit to nonprofits, from large to small. Stick to what is closest to home; otherwise, such visits can be overwhelming. A danger of XML is that it could become a Tower of Babel – too many competing schema, requiring too much overhead to properly cross-reference using namespaces or anything else.

Herein lie areas of much needed research and development. What constitutes valid and useful taxonomies of metadata? Can meaningful ontologies of metadata taxonomies create archetypes and knowledge representations? Little is known about human-to-metadata interfaces.

Look into the “Publishing/Print” page at XML.org references DocBook, Text Encoding Initiative (TEI) and Job Definition Format (JDF), among others. An important initiative here is the ONIX International DTD. Many book wholesalers and retailers, including Amazon and Ingram, have adopted it. ONIX International is the international standard for representing and communicating book industry product information in electronic form, incorporating the core content which has been specified in national initiatives such as BIC Basic and the Association of American Publishers’ ONIX Version.

Much of this information has been captured for years using AACR2 (the Anglo-American Cataloging Rules, 2nd edition) and communicated in electronic form using MARC (the standard for machine readable cataloging). The book industry has additional requirements in creating and facilitating commerce that libraries do not. Libraries have their own needs. The Library of Congress is helping to develop Metadata Object Description Schema (MODS) and Metadata Encoding & Transmission Standard (METS). MODS is an XML schema for a bibliographic element set and will carry selected data from existing MARC 21 records. METS is an XML schema for encoding descriptive, administrative and structural metadata regarding objects within a digital library. MODS deals with just bibliographic information. METS would be used to encode an entire digital book, for example, and could use MODS for the book’s bibliographic information. The Library of Congress has recently released both schemas for comment.

What can you do with XML? What can’t you do with XML? Access Innovations developed a complete database management system around XML as part of their Data Harmony software suite, The product is called XIS, for XML Intranet System. It uses a hierarchy of XML schema to drive the system. Written in Java, it employs a system level XML schema that articulates all of the operations needed for creating a structured, textual database. Each database application is defined in a project level schema. The project level schema contains the processing instructions to mark up the document instance. The document instance is stored and processed as an XML object. That is, the XIS database management system stores data in XML rather than relying on a proprietary markup as do most database management systems. This provides tremendous flexibility and yet great control over database operations.

Using ONIX, a book buyer could receive an ONIX document instance that contains information about the dimensions and weight of a book they’re considering ordering. Because ONIX is a published XML schema, the buyer can have a computer programmed to read the ONIX document instance and calculate the weight and cubic feet for an order of one thousand units. This can be very valuable and a great time saver.

Without standard markup, Internet commerce won’t happen. With standard markup, your ability to transmit your commercial databases will be greatly enhanced. Standard markup frees your data from legacy hardware and software. It allows you to develop efficiently and rapidly new products and services, generating new revenue streams. Lower operating costs; new revenue streams – worth thinking about!

(Article first published in the Bulletin of the American Society for Information Science and Technology, Volume 29, Issue 1, October/November 2002.)