|
Feature
June 2001
Universal translator
What’s all this fuss about
XML? Can it make “write once,
read everywhere” a reality?
Paul G Schreier, Contributing Editor
When it comes to Internet content, we seem to be building a contemporary Tower of Babel. We’ve got too many developers writing too many applications and babbling in too many dialects. And the problem only threatens to get worse with the proliferation of client devices, including cell phones, interactive TVs, and everything in between. The convergence world cries out for a way to make information neutral, so that one source can feed many different client devices.
The biblical story of the Tower of Babel ended badly, but perhaps there’s hope for us. The Web and e-commerce communities are pushing forward a mechanism that promises to clear the way for any application to exchange data with any other. This mechanism is XML, the extensible markup language. Perhaps you’ve heard of it—you’d have to be in seclusion not to have. But what is it, really? And how significant is it to you?
Before we get into a discussion of what XML is and why it carries such tremendous implications, it’s probably instructive to point out what XML is not. “First, despite its name, XML is not a language,” explains Deborah Lapeyre, vice president of Mulberry Technologies, an XML design and training consultancy. “It’s not a natural language or a programming language, so it doesn’t replace C++, Java, or anything else. Further, XML isn’t a user interface or a browser. It’s not a presentation system. It’s not a formatting or processing system. It’s not even a standard set of tags or a recommended set of tags. Rather, it gives you the tools for making your own tags.”
In other words, XML by itself doesn’t do a blessed thing. It’s simply a data format. A means towar ds an end, not the ultimate goal. Opting for XML is akin to choosing SQL (structured query language) for databases—you still have to build the database and define procedures for manipulating it.
“We don’t need XML per se—we need structure,” comments Andreas Pfeiffer, president of Pfeiffer Consulting, a digital-content and new-media research institute in Malakoff, France. “And XML is the best technology around for getting structure in documents in a nonproprietary way. We need XML even if we don’t yet know why.”
Give text a meaning
It’s perhaps easiest to understand what XML is by comparing it to HTML (hypertext markup language). HTML tags do just one thing. They instruct a client machine on how it should display a piece of data in terms of font, type size, and screen position. However, the browser (or any other application) reading an HTML text string has no idea of its meaning. The word “Washington” could refer to a state, a town, a street, a school, or a person. The application doesn’t know and doesn’t care.
“XML is very stable, and it has no real competition...It’s much less risky to pick it as a data format; in fact, it would be unsafe to make any other decision.”
P G Bartlett, Arbortext |
|
In contrast, XML tags describe the meaning of a text string. For instance, the XML string "<lastname>Washington</lastname>" provides both the text and an assignment to a predefined datatype. Other types of XML markup go even further, letting you define the structure of a document, insert value-added information such as location/navigation data or metadata, or add rendering and processing information.

P G Bartlett |
|
In short, XML adds structure. The desire for structure of this type is nothing new. In fact, the W3C (World Wide Web Consortium, www.w3c.org) tried to address the need a dozen years ago when it recommended SGML (standard generalized markup language). However, people found SGML too complicated for the tasks it addressed, so few people outside large corporations adopted it.
On the other hand, HTML is too broad and ill defined to provide the needed structure. “Not so many years ago we had the ‘tag wars’,” recalls P G Bartlett, vice president of marketing at Arbortext, a developer of XML-based content-integration and e-commerce tools. “People wanted to add functionality, so HTML changed every few months. XML settled the war, because now anyone can create their own tags.” Further, XML is far simpler than SGML, a factor that the W3C believes will lead to its wider acceptance. It appears these hopes will be realized. Just go to any conference devoted to electronic publishing, data distribution, or e-commerce, and you’ll see how XML has come to the forefront of everyone’s mind.
Nonobvious benefits
Even with this brief review of what XML is, its benefits aren’t always immediately clear. For CommVerge readers, the critical point is that XML provides a universal data format.
Begin with the fact that XML is becoming a superset of ISO 10646, the international 31-bit character set. As such, it can display almost every language, including Chinese, Hebrew, Cyrillic, and even Star Trek’s Klingon language.
In addition, a device of any type can read what it needs from an XML source and format it as desired using dedicated stylesheets based on XSL (the extensible stylesheet language). For instance, a cell phone might strip out unneeded elements and display only selected textual data. Going a step further, you could take the text fields and send them to a text-to-speech engine for audio output. Meanwhile, another device could convert the same XML source into Braille.
Next, through the transfer of meaningful data, XML can reduce throughput requirements. The client device can process the data, as opposed to sending packets back and forth to a central host for manipulation. For instance, a PDA might be able to keep a running total of items as you select them from an online vendor. Rather than having to interact with the host computer to repeatedly refresh a tally page, the PDA could wait until you finished shopping to inform the server of your choices.
Another benefit? XML could dramatically improve search engines (they could hardly get worse, right?). Search on “Washington” today and you’ll see thousands of hits. But with XML, you’ll be able to search on “State = Washington,” or “President = Washington,” or even “Street In Rye, NH = Washington,” depending on the tag structure.
Also not to be forgotten, XML improves hyperlinks. With XML, they can be bidirectional (allowing you to move from Page A to Page B or vice versa, making the process of surfing or searching more flexible) and can pinpoint certain locations within the target XML document. Also, links can not only point to another page but also suck in new text or images directly where you click. Further, the XLink standard enables authors to use indirect links to a central database rather than pointing to a specific page, thereby reducing link rot.
Paul’s Markup Language
Perhaps most central to XML’s power is the ability to use it to create a new “language.” Anybody can design a new, valid language by simply defining a few tags. In fact, I’d like to take this opportunity to announce the development of PML (Paul’s Markup Language). Until other people start using PML, of course, it has no real purpose. Which brings us to another key XML concept.
“Today XML is a vast array of different, competing, and often incompatible standards.”
Andreas Pfeiffer,
Pfeiffer Consulting |
|
Beyond their commitment to a common set of tags, applications must also agree how they will employ those tags. In practice, they agree to use a common DTD (document type definition), which defines the allowed syntax in a document. A DTD provides the names of allowed elements, attributes, and entities and also defines how they fit together. Written in XML, a DTD is essentially a contract between the data producer and the consumer, so each can validate that they got/sent what they expected.
Because XML has no datatypes (text remains just text), a DTD can’t specify numeric ranges or define limitations or checks on text content. Addressing this requirement is the XML Schema, which adds to a DTD strong data typing. It also can mandate default and required values for content and establish relationships among elements. Beyond DTDs, e-commerce applications also need to agree upon protocols for sharing XML-formatted data (see the sidebar, “Exchanging envelopes” ).
“There’s a very common misconception that once we have XML, we can exchange data,” Pfeiffer explains. “That’s absolutely wrong. Before you can do anything with XML, you must agree on the implementation. Today XML is a vast array of different, competing and often incompatible standards and proposals for different market segments. However, some markets are starting to agree on an XML structure and implementation.”
Indeed, the ability to extend XML with tags and attributes of vital interest to a specific industry or branch of study is leading to the emergence of hundreds more acronyms. These range from CML (chemical markup language), which handles molecular information, to DESSERT (document encoding and structuring specification for electronic recipe transfer—no joke). Some languages reach across disciplines. MathML, for example, would allow browsers with adequate stylesheet support to render and solve mathematical expressions.
Just for handhelds
Meanwhile, the first XML application (language) you might have to deal with is XHTML. By now you’ve learned about the weaknesses of HTML and the strengths of XML. Why not combine the best of both? That’s what W3C did when it created XHTML, which expresses the HTML 4.0 standard in XML instead of using the more complicated SGML. Its initial goal is to assist in the migration of HTML into XML.
If you remember one thing about XHTML, remember that it’s designed for portability. Eventually it will become possible to develop XHTML-compliant content that’s usable by any conforming client; the server, the client, or a proxy service will decide on the subset of XHTML that is received.
Such functionality is vital to CommVerge readers, who might have trouble tracking the efforts of W3C and other groups to facilitate the display of HTML and now XML data on small devices ranging from cell phones to set-top boxes or even smart kitchen appliances. Ready for more acronyms? Not so very long ago we had HDML (handheld device markup language), which has slipped into near obsolescence thanks to WML, the wireless markup language sponsored by the WAP (wireless application protocol) Forum. At the same time, to allow the display of HTML on its i-Mode handheld devices, Japan’s Access Co designed CHTML (compact HTML). Unfortunately, WAP devices and i-Mode devices can’t display both. And realizing that the transition to XML won’t happen overnight, W3C offered “HTML 4.0 Guidelines for Mobile Access.”
Indeed, says W3C, because there are many ways to subset HTML, various organizations and companies have developed many nearly identical subsets. But without a common base set of features, developing applications for a wide range of Web clients is difficult. To try to coalesce these various efforts around a common base, last December W3C defined the XHTML Basic subset. And refining that effort, in April the group published “The Modularization of XHTML” as a recommendation. This defines a method for separating XHTML into a collection of modules, each enabling a group of familiar and related HTML functions such as lists, forms, tables, and images. This approach allows content developers to tailor information to handheld Web devices that often don’t support such niceties. A cell-phone maker, for example, might select a module for XHTML text but leave out the one for graphics.
“In the longer term, browsers will start to accept XML directly, reducing the need for HTML, which will become less and less of a mission-critical format.”
Deborah Lapeyre,
Mulberry Technologies |
|
“We’re working actively with the W3C,” says Ron Mandel, manager of developer services at Openwave (formerly phone.com). “WAP 2.0, for which we expect a draft this summer, will be an XHTML application. It will work with XHTML Basic, plus the mobile modules, to allow us to add functions in WML using XHTML. Because it’s compatible with XML, service providers can create new tags, and if a client device doesn’t recognize one, it simply ignores it.”
Device manufacturers should also be interested to know that Openwave’s Mobile Browser Universal Edition, which is scheduled for shipment this summer, will be WAP 2.0-ready and will support XHTML, WML 1.1, and CHTML.
About output
Clearly, XML is growing in popularity as a source format, but Adobe’s PDF continues to gain followers as an output format. And its developer is making moves to carry that format into the portable domain. Specifically, Adobe has released Acrobat Reader for the Palm OS, and that free software reflows text to make it more readable on a small display. “PDF is great for the interchange and display of data pages,” says Mulberry’s Lapeyre, “and it will remain useful for document delivery, prepress applications, online repositories, and long-term archiving where the underlying data doesn’t change or shouldn’t change, such as in legal documents.
“Overall,” she adds, “both PDF and XML will exist and serve useful purposes. I see that people will use XML when more than one look and feel is necessary, and where security issues require complex processing based on data in the document, and when fine granularity is needed for searching. In fact, I predict that companies will start delivering XML content with a stylesheet, and a cottage industry will arise where people sell only the stylesheets, which can accept a certain type of XML source and make it look better on certain output devices. People will also use XML source to produce PDFs, HTML for older browsers, search services in a proprietary format, ebook formats, and CD-ROM internal formats.”
As should be obvious by now, XML is in its formative stage. Standards, their roles, and even their names are undergoing change. But when all this starts to settle down, content producers will start jumping on board in a big way. Even today, content producers are paying close attention to these trends, which they realize will be vitally important. “Most people create XML so they can deliver to multiple media—not today but in the future,” notes Arbortext’s Bartlett. “It could take years to get a critical mass” that would allow developers to safely assume that potential users can deal with XML.
In the meantime, expect continuing confusion about the role of HTML while XML ramps up. “For right now, there aren’t any ‘real’ XML browsers,” says Mulberry’s Lapeyre. “For the short term, while data suppliers will increasingly enter their data into XML-compliant programs and databases, they will generally convert their data to HTML for display, and HTML will be the predominant output of XML systems. In the longer term, browsers will start to accept XML directly, reducing the need for HTML, which will become less and less of a mission-critical format. People putting together simple display-only Web pages, such as kids or beginning Web users, will continue to use HTML. However, I don’t see that many people will convert their legacy HTML code to XML.”
Take this away
What’s important, however, is that XML and all of its offspring are being built on a solid foundation. “XML is very stable, and it has no real competition,” adds Arbortext’s Bartlett. “It’s not like other standards where there can be a morass of competing schemes. Here, it’s much less risky to pick it as a data format; in fact, it would be unsafe to make any other decision.”
Rather than being content developers, CommVerge readers generally fall more into the category of content consumers. As such, there’s not a whole lot they must do at the moment except be aware of the big trends and realize that XML is getting ready to arrive as the universal, media-neutral mechanism. When it does, it will drive not only data exchange among disparate devices but also big steps in e-commerce.
“This market and technology are still in their early phases,” Pfeiffer concludes. “I suggest people start watching and listening to industry events and trends. Analyze your needs, and start experimenting with the free or low-cost software off the Web. But don’t be in a hurry. Cross-media publishing is a glorious vision, but it will take a long time to arrive.”
XML resources
Lapeyre, Deborah A, XML for Publishing Managers, Tutorial H915, Seybold Boston, April 10, 2001.
The O’Reilly Network (www.oreillynet.com), The best of XML.com, a compilation of articles that have appeared in this online publication.
www.w3c.org, site maintained by the World Wide Web Consortium.
www.ucc.ie/xml, XML FAQ maintained by Peter Flynn of the University Cork College in Cork, Ireland.
www.xml.org, site maintained by OASIS (the Organization for the Advancement of Structured Information Standards), a non-profit international consortium that creates interoperable industry specifications based on public standards such as XML and SGML.
|
|
Exchanging envelopes
The benefits of XML extend far beyond being able to display data on differing browsers or displays. Structuring data in XML’s fashion will make it far easier for devices to exchange bits of data over the Web, greatly facilitating e-commerce.
“To implement e-commerce, though, it’s not enough for two partners to exchange a set of XML documents of predefined format,” comments Ron Kleinman, chief technology evangelist for market developer relations at Sun Microsystems. The receiving application must also understand the document’s context, he explains. What’s more, the digital “envelope” carrying the message must be addressed correctly, and both ends of the transaction must employ an agreed-upon delivery service. Then, the receiving application must be able to open the envelope.
Thus, every vertical XML standard consists of a document layer sitting on top of an XML infrastructure that supplies these services. Today, Kleinman adds, rather than write this layer, a vertical standards group might select from two viable candidates. The first is SOAP (simple object access protocol), which originated with Microsoft and now, in the version submitted to W3C, is also known as XP (XML protocol). The second is part of the ebXML (electronic business XML) movement spearheaded by an industry group known as OASYS. That protocol’s TR&P (transport, routing and packaging) addresses the envelope issues, while other layers address service registry and workflow issues.
SOAP and its derivatives are part of a wider Microsoft initiative known as BizTalk, in which XML data repositories from various niche industries can package their information for transport across the Web. However, Kleinman says, while anyone can talk to a BizTalk server using XML, how services in this framework talk to each other behind the scenes is not open. Further, the backbone must exist on a Microsoft platform. So what was an OS is now becoming an enterprise OS. In contrast, ebXML is an open standard in which people can use XML to address infrastructure issues. “Because of the growing popularity of ebXML, Microsoft is now donating lots more of its technology to the standards bodies,” Kleinman adds. Initially, SOAP was a closed technology, but Microsoft has since turned it over to W3C.
“XML fits in perfectly with Sun’s philosophy,” Kleinman continues. “With Java we have a write-once, run-everywhere philosophy, and the universal applicability of XML fits in with that scheme. We want to make sure there’s only one API [application programming interface] or interface to a technology and that it can run everywhere. In contrast, BizTalk is tied up in a product, the BizTalk Server, which runs on Windows NT. Microsoft seems to have a vision of one-stop shopping for all your enterprise software. BizTalk is the moat in front of the castle, but once you get inside, it’s all Microsoft infrastructure, all part of the .NET movement. But with ebXML we’re putting functionality equivalent to BizTalk into the public domain.”
|
Author information
<?xml version="1.0"?><authorbio><title>Contributing editor</title> <person>Paul G Schreier</person> (<emailaddr>aa1mi@arrl.net</emailaddr>) is a writer and marketing consultant living in <homecitystate>Rye, NH</homecitystate>.</authorbio>
|