I’ve been thinking about alternatives to XML and SGML for marking up text. One might ask, why? Well, why not? XML is popular and has a lot of tool support, but there is always room for improvement. I’ve been thinking in terms of simplification: XML (and SGML) allow elements to have sub-elements and attributes, but one could easily get by with sub-elements alone. Similarly, XML requires the beginning and ending of an element be tagged with the element’s type, but that’s a legacy of SGML, when some elements could omit their end tags if they could be inferred from context. In XML, end tags must always be present, so the explicit labels are unnecessary.
Here’s one possibility, which I’ll call New Markup Language because the abbreviation
“NML” amuses me. It consists of a tree of tagged elements, where each element has
<tag content>. I’ll give a brief example
of how HTML looks in its current form and in if were done in NML.
First, in XML-style markup:
<html> <head> <title>New Markup Language</title> </head> <body> <p>A lot of <abbr title="Extensible Markup Language">XML</abbr>'s cruft can be eliminated without sacrificing expressiveness.</p> <ul> <li>Get rid of attributes</li> <li>Get rid of entities</li> <li>Replace <tag>...</tag> with <tag ...></li> </ul> </body> </html>
And now in an NML style:
<html <head <title New Markup Language>> <body <p A lot of <abbr <title Extensible Markup Language> XML>'s cruft can be eliminated without sacrificing expressiveness.> <ul <li Get rid of attributes> <li Get rid of entities> <li Replace <lt>tag<gt>...<lt>tag<gt> with <lt>tag ...<gt>> > > >
Obviously, more work would need to go into this before it could replace XML (which is unlikely to ever happen). In particular, there are issues around whitespace normalization and language tagging that need some thought. On the other hand, it’s much easier to parse NML than XML, because it only has one construction and two special characters.
Futhermore, a well-formed NML document is
guaranteed to contain the same number of “<” and “>” characters, and it’s pretty
hard to accidentally write a malformed document. Where one occasionally saw things
<i>italics <b>and</i> bold</b> (malformed SGML)
in the early days of HTML, NML’s syntax does not lead itself to such confusion.
(That fragment would be correctly written
<b>bold</b> in SGML and
<i italics <b and>> <b bold> in
Since NML has no entities, we’ll use special element names for character references.
An element whose name is entirely digits (e.g.,
<65> instead of
A) represents the
Unicode character with that decimal code point. Those which begin with “#” indicate
hexadecimal Unicode code points (e.g.,
<#41> instead of