It’s dangerous to go alone

Alternative angle-brackets

By Dave Menendez
Monday, October 6, 2003, at 12:14 AM

Summary: A thought experiment about a simplified markup language like XML, but without the need for backward compatibility with SGML. NML has no attributes or entities, uses single-character end tags, but can still express pretty much everything XML can.

I’ve been thinking about alternatives to XML and SGML for marking up text. One might ask, why? Well, why not? XML is popular and has a lot of tool support, but there is always room for improvement. I’ve been thinking in terms of simplification: XML (and SGML) allow elements to have sub-elements and attributes, but one could easily get by with sub-elements alone. Similarly, XML requires the beginning and ending of an element be tagged with the element’s type, but that’s a legacy of SGML, when some elements could omit their end tags if they could be inferred from context. In XML, end tags must always be present, so the explicit labels are unnecessary.

Here’s one possibility, which I’ll call New Markup Language because the abbreviation “NML” amuses me. It consists of a tree of tagged elements, where each element has the form <tag content>. I’ll give a brief example of how HTML looks in its current form and in if were done in NML.

First, in XML-style markup:

    <title>New Markup Language</title>
    <p>A lot of <abbr title="Extensible Markup Language">XML</abbr>'s
    cruft can be eliminated without sacrificing expressiveness.</p>
      <li>Get rid of attributes</li>
      <li>Get rid of entities</li>
      <li>Replace &lt;tag&gt;...&lt/tag&gt; with &lt;tag ...&gt;</li>

And now in an NML style:

  <head <title New Markup Language>>
    <p A lot of <abbr <title Extensible Markup Language> XML>'s
    cruft can be eliminated without sacrificing expressiveness.>
      <li Get rid of attributes>
      <li Get rid of entities>
      <li Replace <lt>tag<gt>...<lt>tag<gt> with <lt>tag ...<gt>>

Obviously, more work would need to go into this before it could replace XML (which is unlikely to ever happen). In particular, there are issues around whitespace normalization and language tagging that need some thought. On the other hand, it’s much easier to parse NML than XML, because it only has one construction and two special characters.

Futhermore, a well-formed NML document is guaranteed to contain the same number of “<” and “>” characters, and it’s pretty hard to accidentally write a malformed document. Where one occasionally saw things like <i>italics <b>and</i> bold</b> (malformed SGML) in the early days of HTML, NML’s syntax does not lead itself to such confusion. (That fragment would be correctly written <i>italics <b>and</b></i> <b>bold</b> in SGML and <i italics <b and>> <b bold> in NML.)

Since NML has no entities, we’ll use special element names for character references. An element whose name is entirely digits (e.g., <65> instead of &#65;) represents the Unicode character with that decimal code point. Those which begin with “#” indicate hexadecimal Unicode code points (e.g., <#41> instead of &#x41;).