Notes on XHTML

It is not worth an intelligent man's time to be in the majority.
By definition, there are already enough people to do that.

G. H. Hardy

Background

Many years ago, around the time of HTML version 2, I was an HTML purist, argued endlessly against putting presentation markup into the pure structural layout of HTML, and got really upset about how HTML was being transformed into a very poor markup language.

Then I went on to other things and stopped giving it a lot of thought. I also stopped updating my web pages for quite a while, apart from moving them around.

When I started looking at updating my pages again, using a macro language of my own devising to actually write the pages and a program to generate HTML from it, one of the goals was to go back and do something that I'd always wanted to be able to do but couldn't find a good way to do, namely generate HTML that was free of any presentational content and strictly conformant with the current HTML standard. The advent of widespread support for CSS finally made that possible without compromising the appearance of the pages on most browsers.

These are notes about my experiences.

As explained in more depth below, I wouldn't necessarily recommend this approach to other people right now, particularly if maximum accessibility of your pages is important. This was intentionally an experiment and to some degree a political statement, which I can make since these are just my personal pages and I don't really care if badly non-compliant browsers have trouble with them.

I'm also not completely done with this process. I still use tags like <strong> and even <i> far too much, and sometimes use classes like "center" rather than figuring out what page element that paragraph truly is.

XHTML Versus HTML

When I went to find the most recent W3C markup recommendation, I found that it was XHTML 1.0 (1.1 had not been finalized yet), so I started implementing to that. This turned out to be a more controversial and interesting decision than I thought.

XHTML 1.0 is essentially identical to HTML 4.01, except reformulated as an XML application rather than an SGML application. I really like some of the cleanliness requirements that this brings with it (lowercase tag names, which I've always preferred anyway, required quoting of attributes, and required closing tags), so I was happy to use that. However, part of the purpose of XHTML was also to allow one to use an XML parser rather than the ad-hoc HTML parsers that have been put into browsers over the years, and it turns out that this isn't really possible when XHTML is being served as the text/html content type.

The W3C was trying to move towards application/xhtml+xml as the MIME content type for XHTML, since this clearly indicates that it is XML and since content served out under that content type can be assumed to be strictly validating XML. So much random junk from different versions of HTML and even broken pages that happen to render in popular browsers has been served out as text/html that it's almost impossible for a browser to use a strict parser for it. Because of this, serving XHTML as text/html somewhat defeats the point of XHTML, since it gets thrown into the same ad-hoc HTML parser that all the rest of the HTML is thrown into, and may frequently be subject to various quirky rendering and parsing decisions designed to maintain bug-compatibility with popular browsers.

Unfortunately, it's premature at this point to actually serve web content as application/xhtml+xml. A handful of the newest browsers support it, but Internet Explorer prior to version 9 doesn't, so doing so will make pages unviewable by the majority of web users. It's not clear to me that this will ever take off.

I'm still using XHTML 1.0, in part because I like future-proofing my work and making it easier on myself to switch to XHTML 1.1 with the proper content type when that becomes feasible. But I'm also just doing this for my personal pages, where I can choose not to care about very old browsers that are confused by things like <br /> instead of <br> as required by XHTML, and where I can do things just because I feel like it and not because there are any significant positive benefits. For most web page design served as text/html content, it probably still makes more sense to use HTML 4.01 (strict, using style sheets for presentation) than to use XHTML 1.0 and drop support for old browsers.

Microsoft Internet Explorer and XML Directives

When I finished my first rewrite of all of my pages, I checked them in Internet Explorer for MacOS X, Opera, and Mozilla 1.0, and they looked fine in all of those browsers. Shortly thereafter, however, I learned that Internet Explorer 6.0 for the PC pretty-printed the XHTML source of my pages rather than actually rendering them. This caused a great deal of puzzlement.

After a few experiments, I found that it would render the pages properly if the initial <?xml?> directive were removed. I wanted to keep that directive, though, both because it's recommended and because I sometimes use ISO 8859-1 characters on my web pages. What made this even more amusing was that when I included an <?xml-stylesheet?> directive, rather than pretty-printing the source, IE rendered the page as an unreadable mash of content without any links.

This happened on both IE5.5 and IE6 on the PC.

I was finally about to give up and redo my pages so as to not use XML directives, but as a last resort I posted to comp.infosystems.www.authoring.html and asked whether anyone had any ideas. Alan Flavell pointed me at the right solution.

I had some comments at the beginning of my pages giving their last generation time, version, and the like. If there is an XML directive and "too much" between the beginning of the page and the <html> tag, IE decides that the page isn't actually HTML (it ignores the content type entirely) but is really XML and treats it like pure XML.

Boggle.

This has got to be one of the stupidest bugs that I've ever seen. I've since moved all of the comments below the <head> section of the page, and now IE users can read my pages without difficulty. Hopefully if someone else has this same problem, they may find this page and not have to spend the weeks trying to figure this out that I did.

(Zvezdan Petkovic notes that after some experimentation with IE 6.0, he found one could put at most four lines between the <html> tag and the top of the page.)

XHTML 1.0 Strict and Numbered Paragraphs

For the most part, I was able to convert all of my pages to XHTML 1.0 Strict, which is as it should be since I put all presentation markup in the style sheets. Frustratingly, though, there's one exception where I've had to use XHTML 1.0 Transitional.

I maintain one FAQ on the Big Eight newsgroup creation process that consists of numbered paragraphs specifying the rules that are followed as part of the process. These numbers are part of the content of that FAQ; the various points are referred to by numbers. There are several sections, but the numbers start at one and increase throughout the document without regard to section.

Obviously, this is an ordered list, but since an ordered list cannot contain a heading, the document consists of several ordered lists. Since each ordered list would normally start at one, the number of each paragraph has to be specified.

There is no way to do this as part of the content of an XHTML 1.0 Strict document.

I think this is a ridiculous oversight. For some reason, the value attribute of the <li> tag has been considered presentation and omitted from the Strict formulation of XHTML 1.0. This is absurd. For one, as in this case, the paragraph numbers are part of the content of the document, not just a presentation detail. They are used for external references to the document. And secondly, there is no usable replacement. One can specify complex counters using CSS Level 2 that might, with some work, duplicate the document appearance that I want, but I've yet to find a major browser that supports this section of CSS Level 2 yet. And it makes no sense to have to use that obscure and complex of a stylesheet feature just to correctly number paragraphs.

I can fake numbered paragraphs by putting the number at the beginning of each paragraph in brackets or the like, but this also isn't a good solution. The <ol> tag, being a standard and commonly-used HTML tag, is something that the browser already knows how to render in an appropriate manner for the output device and to present in the most readable and most preferred form for the user. Trying to duplicate this is a waste of effort.

So I'm still using the value attribute to <li> and that page has to be marked XHTML 1.0 Transitional. Sigh. (This problem was fixed in XHTML 2.0, but that was never released. It does look like it will be fixed in XHTML 5.0, which is the XML version of HTML 5.)

References

XHTML 1.0 Specification

The official XHTML 1.0 specification from W3C. Note that you will also need to refer to the HTML 4.01 standard; this document only explains the differences between HTML 4.01 and XHTML 1.0 and provides the formal grammar.

HTML 4.01 Specification

The current HTML standard, and still the meat of the XHTML specifications as well, as both XHTML 1.0 and XHTML 1.1 refer to this document for the meanings of all of the tags. This is probably still the best choice of language for the average web page (particularly the strict varient in combination with CSS).

CSS Level 1 Specification

The specification for level 1 of Cascading Style Sheets. At this point, nearly all of level 1 has been implemented by the major browsers (although Netscape 4.7 has a lot of flaws in its implementation), so one can treat it as the safe portion of CSS. Mostly. There is also a separate level 2 specification that has quite a bit of additional power, including the ability to insert content into the page from the style sheet, which is probably safe to use at this point (2011), but I've not ventured into it yet.

W3C Note on XHTML Media Types

The (rather well hidden) document that talks about text/html and application/xhtml+xml and lays out the intended future of content types for XHTML web pages. I wish this information had been in the XHTML 1.0 standard.

Last spun 2013-07-01 from thread modified 2013-01-04