ZedneWeb / Profile for web threads

An HTML profile for web threading

This document describes a coding convention for HTML documents which embeds much of the threading information described by the Thread Description Language without invalidating the markup or relying on comments.

Current version
<http://www.eyrie.org/~zednenem/2002/wtprofile/>
This version
<http://www.eyrie.org/~zednenem/2002/wtprofile/20020831.html>

Table of contents

  1. Introduction
  2. Profile declaration
  3. Posts
    1. Determining the address of an archived post
    2. Determining the address of a mirrored post
    3. Statements indicated by a post
    4. Post sequences
  4. Links
    1. The reference link types
    2. Implicit types
    3. Special handling for the link type “recommendation”
  5. Other metadata
    1. The “meta” and “title” elements
    2. The “span” and heading elements
    3. The “blockquote” and “q” elements
  6. Appendix: Applying this profile to your weblog
    1. Going further
  7. Change history

Introduction

An HTML profile allows software to deduce information from web pages, rather than requiring a separate resource for metadata. This particular profile is designed to embed information about threaded discussions by specifying how to indicate the identity and content of posts, and by providing methods to indicate metadata about the posts.

Agents are free to represent this information in whatever manner is convenient, but this document will describe the information in terms of the Thread Description Language, an RDF vocabulary designed to represent information about threaded discussions. Readers unfamiliar with RDF will probably want to start with the Appendix, which shows how this profile can be used to mark up a weblog.

Because RDF is structured around absolute URI references, addresses are very important to the definitions. Two addresses which will be seen frequently are the page address and the post address. The page address is simply the address of the HTML being interpreted. More specifically, it is the address which will be used for interpreting relative URI references.

Certain parts of an HTML document will be marked as posts. The post address is the address of that post, and the section on posts tells how to derive it.

RDF comprise three parts: a subject, a property, and an object. The subject and property are identified by URI references, and the object may be either a URI reference or a literal value, such as a string.

Lastly, a URI reference is a URI which may or may not contain a fragment identifier. “http://example.org/” and “http://example.org/#fragid” are both examples of URI references.

Profile declaration

Pages which conform to this profile must set the profile attribute of the head element to “http://www.eyrie.org/~zednenem/2002/wtprofile/”. Doing so indicates to processing agents that the page follows the conventions described here.

<head profile="http://www.eyrie.org/~zednenem/2002/wtprofile/">
  headers
</head>

Note that the profile attribute is defined as a space-separated list of URIs, although HTML does not define what multiple profile declarations mean.

Note also that the use of a URI to identify this profile does not imply that this page should be accessed at any point during processing; it is simply being used as a unique identifier.

Posts

The convention for indicating posts must perform two tasks. First, it must associate a URI reference to each post. This should be done in such a way that a user agent can dereference the URI reference and display the post. This means that posts should be given addresses based on the address of the page where they are located (the archive page). If there is more than one post per archive page, then all or all but one must use fragment identifiers to distingish themselves. Thus, if “archive” is a page containing posts, then no more than one post may have the address “archive” and the remainder will have addresses in the form “archive#fragment_id”, where fragment_id is unique to each post.

The second requirement for this convention is the ability to distinguish the content of a post from other page material, such as site-wide navigation and branding and other posts. The simplest way to accomplish this is to put the entire post inside some element. The logical choice here is “div”, because it has no inherent display properties and it can contain block-level content. The use of “div” also provides a simple way to associate posts with their URI.

A post is identified by a “div” element which has the class “post” or “mirrored-post”. The class “post” is used to indicate posts which are located in their archive page. In that case, the address of the post can be derived from the address of the page containing the post. The class “mirrored-post” is used to indicate posts which are permanently located on another page. For example, the main page of a weblog usually contains copies of the most recent posts. For mirrored posts, the address of the post cannot be derived from the page address.

Posts cannot contain other posts. If a parser encounters a “div” element with the class “post” or “mirrored-post” within a post, it should halt processing, return no data, and report an error.

Determining the address of an archived post

The address of a post in its archive depends on the address of the archive. Either it will be the same as the archive’s, or it will be the archive’s plus a fragment identifier.

The first case is indicated by a “div” element which has the class “post” and no “id” attribute:

<div class="post">
  content
</div>

A page may contain no more than one post which has the same address as its archive.

The second case is indicated by a “div” element which has the class “post” and an “id” attribute:

<div class="post" id="fragment_id">
  content
</div>

A page may contain no more than one psot which has a given fragment_id. Note that in this case, the address of the post is also the address of the “div” element, thanks to the “id” attribute. This means that user agents can follow links to posts declared in this way and present them appropriately.

Determining the address of a mirrored post

The address of a mirrored post cannot be derived from the address of the page which contains it. Thus, we will use a link with a special relation to indicate the address of the post.

The post itself is indicated by a “div” element with the class “mirrored-post”:

<div class="mirrored-post">
  content
</div>

Unlike archived posts, mirrored posts are interpreted the same whether or not the “div” element has an “id” attribute.

A mirrored post must contain at least one link with the “archive” relation. The destination of an archive link indicates the address of the post which contains it. The archive link is an “a” element with a “rel” attribute set to “archive”:

<a rel="archive" href="post_address"></a>

This sort of link is often referred to as a “permalink”. Note that archive links are indicated by the relation type, not their link text or location in the post. Thus, any weblog which includes permalinks may mark them as archive links.

If a post contains multiple archive links, they must all point to the same address. Similarly, if an archived post contains an archive link, it must point to the post’s address.

Statements indicated by a post

The existance of a post asserts four RDF statements. Given a post with the address post which is located in an archive with the address archive, we can assert that post has the type tdl:Post, archive has the type tdl:Archive, that the value of tdl:inArchive for post is archive, and that a value of tdl:hasPost for archive is post.

[Graph illustrating those statements]

These statements are true even when the post’s address is the same as its archive’s address. The resulting graph will look slightly different, but it is generated by the same rules.

[Graph illustrating a Post/Archive]

Post sequences

In many circumstances, posts will have an intrinsic order, which is represented using the tdl:next and tdl:prev properties. These properties can always be indicated with link types, but this is inconvenient when multiple posts occur on the same page. Enclosing posts in “div” elements with the class “post-sequence” or “reverse-post-sequence” allows the intrinsic order of elements on the page to imply the sequencing of posts.

Each post contained in a “div” element with the class “post-sequence” follows the post which occurred earlier in the div. No implication is made about what preceeds the first post or follows the last post in the sequence. Applications are free to represent the sequence using tdl:next or tdl:prev or both.

Each post contained in a “div” element with the class “reverse-post-seqence” preceeds the post which occurred earlier in the div. No implication is made about what follows the first post or preceeds the last post in the sequence. Applications are free to represent the sequence using tdl:next or tdl:prev or both.

Post sequences may be nested. In the case where a nested sequence has the same direction as the outer sequence, one can act as though the nested sequence was not there. In the case where the nested sequence has the opposite direction as the outer sequence, the items in the nested sequence are inserted into the outer sequence in the opposide order.

In an attempt to illustrate this, let Pn be a post, S(…) be a sequence of posts, and RS(…) be a reversed sequence of posts. In these examples, the posts are numbered in their intrinsic order, but are given in their presentation order.

  1. S( P1 P2 P3 )
  2. RS( P3 P2 P1 )
  3. S( S( P1 P2 ) P3 S( P4 P5 ) )
  4. RS( RS( P5 P4 ) P3 RS( P2 P1 ) )
  5. S( RS( P2 P1) P3 RS( P5 P4 ) )
  6. RS( S( P4 P5 ) P3 S( P1 P2 ) )
  7. S( P1 RS( P4 S( P2 P3 ) ) P5 )

Processors are, of course, free to ignore information that is unnecessary for their purposes or which their programmers feel is too complicated to deal with. Highly nested sequences such as the seventh example are probably best avoided.

Other metadata

A good deal of information can be extracted from other elements. This section describes information stored in “meta” elements, the “title” element, certain classes of “span”, headings, and the “cite” attribute of the “blockquote” and “q” elements.

The “meta” and “title” elements

The “head” element contains metadata which applies to the document as a whole. Of these, the “title” and “meta” elements indicate information which cannot be represented as a link.

The “name” attribute of a “meta” element indicates a property of the page. The object of this property is given in the “content” attribute. For “meta” elements, the object is always a literal value, even if it resembles a URI.

Meta names
Name Property Comment
author dc:creator
date dc:date W3CDTF; see TDL
title dc:title

If no “meta” element with the name “title” is present, then the content of the “title” element may be used for dc:title.

The “span” and heading elements

The information in the “head” element applies to the page; another method is needed to encode literal data for posts which have a URI different from the page address. Certain classes of “span” and the heading elements indicate information which cannot be represented as a link.

A “span” element which has a class equal to one of those in the following table indicates a statement. The subject is the post containing the span, and the property corresponding to the class is given in the table. If the span has a “title” attribute, then the object of the statement is the value of the “title” attribute. Otherwise, the object is the textual content of the span (that is, the content ignoring any markup).

Span classes
Name Property Comment
author dc:creator
date dc:date W3CDTF; see TDL
title dc:title

If a post contains no span with the class “title”, then the first heading element (h1, h2, etc.) in the post is used as the object of dc:title. If a post contains no span with the class “title” and no headings, processors may excerpt the first few words of the post. The use of heading elements rather than spans is encouraged.

Processor behavior is undefined when a post contains multiple spans of the same class. Processors should use at least one of the spans as the value for the appropriate property. Multiple spans with the same class are discouraged.

The “blockquote” and “q” elements

Within a post, “blockquote” and “q” elements which have the “cite” attribute indicate statements. The subject of these statements is post, the property is tdl:quotes, and the object is the resource contained in the “cite” attribute.

Processors should first check to make sure the “cite” attribute contains a URI reference and not arbitrary text. “cite” attributes which are not URI references are ignored.

Appendix: Applying this profile to your weblog

Basic conformance to this profile enables software to break your weblog (or message board, mail archive, etc.) into individual posts and determine what references you make in individual posts. Currently, all a crawler can say about a link in a weblog is that it came from a certain web page. For posts being mirrored on the weblog’s main page, all that can be said is that the weblog made the link at some point in the past. This profile provides a way to associate links with the post that makes them, and to associate posts with their permanent address even when they are being mirrored in a different location.

Achieving basic conformance is requires only minor changes to your current coding practices. If your weblog is generated using templates, all you need to do is modify them to produce the appropriate code (see the example templates for some ideas). If your weblog is hand-coded, the extra effort needed to use this profile is minimal.

First, your weblog’s main page and archive pages must indicate conformance to this profile by setting the “profile” attribute of the head element to “http://www.eyrie.org/~zednenem/2002/wtprofile/”. This will produce code looking like this:

<head profile="http://www.eyrie.org/~zednenem/2002/wtprofile/">
  headers
</head>

Second, identify posts in your archives by enclosing them with “div” elements which have the class “post”. Make sure that the div includes all the posts’ content, but as little of the surrounding code as possible. This will allow software to distinguish the links being made inside a post from links made in other posts or which are there for site navigation.

The next part depends on how your posts are identified. If the posts in your weblog have addresses which include fragment identifiers, then the div should have an “id” attribute containing the fragment ID. For example, the post “http://example.org/blog/455#p6” would look like this:

<div class="post" id="p6">
  post content
</div>

Note that this eliminates any need for an anchor, such as <a name="p6"></a>, to identify the post. Using div and id is preferable because it makes the address refer to the entire post, rather than a single point preceeding the post.

Alternately, if your posts’ addresses do not have a fragment ID (making their address the same as the page address), simply omit the “id” attribute. The post “http://other.org/blog/80210” would look like this:

<div class="post">
  post content
</div>

The div is still present so that the post content can be separated from any non-post material such as site navigation.

Third, identify posts on your main page by enclosing them with “div” elements which have the class “mirrored-post”. They are called mirrored posts instead of just posts because they aren’t in their permanent location. Again, make sure to include all the post content but as little of the surrounding code as possible. Mirrored posts look like this:

<div class="mirrored-post">
  post content
</div>

Because they aren’t located on their archive pages, there’s no way to derive the post address from the page address, but we will overcome that in the next step.

Most weblogs provide “permalinks”, which provide a simple way for readers to determine the address of an individual posts. Step four is indentify your permalinks by setting their “rel” attribute to “archive”. This tells software that this particular link identifies the permanent address of the post without requiring you to give it a special link text or location. It will look something like this:

<a rel="archive" href="post address">your link text</a>

Mirrored posts must include an archive link, or else they cannot be identified. Normal posts do not require them, but it’s generally a good idea to include them anyway.

At this point, software can interpret your weblog in terms of posts, not just in terms of pages. Each link that a post makes (except for the archive links, naturally) indicates a reference, which form the basis of threading. Even if you go no further, your weblog can now be treated as part of a global discussion forum.

Going further

If you so choose, you may describe your weblog and its posts in greater detail. What follows are a number of optional ways to provide additional information.

As always, it’s a good idea for your weblog to be written in valid HTML. It doesn’t matter so much which version you use, as long as you abide by its rules. For existing, non-validating sites, HTML 4.01 Transitional probably requires the fewest modifications to your site’s code. For newer sites, consider XHMTL.

If your weblog categorizes posts by topic, you can indicate that your posts belong to those topics, rather than merely referring to them, by using the “topic” link type. Such a link, contained within a post’s div, would look something like this:

<a rel="topic" href="topic address">topic name</a>

Similarly, if the topic’s page links to individual posts, it can indicate that they belong to the topic with a reverse link:

<a rev="topic" href="post address">post name</a>

The topic page would need to include the profile declaration for this link to be correctly interpreted.

Every page in the weblog can indicate that it is part of the weblog by linking to the main page using the “weblog” link type. Even the main page might link to itself; doing so would indicate to software that it is a weblog’s main page. These links can be made with an anchor in the page’s body:

<a rel="weblog" href="weblog address">weblog name</a>

Alternately, they can be done with a link element in the page’s head:

<link rel="weblog" href="weblog address">

It’s probably a good idea to include a link in the head even if there is also a link in the body, but there is no strong reason to prefer one to the other.

Linking your archives to your weblog does not explicitly connect your posts to your weblog. Software will presumably infer that if Post A is in Archive B and Archive B is in Weblog C, then Post A is probably in Weblog C. If you wish to explicitly connect your posts to your weblog, you must include a weblog link in each post.

If your weblog has titles for some or all of its posts, consider marking up the titles with heading tags (h1, h2, and so on). This is a good idea for a number of reasons, but the important one here is that the text content of the first heading encountered in a post will usually be interpreted as the post’s title. If your site design does not allow you to use real headings, then enclose the post’s title with a span element of the class “title”.

Similiarly, you can use a span element with the class “author” to identify the author of a post or the class “date” to identify the date. Dates must be given in the format described in the W3C note as a specific day, minute, or second. Because that format is cumbersome, you may choose to present it as the value of the “title” attribute, which overrides the textual content of the span when present. For example, the common practice of concluding a post with the line “Posted by Author at Time” might be coded like this:

<p>Posted by <span class="author" title="Author’s name">Author’s nickname</span> at <span class="date" title="Full date">Time of day</span></p>

If your weblog’s main page includes a “blogroll”, a list of links to other, recommended sites, you can indicate that these links are recommendations using the “recommendation” link type. If your site puts those links on a separate page, you can link to that page using the “linksPage” link type.

Consider being more specific when making references. For example, if your post discusses a post or article, your link to that post could indicate that your post is a comment by declaring rev="comment". (Why a reverse link? It’s because your post is the comment, but rel="comment" says that the resource at the other end of the link is a comment.) Other relevant link types are “concurrence”, “rebuttal”, “pointer”, and “followup”.

When quoting from a web page, use the “cite” attribute to say where you got the quote from. If you’re quoting from another weblog, try to cite the specific post, and not the weblog or an archive. The “cite” attribute can be used with blockquote elements or the rare q element:

<blockquote cite="source of quote">
  quote
</blockquote>

If your weblog implements these conventions, a considerable amount of information will be available to software processors.

Change history

2002-09-08
Rewrite of Posts and Links, including diagrams. Everything is here now, and in pretty good shape.

Dave Menendez