ZedneWeb / Web threading

[Note: Because I still consider this document to be a draft, I can’t guarantee the presentation won’t change radically in the near future. The permanent address of this version is <http://www.eyrie.org/~zednenem/2002/web-threads/20020617.html>. The current version is located at <http://www.eyrie.org/~zednenem/2002/web-threads/>.]

Self-assembling web threads

This document proposes a convention for coding weblogs which allows individual posts to be identified and grouped into discussion threads. The threads themselves emerge from the coding of the posts rather than explicit thread declarations, hence their description as “self-assembling.” The coding convention is designed to be as unobtrusive as possible, requiring minimal change from a coding perspective and usually no change from a visual presentation perspective. It does not rely on naming conventions or external metadata declarations, and it requires no extensions to HTML.

There are four steps involved in presenting web threads to end-users.

  1. Identify posts
  2. Arrange the posts into threads
  3. Transfer threading information between systems
  4. Present threads to users

Identifying posts

For a human, identifying individual weblog posts is fairly straightforward. Each weblog provides visual cues indicating where the posts begin and end, and there are often titles and “permalinks” provided in easy-to-guess locations.

For software, this task is much more difficult, because the means of identifying posts are visual, not syntactic. This means that no two weblogs can necessarily be parsed the same way. However, it turns out that it’s extremely easy to identify weblog posts unambiguously without requiring authors to sacrifice their individual visual styles or to adopt some sort of naming convention.

The answer is to enclose each post in a div element marked with the class “post”. For an archive containing only a single post, this is sufficient. The post is identified as so:

<div class="post">
(normal post contents)
</div>

Text and links inside the div are considered part of the post. Everything outside, such as site-wide navigation, is not.

In this case, assigning an address to the post is simple: it is the same as the address of its archive page. Obviously, this particular method can be used for at most one post per archive page.

Most archive pages contain multiple posts, however, and these are usually distinguished by URLs with fragment identifiers (a “#” followed by some text). This system can be easily implemented by assigning IDs to the div elements, like so:

<div class="post" id="(something)">
(normal post contents)
</div>

This method works identically to the common practice of using a named anchor (as in, “<a name="(something)"></a>”), except that the div is able to enclose the entire post, while an anchor cannot. Assigning address to posts defined this way is also simple: the address of the div element is the address of the post. That is, a div element with ID “carl” on the page “http://example.org/2002” has the address “http://example.org/2002#carl”.

The two identification methods described above work well for posts found on their archive pages, but weblog posts are usually mirrored on their weblog’s front page for a few days after their publication. Because the front page is the one most frequently read, it is useful to be able to identify posts there as well, but neither method can be used because the post address cannot be built from the page URL.

Another common weblog convention is the “permalink”, which gives the permanent address for a post so that readers can refer to it specifically. A simple convention can allow software to make use of them as well. HTML links support an attribute called “rel”, which specifies the relationship of the linked resource to the resource making the link. This attribute can be used to identify the link to the post’s permanent address, like so:

<div class="mirrored-post">
(normal post contents)
<a rel="archive" href="(address)">(link text)</a>
(post contents, continued)
</div>

Note that the div element’s class is now “mirrored-post” instead of “post”. This signifies that the post’s address should be read from the archive link. The archive link itself is identified by the “rel="archive"”, so the link text itself is not important. Similarly, the link may be placed anywhere in the post so long as it is inside the div element.

Obviously, a mirrored post should contain only one archive link. Two or more could be ambiguous, although there’s no confusion if they all point to the same address. Having no archive link means that the post cannot be identified, defeating the purpose of marking it.

Archive links may also appear in non-mirrored posts, as long as the address they give is the same as the one derived from the post’s location.

To avoid confusion, weblogs and other resources following this convention should include this tag in their HTML header:

<meta name="thread-scheme"
      content="http://www.eyrie.org/~zednenem/2002/web-threads/" />

(The final slash is only necessary for XHTML documents and should be omitted from conforming HTML documents.)

The tag indicates that the document follows this convention. The URL of this document is used because it is unlikely to conflict with any existing practice and because it suggests to humans reading the source code that more information may be found in this document. (There is no reason for software to read this document while parsing a weblog.)

Arranging posts into threads

The method for arranging posts into threads is inspired by Usenet’s methods. In Usenet, each post contains a list of references which identify earlier posts in the thread. News clients use these reference lists to arrange these posts into threads. If two posts share an item in their reference lists, then they’re part of the same thread. If one post is listed in another’s reference list, then the second comes later than the first in the thread. The advantage of this method is that it allows threads to form without centralized control. In the minimal case, all that is required is for each post to refer to its parent in the thread.

This method can be applied to weblogs as well, using links instead of a special header. If a post contains a link to another post, then the first post is referring to the second post and therefore occurs after it in the thread. That’s all it takes.

There are some complexities which can arise. Because weblog posts are sometimes modified after their original publication, it’s possible for the graph of references to contain cycles. This doesn’t create problems from a software standpoint, but it does make it difficult to present an overview of the thread to users.

Transferring threading information

Given a set of weblog posts, all that is needed to derive threading information about them is the references made between them. While any processing software is free to read the original weblogs to find this information, it makes sense for different entities to exchange their results.

While one could imagine multiple ways to present this information, it makes sense to offer one method that multiple services might use. As it happens, the Resource Description Framework (RDF) provides exactly the sort of model needed to express these relationships, while being set up in a way to allow near-infinite extendibility.

The two major resources are “Post”, which describes a weblog post, and “refersTo”, which is the relation between a post and a resource it refers to. For example, the relationship between a post “http://example.org/2002#carl” which refers to “http://other.org/2002#lenny” can be expressed like so:

<rdf:RDF xmlns="http://www.eyrie.org/~zednenem/2002/web-threads/"
         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
         
    <Post rdf:about="http://example.org/2002#carl">
        <refersTo rdf:resource="http://other.org/2002#lenny"/>
    </Post>
</rdf:RDF>

Or, using N3, a simpler RDF syntax:

@prefix : <http://www.eyrie.org/~zednenem/2002/web-threads/>.

<http://example.org/2002#carl> a :Post;
    :refersTo <http://other.org/2002#lenny>.

Any set of references can be expressed using this vocabulary.

Once a common interchange format is available, the exact methods used to request and retrieve threading information is less important. In one possible scenario, a service which collects information about web posts might allow clients to request information about specific posts through special URLs like “http://example.org/thread-service?post=http://other.org/2002%23lenny”, which would return the reference information about “http://other.org/2002#lenny”. However, popular posts might be referenced so frequently that more parameters would be necessary to filter the results.

Presenting threads to users

Once the data has been gathered and analyzed, it can be presented to the user in several fashions. A service might offer a thread overview, displaying the posts which preceded and responded to a given post. Other services might filter web pages in a manner similar to Crit, inserting a list of responses after each post. For weblogs which are generated dynamically, the response list could be inserted by the server itself. (In both cases, the list of responses would need to be outside the post itself, to avoid confusing other thread-aware software.)

Web browsers themselves could detect posts and query servers, similar to Amaya’s support for Annotea. Search engines like Google and Blogdex would be able to work with individual posts, rather than noting that a given link or keyword existed on the weblog’s front page the last time it was indexed.

Further applications

While this convention was originally developed with weblogs in mind, it applies equally well to web-based discussion boards. The concept of “post” transfers exactly, and it is a simple matter for threaded boards to provide an automatic reference to a post’s parent. Mirrored posts can be used in cases where a board offers single-post and whole-thread views, or on linear boards where different ranges of posts may be shown, depending on the URL.

Through this convention, a single discussion thread can cross between multiple message boards and weblogs, provided that authors link to the posts they refer to.

This convention can also be applied to web-based Usenet and mailing list archives. Each news post or e-mail message has its own message ID, which can be expressed as a URL. Thus, a mailing list archive could identify messages as posts:

<div class="mirrored-post">
(headers)
Message ID: <a rel="archive"
  href="mid:12345@example.org">&lt;12345@example.org&gt;</a>

(rest of message)
</div>

Similarly, a Usenet archive could identify posts using URLs with the “news:” scheme. In these cases, the information being provided is primarily for the benefit of threading services, since a web browser will not typically be able to dereference a “mid:” or “news:” address. However, archives exposing these links to outside services makes it possible for threads to move among mailing lists and Usenet in addition to weblogs and message boards. If instant messaging services start defining URLs for their messages, then those discussions can be part of the larger threads as well.

Dave Menendez