ZedneWeb / Profile for web threads
This document describes a coding convention for HTML documents which embeds much of the threading information described by the Thread Description Language without invalidating the markup or relying on comments.
An HTML profile allows software to deduce information from web pages, rather than requiring a separate resource for metadata. This particular profile is designed to embed information about threaded discussions by specifying how to indicate the identity and content of posts, and by providing methods to indicate metadata about the posts.
Agents are free to represent this information in whatever manner is convenient, but this document will describe the information in terms of the Thread Description Language, an RDF vocabulary designed to represent information about threaded discussions. Readers unfamiliar with RDF will probably want to start with the Appendix, which shows how this profile can be used to mark up a weblog.
Because RDF is structured around absolute URI references, addresses are very important to the definitions. Two addresses which will be seen frequently are the page address and the post address. The page address is simply the address of the HTML being interpreted. More specifically, it is the address which will be used for interpreting relative URI references.
Certain parts of an HTML document will be marked as posts. The post address is the address of that post, and the section on posts tells how to derive it.
RDF comprise three parts: a subject, a property, and an object. The subject and property are identified by URI references, and the object may be either a URI reference or a literal value, such as a string.
Lastly, a URI reference is a URI which may or may not contain a fragment identifier. “http://example.org/” and “http://example.org/#fragid” are both examples of URI references.
Pages which conform to this profile must set the profile attribute of the head element to “http://www.eyrie.org/~zednenem/2002/wtprofile/”. Doing so indicates to processing agents that the page follows the conventions described here.
<head profile="http://www.eyrie.org/~zednenem/2002/wtprofile/"> headers </head>
Note that the profile attribute is defined as a space-separated list of URIs, although HTML does not define what multiple profile declarations mean.
Note also that the use of a URI to identify this profile does not imply that this page should be accessed at any point during processing; it is simply being used as a unique identifier.
The convention for indicating posts must perform two tasks. First, it must associate a URI reference to each post. This should be done in such a way that a user agent can dereference the URI reference and display the post. This means that posts should be given addresses based on the address of the page where they are located (the archive page). If there is more than one post per archive page, then all or all but one must use fragment identifiers to distingish themselves. Thus, if “archive” is a page containing posts, then no more than one post may have the address “archive” and the remainder will have addresses in the form “archive#fragment_id”, where fragment_id is unique to each post.
The second requirement for this convention is the ability to distinguish the content of a post from other page material, such as site-wide navigation and branding and other posts. The simplest way to accomplish this is to put the entire post inside some element. The logical choice here is “div”, because it has no inherent display properties and it can contain block-level content. The use of “div” also provides a simple way to associate posts with their URI.
A post is identified by a “div” element which has the class “post” or “mirrored-post”. The class “post” is used to indicate posts which are located in their archive page. In that case, the address of the post can be derived from the address of the page containing the post. The class “mirrored-post” is used to indicate posts which are permanently located on another page. For example, the main page of a weblog usually contains copies of the most recent posts. For mirrored posts, the address of the post cannot be derived from the page address.
Posts cannot contain other posts. If a parser encounters a “div” element with the class “post” or “mirrored-post” within a post, it should halt processing, return no data, and report an error.
The address of a post in its archive depends on the address of the archive. Either it will be the same as the archive’s, or it will be the archive’s plus a fragment identifier.
The first case is indicated by a “div” element which has the class “post” and no “id” attribute:
<div class="post"> content </div>
A page may contain no more than one post which has the same address as its archive.
The second case is indicated by a “div” element which has the class “post” and an “id” attribute:
<div class="post" id="fragment_id"> content </div>
A page may contain no more than one psot which has a given fragment_id. Note that in this case, the address of the post is also the address of the “div” element, thanks to the “id” attribute. This means that user agents can follow links to posts declared in this way and present them appropriately.
The address of a mirrored post cannot be derived from the address of the page which contains it. Thus, we will use a link with a special relation to indicate the address of the post.
The post itself is indicated by a “div” element with the class “mirrored-post”:
<div class="mirrored-post"> content </div>
Unlike archived posts, mirrored posts are interpreted the same whether or not the “div” element has an “id” attribute.
A mirrored post must contain at least one link with the “archive” relation. The destination of an archive link indicates the address of the post which contains it. The archive link is an “a” element with a “rel” attribute set to “archive”:
<a rel="archive" href="post_address">…</a>
This sort of link is often referred to as a “permalink”. Note that archive links are indicated by the relation type, not their link text or location in the post. Thus, any weblog which includes permalinks may mark them as archive links.
If a post contains multiple archive links, they must all point to the same address. Similarly, if an archived post contains an archive link, it must point to the post’s address.
The existance of a post asserts four RDF statements. Given a post with the address post which is located in an archive with the address archive, we can assert that post has the type tdl:Post, archive has the type tdl:Archive, that the value of tdl:inArchive for post is archive, and that a value of tdl:hasPost for archive is post.
These statements are true even when the post’s address is the same as its archive’s address. The resulting graph will look slightly different, but it is generated by the same rules.
In many circumstances, posts will have an intrinsic order, which is represented using the tdl:next and tdl:prev properties. These properties can always be indicated with link types, but this is inconvenient when multiple posts occur on the same page. Enclosing posts in “div” elements with the class “post-sequence” or “reverse-post-sequence” allows the intrinsic order of elements on the page to imply the sequencing of posts.
Each post contained in a “div” element with the class “post-sequence” follows the post which occurred earlier in the div. No implication is made about what preceeds the first post or follows the last post in the sequence. Applications are free to represent the sequence using tdl:next or tdl:prev or both.
Each post contained in a “div” element with the class “reverse-post-seqence” preceeds the post which occurred earlier in the div. No implication is made about what follows the first post or preceeds the last post in the sequence. Applications are free to represent the sequence using tdl:next or tdl:prev or both.
Post sequences may be nested. In the case where a nested sequence has the same direction as the outer sequence, one can act as though the nested sequence was not there. In the case where the nested sequence has the opposite direction as the outer sequence, the items in the nested sequence are inserted into the outer sequence in the opposide order.
In an attempt to illustrate this, let Pn be a post, S(…) be a sequence of posts, and RS(…) be a reversed sequence of posts. In these examples, the posts are numbered in their intrinsic order, but are given in their presentation order.
Processors are, of course, free to ignore information that is unnecessary for their purposes or which their programmers feel is too complicated to deal with. Highly nested sequences such as the seventh example are probably best avoided.
A link is a connection of some sort between two resources. There are two primary ways of declaring a link in HTML, the link element and the a element. Most HTML links are untyped, but HTML does provide a mechanism for declaring the type of a link: the “rel” and “rev” attributes (short for “relation” and “reverse relation”). These attributes provide a simple way to indicate the type of relationship implied by the link.
The “rel” attribute indicates a relationship between the resource making the link (the source) and the resource referenced in the link (the destination).
The “rev” attribute indicates a reverse relationship between the resource making the link and the resource referenced in the link.
This model maps very well to RDF. All that is required is a way to associate a URI to the source, destination, and relation.
If the link occurs in a post, then the source of the link is the URI of the post. (The section on posts explains how to determine the content and URI of posts) Otherwise, the source of the link is the URI of the page.
The destination of the link is always the resource identified in the “href” attribute of the link.
The relation of the link corresponds to a property from the TDL vocabulary. These properties will be abbeviated here as tdl:name, which corresponds to the URI “http://www.eyrie.org/~zednenem/2002/web-threads/name”.
For forward relations, the source is the subject and the destination is the object. For reverse relations, the source is the object, and the destination is the subject. In both cases, the property is found by looking up the relation in the table below.
Not all links correspond to RDF statements. Some relations, such as “archive” have a special meaning. Relations with the comment “Post context” are only meaningful when they occur inside a post.
Relation | Property | Comment |
---|---|---|
(nothing) | tdl:refersTo | Post context |
archive | See Posts | |
comment | tdl:commentedOnBy | Post context |
concurrence | tdl:agreedWithBy | Post context |
first | tdl:first | |
followup | tdl:followedUpBy | Post context |
forum | tdl:inForum | |
last | tdl:last | |
linksPage | tdl:hasLinksPageAt | Subject is tdl:Weblog |
next | tdl:next | |
pointer | tdl:pointedToBy | Post context |
prev | tdl:prev | |
rebuttal | tdl:disagreedWithBy | Post context |
recommendation | tdl:recommends | Special handling |
topic | tdl:inTopic | |
weblog | tdl:inWeblog |
Links are permitted to declare multiple relations. The “rel” and “rev” attributes are described as having a space-separated list of relations, and both attributes can occur in a single “a” or “link” element. Each relation in a link is handled separately, as though it had occurred alone.
Untyped links and the “comment”, “concurrence”, “followup”, “pointer”, and “rebuttal” relations are used to indicate references when they occur inside a post.
Untyped links correspond to the tdl:refersTo property, where the post is the subject and the destination of the link is the object. (Untyped links which are not contained in a post have no special meaning.)
The other relations correspond to subproperties of tdl:refersTo, but not directly. For most normal usage, the relations given above are used in reverse. A relation such as “comment” logically points to a comment, as in this graph:
Because the link is usually located in the comment, the relation must be given in reverse in order for the arrow to point the right way. Because of that, “comment” and the other reference relations correspond to the subproperties of tdl:referredToBy, which is the inverse of tdl:refersTo.
The fact that the relations are defined this way does not mean that the information must be stored in this way. When gathering data about a post, it is generally more convenient to formulate statements where the post is the subject. Because each of the reference properties has an inverse, we can interpret these relations in terms of their inverse. This is akin to saying “This post refers to this resource” instead of “this resource is referred to by this post”.
The “comment” relation corresponds to the tdl:commentedOnBy property. The inverse of that is tdl:commentsOn, so we can view the example given above like this:
What does this mean for weblog authors? It means that when you’re writing a post which comments on, agrees with, disagrees with, or points to something, you should use the “rev” attribute, like this:
<a rev="comment" href="resource address">link text</a>
A processor will interpret this as your post commenting on the resource. It will model that in terms of tdl:commentsOn or tdl:commentedOnBy, depending on its needs.
The “forum”, “topic”, and “weblog” link types imply more than the existence of tdl:inForum, tdl:inTopic, and tdl:inWeblog properties on certain objects. They also imply that the objects of those properties belong to the appropriate class (tdl:Forum, tdl:Topic, and tdl:Weblog, respectively).
Knowing type information for resources will make certain forms of processing easier, particularly presentation of data to users.
It is reasonable to link archives to the main page of a weblog or forum using the “weblog” and “forum” link types, respectively. The main pages of weblogs, forums, and topics can also link to themselves in their header with the appropriate type. The statement that, say, a weblog is part of itself is not very interesting, but the implication it is a weblog is useful for activites such as handling recommendations.
The link type “recommendation” corresponds to the TDL weblog-specific property “recommends”. Most weblogs have a list of other sites which their authors recommends to their readers. This list, commonly called the “blogroll”, is often but not always located on the weblog’s main page.
When a link of the type “recommendation” occurs on the main page of a weblog (but not in a post), it indicates that the weblog recommends the destination resource.
Some weblogs place their recommendations on one or more secondary pages. These require special handling, because instead of indicating a relation between the page containing the link and the destination of the link, the links indicate a relation a relation between the weblog containing the page containing the link and the destination of the link.
If a link with the “recommendation” relation is encountered on a page which is known to a links page for a weblog, then the source of the relation is considered to be the weblog. The “linksPage” relation and the tdl:hasLinksPageAt property both indicate when a page is a links page for some weblog.
To avoid ambiguity, links pages should indicate their relationship with a weblog by including a link like this one in their header:
<link rel="weblog" rev="linksPage" href="address of weblog">
This indicates that the destination of the link is a weblog and that this page is a links page associated with that weblog.
The “recommendation” relation can be used in reverse to indicate a weblog which recommends this resource or weblog.
The subject of a recommends property is always a weblog. A link with the attribute “rel="recommendation"” does not have special meaning in a post, even if the post is being mirrored on the weblog’s main page. A link with the attribute “rev="recommendation"” may occur in a post, but it would indicate that the destination is a weblog which recommends this post—a rare occurance. As a rule, links with the relation “recommendation” in forward or reverse should not occur in posts.
A good deal of information can be extracted from other elements. This section describes information stored in “meta” elements, the “title” element, certain classes of “span”, headings, and the “cite” attribute of the “blockquote” and “q” elements.
The “head” element contains metadata which applies to the document as a whole. Of these, the “title” and “meta” elements indicate information which cannot be represented as a link.
The “name” attribute of a “meta” element indicates a property of the page. The object of this property is given in the “content” attribute. For “meta” elements, the object is always a literal value, even if it resembles a URI.
Name | Property | Comment |
author | dc:creator | |
date | dc:date | W3CDTF; see TDL |
title | dc:title |
If no “meta” element with the name “title” is present, then the content of the “title” element may be used for dc:title.
The information in the “head” element applies to the page; another method is needed to encode literal data for posts which have a URI different from the page address. Certain classes of “span” and the heading elements indicate information which cannot be represented as a link.
A “span” element which has a class equal to one of those in the following table indicates a statement. The subject is the post containing the span, and the property corresponding to the class is given in the table. If the span has a “title” attribute, then the object of the statement is the value of the “title” attribute. Otherwise, the object is the textual content of the span (that is, the content ignoring any markup).
Name | Property | Comment |
author | dc:creator | |
date | dc:date | W3CDTF; see TDL |
title | dc:title |
If a post contains no span with the class “title”, then the first heading element (h1, h2, etc.) in the post is used as the object of dc:title. If a post contains no span with the class “title” and no headings, processors may excerpt the first few words of the post. The use of heading elements rather than spans is encouraged.
Processor behavior is undefined when a post contains multiple spans of the same class. Processors should use at least one of the spans as the value for the appropriate property. Multiple spans with the same class are discouraged.
Within a post, “blockquote” and “q” elements which have the “cite” attribute indicate statements. The subject of these statements is post, the property is tdl:quotes, and the object is the resource contained in the “cite” attribute.
Processors should first check to make sure the “cite” attribute contains a URI reference and not arbitrary text. “cite” attributes which are not URI references are ignored.
Basic conformance to this profile enables software to break your weblog (or message board, mail archive, etc.) into individual posts and determine what references you make in individual posts. Currently, all a crawler can say about a link in a weblog is that it came from a certain web page. For posts being mirrored on the weblog’s main page, all that can be said is that the weblog made the link at some point in the past. This profile provides a way to associate links with the post that makes them, and to associate posts with their permanent address even when they are being mirrored in a different location.
Achieving basic conformance is requires only minor changes to your current coding practices. If your weblog is generated using templates, all you need to do is modify them to produce the appropriate code (see the example templates for some ideas). If your weblog is hand-coded, the extra effort needed to use this profile is minimal.
First, your weblog’s main page and archive pages must
indicate conformance to this profile by setting the
“profile” attribute of the head
element
to “http://www.eyrie.org/~zednenem/2002/wtprofile/”. This will
produce code looking like this:
<head profile="http://www.eyrie.org/~zednenem/2002/wtprofile/"> headers </head>
Second, identify posts in your archives by enclosing them with “div” elements which have the class “post”. Make sure that the div includes all the posts’ content, but as little of the surrounding code as possible. This will allow software to distinguish the links being made inside a post from links made in other posts or which are there for site navigation.
The next part depends on how your posts are identified. If the posts in your weblog have addresses which include fragment identifiers, then the div should have an “id” attribute containing the fragment ID. For example, the post “http://example.org/blog/455#p6” would look like this:
<div class="post" id="p6"> post content </div>
Note that this eliminates any need for an anchor, such as
<a name="p6"></a>
, to identify the post.
Using div and id is preferable because
it makes the address refer to the entire post, rather than a single
point preceeding the post.
Alternately, if your posts’ addresses do not have a fragment ID (making their address the same as the page address), simply omit the “id” attribute. The post “http://other.org/blog/80210” would look like this:
<div class="post"> post content </div>
The div
is still present so that the post content
can be separated from any non-post material such as site navigation.
Third, identify posts on your main page by enclosing them with “div” elements which have the class “mirrored-post”. They are called mirrored posts instead of just posts because they aren’t in their permanent location. Again, make sure to include all the post content but as little of the surrounding code as possible. Mirrored posts look like this:
<div class="mirrored-post"> post content </div>
Because they aren’t located on their archive pages, there’s no way to derive the post address from the page address, but we will overcome that in the next step.
Most weblogs provide “permalinks”, which provide a simple way for readers to determine the address of an individual posts. Step four is indentify your permalinks by setting their “rel” attribute to “archive”. This tells software that this particular link identifies the permanent address of the post without requiring you to give it a special link text or location. It will look something like this:
<a rel="archive" href="post address">your link text</a>
Mirrored posts must include an archive link, or else they cannot be identified. Normal posts do not require them, but it’s generally a good idea to include them anyway.
At this point, software can interpret your weblog in terms of posts, not just in terms of pages. Each link that a post makes (except for the archive links, naturally) indicates a reference, which form the basis of threading. Even if you go no further, your weblog can now be treated as part of a global discussion forum.
If you so choose, you may describe your weblog and its posts in greater detail. What follows are a number of optional ways to provide additional information.
As always, it’s a good idea for your weblog to be written in valid HTML. It doesn’t matter so much which version you use, as long as you abide by its rules. For existing, non-validating sites, HTML 4.01 Transitional probably requires the fewest modifications to your site’s code. For newer sites, consider XHMTL.
If your weblog categorizes posts by topic, you can indicate that your posts belong to those topics, rather than merely referring to them, by using the “topic” link type. Such a link, contained within a post’s div, would look something like this:
<a rel="topic" href="topic address">topic name</a>
Similarly, if the topic’s page links to individual posts, it can indicate that they belong to the topic with a reverse link:
<a rev="topic" href="post address">post name</a>
The topic page would need to include the profile declaration for this link to be correctly interpreted.
Every page in the weblog can indicate that it is part of the weblog by linking to the main page using the “weblog” link type. Even the main page might link to itself; doing so would indicate to software that it is a weblog’s main page. These links can be made with an anchor in the page’s body:
<a rel="weblog" href="weblog address">weblog name</a>
Alternately, they can be done with a link element in the page’s head:
<link rel="weblog" href="weblog address">
It’s probably a good idea to include a link in the head even if there is also a link in the body, but there is no strong reason to prefer one to the other.
Linking your archives to your weblog does not explicitly connect your posts to your weblog. Software will presumably infer that if Post A is in Archive B and Archive B is in Weblog C, then Post A is probably in Weblog C. If you wish to explicitly connect your posts to your weblog, you must include a weblog link in each post.
If your weblog has titles for some or all of its posts, consider marking up the titles with heading tags (h1, h2, and so on). This is a good idea for a number of reasons, but the important one here is that the text content of the first heading encountered in a post will usually be interpreted as the post’s title. If your site design does not allow you to use real headings, then enclose the post’s title with a span element of the class “title”.
Similiarly, you can use a span element with the class “author” to identify the author of a post or the class “date” to identify the date. Dates must be given in the format described in the W3C note as a specific day, minute, or second. Because that format is cumbersome, you may choose to present it as the value of the “title” attribute, which overrides the textual content of the span when present. For example, the common practice of concluding a post with the line “Posted by Author at Time” might be coded like this:
<p>Posted by <span class="author" title="Author’s name">Author’s nickname</span> at <span class="date" title="Full date">Time of day</span></p>
If your weblog’s main page includes a “blogroll”, a list of links to other, recommended sites, you can indicate that these links are recommendations using the “recommendation” link type. If your site puts those links on a separate page, you can link to that page using the “linksPage” link type.
Consider being more specific when making references. For example,
if your post discusses a post or article, your link to that post could indicate
that your post is a comment by declaring rev="comment"
. (Why
a reverse link? It’s because your post is the comment, but rel="comment"
says that the resource at the other end of the link is a comment.) Other
relevant link types are “concurrence”, “rebuttal”, “pointer”, and “followup”.
When quoting from a web page, use the “cite” attribute to say where you got the quote from. If you’re quoting from another weblog, try to cite the specific post, and not the weblog or an archive. The “cite” attribute can be used with blockquote elements or the rare q element:
<blockquote cite="source of quote"> quote </blockquote>
If your weblog implements these conventions, a considerable amount of information will be available to software processors.