Software which works with discussion threads may use this vocabulary
as a standard way to exchange threading information. In addition, the
vocabulary can be used to store any number of posts from any discussion
forum in a standard way. All discussion venues are treated equally,
so data from multiple media may be combined into a single data set.
This allows one to treat all weblogs and message boards as though
they were a single forum.
Background
The goal of the thread description language is to describe all forms
of on-line discussion, including weblogs, message boards, Usenet, e-mail,
instant messages, and anything else which can be described in terms of
Posts.
The basic unit of an on-line conversation is the Post. A discussion
comprises a set of posts by various authors which are related to each
other. This set of related posts is a thread. Structurally, threads
can be either implicit or explicit, and they may be linear or forked.
In an explicit thread, the posts which compose the thread are
marked as being part of the same thread. Implicit threads, however,
must be derived from the pattern of references found among a set of
posts.
In a linear thread, each post follows another, as one
finds in an instant message conversation or “unthreaded” message
board. In a forked thread, any given Post may be followed by
multiple posts, forming a tree of responses. For maximum
flexibility, posts in a forked thread may also follow
mulitple posts, in addition to being followed by multiple
posts.
The Thread Description Language is a set of
RDF
classes and properties
which are used to describe discussion threads and forums. By
implementing it in RDF,
we gain interoperability with other vocabularies, extensibility,
and a well-defined serialization format, RDF-XML.
Each post is identifed by a URI,
and relations between them are implemented by properties. To
represent the connections between posts in a linear thread, we
use the sequence properties, which are
“next” and its variants. For a forked thread, we use the
reference properties, which are “refersTo”
and its variants.
- goal is to describe all forms of discussion threads:
weblogs, msg boards, usenet, mailing lists, &c
- use RDF as a standard way to describe things, not necessarily
for internal representation
- useful to have a standard way to express "X is last post of Y", even if
it's not useful to store it
- RDF-XML is a handy way to store these facts in a file
- this means that TDV can represent any discussion forum - ThreadML
Uses and applications:
- A model for those developing dicussion-thread-based software
- Exchange format for threading information extracted from blogthreads,
message boards, etc.
- File format for storing threads from message boards, usenet, IM, etc
Examples, a linear thread, a weblog.
Methods
Some examples to give an idea of how existing discussion forums may be represented.
Weblogs
While individual weblogs might not be considered a discussion forum,
one can look at the weblog community as a whole as a giant, distributed
discussion. This is enabled by the use of URI references to identify
posts and hyperlinks to make references. Naturally, the permanent address
of each post (often called the “permalink”, although that conflates the
act of pointing with the object being pointed at) would be used to identify
each post. The hyperlinks contained within the post serve as references.
(This raises the question of how one determines which hyperlinks a given
post contains. This is beyond the scope of this document, although the
associated coding convention describes one such method.)
The universe of weblogs (or “blogosphere”) is implicitly threaded,
and several proposals exist for standard ways of indicating explicit
threads. Standard hyperlinks imply the “refersTo” property. (Again,
the coding convention describes how other references may be specified.)
Any given weblog post belongs to a Weblog, although that membership
may not be derivable from the post’s encoding. Most weblog posts are
encoded in some variant of HTML,
but for minimal confusion it is recommended that the value of the
“content” parameter be given as an XHTML 1.1 fragment.
This avoids incompatability with RDF-XML, and should not result in any
loss of information.
Linear message board
In a linear, or “unthreaded”, message board, each post follows another in
chronological sequence. If individual posts can be assigned URI references, then
they can be represented as explicit, linear threads.
Each thread is represented by a Topic, and the first post is
identified by the Topic’s “first” property. The posts use
“next” to indicate the following post. For an example of a Topic
encoded this way, see this
RDF-XML transcription of a Quick Topic thread.
Forked message board
In a forked, or “threaded”, message board, each post either begins
a thread (represented here by a Topic) or replies to an existing post.
They are explicitly threaded, because each post is associated with a
specific thread. The threads themselves are forked, because posts may
have multiple responses. (Some forked message boards also provide
a chronological ordering for posts, allowing the thread to be viewed
as a tree or a sequence.)
As with linear boards, each thread is
represented by a Topic, and the first post in the thread is identified
by the Topic’s “first” property. The posts relate to each other through
“refersTo” (at a minimum; a message board can probably assume “commentsOn”
or better for direct replies) and possibly also through “next”. The
Topic will usually be part of a Forum. In large message boards, the
Forums may themselves be organized into larger Forums.
Weblog with comments
Many weblogs have a comments feature which allows readers to respond
to posts within the weblog itself. These comments add to the pre-existing
inter-weblog discussion, and can potentially be referenced themselves
by other weblog or message board posts. Interestingly, one could
consider news sites which feature “talkback” to be examples of this
pattern.
A set of comments to a single weblog post constitute an explicit
thread which may be forked, linear, or both depending on the message
board setup. In this case, the original post does double duty as a
Post and a Topic. If the comments are linear, the Topic’s “first”
property identifies the first response. Otherwise, the Posts in the
Topic use “refersTo” to indicate whether they are responding to the
weblog post or to another post in the Topic.
A minimal example to illustrate a Post/Topic:
<Post rdf:about="http://example.org/blog/455">
<dc:title>
What do you think?
</dc:title>
<inWeblog rdf:resource="http://example.org/blog"/>
<first rdf:resource="http://example.org/blog/455#m1"/>
</Post>
<Post rdf:about="http://example.org/blog/455#m1">
<inTopic rdf:resource="http://example.org/blog/455"/>
<next rdf:resource="http://example.org/blog/455#m2"/>
</Post>
<Post rdf:about="http://example.org/blog/455#m2">
<inTopic rdf:resource="http://example.org/blog/455"/>
</Post>
Note that “http://example.org/blog/455” is not explicitly identifed as a
Topic; this is all right, as most software would be able to figure it out
as it has the “first” property and is the value of several posts’ “inTopic”
property. However, different implementations may present different
information or the same information in different ways.
Note also that all three posts in that example are part of the same
document. This is also not required. All that is important is that the
three URI references
are different.
Usenet
Each Usenet message is required to have a unique message ID. This forms the basis
of part of the news: URI scheme.
A message with the ID “1998090902325900.WAA04282@example.org” has the address
“news:1998090902325900.WAA04282@example.org”. Specific Usenet newsgroups are also
given unique names such as “alt.example” or “rec.arts.tv.mst3k.misc”. These are
similarly represented by the news: URI
scheme as “news:alt.example” and “news:rec.arts.tv.mst3k.misc”. Message IDs
always contain a commercial at-sign (“@”), and newsgroup names never contain
one, so there is no possibility of confusing a message ID and a newsgroup name.
Newsgroups are implicitly threaded. Each message contains a header specifying
the posts it refers to (ie, posts earlier in its thread). This header corresponds
to the “refersTo” property, and the messages listed are its values.
The newsgroups themselves are represented as Topics, and Usenet as a whole
can be thought of as a Forum. Crossposted messages have multiple values for
“inTopic”.
An example message posted to “alt.example” (some headers omitted):
Newsgroups: alt.example
Subject: Re: Test
From: Mr Example <example@example.org>
Date: 20 July 2002 19:08:16 -0800
Message-ID: <1998090902325900.WAA04282@example.org>
References: <v03007802af318a543ec6@hypothetical.com>
<qumwws2zd6s.fsf@apocryphal.edu>
> This message is posted to alt.example.
Yes, that appears to be true.
This can be represented like so:
<Post rdf:about="1998090902325900.WAA04282@example.org">
<dc:title>
Re: Test
</dc:title>
<dc:creator>
Mr Example
<
example@example.org
></dc:creator>
<dc:date>
2002-07-20t19:08:16-08:00
</dc:date>
<inTopic rdf:resource="news:alt.example"/>
<refersTo rdf:resource="news:v03007802af318a543ec6@hypothetical.com"/>
<refersTo rdf:resource="news:qumwws2zd6s.fsf@apocryphal.edu"/>
<content xml:space="preserve">
>
This message is posted to alt.example.
Yes, that appears to be true.
</content>
</Post>
Although the Usenet headers only provide for references made between
Usenet messages, applications are free to infer additional references
from URIs contained
in the message text.
[@@ there should be a note somewhere about MIME and content]
E-mail messages
Like Usenet messages, e-mails contain message identifiers
which are required to be unique. These form the basis of the mid:
URI scheme, as with
“mid:528FA5637F7A16419AD8FC006128E6DCBC6836@example.org”. Thus, they too
can be represented as posts. Additionally, some mail clients include a
references header when making a reply (although this is not common
practice) which allows for some “refersTo” properties to be inferred.
E-mail as a whole is implicitly threaded (to the extent that
references can be inferred from context). Mailing lists can be represented
as Topics, particularly if they include the List-URL header or otherwise
have a unique address. Because mailing lists are centrally managed, they
can have sequential and forked threading.
Instant Messages
While there do not appear to be any standards for identifying instant
messages, there are some defined URN
schemes which are sufficiently decentralized to be useful. One could, for example,
assign each instant message a UUID,
which are defined in such a way that it is highly improbable for two
items to be given the same identifier.
Instant messages are implicitly threaded and sequential, but they can be
organized into Topics. One (non-optimal) way to do it is to explicitly
identify the topic when saving an instant messaging conversation: all the
messages being saved are considered part of the Topic. The Topic is given
a UUID, and each messages is declared a Post. The Posts can be given
fragment identifiers based on the Thread’s address.
An example of a very brief conversation:
<Topic rdf:about="urn:uuid:1234-5678-90abcdef1234-5678">
<dc:title>
IM conversation between Mr H and Mr A
</dc:title>
<dc:contributor>
Mr H
</dc:contributor>
<dc:contributor>
Mr A
</dc:contributor>
<first rdf:resource="urn:uuid:1234-5678-90abcdef1234-5678#m1"/>
</Topic>
<Post rdf:about="urn:uuid:1234-5678-90abcdef1234-5678#m1">
<dc:creator>
Mr H
</dc:creator>
<dc:date>
2002-02-17t13:00:06z
</dc:date>
<inTopic rdf:resource="urn:uuid:1234-5678-90abcdef1234-5678"/>
<next rdf:resource="urn:uuid:1234-5678-90abcdef1234-5678#m2"/>
<content rdf:parseType="Literal">
Gosh, I
<html:b>
hate
</html:b>
the rain.
</content>
</Post>
<Post rdf:about="urn:uuid:1234-5678-90abcdef1234-5678#m2">
<dc:creator>
Mr A
</dc:creator>
<dc:date>
2002-02-17t13:00:48z
</dc:date>
<inTopic rdf:resource="urn:uuid:1234-5678-90abcdef1234-5678"/>
<content rdf:parseType="Literal">
Without it, the flowers would die.
</content>
</Post>
This method for representing instant messaging conversations is only
one possibile way to apply the thread description language, and it does
have the disadvantage that, if both parties export the conversation
they will assign different URIs
to the Topic and Posts. While a superior method will undoubtedly present itself
in the future, this one is good enough to put IM
on the same footing as weblogs, message boards, Usenet, and e-mail.
Web-based archives
While weblog and message-board authors are free to link directly to
Usenet and e-mail messages, they generally will not because browsers cannot
dereference news: and mid: URIs.
Thus, a method is needed to identify some
URIs as aliases
for other URIs.
The appropriate choice here is probably daml:sameIndividualAs [@@sp?].
Properties
Cataloging
Rather than create a new vocabulary to describe common properties
such as titles, authors’ names, and so forth, we specify use of the
Dublin Core Metadata Set. [@@ namespace, use of “dc:”]
Some of the Dublin Core elements likely to be applied to posts,
topics, forums, and weblogs are:
- dc:title
- The title, name, or subject line of a post, topic, forum, or
weblog. As a rule, this should contain only information unique to
the resource, so “Re: Nixon’s dog” is fine but “SoAndSo Discussion
Forum—Re: Nixon’s dog” is probably not.
- dc:creator
- A string identifying the author or authors of a post, or the
creator of a topic, forum, or weblog. This could be a name, a
nickname, or some other identifying string. (If the intent is for
others to know what it means, don’t be too clever.)
- dc:date
- A string specifying the publication time of a post. It should
be formatted according to the ISO 8601 profile specified in
the W3C date/time note as a day, minute, or second. Times are
interpreted to mean “sometime in that period”, not “the start of
that period”. Thus, “2002-07-20” means any time during July 20,
2002. (Note that timezone information is required for minutes and
seconds.)
- dc:description
- A string or XHTML
fragment describing a post, topic, forum, or weblog. Note that this
should not be used to present the content of the resource, use “content”
for that. See the description of “content” for ways of including
complex strings in RDF-XML.
- dc:contributor
- A string identifying someone who contributed to a thread, forum,
or weblog. Similar to dc:creator in syntax.
Membership and containment
Four relations indicate that a resource is part of or belongs to a larger
resource.
- inArchive
- The archive where a post is permanently located.
- inTopic
- A topic to which a post belongs.
- inForum
- A forum to which a post, archive, or topic, or smaller forum belongs.
- inWeblog
- A weblog to which a post, archive, topic, or forum belongs.
Five relations indicate smaller resources contained within a larger one.
- hasPost
- A post located in or part of an archive, topic, forum, or weblog.
- hasArchive
- An archive belonging to a forum or weblog.
- hasTopic
- A topic that is part of a forum or weblog.
- hasForum
- A forum that is part of a larger forum or weblog.
- hasWeblog
- A weblog that is part of some larger resource.
[@@ what is lost if these are reduced to just “in” and “has”?]
References
These properties apply to a post and describe the references it
makes.
- refersTo
- A resource to which this post refers.
- followsUp
- A post which this post corrects or updates.
- commentsOn
- A resource which this post discusses or responds to.
- agreesWith
- A resource which this post agrees with or amplifies.
- disagreesWith
- A resource which this post rebuts or presents evidence
contrary to.
- pointsTo
- A resource which this post refers to but does not discuss.
- quotes
- A resource which this post quotes
These properties apply to any resource and identify a post
which refers to it in some manner.
- referredToBy
- A post which refers to this resource.
- followedUpBy
- A post which updates or corrects this post.
- commentedOnBy
- A post which discusses or responds to this resource.
- agreedWithBy
- A post which agrees with or amplifies this resource.
- disagreedWithBy
- A post which rebuts or presents evidence contrary to this resource.
- pointedToBy
- A post which refers to this resource but does not discuss it.
- quotedBy
- A post which quotes this resource.
Sequence
In addition to the graph formed by inter-post references, posts
can also be organized in an order, as occurs in a linear thread.
- first
- The first post or archive of a topic, forum, or weblog.
- last
- The last post or archive of a topic, forum, or weblog.
- next
- A post or archive which follows this post or archive in
a topic, forum, or weblog.
- prev
- A post or archive which preceeds this post or archive in
a topic, forum, or weblog.
Content
There are several useful applications which require representing
the actual content of a post, such as storing a thread in a
self-contained file. Rather than define a new file format, we
stretch the meaning of “metadata” slightly and declare the
“content” property.
- content
- An XML
fragment containing the content unique to this post.
Note that the value is described as an
XML fragment,
not a text string. This is because the content of many posts will
be best described in XML
(or languages such as HTML
which have XML equivalents).
Some guidelines are in order to avoid a situation like
RSS, where
HTML is
escaped and reencoded in XML.
To represent arbitrary XML
content in RDF-XML, RDF
defines the rdf:parseType="Literal"
processing instruction,
which indicates to the RDF
parser that the contents of an element should not be parsed for further
RDF statements.
<Post rdf:about="http://example.org/blog/455">
<content rdf:parseType="Literal">
<html:p>
This post contains two paragraphs.
</html:p>
<html:p>
This is the
<html:em>
second
</html:em>
paragraph.
</html:p>
</content>
</Post>
In this particular example, the post’s content is an
XHTML fragment
(assuming that the “html” namespace prefix is defined appropriately
elsewhere in the document). Implementers should be aware of two points:
- The meaning of an XML
fragment is dependent on what namespace prefixes are declared. Thus,
regular expressions and other text-based, non-parsing approaches to working with
XML will not
always work as expected. Similarly, HTML
content must be expressed in well-formed XML
(this can be done with no loss of information, because
XHTML includes all
HTML elements).
- The content of the post must survive
XML processing,
so any elements containing semantic whitespace (ie, where spacing
is important) must warn the parser that the spacing is significant
by using
xml:space="preserve"
. This includes the
HTML pre
element,
as one can’t expect general XML
tools to have special knowledge of the
XHTML namespace.
Character strings containing no
XML markup
can still be considered
XML fragments,
which is useful for describing posts such as e-mail and Usenet
messages. Because the post content will undergo
XML parsing,
any reserved characters (“<”, “>”, and “&”) must be
escaped and the xml:space="preserve"
instruction
should be used to preserve whitespace-based formatting. To produce
readable markup, applications may insert newlines before and after
the post content. (If a post begins or ends with a newline and
that newline is considered important, then an additional newline
must be inserted so that parsers will not strip it out.)
<Post rdf:about="mid:1234@example.org">
<content xml:space="preserve">
Mr Hypothetical writes:
>
Where is the AT
&
T website?
Try
<
http://www.att.com/
>
.
</content>
</Post>
Software which understands the content property would discard the initial and
final newlines, leaving this message (newlines are marked “\n
”):
Mr Hypothetical writes:\n
> Where is the AT&T website?\n
\n
Try <http://www.att.com>.\n
Note that the final </content>
is not indented.
This is because the last character in the element is a newline. If
it had been indented, then the two newlines and the whitespace used to
indent the tag would have been included in the post content.
The purpose of the content property is to represent the content of
a post, so Usenet and e-mail headers should not be included. If the information
in the headers is deemed important and not covered by an existing
RDF property,
then a new property should be created.
Weblog-specific
These properties identify elements found in many weblogs.
- recommends
- A resource, such as another weblog, which is linked to
in a prominent place in a weblog (often called the “blogroll”).
- linksPage
- A page which lists recommended sites, often but not always
the same as the front page of a weblog.
- currentPosts
- A sequence (rdf:Seq) of posts which are considered to be
“current”. For example, the posts currently present in the weblog’s
front page can be considered current.
- rssChannel
- An RSS feed which
may be associated with the weblog, usually to list or syndicate
current posts.
Profiles
[@@Note: this whole section is pretty experimental]
Because RDF and the
thread description language are so flexible, it is not simply enough to state
that a resource contains metadata for a weblog or thread. Different applications
will use different subsets of the universe of possible statements one could make.
For example, someone describing the current posts in a weblog can choose to
include the content of those posts or only the information about them.
Profiles provide a way to indicate what data is being specified.
Profiles are identified by URI.
The profiles described in this document are:
- http://www.eyrie.org/~zednenem/2002/web-threads/#prof-blog-synd-content
- The “current” posts in a weblog and their content. (Similar to
RSS when used for syndication.)
- http://www.eyrie.org/~zednenem/2002/web-threads/#prof-blog-synd-data
- The “current” posts in a weblog, but not their content. (Similar to
RSS when used as a summary.)
- http://www.eyrie.org/~zednenem/2002/web-threads/#prof-blog-data
- General data about a weblog, such as locations of alternate “feeds”.
- http://www.eyrie.org/~zednenem/2002/web-threads/#prof-blogroll
- The weblog’s name and the resources it recommends.
- http://www.eyrie.org/~zednenem/2002/web-threads/#prof-thread-content
- Every post in a thread or topic and their content.
- http://www.eyrie.org/~zednenem/2002/web-threads/#prof-thread-data
- Every post in a thread or topic, but not their content.
We define a “profile” property, which is used to indicate the profile of a
given resource. This can be used to distinguish multiple RDF-XML formatted
alternate versions of a resource.
For example, assume the weblog “http://example.org/blog” has three
XML feeds in addition to
its default HTML representation:
an RSS feed “http://example.org/blog.rss”,
the description of the current posts in thread description language “http://example.org/blog.rdf”,
and the description and content of the current in thread description language
“http://example.org/blog.synd”. These three resources are all XML-RDF documents
and all described the current state of the weblog.
A fifth resource, “http://example.org/blog.meta”, gives metadata about the
weblog itself, such as its name and the existance of the three available
XML versions. It does so
using the concept of representations discussed in
“Generic Resources”
and using the “profile” property to distinguish them. Each is considered
a representation of the weblog itself; that is, the weblog presented in
an alternate format.
<Weblog rdf:about="http://example.org/blog">
<dc:title>
Example weblog
</dc:title>
<rssChannel rdf:resource="http://example.org/blog.rss"/>
</Weblog>
<u:RepresentationInvariant rdf:about="http://example.org/blog.meta">
<u:isRepresentationOf rdf:resource="http://example.org/blog"/>
<dc:format>
application/rdf+xml
</dc:format>
<profile rdf:resource="
http://www.eyrie.org/~zednenem/2002/web-threads/#prof-blog-data"/>
</u:RepresentationInvariant>
<u:RepresentationInvariant rdf:about="http://example.org/blog.rdf">
<u:isRepresentationOf rdf:resource="http://example.org/blog"/>
<dc:format>
application/rdf+xml
</dc:format>
<profile rdf:resource="
http://www.eyrie.org/~zednenem/2002/web-threads/#prof-blog-synd-data"/>
</u:RepresentationInvariant>
<u:RepresentationInvariant rdf:about="http://example.org/blog.synd">
<u:isRepresentationOf rdf:resource="http://example.org/blog"/>
<dc:format>
application/rdf+xml
</dc:format>
<profile rdf:resource="
http://www.eyrie.org/~zednenem/2002/web-threads/#prof-blog-synd-content"/>
</u:RepresentationInvariant>
(Ideally, profiles would be indicated in the MIME-type itself,
rather than requiring a separate profile property. Admittedly,
“application/rdf+xml; profile=http://www.eyrie.org/2002/web-threads/#prof-blog-data”
is cumbersome, but it would allow the use of profiles in
HTTP content negotiation.)
Syndicating weblogs
RSS is commonly used to present
weblogs in a standard format, allowing for more flexible treatment of their
content. There are several software packages which display the current
headlines of a site based on its RSS
feed, for example.
Unfortunately, RSS is used
both as a way to describe the content of a web site and as a way to
present that content. It is impossible to tell which strategy a
given feed uses. Worse yet, the attempts to use
RSS for presenting content
usually involve mis-using the “description” property and encoding
HTML content as though
it were a text string.
The thread description language can also be used to syndicate weblogs,
and we provide two profiles to distinguish feeds which describe posts from
feeds which describe and contain posts.
Both profiles require a Weblog resource for each weblog being described.
These Weblogs MUST include the dc:title and currentPosts properties, and
SHOULD include dc:description, dc:creator, and dc:contributor as
appropriate. The value of currentPosts is a sequence of Posts which
are considered “current”.
Both profiles also require a Post resource for each post listed in
currentPosts. These Posts MUST include dc:title and dc:date and SHOULD
include dc:creator, dc:contributor as appropriate. Each post SHOULD also
note the resources it references. Posts are not
required to note resources which refer to them, as that information may
be unavailable or extensive.
Posts are not required to indicate sequence or
membership in Archives or Weblogs; that information is implicit.
Similarly, the Weblog should not indicate which archives, forums,
topics, or posts it contains.
For the #prof-blog-synd-content
profile, each Post MUST include the content property. For the
#prof-blog-synd-data profile,
each Post MUST NOT include the content property.
Describing weblogs
While he syndication profiles focus on
a subset of a weblog’s posts, the #prof-blog-data
and #prof-blogroll profiles describe the
the weblog itself.
The #prof-blog-data profile is used to
describe the weblog as a whole; no information about individual posts is
given. It requires a Weblog resource for each weblog being described.
These Weblogs MUST include the dc:title property and SHOULD include
dc:description, dc:creator, dc:contributor, linksPage, rssChannel,
recommends, hasForum, and hasTopic as appropriate.
The hasTopic property is used to indicate categories which the weblog
uses to organize posts. There should be a Topic resource for each
category. Each Topic MUST include a dc:title property and SHOULD
include dc:description.
The #prof-blogroll profile is a subset
of #prof-blog-data used to describe which
resources a given weblog recommends. It requires a Weblog resource for
each weblog being described. Each Weblog MUST include the dc:title
property and a recommends property for each recommended resource.
Representing threads
[@@ not yet written]