oreilly.comSafari Books Online.Conferences.
Articles Radar Books  

The Power of Metadata
Pages: 1, 2, 3, 4

Resource description

Until recently, the means available to content providers for describing the resources they make available on the Web have been inconsistent at best. About the only consistent metadata in an HTML document is the <title> element, which provides only a hint at best as to the content of the page. HTML's <meta> element is supposed to provide a method for embedding arbitrary metadata -- but that creates more of a problem than a solution, because applications, books, articles, tutorials, and standards bodies alike express little guidance as to what good metadata should look like and how best to express it.

The work of the aforementioned Dublin Core offers a wonderful start. The Dublin Core Metadata Element Set is a set of 15 elements (title, description, creator, date, publisher, etc.) that are useful in describing almost any web resource. Rather than attempt to define semantics for specific instances and situations, the DCMI focused on the commonalities found in resources of various shapes and flavors. The Dublin Core may just as easily be used to describe "a journal article in PDF format," "an MPEG encoding of an episode of Buffy the Vampire Slayer recorded on a hacked TiVO," or "a healthcare speech given by the U.S. President on March 2, 2000."

Example 1 shows a typical appearance of Dublin Core metadata in a fragment of HTML. Each <meta> tag contains an element of metadata defined by Dublin Core.

Example 13-1: Dublin Core metadata in an HTML document


<html>
  <head>
    <title>Distributed Metadata</title>
    <meta name="description" content="This article addresses...">
    <meta name="subject" content="metadata, rdf, peer-to-peer">
    <meta name="creator" content="Dan Brickley and Rael Dornfest">
    <meta name="publisher" content="O'Reilly & Associates">
    <meta name="date" content="2000-10-29T00:34:00+00:00">
    <meta name="type" content="article">
    <meta name="language" content="en-us">
    <meta name="rights" content="Copyright 2000, O'Reilly & Associates, Inc.">
    ...
  </head>
  ...

While useful up to a point, the original HTML mechanism for embedding metadata has proven limited. There is no built-in convention to control the names given to the various embedded metadata fields. As a consequence, HTML <meta> tags can be ambiguous: we don't know which sense of "title" or "date" is being used.

XML represents another evolution in web architecture, and along with XML come namespaces. Example 2 illustrates some namespaces in use. Like peer-to-peer, namespaces exemplify decentralization. We can now mix descriptive elements defined by independent communities, without fear of naming clashes, since each piece of data is tied a URI that provides a context and definition for it.

Example 2: Dublin Core metadata in an XML document


<?xml version="1.0" encoding="iso-8859-1"?>
 
<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns="http://purl.org/rss/1.0/"
>
...
  <item rdf:about="http://www.oreillynet.com/.../metadata.html">
    <title>Distributed Metadata</title>
    <link>http://www.oreillynet.com/.../metadata.html </link>
    <dc:description>This article addresses...</dc:description>
    <dc:subject>metadata, rdf, peer-to-peer </dc:subject>
    <dc:creator>Dan Brickley and Rael Dornfest </dc:creator>
    <dc:publisher>O'Reilly & Associates</dc:publisher>
    <dc:date>2000-10-29T00:34:00+00:00</dc:date>
    <dc:type>article</dc:type>
    <dc:language>en-us</dc:language>
    <dc:format>text/html</dc:format>
    <dc:rights>Copyright 2000, O'Reilly & Associates, Inc.</dc:rights>
    ...
  </item>
  ...

In the example above, Dublin Core elements are prepended by the namespace name dc:. The name is associated with the URI http://purl.org/dc/elements/1.1 by the xmlns:dc construct at the beginning of the document. dc:subject is therefore understood to mean "the subject element in the dc namespace as defined at http://purl.org/dc/elements/1.1."

Namespaces let each author weave additional semantics required by particular types of resources or appropriate to a specific realm with the more general resource description such as that provided by the Dublin Core. In the book world, an additional definition might be the ISBN or Library of Congress number, while in the music world, it might be some form of compact disc identifier.

Now, we're not insisting that each and every document be described using all 15 Dublin Core elements and along various other lines as well. Something to keep in mind, however, is that every bit of metadata provides a logarithmic increase in available semantics, making resources less ambiguous and easier to find. Peer-to-peer application developers may then use the descriptions provided by a resource rather than having to resort to guesswork or such extremes as sequestering resources of a certain type to their own network.

Searching

Searching is the bane of the Web's existence, despite the plethora of search tools -- Yahoo currently lists 193 registered web search engines.[2] Search engines typically suffer from a lack of semantics on both the gathering and querying ends. On the gathering side, search engines typically utilize one of two methods:

  • Internet directories typically ask content providers to register their web sites through an online form. Unfortunately, such forms don't provide slots for metadata such as publisher, author, subject keywords, etc.
  • Search engines scour the Web with armies of agents/spiders, scraping pages and following links for hints at semantics. Sadly, even if a site does embed metadata (such as HTML's <meta> tags) in its documents, this information is often ignored.

On the querying end, while some sites do make an attempt to narrow the context for particular word searches (using such categories as "all the words," "any of the words," or "in the title"), successful searching still comes down to keywords and best guess. It's virtually impossible to remove the ambiguity between concepts like "by" and "about" -- "find me all articles written by Andy Oram" versus "find me anything about Andy Oram." Queries like "find me anything on Perl written by the person whose e-mail address is larry@wall.org" are out of the question.

While the needs of users clearly call for semantically rich queries, some peer-to-peer applications and systems are doing little to provide even the simplest of keyword searches. While Freenet does provide the boon of an optional accompanying metadata file to accompany any resource added to the cloud, this is currently of minimal use since no guidance exists on what this metadata file should contain, and there is currently no search functionality. Gnutella's InfraSearch allows for a wonderfully diverse interpretation and subsequent processing of search terms: While a dictionary node sees "country" as a term to be looked up, an MP3 node may see it as a music genre. Unfortunately, however, the InfraSearch user interface still provides only a simple text entry field and little chance for the user to be an active participant in defining the parameters of his or her search.

Hopefully we'll see peer-to-peer applications emerging that empower both the content provider and end user by providing semantically rich environments for the description and subsequent retrieval of content. This should be reflected both in the user interface and in the engine itself.

Pages: 1, 2, 3, 4

Next Pagearrow





P2P Weblogs

Richard Koman Richard Koman's Weblog
Supreme Court Decides Unanimously Against Grokster
Updating as we go. Supremes have ruled 9-0 in favor of the studios in MGM v Grokster. But does the decision have wider import? Is it a death knell for tech? It's starting to look like the answer is no. (Jun 27, 2005)

> More from O'Reilly Developer Weblogs


More Weblogs
FolderShare remote computer search: better privacy than Google Desktop? [Sid Steward]

Data Condoms: Solutions for Private, Remote Search Indexes [Sid Steward]

Behold! Google the darknet/p2p search engine! [Sid Steward]

Open Source & The Fallacy Of Composition [Spencer Critchley]