A contrasting evolution: MP3 and the metadata marketplace
The alternatives to erecting a rigorous metadata architecture like RDF can be illustrated by the most popular decentralized activity on the Internet today: MP3 file exchange.
How do people find out the names of songs on the CDs they're playing on their networked PCs? One immediate problem is that there is nothing resembling a URI scheme for naming CDs; this makes it difficult to agree on a protocol for querying metadata servers about the properties of those CDs. While one might imagine taking one of the various CDDB-like algorithms and proposing a URI scheme for universal adoption (for instance, cd:894120720878192091), in practice this would be time-consuming and somewhat politicized. Meanwhile, peer-to-peer developers just want to build killer apps; they don't want to spend 18 months on a standards committee specifying the identifiers for compact discs (or people or films...). Most of us can't afford the time to create metadata tags, and if we could, we'd doubtless think of more interesting ways of using that time.
What to do? Having just stressed the importance of unique names when describing content, can we get by without them? Actually, it appears so.
Every day thousands of MP3 users work around the unique identification problem without realizing it. Their CD rippers inspect the CD, compute one of several identifying properties for the CD they're digitizing, and use this uniquely identifying property to consult a networked metadata service. This is metadata in action on a massive scale. But it also smacks of the PICS problem. MP3 listeners have settled on an application-specific piece of infrastructure rather than a more useful, generalized approach.
These metadata services exist and operate very successfully today, despite the lack of any canonical "standard" identifier syntax for compact discs. The technique they use to work around the standards bottleneck is simple, being much the same as saying things like "the person whose personal mailbox is..." or "the company whose corporate homepage is...". Being simple, it can (and should) be applied in other contexts where peer-to-peer and web applications want to query networked services for metadata. There's no reason to use a different protocol when asking for a CD track list and when asking for metadata describing any other kind of thing.
The basic protocol being used in CD metadata query is both simple and general: "Tell me what you know about the resource whose CD checksum is some-huge-number" -- a protocol reminiscent of the PICS label bureau protocol. The MP3 community could build enormously useful services on top of this, even without adopting a more general framework such as that provided by RDF, but they have stopped short of the next step.
On the contrary, while MP3 CD rippers currently embed lots of descriptive information (track listings) right into the encoding, they omit the most crucial piece of data from a fan's point of view: the CD and track identifiers. The simple unique identifier for a song on a CD, while only a tiny fragment of data, could allow both peer-to-peer and web applications to hook into a marketplace of descriptive services. How could MP3 services use this information?
One application is to update the metadata inside MP3 files, either to correct errors or to add additional information. If we don't know which CD an MP3 file was derived from, it becomes hard to know which MP3 files to update when we learn more about that CD. MP3s of collected works (i.e., compilations) typically have very poor embedded metadata. Artist names often appear inside the track name, for example. This makes for difficulties in finding information: If I want to generate a browsable listing organized alphabetically by artist, I don't want half the songs filed away under "Various Artists," nor do I want to find dozens of artist names in the "By Track Title" listings. Embedding unique identifiers in MP3s would allow this mess to be fixed at a later date.
Another example can be found in the practice of sharing playlists: Given some convention for identifying songs and tracks, we can describe virtual, personalized compilation albums that another listener can re-create on his personal system by asking a peer-to-peer network for files representing those tracks. Unique identification strategies would provide the architectural glue that would allow us to reconnect fragmented information resources. Were someone to put a unique identification service in place, we could soon expect all kinds of new applications built on top:
- Collaborative filtering ("Who likes songs that I like?")
- E-commerce ("Where can I can I buy this T-shirt, CD, or book?" or "Is there a compilation album containing these tracks?")
- Discovery ("What are the words to this song?" or "Where can I find other offerings by this artist?")
The lesson for peer-to-peer metadata architecture is simple. Unique identifiers create markets. If you want to build interesting peer-to-peer applications that hook into a wide range of additional services, adopt the same strategy for uniquely identifying things that others are using.
Metadata applied at a fundamental level, early in the game, will provide rich semantics upon which innovators can build peer-to-peer applications that will amaze us with their flexibility. While the symmetry of peer-to-peer brings about a host of new and interesting ways of interacting, there's no substitute for taking the opportunity to rethink our assumptions and learned from the mistakes made on the Web. Let's not continue the screen-scraping modus operandi; rather, let's replace extrapolation with forethought and rich assertions.
To summarize with a call to action for peer-to-peer architects, project leaders, developers, and end users:
- Use a single, coherent metadata framework such as that provided by RDF. When it comes to metadata, the network becomes a poorer information resource whenever we create artificial boundaries between metadata applications.
- Work on the commonalities between seemingly disparate data sources and formats. Work in your community to agree on some sort of common descriptive concepts. If such concepts already exist, borrow them.
- Describe your resources well, in a standard way, getting involved in this standardization process itself where necessary. Be sure to make as much of this description as possible available to peer applications and end users through clear semantics and simple APIs.
- Design ways of searching for (and finding) resources on the Net that take full advantage of any exposed metadata.
Rael Dornfest is Founder and CEO of Portland, Oregon-based Values of n. Rael leads the Values of n charge with passion, unearthly creativity, and a repertoire of puns and jokes some of which are actually good. Prior to founding Values of n, he was O'Reilly's Chief Technical Officer, program chair for the O'Reilly Emerging Technology Conference (which he continues to chair), series editor of the bestselling Hacks book series, and instigator of O'Reilly's Rough Cuts early access program. He built Meerkat, the first web-based feed aggregator, was champion and co-author of the RSS 1.0 specification, and has written and contributed to six O'Reilly books. Rael's programmatic pride and joy is the nimble, open source blogging application Blosxom, the principles of which you'll find in the Values of n philosophy and embodied in Stikkit: Little yellow notes that think.
Discuss this article in the O'Reilly Network General Forum.
Return to the P2P DevCenter.