|Tech Jobs | Forum | Articles|
How the Wayback Machine Worksby Richard Koman
The Internet Archive made headlines back in November with the release of the Wayback Machine, a Web interface to the Archive's five-year, 100-terabyte collection of Web pages. The archive is the result of the efforts of its director, Brewster Kahle, to capture the ephemeral pages of the Web and store them in a publicly accessible library. In addition to the other millions of web pages you can find in the Wayback Machine, it has direct pointers to some of the pioneer sites from the early days of the Web, including the NCSA What's New page, The Trojan Room Coffee Pot, and Feed magazine.
How big is 100 terabytes? Kahle, who serves as archive director and president of Alexa Internet, a wholly-owned subsidiary of Amazon.com, says it's about five times as large as the Library of Congress, with its 20 million books.
"What we have on the Web is phenomenal," Kahle says. "There are more than 10 million people's voices evidenced on the Web. It's the people's medium, the opportunity for people to publish about anything -- the great, the noble, the absolute picayune, and the profane."
The existence of such an archive suggests all kinds of possibilities for research and scholarship, but in Kahle's vision, all of the streams of research commingle into a single purpose: "The idea is to build a library of everything, and the opportunity is to build a great library that offers universal access to all of human knowledge. That may sound laughable, but I'd suggest that the Internet is going exactly in that direction, so if we shoot directly for it, we should be able to get to universal access to human knowledge."
If the goal sounds lofty, the Wayback Machine itself may be the crudest imaginable tool for data-mining a 100-terabyte database. At the Archive's Web site, simply enter a URL and the Wayback Machine gives you a list of dates for which the site is available.
Clicking on an old site is like time travel. I visited a December 1996 issue of Web Review (webreview.com) and found a cover story on "Christmas Cookies" an article dismissing privacy concerns about the new-fangled Web technology. A report from Internet World featured the hottest and most promising technology of the day: Push.
But that report, and the other articles I looked at in the Wayback Machine, were truncated; links to subsequent pages and many graphics were missing. Kahle concedes the Web interface does not show the full glory of the archive, but he says it wasn't meant to. "This is a browsing interface, a wow-isn't-this-cool interface ... It's a first step, but it's technically rather interesting because it's such a huge collection."
While the Wayback Machine has received plenty of press, we were interested in going deeper into the technical workings of this audacious project. We sat down with Kahle (who previously worked at the late supercomputer maker Thinking Machines and founded WAIS, Inc.) at the Archive's offices in San Francisco's Presidio.
Consider the hardware: a computer system with close to 400 parallel processors, 100 terabytes of disk space, hundreds of gigs of RAM, all for under a half-million dollars. As you'll read in this in interview, the folks at the Archive have turned clusters of PCs into a single parallel computer running the biggest database in existence -- and wrote their own operating system, P2, which allows programmers with no expertise in parallel systems to program the system.
Richard Koman: So how much stuff do you have here?
Brewster Kahle: In the Wayback Machine, currently there are 10 billion Web pages, collected over five years. That amounts to 100 terabytes, which is 100 million megabytes. So if a book is a megabyte, which is about what it is, and the Library of Congress has 20 million books, that's 20 terabytes. This is 100 terabytes. At that size, this is the largest database ever built. It's larger than Walmart's, American Express', the IRS. It's the largest database ever built. And it's receiving queries -- because every page request when people are surfing around is a query to this database -- at the rate of 200 queries per second. It's a fairly fast database engine. And it's built on commodity PCs, so we can do this cost-effectively. It's just using clusters of Linux machines and FreeBSD machines.
Koman: How many machines?
Kahle: Three hundred, we may be up to 400 machines now. When we first came out, we didn't architect it for the load we wound up with, so we had to throw another 20 to 30 machines at serving the index.
Koman: You just throw more PCs at the problem?
Kahle: You can build amazing systems out of these bricks that cost only a couple hundred dollars each, and you just throw more bricks at the problem to give it more computer power, more RAM, more disk, more network bandwidth, whatever it is you need. So we build massive database systems by striping the index over tens of machines. And its a very cost-effective system.
Koman: What kind of performance do you get?
Kahle: We're getting exceptional performance. Basically to build a 10oTB database costs -- in hardware costs -- less than $400,000, including all the network equipment, all the redundancy, all the backup systems. We've had to do it based on necessity, because there's not a lot of money in the library trade. Where the Library of Congress has a budget of $450 million a year, you can be sure we don't.
Koman: How does it work technically?
Kahle: How the archive works is just with stacks and stacks of computers runnning Solaris on x86, FreeBSD, and Linux, all of which have serious flaws, so we need to use different operating systems for different functions. The crawling machines are running Solaris; there's a dozen or possibly more.
Koman: What are the crawlers written in?
Kahle: Combinations of C and Perl. Almost everything we can, we do in Perl -- for ease of portability, maintability, flexibility. Because there's so much horsepower we don't really require a tight system. The crawlers record pages into 100MB files in a standard archive file format, and then store it on one of the storage machines. Those are just normal PCs with four IDE hard drives, and its just writes along until it's filled up and then it goes to the next one. It goes through a couple of these machines a day: hundreds of gigabytes a day. The total gathering speed when everything is moving is about 10 terabytes a month, or half a Library of Congress a month.
Then they're indexed onto another set of machines -- there's a whole hierarchical indexing structure for the Wayback Machine, and that is kept up to date on an hourly basis. So when people come to the Wayback Machine, there's a load balancer that goes and distributes those queries to 12 or 20 machines that operate the front end, and those query another dozen or so machines that hold a striped version of the index, and that index allows the queries to answer what pages are available for any particular URL. So if you were to click on one of those pages, it goes back to that index machine, finds out where it is in all the hundreds of machines, retrieves that document, changing the links in it so that it points back to the path, and then hands it back to the user. And it does that at a couple hundred per second.
What's amazing to me is the fact that the hardware is free. For doing things even in the hundreds of terabytes, it costs in the hundreds of thousands of dollars. When you talk to most people in IT departments, they spend a couple hundred thousand dollars just on a CPU, much less a terabyte of disk storage. You buy from EMC a terabyte for maybe $300,000. That's just the storage for 1 TB. We can buy 100 TBs with 250 CPUs to work on it, all on a high-speed switch with redundancy built in. Something has changed by using these modern constructs that are heavily used at Google, Hotmail, here, Transmeta. There's a whole sector of companies that are more cost-constrained than say, banks, that just buy Oracle and Sun and EMC.
Copyright © 2000-2006 OReilly Media, Inc. All Rights Reserved.
All trademarks and registered trademarks appearing on the O'Reilly Network are the property of their respective owners.
For problems or assistance with this site, email