|Tech Jobs | Forum | Articles|
How the Wayback Machine Works
Koman: You mentioned Perl, Linux, and FreeBSD. Do you use exclusively open source software?
Kahle: We use as much open source software as we can; we make as much of our software as we can open because we're a library. The idea is to help people make sense of the Net and we leverage all the open tools. Alexa put up a television archive called tvarchive.org, which is televison news from around the world from Sept. 11 to Sept. 18. Twenty channels in Chinese, Russian, Japanese, Iraqi. Iraqi television is really interesting. So in three weeks, Alexa took all these recordings from tape, massaged them, put them online, and converted them into several different formats. The only way to do this is to cross-cluster hundreds of commodity Linux boxes and use freeware tools, all of which barely work.
Koman: This all takes a lot of brain cells; you have to have some smart people working on this.
Kahle: Yes, this is not for the light of heart. If you're going to run 100TB databases and support hundreds of queries per second, it's going to take good folks. But on the other hand, there are good folk doing a whole lot less than that. The archive is a real vindication that you can do new and different things with these open tools. Because these open tools are available to use in ways different from those for which they were originally designed, it makes striving for the biggest collection of information ever possible.
Koman: Does the fact you can do this at this scale suggest new possibilities for the private sector, that businesses can operate on a scale not previously imagined?
Kahle: Having the capital cost of equipment drop to effectively zero allows you to think bigger. You start thinking about the whole thing. For instance, the gutsy maneuver of saying "let's index it all," which was the breakthrough of Altavista. Altavista in 1995 was an astonishing achievement, not because of the hardware -- yes, that was interesting and important from a technical perspective -- but because of the mindset. "Let's go index every document in the world." And once you have that sort of mindset, you can get really far.
So if all books are 20 TBs, and 20 TBs are $80,000, that's the Library of Congress. Then something big has changed. All music? It's tiny. It looks like there're only one million records that have been produced over the last century. That's tiny. All movies? All theatrical releases have been estimated at 100,000, and most of those from India. If you take all the rest of ephemeral films, that's on the order of a couple hundred thousand. It's just not that big. It allows you to start thinking about the whole thing.
It will change also the relationships of corporations to their IT departments. IT spends a lot of money on this stuff; they spend millions. And if they really understood that it doesn't have to cost millions, it could cost hundreds of thousands of dollars, and they could hire a few smart people rather than large numbers of people to maintain all this equipment, we might be able to make some big steps forward. It would open it up to smaller companies to do bigger things. Where people used to think that warehouses full of mainframes was an asset, that may not be the case.
Koman: How do you mine all this stuff?
Kahle: That's where the fun begins. Datamining these materials is great fun. What Alexa does in its free toolbar is create a related-links service, and it does it based on the collaborative filtering of "other people who went to this page went to these other pages." We use the link structure of the Net and the usage trails from the Alexa users to be able to compute this. And all of these techniques require tens if not hundreds of machines to be able to data process.
Because there are only a couple hundred gigabytes for every processor and the processor and RAM are very closely tied to the disks, you can operate this cluster as a large parallel computer. It's very inexpensive to do. We program the computer using a technology called P2, which we'll be putting out as open source for other people to able to operate parallel clusters of Linux or FreeBSD or Solaris boxes.
Koman: What is P2?
Kahle: P2 is a Perl script that takes commands and runs them on remote boxes, splits up data to be able to run on them, and then brings back and correlates the data.
Koman: It's an operating system for a parallel cluster?
Kahle: But it sits on top. You can take people who know how to do shell scripts or Perl scripts on normal Unix boxes and within two weeks, they can be world-class parallel data miners. That's a huge step past the problems we've had with parallel computing, where you had to learn a whole new methodology. This is: no new methodology, no rocket science, no magic. And it's only because it's straightforward that we've been able to leverage normal programmers' expertise to be able to run programs on hundreds of machines.
Koman: It sounds quite simple.
Kahle: We've been at it for years. The first company I worked in was Thinking Machines. And we blew it. We built the fastest computer in the world that very few people could program. It required people to think in a new way. What a horrible thing to have to do to be able to attract customers. The idea is to be able to think the same and be able to do more. I think we've cracked the parallel computer problem for a very large set of problems, which is fundamentally data-mining and database-type operations.
Koman: So will people looking for more than the Wayback Machine be able to mine the Archive?
Kahle: The idea is to try to allow people to use a Web interface -- clunky, but you can step through it -- but then it would show you the command that's going to be run across the cluster. But if you say, "Yeah, that's kind of what I want, but instead of this I want to be able to go in and put in my own Perl script," then we'll allow people to do it.
We're going to try to expose what we do internally, but first put an easy interface to at least get something done, and then an easy path from novice to expert. But you'll need to know things like Perl. And then our challenge will be how to manage, say, 10 to 20 programs running at the same time over the data sets and not have people clobber each other. Kind of timesharing, but at the hundreds-of-computers level.
Koman: You have several other collections besides the Web. The ephemeral films and the television archives are not content from the Web, but content you're putting on the Web.
Kahle: We've put 1,000 films up online for people to download and use in any way that they want. What we really want is for people to make their own movies. But these, they're pretty wild films; education films, government films, propoganda films, industrial films. They're all available for download in MPEG2, which is DVD-quality, for people to do anything they want. People have made some really terrific films, and some of them are on the site as well. I really recommend "The ABCs of Happiness" and "The Consequence of War." Awesome films.
Koman: You wouldn't think with 100 terabytes of stuff already that you would need to encourage the creation of more content.
Kahle: We're trying to show how people can do it themselves. We're trying to encourage everyone to take their old content that's not online and put it online. A professor at UC Berkeley said that students use the Web as the resource of first resort, which is a huge change. But that's a little dangerous if the Web doesn't have the good stuff on it, and many people complain it doesn't. Instead of trying to whip students to go back to the physical library, let's put the good stuff on the Net. Otherwise, we could have a whole generation learning from ephemeral content collections, as opposing to learning from the books of the ancients. And a lot of materials are not there yet.
Koman: are you working with the great libraries on digitization?
Kahle: Yes, we're working with the Library of Congress on some of these Web collections and starting to work with them on digitizing different parts of their print collections. The Prelinger Archives is digitizing films. We're working with different researchers on automatic transcription of the television materials, so we can get that to be a referenceable resource. These are the sort of things we have to get to, and get to very soon. Every year that passes, we have more and more students using not the best we have to offer and that is a tragedy. We are the establishment. We should be making tools that allow children and students to have access to it all. And we're letting them down so far.
Koman: What about the question of rights? I just wrote about Lawrence Lessig's book on intellectual property. Surely the publishers and the television networks and the record companies aren't willing to let you keep a copy of all of their stuff?
Kahle: All we collect for the Web archive are sites that are publicly accessible for free, and if there's any indication from the site owner that they don't want it in the archive, we take it out. If there's a robot exclusion, it's removed from the Wayback Machine. Over the years, people would notice these things in their logs and would say, what are you doing? And we'd explain what we're doing -- building this archive and donating a copy to the Library of Congress, etc., etc., and 90% of the time they say, "Oh, that's cool, you're crazy, but go ahead." About 10% of the time, they'd say, "I don't want any part of it," and we instruct them on how to use a robot exclusion and they're taken out of history. That seems to work for everybody at this point. People are really excited about this future that we're building together.
Koman: The dot-bomb hasn't disillusioned you at all?
Kahle: I never predicited the capital market in the first place. I don't know where that came from, but wow, there was a lot of money there for awhile. But I love the era of dreams. I loved it when people were trying to make services whose only constraint was to be popular. They didn't have to make money, they just had to do something people liked. It was amazing the ideas… I'm glad they're captured in some way, because it's those dreams when the medium is new before you realize all it's faults and foibles, and the Internet is going to disappoint, it's going to be good at a few things and not good at everything else, but at least those dreams are something we should try to live up to the next time. As we refine technologies and come up with the next thing, let's see if we can live up to a few more of those dreams, not just the making a million dollars, but having the ability to get your words out, to reinvent government, whatever it is. If it doesn't happen this time, let's remember it, so the next time, let's give it another good shot.
Am I disillusioned? No. Is it depressing to see a lot of my friends out of work? Yes! But the goal of universal access to human knowledge is in many ways an original goal of the Net. It's a tremendous goal. It makes me want to jump out of bed in the morning and try to get this thing done. People working on digital divide issues want to join in, advocates for children's literacy programs want to join in. It's not about driving slick cars, it's about using this technology for the betterment of education and people. I'll take that any day over random stock option grants.
Copyright © 2000-2006 OReilly Media, Inc. All Rights Reserved.
All trademarks and registered trademarks appearing on the O'Reilly Network are the property of their respective owners.
For problems or assistance with this site, email