Brewster Kahle on the Internet Archive and People's Technologyby Lisa Rein
The IA started out as just that -- a non-profit organization dedicated to taking snap shots of the entire Web every six months, in order to create a searchable archive.
One of the main goals of the Internet Archive is to provide "Universal Access to All Human Knowledge." It sounds like a lofty task, but Brewster is firmly committed to it, and truly believes that it is achievable. Anyone in his presence for five minutes or more is likely to feel the same way, because his enthusiasm is quite contagious.
Brewster started the IA in 1996 with his own money, which he earned from the sale of two separate Internet search programs: WAIS, which was bought by AOL, and Alexa Internet, which was bought by Amazon. He has been spending his own money to keep the institution going for the last six years. Recently, in the summer of 2003, he was fortunate enough to receive some grants and corporate sponsorship.
Newer IA projects include creating an open source movie archive, creating a rooftop-based WiFi network across San Francisco, creating an archive of the 2004 presidential candidates (offering every candidate unlimited storage and bandwidth to serve up video), and creating a non-profit documentary archive.
Let's Start with the Internet Archive
Lisa Rein: What's the story behind the birth of the Internet Archive? How did it start?
Brewster Kahle: The Internet Archive started in 1996, when the Internet had reached critical mass. By 1996, there was enough material on the Internet to show that this thing was the cornerstone for how people are going to be publishing. It is the people's library. People were using the Internet in a major way towards making things available, as well as for finding answers to things. And, of course, the Internet is quite fleeting. The average life of a web page is about 100 days. So if you want to have culture you can count on, you need to be able to refer to things. And if things change out from underneath you all the time, then you're in trouble. So what traditionally has happened is that there are libraries, and libraries collect up out-of-print materials and try to preserve and make open access to materials that aren't necessarily commercially viable at the moment. The Internet Archive is just a library. It just happens to be a library that mostly is composed of bits.
LR: How did you get the funding for it?
BK: The funding for the Internet Archive came originally from the success of selling a couple of Internet companies on the path towards building a library. So the original funding was from me, based on selling one company, WAIS, Inc. which was the first Internet publishing system, to America Online. And then Alexa Internet, which was a company short for "the Library of Alexandria," to try to catalog the Web. So all of these were trying to build towards the library, and these companies were sold to successful companies and so that gave me enough money to kick start the Internet Archive. At this point, it's funded by private foundations, government grants, and in-kind donations from corporations.
LR: So AOL bought WAIS and who bought Alexa?
BK: Amazon bought Alexa.
LR: What are some of the grants? Didn't you get some good grants lately, during the past year?
BK: Oh yes, we've been very fortunate in this phase of the Internet Archive's life. The Sloan Foundation gave us a significant grant towards helping get the materials up and able to be used by researchers all over the world, and the Hewlett Foundation also gave us a sizable grant to bring more digital materials from a lot of non-profit institutions to give them permanent access.
For instance, a lot of organizations create documentaries that maybe are shown once or twice, but they're not permanently available. But their general approach was to have things to be available. So by having a library be able to digitize and host these materials, we hope to bring a lot of non-profit materials up and out onto the Internet so they can be leveraged and used by people all over the world.
Brewster Kahle speaking at the O'Reilly 2003 Emerging Technology Conference in Santa Clara, CA
LR: How many people work here at the Internet Archive right now?
BK: There are 12 people full-time here at the Internet Archive -- probably 20 if you count, all told. There are a lot of people that come through. We've got a programmer from Norway and a programmer from Iceland here now. We had a programmer from Japan that sort of came through and helped intern and shared the technology that they know and also what we know.
LR: What would you tell somebody that was interested in participating somehow? You're always looking for people to work on projects, right?
BK: We're always looking for help. People are helping in many, many different ways. By curating collections. By keeping good web sites. By making sure that web sites can be archived -- is how thousands of people are helping. But people are also helping curate some of the collections that are here. We have volunteers that are helping with, oh, things like SFLan and some of the technical work that we do. But also, we are growing slowly and we are hiring a few more people -- mostly very technical.
LR: Talk about SFLan a bit.
BK: SFLan is a wireless project that is based around San Francisco. The idea is to experiment using the wireless network to do a rooftop network, to use to use commodity wireless 802.11 WiFi stuff to hop from roof to roof to roof to provide an alternative to DSL and cable for the last mile.
If we can make that both be open and have distributed ownership, then people would own the roadways and they would basically control their network, which is what the Internet really is.
LR: What do you mean by "the last mile," exactly?
BK: Trying to get the last piece from getting from a central location where there might be a fiber that comes to a city, and try to get that distributed so that people in their homes can not only get materials at video speeds, 3-5 megabits per second -- DVD-like speeds -- but also act as servers to make things available to others over the Internet at high speeds.
These are some of the things that are very difficult to do, if not impossible, with the current commercial DSL and cable providers. And we're looking to see how we can not only establish that baseline of video-ready Internet and make it so people can serve video over the Internet, but then, every year, make it better by a factor of two. So the technology follows Moore's Law just like the computer guys do, as opposed to how the telecoms tend to work, which is "here's the same thing, and you'll buy the same thing, and maybe we'll raise the price slightly ..."
LR: And keep paying more for it.
LR: So you're looking for people with rooftops?
BK: We're looking for people with rooftops. And especially people that can buy a node. A node costs $1,000, and that's a little Linux box with a directional antenna.
LR: Is that a node right there?
BK: This is a node right here (gestures). So this is an SFLan box. This is a directional antenna that points upstream back to a node that's closer to the Net. This is an omni antenna. So anyone who can see this can be on the Internet for free.
And this is a Linux machine that's got a CompactFlash card as its hard drive, and two radios. And you get a wire that comes down into your house, which is the way that power is brought up to this machine. And also, you get bandwidth within your house or office.
There are about 23 of these around San Francisco on rooftops now, and we're actively deploying new software. Cliff Cox up in Oregon is doing a lot of the software development and also hardware development. He's actually the guy that sells these things for $1,000. So Internet Archive's participation is to help fund the project to get it kick-started, and to try to get some active roofs up and running.
LR: How does the Internet Archive decide about implementing new technologies? What's your philosophy about implementing new technologies?
BK: The Internet Archive is extremely pragmatic about new technologies. What we tend to do is look at the least costly, both in the short term and long term. So we are frugal to the core.
We run currently about 700 computers. They're all running Linux. We don't have any dedicated routers. We just use Linux machines. We use the same Linux machine over and over and over and over and over again. Jim Gray's model -- he calls it the "brick model." So we just use Linux machines stacked up, and even though they might be storage machines, or CPU machines, or running as a router, or running as a load balancer, or a database machine -- they're all just the same machine. What we've found is that it allows us to only have one or maybe just two systems administrators being able to scale to many hundreds and, we hope, a few thousand, machines, by having such a simple underlying hardware architecture.
Because we operate on these machines stacked up, we tend to do everything based on clusters. Because our amounts of data are fairly large. We have, oh, several hundred terabytes at this point -- three, four hundred terabytes of materials, and it's growing a lot. So it's difficult to process these if you have to go through just one machine, and a lot of proprietary software is licensed to just be on one machine, or it costs per each.
Open source has the ability that you can go and run it on as many machines as you want. Because we run things and we do data processing and conversions on ten machines or a hundred machines at once, we find that open source is often the most pragmatic, least costly way to roll. We also find that it's easiest for other people to copy our model if we use open source software, so we tend towards using open source software, because we'd like anything that we develop to be actively used by others readily and easily.
LR: How much do you test before going live with new services and things? Do you do a lot of testing?
BK: Do we do a lot of testing? I'd say we do a lot of progressive rollouts. We do testing in-house, but you can only go so far, and then you bring on some number of your users and bring things out. I'd say we're less testing-oriented. We're less service-quality oriented than a lot of places, because we're researching. We're trying to push the edge. So we try to make sure our data is safe, but if there happens to be a hiccup, we are very public about that, and we're looking for help from others to help us resolve these and find them. So I'd say we're not like a commercial company doing lots of in-house testing and rounds and rounds of beta testing, because we only have 12 people to run all of this.
LR: Can you remember a specific situation where the technology could have gone one way or the other, and you decided on a certain way over another way, and why? When there's a fork in the road, what process do you go through to decide which way to go?
BK: Boy, when there're different choices of which way to go, you find that one of the lead motivators in terms of how we decide which way to go is which way people believe it should go. People are always open to testing and pushing back and saying, "Why do you think that's true?" Especially if we've tried going down that road before.
Let's take RAID -- Redundant Arrays of Independent Disks. The idea is to run, say, four disks or eight disks as a cluster of disks so that if one fails, it has the information on the other ones, so that it doesn't fail, so you can replace the disk and be able to keep going. Every few years we think that this is the right thing to do, and every few years, we find, unfortunately, that it is the wrong thing to do.
But it doesn't seem to keep us from trying again. Every so often we think, "Okay, they must have fixed the bugs," and that the software must be more reliable, or the controllers must be more reliable, and we'll go and put some number of machines into this new structure and then watch them for six months to a year to sort of see, "Does it work better or worse than what we were using before?" With RAID, we've found with two major tests of RAID that it's been a loser.
LR: Why? What goes wrong?
BK: We're not exactly sure, but it looks like the RAID controllers are just not debugged very well. The software isn't debugged. The hardware isn't debugged. There are failure modes that fall outside of there. "Oh," (supposedly) "if one disk just goes completely corrupt, then you can replace it and everything's fine." Well, we've found out in the latest Linux release that if two disks just hiccup slightly, then it gives it up for lost and it says, "You lose all your data," and so we've had to spend months then going back and decrypting all of the Linux RAID controller file system to be able to recover all of the data that you can actually recover. So I think it's just bad implementations based on not being able to get the reliability up, based on not having enough test cases.
We go along with Hillis' Law. Danny Hillis was one of the great computer designers of all time, and his approach was to have large numbers of commodity components; that basically, price follows volume. So if things are made in more volume, the price is lower. You can say, "Duh. Obviously." But it's amazing that most people don't follow this. Particularly that the price goes down when there's more of it made. You want to use things that cost less, because you might get more gigabytes per hard drive if you're using commodity components, as opposed to specialty components.
But another corollary of this is that "reliability follows volume." That things that are made in large volume have to be more reliable, at least in the long haul, otherwise the company that's making them would go out of business because they'd have too many failures. Another way of saying that is that Toyotas are more reliable than Ferraris. Even though a Toyota might cost one-tenth as much as a Ferrari, they are probably on the road more often. The coupling of this is that if you want a reliable system, and you want one that doesn't cost that much, go for high volume, if you want it available, reliable, etc. And so we find that technologies that are commodity and made in high volumes work better.
Informal and wire.less.dk are working to promote the use of wireless technologies (mainly 802.11) in the developing world. We are planning a Wireless Roadshow to teach local technology NGOs how wireless technologies can be used to bring Internet and intranet connectivity to those parts of the world not included in the plans of the commercial telecommunications companies.
O'Reilly Emerging Technology Conference
LR: When you say "commodity," you mean "off the shelf," or COTS products, right?
LR: Let's talk a little bit about your philosophy now. Could you discuss what you mean when you talk about "Universal Access To All Human Knowledge?"
BK: "Universal Access To All Human Knowledge" is a motto of Raj Reddy from Carnegie Mellon. I found that if you really actually come to understand that statement, then that statement is possible; technologically possible to take, say, all published materials -- all books, music, video, software, web sites -- that it's actually possible to have universal access to all of that. Some for a fee, and some for free. I found that was a life-changing event for me. That is just an inspiring goal. It's the dream of the Greeks, which they embodied, with the Egyptians, in the Library of Alexandria. The idea of having all knowledge accessible.
But, of course, in the Library of Alexandria's case, you had to actually go to Alexandria. They didn't have the Internet. Well, fortunately, we not only have the storage technology to be able to store all of these materials cost-effectively, but we can make it universally available. So that's been just a fabulous goal that causes me to spring out of bed in the morning.
And it also -- when other people sort of catch on to this idea that we could actually do this -- that it helps straighten the path. You know, life, there're lots of paths that sort of wander around. But I find that having a goal that's that far out, but also doable, it helps me keep my direction, keep our organization's direction. And I'm finding that a lot of other people like that direction, as well.
LR: Do you have an overall philosophy about technology and the direction in which you'd like to see it go?
BK: I don't really have a philosophy about technology. I have a philosophy of what future I want to live in, which is probably more of a social and cultural issue than it really is a technological issue. And socially and culturally, what I want to grow up in -- and have my kids grow up in -- is a wonderful flowering of all sorts of really wild ideas coming from all sorts of people doing diverse and interesting things.
What I'd really like to see is a world where there's no limitations on getting your creative ideas out there. That people have a platform to find their natural audience. Whether their natural audience is one person, themselves, or a hundred people, or a thousand people. Try to make it so the technologies that we develop, and the institutions we develop, make it so that people have an opportunity to flower. To live a satisfying life by providing things to others that they appreciate.
And I think our technologies right now are well-suited to doing this in the information domain. In the information domain, we can go and offer people an ability to publish without the traditional restrictions that came before, and to help, with these search engine technologies, to help them find their natural audiences. And so people out there aren't surrounded by stuff they don't want. That they find that the music recordings they want and the video recordings they want, even though they're made a half a continent away, and there are only a hundred other people that also really like that genre.
LR: What kind of projects are you working on with the Library of Congress?
BK: We've been working with the Library of Congress over the last three or four years to help archive web sites. They've got a mission to record the cultural heritage of the United States -- actually also, Thomas Jefferson gave them, more broadly, "the world." And now that publishing is moving, or a large section of publishing, is moving on to the Internet, we've been working with them as a technology partner. They do the curation, and we do some special crawls.
Our first project with them was the election in the year 2000. The presidential election. And they selected a set of web sites, and we crawled them every day to try to get a historical record, and then the Internet Archive made them available to the world to see and use, to see if it was useful to people.
The Library of Congress is trying to move into the digital realm, and they just got a hundred million dollars from Congress to help do digital preservation, and we hope to be participants as that unfolds. We'll see. But the Library of Congress has got a lot of money -- a 450-to-500-million-dollars-a-year budget. We hope that a growing percentage of that goes towards digital materials, whether working with us or others, than currently, which is I think probably less than one percent.
LR: Earlier you said that one way that people could help was to make their web sites "more archivable," basically. What does that really mean? How would you make your web site easily archivable?
BK: Boy. By being straightforward. I think by keeping things fairly simple. If web sites have sort of straightforward links, then that makes things a lot easier.
LR: What do you mean "straightforward?"
Probably one way of finding out is going to archive.org and seeing, "Did we get it right?" the last time. We're continuously updating our tools and trying to make things better. But for instance, we've been having trouble with .swf files, Shockwave and Flash files, from Macromedia. If those files have links to other pages inside of them, we're just not able to find those links, so we can't follow them. We also have trouble rewriting those .swf files so that they point to the Archive's version of the links and not the live Web's. So we're having trouble with certain complicated web sites. What we'd like to see is more straightforward use of pointers, because the hyperlink is one of the great ideas of the Internet.
Return to OpenP2P.com.