JXTA Search: A look at the future of searchingby Rael Dornfest, Kelly Truelove
Searching on the Web has long been caught in a web of spiders and crawlers. From early crawlers like Lycos to the state of the art Google, Web search engines have always suffered from a pretty severe lag time. Between the time a new document is posted on the Web to the time you can find it in Google is often a period of weeks.
As Gene Kan discovered, though, peer to peer networks like Gnutella -- in which nodes on the network receive live queries -- are ready-made for something better than the current state of Web searching. Rather than searching an index of what was there two weeks ago, Kan's InfraSearch technology allowed for searching what is there right now.
In May, InfraSearch was sold to Sun Microsystems, and the InfraSearch team joined Sun's Project JXTA. On Monday, JXTA Search, the result of InfraSearch's work with JXTA was released at JavaOne. OpenP2P.com contributors Rael Dornfest and Kelly Truelove talked with Kan and Steve Waterhouse, who joined Project JXTA as director of engineering when InfraSearch was acquired.
Rael Dornfest: I thought maybe we might start with a little bit of history. I know that you guys are probably sick of the history, but at the same time most people don't know where InfraSearch came from, its relationship to Gnutella, where it might have diverged, where it fits into Sun's JXTA framework, and so on. Perhaps we could start there.
Gene Kan: Well, Gnutella started from a bunch of my friends and me sitting around and thinking of how we might prove that Gnutella was interesting for more than music sharing and the like. We saw that at bottom what Gnutella really was a distributed searching network, and clearly the way to demonstrate this was to build a search engine using Gnutella, and you know that would be the pure expression, right?
So we wanted to demonstrate that Gnutella doesn't care about the data that's trafficked over the network, and that the software and the protocol are completely agnostic to what information is carried. So we thought, well wouldn't it be awesome if we could just make every node on the network participate as not the file-sharing client but rather as a distributed search client, so that we could tie together boxes from effectively hundreds of different information providers.
And I was really excited about this because the company that I was working for at the time had this problem where all of their data was stored in an application database. None of that was exposable to a traditional search engine or a traditional crawler-based search engine, because the crawlers would get scared when they see the question mark in the URL. So there was this huge mine of data out there that wasn't being accessed, and, you know, pretty much if you look in the URL field of your browser, you can tell that this is a huge problem where the question mark in the URL is preventing crawlers from accessing a large portion of the Web's data.
|Questions for Gene or Steve? Post them here and we'll try to get them answered.|
And so we threw up a prototype that was based entirely on the Gnutella protocol, and where you could type in algebraic expressions or simple arithmetic expressions and have PC calculate the results for you, and a few others. So, you know, we thought this was real all pretty cool, and we had a few Valley types that were interested in the technology, and we decided that this would be a pretty cool thing that we should develop further. And that's why we started InfraSearch, the company. Since our inception as a company, we decided that we proved our point with Gnutella and we moved to a proprietary protocol that was specifically tailored to the task of generalized searching, and it's of course XML-based, so that we are more compatible with what's out there on the Web.
Kelly Truelove: So, Gene, there were a couple of aspects of Gnutella. One was that the queries were passed from peer to peer, and the other is that each peer evaluated the query as it saw fit.
Kelly: The InfraSearch prototype -- it's clear from the way it works that, you know, each peer evaluated the query by running it against some site search or something like that, but how was the distribution of queries handled?
Gene: In the prototype, the distribution of queries was Gnutella. You know, it was entirely a Gnutella backend, so you would go to infrasearch.com and a custom Web server would throw up this little search box. When you entered your query, the front end of that Web server would accept your query and the back end would shuttle it out of Gnutella. So the front end of it was HTTP and the back end of it was Gnutella protocol.
Rael: The public demo was rather reminiscent of the early days of Web and WAIS integration. I don't know if WAIS goes back too far for you.
Gene: I never played with WAIS.
Rael: WAIS was basically a search protocol where you could define how your node understood and responded to queries -- sound familiar? While it wasn't a distributed search like Gnutella (it was point to point), your demo was, for me, a nice return to that concept.
Gene: That's cool, didn't know that.
Rael: So, continuing on, how did you get hooked up with Sun and JXTA?
Gene: Well, we were building our technology and in the meantime the JXTA team started to scour the peer-to-peer world, talking with everybody and figuring out what they should do. And they started to talk to us and, basically everything that Ingrid and Mike and Li said, we thought, "Well, yeah, that's pretty much what we want to do, too." And it just evolved, and eventually that turned into an acquisition.