oreilly.comSafari Books Online.Conferences.
Articles Radar Books  

Gene Kan & Mike Clary on Sun's Infrasearch Buy

by Richard Koman

Sun Microsystems announced March 6 that it would acquire Infrasearch, a real-time P2P searching service invented by Gnutella lead developer Gene Kan. With big-time investors such as Marc Andreessen and former Netscape exec Mike Homer, Infrasearch might have the next big crazed Internet IPO. But times have changed, and the prospect of making gazillions in the stock market seems remote at best right now. So being part of Sun and Bill Joy's Project Juxtapose no doubt looks good to Kan. To get a closer look at the acquisition, OpenP2P.com interviewed Kan and Sun VP Mike Clary, who heads Project Juxtapose, or JXTA.

To start with, Gene, can you describe the InfraSearch technology.

Gene Kan: OK, Infrasearch is a distributed-search engine that kind of leverages a lot of the assets behind peer-to-peer computing style. In my mind, in a peer-to-peer environment there are three critical resources: network capacity, processing power and hard disk space ? essentially storage capacity. Infrasearch leverages two of those axes: processing power and storage capacity.

The basic idea is that Infrasearch is able to effectively turn all of the computers on a network into a collective brain, if you will, in disseminating the information that is available on each of those computers. And that's something that's really unique when compared to the World Wide Web. On the Web, the hosts of information are in fact treated as second-class citizens when it comes to answering requests based upon the information that is located on each Web host. And by that I mean that the information that is residing on each host must first be interpreted by a crawler and so on before any kind of questions can be answered about that information.

That doesn't work in a peer-to-peer world, for at least two reasons. The first is that peer networks are extremely transient and the information available on those networks is constantly changing ? not only because the computers are appearing and disappearing all of the time, but because the information itself is changing at a much more rapid rate. And the second thing is that on a peer network it's important to treat every host as a first-class information provider, because the key idea behind peer computing is that each node in the network has the possibility to make a very important contribution to the network as a whole.

Comment on this articleGene Kan says the Web treats information hosts as second-class citizens. Do you take issue with that concept? Are you comfortable with the role Sun is taking in P2P?
Discuss this interview with your peers.

What's the impact on distributed searching when every node is a first-class citizen?

Kan: The key advantage is the ability for much finer-grain information to be found. When you ask a question, it's answered by a very localized database.

In Infrasearch's real-time searching, you can search peers not only for static data, but for a machine to calculate an equation, for example, or process some set of data. Is that right?

Kan: Right. The query is live, in a sense. It's not simply compared against a list of words. Your query is actually distributed in real time, which means that each provider of information has the capability to interpret that query and act upon it. And apart from the qualitative issues of peer searching, there's really a key quantitative issue, which is in a peer-to-peer computing world, where the footprint of the host can be extremely small, it's critical to have a scalable solution ? not just for the architecture of the peer network itself but for all of the services that come along with that network, and one of those services, of course, is search.

So at the O'Reilly Conference, Bill [Joy] talked a lot about a billion-device future as did just about everyone else at the conference, and really that's something that we need to be looking at because current Web search engines have a problem keeping up with a billion pages, much less a billion hosts that are constantly popping in and out of the network. So clearly this is a problem that needs to be addressed, and we think that Infrasearch takes several key steps toward answering that problem.

It isn't really a question of scale, right? It's a question of whether all of your technologies scale up with the number of participants in your network, and for a peer network to be successful that has to be answered positively. On the World Wide Web, so far we've been able to get away, in many instances, without answering that positively.

Is Infrasearch based on your earlier work with Gnutella?

Kan: The prototype was, yes. And we based the prototype on Gnutella because we really wanted to demonstrate unequivocally that Gnutella had a much broader appeal than just file sharing, which it was kind of consistently associated with. And we thought that a clear way to show that Gnutella is a kind of peer information interchange protocol, or technology, was by demonstrating a peer information discovery type tool using nothing but Gnutella. Since then the technology has become a proprietary thing. Rather than using Gnutella we're using network protocols and a communications architecture that is more uniquely suited to the problem of searching.

So it's a proprietary framework right now?

Kan: That's correct.

And do you intend to release parts of it through an open-source license?

Mike Clary: We had our ideas of where we would open some aspects of the search and several of those ideas have been discussed with the JXTA team. As we move forward, we'll clarify and kind of blend Infrasearch into the JXTA effort, so I think that those kind of questions will be answered as we move toward a tighter integration.

Mike, can you talk a little about your conversations with Gene and why Infrasearch seemed like such a natural fit with the JXTA project.

Clary: I think if you take a look at what we're doing with JXTA and its research set of lenses, you'll find that what we're talking about is distributed computing and making all these nodes that Gene's talking about part of a large collective, or a large collective computer, if you will. And I think the primitives that we're going to try and get established or adopted in JXTA ? you know, the notion of distribution and pipes and grouping and monitoring and everything else that we're trying to do ? that's really about distributing all the capability, all the functionality, or processing power or storage, as Gene was talking about, across a lot of different nodes.

And so one of the first things that we concluded after we started thinking about the primitives was search; how do you search in the context of a distributed space that is largely transitory, where things are there some days and some days not? We knew it was different than the conventional Web crawler approach, where you run a spider across a bunch of static data, you roll it into a big index, and then you pose a query against that index and you just walk across the directory.

So Gene's technology gave us the ability to say two things: One is how do we satisfy searching in this very large distributed space that has ad hoc or transitory characteristics? That was issue No. 1, so InfraSearch looked very attractive from that standpoint.

And the second thing is I think what Gene's technology does: It exposes what some people refer to as the deep Web, the stuff that's behind the interfaces on all of those nodes or those Web sites or those computers on the network. So how do you go out there and tickle those interfaces and find real-time or close-to-real-time data from those nodes that come and go. So I think it's a combination of those two things, and we were looking at this infrastructure layer, thinking about some of the first services that were going to be required in order to build interesting applications, and search definitely fell into that category, and so Gene's technology popped up and it looked good from our perspective.

Will InfraSearch be migrated into the infrastructure layer, or will it remain an independent service?

Clary: I think we'd like to maintain, if you will, a pretty bright line between what infrastructure is ? the JXTA stuff ? and what are services on top of it. And so we always think of it as a three-layer cake: the infrastructure, peer-to-peer or network services, and then interesting applications that use those services.

I think we're going to maintain a strong demarcation between those things that are infrastructure ? fundamental, if you will, plumbing ? versus those things that may reach users from a service level or an application level. So we're going to try to keep those things separate. It's not a case where we're going to say that Gene's technology is the only searching mechanism that will ever be out there in existence. We think it's going to be a very popular and compelling, powerful searching capability, but I'm sure there'll be other technologies that we're not going to prohibit from sitting on top of the infrastructure. We just think it's a good first foray, if you will, into "How do we actually search in this distributed space?" And we're going to continue to invest in it and advance the technology as time goes by.

Pages: 1, 2

Next Pagearrow

P2P Weblogs

Richard Koman Richard Koman's Weblog
Supreme Court Decides Unanimously Against Grokster
Updating as we go. Supremes have ruled 9-0 in favor of the studios in MGM v Grokster. But does the decision have wider import? Is it a death knell for tech? It's starting to look like the answer is no. (Jun 27, 2005)

> More from O'Reilly Developer Weblogs

More Weblogs
FolderShare remote computer search: better privacy than Google Desktop? [Sid Steward]

Data Condoms: Solutions for Private, Remote Search Indexes [Sid Steward]

Behold! Google the darknet/p2p search engine! [Sid Steward]

Open Source & The Fallacy Of Composition [Spencer Critchley]