Rael: So that was on the business side. On the technical side, what changes were made along the way? Did you diverge from Gnutella completely? What did you carry along with you?
Gene: Once we incorporated as a company, and we started to really think seriously about search, we stopped looking at the Gnutella protocol because we decided that we could develop something that was much more specifically tailored to the needs that we wanted to address, that was XML-based, and everything like that. Certainly, we've learned a lot with Gnutella, and all those lessons are reflected --
Kelly: So you derived from Gnutella the inspiration of, you know, let each peer process the queries as it sees fit?
Kelly: And the notion of queries being distributed out to peers, although not necessarily the same way that Gnutella distributes them?
Kelly: Not necessarily in the same Gnutella bucket brigade distribution?
Kelly: Why did you move away from Gnutella?
Gene: Probably one of the most compelling reasons was that we thought, in a commercial endeavor it wasn't a good approach to, you know, suppose you're Company A. Do you really want to carry competitor B's query traffic? Or his result traffic? Probably not, right?
Steve Waterhouse: When I joined the company we weren't using the Gnutella protocol at that point. Essentially, what we designed and what we've now completed while at Sun was a distributed search network or a framework for distributed search, and the first area that we targeted this at was the Web -- primarily, deep content contained inside application servers and databases connected to Web servers that is typically not accessed well by a standard crawler approach to searching. You know, all the big search engines essentially go out and crawl the pages, stick them in a big index, and then when you go to search, you're searching against an old piece of data, if you like.
So the alternative idea that the founders of InfraSearch had was to distribute this query out to the edges of the network and let the intelligence of the peer that it's being sent to process it in whatever format was appropriate for that query, and respond. And the way that we did this distribution was using this system within the network which we called "hubs." We have a network of hubs, and when you post a query into the hub, or into the network, it gets picked up by one of the hubs and the hub says to itself, "Is there any provider that I know about that can handle this query?" And if it doesn't know about that, and you can have various different rules for how it should process that, it also shuttles it off to the other hubs and they continue to process it until hopefully it gets answered by one of the different providers. The provider then, in the case of search, runs a query against its database, or index sever or whatever it's running, and then returns the result back to the hub and then ultimately back to the requester.
At the stage when we were in discussions with Sun, we had written a purely Web-focused engine -- in other words, HTTP, we were using Java servlets -- but at the same time it was very distributed in the sense that all the different components could be placed around the networks in different ways. And we also designed this to be relatively network agnostic, in terms of the messaging used, and also agnostic in the type of transport, and so this actually fit in really neatly to what we then started working on at Sun, which is part of the JXTA project.
What the leaders of the JXTA group asked us to do was, make this work, not only on the Web but also to make search within a distributed network very efficient. And so those are the two sides of JXTA Search: both deep and wide. Deep in the sense of finding content at the edges, deep into the database on the edge of the Web network; and wide in the sense of helping peers shuttle queries around more efficiently because of a varied distributed network.
Kelly: Let me follow up with a little technical point. The query goes through the hub, the hub sends it out to the appropriate information providers, and the provider can respond directly back to the node issuing the query?
Steve: One of the things that we thought about in terms of providers being able to respond directly back to the client is that the client can just send back the end point identifier for itself when it requests the query, and then the provider can respond directly to it, but having said that, we see some advantages in the hubs.
Rael: Could you have one be the default, subject to overriding?
Steve: Right now the default is that it responds back to the person that asked it or responds back to the hub, but the protocol's actually, we call it "symmetric," in the sense that it doesn't matter whether a individual client sends a request to a provider or whether the client sends a request to a hub. They all look the same to the hub, and the hub looks the same to the client and provider and so on. The hub, instead of thinking of it as a server, you can think of it like an underground peer that has the job of, you know, working out where is the best place to send these queries. If the client knows where to send the query, then it can be as efficient to send it directly to the provider. But we're targeting the case where people -- although clients don't necessarily know what the best place to send the query to is, and then the hub does a good job. Does that make sense?
Kelly: Right, so one way of looking at it is as a variation on the super-peer concept?
Steve: Yeah, I guess so.
Kelly: It's metadata based routing.
Kelly: Could you talk a little more about hub-to-hub communication. If I have a query and the hub doesn't recognize it --
Steve: I think one of the challenges, Kelly, in trying to design an absolutely neat framework for something like searching -- it's essentially an ambitious thing to do -- but you're constantly fighting the battle with yourselves of how much to specify and how much to leave unspecified, and we thought a lot about how to do hub-to-hub communication. Some of the issues are things like how does a hub register with another hub? Does it take some average of all the different provider data and aggregate them together and then publish that to the other hubs? Or does it take all the things that knows about all the providers?
And so we've got a number of different strategies -- not only how to describe hubs to other hubs, but also how to best send queries from one hub to the other. We definitely got some of the concepts from Gnutella -- things like time-to-live for queries and fan-out so you can specify for how many providers, which includes hubs, to send this thing to. But we've essentially left it up to the implementation, so whoever wants to be running a hub out there -- and of course the source code's available -- so people can run hubs.
Whoever's running them can do a couple different things. You could, for example, decide to send all queries to all the hubs that you know about, or you could decide if they can satisfy the query, they're not going to send it to somebody else ... So the short answer is we have the framework and the protocol for the bad things to not happen -- some of the different routing problems Gnutella had in the early days -- but we left the rest of it to the user, or to the runner of the hub.