The Hashbelt Data Structureby William Grosso, author of Java RMI
In the first article in this series, I explained why it is sometimes necessary to expire data, and discussed the standard approaches for doing so. This article focuses on a particular algorithm for data expiration that I have taken to calling a "hashbelt." (The name is a combination of "hashmap" and "conveyor belt.") Hashbelts are a generic data structure that can be easily adapted to a wide variety of problems involving time-sensitive data. They are easy to use, simple to extend, and quite efficient.
This article has two major sections. The first, "Deriving the Hashbelt Algorithm," gives an overview of the hashbelt algorithm and explores its relationship to the expiration algorithms from the first article, without paying much attention to code-level details. The second section, "The Hashbelt Framework," is a detailed discussion of one particular implementation of hashbelts (included in the source code for this article). That is, the first section explains the algorithm and the second attempts to explain my implementation.
By the end of this article, you should understand the hashbelt algorithm and have a reasonable understanding of when it is appropriate to use it.
Things You Should Know Before Reading This Article
Obviously, I am expecting that you have read the first article in this series. In addition, this article is not for people who are just starting to program, or just starting to program in Java. If you don't know how to use the collection libraries, haven't written a distributed program before, or don't have a good grasp of what the
synchronized keyword does, you'll probably have trouble reading this article. To learn more about the basics of distributed programming and writing multi-threaded code, I recommend my book on RMI. Besides covering RMI, it covers many of the basic design and coding decisions involved in building distributed applications.
In addition, the code examples for this article use generics quite heavily. Generics are a new feature in Java -- they are not in JDK 1.4, but are available from Sun on an experimental basis. The most complete source of information about generics is the JSR-014 specification. A shorter and friendlier introduction to generics can be found in the third article in my series on command objects.
About the Source Code
The source code for both the hashbelt framework and the examples used in this article can be downloaded from here. Please feel free to use and extend the source code for this article. But -- please -- give credit where credit is due, and do not remove the licensing information and my name from the source code. Moreover, please send me e-mail if you come up with any interesting ideas for using or extending the framework; I'm planning on maintaining this code and keeping it up to date for the foreseeable future.
Deriving the Hashbelt Algorithm
In the first article in this series, I discussed a wide variety of expiration mechanisms and then concluded with a list of requirements for a time-based expiration algorithm. None of the requirements are particularly elaborate or subtle, but taken as a whole, they do rule out many different potential mechanisms.
Here's a slightly more elaborate version of those requirements:
The expiration mechanism should use a small, and bounded, number of background threads. To be honest, this requirement is really "use a single background thread." But two or three threads would be an acceptable level of overhead as well, if the threads could be justified.
The expiration mechanism must be a global cache. Time-sensitive data objects frequently don't have an owner. Instead, they're owned by some sort of global cache that is also responsible for expiring them. Code that needs to use a particular object accesses it via the global cache.
The expiration mechanism cannot make other code harder to write. The hard part of expiration should be in the framework. Using the framework to expire objects should be as easy as possible. The strongest possible version of this requirement is: a programmer should to be able to expire any instance of any class, without needing to adapt the class.
The caching should be consistent. The global cache should be consistent -- objects that have been expired shouldn't be accessible from the cache.
Everything must be fast. The simplest form of a cache is simply an instance of
Hashtable. An acceptable data expiration algorithm can't be much slower than that. A constant time increase in cost is acceptable, but the expiration mechanism shouldn't get slower as the cache size increases.
The indexing scheme should make sense. If there isn't an obvious owner for an object, then clients will need to obtain references using some sort of key or logical index. The key used has to be programmatically reasonable (e.g., supported by the problem domain; session objects should be indexed by session keys and so on) and that means we have to support logical indexing (not just time-based indexing).
The expiration mechanism should be general. Whatever algorithm we come up with should be useful in all of our example cases (not just for session keys and not just for short-lived information).
The best solution we examined in the first article was the one adopted by Tomcat. Tomcat uses a single instance of
Hashtable to store session objects and expires stale session objects using a background thread. Tomcat's background thread executes an infinite loop around the following five steps:
- The thread wakes up.
- The thread locks the cache for a brief period of time, during which it builds an instance of
Vectorcontaining the session objects (it does not copy or clone the session objects; it merely inserts them all into a new data structure).
- The thread unlocks the cache, thereby allowing the main code (the servlet engine) to access the cache once more.
- The thread iterates over the vector it built, asking each session object whether it is stale. Every stale element in the copy is removed from the main data structure.
- The thread goes to sleep for a brief period of time.
In this section, we're going to explain the hashbelt algorithm by describing how hashbelt differs from what Tomcat does. By the end of this section, I'll have completely described the hashbelt approach to data expiration and explained why each change is necessary or useful.