Tales of Optimization and Troubleshootingby Howard Feldman
Working in the field of bioinformatics has its share of excitement. Biology produces massive amounts of data to store, catalog, and analyze. This constantly pushes hardware to its limits, requiring cutting-edge hardware and software to keep up with the influx of data, which increases at a rate comparable to Moore's Law. Thus it is not surprising that we experience speed issues with some of our own software, as hardware changed and databases grew.
An Optimizable Scenario
We have a piece of software developed in-house that serves as a data warehouse of sorts, amalgamating many different biological databases from different, often disparate sources. We call this project SeqHound. It makes our lives easier by providing us with a common API through which we access the biological data we need in our applications. If we come across a new database on the Web that we would like to use, we will often incorporate that database into the system. SeqHound has grown quite large and complex over time, on the order of several hundred gigabytes of data in several dozen tables.
Because work on this project started several years ago, before the popularity of MySQL and other freely available DBMSs, we had Sequiter Software's CodeBase as our database manager. CodeBase can be built as a standalone C library that reads and writes in Microsoft FoxPro format. If you link this library to an application, the entire DBMS is incorporated right into the executable, requiring no database server. This also results in extremely fast and efficient database access, being so low level; there are no SQL, network, or other layers to slow it down like a standard database would have.
SeqHound can be accessed in one of two ways. The main API is in C, so you can link directly to the SeqHound local C library, in which case all database calls will take place on the machine where the executable runs. In other words, you can use the local API only if the databases are on the same machine as the program accessing them.
The remote API on the other hand, allows a user to make database queries from a different machine, by making HTTP calls to a CGI application (built with the SeqHound local library) on the database server. The remote API provides C, C++, Java, and Perl wrappers to the appropriate HTTP calls so that the code appears the same as if the local API were used.
|Linux/Unix System Administration Certification -- Would you like to polish your system administration skills online and receive credit from the University of Illinois? Learn how to administer Linux/Unix systems and gain real experience with a root access account. The four-course series covers the Unix file system, networking, Unix services, and scripting. It's all at the O'Reilly Learning Lab.|
Because of our choice of DBMS, we have the unique choice of whether to use a local or remote API for a given application. When an application must run on a machine that does not have room for a full copy of SeqHound or when the application is not written in C, we must use the remote API. Otherwise, we use the local one. Of course, if the backend were using DB2 or MySQL, we'd need to make a connection to the server, so we'd always make a remote connection even if the server were running on the same machine.
Now that our situation is clear, I can explain what had been bothering us for awhile: the local API was generally about 100 times faster than the remote API for the same task! Further, the remote API on our old 1.5 GHz Linux box outperformed our shiny new Sun V880 with RAID, especially on some of the parsers that created the database tables. Additionally, the Java remote API was painfully slow compared to the one in C. Not knowing what to do, some of us chalked it up to CodeBase, to Sun for making inferior hardware, or to Java for being slow. None of this made sense, though. Many people were happy using Sun hardware and Java, and commonsense said that the overhead for an HTTP call should be much less than we were observing.
Having to use SeqHound myself, I was frustrated by the speed issues. Whatever it was, I determined to find out what was going on. How hard could it be? What I found was not at all what I expected. The story was interesting enough that I thought it worth sharing. Perhaps you'll find ways to apply some of the principles employed to your own code optimization procedures.
Looking for Problems
I decided to focus on the local vs. remote discrepancy first. The first
thing to do when trying to speed up code is to turn to your trusty code
prof work well in most cases. These programs time your application while it is running to see how long it spends in each function call you make. I could run the same program with the remote and local APIs, on the same computer, and
compare their profiles. For the remote API I had to profile the actual CGI, not
the client, which just makes the HTTP call and waits.
Unfortunately there are two problems with this. First, the profiler only reports CPU time used executing code, so it will not account for the time used reading the database from disk or transferring data over the network in the case of the remote API. In fact the CPU time used would be the same for both local and remote APIs, since it was the same code running. Additionally, the profilers only give 0.01s precision — much longer than most SeqHound calls actually take to execute! Time only starts to add up after doing thousands of calls.
Returning to the drawing board, my main concern was to find something with
higher timing precision. After some searching, the C function
seemed to be the best one for the job, having microsecond precision, at least
on Solaris and Linux. Now the first thing they teach you in Computer Science
about debugging is not to use
printf to debug. Luckily I am not a
Computer Science graduate, so that is exactly what I did.
I wanted to break down the time spent for each part of the program, for every function call, for the socket read function call in the remote API, and so on. All that required was to store the time before and after each atomic sub-unit of the programs and to record the time differences. I broke down the program into the smallest pieces that seemed reasonable, giving me about 40 timings for each of the remote and local versions of a simple program to initialize SeqHound, make two or three representative calls, and close everything.
As expected, the total time for the local API was about 100 times as fast as the remote API. This was a good sanity check and confirmed that my approach was sound. Looking at my table for a bottleneck, something jumped out at me. One step during the SeqHound initialization (loading a data file needed by some functions) took about 80% of the total time! To compound this, there were two versions of the file available to us, one about three times the size of the other. For this purpose the smaller one was just as good, but of course we were using the big one. Even more important, this initialization took place on the server-side CGI for every function called through the remote API! For N remote calls there were N initializations; the local API performed only one initialization for any number of calls.
There it was; the numbers did not lie. The biggest cause for the slowdown was repeating the load in the initialization phase. By rewriting that function to load that file into memory only for the API calls that needed it, and by using the smaller version of the data file, I achieved a five-fold speedup!
Of course, I said it was roughly 100 times slower, so this still left a factor of 20 to explain. I did not have to do any more work, though, as the numbers were plain in front of me. The overhead of making an HTTP call, waiting for it to reach Apache on the server, and waiting for Apache to process the request and send back a reply altogether accounted for approximately 20ms of waiting time for the client on top of the time the actual call takes on the server. With the faster calls taking about 1ms, the extra factor of 20 was clearly network overhead, and thus further optimization was not possible. While still slow, it was not five times faster than before.
Based on my observations, it was also clear that if the client could minimize the number of SeqHound calls made, by making batch calls that perform several tasks all at once with a single HTTP call for example, the speed could approach that of the local API. Similarly, using the Apache FastCGI module would avoid the need for the initialization phase on every HTTP request and would result in roughly a further two-fold speedup.
But There's More
With this problem fixed, I felt pretty good. The CodeBase was not so bad after all. A simple change had resulted in great speed up. I thought I may be able to get back to my work now. Just as I began eating lunch, however, a coworker came by and told me his Java SeqHound program was really slow. "I fixed it," I told him. "No you didn't," he responded.
A little annoyed, I trudge over to his desk and watched his test program run like molasses. This was even more puzzling. It took a little over half a minute to make 10 simple function calls to our Solaris server. When we did the same calls to the older Linux box, it was fine, taking only a second or two. This made absolutely no sense. Java might be a little slower, but why was the Linux server giving such a fast response?
Again I sat puzzled. In a situation like this, the best thing to do is to enumerate all the possible differences that could explain the observations. The obvious difference was hardware and operating system — Linux v. Solaris. Both used Java and both used the same C code for the remote CGI, so I could rule that out. Perhaps the Linux compiler optimized code better? If we switched from Java to C on the client machine, both the Solaris and Linux servers gave comparable results, finishing the same job in a second or two. The only time it was horribly slow was using Java in the client to talk to the Solaris server.
At a loss, I still had my
gettimeofday calls scattered
throughout the code, so I made my timing tables again. Things were mildly
interesting now. The runtimes on the Solaris and Linux boxes, using Java to
make the API calls, were pretty similar. However there was an extra delay of
three seconds between each call to the Solaris machine! Remember, I
had microsecond precision here and the delay was precisely three seconds. With
10 calls in our test program, three seconds of delay per call equals 30 seconds. That seemed about right. Now this was awfully suspicious — no
random network delay or mysterious Java overhead would take precisely and
consistently three seconds. Perhaps it required a bit of knowledge and
experience, but it was clear at this point that there was a three-second
timeout delay somewhere. Where?
The delay seemed to happen at the start of each call, so the obvious place to look was the web server logs. Comparing the Apache access logs on both the Linux and Solaris servers, I could see the client HTTP requests made from our tests. The calls looked very similar; so much for that idea. Scrolling back in the log a bit, I saw the calls from when we tested using the C client instead of Java. Remember, that client did not suffer the delay that Java did.
I compared the C calls to Java calls. Eureka! The C calls do an
HTTP/1.0 GET while the Java API does
Obviously the problem had something to do with Solaris not liking
HTTP/1.1 POST calls, but we use the same version of Apache on the
Linux box, so why didn't it have a similar problem? I knew I was close, but I
still missed a piece of the puzzle.
Stumped and frustrated, I explained what I had found so far to our IT head,
who promptly told me that the problem was clearly
This is a setting in the Apache configuration file (
httpd.conf), which, if enabled, keeps connections made with HTTP 1.1 calls open for a
certain amount of time. If used properly it can greatly accelerate web
browsing by holding connections open for repeated calls to the same site.
In our case, it blocked each successive HTTP call until the previous one had
httpd.conf on both machines, it was indeed
disabled on the Linux server and set to three seconds on the Solaris box! The
mystery had been solved. Relieved, we switched to HTTP/1.0 calls in the Java
API. I finished my lunch in peace.
A Final Tale
I thought for sure I was done now. I'd fixed two independent problems and SeqHound was behaving much better now. People stopped cursing Java and Sun for awhile even. There was still one problem, though. The databases on the Solaris machine were falling out-of-date. When I inquired as to the reason, I discovered the nightly update sometimes took half a day or more. I narrowed it down to one specific parser that rebuilt one of the tables from scratch each night. It took about 20 minutes on the Linux box, but on our SAN attached to the Solaris server, it needed 12 hours or more! My work was not quite done.
Again, I asked myself what was reasonable. What was the program doing? It was a very simple program that parses a text file and inserts one row per line of text into a database table. There were several million rows, so taking 20 minutes was quite reasonable — but not hours on the SAN! This is an expensive, high-quality SAN. Even if it were configured poorly, it should not be that slow.
Thinking about the program itself, I imagined it would spend most of its time writing to the database. Reading from the text file and parsing would be negligible. Profiling was not necessary in this case. Clearly disk writing was the problem here.
I thought back to some disk benchmarks we had done in the past. The SAN seemed fast enough. I also remembered that in those disk benchmark programs, you can reduce the block size and make it very small; performance degrades significantly. Some disks deal very poorly with very small writes, where "small" means less than the standard block size, often 2,048 bytes on a Linux ext3 file system and as much as 65,536 bytes or more on a SAN. I watched the file size of the database table grow as the program ran. It increased by around 100 bytes at a time. This confirmed my fears. The program wrote out to disk (flushing its buffers) after every single write to the database! While Linux, with its smaller cluster size, could tolerate this, the SAN was very unhappy about it. It was not designed for this abuse.
I whipped out our CodeBase manuals and went to work. With a few lines here and a few changes there, I had enabled its built-in buffering feature. This
causes CodeBase to buffer writes in memory, only writing to disk when the
buffer filled up or when I explicitly flushed. A few simple
commands while the program was running verified that the database file
increased in size in chunks of about 32 KB. It seemed to be moving along much
The program was still slower than on the Linux box, though. Staring at the directory listings on my screen, I noticed something else. The program still updated the database index file after every single row insertion; caching only affected the tables, not the index. Again digging into the manuals produced a solution to disable producing indexes until the end of the program. The end result would be the same, but this would avoid writing to the disk after every insert to the table. That's what thrashed the SAN. I made the change and it worked, cutting the runtime down to 15-20 minutes from 14 hours. It now compared favorably to the performance on the Linux machine again. Clearly the Linux box/RAID was much more tolerant of small writes than the SAN was. However, by understanding the problem and working around it, it was still easily fixable.
What can we learn from all of this? What do the three separate-but-related case studies I have shared here have in common? For one thing, although the problems initially seemed like hardware limitations — and many people jumped to these conclusions — they had easy and quick software corrections. I identified all of these bottlenecks by performing an unassuming analysis of the system. As long as you are thorough and systematic, you will find the bottleneck in your system, whatever it may be. This is only the first half of the problem however. Once you've identified the bottleneck, you must understand the problem. This is where many fail.
Without a firm understanding of hardware, software, and the underlying protocols, it can be difficult to proceed past this point in the analysis. Each of the above problems required a key piece of information or knowledge to understand why there was a bottleneck and how to fix it. In the first case, I needed to understand the SeqHound initialization code. That allowed me to pinpoint the one file that was being loaded into memory each time. I also knew when we could avoid loading that file. Without this knowledge, I could not have proceeded.
Similarly for the second example, I have our IT to thank for
educating me about
KeepAlive and what it does. In the final
example, my experience with disk benchmarking and knowing that small writes can
be inefficient on certain RAID environments, as well as understanding how
CodeBase buffers (or doesn't buffer) and writes to disk, was invaluable in
identifying and fixing the bottleneck.
There is no easy answer for performance tuning. Experience is always the best teacher. This does not mean you have to be an IT whiz or rocket scientist either. If I learned anything from this it is to do as much as you can by yourself and to ask for help if you're stuck. Know your coworkers' strengths and weaknesses. What may stump you could be obvious to someone else, and a fresh pair of eyes on a problem never hurts. Each problem we fix gives the opportunity to gain a little bit more experience, hopefully helping you to avoid the same problem in the future.
Next time, before you walk off threatening to send your new super-duper computer to the scrap yard, remember that your coding may just possibly be to blame.
Howard Feldman is a research scientist at the Chemical Computing Group in Montreal, Quebec.
Return to LinuxDevCenter.com.