Tapping the Matrix, Part 1by Carlos Justiniano
Editor's note: In this first part of a two-part series on harnessing the idle processing power of distributed machines, Carlos Justiniano explains the current trends in this exciting technology area, then drills down into specifics such as client/server communication, protocols, server design, databases, and testing. In part two, next week, he'll cover network failures, security, software updates, and backup.
Neo awakes amidst a vast field of towering pods. Each pod, connected to the matrix, contains a person whose bioelectric energy is being harvested as fuel for a race of machines. That was one of the early scenes in the blockbuster movie The Matrix. The concept, although a stretch of the imagination, has an inverted parallel -- humans are harvesting the processing power of the millions of machines connected to a matrix we call the Internet.
This opportunity is made possible, in part, by the realization that a staggering number of computers are vastly underutilized. This squandering of resources doesn't just involve home computers; few businesses utilize their computers the full 24 hours of any day. In fact, ubiquitous applications, such as word processing, email, and web browsing, require very few CPU resources to support their end users, so machines are largely idle even when they appear to be in use. Just think about what your computer is doing right now, as you read this article.
Modern machines are capable of executing a billion instructions in the time it takes us to blink. The fact is that many computers today spend a significant amount of their time displaying multicolored swirls generated by screen saver programs. This is in striking contrast to the golden age of the mainframe, when time-sharing systems demanded a premium for their precious CPU cycles.
This situation has not gone unnoticed. For researchers, the lure of harnessing spare computing cycles has been simply too good to pass up. The benefits, potentially faster computation at substantially lower costs, are made possible because the bulk of the overall computing power comes from remote machines -- machines that are individually owned and operated by the general public.
Distributed Computing, Clusters, Grid Computing, Parallel Computing, and P2P -- You Are Here!
Pop quiz: how do you create a really fast computer? You basically have two choices.
- Create faster processors and components.
- Use more than one processor.
Since the late 1970s, some of the fastest computers have been built using multiple processors. The goal has been to speed up execution by allowing programs to execute on multiple processors while sharing resources such as memory. This has also allowed programs to communicate with one another as they collaborate on a particular problem. This divide-and-conquer approach to computing is known as parallel computing and is fundamentally how complex problems are solved using supercomputers. The tight integration of specialized processors, memory, and the hardware to effectively connect components produces extremely fast computation, but also accounts for the exorbitant costs of supercomputers.
During the past decade, personal computers have matured into fast and capable machines. Researchers have explored ways of connecting groups of machines to form clusters capable of rivaling larger, more expensive systems. Just ten years ago, NASA researchers Thomas Sterling and Don Becker connected 16 machines to create a cluster they called Beowulf. Today, Beowulf clusters involving thousands of computers are commonplace. In fact, Google claims to operate the world's largest Linux cluster involving more than ten thousand servers.
As you can imagine, monitoring thousands of computers can be a nightmare. To address this issue, specialized software must be built to manage clusters. This is where grid computing comes in. Grid computing software treats a collection of machines as resources that can be partitioned, allocated, and monitored according to established usage guidelines. In addition, grid computing platforms provide developers with methods of building software for use in grid environments.
At the time when clusters were emerging, developers began building software that allowed personal computers to share resources by allowing machines to loosely bind to one another to form highly dynamic networks. The new networks became known as Peer-to-Peer (P2P) networks, and were initially used for instant messaging (ICQ) and later, file sharing (Napster).
The emergence of P2P networks reminded researchers that in addition to instant messaging and file sharing, computation could also be shared. As a result, an increasing number of projects have been formed that are classified under the banner of distributed computing, and more specifically within a subcategory known as distributed computation. We will use the term "distributed computation" throughout this article to refer to distributed computing projects, which specifically involve geographically dispersed machines all working to solve a computational problem.
The past decade as been incredibly fast-paced, and a significant cross-pollination of ideas has occurred. We're currently seeing P2P evolve to include grid-like capabilities and grids evolving to assimilate all things distributed computing.
The O'Reilly Open Peer-to-Peer Directory and The AspenLeaf DC web site currently list several dozen active projects, ranging from analytical spectroscopy to the Zeta grid, specializing in areas such as art, cryptography, life sciences, mathematics, and game research.
One such project, Folding@Home, set out to use large-scale distributed computing to create a simulation involving the folding of proteins. Proteins, which are described as nature's nanobots, assemble themselves in a process known as folding. When proteins fail to fold correctly, the side effects include diseases such as Alzheimer's and Parkinson's.
The Folding@Home team described their protein simulation in the form of a computational model suitable for distribution. The result is now a tool that allows them to simulate folding using timescales that are thousands to millions of times longer than were previously possible.
Perhaps the most famous distributed computation project is SETI@home (the search for extraterrestrial intelligence) at the University of California in Berkeley. SETI@home distributes data collected from the radio telescope on the island of Puerto Rico to millions of remote machines throughout the world, where the data is analyzed for potential evidence of extraterrestrial civilizations.
SETI@home has demonstrated the viability of harnessing remote machines and has captured the imagination of over four million people, and has easily become the largest distributed computation project in history.
It All Begins with a Problem
Distributed computation projects work under the basic premise that machines working in parallel are able to solve a certain class of problems more cost effectively than a local group of machines.
If you consider starting a distributed computation project, one of the first challenges you are likely to encounter involves determining whether your particular problem is suitable for distribution. That is, can your problem be subdivided into discrete chunks? Furthermore, can the problem be subdivided so that there are no immediate interdependencies?
To help address this problem, you should consider writing programs to test the decomposition, distribution, collection, and assembly of processed work units. You have considerable flexibility in the choice of computer languages and other tools, which you can use to build quick prototypes. It is important not to get bogged down with too many technology-related issues. The key is to keep it simple -- work with data files and command-line programs and leave the actual inter-process and network communication aspects for a later time. The goals at this stage should focus on exploration and discovery, with an emphasis on the actual problem at hand.
Once you are confident that your problem can indeed benefit from distributed computation, your next step is to consider a general architecture for the various programs you'll need.
A Common Approach
At a fundamental level, a large majority of distributed computation projects work similarly. Let's examine a typical project from a 30-thousand-foot view:
A central computer subdivides a problem into millions of smaller tasks. It then proceeds to dispatch each task to a remote machine. As each machine completes its assigned task, it returns a result back to the central computer, which may further subdivide (or create new) tasks. The process continues until a solution is found or all subtasks have been processed.
This process, while overly simplified, does reveal that a central entity dispatches work to remote entities, which later return results. For the sake of simplicity, I'll refer to the central computer as a "SuperNode," the remote machines as "PeerNodes," and subtasks as "work units." Note, however, that a SuperNode may consist of several physical servers, such as a scheduling server, database server, and other intermediate servers, to help balance network load.
You may have realized that the relationship between a PeerNode and SuperNode bares a striking resemblance to the relationship between web browsers and web servers. This similarity is no coincidence, because both have client/server attributes. We will examine this relationship later in this article; for now, keep in mind that PeerNodes can be considerably more dynamic than any modern-day web browser. Also, distributed computation projects are not restricted to client/server relationships and one-to-many network topologies. Indeed, some projects support the presence of multiple SuperNodes, which in turn cluster communities of PeerNodes. In addition, DC projects such as Electric Sheep are exploring the use of P2P networks using protocols such as Gnutella.
Tim Toady. Tim Who?
The Perl programming community has a motto, "There's More Than One Way To Do It," which exemplifies the notion that a program is correct so long as it accomplishes its goal. Despite being rather long, the motto rings true in the distributed computing world, where we have an abundance of technological choices.
At one extreme, we have the all-encompassing grid frameworks, and at another we have the necessary raw materials (tools) to "build it ourselves." In between, we have proprietary frameworks such as Microsoft's .NET, various SUN Java frameworks, and open source tools such as LAMP (Linux, Apache, MySQL, Perl/PHP/Python). We also have several dozen readymade frameworks, such as the Berkeley Open Infrastructure for Network Computing (BOINC), which builds on open source LAMP tools); and Alchemi (A .NET-based Grid Computing Framework and its Integration into Global Grids), which leans toward the grid computing.
Which tools are best for you? Well, that depends on your own experience base and the unique requirements of your project. We'll continue exploring key issues so that you'll have a sense of how various tools work, and what you should consider when evaluating existing frameworks or before embarking on your own unique and custom solution.
One of your most important considerations is how your project's primary servers will communicate with remote client machines. Again, we'll simplify our discussion by referring to one or more central servers as a single SuperNode, and multiple distributed clients as PeerNodes.
Consider the following questions. If you don't understand a question, ask a friend or simply enter the highlighted words into your favorite search engine.
- Will the communication need to be synchronous or asynchronous?
- Will the application communicate through port 80 or will it require another port?
- Will the SuperNode need to maintain sessions, or will transactions be stateless?
- Will PeerNode clients need to communicate through intermediary firewalls and proxy servers?
- How important is network transmission speed to the application?
- How large is the data intended for transmission to clients? How large will the return result be? Should the data be compressed?
- How important is data security? Does the data have to be encrypted at the server, the client, or both?
- Will PeerNodes need to communicate with one another?
Those communication questions can help you to carefully consider your project's specific goals and requirements, in addition to helping you evaluate the use of toolkits and frameworks.
In the end, your answers to those questions may lead you down the same path taken by many distributed computing projects. Specifically, toward the use of existing open standards, well-established web-based communication methodologies, and in particular, the use of the HTTP protocol. While you are free to explore alternative approaches (and you should, by all means), we'll focus on the most widely accepted methodologies. This will allow you to at least have a frame of reference as you consider the benefits of other approaches.
HTTP, XML, and SOAP?
The Hypertext Transfer Protocol (HTTP) is the underlying language that web servers and web browsers speak as they negotiate the transfer of web content. HTTP is a well-documented, well-tested text-based protocol that uses simple verbs such as
POST to specify actions. A core benefit of HTTP is that it includes support for interacting with other network devices such as routers, load balancers, and proxy cache servers. Once you consider the widespread deployment of web servers, web browsers, and devices that understand HTTP, it is easy to see why HTTP has become a standard for building distributed applications.
A common criticism of HTTP is that the size of your data may be smaller than an actual HTTP header. For example, an HTTP header may be 200 bytes long, while your actual data payload may be only 100 bytes in size. So it costs more to transmit the HTTP header than it does to send your unique application data. Depending on your choice of tools, you can control the size of the HTTP headers. However, you may reach a point of diminishing returns where you may not be able to significantly balance the header-to-payload ratio. Keep in mind that interoperability often requires compromise, and we'll continue to explore this concept later in this article.
Using HTTP verbs, software can send and request packaged data with no predefined limits on size or format. Indeed, Internet radio stations use HTTP to broadcast continuous streams of binary data on an open connection. The freedom to transmit any type of data further makes HTTP ideal; however, there is value in considering a structured approach to your HTTP payload data that will help enable interoperability. The XML markup language was created to facilitate the interchange of structured data by allowing developers to use agreed-upon keywords. Consider the XML text below, which defines one or more software products along with product IDs and file information (such as when a file was posted, and the size and version of the file).
<product id="CBCC" uc="low" os="win32">
<file name="cbcc_inst.exe" size="756345" ver="3.00.00.00"
posted="12/16/03 12:34 pm GMT"
<desc>ChessBrain Control Center</desc>
<changes>This release contains a fix for...</changes>
Arguably, the XML information could have been represented in a far more compact binary format. For example, the XML translates to a structure with precise data types, such as:
Of immediate concern would be the use of the
long filesize field, which would be interpreted differently across computers. The so-called "endian," which determines how computer memory is organized, causes computers to interpret the
long filesize number in incompatible ways. This issue affects PCs and Macs, in addition to a host of other machines. Naturally, there are well-established methods of addressing this problem. The issue is that you must take those steps, and further place a similar burden on other programmers when they attempt to interpret the same data.
Another apparent limitation is the use of "fixed" data sizes. Misunderstandings and lack of proper data validation can lead to software defects, and one of the most common security threats in existence -- the dreaded buffer overflow. XML allows non-heterogeneous computers to safely receive and process text data.
XML's potential for enabling well-formed data makes it ideal for text parsing. Today, there are widely available programming libraries for working with XML using languages such as C/C++ and Java. In addition, all major scripting languages support XML, and remove much of the complexity of working with XML data. Because of its inherent structural simplicity, you're often able to quickly parse data using standard string runtime calls such as
strncpy(), rather than employ the use of a full-fledged XML tool.
XML has proven to be a valuable tool and developers have created XMLRPC (XML for Remote Procedure Calls) and SOAP (the Simple Object Access Protocol) to formalize data exchange methods. SOAP has become the preferred method of building web-based services and now provides distributed computing applications with standardized methods of communicating with one another.
The use of HTTP, XML, and XML-based grammars helps to enable interoperability and dramatically improves the chances that your project will be scalable.
Pages: 1, 2