O'Reilly Book Excerpts: BGP
Traffic Engineering: Queuing, Traffic Shaping, and Policing
Editor's Note: In the fifth and final installment in this series of excerpts on Traffic Engineering from O'Reilly's BGP, learn how to increase performance for certain protocols or sessions using special queuing strategies, traffic shaping, and rate limiting.
Queuing, Traffic Shaping, and Policing
Traffic engineering works only if you have bandwidth to spare on one of your connections. Even the most sophisticated traffic balancing techniques won't help you when there is just too much traffic. When the output queues for interfaces start filling up, interactive protocols start noticing delays, and bulk protocols start noticing lower throughput. The best way to handle this would be to get more bandwidth, but with some smart queuing techniques, it's possible to increase performance for some protocols or sessions without hurting others very much. Or just give way to "important" packets and let less important traffic suffer. There are three ways to accomplish this: special queuing strategies, traffic shaping, and rate limiting. Before choosing one, you should know how each interacts with TCP.
Nearly all applications that run over the Internet use the TCP (RFC 793) "on top of" IP. IP can only transmit packets of a limited size, and the packets may arrive corrupted by bit errors on the communications medium, in the wrong order, or not at all. Also, IP provides no way for applications to address a specific program running on the destination host. All this missing functionality is implemented in TCP. The characteristics of TCP are:
"Stream" interface: Any and all bytes the application writes to the stream come out in the same order at the application running on the remote host. There is no packet size limit: TCP breaks up the communication into packets as needed.
In This Series
Traffic Engineering: Finding the Right Route
Integrity and reliability: TCP performs a checksum calculation over every segment (packet) and throws away the segment if the checksum fails. It keeps resending packets until the data is received (and acknowledged) successfully by the other end, or until it becomes apparent that the communications channel is unusable, and the connection times out.
Multiplexing: TCP implements "ports" to multiplex different communication streams between two hosts, so applications can address a specific application running on the remote host. For instance, web servers usually live on port 80. When a web browser contacts a server, it also selects a source port number so that the web page can be sent back to this port, and the page will end up with the right browser process. Well-known server ports are usually (but not always) below 1024; client source ports are semirandomly selected from a range starting at 1024 or higher.
Congestion control: Finally, TCP provides congestion control: it makes sure it doesn't waste resources by sending more traffic than the network can successfully carry to the remote host.
Most of what TCP does falls outside the scope of this book, so it won't be discussed here. It's good to know about the congestion control mechanisms TCP employs, however, because they have a strong impact on the traffic patterns on the network.
TCP Congestion Control
Apart from the basic self-timing that happens because TCP uses a windowing system where only a limited amount of data may be in transit at any time, there are four additional congestion-related mechanisms in TCP: slow start, congestion avoidance, fast retransmit, and fast recovery. These algorithms are documented in RFC 2001.
When a TCP connection is initiated, the other side tells the local TCP how much data it's prepared to buffer. This is the "advertised window." Setting up a connection takes three packets: an initial packet with the SYN control bit set (a "SYN packet"), a reply from the target host with both the SYN and ACK bits set, and a final packet from the initiating host back to the target acknowledging the SYN/ACK packet. This is the three-way handshake.
After the three-way handshake, the local (and remote) TCP may transmit data until the advertised window is full. Then it has to wait for an acknowledgment (ACK) for some of this data before it can continue transmitting. When the remote TCP advertises a large window, the local TCP doesn't send a full window's worth of data at once: there may be a low-bandwidth connection somewhere in the path between the two hosts, and the router that terminates this connection may be unable to buffer such a large amount of data until it can traverse the slow connection. Thus, the sending TCP uses a congestion window in addition to the advertised window. The congestion window is initialized as one maximum segment size, and it doubles each time an ACK is received. If the segment size is 1460 bytes (which corresponds to a 1500-byte Ethernet packet minus IP and TCP headers), and the receiver advertises a 8192-byte window, the sending TCP initializes the congestion window to 1460 bytes, transmits the first packet, and waits for an ACK. When the first ACK is received, the congestion window is increased to 2920 bytes, and two packets are transmitted. When the first one of these is ACKed, the congestion window becomes 5840 bytes, so four packets may now be in transit. One packet is still unacknowledged, so three new packets are transmitted. After receiving the next ACK, the congestion window increases beyond the advertised window, so now it's the advertised window that limits the amount of unacknowledged data allowed to be underway.
Congestion avoidance introduces another variable: the slow start threshold size (ssthresh). When a connection is initialized, the ssthresh is set to 65,535 bytes (the maximum possible advertised window). As long as no data is lost, the slow start algorithm is used until the congestion window reaches its full size. If TCP receives an out-of-order ACK, however, congestion avoidance comes into play. An out-of-order ACK is an acknowledgment for data that was already acknowledged before. This happens when a packet gets lost: the receiving TCP sends an ACK for the data up to the lost packet, indicating, "I'm still waiting for the data following what I'm ACKing now." TCP ACKs are cumulative: it isn't possible to say "I got bytes 1000-1499, but I'm missing 500-999."
Upon receiving a duplicate ACK, the sending TCP assumes the unacknowledged data has been lost because of congestion, and the ssthresh and also the congestion window are set to half of the current window size, as long as this is at least two times the maximum segment size. After this, the congestion window is allowed to grow only very slowly, to avoid immediate return of the congestion. If the sending TCP doesn't see any ACKs at all for some period of time, it assumes massive congestion and triggers slow start, in addition to lowering the ssthresh. So as long as the congestion window is smaller than or equal to the ssthresh, slow start is executed (congestion window doubles after each ACK), and after that congestion avoidance (congestion window grows slowly).
Fast retransmit and fast recovery
When TCP receives three out-of-order ACKs in a row, it assumes that just a single packet was lost. (One or two out-or-order ACKs are likely to be the result of packet reordering on the network.) It then retransmits the packet it thinks has been lost, without waiting for the regular retransmit timer to expire. The ssthresh is set as per congestion avoidance, but the congestion window is set to the ssthresh plus three maximum segments: this is the amount of data that was successfully received by the other end, as indicated by the out-of-order ACKs. The result is that TCP slows down a bit, but not too much, because there is obviously still a reasonable amount of data coming through.
TCP Under Packet Loss and Delay Conditions
The result of these four mechanisms is that TCP slows down a lot when multiple packets are lost. The problem is even worse when the round-trip times are long, because the use of windows limits TCP's throughput to a window size per round-trip-time. This means that even with the maximum window size of just under 64 KB (without the TCP high-performance extensions enabled), TCP performance over a transcontinental circuit with a round trip delay of 70 ms will not exceed 900 Kbps. When a packet is lost, this speed is nearly halved, and it takes hundreds of successfully acknowledged packets to get back up to the original window size. So even sporadic packet loss can bring down the effectively used bandwidth for a single TCP session over a high-delay path. This means that packet loss can be tolerated only on low-delay connections, and only as long as those connections are not part of a high-delay path.
The behavior of the two main categories of non-TCP applications under packet loss conditions is different. These categories are multimedia (streaming audio and video) and applications based on small transactions that don't need a lot of overhead, such as DNS. Streaming audio and video are generally not too sensitive to packet loss, although the audio/video quality will suffer slightly. For things like DNS lookups, packet loss slows down individual transactions a lot (they time out and have to be repeated), but the performance penalty doesn't carry over to transactions that didn't lose packets themselves. Because non-TCP applications don't really react to packet loss, they often exacerbate the congestion by continuing to send more traffic than the connection can handle.
Although some lost packets are the result of bit errors on the physical medium or temporary routing inconsistencies, the typical reason packets are lost is congestion: too much traffic. If a router has a single OC-3 (155 Mbps) connection to a popular destination, and 200 Mbps of traffic comes in for this destination, something has to give. The first thing the router will do is to put packets that can't be transmitted immediately in a queue. IP traffic tends to have a lot of bursts: traffic can get high for short periods of time ranging from a fraction of a second to a few seconds. The queue helps smooth out these bursts, at the expense of some additional delay for the queued packets, but at least they're not lost. If the excessive traffic volume persists, the queue fills up. The router has no other choice than to discard any additional packets that come in when the queue is full. This is called a "tail drop." The TCP anti-congestion measures are designed to avoid exactly this situation, so in most cases, all the TCP sessions will slow down so the congestion clears up for the most part. If the congestion is bad, however, this may not be enough. If a connection is used for many short-lived TCP sessions (such as web or email traffic), the sheer number of initial packets (when TCP is still in slow start) may be enough to cause congestion. Non-TCP applications can also easily cause congestion because they lack TCP's sophisticated congestion-avoidance techniques.
Pages: 1, 2