Transparent Proxying with Squidby Jennifer Vesperman
Transparent proxying frees you from the hassle of setting up individual browsers to work with proxies. If you have a hundred, or a thousand, users on your network, it's a pain to set up each browser and to use proxies -- or to try to convince users to go into their preferences and type in these symbols they don't understand.
Using transparent proxying, you intercept their web requests and redirect them through the proxy. Nice and simple -- on the surface.
Why not to use a transparent proxy
Transparent proxying (more commonly known as TCP hijacking) is like Network Address Translation (NAT) in some respects: It is to be avoided at all costs, and only used if there absolutely, positively, no other way.
Why? Because transparent proxying does not work very well with certain web-browsers. With most browsers you're fine, but if even a quarter of your users are using badly behaved browsers, you can expect your help desk costs to exceed any benefits you might gain from transparent proxying. Unfortunately, these browsers are in wide use.
These browsers behave differently if they are aware of a proxy -- all other browsers follow the standard, and the only change they make with a proxy is to direct the requests to a different machine and port. Badly behaved browsers leave some of the HTTP headers out of their requests, and only add them when they know there's a proxy. Without those headers, user commands like "reload" don't work if there's a proxy between the user and the source.
Transparent proxying also introduces a layer of complexity, which can complicate otherwise simple transactions. For instance, a web-based application that requires an active server cannot test for the server by making a connection -- it will connect to the proxy, instead.
Transparent proxying theory
So how does transparent proxying work?
A firewall or other redirector catches TCP connections directed at specific ports on remote hosts (usually port 80), and directs them to the local proxy server. The proxy server uses HTTP headers to determine where it is supposed to make a connection to, and proxies the request.
System administrators are often asked to also transparently proxy FTP and SSL, but these can't be transparently proxied. FTP is a more complex protocol than HTTP, and provides fewer hints as to the original destination of the request. SSL is encrypted and contains no useful data about destinations. Attempts to decode SSL are precisely what it's designed to prevent: decoding SSL to transparent proxy -- it would be indistinguishable from a "true" man-in-the-middle attack.
To perform transparent proxying, we need a server between the clients and the destinations. This server must have the necessary facilities to match and redirect traffic, such as
iptables. Any firewalling system capable of Network Address Translation and traffic redirection is suitable.
You will need to configure a rule to catch traffic destined for port 80 on external hosts, and redirect this traffic to the port of a proxy server on the intercepting machine.
You can have proxies which aren't on the intercepting machine, but these are more awkward. First, the source address of the request is no longer available to the proxy -- it's lost in the process of redirection. You can solve this by using destination NAT (Network Address Translation), but you then have to route the proxy traffic back through the translating server. If you attempt to have the proxy pass the HTTP response back directly, the client will be confused and (quite correctly) refuse to speak to the proxy. The proxy is not the machine the client thinks it's talking to -- the client thinks it's making the request of the destination web server. The proxy must route back through the interceptor, so it can translate the addresses back, and let the client continue to believe it's speaking directly to the web server.
HTTP/1.1 made life easier for transparent proxies, by making the host header mandatory. This header contains the name of the machine (as given in the URL) and allows virtual name-based web-hosting, by allowing the web server to use the host header to determine which page to respond with.
For transparent proxies, it provides the proxy with the host name. Having received an intercepted port 80 connection, the proxy server needs to understand that it is not receiving a fully qualified absolute URI (Uniform Resource Identifier), but a relative URI. Normally, a proxy server receives
http://host/path, but if the client thinks it's talking to the server, not a proxy, it just asks for
The proxy server uses the
HOST header to reassemble the fully qualified URI, then checks its cache and does its usual proxying.
Squid is suitable for transparent proxying because it is also designed as a reverse proxy (also known as an "HTTP accelerator"), and can read these abbreviated request headers. In accelerator mode, it fronts for the actual web servers and receives requests as if it were the web server, so it was designed with the ability to reassemble relative URIs. To use it as a transparent proxy, we enable this web acceleration behavior.
When using Squid as an HTTP accelerator, configure the host name and the port you want the proxy to accelerate. This prevents Squid from being used as an arbitrary HTTP relay. When using Squid in accelerator mode as a transparent proxy, set the host name to
virtual and the port to whichever port we want a transparent proxy for.
Configuring a transparent proxy
Intercept and/or redirect the traffic to the chosen port. Having the proxy on the same machine as the interceptor is preferable. The code example uses
iptables as the redirection mechanism, and port 8080 as the proxy's
To a different machine
iptables -t nat -A PREROUTING -i $INTERFACE -p tcp --dport 80 -j DNAT --to 10.0.3.1:8080
To the same machine
iptables -t nat -A PREROUTING -i $INTERFACE -p tcp --dport 80 -j REDIRECT --to-port 8080
squid.conf file, configure these options:
(or whatever port you want to proxy)
Note that you cannot transparently proxy more than one port at a time. The HTTP headers do not contain port information, so Squid cannot tell which port the request was intended for once the request has been intercepted.
Caveats and gotchas
You can lose the source address of the request if the proxy box isn't also the traffic interceptor. You can correct this by using destination NAT instead of packet redirection, and making sure the proxy routes all traffic back through the intercepting box, including traffic to its clients. (Alternatively, ensure that the proxy is the intercepting box.)
Some browsers are unable to refresh content through a transparent proxy. The client fails to send cache coherence headers, assuming it's talking to the web server, and assuming there is no proxying or caching agent (including web accelerators) in between. Users of these browsers will have problems and will become IT help desk problems.
There is no known correction for this problem, other than not using these browsers with any sort of proxying or caching agent.
It is cheaper in CPU cycles and memory to have the browser explicitly configured to use a proxy, than it is to redirect traffic. It is cheaper in CPU cycles and memory to block port 80 than it is to redirect traffic. Blocking has less overhead than redirection, and can force people to use a proxy.
The most common reason to use transparent proxying is to reduce the setup load for web browsers. System administrators need to be aware of the common problems of transparent proxies, and determine whether they are appropriate in their environments. If the end users are using browsers that are known to behave well with transparent proxies, and the machine designated as the proxy is capable of handling the load, a transparent proxy can be an effective solution.
None of these explicitly describe transparent proxying, but they are useful nonetheless.
Jennifer Vesperman is the author of Essential CVS. She writes for the O'Reilly Network, the Linux Documentation Project, and occasionally Linux.Com.
Return to the Linux DevCenter.