While working on optimizing page load times for a high-profile AJAX application, I had a chance to investigate how much I could reduce latency due to external objects. Specifically, I looked into how the HTTP client implementation in common browsers and characteristics of common Internet connections affect page load time for pages with many small objects.
I found a few things to be interesting:
IE, Firefox, and Safari ship with HTTP pipelining disabled by default; Opera is the only browser I know of that enables it. No pipelining means each request has to be answered and its connection freed up before the next request can be sent. This incurs average extra latency of the round-trip (ping) time to the user divided by the number of connections allowed. Or if your server has HTTP keepalives disabled, doing another TCP three-way handshake adds another round trip, doubling this latency.
By default, IE allows only two outstanding connections per hostname when talking to HTTP/1.1 servers or eight-ish outstanding connections total. Firefox has similar limits. Using up to four hostnames instead of one will give you more connections. (IP addresses don't matter; the hostnames can all point to the same IP.)
Most DSL or cable Internet connections have asymmetric bandwidth, at rates like 1.5Mbit down/128Kbit up, 6Mbit down/512Kbit up, etc. Ratios of download to upload bandwidth are commonly in the 5:1 to 20:1 range. This means that for your users, a request takes the same amount of time to send as it takes to receive an object of 5 to 20 times the request size. Requests are commonly around 500 bytes, so this should significantly impact objects that are smaller than maybe 2.5k to 10k. This means that serving small objects might mean the page load is bottlenecked on the users' upload bandwidth, as strange as that may sound.
Using these, I came up with a model to guesstimate the effective bandwidth of users of various flavors of network connections when loading various object sizes. It assumes that each HTTP request is 500 bytes and that the HTTP reply includes 500 bytes of headers in addition to the object requested. It is simplistic and only covers connection limits and asymmetric bandwidth, and doesn't account for the TCP handshake of the first request of a persistent (keepalive) connection, which is amortized when requesting many objects from the same connection. Note that this is best-case effective bandwidth and doesn't include other limitations like TCP slow-start, packet loss, etc. The results are interesting enough to suggest avenues of exploration but are no substitute for actually measuring the difference with real browsers.
To show the effect of keepalives and multiple hostnames, I simulated a user on net offering 1.5Mbit down/384Kbit up who is 100ms away with 0% packet loss. This roughly corresponds to medium-speed ADSL on the other side of the U.S. from your servers. Shown here is the effective bandwidth while loading a page with many objects of a given size, with effective bandwidth defined as total object bytes received divided by the time to receive them:
Interesting things to note:
For objects of relatively small size (the left-hand portion of the graph), you can see from the empty space above the plotted line how little of the user's downstream bandwidth is being used, even though the browser is requesting objects as fast as it can. This user has to be requesting objects larger than 100k before he's mostly filling his available downstream bandwidth.
For objects under roughly 8k in size, you can double his effective bandwidth by turning keepalives on and spreading the requests over four hostnames. This is a huge win.
If the user were to enable pipelining in his browser (such as setting Firefox's network.http.pipelining in about:config), the number of hostnames we use wouldn't matter, and he'd make even more effective use of his available bandwidth. But we can't control that server-side.
Perhaps more clearly, the following is a graph of how much faster pages could load for an assortment of common access speeds and latencies with many external objects spread over four hostnames and keepalives enabled. Baseline (0%) is one hostname and keepalives disabled.
Interesting things from that graph:
If you load many objects smaller than 10k, both local users and ones on the other side of the world could see substantial improvement from enabling keepalives and spreading requests over 4 hostnames.
There is a much greater improvement for users further away.
This will matter more as access speeds increase. The user on 100meg ethernet only 20ms away from the server saw the biggest improvement.
One more thing I examined was the effect of request size on effective bandwidth. The above graphs assumed 500 byte requests and 500 bytes of reply headers in addition to the object contents. How does changing that affect performance of our 1.5Mbit down/384Kbit up and 100ms away user, assuming we're already using four hostnames and keepalives?
This shows that at small object sizes, we're bottlenecked on the upstream bandwidth. The browser sending larger requests (such as ones laden with lots of cookies) seems to slow the requests down by 40% worst-case for this user.
As I've said, these graphs are based on a simulation and don't account for a number of real-world factors. But I've unscientifically verified the results with real browsers on real net and believe them to be a useful gauge. I'd like to find the time and resources to reproduce these using real data collected from real browsers over a range of object sizes, access speeds, and latencies.
Tips to reduce your page load time
After you gather some page-load times and effective bandwidth for real users all over the world, you can experiment with changes that will improve those times. Measure the difference and keep any that offer a substantial improvement.
Try some of the following:
Turn on HTTP keepalives for external objects. Otherwise you add an extra round-trip for every HTTP request. If you are worried about hitting global server connection limits, set the keepalive timeout to something short, like 5-10 seconds. Also look into serving your static content from a different webserver than your dynamic content. Having thousands of connections open to a stripped down static file webserver can happen in like 10 megs of RAM total, whereas your main webserver might easily eat 10 megs of RAM per connection.
If your users regularly load a dozen or more uncached or uncacheable objects per page, consider evenly spreading those objects over four hostnames. This usually means your users can have 4x as many outstanding connections to you. Without HTTP pipelining, this results in their average request latency dropping to about 1/4 of what it was before.
When you generate a page, evenly spreading your images over four hostnames is most easily done with a hash function, like MD5. Rather than having all tags load objects from http://static.example.com/, create four hostnames (e.g. static0.example.com, static1.example.com, static2.example.com, static3.example.com) and use two bits from an MD5 of the image path to choose which of the four hosts you reference in the tag. Make sure all pages consistently reference the same hostname for the same image URL, or you'll end up defeating caching.
Beware that each additional hostname adds the overhead of an extra DNS lookup and an extra TCP three-way handshake. If your users have pipelining enabled or a given page loads fewer than around a dozen objects, they will see no benefit from the increased concurrency and the site may actually load more slowly. The benefits only become apparent on pages with larger numbers of objects. Be sure to measure the difference seen by your users if you implement this.
Set an Expires header on everything you can, with a date days or even months into the future. This tells the browser it is okay to not revalidate on every request, which can add latency of at least one round-trip per object per page load for no reason.
Instead of relying on the browser to revalidate its cache, if you change an object, change its URL. One simple way to do this for static objects if you have staged pushes is to have the push process create a new directory named by the build number, and teach your site to always reference objects out of the current build's base URL. (Instead of you'd use . When you do another build next week, all references change to .) This also nicely solves problems with browsers sometimes caching things longer than they should -- since the URL changed, they think it is a completely different object.
For anything where you always serve the exact same content when given the same URL (e.g. static images), add "Cache-Control: public" to give proxies explicit permission to cache the result and serve it to different users. If a local cache has the content, it is likely to have much less latency than you; why not let it serve your static objects if it can?
Avoid the use of query params in image URLs, etc. At least the Squid cache refuses to cache any URL containing a question mark by default. I've heard rumors that other things won't cache those URLs at all, but I don't have more information.
On pages where your users are often sent the exact same content over and over, such as your home page or RSS feeds, implementing conditional GETs can substantially improve response time and save server load and bandwidth in cases where the page hasn't changed.
When serving a static files (including HTML) off of disk, most webservers will generate Last-Modified and/or ETag reply headers for you and make use of the corresponding If-Modified-Since and/or If-None-Match mechanisms on requests. But as soon as you add server-side includes, dynamic templating, or have code generating your content as it is served, you are usually on your own to implement these.
The idea is pretty simple: When you generate a page, you give the browser a little extra information about exactly what was on the page you sent. When the browser asks for the same page again, it gives you this information back. If it matches what you were going to send, you know that the browser already has a copy and send a much smaller 304 (Not Modified) reply instead of the contents of the page again. And if you are clever about what information you include in an ETag, you can usually skip the most expensive database queries that would've gone into generating the page.
Minimize HTTP request size. Often cookies are set domain-wide, which means they are also unnecessarily sent by the browser with every image request from within that domain. What might've been a 400 byte request for an image could easily turn into 1000 bytes or more once you add the cookie headers. If you have a lot of uncached or uncacheable objects per page and big, domain-wide cookies, consider using a separate domain to host static content, and be sure to never set any cookies in it.
Minimize HTTP response size by enabling gzip compression for HTML and XML for browsers that support it. For example, the 17k document you are reading takes 90ms of the full downstream bandwidth of a user on 1.5Mbit DSL. Or it will take 37ms when compressed to 6.8k. That's 53ms off of the full page load time for a simple change. If your HTML is bigger and more redundant, you'll see an even greater improvement.
Consider locating your small objects (or a mirror or cache of them) closer to your users in terms of network latency. For larger sites with a global reach, either use a commercial Content Delivery Network, or add a colo within 50ms of 80% of your users and use one of the many available methods for routing user requests to your colo nearest them.
Regularly use your site from a realistic net connection. Convincing the web developers on my project to use a "slow proxy" that simulates bad DSL in New Zealand (768Kbit down, 128Kbit up, 250ms RTT, 1% packet loss) rather than the gig ethernet a few milliseconds from the servers in the U.S. was a huge win. We found and fixed a number of usability and functional problems very quickly.
To implement the slow proxy, I used the netem and HTB kernel modules available in the Linux 2.6 kernel, both of which are set up with the tc command line tool. These offer the most accurate simulation I could find, but are definitely not for the faint of heart. I've not used them, but supposedly Tamper Data for Firefox, Fiddler for Windows, and Charles for OSX can all rate-limit and are probably easier to set up, but they may not simulate latency properly.
Use Google's Load Time Analyzer extension for Firefox from a realistic net connection to see a graphical timeline of what it is doing during a page load. This shows where Firefox has to wait for one HTTP request to complete before starting the next one and how page load time increases with each object loaded. The Tamper Data extension can offer a similar graph if you make use of its hidden graphing feature. And the Safari team offers a tip on a hidden feature in their browser that offers some timing data too.
Or if you are familiar with the HTTP protocol and TCP/IP at the packet level, you can watch what is going on using tcpdump, ngrep, or ethereal. These tools are indispensible for all sorts of network debugging.
Try benchmarking common pages on your site from a local network with ab, which comes with the Apache webserver. If your server is taking longer than 5 or 10 milliseconds to generate a page, you should make sure you have a good understanding of where it is spending its time.
If your latencies are high and your webserver process (or CGI if you are using that) is eating a lot of CPU during this test, it is often a result of using a scripting language that needs to recompile your scripts with every request. Software like eAccelerator for PHP, mod_perl for perl, mod_python for python, etc can cache your scripts in a compiled state, dramatically speeding up your site. Beyond that, look at finding a profiler for your language that can tell you where you are spending your CPU. If you improve that, your pages will load faster and you'll be able to handle more traffic with fewer machines.
If your site relies on doing a lot of database work or some other time-consuming task to generate the page, consider adding server-side caching of the slow operation. Most people start with writing a cache to local memory or local disk, but that starts to fall down if you expand to more than a few web server machines. Look into using memcached, which essentially creates an extremely fast shared cache that's the combined size of the spare RAM you give it off of all of your machines. It has clients available in most common languages.
(Optional) Petition browser vendors to turn on HTTP pipelining by default on new browsers. Doing so will remove some of the need for these tricks and make much of the web feel much faster for the average user. (Firefox has this disabled supposedly because some proxies, some load balancers, and some versions of IIS choke on pipelined requests. But Opera has found sufficient workarounds to enable pipelining by default. Why can't other browsers do similarly?)
The above list covers improving the speed of communication between browser and server and can be applied generally to many sites, regardless of what web server software they use or what language the code behind you site is written in. There is, unfortunately, a lot that isn't covered.
While the tips above are intended to improve your page load times, a side benefit of many of them is a reduction in server bandwidth and CPU needed for the average page view. Reducing your costs while improving your user experience seems it should be worth spending some time on.
-- Amol Kulkarni