Thursday, November 30, 2006

Cache in with JSON

Data validation is one of the most challenging and ever-changing parts of any enterprise Web application. Quite often validation metadata leaves JavaScript modules littered with server-side code. In this article, you'll learn an elegant way to cache metadata on the client side with the help of server code, which provides JSON-formatted (JavaScript Object Notation) stringified metadata. This approach also enables you to handle multivalue and multigroup attributes in a manner similar to Ajax.

Each application targets some domain problem. And each domain has its own set of rules and regulations that put constraints on data. When an application applies those constraints on data, the constraints become validations. All applications need to validate the data that users enter.

Today, applications generally use combinations of if-else statements to validate data. These statements contain validation data that developers either hard-code or put through server-side code. Generally, developers use server-side code to avoid small data changes that can lead to changes in JavaServer Pages (JSP).

You can use JavaScript Object Notation (JSON) to group and cache metadata and use the JavaScript function to access that metadata to validate the user input.

When you have metadata scattered over JavaScript, you can't control how much data the server evaluates and goes to the client. Instead, all server-side code pieces are evaluated and sent to the server. However, when you cache data using JSON, you have full control over how much metadata you send to the client because server-side code generates JSON-formatted metadata. This helps send only metadata to the client that corresponds to the user who will see or enter the data.

You can also use JSON to cache data that the user inputs. Once the program caches the data, it erases the data fields rather than refreshing the screen, similar to Ajax. This way a user can enter another set of data for the same property.

Let's explore metadata caching using JSON.

JSON in brief

With JSON, or JavaScript Object Notation, you represent your JavaScript object in a specific string format. If you assign a string with such a format to any JavaScript variable, the variable will then refer to an object that is constructed from a string assigned to it.

For example, suppose you have a policy object that has these attributes:

* Plan Name
* Description
* Duration

You can represent this policy object in JSON format using the following string:

{"Plan":"Full Life Cover", "Description":"The best life insurance plan", "Term":"20 years"}

If you assign this string to any JavaScript variable, the variable will accept the data in terms of an object. To access the data, give the path of the attribute you want to access. For this example, assign the above string to a variable called policy:

var policy = {"Plan":"Full Life Cover", "Description":"The best life insurance plan", "Term":"20 years"}

Paste this string in your HTML page's header section and write the following alert:

alert(policy.Plan)

If you see this page in any JavaScript-supported browser, you will see that the alert displays the policy plan.

The example

To demonstrate JSON's capabilities, you will take a person object that has a list of vehicle objects and take an object person that can hold one or more vehicles. Each vehicle has the following properties:

* Brand
* Registration Number
* CC

The browser UI should allow users to add multiple vehicles with best application performance (usually an inherent requirement). Each property has some restriction or validation rule attached to it. You'll assign the following rules:

* Brand Name
o Brand Name can never contain a digit.
o Brand Name can contain a maximum of two words separated by a space.
* Registration Number
o Registration Number must be all digits.
* CC
o CC must be all digits.
o CC can be a minimum of 50 and maximum 5000.

You will have three input fields corresponding to the vehicle properties, where a user will enter the information. Next, you'll see how to group the validation messages in a JSON group and how to access them.

Conventional approach

Now, when a user enters 40 CC for the vehicle data, the program must display a message saying that the entry does not fall within the valid CC range. You can show the message simply as in Listing 1:

Listing 1. Conventional code

if(cc < <%= minCC %> || cc > <%= maxCC %>) {
alert(<%= ResourceList.vehicleCCRangeMsg >);
}


ResourceList is a server-side class that holds the internationalized messages in variables like vehicleCCRangeMsg. This approach solves the problems with little mess:

1. This way you add server-side code to all client-side validation functions in order to check conditions and to show messages.
2. If you change the way you organize metadata and messages (such as the server-side classes or variables), you end up changing client script validation functions that use them.

How JSON can help

How would you feel if you have to refer only to a JavaScript variable inside condition statements and alerts rather than server-side code? You won't have server-side code in your JavaScript and the change in server-side metadata and message holding will not affect the client-side script. This would be great, right? Well, that is exactly what you will do when you employ JSON-based metadata caching.

You will use a JavaScript object to group our validation data and messages in a hierarchy. And you will access these messages just like you access the hierarchical JavaScript object. That's it and you are done!

Once you have this JSON metadata object in place, your previous piece of JavaScript will look like Listing 2.

Listing 2. Alert with JSON metadata caching object

if(cc <> vehicleValidationsMetadata.CC.maxCC) {
alert(vehicleValidationsMetadata.CC.RangeMessage);
}


The question now is who or what will prepare the JSON metadata object? Well, only the server can do that. The server must produce this JSON object and provide it to the client (browser). Some Java APIs help you prepare such (in fact, any kind of) JSON objects. See Resources to find those APIs.

A typical approach to generate a JSON metadata object is

1. You prepare a hierarchical Java object for your entities and their validation messages
2. Call toString() on them. These would mostly give you a JSON-formatted string.
3. Store that string away in a request scope.
4. In JSP, get that string and assign it inside the curly brackets of a JavaScript variable value.

The final vehicle metadata object might look like Listing 3.

Listing 3. Validation metadata JSON object

var vehicleValidationsMetadata = {
"BrandName":{
"CanContainDigits":false,
"MaxWords":2,
"FormatMessage":"Brand Name cannot contain digits."
"WordLimitMessage":"Brand Name cannot contain more than two words"
},
"RegistrationNumber":{
"CanContainAlphabets"false,
"CanContainDigits":true,
"FormatMessage":"Registration Number can contain only digits."
},
"CC"
"minCC":50,
"maxCC"5000,
"FormatMessage": "can only be numeric",
"RangeMessage": "CC can be within range of 50 and 5000"
}
}


The server must produce the entire string, except for the first and last lines, because the current user locale might require these messages (and only server-side code can accomplish this). One thing to note here is that this metadata object is only for validating the vehicle. The better encapsulation is if the vehicle metadata object is part of the person metadata object. In that case, rather than create another JavaScript variable, you can just include that metadata object into your person metadata object.

Once you have this metadata object ready, you can use the metadata and messages in that object to validate data input and display messages. Now your JavaScript function that validates vehicle inputs might look like Listing 4.

Listing 4. Vehicle data validation function

function validateVehicleData() {
var brandName = //get brand name from form field
var registrationNumber = //get Registration Number from form field.
var CC = //get CC from form field
var brandNameTokens = brandName.split(' ');
if(brandNameTokens.length > vehicleValidationsMetadata.BrandName.MaxWords) {
alert(vehicleValidationMessages.BrandName.WordLimitMessage);
}
.
.
.
if((!vehicleValidationsMetadata.RegistrationNumber.CanContainAlphabets) &&
isNaN(parseInt(registrationNumber))) {
alert(vehicleValidationMessages.RegistrationNumber.FormatMessage);
}
var ccNum = parseInt(CC);
if(ccNum <> vehicleValidationMessages.CC.maxCC) {
alert(vehicleValidationMessages.CC.RangeMessage);
}
}


Doesn't this code look better? It doesn't have server code littered in JavaScript. It does not need to re-write client scripts if the server side changes the way it stores metadata. It makes the life of a JSP programmer much easier.

Extending client-side data caching

Some Web applications require users to enter multiple data for the same property or object. As an example, the person-vehicle requires a person to enter data for each vehicle she owns. If she owns more than one vehicle, the application must allow her to enter data of more than one vehicle. I will refer to this kind of object as a multigroup attribute. If the multigroup attribute contains any property that can hold multiple data instances, I will call that multivalue attribute.

Now, the problem with multigroup and multivalue attributes is that you have to enter the data in the same input fields. That means before you enter data for the second vehicle, you have to first save the data you entered for the first vehicle. You can solve this problem two ways:

1. Send the first vehicle's data to the server and blank out the input fields to allow the user to enter the next vehicle's data.
2. Cache the data on the client and blank out the input fields to allow the user to enter the next vehicle's data.

The problem with the first approach is that it needs a server visit for each vehicle data entered. This isn't pretty; users will become frustrated when they have to wait for a server response after they enter vehicle data. Alternatively, the second approach has almost zero response time. The user can enter all vehicle data quickly without waiting. But the matter of concern here is how you cache the data on the client side. Here are more ways to store the data on the client:

1. Cache the data in some format into hidden form field(s) as the user clicks to add the next vehicle's data.
2. Cache data into a JavaScript object.

When you store data into hidden fields, you end up manipulating many hidden fields or manipulating hidden field data every time the user enters new vehicle data. This is like you frequently manipulating a string with string operations.

But the second form of caching data offers an object-oriented approach to caching. When the user enters new vehicle data, you create a new element in the array object. There are no clumsy string operations. When the user is done with all the vehicles, you can simply form a JSON string out of that object and send it to the server by storing it in some hidden field. This approach is much more elegant than the first one.

JSON, data caching and Ajax abilities

When you cache data on the client side using JSON, you update the data caching object every time the user clicks on the Add Vehicle button. The JavaScript function to accomplish this task might look like Listing 5.

Listing 5. Function to add vehicle data into JavaScript object for client-side caching

function addVehicleData() {
var brand = //get vehicle brand;
var regNo = //get registration number;
var cc = //get cc;

vehicleData[vehicleData.length] = new Object();
vehicleData[vehicleData.length].brandName = new Object();
vehicleData[vehicleData.length].brandName = brand;
//same way update other two properties
}


Here vehicleData is the JavaScript variable that initializes when a user loads the page. It is initialized to a new array object, which is empty or has vehicle elements from when a user entered vehicles earlier.

Once this function saves the data into a JavaScript object, the program can invoke another function that will clear out the input fields to allow a user to enter new data.

In such applications, you will require the user to enter certain minimum or maximum number of occurrences of multigroup or multivalue attributes. You can put these limits into a JSON metadata object. In this case, your earlier metadata object will look like Listing 6.

Listing 6. JSON metadata object with occurrence limits

var vehicleValidationsMetadata = {
"MIN_OCC":0,
"MAX_OCC":10,
"MAX_OCC_MSG":" Your message....",
"MIN_OCC_MSG":" Your message.....",
//Everything else is the same
}


Then your addVehicleData() function will validate the data on occurrences first and will add data to the JavaScript object only if the total occurrences are within the allowed limits. Listing 7 shows how you check this.

Listing 7. JSON metadata object limit check


function addVehicleData() {
if(vehicleData.length == vehicleValidationsMetadata.MAX_OCC-1) {
alert(vehicleValidationsMetadata.MAX_OCC_MSG);
}
//Everything else is the same
}


The function that is called when a user submits a page actually validates for minimum occurrences. The biggest advantage of this approach is that the screen doesn't refresh to enter new vehicle data. Providing such static screens was the primary objective of Ajax technology and you can accomplish this with JSON as well. This is all about updating the JSON data object and manipulating the HTML DOM tree through JavaScript. The user response time is minimal as everything executes on the client side only. You can use JSON to provide Ajax abilities to your application.

When a user clicks the Save button, the program calls another JavaScript function that will stringify this JSON object and store it in the hidden form field that the program submits to the server. JSON.js (see Resources) has a JSON.stringify() function that takes the JavaScript object as input and returns string output.

The server side has to be able to understand a JSON-formatted string and produce a server-side object in order to proceed and save the data. The Web site http://www.json.org/java/index.html offers a Java API that serves most of these needs for Java-based applications.

Conclusion

You saw powerful uses for JSON in this article. To summarize:

1. JSON provides an elegant and object-oriented way to cache metadata on client.
2. JSON helps separate validation data and logic.
3. JSON helps provide an Ajaxian nature to a Web application.

Cross Site Scripting Vulnerability in Google

Google is vulnerable to cross site scripting. While surfing around the personalization section of Google I ran accross the RSS feed addition tool which is vulnerable to XSS. The employees at Google were aware of XSS as they protected against it as an error condition, however if you input a valid URL (like my RSS feed) it will return with a JavaScript function containing the URL.

If you append the URL of the valid feed with a query string that contains your cross site scripting exploit Google will not sanitize it upon output of the JavaScript (it will upon screen render of the main page, but at that point it is too late). The JavaScript is not intended to be rendered directly, but that’s irrelevant, and can be exploited directly. Because this lives on the http://www.google.com/ domain it is not subject to cross domain policy restrictions that have typically protected Google from these attacks in the past.

Here is a screenshot of the vulnerability:

Cross Site Scripting Vulnerability in Google
(click to enlarge)

If you want to see the vulnerability for yourself click here (this is a benign proof of concept). As I said, this is using the query string from a valid feed to inject the vector. It doesn’t work if you inject it into the Add Content function on the page because the page itself sanitizes the output. Unfortunately for Google this can be intercepted far earlier than the page that does the eventual sanitization. One of the worst parts of this is that it does not require you to be logged in to exploit this cross site scripting vulnerability.

Additionally. in a few seconds of searching, I also found that Google has yet another URL redirection attack in it that can be used for phishing attacks located here (will redirect you to a benign site that demonstrates the attack). Google has been pretty notoriously slow at fixing these sorts of attacks in a timely manner (the last one that was actually being used by phishers was open for nearly a month), but they are really bad, because phishers can easily bounce their traffic off of these trusted domains. People are far more likely to click on a website that says www.google.com than they are to click on a site that says www.wellfsarg0.com or something equally obvious. I understand they are used for tracking purposes, but there are ways around this, like checking against whitelists, or checking against an embedded hash, etc. It’s processor intensive, but it protects the internet community.

Also in a few minutes of checking, id found a CSRF in Google (cross site request forgery) where malicious websites can change the default map search location. This is really not a big deal as far as I can tell besides annoying Google and it’s users, but it’s worth mentioning. Make sure you are logged into Google and then click on the following CSRF to change your default location to the whitehouse. Annoying, but I doubt there is a bigger hole here. The point is that Google definitely has not protected against CSRF, and I am sure there are additional vulnerabilities here that I have not played with since I only spent a few minutes looking at it.

So back to the cross site scripting vector, since that is by far the most dangerous. What are the implications of this attack for Google? Well, for starters, I can put a phishing site on Google. “Sign up for Google World Beta.” I can steal cookies to log in as the user in question, I can use the credentials of the user to screen scrape any of the content off of the www cname, including changing options like adding my RSS feed to your page, or deleting them, etc… I can steal your phone number from the /sendtophone application via an XML RPC (AJAX) call via a POST method, get your address because maps.google.com is mirrored on http://www.google.com/maphp?hl=en&tab=wl&q= etc… the list of potential vulnerabilities goes on and on. The vulnerabilities only grow as Google builds out their portal experience.

Indeed this also could have massive blackhat SEO (Search Engine Optimization) implications as Google sets itself as the authority in page rank (above other sites with more traffic). Its own page is set as a 10 in page rank. Trusting yourself could actually prove to be dangerous in this case, although this is a theoretical attack. Injecting your own links and getting engines to spider it could temporarily dramatically inflate page rank as /ig/ (where the vulnerable function is stored) is not disallowed by Google’s robots.txt file (again this is a theoretical attack and it is easy for Google to de-list violators).

Ultimately, Google cannot be trusted implicitly because of these types of holes, in the same way any major site cannot be trusted implicitly for the same reason. There are too many potential issues in their applications, and your information is definitely not 100% safe when entered there.

This will become particularly relevant at Jeremiah Grossman’s talk at Blackhat next month, where he starts to go into the real issues with cross site scripting, and how dangerous these forms of attack really can be (far beyond what is currently well known). Can you tell I’m excited? I don’t particularly blame Google, as all major websites are vulnerable to this, in my experience, it’s just that with a site’s popularity it becomes exponentially more dangerous and the responsibility to find these issues before the security community increases at the same rate.

Ten Worst Internet Acquisitions Ever

As the market for acquiring fledgling Internet companies heats up, it's worth taking a look at all those acquisitions that didn't quite work out. For every Internet acquisition that's successful there seems to be dozens that die on the vine.

So what makes for a really bad Internet acquisition? First, it has to be expensive. No one's going to rake a company over the coals over a few blown $50 million acquisitions. That might sound like a lot of money to you and me, but that's a rounding error to Google.

Second, for an acquisition to be lousy it has to contribute little or no long term growth to the acquiring company. An acquisition that doesn't fit with a company's long term strategy and that is quickly forgotten - that's a bad buy.

So, here is my highly subjective list of the 10 worst Internet acquisitions of all time:

12. Myspace - acquired by News Corporation, making one of its largest bets on the Internet, announced that it is paying $580 million in cash to acquire Intermix Media Inc., a Los Angeles-based company whose chief asset is MySpace.com, a Web site that is enjoying surging popularity with young audiences.

11. Hotmail - acquired by Microsoft (MSFT) in 1998 for about $400 million. Hotmail was a second-tier free email service when Microsoft bought it and the acquisition did little to improve Microsoft's internet portal ambitions.

10. Skype - acquired by eBay (EBAY) in September 2005 for $2.6 billion. While it's early to call this one an absolute dud, eBay does not seem to have a plan - or at least a plan that would justify the acquisition price - for how to integrate Skype's calling service with the core auction business.

9. MySimon - acquired by CNET (CNET) in 1999 for $700 million. The price comparison site mySimon was supposed to launch CNET into lots of non-tech verticals - not a bad idea at the time. Unfortunately CNET had no idea how to effectively integrate mySimon and it's now withering away, surpassed by newer, shinier price comparison engines.

8 BlueMountain.com - acquired by Excite@Home in 1999. $780 million for an online greeting card site. 'Nuff said.

7. Youtube - acquired by google for $1.65 billion in 2006 million for google videos broadcasting .

6. Lycos - acquired by Terra Networks for $4.6 billion in 2000. Yeah, I never heard of Terra either. The warning bells should have gone off when the deal was originally announced in May 2000 at a value of $12.5 billion, only to fall by more than 50% by the time it closed in October of that year because each company's stock price was plummeting.

5. Netscape - acquired by AOL (TWX) in 1998 for $4.2 billion. To be fair, this was a mercy acquisition. By the time AOL bought the company, Netscape had been humbled by Microsoft's free Internet Explorer browser. AOL clearly had no plans for Netscape and as a result the once pioneering company is now an afterthought.

4. GeoCities - acquired by Yahoo! (YHOO) in 1999 for $3.56 billion. When was the last time you visited a site with a geocities.com domain? I can't remember either. Shortly after the acquisition, innovation on GeoCities appears to have ground to a halt. GeoCities could have been MySpace, but the entire social networking revolution passed them right by.

3. Excite
- acquired by @Home in 1999 for $6.7 billion. Remember Excite.com? Remember how it was the #2 or 3 portal for awhile? Well, a whole year and a half after the cable company @Home acquired Excite (for $394 per user!) in January 1999, the combined entity filed for bankruptcy never to be heard from again. Classically disastrous.

2. AOL - merged with TimeWarner in 2000. This one is obvious. While Time Warner finally seems to be turning things around at AOL six years after the fact, this merger was doomed from the start. Shortly after the merger AOL's business started falling apart fast, with TimeWarner holding the bag. There was never a coherent integration plan and all that talk of synergy is - thankfully - dead and gone.

1. Broadcast.com - acquired by Yahoo! in 1999 for $5 billion. Yahoo! paid a mind-boggling $710 per user back in the hey day of the bubble. But why does this rank higher than the AOL boondoggle? Two words: Mark Cuban. Yahoo's ludicrous overpayment for Broadcast.com gave Cuban the money to go out and buy the Dallas Mavericks basketball team and permanently implant himself on the American psyche. Unforgivable.

Dynamic HTML and XML: XMLHttprequest Object

As deployment of XML data and web services becomes more widespread, you may occasionally find it convenient to connect an HTML presentation directly to XML data for interim updates without reloading the page. Thanks to the little-known XMLHttpRequest object, an increasing range of web clients can retrieve and submit XML data directly, all in the background. To convert retrieved XML data into renderable HTML content, rely on the client-side Document Object Model (DOM) to read the XML document node tree and compose HTML elements that the user sees.

History and Support

Microsoft first implemented the XMLHttpRequest object in Internet Explorer 5 for Windows as an ActiveX object. Engineers on the Mozilla project implemented a compatible native version for Mozilla 1.0 (and Netscape 7). Apple has done the same starting with Safari 1.2.

Similar functionality is covered in a proposed W3C standard, Document Object Model (DOM) Level 3 Load and Save Specification. In the meantime, growing support for the XMLHttpRequest object means that is has become a de facto standard that will likely be supported even after the W3C specification becomes final and starts being implemented in released browsers (whenever that might be).

Creating the Object

Creating an instance of the XMLHttpRequest object requires branching syntax to account for browser differences in the way instances of the object are generated. For Safari and Mozilla, a simple call to the object's constructor function does the job:

var req = new XMLHttpRequest();

For the ActiveX branch, pass the name of the object to the ActiveX constructor:

var req = new ActiveXObject("Microsoft.XMLHTTP");

The object reference returned by both constructors is to an abstract object that works entirely out of view of the user. Its methods control all operations, while its properties hold, among other things, various data pieces returned from the server.

Object Methods

Instances of the XMLHttpRequest object in all supported environments share a concise, but powerful, list of methods and properties. Table 1 shows the methods supported by Safari 1.2, Mozilla, and Windows IE 5 or later.

Common XMLHttpRequest Object Methods

MethodDescription
abort()Stops the current request
getAllResponseHeaders()Returns complete set of headers (labels and values) as a string
getResponseHeader("headerLabel")Returns the string value of a single header label
open("method", "URL"[, asyncFlag[, "userName"[, "password"]]])Assigns destination URL, method, and other optional attributes of a pending request
send(content)Transmits the request, optionally with postable string or DOM object data
setRequestHeader("label", "value")Assigns a label/value pair to the header to be sent with a request


Of the methods shown in Table 1, the open() and send() methods are the ones you'll likely use most. The first, open(), sets the scene for an upcoming operation. Two required parameters are the HTTP method you intend for the request and the URL for the connection. For the method parameter, use "GET" on operations that are primarily data retrieval requests; use "POST" on operations that send data to the server, especially if the length of the outgoing data is potentially greater than 512 bytes. The URL may be either a complete or relative URL (but see security issues below).

An important optional third parameter is a Boolean value that controls whether the upcoming transaction should be handled asynchronously. The default behavior (true) is to act asynchronously, which means that script processing carries on immediately after the send() method is invoked, without waiting for a response. If you set this value to false, however, the script waits for the request to be sent and for a response to arrive from the server. While it might seem like a good idea to wait for a response before continuing processing, you run the risk of having your script hang if a network or server problem prevents completion of the transaction. It is safer to send asynchronously and design your code around the onreadystatechange event for the request object.

The following generic function includes branched object creation, event handler assignment, and submission of a GET request. A single function argument is a string containing the desired URL. The function assumes that a global variable, req, receives the value returned from the object constructors. Using a global variable here allows the response values to be accessed freely inside other functions elsewhere on the page. Also assumed in this example is the existence of a processReqChange() function that will handle changes to the state of the request object.

var req;

function loadXMLDoc(url) {
req = false;
// branch for native XMLHttpRequest object
if(window.XMLHttpRequest && !(window.ActiveXObject)) {
try {
req = new XMLHttpRequest();
} catch(e) {
req = false;
}
// branch for IE/Windows ActiveX version
} else if(window.ActiveXObject) {
try {
req = new ActiveXObject("Msxml2.XMLHTTP");
} catch(e) {
try {
req = new ActiveXObject("Microsoft.XMLHTTP");
} catch(e) {
req = false;
}
}
}
if(req) {
req.onreadystatechange = processReqChange;
req.open("GET", url, true);
req.send("");
}
}

Note: It is essential that the data returned from the server be sent with a Content-Type set to text/xml. Content that is sent as text/plain or text/html is accepted by the instance of the request object however it will only be available for use via the responseText property.

Object Properties

Once a request has been sent, scripts can look to several properties that all implementations have in common, shown in Table 2. All properties are read-only.

Common XMLHttpRequest Object Properties

PropertyDescription
onreadystatechangeEvent handler for an event that fires at every state change
readyStateObject status integer:
0 = uninitialized
1 = loading
2 = loaded
3 = interactive
4 = complete
responseTextString version of data returned from server process
responseXMLDOM-compatible document object of data returned from server process
statusNumeric code returned by server, such as 404 for "Not Found" or 200 for "OK"
statusTextString message accompanying the status code


Use the readyState property inside the event handler function that processes request object state change events. While the object may undergo interim state changes during its creation and processing, the value that signals the completion of the transaction is 4.

You still need more confirmation that the transaction completed successfully before daring to operate on the results. Read the status or statusText properties to determine the success or failure of the operation. Respective property values of 200 and OK indicate success.

Access data returned from the server via the responseText or responseXML properties. The former provides only a string representation of the data. More powerful, however, is the XML document object in the responseXML property. This object is a full-fledged document node object (a DOM nodeType of 9), which can be examined and parsed using W3C Document Object Model (DOM) node tree methods and properties. Note, however, that this is an XML, rather than HTML, document, meaning that you cannot count on the DOM's HTML module methods and properties. This is not really a restriction because the Core DOM module gives you ample ways of finding element nodes, element attribute values, and text nodes nested inside elements.

The following listing shows a skeletal onreadystatechange event handler function that allows processing of the response content only if all conditions are right.

function processReqChange() {
// only if req shows "loaded"
if (req.readyState == 4) {
// only if "OK"
if (req.status == 200) {
// ...processing statements go here...
} else {
alert("There was a problem retrieving the XML data:\n" +
req.statusText);
}
}
}

If you are concerned about possible timeouts of your server process, you can modify the loadXMLDoc() function to save a global time-stamp of the send() method, and then modify the event handler function to calculate the elapsed time with each firing of the event. If the time exceeds an acceptable limit, then invoke the req.abort() method to cancel the send operation, and alert the user about the failure.

Security Issues

When the XMLHttpRequest object operates within a browser, it adopts the same-domain security policies of typical JavaScript activity (sharing the same "sandbox," as it were). This has some important implications that will impact your application of this feature.

First, on most browsers supporting this functionality, the page that bears scripts accessing the object needs to be retrieved via http: protocol, meaning that you won't be able to test the pages from a local hard disk (file: protocol) without some extra security issues cropping up, especially in Mozilla and IE on Windows. In fact, Mozilla requires that you wrap access to the object inside UniversalBrowserRead security privileges. IE, on the other hand, simply displays an alert to the user that a potentially unsafe activity may be going on and offers a chance to cancel.

Second, the domain of the URL request destination must be the same as the one that serves up the page containing the script. This means, unfortunately, that client-side scripts cannot fetch web service data from other sources, and blend that data into a page. Everything must come from the same domain. Under these circumstances, you don't have to worry about security alerts frightening your users.

An Example: Reading XML Data from iTunes RSS Feeds

You can play with an example that points to four static XML files for demonstration purposes. The data sources are snapshots of some iTunes Store-related RSS feeds. Because the actual feeds are hosted at a third-party domain, the mixed domains of the example file and live RSS sources prevents a truly dynamic example.

When you choose one of the four listing categories, the script loads the associated XML file for that category. Further scripts extract various element data from the XML file to modify the options in a second select element. A click on one of the items reads a different element within that item's XML data. That data happens to be HTML content, which is displayed within the example page without reloading the page.

Note that the sample data includes some elements whose tag names contain namespace designations. Internet Explorer for Windows (at least through Version 6) does not implement the DOM getElementsByTagNameNS() function. Instead it treats namespace tag names literally. For example, the element is treated as an element whose tag name is content:encoded, rather than a tag whose local name is encoded and whose prefix is content. The example includes a utility API function called getElementTextNS() which handles the object model disparities, while also retrieving the text node contained by the desired element.

If you download the examples (DMG 2.0MB), you can test it yourself by placing it in your personal web server's Sites folder, and accessing the page via the http: protocol. Keep all files, including the XML samples, in the same directory.

A Cool Combo

For years, advanced client-side developers have frequently wanted a clean way to maintain a "connection" with the server so that transactions can occur in the background, and newly updated data gets inserted into the current page. Many have tortured themselves by using techniques such as hidden self-refreshing frames and "faceless" Java applets. In lieu of a W3C standard still under development, the Microsoft-born XMLHttpRequest object fills an important gap that should inspire application development creativity. The feature is a welcome addition to Safari.

Wednesday, November 22, 2006

Client Side Page Caching

Client Side Page Caching

This Issue

In this issue we will discuss Page Caching, including different browsers cache schemas. We will also discuss how Microsoft Proxy page caching works, how to get your pages cached, and how not to. Examples will be given in the Active Server pages about manipulating the Cache-Control header.

Page Caching

Client side page caching is where the client (browser) caches pages to the hard drive. When a page is requested from the server, the response is written to the hard drive as a file by the browser. If the page is needed again, the client uses the page from the cache, as long as the page hasn't expired. If the page has expired, then the browser asks the server if there is a newer page on the site and rewrites the cache with the response. The reason for client side page caching, as for any caching, is for performance. It is much faster to read the file from the hard drive then to wait for the page or graphic to download from the web server. The theory is that pages that are accessed are more likely to be accessed again in the near future. The hard drive cache is limited in size through a setting in the browser. When the cache fills, the older files in the cache are erased to make room for the newer ones.

You can view Internet Explorer's cache by looking in this directory:

Windows NT

c:\winnt\Temporary Internet Files

Windows 95

c:\windows\Temporary Internet Files

Many browsers let you modify how they use the cache. Internet Explorer works this way as well.

Configuring the Browser

You can configure how the browser caches files also. By default, the Internet Explorer checks each file in the cache once after Explorer is started. This means that on the first viewing of the page, the browser requests the page from the server using If-Since-Modified header. The If-Since-Modified header was discussed in the last Issue. If the browser returns "304 Not Modified" then the page is used from the cache and every page after that is used from the cache unless the page expires or the user "refreshes." If a new page is returned, it's last modified date and the expiration date are written to the cache.

You can also configure the browser to never request the page from the server again once cached unless it either expires or the user requests a refresh. Or you can have the page requested from the server every time. In the case where the page is requested every time, the cached is used every time if a "304 Last Modified" status is returned. To set the Internet Explorer browser configuration:

From within Internet Explorer 3.0:

* Click on Tools | Options. and the Options Dialog will come up.
* Click on the Advanced Tab.
* With the Temporary Internet Files group click on Settings and the Settings Dialog will appear.
* From Check for newer version of stored file choose either Every visit to the page, Every time you start Internet Explorer, or Never.
* Click on OK to close the Settings Dialog and again to close the Options Dialog.

Caching and the Expires Header

In the last issue, we discussed the Expires header and how to set it from the server. By setting the Expires header to the current date, pages that are either accessed by typing in a URL into the address box, using the browser's navigation buttons, or through a link, get requested again from the server. This is true even if the page is in the cache. To set the Expires header to the current date in a Active Server page add the following line:

Example 1

Response.Expires = 0

It might seem like there is no difference between a page that has expired and a page that is not cached. If a page is not cached on the client side and the page is accessed by typing in a URL into the address box, using the browser's navigation buttons, or through a link, the page is requested again from the server.

However there is a difference, in the last issue we discussed how a cached page in Internet Explorer that has the same last modified date as the server response uses the cached page and Netscape uses the page in the response. In the case where the page is expired and the last modified date is the same, you will see the page from the cache using Internet Explorer. However if the page is not cached, you will always see the page in the response using Internet Explorer. Because of this difference, you might want to force the browser not to cache the page, the result will look very much like setting the Expires header to the current date.

Pragma

In HTTP 1.0 there is a pragma command that is documented to control page caching. The pragma is a command header that is passed back from the server to the client in the response. To send the command not to cache the page in Active Server pages, add this line to the top of the Active Server page:

Example 2

Response.AddHeader "Pragma","no-cache"

This will prevent the Netscape browser from caching the page on the disk, however Internet Explorer 3.x browser will continue to cache the page even if with the pragma.

Proxy Caching

The following information about proxy caching is based on the functionality of a majority of proxies, including Microsoft Proxy Server. However, there are a wide range of proxy available and they do not all function the same. Also note that not all proxies cache, nor do all of them have proxy caching turned on.

Proxies will cache files in hopes of better network performance. When a response from a web server returns from a request through the proxy, the proxy takes the page and caches it. If another browser makes the same request, the proxy uses the cached file and the request never makes it to the server. Like the browser cache, the proxy has a limited amount of room so most proxies remove pages from the cache based on inactivity.

The caching mechanism for the proxy uses the Last-modified header and the Expires header to determine when and for how long to cache the proxy information. Because proxies caches are all programmed different there is not a broad sweeping statement that can be used for all proxies. Instead let us look at a specific proxy, the Microsoft Proxy Server.

If an expiration date exists, the Microsoft Proxy Server honors that expiration date. When the cached file has expired, the proxy server removes the page from the cache. The next request then passes through the proxy and the response is cached with the new expiration date. If the expiration date is equal to the current date and time the page is not cached at all. Such is the case with the example 3:

Example 3

<% Response.Expires=0 %>
<HTML>
&ltBODY>
Example 3



If there is no Expires header, then the proxy server bases the expiration of the cached page off of the Last-Modified header. The Microsoft Proxy Server caches the page twenty percent of the difference between the Last-Modified date and the currently date. If the page is five days old, the proxy server will set the expiration date to be twenty percent of five days; the page expires in a day.

If there is neither the Last-Modified header nor the Expires header, then Microsoft Proxy Server sets the expiration date to 10 minutes.

Manipulating the Proxy

One of the problems with proxies is that they assume a response for one person is going to be the same for all people requesting the page. This isn't true in certain situations, especially when you are returning a dynamic page that contains information specific to the user requesting the page. For this reason you can tell the Proxies that certain responses are private and should not be shared publicly. The way to do this is to send a Cache-Control header of private like this:

Cache-Control: private

Fortunately, the Internet Information Server by default automatically adds this header to all Active Server pages. There might be instances that you want to change the Cache-Control header in an Active Server page, for example if you are returning content that is dynamically generated but not based on the individual. IIS 3.0 does not let you change the Cache-Control header. However IIS 4.0 does, you can set the Cache-Control like this:

Response.CacheControl = Public

If you don't set the Cache-control header in IIS 4.0, it will default to private for Active Server pages.

Notice that there is only one way to manipulate the Last-Modified header and the Expires header to have the proxy not cache the page. That is to set the Expires header to the current time, in other words Reponse.Expires=0. However, this adversely effects the client. With the Expires header set to the current date, the page is loaded every time the user views it. For this reason, we advise you to use the cache control header.

Active Server Pages

The real knowledge that you need to take away from the last issue and this issue is how to manipulate the HTTP headers to get the results that you want. There are two extreme cases, either you want the view to be "new" every time the user visits the Active Server page, or you want the browser to cache the Active Server page and save server resources and network bandwidth.

Default

Without modifying or adding any headers, the default setting for an Active Server page is only to send the Cache-Control header as private. This means that the page will cache and will not be requested from the server again unless the user "refreshes." The exception is if the browser is closed and restarts since by default the browser request the page again upon the first view

No Caching

If you are writing Active Server pages and want to make the view on the browser change every time the user navigates to the page or refreshes the page you must set the Expires header to the current date for both Internet Explorer and Netscape 4.0. You must also set the no-cache pragma header for Netscape 3.x. You either must not use a Last-Modified header that doesn't change, or use a Last-Modified header that is constantly newer then the last one sent. Because the Last Modified header is optional, we recommend that you do not use a Last Modified header at all, if you want the view to change every time the user navigates to the page. Make sure to leave the Cache-Control header set to private, the default of Internet Information Server.

Caching

If you want your active server page to cache on the browser, set the Last-Modified header and the Expires headers. Make sure not to send the no-cache pragma command. You will also need to follow the instructions for returning the status of "304 Not Modified" documented in the last issue. If you want to share the cache on the proxy between multiple users, set the Cache-Control to header public. Remember you can not set the Cache-Control header in IIS 3.0, only IIS 4.0.

Thursday, November 16, 2006

What Real People Use On The Web

At last year's Web 2.0 Conference, a much discussed panel was one featuring a group of teenagers telling everyone what Web products they use. This year the concept has been take an extra level, by inviting the parents of the teenagers as well. The panel was moderated by Safa Rashtchy.

Most of the panel has Google as their main search engine. One adult panelist uses Ask.com, because she can put questions into the search box. One (adult) panelist says she uses Google out of habit. Another adult panelist uses Yahoo for the maps and other information. One of the teenagers says she uses Google because her school wants her to. Most of the panelists did not know MSN had a search engine.

In regards to online video, one adult panelist said she spends 3-4 hours per week on YouTube. A teenager spends 2-3 hours a day at the library with his friends watching YouTube. One teenager says he uses Google Video. Another says she uses some download software (Shakespeare something??). Safa asks would any of them pay $1 to watch video, e.g. Lost. The majority opinion is no. As for free but with videos, one teenager says she would and in fact it's better than download.

Website recognition

Safa rattles off some names of websites:

  • Skype - just two people know what it is and use it, both for international calls. One says "it rarely works" in terms of sound quality.
  • craigslist - yes
  • yelp - no
  • judyslist - no
  • blogs - yes, about half read them read them; a couple mentioned reading them on myspace, but also for their interests/hobbies

Re uploading, most have done this. One adult panelist says pictures.

Internet Companies

Safa now asks about companies:

  • Yahoo: most of the adult panelists use them - e.g. for horoscopes. a lot of them use Yahoo Mail (a few use Gmail); one teenager thinks Yahoo is "silly", as in entertaining. One adult panelist says she likes Yahoo for things like games and emails, and scans the news.
  • Google: one teenager "seems more like a friend" and he also uses Gmail because it's easy and user-friendly; another says he uses Google Video and he likes Google; all the teenagers would trust Google over Yahoo!
  • MSN: nobody has much to say about it; one teenager says he likes Xbox; one says they like the little (cartoon) characters. Word was mentioned.
  • eBay: some of them use it, e.g. for concert tickets, books etc.
  • Amazon: a few people, one says for "mostly media type things" like books, CDs.

Instant Messaging

One teenager uses AOL "all day" to talk to his friends. Same for another teenage boy. Three most mentioned by teenagers were AIM, MSN and Yahoo. One says 2-3 hours per day.

MySpace

One teenager compares MySpace to xmas presents, because he sees something new or a new friend every day - he spends around 3 hours per day. Another says 2-3 hours per day - "making sure my profile's good". One mother signed up to monitor what her child was doing - she found out her 14 year old son was 17 on MySpace.

Troubles with Asynchronous Ajax Requests and PHP Sessions

Troubles with Asynchronous Ajax Requests and PHP Sessions

As I sit here watching “The Muppets Take Manhattan” in Spanish in the middle of a Costa Rican thunderstorm, I find my mind drifting back to a recent project where I spent a day debugging a frustratingly annoying problem: A user would visit the web application I was working on, and after a given page was loaded, all of the session data associated with their visit would be suddenly gone. The user would no longer be logged into the site, and any changes they made (which were logged in session data) were lost.

I spent tonnes of time in the debugger (while at times unreliable and frustrating on huge projects, the Zend debugger is still an invaluable aid for the PHP application developer) and kept seeing the same thing: the session data were simply being erased at some point, and the storage in the database would register ’’ as the data for the session.

It was driving me crazy. I would sit there in the debugger and go through the same sequence each time:

  • Debug the request for the page load.
  • Debug the request for the first Ajax request that the page load fired off.
  • Debug the request for the second Ajax request that the page load simultaneously fired off.
  • Debug the request for the third Ajax request that the page initiated

In retrospect, looking at the above list, it seems blindingly obvious what I had been running into, but it was very late in a very long contract, and I blame the fatigue for missing what now seems patently obvious: A race condition.

For those unfamiliar with what exactly this is, a race condition is seen most often in applications involving multiple “threads of execution” – which include either separate processes or threads within a process – when two of these threads (which are theoretically executing at the same time) try to modify the same piece of data.

If two threads of execution that are executing more or less simultaneously (but never in exactly the same way, because of CPU load, other processes, and chance) try to write to the same variable or data storage location, the value of that storage location depends on which thread got there first. Given that it is impossible to predict which one got there first, you end up not knowing the value of the variable after the threads of execution are finished (in effect, “the last one to write, wins”) (see Figure 1).

Racing to destroy values

Normally, when you write web applications in PHP, this is really not an issue, as each page request gets their own execution environment, and a user is only visiting one page at a time. Each page request coming from a particular user arrives more or less sequentially and shares no data with other page requests.

Ajax changes all of this, however: suddenly, one page visit can result in a number of simultaneous requests to the server. While the separate PHP processes cannot directly share data, a solution with which most PHP programmers are familiar exists to get around this problem: Sessions. The session data that the various requests want to modify are now susceptible to being overwritten by other ones with bad data after a given request thinks it has written out updated and correct data (See Figure 2).

When requests go bad - clobbering data

In the web application I was working on, all of the Ajax requests were being routed through the same code that called session_start() and implicitly session_write_close() (when PHP ends and there is a running session, this function is called). One of the Ajax requests would, however, set some session data to help the application “remember” which data the user was browsing. Depending on the order in which the various requests were processed by the server, sometimes those data would overwrite other session data and the user data would be “forgotten”.

As an example of this problem, consider the following example page, which when fully loaded, will execute two asynchronous Ajax requests to the server.

The code is divided into three main sections:
  • The first contains the call to session_start() and opens the HTML headers.
  • The second contains the Javascript code to execute the asynchronous requests to the server. The biggest function, getNewHTTPObject() is used to create new objects. The onLoadFunction() is executed when the page finishes loading and starts the ball rolling, while the other two functions are simply used to wait for and handle the responses and results from the asynchronous requests.
  • In the final section, we just write out the section of the document, which contains a single
    element to hold the results and an attribute on the element to make sure that the onLoadFunction() is called when the document finishes loading.

The asynchronous Ajax requests are then made to race2.php and are processed by the following code, which can handle two different Ajax work requests:



This PHP script handles the two request types differently, and creates the race condition by having the second request type req1 set the sesion data to ’’. (In a real world application, you might have accidentally had this request set some value you thought was meaningful).

If you install the two files race1.php and race2.php on your server, and then load race1.php into your borwser, you will periodically see that the test string is set after the page is completely loaded, and other times it will be “(empty)”, indicating that the second Ajax request has clobbered the value.

Now that we are aware of this problem and how it can manifest itself, the next question is, of course, how do we solve it? Unfortunately, I think this is one of those problems best solved by avoiding it. Building in logic and other things into our web application to lock the threads of execution (i.e. individual requests) would be prohibitively expensive and eliminate much of the fun and many of the benefits of asynchronous requests via Ajax. Instead, we will avoid modifying session data when we are executing multiple session requests.

Please note that this is much more specific than saying simply that we will avoid modifying session data during any Ajax request. Indeed, this would be a disaster: In a Web 2.0 application, we are mostly likely using Ajax for form submission and updating the state of the user data (i.e. session data) as the data are processed and we are responding to the changes. However, for those requests we are using to update parts of pages dynamically, we should be careful to avoid modifying the session data in these, or at least do so in a way that none of the other requests are going to see changes in their results depending on these session data.

Ajax requests and session data do not have be problematic when used together: With a little bit of care and attention, we can write web applications that are powerful, dynamic, and not plagued by race condition-type bugs.

Optimize your Page Load Time

Optimizing Page Load Time

It is widely accepted that fast-loading pages improve the user experience. In recent years, many sites have started using AJAX techniques to reduce latency. Rather than round-trip through the server retrieving a completely new page with every click, often the browser can either alter the layout of the page instantly or fetch a small amount of HTML, XML, or javascript from the server and alter the existing page. In either case, this significantly decreases the amount of time between a user click and the browser finishing rendering the new content.

However, for many sites that reference dozens of external objects, the majority of the page load time is spent in separate HTTP requests for images, javascript, and stylesheets. AJAX probably could help, but speeding up or eliminating these separate HTTP requests might help more, yet there isn't a common body of knowledge about how to do so.

While working on optimizing page load times for a high-profile AJAX application, I had a chance to investigate how much I could reduce latency due to external objects. Specifically, I looked into how the HTTP client implementation in common browsers and characteristics of common Internet connections affect page load time for pages with many small objects.

I found a few things to be interesting:

  • IE, Firefox, and Safari ship with HTTP pipelining disabled by default; Opera is the only browser I know of that enables it. No pipelining means each request has to be answered and its connection freed up before the next request can be sent. This incurs average extra latency of the round-trip (ping) time to the user divided by the number of connections allowed. Or if your server has HTTP keepalives disabled, doing another TCP three-way handshake adds another round trip, doubling this latency.

  • By default, IE allows only two outstanding connections per hostname when talking to HTTP/1.1 servers or eight-ish outstanding connections total. Firefox has similar limits. Using up to four hostnames instead of one will give you more connections. (IP addresses don't matter; the hostnames can all point to the same IP.)

  • Most DSL or cable Internet connections have asymmetric bandwidth, at rates like 1.5Mbit down/128Kbit up, 6Mbit down/512Kbit up, etc. Ratios of download to upload bandwidth are commonly in the 5:1 to 20:1 range. This means that for your users, a request takes the same amount of time to send as it takes to receive an object of 5 to 20 times the request size. Requests are commonly around 500 bytes, so this should significantly impact objects that are smaller than maybe 2.5k to 10k. This means that serving small objects might mean the page load is bottlenecked on the users' upload bandwidth, as strange as that may sound.

Using these, I came up with a model to guesstimate the effective bandwidth of users of various flavors of network connections when loading various object sizes. It assumes that each HTTP request is 500 bytes and that the HTTP reply includes 500 bytes of headers in addition to the object requested. It is simplistic and only covers connection limits and asymmetric bandwidth, and doesn't account for the TCP handshake of the first request of a persistent (keepalive) connection, which is amortized when requesting many objects from the same connection. Note that this is best-case effective bandwidth and doesn't include other limitations like TCP slow-start, packet loss, etc. The results are interesting enough to suggest avenues of exploration but are no substitute for actually measuring the difference with real browsers.

To show the effect of keepalives and multiple hostnames, I simulated a user on net offering 1.5Mbit down/384Kbit up who is 100ms away with 0% packet loss. This roughly corresponds to medium-speed ADSL on the other side of the U.S. from your servers. Shown here is the effective bandwidth while loading a page with many objects of a given size, with effective bandwidth defined as total object bytes received divided by the time to receive them:

[1.5megabit 100ms graph]

Interesting things to note:

  • For objects of relatively small size (the left-hand portion of the graph), you can see from the empty space above the plotted line how little of the user's downstream bandwidth is being used, even though the browser is requesting objects as fast as it can. This user has to be requesting objects larger than 100k before he's mostly filling his available downstream bandwidth.

  • For objects under roughly 8k in size, you can double his effective bandwidth by turning keepalives on and spreading the requests over four hostnames. This is a huge win.

  • If the user were to enable pipelining in his browser (such as setting Firefox's network.http.pipelining in about:config), the number of hostnames we use wouldn't matter, and he'd make even more effective use of his available bandwidth. But we can't control that server-side.

Perhaps more clearly, the following is a graph of how much faster pages could load for an assortment of common access speeds and latencies with many external objects spread over four hostnames and keepalives enabled. Baseline (0%) is one hostname and keepalives disabled.

[Speedup of 4 hostnames and keepalives on]

Interesting things from that graph:

  • If you load many objects smaller than 10k, both local users and ones on the other side of the world could see substantial improvement from enabling keepalives and spreading requests over 4 hostnames.

  • There is a much greater improvement for users further away.

  • This will matter more as access speeds increase. The user on 100meg ethernet only 20ms away from the server saw the biggest improvement.

One more thing I examined was the effect of request size on effective bandwidth. The above graphs assumed 500 byte requests and 500 bytes of reply headers in addition to the object contents. How does changing that affect performance of our 1.5Mbit down/384Kbit up and 100ms away user, assuming we're already using four hostnames and keepalives?

[Effective bandwidth at various request sizes]

This shows that at small object sizes, we're bottlenecked on the upstream bandwidth. The browser sending larger requests (such as ones laden with lots of cookies) seems to slow the requests down by 40% worst-case for this user.

As I've said, these graphs are based on a simulation and don't account for a number of real-world factors. But I've unscientifically verified the results with real browsers on real net and believe them to be a useful gauge. I'd like to find the time and resources to reproduce these using real data collected from real browsers over a range of object sizes, access speeds, and latencies.

Tips to reduce your page load time

After you gather some page-load times and effective bandwidth for real users all over the world, you can experiment with changes that will improve those times. Measure the difference and keep any that offer a substantial improvement.

Try some of the following:

  • Turn on HTTP keepalives for external objects. Otherwise you add an extra round-trip for every HTTP request. If you are worried about hitting global server connection limits, set the keepalive timeout to something short, like 5-10 seconds. Also look into serving your static content from a different webserver than your dynamic content. Having thousands of connections open to a stripped down static file webserver can happen in like 10 megs of RAM total, whereas your main webserver might easily eat 10 megs of RAM per connection.

  • Load fewer external objects. Due to request overhead, one bigger file just loads faster than two smaller ones half its size. Figure out how to globally reference the same one or two javascript files and one or two external stylesheets instead of many; if you have more, try preprocessing them when you publish them. If your UI uses dozens of tiny GIFs all over the place, consider switching to a much cleaner CSS-based design which probably won't need so many images. Or load all of your common UI images in one request using a technique called "CSS sprites".

  • If your users regularly load a dozen or more uncached or uncacheable objects per page, consider evenly spreading those objects over four hostnames. This usually means your users can have 4x as many outstanding connections to you. Without HTTP pipelining, this results in their average request latency dropping to about 1/4 of what it was before.

    When you generate a page, evenly spreading your images over four hostnames is most easily done with a hash function, like MD5. Rather than having all tags load objects from http://static.example.com/, create four hostnames (e.g. static0.example.com, static1.example.com, static2.example.com, static3.example.com) and use two bits from an MD5 of the image path to choose which of the four hosts you reference in the tag. Make sure all pages consistently reference the same hostname for the same image URL, or you'll end up defeating caching.

    Beware that each additional hostname adds the overhead of an extra DNS lookup and an extra TCP three-way handshake. If your users have pipelining enabled or a given page loads fewer than around a dozen objects, they will see no benefit from the increased concurrency and the site may actually load more slowly. The benefits only become apparent on pages with larger numbers of objects. Be sure to measure the difference seen by your users if you implement this.

  • Possibly the best thing you can do to speed up pages for repeat visitors is to allow static images, stylesheets, and javascript to be unconditionally cached by the browser. This won't help the first page load for a new user, but can substantially speed up subsequent ones.

    Set an Expires header on everything you can, with a date days or even months into the future. This tells the browser it is okay to not revalidate on every request, which can add latency of at least one round-trip per object per page load for no reason.

    Instead of relying on the browser to revalidate its cache, if you change an object, change its URL. One simple way to do this for static objects if you have staged pushes is to have the push process create a new directory named by the build number, and teach your site to always reference objects out of the current build's base URL. (Instead of you'd use . When you do another build next week, all references change to .) This also nicely solves problems with browsers sometimes caching things longer than they should -- since the URL changed, they think it is a completely different object.

    If you conditionally gzip HTML, javascript, or CSS, you probably want to add a "Cache-Control: private" if you set an Expires header. This will prevent problems with caching by proxies that won't understand that your gzipped content can't be served to everyone. (The Vary header was designed to do this more elegantly, but you can't use it because of IE brokenness.)

    For anything where you always serve the exact same content when given the same URL (e.g. static images), add "Cache-Control: public" to give proxies explicit permission to cache the result and serve it to different users. If a local cache has the content, it is likely to have much less latency than you; why not let it serve your static objects if it can?

    Avoid the use of query params in image URLs, etc. At least the Squid cache refuses to cache any URL containing a question mark by default. I've heard rumors that other things won't cache those URLs at all, but I don't have more information.

  • On pages where your users are often sent the exact same content over and over, such as your home page or RSS feeds, implementing conditional GETs can substantially improve response time and save server load and bandwidth in cases where the page hasn't changed.

    When serving a static files (including HTML) off of disk, most webservers will generate Last-Modified and/or ETag reply headers for you and make use of the corresponding If-Modified-Since and/or If-None-Match mechanisms on requests. But as soon as you add server-side includes, dynamic templating, or have code generating your content as it is served, you are usually on your own to implement these.

    The idea is pretty simple: When you generate a page, you give the browser a little extra information about exactly what was on the page you sent. When the browser asks for the same page again, it gives you this information back. If it matches what you were going to send, you know that the browser already has a copy and send a much smaller 304 (Not Modified) reply instead of the contents of the page again. And if you are clever about what information you include in an ETag, you can usually skip the most expensive database queries that would've gone into generating the page.

  • Minimize HTTP request size. Often cookies are set domain-wide, which means they are also unnecessarily sent by the browser with every image request from within that domain. What might've been a 400 byte request for an image could easily turn into 1000 bytes or more once you add the cookie headers. If you have a lot of uncached or uncacheable objects per page and big, domain-wide cookies, consider using a separate domain to host static content, and be sure to never set any cookies in it.

  • Minimize HTTP response size by enabling gzip compression for HTML and XML for browsers that support it. For example, the 17k document you are reading takes 90ms of the full downstream bandwidth of a user on 1.5Mbit DSL. Or it will take 37ms when compressed to 6.8k. That's 53ms off of the full page load time for a simple change. If your HTML is bigger and more redundant, you'll see an even greater improvement.

    If you are brave, you could also try to figure out which set of browsers will handle compressed Javascript properly. (Hint: IE4 through IE6 asks for its javascript compressed, then breaks badly if you send it that way.) Or look into Javascript obfuscators that strip out whitespace, comments, etc and usually get it down to 1/3 to 1/2 its original size.

  • Consider locating your small objects (or a mirror or cache of them) closer to your users in terms of network latency. For larger sites with a global reach, either use a commercial Content Delivery Network, or add a colo within 50ms of 80% of your users and use one of the many available methods for routing user requests to your colo nearest them.

  • Regularly use your site from a realistic net connection. Convincing the web developers on my project to use a "slow proxy" that simulates bad DSL in New Zealand (768Kbit down, 128Kbit up, 250ms RTT, 1% packet loss) rather than the gig ethernet a few milliseconds from the servers in the U.S. was a huge win. We found and fixed a number of usability and functional problems very quickly.

    To implement the slow proxy, I used the netem and HTB kernel modules available in the Linux 2.6 kernel, both of which are set up with the tc command line tool. These offer the most accurate simulation I could find, but are definitely not for the faint of heart. I've not used them, but supposedly Tamper Data for Firefox, Fiddler for Windows, and Charles for OSX can all rate-limit and are probably easier to set up, but they may not simulate latency properly.

  • Use Google's Load Time Analyzer extension for Firefox from a realistic net connection to see a graphical timeline of what it is doing during a page load. This shows where Firefox has to wait for one HTTP request to complete before starting the next one and how page load time increases with each object loaded. The Tamper Data extension can offer a similar graph if you make use of its hidden graphing feature. And the Safari team offers a tip on a hidden feature in their browser that offers some timing data too.

    Or if you are familiar with the HTTP protocol and TCP/IP at the packet level, you can watch what is going on using tcpdump, ngrep, or ethereal. These tools are indispensible for all sorts of network debugging.

  • Try benchmarking common pages on your site from a local network with ab, which comes with the Apache webserver. If your server is taking longer than 5 or 10 milliseconds to generate a page, you should make sure you have a good understanding of where it is spending its time.

    If your latencies are high and your webserver process (or CGI if you are using that) is eating a lot of CPU during this test, it is often a result of using a scripting language that needs to recompile your scripts with every request. Software like eAccelerator for PHP, mod_perl for perl, mod_python for python, etc can cache your scripts in a compiled state, dramatically speeding up your site. Beyond that, look at finding a profiler for your language that can tell you where you are spending your CPU. If you improve that, your pages will load faster and you'll be able to handle more traffic with fewer machines.

    If your site relies on doing a lot of database work or some other time-consuming task to generate the page, consider adding server-side caching of the slow operation. Most people start with writing a cache to local memory or local disk, but that starts to fall down if you expand to more than a few web server machines. Look into using memcached, which essentially creates an extremely fast shared cache that's the combined size of the spare RAM you give it off of all of your machines. It has clients available in most common languages.

  • (Optional) Petition browser vendors to turn on HTTP pipelining by default on new browsers. Doing so will remove some of the need for these tricks and make much of the web feel much faster for the average user. (Firefox has this disabled supposedly because some proxies, some load balancers, and some versions of IIS choke on pipelined requests. But Opera has found sufficient workarounds to enable pipelining by default. Why can't other browsers do similarly?)

The above list covers improving the speed of communication between browser and server and can be applied generally to many sites, regardless of what web server software they use or what language the code behind you site is written in. There is, unfortunately, a lot that isn't covered.

While the tips above are intended to improve your page load times, a side benefit of many of them is a reduction in server bandwidth and CPU needed for the average page view. Reducing your costs while improving your user experience seems it should be worth spending some time on.

-- Amol Kulkarni