Thursday, December 7, 2006

How the Web Works?

The anatomy of a single browser selection

If you have browsed the World Wide Web even once, you are familiar with the basic method for getting from one place to another. A Web page presents you with a number of items you may select. Once you have selected something, you leave the current Web page and enter the one that you just selected. If you have a text-based browser, such as Lynx, you made this selection by choosing a number from a menu of items. If you have a graphical browser, such as Mosaic or Netscape, you made this selection by clicking on an underlined item, or perhaps an icon or other graphic.

Suppose you are happily browsing when you come upon an item labeled Sunsite, A Vast Compendium of Information. This certainly sounds intriguing, so you click on Sunsite (hereafter I will assume that you have a graphical browser; the principles are the same for text-based systems). After a few seconds, or several hours, depending on network traffic, how you are connected to the Internet, and a number of other factors, the Web page appears.

How did this actually happen? How did your index finger get such power? In this article we will examine what happens as a result of this single mouse click, and look at all the steps which contribute to making navigation on the World-Wide Web possible.

Anatomy of the URL

While you were sitting there with your finger poised over Sunsite, your browser probably displayed something similar to http://sunsite.unc.edu/. Mosaic, for example, displays this in its status bar at the bottom of the window. This is the infamous Uniform Resource Locator, or URL, which describes the location of the page. URLEvery link to another Web page has a URL. URLs themselves are worthy of an entire article, so I won't attempt to describe all of their nuances and quirks here. For our purposes we can think of a URL as being composed of four parts: the protocol, the hostname, the port, and the path (or document name), as shown in the figure.

The hostname says which machine to contact; the port number and the protocol say how to connect to that machine, and the path says what to get once connected. The protocol is located just before the first colon, and is http, the Hypertext Transfer Protocol, in this case. The hostname is located between the double slash (//) and the next separator, which will be another colon or slash. In this case the hostname is sunsite.unc.edu. If a port were given, the hostname would have been followed by a colon, a number, and then a slash. gopher://news.csie.nctu.edu.tw:4870/7gonnrp%20-g%20 is a gopher URL with a port number, namely 4870 (you might want to brush up on your Chinese before trying this URL). The final part of the URL, the path, is given immediately after the hostname or port. In our URL the path is simply /, known as the root. The format of the path part of a URL depends on the protocol being used, among other things, and can become quite complicated.

Finding a Host

The browser uses the various parts of the URL to find the host, connect to it, retrieve the path portion of the URL, and display the page. The first step is to convert the hostname into an Internet address. The hostname may be descriptive (that is its purpose), but it is also worthless in terms of making an actual connection. Converting a hostname to an address is often referred to as the "address resolution phase". The status bar of your browser may display something like "looking up hostname" during this process.

Addresses such as sunsite.unc.edu are part of a hierarchical addressing scheme known as the Domain Naming System, or DNS. Each of the components in the address is one level of the hierarchy. The rightmost part, "edu", is the top-level domain, and consists of all educational institutions on the Internet. The next level is "unc", a designation for the University of North Carolina, which has registered itself as a member of the edu domain. The final level is "sunsite", which is the name of an actual machine at UNC.

In order to resolve the address, your browser must construct a DNS query and submit it to a DNS server. This process also happens hierarchically. Somewhere very close to you (on your Internet Service Provider, or at your corporate headquarters, for example) there is almost certainly a local DNS Server. This DNS server probably won't know the address of sunsite.unc.edu, but it will know the address of a more authoritative DNS server. Eventually, some DNS server will actually known the answer, which will then make its way back to your browser. From my site, for example, it took two hops to discover that sunsite.unc.edu is actually at 198.86.40.81. My local DNS server contacted the unc.edu server, ns.unc.edu, which provided the numerical Internet address.

The Internet address is only half the equation, however. A fully qualified Internet address consists of a host address and a port number. If the port number is provided as part of the URL, then we have a fully qualified address already. If it is not, then the browser will use the well-known port associated with the protocol. All of the common protocols which the Web supports, including http, ftp, gopher, news (NNTP), WAIS, and Telnet, have well-known port numbers. For http this port number is 80. This information is usually stored in a file that is part of the local machine's network configuration. On Unix machines, for example, this information is stored in a file known as /etc/services.

The browser is now ready to connect to the host. It does this by building a connection request using TCP (some browsers use UDP, but that's another story). If you were to look at the browser code itself, you would see that once it has the address and port number, it makes the following system calls: socket(), to open a raw network endpoint; bind(), to bind that endpoint to the local address (the address of the machine running the browser), and finally connect(), to specify the remote address and try to talk to it. Inside the operating system the guts of connect() builds a TCP packet, and then hands it to the lower-level IP layer, which wraps that packet with IP protocol. This TCP/IP packet is then given to the lowest layer (usually a device driver), which often wraps it in yet another layer of protocol. On Ethernet this would be the Ethernet frame. After all this happens, the connection packet is unleashed on the Internet and goes off in search of the sunsite host.

However, this may not happen quite as directly as you might expect. Many machines are not connected directly to the Internet. Instead, their connection is mediated by one or more intermediaries, known as proxies, gateways, or firewalls, depending on their function. Firewalls, in particular, are becoming more and more common as concerns about security increase. Outgoing and incoming packets thus might be going through one or more stages of intermediate software, often designed to permit a limited set of outgoing connections and an even more limited set of incoming ones.

The first stop for the connection request packet might be an application such as SOCKS, which will peek inside the packet, look at the port number and protocol, and approve or disapprove it based on their values. This often explains why URLs based on specific protocols will fail while all others succeed. If you are behind a firewall, and your security administrator has decreed that Telnet connections may not cross that firewall, URLs that use Telnet will always fail during the connection phase. Firewalls can even cause DNS queries to fail, although few organizations are that paranoid.

Once the connection packet actually leaves your local site, it still won't go directly to sunsite.unc.edu. Unless you are extremely lucky and are connected directly to UNC's campus-wide network, it will instead pass through a number of other sites, each of which will try to route your packet to a site that is closer to the destination. Routes are born, expand, contract, and die all within the scope of milliseconds, so that it is quite possible that two identical packets sent by the same browser as quickly as possible will travel two vastly different routes to the same destination. If you have the traceroute utility, you can have hours of fun (well, maybe minutes of fun) tracing the route between your machine and another one across the globe, over and over again.

Going All the Way

Let's be optimistic and assume that your packet actually reaches sunsite.unc.edu. If that host is up, then its operating system will intercept and decode the packet. It will ask itself many penetrating questions, such as "Do I want to accept any connections right now?"; "Am I willing to accept a connection from this particular host?"; and, most important, "Is there a user application that has declared itself willing to deal with packets on this particular port?"

Such applications are typically called servers. If the http server, httpd, is running on sunsite, then it will have registered itself as a potential recipient for packets on http's well-known port. If httpd has available connections, it will accept our browser's connection request, and the operating system on sunsite will generate an appropriate connection response packet. If httpd is overloaded or otherwise unavailable, we will get a negative response, resulting in the dreaded "connection refused" message on most browsers.

In each case, the reply will work its way back through the network to the originating machine. An IP packet always includes the sender's address as well as the destination address, so sunsite will not have to do a DNS lookup in order to generate its reply. Some modern implementations even record the route taken by a packet from one machine to another, so that a response packet need not discover the return route for itself. It is not guaranteed that the machines over which the response traverses on its way back to us is the same as the outgoing route the connection request used, however. It is often quite unlikely.

Once the browser receives a successful connection reply, it is ready to actually request the path portion of the URL. In our example, that is the top-level path /. Since it is using the http protocol, it must ask for this path using that protocol. The http protocol is a text-based conversational protocol, in which the requesting client (our browser) sends out a series of strings in 8-bit extended ASCII (Iso-Latin-1 to be precise). Our browser will send out a GET request, indicating the path it wishes to receive, and a series of Accept qualifiers, indicating the possible formats it wishes to receive it in. This request is converted into a TCP packet containing the http protocol, wrapped in IP and more, and wends it way across the Internet again, to sunsite. Many browsers will display a progress message, such as "Sent HTTP request; waiting for response."

Sunsite's OS hands the path request to httpd, which then verifies that it is indeed a valid http request, and attempts to satisfy it. Our path is a special case, since we have asked for the root, rather than asking for a specific HTML document, such as exhibit.html (a multimedia link on sunsite). The traditional response to such a root request is to look for a document named index.html and send that.

It is possible, but very unlikely, that we would get back a listing of the top-level server directory, instead. The httpd server does not simply send the contents of this HTML document, however. It wraps the response in http, sending back a status line, header information, and the actual response. The status line indicates its success or failure in getting the requested path. If the request fails, the server tries to be helpful and tell you why, such as "that URL has moved to . . ."

On success, the status line is followed by one or more lines of header information describing what is to come. One header line that is always sent is the Content-Type line, which will take the form Content-Type: text/html in our case. This tells the browser the actual format of the reply data, which could have been in any number of formats.

Even though this conversation between our browser and sunsite's httpd server is in the http protocol, several of these headers are derived from the MIME (Multipurpose Internet Mail Extensions) system. MIME was created independently of http as a uniform method for moving non-ASCII data through e-mail. The http protocol adopted it as a way of doing the same thing for Web content.

Whew!

The final portion of the server's reply is the actual HTML document, index.html, a portion of which is shown below. The httpd server gives its http packet to Sunsite's Unix, which flavors it with TCP, IP, and an Ethernet frame, and sends it back to our browser. It also closes its end of the connection. This is because http was designed to be a stateless protocol, in which each transaction would be a complete exchange.

This design decision has profound implications on how the Web works. Since TCP connections are relatively precious (most systems allow only a few dozen to be simultaneously active in any one application), this helps to conserve system resources. However, it also means that if the browser needs more information, it must once again open a connection, send a request, and then wait for the reply. In our example this will happen repeatedly. When our local browser receives the contents of index.html and begins to parse it, in order to build the actual graphics for the page, it will discover these dangling references, and go back through the connect-request-reply steps for each piece of missing information. The status bar of the browser will flash a series of messages while this is happening.

When the browser finally has resolved all references and fully parsed the contents of index.html, it uses the various HTML directives, image data, and other content to construct the page. In the case of HTML, this is relatively easy, since the HTML language was designed to be nicely viewable. If we had requested a URL based on another protocol, such as FTP (File Transfer Protocol), the browser would have had to do more work. Older protocols such as these were not designed with graphics in mind, so that the browser must construe the textual replies from an ftp server and build a reasonable Web page from them.

As you can see, a very large number of software components, representing many diverse protocols that were not originally intended to work together, had to collaborate to execute the one mouse click that brought up the Sunsite Web page.

No comments: