Open Source Technology: June 2007

Wednesday, June 27, 2007

Websites as graphs

Everyday, we look at dozens of websites. The structure of these websites is defined in HTML, the lingua franca for publishing information on the web. Your browser's job is to render the HTML according to the specs (most of the time, at least). You can look at the code behind any website by selecting the "View source" tab somewhere in your browser's menu.

HTML consists of so-called tags, like the A tag for links, IMG tag for images and so on. Since tags are nested in other tags, they are arranged in a hierarchical manner, and that hierarchy can be represented as a graph. I've written a little app that visualizes such a graph, and here are some screenshots of websites that I often look at.

I've used some color to indicate the most used tags in the following way:

blue: for links (the A tag)
red: for tables (TABLE, TR and TD tags)
green: for the DIV tag
violet: for images (the IMG tag)
yellow: for forms (FORM, INPUT, TEXTAREA, SELECT and OPTION tags)
orange: for linebreaks and blockquotes (BR, P, and BLOCKQUOTE tags)
black: the HTML tag, the root node
gray: all other tags

Here I post a couple of screenshots,

Update: Here it is: http://www.aharef.info/static/htmlgraph/

cnn.com

CNN has a complicated but typical tag structure of a portal: Lots of links, lots of images. Similar use of divs and tables for layouting purposes. (1316 tags)

boingboing.net

boingboing, my favorite blog, has a very simple tag structure: there seems to be one essential container that contains all other tags, essentially links (lots!), images, and tags to layout the text. A typical content driven website. (1056 tags)

apple.com

As always, simplicity rules at Apple's website. A few images and links, that's it. Note the large yellow cluster, representing a dropdown menu. (350 tags)

yahoo.com

Yahoo seems to be stuck in the old days of HTML style: most of the tags are tables, used for layouting - no divs. Very uncommon these days. (952 tags)

wired.com

The complete opposite of yahoo - this site uses almost no tables at all, only divs (green). It's nice to see how the div tags are holding the other elements, like links and images, together. (454 tags)

msn.com

Surprisingly, at least to me, Microsoft's portal is very much div-driven. Also of note is it's very scarce use of images. (633 tags)

google.com

Today, google is everywhere, but if somebody had asked me 5 years ago why I was using google, and wanted a visual answer, here it is (88 tags):

I finish with two of my own projects:

aharef.info

What can I say? I like it ;-) No tables, lots of links, simple structure. A typical Movable Type site, I guess. (372 tags)

onethousandpaintings.com

My personal art project. Although I programmed the site myself, I'm surprised by the simplicity of its tag structure. It shows that you can make beautiful websites with just a few tags ;-) (88 tags)

That's it. You can play around with the app, and take a fresh look at websites - here's the applet.

Friday, June 22, 2007

Search Engine Guide

Search Engines - what are they, and how do they work?Simply put, a search engine is a tool used to help find information on the Internet. No one search engine has examined or indexed the entire World Wide Web. Each only contains a partial subset of what is available and has its own way of gathering, classifying, and displaying this information to the user. Here are a few examples:

Indexing

Many of the most popular and comprehensive search engines on the Web are indexing engines - also known as crawlers or spiders. They get these names due to their particular way of finding information on the Internet - they have a program (often referred to as a "bot") that scans or "crawls" through Web pages, classifying and indexing them based on a set of pre-determined criteria. The weight given to these criteria, which may include links to your page from other sites, keywords, their positioning on a page and meta-tags, depends upon the individual indexing engine, and makes up their ranking algorithm. The information gathered during the crawling process is placed into a database, called an "index", which is then searched every time you enter a keyword query at their site. When you perform a search at an indexing engine, then, you're not actually querying the entire Web, but the portion that they have examined and included in their database.
Indexing search engines are best to use for hard to find information or very specific data, as they search through a wide and varied database of sites, returning many results. If your query is too broad, however, you risk getting an overwhelming amount of results (numbering in the hundreds of thousands or more!).
Examples of indexing search engines are: Google

, AltaVista

, and Gigablast

.

Directories

Directories are categorized groupings of sites, most often compiled and organized by human editors. They're organized into a series of categories and sub-categories, moving from the general to the specific. Each sub-category brings you to a list of additional sub-categories, until finally you reach a list of sites. While the quantity of results are usually much fewer than those returned by an indexing engine, their relevancy and quality are usually much higher.
For ease of use, most directories also have a search feature, which enables you to search through their listings - a word of caution, however: these search functions only search through the directories' categories and listings (i.e. titles, descriptions and URLs as they appear in their database) and not the sites themselves.
Directories are great to use when you don't know a lot about a subject, need help narrowing down a topic, or when you're looking for general information.
Examples of directories are: Yahoo

, The Open Directory

and LookSmart

Natural language

If you're a beginner to the Internet, or prefer to "ask" your questions (for example, "Why is the sky blue?" "What is the temperature of the sun?" etc.), rather than trying to formulate a keyword query, then a Natural-Language search engine is the way to go. These allow queries to be submitted in the form of a question, and then help you to narrow down your search by clarifying what it is you're looking for. Sometimes, they'll even provide the answer to your question directly on the search results page!
Example of natural language search engine: Subjex

and AnswerBus

.

"Pay" engines

With the increasing popularity of search engine advertising, paid inclusion and pay-for-placement services abound, and are offered by most major search engines. In a nutshell, these programs require payment in order to have your site listed with them.
Paid Inclusion Paid inclusion services require a fee in order to list a site in their database. It can take the form of a yearly fee for a directory listing, or can be a cost-per-click listing in an index - where the site owner pays every time someone clicks on their link. It could also be a combination of a flat fee and/or cost-per-click payment method (just to make things confusing!).
The most important thing to know about paid inclusion, however, is that placement or ranking within the search engines' results set is not guaranteed - i.e. a site may be included, but it will not receive preferential treatment. Some search engines that have paid inclusion programs still offer a free (slower) submission process, though these are sometimes reserved for non-commercial sites.
Examples of search engines with paid inclusion programs are: Yahoo

and Entireweb

.

Pay-for-placement

(or pay-per-click, cost-per-click) programs usually take the form of an auction-style environment in which site owners try to outbid each other to get their sites listed higher up in the results. Payment is in the form of a CPC (Cost-Per-Click) whereby the site owner pays a certain amount every time someone clicks on their link.
As pay-for-placement programs are more like advertising than search results, pay-for-placement engines no longer try to attract users to their own sites, but rather distribute their paid results to other search engines, to be displayed as "Sponsored Listings" above or alongside regular results.
Pay-for-placement results are best for when you're searching for something to purchase. The vast majority of listings are for retail sites or online services that are willing to pay for potential customers.
Examples of pay-for-placement programs include: Overture

, Google Adwords

, and Mamma Classifieds

Metasearch
Every time you type in a query at a metasearch engine, they search a series of other search sites at the same time, compile their results, and display them either by search engine employed or by integrating them in a uniform manner, eliminating duplicates, and resorting them according to relevance. It's like using multiple search engines, all at the same time.
By using a metasearch engine, you get a snapshot of the top results from a variety of search engines (including a variety of types of search engines), providing you with a good idea of what kind of information is available.
Meta-search engines are tolerant of imprecise search terms or inexact use of operators, and tend to return fewer results, but with a greater degree of relevance. They're best to use when you've got a general search, and don't know where to start - by providing you results from a series of sites, they help you to determine where to continue focusing your efforts (if this proves necessary). They also allow you to compare what kinds of results are available on different engine types (indexes, directories, pay-for-placement, etc), or to verify that you haven't missed a great resource provided by another site, other than your favorite search engine (acting as a backup). Overall, they're a great way to save time.
Examples of metasearch engines are: Mamma

, Copernic

and Dogpile

.
Additional note on metasearch sites: Because metasearch engines do not have their own database of sites, but rather pull their results from multiple outside databases, they cannot accept URL submissions.
Mamma.com, however, has created two programs in order to overcome this problem experienced by most metasearch sites! Please see Submit Your Site

for more information!

Hybrids
As the Web continues to grow, search engines are realizing that they cannot index or categorize the entire Internet. They have also realized that search is a business, and that in order to remain in existence, search engines need to be profitable. As a result, there is an increasing number of partnerships between search engines being made.
Some examples:

At present, MSN does not have its own search engine (though it is building one). MSN search results are currently a mix of Yahoo's indexing engine results, and Overture's paid listings.
Lycos results are provided by Looksmart, Yahoo's Inktomi, and the Open Directory, and they display Google's pay-for-placement Adwords.

AOL search results are powered entirely by Google (an indexing engine), and include Google's pay-for-placement program, Google Adwords
Mamma.com's "Sponsored Links" section is actually provided by Google's pay-for-placement program, Google Adwords.

These results, coming from a different source than the one you are actively searching are sometimes differentiated from each other - but sometimes they are not. It is important to always pay attention to these details and to know where your results are coming from For a chart of some of the major search engine relationships, please visithttp://www.bruceclay.com/searchenginerelationshipchart.htm

Wednesday, June 6, 2007

Testing Page Load Speed

One of the most problematic tasks when working on a Web browser is getting an accurate measurement of how long you're taking to load Web pages. In order to understand why this is tricky, we'll need to understand what exactly browsers do when you ask them to load a URL.

So what happens when you go to a URL like cnn.com? Well, the first step is to start fetching the data from the network. This is typically done on a thread other than the main UI thread.

As the data for the page comes in, it is fed to an HTML tokenizer. It's the tokenizer's job to take the data stream and figure out what the individual tokens are, e.g., a start tag, an attribute name, an attribute value, an end tag, etc. The tokenizer then feeds the individual tokens to an HTML parser.

The parser's job is to build up the DOM tree for a document. Some DOM elements also represent subresources like stylesheets, scripts, and images, and those loads need to be kicked off when those DOM nodes are encountered.

In addition to building up a DOM tree, modern CSS2-compliant browsers also build up separate rendering trees that represent what is actually shown on your screen when painting. It's important to note two things about the rendering tree vs. the DOM tree.

(1) If stylesheets are still loading, it is wasteful to construct the rendering tree, since you don't want to paint anything at all until all stylesheets have been loaded and parsed. Otherwise you'll run into a problem called FOUC (the flash of unstyled content problem), where you show content before it's ready.

(2) Image loads should be kicked off as soon as possible, and that means they need to happen from the DOM tree rather then the rendering tree. You don't want to have to wait for a CSS file to load just to kick off the loads of images.

There are two options for how to deal with delayed construction of the render tree because of stylesheet loads. You can either block the parser until the stylesheets have loaded, which has the disadvantage of keeping you from parallelizing resource loads, or you can allow parsing to continue but simply prevent the construction of the render tree. Safari does the latter.

External scripts must block the parser by default (because they can document.write). An exception is when defer is specified for scripts, in which case the browser knows it can delay the execution of the script and keep parsing.

What are some of the relevant milestones in the life of a loading page as far as figuring out when you can actually reliably display content?

(1) All stylesheets have loaded.
(2) All data for the HTML page has been received.
(3) All data for the HTML page has been parsed.
(4) All subresources have loaded (the onload handler time).

Benchmarks of page load speed tend to have one critical flaw, which is that all they typically test is (4). Take, for example, the aforementioned cnn.com. Frequently cnn.com is capable of displaying virtually all of its content at about the 350ms mark, but because it can't finish parsing until an external script that wants to load an advertisement has completed, the onload handler typically doesn't fire until the 2-3 second mark!

A browser could clearly optimize for only overall page load speed and show nothing until 2-3 seconds have gone by, thus enabling a single layout and paint. That browser will likely load the overall page faster, but feel literally 10 times slower than the browser that showed most of the page at the 300 ms mark, but then did a little more work as the remaining content came in.

Furthermore benchmarks have to be very careful if they measure only for onload, because there's no rule that browsers have to have done any layout or painting by the time onload fires. Sure, they have to have parsed the whole page in order to find all the subresources, and they have to have loaded all of those subresources, but they may have yet to lay out the objects in the rendering tree.

It's also wise to wait for the onload handler to execute before laying out anyway, because the onload handler could redirect you to another page, in which case you don't really need to lay out or paint the original page at all, or it could alter the DOM of the page (and if you'd done a layout before the onload, you'd then see the changes that the onload handler made happen in the page, such as flashy DHTML menu initialization).

Benchmarks that test only for onload are thus fundamentally flawed in two ways, since they don't measure how quickly a page is initially displayed and they rely on an event (onload) that can fire before layout and painting have occurred, thus causing those operations to be omitted from the benchmark.

i-bench 4 suffers from this problem. i-bench 5 actually corrected the problem by setting minimal timeouts to scroll the page to the offsetTop of a counter element on the page. In order to compute offsetTop browsers must necessarily do a layout, and by setting minimal timers, all browsers paint as well. This means i-bench 5 is doing an excellent job of providing an accurate assessment of overall page load time.

Because tests like i-bench only measure overall page load time, there is a tension between performing well on these sorts of tests and real-world perception, which typically involves showing a page as soon as possible.

A naive approach might be to simply remove all delays and show the page as soon as you get the first chunk of data. However, there are drawbacks to showing a page immediately. Sure, you could try to switch to a new page immediately, but if you don't have anything meaningful to show, you'll end up with a "flashy" feeling, as the old page disappears and is replaced by a blank white canvas, and only later does the real page content come in. Ideally transitions between pages should be smooth, with one page not being replaced by another until you can know reliably that the new page will be reasonably far along in its life cycle.

In Safari 1.2 and in Mozilla-based browsers, the heuristic for this is quite simple. Both browsers use a time delay, and are unwilling to switch to the new page until that time threshold has been exceeded. This setting is configurable in both browsers (in the former using WebKit preferences and in the latter using about:config).

When I implemented this algorithm (called "paint suppression" in Mozilla parlance) in Mozilla I originally used a delay of 1 second, but this led to the perception that Mozilla was slow, since you frequently didnt see a page until it was completely finished. Imagine for example that a page is completely done except for images at the 50ms mark, but that because you're a modem user or DSL user, the images aren't finished until the 1 second mark. Despite the fact that all the readable content could have been shown at the 50ms mark, this delay of 1 second in Mozilla caused you to wait 950 more ms before showing anything at all.

One of the first things I did when working on Chimera (now Camino) was lower this delay in Gecko to 250ms. When I worked on Firefox I made the same change. Although this negatively impacts page load time, it makes the browser feel substantially faster, since the user clicks a link and sees the browser react within 250ms (which to most users is within a threshold of immediacy, i.e., it makes them feel like the browser reacted more or less instantly to their command).

Firefox and Camino still use this heuristic in their latest releases. Safari actually uses a delay of one second like older Mozilla builds used to, and so although it is typically faster than Mozilla-based browsers on overall page load, it will typically feel much slower than Firefox or Camino on network connections like cable modem/modem/DSL.

However, there is also a problem with the straight-up time heuristic. Suppose that you hit the 250ms mark but all the stylesheets haven't loaded or you haven't even received all the data for a page. Right now Firefox and Camino don't care and will happily show you what they have so far anyway. This leads to the "white flash" problem, where the browser gets flashy as it shows you a blank white canvas (because it doesn't yet know what the real background color for the page is going to be, it just fills in with white).

So what I wanted to achieve in Safari was to replicate the rapid response feel of Firefox/Camino, but to temper that rapid response when it would lead to gratuitous flashing. Here's what I did.

(1) Create two constants, cMinimumLayoutThreshold and cTimedLayoutDelay. At the moment the settings for these constants are 250ms and 1000ms respectively.

(2) Don't allow layouts/paints at all if the stylesheets haven't loaded and if you're not over the minimum layout threshold (250ms).

(3) When all data is received for the main document, immediately try to parse as much as possible. When you have consumed all the data, you will either have finished parsing or you'll be stuck in a blocked mode waiting on an external script.

If you've finished parsing or if you at least have the body element ready and if all the stylesheets have loaded, immediately lay out and schedule a paint for as soon as possible, but only if you're over the minimum threshold (250ms).

(4) If stylesheets load after all data has been received, then they should schedule a layout for as soon as possible (if you're below the minimum layout threshold, then schedule the timer to fire at the threshold).

(5) If you haven't received all the data for the document, then whenever a layout is scheduled, you set it to the nearest multiple of the timed layout delay time (so 1000ms, 2000ms, etc.).

(6) When the onload fires, perform a layout immediately after the onload executes.

This algorithm completely transforms the feel of Safari over DSL and modem connections. Page content usually comes screaming in at the 250ms mark, and if the page isn't quite ready at the 250ms, it's usually ready shortly after (at the 300-500ms mark). In the rare cases where you have nothing to display, you wait until the 1 second mark still. This algorithm makes "white flashing" quite rare (you'll typically only see it on a very slow site that is taking a long time to give you data), and it makes Safari feel orders of magnitude faster on slower network connections.

Because Safari waits for a minimum threshold (and waits to schedule until the threshold is exceeded, benchmarks won't be adversely affected as long as you typically beat the minimum threshold. Otherwise the overall page load speed will degrade slightly in real-world usage, but I believe that to be well-worth the decrease in the time required to show displayable content.

Tuesday, June 5, 2007

Firefox Extensions for Web Developers

Firefox's openness and the Firefox plugin architecture means that there is little that you cannot find out about a web page with a Firefox add-on. I've tried a bunch of different Firefox extensions for web development. Here are the ones that I find most useful and that I use on a regular basis.

DOM Inspector

Yes, yes, it comes installed with Firefox, but lets not forget the basics. The DOM Inspector allows you to see what is actually going on in your web document. The DOM Inspector lets you browse DOM nodes, style sheets, or Java Script objects. You select a node by either drilling down, by searching, or by clicking on it. Although, the UI for selecting a node with your mouse is just plain lousy. Once you've chosen your subject, the DOM inspector can show you the box model information for that node, the style sheets associated with the node, the computed CSS styles, or the Javascript object.

Web Developer Extension

Chris Pederick's Web developer extension has been out for a long time and is the plugin I am most familiar with. This is really the swiss army knife of web developer tools. It is so feature packed that I am still finding new things that it does. Unfortunately, the UI is also so cluttered that I am still finding new things that it does.

This add-on can slice and dice a web page every which way. It can outline a variety DOM elements, for example drawing an outline around all block elements on a page. This can be nice for lining things up. The Display Line Guides option is also a good way to verify alignment, not to mention Display Ruler, or Display Page Magnifier for fine detail.

This extension has dozens of reports, each one geared toward diagnosing a particular kind of problem. Some of them are external, such as sending your URL to a validation service. Some are internal, such as showing a dump of all of the page's active cookies. Unfortunately, many of these option open up in a new tab, taking the focus off of the page that you are trying to work with. It can be hard to tell which options do this. There is an option for having the tabs open in the background, but this is not the default.

The View Style Information option is particularly nice. You can point to any element on the page and the extension will display the element tree along with ids and classes. If you click on an element, it will display only the style rules that apply to that element. This beats the drill down approach in the DOM inspector, although it doesn't show box model information or computed style information this way.

The web developer extension can change things as well as inspect them. You can go into a mode where you can edit your CSS or HTML in real time for immediate feedback. This is great for testing out small changes. For the PHP developer, the extension has a variety of options for manipulating cookies and forms. There are also a variety of ways to enable or disable certain elements on the page.

Install Web Developer Extension

Tamper Data

Tamper Data is live HTTP headers on steroids. Tamper data records the HTTP request headers and HTTP response headers for each request that the browser makes. Not only that, It allows you to "tamper" with the requests before they are sent out, editing headers or form values behind the scenes. Tamper data can present a graph of the requests involved with loading a web page. Tamper data is great for security testing and page loading performance tuning.

Install Tamper Data Extension

FireBug

FireBug, ah what can I say but wow! According to their web site:

Firebug integrates with Firefox to put a wealth of development tools at your fingertips while you browse. You can edit, debug, and monitor CSS, HTML, and JavaScript live in any web page.

Firebug has considerable overlap with the extensions I've mentioned so far. It doesn't necessarily duplicate all of their functions, but the ones it does, it does really well. It goes way beyond in some cases. There is really no point in me talking about Firebug's features, because the website already does such a good job at it. They've impressed this jaded old developer.

If you haven't tried this one yet, seriously, go get it right now.

Install FireBug Extension

ColorZilla

ColorZilla adds a small eyedropper tool to the bottom left corner of the window. You can use this tool to inspect colors on the current web site. Double clicking it brings up a color picker and some other color related tools.

Install ColorZilla Extension

Multiple Profiles

Ok, I lied. There are a few situations where I use FireFox for casual browsing. Some web sites just won't work with Safari, or don't work well with Safari. For these, I pull up Firefox. I don't want my casual browsing tools to clutter up my web development experience and I don't want my web development tools to clutter up my casual browsing experience. The solution is to create multiple profiles in FireFox. I have one for web development and another for normal surfing. I have safari ask me to select a profile on start up. This extra step would be annoying for a primary browser, but it doesn't seem too bad for a secondary browser.

Monday, June 4, 2007

Top 17 Search Innovations Outside Of Google

There are an abundance of new search engines (100+ at last count ) - each pioneering some innovation in search technology. Here is a list of the top 17 innovations that, in our opinion, will prove disruptivein the future. These innovations are classified into four types: Query Pre-processing; Information Sources; Algorithm Improvement; Results Visualization and Post-processing.
[Some of these innovations are present in various Google properties, but are either missing or available only in limited form in the main search page, as noted below.]

Query Pre-processing

Pre-processing Logos

The main purpose of this type of enhancement is the application of logic to try to divine the user's intent, and apply that knowledge to improve the query specification.

1. Natural Language Processing

This feature was initially pioneered by Ask.com. The best-known contemporary examples are Hakia and Powerset, both of which (in different ways) try to understand the semantics or meaning behind the user's query. The big difference from Google is that these engines consider "stopwords" to be significant - minor connecting words like by, for, about, of, in - unlike Google, which discards them.

2. Personal relevance (aka personalization)

It has long been understood that tailoring the query to the interests and requirements of a specific user provides a high level of relevance to search results. Google already supports this in their search engine, but only if you're logged in; many users are understandably reluctant to do so, since this has the potential to provide Google with a complete trail of their particular individual searches. [Even John Battelle agrees that the idea is a little scary, although Matt Cutts from Google disagrees.] What is needed is a way to provide personalization, albeit in an anonymous fashion. At a broader level, providing personalization across multiple sites would be even more useful. Collarity is one search engine that addresses this functionality.

3. Canned, specialized searches

This is a simple, yet powerful feature. The poster child for this approach is SimplyHired, a vertical search engine for jobs, which provides powerful, pre-set searches, such as "employers friendly to mature workers", "dog-friendly employers" and so on.

Information Sources

Pre-processing Logos

These enhancements focus on the underlying data sources: additional content types and support for restricting the set of data sources to improve the reliability of search results (reduced spam!).

4. New content types

It is a sign of the times that today's teens are just as comfortable exchanging photos and videos on their cell phones as text messages. On the web, rich media content is exploding - images, audio, video, TV - along with semantic information about content. Search engines increasingly need to support these additional content types to stay relevant. Some examples of search engines with rich content support are given below:
- Rich media search: Audio (odeo, podzinger ), Video (Youtube, truveo ), TV (Blinkx ), Images (Picsearch, Netvue )
- Specialized content search: Blogs (Technorati ), News (Topix ), Classifieds (oodle )

Of course, Google is heavily active in this area, with Google Blogsearch (blogs), Searchmash (images), Google Video, Google News and so on, so perhaps it's not fair to put this item into the list. What would be ideal, though, would be to integrate the different media results into a single search, as Searchmash already does ( Retrevo is another great example).

5. Restricted Data Sources

One of the biggest issues frustrating search users is spam. As Marketers get more savvy and use increasingly more aggressive SEO tactics, the quality of the results continues to degrade. (Google, as the most popular search engine, gets more than its fair share of targeting.) Restricting searches to a set of trusted sites eliminates this issue, although it also narrows the universe of content searched - this works well to provide a set of authenticated, high-quality results for certain types of searches; e.g. searching Wikipedia, National Geographic and science/educational sites when researching volcanoes for an elementary school project.

The best example of this approach is provided by A9.com, which provides content from a variety of sources and allows the user to make an explicit choice for every search. Google Co-op and Yahoo! Search Builder enable 3rd parties to build such a solution; Rollyo has been an early pioneer in this space!

6. Domain-specific search (Vertical Search)

By focusing on a single vertical, the search engine can offer a much better user experience, that's more comprehensive and tailored to a specific domain. There is an incredible variety of vertical search engines for various domains; for more information, check out Alex Iskold's article on the Read/WriteWeb or this overview on the Software Abstractions blog. [For context, Sramana Mitra's overview of online travel services provides a sense how Vertical Search fits into the overall picture.]

Algorithm Improvement

Pre-processing Logos

These enhancements focus on improving the underlying search algorithms to increase the relevance of results and provide new capabilities.

7. Parametric search

This type of search is closer to a Database query than a Text search; it answers an inherently different type of question. A parametric search helps to find problem solutions rather than text documents. For example, Shopping.com allows you to qualify clothing search with a material, brand, style or price change; job search sites like indeed let you constrain the matches to a given zip code; and GlobalSpec lets you specify a variety of parameters when searching for Engineering components (e.g. check out the parameters when searching for an industrial pipe ). Parametric search is a natural feature for Vertical Search engines.

Google has already incorporated this feature at a general level - such as the parameters on the Advanced Search page - but that waters down its usefulness. The most powerful use of this feature happens when additional parameters become available as you drill down further into standard search results or when you constrain the search to specific verticals.

8. Social Input

Yahoo!'s Bradley Horowitz believes social input to be a big differentiator of search technologies in the future (as does Microsoft ). Aggregating inputs from a large number of users enables the search engine to benefit from the wisdom of crowds to provide high quality search results. Of course, the results may not be valid if the individual inputs are not independent or can be gamed. Among the different offerings in this space, the service provided by del.icio.us seems likely to provide high-quality search capabilities based on this approach. [An earlier post offers a comparison among the different findability solutions based upon "crowd-sourcing".] Other reputation-based systems include StumbleUpon, Squidoo, About.com and of course, Wikipedia - all of which fall under the overall umbrella of findability, although they are not, strictly speaking, search engines.

Of course, Google's venerable PageRank algorithm is also implicitly based on social input. Since a large component of pagerank is based on the number and character of incoming links from different web sites, those incoming links act as implicit votes for gathering collective intelligence.

9. Human Input

This approach is included in the list for completeness. Search engines like ChaCha are experimenting with using human operators to respond to search queries. Arguably, Yahoo! Answers is another solution in this space, although the answers are provided by other users rather than by people working for the search engine.

It's difficult to see how the ChaCha-type approach would scale unless it somehow leverages community resources.

10. Semantic Search

Some of the exciting recent developments in search have to do with extracting intelligence from the Web at large. These applications are just the start - they convey the enormous potential of the Semantic Web. Early pioneers in this space include: Monitor110, which tries to extract actionable financial information from the web, that could be of interest to institutional investors; Spock, the "people-search" engine (currently in closed Beta), that plans to have a 100 million profiles in its database at launch; and Riya, a visual search engine, whose technology provides face and text recognition in photos.

11. Discovery support

Hand-in-hand with personalization and agent technology goes Discovery; this a holy grail for search. Although ad-hoc searches are the most popular at this time, most users have fairly stable interests over long periods of time. Wouldn't it be great if you could discover new sources of data - especially high-quality feeds - as they became available?

There are already some tentative steps in this direction, that combine search with the power of RSS - for example, you can already set up an RSS feed for the output of many types of searches in Google and Yahoo!. Bloglines already supports a "Recommended Feeds" feature - clearly, a feed reader should be in a great position to recommend new blogs or feeds in your area of interest, based upon the contents of your OPML file. Another player in this field is Aggregate Knowledge, which provides specialized services for retail and media by collecting information anonymously across multiple web sites. Overall, this will be an exciting area to watch in the future!

Results Visualization and Post-processing

Pre-processing Logos

These enhancements focus on improving the display of results and on "next steps" features offered post-query.

12. Classification, tag clouds and clustering

Search engines like Quintura and Clusty provide clustering of results based on tags/keywords. This allows the user not only to see the results themselves, but visualize the clusters of results and the relationships between them. This meta-information can help the user make sense of the results and discover new information on related topics.

13. Results visualization

Images are easier for the human brain to understand and remember than text results. At a more general level than clustering, specialized UI paradigms for displaying search results and the relationships between them can convey more meaning to the user and make the "big picture" easier to process. This approach works especially well within a specific context, such as a vertical search engine. The Visual Thesaurus from Thinkmap, VizServer from Inxight Software and HeatMaps from real estate search engine Trulia are examples of new ways to visualize information, although research in this field is still in its early stages. At a simpler level, HousingMaps is a mashup that displays the locations for houses available to rent or buy.

14. Results refinement and Filters

Often a natural next step after a search is to drill down into the results, by further refining the search. This is different from the "keyword-tweaking" that we've all gotten used to with Google; it's not just experimenting with keyword combinations to submit a new query, but rather, an attempt to actually refine the results set [akin to adding more conditions to the "where" clause of a SQL query] - this would allow users to narrow the results and converge on their desired solution.

Query refinement is a critical part of the search process, although it hasn't gotten the attention it deserves. One great example is the medical search engine Healia, which allows users to tweak health care search results by using demographic filters. This is important, because demographics, such as age, race and sex, can have a significant impact on search results for symptoms, diseases and the drugs used to treat them; there are also filters based on the complexity, source and type of results found.

Google has recently introduced a new button at the bottom of the Results page: "Search within results", which is a step in the right direction; results can also be refined using the existing OneBox widget and the relatively new Plusbox feature. Over time, we can expect this functionality to get increasingly more sophisticated.

15. Results platforms

As social media and online content become more popular, the number of choices available to a user to consume digital information continues to increase; accordingly, search engines must now support a variety of output platforms, including: web browser, mobile device, Rich Internet Applications, RSS, email and so on. With connectivity becoming ever more ubiquitous, users of the future are likely to connect to search engines from even more unconventional sources, for example: a TiVo system that searches for movies/programs of interest, a Nintendo system used to search for online games or even a refrigerator touch screen used for recipe search.

Some existing search engines already support additional platforms, beyond the standard web browser and mobile device. The web search engine Plazoo has provided RSS results feeds for a long time; Quintura started as a downloadable RIA application, and only now does this search engine provide a pure web interface.

The easiest way to provide support for many different result types, is to make available an open API, enabling third-party developers to create custom UIs for specialized target platforms. The Alexa Web Search platform was one of the first of these (although you use the API at your own risk ); other available APIs include oodle, zillow and trulia .

Google of course, provides APIs for several different properties - e.g. Google Base, Google Maps and the AJAX search API - although not for the main search engine. Handheld devices are supported via Google Mobile; Google Base and Blogsearch already provide RSS output.

16. Related Services

Technically, this is not exactly a part of the search function itself. However - once you finish a query, there is often a natural next step that follows the results of a search, e.g. after you search job openings, you want to apply to the postings you found. In terms of utility to the end user, this is an inherent part of overall search engine functionality.

Surprisingly, this feature has not been heavily exploited by many search engines, other than to display context-sensitive advertising. A perfect example of this approach is the interestingly-named specialized search engine: the web's too big , which enables the user to search for information on the web sites of public relations agencies based in the UK. These folks provide an interesting additional capability: users can enter details of their PR inquiry, and submit it directly to multiple PR agencies with a single click. Similarly, the real estate search engine Zillow provides the concept of Zestimate (an estimated home valuation computed by Zillow), as well as a Home Q&A feature. These types of additional services increase the value of search results offered to the user and make a site stickier.

Google provides additional services on some of its properties - such as the "Find businesses" option on Google Maps - but not in its main search engine.

17. Search agent

Closely related to the twin ideas of sustained, ongoing areas of interest and accessing search results as feeds, are search agents. Imagine a piece of software that functions as a kind of periodic search query, monitoring the web for new information on subjects of interest, collecting and collating the results, removing duplicates, and providing a regular update in summary form. This could work especially well for certain types of continuous searches that are important but not urgent: for example, monitoring for new jobs of interest as they become available, new houses for sale that fit within specific parameters, articles of clothing once they are marked down to a specific price, and so on.

Copernic is an interesting player in this space - the Copernic Agent can automatically run saved searches and provide summaries for new results, as well as track changes in web pages. The Information Agent Suite from Connotate Technologies mines the "deep web" and automates change detection. For more examples of search agents, Read/WriteWeb has an article that describes Allth.at , along with Swamii and Searchbots.net .

Conclusion

Clearly, Google is not going to take this onslaught lying down. Just as it has already introduced personalized search into its primary search engine, it will continue to integrate some of these other approaches into the mainstream as they become successful. For example, Vertical specialization is a powerful tool that Google is sure to use.

It is very likely that in the future, the simple "search box" on the Google front page will hide a variety of specialized search engines behind it. On the other hand, trying to cram in an increasing number of these sophisticated features has the potential to make the overall architecture for Google (or any mainstream web search engine) very complex and difficult to change, so the trade-offs will present an increasingly difficult challenge! In a separate article on the Software Abstractions blog, we take a look at the conceptual architecture for a mainstream search engine that incorporates most of these features.

References

Notable recent articles on "Google and Search Innovation"

Josh Kopelman/Redeye VC: Google - The next vertical search engine?
O'Reilly Rader (via Sarah Milstein): Thoughts on the State of Search
Don Dodge: What's new in search technology? Is Google it?
Rich Skrenta: How to beat Google, part 1
Information Arbitrage: Domain Expertise: The Key to Next Generation Search
David Berkowitz: The Hunt for Search Engine Innovation, Part 1
Richard MacManus/RWW: Interview with Google's Matt Cutts about Next-Generation Search
Phil Butler on Hakia and Powerset
Bob Stumpel: SEARCH 2.0 - consolidated
Google Operating System: What Has Google Done in Search Lately?