Open Source Technology: 06/04/07

There are an abundance of new search engines (100+ at last count ) - each pioneering some innovation in search technology. Here is a list of the top 17 innovations that, in our opinion, will prove disruptivein the future. These innovations are classified into four types: Query Pre-processing; Information Sources; Algorithm Improvement; Results Visualization and Post-processing.
[Some of these innovations are present in various Google properties, but are either missing or available only in limited form in the main search page, as noted below.]

Query Pre-processing

Pre-processing Logos

The main purpose of this type of enhancement is the application of logic to try to divine the user's intent, and apply that knowledge to improve the query specification.

1. Natural Language Processing

This feature was initially pioneered by Ask.com. The best-known contemporary examples are Hakia and Powerset, both of which (in different ways) try to understand the semantics or meaning behind the user's query. The big difference from Google is that these engines consider "stopwords" to be significant - minor connecting words like by, for, about, of, in - unlike Google, which discards them.

2. Personal relevance (aka personalization)

It has long been understood that tailoring the query to the interests and requirements of a specific user provides a high level of relevance to search results. Google already supports this in their search engine, but only if you're logged in; many users are understandably reluctant to do so, since this has the potential to provide Google with a complete trail of their particular individual searches. [Even John Battelle agrees that the idea is a little scary, although Matt Cutts from Google disagrees.] What is needed is a way to provide personalization, albeit in an anonymous fashion. At a broader level, providing personalization across multiple sites would be even more useful. Collarity is one search engine that addresses this functionality.

3. Canned, specialized searches

This is a simple, yet powerful feature. The poster child for this approach is SimplyHired, a vertical search engine for jobs, which provides powerful, pre-set searches, such as "employers friendly to mature workers", "dog-friendly employers" and so on.

Information Sources

Pre-processing Logos

These enhancements focus on the underlying data sources: additional content types and support for restricting the set of data sources to improve the reliability of search results (reduced spam!).

4. New content types

It is a sign of the times that today's teens are just as comfortable exchanging photos and videos on their cell phones as text messages. On the web, rich media content is exploding - images, audio, video, TV - along with semantic information about content. Search engines increasingly need to support these additional content types to stay relevant. Some examples of search engines with rich content support are given below:
- Rich media search: Audio (odeo, podzinger ), Video (Youtube, truveo ), TV (Blinkx ), Images (Picsearch, Netvue )
- Specialized content search: Blogs (Technorati ), News (Topix ), Classifieds (oodle )

Of course, Google is heavily active in this area, with Google Blogsearch (blogs), Searchmash (images), Google Video, Google News and so on, so perhaps it's not fair to put this item into the list. What would be ideal, though, would be to integrate the different media results into a single search, as Searchmash already does ( Retrevo is another great example).

5. Restricted Data Sources

One of the biggest issues frustrating search users is spam. As Marketers get more savvy and use increasingly more aggressive SEO tactics, the quality of the results continues to degrade. (Google, as the most popular search engine, gets more than its fair share of targeting.) Restricting searches to a set of trusted sites eliminates this issue, although it also narrows the universe of content searched - this works well to provide a set of authenticated, high-quality results for certain types of searches; e.g. searching Wikipedia, National Geographic and science/educational sites when researching volcanoes for an elementary school project.

The best example of this approach is provided by A9.com, which provides content from a variety of sources and allows the user to make an explicit choice for every search. Google Co-op and Yahoo! Search Builder enable 3rd parties to build such a solution; Rollyo has been an early pioneer in this space!

6. Domain-specific search (Vertical Search)

By focusing on a single vertical, the search engine can offer a much better user experience, that's more comprehensive and tailored to a specific domain. There is an incredible variety of vertical search engines for various domains; for more information, check out Alex Iskold's article on the Read/WriteWeb or this overview on the Software Abstractions blog. [For context, Sramana Mitra's overview of online travel services provides a sense how Vertical Search fits into the overall picture.]

Algorithm Improvement

Pre-processing Logos

These enhancements focus on improving the underlying search algorithms to increase the relevance of results and provide new capabilities.

7. Parametric search

This type of search is closer to a Database query than a Text search; it answers an inherently different type of question. A parametric search helps to find problem solutions rather than text documents. For example, Shopping.com allows you to qualify clothing search with a material, brand, style or price change; job search sites like indeed let you constrain the matches to a given zip code; and GlobalSpec lets you specify a variety of parameters when searching for Engineering components (e.g. check out the parameters when searching for an industrial pipe ). Parametric search is a natural feature for Vertical Search engines.

Google has already incorporated this feature at a general level - such as the parameters on the Advanced Search page - but that waters down its usefulness. The most powerful use of this feature happens when additional parameters become available as you drill down further into standard search results or when you constrain the search to specific verticals.

8. Social Input

Yahoo!'s Bradley Horowitz believes social input to be a big differentiator of search technologies in the future (as does Microsoft ). Aggregating inputs from a large number of users enables the search engine to benefit from the wisdom of crowds to provide high quality search results. Of course, the results may not be valid if the individual inputs are not independent or can be gamed. Among the different offerings in this space, the service provided by del.icio.us seems likely to provide high-quality search capabilities based on this approach. [An earlier post offers a comparison among the different findability solutions based upon "crowd-sourcing".] Other reputation-based systems include StumbleUpon, Squidoo, About.com and of course, Wikipedia - all of which fall under the overall umbrella of findability, although they are not, strictly speaking, search engines.

Of course, Google's venerable PageRank algorithm is also implicitly based on social input. Since a large component of pagerank is based on the number and character of incoming links from different web sites, those incoming links act as implicit votes for gathering collective intelligence.

9. Human Input

This approach is included in the list for completeness. Search engines like ChaCha are experimenting with using human operators to respond to search queries. Arguably, Yahoo! Answers is another solution in this space, although the answers are provided by other users rather than by people working for the search engine.

It's difficult to see how the ChaCha-type approach would scale unless it somehow leverages community resources.

10. Semantic Search

Some of the exciting recent developments in search have to do with extracting intelligence from the Web at large. These applications are just the start - they convey the enormous potential of the Semantic Web. Early pioneers in this space include: Monitor110, which tries to extract actionable financial information from the web, that could be of interest to institutional investors; Spock, the "people-search" engine (currently in closed Beta), that plans to have a 100 million profiles in its database at launch; and Riya, a visual search engine, whose technology provides face and text recognition in photos.

11. Discovery support

Hand-in-hand with personalization and agent technology goes Discovery; this a holy grail for search. Although ad-hoc searches are the most popular at this time, most users have fairly stable interests over long periods of time. Wouldn't it be great if you could discover new sources of data - especially high-quality feeds - as they became available?

There are already some tentative steps in this direction, that combine search with the power of RSS - for example, you can already set up an RSS feed for the output of many types of searches in Google and Yahoo!. Bloglines already supports a "Recommended Feeds" feature - clearly, a feed reader should be in a great position to recommend new blogs or feeds in your area of interest, based upon the contents of your OPML file. Another player in this field is Aggregate Knowledge, which provides specialized services for retail and media by collecting information anonymously across multiple web sites. Overall, this will be an exciting area to watch in the future!

Results Visualization and Post-processing

Pre-processing Logos

These enhancements focus on improving the display of results and on "next steps" features offered post-query.

12. Classification, tag clouds and clustering

Search engines like Quintura and Clusty provide clustering of results based on tags/keywords. This allows the user not only to see the results themselves, but visualize the clusters of results and the relationships between them. This meta-information can help the user make sense of the results and discover new information on related topics.

13. Results visualization

Images are easier for the human brain to understand and remember than text results. At a more general level than clustering, specialized UI paradigms for displaying search results and the relationships between them can convey more meaning to the user and make the "big picture" easier to process. This approach works especially well within a specific context, such as a vertical search engine. The Visual Thesaurus from Thinkmap, VizServer from Inxight Software and HeatMaps from real estate search engine Trulia are examples of new ways to visualize information, although research in this field is still in its early stages. At a simpler level, HousingMaps is a mashup that displays the locations for houses available to rent or buy.

14. Results refinement and Filters

Often a natural next step after a search is to drill down into the results, by further refining the search. This is different from the "keyword-tweaking" that we've all gotten used to with Google; it's not just experimenting with keyword combinations to submit a new query, but rather, an attempt to actually refine the results set [akin to adding more conditions to the "where" clause of a SQL query] - this would allow users to narrow the results and converge on their desired solution.

Query refinement is a critical part of the search process, although it hasn't gotten the attention it deserves. One great example is the medical search engine Healia, which allows users to tweak health care search results by using demographic filters. This is important, because demographics, such as age, race and sex, can have a significant impact on search results for symptoms, diseases and the drugs used to treat them; there are also filters based on the complexity, source and type of results found.

Google has recently introduced a new button at the bottom of the Results page: "Search within results", which is a step in the right direction; results can also be refined using the existing OneBox widget and the relatively new Plusbox feature. Over time, we can expect this functionality to get increasingly more sophisticated.

15. Results platforms

As social media and online content become more popular, the number of choices available to a user to consume digital information continues to increase; accordingly, search engines must now support a variety of output platforms, including: web browser, mobile device, Rich Internet Applications, RSS, email and so on. With connectivity becoming ever more ubiquitous, users of the future are likely to connect to search engines from even more unconventional sources, for example: a TiVo system that searches for movies/programs of interest, a Nintendo system used to search for online games or even a refrigerator touch screen used for recipe search.

Some existing search engines already support additional platforms, beyond the standard web browser and mobile device. The web search engine Plazoo has provided RSS results feeds for a long time; Quintura started as a downloadable RIA application, and only now does this search engine provide a pure web interface.

The easiest way to provide support for many different result types, is to make available an open API, enabling third-party developers to create custom UIs for specialized target platforms. The Alexa Web Search platform was one of the first of these (although you use the API at your own risk ); other available APIs include oodle, zillow and trulia .

Google of course, provides APIs for several different properties - e.g. Google Base, Google Maps and the AJAX search API - although not for the main search engine. Handheld devices are supported via Google Mobile; Google Base and Blogsearch already provide RSS output.

16. Related Services

Technically, this is not exactly a part of the search function itself. However - once you finish a query, there is often a natural next step that follows the results of a search, e.g. after you search job openings, you want to apply to the postings you found. In terms of utility to the end user, this is an inherent part of overall search engine functionality.

Surprisingly, this feature has not been heavily exploited by many search engines, other than to display context-sensitive advertising. A perfect example of this approach is the interestingly-named specialized search engine: the web's too big , which enables the user to search for information on the web sites of public relations agencies based in the UK. These folks provide an interesting additional capability: users can enter details of their PR inquiry, and submit it directly to multiple PR agencies with a single click. Similarly, the real estate search engine Zillow provides the concept of Zestimate (an estimated home valuation computed by Zillow), as well as a Home Q&A feature. These types of additional services increase the value of search results offered to the user and make a site stickier.

Google provides additional services on some of its properties - such as the "Find businesses" option on Google Maps - but not in its main search engine.

17. Search agent

Closely related to the twin ideas of sustained, ongoing areas of interest and accessing search results as feeds, are search agents. Imagine a piece of software that functions as a kind of periodic search query, monitoring the web for new information on subjects of interest, collecting and collating the results, removing duplicates, and providing a regular update in summary form. This could work especially well for certain types of continuous searches that are important but not urgent: for example, monitoring for new jobs of interest as they become available, new houses for sale that fit within specific parameters, articles of clothing once they are marked down to a specific price, and so on.

Copernic is an interesting player in this space - the Copernic Agent can automatically run saved searches and provide summaries for new results, as well as track changes in web pages. The Information Agent Suite from Connotate Technologies mines the "deep web" and automates change detection. For more examples of search agents, Read/WriteWeb has an article that describes Allth.at , along with Swamii and Searchbots.net .

Conclusion

Clearly, Google is not going to take this onslaught lying down. Just as it has already introduced personalized search into its primary search engine, it will continue to integrate some of these other approaches into the mainstream as they become successful. For example, Vertical specialization is a powerful tool that Google is sure to use.

It is very likely that in the future, the simple "search box" on the Google front page will hide a variety of specialized search engines behind it. On the other hand, trying to cram in an increasing number of these sophisticated features has the potential to make the overall architecture for Google (or any mainstream web search engine) very complex and difficult to change, so the trade-offs will present an increasingly difficult challenge! In a separate article on the Software Abstractions blog, we take a look at the conceptual architecture for a mainstream search engine that incorporates most of these features.