2010-04-30

How Do Search Engines Work?

Search engines are basically computer algorithms which help users find the specific information they’re looking for. With literally trillions of pages of information online, without effective search engines, finding anything on the Internet would be almost impossible. Different search engines work in different specific ways, but they all utilize the same basic principles.

The first thing search engines have to do in order to function is to make a local database of, basically, the Internet. Early search engines just indexed keywords and titles of pages, contemporary search engines index all of the text on every page, as well as a great deal of other data about that page’s relation to other pages, and in some cases all or a portion of the media available on the page as well. Search engines need to index all of this information so that they can run searches on it efficiently, rather than having to run around the Internet every time a search query is sent.
Search engines create these databases by performing periodic crawls of the internet. Early search engines often required pages to be submitted to them in order to crawl them, but now most pages are found by following links from other pages. What are called robots or spiders, computer programs built to index pages, flit from page to page, recording all of the data on the page, and following every link to new pages. Different search engines refresh their indexes at different intervals, depending on how many spiders they constantly have crawling, and how fast those spiders crawl, with some working their way through the internet every day or two, and others only doing a periodic refresh every week or month.
As the spider goes through these pages, it records the words it finds on the pages. It makes notes about how many times each word appears, whether the words are weighted in certain ways, perhaps based on size, location, or HTML markup, and decides how relevant the words are based on the links that come in to the page, and on the general context of the page.
Search engines then must weight the value of each page, and the value of each page for the words that appear on it. This is the trickiest part of what a search engine has to do, but also the most important. At the most simple level a search engine could simply keep track of every word on the page, and record that page as relevant for searches with that keyword. This wouldn’t do much good for most users, however, as what is desired is the most relevant page for their search query. So different search engines come up with different ways of weighting importance.
The algorithms that various search engines use are well protected, to prevent people from specifically creating pages to get better ranks, or at least to limit the degree to which they can do that. This difference is why different search engines yield different results for the same terms. Google might determine that one page is the best result for a search term, and Ask might determine that same page is not even in the top 50. This is all just based on how they value inbound and outbound links, the density of the keywords they find important, how they value different placement of words, and any number of smaller factors.
The newest trend in search engines, and likely the future of search in general, is to move away from keyword-based searches to concept-based searches. In this new form of search, rather than limiting a search to the keywords you input, the search engine tries to figure out what those keywords mean, so that it can suggest pages to you that may not include the exact word, but nonetheless are topical to your search. This is still a developing field, but so far seems to have a lot of potential in making searches more relevant, making the web an even easier place to find exactly what you’re looking for.