topleft topright
How Google Works.

How Google Works.


The key to Google's success lies in its shear simplicity: just type a couple of words into a small box, hit the return key and everything the Internet knows about that precise subject is revealed, neatly arranged by order of direct relevance. Of course, behind the scenes things are a little more complex. . .


A brief history of Google
Google started life humbly as a search engine called BackRub, the brainchild of a collaboration between Stanford University graduates Larry Page and Sergey Brin in January 1996. BackRub was so called as it analyzed the 'back !inks' that point to individual websites. By 1998, they had acquired one terabyte of cheap disks and built their own PCs around them. But with little interest in setting up their own business, they tried to interest portal companies in licensing deals, but to no avail.
So with business plan in hand, they approached Stanford faculty member and Sun Microsystems founder Andy Bechtolsheirn, who liked the demo enough to write a cheque for $100,000. This was made payable to Google Inc, but the company didn't exist yet, so the cheque couldn't be cashed. But it was the catalyst for Page and Brin to take their hobby project onto a whole new level.

After approaching other investors, Google was finally launched, with $l million in funds, on 7 September 1998 - $l million of funds, yet the office was still a friend's garage.


Google was soon making big waves in the search industry, handling 10,000 queries a day and grabbing ever bigger headlines. By February 1999, in new offices and with eight employees, the daily queries had risen to 500,000. On 7 June, $25 million of venture capital was secured, with one representative from each of the two VC firms taking directorships of Google. Mike Moritz of Sequoia and John Doerr of Kleiner Perkins definitely had the right CVs: they had already helped grow Amazon, Sun Microsystems and the search supremo of the time, Yahoo! Another move to bigger premises soon followed, and when AOL selected Google as its web search service, the daily traffic hit three million searches. On 21 September 1999, Google officially came out of beta, and by June 2000 it was the world's largest search engine with a billion-page index, serving up 18 million queries daily. By the end of 2000, daily online traffic hit the 100 million mark, and today Google indexes over eight billion web pages and processes about 2,300 million searches every single day.

Probably the biggest computer on the planet?
The precise details of how Google performs its searches, and what equipment is used to make it happen, are a closely guarded secret. But a lot of people have tried to fathom the Google system, and stitching all the little pieces of information together we can build up a fairly detailed picture of how Google actually works.

We know, for instance, that Google spreads the processing load across server farms in disparate global locations. These farms are home to clusters of commodity level computers (2,000 per cluster) running the proprietary Google File System (GFS) on a Linux as. If you think of it as a single connected system, Google just might be the biggest computer in the world - but we don't know that for sure.

We estimate that Google consists of 1,125 racks, each containing 88 dual processor machines. That gives 198,000 2GHz crus providing 396,000GHz of processing power consuming 198,000GB of RAM. Not forgetting the 7750TB of hard drive storage or the sustained data transfer rates of 2GB/s within a cluster.

Feeling lucky?
Have you ever thought about what actually happens when you type your keywords into the Google search box and hit the enter key? It works like this. First, a 'load-balancing' system will check how busy each of the disparate Google server farms are. It will then send your request to the nearest, most idle one. We know exactly what happens next thanks to 'Web Search for a Planet: The Google Cluster Architecture' by Google-fellow Jeffrey Dean and company. This complex paper reveals that the keywords in your query are checked against 'index servers' for matching documents, and the PageRank algorithm (see below) is applied to produce relevancy scores and determine where in the results each document should appear. With tens of terabytes of data to be searched this should be a slow and painful process. It's not because Google divides each search into 'index shards' that contain a random subset of documents from the full index. In effect you're running the same search on countless databases simultaneously; a highly parallelized and efficient search system.

Every web page that Google indexes is given an associated document ID number (or docid). The docids that match you search query are now sent to the Google document servers where they are matched against their website's title and URL, while snippets of text are extracted from the documents to show you the context in which your keywords appear in the website. All of this usually takes less than a second - amazing when you consider you've just searched multiple low-latency copies of almost everything on the entire Internet

The relevance of PageRank
Unlike all of the other search engines in the late 1990s, and indeed unlike most search engines today, Google wasn't built around the notion that the repetition of a keyword within a document, nor indeed the density of that keyword through the page, should determine the search relevancy. Instead, Google uses a system called PageRank to examine the entire link structure of the web, and in turn decide which websites are the most popular and useful by determining how they are referenced by their peers. It's based on the academic premise that the importance of any research paper can be accurately judged by the number of citations the paper has from other papers. In other words, Google bases its search results on democracy and a good old-fashioned popularity contest!

But this alone is not enough to determine a website's PageRank. Instead, this score is combined with a method of determining real world relevance. This is done using hypertext matching analysis: in other words, Google looks for keywords found deeper than just the body copy of the page. Google even scrutinizes the fonts and subdivisions, the precise location of every keyword on the page, and the content of adjacent pages.

Exactly how PageRank works is another of Google's closely guarded secrets.

The Google File System, GFS, enables the search process, while Page Rank helps with the sorting of results. But it's the Googlebot that does all the ground work. It's a web crawler, following every available link from every Internet page it finds, relaying content data back to the Google document servers for indexing. Googlebot has two components: 'Deep bot' crawls deep into a site to build a complete matrix, while 'Freshbot' is the surveillance team, hunting out newly created content. Googlebot resides on multiple computers and is capable of accessing thousands of web pages simultaneously. It never, ever sleeps.

FOCUS: GOOGLE PAGERANK
The elusive algorithm revealed (perhaps)

In 1998 Brin and Page published a paper at Stanford which explained the original PageRank concept. The algorithm has been secretly tweaked countless times since then, but at least we know its roots were planted in this equation:

PR(A) = (I-d) + d [pR(TI)/C(TI) + ... + PR(Tn)/C(Tn)]

Er, in other words, Page Rank is essentially the sum of the PageRank of all incoming links divided by its outgoing links ... if you see what we mean.

PageRank isn't perfect by any means. It's vulnerable, like all search engines, to mass manipulation such as the famous Google Bombing where thousands of bloggers created links to connect keywords to a specific site. Search for 'miserable. failure' and you'll see an example. (In fairness, this particular example may be a bit of cunning on the part of Google to get some free PR).

June 2000 it was the world's largest search engine with a billion-page index, serving up 18 million queries daily. By the end of 2000, daily online traffic hit the 100 million mark, and today Google indexes over eight billion web pages and processes about 2,300 million searches every single day.

Probably the biggest computer on the planet?
The precise details of how Google performs its searches, and what equipment is used to make it happen, are a closely guarded secret. But a lot of people have tried to fathom the Google system, and stitching all the little pieces of information together we can build up a fairly detailed picture of how Google actually works.

We know, for instance, that Google spreads the processing load across server farms in disparate global locations. These farms are home to clusters of commodity level computers (2,000 per cluster) running the proprietary Google File System (GFS) on a Linux as. If you think of it as a single connected system, Google just might be the biggest computer in the world - but we don't know that for sure.

We estimate that Google consists of 1,125 racks, each containing 88 dual processor machines. That gives 198,000 2GHz crus providing 396,000GHz of processing power consuming 198,000GB of RAM. Not forgetting the 7750TB of hard drive storage or the sustained data transfer rates of 2GB/s within a cluster.



Keyword spamming is another method of the search cheaters. This technique uses a bridging page that contains nothing but an intro and then keyword repetition by the hundreds. Alas, making the repetitive text invisible to the user by adopting the background colour may fool the site visitor, but it doesn't fool the Googlebot.

Far better, then, to take your new found understanding of the rules of the search game and play by them. But there are a still a few tricks you can apply in web design to help you get the highest possible PageRank:

First of all, keywords. Words that are in larger, bolder fonts tend to be seen by Google as more relevant when considering the overall content of a page. Ensure that your page title is relevant, but above all else that it contains those keywords that explain the core purpose of the page. Do this in less than 15 user friendly words, rather than repetitive keyword title spam, and Google will notice. As an aside, the title is what gets shown as the first line in a Google result, so it should be descriptive to appeal to visitors.

Second, content. For a high Google ranking your site's content should not only be interesting for your readers, but also interesting to the Googlebot. This means ensuring that the page has just the right keyword density. There's a handy keyword density calculator at www.pagerank.com, which helps reveal keyword occurrences and densities for any given URL. If the words and phrases that you want to be associated with your site have the highest densities, then you've got it right, otherwise keep tweaking.

Third, consider hyperlinks both in and out of your web site. You want to get links to your website from other websites that already feature highly in relevant Google searches. Your rating will also be boosted far more with just a couple of links from popular websites than many links from lots of obscure ones. And, on the same principle, think carefully about which sites you link to. Avoid any site that tries to cheat the Google search system. If you don't then guilt by association comes into play.

The Google File System
So that's PageRank, the brains behind Google. But the brawn - the thing that gives Google its speed and efficiency - is the GFS, the Google File System. This was written from the ground up by Google developers, and it's necessary because Google's backbone architecture is completely different from most of its rivals. In the early days of the Internet, while other search engines adopted a hardware frameworks consisting of small groups (clusters) of big servers to do the work, Google was forced by its lack of financial clout to instead use big clusters of small, inexpensive computers. It's a bit like the concept of grid computing where lots of small computers all linked together to produce a combined processing power of enormous proportions.

If this sea change in search infrastructure wasn't enough to cope with, the Google developers also had to take into account that the 'small computers' in question were also very cheap (read: 'unreliable') computers. So GFS was also built to be self-monitoring, able to automatically recover from component failures by doing much the same as the Internet and rerouting tasks from dead machines to living ones, more even than a single supercomputer.


Google future

Google isn't the perfect search system that many think, and its Achilles heel may end up proving to be the very thing that's set it apart for so long: PageRank. PageRank often gets it right, mathematically speaking, but entirely wrong as far as real world results go. Google bombing that we mentioned earlier is a good example, but the flooding of results lists with shopping comparison sites an even better one. Although Google seems to be filtering out a lot of these now, the main reason that they happened at all must also be addressed: PageRank doesn't understand context. As the web has grown, so has the size of the Google index and the number of results returned for any given search. We all know that size and performance are not common bedfellows. The next step for Google is to bring in some form of linguistic intuition, an understanding the semantic relationships between words and their context.

It may seem obvious to state that Google will only work if the user chooses the right words to search for, the right question to ask, but that's becoming increasingly harder to do with the burgeoning of disparate online content. However, if there's one thing Google has proven over the years it's that it doesn't stand still: from purchasing the Deja vu newsgroup archive and turning it into something even better as Google Groups, through to the superb Gmail service. We've dug deep to provide an insight into how Google works today, but to see where it will be tomorrow; well, we'll just have to save that for another day.

source PC Plus Mag.

 

 
< Prev   Next >

 
 

Latest Forum Posts

Latest Forum Posts
TopicsByCategoryDate
ПовкалыPeassematGeneral Health Forum10-02-12 20:44
А миоценPeassematGeneral Health Forum10-02-12 20:36
АнтиукрCruiguimbGeneral Health Forum10-02-12 20:24
ХеджируCruiguimbGeneral Health Forum10-02-12 20:17
ПодвернSarafauptGeneral Health Forum10-02-12 19:50

 

VISIT OUR PARTNER STORE

Click for Pakistani Forum Online Community here!!

DISCLAIMER
Any information provided is for website owners own collection and review. So no copyright infringement
of any material published is intended in any way. All efforts are made to accurately provide references where possible.

Joomla Templates by JoomlaShack Joomla Templates