Google has been collecting searchers browsing and searching habits for years. They have their search logs, clickstream information for every site serving their AdSense ads, and for every site using their free web analytics program they have information on the browsing history for their toolbar users who do not opt out, and of all the users of their Web Accelerator proxy. And unlike many other companies that collect user data so Google actually uses their data in fundamental ways. So it comes as no surprise that they’d want to find ways to employ user information in their search algorithms. Clickstream data and folksonomy are some of the big areas that search algorithms are expected to use. Right now all major search engines use ranking algorithms that are primarily based on the Pagerank concept Google introduced and became famous for. They all use links on the web to establish authority, and no fundamental change has taken place in the evolving search algorithms in many years. They get better at filtering malicious manipulation and at tweaks that eek out relevancy but nothing groundbreaking.
So authority based on clickstream analysis and social indexing seemed like good ways to use data to further diversify the effort to allocate authority to web pages. What Google learned early is that they needed scale, and their initial data efforts (things like site ratings by their toolbar users) didn’t end up in their search algorithm. Folksonomy and social indexing doesn’t yet have enough scale to rely on and has potential for abuse, but the clickstream has scale and is harder to game given that traffic is essentially the authority and the people gaming the authority want traffic. So if they need traffic to rank well to get traffic then there’s a significant challenge to those manipulating rankings because they need the end for their means.
But Google is cautious with their core search product and has tweaked their algorithm very conservatively. And it has been hard to tell just how much clickstream data was playing a role in their search results and will continue to be as long as it’s such a minor supplement to their algorithm. Today, Google has posted a bit of information about this on their official blog in their efforts to shine more light on how they use private data. You can read about it here in full but the basics are no big surprises:
- Location - Using geolocation or your input settings they customize results to your location slightly. My estimate is that they are mainly targeting searches with more local relevance. And example of such a search would be “pizza”. “Pizza” is more local than “hosting” and can benefit greatly from localization. Hosting, not so much.
- Recent Searches – Because many users learn to refine their searches when they don’t find what they are looking for this session state is very relevant data. A user who’s been searching for “laptops” for the last few minutes is probably not looking for fruit when they type in “apple” next. They reveal that they store this information client side and that it is gone when you close your browser but because they mean cookies anyone who’s been seriously looking under the hood already know this.
- Web History – If you have allowed them to track your web history through a Google account they use this to personalize your results. They don’t say much of anything about what they are really doing but this is where a the most can be done and there are far too many good ideas to list here. Some examples would be knowing what sites you prefer. Do you always click the Wikipedia link when it’s near the top of the search results even if there are higher ranked pages? Then they know you like Wikipedia and may promote its pages in your personalized results. Do you always search for home decor? Then maybe they’ll take that into consideration when you search for “design” and not give you so many results about design on the web. There are a lot of ways they can use this data, and this is probably an area they will explore further.
In summary, right now I’d say their are mainly going with simple personalizations and not really employing aggregate data and aggregate personalization to give aggressive differences. They are careful with their brand and will use user history with caution. After all, if your the use of your results lead to less relevance they fail and because personalization can be unpredictable (there must be some seriously weird browsing histories out there) they are going to be cautious and slight with this.