Month: July 2008

Microsoft launches new Live Search homepage design

Posted by – July 31, 2008

Microsoft has launched a new design on the home page of their search engine that you can see in the screen shot to your left.

The image has squares that link to different searches (image, map and web). For example, in the screen shot here the tool tip reads “What will you see on your Safari to Botswana?” and links to an “animals in Botswana” search.

I expect that they’ll use other images and search in the future, and this has a lot more to do with the branding of their search assets than any functional change. Because there is a lower cost of entry to having mapping and image search Microsoft’s maps and supplemental search services sometimes have more bells and whistles than Google’s equivalent features and Microsoft is keen to get these in front of people.

You can’t sort and scale

Posted by – July 31, 2008

If you really want to scale you are going to have to come to terms with a basic fact. You can’t sort and scale. Ok, now that I have your attention through overstatement let me apply the requisite nuance. Sure, you can sort. But you can’t sort deep into a stack efficiently.

Now if you are working on a smaller scale where you aren’t pushing the limits of a relational database you’ll never know this. Your expensive sorts will still be fast enough with your small datasets. And when they grow, the first pages will still be fast enough as long as you have decent database indexing. But there’s a specific pattern that will show you the wall: “The deep page in a long sort”.

I made that jewel of jargon up just now, and since it makes precious little sense give me a chance to elaborate. When you sort a large dataset, your database will perform well early in the result set. However further into the set the performance degrades because it needs to calculate all previous positions to give you the next results.

So say you are selecting from a table with 5 million records and sorting by date. To get the first 10 results is easy, and your server doesn’t break a sweat. Ok, now get the last 10 results in that sort. Your server has to sort 4,999,990 results before it gets to the offset and the result is a slow query or the inability to complete the query at all.

And that’s just the way it is. That’s one reason why Google doesn’t show you all the pages in your search. They limit you to about 1,000 results. Go ahead and try to find the last page ranked on Google for a term like, “computers” and you’ll see what I mean.

Now there are ways around it, and those of you in the Google-can-do-anything crowd can simmer down now. The real reason they don’t do so is because their results are of decreasing relevance and therefore of little utility for their average user. But if a mere genius ;-P like myself can figure it out I’m sure they know all the workarounds as well.

To work around it, you need to select a smaller subset of the data using WHERE and sort this smaller dataset with your ORDER BY and offset/limit. So for example, when developing custom community software for able2know (coming soon, I hope) where threads can get big (e.g. 75,000 posts) we amortized this sort expense over the writes. We did this by calculating the post position within a thread and storing it on the post table with the post information instead of relying on a sort off the date. The last pages of threads would have to sort all previous posts to know what to display but this way we know that if we display 10 posts per page then we should query for positions 50-60 on page 5 for example, instead of querying the whole dataset and sorting it all to find out what should be displayed.

When a user posts, we can do an inexpensive check for the last position and calculate the position of the new post, and by storing it then we prevent ourselves from doing expensive sorts. So if you want to sort a huge dataset, save the positions or identifiers and filter first.

In a nutshell for the SQL crowd: use WHERE not ORDER BY to sort and scale.

Google gives details on its customized search results

Posted by – July 30, 2008

Google has been collecting searchers browsing and searching habits for years. They have their search logs, clickstream information for every site serving their AdSense ads, and for every site using their free web analytics program they have information on the browsing history for their toolbar users who do not opt out, and of all the users of their Web Accelerator proxy. And unlike many other companies that collect user data so Google actually uses their data in fundamental ways. So it comes as no surprise that they’d want to find ways to employ user information in their search algorithms. Clickstream data and folksonomy are some of the big areas that search algorithms are expected to use. Right now all major search engines use ranking algorithms that are primarily based on the Pagerank concept Google introduced and became famous for. They all use links on the web to establish authority, and no fundamental change has taken place in the evolving search algorithms in many years. They get better at filtering malicious manipulation and at tweaks that eek out relevancy but nothing groundbreaking.

So authority based on clickstream analysis and social indexing seemed like good ways to use data to further diversify the effort to allocate authority to web pages. What Google learned early is that they needed scale, and their initial data efforts (things like site ratings by their toolbar users) didn’t end up in their search algorithm. Folksonomy and social indexing doesn’t yet have enough scale to rely on and has potential for abuse, but the clickstream has scale and is harder to game given that traffic is essentially the authority and the people gaming the authority want traffic. So if they need traffic to rank well to get traffic then there’s a significant challenge to those manipulating rankings because they need the end for their means.

But Google is cautious with their core search product and has tweaked their algorithm very conservatively. And it has been hard to tell just how much clickstream data was playing a role in their search results and will continue to be as long as it’s such a minor supplement to their algorithm. Today, Google has posted a bit of information about this on their official blog in their efforts to shine more light on how they use private data. You can read about it here in full but the basics are no big surprises:

  • Location - Using geolocation or your input settings they customize results to your location slightly. My estimate is that they are mainly targeting searches with more local relevance. And example of such a search would be “pizza”. “Pizza” is more local than “hosting” and can benefit greatly from localization. Hosting, not so much.
  • Recent Searches – Because many users learn to refine their searches when they don’t find what they are looking for this session state is very relevant data. A user who’s been searching for “laptops” for the last few minutes is probably not looking for fruit when they type in “apple” next. They reveal that they store this information client side and that it is gone when you close your browser but because they mean cookies anyone who’s been seriously looking under the hood already know this.
  • Web History – If you have allowed them to track your web history through a Google account they use this to personalize your results. They don’t say much of anything about what they are really doing but this is where a the most can be done and there are far too many good ideas to list here. Some examples would be knowing what sites you prefer. Do you always click the Wikipedia link when it’s near the top of the search results even if there are higher ranked pages? Then they know you like Wikipedia and may promote its pages in your personalized results. Do you always search for home decor? Then maybe they’ll take that into consideration when you search for “design” and not give you so many results about design on the web. There are a lot of ways they can use this data, and this is probably an area they will explore further.

In summary, right now I’d say their are mainly going with simple personalizations and not really employing aggregate data and aggregate personalization to give aggressive differences. They are careful with their brand and will use user history with caution. After all, if your the use of your results lead to less relevance they fail and because personalization can be unpredictable (there must be some seriously weird browsing histories out there) they are going to be cautious and slight with this.

Earthlink intersted in AOL’s dialup business

Posted by – July 30, 2008

Andrew Lavallee writes for the Wall Street Journal that Earthlink is interested in purchasing AOL’s dialup subscriber business. At this point it does indeed look like a promising pair because there aren’t that many players trying to consolidate yesterday’s markets.

For a bit of history, AOL was late to diversify its business away from its once dominant dial up empire it built. It missed the broadband revolution and ended up pretty much selling access to its walled garden of content and email services to broadband users who had another internet service provider but still wanted their AOL mail.

Even worse was how late AOL was to modernizing the monetization of their content. While other portals built publishing empires and search advertising AOL kept their content walled to subscribers, essentially sticking to a paid content business model where they would charge monthly fees for dial up access and content or just the content while the user paid someone else. They eventually woke up and opened up their AOL content as a public portal but they were once again too late to be a dominant player and now merely have a lot of fading eyeballs that need to be sold to a real internet company.

Needless to say, the web has moved on and AOL is a declining asset that Time Warner is famously willing to part with. And its dial up business is a declining asset within this declining asset that has few takers.

Cuil Search Engine launch

Posted by – July 30, 2008

Search has been the sexy web app for years, with only social networking threatening its status as the cool king of web applications. And anyone gunning for Google seems to get a lot of attention. A husband and wife team of former Google employees launched a search engine called Cuil (pronounced “cool”) this week and the web was abuzz with the drama.

“Ex Googlers build Google killer” was the salacious angle behind the buzz, but when the search engine actually launched, the poor quality of its results and unreliability of the service generated a counter-buzz backlash. That Cuil representatives seemed snappy in response to the criticisms didn’t help and I’ll go ahead and predict that this search engine won’t go anywhere.

They differentiate themselves from the current search engines with a different user interface that is simply a lot less usable and while they claimed to have the largest index upon launch, they don’t and their relevance is behind all the major search engines (even Microsoft’s).

There is speculation that they launched this search engine in the hopes of being bought by a search engine like Microsoft, who has shown a willingness to spend big money this year (Yahoo takeover attempt, Powerset aquisition) to get search IP, market share, and talent.

That may be all that Cuil brings to the table. They went cheap on the backend and can’t realistically challenge for Google, Yahoo or even MSN scale. Their best hope is that Microsoft sees enough value in them to purchase them.

Extending Firebug through plugins

Posted by – July 30, 2008

It’s happened. My favorite Firefox plugin, firebug, is so useful and well-designed, that developers have decided to add on to it’s functionality through plugins. While these are still installed and managed the same as a normal Firefox plugin, they require firebug to be installed first. In fact, they are not extending the functionality of Firefox, they are extending firebug itself. They are plugins for a plugin.

It started (AFAIK) with Yslow. This plugin, developed by Yahoo!, analyzes the page, and determines why it is loading slowly. It is based on their own best practices for high performance websites. This is most likely not the type of tool you will use daily, but when you get to the optimization stage of a project it is very handy.

Just the other day, I had the need to analyze the cookies one of my sites was using, and to do so installed Firecookie. Overall I was impressed. It provides a dead-simple way to view, edit, and delete any cookies associated with a page. The one feature I wish it had, was the ability to temporarily disable cookies, without deleting them. This is handy if you want to quickly check what something looks like while logged out of a site, without having to actually log out, and then back in.

There are other firebug plugins I have yet to use, and I’m sure more are currently being developed. All of this added functionality strengthens my belief that firebug is becoming less of a nicety for developers, and more of a necessity. It is definitely one of the larger tools in my development toolbox.

Checkout by Amazon launches

Posted by – July 29, 2008

Amazon Checkout has launched, extending Amazon’s range of web services and diversifying the payment processing service implementations they offer. With their Flexible Payment Service and Simple Pay they now offer a wide range of tools to develop applications with complex transaction for developers without the resources to be a middle man in financial transactions. This new service seems most useful for simple e-commerce use cases.

Checkout by AmazonTM is a complete ecommerce checkout solution that provides your customers with the same secure and trusted checkout experience available on today. It offers unique features including Amazon’s 1-Click® and tools for businesses to manage shipping charges, sales tax, promotions, and post-sale activities including refunds, cancellations, and chargebacks.

Their range of solutions can be overwhelming, so check out a comparison of the features.

The secret to SEO

Posted by – July 29, 2008

It often seems that everyone working online claims to have expertise in Search Engine Optimization. It isn’t that surprising given the returns (traffic and money) but because SEO is a winner-take-most game knowing a little bit of SEO is about as useful as being a “little bit” pregnant. And no matter how much someone might know about SEO, if they don’t execute better than their competitors they lose.

And the thing about loosing in SEO is that it’s a loser-take-none game. The returns from SEO only begin after achieving the top rankings. Moving in rankings from, say, 50th place to 20th represents virtually no traffic gain (seriously, it could be as low as a difference of a dozen visitors a month).

So SEO is winner-take-most, where only the top spots get any significant traffic and in a competitive SEO market the secret is all about execution. You are competing directly against people who understand SEO (remember, everyone’s an “expert” here) and the depth of your insight that matters isn’t what industry names you can drop, or how well you can argue the pedantry of on-page optimization. It comes down to being able to focus your efforts more accurately and execute better than your opponent. Does your team execute more swiftly, efficiently and accurately than others? Does it do so without running increased risk of penalization? Does it cope with imitation and sabotage?

If not, please remember that there is no silver bullet except having the best people in a fast-paced team who can out maneuver the competition and the search engines’ improving algorithms. If you really want to be competitive in SEO, and because of the nature of the game you shouldn’t bother if you don’t plan to be competitive, you need good people. And if things get competitive and the other guys have good people you need better people.

So the secret to SEO is finding these technical rock stars to lead the critical portions of your online efforts. Businesses can be built around the right online marketing gurus, and when you find one, we’ll not only tell you how you can conquer your niche, but unlike the other “experts” we’ll actually do it.

Microsoft’s BrowseRank alternative to Google’s PageRank

Posted by – July 25, 2008

Cnet broke a story about Microsoft’s BrowseRank (pdf link) about a authority ranking algorithm proposal coming out of Microsoft’s Chinese R&D labs that proposes “Letting Web Users Vote for Page Importance”. There isn’t much new to this, other than the new term “BrowseRank” as Microsoft has long viewed clickstream data as a potential way to outdo Google’s search algorithm, which like all other major search engines revolves around a page ranking system Google introduced that uses links on the web to determine authority.

Using user traffic has potential if you can aggregate enough scale, but a lot of the data is behind walled gardens. You can easily look at public web pages to count links but access to clickstream data is not as simple. Your options are to buy ISP data, to sample traffic and extrapolate, or just collect as much as you can on your own properties (only a few companies with scale to have useful data).

So ultimately this kind of algorithm is unlikely to be a groundbreaking difference and seems destined to be a supplemental part of the general ranking algorithms. We’ll see more of it and it shows promise in smoothing out link anomalies like link farms but isn’t likely going to be the core of a major search engine any time soon.

Google’s bringing it out to measure again… 1 trillion urls!

Posted by – July 25, 2008

Google announced a new milestone of 1 trillion urls, which is impressive enough that we might as well forgive them for bringing us back to the index measuring wars of yesteryear. In the past, search engine bragging rights were about how much of the web their index contained. Then Google stopped publishing their index total on their home page and said it was quality (of search relevance) and not quality that mattered.

But a trillion’s a bit much to keep mum about so there you go. It doesn’t mean much but it’s interesting that it comes right before a stealth competitor launches a search engine they will claim to be the biggest (I think it’s a coincidence).