Category: Scalability

Visualizations of Sorting Algorithms

Posted by – August 6, 2008

I’m traveling today, so here’s a quick post on an old site I found interesting. It’s a visualization of various sorting algorithms that help illustrate the efficiencies and overall speed differences between them. To start the sorting simulations click on each of them to activate the animation.

There’s not a lot of surprises here and you’ll see the predictable results: the parallel sorting algorithms generally outperform the sequential sorting algorithms, but it’s a useful demonstration that can better hammer home the difference in your mind.

Check them out here and try even more here.

You can’t sort and scale

Posted by – July 31, 2008

If you really want to scale you are going to have to come to terms with a basic fact. You can’t sort and scale. Ok, now that I have your attention through overstatement let me apply the requisite nuance. Sure, you can sort. But you can’t sort deep into a stack efficiently.

Now if you are working on a smaller scale where you aren’t pushing the limits of a relational database you’ll never know this. Your expensive sorts will still be fast enough with your small datasets. And when they grow, the first pages will still be fast enough as long as you have decent database indexing. But there’s a specific pattern that will show you the wall: “The deep page in a long sort”.

I made that jewel of jargon up just now, and since it makes precious little sense give me a chance to elaborate. When you sort a large dataset, your database will perform well early in the result set. However further into the set the performance degrades because it needs to calculate all previous positions to give you the next results.

So say you are selecting from a table with 5 million records and sorting by date. To get the first 10 results is easy, and your server doesn’t break a sweat. Ok, now get the last 10 results in that sort. Your server has to sort 4,999,990 results before it gets to the offset and the result is a slow query or the inability to complete the query at all.

And that’s just the way it is. That’s one reason why Google doesn’t show you all the pages in your search. They limit you to about 1,000 results. Go ahead and try to find the last page ranked on Google for a term like, “computers” and you’ll see what I mean.

Now there are ways around it, and those of you in the Google-can-do-anything crowd can simmer down now. The real reason they don’t do so is because their results are of decreasing relevance and therefore of little utility for their average user. But if a mere genius ;-P like myself can figure it out I’m sure they know all the workarounds as well.

To work around it, you need to select a smaller subset of the data using WHERE and sort this smaller dataset with your ORDER BY and offset/limit. So for example, when developing custom community software for able2know (coming soon, I hope) where threads can get big (e.g. 75,000 posts) we amortized this sort expense over the writes. We did this by calculating the post position within a thread and storing it on the post table with the post information instead of relying on a sort off the date. The last pages of threads would have to sort all previous posts to know what to display but this way we know that if we display 10 posts per page then we should query for positions 50-60 on page 5 for example, instead of querying the whole dataset and sorting it all to find out what should be displayed.

When a user posts, we can do an inexpensive check for the last position and calculate the position of the new post, and by storing it then we prevent ourselves from doing expensive sorts. So if you want to sort a huge dataset, save the positions or identifiers and filter first.

In a nutshell for the SQL crowd: use WHERE not ORDER BY to sort and scale.