If you really want to scale you are going to have to come to terms with a basic fact. You can’t sort and scale. Ok, now that I have your attention through overstatement let me apply the requisite nuance. Sure, you can sort. But you can’t sort deep into a stack efficiently.
Now if you are working on a smaller scale where you aren’t pushing the limits of a relational database you’ll never know this. Your expensive sorts will still be fast enough with your small datasets. And when they grow, the first pages will still be fast enough as long as you have decent database indexing. But there’s a specific pattern that will show you the wall: “The deep page in a long sort”.
I made that jewel of jargon up just now, and since it makes precious little sense give me a chance to elaborate. When you sort a large dataset, your database will perform well early in the result set. However further into the set the performance degrades because it needs to calculate all previous positions to give you the next results.
So say you are selecting from a table with 5 million records and sorting by date. To get the first 10 results is easy, and your server doesn’t break a sweat. Ok, now get the last 10 results in that sort. Your server has to sort 4,999,990 results before it gets to the offset and the result is a slow query or the inability to complete the query at all.
And that’s just the way it is. That’s one reason why Google doesn’t show you all the pages in your search. They limit you to about 1,000 results. Go ahead and try to find the last page ranked on Google for a term like, “computers” and you’ll see what I mean.
Now there are ways around it, and those of you in the Google-can-do-anything crowd can simmer down now. The real reason they don’t do so is because their results are of decreasing relevance and therefore of little utility for their average user. But if a mere genius ;-P like myself can figure it out I’m sure they know all the workarounds as well.
To work around it, you need to select a smaller subset of the data using WHERE and sort this smaller dataset with your ORDER BY and offset/limit. So for example, when developing custom community software for able2know (coming soon, I hope) where threads can get big (e.g. 75,000 posts) we amortized this sort expense over the writes. We did this by calculating the post position within a thread and storing it on the post table with the post information instead of relying on a sort off the date. The last pages of threads would have to sort all previous posts to know what to display but this way we know that if we display 10 posts per page then we should query for positions 50-60 on page 5 for example, instead of querying the whole dataset and sorting it all to find out what should be displayed.
When a user posts, we can do an inexpensive check for the last position and calculate the position of the new post, and by storing it then we prevent ourselves from doing expensive sorts. So if you want to sort a huge dataset, save the positions or identifiers and filter first.
In a nutshell for the SQL crowd: use WHERE not ORDER BY to sort and scale.
Great post. I just wanted to clarify one thing. When limiting your resultset using the WHERE, it will return the correct items you need for the given page. However, these results themselves aren’t necessary in the correct order for the page. (You will get positions 50-60, but not in numerical order). So you still need to ORDER BY to get what your looking for. The point Robert is making, is that by using a WHERE clause first, you are only ordering 10 posts, as opposed to all previous posts.
Thanks for making that initial jumble clearer. I’ll do a follow up sometime on specific use cases and how to work around the long/impossible queries.