hype-free: Efficient SQL pagination method

The usual technique for displaying data from an SQL query as multiple web-pages involves the LIMIT/OFFSET clause. For example for the first page the query would look like something like:

SELECT foo, bar, baz FROM ozz WHERE ... ORDER BY ... OFFSET 0 LIMIT 10

For the second page you would do:

SELECT foo, bar, baz FROM ozz WHERE ... ORDER BY ... OFFSET 10 LIMIT 10

And so on. There are two disadvantages to this method:

The smaller one: it can loose or duplicate data (from the user's point of view, not from the DB's point of view). This happens if data is inserted/deleted at a location which (given our ordering criteria) falls before the current offset. In this case, going a page forward can result in: (a) seeing some elements from the previous page or (b) skipping over some elements (they are not shown either in the current or the next page).

These are usually small problems because (a) mostly data paginated is fairly static (so the probability of insertion/deletion occurring is fairly small) (b) it is tolerable by users (they shouldn't use this method as a way to search the data, rather they should use the specialized search features) and (c) the likelihood of accessing a page decreases with their offset (users are more likely to access the first few pages than the last ones), which is great, since the likelihood of these errors manifesting themselves increases with the page offset (higher offset means a higher chance that the insertion/deletion will occur before it)
The second problem, which has the potential of becoming a big one given enough database entries, is the one of performance. Even if you specify a LIMIT clause, the database internally still has to fetch OFFSET + LIMIT rows. This value can become quite large.

Here the counterarguments are: most databases are quite small. These types of queries aren't a problem for well speced out servers for tens of thousands of rows. Also, as mentioned before, the likelihood of an user accessing a high-number page (thus creating a high-offset query) is fairly small.

While there are some good arguments for the fact that this isn't a big problem, I would still like to present an alternative solution which can be applied in the few cases when this is a problem. Of course, the solution itself creates and other set of problems which I'll discuss.

The presumption behind this idea is that our ordering criteria is backed by an index (there is an index which it can rely on to do the ordering). If this isn't the case, there are bigger problems. If this is the case, then we can use this index to fetch only the relevant data from the table.

The B-Tree indexes which you'll find in most current RDBMS's, support the "<" (less than) and ">" (greater than) operators. So, if we could translate the query "elements from offset 10" into "elements greater than X", we could use the index and thus avoiding fetching more rows than needed.

We can do this by transmitting the relevant field of the last element in the get query instead of the page number. IE: instead of ".../foo?page=10" the URL would look like ".../foo?after=1234". One word of caution: if we don't want to expose the ordering criteria, this solution isn't appropriate or symmetrical crypto needs to be used to protect the value.

Now we can rewrite our initial queries into:

SELECT foo, bar, baz, zed FROM ozz WHERE ... AND zed > 1234 ORDER BY zed ASC LIMIT N

But what about stepping backwards (clicking on the "previous page" link)? In this case we are interested in the biggest N elements (where N is the number of elements per page you want to display) smaller than the first element displayed on the page. Again, we transmit this using the get query, writing instead of ".../foo?page=9" the URL ".../foo?before=1200" (supposing that we clicked "previous page" from page 10 - if we would have clicked "next page" from page 8, we would still have an "after" query). Now the query becomes:

SELECT foo, bar, baz, zed FROM ozz WHERE ... AND zed < 1200 ORDER BY zed DESC LIMIT N

One observation: the results returned by this query are in reverse of the order you would like to display them in, so you need to reverse them in code before displaying. If you don't want to change your code at all, you can use the DB to do this for you like this (this will be still fairly efficient, since the final sorting is applied only to the retrieved subset):

SELECT * FROM 
  (SELECT foo, bar, baz, zed 
    FROM ozz 
    WHERE ... AND zed < 1200 
    ORDER BY zed DESC LIMIT N)
  ORDER BY zed ASC

Two observations:

The column we base our sorting criteria on got added in the rows we retrieve. This is necessary because we need to transmit the value of this column for the first and and last element from the page through the "previous" and "next" links.

Also, in all the examples I supposed that the sorting order we want to display the elements in on the page is ascending. If it is descending, things must be swapped around.

Finally, what are the advantages / disadvantage of this method?

It solves the problem of loosing/duplicating elements when they get inserted/deleted before the current elements.
On the flipside, it can't guarantee you that (a) when you go next -> next -> next and then previous -> previous -> previous you will get the same elements on the page (if insertions / deletions occurred) and even worse (b) you can end up with fewer than N elements on the first page (which might be surprising for your visitors. This can be solved by handling the first page in a "special" way.
It is more complex to implement and might not fit well with existing query abstraction solutions (ORMs for example)
There isn't really an easy way to jump X pages forward or backward or to estimate the "number of pages". It might be acceptable for you to provide only "forward" and "back" links, without direct links to specific pages. Going X pages in either direction can be implemented by repeatedly issuing the query. To estimate the number of pages, you might do a "SELECT COUNT(1) FROM ... WHERE ..." for your criteria, which might be more memory efficient than actually fetching all the rows (depending on your RDBMS)

As always, there is no clear cut "best" solution, you must look at the restrictions your particular project has and decide which solution applies best to it.

Update: See an other take on this matter on the MySQL performance blog - Four ways to optimize paginated displays.

3 comments:

izenmania7:12 PM
As with any method that utilizes an index for sorting, this also runs into trouble in circumstances where the sort column is user-selected, because it can require more indexes than is practical (assuming that there are a frequent inserts which will be slowed down by the indexes).
Cd-MaN2:12 PM
@izenmania: very true. The (silent) assumption behind this post was that you have one (or few) columns which you would like to sort/paginate by (a creation date for example). In case you have more complicated situations (like ranking results based on some - dynamically generated - relevance score), this method can not be used.
Anonymous9:48 AM
Hard to understand