We switched to cursor-based pagination

We switched to cursor-based pagination(moderntreasury.com)

120 points by qin 3 years ago | 114 comments

simonw 3 years ago |

This post is not about database cursors. It's about the style of pagination where you have a ?_next=xxx link to get to the next page, where the xxx bit encodes details about the last item on the current page such that the next page can show everything that comes after that record.

This is also sometimes known as keyset pagination. My favourite technical explanation of that is here: https://use-the-index-luke.com/no-offset

paulmd 3 years ago | |

Sounds great on paper but my experience is that key-based indexing is broken on SOLR and likely other lucene engine DBs. If you are evaluating conditions on child objects inside a parent document, the child objects get split and stored as a separate document that is joined… and if that child document is in a different block/file on disk, then it won’t necessarily be inside the range being scanned, and so you will be missing some results that meet your logical criteria.

Possibly just an implementation error in the BlockJoinParser but it did not occur with numeric pagination.

In order to work with that, you need to “flatten” your json, like with the @JsonUnwrapped annotation, and some structures (like arrays) may become problematic and/or require significant lexical mapping of queries to the dataset.

fizx 3 years ago | | |

parent-child in lucene is a hack. It only works when the children immediately follow the parent, and that never happens when it's in a different file because you updated the child. It's a minor miracle you didn't notice the error with numeric pagination.

clementmas 3 years ago | |

Most users don't care but I very much prefer to have URLs with `/?page=123` instead of `?_next=xxx`. I should get over it but meanwhile I'll stick to offset pagination with deferred join (see aarondf's comment).

jlg23 3 years ago | |

This post is about "database cursors" and "keyset pagination". In practice, these terms refer to the same thing, one seen bottom-up the other seen top-down. Implementation-wise, one saves the state of the cursor in the pagination [parameters] and resumes reading from the DB with an equivalent cursor.

nextaccountic 3 years ago | | |

> these terms refer to the same thing

No, cursor is this https://en.wikipedia.org/wiki/Cursor_(databases) https://www.postgresql.org/docs/current/plpgsql-cursors.html

I once did pagination using database cursors, which is something different than keyset pagination: The server would keep a cursor open and keep fetching more data from the same query. This enabled the system to have an interface similar to offset pagination (you get the first page, then the second page, etc) but without doing a new query for each page discarding the first n-1 pages per query

The downside is that it makes the server stateful, and doesn't scale (you would need to keep hundreds of cursors open if you had hundreds of simultaneous users)

aarondf 3 years ago |

There are ways to mitigate the (although not eliminate) the slowing down of offset/limit pagination in later pages. The technique is called a "deferred join" and it is most effective in MySQL. The basic idea is to paginate as little data as necessary, and then do a self-join to get the rest of the data for a single page.

There are libraries for Laravel (https://github.com/hammerstonedev/fast-paginate) and Rails (https://github.com/planetscale/fast_page) as well!

Cursor based pagination is wonderful, but sometimes you're stuck with offset/limit for whatever reason. Might as well make it fast.

barrkel 3 years ago | |

To be clear, this technique (which it seems I independently discovered in 2015) mostly only works in MySQL because other databases usually have planners which are smart enough to not pull everything in eagerly.

MySQL is fairly predictable, though, so when you understand that it wants to nested-loop join all your rows before evaluating predicates on the parent table, it's a predictable win to stop it doing that.

The technique is still applicable even when you have no joins, because MySQL will materialize rows with every selected column before evaluating the unindexed portion of the predicate, and the order by.

aarondf 3 years ago | | |

Hey this is a great explanation. Most people just say "this makes no sense, can never work"

hamandcheese 3 years ago |

Cursor based pagination doesn’t actually solve the described pitfalls if your results are ordered by anything mutable, though.

A result you haven’t yet seen can be mutated to sort before your current cursor, likewise a result that you’ve already seen can be mutated to sort after the current cursor, causing you to see it twice.

Cursor based pagination does minimize the issue somewhat, because only the mutated rows are possibly affected, not unrelated records that lie on page boundaries. But depending on the use case I’m not sure that is worth that added complexity of cursor based pagination (it does get a bit tricky once you have any non-trivial sort clause).

simonw 3 years ago | |

The only approach I can think of which might be able to handle mutability of the rows used for sorting would be to support paginating through a snapshot: provide a mechanism whereby a full snapshot of the query at a specific point in time is captured such that the user can then paginate through that snapshot.

Expensive to implement, so this would only work for situations where you can afford to spend significant storage and computation resources to keep a specific user happy.

kibwen 3 years ago | | |

That might be even worse (or just as bad) for users, as now they won't see any updates to the underlying data set even if they want to, and will presumbaly need to perform some explicit action to get a new snapshot.

Personally, if you care about users not missing any item in a query, you just can't use pagination at all, and you have to give them every item in the query in a single huge dump (running the query again would be the "explicit action" mentioned above that gets the user new data). Conversely, if you use pagination, users are free to assume that they might miss some items unless they already expect the underlying data to be immutable.

mikedouglas 3 years ago | | |

This theoretically should be possible with MVCC, right? It's not an area I've explored and I could immediately see some issues with resource clean-up, but I could imagine it being possible with most modern DBs.

mbreese 3 years ago | | |

You could also store timestamps as part of the records. Then your query will always be consistent if you add an extra clause of tstamp < query_tstamp. So long as you store both the query_tstamp and the last record, you’ll get consistent results without needing to store individual cursors/snapshots for each user. (You’ll still have more CPU time per query, but that’s kind of a given here).

You probably also need to switch from deleting records to adding an archive bit (or timestamp).

This gets complicated fast.

andy800 3 years ago |

Why does every web site default to aggressively paginate their information? Pagination sucks, it's a waste of time and of clicks, and should be a last resort. Sure, when Google returns millions of search results, paginate. But:

For instance, if you have 40 entries and specify that there should be 10 items per page

40 entries??? Just show them all to me. My browser has a scrollbar, CTRL-F is much faster than your search box.

"but not everyone has a reliable fast connection" -- yes, which is a good reason to deliver more data per http request than breaking it up and requiring lots of slow requests.

"but the database load" -- half the queries, each returning 2x data, is almost always going to be easier on a RDBMS. If it's not then you probably need to rethink your schema.

feike 3 years ago |

Reminds me of Markus Winand who hands out stickers on database conferences banning offset.

His site is a great resource for anyone wanting to take a deeper dive on SQL performance:

https://use-the-index-luke.com/sql/partial-results/fetch-nex...

jvolkman 3 years ago |

For anyone curious about this approach, I've posted an excerpt [1] of the shared Python + SQLAlchemy + Postgres code we use to handle pagination at my company.

The name "cursor pagination" is super confusing given that "cursor" is an overloaded term in databases. I always call this "token pagination", given that the APIs I've seen usually call the value a token.

[1] https://gist.github.com/jvolkman/b8c0e3d05929a1506c99fbc9474...

drdec 3 years ago |

I was hoping to read about how they handled cleaning up cursors, what the effect on the server was when having a bunch of open, long-running cursors, etc. Unfortunately the article only treated the subject at a superficial level.

So, anyone here implement pagination via cursors? What do you find to be the drawbacks and how do you mitigate them?

imbusy111 3 years ago | |

You probably assume they are talking about database cursors. The cursor is just a record ID in this article. There is no long term storage of database cursors on the server. Assuming you can sort all your data, next query just returns all records after the given record ID, plus the record with that ID.

One corner case would be if the cursor record is deleted. I don't see it mentioned how they handle it.

erikpukinskis 3 years ago | | |

Does that mean you have to scan the entire results set to get the right page? So if I am on page 100, I have to query pages 1-99 and discard them?

Or is there a trick here I’m missing?

erikpukinskis 3 years ago | |

Yah, another downside of cursor-based pagination is: what happens when the record the cursor refers to is deleted?

Do you just crash and ask the user to start over?

Do you have to nudge open cursors on every delete?

simonw 3 years ago | | |

My implementation of cursors works by encoding the primary ID of the last row on the page, along with additional information corresponding to the sort order if that's needed.

That way it doesn't matter if the record is deleted - I can still return the next page by showing records that come after that provided cursor value.

There's an example on this page: https://latest.datasette.io/fixtures/sortable?_sort=sortable

Since the table is sorted by the "sortable" column, the next page link includes this:

    ?_next=15%2Cg%2Cz

15 is the last value for "sortable" on the page, then g,z are the compound primary key for that last row.

hamandcheese 3 years ago | | |

Instead of the cursor being an ID, it could directly encode whatever column(s) you are sorting by. Then you don’t have to locate any record in particular, you can always return records that sort after the cursor.

christophilus 3 years ago | |

It’s not a cursor as in a relational database cursor. It’s a cursor, as in an ID of an item in a stably ordered set. There’s no long-running anything to be worried about.

fein 3 years ago |

I have a few main requirements for what I consider easy to use pagination.

1. I can set how many items per page.

2. I can get to an arbitrary page either through a path/ query param or at least close with a pagination row that contains a way to jump around. If an item gets removed, whatever I was looking for should still be in the same vicinity.

3. As a result of #1 and #2, I can go back and find items I saw previously because their placement is at some fairly reliable position on the page I remember it being on or around.

You know, like how a physical book or catalogue with pages works.

Please stop trying to improve on this and making usability more difficult. I hate infinity scroll and I hate not being able to pick my page in an intuitive fashion.

nemothekid 3 years ago | |

>I can get to an arbitrary page either through a path/ query param or at least close with a pagination row that contains a way to jump around

I've had quite a few heated discussions on this point. The problem is, once your dataset gets large enough, this use case is incredibly difficult to scale; not impossible (Google does it with millions of search results), but prohibitively expensive compared to how often the need of being able to jump to any specific page arises.

Now I always try to stick to cursor based pagination as the default in order to prevent people from building workflows on top of offsets.

robertknight 3 years ago | | |

> Google does it with millions of search results

Google cheats, but in a way that very few users will notice. You can only access the first 20 pages of search results. Depending on user behavior, this is one way to offer navigation via page number while limiting worst-case cost.

int0x2e 3 years ago | |

I totally agree with you from a usability standpoint, and while it is possible to make this work well, for larger-scale services it is rarely without costs, and for the most part - people don't seem to care as much as you'd think.

While it can be frustrating, I doubt much revenue (relatively speaking) was lost due to this issue, which means for most apps and services, it will likely not get done.

(I'm not disputing the merit of your arguments, just explaining why this will rarely get done well in real life...)

debrice 3 years ago |

The pagination con given in the article is wrong, switching to stream isn’t fixing the dropping issue, removing sorting is likely why no records are being dropped (you wouldn’t drop records using a numerical sorted ID or creation date)

Pagination is incredibly useful to human. If I tell you I found something on page 15 you can relate to it, something I cannot do with infinite scroll.

indecisive_user 3 years ago | |

If a user has to go to page 15 to find something useful to them then I would argue that's a bigger failure of the UX/filtering than it is a success of pagination.

Aeolun 3 years ago | | |

Books work this way. It’s a very relatable to very many people. Conversely, few people understand why they cannot just jump to page 15.

erikpukinskis 3 years ago |

Worth noting that another solution to the problems with offset cursors is to do updates. When you add/remove rows you just update the results set and then there’s no issue with missing or duplicated rows.

Not easy to do of course, but it’s one direction you can go.

wruza 3 years ago |

Why is offset-based pagination a thing at all? It sucks when a feed is very active, it sucks when you link to a page or visit search results, where it may turn out to be hundreds of pages away in a week. I always wondered why nobody does some obvious (order_col > k, order by order_col, limit n). It’s not a two years old mistake, it’s two decades and sites still do that by default.

Aeolun 3 years ago | |

Because normal people enjoy referring to ‘item 4 on page 3’ instead of ‘search for item 57533898-2’.

At least I do. Passing numbers under 10 is easy. Passing identifiers is hell.

wruza 3 years ago | | |

This makes sense for content that grows at the end.

bjourne 3 years ago |

Ime, concurrent updates to the data set isn't a problem in practice and nobody cares if they occasionally get duplicate entries. The cluttered and sometimes useless urls cursors cause are, again ime, a much bigger usability problem. Sure, naively implemented queries suffer if the user types ?page=123456 in the url but such problems are quite easy to fix.

nickkell 3 years ago | |

How do you fix those problems then? Let’s say you have a “last page” button in the UI for example

dmacedo 3 years ago | | |

Have used the conditional that if current page is greater than last page, just return the last. And same with negative just returning the first. If records are updated / deleted and the last page changed, then you'll just get the results of what the "new last page" are.

At scale you might care about the duplicate or up-to-date records. But cursor-based doesn't solve the problem if a results page is left open for "too long" and stuff was added behind your cursor (or after it, if navigating backwards).

It's as if making things less intuitive (to articles' reference to book pages), makes it any easier as long as you don't think about any pitfalls.

My suggestion is to just use pages, and optimise for the right order (I.e.: sequential IDs, or creation date, alphabetical, etc) that make sense for your data.

If you REALLY must care if results have changed, some delta being stored would be best (like some timestamp that allows the server side to indicate "hey, your results are out of date, from 7 days ago, maybe you left that page/API response unused for too long")

bjourne 3 years ago | | |

Don't have a last page button. :) Or limit the number of results to, say, 1000, which is trivial for an rdbms to handle. Or precompute the result set's rankings and transform the offset-limit query into a "where rank >= 991 and rank <= 1000" query.

mike_hock 3 years ago |

TL;DR: The headline. TFA doesn't really add any information.

G3rn0ti 3 years ago | |

Well, except that TFA explains what DB cursors are and why they are faster than page offsets (because they skip DB entries).

jsmith99 3 years ago | | |

I'm pretty sure they aren't using actual SQL cursors as that wouldn't scale well so it's probably not a 'DB cursor'. It's a shame as actual cursors are the native database way to do pagination.

thaumasiotes 3 years ago | | |

> TFA explains what DB cursors are and why they are faster than page offsets (because they skip DB entries)

Except that no such information appears in the article. Here's the entirety of the explanation:

>> Once you’ve been returned a cursor, you can provide it in your requests to act as a farther-along starting point within your data entries. The server is then able to efficiently skip all entries that come before your specified cursor value

theteapot 3 years ago | | |

TFA isn't talking about that kind of cursor.

theogravity 3 years ago | | |

It needs to talk about how to actually implement it. Most articles like this one mention its existence and why it's good, but generally stop there.

camgunz 3 years ago |

I think the challenge is always the stateless nature of HTTP. It's pretty hard to keep a connection to the API server that has your database cursor.

Everything else has big tradeoffs. You can make sure your rows are sortable by a specific column, but that means your searches must all be sorted by that column.

You can save the results of a search and page through it, but that's a lot of I/O.

You can not do any of the above, but then you run the risk of non-deterministic pagination (I went back and because someone added/deleted a row, that page is slightly different now). You also can't really link to it.

These things might all be fine. In my experience though, you run into people who really want to do search in way A, and will use the tradeoffs for way B/C/D to argue against it, as though A has no tradeoffs.

WesleyJohnson 3 years ago |

At my current job, our intranet site has lackluster performance due, in part, to limit/offset pagination. Unfortunately, the business treats the "reports" we author like glorified spreadsheets and want the ability to filter on any column and order by any column. It makes it near impossible to tune/optimize.

The legacy system we're replacing used cursor pagination in a lot of areas and was perfectly acceptable to them, but now that we're going web based - it's not. Unfortunately, really - it seems vastly superior.

yrgulation 3 years ago | |

“It makes it near impossible to tune/optimize.”

I recommend using elastic search or a nosql database to optimise performance. Relational databases can be slow for this use case.

WesleyJohnson 3 years ago | | |

Thanks, I'll have to search for some case studies on the topic to help me understand the approach. We use ElasticSearch in one small area to search for products across multiple points of metadata and I've always wondered if we were underutilizing ES in general.

jitl 3 years ago | | |

What kind of NoSQL database are you thinking about? What strategy would you take with that database to optimize this problem?

hirundo 3 years ago |

My sympathies to the coders downstream of this large change for the amount of work required. I would like to add cursor based pagination to the APIs I manage. It would not be an option to go to our clients and explain that offset-pagination will be removed. There seems to be very little cost in supporting both.

redditor98654 3 years ago | |

AWS does the pagination thing for all LIST APIs and they encode the token in a way they clients cannot deconstruct anything out of it. For them it is just an opaque string. In fact it is cryptographically encrypted and not something simple like a Base64 encoding of some Json data.

And the token has enough metadata to reconstruct the query from where it left behind in the previous call and perform additional security checks so that it cannot be used by other customers to get results from a different account.

I have actually migrated databases from Aurora to DynamoDb behind the scenes where customers continued to call these APIs with these tokens and the calls would flip to the new DB and resume from the last place. No downtime or "maintenance window" which would be a non-starter at AWS anyway.

I recommend this for any external service. For internal services, if you work closely enough with your users, may be you can opt for something more transparent.

mgkimsal 3 years ago | |

I didn't see anything specific about timelines. I would expect you'd want to offer both for some length of time, with some deprecation notice and a sunset date for the older approach. But perhaps they seem use cases in their logs that the current offset is used minimally already, and it's better to switch now vs later if/when adoption is higher?