We switched to cursor-based pagination(moderntreasury.com) |
We switched to cursor-based pagination(moderntreasury.com) |
This is also sometimes known as keyset pagination. My favourite technical explanation of that is here: https://use-the-index-luke.com/no-offset
Possibly just an implementation error in the BlockJoinParser but it did not occur with numeric pagination.
In order to work with that, you need to “flatten” your json, like with the @JsonUnwrapped annotation, and some structures (like arrays) may become problematic and/or require significant lexical mapping of queries to the dataset.
No, cursor is this https://en.wikipedia.org/wiki/Cursor_(databases) https://www.postgresql.org/docs/current/plpgsql-cursors.html
I once did pagination using database cursors, which is something different than keyset pagination: The server would keep a cursor open and keep fetching more data from the same query. This enabled the system to have an interface similar to offset pagination (you get the first page, then the second page, etc) but without doing a new query for each page discarding the first n-1 pages per query
The downside is that it makes the server stateful, and doesn't scale (you would need to keep hundreds of cursors open if you had hundreds of simultaneous users)
You can read more about it here: https://aaronfrancis.com/2022/efficient-pagination-using-def... or here https://planetscale.com/blog/fastpage-faster-offset-paginati....
There are libraries for Laravel (https://github.com/hammerstonedev/fast-paginate) and Rails (https://github.com/planetscale/fast_page) as well!
Cursor based pagination is wonderful, but sometimes you're stuck with offset/limit for whatever reason. Might as well make it fast.
MySQL is fairly predictable, though, so when you understand that it wants to nested-loop join all your rows before evaluating predicates on the parent table, it's a predictable win to stop it doing that.
The technique is still applicable even when you have no joins, because MySQL will materialize rows with every selected column before evaluating the unindexed portion of the predicate, and the order by.
A result you haven’t yet seen can be mutated to sort before your current cursor, likewise a result that you’ve already seen can be mutated to sort after the current cursor, causing you to see it twice.
Cursor based pagination does minimize the issue somewhat, because only the mutated rows are possibly affected, not unrelated records that lie on page boundaries. But depending on the use case I’m not sure that is worth that added complexity of cursor based pagination (it does get a bit tricky once you have any non-trivial sort clause).
Expensive to implement, so this would only work for situations where you can afford to spend significant storage and computation resources to keep a specific user happy.
Personally, if you care about users not missing any item in a query, you just can't use pagination at all, and you have to give them every item in the query in a single huge dump (running the query again would be the "explicit action" mentioned above that gets the user new data). Conversely, if you use pagination, users are free to assume that they might miss some items unless they already expect the underlying data to be immutable.
You probably also need to switch from deleting records to adding an archive bit (or timestamp).
This gets complicated fast.
For instance, if you have 40 entries and specify that there should be 10 items per page
40 entries??? Just show them all to me. My browser has a scrollbar, CTRL-F is much faster than your search box.
"but not everyone has a reliable fast connection" -- yes, which is a good reason to deliver more data per http request than breaking it up and requiring lots of slow requests.
"but the database load" -- half the queries, each returning 2x data, is almost always going to be easier on a RDBMS. If it's not then you probably need to rethink your schema.
His site is a great resource for anyone wanting to take a deeper dive on SQL performance:
https://use-the-index-luke.com/sql/partial-results/fetch-nex...
The name "cursor pagination" is super confusing given that "cursor" is an overloaded term in databases. I always call this "token pagination", given that the APIs I've seen usually call the value a token.
[1] https://gist.github.com/jvolkman/b8c0e3d05929a1506c99fbc9474...
So, anyone here implement pagination via cursors? What do you find to be the drawbacks and how do you mitigate them?
One corner case would be if the cursor record is deleted. I don't see it mentioned how they handle it.
Or is there a trick here I’m missing?
Do you just crash and ask the user to start over?
Do you have to nudge open cursors on every delete?
That way it doesn't matter if the record is deleted - I can still return the next page by showing records that come after that provided cursor value.
There's an example on this page: https://latest.datasette.io/fixtures/sortable?_sort=sortable
Since the table is sorted by the "sortable" column, the next page link includes this:
?_next=15%2Cg%2Cz
15 is the last value for "sortable" on the page, then g,z are the compound primary key for that last row.1. I can set how many items per page.
2. I can get to an arbitrary page either through a path/ query param or at least close with a pagination row that contains a way to jump around. If an item gets removed, whatever I was looking for should still be in the same vicinity.
3. As a result of #1 and #2, I can go back and find items I saw previously because their placement is at some fairly reliable position on the page I remember it being on or around.
You know, like how a physical book or catalogue with pages works.
Please stop trying to improve on this and making usability more difficult. I hate infinity scroll and I hate not being able to pick my page in an intuitive fashion.
I've had quite a few heated discussions on this point. The problem is, once your dataset gets large enough, this use case is incredibly difficult to scale; not impossible (Google does it with millions of search results), but prohibitively expensive compared to how often the need of being able to jump to any specific page arises.
Now I always try to stick to cursor based pagination as the default in order to prevent people from building workflows on top of offsets.
Google cheats, but in a way that very few users will notice. You can only access the first 20 pages of search results. Depending on user behavior, this is one way to offer navigation via page number while limiting worst-case cost.
While it can be frustrating, I doubt much revenue (relatively speaking) was lost due to this issue, which means for most apps and services, it will likely not get done.
(I'm not disputing the merit of your arguments, just explaining why this will rarely get done well in real life...)
Pagination is incredibly useful to human. If I tell you I found something on page 15 you can relate to it, something I cannot do with infinite scroll.
Not easy to do of course, but it’s one direction you can go.
At scale you might care about the duplicate or up-to-date records. But cursor-based doesn't solve the problem if a results page is left open for "too long" and stuff was added behind your cursor (or after it, if navigating backwards).
It's as if making things less intuitive (to articles' reference to book pages), makes it any easier as long as you don't think about any pitfalls.
My suggestion is to just use pages, and optimise for the right order (I.e.: sequential IDs, or creation date, alphabetical, etc) that make sense for your data.
If you REALLY must care if results have changed, some delta being stored would be best (like some timestamp that allows the server side to indicate "hey, your results are out of date, from 7 days ago, maybe you left that page/API response unused for too long")
Except that no such information appears in the article. Here's the entirety of the explanation:
>> Once you’ve been returned a cursor, you can provide it in your requests to act as a farther-along starting point within your data entries. The server is then able to efficiently skip all entries that come before your specified cursor value
Everything else has big tradeoffs. You can make sure your rows are sortable by a specific column, but that means your searches must all be sorted by that column.
You can save the results of a search and page through it, but that's a lot of I/O.
You can not do any of the above, but then you run the risk of non-deterministic pagination (I went back and because someone added/deleted a row, that page is slightly different now). You also can't really link to it.
These things might all be fine. In my experience though, you run into people who really want to do search in way A, and will use the tradeoffs for way B/C/D to argue against it, as though A has no tradeoffs.
The legacy system we're replacing used cursor pagination in a lot of areas and was perfectly acceptable to them, but now that we're going web based - it's not. Unfortunately, really - it seems vastly superior.
I recommend using elastic search or a nosql database to optimise performance. Relational databases can be slow for this use case.
And the token has enough metadata to reconstruct the query from where it left behind in the previous call and perform additional security checks so that it cannot be used by other customers to get results from a different account.
I have actually migrated databases from Aurora to DynamoDb behind the scenes where customers continued to call these APIs with these tokens and the calls would flip to the new DB and resume from the last place. No downtime or "maintenance window" which would be a non-starter at AWS anyway.
I recommend this for any external service. For internal services, if you work closely enough with your users, may be you can opt for something more transparent.
You get to see more ads while flipping through pages.
Timing metrics for loading smaller pages make marketing happy.
Timing metrics for time spent on the website make marketing happy.
Because data retrieval costs money and resources. Unbounded API responses are a susceptible to denial-of-service. Well-architected APIs will always cap the number of results they return per-call.
It really depends on what kind of site we're talking about; there's loads of times I want to see loads of results. Sometimes I don't, but there's no harm is showing me more than I want, either.
In the general case this isn’t a good excuse, there’s no reason it should take that long. But you don’t always get to pick your technology.
It’ll be like Jira, pulling in literal megabytes of data and still slow.
HN is pretty minimalistic, but the point is that text data isn't really what makes requests very large. For reference, 2MB is enough to include an entire book and then some. All of Dune is about 1.7M uncompressed in epub format (and ~1.1M compressed as an epub file). HN threads, Jira discussions, etc. can get pretty large, but I have not seen any yet that have started to approach the size of Dune.
Images might add more data, but those can be loaded on-demand when scrolled into view pretty easily.
Indeed.com's search results are almost all text. Maybe a company avatar. No reason it can't provide 100+ job listings per page rather than 10-20.
I agree with OP that if you are on a powerful machine with fast internet it would be better just to load a massive HTML document of list items.
https://old.reddit.com/?count=25&after=t3_wtpvdp
I noticed Reddit's pagination has that "after" parameter, which points to the last post on the current page.
It glitches out if the last item is deleted by moderators, but otherwise it works smoothly.
https://en.wikipedia.org/w/index.php?title=Category:Living_p...
No need for row value syntax and it works with MS SQL Server
> As we saw, plain keyset pagination offers no facility to jump a certain percentage into the results except through client guesswork. However the PostgreSQL statistics collector maintains per-column histograms of value distribution. We can use these estimates in conjunction with limits and small offsets to get fast random-access pagination through a hybrid approach.
Unless you can guarantee your data is static, or that the sorting order cannot be mutated and only append later values, the concept of what data belongs in which page could be changing every millisecond.
Edit: you might think of the list results like a materialized view. It works really well with natural ordering on something like item create time and you can pass that in as a query param.
I agree, page sizes are usually far too short by default! Especially when I'm on a desktop computer and a fiber connection. I think people using phones and cellular connections probably appreciate short page sizes, so it would be nice if sites had different page sizes for different clients.
The only solution I can think of for that is to track which individual posts have been shown to which users, which is quite a lot of work to do.
database cursors are a different thing.
They never said what "cursor pagination" is or used the term at all. In this chain of comments, yours is the first one to use the term.
The comment you're replying to, like the original comment in this chain, merely said that this post is not about using "database cursors" for pagination. Which is correct, and you agreed that this is a different thing.
I would bet the top-line number of results is some sort of extrapolated estimation thing, a hand-wavey way to represent how deep the hypothetical result set is for that query.
We struggled with this at a prior job because records within the most recent ~10 minutes could be delayed for any reason. We'd be collecting data from thousands of IoT devices and any of them could have momentarily lost a network connection for a few minutes, been rebooted, etc. So it'd be totally normal for end users to see records from 30 seconds ago, but then a few minutes later, see older records appear in between those records and older records because there were minor delays in some of them getting sent and processed.
We were leaning toward just putting a bound on the range that the end user could see (for example, only show records that are at least ten minutes old) to reduce the variability, but that gave the appearance that our system took ten minutes to process all data. It was a tricky one to solve from a user expectation perspective.
I’m not actually a fan of this approach as I think it causes too many scaling issues. If I were to do this, I’d either keep a database cursor in my session or store the full List of PKs in my session for the current query. This way you’d have the order of results as they existed at the time or the query. You still have to deal with the UPDATE/DELETE issue, but o guess this just depends on how much you care about consistent query pagination.
I’m not that much of a stickler and most datasets don’t change fast enough to make it an issue.
https://www.postgresql.org/docs/current/plpgsql-cursors.html
Basically, you declare a cursor like that:
DECLARE cname CURSOR FOR <select statement>;
You can pass the cursor’s name „cname“ (and even store it on the client side, although, best encrypted) and obtain the „next“ slice of your data corresponding to the query on demand like that:
FETCH next cname;
Not sure you really gain that much performance on everyday queries but with a large number of rows it might.
What's the use case for needing the 100th page of a query result, that also doesn't allow you to cache the 100th page locally to retrieve it later?
There’s no SQL clause for “WHERE IT WOULD COME AFTER id=ah73d IN THE RESULT SET” as far as I’m aware.
Also what if there are more than $page_size records with that 15 value?
More than page_size records with that value works fine - that's why the primary key is included as a tie-breaker.
Would the created_at be a better option than the primary key? That can’t be deleted or changed, and that would give you a stable secondary sort?
> If your column is fairly uniformly distributed you can guess the index for any arbitrary page.
I don't think that'll work in a multi-tenancy situation with complex filters.
It definitely isn’t in the users’ best interest to have any method of scrolling through a lot of records.
Anyway, the gist of it is that you store data in denormalised documents whereby searchable columns become keys of a single entry. The storage is a secondary storage not the main data source. You write data in both - sync in your relational db, async via a queue or what works best for your infrastructure. Searches are then made against it. If you go for es you can rank results assuming filtering is done using free text search. I prefer es, the of flavour of nosql doesn't matter, but es is great for free text search.
And even if you do decide that it makes sense to split your analytics data into a separate database system entirely, document-oriented databases "ain't it".
I have very little experience with Elasticsearch, although I'm surprised to hear it being recommended for anything other than full text search. GP was talking about spreadsheet-style filtering, not full-text search.
But I do have a good amount of hands-on experience in production with Mongo, and I can say for sure that it is absolutely not a good option for an analytics database, and that I would much rather deal with traditional joins and filters. Even using it as a primary "application database" for a web app was a constant headache from beginning to end. I never thought I would miss MySQL so much, and I would never consider using Mongo again unless I had the very specific use case of storing a high volume of semi-structured data with no consistent schema, and even then I would convert the data to a more traditional tabular/relational format for use in a data warehouse if the business analysts or data scientists needed to work with it.
For a set of reports with consistent column names and values made of aggregate or static data what you are proposing works fine - since you can just increase counters or cache values as data comes in.
But for a use case where different types of reports have varying columns you can just dump everything into es documents and run basic aggregate queries. Or you can precompute data when making inserts.
Anyway, i am assuming too much about the use case, my bad. I’d have to hear more about it to defend my point.