What's good about offset pagination; designing parallel cursor-based web APIs

What's good about offset pagination; designing parallel cursor-based web APIs(brandur.org)

100 points by clra 5 years ago | 51 comments

gampleman 5 years ago |

To point out the obvious: generally API providers don’t particularly want you to pararelize your request (they even implement rate limiting to make it harder on purpose). If they wanted to make it easy to get all the results, they would allow you to access the data without pagination - just download all the data in one go.

eyelidlessness 5 years ago | |

A certain level of parallelism is generally within the realm of good API citizenship. Even naive rate limiting schemes tend to permit a certain number of concurrent requests (as they well should, since even browsers may perform concurrent requests without any developer intervention).

Rate limiting and pagination aren’t (necessarily) about making full data consumption more difficult. They’re more often about optimizing common use cases and general quality of service.

Edit to add: in certain circles (eg those of us who take REST and HATEOAS as baseline HTTP API principles), parallelism is often not just expected but often encouraged. A service can provide efficient, limited subsets of a full representation and allow clients to retrieve as little or as much of the full representation as they see fit.

corty 5 years ago | | |

One thing that frequently bugs me is APIs limiting number of items per page for reasons of efficiency. I can perfectly understand low limits for other reasons, like not helping people scrape your data.

But limiting for efficiency is usually done in a way that I would call a cargo cult: First, the number of items per "page" is usually a number one would pick per displayed page, in the range of 10 to 20. This is inefficient for the general case, the amount of data transmitted is usually just the same size as the request plus response headers. So if the API isn't strictly for display purposes, pick a number of items per page that gives a useful balance between not transmitting too much useless data, but keeping query and response overhead low. Paginate in chunks of 100kB or more.

In terms of computation and backend load, pagination can be as expensive for a 1-page-query as for a full query. Usually this occurs when the query doesn't directly hit an index or similar data structure where a full sweep over all the data cannot be avoided. So think and benchmark before you paginate, and maybe add an index here and there.

sb8244 5 years ago | |

> If they wanted to make it easy to get all the results

Speaking from experience...we want to make it easy but also want to keep it performant. Getting the data all in one go is generally not performant and is easy to abuse as an API consumer. For example, always asking for all of the data rather than maintaining a cursor and secondary index (which is so much more performant for everyone involved).

alexchamberlain 5 years ago | | |

We provide (internal) access to data where we provide interactive access via GraphQL-based APIs and bulk access via CSV or RDF dumps - I feel like dump files are grossly undervalued these days.

tshaddox 5 years ago | | |

That’s the point. Running multiple paginated queries in parallel is essentially circumventing the API provider’s intent to limit the number of items requested at one time.

felixhuttmann 5 years ago |

A few thoughts:

1) AWS dynamodb has a parallel scanning functionality for this exact use case. https://docs.aws.amazon.com/amazondynamodb/latest/developerg...

2) A typical database already internally maintains an approximately balanced b-tree for every index. Therefore, it should in principal be cheap for the database to return a list of keys that approximately divide the keyrange into N similarly large ranges, even if the key distribution is very uneven. Is somebody aware of a way where this information could be obtained in a query in e.g. postgres?

3) The term 'cursor pagination' is sometimes used for different things, either referring to an in-database concept of cursor, or sometimes as an opaque pagination token. Therefore, for the concept described in the article, I have come to prefer the term keyset pagination, as described in https://www.citusdata.com/blog/2016/03/30/five-ways-to-pagin.... The term keyset pagination makes it clear that we are paginating using conditions on a set of columns that form a unique key for the table.

ComodoHacker 5 years ago | |

>a way where this information could be obtained in a query

There's no standard way because index implementation details are hidden for a reason.

>in e.g. postgres

You can query pg_stats view (histogram_bounds column in particular) after statistics are collected.

adontz 5 years ago |

I believe data export and/or backup should be a separate API, which is low priority and ensures consistency.

Here we just see regular APIs are being abused for data export. I'm rather surprised the author did not face rate limiting.

eyelidlessness 5 years ago | |

Coming from a REST perspective, I wouldn’t implement a separate API, I would use HTTP semantics (eg headers or, if truly necessary query params) on the resource listing to indicate the export/sync intention. Likely with an Accept header. If pagination is still preferred/required, the service could return an ETag or some other continuation token which when provided in subsequent responses could be used to indicate the consistent snapshot being requested. Since this is entirely optional, clients could use this mechanism to opt into stable/parallelizable requests (as I described in less specificity in another sub thread).

At this point, it these requests are expensive you have an opportunity to use a very simple (and optimistic) cache for good faith API users, relegate rate limiting to prevent abuse of cache creation (which should be even easier to detect than just overzealous parallelism), and even use the same or similar semantics to implement deltas for subsequent export/sync.

adontz 5 years ago | | |

I hardly imagine consistent integral paginated data view without creating a snapshot. I would be manual MVCC implementation or something. Separate API seems a much simpler solution to me.

eyelidlessness 5 years ago |

I think keeping temporal history and restricting paginated results to the data at the point in time where the first page was retrieved would be a pretty decent way to solve offset based interfaces (regardless of the complexity of making the query implementation efficient). Data with a lot of churn could churn on, but clients would see a consistent view until they return to the point of entry.

Obviously this has some potential caveats if that churn is also likely to quickly invalidate data, or revoke sensitive information. Time limits for historical data retrieval can be imposed to help mitigate this. And individual records can be revised (eg with bitemporal modeling) without altering the set of referenced records.

ako 5 years ago | |

I think for most use cases, as a user i'd rather see the newest items in a list, then consistency of pagination. If i forget to manually refresh, i might miss out on important new items.

Why do you think it is important for users to have temporal consistency?

eyelidlessness 5 years ago | | |

Well I’ll use a recent example I encountered that was actually very frustrating. I was looking for a font to use for a logo for a personal project. The site I was using (won’t name and shame, and I can’t recall the site now anyway) had no sorting options, items were ordered by whatever “popularity” formula they use. As I paginated, many of the fonts I’d previously viewed would appear on subsequent pages, often in a different order. It was frustrating not just because I could tell that I was probably missing fonts that were being bumped up to previous pages, but also because it made me doubt my mental model of my own browsing history: “Did I navigate back too far? Did I forget a tangential click and end up on a different search path?”

It’s not a great UX. And in some ways I suspect that my own views were at least partially causing it, which made me more hesitant to even click on anything unless I was sure it was worth the disruption.

ppeetteerr 5 years ago |

Pagination of an immutable collection is one thing and can be parallelized. Pagination of a mutable collection (e.g. a database table), on the other hand, is risky since two requests might return intersecting data if new data was added between the requests being executed.

True result sets require relative page tokens and a synchronization mechanism if the software demands it.

simonw 5 years ago | |

Intersecting data is fine provided there's a unique ID for each result that can be used to de-duplicate them.

Ideally I'd want a system that guarantees at-least-once delivery of every item. I can handle duplicates just fine, what I want to avoid is an item being missed out entirely due to the way I break up the data.

ppeetteerr 5 years ago | | |

It's more than just de-duplicating, tho. Imagine you query a dataset and get something like a page count and a chunk size. That page count cannot be trusted if the dataset is mutable. If an item is inserted at the beginning of the set, you're going to miss the last item.

Pagination is hard

jasonhansel 5 years ago |

It's important here that "created" is an immutable attribute. Otherwise you could get issues where the same item appears on multiple lists (or doesn't appear at all) because its attributes changed during the scanning process.

arcbyte 5 years ago |

I think you could accomplish something similar with token pagination by requesting a number of items that will result in multiple "pages" for your user interface. Then as the user iterates through you can request additional items. This isn't parallelizing, but provides the same low-latency user experience.

gigatexal 5 years ago |

From the code sample in the article I didn’t know you could append to a slice from within a go func

mssundaram 5 years ago | |

As long as you use the mutex locks

gigatexal 5 years ago | | |

Of course. I see that now it’s so obvious not sure why I didn’t see that earlier.

draw_down 5 years ago |

> it uses offsets for pagination... understood to be bad practice by today’s standards. Although convenient to use, offsets are difficult to keep performant in the backend

This is funny. Using offsets is known to be bad practice because.... it’s hard to do.

Look I’m just a UI guy so what do I know. But this argument gets old because I’m sorry, but people want a paginated list and to know how many pages are in the list. Clicking “next page” 10 times instead of clicking to page 10 is bullshit, and users know it.