Efficient GraphQL Queries in Ruby on Rails and Postgres(pganalyze.com) |
Efficient GraphQL Queries in Ruby on Rails and Postgres(pganalyze.com) |
This is where the abstraction breaks down. Should you care what the query ultimately looks like? For anything beyond a toy, you sure should. But how far should you go to ensure the abstraction produces the implementation you want? Why not just skip to the implementation you want?
Sooner or later, you need to break the abstraction. I'd rather just write the SQL for all my endpoints.
I'm not bullish on GraphQL for the same reason I'm not bullish on ORMs. For some things, an abstraction suffices, but it's far from a universal solution. If I have to break out of the GraphQL abstraction, why bother in the first place? Is there value in a mixed GQL/REST API?
> GraphQL can be as efficient as REST, but requires approaching optimizations from a different angle. Instead of upfront optimizations, we lazy-load data only when required, loading it in batches to avoid excess trips to the database.
Admittedly, the author does a good job of preserving the abstraction, but I'm hesitant to expect most devs to dive this deep, and thus hesitant to expect this level of quality/performance from a GraphQL backend.
Fair, but the ORM covers 80% of what you need. Mapping to objects, autoloading associations. These are simple selects and joins and make up a large portion of most information based systems.
When something needs performance and the ORM isn't optimizing appropriately, sure, drop to raw SQL and optimize away. Even then, what did you gain? Shaving 100 seconds off a 200 second query, sure that's worth it. But if you optimize and manage to shave off a mere five seconds, was it worth it? Maybe. Maybe not.
For most systems where you're displaying information, you just need good enough, an ORM that feels productive and doesn't get in the way (e.g. ActiveRecord).
Optimizing expensive queries not only improves the user experience (be it web or api or whatever) but also can open up entire usecases that otherwise wouldn’t be practical. Five second queries are a primate candidate for DOS attacks, whether intentionally malicious or by users spamming refresh, retry, etc.
If you're the one both creating and consuming the endpoints, then sure, it doesn't matter where you write your query, and building abstractions to move the query all the way out to the client is a waste of time.
But that's almost never what you want: as the backend developer, it's one's job to care about security and the performance of queries. As far as I know there's no straightforward way of automagically translating every possible permutation of SQL query primitives into GraphQL primitives in a way that is intuitive to a frontend developer, reasonably performant and secure, so at some point the backend dev has to step in and do the work of connecting the GQL layer to the SQL layer.
You get to keep the data aggregation complexity in the data layer and then have a loader that sits above it which is effectively just choosing which subsection of the tree to load.
I’d recommend looking at Hasura which does a great job of this.
A foundational observation that GraphQL makes is that every query is different depending on the client's needs. You can do "SELECT * FROM users JOIN companies ON companies.id = users.employer_id" and call it a day, but now you're already probably overfetching, and the next evolutions of your codebase ends up accruing extra logic to add or remove features: Some queries want to fetch just a subset of fields from each of those relations. Some queries don't want to fetch any companies. Some want to paginate, filter, order in different ways.
You can do all that in REST, of course. You can use an SQL builder to build the final SQL query programmatically. You can add WHERE clauses and FULL OUTER JOIN clauses to do your N+1 queries on foreign associations, with some kind of query parameter to define what associations the client wants. Field selection is also doable with query parameter and so on. You can also centralize a lot of the boilerplate to do this (e.g. FULL OUTER JOIN clauses on associations, if you do them, need to be untangled into object graphs after you get the flat results back).
And so on.
What you end up with is pretty much GraphQL. You need code — identical for every relation, modulo client-specifiable constraints and parameters — for filtering, pagination, selection, joining, etc., all of those things. You need some notation for letting the client specify what they want to fetch, both fields and associations. Then, even when you have ended up with something approximates GraphQL, you have to write a custom client for it that only works for this exact API. And then you have to write documentation. And tests, tests which don't just exercise your "business logic", but also exercises all the REST-based query API stuff.
The benefits to GraphQL are that it's standard, so you can piggyback off existing libraries both on the client side (things like Apollo) and on the backend, libraries that are already tested and documented. GraphQL itself defines a schema, which serves as documentation and a way to machine-generate clients. As a case in point, try out Prima's GraphQL playground app [1]. It's basically a web-based IDE for GraphQL. You can point the app at any GraphQL API, inspect the schema, write queries (you can autocompletion of all fields, queries, mutations, etc.) and see live results. It's a really productive debugging tool.
That alone is worth a lot, and that's why I'm bullish on GraphQL.
> If you start without an ORM you end up writing a custom ORM eventually and often that custom ORM is inferior to the third party one you're considering
Funny enough this engineering problem becomes an HR one:
If your recruiting and training processes can consistently hire devs actually fluent in SQL and web application architecture you can completely avoid the ORM and just write good modular and testable code that avoids the chaos of poor SQL performance etc.
If you're like most organizations and have trash recruiting and training processes for developers, you don't have the luxury of trusting your teams not to fall over with technical debt, and may prefer to use an ORM for immediate short term complexity and also long term technical debt management (since the ORM effectively becomes a framework with consistent patterns that can scale across teams).
A while back, I worked with a client who's stack was predominately Java but due to company policies paid 25-50% under the going rate for decent developers. This caused two problems: not being able to hire enough developers, and some of the hires not being great.
We solved that problem with a technical one: the stack moved to Clojure powering backend services with Node.js on the front end. The coolness and newness of those technologies at the time meant they could make hires. Ironically, now the demand for those technologies means they command high salaries and they're back at square one.
Bit of a long story, but my point is that sometime you need to solve people problems with technology and technology with people.
I didn't like orms much at first but with some practice you can shape your queries. Much neater and cleaner if that matters in your case.
No. In REST, the client is responsible for resolving IDs into entities.
In GraphQL, that responsibility largely moves into the backend.
This is not "breaking the abstraction", this is exactly what GraphQL is good for.
Only a few weeks since we implemented, but happy so far. Feels like it's the best of REST and GraphQL.
This page from Graphiti[1] explains the thinking behind it. Don't be discouraged by the loud pink colors ;)
It’s a really great engine and I’m glad we took a punt on it. Aside from the expected behaviour, you get some nice bonus stuff like nested updates for free.
We mostly tried it out because we normally use python and we wanted subscriptions. The graphql story in python (especially async) isn’t amazing. When we need more business logic, we’ll build it in python and then you can get Hasura to do delivery to python (to ensure each event is run exactly once).
I’d recommend kicking the tyres to see if it works for you. The team are awesome too, and moving quickly. I have no affiliation but I think it’s going to do really well.
> By using the includes method we've been able to knock our queries down from six to two: The first to find the events, and the second to find the categories for those events.
Am I the only one that thinks rails/activerecord is crazy here for (often) preferring two trips to the db over a join on indexed primary keys?
Are there really (sane) use-cases where that is better?
> GraphQL does not provide any built-in mechanisms for filtering, sorting, pagination or other arbitrary transformations of the response
For example, from the GitHub GraphQL API you can in a single request get all open pull requests on a repository and the last 10 comments on each one. You can't get the last 10 comments on all of GitHub or the last 10 comments made by a particular user, because they haven't exposed or even stored the data that way.
In a REST API that would be dozens of separate requests, or the endpoint would just not exist because they have never had the need for that exact combination of records to be queryable.
You mean
SELECT u.* , c.* FROM users u INNER JOIN comments c
ON u.id = c.user_id WHERE u.id = <...>
?If the average user record includes a non-trivial amount of data and there are many comments, you're repeating a lot of information in the query results.
disclaimer: I'm currently doing some work for pganalyze.com (the blog host)
I'm not sure about "non-trivial amount of data" - especially if we're talking about a single row (single user) here? I suppose there might be a few blolbs of large json documents?
And I'm not sure I understand "repeating information in query results" - surely we're talking about receiving a query, transforming it and responding with the data?
In a rails+graphql ideal world, that'd maybe mean Parsing the json query, generate some simple ruby (active record) code, fire it off to the db (let ar /arel do its job; generate sql) - serialize to json and write the response?
If there truly is non-trivial data in fields not requested, it might be worth it to naemrrow the columns selected. But I doubt it, in the general case.
Modify the original sample to load two relations: fetch all events with their first 5 categories and all their users. If you use two joins then you get a combinatorial explosion because you are now getting all permutations of the (event, category, user) tuple.
(event1, category1, user1)
(event1, category1, user2)
(event1, category2, user1)
(event1, category2, user2)For instance, if you have a simple active record type:
class User has_many :comments; has_many :likes; end
User.includes(:comments, :likes).find(user_id)
Generates 3 queries, but importantly the number of records returned is appropriate.
User.joins(:comments, :likes).where(users: { id: user_id })
Generates only a single query, but the number of records returned from this join is the product of comments*likes for this particular user.
I find the includes stuff easier to reason about as the amount of associations you need loaded in memory becomes more complex.
Not "user has orders", and also "user has favorite products".
(That is, it's hard/impossible via the active record DSL to make the correct nested join?)
By the way (you may know this) when staying within active record convention, you should never need to single out the ".id" and ".foreign_model_id" fields:
User.joins(:comments, :likes).where(user: some_user)
(where some_user is an instance of User).
I'm still not quite comfortable enough with AR, but I think, leaning on this: https://www.learneroo.com/modules/137/nodes/768
I think what you want in your second case is simply?:
User.joins(:comments, :likes).where(user: some_user).uniq
(I'd have to create some models and watch what ".to_sql" creates to be certain, though)
Having said that, I don’t mind the second request model myself. It’s actually pretty clean and when you take the dB roundtrips out of it, the dB will generally do the same work anyway.
They use a really clever technique for loading where not only do they. pull all the child data for multiple parent records as a single hit, they actually run a single query for multiple connected clients.
So if you have 10,000 connected clients subscribed to the same graphql queries, they build a temp table with 10k rows (one per client) and add all the environment variables. Then they execute the queries once to get all the data for everyone at once (joining to the temp table and also joining to all the tables for access rules you set up in Hasura). Then they split it up and send back the relevant chunks to each client. It’s much more performant than Postgres RSL because the rules can be executed in the joins in a way that works for every client at the same time.
Edit mobile typos
In any case, you won't have to worry about N+1 Query problems with Hasura.
You encounter this problem whether you use an ORM or not.
I think you underestimate the HN bubble : I'm an average dev (I don't even have a Github account !) yet I'm in the HN bubble.
And in this example, the data is definitely compressible, but the Postgres wire protocol does not implement compression (it leaves that to tls right now, though CRIME makes that a problem). Adding native compression to the protocol has been on the TODO list for a while: https://wiki.postgresql.org/wiki/Todo#Wire_Protocol_Changes_...
Also from my experience the performance problems are completely divorced from ORM/not ORM. Usually the reason why you drop the ORM is that you want to take advantage of features that are not modeled by the ORM. Most query optimizations simply involve trying to do everything in as big of a batch as possible. Whether you do batching with or without the ORM isn't really relevant. The reason why people don't batch their queries when they use ORMs is because it's so trivial to not do so but there is no reason why we can't build better ORMs that make batched queries as easy to use as non batched queries which would get us both performance and convenience at the same time.
And if it is not acceptable, you can still optimize the living excrements out of your GraphQL resolvers.
Also, the burden of doing more than one round trip usually moves to the client. So there's always some trade-off to be made.
For a query like
SELECT u.* , c.* FROM users u INNER JOIN comments c
ON u.id = c.user_id WHERE u.id = <...>
above, the user information is repeated for each comment, e.g., u.id, u.name, c.id, c.text
1 Bob 1 you are right
1 Bob 2 you are wrong
1 Bob 3 you are right again
If you have a hundred comments, and many user fields, this can be a significant amount of data to pass around. Or maybe this is not what you meant? I feel like we may be talking past each other. Can you give an example of what you think the ideal sql should be for "give me user with id = 1 and all their comments"?I didn't consider the kind of expansion you're alluding to - on the other hand i can't ever remember seeing an indication that the amount of data returned is an issue with app performance.
But I've certainly seen the number of round-trips result in real issues. And complicated joins, for that matter.
But you're right, there are always trade-offs, and there are some situations where the current AR approach is not ideal, and a single join would be more efficient. However, I suspect this is the way AR does it because it's right enough often enough that it works well as the default approach.