Effectively Using Materialized Views in Ruby on Rails(pganalyze.com) |
Effectively Using Materialized Views in Ruby on Rails(pganalyze.com) |
Two things that would help:
1. Getting Incremental View Maintenance (IVM) [1] into Postgres. It looks like work is beginning on this but it's been 7+ years coming. If there are any Postgres devs looking at this, I'm cheering for you!
2. There's a commercial group doing an implementation of this, Materialize [2], but they don't have any publicly released product yet. I mention it because their interview on Data Engineering Podcast is really good [3] and I really like their focus on correctness by working from replication logs.
[1] https://wiki.postgresql.org/wiki/Incremental_View_Maintenanc...
[3] https://www.dataengineeringpodcast.com/materialize-streaming...
e.g. insert into rollup_clicks_hourly select hour, count(*) clicks from clicks where created_at >= ? and created_at < ? group by 1
Then from a reporting perspective, users/dashboards/etc read from the highest relevant table that meets their criteria.Could you recommend any classic old-school books about data warehousing that you'vre read that teach more such techniques?
Thank you!
[1] https://hashrocket.com/blog/posts/materialized-view-strategi...
There are other initiatives for plugging it on existing RDBMSs, like Noria [1] for MySQL. It allows for subscribing to changes; and also for lazy evaluation of the materializing view rows, keeping them in cache for later.
In PG you can make a lot of queries run straight from one index if you tailor an index to a query, since you can index your own plpgsql functions.
The syntactical sugar does have advantages, but comes at a cost of flexibility - if you need more control over the update process or modification of the data than a full refresh, then you should just use a table (create table whatever as select * from ...).
Can anybody here speak to that choice?
Part of the power of OP's technique is that it looks exactly like a standard Django/Rails data model with a tiny sprinkle of magic -- no surprises. It's surprising for a Django programmer to hear "all periodic tasks are handled via celery, except these table refreshes" or "signals are responsible all work in response to model changes, except these triggers".
Obviously in some cases you need a trigger for correctness, but in general I try to stick to the conventions of the ecosystem.
I personally don't like having my cron distributed in different areas. All my crons are in one place/system like easycron.com or setcronjob.com or whenever. Performance is not a consideration when deciding to run the job directly from the DB or the App.
I answered the question sentence directly, but maybe that's not what you were asking?
- postgres supports concurrently refreshing the contents of existing materialized views but there's no built in way to change the structure of the matview concurrently. Which means any `ALTER MATERIALIZED VIEW` will result in all reads to the view blocking for potentially many minutes or longer, for views over nontrivial data. Adding a column to a table is a cheap and non-blocking migration, but allowing the new column to appear in matviews can require an outage or fancy transactional view name swapping that libraries like scenic don't support.
- a column that is included in a view of any kind cannot change in any way, even ways that are binary-compatible like VARCHAR->TEXT. This comes up surprisingly often and getting around it is annoying.
There are apps to integrate PG views as django models, however is kind of trivial to do it yourself.
1. Create a migration that runs the SQL to create the view like ```create VIEW active_users AS SELECT rest of view ... `
2. Create a model mapping all the fields you want to use, add the Meta.table_name with the name of the view, and Meta.managed=False to avoid adding this to your migrations in the future.
Use it as any other normal Model.
Just that
About the updates with migrations I think if you already are thinking on using VIEWS, you could just create migrations that drop and recreate it when you needed. Is what I do, and is very little work for that.
What I have so far is support in the schema editor and very basic migration support. The underlying query is defined as a meta option on the model.
Last time I tried to do model a database view in Django it was painful, but support may have improved since.
But worst case you can drop down to a vanilla query.
One of the main issues we had run into, in addition to all the complexity that comes with managing a cache and no longer being able to count on your db queries to return the latest information, was that our views had to be deleted and re-created for nearly every schema change (at least every schema change that affected the views, but we had so many views, this ended up being probably 90% of our schema changes).
We ended up overriding the standard Rails "up" and "down" migration methods to prepend the deletion of these views, append their re-creation, clean up after failed migrations, etc. I remember us spending a good amount of time across the team dealing with weird edge cases that cropped up from this. I also remember spending a fair amount of time training all the developers on these issues since almost no one on the team had experience using or dealing with database views. I assume managing migrations and schema of your views within Rails is all included in what Scenic does for you for free, since these hassles weren't mentioned in the article.
It's really nice to see something that handles all that complexity for you. However, that still leaves the standard and age-old caching issues with stale data. Another side effect was that they started being a bandaid for inefficient queries and data structures in the database, which is probably how we ended up with so many db views in the first place.
In the end, after probably 8 months of using these views, we decided to excise them from the application entirely and go back to optimizing our actual database queries. We got to delete a lot of custom code, our schema changes and migrations became simpler, our data became fresh again, and we got to reinvest that time into improving our underlying data structures and queries.
This probably is not an argument against using db views, but rather an anecdote of what can happen when you resort to them prematurely. If you have queries that are 100ms or more, there are probably more traditional optimizations you can find, such as restructuring your relational data, adding/removing indexes, etc. If you're actually trying to eliminate those last 10s of milliseconds from your queries though, as is shown in the article, I can see them being a good option.
Views (both regular and materialized) are really useful, but remember that storing data in two places really is one of the of the hard problems in CS. Refreshing the materialized view on a timer might seem easy, but 6 months from now when something important reads from the stale view instead of the real table could become a really frustrating bug. A materialized view might still be a worthwhile optimization; just remember that you're also adding cache management, which might be more complicated than you suspect.
But yes, you're right that this is caching, with all associated pitfalls.
I've been trying to figure out a way to deal with my (growing) join of 13 rather large tables when fetching all of a user's data. Are materialized views the kind of thing that I could generate these cached results _per user_ and have, like, hundreds of thousands of them sitting around to query? And then be able to query against all of user 123456's content directly instead of filtering N tables for their content and joining them all together every time?
In the post the whenever gem (https://github.com/javan/whenever) is used, but you could use anything that is able to run things periodically.
If you'd want to stay solely within Postgres, there is also pg_cron (https://github.com/citusdata/pg_cron) which could be used to call REFRESH MATERIALIZED VIEW from within the database.
https://www.amazon.com/dp/product/1118530802 and possibly https://www.amazon.com/dp/0764567578/