How Postgres Triggers Can Simplify Your Back End Development(themythicalengineer.com) |
How Postgres Triggers Can Simplify Your Back End Development(themythicalengineer.com) |
Been first in all my teams since then to loudly voice my opposition when someone suggests it as a quick fix for something more complicated
They migrated away from it at some point, but some of the people who handled that migration were still around when I was there. Didn’t sound fun at all, sounded like a horrific nightmare.
Where Git isn't looking...
Now you have two problems.
Right?
It's obvious to look at `add_new_payment` for the code that runs when adding a new payment, but then the code isn't there, so you have to know/ask or search in either migrations, a fresh structure dump or poke at the actual db (!).
I think they're great for other, well, effects when needed. PostgreSQL is a real powerhouse.
1. Just discovered this. 2. Implemented a bunch of them. 3. Hasn’t been maintaining this solution for more than a couple of months.
This is a trade-off that only manifests itself after maybe a couple years when you’ve built a system and you have many hands maintaining it. Either there’s some performance problem that makes it worth it or you’ve just obfuscated half your code for no reason. In the latter case those many hands are going to wreak havoc on your system before you figure out how to make it maintainable.
This is basically the clickbait in the wider world affecting software development, even though it might not look like it. Boiled down to the essentials, we are telling each other the software version of "Here's How to Lose 10 Pounds in Time for Summer." way more often than "How To Balance Diet and Exercise to Remain Healthy Over A Decade".
But it's up to us to those of us who like more discussion to reverse these trends. In that vein -- do you have good sources that typically talk about tradeoffs with technology rather than promoting a specific one? Kleppmann's book on Data Intensive applications is one I have found in the past.
"Triggers should be used with caution since they can obscure critical logic and create an illusion of automatic processes. While this can be advantageous in certain scenarios, it can also pose a challenge in terms of debugging, testing, and monitoring since they are not readily visible to developers."
1. Migrations. Your schema needs to live in version control, and changes to your schema must be applied by an automated system. Django migrations are the gold standard here in my opinion, but you can stitch together a custom system if you need to, one that tracks which migrations have been run already and provides a mechanism to apply new ones.
2. Automated tests. Your triggers MUST be covered by automated tests - call them unit tests or integration tests depending on your preferences, but they need to be exercised. Don't fall into the trap of mocking your database in your tests!
With these two measures in place I'm 100% comfortable using triggers that split my business logic between my application code and my database code.
Even if they just created a generated file of the final schema, that sat in version control and errored the makemigrations check (just like a missing migration) if it was out of sync, that would be a significant improvement IMO. But I think the Django maintainers at least would say they want the ORM DSL to be that. (But it's way too incomplete, you'd be limited to a tiny subset of postgres, and even then you have to be on board with that being a reasonable description of your schema, not wanting the actual db schema anywhere.)
Intriguing. In JavaWorld it's really only Flyway and Liquibase, and the important self-imposed rule, "Thou shalt never remove a column from a table."
I’d highly recommend people avoid them, unless you feel that you really need them _and_ you have very robust development processes. As soon as you deploy your first table trigger, from that moment you have to check every DML statement for unintended side effects.
If that's the case, you have a documentation problem. It should be easy to decide if there are side effects or not just by looking at the DML and the metadata you need for it.
In fact, the case here is that the trigger on the article is an incredibly bad one. It's not a natural consequence of the table, or the database structure. It's probably not even always true to the business rules, what is the one property triggers must have no matter what.
But IMO, even migrations are a stopgap solution to this problem. In some cases, such as in ORMs like Rails' ActiveRecord, even the column names aren't present in the model by default, they are only migration files.
The real solution is putting the definition of the triggers (and all other database things too) closer to the code that operates on them. In models, for example. And models should enforce that only the defined triggers are present in the database. Of course, since most ORM users aren't really big DB users, this isn't exactly common. Until we get features like this, I agree that triggers can be confusing.
But otherwise, this fulfills the condition of triggered events residing in the codebase. Of course, this means you have to be comfortable with using an ORM in the first place.
Databases exist to manage data, not just store and retrieve it.
Instead adopt a solution for structuring your business logic in a sane way, such as using a workflow engine. Your code will become simpler and well organized that way without creating a tangled web of distributed rules, as well as exist all in one place.
Alternatively, write all the business logic in the database. This way you can better leverage the DB features and ensure that logic only needs to be written once.
It was a nightmare. Deployments were very difficult, there was little tooling, reasoning about the system was difficult, and of course running so much code in Oracle required very expensive licenses.
And it is much harder to hire hard core PL/SQL devs over Java, C#, Python or whatever.
You really don’t want significant code in the DB. It can be useful for some cases, like automating audit tables, but that is about it.
And for triggers - I feel for anyone maintaining business logic located in triggers. What a debugging hell that can be.
Rather, lets rejoice with latest C#, zig, rust, PoweShell or whatever-beautiful-language-and-ecosystem-perfected-in-latest-decade we have, instead of horrible SQL.
I think this could be great for certain projects, but there is a lot room to put yourself into a situation that's hard to maintain if you/you're team doesn't possess the right amount of discipline around documentation, developer tooling, observability etc..
what was the cause of pain?
Morale of the story, use it with caution. I know postgres is different but when you start turning you database into a ball of mud things get dangerously difficult to debug and fix.
Triggers to do something that's simple and always required, e.g. updating a primary key index for a new row (before autoincrement was available) can be OK, but use them sparingly.
I like putting business logic in the database, because you only write it once and not for each client application. Client applications and platforms and their development languages come and go a lot more frequently than databases. But I use stored procedures almost always, and rarely triggers.
Don't write business logic in the datastore.
Summarizes why in my opinion using triggers is rather risky and confusing. You introduce side effects to operations that one might suspect are CRUDlike. Your code is made non-atomic, in that you need knowledge of what happens elsewhere to guess why it's behaving a certain way. On small projects it's rather tempting, but small projects become large projects that then get given to someone else to maintain, and 4y later someone will spend a week trying to understand why the amount column gets updated to another value that they're pushing.
The only use that I find safe is for database metadata, say if you're using triggers to keep track of write origins, or schema metadata. For everything that's business logic, I'd stay away from them
But every single time this comes up, people on the engineering teams Ive been on all throw their hands up and accuse folks of overengineering or underengineering. You need rabbit or kafka. We should move to mongo. Etc.
Thats the part thats hard.
Luv Postgres
Simple as
A quick note for anyone thinking about triggers: if there's any case whatsoever that you're going to have more than one row insert into a table per transaction, please use statement level triggers -- especially if you're doing network calls inside the trigger. Triggers execute within the same transaction, they're synchronous, and will soak up resources.
Hell, if you're using network calls in your triggers... please don't. Use `LISTEN/NOTIFY` or queues to offload that to some other process (whether postgres itself or another process), so PG isn't left hanging around waiting for a network return.
What I would recommend instead is making use of CTE (Common Table Expression), because DML (modifying queries) inside `WITH` are allowed and taking leverage of `RETURNING` keyword in both `UPDATE` and `INSERT` we can execute multiple inter-dependent updates within single query.
With such approach we can trivially run multiple versions of an application in parallel (during deployment, for canary deployment etc.) and we have similar performance advantage of a single roundtrip to database. Additional advantage is the fact that there is only one statement which means that our query will see consistent database view (with very common read committed isolation level it is easy to introduce race conditions unless optimistic locking is used carefully).
Personally, it’s much more valuable to have all business logic in one place, in a single language, available at a glance. The perforance gain isn’t worth the increased complexity in codebase.
They are coupled tightly with database schema design. Making changes to them needs careful consideration and a extensive testbed environment which cannot be mocked with a fraction of the data.
- Engineers being more comfortable expressing the required business logic in the other languages they were working in then PL/pgSQL
- Challenging to write tests for the triggers
- Harder to deploy variations for testing if needed
You can make the development of your backend more simple, by shoving the complexity into the database, meaning your backend just does less. That in itself does not make your application any simpler.
How do we know who created the rule, edited the rule? How can we reason about the sequence in which these rules are executed based on larger use cases with complex interactions.
Seems like a fire waiting to happen.
You are presumably operating inside of a database, a place where the above concerns can be tracked in ~3 additional columns. More complex rule arrangements can be addressed with additional tables & relations. If you are starting from a blank schema, everything is possible. As noted by others here, you either go all-in, or all-out. The middle ground where half the logic is in the database and half is in GitHub is where things get yucky.
Consider the simplification angle. There are some techniques that allow for running entire apps directly out of the database. You might not even need Node, .NET, Go, Rust, etc. Hypothetically, if 100% of the things are in a database, you can simply record the binary log to S3 and have a perfect log of everything over time. Imagine how easy it would be to set up a snapshot of a given environment on a developer machine. Inversely, you could directly ship a developer's machine to production. You can also do some crazy shit where you merge bin logs from different timelines via marker transactions.
The other major advantage includes being able to update production while its live, even if production is running on a single box. I saw a particular PL/SQL install earlier in my career that was utilized for this exact property - production literally could not ever drop a single transaction or stop servicing them. Latency beyond 2 seconds could be catastrophic for system stability. Production could come down, but it had to be all or nothing. Think - shutting down a nuclear reactor and the amount of time you are locked out due to the subsequent safety and restart checklists. You absolutely need a way to safely & deterministically update in-between live, serialized transactions or you can't run your business effectively.
The big pro is that you no longer need to remember to update tableB, which is derived from data in tableA due to performance, in your application code every time you update tableA.
The cons are that:
* You add more state to your DB
* You can't express the logic in your backend language
Thinking more about it, I don't think the cons outweigh the pros. I would prefer this trigger logic to be part of the DBMS, so I can express the logic in my backend language and also avoid increasing the dependency of my application logic on DB state.
You have got to be kidding me.
I think a lot of the problem in industry is simply lack of rigor. We talk a big game about being software engineers but nobody takes the time to talk about values, requirements, trade-offs or business context for decisions. It’s a shame because these topics are the actual engineering of the system.
To some this is the most obvious thing to do, but you'd be surprised that some people wouldn't do this (I wouldn't have before reading a few books) and how I even got pushback at first, despite it being a 10-20 minute to wrapper logic in a class.
A good abstraction for things like this makes it really justifiable to take advantage of Postgres and Redis for things that aren't their forte for the time being until you eventually need to swap the out for a more robust solution. My experience is at startups mostly, and that ability to make complete, but small implementations to get going and being able to make them more robust over time is an essential skill.
Like everything, it depends on the application.
Now you'll never* don't need to shard your Postgres.
* Unless you work at a very rare company.
I eventually left. We pivoted to a new product and leadership agreed the old system was legacy to remain untouched. Eventually this thinking changed and the old product was to be integrated with the new. Sprocs were back baby! My battle was lost, I was done.
If your application logic needs to remember to update a denormalized column somewhere it's very easy for a bug to slip in in the future when someone adds a new piece of Python code but forgets to update the denormalized column.
With triggers you can eliminate that source of bugs entirely.
This is even more valuable if you have more than one system interacting with a single database.
But yeah having the business logic split like this is a hard sell. There needs to be an extremely good reason to do it. It adds developer overhead because not only do you have to maintain another test harness and deal with an entirely separate system encapsulating yours, every time you touch business logic you’re now wasting some brainpower deciding where it should live.
At one point Terradata claimed they had a tool that could auto convert to something else with 99% success rate, or something like that. The reality on our samples we tried was like 40%.
Plus, Pl/SQL code is often very “chatty” with the data, if you try to port it as is you will often suddenly see performance issues as you see the amount of data that has to be sent over the wire from the DB to your code tier.
I do a version of that for my own (Django-powered) blog here: https://github.com/simonw/simonwillisonblog-backup/blob/main...
> Repeatable migrations are very useful for managing database objects whose definition can then simply be maintained in a single file in version control. Instead of being run just once, they are (re-)applied every time their checksum changes.
https://pypi.org/project/DBSamizdat
In a nutshell it allows you to keep DB functions, triggers and views around as Python classes, so you can version them together with the rest of your application code. The DB state gets updated for you (in the right dependency order) whenever you change them.
It can also run without Django.
The standard tables can use the standard process while triggers, etc are run declaratively. When you change something, a tool could simply tear everything down and rebuild it. No need to worry about data since those are handled separately.
Would performance be a concern? Or is there something I'm missing?
Applying a series of migrations to get a final db schema is not much different than a version control system like git.
If you like, you can view my comment above as saying 'if they included that tooling, not just the similar tree of changes stored, it would be better'.
How to get from one state to the next is interesting to the computer, not to me (not after I've initially written it/done it in prod anyway), I'm interested in the result, where do we stand after all of the migrations (currently on disk having checked out whatever).
In Django that's "./mange.py migrate".
To do this right, you're using row level locking e.g. SELECT FOR UPDATE/SKIP LOCKED [1], and hopefully you're already using idle_in_transaction_session_timeout to deal with total consumer failures. A properly designed queue in Postgres runs more-or-less in parallel, and supports (really fantastic features like) atomic row locks across all resources needed to serve the queue request.
If you need extremely long consumer timeouts, it's also totally fine to use RLLs in addition to state on the job itself.
[0] - https://www.crunchydata.com/blog/message-queuing-using-nativ... [1] - https://www.2ndquadrant.com/en/blog/what-is-select-skip-lock...
The comment about deployment pipelines was just that some shops explicitly load test with a multiple of prod-like data to find that sort of issue before prod. Doing so for a non-negligible amount of time is important though to catch qualitative shifts in the RDBMS behavior as it approaches a steady or runaway state as a result of any software change.
> I wouldn't want an average development team to pursue that approach without good reason, especially when the alternatives are so easy to do right.
Agreed. Good reasons I’ve had in the past are:
- wanting transactional guarantees across your DB data and queue (which you don’t get if your queue is external)
- not wanting to add more stack complexity (this has been an issue when you have to support on-prem deployments that you aren’t allowed to interact with).
I’m sure there are others.
If not using such a tool, you might adopt a little meta-schema logic in your own DDL. With PostgreSQL, you can try writing idempotent and self-migrating schemas. Using syntax variants like CREATE OR REPLACE for functions and views, DROP ... IF EXISTS, CREATE ... IF NOT EXISTS, and ALTER ... ADD IF NOT EXISTS allows some of your DDL to simply work on an empty DB or a prior version. Wrapping all this within a pl/pgsql block allows conditional statements to run DDL variants. Conditionals are also useful if you need to include some custom logic for data migration alongside your schema migration, i.e. create a new structure, copy+transform data, drop old structure.
For smaller DBs, you may be able to afford some brute-force techniques to simplify the DDL. Things like DROP IF EXISTS for multiple earlier versions of views, indexes, or constraints to clear the slate and then recreating them with the latest definitions. This may add IO costs to execute it, but makes the DDL easier to read as the same DDL statements are used for the clean slate and the migration. Similarly, a little pl/pgsql logic could loop over tables and apply triggers, policies, etc. that you want to use systematically in a project.
If possible, you can also prune your code so that any specific version in your source control only has logic to handle a few cases, i.e. clean slate builds and a migration of N-1 or N-2 to latest. This minimizes the amount of branched logic you might have to understand when maintaining the DDL. The approach here depends a lot on whether you are versioning a product where you control the deployment lifecycle or versioning a product release which may encounter many different prior versions "in the wild".
In any case, you have a more complicated test requirement if you want to validate that your system works from multiple starting points. I.e. clean-slate builds as well as migrations from specific earlier versions. I think this is true whether you are using some higher level migration management tooling or just rolling your own conditional and idempotent DDL.
If you want a plain text SQL file to look at you can have that with a bit of extra automation - for example, every time you tag a release you could have a script that runs "./manage.py migrate" against that checkout and then dumps the schema out to a file somewhere (an asset attached to the release on GitHub as one example).
Ideally though I'd like a tool with first class declarative schema as the source of truth - migrations that Django can correctly autogenerate don't need to concern me at all; if I need to vary them or resolve some ambiguity in a possible path (renamed a column vs. dropped and added one, for a simple example) then there can be some way of specifying that migration. Essentially as Django targets its models as the intended result of migrations (and complains of a missing migration if they don't get there, etc.) I'd prefer instead to target SQL schema. Still check ORM consistency - but against the schema, not the migrations of it.
Starting from scratch it seems obvious that you'd have the schema, and the mapping into python models. Django does away with the former by inference, but in so doing limits its scope to the subset of possibly desired SQL that it understands.
If you have SQL and Python, then you can have DSL for the understood subset without losing migrations, tracking in a single place, etc. of the rest. You could also give Django ownership of the entire database/schema, so that any manual experiments/adjustments would be blown away on migration, they'd have to be documented in code to be persisted.