Boosting the performance of PostgreSQL’s COPY command by dropping indexes(californiacivicdata.org) |
Boosting the performance of PostgreSQL’s COPY command by dropping indexes(californiacivicdata.org) |
Of course, that's often not an option when you you're loading records into a live database that's also getting queries, you usually don't want every query to result in a full table scan.
This was well known 20+ years ago when I was an entry-level DBA, and I assumed it was still well known today.
here's another tip, at least on mysql, but possibly other databases that have memory tables. Import stuff in to memory tables, then insert from the memory table to a disk-based table. I took a process that was naively importing data via SQL commands which took close to 24 hours down to around 20 minutes by breaking it up, chunking imports to memory tables, then copying those to permanent disk. This was years ago (12?) and mysql is probably better about insert handling than it was, but that approach (plus the drop/recreate indexes) meant this was a smallish process vs a 24 hour import cycle.
Not sure where I first heard that, but it applies here. Essentially it is almost the same thing as saying that computers are often set up to exploit economies of scale.
Thus building an index all at once after a large set of changes are made is more efficient than incrementally updating an index as each change is made.
Kids, many years ago, even before jQuery, software would come with documentation that you could read and it would tell you how to use it effectively.
I know, crazy right? But to this day some of that old software, of which PostgreSQL is an example, still has this documentation that you can read, even before you use the software in a production system.
Yeah, yeah, I know Agile and Docker solved the problem of ever having to document anything, but this is the way things used to be and a few of us are stuck in our ways and still like it.
If you're getting old then I must be ancient! I remember when all the software documentation had to be printed on this white stuff made out of dead trees.
Intellisense has replaced the need to read the docs and Agile has replaced the need to understand what you're doing.
Its no surprise that basic knowledge found in the documentation is later "discovered" when the project is already running in production.
The difference is really in whether you recognize the issue and quietly hope no one finds out how dumb you really are, or whether you make a big celebratory blog post about the secret behind your "pioneering" work, making sure that your title and first and last name are clearly attached. And of course, we can't fail to highlight the further brilliance of accomplishing this marvelous feat by employing "rarely used, low-level" commands from within the framework's ORM.
Hold on to your butts, because next week he's going to learn that you can execute commands directly on the server, without even having to use the "low-level" elements of an ORM! I can't wait for the field to be revolutionized by Lead Developer James Gordon's next discovery.
The entirety of the "Why We Did It" section:
-----
> This improvement was pioneered by James Gordon, the Coalition’s lead developer.
> He drew instruction from PostgreSQL’s official documentation, which reads:
>> [snipping quoted sections from PostgreSQL manual at https://www.postgresql.org/docs/10/static/populate.html#POPU... ]
>Gordon’s code handles this task using rarely utilized, low-level tools in Django’s database manager.
-----
Sadly, in the current day and age, a developer actually taking the time to RTFM may indeed qualify as "pioneering" work!
Perhaps the rest of us need to start trumpeting our accomplishments when we find some clearly-stated performance gain in the manual, rather than hiding our heads in embarrassment for not finding out until we released version 2.2 of our mass DB import tool.
[1]: https://en.wikipedia.org/wiki/Multiversion_concurrency_contr... [2]: https://xkcd.com/1053/
If so, how does that compare, in aggregate, to the time saved in the loads?
Or are you simply not putting the application back into service until the index rebuilds have finished? How long does that take, compared to the time saved?
EDIT: I'm mostly asking these questions to nudge people to think about them in the course of trying this in their own environments. It's my day job to think about these kinds of things; I've worn the PostgreSQL DBA hat for over a decade now.
Maybe we are missing something by getting rid of the DBAs.
Our project saw SIGNIFICANTLY better performance with batched multi-threaded INSERTs. If you can run a few hundred load threads and manage the concurrency correctly (not trivial), it will chew through big loads like a monster.
If I ever have the time/excuse, I want to go back and try a multi-threaded COPY. But if you need speed and have a choice between multi-threaded INSERTs or a single-threaded COPY, go with the INSERTs every time.
On a one for one basis COPY IN will be faster than inserts:
- COPY uses a special optimization in the access method: instead of running the full insert logic (find a target page, lock it, insert, unlock page) per row, it batches all the rows that will fit on the target page.
- COPY overall has shorter code paths than regular inserts.
None of you read all the performance "tricks" to Postgres before writing your first SQL statement.
Every day, somebody's born who doesn't know how to boos the performance of COPY by dropping indexes.
Whoever wrote the Django bit didn't really do a good job on the defaults.
But that's not what they're dealing with. They're dealing with CSV, presumably from some external source. It'd also be faster if they were dealing with pre-formed database files that they could just rsync. But they're not.
I admit there’s no reason to expect this other than “so you don’t have to do the dumb drop/recreate indexes” trick... maybe I just expected COPY to be smarter than just expanding into a set of insert statements.
If I do a little of something now and a little later and yet more even later, I will probably have to deal with caches that don't have my data in them because other things happen in the intervening time. If I do it all at once, then a lot of the work benefits from already-warm caches. (This can apply to disk caches, CPU data caches, and even instruction caches.)
Not to mention that sometimes batch mode processing opens up opportunities to use a more efficient algorithm (even if sometimes just by a constant factor). For example, if you maintain an index in a balanced tree and keep adding to it piecemeal, you do extra work continually rebalancing that tree. Whereas in theory if you built an index all at once, you could collect all the data, sort it using some kind of fast sort like mergesort, and then write out a final tree which is already balanced as desired and doesn't need to be rearranged as data comes in in random order.
Just becasue one blog post making it seem that way makes it not known ?
relatedly/anecdotally, I still run in to people who aren't aware of this, as they don't understand what happens when you're inserting data in to a database or what indexes are in the first place.
One of the most dangerous things about sharing stuff with others, especially isolated items from unknown authors with a worldwide audience, is that you never really know how much of their own context the recipient will read in, or how much of the assumed / pre-requisite context they'll fail to infer (or infer differently than intended).
You only get better at this through repeated practice, but you can't ever be perfect at it. Especially in a world of complex social interactions where people don't always mean what they say or say what they mean, and the lack of body language and facial expression in written language silently corrupts the signal.
All readers should always remember, it is easy to criticize. It is much harder to do. Critics especially need to remember this, because it's very easy, automatic in fact, to fall into a pattern of judgment and criticism when we're regularly exposed to so much stuff from so many sources. But it's good to get out there and try, because it's often much harder than it looks, and especially if you've been on the sidelines judging for a long time, it can be jarring how much harder it is to do than to say (or, particularly, deride).
A great way to test this: if there's some radio program you listen to regularly where callers can share a brief anecdote or story, call in. As a regular listener, you've been silently evaluating callers on a daily basis for years. You likely have some opinion about the practices of good callers and bad callers. Call in and you'll be surprised how nerve-wracking it can be, even for someone as "experienced" as yourself, and I think you'll be disappointed in your overall performance (unless you've thoroughly rehearsed ahead of time).
This is just a tiny, irrelevant thing that most people wouldn't give any thought to, a quick ~60 second phone call. You certainly don't give much thought to it every day when you dismiss anecdotes or stories told by amateurs. But try it yourself and you'll get a great deal on some perspective for the difficulty gap between doing v. criticizing.
The peanut gallery can, and will, always find something to nitpick. Don't take it personally. Use that data to hone your interactions and get a better-tuned result next time.