Boosting the performance of PostgreSQL’s COPY command by dropping indexes

Boosting the performance of PostgreSQL’s COPY command by dropping indexes(californiacivicdata.org)

37 points by palewire 8 years ago | 46 comments

Johnny555 8 years ago |

Is it not common knowledge that dropping indexes improves database insert performance?

Of course, that's often not an option when you you're loading records into a live database that's also getting queries, you usually don't want every query to result in a full table scan.

This was well known 20+ years ago when I was an entry-level DBA, and I assumed it was still well known today.

cuchoi 8 years ago | |

I think what the article is proposing is that it can be quicker to drop the indexes and recreate them than to load a lot of data in an indexed table.

maxxxxx 8 years ago | | |

I thought this is pretty common knowledge and I don't even do much with databases.

mgkimsal 8 years ago | | |

and I think that's what what johnny555 is questioning - isn't this common knowledge already? apparently not.

here's another tip, at least on mysql, but possibly other databases that have memory tables. Import stuff in to memory tables, then insert from the memory table to a disk-based table. I took a process that was naively importing data via SQL commands which took close to 24 hours down to around 20 minutes by breaking it up, chunking imports to memory tables, then copying those to permanent disk. This was years ago (12?) and mysql is probably better about insert handling than it was, but that approach (plus the drop/recreate indexes) meant this was a smallish process vs a 24 hour import cycle.

adrianmonk 8 years ago | | |

"Batch-mode processing is always more efficient."

Not sure where I first heard that, but it applies here. Essentially it is almost the same thing as saying that computers are often set up to exploit economies of scale.

Thus building an index all at once after a large set of changes are made is more efficient than incrementally updating an index as each change is made.

C14L 8 years ago | |

https://xkcd.com/1053/

adrianmonk 8 years ago | |

It's weird but doing less work requires less time. Who'd have thought that?

drblast 8 years ago |

I must be getting old.

Kids, many years ago, even before jQuery, software would come with documentation that you could read and it would tell you how to use it effectively.

I know, crazy right? But to this day some of that old software, of which PostgreSQL is an example, still has this documentation that you can read, even before you use the software in a production system.

Yeah, yeah, I know Agile and Docker solved the problem of ever having to document anything, but this is the way things used to be and a few of us are stuck in our ways and still like it.

derefr 8 years ago | |

I second reading the Postgres docs. They're fabulous resources. Any time I'm attempting to apply a new SQL syntax feature I'm not 100% fluent with, I'll give the PG docs a read-through—not just for usage, but also for idiomatic examples, performance analysis, edge-cases to consider, and more.

slantyyz 8 years ago | |

>> even before jQuery

If you're getting old then I must be ancient! I remember when all the software documentation had to be printed on this white stuff made out of dead trees.

jeremiep 8 years ago | |

And then people wonder why Agile never produces quality results.

Intellisense has replaced the need to read the docs and Agile has replaced the need to understand what you're doing.

Its no surprise that basic knowledge found in the documentation is later "discovered" when the project is already running in production.

cookiecaper 8 years ago | | |

To be fair, the honest among us will admit that it's not unusual to miss something glaringly obvious for an embarrassingly long amount of time.

The difference is really in whether you recognize the issue and quietly hope no one finds out how dumb you really are, or whether you make a big celebratory blog post about the secret behind your "pioneering" work, making sure that your title and first and last name are clearly attached. And of course, we can't fail to highlight the further brilliance of accomplishing this marvelous feat by employing "rarely used, low-level" commands from within the framework's ORM.

Hold on to your butts, because next week he's going to learn that you can execute commands directly on the server, without even having to use the "low-level" elements of an ORM! I can't wait for the field to be revolutionized by Lead Developer James Gordon's next discovery.

gaius 8 years ago | |

This is where the rampant ageism in the industry has lead us; millennials who expect a prize for doing something totally obvious to anyone with experience, who was never considered for the job because they were “too old”

whistlerbrk 8 years ago | | |

I understand your sentiment, but I think your wording is a bit mean spirited. I don't think this person "expects a prize", nor do I think this is a trait that can be applied to millenials as a whole. My sister who has hired literally a hundred people bucketed as "millenials" has had the exact opposite experience in dealing with them than you describe. She thinks they are some of the hardest working people she has ever worked with. But we digress.

cookiecaper 8 years ago |

Wow. You've gotta love the audacity in making a big announcement like this based on the developer finally getting around to reading the docs. "Drop constraints and indexes for faster imports" is mass import 101.

The entirety of the "Why We Did It" section:

-----

> This improvement was pioneered by James Gordon, the Coalition’s lead developer.

> He drew instruction from PostgreSQL’s official documentation, which reads:

>> [snipping quoted sections from PostgreSQL manual at https://www.postgresql.org/docs/10/static/populate.html#POPU... ]

>Gordon’s code handles this task using rarely utilized, low-level tools in Django’s database manager.

-----

Sadly, in the current day and age, a developer actually taking the time to RTFM may indeed qualify as "pioneering" work!

Perhaps the rest of us need to start trumpeting our accomplishments when we find some clearly-stated performance gain in the manual, rather than hiding our heads in embarrassment for not finding out until we released version 2.2 of our mass DB import tool.

not_kurt_godel 8 years ago | |

Yeah, it's OK to not know about this, and get excited about discovering it, but to categorize it as "pioneering" is, as you say, audacious. This announcement would have been better as a minor bullet-point in the release notes. If and when someone publicly points out how much faster it is, the appropriate response would be, "heh, this is embarrassing, but we didn't know you should drop indexes before importing"

larzang 8 years ago | |

Heads are going to explode when they discover the arcane rarely-utilized tool known, nay, merely whispered of, only as "transactions".

mulmen 8 years ago | | |

Just today someone asked me "what's MVCC[1]?". One of today's lucky 10,000 [2].

[1]: https://en.wikipedia.org/wiki/Multiversion_concurrency_contr... [2]: https://xkcd.com/1053/

rosser 8 years ago |

Do you have any numbers on how much extra time (that is, time spent servicing queries above the normal query times when your tables are indexed) application queries take after the loads, but before the index rebuilds are complete?

If so, how does that compare, in aggregate, to the time saved in the loads?

Or are you simply not putting the application back into service until the index rebuilds have finished? How long does that take, compared to the time saved?

EDIT: I'm mostly asking these questions to nudge people to think about them in the course of trying this in their own environments. It's my day job to think about these kinds of things; I've worn the PostgreSQL DBA hat for over a decade now.

whistlerbrk 8 years ago | |

Yes, you would never do this unless writes were completely disallowed to your app. You will save time building the indices in one-shot.

rosser 8 years ago | | |

My point is: this is not necessarily true. The Fine Article even mentions that the benefits are situational.

jacobkg 8 years ago |

Do you also disable autovacuum while this is running? That is another good trick for speeding up large database imports.

protomyth 8 years ago | |

I would, since it would be a waste on a table loaded right after a create.

p2t2p 8 years ago |

N̶o̶ ̶s̶h̶i̶t̶,̶ ̶S̶h̶e̶r̶l̶o̶c̶k̶!̶ You don't say?

p2t2p 8 years ago | |

It was a bit abrupt but seriously it is kinda disheartening to read that a _lead_ developer _discovered_ such a basic thing.

trhway 8 years ago | | |

Reminds about that funny article from Uber on why they switched from one DB to another. Seems like they had failed to discover a number of things.

cuchoi 8 years ago | |

The article is proposing is that it can be quicker to drop the indexes and recreate them than to load a lot of data in an indexed table.

protomyth 8 years ago | | |

Yep, and this is common knowledge for most folks who do this kind of work. Heck, if you can stop the database, a whole host of things become quicker by going drops and recreating things. Most alter commands are quicker if you do drops and creates often even if a copy of a table needs to be made.

Maybe we are missing something by getting rid of the DBAs.

segmondy 8 years ago | | |

It's not a proposal, it's common knowledge. INSERTS causes the index to be rebuilt. You have to search and find the right location then insert the new pointer. If you do 1,000,000 inserts that's 1,000,000 searches and writes to the index.

karlmdavis 8 years ago |

From personal experience: PostgreSQL's COPY commands aren't really all that performant, indexes or no.

Our project saw SIGNIFICANTLY better performance with batched multi-threaded INSERTs. If you can run a few hundred load threads and manage the concurrency correctly (not trivial), it will chew through big loads like a monster.

If I ever have the time/excuse, I want to go back and try a multi-threaded COPY. But if you need speed and have a choice between multi-threaded INSERTs or a single-threaded COPY, go with the INSERTs every time.

davidgould 8 years ago | |

COPY of course is single threaded. If you can split your input data and run multiple copy operations you can get similar increases although I would never suggest hundreds of threads. Depending on your IO system something like twice as many threads as CPU threads is probably going to work better.

On a one for one basis COPY IN will be faster than inserts:

- COPY uses a special optimization in the access method: instead of running the full insert logic (find a target page, lock it, insert, unlock page) per row, it batches all the rows that will fit on the target page.

- COPY overall has shorter code paths than regular inserts.

imtringued 8 years ago | | |

Why not just use regular batched inserts with unnest to turn several array parameters into a table instead of using some arcane hard to use SQL command?

rtpg 8 years ago |

Why is everyone acting so pissy about the fact that someone happened to find a performance improvement trick by reading the docs? Isn't this what's supposed to happen?

None of you read all the performance "tricks" to Postgres before writing your first SQL statement.

Every day, somebody's born who doesn't know how to boos the performance of COPY by dropping indexes.

gaius 8 years ago | |

Most people don’t brag about being “pioneers” for having read the docs for a basic function. That’s what’s causing all the mirth.

beached_whale 8 years ago |

It looks like the article is directed at those doing bulk uploads to their system and not at developers necessarily. This may not be universally intuitive to people outside of Database admins and developers.

Theodores 8 years ago |

They should have stuck with bog-standard mySQL to get this time saving for free - if you restore a SQL dump created with phpMyAdmin or mysqldump then all the disable index commands are already in there, good to go, and in SQL.

Whoever wrote the Django bit didn't really do a good job on the defaults.

ketralnis 8 years ago | |

I don't know what makes mysql more "bog standard" than postgres, but pg_dumpall takes `-F format`, one of which is standard SQL statements.

But that's not what they're dealing with. They're dealing with CSV, presumably from some external source. It'd also be faster if they were dealing with pre-formed database files that they could just rsync. But they're not.

keredson 8 years ago |

yep

j_s 8 years ago |

Yikes https://www.xkcd.com/1053