PostgreSQL, Memory and the Cloud

PostgreSQL, Memory and the Cloud(sosna.de)

211 points by bilalhusain 4 years ago | 57 comments

aeyes 4 years ago |

Wow, the title of this post is very calm compared to what is actually happening.

CloudSQL Postgres is running with a misconfigured OS OOM killer, crashes Postmaster randomly even if memory use is below instance spec. GCP closes this bug report as "Won't fix".

This is a priority 1 issue. Seeing a wontfix for this has completely destroyed my trust of their judgement. The bug report states that they have been in contact with support since February.

Unbelievable attitude towards fixing production critical problems of their platform affecting all customers.

yashap 4 years ago | |

So many GCP products are surprisingly terrible. Certainly not all of them, some are really good, like GKE, Cloud Storage and Cloud Load Balancer. But Cloud SQL is pretty weak, and products like Cloud Logging, Cloud Metrics and Cloud Tracing are legitimately terrible. Cloud NAT is pretty sketchy too, and can lead to a lot of egress issues if not configured perfectly.

My current workplace uses GCP, my last workplace used AWS, and personally I’ve found AWS to have much higher average quality. At my current workplace we’ve stopped using Cloud SQL, and moved our Postgres usage to Aiven (with VPC peering). Aiven seem to do a much better job operating Postgres than GCP do.

yashap 4 years ago | | |

An example of the kinds of quality issues you run into with so many GCP products: https://github.com/googleapis/cloud-trace-nodejs/issues/1272

Basically, their Cloud Tracing product is broken for modern Node/Postgres (in terms of showing PG queries and whatnot in traces), users have found the issue (and a seemingly super simple fix), but it’s been over a year and Google still haven’t fixed it. Google’s response is “yeah, we know pretty core functionality of this product is broken, but we’re not fixing it in the near future.” Or maybe ever? Many of their products feel semi-abandoned like this, especially in their observably stack - major bugs and/or performance issues that they never fix, and extremely limited features.

Cloud SQL isn’t terrible, but at least the Postgres version is one of the weaker managed Postgres offerings out there. And their whole observability stack (Logging/Monitoring/Tracing/Error Reporting) is legit terrible compared to competing products. Compared to other products I’ve used in the space, Cloud Logging is unbelievably worse than Sumo Logic, Cloud Metrics soooo much worse than Grafana+Prometheus, Cloud Tracing way worse than offerings from Datadog or New Relic, Cloud Error Reporting is ridiculously far behind Sentry, etc.

The GCP options are often quite cheap, but it shows in their extremely limited features, poor performance and plentiful bugs. Go with GCP for the things they do well, but don’t bother adopting their solution for everything simply to stick with one platform, as so many of their products are just so poor compared to competitors.

RNCTX 4 years ago | | |

Not really surprising if you consider their likely motivations.

Google isn't in the business of selling things to end users, they're in the business of selling ads. The only thing GCP gives them (outside of getting wall streeters off their backs a few years ago when everyone and their brother was starting a cloud service) is a credit to their own infrastructure cost by selling excess to random joes.

Therefore I'm not surprised that AWS continues to be the defacto, they do sell things to end users. I'm not surprised that Azure is growing quickly, either, since MS also sells things to end users and they needed a way to transition their on-premise stuff to the wires.

xxorde 4 years ago | | |

It's interesting that you are satisfied with GKE. Do you rely on the k8s-API to be (high) available? We were using the API as our source of truth for Patroni, but we had to configure some really high timeouts in order to compensate regular multi minute API downtimes.

slimsag 4 years ago | |

This doesn't really surprise me. We use CloudSQL at my work (Sourcegraph) and have run into all sorts of weird issues actually putting it into production, e.g. segmentation faults when turning on Query Insights (which, lol, is supposed to give insight into why your DB might be behaving poorly.)

For the most part it works okay and is fine, but there have definitely been a fair number of quirks..

https://issuetracker.google.com/u/2/savedsearches/559773?pli...

breakingcups 4 years ago | |

It exposes a very problematic communication pattern. The engineering team doesn't respond to the support team (accidentally or deliberately). The support team then just decides to close the issue instead of prodding the engineering team for an actual response (even if it's just "Yeah, we're not fixing it").

Now the issue is just in limbo and the only one who feels the pain is the customer.

perlgeek 4 years ago | | |

Another "fun" interaction pattern: User reports a bug (or a feature request), several others subscribe to and/or vote for this to be solved, and then a service rep closes the issue because there wasn't any recent activity.

I've observed with with Atlassian where I wanted to report a Jira bug, but found that it had already been opened some years before, more than a hundred people had subscribed, bug was still closed as "no activity, must not be relevant". I just found the exact same bug reported for Jira Cloud (I had observed it in the on-prem version): https://jira.atlassian.com/browse/JSWCLOUD-8865 and it was closed there for the very same reason.

I didn't leave a comment because the original report described the issue perfectly, and adding a "me too" comment is just noise in the bug tracker. Guess I'll be noise in future :-(

Sytten 4 years ago | |

I concur with the other comments, Cloud SQL is a very mediocre service at best. Lot's of weird issues and the engineering team doesn't seem to care. We also had the segfault due to query insight. Just the fact that you can't upgrade your database version without creating a new instance and restoring a backup is just bad. I also suggest aiven as an alternative that works very well and cost is reasonable.

ris 4 years ago | | |

> I also suggest aiven as an alternative that works very well and cost is reasonable.

Seconded. Responsive support too.

arcticfox 4 years ago | |

I migrated off CloudSQL even when they tried to pay me to use it (startup credits). It's not worth risking your business with GCP. Sad, but that's what I've learned...

I'd consider Aiven if I were still on GCP and looking for a solid managed Postgres provider. As it is, I'm now on DigitalOcean and fairly happy with their managed Postgres offering, but there are a few rough edges so I'm actually still looking at Aiven even though everything else I have is on DO...

Winsaucerer 4 years ago |

Are there any good/recommended books or resources for someone who wants to learn how to run postgresql well? E.g, what defaults to change and when, settings for the host OS (such as in the parent linked article), overall tips/insights/recommendations.

mixmastamyk 4 years ago | |

Postgres up and running by o'reilly is decent. Reading the docs for the config files is necessary as well.

thyrsus 4 years ago |

Are there recommendations for learning about Linux kernel memory management? Two anecdata:

* I had some compute servers that were up for 200 days. The customers noticed that they were half as fast as identical hardware just booted. Dropping the file system cache ("echo 3 | sudo dd of=/proc/sys/vm/drop_cache") brought the speed back up to the newly deployed servers. WTF? File system caches are supposed to be zero cost discards as soon as processes ask for RAM - but something else is going on. I suspect the kernel is behaving badly with overpopulated RAM management data (TLB entries?), but I don't know how to measure that.

* If that is actually the problem, then a solution might be to decrease data size by using non-zero hugepages ("cat /proc/sys/vm/nr_hugepages"). I'd love to see recommendations on when to use that.

mnahkies 4 years ago |

I recently managed to crash a GCP cloudsql postgres 12 host running an interactive query that was rather heavy (based on error logs OOM).

It surprised me because I had never executed a query and caused the whole host to crash up until that point - now I'm wondering if this misconfiguration is the cause

renewiltord 4 years ago |

Interesting. Also a problem with RDS: https://stackoverflow.com/questions/52148675/aws-rds-with-po...

zingar 4 years ago |

I'd like to thank the author for their clear, simple explanation. I haven't had to think about allocating memory since university and am not practiced thinking about it in my software but now I feel like I have useful ways to think about why processes just disappear sometimes.

shdh 4 years ago |

GCP CloudSQL has a lot of issues. There was one with query insights being enabled causing segfaults on `LEFT JOIN` operations. Its since been patched, but really shitty.

yjftsjthsd-h 4 years ago |

So are there problems with disabling overcommit? Or is it really that simple (at least for dedicated db hosts)?

dkersten 4 years ago |

A metacomment about the page (rather than the content): the text in the white boxes is almost unreadable for me, the contrast is crazy low.

meepmorp 4 years ago | |

I had this problem, too; there's a button to toggle the night mode theme, which fixed it for me.

dkersten 4 years ago | | |

Oh. I didn't even see that. That does indeed make the text legible, thank you. Sadly it makes the entire page too bright, but at least its readable!