Using ClickHouse to scale an events engine

Using ClickHouse to scale an events engine(github.com)

237 points by wyndham 2 years ago | 97 comments

> Recently, the most interesting rift in the Postgres vs OLAP space is [Hydra](https://www.hydra.so), an open-source, column-oriented distribution of Postgres that was very recently launched (after our migration to ClickHouse). Had Hydra been available during our decision-making time period, we might’ve made a different choice.

There will likely be a good OLAP solution (possibly implemented as an extension) in Postgres in the next year or so. There are a few companies are working on it (Hydra, Parade[0], tembo etc.).

0 - https://www.paradedb.com/

riku_iki 2 years ago | |

> 0 - https://www.paradedb.com/

this looks like repackaging of datafusion as PG extension?..

mritchie712 2 years ago | | |

yes, that's a succinct way to put it.

snihalani 2 years ago | |

Have you seen: https://benchmark.clickhouse.com/

iimblack 2 years ago | | |

That’s cool. Clickhouse and Alloy’s performances are impressive.

riku_iki 2 years ago | | |

that benchmark is very weak, they used just 100M rows which is laughable, also no joins have been tested.

philippemnoel 2 years ago | |

ParadeDB founder here. You can see how we compare to other Postgres-based analytical offerings on ClickBench here: https://blog.paradedb.com/pages/introducing_analytics

ddorian43 2 years ago | |

I don't think `tembo` is working on it though, probably just hosting an existing extension.

Mortiffer 2 years ago | |

so Paradedb and Hydra are using same codebase or just similar approach ?

philippemnoel 2 years ago | | |

ParadeDB and Hydra are completely different. We're tackling the same problem of bringing analytics inside Postgres, but using different approaches.

ParadeDB integrates industry standards like Arrow, Parquet, DataFusion to offer columnar storage + vectorized processing. Hydra is building on top of Citus Columnar.

You can read about our approach here: https://blog.paradedb.com/pages/introducing_analytics

joshstrange 2 years ago |

I feel like with all the Clickhouse praise on HN that we /must/ be doing something fundamentally wrong because I hate every interaction I have with Clickhouse.

* Timeouts (only 30s???) unless I used the cli client

* Cancelling rows - Just kill me, so many bugs and FINAL/PREWHERE are massive foot-guns

* Cluster just feels annoying and fragile don't forget "ON CLUSTER" or you'll have a bad time

Again, I feel like we must be doing something wrong but we are paying an arm and a leg for that "privilege".

HermitX 2 years ago |

Is ClickHouse a suitable engine for analyzing events? Absolutely, as long as you're analyzing a large table, its speed is definitely fast enough. However, you might want to consider the cost of maintaining an OSS ClickHouse cluster, especially when you need to scale up, as the operational costs can be quite high.

If your analysis in Postgres was based on multiple tables and required a lot of JOIN operations, I don't think ClickHouse is a good choice. In such cases, you often need to denormalize multiple data tables into one large table in advance, which means complex ETL and maintenance costs.

For these more common scenarios, I think StarRocks (www.StarRocks.io) is a better choice. It's a Linux Foundation open-source project, with single-table query speeds comparable to ClickHouse (you can check Clickbench), and unmatched multi-table join query speeds, plus it can directly query open data lakes.

jakearmitage 2 years ago | |

> consider the cost of maintaining an OSS ClickHouse cluster I mean... it is pretty straightforward. 40~60 line Terraform, Ansible with templates for the proper configs that get exported from Terraform so you can write the IPs so they can see each other, and you are done.

What else could you possibly need? Backing up is built into it with S3 support: https://clickhouse.com/docs/en/operations/backup#configuring...

Upgrades are a breeze: https://clickhouse.com/docs/en/operations/update

People insist that OMG MAINTENANCE I NEED TO PAY THOUSANDS FOR MANAGED is better, when in reality, it is not.

breadchris 2 years ago |

ClickHouse is awesome, but as the post shows, some code is involved in getting the data there.

I have been working on Scratchdata [1], which makes it easy to try out a column database to optimize aggregation queries (avg, sum, max). We have helped people [2] take their Postgres with 1 billion rows of information (1.5 TB) and significantly reduce their real-time data analysis query time. Because their data was stored more efficiently, they saved on their storage bill.

You can send data as a curl request and it will get batch-processed and flattened into ClickHouse:

curl -X POST "http://app.scratchdata.com/api/data/insert/your_table?api_ke..." --data '{"user": "alice", "event": "click"}'

The founder, Jay, is super nice and just wants to help people save time and money. If you give us a ring, he or I will personally help you [3].

[1] https://www.scratchdb.com/ [2] https://www.scratchdb.com/blog/embeddables/ [3] https://q29ksuefpvm.typeform.com/to/baKR3j0p?typeform-source...

wiredfool 2 years ago | |

My first big win for clickhouse was replacing a 1.2tb, billion + row postgresql DB with clickhouse. It was static data with occasional full replacement loads. We got the DB down to ~ 60GB, with query speeds about 45x faster.

Now, the postgres schema wasn't ideal, and we could have saved ~ 3x on it with corresponding speed increases for queries with a refactor similar to the clickhouse schema, but that wasn't really enough to move the needle to near real-time queries.

Ultimately, the entire clickhouse DB was smaller than the original postgres primary key index. The index was too big to fit in memory on an affordable machine, so it's pretty obvious where the performance is coming from.

hodgesrm 2 years ago | | |

This is a nice illustration of the effects of different choices for storage layout and use of compute. ClickHouse blows away single-threaded queries on row-based data for analytic questions. On the other hand PostgreSQL can offer far higher throughput and concurrency when updating a shopping cart.

alooPotato 2 years ago |

We use BigQuery a lot for internal analytics and we've been super happy. I don't see a lot of love for BigQuery on HN and I wonder why. Tons of features, no hassle and easy to throw a bunch of TB at it.

I guess maybe the cost?

mnahkies 2 years ago | |

I'm a big fan of big query as well, but the cost can be problematic if you're not careful.

Generally speaking I've found it manageable if you make good use of partitioning and do incremental aggregation (we use dbt, though you have to do some macro gymnastics to make the partition key filter eligible for pruning due to restrictions on use of dynamic values https://docs.getdbt.com/docs/build/incremental-models)

It's also important to monitor your cost and watch for the point where switching from the per-tb queried pricing model to slots makes sense.

alooPotato 2 years ago | | |

yeah between partitioning, clustering, materialized views, and smart tuning it seems like there are enough knobs to control costs.

RadiozRadioz 2 years ago | |

Probably also because it is proprietary and only exists in one cloud platform.

wodenokoto 2 years ago | | |

No, it’s because it’s google and HN are certain it will get cancelled at any moment.

doo_daa 2 years ago | |

We are lucky enough to be able to run BigQuery with flat rate billing. It's incredibly powerful and it's a really good example of SaaS and Serverless done right. It just works.

lysecret 2 years ago | |

Yep love it too, especially with external data on GCS. Costs this way are very low. And the convenience is amazing (getting caches you can stream from for every query is a godsend)

alooPotato 2 years ago | | |

What do you mean by the streaming caches?

wodenokoto 2 years ago | |

I was quite surprised that other clouds don’t have an easy to get started analytics data warehouse solution like big query.

drewda 2 years ago |

This change may make sense for Lago as a hosted multi-tenant service, as offered by Lago the company.

Simultaneously this change may not make sense for Lago as an open-source project self-hosted by a single tenant.

But that may also mean that it effectively makes sense for Lago as a business... to make it harder to self host.

I don't at all fault Lago for making decisions to prioritize their multi-tenant cloud offering. That's probably just the nature of running open-source SaaS these days.

config_yml 2 years ago | |

Exactly, I've seen this at Sentry where you now have to run Kafka, Clickhouse, Redis, PG, Zookeeper, memcached and what have you. I get it, but the amount of baggage to handle is a bit difficult.

stephen123 2 years ago |

How were they doing millions of events per minute with postgres.

I'm struggling with pg write performance ATM and want some tips.

Ozzie_osman 2 years ago | |

If you're not already doing this: remove unnecessary indices, partition the table, batch your inserts/updates, or try COPY instead of INSERT.

unixhero 2 years ago | |

Turn off indexing and other optimizations done on a table level

stephen123 2 years ago | | |

What do you do to then query the data? I usually need indexes so queries are not slow. Perhaps I could insert into a staging table then bulk copy the data over to an indexed table, but that seems silly.

whalesalad 2 years ago | |

What’s your hardware? RDS? Nvme storage?

stephen123 2 years ago | | |

Its google cloud sql.

mathnode 2 years ago |

And if you use MariaDB, just enable columnstore. Why not treat yourself to s3 backed storage while you are there?

It is extremely cost effective when you can scale a different workload without migrating.

hipadev23 2 years ago | |

This is no shade to postgres or maria, but they don’t hold a candle to the simplicity, speed, and cost efficiency of clickhouse for olap needs.

riku_iki 2 years ago | | |

I have tons of OOMs with clickhouse on larger than RAM OLAP queries.

While postgres works fine (even it is slower, but actually returns results)

flessner 2 years ago | | |

And I mean why should they? They work great for what they are made for and that is all that matters!

silisili 2 years ago | | |

As a caveat, I'd probably say 'at large volumes.'

For a lot of what people may want to do, they'd probably notice very little difference between the three.

philippemnoel 2 years ago | | |

That's true, but we're trying to change that at ParadeDB. Postgres is still way ahead of ClickHouse in terms of operational simplicity, ease of hiring for DBAs who are used to operating it at scale, ecosystem tooling, etc. If you can patch the speed and cost efficiency of Postgres for analytics to a level comparable to ClickHouse, then you get the best of both worlds

mathnode 2 years ago | | |

For multi-tb or pb needs I would not stray from mariadb. Especially when using columnstore. I have taken the pepsi challenge, even after trying vertica and netezza. Not HANA though; one has had enough of SAP.

samber 2 years ago |

I'm curious: how many rows Lago store in its CH cluster? Do they collect data for fighting fraud?

PG can handle a billion rows easily.

didip 2 years ago | |

OLAP databases need to be able to handle billions of rows per hour/day.

I super love PG but PG is too far away from that.

JosephRedfern 2 years ago | |

Reading between the lines, given they're talking > 1 million rows per minute, I'd guess on the order of trillions of rows rather than billions (assuming they retain data for more than a couple of weeks)

jacobsenscott 2 years ago | |

PG can handle billions of rows for certain use cases, but not easily. Generally you can make things work but you definitely start entering "heroic effort" territory.

jackbauer24 2 years ago |

scale is becoming more and more important, not just for cost, but also as a key technology feature to help deal with unexpected traffic and reduce the cost of manual operations.

andretti1977 2 years ago |

I have a tangentially related question since I don’t use an Olap db: is deleting data so hard to perform? Is it necessarily an immutable storage?

If so, is it a gdpr compliant storage solution? I am asking it since gdpr compliance may require data deletion (or at least anonimization)

FridgeSeal 2 years ago | |

Columnar Db’s want stuff to be contiguous on disk, and deletes cause the rest of the data in that “block” to be rewritten (imagine deleting a chunk out of the middle of an excel table: you’ve got to move everything else up).

This in turn, creates read+write load. Modern OLAP db’s often support it, often via mitigating strategies to minimise the amount of extra work they incur: mark tainted rows, exclude them from queries, and clean up asynchronously; etc.

dangoodmanUT 2 years ago |

deleting this comment because apparently jokes are not received well here

mritchie712 2 years ago | |

There will likely be a good OLAP solution (possibly implemented as an extension) in Postgres in the next year or so. Many companies are working on it (Hydra, Parade[0], etc.)

0 - https://www.paradedb.com/

kapilvt 2 years ago | | |

for others curious

ParadeDB - AGPL License https://github.com/paradedb/paradedb/blob/dev/LICENSE

Hydra - Apache 2.0 https://github.com/hydradatabase/hydra/blob/main/LICENSE

also hydra seems derived from citusdata's columnar implementation.