Using ClickHouse to scale an events engine(github.com) |
Using ClickHouse to scale an events engine(github.com) |
There will likely be a good OLAP solution (possibly implemented as an extension) in Postgres in the next year or so. There are a few companies are working on it (Hydra, Parade[0], tembo etc.).
this looks like repackaging of datafusion as PG extension?..
ParadeDB integrates industry standards like Arrow, Parquet, DataFusion to offer columnar storage + vectorized processing. Hydra is building on top of Citus Columnar.
You can read about our approach here: https://blog.paradedb.com/pages/introducing_analytics
* Timeouts (only 30s???) unless I used the cli client
* Cancelling rows - Just kill me, so many bugs and FINAL/PREWHERE are massive foot-guns
* Cluster just feels annoying and fragile don't forget "ON CLUSTER" or you'll have a bad time
Again, I feel like we must be doing something wrong but we are paying an arm and a leg for that "privilege".
If your analysis in Postgres was based on multiple tables and required a lot of JOIN operations, I don't think ClickHouse is a good choice. In such cases, you often need to denormalize multiple data tables into one large table in advance, which means complex ETL and maintenance costs.
For these more common scenarios, I think StarRocks (www.StarRocks.io) is a better choice. It's a Linux Foundation open-source project, with single-table query speeds comparable to ClickHouse (you can check Clickbench), and unmatched multi-table join query speeds, plus it can directly query open data lakes.
What else could you possibly need? Backing up is built into it with S3 support: https://clickhouse.com/docs/en/operations/backup#configuring...
Upgrades are a breeze: https://clickhouse.com/docs/en/operations/update
People insist that OMG MAINTENANCE I NEED TO PAY THOUSANDS FOR MANAGED is better, when in reality, it is not.
I have been working on Scratchdata [1], which makes it easy to try out a column database to optimize aggregation queries (avg, sum, max). We have helped people [2] take their Postgres with 1 billion rows of information (1.5 TB) and significantly reduce their real-time data analysis query time. Because their data was stored more efficiently, they saved on their storage bill.
You can send data as a curl request and it will get batch-processed and flattened into ClickHouse:
curl -X POST "http://app.scratchdata.com/api/data/insert/your_table?api_ke..." --data '{"user": "alice", "event": "click"}'
The founder, Jay, is super nice and just wants to help people save time and money. If you give us a ring, he or I will personally help you [3].
[1] https://www.scratchdb.com/ [2] https://www.scratchdb.com/blog/embeddables/ [3] https://q29ksuefpvm.typeform.com/to/baKR3j0p?typeform-source...
Now, the postgres schema wasn't ideal, and we could have saved ~ 3x on it with corresponding speed increases for queries with a refactor similar to the clickhouse schema, but that wasn't really enough to move the needle to near real-time queries.
Ultimately, the entire clickhouse DB was smaller than the original postgres primary key index. The index was too big to fit in memory on an affordable machine, so it's pretty obvious where the performance is coming from.
I guess maybe the cost?
Generally speaking I've found it manageable if you make good use of partitioning and do incremental aggregation (we use dbt, though you have to do some macro gymnastics to make the partition key filter eligible for pruning due to restrictions on use of dynamic values https://docs.getdbt.com/docs/build/incremental-models)
It's also important to monitor your cost and watch for the point where switching from the per-tb queried pricing model to slots makes sense.
Simultaneously this change may not make sense for Lago as an open-source project self-hosted by a single tenant.
But that may also mean that it effectively makes sense for Lago as a business... to make it harder to self host.
I don't at all fault Lago for making decisions to prioritize their multi-tenant cloud offering. That's probably just the nature of running open-source SaaS these days.
I'm struggling with pg write performance ATM and want some tips.
It is extremely cost effective when you can scale a different workload without migrating.
While postgres works fine (even it is slower, but actually returns results)
For a lot of what people may want to do, they'd probably notice very little difference between the three.
PG can handle a billion rows easily.
I super love PG but PG is too far away from that.
If so, is it a gdpr compliant storage solution? I am asking it since gdpr compliance may require data deletion (or at least anonimization)
This in turn, creates read+write load. Modern OLAP db’s often support it, often via mitigating strategies to minimise the amount of extra work they incur: mark tainted rows, exclude them from queries, and clean up asynchronously; etc.
There will likely be a good OLAP solution (possibly implemented as an extension) in Postgres in the next year or so. Many companies are working on it (Hydra, Parade[0], etc.)
ParadeDB - AGPL License https://github.com/paradedb/paradedb/blob/dev/LICENSE
Hydra - Apache 2.0 https://github.com/hydradatabase/hydra/blob/main/LICENSE
also hydra seems derived from citusdata's columnar implementation.
I'm sure CH shines for insert-only workloads but that doesn't cover all our needs.
Almost all clients (client libraries) allow a configurable timeout. In server settings there is a max query time settings which can be adjusted if necessary: https://clickhouse.com/docs/en/operations/settings/query-com...
> However, using FINAL is sometimes necessary in order to produce accurate results
Welp.
It’s a table design optimised for specific workloads, and the docs and design detail those tradeoffs.
We use it at work for workloads that can tolerate “retreading” over stale data, because it means they can efficiently write to the db without round tripping, or locking and row updates, and without the table growing massive. It works fantastically in our use case.
I don't disagree, I feel like we might be using it wrong. We were trying to replace ES with it but it just doesn't feel like it fits our needed usecase.
For example, I've set it up, along with many more limitations for my public playground https://play.clickhouse.com/, and it allows me to, at least, make it public and not worry much.
It could also be a configuration of a proxy if you connect through a proxy. ClickHouse has built-in HTTP API, so you can query it directly from the browser or put it behind Cloudflare, etc... Where do you host ClickHouse?
Having served as both ClickHouse and Postgres SRE, I don't agree with this statement.
- Minimal downtime major version upgrades in PostgreSQL is very challenging.
- glibc version upgrade breaks postgres indices. This basically prevents from upgrading linux OS.
And there are other things which makes postgres operationally difficult.
Any database with primary-replica architecture is operationally difficult IMO.
I gave more detail with a toy example here: https://news.ycombinator.com/item?id=39245416
I've since played around with this a little more and you can do it pretty generically (at least make the worker generic where you give it a function `Chunk[A] => Task[Chunk[Result[B]]]` to do the database logic). I don't have that handy to post right now, but probably you're not using Scala anyway so the details aren't that relevant.
I've tried out a similar thing in Rust and it's a lot more finicky but still doable there. Should be similar in go I'd think.
You could use partitions though.
You may want to consider adjusting your partition key (if feasible) as a function of datetime so you can just drop a complete partition when required, rather than needing separate delete queries.
In my experience, it has proven to be a very quick and clean way to clear out older data.
There are many applications that require extremely high insertion rates (millions of records per second), very large total number of rows (billions, trillions) and flexible/fast querying/aggregation with high read rates (100's of millions or higher rows/s) and that's sort of the sweet spot IMO for ClickHouse and where you'll be pressed to find alternatives. I'm sure it can be used in other situations but maybe there are more choices if you're in those.
Unfortunately this is not always realistic, especially in large organizations, I know where I am there is a big push from top (i.e the IT budget people) to standardize everything they want to simplify licenses, support contracts etc.
I may not be doing cutting edge stuff (I work at an Industrial plant) but we do have mixed data use cases where it could be beneficial to use different dbs but realistically I don't see it happening.
Numerous ways to achieve removal of old/stale rows.
If you just use a pre-built package, the AGPL has the exact same requirements as the GPL.
But yes, I've seen similar issues, running out of memory during query processing, it's a price you pay for higher performance. You need to know what's happening under the hood and do more work to make sure your queries will work well. I think postgres can be a thousand or more times slower, and doesn't have the horizontal scalability, so if you need to do complex queries/aggregations over billions of records then "return result" doesn't cut it. If postgres addresses your needs then great- you don't need to use ClickHouse...
but what knobs to use and what values to use in each specific case? Query just usually fails with some generic OOM message without much information.
- max_concurrent_queries, since each query uses a certain amount of memory
- max_memory_usage, which is the max per-query memory usage
Here's my full config for running clickhouse on a 2GiB server without OOMs. Some stuff in here is likely irrelevant, but it's a starting point.
diff --git a/clickhouse-config.xml b/clickhouse-config.xml
index f8213b65..7d7459cb 100644
--- a/clickhouse-config.xml
+++ b/clickhouse-config.xml
@@ -197,7 +197,7 @@
<!-- <listen_backlog>4096</listen_backlog> -->
- <max_connections>4096</max_connections>
+ <max_connections>2000</max_connections>
<!-- For 'Connection: keep-alive' in HTTP 1.1 -->
<keep_alive_timeout>3</keep_alive_timeout>
@@ -270,7 +270,7 @@
-->
<!-- Maximum number of concurrent queries. -->
- <max_concurrent_queries>100</max_concurrent_queries>
+ <max_concurrent_queries>4</max_concurrent_queries>
<!-- Maximum memory usage (resident set size) for server process.
Zero value or unset means default. Default is "max_server_memory_usage_to_ram_ratio" of available physical RAM.
@@ -335,7 +335,7 @@
In bytes. Cache is single for server. Memory is allocated only on demand.
You should not lower this value.
-->
- <mark_cache_size>5368709120</mark_cache_size>
+ <mark_cache_size>805306368</mark_cache_size>
<!-- If you enable the `min_bytes_to_use_mmap_io` setting,
@@ -981,11 +980,11 @@
</distributed_ddl>
<!-- Settings to fine tune MergeTree tables. See documentation in source code, in MergeTreeSettings.h -->
- <!--
<merge_tree>
- <max_suspicious_broken_parts>5</max_suspicious_broken_parts>
+ <merge_max_block_size>2048</merge_max_block_size>
+ <max_bytes_to_merge_at_max_space_in_pool>1073741824</max_bytes_to_merge_at_max_space_in_pool>
+ <number_of_free_entries_in_pool_to_lower_max_size_of_merge>0</number_of_free_entries_in_pool_to_lower_max_size_of_merge>
</merge_tree>
- -->
<!-- Protection from accidental DROP.
If size of a MergeTree table is greater than max_table_size_to_drop (in bytes) than table could not be dropped with any DROP query.
diff --git a/clickhouse-users.xml b/clickhouse-users.xml
index f1856207..bbd4ced6 100644
--- a/clickhouse-users.xml
+++ b/clickhouse-users.xml
@@ -7,7 +7,12 @@
<!-- Default settings. -->
<default>
<!-- Maximum memory usage for processing single query, in bytes. -->
- <max_memory_usage>10000000000</max_memory_usage>
+ <max_memory_usage>536870912</max_memory_usage>
+
+ <queue_max_wait_ms>1000</queue_max_wait_ms>
+ <max_execution_time>30</max_execution_time>
+ <background_pool_size>4</background_pool_size>
+
<!-- How to choose between replicas during distributed query processing.
random - choose random replica from set of replicas with minimum number of errorsOverall I’d say CH isn’t as tolerant or forgiving as BigQuery, Snowflake, or Databricks. You can write the worst SQL possible and BQ will happily charge you $5/TB for that cartesian self-join. CH meanwhile will error with memory limit or even crash.
my experience is that those are not enough, multiple algorithms will just fail saying you hit max memory limit. There are many other knobs, for example: when to start external aggregation or sorting. For some cases I couldn't figure out setup and query just hits OOM without any ideas how to fix it.
What kind of queries are you trying to do? Also, what kind of machine are you running on?