A new JSON data type for ClickHouse(clickhouse.com) |
A new JSON data type for ClickHouse(clickhouse.com) |
Snowflake released a white paper before its IPO days and mentioned this same feature (secretly exploding JSON into columns). Explains how snowflake feels faster than it should, they’ve secretly done a lot of amazing things and just offered it as a polished product like Apple.
https://www.singlestore.com/blog/json-builtins-over-columnst...
For a colstore database, dealing with json as strings is a big perf hit.
Are there any setups for reliable data ingestion into Clickhouse that don't involve spinning up Kafka & Zookeeper?
Vector is a relatively simple ingest tool that supports lots of sources and sinks. It's very simple to run — just a config file and a single binary, and you're set. But it can do a fair amount of ETL (e.g. enriching or reshaping JSON), including some more advanced pipeline operators like joining multiple streams into one. It's maybe not as advanced as some ETL tools, but it covers a lot of ground.
Since you mention Kafka, I would also mention Redpanda, which is Kafka-compatible, but much easier to run. No Java, no ZooKeeper. I think you'd still want Vector here, with Vector connecting Redpanda to ClickHouse. Then you don't need the buffering that Vector provides, and Vector would only act as the "router" than pulls from Redpanda and ingests into ClickHouse.
Another option is RudderStack, which we also use for other purposes. It's a richer tool with a full UI for setting up pipelines, and so on.
Are you looking for setups for OSS ClickHouse or managed ClickHouse services that solve it?
Both Tinybird & ClickHouse Cloud are managed ClickHouse services that include ingest connectors without needing Kafka
Estuary (an ETL tool) just released Dekaf which lets them appear as a Kafka broker by exposing a Kafka-compatible API, so you can connect it with ClickHouse as if it was Kafka, without actually having Kafka (though I'm not sure if this is in the open source Estuary Flow project or not, I have a feeling not)
If you just want to play with CH, you can always use clickhouse-local or chDB which are more like DuckDB, running without a server, and work great for just talking to local files. If you don't need streams and are just working with files, you can also use them as an in-process/serverless transform engine - file arrives, read with chDB, process it however you need, export it as CH binary format, insert directly into your main CH. Nice little pattern than can run on a VM or in Lambda's.
> Estuary (an ETL tool) just released Dekaf which lets them appear as a Kafka broker by exposing a Kafka-compatible API
This is definitely an improvement, but if it looks like kafka and sounds like kafka, I get a bit sus.
> If you just want to play with CH, you can always use clickhouse-local
I've done that, but getting from this to "streaming data" is where I get stuck.
> If you don't need streams
Afraid streams are what I'm dealing with..
That's the same stage I get stuck every time.
I have data emitters (in this example let's say my household IoT devices, feeding a MQTT broker then HomeAssistant).
I have where I want the data to end up (Clickhouse, Database, S3, whatever).
How do I get the data from A to B, so there are no duplicate rows (if the ACK for an upload isn't received when the upload succeeded), no missing rows (the data is retried if an upload fails), and some protection if the local system goes down (data isn't ephemeral)?
The easiest I've found is writing data locally to files (JSON, parquet, whatever), new file every 5 minutes and sync the older files to S3.
But then I'm stuck again. How do I continually load new files from S3 without any repetition or edge cases? And did I really need the intermediate files?
Duplicates get merged out, and errors can be handles at the http level. (Admittedly, one bad row in a big batch post is a pain, but I don’t see that much)
There are plenty of ways to do it with batching, but I assume you want to real-time "insert into table" style or a direct "ch.write(data)", then no. There is no way as far as I know without batching. This is one of the main reason we stopped CH for our last project about 3 years ago for financial data analytic tooling. CH doesn't have a transaction log like WAL, so your data producers need to be smart or you need a "queue" type service to deal with it, whether it's S3 or Kafka or Kinesis to allow batching.
Those are probably some old sources of knowledge. You need to use Kafka if you want it to handle batching for you. But Clickhouse can handle batching as well by using asynchronous inserts:
https://clickhouse.com/blog/asynchronous-data-inserts-in-cli...
What makes Clickhouse different that you're unable to load data into?
https://docs.databend.com/sql/sql-commands/dml/dml-copy-into...
I'm so excited for this! One of my major bug-bears with storing logs in Elasticsearch is the set-type-on-first-seen-occurrence headache.
Hope to see this leave experimental support soon!
For example, consider the documents {"value": 42} and {"value": "foo"}. To index this, index {"value::int": 42} and {"value::str": "foo"} instead. Now you have two distinct fields that don't conflict with each other.
To search this, the logical choice would be to first make sure that the query language is typed. So a query like value=42 would know to search the int field, while a query like value="42" would look in the string field. There's never any situation where there's any ambiguity about which data type is to be searched. KQL doesn't have this, but that's one of their many design mistakes.
You can do the same for any data type, including arrays and objects. There is absolutely no downside; I've successfully implemented it for a specific project. (OK, one downside: More fields. But the nature of the beast. These are, after all, distinct sets of data.)
But now all my queries that look for “value” don’t work. And I’ve got two columns in my report where I only want one.
[1] https://docs.pinot.apache.org/basics/indexing/star-tree-inde...
Now, it wasn't a good schema to start with, and there was about a factor of 3 or 4 size that could be pulled out, but clickhouse was a factor of 20 better for on disk size for what we were doing.
For truly big data (terabytes per month) we rely on BigQuery. For smaller data that is more OLTP write heavy we are using psql… but I think there is room in the middle.
https://clickhouse.com/blog/a-new-powerful-json-data-type-fo...
CREATE TABLE mk3
ENGINE = MergeTree
ORDER BY (account_id, resource_type)
SETTINGS allow_nullable_key = 1
AS SELECT
*,
CAST(content, 'JSON') AS content_json
FROM file('Downloads/data_snapshot.parquet')
Query id: 8ddf1377-7440-4b4d-bb8d-955cd0f2b723
↑ Progress: 239.57 thousand rows, 110.38 MB (172.49 thousand rows/s., 79.48 MB/s.) 22%
Elapsed: 4.104 sec. Processed 239.57 thousand rows, 110.38 MB (58.37 thousand rows/s., 26.89 MB/s.)
Received exception:
Code: 107. DB::ErrnoException: Cannot open file /var/folders/mc/gndsp71j6zz64pm7j2wz_6lh0000gn/T/clickhouse-local-503e1494-c3fb-4a5e-9514-be5ba7940fec/data/default/mk3/tmp_insert_all_1_1_0/content_json.plan.features.available.core/audio.dynamic_structure.bin: , errno: 2, strerror: No such file or directory. (FILE_DOESNT_EXIST)Well, if you want to do things exactly how JS does it, then storing them all as float is correct. However, The JSON standard doesn't say it needs to be done the same way as JS.
It turned out great idea which inspired other databases.
Since we've been using ClickHouse long before this JSON functionality was available (or even before the earlier version of this called `Object('json')` was avaiable) we ended up setting up a job that would materialize json fields out of a json blob and into materialized columns based on query patterns against the keys in the JSON blob. Then, once those materialized columns were created we would just route the queries to those columns at runtime if they were available. This saved us a _ton_ on CPU and IO utilization. Even though ClickHouse uses some really fast SIMD JSON functions, the best way to make a computer go faster is to make the computer do less and this new JSON type does exactly that and it's so turn key!
https://posthog.com/handbook/engineering/databases/materiali...
The team over at ClickHouse Inc. as well as the community behind it moves surprisingly fast. I can't recommend it enough and excited for everything else that is on the roadmap here. I'm really excited for what is on the horizon with Parquet and Iceberg support.
It's common knowledge that 'postgres is all you need' - but if you somehow reach the stage of 'postgres isn't all I need and I have hard proof' this should be the next tech you look at.
Also, clickhouse-local is rather amazing at csv processing using sql. Highly recommended for when you are fed up with google sheets or even excel.
Can someone briefly explain how or if adding data types to JSON - a standardized grammar - leaves something that still qualifies as JSON?
I have no problem with people creating supersets of JSON, but if my standard lib JSON parser can't read your "JSON" then wouldn't it be better to call it something like "CH-JSON"?
If I am wildly missing something, I'm happy to be schooled. The end result certainly sounds cool, even though I haven't needed ClickHouse yet.
This is incorrect. ClickHouse is designed for distributed setups from the beginning, including cross-DC installations. It has been used on large production clusters even before it was open-sourced. When it became open-source in June 2016, the largest cluster was 394 machines across 6 data-centers with 25 ms RTT between the most distant data-centers.
> for example if you add additional nodes it is not easy to redistribute data.
This is precisely one of the issues I predict we'll face with our cluster as we're ramping up OTEL data and it's being sent to a small cluster, and I'm deathly afraid that it will continue sending to the every shard in equal measure without moving around existing data. I can not find any good method of redistributing the load other than "use the third party backup program and pray it doesn't shit the bed".
(disclaimer: I have not used either yet)
Plus it has a MySQL-flavoured client connector where Clickhouse does its own thing, so may be easier to integrate with some existing tools.
Spark can do analysis on huge quantities of data, and so can Microsoft Fabric. What Pinot can do that those tools can't is extremely low latency (milliseconds vs. seconds), concurrency (1000s of queries per second), and ability to update data in real-time.
Excellent intro video on Pinot: https://www.youtube.com/watch?v=_lqdfq2c9cQ
If I were still there, I'd be pushing a lot harder to finally throw away the legacy system (which has lost so many people it's basically ossified, anyway) and just "rebase" it all onto clickhouse and pyspark sparksql. We would throw away so much shitty cruft, and a lot of the newer mapreduce and RDD code is pretty portable to the point that it could be plugged into RDD's pipe() method.
Anyway. My current job, we just stood up a new product that, from day 1, was ingesting billions of rows (event data) (~nothing for clickhouse, to be clear. but obviously way too much for pg). And it's just chugging along. Clickhouse is definitely in my toolbox right after postgres, as you state.
I'd throw in one more to round it out however. The three rings of power are Postgres, ClickHouse and NATS. Postgres is the most powerful ring however and lots of times all you need.
Typically when you discover the need for OLAP DB is when you reach that scale, so I'm personally not sure what the real use case for DuckDB is to be completely honest.
The first is JSON as a data encoding, ie. the particular syntax involving braces and quotes and commas and string escapes.
The second is JSON as a data type, ie. a value which may be a string, number, bool, null, array of such values, or map from string to such values. The JSON data type is the set of values which can be represented by the JSON data encoding.
The article describes an optimized storage format for storing values which have the JSON data type. It is not related to JSON the data encoding, except in that it allows input and output using that encoding.
This is the same thing as postgres' JSONB type, which is also an optimized storage format for values of the JSON data type (internally it uses a binary representation).
The data remains standard JSON and so standard JSON parsers wouldn’t be affected since the optimizations are part of the storage layer and not the JSON structure itself.
No, not really.
The blog post talks about storing JSON data in a column-oriented database.
The blog post talks about importing data from JSON docs into their database. Prior to this, they stored JSON documents in their database like any standard off-the-shelf database does. Now they parse the JSON document when importing, and they store those values in their column-oriented database as key-value pairs, and preserve type information.
The silly part is that this all sounds like a intern project who was tasked with adding support to import data stored in JSON files into a column-oriented database, and an exporter along with it. But no, it seems an ETL job now counts as inventing JSON.
[1] https://www.postgresql.org/docs/current/datatype-json.html
I had to scroll way down the article, passing over tons of what feel like astroturfing comments advertising a vendor and their product line, to see the very first comment pointing out the elephant in the room.
I agree, whatever it's described in the blog post is clearly not JSON. It's a data interchange format, and it might be mappable to JSON under the right circumstances, but JSON it is not. It's not even a superset or a subset.
I mean, by the same line of reasoning both toml, CSV, and y'all are JSON. Come on. Even BSON is described as a different format that can be transcoded to JSON.
The article reads like a cheap attempt to gather attention to a format that otherwise would not justify it.
The error messages for batch inserts are TERRIBLE, but once it’s working it just hums along beautifully.
I’d be surprised if there isn’t a similar library for C++, as I believe clickhouse itself is written in C++
Scaling by the number of replicas of a single shard is less efficient than scaling by the number of shards. For ReplicatedMergeTree tables, due to physical replication of data, it is typically less than 10 replicas per shard, where 3 replicas per shard are practical for servers with non-redundant disks (RAID-0 and JBOD), and 2 replicas per shard are practical for servers with more redundant disks. For SharedMergeTree (in ClickHouse Cloud), which uses shared storage and does not physically replicate data (but still has to replicate metadata), the practical number of replicas is up to 300, and inserts scale quite well on these setups.
If your data mixes data types, I would argue that your report (whatever that is) _should_ get two columns.
Are you arguing that since Clickhouse is a database like Postgres, there's no point for CH to exist as we already have Postgres? Column-oriented databases have their uses.
Wow, that escalated quickly. You are reading too much into my comment. You should read the comment thread from the beginning to understand which question I'm replying to.
1. Aggregate (in-memory or on cheap storage) events in the publisher application into batches.
2. Ship those batches to S3/alike, NFS that clickhouse can read, or equivalent (even a dead-simple HTTP server that just receives file POSTs and writes them to disk, running on storage that clickhouse can reach). The tool you use here needs to be idempotent (retries of failed/timed out uploads don't mangle data), and atomic to readers (partially-received data is never readable).
3. In ClickHouse, run a scheduled refresh of a materialized view pointed at the uploaded data (either "SELECT ... INFILE" for local/NFS files, or "INSERT INTO ... SELECT s3(...)" for an S3/alike): https://clickhouse.com/docs/en/materialized-view/refreshable...
This is only a valid solution given specific constraints; if you don't match these, it may not work for you:
1. You have to be OK with the "experimental" status of refreshable materialized views. My and other users' experience with the feature seems generally positive at this point, and it has been out for awhile.
2. Given your emitter data rates, there must exist a batch size of data which appropriately balances keeping up with uploads to your blob store and the potential of data loss if an emitter crashes before a batch is shipped. If you're sending e.g. financial transaction source-of-record data, then this will not work for you: you really do need a Kafka/alike in that case (if you end up here, consider WarpStream: an extremely affordable and low-infrastructure Kafka clone backed by batching accumulators in front of S3: https://www.warpstream.com/ If their status as a SaaS or recent Confluent acquisition turns you off, fair enough.)
3. Data staleness of up to emitter-flush-interval + worst-case-upload-time + materialized-view-refresh-interval must be acceptable to you.
4. Reliability wise, the staging area for shipped batches (S3, NFS, scratch directory on a clickhouse server) must be sufficiently reliable for your use case, as data will not be replicated by clickhouse while it's staged.
5. All uniqueness/transformations must be things you can express in your materialized view's query + engine settings.
> 1. Aggregate (in-memory or on cheap storage) events in the publisher application into batches.
Clickhouse's Tagline on their website is:
> Build real-time data products that scale
Except, the minute we start having to batch data to process it and stage it, we lose the "real time" part. If I'm shipping them to S3 to have clickhouse batch ingest them, I might as well be use Databricks, Snowflake, or just parquet-on-s3.
Even if you could remove all of the operational burden from Kafka or equivalent, hooking it up to Clickhouse is still, at the end of the day, going to commit in batches (of max_insert_block_size, or kafka_max_block_size, or smaller batches polled from the message broker). Even with no consumer lag, that's still going to incur a delay before your data is SELECTable.
Heck, even Kafka publishers usually don't flush (actually send over the network) after every publish by default.
That same tradeoff comes up in Snowflake and Databricks (albeit mitigated when using Continuous Processing, which is experimental and expensive computationally and monetarily). Their ingestion systems are batching as well.
At the end of the day, "real time" means different things to different people, and you'll have to choose between one of several architectures:
- Clients synchronously insert data (which is then immediately visible) into your analytics store. ClickHouse is less good at handling a barrage of single-row INSERTs than other DBs, but none of them are good at this type of workload at even medium scale. Even manually shipping single-update files to S3 gets expensive and slow fast.
- Batch your inserts and accept bounded lag in data visibility. Doesn't matter whether batching is client-side, database-side, or in an intermediate broker/service.
- Ship your data asynchronously via messaging/streaming/batching and force point-in-time queries to wait for some indication that asynchronous data for the requested point in time has arrived. For example, when batching manually you could delay queries until a batch subsequent to the time-of-query has arrived, or when using Kafka you could wait for the system of record's last-committed-kafka-message-id to pass your topic's max ID at the time of query.
Besides, it's so simple, just a single executable.
Of course if you're at a scale where you need a cluster it's not an option anymore.
Clickhouse typically runs several processes (server, clients) interacting and that already makes things more complicated (and more powerful!).
That's not to say one is good and the other bad, they're just quite different tools.
SELECT * FROM website_traffic WHERE time >= now() - INTERVAL 1 SECONDClickhouse sells itself as a DBMS for real time analytical reports, which is exactly what I want. But I can't compare the two because I've never managed to get it all stood up.
And a completely fair question is “why would I want to spin up a completely new stack when I have psql already installed?”
In my (very limited ) experience you almost never do want to, but when you do, you wish you had started 6 months ago.
All of this ignores the fact that adjusting a restaurant's prices by the customer's expected ability to pay often leads to killing demand among your most frequent and desirable clientele, but that's a different story.
At least it used to be like that a few years ago.
https://www.reddit.com/r/UberEATS/comments/nucd2x/no_tip_no_...
https://www.reddit.com/r/UberEATS/comments/rtn2xe/no_tip_no_...
https://www.reddit.com/r/UberEATS/comments/uce6cs/orders_sit...
https://www.reddit.com/r/doordash/comments/17ojre0/doordash_...
https://www.reddit.com/r/doordash/comments/np6rik/you_have_e...
https://www.reddit.com/r/doordash/comments/o32nl4/no_tip_ord...
Don’t know if the situation has improved since.
The reason this happens is because Uber Eats and DoorDash and others have/had this concept where you’d “tip” for the delivery. Which is actually not a tip, but just a shitty way of disguising delivery fees and putting customers against the people that deliver the food. But that in turn has its background in how the restaurant business treats their workers in the USA, which has been wacky even long before these food delivery apps became a thing.
Anyway, regardless of your opinion on “tipping” and these practices the point was to say that there are additional complications with how much time it will take for your order to arrive aside from just the time it takes to prepare the food and the time it takes to travel from the restaurant to your door, even when the food has been prepared and a delivery driver is right there at the restaurant. If the “tip” is too low, or zero, your order could be left sitting on the shelf with nobody willing to pick it up. At least a few years ago it was like that.
It's also kind of a bummer that the batches have to be inserted, when the tagline on Clickhouse's website is:
> Build real-time data products that scale
But, thanks for the clarification!
I've never understood why this isn't just done better by the downstream product though. It's not that hard to implement a performant write ahead log from scratch.
(Note that you can scale out the above arbitrarily, since there's no reason to limit yourself to one worker or one DBMS.)
Realistically I think ClickHouse's features count as real-time, batching or not. The thing is, there is a cost to inserting things, it's a question of what the cost is. ClickHouse has a lot of overhead for an insert, and very little overhead for large OLAP queries, so amortizing the overhead with big writes is important. That's just a design tradeoff. Let's say you have 1mil events a second and you batch at 100k. You then get 10mil. Does that mean you need 10x as long to see the data? No, you can just scale out the writes by standing up new nodes and scale them up by doing larger batches. In contrast, systems that do excellent on singular point queries and transactional inserts are probably not going to handle 10x (relative) larger inserts and 10x as many writers as well -- or, they will not handle it as well, for as long, and will need more care. For reference I have done tens and hundreds of billions of rows on a commodity homeserver with ease, something Postgres isn't going to handle as well (I have pushed Postgres to about 3 billion rows.)
In this kind of setup, losing some events occasionally isn't ideal, and you should try to stop it, but it will happen. More importantly, at large scale, you'll only be able to sample subsets of the data to get answers in a reasonable time anyway, so your answers will become increasingly approximate over time. In a system of 1 trillion rows, does 100k rows missing matter when you already sample 10% of the dataset via SELECT ... FROM xyz SAMPLE 0.1? This is an important question to ask.
Most of the time you can get data through the pipeline quickly, in seconds (more than enough to spot problems) and you can use tools like ReplacingMergeTree or AggregatingMergeTree in order to scale up your write throughput in the event of multiple writers. Again, at large scale, duplicate rows (no exactly once delivery) are mostly just statistical noise, and they are ultimately idempotent because ClickHouse will merge them together anyway. Someone else already mentioned FINAL here. There are tricky parts to running any big system at scale but, yeah.
If you actually need sub-second or millisecond-level latency and you can't stand to lose even a single event, then you need to look into streaming solutions like using https://nats.io/ combined with Materialize or Feldera, which completely reframe the problem as an incremental computation problem rather than an analytical OLAP system that addresses scale through mass parallelism.
If all of the mentioned numbers here are too big for you or overkill, or something you aren't thinking about yet -- you can just stand up ClickHouse with Vector as an insert buffer, and throw shit at it all day without worry.
> Realistically I think ClickHouse's features count as real-time, batching or not
I agree, but if you look at some of the suggestions in this thread they talk about (e.g.) writing batches to S3 and crawling it on an interval - that's not real time (even if clickhouse itself is). If clickhouse is Real-time, but can't ingest data in a sane format it's not real time.
_That said_, I work at the scale where we have to be slightly careful with what we do, but not at the level where we'd call it a "big system at scale". We operate at the scale where we're worried about the stability of our app (i.e. batching in our app has the potential to cause data loss), but we can fit the ingress management/queue on a single instance (or a very small number of instances) so if _that_ is reliable we're happy.
> If all of the mentioned numbers here are too big for you or overkill,
They are, and Vector is exactly what I want. It took me about 20 minutes from seeing this comment to have app -> vector -> clickhouse-local up and running.