A new JSON data type for ClickHouse

A new JSON data type for ClickHouse(clickhouse.com)

382 points by markhneedham 1 year ago | 122 comments

ramraj07 1 year ago |

Great to see it in ClickHouse.

Snowflake released a white paper before its IPO days and mentioned this same feature (secretly exploding JSON into columns). Explains how snowflake feels faster than it should, they’ve secretly done a lot of amazing things and just offered it as a polished product like Apple.

leetrout 1 year ago | |

Scratch data does this as well with duckdb

https://github.com/scratchdata/scratchdata

nojvek 1 year ago | |

Singlestore has been doing json -> column expansion for a while as well.

https://www.singlestore.com/blog/json-builtins-over-columnst...

For a colstore database, dealing with json as strings is a big perf hit.

statictype 1 year ago | |

Do you have a link to the Snowflake whitepaper?

JosephRedfern 1 year ago | | |

Perhaps this: https://event.cwi.nl/lsde/papers/p215-dageville-snowflake.pd...

maccard 1 year ago |

I've heard wonderful things about ClickHouse, but every time I try to use it, I get stuck on "how do I get data into it reliably". I search around, and inevitably end up with "by combining clickhouse and Kafka", at which point my desire to keep going drops to zero.

Are there any setups for reliable data ingestion into Clickhouse that don't involve spinning up Kafka & Zookeeper?

atombender 1 year ago | |

At my company we use Vector to ingest into ClickHouse. It works really well. Vector does buffering and retrying.

Vector is a relatively simple ingest tool that supports lots of sources and sinks. It's very simple to run — just a config file and a single binary, and you're set. But it can do a fair amount of ETL (e.g. enriching or reshaping JSON), including some more advanced pipeline operators like joining multiple streams into one. It's maybe not as advanced as some ETL tools, but it covers a lot of ground.

Since you mention Kafka, I would also mention Redpanda, which is Kafka-compatible, but much easier to run. No Java, no ZooKeeper. I think you'd still want Vector here, with Vector connecting Redpanda to ClickHouse. Then you don't need the buffering that Vector provides, and Vector would only act as the "router" than pulls from Redpanda and ingests into ClickHouse.

Another option is RudderStack, which we also use for other purposes. It's a richer tool with a full UI for setting up pipelines, and so on.

sdairs 1 year ago | |

Interesting, that's not a problem I've come across before particularly - could you share more?

Are you looking for setups for OSS ClickHouse or managed ClickHouse services that solve it?

Both Tinybird & ClickHouse Cloud are managed ClickHouse services that include ingest connectors without needing Kafka

Estuary (an ETL tool) just released Dekaf which lets them appear as a Kafka broker by exposing a Kafka-compatible API, so you can connect it with ClickHouse as if it was Kafka, without actually having Kafka (though I'm not sure if this is in the open source Estuary Flow project or not, I have a feeling not)

If you just want to play with CH, you can always use clickhouse-local or chDB which are more like DuckDB, running without a server, and work great for just talking to local files. If you don't need streams and are just working with files, you can also use them as an in-process/serverless transform engine - file arrives, read with chDB, process it however you need, export it as CH binary format, insert directly into your main CH. Nice little pattern than can run on a VM or in Lambda's.

maccard 1 year ago | | |

Sure - I work in games, and we stream eventsfrom clients that we want to store in Clickhouse. We've got a native desktop application written in C++ that generates a json payload (we control the format of this). We don't need OSS, but we don't want a SAAS service - we want on-prem (or self managed). Clickhouse Cloud would be fine, TinyBird not.

> Estuary (an ETL tool) just released Dekaf which lets them appear as a Kafka broker by exposing a Kafka-compatible API

This is definitely an improvement, but if it looks like kafka and sounds like kafka, I get a bit sus.

> If you just want to play with CH, you can always use clickhouse-local

I've done that, but getting from this to "streaming data" is where I get stuck.

> If you don't need streams

Afraid streams are what I'm dealing with..

pbowyer 1 year ago | |

> but every time I try to use it, I get stuck on "how do I get data into it reliably"

That's the same stage I get stuck every time.

I have data emitters (in this example let's say my household IoT devices, feeding a MQTT broker then HomeAssistant).

I have where I want the data to end up (Clickhouse, Database, S3, whatever).

How do I get the data from A to B, so there are no duplicate rows (if the ACK for an upload isn't received when the upload succeeded), no missing rows (the data is retried if an upload fails), and some protection if the local system goes down (data isn't ephemeral)?

The easiest I've found is writing data locally to files (JSON, parquet, whatever), new file every 5 minutes and sync the older files to S3.

But then I'm stuck again. How do I continually load new files from S3 without any repetition or edge cases? And did I really need the intermediate files?

wiredfool 1 year ago | | |

Easiest way is to post csv/json/whatever through the http endpoint into a replacing merge tree table.

Duplicates get merged out, and errors can be handles at the http level. (Admittedly, one bad row in a big batch post is a pain, but I don’t see that much)

masterj 1 year ago | | |

Cloudflare workers combined with their queues product https://developers.cloudflare.com/queues/ might be a cheap and easy way of solving this problem

maccard 1 year ago | | |

This is _exactly_ my problem, and where I've found myself.

aynyc 1 year ago | |

My experience and knowledge with CH is about 3-4 years olds now, so I might be talking out of ignorance at this point.

There are plenty of ways to do it with batching, but I assume you want to real-time "insert into table" style or a direct "ch.write(data)", then no. There is no way as far as I know without batching. This is one of the main reason we stopped CH for our last project about 3 years ago for financial data analytic tooling. CH doesn't have a transaction log like WAL, so your data producers need to be smart or you need a "queue" type service to deal with it, whether it's S3 or Kafka or Kinesis to allow batching.

lossolo 1 year ago | |

> I search around, and inevitably end up with "by combining clickhouse and Kafka"

Those are probably some old sources of knowledge. You need to use Kafka if you want it to handle batching for you. But Clickhouse can handle batching as well by using asynchronous inserts:

https://clickhouse.com/blog/asynchronous-data-inserts-in-cli...

DeathArrow 1 year ago | |

It seems you can use JSON, CSV and Parquet: https://clickhouse.com/docs/en/integrations/data-formats

turtlebits 1 year ago | |

There is an HTTP endpoint, client database drivers, CLI tool and third party tools like Vector, Redpanda Connect?

What makes Clickhouse different that you're unable to load data into?

BohuTANG 1 year ago | |

Yes, reliable data ingestion often involves Kafka, which can feel complex. An alternative is the transactional COPY INTO approach used by platforms like Snowflake and Databend. This command supports "exactly-once" ingestion, ensuring data is fully loaded or not at all, without requiring message queues or extra infrastructure.

https://docs.databend.com/sql/sql-commands/dml/dml-copy-into...

two_handfuls 1 year ago | |

Not sure if it's enough for you but there is RedPanda, a Zookeeper-less Kafka.

shawabawa3 1 year ago | |

I had success loading data with vector.dev

jacobsenscott 1 year ago | | |

This is what we do - works well.

matter_and_mind 1 year ago | |

I run a fairly large Clickhouse cluster for advertising data with millions of events every minute streaming in. We use fluentd as a buffer which batches data for upto n records/n minutes and does batch inserts to clickhouse. Its not realtime but close enough and have found it to be pretty reliable.

_peregrine_ 1 year ago | |

I think Tinybird is a nice option here. It's sort of a managed service for ClickHouse with some other nice abstractions. For your streaming case, they have an HTTP endpoint that you can stream to that accepts up to 1k EPS and you can micro-batch events if you need to send more events than that. They also have some good connectors for BigQuery, Snowflake, DynamoDB, etc.

amanj41 1 year ago | |

Not sure if ClickHouse needs ZK but FWIW Kafka has a raft implementation which now obviates need for ZK

dtjohnnymonkey 1 year ago | | |

ClickHouse does need ZK but they have their own implementation.

ramraj07 1 year ago | |

Where are you loading the data from! I had no trouble loading data from s3 parquet.

maccard 1 year ago | | |

I'm streaming data from a desktop application written in C++. It's the step to get it into parquet in the first place.

hisnameisjimmy 1 year ago | |

Fivetran has a destination for it: https://fivetran.com/docs/destinations/clickhouse

VeejayRampay 1 year ago | |

I was glad in the past few years to discover that I am not alone in finding Kafka off-putting / way too convoluted

dtjohnnymonkey 1 year ago | |

Where is your data coming from? I’m curious what prevents you from inserting the data into Clickhouse without Kafka.

barumrho 1 year ago | |

How do you do this with other DBs?

everfrustrated 1 year ago |

>Dynamically changing data: allow values with different data types (possibly incompatible and not known beforehand) for the same JSON paths without unification into a least common type, preserving the integrity of mixed-type data.

I'm so excited for this! One of my major bug-bears with storing logs in Elasticsearch is the set-type-on-first-seen-occurrence headache.

Hope to see this leave experimental support soon!

atombender 1 year ago | |

I never understood why ELK/Kinana chose this method, when there's a much simpler solution: Augment each field name with the data type.

For example, consider the documents {"value": 42} and {"value": "foo"}. To index this, index {"value::int": 42} and {"value::str": "foo"} instead. Now you have two distinct fields that don't conflict with each other.

To search this, the logical choice would be to first make sure that the query language is typed. So a query like value=42 would know to search the int field, while a query like value="42" would look in the string field. There's never any situation where there's any ambiguity about which data type is to be searched. KQL doesn't have this, but that's one of their many design mistakes.

You can do the same for any data type, including arrays and objects. There is absolutely no downside; I've successfully implemented it for a specific project. (OK, one downside: More fields. But the nature of the beast. These are, after all, distinct sets of data.)

mr_toad 1 year ago | | |

> For example, consider the documents {"value": 42} and {"value": "foo"}. To index this, index {"value::int": 42} and {"value::str": "foo"} instead. Now you have two distinct fields that don't conflict with each other.

But now all my queries that look for “value” don’t work. And I’ve got two columns in my report where I only want one.

abe94 1 year ago |

We've been waiting for more JSON support for Clickhouse - the new type looks promising - and the dynamic column, and no need to specifcy subtypes is particularly helpful for us.

breadwinner 1 year ago |

If you're evaluating ClickHouse take a look at Apache Pinot as well. ClickHouse was designed for single-machine installations, although it has been enhanced to support clusters. But this support is lacking, for example if you add additional nodes it is not easy to redistribute data. Pinot is much easier to scale horizontally. Also take a look at star-tree indexes of Pinot [1]. If you're doing multi-dimensional analysis (Pivot table etc.) there is a huge difference in performance if you take advantage of star-tree.

[1] https://docs.pinot.apache.org/basics/indexing/star-tree-inde...

notamy 1 year ago |

Clickhouse is great stuff. I use it for OLAP with a modest database (~600mil rows, ~300GB before compression) and it handles everything I throw at it without issues. I'm hopeful this new JSON data type will be better at a use-case that I currently solve with nested tuples.

jabart 1 year ago | |

Similar for us except 700mil rows in one table, 2.5 billion total rows. That's growing quickly because we started shoving OTEL to the cluster. None of our queries seem to phase Clickhouse. It's like magic. The 48 cores per node also helps

philosopher1234 1 year ago | |

Postgres should be good enough for 300GB, no?

wiredfool 1 year ago | | |

I had a postgres database where the main index (160gb) was larger than the entire equivalent clickhouse database (60gb). And between the partitioning and the natural keys, the primary key index in clickhouse was about 20k per partition * ~ 1k partitions.

Now, it wasn't a good schema to start with, and there was about a factor of 3 or 4 size that could be pulled out, but clickhouse was a factor of 20 better for on disk size for what we were doing.

marginalia_nu 1 year ago | | |

At least in my experience, that's about when regular DBMS:es kinda start to suck for ad-hoc queries. You can push them a bit farther for non-analytical usecases if you're really careful and have prepared indexes that assist every query you make, but that's rarely a luxury you have in OLAP-land.

tempest_ 1 year ago | | |

It depends, if you want to do any kind of aggregation, counts, or count distinct pg falls over pretty quickly.

notamy 1 year ago | | |

Probably, but Clickhouse has been zero-maintenance for me + my dataset is growing at 100~200GB/month. Having the Clickhouse automatic compression makes me worry a lot less about disk space.

whalesalad 1 year ago | | |

For write heavy workloads I find psql to be a dog tbh. I use it everywhere but am anxious to try new tools.

For truly big data (terabytes per month) we rely on BigQuery. For smaller data that is more OLTP write heavy we are using psql… but I think there is room in the middle.

jacobsenscott 1 year ago | | |

Yes, but you're starting to get to the size where you need some real PG expertise to keep the wheels on. If your data is growing CH will just work out of box for a lot longer.

CSDude 1 year ago |

When I tried it a few weeks ago, because ClickHouse names the files based on column names, weird JSON keys resulted in very long filenames and slashes and it did not play well with it the file system and gave errors, I wonder that is fixed?

setr 1 year ago | |

Isn’t that the issue challenge #3 addresses?

https://clickhouse.com/blog/a-new-powerful-json-data-type-fo...

CSDude 1 year ago | | |

Tried with the latest version, but it doesn't solve.

    CREATE TABLE mk3
    ENGINE = MergeTree
    ORDER BY (account_id, resource_type)
    SETTINGS allow_nullable_key = 1
    AS SELECT
        *,
        CAST(content, 'JSON') AS content_json
    FROM file('Downloads/data_snapshot.parquet')

    Query id: 8ddf1377-7440-4b4d-bb8d-955cd0f2b723

    ↑ Progress: 239.57 thousand rows, 110.38 MB (172.49 thousand rows/s., 79.48 MB/s.)                                                                                                          22%
    Elapsed: 4.104 sec. Processed 239.57 thousand rows, 110.38 MB (58.37 thousand rows/s., 26.89 MB/s.)

    Received exception:
    Code: 107. DB::ErrnoException: Cannot open file /var/folders/mc/gndsp71j6zz64pm7j2wz_6lh0000gn/T/clickhouse-local-503e1494-c3fb-4a5e-9514-be5ba7940fec/data/default/mk3/tmp_insert_all_1_1_0/content_json.plan.features.available.core/audio.dynamic_structure.bin: , errno: 2, strerror: No such file or directory. (FILE_DOESNT_EXIST)

Thorrez 1 year ago |

>For example, if we have two integers and a float as values for the same JSON path a, we don’t want to store all three as float values on disk

Well, if you want to do things exactly how JS does it, then storing them all as float is correct. However, The JSON standard doesn't say it needs to be done the same way as JS.

barumrho 1 year ago | |

The new Variant type exists independently of JSON support, so it seems good that they handle it properly.

kreetx 1 year ago |

This seems similar to instead of storing any specific part (int, string, array) of JSON, just store any JSON type in the column, much like "enum with fields" in Swift, Kotlin or Rust, or algebraic data types in Haskell - a feature not present in many other languages.

jojohohanon 1 year ago |

I’m a few years removed, but isn’t this how google capacitor stores protobufs (which are ~ equivalent to json in what they can express)?

jakozaur 1 year ago |

Looks like Snowflake was the first popular warehouse to have variant type which could put JSON values into separate columns.

It turned out great idea which inspired other databases.

karsinkk 1 year ago |

Oracle 23ai also has a similar feature that "explodes" JSON into relational tables/columns for storage while still providing JSON based access API's : https://www.oracle.com/database/json-relational-duality/

officex 1 year ago |

Great to see! I remember checking you guys out in Q1, great team

fuziontech 1 year ago |

Using ClickHouse is one of the best decisions we've made here at PostHog. It has allowed us to scale performance all while allowing us to build more products on the same set of data.

Since we've been using ClickHouse long before this JSON functionality was available (or even before the earlier version of this called `Object('json')` was avaiable) we ended up setting up a job that would materialize json fields out of a json blob and into materialized columns based on query patterns against the keys in the JSON blob. Then, once those materialized columns were created we would just route the queries to those columns at runtime if they were available. This saved us a _ton_ on CPU and IO utilization. Even though ClickHouse uses some really fast SIMD JSON functions, the best way to make a computer go faster is to make the computer do less and this new JSON type does exactly that and it's so turn key!

https://posthog.com/handbook/engineering/databases/materiali...

The team over at ClickHouse Inc. as well as the community behind it moves surprisingly fast. I can't recommend it enough and excited for everything else that is on the roadmap here. I'm really excited for what is on the horizon with Parquet and Iceberg support.

baq 1 year ago |

Clickhouse is criminally underused.

It's common knowledge that 'postgres is all you need' - but if you somehow reach the stage of 'postgres isn't all I need and I have hard proof' this should be the next tech you look at.

Also, clickhouse-local is rather amazing at csv processing using sql. Highly recommended for when you are fed up with google sheets or even excel.

peteforde 1 year ago |

I admit that I didn't read the entire article in depth, but I did my best to meaningfully skim-parse it.

Can someone briefly explain how or if adding data types to JSON - a standardized grammar - leaves something that still qualifies as JSON?

I have no problem with people creating supersets of JSON, but if my standard lib JSON parser can't read your "JSON" then wouldn't it be better to call it something like "CH-JSON"?

If I am wildly missing something, I'm happy to be schooled. The end result certainly sounds cool, even though I haven't needed ClickHouse yet.

anonygler 1 year ago |

I keep misreading this company as ClickHole and expecting some sort of satirical content.