Simdjson – Parsing Gigabytes of JSON per Second

Simdjson – Parsing Gigabytes of JSON per Second(github.com)

598 points by cmsimike 7 years ago | 196 comments

raphlinus 7 years ago |

This is very cool. Meanwhile, in the xi-editor project, we're struggling with the fact that Swift JSON parsing is very slow. My benchmarking clocked in at 0.00089GB/s for Swift 4, and things don't seem to have improved much with Swift 5. I'm encouraging people on that issue to do a blog post.

[1]: https://github.com/xi-editor/xi-mac/issues/102

eridius 7 years ago | |

I wrote my own Swift JSON parser quite a while ago, https://github.com/postmates/PMJSON. In my limited benchmarking it parses slower than Foundation's JSONSerialization (by a factor of 2–2.5 IIRC) but encodes faster, and my impression was most of the time was spent constructing Dictionaries, but I didn't do too much performance work on it. It might be interesting to have someone else take a crack at improving the performance.

That said, it also includes an event-based parser (called JSONDecoder), so if you want to handle events in order to decode into your own data structure and skip the intermediate JSON data structure, you might be able to get faster than JSONSerialization that way.

marton78 7 years ago | |

Why does Xi use JSON in the first place? It would be easier and faster to use a binary format, e.g. Protobufs, Flatbuffers or if the semantics of JSON is needed: CBOR.

aratno 7 years ago | | |

From “Design Decisions”[1]:

> JSON. The protocol for front-end / back-end communication, as well as between the back-end and plug-ins, is based on simple JSON messages. I considered binary formats, but the actual improvement in performance would be completely in the noise. Using JSON considerably lowers friction for developing plug-ins, as it’s available out of the box for most modern languages, and there are plenty of the libraries available for the other ones.

1: https://github.com/xi-editor/xi-editor/blob/master/README.md...

Skinney 7 years ago | | |

Because JSON encoding/decoding was not found to be a typical performance bottleneck, and because JSON is supported in virtually every programming language (Xi allows you to write frontends in pretty much any language you want).

kragen 7 years ago | | |

After spending most of a year doing deep surgery on systems that used CBOR extensively, I can report that the common CBOR parsers are not faster than common JSON parsers; surprisingly, they are actually slower. CBOR is also not easier; it's much less widely supported, and you need a separate debugging representation. It does have three real advantages over JSON: it supports binary strings, it's a monument to Carsten Bormann's ego, and data encoded in CBOR takes slightly fewer bytes than the same data encoded in JSON. (The second is only an advantage if you're Carsten Bormann.)

mpweiher 7 years ago | | |

While theoretically true, in practice the actual character parsing tends to a small to negligible part of the overall time. Which leads to the measurable fact that on macOS/iOS, the JSON serialization stuff is actually one of their fastest, faster than their binary stuff.

saagarjha 7 years ago | |

I ran one of the Codable benchmarks in instruments, and here's what the top functions were:

  19.98 s   swift_getGenericMetadata
  19.15 s   newJSONString
  16.17 s   objc_msgSend
  15.33 s   _swift_release_(swift::HeapObject*)
  14.45 s   tiny_malloc_should_clear
  12.81 s   _swift_retain_(swift::HeapObject*)
  11.28 s   searchInConformanceCache(swift::TargetMetadata<swift::InProcess> const*, swift::TargetProtocolDescriptor<swift::InProcess> const*)
  10.46 s   swift_dynamicCastImpl(swift::OpaqueValue*, swift::OpaqueValue*, swift::TargetMetadata<swift::InProcess> const*, swift::TargetMetadata<swift::InProcess> const*, swift::DynamicCastFlags)

So it looks like a lot of the time is going into memory management or the Swift runtime performing type checking.

raphlinus 7 years ago | | |

Yeah, I've done some analysis, it's creating a ton of objects to conform to the Codable protocol, and a lot of those objects are for codingPath, which is updated for basically every node in the tree. It's not a mystery, we just don't know the best way to fix it.

Cthulhu_ 7 years ago | | |

Can you see any differences with different levels of optimization? I recall a presentation at some point where the old obj-C style compiled code did a lot of checks before and after calling a method ("does this object listen to this message?"), while with an optimization option enabled (whole module optimization?) these calls could be optimized out. That is, with Swift they can make the resulting machine code less er, "checking for safety", so to speak.

mpweiher 7 years ago | |

Yeah, Swift-most-everything is pretty slow, but particularly parsing/generating. Pre-Swift Foundation serialisation code was already...majestic, and in the Swift conversion they've typically managed to slow things down even further. Which didn't seem possible, but they managed.

I have given a bunch of talks[1] on this topic, there's also a chapter in my iOS/macOS performance book[2], which I really recommend if you want to understand this particular topic. I did really fast XML[3][4], CSV[5] and binary plist parsers[6] for Cocoa and also a fast JSON serialiser[7]. All of these are usually around an order of magnitude faster than their Apple equivalents.

Sadly, I haven't gotten around to doing a JSON parser. One reason for this is that parsing the JSON at character level is actually the smaller problem, performance-wise, same as for XML. Performance tends to be largely determined by what you create as a result. If you crate generic Foundation/Swift dictionaries/arrays/etc. you have already lost. The overhead of these generic data structure completely overwhelms the cost of scanning a few bytes.

So you need something more akin to a steaming interface, and if you create objects you must create them directly, without generic temporary objects. This is where XML is easier, because it has an opening tag that you can use to determine what object to create. With JSON, you get "{" so basically you have to know what structure level corresponds to what objects.

Maybe I should write that parser...

[1] https://www.google.com/search?hl=en&q=marcel%20weiher%20perf...

[2] https://www.amazon.com/gp/product/0321842847/

[3] https://github.com/mpw/Objective-XML

[4] https://blog.metaobject.com/2010/05/xml-performance-revisite...

[5] https://github.com/mpw/MPWFoundation/blob/master/Collections...

[6] https://github.com/mpw/MPWFoundation/blob/master/Collections...

[7] https://github.com/mpw/MPWFoundation/blob/master/Streams.sub...

gritzko 7 years ago | | |

That resonates well with my conclusions that led to the Replicated Object Notation project. [1]. If the parser creates an AST tree or some number of dictionaries or some other bullshit... "now you have two problems", that's it.

I settled on a tabular-log format, which is streamed and immediately consumed most of the time, no intermediate object structures.

Then, that "text vs binary" distinction became mostly moot. The binary is slightly more efficient, but grossly less readable, so no big gain, unless at grand scale.

[1] http://replicated.cc

azinman2 7 years ago | |

What are you using? Have you tried NSJSONSerialization? It’s quite fast (am very curious how it shows in these benchmarks), but I don’t think it does the fancy Codable stuff.

jeremy_wiebe 7 years ago | | |

You might want to check out the benchmark I wrote to compare exactly that.

https://github.com/jeremywiebe/json-performance

eridius 7 years ago | | |

Swift has JSONEncoder and JSONDecoder types to do Codable, though internally they have to encode to/decode from the Foundation objects that JSONSerialization produces.

vlovich123 7 years ago | |

Hey Raph, have you seen https://github.com/bmkor/gason? Seems like a low-cost bridge to a high-performance C++ implementation.

raphlinus 7 years ago | | |

Hadn't seen that particular wrapper, but if we're going to take on an FFI solution, we're more likely to use Rust for this, and implement more logic than just JSON parsing.

glangdale 7 years ago |

One of the two authors here. Happy to answer questions.

The intent was to open things but not publicize them at this stage but Hacker News seems to find stuff. Wouldn't surprise me if plenty of folks follow Daniel Lemire on Github as his stuff is always interesting.

xfs 7 years ago |

If you're working with json objects with sizes on the higher end quite often you're not going to need the entirety of them, just a small part of them. If that is the workload what then to do is simply parse as little data as possible: skip the validation, locate the relevant bits, and then start parsing, validation and all the stuff. In this optimizing the json scanner/lexer gives much greater improvement than optimizing the parser.

Though this job is trickier than it may look. The logic to extract the "relevant" bits is often dynamic or tied to user input but for the scanner/lexer to be ultrafast it has to be tightly compiled. You can try jitting but libllvm is probably too heavyweight for parsing json.

jillesvangurp 7 years ago |

Number handling looks like it would be a problem. There are Test suites for json parsers and lots of parsers that fail a lot of these tests. Check e.g. https://github.com/nst/JSONTestSuite which checks compliance against RFC 8259.

Publishing results against this could be useful both for assessing how good this parser is and establishing and documenting any known issues. If correctness is not a goal, this can still be fine but finding out your parser of choice doesn't handle common json emitted by other systems can be annoying.

Regarding the numbers, I've run into a few cases where Jackson being able to parse BigIntegers and BigDecimals was very useful to me. Silently rounding to doubles or floats can be lossy and failing on some documents just because the value exceeds max long/in t can be an issue as well.

baybal2 7 years ago |

> We store strings as NULL terminated C strings. Thus we implicitly assume that you do not include a NULL character within your string, which is allowed technically speaking if you escape it (\u0000).

I lost count to broken JSON parsers which all fall to that.

groestl 7 years ago | |

Yeah, this is unforgivable, and for me makes the whole speed argument void.

Edit: to be fair, they handle a couple of other things, which many similar libraries ignore. I particulary like the support for full 64bit integers. And at least they document their limitation on NULL bytes.

glangdale 7 years ago | | |

"Unforgivable" is a bit strong. I don't think this is something which invalidates our entire approach - nothing in the algorithm depends on this behavior as the \0 chars don't appear until quite late. Even then, we are not dependent on sighting a \0 in our string normalization and as such we can probably just store a offset+length in our 'tape' structure rather than assuming we have null terminated strings.

Please add an issue on Github.

Edit: I went ahead and added an issue. Seems like something we should fix.

adrianN 7 years ago |

I feel like if you need to parse Gigabytes per second of JSON, you should probably think about using a more efficient serialization format than JSON. Binary formats are not much harder to generate and can save a lot of bandwidth and CPU time.

kccqzy 7 years ago |

I guess the question is, what do you parse it to? I'm guessing definitely not turning objects into std::unordered_map and arrays into std::vector or some such. So how easy it is to use the "parsed" data structure? How easy is it to add an element to some deeply nested array for example?

Falell 7 years ago | |

The ParsedJson type is immutable and accessed mutating iterators (up and down the tree, forward and backward through members and indices).

My immediate thought is to compare it to rapidjson, which I've used before. The paradigm of mutating iterators seems awkward at first but should be just as powerful as rapidjson's Value. For example, both approaches end up doing a linear scan to find an object member by name.

The fact that rapidjson supports mutation of Values and simdjson does not has huge implications (as mentioned in the simdjson README scope section), I suspect this tradeoff explains most of the performance differences as I know rapidjson also uses simd internally.

hnaccy 7 years ago | | |

Is there a reason these fast json libraries seem to favor doing linear scan for object representation?

saagarjha 7 years ago | |

The data is put into a "ParsedJson" object: https://github.com/lemire/simdjson/blob/master/include/simdj...

scottlamb 7 years ago | | |

That header mentions a tape.md describing the format. It's really interesting:

https://github.com/lemire/simdjson/blob/master/tape.md

westurner 7 years ago |

> Requirements: […] A processor with AVX2 (i.e., Intel processors starting with the Haswell microarchitecture released 2013, and processors from AMD starting with the Rizen)

aristidb 7 years ago | |

Also noteworthy that on Intel at least, using AVX/AVX2 reduces the frequency of the CPU for a while. It can even go below base clock.

scottlamb 7 years ago | | |

iirc, it's complicated. Some instructions don't reduce the frequency; some reduce it a little; some reduce it a lot.

I'm not sure AVX2 is as ubiquitous as the README says: "We assume AVX2 support which is available in all recent mainstream x86 processors produced by AMD and Intel."

I guess "mainstream" is somewhat subjective, but some recent Chromebooks have Celeron processors with no AVX2:

https://us-store.acer.com/chromebook-14-cb3-431-c5fm

https://ark.intel.com/products/91831/Intel-Celeron-Processor...

ben-schaaf 7 years ago |

I wonder how this compares to fast.json: "Fastest JSON parser in the world is a D project?" (https://news.ycombinator.com/item?id=10430951), both in an implementation/approach sense and in terms of performance.

yeldarb 7 years ago |

Will this work on JSON files that are larger than the available system memory?

Firebase backups are huge JSON files and we haven’t found a good way to deal with them.

There are some “streaming JSON parsers” that we have wrestled with but they are buggy.

glangdale 7 years ago | |

Sadly it will not. Arguably we could 'stream' things, but we don't have an API or a use case for it. If you could capture your requirements and put them on an issue on Github, it would be helpful. We're not against the streaming use case, we just don't understand it very well.

nojvek 7 years ago | |

Probably not. I requires a memory allocation the size of the file for parsing.

However they have the ability to build a tape out of the json and find the interesting marks. Perhaps it can be adapted to make a fast parser than only parses the relevant stuff but zooms through the large file in blocks.

xnormal 7 years ago |

Any chance of something similar for CSV? (full RFC-4180 including quotes, escaping etc).

Terabytes of "big data" get passed around as CSV.

glangdale 7 years ago | |

CSV is on our list; this is a simpler task than JSON due to the absence of arbitrary nesting.

imtringued 7 years ago | | |

I doubt someone using CSV for big data is going to follow that rule...

blaisio 7 years ago | |

It's probably relevant to mention https://github.com/BurntSushi/rust-csv. It uses a state machine (which seems to be the author's expertise) to parse CSVs really fast. Based on some other work, you can do better if you use some of the new SIMD instructions.

badeu 7 years ago | |

I've developped a full RFC compliant CSV parser with Python bindings and supporting SSE4 to AVX-512 instruction sets, however i'm struggling with my hierarchy to open-source it at the moment.

But, the goal of my message is not to tease you with an unavailable code. It's just to say it is a lot more simpler to write a CSV parser than a JSON parser.

So, do not hesitate to write one yourself ! It's easy and a nice way to introduce yourself to SIMD instructions.

fooyc 7 years ago |

What happens of the parsed data ? Do the benchmarks account for the time to access that data after parsing ?

ftp-bit 7 years ago |

Perhaps I'm misunderstanding or don't have a good enough grasp of this, but, in what circumstance would you need to parse gigabytes? I've only seen it be used in config files, so...

userbinator 7 years ago | |

What usually happens is someone creates an API, one which did not initially have to handle much data, and then it just grew over time. (I guess it's similar to how a lot of the Internet's early application-layer protocols like HTTP, SMTP, etc. are text-based --- the text format was initially more "convenient" for a variety of reasons, but obviously is not very efficient at scale.)

Or, perhaps a more common scenario today, it was designed by people who simply had no knowledge of binary protocols or efficiency at all --- not too long ago I had to deal with an API which returned a binary file, but instead of simply sending the bytes directly, it decided to send a JSON object containing one array, whose elements were strings, and each string was... a hex digit. Instead of sending "Hello world" it would send '{"data":["4","8"," ","6","5"," ","6","C"," " ... '

detaro 7 years ago | |

Log files? More and more places are switching to easily machine-parsable logs to run statistics and checks over, and JSON is a common format (e.g. because it's still somewhat human-readable and will work over logging infrastructure set up to transport lines of text)

glangdale 7 years ago | |

There are some quite big JSON files out there; you might also be interested in parsing megabytes but not spending more than 1ms to get through it.

maliker 7 years ago |

If this kind of work is interesting to you, you might like Daniel Lemire's blog (https://lemire.me/blog/).

He's a professor, but his work is highly applied and immediately usable. He manages to find and demonstrate a lot of code where we assume the big-O performance, but the reality of modern processors and caching (etc.) mean very difference performance in practice.

sbr464 7 years ago |

Thanks for posting. I've been working with lidar/robotic data more recently and it's nice to work with JSON directly, when the performance is good enough.

avmich 7 years ago |

> All JSON is JavaScript, but not all JavaScript is JSON

Really? I thought they diverged specifications long enough ago (though using those extras could be discouraged in some cases).

fulafel 7 years ago |

What's the current state of the art in doing this on GPU?

glangdale 7 years ago | |

To my knowledge, it is limited to posting "Towards JSON Parsing on a GPU" type articles. Writing that sort of article is easy and fun, without the tedious burden of implementing things.

tenken 7 years ago |

I'm curious how fast the sqlite json extension is for validation and manipulation of json data when compared to this library.

kitd 7 years ago |

OT, but I notice it can be run by #include-ing the simdjson.cpp file. How common is this in CPP projects?

Erwin 7 years ago | |

It seems like there are quite a few single-header C++ libraries: https://github.com/nothings/single_file_libs

The people complaining about dependency management in Python should try doing it in C++; there seems to be half a dozen competing ones. And three times as many build systems.

vkaku 7 years ago |

Honestly, this is a cool hack. But it's not the best way to shuttle that much data around.

It's a hammer on rocket fuel.

hrdwdmrbl 7 years ago |

Would it be possible to make a native module out of this for node?

sbr464 7 years ago | |

Here's the node bindings for rapid json, I'm assuming it would be similar.

https://github.com/matthewpalmer/node-rapidjson

hrdwdmrbl 7 years ago | | |

Thank you!

Though from the readme on that module the dev says "it turns out that you’re better off using the normal Node.js/V8 implementation unless you’re operating on huge JSON.

... the bridging from V8 to C++ is a bit too costly at this stage."

iamleppert 7 years ago |

Is this faster than the browser’s native parsing speed I assume?

achalkley 7 years ago |

With this work on an Arduino?

abhorrence 7 years ago | |

This code in particular won’t, since it relies on a particular extension of the x86 instruction set. I don’t believe Arduino compatible chips have simd instructions, but if they do, a similar approach could be taken.

glangdale 7 years ago | | |

I'm not aware of any SIMD-capable Arduino chips; even when Quark was a thing, it didn't support SIMD.

It's possible to do SWAR (SIMD Within A Register) tricks to try to substitute, but on a 32-bit processor (or even a 64-bit processor) I doubt our techniques would look good. In Hyperscan, my regex project, we used SWAR for simple things (character scans) but I doubt that simdjson would work well if you tried to make it into swarjson. :-)