Json vs. simplejson vs. ujson(jyotiska.github.io) |
Json vs. simplejson vs. ujson(jyotiska.github.io) |
However, after swapping a fairly large and json-intensive production spider over to ujson, we noticed a large increase in memory use.
When I investigated, I discovered that simplejson reused allocated string objects, so when parsing/loading you basically got string compression for repeated string keys.
The effects were pretty large for our dataset, which was all API results from various popular websites and featured lots of lists of things with repeating keys; on a lot of large documents, the loaded mem object was sometimes 100M for ujson and 50M for simplejson. We ended up switching back because of this.
cjson's way of handling unicode is just plain wrong: it uses utf-8 bytes as unicode code points. ujson cannot handle large numbers (somewhat larger than 263, i've seen a service that encodes unsigned 64-bit hash values in JSON this way: ujson fails to parse its payloads). With simplejson (when using speedups module), string's type depends on its value, i.e. it decodes strings as 'str' type if their characters are ascii-only, but as 'unicode' otherwise; strangely enough, it always decodes strings as unicode (like standard json module) when speedups are disables.
https://github.com/openstack/swift/blob/39c1362a4f5a7df75730...
and https://github.com/openstack/swift/blob/39c1362a4f5a7df75730...
and https://github.com/openstack/swift/blob/39c1362a4f5a7df75730...
and many more just like those.
The worst part is the bugs that appear or disappear depending on whether simplejson's speedups module is in use or not.
Its syntax is nginx-like but can also parse strict json. It's pretty fast too.
More info here: https://github.com/vstakhov/libucl
Apart from that, though, this looks like a really good format.
[1] https://github.com/vstakhov/libucl#automatic-arrays-creation
OnTopic: I think is an unusual way of saying "the environment I use"
The default json module took close to 5 seconds to deserialize the payload once it hit the server, while ujson could do the same work in a fraction of the time (less than a second). 5 seconds might not seem like a whole lot when the import process as a whole could take 30 seconds or so, but when the user is stuck staring at their device it makes sense to cut down the response time any way you can.
We ended up using ijson.
for the typical AJAX call for some rows of data selected from a datastore and JSON encoded, then no the JSON encoding is not the bottleneck, the network latency and database io time dominate the time it takes to JSON encode the data.
however, consider an alternative kind of task that might, for example, produce a big JSON dump of thousands of records. this is fairly typical of a data export of some kind. the network and database time for this request is the same as for the smaller one, but now instead of JSON encoding 50 records you're encoding 50000 records. it can start to add up. a poorly optimized JSON library will add multiple full seconds to your response time here.
So I can't serialize things with ultrajson that aren't serializable? I must be missing something in this statement.
> The verdict is pretty clear. Use simplejson instead of stock json in any case...
The verdict seems clear (based solely on the data in the post) that ultrajson is the winner.
This might not be what they're talking about, but I did run into what might be the same issue when looking at ujson before. The builtin JSON module lets you define custom serializations for types that aren't natively JSON-serializable; we had an application that did that with datetime objects, encoding them as ISO 8601 date strings. ujson doesn't support anything like that; you have to make sure everything is one of the JSON types already before encoding.
ultrajson isn't a drop-in replacement, though, because it doesn't support sort_keys.
Well-defined collections? As in, serializable? Well sure, that's requisite for the native json package as well as simplejson (as far as I can recall -- haven't used simplejson in some time.)
But does "texts" refer to strings? As in, only one data type? The source code certainly supports other types, so I wonder what this statement refers to.
What about larger dictionaries? With such a small one I would be worried that a significant proportion of the time would be simple overhead.
[Warning: Anecdote] When we were testing out the various JSON libraries we found simplejson much faster than json for dumps. We used large dictionaries.
Was the simplejson package using its optimized C library?
I completely failed to read this the first time I went through. I guess this is equivalent to dumping bigger dictionaries.
> [Warning: Anecdote] When we were testing out the various JSON libraries we found simplejson much faster than json for dumps.
Turns out we were using sort_keys=True option, which apparently makes simplejson much faster than json.
(BTW: I got tempted to try ujson exactly for the original blog post, i.e. http://blog.dataweave.in/post/87589606893/json-vs-simplejson...)
Plus, AFAIK, at least in Python 3 json IS simplejson (but a few version older). So every comparison of these libraries is going to give different results over time (likely, with difference getting smaller). Of course, simpejson is the newer thing of the same, so it's likely to be better.
I leave this here in case it helps others.
We had other focus such as good for both python and java.
At the time we went msgpack. As msgpack is doing much the same work as json, it just shows that the magic is in the code not the format..)
JSON is a data representation, not a data model.
The speed deference between working with binary streams and parsing text is night and day.
It was a big disappointment after seeing these kinds of performance improvements.
Ansible combines YAML with jinja2 to do this type of stuff, for instance.
I'm not totally certain, but I think that might end up being simpler, more expressive and more powerful than this.
However, that hash table stores weak references to those strings. If nothing else refers to a string, the GC can and will remove it from the string table.
This gives you great memory use for strings and optimally fast string comparisons. The cost is that creating a string is probably a bit slower because you have to check the string table for the existing one first.
It's an interesting set of trade-offs. I think it makes a lot of sense for Lua which uses hash tables for everything, including method dispatch and where string comparison must be fast. I'm not sure how much sense it would make for other languages.
You can discover what internal strings are held in a web application via a timing attack.
Better hope you never hold onto a reference to internal credentials inside the application! (Say... DB username / password? Passwords before they're hashed? Etc.)
For example Erlang symbols are deeply ingrained into language, and vm doesn't even garbage collects them, so creating symbols from user data is basically giving user 'crush vm' button.
On the other hand, if symbols are treated as another data type, as string with some optimizations - no such problems shall arise