I know the HN police will cite me for no citation, so I'd say it comes with experience. The law changed at one point, and we were legally bound to be able to locate a customers record by a piece of data in a blob. The only way to fix the problem was to dump the massive (1tb+) table and reinsert them into a real schema. The engineering effort to do this took 9 months to get it right, because other people changed the way blobs were written out over the course of years.
Being clever doesnt pay, again.
I agree 100%.
I read that title and expected the post to begin with "Never. You should never serialize objects into a single field." I was disappointed.
If you need schemaless storage, use a schemaless DB. I don't understand what's so difficult about that. I wouldn't try to shove unstructured data into PostgreSQL any more than I'd try to shove relational data into MongoDB.
Background: we had been storing other, similar data in a structured way for years, so we had a system set up to do it right. I'm not sure what the rationale was for switching, but it was declared by fiat over the protests of a team of five experienced .NET developers and an experienced lead DBA. We began experiencing problems from it before we were even a month into the project, such as serialization output not agreeing between client apps (.NET serialization is NOT designed to be a shared archive format!!), implementation requiring breaking the separation of concerns between layers of our application, etc. And for what? When asked what we would do when the format changes, management cheerfully replied "oh we'll just write + run a conversion EXE to update the data in bulk. Why, we do that all the time in [other engineering team who cowboy-codes everything and operates with a level of technical debt that makes it suck to work on that codebase]."
Of course this wasn't the only reason I was resigning, but it made the decision easier!
> reinsert them into a real schema
That sounds sort of like you decided to fix a bunch of problems at the same time...
A real database with real schema involves a series of guaranteed logical relations. Each violation of these logical relations tends to result in a different kind of problem (if you have a relation requiring pairing cars and drivers, you could have the problem of cars without drivers and drivers without cars. The potential logical problems multiply as the effective schema grows, without you have explicit schema or not).
So basically the move of using a real schema fixes a wide variety of real and potential problems compared to ad-hoc solutions. There are many ad-hoc solutions but since these aren't guaranteeing logical relations, such solutions tend to have holes the appear later.
So the gp may have been forced to use a real schema based on the multiplication of problems or they may have just done it because it was the right thing.
And even then, only if there's loads of config values. If you've only got 5 or 10, that solution is bad.
The article is implicitly talking about business objects.
It all depends on the data, but if it's not simple, then serialized JSON values would generally incur performance hit for search operations. Breaking out the data into separate columns could be better indexed.
For stuff like this then I would say it's always preferable to store a json encoded representation rather than a format like pickle (python's object serialization format). If you don't and some clever chap works out a way to write input to that field then you've got an easy RCE. Plus it's easier to debug JSON, and databases like PG have a native data type for it.
Later on I joined Pinterest as their first MySQL DBA. They had copied the sharding system from FB, but instead of having a bunch of columns, they just stored a JSON blob. This saved them from learning how to perform schema change until I joined the company. This is a pretty incredible feature.
We have a new feature under development (which will be open sourced as part of Percona MySQL) which will allow column level compression with an optional predefined dictionary. During testing, this resulted in a 30% additional reduction in spaced consumed versus InnoDB page compression AND doubles our peak QPS at lower latency. This would not work well with many individuals columns, but kicks ass for JSON blobs.
http://www.slideshare.net/denshikarasu/less-is-more-novel-ap... (slides 37, 40, 41, 42)
I think a better sqlite page about the concept of serializing things in your database is the fact that sqlite has json support.
Though checking now for this, someone has packaged a later version of SQLite with this extension [1]. I wonder if there is any possible performance advantage when using a system provided SQLite vs. one installed with the application?
1. There are no use cases that would require you to SELECT on the fields in the serialized data structures.
2. You anticipate that the data structures are going to change frequently during development, so that turning them into relations is going to involve a lot of schema migrations.
Basically you give up the possibility of being able to SELECT on some of the data in return for being able to change its format rapidly and cheaply.
I worked on a project recently where this was helpful -- when I designed the database schema I didn't know the details of many of the data structures that were going to have to be stored there. From the use cases I could deduce the set of fields that would need to be SELECTed on, but the other fields were ill-defined. By storing them as blobs (actually as JSONB fields, since this was PostgreSQL) I could safely defer the decision about how to design these parts of the database, without incurring lots of schema migrations along the way.
[{
"tags": ["cell"],
"number": "1231231234"
}]
Here's a snippet demonstrating how you can do a lateral left join on the column to find the number tagged 'cell' in tags array: select * from mytable t
left join lateral (
select phone->'number' as cell_phone from
jsonb_array_elements(t.phone_numbers) phone
where phone->'tags' @> '["cell"]'
) p on true;Plus even if it did work, it involves logging into the server, activating my python virtualenv, pulling the data out via the python / Django shell an unpickling and printing it. As opposed to running a query on my local machine connecting to the database. When you are debugging a problem and just want to get an overview of what is happening, that is a hell of a lot of hassle.
So, yep, we stored everything as serialized objects in the database. We had a separate table for 'change events', and whenever someone changed the contents of one of the textareas we stored it in that table. A worker would eventually update the the serialized object, but in the meantime, we would load the serialized object and apply all of the changes that had happened since it was last updated. Basically, an Event Sourcing pattern.
So it was either that, or the EAV route, or the 'end user altering the database' route. The latter two options sounded even worse. Our solution worked out pretty well. Admittedly, it only had a few hundred concurrent users, and even the biggest document was never more than 100K.
So, it can work, but YMMV.
We also built the necessary tools to extract/re-process records easily, and the architecture worked well for us. As our needs or the schema changed we could easily accommodate those changes without undue effort.
It doesn't directly address the question, and I'm not sure that I'd use the same solution if the volume were predicted to be significantly higher, but in our case it worked beautifully. (Happily our volume was predictably within a known range, for reasons I won't go into.)
Uber moved to a similar architecture in 2014-2015, ~5 years after article and the original Friendfeed post. Being able to operate MySQL predictably at scale is extremely valuable to high-growth companies, enough to tilt in the favor of unconventional schema choices versus less proven NoSQL alternatives.
https://eng.uber.com/schemaless-part-one/
Except, of course, if the data only exists to support the application (some sort of buffer, cache, or session storage).
sure but this needs to be balanced with performance UNTIL then
Why not just use a document oriented database instead? Seems like a good use case for MongoDB for example: https://www.mongodb.com/compare/mongodb-mysql
ZODB is arguable a novel approach to persistence using Python. And certainly worth taking some time just to play with it -- the barrier of entry low, e.g. `pip install`. But for each positive there are negatives..
"You got it buddy: the large print giveth, and the small print taketh away"
For an academic project, I was calculating ~2mln RNA-RNA interactions from their sequences. Since this calculation stays a requirement for all further calculations, being the naïve kid I was, I started pickle'ing the results.
To feel like the cool kid, I wanted to involve a database somehow -- so after trying out a bunch of options, I finally settled for ZODB. As the project scaled up, soon the ZODB started being a big pain, because as I recall, it only allows a limited number of connections even in the read operations.
Lesson learned, though, it now resides as a lookup table in a PgSQL instance.
Now imagine the same program also has an internal database where these JSON objects can be imported and used. Does it make sense that, when actually in use, these objects are relational and split between 10 complicated tables? Why should someone bother writing complex import/export conversion functions, maintaining them in the future, and having worse performance. Wouldn't it be much simpler, maintainable, and faster to just plop the JSON in the database?
That's just the format du jour. Ten years ago it would have been "store an XML encoded representation" and ten years before that it would have been some delimited representation. Tomorrow it may be yaml or something even more hideous.
Blobs in the DB can make sense in some situations but they should really be blobs: images, or other binary/raw data. But beyond a certain size it's almost always easier to manage things like these in a filesystem and just store a pointer (filename) in the DB.
That's called progress. We can't store it in some future language, who wants to store it as XML, so let's use the best of what we have.
File systems are about storing files. Databases are about intelligently organizing data for retrieval and reliably delivering atomic transactions.
Any system that I've seen scale up well separated blob data to a traditional or object file system. In addition to scaling the database more effectively, this allowed the infrastructure teams to optimize delivery of blob data from a platform POV.
Yeah, better to instead deal with the overhead of the database combined with the overhead of the filesystem! Referential integrity is also very easy to deal with.
I work at a place that has been trying to undo the damage of having serialized BLOBs for about 3 years. Granted, these are especially nasty things with a custom serialization layer written in Java.
Software companies consist of two assets produced by it's employees: data and algorithms to make the data useful.
Why would you ever lock your data down and make it harder to make cool algorithms to make the data useful?
Stop right there. There are plenty of examples for storing blobs in a datastore - images and videos are 2 prime examples.
I think what you should say is in a SQL database. Google's AppEngine with their datastore makes it dead easy and with high performance to store&retrieve things like images and videos.
Experience has taught me that it's next to impossible to know what types of queries will be needed for the entire lifetime of a project. If you absolutely must store data like this, please leave a clear migration path for moving some or all of it into well structured tables when (not if) it becomes necessary.
The project is an industrial control system. Different kinds of item are processed by the system. Each kind of item needs image acquisition parameters, for example field of view. There's no use case that needs to select kinds of item whose image capture parameters specify a particular field of view, and it's clear that such a use case is very unlikely to arise — the image acquisition parameters are part of the internal specification of the system, not something that any end user is interested in.
At database design time it wasn't settled which model of camera was going to be used, and so it was far from settled which parameters were going to be needed. Using a JSONB field for the image acquisition parameters meant that I could leave the specification of these parameters to the camera programmer, without having to incur a series of database migrations as the correct set of parameters were worked out.
Even in your project, where new reporting requirements eventually arose, it doesn't mean the original decision was necessarily a bad one. Sometimes it makes sense to incur technical debt in order to bring a product to market in a timely fashion.
- Move to a fully relational schema:
- - Pros: You can leverage the power of the database to efficiently index, search, and aggregate any specific column without too much trouble.
- - Cons: Schema change may be expensive. Requires processing all data into a new structure. Requires massive changes to large swaths of your data handling routines. Must be done all at once, or you need to run both systems in parallel for a while during conversion. In this case, took 9+ months.
- Add a field on that table:
- - Pros: Simple. Requires very little change in processing routines. Efficient. Can rely on DB typing to require data be present for a record.
- - Cons: Depending on database may require excessive downtime while schema is updated.
- Add a table that tracks this field and links to the main record.
- - Pros: Simple. Requires relatively little change in processing routines. Efficient. Zero downtime. Can fill old records with background process.
- - Cons: Hard to enforce that the data exists for every record. Can be mitigated with report generated for records without this link.
Now, given those choices, I would say the correct choice in any particular situation depends quite a bit on external constraints. Do you have 9+ months and the free developer time to essentially redesign what may be large swaths of your back-end? Do you have assurance (tests, language features, etc) that you won't increase enough bugs to negate the benefits of moving to a fully relational schema? Is there a looming deadline on when the works needs to be completed by? Is downtime not really an option, and/or is building a parallel data store for the migration not feasible? How useful is normalizing the data expected to be in the short, medium and long term?
Like I said, I assume they had a reason for the full relational migration. I was just pointing out that there are easier solutions that in some cases fit business needs better than that. If your company goes under because your large migration project stole focus and manpower from other needed projects when there were simpler solutions available, then you made the wrong decision, period.
But the relational model is essentially so good an approach that if you are building something that's close to a relational model but not fully relational, you're almost certainly creating more present and future potential problems than if you moved to fully relational model (edit: and that's not saying other approaches are bad if you have a problem a ways away from the relational domain).
Neither of us know the particulars of gp's problem and I so couldn't say if you're alternatives are better tradeoffs. However, I would wager if someone wound-up adding several fields and tables to an existing blob-filled database, the result would haunt it's creator on many lonely nights at the office.
Often folks doing this try to re-invent a content management system like FileNet in the DB.