When should you store serialized objects in the database? (2010)

When should you store serialized objects in the database? (2010)(percona.com)

54 points by harshasrinivas 10 years ago | 68 comments

exabrial 10 years ago |

Bad ideas from 5, 10, er 20, er 30 years ago are stil bad ideas.

I know the HN police will cite me for no citation, so I'd say it comes with experience. The law changed at one point, and we were legally bound to be able to locate a customers record by a piece of data in a blob. The only way to fix the problem was to dump the massive (1tb+) table and reinsert them into a real schema. The engineering effort to do this took 9 months to get it right, because other people changed the way blobs were written out over the course of years.

Being clever doesnt pay, again.

LyndsySimon 10 years ago | |

> Bad ideas from 5, 10, er 20, er 30 years ago are stil bad ideas.

I agree 100%.

I read that title and expected the post to begin with "Never. You should never serialize objects into a single field." I was disappointed.

If you need schemaless storage, use a schemaless DB. I don't understand what's so difficult about that. I wouldn't try to shove unstructured data into PostgreSQL any more than I'd try to shove relational data into MongoDB.

spacemanmatt 10 years ago | | |

HAHA, I tend to agree that 'never' is the only good answer.

phaedrus 10 years ago | | |

It seems the article was focused purely on the performance implications of the decision, and nothing about the maintenance and architectural impact.

phaedrus 10 years ago | |

When my previous employer forced my team to implement storage of structured data as a serialized BLOB (on top of a system which used to store the data the "right" way), I turned in my resignation.

Background: we had been storing other, similar data in a structured way for years, so we had a system set up to do it right. I'm not sure what the rationale was for switching, but it was declared by fiat over the protests of a team of five experienced .NET developers and an experienced lead DBA. We began experiencing problems from it before we were even a month into the project, such as serialization output not agreeing between client apps (.NET serialization is NOT designed to be a shared archive format!!), implementation requiring breaking the separation of concerns between layers of our application, etc. And for what? When asked what we would do when the format changes, management cheerfully replied "oh we'll just write + run a conversion EXE to update the data in bulk. Why, we do that all the time in [other engineering team who cowboy-codes everything and operates with a level of technical debt that makes it suck to work on that codebase]."

Of course this wasn't the only reason I was resigning, but it made the decision easier!

kbenson 10 years ago | |

There are so many ways to accomplish this other than what you did. I assume there was a reason for your choice, but it would have been so much easier if you could just create a separate table that matched the same primary key as the customer record, and contained a single other field, the data required to be searchable. Easy to join and search, easy to insert and update.

> reinsert them into a real schema

That sounds sort of like you decided to fix a bunch of problems at the same time...

joe_the_user 10 years ago | | |

There are so many ways to accomplish this other than what you did. [...] That sounds sort of like you decided to fix a bunch of problems at the same time

A real database with real schema involves a series of guaranteed logical relations. Each violation of these logical relations tends to result in a different kind of problem (if you have a relation requiring pairing cars and drivers, you could have the problem of cars without drivers and drivers without cars. The potential logical problems multiply as the effective schema grows, without you have explicit schema or not).

So basically the move of using a real schema fixes a wide variety of real and potential problems compared to ad-hoc solutions. There are many ad-hoc solutions but since these aren't guaranteeing logical relations, such solutions tend to have holes the appear later.

So the gp may have been forced to use a real schema based on the multiplication of problems or they may have just done it because it was the right thing.

merb 10 years ago | |

actually we save history data as json inside postgresql and article data as json and we have a price table that adds a history as a postgresql trigger. that's actually not blob data but its a kind of serialization. However we access the data regulary. The Price History is exposed to the user so it needs to work. Our system needs to work even against older versions of the table.

tigershark 10 years ago | |

If you use an hammer to kill a mosquito obviously it's a bad idea. The hammer is useful to put a nail in the wall. In my current job I introduced a configuration store based on serialization. Obviously, instead of storing everything in a blob, I created a table with some generic string id columns and a generic key that is an object serialized in JSON and a value that is the configuration object serialized in JSON. In this way I can have the best of both worlds, a generic configuration store that is also indexed on the string ids and searchable on the generic key.

mattmanser 10 years ago | | |

The article is not talking about config files, which is one of the few valid reasons to do this, as with a config file you're almost always going to just want the whole thing once at initialization.

And even then, only if there's loads of config values. If you've only got 5 or 10, that solution is bad.

The article is implicitly talking about business objects.

jb613 10 years ago | | |

> and a value that is the configuration object serialized in JSON

It all depends on the data, but if it's not simple, then serialized JSON values would generally incur performance hit for search operations. Breaking out the data into separate columns could be better indexed.

orf 10 years ago |

I think it's rarely a good idea to store blobs of any kind in the database. I've seen systems that store pretty large files as blobs (even base64 encoded ones once), then do 'select *' on the table and wonder why their query performance is so terrible. Use a filesystem, that's what it's for.

For stuff like this then I would say it's always preferable to store a json encoded representation rather than a format like pickle (python's object serialization format). If you don't and some clever chap works out a way to write input to that field then you've got an easy RCE. Plus it's easier to debug JSON, and databases like PG have a native data type for it.

rwultsch 10 years ago |

I was on the DBA team at FB and I spent the better part of a year working on the deployment system for online schema change. It was a pain. Other companies have done quite a bit of work on this as well (Shift from Square, etc...).

Later on I joined Pinterest as their first MySQL DBA. They had copied the sharding system from FB, but instead of having a bunch of columns, they just stored a JSON blob. This saved them from learning how to perform schema change until I joined the company. This is a pretty incredible feature.

We have a new feature under development (which will be open sourced as part of Percona MySQL) which will allow column level compression with an optional predefined dictionary. During testing, this resulted in a 30% additional reduction in spaced consumed versus InnoDB page compression AND doubles our peak QPS at lower latency. This would not work well with many individuals columns, but kicks ass for JSON blobs.

http://www.slideshare.net/denshikarasu/less-is-more-novel-ap... (slides 37, 40, 41, 42)

TheSoftwareGuy 10 years ago |

For anybody using sqlite, they have good documentation about this very question: https://www.sqlite.org/intern-v-extern-blob.html

bvinc 10 years ago | |

Is this really the same thing? This page compares storing blobs in sqlite vs in a separate file.

I think a better sqlite page about the concept of serializing things in your database is the fact that sqlite has json support.

https://www.sqlite.org/json1.html

distances 10 years ago | | |

That's with a loadable JSON1 extension, which one won't find e.g. in Android.

Though checking now for this, someone has packaged a later version of SQLite with this extension [1]. I wonder if there is any possible performance advantage when using a system provided SQLite vs. one installed with the application?

[1] https://github.com/requery/sqlite-android

garethrees 10 years ago |

It makes sense to store serialized data structures in the database when these conditions apply:

1. There are no use cases that would require you to SELECT on the fields in the serialized data structures.

2. You anticipate that the data structures are going to change frequently during development, so that turning them into relations is going to involve a lot of schema migrations.

Basically you give up the possibility of being able to SELECT on some of the data in return for being able to change its format rapidly and cheaply.

I worked on a project recently where this was helpful -- when I designed the database schema I didn't know the details of many of the data structures that were going to have to be stored there. From the use cases I could deduce the set of fields that would need to be SELECTed on, but the other fields were ill-defined. By storing them as blobs (actually as JSONB fields, since this was PostgreSQL) I could safely defer the decision about how to design these parts of the database, without incurring lots of schema migrations along the way.

rhinoceraptor 10 years ago |

You can get away with it in Postgres. The app I work on stores phone numbers in a JSONB array in the following format:

    [{
      "tags": ["cell"],
      "number": "1231231234"
    }]

Here's a snippet demonstrating how you can do a lateral left join on the column to find the number tagged 'cell' in tags array:

    select * from mytable t
    left join lateral (
      select phone->'number' as cell_phone from
      jsonb_array_elements(t.phone_numbers) phone
      where phone->'tags' @> '["cell"]'
    ) p on true;

collyw 10 years ago |

I am working on a system where the data is serialized using Python's pickle and stored in the database. Absolute nightmare for debugging as its basically unreadable.

unlinker 10 years ago | |

Can't you open a Python shell and unpickle it?

collyw 10 years ago | | |

Gave me an error the last time I tried to that.

Plus even if it did work, it involves logging into the server, activating my python virtualenv, pulling the data out via the python / Django shell an unpickling and printing it. As opposed to running a query on my local machine connecting to the database. When you are debugging a problem and just want to get an overview of what is happening, that is a hell of a lot of hassle.

Zikes 10 years ago |

PostgreSQL and jsonb: https://www.postgresql.org/docs/current/static/datatype-json...

andy_ppp 10 years ago | |

Yes, maybe use a database that allows you the best of both worlds, a serialised blob that happens to be queryable and generally really high performance.

Zikes 10 years ago | | |

Queryable and even indexable! Hardly a trade-off at all, really.

blowski 10 years ago |

I was working on a massive CRUD project - hundreds of end-user customisable textarea fields. Despite this sounding like a project perfect for a NoSQL database, the in-house team that would be maintaining it were MySQL experts, and didn't want to support MongoDB or anything like that.

So, yep, we stored everything as serialized objects in the database. We had a separate table for 'change events', and whenever someone changed the contents of one of the textareas we stored it in that table. A worker would eventually update the the serialized object, but in the meantime, we would load the serialized object and apply all of the changes that had happened since it was last updated. Basically, an Event Sourcing pattern.

So it was either that, or the EAV route, or the 'end user altering the database' route. The latter two options sounded even worse. Our solution worked out pretty well. Admittedly, it only had a few hundred concurrent users, and even the biggest document was never more than 100K.

So, it can work, but YMMV.

Friedduck 10 years ago |

We serialized some XML as a backup to the data we were extracting (and properly modeling), given that the vendor was prone to changing the schema without proper notification, and that there were some data elements we weren't using at that time.

We also built the necessary tools to extract/re-process records easily, and the architecture worked well for us. As our needs or the schema changed we could easily accommodate those changes without undue effort.

It doesn't directly address the question, and I'm not sure that I'd use the same solution if the volume were predicted to be significantly higher, but in our case it worked beautifully. (Happily our volume was predictably within a known range, for reasons I won't go into.)

jbyers 10 years ago |

(2010)

Uber moved to a similar architecture in 2014-2015, ~5 years after article and the original Friendfeed post. Being able to operate MySQL predictably at scale is extremely valuable to high-growth companies, enough to tilt in the favor of unconventional schema choices versus less proven NoSQL alternatives.

https://eng.uber.com/schemaless-part-one/

https://eng.uber.com/schemaless-part-two/

https://eng.uber.com/schemaless-part-three/

forinti 10 years ago |

In my experience, databases outlast the applications built on top of them, so it makes no sense to cut corners on the data modelling.

Except, of course, if the data only exists to support the application (some sort of buffer, cache, or session storage).

jb613 10 years ago | |

> databases outlast the applications built on top of them

sure but this needs to be balanced with performance UNTIL then

Arwill 10 years ago |

This is the consequence of the chosen programming language not being adapted to work with a relational database. The language is made to work with objects, and the access to DB is clumsy, trough library functions, which are deeply encapsulated, and the language and the DB are two different worlds. In SAP's ABAP language relational database access is integrated into the language. In SAP, when you create a database table, the structure of that table will be automatically available to any program as a structure datatype. So if you change a table definition, you will also change the data type used by the programs. Doing table changes is supported by a database tool that will automatically copy records from the old table to the new one if necessary. Its easy to find all references to a DB table, and recompile the sources. Its actually done automatically when both program and DB structure changes are deployed. Whatever change the developer does in the development system, that change will be automatically adjusted in the productive system on deployment. This makes any table structure change pretty easy, the development environment takes care of that. Wherever SAP applications use blobs to store data (for example HR payroll), those are the worst to develop with. Doing a non-simple change on a TB big table would surely cause disruption in a SAP system too, but other techniques are available for those cases.

fauria 10 years ago |

"If the application really is schema-less and has a lot of optional parameters that do not appear in every record, serializing the data in one column can be a better idea than having many extra columns that are NULL."

Why not just use a document oriented database instead? Seems like a good use case for MongoDB for example: https://www.mongodb.com/compare/mongodb-mysql

wesd 10 years ago | |

Also the assumption is that you don't need to report on the data. If you need to report on the data then you might need to create index on those columns for performance which you can't do on a blob.

autogn0me 10 years ago |

Agree, storing binaries in database is generally a bad idea. It would be a really miserable idea if it were being done without a sane persistence API. In the Python world, ZODB - http://www.zodb.org/en/latest/ is tightly coupled with the language but works reasonably well in practice. The storage layer is pluggable and https://pypi.python.org/pypi/RelStorage provides storing pickles in RDBMS.

ZODB is arguable a novel approach to persistence using Python. And certainly worth taking some time just to play with it -- the barrier of entry low, e.g. `pip install`. But for each positive there are negatives..

"You got it buddy: the large print giveth, and the small print taketh away"

prashnts 10 years ago | |

From my personal experience, I completely agree.

For an academic project, I was calculating ~2mln RNA-RNA interactions from their sequences. Since this calculation stays a requirement for all further calculations, being the naïve kid I was, I started pickle'ing the results.

To feel like the cool kid, I wanted to involve a database somehow -- so after trying out a bunch of options, I finally settled for ZODB. As the project scaled up, soon the ZODB started being a big pain, because as I recall, it only allows a limited number of connections even in the read operations.

Lesson learned, though, it now resides as a lookup table in a PgSQL instance.

harshasrinivas 10 years ago |

Link about FriendFeed (mentioned in the blog): http://web.archive.org/web/20100314211658/http://bret.appspo...

bvinc 10 years ago |

What about this reason? What if your program is pretty much entirely used from a JSON REST service? What if these JSON objects need to also be sent between machines? What if they need to be exportable to files sometimes?

Now imagine the same program also has an internal database where these JSON objects can be imported and used. Does it make sense that, when actually in use, these objects are relational and split between 10 complicated tables? Why should someone bother writing complex import/export conversion functions, maintaining them in the future, and having worse performance. Wouldn't it be much simpler, maintainable, and faster to just plop the JSON in the database?

WorldMaker 10 years ago | |

Certainly, so long as your database speaks JSON fluently you can even have your cake and eat (some) of it too. PostgreSQL has good JSON support. Microsoft's SQL Server is "working on it". Then there are loads of JSON friendly document databases out there such as Couch{DB, base}, Cloudant, Mongo, Redis, etc and so forth.

PretzelFisch 10 years ago | |

So, my question after storing xml in a database and using their xml features to build indexes is this. Are you looking for a database or a search engine? what will you gain from a database if you are not using it's features, over saving to disk and building a query index?

bvinc 10 years ago | | |

What you would gain is the ease, simplicity, and stability of using your favorite sql engine. You get ACID transactions and the ability to add and remove from large lists using low memory, for free.

longshorej 10 years ago | | |

Replication is a big advantage that's hard to replicate to the same degree with ZFS or rsync.

joesmo 10 years ago |

Proper databases have JSON or other serialized field types. Even mysql 5.7 has some support for this. There is no reason you should hit a limit on one of your tables at a few hundred thousand records because you're an idiot and stored a ton of serialized data in mysql. It happens all the time though.

jcoffland 10 years ago |

The answer is, of course, almost never.