Document Database Transaction Models(fauna.com) |
Document Database Transaction Models(fauna.com) |
Citation needed.
Almost no software application data models are tabular.
OK, in that case, I'm citing Java spec, C++ spec, C# spec, Javascript spec, Smalltalk spec, Common Lisp spec...
I find ... debatable that building yet again one more FB or IG challenger is significantly more important than maintaining or creating new Banking, ERP, Inventory Management solutions and for that stuff "tabular data" doesn't seem to be a bad match, or at the very least, using json or xml documents will not provide much of an improvement (all IMHO, of course...).
Having used mongoDB in the past, for me it was a better day 1 experience. I didn’t need SQL, I just stored my data. The problem is that on day 2 there’s a different access pattern for your data but now it’s stored in a way that’s highly optimized only for that first application.
I find it interesting that the proposed explanation for lack of transactions in document databases is the CAP theorem which covers distributed systems, NOT document databases. Clearly such thing as a document database with transactions is not impossible if, for example, GemStone/S can support transactions just fine.
These are exactly the kind of "transactions" you get when you try to implement everything at the application level rather than the database level. Couchbase transactions (in the article) are the same. And it's not that different from Vitess cross-shard transactions either, which are not isolated (https://vitess.io/docs/reference/features/two-phase-commit/). Tandem SQL used the same scheme as well I believe.
Prior to Spanner, there were no production databases that offered ACID transactions across distributed, disjoint shards.
> Open a bunch of locks, write a bunch of updates, release the locks.
That's how transactional databases using two-phase locking generally work, isn't it?
Subtle burn on the mongo and its client-coordinated session causality model.
Also, when you say:
> On the other hand, server-side transactions use a more typical pessimistic relational lock. They open a transaction, do some work, and then commit it.
What do you mean? What are server-side transactions?
I meant transactions issued from a server ("cloud") client to the database, as opposed to a mobile client.
If they started with PostgreSQL, then they're already adapted to optimistic concurrency control, right?
Anybody who has used Couchbase transactions, or sharded Mongo transactions, and can corroborate our analysis?
As these things tend to go, when there is a place to store arbitrary data, all kinds of things get shoved into it, so the mixed model in Firestore is a compromise between the original tree-of-nodes data model and a more conventional document data model.
My assumption is the Firestore-to-Spanner mapping creates subcollections as shared tables with foreign keys to the parent documents, but I don't actually know. However, that would match the mandatory 1-to-many-to-1-to-many data layout, and makes more sense than shoving all the dependent data into the document itself or creating multiple millions of SQL tables for millions of documents.
The point is, no distributed database with a naive two-phase lock is truly transactional.
I have no idea how you can draw similarities between the two. One of them is basically a Smalltalk VM with automatic object persistence. The other is some kind of real-time Java-based distributed key/value data platform.
The GemStone transaction docs describe a scheme that would work properly on a single machine, but don't discuss anything about distributed coordination across servers or failure modes. The installation instructions don't discuss setting up a cluster. The marketing docs discuss using thousands of VMs (clients) and scaling the dataset to "hundreds of gigabytes" based on disk storage instead of memory which is not what I would expect from a distributed system. Various benchmarks and user comments refer to using a single server for GemStone.
I will update the post to clarify that we are discussing distributed document databases only. It's easy to do anything you want on a single machine.
Can you tell me how many times that users email address changed in that document over the last 3 years by executing a simple query? What was the email after the 2nd time it changed?
In 6th normal form, such a thing is trivial to manage. Discipline is the only price to pay for admission.
Point is, it's not a native representation for them. If you just wanted to serialize a portion of a program's heap, independently of the program (which is presumably the "very convenient" part of it), it would be a graph of records or objects, not a bunch of tables. For example, no word processor to my knowledge represents the document being edited as a bunch of tables outside of the language's native data model. That's why document databases don't use tables for documents.
the advantage of normalized data it that you can easily join it and transform into whatever format you want.
using objects/documents to store you data you can end up with the really annoying problems I've had in the past year at work, sub-sub-object has field/properties X/Y/Z it is assumed all objects would have it except some don't have it and some have it but it was used for something else that it remotely related but not the same.
now you who was not part of making this decision have to go hunt for patterns of correlation between subobjects, find out the illegal states of the global object, and decide wether the patterns you found were the results of an unnoticed bug, a valid edge case or just the result of poor design. keeping in mind that the some patterns are just way more popular than others and you cant just query the db "give me all the ways your objects and sub-objects varies depending on other fields".
another problematic things I saw: period 20xx/xx/xx to 20yy/yy/yy the informal schema was A, then it was changed to B, then after that to C. of course old data was not migrated to the new design each time, giving that the system work mostly in an online fashion that was not much of problem. expect the requirements changed or you found out something it not right and you need to recacultate a result that depdent of the date from the beggining up to now. now querying this data is a nightmare.
these are not problem of the document model itself but the people using it, but from my experience once you give it to people they slowly throw away all common sense on database design.
now if you want to enforce to quality into your stored data you'll find yourself hand writing part of the tools that you threw away when moving from more traditional databases.
At the end you have to ask you self was this worth it ?
I think the idea here is that your database schema is isomorphic to the restrictions on your in-memory data enforced by your application code: you can't store objects that you can't create in the first place, so that's your schema.
Of course systems like AllegroCache blur the difference between in-memory and stored data completely.
> another problematic things I saw: period 20xx/xx/xx to 20yy/yy/yy the informal schema was A, then it was changed to B, then after that to C. of course old data was not migrated to the new design each time, giving that the system work mostly in an online fashion that was not much of problem. expect the requirements changed or you found out something it not right and you need to recacultate a result that depdent of the date from the beggining up to now. now querying this data is a nightmare.
For example in AllegroCache this would be solved by usual object migrations (like in http://clhs.lisp.se/Body/f_upda_1.htm), perhaps even lazily as to not disrupt operation.
Maybe I am just not "modern" enough.
after the initial team that designed such document based system have gone to other jobs, once it got harder.
I found my self maintaining and evolving such system need to the changing needs of the business.
I've had to deal with the cases where customer were charged multiple times, some that were never charged at all, and all kind of similar problems.
I can trace almost all of them back to the original decision of lets just ditch a traditional DB for a NoSQL one, because it scales TM and geo redundancy will be easier for our infrastructure team.
This doesn't sound like it has anything to do with the data model, though. The so-called "NoSQL" databases can be deficient for many other reasons.
> Normal forms not required.
3NF (approximately where document databases live) struggles to decouple the time domain from individual facts. Let me give you an example.
Assume a Customer document has a LastModifiedUtc property. Does this tell you when their email specifically changed? No. It just says "something" in the customer document was modified.
Now, you could say "how about we just add a property per thing I want to track?" Ok - so we now have Customer.EmailLastModifiedUtc, Customer.NameLastModifiedUtc, etc. This is pretty good, but now assume we also need to know what the previous email addresses and names were. How do we go about this? Ok - no big deal, lets just add another column that is some JSON array or whatever. Customer.PreviousEmailAddresses. Cool, so now we know when the email was last modified AND we know what every last variant of it was.
What is missing? Oh right. What about when each of the previous email addresses actually changed? Who changed it? From what IP address? Certainly, we could nest a collection of documents within our document to keep track of all of this, but I hope the point is starting to come across that there may be some value in exploring higher forms of normalization. Imagine if I wanted to determine all of the email addresses that were modified by a specific IP address (across ALL customers), but only over the last 25 days. I feel like this is entirely out of the scope of a document database.
Don't get me wrong. 3NF is extremely powerful and handles many problem domains with total ease. But, once you start talking about historization of specific fields and rates of change, you may need to consider something higher order.
Map the index additions and their timestamps onto the documents themselves and you can retrieve the entire state of each record that the ip wrote at the time that it wrote it. If you want to know specifically what that ip changed, then diff it with the previous record, for example, to filter down to updates that only changed the email address.
Define "performant". Denormalizing your domain model because you feel like the database might get slow is a strong case of premature optimization, unless you have actually tried to model it this way and have measured.
You will find that most modern SQL database systems have no problem querying databases with thousands or tens of thousands of tables. In fact, having narrow tables can dramatically improve your utilization of memory bandwidth since you aren't scanning over a bunch of bytes you will never use in the result set.