How To Make An Infinitely Scalable RDBMS

How To Make An Infinitely Scalable RDBMS(highscalability.com)

154 points by jpmc 12 years ago | 89 comments

glogla 12 years ago |

There was very interesting presentation by one professor. I'm not sure about what university, but he seemed to know his work.

He talked about how databse world is about to change. ACID is really expensive in terms of resources, and so are the more difficult things about relational schema (foreign keys, checks, etc). And architecture of classic RDBMSes is pretty wasteful -- they use on-disk format but cache it in memory.

He talked about how there are basically three new paths for DBMSes to follow. 1) Some drop the restrictions to become faster. This is the NoSql stuff, because you don't really need ACID for writing to Facebook wall.

This is called NoSql database.

2) OLAP, in data warehousing, the usual way to do things is that you load ridiculous amount of data into database, and then run analytical queries, that tend to heavily use aggregation and sometimes use just few dimmensions, while the DWH data tend to be pretty wide.

For this, column store makes perfect sense. It is not very quick on writes, but it can do very fast aggregation and selection of just few columns.

This is called Column store.

3) In OLTP, you need throughtput, but the question is, how big are your data, and how fast do they grow? Because RAM tends to get bigger exponentially, while how many customers you have will probably grow linearly or maybe fuster, but not much. So your data could fit into memory, now, or in future.

This allows you to make very fast database. All you need to do is to switch the architecture to memory-based, store data in memory format in memory and on disk. You don't read the disk, you just use it to store the data on shutdown.

This is called Main memory database.

No, that was the presentation. It was awesome, and if someone can find it, please give us a link! My search-fu was not strong enouhg.

...

What interests me is that we have NoSql databases for some time already, and we have at least one huge (are very expensive) column store: Teradata. But this seems to be first actual Main memory database.

My dream would be to switch Postgres to main memory or column store mode, but I guess that's not happening very soon :)

mbesto 12 years ago | |

> But this seems to be first actual Main memory database.

Eh, not really...

This is exactly what SAP has been doing for several years via Hasso Plattner and the Potsdam Institute: https://epic.hpi.uni-potsdam.de/Home/HassoPlattner

If you've ever worked with large scale "enterprise" database warehouses, they tend to be slow and clunky. Back in 2006ish SAP took the whole Data Warehouse (well mainly just the data cubes) and chucked it into a columnar database (at the time it was called TREX, then became BW Accelerator) - http://en.wikipedia.org/wiki/TREX_search_engine

TREX exist way before 2006. SAP also bought a Korean company called P* (IIRC) which did non-columanr (traditional relational) and threw it into memory. SAP also had a produce called APO LiveCache - http://scn.sap.com/community/scm/apo/livecache - which lived around the same time.

This has now all evolved to a standard offering called SAP HANA - http://www.saphana.com/welcome - In it's second year of inception I believe SAP did roughly $360m in sales just on HANA alone.

Also, IIRC is InnoDB basically the open source version of exactly what you're talking about with "Postgres to main memory"?

edit- correction in TimesTen

army 12 years ago | | |

InnoDB isn't anything like that - it's a transactional database engine that's been around since the 90's and has since become the standard storage engine for MySQL - it competes directly with Postgres' storage layer.

JeffDClark 12 years ago | |

Is this the talk that you are referring to? http://slideshot.epfl.ch/play/suri_stonebraker

nl 12 years ago | | |

Note that Stonebraker makes some good points, but there are many ways to build scalability and Stonebraker is too fast to dismiss many.

In particular, his criticism of traditional databases seems based more on philosophy rather than evidence.

I'd advise reading both sides of the story:

http://lemire.me/blog/archives/2009/09/16/relational-databas...

http://lemire.me/blog/archives/2009/07/03/column-stores-and-...

http://architects.dzone.com/articles/stonebraker-talk-trigge...

http://gigaom.com/2011/07/11/amazons-werner-vogels-on-the-st...

http://dom.as/2011/07/08/stonebraker-trapped/

The date on some of those posts in interesting. 2009 is quite a while ago now, and I'd suggest that columnar datastores haven't exactly taken over. Some implementations have made some progress (eg Cassandra), but OTOH many non-traditional datastores have added traditional-database like features (eg, Facebook's SQL front end on their NoSQL system), and traditional databases have added NoSQL features too.

ezequiel-garzon 12 years ago | | |

If I may drool a little, you guys represent the heart of Hacker News. Insightful summary, mentioning that somewhere somebody gave such a talk. As I was reading the first comment I was silently cheering for "a librarian's follow-up", and there it was!

glogla 12 years ago | | |

Yes, that's the one. Thank you! I'm bookmarking it right now.

twoodfin 12 years ago | |

That sounds very much like Michael Stonebraker's typical pitch these days.

d4mi3n 12 years ago | |

You mentioned OLTP. Erlang's Mnesia store comes to mind, but as far as I'm aware it's limited to a 4GB data set. I'm not sure if that qualifies as a main-memory db, but it might be similar.

pge 12 years ago | |

To add to your list, Vertica (HP) and Paraccel are columnar; TimesTen was a main-memory database bought a number of years ago by Oracle.

leandrod 12 years ago | |

> My dream would be to switch Postgres to main memory or column store mode, but I guess that's not happening very soon :)

If it can be done besides the traditional architecture, be it in a fork or without touching existing code; and if you can at least start the work, it could happen soon.

ams6110 12 years ago | |

A few years ago I was tinkering with Haskell and looked at a framework call HAppS which kept all its state in memory. Doesn't look like the project has really been active lately.

donri 12 years ago | | |

HAppS is dead, long live Happstack!

What you're talking about was the HAppS-State component of the HAppS application server, a project which is in deed not active anymore. Happstack is the active fork of HAppS and had a "happstack-state" component for a while, but this was eventually rewritten from scratch and made independent of Happstack and is now known as acid-state [1]. It's even used for the new Hackage server that powers the Haskell package ecosystem.

[1] https://github.com/acid-state/acid-state

hatchoo 12 years ago | |

And just to add to the list of c-store databases, there's also Sybase IQ. (I believe Sybase is now owned by SAP so it may have been rebranded)

flatfilefan 12 years ago | |

Teradata is not a column store afaik. Vertica would be a good example of such.

alanctgardner2 12 years ago |

I'm a little skeptical:

- a bunch of the novel components (the UPS aware persistence layer, for example) aren't actually built yet

- they're pushing for people to build businesses on it already. I would characterize it as "bleeding-edge with bits of glass glued on", so this doesn't seem entirely honest.

- there's mostly a lot of breathless talk about how great and fast and scalable it is, but no mention of CAP theorem. To boil down their feature set, it's an in-memory RDBMS using the Actor model.

yid 12 years ago |

> UPS systems will stay active for a few minutes, based on their capacity, and the manager process will gracefuly shut down each daemon and write data to disk storage. This will ensure durability--even against power failure or system crash--while still maintaining in memory performance.

How does a UPS ensure durability against system or program crashes, disk corruption in large clusters, and other failures that can affect a simple write()?

> The real killer for database performance is synchronous transaction log writes. Even with the fastest underlying storage, this activity is the limiting factor for database write performance. InfiniSQL avoids this limiting factor while still retaining durability

How do you plan to implement this (since it appears it hasn't been implemented)? What is your fundamental insight about synchronous transaction logs that makes InifiSQL capable of being durable while (presumably) not having a synchronously written transaction log? If your answer is the UPS, please see my first question.

Edit: I don't see any mention of Paxos anywhere. Could you explain what you're using for consensus?

gopalv 12 years ago |

GNU AFPL?

Clause 13 is a real pain to deal with when exposing this over the network.

I guess the developer wants to sell a license (like the mysql java client GPL'ing).

Can't blame him, he needs to get paid.

mtravis 12 years ago | |

Hi, gopalv. What samspenc said. My understanding of the AGPL is that only the modifications made to the source code itself of the covered project would need to be opened up (or have a commercial license). Meaning that merely using InfiniSQL won't require you to open source your app. MongoDB has the same license BTW, and lots of people use it without being forced to open their code. And, yes, I want to get paid somehow--but the AGPL won't stop anybody from using my work how they see fit. But if they modify it and distribute it, then they'll have to comply with the license (or contact me directly for an alternate arrangement).

gopalv 12 years ago | | |

I fully understand what this means and I hope you do get calls about alternate licensing, but remember that people like me do not make these decisions.

I thankfully don't have to - this means I don't need to talk to lawyers about this.

Because AGPL took away the most important bit of unassailable ground I had to argue with when it came to deploying GPL - "Using this code implies no criteria we have to comply to, only if we distribute it".

Clause 12 and 13 - basically took that away from me completely.

Look, I'm not going to tell you what license to use.

But leave me enough room to complain that I have had trouble convincing people that we can use AGPL code in a critical function without obtaining a previous commercial license by paying the developer.

samspenc 12 years ago | |

MongoDB does this too. I personally think its not too bad - its free for whoever wants to use it, but if you want to modify and use it commercially, you do have to pay.

MichaelGG 12 years ago |

In-memory distributed database? VoltDB is already way past 500Ktx/sec on a 12-node cluster.

On their site though, it says no sharding and that it can do these 500Ktx/sec even when each transaction involves data on multiple nodes. Does this performance degrade directly in relation to the number of nodes a tx needs to touch?

A simple, straightforward, wire-level description of how things work when coordinating and performing transactions across would be very useful. There's a lot of excited talk about actors, but nothing that really examines why this is faster, or any sort of technical analysis.

eksith 12 years ago |

Looking at the "About and Goals" section of their docs http://www.infinisql.org/docs/overview/#idp37033600

I can't seem to find the word "Reliable" or any variation thereof anywhere in there.

In fact, that word is no where to be found on the blog post or on the entire InfiniSQL page (not in the Overview, Guides, Reference or even FAQ). I find this quite remarkable since reliability is the true virtue of an RDBMS, not speed or even capacity. At least that's what PostgreSQL aims for and this being another RDBMS, and is also open source, I see it as InfiniSQL's only direct competitor.

It's nice that this is scalable, apparently, to ridiculous levels, but if I can't retrieve what I store in exactly the same shape as I stored it, then that's a bit of a buzz kill for me.

Can we have some assurance that this is the case?

There's a note on "Durability" and a shot at log file writing for transactions, and presumably InfiniSQL uses concurrency and replicas, to provide it. In the Data Storage section, it mentions that InfiniSQL is still an in-memory database for the most part http://www.infinisql.org/docs/overview/#idp37053600

What they're describing is a massively redundant, UPS backed, in-memory cache.

Am I wrong?

mtravis 12 years ago | |

Hi, eksmith. I talk a bit about plans for durability in that overview document.

I promise that I have every intention of making InfiniSQL a platform that does not lose data. I have a long career working in environments that demand 100% data integrity. If I de-emphasized it, it was not intentional.

PostgreSQL doesn't scale for OLTP workloads past a single node. There are a handful of products similar to InfiniSQL (google for the term NewSQL for a survey of them).

And yes, a redundant UPS-backed in-memory cache. I have some ideas on how to do regular disk backing as well (which I'm sure you've read).

And if a more traditional log-based storage layer is added, InfiniSQL will still scale nearly linearly across nodes horizontally. Multi-node scale and in-memory are not dependent on one another. Though I believe that redundant UPS systems managed by a quorum of administrative agents, and provide durability just like writing to disk.

Are you familiar with high end storage arrays, such as from HDS or EMC? They write to redundant memory, battery backed and managed by logic in the arrays. I'm just moving that type of design to protect the database application itself, up from the block layer.

And some people trust their datacenter power--they use pure in-memory databases without UPS already, or they do things like asynchronously write transaction log, which also sacrifices durability. For those groups, InfiniSQL ought to be just fine, without UPS systems.

leif 12 years ago |

The write bottleneck for traditional databases has never been the write-ahead log, with group commit and a battery-backed RAID controller you'll have a hard time saturating the disk with log writes. The bottleneck has always been random I/O induced by in-place updating indexes based on B-trees. You don't need to be in-memory if you use better data structures. TokuDB and TokuMX are proof of that.

mtravis 12 years ago | |

Hi, Leif. It's not hard to get to the throughput limits of a single log device, even on a fast array. I've done it on Sybase, WebSphere MQ, Oracle, MySQL, basically on enough platforms that I assume it to be the general case. The log writes don't saturate the array itself--but the log file has a limit to how many blocks can be appended--even on fast arrays. But imagine getting rid of the transaction log entirely--the entire code path. That will be faster even than a transaction log write to memory-backed filesystem.

But I agree that other write (and read) activity going on in the background and foreground, also limits performance--and in fact, I've seen the index write bottleneck that you describe in real life, more-so than simple transaction log writes. So, you're correct.

I've read about Toku, but I really doubt that it writes faster to disk than writing to memory. Are you really trying to say that?

I think it would be great for InfiniSQL to be adapted to disk-backed storage, in addition to memory. The horizontal scalability will also apply, making for a very large group of fast disk-backed nodes.

I think your input is good.

brianberns 12 years ago |

An in-memory RDBMS hardly seems to be "infinitely scalable". How would this work with DBs in the terabyte size or larger?

MichaelGG 12 years ago | |

A terabyte of RAM is pretty cheap. Around $12K for the RAM. Last I quoted out a system for VoltDB, the total cost (complete servers with CPU, disk, RAM) came to ~$17/GB to $22/GB.

If you actually have transaction processing at this scale and need that performance, the RAM cost is not a major issue.

mtravis 12 years ago | |

Well, 2-way Cisco servers can hold 1TB RAM each.

It scales as long as throughput increases while new nodes are added. I've done benchmarking up to 12 nodes, and it continued to scale nearly linearly. (http://www.infinisql.org/blog). I'd like to push it further, but need $$$ for bigger benchmark environments.

glogla 12 years ago | |

Badly. But scaling in dataset size, and scaling in performance are not the same thing. Busy eshop might need no more than 5 GB of space (growing 100 MB per month or something) but require very high speed.

amalag 12 years ago |

This is what Clustrix (YC company) claims to do.

mtravis 12 years ago | |

Hi, amalag. Yes, Clustrix is very similar to InfiniSQL (not to mention having been around longer). I believe that InfiniSQL has vastly higher performance at least for the type of workloads that InfiniSQl is currently capable of. InfiniSQL is also open source.

I hope there's room for competition in this space still.

sergei 12 years ago | | |

What do you base your performance claim vs Clustrix on?

jeremycole 12 years ago |

What's up with the weird coding standards? Include files named infinisql_*.h and #line statements... strange.

mtravis 12 years ago | |

Oh, the infinisql_*.h is because I deploy all header files as part of "make install", when what I really should do is boil it down to just the api header. The api is for stored procedure programming. Yes, I have it on backlog to fix. I give them all that name in case somebody installs to /usr/local (which you probably oughtn't) it's clear what application they all belong to. Yes, I could create a subdirectory, too. But the fix will be when I clean up api.cc to only have to pull in the one header instead of several of them.

#line statements because I get compiler messages from time to time putting things on the wrong line after having imported headers.

flatfilefan 12 years ago |

What is the difference to Teradata or Netezza except this is open source and lack the burden of universality yet?

mtravis 12 years ago | |

Those are analytics databases, also known as data warehouses. Optimized for batch reporting. InfiniSQL is geared for operational/transactional (OLTP) kinds of workloads.