Your data fits in RAM

Your data fits in RAM(yourdatafitsinram.com)

317 points by lukegb 11 years ago | 224 comments

Smerity 11 years ago |

It's probably worth extending "Your data fits in RAM" to "Your data doesn't fit in RAM, but it does fit on an SSD". So many problems will still work with quite reasonable performance when using an SSD instead. By using a single machine with an array of SSDs, you also avoid the complexity and overhead of distributed systems.

My favourite realization of this: Frank McSherry shows how simplicity and a few optimisations can win out on graph analysis in his COST work. In his first post[1], he shows how to beat a cluster of machines with a laptop. In his second post[2], he applies even more optimizations, both space and speed, to process the largest publicly available graph dataset - terabyte sized with over a hundred billion edges - all on his laptop's SSD.

[1]: http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...

[2]: http://www.frankmcsherry.org/graph/scalability/cost/2015/02/...

cornellphds 11 years ago | |

This is classic case of "Algorithm/Problem Selection" if your algorithm/problem is tailored to a task such as PageRank, surely a single threaded highly optimized code will beat a cluster designed for ETL tasks. In real organizations where there are multiple workflows/algorithms, distributed systems always win out. Systems like Hadoop take care of Administration, Redundancy, Monitoring and Scheduling in a manner that a single machine cannot. Sure you can "Grep" faster on a laptop than AWS EMR with 4 Medium instances, but in reality where you have 12 types of jobs which are run by team of 6 people, you are much better off with a distributed system.

acqq 11 years ago | | |

Ditto for computationally intensive work: if it is CPU dominated, more CPU's calculating in parallel will be of advantage, even if the data could fit some RAM.

There's no a single simple answer, but sure, whenever less computers are enough, less should be used.

The recent problem is, some people love "clouds" so much today that they push there the work that could really be done locally.

Retric 11 years ago | | |

Some things take 1 second on 1 machine beefy machine or 6 hours on a cluster due to latency issues. And yes I do mean 20,000 times slower, though 2,000 is far more common due to latency inside a datacenter being around ~500,000 ns vs ~100 ns for main memory vs 0.5 nm from L1 cache.

PS: Not that most systems are built around these kinds of edge cases, but 'just use a cluster' is often not a good option unless each node is sufficiently beefy.

vardump 11 years ago | |

A step between RAM and SSD could be "Your data fits in RAM in compressed form". LZ4 compression is takes 3-4x longer than memcpy. LZ4 decompression is only 50% [1] slower. 2-3 GB/s per core.

[1]: Your mileage may vary.

bpicolo 11 years ago | | |

If it's slower both in and out what's the benefit?

To guy below me: Ah, thanks. I thought the guy above was trying to say it's slower than paging to disk. : )

threeseed 11 years ago | |

Scalable graph systems have been around for many years and have never taken off.

Most businesses doing big data (like ours) often have multiple disparate data sources that at the start of the pipelines are ETLing into some EDW. Trying to consolidate them into a single integrated view is very difficult and time/resource intensive. Having billions of disconnected nodes in the graph would be very hard to reason with.

stephengillie 11 years ago | |

I feel compelled to point out that the average SSD has an order of magnitude (or maybe 2 orders) more IOPS than a 6-disk 15k RAID6 or RAID10 array.

And that's a single, standalone, non-RAIDed SSD. When you get a 6-SSD RAID10, magic starts to happen. And if you RAID enough SSDs (10-20?), you can theoretically start to get more bandwidth than you do with RAM.

xjia 11 years ago | | |

sounds good to me, but why people not doing that? ssd price too high?

SwellJoe 11 years ago |

So, let's say my system is currently backed by MySQL or PostgreSQL, and that is not fungible. How would one move that data into RAM, including writes? And, how would one maintain some level of safety in the event of a crash? i.e. I don't really care if I lose X amount of time worth of data (say, five minutes), but I do care that when I reboot the system, the database comes back from disk into RAM in a consistent state.

Is there some off-the-shelf solution to this problem? And, if so, why isn't it talked about more? Every CMS ever, for example, would be very well-served by something like this. My entire website's database, all ~100k comments and pages and issues and all 60k users, is only 1.4GB, and performance is always a problem. I don't care if I lose a couple minutes worth of comments in the event of a system reboot or crash. So, why can't I just turn that feature (in-memory with eventual on-disk consistency, or whatever you'd want to call it) on and forget about it?

praseodym 11 years ago |

And if your data doesn't fit in a single server's RAM, just buy some more and run Apache Spark [1] on them. It's an in-memory computation engine that's really nice to program for: you don't have to worry about low-level clustering details (like MapReduce). And it's way (10-100x) faster than Hadoop.

[1] https://spark.apache.org

threeseed 11 years ago | |

Spark is fast becoming the default tool for big data.

The recent addition of SparkR in 1.4 means that now data scientists can leverage in memory data in the cluster that has been put there by output from either Scala or DW developers.

Combine it with Tachyon (http://tachyon-project.org) and it's not hard to imagine petabytes of data all processed in memory.

studentrob 11 years ago | | |

Can you explain what Tachyon does that's different from what Spark already provides?

I haven't used either Spark or Tachyon. I thought the Spark solution was to just put my dataset in memory. But the Tachyon page seems to say the same thing

lukegb 11 years ago |

Inspired by https://twitter.com/garybernhardt/status/600783770925420546

mosselman 11 years ago | |

Can someone explain in a bit more detail what this is about? Is the 'joke' that running data computation in RAM is faster than what? From disk?

JonnieCache 11 years ago | | |

The subtext is that running a fancy distributed system is more exciting and beneficial for ones resume than simply buying a massive bloody server and putting postgres on it, and that people are making tech decisions on this basis.

isp 11 years ago | | |

jordanthoms 11 years ago | | |

There is no point deploying a heavy, complex (and usually pretty slow due to the overheads involved) distributed database, when you could just buy a server with xTB ram, load any sql database on it, and run your queries in a fraction of the time. If your data is so large that it can't fit in the RAM of a single machine, then distributed databases make more sense (since loading data off disk is very slow, modulo SSD).

learnstats2 11 years ago | | |

Data that fits in RAM doesn't need any "Big Data" solutions.

icebraining 11 years ago | | |

I believe it's more "no, you don't need an Hadoop cluster of 20 machines, your data fits in the RAM of one machine".

feld 11 years ago | | |

People will build gigantic compute clusters with expensive storage backends when their entire dataset fits in memory.

If it fits in memory, it's going to be magnitudes faster to work with than on any other infrastructure you can build.

So the trick is, you take their "big data problem" and hand them a server where everything can be hot in memory and their problem no longer exists.

JDDunn9 11 years ago | | |

Right, RAM an order of magnitude faster than disk, so calculations will be performed very quickly. Big data usually implies clusters of servers because the data won't fit on one server (even on the disk).

rm999 11 years ago |

Yes! As someone who frequently runs memory-intensive algorithms on large(ish) datasets, I have a hard time explaining to many technical people that moving from a single server to a cluster increases complexity and cost by an incredible amount. It affects key decisions like algorithm and language, and generally requires a lot of tweaking.

When a problem becomes big enough, moving to a cluster is absolutely the right decision. Meanwhile, RAM is cheap and follows Moore's Law.

chao- 11 years ago |

I love it. I was just doing some Fermi estimates for a friend on the data for a project he has in the pipeline. I was curious whether or not it would be cost efficient for his project's budget to go with NVMe SSDs or have to stick with traditional SATA ones, and turns out it doesn't even matter (for now) because at least the first three months of data will fit in 256GB of RAM, even allowing for a 2.5x factor stemming from some (estimated) inefficient storage or data structure use in a scripting language like Ruby or Python.

Edit: And after those first three months he'll know more about the use and performance demands of the project and will be able to make far more accurate decisions about storage categories.

paulrosenzweig 11 years ago | |

Where's 2.5x from? I'd be curious to see any actual data on comparing memory footprint for a problem in C/Go/Rust to Python/Ruby. I'm sure it varies widely, but 2.5x might not be far off.

Dzidas 11 years ago |

Today I'm working on dataset of 1GB, which fits in memory. But it is not enough. If a variable is category/factor you need to introduce dummy values and your dataset starts picking the weight. Next - do you want apply ML algorithm in parallel? Upst, you need more memory. Done that? Now please use test dataset for prediction. My point that "data in memory" is just the beginning...

SubuSS 11 years ago |

The problem with giant boxes (full of RAM / SSD / Disk) is giant failures and huge recovery times. This is worsened in case of RAM because now every power blip is a full on recovery situation. Have a big enough data set focussed on a single box (or two for backup purposes), your customers are going to blow a gasket the moment one of them go down because workloads usually grow to accommodate available capacity.

FB has a nice paper that talks about this problem. https://research.facebook.com/publications/300734513398948/x...

CHY872 11 years ago | |

Well, you wouldn't run such a server without a hefty uninterruptable power supply system. On your bigger server you can expect a smaller frequency of failures due to fewer points of failure, and can make your system more resilient (rendundant RAM, filesystems, power etc).

jakozaur 11 years ago |

More accurate title would be fit in RAM of single machine.

Maybe some bonus category:

0. Spreadsheet is all you need.

1. Python script is good enough.

2. Java/Scala is way to go.

3. Need to manage memory (gc doesn't cut), some custom organization.

4. Actually needs a cluster.

a-saleh 11 years ago |

I am affraid the in our research lab we didn't have 10 000$ up front/200$ a month to get a pc with 1TB ram ... we did have a large computer hall and BOINC though :)

sytelus 11 years ago |

Looks like 1.5TB RAM with 15 cores costs $50K. But it shouldn't be just about RAM. The problems I'm working on requires 250 cores on similar amount of data. If there was an option to get say 150 cores with 2TB RAM, things would fly for sure.

jacquesm 11 years ago | |

Another 4 to 6 years and that should be a reality.

vegabook 11 years ago | | |

4-6 months and you'll have a Knight's Landing Xeon Phi with at least 72 cores and 288 hardware threads, with vector instructions, and you'll be able to stick 3 of them in single blade.

falcolas 11 years ago |

Seems a bit naive, saying 2.1PB probably doesn't fit in ram, "but it could"...

I get who this is aimed at, and why, but just saying that it fits in RAM isn't as useful as it could be. This is an opportunity to teach, not just snark.

collyw 11 years ago | |

A bit of clever reformatting of your data and 2.1 Pb could probably easily be reduced in size to something that would fit in RAM. Are you actually needing every byte?

lukegb 11 years ago | |

It wasn't intended to snark - I apologize if it was seen that way. I whipped this up super quick and perhaps should have expanded on my meaning.

kragen 11 years ago | |

Like, "To fit 2.1PB in RAM, you could spin up 9 r3.8xlarge EC2 instances for US$3.15 per hour"?

jkot 11 years ago |

Outside of scale-out, scale-up there is also solution: scale-in. Optimize your memory usage, so your data occupies less space.

I work on something like that.

pedrocr 11 years ago |

So this seems to use 6.144TiB as the limit that will fit in RAM. That's 1.536TiB x 4 when using the latest Xeon I could find[1]. According to the specs though you should be able to use 8, so the total limit should actually be 1.536 x 8 = 12.288 TiB. 12TiB of RAM, that's quite amazing.

[1] http://ark.intel.com/products/84688/Intel-Xeon-Processor-E7-...

genericuser 11 years ago | |

It seemed to use 6.000000000000000444089209850...??? TiB when I tried values.

pedrocr 11 years ago | | |

It seems to use different values of cutoff depending on if you are using MiB/GiB/TiB/etc. I tested with GiB and 6144 is OK, 6145 is not.

chinpokomon 11 years ago | | |

Glad to see I wasn't the only one to test the limits. ?

jerven 11 years ago |

I think its wrong. It says 64 TB does not fit in RAM, but you can get 64TB machines from SGI as well 32 TB ones from Oracle.

The SGI one with up to 2048 cores are larger in their single system images than most people have in their clusters.

The benefit of these systems is not really the ease of programming but the speed of interconnect.

List price of the Oracle one was 3 million a few years ago. But most of that is actually in the high density dimms. These days I think the price must be lower, but I won't waste my Oracle sales contact time in figuring out what it is today. Of course it will still be expensive, it is an Oracle product after all.

However, an equivalent dell list price cluster of simple 1U boxes (512 6C/64GB ones!) will go for 1.5 million. The fact that to house 512 boxes i.e. 25 racks or so plus networking. Of course you do get 1/3rd more cores than the SGI one.

For many of us that are between the just use a single normal server and yet too small for the google solutions. These big memory solutions from Oracle and SGI can make sense even if they are not the first thing that comes to mind!

iddqd 11 years ago |

Everything fits in RAM if you have the budget for it.

jacquesm 11 years ago | |

No, the point is that usually fitting things in RAM lowers the budget. So it's well worth doing proper analysis on whether or not you can (a) fit all your data in RAM and (b) if a cluster of machines does not become it's own reason for existence.

Replacing a large number of nodes with a single machine with a lot of RAM is usually a cost savings measure rather than a larger expense (and it saves power too!), and due to a lack of communications overhead and exploitation of the fact that you now have access to all the data in one go you may very well find that your algorithms run much faster.

A distributed solution should be a means of last resort.

bshimmin 11 years ago | | |

What does 6TB of RAM go for these days?

nodata 11 years ago | |

That's the point of the tool! To remind people to compare the cost of fitting the data in ram compared to the cost of not putting it in ram.

rootlocus 11 years ago |

Taken from the github repository:

var MAX_SENSIBLE = 6 * TB; function doesMyDataFitInRam(dataSize) { return dataSize <= MAX_SENSIBLE; }

br0s 11 years ago |

And if your data doesn't fit into the RAM of a single machine you can buy a few more and use vSMP (http://www.scalemp.com/) to create a shared memory single system image.

cornellphds 11 years ago |

In my opinion the correct answer is 255Gb. (i.e. AWS r3X8 High Memory instances ).

While one can purchase servers with larger memory most likely you will run into limitation on number of cores. Also note that there is at least some overhead in processing data, so you would need at least 2X the size of raw data.

Finally while its a good thing to tweet, joke about and make fun of buzzword while trying to appear smart. The reality is that purchasing such servers (> 255 Gb RAM) is costly process. Further you would ideally need two of them to remove single point of failure. it is likely that the job is batch and while it might take a terabyte of RAM you only need to run it once a week, in all these cases you are much better off relying on a distributed system where each node has very large memory, and the task can easily split. Just because you have cluster does not mean that each node has to be a small instance (4 processors ~16 Gb RAM).

jacquesm 11 years ago | |

> Further you would ideally need two of them to remove single point of failure.

That's assuming that everything needs to be 'high availability' and buying two of everything is a must. This is definitely not always the case. In plenty of situations buying a single item and simply repairing it when it breaks is a perfectly good strategy.

cornellphds 11 years ago | | |

Its not about having two of everything at all times, but rather about having a capacity whenever you need it. At 244Gb you hit a sweet point where you can have access to large capability at a flexible price (Spot Market / On Demand / On Premise). This is what separates engineers with business acumen from run of the mill "consultants" with a search engine.

voidlogic 11 years ago |

"Your data fits in RAM", vs "Your data fits in RAM on around X machines", would be better. Any dataset fits in RAM.... but if its going to take more machines then I am willing to buy it really doesn't.

karmakaze 11 years ago |

Before core, there was tape. Tape used to be backup medium, then disk became the new tape. Bubble memory begat SSD, so memory has in some sense become the new disk.

RAM is the new disk: now for some, later for others.

yellowapple 11 years ago |

"Yes, your data fits in RAM... if you feel like buying a server at the same price as 3 Tesla Model S automobiles, a mansion in the Southern U.S., or a bachelor pad in San Francisco."

peter303 11 years ago |

HP hints its new memristor memory computer will have the cost of flash and the speed of registers. An will mostly eliminate the multi-level memory hierarchies we have today.

CHY872 11 years ago | |

Unlikely; the limiting factor is already distance - poor scaling from interconnects (wires) already means that we can't have all that much global state. This might increase the amount of state we can have, but unless you can fit gigabytes into a single chip you won't be eliminating the multi level memory hierarchy.

Like right now the L1 cache will have latencies of 1 or 2 cycles, and the L2 cache 15; this is due to the overheads of cache coherency protocols, moving the data around the chip; it's not that the memory's slower, it's all SRAM.

They are probably referring to enterprise workloads. Here you have large working sets (so caches are less useful) and you want maximum throughput. Clever multithreading (finegrained) can reduce effective latency by scheduling many (32?) processes at the same time, executing an instruction from each in round-robin fashion (see Sun Niagara). In that case, you can sometimes dump the L1 cache, and you would be able to get rid of the memory hierarchy.

There's also probably a benefit wrt hard drives/secondary storage; you can obviously make system storage very fast, which might improve random access times considerably. BUT this is probably not going to be transformative; it'll improve certain types of accesses, but current algorithms are already very highly tuned to spatial and temporal locality of reference. Furthermore, you'll still see these structures win out, because they can take advantage of hardware prefetching more easily.

eafpres 11 years ago | | |

The property of memristors having real values instead of 0 or 1, and the fact that their value can be path dependent, leads me to think that at least information density can be increased over conventional memory today.

nickbauman 11 years ago |

Cute but "Big Data" is really just data that's not in the building and isn't feasible to just move around from one machine to another in your department.

nwenzel 11 years ago |

Even if your data doesn't fit in RAM... and even if it does... when you're developing, you should be using a sample of your data that fits into RAM.

swalsh 11 years ago |

This is good marketing, but you know what would be even better marketing? Give me access to that server for a week. Let me setup a demo of my biggest customer, and then run my tasks. We've started (and are in progress) of investing thousands of dollars in moving to Azure. A server this large is not something I can buy, and experiment on easily. Hard numbers would convince my superiors that its a better solution, but they're not going to give me $10k to do the experiment.

jacquesm 11 years ago | |

That's how accidents are made. If you can't spend a small fraction of the budget for the solution to experimentally verify that it is in fact the optimum solution you may very well be leading the company down a road that will cost them significantly more. It's not up to the writer of the article to provide you with the tools to run your least-cost-analysis, that's up to you and your bosses! (After all, you're the beneficiaries.)

vegabook 11 years ago | |

10k? Those sticks of RAM alone will cost you something like 75k USD. Then you'll need the processors, arguably 4 of the top of the line 18-core XEONs at 5000 USD each. Then you'll need to put it all together with software and a (properly cooled) rack, not to mention the terminal(s) to access it, plus the personnel to put this baby together for you. This box could easily cost you 150 grand.

pquerna 11 years ago | | |

Its not cost effective to use non E5-class Xeons, or go above 32GB DIMMs right now.... So you want a Dual-Processor setup, 16 DIMM slots, so 16x 32GB = 512GB w/ Dual Proc -- which you can do for about $10,000.

Aardwolf 11 years ago |

If I select 1KB, why does the link point to an HP server with up to 6TB of RAM? Linking to an 80's PC seems more appropriate :)

tempodox 11 years ago |

Wow, I wish I had the spare change for one of these beasts. I think I have enough NP-hard problems to fill any RAM to the brim :)

nwrk 11 years ago |

http://www.downloadmoreram.com/

msellout 11 years ago |

Although we can theoretically handle up to 2^64 bytes of RAM (16 exabytes), the practical limit is much lower. I think someone on Wikipedia said it's somewhere around 8TB, but I imagine the performance of random access into 8TB RAM is much worse than a motherboard designed for up to 32GB RAM.

It's not as easy as just buying more RAM. You'll have to pay more attention to how you make use of the various caches in between your CPU and RAM.

gambiting 11 years ago | |

I imagine that on a motherboard with 96x RAM slots, the access time between the first one in row and the last one will be actually quite different, due to the physical distance between them.

polite_wine 11 years ago |

Sorry for the simple question but if you store it in ram what is the strategy for when the server is turned off?

lukegb 11 years ago | |

The idea is more that when you process data, if you can fit it all in memory (and you don't need lots of CPU power, etc, etc, etc) then just use one machine and don't worry about "clusterising" it.

If you're expecting growth in the size of your dataset (beyond growth in RAM size availability), then, well, maybe don't just use a single machine. Same goes for a whole bunch of similar "it's too large for a single machine" considerations.

Storing data should probably still be persisted to disk, and backed up.

swalsh 11 years ago | |

You turn it back on, and load it back from the hard drive.

3pt14159 11 years ago | |

There are multiple strategies that are usually handled by the database that you use. For some databases a hard power off will lose the uncommitted data, for more durable ones it waits until the write is confirmed.

Generally though, these posts are geared towards machine learning people that don't really have "live" data as frequently.

jeltz 11 years ago | |

This all depends on what the data is used for. You may need to persist the data to disk on write even if all your data is in RAM.

stupidcar 11 years ago |

Damn. my data is 6597069766657 bytes. Apparently if it was 6597069766656 bytes it would have fitted in RAM.

lukegb 11 years ago | |

Well, hate to break it to you, but you probably have some overhead associated with your data, like your operating system or structures related to processing your data.

rplnt 11 years ago |

Our data fits in ram but it proved to have no speed benefit. So the ram just sits there, being empty.

starikovs 11 years ago |

Redis as a primary data store!

octatoan 11 years ago |

600 PiB "No, it probably doesn't fit in RAM (but it might)."

Well, well, well.

scblock 11 years ago |

What is the point of this site other budget shaming?

lurkinggrue 11 years ago |

Great googly moogly! terabytes of ram!

maljx 11 years ago |

But does it fit in the L1 cache?

itamarhaber 11 years ago |

Brilliant!

smartpants 11 years ago |

6.000000000000000444 TiB

mahouse 11 years ago |

Any point on the stupidly big ass font? It does not fit in my screen.

pcthrowaway 11 years ago | |

BUT IT FITS IN RAM!

imaginenore 11 years ago |

That's like saying "you can fly first class".

If you don't have money, you can't. Very few people can afford it.

pdpi 11 years ago | |

It's more akin to saying "if you're looking at buying several economy tickets to go from A to B, a first class ticket on a direct flight might be cheaper and faster than stitching together several economy tickets"

smegel 11 years ago |

If you are programming in R, you sure better hope it does!

baldfat 11 years ago | |

After reading the title I was sure there was something about R in the comments.

You can program R in Spark you can now program in R http://blog.revolutionanalytics.com/2015/01/a-first-look-at-...

Now you can work directly with SQL Server as announced this week by MS. http://www.computerworld.com/article/2923214/big-data/sql-se...

I have had a ton of arguments about R's "biggest weakness" being that it uses RAM. I haven't once in the almost 3 years of working in R that I ran into this road block, but I am sure others have. Which there are several good distributed choices that will keep getting better and better.

Using RAM instead of Distributed is better in R as well as really any other language in terms of complexity and flexibility.

saosebastiao 11 years ago | | |

For my workloads, R has always choked on its single thread long before it choked on memory. And the parallelism options are terrible hacks.

lessthunk 11 years ago |

or you learn about data structures and algorithms and try to need less :-); Randomized algorithms for example are intriguing.

toolslive 11 years ago |

We build object stores... so, no it most definitely does not.

josephmx 11 years ago |

I sincerely hope nobody is using a tool like this to decide which enterprise servers to buy...

lukegb 11 years ago | |

Me too. The links are mostly to back up my claim rather than as a suggestion of servers to buy (or I'd have found some affiliate links!)

> mm <- matrix(rnorm(1000000), 1000, 1000) > system.time(eigen(mm)) user system elapsed 5.26 0.00 5.25 IPy [1] >>> xx = np.random.rand(1000000).reshape(1000, 1000) IPy [2] >>> %timeit(np.linalg.eig(xx)) 1 loops, best of 3: 1.28 s per loop

> system.time(for(x in 1:1000) for(y in 1:1000) mm[x, y] <- 1) user system elapsed 1.09 0.00 1.11 IPy [7] >>> def do(): ...: for x in range(1000): ...: for y in range(1000): ...: xx[x, y] = 1 ...: IPy [10] >>> %timeit do() 10 loops, best of 3: 134 ms per loop