Joining a billion rows 20x faster than Apache Spark(snappydata.io) |
Joining a billion rows 20x faster than Apache Spark(snappydata.io) |
People keep stumbling upon the same thing over and over which is that the ability to scale has significant overhead.
Most problems are not Big Data problems. The size a problem must be before it qualifies as a Big-Data problem grows larger every day with the availability of machines with ever-more cores and memory. `Sed`, `awk`, `grep`, `sort`, `join`, and so forth are some of the least appreciated tools in the Unix toolbox.
People want to think they have Big Data problems but they probably just have plain old normal-data problems. I have had to unwind the ridiculous, heavy-weight, Big Data solutions to normal-data problems that "kids today" love.
If you don't work for Netflix or Google or Facebook or insert maybe a hundred other companies here, you probably do not have a Big Data problem.
edit: (accidentally hit the submit button early).
I don't think people should leverage highly distributed software for small workloads, for the same reason they shouldn't write highly parallelized code for things that run perfectly fine on one thread. But the test, while well-intentioned, seems to miss the mark.
Is there a clear cut answer, as to whether one should choose a distributed solution or not? It seems to me that if you're at the Terabyte scale, choosing non-distributing seems to be asking for trouble. A quick search indicates the largest HDD you can buy is around 8TB.
My point is, these 'on a laptop/single machine memory' examples don't really give me an indicator of scenarios where I might actually want to use spark/etc.
Query languages like HiveQL and Spark SQL were designed to look like SQL, but they're not.
You can only stuff so much into memory, so you can scale up vertically in-terms of memory, unless you buy a massive big-iron POWER box, you scale out horizontally. But with each of these in-memory appliances, what happens when you need to spill out to disk?
In essence why should one bother with these in-memory appliances as opposed to buying boxes with fast SSD's instead? Sure you spill out to disk, but do you take that big of a hit compared to the enormous cost of keeping everything in memory?
A friend of mine works for a company that does high speed weather analysis to make predictions for energy brokers, to predict prices of wind / solar energy on the market. They use these kind of systems extensively, because of the speed and volatility of the data. Fascinating stuff.
If the problem is that queries or sets of data might have to jump nodes, couldn't the data be designed in such a way where an assumption is made about what sorts of queries will happen at write?
Optimize so that node spanning is rare, eat the cost when it does happen, and let those 1/n queries disappear into the average.
That always makes me chuckle.
Honestly though ... Jenkins + bash + cloud storage and you'll be surprised at how many big data problems you can solve with a fraction of the complexity.
Edit: on a second check, it might have to do with that nav that moves the whole page down.
$ python2.7 -m timeit 'n=10**9; (n*n + n) / 2'
10000000 loops, best of 3: 0.0867 usec per loop
(Admittedly, I killed `n=109; sum(range(1,n+1))`.)In this case, the software being tested was explicitly written to manage the coordination of data on many nodes, so why is the definition of "baseline" a single laptop? Seems specious.
How many people out there (genuine question here) assume the opposite of what you know / that they are ignorant of it? How many people do you think that when they hear "multithreaded" that they associate that with being faster?
Now assume the people who know there is overhead work to split up and divide the work across threads... because they have this knowledge, also "see everything as a nail because they have a hammer"? That sometimes the right solution is to simply run a single threaded operation, not parallelize everything?
I think there are interesting merits to all of that, even if it means "hyperbolic" articles or cliche not-realistic world tests. They challenge our thinking, our assumptions, our approach. And then separately, there should be articles/discussions on real-world tests and use cases.
Imagine the difference between setting up a spark cluster and writing a for loop. For instance, for reasons someone created a 1TB hdf5 file. Luckily, we had a computer with 500GB+ of ram and lots of swap, so instead of having to hack the file apart and figure out how to chunk or parallelize it, we loaded it into memory for a one time batch job and did other useful things in the mean time.
Well, ideally, yes, if we had infinite time. In reality we don't, which means that we have to choose what to do without the benefit of being able to implement-thrice-deploy-N-times[0]. In practice, what happens is that we (as "engineers"[1]) form rules and patterns in our heads which we use as guidance. I think the point being made is that "use a cluster" is almost never good guidance.
[0] How can you know what the performance is without actually giving your product to a bazillion users? This hints at why just-deliver-it-now-bugs-be-damned and continuous feedback is so valuable. There's no point optimizing a product used by 1000 people, but if your platform ends up being used by 1e9 people (e.g. Facebook), then you'll make ADJUSTMENTS ALONG THE WAY. This is a GOOD PROBLEM TO HAVE.
[1] A laughable term for most of the programmer crowd, myself included. Engineering is about tradeoffs and we still have basically no idea about tradeoffs in software development.
It means that the size of the dataset is not the only factor, you need to take into account the operations performed on each "element/document", the size of the intermediate datasets and the size of the final results and some more stuff (encoding, etc.).
But how do things change when the dataset grows to 9GB? Now we need more than one HD. Hadoop + Spark is built for this exact use case...
The problem is exactly that 8-9TB range because running spark on just two or three machines will be slower than on a laptop with an extra external drive. You need to scale up into potentially dozens of machines just to get the same performance you were getting on a laptop. You were ok with a laptop, add more data and now you have a not insignificant AWS bill, unless you are ok puttering around on a few machines much more slowly than on the laptop.
There is no middle ground solution, so everyone starts with a overkill solution that scales out of fear of getting stuck on one machine when the dataset grows. But most of these systems never grow enough to need to scale this way. So we are wasting resources running toy clusters on problems that would fit on a laptop.
Maybe I am becoming a cranky old man who yells at clouds, but I miss MPI. It had no frills but it runs with next to no overhead and scales up to super computers with no donut hole in between.
That why you see 40gigE or 56gigE used in HPC.
Upside: I had one of the most paying position among the technical people. I also got to play with expensive stuff.
Downside: it was soul crushing, I was delivering no value whatsoever and had a really hard time looking at my colleagues in the eye, as they were making a third of my salary (at best).
I got out, joined a new company with that in mind and now have a very exciting job. They do have a dedicated R&D, Data Science team which is a shit show: absolutely brilliant people completely wasted as they can't build anything for lack of programming/technology experience, in an environment where their theoretical skills are mostly useless. I'm genuinely sad for them.
[edit: one of the company still has a Hadoop/Spark cluster for their whooping 500Mb of data]
However, I'm really glad to find out about SnappyData.io, that's gonna save me a lot of time waiting. It would truly be my perfect dream, if they allowed running any programming language inside an environment like Jupyter.org or BeakerNotebook.com, but with Pandoc.org Markdown. So that I can essentially work fulltime programming, while I can also document it and also be able to export my documentation to a good looking latex thesis.
I disagree over here. I have worked across multiple scenarios which warranted big data solutions and such solutions were not feasible before Apache Spark and such were available. Even our current startup (www.aihello.com) has 8.7 million products and calculating LDA + Cosine Similarity reaches trillions of matrices which is simply not feasible with traditional tools.
Telstra/Sensis, the telecom company in Australia that I consulted for, went from a month delayed reporting to near real time reporting due to apache spark.
Also keep in mind that the scale of data is growing exponentially for all of us since storage is getting cheaper and big data analysis is proving game changer in many scenarios.
https://aadrake.com/command-line-tools-can-be-235x-faster-th...
I'll also take this opportunity to plug Make and Drake for manipulating data in a replicable way:
https://bost.ocks.org/mike/make/
https://github.com/Factual/drake
If you're processing data using tools that cannot trace their ancestry directly to some time before 1985, you're probably wasting your own and your colleagues' time.
This reminds me of the time, way back when, that a coworker told me about how our customer was filling a rack with a terabyte of hard drives. My eyes bulged a little bit to think of it. Now I chuckle to think that the laptop I had two laptops ago had a terabyte drive in it.
> Consulting service: you bring your big data problems to me, I say "your data set fits in RAM", you pay me $10,000 for saving you $500,000.
Considering https://www.supermicro.com/products/system/4U/8048/SYS-8048B... which is a plain old 4U server not some fancy, super expensive NUMA machine can eat up 12TB memory, this quip and parent has quite some merits.
6TB is not even https://memory.net/product/s26361-f3843-e618-fujitsu-1x-64gb... horrible at 57 504 dollars. That's about 48 engineering days if your engineer related expenses are 150 an hour (and it's likely they are more).
> SGI UV 300 now scales up to 64 CPU sockets and 64TB of cache-coherent shared memory in a single system.
This is the current limit of Linux hardware memory support so going above it is tricky. But still, 64TB.
There are other use cases other than mere size that can necessitate "big data" solutions. E.g. timeliness, resiliency, maintainability...
If you are building production data processing systems that have constraints on data size, latency, resiliency, scheduling, dependency management, etc., you might be better off with a "big data" system. Even if the data could all fit on a beefy box. This was a painful lesson for me to learn.
That leaves resiliency and etc. I can't answer etc., but—how is resilience helped with a big data solution? That seems like Lampson's distributed system: more machines, but you need k-of-n, k>1. Better to just mirror to two machines with the data in RAM.
If your scheduling involves running jobs that must wait on dependencies or events for a long time (hours, days), a hardware failure or some other anomaly can be catastrophic, whereas a "big data" framework can recover without your even knowing about it.
At the end of the day it just comes down to use cases. There are a LOT of other use cases that "big data" platforms address other than being able to fit data in RAM. Sometimes flying by the seat of your pants on one host doesn't cut it for business-critical processing.
> how is resilience helped with a big data solution?
The "R" in Spark's RDD abstraction is for "Resilient". Node failures and replication failures can be recovered without you even knowing it.
Sure, you can write all this stuff from scratch every time you encounter them (mirror data on hosts, run embarrassingly-parallel algorithms across a fleet of hosts, write your own DB-backed scheduling system, etc.), but all these are solved problems in these big data frameworks. You'll be wasting tons of time reinventing the wheel. I've been there.
ex: Hadoop << SQL, Python Scripts
I completely agree with
Mapreduce << SQL, Python Scripts
I do a lot of my processing on sparkSQL and through RDD transformations as opposed to Mapreduce limiting, slow KV style processing.
The SuperServer 7088B-TR4FT lists that it supports 24TB DDR4 ECC RAM (with 8xE78890v4 for 192 cores / 384 threads).