Joining a billion rows 20x faster than Apache Spark

Joining a billion rows 20x faster than Apache Spark(snappydata.io)

153 points by plamb 9 years ago | 83 comments

Am I reading this correctly? The testbed was a single laptop? A big part of spark is the distributed in-memory aspect so I'm not sure I understand why any of these numbers mean anything.

quantumhobbit 9 years ago | |

This paper is a must read: https://pdfs.semanticscholar.org/6753/959eed800e9fad9e330daa...

People keep stumbling upon the same thing over and over which is that the ability to scale has significant overhead.

edw 9 years ago | | |

Back in '10, I needed a three or four node Hadoop cluster just to match the performance I was getting using a spare Mac mini in development mode when I was doing a lot of work in Cascalog, which is based on Cascading.

Most problems are not Big Data problems. The size a problem must be before it qualifies as a Big-Data problem grows larger every day with the availability of machines with ever-more cores and memory. `Sed`, `awk`, `grep`, `sort`, `join`, and so forth are some of the least appreciated tools in the Unix toolbox.

People want to think they have Big Data problems but they probably just have plain old normal-data problems. I have had to unwind the ridiculous, heavy-weight, Big Data solutions to normal-data problems that "kids today" love.

If you don't work for Netflix or Google or Facebook or insert maybe a hundred other companies here, you probably do not have a Big Data problem.

banachtarski 9 years ago | | |

The correct way to compare software A and software B is to benchmark both on the target platform/hardware they were respectively written for. Afterwards, do a cost-benefit analysis.

edit: (accidentally hit the submit button early).

I don't think people should leverage highly distributed software for small workloads, for the same reason they shouldn't write highly parallelized code for things that run perfectly fine on one thread. But the test, while well-intentioned, seems to miss the mark.

dajohnson89 9 years ago | | |

I'm not sure why you got downvoted. It's a valid point and it makes intuitive sense.

Is there a clear cut answer, as to whether one should choose a distributed solution or not? It seems to me that if you're at the Terabyte scale, choosing non-distributing seems to be asking for trouble. A quick search indicates the largest HDD you can buy is around 8TB.

jupiter90000 9 years ago | |

Different example, doing a simple 'group by' sparksql query on only about 20 million rows on a distributed phoenix/hbase table couldn't even be completed because of spark dumbly shuffling all the data around the cluster. Spark/phoenix RDD drivers apparently had no 'group by' push down support for phoenix so shuffled all the data amazingly inefficiently. Running the same query directly on phoenix took all of about a minute to finish.

My point is, these 'on a laptop/single machine memory' examples don't really give me an indicator of scenarios where I might actually want to use spark/etc.

j-m-o 9 years ago | | |

Hey, I'm the phoenix-spark author here. You're totally right, right now there is a lot of dumb shuffling around for certain operations. Hopefully some of that will get fixed up in the next release [1].

[1] https://issues.apache.org/jira/browse/PHOENIX-3600

dandermotj 9 years ago | | |

You're trying to GROUP BY on a distributed data store; your code is the problem, not Spark SQL. Use CLUSTER BY - it's distributed sibling.

Query languages like HiveQL and Spark SQL were designed to look like SQL, but they're not.

jagsr123 9 years ago | |

From what I know, this test originally written by Databricks (expanded here) is meant to tease out the optimizations in the Tungsten engine. Of course, a distributed query that is dominated by shuffle costs will produce a very different result.

sumwale 9 years ago | |

The test uses 4 partitions (4 threads) and is not single-threaded. Of course, distributing over network will have network costs which can be significant for sub-second queries but will become insignificant for larger data and more involved queries. The Spark execution plan will be exactly same on laptop or in cluster, and these specific queries will scale about linearly.

filereaper 9 years ago |

I apologize in advance, but whenever people claim to use a in-memory big-data system, how exactly does this end up working?

You can only stuff so much into memory, so you can scale up vertically in-terms of memory, unless you buy a massive big-iron POWER box, you scale out horizontally. But with each of these in-memory appliances, what happens when you need to spill out to disk?

In essence why should one bother with these in-memory appliances as opposed to buying boxes with fast SSD's instead? Sure you spill out to disk, but do you take that big of a hit compared to the enormous cost of keeping everything in memory?

stingraycharles 9 years ago | |

I think there are many use cases. Fraud detection, risk analysis in finance, weather simulations, etc. These don't need to spill out to disk and are a perfect use case for these systems.

A friend of mine works for a company that does high speed weather analysis to make predictions for energy brokers, to predict prices of wind / solar energy on the market. They use these kind of systems extensively, because of the speed and volatility of the data. Fascinating stuff.

rodionos 9 years ago | | |

You can also measure cloud oktas from satellite imagery of you want to get fancy in terms of solar energy supply side forecasting: https://axibase.com/calculating-cloud-oktas/

blhack 9 years ago | |

Maybe I'm misunderstanding the problem, but why can't you scale out horizontally?

If the problem is that queries or sets of data might have to jump nodes, couldn't the data be designed in such a way where an assumption is made about what sorts of queries will happen at write?

Optimize so that node spanning is rare, eat the cost when it does happen, and let those 1/n queries disappear into the average.

usgroup 9 years ago |

Lol was hoping it was a combination of awk and paste :)

That always makes me chuckle.

Honestly though ... Jenkins + bash + cloud storage and you'll be surprised at how many big data problems you can solve with a fraction of the complexity.

pmlnr 9 years ago | |

https://aadrake.com/command-line-tools-can-be-235x-faster-th...

Discussion is here: https://news.ycombinator.com/item?id=8908462

makapuf 9 years ago | |

Pardon my ignorance but what would you use Jenkins for ? Scheduling ?

tetha 9 years ago | | |

Jenkins in such a setting gives you two good things: i) scheduling. and ii) access control. The ability to give random dude X the ability to trigger computation Y, Z, and A, without the ability to change said computations.

usgroup 9 years ago | | |

Triggering jobs more generally (schedule or push notification) and splitting things into jobs and/or pipelines.

EGreg 9 years ago |

This seems like impressive stats about a relational database technology. But the scrolling on their website doesn't work on mobile. So in grand HN tradition, I left and now tell you all about it here, instead of the main point of their invention :)

franciscop 9 years ago | |

It worked for me but the nav of the browser didn't hide, which I recognize as messing around with absolute/fixed positioning and/or overflows. I'd recommend to use media queries to show a simple site on mobile and leave all the fancy stuff they are surely doing in the desktop for the desktop only.

Edit: on a second check, it might have to do with that nav that moves the whole page down.

plamb 9 years ago | | |

Appreciate these comments, the site did not go through much testing before being deployed. Overflowing was modified to eliminate horizontal scroll on mobile but it looks like there were some vertical issues as well. We will get this fixed

Loic 9 years ago |

What is the algorithm used to join the tables? Is it a hash join on `id` and `k` or using the fact that the ids are sorted and using a kind of galloping approach?

jagsr123 9 years ago | |

Yes, it is a hash join.

Loic 9 years ago | | |

I will need to dig into the implementation of the hash function, it must be a nice read as the speed shows that it is definitely well optimized! Thank you.

alexchamberlain 9 years ago |

Python 2.7 can do it in 0.0867 usec (Intel i7);

    $ python2.7 -m timeit 'n=10**9; (n*n + n) / 2'
    10000000 loops, best of 3: 0.0867 usec per loop

(Admittedly, I killed `n=109; sum(range(1,n+1))`.)

marknadal 9 years ago |

Great article, actually. Typical HN comments on performance optimizations are complaints like "this isn't a real world use case" or things like that. Most of which, they miss that comparing baseline performance metrics against two systems is still genuinely interesting in and of by itself, and acts as a huge learning catalyst to understanding what is going on. I think this article did a great job of making an honest comparison and discussing what is going on, so props to the team! (We did something similar as well, where we compared cached read performance against Redis, and were 50X faster - here: https://github.com/amark/gun/wiki/100000-ops-sec-in-IE6-on-2... ).

banachtarski 9 years ago | |

The problem is what "baseline" means. For example, a multithreaded program will always run slower than a single threaded one on one thread by definition. It has to do work in order to coordinate the threads. Obviously, this doesn't mean we avoid multithreaded code.

In this case, the software being tested was explicitly written to manage the coordination of data on many nodes, so why is the definition of "baseline" a single laptop? Seems specious.

marknadal 9 years ago | | |

Yes, but that is exactly why I think these types of articles and discussions are useful - people who understand what is going on often times assume that so do others, but for many people it all looks like magic.

How many people out there (genuine question here) assume the opposite of what you know / that they are ignorant of it? How many people do you think that when they hear "multithreaded" that they associate that with being faster?

Now assume the people who know there is overhead work to split up and divide the work across threads... because they have this knowledge, also "see everything as a nail because they have a hammer"? That sometimes the right solution is to simply run a single threaded operation, not parallelize everything?

I think there are interesting merits to all of that, even if it means "hyperbolic" articles or cliche not-realistic world tests. They challenge our thinking, our assumptions, our approach. And then separately, there should be articles/discussions on real-world tests and use cases.

Bedon292 9 years ago |

I know its just a benchmark for comparison, and it is awesome. I love seeing cool comparisons like this, but why do I care that this particular benchmark is faster than Spark? What sort of analytics will be affected by this improvement, and will it actually be saving me time on real world use cases?

plamb 9 years ago | |

Our impression was that when Databricks released the billion-rows-in-one-second-on-a-laptop benchmark, readers were pretty awed by that result. We wanted to show that when you combine an in-memory database with Spark so it shares the same JVM/block manager, you can squeeze even more performance out of Spark workloads (over and above Spark 's internal columnar storage). Any analytics that require multiple trips to a database will be impacted by this design. E.g. workloads on a Spark + Cassandra analytics cluster will be significantly slower, barring some fundamental changes to Cassandra.

jagsr123 9 years ago | |

Good questions. One answer - Speed in analytics when working with large data sets matters. A lot. Think about this - several vendors seem to be claiming support for interactive analytics. i.e. i can ask an adhoc question on large volume of data and get some sort of insight in "interactive" times. Really? maybe with a thousand cores? In a competitive industry, say like in investment banking, if one can discern a pattern before the competition it simply provides an edge. Ask the question to folks involved in detecting fraud, or online portals trying to place an appropriate ad, or manufacturing plant trying to anticipate/predict faults. It isn't so much about trying to prove we are better than Spark (well other than grabbing some attention :-) ) but rather the potential to live in a world where batch processing is a thing of the past - like working with mainframes. The hope is that we can gain insight instantly. fwiw, we love Spark and expect some of these optimizations simply become part of Apache Spark.

luckydata 9 years ago | |

it's literally impossible to test every possible workload for a tool like Spark so... what's the point of asking? Stand up a testing cluster, run some of your jobs and you'll get the answer.

Bedon292 9 years ago | | |

I am not asking about every workload. I was just curious about an example workload where this benchmark matters.

supergirl 9 years ago |

why would you choose values between 1 and 1000 for the right side? why not 1000 values between 1 and 1 billion?

zzleeper 9 years ago |

In case the author reads this: I can't read well with that font, unless I zoom in all the way. Doesn't happen with anything else (Win10, 14in laptop, Chrome)

plamb 9 years ago | |

The font in the embedded gists or the font on the page?

minimaxir 9 years ago | | |

Likely the font on the page.

A web design QA note for all: thin fonts (e.g 300-400 weight) as a body font but work fine on macOS due to better font rendering, but do not work well on Windows.