Snowflake’s response to Databricks’ TPC-DS post

Snowflake’s response to Databricks’ TPC-DS post(snowflake.com)

80 points by uvdn7 4 years ago | 102 comments

pxc 4 years ago |

Can someone ELI5 what Snowflake and Databricks are? I spent a few minutes on the Databricks website once and couldn't really penetrate the marketing jargon.

There are also some technical terms I don't know at all, and when I've searched for them, the top results are all more Azure stuff. Like wtf is a datalake?

mping 4 years ago | |

A data lake is a system designed for ingesting, and possibly transforming lots of data, a "lake" where you dump your data. This is different from an eg postgres db (a single source of truth for a crud app for example), because it captures more data (eg events) and it's normally not consistent with the single source of truth (the data may arrive in batches, imported from other database, etc). Because the volume of data is normally huge, you need a cluster to store it, and some way of querying it.

Snowflake and data bricks are companies that operate in this space, providing ways to ingest, transform and analyze large volumes of data.

IanCal 4 years ago | |

Snowflake is (amongst other things but primarily to me) SQL database as a service, designed for analytical queries over large datasets.

It separates compute and storage, so there's just a big ol' pile of data and tables, then it spins up large machines to crunch the data on demand.

Data storage is cheap and the machines are expensive per hour but running for shorter times, and with little to no ops work required it can be a cheap overall system.

Bunch of other features that are handy or vital depending on your use case (instant data sharing across accounts, for example).

I've used it to transform terabytes of JSON into nice relational tables for analysts to use with very little effort.

Hopefully that's a useful overview of what kind of thing it is and where it sits.

legerdemain 4 years ago | |

Snowflake is a hosted database that uses SQL. Two distinctions it has is that (1) it lets users pay for data storage and compute power separately and independently and (2) it takes decisions about data indexing out of your hands.

Databricks is a vendor of hosted Spark (and is operated by the creators of Spark). Spark is software for coordinating data processing jobs on multiple machines. The jobs are written using a SQL-like API that allows fairly arbitrary transformations. Databricks also offers storage using their custom virtual cloud filesystem that exposes stored datasets as DB tables.

Both vendors also offer interactive notebook functionality (although Databricks has spent more time on theirs). They're both getting into dashboarding (I think).

Ultimately, they're both selling cloud data services, and their product offerings are gradually converging.

ngc248 4 years ago | |

A data lake is a company wide data repository. All the "data streams" from all of the different departments will flow into the data lake. Aim is to use this data to get both macro and micro insights.

kevindeasis 4 years ago | |

They are a data warehouse with analytics? So data warehouse as a service in the cloud?

So they can collect data from different places like sql, images, etc. I think a better question would be what type of data can't they ingest?

Once you have your data i guess you can run some analytics to find out what your data tells you

tomnipotent 4 years ago | | |

A data lake can be home to many different data formats e.g. parquet, AVRO, Thrift, protobuf, ORC, HDF5S, CSV, JSON all co-existing together. Spark lets you create a virtual abstraction over all of this, and query it as though it was a homogeneous database. There's no need to import data into a centralized format and schema.

This really all ties back to the "old" Hadoop days, and is an evolution of compute over data not in a fixed and managed format/schema.

geoduck14 4 years ago | | |

I'd like to add some points: Ive used Snowflake for several years. Snowflake works with structured and semi-structured data (think spreadsheets and JSON). I've never tried working with pics or videos - and I'm not sure it would make sense to do that.

I've evaluated Databricks. It works with the above mentioned structured and semi-structured data. I also suspect it could process unstructured data. My understanding is that it runs Python (and some others), so you can do any "Python stuff, but in the cloud, and on 1000s of computers"

jeffreygoesto 4 years ago | |

People who downvoted this, please take a minute and reflect that your world is not the whole world. There is a serious question in this comment and there are myriads of topics _you_ have no clue about.

dekhn 4 years ago | | |

sure, but if I see the term 'data lake' I'm gonna Bing it, with the first result being https://aws.amazon.com/big-data/datalakes-and-analytics/what... which explains it nicely.

ELI5 is for reddit, generally here we expect you can google it to get the ELI5 explanation before giving us your hot take in a comment

kthejoker2 4 years ago |

Snowflake conceding they have a 700% markup between Standard and Premium editons which has zero impact on query performance is ... well, it's something. I'd start squeezing my sales engineers about that, definitely not sustainable...

Also proof that lakehouse and spot compute price performance economics are here to stay, that's good for customers.

Otherwise, as a vendor blog post with nothing but self-reported performance, this is worthless.

Disclaimer: I work at Databricks but I admire Snowflake's product for what it is - iron sharpens iron.

drawturkey 4 years ago | |

How do you get 700% markup? The difference between Standard and Enterprise is 50%. Enterprise does have features which do make workloads run faster, but this benchmark didn't need them.

buttaphingas 4 years ago | |

I've used Snowflake for the past few years, and it's worth pointing out that when it comes to overall cost, there's a lot you get with Snowflake for free. For example, they have HA across 3 AZs out of the box, included in the price and with no configuration required.

If I'm reading what Databricks published correctly, it seems that they've only used 1 driver node for this benchmark, in other words it's a dev setup. If they want to compare apples-to-apples then they should configure, and price, a multi-AZ HA set-up.

I'm not sure if this is still applicable to Photon, however - can anyone confirm?

sagarm 4 years ago | | |

The _data_ should be replicated, but the compute infrastructure doesn't need to be. Many companies I suspect would be fine having to restart pipelines on driver failure (increasing tail latency, basically) if it yields a substantial cost reduction.

socaldata 4 years ago |

Take all the problems you have had with data warehousing and throw them in a proprietary cloud. That is Snowflake. They are the best today.

Databricks started with the cloud datalake, sitting natively on parquet and using cloud native tools, fully open. Recently they added SQL to help democratize the data in the data lake versus moving it back and forth into a proprietary data warehouse.

The selling point in Databricks is why move the data around when you can just have it in one place IF performance is the same or better.

This is what led to the latest benchmark which in the writing appears to be unbiased.

In snowflakes response however, they condemn it but then submit their own fundings. Sound a lot lot trump telling everyone he had billions of people attend his inauguration, doesn’t it?

Anyhow, I trust independent studies more than I do coming from vendors. It cannot be argued or debated unless it was unfairly done. I think we are all smart enough to be careful with studies of any kind, but I can see why Databricks was excited about the findings.

michaelhartm 4 years ago |

* Databricks is unethical

* Nobody should benchmark anymore, just focus on customers instead

* But hey, we just did some benchmarks and we look better than what Databricks claims

* Btw, please sign up and do some benchmarks on Snowflake, we actually ship TPC-DS dataset with Snowflake

* Btw, we agree with Databricks, let's remove the DeWitt clause, vendors should be able to benchmark each other!

* Consistency is more important than anything else!!!

uvdn7 4 years ago | |

I don’t think they are saying benchmark is not important but rather public benchmark war being a distraction.

kingkongv2 4 years ago | |

If people have never heard of Databricks, now is the time because a 100 billion company just started a war against them. Great marketing win Databricks.

glogla 4 years ago | | |

Databricks is $28B valuation and 2800 employees, Snowflake is $109 valuation and 2500 employees.

They are both billion dolar companies, we're hardly talking David and Goliath here.

geoduck14 4 years ago | | |

To be fair, I've been equating Databricks for a month or so. Databricks is coming after Snowflake. Snowflake doesn't care. Snowflake has a pretty solid moat with:

EASY SQL, data sharing (they have a marketplace), simple scaling

cloudbonsai 4 years ago | |

The interesting part is that Snowflake omits Databricks' performance scores in their graphs. Here is how they compare on TPC-DS benchmark, based on two companies' self-reports:

* Elapsed time: 3108s (Databricks) vs 3760s (Snowflake)

* Price/Peformance: $242 (Databricks) vs $267 (Snowflake)

Needless to say, these numbers seriously need a verification by independent 3rd parties, but it seems that Databricks is still 18% faster and 10% cheaper than Snowflake?

geoduck14 4 years ago | | |

The way I read this is: DataBricks benchmarked against us, and they messed it up. Here is hou YOU should evaluate Snowflake performance. And, by the way, it is pretty easy to do it.

maslam 4 years ago |

Databricks broke the record by 2x) and is 10x more cost effective, in an audited benchmark. Snowflake should participate in the official, audited benchmark. Customers win when businesses are open and transparent…

blobbers 4 years ago |

This is the sort of FUD testing that gets thrown back and forth between companies of all kinds.

If you're in networking, it's throughput, latency or fairness. If you're in graphics its your shaders or polygons or hashes. If you're in CPUs its your clock speed. If its cameras, it's megapixels (but nobody talks about lens or real measures of clarity) If you're in silicon it's your die size (None of that has mattered for years, those numbers are like versions not the largest block on your die) If you're in finance, it's about your returns or your drawdowns or your sharpe ratios.

I'm a little bit surprised how seriously databricks is taking this, but maybe it's because one of the cofounders laid this claim. Ultimately what you find is one company is not very good at setting up the other company's system, and the result is the benchmarks are less than ideal.

So why not have a showdown? Both founders, streamed live, running their benchmarks on the data. NETFLIX SPECIAL!

rxin 4 years ago | |

Exactly. Not sure about Netflix special, but there are experts that have dedicated their professional careers to creating fair benchmarks. Snowflake should just participate in the official TPC benchmark.

Disclaimer: Databricks cofounder who authored the original blog post.

AtlasLion 4 years ago | | |

The benchmark itself is kinda useless, so I don't see why they should. If you look at tpc-h for years, you had exasol as a top dog, but in the real world that meant nothing for them.

blobbers 4 years ago | | |

Come on, you're going to make a ton of money on the IPO now focus on the things that matter in life... ie: starring in a netflix special.

imslowbutnice 4 years ago |

I dont get still how much optimization was done for the Snowflake TPC-DS power run. This is what I am seeing so far and what i am foggy on -

DB1.Databricks generated the TPC-DS datasets from TPC-DS kit before time started. Databricks starts time then generated all queries. Then Databricks loaded from CSV to Delta format (also some delta tables were partitioned delta tables by date) and also computed statistics. Then all of the queries are executed 1-99 for TPCDS 100TB

SF1. Databricks generated the TPC-DS datasets from TPC-DS kit before time started. Databricks starts time then generated all queries. Then load from S3 to Snowflake tables by - (i'm not sure about these next parts) - creating external stages and then "copy into" statements I guess? Or maybe just using copy into from an s3 bucket, that part doesnt matter much. But its not clear did they also allow target tables to be partitioned/clustering keys at all? Then all of the queries are executed 1-99 for TPCDS 100TB

Its just hard to say exactly what "They were not allowed to apply any optimizations that would require deep understanding of the dataset or queries (as done in the Snowflake pre-baked dataset, with additional clustering columns)" means exactly. Like what does that exactly mean. At a glance though, this looks very impressive for Databricks, but just want to be sure before I submit to an opinion.

aptxkid 4 years ago |

Personally I think it’s a great response and very well written. I didn’t jump on the congrats-Databricks wagon when the result first came out because of the weird front page comparison against snowflake. Both companies are doing great work. Focusing on building a better product for your customer is much more meaningful than making your competitor look bad.

tyingq 4 years ago | |

It is well written, but there's some sleight of hand here and there too. Like using your lowest tier product to demonstrate price/performance against a competitor's highest tier. The Snowflake lowest tier doesn't have failover, for example...or compliance features.

buttaphingas 4 years ago | | |

This is incorrect. Every edition of Snowflake is deployed across multiple availability zones with automatic failover in the case of failure or AZ outage. This is included in the price and requires no configuration by the customer. Cross-cloud/region failover requires the top edition and a few lines of SQL to configure (plus cloud egress costs for data replication).

The higher editions of Snowflake include features like materialised views, dynamic data masking, BYOK, PCI & HIPAA compliance etc., non of which are required for the benchmark.

uvdn7 4 years ago | | |

Exactly. That’s why I think public benchmark war is just a waste of time. There will ALWAYS be some subtle differences between the two platforms that results will never be apple to apple.

choppaface 4 years ago |

The audience for these posts are enterprise managers who don’t actually understand their compute needs.

For the more technically inclined, don’t let any corporate blog post / comms piece live in your head rent-free. If you’re a customer, make them show you value for their money. If you’re not, make them provide you tools / services for free. Just don’t help them fuel the pissing contest, you’ll end up a bag holder (swag holder?).

falaki 4 years ago |

Linking to the discussion on the follow up from Databricks: https://news.ycombinator.com/item?id=29232346

geoduck14 4 years ago |

I've been a customer/user of Snowflake. They make it simple to run SQL. There is a bunch of performance stuff that I don't need to worry about.

I'm interested in using Databricks, but I haven't done it yet. I've heard good things about their product.

throwaway984393 4 years ago |

"Posting benchmark results is bad because it quickly becomes a race to the wrong solution. But somebody showed us sucking on a benchmark, so here's our benchmark results showing we're better."

uvdn7 4 years ago | |

I disagree. It makes sense for Snowflake to response to what-they-think-is an unreasonably bad result published by Databricks. And they focused more on Snowflake’s result and only compared dollar cost against Databricks. It’s consistent with their philosophy that public benchmark war is beside the point and mostly a distraction.

AtlasLion 4 years ago | |

Their cofounder was behind vectorwise, which kicked ass in benchmarks, but died as no one even heard of it. You can run the benchmark queries fast, that's great, but can you handle code migrated from vertica? Will you optimiser come up with a good plan for queries built on 15 layers of views? That's what companies in the real world have, not some synthetic benchmark that you can make sure you can run for marketing purposes.

feqgmmr2 4 years ago | |

The thing is even that response doesn't show them to be better. As someone pointed out, they're comparing their cheapest offering with Databricks' most expensive one and saying they're 3% better in price-perf. What does someone read into that?

Rastonbury 4 years ago | |

I'm not familiar with this realm to comment on veracity of claims but it could very well be

"Posting benchmark results is bad because it quickly becomes a race to the wrong solution. Someone misrepresented our performance in a benchmark, here are the actual results."

AtlasLion 4 years ago |

The main question I have for DB is, how good is their query optimiser/compiler? It's fun that you can run some predefined set of queries fast. More important is, how good you can run queries in the real world, with suboptimal data models, layers upon layers of badly written views, CTEs, UDFs... That is what matters in the end. Not some synthetic benchmark based on known queries you can optimise specifically for.

maslam 4 years ago | |

@AtlasLion you are right real world performance matters. We test extensively with actual workloads, and the speed up holds there too. For example: lots of real world BI queries are repeated over smallish data sets of 10 to 50 GB. We test that size factor and pattern all the time.

hiyer 4 years ago |

Performance is only one part of the story. The major advantage Snowflake (and to some extent Presto/Trino) brings to the table is it's pretty much plug and play. Spark OTOH usually requires a lot of tweaking to work reliably for your workloads.

bpaneural 4 years ago |

So much to read. TLDR; Databricks still holds the world record and they beat us on price/performance

bjornsing 4 years ago |

> At the end of the script, the overall elapsed time and the geometric mean for all the queries is computed directly by querying the history view of all TPC-DS statements that have executed on the warehouse.

The geometric mean? Really? Feels a lot easier to think in terms of arithmetic mean, and perhaps percentiles.

rxin 4 years ago | |

Geometric mean is commonly used in benchmarks when the workloads consists of queries that have large (often orders of magnitude) differences in runtime.

Consider 4 queries. Two run for 1sec, and the other two 1000sec. If we look at arithmetic mean, then we are really only taking into account the large queries. But improving geometric mean would require improving all queries.

Note that I'm on the opposite side (Databricks cofounder here), so when I say that Snowflake didn't make a mistake here, you should trust me :)

bjornsing 4 years ago | | |

> But improving geometric mean would require improving all queries.

No. Improving the geometric mean only requires reducing the product of their execution times. So if you can make the two 1 ms queries execute in 0.5 ms at the expense of the two 1000 ms queries taking 1800 ms each then that’s an improvement in terms of geometric mean.

So… kind of QED. The geometric mean is not easy to reason about.

uvdn7 4 years ago |

I genuinely think DeWitt clause is good for the users (bad for researchers). Without it, especially in the context of cooperate competitions, the company with the most marketing power will win. Users can always compare different products themselves. I am likely wrong but please help me understand.

glogla 4 years ago |

What do you know, here's an article[1] from 2017 about Databricks making an unfortunate mistake that showed Spark Streaming (which they sell) as a better streaming platform to Flink (which they don't sell).

I really hope this is not the case again.

(yes, I understand my sarcasm is unneeded, I couldn't help myself)

[1]: https://www.ververica.com/blog/curious-case-broken-benchmark...