Databricks response to Snowflake's accusation of lacking integrity

Databricks response to Snowflake's accusation of lacking integrity(databricks.com)

217 points by rxin 4 years ago | 156 comments

gnabgib 4 years ago |

Related post (2 days ago, 95 comments): [Snowflake’s response to Databricks’ TPC-DS post](https://news.ycombinator.com/item?id=29206959)

drej 4 years ago |

What I find hilarious is that companies argue who can query 100 TB faster and try to sell this to people. I've been on the receiving end of offers by both of the companies in question and used both platforms (and sadly migrated some data jobs to them).

While they can crunch large datasets, they are laughably slow for the datasets most people have. So while I did propose we use these solutions for our big-ish data projects, management kept pushing for us to migrate our tiny datasets (tens of gigabytes or smaller) and the perf expectedly tanked compared to our other solutions (Postgres, Redshift, pandas etc.), never mind the immense costs to migrate everything and train everyone up.

Yes, these are very good products. But PLEASE, for the love of god, don't migrate to them unless you know you need them (and by 'need' I don't mean pimping your resume).

sanketsarang 4 years ago | |

I did work on making a database myself, and I must say that querying 100TB fast, let alone storing 100TB of data, is a real problem. Some companies (very few) don't have much choice but to use a DB that works on 100TB. If you do have small data, then you have a lot of options. But if your data is large, then you have very few options. So it is correct to be competing on how fast a DB can query 100TB of data; while at the same time being slow if you have just 10GB of data. Some databases are designed only for large data, and should not be used if your data is small.

doppelganger1 4 years ago | | |

The larger your data, the more that indexing and maintaining them hurt you. This is why they do much better at larger datasets vs small data sets. It’s all about trade offs.

To overcome this, they make use of cache and if the small data is frequently accessed, the performance is generally pretty good and acceptable for most use cases.

tshanmu 4 years ago | |

Resume driven development FTW!

StephenJGL 4 years ago | |

Very true. You have to understand the actual capabilities and your actual requirements. We work with petabyte size datasets and BigQuery is hard to beat. Our other reporting systems are still all in MySQL though.

autokad 4 years ago | |

its my experience if its just 10s of GBs then use 'normal' solutions. if TB then spark is great for that. note I have only used DataBricks & Spark, no snowflake.

jeltz 4 years ago | | |

PostgreSQL and MySQL can handle a few TB just fine. It is when you reach over 10TB that you need something else.

scapecast 4 years ago |

The irony here is that what Databricks is doing to Snowflake is exactly what Snowflake did to AWS and Redshift.

Same playbook - show that you’re better in a key metric that’s easy to understand (performance) to get the attention, but then pitch the paradigm change.

In Snowflake’s case, that was separation of storage and compute.

In Databrick’s case, it’s the Lakehouse Architecture.

I think the reason why Snowflake is so nervous because they know they can’t win this game.

avip 4 years ago |

I've used both products in production. Both are good++.

The blog wars seem extremely ridiculous to me. I don't recall ever choosing one over another based on how fast it runs on some imaginary arbitrary dataset.

paxys 4 years ago | |

Manufactured rivalries can be a great thing for business. We have been debating Coke vs Pepsi, Nike vs Reebok, McDonald's vs Burger King for decades now while these companies laugh all the way to the bank.

javajosh 4 years ago | | |

Like the post but I would add "Ford v Ferrari" there. A synthetic 100T test is much like an F1 course - not something you deal with during your commute, but it's nice to know what the limit is, and that there are people pushing that limit.

kartoonhero 4 years ago | |

Its not ridiculous at all. This is the coming of age for a brand new data architecture.

One of the biggest FUDs for a data lake architecture is performance - and this benchmark should put that concern to rest.

buttaphingas 4 years ago | | |

I actually see them as variations on the same architecture. Databricks keeps their metadata in files, Snowflake keeps theirs in a database, but they both, ultimately, are querying data stored in a columnar format on blob store (and, to be fair, Snowflake have been doing that with ACID-compliant SQL for a lot longer than Databricks). So using SQL over blob at high performance has been around for a while.

Databricks say their solution is better because it's open (though keep the optimizations you need to run this at scale to themselves, i.e. is ultimately proprietary). Snowflake says theirs is better because it's a fully managed service, meaning no infrastructure to procure or manage, is fully HA across multiple data centers by default etc.

Databricks push 'open' but really still want you to use their proprietary tech for first transforming into something usable (Parquet/Delta) and then querying with Photon/SQL, though you can also use other tech. With Snowflake you can just ingest and query, but it has to be through their engine.

Customers should do their own valudation and see which one fits their needs best.

syntaxfree 4 years ago | | |

I don’t know, “coming of age” seems to imply that there’s some pre-maturity period out of which something is emerging.

CactusOnFire 4 years ago | |

It was inevitable.

Both Databricks and Snowflake have inflated marketing budgets, and marketing feels they have to "beat" the other one or they'll lose the market.

inetknght 4 years ago |

Snowflake accuses other companies of lacking integrity?

I really wish I could block all of Snowflake's domain from my inbox. Sadly, Google encourages spammers to just create a new email address. So I get a few emails each month from Snowflake who ask me to try their products. I've never done business with them and there's no unsubscribe link.

Fuck Snowflake for thinking it has any room to talk about integrity.

doppelganger1 4 years ago | |

What I find comical is they accuse Databricks of lacking integrity but they don’t actually call out anything except their benchmark was faster than what Databricks did in Snowflake. Databricks then reruns the benchmark and says the only reason that Snowflake’s was faster was because of the built in dataset they used. Databricks was able to match Snowflakes numbers using it but when they loaded the actual data set, it was much slower, which is how a proper TPC benchmark is supposed to happen. They then said that Databricks blog doesn’t match the TPC results, but when I looked at them, they do match. I guess Snowflake just expects people to take arguments at face value. Then I saw someone on LinkedIn complaining that Databricks must have used some beta version. I didn’t see a beta version being used, but that kind of goes out the window when Databricks follows up and then posts that they matched Snowflake when they used their built in TPC data set.

This is funny and interesting to watch but also a distraction I feel. Amazon says it best when they say, “Leaders start with the customer and work backwards. They work vigorously to earn and keep customer trust. Although leaders pay attention to competitors, they obsess over customers.”

boublepop 4 years ago |

Snowflake must be kicking themselves hard now for letting a story that was “Databricks is a viable alternative” turn into “Snowflake has absolutely no integrity and will fling mud even while they are gaming the statistics”

Really can’t see what they can do now short of “bending” to Databricks and entering the competition. And naturally it’s no longer just enough that they show comparable performance. They have to hit their games stats somehow otherwise any news even of they beat Databricks will be reported as “see, we told you they where cheating”

bloodyplonker22 4 years ago |

Databricks is trying to punch up at the market leader. Every decent marketer knows that you should never do the opposite and punch down.

djbusby 4 years ago | |

I'm crap at marketing and know the only-punch-up rule.

aliswe 4 years ago | |

what differences in size (or height) are we talking about?

jchw 4 years ago |

Before the Snowflake blog post, I did not know what Snowflake or Databricks were. I can only imagine that this rivalry is great for both of them, even if Databricks is somewhat on the advantage end, at least from a tactical standpoint; I admit though that they seem to be a bit unnecessarily defensive considering the position they're in with the exchange.

In general though, I'm still not complaining. It's interesting to see a dispute like this unfold.

qaq 4 years ago | |

Snowflake is 120B Market Cap Darling of Cloud Data warehouses I doubt obscurity is a problem they are trying to solve

jchw 4 years ago | | |

Of course they’re known among their pre-existing customer base of people and entities who already solve problems using tools like this. But it’s a subset of the multi-trillion dollar cloud industry, which itself is not the entire software engineering industry.

AdamProut 4 years ago |

I would say that TPC-DS and TPC-H are really table stakes benchmarks for data warehouses at this point in time (maybe they weren't 10 years ago). How to build a database that does well on them is well documented in the literature now[1][2][3][4] (maybe a few other papers). Its not easy to build such a database, but its "just" hard work and many companies have the $$ necessary to do that work. There isn't any magic or technical moat in the results for databricks (or snowflake, or redshift, etc.).

I think Databricks is overly enthusiastic about their results as they have been trying to be competitive with cloud DWs on these benchmarks for a number of years now. They have finally caught up (by building deltalake and their photon query engine which implement a number of standard DW features).

  [1] http://www.vldb.org/pvldb/vol13/p1206-dreseler.pdf
  [2] https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf
  [3] https://web.stanford.edu/class/cs245/readings/c- store.pdf
  [4] http://sites.computer.org/debull/A12mar/vectorwise.pdf

thrtlvlmidnight 4 years ago | |

I agree with everything above. The main advantage the newer data warehouses have over the legacy on-prem incumbents is that they had the chance to build from scratch having learned from all of the challenges that the original players encountered.

The public pissing contest is entertaining while also being silly and slightly cringe, but I think it's a nice story for Databricks nonetheless. They now have a performant SQL-based analytics engine that can credibly compete with the best DWs in the market today, and it's just one part of their overall platform.

The sense I get is that Snowflake wants the conversation to be "no matter what you do, you need a data warehouse, and we're the best in the business at that." Databricks' Lakehouse approach is a fundamental challenge to that, and if they're getting this kind of performance from their analytics engine against the market-leading data warehouses today, that's a big momentum shift in their favour.

redwood 4 years ago |

As much as I love seeing competition in the space and am enjoying my popcorn, I really don't understand what Databricks is doing here: this feels like a childish foodfight rather than an obsession with the customer...

saj1th 4 years ago | |

:) That is a good question. Why spend eng cycles to submit results to the TPC council - why not just focus on customers?

I believe the co-founders have addressed this in the blog.

> Our goal was to dispel the myth that Data Lakehouse cannot have best-in-class price and performance. Rather than making our own benchmarks, we sought the truth and participated in the official TPC benchmark.

I'm sure anybody seriously looking at evaluating data platforms would want to look at things holistically. There are different dimensions like open ecosystem, support for machine learning, performance etc. And different teams evaluating these platforms would stack rank them in different orders.

These blogs, I believe, show that Databricks is a viable choice for customers when performance is a top priority (along with other dimensions). That IMO is customer obsession.

kf6nux 4 years ago | |

I'd say helping customers spot fraud* is serving the customers' interests.

* I haven't executed the test suite, but fraud seems likely.

jjoonathan 4 years ago | |

All publicity is good publicity.

Both participants in a fight can win by implicitly excluding their real competitors.

glogla 4 years ago | |

Yes, the tone of those blogposts, the likelihood of fake benchmarks submitted on someone else's behalf and especially the deluge of new accounts supporting them makes me want to trust Databricks even less than the PoC my company ran with them last year and spending time with their terrible, terrible salespeople.

EDIT: I forgot lying about how open they are when all their interesting technologies (like the new sql engine and the good parts of delta) are proprietary.

mostdataisnice 4 years ago | | |

What fake benchmarks are you talking about?

vgt 4 years ago | |

I think Snowflake cultivates a very careful public image, but in private their sales people use.. how do you say.. aggressive techniques.. databricks is addressing the source of market confusion head-on

benjaminwootton 4 years ago |

Ive been following this and it’s kind of embarrassing to watch.

I love working with Databricks and Snowflake. They both knock it out of the park for their respective use case. They’re amazing products.

It makes no sense to fall out about this though.

For a 100TB dataset with a funky calculation, Spark will trounce Snowflake. For a 1 row dataset, Snowflake will return before the spark job has been serialised.

imslowbutnice 4 years ago | |

What are you talking about. Spark isn't even used, and TPC DS is not a funky calculation at all. It's supposed to be a collection of typical datawarehouse type queries. Although I'm not really sure what funky means, but why would Spark trounce Snowflake on "funky" calculation at all. Do you mean an ML algorithm, and are you implying that TPC-DS has anything close to an ML Algorithm? And why would Snowflake perform better on returning one row, they are columnar stored.

nojvek 4 years ago | |

Why would Spark trounce Snowflake. What makes it inherently so much faster at 100TB jobs?

Also what kind of queries are we talking about?

saj1th 4 years ago | | |

> Why would Spark trounce Snowflake. What makes it inherently so much faster at 100TB jobs?

These are the slides from a talk one of the co-founders (@rxin) gave at Stanford. https://web.stanford.edu/class/cs245/slides/LakehouseGuestTa...

It goes into the details of how this performance is achieved(and not just at 100TB). Part of this could be attributed to innovations in the storage layer(delta lake), and part of it is just the new query engine design itself.

__MatrixMan__ 4 years ago |

Instead of blog posts written but experts in app A based on their experience with app B, I wish there were a platform for this kind of comparison.

Some objective third party sets the goal and then each company submits automation (selenium?) that configures their own app to achieve the goal. Entrants are scored by:

- time

- storage

- compute

- config complexity

No need to waste time making your opponent look bad, just focus on making your self look good, and do it on a level playing field.

rxin 4 years ago | |

Isn’t that what the official TPC does?

falaki 4 years ago | |

That is exactly the role of tpc.org.

renewiltord 4 years ago | |

If you want some information like this quick, you're gonna have to pay to run it.

michaelhartm 4 years ago |

Data Wars: Snowflake vs Databricks (0 - 2)?

drawturkey 4 years ago | |

Snowflake has way more revenue, is worth 3 times more than Databricks and is growing faster. I'd say Snowflake is still in the lead. Plus, just look at Snowflake's customer list. It's a "who's who", Databricks is a "Who's that?".

thrtlvlmidnight 4 years ago | | |

I took a look at Databricks public customer case studies[1] and haven't a clue who any of these companies are:

Atlassian? Adobe? ExxonMobil? PagerDuty? McAfee? HSBC? Starbucks? AstraZeneca? GlaxoSmithKline? Comcast? FINRA? Regeneron? Riot Games? Nielsen? HP? Conde Nast? Viacom? McGraw-Hill? Cisco? NBCUniversal?

Hopefully they can scale to the enterprise soon.

[1]https://databricks.com/customers

naattee 4 years ago |

snowflake should just pony up and do a TPC-DS audited benchmark

maslam 4 years ago |

Everyone win when data platforms submit audited benchmarks...

boringg 4 years ago |

And how soon is the S-1 for Databricks dropping?

Normal_gaussian 4 years ago |

so, alternatives?

Aside from the Azure/GCP/AWS internal offeringa I know about Snowflake and Firebolt, Databricks is new to me.

glogla 4 years ago | |

Redshift is pretty terrible, stay away. AWS is even worse at delivering promises than Databricks and that's saying something.

I heard Google BigQuery is good. It is completely SaaS (like AWS Athena that works).

Unicorns often run their own stack and you could replicate that, if you have the apetite. Netflix and Apple run Trino + Spark on k8s + Iceberg. Uber used their own Hudi thing, not sure if they still do.

falaki 4 years ago | | |

Apple is a big Deltalake (and Databricks) customer: https://www.youtube.com/watch?v=SFeBJxI4Q98

funstuff007 4 years ago |

I guess if anyone suggests "sampling" the data in meeting these days, they get their head blown off.

xiaodai 4 years ago |

Spark compares itself to Hadoop only on the front page. I wonder how Spark compares to Firebolt.

uvdn7 4 years ago |

Now I see that getting rid of the DeWitt clause is indeed great. Kudos to both companies.

1cvmask 4 years ago |

This reminds me of the old performance ads of Oracle where they would show you how everything ran better on Oracle. They used to put those ads at airports, business lounges and the back cover of newspapers and magazines read by non-technical executives like the FT and Economist.

Everyone technical knew they would game every environment to come out with superior results. I suppose it worked. As the top executives buy big system software and ignore the IT crowd who could easily point out the flaws in the methodology of the"studies".

Breakdown of one of those example ads:

https://db2news.wordpress.com/2011/06/08/a-closer-examinatio...

falaki 4 years ago |

tl;dr: The data warehouse company used a pre-baked TPC-DS dataset and claimed they have similar performance to Databricks. Turns out if you use the official TPC-DS data generation scripts, you get much worse performance.

slownews45 4 years ago | |

Even worse, they claimed to have similar performance to Databricks AND claimed databricks "lacked integrity". WOW, talk about chutzpah!

tyingq 4 years ago | |

I read the original post, the Snowflake response, and this. From that I gather that both of them aren't being completely honest or fair when making comparisons. A fair amount of truth, but also some clever wording and omission on both their parts. Which is not surprising or particularly new in this space :)

slownews45 4 years ago | | |

Databricks results are available at tpc.org [1]

Snowflake has shown NOTHING close to this.

[1] http://tpc.org/results/fdr/tpcds/databricks~tpcds~100000~dat...

arnon 4 years ago | |

That's altering the methods - and generally considered a violation of the validity of the results.

xiaodai 4 years ago |

Lol

dreyfan 4 years ago |

Databricks is a rapidly approaching IPO. Trying to justify their valuation with their overpriced in-memory hadoop.

kartoonhero 4 years ago | |

Databricks is way more than hadoop or spark. A great analogy - Spark is a great engine but you need to design and build all of the other subsystems.

Databricks is an F1 car - everything is built out. You get in and drive - FAST.

dreyfan 4 years ago | | |

Databricks is a shit platform that encourages terrible data practices and accretion of technical debt.

glogla 4 years ago | | |

> Databricks is an F1 car

F1 cars really unreliable and need a lot of engineers to keep running, are very expensive, and completely impractical in normal use. They are fast but only on very specific roads, they couldn't survive on normal roads.

What do you know, you might be right! :D

fs111 4 years ago | | |

> Databricks is an F1 car - everything is built out. You get in and drive - FAST.

found the databricks employee

hello_moto 4 years ago |

Serious question: Databricks, Snowflake, Dremio. All these "Data" platform companies => which one do you have for your Data Lake and Data Warehouse solution?

I'm sick and tired of these companies Snake Oiling the Data industry by offering "the easiest" platform to satisfy your Data Lake + Warehouse solution only to fall hard whenever you hook it up with your production data (big dataset).

PS: Anyone selling Data Lakehouse (Data Lake + Warehouse as one platform) is on meth.

kartoonhero 4 years ago | |

Please read up on Lakehouse.

Data Lake + Merge support + DW performance is now possible.

That is the game changer.

hello_moto 4 years ago | | |

It'll take a few more years until these companies fixed all the bugs and address all the scalability issues.

As of today, these companies are not good enough to take on the Data Warehouse part.

strongbond 4 years ago | | |

Do you work for Databricks?