Databricks is an F1 car - everything is built out. You get in and drive - FAST.
F1 cars really unreliable and need a lot of engineers to keep running, are very expensive, and completely impractical in normal use. They are fast but only on very specific roads, they couldn't survive on normal roads.
What do you know, you might be right! :D
found the databricks employee
I'm sick and tired of these companies Snake Oiling the Data industry by offering "the easiest" platform to satisfy your Data Lake + Warehouse solution only to fall hard whenever you hook it up with your production data (big dataset).
PS: Anyone selling Data Lakehouse (Data Lake + Warehouse as one platform) is on meth.
Data Lake + Merge support + DW performance is now possible.
That is the game changer.
As of today, these companies are not good enough to take on the Data Warehouse part.
Great for proof of concepts, but when you start to build out complete pipelines please look into how to make the pipelines more sustainable and maintainable.
Can Spark queries 100Bn structured data performing aggregation on multiple fields (or dimension?)
What about large scale read via OLAP queries (y'know, the typical measures and dimensions)
Vendors can implement that API with their own optimizations. EMR makes optimizations in their implementation and so does Databricks. Photon is a new engine, but it implements the Apache Spark API for better performance. There's nothing to stop EMR or any other Apache Spark vendor from undertaking the same strategy.
This openness has allowed customers of Hortonworks and Cloudera to migrate their workloads to the cloud easier than if they had to refactor from something completely different, like from Oracle PL/SQL routines.
Snowflake does not have an open ETL API. If you write stored procedures in Snowflake, you can only run them on Snowflake. This is one of the reasons people choose to use dbt with Snowflake. It gives them an open ETL layer to provide future optionality.
There's no reason why you couldn't use Snowflake as the datastore and Spark as the ETL. However, it would be prohibitively expensive to do so. You would need to pay for the Spark cluster, but also a Snowflake cluster to export and import the data. Exporting a handful of terabytes from Snowflake can also take hours depending on your cluster configuration.
By storing your data on S3 in an open format, like Apache Parquet or Delta Lake, you can just use a different engine on it without needing to export / import it. In addition to Spark, Presto & Trino are popular engines to use when querying a data lake.
This optionality is ultimately good for customers. If Apache Spark is best for your use case, then you can choose to host Spark yourself, EMR, Databricks, Cloudera, etc. If Presto is best for your use case, you can choose AWS Athena, Starburst, Ahana, etc. Once you pick the best tech for your use case, you have several vendors to compare against for the best deal.
If I want to move off Snowflake to Firebolt or some other data warehouse, I need to pay both vendors to get my data out and get my data in. Snowflake wasn't around 10 years ago, and if they are not still a good option 10 years from now, I don't want to have to pay them for the privilege to export my data out. I could rectify that by keeping all my data in a data lake, but now I'm paying to store the data twice.
Open APIs enables an open ecosystem, which encourages competition.
But with Snowflake, the data never comes out. Can't use Spark/Trino/Flink... on data in SF.
Do you have to pay to export data out of Databricks? No, it's already sitting where you want it.
Which one is open? I wonder
This is the problem. Both Snowflake and Databricks are spreading FUD and otherwise smart people are falling for it.
For all intents and purposes, large amounts of data are locked into Snowflake. Is it theoretically possible to export a petabyte out of SF? Sure.
Do I want to spend money on it? Not really. That is what I mean by the "data doesn't come out".
"Exporting" a petabyte out of Databricks is a no-op. I can already read Deltalake from other open source tools.
This is just FUD.
Consider a scenario where data is coming in periodically, say daily, from some source, server logs, sensor data, whatever. And the user wants to train models daily on the data and they also want to do some SQL. Maybe they ingest the data directly into SF and copy it out for training, or they do it the other way round, land it in object store and the ingest into SF. This is unlikely to be a humongous amount of data, it's probably not a PB. However, this adds up, maybe for some use cases it becomes a PB in a month, maybe in a quarter, maybe it only adds up to a PB in a year.
Thing is, without a Lakehouse architecture, the user will pay to store and copy that data multiple times (at least twice) no. matter. what. They may not pay for a PB in one shot, but you can bet that eventually they'll pay multiple times to store and copy that PB.
There are different ways to lock customers in and both Databricks and Snowflake are playing the game.
Every vendor, be it Snowflake, Databricks, EMR, Athena, BQ, … charges for use of the engine. The difference with a Lakehouse is that one doesn’t have to pay the vendor for the simple ability to use the data with another offering. That’s what you have to pay for with closed systems, whether it’s data on the way in or data on the way out.
While they can crunch large datasets, they are laughably slow for the datasets most people have. So while I did propose we use these solutions for our big-ish data projects, management kept pushing for us to migrate our tiny datasets (tens of gigabytes or smaller) and the perf expectedly tanked compared to our other solutions (Postgres, Redshift, pandas etc.), never mind the immense costs to migrate everything and train everyone up.
Yes, these are very good products. But PLEASE, for the love of god, don't migrate to them unless you know you need them (and by 'need' I don't mean pimping your resume).
To overcome this, they make use of cache and if the small data is frequently accessed, the performance is generally pretty good and acceptable for most use cases.
Same playbook - show that you’re better in a key metric that’s easy to understand (performance) to get the attention, but then pitch the paradigm change.
In Snowflake’s case, that was separation of storage and compute.
In Databrick’s case, it’s the Lakehouse Architecture.
I think the reason why Snowflake is so nervous because they know they can’t win this game.
The blog wars seem extremely ridiculous to me. I don't recall ever choosing one over another based on how fast it runs on some imaginary arbitrary dataset.
One of the biggest FUDs for a data lake architecture is performance - and this benchmark should put that concern to rest.
Databricks say their solution is better because it's open (though keep the optimizations you need to run this at scale to themselves, i.e. is ultimately proprietary). Snowflake says theirs is better because it's a fully managed service, meaning no infrastructure to procure or manage, is fully HA across multiple data centers by default etc.
Databricks push 'open' but really still want you to use their proprietary tech for first transforming into something usable (Parquet/Delta) and then querying with Photon/SQL, though you can also use other tech. With Snowflake you can just ingest and query, but it has to be through their engine.
Customers should do their own valudation and see which one fits their needs best.
Both Databricks and Snowflake have inflated marketing budgets, and marketing feels they have to "beat" the other one or they'll lose the market.
I really wish I could block all of Snowflake's domain from my inbox. Sadly, Google encourages spammers to just create a new email address. So I get a few emails each month from Snowflake who ask me to try their products. I've never done business with them and there's no unsubscribe link.
Fuck Snowflake for thinking it has any room to talk about integrity.
This is funny and interesting to watch but also a distraction I feel. Amazon says it best when they say, “Leaders start with the customer and work backwards. They work vigorously to earn and keep customer trust. Although leaders pay attention to competitors, they obsess over customers.”
Really can’t see what they can do now short of “bending” to Databricks and entering the competition. And naturally it’s no longer just enough that they show comparable performance. They have to hit their games stats somehow otherwise any news even of they beat Databricks will be reported as “see, we told you they where cheating”
In general though, I'm still not complaining. It's interesting to see a dispute like this unfold.
I think Databricks is overly enthusiastic about their results as they have been trying to be competitive with cloud DWs on these benchmarks for a number of years now. They have finally caught up (by building deltalake and their photon query engine which implement a number of standard DW features).
[1] http://www.vldb.org/pvldb/vol13/p1206-dreseler.pdf
[2] https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf
[3] https://web.stanford.edu/class/cs245/readings/c- store.pdf
[4] http://sites.computer.org/debull/A12mar/vectorwise.pdfThe public pissing contest is entertaining while also being silly and slightly cringe, but I think it's a nice story for Databricks nonetheless. They now have a performant SQL-based analytics engine that can credibly compete with the best DWs in the market today, and it's just one part of their overall platform.
The sense I get is that Snowflake wants the conversation to be "no matter what you do, you need a data warehouse, and we're the best in the business at that." Databricks' Lakehouse approach is a fundamental challenge to that, and if they're getting this kind of performance from their analytics engine against the market-leading data warehouses today, that's a big momentum shift in their favour.
I believe the co-founders have addressed this in the blog.
> Our goal was to dispel the myth that Data Lakehouse cannot have best-in-class price and performance. Rather than making our own benchmarks, we sought the truth and participated in the official TPC benchmark.
I'm sure anybody seriously looking at evaluating data platforms would want to look at things holistically. There are different dimensions like open ecosystem, support for machine learning, performance etc. And different teams evaluating these platforms would stack rank them in different orders.
These blogs, I believe, show that Databricks is a viable choice for customers when performance is a top priority (along with other dimensions). That IMO is customer obsession.
* I haven't executed the test suite, but fraud seems likely.
Both participants in a fight can win by implicitly excluding their real competitors.
EDIT: I forgot lying about how open they are when all their interesting technologies (like the new sql engine and the good parts of delta) are proprietary.
I love working with Databricks and Snowflake. They both knock it out of the park for their respective use case. They’re amazing products.
It makes no sense to fall out about this though.
For a 100TB dataset with a funky calculation, Spark will trounce Snowflake. For a 1 row dataset, Snowflake will return before the spark job has been serialised.
Also what kind of queries are we talking about?
These are the slides from a talk one of the co-founders (@rxin) gave at Stanford. https://web.stanford.edu/class/cs245/slides/LakehouseGuestTa...
It goes into the details of how this performance is achieved(and not just at 100TB). Part of this could be attributed to innovations in the storage layer(delta lake), and part of it is just the new query engine design itself.
Some objective third party sets the goal and then each company submits automation (selenium?) that configures their own app to achieve the goal. Entrants are scored by:
- time
- storage
- compute
- config complexity
No need to waste time making your opponent look bad, just focus on making your self look good, and do it on a level playing field.
Atlassian? Adobe? ExxonMobil? PagerDuty? McAfee? HSBC? Starbucks? AstraZeneca? GlaxoSmithKline? Comcast? FINRA? Regeneron? Riot Games? Nielsen? HP? Conde Nast? Viacom? McGraw-Hill? Cisco? NBCUniversal?
Hopefully they can scale to the enterprise soon.
Aside from the Azure/GCP/AWS internal offeringa I know about Snowflake and Firebolt, Databricks is new to me.
I heard Google BigQuery is good. It is completely SaaS (like AWS Athena that works).
Unicorns often run their own stack and you could replicate that, if you have the apetite. Netflix and Apple run Trino + Spark on k8s + Iceberg. Uber used their own Hudi thing, not sure if they still do.
Everyone technical knew they would game every environment to come out with superior results. I suppose it worked. As the top executives buy big system software and ignore the IT crowd who could easily point out the flaws in the methodology of the"studies".
Breakdown of one of those example ads:
https://db2news.wordpress.com/2011/06/08/a-closer-examinatio...
Snowflake has shown NOTHING close to this.
[1] http://tpc.org/results/fdr/tpcds/databricks~tpcds~100000~dat...
Databricks was founded before Spark 1.0 released by Spark's creators.
Hadoop was created at a time when network and disk were much slower, RAM was less abundant. Bringing compute to the data made sense, but it typically doesn't anymore.
Isn't Databricks' delta.io, which their Data Lakehouse product builds on top of, open source? Snowflake could take the best parts from and run with it?
I understand the appeal over having lake and warehouse as separate components, but with those native cloud warehouses, you can already do everything a lake does.
With the lakehouse, you can use python, R and Scala, (not just SQL) to interface with your data. You can use multiple compute engines (spark, Databricks, presto) so you are not locked into one compute engine.
I recall being a junior programmer, and wishing I could talk to my MySQL database in python code to do some processing that was difficult to express in SQL, that day is finally here.
Snowflake and Databricks are multicloud. The different is that Snowflake is more like a SaaS solution and only does SQL. Databricks is more than just SQL. It has all the data science, machine learning information, built into it. Snowflake has Snowpark but it’s every limited and so you are more likely to have to buy more products to build out your capabilities and integrate them with Snowflake. With Databricks it is more out of the box in terms of capabilities. Databricks also runs in your cloud account which has trade offs. It can be harder to get going and more complex but you end up with a lot more flexibility and you own your data and have complete control over it. While Snowflake gives you control of your data with their tools, everything has to go through Snowflake and incur their tax to get to it. You pay for simplicity, which many customers are ok with because they see value in it. On the contrary, a lot of customers see value in having more control and options. This market is big enough for everyone - it’s really just about market share.
"Databricks is an enterprise software company founded by the creators of Apache Spark. [...] Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks."
Something like Snowflake works much better when you're building a platform that you can give to two hundred data analysts or various skills spread over fifty teams, so they can build their own stuff. The nice UI, broad feature set (materialized views, time travel, automatic backups, superfast scaling up and down, ...) and general just-work-iness makes it nice for that, but you're going to pay for the privilege.
Databricks is somewhere in the middle - things are way less polished, features don't always work and you still have to figure out things like backups and partitions on S3 on your own, but some people like that. Expect to also pay a pretty penny for hundreds of Spark clusters nobody knows who uses.
* Databricks pivoted from analytics to ML and it's not just marketing. Clickhouse is all about OLAP use cases.
* Clickhouse competes with Druid/Pinot/Timescale, Spark competes with Flink.
[0] A solution to DeWitt clauses. https://danluu.com/anon-benchmark/
This is kind of understandable. Benchmarking complex software is complicated. It’s easy to give totally wrong picture of things either accidentally or deliberately.
The "Unbreakable" Marketing Campaign:
https://www.oreilly.com/library/view/the-oracle-hackers/9780...
https://www.zdnet.com/article/invincible-oracle-not-so-secur...
You know what, our company uses both Snowflake and Databricks.
For Databricks, there's one or two projects that someone built on it running in production. For Snowflake, there's a sizeable use because we bought a smaller company that used it for reporting and warehousing. Neither of them are "the chosen tool" and will see any growth unless wind changes. But we could be (F50 company) counted as reference by both I guess.
We're are now trying to scale unnamed technology running on EC2 from 100 nodes to 200 cores and the process to buy larger license is pretty painful. If we were using Snowflake or Databricks, we could just scale it up and update our opex estimate.
If we had access to IP address of the posters, I sure would be interested in looking at correlation among them.
Does anyone else notice people questioning common sense?
disclaimer: works for databricks, but not on spark, and first time posting in this thread
> broad feature set
My experience is that the feature sets of Snowflake and Databricks are very similar. Both have time travel support. Snowflake has materialized views, but Databricks has Delta Live Tables. Databricks has a distributed Pandas API, but Snowflake recently introduced Snowpark. Databricks also has autoscaling and they recently launched a serverless offering that makes autoscaling super fast aswell.
Databricks has some interesting features (we were originally interested in it as "nice UI" for our AWS data lake for citizen data scientists - using it for industrialized processing was price impractical compared to AWS Glue) but the security seems lacking - it goes just table level and only in SQL and Spark, with R you can't have security at all.
I really liked the Databricks UI and integrated visualizations, though, that's where they are better than Snowflake I think. Of course, they gained those by buying open source Redash.io and ending it.
The part that ended our PoC with them was when they gave us a price quote for expected number of users, the management was like "ok that sounds reasonable" until I told them that's just license and does not include EC2 costs - the real cost would be at least twice. That made everyone angry.
> Due to a TPC-internal error during the production of 3.2.0 of the TPC-DS kit, the benchmark execution had to use version 2.13 of the kit. It was confirmed by the TPC that the only changes between these two versions of the kit is the version number set in the tools/release.h parameter file.
How can there be that much of a delta of major/minor versions without a change? The only way that I see this happening is if 'change' being defined as the specific benchmark which was run, rather than the kit.
You can access data in Snowflake or BigQuery using JDBC or Python clients. You do pay for the compute that reads the data for you. You cannot access the data in storage directly.
You can access data in lakehouse directly, by going to cloud storage. That has two major challenges:
Lakehouse formats aren't easy to deal with. You need a smart engine (like Spark) to do that. But those engines are pretty heavy. Staring a Spark cluster to update 100 records in a table is wasteful.
The bigger challenge is security. Cloud storage can't give you granular access control. It only sees files, not tables and columns. So if you have a need for column or row-based security or data masking, you're out of luck. Cloud storage also makes it hard to assign even the non-granular access. Not sure about other clouds, but AWS IAM roles are hard to manage and don't scale for large number of users/groups.
You can sidestep this by using a long-running engine (like Trino) and applying security there. Then you don't need to start Spark to change or query a few records. But it means you're basically implementing your own cloud warehouse.
Which honestly can be the way if that's what you want! You can also use multiple engines if you are ok with implementing security multiple times. To me, that doesn't seem to be worth it.
In the end, I don't see data that's one SELECT away as much more proprietary and "outsourced" than data that is one Spark/Trino cluster and then SELECT away, just because you can read the S3 is sits on.
Sadly, those things are mutually exclusive at the moment and with the way things are deployed here (large multi-tenant platforms), the security has to take priority.
But if that's not your situation, then obviously it makes sense to make use of that!
If you want simplicity, you can limit your engine to Databricks. You can also use JDBC/ODBC with Databricks if you want to use other tools that don’t support the delta format/parquet but piping data over JDBC/ODBC doesn’t scale with any tool to large datasets. Databricks has all the capabilities of big query/snowflake/redshift but none of those tools support python/r/scala. Their engines need to be rewritten from the ground up in order to do so.
Snowflake uses the Arrow data format with their drivers, so is plenty fast enough when retrieving data in general. But it would be way less efficient if a data scientist just does a SELECT * to bring everything back from a table to load into a notebook.
Snowflake has had Scala support since earlier in the year, along with Java UDFs, and also just announced Python support - not a Python connector, but executing Python code directly on the Snowflake platform. Not GA yet though.
All leaders in a space take this approach. Little be gained, a fair bit to lose if you are ALREADY leading without having to debate / do a benchmark etc.
Anyways, the benchmark is only one part of the overall story for these solutions.
Implementations of protocols like ODBC/JDBC generally implement their custom on-wire binary protocols that must be marshalled to/from the lib - and the performance would vary a lot from one implementation to another. We are seeing a lot of improvements in this space though, especially with the adoption of Arrow.
There is also the question of computing for ML. Data scientists today use several tools/frameworks ranging from scikit-learn/XGBoost to PyTorch/Keras/TensorFlow - to name a few. Enabling data scientists to use these frameworks against near-realtime data without worrying about provisioning infrastructure or managing dependencies or adding an additional export-to-cloud-storage hop is a game changer IMO.
Few reasons why Databricks platform shines here.
1) Not limited by just udfs - Extensions to improve performance, including GPU acceleration in XGBoost, distributed deep learning using HorovodRunner.
2.) End to end MLOps solution - including Feature store, Model registry & Model Serving
3.) Open approach with https://www.mlflow.org/
4.) Glass box (not blackbox) model for AutoML
It is a solved problem. Essentially you need a central place ( with decentralized ownership for the datamesh fans ) to specify the ACLS ( row-based, column-based, attribute-based etc.) - and an enforcement layer that understands these ACLs. There are many solutions, including the ones from Databricks. Data discovery, lineage, data quality etc., go hand in glove.
Security is front and centre for almost all organizations now.
This is what I've personally seen few times - Databricks claiming they can do something and then it turns out they can't. Buyer beware lying salespeople and HN shills.
[1]: https://docs.databricks.com/administration-guide/access-cont...
Snowflake now offers Scala, Java and Python support, so it would seem their capabilities are converging even more, but both with their own strengths due to their respective histories.
Snowpark is still inferior.