Why Snowflake will ultimately beat Databricks

Having spent my career in the data and machine learning worlds, I’ve watched Snowflake and Databricks pretty closely. On the surface, they seem like symmetrical competitors. Snowflake’s core business is enterprise data warehousing, and they would very much like to expand into the ML workloads that are core to Databricks. Databricks has much more mature ML functionality, and would love to expand into Snowflake’s core data warehousing market. 

In fact the truth is a little more subtle. Databricks was born alongside Spark, and dates to the days before cloud data warehouses could scale to fit all of a large enterprise’s data. These enterprises had to adopt a so-called “big data” or “data lake” technology like Spark to work with truly large data volumes. Databricks’s ML functionality, consisting of hosted Jupyter notebooks running Python with PySpark – as well as MLFlow for productionizing models built in these notebooks – dates to that era and is closely tied to that technology lineage. 

It was Snowflake’s launch – and Amazon Redshift’s – that inaugurated the next era of data platforms. By separating compute from storage, and presenting a single SQL interface to compute warehouses that interfaced with multiple storage nodes, Snowflake allowed us to store all our enterprise data in a columnar SQL-native warehouse for the first time. This was a major improvement in performance, efficiency and ease-of-use compared to the “data lake” platforms like Spark, and it effectively ended the “big data” era. 

So Databricks is built on a data platform that’s a generation old. Its data science and ML functionality have the advantages of being mature and battle-tested by a generation of ML professionals. However, like the platform they’re built on, they’re not particularly modern. In practice, orchestration usually looks like scheduled notebooks. Putting classic ML models into MLFlow, while mature and flexible, doesn’t have modern developer ergonomics. The whole thing is a more natural fit for batch jobs operating on data in Spark clusters rather than ML executing on-demand to serve an online product. And none of it is especially well-suited to modern neural networks or other AI technologies.

Snowflake’s data platform, by contrast, is modern. A single columnar warehouse that scales to enterprise volumes is what everyone wants for their data platform, and Snowflake delivers. But Snowflake’s disadvantage is that its ML products are all v1 (or v0.1), and are currently so limited as to be not truly production-ready. At least, not without a lot of fighting. They’re also pretty disconnected from each other: Snowpark Python scales well but only works with a small whitelist of Anaconda packages and no custom systems at all. Snowpark Container Services doesn’t scale particularly well but lets you run fully custom code. Snowflake ML is a set of Python notebook APIs and UI metaphors that aren’t very well integrated with either of these things. The picture is of a set of promising new products that have just launched, aren’t yet production-ready, and aren’t yet telling a coherent story together.

But the good news for Snowflake is they’re sitting on a modern platform and expanding. In some ways it’s an optimistic sign for Snowflake’s dev team that they are able to ship early software and get market feedback on it. They’re operating from a position of strength in having a truly modern data platform in market, and they’re trying to innovate out from there. They haven’t proven it yet with a coherent set of modern ML tools, but the opportunity is in front of them, and they do seem to be executing.

Databricks’s position is tougher. Having built on a platform that’s no longer state of the art, they are staring down the wrong end of the innovator’s dilemma. Their ML platforms work well at scale, unlike Snowflake’s, but that advantage won’t last forever. Meanwhile their data platform is obsolete, and unlikely to take share from Snowflake on that basis. 

Snowflake is well known to be a strong business. Databricks is still private, but is sometimes rumored to be an even better business at present. But Databricks has shown no sign of trying to tackle the obsolescence that’s coming for their core platform. Like BlackBerry when the iPhone launched, they can take short-term comfort that their tooling for certain use cases is more mature. But also like BlackBerry, it’s unclear how they can overcome their long-term structural disadvantage. Databricks has strengths. But if offered a choice, I’d take a share of Snowflake every time.