Learning About Databricks Coming From Snowflake
Snowflake
For the past 5 years or so, my career has been centered around using Snowflake (and I have been a huge fan of it).
At its core, Snowflake is a fully managed, cloud-native data warehouse that abstracts away most of the underlying complexity of storage, compute, and optimization. Data is stored in Snowflake-managed storage, while compute is handled through virtual warehouses that can be scaled independently.
This architecture allows users to focus almost entirely on querying and modeling data, rather than worrying about how it is physically stored or how execution is distributed. Many of the traditional concerns found in distributed systems, such as partitioning, file layout, and resource coordination are handled automatically by the platform.
As a result, Snowflake promotes a SQL-first, analytics-driven workflow where transformations are primarily expressed declaratively, and performance optimization is largely handled behind the scenes.
Snowflake in an excellent data warehouse solution full of many benefits:
Fast query times
Comprehensive query, task and price monitoring
Database objects for the most common operations are easy to create: view, tasks (aka triggers in other DBMS), procedures
Python API
Wide selection of connectors to other systems
Cortex: AI assistance for writing queries
Integrated Python notebooks
Streamlit: Python visualization applications
Scalable virtual warehouse
Built-in query optimization
Databricks
Recently, I have been curious about Databricks.
After reading Delta Lake The Definitive Guide, and doing my first project in DataBricks, I feel like I at least have had a decent introduction to the platform.
Databricks provides a workspace and orchestration layer that runs Spark-based compute (PySpark or SQL) against data stored in cloud object storage (such as S3 or ADLS).
Data is organized as Delta Lake tables, which consist of Parquet files for the actual data and a transaction log (JSON + checkpoints) that tracks changes, enabling ACID transactions and versioning.
Databricks represents a shift from traditional data warehouses to a “lakehouse” architecture, where storage and compute are decoupled, and large-scale data processing is handled by distributed systems rather than a centralized query engine.
This design exposes more of the underlying mechanics to the engineer: file layout, partitioning strategies, and cluster behavior all directly impact performance and cost. In exchange, it offers significantly more flexibility and control over how data pipelines are built and executed.
Instead of relying primarily on declarative SQL transformations like in Snowflake, workflows in Databricks are often expressed as code-driven pipelines (e.g., PySpark), making the platform feel closer to a software engineering environment than a traditional analytics system.
When Should I Use Databricks?
Snowflake has been great to use over the years, but there are a couple of blind spots.
Prior to the release of Openflow in May 2025 (which I need to explore further), organizations typically relied on external tools or custom-built pipelines to get data into Snowflake. A common pattern was to replicate data from systems like SQL Server into a raw ingestion layer using tools such as Airbyte or Fivetran. In contrast, Databricks treats Python as a first-class interface, which allows teams to build their own ingestion pipelines directly within the platform. For example, you can create Python-based jobs or containers to extract data from SQL Server and load it into Delta Lake without relying on external connectors, although implementing reliable change data capture still requires additional effort. Databricks also makes it straightforward to pull data from APIs directly using its notebook environment, which adds to its flexibility as a full data engineering platform.
Another scenario where Databricks is a strong fit is when working with very large and complex JSON/table transformations. Because it uses distributed Spark execution, it can handle heavy joins, full refresh pipelines, and large-scale processing more efficiently than traditional warehouse patterns, especially when performance tuning and parallelism are important.
Databricks is also a better choice when your workflows extend beyond SQL into more programmatic or machine learning use cases. If your pipelines require custom logic, advanced data processing, or integration with ML models, the Python-first environment provides more flexibility than a SQL-centric system. Snowflake has built-in model libraries within python containers, but I find the developer experience (autocomplete, flexibility in packages, etc) lacking compared to Databricks.
That said, for most organizations, straightforward SQL-first analytics within a well-managed data warehouse like Snowflake is a significant advantage on its own. The simplicity, reliability, and speed of development that platforms like Snowflake provide are often exactly what teams need to deliver value quickly. For the companies I have partnered with, this model has been more than sufficient, allowing them to focus more on insights and business outcomes and less on the complexity of data infrastructure, such as delta table maintenance on Databricks.
Overall, Databricks has more features, but at the cost of additional complexity and maintenance that data teams may not want to absorb. In many cases, the simplicity and reliability of a managed warehouse like Snowflake is a more practical choice.

