DuckDB In Production

Aug 18, 2025

Catching ELT Deduplication Issues with DuckDB

Have you ever synced data from SQL Server into Snowflake, only to find duplicate rows that didn’t exist in the source?

That’s exactly the problem I ran into.

I was using Airbyte’s CDC connector to replicate changes, and despite configuring the cursor on the primary key, Snowflake sometimes ended up with multiple rows for the same record. Subtle data issues like this don’t always crash dashboards, but they erode trust in the pipeline fast.

After failing to fix the issue inside Airbyte itself, I reached for DuckDB. It turned out to be the simplest way to validate what actually landed in Snowflake.

What’s DuckDB Again?

DuckDB is an in-process OLAP database, often described as “SQLite for analytics.” It’s fast, lightweight, and surprisingly flexible: you can query local Parquet/CSV files, but also connect to remote systems via ODBC.

If you’ve ever used AWS Athena to query S3 without loading data into a warehouse, DuckDB feels similar, except it runs locally, in-process, with zero infrastructure overhead.

For a quick primer, here’s the DuckDB Fireship video.

My Pipeline Setup

Here’s the environment where the deduplication issue surfaced:

Applications write into a primary SQL Server.
A read replica keeps analytics workloads off the primary.
Airbyte uses CDC from the replica to load data into Snowflake every minute.
Tableau connects to Snowflake for BI dashboards.

DuckDB sits alongside SQL Server and Snowflake as a lightweight validation layer.

Attempts to Fix It in Airbyte

Before introducing DuckDB, I tried to solve this inside Airbyte:

Resetting and re-syncing the connector.
Rebuilding the CDC config with explicit PK and cursor fields.
Testing the same PK setup I’d used with Fivetran, where deduplication worked fine.
Inspecting raw landing tables in Snowflake to confirm CDC events looked valid.

Despite those attempts, the duplicates persisted. That’s when I realized I needed a way to independently validate the pipeline: compare SQL Server (source of truth) against Snowflake (target) without relying on Airbyte’s internal logic.

DuckDB as a Federated Query Engine

With DuckDB’s ODBC extension, I could query SQL Server and Snowflake side by side in the same SQL statement. This made it the perfect federated query engine for validation.

The validation steps I used were simple:

Compare row counts between the two systems.
Compare distinct primary key counts.
Identify duplicate PKs in either source.

The Validation Query

Here’s the SQL I used in DuckDB:

INSTALL odbc;
LOAD odbc;

SET ss_conn = 'Driver={ODBC Driver 18 for SQL Server};Server=tcp:sql-prod.acme.internal,1433;Database=OpsDB;UID=etl_reader;PWD=REDACTED;';
SET sf_conn = 'Driver=SnowflakeDSIIDriver;Server=acme_xyz123.us-east-1.snowflakecomputing.com;Warehouse=ANALYST_WH;Database=ANALYTICS;Schema=CORE;UID=ETL_READER;PWD=REDACTED;';

-- Compare row counts and distinct PKs
WITH
ss AS (SELECT * FROM odbc_query($ss_conn, 'SELECT order_id, status, updated_at FROM dbo.orders')),
sf AS (SELECT * FROM odbc_query($sf_conn, 'SELECT order_id, status, updated_at FROM CORE.ORDERS'))
SELECT
  'sqlserver' AS source, count(*) AS rows_total, count(DISTINCT order_id) AS distinct_pk
FROM ss
UNION ALL
SELECT
  'snowflake', count(*), count(DISTINCT order_id)
FROM sf;

-- Duplicate PKs in SQL Server
SELECT order_id, count(*) AS dup_count
FROM odbc_query($ss_conn, 'SELECT order_id FROM dbo.orders')
GROUP BY order_id
HAVING count(*) > 1
ORDER BY dup_count DESC, order_id
LIMIT 100;

-- Duplicate PKs in Snowflake
SELECT order_id, count(*) AS dup_count
FROM odbc_query($sf_conn, 'SELECT order_id FROM CORE.ORDERS')
GROUP BY order_id
HAVING count(*) > 1
ORDER BY dup_count DESC, order_id
LIMIT 100;

With just a few queries, I could confirm whether Snowflake’s data matched SQL Server’s — and exactly where things went wrong.

Beyond Manual Checks

This solved my immediate problem, but it’s easy to take further:

Wrap the validation in a Python script using DuckDB’s API.
Run it across all replicated tables, not just one.
Export mismatches into Pandas DataFrames for analysis.
Package it in a Docker container to run on a schedule.

At that point, DuckDB becomes a lightweight data quality watchdog for your ELT pipeline.

Final Thoughts

Airbyte didn’t give me the deduplication guarantees I needed, and no amount of connector tweaking fixed it. But DuckDB gave me a fast, reliable way to validate whether Snowflake matched SQL Server, without standing up new infrastructure.

It’s not a replacement for Airbyte or Snowflake. Instead, it’s a small, embeddable tool that acts as a federated query safety net. Much like Athena gives you cross-S3 visibility, DuckDB gives you federated visibility across your stack, right from your laptop, with nothing more than SQL.

For me, DuckDB has become the tool I reach for when I need to answer the question: “Did my pipeline actually deliver the data I expected?”

Data Methods