Upsert using DuckDB

ghz 8months ago ⋅ 60 views

Let assume I have two parquet file aka main and updates with some data as shown below. I need to implement upsert kind of operation in duck db.

parquet file main: id, name, city

data:
   id     name     city 
    1      a         p
    2      b         q
    3      c         r
    

parquet file updates: id, name, city

data:
   id     name     city
    1      a         m
    4      b         q

desired output:-

   id     name     city
    1      a         m  <----update city
    2      b         q
    3      c         r
    4      b         q  <----insert

currently I am doing the same using below queries:-

create table main as select * from '/tmp/main.parquet';
create table stage as  select * from '/tmp/updates.parquet';
delete from main using stage where main.id=stage.id;
insert into main select * from stage;
COPY main TO '/tmp/final.parquet' (FORMAT 'PARQUET', CODEC 'ZSTD');

but only thing is create table will load and keep all the data in memory what I don't want as the main file may contain 8-10 million records, is there any way where I can achive the same using joins only and avoid creating main and stage tables.

Answers

In DuckDB, you can achieve the upsert operation using joins without explicitly creating tables for the main and updates data. You can perform a series of operations to update existing records and insert new records directly within the queries. Here's how you can do it:

-- Load the main and updates data directly from the parquet files without creating tables
CREATE VIEW main AS SELECT * FROM '/tmp/main.parquet';
CREATE VIEW updates AS SELECT * FROM '/tmp/updates.parquet';

-- Update existing records in main with data from updates
UPDATE main SET city = updates.city
FROM updates
WHERE main.id = updates.id;

-- Insert new records into main from updates
INSERT INTO main
SELECT *
FROM updates
WHERE updates.id NOT IN (SELECT id FROM main);

-- Export the final result to a new parquet file
COPY main TO '/tmp/final.parquet' (FORMAT 'PARQUET', CODEC 'ZSTD');

In this approach, we use views instead of tables to directly query the parquet files. We then perform an update operation to update existing records in the main view with data from the updates view based on matching id. Finally, we insert new records into the main view from the updates view where the id doesn't already exist in the main view.

This approach avoids loading all the data into memory at once by directly querying the parquet files. It should be more memory-efficient, especially when dealing with large datasets.