Let assume I have two parquet file aka main and updates with some data as shown below. I need to implement upsert kind of operation in duck db.
parquet file main: id, name, city
data:
id name city
1 a p
2 b q
3 c r
parquet file updates: id, name, city
data:
id name city
1 a m
4 b q
desired output:-
id name city
1 a m <----update city
2 b q
3 c r
4 b q <----insert
currently I am doing the same using below queries:-
create table main as select * from '/tmp/main.parquet';
create table stage as select * from '/tmp/updates.parquet';
delete from main using stage where main.id=stage.id;
insert into main select * from stage;
COPY main TO '/tmp/final.parquet' (FORMAT 'PARQUET', CODEC 'ZSTD');
but only thing is create table will load and keep all the data in memory what I don't want as the main file may contain 8-10 million records, is there any way where I can achive the same using joins only and avoid creating main and stage tables.
Answers
In DuckDB, you can achieve the upsert operation using joins without explicitly creating tables for the main and updates data. You can perform a series of operations to update existing records and insert new records directly within the queries. Here's how you can do it:
-- Load the main and updates data directly from the parquet files without creating tables
CREATE VIEW main AS SELECT * FROM '/tmp/main.parquet';
CREATE VIEW updates AS SELECT * FROM '/tmp/updates.parquet';
-- Update existing records in main with data from updates
UPDATE main SET city = updates.city
FROM updates
WHERE main.id = updates.id;
-- Insert new records into main from updates
INSERT INTO main
SELECT *
FROM updates
WHERE updates.id NOT IN (SELECT id FROM main);
-- Export the final result to a new parquet file
COPY main TO '/tmp/final.parquet' (FORMAT 'PARQUET', CODEC 'ZSTD');
In this approach, we use views instead of tables to directly query the parquet files. We then perform an update operation to update existing records in the main
view with data from the updates
view based on matching id
. Finally, we insert new records into the main
view from the updates
view where the id
doesn't already exist in the main
view.
This approach avoids loading all the data into memory at once by directly querying the parquet files. It should be more memory-efficient, especially when dealing with large datasets.