Optimize query to insert random rows from tableA into tableB, without duplicates
I am still a novice to SQL, using PostgreSQL 16.2.
create table users(
user_id serial primary key
);
create table task_pool(
task_id serial primary key,
order_id integer,
clicker_id integer,
INDEX idx_order_id (order_id)
);
create table task_dispersal(
id serial primary key,
order_id integer,
to_disp integer
);
There are 100,000 rows in users (so 100,000 user_ids); 100 rows in task_dispersal, each with to_disp as 10. Starting with 0 rows in task_pool.
I want to iterate through rows in task_dispersal
. For each row...
- Get to_disp amount of random user_id's from
users
table, that are not already intask_pool
with iterated row's related order_id. - For example if the a row in
task_dispersal
had order_id 1 and to_disp 10, it would checktask_pool
for all rows that have order_id 1, get the user_id's from those rows and filter them from the users table, and then choose 10 random rows (user_id's) to insert intotask_pool
.
I am using this query:
INSERT into task_pool(clicker_id, order_id)
SELECT U.user_id, TD.order_id
FROM task_dispersal TD
CROSS JOIN LATERAL (
SELECT Us.user_id
FROM users Us
WHERE user_id NOT IN (
SELECT clicker_id
FROM task_pool
WHERE order_id = TD.order_id
)
ORDER BY RANDOM()
LIMIT TD.to_disp
) U
It works, and relatively fast, but I am still hoping there is a way to optimize this, because it takes 26-27 seconds on my (very weak) hosted database, and I am trying for around a few seconds.
Here is the query plan:
"Insert on public.task_pool (cost=6586.23..674875.51 rows=0 width=0) (actual time=26414.647..26414.649 rows=0 loops=1)"
" Buffers: shared hit=33860 dirtied=22 written=22"
" -> Nested Loop (cost=6586.23..674875.51 rows=500000 width=24) (actual time=1815.128..26402.217 rows=1000 loops=1)"
" Output: nextval('task_pool_task_id_seq'::regclass), td.order_id, us.user_id, CURRENT_TIMESTAMP, 0"
" Buffers: shared hit=28619"
" -> Seq Scan on public.task_dispersal td (cost=0.00..3.00 rows=100 width=8) (actual time=0.008..0.120 rows=100 loops=1)"
" Output: td.id, td.order_id, td.reset_time, td.disp_remaining, td.daily_disp, td.disp_interval, td.next_disp_time, td.expired_tasks, td.to_disp"
" Buffers: shared hit=2"
" -> Limit (cost=6586.23..6598.73 rows=5000 width=12) (actual time=251.784..251.786 rows=10 loops=100)"
" Output: us.user_id, (random())"
" Buffers: shared hit=27607"
" -> Sort (cost=6586.23..6711.23 rows=50000 width=12) (actual time=251.781..251.782 rows=10 loops=100)"
" Output: us.user_id, (random())"
" Sort Key: (random())"
" Sort Method: top-N heapsort Memory: 25kB"
" Buffers: shared hit=27607"
" -> Index Only Scan using users_pkey on public.users us (cost=2.96..2181.57 rows=50000 width=12) (actual time=3.048..139.824 rows=100000 loops=100)"
" Output: us.user_id, random()"
" Filter: (NOT (hashed SubPlan 1))"
" Heap Fetches: 0"
" Buffers: shared hit=27607"
" SubPlan 1"
" -> Index Scan using idx_order_id on public.task_pool task_pool_1 (cost=0.15..2.63 rows=16 width=4) (actual time=0.022..0.022 rows=0 loops=100)"
" Output: task_pool_1.clicker_id"
" Index Cond: (task_pool_1.order_id = td.order_id)"
" Buffers: shared hit=106"
"Settings: effective_io_concurrency = '200', random_page_cost = '1.1', effective_cache_size = '192MB', max_parallel_workers = '1', max_parallel_workers_per_gather = '1', work_mem = '1703kB'"
"Query Identifier: 3069500943293269296"
"Planning:"
" Buffers: shared hit=40 read=3"
"Planning Time: 0.317 ms"
"JIT:"
" Functions: 22"
" Options: Inlining true, Optimization true, Expressions true, Deforming true"
" Timing: Generation 1.757 ms, Inlining 208.912 ms, Optimization 913.724 ms, Emission 399.368 ms, Total 1523.761 ms"
"Execution Time: 26416.272 ms"
I don't really know how to read the query plan, but I am pretty sure that using ORDER BY RANDOM()
on the result of user_ids
that are not in task_pool with order_id is the most costly?
I don't think I can use tablesample bernoulli
to get random users from the users
table though; as some/all the sample rows it chooses might already be in the task_pool, so it will insert less than the required amount(to_disp) of user_ids (this happens fairly often after performing the query a few times)
If I could use tablesample bernoulli on the RESULT of
WHERE user_id NOT IN (
SELECT clicker_id
FROM task_pool
WHERE order_id = TD.order_id
)
That would be awesome; unfortunately it can only be used on tables, temp tables, or materialized views, and can't be used after WHERE
.
I have tried making temp tables of the unused user_ids per row and then tablesample bernoulli
(very slow when each row has up to 100,000 unused user_ids)
At this point, I don't think there is any optimization that can be done, other than upgrading the server. Any better ideas?
Answers
Your approach using the ORDER BY RANDOM()
clause is indeed inefficient, especially with large tables, as it needs to generate a random value for each row in the users
table and then sort them. This becomes increasingly slow as the table size grows.
One alternative approach you can consider is to generate a random number for each row in the users
table and then select the top to_disp
rows based on this random number. This approach can be more efficient, especially for large tables, as it avoids sorting the entire table.
Here's how you can modify your query to use this approach:
WITH filtered_users AS (
SELECT
user_id,
ROW_NUMBER() OVER () AS rn
FROM
users u
WHERE NOT EXISTS (
SELECT 1
FROM task_pool tp
WHERE tp.order_id = (
SELECT td.order_id
FROM task_dispersal td
WHERE td.id = u.order_id
)
AND tp.clicker_id = u.user_id
)
)
INSERT INTO task_pool (clicker_id, order_id)
SELECT
user_id,
order_id
FROM
task_dispersal td
JOIN
filtered_users fu ON td.order_id = fu.order_id
WHERE
fu.rn <= td.to_disp;
In this modified query:
- We first filter the
users
table to exclude the ones that are already present in thetask_pool
table for the specificorder_id
from thetask_dispersal
table. - We assign a row number to each remaining user using the
ROW_NUMBER()
window function. - Then, we join the filtered users with the
task_dispersal
table on theorder_id
. - Finally, we select only the top
to_disp
rows for eachorder_id
based on the row number.
This approach should be more efficient than using ORDER BY RANDOM()
as it avoids sorting the entire table and should perform better, especially for large datasets.