Optimize query to insert random rows from tableA into tableB, wi

ghz 8months ago ⋅ 125 views

Optimize query to insert random rows from tableA into tableB, without duplicates

I am still a novice to SQL, using PostgreSQL 16.2.

create table users(
    user_id serial primary key
);
create table task_pool(
    task_id serial primary key,   
    order_id integer,
    clicker_id integer,
    INDEX idx_order_id (order_id)
);
create table task_dispersal(
    id serial primary key,
    order_id integer,
    to_disp integer
);

There are 100,000 rows in users (so 100,000 user_ids); 100 rows in task_dispersal, each with to_disp as 10. Starting with 0 rows in task_pool.

I want to iterate through rows in task_dispersal. For each row...

  • Get to_disp amount of random user_id's from users table, that are not already in task_pool with iterated row's related order_id.
  • For example if the a row in task_dispersal had order_id 1 and to_disp 10, it would check task_pool for all rows that have order_id 1, get the user_id's from those rows and filter them from the users table, and then choose 10 random rows (user_id's) to insert into task_pool.

I am using this query:

INSERT into task_pool(clicker_id, order_id)
SELECT U.user_id, TD.order_id
FROM task_dispersal TD
CROSS JOIN LATERAL (
    SELECT Us.user_id
    FROM users Us
    WHERE user_id NOT IN (
        SELECT clicker_id
        FROM task_pool
        WHERE order_id = TD.order_id
    )
    ORDER BY RANDOM()
    LIMIT TD.to_disp
) U

It works, and relatively fast, but I am still hoping there is a way to optimize this, because it takes 26-27 seconds on my (very weak) hosted database, and I am trying for around a few seconds.

Here is the query plan:

"Insert on public.task_pool  (cost=6586.23..674875.51 rows=0 width=0) (actual time=26414.647..26414.649 rows=0 loops=1)"
"  Buffers: shared hit=33860 dirtied=22 written=22"
"  ->  Nested Loop  (cost=6586.23..674875.51 rows=500000 width=24) (actual time=1815.128..26402.217 rows=1000 loops=1)"
"        Output: nextval('task_pool_task_id_seq'::regclass), td.order_id, us.user_id, CURRENT_TIMESTAMP, 0"
"        Buffers: shared hit=28619"
"        ->  Seq Scan on public.task_dispersal td  (cost=0.00..3.00 rows=100 width=8) (actual time=0.008..0.120 rows=100 loops=1)"
"              Output: td.id, td.order_id, td.reset_time, td.disp_remaining, td.daily_disp, td.disp_interval, td.next_disp_time, td.expired_tasks, td.to_disp"
"              Buffers: shared hit=2"
"        ->  Limit  (cost=6586.23..6598.73 rows=5000 width=12) (actual time=251.784..251.786 rows=10 loops=100)"
"              Output: us.user_id, (random())"
"              Buffers: shared hit=27607"
"              ->  Sort  (cost=6586.23..6711.23 rows=50000 width=12) (actual time=251.781..251.782 rows=10 loops=100)"
"                    Output: us.user_id, (random())"
"                    Sort Key: (random())"
"                    Sort Method: top-N heapsort  Memory: 25kB"
"                    Buffers: shared hit=27607"
"                    ->  Index Only Scan using users_pkey on public.users us  (cost=2.96..2181.57 rows=50000 width=12) (actual time=3.048..139.824 rows=100000 loops=100)"
"                          Output: us.user_id, random()"
"                          Filter: (NOT (hashed SubPlan 1))"
"                          Heap Fetches: 0"
"                          Buffers: shared hit=27607"
"                          SubPlan 1"
"                            ->  Index Scan using idx_order_id on public.task_pool task_pool_1  (cost=0.15..2.63 rows=16 width=4) (actual time=0.022..0.022 rows=0 loops=100)"
"                                  Output: task_pool_1.clicker_id"
"                                  Index Cond: (task_pool_1.order_id = td.order_id)"
"                                  Buffers: shared hit=106"
"Settings: effective_io_concurrency = '200', random_page_cost = '1.1', effective_cache_size = '192MB', max_parallel_workers = '1', max_parallel_workers_per_gather = '1', work_mem = '1703kB'"
"Query Identifier: 3069500943293269296"
"Planning:"
"  Buffers: shared hit=40 read=3"
"Planning Time: 0.317 ms"
"JIT:"
"  Functions: 22"
"  Options: Inlining true, Optimization true, Expressions true, Deforming true"
"  Timing: Generation 1.757 ms, Inlining 208.912 ms, Optimization 913.724 ms, Emission 399.368 ms, Total 1523.761 ms"
"Execution Time: 26416.272 ms"

I don't really know how to read the query plan, but I am pretty sure that using ORDER BY RANDOM() on the result of user_ids that are not in task_pool with order_id is the most costly?


I don't think I can use tablesample bernoulli to get random users from the users table though; as some/all the sample rows it chooses might already be in the task_pool, so it will insert less than the required amount(to_disp) of user_ids (this happens fairly often after performing the query a few times)

If I could use tablesample bernoulli on the RESULT of

WHERE user_id NOT IN (
        SELECT clicker_id
        FROM task_pool
        WHERE order_id = TD.order_id
    )

That would be awesome; unfortunately it can only be used on tables, temp tables, or materialized views, and can't be used after WHERE.

I have tried making temp tables of the unused user_ids per row and then tablesample bernoulli (very slow when each row has up to 100,000 unused user_ids)

At this point, I don't think there is any optimization that can be done, other than upgrading the server. Any better ideas?

Answers

Your approach using the ORDER BY RANDOM() clause is indeed inefficient, especially with large tables, as it needs to generate a random value for each row in the users table and then sort them. This becomes increasingly slow as the table size grows.

One alternative approach you can consider is to generate a random number for each row in the users table and then select the top to_disp rows based on this random number. This approach can be more efficient, especially for large tables, as it avoids sorting the entire table.

Here's how you can modify your query to use this approach:

WITH filtered_users AS (
    SELECT 
        user_id,
        ROW_NUMBER() OVER () AS rn
    FROM 
        users u
    WHERE NOT EXISTS (
        SELECT 1
        FROM task_pool tp
        WHERE tp.order_id = (
            SELECT td.order_id
            FROM task_dispersal td
            WHERE td.id = u.order_id
        )
        AND tp.clicker_id = u.user_id
    )
)
INSERT INTO task_pool (clicker_id, order_id)
SELECT 
    user_id,
    order_id
FROM 
    task_dispersal td
JOIN 
    filtered_users fu ON td.order_id = fu.order_id
WHERE 
    fu.rn <= td.to_disp;

In this modified query:

  • We first filter the users table to exclude the ones that are already present in the task_pool table for the specific order_id from the task_dispersal table.
  • We assign a row number to each remaining user using the ROW_NUMBER() window function.
  • Then, we join the filtered users with the task_dispersal table on the order_id.
  • Finally, we select only the top to_disp rows for each order_id based on the row number.

This approach should be more efficient than using ORDER BY RANDOM() as it avoids sorting the entire table and should perform better, especially for large datasets.