PostgreSQL: Why "select into" is faster than "select count" from the same sourcetable?
After upgrading PG from 11 to 15, I struggeling with the following situation:
old (working) Setup: Postgres 11 PostGIS 2.5
Setup: Postgres 15 PostGIS 3.3
table_1 (~14.000.000 rows)
When I do select count(*) from table_1
it never seems to be finished. When I do select * into table_2 from table_1
and than select count(*) from table_2
, it would finished in ~15 minutes.
The concrete statements are:
-- this never stops
select count(*)
FROM table_1 t1
INNER JOIN table_1 t2 ON (ST_Equals(t1.wkb_geometry, t2.wkb_geometry))
WHERE t1.gid > t2.gid
;
-- this works
select *
into table_2
from table_1
;
select count(*)
FROM table_2 t1
INNER JOIN table_2 t2 ON (ST_Equals(t1.wkb_geometry, t2.wkb_geometry))
WHERE t1.gid > t2.gid
;
In both cases there is an index created on wkb_geometry, like this:
CREATE INDEX table_1_geom ON table_1
USING gist (wkb_geometry)
TABLESPACE pgdata;
I already recreated the index an vacuum the table, but nothing worked for me.
Update: I figured out, that the problem is located at the pkey for column gid. After the select into... table_2 has no pkey on gid. If i remove that pkey on table_1 it would work, too. So what happend by adding a pkey different in PG version 15? Any ideas?
Update 2: DDL/SQL for table_1
SELECT ST_MakeLine(sp,ep) AS wkb_geometry,
ogc_fid,
...
INTO table_1
FROM
(SELECT
ST_PointN(geom, generate_series(1, ST_NPoints(geom)-1)) as sp,
ST_PointN(geom, generate_series(2, ST_NPoints(geom) )) as ep,
ogc_fid,
...
FROM
(SELECT (ST_Dump(ST_Boundary(wkb_geometry))).geom,
ogc_fid,
...
FROM another_table
) AS linestrings
) AS segments;
ALTER TABLE table_1 ADD COLUMN gid serial NOT NULL;
--ALTER TABLE table_1 ADD CONSTRAINT table_1_pk PRIMARY KEY (gid);
CREATE INDEX table_1_geom ON table_1 USING gist (wkb_geometry) TABLESPACE pgdata;
Answers
It seems like the issue you're encountering might be related to the use of self-joins and possibly the primary key constraint on the gid
column in PostgreSQL 15.
One thing to note is that self-joins on large tables can be computationally expensive, especially when dealing with spatial operations like ST_Equals
. Additionally, when you have a primary key constraint on a column, PostgreSQL needs to ensure that this constraint is satisfied for each row, which can add overhead, especially when combined with spatial operations.
Here are a few suggestions you can try to address the issue:
-
Analyze Query Execution Plan: Check the execution plan for your query using
EXPLAIN
to see if there are any obvious bottlenecks or inefficiencies. This can help identify if the problem lies in how PostgreSQL is executing the query. -
Tune PostgreSQL Configuration: Depending on your hardware and workload, you may need to adjust PostgreSQL configuration parameters such as
work_mem
,max_parallel_workers
,shared_buffers
, etc., to optimize performance. -
Review Spatial Index: Ensure that your spatial index (
table_1_geom
) is being used effectively in the query execution plan. You may need to tweak the index configuration or even consider re-creating it. -
Disable Primary Key Constraint: Since you mentioned that removing the primary key constraint on the
gid
column resolves the issue, you might consider temporarily disabling or dropping the constraint to see if it improves performance. However, keep in mind that this may have implications for data integrity and uniqueness constraints. -
Batch Processing: If the self-join operation is too resource-intensive, consider breaking down the operation into smaller batches using window functions or pagination.
-
Optimize Query: Review the logic of your query to see if there are opportunities for optimization, such as reducing the number of self-joins or using alternative approaches to achieve the same result.
-
Upgrade PostGIS: Ensure that your PostGIS extension is up-to-date with the latest version compatible with PostgreSQL 15. Newer versions of PostGIS may include performance improvements or bug fixes that could address your issue.
Before making any changes, it's essential to thoroughly test them in a non-production environment to ensure they have the desired effect and don't introduce any unintended consequences. Additionally, consider consulting the PostgreSQL and PostGIS documentation for more guidance on performance optimization techniques.