Intro
I am currentry running a query that was reported, by the DataEng team, to be highly resource consuming. I need some help to improve it/ reduce its cost.
Case
I have a table (status_history_table
) that shows, for all of my customers, their status and a date range that this status is valid. I need a query that returns, daily, how many customers have changed their status and how much of them didnt changed it. Only the last 365 days of data are needed.
status_history_table
example:
user_id | status | valid_from | valid_to |
---|---|---|---|
X | 'active' | 20240401 | 20240405 |
X | 'inactive' | 20240406 | 99991231 |
When we have valid_to as '99991231' it means this is the current status of the customer, i.e, it is valid untill something happens/changes it.
dim_date_table
is a dimension table with all dates;
id_date | date_ |
---|---|
20240401 | 2024-04-01 |
20240402 | 2024-04-02 |
This is the current query I am using:
with
daily_status as (
select
h.user_id
, d.id_date
, h.status
from
status_history_table h
inner join dim_date_table d
on d.id_date between h.valid_from and h.valid_to
and d.date_ <= current_date
and d.date_ > current_date - 365
)
, lag as (
select
*
, lag(status,1)over(partition by user_id order by id_date asc) as previous_status
from
daily_status
)
select
id_date
case
when previous_status = status_braze then 'No change'
else 'Changed'
end as class,
count(1) as user_qty
from
lag
group by
1,2
There are some more infos that I am extracting for this report, but this query represents the most important logic that is being used. Any help will be appreciated.
Update1:
This is the real query I am using:
with
base as (
select
d.date_,
h.subscription_status as status,
braze_id
from
curated_data.braze_subscription_history h
inner join business_layer.dim_date d
on d.pk_date between h.valid_from and h.valid_to
and d.date_ <= current_date
and d.date_ > current_date - 365
where
h.source = 'PUSH'
and h.country_code = 'BR'
),
-- ve qual foi o status anterior de cada cliente
lag as (
select
*,
lag(status,1)over(partition by braze_id order by date_ asc) as previous_status
from
base
)
select
date_
, case
when previous_status = status then 'No change'
else 'Changed'
end as class,
count(1) as user_qty
from
lag
group by
1,2
Result of running "explain"
XN HashAggregate (cost=1000679035081.81..1000679075081.81 rows=8000000 width=78)
-> XN Subquery Scan lag (cost=1000627708182.17..1000668036460.46 rows=1466482847 width=78)
-> XN Window (cost=1000627708182.17..1000649705424.87 rows=1466482847 width=45)
Partition: h.braze_id
Order: d.date_
-> XN Sort (cost=1000627708182.17..1000631374389.29 rows=1466482847 width=45)
Sort Key: h.braze_id, d.date_
-> XN Network (cost=7525.05..404438272.74 rows=1466482847 width=45)
Distribute
-> XN Nested Loop DS_BCAST_INNER (cost=7525.05..404438272.74 rows=1466482847 width=45)
Join Filter: (("inner".pk_date <= ("outer".valid_to)::bigint) AND ("inner".pk_date >= ("outer".valid_from)::bigint))
-> XN Seq Scan on braze_subscription_history h (cost=0.00..1472107.32 rows=36159851 width=49)
Filter: (((source)::text = 'PUSH'::text) AND ((country_code)::text = 'BR'::text))
-> XN Materialize (cost=7525.05..7528.70 rows=365 width=12)
-> XN Seq Scan on dim_date d (cost=0.00..224.69 rows=365 width=12)
Filter: ((date_ > '2023-04-09'::date) AND (date_ <= '2024-04-08'::date))
----- Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products -----
DDLS:
CREATE TABLE IF NOT EXISTS curated_data.braze_subscription_history
(
country_code VARCHAR(2) ENCODE zstd
,braze_id VARCHAR(32) ENCODE zstd
,external_id VARCHAR(32) ENCODE zstd
,source VARCHAR(16) ENCODE zstd
,subscription_status VARCHAR(16) ENCODE zstd
,valid_from INTEGER ENCODE az64
,valid_to INTEGER ENCODE az64
)
DISTSTYLE EVEN
CREATE TABLE IF NOT EXISTS business_layer.dim_date
(
pk_date BIGINT NOT NULL ENCODE az64
,iso_date VARCHAR(40) ENCODE zstd
,"year" VARCHAR(16) ENCODE zstd
,quarter_of_year VARCHAR(8) ENCODE zstd
,month_of_year VARCHAR(8) ENCODE zstd
,date_ DATE ENCODE az64
,year_month VARCHAR(6) ENCODE zstd
,id_date BIGINT ENCODE az64
)
DISTSTYLE ALL
Answers
The query you provided seems generally well-structured, but there are a few areas where improvements can be made to potentially reduce resource consumption.
-
Join Predicate Optimization: The join condition between
braze_subscription_history
anddim_date
tables can be improved to reduce Cartesian product risk and unnecessary data retrieval. Instead of filtering thedim_date
table on every execution, you can pre-filter it to only include dates within the last 365 days. This can potentially reduce the amount of data being processed during the join. -
Materialization Consideration: The
braze_subscription_history
table is scanned multiple times in the query. If this table is large and doesn't change frequently, consider materializing it into a temporary table before running the main query. This can reduce redundant scanning of the table, especially if it's joined with multiple tables or used multiple times within the same query. -
Distribution Style Optimization: Ensure that the distribution style of the tables is appropriate for the join operations. The
braze_subscription_history
table is distributed EVEN, which might not be the most efficient distribution strategy depending on the size of the table and the join conditions. Consider distributing it based on a column used in joins or filtering predicates.
Here's an updated version of your query with these optimizations applied:
WITH base AS (
SELECT
d.date_,
h.subscription_status AS status,
h.braze_id
FROM
curated_data.braze_subscription_history h
INNER JOIN (
SELECT
date_,
pk_date
FROM
business_layer.dim_date
WHERE
date_ > CURRENT_DATE() - INTERVAL '365 DAY'
) d ON d.pk_date BETWEEN h.valid_from AND h.valid_to
WHERE
h.source = 'PUSH'
AND h.country_code = 'BR'
),
lag AS (
SELECT
*,
LAG(status, 1) OVER(PARTITION BY braze_id ORDER BY date_ ASC) AS previous_status
FROM
base
)
SELECT
date_,
CASE
WHEN previous_status = status THEN 'No change'
ELSE 'Changed'
END AS class,
COUNT(1) AS user_qty
FROM
lag
GROUP BY
1, 2;
Additionally, ensure that appropriate indexes are present on the join columns (valid_from
, valid_to
, date_
, braze_id
, etc.) to facilitate efficient data retrieval.