Reduce Redshift Query Cost

ghz 8months ago ⋅ 47 views

Intro

I am currentry running a query that was reported, by the DataEng team, to be highly resource consuming. I need some help to improve it/ reduce its cost.

Case

I have a table (status_history_table) that shows, for all of my customers, their status and a date range that this status is valid. I need a query that returns, daily, how many customers have changed their status and how much of them didnt changed it. Only the last 365 days of data are needed.

status_history_table example:

user_idstatusvalid_fromvalid_to
X'active'2024040120240405
X'inactive'2024040699991231

When we have valid_to as '99991231' it means this is the current status of the customer, i.e, it is valid untill something happens/changes it.

dim_date_table is a dimension table with all dates;

id_datedate_
202404012024-04-01
202404022024-04-02

This is the current query I am using:

with
    daily_status as (
        select
            h.user_id
            , d.id_date
            , h.status
        from
            status_history_table h
            inner join dim_date_table d
                on d.id_date between h.valid_from and h.valid_to 
                and d.date_ <= current_date
                and d.date_ > current_date - 365
    )

    , lag as (
        select 
            *
            , lag(status,1)over(partition by user_id order by id_date asc) as previous_status
        from
            daily_status
    )

    select 
        id_date
        case
            when previous_status = status_braze then 'No change'
            else 'Changed'
        end as class,
        count(1) as user_qty
    from 
        lag
    group by 
        1,2

There are some more infos that I am extracting for this report, but this query represents the most important logic that is being used. Any help will be appreciated.

Update1:

This is the real query I am using:

with
    base as (
        select 
            d.date_,
            h.subscription_status as status,
            braze_id
        from 
            curated_data.braze_subscription_history h
            inner join business_layer.dim_date d 
                on d.pk_date between h.valid_from and h.valid_to 
                and d.date_ <= current_date
                and d.date_ > current_date - 365
        where
            h.source = 'PUSH'
            and h.country_code = 'BR'
    ),

    -- ve qual foi o status anterior de cada cliente
    lag as (
        select 
            *,
            lag(status,1)over(partition by braze_id order by date_ asc) as previous_status
        from
            base
    )
    

    select 
        date_
        , case
            when previous_status = status then 'No change'
            else 'Changed'
        end as class,
        count(1) as user_qty
    from 
        lag
    group by 
        1,2

Result of running "explain"

XN HashAggregate  (cost=1000679035081.81..1000679075081.81 rows=8000000 width=78)
  ->  XN Subquery Scan lag  (cost=1000627708182.17..1000668036460.46 rows=1466482847 width=78)
        ->  XN Window  (cost=1000627708182.17..1000649705424.87 rows=1466482847 width=45)
              Partition: h.braze_id
              Order: d.date_
              ->  XN Sort  (cost=1000627708182.17..1000631374389.29 rows=1466482847 width=45)
                    Sort Key: h.braze_id, d.date_
                    ->  XN Network  (cost=7525.05..404438272.74 rows=1466482847 width=45)
                          Distribute
                          ->  XN Nested Loop DS_BCAST_INNER  (cost=7525.05..404438272.74 rows=1466482847 width=45)
                                Join Filter: (("inner".pk_date <= ("outer".valid_to)::bigint) AND ("inner".pk_date >= ("outer".valid_from)::bigint))
                                ->  XN Seq Scan on braze_subscription_history h  (cost=0.00..1472107.32 rows=36159851 width=49)
                                      Filter: (((source)::text = 'PUSH'::text) AND ((country_code)::text = 'BR'::text))
                                ->  XN Materialize  (cost=7525.05..7528.70 rows=365 width=12)
                                      ->  XN Seq Scan on dim_date d  (cost=0.00..224.69 rows=365 width=12)
                                            Filter: ((date_ > '2023-04-09'::date) AND (date_ <= '2024-04-08'::date))
----- Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products -----

DDLS:

CREATE TABLE IF NOT EXISTS curated_data.braze_subscription_history
(
    country_code VARCHAR(2)   ENCODE zstd
    ,braze_id VARCHAR(32)   ENCODE zstd
    ,external_id VARCHAR(32)   ENCODE zstd
    ,source VARCHAR(16)   ENCODE zstd
    ,subscription_status VARCHAR(16)   ENCODE zstd
    ,valid_from INTEGER   ENCODE az64
    ,valid_to INTEGER   ENCODE az64
)
DISTSTYLE EVEN


CREATE TABLE IF NOT EXISTS business_layer.dim_date
(
    pk_date BIGINT NOT NULL  ENCODE az64
    ,iso_date VARCHAR(40)   ENCODE zstd
    ,"year" VARCHAR(16)   ENCODE zstd
    ,quarter_of_year VARCHAR(8)   ENCODE zstd
    ,month_of_year VARCHAR(8)   ENCODE zstd
    ,date_ DATE   ENCODE az64
    ,year_month VARCHAR(6)   ENCODE zstd
    ,id_date BIGINT   ENCODE az64
)
DISTSTYLE ALL

Answers

The query you provided seems generally well-structured, but there are a few areas where improvements can be made to potentially reduce resource consumption.

  1. Join Predicate Optimization: The join condition between braze_subscription_history and dim_date tables can be improved to reduce Cartesian product risk and unnecessary data retrieval. Instead of filtering the dim_date table on every execution, you can pre-filter it to only include dates within the last 365 days. This can potentially reduce the amount of data being processed during the join.

  2. Materialization Consideration: The braze_subscription_history table is scanned multiple times in the query. If this table is large and doesn't change frequently, consider materializing it into a temporary table before running the main query. This can reduce redundant scanning of the table, especially if it's joined with multiple tables or used multiple times within the same query.

  3. Distribution Style Optimization: Ensure that the distribution style of the tables is appropriate for the join operations. The braze_subscription_history table is distributed EVEN, which might not be the most efficient distribution strategy depending on the size of the table and the join conditions. Consider distributing it based on a column used in joins or filtering predicates.

Here's an updated version of your query with these optimizations applied:

WITH base AS (
    SELECT 
        d.date_,
        h.subscription_status AS status,
        h.braze_id
    FROM 
        curated_data.braze_subscription_history h
    INNER JOIN (
        SELECT 
            date_, 
            pk_date 
        FROM 
            business_layer.dim_date 
        WHERE 
            date_ > CURRENT_DATE() - INTERVAL '365 DAY'
    ) d ON d.pk_date BETWEEN h.valid_from AND h.valid_to 
    WHERE
        h.source = 'PUSH'
        AND h.country_code = 'BR'
),

lag AS (
    SELECT 
        *,
        LAG(status, 1) OVER(PARTITION BY braze_id ORDER BY date_ ASC) AS previous_status
    FROM
        base
)

SELECT 
    date_,
    CASE
        WHEN previous_status = status THEN 'No change'
        ELSE 'Changed'
    END AS class,
    COUNT(1) AS user_qty
FROM 
    lag
GROUP BY 
    1, 2;

Additionally, ensure that appropriate indexes are present on the join columns (valid_from, valid_to, date_, braze_id, etc.) to facilitate efficient data retrieval.