How to resample an SQL database

ghz 8months ago ⋅ 55 views

I have a dataset which looks like below:

ITEM    CITY        START_Y   START_W   FIRST_USE_Y   FIRST_USE_W   VALUE
A       NEW YORK    2023      30             2023             32    15000
A       LONDON      2024       2             2024              2    12000
A       LONDON      2024       2             2024              5    50000
B       NEW YORK    2023      49             2024              1    19540
B       MADRID      2023      10             2023             11    15444

First, the combination of ITEM and CITY need to be grouped. Then for each group, I want to resample weekly up to 5 datapoints and fill the 'VALUE' column with zero where there is no value for the combination of FIRST_USE_Y and FIRST_USE_W columns. START_W and FIRST_USE_W are the number of week of the year(value can be from 1 to 52).

I tried with pandas and a for loop; it worked. But as it is a very large dataset with many millions of rows and I am bound to do with SQL(where I am a newbie). This is the code I tried:

WITH RECURSIVE weekly_intervals AS (
    SELECT MIN(start_w) AS start_w, MAX(start_w) AS end_w
    FROM citywise_values
    UNION ALL
    SELECT start_w + INTERVAL 1 WEEK, end_w
    FROM weekly_intervals
    WHERE start_w + INTERVAL 1 WEEK <= end_w
),
filled_values AS (
    SELECT 
        w.item,
        w.city,
        w.start_y,
        w.start_w,
        COALESCE(cv.value, 0) AS value
    FROM 
        (SELECT 
            item,
            city,
            start_y,
            start_w
        FROM 
            citywise_values
        GROUP BY 
            item, city) w
    LEFT JOIN 
        citywise_values cv ON w.item = cv.item
                             AND w.city = cv.city
                             AND w.start_y = cv.start_y
                             AND w.start_w = cv.start_w
)
SELECT 
    item,
    city,
    start_y,
    start_w,
    COALESCE(value, LAG(value) OVER (PARTITION BY item, city, start_y ORDER BY start_w)) AS value
FROM 
    filled_values
RIGHT JOIN
    weekly_intervals
ON
    filled_values.start_w = weekly_intervals.start_w
ORDER BY
    item, city, start_y, start_w

Then I tried with a cross join and was able to produce the result only for one single combination of the ITEM and CITY. But I could not find how to do for the whole dataset.

I am not sure that I could explain it well or not. So, I am posting the desired output which I created manually.

ITEM    CITY        START_Y     START_W     FIRST_USE_Y     FIRST_USE_W     VALUE
A       NEW YORK    2023        30                2023              30      0
A       NEW YORK    2023        30                2023              31      0
A       NEW YORK    2023        30                2023              32      15000
A       NEW YORK    2023        30                2023              33      0
A       NEW YORK    2023        30                2023              34      0
A       LONDON      2024        2                 2024              2       12000
A       LONDON      2024        2                 2024              3       0
A       LONDON      2024        2                 2024              4       0
A       LONDON      2024        2                 2024              5       50000
A       LONDON      2024        2                 2024              6       0
B       NEW YORK    2023        49                2023              49      0
B       NEW YORK    2023        49                2023              50      0
B       NEW YORK    2023        49                2023              51      0
B       NEW YORK    2023        49                2023              52      0
B       NEW YORK    2023        49                2024              1       19540
B       MADRID      2023        10                2023              10      0
B       MADRID      2023        10                2023              11      15444
B       MADRID      2023        10                2023              12      0
B       MADRID      2023        10                2023              13      0
B       MADRID      2023        10                2023              14      0

Any help will be appreciated.

Answers

You can achieve the desired result using SQL by generating a list of weeks and then performing a LEFT JOIN with your original dataset to fill in the missing weeks with zero values. Here's how you can do it:

WITH RECURSIVE weekly_intervals AS (
    SELECT MIN(start_w) AS start_w, MAX(start_w) AS end_w
    FROM citywise_values
    UNION ALL
    SELECT start_w + 1, end_w
    FROM weekly_intervals
    WHERE start_w < end_w
),
expanded_values AS (
    SELECT 
        item,
        city,
        start_y,
        w.start_w,
        COALESCE(cv.value, 0) AS value
    FROM 
        weekly_intervals w
    CROSS JOIN (
        SELECT DISTINCT item, city, start_y
        FROM citywise_values
    ) AS base_data
    LEFT JOIN 
        citywise_values cv ON base_data.item = cv.item
                             AND base_data.city = cv.city
                             AND base_data.start_y = cv.start_y
                             AND w.start_w = cv.start_w
)
SELECT 
    item,
    city,
    start_y,
    start_w,
    COALESCE(value, LAG(value) OVER (PARTITION BY item, city, start_y ORDER BY start_w)) AS value
FROM 
    expanded_values
ORDER BY
    item, city, start_y, start_w;

Here's what each part of the query does:

  1. Recursive CTE (Common Table Expression) weekly_intervals: This generates a list of weeks starting from the minimum start_w to the maximum start_w in your dataset.

  2. CTE expanded_values: This performs a cross join between the distinct combinations of item, city, and start_y in your dataset and the generated weekly intervals. It then left joins this with your original dataset to fill in the missing weeks with zero values.

  3. Final Select: This selects the required columns from the expanded_values CTE and uses the LAG function to fill in the missing values from the previous week within each group of item, city, and start_y.

This query should give you the desired output, with missing weeks filled in with zero values.