I have a dataset which looks like below:
ITEM CITY START_Y START_W FIRST_USE_Y FIRST_USE_W VALUE
A NEW YORK 2023 30 2023 32 15000
A LONDON 2024 2 2024 2 12000
A LONDON 2024 2 2024 5 50000
B NEW YORK 2023 49 2024 1 19540
B MADRID 2023 10 2023 11 15444
First, the combination of ITEM and CITY need to be grouped. Then for each group, I want to resample weekly up to 5 datapoints and fill the 'VALUE' column with zero where there is no value for the combination of FIRST_USE_Y and FIRST_USE_W columns. START_W and FIRST_USE_W are the number of week of the year(value can be from 1 to 52).
I tried with pandas and a for loop; it worked. But as it is a very large dataset with many millions of rows and I am bound to do with SQL(where I am a newbie). This is the code I tried:
WITH RECURSIVE weekly_intervals AS (
SELECT MIN(start_w) AS start_w, MAX(start_w) AS end_w
FROM citywise_values
UNION ALL
SELECT start_w + INTERVAL 1 WEEK, end_w
FROM weekly_intervals
WHERE start_w + INTERVAL 1 WEEK <= end_w
),
filled_values AS (
SELECT
w.item,
w.city,
w.start_y,
w.start_w,
COALESCE(cv.value, 0) AS value
FROM
(SELECT
item,
city,
start_y,
start_w
FROM
citywise_values
GROUP BY
item, city) w
LEFT JOIN
citywise_values cv ON w.item = cv.item
AND w.city = cv.city
AND w.start_y = cv.start_y
AND w.start_w = cv.start_w
)
SELECT
item,
city,
start_y,
start_w,
COALESCE(value, LAG(value) OVER (PARTITION BY item, city, start_y ORDER BY start_w)) AS value
FROM
filled_values
RIGHT JOIN
weekly_intervals
ON
filled_values.start_w = weekly_intervals.start_w
ORDER BY
item, city, start_y, start_w
Then I tried with a cross join and was able to produce the result only for one single combination of the ITEM and CITY. But I could not find how to do for the whole dataset.
I am not sure that I could explain it well or not. So, I am posting the desired output which I created manually.
ITEM CITY START_Y START_W FIRST_USE_Y FIRST_USE_W VALUE
A NEW YORK 2023 30 2023 30 0
A NEW YORK 2023 30 2023 31 0
A NEW YORK 2023 30 2023 32 15000
A NEW YORK 2023 30 2023 33 0
A NEW YORK 2023 30 2023 34 0
A LONDON 2024 2 2024 2 12000
A LONDON 2024 2 2024 3 0
A LONDON 2024 2 2024 4 0
A LONDON 2024 2 2024 5 50000
A LONDON 2024 2 2024 6 0
B NEW YORK 2023 49 2023 49 0
B NEW YORK 2023 49 2023 50 0
B NEW YORK 2023 49 2023 51 0
B NEW YORK 2023 49 2023 52 0
B NEW YORK 2023 49 2024 1 19540
B MADRID 2023 10 2023 10 0
B MADRID 2023 10 2023 11 15444
B MADRID 2023 10 2023 12 0
B MADRID 2023 10 2023 13 0
B MADRID 2023 10 2023 14 0
Any help will be appreciated.
Answers
You can achieve the desired result using SQL by generating a list of weeks and then performing a LEFT JOIN
with your original dataset to fill in the missing weeks with zero values. Here's how you can do it:
WITH RECURSIVE weekly_intervals AS (
SELECT MIN(start_w) AS start_w, MAX(start_w) AS end_w
FROM citywise_values
UNION ALL
SELECT start_w + 1, end_w
FROM weekly_intervals
WHERE start_w < end_w
),
expanded_values AS (
SELECT
item,
city,
start_y,
w.start_w,
COALESCE(cv.value, 0) AS value
FROM
weekly_intervals w
CROSS JOIN (
SELECT DISTINCT item, city, start_y
FROM citywise_values
) AS base_data
LEFT JOIN
citywise_values cv ON base_data.item = cv.item
AND base_data.city = cv.city
AND base_data.start_y = cv.start_y
AND w.start_w = cv.start_w
)
SELECT
item,
city,
start_y,
start_w,
COALESCE(value, LAG(value) OVER (PARTITION BY item, city, start_y ORDER BY start_w)) AS value
FROM
expanded_values
ORDER BY
item, city, start_y, start_w;
Here's what each part of the query does:
-
Recursive CTE (Common Table Expression)
weekly_intervals
: This generates a list of weeks starting from the minimumstart_w
to the maximumstart_w
in your dataset. -
CTE
expanded_values
: This performs a cross join between the distinct combinations ofitem
,city
, andstart_y
in your dataset and the generated weekly intervals. It then left joins this with your original dataset to fill in the missing weeks with zero values. -
Final Select: This selects the required columns from the
expanded_values
CTE and uses theLAG
function to fill in the missing values from the previous week within each group ofitem
,city
, andstart_y
.
This query should give you the desired output, with missing weeks filled in with zero values.