Sorting using Oracle SQL Hierarchical Join path without inflating row count
I've written a query with an Oracle SQL hierarchical join that I think is a bit inefficient. I'm hoping someone can point me to a more efficient means of achieving what I want.
Background on the problem:
- I need to correctly sort a set of records (uniquely identified by segment_id) we are receiving from a third-party database. The records are ordered by record_time, but due to truncation of the timestamp, some records will have the exact same record_time. We cannot guarantee that the records will be sorted into the desired order when we retrieve them, but we need to sort records within the same timestamp into the correct order.
- Thankfully, we have a 'neighbors' table that tells us the expected connection between two segments. For example, if I have two segments 'A' and '1' at the same record_time, I can look at the
reprex_neighbors_table
to see that where segment_id = 'A', the next_segment_id is '1', so '1' must follow 'A'. - A few other challenges are:
- Some segments will have multiple parents and children in the neighbors table, not all of which will manifest in the segment data. I have one example of multiple children in the example below.
- Because we have many, many segments, it isn't practical to start at the original part of the hierarchy and work all the way down; we only want to use hierarchical queries to sort within a given record_time. The reprex below doesn't illustrate the workarounds for that, but you get the idea.
- Sometimes we get more than one record in successive time stamps, so you can't look back to the previous timestamp to immediately see what the desired antecedent is. I would like to avoid loops or even more extensive subqueries in my code for the sake of performance, but if that's the recommended approach, let me know.
To summarize, here's an example of the data:
This is the reprex_segments_table (for convenience, cast r_time is shown)
ORIGINAL_ORDER | DESIRED_ORDER | SEGMENT_ID | RECORD_TIME | r_time |
---|---|---|---|---|
1 | 1 | 25-52-21 | 1-Apr-24 | 7:57:18 |
2 | 2 | 45-85-44 | 1-Apr-24 | 7:57:26 |
5 | 3 | 51-1-97 | 1-Apr-24 | 8:04:12 |
4 | 4 | 81-24-87 | 1-Apr-24 | 8:04:12 |
3 | 5 | 18-11-88 | 1-Apr-24 | 8:04:12 |
7 | 6 | 79-60-16 | 1-Apr-24 | 8:04:54 |
6 | 7 | 12-80-48 | 1-Apr-24 | 8:04:54 |
8 | 8 | 11-58-36 | 1-Apr-24 | 8:05:12 |
9 | 9 | 90-49-72 | 1-Apr-24 | 8:05:36 |
This is the reprex_neighbors_table. Notably, the last record here in the reprex is a superfluous child to one record; in practice each record can have up to 3 parents and/or 3 children!
SEGMENT_ID | NEXT_SEGMENT_ID |
---|---|
25-52-21 | 45-85-44 |
45-85-44 | 51-1-97 |
51-1-97 | 81-24-87 |
81-24-87 | 18-11-88 |
18-11-88 | 79-60-16 |
79-60-16 | 12-80-48 |
12-80-48 | 11-58-36 |
11-58-36 | 90-49-72 |
81-24-87 | 69-67-78 |
Here's a bit of code to create the example tables.
CREATE TABLE reprex_segment_data (
original_order int,
desired_order int,
segment_id VARCHAR(20),
record_time DATE
);
CREATE TABLE reprex_neighbors_table (
segment_id VARCHAR(20),
next_segment_id VARCHAR(20)
);
INSERT ALL INTO reprex_segment_data (
original_order,
desired_order,
segment_id,
record_time
) VALUES (
1,
1,
'25-52-21',
TO_DATE('7:57:18', 'HH24:MI:SS')
) INTO reprex_segment_data (
original_order,
desired_order,
segment_id,
record_time
) VALUES (
2,
2,
'45-85-44',
TO_DATE('7:57:26', 'HH24:MI:SS')
) INTO reprex_segment_data (
original_order,
desired_order,
segment_id,
record_time
) VALUES (
5,
3,
'51-1-97',
TO_DATE('8:04:12', 'HH24:MI:SS')
) INTO reprex_segment_data (
original_order,
desired_order,
segment_id,
record_time
) VALUES (
4,
4,
'81-24-87',
TO_DATE('8:04:12', 'HH24:MI:SS')
) INTO reprex_segment_data (
original_order,
desired_order,
segment_id,
record_time
) VALUES (
3,
5,
'18-11-88',
TO_DATE('8:04:12', 'HH24:MI:SS')
) INTO reprex_segment_data (
original_order,
desired_order,
segment_id,
record_time
) VALUES (
7,
6,
'79-60-16',
TO_DATE('8:04:54', 'HH24:MI:SS')
) INTO reprex_segment_data (
original_order,
desired_order,
segment_id,
record_time
) VALUES (
6,
7,
'12-80-48',
TO_DATE('8:04:54', 'HH24:MI:SS')
) INTO reprex_segment_data (
original_order,
desired_order,
segment_id,
record_time
) VALUES (
8,
8,
'11-58-36',
TO_DATE('8:05:12', 'HH24:MI:SS')
) INTO reprex_segment_data (
original_order,
desired_order,
segment_id,
record_time
) VALUES (
9,
9,
'90-49-72',
TO_DATE('8:05:36', 'HH24:MI:SS')
)
SELECT
1
FROM
dual;
INSERT ALL INTO reprex_neighbors_table (
segment_id,
next_segment_id
) VALUES (
'25-52-21',
'45-85-44'
) INTO reprex_neighbors_table (
segment_id,
next_segment_id
) VALUES (
'45-85-44',
'51-1-97'
) INTO reprex_neighbors_table (
segment_id,
next_segment_id
) VALUES (
'51-1-97',
'81-24-87'
) INTO reprex_neighbors_table (
segment_id,
next_segment_id
) VALUES (
'81-24-87',
'18-11-88'
) INTO reprex_neighbors_table (
segment_id,
next_segment_id
) VALUES (
'18-11-88',
'79-60-16'
) INTO reprex_neighbors_table (
segment_id,
next_segment_id
) VALUES (
'79-60-16',
'12-80-48'
) INTO reprex_neighbors_table (
segment_id,
next_segment_id
) VALUES (
'12-80-48',
'11-58-36'
) INTO reprex_neighbors_table (
segment_id,
next_segment_id
) VALUES (
'11-58-36',
'90-49-72'
) INTO reprex_neighbors_table (
segment_id,
next_segment_id
) VALUES (
'81-24-87',
'69-67-78'
)
SELECT
1
FROM
dual;
Here's my proposed solution. The logic is to work 'backwards' almost by making the current record the 'next' segment and looking up the prior segment. I also work record_time into the connect by statement to essentially partition the join. The resulting path is longest for the record that should be at the end within a record_time partition and shortest for the one that should be at the beginning, which is my desired outcome.
WITH add_neigh AS (
SELECT
reprex_segment_data.original_order,
reprex_segment_data.desired_order,
reprex_segment_data.segment_id,
reprex_segment_data.record_time,
to_char(reprex_segment_data.record_time, 'HH24:MI:SS') AS r_time,
reprex_neighbors_table.segment_id AS parent_segment_id
FROM
reprex_segment_data
LEFT JOIN reprex_neighbors_table
--here, we try to remove children that aren't relevant to our path in some fashion
--if there's a better way to handle this (especially with many more records)
--would appreciate advice.
ON reprex_neighbors_table.next_segment_id IN (
SELECT
reprex_neighbors_table.next_segment_id
FROM
reprex_neighbors_table
WHERE
reprex_neighbors_table.next_segment_id = reprex_segment_data.segment_id
)
), add_neigh_path AS (
SELECT
add_neigh.original_order,
add_neigh.desired_order,
add_neigh.segment_id,
add_neigh.record_time,
add_neigh.r_time,
add_neigh.parent_segment_id,
level,
connect_by_iscycle AS is_cycle,
sys_connect_by_path(add_neigh.parent_segment_id, '/')
|| '/'
|| add_neigh.segment_id AS path,
--for reasons i don't entirely understand, multiple paths are generated for each record with various levels of
--depth; we want to keep the longest one
row_number() OVER (PARTITION BY add_neigh.segment_id, add_neigh.record_time ORDER BY level DESC) AS rn_highest_level
FROM
add_neigh
CONNECT BY
-- PRIOR child -> parent relationship
NOCYCLE PRIOR add_neigh.segment_id = add_neigh.parent_segment_id
-- this is what partitions the hierarchy
AND PRIOR add_neigh.record_time = add_neigh.record_time
ORDER BY
add_neigh.original_order
)
SELECT
*
FROM
add_neigh_path
WHERE
rn_highest_level = 1
ORDER BY
record_time,
path
Which produces:
ORIGINAL_ORDER | DESIRED_ORDER | SEGMENT_ID | RECORD_TIME | R_TIME | PARENT_SEGMENT_ID | LEVEL | IS_CYCLE | PATH | RN_HIGHEST_LEVEL |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 25-52-21 | 1-Apr-24 | 7:57:18 | 1 | 0 | //25-52-21 | 1 | |
2 | 2 | 45-85-44 | 1-Apr-24 | 7:57:26 | 25-52-21 | 1 | 0 | /25-52-21/45-85-44 | 1 |
5 | 3 | 51-1-97 | 1-Apr-24 | 8:04:12 | 45-85-44 | 1 | 0 | /45-85-44/51-1-97 | 1 |
4 | 4 | 81-24-87 | 1-Apr-24 | 8:04:12 | 51-1-97 | 2 | 0 | /45-85-44/51-1-97/81-24-87 | 1 |
3 | 5 | 18-11-88 | 1-Apr-24 | 8:04:12 | 81-24-87 | 3 | 0 | /45-85-44/51-1-97/81-24-87/18-11-88 | 1 |
7 | 6 | 79-60-16 | 1-Apr-24 | 8:04:54 | 18-11-88 | 1 | 0 | /18-11-88/79-60-16 | 1 |
6 | 7 | 12-80-48 | 1-Apr-24 | 8:04:54 | 79-60-16 | 2 | 0 | /18-11-88/79-60-16/12-80-48 | 1 |
8 | 8 | 11-58-36 | 1-Apr-24 | 8:05:12 | 12-80-48 | 1 | 0 | /12-80-48/11-58-36 | 1 |
9 | 9 | 90-49-72 | 1-Apr-24 | 8:05:36 | 11-58-36 | 1 | 0 | /11-58-36/90-49-72 | 1 |
Here are some challenges:
- Joining the neighbors table has the potential to inflate record counts. With the subquery in my example it's not so bad, but in the full dataset this is not ideal.
- The add_neigh_path CTE table has many levels shown, and I don't understand why. I have to sort by path length and then filter on RN_HIGHEST_LEVEL. The example below shows what that table looks like. This isn't so bad in the reprex data but can become really big in my full table.
ORIGINAL_ORDER | DESIRED_ORDER | SEGMENT_ID | RECORD_TIME | R_TIME | PARENT_SEGMENT_ID | LEVEL | IS_CYCLE | PATH | RN_HIGHEST_LEVEL |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 25-52-21 | 1-Apr-24 | 7:57:18 | 1 | 0 | //25-52-21 | 1 | |
2 | 2 | 45-85-44 | 1-Apr-24 | 7:57:26 | 25-52-21 | 1 | 0 | /25-52-21/45-85-44 | 1 |
5 | 3 | 51-1-97 | 1-Apr-24 | 8:04:12 | 45-85-44 | 1 | 0 | /45-85-44/51-1-97 | 1 |
4 | 4 | 81-24-87 | 1-Apr-24 | 8:04:12 | 51-1-97 | 2 | 0 | /45-85-44/51-1-97/81-24-87 | 1 |
3 | 5 | 18-11-88 | 1-Apr-24 | 8:04:12 | 81-24-87 | 3 | 0 | /45-85-44/51-1-97/81-24-87/18-11-88 | 1 |
4 | 4 | 81-24-87 | 1-Apr-24 | 8:04:12 | 51-1-97 | 1 | 0 | /51-1-97/81-24-87 | 2 |
3 | 5 | 18-11-88 | 1-Apr-24 | 8:04:12 | 81-24-87 | 2 | 0 | /51-1-97/81-24-87/18-11-88 | 2 |
3 | 5 | 18-11-88 | 1-Apr-24 | 8:04:12 | 81-24-87 | 1 | 0 | /81-24-87/18-11-88 | 3 |
7 | 6 | 79-60-16 | 1-Apr-24 | 8:04:54 | 18-11-88 | 1 | 0 | /18-11-88/79-60-16 | 1 |
6 | 7 | 12-80-48 | 1-Apr-24 | 8:04:54 | 79-60-16 | 2 | 0 | /18-11-88/79-60-16/12-80-48 | 1 |
6 | 7 | 12-80-48 | 1-Apr-24 | 8:04:54 | 79-60-16 | 1 | 0 | /79-60-16/12-80-48 | 2 |
8 | 8 | 11-58-36 | 1-Apr-24 | 8:05:12 | 12-80-48 | 1 | 0 | /12-80-48/11-58-36 | 1 |
9 | 9 | 90-49-72 | 1-Apr-24 | 8:05:36 | 11-58-36 | 1 | 0 | /11-58-36/90-49-72 | 1 |
So, in short, is there a way to achieve my desired sort:
- with less row count explosion/filtering?
- without need for a subquery?
Many thanks!
Answers
Your approach using hierarchical queries is indeed a valid way to solve this problem. However, there are some optimizations and improvements that can be made to make it more efficient and easier to understand.
-
Hierarchical Query Optimization: Instead of joining the
reprex_neighbors_table
directly in the main query, you can use a hierarchical query to traverse the relationships between segments. This can reduce the need for joins and simplify the query structure. -
Path Generation Optimization: You can generate the full path of segments within each record_time partition directly within the hierarchical query using the
SYS_CONNECT_BY_PATH
function. This eliminates the need for additional processing in subsequent CTEs. -
Level Filtering: Instead of filtering by the highest level using
ROW_NUMBER()
, you can use theCONNECT_BY_ISLEAF
pseudocolumn to filter out intermediate nodes and keep only leaf nodes in the hierarchy.
Here's an optimized version of your query:
WITH add_neigh_path AS (
SELECT
r.original_order,
r.desired_order,
r.segment_id,
r.record_time,
to_char(r.record_time, 'HH24:MI:SS') AS r_time,
CONNECT_BY_ROOT r.segment_id AS root_segment_id,
SYS_CONNECT_BY_PATH(r.segment_id, '/') AS path,
LEVEL AS path_length
FROM
reprex_segment_data r
START WITH
NOT EXISTS (
SELECT 1
FROM reprex_neighbors_table n
WHERE n.segment_id = r.segment_id
)
CONNECT BY
PRIOR r.segment_id = n.segment_id
AND
PRIOR r.record_time = r.record_time
)
SELECT
original_order,
desired_order,
segment_id,
record_time,
r_time,
path,
path_length
FROM
add_neigh_path
WHERE
CONNECT_BY_ISLEAF = 1
ORDER BY
record_time,
path;
In this version:
- The hierarchical query starts with segments that do not have any children in the
reprex_neighbors_table
, avoiding unnecessary joins. - The
CONNECT_BY_ROOT
function is used to get the root segment of each path directly within the hierarchical query. - Filtering is done using the
CONNECT_BY_ISLEAF
pseudocolumn to keep only leaf nodes in the hierarchy, which represent the end of each path.
This should provide a more efficient and concise solution to your problem.