Sorting using Oracle SQL Hierarchical Join path without inflatin

ghz 8months ago ⋅ 51 views

Sorting using Oracle SQL Hierarchical Join path without inflating row count

I've written a query with an Oracle SQL hierarchical join that I think is a bit inefficient. I'm hoping someone can point me to a more efficient means of achieving what I want.

Background on the problem:

  • I need to correctly sort a set of records (uniquely identified by segment_id) we are receiving from a third-party database. The records are ordered by record_time, but due to truncation of the timestamp, some records will have the exact same record_time. We cannot guarantee that the records will be sorted into the desired order when we retrieve them, but we need to sort records within the same timestamp into the correct order.
  • Thankfully, we have a 'neighbors' table that tells us the expected connection between two segments. For example, if I have two segments 'A' and '1' at the same record_time, I can look at the reprex_neighbors_table to see that where segment_id = 'A', the next_segment_id is '1', so '1' must follow 'A'.
  • A few other challenges are:
    • Some segments will have multiple parents and children in the neighbors table, not all of which will manifest in the segment data. I have one example of multiple children in the example below.
    • Because we have many, many segments, it isn't practical to start at the original part of the hierarchy and work all the way down; we only want to use hierarchical queries to sort within a given record_time. The reprex below doesn't illustrate the workarounds for that, but you get the idea.
    • Sometimes we get more than one record in successive time stamps, so you can't look back to the previous timestamp to immediately see what the desired antecedent is. I would like to avoid loops or even more extensive subqueries in my code for the sake of performance, but if that's the recommended approach, let me know.

To summarize, here's an example of the data:

This is the reprex_segments_table (for convenience, cast r_time is shown)

ORIGINAL_ORDERDESIRED_ORDERSEGMENT_IDRECORD_TIMEr_time
1125-52-211-Apr-247:57:18
2245-85-441-Apr-247:57:26
5351-1-971-Apr-248:04:12
4481-24-871-Apr-248:04:12
3518-11-881-Apr-248:04:12
7679-60-161-Apr-248:04:54
6712-80-481-Apr-248:04:54
8811-58-361-Apr-248:05:12
9990-49-721-Apr-248:05:36

This is the reprex_neighbors_table. Notably, the last record here in the reprex is a superfluous child to one record; in practice each record can have up to 3 parents and/or 3 children!

SEGMENT_IDNEXT_SEGMENT_ID
25-52-2145-85-44
45-85-4451-1-97
51-1-9781-24-87
81-24-8718-11-88
18-11-8879-60-16
79-60-1612-80-48
12-80-4811-58-36
11-58-3690-49-72
81-24-8769-67-78

Here's a bit of code to create the example tables.

CREATE TABLE reprex_segment_data (
    original_order int,
    desired_order int,
    segment_id VARCHAR(20),
    record_time DATE
);

CREATE TABLE reprex_neighbors_table (
    segment_id VARCHAR(20),
    next_segment_id VARCHAR(20)
);

INSERT ALL INTO reprex_segment_data (
    original_order,
    desired_order,
    segment_id,
    record_time
) VALUES (
    1,
    1,
    '25-52-21',
    TO_DATE('7:57:18', 'HH24:MI:SS')
) INTO reprex_segment_data (
    original_order,
    desired_order,
    segment_id,
    record_time
) VALUES (
    2,
    2,
    '45-85-44',
    TO_DATE('7:57:26', 'HH24:MI:SS')
) INTO reprex_segment_data (
    original_order,
    desired_order,
    segment_id,
    record_time
) VALUES (
    5,
    3,
    '51-1-97',
    TO_DATE('8:04:12', 'HH24:MI:SS')
) INTO reprex_segment_data (
    original_order,
    desired_order,
    segment_id,
    record_time
) VALUES (
    4,
    4,
    '81-24-87',
    TO_DATE('8:04:12', 'HH24:MI:SS')
) INTO reprex_segment_data (
    original_order,
    desired_order,
    segment_id,
    record_time
) VALUES (
    3,
    5,
    '18-11-88',
    TO_DATE('8:04:12', 'HH24:MI:SS')
) INTO reprex_segment_data (
    original_order,
    desired_order,
    segment_id,
    record_time
) VALUES (
    7,
    6,
    '79-60-16',
    TO_DATE('8:04:54', 'HH24:MI:SS')
) INTO reprex_segment_data (
    original_order,
    desired_order,
    segment_id,
    record_time
) VALUES (
    6,
    7,
    '12-80-48',
    TO_DATE('8:04:54', 'HH24:MI:SS')
) INTO reprex_segment_data (
    original_order,
    desired_order,
    segment_id,
    record_time
) VALUES (
    8,
    8,
    '11-58-36',
    TO_DATE('8:05:12', 'HH24:MI:SS')
) INTO reprex_segment_data (
    original_order,
    desired_order,
    segment_id,
    record_time
) VALUES (
    9,
    9,
    '90-49-72',
    TO_DATE('8:05:36', 'HH24:MI:SS')
)
    SELECT
        1
    FROM
        dual;

INSERT ALL INTO reprex_neighbors_table (
    segment_id,
    next_segment_id
) VALUES (
    '25-52-21',
    '45-85-44'
) INTO reprex_neighbors_table (
    segment_id,
    next_segment_id
) VALUES (
    '45-85-44',
    '51-1-97'
) INTO reprex_neighbors_table (
    segment_id,
    next_segment_id
) VALUES (
    '51-1-97',
    '81-24-87'
) INTO reprex_neighbors_table (
    segment_id,
    next_segment_id
) VALUES (
    '81-24-87',
    '18-11-88'
) INTO reprex_neighbors_table (
    segment_id,
    next_segment_id
) VALUES (
    '18-11-88',
    '79-60-16'
) INTO reprex_neighbors_table (
    segment_id,
    next_segment_id
) VALUES (
    '79-60-16',
    '12-80-48'
) INTO reprex_neighbors_table (
    segment_id,
    next_segment_id
) VALUES (
    '12-80-48',
    '11-58-36'
) INTO reprex_neighbors_table (
    segment_id,
    next_segment_id
) VALUES (
    '11-58-36',
    '90-49-72'
) INTO reprex_neighbors_table (
    segment_id,
    next_segment_id
) VALUES (
    '81-24-87',
    '69-67-78'
)
    SELECT
        1
    FROM
        dual;

Here's my proposed solution. The logic is to work 'backwards' almost by making the current record the 'next' segment and looking up the prior segment. I also work record_time into the connect by statement to essentially partition the join. The resulting path is longest for the record that should be at the end within a record_time partition and shortest for the one that should be at the beginning, which is my desired outcome.

WITH add_neigh AS (
    SELECT
        reprex_segment_data.original_order,
        reprex_segment_data.desired_order,
        reprex_segment_data.segment_id,
        reprex_segment_data.record_time,
        to_char(reprex_segment_data.record_time, 'HH24:MI:SS') AS r_time,
        reprex_neighbors_table.segment_id                      AS parent_segment_id
    FROM
        reprex_segment_data
        LEFT JOIN reprex_neighbors_table
 --here, we try to remove children that aren't relevant to our path in some fashion
 --if there's a better way to handle this (especially with many more records)
 --would appreciate advice.
        ON reprex_neighbors_table.next_segment_id IN (
            SELECT
                reprex_neighbors_table.next_segment_id
            FROM
                reprex_neighbors_table
            WHERE
                reprex_neighbors_table.next_segment_id = reprex_segment_data.segment_id
        )
), add_neigh_path AS (
    SELECT
        add_neigh.original_order,
        add_neigh.desired_order,
        add_neigh.segment_id,
        add_neigh.record_time,
        add_neigh.r_time,
        add_neigh.parent_segment_id,
        level,
        connect_by_iscycle                                                                               AS is_cycle,
        sys_connect_by_path(add_neigh.parent_segment_id, '/')
        || '/'
        || add_neigh.segment_id                                                                          AS path,
 --for reasons i don't entirely understand, multiple paths are generated for each record with various levels of
 --depth; we want to keep the longest one
        row_number() OVER (PARTITION BY add_neigh.segment_id, add_neigh.record_time ORDER BY level DESC) AS rn_highest_level
    FROM
        add_neigh
    CONNECT BY
 -- PRIOR child -> parent relationship
        NOCYCLE PRIOR add_neigh.segment_id = add_neigh.parent_segment_id
 -- this is what partitions the hierarchy
        AND PRIOR add_neigh.record_time = add_neigh.record_time
    ORDER BY
        add_neigh.original_order
)
SELECT
    *
FROM
    add_neigh_path
WHERE
    rn_highest_level = 1
ORDER BY
    record_time,
    path

Which produces:

ORIGINAL_ORDERDESIRED_ORDERSEGMENT_IDRECORD_TIMER_TIMEPARENT_SEGMENT_IDLEVELIS_CYCLEPATHRN_HIGHEST_LEVEL
1125-52-211-Apr-247:57:1810//25-52-211
2245-85-441-Apr-247:57:2625-52-2110/25-52-21/45-85-441
5351-1-971-Apr-248:04:1245-85-4410/45-85-44/51-1-971
4481-24-871-Apr-248:04:1251-1-9720/45-85-44/51-1-97/81-24-871
3518-11-881-Apr-248:04:1281-24-8730/45-85-44/51-1-97/81-24-87/18-11-881
7679-60-161-Apr-248:04:5418-11-8810/18-11-88/79-60-161
6712-80-481-Apr-248:04:5479-60-1620/18-11-88/79-60-16/12-80-481
8811-58-361-Apr-248:05:1212-80-4810/12-80-48/11-58-361
9990-49-721-Apr-248:05:3611-58-3610/11-58-36/90-49-721

Here are some challenges:

  • Joining the neighbors table has the potential to inflate record counts. With the subquery in my example it's not so bad, but in the full dataset this is not ideal.
  • The add_neigh_path CTE table has many levels shown, and I don't understand why. I have to sort by path length and then filter on RN_HIGHEST_LEVEL. The example below shows what that table looks like. This isn't so bad in the reprex data but can become really big in my full table.
ORIGINAL_ORDERDESIRED_ORDERSEGMENT_IDRECORD_TIMER_TIMEPARENT_SEGMENT_IDLEVELIS_CYCLEPATHRN_HIGHEST_LEVEL
1125-52-211-Apr-247:57:1810//25-52-211
2245-85-441-Apr-247:57:2625-52-2110/25-52-21/45-85-441
5351-1-971-Apr-248:04:1245-85-4410/45-85-44/51-1-971
4481-24-871-Apr-248:04:1251-1-9720/45-85-44/51-1-97/81-24-871
3518-11-881-Apr-248:04:1281-24-8730/45-85-44/51-1-97/81-24-87/18-11-881
4481-24-871-Apr-248:04:1251-1-9710/51-1-97/81-24-872
3518-11-881-Apr-248:04:1281-24-8720/51-1-97/81-24-87/18-11-882
3518-11-881-Apr-248:04:1281-24-8710/81-24-87/18-11-883
7679-60-161-Apr-248:04:5418-11-8810/18-11-88/79-60-161
6712-80-481-Apr-248:04:5479-60-1620/18-11-88/79-60-16/12-80-481
6712-80-481-Apr-248:04:5479-60-1610/79-60-16/12-80-482
8811-58-361-Apr-248:05:1212-80-4810/12-80-48/11-58-361
9990-49-721-Apr-248:05:3611-58-3610/11-58-36/90-49-721

So, in short, is there a way to achieve my desired sort:

  • with less row count explosion/filtering?
  • without need for a subquery?

Many thanks!

Answers

Your approach using hierarchical queries is indeed a valid way to solve this problem. However, there are some optimizations and improvements that can be made to make it more efficient and easier to understand.

  1. Hierarchical Query Optimization: Instead of joining the reprex_neighbors_table directly in the main query, you can use a hierarchical query to traverse the relationships between segments. This can reduce the need for joins and simplify the query structure.

  2. Path Generation Optimization: You can generate the full path of segments within each record_time partition directly within the hierarchical query using the SYS_CONNECT_BY_PATH function. This eliminates the need for additional processing in subsequent CTEs.

  3. Level Filtering: Instead of filtering by the highest level using ROW_NUMBER(), you can use the CONNECT_BY_ISLEAF pseudocolumn to filter out intermediate nodes and keep only leaf nodes in the hierarchy.

Here's an optimized version of your query:

WITH add_neigh_path AS (
    SELECT
        r.original_order,
        r.desired_order,
        r.segment_id,
        r.record_time,
        to_char(r.record_time, 'HH24:MI:SS') AS r_time,
        CONNECT_BY_ROOT r.segment_id AS root_segment_id,
        SYS_CONNECT_BY_PATH(r.segment_id, '/') AS path,
        LEVEL AS path_length
    FROM
        reprex_segment_data r
    START WITH
        NOT EXISTS (
            SELECT 1
            FROM reprex_neighbors_table n
            WHERE n.segment_id = r.segment_id
        )
    CONNECT BY
        PRIOR r.segment_id = n.segment_id
    AND
        PRIOR r.record_time = r.record_time
)
SELECT
    original_order,
    desired_order,
    segment_id,
    record_time,
    r_time,
    path,
    path_length
FROM
    add_neigh_path
WHERE
    CONNECT_BY_ISLEAF = 1
ORDER BY
    record_time,
    path;

In this version:

  • The hierarchical query starts with segments that do not have any children in the reprex_neighbors_table, avoiding unnecessary joins.
  • The CONNECT_BY_ROOT function is used to get the root segment of each path directly within the hierarchical query.
  • Filtering is done using the CONNECT_BY_ISLEAF pseudocolumn to keep only leaf nodes in the hierarchy, which represent the end of each path.

This should provide a more efficient and concise solution to your problem.