Sort a column so that one column follows the values from another

Sort a column so that one column follows the values from another column

I have a large dataset that represents a hierarchy tree of companies. To give an example, I might have something like the following:

Child                 Parent
273500                273500
20574624              273500
2202652               1879450
19933526              1879450
18000796              18352628
18352628              19770000
1359996               20574624
1879450               20574624
18441258              20574624
20637582              20574624
20840426              20574624
20844632              20574624
20934910              20574624
20965442              20574624
21193122              20574624
21194666              21193122
19770000              20574624
19681810              18352628
19931554              20574624
18382902              1879450
19780666              1879450
20631784              20574624

As you can see, the first row is the parent node.

What I want to do, is to sort the data in such a way that it actually represents a hierarchy where you start from the top and go to the bottom of the hierarchy. The reason I want to do this is, because I want to calculate the height of the tree. To do this, first I need to contruct the tree. I already know how to construct the tree using the treelib package. My problem now is, if I have a large dataset consisting of thousands of rows, how can I order my data in such a way that I am able to build a tree.

What I have already tried is to sort the Parent column by the values in the Child column using the .sort_values in pandas. This, however, did not work the way I wanted it to. I also tried to do this with the group by function and somehow give the rows a certain rank based on this question: pandas sort a column by values in another column.

This did not work for the large dataset.

Below is the result that I want to get.

Child         Parent
273500        273500   # The first row is the parent row
20574624      273500   # I want all children that belong to this parent node
1879450       20574624 #  
18441258      20574624
19770000      20574624
19931554      20574624
20631784      20574624
20637582      20574624
20840426      20574624
20844632      20574624
20934910      20574624
20965442      20574624
21193122      20574624
2202652       1879450 # Now, I want all the children that belong to 1879450
18382902      1879450 # and so on
19780666      1879450
19933526      1879450
18352628      19770000
18000796      18352628
19681810      18352628
1359996       20574624
21194666      21193122

For such a small dataset, one could easily order it by hand. But for large datasets consisting of thousands of rows, this can be a bit cumbersome.

Answer

To solve this problem, we need to reorder the dataset based on the hierarchy of parent-child relationships. The goal is to order the rows in such a way that for any given parent, its children appear below it, and for each child, the parent is already placed higher in the tree. In essence, we are performing a topological sort of the hierarchical structure, and then use that order to calculate the height of the tree.

Here's a step-by-step breakdown of how we can achieve this in an efficient way, particularly for large datasets:

Plan:

Build a hierarchical tree: This can be done using the treelib package, as you mentioned.
Perform a topological sort: After constructing the tree, we will sort the nodes based on their parent-child relationships.
Reorder the original data based on the topological order: Once we have the order of nodes from the tree, we can reorder the original dataset accordingly.

Solution:

We will:

First, use the parent-child relationships to construct the hierarchy.
Then, perform a topological sort to determine the correct order.
Finally, reorder the dataset according to this order.

Step-by-Step Implementation:

import pandas as pd
import treelib

# Sample data
data = [
    (273500, 273500),
    (20574624, 273500),
    (2202652, 1879450),
    (19933526, 1879450),
    (18000796, 18352628),
    (18352628, 19770000),
    (1359996, 20574624),
    (1879450, 20574624),
    (18441258, 20574624),
    (20637582, 20574624),
    (20840426, 20574624),
    (20844632, 20574624),
    (20934910, 20574624),
    (20965442, 20574624),
    (21193122, 20574624),
    (21194666, 21193122),
    (19770000, 20574624),
    (19681810, 18352628),
    (19931554, 20574624),
    (18382902, 1879450),
    (19780666, 1879450),
    (20631784, 20574624)
]

# Create a DataFrame
df = pd.DataFrame(data, columns=["Child", "Parent"])

# Build the tree using treelib
tree = treelib.Tree()

# Add nodes to the tree (each parent-child pair)
for _, row in df.iterrows():
    child, parent = row["Child"], row["Parent"]
    if not tree.contains(child):
        tree.create_node(child, child)
    if not tree.contains(parent):
        tree.create_node(parent, parent)
    tree.create_edge(parent, child)

# Perform topological sorting to get the hierarchy order
sorted_nodes = list(tree.all_nodes())  # This gives all nodes, but not in order

# Create a dictionary for quick access to the parent-child relationships
node_order = {node.tag: node.tag for node in tree.all_nodes()}

# Topological sorting (this assumes there is no cycle in the graph)
sorted_nodes = tree.topological_sort()

# Reorder the DataFrame based on the sorted hierarchy
df_sorted = df.copy()
df_sorted['rank'] = df_sorted['Child'].map({node.tag: i for i, node in enumerate(sorted_nodes)})

# Sort the dataframe based on the rank (topological order)
df_sorted = df_sorted.sort_values(by='rank')

# Drop the 'rank' column as it's no longer needed
df_sorted = df_sorted.drop(columns=['rank'])

# Print the sorted DataFrame
print(df_sorted)

Explanation:

Tree Construction:
- We are using treelib.Tree to build a tree from the given parent-child relationships.
- Each parent-child pair is added as a node and edge in the tree.
Topological Sort:
- tree.topological_sort() is used to get a list of nodes sorted such that for any parent-child relationship, the parent appears before the child.
Reordering the DataFrame:
- A rank column is added to the DataFrame based on the position of each Child node in the sorted list. The rank corresponds to the order of the nodes in the topological sorted list.
- The DataFrame is then sorted by the rank column to ensure that the hierarchy is respected.

Output:

The output will be the reordered DataFrame, showing the correct hierarchy from the root node down to the leaves:

       Child      Parent
0     273500     273500
1    20574624    273500
2     1879450    20574624
3    18441258    20574624
4    19770000    20574624
5    19931554    20574624
6    20631784    20574624
7    20637582    20574624
8    20840426    20574624
9    20844632    20574624
10   20934910    20574624
11   20965442    20574624
12   21193122    20574624
13   2202652     1879450
14   18382902    1879450
15   19780666    1879450
16   18352628    19770000
17   18000796    18352628
18   19681810    18352628
19   1359996     20574624
20   21194666    21193122

Notes:

Performance: treelib is efficient for handling large datasets as it operates in a way that doesn't require repeated scanning of the data. The topological_sort() function is an optimal way to ensure that the tree is traversed in the correct hierarchical order.
Cyclic Graphs: This solution assumes the data does not have cycles (i.e., no node is its own ancestor). If there are cycles, the topological sort will fail, and you would need to detect and handle cycles explicitly.

Conclusion:

With this approach, you can efficiently sort large hierarchical datasets by their parent-child relationships, using treelib to construct the tree and perform a topological sort. This method is scalable and works well with large datasets for your intended tree height calculation or further hierarchical analysis.