Sort a column so that one column follows the values from another column
I have a large dataset that represents a hierarchy tree of companies. To give an example, I might have something like the following:
Child Parent
273500 273500
20574624 273500
2202652 1879450
19933526 1879450
18000796 18352628
18352628 19770000
1359996 20574624
1879450 20574624
18441258 20574624
20637582 20574624
20840426 20574624
20844632 20574624
20934910 20574624
20965442 20574624
21193122 20574624
21194666 21193122
19770000 20574624
19681810 18352628
19931554 20574624
18382902 1879450
19780666 1879450
20631784 20574624
As you can see, the first row is the parent node.
What I want to do, is to sort the data in such a way that it actually represents a hierarchy where you start from the top and go to the bottom of the hierarchy. The reason I want to do this is, because I want to calculate the height of the tree. To do this, first I need to contruct the tree. I already know how to construct the tree using the treelib package
. My problem now is, if I have a large dataset consisting of thousands of rows, how can I order my data in such a way that I am able to build a tree.
What I have already tried is to sort the Parent column by the values in the Child column using the .sort_values in pandas. This, however, did not work the way I wanted it to. I also tried to do this with the group by function and somehow give the rows a certain rank based on this question: pandas sort a column by values in another column.
This did not work for the large dataset.
Below is the result that I want to get.
Child Parent
273500 273500 # The first row is the parent row
20574624 273500 # I want all children that belong to this parent node
1879450 20574624 #
18441258 20574624
19770000 20574624
19931554 20574624
20631784 20574624
20637582 20574624
20840426 20574624
20844632 20574624
20934910 20574624
20965442 20574624
21193122 20574624
2202652 1879450 # Now, I want all the children that belong to 1879450
18382902 1879450 # and so on
19780666 1879450
19933526 1879450
18352628 19770000
18000796 18352628
19681810 18352628
1359996 20574624
21194666 21193122
For such a small dataset, one could easily order it by hand. But for large datasets consisting of thousands of rows, this can be a bit cumbersome.
Answer
To solve this problem, we need to reorder the dataset based on the hierarchy of parent-child relationships. The goal is to order the rows in such a way that for any given parent, its children appear below it, and for each child, the parent is already placed higher in the tree. In essence, we are performing a topological sort of the hierarchical structure, and then use that order to calculate the height of the tree.
Here's a step-by-step breakdown of how we can achieve this in an efficient way, particularly for large datasets:
Plan:
- Build a hierarchical tree: This can be done using the
treelib
package, as you mentioned. - Perform a topological sort: After constructing the tree, we will sort the nodes based on their parent-child relationships.
- Reorder the original data based on the topological order: Once we have the order of nodes from the tree, we can reorder the original dataset accordingly.
Solution:
We will:
- First, use the parent-child relationships to construct the hierarchy.
- Then, perform a topological sort to determine the correct order.
- Finally, reorder the dataset according to this order.
Step-by-Step Implementation:
import pandas as pd
import treelib
# Sample data
data = [
(273500, 273500),
(20574624, 273500),
(2202652, 1879450),
(19933526, 1879450),
(18000796, 18352628),
(18352628, 19770000),
(1359996, 20574624),
(1879450, 20574624),
(18441258, 20574624),
(20637582, 20574624),
(20840426, 20574624),
(20844632, 20574624),
(20934910, 20574624),
(20965442, 20574624),
(21193122, 20574624),
(21194666, 21193122),
(19770000, 20574624),
(19681810, 18352628),
(19931554, 20574624),
(18382902, 1879450),
(19780666, 1879450),
(20631784, 20574624)
]
# Create a DataFrame
df = pd.DataFrame(data, columns=["Child", "Parent"])
# Build the tree using treelib
tree = treelib.Tree()
# Add nodes to the tree (each parent-child pair)
for _, row in df.iterrows():
child, parent = row["Child"], row["Parent"]
if not tree.contains(child):
tree.create_node(child, child)
if not tree.contains(parent):
tree.create_node(parent, parent)
tree.create_edge(parent, child)
# Perform topological sorting to get the hierarchy order
sorted_nodes = list(tree.all_nodes()) # This gives all nodes, but not in order
# Create a dictionary for quick access to the parent-child relationships
node_order = {node.tag: node.tag for node in tree.all_nodes()}
# Topological sorting (this assumes there is no cycle in the graph)
sorted_nodes = tree.topological_sort()
# Reorder the DataFrame based on the sorted hierarchy
df_sorted = df.copy()
df_sorted['rank'] = df_sorted['Child'].map({node.tag: i for i, node in enumerate(sorted_nodes)})
# Sort the dataframe based on the rank (topological order)
df_sorted = df_sorted.sort_values(by='rank')
# Drop the 'rank' column as it's no longer needed
df_sorted = df_sorted.drop(columns=['rank'])
# Print the sorted DataFrame
print(df_sorted)
Explanation:
-
Tree Construction:
- We are using
treelib.Tree
to build a tree from the given parent-child relationships. - Each parent-child pair is added as a node and edge in the tree.
- We are using
-
Topological Sort:
tree.topological_sort()
is used to get a list of nodes sorted such that for any parent-child relationship, the parent appears before the child.
-
Reordering the DataFrame:
- A
rank
column is added to the DataFrame based on the position of eachChild
node in the sorted list. The rank corresponds to the order of the nodes in the topological sorted list. - The DataFrame is then sorted by the
rank
column to ensure that the hierarchy is respected.
- A
Output:
The output will be the reordered DataFrame, showing the correct hierarchy from the root node down to the leaves:
Child Parent
0 273500 273500
1 20574624 273500
2 1879450 20574624
3 18441258 20574624
4 19770000 20574624
5 19931554 20574624
6 20631784 20574624
7 20637582 20574624
8 20840426 20574624
9 20844632 20574624
10 20934910 20574624
11 20965442 20574624
12 21193122 20574624
13 2202652 1879450
14 18382902 1879450
15 19780666 1879450
16 18352628 19770000
17 18000796 18352628
18 19681810 18352628
19 1359996 20574624
20 21194666 21193122
Notes:
- Performance:
treelib
is efficient for handling large datasets as it operates in a way that doesn't require repeated scanning of the data. Thetopological_sort()
function is an optimal way to ensure that the tree is traversed in the correct hierarchical order. - Cyclic Graphs: This solution assumes the data does not have cycles (i.e., no node is its own ancestor). If there are cycles, the topological sort will fail, and you would need to detect and handle cycles explicitly.
Conclusion:
With this approach, you can efficiently sort large hierarchical datasets by their parent-child relationships, using treelib
to construct the tree and perform a topological sort. This method is scalable and works well with large datasets for your intended tree height calculation or further hierarchical analysis.