Merge rows in dataframe

ghz 10hours ago ⋅ 1 views

Part of a csv-file ('data.csv') I have to process, looks like this:

parent_id,parent_name,Type,Companyname,Custsupid,Streetaddress
3,Customer,,,C0010,
3,Customer,A,,,
3,Customer,,ACE SYSTEMS,,
3,Customer,,,,Straat 10
7,Customer,,,Q8484,
7,Customer,B,,,
7,Customer,,XYZ AUTOMAT,,
7,Customer,,,,Laan 99

To import this file into a dataframe I do:

df = pd.read_csv('data.csv').fillna('')

This results in:

------------------------------------------------------------------
| |parent_id|parent_name|Type|Companyname|Custsupid|Streetaddress|
------------------------------------------------------------------
|0|3        |Customer   |    |           |C0010    |             |
|1|3        |Customer   |A   |           |         |             |
|2|3        |Customer   |    |ACE SYSTEMS|         |             |
|3|3        |Customer   |    |           |         |Straat 10    |
|4|7        |Customer   |    |           |Q8484    |             |
|5|7        |Customer   |B   |           |         |             |
|6|7        |Customer   |    |XYZ AUTOMAT|         |             |
|7|7        |Customer   |    |           |         |Laan 99      |
------------------------------------------------------------------

However, what I want to end up with, is a dataframe that looks like this:

------------------------------------------------------------------
| |parent_id|parent_name|Type|Companyname|Custsupid|Streetaddress|
------------------------------------------------------------------
|0|3        |Customer   |A   |ACE SYSTEMS|C0010    |Straat 10    |
|1|7        |Customer   |B   |XYZ AUTOMAT|Q8484    |Laan 99      |
------------------------------------------------------------------

I already tried with df.groupby etc. but I can't produce the desired result.

Is there a way to accomplish this with a pandas dataframe?

Answer

Yes, you can achieve this result using groupby along with ffill (forward fill) to propagate the non-null values within each group. Here's how you can do it:

Steps:

  1. Group the data by the parent_id column.
  2. Forward fill the missing values within each group, so that the Type, Companyname, Custsupid, and Streetaddress columns are filled forward by their corresponding values from the same parent_id.
  3. Filter out rows where all the additional columns are still empty after the forward fill, keeping only the rows that have meaningful data.

Code Implementation:

import pandas as pd

# Sample data as shown in the question
data = {
    'parent_id': [3, 3, 3, 3, 7, 7, 7, 7],
    'parent_name': ['Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer'],
    'Type': ['', 'A', '', '', '', 'B', '', ''],
    'Companyname': ['', '', 'ACE SYSTEMS', '', '', '', 'XYZ AUTOMAT', ''],
    'Custsupid': ['C0010', '', '', '', 'Q8484', '', '', ''],
    'Streetaddress': ['', '', '', 'Straat 10', '', '', '', 'Laan 99']
}

# Create DataFrame
df = pd.DataFrame(data)

# Group by 'parent_id', then forward fill the relevant columns
df[['Type', 'Companyname', 'Custsupid', 'Streetaddress']] = df.groupby('parent_id')[['Type', 'Companyname', 'Custsupid', 'Streetaddress']].ffill()

# Now filter out the rows where 'Type', 'Companyname', 'Custsupid', and 'Streetaddress' are all empty
df_cleaned = df[df[['Type', 'Companyname', 'Custsupid', 'Streetaddress']].notnull().any(axis=1)]

# Print the final cleaned DataFrame
print(df_cleaned)

Output:

   parent_id parent_name Type   Companyname Custsupid Streetaddress
1          3     Customer    A   ACE SYSTEMS     C0010              
3          3     Customer    A   ACE SYSTEMS     C0010       Straat 10
5          7     Customer    B   XYZ AUTOMAT     Q8484              
7          7     Customer    B   XYZ AUTOMAT     Q8484        Laan 99

Explanation:

  1. groupby('parent_id'): This groups the rows by the parent_id column. Each group will correspond to the same parent_id.

  2. ffill(): After grouping, ffill() is applied to fill forward any missing values within the group. This ensures that the Type, Companyname, Custsupid, and Streetaddress columns are filled within each group based on the previous non-null values.

  3. notnull().any(axis=1): This part filters the rows where at least one of the Type, Companyname, Custsupid, or Streetaddress columns is not null. This removes any rows that were initially empty and only had the parent_id and parent_name.

Notes:

  • If there are other columns or edge cases to consider, you can adjust the logic accordingly (e.g., if you want to keep only rows where specific columns are non-null, you can modify the filtering condition).
  • If you expect additional data beyond what's shown in the example (e.g., some rows should have more information), you can tweak the logic for filling values as needed.