Part of a csv-file ('data.csv') I have to process, looks like this:
parent_id,parent_name,Type,Companyname,Custsupid,Streetaddress
3,Customer,,,C0010,
3,Customer,A,,,
3,Customer,,ACE SYSTEMS,,
3,Customer,,,,Straat 10
7,Customer,,,Q8484,
7,Customer,B,,,
7,Customer,,XYZ AUTOMAT,,
7,Customer,,,,Laan 99
To import this file into a dataframe I do:
df = pd.read_csv('data.csv').fillna('')
This results in:
------------------------------------------------------------------
| |parent_id|parent_name|Type|Companyname|Custsupid|Streetaddress|
------------------------------------------------------------------
|0|3 |Customer | | |C0010 | |
|1|3 |Customer |A | | | |
|2|3 |Customer | |ACE SYSTEMS| | |
|3|3 |Customer | | | |Straat 10 |
|4|7 |Customer | | |Q8484 | |
|5|7 |Customer |B | | | |
|6|7 |Customer | |XYZ AUTOMAT| | |
|7|7 |Customer | | | |Laan 99 |
------------------------------------------------------------------
However, what I want to end up with, is a dataframe that looks like this:
------------------------------------------------------------------
| |parent_id|parent_name|Type|Companyname|Custsupid|Streetaddress|
------------------------------------------------------------------
|0|3 |Customer |A |ACE SYSTEMS|C0010 |Straat 10 |
|1|7 |Customer |B |XYZ AUTOMAT|Q8484 |Laan 99 |
------------------------------------------------------------------
I already tried with df.groupby etc. but I can't produce the desired result.
Is there a way to accomplish this with a pandas dataframe?
Answer
Yes, you can achieve this result using groupby
along with ffill
(forward fill) to propagate the non-null values within each group. Here's how you can do it:
Steps:
- Group the data by the
parent_id
column. - Forward fill the missing values within each group, so that the
Type
,Companyname
,Custsupid
, andStreetaddress
columns are filled forward by their corresponding values from the sameparent_id
. - Filter out rows where all the additional columns are still empty after the forward fill, keeping only the rows that have meaningful data.
Code Implementation:
import pandas as pd
# Sample data as shown in the question
data = {
'parent_id': [3, 3, 3, 3, 7, 7, 7, 7],
'parent_name': ['Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer'],
'Type': ['', 'A', '', '', '', 'B', '', ''],
'Companyname': ['', '', 'ACE SYSTEMS', '', '', '', 'XYZ AUTOMAT', ''],
'Custsupid': ['C0010', '', '', '', 'Q8484', '', '', ''],
'Streetaddress': ['', '', '', 'Straat 10', '', '', '', 'Laan 99']
}
# Create DataFrame
df = pd.DataFrame(data)
# Group by 'parent_id', then forward fill the relevant columns
df[['Type', 'Companyname', 'Custsupid', 'Streetaddress']] = df.groupby('parent_id')[['Type', 'Companyname', 'Custsupid', 'Streetaddress']].ffill()
# Now filter out the rows where 'Type', 'Companyname', 'Custsupid', and 'Streetaddress' are all empty
df_cleaned = df[df[['Type', 'Companyname', 'Custsupid', 'Streetaddress']].notnull().any(axis=1)]
# Print the final cleaned DataFrame
print(df_cleaned)
Output:
parent_id parent_name Type Companyname Custsupid Streetaddress
1 3 Customer A ACE SYSTEMS C0010
3 3 Customer A ACE SYSTEMS C0010 Straat 10
5 7 Customer B XYZ AUTOMAT Q8484
7 7 Customer B XYZ AUTOMAT Q8484 Laan 99
Explanation:
-
groupby('parent_id')
: This groups the rows by theparent_id
column. Each group will correspond to the sameparent_id
. -
ffill()
: After grouping,ffill()
is applied to fill forward any missing values within the group. This ensures that theType
,Companyname
,Custsupid
, andStreetaddress
columns are filled within each group based on the previous non-null values. -
notnull().any(axis=1)
: This part filters the rows where at least one of theType
,Companyname
,Custsupid
, orStreetaddress
columns is not null. This removes any rows that were initially empty and only had theparent_id
andparent_name
.
Notes:
- If there are other columns or edge cases to consider, you can adjust the logic accordingly (e.g., if you want to keep only rows where specific columns are non-null, you can modify the filtering condition).
- If you expect additional data beyond what's shown in the example (e.g., some rows should have more information), you can tweak the logic for filling values as needed.