I have a pandas dataframe
in which one column of text strings contains comma-separated values. I want to split each CSV field and create a new row per entry (assume that CSV are clean and need only be split on ','). For example, a
should become b
:
In [7]: a
Out[7]:
var1 var2
0 a,b,c 1
1 d,e,f 2
In [8]: b
Out[8]:
var1 var2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
So far, I have tried various simple functions, but the .apply
method seems to only accept one row as return value when it is used on an axis, and I can't get .transform
to work. Any suggestions would be much appreciated!
Example data:
from pandas import DataFrame
import numpy as np
a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
{'var1': 'd,e,f', 'var2': 2}])
b = DataFrame([{'var1': 'a', 'var2': 1},
{'var1': 'b', 'var2': 1},
{'var1': 'c', 'var2': 1},
{'var1': 'd', 'var2': 2},
{'var1': 'e', 'var2': 2},
{'var1': 'f', 'var2': 2}])
I know this won't work because we lose DataFrame meta-data by going through numpy, but it should give you a sense of what I tried to do:
def fun(row):
letters = row['var1']
letters = letters.split(',')
out = np.array([row] * len(letters))
out['var1'] = letters
a['idx'] = range(a.shape[0])
z = a.groupby('idx')
z.transform(fun)
Answers
You can achieve this by using the str.split
method in combination with explode
. Here’s how you can do it:
- First, use
str.split
to split the comma-separated values in thevar1
column into lists. - Then, use the
explode
method to create a new row for each element in these lists.
Here is the code to do that:
import pandas as pd
# Create the initial DataFrame
a = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1},
{'var1': 'd,e,f', 'var2': 2}])
# Split the 'var1' column and explode it
a['var1'] = a['var1'].str.split(',')
b = a.explode('var1')
print(b)
Explanation:
a['var1'].str.split(',')
splits each string in thevar1
column into a list of strings.a.explode('var1')
transforms each element in the list to a separate row, maintaining the association with the other columns.
This code will produce the desired DataFrame b
:
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
1 f 2
If you want to reset the index, you can add reset_index(drop=True)
:
b = a.explode('var1').reset_index(drop=True)
print(b)
This will give you:
var1 var2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
This approach is efficient and leverages pandas' built-in functionality to manipulate the DataFrame directly.