Split (explode) pandas dataframe string entry to separate rows

ghz 7months ago ⋅ 125 views

I have a pandas dataframe in which one column of text strings contains comma-separated values. I want to split each CSV field and create a new row per entry (assume that CSV are clean and need only be split on ','). For example, a should become b:

In [7]: a
Out[7]: 
    var1  var2
0  a,b,c     1
1  d,e,f     2

In [8]: b
Out[8]: 
  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

So far, I have tried various simple functions, but the .apply method seems to only accept one row as return value when it is used on an axis, and I can't get .transform to work. Any suggestions would be much appreciated!

Example data:

from pandas import DataFrame
import numpy as np
a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
               {'var1': 'd,e,f', 'var2': 2}])
b = DataFrame([{'var1': 'a', 'var2': 1},
               {'var1': 'b', 'var2': 1},
               {'var1': 'c', 'var2': 1},
               {'var1': 'd', 'var2': 2},
               {'var1': 'e', 'var2': 2},
               {'var1': 'f', 'var2': 2}])

I know this won't work because we lose DataFrame meta-data by going through numpy, but it should give you a sense of what I tried to do:

def fun(row):
    letters = row['var1']
    letters = letters.split(',')
    out = np.array([row] * len(letters))
    out['var1'] = letters
a['idx'] = range(a.shape[0])
z = a.groupby('idx')
z.transform(fun)

Answers

You can achieve this by using the str.split method in combination with explode. Here’s how you can do it:

  1. First, use str.split to split the comma-separated values in the var1 column into lists.
  2. Then, use the explode method to create a new row for each element in these lists.

Here is the code to do that:

import pandas as pd

# Create the initial DataFrame
a = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1},
                  {'var1': 'd,e,f', 'var2': 2}])

# Split the 'var1' column and explode it
a['var1'] = a['var1'].str.split(',')
b = a.explode('var1')

print(b)

Explanation:

  1. a['var1'].str.split(',') splits each string in the var1 column into a list of strings.
  2. a.explode('var1') transforms each element in the list to a separate row, maintaining the association with the other columns.

This code will produce the desired DataFrame b:

  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
1    f     2

If you want to reset the index, you can add reset_index(drop=True):

b = a.explode('var1').reset_index(drop=True)
print(b)

This will give you:

  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

This approach is efficient and leverages pandas' built-in functionality to manipulate the DataFrame directly.