Pandas: Turn multiple variables into a single set of dummy varia

ghz 14hours ago ⋅ 4 views

Pandas: Turn multiple variables into a single set of dummy variables

I have a column with categories (A, B, C, D) I want to turn into dummy variables. Problem is, this column can contain multiple categories per row, like this:

DF = pd.DataFrame({'Col':['A', 'A, B', 'A, C', 'B, C, D', 'D']})

    Col
0   A
1   A, B
2   A, C
3   B, C, D
4   D

My thought at this point is to first split the variable into multiple fields using ',' as the delimiter, then dummy-code the results. Something like this:

DF2 = DF['Col'].str.split(', ', expand = True)

    0   1     2
0   A   None  None
1   A   B     None
2   A   C     None
3   B   C     D
4   D   None  None

pd.get_dummies(DF2)

    0_A 0_B 0_D 1_B 1_C 2_D
0   1   0   0   0   0   0
1   1   0   0   1   0   0
2   1   0   0   0   1   0
3   0   1   0   0   1   1
4   0   0   1   0   0   0

Finally, run some sort of loop through across the columns to create a single set of dummy variables for A, B, C, and D. This can work, but gets quite tedious with many more variables/categories. Is there an easier way to achieve this?

Answer

Yes, there is a much easier and more efficient way to achieve this using pandas! Instead of splitting the column and creating dummy variables for each split column, you can directly use the pd.get_dummies function in combination with the str.get_dummies method. Here's how you can do it:

import pandas as pd

# Create the initial DataFrame
DF = pd.DataFrame({'Col': ['A', 'A, B', 'A, C', 'B, C, D', 'D']})

# Generate dummy variables directly
dummy_df = DF['Col'].str.get_dummies(sep=', ')

print(dummy_df)

Output:

   A  B  C  D
0  1  0  0  0
1  1  1  0  0
2  1  0  1  0
3  0  1  1  1
4  0  0  0  1

Explanation:

  1. str.get_dummies:

    • The str.get_dummies method splits the values in each cell of the Col column using the specified delimiter (sep=', '), then automatically creates dummy variables for each unique category.
  2. Output:

    • The resulting DataFrame contains one column for each unique category (A, B, C, D) and fills the rows with 1 or 0 to indicate the presence or absence of the category in the corresponding row.

This approach is straightforward, avoids intermediate steps, and scales well for larger datasets or more categories.