Pandas: Turn multiple variables into a single set of dummy variables
I have a column with categories (A, B, C, D) I want to turn into dummy variables. Problem is, this column can contain multiple categories per row, like this:
DF = pd.DataFrame({'Col':['A', 'A, B', 'A, C', 'B, C, D', 'D']})
Col
0 A
1 A, B
2 A, C
3 B, C, D
4 D
My thought at this point is to first split the variable into multiple fields using ',' as the delimiter, then dummy-code the results. Something like this:
DF2 = DF['Col'].str.split(', ', expand = True)
0 1 2
0 A None None
1 A B None
2 A C None
3 B C D
4 D None None
pd.get_dummies(DF2)
0_A 0_B 0_D 1_B 1_C 2_D
0 1 0 0 0 0 0
1 1 0 0 1 0 0
2 1 0 0 0 1 0
3 0 1 0 0 1 1
4 0 0 1 0 0 0
Finally, run some sort of loop through across the columns to create a single set of dummy variables for A, B, C, and D. This can work, but gets quite tedious with many more variables/categories. Is there an easier way to achieve this?
Answer
Yes, there is a much easier and more efficient way to achieve this using pandas! Instead of splitting the column and creating dummy variables for each split column, you can directly use the pd.get_dummies
function in combination with the str.get_dummies
method. Here's how you can do it:
import pandas as pd
# Create the initial DataFrame
DF = pd.DataFrame({'Col': ['A', 'A, B', 'A, C', 'B, C, D', 'D']})
# Generate dummy variables directly
dummy_df = DF['Col'].str.get_dummies(sep=', ')
print(dummy_df)
Output:
A B C D
0 1 0 0 0
1 1 1 0 0
2 1 0 1 0
3 0 1 1 1
4 0 0 0 1
Explanation:
-
str.get_dummies
:- The
str.get_dummies
method splits the values in each cell of theCol
column using the specified delimiter (sep=', '
), then automatically creates dummy variables for each unique category.
- The
-
Output:
- The resulting DataFrame contains one column for each unique category (
A
,B
,C
,D
) and fills the rows with1
or0
to indicate the presence or absence of the category in the corresponding row.
- The resulting DataFrame contains one column for each unique category (
This approach is straightforward, avoids intermediate steps, and scales well for larger datasets or more categories.