Changing boolean value within a DataFrame iterrows does nothing

ghz 昨天 ⋅ 1 views

The general idea is:

  1. Read a DataFrame from Excel
  2. Add a new column where I can identify valid and invalid rows (in this example, values are initialized at None, but I've also tried initializing at False and 0)
  3. Iterate through DataFrame and assign values to the new column depending on a series of tests. (unfortunately, the tests are too complex to use applymap or similar)

The expected result is to print the DataFrame and see the values of the new column properly assigned.

The experienced result is the values have not changed at all. They all remain in their initial value.

Here's a highly simplified code sample, where the behavior is present.

import pandas as pd

df = pd.read_csv('./some_file.csv', sep='\t')

print(df)   # View outputs below

df['Valid'] = [None for _ in range(len(df))]    # New column

for n in df.iterrows():
    if pd.notnull(n[1].Name):    # Example test
        n[1].Valid = False
    else:    # else clause should ensure all values change
        n[1].Valid = True

print(df)    # Shows all df.Valid values are still None

Output 1 after read_csv()

    Name  Age
0   John   20
1  Mandy   25
2   Mike   30
3    NaN   40
4   Alex   35

Output 2 at the end of the code:

    Name  Age Valid
0   John   20  None
1  Mandy   25  None
2   Mike   30  None
3    NaN   40  None
4   Alex   35  None

In the example above, when printing df, all values are still None, even though the debugger shows the process stepping into the assignment statements.

Is there some sort of scope, reference, or mutability issue that I'm not seeing? Can anybody explain this behavior?

Running python 3.6.0

For a sample data set, I manually built a some_file.csv file separated by tabs with (4th row has an empty string for Name while age is 40, 5th row is skipped by read_csv for obvious reasons):

Name    Age
John    20
Mandy   25
Mike    30
    40

Alex    35

Answer

The issue you're encountering stems from how you're iterating over the DataFrame rows and trying to modify the values. When using df.iterrows(), you're working with a copy of the row, not a reference to the original row in the DataFrame. This means that modifications made to the row (inside the loop) don't affect the original DataFrame.

Here's a more detailed explanation:

Explanation:

  • iterrows(): When you call iterrows(), it returns an iterator that yields each row as a tuple of (index, Series). The Series object is a copy of the row data, not a reference to the original row. Therefore, any modification to the Series inside the loop does not alter the DataFrame itself.
  • Mutability: While a pandas.Series is mutable (you can modify its values), modifying a copy does not change the original DataFrame.

Solution 1: Modify the DataFrame directly by index

Instead of modifying the Valid column inside the loop by modifying the row object (n[1]), you should assign the values directly to the DataFrame using the index.

Here's how you can rewrite the code:

import pandas as pd

df = pd.read_csv('./some_file.csv', sep='\t')

print("Original DataFrame:")
print(df)

# Create the 'Valid' column initialized with None
df['Valid'] = None

# Iterate through DataFrame rows by index and modify the 'Valid' column
for index, row in df.iterrows():
    if pd.notnull(row['Name']):    # Example test
        df.at[index, 'Valid'] = False  # Modify the original DataFrame using 'at'
    else:
        df.at[index, 'Valid'] = True   # Modify the original DataFrame using 'at'

print("Modified DataFrame:")
print(df)

Key Points:

  • df.at[index, 'Valid']: This allows you to modify a specific cell in the DataFrame using the row's index. It's more efficient and guarantees that you're modifying the DataFrame directly, not a copy.
  • No need to create a list: You can directly update the Valid column values within the loop.

Expected Output:

Original DataFrame:
    Name  Age
0   John   20
1  Mandy   25
2   Mike   30
3    NaN   40
4   Alex   35

Modified DataFrame:
    Name  Age Valid
0   John   20 False
1  Mandy   25 False
2   Mike   30 False
3    NaN   40  True
4   Alex   35 False

Solution 2: Use apply() for more complex logic (if possible)

If the logic for determining whether a row is valid can be expressed as a function, you can use the apply() method to update the Valid column, which is often more efficient and readable than manually iterating over rows.

import pandas as pd

df = pd.read_csv('./some_file.csv', sep='\t')

print("Original DataFrame:")
print(df)

# Create the 'Valid' column initialized with None
df['Valid'] = None

# Define the validation logic as a function
def validate_row(row):
    return False if pd.notnull(row['Name']) else True

# Apply the function to each row and update the 'Valid' column
df['Valid'] = df.apply(validate_row, axis=1)

print("Modified DataFrame:")
print(df)

Expected Output:

Original DataFrame:
    Name  Age
0   John   20
1  Mandy   25
2   Mike   30
3    NaN   40
4   Alex   35

Modified DataFrame:
    Name  Age Valid
0   John   20 False
1  Mandy   25 False
2   Mike   30 False
3    NaN   40  True
4   Alex   35 False

This approach is typically more efficient and concise, especially if you have more complex validation logic.