The general idea is:
- Read a DataFrame from Excel
- Add a new column where I can identify valid and invalid rows (in this example, values are initialized at
None
, but I've also tried initializing atFalse
and0
) - Iterate through DataFrame and assign values to the new column depending on a series of tests. (unfortunately, the tests are too complex to use
applymap
or similar)
The expected result is to print the DataFrame and see the values of the new column properly assigned.
The experienced result is the values have not changed at all. They all remain in their initial value.
Here's a highly simplified code sample, where the behavior is present.
import pandas as pd
df = pd.read_csv('./some_file.csv', sep='\t')
print(df) # View outputs below
df['Valid'] = [None for _ in range(len(df))] # New column
for n in df.iterrows():
if pd.notnull(n[1].Name): # Example test
n[1].Valid = False
else: # else clause should ensure all values change
n[1].Valid = True
print(df) # Shows all df.Valid values are still None
Output 1 after read_csv()
Name Age
0 John 20
1 Mandy 25
2 Mike 30
3 NaN 40
4 Alex 35
Output 2 at the end of the code:
Name Age Valid
0 John 20 None
1 Mandy 25 None
2 Mike 30 None
3 NaN 40 None
4 Alex 35 None
In the example above, when printing df
, all values are still None
, even though the debugger shows the process stepping into the assignment statements.
Is there some sort of scope, reference, or mutability issue that I'm not seeing? Can anybody explain this behavior?
Running python 3.6.0
For a sample data set, I manually built a some_file.csv
file separated by tabs with (4th row has an empty string for Name while age is 40, 5th row is skipped by read_csv
for obvious reasons):
Name Age
John 20
Mandy 25
Mike 30
40
Alex 35
Answer
The issue you're encountering stems from how you're iterating over the DataFrame rows and trying to modify the values. When using df.iterrows()
, you're working with a copy of the row, not a reference to the original row in the DataFrame. This means that modifications made to the row (inside the loop) don't affect the original DataFrame.
Here's a more detailed explanation:
Explanation:
iterrows()
: When you calliterrows()
, it returns an iterator that yields each row as a tuple of(index, Series)
. TheSeries
object is a copy of the row data, not a reference to the original row. Therefore, any modification to theSeries
inside the loop does not alter the DataFrame itself.- Mutability: While a
pandas.Series
is mutable (you can modify its values), modifying a copy does not change the original DataFrame.
Solution 1: Modify the DataFrame directly by index
Instead of modifying the Valid
column inside the loop by modifying the row object (n[1]
), you should assign the values directly to the DataFrame using the index.
Here's how you can rewrite the code:
import pandas as pd
df = pd.read_csv('./some_file.csv', sep='\t')
print("Original DataFrame:")
print(df)
# Create the 'Valid' column initialized with None
df['Valid'] = None
# Iterate through DataFrame rows by index and modify the 'Valid' column
for index, row in df.iterrows():
if pd.notnull(row['Name']): # Example test
df.at[index, 'Valid'] = False # Modify the original DataFrame using 'at'
else:
df.at[index, 'Valid'] = True # Modify the original DataFrame using 'at'
print("Modified DataFrame:")
print(df)
Key Points:
df.at[index, 'Valid']
: This allows you to modify a specific cell in the DataFrame using the row's index. It's more efficient and guarantees that you're modifying the DataFrame directly, not a copy.- No need to create a list: You can directly update the
Valid
column values within the loop.
Expected Output:
Original DataFrame:
Name Age
0 John 20
1 Mandy 25
2 Mike 30
3 NaN 40
4 Alex 35
Modified DataFrame:
Name Age Valid
0 John 20 False
1 Mandy 25 False
2 Mike 30 False
3 NaN 40 True
4 Alex 35 False
Solution 2: Use apply()
for more complex logic (if possible)
If the logic for determining whether a row is valid can be expressed as a function, you can use the apply()
method to update the Valid
column, which is often more efficient and readable than manually iterating over rows.
import pandas as pd
df = pd.read_csv('./some_file.csv', sep='\t')
print("Original DataFrame:")
print(df)
# Create the 'Valid' column initialized with None
df['Valid'] = None
# Define the validation logic as a function
def validate_row(row):
return False if pd.notnull(row['Name']) else True
# Apply the function to each row and update the 'Valid' column
df['Valid'] = df.apply(validate_row, axis=1)
print("Modified DataFrame:")
print(df)
Expected Output:
Original DataFrame:
Name Age
0 John 20
1 Mandy 25
2 Mike 30
3 NaN 40
4 Alex 35
Modified DataFrame:
Name Age Valid
0 John 20 False
1 Mandy 25 False
2 Mike 30 False
3 NaN 40 True
4 Alex 35 False
This approach is typically more efficient and concise, especially if you have more complex validation logic.