pandas: multiple conditions while indexing data frame - unexpected behavior
I am filtering rows in a dataframe by values in two columns.
For some reason the OR operator behaves like I would expect AND operator to behave and vice versa.
My test code:
df = pd.DataFrame({'a': range(5), 'b': range(5) })
# let's insert some -1 values
df['a'][1] = -1
df['b'][1] = -1
df['a'][3] = -1
df['b'][4] = -1
df1 = df[(df.a != -1) & (df.b != -1)]
df2 = df[(df.a != -1) | (df.b != -1)]
print(pd.concat([df, df1, df2], axis=1,
keys = [ 'original df', 'using AND (&)', 'using OR (|)',]))
And the result:
original df using AND (&) using OR (|)
a b a b a b
0 0 0 0 0 0 0
1 -1 -1 NaN NaN NaN NaN
2 2 2 2 2 2 2
3 -1 3 NaN NaN -1 3
4 4 -1 NaN NaN 4 -1
[5 rows x 6 columns]
As you can see, the AND
operator drops every row in which at least one value equals -1
. On the other hand, the OR
operator requires both values to be equal to -1
to drop them. I would expect exactly the opposite result. Could anyone explain this behavior?
I am using pandas 0.13.1.
Answers
The behavior you're seeing is correct and expected for the logical operators &
(AND) and |
(OR) when used with conditional filtering in pandas. However, the results may be confusing based on how the logic works.
Explanation of the Logical Conditions:
-
AND (
&
) Condition:df1 = df[(df.a != -1) & (df.b != -1)]
This means: "Keep rows where both
df.a != -1
anddf.b != -1
areTrue
."- If either
df.a == -1
ordf.b == -1
, the row will be dropped.
- If either
-
OR (
|
) Condition:df2 = df[(df.a != -1) | (df.b != -1)]
This means: "Keep rows where either
df.a != -1
ordf.b != -1
isTrue
."- A row is dropped only if both
df.a == -1
anddf.b == -1
.
- A row is dropped only if both
Walkthrough of Your Data:
Let’s go through the rows in your DataFrame:
df = pd.DataFrame({'a': [0, -1, 2, -1, 4], 'b': [0, -1, 2, 3, -1]})
| Index | a | b | Condition: df.a != -1
| Condition: df.b != -1
| AND (&
) | OR (|
) |
|-----------|-------|-------|------------------------------|-----------------------------|---------------|--------------|
| 0 | 0 | 0 | True
| True
| True
| True
|
| 1 | -1 | -1 | False
| False
| False
| False
|
| 2 | 2 | 2 | True
| True
| True
| True
|
| 3 | -1 | 3 | False
| True
| False
| True
|
| 4 | 4 | -1 | True
| False
| False
| True
|
-
AND (
&
): The row is kept only if botha
andb
are not-1
. Rows 1, 3, and 4 are dropped because eithera
orb
(or both) are-1
. -
OR (
|
): The row is kept if eithera
orb
is not-1
. Only row 1 is dropped because botha
andb
are-1
in that row.
Why It May Seem Reversed:
The confusion likely arises because you're interpreting "AND" as "keep the row if either column is valid" and "OR" as "drop the row if either column is invalid." However, the logical conditions do the opposite:
- AND keeps rows only if both conditions are true (both values are not
-1
). - OR keeps rows if at least one condition is true (at least one value is not
-1
).
Correct Logic Based on Your Expectations:
If you want:
- To drop rows where both
a
andb
are-1
(keep if either is not-1
), use theOR
(|
) operator, which is already correct. - To drop rows where either
a
orb
is-1
, use theAND
(&
) operator, which is also correct.
Summary:
- AND (
&
): Drops rows if eithera
orb
is-1
(i.e., it keeps rows where both are non--1
). - OR (
|
): Drops rows if botha
andb
are-1
(i.e., it keeps rows where at least one is non--1
).
The code you've written behaves as expected based on these logical definitions.