Skipping char-string values while comparing to a column values of mixed type to int or float in pandas Dataframe
I have a dataframe in which one column have mixed type values in it:
df
name ref
a 100
b 103.78
c own
d 108
e abc@yahoo.com
f 110.45
So the ref
col has mixed type. Now I have to query on it like:
d = df[df['ref'] > 105]
# got error can't comapre str() with int
# so i did changed the dtype to float
df['ref'] = df['ref'].astype(float)
# cant convert str to float 'abc@yahoo.com'
After googling it, I can now convert the column values to float in one go:
# using
df['ref'] = df['ref'].convert_objects(convert_numeric=True)
But demerit is, it makes the char
strings to Nan
, which is causing problem in next df operation. of-course my query d = df[df['ref'] > 105]
is working.
But is there a way to skip checking char string while comparing the column values to int and vice versa...so that my query will work and the char string will remain as it's.
Something like:
d = df[df['ref']>103].skip_charstrings=True
## then i don't have to worry about char or nan or dtype as a whole.
Answer
To achieve your goal of skipping the non-numeric (string) values while performing the comparison in the ref
column without converting the entire column to NaN
, you can apply a more selective filtering approach. This will allow you to avoid errors when comparing values in a column with mixed types.
Here's how you can handle it:
Solution
You can use pd.to_numeric
with the errors='coerce'
argument to convert values to numeric while keeping non-numeric ones as NaN
. After that, you can apply the comparison by checking for NaN
values before performing the filtering:
import pandas as pd
# Sample DataFrame
data = {
'name': ['a', 'b', 'c', 'd', 'e', 'f'],
'ref': [100, 103.78, 'own', 108, 'abc@yahoo.com', 110.45]
}
df = pd.DataFrame(data)
# Convert 'ref' column to numeric, invalid parsing will be set to NaN
df['ref_numeric'] = pd.to_numeric(df['ref'], errors='coerce')
# Now perform the query, ignoring NaN values
d = df[df['ref_numeric'] > 105]
print(d)
Explanation:
pd.to_numeric(df['ref'], errors='coerce')
: This will convert values in theref
column to numeric types, coercing any non-numeric value (e.g., strings like'own'
or'abc@yahoo.com'
) toNaN
.- Querying: You can now safely perform the query on the
ref_numeric
column, and the string values will be excluded from the comparison.
Output:
name ref ref_numeric
1 b 103.78 103.78
3 d 108 108.00
5 f 110.45 110.45
Additional Notes:
- If you don't want the
NaN
values in the output DataFrame, you can simply filter them out usingdropna()
on theref_numeric
column:d = df[df['ref_numeric'] > 105].dropna(subset=['ref_numeric'])
- This method avoids the need to change the
ref
column directly or convert it toNaN
for the entire DataFrame.
This way, you keep the original ref
column intact and only manipulate a new numeric column for your operations.