Skipping char-string values while comparing to a column values o

ghz 11hours ago ⋅ 2 views

Skipping char-string values while comparing to a column values of mixed type to int or float in pandas Dataframe

I have a dataframe in which one column have mixed type values in it:

df
name   ref
 a     100
 b     103.78
 c     own
 d     108
 e     abc@yahoo.com
 f     110.45

So the ref col has mixed type. Now I have to query on it like:

 d = df[df['ref'] > 105]

 # got error can't comapre str() with int
 # so i did changed the dtype to float
 df['ref'] = df['ref'].astype(float)
 # cant convert str to float 'abc@yahoo.com'

After googling it, I can now convert the column values to float in one go:

 # using 
 df['ref'] = df['ref'].convert_objects(convert_numeric=True)

But demerit is, it makes the char strings to Nan, which is causing problem in next df operation. of-course my query d = df[df['ref'] > 105] is working.

But is there a way to skip checking char string while comparing the column values to int and vice versa...so that my query will work and the char string will remain as it's.

Something like:

d = df[df['ref']>103].skip_charstrings=True
## then i don't have to worry about char or nan or dtype as a whole.

Answer

To achieve your goal of skipping the non-numeric (string) values while performing the comparison in the ref column without converting the entire column to NaN, you can apply a more selective filtering approach. This will allow you to avoid errors when comparing values in a column with mixed types.

Here's how you can handle it:

Solution

You can use pd.to_numeric with the errors='coerce' argument to convert values to numeric while keeping non-numeric ones as NaN. After that, you can apply the comparison by checking for NaN values before performing the filtering:

import pandas as pd

# Sample DataFrame
data = {
    'name': ['a', 'b', 'c', 'd', 'e', 'f'],
    'ref': [100, 103.78, 'own', 108, 'abc@yahoo.com', 110.45]
}
df = pd.DataFrame(data)

# Convert 'ref' column to numeric, invalid parsing will be set to NaN
df['ref_numeric'] = pd.to_numeric(df['ref'], errors='coerce')

# Now perform the query, ignoring NaN values
d = df[df['ref_numeric'] > 105]

print(d)

Explanation:

  1. pd.to_numeric(df['ref'], errors='coerce'): This will convert values in the ref column to numeric types, coercing any non-numeric value (e.g., strings like 'own' or 'abc@yahoo.com') to NaN.
  2. Querying: You can now safely perform the query on the ref_numeric column, and the string values will be excluded from the comparison.

Output:

  name           ref  ref_numeric
1    b        103.78       103.78
3    d           108       108.00
5    f        110.45       110.45

Additional Notes:

  • If you don't want the NaN values in the output DataFrame, you can simply filter them out using dropna() on the ref_numeric column:
    d = df[df['ref_numeric'] > 105].dropna(subset=['ref_numeric'])
    
  • This method avoids the need to change the ref column directly or convert it to NaN for the entire DataFrame.

This way, you keep the original ref column intact and only manipulate a new numeric column for your operations.