Importing csv embedding special character with numpy genfromtxt

ghz 11hours ago ⋅ 2 views

I have a CSV containing special characters. Some cells are arithmetic operations (like "(10/2)"). I would like to import these cells as string in numpy by using np.genfromtxt. What I notice is that it actually import them in UTF8 (if I understood). For instance everytime I have a division symbol I get this code in the numpy array :\xc3\xb7

How could I import these arithmetic operations as readable string?

Answer

When you use np.genfromtxt, it attempts to parse the CSV file into data types based on its content. This can result in it automatically converting arithmetic symbols or special characters into their UTF-8 encoded equivalents (e.g., the division symbol ÷ becoming \xc3\xb7).

To import these arithmetic expressions as readable strings without automatically encoding them into their UTF-8 byte representations, you can follow these steps:

Solution:

  1. Specify dtype='str' in np.genfromtxt: This ensures that all values are treated as strings, and no automatic type conversion (like interpreting arithmetic operations) occurs.
  2. Ensure proper encoding: If your file contains special characters (like the division symbol ÷), you can specify the correct encoding (e.g., 'utf-8') while reading the file to preserve the special characters.

Example Code:

import numpy as np

# Load the CSV with dtype as string and specify encoding if needed
data = np.genfromtxt('your_file.csv', delimiter=',', dtype='str', encoding='utf-8')

# Print the loaded data
print(data)

Explanation:

  • dtype='str': Forces np.genfromtxt to treat all data as strings, which will prevent it from trying to convert arithmetic expressions or special characters into numeric types.
  • encoding='utf-8': Ensures that the file is read with the correct encoding, preserving any special characters in the file, such as the division symbol ÷.

Example:

Let's assume the CSV file your_file.csv contains this data:

age,expression
10,"(10/2)"
12,"(20/5)"

After reading the file with the code above, data will contain:

[['10' '(10/2)']
 ['12' '(20/5)']]

This way, the arithmetic operations like (10/2) are imported exactly as strings, without being interpreted as numeric or converted into UTF-8 encoded characters.

Important Note:

  • If you still see characters like \xc3\xb7, it means that your file encoding might not match the expected one (perhaps it's not properly UTF-8 encoded). In that case, double-check the encoding of the CSV file and try other encodings like 'latin1' or 'ISO-8859-1' if necessary.

If you still encounter issues with certain characters not being displayed correctly, you might want to inspect the file encoding explicitly before reading it, or use a tool to re-encode it to UTF-8.