Python Unreproducible UnicodeDecodeError

ghz 12hours ago ⋅ 3 views

I'm trying to replace a substring in a Word file, using the following command sequence in Python. The code alone works perfectly fine - even with the exact same Word file, but when embedding it in a larger scale project structure, it throws an error at exact that spot. I'm clueless as to what causes it, as it seemingly has nothing to do with the code and seems unreproducible for me.

Side note: I know what's causing the Error, it's a german 'ü' in the Word file, but it's needed and removing it doesn't seem like the right solution, if the code works standalone.

#foo.py
from bar import make_wordm
def main(uuid):
    with open('foo.docm', 'w+') as f:
        f.write(make_wordm(uuid=uuid))

main('1cb02f34-b331-4616-8d20-aa1821ef0fbd')

foo.py imports bar.py for doing the heavy lifting.

#bar.py
import tempfile
import shutil
from cStringIO import StringIO
from zipfile import ZipFile, ZipInfo

WORDM_TEMPLATE='./res/template.docm'
MODE_DIRECTORY = 0x10

def zipinfo_contents_replace(zipfile=None, zipinfo=None,
                             search=None, replace=None):
    dirname = tempfile.mkdtemp()
    fname = zipfile.extract(zipinfo, dirname)
    with open(fname, 'r') as fd:
        contents = fd.read().replace(search, replace)
    shutil.rmtree(dirname)
    return contents

def make_wordm(uuid=None, template=WORDM_TEMPLATE):
    with open(template, 'r') as f:
        input_buf = StringIO(f.read())
    output_buf = StringIO()
    output_zip = ZipFile(output_buf, 'w')

    with ZipFile(input_buf, 'r') as doc:
        for entry in doc.filelist:
            if entry.external_attr & MODE_DIRECTORY:
                continue

            contents = zipinfo_contents_replace(zipfile=doc, zipinfo=entry,
                                        search="00000000-0000-0000-0000-000000000000"
                                        , replace=uuid)
            output_zip.writestr(entry, contents)
    output_zip.close()
    return output_buf.getvalue()

The following error is thrown when embedding the same code in a larger scale context:

ERROR:root:message
Traceback (most recent call last):
  File "FooBar.py", line 402, in foo_bar
    bar = bar_constructor(bar_theme,bar_user,uuid)
  File "FooBar.py", line 187, in bar_constructor
    if(main(uuid)):
  File "FooBar.py", line 158, in main
    f.write(make_wordm(uuid=uuid))
  File "/home/foo/FooBarGen.py", line 57, in make_wordm
    search="00000000-0000-0000-0000-000000000000", replace=uuid)
  File "/home/foo/FooBarGen.py", line 24, in zipinfo_contents_replace
    contents = fd.read().replace(search, replace)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2722: ordinal not in range(128)
INFO:FooBar:None

edit: Upon further examination and debugging, it seems like the variable 'uuid' is causing the issue. When giving the parameter as a fulltext string ('1cb02f34-b331-4616-8d20-aa1821ef0fbd'), instead of using the variable parsed from a JSON, it works perfectly fine.

edit2: I had to add uuid = uuid.encode('utf-8', 'ignore') and it works perfectly fine now.

Answer

It looks like the issue arises due to the encoding of the uuid variable when you're working with different character encodings in a larger project context. Specifically, the error you're seeing (UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2722: ordinal not in range(128)) typically happens when Python is trying to decode a byte sequence that contains non-ASCII characters (like the ü in your file) using the default ASCII codec.

Here's a breakdown of why this happens:

  1. Default Encoding Issue: When you read a file (e.g., fd.read()) in Python, it's typically assumed to be encoded in UTF-8 or some default encoding (e.g., ASCII in some environments). If your input contains non-ASCII characters like ü, and Python tries to decode it using ASCII (which doesn't support those characters), it will raise a UnicodeDecodeError.

  2. The uuid variable: In your case, the uuid is causing the issue when it's passed as a parameter. If uuid comes from a JSON or external source, it might contain non-ASCII characters (possibly due to encoding issues in JSON parsing or a different character set), which causes the error when you attempt to write to the file.

Solution

To fix this, you should ensure that the data you're working with is properly encoded as UTF-8 before doing operations like fd.read() or replacing strings within the Word file.

Here are two potential fixes to address this issue:

1. Ensure the uuid is encoded correctly before processing

Since you mentioned that encoding the uuid using uuid.encode('utf-8', 'ignore') solved the issue, it's a good approach to explicitly encode the uuid variable to UTF-8 before it's used in string replacement. This avoids any issues that arise when the uuid string might contain non-ASCII characters.

For example:

def make_wordm(uuid=None, template=WORDM_TEMPLATE):
    # Ensure the uuid is properly encoded to UTF-8
    if isinstance(uuid, str):
        uuid = uuid.encode('utf-8', 'ignore')
    
    with open(template, 'r') as f:
        input_buf = StringIO(f.read())
    output_buf = StringIO()
    output_zip = ZipFile(output_buf, 'w')

    with ZipFile(input_buf, 'r') as doc:
        for entry in doc.filelist:
            if entry.external_attr & MODE_DIRECTORY:
                continue

            # Replace the uuid in the Word file
            contents = zipinfo_contents_replace(zipfile=doc, zipinfo=entry,
                                                search="00000000-0000-0000-0000-000000000000", 
                                                replace=uuid)
            output_zip.writestr(entry, contents)
    output_zip.close()
    return output_buf.getvalue()

By encoding the uuid to utf-8, you ensure that the replacement process will work even if there are special characters like ü in the Word file.

2. Handle the encoding issue when reading the file

You could also specify the encoding explicitly when reading the contents of the file in the zipinfo_contents_replace function. By doing this, you can control how the file is read and avoid relying on Python's default encoding settings.

def zipinfo_contents_replace(zipfile=None, zipinfo=None, search=None, replace=None):
    dirname = tempfile.mkdtemp()
    fname = zipfile.extract(zipinfo, dirname)
    
    with open(fname, 'r', encoding='utf-8', errors='ignore') as fd:
        contents = fd.read().replace(search, replace)
    
    shutil.rmtree(dirname)
    return contents

In this case, we're using encoding='utf-8' to ensure the file is read as UTF-8, and errors='ignore' will ensure that any invalid characters are ignored rather than raising an error.

Conclusion

To resolve the issue, the main focus should be on ensuring that all text data—particularly the uuid variable and the contents read from the Word file—are properly handled as UTF-8 strings. Both approaches (encoding the uuid and explicitly setting the file encoding) should help resolve the error you're encountering.