I'm trying to replace a substring in a Word file, using the following command sequence in Python. The code alone works perfectly fine - even with the exact same Word file, but when embedding it in a larger scale project structure, it throws an error at exact that spot. I'm clueless as to what causes it, as it seemingly has nothing to do with the code and seems unreproducible for me.
Side note: I know what's causing the Error, it's a german 'ü' in the Word file, but it's needed and removing it doesn't seem like the right solution, if the code works standalone.
#foo.py
from bar import make_wordm
def main(uuid):
with open('foo.docm', 'w+') as f:
f.write(make_wordm(uuid=uuid))
main('1cb02f34-b331-4616-8d20-aa1821ef0fbd')
foo.py imports bar.py for doing the heavy lifting.
#bar.py
import tempfile
import shutil
from cStringIO import StringIO
from zipfile import ZipFile, ZipInfo
WORDM_TEMPLATE='./res/template.docm'
MODE_DIRECTORY = 0x10
def zipinfo_contents_replace(zipfile=None, zipinfo=None,
search=None, replace=None):
dirname = tempfile.mkdtemp()
fname = zipfile.extract(zipinfo, dirname)
with open(fname, 'r') as fd:
contents = fd.read().replace(search, replace)
shutil.rmtree(dirname)
return contents
def make_wordm(uuid=None, template=WORDM_TEMPLATE):
with open(template, 'r') as f:
input_buf = StringIO(f.read())
output_buf = StringIO()
output_zip = ZipFile(output_buf, 'w')
with ZipFile(input_buf, 'r') as doc:
for entry in doc.filelist:
if entry.external_attr & MODE_DIRECTORY:
continue
contents = zipinfo_contents_replace(zipfile=doc, zipinfo=entry,
search="00000000-0000-0000-0000-000000000000"
, replace=uuid)
output_zip.writestr(entry, contents)
output_zip.close()
return output_buf.getvalue()
The following error is thrown when embedding the same code in a larger scale context:
ERROR:root:message
Traceback (most recent call last):
File "FooBar.py", line 402, in foo_bar
bar = bar_constructor(bar_theme,bar_user,uuid)
File "FooBar.py", line 187, in bar_constructor
if(main(uuid)):
File "FooBar.py", line 158, in main
f.write(make_wordm(uuid=uuid))
File "/home/foo/FooBarGen.py", line 57, in make_wordm
search="00000000-0000-0000-0000-000000000000", replace=uuid)
File "/home/foo/FooBarGen.py", line 24, in zipinfo_contents_replace
contents = fd.read().replace(search, replace)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2722: ordinal not in range(128)
INFO:FooBar:None
edit: Upon further examination and debugging, it seems like the variable 'uuid' is causing the issue. When giving the parameter as a fulltext string ('1cb02f34-b331-4616-8d20-aa1821ef0fbd'), instead of using the variable parsed from a JSON, it works perfectly fine.
edit2: I had to add uuid = uuid.encode('utf-8', 'ignore')
and it works perfectly fine now.
Answer
It looks like the issue arises due to the encoding of the uuid
variable when you're working with different character encodings in a larger project context. Specifically, the error you're seeing (UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2722: ordinal not in range(128)
) typically happens when Python is trying to decode a byte sequence that contains non-ASCII characters (like the ü
in your file) using the default ASCII codec.
Here's a breakdown of why this happens:
-
Default Encoding Issue: When you read a file (e.g.,
fd.read()
) in Python, it's typically assumed to be encoded in UTF-8 or some default encoding (e.g., ASCII in some environments). If your input contains non-ASCII characters likeü
, and Python tries to decode it using ASCII (which doesn't support those characters), it will raise aUnicodeDecodeError
. -
The
uuid
variable: In your case, theuuid
is causing the issue when it's passed as a parameter. Ifuuid
comes from a JSON or external source, it might contain non-ASCII characters (possibly due to encoding issues in JSON parsing or a different character set), which causes the error when you attempt to write to the file.
Solution
To fix this, you should ensure that the data you're working with is properly encoded as UTF-8 before doing operations like fd.read()
or replacing strings within the Word file.
Here are two potential fixes to address this issue:
1. Ensure the uuid
is encoded correctly before processing
Since you mentioned that encoding the uuid
using uuid.encode('utf-8', 'ignore')
solved the issue, it's a good approach to explicitly encode the uuid
variable to UTF-8 before it's used in string replacement. This avoids any issues that arise when the uuid
string might contain non-ASCII characters.
For example:
def make_wordm(uuid=None, template=WORDM_TEMPLATE):
# Ensure the uuid is properly encoded to UTF-8
if isinstance(uuid, str):
uuid = uuid.encode('utf-8', 'ignore')
with open(template, 'r') as f:
input_buf = StringIO(f.read())
output_buf = StringIO()
output_zip = ZipFile(output_buf, 'w')
with ZipFile(input_buf, 'r') as doc:
for entry in doc.filelist:
if entry.external_attr & MODE_DIRECTORY:
continue
# Replace the uuid in the Word file
contents = zipinfo_contents_replace(zipfile=doc, zipinfo=entry,
search="00000000-0000-0000-0000-000000000000",
replace=uuid)
output_zip.writestr(entry, contents)
output_zip.close()
return output_buf.getvalue()
By encoding the uuid
to utf-8
, you ensure that the replacement process will work even if there are special characters like ü
in the Word file.
2. Handle the encoding issue when reading the file
You could also specify the encoding explicitly when reading the contents of the file in the zipinfo_contents_replace
function. By doing this, you can control how the file is read and avoid relying on Python's default encoding settings.
def zipinfo_contents_replace(zipfile=None, zipinfo=None, search=None, replace=None):
dirname = tempfile.mkdtemp()
fname = zipfile.extract(zipinfo, dirname)
with open(fname, 'r', encoding='utf-8', errors='ignore') as fd:
contents = fd.read().replace(search, replace)
shutil.rmtree(dirname)
return contents
In this case, we're using encoding='utf-8'
to ensure the file is read as UTF-8, and errors='ignore'
will ensure that any invalid characters are ignored rather than raising an error.
Conclusion
To resolve the issue, the main focus should be on ensuring that all text data—particularly the uuid
variable and the contents read from the Word file—are properly handled as UTF-8 strings. Both approaches (encoding the uuid
and explicitly setting the file encoding) should help resolve the error you're encountering.