I have the following dataset
'Fʀɪᴇɴᴅ',
'ᴍᴏᴍ',
'ᴍᴀᴋᴇs',
'ʜᴏᴜʀʟʏ',
'ᴛʜᴇ',
'ᴄᴏᴍᴘᴜᴛᴇʀ',
'ʙᴇᴇɴ',
'ᴏᴜᴛ',
'ᴀ',
'ᴊᴏʙ',
'ғᴏʀ',
'ᴍᴏɴᴛʜs',
'ʙᴜᴛ',
'ʟᴀsᴛ',
'ᴍᴏɴᴛʜ',
'ʜᴇʀ',
'ᴄʜᴇᴄᴋ',
'ᴊᴜsᴛ',
'ᴡᴏʀᴋɪɴɢ',
'ғᴇᴡ',
'ʜᴏᴜʀs',
'sᴏᴜʀᴄᴇ',
I want then into ASCII format using Python script for example:
Fʀɪᴇɴᴅ - FRIEND
ᴍᴏᴍ - MOM
Answer
To convert the text from small capital Unicode letters (e.g., ᴍᴏᴍ
) to regular ASCII uppercase letters (e.g., MOM
), you can use a Unicode normalization technique and then map the small capitals to regular uppercase letters. Here's how you can do it:
Python Solution
# Input dataset
dataset = [
'Fʀɪᴇɴᴅ',
'ᴍᴏᴍ',
'ᴍᴀᴋᴇs',
'ʜᴏᴜʀʟʏ',
'ᴛʜᴇ',
'ᴄᴏᴍᴘᴜᴛᴇʀ',
'ʙᴇᴇɴ',
'ᴏᴜᴛ',
'ᴀ',
'ᴊᴏʙ',
'ғᴏʀ',
'ᴍᴏɴᴛʜs',
'ʙᴜᴛ',
'ʟᴀsᴛ',
'ᴍᴏɴᴛʜ',
'ʜᴇʀ',
'ᴄʜᴇᴄᴋ',
'ᴊᴜsᴛ',
'ᴡᴏʀᴋɪɴɢ',
'ғᴇᴡ',
'ʜᴏᴜʀs',
'sᴏᴜʀᴄᴇ',
]
# Map of small capital Unicode characters to their ASCII counterparts
unicode_to_ascii = str.maketrans({
'ᴀ': 'A', 'ʙ': 'B', 'ᴄ': 'C', 'ᴅ': 'D', 'ᴇ': 'E', 'ғ': 'F', 'ɢ': 'G',
'ʜ': 'H', 'ɪ': 'I', 'ᴊ': 'J', 'ᴋ': 'K', 'ʟ': 'L', 'ᴍ': 'M', 'ɴ': 'N',
'ᴏ': 'O', 'ᴘ': 'P', 'ǫ': 'Q', 'ʀ': 'R', 's': 'S', 'ᴛ': 'T', 'ᴜ': 'U',
'ᴠ': 'V', 'ᴡ': 'W', 'x': 'X', 'ʏ': 'Y', 'ᴢ': 'Z'
})
# Convert the dataset
converted = {word: word.translate(unicode_to_ascii).upper() for word in dataset}
# Print the results
for original, ascii_version in converted.items():
print(f"{original} - {ascii_version}")
Output
Fʀɪᴇɴᴅ - FRIEND
ᴍᴏᴍ - MOM
ᴍᴀᴋᴇs - MAKES
ʜᴏᴜʀʟʏ - HOURLY
ᴛʜᴇ - THE
ᴄᴏᴍᴘᴜᴛᴇʀ - COMPUTER
ʙᴇᴇɴ - BEEN
ᴏᴜᴛ - OUT
ᴀ - A
ᴊᴏʙ - JOB
ғᴏʀ - FOR
ᴍᴏɴᴛʜs - MONTHS
ʙᴜᴛ - BUT
ʟᴀsᴛ - LAST
ᴍᴏɴᴛʜ - MONTH
ʜᴇʀ - HER
ᴄʜᴇᴄᴋ - CHECK
ᴊᴜsᴛ - JUST
ᴡᴏʀᴋɪɴɢ - WORKING
ғᴇᴡ - FEW
ʜᴏᴜʀs - HOURS
sᴏᴜʀᴄᴇ - SOURCE
Explanation
-
Unicode Translation:
- The
str.maketrans()
function creates a mapping between Unicode characters (small capitals) and their ASCII counterparts. - The
translate()
method is then used to apply this mapping to each word in the dataset.
- The
-
Uppercase Conversion:
- Since some words may not be fully in small capitals, applying
.upper()
ensures the entire string is in ASCII uppercase.
- Since some words may not be fully in small capitals, applying
This approach ensures proper handling of all small capital Unicode letters in your dataset.