Convert unicode small capitals to their ASCII equivalents

ghz 16hours ago ⋅ 3 views

I have the following dataset

'Fʀɪᴇɴᴅ',
 'ᴍᴏᴍ',
 'ᴍᴀᴋᴇs',
 'ʜᴏᴜʀʟʏ',
 'ᴛʜᴇ',
 'ᴄᴏᴍᴘᴜᴛᴇʀ',
 'ʙᴇᴇɴ',
 'ᴏᴜᴛ',
 'ᴀ',
 'ᴊᴏʙ',
 'ғᴏʀ',
 'ᴍᴏɴᴛʜs',
 'ʙᴜᴛ',
 'ʟᴀsᴛ',
 'ᴍᴏɴᴛʜ',
 'ʜᴇʀ',
 'ᴄʜᴇᴄᴋ',
 'ᴊᴜsᴛ',
 'ᴡᴏʀᴋɪɴɢ',
 'ғᴇᴡ',
 'ʜᴏᴜʀs',
 'sᴏᴜʀᴄᴇ',

I want then into ASCII format using Python script for example:

Fʀɪᴇɴᴅ - FRIEND
ᴍᴏᴍ - MOM

Answer

To convert the text from small capital Unicode letters (e.g., ᴍᴏᴍ) to regular ASCII uppercase letters (e.g., MOM), you can use a Unicode normalization technique and then map the small capitals to regular uppercase letters. Here's how you can do it:

Python Solution

# Input dataset
dataset = [
    'Fʀɪᴇɴᴅ',
    'ᴍᴏᴍ',
    'ᴍᴀᴋᴇs',
    'ʜᴏᴜʀʟʏ',
    'ᴛʜᴇ',
    'ᴄᴏᴍᴘᴜᴛᴇʀ',
    'ʙᴇᴇɴ',
    'ᴏᴜᴛ',
    'ᴀ',
    'ᴊᴏʙ',
    'ғᴏʀ',
    'ᴍᴏɴᴛʜs',
    'ʙᴜᴛ',
    'ʟᴀsᴛ',
    'ᴍᴏɴᴛʜ',
    'ʜᴇʀ',
    'ᴄʜᴇᴄᴋ',
    'ᴊᴜsᴛ',
    'ᴡᴏʀᴋɪɴɢ',
    'ғᴇᴡ',
    'ʜᴏᴜʀs',
    'sᴏᴜʀᴄᴇ',
]

# Map of small capital Unicode characters to their ASCII counterparts
unicode_to_ascii = str.maketrans({
    'ᴀ': 'A', 'ʙ': 'B', 'ᴄ': 'C', 'ᴅ': 'D', 'ᴇ': 'E', 'ғ': 'F', 'ɢ': 'G',
    'ʜ': 'H', 'ɪ': 'I', 'ᴊ': 'J', 'ᴋ': 'K', 'ʟ': 'L', 'ᴍ': 'M', 'ɴ': 'N',
    'ᴏ': 'O', 'ᴘ': 'P', 'ǫ': 'Q', 'ʀ': 'R', 's': 'S', 'ᴛ': 'T', 'ᴜ': 'U',
    'ᴠ': 'V', 'ᴡ': 'W', 'x': 'X', 'ʏ': 'Y', 'ᴢ': 'Z'
})

# Convert the dataset
converted = {word: word.translate(unicode_to_ascii).upper() for word in dataset}

# Print the results
for original, ascii_version in converted.items():
    print(f"{original} - {ascii_version}")

Output

Fʀɪᴇɴᴅ - FRIEND
ᴍᴏᴍ - MOM
ᴍᴀᴋᴇs - MAKES
ʜᴏᴜʀʟʏ - HOURLY
ᴛʜᴇ - THE
ᴄᴏᴍᴘᴜᴛᴇʀ - COMPUTER
ʙᴇᴇɴ - BEEN
ᴏᴜᴛ - OUT
ᴀ - A
ᴊᴏʙ - JOB
ғᴏʀ - FOR
ᴍᴏɴᴛʜs - MONTHS
ʙᴜᴛ - BUT
ʟᴀsᴛ - LAST
ᴍᴏɴᴛʜ - MONTH
ʜᴇʀ - HER
ᴄʜᴇᴄᴋ - CHECK
ᴊᴜsᴛ - JUST
ᴡᴏʀᴋɪɴɢ - WORKING
ғᴇᴡ - FEW
ʜᴏᴜʀs - HOURS
sᴏᴜʀᴄᴇ - SOURCE

Explanation

  1. Unicode Translation:

    • The str.maketrans() function creates a mapping between Unicode characters (small capitals) and their ASCII counterparts.
    • The translate() method is then used to apply this mapping to each word in the dataset.
  2. Uppercase Conversion:

    • Since some words may not be fully in small capitals, applying .upper() ensures the entire string is in ASCII uppercase.

This approach ensures proper handling of all small capital Unicode letters in your dataset.