Issues while encoding, decoding arabic language in terminal

In my script Cosine similarity need first, to convert an Arabic string into a vector before perform Cosine similarity on terminal under Linux --> problem while convert Arabic string to vector producing Arabic as:

[u'\u0627\u0644\u0634\u0645\u0633 \u0645\u0634\u0631\u0642\u0647 \u0646\u0647\u0627\u0631\u0627', u'\u0627\u0644\u0633\u0645\u0627\u0621 \u0632\u0631\u0642\u0627\u0621']

My script:

train_set = ["السماء زرقاء", "الشمس مشرقه نهارا"] #Documents
test_set = ["الشمس التى فى السماء مشرقه","السماء زرقاء"] #Query
stopWords = set(stopwords.words('english'))

vectorizer = CountVectorizer(stop_words = stopWords)
transformer = TfidfTransformer()
trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray
cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)

for vector in trainVectorizerArray:
    print vector
    for testV in testVectorizerArray:
        print testV
        cosine = cx(vector, testV)
        print cosine

Answer

The issue you're encountering is related to how Arabic text is being handled in your script. The strings you’re seeing (like [u'\u0627\u0644\u0634\u0645\u0633 \u0645\u0634\u0631\u0642\u0647 \u0646\u0647\u0627\u0631\u0627']) are Unicode representations of the Arabic text. You need to ensure that the Arabic text is properly encoded and handled as a string in Python before applying any text processing steps, such as vectorization and Cosine similarity.

Steps to Fix the Issue

Ensure Correct Encoding: Make sure your input text is properly decoded when reading from files or input sources. You need to handle it as a normal Unicode string in Python 3.x.
Set Up Stop Words for Arabic: You're using stopwords.words('english'), but for Arabic text, you should be using Arabic stop words. If you don’t have a predefined list of Arabic stop words, you might need to create one or use external libraries that support Arabic stopwords.
Vectorizing Arabic Text: You can use CountVectorizer with stop_words='english' for English text, but since you're working with Arabic text, this could cause issues. Instead, it's better to create a custom stopword list or skip stopwords for the Arabic text if necessary.
Fixing the Cosine Similarity Calculation: The vectorization looks fine, but the key issue is how the text is processed and encoded. If you handle the Arabic string correctly, the vectorization and Cosine similarity will work as expected.

Full Script with Fixes

Here is an updated version of your script with these adjustments:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from numpy.linalg import norm
from nltk.corpus import stopwords

# Define Arabic stop words manually or load from external source
# Example Arabic stopwords list (use an actual list if available)
arabic_stopwords = [
    "ال", "و", "في", "من", "على", "إلى", "عن", "مع", "أن", "كان", "لذلك", "قد"
    # Add more stopwords as needed...
]

# Arabic text data
train_set = ["السماء زرقاء", "الشمس مشرقه نهارا"]  # Documents
test_set = ["الشمس التى فى السماء مشرقه", "السماء زرقاء"]  # Query

# Setup CountVectorizer for Arabic with custom stop words
vectorizer = CountVectorizer(stop_words=arabic_stopwords)

# Initialize TF-IDF transformer
transformer = TfidfTransformer()

# Convert training and test set into vectors
trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()

print('Fit Vectorizer to train set:', trainVectorizerArray)
print('Transform Vectorizer to test set:', testVectorizerArray)

# Cosine similarity function
cx = lambda a, b: round(np.inner(a, b) / (norm(a) * norm(b)), 3)

# Calculate cosine similarity for each pair of train and test vectors
for vector in trainVectorizerArray:
    print('Train Vector:', vector)
    for testV in testVectorizerArray:
        print('Test Vector:', testV)
        cosine = cx(vector, testV)
        print('Cosine Similarity:', cosine)

Key Changes:

Stop Words:
- Replaced the English stop words with a basic list of Arabic stop words. You can expand the list based on your needs, or use an external library like nltk or spacy if they provide better support for Arabic stopwords.
Vectorizer:
- Used CountVectorizer with the Arabic stop word list for Arabic text. The vectorizer works similarly for Arabic, as long as the stop words and preprocessing steps are properly set.
Cosine Similarity:
- The cosine similarity calculation using np.inner is fine, but I ensured that it's done correctly in terms of the vector arrays you generated.

Important Notes:

Arabic Text Processing: Arabic is a right-to-left language, and some preprocessing steps might be necessary (like stemming or lemmatization) for better vectorization. You can use libraries like farasa or camel_tools for more advanced Arabic text preprocessing.
Stop Words: If you don't have a comprehensive list of Arabic stopwords, you can use datasets or manually build one as needed. Alternatively, you could omit stop words entirely, but that might affect the quality of the similarity calculation.
Normalization: If you need to handle diacritics or different forms of Arabic, consider normalizing the text to improve results.

Conclusion:

Ensure that you handle the Arabic text as normal Unicode strings and make sure your vectorizer is set up to work with Arabic text.
Use a custom Arabic stop words list or an appropriate external library.
The cosine similarity function should work fine once the text is vectorized correctly.

This solution should resolve the issues you're facing, but feel free to expand the stop word list and experiment with more advanced preprocessing techniques for Arabic.