tensorflow_hub to pull BERT embedding on windows machine

ghz 8hours ago ⋅ 1 views

I would like to get BERT embedding using tensorflow hub. I found it very easy to get ELMO embedding and my steps are below. Could anyone explain how to get BERT embedding on a windows machine? I found this but couldn't get it work on windows machine

  1. https://tfhub.dev/google/elmo/3 go to this link and then download.

  2. Unzip it twice till you see "tfhub_module.pb", provide path of that folder to get embedding

        import tensorflow as tf
        import tensorflow_hub as hub
    
        elmo = hub.Module("C:/Users/nnnn/Desktop/BERT/elmo/3.tar/3", trainable=True)
    
        with tf.Session() as sess:
            sess.run(tf.global_variables_initializer())
            abc1=sess.run(elmo(x, signature="default", as_dict=True)["default"])
    

+++++++++++++++++++++++++++++++++++++++++ update 1

list of the problems that I am facing are below - I will add them one by one. This page has the complete notebook from the same author.

  1. when i try import tokenization, i get an error ModuleNotFoundError: No module named 'tokenization' How do i get rid of it? Do I need to download the tokenization.py and refer to it? Please clarify

==============update 2 I was able to get it work. The code with comments are as below

#manually copy paste code from https://github.com/google-research/bert/blob/master/tokenization.py and create a file called C:\\Users\\nn\\Desktop\\BERT\\tokenization.py
#for some reason direct download doesn’t work

#https://github.com/vineetm/tfhub-bert/blob/master/bert_tfhub.ipynb 

#https://stackoverflow.com/questions/44891069/how-to-import-python-file
import sys
import os

print (sys.path)


script_dir = "C:\\Users\\nn\\Desktop\\BERT"



# Add the absolute directory  path containing your
# module to the Python path

sys.path.append(os.path.abspath(script_dir))

import tokenization





import tensorflow_hub as hub
import tensorflow as tf

#download https://tfhub.dev/google/bert_cased_L-12_H-768_A-12/1 and unzip twice
def create_tokenizer(vocab_file='C:\\Users\\nn\\Desktop\\BERT\\bert_cased_L-12_H-768_A-12\\bert_cased_L-12_H-768_A-12~\\assets\\vocab.txt', do_lower_case=False):
    return tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)


tokenizer = create_tokenizer()


def convert_sentence_to_features(sentence, tokenizer, max_seq_len):
    tokens = ['[CLS]']
    tokens.extend(tokenizer.tokenize(sentence))
    if len(tokens) > max_seq_len-1:
        tokens = tokens[:max_seq_len-1]
    tokens.append('[SEP]')

    segment_ids = [0] * len(tokens)
    input_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_mask = [1] * len(input_ids)

    #Zero Mask till seq_length
    zero_mask = [0] * (max_seq_len-len(tokens))
    input_ids.extend(zero_mask)
    input_mask.extend(zero_mask)
    segment_ids.extend(zero_mask)

    return input_ids, input_mask, segment_ids

def convert_sentences_to_features(sentences, tokenizer, max_seq_len=20):
    all_input_ids = []
    all_input_mask = []
    all_segment_ids = []

    for sentence in sentences:
        input_ids, input_mask, segment_ids = convert_sentence_to_features(sentence, tokenizer, max_seq_len)
        all_input_ids.append(input_ids)
        all_input_mask.append(input_mask)
        all_segment_ids.append(segment_ids)

    return all_input_ids, all_input_mask, all_segment_ids



#BERT_URL = 'https://tfhub.dev/google/bert_cased_L-12_H-768_A-12/1'

BERT_URL ='C:\\Users\\nn\\Desktop\\BERT\\bert_cased_L-12_H-768_A-12\\bert_cased_L-12_H-768_A-12~'

module = hub.Module(BERT_URL)
sess = tf.Session()
sess.run(tf.global_variables_initializer())


input_ids = tf.placeholder(dtype=tf.int32, shape=[None, None])
input_mask = tf.placeholder(dtype=tf.int32, shape=[None, None])
segment_ids = tf.placeholder(dtype=tf.int32, shape=[None, None])

bert_inputs = dict(
    input_ids=input_ids,
    input_mask=input_mask,
    segment_ids=segment_ids)

bert_outputs = module(bert_inputs, signature="tokens", as_dict=True)


sentences = ['New Delhi is the capital of India', 'The capital of India is Delhi']
input_ids_vals, input_mask_vals, segment_ids_vals = convert_sentences_to_features(sentences, tokenizer, 10)#max_seq_len parameter

out = sess.run(bert_outputs, feed_dict={input_ids: input_ids_vals, input_mask: input_mask_vals, segment_ids: segment_ids_vals})

out['sequence_output'].shape

out['pooled_output'].shape

out.keys()

type(out['pooled_output'])

x1=out['sequence_output'][0,:,:]
x2=out['sequence_output'][1,:,:]#Sentences length is 7, even if i add cls and sep tokens, the length is 9. max_seq_len parameter is 10, then why are the last row of x1 and x2 not same?

Answer

To get BERT embeddings on a Windows machine using TensorFlow Hub, you've made great progress already. Let's go through the steps to clarify the process and answer the issues you've encountered.

Overview of Steps:

  1. Download BERT model from TensorFlow Hub.
  2. Manually create tokenization.py (because TensorFlow BERT requires this script for tokenizing text).
  3. Prepare the tokenizer.
  4. Feed sentences into BERT model and retrieve embeddings.

Problem Clarifications

1. ModuleNotFoundError: No module named 'tokenization'

This happens because BERT requires a tokenization.py script to process input text into tokens. However, the module isn't packaged with TensorFlow Hub by default. To solve this, you manually download the tokenization script and place it in your working directory.

You’ve already taken the correct approach by manually copying the code from Google's BERT GitHub.

2. TensorFlow Version Compatibility Issues

If you're working on Windows, it's also possible that you're encountering issues related to version mismatches between TensorFlow and TensorFlow Hub. Ensure you are using compatible versions of these libraries. A typical setup could be:

pip install tensorflow==1.15
pip install tensorflow-hub==0.7.0

TensorFlow 2.x may not be compatible with the older BERT model from TensorFlow Hub (hence the version restrictions).

3. Manual Setup for BERT Model

It seems you’ve successfully downloaded the BERT model (bert_cased_L-12_H-768_A-12) and the tokenization.py script. Let’s summarize your steps with corrections:

Full Code to Obtain BERT Embeddings

Here’s a step-by-step corrected version of your code:

# Manually copy the tokenization.py script from https://github.com/google-research/bert/blob/master/tokenization.py
import sys
import os

# Add the directory containing tokenization.py to the Python path
script_dir = "C:\\Users\\nn\\Desktop\\BERT"
sys.path.append(os.path.abspath(script_dir))

# Import tokenization module
import tokenization

import tensorflow as tf
import tensorflow_hub as hub

# Create a BERT tokenizer
def create_tokenizer(vocab_file='C:\\Users\\nn\\Desktop\\BERT\\bert_cased_L-12_H-768_A-12\\assets\\vocab.txt', do_lower_case=False):
    return tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)

tokenizer = create_tokenizer()

# Function to convert sentences to features (tokenization and padding)
def convert_sentence_to_features(sentence, tokenizer, max_seq_len):
    tokens = ['[CLS]']
    tokens.extend(tokenizer.tokenize(sentence))
    if len(tokens) > max_seq_len-1:
        tokens = tokens[:max_seq_len-1]
    tokens.append('[SEP]')

    segment_ids = [0] * len(tokens)
    input_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_mask = [1] * len(input_ids)

    # Zero padding to max sequence length
    zero_mask = [0] * (max_seq_len-len(tokens))
    input_ids.extend(zero_mask)
    input_mask.extend(zero_mask)
    segment_ids.extend(zero_mask)

    return input_ids, input_mask, segment_ids

def convert_sentences_to_features(sentences, tokenizer, max_seq_len=20):
    all_input_ids = []
    all_input_mask = []
    all_segment_ids = []

    for sentence in sentences:
        input_ids, input_mask, segment_ids = convert_sentence_to_features(sentence, tokenizer, max_seq_len)
        all_input_ids.append(input_ids)
        all_input_mask.append(input_mask)
        all_segment_ids.append(segment_ids)

    return all_input_ids, all_input_mask, all_segment_ids

# Specify the BERT model directory
BERT_URL = 'C:\\Users\\nn\\Desktop\\BERT\\bert_cased_L-12_H-768_A-12\\bert_cased_L-12_H-768_A-12~'

# Load the BERT module from TensorFlow Hub
module = hub.Module(BERT_URL)
sess = tf.Session()
sess.run(tf.global_variables_initializer())

# Define placeholders for input data
input_ids = tf.placeholder(dtype=tf.int32, shape=[None, None])
input_mask = tf.placeholder(dtype=tf.int32, shape=[None, None])
segment_ids = tf.placeholder(dtype=tf.int32, shape=[None, None])

# Define input dictionary for BERT
bert_inputs = dict(
    input_ids=input_ids,
    input_mask=input_mask,
    segment_ids=segment_ids)

# Get BERT outputs (sequence_output and pooled_output)
bert_outputs = module(bert_inputs, signature="tokens", as_dict=True)

# Example sentences for testing
sentences = ['New Delhi is the capital of India', 'The capital of India is Delhi']
input_ids_vals, input_mask_vals, segment_ids_vals = convert_sentences_to_features(sentences, tokenizer, 10)

# Run the session to get BERT embeddings
out = sess.run(bert_outputs, feed_dict={input_ids: input_ids_vals, input_mask: input_mask_vals, segment_ids: segment_ids_vals})

# Print shapes of the outputs
print(out['sequence_output'].shape)
print(out['pooled_output'].shape)

# Check keys of the output
print(out.keys())

# Check type of pooled_output
print(type(out['pooled_output']))

# Example embeddings for the first sentence
x1 = out['sequence_output'][0, :, :]
x2 = out['sequence_output'][1, :, :]

Key Considerations:

  • Tokenization: BERT requires special tokenization to process input sentences, including [CLS] at the start and [SEP] at the end. You've handled that correctly with convert_sentence_to_features.
  • Padding: The tokenized sentences are padded to the max_seq_len to ensure that all input sequences are the same length.
  • Model Loading: You’ve correctly specified the path to your local BERT model and used TensorFlow Hub to load it. Ensure that the model is extracted and located at the proper path.

Output Shapes:

  • sequence_output: This will have the shape (batch_size, max_seq_len, hidden_size), where hidden_size is 768 for the BERT base model.
  • pooled_output: This is the output corresponding to the [CLS] token and has the shape (batch_size, hidden_size).

Final Notes:

  1. BERT Tokenizer: The BERT model expects a specific vocabulary file (vocab.txt). Ensure that the path to the vocab file is correct.
  2. TensorFlow Compatibility: If you face compatibility issues with TensorFlow or other dependencies, check your versions carefully. TensorFlow 1.15 is a good match for this approach.
  3. Performance: Running BERT locally might require significant memory (especially with long sequences), so ensure your machine has enough resources (RAM, GPU, etc.).

Let me know if you encounter any further issues!