I would like to get BERT embedding using tensorflow hub. I found it very easy to get ELMO embedding and my steps are below. Could anyone explain how to get BERT embedding on a windows machine? I found this but couldn't get it work on windows machine
-
https://tfhub.dev/google/elmo/3 go to this link and then download.
-
Unzip it twice till you see "tfhub_module.pb", provide path of that folder to get embedding
import tensorflow as tf import tensorflow_hub as hub elmo = hub.Module("C:/Users/nnnn/Desktop/BERT/elmo/3.tar/3", trainable=True) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) abc1=sess.run(elmo(x, signature="default", as_dict=True)["default"])
+++++++++++++++++++++++++++++++++++++++++ update 1
list of the problems that I am facing are below - I will add them one by one. This page has the complete notebook from the same author.
- when i try
import tokenization
, i get an errorModuleNotFoundError: No module named 'tokenization'
How do i get rid of it? Do I need to download thetokenization.py
and refer to it? Please clarify
==============update 2 I was able to get it work. The code with comments are as below
#manually copy paste code from https://github.com/google-research/bert/blob/master/tokenization.py and create a file called C:\\Users\\nn\\Desktop\\BERT\\tokenization.py
#for some reason direct download doesn’t work
#https://github.com/vineetm/tfhub-bert/blob/master/bert_tfhub.ipynb
#https://stackoverflow.com/questions/44891069/how-to-import-python-file
import sys
import os
print (sys.path)
script_dir = "C:\\Users\\nn\\Desktop\\BERT"
# Add the absolute directory path containing your
# module to the Python path
sys.path.append(os.path.abspath(script_dir))
import tokenization
import tensorflow_hub as hub
import tensorflow as tf
#download https://tfhub.dev/google/bert_cased_L-12_H-768_A-12/1 and unzip twice
def create_tokenizer(vocab_file='C:\\Users\\nn\\Desktop\\BERT\\bert_cased_L-12_H-768_A-12\\bert_cased_L-12_H-768_A-12~\\assets\\vocab.txt', do_lower_case=False):
return tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)
tokenizer = create_tokenizer()
def convert_sentence_to_features(sentence, tokenizer, max_seq_len):
tokens = ['[CLS]']
tokens.extend(tokenizer.tokenize(sentence))
if len(tokens) > max_seq_len-1:
tokens = tokens[:max_seq_len-1]
tokens.append('[SEP]')
segment_ids = [0] * len(tokens)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_mask = [1] * len(input_ids)
#Zero Mask till seq_length
zero_mask = [0] * (max_seq_len-len(tokens))
input_ids.extend(zero_mask)
input_mask.extend(zero_mask)
segment_ids.extend(zero_mask)
return input_ids, input_mask, segment_ids
def convert_sentences_to_features(sentences, tokenizer, max_seq_len=20):
all_input_ids = []
all_input_mask = []
all_segment_ids = []
for sentence in sentences:
input_ids, input_mask, segment_ids = convert_sentence_to_features(sentence, tokenizer, max_seq_len)
all_input_ids.append(input_ids)
all_input_mask.append(input_mask)
all_segment_ids.append(segment_ids)
return all_input_ids, all_input_mask, all_segment_ids
#BERT_URL = 'https://tfhub.dev/google/bert_cased_L-12_H-768_A-12/1'
BERT_URL ='C:\\Users\\nn\\Desktop\\BERT\\bert_cased_L-12_H-768_A-12\\bert_cased_L-12_H-768_A-12~'
module = hub.Module(BERT_URL)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
input_ids = tf.placeholder(dtype=tf.int32, shape=[None, None])
input_mask = tf.placeholder(dtype=tf.int32, shape=[None, None])
segment_ids = tf.placeholder(dtype=tf.int32, shape=[None, None])
bert_inputs = dict(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids)
bert_outputs = module(bert_inputs, signature="tokens", as_dict=True)
sentences = ['New Delhi is the capital of India', 'The capital of India is Delhi']
input_ids_vals, input_mask_vals, segment_ids_vals = convert_sentences_to_features(sentences, tokenizer, 10)#max_seq_len parameter
out = sess.run(bert_outputs, feed_dict={input_ids: input_ids_vals, input_mask: input_mask_vals, segment_ids: segment_ids_vals})
out['sequence_output'].shape
out['pooled_output'].shape
out.keys()
type(out['pooled_output'])
x1=out['sequence_output'][0,:,:]
x2=out['sequence_output'][1,:,:]#Sentences length is 7, even if i add cls and sep tokens, the length is 9. max_seq_len parameter is 10, then why are the last row of x1 and x2 not same?
Answer
To get BERT embeddings on a Windows machine using TensorFlow Hub, you've made great progress already. Let's go through the steps to clarify the process and answer the issues you've encountered.
Overview of Steps:
- Download BERT model from TensorFlow Hub.
- Manually create
tokenization.py
(because TensorFlow BERT requires this script for tokenizing text). - Prepare the tokenizer.
- Feed sentences into BERT model and retrieve embeddings.
Problem Clarifications
1. ModuleNotFoundError: No module named 'tokenization'
This happens because BERT requires a tokenization.py
script to process input text into tokens. However, the module isn't packaged with TensorFlow Hub by default. To solve this, you manually download the tokenization script and place it in your working directory.
You’ve already taken the correct approach by manually copying the code from Google's BERT GitHub.
2. TensorFlow Version Compatibility Issues
If you're working on Windows, it's also possible that you're encountering issues related to version mismatches between TensorFlow and TensorFlow Hub. Ensure you are using compatible versions of these libraries. A typical setup could be:
pip install tensorflow==1.15
pip install tensorflow-hub==0.7.0
TensorFlow 2.x may not be compatible with the older BERT model from TensorFlow Hub (hence the version restrictions).
3. Manual Setup for BERT Model
It seems you’ve successfully downloaded the BERT model (bert_cased_L-12_H-768_A-12
) and the tokenization.py
script. Let’s summarize your steps with corrections:
Full Code to Obtain BERT Embeddings
Here’s a step-by-step corrected version of your code:
# Manually copy the tokenization.py script from https://github.com/google-research/bert/blob/master/tokenization.py
import sys
import os
# Add the directory containing tokenization.py to the Python path
script_dir = "C:\\Users\\nn\\Desktop\\BERT"
sys.path.append(os.path.abspath(script_dir))
# Import tokenization module
import tokenization
import tensorflow as tf
import tensorflow_hub as hub
# Create a BERT tokenizer
def create_tokenizer(vocab_file='C:\\Users\\nn\\Desktop\\BERT\\bert_cased_L-12_H-768_A-12\\assets\\vocab.txt', do_lower_case=False):
return tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)
tokenizer = create_tokenizer()
# Function to convert sentences to features (tokenization and padding)
def convert_sentence_to_features(sentence, tokenizer, max_seq_len):
tokens = ['[CLS]']
tokens.extend(tokenizer.tokenize(sentence))
if len(tokens) > max_seq_len-1:
tokens = tokens[:max_seq_len-1]
tokens.append('[SEP]')
segment_ids = [0] * len(tokens)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_mask = [1] * len(input_ids)
# Zero padding to max sequence length
zero_mask = [0] * (max_seq_len-len(tokens))
input_ids.extend(zero_mask)
input_mask.extend(zero_mask)
segment_ids.extend(zero_mask)
return input_ids, input_mask, segment_ids
def convert_sentences_to_features(sentences, tokenizer, max_seq_len=20):
all_input_ids = []
all_input_mask = []
all_segment_ids = []
for sentence in sentences:
input_ids, input_mask, segment_ids = convert_sentence_to_features(sentence, tokenizer, max_seq_len)
all_input_ids.append(input_ids)
all_input_mask.append(input_mask)
all_segment_ids.append(segment_ids)
return all_input_ids, all_input_mask, all_segment_ids
# Specify the BERT model directory
BERT_URL = 'C:\\Users\\nn\\Desktop\\BERT\\bert_cased_L-12_H-768_A-12\\bert_cased_L-12_H-768_A-12~'
# Load the BERT module from TensorFlow Hub
module = hub.Module(BERT_URL)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
# Define placeholders for input data
input_ids = tf.placeholder(dtype=tf.int32, shape=[None, None])
input_mask = tf.placeholder(dtype=tf.int32, shape=[None, None])
segment_ids = tf.placeholder(dtype=tf.int32, shape=[None, None])
# Define input dictionary for BERT
bert_inputs = dict(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids)
# Get BERT outputs (sequence_output and pooled_output)
bert_outputs = module(bert_inputs, signature="tokens", as_dict=True)
# Example sentences for testing
sentences = ['New Delhi is the capital of India', 'The capital of India is Delhi']
input_ids_vals, input_mask_vals, segment_ids_vals = convert_sentences_to_features(sentences, tokenizer, 10)
# Run the session to get BERT embeddings
out = sess.run(bert_outputs, feed_dict={input_ids: input_ids_vals, input_mask: input_mask_vals, segment_ids: segment_ids_vals})
# Print shapes of the outputs
print(out['sequence_output'].shape)
print(out['pooled_output'].shape)
# Check keys of the output
print(out.keys())
# Check type of pooled_output
print(type(out['pooled_output']))
# Example embeddings for the first sentence
x1 = out['sequence_output'][0, :, :]
x2 = out['sequence_output'][1, :, :]
Key Considerations:
- Tokenization: BERT requires special tokenization to process input sentences, including
[CLS]
at the start and[SEP]
at the end. You've handled that correctly withconvert_sentence_to_features
. - Padding: The tokenized sentences are padded to the
max_seq_len
to ensure that all input sequences are the same length. - Model Loading: You’ve correctly specified the path to your local BERT model and used TensorFlow Hub to load it. Ensure that the model is extracted and located at the proper path.
Output Shapes:
sequence_output
: This will have the shape(batch_size, max_seq_len, hidden_size)
, wherehidden_size
is 768 for the BERT base model.pooled_output
: This is the output corresponding to the[CLS]
token and has the shape(batch_size, hidden_size)
.
Final Notes:
- BERT Tokenizer: The BERT model expects a specific vocabulary file (
vocab.txt
). Ensure that the path to the vocab file is correct. - TensorFlow Compatibility: If you face compatibility issues with TensorFlow or other dependencies, check your versions carefully. TensorFlow 1.15 is a good match for this approach.
- Performance: Running BERT locally might require significant memory (especially with long sequences), so ensure your machine has enough resources (RAM, GPU, etc.).
Let me know if you encounter any further issues!