ChatGPT解决这个技术问题 Extra ChatGPT

Keras Text Preprocessing - Saving Tokenizer object to file for scoring

I've trained a sentiment classifier model using Keras library by following the below steps(broadly).

Convert Text corpus into sequences using Tokenizer object/class Build a model using the model.fit() method Evaluate this model

Now for scoring using this model, I was able to save the model to a file and load from a file. However I've not found a way to save the Tokenizer object to file. Without this I'll have to process the corpus every time I need to score even a single sentence. Is there a way around this?


t
today

The most common way is to use either pickle or joblib. Here you have an example on how to use pickle in order to save Tokenizer:

import pickle

# saving
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

# loading
with open('tokenizer.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)

Do you call tokenizer.fit_on_texts again on test set?
No. If you call fit* again it could change the index. The pickle loaded tokenizer is ready to use.
Wait. You have to save both a model and a tokenizer in order to run a model in the future?
of course! they have 2 differents roles, the tokenizer will transform text into vectors, it's important to have the same vector space between training & testing.
M
Max

Tokenizer class has a function to save date into JSON format:

tokenizer_json = tokenizer.to_json()
with io.open('tokenizer.json', 'w', encoding='utf-8') as f:
    f.write(json.dumps(tokenizer_json, ensure_ascii=False))

The data can be loaded using tokenizer_from_json function from keras_preprocessing.text:

with open('tokenizer.json') as f:
    data = json.load(f)
    tokenizer = tokenizer_from_json(data)

tokenizer_from_json doesnt seem to be available in Keras anymore, or rather its not listed in their docs or available in the package in conda @Max you still do it this way?
@benbyford I use Keras-Preprocessing==1.0.9 package from PyPI and the function is avaiable
tokenizer_to_json should be available on tensorflow > 2.0.0 at some point soon, see this pr In the meantime from keras_preprocessing.text import tokenizer_from_json can be used
This worked for me. Thank you
Q
Quetzalcoatl

The accepted answer clearly demonstrates how to save the tokenizer. The following is a comment on the problem of (generally) scoring after fitting or saving. Suppose that a list texts is comprised of two lists Train_text and Test_text, where the set of tokens in Test_text is a subset of the set of tokens in Train_text (an optimistic assumption). Then fit_on_texts(Train_text) gives different results for texts_to_sequences(Test_text) as compared with first calling fit_on_texts(texts) and then text_to_sequences(Test_text).

Concrete Example:

from keras.preprocessing.text import Tokenizer

docs = ["A heart that",
         "full up like",
         "a landfill",
        "no surprises",
        "and no alarms"
         "a job that slowly"
         "Bruises that",
         "You look so",
         "tired happy",
         "no alarms",
        "and no surprises"]
docs_train = docs[:7]
docs_test = docs[7:]
# EXPERIMENT 1: FIT  TOKENIZER ONLY ON TRAIN
T_1 = Tokenizer()
T_1.fit_on_texts(docs_train)  # only train set
encoded_train_1 = T_1.texts_to_sequences(docs_train)
encoded_test_1 = T_1.texts_to_sequences(docs_test)
print("result for test 1:\n%s" %(encoded_test_1,))

# EXPERIMENT 2: FIT TOKENIZER ON BOTH TRAIN + TEST
T_2 = Tokenizer()
T_2.fit_on_texts(docs)  # both train and test set
encoded_train_2 = T_2.texts_to_sequences(docs_train)
encoded_test_2 = T_2.texts_to_sequences(docs_test)
print("result for test 2:\n%s" %(encoded_test_2,))

Results:

result for test 1:
[[3], [10, 3, 9]]
result for test 2:
[[1, 19], [5, 1, 4]]

Of course, if the above optimistic assumption is not satisfied and the set of tokens in Test_text is disjoint from that of Train_test, then test 1 results in a list of empty brackets [].


moral of the story: if using word embeddings and keras's Tokenizer, use fit_on_texts only once on a very large corpus; or use character n-grams instead.
I don't understand what's the message you're trying to communicate: why would one fit on test docs in the first place? By definition, whatever it is that you're doing, the test must be kept in a vault as if you didn't know you had it in the first place.
@gented: you may be confusing unsupervised text parsing with supervised ML. Correct me if I'm wrong, but keras's Tokenizer does not have a loss function attached to it that is meant for generalization; hence, is not a (supervised) machine learning problem -- which appears to be your assumption. The message I was trying to communicate is summarized in my first comment above ("moral of the story..."), which may be worth re-reading.
@gented good points. sorry if the nomenclature confused you; I was keeping some consistency with the comments in the accepted answer.
I agree with @gented in that you do not want to fit your tokenizer in the test set because then you remove the possibility of oov tokens at test time, defeating the purpose of a test set. It's not about the tokenizer having a loss, but rather about the data from the test set leaking into your training data.
u
user9170

I've created the issue https://github.com/keras-team/keras/issues/9289 in the keras Repo. Until the API is changed, the issue has a link to a gist that has code to demonstrate how to save and restore a tokenizer without having the original documents the tokenizer was fit on. I prefer to store all my model information in a JSON file (because reasons, but mainly mixed JS/Python environment), and this will allow for that, even with sort_keys=True


the linked gist looks like a good way to "reload" a trained tokenizer. However, the original question potentially relates to "extending" a previously saved tokenizer to new (test) texts; this part still seems open (otherwise, why "save" a model if it won't be used to "score" new data?)
I think their intents are clear "Without this I'll have to process the corpus every time I need to score even a single sentence". From this, I gather that they want to skip the tokenizing step and evaluate the trained model on other data. They don't ask anything else, which is that you are anticipating. They like most people, only want to use previously tokenized on a different data set which is skipped in most tutorials. Therefore, I think my answer 1) answers what was asked, and 2) provides working code.
fair points. the question is "Saving Tokenizer object to file for scoring" so one might assume they're asking about scoring (potentially new data), too.
P
Peter O.

I found the following snippet provided at following link by @thusv89.

Save objects:

import pickle

with open('data_objects.pickle', 'wb') as handle:
    pickle.dump(
        {'input_tensor': input_tensor, 
         'target_tensor': target_tensor, 
         'inp_lang': inp_lang,
         'targ_lang': targ_lang,
        }, handle, protocol=pickle.HIGHEST_PROTOCOL)

Load objects:

with open("dataset_fr_en.pickle", 'rb') as f:
    data = pickle.load(f)
    input_tensor = data['input_tensor']
    target_tensor = data['target_tensor']
    inp_lang = data['inp_lang']
    targ_lang = data['targ_lang']

c
chales sandy

Quite easy, because Tokenizer class has provided two funtions for save and load:

save —— Tokenizer.to_json()

load —— keras.preprocessing.text.tokenizer_from_json

In to_json() method,it call "get_config" method which handle this:

    json_word_counts = json.dumps(self.word_counts)
    json_word_docs = json.dumps(self.word_docs)
    json_index_docs = json.dumps(self.index_docs)
    json_word_index = json.dumps(self.word_index)
    json_index_word = json.dumps(self.index_word)

    return {
        'num_words': self.num_words,
        'filters': self.filters,
        'lower': self.lower,
        'split': self.split,
        'char_level': self.char_level,
        'oov_token': self.oov_token,
        'document_count': self.document_count,
        'word_counts': json_word_counts,
        'word_docs': json_word_docs,
        'index_docs': json_index_docs,
        'index_word': json_index_word,
        'word_index': json_word_index
    }

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.

关注公众号,不定期副业成功案例分享
Follow WeChat

Success story sharing

Want to stay one step ahead of the latest teleworks?

Subscribe Now