Package opennlp.tools.tokenize
Class BertTokenizer
java.lang.Object
opennlp.tools.tokenize.BertTokenizer
- All Implemented Interfaces:
Tokenizer
A
Tokenizer implementation of the full BERT tokenization pipeline:
basic tokenization (text normalization) followed by wordpiece tokenization.
The basic tokenization stage reproduces the reference BERT
BasicTokenizer:
- Removal of control characters and normalization of all whitespace to single spaces.
- Whitespace isolation of CJK ideographs.
- For uncased models: lower casing and accent stripping (Unicode NFD decomposition with removal of combining marks).
- Isolation of every punctuation character as its own token.
WordpieceTokenizer sharing the same vocabulary and special tokens.
This pipeline is required for correct results with BERT-style models:
feeding raw text directly to WordpieceTokenizer maps every token
that does not literally appear in the vocabulary - for uncased models that
includes every capitalized word - to the unknown token.
Whether to use the lower casing variant is a property of the model: uncased
models (for example bert-base-uncased and the
sentence-transformers models derived from it) require it, cased
models must not use it. Accent stripping is coupled to lower casing, as in
the reference implementation's default (strip_accents follows
do_lower_case unless overridden).
For reference see:
-
https://github.com/google-research/bert (
tokenization.py)
- See Also:
-
Constructor Summary
ConstructorsConstructorDescriptionBertTokenizer(Set<String> vocabulary) Initializes aBertTokenizerfor an uncased BERT model, with lower casing and accent stripping enabled.BertTokenizer(Set<String> vocabulary, boolean lowerCase) Initializes aBertTokenizerwith BERT special tokens.BertTokenizer(Set<String> vocabulary, boolean lowerCase, String classificationToken, String separatorToken, String unknownToken) Initializes aBertTokenizerwith custom special tokens, for models like RoBERTa that do not use the BERT defaults. -
Method Summary
Modifier and TypeMethodDescriptionString[]Tokenizes the given text into wordpieces, surrounded by the classification and separator tokens.Span[]tokenizePos(String text) Not supported: wordpiece tokens (subwords,##continuations and special tokens) have no faithful character spans in the original text.
-
Constructor Details
-
BertTokenizer
Initializes aBertTokenizerfor an uncased BERT model, with lower casing and accent stripping enabled.- Parameters:
vocabulary- The wordpiece vocabulary. Must not benull.
-
BertTokenizer
Initializes aBertTokenizerwith BERT special tokens.- Parameters:
vocabulary- The wordpiece vocabulary. Must not benull.lowerCase-truefor uncased models (lower casing and accent stripping),falsefor cased models.
-
BertTokenizer
public BertTokenizer(Set<String> vocabulary, boolean lowerCase, String classificationToken, String separatorToken, String unknownToken) Initializes aBertTokenizerwith custom special tokens, for models like RoBERTa that do not use the BERT defaults.- Parameters:
vocabulary- The wordpiece vocabulary. Must not benull.lowerCase-truefor uncased models (lower casing and accent stripping),falsefor cased models.classificationToken- The CLS token.separatorToken- The SEP token.unknownToken- The UNK token.
-
-
Method Details
-
tokenize
Tokenizes the given text into wordpieces, surrounded by the classification and separator tokens. -
tokenizePos
Not supported: wordpiece tokens (subwords,##continuations and special tokens) have no faithful character spans in the original text.- Specified by:
tokenizePosin interfaceTokenizer- Parameters:
text- The string to be tokenized.- Returns:
- The
spans (offsets intofor each token as the individuals array elements.s) - Throws:
UnsupportedOperationException- Always.
-