Class WordpieceTokenizer
- All Implemented Interfaces:
Tokenizer
Tokenizer implementation which performs tokenization
using word pieces.
Adapted under MIT license from https://github.com/robrua/easy-bert.
Note that this tokenizer performs only the wordpiece (subword) stage
of BERT tokenization. It does not normalize the input text: no lower casing,
no accent stripping, no control character removal. Text that does not match
the vocabulary's casing - for uncased models that includes every capitalized
word - is mapped to the unknown token. Use BertTokenizer for the
full BERT tokenization pipeline.
As of OpenNLP 3.0.0 the behavior matches the reference BERT wordpiece
implementation in three respects that differ from earlier releases:
runs of punctuation (and non-ASCII punctuation) are split into individual
single-character tokens, words that cannot be fully represented by
vocabulary pieces become a single unknown token instead of the matched
prefix pieces followed by the unknown token, and tokenizePos(String)
throws UnsupportedOperationException instead of returning
null.
For reference see:
- See Also:
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final StringBERT classification token:[CLS].static final StringBERT separator token:[SEP].static final StringBERT unknown token:[UNK].static final StringRoBERTa classification token:<s>.static final StringRoBERTa separator token.static final StringRoBERTa unknown token. -
Constructor Summary
ConstructorsConstructorDescriptionWordpieceTokenizer(Set<String> vocabulary) WordpieceTokenizer(Set<String> vocabulary, int maxTokenLength) WordpieceTokenizer(Set<String> vocabulary, String classificationToken, String separatorToken, String unknownToken) Initializes aWordpieceTokenizerwith avocabularyand custom special tokens.WordpieceTokenizer(Set<String> vocabulary, String classificationToken, String separatorToken, String unknownToken, int maxTokenLength) Initializes aWordpieceTokenizerwith avocabulary, custom special tokens and a custommaxTokenLength. -
Method Summary
Modifier and TypeMethodDescriptionintString[]Splits a string into its atomic parts.Span[]tokenizePos(String text) Not supported: wordpiece tokens (subwords,##continuations and special tokens) have no faithful character spans in the original text.
-
Field Details
-
BERT_CLS_TOKEN
BERT classification token:[CLS].- See Also:
-
BERT_SEP_TOKEN
BERT separator token:[SEP].- See Also:
-
BERT_UNK_TOKEN
BERT unknown token:[UNK].- See Also:
-
ROBERTA_CLS_TOKEN
RoBERTa classification token:<s>.- See Also:
-
ROBERTA_SEP_TOKEN
RoBERTa separator token.- See Also:
-
ROBERTA_UNK_TOKEN
RoBERTa unknown token.- See Also:
-
-
Constructor Details
-
WordpieceTokenizer
- Parameters:
vocabulary- A set of tokens considered the vocabulary.
-
WordpieceTokenizer
- Parameters:
vocabulary- A set of tokens considered the vocabulary.maxTokenLength- A non-negative number that is used as maximum token length.
-
WordpieceTokenizer
public WordpieceTokenizer(Set<String> vocabulary, String classificationToken, String separatorToken, String unknownToken) Initializes aWordpieceTokenizerwith avocabularyand custom special tokens. This allows support for models like RoBERTa that use different special tokens instead of the BERT defaults.- Parameters:
vocabulary- The vocabulary.classificationToken- The CLS token.separatorToken- The SEP token.unknownToken- The UNK token.
-
WordpieceTokenizer
public WordpieceTokenizer(Set<String> vocabulary, String classificationToken, String separatorToken, String unknownToken, int maxTokenLength) Initializes aWordpieceTokenizerwith avocabulary, custom special tokens and a custommaxTokenLength.- Parameters:
vocabulary- The vocabulary.classificationToken- The CLS token.separatorToken- The SEP token.unknownToken- The UNK token.maxTokenLength- A non-negative number that is used as maximum token length.
-
-
Method Details
-
tokenizePos
Not supported: wordpiece tokens (subwords,##continuations and special tokens) have no faithful character spans in the original text.- Specified by:
tokenizePosin interfaceTokenizer- Parameters:
text- The string to be tokenized.- Returns:
- The
spans (offsets intofor each token as the individuals array elements.s) - Throws:
UnsupportedOperationException- Always.
-
tokenize
Description copied from interface:TokenizerSplits a string into its atomic parts. -
getMaxTokenLength
public int getMaxTokenLength()- Returns:
- The maximum token length.
-