Class BertTokenizer

java.lang.Object
opennlp.tools.tokenize.BertTokenizer
All Implemented Interfaces:
Tokenizer

public class BertTokenizer extends Object implements Tokenizer
A Tokenizer implementation of the full BERT tokenization pipeline: basic tokenization (text normalization) followed by wordpiece tokenization.

The basic tokenization stage reproduces the reference BERT BasicTokenizer:

  1. Removal of control characters and normalization of all whitespace to single spaces.
  2. Whitespace isolation of CJK ideographs.
  3. For uncased models: lower casing and accent stripping (Unicode NFD decomposition with removal of combining marks).
  4. Isolation of every punctuation character as its own token.
The normalized text is then split into subwords by a WordpieceTokenizer sharing the same vocabulary and special tokens.

This pipeline is required for correct results with BERT-style models: feeding raw text directly to WordpieceTokenizer maps every token that does not literally appear in the vocabulary - for uncased models that includes every capitalized word - to the unknown token.

Whether to use the lower casing variant is a property of the model: uncased models (for example bert-base-uncased and the sentence-transformers models derived from it) require it, cased models must not use it. Accent stripping is coupled to lower casing, as in the reference implementation's default (strip_accents follows do_lower_case unless overridden).

For reference see:

See Also:
  • Constructor Details

    • BertTokenizer

      public BertTokenizer(Set<String> vocabulary)
      Initializes a BertTokenizer for an uncased BERT model, with lower casing and accent stripping enabled.
      Parameters:
      vocabulary - The wordpiece vocabulary. Must not be null.
    • BertTokenizer

      public BertTokenizer(Set<String> vocabulary, boolean lowerCase)
      Initializes a BertTokenizer with BERT special tokens.
      Parameters:
      vocabulary - The wordpiece vocabulary. Must not be null.
      lowerCase - true for uncased models (lower casing and accent stripping), false for cased models.
    • BertTokenizer

      public BertTokenizer(Set<String> vocabulary, boolean lowerCase, String classificationToken, String separatorToken, String unknownToken)
      Initializes a BertTokenizer with custom special tokens, for models like RoBERTa that do not use the BERT defaults.
      Parameters:
      vocabulary - The wordpiece vocabulary. Must not be null.
      lowerCase - true for uncased models (lower casing and accent stripping), false for cased models.
      classificationToken - The CLS token.
      separatorToken - The SEP token.
      unknownToken - The UNK token.
  • Method Details

    • tokenize

      public String[] tokenize(String text)
      Tokenizes the given text into wordpieces, surrounded by the classification and separator tokens.
      Specified by:
      tokenize in interface Tokenizer
      Parameters:
      text - The text to tokenize. Must not be null.
      Returns:
      The wordpiece tokens.
    • tokenizePos

      public Span[] tokenizePos(String text)
      Not supported: wordpiece tokens (subwords, ## continuations and special tokens) have no faithful character spans in the original text.
      Specified by:
      tokenizePos in interface Tokenizer
      Parameters:
      text - The string to be tokenized.
      Returns:
      The spans (offsets into s) for each token as the individuals array elements.
      Throws:
      UnsupportedOperationException - Always.