Class WordpieceTokenizer

java.lang.Object
opennlp.tools.tokenize.WordpieceTokenizer
All Implemented Interfaces:
Tokenizer

public class WordpieceTokenizer extends Object implements Tokenizer
A Tokenizer implementation which performs tokenization using word pieces.

Adapted under MIT license from https://github.com/robrua/easy-bert.

Note that this tokenizer performs only the wordpiece (subword) stage of BERT tokenization. It does not normalize the input text: no lower casing, no accent stripping, no control character removal. Text that does not match the vocabulary's casing - for uncased models that includes every capitalized word - is mapped to the unknown token. Use BertTokenizer for the full BERT tokenization pipeline.

As of OpenNLP 3.0.0 the behavior matches the reference BERT wordpiece implementation in three respects that differ from earlier releases: runs of punctuation (and non-ASCII punctuation) are split into individual single-character tokens, words that cannot be fully represented by vocabulary pieces become a single unknown token instead of the matched prefix pieces followed by the unknown token, and tokenizePos(String) throws UnsupportedOperationException instead of returning null.

For reference see:

See Also:
  • Field Details

  • Constructor Details

    • WordpieceTokenizer

      public WordpieceTokenizer(Set<String> vocabulary)
      Initializes a WordpieceTokenizer with a vocabulary and a default maxTokenLength of 50.
      Parameters:
      vocabulary - A set of tokens considered the vocabulary.
    • WordpieceTokenizer

      public WordpieceTokenizer(Set<String> vocabulary, int maxTokenLength)
      Initializes a WordpieceTokenizer with a vocabulary and a custom maxTokenLength.
      Parameters:
      vocabulary - A set of tokens considered the vocabulary.
      maxTokenLength - A non-negative number that is used as maximum token length.
    • WordpieceTokenizer

      public WordpieceTokenizer(Set<String> vocabulary, String classificationToken, String separatorToken, String unknownToken)
      Initializes a WordpieceTokenizer with a vocabulary and custom special tokens. This allows support for models like RoBERTa that use different special tokens instead of the BERT defaults.
      Parameters:
      vocabulary - The vocabulary.
      classificationToken - The CLS token.
      separatorToken - The SEP token.
      unknownToken - The UNK token.
    • WordpieceTokenizer

      public WordpieceTokenizer(Set<String> vocabulary, String classificationToken, String separatorToken, String unknownToken, int maxTokenLength)
      Initializes a WordpieceTokenizer with a vocabulary, custom special tokens and a custom maxTokenLength.
      Parameters:
      vocabulary - The vocabulary.
      classificationToken - The CLS token.
      separatorToken - The SEP token.
      unknownToken - The UNK token.
      maxTokenLength - A non-negative number that is used as maximum token length.
  • Method Details

    • tokenizePos

      public Span[] tokenizePos(String text)
      Not supported: wordpiece tokens (subwords, ## continuations and special tokens) have no faithful character spans in the original text.
      Specified by:
      tokenizePos in interface Tokenizer
      Parameters:
      text - The string to be tokenized.
      Returns:
      The spans (offsets into s) for each token as the individuals array elements.
      Throws:
      UnsupportedOperationException - Always.
    • tokenize

      public String[] tokenize(String text)
      Description copied from interface: Tokenizer
      Splits a string into its atomic parts.
      Specified by:
      tokenize in interface Tokenizer
      Parameters:
      text - The string to be tokenized.
      Returns:
      The String[] with the individual tokens as the array elements.
    • getMaxTokenLength

      public int getMaxTokenLength()
      Returns:
      The maximum token length.