Class StopwordFilteringTokenizer

java.lang.Object
opennlp.tools.stopword.StopwordFilteringTokenizer
All Implemented Interfaces:
opennlp.tools.tokenize.Tokenizer

@ThreadSafe public final class StopwordFilteringTokenizer extends Object implements opennlp.tools.tokenize.Tokenizer
A Tokenizer decorator which delegates tokenization to a wrapped Tokenizer and then removes any tokens identified as stopwords by the supplied StopwordFilter.

Both tokenize(String) and tokenizePos(String) apply the filter using the same greedy longest-match window scan, so single-token (1-gram) and multi-token (n-gram) stopword entries are dropped identically across tokenize(String), tokenizePos(String) and StopwordFilterStream. For tokenizePos(String) the Spans covering a matched entry are dropped while the offsets of the remaining spans are kept intact (they continue to refer to positions in the original input string).

Instances are immutable and therefore safe for concurrent use provided that both the wrapped Tokenizer and the StopwordFilter are thread-safe. DictionaryStopwordFilter is unconditionally thread-safe; combined with a thread-safe delegate tokenizer (e.g. SimpleTokenizer.INSTANCE) the resulting decorator is thread-safe with no further synchronization required.

  • Constructor Summary

    Constructors
    Constructor
    Description
    StopwordFilteringTokenizer(opennlp.tools.tokenize.Tokenizer delegate, opennlp.tools.stopword.StopwordFilter filter)
  • Method Summary

    Modifier and Type
    Method
    Description
    Tokenizes the supplied string with the wrapped Tokenizer and then removes any tokens which the StopwordFilter considers a stopword.
    opennlp.tools.util.Span[]
    Computes token spans with the wrapped Tokenizer and then drops the spans covering any stopword entry according to the StopwordFilter.

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • StopwordFilteringTokenizer

      public StopwordFilteringTokenizer(opennlp.tools.tokenize.Tokenizer delegate, opennlp.tools.stopword.StopwordFilter filter)
      Parameters:
      delegate - The underlying Tokenizer that produces the raw tokens. Must not be null.
      filter - The StopwordFilter which decides whether a token is a stopword. Must not be null.
      Throws:
      IllegalArgumentException - if delegate or filter is null.
  • Method Details

    • tokenize

      public String[] tokenize(String s)
      Tokenizes the supplied string with the wrapped Tokenizer and then removes any tokens which the StopwordFilter considers a stopword.
      Specified by:
      tokenize in interface opennlp.tools.tokenize.Tokenizer
      Parameters:
      s - The string to be tokenized.
      Returns:
      The remaining tokens in their original order.
    • tokenizePos

      public opennlp.tools.util.Span[] tokenizePos(String s)
      Computes token spans with the wrapped Tokenizer and then drops the spans covering any stopword entry according to the StopwordFilter. A greedy left-to-right window scan mirrors StopwordFilter.filter(String[]): at each position the longest window of consecutive spans whose covered texts form a registered entry is removed; otherwise the current span is kept and the scan advances by one. This way multi-word (n-gram) entries are dropped here exactly as they are by tokenize(String). The relative order and the offsets of the surviving spans are preserved.
      Specified by:
      tokenizePos in interface opennlp.tools.tokenize.Tokenizer
      Parameters:
      s - The string to be tokenized.
      Returns:
      The remaining Spans in their original order.