Interface StopwordFilter


public interface StopwordFilter
A pluggable filter that decides whether a token (or a sequence of tokens) is a stopword that should be removed during downstream text processing.

Implementations may be backed by a static bundled list, a user-supplied file, an in-memory data structure, or any other source. Both single-token and multi-token (n-gram) membership tests are supported.

See Also:
  • Method Summary

    Modifier and Type
    Method
    Description
    filter(String[] tokens)
    Returns a copy of tokens with stopword matches removed, preserving the input order.
    boolean
     
    boolean
    Checks whether the given token is a single-token stopword.
    boolean
    isStopword(String... tokens)
    Checks whether the given sequence of tokens is a multi-token stopword (n-gram).
    Returns an unmodifiable snapshot of the registered single-token stopwords.
  • Method Details

    • isStopword

      boolean isStopword(CharSequence token)
      Checks whether the given token is a single-token stopword. Equivalent to isStopword(new String[] { token.toString() }) when token is non-null.
      Parameters:
      token - The token to test. May be null, in which case implementations should return false.
      Returns:
      true if token is registered as a single-token stopword, false otherwise.
    • isStopword

      boolean isStopword(String... tokens)
      Checks whether the given sequence of tokens is a multi-token stopword (n-gram). For a single token this is equivalent to isStopword(CharSequence).
      Parameters:
      tokens - The tokens to test as one entry. May be null or empty, in which case implementations should return false.
      Returns:
      true if the sequence is registered as a stopword, false otherwise.
    • filter

      String[] filter(String[] tokens)
      Returns a copy of tokens with stopword matches removed, preserving the input order.

      Implementations should honor both 1-gram and n-gram entries. A recommended strategy is a greedy left-to-right window scan: at each position try the longest registered window first; if it matches, skip those tokens; otherwise advance by one and keep the current token. Implementations that do not support n-gram entries may fall back to 1-gram filtering.

      Parameters:
      tokens - The input token array. Must not be null. Individual array elements may be null and are kept as-is.
      Returns:
      A new array containing the surviving tokens. Never null.
      Throws:
      IllegalArgumentException - if tokens is null.
    • isCaseSensitive

      boolean isCaseSensitive()
      Returns:
      true if this filter performs case-sensitive matching; false if matching is case-insensitive.
    • stopwords

      Set<String> stopwords()
      Returns an unmodifiable snapshot of the registered single-token stopwords. Multi-token (n-gram) entries are not included in this view and must be tested via isStopword(String...).

      Attempts to mutate the returned Set will fail.

      Returns:
      An unmodifiable Set of stopwords. Never null.
      Throws:
      UnsupportedOperationException - if a caller attempts to add to, remove from, or otherwise mutate the returned Set.