Package opennlp.tools.stopword
Interface StopwordFilter
public interface StopwordFilter
A pluggable filter that decides whether a token (or a sequence of tokens)
is a stopword that should be removed during downstream text processing.
Implementations may be backed by a static bundled list, a user-supplied file, an in-memory data structure, or any other source. Both single-token and multi-token (n-gram) membership tests are supported.
- See Also:
-
Method Summary
Modifier and TypeMethodDescriptionString[]Returns a copy oftokenswith stopword matches removed, preserving the input order.booleanbooleanisStopword(CharSequence token) Checks whether the given token is a single-token stopword.booleanisStopword(String... tokens) Checks whether the given sequence of tokens is a multi-token stopword (n-gram).Returns an unmodifiable snapshot of the registered single-token stopwords.
-
Method Details
-
isStopword
Checks whether the given token is a single-token stopword. Equivalent toisStopword(new String[] { token.toString() })whentokenis non-null.- Parameters:
token- The token to test. May benull, in which case implementations should returnfalse.- Returns:
trueiftokenis registered as a single-token stopword,falseotherwise.
-
isStopword
Checks whether the given sequence of tokens is a multi-token stopword (n-gram). For a single token this is equivalent toisStopword(CharSequence).- Parameters:
tokens- The tokens to test as one entry. May benullor empty, in which case implementations should returnfalse.- Returns:
trueif the sequence is registered as a stopword,falseotherwise.
-
filter
Returns a copy oftokenswith stopword matches removed, preserving the input order.Implementations should honor both 1-gram and n-gram entries. A recommended strategy is a greedy left-to-right window scan: at each position try the longest registered window first; if it matches, skip those tokens; otherwise advance by one and keep the current token. Implementations that do not support n-gram entries may fall back to 1-gram filtering.
- Parameters:
tokens- The input token array. Must not benull. Individual array elements may benulland are kept as-is.- Returns:
- A new array containing the surviving tokens. Never
null. - Throws:
IllegalArgumentException- iftokensisnull.
-
isCaseSensitive
boolean isCaseSensitive()- Returns:
trueif this filter performs case-sensitive matching;falseif matching is case-insensitive.
-
stopwords
Returns an unmodifiable snapshot of the registered single-token stopwords. Multi-token (n-gram) entries are not included in this view and must be tested viaisStopword(String...).Attempts to mutate the returned
Setwill fail.- Returns:
- An unmodifiable
Setof stopwords. Nevernull. - Throws:
UnsupportedOperationException- if a caller attempts to add to, remove from, or otherwise mutate the returnedSet.
-