Class StopwordFilteringTokenizer
- All Implemented Interfaces:
opennlp.tools.tokenize.Tokenizer
Tokenizer decorator which delegates tokenization to a wrapped
Tokenizer and then removes any tokens identified as stopwords by
the supplied StopwordFilter.
Both tokenize(String) and tokenizePos(String) apply the
filter using the same greedy longest-match window scan, so single-token
(1-gram) and multi-token (n-gram) stopword entries are dropped identically
across tokenize(String), tokenizePos(String) and
StopwordFilterStream. For tokenizePos(String) the
Spans covering a matched entry are dropped while the offsets of
the remaining spans are kept intact (they continue to refer to positions in
the original input string).
Instances are immutable and therefore safe for concurrent use provided that
both the wrapped Tokenizer and the StopwordFilter are
thread-safe. DictionaryStopwordFilter is unconditionally
thread-safe; combined with a thread-safe delegate tokenizer
(e.g. SimpleTokenizer.INSTANCE) the resulting decorator is
thread-safe with no further synchronization required.
-
Constructor Summary
ConstructorsConstructorDescriptionStopwordFilteringTokenizer(opennlp.tools.tokenize.Tokenizer delegate, opennlp.tools.stopword.StopwordFilter filter) Initializes aStopwordFilteringTokenizer. -
Method Summary
Modifier and TypeMethodDescriptionString[]Tokenizes the supplied string with the wrappedTokenizerand then removes any tokens which theStopwordFilterconsiders a stopword.opennlp.tools.util.Span[]Computes token spans with the wrappedTokenizerand then drops the spans covering any stopword entry according to theStopwordFilter.
-
Constructor Details
-
StopwordFilteringTokenizer
public StopwordFilteringTokenizer(opennlp.tools.tokenize.Tokenizer delegate, opennlp.tools.stopword.StopwordFilter filter) Initializes aStopwordFilteringTokenizer.- Parameters:
delegate- The underlyingTokenizerthat produces the raw tokens. Must not benull.filter- TheStopwordFilterwhich decides whether a token is a stopword. Must not benull.- Throws:
IllegalArgumentException- ifdelegateorfilterisnull.
-
-
Method Details
-
tokenize
Tokenizes the supplied string with the wrappedTokenizerand then removes any tokens which theStopwordFilterconsiders a stopword.- Specified by:
tokenizein interfaceopennlp.tools.tokenize.Tokenizer- Parameters:
s- The string to be tokenized.- Returns:
- The remaining tokens in their original order.
-
tokenizePos
Computes token spans with the wrappedTokenizerand then drops the spans covering any stopword entry according to theStopwordFilter. A greedy left-to-right window scan mirrorsStopwordFilter.filter(String[]): at each position the longest window of consecutive spans whose covered texts form a registered entry is removed; otherwise the current span is kept and the scan advances by one. This way multi-word (n-gram) entries are dropped here exactly as they are bytokenize(String). The relative order and the offsets of the surviving spans are preserved.- Specified by:
tokenizePosin interfaceopennlp.tools.tokenize.Tokenizer- Parameters:
s- The string to be tokenized.- Returns:
- The remaining
Spansin their original order.
-