Class SpellCorrectingTokenStream

java.lang.Object
opennlp.tools.util.FilterObjectStream<String,String>
opennlp.spellcheck.stream.SpellCorrectingTokenStream
All Implemented Interfaces:
AutoCloseable, opennlp.tools.util.ObjectStream<String>

public class SpellCorrectingTokenStream extends opennlp.tools.util.FilterObjectStream<String,String>
A FilterObjectStream for tokenized data: each element read from the wrapped ObjectStream is a string of tokens separated by a known delimiter (whitespace by default). Every token is spell-corrected independently and the tokens are re-joined with the same delimiter.

This is the shape produced by OpenNLP tokenizers / token-sample formats and is what the trainable components consume: a fixed sequence of tokens per element. Unlike SpellCorrectingObjectStream in compound mode, this stream is token-count preserving – it never splits or merges tokens, so the corrected element stays aligned with any parallel annotation (tags, spans).

Correction always runs in per-token mode and reuses the normalizer's guards (minimum length, skip numbers/URLs, never change a word the dictionary already contains) and its casing preservation.

null (end of stream) is forwarded unchanged; FilterObjectStream.reset() and FilterObjectStream.close() delegate to the wrapped stream.

  • Field Details

    • DEFAULT_DELIMITER

      public static final String DEFAULT_DELIMITER
      The default delimiter splitting and re-joining tokens (a single space).
      See Also:
  • Constructor Details

    • SpellCorrectingTokenStream

      public SpellCorrectingTokenStream(opennlp.tools.util.ObjectStream<String> samples, SpellChecker spellChecker)
      Wraps samples with a default corrector (space delimited) backed by a SpellChecker.
      Parameters:
      samples - the source token-line stream; must not be null
      spellChecker - the engine used to correct tokens; must not be null
    • SpellCorrectingTokenStream

      public SpellCorrectingTokenStream(opennlp.tools.util.ObjectStream<String> samples, SymSpellModel model)
      Wraps samples with a default corrector (space delimited) backed by a loaded SymSpellModel.
      Parameters:
      samples - the source token-line stream; must not be null
      model - the loaded model whose engine is used; must not be null
    • SpellCorrectingTokenStream

      public SpellCorrectingTokenStream(opennlp.tools.util.ObjectStream<String> samples, SpellCheckingCharSequenceNormalizer normalizer, String delimiter)
      Wraps samples with an explicitly configured corrector and delimiter.

      The corrector is forced into per-token mode regardless of how it was built, so the token count is always preserved.

      Parameters:
      samples - the source token-line stream; must not be null
      normalizer - the corrector whose guards/config are reused; must not be null
      delimiter - the literal token delimiter to split and re-join on; must not be null or empty
      Throws:
      NullPointerException - if normalizer or delimiter is null
      IllegalArgumentException - if delimiter is empty
  • Method Details