Class SpellCheckingCharSequenceNormalizer
- All Implemented Interfaces:
Serializable,opennlp.tools.util.normalizer.CharSequenceNormalizer
CharSequenceNormalizer that corrects spelling in text using a
SpellChecker (typically a SymSpell engine).
The normalizer works in one of two modes:
PER_TOKEN(default) – the input is split into whitespace-delimited tokens and each token is corrected independently withSpellChecker.lookup(java.lang.String, opennlp.spellcheck.Verbosity, int). The original whitespace runs between tokens are preserved verbatim, so the shape of the line is kept. Tokens the dictionary already contains (best suggestion at edit distance0) are left untouched.COMPOUND– the whole input is passed toSpellChecker.lookupCompound(java.lang.String, int), which additionally repairs wrongly inserted or omitted spaces (word splits and merges). This collapses runs of whitespace to single spaces, as the compound corrector re-tokenizes the input.
Several guards keep the corrector from "fixing" tokens that should be left as
they are (configurable through the SpellCheckingCharSequenceNormalizer.Builder):
- tokens shorter than
minTokenLengthare skipped; - numeric tokens are skipped (
skipNumbers, on by default); - URL- and email-like tokens are skipped (
skipUrls, on by default); - a token whose lower-cased form is already in the dictionary is never
changed (the engine returns it at edit distance
0).
Casing. Dictionaries are normally lower-cased, so lookups are performed on the lower-cased token, and the original casing pattern is re-applied to the correction: an all-upper token yields an all-upper correction, a leading-capital token yields a leading-capital correction, otherwise the suggestion's own casing is used. When no correction applies, the original token (including its casing and any surrounding punctuation) is emitted unchanged.
This normalizer composes cleanly inside an
AggregateCharSequenceNormalizer; place it after noise-removing normalizers
(URL, emoji, shrink) so it sees clean tokens.
Serialization. CharSequenceNormalizer is Serializable,
but the backing SpellChecker usually is not; it is therefore held in a
transient field and is null after Java deserialization. A deserialized
instance is inert until a checker is re-attached: obtain a working copy with the same
settings via withSpellChecker(SpellChecker) (this matches how the engine is
rebuilt from a model rather than Java-serialized). Calling normalize(java.lang.CharSequence) on an
instance with no checker throws IllegalStateException.
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final classA mutable builder forSpellCheckingCharSequenceNormalizer.static enumThe correction mode. -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intThe default minimum token length below which tokens are left untouched. -
Constructor Summary
ConstructorsConstructorDescriptionCreates a normalizer inSpellCheckingCharSequenceNormalizer.Mode.PER_TOKENmode with default guards from a loadedSymSpellModel(uses the model'sengine).SpellCheckingCharSequenceNormalizer(SpellChecker spellChecker) Creates a normalizer inSpellCheckingCharSequenceNormalizer.Mode.PER_TOKENmode with default guards from aSpellChecker. -
Method Summary
Modifier and TypeMethodDescriptionbuilder(SymSpellModel model) builder(SpellChecker spellChecker) normalize(CharSequence text) withSpellChecker(SpellChecker checker) Returns a copy of this normalizer carrying the same settings but backed by the given checker.
-
Field Details
-
DEFAULT_MIN_TOKEN_LENGTH
public static final int DEFAULT_MIN_TOKEN_LENGTHThe default minimum token length below which tokens are left untouched.- See Also:
-
-
Constructor Details
-
SpellCheckingCharSequenceNormalizer
Creates a normalizer inSpellCheckingCharSequenceNormalizer.Mode.PER_TOKENmode with default guards from aSpellChecker.- Parameters:
spellChecker- the engine used to correct tokens; must not benull
-
SpellCheckingCharSequenceNormalizer
Creates a normalizer inSpellCheckingCharSequenceNormalizer.Mode.PER_TOKENmode with default guards from a loadedSymSpellModel(uses the model'sengine).- Parameters:
model- the loaded model whose engine is used; must not benull
-
-
Method Details
-
builder
- Parameters:
spellChecker- the engine to wrap; must not benull- Returns:
- a new
SpellCheckingCharSequenceNormalizer.Builderseeded with sensible defaults
-
builder
- Parameters:
model- the loaded model whose engine to wrap; must not benull- Returns:
- a new
SpellCheckingCharSequenceNormalizer.Builderseeded with sensible defaults
-
withSpellChecker
Returns a copy of this normalizer carrying the same settings but backed by the given checker. This is the supported way to re-attach an engine to an instance restored by Java deserialization (whosetransientchecker isnull).- Parameters:
checker- the engine to attach; must not benull- Returns:
- a new, ready-to-use normalizer with this instance's settings
-
normalize
- Specified by:
normalizein interfaceopennlp.tools.util.normalizer.CharSequenceNormalizer
-