Class FrequencyDictionaryLoader
SymSpell engine.
Two text formats are supported, both consumed line-by-line through an
ObjectStream (a PlainTextByLineStream over a caller-supplied
InputStreamFactory):
- unigram dictionary –
word<sep>countper line, fed toSymSpell.add(String, long); - bigram dictionary (optional) –
w1<sep>w2<sep>countper line, fed toSymSpell.addBigram(String, String, long).
Columns are separated by whitespace – a TAB or one or more spaces – so
the canonical space-delimited SymSpell reference dictionaries (e.g.
frequency_dictionary_en_82_765.txt) load as-is, as do TAB-delimited files.
The loader is encoding-aware (UTF-8 by default) and tolerant of input noise: a
leading UTF-8 byte-order mark is stripped; blank lines, lines that are entirely
whitespace, and lines starting with # (comments) are skipped. A line that does
not match the expected shape (too few columns, unparsable count) is reported through
MalformedDictionaryLineException.
This class performs only parsing and dispatch; it never mutates the engine's
configuration. Build the SymSpell with the desired SymSpellConfig
first, then load one or more dictionaries into it.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final CharsetThe default character set used when none is supplied. -
Constructor Summary
ConstructorsConstructorDescriptionCreates a loader using the default UTF-8 charset.FrequencyDictionaryLoader(Charset charset) Creates a loader using the supplied charset. -
Method Summary
Modifier and TypeMethodDescriptionlongloadBigrams(SymSpell target, opennlp.tools.util.InputStreamFactory factory) Loads a bigram frequency dictionary (w1<sep>w2<sep>count) intotarget.longloadUnigrams(SymSpell target, opennlp.tools.util.InputStreamFactory factory) Loads a unigram frequency dictionary (word<sep>count) intotarget.
-
Field Details
-
DEFAULT_CHARSET
The default character set used when none is supplied.
-
-
Constructor Details
-
FrequencyDictionaryLoader
public FrequencyDictionaryLoader()Creates a loader using the default UTF-8 charset. -
FrequencyDictionaryLoader
Creates a loader using the supplied charset.- Parameters:
charset- the character set used to decode the dictionary text; must not benull
-
-
Method Details
-
loadUnigrams
public long loadUnigrams(SymSpell target, opennlp.tools.util.InputStreamFactory factory) throws IOException Loads a unigram frequency dictionary (word<sep>count) intotarget.- Parameters:
target- the engine to populate; must not benullfactory- the source of the dictionary text; must not benull- Returns:
- the number of dictionary entries that were read (after skipping blank and comment lines)
- Throws:
IOException- Thrown on IO errors or on a malformed line.
-
loadBigrams
public long loadBigrams(SymSpell target, opennlp.tools.util.InputStreamFactory factory) throws IOException Loads a bigram frequency dictionary (w1<sep>w2<sep>count) intotarget.- Parameters:
target- the engine to populate; must not benullfactory- the source of the dictionary text; must not benull- Returns:
- the number of bigram entries that were read (after skipping blank and comment lines)
- Throws:
IOException- Thrown on IO errors or on a malformed line.
-