Class SymSpellModelSerializer
- All Implemented Interfaces:
opennlp.tools.util.model.ArtifactSerializer<SymSpellModel>
ArtifactSerializer for SymSpellModel.
What is serialized, and why
The serializer writes the model's source dictionary (unigram and bigram
counts), its configuration, and its metadata — not
the derived delete index. On load the engine is rebuilt
by replaying the source through SymSpell.add(java.lang.String, long) /
SymSpell.addBigram(java.lang.String, java.lang.String, long).
Rationale (per OPENNLP-1832 guidance):
- Size. The delete index of a real dictionary is roughly an order of magnitude larger than the source word list (each term expands to many delete keys), so persisting the source yields a far smaller artifact to ship and load.
- Forward compatibility. The index layout,
prefixLengthandmaxDictionaryEditDistancecan change between releases without invalidating already-published artifacts; the index is a pure function of the source and config and is regenerated on load. - API surface. The engine intentionally exposes only build hooks and no index getters, so serializing the index would require widening internal state.
Index rebuild cost is linear in the dictionary and small compared to artifact IO, which is why this trade-off is preferred over persisting the index.
Binary layout (big-endian, DataOutputStream)
int magic = 0x53594D53 ("SYMS")
int formatVersion = 1
UTF language
UTF name
UTF version
int maxDictionaryEditDistance
int prefixLength
long countThreshold
UTF editDistanceId ("damerau-osa" | "levenshtein")
long corpusWordCount (0 = derive N from the dictionary; see SymSpellConfig)
int unigramCount
repeat unigramCount times: UTF word, vlong count
int bigramCount
repeat bigramCount times: UTF w1, UTF w2, vlong count
Counts use an unsigned variable-length encoding (writeVLong(java.io.DataOutputStream, long)) to keep the
common (small) counts compact while still representing the full long range.
Charset. The UTF fields use DataOutputStream.writeUTF(java.lang.String) /
DataInputStream.readUTF(), i.e. Java modified UTF-8 (each string prefixed by
an unsigned 16-bit byte length, so a single encoded term may not exceed 64 KB). This
is internally consistent for round-tripping but is not interchangeable with the
standard UTF-8 used by the plain-text FrequencyDictionaryLoader.
The serializer has a public no-argument constructor so it can be referenced from
SymSpellModel.getArtifactSerializerClass() and registered with OpenNLP model
containers.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final StringStable identifier forDamerauOSADistance.static final StringStable identifier forLevenshteinDistance. -
Constructor Summary
ConstructorsConstructorDescriptionPublic no-arg constructor required by theArtifactSerializercontract. -
Method Summary
Modifier and TypeMethodDescriptioncreate(InputStream in) voidserialize(SymSpellModel model, OutputStream out)
-
Field Details
-
EDIT_DISTANCE_DAMERAU_OSA
Stable identifier forDamerauOSADistance.- See Also:
-
EDIT_DISTANCE_LEVENSHTEIN
Stable identifier forLevenshteinDistance.- See Also:
-
-
Constructor Details
-
SymSpellModelSerializer
public SymSpellModelSerializer()Public no-arg constructor required by theArtifactSerializercontract.
-
-
Method Details
-
create
- Specified by:
createin interfaceopennlp.tools.util.model.ArtifactSerializer<SymSpellModel>- Throws:
IOException
-
serialize
- Specified by:
serializein interfaceopennlp.tools.util.model.ArtifactSerializer<SymSpellModel>- Throws:
IOException
-