Class SymSpellModelSerializer

java.lang.Object
opennlp.spellcheck.dictionary.SymSpellModelSerializer
All Implemented Interfaces:
opennlp.tools.util.model.ArtifactSerializer<SymSpellModel>

public final class SymSpellModelSerializer extends Object implements opennlp.tools.util.model.ArtifactSerializer<SymSpellModel>
Binary ArtifactSerializer for SymSpellModel.

What is serialized, and why

The serializer writes the model's source dictionary (unigram and bigram counts), its configuration, and its metadata — not the derived delete index. On load the engine is rebuilt by replaying the source through SymSpell.add(java.lang.String, long) / SymSpell.addBigram(java.lang.String, java.lang.String, long).

Rationale (per OPENNLP-1832 guidance):

  • Size. The delete index of a real dictionary is roughly an order of magnitude larger than the source word list (each term expands to many delete keys), so persisting the source yields a far smaller artifact to ship and load.
  • Forward compatibility. The index layout, prefixLength and maxDictionaryEditDistance can change between releases without invalidating already-published artifacts; the index is a pure function of the source and config and is regenerated on load.
  • API surface. The engine intentionally exposes only build hooks and no index getters, so serializing the index would require widening internal state.

Index rebuild cost is linear in the dictionary and small compared to artifact IO, which is why this trade-off is preferred over persisting the index.

Binary layout (big-endian, DataOutputStream)

   int    magic            = 0x53594D53 ("SYMS")
   int    formatVersion    = 1
   UTF    language
   UTF    name
   UTF    version
   int    maxDictionaryEditDistance
   int    prefixLength
   long   countThreshold
   UTF    editDistanceId   ("damerau-osa" | "levenshtein")
   long   corpusWordCount  (0 = derive N from the dictionary; see SymSpellConfig)
   int    unigramCount
   repeat unigramCount times: UTF word, vlong count
   int    bigramCount
   repeat bigramCount  times: UTF w1, UTF w2, vlong count
 

Counts use an unsigned variable-length encoding (writeVLong(java.io.DataOutputStream, long)) to keep the common (small) counts compact while still representing the full long range.

Charset. The UTF fields use DataOutputStream.writeUTF(java.lang.String) / DataInputStream.readUTF(), i.e. Java modified UTF-8 (each string prefixed by an unsigned 16-bit byte length, so a single encoded term may not exceed 64 KB). This is internally consistent for round-tripping but is not interchangeable with the standard UTF-8 used by the plain-text FrequencyDictionaryLoader.

The serializer has a public no-argument constructor so it can be referenced from SymSpellModel.getArtifactSerializerClass() and registered with OpenNLP model containers.