Class DocumentCategorizerDL

java.lang.Object
opennlp.dl.AbstractDL
opennlp.dl.doccat.DocumentCategorizerDL
All Implemented Interfaces:
AutoCloseable, opennlp.tools.doccat.DocumentCategorizer

@ThreadSafe public class DocumentCategorizerDL extends AbstractDL implements opennlp.tools.doccat.DocumentCategorizer
An implementation of DocumentCategorizer that performs document classification using ONNX models.

Tokenization performs BERT basic tokenization (text normalization) before wordpiece, see BertTokenizer. Input text is lower cased and accent stripped by default, matching the uncased models commonly used for classification. For cased models, set InferenceOptions.setLowerCase(boolean) to false.

This class is thread-safe and may be shared across threads, provided the supplied ClassificationScoringStrategy is thread-safe (the built-in AverageClassificationScoringStrategy is stateless). Inference holds no per-call instance state, the relevant InferenceOptions values are snapshotted into final fields at construction (so mutating the passed options afterwards does not affect a shared instance), and the underlying OrtSession supports concurrent execution. This thread-safety guarantee applies until AbstractDL.close() is called; callers must not race close() with inference methods.

See Also:
  • Constructor Details

    • DocumentCategorizerDL

      public DocumentCategorizerDL(File model, File vocabulary, Map<Integer,String> categories, ClassificationScoringStrategy classificationScoringStrategy, InferenceOptions inferenceOptions) throws IOException, ai.onnxruntime.OrtException
      Instantiates a document categorizer using ONNX models.
      Parameters:
      model - The ONNX model file.
      vocabulary - The model file's vocabulary file.
      categories - The categories.
      classificationScoringStrategy - Implementation of ClassificationScoringStrategy used to calculate the classification scores given the score of each individual document part.
      inferenceOptions - InferenceOptions to control the inference.
      Throws:
      ai.onnxruntime.OrtException - Thrown if the model cannot be loaded.
      IOException - Thrown if errors occurred loading the model or vocabulary.
    • DocumentCategorizerDL

      public DocumentCategorizerDL(File model, File vocabulary, File config, ClassificationScoringStrategy classificationScoringStrategy, InferenceOptions inferenceOptions) throws IOException, ai.onnxruntime.OrtException
      Instantiates a document categorizer using ONNX models.
      Parameters:
      model - The ONNX model file.
      vocabulary - The model file's vocabulary file.
      config - The model's config file. The file will be used to determine the classification categories.
      classificationScoringStrategy - Implementation of ClassificationScoringStrategy used to calculate the classification scores given the score of each individual document part.
      inferenceOptions - InferenceOptions to control the inference.
      Throws:
      ai.onnxruntime.OrtException - Thrown if the model cannot be loaded.
      IOException - Thrown if errors occurred loading the model or vocabulary.
  • Method Details

    • categorize

      public double[] categorize(String[] strings)
      Categorizes the document, failing loudly rather than returning an invalid distribution: malformed input is rejected with IllegalArgumentException, and any failure executing the model is surfaced as an IllegalStateException (cause preserved).
      Specified by:
      categorize in interface opennlp.tools.doccat.DocumentCategorizer
      Parameters:
      strings - The document to categorize; strings[0] is classified.
      Returns:
      The per-category probabilities.
      Throws:
      IllegalArgumentException - If strings is null or empty.
      IllegalStateException - If inference fails or the model returns an unexpected output.
    • categorize

      public double[] categorize(String[] strings, Map<String,Object> map)
      Specified by:
      categorize in interface opennlp.tools.doccat.DocumentCategorizer
    • getBestCategory

      public String getBestCategory(double[] doubles)
      Specified by:
      getBestCategory in interface opennlp.tools.doccat.DocumentCategorizer
    • getIndex

      public int getIndex(String s)
      Specified by:
      getIndex in interface opennlp.tools.doccat.DocumentCategorizer
    • getCategory

      public String getCategory(int i)
      Specified by:
      getCategory in interface opennlp.tools.doccat.DocumentCategorizer
    • getNumberOfCategories

      public int getNumberOfCategories()
      Specified by:
      getNumberOfCategories in interface opennlp.tools.doccat.DocumentCategorizer
    • getAllResults

      public String getAllResults(double[] doubles)
      Specified by:
      getAllResults in interface opennlp.tools.doccat.DocumentCategorizer
    • scoreMap

      public Map<String,Double> scoreMap(String[] strings)
      Specified by:
      scoreMap in interface opennlp.tools.doccat.DocumentCategorizer
    • sortedScoreMap

      public SortedMap<Double,Set<String>> sortedScoreMap(String[] strings)
      Specified by:
      sortedScoreMap in interface opennlp.tools.doccat.DocumentCategorizer