Class DocumentCategorizerSVM

java.lang.Object
opennlp.tools.ml.libsvm.doccat.DocumentCategorizerSVM
All Implemented Interfaces:
opennlp.tools.doccat.DocumentCategorizer

public class DocumentCategorizerSVM extends Object implements opennlp.tools.doccat.DocumentCategorizer
An implementation of DocumentCategorizer that uses Support Vector Machines (SVM) via the zlibsvm library for document classification.

This categorizer supports configurable:

  • Term weighting (binary, TF, TF-IDF, log-normalized TF)
  • Feature selection (information gain, chi-square, term frequency, document frequency)
  • Feature scaling to a configurable range (e.g., [0, 1])
  • SVM classifier parameters (kernel, cost, gamma, etc.) via SvmConfiguration
See Also:
  • Constructor Details

    • DocumentCategorizerSVM

      public DocumentCategorizerSVM(SvmDoccatModel model, opennlp.tools.doccat.FeatureGenerator... featureGenerators)
      Instantiates a DocumentCategorizerSVM with a trained model and feature generators.
      Parameters:
      model - The trained SvmDoccatModel. Must not be null.
      featureGenerators - The FeatureGenerator instances used to extract features. Must not be null or empty.
  • Method Details

    • categorize

      public double[] categorize(String[] text, Map<String,Object> extraInformation)
      Specified by:
      categorize in interface opennlp.tools.doccat.DocumentCategorizer
    • categorize

      public double[] categorize(String[] text)
      Specified by:
      categorize in interface opennlp.tools.doccat.DocumentCategorizer
    • getBestCategory

      public String getBestCategory(double[] outcome)
      Specified by:
      getBestCategory in interface opennlp.tools.doccat.DocumentCategorizer
    • getIndex

      public int getIndex(String category)
      Specified by:
      getIndex in interface opennlp.tools.doccat.DocumentCategorizer
    • getCategory

      public String getCategory(int index)
      Specified by:
      getCategory in interface opennlp.tools.doccat.DocumentCategorizer
    • getNumberOfCategories

      public int getNumberOfCategories()
      Specified by:
      getNumberOfCategories in interface opennlp.tools.doccat.DocumentCategorizer
    • getAllResults

      public String getAllResults(double[] results)
      Specified by:
      getAllResults in interface opennlp.tools.doccat.DocumentCategorizer
    • scoreMap

      public Map<String,Double> scoreMap(String[] text)
      Specified by:
      scoreMap in interface opennlp.tools.doccat.DocumentCategorizer
    • sortedScoreMap

      public SortedMap<Double,Set<String>> sortedScoreMap(String[] text)
      Specified by:
      sortedScoreMap in interface opennlp.tools.doccat.DocumentCategorizer
    • train

      public static SvmDoccatModel train(String lang, opennlp.tools.util.ObjectStream<opennlp.tools.doccat.DocumentSample> samples, opennlp.tools.doccat.FeatureGenerator... featureGenerators) throws IOException
      Trains an SVM-based document categorization model using default configuration (TF-IDF weighting, no feature selection, scaling to [0, 1]).
      Parameters:
      lang - The ISO conform language code.
      samples - The ObjectStream of DocumentSample used as input for training.
      featureGenerators - The FeatureGenerator instances used to extract features.
      Returns:
      A trained SvmDoccatModel.
      Throws:
      IOException - Thrown if IO errors occurred during training.
    • train

      public static SvmDoccatModel train(String lang, opennlp.tools.util.ObjectStream<opennlp.tools.doccat.DocumentSample> samples, SvmDoccatConfiguration config, opennlp.tools.doccat.FeatureGenerator... featureGenerators) throws IOException
      Trains an SVM-based document categorization model with a custom configuration.
      Parameters:
      lang - The ISO conform language code.
      samples - The ObjectStream of DocumentSample used as input for training.
      config - The SvmDoccatConfiguration controlling term weighting, feature selection, scaling, and SVM parameters.
      featureGenerators - The FeatureGenerator instances used to extract features.
      Returns:
      A trained SvmDoccatModel.
      Throws:
      IOException - Thrown if IO errors occurred during training.