Package opennlp.tools.ml.libsvm.doccat
Enum Class FeatureSelectionStrategy
- All Implemented Interfaces:
Serializable,Comparable<FeatureSelectionStrategy>,Constable
Defines strategies for selecting the most informative features for
SVM-based text classification.
Feature selection reduces the dimensionality of the feature space by retaining only the features that are most useful for distinguishing between categories.
- See Also:
-
Nested Class Summary
Nested classes/interfaces inherited from class java.lang.Enum
Enum.EnumDesc<E extends Enum<E>> -
Enum Constant Summary
Enum ConstantsEnum ConstantDescriptionChi-Square based feature selection: features are ranked by the maximum chi-square statistic across all categories, and only the top-k features are retained.Document Frequency based feature selection: features are ranked by the number of documents they appear in, and only the top-k features are retained.Information Gain based feature selection: features are ranked by their information gain score, and only the top-k features are retained.No feature selection: all features from the vocabulary are used.Term Frequency based feature selection: features are ranked by their total occurrence count across all documents in the corpus, and only the top-k features are retained. -
Method Summary
Modifier and TypeMethodDescriptionstatic FeatureSelectionStrategyReturns the enum constant of this class with the specified name.static FeatureSelectionStrategy[]values()Returns an array containing the constants of this enum class, in the order they are declared.Methods inherited from class java.lang.Enum
compareTo, describeConstable, equals, getDeclaringClass, hashCode, name, ordinal, toString, valueOf
-
Enum Constant Details
-
NONE
No feature selection: all features from the vocabulary are used. -
INFORMATION_GAIN
Information Gain based feature selection: features are ranked by their information gain score, and only the top-k features are retained.Information gain measures the reduction in entropy of the class variable achieved by observing the presence or absence of a feature.
-
CHI_SQUARE
Chi-Square based feature selection: features are ranked by the maximum chi-square statistic across all categories, and only the top-k features are retained.Chi-square measures the statistical dependence between a feature and a class label. A high chi-square value indicates that the feature and the class are not independent.
-
TERM_FREQUENCY
Term Frequency based feature selection: features are ranked by their total occurrence count across all documents in the corpus, and only the top-k features are retained.This is a simple baseline strategy that favors frequent terms. It can be useful to filter out very rare features that may be noise.
-
DOCUMENT_FREQUENCY
Document Frequency based feature selection: features are ranked by the number of documents they appear in, and only the top-k features are retained.Unlike
TERM_FREQUENCY, this counts each feature at most once per document, regardless of how often it occurs within that document.
-
-
Method Details
-
values
Returns an array containing the constants of this enum class, in the order they are declared.- Returns:
- an array containing the constants of this enum class, in the order they are declared
-
valueOf
Returns the enum constant of this class with the specified name. The string must match exactly an identifier used to declare an enum constant in this class. (Extraneous whitespace characters are not permitted.)- Parameters:
name- the name of the enum constant to be returned.- Returns:
- the enum constant with the specified name
- Throws:
IllegalArgumentException- if this enum class has no constant with the specified nameNullPointerException- if the argument is null
-