R: News for Package 'tm'

NEWS	R Documentation

News for Package 'tm'

Changes in tm version 0.7-16

BUG FIXES

Improvements for Rd cross-references.

Changes in tm version 0.7-15

BUG FIXES

Improvements for Rd cross-references.

Changes in tm version 0.7-14

BUG FIXES

Use R_Calloc/R_Free instead of the long-deprecated Calloc/Free.

Changes in tm version 0.7-13

BUG FIXES

Improvements for Rd cross-references.

Changes in tm version 0.7-12

BUG FIXES

Add missing S3 method registration.

Changes in tm version 0.7-11

BUG FIXES

Use the default C++ standard instead of C++11.

Changes in tm version 0.7-10

NEW FEATURES

All built-in pGetElem() methods now use tm_parLapply().

Changes in tm version 0.7-9

BUG FIXES

Compilation fixes.

Changes in tm version 0.7-8

BUG FIXES

Fix invalid counting in prevalent stemCompletion(). Reported by Bernard Chang.
tm_index() now interprets all non-TRUE logical values returned by the filter function as FALSE. This fixes corner cases where filter functions return logical(0) or NA. Reported by Tom Nicholls.

Changes in tm version 0.7-6

NEW FEATURES

TermDocumentMatrix.SimpleCorpus() now also honors a logical removePunctuation control option (default: false).

BUG FIXES

Sync encoding fixes in TermDocumentMatrix.SimpleCorpus() with Boost_tokenizer().

Changes in tm version 0.7-5

BUG FIXES

Handle NAs consistently in tokenizers.

Changes in tm version 0.7-4

BUG FIXES

Keep document names in tm_map.SimpleCorpus().
Fix encoding problems in scan_tokenizer() and Boost_tokenizer().

Changes in tm version 0.7-3

BUG FIXES

scan_tokenizer() now works with character vectors and character strings.
removePunctuation() now works again in latin1 locales.
Handle empty term-document matrices gracefully.

Changes in tm version 0.7-2

SIGNIFICANT USER-VISIBLE CHANGES

DataframeSource now only processes data frames with the two mandatory columns "doc_id" and "text". Additional columns are used as document level metadata. This implements compatibility with Text Interchange Formats corpora (https://github.com/ropenscilabs/tif).
readTabular() has been removed. Use DataframeSource instead.
removeNumbers() and removePunctuation() now have an argument ucp to check for Unicode general categories Nd (decimal digits) and P (punctuation), respectively. Contributed by Kurt Hornik.
The package xml2 is now imported for XML functionality instead of the (CRAN maintainer orphaned) package XML.

NEW FEATURES

Boost_tokenizer provides a tokenizer based on the Boost (https://www.boost.org) Tokenizer.

BUG FIXES

Correctly handle the dictionary argument when constructing a term-document matrix from a SimpleCorpus (reported by Joe Corrigan) or from a VCorpus (reported by Mark Rosenstein).

Changes in tm version 0.7-1

BUG FIXES

Compilation fixes for Clang's libc++.

Changes in tm version 0.7

SIGNIFICANT USER-VISIBLE CHANGES

inspect.TermDocumentMatrix() now displays a sample instead of the full matrix. The full dense representation is available via as.matrix().

NEW FEATURES

SimpleCorpus provides a corpus which is optimized for the most common usage scenario: importing plain texts from files in a directory or directly from a vector in R, preprocessing and transforming the texts, and finally exporting them to a term-document matrix. The aim is to boost performance and minimize memory pressure. It loads all documents into memory, and is designed for medium-sized to large data sets.
inspect() on text documents as a shorthand for writeLines(as.character()).
findMostFreqTerms() finds most frequent terms in a document-term or term-document matrix, or a vector of term frequencies.
tm_parLapply() is now internally used for the parallelization of transformations, filters, and term-document matrix construction. The preferred parallelization engine can be registered via tm_parLapply_engine(). The default is to use no parallelization (instead of mclapply (package parallel) in previous versions).

Changes in tm version 0.6-2

BUG FIXES

format.PlainTextDocument() now reports only one character count for a whole document.

Changes in tm version 0.6-1

SIGNIFICANT USER-VISIBLE CHANGES

format.PlainTextDocument() now displays a compact representation instead of the content. Use as.character() to obtain the character content (which in turn can be applied to a corpus via lapply()).

NEW FEATURES

ZipSource() for processing ZIP files.
Sources now provide open() and close().
termFreq() now accepts Span_Tokenizer and Token_Tokenizer (both from package NLP) objects as tokenizers.
readTagged(), a reader for text documents containing POS-tagged words.

BUG FIXES

The function removeWords() now correctly processes words being truncations of others. Reported by Александр Труфанов.

Changes in tm version 0.6

SIGNIFICANT USER-VISIBLE CHANGES

DirSource() and URISource() now use the argument encoding for conversion via iconv() to "UTF-8".
termFreq() now uses words() as the default tokenizer.
Text documents now provide the functions content() and as.character() to access the (possibly raw) document content and the natural language text in a suitable (not necessarily structured) form.
The internal representation of corpora, sources, and text documents changed. Saved objects created with older tm versions are incompatible and need to be rebuilt.

NEW FEATURES

DirSource() and URISource() now have a mode argument specifying how elements should be read (no read, binary, text).
Improved high-level documentation on corpora (?Corpus), text documents (?TextDocument), sources (?Source), and readers (?Reader).
Integration with package NLP.
Romanian stopwords. Suggested by Cristian Chirita.
words.PlainTextDocument() delivers word tokens in the document.

BUG FIXES

The function stemCompletion() now avoids spurious duplicate results. Reported by Seong-Hyeon Kim.

DEPRECATED & DEFUNCT

Following functions have been removed:
- Author(), DateTimeStamp(), CMetaData(), content_meta(), DMetaData(), Description(), Heading(), ID(), Language(), LocalMetaData(), Origin(), prescindMeta(), sFilter() (use meta() instead).
- dissimilarity() (use proxy::dist() instead).
- makeChunks() (use [ and [[ manually).
- summary.Corpus() and summary.TextRepository() (print() now gives a more informative but succinct overview).
- TextRepository() and RepoMetaData() (use e.g. a list to store multiple corpora instead).

Changes in tm version 0.5-10

SIGNIFICANT USER-VISIBLE CHANGES

License changed to GPL-3 (from GPL-2 | GPL-3).
Following functions have been renamed:
- tm_tag_score() to tm_term_score().

DEPRECATED & DEFUNCT

Following functions have been removed:
- Dictionary() (use a character vector instead; use Terms() to extract terms from a document-term or term-document matrix),
- GmaneSource() (but still available via an example in XMLSource()),
- preprocessReut21578XML() (moved to package tm.corpus.Reuters21578),
- readGmane() (but still available via an example in readXML()),
- searchFullText() and tm_intersect() (use grep() instead).
Following S3 classes are no longer registered as S4 classes:
- VCorpus and PlainTextDocument.

Changes in tm version 0.5-9

SIGNIFICANT USER-VISIBLE CHANGES

Stemming functionality is now provided by the package SnowballC replacing packages Snowball and RWeka.
All stopword lists (besides Catalan and SMART) available via stopwords() now come from the Snowball stemmer project.
Transformations, filters, and term-document matrix construction now use mclapply (package parallel). Packages snow and Rmpi are no longer used.

DEPRECATED & DEFUNCT

Following functions have been removed:
- tm_startCluster() and tm_stopCluster().

Changes in tm version 0.5-8

SIGNIFICANT USER-VISIBLE CHANGES

The function termFreq() now processes the tolower and tokenize options first.

NEW FEATURES

Catalan stopwords. Requested by Xavier Fernández i Marín.

BUG FIXES

The function termFreq() now correctly accepts user-provided stopwords. Reported by Bettina Grün.
The function termFreq() now correctly handles the lower bound of the option wordLength. Reported by Steven C. Bagley.

Changes in tm version 0.5-7

SIGNIFICANT USER-VISIBLE CHANGES

The function termFreq() provides two new arguments for generalized bounds checking of term frequencies and word lengths. This replaces the arguments minDocFreq and minWordLength.
The function termFreq() is now sensitive to the order of control options.

NEW FEATURES

Weighting schemata for term-document matrices in SMART notation.
Local and global options for term-document matrix construction.
SMART stopword list was added.

Changes in tm version 0.5-5

NEW FEATURES

Access documents in a corpus by names (fallback to IDs if names are not set), i.e., allow a string for the corpus operator '[['.

BUG FIXES

The function findFreqTerms() now checks bounds on a global level (to comply with the manual page) instead per document. Reported and fixed by Thomas Zapf-Schramm.

Changes in tm version 0.5-4

SIGNIFICANT USER-VISIBLE CHANGES

Use IETF language tags for language codes (instead of ISO 639-2).

NEW FEATURES

The function tm_tag_score() provides functionality to score documents based on the number of tags found. This is useful for sentiment analysis.
The weighting function for term frequency-inverse document frequency weightTfIdf() has now an option for term normalization.
Plotting functions to test for Zipf's and Heaps' law on a term-document matrix were added: Zipf_plot() and Heaps_plot(). Contributed by Kurt Hornik.

Changes in tm version 0.5-3

NEW FEATURES

The reader function readRCV1asPlain() was added and combines the functionality of readRCV1() and as.PlainTextDocument().
The function stemCompletion() has a set of new heuristics.

Changes in tm version 0.5-2

SIGNIFICANT USER-VISIBLE CHANGES

The function termFreq() which is used for building a term-document matrix now uses a whitespace oriented tokenizer as default.

NEW FEATURES

A combine method for merging multiple term-document matrices was added (c.TermDocumentMatrix()).
The function termFreq() has now an option to remove punctuation characters.

DEPRECATED & DEFUNCT

Following functions have been removed:
- CSVSource() (use DataframeSource(read.csv(..., stringsAsFactors = FALSE)) instead), and
- TermDocMatrix() (use DocumentTermMatrix() instead).

BUG FIXES

removeWords() no longer skips words at the beginning or the end of a line. Reported by Mark Kimpel.

Changes in tm version 0.5-1

BUG FIXES

preprocessReut21578XML() no longer generates invalid file names.

Changes in tm version 0.5

SIGNIFICANT USER-VISIBLE CHANGES

All classes, functions, and generics are reimplemented using the S3 class system.
Following functions have been renamed:
- activateCluster() to tm_startCluster(),
- asPlain() to as.PlainTextDocument(),
- deactivateCluster() to tm_stopCluster(),
- tmFilter() to tm_filter(),
- tmIndex() to tm_index(),
- tmIntersect() to tm_intersect(), and
- tmMap() to tm_map().
Mail handling functionality is factored out to the tm.plugin.mail package.

DEPRECATED & DEFUNCT

Following functions have been removed:
- tmTolower() (use tolower() instead), and
- replacePatterns() (use gsub() instead).

Changes in tm version 0.4

SIGNIFICANT USER-VISIBLE CHANGES

The Corpus class is now virtual providing an abstract interface.
VCorpus, the default implementation of the abstract corpus interface (by subclassing), provides a corpus with volatile (= standard R object) semantics. It loads all documents into memory, and is designed for small to medium-sized data sets.
PCorpus, an implementation of the abstract corpus interface (by subclassing), provides a corpus with permanent storage semantics. The actual data is stored in an external database (file) object (as supported by the filehash package), with automatic (un-)loading into memory. It is designed for systems with small memory.
Language codes are now in ISO 639-2 (instead of ISO 639-1).
Reader functions no longer have a load argument for lazy loading.

NEW FEATURES

The reader function readReut21578XMLasPlain() was added and combines the functionality of readReut21578XML() and asPlain().

BUG FIXES

weightTfIdf() no longer applies a binary weighting to an input matrix in term frequency format (which happened only in 0.3-4).

Changes in tm version 0.3-4

SIGNIFICANT USER-VISIBLE CHANGES

.onLoad() no longer tries to start a MPI cluster (which often failed in misconfigured environments). Use activateCluster() and deactivateCluster() instead.
DocumentTermMatrix (the improved reimplementation of defunct TermDocMatrix) does not use the Matrix package anymore.

NEW FEATURES

The DirSource() constructor now accepts the two new (optional) arguments pattern and ignore.case. With pattern one can define a regular expression for selecting only matching files, and ignore.case specifies whether pattern-matching is case-sensitive.
The readNewsgroup() reader function can now be configured for custom date formats (via the DateFormat argument).
The readPDF() reader function can now be configured (via the PdfinfoOptions and PdftotextOptions arguments).
The readDOC() reader function can now be configured (via the AntiwordOptions argument).
Sources now can be vectorized. This allows faster corpus construction.
New XMLSource class for arbitrary XML files.
The new readTabular() reader function allows to create a custom tailor-made reader configured via mappings from a tabular data structure.
The new readXML() reader function allows to read in arbitrary XML files which are described with a specification.
The new tmReduce() transformation allows to combine multiple maps into one transformation.

DEPRECATED & DEFUNCT

CSVSource is defunct (use DataframeSource instead).
weightLogical is defunct.
TermDocMatrix is defunct (use DocumentTermMatrix or TermDocumentMatrix instead).

Changes in tm version 0.3-3

NEW FEATURES

The abstract Source class gets a default implementation for the stepNext() method. It increments the position counter by one, a reasonable value for most sources. For special purposes custom methods can be created via overloading stepNext() of the subclass.
New URISource class for a single document identified by a Uniform Resource Identifier.
New DataframeSource for documents stored in a data frame. Each row is interpreted as a single document.

BUG FIXES

Fix off-by-one error in convertMboxEml() function. Reported by Angela Bohn.
Sort row indices in sparse term-document matrices. Kudos to Martin Mächler for his suggestions.
Sources and readers no longer evaluate calls in a non-standard way.

Changes in tm version 0.3-2

NEW FEATURES

Weighting functions now have an Acronym slot containing abbreviations of the weighting functions' names. This is highly useful when generating tables with indications which weighting scheme was actually used for your experiments.
The functions tmFilter(), tmIndex(), tmMap() and TermDocMatrix() now can use a MPI cluster (via the snow and Rmpi packages) if available. Use (de)activateCluster() to manually override cluster usage settings. Special thanks to Stefan Theussl for his constructive comments.
The Source class receives a new Length slot. It contains the number of elements provided by the source (although there might be rare cases where the number cannot be determined in advance—then it should be set to zero).