tokenizers: Fast, Consistent Tokenization of Natural Language Text
Convert natural language text into tokens. Includes tokenizers for
shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs,
characters, shingled characters, lines, Penn Treebank, regular
expressions, as well as functions for counting characters, words, and sentences,
and a function for splitting longer texts into separate documents, each with
the same number of words. The tokenizers have a consistent interface, and
the package is built on the 'stringi' and 'Rcpp' packages for fast
yet correct tokenization in 'UTF-8'.
Version: |
0.3.0 |
Depends: |
R (≥ 3.1.3) |
Imports: |
stringi (≥ 1.0.1), Rcpp (≥ 0.12.3), SnowballC (≥ 0.5.1) |
LinkingTo: |
Rcpp |
Suggests: |
covr, knitr, rmarkdown, stopwords (≥ 0.9.0), testthat |
Published: |
2022-12-22 |
DOI: |
10.32614/CRAN.package.tokenizers |
Author: |
Lincoln Mullen
[aut, cre],
Os Keyes [ctb],
Dmitriy Selivanov [ctb],
Jeffrey Arnold
[ctb],
Kenneth Benoit
[ctb] |
Maintainer: |
Lincoln Mullen <lincoln at lincolnmullen.com> |
BugReports: |
https://github.com/ropensci/tokenizers/issues |
License: |
MIT + file LICENSE |
URL: |
https://docs.ropensci.org/tokenizers/,
https://github.com/ropensci/tokenizers |
NeedsCompilation: |
yes |
Citation: |
tokenizers citation info |
Materials: |
README NEWS |
In views: |
NaturalLanguageProcessing |
CRAN checks: |
tokenizers results |
Documentation:
Downloads:
Reverse dependencies:
Reverse imports: |
covfefe, deeplr, DeepPINCS, DramaAnalysis, pdfsearch, proustr, rslp, textrecipes, tidypmc, tidytext, ttgsea, wactor, WhatsR |
Reverse suggests: |
edgarWebR, torchdatasets |
Reverse enhances: |
quanteda |
Linking:
Please use the canonical form
https://CRAN.R-project.org/package=tokenizers
to link to this page.