arXiv is a repository of electronic preprints for computer science, mathematics, physics, quantitative biology, quantitative finance, and statistics. The aRxiv package provides an R interface to the arXiv API.
Note that the arXiv API does not require an API key.
You can install the aRxiv package via CRAN:
Or use remotes::install_github()
to get the (possibly
more recent) version at GitHub:
Use arxiv_search()
to search arXiv, arxiv_count()
to get a
simple count of manuscripts matching a query, and
arxiv_open()
to open the abstract pages for a set of
results from arxiv_search()
.
We’ll get to the details in a moment. For now, let’s look at a few examples.
Suppose we wanted to identify all arXiv manuscripts with
“Peter Hall
” as an author. It is best to first get a count,
so that we have a sense of how many records the search will return.
(Peter Hall was “among the world’s
most prolific and highly cited authors in both probability and
statistics.”) We first use library()
to load the aRxiv
package and then arxiv_count()
to get the count.
## [1] 67
The au:
part indicates to search the author field; we
use double quotes to search for a phrase.
To obtain the actual records matching the query, use
arxiv_search()
.
## [1] 10
The default is to grab no more than 10 records; this limit can be
changed with the limit
argument. But note that the arXiv
API will not let you download more than 50,000 or so records, and even
in that case it’s best to do so in batches; more on this below.
Also note that the result of arxiv_search()
has an
attribute "total_results"
containing the total count of
search results; this is the same as what arxiv_count()
provides.
## [1] 67
The following will get us all 67 records.
## [1] 67
arxiv_search()
returns a data frame with each row being
a single manuscript. The columns are the different fields (e.g.,
authors
, title
, abstract
, etc.).
Fields like authors
that contain multiple items will be a
single character string with the multiple items separated by a vertical
bar (|
).
We might be interested in a more restrictive search, such as for
Peter Hall’s arXiv manuscripts that have deconvolution
in
the title. We use ti:
to search the title field, and
combine the two with AND
.
## [1] 4
Let’s display just the authors and title for the results.
## title
## 1 A ridge-parameter approach to deconvolution
## 2 On deconvolution with repeated measurements
## 3 Estimation of distributions, moments and quantiles in deconvolution\n problems
## 4 Kernel methods and minimum contrast estimators for empirical\n deconvolution
## authors
## 1 Peter Hall|Alexander Meister
## 2 Aurore Delaigle|Peter Hall|Alexander Meister
## 3 Peter Hall|Soumendra N. Lahiri
## 4 Aurore Delaigle|Peter Hall
We can open the abstract pages for these 4 manuscripts using
arxiv_open()
. It takes, as input, the output of
arxiv_search()
.
The two basic arguments to arxiv_count()
and
arxiv_search()
are query
, a character string
representing the search, and id_list
, a list of arXiv manuscript
identifiers.
query
is provided, manuscripts matching that
query are returned.id_list
is provided, manuscripts in the list
are returned.id_list
that match
query
will be returned.query
may be a single character string or a vector of
character strings. If it is a vector, the elements are pasted together
with AND
.
id_list
may be a vector of character strings or a single
comma-separated character string.
Generally, one would ignore id_list
and focus on forming
the query
argument. The aRxiv package includes a dataset
query_terms
that lists the terms (like au
)
that you can use.
## term description
## 1 ti Title
## 2 au Author
## 3 abs Abstract
## 4 co Comment
## 5 jr Journal Reference
## 6 cat Subject Category
## 7 rn Report Number
## 8 all All of the above
## 9 submittedDate Date/time of initial submission, as YYYYMMDDHHMM
## 10 lastUpdatedDate Date/time of last update, as YYYYMMDDHHMM
Use a colon (:
) to separate the query term from the
actual query. Multiple queries can be combined with AND
,
OR
, and ANDNOT
. The default is
OR
.
## [1] 38112
## [1] 38112
## [1] 161
## [1] 2764
It appears that in the author field (and many other fields) you must search full words, and that wild cards are not allowed.
## [1] 0
## [1] 1236
## [1] 44
arXiv has a set of 155 subject classifications, searchable with the
prefix cat:
. The aRxiv package contains a dataset
arxiv_cats
containing the categories, short and long
descriptions, as well as field (and, for Physics, subfield). Here are
the column names.
## [1] "category" "field" "subfield"
## [4] "short_description" "long_description"
Here are the statistics categories.
## category short_description
## 150 stat.AP Applications
## 151 stat.CO Computation
## 152 stat.ME Methodology
## 153 stat.ML Machine Learning
## 154 stat.OT Other Statistics
## 155 stat.TH Statistics Theory
To search these categories, you need to include either the full term
or use the *
wildcard.
## [1] 0
## [1] 17806
## [1] 116194
The terms submittedDate
(date/time of first submission)
and lastUpdatedDate
(date/time of last revision) are
particularly useful for limiting a search with many results, so
that you may combine multiple searches together, each within some window
of time, to get the full results.
The date/time information is of the form YYYYMMDDHHMMSS
,
for example 20071018122534
for
2007-10-18 12:25:34
. You can use *
for a
wildcard for the times. For example, to get all manuscripts with initial
submission on 2007-10-18:
## [1] 196
But you can’t use the wildcard within the dates.
## [1] 0
To get a count of all manuscripts with original submission in 2007,
use a date range, like [from_date TO to_date]
. (If you give
a partial date, it’s treated as the earliest date/time that matches, and
the range appears to be up to but not including the second
date/time.)
## [1] 55749
The output of arxiv_search()
is a data frame with the
following columns.
## [1] "id" "submitted" "updated" "title"
## [5] "abstract" "authors" "affiliations" "link_abstract"
## [9] "link_pdf" "link_doi" "comment" "journal_ref"
## [13] "doi" "primary_category" "categories"
The columns are described in the help file for
arxiv_search()
. Try ?arxiv_search
.
A few short notes:
authors
,
link_doi
, and categories
may contain multiple
items, separated by a vertical bar (|
).""
).categories
column may contain not just the aRxiv
categories (e.g., stat.AP
) but also codes for the Mathematical
Subject Classification (MSC) (e.g., 14J60) and the ACM Computing
Classification System (e.g., F.2.2). These are not searchable with
cat:
but are searchable with a general search.## [1] 0
## [1] 880
The arxiv_search()
function has two arguments for
sorting the results, sort_by
(taking values
"submitted"
, "updated"
, or
"relevance"
) and ascending
(TRUE
or FALSE
). If id_list
is provided, these
sorting arguments are ignored and the results are presented according to
the order in id_list
.
Here’s an example, to sort the results by the date the manuscripts were last updated, in descending order.
res <- arxiv_search('au:"Peter Hall" AND ti:deconvolution',
sort_by="updated", ascending=FALSE)
res$updated
## [1] "2010-03-01 11:33:37" "2008-10-27 14:27:52" "2008-04-04 12:19:05"
## [4] "2007-10-18 12:25:34"
The arXiv metadata has a number of limitations, the key issue being that it is author-supplied and so not necessarily consistent between records.
Authors’ names may vary between records (e.g., Peter Hall vs. Peter G. Hall vs. Peter Gavin Hall vs. P Hall). Further, arXiv provides no ability to distinguish multiple individuals with the same name (c.f., ORCID).
Authors’ institutional affiliations are mostly missing. The arXiv submission form does not include an affiliation field; affiliations are entered within the author field, in parentheses. The metadata instructions may not be widely read.
There are no key words; you are stuck with searching the free text in the titles and abstracts.
Subject classifications are provided by the authors and may be incomplete or inappropriate.
Care should be taken to avoid multiple requests to the arXiv API in a short period of time. The arXiv API user manual states:
In cases where the API needs to be called multiple times in a row, we encourage you to play nice and incorporate a 3 second delay in your code.
The aRxiv package institutes a delay between requests, with the time
period for the delay configurable with the R option
"aRxiv_delay"
(in seconds). The default is 3 seconds.
To reduce the delay to 1 second, use:
Don’t do searches in parallel (e.g., via the parallel package). You may be locked out from the arXiv API.
The arXiv API returns only complete records (including the entire abstracts); searches returning large numbers of records can be very slow.
It’s best to use arxiv_count()
before
arxiv_search()
, so that you have a sense of how many
records you will receive. If the count is large, you may wish to refine
your query.
arXiv has a hard limit of around 50,000 records; for a query that
matches more than 50,000 manuscripts, there is no way to receive the
full results. The simplest solution to this problem is to break the
query into smaller pieces, for example using slices of time, with a
range of dates for submittedDate
or
lastUpdatedDate
.
The limit
argument to arxiv_search()
(with
default limit=10
) limits the number of records to be
returned. If you wish to receive more than 10 records, you must specify
a larger limit (e.g., limit=100
).
To avoid accidental searches that may return a very large number of
records, arxiv_search()
uses an R option,
aRxiv_toomany
(with a default of 15,000), and refuses to
attempt a search that will return results above that limit.
Even for searches that return a moderate number of records (say
2,000), it may be best to make the requests in batches: Use a smaller
value for the limit
argument (say 100), and make multiple
requests with different offsets, indicated with the start
argument, for the initial record to return.
This is done automatically with the batchsize
argument
to arxiv_search()
. A search is split into multiple calls,
with no more than batchsize
records to be returned by each,
and then the results are combined.