The R {arrow}
package provides access to many of the
features of the Apache Arrow C++
library for R users. The goal of arrow is to provide an Arrow C++
backend to {dplyr}
, and access to the Arrow C++ library
through familiar base R and tidyverse functions, or {R6}
classes.
To learn more about the Apache Arrow project, see the parent documentation of the Arrow Project. The Arrow project provides functionality for a wide range of data analysis tasks to store, process and move data fast. See the read/write article to learn about reading and writing data files, data wrangling to learn how to use dplyr syntax with arrow objects, and the function documentation for a full list of supported functions within dplyr queries.
The latest release of arrow can be installed from CRAN. In most cases installing the latest release should work without requiring any additional system dependencies, especially if you are using Windows or macOS.
install.packages("arrow")
Alternatively, if you are using conda you can install arrow from conda-forge:
conda install -c conda-forge --strict-channel-priority r-arrow
There are some special cases to note:
On macOS, the R you use with Arrow should match the architecture of the machine you are using. If you’re using an ARM (aka M1, M2, etc.) processor use R compiled for arm64. If you’re using an Intel based mac, use R compiled for x86. Using R and Arrow compiled for Intel based macs on an ARM based mac will result in segfaults and crashes.
On Linux the installation process can sometimes be more involved because CRAN does not host binaries for Linux. For more information please see the installation guide.
If you are compiling arrow from source, please note that as of version 10.0.0, arrow requires C++17 to build. This has implications on Windows and CentOS 7. For Windows users it means you need to be running an R version of 4.0 or later. On CentOS 7, it means you need to install a newer compiler than the default system compiler gcc. See the installation details article for guidance.
Development versions of arrow are released nightly. For information on how to installl nightly builds please see the installing nightly builds article.
The Arrow C++ library is comprised of different parts, each of which serves a specific purpose. The arrow package provides binding to the C++ functionality for a wide range of data analysis tasks.
It allows users to read and write data in a variety formats:
It provides access to remote filesystems and servers:
Additional features include:
Apache Arrow is a cross-language development platform for in-memory and larger-than-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming, messaging, and interprocess communication.
This package exposes an interface to the Arrow C++ library, enabling access to many of its features in R. It provides low-level access to the Arrow C++ library API and higher-level access through a dplyr backend and familiar R functions.
There are a few additional resources that you may find useful for getting started with arrow:
We welcome questions, discussion, and contributions from users of the arrow package. For information about mailing lists and other venues for engaging with the Arrow developer and user communities, please see the Apache Arrow Community page.
If you encounter a bug, please file an issue with a minimal
reproducible example on GitHub issues. Log in
to your GitHub account, click on New issue and select
the type of issue you want to create. Add a meaningful title prefixed
with [R]
followed by a space, the issue
summary and select component R from the dropdown list.
For more information, see the Report bugs and propose
features section of the Contributing
to Apache Arrow page in the Arrow developer documentation.
Please note that all participation in the Apache Arrow project is governed by the Apache Software Foundation’s code of conduct.