nanoarrow

The goal of nanoarrow is to provide minimal useful bindings to the Arrow C Data and Arrow C Stream interfaces using the nanoarrow C library.

Installation

You can install the released version of nanoarrow from CRAN with:

install.packages("nanoarrow")

You can install the development version of nanoarrow from GitHub with:

# install.packages("remotes")
remotes::install_github("apache/arrow-nanoarrow/r")

If you can load the package, you’re good to go!

library(nanoarrow)

Example

The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the ArrowSchema which represents a data type of an array, the ArrowArray which represents the values of an array, and an ArrowArrayStream, which represents zero or more ArrowArrays with a common ArrowSchema. All three can be wrapped by R objects using the nanoarrow R package.

Schemas

Use infer_nanoarrow_schema() to get the ArrowSchema object that corresponds to a given R vector type; use as_nanoarrow_schema() to convert an object from some other data type representation (e.g., an arrow R package DataType like arrow::int32()); or use na_XXX() functions to construct them.

infer_nanoarrow_schema(1:5)
#> <nanoarrow_schema int32>
#>  $ format    : chr "i"
#>  $ name      : chr ""
#>  $ metadata  : list()
#>  $ flags     : int 2
#>  $ children  : list()
#>  $ dictionary: NULL
as_nanoarrow_schema(arrow::schema(col1 = arrow::float64()))
#> <nanoarrow_schema struct>
#>  $ format    : chr "+s"
#>  $ name      : chr ""
#>  $ metadata  : list()
#>  $ flags     : int 0
#>  $ children  :List of 1
#>   ..$ col1:<nanoarrow_schema double>
#>   .. ..$ format    : chr "g"
#>   .. ..$ name      : chr "col1"
#>   .. ..$ metadata  : list()
#>   .. ..$ flags     : int 2
#>   .. ..$ children  : list()
#>   .. ..$ dictionary: NULL
#>  $ dictionary: NULL
na_int64()
#> <nanoarrow_schema int64>
#>  $ format    : chr "l"
#>  $ name      : chr ""
#>  $ metadata  : list()
#>  $ flags     : int 2
#>  $ children  : list()
#>  $ dictionary: NULL

Arrays

Use as_nanoarrow_array() to convert an object to an ArrowArray object:

as_nanoarrow_array(1:5)
#> <nanoarrow_array int32[5]>
#>  $ length    : int 5
#>  $ null_count: int 0
#>  $ offset    : int 0
#>  $ buffers   :List of 2
#>   ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   ..$ :<nanoarrow_buffer data<int32>[5][20 b]> `1 2 3 4 5`
#>  $ dictionary: NULL
#>  $ children  : list()
as_nanoarrow_array(data.frame(col1 = c(1.1, 2.2)))
#> <nanoarrow_array struct[2]>
#>  $ length    : int 2
#>  $ null_count: int 0
#>  $ offset    : int 0
#>  $ buffers   :List of 1
#>   ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>  $ children  :List of 1
#>   ..$ col1:<nanoarrow_array double[2]>
#>   .. ..$ length    : int 2
#>   .. ..$ null_count: int 0
#>   .. ..$ offset    : int 0
#>   .. ..$ buffers   :List of 2
#>   .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   .. .. ..$ :<nanoarrow_buffer data<double>[2][16 b]> `1.1 2.2`
#>   .. ..$ dictionary: NULL
#>   .. ..$ children  : list()
#>  $ dictionary: NULL

You can use as.vector() or as.data.frame() to get the R representation of the object back:

array <- as_nanoarrow_array(data.frame(col1 = c(1.1, 2.2)))
as.data.frame(array)
#>   col1
#> 1  1.1
#> 2  2.2

Even though at the C level the ArrowArray is distinct from the ArrowSchema, at the R level we attach a schema wherever possible. You can access the attached schema using infer_nanoarrow_schema():

infer_nanoarrow_schema(array)
#> <nanoarrow_schema struct>
#>  $ format    : chr "+s"
#>  $ name      : chr ""
#>  $ metadata  : list()
#>  $ flags     : int 0
#>  $ children  :List of 1
#>   ..$ col1:<nanoarrow_schema double>
#>   .. ..$ format    : chr "g"
#>   .. ..$ name      : chr "col1"
#>   .. ..$ metadata  : list()
#>   .. ..$ flags     : int 2
#>   .. ..$ children  : list()
#>   .. ..$ dictionary: NULL
#>  $ dictionary: NULL

Array Streams

The easiest way to create an ArrowArrayStream is from a list of arrays or objects that can be converted to an array using as_nanoarrow_array():

stream <- basic_array_stream(
  list(
    data.frame(col1 = c(1.1, 2.2)),
    data.frame(col1 = c(3.3, 4.4))
  )
)

You can pull batches from the stream using the $get_next() method. The last batch will return NULL.

stream$get_next()
#> <nanoarrow_array struct[2]>
#>  $ length    : int 2
#>  $ null_count: int 0
#>  $ offset    : int 0
#>  $ buffers   :List of 1
#>   ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>  $ children  :List of 1
#>   ..$ col1:<nanoarrow_array double[2]>
#>   .. ..$ length    : int 2
#>   .. ..$ null_count: int 0
#>   .. ..$ offset    : int 0
#>   .. ..$ buffers   :List of 2
#>   .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   .. .. ..$ :<nanoarrow_buffer data<double>[2][16 b]> `1.1 2.2`
#>   .. ..$ dictionary: NULL
#>   .. ..$ children  : list()
#>  $ dictionary: NULL
stream$get_next()
#> <nanoarrow_array struct[2]>
#>  $ length    : int 2
#>  $ null_count: int 0
#>  $ offset    : int 0
#>  $ buffers   :List of 1
#>   ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>  $ children  :List of 1
#>   ..$ col1:<nanoarrow_array double[2]>
#>   .. ..$ length    : int 2
#>   .. ..$ null_count: int 0
#>   .. ..$ offset    : int 0
#>   .. ..$ buffers   :List of 2
#>   .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   .. .. ..$ :<nanoarrow_buffer data<double>[2][16 b]> `3.3 4.4`
#>   .. ..$ dictionary: NULL
#>   .. ..$ children  : list()
#>  $ dictionary: NULL
stream$get_next()
#> NULL

You can pull all the batches into a data.frame() by calling as.data.frame() or as.vector():

stream <- basic_array_stream(
  list(
    data.frame(col1 = c(1.1, 2.2)),
    data.frame(col1 = c(3.3, 4.4))
  )
)

as.data.frame(stream)
#>   col1
#> 1  1.1
#> 2  2.2
#> 3  3.3
#> 4  4.4

After consuming a stream, you should call the release method as soon as you can. This lets the implementation of the stream release any resources (like open files) it may be holding in a more predictable way than waiting for the garbage collector to clean up the object.

Integration with the arrow package

The nanoarrow package implements as_nanoarrow_schema(), as_nanoarrow_array(), and as_nanoarrow_array_stream() for most arrow package types. Similarly, it implements arrow::as_arrow_array(), arrow::as_record_batch(), arrow::as_arrow_table(), arrow::as_record_batch_reader(), arrow::infer_type(), arrow::as_data_type(), and arrow::as_schema() for nanoarrow objects such that you can pass equivalent nanoarrow objects into many arrow functions and vice versa.