My Quarto PDF

Introduction:

In data science workflows, dealing with large datasets can be both challenging and resource-intensive. Recently, I came across Apache Arrow, a promising solution that significantly optimizes data storage for large dataframes. In this post, I’ll share a simple example of using Apache Arrow in R and how it helped me reduce data storage requirement and money.

Getting Started

To get started, we’ll need to load the required libraries and download the Titanic dataset, a common benchmark dataset in data science. We’ll then read it into R and compare its size with the size of the Arrow table.


Attaching package: 'arrow'

The following object is masked from 'package:utils':

    timestamp

Comparing Data Sizes

Now, let’s compare the size of the original data frame with the size of the Arrow table after conversion.

# Display the size of the original data frame
cat("Size of the original data frame:", object.size(titanic_data), "bytes\n")

Size of the original data frame: 193176 bytes

# Convert to Arrow table
titanic_data_arrow <- arrow_table(titanic_data)

# Display the size of the Arrow table
cat("Size of the Arrow table:", object.size(titanic_data_arrow), "bytes\n")

Size of the Arrow table: 488 bytes

After reading the Titanic dataset, we can convert it to an Arrow table using arrow_table(). This conversion will showcase the data size reduction achieved by using Arrow. You’ll notice that the Arrow table occupies significantly less memory, making it a more efficient option for large datasets.

One of the most remarkable features of Apache Arrow is its integration with cloud storage services such as AWS S3. With Arrow’s functions (see its documentation), you can directly read data from S3 buckets without downloading large files to your local machine/scheduling cluster. This feature can be incredibly beneficial for projects that deal with massive datasets stored in cloud-based data buckets.

Write Out

To take advantage of the reduced data size, we can write the Arrow table to disk using write_dataset().

# Write Arrow table to disk
write_dataset(titanic_data_arrow, "titanic_data_arrow.arrow")

Reading Later

Later on, if we want to work with the data again, we can easily read it back into R as an Arrow table and then convert it to a data frame.

# Read Arrow table from disk
dat1 <- open_dataset("titanic_data_arrow.arrow")

# Display the dimensions and size of the Arrow table
cat("Dimensions of the Arrow table:", dim(dat1), "\n")

Dimensions of the Arrow table: 891 12

cat("Size of the Arrow table:", object.size(dat1), "bytes\n")

Size of the Arrow table: 504 bytes

# Convert Arrow table to data frame
dat2 <- as.data.frame(dat1)

# Display the dimensions and size of the data frame
cat("Dimensions of the data frame:", dim(dat2), "\n")

Dimensions of the data frame: 891 12

cat("Size of the data frame:", object.size(dat2), "bytes\n")

Size of the data frame: 193176 bytes

After reading the Arrow table back into R, we can access and manipulate the data with ease. As you can see, the Arrow table takes up significantly less memory compared to the original data frame, making it an efficient choice for data storage.

Additional Resources

In conclusion, Apache Arrow’s efficiency in terms of data storage and retrieval makes it a game-changer for data-intensive applications like genomics. As you delve deeper into this powerful library, you’ll likely discover even more advantages that will help you unlock the true potential of your data.

This was a basic example to get the ball rolling. Happy data wrangling with Apache Arrow!

Apache Arrow Cheatsheet DJ Navarro’s cool blog R Arrow Cookbook