Access and Feeds

Big Data: Apache Arrow for Faster and More Efficient Management of Columnar Data

By Dick Weisinger

The Apache Software Foundation recently released Arrow as a new top-level project.  Arrow is embeddable software that enables columnar in-memory processing.  Structured data is usually managed with SQL operating on rows of a database.  Column operations are often tricky to set up, but done right, manipulating columns allows larger data sets to be processed, and certain columnar operations are as much as 100 times faster than doing the same operation in a row-based way.

Jacques Nadeau, chairman of the Apache Drill project, described Apache Arrow as “an accelerator for processing and storage systems.  It’s a set of data representations that are much more CPU-efficient…  Doing in-memory columnar is hard. Doing rows is easy.”

Arrow provides a common data format that enables multiple systems to share, exchange and communicate data.  Nadeau said that using Arrow can eliminate frequent serialization and deserialization of data that often can consume 70 to 80 percent of the processing cycles for some workloads.

Nadeau said that “Cache locality, pipelining and superword operations frequently provide 10-100x faster execution performance. Since many analytical workloads are CPU bound, these benefits translate into performance gains, or more plainly, the potential for faster answers and higher levels of user concurrency.”

Ted Dunning, Vice President of the Apache Incubator, said that “an industry-standard columnar in-memory data layer enables users to combine multiple systems, applications and programming languages in a single workload without the usual overhead.”




Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Leave a Reply

Your email address will not be published. Required fields are marked *