Big Data Pivot Table

(Note: This was an old internal POC we created and as such the code hasn’t been cleaned up and we removed it from the public article. If you are interested in seeing the code or having us implement a Big Data Pivot Table for you please write to us at info@synkre.com)

Years ago, I was assigned a task to build a pivot table to show financial information. In finance, pivot tables are often useful because they allow arbitrary rotations of datasets. Building a pivot table is never trivial. But the constrains we had back then were unusually challenging: We had to use relational databases. Since pivot tables require arbitrary and efficient manipulation of columns, not just rows, this constraint was going directly against the limitations of relational database technology. We have succeeded. We built the product. Then we rebuilt it. Then we rebuilt it, again. But ever since then I have been wondering what it would be like to build one using the right technology.

Since then we developers gained a lot of new tools including goodies like Spark and Cassandra, a columnar distributed processing engine and a columnar distributed datastore, respectively. Now we have the tools what we need. I created a POC to show how easy it would be to construct a pivot table with these tools. Here it is.

Let’s first set our goals for this POC. What are we trying to accomplish?

Show how the columnar nature of Spark data frames allow easy creation of columns and therefore they allow the creation of the “pivoting” illusion that make pivot tables so useful in visualization
Show that an application can be constructed which uses Spark for ad- hoc computing. Spark was originally designed to be used in batch mode. Here we need to build a state inside the cluster and run an app which connects to this state and displays it. The cluster will be used as our processing engine for an interactive application.
Show that the pivot table we construct will be able work with continually changing data. When viewing financial information the data is often streamed to us. In this example we will use EOD stock prices of a single company as our datasource. If we used intraday data, say with minute granularity, our pivot table would have to update itself from that stream every minute. We will show how Spark’s structured streaming makes this relatively straightforward to achieve.
Show that our table can grow in size to nearly arbitrary length. In finance we usually don’t have true “Big Data”. We really only have medium sized data. But even that is big enough, so that the method of putting all of it into memory for pivoting becomes problematic. Spark can use the memories of many machines in the cluster. So it is much less likely that we will run out of memory. But it also has connectors to other technologies to store its dataframes, such as Cassandra, which is a database. When using a database to store our data, we in fact make our technology scale to the size that fits into our storage. But unlike a relational database, this database will support our columnar manipulations directly, as opposed to fighting us which will make our lives easier as developers. The good news is that the hard limitation set but memory size will be removed and without having to resort to clumsy database technologies that are incompatible with our use case. But we will still need to prove that these datasets as being rotated can be efficiently paged out onto Sparks’ nodes’ memories and beyond into an HDFS compliant storage, in our example Cassandra. Will it be able to handle the load?

I have created a POC that achieved many of these goals either fully or partially. Unfortunately, not all the nice features of Spark and Cassandra work all at the same time with recent versions.

The pivoting illusion

All versions of Spark supported columnar manipulation of data. Spark APIs come in three different flavors: RDDs, DataFrames, and Datasets. DataFrames directly support pivoting on columns but so far only one column is supported. The only way multicolumn pivoting can be accomplished is by combining multiple pivoted columns into one. The following change in the Spark Github repository promises to make multi-column pivoting easier. For the time being though, our POC will only allow single column pivots.

https://github.com/apache/spark/pull/22030/commits/90fc82b59ae50e6a2a1548a0756a40c6325354ec

The Cassandra connector only supports Spark 2.3

When saving the pivoted dataset in Cassandra, we also have some limitations: There doesn’t seem to be any connector for the most recent version of Spark. This is not ideal for us because, in order to pivot multiple columns we want to use the latest version of Spark 2.5, which at the time of this writing is not even out. Here’s the connector version compatibility chart:

https://github.com/datastax/spark-cassandra-connector Version Compatiility

The dataset

The data for the pivot table will be represented as a Spark data frame. We will use daily stock data for IBM. We manipulate the dataset using Spark transformations and we save the data frame as a Cassandra table to demonstrate how well this connector works for persisting data frames. Once saved in Cassandra, this data can be further worked on by any tool that can interface with Cassandra.

Calculated columns and derived datapoints

Because we use Spark, a columnar data processing framework, adding columns on the fly is very efficient. This accommodates well how people usually want to work with the data when they perform data science style tasks, like enriching the data. We show this can be easily accomplished by adding a Moving Average (MA) column, and open and closing groups columns like these: o41+, c41+, o40+, c40+, which just show labels for trades where the open and close fall in ranges such as 40-41, for open or close respectively. We will need these group label columns, because in order to demonstrate how grouping works we need repeated values.

Grouping and pivoting horizontally:

We use Spark’s groupBy to group the values of the data frame and we use Dataframe.pivot to rotate the grouped row and make it into a column. When we pivot, the grouped column that is moved to the top to be a new column will create child columns for all the existing columns there. This is how pivoting always works, because each column is really per each grouping, even if the group is now shown at the top.

The UI

Our sample pivot table app has as its main pane the current version of the dataset. On the right there are 4 control panes:, Available Columns, Selected Columns, Selected Rows, Selected Pivot Columns.

Available Columns is just the list the columns in the data-frame that is not shown in the grid, but is available to be shown. Drag an available column to the selected columns pane below to make it show up in the grid. This is how we control what is visible.

The two panes below control what we group on. Drag a column to the Selected Rows pane to group the rows for that column. This is classic groupBy functionality. Now, if you drag the column to the Selected Pivot Columns pane, the grouping will be shown at the top, and this is what gives the pivoting illusion.

Please keep in mind that due the limitations of the current versions of Spark and Cassandra Connector, we can only pivot one column at this point. Hopefully, this limitation will soon disappear.

Here’s how the Pivot Table looks initially: https://github.com/kornelc/spark-pivot-table/blob/master/pivot-images/PivotInitial.png

Then when you drag a column into the selected rows pane, you can see classical row grouping https://github.com/kornelc/spark-pivot-table/blob/master/pivot-images/OneColumnGroupped.png

Now drag two columns, you can see two level grouping https://github.com/kornelc/spark-pivot-table/blob/master/pivot-images/TwoColumnsGrouped.png

When you drag one of the columns into the pivot dimensions pane you see the headings appear at the top, and all columns show in the context of te dragged dimension https://github.com/kornelc/spark-pivot-table/blob/master/pivot-images/SecondGroupedColumnPivotted.png

You can get the source code here https://github.com/kornelc/spark-pivot-table. Make sure you have docker installed and that you have lot’s of memory (4G+) available to run the app. The demo uses docker-compose. All you should have to do is ./pivot-service on.