Algorithms and Stock Tickers with Apache Spark – My Summer Internship with Basho
In January of 2015, I had just graduated with a Masters Degree in Particle Physics. Usually after finishing a masters in physics the next step it to pursue a PhD, however, the prospect of another 5-8 years of schooling did not appeal to me. While in college I did robotics research as well as mathematical finance research, and even built a rudimentary algorithmic trading program using just Java and Yahoo Finance, which worked very well. I really enjoyed the algorithmic trading project and it always stayed in the back of my mind as something I would like to pursue further.
From Isotopes to Algorithms
After some job searching, I realized that my educational choices did not lead to a clear career path. Although my Masters degree made me very knowledgeable in radiating dipoles, those are not the "practical" skills that I would need to be a competitive job candidate. So I enrolled in a 12 week data science "bootcamp". After the bootcamp, I began looking for an interesting job. I remembered calling Basho Technologies a couple months prior, inquiring about any data science related opportunities, and was told to call back when my bootcamp was completed. So I decided to follow up. After speaking with Matt Digan, an Engineering Manager at Basho, it seemed there was a need for a real-world use case to test the newly created Basho Data Platform (BDP). We discussed possible use cases and realized that I could expand on my previous algorithmic trading program as a summer internship.
For this project I first had to learn: a lot. Before the internship started, I began reading the 'Little Riak Book' which is quite dense. The book's journey through distributed systems thinking and the implementation of Riak prepared me for what's next. Once I had a feel for how Riak KV works, I began to design the project in detail. I knew the program would be best designed when kept quite simple. It begins with an efficient pull all of the raw data from Riak KV using Python in the format of a Spark Resilient Distributed Dataset (RDD). From there, I would format and analyze the data using the parallel nature of Spark, and then write the results back to Riak.
Analysis of the NYSE
And the project began. Before we get into the code, it's worth noting that this project is only an example and is, of course, not supported in production by Basho. Now, to the process: I analyzed all possible pairs on the New York Stock Exchange (NYSE). Naively, that means roughly 9 millions pairs created from 3000 stocks. I used historical stock time series data from Google Finance to fill the raw data Riak bucket. Then I pulled and sorted all 9 million pairs for data points in the last N days. Now each pair consists of two time series of length N. To generate a signal, I had to run an algorithm on each pair. The algorithm is called a Engle-Granger Test of Cointegration. Engle and Granger won the 2003 Nobel Prize in Economics for formulating this test. The algorithm tests whether or not the differences between two time series are mean reverting. If the differenced time series is mean reverting, then the original two time series are "cointegrated." This property of cointegration can be exploited by noticing that the differenced time series is currently trading far from the mean, which suggests one of the stocks is over-valued and the other under-valued. The trick is designing a reliable "signal" to indicate the optimal time to get into a trade, and another signal for exiting the trade. The beauty of this strategy is that it works in all markets and has very low market exposure. The main risk involved with pair trading is that the cointegrated relationship between two stocks will break. This can be caused by many events such as a scandal, management shuffling, lawsuits, earnings, geopolitics, etc. Basically, no news is good news.
So for my project, I had to create the signal generator. The program would analyze all 9 million pairs, which on a high-end laptop would take almost half a day, spitting out all pairs that were cointegrated and outside a certain deviation from the mean. The program was to do this every night after the markets closed and have a small Amazon Web Services t2.micro "wake up" five t2.mediums, tie together a BDP cluster, run code on the t2.mediums to pull raw data from Google Finance, load N days worth of each stock into a Spark RDD, create all possible pairs, analyze in parallel with Spark, then write back the results to Riak.
Learning More Development Tools
Once I began writing the core of the project, Matt introduced me to Python's
virtualenv, and at the time I didn't understand the importance of this tool. It turned out that having this separate environment was crucial to testing and deploying the project down the line. I was now able to quarantine the Python libraries I used so that deploying library dependencies to production would be as easy as running
pip install requirements.txt. Also, during the initial code writing phase, I was introduced to Vagrant. Vagrant amazed me at how quickly and seamlessly multiple systems can be booted within a single OS. Vagrant became vital to testing my project, and I'm glad I spent the time to really get to know it's functionality. We have many Vagrantfiles for Riak that personify this ease of use.
After about a month into my internship I had coded out the first working version of my project. I wrote a Python library to download data from Google Finance, store it in a useable format in Riak, retrieve the data from Riak, sort and order the data, create all possible pairs of stocks, analyze the pairs, write back the results to Riak. In order to quickly pull data from Riak, I created a Spark RDD with a list of all stocks and would parallelize the RDD so that I could pull data in parallel from Riak. If I were using Java or Scala, the BDP Spark Connector would have been well suited for this task.. Once the data is in the RDD, I wanted to analyze it in parallel as much as possible in order to cut computation time down from 12 hours. Pruning the data was very important, as I only wanted to use "good" stocks (low risk, with sufficient historical data available) in a parallel Spark RDD to create all possible pairs. Once the data is ready for analysis, each pair was mapped to an anonymous function that handled all of the analysis. The results of the analysis either returns a zero if the pair is not ready to be traded or a one accompanied by relevant data if the pair is ready to be traded. Again the results are mapped to an anonymous function that would write the results in parallel back to Riak. While the analytics is relatively straightforward, getting acquainted with Spark and parallel analytics took quite some time as I had no previous experience with Spark.
At this point, I had a working data pipeline in an IPython notebook, but this was not the full project. I still needed to figure out how to make the pipeline fully automated, instead of manually running IPython notebook cells one after the other. I had no previous exposure to automated code deployment, so this seemed like a fun issue to figure out. Matt pointed me towards several technologies that I had never heard of, including: Fabric, Crontab, Ansible, Terraform, Boto and Docker. I looked over all of them to find which fit my needs best. I decided to use Boto, Fabric, and Crontab, and began learning them inside and out. For example, cron takes care of running code at predetermined times in a fairly straightforward manner. Simply put into a file the time, days, and path to the code, and its runs it at that time.
The next part was a bit more difficult. I wanted to have an AWS EC2 t2.micro (a very inexpensive EC2 machine) always on, acting as a cluster launcher. Each morning at one o'clock the launcher would then boot up a five-node BDP cluster of t2.mediums (not so inexpensive, more computing power). These t2.mediums would act as my analytics machine.
I created an Amazon Machine Image with the BDP and my project code already installed, that would act as the base for the five-node cluster. Then, I wrote some code using Boto to boot and shutdown the five-node cluster. At this point IPython Notebook was no longer of any value, so I turned instead to Fabric in order to quickly boot EC2 machines, setup a cluster, and debug my code I wrote a fabfile library to simulate the launcher before putting the project into production. I was able to write a fabric script that would create a working five-node BDP cluster out of thin air on AWS in under three minutes – quite a time saver. Then, I deployed the necessary code to the launcher and cluster using Fabric and waited until the next morning. Each morning I checked the cluster's Riak database for possible trades to see if it worked properly, and it did. So there you have it: a single t2.micro that would launch a pairs trading signal generation program automatically each night at 1 am to a standby, five-node BDP cluster. Running the program on the cluster instead of a single laptop cut run time down from 12 hours to around two hours, and potentially lower if more BDP nodes are added to the cluster.
More to Build
In the future I hope to add much more to this project. By adding streaming support and machine learning algorithms on top of the generated data and signal, this project could be turned into a fully functioning, cutting edge, trading system. There is a lot of potential left to fulfill, but only so much time in one summer.
My internship with Basho has been an amazing experience. I had the opportunity to create something interesting from scratch, and learned a lot about software development along the way. I couldn't have done it without the guidance and support of Matt Digan and the rest of the Data Platform team. If you'd like to check out my work, you can find it on GitHub at https://github.com/basho-labs/spark-bdp-stockticker-demo
Shared via my feedly reader
Sent from my iPhone