– Stake Crytpocurrency for Profit, Competing against 10K Data Scientists Background

Numerai is a weekly data science competition. Predictions submitted by users steer Numerai’s hedge fund together. A weekly prize pool is paid out to the top performing users who stake. I have been watching Richard Craib at Numerai for a few years now. They have some smart investors and board names, e.g. they were seed funded by Howard L. Morgan the co-founder of Renaissance Technologies. Public information is not available on Numerai profits, but their last series A generated 6M.

Data provides an edge for investors so it is predictably expensive. Numerai has the novel idea of democratizing data access and rewarding accurate predictions. Using layman to crowd source expertise is not a new approach. Crowd sourced predictions are catching headlines by out-performing industry experts in medicine (protein folding) and policing pirates at sea (US Navy). Each week, Numerai releases new anonymized data. Data scientists can submit forecasts weekly. Payouts, however are based on the crytocurrency stake submitted with each set of forecasts. Poor forecasts have their stake burned, effectively forcing the value of numeraire, an ethereum token.

Investigating Data

Without knowing anything about the data, my first 3 tasks are almost always:

  • Are there missing values, and what is the missing data profile?
  • How does the categorical frequency for each discrete variable look like?
  • What is the distribution of each continuous variable?

The plot of missing data is not displayed because there is no missing data.

Plot predictors vs response does a good job of providing clean data. Box-plots show the predictor variables are balanced around the response variable. More feature engineering can and should be accomplished too; multi-colinearity exists among all 50 predictors.

plot of chunk plt_boxPlot

The plot shows the response variable target bernie on the y-axis and the main title for each feature. A subset of predictors are cut from this plot, feature 1:44, but the normalized appearance remains consistent. The dots indicate some heaviness in the tails, departing from iid so some transformations may be in order. Also the response variable is well balanced.

Predictive Power

Next I check for zero-variance predictors, outliers (tukey) and overall predictive power.


A single red bar indicates the predictors provide low predictive power. That plot is not an accident. The predictive power is usually spread out into high, medium and low. The modeling for Numerai is very hard indeed.

Below is the same predictive power plot from a separate real world data set (not from Numerai). To put this uphill battle into perspective, I generated just over 90% AUC from that poor data, with mixed predictive power. Hoping to get a 90% AUC with Numerai, is somewhere between optimistic and naive.


Fit Model to Data

In the next blog post I will walk through the process of model fitting using H2O. My initial model results have been generating a logloss of 0.68 from the training data. To avoid having my stake burned I am optimizing my model to get my test data at or below 0.67 logloss.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s