Numerai is a weekly data science competition. Predictions submitted by users steer Numerai’s hedge fund together. A weekly prize pool is paid out to the top performing users who stake. I have been watching Richard Craib at Numerai for a few years now. They have some smart investors and board names, e.g. they were seed funded by Howard L. Morgan the co-founder of Renaissance Technologies. Public information is not available on Numerai profits, but their last series A generated 6M.
Data provides an edge for investors so it is predictably expensive. Numerai has the novel idea of democratizing data access and rewarding accurate predictions. Using layman to crowd source expertise is not a new approach. Crowd sourced predictions are catching headlines by out-performing industry experts in medicine (protein folding) and policing pirates at sea (US Navy). Each week, Numerai releases new anonymized data. Data scientists can submit forecasts weekly. Payouts, however are based on the crytocurrency stake submitted with each set of forecasts. Poor forecasts have their stake burned, effectively forcing the value of numeraire, an ethereum token.
Without knowing anything about the data, my first 3 tasks are almost always:
- Are there missing values, and what is the missing data profile?
- How does the categorical frequency for each discrete variable look like?
- What is the distribution of each continuous variable?
The plot of missing data is not displayed because there is no missing data.
Plot predictors vs response
Numer.ai does a good job of providing clean data. Box-plots show the predictor variables are balanced around the response variable. More feature engineering can and should be accomplished too; multi-colinearity exists among all 50 predictors.
The plot shows the response variable target bernie on the y-axis and the main title for each feature. A subset of predictors are cut from this plot, feature 1:44, but the normalized appearance remains consistent. The dots indicate some heaviness in the tails, departing from iid so some transformations may be in order. Also the response variable is well balanced.
Next I check for zero-variance predictors, outliers (tukey) and overall predictive power.
A single red bar indicates the predictors provide low predictive power. That plot is not an accident. The predictive power is usually spread out into high, medium and low. The modeling for Numerai is very hard indeed.
Below is the same predictive power plot from a separate real world data set (not from Numerai). To put this uphill battle into perspective, I generated just over 90% AUC from that poor data, with mixed predictive power. Hoping to get a 90% AUC with Numerai, is somewhere between optimistic and naive.
Fit Model to Data
In the next blog post I will walk through the process of model fitting using H2O. My initial model results have been generating a logloss of 0.68 from the training data. To avoid having my stake burned I am optimizing my model to get my test data at or below 0.67 logloss.