Nimbus is a Ruby gem that implements Random Forest in a genome-wide prediction context (ref.). Nimbus trains the algorithm based on an input file (learning sample) containing the phenotypes of the individuals and their respective list of genotype markers (i.e. SNPs). A random forest is created and stored in a .yml file for future use.
Nimbus can also be run to make prediction in a validation set or in a set of data containing yet to be observed response variable. In this case, the predictions can be obtained using the random forest created with a learning sample or with a previously stored random forest.
If a learning sample is provided, the gem will create a file with the variable importance of each feature (marker) in the data. The higher the importance is, the more relevant the marker is to correctly predict the response variable in new data.
Nimbus can be use for both classification or regression problems, and the user may provide different parameter values in a configuration file to tune the performance of the algorithm.
Prerequisites: Ruby and Rubygems installed in your system.
The random forest algorithm was first proposed by Breiman (2011). It can be classified as a massively non-parametric machine-learning algorithm. RF makes use of bagging and randomization, constructing many decision trees (ref) on bootstrapped samples of a given data set. The prediction from the trees are averaged to make final predictions. The algorithm is robust to over-fitting and able to capture complex interaction structures in the data, which may alleviate the problems of analyzing genome-wide data.
In machine learning terms, it is an ensemble algorithm that uses multiple models to obtain better predictive performance than that obtained from any of the single models (trees).
Let y (nx1) be the data vector consisting of discrete observations for the outcome of a given trait, and X = {xi} where xi is a (px1) vector representing the genotype of each animal (0, 1 or 2) for p SNP, to which T decision trees are built (see classification and regression tree theory).
Note that main SNP effects, SNP interactions, environmental factors or combinations thereof may be also included in xi. This ensemble can be described as an additive expansion of the form:
Traning: Each tree (ht(y,X) for t ∈ (1,T)) in the forest is constructed using the following algorithm:
Generalization error: This error is calculated using the OOB sample. Observations from the OBB are passed down the tree and they are assigned the label of the node they end up in. MSE (misclassification) is calculated comparing predictions to real observed phenotypes in this OOB samples.
Testing: For prediction a sample is pushed down the tree. It is assigned the label of the training sample in the terminal node it ends up in. This procedure is iterated over all trees in the ensemble, and the average vote of all trees is reported as random forest prediction.
Nimbus can be used both with regression and classification problems.
Regression: is the default mode.
Classification: user-activated declaring `classes` in the configuration file.
By default Nimbus will estimate SNP importances everytime a training file is run to create a forest.
You can disable this behaviour (and speed up the training process) by setting the parameter
in the configuration file.
Once you have nimbus installed in your system, you can run the gem using the nimbus executable:
It will look for these files:
That way in order to train a forest a training file is needed. And to do the testing you need two files: the testing file and one of the other two: the training OR the random_forest file, because nimbus needs a forest from which obtain the predictions.
The values for the input data files and the forest can be specified in the config.yml file that shouldbe locate in the directory where you are running `nimbus`.
The config.yml has the following structure and parameters:
#Input files input: training: training_classification.data testing: testing_classification.data forest: my_forest.yml classes: [0, 1] #Forest parameters forest: forest_size: 10 #how many trees SNP_sample_size_mtry: 60 #mtry SNP_total_count: 200 node_min_size: 5
Options under the input chapter:
Options under the forest chapter:
The three input files you can use with Nimbus should have proper format:
The training file has any number of rows, each representing data for an individual, with this columns:
The testing file has any number of rows, each representing data for an individual, similar to the training file but without the fenotype column:
The forest file contains the structure of a forest in YAML format. It is the output file of a nimbus training run.
Nimbus will generate the following output files:
After training:
After testing:
Issues, bugs and feature requests
Genome-wide prediction of discrete traits using bayesian regressions and machine learning
Nimbus was developed by Juanjo Bazán in collaboration with Oscar González-Recio.
Copyright © Juanjo Bazán, released under the MIT license
Web template by Fernando Guillén