Nimbus

Implementing Random Forest in a genome-wide prediction context

Nimbus's ruby code

Overview

Nimbus is a Ruby gem that implements Random Forest in a genome-wide prediction context (ref.). Nimbus trains the algorithm based on an input file (learning sample) containing the phenotypes of the individuals and their respective list of genotype markers (i.e. SNPs). A random forest is created and stored in a .yml file for future use.

Nimbus can also be run to make prediction in a validation set or in a set of data containing yet to be observed response variable. In this case, the predictions can be obtained using the random forest created with a learning sample or with a previously stored random forest.

If a learning sample is provided, the gem will create a file with the variable importance of each feature (marker) in the data. The higher the importance is, the more relevant the marker is to correctly predict the response variable in new data.

Nimbus can be use for both classification or regression problems, and the user may provide different parameter values in a configuration file to tune the performance of the algorithm.

Installation

$ gem install nimbus

Prerequisites: Ruby and Rubygems installed in your system.

Random Forest

The random forest algorithm was first proposed by Breiman (2011). It can be classified as a massively non-parametric machine-learning algorithm. RF makes use of bagging and randomization, constructing many decision trees (ref) on bootstrapped samples of a given data set. The prediction from the trees are averaged to make final predictions. The algorithm is robust to over-fitting and able to capture complex interaction structures in the data, which may alleviate the problems of analyzing genome-wide data.

In machine learning terms, it is an ensemble algorithm that uses multiple models to obtain better predictive performance than that obtained from any of the single models (trees).

Learning algorithm

Let y (nx1) be the data vector consisting of discrete observations for the outcome of a given trait, and X = {xi} where xi is a (px1) vector representing the genotype of each animal (0, 1 or 2) for p SNP, to which T decision trees are built (see classification and regression tree theory).

Note that main SNP effects, SNP interactions, environmental factors or combinations thereof may be also included in xi. This ensemble can be described as an additive expansion of the form:

Traning: Each tree (ht(y,X) for t ∈ (1,T)) in the forest is constructed using the following algorithm:

Let the number of training cases be N, and the number of variables (SNPs) in the classifier be M. Then,

Generalization error: This error is calculated using the OOB sample. Observations from the OBB are passed down the tree and they are assigned the label of the node they end up in. MSE (misclassification) is calculated comparing predictions to real observed phenotypes in this OOB samples.

Testing: For prediction a sample is pushed down the tree. It is assigned the label of the training sample in the terminal node it ends up in. This procedure is iterated over all trees in the ensemble, and the average vote of all trees is reported as random forest prediction.

Regression and Classification

Nimbus can be used both with regression and classification problems.

Regression: is the default mode.

Classification: user-activated declaring `classes` in the configuration file.

Variable importances

By default Nimbus will estimate SNP importances everytime a training file is run to create a forest.

You can disable this behaviour (and speed up the training process) by setting the parameter

in the configuration file.

Getting Started

Once you have nimbus installed in your system, you can run the gem using the nimbus executable:

$ nimbus

It will look for these files:

training.data: If found it will be used to build a random forest
testing.data: If found it will be pushed down the forest to obtain predictions for every individual in the file
random_forest.yml: If found it will be the forest used for the testing.

That way in order to train a forest a training file is needed. And to do the testing you need two files: the testing file and one of the other two: the training OR the random_forest file, because nimbus needs a forest from which obtain the predictions.

Configuration (config.yml)

The values for the input data files and the forest can be specified in the config.yml file that shouldbe locate in the directory where you are running `nimbus`.

The config.yml has the following structure and parameters:

    #Input files
    input:
      training: training_classification.data
      testing: testing_classification.data
      forest: my_forest.yml
      classes: [0, 1]

    #Forest parameters
    forest:
      forest_size: 10 #how many trees
      SNP_sample_size_mtry: 60 #mtry
      SNP_total_count: 200
      node_min_size: 5
        

Options under the input chapter:

Options under the forest chapter:

Input files

The three input files you can use with Nimbus should have proper format:

The training file has any number of rows, each representing data for an individual, with this columns:

The testing file has any number of rows, each representing data for an individual, similar to the training file but without the fenotype column:

The forest file contains the structure of a forest in YAML format. It is the output file of a nimbus training run.

Output files

Nimbus will generate the following output files:

After training:

random_forest.yml: A file defining the structure of the computed Random Forest. It can be used as input forest file.
generalization_errors.txt: A file with the generalization error for every tree in the forest.
training_file_predictions.txt: A file with predictions for every individual from the training file.
snp_importances.txt: A file with the computed importance for every SNP. *(unless var_importances set to 'No' in config file)

After testing:

testing_file_predictions.txt: A file with the genomic predicted merit of individuals in the testing set. In classification problems, it describe the probability of the individuals to belong to each class.

Resources

Source code

Issues, bugs and feature requests

Online rdocs

Nimbus at rubygems.org

Random Forest at Wikipedia

RF Leo Breiman page

Genome-wide prediction of discrete traits using bayesian regressions and machine learning

Credits

Nimbus was developed by Juanjo Bazán in collaboration with Oscar González-Recio.

Copyright © Juanjo Bazán, released under the MIT license

Web template by Fernando Guillén