Concurrent Learning of Large-Scale Random Forests

Bostr&#246;m, Henrik

doi:10.3233/978-1-60750-754-3-20

Abstract

The random forest algorithm belongs to the class of ensemble learning methods that are embarassingly parallel, i.e., the learning task can be straightforwardly divided into subtasks that can be solved independently by concurrent processes. A parallel version of the random forest algorithm has been implemented in Erlang, a concurrent programming language originally developed for telecommunication applications. The implementation can be used for generating very large forests, or handling very large datasets, in a reasonable time frame. This allows for investigating potential gains in predictive performance from generating large-scale forests. An empirical investigation on 34 datasets from the UCI repository shows that forests of 1000 trees significantly outperform forests of 100 trees with respect to accuracy, area under ROC curve (AUC) and Brier score. However, increasing the forest sizes to 10 000 or 100 000 trees does not give any further significant performance gains.

This website uses cookies

This website uses cookies