Feature selection (FS) is essential for the analysis of genomic datasets with millions of features. In such context, Big Data tools are paramount, but the use of standard machine learning models is limited for data with such low instances to features ratios. Apache Spark is a distributed in-memory big data system with the potential to overcome this bottleneck. This study analyzes genomic data related to prediction of human obesity. Since Apache Spark is unable to cope with our dataset containing ≈ 0.74 million features, we propose here a pipeline to solve this problem using partitioning strategies, both vertical, by dividing the data based on gender, and horizontal, by splitting each chromosome into 5,000-instances subsets. For each subset, Minimum Redundancy and Maximum Relevance FS was used to find rankings of the most relevant features. The challenge, thus, is making accurate obesity predictions with parsimonious subsets of features selected from millions of them. We tackle it by defining a 2-phase pipeline: first learning with individual chromosomes and then learning with joined 22 chromosomes from selected features.
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
Tel.: +1 703 830 6300
Fax: +1 703 830 2300 email@example.com
(Corporate matters and books only) IOS Press c/o Accucoms US, Inc.
For North America Sales and Customer Service
West Point Commons
Lansdale PA 19446
Tel.: +1 866 855 8967
Fax: +1 215 660 5042 firstname.lastname@example.org