Big Data Analytics for Obesity Prediction

Bilal, Ahsan; Vellido, Alfredo; Ribas, Vicent

doi:10.3233/978-1-61499-918-8-141

Abstract

Feature selection (FS) is essential for the analysis of genomic datasets with millions of features. In such context, Big Data tools are paramount, but the use of standard machine learning models is limited for data with such low instances to features ratios. Apache Spark is a distributed in-memory big data system with the potential to overcome this bottleneck. This study analyzes genomic data related to prediction of human obesity. Since Apache Spark is unable to cope with our dataset containing ≈ 0.74 million features, we propose here a pipeline to solve this problem using partitioning strategies, both vertical, by dividing the data based on gender, and horizontal, by splitting each chromosome into 5,000-instances subsets. For each subset, Minimum Redundancy and Maximum Relevance FS was used to find rankings of the most relevant features. The challenge, thus, is making accurate obesity predictions with parsimonious subsets of features selected from millions of them. We tackle it by defining a 2-phase pipeline: first learning with individual chromosomes and then learning with joined 22 chromosomes from selected features.

Contact

IOS Press Copyright 2024

Contact

IOS Press Copyright 2024

This website uses cookies

This website uses cookies