As a guest user you are not logged in or recognized by your IP address. You have
access to the Front Matter, Abstracts, Author Index, Subject Index and the full
text of Open Access publications.
Feature selection (FS) is essential for the analysis of genomic datasets with millions of features. In such context, Big Data tools are paramount, but the use of standard machine learning models is limited for data with such low instances to features ratios. Apache Spark is a distributed in-memory big data system with the potential to overcome this bottleneck. This study analyzes genomic data related to prediction of human obesity. Since Apache Spark is unable to cope with our dataset containing ≈ 0.74 million features, we propose here a pipeline to solve this problem using partitioning strategies, both vertical, by dividing the data based on gender, and horizontal, by splitting each chromosome into 5,000-instances subsets. For each subset, Minimum Redundancy and Maximum Relevance FS was used to find rankings of the most relevant features. The challenge, thus, is making accurate obesity predictions with parsimonious subsets of features selected from millions of them. We tackle it by defining a 2-phase pipeline: first learning with individual chromosomes and then learning with joined 22 chromosomes from selected features.
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.