Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time
As part of the Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time my proposal under the mentorship of In Kee Kim, Martin Putra and collaborator Charis Christopher Hulu (another OSRE fellow) aims to analyze large-scale sequencing datasets in order to gain insights on how ‘input quality’ affects genomic workflows’ execution times.
Recent advancements in Next-Generation Sequencing (NGS) technologies have resulted in massive amounts of nucleotide sequence data and automated genomic workflows to streamline analysis and data interpretation. The success of NGS-driven research has also led to a sudden increase in data of varying size and complexity, making it more time-consuming for researchers to test hypotheses. Analyzing
high-throughput genomic data requires a step-by-step execution of dedicated tools - also known as workflows. The first step toward the execution of a typical genomic analysis workflow is quality control
of the raw data - a crucial step in removing low-quality data instances that may significantly impact the downstream analysis. Prior work in this area has suggested that the runtimes of genomic workflows get affected due to qualitative differences in the data. Additionally, there is very little consensus on what constitutes “input quality” regarding data from large genomic experiments. In this proposal, we hypothesize that genomic data quality significantly impacts the genomic workflows’ execution time. We aim to leverage machine learning techniques to extract predictive features from quality control tools that robustly predict workflow execution time.