<aside> <img src="/icons/list_gray.svg" alt="/icons/list_gray.svg" width="40px" /> Home | Research Aims | My journey
</aside>
<aside> <img src="/icons/mail_gray.svg" alt="/icons/mail_gray.svg" width="40px" /> Email
</aside>
<aside> <img src="/icons/follow_gray.svg" alt="/icons/follow_gray.svg" width="40px" /> @KlieAdam
</aside>
<aside> <img src="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/911fee6a-b4c8-408c-bab9-97a18ece4d72/github.png" alt="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/911fee6a-b4c8-408c-bab9-97a18ece4d72/github.png" width="40px" /> GitHub
</aside>
<aside> <img src="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/7e8fa7d2-6aab-431e-994a-e9fcf34b7b12/scholar.png" alt="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/7e8fa7d2-6aab-431e-994a-e9fcf34b7b12/scholar.png" width="40px" /> Scholar
</aside>
Bioinformatics & Systems Biology
I’m a 5th year Ph.D. candidate in the Bioinformatics and Systems Biology (BISB) program at the University of California, San Diego.
My research focuses on understanding how gene expression is regulated (gene regulation). See below for my specific research aims.
The completion of the Human Genome Project in 2003 provided a nearly complete map of human DNA, setting the stage for the field of genomics to blossom. This milestone enabled scientists to explore differences in our DNA sequence (genetic variation) by comparing genomes from many different individuals.
Today, we have genome sequences for over half a million people, identifying hundreds of thousands of genetic variants linked to common diseases like heart disease, Alzheimer’s, and diabetes. Yet, understanding the biological mechanisms behind these links remains a challenge. A significant portion of these genetic variants occur in the regions of the genome that do not produce proteins (called the non-coding genome) but instead play a pivotal role in regulating protein production in cells. Despite its importance, the non-coding genome is far less understood than the regions that code for proteins.
In my thesis, I use machine learning to analyze large-scale genomics datasets, aiming to predict how genetic variation in the non-coding genome impacts biological function.
Building a software ecosystem for machine learning in genomics
When I started my PhD in 2019, machine learning (ML) had already made a substantial impact in genomics. However, turning a new dataset into biological insight using ML was and remains way harder than the publications make it seem.
Steep learning curves are everywhere. Genomics is full of specialized assays, file formats and software. The data is big and complex, and there are often many things to analyze and to look out for. It’s crucial to be able to explore the data, but this often requires several steps to get to
Reproducing results
This is cornerstone of scientific research, is particularly challenging in genomics studies that employ machine learning techniques.
Applying machine learning in genomics is full of potential pitfalls [1]. I would hazard a guess that most ****projects centered around applying ML in genomics end in one of two ways: (1) a pitfall makes some poor grad student/post-doc believe that their model isn’t nearly good enough to cut it OR (2) a pitfall makes some poor grad student/post-doc believe they have trained the model that will solve biology.
It can be hard to interpret what a model is doing
Best practices are a work in progress. Simplifying the way we communicate and use tools and resources is key for linking our methods to actual biology
There is no unified software framework to do it. scikit-learn, PyTorch and Keras are great and essential for widespread usage of ML, but they are were not principally designed for genomics. Tools that do exist specifically for genomics can be hard to get working, difficult to use, and often are not easily interoperable or extendable
Tools
Docs | Preprint | Publication | GitHub
Talks
https://www.youtube.com/watch?v=47wbTR9yUpg
Learning the grammar of enhancer function
An exciting direction of ML in genomics is in studying genetic switches called enhancers, short DNA fragments that when activated signal to the cell to create certain proteins. Several exciting ML models have been developed and interpreted in an effort to learn more about the sequence features and their inter-dependencies (collectively termed syntax or grammar) that drive enhancer activity [], including ML models that can be used to design cell type specific enhancers []. Defining the mechanistic roles of enhancer features during development and in tissues with complex patterns of expression remains a substantial challenge.
I leverage both several high-throughput massively parallel reporter assays (MPRAs) offer a high-throughput methodology that directly test the regulatory potential of REs. Predicting outcomes of MPRAs using interpretable ML represents a powerful mechanism towards understanding the sequence features within REs that encode function. Such models can help us find and validate functional genomic enhancers, prioritize enhancer features to test, and uncover dependencies that exist between features. We are are hoping for a set of unifying principles or rules that govern enhancer function. ML models trained on MPRA data, especially in combination with other data types, will play a key role in helping us determine if, when and where such principles exist in biology.
Building models that capture both cis and trans gene regulation
Enhancers act in cis. In many tissues, many enhancers act in concert
To take the above aim a step further, we can also ask questions about the roles that these enhancers and other regulatory elements (REs) play in coordinating what exactly a given cell does. Much work has been done to develop gene regulatory network (GRNs) models of biological systems, but linking these GRNs to the CPs, and other phenotypes, remains a challenge.