<aside> <img src="/icons/list_gray.svg" alt="/icons/list_gray.svg" width="40px" /> Home | Research Aims | My journey

</aside>

Home

Untitled

<aside> <img src="/icons/mail_gray.svg" alt="/icons/mail_gray.svg" width="40px" /> Email

</aside>

<aside> <img src="/icons/follow_gray.svg" alt="/icons/follow_gray.svg" width="40px" /> @KlieAdam

</aside>

<aside> <img src="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/911fee6a-b4c8-408c-bab9-97a18ece4d72/github.png" alt="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/911fee6a-b4c8-408c-bab9-97a18ece4d72/github.png" width="40px" /> GitHub

</aside>

<aside> <img src="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/7e8fa7d2-6aab-431e-994a-e9fcf34b7b12/scholar.png" alt="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/7e8fa7d2-6aab-431e-994a-e9fcf34b7b12/scholar.png" width="40px" /> Scholar

</aside>

Ph.D. Candidate

Bioinformatics & Systems Biology

I’m a 5th year Ph.D. candidate in the Bioinformatics and Systems Biology (BISB) program at the University of California, San Diego.

My research focuses on understanding how gene expression is regulated (see below for my specific research aims).

🎓 Education

B.S. in Bioengineering: Bioinformatics, 2017

Curriculum Vitae

Research Aims

The completion of the Human Genome Project in 2003 provided a nearly complete map of human DNA, setting the stage for the field of genomics to blossom. This milestone enabled scientists to explore differences in our DNA sequence (genetic variation) by comparing genomes from many different individuals.

Today, we have genome sequences for over half a million people, identifying hundreds of thousands of genetic variants linked to common diseases like heart disease, Alzheimer’s, and diabetes. Yet, understanding the biological mechanisms behind these links remains a challenge. A significant portion of these genetic variants occur in the regions of the genome that do not produce proteins (called the non-coding genome) but instead play a pivotal role in regulating protein production in cells. Despite its importance, the non-coding genome is far less understood than the regions that code for proteins.

In my thesis, I use machine learning to analyze large-scale genomics datasets, aiming to predict how genetic variation in the non-coding genome impacts biological function.

Building a software ecosystem for machine learning in genomics

When I started my PhD in 2019, machine learning (ML) was already making a substantial impact in genomics. However, successfully applying ML is consistently more challenging than is reported in the latest papers.

Steep learning curves are everywhere. Genomics involves specialized assays, unique file formats, and buggy software. Rapid exploration of data and ideas, essential for effective analysis, often requires extensive experience, making it challenging for newcomers.
Pitfalls are prevalent [1]. ML pitfalls, though becoming increasingly well-documented, can be difficult to detect. If not recognized, they often lead to undervalued datasets or overestimated model capability.
Model interpretation is difficult. Understanding a model's predictions is essential in biomedical research. Despite advances in interpretability techniques [2], the complexity and steep learning curve associated with these methods often keep models effectively opaque.
Reproducing results is challenging. Reproducibility is fundamental to scientific research. However, in genomics studies that utilize ML, the issues listed above often make replication difficult, if not impossible.
Best practices are evolving. Best practices in applying ML in genomics machine learning are still under development. While it is crucial to consolidate, reconcile, and document existing methodologies for broader adoption and discourse, this process is often undervalued.
No unified framework currently exists. While general machine learning frameworks like scikit-learn, PyTorch, and Keras are widely used, they are not specifically tailored for genomics. Genomics-focused tools, although available, often present challenges in usability, interoperability, and extensibility.

Tools

Docs | Publication | Preprint | GitHub

Talks

https://www.youtube.com/watch?v=47wbTR9yUpg

Learning the grammar of enhancer function

Untitled

An exciting direction for ML in genomics is in studying genetic switches called enhancers. Enhancers are short DNA fragments that, when activated, signal to the cell to create certain proteins. Several exciting ML models have been developed and interpreted in an effort to learn more about the sequence features and their inter-dependencies (collectively termed syntax or grammar) that drive enhancer activity []. These include models that can be used to design cell type [] or tissue specific enhancers [], a particular exciting application with implications in synthetic biology. Despite this, defining the mechanistic roles of enhancer features during development or in tissues with complex patterns of expression remains a substantial challenge.

Building models that capture both cis and trans gene regulation (WIP)

Enhancers act in cis. In many tissues, many enhancers act in concert

To take the above aim a step further, we can also ask questions about the roles that these enhancers and other regulatory elements (REs) play in coordinating what exactly a given cell does. Much work has been done to develop gene regulatory network (GRNs) models of biological systems, but linking these GRNs to the CPs, and other phenotypes, remains a challenge.

My journey

“The road goes ever on and on…”