GBI | My Site

Flint's Engineering Portfolio

Bioinformatics of Rare and Endangered California Plants

Green Biome Institute

I have owned a variety of different projects as a consultant for the Green Biome Institute at California State University East Bay in order to move their mission of preserving and studying all of the rare and endangered plants of California using their genetic information. These projects have included setting up an Amazon Web Services environment for bioinformatics work, bioinformatics analysis to support many faculty and students research, teaching an introduction to bioinformatics class multiple times, finding and vetting new softwares, cost analysis and monitoring of AWS for our analysis work and student use, and storing and organizing sensitive data and results.

Screenshot 2023-12-12 at 11.31.49 PM.png

Screenshot 2023-12-12 at 11.31.19 PM.png

Screenshot 2023-12-12 at 11.30.35 PM.png

Annotated chloroplast genome

(top) genome size estimation (bot) ploidy estimation

AWS EC2 instance with vetted bioinformatics softwares and pipelines
The GBI is currently producing substantial amounts of sequencing data for the rare and endangered plants of California (now with over 115+ plants sequenced, more rare and endangered plants sequenced and analyzed than any other organization). Since we do not have access to the same access to large computational environments as larger institutions with greater research funding, we are using AWS EC2 instances to do the work that we cannot do locally. Certain algorithms, for instance those that are used to do genome assembly, the nontrivial problem of piecing billions of sequencing reads together based on overlapping sequences, require large amount of RAM (and many CPUs if it is desired the algorithms finish in a timely manner). Along with the faculty of the GBI, I did a literature review of genome assembly methodologies and softwares in order to identify the main softwares currently used to do genome assembly and annotation. I then built an AWS EC2 image with these ~60 softwares installed on it, basically an already set up bioinformatics environment. I used a variety of these softwares, testing certain ones against each other and against the literature. From this we created our current genome assembly and annotation pipeline, which looks like the following:

The left column above includes types of analyses done and certain softwares used and the right column includes metrics that we use in order to determine the quality of our analysis results.

Using this bioinformatics pipeline we are in the final stages of preparation for the following paper:

Screenshot 2023-12-12 at 11.06.21 PM.png

My current main project with the GBI is the bioinformatics analysis of 101+ rare and endangered California plant genomes. As of November 2020, 1,139 plant genomes have been submitted to public archives. (1) With this project we are excited to be contributing a significant amount of data and information to the world of plant genomics.

In order to do this bioinformatics work, I host the majority of our analysis on AWS EC2 instances. This requires the usage and monitoring of rented space in AWS data centers accruing between $1,000 and $10,000 in a given month.

Population Analysis of a particular rare and endangered set of California plant species

Separate analysis work that I did is being used on a project to determine the population structure of particular series of rare and endangered California species. The analysis that I did involved Principal Component Analysis (PCA) and Randomized Axelerated Maximum Likelihood analysis (RAxML) to identify the organization of genetic variation and do phylogenetic analysis, respectively, of these rare and endangered species. Labels and names in the following images are redacted because these results are currently in the hands of our collaborators with some fairly interesting conclusions that are not yet published.

RAxML analysis

PCA Analysis

Screenshot 2023-12-12 at 11.08.42 PM.png

Screenshot 2023-12-12 at 11.13.53 PM.png

Finding new ways to present data - web scraping example

Many times there is not a clean, organized way of describing exactly the results we get. One example is that the tool BUSCO searches for highly conserved orthologs in our genome assemblies. This gives us information on which gene ID's are present, duplicated, fragmented, or missing. However it does not give information regarding those genes. In order to generate this information I learned the basics of the Selenium python package in order to scrape the relevant OrthoDB database which contains information about these genes. It is important to me to find or come up with solutions to problems like this on the fly in order to keep research progress moving forward. This quick work allowed me to programmatically turn unreadable results (just a list of numbers) into a spreadsheet where the results are obvious (name of the gene found and a URL link to further information for that gene) that will be shared with the scientific community. [https://github.com/Green-Biome-Institute/AWS/blob/master/BUSCO_gene_list/web_scrape_BUSCO/busco_web_scraper.py , scroll down below the gene lists to see the code]

Screenshot 2023-12-12 at 10.54.16 PM.png

Screenshot 2023-12-12 at 10.53.19 PM.png

Screenshot 2023-12-12 at 10.56.59 PM.png

Screenshot 2023-12-12 at 11.19.54 PM.png

PAG 2023 Poster:

PAG 2024 Poster:

Conference attendance KEW Royal Botanic Gardens State of the World's Plants and Fungi Symposium in London, England in October of 2023.

(1) Kress, W. J., Soltis, D. E., Kersey, P. J., Wegrzyn, J. L., Leebens‐Mack, J. H., Gostel, M. R., Liu, X., & Soltis, P. S. (n.d.). Green plant genomes: What we know in an era of rapidly expanding opportunities. Proceedings of the National Academy of Sciences of the United States of America, 119(4). https://doi.org/10.1073/pnas.2115640118

Flint's Engineering Portfolio

Bioinformatics of Rare and Endangered California Plants

Green Biome Institute

CONTACT