Genome-wide de novo prediction of transcription factor binding sites in prokaryotic genomes using comparative genomics approaches
The availability of increasing number of thousands and even tens of thousands of sequenced prokaryotic genomes has provided unprecedented challenges as well as opportunities for computational prediction of transcription factor binding sites (TFBSs) in these genomes using solely the sequences. We are developing algorithms and tools to predict all possible TFBSs in the sequenced genomes using traditional high performance computing and cloud computing.
Determination of the complexity of prokaryotic transcriptomes under various conditions
Recently, applications of directional RNA-seq techniques to transcriptome profiling in a few prokaryotes have revealed that their transcriptomes are more complex and dynamic than previously thought. Alternative and dynamic transcripts within operons have revolutionized the classic operon definition, and pervasive expression of non-coding RNAs (ncRNAs) and anti-sense RNAs (asRNAs) further indicate that prokaryotes might have highly complex transcriptomes that are comparable to those in eukaryotes. However, these observations have not been broadly surveyed in a wide range of prokaryotes for their ubiquity, and inconsistent and contradictory results are often reported even in the same strains, due to technical challenges in the directional RNA-seq library preparation and subsequent bioinformatics analysis. We are developing algorithms and tools for assembly of full-length prokaryotic transcripts and detection of varying (dynamic) expression levels along the assembled transcripts using RNA-seq short reads. Using these tools, we will determine alternative and dynamic operons as well as ncRNAs and asRNAs expression patterns under various culture conditions and time points in a few genomes of interest.
Deciphering the genetic code of cell fate/type determination during cell differentiation
One of the most important questions in biomedical research is to understand how genetic programs are encoded in an animal’s genome, and how they are executed during the embryogenesis to govern a single-celled zygote developing into a mature embryo consisting of different types of cells, tissues and organs. We are developing a bottom-up approach to addressing the problem using C. elegans as the model animal to take advantage of its two unusual properties: its mature embryo contains exactly only 558 cells, and each cell is produced in an essentially invariant way. We first characterize the molecular signatures that determine type of cells produced at different stages of embryogenesis using single-cell RNA-seq techniques, and then reconstruct the gene transcriptional regulatory programs in each cells using these molecular signatures of the cells.
Prediction of cis-regulatory modules in eukaryotes via integrating various types of NGS datasets
Although cis-regulatory elements (CREs) and their clusters termed cis-regulatory modules (CRMs) in eukaryotes are at least as important as coding sequences, our general understanding of the cis-regulatory systems in most sequenced eukaryotic genomes is very limited due to the difficulty for their characterization. Although the recent development ChIP-seq, RNA-seq, Dnase-seq and Hi-C methods using next generation sequencing (NGS) technologies has largely accelerated the characterization of CREs in genomes, however, how to analyze and integrate these datasets to derive maximal meaningful information about the CRMs and gene transcriptional regulatory networks in cells is a highly challenging computational problem. We are developing a suite of algorithms and tools for inferring CRMs by exploring the various types of NGS datasets.