Representative Achievements
Stacks Image 38

Established Tools and Databases

mSignatureDB: a database for deciphering mutational signatures in human cancer. Nucleic Acids Res. 2018 Jan 4;46(D1):D964-D970
Stacks Image 95
Cancer is a genetic disease caused by somatic mutations; however, the understanding of the causative biological processes generating these mutations is limited. A cancer genome bears the cumulative effects of mutational processes during tumor development. Deciphering mutational signatures in cancer is a new topic in cancer research. The Wellcome Trust Sanger Institute (WTSI) has categorized 30 reference signatures in the COSMIC database based on the analyses of 10 000 sequencing datasets from TCGA and ICGC. Large cohorts and bioinformatics skills are required to perform the same analysis as WTSI. The quantification of known signatures in custom cohorts is not possible under the current framework of the COSMIC database, which motivates us to construct a database for mutational signatures in cancers and make such analyses more accessible to general researchers. mSignatureDB ( integrates R packages and in-house scripts to determine the contributions of the published signatures in 15 780 individual tumors from 73 TCGA/ICGC cancer projects, making comparison of signature patterns within and between projects become possible. mSignatureDB also allows users to perform signature analysis on their own datasets, quantifying contributions of signatures at sample resolution, which is a unique feature of mSignatureDB not available in other related databases.

circlncRNAnet: an integrated web-based resource for mapping functional networks of long or circular forms of noncoding RNAs. GigaScience, Volume 7, Issue 1, 1 January 2018
Stacks Image 87
Despite their lack of protein-coding potential, long noncoding RNAs (lncRNAs) and circular RNAs (circRNAs) have emerged as key determinants in gene regulation, acting to fine-tune transcriptional and signaling output. These noncoding RNA transcripts are known to affect expression of messenger RNAs (mRNAs) via epigenetic and post-transcriptional regulation. Given their widespread target spectrum, as well as extensive modes of action, a complete understanding of their biological relevance will depend on integrative analyses of systems data at various levels.
While a handful of publicly available databases have been reported, existing tools do not fully capture, from a network perspective, the functional implications of lncRNAs or circRNAs of interest. Through an integrated and streamlined design, circlncRNAnet aims to broaden the understanding of ncRNA candidates by testing in silico several hypotheses of ncRNA-based functions, on the basis of large-scale RNA-seq data. This web server is implemented with several features that represent advances in the bioinformatics of ncRNAs: (1) a flexible framework that accepts and processes user-defined next-generation sequencing–based expression data; (2) multiple analytic modules that assign and productively assess the regulatory networks of user-selected ncRNAs by cross-referencing extensively curated databases; (3) an all-purpose, information-rich workflow design that is tailored to all types of ncRNAs. Outputs on expression profiles, co-expression networks and pathways, and molecular interactomes, are dynamically and interactively displayed according to user-defined criteria.
In short, users may apply circlncRNAnet to obtain, in real time, multiple lines of functionally relevant information on circRNAs/lncRNAs of their interest. In summary, circlncRNAnet provides a “one-stop” resource for in-depth analyses of ncRNA biology. circlncRNAnet is freely available at

VAReporter: Variant reporter for cancer research of massive parallel sequencing. BMC Genomics 2018 19 (Suppl 2):86
Stacks Image 102
High throughput sequencing technologies have been an increasingly critical aspect of precision medicine owing to a better identification of disease targets, which contributes to improved health care cost and clinical outcomes. In particular, disease-oriented targeted enrichment sequencing is becoming a widely-accepted application for diagnostic purposes, which can interrogate known diagnostic variants as well as identify novel biomarkers from panels of entire human coding exome or disease-associated genes.
We introduce a workflow named VAReporter to facilitate the management of variant assessment in disease-targeted sequencing, the identification of pathogenic variants, the interpretation of biological effects and the prioritization of clinically actionable targets. State-of-art algorithms that account for mutation phenotypes are used to rank the importance of mutated genes through visual analytic strategies. We established an extensive annotation source by integrating a wide variety of biomedical databases and followed the American College of Medical Genetics and Genomics (ACMG) guidelines for interpretation and reporting of sequence variations.
In summary, VAReporter is the first web server designed to provide a “one-stop” resource for individual’s diagnosis and large-scale cohort studies, and is freely available at

Stacks Image 109
Whole-exome sequencing, which centres on the protein coding regions of disease/cancer associated genes, represents the most cost-effective method to-date for deciphering the association between genetic alterations and diseases. Large-scale whole exome/genome sequencing projects have been launched by various institutions, such as NCI, Broad Institute and TCGA, to provide a comprehensive catalogue of coding variants in diverse tissue samples and cell lines. Further functional and clinical interrogation of these sequence variations must rely on extensive cross-platforms integration of sequencing information and a proteome database that explicitly and comprehensively archives the corresponding mutated peptide sequences. While such data resource is a critical for the mass spectrometry-based proteomic analysis of exomic variants, no database is currently available for the collection of mutant protein sequences that correspond to recent large-scale genomic data. To address this issue and serve as bridge to integrate genomic and proteomics datasets, CMPD ( collected over 2 millions genetic alterations, which not only facilitates the confirmation and examination of potential cancer biomarkers but also provides an invaluable resource for translational medicine research and opportunities to identify mutated proteins encoded by mutated genes.

Vanno: A Visualization‐Aided Variant Annotation Tool. Human Mutation, 36 (2), 2015, pp. 167-174.
Stacks Image 116
Next‐generation sequencing (NGS) technologies have revolutionized the field of genetics and are trending toward clinical diagnostics. Exome and targeted sequencing in a disease context represent a major NGS clinical application, considering its utility and cost‐effectiveness. With the ongoing discovery of disease‐associated genes, various gene panels have been launched for both basic research and diagnostic tests. However, the fundamental inconsistencies among the diverse annotation sources, software packages, and data formats have complicated the subsequent analysis. To manage disease‐associated NGS data, we developed Vanno, a Web‐based application for in‐depth analysis and rapid evaluation of disease‐causative genome sequence alterations. Vanno integrates information from biomedical databases, functional predictions from available evaluation models, and mutation landscapes from TCGA cancer types. A highly integrated framework that incorporates filtering, sorting, clustering, and visual analytic modules is provided to facilitate exploration of oncogenomics datasets at different levels, such as gene, variant, protein domain, or three‐dimensional structure. Such design is crucial for the extraction of knowledge from sequence alterations and translating biological insights into clinical applications. Taken together, Vanno supports almost all disease‐associated gene tests and exome sequencing panels designed for NGS, providing a complete solution for targeted and exome sequencing analysis. Vanno is freely available at

CPAP: Cancer Panel Analysis Pipeline. Human Mutation, 34 (10), 2013, pp. 1340-1346.
Stacks Image 123
Targeted sequencing using next‐generation sequencing technologies is currently being rapidly adopted for clinical sequencing and cancer marker tests. However, no existing bioinformatics tool is available for the analysis and visualization of multiple targeted sequencing datasets. In the present study, we use cancer panel targeted sequencing datasets generated by the Life Technologies Ion Personal Genome Machine Sequencer as an example to illustrate how to develop an automated pipeline for the comparative analyses of multiple datasets. Cancer Panel Analysis Pipeline (CPAP) uses standard output files from variant calling software to generate a distribution map of SNPs among all of the samples in a circular diagram generated by Circos. The diagram is hyperlinked to a dynamic HTML table that allows the users to identify target SNPs by using different filters. CPAP also integrates additional information about the identified SNPs by linking to an integrated SQL database compiled from SNP‐related databases, including dbSNP, 1000 Genomes Project, COSMIC, and dbNSFP. CPAP only takes 17 min to complete a comparative analysis of 500 datasets. CPAP not only provides an automated platform for the analysis of multiple cancer panel datasets but can also serve as a model for any customized targeted sequencing project. CPAP is freely available at

Stacks Image 133
DSAP is an automated multiple-task web service designed to provide a total solution to analyzing deep-sequencing small RNA datasets generated by next-generation sequencing technology. DSAP uses a tab-delimited file as an input format, which holds the unique sequence reads (tags) and their corresponding number of copies generated by the Solexa sequencing platform. The input data will go through four analysis steps in DSAP: (i) cleanup: removal of adaptors and poly-A/T/C/G/N nucleotides; (ii) clustering: grouping of cleaned sequence tags into unique sequence clusters; (iii) non-coding RNA (ncRNA) matching: sequence homology mapping against a transcribed sequence library from the ncRNA database Rfam (; and (iv) known miRNA matching: detection of known miRNAs in miRBase ( based on sequence homology. The expression levels corresponding to matched ncRNAs and miRNAs are summarized in multi-color clickable bar charts linked to external databases. DSAP is also capable of displaying miRNA expression levels from different jobs using a log2-scaled color matrix. Furthermore, a cross-species comparative function is also provided to show the distribution of identified miRNAs in different species as deposited in miRBase. DSAP is available at