精准医学大数据管理和共享技术平台

项目来源

国家重点研发计划(NKRD)

项目主持人

伯晓晨

项目受资助机构

中山大学

项目编号

2016YFC0901604

立项年度

2016

立项时间

未公开

研究期限

未知 / 未知

项目级别

国家级

受资助金额

500.00万元

学科

精准医学研究

学科代码

未公开

基金类别

精准医学研究重点专项

关键词

精准医学 ; 大数据 ; 整合 ; 注释 ; 工作流 ; 数据分析挖掘 ; Precision medicine ; big data ; integration ; annotation ; workflow ; data analysis and mining

参与者

李伟忠;谢志

参与机构

上海生物信息技术研究中心

项目标书摘要:本课题目标是面向精准医学数据共享和分析需要,整合精准医学数据与分子生物学基础数据,开发用于高效注释精准医学生命组学数据与临床信息的软件工作流,建立一套自动、无缝、高效的生命组学大数据与临床信息整合、注释系统。生命组学数据工具研发方面,本课题建成了蛋白组学数据的迭代检索方法和宏基因组反向检索方案,发明了基因测序数据压缩算法 Leon-RC 和基因变异数据 GDS-Huffman 算法,以及多序列比对的图像可视化方法,满足了组学大数据的检索、压缩和分析的需求。组学大数据注释与分析工作流建设方面,本课题建立了多个组学与临床数据注释和分析软件流程,同时建成了Preci工作流平台系统以及整合的微生物组学数据分析云平台iMAC,满足了用户更好地使用大数据工作流分析数据的需求。针对面向疾病的组学数据注释,我们构建了组学数据与疾病关联的数据库群,包括微生物组与疾病表型数据库、非编码RNA与疾病表型关联数据库、非编码基因变异与疾病关联数据库,为组学数据面向人类疾病的注释提供了新范本。课题建立的工作流、分析系统、算法和数据库群为医学问题的研究提供了生物信息手段,促进了相关医学大数据的共享和研究。

Application Abstract: The goal of this project was to meet the needs of data sharing and analysis for precision medicine.The tasks included integrating precision medicine data and molecular biology data,developing automatic and efficient analysis workflows,and establishing an automatic,seamless,and efficient integration and annotation system for life omics data and clinical information.In terms of the development of life omics data tools,this project has built an iterative remote search method for proteomics data and a reverse search for metagenomics data,and invented the compression algorithms(such as Leon-RC for NGS data and GDS-Huffman for gene variation data)and a visualization method for multiple sequence alignment.These tools met the needs of omics big data retrieval,compression and analysis.Regarding the construction of omics big data annotation and analysis workflow,we have established a number of omic and clinical data annotation and analysis workflows,and built the Preci workflow platform and the integrated microbiome analysis cloud platform(iMAC).These platform systems met the needs of users to better use the data analysis workflows.For disease-oriented data annotation,this study has constructed a database warehouse of omics data with disease association,including the microbiome and disease phenotype database(MicroPhenoDB),the noncoding RNA and disease phenotype database(ncrPheno),and the noncoding gene variation and disease association database(ncRNAVar).These databases provided an innovative model for human disease-oriented annotation of omics data.The workflows,the analysis platforms,the novel algorithms,and the multiple databases established by this project have provided the bioinformatics means for biomedical research,and promoted the sharing and research of precision medicine big data.

项目受资助省

广东省

  • 排序方式:
  • 1
  • /
  • 1.Benchmarking variant callers in next-generation and third-generation sequencing analysis

    • 关键词:
    • variant callers; germline variant; somatic variant

    DNA variants represent an important source of genetic variations among individuals. Next- generation sequencing (NGS) is the most popular technology for genome-wide variant calling. Third-generation sequencing (TGS) has also recently been used in genetic studies. Although many variant callers are available, no single caller can call both types of variants on NGS or TGS data with high sensitivity and specificity. In this study, we systematically evaluated 11 variant callers on 12 NGS and TGS datasets. For germline variant calling, we tested DNAseq and DNAscope modes from Sentieon, HaplotypeCaller mode from GATK and WGS mode from DeepVariant. All the four callers had comparable performance on NGS data and 30x coverage of WGS data was recommended. For germline variant calling on TGS data, we tested DNAseq mode from Sentieon, HaplotypeCaller mode from GATK and PACBIO mode from DeepVariant. All the three callers had similar performance in SNP calling, while DeepVariant outperformed the others in InDel calling. TGS detected more variants than NGS, particularly in complex and repetitive regions. For somatic variant calling on NGS, we tested TNscope and TNseq modes from Sentieon, MuTect2 mode from GATK, NeuSomatic, VarScan2, and Strelka2. TNscope and Mutect2 outperformed the other callers. A higher proportion of tumor sample purity (from 10 to 20%) significantly increased the recall value of calling. Finally, computational costs of the callers were compared and Sentieon required the least computational cost. These results suggest that careful selection of a tool and parameters is needed for accurate SNP or InDel calling under different scenarios.

    ...
  • 2.Benchmarking variant callers in next-generation and third-generation sequencing analysis

    • 关键词:
    • variant callers; germline variant; somatic variant

    DNA variants represent an important source of genetic variations among individuals. Next- generation sequencing (NGS) is the most popular technology for genome-wide variant calling. Third-generation sequencing (TGS) has also recently been used in genetic studies. Although many variant callers are available, no single caller can call both types of variants on NGS or TGS data with high sensitivity and specificity. In this study, we systematically evaluated 11 variant callers on 12 NGS and TGS datasets. For germline variant calling, we tested DNAseq and DNAscope modes from Sentieon, HaplotypeCaller mode from GATK and WGS mode from DeepVariant. All the four callers had comparable performance on NGS data and 30x coverage of WGS data was recommended. For germline variant calling on TGS data, we tested DNAseq mode from Sentieon, HaplotypeCaller mode from GATK and PACBIO mode from DeepVariant. All the three callers had similar performance in SNP calling, while DeepVariant outperformed the others in InDel calling. TGS detected more variants than NGS, particularly in complex and repetitive regions. For somatic variant calling on NGS, we tested TNscope and TNseq modes from Sentieon, MuTect2 mode from GATK, NeuSomatic, VarScan2, and Strelka2. TNscope and Mutect2 outperformed the other callers. A higher proportion of tumor sample purity (from 10 to 20%) significantly increased the recall value of calling. Finally, computational costs of the callers were compared and Sentieon required the least computational cost. These results suggest that careful selection of a tool and parameters is needed for accurate SNP or InDel calling under different scenarios.

    ...
  • 4.A survey and evaluation of Web-based tools/databases for variant analysis of TCGA data

    • 关键词:
    • The Cancer Genome Atlas; cancer; bioinformatics tools; databases; survey;INTEGRATED GENOMIC ANALYSIS; GENE-EXPRESSION; WHOLE-GENOME; MUTATIONALPROCESSES; DNA METHYLATION; CANCER GENOME; OPEN PLATFORM; LANDSCAPE;SURVIVAL; SIGNATURE

    The Cancer Genome Atlas (TCGA) is a publicly funded project that aims to catalog and discover major cancer-causing genomic alterations with the goal of creating a comprehensive 'atlas' of cancer genomic profiles. The availability of this genome-wide information provides an unprecedented opportunity to expand our knowledge of tumourigenesis. Computational analytics and mining are frequently used as effective tools for exploring this byzantine series of biological and biomedical data. However, some of the more advanced computational tools are often difficult to understand or use, thereby limiting their application by scientists who do not have a strong computational background. Hence, it is of great importance to build user-friendly interfaces that allow both computational scientists and life scientists without a computational background to gain greater biological and medical insights. To that end, this survey was designed to systematically present available Web-based tools and facilitate the use TCGA data for cancer research.

    ...
  • 排序方式:
  • 1
  • /