精准医学大数据管理和共享技术平台

项目来源

国家重点研发计划(NKRD)

项目主持人

伯晓晨

项目受资助机构

中山大学

立项年度

2016

立项时间

未公开

项目编号

2016YFC0901604

项目级别

国家级

研究期限

未知 / 未知

受资助金额

500.00万元

学科

精准医学研究

学科代码

未公开

基金类别

精准医学研究重点专项

关键词

精准医学 ; 大数据 ; 整合 ; 注释 ; 工作流 ; 数据分析挖掘 ; Precision medicine ; big data ; integration ; annotation ; workflow ; data analysis and mining

参与者

李伟忠;谢志

参与机构

上海生物信息技术研究中心

项目标书摘要:本课题目标是面向精准医学数据共享和分析需要,整合精准医学数据与分子生物学基础数据,开发用于高效注释精准医学生命组学数据与临床信息的软件工作流,建立一套自动、无缝、高效的生命组学大数据与临床信息整合、注释系统。生命组学数据工具研发方面,本课题建成了蛋白组学数据的迭代检索方法和宏基因组反向检索方案,发明了基因测序数据压缩算法 Leon-RC 和基因变异数据 GDS-Huffman 算法,以及多序列比对的图像可视化方法,满足了组学大数据的检索、压缩和分析的需求。组学大数据注释与分析工作流建设方面,本课题建立了多个组学与临床数据注释和分析软件流程,同时建成了Preci工作流平台系统以及整合的微生物组学数据分析云平台iMAC,满足了用户更好地使用大数据工作流分析数据的需求。针对面向疾病的组学数据注释,我们构建了组学数据与疾病关联的数据库群,包括微生物组与疾病表型数据库、非编码RNA与疾病表型关联数据库、非编码基因变异与疾病关联数据库,为组学数据面向人类疾病的注释提供了新范本。课题建立的工作流、分析系统、算法和数据库群为医学问题的研究提供了生物信息手段,促进了相关医学大数据的共享和研究。

Application Abstract: The goal of this project was to meet the needs of data sharing and analysis for precision medicine.The tasks included integrating precision medicine data and molecular biology data,developing automatic and efficient analysis workflows,and establishing an automatic,seamless,and efficient integration and annotation system for life omics data and clinical information.In terms of the development of life omics data tools,this project has built an iterative remote search method for proteomics data and a reverse search for metagenomics data,and invented the compression algorithms(such as Leon-RC for NGS data and GDS-Huffman for gene variation data)and a visualization method for multiple sequence alignment.These tools met the needs of omics big data retrieval,compression and analysis.Regarding the construction of omics big data annotation and analysis workflow,we have established a number of omic and clinical data annotation and analysis workflows,and built the Preci workflow platform and the integrated microbiome analysis cloud platform(iMAC).These platform systems met the needs of users to better use the data analysis workflows.For disease-oriented data annotation,this study has constructed a database warehouse of omics data with disease association,including the microbiome and disease phenotype database(MicroPhenoDB),the noncoding RNA and disease phenotype database(ncrPheno),and the noncoding gene variation and disease association database(ncRNAVar).These databases provided an innovative model for human disease-oriented annotation of omics data.The workflows,the analysis platforms,the novel algorithms,and the multiple databases established by this project have provided the bioinformatics means for biomedical research,and promoted the sharing and research of precision medicine big data.

项目受资助省

广东省

  • 排序方式:
  • 17
  • /
  • 1.Systematic mining and quantification reveal the dominant contribution of non-HLA variations to acute graft-versus-host disease

    • 关键词:
    • Alloreactivity; Genetic risk; Acute graft-versus-host disease;Whole-genome sequencing; Machine-learning model;STEM-CELL TRANSPLANTATION; HAPLOIDENTICAL BONE-MARROW; SINGLE NUCLEOTIDEPOLYMORPHISMS; NF-KAPPA-B; RISK; REGRESSION; INHIBITOR; DISPARITY;TESTS; DONOR
    • Liang, Shuang;Kang, Yu-Jian;Huo, Mingrui;Yang, De-Chang;Ling, Min;Yue, Keli;Wang, Yu;Xu, Lan-Ping;Zhang, Xiao-Hui;Xia, Chen-Rui;Li, Jing-Yi;Wu, Ning;Liu, Ruoyang;Dong, Xinyu;Liu, Jiangying;Gao, Ge;Huang, Xiao-Jun
    • 《CELLULAR & MOLECULAR IMMUNOLOGY》
    • 2025年
    • 期刊

    Human leukocyte antigen (HLA) disparity between donors and recipients is a key determinant triggering intense alloreactivity, leading to a lethal complication, namely, acute graft-versus-host disease (aGVHD), after allogeneic transplantation. Moreover, aGVHD remains a cause of mortality after HLA-matched allogeneic transplantation. Protocols for HLA-haploidentical hematopoietic cell transplantation (haploHCT) have been established successfully and widely applied, further highlighting the urgency of performing panoramic screening of non-HLA variations correlated with aGVHD. On the basis of our time-consecutive large haploHCT cohort (with a homogenous discovery set and an extended confirmatory set), we first delineated the genetic landscape of 1366 samples to quantitatively model aGVHD risk by assessing the contributions of HLA and non-HLA genes together with clinical factors. In addition to identifying multiple loss-of-function (LoF) risk variations in non-HLA coding genes, our data-driven study revealed that non-HLA genetic variations, independent of HLA disparity, contributed the most to the occurrence of aGVHD. This unexpected major effect was verified in an independent cohort that received HLA-identical sibling HCT. Subsequent functional experiments further revealed the roles of a representative non-HLA LoF gene and LoF gene pair in regulating the alloreactivity of primary human T cells. Our findings highlight the importance of non-HLA genetic risk in the new era of transplantation and propose a new direction to explore the immunogenetic mechanism of alloreactivity and to optimize donor selection strategies for allogeneic transplantation.

    ...
  • 2.基于特征关联的特征识别与推荐算法研究

    • 关键词:
    • 特征识别;特征关联;推荐算法;深度学习;隐式反馈
    • 孙明瑞
    • 指导老师:哈尔滨工业大学 臧天仪
    • 学位论文

    随着物联网的发展和大数据时代的到来,数据爆炸式增长导致信息过载等问题,使传统的推荐系统逐渐转型为个性化推荐系统。个性化推荐算法通过构建用户画像和预测用户行为来提供信息过滤和推荐服务。在以大数据为背景下的推荐系统中,领域推荐技术的数据构成日趋复杂,呈现出海量异构数据、数据特征缺失、数据特征异常和数据特征关联等新特征。这些特征从问题规模、特征缺失程度、异常特征状态和关联关系等方面对推荐算法提出了新的需求和挑战。为此,本文开展基于特征关联的特征识别、预测与推荐算法研究。主要包含以下几个方面:(1)特征关系的分类关联规则启发式挖掘算法与特征匹配算法。基于海量数据隐藏的关联关系,重点研究面向推荐算法的数据本身隐式的分类关联规则。引入分类及连续的数据特征属性并离散化,扩展数据特征的二元表示,确保数据特征属性的多样性。为了挖掘数据中某些关联特征,研究基于最小支持度的启发式特征挖掘方法,发现关联特征的频繁性及构建最优特征子集。基于数据特征频繁项,研究基于最小置信度的启发式分类关联规则挖掘算法,为不同情景模式下,进行基于分类关联规则的特征匹配。利用机器学习库中健康医疗情景实验数据进行实验验证与分析,验证了提出算法的有效性。(2)隐式反馈特征识别与预测算法。针对应用领域数据特征稀疏性和缺失性问题,系统地研究领域中数据缺失特征的识别与预测分类问题。在系统地分析领域数据特征缺失基础上,研究基于加权用户的协同过滤特征识别方法。通过有监督学习向无监督学习的转变,研究推荐系统特征属性间隐式关联关系的特征识别方法。研究基于隐式特征提取的隐式反馈协同过滤特征识别与预测算法,通过随机创建的数据特征缺失程度模拟真实环境数据特征缺失情况,实验验证算法的有效性。利用机器学习库中健康医疗情景实验数据进行实验验证与分析,验证了提出算法的有效性和预测准确率。(3)数据异常特征识别与预测算法。针对只关注于离散数据特征的局限性,研究基于连续属性时间序列数据相互依赖关联的特征识别算法,以此进行异常特征识别与预测。研究基于深度学习网络模型的连续时间序列数据的特征识别方法,通过复杂的图模式进行数据降维,以及时频序列数据分析,形成深层次的数据时序关联关系与异常特征识别模型,以此改善预测结果的有效性。利用脑电图健康医疗情景实验数据进行实验验证与分析,验证了提出算法的有效性和预测准确率。(4)面向领域的级联加权混合个性化推荐方法。针对特定领域推荐需求,研究不同情境模式下的混合推荐方法,面向领域的研究问题抽象成为本体推荐项目的个性化推荐过程。构建用户特征信息模型画像,采用分类树和内容相似度的相似用户发现算法发现相似用户,并基于关联规则的特征匹配算法加权计算得到推荐方案。针对推荐算法冷启动问题,研究基于领域知识分类树的相似用户发现算法,采用离线计算方法提高效率。基于多用户的层次分析决策推荐形式化方法进行决策推荐,改善用户的满意度和推荐效果。中风患者实际健康医疗实验数据进行实验验证与算法比较,验证了本文混合推荐算法的有效性。

    ...
  • 3.基于聚类分析的原发性肝癌患者预后预测

    • 关键词:
    • 聚类分析,预后预测,原发性肝癌,临床亚型
    • 李琳,张学良,王哲,杨日东,周毅
    • 《新疆医科大学学报》
    • 2018年
    • 12期
    • 期刊

    目的通过探索原发性肝癌患者术前的临床信息,进而评估患者的临床表型特点,对患者进行根治性肝癌切除术的预后状况进行预测,为制定个体化诊治方案和治疗策略提供临床依据。方法对386名原发性肝癌患者的34个基线临床资料进行主成分分析

    ...
  • 4.An end-to-end deep learning method for mass spectrometry data analysis to reveal disease-specific metabolic profiles

    • Deng, Yongjie;Yao, Yao;Wang, Yanni;Yu, Tiantian;Cai, Wenhao;Zhou, Dingli;Yin, Feng;Liu, Wanli;Liu, Yuying;Xie, Chuanbo;Guan, Jian;Hu, Yumin;Huang, Peng;Li, Weizhong
    • 《NATURE COMMUNICATIONS》
    • 2024年
    • 15卷
    • 1期
    • 期刊

    Untargeted metabolomic analysis using mass spectrometry provides comprehensive metabolic profiling, but its medical application faces challenges of complex data processing, high inter-batch variability, and unidentified metabolites. Here, we present DeepMSProfiler, an explainable deep-learning-based method, enabling end-to-end analysis on raw metabolic signals with output of high accuracy and reliability. Using cross-hospital 859 human serum samples from lung adenocarcinoma, benign lung nodules, and healthy individuals, DeepMSProfiler successfully differentiates the metabolomic profiles of different groups (AUC 0.99) and detects early-stage lung adenocarcinoma (accuracy 0.961). Model flow and ablation experiments demonstrate that DeepMSProfiler overcomes inter-hospital variability and effects of unknown metabolites signals. Our ensemble strategy removes background-category phenomena in multi-classification deep-learning models, and the novel interpretability enables direct access to disease-related metabolite-protein networks. Further applying to lipid metabolomic data unveils correlations of important metabolites and proteins. Overall, DeepMSProfiler offers a straightforward and reliable method for disease diagnosis and mechanism discovery, enhancing its broad applicability.Untargeted metabolomic analysis provides comprehensive metabolic profiling but faces challenges in medical application. Here, the authors present an explainable deep learning method for end-to-end analysis on raw metabolic signals to differentiate metabolomic profiles of cancers with high accuracy.

    ...
  • 5.A clinical consensus-compliant deep learning approach to quantitatively evaluate human in vitro fertilization early embryonic development with optical microscope images

    • 关键词:
    • Deep learning;Image enhancement;Image segmentation;Microscopes;Statistical tests;Blastomere segmentation;Crowd-NMS;Deep learning;Embryonic development;Embryonic development evaluation;In vitro fertilization;In-vitro;Microscope images;Optical microscope image;Optical microscopes ;Vitro fertilization
    • Liao, Zaowen;Yan, Chaoyu;Wang, Jianbo;Zhang, Ningfeng;Yang, Huan;Lin, Chenghao;Zhang, Haiyue;Wang, Wenjun;Li, Weizhong
    • 《Artificial Intelligence in Medicine》
    • 2024年
    • 149卷
    • 期刊

    The selection of embryos is a key for the success of in vitro fertilization (IVF). However, automatic quality assessment on human IVF embryos with optical microscope images is still challenging. In this study, we developed a clinical consensus-compliant deep learning approach, named Esava (Embryo Segmentation and Viability Assessment), to quantitatively evaluate the development of IVF embryos using optical microscope images. In total 551 optical microscope images of human IVF embryos of day-2 to day-3 were collected, preprocessed, and annotated. Using the Faster R-CNN model as baseline, our Esava model was constructed, refined, trained, and validated for precise and robust blastomere detection. A novel algorithm Crowd-NMS was proposed and employed in Esava to enhance the object detection and to precisely quantify the embryonic cells and their size uniformity. Additionally, an innovative GrabCut-based unsupervised module was integrated for the segmentation of blastomeres and embryos. Independently tested on 94 embryo images for blastomere detection, Esava obtained the high rates of 0.9940, 0.9121, and 0.9531 for precision, recall, and mAP respectively, and gained significant advances compared with previous computational methods. Intraclass correlation coefficients indicated the consistency between Esava and three experienced embryologists. Another test on 51 extra images demonstrated that Esava surpassed other tools significantly, achieving the highest average precision 0.9025. Moreover, it also accurately identified the borders of blastomeres with mIoU over 0.88 on the independent testing dataset. Esava is compliant with the Istanbul clinical consensus and compatible to senior embryologists. Taken together, Esava improves the accuracy and efficiency of embryonic development assessment with optical microscope images. © 2024

    ...
  • 6.KGE-UNIT: toward the unification of molecular interactions prediction based on knowledge graph and multi-task learning on drug discovery

    • 关键词:
    • molecular interactions prediction; KGE; multi-task learning; DDIs; DTIs;NETWORK; ACID
    • Zhang, Chengcheng;Zang, Tianyi;Zhao, Tianyi
    • 《BRIEFINGS IN BIOINFORMATICS》
    • 2024年
    • 25卷
    • 2期
    • 期刊

    The prediction of molecular interactions is vital for drug discovery. Existing methods often focus on individual prediction tasks and overlook the relationships between them. Additionally, certain tasks encounter limitations due to insufficient data availability, resulting in limited performance. To overcome these limitations, we propose KGE-UNIT, a unified framework that combines knowledge graph embedding (KGE) and multi-task learning, for simultaneous prediction of drug-target interactions (DTIs) and drug-drug interactions (DDIs) and enhancing the performance of each task, even when data availability is limited. Via KGE, we extract heterogeneous features from the drug knowledge graph to enhance the structural features of drug and protein nodes, thereby improving the quality of features. Additionally, employing multi-task learning, we introduce an innovative predictor that comprises the task-aware Convolutional Neural Network-based (CNN-based) encoder and the task-aware attention decoder which can fuse better multimodal features, capture the contextual interactions of molecular tasks and enhance task awareness, leading to improved performance. Experiments on two imbalanced datasets for DTIs and DDIs demonstrate the superiority of KGE-UNIT, achieving high area under the receiver operating characteristics curves (AUROCs) (0.942, 0.987) and area under the precision-recall curve ( AUPRs) (0.930, 0.980) for DTIs and high AUROCs (0.975, 0.989) and AUPRs (0.966, 0.988) for DDIs. Notably, on the LUO dataset where the data were more limited, KGE-UNIT exhibited a more pronounced improvement, with increases of 4.32$\%$ in AUROC and 3.56$\%$ in AUPR for DTIs and 6.56$\%$ in AUROC and 8.17$\%$ in AUPR for DDIs. The scalability of KGE-UNIT is demonstrated through its extension to protein-protein interactions prediction, ablation studies and case studies further validate its effectiveness.

    ...
  • 7.Computational Assessment of the Expression-Modulating Potential for Non-Coding Variants

    • 关键词:
    • Non-coding variant; Expression-modulating variant; Gene regulation;Algorithm; Web server;TRANSCRIPTION FACTOR-BINDING; GENE-EXPRESSION; REGULATORY VARIANTS;DIABETES RISK; GENOME; IDENTIFICATION; ASSOCIATION; CHROMATIN; SNPS;PATHOGENICITY
    • Shi, Fang-Yuan;Wang, Yu;Huang, Dong;Liang, Yu;Liang, Nan;Chen, Xiao-Wei;Gao, Ge
    • 《GENOMICS PROTEOMICS & BIOINFORMATICS》
    • 2023年
    • 21卷
    • 3期
    • 期刊

    Large-scale genome-wide association studies (GWAS) and expression quantitative trait locus (eQTL) studies have identified multiple non-coding variants associated with genetic diseases by affecting gene expression. However, pinpointing causal variants effectively and efficiently remains a serious challenge. Here, we developed CARMEN, a novel algorithm to identify functional non-coding expression-modulating variants. Multiple evaluations demonstrated CARMEN's superior performance over state-of-the-art tools. Applying CARMEN to GWAS and eQTL datasets further pinpointed several causal variants other than the reported lead single-nucleotide polymorphisms (SNPs). CARMEN scales well with the massive datasets, and is available online as a web server at http://carmen.gao-lab.org.

    ...
  • 8.Genome-Wide Identification of Gene Loss Events Suggests Loss Relics as a Potential Source of Functional lncRNAs in Humans

    • 关键词:
    • gene loss; long noncoding RNA; lncRNA origin; comparative genomics;EVOLUTION; MIR-106A-5P; RESISTANCE
    • Wen, Zheng-Yang;Kang, Yu-Jian;Ke, Lan;Yang, De-Chang;Gao, Ge
    • 《MOLECULAR BIOLOGY AND EVOLUTION》
    • 2023年
    • 40卷
    • 5期
    • 期刊

    Gene loss is a prevalent source of genetic variation in genome evolution. Calling loss events effectively and efficiently is a critical step for systematically characterizing their functional and phylogenetic profiles genome wide. Here, we developed a novel pipeline integrating orthologous inference and genome alignment. Interestingly, we identified 33 gene loss events that give rise to evolutionarily novel long noncoding RNAs (lncRNAs) that show distinct expression features and could be associated with various functions related to growth, development, immunity, and reproduction, suggesting loss relics as a potential source of functional lncRNAs in humans. Our data also demonstrated that the rates of protein gene loss are variable among different lineages with distinct functional biases.

    ...
  • 9.Recurrent RNA edits in human preimplantation potentially enhance maternal mRNA clearance

    • 关键词:
    • ACCURATE IDENTIFICATION; WHOLE-GENOME; TRANSCRIPTOME; ADENOSINE; ALU;EXPRESSION; LANDSCAPE; REVEALS; EMBRYOS; TARGETS
    • Ding, Yang;Zheng, Yang;Wang, Junting;Li, Hao;Zhao, Chenghui;Tao, Huan;Li, Yaru;Xu, Kang;Huang, Xin;Gao, Ge;Chen, Hebing;Bo, Xiaochen
    • 《COMMUNICATIONS BIOLOGY》
    • 2022年
    • 5卷
    • 1期
    • 期刊

    Posttranscriptional modification plays an important role in key embryonic processes. Adenosine-to-inosine RNA editing, a common example of such modifications, is widespread in human adult tissues and has various functional impacts and clinical consequences. However, whether it persists in a consistent pattern in most human embryos, and whether it supports embryonic development, are poorly understood. To address this problem, we compiled the largest human embryonic editome from 2,071 transcriptomes and identified thousands of recurrent embryonic edits (>=50% chances of occurring in a given stage) for each early developmental stage. We found that these recurrent edits prefer exons consistently across stages, tend to target genes related to DNA replication, and undergo organized loss in abnormal embryos and embryos from elder mothers. In particular, these recurrent edits are likely to enhance maternal mRNA clearance, a possible mechanism of which could be introducing more microRNA binding sites to the 3'-untranslated regions of clearance targets. This study suggests a potentially important, if not indispensable, role of RNA editing in key human embryonic processes such as maternal mRNA clearance; the identified editome can aid further investigations.

    ...
  • 10.Identification of a cytokine-dominated immunosuppressive class in squamous cell lung carcinoma with implications for immunotherapy resistance

    • 关键词:
    • Immunogenomics; LUSC; T cell exhaustion; Immunosuppressive cytokine;Immune checkpoint blockade resistance; Tumour microenvironment;CANCER; EXPRESSION; TUMOR; THERAPY; PEMBROLIZUMAB; EPIDEMIOLOGY;GUIDELINES; DISCOVERY; BLOCKADE; FEATURES
    • Yang, Minglei;Lin, Chenghao;Wang, Yanni;Chen, Kang;Zhang, Haiyue;Li, Weizhong
    • 《GENOME MEDICINE》
    • 2022年
    • 14卷
    • 1期
    • 期刊

    Background: Immune checkpoint blockade (ICB) therapy has revolutionized the treatment of lung squamous cell carcinoma (LUSC). However, a significant proportion of patients with high tumour PD-L1 expression remain resistant to immune checkpoint inhibitors. To understand the underlying resistance mechanisms, characterization of the immunosuppressive tumour microenvironment and identification of biomarkers to predict resistance in patients are urgently needed.Methods: Our study retrospectively analysed RNA sequencing data of 624 LUSC samples. We analysed gene expression patterns from tumour microenvironment by unsupervised clustering. We correlated the expression patterns with a set ofT cell exhaustion signatures, immunosuppressive cells, clinical characteristics, and immunotherapeutic responses. Internal and external testing datasets were used to validate the presence of exhausted immune status.Results: Approximately 28 to 36% of LUSC patients were found to exhibit significant enrichments of T cell exhaustion signatures, high fraction of immunosuppressive cells (M2 macrophage and CD4 Treg), co-upregulation of 9 inhibitory checkpoints (CTLA4, PDCD1, LAG3, BTLA, TIGIT, HAVCR2, IDO1, SIGLEC7, and VISTA), and enhanced expression of anti-inflammatory cytokines (e.g. TGF beta and CCL18). We defined this immunosuppressive group of patients as exhausted immune class (EIC). Although EIC showed a high density of tumour-infiltrating lymphocytes, these were associated with poor prognosis. EIC had relatively elevated PD-L1 expression, but showed potential resistance to ICB therapy. The signature of 167 genes for EIC prediction was significantly enriched in melanoma patients with ICB therapy resistance. EIC was characterized by a lower chromosomal alteration burden and a unique methylation pattern. We developed a web application (http://lilab2.sysu.edu.cn/tex & http://liwzlab.cn/tex) for researchers to further investigate potential association of ICB resistance based on our multi-omics analysis data.Conclusions: We introduced a novel LUSC immunosuppressive class which expressed high PD-L1 but showed potential resistance to ICB therapy. This comprehensive characterization of immunosuppressive tumour microenvironment in LUSC provided new insights for further exploration of resistance mechanisms and optimization of immunotherapy strategies.

    ...
  • 排序方式:
  • 17
  • /