精准医学大数据管理和共享技术平台

项目来源

国家重点研发计划(NKRD)

项目主持人

伯晓晨

项目受资助机构

中山大学

立项年度

2016

立项时间

未公开

项目编号

2016YFC0901604

研究期限

未知 / 未知

项目级别

国家级

受资助金额

500.00万元

学科

精准医学研究

学科代码

未公开

基金类别

精准医学研究重点专项

关键词

精准医学 ; 大数据 ; 整合 ; 注释 ; 工作流 ; 数据分析挖掘 ; Precision medicine ; big data ; integration ; annotation ; workflow ; data analysis and mining

参与者

李伟忠;谢志

参与机构

上海生物信息技术研究中心

项目标书摘要:本课题目标是面向精准医学数据共享和分析需要,整合精准医学数据与分子生物学基础数据,开发用于高效注释精准医学生命组学数据与临床信息的软件工作流,建立一套自动、无缝、高效的生命组学大数据与临床信息整合、注释系统。生命组学数据工具研发方面,本课题建成了蛋白组学数据的迭代检索方法和宏基因组反向检索方案,发明了基因测序数据压缩算法 Leon-RC 和基因变异数据 GDS-Huffman 算法,以及多序列比对的图像可视化方法,满足了组学大数据的检索、压缩和分析的需求。组学大数据注释与分析工作流建设方面,本课题建立了多个组学与临床数据注释和分析软件流程,同时建成了Preci工作流平台系统以及整合的微生物组学数据分析云平台iMAC,满足了用户更好地使用大数据工作流分析数据的需求。针对面向疾病的组学数据注释,我们构建了组学数据与疾病关联的数据库群,包括微生物组与疾病表型数据库、非编码RNA与疾病表型关联数据库、非编码基因变异与疾病关联数据库,为组学数据面向人类疾病的注释提供了新范本。课题建立的工作流、分析系统、算法和数据库群为医学问题的研究提供了生物信息手段,促进了相关医学大数据的共享和研究。

Application Abstract: The goal of this project was to meet the needs of data sharing and analysis for precision medicine.The tasks included integrating precision medicine data and molecular biology data,developing automatic and efficient analysis workflows,and establishing an automatic,seamless,and efficient integration and annotation system for life omics data and clinical information.In terms of the development of life omics data tools,this project has built an iterative remote search method for proteomics data and a reverse search for metagenomics data,and invented the compression algorithms(such as Leon-RC for NGS data and GDS-Huffman for gene variation data)and a visualization method for multiple sequence alignment.These tools met the needs of omics big data retrieval,compression and analysis.Regarding the construction of omics big data annotation and analysis workflow,we have established a number of omic and clinical data annotation and analysis workflows,and built the Preci workflow platform and the integrated microbiome analysis cloud platform(iMAC).These platform systems met the needs of users to better use the data analysis workflows.For disease-oriented data annotation,this study has constructed a database warehouse of omics data with disease association,including the microbiome and disease phenotype database(MicroPhenoDB),the noncoding RNA and disease phenotype database(ncrPheno),and the noncoding gene variation and disease association database(ncRNAVar).These databases provided an innovative model for human disease-oriented annotation of omics data.The workflows,the analysis platforms,the novel algorithms,and the multiple databases established by this project have provided the bioinformatics means for biomedical research,and promoted the sharing and research of precision medicine big data.

项目受资助省

广东省

  • 排序方式:
  • 3
  • /
  • 1.Cross-Linked Unified Embedding for cross-modality representation learning

    • 关键词:
    • Benchmarking;Cells;Cytology;Data integration;Genome;Modal analysis;Cross modality;Embeddings;Genomics;Global integration;Incomplete observation;Multi-modal;Multi-modal data;Multi-modal learning;Real-world;Single cells
    • Tu, Xinming;Cao, Zhi-Jie;Xia, Chen-Rui;Mostafavi, Sara;Gao, Ge
    • 《36th Conference on Neural Information Processing Systems, NeurIPS 2022》
    • 2022年
    • November 28, 2022 - December 9, 2022
    • New Orleans, LA, United states
    • 会议

    Multi-modal learning is essential for understanding information in the real world. Jointly learning from multi-modal data enables global integration of both shared and modality-specific information, but current strategies often fail when observations from certain modalities are incomplete or missing for part of the subjects. To learn comprehensive representations based on such modality-incomplete data, we present a semi-supervised neural network model called CLUE (Cross-Linked Unified Embedding). Extending from multi-modal VAEs, CLUE introduces the use of cross-encoders to construct latent representations from modality-incomplete observations. Representation learning for modality-incomplete observations is common in genomics. For example, human cells are tightly regulated across multiple related but distinct modalities such as DNA, RNA, and protein, jointly defining a cell's function. We benchmark CLUE on multi-modal data from single cell measurements, illustrating CLUE's superior performance in all assessed categories of the NeurIPS 2021 Multimodal Single-cell Data Integration Competition. While we focus on analysis of single cell genomic datasets, we note that the proposed cross-linked embedding strategy could be readily applied to other cross-modality representation learning problems. © 2022 Neural information processing systems foundation. All rights reserved.

    ...
  • 2.Multi-label classification of fundus images based on graph convolutional network

    • 关键词:
    • Diabetic retinopathy; Fundus images; GCN; Multi-label;DIABETIC-RETINOPATHY; FLUORESCEIN ANGIOGRAPHY; PREVALENCE; LESIONS
    • Cheng, Yinlin;Ma, Mengnan;Li, Xingyu;Zhou, Yi
    • 《International Conference on Health Big Data and Artificial Intelligence》
    • 2021年
    • OCT 29-NOV 01, 2020
    • Guangzhou, PEOPLES R CHINA
    • 会议

    Background: Diabetic Retinopathy (DR) is the most common and serious microvascular complication in the diabetic population. Using computer-aided diagnosis from the fundus images has become a method of detecting retinal diseases, but the detection of multiple lesions is still a difficult point in current research. Methods: This study proposed a multi-label classification method based on the graph convolutional network (GCN), so as to detect 8 types of fundus lesions in color fundus images. We collected 7459 fundus images (1887 left eyes, 1966 right eyes) from 2282 patients (1283 women, 999 men), and labeled 8 types of lesions, laser scars, drusen, cup disc ratio (C/D > 0.6), hemorrhages, retinal arteriosclerosis, microaneurysms, hard exudates and soft exudates. We constructed a specialized corpus of the related fundus lesions. A multi-label classification algorithm for fundus images was proposed based on the corpus, and the collected data were trained. Results: The average overall F1 Score (OF1) and the average per-class F1 Score (CF1) of the model were 0.808 and 0.792 respectively. The area under the ROC curve (AUC) of our proposed model reached 0.986, 0.954, 0.946, 0.957, 0.952, 0.889, 0.937 and 0.926 for detecting laser scars, drusen, cup disc ratio, hemorrhages, retinal arteriosclerosis, microaneurysms, hard exudates and soft exudates, respectively. Conclusions: Our results demonstrated that our proposed model can detect a variety of lesions in the color images of the fundus, which lays a foundation for assisting doctors in diagnosis and makes it possible to carry out rapid and efficient large-scale screening of fundus lesions.

    ...
  • 3.Research on epileptic EEG recognition based on improved residual networks of 1-D CNN and indRNN

    • 关键词:
    • Epilepsy; Residual network; CNN; indRNN; RCNN;CLASSIFICATION; PREDICTION; TERM
    • Ma, Mengnan;Cheng, Yinlin;Wei, Xiaoyan;Chen, Ziyi;Zhou, Yi
    • 《International Conference on Health Big Data and Artificial Intelligence》
    • 2021年
    • OCT 29-NOV 01, 2020
    • Guangzhou, PEOPLES R CHINA
    • 会议

    Background Epilepsy is one of the diseases of the nervous system, which has a large population in the world. Traditional diagnosis methods mostly depended on the professional neurologists' reading of the electroencephalogram (EEG), which was time-consuming, inefficient, and subjective. In recent years, automatic epilepsy diagnosis of EEG by deep learning had attracted more and more attention. But the potential of deep neural networks in seizure detection had not been fully developed. Methods In this article, we used a one-dimensional convolutional neural network (1-D CNN) to replace the residual network architecture's traditional convolutional neural network (CNN). Moreover, we combined the Independent recurrent neural network (indRNN) and CNN to form a new residual network architecture-independent convolutional recurrent neural network (RCNN). Our model can achieve an automatic diagnosis of epilepsy EEG. Firstly, the important features of EEG were learned by using the residual network architecture of 1-D CNN. Then the relationship between the sequences were learned by using the recurrent neural network. Finally, the model outputted the classification results. Results On the small sample data sets of Bonn University, our method was superior to the baseline methods and achieved 100% classification accuracy, 100% classification specificity. For the noisy real-world data, our method also exhibited powerful performance. Conclusion The model we proposed can quickly and accurately identify the different periods of EEG in an ideal condition and the real-world condition. The model can provide automatic detection capabilities for clinical epilepsy EEG detection. We hoped to provide a positive significance for the prediction of epileptic seizures EEG.

    ...
  • 4.Multi-label classification of fundus images based on graph convolutional network

    • 关键词:
    • Diabetic retinopathy; Fundus images; GCN; Multi-label;DIABETIC-RETINOPATHY; FLUORESCEIN ANGIOGRAPHY; PREVALENCE; LESIONS
    • Cheng, Yinlin;Ma, Mengnan;Li, Xingyu;Zhou, Yi
    • 《International Conference on Health Big Data and Artificial Intelligence》
    • 2021年
    • OCT 29-NOV 01, 2020
    • Guangzhou, PEOPLES R CHINA
    • 会议

    Background: Diabetic Retinopathy (DR) is the most common and serious microvascular complication in the diabetic population. Using computer-aided diagnosis from the fundus images has become a method of detecting retinal diseases, but the detection of multiple lesions is still a difficult point in current research. Methods: This study proposed a multi-label classification method based on the graph convolutional network (GCN), so as to detect 8 types of fundus lesions in color fundus images. We collected 7459 fundus images (1887 left eyes, 1966 right eyes) from 2282 patients (1283 women, 999 men), and labeled 8 types of lesions, laser scars, drusen, cup disc ratio (C/D > 0.6), hemorrhages, retinal arteriosclerosis, microaneurysms, hard exudates and soft exudates. We constructed a specialized corpus of the related fundus lesions. A multi-label classification algorithm for fundus images was proposed based on the corpus, and the collected data were trained. Results: The average overall F1 Score (OF1) and the average per-class F1 Score (CF1) of the model were 0.808 and 0.792 respectively. The area under the ROC curve (AUC) of our proposed model reached 0.986, 0.954, 0.946, 0.957, 0.952, 0.889, 0.937 and 0.926 for detecting laser scars, drusen, cup disc ratio, hemorrhages, retinal arteriosclerosis, microaneurysms, hard exudates and soft exudates, respectively. Conclusions: Our results demonstrated that our proposed model can detect a variety of lesions in the color images of the fundus, which lays a foundation for assisting doctors in diagnosis and makes it possible to carry out rapid and efficient large-scale screening of fundus lesions.

    ...
  • 5.Research on epileptic EEG recognition based on improved residual networks of 1-D CNN and indRNN

    • 关键词:
    • Epilepsy; Residual network; CNN; indRNN; RCNN;CLASSIFICATION; PREDICTION; TERM
    • Ma, Mengnan;Cheng, Yinlin;Wei, Xiaoyan;Chen, Ziyi;Zhou, Yi
    • 《International Conference on Health Big Data and Artificial Intelligence》
    • 2021年
    • OCT 29-NOV 01, 2020
    • Guangzhou, PEOPLES R CHINA
    • 会议

    Background Epilepsy is one of the diseases of the nervous system, which has a large population in the world. Traditional diagnosis methods mostly depended on the professional neurologists' reading of the electroencephalogram (EEG), which was time-consuming, inefficient, and subjective. In recent years, automatic epilepsy diagnosis of EEG by deep learning had attracted more and more attention. But the potential of deep neural networks in seizure detection had not been fully developed. Methods In this article, we used a one-dimensional convolutional neural network (1-D CNN) to replace the residual network architecture's traditional convolutional neural network (CNN). Moreover, we combined the Independent recurrent neural network (indRNN) and CNN to form a new residual network architecture-independent convolutional recurrent neural network (RCNN). Our model can achieve an automatic diagnosis of epilepsy EEG. Firstly, the important features of EEG were learned by using the residual network architecture of 1-D CNN. Then the relationship between the sequences were learned by using the recurrent neural network. Finally, the model outputted the classification results. Results On the small sample data sets of Bonn University, our method was superior to the baseline methods and achieved 100% classification accuracy, 100% classification specificity. For the noisy real-world data, our method also exhibited powerful performance. Conclusion The model we proposed can quickly and accurately identify the different periods of EEG in an ideal condition and the real-world condition. The model can provide automatic detection capabilities for clinical epilepsy EEG detection. We hoped to provide a positive significance for the prediction of epileptic seizures EEG.

    ...
  • 6.DRACP: a novel method for identification of anticancer peptides

    • 关键词:
    • Anticancer peptides; Deep belief network; Relevance vector machine;Random forest; Cancer;AMINO-ACID-COMPOSITION; TOOL
    • Zhao, Tianyi;Hu, Yang;Zang, Tianyi
    • 《Biological Ontologies and Knowledge Bases Workshop》
    • 2020年
    • NOV 18-21, 2019
    • San Diego, CA
    • 会议

    BackgroundMillions of people are suffering from cancers, but accurate early diagnosis and effective treatment are still tough for all doctors. Common ways against cancer include surgical operation, radiotherapy and chemotherapy. However, they are all very harmful for patients. Recently, the anticancer peptides (ACPs) have been discovered to be a potential way to treat cancer. Since ACPs are natural biologics, they are safer than other methods. However, the experimental technology is an expensive way to find ACPs so we purpose a new machine learning method to identify the ACPs.ResultsFirstly, we extracted the feature of ACPs in two aspects: sequence and chemical characteristics of amino acids. For sequence, average 20 amino acids composition was extracted. For chemical characteristics, we classified amino acids into six groups based on the patterns of hydrophobic and hydrophilic residues. Then, deep belief network has been used to encode the features of ACPs. Finally, we purposed Random Relevance Vector Machines to identify the true ACPs. We call this method 'DRACP' and tested the performance of it on two independent datasets. Its AUC and AUPR are higher than 0.9 in both datasets.ConclusionWe developed a novel method named 'DRACP' and compared it with some traditional methods. The cross-validation results showed its effectiveness in identifying ACPs.

    ...
  • 7.LncDisAP: A computation model for LncRNA-disease association prediction based on multiple biological datasets    (Open Access)

    • Wang, Yongtian ; Juan, Liran ; Peng, Jiajie ; Zang, Tianyi ; Wang, Yadong
    • 《BMC Bioinformatics》
    • 2019年
    • 会议

    Background: Over the past decades, a large number of long non-coding RNAs (lncRNAs) have been identified. Growing evidence has indicated that the mutation and dysregulation of lncRNAs play a critical role in the development of many complex human diseases. Consequently, identifying potential disease-related lncRNAs is an effective means to improve the quality of disease diagnostics and treatment, which is the motivation of this work. Here, we propose a computational model (LncDisAP) for potential disease-related lncRNA identification based on multiple biological datasets. First, the associations between lncRNA and different data sources are collected from different databases. With these data sources as dimensions, we calculate the functional associations between lncRNAs by the recommendation strategy of collaborative filtering. Subsequently, a disease-associated lncRNA functional network is built with functional similarities between lncRNAs as the weight. Ultimately, potential disease-related lncRNAs can be identified based on ranked scores derived by random walking with restart (RWR). Then, training sets and testing sets are extracted from two different versions of a disease-lncRNA dataset to assess the performance of LncDisAP on 54 diseases. Results: A lncRNA functional network is built based on the proposed computational model, and it contains 66,060 associations among 364 lncRNAs associated with 182 diseases in total. We extract 218 known disease-lncRNA pairs associated with 54 diseases to assess the network. As a result, the average AUC (area under the receiver operating characteristic curve) of LncDisAP is 78.08%. Conclusion: In this article, a computational model integrating multiple lncRNA-related biological datasets is proposed for identifying potential disease-related lncRNAs. The result shows that LncDisAP is successful in predicting novel disease-related lncRNA signatures. In addition, with several common cancers taken as case studies, we found some unknown lncRNAs that could be associated with these diseases through our network. These results suggest that this method can be helpful in improving the quality for disease diagnostics and treatment. © 2019 The Author(s).

    ...
  • 8.LncDisAP: A computation model for LncRNA-disease association prediction based on multiple biological datasets

    • 关键词:
    • Statistical tests;Diagnosis;Collaborative filtering;Large dataset;Computation theory;Computational methods;RNA;Network coding;Computational model;Disease associations;Functional associations;Functional similarity;Non-coding RNAs;Random walking with restart;Receiver operating characteristic curves;Recommendation strategies
    • Wang, Yongtian;Juan, Liran;Peng, Jiajie;Zang, Tianyi;Wang, Yadong
    • 2019年
    • 会议

    Background: Over the past decades, a large number of long non-coding RNAs (lncRNAs) have been identified. Growing evidence has indicated that the mutation and dysregulation of lncRNAs play a critical role in the development of many complex human diseases. Consequently, identifying potential disease-related lncRNAs is an effective means to improve the quality of disease diagnostics and treatment, which is the motivation of this work. Here, we propose a computational model (LncDisAP) for potential disease-related lncRNA identification based on multiple biological datasets. First, the associations between lncRNA and different data sources are collected from different databases. With these data sources as dimensions, we calculate the functional associations between lncRNAs by the recommendation strategy of collaborative filtering. Subsequently, a disease-associated lncRNA functional network is built with functional similarities between lncRNAs as the weight. Ultimately, potential disease-related lncRNAs can be identified based on ranked scores derived by random walking with restart (RWR). Then, training sets and testing sets are extracted from two different versions of a disease-lncRNA dataset to assess the performance of LncDisAP on 54 diseases. Results: A lncRNA functional network is built based on the proposed computational model, and it contains 66,060 associations among 364 lncRNAs associated with 182 diseases in total. We extract 218 known disease-lncRNA pairs associated with 54 diseases to assess the network. As a result, the average AUC (area under the receiver operating characteristic curve) of LncDisAP is 78.08%. Conclusion: In this article, a computational model integrating multiple lncRNA-related biological datasets is proposed for identifying potential disease-related lncRNAs. The result shows that LncDisAP is successful in predicting novel disease-related lncRNA signatures. In addition, with several common cancers taken as case studies, we found some unknown lncRNAs that could be associated with these diseases through our network. These results suggest that this method can be helpful in improving the quality for disease diagnostics and treatment. © 2019 The Author(s).

    ...
  • 9.Human mitochondrial genome compression using machine learning techniques

    • 关键词:
    • Compression; Human mitochondrial genomes; Machine learning
    • Wang, Rongjie;Zang, Tianyi;Wang, Yadong
    • 《IEEE International Conference on Bioinformatics and Biomedicine -Human Genomics》
    • 2019年
    • DEC 03-06, 2018
    • Madrid, SPAIN
    • 会议

    Background In recent years, with the development of high-throughput genome sequencing technologies, a large amount of genome data has been generated, which has caused widespread concern about data storage and transmission costs. However, how to effectively compression genome sequences data remains an unsolved problem. Results In this paper, we propose a compression method using machine learning techniques (DeepDNA), for compressing human mitochondrial genome data. The experimental results show the effectiveness of our proposed method compared with other on the human mitochondrial genome data. Conclusions The compression method we proposed can be classified as non-reference based method, but the compression effect is comparable to that of reference based methods. Moreover, our method not only have a well compression results in the population genome with large redundancy, but also in the single genome with small redundancy. The codes of DeepDNA are available at .

    ...
  • 10.Mining Pharmaceutical Product Data Related to Payment Pattern from the CMS Open Payments Data: A Case Study in Thoracic Surgery

    • 关键词:
    • Health expenditures; drug industry; medical informatics
    • Na, Xu;Guo, Haihong;Wu, Sizhu;Li, Jiao
    • 《17th World Congress of Medical and Health Informatics 》
    • 2019年
    • AUG 25-30, 2019
    • Int Med Informat Assoc, Lyon, FRANCE
    • 会议

    This study used descriptive statistical analyses to investigate the payment characteristics and to discuss the regularity of highest paying industries. Payments by 4.70% of highest paying industries (N=446) accounted for 85% of the total (US $72,458,304) in 2014-2016. A tiny minority of highest paying industries control the majority of payments. Large payments from these industries are highly associated with few specific products. Furthermore, payment patterns among the industries include concentration and diversification.

    ...
  • 排序方式:
  • 3
  • /