EarthCube Capabilities:OpenMindat-Open Access and Interoperable Mineralogy Data to Broaden Community Access and Advance Geoscience Research
项目来源
项目主持人
项目受资助机构
立项年度
立项时间
项目编号
项目级别
研究期限
受资助金额
学科
学科代码
基金类别
关键词
参与者AI
参与机构AI
1.Enhancing neuro-symbolic AI for mineral prediction via LLM-guided knowledge integration
- 关键词:
- Decision trees;Deposits;Domain Knowledge;Economic geology;Forecasting;Learning systems;Lithology;Domain knowledge;Geosciences;Knowledge integration;Language model;Large language model;Machine learning algorithms;Machine learning models;Mineral prediction;Neuro-symbolic AI;Scientific fields
- Chen, Weilin;Zhang, Jiyin;Li, Chenhao;Ma, Xiaogang
- 《Applied Computing and Geosciences》
- 2026年
- 29卷
- 期
- 期刊
Integrating domain knowledge into machine learning (ML) models is critical for achieving reliable and interpretable predictions in complex scientific fields such as geoscience. In several recent studies centered on the so-called Neuro-Symbolic AI (NSAI) frameworks, symbolic geological knowledge was successfully combined with traditional ML algorithms to improve the prediction of mineral deposit types. The fast development of Large Language Model (LLM) brings new opportunities to further enhance the NSAI applications. In this study, to construct the symbolic component of NSAI, we used an LLM to automatically extract, structure, and transform descriptive knowledge from authoritative geoscience textbooks into a machine-readable format. The result captures geochemical signatures, lithological settings, and alteration features associated with various mineral systems. The structured knowledge was integrated into a decision tree classifier by embedding each sample with a vectorized representation of its corresponding deposit type. Compared to conventional ML models trained solely on geochemical data, our NSAI model achieved significantly higher accuracy on the test sets, indicating improved generalization. Moreover, the NSAI model demonstrated consistent performance across a broader set of deposit types, including those with extremely limited training samples. In particular, the NSAI framework improved predictive stability and accuracy even for minority classes with only 3 to 5 samples, where traditional ML models tend to overfit or fail. This robustness underscores the value of incorporating expert-level geological knowledge into data-driven pipelines. In our result assessment, the SHAP (SHapley Additive exPlanations) analysis further revealed that symbolic knowledge vectors contributed substantially to the model's decision-making process, confirming their importance in enhancing interpretability and predictive power. Our work demonstrates that LLM-guided knowledge extraction offers an effective and scalable way to integrate structured domain knowledge into mineral prediction tasks. We hope the work can also provide insights for other geoscientific applications of NSAI. © 2025 The Authors
...2.Fine-tuning small and open LLMs to automate geoscience data analysis workflows: A scalable approach
- 关键词:
- Information analysis;Open Data;Open systems;Scalability;Tuning;Analysis workflow;Data analytics;Fine tuning;Geoscience data;Language model;Mindat;Open large language model;Open-source model;Performance;Work-flows
- Zhang, Jiyin;Li, Wenjia;Que, Xiang;Chen, Weilin;Li, Chenhao;Ma, Xiaogang
- 《Applied Computing and Geosciences》
- 2025年
- 28卷
- 期
- 期刊
With the recent integration of Large Language Models (LLMs) into geoscience applications, agentic LLM-driven workflows have emerged as an innovative approach to streamline automated data analysis processes. Advanced proprietary LLMs like ChatGPT demonstrate strong performance in customized workflows due to their substantial computational resources and extensive pretraining on diverse datasets. However, deploying such workflows with commercial LLMs can incur significant costs, especially in terms of token consumption, necessitating a shift toward open-source models. In this study, we fine-tuned an open-source LLM (Llama 3.1) to handle geoscience data analysis tasks, leveraging the self-instruct method to generate synthetic training datasets. The proposed pipeline for designing LLM-driven workflows and fine-tuning open-source models using synthetic datasets enables scalability, allowing the integration of additional LLM agents to accommodate more complex tasks. Furthermore, this workflow serves as a template for researchers in other domains to develop similar solutions tailored to their specific needs. Our experimental evaluation compares the performance of ChatGPT-4o with the fine-tuned Llama 3.1 in the context of the proposed geoscience data analysis workflow. Results demonstrate that the fine-tuned open-source model achieves performance comparable to proprietary models, extending the applicability of open LLMs to domain-specific agentic workflows in data analysis. © 2025 The Authors
...3.OneMineralogy移动端应用:建立OpenMindat信息基础设施新通道
- 关键词:
- 信息基础设施;OneMineralogy;OpenMindat;移动端应用
- 付权利;阙翔;马小刚;张继吟
- 《第十一届全国成矿理论与找矿方法学术讨论会》
- 2025年
- 中国福建福州
- 会议
Mindat是目前已开放的全球最大的在线矿物数据库。与其相关且由美国国家科学基金会(NSF)资助的OpenMindat服务项目(Maetal.,2024)基于RESTful的Mindat应用程序编程接口(API)构建,并实现了相应的R和Python包,这些让遵循FAIR(Findable,Accessible,Interoperable,Reusable)原则(Wilkinson et al.,2016)的Mindat矿物数据驱动的知识探索与发现更加高效和便捷。然而,目前仍缺少移动端数据访问与分析工具,而工具可获取性不足,可能会延缓或限制一些数据驱动矿物知识发现的过程。
...4.使用Mindat数据库微调Stable Diffusion从文本生成矿物图像
- 关键词:
- 文本描述;文本生成;Mindat;Stable Diffusion;数据库
- 付权利;阙翔;林铭杰;翁岳鹏;陈明贤;游舒羽
- 《第十一届全国成矿理论与找矿方法学术讨论会》
- 2025年
- 中国福建福州
- 会议
Mindat("mindat.org")是一个综合性的矿物数据库,是宝贵的矿物数据资源,汇集了来自世界各地的大量文本和图像。美国国家科学基金会(NSF)资助的OpenMindat项目(Ma et al.,2024)促进了矿物学知识的快速发现,但Mindat中图像与文本之间的许多关系仍未得到充分开发。研究这些关系有可能引发关于将文本描述转化为矿物图像的全新且引人入胜的讨论,这可能具有重大价值。
...5.OneMineralogy移动端应用:建立OpenMindat信息基础设施新通道,促进矿物学知识发现
- 关键词:
- 移动端应用;Mindat;矿物信息学;数据驱动知识发现
- 付权利;阙翔;马小刚;张继吟
- 《矿物岩石地球化学通报》
- 2025年
- 卷
- 期
- 期刊
Mindat是目前已开放的全球最大的在线矿物数据库,但仍缺少移动端的数据访问与分析工具。工具可获取性不足,可能会延缓或限制一些数据驱动矿物知识发现的过程。本研究基于Vue和Uni-app框架开发了一个JavaScript的移动端开源应用(OneMineralogy),初步实现了支持移动端Mindat连接、数据检索、格式转换、数据导出、探索分析及推理等功能。移动端的用户可以通过OneMineralogy简单高效地获取Mindat在线数据,它扩充了OpenMindat的信息基础设施生态。三个用例分析进一步展示出其科学及应用价值:(1)交互式检索获取矿物与属性数据;(2)探索性分析属性约束下矿物品种、成矿元素之间的共生关系;(3)预测目标地点出现新矿物、矿物共生组合。此移动端应用能为地球科学研究人员及相关用户提供新的Mindat数据访问通道,预期可增加Mindat社区用户数量,促进矿物知识发现过程和矿物信息学发展。
...6.Integrating neuro-symbolic AI and knowledge graph for enhanced geochemical prediction in copper deposits
- 关键词:
- Deep learning;Deposits;Economic geology;Forecasting;Geochemistry;Knowledge graph;Knowledge management;Learning systems;Mineralogy;Prediction models;Expert knowledge;Interpretability;Knowledge graphs;Language model;Large language model;Machine learning models;Mineral prediction;Mineralisation;Model knowledge;Neuro-symbolic AI
- Chen, Weilin;Zhang, Jiyin;Li, Wenjia;Que, Xiang;Li, Chenhao;Ma, Xiaogang
- 《Applied Computing and Geosciences》
- 2025年
- 27卷
- 期
- 期刊
The integration of machine learning (ML) and deep learning (DL) in geoscience has demonstrated great promise for mineral prediction. However, existing approaches are predominantly data-driven and often overlook expert geological knowledge, limiting their interpretability, accuracy, and practical applicability. This study introduces a new method that combines Large Language Models (LLMs), knowledge graphs (KGs), and Neuro-Symbolic AI (NSAI) models to predict mineralization systems in diverse copper deposits, significantly increasing the precision in prediction results. We utilize LLMs to generate KGs from geological literature, extracting symbolic rules that encode domain-specific insights about copper mineralization. These rules, derived dynamically from expert knowledge, are integrated into ML models as guidance during the training and prediction phases. By fusing symbolic reasoning with ML's computational power, our approach overcomes the limitations of black-box models, offering both improved accuracy and transparency in mineral prediction. To validate this method, we apply it to a comprehensive geochemical dataset of global copper deposits. The results show that rule-guided ML models achieve notable performance improvements, outperforming traditional ML methods in accuracy, precision, and robustness. Interpretability is further enhanced by using tools such as SHAP values, which explain the influence of individual geochemical features within the rule-based framework. This combination not only identifies critical geochemical elements like Cu, Fe, and S but also provides coherent, domain-aligned explanations for the predicted mineralization patterns. Our findings demonstrate the transformative potential of combining LLMs, KGs, and ML models for mineral prediction. This hybrid approach enables geoscientists to leverage both computational and expert knowledge, achieving a deeper understanding of mineralization systems. © 2025 The Authors
...7.Identifier Service in the Mindat Database: Persistent and Structured Access to Massive Records of Minerals and Other Natural Materials
- 关键词:
- Geology;Mineral resources;Minerals;Open Data;Contextual information;Data intensive;Data-source;Datatypes;FAIR principle;Geomaterials;Geosciences research;Multiple source;Natural materials;Persistent identifier
- Ralph, Jolyon;Martynov, Pavel;Ma, Xiaogang;Von Bargen, David;Li, Wenjia;Huang, Jingyi;Golden, Joshua;Profeta, Lucia;Prabhu, Anirudh;Morrison, Shaunna;Que, Xiang;Zhang, Jiyin
- 《Data Intelligence》
- 2025年
- 7卷
- 3期
- 期刊
Minerals, like many other natural materials of geological origin (i.e., geomaterials), face the challenge of name variations. This in turn hinders the data-intensive geoscience research, which often needs to integrate data from multiple sources. It is clear that mineral name is not an appropriate identifier to connect records within and amongst data sources. The Mindat database, as one of the biggest resources for open data in mineralogy, has received significant volume of feedback on the heterogeneity of mineral and rock names. To address that issue, we established a persistent identifier service on Mindat to provide persistent and meaningful access to the records of geomaterials (mineral/rock/variety), localities, mineral occurrences, references, photos, and specimens. A key development was the long-form identifier, which adds contextual information such as identifier authorities and data types into the identifier structure. Moreover, a UUID service was built along with the long-form identifier to further increase the interoperability. The identifier service has been successfully implemented to mint millions of identifiers to different types of data objects on Mindat. Several use case scenarios were developed to illustrate the utility of the identifiers in the real world. We believe the persistent identifier will help address the challenges caused by name variations, and we welcome Mindat users to test the identifiers and send feedback to us for future extensions. © 2025 Chinese Academy of Sciences.
...8.An Approach to Trustworthy Article Ranking by NLP and Multi-Layered Analysis and Optimization
- 关键词:
- Artificial intelligence;Classification (of information);Learning algorithms;Learning systems;Search engines;Factors analysis;Impact factor;Multi-layered;Multi-layered factor analyze;Optimisations;Rapid growth;Scientific publications;Similarity computation;Three-layer;Trustworthiness ranking
- Li, Chenhao;Zhang, Jiyin;Chen, Weilin;Ma, Xiaogang
- 《Algorithms》
- 2025年
- 18卷
- 7期
- 期刊
The rapid growth of scientific publications, coupled with rising retraction rates, has intensified the challenge of identifying trustworthy academic articles. To address this issue, we propose a three-layer ranking system that integrates natural language processing and machine learning techniques for relevance and trust assessment. First, we apply BERT-based embeddings to semantically match user queries with article content. Second, a Random Forest classifier is used to eliminate potentially problematic articles, leveraging features such as citation count, Altmetric score, and journal impact factor. Third, a custom ranking function combines relevance and trust indicators to score and sort the remaining articles. Evaluation using 16,052 articles from Retraction Watch and Web of Science datasets shows that our classifier achieves 90% accuracy and 97% recall for retracted articles. Citations emerged as the most influential trust signal (53.26%), followed by Altmetric and impact factors. This multi-layered approach offers a transparent and efficient alternative to conventional ranking algorithms, which can help researchers discover not only relevant but also reliable literature. Our system is adaptable to various domains and represents a promising tool for improving literature search and evaluation in the open science environment. © 2025 by the authors.
...9.Mindat.org: The open access mineralogy database to accelerate data-intensive geoscience research
- 关键词:
- Mineralogy; open data; data bias; data science; data-driven discovery;Mineral Informatics: Revolutionizing Mineralogy; Petrology; andGeochemistry;NOMENCLATURE; EVOLUTION; DIVERSITY
- Ralph, Jolyon;Von Bargen, David;Martynov, Pavel;Zhang, Jiyin;Que, Xiang;Prabhu, Anirudh;Morrison, Shaunna M.;Li, Wenjia;Chen, Weilin;Ma, Xiaogang
- 《AMERICAN MINERALOGIST》
- 2025年
- 110卷
- 6期
- 期刊
The mindat.org website (Mindat) has been operating since October 2000 as a free, crowd-sourced, and expert-curated database particularly focused on mineral species and their occurrences worldwide. The project has transformed from a hobbyist site in the beginning into a resource that has found use in various scientific research projects and educational programs. Together with other open data resources, Mindat has helped accelerate scientific discoveries in many fields, such as mineral evolution, mineral ecology, and the co-evolution of the geosphere and biosphere. Recently, through open data efforts, machine interfaces and software packages have been established to enable flexible data discovery and download from Mindat. We assume that the data access and usage will further scale up in the next years. Although Mindat is curated by a team of geoscience and database experts across the world, the crowd-sourced records in Mindat possess some bias. In this paper, we first present an overview of the primary data subjects in Mindat and then give extensive details about the characteristics and partiality of three of the most popular data subjects: locality, mineral species, and mineral occurrence. In the discussion, we also give an outlook on appropriate data usage and future extension of data records. We hope users can obtain a more comprehensive view of the Mindat database through this paper and thus better plan their data use. We also hope more people will be inspired to contribute to the data curation work to make Mindat a sustained data ecosystem for geoscience research.
...10.Streamlining geoscience data analysis with an LLM-driven workflow
- 关键词:
- Autonomous agents;Network security;Problem oriented languages;AI agent;Fine tuning;Geoscience data;Geoscience data analyze;Language model;Large language model;Mindat;Model-driven;Prompt engineering;Work-flows
- Zhang, Jiyin;Clairmont, Cory;Que, Xiang;Li, Wenjia;Chen, Weilin;Li, Chenhao;Ma, Xiaogang
- 《Applied Computing and Geosciences》
- 2025年
- 25卷
- 期
- 期刊
Large Language Models (LLMs) have made significant advancements in natural language processing and human-like response generation. However, training and fine-tuning an LLM to fit the strict requirements in the scope of academic research, such as geoscience, still requires significant computational resources and human expert alignment to ensure the quality and reliability of the generated content. The challenges highlight the need for a more flexible and reliable LLM workflow to meet domain-specific analysis needs. This study proposes an LLM-driven workflow that addresses the challenges of utilizing LLMs in geoscience data analysis. The work was built upon the open data API (application programming interface) of Mindat, one of the largest databases in mineralogy. We designed and developed an open-source LLM-driven workflow that processes natural language requests and automatically utilizes the Mindat API, mineral co-occurrence network analysis, and locality distribution heat map visualization to conduct geoscience data analysis tasks. Using prompt engineering techniques, we developed a supervisor-based agentic framework that enables LLM agents to not only interpret context information but also autonomously addressing complex geoscience analysis tasks, bridging the gap between automated workflows and human expertise. This agentic design emphasizes autonomy, allowing the workflow to adapt seamlessly to future advancements in LLM capabilities without requiring additional fine-tuning or domain-specific embedding. By providing the comprehensive context of the task in the workflow and the professional tool, we ensure the quality of LLM-generated content without the need to embed geoscience knowledge into LLMs through fine-tuning or human alignment. Our approach integrates LLMs into geoscience data analysis, addressing the need for specialized tools while reducing the learning curve through LLM-driven interactions between users and APIs. This streamlined workflow enhances the efficiency of exploratory data analysis, as demonstrated by the several use cases presented. In our future work we will explore the scalability of this workflow through the integration of additional agents and diverse geoscience data sources. © 2024 The Authors
...
