Application of Intelligent Data Mining in Soil Environmental Science: Analysis Based on Literature Knowledge Graph
-
摘要: 数据的大幅增长和人工智能技术的快速发展为土壤环境科学研究带来新的思路与发展机遇。本文系统梳理人工智能数据挖掘在土壤环境领域的应用研究现状和前沿动态,归纳评述主要研究热点,提出面临的挑战。采用文献计量和知识图谱分析方法对中国知网(CNKI)和Web of Science文献数据库中截至2021年的相关主题文献进行可视化分析。分析结果表明,人工智能数据挖掘在土壤环境领域的应用研究从2000年左右起步,目前正处于快速发展阶段;中国学者在该领域做出大量贡献,成为国际上最重要的研究力量之一。文献知识图谱分析结果显示,土壤污染预测评价、有机碳空间分布预测制图是国内外学者共同关注的热点方向;我国学者在土壤污染溯源、场地土壤污染评价与修复管控决策两个应用方向的研究处于优势地位。随着数据“孤岛”、数据隐私保护、模型可解释性等一系列问题的逐步被克服,基于人工智能技术的数据挖掘将给土壤环境实时监测、评估、预测预警及管理决策带来深远的影响。Abstract: The massive growth of soil environmental data and rapid development of artificial intelligence technology have brought new ideas and opportunities to soil environmental research. This study reviewed the status and progress of research focused on the application of intelligent data mining (IDM) technologies in the field of soil environmental science. Visualized analysis was conducted on relevant literature up to 2021 from CNKI and Web of Science based on bibliometrics method and knowledge graph. The results showed that studies related to application of IDM technologies in soil environmental field commenced around 2000 and were in the stage of exponential growth. Chinese scholars have made significant contributions in this field as one of the most important research groups in the world. The knowledge graphs of literature indicated that prediction and assessment of soil pollution and mapping of soil organic carbon are two major research topics in the field of soil environmental data mining. Chinese scholars play a leading role in research related to IDM technologies in soil pollution source identification, as well as site soil pollution assessment and remediation decision aids. Although still faced with various challenges such as data isolation, data privacy threatening and model untransparent nature, IDM technologies will bring about profound influence on soil environment monitoring, evaluation, prediction and management in the future.
-
Key words:
- Soil environment /
- Artificial intelligence /
- Data mining /
- Machine learning /
- Big data
-
表 1 发文量前10的国家统计
Table 1. Top 10 most productive countries in soil environmental science-related research ranked by total number of publications
发文量排名
Ranking国家
Country发文数量
Number of
publicationsWOS核心集被引频次
Citation
frequency篇均被引频次
Citation frequency
per paper1 中国 267 2950 11.0 2 美国 95 2217 23.3 3 德国 43 1720 40.0 4 澳大利亚 42 1349 32.1 5 伊朗 23 217 9.4 6 巴西 22 155 7.0 7 法国 22 735 33.4 8 印度 22 211 9.6 9 加拿大 21 429 20.4 10 捷克 16 279 17.4 表 2 研究主题及主要关键词
Table 2. Research topics and keywords
数据库
Literature database研究主题
Research topic重要关键词
KeywordWOS 土壤重金属污染预测及分析 Heavy Metal, Soil Pollution, ANN, Random Forest, Geostatistics, Kriging, Contamination Prediction, Source Analysis 土壤光谱数据挖掘 Soil、Spectral、VIS-NIR、Feature Extraction、Regression、SVM、Genetic Algorithm 土壤有机碳预测与数字制图 SOC, Climate Change, Soil Mapping, Soil PH, Microbial, Data Mining, Machine Learning, Deep Learning, Remote Sensing, Hyperspectral CNKI 土壤污染预测分析 重金属、土壤、土壤环境质量评价、光谱、预警预测、人工神经网络、支持向量机、遗传算法、多元回归 场地污染评价与修复 土壤污染、土壤污染风险评价、场地污染、多环芳烃、修复、影响因素、大数据、知识图谱、GIS 土壤有机碳预测与数字制图 土壤有机碳、土壤有机质、数字土壤制图、遥感、机器学习、数据挖掘、随机森林、决策树 -
[1] 熊丽君, 袁明珠, 吴建强. 大数据技术在生态环境领域的应用综述[J]. 生态环境学报, 2019, 28(12): 2454 − 2463. doi: 10.16258/j.cnki.1674-5906.2019.12.019 [2] 赵苗苗, 赵师成, 张丽云, 等. 大数据在生态环境领域的应用进展与展望[J]. 应用生态学报, 2017, 28(5): 1727 − 1734. doi: 10.13287/j.1001-9332.201705.001 [3] 郭书海, 吴 波, 张玲妍, 等. 土壤环境大数据: 构建与应用[J]. 中国科学院院刊, 2017, 32(2): 202 − 208. [4] Sun A Y, Scanlon B R. How can Big Data and machine learning benefit environment and water management: a survey of methods, applications, and future directions[J]. Environmental Research Letters, 2019, 14(7): 1 − 28. [5] 周志华. 机器学习 [M]. 北京: 清华大学出版社, 2016. [6] 王夏晖. 大数据: 场地污染智能识别与风险精准管控驱动力[J]. 环境保护, 2019, 47(13): 14 − 16. [7] 邵 帅. 基于空间分析与数据挖掘的区域土壤重金属“源汇”污染特征研究 [D]: 浙江大学, 2020. [8] Bui E N. Data-driven Critical Zone science: A new paradigm[J]. Sci Total Environ, 2016, 568: 587 − 593. doi: 10.1016/j.scitotenv.2016.01.202 [9] Padarian J, Minasny B, Mcbratney A B. Machine learning and soil sciences: a review aided by machine learning tools[J]. Soil, 2020, 6(1): 35 − 52. doi: 10.5194/soil-6-35-2020 [10] Yaseen Z M. An insight into machine learning models era in simulating soil, water bodies and adsorption heavy metals: Review, challenges and solutions[J]. Chemosphere, 2021, 277(130126): 1 − 22. [11] Ec K N, Waltman L R. VOSviewer: A Computer Program for Bibliometric Mapping[J]. ERIM Report Series Research in Management, 2009, 84(2): 523 − 538. [12] 吴同亮, 王玉军, 陈怀满. 2016—2020年环境土壤学研究进展与热点分析[J]. 农业环境科学学报, 2021, 40(1): 1 − 15. doi: 10.11654/jaes.2021-0073 [13] Wang Z, Shi W J, Zhou W, et al. Comparison of additive and isometric log-ratio transformations combined with machine learning and regression kriging models for mapping soil particle size fractions[J]. Geoderma, 2020, 365(114214): 1 − 16. [14] Jordan M I, Mitchell T M. Machine learning: Trends, perspectives, and prospects[J]. Science, 2015, 349(6245): 255 − 260. doi: 10.1126/science.aaa8415 [15] Zhong S, Zhang K, Bagheri M, et al. Machine Learning: New Ideas and Tools in Environmental Science and Engineering[J]. Environ Sci Technol, 2021, 55(19): 12741 − 12754. [16] Ma W, Tan K, Du P. Predicting soil heavy metal based on Random Forest model; proceedings of the Geoscience & Remote Sensing Symposium, F, 2016 [C]. [17] Tan K, Wang H M, Chen L H, et al. Estimation of the spatial distribution of heavy metal in agricultural soils using airborne hyperspectral imaging and random forest[J]. Journal of Hazardous Materials, 2020, 382(120987): 1 − 13. [18] Jia X Y, Cao Y N, O'connor D, et al. Mapping soil pollution by using drone image recognition and machine learning at an arsenic-contaminated agricultural field[J]. Environmental Pollution, 2021, 270(116281): 1 − 10. [19] Ding X G, Zhao Z Y, Xing Z S, et al. Comparison of Models for Spatial Distribution and Prediction of Cadmium in Subtropical Forest Soils, Guangdong, China[J]. Land, 2021, 10(9): 1 − 21. [20] 姜 雪, 卢文喜, 杨青春, 等. 应用支持向量机评价土壤环境质量[J]. 中国环境科学, 2014, 34(5): 1229 − 1235. [21] 仝桂杰, 吴绍华, 袁毓婕, 等. 基于贝叶斯决策树的小麦镉风险识别规则提取[J]. 中国环境科学, 2019, 39(3): 1336 − 1344. doi: 10.3969/j.issn.1000-6923.2019.03.052 [22] Bhagat S K, Tung T M, Yaseen Z M. Heavy metal contamination prediction using ensemble model: Case study of Bay sedimentation, Australia[J]. Journal of Hazardous Materials, 2021, 403(123492): 1 − 20. [23] Yang H R, Huang K, Zhang K, et al. Predicting Heavy Metal Adsorption on Soil with Machine Learning and Mapping Global Distribution of Soil Adsorption Capacities[J]. Environmental Science & Technology, 2021, 55(20): 14316 − 14328. [24] Zhao S T, Qiu Z J, He Y. Transfer learning strategy for plastic pollution detection in soil: Calibration transfer from high-throughput HSI system to NIR sensor[J]. Chemosphere, 2021, 272(129908): 1 − 12. [25] Zheng S Y, Wang J G, Zhuo Y, et al. Spatial distribution model of DEHP contamination categories in soil based on Bi-LSTM and sparse sampling[J]. Ecotoxicology and Environmental Safety, 2022, 229(113092): 1 − 8. [26] Pyo J, Hong S M, Kwon Y S, et al. Estimation of heavy metals using deep neural network with visible and infrared spectroscopy of soil[J]. Science of the Total Environment, 2020, 741(140162): 1 − 12. [27] Radocaj D, Jurii M, Upan R, et al. Spatial Prediction of Heavy Metal Soil Contents in Continental Croatia Comparing Machine Learning and Spatial Interpolation Methods[J]. Geodetski List, 2021, 74(4): 357 − 372. [28] Wang Q, Xie Z, Li F. Using ensemble models to identify and apportion heavy metal pollution sources in agricultural soils on a local scale[J]. Environmental Pollution, 2015, 206: 227 − 235. doi: 10.1016/j.envpol.2015.06.040 [29] 江叶枫, 郭 熙. 基于多源辅助数据和神经网络模型的稻田土壤砷空间分布预测[J]. 环境科学学报, 2019, 39(3): 928 − 938. [30] Lacoste M, Minasny B, Mcbratney A, et al. High resolution 3D mapping of soil organic carbon in a heterogeneous agricultural landscape[J]. Geoderma, 2014, 213: 296 − 311. doi: 10.1016/j.geoderma.2013.07.002 [31] Sothe C, Gonsamo A, Arabian J, et al. Large scale mapping of soil organic carbon concentration with 3D machine learning and satellite observations[J]. Geoderma, 2022, 405(115402): 1 − 14. [32] 鲍伟佳. 土壤有机碳含量空间分布预测研究 [D]. 芜湖: 安徽师范大学, 2011. [33] 郑光辉. 江苏部分地区土壤属性高光谱定量估算研究 [D]. 南京: 南京大学, 2011. [34] Henderson B L, Bui E N, Moran C J, et al. Australia-wide predictions of soil properties using, decision trees[J]. Geoderma, 2005, 124(3-4): 383 − 398. doi: 10.1016/j.geoderma.2004.06.007 [35] Hinkel K M, Eisner W R, Bockheim J G, et al. Spatial extent, age, and carbon stocks in drained thaw lake basins on the Barrow Peninsula, Alaska[J]. Arctic Antarctic and Alpine Research, 2003, 35(3): 291 − 300. doi: 10.1657/1523-0430(2003)035[0291:SEAACS]2.0.CO;2 [36] Somaratne S, Seneviratne G, Coomaraswamy U. Prediction of soil organic carbon across different land-use patterns: A neural network approach[J]. Soil Science Society of America Journal, 2005, 69(5): 1580 − 1589. doi: 10.2136/sssaj2003.0293 [37] Minasny B, Mcbratney A B, Mendonca-Santos M L, et al. Prediction and digital mapping of soil carbon storage in the Lower Namoi Valley[J]. Australian Journal of Soil Research, 2006, 44(3): 233 − 244. doi: 10.1071/SR05136 [38] 解宏图, 宣然然, 彭 义, 等. 基于遗传算法的波长选择方法对土壤有机碳预测模型影响[J]. 土壤通报, 2014, 45(4): 795 − 800. doi: 10.19336/j.cnki.trtb.2014.04.005 [39] 方利民, 冯爱明, 林 敏. 可见/近红外光谱快速测定土壤中的有机碳含量和阳离子交换量[J]. 光谱学与光谱分析, 2010, 30(2): 327 − 330. doi: 10.3964/j.issn.1000-0593(2010)02-0327-04 [40] Wang B, Chen J W, Li X H, et al. Estimation of Soil Organic Carbon Normalized Sorption Coefficient (K-oc) Using Least Squares-Support Vector Machine[J]. Qsar & Combinatorial Science, 2009, 28(5): 561 − 567. [41] Vohland M, Besold J, Hill J, et al. Comparing different multivariate calibration methods for the determination of soil organic carbon pools with visible to near infrared spectroscopy[J]. Geoderma, 2011, 166(1): 198 − 205. doi: 10.1016/j.geoderma.2011.08.001 [42] Grimm R, Behrens T, Märker M, et al. Soil organic carbon concentrations and stocks on Barro Colorado Island — Digital soil mapping using Random Forests analysis[J]. Geoderma, 2008, 146(1-2): 102 − 113. doi: 10.1016/j.geoderma.2008.05.008 [43] Wiesmeier M, Barthold F, Blank B, et al. Digital mapping of soil organic matter stocks using Random Forest modeling in a semi-arid steppe ecosystem[J]. Plant and Soil, 2011, 340(1-2): 7 − 24. doi: 10.1007/s11104-010-0425-z [44] 袁玉琦, 陈瀚阅, 张黎明, 等. 基于多变量与RF算法的耕地土壤有机碳空间预测研究−以福建亚热带复杂地貌区为例[J]. 土壤学报, 2021, 58(4): 887 − 899. doi: 10.11766/trxb202001140623 [45] 赖雨晴, 孙孝林, 王会利. 人工神经网络及其与地统计的混合模型在小面积丘陵区土壤有机碳预测制图上的应用研究[J]. 土壤通报, 2020, 51(6): 1313 − 1322. [46] 卢宏亮, 赵明松, 刘斌寅, 等. 基于随机森林模型的安徽省土壤属性空间分布预测[J]. 土壤, 2019, 51(3): 602 − 608. doi: 10.13758/j.cnki.tr.2019.03.025 [47] Yamashita N, Ishizuka S, Hashimoto S, et al. National-scale 3D mapping of soil organic carbon in a Japanese forest considering microtopography and tephra deposition[J]. Geoderma, 2022, 406(115534): 1 − 18. [48] Xu M X, Chu X Y, Fu Y S, et al. Improving the accuracy of soil organic carbon content prediction based on visible and near-infrared spectroscopy and machine learning[J]. Environmental Earth Sciences, 2021, 80(8): 1 − 10. [49] 史 杨, 王儒敬, 汪玉冰. 基于卷积神经网络和近红外光谱的土壤有机碳预测模型[J]. 计算机应用与软件, 2018, 35(10): 147 − 152 + 266. [50] Padarian J, Minasny B, Mcbratney A B. Using deep learning for digital soil mapping[J]. Soil, 2019, 5(1): 79 − 89. doi: 10.5194/soil-5-79-2019 [51] Singh S, Kasana S S. Estimation of soil properties from the EU spectral library using long short-term memory networks[J]. Geoderma Regional, 2019, 18(e00233): 1 − 12. [52] Toth G, Jones A, Montanarella L. The LUCAS topsoil database and derived information on the regional variability of cropland topsoil properties in the European Union[J]. Environmental Monitorning and Assessment, 2013, 185(9): 7409 − 7425. doi: 10.1007/s10661-013-3109-3 [53] Chen S, Liang Z, Webster R, et al. A high-resolution map of soil pH in China made by hybrid modelling of sparse soil data and environmental covariates and its implications for pollution[J]. Scicence of the Total Environment, 2019, 655: 273 − 283. doi: 10.1016/j.scitotenv.2018.11.230 [54] Heung B, Bulmer C E, Schmidt M G. Predictive soil parent material mapping at a regional-scale: A Random Forest approach[J]. Geoderma, 2014, 214: 141 − 154. [55] Xie X F, Wu T, Zhu M, et al. Comparison of random forest and multiple linear regression models for estimation of soil extracellular enzyme activities in agricultural reclaimed coastal saline land[J]. Ecological Indicators, 2021, 120(106925): 1 − 9. [56] Sun W M, Xiao E Z, Xiao T F, et al. Response of Soil Microbial Communities to Elevated Antimony and Arsenic Contamination Indicates the Relationship between the Innate Microbiota and Contaminant Fractions[J]. Environmental Science & Technology, 2017, 51(16): 9165 − 9175. [57] Hu Y, Cheng H. Application of Stochastic Models in Identification and Apportionment of Heavy Metal Pollution Sources in the Surface Soils of a Large-Scale Region[J]. Environmental Science & Technology, 2013, 47(8): 3752 − 3760. [58] 孙 慧. 基于规则模型的广东省土壤重金属源识别及影响因子解析 [D]. 太原: 山西农业大学, 2019. [59] Yang S, Taylor D, Yang D, et al. A synthesis framework using machine learning and spatial bivariate analysis to identify drivers and hotspots of heavy metal pollution of agricultural soils[J]. Environmental Pollution, 2021, 287(117611): 1 − 10. [60] Dai L J, Wang L Q, Li L F, et al. Multivariate geostatistical analysis and source identification of heavy metals in the sediment of Poyang Lake in China[J]. Science of the Total Environment, 2018, 621: 1433 − 1444. doi: 10.1016/j.scitotenv.2017.10.085 [61] 王夏晖, 黄国鑫, 朱文会, 等. 大数据支持场地污染风险管控的总体技术策略[J]. 环境保护, 2020, 48(19): 64 − 66. doi: 10.14026/j.cnki.0253-9705.2020.19.010 [62] Chen Z, Huang G H, Chan C W, et al. Development of an expert system for the remediation of petroleum-contaminated sites[J]. Environmental Modeling & Assessment, 2003, 8(4): 323 − 334. [63] Demirhan M, Ozdamar L. Integrating expert knowledge in environmental site characterization[J]. IEEE Transactions on Systems Man and Cybernetics Part C-Applications and Reviews, 2001, 31(3): 344 − 351. doi: 10.1109/5326.971662 [64] Geng J Q, Chen Z, Chan C W, et al. An intelligent decision support system for management of petroleum-contaminated sites[J]. Expert Systems with Applications, 2001, 20(3): 251 − 260. doi: 10.1016/S0957-4174(00)00063-4 [65] Liu G, Zhou X, Li Q, et al. Spatial distribution prediction of soil As in a large-scale arsenic slag contaminated site based on an integrated model and multi-source environmental data[J]. Environmental Pollution, 2020, 267(115631): 1 − 10. [66] 徐 亚, 朱雪梅, 刘玉强, 等. 基于随机-模糊耦合的污染场地健康风险评价及案例[J]. 中国环境科学, 2014, 34(10): 2692 − 2700. [67] Wu J, Teng Y G, Chen H Y, et al. Machine-learning models for on-site estimation of background concentrations of arsenic in soils using soil formation factors[J]. Journal of Soils and Sediments, 2016, 16(6): 1787 − 1797. doi: 10.1007/s11368-016-1374-9 [68] 王玉玲, 王 蒙, 闫 岩, 等. 基于聚类算法的ERT污染区域识别方法[J]. 中国环境科学, 2019, 39(3): 1315 − 1322. doi: 10.3969/j.issn.1000-6923.2019.03.050 [69] 张秋垒, 黄国鑫, 王夏晖, 等. 基于案例推理和机器学习的场地污染风险管控与修复方案推荐系统构建技术[J]. 环境工程技术学报, 2020, 10(6): 1012 − 1021. doi: 10.12153/j.issn.1674-991X.20200207 [70] Man J, Zeng L, Luo J, et al. Application of the Deep Learning Algorithm to Identify the Spatial Distribution of Heavy Metals at Contaminated Sites[J]. Acs Es& T Engineering, 2022, 2: 158 − 168. [71] 能昌信, 孙晓晨, 徐 亚, 等. 基于深度卷积神经网络的场地污染非线性反演方法[J]. 中国环境科学, 2019, 39(12): 5162 − 5172. doi: 10.19674/j.cnki.issn1000-6923.20191118.001 [72] 陈 刚, 蓝 艳. 大数据时代环境保护的国际经验及启示[J]. 环境保护, 2015, (19): 34 − 37. [73] Ji S, Pan S, Cambria E, et al. A Survey on Knowledge Graphs: Representation, Acquisition, and Applications[J]. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(2): 494 − 514. doi: 10.1109/TNNLS.2021.3070843 [74] 刘俊旭, 孟小峰. 机器学习的隐私保护研究综述[J]. 计算机研究与发展, 2020, 57(2): 346 − 362. doi: 10.7544/issn1000-1239.2020.20190455 [75] 杨 强. AI与数据隐私保护: 联邦学习的破解之道[J]. 信息安全研究, 2019, 5(11): 961 − 965. doi: 10.3969/j.issn.2096-1057.2019.11.003 [76] Zheng Z, Xie S, Dai H N, et al. Blockchain challenges and opportunities: A survey[J]. International Journal of Web and Grid Services, 2018, 14(4): 352 − 375. doi: 10.1504/IJWGS.2018.095647 [77] 何 蒲, 于 戈, 张岩峰, 等. 区块链技术与应用前瞻综述[J]. 计算机科学, 2017, 44(4): 1 − 7 + 15. [78] Wang C, Ma X, Chen J, et al. Information extraction and knowledge graph construction from geoscience literature[J]. Computers & Geosciences, 2018, 112: 112 − 120. [79] Zhu Y, Zhou W, Xu Y, et al. Intelligent Learning for Knowledge Graph towards Geological Data[J]. Scientific Programming, 2017, 2017(5072427): 1 − 13. [80] Han F, Deng Y, Liu Q, et al. Construction and application of the knowledge graph method in management of soil pollution in contaminated sites: A case study in South China[J]. Journal of Environmental Management, 2022, 319(115685): 1 − 8. [81] Padarian J, Mcbratney A B. A new model for intra-and inter-institutional soil data sharing[J]. Soil, 2020, 6(1): 89 − 94. doi: 10.5194/soil-6-89-2020 [82] Li L, Wang P, Yan J, et al. Real-world data medical knowledge graph: construction and applications[J]. Artificial intelligence in medicine, 2020, 103(101817): 1 − 10. [83] 邢 丹, 徐 琦, 姚俊明. 边缘计算环境下基于区块链和联邦学习的医疗健康数据共享模型[J]. 医学信息学杂志, 2021, 42(2): 33 − 37. doi: 10.3969/j.issn.1673-6036.2021.02.007 [84] 王生生, 陈境宇, 卢奕南. 基于联邦学习和区块链的新冠肺炎胸部CT图像分割[J]. 吉林大学学报(工学版), 2021, 51(6): 2164 − 2173. [85] 胡文友, 陶婷婷, 田 康, 等. 中国农田土壤环境质量管理现状与展望[J]. 土壤学报, 2021, 58(5): 1094 − 1109. doi: 10.11766/trxb202009220533 [86] Wang K, Gou C, Duan Y, et al. Generative adversarial networks: introduction and outlook[J]. IEEE/CAA Journal of Automatica Sinica, 2017, 4(4): 588 − 598. doi: 10.1109/JAS.2017.7510583 [87] Chen X, Jia S, Xiang Y. A review: Knowledge reasoning over knowledge graph[J]. Expert Systems with Applications, 2020, 141(112948): 1 − 21. [88] Bergen K J, Johnson P A, De Hoop M V, et al. Machine learning for data-driven discovery in solid Earth geoscience[J]. Science, 2019, 363(6433): 1 − 10. [89] Zhong S, Zhang K, Bagheri M, et al. Machine Learning: New Ideas and Tools in Environmental Science and Engineering[J]. Environmental Science & Technology, 2021, 55(19): 12741 − 12754. -