基于Hive的高寒草地海量数据高效分析系统设计研究

李亮丹; 晔沙; 谢夏; 胡月明; 谢健文; 周悟; 游小敏

文章摘要

基于Hive的高寒草地海量数据高效分析系统设计研究

High-efficient analysis system for massive data of alpine grassland based on Hive

投稿时间：2021-08-17

DOI：10.13254/j.jare.2021.0530

中文关键词: 高寒草地，海量数据，存储与分析，Hive

英文关键词: alpine grassland, massive data, storage and analysis, Hive

基金项目:国家重点研发计划课题（2020YFD1100204）；四川省科技计划项目（2020YFS0499）

作者	单位	E-mail
李亮丹	华南农业大学资源环境学院, 广州 510642
晔沙	华南农业大学资源环境学院, 广州 510642
谢夏	海南大学热带作物学院, 海口 570228
胡月明	华南农业大学资源环境学院, 广州 510642 海南大学热带作物学院, 海口 570228 青海-广东自然资源监测与评价联合重点实验室, 西宁 810016 广州市华南自然资源科学技术研究院, 广州 510642
谢健文	华南农业大学资源环境学院, 广州 510642 青海-广东自然资源监测与评价联合重点实验室, 西宁 810016 广州市华南自然资源科学技术研究院, 广州 510642
周悟	华南农业大学资源环境学院, 广州 510642
游小敏	广州市华南自然资源科学技术研究院, 广州 510642	285865210@qq.com

摘要点击次数: 1110

全文下载次数: 790

中文摘要:

解决高寒草地的退化问题需要对高寒草地退化现状进行综合评价，而这需要相关数据作为支撑，本研究设计并实现了一个基于Hive的高寒草地海量数据高效分析系统，能对高寒草地的海量数据进行可靠、高效地存储分析。首先，平台设计基于Hadoop、Hive、Sqoop环境，通过节点和集群配置等步骤搭建完成；然后，通过期望最大化（EM）算法进行数据填充、数据导入、数据分区存储等步骤，完成数据抽取、转换、加载（ETL）及数据存储；最后，系统通过混合函数编码实现模糊查询功能，实验测试表明系统达到了预定的效果。随着文件大小的增加和总体数据规模的增大，系统整体存储和读取时间一直处于增长的状态，但平均运行时间（平均处理1 MB数据所使用的时间）处于降低的趋势，说明随着数据量的增加，系统并行处理海量数据的能力得到体现。使用2014年青海省称多县高寒草地样方监测数据和部分虚拟数据（总数据量约为3 958万条，7.56 GB），对Hive集群以及关系型数据库SQL Server的数据查询效率进行对比。结果显示，当查询数据量为3 958万条时，Hive集群数据查询的时间为SQL Server查询时间的67.8%。说明在数据量较大时，系统数据查询的效率比SQL Server更高。通过HiveQL对高寒草地生态数据进行分析处理，并开展相应的对照实验，对比发现，Hive数据分析技术与对照实验的处理结果相同。综上，将分布式数据仓库技术应用于高寒草地海量数据的存储与分析，较传统的数据存储与分析技术相比有明显的进步。本系统对海量数据处理效率高、可开发性强，可以很好地满足海量高寒草地数据的存储和分析要求。

英文摘要:

Solving the problem of alpine grassland degradation should conduct a comprehensive evaluation of the current situation of alpine grassland degradation, and this requires relevant data as support. This paper designs and implements a Hive-based high-performance system for analyzing massive data of alpine grassland, which can reliably and efficiently store and analyze the massive data of alpine grassland. First, the platform was designed based on the Hadoop, Hive, and Sqoop environments, and was completed through steps such as node and cluster configuration. Then, the data ETL(Extract-Transform-Load) and data storage were completed by using the EM (Expectation-Maximization) algorithm for data filling, importing data, and data partition storage. Finally, the system realized the fuzzy query function through mixed function coding, and the system had achieved the predetermined effect. The results showed as the file size increased, the overall data size increased, and the overall system storage and reading time were always increasing, however, the average running time(the average time for processing 1 MB of data) was decreasing, reflecting the system's high ability to process large amounts of data in parallel as the amount of data increased. Using the alpine grassland quadrat monitoring data and some virtual data from the counties in Qinghai Province in 2014, the total data volume was approximately 39.58 million(7.56 GB), and the efficiency of data query between the Hive cluster and the relational database SQL Server was compared. When the query data volume was 39.58 million, the Hive cluster data query time was 67.8% of the SQL Server. With the increase of data volume, the efficiency of system data query was higher than that of SQL Server. The ecological data of alpine grassland was analyzed and processed through HiveQL, and the corresponding control experiment was carried out. The comparison found that the Hive data analysis technology had the same processing result as the control experiment. In summary, distributed data warehouse technology is applied to the storage and analysis of massive data in alpine grasslands, which is a significant improvement over traditional data storage and analysis technologies. This system has high efficiency in processing massive data and strong developability, which can well meet the storage and analysis requirements of massive alpine grassland data.

HTML 查看全文查看/发表评论下载PDF阅读器

关闭