周报 #11564

ladventure 更新于 2017-02-21 00:27

之前版本:

In recent years, with the rapid development and popularity of computer science and information technology, the scale of the industry application system expands rapidly and the industry application data is generated by explosive growth. For example, the US New York Stock Exchange generates about 1TB transaction data every day, the Internet archives store about 2PB data, and increasing at a rate of at least 20TB every mouth [1]. Different periods of data have different meanings for application having massive data. The data of newly generated have the much higher frequency to access than the older data [2]. According to statistics, 80% of the disk data is not often accessed, which is very important and must be completely stored. How to store these data in order to achieve the requirements of high-speed data access and minimum storage cost.

At present, SSD (Solid State Disks) and HDD (Hard Disk Drives) are the main storage devices. The SSD have high data access speed, but the price of unit storage is very expensive and the lifespan is short. Though the price of unit storage is much cheaper and the lifespan is longer for the HDD disks, they cannot meet performance requirements because of the low speed of the data access for the application having massive data. Hierarchical storage [3] can be a good balance of storage costs and the speed of data access as application needed. The data is not accessed for a long period of time which is stored in the HDD and the frequently accessed data is stored in the SSD. Since only a small amount of data needs to be accessed frequently over a period of time, a small amount of SSD are required and the most of data is stored HDD. The hierarchical storage strategy have a good performance in accessing data with almost cost of HDD.

For the application having massive data, it’s necessary to use the distributed storage to store massive data because a single disk storage capacity should be insufficient.

GlusterFS [4] proposed by Z RESEARCH Company is an open source distributed file system, which is widely used in cloud storage system. GlusterFS has PB class file cluster storage capacity by linking to many kinds of cheap x86 hosts with Infiniband RDMA [5] or TCP/ IP [6] protocol to a large-scale parallel network file system. In view of the large differences in data access frequency between different data, GlusterFS is used to build a hierarchical distributed storage cluster. A few SSD and a lot of HDD with different access speeds disks are connected into a cluster. The online volume having high speed of data access is created on the SSD and the offline volume having low unit storage costs is created on the HDD disks. In this way, the storage cluster provide good performance of data access under constraints of storage capacity and hardware costs.

In this distributed hierarchical storage system, data access frequency is dynamic. In order to enable the system to provide better performance all the time, we need to migrate frequently data between different storage devices. There are many methods to migrate data between offline volume and online volume, the most commonly used way is to migrate data file for offline database. But statistics show that large data file migration maybe fail or migrated data is not available because of various types of problems. For example, unexpected computer downtime during migration, or varying degrees of data loss will result in inconsistent data [7]. Therefore, in order to ensure data availability before and after large data file migration, data file must be verified after data migration.

At present, there are two main aspects to ensure the data availability after migration. One is that comparing some specific data is consistent before and after data file migration by using spot checks and partial statistics [8]. It is a sampling approach, which can only ensure the sampling part of the data consistency, not all data. Another way is ensuring the data consistency in transfer process by using SHA1 [9], MD5 checksum [10] or other data verification method. This approach can ensure all data integrity after data migration because each bit of all data is checked in data transfer. However, it takes much more time to verify the data compared to the only migration data.

In this paper, we using a parallel method to verify and migrate data with pipelining. We want to take very little extra time than only data migration to finish verify data available and integrity. Under ensure data available and integrity conditions, our method can significantly enhance the migration efficiency. 

当前版本:

    In recent years, with the rapid development and popularity of computer science and information technology, the scale of the industry application system expands rapidly and the industry application data is generated by explosive growth. For example, the US New York Stock Exchange generates about 1TB transaction data every day, the Internet archives store about 2PB data, and increasing at a rate of at least 20TB every mouth [1]. Different periods of data have different meanings for application having massive data. The data of newly generated have the much higher frequency to access than the older data [2]. According to statistics, 80% of the disk data is not often accessed, which is very important and must be completely stored. How to store these data in order to achieve the requirements of high-speed data access and minimum storage cost.

    At present, SSD (Solid State Disks) and HDD (Hard Disk Drives) are the main storage devices. The SSD have high data access speed, but the price of unit storage is very expensive and the lifespan is short. Though the price of unit storage is much cheaper and the lifespan is longer for the HDD disks, they cannot meet performance requirements because of the low speed of the data access for the application having massive data. Hierarchical storage [3] can be a good balance of storage costs and the speed of data access as application needed. The data is not accessed for a long period of time which is stored in the HDD and the frequently accessed data is stored in the SSD. Since only a small amount of data needs to be accessed frequently over a period of time, a small amount of SSD are required and the most of data is stored HDD. The hierarchical storage strategy have a good performance in accessing data with almost cost of HDD.

    For the application having massive data, it’s necessary to use the distributed storage to store massive data because a single disk storage capacity should be insufficient.GlusterFS [4] proposed by Z RESEARCH Company is an open source distributed file system, which is widely used in cloud storage system. GlusterFS has PB class file cluster storage capacity by linking to many kinds of cheap x86 hosts with Infiniband RDMA [5] or TCP/ IP [6] protocol to a large-scale parallel network file system. In view of the large differences in data access frequency between different data, GlusterFS is used to build a hierarchical distributed storage cluster. A few SSD and a lot of HDD with different access speeds disks are connected into a cluster. The online volume having high speed of data access is created on the SSD and the offline volume having low unit storage costs is created on the HDD disks. In this way, the storage cluster provide good performance of data access under constraints of storage capacity and hardware costs.

    In this distributed hierarchical storage system, data access frequency is dynamic. In order to enable the system to provide better performance all the time, we need to migrate frequently data between different storage devices. There are many methods to migrate data between offline volume and online volume, the most commonly used way is to migrate data file for offline database. But statistics show that large data file migration maybe fail or migrated data is not available because of various types of problems. For example, unexpected computer downtime during migration, or varying degrees of data loss will result in inconsistent data [7]. Therefore, in order to ensure data availability before and after large data file migration, data file must be verified after data migration.

    At present, there are two main aspects to ensure the data availability after migration. One is that comparing some specific data is consistent before and after data file migration by using spot checks and partial statistics [8]. It is a sampling approach, which can only ensure the sampling part of the data consistency, not all data. Another way is ensuring the data consistency in transfer process by using SHA1 [9], MD5 checksum [10] or other data verification method. This approach can ensure all data integrity after data migration because each bit of all data is checked in data transfer. However, it takes much more time to verify the data compared to the only migration data.

    In this paper, we using a parallel method to verify and migrate data with pipelining. We want to take very little extra time than only data migration to finish verify data available and integrity. Under ensure data available and integrity conditions, our method can significantly enhance the migration efficiency. 

返回

© Copyright 2007~2021 国防科技大学Trustie团队 & IntelliDE 湘ICP备 17009477号

问题和建议
还能输入50个字符 提交

加入QQ群

关注微信APP


×