基于java的仓储调动系统毕业设计英文文献翻译.doc
毕业设计说明书英文文献及中文翻译学生姓名: 学号: 学 院: 专 业: 指导教师: Deduping Storage DeduplicationI think everyone can agree that data storage is exploding at a fairly fast, some say alarming, rate. This means that administrators are having to work overtime to keep everything humming so that users dont even see the hard work that is going on behind the scenes. These things include: quota management, snapshots, backups, replication, preparing disaster recovery backups, off-site copies of data, restorations of user data that has been erased, monitoring data growth and data usage, and a thousand other tasks that keep things running smoothly (picture synchronized swimmers that look graceful above the water but underneath the surface their legs and hands are moving at a furious rate).Now that I have equated storage experts to synchronized swimmers and probably upset all of them (my apologies), lets look at a new technology that is trying to make their life easier while also saving money. This technology is called data deduplication. While it is something of a new technology I hope to show that its really an older technology with a new twist that can be used to great effect on many storage systems. Without further ado, lets examine data deduplication data dedntroductionData deduplication is, quite simply, removing copies (duplicates) of data and replacing it with pointers to the first (unique) copy of the data. Fundamentally, this technology helps reduce the total amount of storage. This can result in many things:Saving money (no need to buy additional capacity)Reducing the size of backups, snapshots, etc. (saves money, time, etc.)Reducing power requirements (less disk, less tape, etc.)Reduces network requirements (less data to transmit)Time savingsSince the amount of storage is reduced, disk backups become more possibleThese results are the fundamental reason that data deduplication technology is the rage at the moment. Who doesnt like saving money, time, network bandwidth, etc.? But as with everything, the devil is always in the details. In this article the concepts and issues in data deduplication will be presented.Deduplication is really not a new technology. It is really an out growth of compression. Compression searches a single file for repeated binary patterns and replaces duplicates with pointers to the original or unique piece of data. Data deduplication extends this concept to include deduplicationWithin files (just like compression)Across filesAcross applicationsAcross clientsOver timeA quick illustration of deduplication versus compression is that if you have two files that are identical, compression applies deduplication to each file independently. But data deduplication recognizes that the files are duplicates and only stores the first one. In addition, it can also search the first file for duplicate data, further reducing the size of the stored data (ala compression).A very simple example of data deduplication is derived from an EMC video Figure 1 Data Deduplication ExampleIn this example there are three files. The first file, document1.docx, is a simple Microsoft Word file that is 6MB is size. The second file, document2.docx is just a copy of the first file but with a different file name. And finally, the last file, document_new.docx, is derived from document1.docx but with some small changes to the data and is also 6MB in size.Lets assume that a data deduplication process divides the files into 6 pieces (this is a very small number and is for illustrative purposes only). The first file has pieces A, B, C, D, E, and F. The second file, since its a copy of the first file, has the exact same pieces. The third file has one piece changed which is labeled G and is 6MB in size. Without data deduplication, a backup of the files would have to backup 18MB of data(6MB times 3). But with data deduplication only the first file and the new block G in the third file are backed up. This is a total of 7MB of data. One additional feature that data deduplication offers is that after the backup, the pieces, A, B, C, D, E, F, and G are typically stored in a list (sometimes called an index). Then when new files are backed up, their pieces are compared to the ones that have already been backed-up. This is a feature of doing data deduplication over time.One of the first questions asked after, “what is data deduplication?” is, “what level of deduplication can I expect?” The specific answer depends upon the details of the situation and the dedup implementation, but EMC is quoting a range of 20:1 to 50:1 over a period of time. Devilish DetailsData deduplication is not a “standard” in any sense so all of the implementations are proprietary. Therefore, each product does things differently. Understanding the fundamental differences is important for determining when and if they fit into your environment. Typically deduplication technology is being used in conjunction with backups, but it is not necessarily limited to only that function. With that in mind lets examine some of the ways deduplication can be done.There are really two main types of deduplication with respect to backups, target-based, and source-based. The difference is fairly simple. Target-based deduplication, dedups the data after it has been transferred across the network for backup. Source-based deduplication, dedups the data before it is backed up. The differences are fairly important in understanding the typical ways that deduplication is deployed.With target-based deduplication, the deduplication is typically done by a device such as a Virtual Tape Library (VTL) that does the deduplication. When using a VTL, the data is passed to the backup server and then to the VTL where it is deduped. So the data is sent across the network without being deduped, increasing the amount of data transferred. But, the target-based approach does allow you to continue to use your existing backup tools and processes. Alternatively, in a remote backup situation where you communicate over the WAN, network bandwidth is important. If you want to still target-based deduplication, the VTL is placed near the servers to dedup the data before sending it over the network to the backup server.The opposite of target-based dedup is source-based deduplication. In this case the deduplication is done by the backup software. The backup software on the clients talks to the backup software on the backup server to dedup the data prior to it being transmitted to the backup server. In essence the client sends the pieces of each file that are to be backed-up, to the backup software that compares it to pieces that have already been backed-up. If a duplicate is found, then a pointer is created to the unique piece of data that has already been backed-up.Source-based dedup can greatly reduce the amount of data transmitted over the network but there is some traffic from the clients to the backup server for deduping the data. In addition, since the dedup takes place in software, no additional hardware is needed. But, you have to use specialized backup software so you may have to give up your existing backup tools to gain the dedup capability.So far it looks like deduplication is pretty easy, and the fundamental concepts are fairly easy, but many details have been left out. There are many parts to the whole deduplication technology that have to be developed, integrated, and tested for reliability (it is your data after all). Deduplication companies differentiate themselves by these details. Is the deduplication technology target-based or source-based? Whats the nature of the device and/or software? A what level is the deduplication performed? How are the data pieces compared to find duplicates? And on and on.Before diving into a discussion about deduplication deployment, lets talk about dedup algorithms. Recall that deduplication can happen on a file basis, or a block basis (the definition of a block is up to the specific dedup implementation), or even on a bit level. It is extremely inefficient to perform deduplication by taking pieces of data and comparing them to an index. To make things easier, dedup algorithms produce a hash of the data piece being deduped using something like MD5 or SHA-1. This hash process should produce a unique number for the specific piece of data and can be easily compared to a hash stored in the dedup index.One of the problems with using these hash algorithms is hash collisions. A hash collision is something of a “false” positive. That is, the hash for a piece of data may actually correspond to a different piece of data (i.e. the hash is not unique). Consequently, the piece of data may not be backed-up because it has the same hash number as is stored in the index, but in fact the data is different. Obviously this can lead to data corruption. So what data dedup companies do is to use several hash algorithms or combinations of them for deduplication to make sure it truly is a duplicate piece of data. In addition, some dedup vendors will use metadata to help identify and prevent collisions.To give you an idea of the likely-hood of a hash collision requires a little bit of math. This article does a pretty good job explaining the odds of a hash collision. The basic conclusion is that the odds are 1:2160. This is a huge number. Alternatively, if you have 95 EB (Exabytes 1,000 Petabytes), then you have a0.00000000000001110223024625156540423631668090820313% chance of getting a false positive in the hash comparison and throwing away a piece of data you should have kept. Given the size of 95 EB, its not likely you will encounter this chance even over an extended period of time. But, never say never (after all, someone predicted wed only need 640KB of memory).ImplementationChoosing one solution over another is a bit of an art and requires careful consideration of your environment and processes. The previously mentioned video has a couple rules of thumb based on the fundamental difference between source-based and target-based deduplication. Source-based dedup approaches are good for situations where network bandwidth may be a premium, such as: File systems (dont want to transfer the entire file system to deduplicate it and pass back the results), VMware storage, and remote offices or branch offices (network bandwidth to a central backup server may be rather limited). Dont foget that for source-based dedup, you will likely have to switch backup tools to get the dedup features.On the other hand, target-based deduplication works well for SANs, LANs, and possibly databases. The reason for this is that moving the data around the network is not very expensive and you may already have your backup packages chosen and in production.Finally the video also claims that for source-based dedup you can achieve a deduplication of 50:1 and that target-based dedup can achieve 20:1. Both levels of dedup are very impressive. There are a number of articles that discuss how to estimate the deduplication ratios you can achieve. A ratio of 20:1 seems definitely possible.There are many commercial deduplication products. Any list in this article is incomplete and is not meant as a slight toward a particular company. Nevertheless here is a quick list of companies providing deduplication capabilities:ExagridNEC HydraStorSymantec Netbackup PurediskData Domain (owned by EMC)FalconStorEMC AvamarSepatonCommvaultNetapp has deduplication available in their productsQuantumThese are a few of the solutions that are available. There are some smaller companies that offer deduplication products as well.Deduplication and Open-SourceThere are not very many (any?) deduplication projects in the open-source world. However, you can use a target-based deduplication device because it allows you to use your existing backup software which could be open-source. However, it is suggested you talk to the vendor to make sure that they have tested it with Linux.The only deduplication project that could be found is called LessFS. It is a FUSE based file system that has built-in deduplication. It is still early in the development process but it has demonstrated deduplication capabilities and has incorporated encryption (ah, the beauty of FUSE).SummaryThis has been a fairly short introductory article to deduplication technology. This is one of the hot technologies in storage right now. It holds the promise of saving money because of the reduction in hardware to store the data, as well as a reduction in network bandwidth.This article is intended to wet your appetite for examining data deduplication and how it might (or might not) be applicable to your environment. Take a look at the various articles on the net there has been some hype around the technology and judge for yourself if this is something that might work for you. If you want to try an open-source project, there arent very many (any) at all. The only one that could be found is LessFS which is a FUSE based file system that incorporates deduplication. But it might be worth investigating, even using it for secondary storage and not as your primary file storage.存储重复数据删除我相信所有人都会同意,数据存储正在以飞快地,甚至是令人震惊的速度在增长。这意味着为了不影响普通用户的正常使用,存储管理员们不得不加班加点地在幕后工作着。他们的鲜为人知的工作包括:配额管理,快照(snapshots),数据备份,数据复制(replication),为灾难时数据恢复而做的数据备份,离线数据拷贝,已删除的用户数据的恢复,监测数据增长和数据使用率,以及其他为确保应用平稳运行所做的数以千计的工作(正如花样游泳,从水上看起来非常优雅美观,而在水下,运动员的腿和手臂不得不飞快地摆动。)这只是关于重复数据删除的简单介绍。而重复数据删除是现在仓储储业界最热门的技术。它承诺由于能够减少备份数据所需要的物理空间和传输数据所需要的带宽,因而可以为企业或组织节约IT花费。简介重复数据删除,其实很简单,遇到重复数据时不是保存重复数据的备份,取而代之,增加一个指向第一份(并且是唯一一份)数据的索引。从根本上讲,它能减少存储数据所占用的空间。这会带来如下好处:节约IT经费(不需要为所需要的额外空间增加投资)减少备份数据,数据快照等的大小(节约经费,节约时间,等)较少电源压力(因为更少的硬盘,更少的磁带等)节约网络带宽(因为只需要更少的数据)节约时间因为需要较少的存储空间,磁盘备份成为可能。上面这些好处也正是重复数据删除技术风靡当前的根本原因。又有谁会不喜欢节约经费,时间,网络带宽呢?但是像很多美好的东西一样,魔鬼存在于细节中。本文将会介绍重复数据删除方面的概念及存在的问题。重复数据删除绝不是新事物,事实上它只是数据压缩的衍生品。数据压缩在单个文件范围内删除重复的数据,代之以指向第一份数据的索引。重复数据删除把这个概念进行如下扩展:单个文件范围内(跟数据压缩完全一致)跨文件跨应用跨客户端跨时间重复数据删除与数据压缩的主要区别在于:假如你有两个完全相同的文件,数据压缩会对每个文件进行重复数据的排除并代之以指向第一份数据的索引;而重复数据排除则能分辨出两个文件完全相同,从而只保存第一个文件。而且,它还跟数据压缩一样,排除掉第一个文件里的重复数据,从而减小所存储数据的大小。下面是个简单的重复数据删除的例子,来自EMC视频。该例中一共有三个文件。第一个文件,document1.docx,是个大小为6MB的简单的word文档。第二个文件,document2.docx,是第一个文件的拷贝,只是文件名不同。最后那个文件,document_new.docx,在document1.docx基础了进行某些小的修改,其大小仍旧为6MB.假设重复数据删除程序会把文件分割成6个部分(6在实际应用中可能太小,这儿只是为了说明用)。现在第一个文件有A,B,C,D,E和F六个部分。第二个文件既然是第一个文件的拷贝,所有有会被分成完全相同的六个部分。第三个文件相比前两个文件只有一部分发生了变化(标记为G),其大小仍旧是6MB. 在不使用重复数据删除的情况下,备份这些文件需要18MB的空间,而是用重复数据删除技术的情况下,只需要备份第一个文件和第三个文件的G部分,大约只需要7MB.重复数据删除技术还有一个特性:备份后,A,B,C,D,E,F,G存储在一个列表中(有时也称为索引)。当新的文件被备份到该同一系统时,A,B,C,D,E,F,G各部分分别会于新文件进行比较。这就是跨时间的数据重复删除。关于重复数据删除,“什么是重复数据删除?“之后的下一个问题通常是”它能带来多大程度的数据删除?” 这个问题的答案依赖于使用场合和重复数据删除技术的具体实现。据EMC的统计,经过一段的稳定运行后,重复数据排除率在20:1到50:1之间。细节之魔重复数据删除技术无论从哪一种角度看都不是一种