生物信息导论英文论文Practical Suffix Tree Construction 生物信息导论英文论文Practical Suffix Tree Construction.doc
《生物信息导论英文论文Practical Suffix Tree Construction 生物信息导论英文论文Practical Suffix Tree Construction.doc》由会员分享,可在线阅读,更多相关《生物信息导论英文论文Practical Suffix Tree Construction 生物信息导论英文论文Practical Suffix Tree Construction.doc(24页珍藏版)》请在三一办公上搜索。
1、Practical Suffix Tree ConstructionSandeep Tata Richard A. Hankins Jignesh M. PatelUniversity of MichiganAbstractLarge string datasets are common in a numberof emerging text and biological database applications.Common queries over such datasets includeboth exact and approximate string matches. Theseq
2、ueries can be evaluated very efficiently by usinga suffix tree index on the string dataset. Althoughsuffix trees can be constructed quickly in memoryfor small input datasets, constructing persistenttrees for large datasets has been challenging.In this paper, we explore suffix tree constructionalgori
3、thms over a wide spectrum of data sourcesand sizes. First, we show that on modern processors,a cache-efficient algorithm with O(n2) complexityoutperforms the popular O(n) Ukkonenalgorithm, even for in-memory construction. Forlarger datasets, the disk I/O requirement quicklybecomes the bottleneck in
4、each algorithms performance.To address this problem, we present abuffer management strategy for the O(n2) algorithm,creating a new disk-based construction algorithmthat scales to sizes much larger than havebeen previously described in the literature. Ourapproach far outperforms the best known diskba
5、sedconstruction algorithms.1 IntroductionQuerying large string datasets is becoming increasinglyimportant in a number of emerging text and life sciencesapplications. Life science researchers are often interestedin explorative querying of large biological sequencedatabases, such as genomes and large
6、sets of protein sequences.Many of these biological datasets are growingat exponential rates for example, the sizes of the sequencedatasets in GenBank have been doubling every six-Permission to copy without fee all or part of this material is granted providedthat the copies are not made or distribute
7、d for direct commercialadvantage, the VLDB copyright notice and the title of the publication andits date appear, and notice is given that copying is by permission of theVery Large Data Base Endowment. To copy otherwise, or to republish,requires a fee and/or special permission from the Endowment.Proc
8、eedings of the 30th VLDB Conference,Toronto, Canada, 2004teen months 31. Consequently, methods for efficientlyquerying large string datasets are critical to the success ofthese emerging database applications.Suffix trees are versatile data structures that can helpexecute such queries very efficientl
9、y. In fact, suffix treesare useful for solving a wide variety of string based problems17. For instance, the exact substring matching problemcan be solved in time proportional to the length of thequery, once the suffix tree is built on the database string.Suffix trees can also be used to solve approx
10、imate stringmatching problems efficiently. Some bioinformatics applicationssuch as MUMmer 10, 11, 22, REPuter 23,and OASIS 25 exploit suffix trees to efficiently evaluatequeries on biological sequence datasets. However, suffixtrees are not widely used because of their high cost of construction.As we
11、 show in this paper, building a suffix treeon moderately sized datasets, such as a single chromosomeof the human genome, takes over 1.5 hours with the bestknown existing disk-based construction technique 18. Incontrast, the techniques that we develop in this paper reducethe construction time by a fa
12、ctor of 5 on inputs of thesame size.Even though suffix trees are currently not in widespreaduse, there is a rich history of algorithms for constructingsuffix trees. A large focus of previous research has been onlinear-time suffix tree construction algorithms 24, 32, 33.These algorithms are well suit
13、ed for small input stringswhere the tree can be constructed entirely in main memory.The growing size of input datasets, however, requires thatwe construct suffix trees efficiently on disk. The algorithmsproposed in 24, 32, 33 cannot be used for disk-based constructionas they have poor locality of re
14、ference. This poorlocality causes a large amount of random disk I/O once thedata structures no longer fit in main memory. If we naivelyuse these main-memory algorithms for on-disk suffix treeconstruction, the process may take well over a day for asingle human chromosome.Large (and rapidly growing) s
15、ize of many string datasetsunderscores the need for fast disk-based suffix tree constructionalgorithms. A few recent research efforts havealso considered this problem 4,18, though neither of theseapproaches scales well for large datasets (such as a largechromosome, or an entire eukaryotic genome).In
16、 this paper, we present a new approach to efficiently36construct suffix trees on disk. We use a philosophy similarto the one in 18. We forgo the use of suffix links in returnfor a much better memory reference pattern, which translatesto better scalability and performance for large trees.The main con
17、tributions of this paper are as follows:1. We introduce the “Top Down Disk-based” (TDD)approach to building suffix trees efficiently for awide range of sizes and input types. This technique,includes a suffix tree construction algorithmcalled PWOTD, and a sophisticated buffer managementstrategy.2. We
18、 compare the performance of TDD with the popularUkkonens algorithm 32 for the in-memory case,where all the data structures needed for building thesuffix trees are memory resident (i.e. the datasets are“small”). Interestingly, we show that even thoughUkkonen has a better worst case theoretical comple
19、xity,TDD outperforms Ukkonen on modern cachedprocessors, since TDD incurs significantly fewer processorcache misses.3. We systematically explore the space of data sizes andtypes, and highlight the advantages and disadvantagesof TDD with respect to other construction algorithms.4. We experimentally d
20、emonstrate that TDD scalesgracefully with increasing input size. Using the TDDprocess, we are able to construct a suffix tree on theentire human genome in 30 hours (on a single processormachine)! To our knowledge, suffix tree constructionon an input string of this size (3 billion symbolsapprox.) has
21、 yet to be reported in literature.The remainder of this paper is organized as follows:Section 2 discusses related work. The TDD technique isdescribed in Section 3, and we analyze the behavior of thisalgorithm in Section 4 . Section 5, presents the experimentalresults, and Section 6 presents our conc
22、lusions.2 Related WorkLinear time algorithms for constructing suffix trees havebeen described byWeiner 33, McCreight 24, and Ukkonen32. Ukkonens is a popular algorithm because itis easier to implement than the other algorithms. It isan O(n), in-memory construction algorithm based on theclever observ
23、ation that constructing the suffix tree can beperformed by iteratively expanding the leaves of a partiallyconstructed suffix tree. Through the use of suffix links,which provide a mechanism for quickly traversing acrosssub-trees, the suffix tree can be expanded by simply addingthe i+1 character to th
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 生物信息导论英文论文Practical Suffix Tree Construction 生物 信息 导论 英文 论文 Practical
链接地址:https://www.31ppt.com/p-2386387.html