生物信息导论英文论文Practical Suffix Tree Construction 生物信息导论英文论文Practical Suffix Tree Construction.doc

资源ID：2386387 资源大小：340.50KB 全文页数：24页
资源格式： DOC 下载积分：8金币

快捷下载

会员登录下载

三方登录下载：

下载资源需要8金币

邮箱/手机：
温馨提示：	用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）
支付方式：
验证码：	换一换

加入VIP免费专享

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？

友情提示

1、下载资料失败解决办法

2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。

3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。

4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

5、试题试卷类文档，如果标题没有明确说明有答案则都视为没有答案，请知晓。

网站客服

侵权投诉

生物信息导论英文论文Practical Suffix Tree Construction 生物信息导论英文论文Practical Suffix Tree Construction.doc

Practical Suffix Tree ConstructionSandeep Tata Richard A. Hankins Jignesh M. PatelUniversity of MichiganAbstractLarge string datasets are common in a numberof emerging text and biological database applications.Common queries over such datasets includeboth exact and approximate string matches. Thesequeries can be evaluated very efficiently by usinga suffix tree index on the string dataset. Althoughsuffix trees can be constructed quickly in memoryfor small input datasets, constructing persistenttrees for large datasets has been challenging.In this paper, we explore suffix tree constructionalgorithms over a wide spectrum of data sourcesand sizes. First, we show that on modern processors,a cache-efficient algorithm with O(n2) complexityoutperforms the popular O(n) Ukkonenalgorithm, even for in-memory construction. Forlarger datasets, the disk I/O requirement quicklybecomes the bottleneck in each algorithms performance.To address this problem, we present abuffer management strategy for the O(n2) algorithm,creating a new disk-based construction algorithmthat scales to sizes much larger than havebeen previously described in the literature. Ourapproach far outperforms the best known diskbasedconstruction algorithms.1 IntroductionQuerying large string datasets is becoming increasinglyimportant in a number of emerging text and life sciencesapplications. Life science researchers are often interestedin explorative querying of large biological sequencedatabases, such as genomes and large sets of protein sequences.Many of these biological datasets are growingat exponential rates for example, the sizes of the sequencedatasets in GenBank have been doubling every six-Permission to copy without fee all or part of this material is granted providedthat the copies are not made or distributed for direct commercialadvantage, the VLDB copyright notice and the title of the publication andits date appear, and notice is given that copying is by permission of theVery Large Data Base Endowment. To copy otherwise, or to republish,requires a fee and/or special permission from the Endowment.Proceedings of the 30th VLDB Conference,Toronto, Canada, 2004teen months 31. Consequently, methods for efficientlyquerying large string datasets are critical to the success ofthese emerging database applications.Suffix trees are versatile data structures that can helpexecute such queries very efficiently. In fact, suffix treesare useful for solving a wide variety of string based problems17. For instance, the exact substring matching problemcan be solved in time proportional to the length of thequery, once the suffix tree is built on the database string.Suffix trees can also be used to solve approximate stringmatching problems efficiently. Some bioinformatics applicationssuch as MUMmer 10, 11, 22, REPuter 23,and OASIS 25 exploit suffix trees to efficiently evaluatequeries on biological sequence datasets. However, suffixtrees are not widely used because of their high cost of construction.As we show in this paper, building a suffix treeon moderately sized datasets, such as a single chromosomeof the human genome, takes over 1.5 hours with the bestknown existing disk-based construction technique 18. Incontrast, the techniques that we develop in this paper reducethe construction time by a factor of 5 on inputs of thesame size.Even though suffix trees are currently not in widespreaduse, there is a rich history of algorithms for constructingsuffix trees. A large focus of previous research has been onlinear-time suffix tree construction algorithms 24, 32, 33.These algorithms are well suited for small input stringswhere the tree can be constructed entirely in main memory.The growing size of input datasets, however, requires thatwe construct suffix trees efficiently on disk. The algorithmsproposed in 24, 32, 33 cannot be used for disk-based constructionas they have poor locality of reference. This poorlocality causes a large amount of random disk I/O once thedata structures no longer fit in main memory. If we naivelyuse these main-memory algorithms for on-disk suffix treeconstruction, the process may take well over a day for asingle human chromosome.Large (and rapidly growing) size of many string datasetsunderscores the need for fast disk-based suffix tree constructionalgorithms. A few recent research efforts havealso considered this problem 4,18, though neither of theseapproaches scales well for large datasets (such as a largechromosome, or an entire eukaryotic genome).In this paper, we present a new approach to efficiently36construct suffix trees on disk. We use a philosophy similarto the one in 18. We forgo the use of suffix links in returnfor a much better memory reference pattern, which translatesto better scalability and performance for large trees.The main contributions of this paper are as follows:1. We introduce the “Top Down Disk-based” (TDD)approach to building suffix trees efficiently for awide range of sizes and input types. This technique,includes a suffix tree construction algorithmcalled PWOTD, and a sophisticated buffer managementstrategy.2. We compare the performance of TDD with the popularUkkonens algorithm 32 for the in-memory case,where all the data structures needed for building thesuffix trees are memory resident (i.e. the datasets are“small”). Interestingly, we show that even thoughUkkonen has a better worst case theoretical complexity,TDD outperforms Ukkonen on modern cachedprocessors, since TDD incurs significantly fewer processorcache misses.3. We systematically explore the space of data sizes andtypes, and highlight the advantages and disadvantagesof TDD with respect to other construction algorithms.4. We experimentally demonstrate that TDD scalesgracefully with increasing input size. Using the TDDprocess, we are able to construct a suffix tree on theentire human genome in 30 hours (on a single processormachine)! To our knowledge, suffix tree constructionon an input string of this size (3 billion symbolsapprox.) has yet to be reported in literature.The remainder of this paper is organized as follows:Section 2 discusses related work. The TDD technique isdescribed in Section 3, and we analyze the behavior of thisalgorithm in Section 4 . Section 5, presents the experimentalresults, and Section 6 presents our conclusions.2 Related WorkLinear time algorithms for constructing suffix trees havebeen described byWeiner 33, McCreight 24, and Ukkonen32. Ukkonens is a popular algorithm because itis easier to implement than the other algorithms. It isan O(n), in-memory construction algorithm based on theclever observation that constructing the suffix tree can beperformed by iteratively expanding the leaves of a partiallyconstructed suffix tree. Through the use of suffix links,which provide a mechanism for quickly traversing acrosssub-trees, the suffix tree can be expanded by simply addingthe i+1 character to the leaves of the suffix tree built on theprevious i characters. The algorithm thus relies on suffixlinks to traverse through all of the sub-trees in the main tree,expanding the outer edges for each input character. However,they have poor locality of reference since they traversethe suffix tree nodes in a random fashion. This leads topoor performance on cached architectures and when usedto construct on-disk suffix trees.Recently, Bedathur et al. developed a buffering strategy,called TOP-Q, which improves the performance of theUkkonens algorithm (which uses suffix links) when constructingon-disk suffix trees 4. A different approach wassuggested by Hunt et al. 18 where the authors drop the useof suffix links and use an O(n2) algorithm with a better localityof reference. In one pass over the string, they indexall suffixes with the same prefix by inserting them into anon-disk subtree managed by PJama 3, a Java based objectstore. Construction of each independent subtree requires afull pass over the string.Several O(n2) and O(n log n) algorithms for constructingsuffix trees are described in 17. A top-down approachhas been suggested in 1, 14, 16. In 15, the authors explorethe benefits of using a lazy implementation of suffixtrees. In this approach, the authors argue that one can avoidpaying the full construction cost by constructing the subtreeonly when it is accessed for the first time. This approachis useful only when a small number of queries are posedagainst a string dataset. When executing a large number ofqueries, most of the tree must be materialized, and in thiscase, this approach will perform poorly.Previous research has also produced theoretical resultson understanding the average sizes of suffix trees 5, 30,and theoretical complexity of using sorting to build suffixtrees for different computational models such as RAM,PRAM, and various other external memory models 12.Suffix arrays have also been used as an alternative to suffixtrees for specific string matching tasks 8, 9, 26. However,in general, suffix trees are more versatile data structures.The focus of this paper is only on suffix trees.Our solution uses a simple partitioning strategy. However,a more sophisticated partitioning method has beenproposed recently 6, which can complement our existingpartitioning method.3 The TDD TechniqueMost suffix tree construction algorithms do not scale dueto the prohibitive disk I/O requirements. The high percharacteroverhead quickly causes the data structures tooutgrow main memory and the poor locality of referencemakes efficient buffer management difficult.We now present a new disk-based construction techniquecalled the “Top-Down Disk-based” technique, hereafterreferred to simply as TDD. TDD scales much moregracefully than existing techniques by reducing the mainmemoryrequirements through strategic buffering of thelargest data structures. The TDD technique consists of asuffix tree construction algorithm, called PWOTD, and therelated buffer management strategy described in the followingsections.3.1 PWOTD AlgorithmThe first component of the TDD technique is our suffixtree construction algorithm, called PWOTD (Partition andWrite Only Top Down). This algorithm is based on the wotdeageralgorithm suggested by Kurtz 15. We improve onthis algorithm by using a partitioning phase which allowsone to immediately build larger, independent sub-trees inmemory. Before we explain the details of the algorithm,we briefly discuss the representation of the suffix tree.The suffix tree is represented by a linear array, as in wotdeager.This is a compact representation using an averageof 8.5 bytes per symbol indexed. Figure 1 illustrates a suffixtree on the string ATTAGTACA$ and the trees correspondingarray representation in memory. Shaded entriesin the array represent leaf nodes, with all other entries representingnon-leaf nodes. An R in the lower right-hand cornerof an entry denotes a rightmost child. A branching nodeis represented by two integers. The first is an index into theinput string; the character at that index is the starting characterof the incoming edges label. The length of the labelcan be deduced by examining the children of the currentnode. The second entry points to the first child. Note thatthe leaf nodes do not have a second entry. The leaf noderequires only the starting index of the label; the end of thelabel is the strings terminating character. See 15 for amore detailed explanation.The PWOTD algorithm consists of two phases. Inphase one, we partition the suffixes of the input string into|A|prefixlen partitions, where |A| is the alphabet size ofthe string and prefixlen is the depth of the partitioning. Thepartitioning step is executed as follows. The input stringis scanned from left to right. At each index position i theprefixlen subsequent characters are used to determine oneof the |A|prefixlen partitions. This index i is then writtento the calculated partitions buffer. At the end of the scan,each partition will contain the suffix pointers for suffixesthat all have the same prefix of size prefixlen.To further illustrate the partition step, consider the followingexample. Partitioning the string ATTAGTACA$using a prefixlen of 1 would create four partitions of suffixes,one for each symbol in the alphabet. (We ignorethe final partition consisting of just the string terminatorsymbol $.) The suffix partition for the character A wouldAlgorithm PWOTD(String,prefixlen)Phase1:Scan the String and partition Suffixes basedon the first prefixlen symbols of each suffixPhase2: Do for each partition:1. START BuildSuffixTree2. Populate Suffixes from current partition3. Sort Suffixes on first symbol using Temp4. Output branching and leaf nodes to the Tree5. Push the nodes pointing to an unevaluated rangeonto the StackWhile Stack is not empty6. Pop a node7. Find the Longest Common Prefix (LCP) ofall the suffixes in this range by checkingthe String8. Sort the range in Suffixes on the firstsymbol using Temp9. Write out branching nodes or leaf nodes to Tree10.Push the nodes pointing to an unevaluated rangeonto the Stack11. ENDFigure 2: The TDD Algorithmbe 0,3,6,8, representing the suffixes ATTAGTACA$,AGTACA$, ACA$, A$. The suffix partition for thecharacter T would be 1,2,5 representing the suffixesTTAGTACA$, TAGTACA$, TACA$. In phase two, weuse the wotdeager algorithm to build the suffix tree on eachpartition using a top down construction.The pseudo-code for the PWOTD algorithm is shown inFigure 2. While the partitioning in phase one of PWOTD issimple enough, the algorithm for wotdeager in phase twowarrants further discussion. We now illustrate the wotdeageralgorithm using an example.3.1.1 Example Illustrating the wotdeager AlgorithmThe PWOTD algorithm requires four data structures forconstructing suffix trees: an input string array, a suffix array,a temporary array, and the suffix tree. For the discussionthat follows, we name each of these structures String,Suffixes, Temp, and Tree, respectively.The Suffixes array is first populated with suffixes from apartition after discarding the first prefixlen characters. Usingthe same example string as before, ATTAGTACA$,consider the construction of the Suffixes array for the Tpartition.The suffixes in this partition are at positions 1,2, and 5. Since all these suffixes share the same prefix, T,we add one to each offset to produce the new Suffix array2,3,6. The next step involves sorting this array of

注意事项

本文（生物信息导论英文论文Practical Suffix Tree Construction 生物信息导论英文论文Practical Suffix Tree Construction.doc）为本站会员（文库蛋蛋多）主动上传，三一办公仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知三一办公（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。