系统发生树构建ppt课件.ppt
基础生物信息学及应用,李裕强2009.09,第部分生物分子信息的分析,第八章 分子进化分析 系统发生树构建,本章内容:分子进化分析介绍系统发生树构建方法系统发生树构建实例,第一节 分子进化分析介绍,基本概念:系统发生(phylogeny)是指生物形成或进化的历史系统发生学(phylogenetics)研究物种之间的进化关系 系统发生树(phylogenetic tree)表示形式,描述物种之间进化关系,分子进化研究的目的从物种的一些分子特性出发,从而了解物种之间的生物系统发生的关系。蛋白和核酸序列通过序列同源性的比较进而了解基因的进化以及生物系统发生的内在规律,分子进化分析介绍,分子进化分析介绍,分子进化研究的基础基本理论:在各种不同的发育谱系及足够大的进化时间尺度中,许多序列的进化速率几乎是恒定不变的。(分子钟理论, Molecular clock 1965 ),分子进化分析介绍,主要假定条件:To use molecular data to reconstruct evolutionary history requires making a number of reasonable assumptions:The first is that the molecular sequences used in phylogenetic construction are homologous, meaning that they share a common origin and subsequently diverged through time.Phylogenetic divergence is assumed to be bifurcating, meaning that a parent branch splits into two daughter branches at any given point. Another assumption in phylogenetics is that each position in a sequence evolved independently. The variability among sequences is sufficiently informative for constructing unambiguous phylogenetic trees.,分子进化分析介绍,实际情况:虽然很多时候仍然存在争议,但是分子进化确实能阐述一些生物系统发生的内在规律,分子进化分析介绍,直系同源与旁系同源Orthologs(直系同源): Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function. Paralogs(旁系同源): Homologous sequences within a single species that arose by gene duplication. 。以上两个概念代表了两个不同的进化事件。用于分子进化分析中的序列必须是直系同源的,才能真实反映进化过程。,分子进化分析介绍,分子进化分析介绍,系统发生树(phylogenetic tree):又名进化树(evolutionary tree)已发展成为多学科交叉形成的一个边缘领域。包括生命科学中的进化论、遗传学、分类学、分子生物学、生物化学、生物物理学和生态学,又包括数学中的概率统计、图论、计算机科学和群论。闻名国际生物学界的美国冷泉港定量生物学会议于1987年特辟出进化树专栏进行学术讨论,标志着该领域已成为现代生物学的前沿之一,迄今仍很活跃。,分子进化分析介绍,分子进化分析介绍,系统发生树结构The lines in the tree are called branches(分支). At the tips of the branches are present-day species or sequences known as taxa (分类,the singular form is taxon) or operational taxonomic units(运筹分类单位). The connecting point where two adjacent branches join is called a node(节点), which represents an inferred ancestor of extant taxa. The bifurcating point at the very bottom of the tree is the root node(根节), which represents the common ancestor of all members of the tree.A group of taxa descended from a single common ancestor is defined as a clade or monophyletic group (单源群).The branching pattern in a tree is called tree topology(拓扑结构).,分子进化分析介绍,有根树与无根树树根代表一组分类的共同祖先,分子进化分析介绍,如何确定树根根据外围群:One is to use an outgroup(外围群), which is a sequence that is homologous to the sequences under consideration, but separated from those sequences at an early evolutionary time.根据中点:In the absence of a good outgroup, a tree can be rooted using the midpoint rooting approach, in which the midpoint of the two most divergent groups judged by overall branch lengths is assigned as the root.,Rooted by outgroup,分子进化分析介绍,分子进化分析介绍,树形系统发生图(Phylograms):有分支和支长信息分支图( Cladograms)只有分支信息,无支长信息,第二节 系统发生树构建方法,Molecular phylogenetic tree construction can be divided into five steps: (1) choosing molecular markers; (2) performing multiple sequence alignment;(3) choosing a model of evolution; (4) determining a tree building method; (5) assessing tree reliability.,系统发生树构建方法,(1) Choosing molecular markers For studying very closely related organisms, nucleotide sequences, which evolve more rapidly than proteins, can be used.For studying the evolution of more widely divergent groups of organisms, one may choose either slowly evolving nucleotide sequences, such as ribosomal RNA or protein sequences.,(2) Performing multiple sequence alignment:Probably the most critical step in the procedure. Only the correct alignment produces correct phylogenetic inference.Multiple state-of-the-art alignment programs (such as T-Coffee) should be used.Manual editing is often critical in ensuring alignment quality.It is also often necessary to decide whether to use the full alignment or to extract parts of it. Truly ambiguously aligned regions have to be removed from consideration prior to phylogenetic analysis.,系统发生树构建方法,(2) Performing multiple sequence alignment:Using automatic approach to improving alignment quality. Rascal (ftp:/ftp-igbmc.u-strasbg.fr/pub/RASCAL)and NorMD (ftp:/ftp-igbmc.u-strasbg.fr/pub/NORMD) can help to improve alignment by correcting alignment errors and removing potentially unrelated or highly divergent sequences. the program Gblocks (http:/woody.embl-heidelberg.de/phylo/ ) can help to detect and eliminate the poorly aligned positions and divergent regions so to make the alignment more suitable for phylogenetic analysis.,系统发生树构建方法,(3) Choosing a model of evolution:何为进化模型The statistical models used to correct homoplasy(非同源相似,平行演化 ) are called substitution models or evolutionary models.,系统发生树构建方法,(3) Choosing a model of evolution:为何要考虑进化模型:The observed number of substitutions may not represent the true evolutionary events that actually occurred. For instance,Observed: A replaced by C, actually: ATGCBack mutation (回复突变)could have occurred: GCGParallel mutations(平行突变): both sequences mutate into T,.Such multiple substitutions and convergence at individual positions obscure the estimation of the true evolutionary distances between sequences. This effect is known as homoplasy, which, if not corrected, can lead to the generation of incorrect trees.To correct homoplasy, statistical models are needed to infer the true evolutionary distances between sequences.,系统发生树构建方法,(3) Choosing a model of evolution:For protein sequences, the evolutionary distances from an alignment can be corrected using a PAM or JTT amino acid substitution matrix whose construction already takes into account the multiple substitutions. For constructing DNA phylogenies, there are a number of nucleotide substitution models available.JukesCantor ModelKimura Model,系统发生树构建方法,(3) Choosing a model of evolution:JukesCantor Model距离函数: dAB = (3/4) ln1 (4/3)pAB如,两20核苷酸序列A、B,有6个不同,则距离为 dAB = 3/4 ln1 (4/3 0.3) = 0.38The JukesCantor model can only handle reasonably closely related sequences.,系统发生树构建方法,(3) Choosing a model of evolution:Kimura Model增加计算转换(transitions)和颠换(transversion)的不同影响 (next)距离函数: dAB = (1/2) ln(1 2pti ptv) (1/4) ln(1 2ptv)如两序列A、B有30差异,其中转换20、颠换10,则距离为 dAB = 1/2 ln(1 2 0.2 0.1) 1/4 ln(1 2 0.1) = 0.40,系统发生树构建方法,系统发生树构建方法,(3) Choosing a model of evolution:JukesCantor Model 和Kimura Model 都假定每一位点的进化速率相等,但实际上密码子不同位点的变异频率不同(如第三碱基往往替换发生更高)。因此,在进化树构建程序中,均对这种所谓 Among-Site Variations 进行修正,如计入修正因子( correction factor)。,系统发生树构建方法,(4) Determining a tree building method:构建系统发生树的算法:分子数据可大致分为以下两大类:1)离散特征数据 (discrete character data) 2)相似性和距离数据(similarity and distance data) 从数据处理的角度讲,建树方法分为基于距离的方法(distance method)基于离散特征数据的方法(discrete character method),系统发生树构建方法,(4) Determining a tree building method: 距离法又称距离矩阵法首先通过各个物种之间的比较,根据一定的假设(进化距离模型)推导得出分类群之间的进化距离,构建一个进化距离矩阵(next) 。进化树的构建则是基于这个矩阵中的进化距离关系 。树的拓扑结构由两两序列的进化距离决定。树枝长度代表进化距离。包括:聚类法(clustering based methods)除权配对分组平均算法(unweighted pair group method using arithmetic mean,UPGMA)邻位归并算法(neighbor joining,NJ)优化法(Optimality-Based Methods):采用一定算法,对距离数据进行优化处理最短演化长度算法(minimal evolution,ME),系统发生树构建方法,一种简单的距离矩阵,(4) Determining a tree building method: 基于离散特征数据法,直接根据序列各位点特征的变化来构建进化树。包括最大简约法(maximum parimony,MP)最大似然法(maximum likelihood,ML),系统发生树构建方法,d=e=10/2=5,(4) Determining a tree building method:除权配对分组平均法(UPGMA,Unweighted pair group method using arithmetic mean),系统发生树构建方法,c=19/2=9.5g=c-d=9.5-5=4.5,a=b=22/2=11,f1+a=f2+c=40.5/2=20.25f1=9.25 , f2=11.75,(4) Determining a tree building method: 邻位归并法(NJ,Neighbor joining)计算任意两个节点选为相邻序列的得分得分最小者确定为相邻序列依次加入各枝任意两个节点选为相邻序列的计算公式:,系统发生树构建方法,把A、B看成一个新的复合序列,构建一个新的距离表,重复以上过程。,(4) Determining a tree building method:最短演化长度算法(minimal evolution,ME)采用一定的算法比较所有可能的拓扑结构,选择具有最短树枝长度的拓扑结构为ME树。,系统发生树构建方法,(4) Determining a tree building method:Generalized NJ method:Multiple NJ trees with different initial taxon groupings are generated. A best tree is then selected from a pool of regular NJ trees that best fit the actual evolutionary distances.,系统发生树构建方法,(4) Determining a tree building method:距离法的评价All distance-based methodsAdvantage: ability to make use of a large number of substitution models to correct distances.Drawback: the actual sequence information is lost when all the sequence variation is reduced to a single value. Hence, ancestral sequences at internal nodes cannot be inferred.Clustering based method (NJ and UPGMA):Advantage: computationally fast, therefore capable of handling data sets that are deemed to be too large for any other phylogenetic method.Disadvantage:not guaranteed to find the best tree.,系统发生树构建方法,(4) Determining a tree building method:距离法的评价Optimality-based method (ME):Advantage: have better accuraciesDrawback: computationally prohibitive to use when the number of taxa is large (e.g., 12)A compromise between the two types of algorithm is a hybrid approach such as the generalized NJ, with a performance similar to that of ME but computationally much faster.,系统发生树构建方法,(4) Determining a tree building method:最大简约法(maximum parsimony,MP) 最大简约法的理论基础是奥卡姆(William Ockham,13th century)哲学原则: 解释一个过程的最好理论是所需假设数目最少的那一个。最大简约法对所有可能的拓扑结构进行计算,并计算出所需替换数最小(枝长最短)的那个拓扑结构,作为最优树。,系统发生树构建方法,(4) Determining a tree building method:最大简约法工作方法:1、可能树的搜索方法问题:可能的进化树的数量是分类树的指数(next)解决: 边界约束法(branch-and-bound method)先构建距离树(如NJ法),然后计算最小替换数(最短枝长),确定一个序列变异的上限(upper limit or upper bound)来限制树的增长(growth)The rationale is that a maximally parsimonious tree must be equal to or shorter than the distance-based tree.,系统发生树构建方法,(4) Determining a tree building method:最大简约法工作方法: 2、简约树的确定选择最大进化信息位点(informative sites)来计算替换数,减少计算量Informative sites:those sites that have at least two different kinds of characters, each occurring at least twice(next)对于给定树,计算每一信息位点的最小替换数,然后相加,得出该拓扑结构的最小替换数具有最少改变数的拓扑结构即为最大简约树,系统发生树构建方法,(4) Determining a tree building method:最大简约法工作方法: 2、简约树的确定最小替换数的计算方法:first going from the leaves to internal nodes and to the common root to determine all possible ancestral character statesthen going back from the common root to the leaves to assign ancestral sequences that require the minimum number of substitutions. (next),系统发生树构建方法,(4) Determining a tree building method:最大简约法评价:如果所分析的序列位点上没有回复突变或平行突变,且被检验的序列位点数很大的时候,最大简约法能够推导获得一个很好的进化树。然而,如果在所分析序列上存在较多的回复突变或平行突变,而且被检验的序列位点数又比较少的时候,最大简约法可能会给出一个不合理的或者错误的进化树推导结果。,系统发生树构建方法,(4) Determining a tree building method:最大简约法优化:Weighted ParsimonyTherefore, a weighting scheme that takes into account the different kinds of mutations helps to select tree topologies more accurately. The MP method that incorporates a weighting scheme is called weighted parsimony.,系统发生树构建方法,(4) Determining a tree building method:最大似然法(Maximum Likelihood Method,ML)对给定的一组序列数据,选取一个特定的替代模型来计算每一个可能的拓扑结构的发生概率(似然率), (see next)然后再挑出其中似然率最大的拓扑结构作为最优树。采用穷举法搜索每一个可能的拓扑结构每个位点的似然率都被计算(而不只是信息位点),系统发生树构建方法,(4) Determining a tree building method:最大似然法如何工作:1、利用替换模型计算每一个分枝的发生似然率,其中计入每一个内部分枝的似然率如对DNA序列,选择JukesCantor model,经时间t后,某一核苷酸不变的概率为: P(t) = 1/4 + 3/4et(为JukesCantor model 的替换率)核苷酸从X变为Y的概率为 P(t) = 1/4 1/4et实际似然率转换成自然对数,取整数其它替换模型的计算公式更复杂,系统发生树构建方法,碱基变化的通用模型,系统发生树构建方法,(4) Determining a tree building method:最大似然法如何工作:2、一个拓扑结构的似然率是每一枝似然率的和3、具有最高似然率得分的拓扑结构即为最大似然树,系统发生树构建方法,似然值最大,即SUM最大的拓扑树则为最优树,系统发生树构建方法,(4) Determining a tree building method:最大似然法评价:最大似然法的建树过程是个很费时的过程,因为在分析过程中有很大的计算量,每个步骤都要考虑内部节点的所有可能性。最大似然法是一个比较成熟的参数估计的统计学方法,具有很好的统计学理论基础,在当样本量很大的时候,似然法可以获得参数统计的最小方差。只要使用了一个合理的、正确的替代模型,最大似然法可以推导出一个很好的进化树结果。,系统发生树构建方法,(4) Determining a tree building method:最大似然法改进:由于最大似然法的分析过程需要耗费较多的时间,针对这种情况,发展出了许多优化的可以加快最大似然法寻找最优树的搜索方法,如启发式搜索,分枝交换搜索等。,系统发生树构建方法,(4) Determining a tree building method:算法选择策略,(5)Assessing tree reliability进化树的可靠性分析:自展法(Bootstrap Method)从排列的多序列中随机有放回的抽取某一列,构成相同长度的新的排列序列重复上面的过程,得到多组新的序列对这些新的序列进行建树,再观察这些树与原始树是否有差异,以此评价建树的可靠性(next),系统发生树构建方法,第三节 系统发生树构建实例,系统发生分析常用软件(1) PHYLIP(2) PAUP(3) TREE-PUZZLE(4) MEGA(5) PAML(6) TreeView,(7) VOSTORG (8) Fitch programs (9) Phylo_win (10) ARB (11) DAMBE(12) PAL (13) Bionumerics,其它程序见:http:/evolution.genetics.washington.edu/phylip/software.html,系统发生树构建实例,Mega 3下载地址http:/,离散特征数据 (discrete character data):即所获得的是2个或更多的离散的值。如:DNA序列某一位置是或者不是剪切位点(二态特征);序列中某一位置,可能的碱基有A、T、G、C共4种(多态特征);相似性和距离数据 (similarity and distance data):是用彼此间的相似性或距离所表示出来的各分类单位间的相互关系。,