基因组测序的原理与方法.ppt
大规模基因组测序的原理与方法,胡松年,元素周期表的发现奠定了二十世纪物理、化学研究和发展的基础,元素周期表,“基因组序列图”将奠定二十一世纪生命科学研究和生物产业发展的基础!,“基因组”-生命科学的“元素周期表”,人体解剖图奠定了现,代医学发展的基础,生命的奥秘蕴藏于“四字天书”之中,GCTTCTTCCTCATTTTCTCTTGCCGCCACCATGCCGCCACCA TCATTTTCTCTTGCCGCCACCATGCTTCTTCCTCATTTTCTCT CCACCATGCCGCCACCACGCCACCATGCTTCTTCCTCATCTC GCTTTCTTGCCGCCACCATGCCGCCACCGCTTCTTCCtTCTCT,基因组学的基础理论研究,基因组学是要揭示下述四种整合体系的相互关系:基因组作为信息载体(碱基对、重复序列的整体守恒与局部不平衡的关系)基因组作为遗传物质的整合体(基因作为功能和结构单位与遗传学机制的关系)基因组作为生物化学分子的整合体(基因产物作为功能分子与分子、细胞机制的关系)物种进化的整合体(物种在地理与大气环境中的自然选择),基因组学是一个大学科,“界门纲目科属种”,地球上现存物种近亿,所有生生灭灭的生物,无一例外,都有个基因组。基因组作为信息载体,它所储存的信息是最基本的生物学信息之一;既是生命本质研究的出发点之一,又是生物信息的归宿。基因组学研究包括对基因产物(转录子组和蛋白质组)的系统生物学研究。基因多态性的规模化研究就是基因组多态性的研究。基因组学的研究必然要上升到细胞机制、分子机制和系统生物学的水平。基因组的起源与进化和物种的起源与进化一样是一个新的科学领域。基因组信息正在以天文数字计算,规模化地积累,它的深入研究必将形成一个崭新的学科。,基因组学是一门大科学,基因组的信息是用来发现和解释具有普遍意义的生命现象和它们的变化、内在规律、和相互关系。基因组的信息含量高。基因组学的研究又在于基因组间的比较。基因组学的复杂性必然导致多学科的引进和介入(各生物学科、医学、药学、计算机科学、化学、数学、物理学、电子工程学、考古学等)。基因组学研究的手段和技术已经走在生命科学研究的最前沿。基因组信息来自于高效率和规模化所产生的实验数据。人类基因组计划证明了基因组研究的迫切性和可行性。,基因组与生命之谜,基因组的产生与进化。基因组DNA组分的变化、GC百分比、嘌呤:嘧啶守恒。遗传密码的发生、发展和进化。内含子(尤其是大于100,000 核苷酸的大内含子)剪出后的运输和降解。最小内含子的生物学意义。动物基因组与植物基因组在基因分布上的共性和个性。物种衍变过程中基因组水平的变化。基因组大小变化与遗传、分子、细胞机制的关系。“JUNK DNA”的发生、分类、进化与功能。,测序设备的垄断和高速度换代,8,测序设备发展现状,9,第一代(稳定需求)ABi3130 xL3730 xL3500 xL,第三代(即将面市)Helicos BiosciencesHelicos Genetic Analysis System Pacific BiosciencesRSSystem,第二代(高速发展)RocheGenome Sequencer FLX System GS Junior System IlluminaGenome Analyzer IIxMiSeqHiSeq 1000HiSeq 2000Life Technologies(ABi)5500 SOLiD System5500 xL SOLiD SystemIon Torrent PGMDanaherMotionPolonator G.007Complete Genomics无锡艾吉因生物信息技术有限公司AG-100深圳华因康基因科技有限公司Pstar-1中科院北京基因组所/半导体所BIGIS-1BIGIS-4,大规模基因组测序的几个支撑技术,Sanger双脱氧末端终止法 PCR 技术 DNA 自动测序仪的发展 生物信息学分析软硬件设施,“双脱氧末端终止”的含义,PCR(聚合酶链式反应)原理,反应所需物质:DNA模板、引物、DNA聚合 酶、dNTP、缓冲液每个循环包括:变性(90)、退火(54)、延伸(72),Sanger 双脱氧末端终止法测序原理,大 规 模基因组 测 序 的 两 种 策 略,逐步克隆法(Clone by Clone)全基因组霰弹法(Whole Genome Shot-gun),ATGCCGTAGGCCTAGC TAGGCCTAGCTCGGA,ATGCCGTAGGCCTAGCTCGGA,基因组DNA,BAC文库,根据物理图谱正确定位的BAC 或contig,用于霰弹法测序的候选克隆,用于霰弹法测序的亚克隆,测序并组装,完整的基因组序列,逐步克隆法(Clone by Clone),全基因组霰弹法(Whole Genome Shot-gun),基因组DNA,霰弹法克隆,测序并进行全基因组序列组装,完整的基因组序列,两种大规模基因组测序策略的比较,BAC by BAC,Whole Genome Shotgun,the sequencing of the human genome is likely to be the only large sequencing project carried to completion by the methods described in this issue.Maynard V.Olson,The maps:Clone by clone by clone,Nature 409,816-818(2001),“WorkingDraft”(90%;4X),FinishedGenome(99.99%;8X),Gap1,Gap2,Chromosome,工作草稿(框架图)与完成图,BAC by BAC,The sequence of the human genomeC.Venter et al.Science 16 Feb.291:1304 1351,2001,人类基因组计划研究的主要成果和进展表现在这“四张图”上,遗传图谱 又称为连锁图谱(linkage map),指基因或DNA标志在染色体上的相对位置与遗传距离物理图谱 以定位的DNA标记序列如STS作为路标,以DNA实际长度即bp、kb、Mb为图距的基因组图谱。转录图谱 利用EST(expressed sequence tags 表达序列标签)作为标记所构建的分子遗传图谱序列图谱 通过基因组测序得到的,以A、T、G、C为标记单位的基因组DNA序列,逐步克隆法(Clone by Clone),物理图谱的构建,大片段克隆的筛选,霰弹法测序与“工作框架图”的构建,序列的全组装与“完成图”构建,物理图谱的制作,物理图谱的制作序列标签位点(STS)作图,物理图谱是以特异的DNA序列为标志所展示的染色体图。标志之间的距离或图距以物理距离如碱基对(base pair;bp,Kb,Mb)表示。最精细的物理图是核苷酸顺序图,最粗略的物理图是染色体组型图。STS图谱是最基本和最为有用的染色体物理图谱之一,STS(Sequence Tagged Site)本身是随机地从人类基因组上选择出来的长度在200300bp左右的特异性短序列(每个STS在基因组中是唯一的,STS图谱就是以STS为路标(平均每100Kb一个),将DNA克隆片段有序地定位到基因组上。,STS的来源,随机基因组序列表达基因序列,如EST遗传标记序列,如微卫星标记,有关STS的信息可在基因组数据库GDB中找到 http:/gdbwww.gdb.org,物 理 图 谱 构 建 的 步 骤,确定各STS序列及其在基因组中的位置大插入片段基因组文库的构建(BAC文库)以特定STS为标记筛 选并定位克隆含有STS的克隆在基因组中排序,基因组数据库(GDB)中至少含有24568 个STS路标信息,关 于 文 库,作为载体的基本要求,能在宿主细胞中进行独立的复制 具有多克隆位点,可插入外源 DNA片段 有合适的筛选标记,如抗药性 大小合适,易于分离纯化 拷贝数多,文库的概念 含有某种生物体全部基因的随机片段的重组DNA克隆群体,载体:能携带外源DNA进入宿主细胞的工具,常用的载体有质粒载体、噬菌体载体、细菌人工染色体等,宿主:能容纳外源DNA片段的生物体,常用的有大肠杆菌、酵母等,BAC文库的构建,NotI、SacI,脉冲场凝胶电泳得200Kb左右的大片段DNA,纯化后与载体连接,电转化,将连接产物导入大肠杆菌感受态细胞,插有外源DNA片段的BAC载体,在含有氯霉素的固体培养基中培养,每一个菌落为带有相同外源DNA片段的单克隆,BAC克隆的筛选,“STS-PCR反应池”方案筛选种子克隆,特定的STS标记,相互间具有重叠片段的BAC克隆根据STS信息组装成contig,并定位于基因组上,Contig,每一个菌落为带有相同外源DNA片段的单克隆,Regional mapping,Regional mapping,Minimal tiling path selected for sequencing.,Regional mapping,Beijing Map,共48个,每组 8 个,每8个96孔板组成1个superpool,384个96孔板组成48个superpools,48 superpools,Column pools Row pools,1 2 3 4 5 6 7 8 9 10 11 12,第八板,第二板,Plate pools第一板,plate pools,row pools,column pools的构成,“STS-PCR反应池”方案(Pooling Protocol),1 2 3 4 5 6 7 8 9 10 11 12,超级池(8个96孔板,共768个克隆),板池(96个克隆),行池(12个克隆),列池(8个克隆),大大减少筛选的工作量,降低成本,所得筛选结果准确可靠,28 VS 768,sheet of superpools,plate pools,row pools,column pools,一 BAC Screening前48个样品为引物OGG1.51对superpool(sp)的筛选结果后48个样品为引物OGG1.52对superpool(sp)的筛选结果,引物OGG1.52对应sp#27,34,45的plate,row,column pools的筛选结果,BAC clone 确定(+为阳性克隆),引物OGG1.52的Colony-PCR,延 伸 克 隆 的 筛 选,STS的密度尚未达到绘制高精度物理图谱的要求,且在基因组中的分布不均匀,造成很多区域没有阳性克隆覆盖,形成空洞。因此需用指纹图谱(FPC法)或末端序列(Walking by End Sequence)步移等手段对种子克隆进行延伸,形成连续克隆群。利用延伸方法筛选得到的克隆称为延伸克隆。,Contig 1,Contig 2,重叠序列,重叠序列,延伸引物,筛选到的延伸克隆,Molecular weightmarker every 5th lane,BAC clones 在96深孔 板中培养-Hind III 完全酶切-1%琼脂糖凝胶电泳,指 纹 图 谱 法(Walking by Fingerprinting database),挑取靠近空洞的种子克隆,酶切构建其指纹图谱,在FPC数据库中进行比对,搜索含有此克隆的重叠克隆群信息,从中确定覆盖空洞区域的克隆,达到延伸目的。,Hind III 完全酶切,Hind III 完全酶切,FPC数据库中比对,Clone A,Clone B,Clone C,C,A,B,contig搭建中克隆的错位,末端序列步行法(Walking by End Sequence)挑取靠近空洞的种子克隆进行末端测序,然后在基因组数据库中进行比对,确定专一性的序列片段作为新的STS路标。最后设计新路标的PCR引物,按照STSPCR“反应池”方案筛选新的克隆,达到延伸的目的。,克隆350A18序列输入 end sequence database的查询结果,四、Clone Identification 1、STS-PCR 2、BAC end sequencing 3、Fingerprinting 4、FISH,CK2,CK1,CK2,CK1,13f06,267l16,481o07250a15,204c23,340j13,对15个克隆进行HindIII酶切后电泳结果,“工作框架图”绘制,根据序列与STS database进行blastn比较结果,将克隆定位末端序的比较,判定延伸在contig外的一端序列。并可及时进行walking,筛选新的克隆,霰弹法测序组装与Finishing,工作流程图,Shotgun Sequencing I:RANDOM PHASE,Bac Clone:100-200 kb,Sheared DNA:1.0-2.0 kb,SequencingTemplates:,RandomReads,Shotgun Sequencing II:ASSEMBLY,Consensus,Consensus,Shotgun Sequencing III:FINISHING,Consensus,Shotgun Sequencing III:FINISHING,Consensus,Shotgun Sequencing III:FINISHING,Consensus,Shotgun Sequencing III:FINISHING,Shotgun Sequencing III:FINISHING,Consed软件显示序列组装结果界面,1、Filling“intraclone gaps”,Gap filling by end sequences,2、Filling“interclone gaps”,The actual and predicted fingerprint of R-260J13 digested with HindIII Lane 1:marker,Lane 2:R-260J13 digested with HindIII,3:the predicted,克隆211B19组装后的序列的错误率为零,Whole Genome Shotgun,This bacterium has a circular genome structure with 2,689,445 base pairs,the second largest one of thermophiles decoded completely to date.,Circular representation of the genome of T.tengcongensis,What is under heaven is for all.Sun Yat-sen,the father of modern China,天下为公,DDBJ/EMBL/GenBank:AAAA01000000,国际一流测序生产线7万克隆,3000万碱基/天高产出、低成本:$/bp¥/bp美分/bp分/bp,基因组学:数据导向的大科学有数据才是硬道理,世上无难事只要肯登攀,Contigs:127,550(N50=6,688 bp),Scaffolds:102,444(N50=11,764 bp),Quality:546 bp at Q20,De Novo Sequencing the Genome in BIG,Hu Songnian,Beijing Institute of Genomics,Chinese Academy of Sciences,Next Generation Sequencing(NGS)Technology,Second generation sequencers,Solexa,3,SOLiD,5,De novo sequencingRNA-seq,Re-sequencingChIP-seq,Meth-seq,MetagenomicsDe novo sequencingRNA-seq,Re-sequencing ChIP-seq RNA-seq,“known”Genome,Novel genome(s),Both types,1x454 5xSOLiD4.0 2x5500 xl 3xSOLEXA 2xHiseq 2000 3x3730 xl 1xsequenom,1000 CPU cores,800 TB Storage,数据中心,完善的试验与测序体系和流程,强有力的计算、存储及数据库支持体系,成熟的生物信息数据处理和分析流程,2023/8/10,Second generation sequencers in BIG,高通量测序仪10台,3730XL测序仪2台,Sequenom仪器1台,高性能计算机刀片服务器100余台,大内存服务器4台,存储设备约800TB。,Sequencing Glossary,Reads.A collection of clones that over-sample the target genome.Pair-end reads.Sequence reads derived from both ends of a sequencing-library clone.Mate-pair reads.Sequence reads derived from both ends of a mate-pair library clone which insert size is usually 1kb.Insert size.The size of the clone-insert from which a clone-end pair is taken.Contig.The result of joining an overlapping collection of sequence reads.Scaffold.The result of connecting non-overlapping contigs by using pair-end reads.N50 size.As applied to contigs or scaffolds,that size above which 50%of the assembled sequence can be found.,Genome assembly strategy,Contig assembly,Scafffolding,Internal gap closing,http:/,Recent whole genome sequencing projects,Flowchart of the WGS de novo assembly,Fill in intra-scaffold gaps and get the final scaffolds,Solexa part,454 part,Hybrid assembly and scffolding,454 reads process,Assembly,Hybrid scaffolding,Solexa reads process,Assembly,Mapping to 454 contig,Hybrid scaffolding,Cov/Comp,long reads,assembly,contigs,short reads,A+,C,B,scaffolding,A+,B,C,scaffolds,Fix gap,Hybrid assembly,EST,Unigene,Scaf A,Scaf C,Scaf B,Scaf D,New Scaf,A,B,C,D,EST based Assembly in short reads of NGS:Constructe BIGer Scaffording,Raw sequencing reads pre-processing I,Significance and purpose,Sequencing library quality controlSequencing bias analysisInherited prosperities on certain second generation sequencerGenome sequencing black hole effectTranscriptome sampling and quantification biasReady for mapping Ready for de novo assembly,Raw sequencing reads pre-processing II,Sequencing reads numbersDuplicates detection,regional distribution analysis and trimmingAdapter detection and trimmingReads quality analysis and low quality reads filter Average quality density distribution Average quality positional distribution regional distribution F-R correlation GC content-quality correlationInsert length distribution,Pipeline,raw data pre-process,Image analysis and basecalling,GOAT pipeline(OLB1.6),CASAVA,Quality Control,GERALD Summary.htm,Fastq and Quality,Solexa reads of the Fastq formats_1_1_sequence.txtHWI-EAS724_0001:8:32:374:374#0/1GAGCTGTATATGAATAATAGTTCGTTTTTCATTATCCAAGATGGATCGGTATAAAGTCTGCTAAAATAAAGGTACAACG+HWI-EAS724_0001:8:32:374:374#0/1fcfcfggdfggggfggggcggggggggfgggggcgggfWgggggggggfgcggdgcgcggggfacbbbbgcgggggds_1_2_sequence.txt HWI-EAS724_0001:8:32:374:374#0/2TACCGTTAATAGCAGTAATATCATAATAGTAATAGCATCATAACGGTAGTCCCATAAAAGTGTGTCAGTAGTAGTAGTA+HWI-EAS724_0001:8:32:374:374#0/2ggggfgggggd_adcggggeggfggeggegfgeececdegggggfegcfegggegggfgacacedbd_cYb,Illumina 1.3 format encodes a Phred quality score from 0 to 40 using ASCII 64 to 104error probability(p):,#for solexa:p=0.01,Q=19;p=0,05,Q=12.8,p=0.10,Q=9.5;#for phred:p=0.01,Q=20;p=0,05,Q=13,p=0.10,Q=10;,Data assessment I Read quality distribution,Low Quality High Quality,Trim:3 end trim if QN 30)60Assessment:Distance Distrubition between two Low quality(Q20),454 dinucleotide proportion check,454 raw reads quality,Data assessment II Library insert size,Numbers of reads with non-insert DNA(full length adapter)in different insert size libraries,Data assessment III Mapping Rate,Solexa Sequencing Data Usage in 500bp Library,Data assessment IV Duplication assessment,Duplicates detection and filter,F,R,N,N,2N,Qaverage 20?,Lane data usage in different solexa library-Fiter duplication reads,Average Reads per StartPoint,Read CorrectionCorrect Illumina GA short reads,Kmer=17,Genome Size Prediction:M=N*(L-K+1)/L N=Total Length(bp)/Genome sizeL=Average Rads Length(bp),M,Genome size estimation using Kmer,Before estimating the genome size,we set a hypothesis:the k-mer we picked out from the genome can ergodic the whole genome sequence.According to the Lander waterman algorithm,the algorithm should be represented as:G=Knum/KdepthHere,G is the genome size,Knum is the total number of k-mer and Kdepth is the expected depth of the k-mer.If we obtain the expected depth of k-mer,we can calculate the genome size.Because the distribution of k-mer frequency yields to Poisson distribution,we can consider the peak of the k-mer distribution curve as the expected depth of k-mer and calculate the genome size.,Note:A total of 15,437,084,746 Kmers,the peak value on the right figure is 8,so the genome size is estimated as:15,437,084,746/8=1.93G,High Quality Read Rate after preprocess,Assembly:Raw data VS preprocessed Data?,Questions,Genome size estimation methods(K-mer&Cov)Assembly optimization(parameters)Assembly evaluation(454_Solexa EST)Unmappable solexa reads reuse(filter-assemble)Scaffolding comparison(ABI&BIG&Bambus&blat)solexa to solid feasible?Assembly assessment(BAC,3730,necessary?),Sequencing Strategy for solexa,Sample preper Fragment or Paired End or Mate PairSequencing different libraries:Data coverage(500bp).10X,20X.Larger genome size,Larger libraries needed.10K?Length of solexa Reads?100bp?F+R=One Reads?Other Data:3730,454,solid.EST.,OVERVIEW OF TESTED ASSEMBLERS,Depth VS Coverage,EST based Scaffolding,基因组混合拼接验证及结构变异检测流程,重复序列注释流程,Repeat analysis,基因结构及功能注释技术路线,Gene prediction,De novo predictionGenScan 16,609-3,775 uniprot hitAugustus 19378-10,245 hitHomology-based predictionalignment-gene scaffold-genewiseReference gene set,tRNA scan,CpG island,miRNA prediction,Using miRNA database fasta as query and blast with our masked scaffolds,Gene function annotation,Gene Ontology(local uniprot database)KEGG(online),GO annotation,GenScan uniprot annotation Gene Ontology,KEGG pathway overview,血吸虫,基因家族进化分析及比较生物学分析技术路线,以应用为主导的基因组学将阔步走向未来,走向人类的健康与生活走向人类赖以生存的物质基础走向人类赖以生存的环境走上人类社会和经济发展的大舞台,基因组学研究成果将走近人类的健康与生活,疾病相关基因的发现、功能的鉴定和分子机制的探讨突破常见病(复杂疾病)基因水平的研究以基因为基础的疾病诊断、预测和预防基因治疗与细胞治疗治疗的结合以基因多态性为基础的“个体化”药物以基因多态性为基础的“个体健康计划”传统药物、生物药物和“有机药物”的自然回归,走向人类赖以生存的物质基础,抗病、抗虫和抗极端环境GM农作物高生殖率、高生长率、高营养率的GM家畜、家禽和水产品新品种维生素和营养物质富集的水果和蔬菜生物杀虫剂、除草剂和抗病药物微生态环境下生产的有机食品,走向人类赖以生存的环境,基因组信息记录了物种亿万年来在环境变迁中起源和进化的历史。生物多样性资源的研究、保护与开发:地球上估计有1亿个物种生态环境的研究、保护与开发:巨大的海洋(占地球总面积71)广袤的森林(占地球总面积40)诸多的湖泊与河流,谢谢!,