博士论文转录因子结合位点和动物毒素的分析与预测.doc
《博士论文转录因子结合位点和动物毒素的分析与预测.doc》由会员分享,可在线阅读,更多相关《博士论文转录因子结合位点和动物毒素的分析与预测.doc(89页珍藏版)》请在三一办公上搜索。
1、10126-20709042 分类号 密级 U D C 编号 论 文 题 目转录因子结合位点和动物毒素的分析与预测研 究 生: 指导教师: 教授专 业:生物物理学 研究方向:理论生物物理2010 年 3 月 30 日原 创 性 声 明本人声明:所呈交的学位论文是本人在导师的指导下进行的研究工作及取得的研究成果。除本文已经注明引用的内容外,论文中不包含其他人已经发表或撰写过的研究成果,也不包含为获得内蒙古大学及其他教育机构的学位或证书而使用过的材料。与我一同工作的同志对本研究所做的任何贡献均已在论文中作了明确的说明并表示谢意。学位论文作者签名: 指导教师签名: 日 期: 日 期: 在学期间研究成
2、果使用承诺书本学位论文作者完全了解学校有关保留、使用学位论文的规定,即:内蒙古大学有权将学位论文的全部内容或部分保留并向国家有关机构、部门送交学位论文的复印件和磁盘,允许编入有关数据库进行检索,也可以采用影印、缩印或其他复制手段保存、汇编学位论文。为保护学院和导师的知识产权,作者在学期间取得的研究成果属于内蒙古大学。作者今后使用涉及在学期间主要研究内容或研究成果,须征得内蒙古大学就读期间导师的同意;若用于发表论文,版权单位必须署名为内蒙古大学方可投稿或公开发表。学位论文作者签名: 指导教师签名: 日 期: 日 期: 转录因子结合位点和动物毒素的分析与预测摘 要转录因子结合位点的识别是阐明基因转
3、录调控机制的重要环节,准确的转录因子结合位点的预测算法将有助于人们识别转录因子的目标基因,进而研究其在上游调控区中的位置对转录调控的影响。然而,目前存在的预测转录因子结合位点的算法所得结果的特异性普遍较低,因此有必要提出一种新的有效的理论预测算法。动物毒素能直接作用于药物作用靶点,这使得动物毒素成为研究药物靶点的重要工具。动物毒素还在离子通道的研究、药物发现和杀虫剂的合成方面有广泛的应用。因此,预测动物毒素就变得非常重要,有必要提出一种能准确鉴别动物毒素的理论算法。窗体底端本文以转录因子结合位点、动物毒素、神经毒素、细胞毒素、突触前神经毒素和突触后神经毒素作为研究对象,利用位置关联性打分方程(
4、position correlation scoring function, PCSF)、离散增量(increment of diversity, ID)、支持向量机(Support Vector Machine, SVM)和朴素贝叶斯分类器(Naive Bayes Classifier, NB)四类算法对它们进行了预测研究。本文的研究工作如下:首先,从转录因子结合位点数据库JASPAR选出8种实验上证实的没有冗余的转录因子结合位点数据,结合位置保守性和伪计数,构建了位置关联方程,通过定义位置关联性打分方程的最佳阈值,使得打分方程在此最佳阈值下所得结果的假阳率较低。同时为了比较打分方程在转录因
5、子结合位点方面的预测能力,本文将打分方程与MATCHTM中所使用的位置权重矩阵进行了比较,结果显示打分方程的预测能力优于位置权重矩阵的预测能力。其次,从动物毒素数据库ATDB下载了全部的动物毒素,用Saha和Raghava工作中提供的非毒素的蛋白质序列作为负集,利用PISCES软件对动物毒素和非毒素进行序列相似性比对,构建了序列相似小于25%、40%、60%、80%和90%的数据集合。分别选取20种氨基酸组分、400种二肽组分、6种亲疏水组分、36种二肽亲疏水组分作为离散增量算法的参数,对不同序列相似性的动物毒素数据集进行了预测。结果表明:离散增量算法在以二肽组分作为参数时预测结果最好;5种不
6、同序列相似性的动物毒素数据集的预测结果随序列相似性变化较小。为了进一步提高动物毒素的预测精度,本文对4种不同的离散增量值进行组合并作为支持向量机的输入参数,对动物毒素进行了预测,结果显示:支持向量机的预测结果优于离散增量算法的预测结果。同时本文还对神经毒素和细胞毒素进行了预测。此外,为了将支持向量机和其它的预测算法进行比较,这里将支持向量机应用到Saha和Raghava构建的神经毒素的数据库上,预测结果显示:本文所使用的支持向量机的预测结果优于Saha和Raghava所提出的算法取得的预测结果。最后,本文从Swiss-Prot数据库上下载了突触前和突触后神经毒素的蛋白质序列,参照数据库给出的注
7、释信息,统计了突触前和突触后神经毒素的二硫键类型及其二硫键数目的分布。从ATDB和Swiss-Prot数据库上下载了突触前和突触后神经毒素的蛋白质序列,分别构建了序列相似性小于80%的数据集1和数据集2。本文采用了5种方法选取参数:(1):蛋白质序列的二肽参数;(2):MRMR软件提取的50个二肽参数;(3):MEME搜索到的模体特征;(4): Prosite搜索到的模体特征;(5):Interpro搜索到的模体特征。本文还对这5种参数进行了组合,一共得到了12类参数,并将这12类参数作为离散增量和朴素贝叶斯分类器的参数,在Jackknife检验下,对数据集1和2进行预测。预测结果表明:(1)
8、:增加模体参数的预测结果好于二肽参数时的预测结果;(2):使用模体参数和50个二肽参数时,突触前神经毒素和突触后神经毒素的预测结果最好。关键词:转录因子结合位点;动物毒素;模体特征;离散增量;朴素贝叶斯分类器Analysis and prediction of transcription factor binding sites and animal toxinsAbstractThe identification of transcription factor binding sites is an important step towards the understanding of th
9、e transcription regulation. Reliable prediction of transcription factor binding sites can help to identify the target genes of transcription factors and infer the relationship between the positions of binding sites and regulation activity of transcription factors. But the specificity of recognition
10、results achieved by the current algorithms is quite low; therefore, algorithms that can identify binding sites more efficiently are required. The animal toxins are directed against a wide variety of pharmacological targets, making them good tools for studying the properties of these targets. The ani
11、mal toxins are used in the studies of ion channels, drug discovery and formulation of insecticides. So, prediction of the animal toxins is become very important, it is necessary to propose a computational method to identify the animal toxins. In this thesis, six important issues that are transcripti
12、on factor binding sites, animal toxins, neurotoxins, cytotoxins, presynaptic neurotoxins and postsynaptic neurotoxins are predicted by using position correlation scoring function (PCSF), increment of diversity (ID), support vector machine (SVM) and Naive Bayes classifier (NB). The main contributions
13、 are summarized as follows:First, 8 non-redundant experimentally known transcription factor binding sites are extracted from JASPAR database. Based on pseudo-counts and the conservation analysis of transcription factor binding sites, a novel position correlation scoring function algorithm (PCSF) is
14、proposed. In order to reduce the false positive, the optimal cutoffs are defined for the position correlation scoring function (PCSF). Testing is performed to compare the recognition accuracy of PCSF algorithm with position weight matrix (PWM) that is used in MATCHTM, the predictive results indicate
15、s that the PCSF algorithm is better than PWM algorithm.Second, the animal toxin sequences are downloaded from Animal Toxin Database (ATDB), the non-toxin dataset described in the work of Saha and Raghava is used as the negative dataset. Both animal toxin and non-toxin datasets are culled by the PISC
16、ES software, the datasets with less than 25%, 40%, 60%, 80% and 90% sequence identity are used. Baed on 20 amino acid compositions, 400 dipetide compositions, 6 amino acid hydropathy compositions and 36 hydropathy dipeptide compositoons, the ID algorithm is applied to predict the animal toxins and n
17、on-toxins. The predictive results indicate that the best predictive results are obtained by selecting dipeptide compositions as imputing parameters. For improving the successful rates of the animal toxins, 4 kinds of ID values as inputting the parameters of SVM are combined, and the overall predicti
18、on accuracy of SVM is better than ID algorithm. In addition, neurotoxins and cytotoxins are also predicted. In order to compare SVM with other approaches, SVM is also used to predict neurotoxins that described in the work of Saha and Raghava, the higher predictive success rates than the previous alg
19、orithms are obtained by SVM.Finally, the protein sequences for presynaptic and postsynaptic neurotoxins are obtained from Swiss-Prot. The distriution of disulfide bond numbers and classes are studied according to the annotation information provided by Swiss-Prot. Based on ATDB and Swiss-Prot, two ne
20、urotoxin datasets which the sequence identity is less than 80% are obtained. Five feature extraction methods are used in this paper: (1): the dipeptide compositions; (2): 50 features extract by MRMR software; (3): the motif features discoveried by MEME; (4): the motif features discoveried by Prosite
21、; (5): the motif features discoveried by Interpro. By selecting 12 kinds of hybrid parameters as the inputting parameters of ID algorithm and NB classifier, two datasets are predicted. The predictive results of jackknife tests show that: (1): the predictive results based on extracted motif features
22、are better than the 400 dipeptide features; (2): by using motif features and 50 extracted features, the best predictive results are obtained.Keywords: transcription factor binding sites; animal toxins; motif features; increment of diversity; Naive Bayes classifier目 录摘 要IAbstractIV第一章 绪论11.1 引言11.2 研
23、究课题的背景和意义21.3 国内外研究现状和进展41.3.1 转录因子结合位点研究现状和进展41.3.2 动物毒素研究现状和进展51.4数据库和软件介绍61.5 论文结构安排6第二章 理论研究方法介绍82.1位置权重矩阵算法82.2离散增量算法92.2.1 离散量和离散增量92.2.2 最小离散增量算法102.3支持向量机算法112.4朴素贝叶斯分类器122.4.1贝叶斯定理122.4.2朴素贝叶斯分类器122.4.3条件概率的计算132.5 特征选取算法142.5.1 氨基酸组成分信息142.5.2 氨基酸序列的二肽组分信息142.5.3 氨基酸亲疏水性分布信息152.6基于互信息的特征参数
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 博士论文 转录 因子 结合 动物 毒素 分析 预测

链接地址:https://www.31ppt.com/p-3924565.html