4、position correlation scoring function, PCSF)、离散增量(increment of diversity, ID)、支持向量机(Support Vector Machine, SVM)和朴素贝叶斯分类器(Naive Bayes Classifier, NB)四类算法对它们进行了预测研究。本文的研究工作如下:首先,从转录因子结合位点数据库JASPAR选出8种实验上证实的没有冗余的转录因子结合位点数据,结合位置保守性和伪计数,构建了位置关联方程,通过定义位置关联性打分方程的最佳阈值,使得打分方程在此最佳阈值下所得结果的假阳率较低。同时为了比较打分方程在转录因
7、释信息,统计了突触前和突触后神经毒素的二硫键类型及其二硫键数目的分布。从ATDB和Swiss-Prot数据库上下载了突触前和突触后神经毒素的蛋白质序列,分别构建了序列相似性小于80%的数据集1和数据集2。本文采用了5种方法选取参数:(1):蛋白质序列的二肽参数;(2):MRMR软件提取的50个二肽参数;(3):MEME搜索到的模体特征;(4): Prosite搜索到的模体特征;(5):Interpro搜索到的模体特征。本文还对这5种参数进行了组合,一共得到了12类参数,并将这12类参数作为离散增量和朴素贝叶斯分类器的参数,在Jackknife检验下,对数据集1和2进行预测。预测结果表明:(1)
8、:增加模体参数的预测结果好于二肽参数时的预测结果;(2):使用模体参数和50个二肽参数时,突触前神经毒素和突触后神经毒素的预测结果最好。关键词:转录因子结合位点;动物毒素;模体特征;离散增量;朴素贝叶斯分类器Analysis and prediction of transcription factor binding sites and animal toxinsAbstractThe identification of transcription factor binding sites is an important step towards the understanding of th
9、e transcription regulation. Reliable prediction of transcription factor binding sites can help to identify the target genes of transcription factors and infer the relationship between the positions of binding sites and regulation activity of transcription factors. But the specificity of recognition
10、results achieved by the current algorithms is quite low; therefore, algorithms that can identify binding sites more efficiently are required. The animal toxins are directed against a wide variety of pharmacological targets, making them good tools for studying the properties of these targets. The ani
11、mal toxins are used in the studies of ion channels, drug discovery and formulation of insecticides. So, prediction of the animal toxins is become very important, it is necessary to propose a computational method to identify the animal toxins. In this thesis, six important issues that are transcripti
12、on factor binding sites, animal toxins, neurotoxins, cytotoxins, presynaptic neurotoxins and postsynaptic neurotoxins are predicted by using position correlation scoring function (PCSF), increment of diversity (ID), support vector machine (SVM) and Naive Bayes classifier (NB). The main contributions
13、 are summarized as follows:First, 8 non-redundant experimentally known transcription factor binding sites are extracted from JASPAR database. Based on pseudo-counts and the conservation analysis of transcription factor binding sites, a novel position correlation scoring function algorithm (PCSF) is
14、proposed. In order to reduce the false positive, the optimal cutoffs are defined for the position correlation scoring function (PCSF). Testing is performed to compare the recognition accuracy of PCSF algorithm with position weight matrix (PWM) that is used in MATCHTM, the predictive results indicate
15、s that the PCSF algorithm is better than PWM algorithm.Second, the animal toxin sequences are downloaded from Animal Toxin Database (ATDB), the non-toxin dataset described in the work of Saha and Raghava is used as the negative dataset. Both animal toxin and non-toxin datasets are culled by the PISC
16、ES software, the datasets with less than 25%, 40%, 60%, 80% and 90% sequence identity are used. Baed on 20 amino acid compositions, 400 dipetide compositions, 6 amino acid hydropathy compositions and 36 hydropathy dipeptide compositoons, the ID algorithm is applied to predict the animal toxins and n
17、on-toxins. The predictive results indicate that the best predictive results are obtained by selecting dipeptide compositions as imputing parameters. For improving the successful rates of the animal toxins, 4 kinds of ID values as inputting the parameters of SVM are combined, and the overall predicti
18、on accuracy of SVM is better than ID algorithm. In addition, neurotoxins and cytotoxins are also predicted. In order to compare SVM with other approaches, SVM is also used to predict neurotoxins that described in the work of Saha and Raghava, the higher predictive success rates than the previous alg
19、orithms are obtained by SVM.Finally, the protein sequences for presynaptic and postsynaptic neurotoxins are obtained from Swiss-Prot. The distriution of disulfide bond numbers and classes are studied according to the annotation information provided by Swiss-Prot. Based on ATDB and Swiss-Prot, two ne
20、urotoxin datasets which the sequence identity is less than 80% are obtained. Five feature extraction methods are used in this paper: (1): the dipeptide compositions; (2): 50 features extract by MRMR software; (3): the motif features discoveried by MEME; (4): the motif features discoveried by Prosite
21、; (5): the motif features discoveried by Interpro. By selecting 12 kinds of hybrid parameters as the inputting parameters of ID algorithm and NB classifier, two datasets are predicted. The predictive results of jackknife tests show that: (1): the predictive results based on extracted motif features
22、are better than the 400 dipeptide features; (2): by using motif features and 50 extracted features, the best predictive results are obtained.Keywords: transcription factor binding sites; animal toxins; motif features; increment of diversity; Naive Bayes classifier目 录摘 要IAbstractIV第一章 绪论11.1 引言11.2 研
23、究课题的背景和意义21.3 国内外研究现状和进展41.3.1 转录因子结合位点研究现状和进展41.3.2 动物毒素研究现状和进展51.4数据库和软件介绍61.5 论文结构安排6第二章 理论研究方法介绍82.1位置权重矩阵算法82.2离散增量算法92.2.1 离散量和离散增量92.2.2 最小离散增量算法102.3支持向量机算法112.4朴素贝叶斯分类器122.4.1贝叶斯定理122.4.2朴素贝叶斯分类器122.4.3条件概率的计算132.5 特征选取算法142.5.1 氨基酸组成分信息142.5.2 氨基酸序列的二肽组分信息142.5.3 氨基酸亲疏水性分布信息152.6基于互信息的特征参数
