生物信息学数据库.ppt
Databases for Bioinformatics,陈艳炯医学院免疫与病原生物学系,数据库系统基础,数据库的基本概念数据管理系统的发展数据库技术的发展数据库系统的组成数据库应用系统体系结构,数据(Data),数据的定义 描述客观事物(对象)的符号记录数据的种类 文字、图形、图像、声音 数据的特点 数据与其语义是不可分的,Data,The term data means groups of information that represent the qualitative or quantitative attributes of a variable or set of variables.Data(plural of datum,which is seldom used)are typically the results of measurements and can be the basis of graphs,images,or observations of a set of variables.Data are often viewed as the lowest level of abstraction from which information and knowledge are derived.,数据概念的变化特点质的规定:由简单到集成;由私有到共享。量的刻化:由小量到大量到海量。所处位置:在软件中的从属地位到主导地位。,信息(Information)是以数据为载体的对客观世界实际存在的事物、事件和概念的抽象反应。信息=数据+数据处理,Data processing,Computer data processing is any process that uses a computer program to enter data and summarise,analyse or otherwise convert data into usable information.The process may be automated and run on a computer.It involves recording,analysing,sorting,summarising,calculating,disseminating and storing data.Because data are most useful when well-presented and actually informative,data-processing systems are often referred to as information systems.,Data analysisWhen the domain from which the data are harvested is a science or an engineering,data processing and information systems are considered too broad of terms and the more specialized term data analysis is typically used,focusing on the highly-specialized and highly-accurate algorithmic derivations and statistical calculations that are less often observed in the typical general business environment.Data analysis packages like DAP,gretl or PSPP are often used.,Elements of data processing,In order to be processed by a computer,data needs first be converted into a machine readable format.Once data is in digital format,various procedures can be applied on the data to get useful information.Data processing may involve various processes,including:Data acquisition(数据采集)Data entry(数据录入)Data cleaning(数据清理)Data validation(数据验证)Data tabulation(数据制表)Statistical analysis(统计分析)Computer graphics(计算机图形)Data warehousing(数据存储)Data mining(数据挖掘),Data acquisition,In computer data processing,data acquisition is the sampling of real world physical conditions and conversion of the resulting samples into digital numeric values that can be manipulated by a computer.The components of data acquisition systems include:Sensors that convert physical parameters to electrical signals.Signal conditioning circuitry to coerce sensor signals into a form that can be converted to digital values.Analog-to-digital converters,which convert conditioned sensor signals to digital values.Depending on the application,acquired data may be displayed,analyzed,or recorded,or some combination there of.Data acquisition applications may be controlled by commercial DAQ software or by custom programs developed using various general purpose programming languages such as BASIC or C.Specialized programming languages used for data acquisition include EPICS for building large scale data acquisition systems,LabVIEW,which offers a graphical programming environment,and MATLAB which provides graphical tools and libraries for data acquisition and analysis.,Data cleansing or data scrubbing is the act of detecting and correcting(or removing)corrupt or inaccurate records from a record set,table,or database.Used mainly in databases,the term refers to identifying incomplete,incorrect,inaccurate,irrelevant etc.parts of the data and then replacing,modifying or deleting this dirty data.After cleansing,a data set will be consistent with other similar data sets in the system.The inconsistencies detected or removed may have been originally caused by different data dictionary definitions of similar entities in different stores,may have been caused by user entry errors,or may have been corrupted in transmission or storage.Data cleansing differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at entry time,rather than on batches of data.The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities.The validation may be strict(such as rejecting any address that does not have a valid postal code)or fuzzy(such as correcting records that partially match existing,known records).,A data entry clerk is a member of staff who reads hand-written or printed records and types them into a computer.They are sometimes employed on a temporary basis,but most large companies which have large amounts of data will hire on a near-permanent basis.,In computer science,data validation is the process of ensuring that a program operates on clean,correct and useful data.It uses routines,often called validation rules or check routines,that check for correctness,meaningfulness,and security of data that are input to the system.The rules may be implemented through the automated facilities of a data dictionary,or by the inclusion of explicit application program validation logic.Incorrect data validation can lead to data corruption or a security vulnerability.Data validation checks that data are valid,sensible,reasonable,and secure before they are processed.,Computer graphics are graphics created using computers and,more generally,the representation and manipulation of pictorial data by a computer.The development of computer graphics,or simply referred to as CG,has made computers easier to interact with,and better for understanding and interpreting many types of data.Developments in computer graphics have had a profound impact on many types of media and have revolutionized the animation and video game industry.,Data mining is the process of extracting patterns from data.As more data are gathered,with the amount of data doubling every three years,data mining is becoming an increasingly important tool to transform these data into information.It is commonly used in a wide range of profiling practices,such as marketing,surveillance,fraud detection and scientific discovery.,数据结构(data structure)是计算机中存储、组织数据的方式。(Incomputer science,adata structureis a particular way of storing and organizingdatain acomputerso that it can be usedefficiently.)数据结构的逻辑表示与物理存储体现为数据的逻辑结构、存储结构、数据的处理方法(算法)与处理结果。,The two main structures of a database are TABLES and INDEXES.Tables are the structures that store your data in the database.Each table is composed of a number of FIELDS,also known as COLUMNS in some database engines.Indexes do not store data,and you do not use them directly.They are used internally by the database engine to speed up certain search operations.,Field names and types are defined when you create a table.,In order to create an index you have to define the table and the field to be indexed,and the indexing order(Ascending or Descending).Indexes can also be UNIQUE,and in this case the indexed field does not allow duplicate data to be inserted in different records or rows(for example you could not have two employees with the same userid value if the userid field is being indexed as UNIQUE.),Data Manipulation(数据操作),数据操作Inserting,deleting and updating data分类、归并、排序、存取、检索和输入、输出、更新(包括插入、删除、修改)Adata manipulation language(DML)is a family of syntax elements similar to a computerprogramming languageused for inserting,deleting and updating data in adatabase.Structured Query Language(SQL),which is used to retrieve and manipulatedatain arelational database.IDMS used byIMS/DLI,CODASYLdatabases.,数据管理 对数据进行分类、组织、编码、存储、检索和维护数据处理的中心问题数据管理技术的发展过程 人工管理阶段(20世纪50年代中期以前)文件系统阶段(20世纪50年代后期-60年代中期)数据库系统阶段(20世纪60年代后期-现在),数据库管理技术发展的比较,数据库技术是一种计算机辅助管理数据的方法,它研究如何组织和存储数据,如何高效地获取和处理数据,是计算机科学的重要分支。通过研究数据库的结构、存储、设计、管理以及应用的基本理论和实现方法,并利用这些理论来实现对数据库中的数据进行处理、分析和理解的技术。,Database Technology includes theory and experimental methodology for building computer systems that handles large data volumes.Central is development of concepts,languages,software,and methods for describing,storing,searching,analyzing,distributing,and other data processing to make access of data simple,efficient,scalable,reliable,and adaptable for new application areas.,Database,数据库(Database,DB)的定义,数据库是“按照数据结构来组织、存储和管理数据的仓库”。数据库是电脑化的资料保存系统。数据库本身可视为电子化的档案柜储存电脑化档案的处所,使用者可以新增档案或删除档案,也可以对档案中的资料执行新增、撷取、更新、删除等操作。数据库是长期储存在计算机内、有组织的、可共享的大量数据的集合。,数据库的基本特征,数据按一定的数据模型组织、描述和储存 可为各种用户共享 冗余度较小 数据独立性较高 易扩展,Main features of a database,1.Compactness(紧凑)where there is no need for the old paper files that has a big size.2.Speed(快速)Because of the computer can restore the stored Data Base and upgrading it very fast than the normal human manual hand can do.3.Less drudgery(减少人工)because the computer do every thing for you.4.Currency(专业)The more specific you can have when you asking for a Data Base information.5.Simplicity(简单)An easy way to collect,access-connects,and display information.6.stability(稳定)To prevent unnecessary loss of data.7.Security(安全)To protect against unauthorized access to private data.,Architectures,A number of database architectures exist.Many databases use a combination of strategies.Databases are software-based containers that is structure to collect and store information so it can be retrieved,added to,updated or removed in an automatic fashion.Database programs are designed for users so that they can add or delete any information needed.The structure of a database is the table,which consists of rows and columns of information.,数据库的主要特点,(1)实现数据共享 数据共享包含所有用户可同时存取数据库中的数据,也包括用户可以用各种方式通过接口使用数据库,并提供数据共享。(2)减少数据的冗余度 减少大量重复数据,减少了数据冗余,维护了数据的一致性。(3)数据的独立性 数据的独立性包括数据库的逻辑结构和应用程序相互独立,也包括数据物理结构的变化不影响数据的逻辑结构。,数据库的主要特点,(4)数据实现集中控制 数据库可对数据进行集中控制和管理,并通过数据模型表示各种数据的组织以及数据间的联系。(5)数据一致性和可维护性,以确保数据的安全性和可靠性 安全性控制:以防止数据丢失、错误更新和越权使用;完整性控制:保证数据的正确性、有效性和相容性;并发控制:使在同一时间周期内,允许对数据实现多路存取,又能防止用户之间的不正常交互作用;故障的发现和恢复:由数据库管理系统提供一套方法,可及时发现故障和修复故障,从而防止数据被破坏。,数据库在计算机系统中的位置,硬件平台,基础软件平台,软件基础构架平台,应用软件平台,软件产品,协同软件 办公软件,数据库系统 操作系统,中间件 应用服务器,数据库系统(Database System,DBS)的组成,硬件系统数据库(Database)数据库管理系统(DBMS)人员,数据库系统组成,数据库 即存储在磁带、磁盘、光盘或其他外存介质上、按一定结构组织在一起的相关数据的集合。数据库管理系统(DBMS)它是一组能完成描述、管理、维护数据库的程序系统。它按照一种公用的和可控制的方法完成插入新数据、修改和检索原有数据的操作。人员:最终用户数据库设计者系统分析员和应用程序员数据库管理员(DBA),数据库管理系统,数据库管理系统(Database Management System,DBMS)位于用户与操作系统之间的一层数据管理软件是基础软件,是一个大型复杂的软件系统 DBMS的用途 科学地组织和存储数据、高效地获取和维护数据,DBMS能够统一管理和共享数据的数据库管理系统。数据模型是数据库系统的核心和基础,各种DBMS 软件都是基于某种数据模型的。通常也按照数据模型的特点将传统数据库系统分成网状数据库、层次数据库和关系数据库三类。,ADatabase Management System(DBMS)is a set ofcomputer programsthat controls the creation,maintenance,and the use of adatabase.,DBMS的主要功能,数据定义功能 提供数据定义语言(DDL)定义数据库中的数据对象 数据组织、存储和管理 分类组织、存储和管理各种数据 确定组织数据的文件结构和存取方式,实现数据之间的联系 提供多种存取方法提高存取效率,DBMS的主要功能,数据操纵功能 提供数据操纵语言(DML)实现对数据库的基本操作(查询、插入、删除和修改)数据库的事务管理和运行管理 数据库在建立、运行和维护时由DBMS统一管理和控制 保证数据的安全性、完整性、多用户对数据的并发使用 发生故障后的系统恢复,DBMS的主要功能,数据库的建立和维护功能(实用程序)数据库初始数据装载转换 数据库转储 介质故障恢复 数据库的重组织 性能监视分析等 其它功能 DBMS与网络中其它软件系统的通信 两个DBMS系统的数据转换 异构数据库之间的互访和互操作,Some of the more popular relational database management systems include:Microsoft Access Filemaker Microsoft SQL Server MySQL Oracle,Microsoft SQL Server,Microsoft Access,SQL语言共分为四大类:数据查询语言DQL,数据操纵语言DML,数据定义语言DDL,数据控制语言DCL。,The interdisciplinary nature of bioinformatics will require the use of a variety of discipline-specific databases.,Oracle Database Architecture on Windows,A database is an integrated collection of logically related records or files consolidated into a common pool that provides data for one or more multiple uses.The data in a database is organized according to a database model.relational model hierarchical model network model,数据库应用系统体系结构,主从式结构的数据库系统分布式数据库系统客户服务器(client/server或C/S)数据库系统浏览器服务器数据库系统,主从式结构的数据库系统指一个主机带多个终端的多用户结构。在这种结构中,数据库系统,包括应用程序、DBMS、数据,都集中存放在主机上,所有处理任务都由主机来完成,各个用户通过主机的终端并发地存取数据库,共享数据资源。优点:数据易于管理与维护。缺点:主机的任务会过分繁重,可能成为瓶颈,从而使系统性能大幅度下降;当主机出现故障时,整个系统都不能使用,因此系统的可靠性不高。,分布式结构的数据库系统分布式结构的数据库系统是指数据库中的数据在逻辑上是一个整体,但物理地分布在计算机网络的不同结点上。网络中的每个结点都可以独立处理本地数据库中的数据,执行局部应用;同时也可以同时存取和处理多个异地数据库中的数据,执行全局应用。优点:分布式结构的数据库系统计算机网络发展的必然产物,它适应了地理上分散的公司、团体和组织对于数据库应用的需求。缺点:数据的分布存放给数据的处理、管理与维护带来困难;当用户需要经常访问远程数据时,系统效率会明显地受到网络交通的制约。,客户服务器(client/server或C/S)结构的数据库系统服务器:网络中某个(些)结点上的计算机专门用于执行DBMS功能,称为数据库服务器。客户机:其他结点上的计算机安装DBMS的外围应用开发工具,支持用户的应用,称为客户机。工作原理:在客户服务器结构中,客户端的用户请求被传送到数据库服务器,数据库服务器进行处后,只将结果返回给用户(而不是整个数据)。优点:显著减少了网络上的数据传输量,提高了系统的性能、吞吐量和负载能力;客户服务器结构的数据库往往更加开放(多种不同的硬件和软件平台、数据库应用开发工具),应用程序具有更强的可移植性,同时也可以减少软件维护开销。,浏览器/服务器结构的数据库系统,在Internet和Intranet上的浏览器/服务器(简称B/S)的数据库系统从本质上讲,与传统的C/S都是用同一种请求和应答方式来执行应用的。但传统的C/S结构模式在客户端集中了大量应用软件,而B/S是一种基于Hyperlink、HTML、Java的三层或多层C/S结构,客户端仅需要单一的浏览器软件,是一种全新的体系结构。,数据模型,在数据库中用数据模型这个工具来抽象、表示和处理现实世界中的数据和信息。数据模型应满足三方面要求能比较真实地模拟现实世界容易为人所理解便于在计算机上实现,数据模型,1.概念数据模型(Conceptual Data Model):简称概念模型,是面向数据库用户的现实世界的模型,主要用来描述世界的概念化结构,它使数据库的设计人员在设计的初始阶段,摆脱计算机系统及DBMS的具体技术问题,集中精力分析数据以及数据之间的联系等,与具体的数据管理系统(Database Management System,简称DBMS)无关。概念数据模型必须换成逻辑数据模型,才能在DBMS中实现。,数据模型,2.逻辑数据模型(Logical Data Model):简称数据模型,这是用户从数据库所看到的模型,是具体的DBMS所支持的数据模型,如网状数据模型(Network Data Model)、层次数据模型(Hierarchical Data Model)等等。此模型既要面向用户,又要面向系统,主要用于数据库管理系统(DBMS)的实现。,数据模型,3.物理数据模型(Physical Data Model):简称物理模型,是面向计算机物理表示的模型,描述了数据在储存介质上的组织结构,它不但与具体的DBMS有关,而且还与操作系统和硬件有关。每一种逻辑数据模型在实现时都有其对应的物理数据模型。DBMS为了保证其独立性与可移植性,大部分物理数据模型的实现工作由系统自动完成,而设计者只设计索引、聚集等特殊结构。,最常用的数据模型,非关系模型 层次模型(Hierarchical Model)网状模型(Network Model)关系模型(Relational Model)面向对象模型(Object Oriented Model)对象关系模型(Object Relational Model),database model,A database model or database schema is the structure or format of a database,described in a formal language supported by the database management system,(1)层次结构模型 层次结构模型实质上是一种有根结点的定向有序树(在数学中“树”被定义为一个无回的连通图)。例如高等学校的组织结构图。这个组织结构图像一棵树,校部就是树根(称为根结点),各系、专业、教师、学生等为枝点(称为结点),树根与枝点之间的联系称为边,树根与边之比为1:N,即树根只有一个,树枝有N个。按照层次模型建立的数据库系统称为层次模型数据库系统。IMS(Information Management System)是其典型代表。,In a hierarchical model,data is organized into a tree-like structure,implying a single upward link in each record to describe the nesting,and a sort field to keep the records in a particular order in each same-level list.,(2)网状结构模型按照网状数据结构建立的数据库系统称为网状数据库系统,其典型代表是DBTG(Data Base Task Group)。用数学方法可将网状数据结构转化为层次数据结构。,The network model(defined by the CODASYL specification)organizes data using two fundamental constructs,called records and sets.Records contain fields(which may be organized hierarchically,as in the programming language COBOL).Sets(not to be confused with mathematical sets)define one-to-many relationships between records:one owner,many members.,(3)关系结构模型关系式数据结构把一些复杂的数据结构归结为简单的二元关系(即二维表格形式)。例如某单位的职工关系就是一个二元关系。关系型数据库系统以关系代数为坚实的理论基础,经过几十年的发展和实际应用,技术越来越成熟和完善。由关系数据结构组成的数据库系统被称为关系数据库系统。,The relational model was introduced by E.F.Codd in 1970 as a way to make database management systems more independent of any particular application.It is a mathematical model defined in terms of predicate logic and set theory.,人们发现关系型数据库系统虽然技术很成熟,但其局限性也是显而易见的:它能很好地处理所谓的“表格型数据”,却对技术界出现的越来越多的复杂类型的数据无能为力。,(4)面向对象数据库系统,面向对象是一种认识方法学,也是一种新的程序设计方法学。把面向对象的方法和数据库技术结合起来可以使数据库系统的分析、设计最大程度地与人们对客观世界的认识相一致。面向对象数据库系统是为了满足新的数据库应用需要而产生的新一代数据库系统。,In recent years,the object-oriented paradigm has been applied to database technology,creating a new programming model known as object databases.These