WEB数据挖掘课件.ppt

资源ID：5576448 资源大小：3.71MB 全文页数：92页
资源格式： PPT 下载积分：15金币

快捷下载

会员登录下载

三方登录下载：

下载资源需要15金币

邮箱/手机：
温馨提示：	用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）
支付方式：
验证码：	换一换

加入VIP免费专享

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？

友情提示

1、下载资料失败解决办法

2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。

3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。

4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

5、试题试卷类文档，如果标题没有明确说明有答案则都视为没有答案，请知晓。

网站客服

侵权投诉

WEB数据挖掘课件.ppt

WEB MINING绪论,刘均电信学院系统结构与网络研究所，西一436,关于这门课程,课程的目的课程内容与时间安排参考书、考试、作业、课件学科定位,课程目的,为在Web Mining或Data Mining、Text Mining等领域的深入研究奠定基础；能够利用所学理论与技术解决Web Mining相关的实际问题。掌握Web Mining的基本概念；了解Web Mining产生背景、目前研究现状、研究方向以及主要应用领域。掌握Data Mining与Text Mining等领域的基本概念以及较成熟的算法。掌握Web Content Mining、Web Structure Mining、Web Usage Mining等领域的基本概念以及较成熟的算法，并具有一定的分析、应用能力。,课程内容,Web结构挖掘,Web内容挖掘,Web日志挖掘,数据挖掘,文本挖掘,课程内容与时间安排,绪论（2学时）Data Mining与Text Mining理论与技术（20学时）Web Structure Mining（4学时）Web Content Mining（4学时）Web Usage Mining（4学时）Web Mining应用举例（2学时）,教材与参考书,Web知识挖掘：理论、方法与应用,郑庆华，刘均，田锋等著，科学出版社，2010Mining the Web:Analysis of Hypertext and Semi Structured Data,by Soumen Chakrabarti,Morgan Kaufmann,2002数据挖掘：概念与技术,Jiawei Han，Micheline Kamber 等著，范明，孟小峰译.机械工业出版社，2001,考试与作业,考试作业成绩的加权和作业作业1：试验,提交试验报告、程序、数据等。（60，人）作业2：专业翻译(40%，每人)提交方式、时间下学期开学两周内,课件下载,u：webp：web,学科定位,科学世界观、认识世界、完整严密的体系结构技术方法论、改造世界WEB MINING（DATA MINING）是一门技术类学科。,引用说明,课件的部分内容引用了国内外同行的PPT页面或其他资料。,Web Mining的定义Web Mining的背景分类Web Structure Mining、Web Content Mining、Web Usage Mining的研究现状与应用,本节课主要内容,Web Mining的定义,Web Mining的定义,Web mining-data mining techniques to automatically discover and extract information from Web documents/services(Etzioni,1996).Data mining(knowledge discovery from database)Extraction of interesting(non-trivial,implicit,previously unknown and potentially useful)patterns or knowledge from huge amount of data,Some DM tasks,Classification:mining patterns that can classify future data into known classes.Association rule miningmining any rule of the form X Y,where X and Y are sets of data items.Clusteringidentifying a set of similarity groups in the data,Some DM tasks,Sequential pattern mining:A sequential rule:A B,says that event A will be followed by event B with a certain confidenceDeviation detection:discovering the most significant changes in data,Web Mining的其它定义,Jaideep Srivastava借鉴数据挖掘的定义将Web挖掘定义为“从Web文档和Web活动中抽取感兴趣的潜在的有用模式和隐藏的信息”。维基百科：Web 挖掘被定义为“利用数据挖掘技术从Web中发现模式”（Wikipedia）,Web Mining的定义,对Web Mining定义的理解（五个方面）信息与知识数据分析技术支撑技术：Data Mining(DM)、Text Mining(TM)、Multimedia Mining(MM)目标：获取有用的信息或知识rules,patterns,constraints数据源：Web documents/services 隐藏在半结构化数据中的模式和数据实体超链接关系Web日志,Web Mining的定义,我们的定义：利用数据挖掘、文本挖掘、机器学习等技术从Web页面数据、日志数据、超链接关系中发现感兴趣的、潜在的、有用的规则、模式、领域知识等。,Information and Knowledge,Information is data that has been organized into a meaningful context.Negentropy（负熵）entropy（熵）1944，薛定谔(Schrdinger)，生命是什么。负熵是物质系统有序化、组织化的一种量度。信息是负熵信息是系统有序度的量度。信息用于消除不确定性。Knowledge is defined as re-usable information in a specific context.,Data pyramid,数据,知识,信息,数据是计算机中对事实、概念或指令进行描述的一种特殊格式,赋以语义的数据称为信息,知识是适用面更广的信息,智慧则是通过对过去知识和新信息的整合，形成决策的能力。,Data Analysis Evolution,Confluence of Multiple Disciplines,WEB Mining,Database,Information retrieval,Data MiningText Mining,Natural language processing,MachineLearning,Web or Internet,Web mining research integrate research from several research communities(Kosala and Blockeel,July 2000),Web挖掘与数据挖掘、信息检索、信息抽取的区别,Web挖掘与数据挖掘数据挖掘的对象的不同：结构化数据、（非/半）结构化数据Web 挖掘与信息检索从特定文档集中返回与检索需求相关的文档包括文档建模、分类、索引、结果排序与可视化Web等流程，Web挖掘技术一般用于分类、索引以及结果排序信息检索的结果往往也是Web挖掘的对象,Web挖掘与数据挖掘、信息检索、信息抽取的区别,Web 挖掘与信息抽取从给定的文档中抽取特定类别的信息，如元数据信息抽取方法能够自动或半自动的方法建立抽取模式利用信息抽取可以建立文档的压缩版本以提高挖掘效率,Web Mining的背景,History of the Web,1965 Ted Nelson proposed“Literary Machines,”allow writing and publishing of nonsequential text hypertextLate 1960s Doug Engelbart at SRI developed the oNLine System(NLS),software for the about-to-be ARPANET that allowed hyperlinking between files on different computers1965，Doug Engelbart，Mouse,History of the Web,1989-90 Berners-Lee“global hypertext system”第一台Web服务器：三大支撑技术：HTML（Hyper Text Markup Language）信息与信息的链接、URL（Uniform Resource Locator）信息定位、HTTP（Hyper Text Transfer Protocol）分布式的信息共享 10/90 TBL first browser program,names it“World Wide Web”8/91 software released on the Internet9/93“Mosaic”browser for PC;Web traffic measures 1%of traffic on NSFnet backbone,Web技术发展,客户端集成于Web浏览器的技术，涉及HTML语言、Java语言、CSS（Cascading Style Sheets）、DHTML（Dynamic HTML）以及浏览器插件等由静态向动态逐渐发展。服务器端由静态向动态逐渐发展。NCSA：CGI（Common Gateway Interface）可执行程序到脚本程序 PHP（Personal Home Page Tools）语言Microsoft：ASPServlet和JSP，同时拥有了类似CGI程序的集中处理功能和类似PHP的HTML嵌入功能,Web技术发展,向语义化发展Web（Semantic Web）是下一代WEB的信息组织和表达方式，其目标是在当前Web资源链接与共享的基础上，通过XML 和RDF框架，对Web数据的语义信息进行描述和管理，使其能为机器所理解并集成到各个不同的应用程序中，从而能够更好地支持人机协同工作。,The WEB,Web网站数目与页面数目呈指数级速增长 Web网站数目从1990年的1个发展到2006年超过108个，倍增周期仅有23周。Web页面数目则平均每6个月翻一番，2004年以后，已达到了1010数量级，每天新增数目超过800万。,The WEB,Deep Web中的页面数量是PIW页面400550倍，约占整个Web页面99.7%。,Motivation for Web Mining,信息爆炸每年以2弋（1018）字节，每-年翻一翻；知识的获取 VS.信息的获取；“We are buried in information,but looking for knowledge”（John Naisbett）。“The greatest problem of today is how to teach people to ignore the irrelevant,how to refuse to know things,before they are suffocated.For too many facts are as bad as none at all.”(W.H.Auden)。如何应对信息爆炸DM、KDD、KDT、TM、MM,Motivation for Web Mining,Mining interesting patterns and knowledge leads tobetter information and knowledge acquisitionbusiness intelligencemore efficient organizing Web resource.,Web Mining:The User-Centeric View(Client-Side),discovery of documents on a subjectdiscovery of semantically related documents or document segmentsextraction of relevant knowledge about a subject,Web Mining:The Owner-Centeric View(Server-Side),targeted ads of goods,services,productsmeasuring effectiveness of site content/structureproviding dynamic personalized services or content,Web挖掘面临的挑战,数据自身的复杂性以及在获取手段方面的局限性，导致Web挖掘与传统的数据挖掘相比，面临着一些新的挑战。Web数据的高度复杂性异构性：挖掘对象是异构的；信息组织方式是异构的半结构化特性动态性存在噪音数据：页面中与挖掘应用无关的信息；质量低下的页面信息,Web挖掘面临的挑战,Web数据检索的局限性丰度问题（Abundance Problem）。丰度问题是由美国康纳尔大学Jon Kleinberg教授提出的，表现为：Web信息总量虽然很大，但对于某一个特定用户，他所感兴趣的Web信息却相对很少，即“99的Web信息对于99的Web用户是没有用处的”。丰度问题导致严重的信息负荷。有限覆盖问题（Limited Coverage Problem）。检索接口的局限性。当前，各种主流搜索引擎一般都采用关键词或者是关键词的逻辑组合作为检索条件。这种检索接口很难明确地表达用户的检索意图，主要原因在于自然语言词汇具有一词多义（Polysemy）与一义多词（Synonymy）特性。（病毒&电脑）缺少个性化检索机制。现有搜索引擎给用户呈现的是无差别的、“千人一面”的资源检索界面与结果显示。,Web Mining 分类,Web Mining 分类,Web Mining 分类,Web Data Web pagesIntra-page structuresInter-page structuresInternet structuresUsage dataClick StreamSupplemental dataProfilesRegistration informationCookies,Web Mining 分类,Web Mining 分类依据说明（全信息理论）,“全信息理论”概要,语用,语义,语法,Web Mining 分类,Web Content Mining,Web Content Mining,Web StructureMining,Web ContentMining,Web Page Content Mining1.Web Page Summarization 2.Can Identify information within given web pages Uses heuristics to distinguish personal home pages from other web pages Looks for product prices within web pages,Search ResultMining,Web UsageMining,General AccessPattern Tracking,CustomizedUsage Tracking,Web Content Mining,Web UsageMining,General AccessPattern Tracking,CustomizedUsage Tracking,Web StructureMining,Web ContentMining,Web PageContent Mining,Search Result Mining(Text Mining&Knowledge Discovery from Text)Search Engine Result SummarizationClustering Search ResultCategorizes documents using phrases in titles and snippets,Web Content Mining,Discovery of useful information from web contentsInformation Retrieval View:(Structured+Semi-Structured)Assist/Improve information findingFiltering Information to users on user profilesDatabase ViewModel Data on the web Integrate them for more sophisticated queries,Web Content Mining,Web contentsUnstructured text document Semi-structured HTML document(hyperlinks)Textual,image,audio,video,metadataMultimedia data mining,Web Content Mining,What have been doing in Web content mining?1.Developing intelligent tools for IR-Finding keywords and keyphrases-Discovering grammatical rules and collocations-Hypertext classification/categorization-Extracting key phrases from text documents-Hierarchical clustering,Web Content Mining,2.Developing Web query systems Many applications such as WebLog(Lakshmanan,et al.,1996)3.Mining multimedia data-Fayyad,et al.(1996)mining image from satellite-Smyth,et al(1996)mining image to identify small volcanoes on Venus.,Server-based approaches,Multilevel databasesThe main idea behind these proposals is that the lowest level of the database contains primitive semi-structured information stored in various Web repositories,such as hypertext documents.At the higher level(s)meta data or generalizations are extracted from lower levels and organized in structured collections such as relational or object-oriented databases.Web query systems,A Multiple Layered Meta-Web Architecture,Generalized Descriptions,More Generalized Descriptions,Layer0,Layer1,Layern,.,Multilevel databases Meta-WEB,Meta-WEBLayer0:the Web itselfLayer1:the lowest layer of the ML-WebAn entry:a Web page summary,including class,time,URL,contents,keywords,popularity,rank,links,etc.Layer2 and up:,Benefits of Multi-Layer Meta-Web Multi-dimensional Web info summary analysisintelligent queryWeb high-level query(WebSQL),Web Content Mining的应用,Information retrieval(Search)on the WebAutomated generation of topic hierarchiesKnowledge baseDocuments classifyingOurs,Web Content Mining的应用,E-commerce Generate user profiles-improving customization and provide users with pages,advertisements of interestTargeted advertising-Ads are a major source of revenue for Web portals(e.g.,Yahoo,Lycos)and E-commerce sites.Internet advertising is probably the“hottest”web mining application todayFraud-Maintain a signature for each user based on buying patterns on the Web(e.g.,amount spent,categories of items).If buying pattern changessignificantly,then signal fraud,Web Structure Mining,Web Structure Mining,Web ContentMining,Web PageContent Mining,Search ResultMining,Web UsageMining,General AccessPattern Tracking,CustomizedUsage Tracking,Web Structure Mining Using Links Use interconnections between web pages to give weight to pages.Using Generalization Web.Counters(popularity),Internet and Web structures,the fundamental difference between Internet and Web structuresInternet structure is controlled by wiringWeb structure is controlled by hyperlinks,Web StructureStructure of the hyperlinks within the Web itselfStructure of a Page,Web Structure Mining,Web Structure Mining,Finding authoritative Web pagesRetrieving pages that are not only relevant,but also of high quality,or authoritative on the topicHyperlinks can infer the notion of authorityThese hyperlinks contain an enormous amount of latent human annotationA hyperlink pointing to another Web page,this can be considered as the authors endorsement of the other page,Web Structure Mining,PageRank Stanford projectLawrence Page,Sergey Brin,Rajeev Motwani,Terry Winograd.The PageRank Citation Ranking:Bringing Order to the Web.GoogleHITS(HyperlinkInduced Topic Search)Jon M.Kleinberg:Authoritative Sources in a Hyperlinked Environment.JACM 46(5):604-632(1999)HITS(Hypertext-Induced Topic Search)developed by Jon Kleinberg,while visiting IBM Almaden.IBM expanded HITS into Clever.,Web Structure Mining,Internet的宏观特性挖掘如无尺度、小世界特性、蝴蝶结理论，利用这些来提高挖掘的效率与质量。,Web Structure Mining 的应用,指导网页采集(采集“高质量”的网页)帮助结果排序(PageRank 或HITS 算法)查找相关网页(Query By Examples)确定Web 影响因子,Web Usage Mining,Web Usage Mining,Web StructureMining,Web ContentMining,Web PageContent Mining,Search ResultMining,Web UsageMining,General Access Pattern Tracking Web Log Mining Uses DM techniques to understand general access patterns and trends.Can shed light on better structure and grouping of resource providers.,CustomizedUsage Tracking,Web Usage Mining,Web UsageMining,General AccessPattern Tracking,Customized Usage TrackingAdaptive Sites Analyzes access patterns of each user at a time.Web site restructures itself automatically by learning from user access patterns.,Web StructureMining,Web ContentMining,Web PageContent Mining,Search ResultMining,Web Usage Mining,Web usage mining,also known as Web log mining,process of discovering interesting patterns in Web access logs.Commonly used approaches(Borges and Levene,1999):Maps the log data into relational tables before an adapted data mining technique is performed.Uses the log data directly by utilizing special pre-processing techniques.,Web Usage Mining,Perform data mining on Weblog records Find association patterns,sequential patterns,and trends of Web accessingConduct studies toAnalyze system performance,improve system design by Web caching,Web page prefetching,W3C Extended Log File Format,Server logs,123.456.78.9-24/Oct/1999:19:13:44 0400“GET/Images/tagline.gif HTTP/1.0”200 1449“Mozilla/4.51 en(Win98;I)”,Problems with Web Logs,Identifying users Clients may have multiple streams Clients may access web from multiple hosts Proxy servers:many clients/one address Proxy servers:one client/many addressesData not in log POST data(i.e.,CGI request)not recorded Cookie data stored elsewhere Pages may be cached Use of forward and backward pointers,Web Usage Mining的应用,System Improvement1).Site Improvement,根据实际用户的浏览情况，调整网站的网页的连接结构和内容，更好的服务用户,极端：Adaptive web sites,Web Usage Mining的应用,System Improvement2).Caching&Network Transmission,例如：从proxy 的访问信息中可以分析用户的访问模式，从而可以预测用户的Page访问，提高Web Caching的性能,Web Usage Mining的应用,Personalization,直接实现形式：Recommender,作用：1)方便用户查询和浏览2)增强广告的作用3)促进网上销售4)提高用户忠诚度,根据发现的用户喜好，动态地为用户定制观看的内容或提供浏览建议。,Web挖掘相关的标准、规范及语言,数据挖掘相关标准CRISP-DM（交叉行业数据挖掘过程标准，Cross Industry Standard Process for Data Mining）。SPSS、NCR以及DaimlerChrysler三个在数据挖掘领域经验丰富的公司发起建立一个社团，目的建立数据挖掘方法和过程的标准,Web挖掘相关的标准、规范及语言,PMML（预测模型标记语言，Predictive Model Markup Language）。数据挖掘应用往往需要多种类型的数据挖掘软件、算法协同运行，这就要求对挖掘出的模型能够很好地继承、复用与集成。为此，DMG（The Data Mining Group，DMG）提出了PMML语言。PMML最新版本为3.2，支持12种数据挖掘模型，包括：AssociationModel（关联规则）、ClusteringModel（聚类模型）、GeneralRegressionModel（回归模型）、MiningModel（组合模型）、NaiveBayesModel（朴素贝叶斯）、NeuralNetwork（神经网络）、RegressionModel（线性、多项式、对数三种回归模型）、RuleSetModel（规则集）、SequenceModel（序列模式）、SupportVectorMachineModel（支持向量机）TextModel（文本模型）、TreeModel（决策树）。,Web挖掘相关的标准、规范及语言,JDM（Java Data Mining API）。旨在提供一个访问数据挖掘工具的标准API，支持数据挖掘模型的建立、使用，数据及元数据的创建、存储、访问及维护，从而使得Java应用程序能够能够方便集成数据挖掘技术。,Web挖掘相关的标准、规范及语言,Semantic Web相关标准Tim Berners-Lee 在XML 2000会议报告中首次提出了语义Web的层次模型（Layer Cake）。其特点在与：基于XML和RDF/RDFS，构建本体和逻辑推理规则，以完成基于语义的知识表示和推理，从而为计算机所理解和处理。,Web挖掘相关的标准、规范及语言,Semantic Web相关标准第一层是Unicode（统一编码）和URI（Uniform Resource Identifier，统一资源标识器）。UNICODE于1993年成为国际标准组织ISO的一项国际标准ISO/IEC10646，其宗旨是全球所有文种统一编码。URI包含三个部分：被用来访问资源的统一命名规则分配体系、资源宿主机器的名称、路径形式的资源名称。与URL 本不同的是，URI只是一个标识符，不直接提供访问资源的方法。,Web挖掘相关的标准、规范及语言,Semantic Web相关标准第二层是XML（EXtensible Markup Language）。XML具有简单、自描述、可扩展的特点，并且实现了内容、结构和表现三者的分离，因而，更适合于数据表示和交换。XML Schema中的约束主要用于XML文档的结构合法性验证。第三层是RDF（Resource Description Framework，资源描述框架）。元数据层。RDF是建立在XML上的元数据描述与交换框架，以“资源（Resource）属性（Property）属性值（Property Value）”的形式描述网络资源。,Web挖掘相关标准、规范及语言,Semantic Web相关标准第四层是RDF-S（RDF Schema）。RDF-S是对RDF 的扩展，是RDF的词汇描述语言（Vocabulary Description Language）W3C,2002，用于定义RDF资源描述文件中出现的词汇。第五层是本体（Ontology）和规则（Rule）。领域知识层。OWL用于明确表示词汇体系中的术语及术语间的关系，在词义和语义的表达来说，OWL有更强的表达能力。规则用于描述领域知识中的前提和结论。SPARQL（Simple Protocol and RDF Query Language）是W3C推荐的用于对RDF数据查询的语言和协议。,Web Mining的主要工具,Web内容挖掘的工具 Intelligent Miner for Text（IBM）Semio Map（Semio）Text Analyst（Megaputer）Web日志挖掘的工具 Analog(.ac.uk)WUM(Web Utilization Miner)Commerce Trends(),相关的国际期刊与国际会议,相关的国际期刊与国际会议,相关的国际期刊与国际会议,Web挖掘的研究方向,数据挖掘未来的10个主要研究方向:Developing a Unifying Theory of Data Mining Scaling Up for High Di

注意事项

本文（WEB数据挖掘课件.ppt）为本站会员（小飞机）主动上传，三一办公仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知三一办公（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。