毕业设计论文 外文文献翻译 中英文对照 计算机科学与技术 预处理和挖掘Web日志数据网站个性化.doc
南京理工大学泰州科技学院毕业设计(论文)外文资料翻译系 部: 计算机科学与技术 专 业: 计算机科学与技术 姓 名: 学 号: 外文出处: Dipartimento di Informatica, Universitµa di Pisa 附 件: 1.外文资料翻译译文;2.外文原文。指导教师评语: 签名: 年 月 日注:请将该封面与附件装订成册。附件1:外文资料翻译译文预处理和挖掘Web日志数据网站个性化摘要:我们描述了Web使用挖掘活动的一个持续项目要求,我们叫它ClickWorld3,旨在提取导航行为的一个网站的用户的模型。该模型的推断在访问日志的网络服务器通过数据和Web挖掘技术的功能。提取的知识是部署的个性化和主动提供网络服务给用户。第一,我们描述预处理步骤访问日志必要的步骤,选择并准备数据,知识提取。然后,我们表现出两套实验:第一,一个尝试性预测的用户基础上访问的网页;第二,试图预测是否用户可能有兴趣参观的一部分网页。关键词:知识发现,Web挖掘,分类。1、导言Web挖掘是利用数据挖掘技术在自动化发现和提取信息从网络的文件和服务。一个常见的分类Web挖掘的三个主要的研究项目明确的规定:内容分钟法,结构挖掘和使用挖掘。区分这些类别没有一个明确的界限,而是将经常使用的方法相结合区分出不同的类别。内容涵盖数据挖掘技术提取模型,网络对象的内容,包括纯文字,半结构化文件(例如,HTML或XML语言),结构化文件(数字图书馆),动态的文件,多媒体文件。提取模型被用于分类的网页对象,提取关键字用于信息检索,推断结构的半结构化或非结构化的对象。结构挖掘旨在发掘基本的拓扑结构的互连,筹措之间的网络对象。该模型建立可用于分类和排名的网站,并发现了它们之间的相似性。使用挖掘是应用数据挖掘技术发现使用从网络模式的数据。数据通常是收集用户的互动关系在网上,例如网站/代理服务器日志,用户查询,登记数据。使用挖掘工具发现和预测用户行为,以帮助设计师为改善网站,来吸引游客,或给普通用户的个性化和适应性的服务。在本文中,我们描述了Web使用挖掘活动的一个持续项目要求ClickWorld ,旨在提取模型,以用户的行为为目的的个性化网站。我们从中期全国性大型门户网站vivacity.it收集和预处理访问日志,花费的时间为5个月。该网站包括了民族地区如网址为:www.vivacity.it的新闻,论坛,笑话等,以及30多个地方,例如,www.roma.vivacity.it与城市专用信息,如本地新闻,餐厅地址,戏剧节目,巴士的时间表,ECC等。预处理步骤包括数据选择,清洗和转化和通过验证的用户和用户会话。结果预处理,方法是一个数据集市的网络访问和注册信息。从预处理的数据,Web挖掘的目的是发现模式调整方法从统计数据,数据挖掘,机器学习和模式识别。其中基本数据挖掘技术,我们提到的关联规则,发现集团的物体,常常要求用户一起;集群,集团用户提供类似的浏览方式,或集团类似的物体内容或访问的模式;分类,而有利于的用户被分到某一类或类别;和序列模式,即序列请求这是常见的许多用户。在ClickWorld项目,有几个上述方法,目前被用来提取有用的信息主动提供个性化网页网站。在本文中,我们描述了两套分类实验。第一个,一项旨在提取一分类模型能够性别歧视的用户根据设置的网页访问。第二次试验的目的是提取一分类模型能够歧视这些用户访问的网页有关例如:提供给典型的实验。2、预处理的Web个性化我们已经制定了一个数据集市的网页记录特殊的支持网络个人化分析。该数据集市是人口从一个网络日志数据仓库房子,如中所描述的,或更简单地说,从原材料网络/代理服务器日志种来。在这一节中,我们描述了一些预处理和编码步骤进行数据的选择,理解,清洗和转化。虽然其中一些是一般数据准备步骤,Web使用挖掘,值得注意的是,在许多人的一种领域知识必须一定要包括以清洁,正确和完整的输入数据根据网页的个性化需求。2.1用户注册数据除了网页访问日志,我们考虑输入包括个人资料的一个子集的用户,即那些谁注册的vivacity.it网站,备注:注册法不是强制性的。对于注册用户,该系统记录了以下资料:性别,城市,省,婚姻状况,出生日期。此信息是提供由用户在一个网页表单在登记时,作为一个可预计,数据的标准是对用户公平。作为预处理步骤,难以置信的数据检测并删除,如出生数据在未来或在遥远的过去。此外,一些额外的投入没有进口的数据信息,因为几乎所有的值分别为左为默认选择的网页表单。换言之,领域被认为是不利于区分用户的选择和喜好。为了避免用户位数的登录名和密码在每个访问vivacity.it网站采用的Cookie重复。如果一个Cookie是由用户的浏览器,然后认证并不是必需的。否则,身份验证后,一个新的Cookie 发送到用户的浏览器。随着这一机制,可以跟踪任何用户只要她删除的Cookie的体系。此外,如果用户注册,该协会登录cookie是可以在输入数据,然后可以跟踪用户后,还原她删除的cookie.这种机制使检测非人类的用户,如系统诊断诊断和监测方案。通过检查的数量分配给cookie每个用户,我们发现,用户登录test009被派到以上24.000独特的Cookie。这不仅是可能的,如果用户是一些程序,自动删除指定的cookie,例如:系统诊断程序。2.2网站的网址一方面,有一些标准化的网页必须形成的统一的网址,以消除不相关的句法的差异。例如,主机可以在IP格式或自身格式,如131.114.2.91是相同的主机作为kdd.di.unipi.it。另一方面,也有一些网络服务器程序采用非标准格式的参数传递。网站的vivacity.it 服务器程序是其中之一。例如,在以下网址:http:/roma.vivacity.it/speciali/EditColonnaSpeciale/1,3478,|DX,00.html文件的名字1,3478,|DX,载有00码的地方网站,网页识别码(3478)及其专用的参数(DX型)。上述的形式设计了效率的机器进程。作为一个例子,网页标识是一个关键的数据库表的网页模板发现,虽然参数可以检索的网页内容在一些其他就座。不幸的是,这是一场噩梦时,挖掘点击的网址。句法功能的网址是很少的帮助:我们需要一些语义信息,或本论文指定的网址。在最好的,我们可以预期,一个应用程序级别的日志是,即日志的访问语义相关的对象。例如,应用程序级日志是记录用户进入网站主页,然后参观了体育与新闻页面上足球代表队,等等。这将需要一个系统模块监测用户的步骤在语义水平的力度。在这个ClickWorld项目中这样一个模块被称为ClickObserve。不幸地,然而,该模块是一个可交付的项目,它不适用于在收集数据的开始该项目。因此,我们决定提取两个句法和语义信息从网址通过一个半自动的办法。该办法包括通过在逆向工程的网址,从网站设计者说明这意味着每一个URL路径,网页id和网页的参数。使用PERL脚本,从设计师的描述,我们从原来的提取网址以下信息:本地网络服务器,即vivacity.it或roma.vivacity.it等,这些亲志愿给我们一些空间信息的用户的利益;第一级分类的网址有24种,其中一些是:家庭,新闻,财政,照片,笑话,购物。论坛,酒吧;第二个级别的网址取决于第一级之一,例如:网址分类版购物可进一步分类版的图书购物或PC购物等;第三级分类的网址取决于第二级之一,例如网址分类版的图书购物可进一步分类版编程该书叙事购物或购物和书籍等;参数信息,还详细介绍了三个层次分类,如网址分类版的编程书籍购物可能的ISBN书码作为参数的深度分类,即一日的网址,如果只有一个第一级别分类,如果网址的第一和第二级分类,等等。当然,采取的办法主要是其中的一个启发式,随着本次设计的层次上升。此外,本次设计不利用任何基于内容的分类,即说明新闻分类,如体育新闻的编号为12345的代码,即第一级是新闻,并没有提及的新闻内容。附件2:外文原文Preprocessing and Mining Web Log Data forWeb PersonalizationM. Baglioni1, U. Ferrara2, A. Romei1, S. Ruggieri1, and F. Turini11 Dipartimento di Informatica, Universitµa di Pisa,Via F. Buonarroti 2, 56125 Pisa Italyfbaglioni,romei,ruggieri,turinigdi.unipi.it2 KSolutions S.p.A.Via Lenin 132/26, 56017 S. Martino Ulmiano (PI) Italyferraraksolutions.itAbstract. We describe the web usage mining activities of an on-going project, called ClickWorld3, that aims at extracting models of the navigational behaviour of a web site users. The models are inferred from the access logs of a web server by means of data and web mining techniques. The extracted knowledge is deployed to the purpose of offering a personalized and proactive view of the web services to users. We first describe the preprocessing steps on access logs necessary to clean, select and prepare data for knowledge extraction. Then we show two sets of experiments: the first one tries to predict the sex of a user based on the visited web pages, and the second one tries to predict whether a user might be interested in visiting a section of the site.Keywords: knowledge discovery, web mining, classification.1 IntroductionAccording to 10, Web Mining is the use of data mining techniques to auto-matically discover and extract information from web documents and services. A common taxonomy of web mining defines three main research lines: content mining, structure mining and usage mining. The distinction between those categories is not a clear cut, and very often approaches use combination of techniques from different categories.Content mining covers data mining techniques to extract models from web object contents including plain text, semi-structured documents (e.g., HTML orXML), structured documents (digital libraries), dynamic documents, multimedia documents. The extracted models are used to classify web objects, to extractkeywords for use in information retrieval, to infer structure of semi-structured or unstructured objects.Structure Mining aims at finding the underlying topology of the interconnections between web objects. The model built can be used to categorize and to rank web sites, and also to find out similarity between them.2 M. Baglioni et al.Usage mining is the application of data mining techniques to discover usage patterns from web data. Data is usually collected from user's interaction with the web, e.g. web/proxy server logs, user queries, registration data. Usage mining tools 3,4,9,15 discover and predict user behavior, in order to help the designer to improve the web site, to attract visitors, or to give regular users a personalized and adaptive service. In this paper, we describe the web usage mining activities of an on-going project, called ClickWorld, that aims at extracting models of the navigational behavior of users for the purpose of web site personalization 6. We have collected and preprocessed access logs from a medium-large national web portal,vivacity.it, over a period of five months. The portal includes a national area (www.vivacity.it) with news, forums, jokes, etc., and more than 30 local areas (e.g., www.roma.vivacity.it) with city-specific information, such as local news, restaurant addresses, theatre programming, bus timetable, ecc.The preprocessing steps include data selection, cleaning and transformation and the identification of users and of user sessions 2. The result of preprocessing is a data mart of web accesses and registration information. Starting from preprocessed data, web mining aims at pattern discovery by adapting methods from statistics, data mining, machine learning and pattern recognition. Among the basic data mining techniques 7, we mention association rules, discovering groups of objects that are frequently requested together by users; clustering, grouping users with similar browsing patterns, or grouping objects with similarcontent or access patterns; classification, where a profile is built for users belonging to a given class or category; and sequential patterns, namely sequences of requests which are common for many users.In the ClickWorld project, several of the mentioned methods are currently being used to extract useful information for proactive personalization of web sites. In this paper, we describe two sets of classification experiments. The first one aims at extracting a classification model able to discriminate the sex of a user based on the set of web pages visited. The second experiment aims at extracting a classification model able to discriminate those users that visit pages regarding e.g. sport or finance from those that typically do not.2 Preprocessing for Web PersonalizationWe have developed a data mart of web logs specifically to support web personalization analysis. The data mart is populated starting from a web log data warehouse (such as those described in 8,16) or, more simply, from raw web/proxy server log files. In this section, we describe a number of preprocessing and coding steps performed for data selection, comprehension, cleaning and transformation.While some of them are general data preparation steps for web usage mining2,16, it is worth noting that in many of them a form of domain knowledge must necessarily be included in order to clean, correct and complete the input data according to the web personalization requirements.2.1 User registration dataIn addition to web access logs, our given input includes personal data on a subset of users, namely those who are registered to the vivacity.it website (registration is not mandatory). For a registered user, the system records the following information: sex, city, province, civil status, born date. This information is provided by the user in a web form at the time of registration and, as one could expect, the quality of data is up to the user fairness. As preprocessing steps, improbable data are detected and removed, such as born data in the future or in the remote past. Also, some additional input fields were not imported in the data mart since almost all values were left as the default choice in the web form. In other words, the fields were considered not to be useful in discriminating user choices and preferences.In order to avoid users to digit their login and password at each visit, the vivacity.it web site adopts cookies. If a cookie is provided by the user browser, then authentication is not required. Otherwise, after authentication, a new cookie is sent to the user browser. With this mechanism, it is possible to track any user as long as she deletes the cookies on her system. In addition, if the user is registered, the association login-cookie is available in the input data, and then it is possible to track the user also after she deletes the cookies. This mechanism allows for detecting non-human users, such as system diagnosis and monitoring programs. By checking the number of cookies assigned to each user, we discovered that the user login 'test009' was assigned more than 24.000 distinct cookies. This is possible only if the user is some program that automatically deletes assigned cookies, e.g. a system diagnosis program.2.2 Web URLResources in the World Wide Web are uniformly identified by means of URLs(Uniform Resource Locators). The syntax of an http URL is: 'http:/' host.domain ':'port abs path '?' querywhere host.domain:port is the name of the server site. The TCP/IP port is optional (the default port is 80), abs path is the absolute path of the requested resource in the server filesystem. We further consider abs path of the form path '/' filename '.'extension, i.e. consisting of the filesystem path, filename and file extension. query is an optional collection of parameters, to be passed as an input to a resource that is actually an executable program, e.g. a CGI script.On the one side, there are a number of normalizations that must be performed on URLs, in order to remove irrelevant syntactic differences (e.g., thehost can be in IP format or host format 131.114.2.91 is the same host as kdd.di.unipi.it). On the other side, there are some web server programs that adopt non-standard formats for passing parameters. The vivacity.it web server program is one of them. For instance, in the following URL:http:/roma.vivacity.it/speciali/EditColonnaSpeciale/1,3478,|DX,00.html the file name 1,3478,|DX,00 contains a code for the local web site (1 stands for roma.vivacity.it), a web page id (3478) and its specific parameters (DX). The form above has been designed for excient machine processing. For instance, the web page id is a key for a database table where the page template is found, while the parameters allow for retrieving the web page content in some other table. Unfortunately, this is a nightmare when mining clickstream of URLs.Syntactic features of URLs are of little help: we need some semantic information,or ontology 5,13, assigned to URLs. At the best, we can expect that an application-level log is available, i.e. a log of accesses to semantic-relevant objects. An example of application-level log is one recording that the user entered the site from the home page, then visited a sport page with news on a soccer team, and so on. This would require a system module monitoring user steps at a semantic level of granularity. In the ClickWorld project such a module is called Click Observe. Unfortunately , however, the module is a deliverable of the project, and it was not available for collecting data at the beginning of the project. Therefore, we decided to extract both syntactic and semantic information from URLs via a semi-automatic approach. The adopted approach consists in reverse-engineering URLs, starting from the web site designer description of the meaning of each URL path, web page id and web page parameters. Using a PERL script, starting from the designer description we extracted from original URLs the following information: local web server (i.e., vivacity.it or roma.vivacity.it etc.), which provides us with some spatial information about user interests; a first-level classification of URLs into 24 types, some of which are: home , news, finance, photo galleries, jokes, shopping, forum, pubs; a second-level classification of URLs depending on the first-level one, e.g.URLs classified as shopping may be further classified as book shopping or pcshopping and so on; a third-level classification of URLs depending on the second-level one, e.g.URLs classified as book shopping may be further classified as programming book shopping or narrative book shopping and so on; a parameter information, further detailing the three level classification, e.g.URLs classified as programming book shopping may have the ISBN book code as parameter; the depth of the classification, i.e. 1 if the URL has only a first-level classification, 2 if the URL has first and second-level classification, and so on .Of course, the adopted approach was mainly an heuristics one, with the hierarchical ontology designed at posteriori. Also, the designed ontology does not exploit any content-based classification, i.e. the description of an elementary object such as sport news with id 12345 is its code (i.e., first-level is news, second level is sport, parameter information 12345), with no reference to the content of the