《《文本挖掘简介》PPT课件.ppt》由会员分享,可在线阅读,更多相关《《文本挖掘简介》PPT课件.ppt(15页珍藏版)》请在三一办公上搜索。
1、文本挖掘简介,邹权博士,助理教授,Outline,IntroductionTF-IDFSimilarity,Introduction,Why?Text mining Web miningHow?Classification or ClusteringRetrieval,文本分类一般过程,预处理将文档集表示成易于计算机处理的形式 特征表示与选择、降维根据适宜的权重计算方法表示文档中各项的重要性 学习建模 构建分类器,文本分类预处理,去标点、多余空格、数字(可选)大小写统一去停用词(stop words)没有实际含义的词,比如and,you,have等等统一词根PorterStemmer分词英文?
2、中文,特征表示,向量空间模型以词项为特征组成高维特征向量TF/IDF得到权值,TF-IDF,TF(Term Frequency)表示词项频率IDF(Inverse Document Frequency)逆文档频率TF*IDF值,8,Similarity Applications,Many Web-mining problems can be expressed as finding“similar”sets:Plagiarism/Mirror Pages/Articles from the Same Source/Duplication RemoveCollaborative Filterin
3、g as a Similar-Sets ProblemRecommend to users items that were liked by other users who have exhibited smilar tastes,Measurement,Edit distanceShort text,wordsFor personal textJaccard distanceLong text,ignoring the word similarityFor government text,Microsoft Academic Search,PK,http:/,http:/,Real-worl
4、d Data is Rather Dirty!,Kenneth De Jong,Kenneth Dejong,2023/7/18,Trie-Join VLDB2010,10/38,Typo in“author”Typo in“title”,relaxed,related,Argyrios Zymnis,Argyris Zymnis,DBLP Complete Search,2023/7/18,Real-world Data is Rather Dirty!,Trie-Join VLDB2010,11/38,The similarity join is an essential operatio
5、n for data integration and cleaningPerform a similarity join on Name attribute(find all record pairs whose Name attributes are similar)Output:(2037349,3054641),Similarity Joins,R,2023/7/18,Trie-Join VLDB2010,12/38,Near Duplicate Data,On one end,a winded Pete Sampras tried to summon enough energy to
6、give the New York fans another memorable win to talk about it on the subway ride home.On the other side,Roger Federer wore a sly grin like he knew age was about to catch up to the former world No.1-the man who owns the record of 14 Grand Slams he wants.,03/11/2008|11:28 AM,By JAY COHEN,AP Sports Wri
7、ter Mar 11,4:23 am EDT,Similarity Join,Tokenize:Each record is a set of tokens from a finite universe.Suppose each record is a single text documentx=“yes as soon as possible”y=“as soon as possible please”x=A,B,C,D,Ey=B,C,D,E,F,参考文献,Chuan Xiao,Wei Wang,Xuemin Lin,Jeffrey Xu Yu.Efficient Similarity Joins for Near Duplicate Detection.WWW 2008.Guoliang Li,Dong Deng,Jiannan Wang,Jianhua Feng.Pass-Join:A Partition based Method for Similarity Joins.VLDB 2012.,
链接地址:https://www.31ppt.com/p-5520601.html