智能信息检索课程第2讲.ppt
《智能信息检索课程第2讲.ppt》由会员分享,可在线阅读,更多相关《智能信息检索课程第2讲.ppt(30页珍藏版)》请在三一办公上搜索。
1、智能信息检索,杜小勇教授,中国人民大学文继荣教授,微软亚洲研究院,Overview of Key Techniques in IR,Prof.Xiaoyong Du,Core Techniques,Unstructured data,Text Operator,image operator,Video operator,Media-RelatedInterpretation,Indexing,Search,Compression,Metadata-levelTechniques,English Text Operation,Word tokenization(断词)“.”的处理“/apostr
2、ophe”的处理“-”的处理Open source G.Grefenstette的研究结果(1994)统计Brown语料中的52511个句子将”.”简单地作为句子分割符,准确率为93.20%使用简单的正则表达式规则,准确率为97.66%借助词表,可以进一步提高准确率Proceedings of 3rd conf.on computational lexicography and text research,1994,English Text Operation,Stemming(词干提取)查表法,事前将所有词的词干都列出来.浪费存储空间基于规则的porter算法Open source 其他方法
3、,中文词法分析,分词(word segmentation)什么是中文的”词”?基于词典(词表)的最大匹配法正向最大匹配 Forward Maximum Matching逆向最大匹配 Reverse Maximum Matching双向最大匹配 Bi-Directional MM如果FMM=RMM 可认为分词正确,否则可进行进一步的消歧处理,中文词法分析,歧义词切分(ambiguities)歧义词分类交集型歧义:A+X+B=AX,XB,例:苏副教授组合型歧义:A+B=A,B,AB,例:马上基于统计语言模型的消歧,中文词法分析,未登陆词识别(out-of vocabulary OOV)没有在词表中
4、出现的新词未登陆词的种类人名:张朝阳,哈里.波特地名:海淀区,李家庄机构名:中国人民大学,专有名词:道-琼斯专业术语:非典,线形回归数词,时间词等.1992年Named entity recognitionInformation Extraction,Image Operation,OCRColor,Indexing,Unstructured data,ObjectRepresentation,Indexing,Inverted FilesSuffix TreesSignatures,Inverted Files,CharacteristicsA word-oriented mechanism
5、 based on sorted list of keywords,with each keyword having links to the documents containing that keyword.PreprocessingEach document is assigned a list of keywords or attributes.Each keyword(attribute)is associated with relevance weights.,1.The input text is parsed into a list of words along with th
6、eirlocation in the text.(time and storage consuming operation)2.This list is inverted from a list of terms in location order to a list of terms in alphabetical order.3.Add term weights,or reorganize or compress the files.,Inversion of Word List,Inversion of Word List,Structure and Construction,Struc
7、ture(split the index into two files)Vocabulary:O(nb)according to Heaps LawOccurrences:depends on the addressing granularity(document or block?)Construction Dictionary file:The vocabulary is stored in lexicographical order and points to posting list.Posting file:the lists of occurrences are stored co
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 智能 信息 检索 课程

链接地址:https://www.31ppt.com/p-6582722.html