文本信息检索相关处理技术.ppt
文本信息检索相关处理技术,武港山Tel:83594243Office:蒙民伟楼608BEmail:,2023/7/3,Wu Gangshan:Modern Information Retrieval,2,信息检索系统的体系结构,文本数据库,数据库管理,建索引,索引,提问处理,搜索,排序,排序后的文档,用户反馈,文档处理,用户界面,检出的文档,用户需求,文档,提问,逻辑视图,倒排文档,查询语言和查询处理,索引和检索,文本处理,具体应用系统(clir,QA,Web),查询语言及查询处理,武港山Tel:83594243Office:蒙民伟楼608BEmail:,2023/7/3,Wu Gangshan:Modern Information Retrieval,4,内容提要,查询语言基于关键词的查询基于模式匹配的查询结构查询查询协议查询处理用户相关反馈查询扩展,1、基于关键词的查询,2023/7/3,Wu Gangshan:Modern Information Retrieval,6,1.基于关键词的查询,单词查询最基本的查询方式.需要分词处理上下文查询短语查询查询一个词的序列临近词查询给定一个查询序列,并指定词或短语间的最大允许距离。布尔查询自然语言查询,2023/7/3,Wu Gangshan:Modern Information Retrieval,7,-短语查询,根据指定的短语查询文档(ordered list of contiguous words)“information theory”有时需要考虑停用词处理技术“buy camera”matches:“buy a camera”“buying the cameras”etc.,2023/7/3,Wu Gangshan:Modern Information Retrieval,8,基于倒排索引实现短语检索,条件:必须是记录关键词位置的倒排索引机制。处理顺序:查询包含短语中每个词的文档计算文档交集最后在结果文档中进行词序检测最好从生僻词开始进行词序检测,2023/7/3,Wu Gangshan:Modern Information Retrieval,9,短语查询的伪码,Find set of documents D in which all keywords(k1km)in phrase occur(using AND query processing).Intitialize empty set,R,of retrieved documents.For each document,d,in D:Get array,Pi,of positions of occurrences for each ki in d Find shortest array Ps of the Pis For each position p of keyword ks in Ps For each keyword ki except ks Use binary search to find a position(p s+i)in the array Pi If correct position for every keyword found,add d to RReturn R,2023/7/3,Wu Gangshan:Modern Information Retrieval,10,-临近词查询,给定一串词并规定检索文档中词间的最大距离。举例:“dogs”and“race”within 4 wordsmatch“dogs will begin the race”也可以结合词根处理和停用词处理。,2023/7/3,Wu Gangshan:Modern Information Retrieval,11,基于倒排索引实现临近词查询,检索的方法和短语检索相似:不同之处在于词间距离的约束比短语要松,但还有最大距离限制。在进行位置检测时,是要查找待检测文档中,关键词间的最近距离是否满足检索一定范围内的词汇是否存在。,2023/7/3,Wu Gangshan:Modern Information Retrieval,12,-基于关键词的布尔查询,查询请求用布尔表达式的形式表达:OR:(e1 OR e2)AND:(e1 AND e2)BUT:(e1 BUT e2)Satisfy e1 but not e2非逻辑用BUT表示,是一个双操作数的运算。可以很方便地用倒排技术来实现。问题:初学者不容易掌握布尔逻辑。,2023/7/3,Wu Gangshan:Modern Information Retrieval,13,用倒排索引实现基于关键词的布尔检索,关键词:基于倒排索引,检索包含这些关键词的文档。OR:将两个操作项的检索结果进行联合运算。AND:将两个操作项的检索结果进行交叉运算。BUT:求两个操作项的检索结果之间的差,前者减去后者。,2023/7/3,Wu Gangshan:Modern Information Retrieval,14,-基于关键词实现“自然语言”查询,是一种面向任意字符串的全文检索技术。通常会被当作一种基于“bag-of-words”的形式进行基于向量空间模式的检索。将自然表达的字符串,抽取其中的关键词(索引项)。应该有词序、词根、停用词等处理。用查询关键词组成的向量,基于向量空间模式进行检索。倒排的词频统计可以简单地看成是点积运算。,2023/7/3,Wu Gangshan:Modern Information Retrieval,15,1.基于关键词的查询方式总结,单词查询:基本的检索技术,是其他方式的基础。短语查询增加了严格的距离约束。临近词查询增加了比较宽泛的距离约束。布尔查询增加了严格的布尔逻辑约束。自然语言查询增加了关键词间语义关系的约束。但是,2、基于模式匹配的检索表达,2023/7/3,Wu Gangshan:Modern Information Retrieval,17,2.模式匹配,是一种字符串检索而不是简单的单词检索。无法基于倒排索引技术实现模式匹配检索,需要更为复杂的数据结构和计算算法。,2023/7/3,Wu Gangshan:Modern Information Retrieval,18,模式举例,前缀(Prefixes):匹配词或字符串的前面部分:“anti”matches“antiquity”,“antibody”,etc.后缀(Suffixes):匹配词或字符串的后面部分:“ix”matches“fix”,“matrix”,etc.子串(Substrings):匹配词或字符串的任意子串:“rapt”matches“enrapture”,“velociraptor”etc.范围(Ranges):给出两个字符串,匹配所有词典顺序在两者之间的词:“tin”to“tix”matches“tip”,“tire”,“title”,etc.,2023/7/3,Wu Gangshan:Modern Information Retrieval,19,基本处理,文档和查询中都有可能出现错误,这会给检索带来麻烦。判断词或任意字符串间的相似性方法:编辑距离(Levenstein distance)最长共同子串(Longest Common Subsequence,LCS)基于字符串相似性进行信息检索。,2023/7/3,Wu Gangshan:Modern Information Retrieval,20,编辑距离(Levenstein Distance),只需要作最少数量的字符删除,增加或者 替换 就可以完全匹配两个字符串,这个数量就是编辑距离。“misspell”to“mispell”is distance 1“misspell”to“mistell”is distance 2“misspell”to“misspelling”is distance 3比较算法的计算复杂度是 O(mn)其中m 和 n 是两个比较字符串的长度。,2023/7/3,Wu Gangshan:Modern Information Retrieval,21,最长共同子串(LCS),两个字符串最长的共同子串长度。所谓子串是指可通过删除多个字符得到的字符串。没有规定删除的一定是连续的。举例:“misspell”to“mispell”is 7“misspelled”to“misinterpretted”is 7“mispeed”,2023/7/3,Wu Gangshan:Modern Information Retrieval,22,正则表达式,它是一种可以用简单模式构造复杂模式的描述语言。一个字符是一个 regex.联合:If e1 and e2 are regexes,then(e1|e2)is a regex that matches whatever either e1 or e2 matches.串联:If e1 and e2 are regexes,then e1 e2 is a regex that matches a string that consists of a substring that matches e1 immediately followed by a substring that matches e2 循环:(Kleene closure):If e1 is a regex,then e1*is a regex that matches a sequence of zero or more strings that match e1,2023/7/3,Wu Gangshan:Modern Information Retrieval,23,正则表达式例,(u|e)nabl(e|ing)matchesunableUnablinggswuskdjflenableenabling(un|en)*able matchesableunableunenableenununenable,2023/7/3,Wu Gangshan:Modern Information Retrieval,24,Perl的增强型正则表达式,用了一些常用的字符集作为特殊的操作符。Special repetition operator(+)for 1 or more occurrences.Special optional operator(?)for 0 or 1 occurrences.Special repetition operator for specific range of number of occurrences:min,max.A1,5 One to five As.A5,Five or more AsA5 Exactly five As,2023/7/3,Wu Gangshan:Modern Information Retrieval,25,Perl Regexs,Character classes:w(word char)Any alpha-numeric(not:W)d(digit char)Any digit(not:D)s(space char)Any whitespace(not:S).(wildcard)AnythingAnchor points:b(boundary)Word boundary Beginning of string$End of string,2023/7/3,Wu Gangshan:Modern Information Retrieval,26,Perl Regex Examples,U.S.phone number with optional area code:/b(d3)s?)?d3-d4b/Email address:/bS+S+(.com|.edu|.gov|.org|.net)b/Note:Packages available to support Perl regexs in Java,2023/7/3,Wu Gangshan:Modern Information Retrieval,27,小结,不适合做大规模的文件检索处理。实时性比较差。但非常适合做模式提取字符串的模式提取,2023/7/3,Wu Gangshan:Modern Information Retrieval,28,补:通配符查询,对某些查询词记忆不是非常精确的情况下需要使用通配符来定义查询请求。Sydney or sidney?S*dney*表示可以不匹配或者匹配任意数量的字符串。通常的做法:先从词典中查找出全部匹配请求格式的词。基于这些词来进行倒排索引的查询。两种实现方法。,2023/7/3,Wu Gangshan:Modern Information Retrieval,29,方法1、General wildcard queries,Permuterm indexesFirst,introduce a special symbol$into our character set,to mark the end of a term;hello hello$.Next,we construct a permuterm index,in which the dictionary consists of all rotations of each term.Ll0$hehelloLo$helhello将所有这些索引词构成一个索引词典。B树查询。,2023/7/3,Wu Gangshan:Modern Information Retrieval,30,方法1、General wildcard queries,通配符检索请求改写方法:将查询请求单词的通配符循环移位到最后。M*nn$m*这样通配问题转换成了前缀匹配问题了。在前述的B树结构上进行前缀匹配处理。所有匹配的词都是符合通配符请求的单词。多个统配符的情况:忽略中间部分,处理单个通配符,然后再过滤。,2023/7/3,Wu Gangshan:Modern Information Retrieval,31,方法2、k-gram indexes,A k-gram is a sequence of k characters.cas,ast and stl are all 3-grams occurring in the term castle.use a special character$to denote the beginning or end of a term,so the full set of 3-grams generated for castle is:$ca,cas,ast,stl,tle,le$.A k-gram index is an index in which the dictionary consists of all k-grams k-GRAM INDEX that occur in any term in the lexicon.,2023/7/3,Wu Gangshan:Modern Information Retrieval,32,方法2、k-gram indexes,查询处理Consider the wildcard query re*ve.run the Boolean query$re AND ve$.This is looked up in the 3-gram index and yields a list of matching re*ve.such as relive,remove and retrieve.Red*$re and red,然后再过滤。,3、结构化查询,2023/7/3,Wu Gangshan:Modern Information Retrieval,34,3.结构化查询,文档都会有一定的结构信息,这些信息可以用来辅助检索。结构信息有:特定的域名,e.g.title,author,abstract,etc.层次化的树型结构(recursive):,chapter,title,section,title,section,title,subsection,chapter,book,2023/7/3,Wu Gangshan:Modern Information Retrieval,35,3.1 固定结构查询,有些文档具有非常稳定的结构描述,很象表的形式。(email archive.)可以通过查询某些域是否是特定词来检索:“nuclear fusion”appearing in a chapter titleSFQL:在关系数据库查询语言SQL基础上,进行扩充,以实现全文检索的需要。Select abstract from journal.papers where author contains“Teller”and title contains“nuclear fusion”and date 1/1/1950,2023/7/3,Wu Gangshan:Modern Information Retrieval,36,3.2 Hypertext,超文本是一种 directed graph,其中节点含有内容文字,超链用来链接节点。无结构。It is not possible to query the hypertext based on its structure.没有起点WebGlimpse:classical navigation+search by content in the neighborhood of current node.首先确定参照点,然后再查询其相邻等结构关系节点。,2023/7/3,Wu Gangshan:Modern Information Retrieval,37,3.3 层次结构Hierarchical Structure,是介于结构和无结构之间的文档结构形态。Hierarchical ModelsPAT ExpressionsOverlapped ListsLists of ReferencesProximal NodesTree Matching,3、查询协议,2023/7/3,Wu Gangshan:Modern Information Retrieval,39,4.查询协议,有些查询语言被推荐用来检索光盘、查询图书馆系统的等。它们不是为人类用户设计的,我们还是应该把它叫做通信协议,而不是查询语言。主要的查询协议:Z39.50:1995 成为 ANSI和NISO的基础.Query bibliographical information using a standard interface between client and hostDoes not special the way how to do.WAIS:Wide Area Information ServicePopular at the beginning of the 1990sA network publishing protocol,query database through the internet.,2023/7/3,Wu Gangshan:Modern Information Retrieval,40,4.查询协议,Google的Web Service接口。,2023/7/3,Wu Gangshan:Modern Information Retrieval,41,小结,查询语言实际上是检索系统中的非常重要的一环。某种程度上反映了检索系统的技术方案。现在常用的还是关键词检索。不得已还有用它。结构化检索应该是未来的一个方向。加入语法分析后,可以分析到段落内容/句子。目前的热点问题。QA。,查询处理,武港山Tel:83594243Office:蒙民伟楼608BEmail:,2023/7/3,Wu Gangshan:Modern Information Retrieval,43,主要内容,用户相关反馈基于字典的查询扩展 全局自动分析技术局部自动分析技术拼写纠正语音纠正,2023/7/3,Wu Gangshan:Modern Information Retrieval,44,1.相关反馈,检索出初步结果后,允许用户对检索结果文档进行反馈。利用相关反馈信息再调整检索请求。根据新的检索请求,得到新的检索结果。多次重复上述过程。相关反馈的意图是:弥补用户请求表达缺陷,2023/7/3,Wu Gangshan:Modern Information Retrieval,45,相关反馈的架构,Rankings,IRSystem,Documentcorpus,2023/7/3,Wu Gangshan:Modern Information Retrieval,46,查询更新,根据相关反馈更新查询的方式:查询扩展:从相关文档中扩展新的查询检索词。权重调整:增加相关文档中词的权重,降低不相关文档中词的权重,2023/7/3,Wu Gangshan:Modern Information Retrieval,47,查询更新,基于向量模型进行查询更新:Add the vectors for the relevant documents to the query vector.Subtract the vectors for the irrelevant docs from the query vector.这种方法不仅可以扩展正面和负面的新检索词,而且可以调整它们的初始权重。,2023/7/3,Wu Gangshan:Modern Information Retrieval,48,理想的查询表达式,Assume that the relevant set of documents Cr are known.Then the best query that ranks all and only the relevant queries at the top is:,Where N is the total number of documents.,2023/7/3,Wu Gangshan:Modern Information Retrieval,49,传统的Rochio方法,Since all relevant documents unknown,just use the known relevant(Dr)and irrelevant(Dn)sets of documents and include the initial query q.,:Tunable weight for initial query.:Tunable weight for relevant documents.:Tunable weight for irrelevant documents.,2023/7/3,Wu Gangshan:Modern Information Retrieval,50,Rochio方法的一种改进,Since more feedback should perhaps increase the degree of reformulation,do not normalize for amount of feedback:,:Tunable weight for initial query.:Tunable weight for relevant documents.:Tunable weight for irrelevant documents.,2023/7/3,Wu Gangshan:Modern Information Retrieval,51,Rochio方法的进一步改进,Bias towards rejecting just the highest ranked of the irrelevant documents:,:Tunable weight for initial query.:Tunable weight for relevant documents.:Tunable weight for irrelevant document.,2023/7/3,Wu Gangshan:Modern Information Retrieval,52,Comparison of Methods,Overall,experimental results indicate no clear preference for any one of the specific methods.All methods generally improve retrieval performance(recall&precision)with feedback.Generally just let tunable constants equal 1.,2023/7/3,Wu Gangshan:Modern Information Retrieval,53,相关反馈的性能评价,By construction,reformulated query will rank explicitly-marked relevant documents higher and explicitly-marked irrelevant documents lower.Method should not get credit for improvement on these documents,since it was told their relevance.In machine learning,this error is called“testing on the training data.”Evaluation should focus on generalizing to other un-rated documents.,2023/7/3,Wu Gangshan:Modern Information Retrieval,54,Fair Evaluation of Relevance Feedback,Remove from the corpus any documents for which feedback was provided.Measure recall/precision performance on the remaining residual collection.Compared to complete corpus,specific recall/precision numbers may decrease since relevant documents were removed.However,relative performance on the residual collection provides fair data on the effectiveness of relevance feedback.,2023/7/3,Wu Gangshan:Modern Information Retrieval,55,为什么相关反馈没有大规模使用?,Users sometimes reluctant to provide explicit feedback.Results in long queries that require more computation to retrieve,and search engines process lots of queries and allow little time for each one.,2023/7/3,Wu Gangshan:Modern Information Retrieval,56,伪反馈处理机制,Use relevance feedback methods without explicit user input.Just assume the top m retrieved documents are relevant,and use them to reformulate the query.Allows for query expansion that includes terms that are correlated with the query terms.,2023/7/3,Wu Gangshan:Modern Information Retrieval,57,伪反馈的处理架构,Rankings,IRSystem,Documentcorpus,2023/7/3,Wu Gangshan:Modern Information Retrieval,58,Pseudo Feedback Results,Found to improve performance on TREC competition ad-hoc retrieval task.Works even better if top documents must also satisfy additional boolean constraints in order to be used in feedback.,查询处理基于词典的查询扩展,2023/7/3,Wu Gangshan:Modern Information Retrieval,60,词典(Thesaurus),A thesaurus provides information on synonyms and semantically related words and phrases.Example:physician【内科医生】syn:|croaker,doc,doctor,MD,medical,mediciner,medico,|sawbones rel:medic,general practitioner,surgeon,2023/7/3,Wu Gangshan:Modern Information Retrieval,61,Thesaurus-based Query Expansion,For each term,t,in a query,expand the query with synonyms and related words of t from the thesaurus.May weight added terms less than original query terms.Generally increases recall.May significantly decrease precision,particularly with ambiguous terms.“interest rate”“interest rate fascinate evaluate”,2023/7/3,Wu Gangshan:Modern Information Retrieval,62,通用词典:WordNet,A more detailed database of semantic relationships between English words.Developed by famous cognitive psychologist George Miller and a team at Princeton University.About 144,000 English words.Nouns,adjectives,verbs,and adverbs grouped into about 109,000 synonym sets called synsets.,2023/7/3,Wu Gangshan:Modern Information Retrieval,63,WordNet Synset Relationships,Antonym:front backAttribute:benevolence good(noun to adjective)Pertainym:alphabetical alphabet(adjective to noun)Similar:unquestioning absoluteCause:kill dieEntailment:breathe inhaleHolonym:chapter text(part-of)Meronym:computer cpu(whole-of)Hyponym:tree plant(specialization)Hypernym:fruit apple(generalization),2023/7/3,Wu Gangshan:Modern Information Retrieval,64,WordNet Query Expansion,Add synonyms in the same synset.Add hyponyms to add specialized terms.Add hypernyms to generalize a query.Add other related terms to expand query.,2023/7/3,Wu Gangshan:Modern Information Retrieval,65,非通用词典:Statistical Thesaurus,Existing human-developed thesauri are not easily available in all languages.Human thesuari are limited in the type and range of synonymy and semantic relations they represent.Semantically related terms can be discovered from statistical analysis of corpora.,2023/7/3,Wu Gangshan:Modern Information Retrieval,66,1、Automatic Global Analysis,Determine term similarity through a pre-computed statistical analysis of the complete corpus.Compute association matrices which quantify term correlations in terms of how frequently they co-occur.Expand queries with statistically most similar terms.,2023/7/3,Wu Gangshan:Modern Information Retrieval,67,方法1:Association Matrix,cij:Correlation factor between term i and term j,fik:Frequency of term i in document k,2023/7/3,Wu Gangshan:Modern Information Retrieval,68,Normalized Association Matrix,Frequency based correlation factor favors more frequent terms.Normalize association scores:Normalized score is 1 if two terms have the same frequency in all documents.,2023/