《智能信息检索》PPT课件.ppt
《《智能信息检索》PPT课件.ppt》由会员分享,可在线阅读,更多相关《《智能信息检索》PPT课件.ppt(97页珍藏版)》请在三一办公上搜索。
1、智能信息检索,杜小勇教授,中国人民大学文继荣教授,微软亚洲研究院,课程介绍,课程历史,2007年9月27日,我们和微软亚洲研究院联合举办了一次“互联网数据管理主题学术报告”既聘任文继荣博士为兼职研究员的活动;2008年春季学期,第一次与微软合作,开设智能信息检索课程;来自MSRA的9位研究员共进行了11次讲座;参考文献:从智能信息检索看微软前瞻性课程,计算机教育,授课风格,IR基础知识+专题讲座专题讲座由微软研究员担任,信息量非常大考核方式:(1)选择某一个专题写一个综述性质的报告,包括:研究的问题是什么?该领域的理论基础是什么?技术难点在那里?目前大致有什么解决问题的手段和方法?研究这些问题
2、的实验方法和评价体系是什么?提出自己的观点.(2)将打印好的文章在最后一节课交给助教,将分发给相关的老师进行评阅。(3)平时考核,主要是参与讨论的情况.,授课内容,基础知识基本模型基本实现技术评价方法核心技术与系统RankingInformation ExtractionLog MiningSystem Implementation应用Image searchMulti-language searchKnowledge Management,Reading Materials,R.Baeza-Yates,B.Ribeiro-Neto,Modern Information Retrieval,A
3、CM Press,1999,TREC:Experiment and Evaluation in Information Retrieval,The MIT Press 2005K.S.Jones,P.Willett,Readings in Information Retrieval,Morgan Kaufmann,1997Proceedings SIGIR,SIGMOD,WWW,课程安排,课程安排,http:/联系方式:杜小勇(信息楼 0459)文继荣,Introduction to IR,Prof.Xiaoyong Du,What is Information Retrieval,Defin
4、ition from comparison,What is IR?,Definition by examples:Library system,like CALISSearch engine,like Google,Baidu,What is IR?,Definition by contentIR=,whereD:document collectionQ:Users queryR:the relevance/similarity degree between query qi and document dj,IR System Architecture,UnstructuredData,Ind
5、ex,Indexer,Ranker,Classical IR,Crawler,UserInterface,WEB,WEB IR,Query logFeedback,Extractor,Data Miner,Content,IR ModelSystem architectureEvaluation and benchmarkKey techniquesMedia-related operatorsIndexingIEClassification and clusteringLink analysisRelevance evaluation,Related Area,Natural Languag
6、e ProcessingLarge-scale distributed computingDatabaseData miningInformation ScienceArtificial Intelligence,Model,IR Model,RepresentationHow to represent document/queryBag-of-wordSequence-of-wordLink of documentsSemantic NetworkSimilarity/relevance Evaluationsim(dj,q)=?,两大类的模型,基于文本内容的检索模型布尔模型向量空间模型概率
7、模型统计语言模型与内容无关的其他检索模型基于协同的模型基于链接分析的模型基于关联的模型,Classical IR Models-Basic Concepts,Bag-of-Word ModelEach document represented by a set of representative keywords or index termsThe importance of the index terms is represented by weights associated to themLet ki:an index termdj:a documentt:the total numbe
8、r of docsK=k1,k2,kt:the set of all index terms,Classic IR Models-Basic Concepts,wij=0:a weight associated with(ki,dj)The weight wij quantifies the importance of the index term for describing the document contentswij=0 indicates that term does not belong to docvec(dj)=(w1j,w2j,wtj):a weighted vector
9、associated with the document djgi(vec(dj)=wij:a function which returns the weight of term ki in document dj,Classical IR Models-Basic Concepts,A ranking is an ordering of the documents retrieved that(hopefully)reflects the relevance of the documents to the user query A ranking is based on fundamenta
10、l premises regarding the notion of relevance,such as:common sets of index termssharing of weighted termslikelihood of relevanceEach set of premises leads to a distinct IR model,The Boolean Model,Simple model based on set theoryQueries specified as boolean expressions precise semanticsneat formalismq
11、=ka(kb kc)Terms are either present or absent.Thus,wij 0,1Considerq=ka(kb kc)vec(qdnf)=(1,1,1)(1,1,0)(1,0,0)vec(qcc)=(1,1,0)is a conjunctive component,Outline,Boolean Model(BM)Vector Space Model(VSM)Probabilistic Model(PM)Language Model(LM),The Boolean Model,q=ka(kb kc)sim(q,dj)=1 if vec(qcc)|(vec(qc
12、c)/in vec(qdnf)(ki,gi(vec(dj)=gi(vec(qcc)0 otherwise,Drawbacks of the Boolean Model,Exact matchingNo ranking:Awkward:Information need has to be translated into a Boolean expression Too simple:The Boolean queries formulated by the users are most often too simplisticUnsatisfiable Results:The Boolean m
13、odel frequently returns either too few or too many documents in response to a user query,Outline,Boolean Model(BM)Vector Space Model(VSM)Probabilistic Model(PM)Language Model(LM),The Vector Model,Non-binary weights provide consideration for partial matchesThese term weights are used to compute a deg
14、ree of similarity between a query and each documentRanked set of documents provides for better matching,The Vector Model,Define:wij 0 whenever ki djwiq=0 associated with the pair(ki,q)vec(dj)=(w1j,w2j,.,wtj)vec(q)=(w1q,w2q,.,wtq)index terms are assumed to occur independently within the documents,Tha
15、t means the vector space is orthonormal.The t terms form an orthonormal basis for a t-dimensional spaceIn this space,queries and documents are represented as weighted vectors,The Vector Model,Sim(q,dj)=cos()=vec(dj)vec(q)/(|dj|*|q|)=wij*wiq/(|dj|*|q|)Since wij 0 and wiq 0,0=sim(q,dj)=1A document is
16、retrieved even if it matches the query terms only partially,i,j,dj,q,The Vector Model,Sim(q,dj)=wij*wiq/(|dj|*|q|)The KEY is to compute the weights wij and wiq?A good weight must take into account two effects:quantification of intra-document contents(similarity)tf factor,the term frequency within a
17、documentquantification of inter-documents separation(dissimilarity)idf factor,the inverse document frequency TF*IDF formular:wij=tf(i,j)*idf(i),The Vector Model,Let,N be the total number of docs in the collectionni be the number of docs which contain kifreq(i,j)raw frequency of ki within djA normali
18、zed tf factor is given bytf(i,j)=freq(i,j)/max(freq(l,j)where kl djThe idf factor is computed asidf(i)=log(N/ni)the log is used to make the values of tf and idf comparable.It can also be interpreted as the amount of information associated with the term ki.,The Vector Model,tf-idf weighting schemewij
19、=tf(i,j)*log(N/ni)The best term-weighting schemes For the query term weights,wiq=(0.5+0.5*freq(i,q)/max(freq(l,q)*log(N/ni)Or specified by the userThe vector model with tf-idf weights is a good ranking strategy with general collectionsThe vector model is usually as good as the known ranking alternat
20、ives.It is also simple and fast to compute.,The Vector Model,Advantages:term-weighting improves quality of the answer setpartial matching allows retrieval of docs that approximate the query conditionscosine ranking formula sorts documents according to degree of similarity to the queryDisadvantages:a
21、ssumes independence of index terms,Outline,Boolean Model(BM)Vector Space Model(VSM)Probabilistic Model(PM)Language Model(LM),Probabilistic Model,Objective:to capture the IR problem using a probabilistic frameworkGiven a user query,there is an ideal answer setQuerying as specification of the properti
22、es of this ideal answer set(clustering)But,what are these properties?Guess at the beginning what they could be(i.e.,guess initial description of ideal answer set)Improve by iteration,Probabilistic Model,Baisc ideas:An initial set of documents is retrieved somehow User inspects these docs looking for
23、 the relevant ones(in truth,only top 10-20 need to be inspected)IR system uses this information to refine description of ideal answer setBy repeting this process,it is expected that the description of the ideal answer set will improveDescription of ideal answer set is modeled in probabilistic terms,
24、Probabilistic Ranking Principle,The probabilistic model tries to estimate the probability that the user will find the document dj interesting(i.e.,relevant).The model assumes that this probability of relevance depends on the query and the document representations only.Let R be the Ideal answer set.B
25、ut,how to compute probabilities?what is the sample space?,The Ranking,Probabilistic ranking computed as:sim(q,dj)=P(dj relevant-to q)/P(dj non-relevant-to q)Definition:wij 0,1P(R|vec(dj):probability that given doc is relevantP(R|vec(dj):probability doc is not relevant,The Ranking,sim(dj,q)=P(R|vec(d
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 智能信息检索 智能 信息 检索 PPT 课件

链接地址:https://www.31ppt.com/p-5092865.html