书签分享收藏举报版权申诉 / 40

立即下载加入VIP免费专享

当前位置：首页 > 建筑/施工/环境 > 农业报告 > 萨师煊国际大数据分析与研究中心.ppt

萨师煊国际大数据分析与研究中心.ppt

上传人：sccc

文档编号：6107136

上传时间：2023-09-25

格式：PPT

页数：40

大小：2.40MB

《萨师煊国际大数据分析与研究中心.ppt》由会员分享，可在线阅读，更多相关《萨师煊国际大数据分析与研究中心.ppt（40页珍藏版）》请在三一办公上搜索。

1、Weiyi Meng 孟卫一Department of Computer ScienceState University of New York at BinghamtonJuly 9,2012,Large-Scale Distributed Information Retrieval on the Web,萨师煊国际大数据分析与研究中心Summer Research Camp Seminar,About SUNY Binghamton,Founded in 1946 after WWII.Located in Binghamton a city in Southern Tier of New

2、 York StateAbout 15,000 students(3,000 grad students)IBM was founded in BinghamtonOne of the 4 University Centers of SUNY system:SUNY at Stony Brook,SUNY at Buffalo,SUNY at Albany.For more information,seehttp:/www2.binghamton.edu/features/premier/index.html,What is Information Retrieval?,Information

3、 retrieval(IR)is a computer science discipline for finding unstructured data(usually text documents)that satisfy an information need from within large collections that are stored on computers.In this seminar,we are going to extend this definition to include both unstructured and structured data.,Wha

4、t is Distributed Information Retrieval(DIR)?,It is a special branch of information retrieval where the data of the IR system are stored in multiple distributed locations/collections.In the Web environment,DIR deals with data that are distributed across many websites or web servers.Related terms for

5、DIR:metasearch engine,federated search,web DB integration system,The Scale How Large?,It can be as large as the number of data sources on the Web.A 2007 survey(Madhavan et al.2007)indicates there were about 50 million searchable Web data sources in 2007.25 million for un-or less structured data(web

6、pages,weibo,)25 million for structured data(web databases),Where do Web data reside?,Iceberg Structure:A small fraction is on the Surface Web with mostly static web pages that are crawlable by following hyperlinks.Publicly indexable portion:40-60 billion pagesMost are in the Deep Web with both struc

7、tured data and less structured text documents hidden behind numerous search interfaces.About 1 trillion pages/records,Two paradigms to provide integrated access to Web data,Crawling-based:Gather Web data from various Web servers and/or search engines and build a search index for the gathered data.Su

8、rface Web crawlingDeep Web crawlingMetasearching-based(DIR-based):Integrate existing search engines into federated systems.Metasearching text documentsMetasearching structured data by domain,Advantages of each approach,Crawling-based:Complete control on crawled data:Can add metadataCan link data fro

9、m different sources in advanceCan create an archive graduallyComplete control on retrieving techniques and ranking functionsFast response time,Metasearching-based:Capabilities of search engines can be leveragedNatural clustering of the data by individual search engines can be utilizedThree-level que

10、ry evaluation process(SE selection,SE retrieval,result merging)can lead to better effectivenessMore likely to obtain fresher results,Disadvantages of each approach,Crawling-based:Deep Web crawling difficultOften incompleteMany sites not crawlableLose semantics/structure of the dataCannot leverage se

11、arch engines capabilitiesCrawling delay leads to less up-to-date resultsCopyright and privacy issues,Metasearching-based:Performance depends on the quality of used search enginesMay cause search engines to crashAccess could be blocked by search enginesNo direct control of the dataSlower response tim

12、e,Conclusions?,Both technologies(crawling-based and metasearching-based)have unique values and they should co-exist.They actually complement each other!Question:Is there an effective way to combine both technologies into a single platform?,Our seminar will focus on the metasearching(DIR)-based appro

13、ach.,Two types of metasearching systems,Because structured and unstructured data have very different characteristics,they are often handled separately with different technologies.Metasearching systems for text documents(metasearch engines or DIR systems).Metasearching systems for structured data,eac

14、h for a given domain(Web database integration systems).We will first introduce large-scale metasearch engines and then introduce large-scale Web database integration systems.Due to limited time,we will focus on challenges and remaining challenges,not on current solutions.,Large-Scale Metasearch Engi

15、nes(MSE),user user interface query dispatcher result merger search search search engine 1 engine 2 engine n.text text text source 1 source 2 source n,query,result,A simple MSE architecture,What is a large-scale MSE?,A large-scale metasearch engine needs to satisfy ALL of the following requirements:I

16、t is a metasearch engine.It is connected to a large number of(thousands or more)component search engines.The component search engines are special-purpose search enginesCovering a specific domain:news,sports,medicine,Covering a specific organization:RenDa,IBM,ACM,Why the third requirement?To retain t

17、he advantages on freshness and searching the deep Web.,Technical challenges with large-scale MSE,Scalable and accurate search engine selectionMost search engines are useless for a given user query.Best 10 results,10,000 search engines at least 9990 useless.Using useless search engines is badUnnecess

18、ary network trafficWaste resources of local search enginesIncur higher cost at the metasearch engineLead to poor effectivenessHow to identify the most appropriate search engines for any given query accurately and in a timely manner?How to summarize a search engine content(representative)?How to coll

19、ect the representative?How to use the representatives to perform selection?,Technical challenges(cont.),Automatic search engine inclusion into metasearch engineAutomatic connection to search engines(automatic connection wrapper generation)Submit queries and receive result pages via programAutomatic

20、search result records(SRR)extraction(automatic extraction wrapper generation)Automatic wrapper maintenanceSearch engines may change the connection parameters and and result presentation any time,Technical challenges(cont.),Effective and efficient result merging Autonomous component search engines li

21、kely employ different matching techniques between queries and documents(index techniques,weighting schemes,similarity functions,link-based ranking,etc)Local scores and ranks are generally not comparableHow to re-rank the results returned from different search engines into a single ranked list such t

22、hat high effectiveness can be achieved in a speedy manner?,Large-scale MSE architecture,Search Engine m,Search Engine Selector,Query Dispatcher,Result Merger,Result Collector and Extractor,Search Engine 1,Search EngineRepresentatives,User query,World Wide Web,Web,Search EngineDiscovery,SE List,SE In

23、corporation,Automatic connection and result extraction,Metasearch Engine Construction Module,Query Processing Module.,Result,Search engineRepresentativesGeneration,Two Recent Books(Monographs),W.Meng and C.Yu.Advanced Metasearch Engine Technology.Morgan&Claypool Publishers,December 2010.http:/Table

24、of content:IntroductionMetasearch engine architectureSearch engine selectionSearch engine incorporationResult mergingSummary and Future Research,Two Recent Books(Monographs),M.Shokouhi and L.Si.Federated Search.Foundations and Trends in Information Retrieval,5(1),pp.1-102,2011.Table of content:Intro

25、ductionCollection representationCollection selectionResult mergingFederated search testbedsConclusion and Future Research Challenges,Search Engine Selection(1),Problem:Given any user query and a set of search engines(or document collections),determine the search engines that match the user query the

26、 best.Basic solution:Summarize the content of each search engine in advance.For each user query,compare it with the search engine summaries and compute a matching score.Rank search engines in descending order of their matching scores with the query and select the top-ranked search engines.,Search En

27、gine Selection(2),Question 1:How to summarize the content of each search engines?Advanced solutions are statistics-based:One or more statistics for each term in the documents of a search engine.Some used statistics for a term t:document frequency(df):The number of documents in the search engine that

28、 contain t.collection frequency(cf):The number of search engines in a metasearch engine that contain t.average normalized weight(anw):The avg of the weights of t in all documents containing t in a SE.maximum normalized weight(mnw):The max of the weights of t in all documents in a SE.,Search Engine S

29、election(3),Question 2:How to obtain the summaries of search engines?Two general scenarios:Straightforward computation if the documents of the search engine is available.Query-based sampling if the documents of the search engine are not directly available(i.e.,deep web search engine).Many published

30、solutions,but still not scalable to large-scale metasearch engines.,Search Engine Selection(4),Question 3:How to rank search engines for each user query?Sub-questions:How to define a measure of usefulness of a search engine with respect to a query?How to compute the measure very quickly(highly effic

31、iently)in a large-scale metasearch engine?A large number of search engine selection algorithms have been proposed,most are not very scalable.,Automatic connection to any search engine given its URLPass queries to the search engine programmatically.Receive results from the search engine programmatica

32、lly.Automatic extraction of retrieved search resultsExtract the URLs and snippets of retrieved pages.Extract the number of hitsExtract the URL pattern of the next page button.Automatic connection and extraction maintenanceAutomatic failure detection,Automatic Search Engine Incorporation,Extract conn

33、ection parameters from the HTML form tag of each search engine.Apply HTTP request method(GET or POST)to perform connection.,Automatic Search Engine Connection,Complex search forms with many control elementsIll-formatted HTML search formsMultiple search forms on the same pageSearch forms with JavaScr

34、ipt and/or CSS(cascading Style Sheets)Search forms that have action redirectionsSearch forms that utilize sessions/cookiesSearch engines that do not allow metasearching,Search form extraction:Difficulties,A search result record(SRR)consists of the returned information associated with a retrieved Web

35、 page.URL of the pageTitle of the pageA short summary of the pageOther misc.:size,date,category,Result pages often contain irrelevant information such as that related to advertisement and hosting organization,in addition to SRR.,Automatic Search Result Records(SRRs)Extraction(1),WebScales:Wrapper Ge

36、neration,an SRR,an SRR,Extract correct SRRs from returned response pages while discarding irrelevant information.The problem is to identify the rules(often called wrapper)that can extract the correct SRRs.,Automatic SRR Extraction(2),General methodologyUtilize the tag strings/DOM trees/visual inform

37、ation on one or more result pages from the same search engine to mine extraction patterns.Identify the minimal data-rich region/subtree that likely contains the SRRs.Identify separator(s)that separate different SRRs.More recent solutions use more visual information on result pages.Still cannot handl

38、e complex result pages well(javascript,multiple columns,multiple sections,multiple SRR formats),Automatic SRR Extraction(3),Result Merging(1),Problem:Merge returned documents from multiple sources into a single ranked list.DifficultiesFull documents of search results are not available or too expensi

39、ve to download and analyze on the fly.Local similarities(thus local ranks)are usually not comparable due todifferent similarity functionsdifferent term weighting schemesdifferent statistical values,e.g.,global idf vs.local idf,Result Merging(2),A large number of solutions has been proposed to perfor

40、m result merging.Some use local similarities associated with each result(modern search engines no longer provide the information).Some use local ranks of search results.Some analyze downloaded full documents.Some use the titles and snippets of the search results.Some consider the quality of the used

41、 search engine.Some consider whether a result is retrieved from multiple search engines.Some use a sample set of documents from each search engine,Information that could be utilized for result merging:Local similarity or local rank of each resultTitle of each resultSnippet of each resultPublication

42、time of each resultOrganization/person who published the result(from URL)Size of each resultNumber of search engines that returned the resultRanking scores of the search engines that returned the resultFull content of each result(or some of the results)PageRank or number of backlinks of each resultA

43、 sample set of documents from each search engine,Result Merging(3),Remaining Research Challenges(1),Search engine summary generation and maintenance Query-based sampling methods have not been shown to be practically viable for a large number of truly autonomous search engines.Certain statistics used

44、 by some search engine selection algorithms,such as the maximum normalized weight,are still too expensive to collect as it may require submitting a substantial number of queries to cover a significant portion of the vocabulary of a search engine.The important issue of how to effectively maintain the

45、 quality of summaries for search engines whose contents may change over time has started to get attention only recently and more investigation into this issue is needed.,Remaining Research Challenges(2),Automatic search engine connection with complex search forms.More and more search engines are emp

46、loying more advanced tools to program their search forms.For example,more and more search forms now have Javascripts.Some search engines also include cookie and session id in their connection mechanism.These complexities make it significantly more difficult to automatically extract all needed connec

47、tion information.,Remaining Research Challenges(3),Automatic maintenance.Search engines used by metasearch engines may make various changes due to upgrade or other reasons.Possible changes may include search form change,query format change,and result display format change.These changes can cause the

48、 search engines not usable in the metasearch engines unless necessary adjustments are made automatically.Automatic metasearch engine maintenance is critical for the smooth operation of a large-scale metasearch engine but this important problem remains largely unsolved.There are mainly two issues.det

49、ect and differentiate various changes automatically fix the problem for each type of changes automatically,Remaining Research Challenges(4),More advanced result merging algorithm.No existing solution has explored all of the following factors:Local similarity or local rank of each resultTitle of each

50、 resultSnippet of each resultPublication time of each resultOrganization/person who published the result(from URL)Size of each resultNumber of search engines that returned the resultRanking scores of the search engines that returned the resultFull content of each result(or some of the results)PageRa