An Introduction of Big Data[大数据的介绍](PPT34).ppt

资源ID：2366607 资源大小：2.66MB 全文页数：34页
资源格式： PPT 下载积分：8金币

快捷下载

会员登录下载

三方登录下载：

下载资源需要8金币

邮箱/手机：
温馨提示：	用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）
支付方式：
验证码：	换一换

加入VIP免费专享

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？

友情提示

1、下载资料失败解决办法

2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。

3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。

4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

5、试题试卷类文档，如果标题没有明确说明有答案则都视为没有答案，请知晓。

网站客服

侵权投诉

An Introduction of Big Data[大数据的介绍](PPT34).ppt

An Introduction of Big Data,WEB GROUP2011.9.24,1,2,Outline,What is Big DataThe Framework of Big DataThe Applications of Big DataThe Challenges of Big DataResearch works related with Big DataConclusions,3,Information Explosion57%every year(IDC)Double every 1.5 years988EB(1EB=1024PB)data will be produced in 2010(IDC)18 million times of all info in books IT850 million photos&8 million videos/day(Facebook)50PB web pages,500PB log(Baidu)Telco(Log,multimedia data)Enterprise Storage Public UtilitiesHealth Care(medical images-photos)Public Traffic(surveillance-videos),What is Big Data,4,DefinitionBig data is the confluence of the three trends consisting of Big Transaction Data,Big Interaction and Big Data ProcessingQuestions?Big Data=Large-Scale Data(Massive Data),What is Big Data,Structural and Semi-Structural Transaction Data,.Unstructured dataInteraction Data,5,The properties of Big DataHugeDistributedDispersed over many serversDynamicItems add/deleted/modified continuouslyHeterogeneousMany agents access/update dataNoisyInherentUnintentionalMaliciousUnstructured/semi-structuredNo database schemaComplex interrelationships,What is Big Data,6,Outline,What is Big DataThe Framework of Big DataThe Applications of Big DataThe Challenges of Big DataResearch works related with Big DataConclusions,7,The Framework of Big Data,8,Outline,What is Big DataThe Framework of Big DataThe Applications of Big DataThe Challenges of Big DataResearch works related with Big DataConclusions,9,The Applications of Big Data,Celestial bodyExobiology,Inheritance Sequence of cancer,AdvertisementFinding communities,SNAFinding communities,Data MiningConsuming habit,Changing router,10,Outline,What is Big DataThe Framework of Big DataThe Applications of Big DataThe Challenges of Big DataResearch works related with Big DataConclusions,11,Efficiency requirements for AlgorithmTraditionally,“efficient”algorithmsRun in(small)polynomial time:O(nlogn)Use linear space:O(n)For large data sets,efficient algorithmsMust run in linear or even sub-linear time:o(n)Must use up to poly-logarithmic space:(logn)2Mining Big DataAssociation Rule and Frequent PatternsTwo parameters:support,confidenceClusteringDistance measure(L1,L2,L,Edit Distance,etc,.)Graph structureSocial Networks,Degree distribution(heavy trail),The Challenges of Big Data,12,Clean Big DataNoise in data distorts Computation resultsSearch resultsNeed automatic methods for“cleaning”the dataDuplicate eliminationQuality evaluationComputing ModelAccuracy and ApproximationEfficiency,The Challenges of Big Data,13,Abstract Model of Computing,Computing Model of Big Data,13,Approximation of,Data,(n is very large),Approximation of f(x)is sufficient Program can be randomized,Computer Program,Examples,Mean,Parity,Random Sampling,Computing Model of Big Data,14,Query a few data items,Data,Examples,MeanO(1)queries,Parityn queries,Approximation of,(n is very large),Computer Program,15,AdvantagesUltra-efficientSub-linear running time&space(could even be independent of data set size)DisadvantagesMay require random accessDoesnt fit many problems,Random Sampling,Data Streams,Computing Model of Big Data,16,Data,Computer Program,Stream through the data;Use limited memory,Examples,MeanO(1)memory,Parity1 bit of memory,Approximation of,(n is very large),17,AdvantagesSequential accessLimited memoryDisadvantagesRunning time is at least linearToo restricted for some problems,Random Sampling,Sketching,Computing Model of Big Data,18,Data1,Data2,Data1,Data2,Sketch2,Sketch1,Compress eachdata segment intoa small“sketch”Compute overthe sketches,Examples,EqualityO(1)size sketch,Hamming distanceO(1)size sketch,Lp distance(p 2)(n1-2/p)size sketch,Approximation of,(n is very large),19,Outline,What is Big DataThe Framework of Big DataThe Applications of Big DataThe Challenges of Big DataResearch works related with Big DataConclusions,20,Finding Maximal Cliques in Massive Networks by H*-Graph(Sigmod 2010)Large-Scale Collective Entity Matching(VLDB2011)Estimating Sizes of Social Networks via Biased Sampling(WWW 2011),Research works related with Big Data,21,Massive graph dataGraph is a powerful modeling tool for analyzing massive networks.Graph data is everywhere(e.g.chemistry,biology,image,vision,social networks,the Web,etc.).The outstanding property of graph data is massive.,Finding Maximal Cliques in Massive Networks by H*-Graph,22,MotivationThis has become a serious concern in view of the massive volume of todays fast-growing network graphs.Web graph has over 1 trillion webpages(google)Social networks have millions to billions of users(Facebook,Linkedin)Maximal Clique Enumeration(MCE)is very useful and helpful for analyzing massive graph data.How to find MCE in massive graph?The best algorithm require memory space linear in the size of the input graph,which is clearly infeasible on massive graph.,Finding Maximal Cliques in Massive Networks by H*-Graph,23,Challenges&MethodsClique:A subset of vertices such that every two vertices are connected.Clique problem is NP-Complete.Maximal Clique:If no more vertices can be added to the clique.The Graph has 5 maximal cliques1,2,5,2,3,3,4,4,5 and 4,6Due to Massive graph,authors provide an External-memory algorithm for MCE(ExtMCE).One critical problem must be handled.What portion should be chosen at each recursive step and how?H*-graph is a core of graph,which can be stand for the massive graph.Only finding MCE in H*-graph that can fit into memory.H*-graph is the largest set of h vertices in G that have degree at least h.Therefore,authors maintain and update MCE in H*-graph.,Finding Maximal Cliques in Massive Networks by H*-Graph,24,Finding Maximal Cliques in Massive Networks by H*-Graph(Sigmod 2010)Large-Scale Collective Entity Matching(VLDB2011)Estimating Sizes of Social Networks via Biased Sampling(WWW 2011)The Anatomy of a Large-Scale Social Search Engine(WWW2010),Research works related with Big Data,25,MotivationTwo kinds of ApproachesPair-wise Entity MatchingLabel pairs as match/non-match independentlyIgnoring the relational informationLow accuracyCollective Entity MatchingLabel all pairs collectively High accuracyOften scale only to a few 1000 entities,Large-Scale Collective Entity Matching,How can we scale Collective Entity Matching to millions of entities?,26,MotivationTwo kinds of ApproachesPair-wise Entity MatchingLabel pairs as match/non-match independentlyIgnoring the relational informationLow accuracyCollective Entity MatchingLabel all pairs collectively High accuracyOften scale only to a few 1000 entities,Large-Scale Collective Entity Matching,How can we scale Collective Entity Matching to millions of entities?,27,MethodThe scalable EM framework consists of three key componentsModeling an entity matcher as a block boxRunning multiple instances of the matcher on small subsets of entitiesUsing message passing across the instances to control the interaction between different runs of the matcher.,Large-Scale Collective Entity Matching,Vibhor Rastogi,Nilesh Dalvi,Minos Garofalakis,Large-Scale Collective Entity Matching.Proceedings of the VLDB Endowment,Vol.4,No.4,28,Finding Maximal Cliques in Massive Networks by H*-Graph(Sigmod 2010)Large-Scale Collective Entity Matching(VLDB2011)Estimating Sizes of Social Networks via Biased Sampling(WWW 2011),Research works related with Big Data,29,MotivationSocial network have become pretty big:Facebook(650,000,000)Qzone(200,000,000)Twitter(175,000,000)No public API for population size queries.Exhaustive crawl is time/space/communication intensive and violates“politeness”Goal:Obtaining estimates for sizes of populations in social network with limit public API calls.,Estimating Sizes of Social Networks via Biased Sampling,MethodBiased sampling random walk on directed graphConstruct 4 statistics:C the number of collisions.C the number of non-unique elements the sum of the sampled nodes degrees.the sum of the inverse sampled nodes degrees.Two way to estimate the number of nodes:At least samples are needed to guarantee the accuracy of the estimate.,Large-Scale Collective Entity Matching,collision based estimator,non-unique element based estimator,Example,1,2,3,5,4,c,3,0,1/3,-,3,f,c,b,d,d,3,3,2,4,4,0,0,0,1,2,5,9,12,15,19,5/6,13/12,17/12,21/12,2,-,-,-,13,9,seed,32,Outline,What is Big DataThe Framework of Big DataThe Applications of Big DataThe Challenges of Big DataSome research works related Big DataConclusions,33,Conclusions,Data on todays scales require scientific and computational intelligence.Big Data is a challenge and an opportunity for us.,34,Thank You,

注意事项

本文（An Introduction of Big Data[大数据的介绍](PPT34).ppt）为本站会员（文库蛋蛋多）主动上传，三一办公仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知三一办公（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。