An Introduction of Big Data[大数据的介绍](PPT34).ppt
An Introduction of Big Data,WEB GROUP2011.9.24,1,2,Outline,What is Big DataThe Framework of Big DataThe Applications of Big DataThe Challenges of Big DataResearch works related with Big DataConclusions,3,Information Explosion57%every year(IDC)Double every 1.5 years988EB(1EB=1024PB)data will be produced in 2010(IDC)18 million times of all info in books IT850 million photos&8 million videos/day(Facebook)50PB web pages,500PB log(Baidu)Telco(Log,multimedia data)Enterprise Storage Public UtilitiesHealth Care(medical images-photos)Public Traffic(surveillance-videos),What is Big Data,4,DefinitionBig data is the confluence of the three trends consisting of Big Transaction Data,Big Interaction and Big Data ProcessingQuestions?Big Data=Large-Scale Data(Massive Data),What is Big Data,Structural and Semi-Structural Transaction Data,.Unstructured dataInteraction Data,5,The properties of Big DataHugeDistributedDispersed over many serversDynamicItems add/deleted/modified continuouslyHeterogeneousMany agents access/update dataNoisyInherentUnintentionalMaliciousUnstructured/semi-structuredNo database schemaComplex interrelationships,What is Big Data,6,Outline,What is Big DataThe Framework of Big DataThe Applications of Big DataThe Challenges of Big DataResearch works related with Big DataConclusions,7,The Framework of Big Data,8,Outline,What is Big DataThe Framework of Big DataThe Applications of Big DataThe Challenges of Big DataResearch works related with Big DataConclusions,9,The Applications of Big Data,Celestial bodyExobiology,Inheritance Sequence of cancer,AdvertisementFinding communities,SNAFinding communities,Data MiningConsuming habit,Changing router,10,Outline,What is Big DataThe Framework of Big DataThe Applications of Big DataThe Challenges of Big DataResearch works related with Big DataConclusions,11,Efficiency requirements for AlgorithmTraditionally,“efficient”algorithmsRun in(small)polynomial time:O(nlogn)Use linear space:O(n)For large data sets,efficient algorithmsMust run in linear or even sub-linear time:o(n)Must use up to poly-logarithmic space:(logn)2Mining Big DataAssociation Rule and Frequent PatternsTwo parameters:support,confidenceClusteringDistance measure(L1,L2,L,Edit Distance,etc,.)Graph structureSocial Networks,Degree distribution(heavy trail),The Challenges of Big Data,12,Clean Big DataNoise in data distorts Computation resultsSearch resultsNeed automatic methods for“cleaning”the dataDuplicate eliminationQuality evaluationComputing ModelAccuracy and ApproximationEfficiency,The Challenges of Big Data,13,Abstract Model of Computing,Computing Model of Big Data,13,Approximation of,Data,(n is very large),Approximation of f(x)is sufficient Program can be randomized,Computer Program,Examples,Mean,Parity,Random Sampling,Computing Model of Big Data,14,Query a few data items,Data,Examples,MeanO(1)queries,Parityn queries,Approximation of,(n is very large),Computer Program,15,AdvantagesUltra-efficientSub-linear running time&space(could even be independent of data set size)DisadvantagesMay require random accessDoesnt fit many problems,Random Sampling,Data Streams,Computing Model of Big Data,16,Data,Computer Program,Stream through the data;Use limited memory,Examples,MeanO(1)memory,Parity1 bit of memory,Approximation of,(n is very large),17,AdvantagesSequential accessLimited memoryDisadvantagesRunning time is at least linearToo restricted for some problems,Random Sampling,Sketching,Computing Model of Big Data,18,Data1,Data2,Data1,Data2,Sketch2,Sketch1,Compress eachdata segment intoa small“sketch”Compute overthe sketches,Examples,EqualityO(1)size sketch,Hamming distanceO(1)size sketch,Lp distance(p 2)(n1-2/p)size sketch,Approximation of,(n is very large),19,Outline,What is Big DataThe Framework of Big DataThe Applications of Big DataThe Challenges of Big DataResearch works related with Big DataConclusions,20,Finding Maximal Cliques in Massive Networks by H*-Graph(Sigmod 2010)Large-Scale Collective Entity Matching(VLDB2011)Estimating Sizes of Social Networks via Biased Sampling(WWW 2011),Research works related with Big Data,21,Massive graph dataGraph is a powerful modeling tool for analyzing massive networks.Graph data is everywhere(e.g.chemistry,biology,image,vision,social networks,the Web,etc.).The outstanding property of graph data is massive.,Finding Maximal Cliques in Massive Networks by H*-Graph,22,MotivationThis has become a serious concern in view of the massive volume of todays fast-growing network graphs.Web graph has over 1 trillion webpages(google)Social networks have millions to billions of users(Facebook,Linkedin)Maximal Clique Enumeration(MCE)is very useful and helpful for analyzing massive graph data.How to find MCE in massive graph?The best algorithm require memory space linear in the size of the input graph,which is clearly infeasible on massive graph.,Finding Maximal Cliques in Massive Networks by H*-Graph,23,Challenges&MethodsClique:A subset of vertices such that every two vertices are connected.Clique problem is NP-Complete.Maximal Clique:If no more vertices can be added to the clique.The Graph has 5 maximal cliques1,2,5,2,3,3,4,4,5 and 4,6Due to Massive graph,authors provide an External-memory algorithm for MCE(ExtMCE).One critical problem must be handled.What portion should be chosen at each recursive step and how?H*-graph is a core of graph,which can be stand for the massive graph.Only finding MCE in H*-graph that can fit into memory.H*-graph is the largest set of h vertices in G that have degree at least h.Therefore,authors maintain and update MCE in H*-graph.,Finding Maximal Cliques in Massive Networks by H*-Graph,24,Finding Maximal Cliques in Massive Networks by H*-Graph(Sigmod 2010)Large-Scale Collective Entity Matching(VLDB2011)Estimating Sizes of Social Networks via Biased Sampling(WWW 2011)The Anatomy of a Large-Scale Social Search Engine(WWW2010),Research works related with Big Data,25,MotivationTwo kinds of ApproachesPair-wise Entity MatchingLabel pairs as match/non-match independentlyIgnoring the relational informationLow accuracyCollective Entity MatchingLabel all pairs collectively High accuracyOften scale only to a few 1000 entities,Large-Scale Collective Entity Matching,How can we scale Collective Entity Matching to millions of entities?,26,MotivationTwo kinds of ApproachesPair-wise Entity MatchingLabel pairs as match/non-match independentlyIgnoring the relational informationLow accuracyCollective Entity MatchingLabel all pairs collectively High accuracyOften scale only to a few 1000 entities,Large-Scale Collective Entity Matching,How can we scale Collective Entity Matching to millions of entities?,27,MethodThe scalable EM framework consists of three key componentsModeling an entity matcher as a block boxRunning multiple instances of the matcher on small subsets of entitiesUsing message passing across the instances to control the interaction between different runs of the matcher.,Large-Scale Collective Entity Matching,Vibhor Rastogi,Nilesh Dalvi,Minos Garofalakis,Large-Scale Collective Entity Matching.Proceedings of the VLDB Endowment,Vol.4,No.4,28,Finding Maximal Cliques in Massive Networks by H*-Graph(Sigmod 2010)Large-Scale Collective Entity Matching(VLDB2011)Estimating Sizes of Social Networks via Biased Sampling(WWW 2011),Research works related with Big Data,29,MotivationSocial network have become pretty big:Facebook(650,000,000)Qzone(200,000,000)Twitter(175,000,000)No public API for population size queries.Exhaustive crawl is time/space/communication intensive and violates“politeness”Goal:Obtaining estimates for sizes of populations in social network with limit public API calls.,Estimating Sizes of Social Networks via Biased Sampling,MethodBiased sampling random walk on directed graphConstruct 4 statistics:C the number of collisions.C the number of non-unique elements the sum of the sampled nodes degrees.the sum of the inverse sampled nodes degrees.Two way to estimate the number of nodes:At least samples are needed to guarantee the accuracy of the estimate.,Large-Scale Collective Entity Matching,collision based estimator,non-unique element based estimator,Example,1,2,3,5,4,c,3,0,1/3,-,3,f,c,b,d,d,3,3,2,4,4,0,0,0,1,2,5,9,12,15,19,5/6,13/12,17/12,21/12,2,-,-,-,13,9,seed,32,Outline,What is Big DataThe Framework of Big DataThe Applications of Big DataThe Challenges of Big DataSome research works related Big DataConclusions,33,Conclusions,Data on todays scales require scientific and computational intelligence.Big Data is a challenge and an opportunity for us.,34,Thank You,