云计算时代的社交网络平台和技术_谷歌中国.ppt
2/25/2023,Ed Chang,1,云计算时代的社交网络平台和技术,张智威副院长,研究院,谷歌中国教授,电机工程系,加州大学,2/25/2023,Ed Chang,2,180 million(25%),208 million(3%),60 million(90%),60 million(29%),500 million,180 million,600 k,Engineering,Graduates,Mobile Phones,Broadband Users,Internet,Population,China,U.S.,China Opportunity China&US in 2006-07,72 k,72000,2/25/2023,Ed Chang,3,Google China,Size(700)200 engineers 400 other employees Almost 100 internsLocationsBeijing(2005)Taipei(2006)Shanghai(2007),2/25/2023,Ed Chang,4,Organizing the Worlds Information,Socially,社区平台(Social Platform)云运算(Cloud Computing)结论与前瞻(Concluding Remarks),2/25/2023,Ed Chang,5,Web 1.0,.htm,.htm,.htm,.jpg,.jpg,.doc,.htm,.msg,.htm,.htm,2/25/2023,Ed Chang,6,Web with People(2.0),.htm,.jpg,.doc,.xls,.msg,2/25/2023,Ed Chang,7,+Social Platforms,.htm,.jpg,.doc,.xls,.msg,App(Gadget),App(Gadget),2/25/2023,Ed Chang,8,2/25/2023,Ed Chang,9,2/25/2023,Ed Chang,10,2/25/2023,Ed Chang,11,2/25/2023,Ed Chang,12,开放社区平台,2/25/2023,Ed Chang,13,2/25/2023,Ed Chang,14,2/25/2023,Ed Chang,15,2/25/2023,Ed Chang,16,2/25/2023,Ed Chang,17,开放社区平台,社区平台,2/25/2023,Ed Chang,18,2/25/2023,Ed Chang,19,2/25/2023,Ed Chang,20,开放社区平台,社区平台,2/25/2023,Ed Chang,21,2/25/2023,Ed Chang,22,Social Graph,2/25/2023,Ed Chang,23,2/25/2023,Ed Chang,24,What Users Want?,People care about other peoplecare about people they knowconnect to people they do not knowDiscover interesting informationbased on other peopleabout who other people areabout what other people are doing,2/25/2023,Ed Chang,25,Information Overflow Challenge,Too many people,too many choices of forums and apps“I soon need to hire a full-time to manage my online social networks”Desiring a Social Network Recommendation System,2/25/2023,Ed Chang,26,Recommendation System,Friend RecommendationCommunity/Forum RecommendationApplication SuggestionAds Matching,2/25/2023,Ed Chang,27,Organizing the Worlds Information,Socially,社区平台(Social Platform)云运算(Cloud Computing)结论与前瞻(Concluding Remarks),2/25/2023,Ed Chang,28,picture source:http:/www.sis.pitt.edu,(1)数据在云端 不怕丢失 不必备份(2)软件在云端 不必下载 自动升级,(3)无所不在的云计算 任何设备 登录后就是你的(4)无限强大的云计算 无限空间 无限速度,业界趋势:云计算时代的到来,2/25/2023,Ed Chang,29,互联网搜索:云计算的例子,1.用户输入查询关键字,Cloud Computing,2.分布式预处理数据以便为搜索提供服务:Google Infrastructure(thousands of commodity servers around the world)MapReduce for mass data processingGoogle File System,3.返回搜索结果,2/25/2023,Ed Chang,30,Given a matrix that“encodes”data,Collaborative Filtering,2/25/2023,Ed Chang,31,Given a matrix that“encodes”data,Many applications(collaborative filtering):User Community User User Ads User Ads Community etc.,Users,Communities,2/25/2023,Ed Chang,32,Collaborative Filtering(CF)Breese,Heckerman and Kadie 1998,Memory-basedGiven user u,find“similar”users(k nearest neighbors)Bought similar items,saw similar movies,similar profiles,etc.Different similarity measures yield different techniquesMake predictions based on the preferences of these“similar”usersModel-basedBuild a model of relationship between subject mattersMake predictions based on the constructed model,2/25/2023,Ed Chang,33,Memory-Based Model Goldbert et al.1992;Resnik et al.1994;Konstant et al.1997,ProsSimplicity,avoid model-building stageConsMemory and Time consuming,uses the entire database every time to make a predictionCannot make prediction if the user has no items in common with other users,2/25/2023,Ed Chang,34,Model-Based ModelBreese et al.1998;Hoffman 1999;Blei et al.2004,ProsScalability,model is much smaller than the actual datasetFaster prediction,query the model instead of the entire datasetConsModel-building takes time,2/25/2023,Ed Chang,35,Algorithm Selection Criteria,Near-real-time RecommendationScalable TrainingIncremental Training is DesirableCan deal with data scarcityCloud Computing!,2/25/2023,Ed Chang,36,Model-based Prior Work,Latent Semantic Analysis(LSA)Probabilistic LSA(PLSA)Latent Dirichlet Allocation(LDA),2/25/2023,Ed Chang,37,Latent Semantic Analysis(LSA)Deerwester et al.1990,Map high-dimensional count vectors to lower dimensional representation called latent semantic spaceBy SVD decomposition:A=U VT,A=Word-document co-occurrence matrixUij=How likely word i belongs to topic jjj=How significant topic j isVijT=How likely topic i belongs to doc j,2/25/2023,Ed Chang,38,Latent Semantic Analysis(cont.),LSA keeps k-largest singular valuesLow-rank approximation to the original matrixSave space,de-noisified and reduce sparsityMake recommendations using Word-word similarity:TDoc-doc similarity:T Word-doc relationship:,2/25/2023,Ed Chang,39,Probabilistic Latent Semantic Analysis(PLSA)Hoffman 1999;Hoffman 2004,Document is viewed as a bag of wordsA latent semantic layer is constructed in between documents and wordsP(w,d)=P(d)P(w|d)=P(d)zP(w|z)P(z|d)Probability delivers explicit meaningP(w|w),P(d|d),P(d,w)Model learning via EM algorithm,2/25/2023,Ed Chang,40,PLSA extensions,PHITS Cohn&Chang 2000Model document-citation co-occurrenceA linear combination of PLSA and PHITS Cohn&Hoffmann 2001Model contents(words)and inter-connectivity of documentsLDA Blei et al.2003Provide a complete generative model with Dirichlet priorAT Griffiths&Steyvers 2004Include authorship informationDocument is categorized by authors and topicsART McCallum 2004Include email recipient as additional informationEmail is categorized by author,recipients and topics,2/25/2023,Ed Chang,41,Combinational Collaborative Filtering(CCF),Fuse multiple informationAlleviate the information sparsity problemHybrid training schemeGibbs sampling as initializations for EM algorithmParallelizationAchieve linear speedup with the number of machines,2/25/2023,Ed Chang,42,Notations,Given a collection of co-occurrence dataCommunity:C=c1,c2,cNUser:U=u1,u2,uMDescription:D=d1,d2,dVLatent aspect:Z=z1,z2,zKModelsBaseline modelsCommunity-User(C-U)modelCommunity-Description(C-D)modelCCF:Combinational Collaborative FilteringCombines both baseline models,2/25/2023,Ed Chang,43,Baseline Models,Community-User(C-U)model,Community-Description(C-D)model,Community is viewed as a bag of users c and u are rendered conditionally independent by introducing z Generative process,for each user u 1.A community c is chosen uniformly 2.A topic z is selected from P(z|c)3.A user u is generated from P(u|z),Community is viewed as a bag of words c and d are rendered conditionally independent by introducing z Generative process,for each word d 1.A community c is chosen uniformly 2.A topic z is selected from P(z|c)3.A word d is generated from P(d|z),2/25/2023,Ed Chang,44,Baseline Models(cont.),Community-User(C-U)model,Community-Description(C-D)model,Pros 1.Personalized community suggestion Cons 1.C-U matrix is sparse,may suffer from information sparsity problem 2.Cannot take advantage of content similarity between communities,Pros 1.Cluster communities based on community content(description words)Cons 1.No personalized recommendation 2.Do not consider the overlapped users between communities,2/25/2023,Ed Chang,45,CCF Model,Combinational Collaborative Filtering(CCF)model,CCF combines both baseline models A community is viewed as-a bag of users AND a bag of words By adding C-U,CCF can perform personalized recommendation which C-D alone cannot By adding C-D,CCF can perform better personalized recommendation than C-U alone which may suffer from sparsity Things CCF can do that C-U and C-D cannot-P(d|u),relate user to word-Useful for user targeting ads,2/25/2023,Ed Chang,46,Algorithm Requirements,Near-real-time RecommendationScalable TrainingIncremental Training is Desirable,2/25/2023,Ed Chang,47,Parallelizing CCF,Details omitted,2/25/2023,Ed Chang,48,picture source:http:/www.sis.pitt.edu,(1)数据在云端 不怕丢失 不必备份(2)软件在云端 不必下载 自动升级,(3)无所不在的云计算 任何设备 登录后就是你的(4)无限强大的云计算 无限空间 无限速度,业界趋势:云计算时代的到来,2/25/2023,Ed Chang,49,Experiments on Orkut Dataset,Data descriptionCollected on July 26,2007Two types of data were extractedCommunity-user,community-description312,385 users109,987 communities191,034 unique English wordsCommunity recommendationCommunity similarity/clusteringUser similaritySpeedup,2/25/2023,Ed Chang,50,Community Recommendation,Evaluation MethodNo ground-truth,no user clicks availableLeave-one-out:randomly delete one community for each userWhether the deleted community can be recoveredEvaluation metricPrecision and Recall,2/25/2023,Ed Chang,51,Results,Observations:CCF outperforms C-U For top20,precision/recall of CCF are twice higher than those of C-U The more communities a user has joined,the better CCF/C-U can predict,2/25/2023,Ed Chang,52,Runtime Speedup,The Orkut dataset enjoys a linear speedup when the number of machines is up to 100Reduces the training time from one day to less than 14 minutesBut,what makes the speedup slow down after 100 machines?,2/25/2023,Ed Chang,53,Runtime Speedup(cont.),Training time consists of two parts:Computation time(Comp)Communication time(Comm),2/25/2023,Ed Chang,54,CCF Summary,Combinational Collaborative FilteringFuse bags of words and bags of users informationHybrid training provides better initializations for EM rather than random seedingParallelize to handle large-scale datasets,2/25/2023,Ed Chang,55,Chinas Contributions on/to Cloud Computing,Parallel CCFParallel SVMs(Kernel Machines)Parallel SVDParallel Spectral ClusteringParallel Expectation MaximizationParallel Association MiningParallel LDA,2/25/2023,Ed Chang,56,Speeding up SVMs NIPS 2007,Approximate Matrix FactorizationParallelizationOpen source downloads since December 07A task that takes 7 days on 1 machine takes 1 hours on 500 machines,2/25/2023,Ed Chang,57,Incomplete Cholesky Factorization(ICF),n x n,n x p,p x n,p n Conserve Storage,2/25/2023,Ed Chang,58,Matrix Product,=,p x n,n x p,p x p,2/25/2023,Ed Chang,59,Organizing the Worlds Information,Socially,社区平台(Social Platform)云运算(Cloud Computing)结论与前瞻(Concluding Remarks),2/25/2023,Ed Chang,60,Web With People,.htm,.htm,.htm,.jpg,.jpg,.doc,.xls,.msg,.msg,.htm,2/25/2023,Ed Chang,61,What Next for Web Search?,PersonalizationReturn query results considering personal preferencesExample:Disambiguate synonym like fuji Oops:several tried,the problem is hardTraining data difficult to collect enough(for collaborative filtering)Computational intensive to support personalization(e.g.,for personalizing page rank)User profile may be incomplete,erroneous,2/25/2023,Ed Chang,62,个人搜索 智能搜索,搜索“富士”可返回富士山富士苹果富士相机,2/25/2023,Ed Chang,63,2/25/2023,Ed Chang,64,2/25/2023,Ed Chang,65,2/25/2023,Ed Chang,66,2/25/2023,Ed Chang,67,Organizing Worlds Information,Socially,Web is a Collection of Documents and PeopleRecommendation is a Personalized,Push Model of SearchCollaborative Filtering Requires Dense Information to be EffectiveCloud Computing is Essential,2/25/2023,Ed Chang,68,References,1 Alexa internet.http:/D.M.Blei and M.I.Jordan.Variational methods for the dirichlet process.In Proc.of the 21st international conference on Machine learning,pages 373-380,2004.3 D.M.Blei,A.Y.Ng,and M.I.Jordan.Latent dirichlet allocation.Journal of Machine Learning Research,3:993-1022,2003.4 D.Cohn and H.Chang.Learning to probabilistically identify authoritative documents.In Proc.of the Seventeenth International Conference on Machine Learning,pages 167-174,2000.5 D.Cohn and T.Hofmann.The missing link-a probabilistic model of document content and hypertext connectivity.In Advances in Neural Information Processing Systems 13,pages 430-436,2001.6 S.C.Deerwester,S.T.Dumais,T.K.Landauer,G.W.Furnas,and R.A.Harshman.Indexing by latent semantic analysis.Journal of the American Society of Information Science,41(6):391-407,1990.7 A.P.Dempster,N.M.Laird,and D.B.Rubin.Maximum likelihood from incomplete data via the em algorithm.Journal of the Royal Statistical Society.Series B(Methodological),39(1):1-38,1977.8 S.Geman and D.Geman.Stochastic relaxation,gibbs distributions,and the bayesian restoration of images.IEEE Transactions on Pattern recognition and Machine Intelligence,6:721-741,1984.9 T.Hofmann.Probabilistic latent semantic indexing.In Proc.of Uncertainty in Articial Intelligence,pages 289-296,1999.10 T.Hofmann.Latent semantic models for collaborative filtering.ACM Transactions on Information System,22(1):89-115,2004.11 A.McCallum,A.Corrada-Emmanuel,and X.Wang.The author-recipient-topic model for topic and role discovery in social networks:Experiments with enron and academic email.Technical report,Computer Science,University of Massachusetts Amherst,2004.12 D.Newman,A.Asuncion,P.Smyth,and M.Welling.Distributed inference for latent dirichlet allocation.In Advances in Neural Information Processing Systems 20,2007.13 M.Ramoni,P.Sebastiani,and P.Cohen.Bayesian clustering by dynamics.Machine Learning,47(1):91-121,2002.,2/25/2023,Ed Chang,69,References(cont.),14 R.Salakhutdinov,A.Mnih,and G.Hinton.Restricted boltzmann machines for collaborative ltering.In Proc.Of the 24th international conference on Machine learning,pages 791-798,2007.15 E.Spertus,M.Sahami,and O.Buyukkokten.Evaluating similarity measures:a large-scale study in the orkut social network.In Proc.of the 11th ACM SIGKDD international conference on Knowledge discovery in data mining,pages 678-684,2005.16 M.Steyvers,P.Smyth,M.Rosen-Zvi,and T.Griths.Probabilistic author-topic models for information discovery.In Proc.of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining,pages 306-315,2004.17 A.Strehl and J.Ghosh.Cluster ensembles-a knowledge reuse framework for combining multiple partitions.Journal on Machine Learning Research(JMLR),3:583-617,2002.18 T.Zhang and V.S.Iyengar.Recommender systems using linear classiers.Journal of Machine Learning Research,2:313-334,2002.19 S.Zhong and J.Ghosh.Generative model-based clustering of documents:a comparative study.Knowledge and Information Systems(KAIS),8:374-384,2005.20 L.Admic and E.Adar.How to search a social network.200421 T.L.Griffiths and M.Steyvers.Finding scientific topics.Proceedings of the National Academy of Sciences,pages 5228-5235,2004.22 H.Kautz,B.Selman,and M.Shah.Referral Web:Combining social networks and collaborative filtering.Communitcations of the ACM,3:63-65,1997.23 R.Agrawal,T.Imielnski,A.Swami.Mining association rules between sets of items in large databses.SIGMOD Rec.,22:207-116,1993.24 J.S.Breese,D.Heckerman,and C.Kadie.Empirical analysis of predictive algorithms for collaborative filtering.In Proceedings of the Fourteenth Conference on Uncertainty in Artifical Intelligence,1998.25 M.Deshpande and G.Karypis.Item-based top-n recommendation algorithms.ACM Trans.Inf.Syst.,22(1):143-177,2004.,2/25/2023,Ed Chang,70,References(cont.),26 B.M.Sarwar,G.Karypis,J.A.Konstan,and J.Reidl.Item-based collaborative filtering recommendation algorithms.In Proceedings of the 10th International World Wide Web Conference,pages 285-295,2001.27 M.Deshpande and G.Karypis.Item-based top-n recommendation algorithms.ACM Trans.Inf.Syst.,22(1):143-177,2004.28 B.M.Sarwar,G.Karypis,J.A.Konstan,and J.Reidl.Item-based collaborative filtering recommendation algorithms.In Proceedings of the 10th International World Wide Web Conference,pages 285-295