基于微博数据的用户兴趣挖掘ppt课件.ppt
《基于微博数据的用户兴趣挖掘ppt课件.ppt》由会员分享,可在线阅读,更多相关《基于微博数据的用户兴趣挖掘ppt课件.ppt(57页珍藏版)》请在三一办公上搜索。
1、Mining Users Interest from MicroBlogs,Yinglin WangProfessor Depart of Computer Science & TechnologyShanghai University of Finance and Economics,08 August , 2014,2014 Sino-Finish Summer School on Social-Media Data Analysis,Shanghai Univ. of Finance & Economics,Motivation,Big amount of content,Explosi
2、ve growth,Huge number of users,Find useful information, products, collaborators,For Information provider,Send information to the target users accurately,For normal users,Challenges:,User interest modeling - core of personalized services,Motivation (cont.),Recommendation systems : For selling product
3、s(including retailers, news agencies, movie websites ), for finding the person who have the same interests(in social activities, or academic researches ),Information search / retrieval: to help the information providers to know what information you really need,Movie website,News Agency,retailers,Soc
4、ial Networks,How to analyze the user interests ?,What kind of data can we use ? How can we represent user interests?What is effective algorithm to calculate Uis?,The data that can be used for analyzing user interests : Historical activities: purchase records, searching recordsexplicit input: a list
5、of keywords by users, ratings of documents, movies, music, etc.Implicit feedback: online browsing behavior of users, E.g., Mouse movement, Reading time, Print, bookmark, copy & paste, scroll, visit links on a page.,Problems faced : Cold-start problem, when little or nothing is known about a new user
6、. Data renew slowly, not fully reflect user interests,Some information provided by third party can be a help, such as MicroBlogs, or WikipediaVia integrating the results obtained from multiple sources,What kind of data can we use ?,Social media data can be used :,MicroBlogs new resources to obtain u
7、sers interest real time, data renew quickly Huge data, and easy to obtain,Facebook has 500 million users in 2010901 million users in April, 2012,Survey of GlobalWebIndexTencent Qzone(腾讯QQ空间) 286 million users,66% of Chinas internet populationSina Weibo(新浪微博) 264 million users,The general view of mic
8、roblog data,How can we represent user interests?,1. VSM - Vector Space Model (Vector, Bag of words),Vector: Each dimension in the vector corresponds to a separate term. If a term reflects the users interest , its value in the vector is non-zero. The value can be boolean, indicating for instance that
9、 a user has visited the item or understood the concept, or it can be an integer/ fraction value indicating the degree of concern about the concept.,Bag of Words. Another similar approach widely used is the keyword-based user model, which holds bag-of-words representing (usually) user interests.,User
10、 Interests / User Model Representations,We need to reduce the dimension of words,keywords extraction methods: TFIDF,TFIDF=TF*IDF,TF: term frequency,f(w,d) : the number of times that term w occurs in document d,IDF: inverse document frequency,is a measure of how much information the word w provides,
11、that is, whether the term is common or rare across all documents.,Idf can be obtained by dividing the total number of documents by the number of documents containing the term w, and then taking the logarithm of that quotient.,User Interests / User Model Representations,TFIDF intuition: If a word or
12、phrase appear many times in an document, and rarely appear in other articles, this word has a very good class distinction ability, suitable for classification.,Another keywords extraction method,TextRank: Inspired by PageRank method, calculate the weight for each word.,2. Concept based model (ontolo
13、gy based model, network based model),An illustration of the user ontology,A partial domain ontology for the Italiansoccer teams,Excerpted from Xing Jiang and Ah-Hwee Tan “Learning and Inferencing in User Ontology for Personalized Semantic Web Services” WWW2006, May 2226, 2006, Edinburgh, UK.,User In
14、terests / User Model Representations,3. Topic model The most common topic model -Latent Dirichlet Allocation (LDA): a type of statistical model for discovering the abstract topics that occur in a collection of documents. A topic is an abstract concept, Which is characterized by a distribution over w
15、ords: P(w1, w1 , , wn |t) A document is represented as random mixture distribution over latent topics : P(t1, t1 , , tn |d),Advantages:a low-dimensional representation of the document The semantic information hidden behind the words can be discovered,e.g., computer and microcomputer, automobile, car
16、, have the same meaning, hence belong to the same topic.,User Interests / User Model Representations,The following two sentence will be regarded completely irrelevant if using VSM model, but by topic model the relationship can be discovered.,Doc1: If it date back to 2006, will Yun Ma and Zhiyuan Yan
17、g cooperate?,Doc2: Alibaba Group and Yahoo signed a share repurchase agreements.,Yun Ma (马云)Alibaba (阿里巴巴)Taobao(淘宝),Topic 1,Zhiyuan Yang(杨致远)Yahoo(雅虎)Portal (门户网站) ,Topic 2,cooperateagreementcontract.,Topic 3,Thus, doc1 and doc2 are closely related based on the topics!,An example to show the advant
18、age of topic model,Note : Yun Ma The founder of Taobao; Zhiyuan Yang- The founder of Yahoo,we can use topics to model users interests,P(t1, t2 , , tm |u),The probability distribution over topics for users,User interest model,The probability distribution over words for topics,An example for the repre
19、sentation of user interests by topic model,Mining interest from MicroBlogs:,Most of them use keyword representation, which only differ from the way they choose the keywords, e.g., Wen use labels to construct interest, and Wu use TF/IDF and TextRank instead.,Wen(WSDM 10) - Microblogs released by one
20、user are merged into a large document, and then use the standard LDA model.Hong(SOMA 10) - compared different ways to train topic model from MicroBlogs, 1) regard each microblog as one doc; 2) merge all the microblog of a user into one doc; 3) merge all the microblog with same labels into one doc.Li
21、 Jun(ISCTCS 12, 2013.) developed an explainable user interest model based on the topic models.,Some studies Using LDA :,Banerjee(2009), Jilin Chen(2010), Wen(ICDM 10), Wu(2010),Some of the related researches of user interests modelling,Introduction to Latent Dirichlet Allocation,History of topic mod
22、els,Latent class models in statistics (late 60s)“Aspect model”, Hoffman (1999)Original application to documentsLDA Model: Blei, Ng, and Jordan (2001, 2003)Variational methodsTopics Model: Griffiths and Steyvers (2003, 2004)Gibbs sampling approach (very efficient)More recent work on alternative (but
23、similar) models, e.g., by Max Welling (ICS), Buntine, McCallum, and others,Topic models,Documents are mixtures of topics.A topic is a probability distribution over words. A topic model is a generative model for documents.,To make a new document, one chooses a distribution over topics. Then, for each
24、 word in that document, one chooses a topic at random according to this distribution, and draws a word from that topic.,An toy example that can help,Suppose you have the following set of sentences:,(1) I like to eat broccoli and bananas.(2) I ate a banana and spinach smoothie for breakfast.(3) Chinc
25、hillas and kittens are cute.(4) My sister adopted a kitten yesterday.(5) Look at this cute hamster munching on a piece of broccoli.,chinchillas,broccoli,spinach,smoothie,hamster,What is latent Dirichlet allocation? Its a way of automatically discoveringtopicsthat these sentences contain. For example
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 基于 数据 用户 兴趣 挖掘 ppt 课件
链接地址:https://www.31ppt.com/p-1325253.html