基于微博数据的用户兴趣挖掘ppt课件.ppt
Mining Users Interest from MicroBlogs,Yinglin WangProfessor Depart of Computer Science & TechnologyShanghai University of Finance and Economics,08 August , 2014,2014 Sino-Finish Summer School on Social-Media Data Analysis,Shanghai Univ. of Finance & Economics,Motivation,Big amount of content,Explosive growth,Huge number of users,Find useful information, products, collaborators,For Information provider,Send information to the target users accurately,For normal users,Challenges:,User interest modeling - core of personalized services,Motivation (cont.),Recommendation systems : For selling products(including retailers, news agencies, movie websites ), for finding the person who have the same interests(in social activities, or academic researches ),Information search / retrieval: to help the information providers to know what information you really need,Movie website,News Agency,retailers,Social Networks,How to analyze the user interests ?,What kind of data can we use ? How can we represent user interests?What is effective algorithm to calculate Uis?,The data that can be used for analyzing user interests : Historical activities: purchase records, searching recordsexplicit input: a list of keywords by users, ratings of documents, movies, music, etc.Implicit feedback: online browsing behavior of users, E.g., Mouse movement, Reading time, Print, bookmark, copy & paste, scroll, visit links on a page.,Problems faced : Cold-start problem, when little or nothing is known about a new user. Data renew slowly, not fully reflect user interests,Some information provided by third party can be a help, such as MicroBlogs, or WikipediaVia integrating the results obtained from multiple sources,What kind of data can we use ?,Social media data can be used :,MicroBlogs new resources to obtain users interest real time, data renew quickly Huge data, and easy to obtain,Facebook has 500 million users in 2010901 million users in April, 2012,Survey of GlobalWebIndexTencent Qzone(腾讯QQ空间) 286 million users,66% of Chinas internet populationSina Weibo(新浪微博) 264 million users,The general view of microblog data,How can we represent user interests?,1. VSM - Vector Space Model (Vector, Bag of words),Vector: Each dimension in the vector corresponds to a separate term. If a term reflects the users interest , its value in the vector is non-zero. The value can be boolean, indicating for instance that a user has visited the item or understood the concept, or it can be an integer/ fraction value indicating the degree of concern about the concept.,Bag of Words. Another similar approach widely used is the keyword-based user model, which holds bag-of-words representing (usually) user interests.,User Interests / User Model Representations,We need to reduce the dimension of words,keywords extraction methods: TFIDF,TFIDF=TF*IDF,TF: term frequency,f(w,d) : the number of times that term w occurs in document d,IDF: inverse document frequency,is a measure of how much information the word w provides, that is, whether the term is common or rare across all documents.,Idf can be obtained by dividing the total number of documents by the number of documents containing the term w, and then taking the logarithm of that quotient.,User Interests / User Model Representations,TFIDF intuition: If a word or phrase appear many times in an document, and rarely appear in other articles, this word has a very good class distinction ability, suitable for classification.,Another keywords extraction method,TextRank: Inspired by PageRank method, calculate the weight for each word.,2. Concept based model (ontology based model, network based model),An illustration of the user ontology,A partial domain ontology for the Italiansoccer teams,Excerpted from Xing Jiang and Ah-Hwee Tan “Learning and Inferencing in User Ontology for Personalized Semantic Web Services” WWW2006, May 2226, 2006, Edinburgh, UK.,User Interests / User Model Representations,3. Topic model The most common topic model -Latent Dirichlet Allocation (LDA): a type of statistical model for discovering the abstract topics that occur in a collection of documents. A topic is an abstract concept, Which is characterized by a distribution over words: P(w1, w1 , , wn |t) A document is represented as random mixture distribution over latent topics : P(t1, t1 , , tn |d),Advantages:a low-dimensional representation of the document The semantic information hidden behind the words can be discovered,e.g., computer and microcomputer, automobile, car, have the same meaning, hence belong to the same topic.,User Interests / User Model Representations,The following two sentence will be regarded completely irrelevant if using VSM model, but by topic model the relationship can be discovered.,Doc1: If it date back to 2006, will Yun Ma and Zhiyuan Yang cooperate?,Doc2: Alibaba Group and Yahoo signed a share repurchase agreements.,Yun Ma (马云)Alibaba (阿里巴巴)Taobao(淘宝),Topic 1,Zhiyuan Yang(杨致远)Yahoo(雅虎)Portal (门户网站) ,Topic 2,cooperateagreementcontract.,Topic 3,Thus, doc1 and doc2 are closely related based on the topics!,An example to show the advantage of topic model,Note : Yun Ma The founder of Taobao; Zhiyuan Yang- The founder of Yahoo,we can use topics to model users interests,P(t1, t2 , , tm |u),The probability distribution over topics for users,User interest model,The probability distribution over words for topics,An example for the representation of user interests by topic model,Mining interest from MicroBlogs:,Most of them use keyword representation, which only differ from the way they choose the keywords, e.g., Wen use labels to construct interest, and Wu use TF/IDF and TextRank instead.,Wen(WSDM 10) - Microblogs released by one user are merged into a large document, and then use the standard LDA model.Hong(SOMA 10) - compared different ways to train topic model from MicroBlogs, 1) regard each microblog as one doc; 2) merge all the microblog of a user into one doc; 3) merge all the microblog with same labels into one doc.Li Jun(ISCTCS 12, 2013.) developed an explainable user interest model based on the topic models.,Some studies Using LDA :,Banerjee(2009), Jilin Chen(2010), Wen(ICDM 10), Wu(2010),Some of the related researches of user interests modelling,Introduction to Latent Dirichlet Allocation,History of topic models,Latent class models in statistics (late 60s)“Aspect model”, Hoffman (1999)Original application to documentsLDA Model: Blei, Ng, and Jordan (2001, 2003)Variational methodsTopics Model: Griffiths and Steyvers (2003, 2004)Gibbs sampling approach (very efficient)More recent work on alternative (but similar) models, e.g., by Max Welling (ICS), Buntine, McCallum, and others,Topic models,Documents are mixtures of topics.A topic is a probability distribution over words. A topic model is a generative model for documents.,To make a new document, one chooses a distribution over topics. Then, for each word in that document, one chooses a topic at random according to this distribution, and draws a word from that topic.,An toy example that can help,Suppose you have the following set of sentences:,(1) I like to eat broccoli and bananas.(2) I ate a banana and spinach smoothie for breakfast.(3) Chinchillas and kittens are cute.(4) My sister adopted a kitten yesterday.(5) Look at this cute hamster munching on a piece of broccoli.,chinchillas,broccoli,spinach,smoothie,hamster,What is latent Dirichlet allocation? Its a way of automatically discoveringtopicsthat these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like ,Sentences 1 and 2: 100% Topic ASentences 3 and 4: 100% Topic BSentence 5: 60% Topic A, 40% Topic BTopic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, (at which point, you could interpret topic A to be about food)Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, (at which point, you could interpret topic B to be about cute animals),If asked for 2 topics, LDA might produce something like ,Each topic is a probability distribution over wordsEach document is modeled as a mixture of topicsThe question: How the probability distributions are used in the generation of sentences? How does LDA perform this discovery?,How the probability distributions are used in the generation of sentences / documents ?,Sentence/ document generation process with LDA Model,LDA represents documents asmixtures of topicsthat spit out words with certain probabilities. It assumes that documents are produced in the following fashion: when writing each document, you,Decide on the number of words N the document will have (say, according to a Poisson distribution).Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics). For example, assuming that we have the two food and cute animal topics above, you might choose the document to consist of 1/3 food and 2/3 cute animals.Generate each word wi in the document by:First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability).Using the topic to generate the word itself (according to the topics multinomial distribution). For example, if we selected the food topic, we might generate the word “broccoli” with 30% probability, “bananas” with 15% probability, and so on.,An Example for generating an sentence from LDA Model,According to the above process, when generating some particular document D, you might,Pick 5 to be the number of words in D.Decide that D will be 1/2 about food and 1/2 about cute animals.Pick the first word to come from the food topic, which then gives you the word “broccoli”.Pick the second word to come from the cute animals topic, which gives you “panda”.Pick the third word to come from the cute animals topic, giving you “adorable”.Pick the fourth word to come from the food topic, giving you “cherries”.Pick the fifth word to come from the food topic, giving you “eating”.,So the document generated under the LDA model will be “broccoli panda adorable cherries eating” (note that LDA is a bag-of-words model).,The Generative Model,TOPIC MIXTURE,TOPIC,TOPIC,WORD,WORD,.,.,for each document, choosea mixture of topics For every word slot, sample a topic 1.T from the mixturesample a word from the topic,Topics ,.4,1.0,.6,1.0,MONEY1 BANK1 BANK1 LOAN1 BANK1 MONEY1 BANK1 MONEY1 BANK1 LOAN1 LOAN1 BANK1 MONEY1 .,Mixtures ,Documents and topic assignments,RIVER2 MONEY1 BANK2 STREAM2 BANK2 BANK1 MONEY1 RIVER2 MONEY1 BANK2 LOAN1 MONEY1 .,RIVER2 BANK2 STREAM2 BANK2 RIVER2 BANK2.,Example of generating words/ documents,P( w | z ),Document generation as a probabilistic process,Each document is a mixture of topicsEach topic is a distribution over wordsEach word chosen from a single topic,From parameters (j),Fromparameters (d),The model can be described formally as :,the probability that the j-th topic was sampled for the i-th word token,P( wi | zi = j ) the probability of word wi under topic j,T : the number of topics,the multinomial distribution over words for topic j,the multinomial distribution over topics for document d.,assume that the text collection consists of D documents and each document d consists of Nd word tokens.,Example topics from TASA: an educational corpus,37K docs 26K word vocabulary300 topics e.g.:,Advantages: Each topic is individually interpretable, providing a probability distribution over words that picks out a coherent cluster of correlated terms.,Example topics from TASA: an educational corpus,For example, by giving equal probability to the first two topics, one could construct a document about a person that has taken too many drugs, and how that affected color perception.,By giving equal probability to the last two topics, one could construct a document about a person who experienced a loss of memory, which required a visit to the doctor.,P(w|z) : P(“Mind” | topic 43),How it is possible to discover the topics that appear in a set of documents?,Inference / learning task,So now suppose you have a set of documents. Youve chosen some fixed number of K topics to discover, and want to use LDA to learn the topic representation of each document, and the words associated to each topic. How do you do this? One way (known as collapsed Gibbs sampling) is the following:,Inference - Gibbs Sampling,where CWT and CDT are matrices of counts with dimensions W x T and D x T respectively contains the number of times word w is assigned to topic j, not including the current instance (word) i and contains the number of times topic j is assigned to some word token in document d, not including the current instance i.,Learning Algorithm (inference),Go through each document, and randomly assign each word in the document to one of the K topics. Notice that this random assignment already gives you both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).So to improve on them, for each document dGo through each word w in dAnd for each topic t, compute two things: 1) p (topic t | document d) = the proportion of words in document d that are currently assigned to topic t, and 2) p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w. Reassign w a new topic, where we choose topic t with probability p(topic t | document d) * p(word w | topic t) (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current words topic with this probability). In other words, in this step, were assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.,Learning Algorithm - cont.,After repeating the previous step a large number of times, youll eventually reach a roughly steady state where your assignments are pretty good. So use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).,Artificial Documents for illustration of the learning process,Can we recover the original topics and topic mixtures from this data?,documents,16 documents, 2 topics, 5 words ( River, stream, Bank, Money, Loan),The Gibbs sampling algorithm can be illustrated by generating artificial data from a known topic model and applying the algorithm to check whether it is able to infer the original generative structure.,Suppose topic 1 gives equal probability to words MONEY, LOAN, and BANK, i.e., P(Money | topic 1 )= =P(Bank |topic 1)= 1/3. While topic 2 gives equal probability to words RIVER, STREAM, and BANK, i.e., P(River| topic 2 )= P( Bank| topic 2 |)=1/3.,Starting the Gibbs Sampling,Assign word tok