协同过滤外文文献翻译.docx
精选优质文档-倾情为你奉上外文:Introduction to Recommender SystemApproaches of Collaborative Filtering: Nearest Neighborhood and Matrix Factorization“We are leaving the age of information and entering the age of recommendation.”Like many machine learning techniques, a recommender system makes prediction based on users historical behaviors. Specifically, its to predict user preference for a set of items based on past experience. To build a recommender system, the most two popular approaches are Content-based and Collaborative Filtering.Content-based approach requires a good amount of information of items own features, rather than using users interactions and feedbacks. For example, it can be movie attributes such as genre, year, director, actor etc., or textual content of articles that can extracted by applying Natural Language Processing. Collaborative Filtering, on the other hand, doesnt need anything else except users historical preference on a set of items. Because its based on historical data, the core assumption here is that the users who have agreed in the past tend to also agree in the future. In terms of user preference, it usually expressed by two categories. Explicit Rating, is a rate given by a user to an item on a sliding scale, like 5 stars for Titanic. This is the most direct feedback from users to show how much they like an item. Implicit Rating, suggests users preference indirectly, such as page views, clicks, purchase records, whether or not listen to a music track, and so on. In this article, I will take a close look at collaborative filtering that is a traditional and powerful tool for recommender systems.Nearest NeighborhoodThe standard method of Collaborative Filtering is known as Nearest Neighborhood algorithm. There are user-based CF and item-based CF. Lets first look at User-based CF. We have an n × m matrix of ratings, with user u, i = 1, .n and item p, j=1, m. Now we want to predict the rating r if target user i did not watch/rate an item j. The process is to calculate the similarities between target user i and all other users, select the top X similar users, and take the weighted average of ratings from these X users with similarities as weights.While different people may have different baselines when giving ratings, some people tend to give high scores generally, some are pretty strict even though they are satisfied with items. To avoid this bias, we can subtract each users average rating of all items when computing weighted average, and add it back for target user, shown as below.Two ways to calculate similarity are Pearson Correlation and Cosine Similarity.Basically, the idea is to find the most similar users to your target user (nearest neighbors) and weight their ratings of an item as the prediction of the rating of this item for target user.Without knowing anything about items and users themselves, we think two users are similar when they give the same item similar ratings . Analogously, for Item-based CF, we say two items are similar when they received similar ratings from a same user. Then, we will make prediction for a target user on an item by calculating weighted average of ratings on most X similar items from this user. One key advantage of Item-based CF is the stability which is that the ratings on a given item will not change significantly overtime, unlike the tastes of human beings.There are quite a few limitations of this method. It doesnt handle sparsity well when no one in the neighborhood rated an item that is what you are trying to predict for target user. Also, its not computational efficient as the growth of the number of users and products.Matrix FactorizationSince sparsity and scalability are the two biggest challenges for standard CF method, it comes a more advanced method that decompose the original sparse matrix to low-dimensional matrices with latent factors/features and less sparsity. That is Matrix Factorization.Beside solving the issues of sparsity and scalability, theres an intuitive explanation of why we need low-dimensional matrices to represent users preference. A user gave good ratings to movie Avatar, Gravity, and Inception. They are not necessarily 3 separate opinions but showing that this users might be in favor of Sci-Fi movies and there may be many more Sci-Fi movies that this user would like. Unlike specific movies, latent features is expressed by higher-level attributes, and Sci-Fi category is one of latent features in this case. What matrix factorization eventually gives us is how much a user is aligned with a set of latent features, and how much a movie fits into this set of latent features. The advantage of it over standard nearest neighborhood is that even though two users havent rated any same movies, its still possible to find the similarity between them if they share the similar underlying tastes, again latent features.To see how a matrix being factorized, first thing to understand is Singular Value Decomposition(SVD). Based on Linear Algebra, any real matrix R can be decomposed into 3 matrices U, , and V. Continuing using movie example, U is an n × r user-latent feature matrix, V is an m × r movie-latent feature matrix. is an r × r diagonal matrix containing the singular values of original matrix, simply representing how important a specific feature is to predict user preference.To sort the values of by decreasing absolute value and truncate matrix to first k dimensions( k singular values), we can reconstruct the matrix as matrix A. The selection of k should make sure that A is able to capture the most of variance within the original matrix R, so that A is the approximation of R, A R. The difference between A and R is the error that is expected to be minimized. This is exactly the thought of Principle Component Analysis.When matrix R is dense, U and V could be easily factorized analytically. However, a matrix of movie ratings is super sparse. Although there are some imputation methods to fill in missing values , we will turn to a programming approach to just live with those missing values and find factor matrices U and V. Instead of factorizing R via SVD, we are trying find U and V directly with the goal that when U and V multiplied back together the output matrix R is the closest approximation of R and no more a sparse matrix. This numerical approximation is usually achieved with Non-Negative Matrix Factorization for recommender systems since there is no negative values in ratings.See the formula below. Looking at the predicted rating for specific user and item, item i is noted as a vector q, and user u is noted as a vector p such that the dot product of these two vectors is the predicted rating for user u on item i. This value is presented in the matrix R at row u and column i.How do we find optimal q and p? Like most of machine learning task, a loss function is defined to minimize the cost of errors.r is the true ratings from original user-item matrix. Optimization process is to find the optimal matrix P composed by vector p and matrix Q composed by vector q in order to minimize the sum square error between predicted ratings r and the true ratings r. Also, L2 regularization has been added to prevent overfitting of user and item vectors. Its also quite common to add bias term which usually has 3 major components: average rating of all items , average rating of item i minus (noted as b), average rating given by user u minus u(noted as b).OptimizationA few optimization algorithms have been popular to solve Non-Negative Factorization. Alternative Least Square is one of them. Since the loss function is non-convex in this case, theres no way to reach a global minimum, while it still can reach a great approximation by finding local minimums. Alternative Least Square is to hold user factor matrix constant, adjust item factor matrix by taking derivatives of loss function and setting it equal to 0, and then set item factor matrix constant while adjusting user factor matrix. Repeat the process by switching and adjusting matrices back and forth until convergence. If you apply Scikit-learn NMF model, you will see ALS is the default solver to use, which is also called Coordinate Descent. Pyspark also offers pretty neat decomposition packages that provides more tuning flexibility of ALS itself.Some ThoughtsCollaborative Filtering provides strong predictive power for recommender systems, and requires the least information at the same time. However, it has a few limitations in some particular situations.First, the underlying tastes expressed by latent features are actually not interpretable because there is no content-related properties of metadata. For movie example, it doesnt necessarily to be genre like Sci-Fi in my example. It can be how motivational the soundtrack is, how good the plot is, and so on. Collaborative Filtering is lack of transparency and explainability of this level of information.On the other hand, Collaborative Filtering is faced with cold start. When a new item coming in, until it has to be rated by substantial number of users, the model is not able to make any personalized recommendations . Similarly, for items from the tail that didnt get too much data, the model tends to give less weight on them and have popularity bias by recommending more popular items.Its usually a good idea to have ensemble algorithms to build a more comprehensive machine learning model such as combining content-based filtering by adding some dimensions of keywords that are explainable, but we should always consider the tradeoff between model/computational complexity and the effectiveness of performance improvement.中文翻译推荐系统介绍协同过滤的方法:最近邻域和矩阵分解“我们正在离开信息时代,而进入推荐时代。”像许多机器学习技术一样,推荐系统根据用户的历史行为进行预测。具体来说,是根据过去的经验来预测用户对一组商品的偏好。要构建推荐系统,最流行的两种方法是基于内容的过滤和协作过滤。基于内容的方法需要大量项目自身功能的信息,而不是使用用户的交互和反馈。例如,它可以是电影属性(例如流派,年份,导演,演员等)或可以通过应用自然语言处理提取的文章的文本内容。另一方面,协作过滤除了用户对一组项目的历史偏好之外,不需要任何其他操作。因为它是基于历史数据的,所以这里的核心假设是,过去已经同意的用户将来也会倾向于也同意。就用户偏好而言,它通常由两类表示。明确评分,是用户按滑动比例对某项商品的价格,例如泰坦尼克号的评分为5星。这是用户最直接的反馈,表明他们对商品的喜爱程度。隐含评价,间接建议用户偏好,例如页面浏览量,点击次数,购买记录,是否收听音乐曲目等等。在本文中,我将仔细研究协作过滤,它是推荐系统的传统且功能强大的工具。最近的邻居协作过滤的标准方法称为“ 最近邻算法”。有基于用户的CF和基于项目的CF。让我们先来看看基于用户的CF。我们有一个n×m的评分矩阵,用户u,i = 1,. n,项目p,j = 1,. m。现在,如果目标用户i没有对项目j进行观看/评分,我们现在要预测评分r。该过程将计算目标用户i与所有其他用户之间的相似度,选择排名靠前的X个相似用户,并将来自这X个具有相似性的用户的评分的加权平均值作为权重。尽管不同的人给出评分时可能会有不同的基准,但是有些人通常会给出高分,有些人即使对项目感到满意也很严格。为了避免这种偏差,我们可以在计算加权平均值时减去每个用户对所有项目的平均评分,然后将其加回到目标用户身上,如下所示。计算相似度的两种方法是皮尔森相关和余弦相似度。基本上,该想法是找到与您的目标用户(最接近的邻居)最相似的用户,并权衡他们对某项的评价,以此作为对目标用户的评价。在不了解商品和用户本身的情况下,我们认为两个用户在给同一个商品相似的评分时是相似的。类似地,对于基于项目的CF,我们说两个项目在收到来自同一用户的相似评分时是相似的。然后,我们将通过计算来自该用户的大多数X个类似商品的评分的加权平均值,来预测该商品的目标用户。基于项目的CF的一个关键优势是稳定性,即与人类的口味不同,给定项目的评级不会随着时间的推移而发生显着变化。此方法有很多限制。当附近没有人对您要为目标用户预测的商品进行评分时,它不能很好地处理稀疏性。而且,随着用户和产品数量的增长,它的计算效率也不高。矩阵分解由于稀疏性和可伸缩性是标准CF方法的两个最大挑战,因此出现了一种更高级的方法,该方法将原始稀疏矩阵分解为具有潜在因子/特征且稀疏性较低的低维矩阵。那就是矩阵分解。除了解决稀疏性和可伸缩性问题之外,还有一个直观的解释,说明为什么我们需要低维矩阵来表示用户的偏好。用户对电影阿凡达,重力和盗梦空间给予了很高的评价。它们不一定是3个独立的意见,而是表明该用户可能更喜欢科幻电影,并且该用户可能想要更多的科幻电影。与特定电影不同,潜在功能由更高级别的属性表示,在这种情况下,科幻类别是潜在功能之一。矩阵分解最终给我们的是用户与一组潜在特征对齐的程度,以及一部电影在这组潜在特征中的适应程度。 与标准最近邻区相比,它的优势在于,即使两个用户未对任何一部电影进行评级,但如果他们共享相似的基本口味(又是潜在特征),仍然有可能找到它们之间的相似性。要了解矩阵如何分解,首先要了解的是奇异值分解(SVD)。基于线性代数,可以将任何实矩阵R分解为3个矩阵U,和V。以电影示例为例,U是n×r用户潜伏特征矩阵,V是m×r电影潜伏特征矩阵。是一个r×r对角矩阵,包含原始矩阵的奇异值,仅表示特定功能对预测用户偏好的重要性。为了通过减少绝对值对的值进行排序并将矩阵截断为前k个维(k个奇异值),我们可以将矩阵重构为矩阵A。选择k应该确保A能够捕获最大的方差在原始矩阵R内,A是R的近似值,AR。A和R之间的差是期望最小化的误差。这正是主成分分析的思想。当矩阵R是致密的时,U和V可以很容易地解析分解。但是,电影分级矩阵超级稀疏。尽管存在一些填补缺失值的插补方法,但我们将转向一种编程方法,以仅使用那些缺失值并找到因子矩阵U和V。我们尝试通过以下方法直接找到U和V,而不是通过SVD对R进行因子分解。目的是当U和V相乘时,输出矩阵R'是R的最近似值,而不再是稀疏矩阵。对于推荐系统,通常使用非负矩阵分解实现此数值近似,因为评级中没有负值。请参阅下面的公式。查看特定用户和项目的预测评级,将项目i记为向量q,将用户u标记为向量p,以使这两个向量的点积为用户u对项目i的预测评级。该值显示在矩阵R'中的第u行和第i列。我们如何找到最优q和p?像大多数机器学习任务一样,定义了损失函数以最大程度地减少错误的成本。r是原始用户项目矩阵的真实评级。优化过程是找到由向量p组成的最优矩阵P和由向量q组成的矩阵Q,以使预测等级r与真实等级r之间的平方和误差最小。此外,还添加了L2正则化以防止用户和项目向量的过拟合。添加偏差项也很常见,该偏差项通常包含3个主要成分:所有项目的平均评级,项目i的平均评级减去(表示为b),用户给定的平均评级为u减去u(表示为b)。优化一些优化算法已广泛用于解决非负因式分解。替代最小二乘是其中之一。由于损失函数在这种情况下是非凸的,因此无法达到全局最小值,而通过找到局部最小值仍可以达到很大的近似值。备选最小二乘保持用户因子矩阵恒定,通过采用损失函数的导数并将其设置为0来调整项目因子矩阵,然后在调整用户因子矩阵时将项目因子矩阵设为恒定。通过来回切换和调整矩阵来重复此过程,直到收敛为止。如果应用Scikit-learn NMF模型,您将看到ALS是要使用的默认求解器,也称为坐标下降。Pyspark还提供了非常简洁的分解包,可为ALS本身提供更大的调整灵活性。一些想法协作过滤为推荐系统提供强大的预测能力,并且同时需要最少的信息。但是,它在某些特定情况下有一些限制。首先,由于没有元数据的与内容相关的属性,因此潜在特征表达的基本口味实际上是无法解释的。对于电影示例,在我的示例中不一定像科幻小说那样。可能是配乐的动机,情节的好坏等等。协作式筛选缺乏此级别信息的透明度和可解释性。另一方面,协作过滤面临冷启动。当有新项目进入时,直到必须由大量用户对其进行评分,该模型才能提出任何个性化建议。类似地,对于没有获得太多数据的尾部项目,该模型往往会给予较少的权重,并通过推荐更多受欢迎的项目而具有受欢迎度偏差。通常,拥有集成算法来构建更全面的机器学习模型是一个好主意,例如通过添加一些可解释的关键字维度来组合基于内容的过滤,但是我们应该始终在模型/计算复杂度与有效性之间进行权衡性能改进。专心-专注-专业