论文(设计)一种基于SVD 和Rough 集的信息过滤方法23050.doc
一种基于SVD和Rough集的信息过滤方法*本研究工作得到了教育部、科技部以及国家自然科学基金和国家973项目(项目编号G19980306)的资助。陈彩云 李治国南开大学组合数学研究中心,天津 300071摘 要本文提出了一种信息过滤方法,即在奇异值分解(SVD)的基础上,运用粗糙集(Rough Sets)理论进行信息过滤。通过对词语×文档矩阵进行奇异值分解得出近似矩阵,改变了一些词语在相应文档中的重要性,从而使得词语更好的体现文档内容。然后运用粗糙集理论中决策表上的规则推理方法,生成我们感兴趣信息的规则库,将未知文档的条件属性与规则库里规则进行相似匹配,进行信息过滤。实验表明,该方法在准确度方面比传统的VSM和LSI要好。关键字:奇异值分解粗糙集信息过滤 规则提取1、 引言随着因特网上信息量的迅速增加,人们往往为了找到自己需要的信息花费大量的时间和精力,如何能够更有效的,更准确的找到自己感兴趣的信息,滤除与自己的需求无关的信息已经成为基于Internet网络信息处理的当务之急。随之产生的信息过滤技术正得到越来越广泛的关注,信息过滤系统根据用户的信息需求对动态信息流进行过滤,仅把用户感兴趣的文档传送给用户,可以提高获取信息的效率,对信息过滤主要的需求是对文档与用户信息需求相关性的判断要准确,同时查全率也需要提高。本文提出了一种信息过滤方法,在奇异值分解的基础上,运用粗糙集理论中规则推理方法,建立信息过滤的规则库,对于任意一个未知文档,我们只要将其条件属性与规则库中的规则进行相似匹配,进行过滤。实验证明该方法较传统的向量法和LSI方法都要好。 2、 粗糙集相关理论粗糙集是波兰Z. Pawlak教授提出的一种数据推理方法1。该理论为发现重要数据结构和复杂对象的分类提供了强有力的基础。我们首先描述与本文相关的粗糙集理论中的一些概念。(下面提到的概念和符号源自文献2)2.1信息系统(Information System)信息系统由4元集组成,记为,其中:由个研究对象组成的非空集合,称为闭域(Closed Universe);Q:由n个属性组成的有限非空集合;:表示Q中所有属性的值域,其中是属性的值域。:全决策函数(Total Decision Function),使得对于任一,有。通过作用,信息系统S能用一个有限的数据表表示,表的第i行研究对象和第j列属性有对应的值。2.2决策表(Decision Tables)如果信息系统的属性集Q可以分成互不相交的条件属性集C和决策属性集D,即满足且,满足这样条件的信息系统称为决策表,记。一般情况下,集合D包含多个决策属性,但是在本文中根据研究的需要,我们只包含一个决策属性d,即D=d。通过决策表,我们就可以对数据集进行规则推理。下面的过滤方法就是在决策表的基础上进行规则推理的。3、 奇异值分解(SVD)给定m×n的矩阵M,可以分解成三个矩阵的乘积,其中U和V分别为和的正交矩阵,S为对角矩阵,S的非零对角元叫做矩阵M的奇异值,r为非零对角元的个数。定义m×n矩阵,其中由U的前k()列列向量组成的m×k的矩阵,由S的前k个最大的奇异值组成的k×k的对角矩阵,由V的前k列列向量组成的n×k矩阵。由此构造的矩阵是秩为k的矩阵中与M距离最近的矩阵,称之为秩为k的最好近似矩阵3。4、 构造信息过滤方法第一步:准备数据,建立词语-文档矩阵(Term-Document)4M首先我们收集一定数量的文档数据集。将之分成训练集和测试集,一般情况下,取所有文档的60%-80%作为训练集,其它的作为测试集。假设有m个文档,选取n个关键词语,建立词语-文档矩阵M,矩阵的每一行代表一个文档,每一列代表词语在文档中的出现的频率,即M=(mij) ,mij表示第j个词语在第i个文档中出现的频率。第二步:将该矩阵M进行奇异值分解,构造秩为k的最好近似矩阵Mk我们将矩阵M进行奇异值分解,估计文档使用的词语结构。分解M得到,再构造秩为k的最好近似矩阵,其中,r是非零奇异值的个数。通常情况下,我们面临的数据量是很大的,而使用奇异值分解,使我们找到了M的秩为k的最好近似矩阵Mk,从而降低了词语-文档的空间维数。通过这样的变换,使得原来比较稀松的词语-文档矩阵变得稠密,改变了不同的词语在不同文档中的相对比重,从而使词语能更好的表达文档的内容。同样对于任何一篇新的文章,我们统计这 n个关键词在该文章中出现的频率,得到1n的向量P,可以通过公式变换,将P转化成词语-文档向量空间的向量的形式。第三步:构造决策表DT,生成决策规则我们用上面预处理过的文档数据来构造决策表。表示一个决策表,其中闭域U是由词语-文档矩阵中m个文档组成,条件属性集C由词语-文档矩阵M的n个词语作为条件属性构成,决策属性集D=d由文档的类别属性构成。值域,其中条件属性的取值我们直接取M 的最好近似矩阵Mk的值,即,决策属性的取值根据我们感兴趣的或者是有价值的文档的属性决定,可以分别用文档的属性标明,比如军事,财经,体育等等,也可以直接用布尔变量0,1表示,0表示是我们不感兴趣的文档,1表示是我们感兴趣的文档,即的取值根据需求确定。有此决策表,我们就可以用来进行规则推理:首先定义闭域中任两个文档的相似度: (1)其中,和分别为文档u、v对应的n个条件属性组成的向量,取定一个常数为相似度阈值,如果,则认为u和v是相似的,否则是不相似的。任给,记Xi为闭集U中与相似文档(包括)的集合,即。由组成的集合记为。根据决策属性D将U分成g类,决策表上的第i个决策规则定义如下:,且,其中和分别为i和Yj的共性的唯一描述。例如就是i的共性,在本文中即是与相似度大于等于的文档。对于我们感兴趣的某一种类型的信息集的规则集表示如下:决策规则的准确率用公式来计算,从中选取准确率大于准确率阈值的规则组成我们信息过滤的规则库G,即。决策规则可以用如果那么语句来表示,即如果某个条件成立,那么就有某个结论成立。在本文中,决策规则我们可以表示为:对于一条新来的文档,如果该文档与规则库中某个规则相对应的文档的相似度大于等于阈值,那么该文档的决策属性就是该规则所对应的决策属性。举例来说,对于新来的文档,如果它与x2的相似度大于,并且,那么该文档就是我们感兴趣的。从上面的叙述中可以看到我们只需把训练集中与规则库对应的文档标识出来就可以了,对于新来的文档只需与这些文档进行相似度计算。用一个1M的向量R存储标识,R的每一位只有0和1两个值。1表示该文档与规则库的规则对应,0表示没有规则与该文档对应。也就是说与标识为1的文档相似的文档就是我们感兴趣的,否则为我们不感兴趣的文档。规则提取的算法可以如下描述:步1:选取类中我们感兴趣的信息,对决策表闭域U中每一个文档重复执行步2至步4。步2:依次将与U中的所有文档(包括)计算相似度。步3:将满足的所有文档组成集合,计算。步4:如果,则在R的相应的位置上面置1,否则置0。第四步:推导任一篇未知文档的决策属性对于任一未知文档,根据上面第二步计算出向量,再根据公式(1)分别计算该文档与标识为1的文档相似度,如果与某个文档的相似度大于等于,则说明此文档是我们感兴趣的,否则就是我们不感兴趣的。下面我们用一个实验将该方法与传统的方法进行比较。5、 实验过程及结果分析我们选择了新闻稿进行实验,选择体育、财经、军事三类新闻稿各约1100多篇(其中体育类1197篇、财经1121篇、军事1190篇共3580篇)作为实验数据(所有的新闻稿均取自千龙新闻网)。将这些数据取60%作为训练集,40%作为测试集。首先由训练集得到训练规则,通过测试集进行测试,最后对VSM、LSI、和粗糙集三种方法进行了比较,并对结果进行了分析。5.1.关键词的选取我们首先搜集了所有的二元词共98558个,用部分新闻稿计算词频,去掉其中的虚词、高频词和低频词。从中选出最具有代表性的词170个。5.2建立词语-文档矩阵M计算选取的关键词在文档中出现的频率,从而构成词语-文档矩阵M=(mij) ,mij表示第j个词语在第i个文档中出现的频率。5.3 生成规则库根据上述方法中的第二步对M进行奇异值分解,取k=20构成新矩阵Mk,并按上述方法中的第三步对分解好的矩阵进行规则提取,生成规则库。5.4 实验结果比较我们用三种方法:VSM、LSI、以及粗糙集方法进行多次实验,得到了图1所示的P-R图。图1 三种算法的P-R图从图中可以看到,我们的方法在准确率和查全率上面比VSM都要好一些。准确率上也比LSI方法平均要好一些,尽管在查全率方面比LSI稍微差一点。三种算法的时间复杂度与空间复杂度描述如下:算法训练算法过滤过程空间存储VSMO (m)无中间向量VLSIO(mr2)O (m2)Uk,Vk,Sk粗糙集方法O (m2) +O(mr2)O (m)Uk,Vk,Sk,标识向量R表1 三种方法的复杂度比较其中奇异值分解的复杂度是O(mr2)。从表1中我们可以看到:粗糙集方法虽然在训练过程中的时间复杂度比LSI要高,但是在过滤过程中却比LSI方法低了一个数量级。而且在空间存储方面也只比LSI方法多存储一个1M的标识向量,并没有造成多少存储负担。所以在复杂度方面粗糙集方法还是优于LSI方法的。5.5 原因分析一个文档可以由一个向量来表示,图2表示了我们需要的和不需要的文档的向量的两种不同的分布情况,是我们需要的文档向量,而 是我们不需要的,我们要把过滤出来。在a图的情况下,VSM可以很容易地把过滤出来,而且准确率会比较高,但是在b图的情况下,VSM无论是准确率和查全率都不会很高,但是用粗糙集的方法或是LSI即使在这种情况下,也能达到很高的准确率。由于LSI方法在进行过滤时只是进行相似性比较和排序,并没有过滤掉一些准确率不高的向量,所以我们的结果比LSI在准确率方面要好一些。但是同样是这个原因,LSI在查全率方面要优于我们的方法。a待添加的隐藏文字内容3b图2 常出现的两种情况6、 结束语该文所提到的实验系统是在Visual C+ 6.0和Delphi6.0下,并借助Matcom4.5初步实现。我们初步探讨了奇异值分解和粗糙集理论相结合的一种信息过滤方法。运用代数的方法来重新调整词语-文档矩阵,然后运用粗糙集理论中的规则推理方法建立规则库,通过这种方法进行信息过滤,更能表达文档的内容,避免了传统的向量空间方法对信息过滤的盲目性,这无疑是对信息过滤的一种有益的尝试。而且在实验中我们验证了该方法无论在复杂度还是在准确度方面都是可以接受的,是一种切实可行的方法。参考文献:1、Pawlak. Z, Rough Sets, International Journal of Computer Sciences, 1982. 11, pp341-356.2、Krzysztof J.Cios, Witold Pedrycz, Roman W. Swiniarski, Data Mining Methods for Knowledge Discovery, Kluwer Academic Publishers, 1998.3、Azar.Y, Fiat. A, Karlin. A, etal, Spectral Analysis for Data Mining, Proceedings of the Thirty-Third Annual ACM Symposium on Theory of Computing, 2001, pp 619-626.4、Christos H. Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, etal, Latent Semantic Indexing: A Probabilistic Analysis, In Proceedings of ACM Symposium on Principles of Database Systems, 1997 A New Method for Information Filter Based on SVD and Rough SetsChen Cai-yunLi Zhi-guoCenter for Combinatorics of Nankai University, Tianjin 300071AbstractThis paper proposed a new method for information filter based on Rough Sets theory and SVD. We have changed the importance of terms in corresponding documents by singular value decompose (SVD). Then we generated the rules which are useful to us base on the decision tables of Rough Set theory. When an unknown document was inputted, we just matched approximately the condition property of the document to these rules and remained useful information. The experiment proved that the method was better than traditional VSM and LSI in precision.key words:SVDRough SetsInformation FilterRule Generation作者简介:陈彩云,女,1975年11月生,南开大学组合数学研究中心2001级博士生,研究方向 组合数学与数据挖掘,chencaiyun;李治国,男,1977年9月生,南开大学组合数学研究中心2000级硕士生,研究方向 组合数学与应用,sbickle;Editor's note: Judson Jones is a meteorologist, journalist and photographer. He has freelanced with CNN for four years, covering severe weather from tornadoes to typhoons. Follow him on Twitter: jnjonesjr (CNN) - I will always wonder what it was like to huddle around a shortwave radio and through the crackling static from space hear the faint beeps of the world's first satellite - Sputnik. I also missed watching Neil Armstrong step foot on the moon and the first space shuttle take off for the stars. Those events were way before my time.As a kid, I was fascinated with what goes on in the sky, and when NASA pulled the plug on the shuttle program I was heartbroken. Yet the privatized space race has renewed my childhood dreams to reach for the stars.As a meteorologist, I've still seen many important weather and space events, but right now, if you were sitting next to me, you'd hear my foot tapping rapidly under my desk. I'm anxious for the next one: a space capsule hanging from a crane in the New Mexico desert.It's like the set for a George Lucas movie floating to the edge of space.You and I will have the chance to watch a man take a leap into an unimaginable free fall from the edge of space - live.The (lack of) air up there Watch man jump from 96,000 feet Tuesday, I sat at work glued to the live stream of the Red Bull Stratos Mission. I watched the balloons positioned at different altitudes in the sky to test the winds, knowing that if they would just line up in a vertical straight line "we" would be go for launch.I feel this mission was created for me because I am also a journalist and a photographer, but above all I live for taking a leap of faith - the feeling of pushing the envelope into uncharted territory.The guy who is going to do this, Felix Baumgartner, must have that same feeling, at a level I will never reach. However, it did not stop me from feeling his pain when a gust of swirling wind kicked up and twisted the partially filled balloon that would take him to the upper end of our atmosphere. As soon as the 40-acre balloon, with skin no thicker than a dry cleaning bag, scraped the ground I knew it was over.How claustrophobia almost grounded supersonic skydiverWith each twist, you could see the wrinkles of disappointment on the face of the current record holder and "capcom" (capsule communications), Col. Joe Kittinger. He hung his head low in mission control as he told Baumgartner the disappointing news: Mission aborted.The supersonic descent could happen as early as Sunday.The weather plays an important role in this mission. Starting at the ground, conditions have to be very calm - winds less than 2 mph, with no precipitation or humidity and limited cloud cover. The balloon, with capsule attached, will move through the lower level of the atmosphere (the troposphere) where our day-to-day weather lives. It will climb higher than the tip of Mount Everest (5.5 miles/8.85 kilometers), drifting even higher than the cruising altitude of commercial airliners (5.6 miles/9.17 kilometers) and into the stratosphere. As he crosses the boundary layer (called the tropopause), he can expect a lot of turbulence.The balloon will slowly drift to the edge of space at 120,000 feet (22.7 miles/36.53 kilometers). Here, "Fearless Felix" will unclip. He will roll back the door.Then, I would assume, he will slowly step out onto something resembling an Olympic diving platform.Below, the Earth becomes the concrete bottom of a swimming pool that he wants to land on, but not too hard. Still, he'll be traveling fast, so despite the distance, it will not be like diving into the deep end of a pool. It will be like he is diving into the shallow end.Skydiver preps for the big jumpWhen he jumps, he is expected to reach the speed of sound - 690 mph (1,110 kph) - in less than 40 seconds. Like hitting the top of the water, he will begin to slow as he approaches the more dense air closer to Earth. But this will not be enough to stop him completely.If he goes too fast or spins out of control, he has a stabilization parachute that can be deployed to slow him down. His team hopes it's not needed. Instead, he plans to deploy his 270-square-foot (25-square-meter) main chute at an altitude of around 5,000 feet (1,524 meters).In order to deploy this chute successfully, he will have to slow to 172 mph (277 kph). He will have a reserve parachute that will open automatically if he loses consciousness at mach speeds.Even if everything goes as planned, it won't. Baumgartner still will free fall at a speed that would cause you and me to pass out, and no parachute is guaranteed to work higher than 25,000 feet (7,620 meters).It might not be the moon, but Kittinger free fell from 102,800 feet in 1960 - at the dawn of an infamous space race that captured the hearts of many. Baumgartner will attempt to break that record, a feat that boggles the mind. This is one of those monumental moments I will always remember, because there is no way I'd miss this.