英语语言文化论文中英文混合分词方法及应用研究.doc

资源ID：2884403 资源大小：19.50KB 全文页数：3页
资源格式： DOC 下载积分：8金币

快捷下载

会员登录下载

三方登录下载：

下载资源需要8金币

邮箱/手机：
温馨提示：	用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）
支付方式：
验证码：	换一换

加入VIP免费专享

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？

友情提示

1、下载资料失败解决办法

2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。

3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。

4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

5、试题试卷类文档，如果标题没有明确说明有答案则都视为没有答案，请知晓。

网站客服

侵权投诉

英语语言文化论文中英文混合分词方法及应用研究.doc

中英文混合分词方法及应用研究中英文混合分词方法及应用研究 Chinese and English Mixed Segmentation Method and Applied Research【中文摘要】随着科学技术的迅猛发展,计算机在各个领域得到了前所未有的广泛应用。已从过去的数据处理、信息处理发展到现在的知识处理和对语言文字的信息处理。自从20世纪80年代初,中文信息处理提出自动分词以来,众多专家和学者在这一领域取得了令人可喜的进展,并且基于中文分词的算法也随着信息的多元化,复杂化在不断的升级,改进和完善。分词算法在信息检索,自动归档等领域都有着广泛的应用,但是由于中国经济的飞速发展使得中国与世界的联系更加紧密,在一些前沿领域或是国人刚刚开始涉足的领域就难免要借鉴和引用一些发达国家的科研成果或创新理论。这样,信息的形式就难免要使用中外语言混合的形式来表达,特别是中英文混合使用的情况将会越来越普遍。这就要求信息处理系统不仅能够将中文正确分词,还要能够对中英文混合的情况正确分词。目前有关中英文混合分词的研究相对较少,还没有形成比较成熟的理论,中英文混合分词的规范、评价体系还没有建立。对于中英文混合字段一般是将中文汉字和英文字母、中文汉字和阿拉伯数字、英文字母和阿拉伯数字直接分开,没有对其进行词的判断和消岐处理。基于此课题首先研究中英文混合的新特点,并着重研究了中英文混合分词的算法,特别是混合分词的消岐问题。本文主要研究了中英文混合的形式、结构以及人们使用中英文混合的习惯,分析了现有的中文分词算法,提出了一种实用的中英文混合分词算法。对于分词难点之一的消歧问题,本文做了深入的研究,并在现有消歧算法基础之上分析了需要继续消歧的原因并且给出了具体实施方法。对于最大词长问题的解决本文充分考虑了分词速度的要求,提出了以待切分字符串的双首字开始的Hash词典的词长和待切分文本的长度进行比较从而确定RMM的最大词长。为了验证该算法的效率,开发了中英文混合分词系统,以中国风能信息中心系统为例对中英文混合分词算法做了试验验证。试验表明,该算法能够有效的将中英文混合文档正确的分词,其消歧率到达了较高的水平,算法对未登录词中的姓名名词也有很好的识别能力。最后依据算法的分词结果初步达到了文章自动分类归档的目的。【英文摘要】 With the rapid development of science and technology, computer has been widely used ever. It develops from the data processing to knowledge processing. Since the early-1980s, Chinese information processing has proposed the automatic segmentation, many experts and scholars in this field have made great progress. The algorithm also has a wide range of applications in information retrieval, automatic archiving and other areas. The link between China and the world has been more closely due to the rapid development of Chinas economy, however, we unavoidably use the experience of other countries for reference.Such informations form unavoidably must be used Chinese and foreign language mixed to express our thought, especially Chinese and English mixed form. This set a higher request to the information management system,.At present, the research of Chinese and English mixed word segmentation is relatively few, and it has not formed a quite mature theory. The Chinese and English mixed word segmentation standard and the appraisal system have not been established. Based on this, The paper has studied the new features of Chinese and English mixed form and proposed a new algorithm.This paper has mainly studied the Chinese and English mixed form, the structure and the use custom. It aslo presents a practical segmentation algorithm of Chinaese and English mixed. The removing ambiguity is one of the difficulties of segmentation.This article has done the thorough analytical study and proposed the implementation method. for continuing removing ambiguity. To solve the biggest word length, a method which compared the length of the first two-character string beginning Hash dictionary of the waiting string with the length of the text to determine the maximum word length of RMM has been proposed.The experiment indicated that using this article proposed method can split the words of Chinese and English mixed effectively. The method can not only keep a higher level of removing Ambiguity, but also do well in unknown word identification. Finally it arrived at the goal of article automatic sorting based on the algorithm participle result. 【中文关键词】中英文混合分词; Hash; RMM; 消歧; 未登录词【英文关键词】 Chinese and English mixed word segmentation; Hash; RMM; Removing Ambiguity; Unknown word 【论文目录】摘要 4-5 Abstract 5 1 引言 9-12 1.1 研究背景及意义 9-10 1.1.1 研究背景 9-10 1.1.2 研究意义 10 1.2 研究现状 10-11 1.3 论文的组织 11-12 2 分词算法综述 12-17 2.1 中文自动分词基本算法 12-14 2.1.1 基于字符串匹配算法 12-13 2.1.2 基于统计的方法 13 2.1.3 基于理解的方法 13-14 2.2 歧义处理 14-15 2.2.1 歧义定义 14 2.2.2 探测歧义 14 2.2.3 消歧算法 14-15 2.3 未登录词识别 15-16 2.3.1 未登录词识别的方法 15 2.3.2 未登录词识别的现状 15-16 2.4 中文分词评价 16-17 3 中英文混合分词评价体系 17-19 3.1 中英文混合使用现象原因 17 3.2 中英文混合特点 17-18 3.2.1 由引进英文而产生的歧义 17 3.2.2 网络语言的大量使用 17 3.2.3 由领域性产生的歧义 17 3.2.4 缺失字母问题 17-18 3.3 中英文混合分词评价体系 18-19 3.3.1 增加词语纠错能力评价 18 3.3.2 未登录词标准发生变化 18-19 4 中英文混合分词的算法实现 19-40 4.1 常用的分词词典 19-23 4.1.1 整词二分的分词词典机制 19-21 4.1.2 TRIE 索引树的分词词典机制 21-22 4.1.3 逐字二分的分词词典机制 22-23 4.2 三种分词词典机制的实验结果 23-24 4.3 本文采用的词典机制 24-26 4.3.1 改进的词典机制 25-26 4.4 词典的实现 26-30 4.4.1 词典的构成 26 4.4.2 词典定义 26-27 4.4.3 加载基本词典 27-29 4.4.4 加载停用词典和姓氏词典 29-30 4.5 中英文混合分词算法 30-35 4.5.1 初切分算法 30-31 4.5.2 分词过程 31-35 4.6 歧义处理 35-38 4.6.1 中英文混合歧义形式 35 4.6.2 歧义探测 35-36 4.6.3 改进的消歧算法 36-37 4.6.4 与消歧效果 37-38 4.7 系统功能实现 38-40 5 中英文混合分词在中国风能信息中心的应用 40-44 5.1 系统介绍 40 5.2 分词、文本自动分类的实现 40-44 5.2.1 技术类别 40-41 5.2.2 文本自动归档 41-44 6 总结 44-46 6.1 论文总结 44 6.2 有待继续完善的工作 44-46 参考文献 46-49 在读期间发表的学术论文 49-50 作者简介 50-51 致谢 51

注意事项

本文（英语语言文化论文中英文混合分词方法及应用研究.doc）为本站会员（仙人指路1688）主动上传，三一办公仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知三一办公（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。