语料库与外语学习.ppt
基于语料库的词汇学习(方法与资源),内容提纲,语料库概念简介(3-5)国内外主要语料库(6,7)语料库在外语教学与学习中的应用(8-29)免费在线语料库简介(COCA,BNC,Lextutor)(30-37)软件工具(38-43)资源分享,What is a corpus?,Corpus=“a body of naturally occurring text”The texts were not produced without the creator knowing that they would be used for linguistic analysisNewspapers,magazine articles,short stories,academic journals,etcGood genre balance(spoken,fiction,magazines,newspaper,academic)Current:not 100-year-old novels,3,Large:at least 100 million wordsMore words than you would see/hear in a whole lifetimeAnnotated:tagged for part of speech and lemma(e.g.the beat,they beat,and beat as)语料(corpus)是指收集的未经组织和加工过的语言材料和素材。戴炜栋,1999 语料(corpus)又称为素材,是自然发生的语言材料(包括书面语和口语)的集合。可以用来作为描述一种语言的出发点或用于证实有关一种语言的假设的手段。陈建生,1989,语料库按照特定目的与方法建立起来的存储语言材 料的“仓库”。语料库是按照一定的语言原则,运用随机抽样方法,收集自然出现的连续的语言,运用文本或话语片段而建成的具有一定容量的大型电子文本库。从其本质上来说,语料库实际上是通过对自然语言运用的随机抽样,以一定大小的语言样本代表某一研究中所确定的语言运用总体。杨惠中,2002,国外主要的语料库,Brown(1963 64)布朗大学当代美国英语标准语料库(The Brown University Corpus of Present Day American English)。含100 万1961 年前后的书面英语。由Francis 与Kucera 主持完成。COBUILDJohn Sinclair 主持,迄今最大的语料库之一;含的语料超过5 亿词。COCA 美国当代英语语料库,收词四亿多,1990-2010 BNC英语国家语料库,收词一亿多,牛津大学/朗文/钱伯斯-哈洛普出版公司。ICE国际英语语料库,口语和书面语各一库,收词1 百万 The Bank of English英语库,收词2.5亿。朗文/柯林 斯/伯明翰大学。,国内英语学习者语料库,名称 类型 建设单位 母语背景 容量(万词)HKUST 书面语 香港科技大学 广东话 2500TSLC 书面语 香港大学 广东话 300CLEC 书面语 广东外语外贸大学等 汉语 100COLSEC 口语 上海交通大学等 汉语 50MSEE 书面语/口语 华南师范大学 汉语 87.6SWECCL 书面语/口语 南京大学 汉语 200,中国英语学习者语料库CLEC(桂诗春、杨惠中,2003)我国中学生、大学英语4、6级、英语专业低年级和高年级学生在内的100多 万词的书面英语语料库,是一部含有言语失误标注的英语学习者语料库。,中国英语学生口笔语语料库SWECCL由“中国学生英语口语语料库”(Spoken English Corpus of Chinese Learners,(SECCL)和“中国学生英语笔语语料库”(Written English Corpus of Chinese Learners,(WECCL)二个子项目组成。总规模为200 万词。南京大学主持,(文秋 芳、王立非、梁茂成2005:2),JDEST20世纪80年代,中国第一个语料库,上海交大,桂诗春、杨惠中,学术,语料库在外语教学与学习中的应用,基于规则和基于概率的实际应用:比如 机阅作文;机器翻译等语料库用于目的语和中介语研究词典编撰:如 Collins Cobuild Advanced Learners English Dictionary测试教材编写翻译研究 语料库用于语言学习:基于大量真实语言输入的自主性、研究型的语言学习 比如:近义词辨析;语义韵;类联结;搭配研究;句法分析;话 语分析等。应用举例,Quiz:order by frequency,vigilantflabbergastedlostrinky-dinkmiserable,9,Quiz:order by frequency,lost(#2691)miserable(#5841,“sad,hopeless”)vigilant(#11831,“watching over”)flabbergasted(#21701,“extremely surprised”)rinky-dink(#44681;“small,cheap,worthless”),10,11,Obvious errors:not in corpus,12,Corpus of Contemporary American English(COCA)fall down carefully:no occurrences,13,“unrecycling”,Google:unrecycling(100 hits:lot/little?;they refer to that trashcan picture),15,Corpus of Contemporary American English(COCA):no occurrences,16,COCA:other words with*recycl*(recycling,nonrecyclable,etc),x*recyclable:negative words before recyclable,18,Problems:civilized visitor|set up the ecosystem|ecosystem scenery,19,*set up the ecosystem:verbs with ecosystem as an object,20,21,no virtuous near duck,22,Word meaning:collocates:slippery near crafty,23,slippery near crafty:no occurrences,24,adjectives near slippery:dangerous,25,arouse,26,collocates(nearby words)near arouse:suspicions,sexually,anger,外语学习的四大难点,native-like pronunciation native way of thinking discrimination of synonyms idiomatic collocation,近义词辨析,近义词的辨析可以从意义的不同类型入手:语法意义(grammatical meaning)词汇意义(lexical meaning)概念意义(denotative meaning)联想意义(associative meaning)内涵意义(connotative meaning)语体意义(stylistic meaning)情感意义(affective meaning)搭配意义(collocative meaning),语料库方法在教学中的应用举例,高级英语词汇自主学习的语料库方法 SketchEngine工具在词汇搭配和同义词辨析教学上的应用基于在线语料库的动_名搭配教学的实证研究,免费在线语料库 简介,COCABNCLextutor,Corpus of Contemporary American English(COCA;www.americancorpus.org),410+million words(cf.British National Corpus,100m)More words than average speaker will hear in a lifetimeFrom more than 160,000 texts20 million words each year from 1990-2010Balanced across spoken,fiction,popular magazines,newspapers,and academic journals(20%in each genre each year)Freely available online since March 200860,000-70,000 unique users each monthComplete,context-sensitive help files online,31,A good article to learn about COCA(in Chinese):Wang,Xingfu,Liu Guohui,Mark Davies(2008)The Corpus of Contemporary American English-A Useful Tool for English Teaching and Research.Computer-Assisted Foreign Language Education in China.5:24-31,32,Composition of COCA410+million words(1990-present):same composition each year,Spoken:(83 million words)Transcripts of unscripted conversation from more than 150 different TV and radio programs(examples:All Things Considered(NPR),Newshour(PBS),Good Morning America(ABC),Today Show(NBC),60 Minutes(CBS),Hannity and Colmes(Fox),Jerry Springer,etc).Fiction:(79 million words)Short stories and plays from literary magazines,childrens magazines,popular magazines,first chapters of first edition books 1990-present,and movie scripts.Popular Magazines:(84 million words)Nearly 100 different magazines,with a good mix(overall,and by year)between specific domains(news,health,home and gardening,women,financial,religion,sports,etc).A few examples are Time,Mens Health,Good Housekeeping,Cosmopolitan,Fortune,Christian Century,Sports Illustrated,etc.,33,Newspapers:(79 million words)Ten newspapers from across the US,including:USA Today,New York Times,Atlanta Journal Constitution,San Francisco Chronicle,etc.In most cases,there is a good mix between different sections of the newspaper,such as local news,opinion,sports,financial,etc.Academic Journals:(79 million words)Nearly 100 different peer-reviewed journals.These were selected to cover the entire range of the Library of Congress classification system(e.g.a certain percentage from B(philosophy,psychology,religion),D(world history),K(education),T(technology),etc.),both overall and by number of words per year,免费在线语料库COCA检索方法,COCA在线检索首页COCA检索页COCA在线检索seldomseldom检索结果(list形式)seldom检索结果(chart形式)COCA在线检索seldom扩展语境举例,免费在线语料库BNC检索方法,BNC首页 BNC检索页BNC在线检索outcomeBNC在线检索outcome检索结果(list)BNC在线检索outcome检索结果(chart)BNC在线检索outcome检索行扩展语境举例,免费在线语料库Lextutor检索方法,Lextutor的多语料库在线检索首页(http)Lextutor检索consequenceLextutor检索consequence检索结果Lextutor检索consequence检索结果refinedLextutor检索consequence扩展语境举例,软件工具,用语料库检索工具可以将关键词及其语境检索出来,让语言学习者直接而又集中地看到上述特征。用AntConc和Wordsmith检索词语Wordsmith索引软件。提供关键词检索,语块检索等。关键词及上下文共显。英国,需付费。AntConc 日本早稻田大学教授Antony研发的检索软 件,免费MicroConcord,ConcApp6.0,VocabProfile,PowerGrep,Key words in Context-KWIC,Concordance lines 索引行/检索行,AntConc使用步骤展示,打开Antconc打开open files,载入选定的语料库选定需要的选项卡:concordance;word list;key word;collocation等在下方输入要检索的词汇或短语运行显示结果对结果进行分析(Reading Concordance),Thank you!,