版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
信息檢索導(dǎo)論漢字加工
ChineseProcessing1SoFarWhatWeHaveDocumentSimilarity"aBagofWords"Model+TermWeighting(TF-IDF)VectorSpaceModel(VSM)Co-occurrenceAssociationLinkanalysis:Co-citation&CouplingClassificationNa?veBayeskNearestNeighborsSupportVectorMachineProblemsinChinese
Processing"小明日記:今日王叔叔來(lái)我家玩媽媽,說(shuō)我做完作業(yè)后,可以吃點(diǎn)心。然后,王叔叔夸我作業(yè)做的好,于是抱起了我媽,媽叫叔叔小心一點(diǎn),之后叔叔又親了我媽媽,也親了我。""老師批復(fù):拿回家讓你爸看看,是標(biāo)點(diǎn)符號(hào)有問(wèn)題,還是你王叔叔和你媽媽有問(wèn)題。"3ProblemsinChinese
Processing新詞(out-of-vocabulary,OOV)九把刀拍了部新電影叫等一個(gè)人咖啡斷詞(termsegmentation)消除歧義我國(guó)代表現(xiàn)在正面臨很大的壓力全臺(tái)大停電不可以營(yíng)利為目的他才能非凡;他才能勝任4What’stheDifferenceinChinese?AlgorithmsinEnglisharebasedin“Term”
前述主要算法均基于詞做運(yùn)算Document→Paragraph→Sentence→TermSomeexpandtoPhrase有些擴(kuò)充至詞組Somechangeton-gram有些改用n-gramThemajordifferenceinChineseCharacterrangespaceismuchlarger
中文字符個(gè)數(shù)遠(yuǎn)多過(guò)于其它語(yǔ)言Noobviousboundarybetweencharacters/terms.
中文字或詞之間無(wú)明顯分隔符What’stheDifferenceinChinese?中文英文單位字(元)
Character詞詞組句子段落文件字母Letter字Word詞組Phrase句子Sentence段落Paragraph文件Document統(tǒng)計(jì)資料BIG5:常用字約5000個(gè),次常用字約8000個(gè)Unicode:約4萬(wàn)個(gè)漢字注音:共376個(gè)音(不含四聲變化)CKIP:二字以上約13萬(wàn)詞
WebsterDictionary:470,000ProbleminChineseProcessing(1)TermSegmentation斷詞(i.e.搶詞問(wèn)題)Example
我國(guó)代表現(xiàn)在正面臨很大的壓力
我到達(dá)文西博物館Solution 1.字典法:例如長(zhǎng)詞優(yōu)先法 2.法則式:例如文法式、構(gòu)詞法則、歧義解決法則 3.訓(xùn)練統(tǒng)計(jì)式:例如詞頻法(最大詞頻組合)等 4.自動(dòng)分類式:將斷詞轉(zhuǎn)為分類問(wèn)題Result
現(xiàn)今主要第3,4類方法,正確率可達(dá)9成以上ProbleminChineseProcessing(2)Part-of-SpeechTagging詞性標(biāo)定Example
我國(guó)代表現(xiàn)在正面臨很大的壓力NcNaNdNeqbNvDfaNaNaNaVCNaVKNvTVK NvVADeVCVHDiVHVJAADDa N…名詞V….動(dòng)詞D…副詞A形容詞T語(yǔ)助詞ProbleminChineseProcessing(2)Part-of-SpeechTagging詞性標(biāo)定Solution 1.訓(xùn)練統(tǒng)計(jì)式:
例如馬可夫機(jī)率模型 2.自動(dòng)分類式:
將詞性標(biāo)定轉(zhuǎn)為分類問(wèn)題Result
正確率可達(dá)9成以上
可衍生出許多應(yīng)用表:中研院平衡語(yǔ)料庫(kù)詞類標(biāo)記ProbleminChineseProcessing(3)UnknownTerm未知詞(或稱Out-of-Vocabulary)Example
新鮮人倪安東見面簽唱會(huì)歌迷熱情喊凍蒜
國(guó)際運(yùn)動(dòng)仲裁庭祕(mì)書長(zhǎng)瑞伯表示世跆盟可拒仲裁Solution 1.先經(jīng)過(guò)斷詞,再處理未知部份
未知部份以構(gòu)詞法則處理,或n-gram統(tǒng)計(jì)學(xué)習(xí) 2.不經(jīng)過(guò)斷詞,直接以訓(xùn)練統(tǒng)計(jì)式處理
Result
正確率可達(dá)7~8成(含詞性標(biāo)定)ToolforChineseProcessing(1)Yahoo斷章取義API /cas/取得應(yīng)用程序賬號(hào)使用API(目前停用)斷詞與詞性標(biāo)注文章關(guān)鍵字?jǐn)X取ToolforChineseProcessing(2)eLandETool開放完整API主要功能自動(dòng)關(guān)鍵詞自動(dòng)摘要斷詞與詞性標(biāo)定情緒判定試用展示自動(dòng)關(guān)鍵字以n-gram,找出最長(zhǎng)且最常結(jié)伴出現(xiàn)的字符串需指定所謂“最常出現(xiàn)”的次數(shù)門檻值以BACDBCDABACD為例設(shè)定thresholdT=1
FinalList會(huì)得到CD:3BACD:2代表擷取出兩個(gè)關(guān)鍵字
自動(dòng)摘要重新組合重要的句子“句子”作為單位以關(guān)鍵詞計(jì)算每個(gè)句子的得分由句子得分篩選固定比例的句子作為文章摘要HMM斷詞HiddenMarkovModel統(tǒng)計(jì)式的模型序列資料的描述S0S0S1S2GaussiandistributionOtOt-1Ot+1Ot+2(State)(ObservationValue)S0S1S2S0P00P01P02S1P10P11P12S2P20P21P22TransitionProb.ObservationProb.HMM斷詞中文斷詞的應(yīng)用Ex.
缺乏(V)耐性(N)是(SHI)一(N)項(xiàng)(N)莫大(A)的(D)致命傷(N)S0S0S1S2OtOt-1Ot+1Ot+2State:詞性O(shè)bservationValue:詞VNSHIN缺乏耐性是一HMM斷詞中文斷詞的應(yīng)用取得機(jī)率最高的路徑VNA..…………缺乏耐性……………情緒判別Abag-of-words上好一流公道引人入勝方便主流叫好卓越…引誘太過(guò)出錯(cuò)失常白目丟臉劣質(zhì)…PositiveTermsNegativeTermsOkapiBM25termsetdocumentavg.documentlengthdocumentlength情緒判別AssociateAttitude建立關(guān)聯(lián)態(tài)度詞庫(kù)服務(wù)傲慢(-1.0)親切(1.0)周全(1.0)敷衍(-1.0)…這家代理商的服務(wù)一點(diǎn)也不周全…態(tài)度反轉(zhuǎn)…這家代理商的服務(wù)一點(diǎn)也不周全……這家代理商的服務(wù)一點(diǎn)也不周全…Discussions20Hsin-HsiChen9-21ChineseTextRetrievalwithoutUsingaDictionary(Chenetal,SIGIR97)SegmentationBreakastringofcharactersintowordsChinesecharactersandwordsMostChinesewordsconsistoftwocharacters(趙元任)26.7%unigrams,69.8%bigrams,2.7%trigrams
(北京,現(xiàn)代漢語(yǔ)頻率辭典)5%unigrams,75%bigrams,14%trigrams,6%others(Liu)Wordsegmentationstatisticalmethods,e.g.,mutualinformationstatisticsrule-basedmethods,e.g.,morphologicalrules,longest-matchrules,...hybridmethodsHsin-HsiChen9-22IndexingTechniquesUnigramIndexingBreakasequenceofChinesecharactersintoindividualones.Regardeachindividualcharacterasanindexingunit.GB2312-80:6763charactersBigramIndexingRegardalladjacentpairsofhanzicharactersintextasindexingterms.TrigramIndexingRegardalltheconsecutivesequenceofthreehanzicharactersasindexingterms.Hsin-HsiChen9-23ExamplesHsin-HsiChen9-24IndexingTechniques(Continued)StatisticalIndexingCollectoccurrencefrequencyinthecollectionforallChinesecharactersoccurringatleastonceinthecollection.CollectoccurrencefrequencyinthecollectionforallChinesebigramsoccurringatleaseonceinthecollection.ComputethemutualinformationforallChinesebigrams.
I(x,y)=log2(p(x,y)/(p(x)*p(y)))
=log2((f(x,y)/N)/((f(x)/N)*(f(y)/N)))
=log2((f(x,y)*N)/(f(x)*f(y)))Stronglyrelated:muchlargervalueNotrelated:closeto0Negativelyrelated:negative
I(x,y)=log2(p(x,y)/(p(x)*p(y)))
=log2(p(x)/(p(x)*p(y))
=log2(1/*p(y))I(x,y)=log2(p(x,y)/(p(x)*p(y)))
=log2(p(x|y)/p(x))
=log2(p(x|y)/p(x))=0Hsin-HsiChen9-25f(c1):theoccurrencefrequencyvalueofthefirstChinesecharacterofabigramf(c2):theoccurrencefrequencyvalueofthesecondChinesecharacterf(c1c2):theoccurrencefrequencyvalueofabigramI(c1,c2):mutualinformation
I(c1,c2)>>0,c1andc2havestrongrelationship
I(c1,c2)~0,c1andc2havenorelationship
I(c1,c2)<<0,c1andc2havecomplementraryrelationship>0<0Hsin-HsiChen9-26352974671SegmentationasClassification我國(guó)代表現(xiàn)在正面臨很大的壓力
B
E
B
E
B
E
S
B
E
B
E
S
B
E九把刀不同意BIESBE28Trainingdata:inputfeaturesandthetargetC-2T-2C-1T-1C1T1C2T2C0T0目標(biāo)欄位我B國(guó)E表E現(xiàn)B代B國(guó)E代B現(xiàn)B在E表E…from"ChineseWordSegmentationbyClassificationofCharacters",2005Appendix:
OtherLanguageIssues楊立偉教授wyang@.tw31TokenizationInput:“Friends,RomansandCountrymen”O(jiān)utput:TokensFriendsRomansCountrymenEachsuchtokenisnowacandidateforfurtherprocessing正規(guī)化及語(yǔ)言處理在此一階段就直接丟棄(保留)哪些信息?索引與查詢(分析)時(shí)的處理要一致TokenizationIssuesintokenization:Finland’scapitalFinland?Finlands?Finland’s?Hewlett-Packard
HewlettandPackardastwotokens?SanFrancisco:onetokenortwo?Howdoyoudecideitisonetoken?Numbers3/12/91 Mar.12,199155B.C.B-52MyPGPkeyis324a3df234cb23e44Often,don’tindexastext.Butoftenveryusefulmixedwithtext:ex.產(chǎn)品型號(hào)NikonD700(Oneanswerisusingn-grams)Tokenization:LanguageissuesL'ensemble
onetokenortwo?L?L’?Le?Wantl’ensembletomatchwithunensembleGermannouncompoundsarenotsegmentedLebensversicherungsgesellschaftsangestellter‘lifeinsurancecompanyemployee’Tokenization:languageissuesChineseandJapanesehavenospacesbetweenwords:莎拉波娃現(xiàn)在居住在美國(guó)東南部的佛羅里達(dá)。NotalwaysguaranteedauniquetokenizationFurthercomplicatedinJapanese,withmultiplealphabetsintermingled混合使用Dates/amountsinmultipleformatsフォーチュン500社は情報(bào)不足のため時(shí)間あた$500K(約6,000萬(wàn)円)Katakana片假名Hiragana平假名Kanji漢字Romaji羅馬拼音斷詞問(wèn)題NormalizationNeedto“normalize”termsinindexedtextaswellasquerytermsintothesameformWewanttomatchU.S.A.andUSA索引與查詢(分析)時(shí)的處理要一致Alternativeistohavemultipletokenizationmixedlanguageprocessingandn-gramapproachNormalization:otherlanguagesAccents:résumévs.resume.Mostimportantcriterion:Eveninlanguagesthatstandardlyhaveaccents,usersoftenmaynottypethemHowwouldyouliketopresentinthefinalresult?German:Tuebingenvs.TübingenShouldbeequivalent7月30日vs.7/30CasefoldingReduceallletterstolowercaseexception:uppercase(inmid-sentence?)e.g.,GeneralMotorsFedvs.fedSAILvs.sailOneapproachistolowercaseeverythinginanalysis,meanwhiletorepresentintheoriginalformStopwordsWithastoplist,youexcludefromdictionaryentirelythecommonestwords.Intuition:Theyhavelittlesemanticcontent:the,a,and,to,beTheytakealotofspace:~30%ofpostingsfortop30Butthetrendisawayfromdoingthis:Youneedthemfor:Phrasequeries:“KingofDenmark”Varioussongtitles,etc.:“Letitbe”,“Tobeornottobe”“Relational”queries:“flightstoLondon”ThesauriandsoundexHandlesynonyms同義字andhomonyms同音字Hand-constructedequivalenceclassese.g.,car=automobilecolor=colourRewritetoformequivalenceclasses原則:兩種方式,在索引時(shí)處理?或在查詢時(shí)處理?(1)IndexsuchequivalencesEx.Whenthedocumentcontainsautomobile,indexitundercaraswell(usually,alsovice-versa)(2)expandqueryEx.Whenthequerycontainsautomobile,lookundercaraswellSoundexTraditionalclassofheuristicstoexpandaqueryintophoneticequivalentsLanguagespecific–mainlyfornamesE.g.,chebyshev
tchebycheffLemmatizationReduceinflectional/variantformstobaseformE.g.,am,are,
is
becar,cars,car's,cars'
cartheboy'scarsaredifferentcolors
theboycarbedifferentcolorLemmatizationimpliesdoing“proper”reductiontodictionaryheadwordformStemmingReducetermstotheir“roots”beforeindexing“Stemming”suggestcrudeaffixchopping很粗略地將字首字尾去除languagedependente.g.,automate(s),automatic,automationall
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2026年鄭州鐵路職業(yè)技術(shù)學(xué)院高職單招職業(yè)適應(yīng)性測(cè)試備考題庫(kù)帶答案解析
- 2026年宿州職業(yè)技術(shù)學(xué)院?jiǎn)握新殬I(yè)技能筆試備考試題帶答案解析
- 2026年九江職業(yè)技術(shù)學(xué)院?jiǎn)握新殬I(yè)技能考試模擬試題附答案詳解
- 2026年南京機(jī)電職業(yè)技術(shù)學(xué)院高職單招職業(yè)適應(yīng)性考試模擬試題帶答案解析
- 人的性別決定教學(xué)課件冀少版生物八年級(jí)下冊(cè)
- 車輛公開轉(zhuǎn)讓協(xié)議書
- 石質(zhì)文物修復(fù)師操作規(guī)程知識(shí)考核試卷含答案
- 客運(yùn)車輛駕駛員持續(xù)改進(jìn)測(cè)試考核試卷含答案
- 未來(lái)五年教學(xué)天平行業(yè)市場(chǎng)營(yíng)銷創(chuàng)新戰(zhàn)略制定與實(shí)施分析研究報(bào)告
- 重癥肌無(wú)力患者的睡眠管理與改善方法
- 智慧產(chǎn)業(yè)園倉(cāng)儲(chǔ)項(xiàng)目可行性研究報(bào)告-商業(yè)計(jì)劃書
- 四川省森林資源規(guī)劃設(shè)計(jì)調(diào)查技術(shù)細(xì)則
- 廣東省建筑裝飾裝修工程質(zhì)量評(píng)價(jià)標(biāo)準(zhǔn)
- 銀行外包服務(wù)管理應(yīng)急預(yù)案
- 樓板回頂施工方案
- DB13T 5885-2024地表基質(zhì)調(diào)查規(guī)范(1∶50 000)
- 2025年度演出合同知識(shí)產(chǎn)權(quán)保護(hù)范本
- 區(qū)塊鏈智能合約開發(fā)實(shí)戰(zhàn)教程
- 2025年校長(zhǎng)考試題庫(kù)及答案
- 《煤礦開采基本概念》課件
- 口腔進(jìn)修申請(qǐng)書
評(píng)論
0/150
提交評(píng)論