版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
作選擇作為研究對(duì)象,在這次研究中,我們中國(guó)最大服務(wù)提供——作為研究對(duì)象。擁有比多8倍的用戶,新浪相對(duì),的一些新特性在分類時(shí)很有效,甚至之前研究考慮特性對(duì)于和隨著平臺(tái)的不斷壯大,信息正在以一種史無(wú)前例的速度產(chǎn)生和傳之前大部分研究都是把作為研究前提,考慮到的流行度和中國(guó)最大平臺(tái),由新浪公司在2009年成立。現(xiàn)有超過(guò)3億用戶(8倍于2011年5月份數(shù)字每天產(chǎn)生1億條,超過(guò)30%的互聯(lián)網(wǎng)在使用,新浪已經(jīng)最受歡迎的的。上廣泛。例如,一條名為“國(guó)家宣布人均收入達(dá)9000”的遭到了巨大規(guī)模的轉(zhuǎn)發(fā),上有20多萬(wàn)條和這條相關(guān)的。對(duì)于的分析和檢測(cè),和有許多重要的區(qū)別須要和上被轉(zhuǎn)發(fā)的話題類型不一樣,上,大部分的話題是由多內(nèi)容的轉(zhuǎn)發(fā)形成的,比如笑話,肯。而在上,話題大它對(duì)那些廣泛的進(jìn)行澄清。然而沒(méi)有這樣的服務(wù)。例如(見(jiàn)圖一這是一條名為“在2012年1月23日對(duì)伊朗宣戰(zhàn)”胡,最初胡那 來(lái)建立語(yǔ)料庫(kù),同時(shí)加入和這些相關(guān)。最終,每條可以分離19個(gè)特點(diǎn),包括內(nèi)容,客戶端,用戶賬戶,位置,回復(fù)和轉(zhuǎn)發(fā)次數(shù)等。我們 接下來(lái)的內(nèi)容組織如下:在第2部分,總體介紹相關(guān)工作,第356部分,總結(jié)處理”的反映。在過(guò)去,的要通過(guò)口口相傳,但社交的崛起為近期出現(xiàn)了一些針對(duì)上和信息可靠性的研究,CASTILLO專注于研究自動(dòng)處理一組給定集合的可靠性,他們分析收集到的和“流行話題”相關(guān),并使用了一種受管理的學(xué)習(xí)方法(J48分類樹(shù))來(lái)專注兩個(gè)任務(wù):不同的特征集合上構(gòu)建了不同的分類器,然后MENDOZA在性們分析用戶的轉(zhuǎn)發(fā)網(wǎng)絡(luò)并發(fā)現(xiàn)在上和傳統(tǒng)平臺(tái)上的區(qū)別。(2特征計(jì)算而來(lái)的集合(4)基于的特征,這類特征考慮和樹(shù)(可通過(guò)特QAZVINIAN使用三種類型的特征:基于內(nèi)容的屬性,基于網(wǎng)絡(luò)的屬性,微博特有的文化。對(duì)于基于內(nèi)容的特征,他們更隨HASSAN,并且把分i處在一個(gè)積極用戶模型中。另外一種模型是對(duì)數(shù)似然比,其中是從處在積極模型中的用戶j那轉(zhuǎn)發(fā)來(lái)的。最后,【12】中研究特有的文化特征是從獨(dú)特的內(nèi)容提取出來(lái)的:散列標(biāo)記和URL。QAZVINIAN借用正則表達(dá)式來(lái)使用2009年到2010年,每一次的查詢符合一條廣發(fā)流傳的,這條歸為“錯(cuò)來(lái)分析哪些符合提交給接口的正則表達(dá)查詢,但是又和無(wú)關(guān)。然后如標(biāo)注相關(guān)的數(shù)據(jù)立,則將這條標(biāo)記為“12”.他們用第二次的標(biāo)記集來(lái)探測(cè)用戶對(duì)的相信程度。CASTILLO通過(guò)服務(wù)商提供的接口使用基于關(guān)鍵字的查詢來(lái)收集數(shù)用AMAZONMECHANICALTURK,這是一家眾包,它能夠使網(wǎng)民合作利用截止2011年二月,報(bào)告稱的用戶每天超過(guò)一億有一個(gè)辟謠賬戶,他提供其他服務(wù)商無(wú)法提供的服務(wù)。他件,我們使用MONITOR定義的基于的查詢形式,這種查詢形ABA是時(shí)間參與者的合取,B是時(shí)間相關(guān)描述信息的析取。例如,有一條查詢是(USIran)(declarewar)指的是有關(guān)宣布2012年123日對(duì)宣戰(zhàn)的關(guān)數(shù)目是相當(dāng)高的。1”,否則標(biāo)注為“-1”。我們手動(dòng)處理了5144條,其中只有7條匹配關(guān)鍵字但是和話題無(wú)關(guān)。并且和相關(guān)中,大約有18.3%被標(biāo)記為“1”.kpr(a)pr1pr之前特基于內(nèi)容的特征考慮和內(nèi)容相關(guān)的因素,包括是否包含和URL,的情感(通過(guò)哦使用的積極或者消極的表情數(shù)目來(lái)衡量,的 (82.3%戶端(例如網(wǎng)頁(yè)版或者定時(shí)發(fā)布工具,那么這條有很高的可能性是謠言。舉個(gè)例子,在2012年1月22日(中國(guó)農(nóng)歷新年,有一條向伊朗宣戰(zhàn),在不到12小時(shí)的時(shí)間里被轉(zhuǎn)發(fā)了949次,其中77.77%是通過(guò)斷改變客戶端特征和真實(shí)性之間獨(dú)立性的假設(shè)。表3中是實(shí)驗(yàn)中使用的共識(shí)和符號(hào)??占僭O(shè)是客戶端特征和的真實(shí)性在統(tǒng)計(jì)學(xué)上是獨(dú)立的。觀察頻率OJI對(duì)應(yīng)的是客戶端信息的第i個(gè)值和真實(shí)度的第j個(gè)值。EIJ.是在假設(shè)他們獨(dú)立的前提下的期望值。如圖4顯示,自由度d=1.低在我們的例子中,已知a=0.05,d=1,通過(guò)反函數(shù)我們計(jì)算得x2是3. 表4中,我們計(jì)算每一項(xiàng)的期望頻率,檢驗(yàn)的統(tǒng)計(jì)數(shù)據(jù)(卡方值)(2 )( 2 005,d性之間有很重要的關(guān)系,因此客戶端特征可以用于分類任務(wù)。兩個(gè)新特征后對(duì)實(shí)驗(yàn)結(jié)果的影響。Precision,RecallF-score作索的評(píng)價(jià)標(biāo)準(zhǔn)。例如,在信息檢索領(lǐng)域,Precison是檢索到的文檔是符合用戶搜索的占所有的文檔的比例,Recall是所檢索到的文檔數(shù)除以所有存在的相關(guān)文檔,F(xiàn)-score是Precision和Recall的折中。SVM分類器在判斷下的實(shí)驗(yàn)結(jié)果。積極和消極這兩個(gè)術(shù)語(yǔ)指的是SVM分類器的分類(比如,積極指的是相關(guān)被劃分到非一類,消極指的是相關(guān)被劃分為一類,正確和錯(cuò)誤指的是分類結(jié)果是 的一致。為了更好的理解,我們使用TTI代替正確積極,F(xiàn)TI代表代表錯(cuò)服務(wù)的判斷,其中,ACTI(FI)代表被或者已知事實(shí)證明為真實(shí)(錯(cuò)誤)PREDICTED代表了SVM分類器對(duì)相關(guān)的分。例如,TfiSVM分類器把一條相關(guān)劃分到錯(cuò)誤信息分類,同PRECISONRECALL在表格6中給出了定義。因?yàn)槲覀儾豢紤]對(duì)PRECISION和RECALL的程度,因此,我們使用傳統(tǒng)的F-SCORE,即PRECISON和RECALL的調(diào)和平均值。F2PRECISIONPRECISION7中展示了實(shí)驗(yàn)結(jié)果。實(shí)驗(yàn)結(jié)果表明在考慮的特征中,基于賬戶的特征為我們從接口爬取的數(shù)據(jù)集只包含了兩層轉(zhuǎn)發(fā)關(guān)系,即原始和最后所以通過(guò)用戶的特征,比如用戶的賬戶是否已經(jīng)驗(yàn)證,用戶的朋友數(shù),過(guò)了的認(rèn)證并且有許多朋友(關(guān)注者,那么這個(gè)賬戶發(fā)布只有,不是真實(shí)的人,或者沒(méi)有經(jīng)認(rèn)證,那么如果他發(fā)布和爭(zhēng)議相干,那么這條既有很高的可能性是。的特征分類精確度分別是72.5780%,72.6252%,72,3444%。我們分別加入客戶端特征和發(fā)生地址特征來(lái)進(jìn)行分類(基于內(nèi)容的征,基于賬戶的特征和基于的特征依然存在)來(lái)檢驗(yàn)新加入特征的有效性。為了形象的展示對(duì)分類精確度的影響,展示了加入了客戶端特征和新加入的兩個(gè)特性在分類任務(wù)上的明顯的優(yōu)勢(shì)。數(shù)量的巨大和快速的特性使得找到一種能自動(dòng)處理可信度的工具變得很重要。在這片中,基于提供的信息,我們收集發(fā)生地址特征,這兩個(gè)特征可以從中提取出來(lái)并且用于分新。 ,紀(jì)卓人才計(jì)劃,NSERC發(fā)現(xiàn)基金,山東大學(xué)獨(dú)立創(chuàng)新組織(2012Z012,200T016,STP.BordiaandN.DiFonzo.Problemsolvinginsocialin ctionsontheinternet:Rumorandsocialcognition.SocialPsychologyQuarterly,67(1):33-49,2004J.Carletta.Assessingagreementonclassificationtasks;TheKappastatisticC.Castillo,M.Mendoza,andB.Poblete.Informationcredibilityon.InWWW,684,2011. ;A [5]T.S.Ferguson.ACourseinLargeSampleTheory.Chapmanand[6]A.Hassan,V.Qazvinian,andD.R.Radev.What’swiththeattitude?Identifyingsentenceswithattitudeinonlinediscussions.InEMNLP,pages1245-1255,2010[7]M.MathioudakisandN,Koudas. monitor:trenddetectionoverthestream.InProceedingsof2010internationalconferenceonManagementofdata,SIGMOD’10,pages1155-1158,NewYork,NY,USA,2010.ACM.[8]M.Mendoza,B.Poblete,andC.Castillo. undercrisis:canwetrustwhatwe?InProceedingsoftheFistWorkshoponSocialytics.SOMA’10,pages71-79,New[9]M.R.Morris,S.Counts,A.Roseway,.A.Hoff,andJ.Schwarz.Tweetingisbelieving?:understandingmicroblogcredibilitypreceptions.InProceedingsoftheACM2012conferenceonComputerSupportedCooperativeWork,CSCW’12,pages441-450,NewYork,NY,USA,2012.ACM.[10]W.PetersonandN.Gist.Rumorandpublicopinion.AmericanJournalof[11]V.Qazvinian,E.Rosengren,D.R.Radev,andQ.Mei.Rumorhasit:Identifyingmisinformationinmicroblog.InEMNLP,pages1589-1599,2011.[12]J.Ratkiewicz,M.Conover,M.Meiss,B,Goncalves,S.Patil,A.Flammini,andF.Menczer.DetectingandtrackingthespreadofastroturfmemesinmicroblogAutomaticDetectionofRumoronSinaYang
XiaohuiYu1,2,3?Miny Jinan,3ShProvincialKeyLaboratoryofSoftwareTheproblemofgauginginformationcredibilityonsocialnet-workshasreceivedconsiderableattentioninrecentyears.Mostpreviousworkhaschosen,theworld’slargestmicro-bloggingtform,asthepremiseofresearch.Inthiswork,weshiftthepremiseandstudytheproblemofinfor-mationcredibilityonSinaWeibo,’sleadingmicro-bloggingserviceprovider.Witheighttimesmoreusersthan,SinaWeiboismoreofa thanapureclone,andexhibitsseveralimportantcharacteristicsthatdistinguishitfrom.Wecollectanextensivesetofmicroblogswhichhavebeenconfirmedtobefalserumorsbasedoninformationfromtheofficialrumor-bustingserviceprovidedbySinaWeibo.Unlikepre-viousstudiesonwherethelabelingofrumorsisdonemanuallybytheparticipantsoftheexperiments,theofficialnatureofthisserviceensuresthehighqualityofthedataset.Wethenexamineanextensivesetoffeaturesthatcanbeextractedfromthemicroblogs,andtrainaclassifiertoautomaticallydetecttherumorsfromamixedsetoftrueinformationandfalseinformation.Theexperimentsshowthatsomeofthenewfeaturesweproposeareindeedeffec-tiveintheclassification,andeventhefeaturesconsideredinpreviousstudieshavedifferentimplicationswithSinaWeibothanwith.Tothebestofourknowledge,thisisthefirststudyonrumorysisanddetectiononSinaWeibo.H.2.8[DatabaseManagement]:Database?CorrespondingPermissiontomakedigitalorhardcopiesofallorpartofthisworkforalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesnotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermissionand/orafee.MDS’12August122012,
RumorDetection,SinaWeibo,Withtheriseofmicro-bloggingtforms,informationisgeneratedandpropagatedatanunprecedentedrate.Theautomaticassessmentofinformationcredibilitytherefore esacriticalproblem,becausethereisoftennotenoughresourcetomanuallyidentifythemisinformationaboutacontroversialandlargescalespreadingnewsfromthevolumeoffastevolvingWhereasmostpreviousworkhasusedasthepremiseofstudy,weinthisworkchoosetostudytheprob-lemofautomaticrumordetectiononSinaWeibo,duetoitswidepopularityanduniquecharacteristics.SinaWeibois’slargestmicro-bloggingservice.LaunchedbySinaCorporationinlate2009,SinaWeibonowhasmorethan300millionregisteredusers(eighttimesmorethanasofMay2011),generating100millionmicroblogsperSinaWeiboisusedbymorethan30%oftheInternet,andisthemostofthemostpopularwebsitesin.RumorspresentaseriousconcernforSinaWeibo.Statis-ticsshowthatthereisatleastonerumorthatiswidelyspreadonSinaWeiboeveryday.Forexample,attheendofApril2011arumorthatstatedthat“theNationalStatisticsBureauannouncedthat’surbanpercapita ehasreached9000RMBmark”2causedalargescaleofforward-ing.Thereareabout200thousandsmicroblogsaboutthatTherearesomemajordifferencesbetweenSinaWeiboandwithrespecttorumorysisanddetection,whichmustbetakenintoconsideration:(1)Somelinguisticfea-turesthatarestudiedinpreviousworkforEnglishtweets,suchasthecasesensitivityofEnglishwords,repeatedlet-ters,andwordlengthening,donotapplytothelan-guagethatdominateSinaWeibo.(2)Thetypesoftrend-ingmicroblogsretweeted(forwarded)aredifferentinSinaWeibothanin.InSinaWeibo,mosttrendsarecre-atedduetoretweetsofmediacontentsuchasjokes,imagesands,whereason,thetrendstendtohavemoretodowithcurrentglobaleventsandnewsstories.(3)SinaWeibohasanofficialserviceforrumorbusting(with1TheSinacorporationannualreport2011isavail-able(in )atht 2(In) Figure1:InstanceofSinaWeiboRumor- theusernameof“WeiboRumor-Busting”iftranslatedintoEnglish),whichfocusesonbustingthosewidespreadru-mors.Whiledoesnothavethistypeofservice.Forinstance(seeFigure1),thisisarumor-relatedmicroblogaboutUnitedStatesofficiallydeclaringwartoIranatJan-uary232012.Theoriginalmessagecaused3607reforwardtimesand1572commentedtimes.Itsleftbottomshowsthefunctionofmicroblogpostingprogram,andthatrepresenttheweb-program-used.AsSinaWeibopro-videsthisauthoritativesourceforverifyinginformation,thedatasetswecollectedarealmostreferredtowidelyspreadrumor.Weclassifytheserumor-relatedmicroblogsintwosetsandlabelthemaswhetherthemicroblogistrueinfor-mation(theorientationofthemicroblogisnotinaccordancewiththerumor)orfalseinformation(theorientationofthemicroblogisinaccordancewiththerumor).Inthispaper,weformulatetheproblemofrumortionasaclassificationproblem,andbuildclassifiersbasedonasetoffeaturesrelatedtothespecificcharacteristicsofSinaWeibomicro-blogingservice.Thecorpusisbuiltbycollect-ingtherumorsthatareannouncedbySinaWeibo’sofficialrumor-bustingservice,alongwiththemicroblogsrelatedtothoserumors.Intotal,19featuresareextractedfromeachmicroblog,includingthecontent,themicro-bloggingprogramused,theuseraccount,thelocation,thenumberofrepliesandretweets,etc.Wefindthattheprogramusedformicrobloggingandtheeventlocation,twofeaturesthathavenotbeenpreviouslystudied,areparticularusefulinclassifyingrumorsonSinaWeibo.Ourexperimentsalsoshowsomeinterestingresultswithrespecttotheeffective-nessofvariousfeatures.Therestofthepaperisorganizedasfollows:inSection2howwecollectandannotatedata.Insection4weshowhowtoyzeandextractfeaturesbasedonthoserumor-relatedtopicsannouncedbySinaWeibo’srumor-bustingaccount,andprovideadescriptionoftwonewfeatures,theprogramusedandeventlocation.InSection5wepresenttheexperimentalresults.Section6concludesthispaper.Thereisanextensivebodyofrelatedworkonmisinforma-tiondetection.Inthissection,wefocusonprovidingabriefreviewoftheworkmostcloselyrelatedtoourstudy.Weoutlinerelatedworkinthreemainareas:rumorysis,featuresforclassification,anddatacollectionandannotata-yzingRumorhasbeenaresearchsubjectinpsychologyandso-cialcognitionforalongtime.Itisoftenviewedasanunveri-
licconcern[10].Bordiaetal.[1]proposethattransmissionofrumorisprobablyreflectiveofa“collectiveexnationprocess”.Inthepast,thespreadofrumorscanonlybedif-fusedbymouthtomouth.Theriseofsocialmediaprovidesanevenbettertformforspreadingrumors.Therehaveappearedsomerecentstudiesonyzingrumorsandinformationcredibilityon,theworld’slargestmicro-bloggingtform.Castilloetal.[3]focusonautomaticallyassessingthecredibilityofagivensetoftweets.Theyyzethecollectedmicroblogsthatarere-latedto“trendingtopics”,anduseasupervisedlearningmethod(J48decisiontree)toclassifythemascredibleornotcredible.Qazvinianetal.[11]focusontwotasks:Thefirsttaskisclassifyingthoserumor-relatedtweetsthatmatchtheregularexpressionofthekeywordqueryusedtocollecttweetsonMonitor.Thesecondtaskisyzingtheusers’believingbehaviouraboutthoserumor-relatedtweets.TheybuilddifferentBayesianclassifiersonvarioussubsetsoffeaturesandthenlearnalinearfunctionoftheseclassifiersforretrievalofthosetwosets.Mendozaetal.[8]usetweetstoyzethebehaviorofusersunderbombseventssuchastheChileearthquakein2010.They users’retweetingtopologynetworkandfindthedifferenceintherumordiffusionpatternon environmentthanontraditionalnewstforms.Featureextractionisanimportantstepinaclassificationtask.Generallyspeaking,varioussetsoffeatureareex-tractedfromdifferentcorpora.Castilloetal.[3]usefourtypesoffeatures:(1)message-basedfeatures,whichcon-sidercharacteristicsofthetweetcontent,whichcanbecat-egorizedas-independentand-dependent;(2)user-basedfeatures,whichconsidercharacteristicsofTwit-terusers,suchasregistrationage,numberoffollowers,num-beroffriends,andnumberofuserpostedtweets;(3)topic-basedfeaturesanduser-basedfeatures;and(4)propagation-basedfeatures,whichconsiderattributesrelatedtotheprop-agationtreethatcanbebuiltfromtheretweetsofaspecificQazvinianetal.[11]usethreesetsoffeatures,whicharecontent-basedfeatures,network-basedfeatures,andspecificmemes.Forcontent-basedfeatures,theyfollowHas-sanetal.[6],andclassifytweetswithtwodifferentpatterns:lexicalpatternsandpart-of-speechpatterns.Fornetwork-basedfeatures,theybuildtwofeaturestocapturefourtypesofnetwork-basedproperties.Oneisthelog-likelihoodthatuseriisunderapositiveusermodel,andanotherfeatureisthelog-likelihoodratiothatthetweetisretweetedfromauserjwhoisunderapositiveusermodelthananegativeusermodel.Finally,the-specificmemesfeaturesthathavebeenstudiedin[12]areextractedfrommemeswhichareparticularto:hash-tagsandURLs.Forourwork,weconsidersomefeaturesthathavebeenproposedinpreviouswork,suchasthenumberofpostedmicroblogsorretweetedmicroblogs.Wealsoproposenewfeatures,thelocationofevent,andtheprogramusedforpostingthemicroblog,whichhavenotbeenstudiedinpreviouswork.MethodsForDataCollectionandAnno-Qazvinianetal.[11]use’ssearchAPIwithreg-ularexpressionqueries,andcollectdatafromtheperiodof2009to2010.Eachquerycorrespondstoapopularrumorthatislistedas“false”oronly“partlytrue”on UrbanLegendsreferencesite3.Duringtheannotationpro-cess,theylettwoannotatorsscanthedatasetandlabeleachtweetwitha“1”ifitisrelatedtoanyoftherumors,andwitha“0”otherwise.TheyusethisannotationinyzingwhichtweetsmatchtheregularexpressionqueryposedtotheAPI,butarenotrelatedtotherumor.Andthentheyaskedtheannotatorstomarkeachtweetwith“11”iftheuserbelievestherumorandwith“12”iftheuserdoesnotbelieveorremainsneutralinthepreviousannotatedrumor-relateddataset.Theyusethesecondannotateddatasettodetectusers’beliefsinrumors.Castilloetal.[3]usekeyword-basedqueryinterfacevidedbyMonitortocollectdata.Theyseparatethecollectedtopicsintotwobroadtypes:newsandconver-sation.Forannotation,theyuseAmazonMechanicalTurk4,acrowdsourcingwebsitethatenablesnetizenstoco-ordinatetheuseofhumaninligencetoperformtasksthatcomput-ersareunabletodoyet.AsofFebruary2011,SinaWeiboreportsthatitsregis-tereduserspostmorethan100millionmicroblogsperday.ThismakesSinaWeiboanexcellentcasetoyzedisin-formationinonlinesocialnetwork.WefirstbuildahighqualitydatasetbyusingSinaWeibo’sofficialrumorbustingservice.Thosemicroblogswecollectedconsistoftrueinfor-events,andalmostofthemarerelevanttotherumortopicsannouncedbytherumorbustingservice,andalsotheworkoflabelingthedataset.Therefore,inthiswork,thelabel-ingisdonebyanauthoritativesource,avoidingtheerrorsinjudgmentwhenhumanparticipantsannotate.Thissec-tiondescribeshowwecollectedasetofmessagesrelatedtorumoreventsfromSinaWeibo.DataAsSinaWeibohasanofficialrumorbustingaccount,anuniquefunctionofthisservicethatothermicrobloggingser-vicesdonothave.Topicsitannouncesasrumorsareallconfirmedfalseinformationthatisrelatedtocontroversialeventsandhasbeenwidelyspread.Foreveryeventcon-sidered,weusetheformofkeyword-basedquerydefinedbyMonitor[7].TheformofqueryisA∧BwhereAisaconjunctionofeventparticipantsandBisadisjunctionofsomedescriptiveinformationabouttheevent.Forexample,onequeryingformas(US∧Iran)∧(declare∨war)referstotherumoraboutU.S.officiallydeclaringwaronIranonJanuary23,2012.Wecollectmicroblogsmatchingthe inthetop-icspublishedbytherumorbustingaccountfromMarch1,2010toFebruary2,2012.Thedatasetthuscollectedcanbedividedintotwosubsets,includingonethatcontainsmi-croblogsrelatedtotherumorsandtheotherthatcontains33
thosemicroblogsthatmatchthequerying butaredirectlyrelatedtothespecificrumor.Asthequerying arebasedonthetopicsannouncedbytheofficialac-count,thenumberofrumor-relatedmicroblogsinthecol-lecteddatasetisquitehigh.DataWeasktwoannotatorstogothroughallmicroblogsinthedatasetindependentlyandeliminatemicroblogsthatarenotrelatedtoanyrumortopicspublishedbySinaWeibo’sofficialrumor-bustingaccount.Wealsoaskannotatorstolabeleachmicroblogkeptwith“1”iftheorientationofthemicroblogisinaccordancewiththerumor,andwith“-1”Wemanuallyprocessed5,144microblogs,only7ofwhichmatchthequerying butarenotrelatedtotheru-mortopics.Moreover,amongthosemicroblogsthatarere-latedtorumors,about18.3%arelabeledwith“1”.Wecalculatetheκstatistictomeasuretheinter-rateragreement.Theκstatisticisdefinedasκ=Pr(a)?1?wherePr(a)istherelativeobservedagreementamongan-notators,andPr(e)istheprobabilityofchanceagreement[2][4].Inourcase,wehaveκ=0.95withconfidenceinter-valC.I.=95%,demonstratingthatthetwoannotatorscanreachahighlevelofagreementinidentifyingrumors.Weidentifyasetoffeaturesthatcanbeextractedfromthemicroblogsfortheclassificationpurpose.TheseincludeseveralfeaturesthatarespecifictotheSinaWeibot-form,butmostofthemarequitegeneralandcanbeappliedtoothertforms.Someofthefeatureshavebeenstud-iedinpreviousworks[3][11][9].Inaddition,weproposetwonewfeaturesthathavenotbeenstudiedinpreviousworks.ThesetoffeaturesarelistedinTable1.Wedividethesefeaturesintofivetypes:content-basedfeatures,-basedfeatures,account-basedfeatures,propagation-basedfeatures,andlocation-basedfeatures.Inwhatfollows,wefirstdescribethefeaturesthathavebeenproposedinthepreviousworkandareadoptedinstudy,andthenprovideadetaileddescriptionofthenewlyproposedfeatures.Content-basedfeaturesconsiderattributesrelatedtothemicroblogcontent,whichincludewhetheritcontainsapictureorURL,thesentimentofamicroblog(measuredbythenumberofpositive/negativeemoticonsused),andthetimeintervalbetweenthemicroblog’stimeofpostingandtheuser’sregistrationtime.Account-basedfeaturesconsiderthecharacteristicsofusers,whichcanbealdependentoralinde-pendent.aldependentfeaturesincludewhethertheuser’sidentityisverified,whethertheuserhasaaldescription,thegenderofuser,theageoftheuser,thetypeofusernameanduser’slogo.Wefoundthatamongthecon-firmedrumortopics,theproportionofmicroblogspostedbynon-organizationalusersthathavethedefaultoracartoonlogoisparticularhigh.alindependentfeaturesin-cludethenumberoffollowers,thenumberoffriends,and Table1:Descriptionof HASMULTIMEDIAHASTIMEWhetherthemicroblogcontainspictures,s,orThenumbersofpositiveandnegativeemoticonsusedintheWhetherthemicroblogincludesaURLpointingtoanexternalsourceThetimeintervalbetweenthetimeofpostinganduserregistrationPROGRAMUSEDtypeprogramusedtopostamicroblog:-ISHASDESCRIPTIONGENDEROFUSERUSERAVATARTYPENUMBEROFFOLLOWERSNUMBEROFFRIENDSNUMBEROFMICROBLOGSPOSTEDREGISTRATIONTIMEUSERNAME Whethertheuser’sidentityisverifiedbySinaWeiboWhethertheuserhas aldescriptionsTheuser’sal,organization,andThenumberofuser’sThenumberofuserswhohaveamutualfollowingrelationshipwiththisuserThenumberofmicroblogspostedbythisuserTheactualtimeofuseralrealname,organizationname,andThelocationinformationtakenatuser’sEVENTThelocationwheretheeventmentionedbyrumor-relatedmicroblogsISNUMBEROFCOMMENTSNUMBEROFRETWEETSWhetherthemicroblogisoriginalorisaretweetofanothermicroblogThenumberofcommentsonthemicroblogThenumberofretweetsofthethenumberofmicroblogswhichhavebeenpostedbythetopropagationoftherumor,suchaswhetherthemicroblogisanoriginalpostoraretweetfromanothermicroblog,thenumberofcomments,andthenumberofretweetsithasre-New-basedfeaturereferstotheprogramthatuserhasusedtopostamicroblog.Itcontainsnon-programand programtwotypes.The programincludesSinaWeiboweb-app,timed-postingtoolsandembeddedSinaWeibo’sthirdpartyapplications.The programtypeincludesmo-bilephonebased andTabletPCbased Location-basedfeaturereferstotheactualcewheretheeventmentionedbytherumor-relatedmicroblogshashappened.Wedistinguishbetweentwotypesoflocations,domestic(in)andforeign.Fortheaforementionedmicroblogdataset,thedistribu-tionsofvaluesofthetwofeatures,theprogramusedandeventlocation,areshowninFigure2andFigure3re-spectively.AsshowninFigure2,about71.8%offalseinfor-mationispostedbynon- programs.Inourcol-lectedrumor-relatedmicroblogs,thereisasignificantdiffer-enceintheproportionbetweendomesticandforeigneventsfortrueandfalseinformation,asshowninFigure3.Formicroblogscontainingfalseinformation,about56.1%oftheeventsoccurredabroad.Forthosecontainingtrueinforma-tion,ontheotherhand,themajorityoftheevents(82.3%)aredomestic.Inaddition,wefindthatifamicroblogdescribesanthathappenedabroadandtheprogramusedisnon-(suchasWeb-basedortimedpostingtools),then
Table2:HypothesisTestoftheIndependencebetween ProgramFeatureandMicroblogs’ Theprogramusedfeatureisindependentofthetruthfulnessofamicroblog Theprogramusedfeatureisnotindepen-dentofthetruthfulnessofamicroblogitisarumorwithhighprobability.Forexample,onJan-uary22,2012(theNewYear),thereappearedamicroblogaboutTheUnitedStatesformallydeclaringwaragainstIran.Itwasforwarded(retweeted)949timesinlessthan12hours,amongwhich77.77%weredonebyWeb-baesdorothertimed-postings,muchhigheraper-centagethantheaverageusagefrequencyofthoses.Asthecontent-basedfeatures,account-basedfeatures,andpropagation-basedfeatureshavebeenstudiedintheprevi-ousworks[3][11],weherejustidentifytheeffectivenessofthetwonewfeaturesthatweproposed.Inordertotestwhetherthetwoproposedfeaturesaresignificantindica-torsofthetruthfulnessofmicroblogs,weusePearson’schi-squaredtest(χ2)toperformthetestofindependencebe-tweentheprogramfeatureandthetruthfulness;thesameisdonefortheeventlocationfeatureaswell.Fortheprogramfeature,wemakethenullhypothesisandal-ternativehypothesisabouttheindependencebetweentheprogramfeatureandthemicroblogs’truthfulnessinTable2.Theformulasandnotationusedforthetestaresumma-rizedinTable3.Thenullhypothesisisthatthegramusedisstatisticallyindependentofthetruthfulnessofamicroblog.Theobservedfrequency,Oi,j,isthefrequencyofamicroblogtakingthei-thvalueofthe Table3:SummaryofNotationsUsedIntheIndependenceχ Pχ
(Oi,j?Ei,j Oi,n)·( Oj,nEi,j=
nr dThedThedegreesofdomwhichvalueisequalthenumberof(r?1)·(c?αThedegreeofrThenumberoftable’scThenumberoftable’sAnobservedAnexpectedfrequency,assertedbythenullhy-nThenumberofcellsintheNThetotalsamplesize(thesumofallcellsintheTrueFigure2:ThedistributionofprogramusedonSina
Table4:TestofIndependencebetweenProgramUsedandTruthfulnessObservedTrueInfo.FalseInfo. Expected 2472.1646881301.835312TrueFigure3:1301.835312Trueandthej-thvalueoftruthfulness.Ei,j,istheexpectedfrequencyofthiscombinationassumingtheyareindepen-dent.AsshownintheTable4,thedegreesofdomisd=1.Forthetestofindependence,achi-squaredprobabil-ityoflessthanorequalto0.05iscommonlyinterpretedasgroundforrejectingthenullhypothesis[5].Forourcase,theχ2valueiscalculatedbytheinversefunctionwithα=0.05andd=1,whichresultsin3. .Wecalculatetheexpectedfrequencyofeachcell,asshownintheTable4.Theteststatistic(chi-squarevalue)isgreaterthanthethreshold
(Oij?EijTest-Statisticalχχ2= (Oij?Eijrelatedmicroblogs,weconducttwosetsofexperimentsatthefeaturesmentionedabovetomeasuretheireffect.Inthefirstsetofexperiment,wetrainaclassifierusingspecific(χ2= )>(χ2α=0.05,d=1= ). fore,werejectthenullhypothesisH0:Theprogramusedfeatureisindependenttothetruthfulnessofamicroblog.Thisclearlyindicatesthattheprogramfeaturehasanontrivialrelationshipwiththetruthfulness,andcanbeusedasafeatureintherumorclassificationtask.Wecansimilarlyperformtheindependencetestbetweentheeventlocationfeatureandthemicroblog’struthfulness,andresultisshowninTable5.Thetestalsoconfirmsthattheeventlocationisnotindependentofthetruthfulness,andcanbeusedasagoodindicatorforclassification.Inordertobetterunderstandtheimpactofvariouscat-egoriesoffeaturesonidentifyingthetruthfulnessofrumor-
wellthosesubsetsoffeaturesperforminrumordetectiononSinaWeibo.Inthesecondsetofexperiments,westudytheimpactofincorporatingthetwonewlyproposedfeatures.EffectofPreviouslyProposedWefirstconsiderthethreesubsetsoffeaturesthathavebeenproposedintheliture:content-basedfeatures,account-basedfeatures,andpropagation-basedfeatures.WetrainaSVMclassifierwithRBFkernelfunction(γ=0.313,ob-tainedthrough10-foldcrossvalidationstrategy)usingtheabovementionedthreesubsetsoffeaturesrespectivelytomeasuretheimpactofthosefeaturesontheclassificationperformancefortherumorrelatedcorpus.Forexample,inthefirstexperiment,weonlyusethecontent-basedfeatures;Table5:TestofIndependencebetweenEventLocationandObserved
Table7:TheEvaluationMeasureofDifferentSubsetsofFeaturesontheClassificationPerformanceContent-basedTrueFalseedExpected 1052.047499 (Oij?EijTest-Statisticalχχ2= (Oij?EijTable6:TheNotationofEvaluationMeasureUsedInthe PredictedClassActual
latedmicroblogisconsistentwiththerumorbywhichSinaWeiborumorbustingserviceisidentified.Inordertofa-cilitateunderstanding,weuseTti
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 新生兒復(fù)蘇培訓(xùn)制度
- 教職工績(jī)效考核細(xì)則制度
- 國(guó)際關(guān)系學(xué)院雙語(yǔ)教學(xué)課程建設(shè)項(xiàng)目申請(qǐng)表
- 罕見(jiàn)藥藥源性疾病的防控策略
- 2026安徽省面向中國(guó)農(nóng)業(yè)大學(xué)選調(diào)生招錄備考題庫(kù)有答案詳解
- 2026上半年海南事業(yè)單位聯(lián)考瓊海市招聘事業(yè)單位工作人員167人備考題庫(kù)(1號(hào))及一套參考答案詳解
- 2026年1月廣東廣州市幼兒師范學(xué)校附屬幼兒園招聘編外聘用制專任教師2人備考題庫(kù)及答案詳解(考點(diǎn)梳理)
- 罕見(jiàn)腫瘤的個(gè)體化治療療效預(yù)測(cè)模型構(gòu)建與個(gè)體化治療路徑
- 2026安徽蕪湖臻鑫智鎂科技有限公司招聘2人備考題庫(kù)完整參考答案詳解
- 設(shè)備租賃行業(yè)財(cái)務(wù)制度
- 上腔靜脈綜合征患者的護(hù)理專家講座
- 免責(zé)協(xié)議告知函
- 食物與情緒-營(yíng)養(yǎng)對(duì)心理健康的影響
- 2023氣管插管意外拔管的不良事件分析及改進(jìn)措施
- 麻醉藥品、精神藥品月檢查記錄
- 基礎(chǔ)化學(xué)(本科)PPT完整全套教學(xué)課件
- 蕉嶺縣幅地質(zhì)圖說(shuō)明書(shū)
- 電梯控制系統(tǒng)論文
- (完整word版)人教版初中語(yǔ)文必背古詩(shī)詞(完整版)
- 湖北省地質(zhì)勘查坑探工程設(shè)計(jì)編寫(xiě)要求
- GB/T 4310-2016釩
評(píng)論
0/150
提交評(píng)論