版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、Search Engine,朱廷劭(Zhu, Tingshao)Ph.D,Examples of search engines,Conventional (library catalog). Search by keyword, title, author, etc. Text-based (Lexis-Nexis, Google, Yahoo!).Search by keywords. Limited search using queries in natural language. Multimedia (QBIC, WebSeek, SaFe)Search by visual appea
2、rance (shapes, colors, ). Question answering systems (Ask, NSIR, Answerbus)Search in (restricted) natural language Clustering systems (Vivisimo, Clusty) Research systems (Lemur, Nutch),What does it take to build a search engine?,Decide what to index Collect it Index it (efficiently) Keep the index u
3、p to date Provide user-friendly query facilities,What else?,Understand the structure of the web for efficient crawling Understand user information needs Preprocess text and other unstructured data Cluster data Classify data Evaluate performance,Gather the contents of all web pages (using a program c
4、alled a crawler or spider) Organize the contents of the pages in a way that allows efficient retrieval (indexing) Take in a query, determine which pages match, and show the results (ranking and display of results),Three main parts:,How Search Engines Work,Search engine servers,user query,Show result
5、s To user,Standard Web Search Engine Architecture,crawl the web,Create an inverted index,Check for duplicates, store the documents,Inverted index,Search engine servers,DocIds,Crawler machines,Standard Web Search Engine Architecture,crawl the web,Create an inverted index,Check for duplicates, store t
6、he documents,Inverted index,Search engine servers,user query,Show results To user,DocIds,Crawler machines,Standard Web Search Engine Architecture,How to find web pages to visit and copy? Can start with a list of domain names, visit the home pages there. Look at the hyperlink on the home page, and fo
7、llow those links to more pages. Use HTTP commands to GET the pages Keep a list of urls visited, and those still to be visited. Each time the program loads in a new HTML page, add the links in that page to the list to be crawled.,Spiders (crawlers),Four Laws of Crawling,A Crawler must show identifica
8、tion A Crawler must obey the robots exclusion standard /wc/norobots.html A Crawler must not hog resources A Crawler must report errors,Example robots.txt file,/robots.txt (just the first few lines),Lots of tricky aspects,Servers are often down or slow Hyperli
9、nks can get the crawler into cycles Some websites have junk in the web pages Now many pages have dynamic content The “hidden” web E.g., You dont see the course schedules until you run a query. The web is HUGE,“Freshness”,Need to keep checking pages Pages change (25%,7% large ch
10、anges) At different frequencies Who is the fastest changing? Pages are removed Many search engines cache the pages (store a copy on their own servers),A small fraction of the Web that search engines know about; no search engine is exhaustive Not the “l(fā)ive” Web, but the search engines index Not the “
11、Deep Web” Mostly HTML pages but other file types too: PDF, Word, PPT, etc.,What really gets crawled?,Record information about each page List of words In the title? How far down in the page? Was the word in boldface? URLs of pages pointing to this one Anchor text on pages pointing to this one,Index (
12、the database),Inverted Index,How to store the words for fast lookup Basic steps: Make a “dictionary” of all the words in all of the web pages For each word, list all the documents it occurs in. Often omit very common words “stop words” Sometimes stem the words (also called morphological analysis) ca
13、ts - cat running - run,Inverted Index Example,Image from /documentation/UserExperience/Conceptual/SearchKitConcepts/searchKit_basics/chapter_2_section_2.html,Inverted Index,In reality, this index is HUGE Need to store the contents across many machines Need to do optimization tricks to make lookup fa
14、st.,Search engine receives a query, then Looks up the words in the index, retrieves many documents, then Rank orders the pages and extracts “snippets” or summaries containing query words. Most web search engines assume the user wants all of the words (Boolean AND, not OR). These are complex and high
15、ly guarded algorithms unique to each search engine.,Results ranking,For a given candidate result page, use: Number of matching query words in the page Proximity of matching words to one another Location of terms within the page Location of terms within tags e.g. , , link text, body text Anchor text
16、on pages pointing to this one Frequency of terms on the page and in general Link analysis of which pages point to this one (Sometimes) Click-through analysis: how often the page is clicked on How “fresh” is the page Complex formulae combine these together.,Some ranking criteria,Machine Learned Ranki
17、ng,Goal: Automatically construct a ranking function Input: Large number training examples Features that predict relevance Relevance metrics Output: Ranking function Enables rapid experimental cycle Scientific investigation of Modifications to existing features New feature,Ranking Features (from Jan
18、Pedersens lecture),A0 - A4anchor text score per term W0 - W4term weights L0 - L4first occurrence location (encodes hostname and title match) SPspam index: logistic regression of 85 spam filter variables (against relevance scores) F0 - F4term occurrence frequency within document DCLNdocument length (
19、tokens) EREigenrank HBExtra-host unique inlink count ERHBER*HB A0W0 etc.A0*W0 QASite factor logistic regression of 5 site link and url count ratios SPNProximity FFfamily friendly rating UDurl depth,Ranking Decision Tree (from Jan Pedersens Lecture),A0w0 22.3,L0 18+1,R=0.0015,L1 18+1,L1 509+1,R=-0.05
20、45,W0 856,F0 2 + 1,R = -0.2368,R=-0.0199,R=-0.1185,R=-0.0039,F2 1 + 1,R=-0.1604,R=-0.0790,Y,N,The importance of anchor text, i141 , A terrific course on search engines ,The anchor text summarizes what the website is about.,Measuring Importance of Linking,PageRank Algorithm Idea: important pages are
21、pointed to by other important pages Method: Each link from one page to another is counted as a “vote” for the destination page But the importance of the starting page also influences the importance of the destination page. And those pages scores, in turn, depend on those linking to them.,Image and e
22、xplanation from ,Measuring Importance of Linking,Example: each page starts with 100 points. Each pages score is recalculated by adding up the score from each incoming link. This is the score of the linking page divided by the number of outgoing links it has. E.g, the page in green has 2 outgoing links and so its “points” are shared evenly by the 2 pages it links to. Keep repeating the score updates until no more changes.,Image and explanation from ,What d
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- (新教材)2026年北師大版三年級(jí)上冊(cè)數(shù)學(xué) 1.1 小熊購(gòu)物(1) 課件
- 楊保永課件教學(xué)課件
- 機(jī)車(chē)換擋介紹
- 機(jī)電設(shè)備安全生產(chǎn)培訓(xùn)
- 2026年福建華南女子職業(yè)學(xué)院高職單招職業(yè)適應(yīng)性測(cè)試模擬試題帶答案解析
- 2026年廣東省外語(yǔ)藝術(shù)職業(yè)學(xué)院?jiǎn)握新殬I(yè)技能筆試模擬試題帶答案解析
- 2026年定西師范高等專(zhuān)科學(xué)校單招職業(yè)技能考試模擬試題帶答案解析
- 2026年大連楓葉職業(yè)技術(shù)學(xué)院高職單招職業(yè)適應(yīng)性考試備考題庫(kù)帶答案解析
- 2026年合肥幼兒師范高等專(zhuān)科學(xué)校高職單招職業(yè)適應(yīng)性測(cè)試參考題庫(kù)帶答案解析
- 2026年河南推拿職業(yè)學(xué)院高職單招職業(yè)適應(yīng)性測(cè)試模擬試題帶答案解析
- 2026年七年級(jí)歷史上冊(cè)期末考試試卷及答案(共六套)
- 資產(chǎn)評(píng)估期末試題及答案
- 2025年內(nèi)科醫(yī)師定期考核模擬試題及答案
- 鄭州大學(xué)《大學(xué)英語(yǔ)》2023-2024學(xué)年第一學(xué)期期末試卷
- 校企合作工作室規(guī)范管理手冊(cè)
- 2025年農(nóng)業(yè)農(nóng)村部科技發(fā)展中心招聘?jìng)淇碱}庫(kù)及1套參考答案詳解
- 2025年南陽(yáng)科技職業(yè)學(xué)院?jiǎn)握新殬I(yè)適應(yīng)性考試模擬測(cè)試卷附答案
- 毛澤東思想和中國(guó)特色社會(huì)主義理論體系概論+2025秋+試題1
- 2025年10月自考13532法律職業(yè)倫理試題及答案
- 高中數(shù)學(xué)拔尖創(chuàng)新人才培養(yǎng)課程體系建構(gòu)與實(shí)施
- 2025年廣東省普通高中學(xué)業(yè)水平合格性考試英語(yǔ)試題(原卷版)
評(píng)論
0/150
提交評(píng)論