chapter02-part03-search-engine_搜索引擎.ppt

上傳人：g*** IP屬地：河南上傳時(shí)間：2020-08-03 格式：PPT 頁(yè)數(shù)：33 大?。?.23MB 積分：20 舉報(bào) 版權(quán)申訴

chapter02-part03-search-engine_搜索引擎.ppt_第2頁(yè)

chapter02-part03-search-engine_搜索引擎.ppt_第3頁(yè)

chapter02-part03-search-engine_搜索引擎.ppt_第4頁(yè)

chapter02-part03-search-engine_搜索引擎.ppt_第5頁(yè)

已閱讀5頁(yè)，還剩28頁(yè)未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說(shuō)明：本文檔由用戶(hù)提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、Search Engine,朱廷劭（Zhu, Tingshao）Ph.D,Examples of search engines,Conventional (library catalog). Search by keyword, title, author, etc. Text-based (Lexis-Nexis, Google, Yahoo!).Search by keywords. Limited search using queries in natural language. Multimedia (QBIC, WebSeek, SaFe)Search by visual appea

2、rance (shapes, colors, ). Question answering systems (Ask, NSIR, Answerbus)Search in (restricted) natural language Clustering systems (Vivisimo, Clusty) Research systems (Lemur, Nutch),What does it take to build a search engine?,Decide what to index Collect it Index it (efficiently) Keep the index u

3、p to date Provide user-friendly query facilities,What else?,Understand the structure of the web for efficient crawling Understand user information needs Preprocess text and other unstructured data Cluster data Classify data Evaluate performance,Gather the contents of all web pages (using a program c

4、alled a crawler or spider) Organize the contents of the pages in a way that allows efficient retrieval (indexing) Take in a query, determine which pages match, and show the results (ranking and display of results),Three main parts:,How Search Engines Work,Search engine servers,user query,Show result

5、s To user,Standard Web Search Engine Architecture,crawl the web,Create an inverted index,Check for duplicates, store the documents,Inverted index,Search engine servers,DocIds,Crawler machines,Standard Web Search Engine Architecture,crawl the web,Create an inverted index,Check for duplicates, store t

6、he documents,Inverted index,Search engine servers,user query,Show results To user,DocIds,Crawler machines,Standard Web Search Engine Architecture,How to find web pages to visit and copy? Can start with a list of domain names, visit the home pages there. Look at the hyperlink on the home page, and fo

7、llow those links to more pages. Use HTTP commands to GET the pages Keep a list of urls visited, and those still to be visited. Each time the program loads in a new HTML page, add the links in that page to the list to be crawled.,Spiders (crawlers),Four Laws of Crawling,A Crawler must show identifica

8、tion A Crawler must obey the robots exclusion standard /wc/norobots.html A Crawler must not hog resources A Crawler must report errors,Example robots.txt file,/robots.txt (just the first few lines),Lots of tricky aspects,Servers are often down or slow Hyperli

9、nks can get the crawler into cycles Some websites have junk in the web pages Now many pages have dynamic content The “hidden” web E.g., You dont see the course schedules until you run a query. The web is HUGE,“Freshness”,Need to keep checking pages Pages change (25%,7% large ch

10、anges) At different frequencies Who is the fastest changing? Pages are removed Many search engines cache the pages (store a copy on their own servers),A small fraction of the Web that search engines know about; no search engine is exhaustive Not the “l(fā)ive” Web, but the search engines index Not the “

11、Deep Web” Mostly HTML pages but other file types too: PDF, Word, PPT, etc.,What really gets crawled?,Record information about each page List of words In the title? How far down in the page? Was the word in boldface? URLs of pages pointing to this one Anchor text on pages pointing to this one,Index (

12、the database),Inverted Index,How to store the words for fast lookup Basic steps: Make a “dictionary” of all the words in all of the web pages For each word, list all the documents it occurs in. Often omit very common words “stop words” Sometimes stem the words (also called morphological analysis) ca

13、ts - cat running - run,Inverted Index Example,Image from /documentation/UserExperience/Conceptual/SearchKitConcepts/searchKit_basics/chapter_2_section_2.html,Inverted Index,In reality, this index is HUGE Need to store the contents across many machines Need to do optimization tricks to make lookup fa

14、st.,Search engine receives a query, then Looks up the words in the index, retrieves many documents, then Rank orders the pages and extracts “snippets” or summaries containing query words. Most web search engines assume the user wants all of the words (Boolean AND, not OR). These are complex and high

15、ly guarded algorithms unique to each search engine.,Results ranking,For a given candidate result page, use: Number of matching query words in the page Proximity of matching words to one another Location of terms within the page Location of terms within tags e.g. , , link text, body text Anchor text

16、on pages pointing to this one Frequency of terms on the page and in general Link analysis of which pages point to this one (Sometimes) Click-through analysis: how often the page is clicked on How “fresh” is the page Complex formulae combine these together.,Some ranking criteria,Machine Learned Ranki

17、ng,Goal: Automatically construct a ranking function Input: Large number training examples Features that predict relevance Relevance metrics Output: Ranking function Enables rapid experimental cycle Scientific investigation of Modifications to existing features New feature,Ranking Features (from Jan

18、Pedersens lecture),A0 - A4anchor text score per term W0 - W4term weights L0 - L4first occurrence location (encodes hostname and title match) SPspam index: logistic regression of 85 spam filter variables (against relevance scores) F0 - F4term occurrence frequency within document DCLNdocument length (

19、tokens) EREigenrank HBExtra-host unique inlink count ERHBER*HB A0W0 etc.A0*W0 QASite factor logistic regression of 5 site link and url count ratios SPNProximity FFfamily friendly rating UDurl depth,Ranking Decision Tree (from Jan Pedersens Lecture),A0w0 22.3,L0 18+1,R=0.0015,L1 18+1,L1 509+1,R=-0.05

20、45,W0 856,F0 2 + 1,R = -0.2368,R=-0.0199,R=-0.1185,R=-0.0039,F2 1 + 1,R=-0.1604,R=-0.0790,Y,N,The importance of anchor text, i141 , A terrific course on search engines ,The anchor text summarizes what the website is about.,Measuring Importance of Linking,PageRank Algorithm Idea: important pages are

21、pointed to by other important pages Method: Each link from one page to another is counted as a “vote” for the destination page But the importance of the starting page also influences the importance of the destination page. And those pages scores, in turn, depend on those linking to them.,Image and e

22、xplanation from ,Measuring Importance of Linking,Example: each page starts with 100 points. Each pages score is recalculated by adding up the score from each incoming link. This is the score of the linking page divided by the number of outgoing links it has. E.g, the page in green has 2 outgoing links and so its “points” are shared evenly by the 2 pages it links to. Keep repeating the score updates until no more changes.,Image and explanation from ,What d

人人文庫(kù)> 全部分類(lèi)> 應(yīng)用文書(shū) > 技術(shù)指導(dǎo)

溫馨提示

1. 本站所有資源如無(wú)特殊說(shuō)明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

chapter02-part03-search-engine_搜索引擎.ppt

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

chapter02-part03-search-engine_搜索引擎.ppt

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔