Python爬蟲(chóng)項(xiàng)目教程（第2版）（微課版）-課件更新教程4 爬取旅游網(wǎng)站數(shù)據(jù)

上傳人：y*** IP屬地：山東上傳時(shí)間：2025-12-12 格式：PPTX 頁(yè)數(shù)：48 大?。?.50MB 積分：15 舉報(bào) 版權(quán)申訴

Python爬蟲(chóng)項(xiàng)目教程（第2版）（微課版）-課件更新教程4 爬取旅游網(wǎng)站數(shù)據(jù)_第2頁(yè)

Python爬蟲(chóng)項(xiàng)目教程（第2版）（微課版）-課件更新教程4 爬取旅游網(wǎng)站數(shù)據(jù)_第3頁(yè)

Python爬蟲(chóng)項(xiàng)目教程（第2版）（微課版）-課件更新教程4 爬取旅游網(wǎng)站數(shù)據(jù)_第4頁(yè)

Python爬蟲(chóng)項(xiàng)目教程（第2版）（微課版）-課件更新教程4 爬取旅游網(wǎng)站數(shù)據(jù)_第5頁(yè)

已閱讀5頁(yè)，還剩43頁(yè)未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說(shuō)明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

微課版

第2版Python爬蟲(chóng)項(xiàng)目教程4目錄CONTENTS旅游網(wǎng)站項(xiàng)目任務(wù)01網(wǎng)站樹(shù)的爬取路徑02爬取多頁(yè)面網(wǎng)站數(shù)據(jù)03Python實(shí)現(xiàn)多線程04爬取旅游網(wǎng)站圖像05爬取模擬旅游網(wǎng)站數(shù)據(jù)06爬取實(shí)際旅游網(wǎng)站數(shù)據(jù)06爬取旅游網(wǎng)站數(shù)據(jù)01爬取旅游項(xiàng)目任務(wù)01復(fù)雜的爬蟲(chóng)程序爬取的數(shù)據(jù)往往很多，而且相關(guān)的數(shù)據(jù)往往分布在很多不同的網(wǎng)頁(yè)中，甚至分布在相關(guān)聯(lián)的不同的網(wǎng)站中，爬蟲(chóng)程序必須能按鏈接自動(dòng)往返于這些不同的網(wǎng)站中去爬取數(shù)據(jù)。一個(gè)爬蟲(chóng)程序爬取成百上千條的數(shù)據(jù)是常有的事，怎么樣設(shè)計(jì)一個(gè)高效率的爬蟲(chóng)程序成了我們學(xué)習(xí)的重點(diǎn)。圖4-1-1中國(guó)日?qǐng)?bào)網(wǎng)站在中國(guó)日?qǐng)?bào)（ChinaDaily）網(wǎng)站有很多旅游景點(diǎn)介紹，進(jìn)入網(wǎng)站可以看到各個(gè)旅游景點(diǎn)項(xiàng)目，而且每個(gè)項(xiàng)目一般都配有一張精美的圖像，這樣的網(wǎng)頁(yè)還很多，通過(guò)單擊"Next"按鈕進(jìn)入下一頁(yè)，如圖4-1-1所示。這個(gè)項(xiàng)目的目的是爬取這些頁(yè)面所有旅游項(xiàng)目與對(duì)應(yīng)的圖片，它在爬取一個(gè)頁(yè)面后能自動(dòng)跳轉(zhuǎn)到后面一個(gè)頁(yè)面繼續(xù)爬取，把文本數(shù)據(jù)存儲(chǔ)到數(shù)據(jù)庫(kù)travels.db，把圖片存儲(chǔ)到download文件夾。在爬取實(shí)際網(wǎng)站數(shù)據(jù)之前先練習(xí)爬取模擬網(wǎng)站的數(shù)據(jù)。創(chuàng)建一個(gè)項(xiàng)目project4，在該文件夾中有一個(gè)travels.csv文件，它包含了很多旅游的信息，前面幾行如下。1.1旅游網(wǎng)站項(xiàng)目任務(wù)ID,Title,Date,Ext000001,Coolsportsgivecityhealthyoptions,2024-07-2012:07,jpeg000002,BelgiansharvestthegoodlifeinGuizhou,2024-07-0207:50,jpeg000003,NEChina'sChangbaiMountainseekingtobecometop-leveloutdoorsportsdestination,2024-06-1715:12,jpeg000004,Ajourneythroughthetracksoftime,2024-06-1010:35,jpeg000005,ChinaFocus:EasiertravelfuelsChinesepeople'sinterestinglobaldestinations,2024-06-0710:13,jpeg000006,Aplacewithasenseofhistory,2024-06-0406:55,jpeg000007,DiscoveringZhangzhou:AculturalandecologicalgeminFujian,2024-05-2914:56,jpeg000008,BeijingunveilsplanstoenhanceGreatWalltourism,2024-05-1715:36,jpeg文件中各個(gè)數(shù)據(jù)項(xiàng)目使用","分割，第一行是各個(gè)數(shù)據(jù)項(xiàng)目的名稱，其中ID是編號(hào)、Title是項(xiàng)目標(biāo)題、Date是日期，Ext是圖像的擴(kuò)展名，圖像名稱是由ID與Ext組合而成。在static文件夾中有各個(gè)項(xiàng)目的圖像和文本內(nèi)容，例如項(xiàng)目一的圖像是000001.jpeg，文本是000001.html，如圖4-1-2所示。我們使用這些數(shù)據(jù)編寫(xiě)一個(gè)旅游網(wǎng)站，如圖4-1-3所示，這個(gè)網(wǎng)站顯示71個(gè)旅游項(xiàng)目信息，分為很多個(gè)頁(yè)面，單擊每個(gè)網(wǎng)頁(yè)的"第一頁(yè)"、"前一頁(yè)"、"下一頁(yè)"、"末一頁(yè)"等按鈕就可以在各個(gè)頁(yè)面之間切換。1.1旅游網(wǎng)站項(xiàng)目任務(wù)圖4-1-2項(xiàng)目圖像與內(nèi)容圖4-1-3模擬旅游網(wǎng)站網(wǎng)站樹(shù)的爬取路徑022.1Web服務(wù)器網(wǎng)站2.2遞歸程序爬取數(shù)據(jù)2.3深度優(yōu)先爬取數(shù)據(jù)2.4廣度優(yōu)先爬取數(shù)據(jù)2.1Web服務(wù)器網(wǎng)站我們?cè)O(shè)計(jì)好books.html,program.html,database.html,netwwork.html,mysql.html,java.html,python.html等網(wǎng)頁(yè)文件以u(píng)tf-8的編碼存儲(chǔ)在文件夾（例如c:\demo）中，各個(gè)文件的內(nèi)容如下：(1)books.html<h3>計(jì)算機(jī)</h3><ul><li><ahref="database.html">數(shù)據(jù)庫(kù)</a></li><li><ahref="program.html">程序設(shè)計(jì)</a></li><li><ahref="network.html">計(jì)算機(jī)網(wǎng)絡(luò)</a></li></ul>(2)database.html<h3>數(shù)據(jù)庫(kù)</h3><ul><li><ahref="mysql.html">MySQL數(shù)據(jù)庫(kù)</a></li></ul><ahref="books.html">Home</a>(3)program.html<h3>程序設(shè)計(jì)</h3><ul><li><ahref="python.html">Python程序設(shè)計(jì)</a></li><li><ahref="java.html">Java程序設(shè)計(jì)</a></li></ul><ahref="books.html">Home</a>(4)network.html<h3>計(jì)算機(jī)網(wǎng)絡(luò)</h3><ahref="books.html">Home</a>(5)mysql.html<h3>MySQL數(shù)據(jù)庫(kù)</h3><ahref="books.html">Home</a>2.1Web服務(wù)器網(wǎng)站(6)python.html<h3>Python程序設(shè)計(jì)</h3><ahref="books.html">Home</a>(7)java.html<h3>Java程序設(shè)計(jì)</h3><ahref="books.html">Home</a>然后再用Flask設(shè)計(jì)一個(gè)server.py的Web程序來(lái)呈現(xiàn)它們：importflaskimportosapp=flask.Flask(__name__)defgetFile(fileName):data=b""ifos.path.exists(fileName):fobj=open(fileName,"rb")data=fobj.read()fobj.close()returndata@app.route("/")defindex():returngetFile("books.html")@app.route("/<section>")defprocess(section):data=""ifsection!="":data=getFile(section)returndataif__name__=="__main__":app.run()2.1Web服務(wù)器網(wǎng)站這個(gè)Web程序的默認(rèn)網(wǎng)址是:5000，訪問(wèn)這個(gè)網(wǎng)址后執(zhí)行index()函數(shù)，返回books.html的網(wǎng)頁(yè)，如圖4-2-1所示。單擊每個(gè)超級(jí)鏈接后會(huì)轉(zhuǎn)去數(shù)據(jù)庫(kù)、程序設(shè)計(jì)、計(jì)算機(jī)網(wǎng)絡(luò)各個(gè)網(wǎng)頁(yè)，這些網(wǎng)頁(yè)的結(jié)構(gòu)實(shí)際上是一棵樹(shù)，如圖4-2-2所示。圖4-2-1Web網(wǎng)站圖4-2-2Web網(wǎng)站結(jié)構(gòu)2.2遞歸程序爬取數(shù)據(jù)現(xiàn)在我們來(lái)設(shè)計(jì)一個(gè)客戶端程序client.py爬取這個(gè)網(wǎng)站各個(gè)網(wǎng)頁(yè)的<h3>的標(biāo)題值，設(shè)計(jì)的思想如下：frombs4importBeautifulSoupimporturllib.requestdefspider(url):globalurlstry:data=urllib.request.urlopen(url)data=data.read()data=data.decode()soup=BeautifulSoup(data,"lxml")print(soup.find("h3").text)links=soup.select("a")forlinkinlinks:href=link["href"]url=start_url+"/"+href

(1)設(shè)計(jì)一個(gè)urls列表，記錄已經(jīng)訪問(wèn)過(guò)的網(wǎng)頁(yè)；(2)從books.html出發(fā)；(3)訪問(wèn)一個(gè)網(wǎng)頁(yè)，記錄網(wǎng)頁(yè)地址到urls，獲取<h3>標(biāo)題；(4)獲取這個(gè)網(wǎng)頁(yè)中所有<a>超級(jí)鏈接的href值形成links列表；(5)循環(huán)links列表，對(duì)于每個(gè)鏈接link都指向另外一個(gè)網(wǎng)頁(yè)，遞歸回到(3)；(6)繼續(xù)links的下一個(gè)link，直到遍歷所有l(wèi)ink為止；ifurlnotinurls:urls.append(url)spider(url)exceptExceptionaserr:print(err)start_url=":5000"urls=[start_url+"/books.html"]spider(urls[0])計(jì)算機(jī)數(shù)據(jù)庫(kù)MySQL數(shù)據(jù)庫(kù)程序設(shè)計(jì)Python程序設(shè)計(jì)Java程序設(shè)計(jì)計(jì)算機(jī)網(wǎng)絡(luò)2.3深度優(yōu)先爬取數(shù)據(jù)如果我們不使用遞歸程序?qū)崿F(xiàn)深度優(yōu)先的順序爬取網(wǎng)站數(shù)據(jù)，也可以設(shè)計(jì)一個(gè)棧Stack來(lái)完成。在Python中實(shí)現(xiàn)一個(gè)棧十分簡(jiǎn)單，Python中的列表list就是一個(gè)棧，很容易設(shè)計(jì)自己的一個(gè)棧Stack類：classStack:def__init__(self):self.st=[]defpop(self):returnself.st.pop()defpush(self,obj):self.st.append(obj)defempty(self):returnlen(self.st)==0(1)設(shè)計(jì)一個(gè)urls列表記錄已經(jīng)訪問(wèn)過(guò)的地址；(2)第一個(gè)url入棧；(3)如果棧為空程序結(jié)束，如不為空出棧一個(gè)url，爬取它的<h3>標(biāo)題值；(4)獲取url站點(diǎn)的所有超級(jí)鏈接<a>的href值，組成鏈接列表links，如果url不在urls列表中，就把url鏈接壓棧；(5)回到(3)其中push()是壓棧函數(shù)、pop()是出棧函數(shù)、empty()判斷棧是否為空。采用Stack類后我們可以設(shè)計(jì)深度優(yōu)先的順序爬取數(shù)據(jù)的客戶端程序的思想如下：2.3深度優(yōu)先爬取數(shù)據(jù)根據(jù)這個(gè)思想編寫(xiě)爬蟲(chóng)程序client.py如下：frombs4importBeautifulSoupimporturllib.requestclassStack:def__init__(self):self.st=[]defpop(self):returnself.st.pop()defpush(self,obj):self.st.append(obj)defempty(self):returnlen(self.st)==0defspider(url): globalurlsstack=Stack()stack.push(url)whilenotstack.empty():

url=stack.pop()try:data=urllib.request.urlopen(url)data=data.read()data=data.decode()soup=BeautifulSoup(data,"lxml")print(soup.find("h3").text)links=soup.select("a")foriinrange(len(links)-1,-1,-1):href=links[i]["href"]url=start_url+"/"+hrefifurlnotinurls:urls.append(url)stack.push(url)exceptExceptionaserr:print(err)start_url=":5000"urls=[start_url+"/books.html"]spider(urls[0])計(jì)算機(jī)數(shù)據(jù)庫(kù)MySQL數(shù)據(jù)庫(kù)程序設(shè)計(jì)Python程序設(shè)計(jì)Java程序設(shè)計(jì)計(jì)算機(jī)網(wǎng)絡(luò)程序結(jié)果：2.4廣度優(yōu)先爬取數(shù)據(jù)遍歷網(wǎng)站樹(shù)還有一種廣度優(yōu)先的順序，這要使用到隊(duì)列，在Python中實(shí)現(xiàn)一個(gè)隊(duì)列十分簡(jiǎn)單，Python中的列表list就是一個(gè)隊(duì)列，很容易設(shè)計(jì)自己的一個(gè)隊(duì)列Queue類：def__init__(self):self.st=[]deffetch(self):returnself.st.pop(0)defenter(self,obj):self.st.append(obj)defempty(self):returnlen(self.st)==0(1)設(shè)計(jì)一個(gè)urls列表記錄已經(jīng)訪問(wèn)過(guò)的地址；(2)第一個(gè)url入列；(3)如果列為空程序結(jié)束，如不為空出列一個(gè)url，爬取它的<h3>標(biāo)題值；(4)獲取url站點(diǎn)的所有超級(jí)鏈接<a>的href值，組成鏈接列表links，如果url不在urls列表中，就把url鏈接入列；(5)回到(3)其中enter()是入列函數(shù)、fetch()是出列函數(shù)、empty()判斷列是否為空。采用Queue類后我們可以設(shè)計(jì)廣度優(yōu)先的順序爬取數(shù)據(jù)的客戶端程序的思想如下：2.4廣度優(yōu)先爬取數(shù)據(jù)根據(jù)這個(gè)思想編寫(xiě)爬蟲(chóng)程序client.py如下：frombs4importBeautifulSoupimporturllib.requestclassQueue:def__init__(self):self.st=[]deffetch(self):returnself.st.pop(0)defenter(self,obj):self.st.append(obj)defempty(self):returnlen(self.st)==0defspider(url):globalurlsqueue=Queue()queue.enter(url)whilenotqueue.empty():

url=queue.fetch()try:data=urllib.request.urlopen(url)data=data.read()data=data.decode()soup=BeautifulSoup(data,"lxml")print(soup.find("h3").text)links=soup.select("a")forlinkinlinks:href=link["href"]url=start_url+"/"+hrefifurlnotinurls:urls.append(url)queue.enter(url)exceptExceptionaserr:print(err)start_url=":5000"urls=[":5000/books.html"]spider(urls[0])計(jì)算機(jī)數(shù)據(jù)庫(kù)程序設(shè)計(jì)計(jì)算機(jī)網(wǎng)絡(luò)MySQL數(shù)據(jù)庫(kù)Python程序設(shè)計(jì)Java程序設(shè)計(jì)程序結(jié)果：爬取多頁(yè)面網(wǎng)站數(shù)據(jù)033.1旅游網(wǎng)站服務(wù)器3.2爬取網(wǎng)站數(shù)據(jù)3.3編寫(xiě)爬蟲(chóng)程序3.1旅游網(wǎng)站服務(wù)器1.旅游網(wǎng)站數(shù)據(jù)在project4文件夾中有一個(gè)travels.csv文件，它包含ID編號(hào)，標(biāo)題Title、日期Date、圖像文件的擴(kuò)展名Ext，格式內(nèi)容如下：這組數(shù)據(jù)共有71個(gè)旅游項(xiàng)目，每個(gè)項(xiàng)目有一個(gè)名稱為ID+Ext的圖像文件，有一個(gè)ID+".html"的內(nèi)容文件，它們放在project4/static文件夾中，如圖4-3-1所示。ID,Title,Date,Ext000001,Coolsportsgivecityhealthyoptions,2024-07-2012:07,jpeg000002,BelgiansharvestthegoodlifeinGuizhou,2024-07-0207:50,jpeg000003,NEChina'sChangbaiMountainseekingtobecometop-leveloutdoorsportsdestination,2024-06-1715:12,jpeg000004,Ajourneythroughthetracksoftime,2024-06-1010:35,jpeg000005,ChinaFocus:EasiertravelfuelsChinesepeople'sinterestinglobaldestinations,2024-06-0710:13,jpeg000006,Aplacewithasenseofhistory,2024-06-0406:55,jpeg000007,DiscoveringZhangzhou:AculturalandecologicalgeminFujian,2024-05-2914:56,jpeg000008,BeijingunveilsplanstoenhanceGreatWalltourism,2024-05-1715:36,jpeg（省略）000070,Chongqingacityfullofsurprises,2023-06-0712:03,jpeg000071,TouristsenjoykiteboardinginHainan,2023-06-0510:02,jpeg圖4-3-1圖像文件與內(nèi)容文件3.1旅游網(wǎng)站服務(wù)器2.設(shè)計(jì)網(wǎng)站模板在project4/templates下設(shè)計(jì)一個(gè)travel.html的模板文件：<style>.first{display:inline-block;width:150px;height:90px;}.second{display:inline-block;}</style>{%foriteminitems%}<div><spanclass="first"><atarget="_blank"shape="rect"href="/static/{{item.ID}}.html"><imgwidth="120"height="80"src="/static/{{item.Image}}"></a></span><spanclass="second"><div><b>{{item.ID}}</b></div><div><ashape="rect"href="/static/{{item.ID}}.html">{{item.Title}}</a></div>

<div>{{item.Date}}</div></span></div>{%endfor%}<divid="pagnation"style="text-align:center;"><ahref="?pageIndex=1">第一頁(yè)</a>{%ifpageIndex>1%}<ahref="?pageIndex={{pageIndex-1}}">上一頁(yè)</a>{%endif%}{%ifpageIndex<pageCount%}<ahref="?pageIndex={{pageIndex+1}}">下一頁(yè)</a>{%endif%}<ahref="?pageIndex={{pageCount}}">最末頁(yè)</a><span>第{{pageIndex}}頁(yè)/共{{pageCount}}頁(yè)</span></div>3.1旅游網(wǎng)站服務(wù)器3.設(shè)計(jì)網(wǎng)站服務(wù)器設(shè)計(jì)一個(gè)服務(wù)器網(wǎng)站：importflaskimportosapp=flask.Flask(__name__)@app.route("/")defindex():pageIndex=flask.request.values.get("pageIndex","1")pageIndex=int(pageIndex)pageSize=4startIndex=(pageIndex-1)*pageSize+1endIndex=startIndex+pageSizefobj=open("travels.csv","rt",encoding="utf-8")data=fobj.readlines()count=len(data)items=[]j=0foriinrange(1,count):s=data[i].strip().split(",")iflen(s)==4:j+=1ifj>=startIndexandj<endIndex:item={"ID":s[0],"Title":s[1],"Date":s[2],"Image":s[0]+"."+s[3]}items.append(item)fobj.close()pageCount=j//pageSizeifj%pageSize!=0:pageCount+=1returnflask.render_template("travel.html",items=items,pageIndex=pageIndex,pageCount=pageCount)if__name__=="__main__":app.run(debug=True)3.1旅游網(wǎng)站服務(wù)器運(yùn)行服務(wù)器后瀏覽網(wǎng)站，效果如圖4-3-2所示，使用"第一頁(yè)"、"前一頁(yè)"、"下一頁(yè)"、"最末頁(yè)"進(jìn)行網(wǎng)頁(yè)跳轉(zhuǎn)，單擊一個(gè)旅游項(xiàng)目后就跳轉(zhuǎn)到該項(xiàng)目的內(nèi)容頁(yè)。如圖4-3-3所示。圖4-3-2旅游項(xiàng)目網(wǎng)頁(yè)圖4-3-3旅游項(xiàng)目?jī)?nèi)容3.2爬取網(wǎng)站數(shù)據(jù)分析網(wǎng)站HTML的結(jié)構(gòu)很容易發(fā)現(xiàn)每個(gè)頁(yè)面下面有多個(gè)<spanclass="second">的元素，這個(gè)元素中包含項(xiàng)目ID、Title、Date等信息，還包含Content的地址，因此先查找到全部<spanclass="second">元素，再循環(huán)到每個(gè)<span>中查找即可3.3編寫(xiě)爬蟲(chóng)程序由于這個(gè)網(wǎng)站的多個(gè)網(wǎng)頁(yè)是用"下一頁(yè)"的方法串聯(lián)組合的，因此設(shè)計(jì)一個(gè)spider(url)爬取url頁(yè)面的數(shù)據(jù)，然后找到下一頁(yè)的地址url再次遞歸調(diào)用spide(url)即可，根據(jù)這些分析編寫(xiě)程序如下：frombs4importBeautifulSoupimporturllib.requestdefgetContent(url):data=urllib.request.urlopen(url)data=data.read()data=data.decode()soup=BeautifulSoup(data,"lxml")returnsoup.select_one("pre").textdefspider(url):print(url)try:data=urllib.request.urlopen(url)data=data.read()data=data.decode()

soup=BeautifulSoup(data,"lxml")spans=soup.select("divspan[class='second']")forspaninspans:link=span.select_one("diva")title=link.textID=span.select_one("divb").textdate=span.select("div")[-1].texturl=urllib.request.urljoin(start_url,link["href"])content=getContent(url)print(ID)print(title)print(date)print(content[:20])

links=soup.select("div[id='pagnation']a")nextUrl=""forlinkinlinks:iflink.text=="下一頁(yè)":nextUrl=urllib.request.urljoin(start_url,link["href"])breakifnextUrl!="":spider(nextUrl)exceptExceptionaserr:print(err)start_url=":5000"spider(start_url)Python實(shí)現(xiàn)多線程044.1Python的Daemon線程4.2線程的等待4.1Python的Daemon線程在Python中要啟動(dòng)一個(gè)線程，可以使用threading包中的Thread建立一個(gè)對(duì)象，這個(gè)Thread類的基本原型是：importthreadingimporttimeimportrandomdefreading():foriinrange(10):print("reading",i)time.sleep(random.randint(1,2))r=threading.Thread(target=reading)r.setDaemon(False)r.start()print("TheEnd")程序結(jié)果：reading0TheEndreading1reading2reading3reading4其中target是要執(zhí)行的線程，args是一個(gè)元組或者列表為target提供參數(shù)，然后調(diào)用t.start()就開(kāi)始了線程。在主線程中啟動(dòng)一個(gè)子線程執(zhí)行reading()函數(shù)。t=Thread(target,args=None)從結(jié)果看到主線程啟動(dòng)子線程r后就結(jié)束了，但是子線程還沒(méi)有結(jié)束，繼續(xù)顯示完reasing4后才結(jié)束。其中的r.setDaemon(False)就是設(shè)置線程r為非Daemon線程，這種線程不因主線程的結(jié)束而結(jié)束。4.1Python的Daemon線程啟動(dòng)一個(gè)Daemon線程importthreadingimporttimeimportrandomdefreading():foriinrange(5):print("reading",i)time.sleep(random.randint(1,2))r=threading.Thread(target=reading)r.setDaemon(True)r.start()print("TheEnd")程序結(jié)果：reading0TheEnd由此可見(jiàn)在主線程結(jié)束后子線程也結(jié)束，這就是Daemon線程。4.2線程的等待在多線程的程序中往往一個(gè)線程（例如主線程）要等待其他線程執(zhí)行完畢才繼續(xù)執(zhí)行，這可以用join()函數(shù)，使用的方法是：importthreadingimporttimeimportrandomdefreading():foriinrange(5):print("reading",i)time.sleep(random.randint(1,2))t=threading.Thread(target=reading)t.setDaemon(False)t.start()t.join()print("TheEnd")程序結(jié)果：reading0reading1reading2reading3reading4TheEnd在一個(gè)線程代碼中執(zhí)行這條語(yǔ)句，當(dāng)前的線程就會(huì)停止執(zhí)行，一直等到指定的線程對(duì)象的線程執(zhí)行完畢后才繼續(xù)執(zhí)行，即這條語(yǔ)句啟動(dòng)阻塞等待的作用。線程對(duì)象.join()由此可見(jiàn)主線程啟動(dòng)子線程t執(zhí)行reading()函數(shù)，t.join()就阻塞主線程，一直等到t線程執(zhí)行完畢后才結(jié)束t.join()，繼續(xù)執(zhí)行顯示TheEnd。爬取旅游網(wǎng)站圖像055.1設(shè)計(jì)旅游網(wǎng)站5.2單線程爬取圖像5.3多線程爬取圖像5.1設(shè)計(jì)旅游網(wǎng)站為了模擬真實(shí)的網(wǎng)絡(luò)環(huán)境，體現(xiàn)圖像爬取是一個(gè)比較漫長(zhǎng)的過(guò)程，我們改造旅游網(wǎng)站的圖像顯示機(jī)制。將project4/templates文件夾中的travel.html模板改成：<style>.first{display:inline-block;width:150px;height:90px;}.second{display:inline-block;}</style>{%foriteminitems%}<div><spanclass="first"><atarget="_blank"shape="rect"href="/static/{{item.ID}}.html"><imgwidth="120"height="80"src="/getImage/{{item.Image}}"></a></span><spanclass="second"><div><b>{{item.ID}}</b></div><div><ashape="rect"href="/static/{{item.ID}}.html">{{item.Title}}</a></div>1.網(wǎng)站模板文件其中圖像顯示改成：<imgwidth="120"height="80"src="/getImage/{{item.Image}}"><div>{{item.Date}}</div></span></div>{%endfor%}<divid="pagnation"style="text-align:center;"><ahref="?pageIndex=1">第一頁(yè)</a>{%ifpageIndex>1%}<ahref="?pageIndex={{pageIndex-1}}">上一頁(yè)</a>{%endif%}{%ifpageIndex<pageCount%}<ahref="?pageIndex={{pageIndex+1}}">下一頁(yè)</a>{%endif%}<ahref="?pageIndex={{pageCount}}">最末頁(yè)</a><span>第{{pageIndex}}頁(yè)/共{{pageCount}}頁(yè)</span></div>5.1設(shè)計(jì)旅游網(wǎng)站改造服務(wù)器如下：importflaskimportosimportrandomimporttimeapp=flask.Flask(__name__)@app.route("/")defindex():pageIndex=flask.request.values.get("pageIndex","1")pageIndex=int(pageIndex)pageSize=4startIndex=(pageIndex-1)*pageSize+1endIndex=startIndex+pageSizefobj=open("travels.csv","rt",encoding="utf-8")data=fobj.readlines()2.網(wǎng)站服務(wù)器count=len(data)items=[]j=0foriinrange(1,count):s=data[i].strip().split(",")iflen(s)==4:j+=1ifj>=startIndexandj<endIndex:item={"ID":s[0],"Title":s[1],"Date":s[2],"Image":s[0]+"."+s[3]}items.append(item)fobj.close()pageCount=j//pageSizeifj%pageSize!=0:pageCount+=15.1設(shè)計(jì)旅游網(wǎng)站returnflask.render_template("travel.html",items=items,pageIndex=pageIndex,pageCount=pageCount)@app.route("/getImage/<name>")defgetImage(name):img=b""ifos.path.exists("static/"+name):fobj=open("static/"+name,"rb")img=fobj.read()fobj.close()time.sleep(random.randdom()*3)returnimgif__name__=="__main__":app.run(debug=True)2.網(wǎng)站服務(wù)器其中g(shù)etImage(name)函數(shù)負(fù)責(zé)獲取圖像文件的二進(jìn)制數(shù)據(jù)，為了模擬圖像獲取緩慢的過(guò)程，設(shè)計(jì)了這個(gè)函數(shù)并使用語(yǔ)句：time.sleep(random.random()*3)延遲圖像0~3秒，這樣在瀏覽網(wǎng)站時(shí)圖像的出現(xiàn)就比較緩慢了。5.2單線程爬取圖像我們編寫(xiě)一個(gè)爬蟲(chóng)程序爬取旅游網(wǎng)站的所有圖像，方法是一個(gè)個(gè)圖像逐個(gè)爬取，即所有的圖像下載都單獨(dú)在主線程中完成，爬蟲(chóng)程序如下：importurllib.responsefrombs4importBeautifulSoupimporturllib.requestimportosimporttimedefdownload(src):try:p=src.rfind("/")name=src[p+1:]data=urllib.request.urlopen(src,timeout=10)data=data.read()iflen(data)>0:fobj=open("downloadImages/"+name,"wb")fobj.write(data)fobj.close()print("downloaded",src)

exceptExceptionaserr:print(err)defspider(url):print(url)try:data=urllib.request.urlopen(url)data=data.read()data=data.decode()soup=BeautifulSoup(data,"lxml")images=soup.select("divspan[class='first']img")forimageinimages:src=urllib.request.urljoin(start_url,image["src"])download(src)links=soup.select("div[id='pagnation']a")url=""forlinkinlinks:

iflink.text=="下一頁(yè)":url=urllib.request.urljoin(start_url,link["href"])breakifurl!="":spider(url)exceptExceptionaserr:print(err)ifos.path.exists("downloadImages")==False:os.mkdir("downloadImages")start_url=":5000"startTime=time.time()spider(start_url)endTime=time.time()print("timeused:%.2fseconds"%(endTime-startTime))5.2單線程爬取圖像:5000downloaded:5000/getImage/000001.jpegdownloaded:5000/getImage/000002.jpegdownloaded:5000/getImage/000003.pngdownloaded:5000/getImage/000004.png:5000?pageIndex=2downloaded:5000/getImage/000005.jpegdownloaded:5000/getImage/000006.jpegdownloaded:5000/getImage/000007.jpegdownloaded:5000/getImage/000008.jpeg:5000?pageIndex=3:5000?pageIndex=17downloaded:5000/getImage/000065.jpegdownloaded:5000/getImage/000066.jpegdownloaded:5000/getImage/000067.jpegdownloaded:5000/getImage/000068.jpeg:5000?pageIndex=18downloaded:5000/getImage/000069.jpegdownloaded:5000/getImage/000070.jpegdownloaded:5000/getImage/000071.jpegtimeused:204.31seconds其中download(name)函數(shù)是圖像下載函數(shù)，下載圖像名稱為name，下載圖像存儲(chǔ)在downloadImages文件夾中。啟動(dòng)旅游網(wǎng)站服務(wù)器后執(zhí)行這個(gè)爬蟲(chóng)程序，部分結(jié)果如下：由此可見(jiàn)每次進(jìn)入一個(gè)網(wǎng)頁(yè)就爬取該網(wǎng)頁(yè)的一個(gè)圖像，圖像是一個(gè)接一個(gè)順序爬取的，耗時(shí)大約204秒鐘。5.3多線程爬取圖像如果下載圖像的過(guò)程是多線程的，設(shè)計(jì)一個(gè)總線程列表TS，每次下載圖像時(shí)都開(kāi)啟一個(gè)線程T并把T增加到TS，T線程執(zhí)行download()函數(shù)進(jìn)行下載，程序代碼如下：forTinTS:T.join()然后在主線程結(jié)束之前等待每個(gè)T線程結(jié)束：src=urllib.request.urljoin(start_url,image["src"])T=threading.Thread(target=download,args=(src,))TS.append(T)T.start()5.3多線程爬取圖像保證在主程序結(jié)束之前所有的線程都結(jié)束，即所有的圖像下載都結(jié)束。根據(jù)這樣的思路編寫(xiě)多線程爬蟲(chóng)程序如下：importurllib.responsefrombs4importBeautifulSoupimporturllib.requestimportosimporttimeimportthreadingdefdownload(src):try:p=src.rfind("/")name=src[p+1:]data=urllib.request.urlopen(src,timeout=10)data=data.read()iflen(data)>0:fobj=open("downloadImages/"+name,"wb")fobj.write(data)fobj.close()print("downloaded",src)exceptExceptionaserr:print(err)defspider(url):globalTSprint(url)try:data=urllib.request.urlopen(url)data=data.read()data=data.decode()soup=BeautifulSoup(data,"lxml")images=soup.select("divspan[class='first']img")forimageinimages:src=urllib.request.urljoin(start_url,image["src"])T=threading.Thread(target=download,args=(src,))T.start()TS.append(T)

links=soup.select("div[id='pagnation']a")url=""forlinkinlinks:iflink.text=="下一頁(yè)":url=urllib.request.urljoin(start_url,link["href"])breakifurl!="":spider(url)exceptExceptionaserr:print(err)ifos.path.exists("downloadImages")==False:os.mkdir("downloadImages")TS=[]start_url=":5000"startTime=time.time()spider(start_url)forTinTS:T.join()endTime=time.time()print("timeused:%.2fseconds"%(endTime-startTime))5.3多線程爬取圖像:5000:5000?pageIndex=2:5000?pageIndex=3:5000?pageIndex=4:5000?pageIndex=5:5000?pageIndex=6:5000?pageIndex=7:5000?pageIndex=8:5000?pageIndex=9:5000?pageIndex=10downloaded:5000/getImage/000015.jpeg:5000?pageIndex=11:5000?pageIndex=12:5000?pageIndex=13:5000?pageIndex=14:5000?pageIndex=15:5000?pageIndex=16:5000?pageIndex=17:5000?pageIndex=18downloaded:5000/getImage/000026.jpegdownloaded:5000/getImage/000039.jpegdownloaded:5000/getImage/000051.jpegdownloaded:5000/getImage/000056.jpegdownloaded:5000/getImage/000059.jpegdownloaded:5000/getImage/000049.jpegdownloaded:5000/getImage/000045.jpegtimeused:2.09seconds執(zhí)行這個(gè)程序部分結(jié)果如下：綜合項(xiàng)目

爬取模擬旅游網(wǎng)站數(shù)據(jù)066.1旅游數(shù)據(jù)存儲(chǔ)6.2編寫(xiě)爬蟲(chóng)程序.6.1旅游數(shù)據(jù)存儲(chǔ)根據(jù)travles.csv的數(shù)據(jù)模擬真實(shí)網(wǎng)站建立一個(gè)旅游網(wǎng)站，如圖4-6-1所示，然后編寫(xiě)爬蟲(chóng)程序爬取所有數(shù)據(jù)與圖像，數(shù)據(jù)被存儲(chǔ)到數(shù)據(jù)庫(kù)中，這個(gè)項(xiàng)目中將綜合使用前面學(xué)習(xí)到的各種知識(shí)與技能，為后面爬取真實(shí)的旅游網(wǎng)站數(shù)據(jù)做好準(zhǔn)備。爬取的數(shù)據(jù)可以存儲(chǔ)到一個(gè)數(shù)據(jù)庫(kù)traves.db中，該數(shù)據(jù)庫(kù)包含一張travels表，表結(jié)構(gòu)如表4-6-1所示。圖4-6-1旅游網(wǎng)站字段類型說(shuō)明IDvarchar(8),primarykey編號(hào)Titlevarchar(256)標(biāo)題Datevarchar(256)日期Extvarchar(256)圖像擴(kuò)展名Contenttext內(nèi)容表4-6-1數(shù)據(jù)庫(kù)表結(jié)構(gòu)6.2編寫(xiě)爬蟲(chóng)程序importurllib.responsefrombs4importBeautifulSoupimporturllib.requestimportosimporttimeimportthreadingfromdatabaseimportDatabasedefdownload(src):try:p=src.rfind("/")name=src[p+1:]data=urllib.request.urlopen(src,timeout=100)data=data.read()iflen(data)>0:fobj=open("downloadImages/"+name,"wb")fobj.write(data)fobj.close()print("downloaded",src)exceptExceptionaserr:print(err)defgetContent(url):try:data=urllib.request.urlopen(url)data=data.read()data=data.decode()soup=BeautifulSoup(data,"lxml")returnsoup.select_one("pre").textexceptExceptionaserr:print(err)return""definitializeDownload():#初始化downloadImages文件夾ifnotos.path.exists("downloadImages"):os.mkdir("downloadImages")fs=os.listdir("downloadImages")forfinfs:os.remove("downloadImages\\"+f)defspider(url):globalTS,dbprint(url)try:data=urllib.request.urlopen(url)data=data.read()data=data.decode()soup=BeautifulSoup(data,"lxml")spans=soup.select("divspan[class='second']")forspaninspans:link=span.select_one("diva")title=link.textID=span.select_one("divb").textdate=span.select("div")[-1].texturl=urllib.request.urljoin(start_url,link["href"])content=getContent(url)6.2編寫(xiě)爬蟲(chóng)程序#span的父元素是div，div的第一個(gè)span是圖片div=span.parentimage=div.select_one("span[class='first']img")src=urllib.request.urljoin(start_url,image["src"])p=src.rfind(".")ext=src[p:]db.insert(ID,title,date,ext,content)print(ID,title)T=threading.Thread(target=download,args=(src,))T.start()TS.append(T)links=soup.select("div[id='pagnation']a")url=""forlinkinlinks:iflink.text=="下一頁(yè)":url=urllib.request.urljoin(start_url,link["href"])breakifurl!="":spider(url)exceptExceptionaserr:print(err)ifos.path.exists("downloadImages")==False:os.mkdir("downloadImages")TS=[]db=Nonestart_url=":5000"whileTrue:print("1.Spider")print("2.Show")print("3.Exit")choice=input("Enteryourchoice(1,2,3):")ifchoice=="1":startTime=time.time()initializeDownload()db=Database()db.open()db.initialize()spider(start_url)db.close()forTinTS:T.join()endTime=time.time()print("timeused:%.2fseconds"%(endTime-startTime))elifchoice=="2":db=Database()db.open()db.show()db.close()elifchoice=="3":breakelse:print("Invalidchoice")實(shí)戰(zhàn)項(xiàng)目

爬取實(shí)際旅游網(wǎng)站數(shù)據(jù)077.1網(wǎng)站網(wǎng)頁(yè)分析7.2網(wǎng)站數(shù)據(jù)存儲(chǔ)7.3編寫(xiě)爬蟲(chóng)程序7.1網(wǎng)站網(wǎng)頁(yè)分析要爬取這些頁(yè)面的文本與圖像就必須先分析網(wǎng)頁(yè)的結(jié)構(gòu)。使用Chrome瀏覽器瀏覽網(wǎng)站，右擊網(wǎng)頁(yè)彈出菜單選擇“檢查”，看到網(wǎng)頁(yè)結(jié)構(gòu)如圖4-7-2所示。圖4-7-2網(wǎng)站結(jié)構(gòu)圖4-7-3詳細(xì)內(nèi)容7.1網(wǎng)站網(wǎng)頁(yè)分析我們看到每個(gè)項(xiàng)目都是一個(gè)<div>元素中，復(fù)制其中一個(gè)結(jié)構(gòu)如下：而且所有的項(xiàng)目都包含在<divclass='lft_artlf'>的元素中，這個(gè)元素的下面是一序列的div[class='mb10tw3_01_2']元素，它們就是各個(gè)項(xiàng)目。進(jìn)一步單擊其中一個(gè)項(xiàng)目，可以看到這個(gè)項(xiàng)目的詳細(xì)內(nèi)容，如圖4-7-3所示。這個(gè)詳細(xì)內(nèi)容一般包含文字與圖像，而且還有很多個(gè)頁(yè)面，為了簡(jiǎn)化我們只獲取第一個(gè)頁(yè)面的文字內(nèi)容，它們包含在一個(gè)<divid="Content">元素下面的各個(gè)<p>元素中。<divclass="mb10tw3_01_2"><spanclass="tw3_01_2_p"><atarget="_blank"shape="rect"href="http:///a/202407/23/WS669f294da31095c51c50f70c.html"><imgwidth="200"height="130"src="http:///images/202407/23/669f294da31095c551b59e96.jpeg"></a></span><spanclass="tw3_01_2_t"><h4><ashape="rect"href="http:///a/202407/23/WS669f294da31095c51c50f70c.html">SummertravelheatsupinChina's'icecity'</a></h4><b>2024-07-2311:53</b></span></div>7.2網(wǎng)站數(shù)據(jù)存儲(chǔ)數(shù)據(jù)可以存儲(chǔ)在一個(gè)SQLite3數(shù)據(jù)庫(kù)travels.db中，設(shè)計(jì)一個(gè)數(shù)據(jù)庫(kù)表items數(shù)據(jù)包含一個(gè)關(guān)鍵字ID編號(hào)、標(biāo)題tTitle、日期tDate、內(nèi)容tContent以及圖像擴(kuò)展名稱tExt，表格結(jié)構(gòu)見(jiàn)表4-7-1所示。字段名稱字段類型說(shuō)明IDvarchar(6),primarykey序號(hào)tDatevarchar(16)日期tTitlevarchar(1024)標(biāo)題tContenttext內(nèi)容tExtvarchar(8)圖像表4-7-1數(shù)據(jù)庫(kù)表字段結(jié)構(gòu)7.3編寫(xiě)爬蟲(chóng)程序frombs4importBeautifulSoupimporturllib.requestimportsqlite3importosimporttimeimportthreadingclassDatabase:

defopen(self):

#打開(kāi)數(shù)據(jù)庫(kù)

self.con=sqlite3.connect("travels.db")

self.cursor=self.con.cursor()

defclose(self):

#關(guān)閉數(shù)據(jù)庫(kù)

mit()

self.con.close()definitialize(self):

#初始化數(shù)據(jù)庫(kù)，創(chuàng)建items表

try:

self.cursor.execute("droptableitems")

except:

pass

self.cursor.execute("createtableitems(IDvarchar(8)primarykey,tDatevarchar(16),tTitlevarchar(1024),tContenttext,tExtvarchar(8))")

definsert(self,ID,tDate,tTitle,tContent,tExt):

#插入一條記錄

try:

self.cursor.execute("insertintoitems(ID,tDate,tTitle,tContent,tExt)values(?,?,?,?,?)",[ID,tDate,tTitle,tContent,tExt])

exceptExceptionaserr:

print(err)

defshow(self):

#顯示數(shù)據(jù)內(nèi)容self.cursor.execute("selectID,tDate,tTitle,tContent,tExtfromitemsorderbyID")

rows=self.cursor.fetchall()

forrowinrows:

print(row[0])

print(row[1])

print(row[2])

print(row[3])

print(row[

人人文庫(kù)> 全部分類> 教育資料 > 課件下載

溫馨提示

1. 本站所有資源如無(wú)特殊說(shuō)明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

Python爬蟲(chóng)項(xiàng)目教程（第2版）（微課版）-課件更新教程4 爬取旅游網(wǎng)站數(shù)據(jù)

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

Python爬蟲(chóng)項(xiàng)目教程（第2版）（微課版）-課件 更新 教程4 爬取旅游網(wǎng)站數(shù)據(jù)

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔

Python爬蟲(chóng)項(xiàng)目教程（第2版）（微課版）-課件更新教程4 爬取旅游網(wǎng)站數(shù)據(jù)