版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
微課版
第2版Python爬蟲(chóng)項(xiàng)目教程4目錄CONTENTS旅游網(wǎng)站項(xiàng)目任務(wù)01網(wǎng)站樹(shù)的爬取路徑02爬取多頁(yè)面網(wǎng)站數(shù)據(jù)03Python實(shí)現(xiàn)多線程04爬取旅游網(wǎng)站圖像05爬取模擬旅游網(wǎng)站數(shù)據(jù)06爬取實(shí)際旅游網(wǎng)站數(shù)據(jù)06爬取旅游網(wǎng)站數(shù)據(jù)01爬取旅游項(xiàng)目任務(wù)01復(fù)雜的爬蟲(chóng)程序爬取的數(shù)據(jù)往往很多,而且相關(guān)的數(shù)據(jù)往往分布在很多不同的網(wǎng)頁(yè)中,甚至分布在相關(guān)聯(lián)的不同的網(wǎng)站中,爬蟲(chóng)程序必須能按鏈接自動(dòng)往返于這些不同的網(wǎng)站中去爬取數(shù)據(jù)。一個(gè)爬蟲(chóng)程序爬取成百上千條的數(shù)據(jù)是常有的事,怎么樣設(shè)計(jì)一個(gè)高效率的爬蟲(chóng)程序成了我們學(xué)習(xí)的重點(diǎn)。圖4-1-1中國(guó)日?qǐng)?bào)網(wǎng)站在中國(guó)日?qǐng)?bào)(ChinaDaily)網(wǎng)站有很多旅游景點(diǎn)介紹,進(jìn)入網(wǎng)站可以看到各個(gè)旅游景點(diǎn)項(xiàng)目,而且每個(gè)項(xiàng)目一般都配有一張精美的圖像,這樣的網(wǎng)頁(yè)還很多,通過(guò)單擊"Next"按鈕進(jìn)入下一頁(yè),如圖4-1-1所示。這個(gè)項(xiàng)目的目的是爬取這些頁(yè)面所有旅游項(xiàng)目與對(duì)應(yīng)的圖片,它在爬取一個(gè)頁(yè)面后能自動(dòng)跳轉(zhuǎn)到后面一個(gè)頁(yè)面繼續(xù)爬取,把文本數(shù)據(jù)存儲(chǔ)到數(shù)據(jù)庫(kù)travels.db,把圖片存儲(chǔ)到download文件夾。在爬取實(shí)際網(wǎng)站數(shù)據(jù)之前先練習(xí)爬取模擬網(wǎng)站的數(shù)據(jù)。創(chuàng)建一個(gè)項(xiàng)目project4,在該文件夾中有一個(gè)travels.csv文件,它包含了很多旅游的信息,前面幾行如下。1.1旅游網(wǎng)站項(xiàng)目任務(wù)ID,Title,Date,Ext000001,Coolsportsgivecityhealthyoptions,2024-07-2012:07,jpeg000002,BelgiansharvestthegoodlifeinGuizhou,2024-07-0207:50,jpeg000003,NEChina'sChangbaiMountainseekingtobecometop-leveloutdoorsportsdestination,2024-06-1715:12,jpeg000004,Ajourneythroughthetracksoftime,2024-06-1010:35,jpeg000005,ChinaFocus:EasiertravelfuelsChinesepeople'sinterestinglobaldestinations,2024-06-0710:13,jpeg000006,Aplacewithasenseofhistory,2024-06-0406:55,jpeg000007,DiscoveringZhangzhou:AculturalandecologicalgeminFujian,2024-05-2914:56,jpeg000008,BeijingunveilsplanstoenhanceGreatWalltourism,2024-05-1715:36,jpeg文件中各個(gè)數(shù)據(jù)項(xiàng)目使用","分割,第一行是各個(gè)數(shù)據(jù)項(xiàng)目的名稱,其中ID是編號(hào)、Title是項(xiàng)目標(biāo)題、Date是日期,Ext是圖像的擴(kuò)展名,圖像名稱是由ID與Ext組合而成。在static文件夾中有各個(gè)項(xiàng)目的圖像和文本內(nèi)容,例如項(xiàng)目一的圖像是000001.jpeg,文本是000001.html,如圖4-1-2所示。我們使用這些數(shù)據(jù)編寫(xiě)一個(gè)旅游網(wǎng)站,如圖4-1-3所示,這個(gè)網(wǎng)站顯示71個(gè)旅游項(xiàng)目信息,分為很多個(gè)頁(yè)面,單擊每個(gè)網(wǎng)頁(yè)的"第一頁(yè)"、"前一頁(yè)"、"下一頁(yè)"、"末一頁(yè)"等按鈕就可以在各個(gè)頁(yè)面之間切換。1.1旅游網(wǎng)站項(xiàng)目任務(wù)圖4-1-2項(xiàng)目圖像與內(nèi)容圖4-1-3模擬旅游網(wǎng)站網(wǎng)站樹(shù)的爬取路徑022.1Web服務(wù)器網(wǎng)站2.2遞歸程序爬取數(shù)據(jù)2.3深度優(yōu)先爬取數(shù)據(jù)2.4廣度優(yōu)先爬取數(shù)據(jù)2.1Web服務(wù)器網(wǎng)站我們?cè)O(shè)計(jì)好books.html,program.html,database.html,netwwork.html,mysql.html,java.html,python.html等網(wǎng)頁(yè)文件以u(píng)tf-8的編碼存儲(chǔ)在文件夾(例如c:\demo)中,各個(gè)文件的內(nèi)容如下:(1)books.html<h3>計(jì)算機(jī)</h3><ul><li><ahref="database.html">數(shù)據(jù)庫(kù)</a></li><li><ahref="program.html">程序設(shè)計(jì)</a></li><li><ahref="network.html">計(jì)算機(jī)網(wǎng)絡(luò)</a></li></ul>(2)database.html<h3>數(shù)據(jù)庫(kù)</h3><ul><li><ahref="mysql.html">MySQL數(shù)據(jù)庫(kù)</a></li></ul><ahref="books.html">Home</a>(3)program.html<h3>程序設(shè)計(jì)</h3><ul><li><ahref="python.html">Python程序設(shè)計(jì)</a></li><li><ahref="java.html">Java程序設(shè)計(jì)</a></li></ul><ahref="books.html">Home</a>(4)network.html<h3>計(jì)算機(jī)網(wǎng)絡(luò)</h3><ahref="books.html">Home</a>(5)mysql.html<h3>MySQL數(shù)據(jù)庫(kù)</h3><ahref="books.html">Home</a>2.1Web服務(wù)器網(wǎng)站(6)python.html<h3>Python程序設(shè)計(jì)</h3><ahref="books.html">Home</a>(7)java.html<h3>Java程序設(shè)計(jì)</h3><ahref="books.html">Home</a>然后再用Flask設(shè)計(jì)一個(gè)server.py的Web程序來(lái)呈現(xiàn)它們:importflaskimportosapp=flask.Flask(__name__)defgetFile(fileName):data=b""ifos.path.exists(fileName):fobj=open(fileName,"rb")data=fobj.read()fobj.close()returndata@app.route("/")defindex():returngetFile("books.html")@app.route("/<section>")defprocess(section):data=""ifsection!="":data=getFile(section)returndataif__name__=="__main__":app.run()2.1Web服務(wù)器網(wǎng)站這個(gè)Web程序的默認(rèn)網(wǎng)址是:5000,訪問(wèn)這個(gè)網(wǎng)址后執(zhí)行index()函數(shù),返回books.html的網(wǎng)頁(yè),如圖4-2-1所示。單擊每個(gè)超級(jí)鏈接后會(huì)轉(zhuǎn)去數(shù)據(jù)庫(kù)、程序設(shè)計(jì)、計(jì)算機(jī)網(wǎng)絡(luò)各個(gè)網(wǎng)頁(yè),這些網(wǎng)頁(yè)的結(jié)構(gòu)實(shí)際上是一棵樹(shù),如圖4-2-2所示。圖4-2-1Web網(wǎng)站圖4-2-2Web網(wǎng)站結(jié)構(gòu)2.2遞歸程序爬取數(shù)據(jù)現(xiàn)在我們來(lái)設(shè)計(jì)一個(gè)客戶端程序client.py爬取這個(gè)網(wǎng)站各個(gè)網(wǎng)頁(yè)的<h3>的標(biāo)題值,設(shè)計(jì)的思想如下:frombs4importBeautifulSoupimporturllib.requestdefspider(url):globalurlstry:data=urllib.request.urlopen(url)data=data.read()data=data.decode()soup=BeautifulSoup(data,"lxml")print(soup.find("h3").text)links=soup.select("a")forlinkinlinks:href=link["href"]url=start_url+"/"+href
(1)設(shè)計(jì)一個(gè)urls列表,記錄已經(jīng)訪問(wèn)過(guò)的網(wǎng)頁(yè);(2)從books.html出發(fā);(3)訪問(wèn)一個(gè)網(wǎng)頁(yè),記錄網(wǎng)頁(yè)地址到urls,獲取<h3>標(biāo)題;(4)獲取這個(gè)網(wǎng)頁(yè)中所有<a>超級(jí)鏈接的href值形成links列表;(5)循環(huán)links列表,對(duì)于每個(gè)鏈接link都指向另外一個(gè)網(wǎng)頁(yè),遞歸回到(3);(6)繼續(xù)links的下一個(gè)link,直到遍歷所有l(wèi)ink為止;ifurlnotinurls:urls.append(url)spider(url)exceptExceptionaserr:print(err)start_url=":5000"urls=[start_url+"/books.html"]spider(urls[0])計(jì)算機(jī)數(shù)據(jù)庫(kù)MySQL數(shù)據(jù)庫(kù)程序設(shè)計(jì)Python程序設(shè)計(jì)Java程序設(shè)計(jì)計(jì)算機(jī)網(wǎng)絡(luò)2.3深度優(yōu)先爬取數(shù)據(jù)如果我們不使用遞歸程序?qū)崿F(xiàn)深度優(yōu)先的順序爬取網(wǎng)站數(shù)據(jù),也可以設(shè)計(jì)一個(gè)棧Stack來(lái)完成。在Python中實(shí)現(xiàn)一個(gè)棧十分簡(jiǎn)單,Python中的列表list就是一個(gè)棧,很容易設(shè)計(jì)自己的一個(gè)棧Stack類:classStack:def__init__(self):self.st=[]defpop(self):returnself.st.pop()defpush(self,obj):self.st.append(obj)defempty(self):returnlen(self.st)==0(1)設(shè)計(jì)一個(gè)urls列表記錄已經(jīng)訪問(wèn)過(guò)的地址;(2)第一個(gè)url入棧;(3)如果棧為空程序結(jié)束,如不為空出棧一個(gè)url,爬取它的<h3>標(biāo)題值;(4)獲取url站點(diǎn)的所有超級(jí)鏈接<a>的href值,組成鏈接列表links,如果url不在urls列表中,就把url鏈接壓棧;(5)回到(3)其中push()是壓棧函數(shù)、pop()是出棧函數(shù)、empty()判斷棧是否為空。采用Stack類后我們可以設(shè)計(jì)深度優(yōu)先的順序爬取數(shù)據(jù)的客戶端程序的思想如下:2.3深度優(yōu)先爬取數(shù)據(jù)根據(jù)這個(gè)思想編寫(xiě)爬蟲(chóng)程序client.py如下:frombs4importBeautifulSoupimporturllib.requestclassStack:def__init__(self):self.st=[]defpop(self):returnself.st.pop()defpush(self,obj):self.st.append(obj)defempty(self):returnlen(self.st)==0defspider(url): globalurlsstack=Stack()stack.push(url)whilenotstack.empty():
url=stack.pop()try:data=urllib.request.urlopen(url)data=data.read()data=data.decode()soup=BeautifulSoup(data,"lxml")print(soup.find("h3").text)links=soup.select("a")foriinrange(len(links)-1,-1,-1):href=links[i]["href"]url=start_url+"/"+hrefifurlnotinurls:urls.append(url)stack.push(url)exceptExceptionaserr:print(err)start_url=":5000"urls=[start_url+"/books.html"]spider(urls[0])計(jì)算機(jī)數(shù)據(jù)庫(kù)MySQL數(shù)據(jù)庫(kù)程序設(shè)計(jì)Python程序設(shè)計(jì)Java程序設(shè)計(jì)計(jì)算機(jī)網(wǎng)絡(luò)程序結(jié)果:2.4廣度優(yōu)先爬取數(shù)據(jù)遍歷網(wǎng)站樹(shù)還有一種廣度優(yōu)先的順序,這要使用到隊(duì)列,在Python中實(shí)現(xiàn)一個(gè)隊(duì)列十分簡(jiǎn)單,Python中的列表list就是一個(gè)隊(duì)列,很容易設(shè)計(jì)自己的一個(gè)隊(duì)列Queue類:def__init__(self):self.st=[]deffetch(self):returnself.st.pop(0)defenter(self,obj):self.st.append(obj)defempty(self):returnlen(self.st)==0(1)設(shè)計(jì)一個(gè)urls列表記錄已經(jīng)訪問(wèn)過(guò)的地址;(2)第一個(gè)url入列;(3)如果列為空程序結(jié)束,如不為空出列一個(gè)url,爬取它的<h3>標(biāo)題值;(4)獲取url站點(diǎn)的所有超級(jí)鏈接<a>的href值,組成鏈接列表links,如果url不在urls列表中,就把url鏈接入列;(5)回到(3)其中enter()是入列函數(shù)、fetch()是出列函數(shù)、empty()判斷列是否為空。采用Queue類后我們可以設(shè)計(jì)廣度優(yōu)先的順序爬取數(shù)據(jù)的客戶端程序的思想如下:2.4廣度優(yōu)先爬取數(shù)據(jù)根據(jù)這個(gè)思想編寫(xiě)爬蟲(chóng)程序client.py如下:frombs4importBeautifulSoupimporturllib.requestclassQueue:def__init__(self):self.st=[]deffetch(self):returnself.st.pop(0)defenter(self,obj):self.st.append(obj)defempty(self):returnlen(self.st)==0defspider(url):globalurlsqueue=Queue()queue.enter(url)whilenotqueue.empty():
url=queue.fetch()try:data=urllib.request.urlopen(url)data=data.read()data=data.decode()soup=BeautifulSoup(data,"lxml")print(soup.find("h3").text)links=soup.select("a")forlinkinlinks:href=link["href"]url=start_url+"/"+hrefifurlnotinurls:urls.append(url)queue.enter(url)exceptExceptionaserr:print(err)start_url=":5000"urls=[":5000/books.html"]spider(urls[0])計(jì)算機(jī)數(shù)據(jù)庫(kù)程序設(shè)計(jì)計(jì)算機(jī)網(wǎng)絡(luò)MySQL數(shù)據(jù)庫(kù)Python程序設(shè)計(jì)Java程序設(shè)計(jì)程序結(jié)果:爬取多頁(yè)面網(wǎng)站數(shù)據(jù)033.1旅游網(wǎng)站服務(wù)器3.2爬取網(wǎng)站數(shù)據(jù)3.3編寫(xiě)爬蟲(chóng)程序3.1旅游網(wǎng)站服務(wù)器1.旅游網(wǎng)站數(shù)據(jù)在project4文件夾中有一個(gè)travels.csv文件,它包含ID編號(hào),標(biāo)題Title、日期Date、圖像文件的擴(kuò)展名Ext,格式內(nèi)容如下:這組數(shù)據(jù)共有71個(gè)旅游項(xiàng)目,每個(gè)項(xiàng)目有一個(gè)名稱為ID+Ext的圖像文件,有一個(gè)ID+".html"的內(nèi)容文件,它們放在project4/static文件夾中,如圖4-3-1所示。ID,Title,Date,Ext000001,Coolsportsgivecityhealthyoptions,2024-07-2012:07,jpeg000002,BelgiansharvestthegoodlifeinGuizhou,2024-07-0207:50,jpeg000003,NEChina'sChangbaiMountainseekingtobecometop-leveloutdoorsportsdestination,2024-06-1715:12,jpeg000004,Ajourneythroughthetracksoftime,2024-06-1010:35,jpeg000005,ChinaFocus:EasiertravelfuelsChinesepeople'sinterestinglobaldestinations,2024-06-0710:13,jpeg000006,Aplacewithasenseofhistory,2024-06-0406:55,jpeg000007,DiscoveringZhangzhou:AculturalandecologicalgeminFujian,2024-05-2914:56,jpeg000008,BeijingunveilsplanstoenhanceGreatWalltourism,2024-05-1715:36,jpeg(省略)000070,Chongqingacityfullofsurprises,2023-06-0712:03,jpeg000071,TouristsenjoykiteboardinginHainan,2023-06-0510:02,jpeg圖4-3-1圖像文件與內(nèi)容文件3.1旅游網(wǎng)站服務(wù)器2.設(shè)計(jì)網(wǎng)站模板在project4/templates下設(shè)計(jì)一個(gè)travel.html的模板文件:<style>.first{display:inline-block;width:150px;height:90px;}.second{display:inline-block;}</style>{%foriteminitems%}<div><spanclass="first"><atarget="_blank"shape="rect"href="/static/{{item.ID}}.html"><imgwidth="120"height="80"src="/static/{{item.Image}}"></a></span><spanclass="second"><div><b>{{item.ID}}</b></div><div><ashape="rect"href="/static/{{item.ID}}.html">{{item.Title}}</a></div>
<div>{{item.Date}}</div></span></div>{%endfor%}<divid="pagnation"style="text-align:center;"><ahref="?pageIndex=1">第一頁(yè)</a>{%ifpageIndex>1%}<ahref="?pageIndex={{pageIndex-1}}">上一頁(yè)</a>{%endif%}{%ifpageIndex<pageCount%}<ahref="?pageIndex={{pageIndex+1}}">下一頁(yè)</a>{%endif%}<ahref="?pageIndex={{pageCount}}">最末頁(yè)</a><span>第{{pageIndex}}頁(yè)/共{{pageCount}}頁(yè)</span></div>3.1旅游網(wǎng)站服務(wù)器3.設(shè)計(jì)網(wǎng)站服務(wù)器設(shè)計(jì)一個(gè)服務(wù)器網(wǎng)站:importflaskimportosapp=flask.Flask(__name__)@app.route("/")defindex():pageIndex=flask.request.values.get("pageIndex","1")pageIndex=int(pageIndex)pageSize=4startIndex=(pageIndex-1)*pageSize+1endIndex=startIndex+pageSizefobj=open("travels.csv","rt",encoding="utf-8")data=fobj.readlines()count=len(data)items=[]j=0foriinrange(1,count):s=data[i].strip().split(",")iflen(s)==4:j+=1ifj>=startIndexandj<endIndex:item={"ID":s[0],"Title":s[1],"Date":s[2],"Image":s[0]+"."+s[3]}items.append(item)fobj.close()pageCount=j//pageSizeifj%pageSize!=0:pageCount+=1returnflask.render_template("travel.html",items=items,pageIndex=pageIndex,pageCount=pageCount)if__name__=="__main__":app.run(debug=True)3.1旅游網(wǎng)站服務(wù)器運(yùn)行服務(wù)器后瀏覽網(wǎng)站,效果如圖4-3-2所示,使用"第一頁(yè)"、"前一頁(yè)"、"下一頁(yè)"、"最末頁(yè)"進(jìn)行網(wǎng)頁(yè)跳轉(zhuǎn),單擊一個(gè)旅游項(xiàng)目后就跳轉(zhuǎn)到該項(xiàng)目的內(nèi)容頁(yè)。如圖4-3-3所示。圖4-3-2旅游項(xiàng)目網(wǎng)頁(yè)圖4-3-3旅游項(xiàng)目?jī)?nèi)容3.2爬取網(wǎng)站數(shù)據(jù)分析網(wǎng)站HTML的結(jié)構(gòu)很容易發(fā)現(xiàn)每個(gè)頁(yè)面下面有多個(gè)<spanclass="second">的元素,這個(gè)元素中包含項(xiàng)目ID、Title、Date等信息,還包含Content的地址,因此先查找到全部<spanclass="second">元素,再循環(huán)到每個(gè)<span>中查找即可3.3編寫(xiě)爬蟲(chóng)程序由于這個(gè)網(wǎng)站的多個(gè)網(wǎng)頁(yè)是用"下一頁(yè)"的方法串聯(lián)組合的,因此設(shè)計(jì)一個(gè)spider(url)爬取url頁(yè)面的數(shù)據(jù),然后找到下一頁(yè)的地址url再次遞歸調(diào)用spide(url)即可,根據(jù)這些分析編寫(xiě)程序如下:frombs4importBeautifulSoupimporturllib.requestdefgetContent(url):data=urllib.request.urlopen(url)data=data.read()data=data.decode()soup=BeautifulSoup(data,"lxml")returnsoup.select_one("pre").textdefspider(url):print(url)try:data=urllib.request.urlopen(url)data=data.read()data=data.decode()
soup=BeautifulSoup(data,"lxml")spans=soup.select("divspan[class='second']")forspaninspans:link=span.select_one("diva")title=link.textID=span.select_one("divb").textdate=span.select("div")[-1].texturl=urllib.request.urljoin(start_url,link["href"])content=getContent(url)print(ID)print(title)print(date)print(content[:20])
links=soup.select("div[id='pagnation']a")nextUrl=""forlinkinlinks:iflink.text=="下一頁(yè)":nextUrl=urllib.request.urljoin(start_url,link["href"])breakifnextUrl!="":spider(nextUrl)exceptExceptionaserr:print(err)start_url=":5000"spider(start_url)Python實(shí)現(xiàn)多線程044.1Python的Daemon線程4.2線程的等待4.1Python的Daemon線程在Python中要啟動(dòng)一個(gè)線程,可以使用threading包中的Thread建立一個(gè)對(duì)象,這個(gè)Thread類的基本原型是:importthreadingimporttimeimportrandomdefreading():foriinrange(10):print("reading",i)time.sleep(random.randint(1,2))r=threading.Thread(target=reading)r.setDaemon(False)r.start()print("TheEnd")程序結(jié)果:reading0TheEndreading1reading2reading3reading4其中target是要執(zhí)行的線程,args是一個(gè)元組或者列表為target提供參數(shù),然后調(diào)用t.start()就開(kāi)始了線程。在主線程中啟動(dòng)一個(gè)子線程執(zhí)行reading()函數(shù)。t=Thread(target,args=None)從結(jié)果看到主線程啟動(dòng)子線程r后就結(jié)束了,但是子線程還沒(méi)有結(jié)束,繼續(xù)顯示完reasing4后才結(jié)束。其中的r.setDaemon(False)就是設(shè)置線程r為非Daemon線程,這種線程不因主線程的結(jié)束而結(jié)束。4.1Python的Daemon線程啟動(dòng)一個(gè)Daemon線程importthreadingimporttimeimportrandomdefreading():foriinrange(5):print("reading",i)time.sleep(random.randint(1,2))r=threading.Thread(target=reading)r.setDaemon(True)r.start()print("TheEnd")程序結(jié)果:reading0TheEnd由此可見(jiàn)在主線程結(jié)束后子線程也結(jié)束,這就是Daemon線程。4.2線程的等待在多線程的程序中往往一個(gè)線程(例如主線程)要等待其他線程執(zhí)行完畢才繼續(xù)執(zhí)行,這可以用join()函數(shù),使用的方法是:importthreadingimporttimeimportrandomdefreading():foriinrange(5):print("reading",i)time.sleep(random.randint(1,2))t=threading.Thread(target=reading)t.setDaemon(False)t.start()t.join()print("TheEnd")程序結(jié)果:reading0reading1reading2reading3reading4TheEnd在一個(gè)線程代碼中執(zhí)行這條語(yǔ)句,當(dāng)前的線程就會(huì)停止執(zhí)行,一直等到指定的線程對(duì)象的線程執(zhí)行完畢后才繼續(xù)執(zhí)行,即這條語(yǔ)句啟動(dòng)阻塞等待的作用。線程對(duì)象.join()由此可見(jiàn)主線程啟動(dòng)子線程t執(zhí)行reading()函數(shù),t.join()就阻塞主線程,一直等到t線程執(zhí)行完畢后才結(jié)束t.join(),繼續(xù)執(zhí)行顯示TheEnd。爬取旅游網(wǎng)站圖像055.1設(shè)計(jì)旅游網(wǎng)站5.2單線程爬取圖像5.3多線程爬取圖像5.1設(shè)計(jì)旅游網(wǎng)站為了模擬真實(shí)的網(wǎng)絡(luò)環(huán)境,體現(xiàn)圖像爬取是一個(gè)比較漫長(zhǎng)的過(guò)程,我們改造旅游網(wǎng)站的圖像顯示機(jī)制。將project4/templates文件夾中的travel.html模板改成:<style>.first{display:inline-block;width:150px;height:90px;}.second{display:inline-block;}</style>{%foriteminitems%}<div><spanclass="first"><atarget="_blank"shape="rect"href="/static/{{item.ID}}.html"><imgwidth="120"height="80"src="/getImage/{{item.Image}}"></a></span><spanclass="second"><div><b>{{item.ID}}</b></div><div><ashape="rect"href="/static/{{item.ID}}.html">{{item.Title}}</a></div>1.網(wǎng)站模板文件其中圖像顯示改成:<imgwidth="120"height="80"src="/getImage/{{item.Image}}"><div>{{item.Date}}</div></span></div>{%endfor%}<divid="pagnation"style="text-align:center;"><ahref="?pageIndex=1">第一頁(yè)</a>{%ifpageIndex>1%}<ahref="?pageIndex={{pageIndex-1}}">上一頁(yè)</a>{%endif%}{%ifpageIndex<pageCount%}<ahref="?pageIndex={{pageIndex+1}}">下一頁(yè)</a>{%endif%}<ahref="?pageIndex={{pageCount}}">最末頁(yè)</a><span>第{{pageIndex}}頁(yè)/共{{pageCount}}頁(yè)</span></div>5.1設(shè)計(jì)旅游網(wǎng)站改造服務(wù)器如下:importflaskimportosimportrandomimporttimeapp=flask.Flask(__name__)@app.route("/")defindex():pageIndex=flask.request.values.get("pageIndex","1")pageIndex=int(pageIndex)pageSize=4startIndex=(pageIndex-1)*pageSize+1endIndex=startIndex+pageSizefobj=open("travels.csv","rt",encoding="utf-8")data=fobj.readlines()2.網(wǎng)站服務(wù)器count=len(data)items=[]j=0foriinrange(1,count):s=data[i].strip().split(",")iflen(s)==4:j+=1ifj>=startIndexandj<endIndex:item={"ID":s[0],"Title":s[1],"Date":s[2],"Image":s[0]+"."+s[3]}items.append(item)fobj.close()pageCount=j//pageSizeifj%pageSize!=0:pageCount+=15.1設(shè)計(jì)旅游網(wǎng)站returnflask.render_template("travel.html",items=items,pageIndex=pageIndex,pageCount=pageCount)@app.route("/getImage/<name>")defgetImage(name):img=b""ifos.path.exists("static/"+name):fobj=open("static/"+name,"rb")img=fobj.read()fobj.close()time.sleep(random.randdom()*3)returnimgif__name__=="__main__":app.run(debug=True)2.網(wǎng)站服務(wù)器其中g(shù)etImage(name)函數(shù)負(fù)責(zé)獲取圖像文件的二進(jìn)制數(shù)據(jù),為了模擬圖像獲取緩慢的過(guò)程,設(shè)計(jì)了這個(gè)函數(shù)并使用語(yǔ)句:time.sleep(random.random()*3)延遲圖像0~3秒,這樣在瀏覽網(wǎng)站時(shí)圖像的出現(xiàn)就比較緩慢了。5.2單線程爬取圖像我們編寫(xiě)一個(gè)爬蟲(chóng)程序爬取旅游網(wǎng)站的所有圖像,方法是一個(gè)個(gè)圖像逐個(gè)爬取,即所有的圖像下載都單獨(dú)在主線程中完成,爬蟲(chóng)程序如下:importurllib.responsefrombs4importBeautifulSoupimporturllib.requestimportosimporttimedefdownload(src):try:p=src.rfind("/")name=src[p+1:]data=urllib.request.urlopen(src,timeout=10)data=data.read()iflen(data)>0:fobj=open("downloadImages/"+name,"wb")fobj.write(data)fobj.close()print("downloaded",src)
exceptExceptionaserr:print(err)defspider(url):print(url)try:data=urllib.request.urlopen(url)data=data.read()data=data.decode()soup=BeautifulSoup(data,"lxml")images=soup.select("divspan[class='first']img")forimageinimages:src=urllib.request.urljoin(start_url,image["src"])download(src)links=soup.select("div[id='pagnation']a")url=""forlinkinlinks:
iflink.text=="下一頁(yè)":url=urllib.request.urljoin(start_url,link["href"])breakifurl!="":spider(url)exceptExceptionaserr:print(err)ifos.path.exists("downloadImages")==False:os.mkdir("downloadImages")start_url=":5000"startTime=time.time()spider(start_url)endTime=time.time()print("timeused:%.2fseconds"%(endTime-startTime))5.2單線程爬取圖像:5000downloaded:5000/getImage/000001.jpegdownloaded:5000/getImage/000002.jpegdownloaded:5000/getImage/000003.pngdownloaded:5000/getImage/000004.png:5000?pageIndex=2downloaded:5000/getImage/000005.jpegdownloaded:5000/getImage/000006.jpegdownloaded:5000/getImage/000007.jpegdownloaded:5000/getImage/000008.jpeg:5000?pageIndex=3:5000?pageIndex=17downloaded:5000/getImage/000065.jpegdownloaded:5000/getImage/000066.jpegdownloaded:5000/getImage/000067.jpegdownloaded:5000/getImage/000068.jpeg:5000?pageIndex=18downloaded:5000/getImage/000069.jpegdownloaded:5000/getImage/000070.jpegdownloaded:5000/getImage/000071.jpegtimeused:204.31seconds其中download(name)函數(shù)是圖像下載函數(shù),下載圖像名稱為name,下載圖像存儲(chǔ)在downloadImages文件夾中。啟動(dòng)旅游網(wǎng)站服務(wù)器后執(zhí)行這個(gè)爬蟲(chóng)程序,部分結(jié)果如下:由此可見(jiàn)每次進(jìn)入一個(gè)網(wǎng)頁(yè)就爬取該網(wǎng)頁(yè)的一個(gè)圖像,圖像是一個(gè)接一個(gè)順序爬取的,耗時(shí)大約204秒鐘。5.3多線程爬取圖像如果下載圖像的過(guò)程是多線程的,設(shè)計(jì)一個(gè)總線程列表TS,每次下載圖像時(shí)都開(kāi)啟一個(gè)線程T并把T增加到TS,T線程執(zhí)行download()函數(shù)進(jìn)行下載,程序代碼如下:forTinTS:T.join()然后在主線程結(jié)束之前等待每個(gè)T線程結(jié)束:src=urllib.request.urljoin(start_url,image["src"])T=threading.Thread(target=download,args=(src,))TS.append(T)T.start()5.3多線程爬取圖像保證在主程序結(jié)束之前所有的線程都結(jié)束,即所有的圖像下載都結(jié)束。根據(jù)這樣的思路編寫(xiě)多線程爬蟲(chóng)程序如下:importurllib.responsefrombs4importBeautifulSoupimporturllib.requestimportosimporttimeimportthreadingdefdownload(src):try:p=src.rfind("/")name=src[p+1:]data=urllib.request.urlopen(src,timeout=10)data=data.read()iflen(data)>0:fobj=open("downloadImages/"+name,"wb")fobj.write(data)fobj.close()print("downloaded",src)exceptExceptionaserr:print(err)defspider(url):globalTSprint(url)try:data=urllib.request.urlopen(url)data=data.read()data=data.decode()soup=BeautifulSoup(data,"lxml")images=soup.select("divspan[class='first']img")forimageinimages:src=urllib.request.urljoin(start_url,image["src"])T=threading.Thread(target=download,args=(src,))T.start()TS.append(T)
links=soup.select("div[id='pagnation']a")url=""forlinkinlinks:iflink.text=="下一頁(yè)":url=urllib.request.urljoin(start_url,link["href"])breakifurl!="":spider(url)exceptExceptionaserr:print(err)ifos.path.exists("downloadImages")==False:os.mkdir("downloadImages")TS=[]start_url=":5000"startTime=time.time()spider(start_url)forTinTS:T.join()endTime=time.time()print("timeused:%.2fseconds"%(endTime-startTime))5.3多線程爬取圖像:5000:5000?pageIndex=2:5000?pageIndex=3:5000?pageIndex=4:5000?pageIndex=5:5000?pageIndex=6:5000?pageIndex=7:5000?pageIndex=8:5000?pageIndex=9:5000?pageIndex=10downloaded:5000/getImage/000015.jpeg:5000?pageIndex=11:5000?pageIndex=12:5000?pageIndex=13:5000?pageIndex=14:5000?pageIndex=15:5000?pageIndex=16:5000?pageIndex=17:5000?pageIndex=18downloaded:5000/getImage/000026.jpegdownloaded:5000/getImage/000039.jpegdownloaded:5000/getImage/000051.jpegdownloaded:5000/getImage/000056.jpegdownloaded:5000/getImage/000059.jpegdownloaded:5000/getImage/000049.jpegdownloaded:5000/getImage/000045.jpegtimeused:2.09seconds執(zhí)行這個(gè)程序部分結(jié)果如下:綜合項(xiàng)目
爬取模擬旅游網(wǎng)站數(shù)據(jù)066.1旅游數(shù)據(jù)存儲(chǔ)6.2編寫(xiě)爬蟲(chóng)程序.6.1旅游數(shù)據(jù)存儲(chǔ)根據(jù)travles.csv的數(shù)據(jù)模擬真實(shí)網(wǎng)站建立一個(gè)旅游網(wǎng)站,如圖4-6-1所示,然后編寫(xiě)爬蟲(chóng)程序爬取所有數(shù)據(jù)與圖像,數(shù)據(jù)被存儲(chǔ)到數(shù)據(jù)庫(kù)中,這個(gè)項(xiàng)目中將綜合使用前面學(xué)習(xí)到的各種知識(shí)與技能,為后面爬取真實(shí)的旅游網(wǎng)站數(shù)據(jù)做好準(zhǔn)備。爬取的數(shù)據(jù)可以存儲(chǔ)到一個(gè)數(shù)據(jù)庫(kù)traves.db中,該數(shù)據(jù)庫(kù)包含一張travels表,表結(jié)構(gòu)如表4-6-1所示。圖4-6-1旅游網(wǎng)站字段類型說(shuō)明IDvarchar(8),primarykey編號(hào)Titlevarchar(256)標(biāo)題Datevarchar(256)日期Extvarchar(256)圖像擴(kuò)展名Contenttext內(nèi)容表4-6-1數(shù)據(jù)庫(kù)表結(jié)構(gòu)6.2編寫(xiě)爬蟲(chóng)程序importurllib.responsefrombs4importBeautifulSoupimporturllib.requestimportosimporttimeimportthreadingfromdatabaseimportDatabasedefdownload(src):try:p=src.rfind("/")name=src[p+1:]data=urllib.request.urlopen(src,timeout=100)data=data.read()iflen(data)>0:fobj=open("downloadImages/"+name,"wb")fobj.write(data)fobj.close()print("downloaded",src)exceptExceptionaserr:print(err)defgetContent(url):try:data=urllib.request.urlopen(url)data=data.read()data=data.decode()soup=BeautifulSoup(data,"lxml")returnsoup.select_one("pre").textexceptExceptionaserr:print(err)return""definitializeDownload():#初始化downloadImages文件夾ifnotos.path.exists("downloadImages"):os.mkdir("downloadImages")fs=os.listdir("downloadImages")forfinfs:os.remove("downloadImages\\"+f)defspider(url):globalTS,dbprint(url)try:data=urllib.request.urlopen(url)data=data.read()data=data.decode()soup=BeautifulSoup(data,"lxml")spans=soup.select("divspan[class='second']")forspaninspans:link=span.select_one("diva")title=link.textID=span.select_one("divb").textdate=span.select("div")[-1].texturl=urllib.request.urljoin(start_url,link["href"])content=getContent(url)6.2編寫(xiě)爬蟲(chóng)程序#span的父元素是div,div的第一個(gè)span是圖片div=span.parentimage=div.select_one("span[class='first']img")src=urllib.request.urljoin(start_url,image["src"])p=src.rfind(".")ext=src[p:]db.insert(ID,title,date,ext,content)print(ID,title)T=threading.Thread(target=download,args=(src,))T.start()TS.append(T)links=soup.select("div[id='pagnation']a")url=""forlinkinlinks:iflink.text=="下一頁(yè)":url=urllib.request.urljoin(start_url,link["href"])breakifurl!="":spider(url)exceptExceptionaserr:print(err)ifos.path.exists("downloadImages")==False:os.mkdir("downloadImages")TS=[]db=Nonestart_url=":5000"whileTrue:print("1.Spider")print("2.Show")print("3.Exit")choice=input("Enteryourchoice(1,2,3):")ifchoice=="1":startTime=time.time()initializeDownload()db=Database()db.open()db.initialize()spider(start_url)db.close()forTinTS:T.join()endTime=time.time()print("timeused:%.2fseconds"%(endTime-startTime))elifchoice=="2":db=Database()db.open()db.show()db.close()elifchoice=="3":breakelse:print("Invalidchoice")實(shí)戰(zhàn)項(xiàng)目
爬取實(shí)際旅游網(wǎng)站數(shù)據(jù)077.1網(wǎng)站網(wǎng)頁(yè)分析7.2網(wǎng)站數(shù)據(jù)存儲(chǔ)7.3編寫(xiě)爬蟲(chóng)程序7.1網(wǎng)站網(wǎng)頁(yè)分析要爬取這些頁(yè)面的文本與圖像就必須先分析網(wǎng)頁(yè)的結(jié)構(gòu)。使用Chrome瀏覽器瀏覽網(wǎng)站,右擊網(wǎng)頁(yè)彈出菜單選擇“檢查”,看到網(wǎng)頁(yè)結(jié)構(gòu)如圖4-7-2所示。圖4-7-2網(wǎng)站結(jié)構(gòu)圖4-7-3詳細(xì)內(nèi)容7.1網(wǎng)站網(wǎng)頁(yè)分析我們看到每個(gè)項(xiàng)目都是一個(gè)<div>元素中,復(fù)制其中一個(gè)結(jié)構(gòu)如下:而且所有的項(xiàng)目都包含在<divclass='lft_artlf'>的元素中,這個(gè)元素的下面是一序列的div[class='mb10tw3_01_2']元素,它們就是各個(gè)項(xiàng)目。進(jìn)一步單擊其中一個(gè)項(xiàng)目,可以看到這個(gè)項(xiàng)目的詳細(xì)內(nèi)容,如圖4-7-3所示。這個(gè)詳細(xì)內(nèi)容一般包含文字與圖像,而且還有很多個(gè)頁(yè)面,為了簡(jiǎn)化我們只獲取第一個(gè)頁(yè)面的文字內(nèi)容,它們包含在一個(gè)<divid="Content">元素下面的各個(gè)<p>元素中。<divclass="mb10tw3_01_2"><spanclass="tw3_01_2_p"><atarget="_blank"shape="rect"href="http:///a/202407/23/WS669f294da31095c51c50f70c.html"><imgwidth="200"height="130"src="http:///images/202407/23/669f294da31095c551b59e96.jpeg"></a></span><spanclass="tw3_01_2_t"><h4><ashape="rect"href="http:///a/202407/23/WS669f294da31095c51c50f70c.html">SummertravelheatsupinChina's'icecity'</a></h4><b>2024-07-2311:53</b></span></div>7.2網(wǎng)站數(shù)據(jù)存儲(chǔ)數(shù)據(jù)可以存儲(chǔ)在一個(gè)SQLite3數(shù)據(jù)庫(kù)travels.db中,設(shè)計(jì)一個(gè)數(shù)據(jù)庫(kù)表items數(shù)據(jù)包含一個(gè)關(guān)鍵字ID編號(hào)、標(biāo)題tTitle、日期tDate、內(nèi)容tContent以及圖像擴(kuò)展名稱tExt,表格結(jié)構(gòu)見(jiàn)表4-7-1所示。字段名稱字段類型說(shuō)明IDvarchar(6),primarykey序號(hào)tDatevarchar(16)日期tTitlevarchar(1024)標(biāo)題tContenttext內(nèi)容tExtvarchar(8)圖像表4-7-1數(shù)據(jù)庫(kù)表字段結(jié)構(gòu)7.3編寫(xiě)爬蟲(chóng)程序frombs4importBeautifulSoupimporturllib.requestimportsqlite3importosimporttimeimportthreadingclassDatabase:
defopen(self):
#打開(kāi)數(shù)據(jù)庫(kù)
self.con=sqlite3.connect("travels.db")
self.cursor=self.con.cursor()
defclose(self):
#關(guān)閉數(shù)據(jù)庫(kù)
mit()
self.con.close()definitialize(self):
#初始化數(shù)據(jù)庫(kù),創(chuàng)建items表
try:
self.cursor.execute("droptableitems")
except:
pass
self.cursor.execute("createtableitems(IDvarchar(8)primarykey,tDatevarchar(16),tTitlevarchar(1024),tContenttext,tExtvarchar(8))")
definsert(self,ID,tDate,tTitle,tContent,tExt):
#插入一條記錄
try:
self.cursor.execute("insertintoitems(ID,tDate,tTitle,tContent,tExt)values(?,?,?,?,?)",[ID,tDate,tTitle,tContent,tExt])
exceptExceptionaserr:
print(err)
defshow(self):
#顯示數(shù)據(jù)內(nèi)容self.cursor.execute("selectID,tDate,tTitle,tContent,tExtfromitemsorderbyID")
rows=self.cursor.fetchall()
forrowinrows:
print(row[0])
print(row[1])
print(row[2])
print(row[3])
print(row[
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 廣元市人民檢察院關(guān)于公開(kāi)招聘警務(wù)輔助人員的(5人)參考題庫(kù)附答案
- 新余市2025年市直單位公開(kāi)遴選公務(wù)員考試備考題庫(kù)附答案
- 紅領(lǐng)巾安全演講集
- 2026西安未央?yún)^(qū)徐家灣社區(qū)衛(wèi)生服務(wù)中心招聘參考題庫(kù)附答案
- 2026四川樂(lè)山市沐川縣沐溪鎮(zhèn)幸福社區(qū)招募高校畢業(yè)生(青年)見(jiàn)習(xí)人員2人參考題庫(kù)必考題
- 中國(guó)華錄集團(tuán)有限公司2026屆校園招聘正式開(kāi)啟參考題庫(kù)附答案
- 2026年陜西眉太麟法高速項(xiàng)目招聘(11人)備考題庫(kù)附答案
- 2025 小學(xué)五年級(jí)科學(xué)下冊(cè)靜脈識(shí)別的血管分布與活體檢測(cè)課件
- 邳州市輔警考試題庫(kù)2025
- 2026四川內(nèi)江市公安局高新技術(shù)開(kāi)發(fā)區(qū)分局第一次招聘警務(wù)輔助人員15人備考題庫(kù)帶答案詳解
- 體溫單模板完整版本
- 武漢市2024屆高中畢業(yè)生二月調(diào)研考試(二調(diào))英語(yǔ)試卷(含答案)
- 天然美肌無(wú)添加的護(hù)膚品
- 《正常人體形態(tài)學(xué)》考試復(fù)習(xí)題庫(kù)大全(含答案)
- 湖南省長(zhǎng)沙市外國(guó)語(yǔ)學(xué)校 2021-2022學(xué)年高一數(shù)學(xué)文模擬試卷含解析
- 3D車載蓋板玻璃項(xiàng)目商業(yè)計(jì)劃書(shū)
- 阿米巴經(jīng)營(yíng)管理培訓(xùn)課件
- 我國(guó)的宗教政策-(共38張)專題培訓(xùn)課件
- 鋁材廠煲模作業(yè)指導(dǎo)書(shū)
- 【行測(cè)題庫(kù)】圖形推理題庫(kù)
- GB/T 43293-2022鞋號(hào)
評(píng)論
0/150
提交評(píng)論