爬虫

爬取网页数据

urllib库

urllib库 四大模块

urllib.request 请求模块

urllib.error 异常处理

urllib.parse URL解析

urllib.robotparser robots.txt解析模块

使用urllib库爬取网页数据

urlopen()方法

urlopen(url ,data,timeout)

dat默认为None 为GET请求 设置参数是为POST请求

timeout 超时时间

1
2
3
4
import urllib.request as request
response= request.urlopen("https://baidu.com")
html=response.read().decode("utf-8")
print(html)
response对象

geturl()

info()

getcode()

Request对象

使用Requests库爬取网页数据

以GET请求的方式爬取百度网页,搜索词为python

1
2
3
4
5
6
7
8
9
10
11
import requests
from fake_useragent import UserAgent
headers = {
'User-Agent': UserAgent().random
}
url="https://www.baidu.com/s"
param={"wd":"python"}
response=requests.get(url,params=param,headers=headers)
print(response.text)
with open('output.html', 'w',encoding='utf-8') as f: # 将页面内容保存到本地文件
f.write(response.text)

网页数据解析

xpath语法

使用Xpath语法爬取百度热点新闻

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import requests
from fake_useragent import UserAgent
from lxml import etree

tag_url = 'https://top.baidu.com/board?tab=realtime'

headers = {
'User-Agent': UserAgent().random,
'Cookie':
}
response = requests.get(url=tag_url, headers=headers)
page_text = response.text
tree = etree.HTML(page_text)
#//*[@id="sanRoot"]/main/div[2]/div/div[2]/div[1]/div[2]/a/div[1]/text()
#//*[@id="sanRoot"]/main/div[2]/div/div[2]/div[2]/div[2]/a/div[1]/text()
#//*[@id="sanRoot"]/main/div[2]/div/div[2]/div[1]
#//*[@id="sanRoot"]/main/div[2]/div/div[2]/div[1]/div[1]/div[2]
#//*[@id="sanRoot"]/main/div[2]/div/div[2]/div[2]/div[1]/div[2]

def hot_new():
data_list = []
for i in range(1,51):
data_dict = {}
title=tree.xpath('//*[@id="sanRoot"]/main/div[2]/div/div[2]/div['+str(i)+']/div[2]/a/div[1]/text()')[0]
hot_index=tree.xpath('//*[@id="sanRoot"]/main/div[2]/div/div[2]/div['+str(i)+']/div[1]/div[2]/text()')[0]
data_dict['title'] = title
data_dict['hot_index'] = hot_index
data_list.append(data_dict)
return data_list
print(hot_new()[0]["title"])
#将爬取的数据保存到本地
with open('test.csv',"w",encoding="utf-8") as f:
for i in range(0,len(hot_new())):
f.write(hot_new()[i]["title"]+"\n")