爬虫

爬取网页数据

urllib库

urllib库四大模块

urllib.request 请求模块

urllib.error 异常处理

urllib.parse URL解析

urllib.robotparser robots.txt解析模块

使用urllib库爬取网页数据

urlopen()方法

urlopen(url ,data,timeout)

dat默认为None 为GET请求设置参数是为POST请求

timeout 超时时间

import urllib.request as request
response= request.urlopen("https://baidu.com")
html=response.read().decode("utf-8")
print(html)

response对象

geturl()

info()

getcode()

Request对象

使用Requests库爬取网页数据

以GET请求的方式爬取百度网页，搜索词为python

import requests
from fake_useragent import UserAgent
headers = {
 'User-Agent': UserAgent().random
 }
url="https://www.baidu.com/s"
param={"wd":"python"}
response=requests.get(url,params=param,headers=headers)
print(response.text)
with open('output.html', 'w',encoding='utf-8') as f:  # 将页面内容保存到本地文件
    f.write(response.text)

网页数据解析

xpath语法

使用Xpath语法爬取百度热点新闻

import requests
from fake_useragent import UserAgent
from lxml import etree

tag_url = 'https://top.baidu.com/board?tab=realtime'

headers = {
    'User-Agent': UserAgent().random,
    'Cookie': 
}
response = requests.get(url=tag_url, headers=headers)
page_text = response.text
tree = etree.HTML(page_text)
#//*[@id="sanRoot"]/main/div[2]/div/div[2]/div[1]/div[2]/a/div[1]/text()
#//*[@id="sanRoot"]/main/div[2]/div/div[2]/div[2]/div[2]/a/div[1]/text()
#//*[@id="sanRoot"]/main/div[2]/div/div[2]/div[1]
#//*[@id="sanRoot"]/main/div[2]/div/div[2]/div[1]/div[1]/div[2]
#//*[@id="sanRoot"]/main/div[2]/div/div[2]/div[2]/div[1]/div[2]

def hot_new():
    data_list = []
    for i in range(1,51):
        data_dict = {}
        title=tree.xpath('//*[@id="sanRoot"]/main/div[2]/div/div[2]/div['+str(i)+']/div[2]/a/div[1]/text()')[0]
        hot_index=tree.xpath('//*[@id="sanRoot"]/main/div[2]/div/div[2]/div['+str(i)+']/div[1]/div[2]/text()')[0]
        data_dict['title'] = title
        data_dict['hot_index'] = hot_index
        data_list.append(data_dict)
    return data_list
print(hot_new()[0]["title"])
#将爬取的数据保存到本地
with open('test.csv',"w",encoding="utf-8") as f:
    for i in range(0,len(hot_new())):
        f.write(hot_new()[i]["title"]+"\n")