urllib库详解

tuan_liu

2018-04-01

python, web-scraping

非常明显，技术已经超越我们的人性。
——阿尔伯特·爱因斯坦（Albert Einstein）

python爬虫之旅之第一站~~

首先讲一下urllib库，他是python内置的一个http请求库，他有以下主要的四个模块：

urllib.request 请求模块，我们通过他来模拟发送一个请求
urllib.error 异常处理模块
urllib.parse url解析模块，可以对url进行拆分，合并等操作
urllib.robotparser robots.txt解析模块，主要是对网站的robots.txt文件进行解析

下面我们就来讲解一下urllib库中的各种使用情况：

urlopen函数：
urllib.request.urlopen(url,data=None,[timeout])
urlopen函数中的常见的参数有url,data和timeout，url参数是传入网站的url，data参数主要是用来传入一些其他的数据，timeout是用于超时的数据。

下面看代码：

1
2
3

import urllib.request
response=urllib.request.urlopen("http://aisleep.xyz")
print(response.read().decode('utf-8'))

其中.read()是获取response的内容，因为一开始是bytes数据，所以才要再用.decode('utf-8')将其转为字符串。这个是request的一个get的请求。

接着看：

import urllib.parse
import urllib.request

data=bytes(urllib.parse.urlencode({'word': 'hello'}),encoding='utf8')
response=urllib.request.urlopen('http://httpbin.org',data=data)
print(response.read())

这里传入了个data，且必须是bytes类型。其中调用urlencode方法传入所需要的字典，后又定义了一个编码方式。这就是post请求。

下面再看一下使用timeout参数的代码：

import urllib.request
import socket
import urllib.error

try:
	response=urllib.request.urlopen('http://httpbin.org/get',timeout=1)
	print(response.read())
except urllib.error.URLError as e:
	if isinstance(e.reason,socket.timeout):
		print("time out")

这里是使用了一个try-except处理异常，还设置了一个timeout参数，还有涉及到了一些error等，这个后面会有讲解。

下面我们再来看看，urlopen的响应类型（一般常用的有：状态码，响应头，响应体）：
代码一：

import urllib.request
response=urllib.request.urlopen('http://baidu.com')
print(type(response))

代码二：

import urllib.request
response=urllib.request.urlopen('https://baidu.com')
print(response.status)
print(response.getheaders()）
print(response.getheader('Server'))
print(response.read().decode('utf-8'))

代码一中是返回了response的类型是什么，代码二中第3行是返回的状态码，一般请求成功状态码就是200第4行是返回的响应头的全部信息，第5行则是返回的是响应头中的Server参数的信息，第6行则是使用了常用的read方法，来获取响应体的内容。

Request对象

代码一：

import urllib.request
request=urllib.request.Request('http://aisleep.xyz')
response=urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

代码二：

from urllib import request,parse
url="http://httpbin.org/post"
dict={
	'name':'lifan'
}
data=bytes(parse.urlencode(dict),encoding='utf8')
req=request.Request(url=url,data=data,method='POST')
req.add_header('User-Agent','Mozilia/4.0(compatible;MSIE 5.5;Windows NT)')
response=request.urlopen(req)
print(response.read().decode('utf-8'))

这里代码二是将url,data,header一起传给Request对象，然后再使用urlopen函数。第8行是传入了一个header,可以用特殊的add_header函数，也可以将其构造一个和这里的dict一样的字典传给Request。
通过Request对象可以将我们需要传入的请求方式，请求头，请求体等参数构造成一各整体，传给Request对象，发送给服务器。这里使用Request对象，再调用urlopen函数的方法相比于直接使用urlopen函数的好处就是可以传入更多类型的参数。不单单只是url,timeout,data等了。

代理（handler）

import urllib.request
proxy_handler=urllib.request.ProxyHandler({
	'https': 'https://127.0.0.1:9743',
	'http': 'http://127.0.0.1:9743'
	})
opener=urllib.request.build_opener(proxy_handler)
response=opener.open('http://www.google.com')
print(response.read())

这里稍后再细讲

cookie

获取cookie

import urllib.request,http.cookiejar
cookie=http.cookiejar.CookieJar()
hander=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(hander)
response=opener.open('http://www.baidu.com')
for item in cookie:
	print(item.name+"="+item.value)

cookie保存：

#cookie保存方式1
import http.cookiejar,urllib.request
filename="cookie.txt"
cookie=http.cookiejar.MozillaCookieJar(filename)
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open("http://www.baidu.com")
cookie.save(ignore_discard=True,ignore_expires=True)

#cookie保存方式2
import http.cookie,urllib.request
filename ='cookie.txt'
cookie=http.cookiejar.LWPCookieJar(filename)
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)

#读取cookie文件
import http.cookiejar,urllib.request
cookie=http.cookiejar.LWPCookiejar()
cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)
handler=urllib.request.build_opener(handler)
response=opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

cookie这里的使用，后面再细讲。

error模块，异常处理

from urllib import request,error
try:
	response=request.urlopen('http://www.baidu.com')
except error.HTTPError as e:
	print(e.reason,e.code,e.headers,sep='\n')
except error.URLError as e:
	print(e.reason)
else：
	print（'Request.Successfully'）

urlparse模块，url解析

1
2
3

from urlib.parse import urlparse
result=urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(result)

urlparse模块主要是对url进行解析，这个代码的输出信息是：
ParseResult(scheme='http',netloc='www.baidu.com',path='/index.html',params='user',query='id5',fragment='comment')

再看：

1
2
3

from urlib.parse import urlparse
result=urlparse('http://www.baidu.com/index.html;user?id=5#comment',scheme='https')
print(result)

这个代码的输出信息则是：
ParseResult(scheme='http',netloc='www.baidu.com',path='/index.html',params='user',query='id5',fragment='comment')
可以看到，我们指定的协议类型，如果默认协议存在则不会被我们指定的协议所改变。

再看：

1 2	from urllib.parse import urlparse result=urlparse('http://www.baidu.com/index.html;user?id=5#comment',allow_fragments=False)

这个代码的输出信息则是：
ParseResult(scheme='http',netloc='www.baidu.com',path='/index.html',params='user',query='id=5#comment',fragment='')
可见，当我们指定allow_fragment=False的时候，fragment的内容就会被拼接到前面的query 里面去。如果连前面的query也没有呢？fragment里面的内容会继续拼接到前面的地方。

unurlparse函数，组成url

1
2
3

from urllib.parse import urlunparse
data=['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))

这个代码的输出信息则是：
http://www.baidu.com/index.html;user?a=6#comment

urljoin函数，拼接url

from urllib.parse import urljoin
print(urljoin('http://www.baidu.com','https://www.google.com/faq.html'))
print（urljoin('http://www.baidu.com','https://www.baidu.com')）
print(urljoin('http://www.baidu.com','http://www.baidu.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com','?category=1'))

这个代码的运行结果是：

https://www.google.com/faq.html
https://www.baidu.com
http://www.baidu.com/FAQ.html?question=2
http://www.baidu.com?category=1

我们从中可以知道，urljoin函数，泛泛的理解就是：在前者后者都有的情况下，后者的内容会覆盖前者，前者没有后者有的时候，也是为后者的内容，如果前者有，后者内容没有就以前者为准。

urlencode函数，将字典对象转换成get请求参数

from urllib.parse import urlencode
params={
	'name':'lifan',
	'age':23
}
base_url='http://www.baidu.com'
url=base_url+urlencode(params)
print(url)

运行结果是：http://www.baidu.comname=lifan&age=23