概述
说起‘爬虫’,大家第一映像是这是个很酷的东西,听起来很厉害的样子。一般人对爬虫的理解就是这样了,不会对爬虫有一些深入的思考。
我眼中爬虫的意义:爬虫是我们在信息化社会中获取信息最有效的工具.
大数据的时代已经来临,基于大数据的数据挖掘,个性化推荐等领域都如火如荼。爬虫是个人获取大数据非常有效的工具。
大学毕业设计时的课题是实现一套股票题材的文本理解系统,输入是一段文本(股评,财经评论等),系统会输出文本是推荐买入,持仓还是卖出。基于大量样本数据的情况下,使用了决策树,随机森林,支持向量机3种不同的机器学习算法来训练模型,然后根据训练得到的模型来给出预测。
这实际就是目前淘宝,百度等各大互联网公司广泛使用的个性化推荐系统的原型。这里面非常重要的是需要有大量的样本的数据来训练预测模型,获取的大量的数据就得依靠爬虫了。
大数据时代的生产模式:
获取数据(爬虫等) —- 数据处理(数据挖掘,机器学习等)
爬虫工作原理
首先爬虫是什么?
网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动的抓取万维网信息的程序>或者脚本。
我们使用的互联网是网状的,一个一个WEB页面通过各种各样的方式链接在一起,就像一个蜘蛛网一样。我们实现的程序或脚本就像一只虫子一样从这个页面爬到另一个页面,抓取其中的信息,所以形象的称这样的工具为爬虫。
爬虫程序一般需要实现以下几点:
- 爬取网页:发送和接收HTTP(或HTTPS)请求
- 提取数据:解析获取到的HTML内容,从中提取需要的数据
- 爬取范围:指定程序爬取的网页范围,包括如何从一个页面转移到另一个页面,爬取页面的终止条件等
python爬虫实战
选择python 来实现爬虫,而不是自己更熟悉的C++,是因为python 有非常多网络相关的标准库,在网络方面用起来更方面些。
在python中主要使用urllib,urllib2,httplib这几个库来处理HTTP相关的内容,实现爬取网页;
从HTML内容中提取需要的数据主要通过正则表达式实现, python 的正则表达式库是re;
至于爬虫爬取网页的边界则是有用户自己控制,不需要特殊的技术支持
下面的代码用于从糗事百科中自动获取作者,段子内容,图片,好笑数,评论数
# author huqijun ,2016/05/23
#import urllib,urllib2,re
import urllib
import urllib2
import re
page = 1
# url for qiushibaik
url = "http://www.qiushibaike.com/8hr/page/" + str(page)
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
# http header
headers = {'User-Agent':user_agent}
# Regular expression to extract data from HTML
pattern = re.compile('<div class="author clearfix">.*?target="_blank" title="(.*?)">.*?<div class="content">(.*?)'+
'</div>.*?<img src="(.*?)" alt=.*?class="number">(.*?)</i>.*?class="number">(.*?)</i>',re.S)
try:
# open url
request = urllib2.Request(url,headers= headers)
# get response
response = urllib2.urlopen(request)
content = response.read().decode('utf-8')
# regular match
items = re.findall(pattern,content)
# print outcome
for item in items:
print item[0],item[1],item[2],item[3],item[4]
except urllib2.URLError,e:
if hasattr(e,"code"):
print e.code
if hasattr(e,"reason"):
print e.reason
程序运行的结果(这里只展示一条):
暖男找暖女
老妹结婚前,妹夫的四个死党商量着闹洞房砸夯,结果结婚那天,在外地工作的老妹弄来了十九个伴娘,最轻的也一百六十多斤,个个又高又壮,妹>夫那四个死党准备闹洞房,一进门傻了,十九个女汉子把门一关,拳一握,硬把四个老爷们闹了,过程中,一个想跑,被拎小鸡一样又拎回去了。
http://pic.qiushibaike.com/system/avtnew/3061/30613416/medium/20160514003803.jpg 415 4
看,我们成功的通过爬虫获取到糗事百科上的段子,爬虫的原理就是这样简单!
优化后完整实现的代码: QSBK_Crawler.py
这里有一个爬取本地相亲论坛所有符合条件女性的详细信息的例子供大家参考:
# -*- coding:utf-8 -*-
# author huqijun ,2016/05/26
# 19楼爬虫
# 19楼(www.19lou.com) 是一个杭州本地生活平台,这里主要用到其征婚板块
# 爬取19楼符合条件的女性资料,通过机器学习判断是否是喜欢的类型,如果是则自动发送站内信
import re
import urllib
import urllib2
import cookielib
import zlib
class _19floor(object):
def __init__(self):
# base url
self.url_base1 = "http://www.19lou.com/love/list-164-"
self.url_base2 = ".html"
# page index 1
self.pageIndex = 1
# search condition
self.SearchConditionparameters = {
# looking for female
"sex": "0",
# 18 to 30 year ole
"startAge": "18",
"endAge": "30",
# not married
"marry": "1",
# location now in hangzhou ,zhejiang
"locationProvince": "31",
"locationCity": "383"
}
self.SearchConditiondata = urllib.urlencode(self.SearchConditionparameters)
# login data
# replace with real username and passwd
self.loginParam = {
"userName": "username",
"userPass": "password"
}
self.loginData = urllib.urlencode(self.loginParam)
self.loginURL = "http://www.19lou.com/login"
self.cookie = cookielib.CookieJar()
# login
self.login()
# login to website
def login(self):
request = urllib2.Request(self.loginURL, self.loginData)
handler = urllib2.HTTPCookieProcessor(self.cookie)
opener = urllib2.build_opener(handler)
response = opener.open(request)
#print self.cookie
# get the serarch result of page pageIndex, do not need login to get information
def getSearchResult(self, pageIndex):
url = self.url_base1 + str(pageIndex) + self.url_base2
request = urllib2.Request(url, self.SearchConditiondata)
# HTML webpage
response = urllib2.urlopen(request)
# 19lou is encoded in gb2312
_response = response.read().decode('gb2312', 'ignore')
pattern = re.compile(
'<div class="list-details">.*?<a href="(.*?)" target="_blank" class="user-details" ttname="bbs_love_yhxx">',re.S)
items = re.findall(pattern, _response)
return items
# need login to get detail information of Girls
def getDetailInfo(self, url):
#print url
# cookie information
cookieInfo = {}
for item in self.cookie:
cookieInfo[item.name] = item.value
print item
_sbs_auth_id = cookieInfo["_sbs_auth_id"]
_sbs_auth_uid = cookieInfo["_sbs_auth_uid"]
dm_ui = cookieInfo["dm_ui"]
sbs_auth_id = cookieInfo["sbs_auth_id"]
sbs_auth_uid = cookieInfo["sbs_auth_uid"]
JSESSIONID = cookieInfo["JSESSIONID"]
f8big = cookieInfo["f8big"]
#print _sbs_auth_id,_sbs_auth_uid,dm_ui,sbs_auth_id,sbs_auth_uid,JSESSIONID,f8big
cookieHead = r"bdshare_firstime=1460288971964;_Z3nY0d4C_=37XgPK9h-%3D1920-1920-1920-949;"+\
"f8big=%s; _DM_S_=7f26ea80bdc38e887ed6eef907b1c333; JSESSIONID=%s; " % (f8big,JSESSIONID) +\
r"fr_adv_last=thread-top-reg; fr_adv=bbs_top_20160529_12651464335062915;checkin__40459849_0529=9_12_12;" + \
"sbs_auth_uid={0:s}; sbs_auth_id={1:s}; sbs_auth_remember=1; _sbs_auth_uid=&s;_sbs_auth_id={2:s}; dm_ui={3:s};".format(
sbs_auth_uid, sbs_auth_id, _sbs_auth_uid, _sbs_auth_id, dm_ui)
cookieEnd = r"_dm_userinfo=%7B%22uid%22%3A%2240459849%22%2C%22category%22%3A%22%E6%97%85%E6%B8%B8%2C%E7%BE%8E%E9%A3%9F%2C%E6%83%85%E6%84%9F%2C%E5%A9%9A%E5%BA%86%22%2C%22" + \
r"sex%22%3A%221%22%2C%22frontdomain%22%3A%22www.19lou.com%22%2C%22stage%22%3A%22%22%2C%22ext%22%3A%22%22%2C%22ip%22%3A%2239.189.203.105%22%2C%22city%22%3A%22%E6%B5%99%E6%B1%9F%3A%E6%9D%AD%E5%B7%9E%22%7D; " + \
r"loginwin_opened_user=40459849; Hm_lvt_5185a335802fb72073721d2bb161cd94=1464346870,1464364960,1464437399,1464485996; Hm_lpvt_5185a335802fb72073721d2bb161cd94=1464487611; " + \
r"_DM_SID_=762d93b17978f46a069f43205ebdfb01; screen=1903; _dm_tagnames=%5B%7B%22k%22%3A%22%E5%A5%B3%E7%94%9F%E5%BE%81%E5%8F%8B%22%2C%22c%22%3A267%7D%2C%7B%22k%22%3A%22%E7%9B%" + \
r"B8%E4%BA%B2%E8%AE%BA%E5%9D%9B%22%2C%22c%22%3A39%7D%2C%7B%22k%22%3A%22%E6%9D%AD%E5%B7%9E%E7%9B%B8%E4%BA%B2%E7%BD%91%22%2C%22c%22%3A39%7D%2C%7B%22k%22%3A%22%E6%9D%AD%E5%B7%9E%E5%BE%81%E5%A9%9A%E7%BD%91%22%2C%22c" + \
r"%22%3A39%7D%2C%7B%22k%22%3A%22%E5%85%B6%E4%BB%96%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%9D%AD%E5%B7%9E%22%2C%22c%22%3A134%7D%2C%7B%22k%22%3A%22%E8%90%A7%E5%B1%B1%22%2C%22c%22%3A6%7D%2C%7B%22k%22%3A%22%E5%8D%95" + \
r"%E8%BA%AB%E7%94%B7%E5%A5%B3%E7%9B%B8%E5%86%8C%22%2C%22c%22%3A3%7D%2C%7B%22k%22%3A%22%E5%BE%81%E5%8F%8B%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%89%BE%E7%94%B7%E5%8F%8B%22%2C%22c%22%3A613%7D%2C%7B%22k%22%3A%22%E7%9B" + \
r"%B8%E4%BA%B2%22%2C%22c%22%3A4%7D%2C%7B%22k%22%3A%22%E7%BE%8E%E9%A3%9F%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E7%BB%93%E5%A9%9A%22%2C%22c%22%3A2%7D%2C%7B%22k%22%3A%22%E5%BA%8A%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%" + \
r"E5%AE%B6%E5%BA%AD%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%83%85%E6%84%9F%E8%AF%9D%E9%A2%98%22%2C%22c%22%3A2%7D%2C%7B%22k%22%3A%22%E6%83%85%E6%84%9F%E6%97%A5%E8%AE%B0%22%2C%22c%22%3A6%7D%2C%7B%22k%22%3A%22%E5%89%8" + \
r"D%E7%94%B7%E5%8F%8B%22%2C%22c%22%3A3%7D%2C%7B%22k%22%3A%22%E6%9A%A7%E6%98%A7%22%2C%22c%22%3A2%7D%2C%7B%22k%22%3A%22%E8%BD%AF%E5%A6%B9%E5%AD%90%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%9E%97%E5%BF%97%E7%8E%B2%22%2C%" + \
r"22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%97%85%E8%A1%8C%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E4%BA%A4%E5%8F%8B%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%9D%AD%E5%B7%9E%E4%BA%A4%E5%8F%8B%E7%BD%91%22%2C%22c%22%3A4%7D%2C" + \
r"%7B%22k%22%3A%22%E6%9D%AD%E5%B7%9E%E4%BA%A4%E5%8F%8B%E8%AE%BA%E5%9D%9B%22%2C%22c%22%3A4%7D%2C%7B%22k%22%3A%22%E6%81%8B%E7%88%B1%22%2C%22c%22%3A4%7D%2C%7B%22k%22%3A%22%E6%89%BE%E5%A5%B3%E5%8F%8B%22%2C%22c%22%3A22%7D" + \
r"%2C%7B%22k%22%3A%22%E8%90%A7%E5%B1%B1%E7%94%B7%22%2C%22c%22%3A5%7D%2C%7B%22k%22%3A%22%E5%81%A5%E8%BA%AB%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E5%81%A5%E8%BA%AB%E6%88%BF%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%B7" + \
r"%B1%E6%83%85%22%2C%22c%22%3A6%7D%2C%7B%22k%22%3A%22%E8%90%9D%E8%8E%89%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E5%A4%A9%E7%A7%A4%E5%A5%B3%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%9C%AC%E5%A1%98%E5%A5%B3%22%2C%22c%22" + \
r"%3A1%7D%2C%7B%22k%22%3A%22%E9%97%AA%E5%A9%9A%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E9%85%B1%E6%B2%B9%22%2C%22c%22%3A4%7D%2C%7B%22k%22%3A%22%E9%97%BA%E8%9C%9C%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E4%B8%BD%E6%B0%B4" + \
r"%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E4%B8%BE%E6%8A%A5%E8%BF%9D%E7%AB%A0%22%2C%22c%22%3A2%7D%2C%7B%22k%22%3A%22%E5%85%AB%E5%8D%A6%E7%BB%AF%E9%97%BB%22%2C%22c%22%3A2%7D%2C%7B%22k%22%3A%22%E7%89%B5%E6%89%8B%22%2C%" + \
r"22c%22%3A2%7D%2C%7B%22k%22%3A%22%E9%BB%84%E6%99%93%E6%98%8E%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E9%A2%86%E8%AF%81%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%9D%8E%E6%98%93%E5%B3%B0%22%2C%22c%22%3A1%7D%2C%7B%22k%22" + \
r"%3A%22%E6%96%87%E5%A8%B1%E8%BD%AE%E6%92%AD%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%97%B6%E5%B0%9A%22%2C%22c%22%3A3%7D%2C%7B%22k%22%3A%22%E6%B4%BB%E5%8A%A8%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%96%B0%E5%A8%9" + \
r"8%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E5%A9%9A%E7%A4%BC%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E5%A9%9A%E7%BA%B1%22%2C%22c%22%3A1%7D%5D; pm_count=%7B%22pc_hangzhou_cityEnterMouth_advmodel_adv_210x200_2%22%3A21" + \
r"%2C%22pc_hangzhou_cityEnterMouth_advmodel_adv_210x200_1%22%3A21%2C%22pc_hangzhou_cityEnterMouth_advmodel_adv_210x200_4%22%3A21%2C%22pc_hangzhou_cityEnterMouth_advmodel_adv_210x200_3%22%3A21%2C%22" + \
r"pc_hangzhou_cityEnterMouth_advmodel_adv_210x200_6%22%3A21%2C%22pc_hangzhou_cityEnterMouth_advmodel_adv_210x401_1%22%3A21%2C%22pc_hangzhou_cityEnterMouth_advmodel_adv_210x200_7%22%3A21%2C%22pc_hangzhou_city" + \
r"EnterMouth_advmodel_adv_330x401_3%22%3A21%2C%22pc_hangzhou_cityEnterMouth_advmodel_adv_330x200_1%22%3A21%2C%22pc_hangzhou_cityEnterMouth_advmodel_adv_330x401_1%22%3A21%2C%22pc_hangzhou_cityEnterMouth_advmodel_adv_" + \
r"330x401_2%22%3A21%2C%22pc_hangzhou_forumthread_button_adv_180x180_4%22%3A261%2C%22pc_hangzhou_forumthread_button_adv_180x180_3%22%3A261%2C%22pc_hangzhou_forumthread_button_adv_180x180_2%22%3A261%2C%22pc_hangzhou" + \
r"_forumthread_button_adv_180x180_1%22%3A261%2C%22pc_hangzhou_forumthread_button_adv_180x180_5%22%3A15%2C%22pc_allCity_threadView_advmodel_adv_360x120_1%22%3A64%2C%22pc_allCity_threadView_button_adv_190x205_" + \
r"1%22%3A3%2C%22pc_hangzhou_sbs_20_advmodel_adv_300x250_2%22%3A2%2C%22pc_hangzhou_sbs_20_advmodel_adv_200x200_1%22%3A2%7D; dayCount=%5B%7B%22id%22%3A7971%2C%22count%22%3A1%7D%2C%7B%22id%22%3A8099%2C%22count" + \
r"%22%3A2%7D%2C%7B%22id%22%3A7238%2C%22count%22%3A2%7D%2C%7B%22id%22%3A8101%2C%22count%22%3A2%7D%2C%7B%22id%22%3A7421%2C%22count%22%3A1%7D%2C%7B%22id%22%3A8338%2C%22count%22%3A2%7D%2C%7B%22id%22%3A7107%2C%22count" + \
r"%22%3A1%7D%2C%7B%22id%22%3A7724%2C%22count%22%3A2%7D%2C%7B%22id%22%3A8097%2C%22count%22%3A2%7D%2C%7B%22id%22%3A8758%2C%22count%22%3A2%7D%2C%7B%22id%22%3A6718%2C%22count%22%3A2%7D%2C%7B%22id%22%3A7119%2C%22count" + \
r"%22%3A5%7D%2C%7B%22id%22%3A7919%2C%22count%22%3A5%7D%2C%7B%22id%22%3A8294%2C%22count%22%3A3%7D%2C%7B%22id%22%3A8325%2C%22count%22%3A1%7D%2C%7B%22id%22%3A8662%2C%22count%22%3A3%7D%2C%7B%22id%22%3A6700%2C%22count" + \
r"%22%3A1%7D%5D"
cookie = cookieHead + cookieEnd
#print cookie
header = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, sdch",
"Accept-Language": "zh-CN,zh;q=0.8",
"Cache-Control": " max-age=0",
"Connection": "close",
"Cookie": cookie,
"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36",
"Upgrade-Insecure-Requests": "1",
"DNT": "1"
}
request = urllib2.Request(url, headers=header)
response = urllib2.urlopen(request)
html = response.read()
encodeType = response.headers.get('Content-Encoding')
if encodeType == "gzip":
html = zlib.decompress(html, 16 + zlib.MAX_WBITS)
_response = html.decode('gb2312', 'ignore')
#print _response
pattern = re.compile('<p class="mt10">(.*?)</p>.*?data-uname="(.*?)" data-uid=".*?<td colspan="2">(.*?) </td>.*?' +
'<td colspan="2">(.*?)</td>.*?<td colspan="2">(.*?)</td>.*?<td colspan="2">(.*?)</td>.*?<td>' +
'(.*?)</td>.*?<td style="word-wrap:break-word;word-break:break-all;">(.*?)</td>.*?<img src="(.*?)"/>', re.S)
# item is a list that contains a tuple
item = re.findall(pattern, _response)
girlInfo = []
for temp in item[0]:
# remove \r\n
temp = temp.replace('\r\n','')
# remove space
temp = temp.replace(' ','')
girlInfo.append(temp)
return girlInfo
# write gril's info to file, eg excel file
def saveTofile(self,girlInfo):
pass
def test(self):
pageIndex = 1
items = self.getSearchResult(pageIndex)
girlInfo = self.getDetailInfo(items[0])
for temp in girlInfo:
print temp
spider = _19floor()
spider.test()
程序测试结果:
女,22岁,魔羯座,年薪6-9万,本科,身高162CM
欣芭比
从事自由职业工作,汉族
无住房,已买车,未婚
浙江杭州江干区
浙江温州
比较安静独立喜欢电影美食狗。
没有条条框框彼此合适可以从朋友做起。
http://att3.citysbs.com/600xhangzhou/2016/05/29/13/780x780-135556_v2_14711464501356634_a2e12819608e252958e4ec4e20234a30.jpg
主要的思路是通过爬虫爬取论坛中所有符合条件女性个人信息, 通过机器学习来判断是否是喜欢的类型,判断喜欢的自动则发送站内信。
完整的代码在这里19lou_Crawler.py ,包括爬虫,机器学习,站内信部分
困难与解决办法
- 正则表达式对人类相当不友好,比较难写
解决办法:
RegexBuddy: 好用的正则表达式调试工具;
PyQuery: JQuery 的Python 实现,Jquery 是专门用来从HTML中提取信息的一种语言
Xpath,BeautifulSOAP: 都是从HTML 中提取信息的工具
使用python 请求一个页面, 返回的是一段JavaScript 代码,而不是实际的HTML
解决办法:
通过Chrom 抓取浏览器的交互过程,发现浏览器GET请求是可以收到HTML的响应的;
通过WireShark 抓取爬虫的请求,对比和浏览器的请求,发现HTTPHeader 中缺少Cookie 字段,加上后即可(该网站用户不登录请求时都会带Cookie)编解码问题,抓包看已返回HTML,但程序中输入还是乱码
解决办法:
查看网页源码,meta charset=”gb2312”, 使用gb2312解码后还是乱码;
最后发现是HTML的内容经过gzip压缩, HTTP响应头中有Content-Encoding:gzip
使用zlip解压问题解决:
response = urllib2.urlopen(request)
html = response.read()
encodeType = response.headers.get('Content-Encoding')
if encodeType == "gzip":
html= zlib.decompress(html,16+zlib.MAX_WBITS)
print html
心得与体会
不能正确获得想要的HTML页面,核心的解决办法就是构造和实际浏览器一模一样的请求。
当程序的行为和浏览器一样时,各种防爬虫的技术就对你没有办法了。
而构造和浏览器一模一样的请求也很简单,HTTP通过WireShark 抓包对比,一个字段一个字段修改成一样即可; HTTPS可以使用fiddler工具抓包对比(需要配置下SSL证书)
进阶
前面的爬虫实例只是爬取一个或者几个页面,所以比较简单。 但是当我们想爬取一整个网站(如豆瓣,页面总数至少是千万级别)这样简单的实现就不行了。 以上面这样简单的实现爬整个豆瓣不知道100年能不能搞定。
这时候就需要从以下几个方面来提升爬虫的性能:
- 多线程
- 集群
参考资料与实用工具
Python快速入门,适合有其他语言基础的同学: Learn Python The Hard Way
Pyhotn教程,适合零基础同学:廖雪峰Python教程
好用的Python IDE: PyCharm
Python爬虫教程: Python爬虫学习系列教程
正则表达式非常好用的调试工具: RegexBuddy
抓包工具: WireShark
HTTPS 抓包分析工具: fiddler
fiddler分析HTTPS :配置fiddler查看HTTPS