Big Red
来自CloudWiki
第2章 建立网络爬虫
第一个网络爬虫
数值类型:整型,浮点型,复数型。 字符串类型
import requests r = requests.get("http://www.baidu.com")#打开一个网页 print(r.status_code) #返回状态 r.encoding = 'utf-8' print(r.text) #观察返回的内容
- 常量:其值不发生改变的数据对象
- 变量:使用id()函数可以查看变量的内存地址
基本命令
例如:
print("Hello World!")
1.字符串
例如:
string1='Python Web Scrappy' string2="by Santos" string3=string1+""+string2 print(string3)
索引从0开始:
print("list1[0]:",list1[0]) print("list2[1:3]:",list2[1:3]) list1[0]: python list2[1:3]:[2,3]
修改列表中的值:
list1[1]="new" print(list1)
- 字符串是最常见的数据类型,一般用来存储类似“句子”的数据,并放在单引号或双引号中。
2.数字
例如:
int1=7 float1=7.5 trans_int=int(float1) print(trans_int)
- 数字用来存储数值,包含两种常用的数字类型:整数和浮点数,浮点数由整数和小数部分组成。
3.列表
例如:
list1=['python','web','scrappy'] list2=[1,2,3,4,5] list3=["a",2,"o",4]
- 如果需要把字符串和数字襄括起来,就可以使用列表。
4.字典
例如:
namebook={"Name":"Alex","Age":7,"Class":"First"} print(namebook["Name"]) print(namebook)
遍历访问字典中的每一个值:
for key,value in namebook.items(): print(key,value)
- 字典是一种可变容器模型,每个存储的值都对应着一个键值key,key必须唯一,但是值不用。值也可以取任何数据类型。
条件语句和循环语句
- 条件语句可以使得当满足条件的时候才执行某部分代码。
例如:
book="Python" if book= ="Python": print("You are studying python.") else: print("Wrong.")
如果需要判断的有多种条件,就需要用到elif
例如:
book="java" if book= ="Python": print("You are studying python.") elif:book= ="java": print("You are studying java.") else: print("Wrong.")
- 循环语句能让我们执行一个代码片段多次,循环分为for循环和while循环。
for循环能在一个给定的顺序下重复执行
例如:
citylist=["Beijing","Shanghai","Guangzhou"] for eachcity in citylist: print(eachcity)
while循环能不断重复执行,只要能满足一定条件
例如:
count=0 while count<3: count+=1 print (count)
函数
定义一个函数:
例如:
def calulus(x): y=x+1 return y result=calulus(2) print(result)
参数必须要正确地写入函数中,函数的参数也可以为多个
例如:
def fruit_function(fruit1,fruit2): fruits=fruit1+""+fruit2 return fruits result=fruit_function("apple","banana") print(result)
- 在代码很少的时候,按照逻辑写完就能很好地运行。
面向对象编程
- 面向过程的意思是根据业务逻辑从上到下写代码。
例如:
class Person: def_init_(self,name,age): self.name=name self.age=age def detail(self): print(self.name) print(self.age)
如果我们使用函数编程,可以写成:
def defail(name,age): print(name) print(age)
- 如果各个函数之间独立且无共用的数据,就选用函数式编程;如果各个函数之间有一定的关联性,那么选用面向对象编程比较好。
面向对象的两大特性:封装和继承
1.封装
- 封装,顾名思义就是把内容封装好,再调用封装好的内容。封装分为两步:
第一步为封装内容。 第二步为调用被封装的内容。
(1)封装内容
class Person: def_init_(self,name,age): self.name=name self.age=age obj1=Person('santos',18)
(2)调用被封装的内容
- 调用被封装的内容时有两种方式:通过对象直接调用和通过self间接调用。
通过对象直接调用obj1对象的name和age属性,代码如下:
class Person: def_init_(self,name,age): self.name=name self.age=age obj1=Person('santos',18) print(obj1.name) print(obj1.age)
- 通过self简介调用时,python默认会将obj1传给self参数,即obj1.detail(obj1)。此时方法内部的self=obj1,即self.name='santos',self.age=18.
代码如下:
class Person: def_init_(self,name,age): self.name=name self.age=age def detail(self): print(self.name) print(self,age)
2.继承
- 继承是以普通的类为基础建立专门的类对象。面向对象编程的继承和现实中的继承类似,子继承了夫的某些特性。
例如:
class 猫: def 喵喵叫(self): print('喵喵叫') def 吃(self): def 喝(self): def 拉(self): def 撒(self): class 狗: def 汪汪叫(self): print('汪汪叫') def 吃(self): def 喝(self): def 拉(self): def 撒(self):
如果使用继承的思想,就可以写成:
class Animal: def eat(self): print("%s 吃"%self.name) print("%s 喝"%self.name) print("%s 拉"%self.name) print("%s 撒"%self.name) class Cat(Animal): def_init_(self,name): self.name=name def cry(self): print('喵喵叫') class Dog(Animal): def_init_(self,name): self.name=name def cry(self): print('汪汪叫') c1=Cat('小白家的小黑猫') c1.eat() c1.cry() d1=Dog('胖子家的小瘦狗') d1.eat()
编写第一个简单的爬虫
第一步:获取页面
import requests link="https://user.qzone.qq.com/328911422/main" headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'} r=requests.get(link,headers=headers) print(r.text)
- 用requests的headers伪装成浏览器访问。
- r是requests的Response回复对象,可以从中获取想要的信息,r.text是获取的网页内容代码。
第二步:提取需要的数据
import requests from bs4 import BeautifulSoup link="https://user.qzone.qq.com/328911422/main" headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'} r=requests.get(link,headers=headers) soup=BeautifulSoup(r.text,"lxml") title=soup.find("h1", class_="post-title").a.text.strip() print(title)
第三步:存储数据
import requests link="https://user.qzone.qq.com/328911422/main" headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'} r=requests.get(link,headers=headers) soup=BeautifulSoup(r.text,"lxml") title=soup.find("h1", class_="post-title").a.text.strip() print(title) with open('title.txt',"a+") as f: f.write(title) f.close()
第3章 静态网页抓取
获取相应内容
例:获取QQ空间主页内容
import requests r=requests.get('https://user.qzone.qq.com/328911422/infocenter') print("文本编码:",r.encoding) print("响应状态码:",r.status_code) print("字符串方式的相应体:",r.text)
定制Requests
传递URL参数
例:传递key1=value1和key2=value2到https://user.qzone.qq.com/1150117452/main
import requests key_dict={'key1':'walue1','key2':'value2'} r=requests.get('https://user.qzone.qq.com/1150117452/main') print("URL已经正确编码:",r.url) print("字符串方式的相应体:\n",r.text)
定制请求头
例如:
import requests headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36', 'Host':'m.douban.com' } r = requests.get('https://movie.douban.com/subject/1292052/',headers=headers) print("响应状态码:",r.status_code)
发送POST请求
import requests key_dict={'key1':'walue1','key2':'value2'} r=requests.post('https://user.qzone.qq.com/1150117452/main') print(r.text)
Requests爬虫实战:TOP250电影数据
项目实践
import requests def get_movies(): headers={ 'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64)AppleWobKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82Safari/537.36', 'Host': 'movie.douban.com' } for i in range(0,10): link = 'https://movie.douban.com/top250?start=' + str(1 * 25) r = requests.get(link,headers, timeout=10) print(str(i+1),"页响应状态码:",r.status_code) print(r.text) get_movies()
第4章 动态网页抓取
动态抓取的实例
- 通过浏览器审查元素解析地址
- 通过Selenium模拟浏览器抓取
解析真实地址抓取
例如:
import requests link = """https://api-zero.livere.com/v1/comments/list?callback=jQuery112407875296433383039_1506267778283&limit=10&offset=1&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2 Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1506267778285""" headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} r = requests.get(link, headers= headers) print (r.text)
import requests import json def single_page_comment(link): headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} r = requests.get(link, headers= headers) # 获取 json 的 string json_string = r.text json_string = json_string[json_string.find('{'):-2] json_data = json.loads(json_string) comment_list = json_data['results']['parents'] for eachone in comment_list: message = eachone['content'] print (message) for page in range(1,4): link1 = "https://api-zero.livere.com/v1/comments/list?callback=jQuery112407875296433383039_1506267778283&limit=10&offset=" link2 = "&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1506267778285" page_str = str(page) link = link1 + page_str + link2 print (link) single_page_comment(link)
通过 selenium 模拟浏览器抓取
selenium获取文章的所有评论
from selenium import webdriver from selenium.webdriver.firefox.firefox_binary import FirefoxBinary import time caps = webdriver.DesiredCapabilities().FIREFOX caps["marionette"] = True binary = FirefoxBinary(r'D:\Program Files\Mozilla Firefox\firefox.exe') #把上述地址改成你电脑中Firefox程序的地址 driver = webdriver.Firefox(firefox_binary=binary, capabilities=caps) driver.get("http://www.santostang.com/2017/03/02/hello-world/") driver.switch_to.frame(driver.find_element_by_css_selector("iframe[title='livere']")) comments = driver.find_elements_by_css_selector('div.reply-content') for eachcomment in comments: content = eachcomment.find_element_by_tag_name('p') print (content.text)
限制图片的加载:
from selenium import webdriver from selenium.webdriver.firefox.firefox_binary import FirefoxBinary caps = webdriver.DesiredCapabilities().FIREFOX caps["marionette"] = True binary = FirefoxBinary(r'D:\Program Files\Mozilla Firefox\firefox.exe') #把上述地址改成你电脑中Firefox程序的地址 fp = webdriver.FirefoxProfile() fp.set_preference("permissions.default.image",2) driver = webdriver.Firefox(firefox_binary=binary, firefox_profile = fp, capabilities=caps) driver.get("http://www.santostang.com/2017/03/02/hello-world/")
- 直接用浏览器在显示网页时解析 HTML、应用 CSS 样式并执行 JavaScript 的语句
第5章解析网页
使用正则表达式解析网页
- re.match方法
import re m = re.match('www','www.santostang.com') print("匹配的结果:",m) print("匹配的起始与终点:",m.span()) print("匹配的起始位置:",m.start()) print("匹配的终点位置:",m.end())
- re.search方法
import re m_match = re.match('com','www.santostang.com') m_search = re.search('com','www.santostang.com') print (m_match) print (m_search)
- re.findall方法
import re m_match = re.match ('[0—9]+','12345 is the first number,23456 is the second') m_search = re.search ('[0—9]+','The first number is 12345 ,23456 is the second') m_findall = re.findall('[0—9]+','12345 is the first number,23456 is the second') print (m_match.group()) print (m_search.group()) print (m_findall)
- findall与match,search不同的是,findall能够找到所有匹配的结果,并且以列表的形式返回。当爬取博客文章的标题时,如果提取的不只是一个标题,而是所有标题,就可以用findall。
使用BeautifulSoup解析网页
- 安装BeautifulSoup非常简单,使用pip安装即可。在cmd中输入:
pip install bs4
- 使用BeautifulSoup获取博客标题
import requests from bs4 import BeautifulSoup link = "http://www.santostang.com/" headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT6.1; en—US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} r = requests.get(link, headers= haeders) soup = BeautifulSoup(r.text,"html.parser") first_title = soup.find("h1",class_="post-title").a.text.strip() print ("第一篇文章的标题是:",first_title) title_list = soup.find_all("h1",class_="post-title") for i in range(len(title_list)): title = title_list[i].a.text.strip() print ('第%s篇文章的标题是:%s' %(i+1,title))它的每一个节点都是一个Python对象,获取网页内容就是一个提取对象内容的过程。而提取对象的方法可以归纳为3种 *1.遍历文档树 <nowiki>soup.header.h3
- 2.搜索文档树
for tag in soup.find_all(re.compile("^h")): print(tag.name)
- 3.css选择器
soup.select('a[href^="http://www.santostang.com/"]')
使用lxml解析网页
- 使用lxml获取博客标题
import requests from lxml import etree link = "http://www.santostang.com/" headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT6.1; en—US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} r = requests.get(link, headers= haeders) html = etree.HTML(r.text) title_list = html.xpath('//h1[@class="post-title"]/a/text()') print (title_list)
BeautifulSoup爬虫实践:房屋价格数据
- 获取安居客北京二手房结果的第1页数据
import requests from bs4 import BeautifulSoup headers ={'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1;WOW64)AppleWebKit/537.36 (KHTML,Like Gecko) Chrome/57.0.2987.98 Safari/537.36'} link = 'http://beijing.anjuke.com./sale/' r = requests.get(link,headers = headers) soup = BeautifulSoup(r.text,'lxml') house_list = soup.find_all('li',class_="list-item") for house in house_list: name = house.find('div',class_='house-title').a.text.strip() price = house.find('span',class_='price-det').text.strip() price_area = house.find('span',class_='unit-price').text.strip() no_room = house.find('div',class_='details-item').span.text area = house.find('div',class_='details-item').contents[3].text floor = house.find('div',class_='details-item').contents[5].text year = house.find('div',class_='details-item').contents[7].text broker = house.find('span',class_='brokername').text broker = broker[1:] address = house.find('span',class_='comm-address').text.strip() address = address.replace('\xa0\xa0\n ',' ') tag_list = house.find_all('span',class_='item-tage') tags = [i.text for i in tag_list] print (name,price,prince_area,no_room,area,floor,year,broker,address,tgs)
第6章数据存储
基本存储:存储至TXT或CSV
- 存储至TXT
title = "This is a test sentence." with open('C:\\you\\desktop\\title.txt',"a+") as f: f.write(title) f.close()
- 其中,with open('C:\\you\\desktop\\title.txt',"a+") as f:a+为Python文件的读写模式,表示将对文件使用附加读写方式打开,如果该文件不存在,就会创建一个新文件。
- 用'\t'.join()将变量连接成一个字符串的代码如下:
output = '\t'.join(['name','title','age','gender']) with open('C:\\you\\desktop\\title.txt',"a+") as f: f.write(output) f.close()
- 有时还需要读取TXT文件中的数据,和写入数据的方式非常类似,把write改成read即可。
- 存储至CSV
- 尝试使用Python读取test.csv中的数据。
import csv with open('test.csv','r',encoding='UTF-8') as csvfile: csv_reader = csv.reader: print(row) print(row[0])
- 把数据写入CSV
import csv output_list = ['1','2','3','4'] with open('test2.csv','a+',encoding='UTF-8',newline='') as csvfile: w = csv.writer(csvfile) w.writerow(output_list)
存储至mysql数据库
- Python操作Mysql数据库
- 需要用pip安装mysqlclient库,连接Python和mysql。在命令行中输入:
pip install mysqlclient
- 安装完成后,我们可以尝试用Python操作Mysql,在数据库中插入数据:
#coding=UTF-8 import MySQLdb conn= MySQLdb.connect(host='localhost',user='root',passwd='root',db='scraping') cur = conn.cursor() cur.execute("insert into urls (url,content) values('www.baidu.com','This is content.')") cur.close() conn.commit() conn.close()
爬取之前在博客的标题和url地址使用Python存储到MYSQL数据库中,代码如下:
import requests from bs4 import BeautifuSoup import MYSQLdb conn=MySQLdb.connect(host='localhost','uese='root',passwd='root',db='scraping',charset="utf8") cur = conn.cursor() link = "http://www.santostang.com/" headers = {'User-Agent' :'Mozilla/5.0 (Windows; U; Windows NT6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} r=requests.get(link,headers= headers) soup = BeautifulSoup(r.text,"lxml") title list = soup.find all("h1",class ="post-title") for eachone in title list: url = eachone.a['href'] title = earchone.a.text.strip() cur.execute("insert into urls (url,content) values (%s,%s)",(url,title)) cur.close() conn.commit() conn.close()
存储至MongoDB数据库
- Python操作MongoDB数据库
- 需要用pip安装PyMongo库,连接Python和MongoDB。在命令行中输入:
pip install pymongo
- 完成安装后,可以尝试用Python操作MongoDB,监测能否正常连接到数据库。
from pymongo import MongoClient client = MongoClient('localhost',27017) db = client.blog_database collection = db.blog
- 爬取播客主页所有文章标题存储至的MongoDB数据库,代码如下:
import requests import datetime from bs4 import BeautifuSoup from pymongo import MongoClient client = MongoClient('localhost',27017) db = client.blog_database collection = db.blog link = "http://www.santostang.com/" headers = {'User-Agent' :'Mozilla/5.0 (Windows; U; Windows NT6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} r=requests.get(link,headers= headers) soup = BeautifulSoup(r.text,"lxml") itle list = soup.find all("h1",class ="post-title") for eachone in title list: url = eachone.a['href'] title = earchone.a.text.strip() post = {"url":url, "title":title, "date":datetime.datetime.utcnow()} collection.insert_one(post)
MongoDB爬虫实践:虎扑论坛
- 获取第一页数据的代码如下:
import requests from bs4 import BeautifuSoup import datetime def get_page(link) : headers = {'User-Agent' :'Mozilla/5.0 (Windows; U; Windows NT6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} r=requests.get(link,headers= headers) html = r.content html = html.decode('UTF-8') soup = BeautifulSoup(html,'lxml') return soup def get_data(post_list): data_list =[] for post in post_list: title_td = post.find('td',class_='p_title') title = title_td.find('a',id=True).text.strip() post_link = title_td.find('a',id=True)['href'] post_link = 'https://bbs.hupu.com'+post_link autor = post.find('td',class_='p_author').text.strip() autor_page = post.find('td',class_='p_author').a['href'] start_date = post.find('td',class_='p_author').contents[2] start_date = datetime.datetime.striptime(start_date, '%Y-%m-%d').date() reply_view = post.find('td',class_='p_re').text.strip() reply = reply_view.split('/')[0].strip() view = reply_view.split('/')[1].strip() reply_time = post.find('td',class_='p_retime').a.text.strip() last_reply = post.find('td',class_='p_retime').contents[2] if':' in reply_time: date_time = str(datetime.date.today())+' '+reply_time date_time = datetime.datetime.strptime(date_time,'%Y-%m-%d %H:%M') else: date_time = datetime.datetime.strptime('2017-'+reply_time ,'%Y-%m-%d').date() data_list.append([title,post_link,author,author_page,start_date,reply,view,last_reply,date_time]) return data_list link = "https://bbs.hupu.com/bxj" soup = get_page(link) post_list = soup.find_all('tr',mid=True) data_list = get_data(post_list) for each in data_list: print (each)
第七章并发和并行,同步和异步
7.1.1并发和并行
并发是在一个时间段内发生若干事件的情况。 并行是在同一个时刻发生若干事件的情况。
7.1.2同步和异步
同步就是并发或并行的各个任务不是独自运行的,任务之间有一定的交替顺序,可能在运行完一个任务得到结果后,另一个任务才会开始运行。 异步是并发或并行的各个任务可以独立运行,一个任务的运行不受另一个任务的影响。