Big Red

第2章建立网络爬虫

第一个网络爬虫

数值类型:整型，浮点型，复数型。字符串类型

import requests
r = requests.get("http://www.baidu.com")#打开一个网页
print(r.status_code) #返回状态
r.encoding = 'utf-8'
print(r.text) #观察返回的内容

常量:其值不发生改变的数据对象
变量：使用id()函数可以查看变量的内存地址

基本命令

例如：

print("Hello World!")

1.字符串

例如：

string1='Python Web Scrappy'
string2="by Santos"
string3=string1+""+string2
print(string3)

索引从0开始：

print("list1[0]:",list1[0])
print("list2[1:3]:",list2[1:3])
list1[0]: python
list2[1:3]:[2,3]

修改列表中的值：

list1[1]="new"
print(list1)

字符串是最常见的数据类型，一般用来存储类似“句子”的数据，并放在单引号或双引号中。

2.数字

例如：

int1=7
float1=7.5
trans_int=int(float1)
print(trans_int)

数字用来存储数值，包含两种常用的数字类型：整数和浮点数，浮点数由整数和小数部分组成。

3.列表

例如：

list1=['python','web','scrappy']
list2=[1,2,3,4,5]
list3=["a",2,"o",4]

如果需要把字符串和数字襄括起来，就可以使用列表。

4.字典

例如：

namebook={"Name":"Alex","Age":7,"Class":"First"}
print(namebook["Name"])
print(namebook)

遍历访问字典中的每一个值：

for key,value in namebook.items():
    print(key,value)

字典是一种可变容器模型，每个存储的值都对应着一个键值key,key必须唯一，但是值不用。值也可以取任何数据类型。

条件语句和循环语句

条件语句可以使得当满足条件的时候才执行某部分代码。

例如：

book="Python"
if book= ="Python":
   print("You are studying python.")
else:
   print("Wrong.")

如果需要判断的有多种条件，就需要用到elif

例如：

book="java"
if book= ="Python":
   print("You are studying python.")
elif:book= ="java":
   print("You are studying java.")
else:
   print("Wrong.")

循环语句能让我们执行一个代码片段多次，循环分为for循环和while循环。

for循环能在一个给定的顺序下重复执行

例如：

citylist=["Beijing","Shanghai","Guangzhou"]
for eachcity in citylist:
    print(eachcity)

while循环能不断重复执行，只要能满足一定条件

例如：

count=0
while count<3:
     count+=1
print (count)

函数

定义一个函数：

例如：

def calulus(x):
    y=x+1
    return y
    result=calulus(2)
    print(result)

参数必须要正确地写入函数中，函数的参数也可以为多个

例如：

def fruit_function(fruit1,fruit2):
    fruits=fruit1+""+fruit2
    return fruits
result=fruit_function("apple","banana")
print(result)

在代码很少的时候，按照逻辑写完就能很好地运行。

面向对象编程

面向过程的意思是根据业务逻辑从上到下写代码。

例如：

class Person:
     def_init_(self,name,age):
        self.name=name
        self.age=age
     def detail(self):
        print(self.name)
        print(self.age)

如果我们使用函数编程，可以写成：

def defail(name,age):
    print(name)
    print(age)

如果各个函数之间独立且无共用的数据，就选用函数式编程；如果各个函数之间有一定的关联性，那么选用面向对象编程比较好。

面向对象的两大特性：封装和继承

1.封装

封装，顾名思义就是把内容封装好，再调用封装好的内容。封装分为两步：

第一步为封装内容。第二步为调用被封装的内容。

（1）封装内容

class Person:
     def_init_(self,name,age):
        self.name=name
        self.age=age
obj1=Person('santos',18)

（2）调用被封装的内容

调用被封装的内容时有两种方式：通过对象直接调用和通过self间接调用。

通过对象直接调用obj1对象的name和age属性，代码如下：

class Person:
     def_init_(self,name,age):
        self.name=name
        self.age=age
obj1=Person('santos',18)
print(obj1.name)
print(obj1.age)

通过self简介调用时，python默认会将obj1传给self参数，即obj1.detail(obj1)。此时方法内部的self=obj1，即self.name='santos',self.age=18.

代码如下：

class Person:
     def_init_(self,name,age):
        self.name=name
        self.age=age
     def detail(self):
        print(self.name)
        print(self,age)

2.继承

继承是以普通的类为基础建立专门的类对象。面向对象编程的继承和现实中的继承类似，子继承了夫的某些特性。

例如：

class 猫:
    def 喵喵叫(self):
       print('喵喵叫')
    def 吃(self):
    def 喝(self):
    def 拉(self):
    def 撒(self):
class 狗:
    def 汪汪叫(self):
       print('汪汪叫')
    def 吃(self):
    def 喝(self):
    def 拉(self):
    def 撒(self):

如果使用继承的思想，就可以写成：

class Animal:
    def eat(self):
        print("%s 吃"%self.name)
        print("%s 喝"%self.name)
        print("%s 拉"%self.name)
        print("%s 撒"%self.name)
class Cat(Animal):
    def_init_(self,name):
        self.name=name
    def cry(self):
        print('喵喵叫')
class Dog(Animal):
    def_init_(self,name):
        self.name=name
    def cry(self):
        print('汪汪叫')
c1=Cat('小白家的小黑猫')
c1.eat()
c1.cry()
d1=Dog('胖子家的小瘦狗')
d1.eat()

编写第一个简单的爬虫

第一步：获取页面

import requests
link="https://user.qzone.qq.com/328911422/main"
headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}
r=requests.get(link,headers=headers)
print(r.text)

用requests的headers伪装成浏览器访问。
r是requests的Response回复对象，可以从中获取想要的信息，r.text是获取的网页内容代码。

第二步：提取需要的数据

import requests
from bs4 import BeautifulSoup
link="https://user.qzone.qq.com/328911422/main"
headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}
r=requests.get(link,headers=headers)
soup=BeautifulSoup(r.text,"lxml")
title=soup.find("h1", class_="post-title").a.text.strip()
print(title)

第三步：存储数据

import requests
link="https://user.qzone.qq.com/328911422/main"
headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}
r=requests.get(link,headers=headers)
soup=BeautifulSoup(r.text,"lxml")
title=soup.find("h1", class_="post-title").a.text.strip()
print(title)
with open('title.txt',"a+") as f:
    f.write(title)
    f.close()

第3章静态网页抓取

获取相应内容

例：获取QQ空间主页内容

import requests
r=requests.get('https://user.qzone.qq.com/328911422/infocenter')
print("文本编码：",r.encoding)
print("响应状态码：",r.status_code)
print("字符串方式的相应体：",r.text)

定制Requests

传递URL参数

例：传递key1=value1和key2=value2到https://user.qzone.qq.com/1150117452/main

import requests
key_dict={'key1':'walue1','key2':'value2'}
r=requests.get('https://user.qzone.qq.com/1150117452/main')
print("URL已经正确编码：",r.url)
print("字符串方式的相应体：\n",r.text)

定制请求头

例如：

import requests
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
         'Host':'m.douban.com'
    }
r = requests.get('https://movie.douban.com/subject/1292052/',headers=headers)
print("响应状态码：",r.status_code)

发送POST请求

import requests
key_dict={'key1':'walue1','key2':'value2'}
r=requests.post('https://user.qzone.qq.com/1150117452/main')
print(r.text)

Requests爬虫实战：TOP250电影数据

项目实践

import requests
def get_movies():
     headers={
       'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64)AppleWobKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82Safari/537.36',
       'Host': 'movie.douban.com'
         }
for i in range(0,10):
             link = 'https://movie.douban.com/top250?start=' + str(1 * 25)
             r = requests.get(link,headers, timeout=10)
             print(str(i+1),"页响应状态码：",r.status_code)
             print(r.text)
get_movies()

第4章动态网页抓取

动态抓取的实例

通过浏览器审查元素解析地址
通过Selenium模拟浏览器抓取

解析真实地址抓取

例如：

import requests
link = """https://api-zero.livere.com/v1/comments/list?callback=jQuery112407875296433383039_1506267778283&limit=10&offset=1&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2
Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1506267778285"""
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} 
 
r = requests.get(link, headers= headers)
print (r.text)

import requests
import json
 
def single_page_comment(link):
    headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} 
    r = requests.get(link, headers= headers)
    # 获取 json 的 string
    json_string = r.text
    json_string = json_string[json_string.find('{'):-2]
    json_data = json.loads(json_string)
    comment_list = json_data['results']['parents']
    
    for eachone in comment_list:
        message = eachone['content']
        print (message)
 
for page in range(1,4):
    link1 = "https://api-zero.livere.com/v1/comments/list?callback=jQuery112407875296433383039_1506267778283&limit=10&offset="
    link2 = "&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1506267778285"
    page_str = str(page)
    link = link1 + page_str + link2
    print (link)
    single_page_comment(link)

通过 selenium 模拟浏览器抓取

selenium获取文章的所有评论

from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import time
 
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
binary = FirefoxBinary(r'D:\Program Files\Mozilla Firefox\firefox.exe')
#把上述地址改成你电脑中Firefox程序的地址
driver = webdriver.Firefox(firefox_binary=binary, capabilities=caps)
driver.get("http://www.santostang.com/2017/03/02/hello-world/")
driver.switch_to.frame(driver.find_element_by_css_selector("iframe[title='livere']"))
 
comments = driver.find_elements_by_css_selector('div.reply-content')
for eachcomment in comments:
    content = eachcomment.find_element_by_tag_name('p')
    print (content.text)

限制图片的加载：

from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
 
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
binary = FirefoxBinary(r'D:\Program Files\Mozilla Firefox\firefox.exe')
#把上述地址改成你电脑中Firefox程序的地址
fp = webdriver.FirefoxProfile()
fp.set_preference("permissions.default.image",2)
driver = webdriver.Firefox(firefox_binary=binary, firefox_profile = fp, capabilities=caps)
driver.get("http://www.santostang.com/2017/03/02/hello-world/")

直接用浏览器在显示网页时解析 HTML、应用 CSS 样式并执行 JavaScript 的语句

第5章解析网页

使用正则表达式解析网页

re.match方法

 import re
m = re.match('www','www.santostang.com')
print("匹配的结果：",m)
print("匹配的起始与终点：",m.span())
print("匹配的起始位置：",m.start())
print("匹配的终点位置：",m.end())

re.search方法

 import re
m_match = re.match('com','www.santostang.com')
m_search = re.search('com','www.santostang.com')
print (m_match)
print (m_search)

re.findall方法

 import re
m_match = re.match ('[0—9]+','12345 is the first number,23456 is the second')
m_search = re.search ('[0—9]+','The first number is 12345 ,23456 is the second')
m_findall = re.findall('[0—9]+','12345 is the first number,23456 is the second')
print  (m_match.group())
print  (m_search.group())
print  (m_findall)

findall与match，search不同的是，findall能够找到所有匹配的结果，并且以列表的形式返回。当爬取博客文章的标题时，如果提取的不只是一个标题，而是所有标题，就可以用findall。

使用BeautifulSoup解析网页

安装BeautifulSoup非常简单，使用pip安装即可。在cmd中输入：

pip install bs4

使用BeautifulSoup获取博客标题

 import requests
from bs4 import BeautifulSoup

link = "http://www.santostang.com/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT6.1; en—US; rv:1.9.1.6)  Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link, headers= haeders)

soup = BeautifulSoup(r.text,"html.parser")
first_title = soup.find("h1",class_="post-title").a.text.strip()
print ("第一篇文章的标题是：",first_title)

title_list = soup.find_all("h1",class_="post-title")
for i in range(len(title_list)):
   title = title_list[i].a.text.strip()
   print ('第%s篇文章的标题是:%s' %(i+1,title))它的每一个节点都是一个Python对象，获取网页内容就是一个提取对象内容的过程。而提取对象的方法可以归纳为3种
*1.遍历文档树
 <nowiki>soup.header.h3

2.搜索文档树

 for tag in soup.find_all(re.compile("^h")):
    print(tag.name)

3.css选择器

 soup.select('a[href^="http://www.santostang.com/"]')

使用lxml解析网页

使用lxml获取博客标题

  import requests
from lxml import etree

link = "http://www.santostang.com/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT6.1; en—US; rv:1.9.1.6)  Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link, headers= haeders)
html = etree.HTML(r.text)
title_list = html.xpath('//h1[@class="post-title"]/a/text()')
print (title_list)

BeautifulSoup爬虫实践：房屋价格数据

获取安居客北京二手房结果的第1页数据

 import requests
from bs4 import BeautifulSoup

headers ={'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1;WOW64)AppleWebKit/537.36 (KHTML,Like Gecko) Chrome/57.0.2987.98 Safari/537.36'}
link = 'http://beijing.anjuke.com./sale/'
r = requests.get(link,headers = headers)

soup = BeautifulSoup(r.text,'lxml')
house_list = soup.find_all('li',class_="list-item")

for house in house_list:
    name = house.find('div',class_='house-title').a.text.strip()
    price = house.find('span',class_='price-det').text.strip()
    price_area = house.find('span',class_='unit-price').text.strip()


    no_room = house.find('div',class_='details-item').span.text
    area = house.find('div',class_='details-item').contents[3].text
    floor = house.find('div',class_='details-item').contents[5].text
    year = house.find('div',class_='details-item').contents[7].text
    broker = house.find('span',class_='brokername').text
    broker = broker[1:]
    address = house.find('span',class_='comm-address').text.strip()
    address = address.replace('\xa0\xa0\n         ','   ')
    tag_list = house.find_all('span',class_='item-tage')
    tags = [i.text for i in tag_list]
    print (name,price,prince_area,no_room,area,floor,year,broker,address,tgs)

第6章数据存储

基本存储：存储至TXT或CSV

存储至TXT

 title = "This is a test sentence."
with open('C:\\you\\desktop\\title.txt',"a+") as f:
   f.write(title)
   f.close()

其中，with open('C:\\you\\desktop\\title.txt',"a+") as f:a+为Python文件的读写模式，表示将对文件使用附加读写方式打开，如果该文件不存在，就会创建一个新文件。

用'\t'.join()将变量连接成一个字符串的代码如下：

 output = '\t'.join(['name','title','age','gender'])
with open('C:\\you\\desktop\\title.txt',"a+") as f:
   f.write(output)
   f.close()

有时还需要读取TXT文件中的数据，和写入数据的方式非常类似，把write改成read即可。
存储至CSV
尝试使用Python读取test.csv中的数据。

 import csv
with open('test.csv','r',encoding='UTF-8') as csvfile:
   csv_reader = csv.reader:
   print(row)
   print(row[0])

把数据写入CSV

 import csv
output_list = ['1','2','3','4']
with open('test2.csv','a+',encoding='UTF-8',newline='') as csvfile:
   w = csv.writer(csvfile)
   w.writerow(output_list)

存储至mysql数据库

Python操作Mysql数据库
需要用pip安装mysqlclient库，连接Python和mysql。在命令行中输入：

 pip install mysqlclient

安装完成后，我们可以尝试用Python操作Mysql，在数据库中插入数据：

 #coding=UTF-8
import MySQLdb

conn= MySQLdb.connect(host='localhost',user='root',passwd='root',db='scraping')
cur = conn.cursor()
cur.execute("insert into urls (url,content) values('www.baidu.com','This is content.')")
cur.close()
conn.commit()
conn.close()

爬取之前在博客的标题和url地址使用Python存储到MYSQL数据库中，代码如下：

 import requests
from bs4 import BeautifuSoup
import MYSQLdb

conn=MySQLdb.connect(host='localhost','uese='root',passwd='root',db='scraping',charset="utf8")
cur = conn.cursor()

link = "http://www.santostang.com/"
headers = {'User-Agent' :'Mozilla/5.0 (Windows; U; Windows NT6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r=requests.get(link,headers= headers)

soup = BeautifulSoup(r.text,"lxml")
title list = soup.find all("h1",class ="post-title")
for eachone in title list:
    url = eachone.a['href']
    title = earchone.a.text.strip()
    cur.execute("insert into urls (url,content) values (%s,%s)",(url,title))

 cur.close()
 conn.commit()
 conn.close()

存储至MongoDB数据库

Python操作MongoDB数据库
需要用pip安装PyMongo库，连接Python和MongoDB。在命令行中输入：

 pip install pymongo

完成安装后，可以尝试用Python操作MongoDB，监测能否正常连接到数据库。

 from pymongo import MongoClient
client = MongoClient('localhost',27017)
db = client.blog_database
collection = db.blog

爬取播客主页所有文章标题存储至的MongoDB数据库，代码如下：

 import requests
import datetime
from bs4 import BeautifuSoup
from pymongo import MongoClient
client = MongoClient('localhost',27017)
db = client.blog_database
collection = db.blog


link = "http://www.santostang.com/"
headers = {'User-Agent' :'Mozilla/5.0 (Windows; U; Windows NT6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r=requests.get(link,headers= headers)


soup = BeautifulSoup(r.text,"lxml")
itle list = soup.find all("h1",class ="post-title")
for eachone in title list:
    url = eachone.a['href']
    title = earchone.a.text.strip()
    post = {"url":url,
            "title":title,
            "date":datetime.datetime.utcnow()}
    collection.insert_one(post)

MongoDB爬虫实践:虎扑论坛

获取第一页数据的代码如下：

 import requests
from bs4 import BeautifuSoup
import datetime
def get_page(link) :
    headers = {'User-Agent' :'Mozilla/5.0 (Windows; U; Windows NT6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
    r=requests.get(link,headers= headers)
    html = r.content
    html = html.decode('UTF-8')
    soup = BeautifulSoup(html,'lxml')
    return soup

def get_data(post_list):
    data_list =[]
    for post in post_list:
        title_td = post.find('td',class_='p_title')
        title = title_td.find('a',id=True).text.strip()
        post_link = title_td.find('a',id=True)['href']
        post_link = 'https://bbs.hupu.com'+post_link


        autor = post.find('td',class_='p_author').text.strip()
        autor_page = post.find('td',class_='p_author').a['href']
        start_date = post.find('td',class_='p_author').contents[2]
        start_date = datetime.datetime.striptime(start_date, '%Y-%m-%d').date()
        

        reply_view = post.find('td',class_='p_re').text.strip()
        reply = reply_view.split('/')[0].strip()
        view = reply_view.split('/')[1].strip()

        reply_time = post.find('td',class_='p_retime').a.text.strip()
        last_reply = post.find('td',class_='p_retime').contents[2]
        if':' in reply_time:
             date_time = str(datetime.date.today())+' '+reply_time
             date_time = datetime.datetime.strptime(date_time,'%Y-%m-%d %H:%M')
        else:
             date_time = datetime.datetime.strptime('2017-'+reply_time
,'%Y-%m-%d').date()
             data_list.append([title,post_link,author,author_page,start_date,reply,view,last_reply,date_time])
           
        return data_list


     link = "https://bbs.hupu.com/bxj"
     soup = get_page(link)
     post_list = soup.find_all('tr',mid=True)
     data_list = get_data(post_list)
     for each in data_list:
         print (each)

第七章并发和并行，同步和异步

7.1.1并发和并行

并发是在一个时间段内发生若干事件的情况。并行是在同一个时刻发生若干事件的情况。

7.1.2同步和异步

同步就是并发或并行的各个任务不是独自运行的，任务之间有一定的交替顺序，可能在运行完一个任务得到结果后，另一个任务才会开始运行。异步是并发或并行的各个任务可以独立运行，一个任务的运行不受另一个任务的影响。