Big Red

第2章建立网络爬虫

第一个网络爬虫

数值类型:整型，浮点型，复数型。字符串类型

import requests
r = requests.get("http://www.baidu.com")#打开一个网页
print(r.status_code) #返回状态
r.encoding = 'utf-8'
print(r.text) #观察返回的内容

常量:其值不发生改变的数据对象
变量：使用id()函数可以查看变量的内存地址

基本命令

例如：

print("Hello World!")

1.字符串

例如：

string1='Python Web Scrappy'
string2="by Santos"
string3=string1+""+string2
print(string3)

索引从0开始：

print("list1[0]:",list1[0])
print("list2[1:3]:",list2[1:3])
list1[0]: python
list2[1:3]:[2,3]

修改列表中的值：

list1[1]="new"
print(list1)

字符串是最常见的数据类型，一般用来存储类似“句子”的数据，并放在单引号或双引号中。

2.数字

例如：

int1=7
float1=7.5
trans_int=int(float1)
print(trans_int)

数字用来存储数值，包含两种常用的数字类型：整数和浮点数，浮点数由整数和小数部分组成。

3.列表

例如：

list1=['python','web','scrappy']
list2=[1,2,3,4,5]
list3=["a",2,"o",4]

如果需要把字符串和数字襄括起来，就可以使用列表。

4.字典

例如：

namebook={"Name":"Alex","Age":7,"Class":"First"}
print(namebook["Name"])
print(namebook)

遍历访问字典中的每一个值：

for key,value in namebook.items():
    print(key,value)

字典是一种可变容器模型，每个存储的值都对应着一个键值key,key必须唯一，但是值不用。值也可以取任何数据类型。

条件语句和循环语句

条件语句可以使得当满足条件的时候才执行某部分代码。

例如：

book="Python"
if book= ="Python":
   print("You are studying python.")
else:
   print("Wrong.")

如果需要判断的有多种条件，就需要用到elif

例如：

book="java"
if book= ="Python":
   print("You are studying python.")
elif:book= ="java":
   print("You are studying java.")
else:
   print("Wrong.")

循环语句能让我们执行一个代码片段多次，循环分为for循环和while循环。

for循环能在一个给定的顺序下重复执行

例如：

citylist=["Beijing","Shanghai","Guangzhou"]
for eachcity in citylist:
    print(eachcity)

while循环能不断重复执行，只要能满足一定条件

例如：

count=0
while count<3:
     count+=1
print (count)

函数

定义一个函数：

例如：

def calulus(x):
    y=x+1
    return y
    result=calulus(2)
    print(result)

参数必须要正确地写入函数中，函数的参数也可以为多个

例如：

def fruit_function(fruit1,fruit2):
    fruits=fruit1+""+fruit2
    return fruits
result=fruit_function("apple","banana")
print(result)

在代码很少的时候，按照逻辑写完就能很好地运行。

面向对象编程

面向过程的意思是根据业务逻辑从上到下写代码。

例如：

class Person:
     def_init_(self,name,age):
        self.name=name
        self.age=age
     def detail(self):
        print(self.name)
        print(self.age)

如果我们使用函数编程，可以写成：

def defail(name,age):
    print(name)
    print(age)

如果各个函数之间独立且无共用的数据，就选用函数式编程；如果各个函数之间有一定的关联性，那么选用面向对象编程比较好。

面向对象的两大特性：封装和继承

1.封装

封装，顾名思义就是把内容封装好，再调用封装好的内容。封装分为两步：

第一步为封装内容。第二步为调用被封装的内容。

（1）封装内容

class Person:
     def_init_(self,name,age):
        self.name=name
        self.age=age
obj1=Person('santos',18)

（2）调用被封装的内容

调用被封装的内容时有两种方式：通过对象直接调用和通过self间接调用。

通过对象直接调用obj1对象的name和age属性，代码如下：

class Person:
     def_init_(self,name,age):
        self.name=name
        self.age=age
obj1=Person('santos',18)
print(obj1.name)
print(obj1.age)

通过self简介调用时，python默认会将obj1传给self参数，即obj1.detail(obj1)。此时方法内部的self=obj1，即self.name='santos',self.age=18.

代码如下：

class Person:
     def_init_(self,name,age):
        self.name=name
        self.age=age
     def detail(self):
        print(self.name)
        print(self,age)

2.继承

继承是以普通的类为基础建立专门的类对象。面向对象编程的继承和现实中的继承类似，子继承了夫的某些特性。

例如：

class 猫:
    def 喵喵叫(self):
       print('喵喵叫')
    def 吃(self):
    def 喝(self):
    def 拉(self):
    def 撒(self):
class 狗:
    def 汪汪叫(self):
       print('汪汪叫')
    def 吃(self):
    def 喝(self):
    def 拉(self):
    def 撒(self):

如果使用继承的思想，就可以写成：

class Animal:
    def eat(self):
        print("%s 吃"%self.name)
        print("%s 喝"%self.name)
        print("%s 拉"%self.name)
        print("%s 撒"%self.name)
class Cat(Animal):
    def_init_(self,name):
        self.name=name
    def cry(self):
        print('喵喵叫')
class Dog(Animal):
    def_init_(self,name):
        self.name=name
    def cry(self):
        print('汪汪叫')
c1=Cat('小白家的小黑猫')
c1.eat()
c1.cry()
d1=Dog('胖子家的小瘦狗')
d1.eat()

编写第一个简单的爬虫

第一步：获取页面

import requests
link="https://user.qzone.qq.com/328911422/main"
headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}
r=requests.get(link,headers=headers)
print(r.text)

用requests的headers伪装成浏览器访问。
r是requests的Response回复对象，可以从中获取想要的信息，r.text是获取的网页内容代码。

第二步：提取需要的数据

import requests
from bs4 import BeautifulSoup
link="https://user.qzone.qq.com/328911422/main"
headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}
r=requests.get(link,headers=headers)
soup=BeautifulSoup(r.text,"lxml")
title=soup.find("h1", class_="post-title").a.text.strip()
print(title)

第三步：存储数据

import requests
link="https://user.qzone.qq.com/328911422/main"
headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}
r=requests.get(link,headers=headers)
soup=BeautifulSoup(r.text,"lxml")
title=soup.find("h1", class_="post-title").a.text.strip()
print(title)
with open('title.txt',"a+") as f:
    f.write(title)
    f.close()

第3章静态网页抓取

获取相应内容

例：获取QQ空间主页内容

import requests
r=requests.get('https://user.qzone.qq.com/328911422/infocenter')
print("文本编码：",r.encoding)
print("响应状态码：",r.status_code)
print("字符串方式的相应体：",r.text)

定制Requests

传递URL参数

例：传递key1=value1和key2=value2到https://user.qzone.qq.com/1150117452/main

import requests
key_dict={'key1':'walue1','key2':'value2'}
r=requests.get('https://user.qzone.qq.com/1150117452/main')
print("URL已经正确编码：",r.url)
print("字符串方式的相应体：\n",r.text)

定制请求头

例如：

import requests
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
         'Host':'m.douban.com'
    }
r = requests.get('https://movie.douban.com/subject/1292052/',headers=headers)
print("响应状态码：",r.status_code)

发送POST请求

import requests
key_dict={'key1':'walue1','key2':'value2'}
r=requests.post('https://user.qzone.qq.com/1150117452/main')
print(r.text)

Requests爬虫实战：TOP250电影数据

项目实践

import requests
def get_movies():
     headers={
       'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64)AppleWobKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82Safari/537.36',
       'Host': 'movie.douban.com'
         }
for i in range(0,10):
             link = 'https://movie.douban.com/top250?start=' + str(1 * 25)
             r = requests.get(link,headers, timeout=10)
             print(str(i+1),"页响应状态码：",r.status_code)
             print(r.text)
get_movies()

第4章动态网页抓取

动态抓取的实例

通过浏览器审查元素解析地址
通过Selenium模拟浏览器抓取

解析真实地址抓取

例如：

import requests
link = """https://api-zero.livere.com/v1/comments/list?callback=jQuery112407875296433383039_1506267778283&limit=10&offset=1&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2
Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1506267778285"""
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} 
 
r = requests.get(link, headers= headers)
print (r.text)

import requests
import json
 
def single_page_comment(link):
    headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} 
    r = requests.get(link, headers= headers)
    # 获取 json 的 string
    json_string = r.text
    json_string = json_string[json_string.find('{'):-2]
    json_data = json.loads(json_string)
    comment_list = json_data['results']['parents']
    
    for eachone in comment_list:
        message = eachone['content']
        print (message)
 
for page in range(1,4):
    link1 = "https://api-zero.livere.com/v1/comments/list?callback=jQuery112407875296433383039_1506267778283&limit=10&offset="
    link2 = "&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1506267778285"
    page_str = str(page)
    link = link1 + page_str + link2
    print (link)
    single_page_comment(link)

通过 selenium 模拟浏览器抓取

selenium获取文章的所有评论

from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import time
 
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
binary = FirefoxBinary(r'D:\Program Files\Mozilla Firefox\firefox.exe')
#把上述地址改成你电脑中Firefox程序的地址
driver = webdriver.Firefox(firefox_binary=binary, capabilities=caps)
driver.get("http://www.santostang.com/2017/03/02/hello-world/")
driver.switch_to.frame(driver.find_element_by_css_selector("iframe[title='livere']"))
 
comments = driver.find_elements_by_css_selector('div.reply-content')
for eachcomment in comments:
    content = eachcomment.find_element_by_tag_name('p')
    print (content.text)

限制图片的加载：

from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
 
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
binary = FirefoxBinary(r'D:\Program Files\Mozilla Firefox\firefox.exe')
#把上述地址改成你电脑中Firefox程序的地址
fp = webdriver.FirefoxProfile()
fp.set_preference("permissions.default.image",2)
driver = webdriver.Firefox(firefox_binary=binary, firefox_profile = fp, capabilities=caps)
driver.get("http://www.santostang.com/2017/03/02/hello-world/")

直接用浏览器在显示网页时解析 HTML、应用 CSS 样式并执行 JavaScript 的语句

第5章解析网页

使用正则表达式解析网页

re.match方法

 import re
m = re.match('www','www.santostang.com')
print("匹配的结果：",m)
print("匹配的起始与终点：",m.span())
print("匹配的起始位置：",m.start())
print("匹配的终点位置：",m.end())

re.search方法

 import re
m_match = re.match('com','www.santostang.com')
m_search = re.search('com','www.santostang.com')
print (m_match)
print (m_search)

re.findall方法

 import re
m_match = re.match ('[0—9]+','12345 is the first number,23456 is the second')
m_search = re.search ('[0—9]+','The first number is 12345 ,23456 is the second')
m_findall = re.findall('[0—9]+','12345 is the first number,23456 is the second')
print  (m_match.group())
print  (m_search.group())
print  (m_findall)

findall与match，search不同的是，findall能够找到所有匹配的结果，并且以列表的形式返回。当爬取博客文章的标题时，如果提取的不只是一个标题，而是所有标题，就可以用findall。

使用BeautifulSoup解析网页

安装BeautifulSoup非常简单，使用pip安装即可。在cmd中输入：

pip install bs4

使用BeautifulSoup获取博客标题

 import requests
from bs4 import BeautifulSoup

link = "http://www.santostang.com/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT6.1; en—US; rv:1.9.1.6)  Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link, headers= haeders)

soup = BeautifulSoup(r.text,"html.parser")
first_title = soup.find("h1",class_="post-title").a.text.strip()
print ("第一篇文章的标题是：",first_title)

title_list = soup.find_all("h1",class_="post-title")
for i in range(len(title_list)):
   title = title_list[i].a.text.strip()
   print ('第%s篇文章的标题是:%s' %(i+1,title))

Big Red

目录

第2章建立网络爬虫

第一个网络爬虫

条件语句和循环语句

函数

面向对象编程

面向对象的两大特性：封装和继承

编写第一个简单的爬虫

第一步：获取页面

第二步：提取需要的数据

第三步：存储数据

第3章静态网页抓取

获取相应内容

定制Requests

传递URL参数

定制请求头

发送POST请求

Requests爬虫实战：TOP250电影数据

项目实践

第4章动态网页抓取

动态抓取的实例

解析真实地址抓取

通过 selenium 模拟浏览器抓取

selenium获取文章的所有评论

限制图片的加载：

第5章解析网页

使用正则表达式解析网页

使用BeautifulSoup解析网页

导航菜单

个人工具

命名空间

变种

视图

更多

搜索

导航

工具

Big Red

目录

第2章 建立网络爬虫

第一个网络爬虫

条件语句和循环语句

函数

面向对象编程

面向对象的两大特性：封装和继承

编写第一个简单的爬虫

第一步：获取页面

第二步：提取需要的数据

第三步：存储数据

第3章 静态网页抓取

获取相应内容

定制Requests

传递URL参数

定制请求头

发送POST请求

Requests爬虫实战：TOP250电影数据

项目实践

第4章 动态网页抓取

动态抓取的实例

解析真实地址抓取

通过 selenium 模拟浏览器抓取

selenium获取文章的所有评论

限制图片的加载：

第5章解析网页

使用正则表达式解析网页

使用BeautifulSoup解析网页

导航菜单

搜索

第2章建立网络爬虫

第3章静态网页抓取

第4章动态网页抓取