查看“Big Red”的源代码

==第2章 建立网络爬虫==
===第一个网络爬虫===
数值类型:整型，浮点型，复数型。
字符串类型
 <nowiki>import requests
r = requests.get("http://www.baidu.com")#打开一个网页
print(r.status_code) #返回状态
r.encoding = 'utf-8'
print(r.text) #观察返回的内容</nowiki>

*常量:其值不发生改变的数据对象
*变量：使用id()函数可以查看变量的内存地址

基本命令

例如：
 <nowiki>print("Hello World!")</nowiki>

1.字符串 

例如：
 <nowiki>string1='Python Web Scrappy'
string2="by Santos"
string3=string1+""+string2
print(string3)</nowiki>
索引从0开始：
 <nowiki>print("list1[0]:",list1[0])
print("list2[1:3]:",list2[1:3])
list1[0]: python
list2[1:3]:[2,3]</nowiki>
修改列表中的值：
 <nowiki>list1[1]="new"
print(list1)</nowiki>
*字符串是最常见的数据类型，一般用来存储类似“句子”的数据，并放在单引号或双引号中。
2.数字

例如：
 <nowiki>int1=7
float1=7.5
trans_int=int(float1)
print(trans_int)</nowiki>
*数字用来存储数值，包含两种常用的数字类型：整数和浮点数，浮点数由整数和小数部分组成。
3.列表

例如：
 <nowiki>list1=['python','web','scrappy']
list2=[1,2,3,4,5]
list3=["a",2,"o",4]</nowiki>
*如果需要把字符串和数字襄括起来，就可以使用列表。
4.字典

例如：
 <nowiki>namebook={"Name":"Alex","Age":7,"Class":"First"}
print(namebook["Name"])
print(namebook)</nowiki>
遍历访问字典中的每一个值：
 <nowiki>for key,value in namebook.items():
    print(key,value)</nowiki>
*字典是一种可变容器模型，每个存储的值都对应着一个键值key,key必须唯一，但是值不用。值也可以取任何数据类型。
===条件语句和循环语句===
*条件语句可以使得当满足条件的时候才执行某部分代码。

例如：
 <nowiki>book="Python"
if book= ="Python":
   print("You are studying python.")
else:
   print("Wrong.")</nowiki>
如果需要判断的有多种条件，就需要用到elif

例如：
 <nowiki>book="java"
if book= ="Python":
   print("You are studying python.")
elif:book= ="java":
   print("You are studying java.")
else:
   print("Wrong.")</nowiki>
*循环语句能让我们执行一个代码片段多次，循环分为for循环和while循环。
for循环能在一个给定的顺序下重复执行

例如：
 <nowiki>citylist=["Beijing","Shanghai","Guangzhou"]
for eachcity in citylist:
    print(eachcity)</nowiki>
while循环能不断重复执行，只要能满足一定条件

例如：
 <nowiki>count=0
while count<3:
     count+=1
print (count)</nowiki>

===函数===
定义一个函数：

例如：
 <nowiki>def calulus(x):
    y=x+1
    return y
    result=calulus(2)
    print(result)</nowiki>
参数必须要正确地写入函数中，函数的参数也可以为多个

例如：
 <nowiki>def fruit_function(fruit1,fruit2):
    fruits=fruit1+""+fruit2
    return fruits
result=fruit_function("apple","banana")
print(result)</nowiki>
*在代码很少的时候，按照逻辑写完就能很好地运行。
===面向对象编程===
*面向过程的意思是根据业务逻辑从上到下写代码。
例如：
 <nowiki>class Person:
     def_init_(self,name,age):
        self.name=name
        self.age=age
     def detail(self):
        print(self.name)
        print(self.age)</nowiki>
如果我们使用函数编程，可以写成：
 <nowiki>def defail(name,age):
    print(name)
    print(age)</nowiki>
*如果各个函数之间独立且无共用的数据，就选用函数式编程；如果各个函数之间有一定的关联性，那么选用面向对象编程比较好。

====面向对象的两大特性：封装和继承====
1.封装
*封装，顾名思义就是把内容封装好，再调用封装好的内容。封装分为两步：
第一步为封装内容。
第二步为调用被封装的内容。

（1）封装内容
 <nowiki>class Person:
     def_init_(self,name,age):
        self.name=name
        self.age=age
obj1=Person('santos',18)</nowiki>
（2）调用被封装的内容
*调用被封装的内容时有两种方式：通过对象直接调用和通过self间接调用。
通过对象直接调用obj1对象的name和age属性，代码如下：
 <nowiki>class Person:
     def_init_(self,name,age):
        self.name=name
        self.age=age
obj1=Person('santos',18)
print(obj1.name)
print(obj1.age)</nowiki>
*通过self简介调用时，python默认会将obj1传给self参数，即obj1.detail(obj1)。此时方法内部的self=obj1，即self.name='santos',self.age=18.

代码如下：
 <nowiki>class Person:
     def_init_(self,name,age):
        self.name=name
        self.age=age
     def detail(self):
        print(self.name)
        print(self,age)</nowiki>
2.继承
*继承是以普通的类为基础建立专门的类对象。面向对象编程的继承和现实中的继承类似，子继承了夫的某些特性。
例如：
 <nowiki>class 猫:
    def 喵喵叫(self):
       print('喵喵叫')
    def 吃(self):
    def 喝(self):
    def 拉(self):
    def 撒(self):
class 狗:
    def 汪汪叫(self):
       print('汪汪叫')
    def 吃(self):
    def 喝(self):
    def 拉(self):
    def 撒(self):</nowiki>
如果使用继承的思想，就可以写成：
 <nowiki>class Animal:
    def eat(self):
        print("%s 吃"%self.name)
        print("%s 喝"%self.name)
        print("%s 拉"%self.name)
        print("%s 撒"%self.name)
class Cat(Animal):
    def_init_(self,name):
        self.name=name
    def cry(self):
        print('喵喵叫')
class Dog(Animal):
    def_init_(self,name):
        self.name=name
    def cry(self):
        print('汪汪叫')
c1=Cat('小白家的小黑猫')
c1.eat()
c1.cry()
d1=Dog('胖子家的小瘦狗')
d1.eat()</nowiki>

===编写第一个简单的爬虫===
====第一步：获取页面====
 <nowiki>import requests
link="https://user.qzone.qq.com/328911422/main"
headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}
r=requests.get(link,headers=headers)
print(r.text)</nowiki>
*用requests的headers伪装成浏览器访问。
*r是requests的Response回复对象，可以从中获取想要的信息，r.text是获取的网页内容代码。
====第二步：提取需要的数据====
 <nowiki>import requests
from bs4 import BeautifulSoup
link="https://user.qzone.qq.com/328911422/main"
headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}
r=requests.get(link,headers=headers)
soup=BeautifulSoup(r.text,"lxml")
title=soup.find("h1", class_="post-title").a.text.strip()
print(title)</nowiki>
====第三步：存储数据====
 <nowiki>import requests
link="https://user.qzone.qq.com/328911422/main"
headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}
r=requests.get(link,headers=headers)
soup=BeautifulSoup(r.text,"lxml")
title=soup.find("h1", class_="post-title").a.text.strip()
print(title)
with open('title.txt',"a+") as f:
    f.write(title)
    f.close()</nowiki>

==第3章 静态网页抓取==
===获取相应内容===
例：获取QQ空间主页内容
 <nowiki>import requests
r=requests.get('https://user.qzone.qq.com/328911422/infocenter')
print("文本编码：",r.encoding)
print("响应状态码：",r.status_code)
print("字符串方式的相应体：",r.text)</nowiki>
===定制Requests===
====传递URL参数====
例：传递key1=value1和key2=value2到https://user.qzone.qq.com/1150117452/main
 <nowiki>import requests
key_dict={'key1':'walue1','key2':'value2'}
r=requests.get('https://user.qzone.qq.com/1150117452/main')
print("URL已经正确编码：",r.url)
print("字符串方式的相应体：\n",r.text)</nowiki>
====定制请求头====
例如：
 <nowiki>import requests
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
         'Host':'m.douban.com'
    }
r = requests.get('https://movie.douban.com/subject/1292052/',headers=headers)
print("响应状态码：",r.status_code)</nowiki>
====发送POST请求====
 <nowiki>import requests
key_dict={'key1':'walue1','key2':'value2'}
r=requests.post('https://user.qzone.qq.com/1150117452/main')
print(r.text)</nowiki>

===Requests爬虫实战：TOP250电影数据===
====项目实践====
 <nowiki>import requests
def get_movies():
     headers={
       'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64)AppleWobKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82Safari/537.36',
       'Host': 'movie.douban.com'
         }
for i in range(0,10):
             link = 'https://movie.douban.com/top250?start=' + str(1 * 25)
             r = requests.get(link,headers, timeout=10)
             print(str(i+1),"页响应状态码：",r.status_code)
             print(r.text)
get_movies()</nowiki>

==第4章 动态网页抓取==
===动态抓取的实例===
*通过浏览器审查元素解析地址
*通过Selenium模拟浏览器抓取
===解析真实地址抓取===
例如：
 <nowiki>import requests
link = """https://api-zero.livere.com/v1/comments/list?callback=jQuery112407875296433383039_1506267778283&limit=10&offset=1&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2
Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1506267778285"""
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} 
 
r = requests.get(link, headers= headers)
print (r.text)</nowiki>

 <nowiki>import requests
import json
 
def single_page_comment(link):
    headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} 
    r = requests.get(link, headers= headers)
    # 获取 json 的 string
    json_string = r.text
    json_string = json_string[json_string.find('{'):-2]
    json_data = json.loads(json_string)
    comment_list = json_data['results']['parents']
    
    for eachone in comment_list:
        message = eachone['content']
        print (message)
 
for page in range(1,4):
    link1 = "https://api-zero.livere.com/v1/comments/list?callback=jQuery112407875296433383039_1506267778283&limit=10&offset="
    link2 = "&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1506267778285"
    page_str = str(page)
    link = link1 + page_str + link2
    print (link)
    single_page_comment(link)</nowiki>

===通过 selenium 模拟浏览器抓取===
====selenium获取文章的所有评论====
 <nowiki>from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import time
 
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
binary = FirefoxBinary(r'D:\Program Files\Mozilla Firefox\firefox.exe')
#把上述地址改成你电脑中Firefox程序的地址
driver = webdriver.Firefox(firefox_binary=binary, capabilities=caps)
driver.get("http://www.santostang.com/2017/03/02/hello-world/")
driver.switch_to.frame(driver.find_element_by_css_selector("iframe[title='livere']"))
 
comments = driver.find_elements_by_css_selector('div.reply-content')
for eachcomment in comments:
    content = eachcomment.find_element_by_tag_name('p')
    print (content.text)</nowiki>
====限制图片的加载：====
 <nowiki>from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
 
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
binary = FirefoxBinary(r'D:\Program Files\Mozilla Firefox\firefox.exe')
#把上述地址改成你电脑中Firefox程序的地址
fp = webdriver.FirefoxProfile()
fp.set_preference("permissions.default.image",2)
driver = webdriver.Firefox(firefox_binary=binary, firefox_profile = fp, capabilities=caps)
driver.get("http://www.santostang.com/2017/03/02/hello-world/")</nowiki>
*直接用浏览器在显示网页时解析 HTML、应用 CSS 样式并执行 JavaScript 的语句
==第5章解析网页==
===使用正则表达式解析网页===
*re.match方法
  <nowiki>import re
m = re.match('www','www.santostang.com')
print("匹配的结果：",m)
print("匹配的起始与终点：",m.span())
print("匹配的起始位置：",m.start())
print("匹配的终点位置：",m.end())</nowiki>
*re.search方法
  <nowiki>import re
m_match = re.match('com','www.santostang.com')
m_search = re.search('com','www.santostang.com')
print (m_match)
print (m_search)</nowiki>
*re.findall方法
  <nowiki>import re
m_match = re.match ('[0—9]+','12345 is the first number,23456 is the second')
m_search = re.search ('[0—9]+','The first number is 12345 ,23456 is the second')
m_findall = re.findall('[0—9]+','12345 is the first number,23456 is the second')
print  (m_match.group())
print  (m_search.group())
print  (m_findall)</nowiki>
*findall与match，search不同的是，findall能够找到所有匹配的结果，并且以列表的形式返回。当爬取博客文章的标题时，如果提取的不只是一个标题，而是所有标题，就可以用findall。
===使用BeautifulSoup解析网页===
*安装BeautifulSoup非常简单，使用pip安装即可。在cmd中输入：
pip install bs4
*使用BeautifulSoup获取博客标题
  <nowiki>import requests
from bs4 import BeautifulSoup

link = "http://www.santostang.com/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT6.1; en—US; rv:1.9.1.6)  Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link, headers= haeders)

soup = BeautifulSoup(r.text,"html.parser")
first_title = soup.find("h1",class_="post-title").a.text.strip()
print ("第一篇文章的标题是：",first_title)

title_list = soup.find_all("h1",class_="post-title")
for i in range(len(title_list)):
   title = title_list[i].a.text.strip()
   print ('第%s篇文章的标题是:%s' %(i+1,title))它的每一个节点都是一个Python对象，获取网页内容就是一个提取对象内容的过程。而提取对象的方法可以归纳为3种
*1.遍历文档树
 <nowiki>soup.header.h3</nowiki>
*2.搜索文档树
  <nowiki>for tag in soup.find_all(re.compile("^h")):
    print(tag.name)</nowiki>
*3.css选择器
  <nowiki>soup.select('a[href^="http://www.santostang.com/"]')</nowiki>
===使用lxml解析网页===
*使用lxml获取博客标题
   <nowiki>import requests
from lxml import etree

link = "http://www.santostang.com/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT6.1; en—US; rv:1.9.1.6)  Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link, headers= haeders)
html = etree.HTML(r.text)
title_list = html.xpath('//h1[@class="post-title"]/a/text()')
print (title_list)</nowiki>
===BeautifulSoup爬虫实践：房屋价格数据===
*获取安居客北京二手房结果的第1页数据
  <nowiki>import requests
from bs4 import BeautifulSoup

headers ={'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1;WOW64)AppleWebKit/537.36 (KHTML,Like Gecko) Chrome/57.0.2987.98 Safari/537.36'}
link = 'http://beijing.anjuke.com./sale/'
r = requests.get(link,headers = headers)

soup = BeautifulSoup(r.text,'lxml')
house_list = soup.find_all('li',class_="list-item")

for house in house_list:
    name = house.find('div',class_='house-title').a.text.strip()
    price = house.find('span',class_='price-det').text.strip()
    price_area = house.find('span',class_='unit-price').text.strip()


    no_room = house.find('div',class_='details-item').span.text
    area = house.find('div',class_='details-item').contents[3].text
    floor = house.find('div',class_='details-item').contents[5].text
    year = house.find('div',class_='details-item').contents[7].text
    broker = house.find('span',class_='brokername').text
    broker = broker[1:]
    address = house.find('span',class_='comm-address').text.strip()
    address = address.replace('\xa0\xa0\n         ','   ')
    tag_list = house.find_all('span',class_='item-tage')
    tags = [i.text for i in tag_list]
    print (name,price,prince_area,no_room,area,floor,year,broker,address,tgs)</nowiki>

==第6章数据存储==
===基本存储：存储至TXT或CSV===
*存储至TXT
  <nowiki>title = "This is a test sentence."
with open('C:\\you\\desktop\\title.txt',"a+") as f:
   f.write(title)
   f.close()</nowiki>
*其中，with open('C:\\you\\desktop\\title.txt',"a+") as f:a+为Python文件的读写模式，表示将对文件使用附加读写方式打开，如果该文件不存在，就会创建一个新文件。

*用'\t'.join()将变量连接成一个字符串的代码如下：
  <nowiki>output = '\t'.join(['name','title','age','gender'])
with open('C:\\you\\desktop\\title.txt',"a+") as f:
   f.write(output)
   f.close()</nowiki>
*有时还需要读取TXT文件中的数据，和写入数据的方式非常类似，把write改成read即可。
*存储至CSV
*尝试使用Python读取test.csv中的数据。
  <nowiki>import csv
with open('test.csv','r',encoding='UTF-8') as csvfile:
   csv_reader = csv.reader:
   print(row)
   print(row[0])</nowiki>
*把数据写入CSV
  <nowiki>import csv
output_list = ['1','2','3','4']
with open('test2.csv','a+',encoding='UTF-8',newline='') as csvfile:
   w = csv.writer(csvfile)
   w.writerow(output_list)</nowiki>
===存储至mysql数据库===
*Python操作Mysql数据库
*需要用pip安装mysqlclient库，连接Python和mysql。在命令行中输入：
  <nowiki>pip install mysqlclient</nowiki>
*安装完成后，我们可以尝试用Python操作Mysql，在数据库中插入数据：
  <nowiki>#coding=UTF-8
import MySQLdb

conn= MySQLdb.connect(host='localhost',user='root',passwd='root',db='scraping')
cur = conn.cursor()
cur.execute("insert into urls (url,content) values('www.baidu.com','This is content.')")
cur.close()
conn.commit()
conn.close()</nowiki>
*
爬取之前在博客的标题和url地址使用Python存储到MYSQL数据库中，代码如下：
  <nowiki>import requests
from bs4 import BeautifuSoup
import MYSQLdb

conn=MySQLdb.connect(host='localhost','uese='root',passwd='root',db='scraping',charset="utf8")
cur = conn.cursor()

link = "http://www.santostang.com/"
headers = {'User-Agent' :'Mozilla/5.0 (Windows; U; Windows NT6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r=requests.get(link,headers= headers)

soup = BeautifulSoup(r.text,"lxml")
title list = soup.find all("h1",class ="post-title")
for eachone in title list:
    url = eachone.a['href']
    title = earchone.a.text.strip()
    cur.execute("insert into urls (url,content) values (%s,%s)",(url,title))

 cur.close()
 conn.commit()
 conn.close()</nowiki>