“Big Red”的版本间的差异

来自CloudWiki
跳转至: 导航搜索
MongoDB爬虫实践:虎扑论坛
第536行: 第536行:
 
from bs4 import BeautifuSoup
 
from bs4 import BeautifuSoup
 
import datetime
 
import datetime
def get_page(link)
+
def get_page(link) :
 
     headers = {'User-Agent' :'Mozilla/5.0 (Windows; U; Windows NT6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
 
     headers = {'User-Agent' :'Mozilla/5.0 (Windows; U; Windows NT6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
    r=requests.get(link,headers= headers)
+
    r=requests.get(link,headers= headers)
      html = r.content
+
    html = r.content
      html = html.decode('UTF-8')
+
    html = html.decode('UTF-8')
      soup = BeautifulSoup(html,'lxml')
+
    soup = BeautifulSoup(html,'lxml')
    return soup
+
    return soup
  
 
def get_data(post_list):
 
def get_data(post_list):
第565行: 第565行:
 
         reply_time = post.find('td',class_='p_retime').a.text.strip()
 
         reply_time = post.find('td',class_='p_retime').a.text.strip()
 
         last_reply = post.find('td',class_='p_retime').contents[2]
 
         last_reply = post.find('td',class_='p_retime').contents[2]
        if':' in reply_time:
+
        if':' in reply_time:
 
             date_time = str(datetime.date.today())+' '+reply_time
 
             date_time = str(datetime.date.today())+' '+reply_time
 
             date_time = datetime.datetime.strptime(date_time,'%Y-%m-%d %H:%M')
 
             date_time = datetime.datetime.strptime(date_time,'%Y-%m-%d %H:%M')
          else:
+
        else:
 
             date_time = datetime.datetime.strptime('2017-'+reply_time
 
             date_time = datetime.datetime.strptime('2017-'+reply_time
 
,'%Y-%m-%d').date()
 
,'%Y-%m-%d').date()
          data_list.append([title,post_link,author,author_page,start_date,reply,view,last_reply,date_time])
+
            data_list.append([title,post_link,author,author_page,start_date,reply,view,last_reply,date_time])
  return data_list
+
         
 +
        return data_list
  
  
 
+
    link = "https://bbs.hupu.com/bxj"
  link = "https://bbs.hupu.com/bxj"
+
    soup = get_page(link)
  soup = get_page(link)
+
    post_list = soup.find_all('tr',mid=True)
  post_list = get_data(post_list)
+
    data_list = get_data(post_list)
  data_list = get_data(post_list)
+
    for each in data_list:
  for each in data_list:
+
        print (each)
      print (each)</nowiki>
+
</nowiki>

2018年6月9日 (六) 05:05的版本

第2章 建立网络爬虫

第一个网络爬虫

数值类型:整型,浮点型,复数型。 字符串类型

import requests
r = requests.get("http://www.baidu.com")#打开一个网页
print(r.status_code) #返回状态
r.encoding = 'utf-8'
print(r.text) #观察返回的内容
  • 常量:其值不发生改变的数据对象
  • 变量:使用id()函数可以查看变量的内存地址

基本命令

例如:

print("Hello World!")

1.字符串

例如:

string1='Python Web Scrappy'
string2="by Santos"
string3=string1+""+string2
print(string3)

索引从0开始:

print("list1[0]:",list1[0])
print("list2[1:3]:",list2[1:3])
list1[0]: python
list2[1:3]:[2,3]

修改列表中的值:

list1[1]="new"
print(list1)
  • 字符串是最常见的数据类型,一般用来存储类似“句子”的数据,并放在单引号或双引号中。

2.数字

例如:

int1=7
float1=7.5
trans_int=int(float1)
print(trans_int)
  • 数字用来存储数值,包含两种常用的数字类型:整数和浮点数,浮点数由整数和小数部分组成。

3.列表

例如:

list1=['python','web','scrappy']
list2=[1,2,3,4,5]
list3=["a",2,"o",4]
  • 如果需要把字符串和数字襄括起来,就可以使用列表。

4.字典

例如:

namebook={"Name":"Alex","Age":7,"Class":"First"}
print(namebook["Name"])
print(namebook)

遍历访问字典中的每一个值:

for key,value in namebook.items():
    print(key,value)
  • 字典是一种可变容器模型,每个存储的值都对应着一个键值key,key必须唯一,但是值不用。值也可以取任何数据类型。

条件语句和循环语句

  • 条件语句可以使得当满足条件的时候才执行某部分代码。

例如:

book="Python"
if book= ="Python":
   print("You are studying python.")
else:
   print("Wrong.")

如果需要判断的有多种条件,就需要用到elif

例如:

book="java"
if book= ="Python":
   print("You are studying python.")
elif:book= ="java":
   print("You are studying java.")
else:
   print("Wrong.")
  • 循环语句能让我们执行一个代码片段多次,循环分为for循环和while循环。

for循环能在一个给定的顺序下重复执行

例如:

citylist=["Beijing","Shanghai","Guangzhou"]
for eachcity in citylist:
    print(eachcity)

while循环能不断重复执行,只要能满足一定条件

例如:

count=0
while count<3:
     count+=1
print (count)

函数

定义一个函数:

例如:

def calulus(x):
    y=x+1
    return y
    result=calulus(2)
    print(result)

参数必须要正确地写入函数中,函数的参数也可以为多个

例如:

def fruit_function(fruit1,fruit2):
    fruits=fruit1+""+fruit2
    return fruits
result=fruit_function("apple","banana")
print(result)
  • 在代码很少的时候,按照逻辑写完就能很好地运行。

面向对象编程

  • 面向过程的意思是根据业务逻辑从上到下写代码。

例如:

class Person:
     def_init_(self,name,age):
        self.name=name
        self.age=age
     def detail(self):
        print(self.name)
        print(self.age)

如果我们使用函数编程,可以写成:

def defail(name,age):
    print(name)
    print(age)
  • 如果各个函数之间独立且无共用的数据,就选用函数式编程;如果各个函数之间有一定的关联性,那么选用面向对象编程比较好。

面向对象的两大特性:封装和继承

1.封装

  • 封装,顾名思义就是把内容封装好,再调用封装好的内容。封装分为两步:

第一步为封装内容。 第二步为调用被封装的内容。

(1)封装内容

class Person:
     def_init_(self,name,age):
        self.name=name
        self.age=age
obj1=Person('santos',18)

(2)调用被封装的内容

  • 调用被封装的内容时有两种方式:通过对象直接调用和通过self间接调用。

通过对象直接调用obj1对象的name和age属性,代码如下:

class Person:
     def_init_(self,name,age):
        self.name=name
        self.age=age
obj1=Person('santos',18)
print(obj1.name)
print(obj1.age)
  • 通过self简介调用时,python默认会将obj1传给self参数,即obj1.detail(obj1)。此时方法内部的self=obj1,即self.name='santos',self.age=18.

代码如下:

class Person:
     def_init_(self,name,age):
        self.name=name
        self.age=age
     def detail(self):
        print(self.name)
        print(self,age)

2.继承

  • 继承是以普通的类为基础建立专门的类对象。面向对象编程的继承和现实中的继承类似,子继承了夫的某些特性。

例如:

class 猫:
    def 喵喵叫(self):
       print('喵喵叫')
    def 吃(self):
    def 喝(self):
    def 拉(self):
    def 撒(self):
class 狗:
    def 汪汪叫(self):
       print('汪汪叫')
    def 吃(self):
    def 喝(self):
    def 拉(self):
    def 撒(self):

如果使用继承的思想,就可以写成:

class Animal:
    def eat(self):
        print("%s 吃"%self.name)
        print("%s 喝"%self.name)
        print("%s 拉"%self.name)
        print("%s 撒"%self.name)
class Cat(Animal):
    def_init_(self,name):
        self.name=name
    def cry(self):
        print('喵喵叫')
class Dog(Animal):
    def_init_(self,name):
        self.name=name
    def cry(self):
        print('汪汪叫')
c1=Cat('小白家的小黑猫')
c1.eat()
c1.cry()
d1=Dog('胖子家的小瘦狗')
d1.eat()

编写第一个简单的爬虫

第一步:获取页面

import requests
link="https://user.qzone.qq.com/328911422/main"
headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}
r=requests.get(link,headers=headers)
print(r.text)
  • 用requests的headers伪装成浏览器访问。
  • r是requests的Response回复对象,可以从中获取想要的信息,r.text是获取的网页内容代码。

第二步:提取需要的数据

import requests
from bs4 import BeautifulSoup
link="https://user.qzone.qq.com/328911422/main"
headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}
r=requests.get(link,headers=headers)
soup=BeautifulSoup(r.text,"lxml")
title=soup.find("h1", class_="post-title").a.text.strip()
print(title)

第三步:存储数据

import requests
link="https://user.qzone.qq.com/328911422/main"
headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}
r=requests.get(link,headers=headers)
soup=BeautifulSoup(r.text,"lxml")
title=soup.find("h1", class_="post-title").a.text.strip()
print(title)
with open('title.txt',"a+") as f:
    f.write(title)
    f.close()

第3章 静态网页抓取

获取相应内容

例:获取QQ空间主页内容

import requests
r=requests.get('https://user.qzone.qq.com/328911422/infocenter')
print("文本编码:",r.encoding)
print("响应状态码:",r.status_code)
print("字符串方式的相应体:",r.text)

定制Requests

传递URL参数

例:传递key1=value1和key2=value2到https://user.qzone.qq.com/1150117452/main

import requests
key_dict={'key1':'walue1','key2':'value2'}
r=requests.get('https://user.qzone.qq.com/1150117452/main')
print("URL已经正确编码:",r.url)
print("字符串方式的相应体:\n",r.text)

定制请求头

例如:

import requests
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
         'Host':'m.douban.com'
    }
r = requests.get('https://movie.douban.com/subject/1292052/',headers=headers)
print("响应状态码:",r.status_code)

发送POST请求

import requests
key_dict={'key1':'walue1','key2':'value2'}
r=requests.post('https://user.qzone.qq.com/1150117452/main')
print(r.text)

Requests爬虫实战:TOP250电影数据

项目实践

import requests
def get_movies():
     headers={
       'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64)AppleWobKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82Safari/537.36',
       'Host': 'movie.douban.com'
         }
for i in range(0,10):
             link = 'https://movie.douban.com/top250?start=' + str(1 * 25)
             r = requests.get(link,headers, timeout=10)
             print(str(i+1),"页响应状态码:",r.status_code)
             print(r.text)
get_movies()

第4章 动态网页抓取

动态抓取的实例

  • 通过浏览器审查元素解析地址
  • 通过Selenium模拟浏览器抓取

解析真实地址抓取

例如:

import requests
link = """https://api-zero.livere.com/v1/comments/list?callback=jQuery112407875296433383039_1506267778283&limit=10&offset=1&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2
Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1506267778285"""
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} 
 
r = requests.get(link, headers= headers)
print (r.text)
import requests
import json
 
def single_page_comment(link):
    headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} 
    r = requests.get(link, headers= headers)
    # 获取 json 的 string
    json_string = r.text
    json_string = json_string[json_string.find('{'):-2]
    json_data = json.loads(json_string)
    comment_list = json_data['results']['parents']
    
    for eachone in comment_list:
        message = eachone['content']
        print (message)
 
for page in range(1,4):
    link1 = "https://api-zero.livere.com/v1/comments/list?callback=jQuery112407875296433383039_1506267778283&limit=10&offset="
    link2 = "&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1506267778285"
    page_str = str(page)
    link = link1 + page_str + link2
    print (link)
    single_page_comment(link)

通过 selenium 模拟浏览器抓取

selenium获取文章的所有评论

from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import time
 
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
binary = FirefoxBinary(r'D:\Program Files\Mozilla Firefox\firefox.exe')
#把上述地址改成你电脑中Firefox程序的地址
driver = webdriver.Firefox(firefox_binary=binary, capabilities=caps)
driver.get("http://www.santostang.com/2017/03/02/hello-world/")
driver.switch_to.frame(driver.find_element_by_css_selector("iframe[title='livere']"))
 
comments = driver.find_elements_by_css_selector('div.reply-content')
for eachcomment in comments:
    content = eachcomment.find_element_by_tag_name('p')
    print (content.text)

限制图片的加载:

from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
 
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
binary = FirefoxBinary(r'D:\Program Files\Mozilla Firefox\firefox.exe')
#把上述地址改成你电脑中Firefox程序的地址
fp = webdriver.FirefoxProfile()
fp.set_preference("permissions.default.image",2)
driver = webdriver.Firefox(firefox_binary=binary, firefox_profile = fp, capabilities=caps)
driver.get("http://www.santostang.com/2017/03/02/hello-world/")
  • 直接用浏览器在显示网页时解析 HTML、应用 CSS 样式并执行 JavaScript 的语句

第5章解析网页

使用正则表达式解析网页

  • re.match方法
 import re
m = re.match('www','www.santostang.com')
print("匹配的结果:",m)
print("匹配的起始与终点:",m.span())
print("匹配的起始位置:",m.start())
print("匹配的终点位置:",m.end())
  • re.search方法
 import re
m_match = re.match('com','www.santostang.com')
m_search = re.search('com','www.santostang.com')
print (m_match)
print (m_search)
  • re.findall方法
 import re
m_match = re.match ('[0—9]+','12345 is the first number,23456 is the second')
m_search = re.search ('[0—9]+','The first number is 12345 ,23456 is the second')
m_findall = re.findall('[0—9]+','12345 is the first number,23456 is the second')
print  (m_match.group())
print  (m_search.group())
print  (m_findall)
  • findall与match,search不同的是,findall能够找到所有匹配的结果,并且以列表的形式返回。当爬取博客文章的标题时,如果提取的不只是一个标题,而是所有标题,就可以用findall。

使用BeautifulSoup解析网页

  • 安装BeautifulSoup非常简单,使用pip安装即可。在cmd中输入:

pip install bs4

  • 使用BeautifulSoup获取博客标题
 import requests
from bs4 import BeautifulSoup

link = "http://www.santostang.com/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT6.1; en—US; rv:1.9.1.6)  Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link, headers= haeders)

soup = BeautifulSoup(r.text,"html.parser")
first_title = soup.find("h1",class_="post-title").a.text.strip()
print ("第一篇文章的标题是:",first_title)

title_list = soup.find_all("h1",class_="post-title")
for i in range(len(title_list)):
   title = title_list[i].a.text.strip()
   print ('第%s篇文章的标题是:%s' %(i+1,title))它的每一个节点都是一个Python对象,获取网页内容就是一个提取对象内容的过程。而提取对象的方法可以归纳为3种
*1.遍历文档树
 <nowiki>soup.header.h3
  • 2.搜索文档树
 for tag in soup.find_all(re.compile("^h")):
    print(tag.name)
  • 3.css选择器
 soup.select('a[href^="http://www.santostang.com/"]')

使用lxml解析网页

  • 使用lxml获取博客标题
  import requests
from lxml import etree

link = "http://www.santostang.com/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT6.1; en—US; rv:1.9.1.6)  Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link, headers= haeders)
html = etree.HTML(r.text)
title_list = html.xpath('//h1[@class="post-title"]/a/text()')
print (title_list)

BeautifulSoup爬虫实践:房屋价格数据

  • 获取安居客北京二手房结果的第1页数据
 import requests
from bs4 import BeautifulSoup

headers ={'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1;WOW64)AppleWebKit/537.36 (KHTML,Like Gecko) Chrome/57.0.2987.98 Safari/537.36'}
link = 'http://beijing.anjuke.com./sale/'
r = requests.get(link,headers = headers)

soup = BeautifulSoup(r.text,'lxml')
house_list = soup.find_all('li',class_="list-item")

for house in house_list:
    name = house.find('div',class_='house-title').a.text.strip()
    price = house.find('span',class_='price-det').text.strip()
    price_area = house.find('span',class_='unit-price').text.strip()


    no_room = house.find('div',class_='details-item').span.text
    area = house.find('div',class_='details-item').contents[3].text
    floor = house.find('div',class_='details-item').contents[5].text
    year = house.find('div',class_='details-item').contents[7].text
    broker = house.find('span',class_='brokername').text
    broker = broker[1:]
    address = house.find('span',class_='comm-address').text.strip()
    address = address.replace('\xa0\xa0\n         ','   ')
    tag_list = house.find_all('span',class_='item-tage')
    tags = [i.text for i in tag_list]
    print (name,price,prince_area,no_room,area,floor,year,broker,address,tgs)

第6章数据存储

基本存储:存储至TXT或CSV

  • 存储至TXT
 title = "This is a test sentence."
with open('C:\\you\\desktop\\title.txt',"a+") as f:
   f.write(title)
   f.close()
  • 其中,with open('C:\\you\\desktop\\title.txt',"a+") as f:a+为Python文件的读写模式,表示将对文件使用附加读写方式打开,如果该文件不存在,就会创建一个新文件。
  • 用'\t'.join()将变量连接成一个字符串的代码如下:
 output = '\t'.join(['name','title','age','gender'])
with open('C:\\you\\desktop\\title.txt',"a+") as f:
   f.write(output)
   f.close()
  • 有时还需要读取TXT文件中的数据,和写入数据的方式非常类似,把write改成read即可。
  • 存储至CSV
  • 尝试使用Python读取test.csv中的数据。
 import csv
with open('test.csv','r',encoding='UTF-8') as csvfile:
   csv_reader = csv.reader:
   print(row)
   print(row[0])
  • 把数据写入CSV
 import csv
output_list = ['1','2','3','4']
with open('test2.csv','a+',encoding='UTF-8',newline='') as csvfile:
   w = csv.writer(csvfile)
   w.writerow(output_list)

存储至mysql数据库

  • Python操作Mysql数据库
  • 需要用pip安装mysqlclient库,连接Python和mysql。在命令行中输入:
 pip install mysqlclient
  • 安装完成后,我们可以尝试用Python操作Mysql,在数据库中插入数据:
 #coding=UTF-8
import MySQLdb

conn= MySQLdb.connect(host='localhost',user='root',passwd='root',db='scraping')
cur = conn.cursor()
cur.execute("insert into urls (url,content) values('www.baidu.com','This is content.')")
cur.close()
conn.commit()
conn.close()

爬取之前在博客的标题和url地址使用Python存储到MYSQL数据库中,代码如下:

 import requests
from bs4 import BeautifuSoup
import MYSQLdb

conn=MySQLdb.connect(host='localhost','uese='root',passwd='root',db='scraping',charset="utf8")
cur = conn.cursor()

link = "http://www.santostang.com/"
headers = {'User-Agent' :'Mozilla/5.0 (Windows; U; Windows NT6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r=requests.get(link,headers= headers)

soup = BeautifulSoup(r.text,"lxml")
title list = soup.find all("h1",class ="post-title")
for eachone in title list:
    url = eachone.a['href']
    title = earchone.a.text.strip()
    cur.execute("insert into urls (url,content) values (%s,%s)",(url,title))

 cur.close()
 conn.commit()
 conn.close()

存储至MongoDB数据库

  • Python操作MongoDB数据库
  • 需要用pip安装PyMongo库,连接Python和MongoDB。在命令行中输入:
 pip install pymongo
  • 完成安装后,可以尝试用Python操作MongoDB,监测能否正常连接到数据库。
 from pymongo import MongoClient
client = MongoClient('localhost',27017)
db = client.blog_database
collection = db.blog
  • 爬取播客主页所有文章标题存储至的MongoDB数据库,代码如下:
 import requests
import datetime
from bs4 import BeautifuSoup
from pymongo import MongoClient
client = MongoClient('localhost',27017)
db = client.blog_database
collection = db.blog


link = "http://www.santostang.com/"
headers = {'User-Agent' :'Mozilla/5.0 (Windows; U; Windows NT6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r=requests.get(link,headers= headers)


soup = BeautifulSoup(r.text,"lxml")
itle list = soup.find all("h1",class ="post-title")
for eachone in title list:
    url = eachone.a['href']
    title = earchone.a.text.strip()
    post = {"url":url,
            "title":title,
            "date":datetime.datetime.utcnow()}
    collection.insert_one(post)

MongoDB爬虫实践:虎扑论坛

  • 获取第一页数据的代码如下:
 import requests
from bs4 import BeautifuSoup
import datetime
def get_page(link) :
    headers = {'User-Agent' :'Mozilla/5.0 (Windows; U; Windows NT6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
    r=requests.get(link,headers= headers)
    html = r.content
    html = html.decode('UTF-8')
    soup = BeautifulSoup(html,'lxml')
    return soup

def get_data(post_list):
    data_list =[]
    for post in post_list:
        title_td = post.find('td',class_='p_title')
        title = title_td.find('a',id=True).text.strip()
        post_link = title_td.find('a',id=True)['href']
        post_link = 'https://bbs.hupu.com'+post_link


        autor = post.find('td',class_='p_author').text.strip()
        autor_page = post.find('td',class_='p_author').a['href']
        start_date = post.find('td',class_='p_author').contents[2]
        start_date = datetime.datetime.striptime(start_date, '%Y-%m-%d').date()
        

        reply_view = post.find('td',class_='p_re').text.strip()
        reply = reply_view.split('/')[0].strip()
        view = reply_view.split('/')[1].strip()

        reply_time = post.find('td',class_='p_retime').a.text.strip()
        last_reply = post.find('td',class_='p_retime').contents[2]
        if':' in reply_time:
             date_time = str(datetime.date.today())+' '+reply_time
             date_time = datetime.datetime.strptime(date_time,'%Y-%m-%d %H:%M')
        else:
             date_time = datetime.datetime.strptime('2017-'+reply_time
,'%Y-%m-%d').date()
             data_list.append([title,post_link,author,author_page,start_date,reply,view,last_reply,date_time])
           
        return data_list


     link = "https://bbs.hupu.com/bxj"
     soup = get_page(link)
     post_list = soup.find_all('tr',mid=True)
     data_list = get_data(post_list)
     for each in data_list:
         print (each)