初识Python爬虫

环境搭建

Python2与Python3的差异：python2与python3整体差异不大，大多是一些语法上的区别，考虑到python2只会维护到2020年，因此这里建议使用python3来作为我们的编程环境。阅读应该学习最新版本的 Python 3 还是旧版本的 Python 2？，了解两者之间的差别。

下载Python：可参考：Python语言开发环境概述

其他工具下载

chrome浏览器：下载地址：https://www.google.cn/chrome/browser/desktop/index.html

chrome浏览器插件推荐：https://www.zhihu.com/question/20054116

PyCharm编译器：下载地址：http://www.jetbrains.com/pycharm/download/#section=windows

创建第一个实例

urllib包：阅读urllib官方文档，了解python自带urllib库的用法

使用urllib包获取百度首页信息

import urllib.request
#导入urllib.request

f = urllib.request.urlopen('http://www.baidu.com/')
#打开网址，返回一个类文件对象

print(f.read(500))
#打印前500字符
print("utf-8编码:")
print(f.read(500).decode('utf-8'))

Requests包：由于requests是python的第三方库，因此首先需要安装requests库，阅读安装 Requests,了解如何快速安装requests库；

然后阅读快速上手Requests，了解python第三方库Requests的用法

使用Requests库获取百度首页信息

import requests      #导入requests库

r = requests.get('https://www.baidu.com/')
#使用requests.get方法获取网页信息

print(r.text)      #打印结果

r.encoding = 'utf-8'      #修改编码

print(r.text)       #打印结果

爬虫三步走

爬虫第一步：使用requests获得数据

1.导入requests

2.使用requests.get获取网页源码

import requests 
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}
url= 'https://book.douban.com/subject/1084336/comments/'
r = requests.get(url,headers=headers).text

print(r)

爬虫第二步：使用BeautifulSoup4解析数据

1.导入bs4

2.解析网页数据

3.寻找数据

4.for循环打印

from bs4 import BeautifulSoup
soup = BeautifulSoup(r,'lxml')
pattern = soup.find_all('span','short')
for item in pattern:
    print(item.string)

爬虫第三步：使用pandas保存数据

1.导入pandas

2.新建list对象

3.使用to_csv写入

import pandas
comments = []
for item in pattern:
    comments.append(item.string)    
df = pandas.DataFrame(comments)
df.to_csv('comments.csv')

完整的爬虫

import requests 

headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}
url= 'https://book.douban.com/subject/1084336/comments/'
r = requests.get(url,headers=headers).text

from bs4 import BeautifulSoup
soup = BeautifulSoup(r,'lxml')
pattern = soup.find_all('span','short')
for item in pattern:
    print(item.string)

import pandas
comments = []
for item in pattern:
    comments.append(item.string)    
df = pandas.DataFrame(comments)
df.to_csv('comments.csv')

代码运行结果：

PS：

本课程所用到的代码均可在小歪老师的GitHub上查阅或下载，地址如下： https://github.com/zhangslob