“丝路通:分布式爬虫任务分配”的版本间的差异
来自CloudWiki
(→任务表建立) |
|||
第120行: | 第120行: | ||
==任务表建立== | ==任务表建立== | ||
− | === | + | ===安装mysql=== |
*[[Centos7 安装python3]],本项目安装python3.6 | *[[Centos7 安装python3]],本项目安装python3.6 | ||
*[[Centos7 安装MySQL]] | *[[Centos7 安装MySQL]] | ||
− | |||
+ | ===建立数据表=== | ||
===model设计和资源导入 === | ===model设计和资源导入 === | ||
− | + | ||
− | + | <nowiki>MariaDB [crawler]> CREATE TABLE IF NOT EXISTS `task`( | |
− | + | -> `id` INT UNSIGNED AUTO_INCREMENT, | |
− | + | -> `site_title` VARCHAR(100) NOT NULL, | |
− | + | -> `task_name` VARCHAR(40) NOT NULL, | |
− | + | -> `task_status` INT UNSIGNED NOT NULL, | |
− | + | -> `availability zones` VARCHAR(60) NOT NULL, | |
+ | -> `start_date` DATE, | ||
+ | -> PRIMARY KEY ( `id` ))ENGINE=InnoDB DEFAULT CHARSET=utf8;</nowiki> |
2020年9月17日 (四) 14:31的版本
任务切割
将原始的待爬目录表 分割成许多小份,当作许多小任务去完成。
敦煌网
import time task_header ='../../task/dh_task/dh_task_' #header def assign_task(): task_content = "" # 创建类别网址列表 fp = open('dh_sub_category.csv', "rt") # 打开csv文件 count= 0 num =0 #类别名 类目级别 父类目级别 s =set()#储存已有的类别 for line in fp: # 文件对象可以直接迭代 count +=1 task_content +=line if count%100 ==0: num += 1 fw = open(task_header+str(num)+".csv","w",encoding="utf-8") fw.write(task_content) fw.close() task_content ="" fw = open(task_header+str(num)+".csv","a",encoding="utf-8") fw.write(task_content) fw.close() task_content ="" fp.close() if __name__ == '__main__': assign_task()
阿里巴巴
import time task_header ='../../task/ali_task/ali_task_' #header def assign_task(): task_content = "" # 创建类别网址列表 fp = open('alibaba_categary.csv', "rt") # 打开csv文件 count= 0 num =0 #类别名 类目级别 父类目级别 for line in fp: # 文件对象可以直接迭代 count +=1 task_content +=line if count%100 == 0: num += 1 fw = open(task_header+str(num)+".csv","w",encoding="utf-8") fw.write(task_content) fw.close() task_content ="" if num <= 50: fw = open(task_header+str(num)+".csv","a",encoding="utf-8") else: fw = open(task_header+str(num+1)+".csv","a",encoding="utf-8") fw.write(task_content) fw.close() task_content ="" fp.close() if __name__ == '__main__': assign_task()
中国制造网
import time task_header ='../../task/mc_task/mc_task_' #header def assign_task(): task_content = "" # 创建类别网址列表 fp = open('made_in_china_sub_cat.csv', "rt") # 打开csv文件 count= 0 num =0 for line in fp: # 文件对象可以直接迭代 count +=1 task_content +=line if count%200 ==0: num += 1 fw = open(task_header+str(num)+".csv","w",encoding="utf-8") fw.write(task_content) fw.close() task_content ="" if num <= 100: fw = open(task_header+str(num)+".csv","a",encoding="utf-8") else: fw = open(task_header+str(num+1)+".csv","a",encoding="utf-8") fw.write(task_content) fw.close() task_content ="" fp.close() if __name__ == '__main__': assign_task()
任务表建立
安装mysql
- Centos7 安装python3,本项目安装python3.6
- Centos7 安装MySQL
建立数据表
model设计和资源导入
MariaDB [crawler]> CREATE TABLE IF NOT EXISTS `task`( -> `id` INT UNSIGNED AUTO_INCREMENT, -> `site_title` VARCHAR(100) NOT NULL, -> `task_name` VARCHAR(40) NOT NULL, -> `task_status` INT UNSIGNED NOT NULL, -> `availability zones` VARCHAR(60) NOT NULL, -> `start_date` DATE, -> PRIMARY KEY ( `id` ))ENGINE=InnoDB DEFAULT CHARSET=utf8;