“丝路通:爬虫数据清洗”的版本间的差异
来自CloudWiki
(创建页面,内容为“==爬虫初始数据== 一开始得到的数据是这样,很不规范,价格是一个区间: <nowiki> DG_gate_,1,Apparel,Apparel from Turkey,https://www.dhre…”) |
(没有差异)
|
2020年9月26日 (六) 08:52的版本
爬虫初始数据
一开始得到的数据是这样,很不规范,价格是一个区间:
DG_gate_,1,Apparel,Apparel from Turkey,https://www.dhresource.com/albu_2374808775_00-1.260x260/wildlebend-miss-belt-waist-thinner-and-shaper.jpg,Wildlebend Miss Belt Waist Thinner and Shaper Corset - 2 Sizes Thin - BLACK HB000005S9FE,https://www.dhgate.com/product/wildlebend-miss-belt-waist-thinner-and-shaper/539295710.html?d1_page_num=1#s1-0-1;searl|3904672528:1,US $5.10 - 5.32 / Piece,hepsi_fashion,https://www.dhgate.com/store/21394830?dspm=pcen.sp.storerun.1.stfdxXd0GCBMZUZ02FDm&resource_id=#listing_store-1,1,20200830 DG_gate_,1,Apparel,Apparel from Turkey,https://www.dhresource.com/260x260s/f2-albu-g10-M00-D9-71-rBVaVl7pm5OAEGrmAAg-BbCZ0ec590.jpg/lyra-large-size-women-039-s-half-sleeve-dress.jpg,Lyra Large Size Women's Half Sleeve Dress Gray Ship from Turkey L1621 2599,https://www.dhgate.com/product/lyra-large-size-women-s-half-sleeve-dress/548531772.html?d1_page_num=1#s1-1-1;searl|3904672528:2,US $17.41 - 18.14 / Piece,pianoluce,https://www.dhgate.com/store/21356604?dspm=pcen.sp.storerun.2.stfdxXd0GCBMZUZ02FDm&resource_id=#listing_store-2,1,20200830
数据清洗程序
import time goods_file ='dh_goods_data.csv' def read_goods_file(): fp = open('DG_gate_20200830153515.csv', "rt") # 打开csv文件 count= 0 goods_list = "" # 创建商品列表 for line in fp: # 文件对象可以直接迭代 count +=1 d = {}; data = line.split(',') line_info = ",".join(data) line_info = line_info.strip() price = data[7].split("/")[0] price_range = price.split("$")[1] min_p,max_p = price_range.split(" - ") min_p =eval(min_p);max_p=eval(max_p) print(min_p,max_p) avg_p = round((min_p + max_p)/2,2) line_info += ","+str(avg_p)+"\n" #将价格加到每一行最后 goods_list += line_info if count%10 ==0: fw = open(goods_file,"a",encoding="utf-8") fw.write(goods_list) fw.close() goods_list ="" #''' fp.close() return goods_list if __name__ == '__main__': goods_list =read_goods_file()
清洗后数据
DG_gate_,1,Apparel,Apparel from Turkey,https://www.dhresource.com/albu_2374808775_00-1.260x260/wildlebend-miss-belt-waist-thinner-and-shaper.jpg,Wildlebend Miss Belt Waist Thinner and Shaper Corset - 2 Sizes Thin - BLACK HB000005S9FE,https://www.dhgate.com/product/wildlebend-miss-belt-waist-thinner-and-shaper/539295710.html?d1_page_num=1#s1-0-1;searl|3904672528:1,US $5.10 - 5.32 / Piece,hepsi_fashion,https://www.dhgate.com/store/21394830?dspm=pcen.sp.storerun.1.stfdxXd0GCBMZUZ02FDm&resource_id=#listing_store-1,1,20200830,5.21 DG_gate_,1,Apparel,Apparel from Turkey,https://www.dhresource.com/260x260s/f2-albu-g10-M00-D9-71-rBVaVl7pm5OAEGrmAAg-BbCZ0ec590.jpg/lyra-large-size-women-039-s-half-sleeve-dress.jpg,Lyra Large Size Women's Half Sleeve Dress Gray Ship from Turkey L1621 2599,https://www.dhgate.com/product/lyra-large-size-women-s-half-sleeve-dress/548531772.html?d1_page_num=1#s1-1-1;searl|3904672528:2,US $17.41 - 18.14 / Piece,pianoluce,https://www.dhgate.com/store/21356604?dspm=pcen.sp.storerun.2.stfdxXd0GCBMZUZ02FDm&resource_id=#listing_store-2,1,20200830,17.77