丝路通:爬虫数据清洗

来自CloudWiki
跳转至: 导航搜索

爬虫初始数据

一开始得到的数据是这样,很不规范,价格是一个区间:

DG_gate_,1,Apparel,Apparel from Turkey,https://www.dhresource.com/albu_2374808775_00-1.260x260/wildlebend-miss-belt-waist-thinner-and-shaper.jpg,Wildlebend Miss Belt Waist Thinner and Shaper Corset - 2 Sizes Thin - BLACK HB000005S9FE,https://www.dhgate.com/product/wildlebend-miss-belt-waist-thinner-and-shaper/539295710.html?d1_page_num=1#s1-0-1;searl|3904672528:1,US $5.10 - 5.32 / Piece,hepsi_fashion,https://www.dhgate.com/store/21394830?dspm=pcen.sp.storerun.1.stfdxXd0GCBMZUZ02FDm&resource_id=#listing_store-1,1,20200830
DG_gate_,1,Apparel,Apparel from Turkey,https://www.dhresource.com/260x260s/f2-albu-g10-M00-D9-71-rBVaVl7pm5OAEGrmAAg-BbCZ0ec590.jpg/lyra-large-size-women-039-s-half-sleeve-dress.jpg,Lyra Large Size Women's Half Sleeve Dress Gray Ship from Turkey L1621 2599,https://www.dhgate.com/product/lyra-large-size-women-s-half-sleeve-dress/548531772.html?d1_page_num=1#s1-1-1;searl|3904672528:2,US $17.41 - 18.14 / Piece,pianoluce,https://www.dhgate.com/store/21356604?dspm=pcen.sp.storerun.2.stfdxXd0GCBMZUZ02FDm&resource_id=#listing_store-2,1,20200830

数据清洗程序

import time

goods_file ='dh_goods_data.csv'
def read_goods_file():
    
    fp = open('DG_gate_20200830153515.csv', "rt")  # 打开csv文件

    count= 0

    goods_list = ""  # 创建商品列表
    for line in fp:  # 文件对象可以直接迭代
        count +=1
        
        d = {};
        data = line.split(',')
        line_info = ",".join(data)
        line_info = line_info.strip()
        
        
        price = data[7].split("/")[0]
        price_range = price.split("$")[1]
        min_p,max_p = price_range.split(" - ")
        min_p =eval(min_p);max_p=eval(max_p)
        print(min_p,max_p)
        avg_p = round((min_p + max_p)/2,2)
        
        line_info += ","+str(avg_p)+"\n" #将价格加到每一行最后
        goods_list += line_info        
        
        if count%10 ==0:
            fw = open(goods_file,"a",encoding="utf-8")
            fw.write(goods_list)
            fw.close()
            goods_list =""
        #'''
        
        
    fp.close()
    return goods_list

if __name__ == '__main__':
    goods_list =read_goods_file()
    

清洗后数据

清洗后 将每件商品的价格区间 算出了平均价格,标在了每一行后面:

DG_gate_,1,Apparel,Apparel from Turkey,https://www.dhresource.com/albu_2374808775_00-1.260x260/wildlebend-miss-belt-waist-thinner-and-shaper.jpg,Wildlebend Miss Belt Waist Thinner and Shaper Corset - 2 Sizes Thin - BLACK HB000005S9FE,https://www.dhgate.com/product/wildlebend-miss-belt-waist-thinner-and-shaper/539295710.html?d1_page_num=1#s1-0-1;searl|3904672528:1,US $5.10 - 5.32 / Piece,hepsi_fashion,https://www.dhgate.com/store/21394830?dspm=pcen.sp.storerun.1.stfdxXd0GCBMZUZ02FDm&resource_id=#listing_store-1,1,20200830,5.21
DG_gate_,1,Apparel,Apparel from Turkey,https://www.dhresource.com/260x260s/f2-albu-g10-M00-D9-71-rBVaVl7pm5OAEGrmAAg-BbCZ0ec590.jpg/lyra-large-size-women-039-s-half-sleeve-dress.jpg,Lyra Large Size Women's Half Sleeve Dress Gray Ship from Turkey L1621 2599,https://www.dhgate.com/product/lyra-large-size-women-s-half-sleeve-dress/548531772.html?d1_page_num=1#s1-1-1;searl|3904672528:2,US $17.41 - 18.14 / Piece,pianoluce,https://www.dhgate.com/store/21356604?dspm=pcen.sp.storerun.2.stfdxXd0GCBMZUZ02FDm&resource_id=#listing_store-2,1,20200830,17.77