PySpark实战:DataFrame存储csv数据

来自CloudWiki
跳转至: 导航搜索

介绍

DataFrame存储csv数据

user.json

 {"deptId":"01","name":"张三","gender":"男","age":32,"salary":5000},
 {"deptId":"01","name":"李四","gender":"男","age":33,"salary":6000}, 
 {"deptId":"01","name":"王五","gender":"女","age":38,"salary":5500}, 
 {"deptId":"02","name":"Jack","gender":"男","age":42,"salary":7000}, 
 {"deptId":"02","name":"Smith","gender":"女","age":27,"salary":6500}, 
 {"deptId":"02","name":"Lily","gender":"女","age":45,"salary":9500}

代码


import findspark
findspark.init()
##############################################
from pyspark.sql import SparkSession
spark = SparkSession.builder \
        .master("local[1]") \
        .appName("RDD Demo") \
        .getOrCreate();
sc = spark.sparkContext
#############################################
#注意:json数组不能有[]
df = spark.read.format('json') \
        .load('user.json')
#mode : overwrite | append ...
#user.csv不是一个文件,而是一个文件夹
df.write.csv("user.csv","append")
spark.read.csv("user.csv").show()
##############################################
"""
+---+---+---+-----+----+
|_c0|_c1|_c2|  _c3| _c4|
+---+---+---+-----+----+
| 32| 01| 男| 张三|5000|
| 33| 01| 男| 李四|6000|
| 38| 01| 女| 王五|5500|
| 42| 02| 男| Jack|7000|
| 27| 02| 女|Smith|6500|
| 45| 02| 女| Lily|9500|
+---+---+---+-----+----+
"""

输出

 +---+---+---+-----+----+
|_c0|_c1|_c2|  _c3| _c4|
+---+---+---+-----+----+
| 32| 01| 男| 张三|5000|
| 33| 01| 男| 李四|6000|
| 38| 01| 女| 王五|5500|
| 42| 02| 男| Jack|7000|
| 27| 02| 女|Smith|6500|
| 45| 02| 女| Lily|9500|
+---+---+---+-----+----+