PySpark实战:DataFrame修改列名、重新生成列

来自CloudWiki
跳转至: 导航搜索

介绍

DataFrame对象还支持修改列名,

也能根据给出的计算列的表达式,来重新生成其他列。

代码


import findspark
findspark.init()
##############################################
from pyspark.sql import SparkSession
spark = SparkSession.builder \
        .master("local[1]") \
        .appName("RDD Demo") \
        .getOrCreate();
sc = spark.sparkContext
#############################################
a = [   ('01','张三', '男',32,5000),
        ('01','李四', '男',33,6000),
        ('01','王五', '女',38,5500),
        ('02','Jack', '男',42,7000),
        ('02','Smith', '女',27,6500),
        ('02','Lily', '女',45,9500)
]
rdd = sc.parallelize(a)
peopleDf = spark.createDataFrame(rdd,\
   "deptId:string,name:string,gender:string,age:int,salary:int")
#转成JSON格式
print(peopleDf.toJSON().collect())
peopleDf.summary().show()
#添加age2列
peopleDf.withColumn("age2",peopleDf.age+1) \
        .withColumnRenamed("name","姓名") \
        .show()
#选择表达式
peopleDf.selectExpr("age+1","salary*1.2").show()


输出

JSON格式

 ['{"deptId":"01","name":"张三","gender":"男","age":32,"salary":5000                                                                                          }', '{"deptId":"01","name":"李四","gender":"男","age":33,"salary":6                                                                                          000}', '{"deptId":"01","name":"王五","gender":"女","age":38,"salary                                                                                          ":5500}', '{"deptId":"02","name":"Jack","gender":"男","age":42,"sal                                                                                          ary":7000}', '{"deptId":"02","name":"Smith","gender":"女","age":27,                                                                                          "salary":6500}', '{"deptId":"02","name":"Lily","gender":"女","age":                                                                                          45,"salary":9500}']

[

SUMMARY

                                                                                                                                                                                                                                                                                                                   +-------+------------------+----+------+------------------+--------                                                                                          ----------+
|summary|            deptId|name|gender|               age|                                                                                                      salary|
+-------+------------------+----+------+------------------+--------                                                                                          ----------+
|  count|                 6|   6|     6|                 6|                                                                                                           6|
|   mean|               1.5|null|  null|36.166666666666664| 6583.33                                                                                          3333333333|
| stddev|0.5477225575051661|null|  null| 6.735478206235001|1594.260                                                                                          5391424158|
|    min|                01|Jack|    女|                27|                                                                                                        5000|
|    25%|               1.0|null|  null|                32|                                                                                                        5500|
|    50%|               1.0|null|  null|                33|                                                                                                        6000|
|    75%|               2.0|null|  null|                42|                                                                                                        7000|
|    max|                02|王五|    男|                45|                                                                                                        9500|
+-------+------------------+----+------+------------------+--------                                                                                          ----------+

输出新列

 +------+-----+------+---+------+----+
|deptId| 姓名|gender|age|salary|age2|
+------+-----+------+---+------+----+
|    01| 张三|    男| 32|  5000|  33|
|    01| 李四|    男| 33|  6000|  34|
|    01| 王五|    女| 38|  5500|  39|
|    02| Jack|    男| 42|  7000|  43|
|    02|Smith|    女| 27|  6500|  28|
|    02| Lily|    女| 45|  9500|  46|
+------+-----+------+---+------+----+


输出表达式

+---------+--------------+
|(age + 1)|(salary * 1.2)|
+---------+--------------+
|       33|        6000.0|
|       34|        7200.0|
|       39|        6600.0|
|       43|        8400.0|
|       28|        7800.0|
|       46|       11400.0|
+---------+--------------+