PySpark实战:DataFrame修改列名、重新生成列
来自CloudWiki
介绍
DataFrame对象还支持修改列名,
也能根据给出的计算列的表达式,来重新生成其他列。
代码
import findspark findspark.init() ############################################## from pyspark.sql import SparkSession spark = SparkSession.builder \ .master("local[1]") \ .appName("RDD Demo") \ .getOrCreate(); sc = spark.sparkContext ############################################# a = [ ('01','张三', '男',32,5000), ('01','李四', '男',33,6000), ('01','王五', '女',38,5500), ('02','Jack', '男',42,7000), ('02','Smith', '女',27,6500), ('02','Lily', '女',45,9500) ] rdd = sc.parallelize(a) peopleDf = spark.createDataFrame(rdd,\ "deptId:string,name:string,gender:string,age:int,salary:int") #转成JSON格式 print(peopleDf.toJSON().collect()) peopleDf.summary().show() #添加age2列 peopleDf.withColumn("age2",peopleDf.age+1) \ .withColumnRenamed("name","姓名") \ .show() #选择表达式 peopleDf.selectExpr("age+1","salary*1.2").show()
输出
JSON格式
['{"deptId":"01","name":"张三","gender":"男","age":32,"salary":5000 }', '{"deptId":"01","name":"李四","gender":"男","age":33,"salary":6 000}', '{"deptId":"01","name":"王五","gender":"女","age":38,"salary ":5500}', '{"deptId":"02","name":"Jack","gender":"男","age":42,"sal ary":7000}', '{"deptId":"02","name":"Smith","gender":"女","age":27, "salary":6500}', '{"deptId":"02","name":"Lily","gender":"女","age": 45,"salary":9500}']
[
SUMMARY
+-------+------------------+----+------+------------------+-------- ----------+ |summary| deptId|name|gender| age| salary| +-------+------------------+----+------+------------------+-------- ----------+ | count| 6| 6| 6| 6| 6| | mean| 1.5|null| null|36.166666666666664| 6583.33 3333333333| | stddev|0.5477225575051661|null| null| 6.735478206235001|1594.260 5391424158| | min| 01|Jack| 女| 27| 5000| | 25%| 1.0|null| null| 32| 5500| | 50%| 1.0|null| null| 33| 6000| | 75%| 2.0|null| null| 42| 7000| | max| 02|王五| 男| 45| 9500| +-------+------------------+----+------+------------------+-------- ----------+
输出新列
+------+-----+------+---+------+----+ |deptId| 姓名|gender|age|salary|age2| +------+-----+------+---+------+----+ | 01| 张三| 男| 32| 5000| 33| | 01| 李四| 男| 33| 6000| 34| | 01| 王五| 女| 38| 5500| 39| | 02| Jack| 男| 42| 7000| 43| | 02|Smith| 女| 27| 6500| 28| | 02| Lily| 女| 45| 9500| 46| +------+-----+------+---+------+----+
输出表达式
+---------+--------------+ |(age + 1)|(salary * 1.2)| +---------+--------------+ | 33| 6000.0| | 34| 7200.0| | 39| 6600.0| | 43| 8400.0| | 28| 7800.0| | 46| 11400.0| +---------+--------------+