PySpark实战:DataFrame删除列

来自CloudWiki
跳转至: 导航搜索

介绍

某些场景下需要删除某些不需要的数据列信息,

在DataFrame中可以通过drop操作实现

代码


import findspark
findspark.init()
##############################################
from pyspark.sql import SparkSession
spark = SparkSession.builder \
        .master("local[1]") \
        .appName("RDD Demo") \
        .getOrCreate();
sc = spark.sparkContext
#############################################
a = [   ('01','张三', '男',32,5000),
        ('01','李四', '男',33,6000),
        ('01','王五', '女',38,5500),
        ('02','Jack', '男',42,7000),
        ('02','Smith', '女',27,6500),
        ('02','Lily', '女',45,9500),
]
rdd = sc.parallelize(a)
peopleDf = spark.createDataFrame(rdd,\
   "deptId:string,name:string,gender:string,age:int,salary:int")

peopleDf.drop("gender","age").show()
# +------+-----+------+
# |deptId| name|salary|
# +------+-----+------+
# |    01| 张三|  5000|
# |    01| 李四|  6000|
# |    01| 王五|  5500|
# |    02| Jack|  7000|
# |    02|Smith|  6500|
# |    02| Lily|  9500|
# +------+-----+------+
##############################################


结果

+------+-----+------+
|deptId| name|salary|
+------+-----+------+
|    01| 张三|  5000|
|    01| 李四|  6000|
|    01| 王五|  5500|
|    02| Jack|  7000|
|    02|Smith|  6500|
|    02| Lily|  9500|
+------+-----+------+