PySpark实战:DataFrame删除列
来自CloudWiki
介绍
某些场景下需要删除某些不需要的数据列信息,
在DataFrame中可以通过drop操作实现
代码
import findspark findspark.init() ############################################## from pyspark.sql import SparkSession spark = SparkSession.builder \ .master("local[1]") \ .appName("RDD Demo") \ .getOrCreate(); sc = spark.sparkContext ############################################# a = [ ('01','张三', '男',32,5000), ('01','李四', '男',33,6000), ('01','王五', '女',38,5500), ('02','Jack', '男',42,7000), ('02','Smith', '女',27,6500), ('02','Lily', '女',45,9500), ] rdd = sc.parallelize(a) peopleDf = spark.createDataFrame(rdd,\ "deptId:string,name:string,gender:string,age:int,salary:int") peopleDf.drop("gender","age").show() # +------+-----+------+ # |deptId| name|salary| # +------+-----+------+ # | 01| 张三| 5000| # | 01| 李四| 6000| # | 01| 王五| 5500| # | 02| Jack| 7000| # | 02|Smith| 6500| # | 02| Lily| 9500| # +------+-----+------+ ##############################################
结果
+------+-----+------+ |deptId| name|salary| +------+-----+------+ | 01| 张三| 5000| | 01| 李四| 6000| | 01| 王五| 5500| | 02| Jack| 7000| | 02|Smith| 6500| | 02| Lily| 9500| +------+-----+------+