PySpark实战:DataFrame条件选择

来自CloudWiki
跳转至: 导航搜索

介绍

Spark读取JSON数据

user.json

 {"deptId":"01","name":"张三","gender":"男","age":32,"salary":5000},
 {"deptId":"01","name":"李四","gender":"男","age":33,"salary":6000}, 
 {"deptId":"01","name":"王五","gender":"女","age":38,"salary":5500}, 
 {"deptId":"02","name":"Jack","gender":"男","age":42,"salary":7000}, 
 {"deptId":"02","name":"Smith","gender":"女","age":27,"salary":6500}, 
 {"deptId":"02","name":"Lily","gender":"女","age":45,"salary":9500}


代码


import findspark
findspark.init()
##############################################
from pyspark.sql import SparkSession
spark = SparkSession.builder \
        .master("local[1]") \
        .appName("RDD Demo") \
        .getOrCreate();
sc = spark.sparkContext
#############################################
#注意:json数组不能有[]
df = spark.read.format('json') \
        .load('user.json')
#打印字段的类型
print(df.dtypes)
df.show()

#透视
df2 = df.groupBy("deptId") \
        .pivot("gender") \
        .sum("salary")

df2.show()

#条件选择
df.select("name",df.salary.between(6000,9500)).show()
df.select("name","age").where(df.name.like("Smi%")).show()

  • 数据透视:df2 = df.groupBy("deptId") \
       .pivot("gender") \
       .sum("salary")
首先按照deptId分组,其次将gender字段上的值转置,按gender字段 汇总salary的和,形成透视表
  • 条件选择:df.select("name","age").where(df.name.like("Smi%"))

输出

字段类型

[('age', 'bigint'), ('deptId', 'string'), ('gender', 'string'), ('name', 'string'), ('salary', 'bigint')]

全部数据

+---+------+------+-----+------+
|age|deptId|gender| name|salary|
+---+------+------+-----+------+
| 32|    01|    男| 张三|  5000|
| 33|    01|    男| 李四|  6000|
| 38|    01|    女| 王五|  5500|
| 42|    02|    男| Jack|  7000|
| 27|    02|    女|Smith|  6500|
| 45|    02|    女| Lily|  9500|
+---+------+------+-----+------+


数据透视表

+------+-----+-----+
|deptId|   女|   男|
+------+-----+-----+
|    01| 5500|11000|
|    02|16000| 7000|
+------+-----+-----+

条件选择

+-----+---------------------------------------+
| name|((salary >= 6000) AND (salary <= 9500))|
+-----+---------------------------------------+
| 张三|                                  false|
| 李四|                                   true|
| 王五|                                  false|
| Jack|                                   true|
|Smith|                                   true|
| Lily|                                   true|
+-----+---------------------------------------+
+-----+---+
| name|age|
+-----+---+
|Smith| 27|
+-----+---+