PySpark实战:sortBy操作
来自CloudWiki
介绍
sortBy操作是一个变换算子,
sortBy(keyfunc, ascending=True, numPartitions=None)
可以实现灵活的排序功能。
它的作用是根据函数keyfunc来对RDD对象元素进行排序,并返回一个新的RDD
代码
import findspark findspark.init() ############################################## from pyspark.sql import SparkSession spark = SparkSession.builder \ .master("local[1]") \ .appName("RDD Demo") \ .getOrCreate(); sc = spark.sparkContext ############################################# rdd = [('a', 6), ('f', 2), ('c', 7), ('d', 4), ('e', 5)] rdd2 = sc.parallelize(rdd).sortBy(lambda x: x[0]) #[('a', 6), ('c', 7), ('d', 4), ('e', 5), ('f', 2)] print(rdd2.collect()) rdd3 = sc.parallelize(rdd).sortBy(lambda x: x[1]) #[('f', 2), ('d', 4), ('e', 5), ('a', 6), ('c', 7)] print(rdd3.collect()) rdd3 = sc.parallelize(rdd).sortBy(lambda x: x[1],False) #[('c', 7), ('a', 6), ('e', 5), ('d', 4), ('f', 2)] print(rdd3.collect()) rdd3 = sc.parallelize(rdd).sortBy(lambda x: x[1],False,2) #[('c', 7), ('a', 6), ('e', 5), ('d', 4), ('f', 2)] print(rdd3.collect()) ############################################## sc.stop()
输出
[('a', 6), ('c', 7), ('d', 4), ('e', 5), ('f', 2)]
[('f', 2), ('d', 4), ('e', 5), ('a', 6), ('c', 7)]
[('c', 7), ('a', 6), ('e', 5), ('d', 4), ('f', 2)]
[('c', 7), ('a', 6), ('e', 5), ('d', 4), ('f', 2)]