PySpark实战:sortBy操作

来自CloudWiki
跳转至: 导航搜索

介绍

sortBy操作是一个变换算子,

sortBy(keyfunc, ascending=True, numPartitions=None)

可以实现灵活的排序功能。

它的作用是根据函数keyfunc来对RDD对象元素进行排序,并返回一个新的RDD

代码

import findspark
findspark.init()
##############################################
from pyspark.sql import SparkSession
spark = SparkSession.builder \
        .master("local[1]") \
        .appName("RDD Demo") \
        .getOrCreate();
sc = spark.sparkContext
#############################################
rdd = [('a', 6), ('f', 2), ('c', 7), ('d', 4), ('e', 5)]
rdd2 = sc.parallelize(rdd).sortBy(lambda x: x[0])
#[('a', 6), ('c', 7), ('d', 4), ('e', 5), ('f', 2)]
print(rdd2.collect())
rdd3 = sc.parallelize(rdd).sortBy(lambda x: x[1])
#[('f', 2), ('d', 4), ('e', 5), ('a', 6), ('c', 7)]
print(rdd3.collect())
rdd3 = sc.parallelize(rdd).sortBy(lambda x: x[1],False)
#[('c', 7), ('a', 6), ('e', 5), ('d', 4), ('f', 2)]
print(rdd3.collect())
rdd3 = sc.parallelize(rdd).sortBy(lambda x: x[1],False,2)
#[('c', 7), ('a', 6), ('e', 5), ('d', 4), ('f', 2)]
print(rdd3.collect())
##############################################
sc.stop()

输出

[('a', 6), ('c', 7), ('d', 4), ('e', 5), ('f', 2)]

[('f', 2), ('d', 4), ('e', 5), ('a', 6), ('c', 7)]

[('c', 7), ('a', 6), ('e', 5), ('d', 4), ('f', 2)]

[('c', 7), ('a', 6), ('e', 5), ('d', 4), ('f', 2)]