PySpark实战:Python文件 建立RDD

来自CloudWiki
跳转至: 导航搜索

安装findspark

pip3 install findspark

设置环境变量

vi /etc/profile

export SPARK_HOME=/root/wmtools/spark-3.1.3-bin-hadoop2.7

source /etc/profile

运行Spark代码

python3 demo20.py

#pip install findspark
#fix:ModuleNotFoundError: No module named 'pyspark'
import findspark
findspark.init()

#############################
from pyspark import SparkConf, SparkContext

# 创建SparkContext
conf = SparkConf().setAppName("WordCount").setMaster("local[*]")
sc = SparkContext(conf=conf)
 
rdd = sc.parallelize(["hello world","hello spark"]);
rdd2 = rdd.flatMap(lambda line:line.split(" "));
rdd3 = rdd2.map(lambda word:(word,1));
rdd5 = rdd3.reduceByKey(lambda a, b : a + b);
#print,否则无法显示结果
#[('spark', 1), ('hello', 2), ('world', 1)]
print(rdd5.collect());
#防止多次创建SparkContexts
sc.stop()

输出:

[('world', 1), ('hello', 2), ('spark', 1)]