PySpark实战:Linux搭建Spark环境
目录
准备工作
安装运行Spark
解压软件包
mkdir /root/wmtools
cd wmtools
tar -zxvf spark-3.1.3-bin-hadoop2.7.tgz -C ~/wmtools
验证安装是否成功
cd /root/wmtools/spark-3.1.3-bin-hadoop2.7/bin
./spark-submit ../examples/src/main/python/pi.py
求pi的值,运行结果如图:
... 21/06/27 16:34:26 INFO DAGScheduler: Job 0 finished: reduce at /root/wmtools/spark-2.4.8-bin-hadoop2.7/bin/../examples/src/main/python/pi.py:44, took 1.080727 s Pi is roughly 3.145780 21/06/27 16:34:26 INFO SparkUI: Stopped Spark web UI at http://10.0.0.30:4040 21/06/27 16:34:26 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 21/06/27 16:34:26 INFO MemoryStore: MemoryStore cleared 21/06/27 16:34:26 INFO BlockManager: BlockManager stopped 21/06/27 16:34:26 INFO BlockManagerMaster: BlockManagerMaster stopped 21/06/27 16:34:26 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 21/06/27 16:34:26 INFO SparkContext: Successfully stopped SparkContext 21/06/27 16:34:27 INFO ShutdownHookManager: Shutdown hook called 21/06/27 16:34:27 INFO ShutdownHookManager: Deleting directory /tmp/spark-12d3e207-636d-48df-ac07-1906b8ec0b83 21/06/27 16:34:27 INFO ShutdownHookManager: Deleting directory /tmp/spark-12d3e207-636d-48df-ac07-1906b8ec0b83/pyspark-52f38583-70e0-4417-b37e-0880b1f319c3 21/06/27 16:34:27 INFO ShutdownHookManager: Deleting directory /tmp/spark-3caa03ed-ee35-4b64-9827-493bd631a7a7
修改配置文件
此处默认的输出信息太多,这里修改一下日志文件
在目录conf下,重命名log4j.properties.template 为log4j.properties
[root@localhost conf]# pwd /root/wmtools/spark-2.4.8-bin-hadoop2.7/conf [root@localhost conf]# ls docker.properties.template metrics.properties.template spark-env.sh.template fairscheduler.xml.template slaves.template log4j.properties.template spark-defaults.conf.template [root@localhost conf]# mv log4j.properties.template log4j.properties
将log4j.properties文件中的
log4j.rootCategory=INFO, console
改为
log4j.rootCategory=ERROR, console
再次执行如下命令,则输出信息非常少:
21/06/27 16:44:14 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 10.0.0.30 instead (on interface ens33) 21/06/27 16:44:14 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 21/06/27 16:44:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Pi is roughly 3.131460
注:WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 这是由于未安装和配置Hadoop导致的。一般来说,并不影响Spark的使用。
安装运行PySpark
运行pyspark
cd /root/wmtools/spark-3.1.3-bin-hadoop2.7/bin
./pyspark
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.8 /_/ Using Python version 2.7.5 (default, Aug 7 2019 00:51:29) SparkSession available as 'spark'. >>>
退出:
>>> exit()
修改pyspark运行环境
虽然安装了python3.7 ,但是spark启动后加载的还是python 2.7.5
为了切换python的运行任务至3.7,需要进行一些配置:
cd /root/wmtools/spark-2.4.8-bin-hadoop2.7/bin
vi pyspark ,添加:
export PYSPARK_PYTHON=/usr/local/Python3/bin/python3
下面再次运行pyspark:
cd /root/wmtools/spark-2.4.8-bin-hadoop2.7/bin
./pyspark
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.8 /_/ Using Python version 3.7.5 (default, May 25 2021 14:04:16) SparkSession available as 'spark'. >>>
至此,一个具备Python3.x的Spark环境搭建完成。
注意:Spark2.4.8与Python3.8还不太兼容,因此目前不建议安装Python3.8