PySpark实战:Linux搭建Spark环境

来自CloudWiki
跳转至: 导航搜索

准备工作

安装运行Spark

解压软件包

mkdir /root/wmtools

cd wmtools

tar -zxvf spark-3.1.3-bin-hadoop2.7.tgz -C ~/wmtools

验证安装是否成功

cd /root/wmtools/spark-3.1.3-bin-hadoop2.7/bin

./spark-submit ../examples/src/main/python/pi.py

求pi的值,运行结果如图:

...
21/06/27 16:34:26 INFO DAGScheduler: Job 0 finished: reduce at /root/wmtools/spark-2.4.8-bin-hadoop2.7/bin/../examples/src/main/python/pi.py:44, took 1.080727 s
Pi is roughly 3.145780
21/06/27 16:34:26 INFO SparkUI: Stopped Spark web UI at http://10.0.0.30:4040
21/06/27 16:34:26 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/06/27 16:34:26 INFO MemoryStore: MemoryStore cleared
21/06/27 16:34:26 INFO BlockManager: BlockManager stopped
21/06/27 16:34:26 INFO BlockManagerMaster: BlockManagerMaster stopped
21/06/27 16:34:26 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/06/27 16:34:26 INFO SparkContext: Successfully stopped SparkContext
21/06/27 16:34:27 INFO ShutdownHookManager: Shutdown hook called
21/06/27 16:34:27 INFO ShutdownHookManager: Deleting directory /tmp/spark-12d3e207-636d-48df-ac07-1906b8ec0b83
21/06/27 16:34:27 INFO ShutdownHookManager: Deleting directory /tmp/spark-12d3e207-636d-48df-ac07-1906b8ec0b83/pyspark-52f38583-70e0-4417-b37e-0880b1f319c3
21/06/27 16:34:27 INFO ShutdownHookManager: Deleting directory /tmp/spark-3caa03ed-ee35-4b64-9827-493bd631a7a7

修改配置文件

此处默认的输出信息太多,这里修改一下日志文件

在目录conf下,重命名log4j.properties.template 为log4j.properties

[root@localhost conf]# pwd
/root/wmtools/spark-2.4.8-bin-hadoop2.7/conf
[root@localhost conf]# ls
docker.properties.template  metrics.properties.template   spark-env.sh.template
fairscheduler.xml.template  slaves.template
log4j.properties.template   spark-defaults.conf.template
[root@localhost conf]# mv log4j.properties.template log4j.properties

将log4j.properties文件中的

log4j.rootCategory=INFO, console

改为

log4j.rootCategory=ERROR, console

再次执行如下命令,则输出信息非常少:

21/06/27 16:44:14 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 10.0.0.30 instead (on interface ens33)
21/06/27 16:44:14 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/06/27 16:44:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Pi is roughly 3.131460

注:WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 这是由于未安装和配置Hadoop导致的。一般来说,并不影响Spark的使用。

安装运行PySpark

运行pyspark

cd /root/wmtools/spark-3.1.3-bin-hadoop2.7/bin

./pyspark

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.8
      /_/

Using Python version 2.7.5 (default, Aug  7 2019 00:51:29)
SparkSession available as 'spark'.
>>>

退出:

>>> exit()

修改pyspark运行环境

虽然安装了python3.7 ,但是spark启动后加载的还是python 2.7.5

为了切换python的运行任务至3.7,需要进行一些配置:

cd /root/wmtools/spark-2.4.8-bin-hadoop2.7/bin

vi pyspark ,添加:

export PYSPARK_PYTHON=/usr/local/Python3/bin/python3

下面再次运行pyspark:

cd /root/wmtools/spark-2.4.8-bin-hadoop2.7/bin

./pyspark

 
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.8
      /_/

Using Python version 3.7.5 (default, May 25 2021 14:04:16)
SparkSession available as 'spark'.
>>>

至此,一个具备Python3.x的Spark环境搭建完成。

注意:Spark2.4.8与Python3.8还不太兼容,因此目前不建议安装Python3.8