PySpark实战:认识数据格式
来自CloudWiki
介绍
Kaggle网站主要为开发商和数据科学家提供举办机器学习竞赛、托管数据、编写和分享代码的平台
本章选用泰坦尼克号幸存者预测这个数据集作为研究对象,
这是该网站上参赛人数最多的竞赛之一
该竞赛的网址为 https://www.kaggle.com/c/titanic
机器学习步骤
- 对数据进行观察,并认识数据格式
- 对数据进行一些统计描述和探索分析
- 对数据特征进行选择,删除一些无关的数据,或者派生新的数据等。
- 对数据格式进行变换和处理,以提高预测效果
- 从众多机器学习算法中,选择一个或者多个合适的算法进行构建
Titanic训练集
PassengerId: integer,乘客ID,对预测无帮助
Survived: integer (nullable = true),是否幸存
Pclass: integer (nullable = true),社会经济状态,1代表upper,2代表middle,3代表lower
Name: string (nullable = true) ,乘客姓名
Sex: string (nullable = true),乘客性别
Age: double (nullable = true),乘客年龄
SibSp: integer (nullable = true),兄弟姐妹及配偶个数
Parch: integer (nullable = true) 父母或子女的个数
Ticket: string (nullable = true) 船票号,对预测无帮助
Fare: double (nullable = true) 船票价
Cabin: string (nullable = true) 乘客所在舱位
Embarked: string (nullable = true) 乘客登船口岸
同时,测试集比训练集少一个Survived字段
PySpark代码
import findspark findspark.init() ############################################## from pyspark.sql import SparkSession from pyspark.sql.context import SQLContext spark = SparkSession.builder \ .master("local[*]") \ .appName("PySpark ML") \ .getOrCreate() sc = spark.sparkContext ############################################# print("Titanic train.csv Info") df_train = spark.read.csv('./data/titanic-train.csv',header=True,inferSchema=True).cache() df_train.printSchema() print(df_train.count(),len(df_train.columns)) df_train.show() print("#############################################") print("Titanic test.csv Info") df_test = spark.read.csv('./data/titanic-test.csv',header=True,inferSchema=True).cache() df_test.printSchema() print(df_test.count(),len(df_test.columns)) df_test.show() ############################################# sc.stop()
输出
root |-- PassengerId: integer (nullable = true) |-- Survived: integer (nullable = true) |-- Pclass: integer (nullable = true) |-- Name: string (nullable = true) |-- Sex: string (nullable = true) |-- Age: double (nullable = true) |-- SibSp: integer (nullable = true) |-- Parch: integer (nullable = true) |-- Ticket: string (nullable = true) |-- Fare: double (nullable = true) |-- Cabin: string (nullable = true) |-- Embarked: string (nullable = true) 891 12 +-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+ |PassengerId|Survived|Pclass| Name| Sex| Age|SibSp|Parch| Ticket| Fare|Cabin|Embarked| +-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+ | 1| 0| 3|Braund, Mr. Owen ...| male|22.0| 1| 0| A/5 21171| 7.25| null| S| | 2| 1| 1|Cumings, Mrs. Joh...|female|38.0| 1| 0| PC 17599|71.2833| C85| C| | 3| 1| 3|Heikkinen, Miss. ...|female|26.0| 0| 0|STON/O2. 3101282| 7.925| null| S| | 4| 1| 1|Futrelle, Mrs. Ja...|female|35.0| 1| 0| 113803| 53.1| C123| S| | 5| 0| 3|Allen, Mr. Willia...| male|35.0| 0| 0| 373450| 8.05| null| S| | 6| 0| 3| Moran, Mr. James| male|null| 0| 0| 330877| 8.4583| null| Q| | 7| 0| 1|McCarthy, Mr. Tim...| male|54.0| 0| 0| 17463|51.8625| E46| S| | 8| 0| 3|Palsson, Master. ...| male| 2.0| 3| 1| 349909| 21.075| null| S| | 9| 1| 3|Johnson, Mrs. Osc...|female|27.0| 0| 2| 347742|11.1333| null| S| | 10| 1| 2|Nasser, Mrs. Nich...|female|14.0| 1| 0| 237736|30.0708| null| C| | 11| 1| 3|Sandstrom, Miss. ...|female| 4.0| 1| 1| PP 9549| 16.7| G6| S| | 12| 1| 1|Bonnell, Miss. El...|female|58.0| 0| 0| 113783| 26.55| C103| S| | 13| 0| 3|Saundercock, Mr. ...| male|20.0| 0| 0| A/5. 2151| 8.05| null| S| | 14| 0| 3|Andersson, Mr. An...| male|39.0| 1| 5| 347082| 31.275| null| S| | 15| 0| 3|Vestrom, Miss. Hu...|female|14.0| 0| 0| 350406| 7.8542| null| S| | 16| 1| 2|Hewlett, Mrs. (Ma...|female|55.0| 0| 0| 248706| 16.0| null| S| | 17| 0| 3|Rice, Master. Eugene| male| 2.0| 4| 1| 382652| 29.125| null| Q| | 18| 1| 2|Williams, Mr. Cha...| male|null| 0| 0| 244373| 13.0| null| S| | 19| 0| 3|Vander Planke, Mr...|female|31.0| 1| 0| 345763| 18.0| null| S| | 20| 1| 3|Masselmani, Mrs. ...|female|null| 0| 0| 2649| 7.225| null| C| +-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+ only showing top 20 rows ############################################# Titanic test.csv Info root |-- PassengerId: integer (nullable = true) |-- Pclass: integer (nullable = true) |-- Name: string (nullable = true) |-- Sex: string (nullable = true) |-- Age: double (nullable = true) |-- SibSp: integer (nullable = true) |-- Parch: integer (nullable = true) |-- Ticket: string (nullable = true) |-- Fare: double (nullable = true) |-- Cabin: string (nullable = true) |-- Embarked: string (nullable = true) 418 11 +-----------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+ |PassengerId|Pclass| Name| Sex| Age|SibSp|Parch| Ticket| Fare|Cabin|Embarked| +-----------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+ | 892| 3| Kelly, Mr. James| male|34.5| 0| 0| 330911| 7.8292| null| Q| | 893| 3|Wilkes, Mrs. Jame...|female|47.0| 1| 0| 363272| 7.0| null| S| | 894| 2|Myles, Mr. Thomas...| male|62.0| 0| 0| 240276| 9.6875| null| Q| | 895| 3| Wirz, Mr. Albert| male|27.0| 0| 0| 315154| 8.6625| null| S| | 896| 3|Hirvonen, Mrs. Al...|female|22.0| 1| 1| 3101298|12.2875| null| S| | 897| 3|Svensson, Mr. Joh...| male|14.0| 0| 0| 7538| 9.225| null| S| | 898| 3|Connolly, Miss. Kate|female|30.0| 0| 0| 330972| 7.6292| null| Q| | 899| 2|Caldwell, Mr. Alb...| male|26.0| 1| 1| 248738| 29.0| null| S| | 900| 3|Abrahim, Mrs. Jos...|female|18.0| 0| 0| 2657| 7.2292| null| C| | 901| 3|Davies, Mr. John ...| male|21.0| 2| 0| A/4 48871| 24.15| null| S| | 902| 3| Ilieff, Mr. Ylio| male|null| 0| 0| 349220| 7.8958| null| S| | 903| 1|Jones, Mr. Charle...| male|46.0| 0| 0| 694| 26.0| null| S| | 904| 1|Snyder, Mrs. John...|female|23.0| 1| 0| 21228|82.2667| B45| S| | 905| 2|Howard, Mr. Benjamin| male|63.0| 1| 0| 24065| 26.0| null| S| | 906| 1|Chaffee, Mrs. Her...|female|47.0| 1| 0| W.E.P. 5734| 61.175| E31| S| | 907| 2|del Carlo, Mrs. S...|female|24.0| 1| 0| SC/PARIS 2167|27.7208| null| C| | 908| 2| Keane, Mr. Daniel| male|35.0| 0| 0| 233734| 12.35| null| Q| | 909| 3| Assaf, Mr. Gerios| male|21.0| 0| 0| 2692| 7.225| null| C| | 910| 3|Ilmakangas, Miss....|female|27.0| 1| 0|STON/O2. 3101270| 7.925| null| S| | 911| 3|"Assaf Khalil, Mr...|female|45.0| 0| 0| 2696| 7.225| null| C| +-----------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+ only showing top 20 rows