PySpark实战:认识数据格式

来自CloudWiki
跳转至: 导航搜索

介绍

Kaggle网站主要为开发商和数据科学家提供举办机器学习竞赛、托管数据、编写和分享代码的平台

本章选用泰坦尼克号幸存者预测这个数据集作为研究对象,

这是该网站上参赛人数最多的竞赛之一

该竞赛的网址为 https://www.kaggle.com/c/titanic

机器学习步骤

  • 对数据进行观察,并认识数据格式
  • 对数据进行一些统计描述和探索分析
  • 对数据特征进行选择,删除一些无关的数据,或者派生新的数据等。
  • 对数据格式进行变换和处理,以提高预测效果
  • 从众多机器学习算法中,选择一个或者多个合适的算法进行构建

Titanic训练集

PassengerId: integer,乘客ID,对预测无帮助

Survived: integer (nullable = true),是否幸存

Pclass: integer (nullable = true),社会经济状态,1代表upper,2代表middle,3代表lower

Name: string (nullable = true) ,乘客姓名

Sex: string (nullable = true),乘客性别

Age: double (nullable = true),乘客年龄

SibSp: integer (nullable = true),兄弟姐妹及配偶个数

Parch: integer (nullable = true) 父母或子女的个数

Ticket: string (nullable = true) 船票号,对预测无帮助

Fare: double (nullable = true) 船票价

Cabin: string (nullable = true) 乘客所在舱位

Embarked: string (nullable = true) 乘客登船口岸

同时,测试集比训练集少一个Survived字段

PySpark代码


import findspark
findspark.init()
##############################################
from pyspark.sql import SparkSession
from pyspark.sql.context import SQLContext
spark = SparkSession.builder \
        .master("local[*]") \
        .appName("PySpark ML") \
        .getOrCreate()
sc = spark.sparkContext
#############################################
print("Titanic train.csv Info")
df_train = spark.read.csv('./data/titanic-train.csv',header=True,inferSchema=True).cache()
df_train.printSchema()
print(df_train.count(),len(df_train.columns))
df_train.show()
print("#############################################")
print("Titanic test.csv Info")
df_test = spark.read.csv('./data/titanic-test.csv',header=True,inferSchema=True).cache()
df_test.printSchema()
print(df_test.count(),len(df_test.columns))
df_test.show()
#############################################
sc.stop()

输出

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)

891 12
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|          6|       0|     3|    Moran, Mr. James|  male|null|    0|    0|          330877| 8.4583| null|       Q|
|          7|       0|     1|McCarthy, Mr. Tim...|  male|54.0|    0|    0|           17463|51.8625|  E46|       S|
|          8|       0|     3|Palsson, Master. ...|  male| 2.0|    3|    1|          349909| 21.075| null|       S|
|          9|       1|     3|Johnson, Mrs. Osc...|female|27.0|    0|    2|          347742|11.1333| null|       S|
|         10|       1|     2|Nasser, Mrs. Nich...|female|14.0|    1|    0|          237736|30.0708| null|       C|
|         11|       1|     3|Sandstrom, Miss. ...|female| 4.0|    1|    1|         PP 9549|   16.7|   G6|       S|
|         12|       1|     1|Bonnell, Miss. El...|female|58.0|    0|    0|          113783|  26.55| C103|       S|
|         13|       0|     3|Saundercock, Mr. ...|  male|20.0|    0|    0|       A/5. 2151|   8.05| null|       S|
|         14|       0|     3|Andersson, Mr. An...|  male|39.0|    1|    5|          347082| 31.275| null|       S|
|         15|       0|     3|Vestrom, Miss. Hu...|female|14.0|    0|    0|          350406| 7.8542| null|       S|
|         16|       1|     2|Hewlett, Mrs. (Ma...|female|55.0|    0|    0|          248706|   16.0| null|       S|
|         17|       0|     3|Rice, Master. Eugene|  male| 2.0|    4|    1|          382652| 29.125| null|       Q|
|         18|       1|     2|Williams, Mr. Cha...|  male|null|    0|    0|          244373|   13.0| null|       S|
|         19|       0|     3|Vander Planke, Mr...|female|31.0|    1|    0|          345763|   18.0| null|       S|
|         20|       1|     3|Masselmani, Mrs. ...|female|null|    0|    0|            2649|  7.225| null|       C|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
only showing top 20 rows

#############################################
Titanic test.csv Info
root
 |-- PassengerId: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)

418 11
+-----------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|        892|     3|    Kelly, Mr. James|  male|34.5|    0|    0|          330911| 7.8292| null|       Q|
|        893|     3|Wilkes, Mrs. Jame...|female|47.0|    1|    0|          363272|    7.0| null|       S|
|        894|     2|Myles, Mr. Thomas...|  male|62.0|    0|    0|          240276| 9.6875| null|       Q|
|        895|     3|    Wirz, Mr. Albert|  male|27.0|    0|    0|          315154| 8.6625| null|       S|
|        896|     3|Hirvonen, Mrs. Al...|female|22.0|    1|    1|         3101298|12.2875| null|       S|
|        897|     3|Svensson, Mr. Joh...|  male|14.0|    0|    0|            7538|  9.225| null|       S|
|        898|     3|Connolly, Miss. Kate|female|30.0|    0|    0|          330972| 7.6292| null|       Q|
|        899|     2|Caldwell, Mr. Alb...|  male|26.0|    1|    1|          248738|   29.0| null|       S|
|        900|     3|Abrahim, Mrs. Jos...|female|18.0|    0|    0|            2657| 7.2292| null|       C|
|        901|     3|Davies, Mr. John ...|  male|21.0|    2|    0|       A/4 48871|  24.15| null|       S|
|        902|     3|    Ilieff, Mr. Ylio|  male|null|    0|    0|          349220| 7.8958| null|       S|
|        903|     1|Jones, Mr. Charle...|  male|46.0|    0|    0|             694|   26.0| null|       S|
|        904|     1|Snyder, Mrs. John...|female|23.0|    1|    0|           21228|82.2667|  B45|       S|
|        905|     2|Howard, Mr. Benjamin|  male|63.0|    1|    0|           24065|   26.0| null|       S|
|        906|     1|Chaffee, Mrs. Her...|female|47.0|    1|    0|     W.E.P. 5734| 61.175|  E31|       S|
|        907|     2|del Carlo, Mrs. S...|female|24.0|    1|    0|   SC/PARIS 2167|27.7208| null|       C|
|        908|     2|   Keane, Mr. Daniel|  male|35.0|    0|    0|          233734|  12.35| null|       Q|
|        909|     3|   Assaf, Mr. Gerios|  male|21.0|    0|    0|            2692|  7.225| null|       C|
|        910|     3|Ilmakangas, Miss....|female|27.0|    1|    0|STON/O2. 3101270|  7.925| null|       S|
|        911|     3|"Assaf Khalil, Mr...|female|45.0|    0|    0|            2696|  7.225| null|       C|
+-----------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
only showing top 20 rows