sterlingalston · January 29, 2019 21:06
diff --git a/Tutorial - Spark in Python.ipynb b/Tutorial - Spark in Python.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial - Spark in Python\n",
    "This notebook is designed to introduce some basic concepts and help get you familiar with using Spark in Python.   \n",
    "\n",
    "In this notebook, we will load and explore the mtcars dataset. Specifically, this tutorial covers:\n",
    "\n",
    "1. Loading data in memory\n",
    "1. Creating SQLContext\n",
    "1. Creating Spark DataFrame\n",
    "1. Group data by columns \n",
    "1. Operating on columns\n",
    "1. Running SQL Queries from a Spark DataFrame\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading in a DataFrame\n",
    "To create a Spark DataFrame we load an external DataFrame, called `mtcars`. This DataFrame includes 32 observations on 11 variables.\n",
    "\n",
    "[, 1]\tmpg\tMiles/(US) --> gallon  \n",
    "[, 2]\tcyl\t--> Number of cylinders  \n",
    "[, 3]\tdisp\t--> Displacement (cu.in.)  \n",
    "[, 4]\thp -->\tGross horsepower  \n",
    "[, 5]\tdrat -->\tRear axle ratio  \n",
    "[, 6]\twt -->\tWeight (lb/1000)  \n",
    "[, 7]\tqsec -->\t1/4 mile time  \n",
    "[, 8]\tvs -->\tV/S  \n",
    "[, 9]\tam -->\tTransmission (0 = automatic, 1 = manual)  \n",
    "[,10]\tgear -->\tNumber of forward gears  \n",
    "[,11]\tcarb -->\tNumber of carburetors  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "mtcars = pd.read_csv('https://ibm.box.com/shared/static/f1dhhjnzjwxmy2c1ys2whvrgz05d1pui.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "mtcars.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initialize SQLContext\n",
    "To work with dataframes we need a SQLContext which is created using `SQLContext(sc)`. SQLContext uses SparkContext which has been already created, named `sc`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sqlContext = SQLContext(sc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Spark DataFrames\n",
    "With SQLContext and a loaded local DataFrame, we create a Spark DataFrame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sdf = sqlContext.createDataFrame(mtcars) \n",
    "sdf.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Displays the content of the DataFrame \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "sdf.show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selecting columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "sdf.select('mpg').show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filtering Data\n",
    "Filter the DataFrame to only retain rows with `mpg` less than 18"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sdf.filter(sdf['mpg'] < 18).show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Operating on Columns\n",
    "SparkR also provides a number of functions that can directly applied to columns for data processing and aggregation. The example below shows the use of basic arithmetic functions to convert lb to metric ton."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sdf.withColumn('wtTon', sdf['wt'] * 0.45).show(6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sdf.show(6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Grouping, Aggregation\n",
    "Spark DataFrames support a number of commonly used functions to aggregate data after grouping. For example we can compute the average weight of cars by their cylinders as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sdf.groupby(['cyl'])\\\n",
    ".agg({\"wt\": \"AVG\"})\\\n",
    ".show(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# We can also sort the output from the aggregation to get the most common cars\n",
    "car_counts = sdf.groupby(['cyl'])\\\n",
    ".agg({\"wt\": \"count\"})\\\n",
    ".sort(\"count(wt)\", ascending=False)\\\n",
    ".show(5)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Running SQL Queries from Spark DataFrames\n",
    "A Spark DataFrame can also be registered as a temporary table in Spark SQL and registering a DataFrame as a table allows you to run SQL queries over its data. The `sql` function enables applications to run SQL queries programmatically and returns the result as a DataFrame.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Register this DataFrame as a table.\n",
    "sdf.registerTempTable(\"cars\")\n",
    "\n",
    "# SQL statements can be run by using the sql method\n",
    "highgearcars = sqlContext.sql(\"SELECT gear FROM cars WHERE cyl >= 4 AND cyl <= 9\")\n",
    "highgearcars.show(6)\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "NOTE: This tutorial draws heavily from the original \n",
    "[Spark Quick Start Guide](http://spark.apache.org/docs/latest/quick-start.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Want to learn more?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Free courses on [Big Data University](https://bigdatauniversity.com/courses/what-is-spark/?utm_source=tutorial-spark-python&utm_medium=dswb&utm_campaign=bdu):"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://bigdatauniversity.com/courses/what-is-spark/?utm_source=tutorial-spark-python&utm_medium=dswb&utm_campaign=bdu\"><img src = \"https://ibm.box.com/shared/static/pxb2n9airrzrfola21zj5ssj7kndcp8m.png\"> </a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>Authors:</h3>\n",
    "<br>\n",
    "<a href=\"https://ca.linkedin.com/in/saeedaghabozorgi\">\n",
    "    <div class=\"teacher-image\" style=\"    float: left;\n",
    "        width: 115px;\n",
    "        height: 115px;\n",
    "        margin-right: 10px;\n",
    "        margin-bottom: 10px;\n",
    "        border: 1px solid #CCC;\n",
    "        padding: 3px;\n",
    "        border-radius: 3px;\n",
    "        text-align: center;\"><img class=\"alignnone wp-image-2258 \" src=\"https://ibm.box.com/shared/static/tyd41rlrnmfrrk78jx521eb73fljwvv0.jpg\" alt=\"Saeed Aghabozorgi\" width=\"178\" height=\"178\"/>\n",
    "    </div>\n",
    "</a>\n",
    "\n",
    "<h4>Saeed Aghabozorgi</h4>\n",
    "<p><a href=\"https://ca.linkedin.com/in/saeedaghabozorgi\">Saeed Aghabozorgi</a>, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients' ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.</p>\n",
    "\n",
    "<br>\n",
    "\n",
    "<a href=\"https://ca.linkedin.com/in/polonglin\">\n",
    "    <div class=\"teacher-image\" style=\"    float: left;\n",
    "        width: 115px;\n",
    "        height: 115px;\n",
    "        margin-right: 10px;\n",
    "        margin-bottom: 10px;\n",
    "        border: 1px solid #CCC;\n",
    "        padding: 3px;\n",
    "        border-radius: 3px;\n",
    "        text-align: center;\"><img class=\"alignnone size-medium wp-image-2177\" src=\"https://ibm.box.com/shared/static/2ygdi03ahcr97df2ofrr6cf8knq4kodd.jpg\" alt=\"Polong Lin\" width=\"300\" height=\"300\"/>\n",
    "    </div>\n",
    "</a>\n",
    "<h4>Polong Lin</h4>\n",
    "<p>\n",
    "<a href=\"https://ca.linkedin.com/in/polonglin\">Polong Lin</a> is a Data Scientist at IBM in Canada. Under the Emerging Technologies division, Polong is responsible for educating the next generation of data scientists through Big Data University. Polong is a regular speaker in conferences and meetups, and holds a M.Sc. in Cognitive Psychology.</p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<hr>\n",
    "Copyright &copy; 2016 [Big Data University](https://bigdatauniversity.com). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Tutorial - Spark in Python\n",
	"This notebook is designed to introduce some basic concepts and help get you familiar with using Spark in Python. \n",
	"\n",
	"In this notebook, we will load and explore the mtcars dataset. Specifically, this tutorial covers:\n",
	"\n",
	"1. Loading data in memory\n",
	"1. Creating SQLContext\n",
	"1. Creating Spark DataFrame\n",
	"1. Group data by columns \n",
	"1. Operating on columns\n",
	"1. Running SQL Queries from a Spark DataFrame\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Loading in a DataFrame\n",
	"To create a Spark DataFrame we load an external DataFrame, called `mtcars`. This DataFrame includes 32 observations on 11 variables.\n",
	"\n",
	"[, 1]\tmpg\tMiles/(US) --> gallon \n",
	"[, 2]\tcyl\t--> Number of cylinders \n",
	"[, 3]\tdisp\t--> Displacement (cu.in.) \n",
	"[, 4]\thp -->\tGross horsepower \n",
	"[, 5]\tdrat -->\tRear axle ratio \n",
	"[, 6]\twt -->\tWeight (lb/1000) \n",
	"[, 7]\tqsec -->\t1/4 mile time \n",
	"[, 8]\tvs -->\tV/S \n",
	"[, 9]\tam -->\tTransmission (0 = automatic, 1 = manual) \n",
	"[,10]\tgear -->\tNumber of forward gears \n",
	"[,11]\tcarb -->\tNumber of carburetors "
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"import pandas as pd\n",
	"mtcars = pd.read_csv('https://ibm.box.com/shared/static/f1dhhjnzjwxmy2c1ys2whvrgz05d1pui.csv')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false,
	"scrolled": true
	},
	"outputs": [],
	"source": [
	"mtcars.head()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Initialize SQLContext\n",
	"To work with dataframes we need a SQLContext which is created using `SQLContext(sc)`. SQLContext uses SparkContext which has been already created, named `sc`. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"sqlContext = SQLContext(sc)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Creating Spark DataFrames\n",
	"With SQLContext and a loaded local DataFrame, we create a Spark DataFrame:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"sdf = sqlContext.createDataFrame(mtcars) \n",
	"sdf.printSchema()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Displays the content of the DataFrame \n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false,
	"scrolled": true
	},
	"outputs": [],
	"source": [
	"sdf.show(5)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Selecting columns"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false,
	"scrolled": true
	},
	"outputs": [],
	"source": [
	"sdf.select('mpg').show(5)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Filtering Data\n",
	"Filter the DataFrame to only retain rows with `mpg` less than 18"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"sdf.filter(sdf['mpg'] < 18).show(5)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Operating on Columns\n",
	"SparkR also provides a number of functions that can directly applied to columns for data processing and aggregation. The example below shows the use of basic arithmetic functions to convert lb to metric ton."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"sdf.withColumn('wtTon', sdf['wt'] * 0.45).show(6)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"sdf.show(6)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Grouping, Aggregation\n",
	"Spark DataFrames support a number of commonly used functions to aggregate data after grouping. For example we can compute the average weight of cars by their cylinders as shown below:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"sdf.groupby(['cyl'])\\\n",
	".agg({\"wt\": \"AVG\"})\\\n",
	".show(5)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# We can also sort the output from the aggregation to get the most common cars\n",
	"car_counts = sdf.groupby(['cyl'])\\\n",
	".agg({\"wt\": \"count\"})\\\n",
	".sort(\"count(wt)\", ascending=False)\\\n",
	".show(5)\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Running SQL Queries from Spark DataFrames\n",
	"A Spark DataFrame can also be registered as a temporary table in Spark SQL and registering a DataFrame as a table allows you to run SQL queries over its data. The `sql` function enables applications to run SQL queries programmatically and returns the result as a DataFrame.\n",
	"\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Register this DataFrame as a table.\n",
	"sdf.registerTempTable(\"cars\")\n",
	"\n",
	"# SQL statements can be run by using the sql method\n",
	"highgearcars = sqlContext.sql(\"SELECT gear FROM cars WHERE cyl >= 4 AND cyl <= 9\")\n",
	"highgearcars.show(6)\n",
	" "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"NOTE: This tutorial draws heavily from the original \n",
	"[Spark Quick Start Guide](http://spark.apache.org/docs/latest/quick-start.html)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Want to learn more?"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Free courses on [Big Data University](https://bigdatauniversity.com/courses/what-is-spark/?utm_source=tutorial-spark-python&utm_medium=dswb&utm_campaign=bdu):"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<a href=\"https://bigdatauniversity.com/courses/what-is-spark/?utm_source=tutorial-spark-python&utm_medium=dswb&utm_campaign=bdu\"><img src = \"https://ibm.box.com/shared/static/pxb2n9airrzrfola21zj5ssj7kndcp8m.png\"> </a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h3>Authors:</h3>\n",
	"<br>\n",
	"<a href=\"https://ca.linkedin.com/in/saeedaghabozorgi\">\n",
	" <div class=\"teacher-image\" style=\" float: left;\n",
	" width: 115px;\n",
	" height: 115px;\n",
	" margin-right: 10px;\n",
	" margin-bottom: 10px;\n",
	" border: 1px solid #CCC;\n",
	" padding: 3px;\n",
	" border-radius: 3px;\n",
	" text-align: center;\"><img class=\"alignnone wp-image-2258 \" src=\"https://ibm.box.com/shared/static/tyd41rlrnmfrrk78jx521eb73fljwvv0.jpg\" alt=\"Saeed Aghabozorgi\" width=\"178\" height=\"178\"/>\n",
	" </div>\n",
	"</a>\n",
	"\n",
	"<h4>Saeed Aghabozorgi</h4>\n",
	"<p><a href=\"https://ca.linkedin.com/in/saeedaghabozorgi\">Saeed Aghabozorgi</a>, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients' ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.</p>\n",
	"\n",
	"<br>\n",
	"\n",
	"<a href=\"https://ca.linkedin.com/in/polonglin\">\n",
	" <div class=\"teacher-image\" style=\" float: left;\n",
	" width: 115px;\n",
	" height: 115px;\n",
	" margin-right: 10px;\n",
	" margin-bottom: 10px;\n",
	" border: 1px solid #CCC;\n",
	" padding: 3px;\n",
	" border-radius: 3px;\n",
	" text-align: center;\"><img class=\"alignnone size-medium wp-image-2177\" src=\"https://ibm.box.com/shared/static/2ygdi03ahcr97df2ofrr6cf8knq4kodd.jpg\" alt=\"Polong Lin\" width=\"300\" height=\"300\"/>\n",
	" </div>\n",
	"</a>\n",
	"<h4>Polong Lin</h4>\n",
	"<p>\n",
	"<a href=\"https://ca.linkedin.com/in/polonglin\">Polong Lin</a> is a Data Scientist at IBM in Canada. Under the Emerging Technologies division, Polong is responsible for educating the next generation of data scientists through Big Data University. Polong is a regular speaker in conferences and meetups, and holds a M.Sc. in Cognitive Psychology.</p>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<hr>\n",
	"Copyright © 2016 [Big Data University](https://bigdatauniversity.com). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/)."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.11"
	},
	"widgets": {
	"state": {},
	"version": "1.1.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}
No results found