isuhendro · March 11, 2020 04:12
diff --git a/assignment4.ipynb b/assignment4.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the last assignment for the Coursera course \"Advanced Machine Learning and Signal Processing\"\n",
    "\n",
    "Just execute all cells one after the other and you are done - just note that in the last one you should update your email address (the one you've used for coursera) and obtain a submission token, you get this from the programming assignment directly on coursera.\n",
    "\n",
    "Please fill in the sections labelled with \"###YOUR_CODE_GOES_HERE###\"\n",
    "\n",
    "The purpose of this assignment is to learn how feature engineering boosts model performance. You will apply Discrete Fourier Transformation on the accelerometer sensor time series and therefore transforming the dataset from the time to the frequency domain. \n",
    "\n",
    "After that, you’ll use a classification algorithm of your choice to create a model and submit the new predictions to the grader. Done.\n",
    "\n",
    "Please make sure you run this notebook from an Apache Spark 2.3 notebook.\n",
    "\n",
    "So the first thing we need to ensure is that we are on the latest version of SystemML, which is 1.3.0 (as of 20th March'19) Please use the code block below to check if you are already on 1.3.0 or higher. 1.3 contains a necessary fix, that's we are running against the SNAPSHOT\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "If you are blow version 1.3.0, or you got the error message \"No module named 'systemml'\"  please execute the next two code blocks and then\n",
    "\n",
    "# PLEASE RESTART THE KERNEL !!!\n",
    "\n",
    "Otherwise your changes won't take effect, just double-check every time you run this notebook if you are on SystemML 1.3\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting https://github.com/IBM/coursera/blob/master/systemml-1.3.0-SNAPSHOT-python.tar.gz?raw=true\n",
      "  Using cached https://github.com/IBM/coursera/blob/master/systemml-1.3.0-SNAPSHOT-python.tar.gz?raw=true\n",
      "Collecting numpy>=1.8.2 (from systemml==1.3.0)\n",
      "  Using cached https://files.pythonhosted.org/packages/62/20/4d43e141b5bc426ba38274933ef8e76e85c7adea2c321ecf9ebf7421cedf/numpy-1.18.1-cp36-cp36m-manylinux1_x86_64.whl\n",
      "Collecting scipy>=0.15.1 (from systemml==1.3.0)\n",
      "  Using cached https://files.pythonhosted.org/packages/dc/29/162476fd44203116e7980cfbd9352eef9db37c49445d1fec35509022f6aa/scipy-1.4.1-cp36-cp36m-manylinux1_x86_64.whl\n",
      "Collecting pandas (from systemml==1.3.0)\n",
      "  Using cached https://files.pythonhosted.org/packages/08/ec/b5dd8cfb078380fb5ae9325771146bccd4e8cad2d3e4c72c7433010684eb/pandas-1.0.1-cp36-cp36m-manylinux1_x86_64.whl\n",
      "Collecting scikit-learn (from systemml==1.3.0)\n",
      "  Using cached https://files.pythonhosted.org/packages/5e/d8/312e03adf4c78663e17d802fe2440072376fee46cada1404f1727ed77a32/scikit_learn-0.22.2.post1-cp36-cp36m-manylinux1_x86_64.whl\n",
      "Collecting Pillow>=2.0.0 (from systemml==1.3.0)\n",
      "  Using cached https://files.pythonhosted.org/packages/19/5e/23dcc0ce3cc2abe92efd3cd61d764bee6ccdf1b667a1fb566f45dc249953/Pillow-7.0.0-cp36-cp36m-manylinux1_x86_64.whl\n",
      "Collecting python-dateutil>=2.6.1 (from pandas->systemml==1.3.0)\n",
      "  Using cached https://files.pythonhosted.org/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl\n",
      "Collecting pytz>=2017.2 (from pandas->systemml==1.3.0)\n",
      "  Using cached https://files.pythonhosted.org/packages/e7/f9/f0b53f88060247251bf481fa6ea62cd0d25bf1b11a87888e53ce5b7c8ad2/pytz-2019.3-py2.py3-none-any.whl\n",
      "Collecting joblib>=0.11 (from scikit-learn->systemml==1.3.0)\n",
      "  Using cached https://files.pythonhosted.org/packages/28/5c/cf6a2b65a321c4a209efcdf64c2689efae2cb62661f8f6f4bb28547cf1bf/joblib-0.14.1-py2.py3-none-any.whl\n",
      "Collecting six>=1.5 (from python-dateutil>=2.6.1->pandas->systemml==1.3.0)\n",
      "  Using cached https://files.pythonhosted.org/packages/65/eb/1f97cb97bfc2390a276969c6fae16075da282f5058082d4cb10c6c5c1dba/six-1.14.0-py2.py3-none-any.whl\n",
      "Building wheels for collected packages: systemml\n",
      "  Building wheel for systemml (setup.py) ... \u001b[?25ldone\n",
      "\u001b[?25h  Stored in directory: /home/spark/shared/.cache/pip/wheels/aa/bf/28/4344dd13abd8b9b6cbd4032baf4b851873d2e2288a65631fd2\n",
      "Successfully built systemml\n",
      "\u001b[31mtensorflow 1.13.1 requires tensorboard<1.14.0,>=1.13.0, which is not installed.\u001b[0m\n",
      "\u001b[31mibm-cos-sdk-core 2.4.3 has requirement urllib3<1.25,>=1.20, but you'll have urllib3 1.25.8 which is incompatible.\u001b[0m\n",
      "\u001b[31mbotocore 1.12.82 has requirement urllib3<1.25,>=1.20, but you'll have urllib3 1.25.8 which is incompatible.\u001b[0m\n",
      "Installing collected packages: numpy, scipy, six, python-dateutil, pytz, pandas, joblib, scikit-learn, Pillow, systemml\n",
      "Successfully installed Pillow-7.0.0 joblib-0.14.1 numpy-1.18.1 pandas-1.0.1 python-dateutil-2.8.1 pytz-2019.3 scikit-learn-0.22.2.post1 scipy-1.4.1 six-1.14.0 systemml-1.3.0\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/six.py already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/systemml-1.3.0.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/python_dateutil-2.8.1.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/dateutil already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/pytz already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/scikit_learn-0.22.2.post1.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/joblib already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/numpy already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/systemml already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/Pillow-7.0.0.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/numpy-1.18.1.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/scipy already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/__pycache__ already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/scipy-1.4.1.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/sklearn already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/pytz-2019.3.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/pandas-1.0.1.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/six-1.14.0.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/PIL already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/joblib-0.14.1.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/pandas already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/bin already exists. Specify --upgrade to force replacement.\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!pip install https://github.com/IBM/coursera/blob/master/systemml-1.3.0-SNAPSHOT-python.tar.gz?raw=true"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "Now we need to create two sym links that the newest version is picket up - this is a workaround and will be removed as soon as SystemML 1.3 will be pre-installed on Watson Studio once officially released.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ln -s -f ~/user-libs/python3.6/systemml/systemml-java/systemml-1.3.0-SNAPSHOT-extra.jar ~/user-libs/spark2/systemml-1.3.0-SNAPSHOT-extra.jar\n",
    "!ln -s -f ~/user-libs/python3.6/systemml/systemml-java/systemml-1.3.0-SNAPSHOT.jar ~/user-libs/spark2/systemml-1.3.0-SNAPSHOT.jar"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Please now restart the kernel and start from the beginning to make sure you've installed SystemML 1.3\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.3.0-SNAPSHOT\n"
     ]
    }
   ],
   "source": [
    "from systemml import MLContext\n",
    "ml = MLContext(spark)\n",
    "print(ml.version())\n",
    "    \n",
    "if not ml.version() == '1.3.0-SNAPSHOT':\n",
    "    raise ValueError('please upgrade to SystemML 1.3.0, or restart your Kernel (Kernel->Restart & Clear Output)')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's download the test data since it's so small we don't use COS (IBM Cloud Object Store) here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2020-03-11 02:50:04--  https://github.com/IBM/coursera/blob/master/coursera_ml/shake.parquet?raw=true\n",
      "Resolving github.com (github.com)... 140.82.114.3\n",
      "Connecting to github.com (github.com)|140.82.114.3|:443... connected.\n",
      "HTTP request sent, awaiting response... 302 Found\n",
      "Location: https://github.com/IBM/coursera/raw/master/coursera_ml/shake.parquet [following]\n",
      "--2020-03-11 02:50:04--  https://github.com/IBM/coursera/raw/master/coursera_ml/shake.parquet\n",
      "Reusing existing connection to github.com:443.\n",
      "HTTP request sent, awaiting response... 302 Found\n",
      "Location: https://raw.githubusercontent.com/IBM/coursera/master/coursera_ml/shake.parquet [following]\n",
      "--2020-03-11 02:50:04--  https://raw.githubusercontent.com/IBM/coursera/master/coursera_ml/shake.parquet\n",
      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.8.133\n",
      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.8.133|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 74727 (73K) [application/octet-stream]\n",
      "Saving to: 'shake.parquet?raw=true'\n",
      "\n",
      "100%[======================================>] 74,727      --.-K/s   in 0.003s  \n",
      "\n",
      "2020-03-11 02:50:04 (22.6 MB/s) - 'shake.parquet?raw=true' saved [74727/74727]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget https://github.com/IBM/coursera/blob/master/coursera_ml/shake.parquet?raw=true\n",
    "!mv shake.parquet?raw=true shake.parquet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now it’s time to read the sensor data and create a temporary query table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "df=spark.read.parquet('shake.parquet')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----+---------+-----+-----+-----+\n",
      "|CLASS| SENSORID|    X|    Y|    Z|\n",
      "+-----+---------+-----+-----+-----+\n",
      "|    2| qqqqqqqq| 0.12| 0.12| 0.12|\n",
      "|    2|aUniqueID| 0.03| 0.03| 0.03|\n",
      "|    2| qqqqqqqq|-3.84|-3.84|-3.84|\n",
      "|    2| 12345678| -0.1| -0.1| -0.1|\n",
      "|    2| 12345678|-0.15|-0.15|-0.15|\n",
      "|    2| 12345678| 0.47| 0.47| 0.47|\n",
      "|    2| 12345678|-0.06|-0.06|-0.06|\n",
      "|    2| 12345678|-0.09|-0.09|-0.09|\n",
      "|    2| 12345678| 0.21| 0.21| 0.21|\n",
      "|    2| 12345678|-0.08|-0.08|-0.08|\n",
      "|    2| 12345678| 0.44| 0.44| 0.44|\n",
      "|    2|    gholi| 0.76| 0.76| 0.76|\n",
      "|    2|    gholi| 1.62| 1.62| 1.62|\n",
      "|    2|    gholi| 5.81| 5.81| 5.81|\n",
      "|    2| bcbcbcbc| 0.58| 0.58| 0.58|\n",
      "|    2| bcbcbcbc|-8.24|-8.24|-8.24|\n",
      "|    2| bcbcbcbc|-0.45|-0.45|-0.45|\n",
      "|    2| bcbcbcbc| 1.03| 1.03| 1.03|\n",
      "|    2|aUniqueID|-0.05|-0.05|-0.05|\n",
      "|    2| qqqqqqqq|-0.44|-0.44|-0.44|\n",
      "+-----+---------+-----+-----+-----+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting pixiedust\n",
      "Collecting astunparse (from pixiedust)\n",
      "  Using cached https://files.pythonhosted.org/packages/2b/03/13dde6512ad7b4557eb792fbcf0c653af6076b81e5941d36ec61f7ce6028/astunparse-1.6.3-py2.py3-none-any.whl\n",
      "Collecting colour (from pixiedust)\n",
      "  Using cached https://files.pythonhosted.org/packages/74/46/e81907704ab203206769dee1385dc77e1407576ff8f50a0681d0a6b541be/colour-0.1.5-py2.py3-none-any.whl\n",
      "Collecting markdown (from pixiedust)\n",
      "  Using cached https://files.pythonhosted.org/packages/ab/c4/ba46d44855e6eb1770a12edace5a165a0c6de13349f592b9036257f3c3d3/Markdown-3.2.1-py2.py3-none-any.whl\n",
      "Collecting requests (from pixiedust)\n",
      "  Using cached https://files.pythonhosted.org/packages/1a/70/1935c770cb3be6e3a8b78ced23d7e0f3b187f5cbfab4749523ed65d7c9b1/requests-2.23.0-py2.py3-none-any.whl\n",
      "Collecting lxml (from pixiedust)\n",
      "  Using cached https://files.pythonhosted.org/packages/dd/ba/a0e6866057fc0bbd17192925c1d63a3b85cf522965de9bc02364d08e5b84/lxml-4.5.0-cp36-cp36m-manylinux1_x86_64.whl\n",
      "Collecting geojson (from pixiedust)\n",
      "  Using cached https://files.pythonhosted.org/packages/e4/8d/9e28e9af95739e6d2d2f8d4bef0b3432da40b7c3588fbad4298c1be09e48/geojson-2.5.0-py2.py3-none-any.whl\n",
      "Collecting mpld3 (from pixiedust)\n",
      "Collecting wheel<1.0,>=0.23.0 (from astunparse->pixiedust)\n",
      "  Using cached https://files.pythonhosted.org/packages/8c/23/848298cccf8e40f5bbb59009b32848a4c38f4e7f3364297ab3c3e2e2cd14/wheel-0.34.2-py2.py3-none-any.whl\n",
      "Collecting six<2.0,>=1.6.1 (from astunparse->pixiedust)\n",
      "  Using cached https://files.pythonhosted.org/packages/65/eb/1f97cb97bfc2390a276969c6fae16075da282f5058082d4cb10c6c5c1dba/six-1.14.0-py2.py3-none-any.whl\n",
      "Collecting setuptools>=36 (from markdown->pixiedust)\n",
      "  Using cached https://files.pythonhosted.org/packages/70/b8/b23170ddda9f07c3444d49accde49f2b92f97bb2f2ebc312618ef12e4bd6/setuptools-46.0.0-py3-none-any.whl\n",
      "Collecting idna<3,>=2.5 (from requests->pixiedust)\n",
      "  Using cached https://files.pythonhosted.org/packages/89/e3/afebe61c546d18fb1709a61bee788254b40e736cff7271c7de5de2dc4128/idna-2.9-py2.py3-none-any.whl\n",
      "Collecting certifi>=2017.4.17 (from requests->pixiedust)\n",
      "  Using cached https://files.pythonhosted.org/packages/b9/63/df50cac98ea0d5b006c55a399c3bf1db9da7b5a24de7890bc9cfd5dd9e99/certifi-2019.11.28-py2.py3-none-any.whl\n",
      "Collecting chardet<4,>=3.0.2 (from requests->pixiedust)\n",
      "  Using cached https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl\n",
      "Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (from requests->pixiedust)\n",
      "  Using cached https://files.pythonhosted.org/packages/e8/74/6e4f91745020f967d09332bb2b8b9b10090957334692eb88ea4afe91b77f/urllib3-1.25.8-py2.py3-none-any.whl\n",
      "\u001b[31mtensorflow 1.13.1 requires tensorboard<1.14.0,>=1.13.0, which is not installed.\u001b[0m\n",
      "\u001b[31mibm-cos-sdk-core 2.4.3 has requirement urllib3<1.25,>=1.20, but you'll have urllib3 1.25.8 which is incompatible.\u001b[0m\n",
      "\u001b[31mbotocore 1.12.82 has requirement urllib3<1.25,>=1.20, but you'll have urllib3 1.25.8 which is incompatible.\u001b[0m\n",
      "Installing collected packages: wheel, six, astunparse, colour, setuptools, markdown, idna, certifi, chardet, urllib3, requests, lxml, geojson, mpld3, pixiedust\n",
      "Successfully installed astunparse-1.6.3 certifi-2019.11.28 chardet-3.0.4 colour-0.1.5 geojson-2.5.0 idna-2.9 lxml-4.5.0 markdown-3.2.1 mpld3-0.3 pixiedust-1.1.18 requests-2.23.0 setuptools-46.0.0 six-1.14.0 urllib3-1.25.8 wheel-0.34.2\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/mpld3-0.3.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/urllib3 already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/urllib3-1.25.8.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/Markdown-3.2.1.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/six.py already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/idna already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/chardet already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/markdown already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/lxml-4.5.0.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/astunparse-1.6.3.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/wheel already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/setuptools-46.0.0.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/colour.py already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/pixiedust-1.1.18.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/chardet-3.0.4.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/astunparse already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/certifi already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/setuptools already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/certifi-2019.11.28.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/__pycache__ already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/requests-2.23.0.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/lxml already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/install already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/six-1.14.0.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/requests already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/colour-0.1.5.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/wheel-0.34.2.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/pixiedust already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/idna-2.9.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/geojson-2.5.0.dist-info already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/easy_install.py already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/geojson already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/mpld3 already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/pkg_resources already exists. Specify --upgrade to force replacement.\u001b[0m\n",
      "\u001b[33mTarget directory /home/spark/shared/user-libs/python3.6/bin already exists. Specify --upgrade to force replacement.\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!pip install pixiedust"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pixiedust": {
     "displayParams": {
      "handlerId": "tableView"
     }
    }
   },
   "outputs": [],
   "source": [
    "import pixiedust\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.createOrReplaceTempView(\"df\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We’ll use Apache SystemML to implement Discrete Fourier Transformation. This way all computation continues to happen on the Apache Spark cluster for advanced scalability and performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "from systemml import MLContext, dml\n",
    "ml = MLContext(spark)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you’ve learned from the lecture, implementing Discrete Fourier Transformation in a linear algebra programming language is simple. Apache SystemML DML is such a language and as you can see the implementation is straightforward and doesn’t differ too much from the mathematical definition (Just note that the sum operator has been swapped with a vector dot product using the %*% syntax borrowed from R\n",
    "):\n",
    "\n",
    "<img style=\"float: left;\" src=\"https://wikimedia.org/api/rest_v1/media/math/render/svg/1af0a78dc50bbf118ab6bd4c4dcc3c4ff8502223\">\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "dml_script = '''\n",
    "PI = 3.141592654\n",
    "N = nrow(signal)\n",
    "\n",
    "n = seq(0, N-1, 1)\n",
    "k = seq(0, N-1, 1)\n",
    "\n",
    "M = (n %*% t(k))*(2*PI/N)\n",
    "\n",
    "Xa = cos(M) %*% signal\n",
    "Xb = sin(M) %*% signal\n",
    "\n",
    "DFT = cbind(Xa, Xb)\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now it’s time to create a function which takes a single row Apache Spark data frame as argument (the one containing the accelerometer measurement time series for one axis) and returns the Fourier transformation of it. In addition, we are adding an index column for later joining all axis together and renaming the columns to appropriate names. The result of this function is an Apache Spark DataFrame containing the Fourier Transformation of its input in two columns. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.sql.functions import monotonically_increasing_id\n",
    "\n",
    "def dft_systemml(signal,name):\n",
    "    prog = dml(dml_script).input('signal', signal).output('DFT')\n",
    "    \n",
    "    return (\n",
    "\n",
    "    #execute the script inside the SystemML engine running on top of Apache Spark\n",
    "    ml.execute(prog) \n",
    "     \n",
    "         #read result from SystemML execution back as SystemML Matrix\n",
    "        .get('DFT') \n",
    "     \n",
    "         #convert SystemML Matrix to ApacheSpark DataFrame \n",
    "        .toDF() \n",
    "     \n",
    "         #rename default column names\n",
    "        .selectExpr('C1 as %sa' % (name), 'C2 as %sb' % (name)) \n",
    "     \n",
    "         #add unique ID per row for later joining\n",
    "        .withColumn(\"id\", monotonically_increasing_id())\n",
    "    )\n",
    "        \n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now it’s time to create individual DataFrames containing only a subset of the data. We filter simultaneously for accelerometer each sensor axis and one for each class. This means you’ll get 6 DataFrames. Please implement this using the relational API of DataFrames or SparkSQL. Please use class 1 and 2 and not 0 and 1.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x0 = spark.sql(\"SELECT CLASS as class,SENSORID as id,X as x from df where CLASS==0\") \n",
    "y0 = spark.sql(\"SELECT CLASS as class,SENSORID as id,Y as y from df where CLASS==0\") \n",
    "z0 = spark.sql(\"SELECT CLASS as class,SENSORID as id,Z as z from df where CLASS==0\") \n",
    "x1 = spark.sql(\"SELECT CLASS as class,SENSORID as id,X as x from df where CLASS==1\") \n",
    "y1 = spark.sql(\"SELECT CLASS as class,SENSORID as id,Y as y from df where CLASS==1\") \n",
    "z1 = spark.sql(\"SELECT CLASS as class,SENSORID as id,Z as z from df where CLASS==1\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we’ve created this cool DFT function before, we can just call it for each of the 6 DataFrames now. And since the result of this function call is a DataFrame again we can use the pyspark best practice in simply calling methods on it sequentially. So what we are doing is the following:\n",
    "\n",
    "- Calling DFT for each class and accelerometer sensor axis.\n",
    "- Joining them together on the ID column. \n",
    "- Re-adding a column containing the class index.\n",
    "- Stacking both Dataframes for each classes together\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "ename": "Py4JJavaError",
     "evalue": "An error occurred while calling o179.execute.\n: org.apache.sysml.api.mlcontext.MLContextException: Exception when executing script\n\tat org.apache.sysml.api.mlcontext.MLContext.execute(MLContext.java:346)\n\tat org.apache.sysml.api.mlcontext.MLContext.execute(MLContext.java:319)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:282)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:819)\nCaused by: org.apache.sysml.api.DMLException: org.apache.sysml.parser.LanguageException: Invalid Parameters : ERROR: [line 11:5] -> cos(M) -- Invalid Datatypes for operation MATRIX FRAME\n\tat org.apache.sysml.api.ScriptExecutorUtils.compileRuntimeProgram(ScriptExecutorUtils.java:236)\n\tat org.apache.sysml.api.mlcontext.ScriptExecutor.compile(ScriptExecutor.java:195)\n\tat org.apache.sysml.api.mlcontext.ScriptExecutor.compile(ScriptExecutor.java:168)\n\tat org.apache.sysml.api.mlcontext.ScriptExecutor.execute(ScriptExecutor.java:234)\n\tat org.apache.sysml.api.mlcontext.MLContext.execute(MLContext.java:342)\n\t... 12 more\nCaused by: org.apache.sysml.parser.LanguageException: Invalid Parameters : ERROR: [line 11:5] -> cos(M) -- Invalid Datatypes for operation MATRIX FRAME\n\tat org.apache.sysml.parser.Expression.raiseValidateError(Expression.java:549)\n\tat org.apache.sysml.parser.Expression.computeDataType(Expression.java:427)\n\tat org.apache.sysml.parser.Expression.computeDataType(Expression.java:399)\n\tat org.apache.sysml.parser.BinaryExpression.validateExpression(BinaryExpression.java:114)\n\tat org.apache.sysml.parser.StatementBlock.validateAssignmentStatement(StatementBlock.java:849)\n\tat org.apache.sysml.parser.StatementBlock.validate(StatementBlock.java:792)\n\tat org.apache.sysml.parser.DMLTranslator.validateParseTree(DMLTranslator.java:147)\n\tat org.apache.sysml.parser.DMLTranslator.validateParseTree(DMLTranslator.java:110)\n\tat org.apache.sysml.api.ScriptExecutorUtils.compileRuntimeProgram(ScriptExecutorUtils.java:158)\n\t... 16 more\n",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mPy4JJavaError\u001b[0m                             Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-50-620a69afb45b>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mpyspark\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msql\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfunctions\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mlit\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mdf_class_0\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdft_systemml\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx0\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m'x'\u001b[0m\u001b[0;34m)\u001b[0m     \u001b[0;34m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdft_systemml\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my0\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m'y'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mon\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'id'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhow\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'inner'\u001b[0m\u001b[0;34m)\u001b[0m     \u001b[0;34m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdft_systemml\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mz0\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m'z'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mon\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'id'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhow\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'inner'\u001b[0m\u001b[0;34m)\u001b[0m     \u001b[0;34m.\u001b[0m\u001b[0mwithColumn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'class'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mdf_class_1\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdft_systemml\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m'x'\u001b[0m\u001b[0;34m)\u001b[0m     \u001b[0;34m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdft_systemml\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m'y'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mon\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'id'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhow\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'inner'\u001b[0m\u001b[0;34m)\u001b[0m     \u001b[0;34m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdft_systemml\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mz1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m'z'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mon\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'id'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhow\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'inner'\u001b[0m\u001b[0;34m)\u001b[0m     \u001b[0;34m.\u001b[0m\u001b[0mwithColumn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'class'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m<ipython-input-24-d4d07b673c1b>\u001b[0m in \u001b[0;36mdft_systemml\u001b[0;34m(signal, name)\u001b[0m\n\u001b[1;32m      7\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      8\u001b[0m     \u001b[0;31m#execute the script inside the SystemML engine running on top of Apache Spark\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 9\u001b[0;31m     \u001b[0mml\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexecute\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mprog\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     10\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     11\u001b[0m          \u001b[0;31m#read result from SystemML execution back as SystemML Matrix\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/home/spark/shared/user-libs/python3.6/systemml/mlcontext.py\u001b[0m in \u001b[0;36mexecute\u001b[0;34m(self, script)\u001b[0m\n\u001b[1;32m    726\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mdefault_jvm_stdout\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    727\u001b[0m             \u001b[0;32mwith\u001b[0m \u001b[0mjvm_stdout\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mparallel_flush\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdefault_jvm_stdout_parallel_flush\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 728\u001b[0;31m                 \u001b[0;32mreturn\u001b[0m \u001b[0mMLResults\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_ml\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexecute\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mscript_java\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_sc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    729\u001b[0m         \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    730\u001b[0m             \u001b[0;32mreturn\u001b[0m \u001b[0mMLResults\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_ml\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexecute\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mscript_java\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_sc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/opt/ibm/conda/miniconda3.6/lib/python3.6/site-packages/py4j/java_gateway.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args)\u001b[0m\n\u001b[1;32m   1284\u001b[0m         \u001b[0manswer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgateway_client\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend_command\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcommand\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1285\u001b[0m         return_value = get_return_value(\n\u001b[0;32m-> 1286\u001b[0;31m             answer, self.gateway_client, self.target_id, self.name)\n\u001b[0m\u001b[1;32m   1287\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1288\u001b[0m         \u001b[0;32mfor\u001b[0m \u001b[0mtemp_arg\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtemp_args\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/opt/ibm/spark/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m     61\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mdeco\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkw\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     62\u001b[0m         \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 63\u001b[0;31m             \u001b[0;32mreturn\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkw\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     64\u001b[0m         \u001b[0;32mexcept\u001b[0m \u001b[0mpy4j\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprotocol\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mPy4JJavaError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     65\u001b[0m             \u001b[0ms\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mjava_exception\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtoString\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/opt/ibm/conda/miniconda3.6/lib/python3.6/site-packages/py4j/protocol.py\u001b[0m in \u001b[0;36mget_return_value\u001b[0;34m(answer, gateway_client, target_id, name)\u001b[0m\n\u001b[1;32m    326\u001b[0m                 raise Py4JJavaError(\n\u001b[1;32m    327\u001b[0m                     \u001b[0;34m\"An error occurred while calling {0}{1}{2}.\\n\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 328\u001b[0;31m                     format(target_id, \".\", name), value)\n\u001b[0m\u001b[1;32m    329\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    330\u001b[0m                 raise Py4JError(\n",
      "\u001b[0;31mPy4JJavaError\u001b[0m: An error occurred while calling o179.execute.\n: org.apache.sysml.api.mlcontext.MLContextException: Exception when executing script\n\tat org.apache.sysml.api.mlcontext.MLContext.execute(MLContext.java:346)\n\tat org.apache.sysml.api.mlcontext.MLContext.execute(MLContext.java:319)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:282)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:819)\nCaused by: org.apache.sysml.api.DMLException: org.apache.sysml.parser.LanguageException: Invalid Parameters : ERROR: [line 11:5] -> cos(M) -- Invalid Datatypes for operation MATRIX FRAME\n\tat org.apache.sysml.api.ScriptExecutorUtils.compileRuntimeProgram(ScriptExecutorUtils.java:236)\n\tat org.apache.sysml.api.mlcontext.ScriptExecutor.compile(ScriptExecutor.java:195)\n\tat org.apache.sysml.api.mlcontext.ScriptExecutor.compile(ScriptExecutor.java:168)\n\tat org.apache.sysml.api.mlcontext.ScriptExecutor.execute(ScriptExecutor.java:234)\n\tat org.apache.sysml.api.mlcontext.MLContext.execute(MLContext.java:342)\n\t... 12 more\nCaused by: org.apache.sysml.parser.LanguageException: Invalid Parameters : ERROR: [line 11:5] -> cos(M) -- Invalid Datatypes for operation MATRIX FRAME\n\tat org.apache.sysml.parser.Expression.raiseValidateError(Expression.java:549)\n\tat org.apache.sysml.parser.Expression.computeDataType(Expression.java:427)\n\tat org.apache.sysml.parser.Expression.computeDataType(Expression.java:399)\n\tat org.apache.sysml.parser.BinaryExpression.validateExpression(BinaryExpression.java:114)\n\tat org.apache.sysml.parser.StatementBlock.validateAssignmentStatement(StatementBlock.java:849)\n\tat org.apache.sysml.parser.StatementBlock.validate(StatementBlock.java:792)\n\tat org.apache.sysml.parser.DMLTranslator.validateParseTree(DMLTranslator.java:147)\n\tat org.apache.sysml.parser.DMLTranslator.validateParseTree(DMLTranslator.java:110)\n\tat org.apache.sysml.api.ScriptExecutorUtils.compileRuntimeProgram(ScriptExecutorUtils.java:158)\n\t... 16 more\n"
     ]
    }
   ],
   "source": [
    "from pyspark.sql.functions import lit\n",
    "\n",
    "df_class_0 = dft_systemml(x0,'x') \\\n",
    "    .join(dft_systemml(y0,'y'), on=['id'], how='inner') \\\n",
    "    .join(dft_systemml(z0,'z'), on=['id'], how='inner') \\\n",
    "    .withColumn('class', lit(0))\n",
    "    \n",
    "df_class_1 = dft_systemml(x1,'x') \\\n",
    "    .join(dft_systemml(y1,'y'), on=['id'], how='inner') \\\n",
    "    .join(dft_systemml(z1,'z'), on=['id'], how='inner') \\\n",
    "    .withColumn('class', lit(1))\n",
    "\n",
    "df_dft = df_class_0.union(df_class_1)\n",
    "\n",
    "df_dft.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please create a VectorAssembler which consumes the newly created DFT columns and produces a column “features”\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import VectorAssembler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vectorAssembler = ###YOUR_CODE_GOES_HERE###"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please insatiate a classifier from the SparkML package and assign it to the classifier variable. Make sure to set the “class” column as target.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.classification import GBTClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "classifier = ###YOUR_CODE_GOES_HERE###"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s train and evaluate…\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml import Pipeline\n",
    "pipeline = Pipeline(stages=[vectorAssembler, classifier])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = pipeline.fit(df_dft)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prediction = model.transform(df_dft)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prediction.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.evaluation import MulticlassClassificationEvaluator\n",
    "binEval = MulticlassClassificationEvaluator().setMetricName(\"accuracy\") .setPredictionCol(\"prediction\").setLabelCol(\"class\")\n",
    "    \n",
    "binEval.evaluate(prediction) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are happy with the result (I’m happy with > 0.8) please submit your solution to the grader by executing the following cells, please don’t forget to obtain an assignment submission token (secret) from the Courera’s graders web page and paste it to the “secret” variable below, including your email address you’ve used for Coursera. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm -Rf a2_m4.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prediction = prediction.repartition(1)\n",
    "prediction.write.json('a2_m4.json')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm -f rklib.py\n",
    "!wget wget https://raw.githubusercontent.com/IBM/coursera/master/rklib.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from rklib import zipit\n",
    "zipit('a2_m4.json.zip','a2_m4.json')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!base64 a2_m4.json.zip > a2_m4.json.zip.base64"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from rklib import submit\n",
    "key = \"-fBiYHYDEeiR4QqiFhAvkA\"\n",
    "part = \"IjtJk\"\n",
    "email = ###YOUR_CODE_GOES_HERE###\n",
    "submission_token = ###YOUR_CODE_GOES_HERE### # (have a look here if you need more information on how to obtain the token https://youtu.be/GcDo0Rwe06U?t=276)\n",
    "\n",
    "with open('a2_m4.json.zip.base64', 'r') as myfile:\n",
    "    data=myfile.read()\n",
    "submit(email, submission_token, key, part, [part], data)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.6 with Spark",
   "language": "python3",
   "name": "python36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
No results found