isuhendro · March 8, 2020 14:18
diff --git a/AssignmentML2.ipynb b/AssignmentML2.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the second assignment for the Coursera course \"Advanced Machine Learning and Signal Processing\"\n",
    "\n",
    "\n",
    "Just execute all cells one after the other and you are done - just note that in the last one you have to update your email address (the one you've used for coursera) and obtain a submission token, you get this from the programming assignment directly on coursera.\n",
    "\n",
    "Please fill in the sections labelled with \"###YOUR_CODE_GOES_HERE###\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Waiting for a Spark session to start...\n",
      "Spark Initialization Done! ApplicationId = app-20200308135741-0000\n",
      "KERNEL_ID = c44da1e5-8cb5-4a81-97d0-ccf2a05e6c41\n",
      "--2020-03-08 13:57:44--  https://github.com/IBM/coursera/raw/master/coursera_ml/a2.parquet\n",
      "Resolving github.com (github.com)... 140.82.113.4\n",
      "Connecting to github.com (github.com)|140.82.113.4|:443... connected.\n",
      "HTTP request sent, awaiting response... 302 Found\n",
      "Location: https://raw.githubusercontent.com/IBM/coursera/master/coursera_ml/a2.parquet [following]\n",
      "--2020-03-08 13:57:44--  https://raw.githubusercontent.com/IBM/coursera/master/coursera_ml/a2.parquet\n",
      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.8.133\n",
      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.8.133|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 59032 (58K) [application/octet-stream]\n",
      "Saving to: 'a2.parquet'\n",
      "\n",
      "100%[======================================>] 59,032      --.-K/s   in 0.003s  \n",
      "\n",
      "2020-03-08 13:57:44 (19.6 MB/s) - 'a2.parquet' saved [59032/59032]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget https://github.com/IBM/coursera/raw/master/coursera_ml/a2.parquet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now it’s time to have a look at the recorded sensor data. You should see data similar to the one exemplified below….\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----+-----------+-------------------+-------------------+-------------------+\n",
      "|CLASS|   SENSORID|                  X|                  Y|                  Z|\n",
      "+-----+-----------+-------------------+-------------------+-------------------+\n",
      "|    0|         26| 380.66434005495194| -139.3470983812975|-247.93697521077704|\n",
      "|    0|         29| 104.74324299209692| -32.27421440203938|-25.105013725863852|\n",
      "|    0| 8589934658| 118.11469236129976| 45.916682927433534| -87.97203782706572|\n",
      "|    0|34359738398| 246.55394030642543|-0.6122810693132044|-398.18662513951506|\n",
      "|    0|17179869241|-190.32584900181487|  234.7849657520335|-206.34483804019288|\n",
      "|    0|25769803830| 178.62396382387422| -47.07529438881511|  84.38310769821979|\n",
      "|    0|25769803831|  85.03128805189493|-4.3024316644854546|-1.1841857567516714|\n",
      "|    0|34359738411| 26.786262674736566| -46.33193951911338| 20.880756008396055|\n",
      "|    0| 8589934592|-16.203752396859194| 51.080957032176954| -96.80526656416971|\n",
      "|    0|25769803852|   47.2048142440404|  -78.2950899652916| 181.99604091494786|\n",
      "|    0|34359738369| 15.608872398939273| -79.90322809181754|  69.62150711098005|\n",
      "|    0|         19|-4.8281721129789315| -67.38050508399905| 221.24876396496404|\n",
      "|    0|         54| -98.40725712852762|-19.989364074314732|  -302.695196085276|\n",
      "|    0|17179869313| 22.835845394816594|   17.1633660118843| 32.877914832011385|\n",
      "|    0|34359738454|  84.20178070080324| -32.81572075916947| -48.63517643958031|\n",
      "|    0|          0|  56.54732521345129| -7.980106018032676|  95.05162719436447|\n",
      "|    0|17179869201|  -57.6008655247749|  5.135393798773895| 236.99158698947267|\n",
      "|    0|17179869308| -65.59264738389012| -48.92660057215126| -61.58970715383383|\n",
      "|    0|25769803790|  34.82337351291005|  9.483542084393937|  197.6066372962772|\n",
      "|    0|25769803825|  39.80573823439121|-0.7955236412785212| -79.66652640650325|\n",
      "+-----+-----------+-------------------+-------------------+-------------------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df=spark.read.load('a2.parquet')\n",
    "\n",
    "df.createOrReplaceTempView(\"df\")\n",
    "spark.sql(\"SELECT * from df\").show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please create a VectorAssembler which consumes columns X, Y and Z and produces a column “features”\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import VectorAssembler\n",
    "vectorAssembler = VectorAssembler( inputCols=[\"X\",\"Y\",\"Z\"], outputCol=\"CLASS\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please instantiate a classifier from the SparkML package and assign it to the classifier variable. Make sure to either\n",
    "1.\tRename the “CLASS” column to “label” or\n",
    "2.\tSpecify the label-column correctly to be “CLASS”\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.classification import LogisticRegression\n",
    "\n",
    "classifier = LogisticRegression(labelCol='CLASS', maxIter=10, regParam=0.3, elasticNetParam=0.8 )\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s train and evaluate…\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml import Pipeline\n",
    "pipeline = Pipeline(stages=[vectorAssembler, classifier])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "ename": "IllegalArgumentException",
     "evalue": "'Output column CLASS already exists.'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mPy4JJavaError\u001b[0m                             Traceback (most recent call last)",
      "\u001b[0;32m/opt/ibm/spark/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m     62\u001b[0m         \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 63\u001b[0;31m             \u001b[0;32mreturn\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkw\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     64\u001b[0m         \u001b[0;32mexcept\u001b[0m \u001b[0mpy4j\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprotocol\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mPy4JJavaError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/opt/ibm/conda/miniconda3.6/lib/python3.6/site-packages/py4j/protocol.py\u001b[0m in \u001b[0;36mget_return_value\u001b[0;34m(answer, gateway_client, target_id, name)\u001b[0m\n\u001b[1;32m    327\u001b[0m                     \u001b[0;34m\"An error occurred while calling {0}{1}{2}.\\n\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 328\u001b[0;31m                     format(target_id, \".\", name), value)\n\u001b[0m\u001b[1;32m    329\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mPy4JJavaError\u001b[0m: An error occurred while calling o105.transform.\n: java.lang.IllegalArgumentException: Output column CLASS already exists.\n\tat org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:172)\n\tat org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\tat org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:86)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:282)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:819)\n",
      "\nDuring handling of the above exception, another exception occurred:\n",
      "\u001b[0;31mIllegalArgumentException\u001b[0m                  Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-21-5a6ceed6e361>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mmodel\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m/opt/ibm/spark/python/pyspark/ml/base.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, dataset, params)\u001b[0m\n\u001b[1;32m    130\u001b[0m                 \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcopy\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mparams\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_fit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    131\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 132\u001b[0;31m                 \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_fit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    133\u001b[0m         \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    134\u001b[0m             raise ValueError(\"Params must be either a param map or a list/tuple of param maps, \"\n",
      "\u001b[0;32m/opt/ibm/spark/python/pyspark/ml/pipeline.py\u001b[0m in \u001b[0;36m_fit\u001b[0;34m(self, dataset)\u001b[0m\n\u001b[1;32m    105\u001b[0m                 \u001b[0;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstage\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mTransformer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    106\u001b[0m                     \u001b[0mtransformers\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstage\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 107\u001b[0;31m                     \u001b[0mdataset\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mstage\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtransform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    108\u001b[0m                 \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m  \u001b[0;31m# must be an Estimator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    109\u001b[0m                     \u001b[0mmodel\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mstage\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/opt/ibm/spark/python/pyspark/ml/base.py\u001b[0m in \u001b[0;36mtransform\u001b[0;34m(self, dataset, params)\u001b[0m\n\u001b[1;32m    171\u001b[0m                 \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcopy\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mparams\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_transform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    172\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 173\u001b[0;31m                 \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_transform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    174\u001b[0m         \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    175\u001b[0m             \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Params must be a param map but got %s.\"\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mparams\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/opt/ibm/spark/python/pyspark/ml/wrapper.py\u001b[0m in \u001b[0;36m_transform\u001b[0;34m(self, dataset)\u001b[0m\n\u001b[1;32m    310\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0m_transform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    311\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_transfer_params_to_java\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 312\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0mDataFrame\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_java_obj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtransform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jdf\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdataset\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msql_ctx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    313\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    314\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/opt/ibm/conda/miniconda3.6/lib/python3.6/site-packages/py4j/java_gateway.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args)\u001b[0m\n\u001b[1;32m   1284\u001b[0m         \u001b[0manswer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgateway_client\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend_command\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcommand\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1285\u001b[0m         return_value = get_return_value(\n\u001b[0;32m-> 1286\u001b[0;31m             answer, self.gateway_client, self.target_id, self.name)\n\u001b[0m\u001b[1;32m   1287\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1288\u001b[0m         \u001b[0;32mfor\u001b[0m \u001b[0mtemp_arg\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtemp_args\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/opt/ibm/spark/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m     77\u001b[0m                 \u001b[0;32mraise\u001b[0m \u001b[0mQueryExecutionException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m': '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstackTrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     78\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstartswith\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'java.lang.IllegalArgumentException: '\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 79\u001b[0;31m                 \u001b[0;32mraise\u001b[0m \u001b[0mIllegalArgumentException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m': '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstackTrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     80\u001b[0m             \u001b[0;32mraise\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     81\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mdeco\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mIllegalArgumentException\u001b[0m: 'Output column CLASS already exists.'"
     ]
    }
   ],
   "source": [
    "model = pipeline.fit(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prediction = model.transform(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prediction.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.evaluation import MulticlassClassificationEvaluator\n",
    "binEval = MulticlassClassificationEvaluator().setMetricName(\"accuracy\") .setPredictionCol(\"prediction\").setLabelCol(\"CLASS\")\n",
    "    \n",
    "binEval.evaluate(prediction) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are happy with the result (I’m happy with > 0.55) please submit your solution to the grader by executing the following cells, please don’t forget to obtain an assignment submission token (secret) from the Coursera’s graders web page and paste it to the “secret” variable below, including your email address you’ve used for Coursera. (0.55 means that you are performing better than random guesses)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm -Rf a2_m2.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prediction = prediction.repartition(1)\n",
    "prediction.write.json('a2_m2.json')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm -f rklib.py\n",
    "!wget https://raw.githubusercontent.com/IBM/coursera/master/rklib.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import zipfile\n",
    "\n",
    "def zipdir(path, ziph):\n",
    "    for root, dirs, files in os.walk(path):\n",
    "        for file in files:\n",
    "            ziph.write(os.path.join(root, file))\n",
    "\n",
    "zipf = zipfile.ZipFile('a2_m2.json.zip', 'w', zipfile.ZIP_DEFLATED)\n",
    "zipdir('a2_m2.json', zipf)\n",
    "zipf.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!base64 a2_m2.json.zip > a2_m2.json.zip.base64"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from rklib import submit\n",
    "key = \"J3sDL2J8EeiaXhILFWw2-g\"\n",
    "part = \"G4P6f\"\n",
    "email = None###YOUR_CODE_GOES_HERE###\"\n",
    "token = None###YOUR_CODE_GOES_HERE###\" # (have a look here if you need more information on how to obtain the token https://youtu.be/GcDo0Rwe06U?t=276)\n",
    "\n",
    "with open('a2_m2.json.zip.base64', 'r') as myfile:\n",
    "    data=myfile.read()\n",
    "submit(email, token, key, part, [part], data)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.6 with Spark",
   "language": "python3",
   "name": "python36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This is the second assignment for the Coursera course \"Advanced Machine Learning and Signal Processing\"\n",
	"\n",
	"\n",
	"Just execute all cells one after the other and you are done - just note that in the last one you have to update your email address (the one you've used for coursera) and obtain a submission token, you get this from the programming assignment directly on coursera.\n",
	"\n",
	"Please fill in the sections labelled with \"###YOUR_CODE_GOES_HERE###\""
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Waiting for a Spark session to start...\n",
	"Spark Initialization Done! ApplicationId = app-20200308135741-0000\n",
	"KERNEL_ID = c44da1e5-8cb5-4a81-97d0-ccf2a05e6c41\n",
	"--2020-03-08 13:57:44-- https://github.com/IBM/coursera/raw/master/coursera_ml/a2.parquet\n",
	"Resolving github.com (github.com)... 140.82.113.4\n",
	"Connecting to github.com (github.com)\|140.82.113.4\|:443... connected.\n",
	"HTTP request sent, awaiting response... 302 Found\n",
	"Location: https://raw.githubusercontent.com/IBM/coursera/master/coursera_ml/a2.parquet [following]\n",
	"--2020-03-08 13:57:44-- https://raw.githubusercontent.com/IBM/coursera/master/coursera_ml/a2.parquet\n",
	"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.8.133\n",
	"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)\|199.232.8.133\|:443... connected.\n",
	"HTTP request sent, awaiting response... 200 OK\n",
	"Length: 59032 (58K) [application/octet-stream]\n",
	"Saving to: 'a2.parquet'\n",
	"\n",
	"100%[======================================>] 59,032 --.-K/s in 0.003s \n",
	"\n",
	"2020-03-08 13:57:44 (19.6 MB/s) - 'a2.parquet' saved [59032/59032]\n",
	"\n"
	]
	}
	],
	"source": [
	"!wget https://github.com/IBM/coursera/raw/master/coursera_ml/a2.parquet"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now it’s time to have a look at the recorded sensor data. You should see data similar to the one exemplified below….\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"+-----+-----------+-------------------+-------------------+-------------------+\n",
	"\|CLASS\| SENSORID\| X\| Y\| Z\|\n",
	"+-----+-----------+-------------------+-------------------+-------------------+\n",
	"\| 0\| 26\| 380.66434005495194\| -139.3470983812975\|-247.93697521077704\|\n",
	"\| 0\| 29\| 104.74324299209692\| -32.27421440203938\|-25.105013725863852\|\n",
	"\| 0\| 8589934658\| 118.11469236129976\| 45.916682927433534\| -87.97203782706572\|\n",
	"\| 0\|34359738398\| 246.55394030642543\|-0.6122810693132044\|-398.18662513951506\|\n",
	"\| 0\|17179869241\|-190.32584900181487\| 234.7849657520335\|-206.34483804019288\|\n",
	"\| 0\|25769803830\| 178.62396382387422\| -47.07529438881511\| 84.38310769821979\|\n",
	"\| 0\|25769803831\| 85.03128805189493\|-4.3024316644854546\|-1.1841857567516714\|\n",
	"\| 0\|34359738411\| 26.786262674736566\| -46.33193951911338\| 20.880756008396055\|\n",
	"\| 0\| 8589934592\|-16.203752396859194\| 51.080957032176954\| -96.80526656416971\|\n",
	"\| 0\|25769803852\| 47.2048142440404\| -78.2950899652916\| 181.99604091494786\|\n",
	"\| 0\|34359738369\| 15.608872398939273\| -79.90322809181754\| 69.62150711098005\|\n",
	"\| 0\| 19\|-4.8281721129789315\| -67.38050508399905\| 221.24876396496404\|\n",
	"\| 0\| 54\| -98.40725712852762\|-19.989364074314732\| -302.695196085276\|\n",
	"\| 0\|17179869313\| 22.835845394816594\| 17.1633660118843\| 32.877914832011385\|\n",
	"\| 0\|34359738454\| 84.20178070080324\| -32.81572075916947\| -48.63517643958031\|\n",
	"\| 0\| 0\| 56.54732521345129\| -7.980106018032676\| 95.05162719436447\|\n",
	"\| 0\|17179869201\| -57.6008655247749\| 5.135393798773895\| 236.99158698947267\|\n",
	"\| 0\|17179869308\| -65.59264738389012\| -48.92660057215126\| -61.58970715383383\|\n",
	"\| 0\|25769803790\| 34.82337351291005\| 9.483542084393937\| 197.6066372962772\|\n",
	"\| 0\|25769803825\| 39.80573823439121\|-0.7955236412785212\| -79.66652640650325\|\n",
	"+-----+-----------+-------------------+-------------------+-------------------+\n",
	"only showing top 20 rows\n",
	"\n"
	]
	}
	],
	"source": [
	"df=spark.read.load('a2.parquet')\n",
	"\n",
	"df.createOrReplaceTempView(\"df\")\n",
	"spark.sql(\"SELECT * from df\").show()\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Please create a VectorAssembler which consumes columns X, Y and Z and produces a column “features”\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [],
	"source": [
	"from pyspark.ml.feature import VectorAssembler\n",
	"vectorAssembler = VectorAssembler( inputCols=[\"X\",\"Y\",\"Z\"], outputCol=\"CLASS\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Please instantiate a classifier from the SparkML package and assign it to the classifier variable. Make sure to either\n",
	"1.\tRename the “CLASS” column to “label” or\n",
	"2.\tSpecify the label-column correctly to be “CLASS”\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 19,
	"metadata": {},
	"outputs": [],
	"source": [
	"from pyspark.ml.classification import LogisticRegression\n",
	"\n",
	"classifier = LogisticRegression(labelCol='CLASS', maxIter=10, regParam=0.3, elasticNetParam=0.8 )\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let’s train and evaluate…\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 20,
	"metadata": {},
	"outputs": [],
	"source": [
	"from pyspark.ml import Pipeline\n",
	"pipeline = Pipeline(stages=[vectorAssembler, classifier])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 21,
	"metadata": {},
	"outputs": [
	{
	"ename": "IllegalArgumentException",
	"evalue": "'Output column CLASS already exists.'",
	"output_type": "error",
	"traceback": [
	"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
	"\u001b[0;31mPy4JJavaError\u001b[0m Traceback (most recent call last)",
	"\u001b[0;32m/opt/ibm/spark/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(a, kw)\u001b[0m\n\u001b[1;32m 62\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 63\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkw\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 64\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mpy4j\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprotocol\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mPy4JJavaError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
	"\u001b[0;32m/opt/ibm/conda/miniconda3.6/lib/python3.6/site-packages/py4j/protocol.py\u001b[0m in \u001b[0;36mget_return_value\u001b[0;34m(answer, gateway_client, target_id, name)\u001b[0m\n\u001b[1;32m 327\u001b[0m \u001b[0;34m\"An error occurred while calling {0}{1}{2}.\\n\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 328\u001b[0;31m format(target_id, \".\", name), value)\n\u001b[0m\u001b[1;32m 329\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
	"\u001b[0;31mPy4JJavaError\u001b[0m: An error occurred while calling o105.transform.\n: java.lang.IllegalArgumentException: Output column CLASS already exists.\n\tat org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:172)\n\tat org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\tat org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:86)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:282)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:819)\n",
	"\nDuring handling of the above exception, another exception occurred:\n",
	"\u001b[0;31mIllegalArgumentException\u001b[0m Traceback (most recent call last)",
	"\u001b[0;32m<ipython-input-21-5a6ceed6e361>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mmodel\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
	"\u001b[0;32m/opt/ibm/spark/python/pyspark/ml/base.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, dataset, params)\u001b[0m\n\u001b[1;32m 130\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcopy\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mparams\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_fit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 131\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 132\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_fit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 133\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 134\u001b[0m raise ValueError(\"Params must be either a param map or a list/tuple of param maps, \"\n",
	"\u001b[0;32m/opt/ibm/spark/python/pyspark/ml/pipeline.py\u001b[0m in \u001b[0;36m_fit\u001b[0;34m(self, dataset)\u001b[0m\n\u001b[1;32m 105\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstage\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mTransformer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 106\u001b[0m \u001b[0mtransformers\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstage\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 107\u001b[0;31m \u001b[0mdataset\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mstage\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtransform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 108\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;31m# must be an Estimator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 109\u001b[0m \u001b[0mmodel\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mstage\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
	"\u001b[0;32m/opt/ibm/spark/python/pyspark/ml/base.py\u001b[0m in \u001b[0;36mtransform\u001b[0;34m(self, dataset, params)\u001b[0m\n\u001b[1;32m 171\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcopy\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mparams\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_transform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 172\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 173\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_transform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 174\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 175\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Params must be a param map but got %s.\"\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mparams\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
	"\u001b[0;32m/opt/ibm/spark/python/pyspark/ml/wrapper.py\u001b[0m in \u001b[0;36m_transform\u001b[0;34m(self, dataset)\u001b[0m\n\u001b[1;32m 310\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_transform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 311\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_transfer_params_to_java\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 312\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mDataFrame\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_java_obj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtransform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jdf\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdataset\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msql_ctx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 313\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 314\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
	"\u001b[0;32m/opt/ibm/conda/miniconda3.6/lib/python3.6/site-packages/py4j/java_gateway.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args)\u001b[0m\n\u001b[1;32m 1284\u001b[0m \u001b[0manswer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgateway_client\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend_command\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcommand\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1285\u001b[0m return_value = get_return_value(\n\u001b[0;32m-> 1286\u001b[0;31m answer, self.gateway_client, self.target_id, self.name)\n\u001b[0m\u001b[1;32m 1287\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1288\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mtemp_arg\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtemp_args\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
	"\u001b[0;32m/opt/ibm/spark/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(a, *kw)\u001b[0m\n\u001b[1;32m 77\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mQueryExecutionException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m': '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstackTrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 78\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstartswith\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'java.lang.IllegalArgumentException: '\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 79\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mIllegalArgumentException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m': '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstackTrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 80\u001b[0m \u001b[0;32mraise\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 81\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mdeco\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
	"\u001b[0;31mIllegalArgumentException\u001b[0m: 'Output column CLASS already exists.'"
	]
	}
	],
	"source": [
	"model = pipeline.fit(df)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"prediction = model.transform(df)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"prediction.show()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"from pyspark.ml.evaluation import MulticlassClassificationEvaluator\n",
	"binEval = MulticlassClassificationEvaluator().setMetricName(\"accuracy\") .setPredictionCol(\"prediction\").setLabelCol(\"CLASS\")\n",
	" \n",
	"binEval.evaluate(prediction) "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"If you are happy with the result (I’m happy with > 0.55) please submit your solution to the grader by executing the following cells, please don’t forget to obtain an assignment submission token (secret) from the Coursera’s graders web page and paste it to the “secret” variable below, including your email address you’ve used for Coursera. (0.55 means that you are performing better than random guesses)\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"!rm -Rf a2_m2.json"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"prediction = prediction.repartition(1)\n",
	"prediction.write.json('a2_m2.json')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"!rm -f rklib.py\n",
	"!wget https://raw.githubusercontent.com/IBM/coursera/master/rklib.py"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"import zipfile\n",
	"\n",
	"def zipdir(path, ziph):\n",
	" for root, dirs, files in os.walk(path):\n",
	" for file in files:\n",
	" ziph.write(os.path.join(root, file))\n",
	"\n",
	"zipf = zipfile.ZipFile('a2_m2.json.zip', 'w', zipfile.ZIP_DEFLATED)\n",
	"zipdir('a2_m2.json', zipf)\n",
	"zipf.close()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"!base64 a2_m2.json.zip > a2_m2.json.zip.base64"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"from rklib import submit\n",
	"key = \"J3sDL2J8EeiaXhILFWw2-g\"\n",
	"part = \"G4P6f\"\n",
	"email = None###YOUR_CODE_GOES_HERE###\"\n",
	"token = None###YOUR_CODE_GOES_HERE###\" # (have a look here if you need more information on how to obtain the token https://youtu.be/GcDo0Rwe06U?t=276)\n",
	"\n",
	"with open('a2_m2.json.zip.base64', 'r') as myfile:\n",
	" data=myfile.read()\n",
	"submit(email, token, key, part, [part], data)"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3.6 with Spark",
	"language": "python3",
	"name": "python36"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.8"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 1
	}
No results found