Skip to content

Instantly share code, notes, and snippets.

@ldruizsan
Created December 24, 2020 04:46
Show Gist options
  • Select an option

  • Save ldruizsan/83319824b98e13adfd94481d10891162 to your computer and use it in GitHub Desktop.

Select an option

Save ldruizsan/83319824b98e13adfd94481d10891162 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "<center>\n <img src=\"https://gitlab.com/ibm/skills-network/courses/placeholder101/-/raw/master/labs/module%201/images/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n</center>\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "# **Hands-on Lab : Web Scraping**\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Estimated time needed: **30 to 45** minutes\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Objectives\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "In this lab you will perform the following:\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "- Extract information from a given web site \n- Write the scraped data into a csv file.\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Extract information from the given web site\n\nYou will extract the data from the below web site: <br> \n"
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": "#this url contains the data you need to scrape\nurl = \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/Programming_Languages.html\""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The data you need to scrape is the **name of the programming language** and **average annual salary**.<br> It is a good idea to open the url in your web broswer and study the contents of the web page before you start to scrape.\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Import the required libraries\n"
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": "# Your code here\nfrom bs4 import BeautifulSoup\nimport requests"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Download the webpage at the url\n"
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": "#your code goes here\ndata = requests.get(url).text"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Create a soup object\n"
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "<!DOCTYPE html>\n<html lang=\"en\">\n <head>\n <title>\n Salary survey results of programming languages\n </title>\n <style>\n table, th, td {\n border: 1px solid black;\n}\n </style>\n </head>\n <body>\n <hr/>\n <h2>\n Popular Programming Languages\n </h2>\n <hr/>\n <p>\n Finding out which is the best language is a tough task. A programming language is created to solve a specific problem. A language which is good for task A may not be able to properly handle task B. Comparing programming language is never easy. What we can do, however, is find which is popular in the industry.\n </p>\n <p>\n There are many ways to find the popularity of a programming languages. Counting the number of google searchs for each language is a simple way to find the popularity. GitHub and StackOverflow also can give some good pointers.\n </p>\n <p>\n Salary surveys are a way to find out the programmings languages that are most in demand in the industry. Below table is the result of one such survey. When using any survey keep in mind that the results vary year on year.\n </p>\n <hr/>\n <table>\n <tbody>\n <tr>\n <td>\n No.\n </td>\n <td>\n Language\n </td>\n <td>\n Created By\n </td>\n <td>\n Average Annual Salary\n </td>\n <td>\n Learning Difficulty\n </td>\n </tr>\n <tr>\n <td>\n 1\n </td>\n <td>\n Python\n </td>\n <td>\n Guido van Rossum\n </td>\n <td>\n $114,383\n </td>\n <td>\n Easy\n </td>\n </tr>\n <tr>\n <td>\n 2\n </td>\n <td>\n Java\n </td>\n <td>\n James Gosling\n </td>\n <td>\n $101,013\n </td>\n <td>\n Easy\n </td>\n </tr>\n <tr>\n <td>\n 3\n </td>\n <td>\n R\n </td>\n <td>\n Robert Gentleman, Ross Ihaka\n </td>\n <td>\n $92,037\n </td>\n <td>\n Hard\n </td>\n </tr>\n <tr>\n <td>\n 4\n </td>\n <td>\n Javascript\n </td>\n <td>\n Netscape\n </td>\n <td>\n $110,981\n </td>\n <td>\n Easy\n </td>\n </tr>\n <tr>\n <td>\n 5\n </td>\n <td>\n Swift\n </td>\n <td>\n Apple\n </td>\n <td>\n $130,801\n </td>\n <td>\n Easy\n </td>\n </tr>\n <tr>\n <td>\n 6\n </td>\n <td>\n C++\n </td>\n <td>\n Bjarne Stroustrup\n </td>\n <td>\n $113,865\n </td>\n <td>\n Hard\n </td>\n </tr>\n <tr>\n <td>\n 7\n </td>\n <td>\n C#\n </td>\n <td>\n Microsoft\n </td>\n <td>\n $88,726\n </td>\n <td>\n Hard\n </td>\n </tr>\n <tr>\n <td>\n 8\n </td>\n <td>\n PHP\n </td>\n <td>\n Rasmus Lerdorf\n </td>\n <td>\n $84,727\n </td>\n <td>\n Easy\n </td>\n </tr>\n <tr>\n <td>\n 9\n </td>\n <td>\n SQL\n </td>\n <td>\n Donald D. Chamberlin, Raymond F. Boyce.\n </td>\n <td>\n $84,793\n </td>\n <td>\n Easy\n </td>\n </tr>\n <tr>\n <td>\n 10\n </td>\n <td>\n Go\n </td>\n <td>\n Robert Griesemer, Ken Thompson, Rob Pike.\n </td>\n <td>\n $94,082\n </td>\n <td>\n Difficult\n </td>\n </tr>\n </tbody>\n </table>\n <hr/>\n </body>\n</html>\n"
}
],
"source": "#your code goes here\nsoup = BeautifulSoup(data,\"html.parser\")\nprint(soup.prettify())"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Scrape the `Language name` and `annual average salary`.\n"
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "['Language', 'Python', 'Java', 'R', 'Javascript', 'Swift', 'C++', 'C#', 'PHP', 'SQL', 'Go']\n['Average Annual Salary', '$114,383', '$101,013', '$92,037', '$110,981', '$130,801', '$113,865', '$88,726', '$84,727', '$84,793', '$94,082']\n"
}
],
"source": "#your code goes here\n\ntable = soup.find(\"table\")\nlanguage = []\nsalary = []\nfor row in table.find_all(\"tr\"):\n cells = row.find_all(\"td\")\n language_name = language.append(cells[1].getText())\n avg_annual_salary = salary.append(cells[3].getText())\nprint(language)\nprint(salary)\n "
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "Language ---- Average Annual Salary\nPython ---- $114,383\nJava ---- $101,013\nR ---- $92,037\nJavascript ---- $110,981\nSwift ---- $130,801\nC++ ---- $113,865\nC# ---- $88,726\nPHP ---- $84,727\nSQL ---- $84,793\nGo ---- $94,082\n"
}
],
"source": "#your code goes here\n\ntable = soup.find(\"table\")\nfor row in table.find_all(\"tr\"):\n cells = row.find_all(\"td\")\n language_name = cells[1].getText()\n avg_annual_salary = cells[3].getText()\n print(\"{} ---- {}\".format(language_name, avg_annual_salary))\n "
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Save the scrapped data into a file named _popular-languages.csv_\n"
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [],
"source": "# your code goes here\nimport pandas as pd\n\nlang_salary = []\nfor tr in table.find_all(\"tr\"):\n cells = tr.find_all(\"td\")\n row = [tr.text for tr in cells]\n lang_salary.append(row)\n\ndf = pd.DataFrame(lang_salary)\nnew_header = df.iloc[0]\ndf.columns = new_header\ndf = df[1:]\ndf.head()\ndf.to_csv(\"popular-languages.csv\")"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Authors\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Ramesh Sannareddy\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Other Contributors\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Rav Ahuja\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Change Log\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n| ----------------- | ------- | ----------------- | ---------------------------------- |\n| 2020-10-17 | 0.1 | Ramesh Sannareddy | Created initial version of the lab |\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": " Copyright \u00a9 2020 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).\n"
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.7",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment