Skip to content

Instantly share code, notes, and snippets.

@lundeen-bryan
Last active March 25, 2024 19:57
Show Gist options
  • Select an option

  • Save lundeen-bryan/32495d61ca9bfab82fb89bbe8d8e096c to your computer and use it in GitHub Desktop.

Select an option

Save lundeen-bryan/32495d61ca9bfab82fb89bbe8d8e096c to your computer and use it in GitHub Desktop.
Boolean_Indexing_Masking_ChatGPT_Tutorial
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# UNDERSTANDING HOW BOOLEAN INDEXING WORKS\n",
"\n",
"Boolean Indexing or masking is a technique used to select a subset of the data based on some criteria. This is a difficult concept to grasp, but it is a powerful tool for data analysis. The following exercises were created by ChatGPT to help better understand how boolean indexing works."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"URL = \"https://gist.githubusercontent.com/sh7ata/e075ff35b51ebb0d2d577fbe1d19ebc9/raw/b966d02c7c26bcca60703acb1390e938a65a35cb/drinks.csv\"\n",
"df = pd.read_csv(URL)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 1: Filtering Countries with Low Wine Servings\n",
"#### Task:\n",
"Filter the dataset to identify countries with wine servings below a certain threshold.\n",
"\n",
"#### Instructions:\n",
"1. Set a threshold for the minimum number of wine servings.\n",
"2. Use boolean indexing to create a mask that identifies countries with wine servings below the threshold.\n",
"3. Apply the mask to filter the dataset and display the countries with low wine servings.\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Set the threshold for minimum wine servings\n",
"threshold = 50"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Personal Comments**:\n",
"\n",
"I think the starting point here is to create a threashold value that we can test against in our filter. So for example, if we want to find wine servings below 50, we would set the threshold to 50.\n",
"\n",
"ChatGPT gives the following definition of a \"threshold\":\n",
"\n",
"A threshold is a value that is used to determine whether a value is \"above\" or \"below\" a certain value. For example, a threshold of 50 means that if a value is 50 or greater, it is considered to be \"above\" the threshold. If a value is less than 50, it is considered to be \"below\" the threshold...The threshold is a specific value or condition used to define the criteria for filtering the data. For example, in Exercise 1, the threshold could represent the minimum number of wine servings below which countries are considered to have low wine consumption."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Create a boolean mask for countries with wine servings below the threshold\n",
"low_wine_mask = df['wine_servings'] < threshold\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Personal Comments**:\n",
"\n",
"Here we create a variable called low_wine_mask that is a boolean mask that identifies countries with wine servings below the threshold."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Apply the mask to filter the dataset\n",
"countries_with_low_wine = df[low_wine_mask]\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# What is the data type of countries_with_low_wine?\n",
"type(countries_with_low_wine)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>country</th>\n",
" <th>beer_servings</th>\n",
" <th>spirit_servings</th>\n",
" <th>wine_servings</th>\n",
" <th>total_litres_of_pure_alcohol</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Algeria</td>\n",
" <td>25.0</td>\n",
" <td>NaN</td>\n",
" <td>14.0</td>\n",
" <td>0.7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Angola</td>\n",
" <td>217.0</td>\n",
" <td>57.0</td>\n",
" <td>45.0</td>\n",
" <td>5.9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Antigua &amp; Barbuda</td>\n",
" <td>12.0</td>\n",
" <td>128.0</td>\n",
" <td>45.0</td>\n",
" <td>4.9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Armenia</td>\n",
" <td>21.0</td>\n",
" <td>179.0</td>\n",
" <td>11.0</td>\n",
" <td>3.8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>Azerbaijan</td>\n",
" <td>21.0</td>\n",
" <td>46.0</td>\n",
" <td>5.0</td>\n",
" <td>1.3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" country beer_servings spirit_servings wine_servings \\\n",
"2 Algeria 25.0 NaN 14.0 \n",
"4 Angola 217.0 57.0 45.0 \n",
"5 Antigua & Barbuda 12.0 128.0 45.0 \n",
"7 Armenia 21.0 179.0 11.0 \n",
"10 Azerbaijan 21.0 46.0 5.0 \n",
"\n",
" total_litres_of_pure_alcohol \n",
"2 0.7 \n",
"4 5.9 \n",
"5 4.9 \n",
"7 3.8 \n",
"10 1.3 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"countries_with_low_wine.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Personal Comments**:\n",
"\n",
"After creating the threashold mask, we apply the mask to the dataset in a variable called countries_with_low_wine. I was curious to verify the data type of countries_with_low_wine which turns out to be a dataframe.\n",
"\n",
"I noticed that there are a few `NaN` values in the resulting dataframe. Since the focus of this exercise is to filter the dataset, we can safely ignore these `NaN` values."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Countries with wine servings below 50 :\n",
" country beer_servings spirit_servings wine_servings \\\n",
"2 Algeria 25.0 NaN 14.0 \n",
"4 Angola 217.0 57.0 45.0 \n",
"5 Antigua & Barbuda 12.0 128.0 45.0 \n",
"7 Armenia 21.0 179.0 11.0 \n",
"10 Azerbaijan 21.0 46.0 5.0 \n",
".. ... ... ... ... \n",
"187 Vanuatu 21.0 18.0 11.0 \n",
"188 Venezuela 333.0 1.0 3.0 \n",
"189 Vietnam 111.0 2.0 1.0 \n",
"191 Zambia 32.0 19.0 4.0 \n",
"192 Zimbabwe 64.0 18.0 4.0 \n",
"\n",
" total_litres_of_pure_alcohol \n",
"2 0.7 \n",
"4 5.9 \n",
"5 4.9 \n",
"7 3.8 \n",
"10 1.3 \n",
".. ... \n",
"187 0.9 \n",
"188 7.7 \n",
"189 2.0 \n",
"191 2.5 \n",
"192 4.7 \n",
"\n",
"[114 rows x 5 columns]\n"
]
}
],
"source": [
"# Display the filtered dataset\n",
"print(\n",
" \"Countries with wine servings below\", \n",
" threshold, \n",
" \":\\n\", \n",
" countries_with_low_wine\n",
")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 2: Finding Countries with High Alcohol Consumption\n",
"#### Task:\n",
"Identify countries with high total alcohol consumption based on the sum of beer, spirit, and wine servings.\n",
"\n",
"#### Instructions:\n",
"1. Calculate the total alcohol consumption for each country by summing the servings of beer, spirit, and wine.\n",
"2. Set a threshold for the minimum total alcohol consumption.\n",
"3. Create a boolean mask to identify countries with total alcohol consumption above the threshold.\n",
"4. Apply the mask to filter the dataset and display the countries with high alcohol consumption.\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# Calculate total alcohol consumption for each country\n",
"df['total_alcohol'] = df['beer_servings'] + df['spirit_servings'] + df['wine_servings']\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# Set the threshold for minimum total alcohol consumption\n",
"threshold = 400\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Personal Comments**:\n",
"\n",
"Again we'll set a threshold value that we can test against in our filter. "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# Create a boolean mask for countries with total alcohol consumption above the threshold\n",
"high_alcohol_mask = df['total_alcohol'] > threshold\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# Apply the mask to filter the dataset\n",
"countries_with_high_alcohol = df[high_alcohol_mask]\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Countries with total alcohol consumption above 400 :\n",
" country beer_servings spirit_servings wine_servings \\\n",
"3 Andorra 245.0 138.0 312.0 \n",
"6 Argentina 193.0 25.0 221.0 \n",
"8 Australia 261.0 72.0 212.0 \n",
"9 Austria 279.0 75.0 191.0 \n",
"15 Belarus 142.0 373.0 42.0 \n",
"16 Belgium 295.0 84.0 212.0 \n",
"23 Brazil 245.0 145.0 16.0 \n",
"25 Bulgaria 231.0 252.0 94.0 \n",
"44 Cyprus 192.0 154.0 113.0 \n",
"45 Czech Republic 361.0 17.0 134.0 \n",
"48 Denmark 224.0 81.0 278.0 \n",
"57 Estonia 224.0 194.0 59.0 \n",
"60 Finland 263.0 133.0 97.0 \n",
"62 Gabon 347.0 98.0 59.0 \n",
"65 Germany 346.0 117.0 175.0 \n",
"67 Greece 133.0 112.0 218.0 \n",
"68 Grenada 199.0 438.0 28.0 \n",
"75 Hungary 234.0 215.0 185.0 \n",
"81 Ireland 313.0 118.0 165.0 \n",
"93 Latvia 281.0 216.0 62.0 \n",
"98 Lithuania 343.0 244.0 56.0 \n",
"99 Luxembourg 236.0 133.0 271.0 \n",
"132 Paraguay 213.0 117.0 74.0 \n",
"135 Poland 343.0 215.0 56.0 \n",
"136 Portugal 194.0 67.0 339.0 \n",
"140 Romania 297.0 122.0 167.0 \n",
"141 Russian Federation 247.0 326.0 73.0 \n",
"144 St. Lucia 171.0 315.0 71.0 \n",
"151 Serbia 283.0 131.0 127.0 \n",
"155 Slovakia 196.0 293.0 116.0 \n",
"160 Spain 284.0 157.0 112.0 \n",
"182 United Kingdom 219.0 126.0 195.0 \n",
"184 USA 249.0 158.0 84.0 \n",
"\n",
" total_litres_of_pure_alcohol total_alcohol \n",
"3 12.4 695.0 \n",
"6 8.3 439.0 \n",
"8 1.4 545.0 \n",
"9 9.7 545.0 \n",
"15 14.4 557.0 \n",
"16 1.5 591.0 \n",
"23 7.2 406.0 \n",
"25 1.3 577.0 \n",
"44 8.2 459.0 \n",
"45 11.8 512.0 \n",
"48 1.4 583.0 \n",
"57 9.5 477.0 \n",
"60 1.0 493.0 \n",
"62 8.9 504.0 \n",
"65 11.3 638.0 \n",
"67 8.3 463.0 \n",
"68 11.9 665.0 \n",
"75 11.3 634.0 \n",
"81 11.4 596.0 \n",
"93 1.5 559.0 \n",
"98 12.9 643.0 \n",
"99 11.4 640.0 \n",
"132 7.3 404.0 \n",
"135 1.9 614.0 \n",
"136 11.0 600.0 \n",
"140 1.4 586.0 \n",
"141 11.5 646.0 \n",
"144 1.1 557.0 \n",
"151 9.6 541.0 \n",
"155 11.4 605.0 \n",
"160 1.0 553.0 \n",
"182 1.4 540.0 \n",
"184 8.7 491.0 \n"
]
}
],
"source": [
"# Display the filtered dataset\n",
"print(\"Countries with total alcohol consumption above\", threshold, \":\\n\", countries_with_high_alcohol)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Personal Comments**:\n",
"\n",
"Note that we are only using alcohol consumption as a measurement for all other columns. It looks like columns of data in a dataset are called 'measures' or 'features' in the context of this exercise. For comparison, Power BI uses the term 'measures' to refer to columns of data that are used to define a 'feature' in a Python dataset."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment