Skip to content

Instantly share code, notes, and snippets.

@dannguyen
Last active June 13, 2022 20:04
Show Gist options
  • Select an option

  • Save dannguyen/57423dbcb1713d31b659 to your computer and use it in GitHub Desktop.

Select an option

Save dannguyen/57423dbcb1713d31b659 to your computer and use it in GitHub Desktop.
A Bash script, using the jq JSON-parser, to scrape all the NHTSA 5 star vehcile ratings from its API http://www.nhtsa.gov/webapi/Default.aspx?SafetyRatings/API/5
## Note: this is deprecated. jq is still awesome, so now we just get JSON all the way
# jq JSON parser is awesome:
# http://stedolan.github.io/jq/
# The NHTSA API is pretty clunky, requiring you to get a list of all the years, then all the models in that year, then all the makes per model, and then
# finally, you get the vehicle IDs needed to query the endpoint for one vehicle at a time.
#
# I query for JSON for most of the loop, and in the end, I get the Vehicle data in CSV format
# Note, there are a lot of errors in the API, because the NHTSA doesn't properly escape the "/" in a car's name. And many other
# whitespace related errors.
BURL='http://www.nhtsa.gov/webapi/api/SafetyRatings'
# get all the years first
curl -s "$BURL?format=json" | jq -r '.Results[] .ModelYear' | \
while read year; do
echo "$year"
echo "######"
curl -s "$BURL/modelyear/$year?format=json" | jq -r '.Results[] .Make' | sed 's/ /%20/g' | sed 's/&/_/g' | \
while read -r carmake; do
# Get the year and make
echo " $carmake"
echo " ======="
curl -s "$BURL/modelyear/$year/make/$carmake?format=json" | jq -r '.Results[] .Model' | sed 's/ /%20/g' | sed 's/&/_/g' | \
while read -r model; do
echo " $model"
echo " -------"
# Get the year, make, and model
curl -s "$BURL/modelyear/$year/make/$carmake/model/$model?format=json" | jq -r '.Results[] .VehicleId' | \
while read -r id; do
echo " $id: $year - $carmake - $model"
curl -s "$BURL/VehicleId/$id?format=csv" -o "$id.csv"
done
echo ' '
done
done
done
# As it turns out, the CSV produced by NHTSA is broken.
# So now, let's just iterate through all possible JSON values (assuming no car is at 10000)
# then use jq to collect all possible keys (which varies widely)
# and then map every result to that array of keys
mkdir -p json/vehicles
for id in $(seq 1 10000); do
echo "$id.json"
curl -sS "http://www.nhtsa.gov/webapi/api/SafetyRatings/VehicleId/$id?format=json" -o "json/vehicles/$id.json"
done
# remove bad json
find ./json/vehicles -name "*.json" | xargs grep -l '</html>' | xargs rm
# get the keys
allkeys=$(find ./json/vehicles -name "*.json" | xargs cat | jq --sort-keys -r 'select(.Count == 1) .Results[0] | keys | @csv' | grep -oE '[[:alnum:]]+' | sort | uniq | sed -E 's/^/./' | paste -s -d ',' -)
echo $allkeys | tr -d '.' > all-vehicles.csv
find ./json/vehicles -name "*.json" | xargs cat | jq --sort-keys -r "select(.Count == 1) .Results | map($allkeys) | @csv" | csvfix echo >> all-vehicles.csv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment