Created
December 8, 2020 19:27
-
-
Save burch-cm/1fadc652a842ec3465fab9c9be6feb9a to your computer and use it in GitHub Desktop.
Extracting Vehicle Weights (or any numbers, really) from Text Descriptions
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Vehicles from the GSA have a gross vehicle weight rating (GVWR), but this is stored in the vehicle description in inconsistent | |
| # text formats (sometimes spaces as separators, sometimes commas, sometimes no separators). One consistent thing is that the weights | |
| # always end with 'LBS', which means it's easy enough to write a regex that can extract them from the text. | |
| # Same GSA vehicle description | |
| sample_text <- c("4x4 SUV 10,000 LBS GVWR", "2x4 PICKUP 7 001 LBS GVWR", "SEDAN 6000LBS") | |
| # using {stringr} and regex: | |
| library(stringr) | |
| library(magrittr) | |
| # First, extract digits from the text where the string of digits ends with 'LBS'. | |
| # Allow for optional spaces and commas as well. | |
| sample_text %>% | |
| str_extract(regex("[\\d(,?) ]+(?= *LBS)")) | |
| # this leaves me with c(" 10,000 ", " 7 001 ", " 6000") | |
| # first, remove spaces and commas | |
| # stringr::str_remove_all() takes the regex() argument as well, and makes this easy | |
| sample_text %>% | |
| str_extract(regex("[\\d(,?) ]+(?= *LBS)")) %>% | |
| str_remove_all(regex("[ ,]")) | |
| # which yields c("10000", "7001", "6000"). Much better. | |
| # Finally, convert to numeric so that it can be used in calculations. | |
| sample_text %>% | |
| str_extract(regex("[\\d(,?) ]+(?= *LBS)")) %>% | |
| str_remove_all(regex("[ ,]")) %>% | |
| as.numeric() | |
| # c(10000, 7001, 6000) | |
| # Done! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment