-
-
Save barbietunnie/8ed641012cf860a7d591a198b4ed95d7 to your computer and use it in GitHub Desktop.
AutoScraper Examples
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ### Grouping results and removing unwanted ones | |
| Here we want to scrape product name, price and rating from ebay product pages: | |
| ```python | |
| url = 'https://www.ebay.com/itm/Sony-PlayStation-4-PS4-Pro-1TB-4K-Console-Black/203084236670' | |
| wanted_list = ['Sony PlayStation 4 PS4 Pro 1TB 4K Console - Black', 'US $349.99', '4.8'] | |
| scraper.build(url, wanted_list) | |
| ``` | |
| The items which we wanted have been on multiple sections of the page and the scraper tries to catch them all. So it may retrieve some extra information compared to what we have in mind. | |
| Let's run it on a different page: | |
| ```python | |
| scraper.get_result_exact('https://www.ebay.com/itm/Acer-Predator-Helios-300-15-6-144Hz-FHD-Laptop-i7-9750H-16GB-512GB-GTX-1660-Ti/114183725523') | |
| ``` | |
| The result: | |
| ```python | |
| [ | |
| "Acer Predator Helios 300 15.6'' 144Hz FHD Laptop i7-9750H 16GB 512GB GTX 1660 Ti", | |
| 'ACER Predator Helios 300 i7-9750H 15.6" 144Hz FHD GTX 1660Ti 16GB 512GB SSD⚡RGB', | |
| 'US $1,229.49', | |
| '5.0' | |
| ] | |
| ``` | |
| As we can see we have one extra item here. We can run the `get_result_exact` or `get_result_similar` method with `grouped=True` parameter. It will group all results per its scraping rule: | |
| ```python | |
| scraper.get_result_exact('https://www.ebay.com/itm/Acer-Predator-Helios-300-15-6-144Hz-FHD-Laptop-i7-9750H-16GB-512GB-GTX-1660-Ti/114183725523', grouped=True) | |
| ``` | |
| Output: | |
| ```python | |
| { | |
| 'rule_sks3': ["Acer Predator Helios 300 15.6'' 144Hz FHD Laptop i7-9750H 16GB 512GB GTX 1660 Ti"], | |
| 'rule_d4n5': ['ACER Predator Helios 300 i7-9750H 15.6" 144Hz FHD GTX 1660Ti 16GB 512GB SSD⚡RGB'], | |
| 'rule_fmrm': ['ACER Predator Helios 300 i7-9750H 15.6" 144Hz FHD GTX 1660Ti 16GB 512GB SSD⚡RGB'], | |
| 'rule_2ydq': ['US $1,229.49'], | |
| 'rule_buhw': ['5.0'], | |
| 'rule_vpfp': ['5.0'] | |
| } | |
| ``` | |
| Now we can use `keep_rules` or `remove_rules` methods to prune unwanted rules: | |
| ```python | |
| scraper.keep_rules(['rule_sks3', 'rule_2ydq', 'rule_buhw']) | |
| scraper.get_result_exact('https://www.ebay.com/itm/Acer-Predator-Helios-300-15-6-144Hz-FHD-Laptop-i7-9750H-16GB-512GB-GTX-1660-Ti/114183725523') | |
| ``` | |
| And now the result only contains the ones which we want: | |
| ```python | |
| [ | |
| "Acer Predator Helios 300 15.6'' 144Hz FHD Laptop i7-9750H 16GB 512GB GTX 1660 Ti", | |
| 'US $1,229.49', | |
| '5.0' | |
| ] | |
| ``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment