Skip to content

Instantly share code, notes, and snippets.

@alan-ho
Last active January 4, 2018 01:27
Show Gist options
  • Select an option

  • Save alan-ho/513b4301fbe9bc57a32df83a2b79c666 to your computer and use it in GitHub Desktop.

Select an option

Save alan-ho/513b4301fbe9bc57a32df83a2b79c666 to your computer and use it in GitHub Desktop.
Creating scrapers and using them
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re
random.seed(datetime.datetime.now())
def getLinks(articleUrl):
html = urlopen("http://en.wikipedia.org"+articleUrl)
bsObj = BeautifulSoup(html, 'html.parser')
return bsObj.find("div", {"id":"bodyContent"}).findAll("a",
href=re.compile("^(/wiki/)((?!:).)*$"))
links = getLinks("/wiki/Kevin_Bacon")
while len(links) > 0:
newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
print(newArticle)
links = getLinks(newArticle)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment