Skip to content

Instantly share code, notes, and snippets.

@jaredhirsch
Last active January 22, 2021 20:55
Show Gist options
  • Select an option

  • Save jaredhirsch/4354202 to your computer and use it in GitHub Desktop.

Select an option

Save jaredhirsch/4354202 to your computer and use it in GitHub Desktop.

Revisions

  1. jaredhirsch revised this gist Dec 21, 2012. 1 changed file with 1 addition and 3 deletions.
    4 changes: 1 addition & 3 deletions howto-detumblrize.mkd
    Original file line number Diff line number Diff line change
    @@ -106,14 +106,12 @@ to the config file when they seem to be really working.
    Well, yes and no. Recursion is hard. It followed a bunch of links off of my
    website, which is linked from my tumblr. Dammit.

    I do like excluding quantserve, though. I'll add that to the config file.

    What I actually want is not a recursive download. What I want is to download
    some-site.tumblr.com/page/1 up to /page/104. So let's just do that.

    ## One page plus some bash

    We just need a little for-loop at the bash prompt to get this done.
    We need a little for-loop at the bash prompt to get this done.

    A bit of googling turns up the one-liner to iterate over a for-loop n times:

  2. jaredhirsch created this gist Dec 21, 2012.
    153 changes: 153 additions & 0 deletions howto-detumblrize.mkd
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,153 @@
    TL;DR skip to the summary for the bash one-liner if so inclined :-)

    ## Why & wherefore

    I have an old tumblr that I want to back up. I don't use it regularly, but
    I want to have a backup of the whole site, design as well as content, just
    in case; it's not just the pictures I love, but the whole thing. I'm not
    particularly worried about downloading the streaming audio--can't help you
    there, gentle reader.

    This is the story of a man, a man page, and a page. And lots of other pages.


    ## Downloading just one page (and all the images, fonts, JS, CSS)

    Before trying to download the whole site, I started with just one page. I want
    all the assets, not just the images or underlying HTML, and I want the links
    converted, so I can look at the files offline.

    I found this swell incantation in the wget man page:

    wget -E -H -k -K -p http://<site>/<document>

    It worked, but not perfectly:

    * Unfortunately, the -E (--adjust-extension) option renamed my webfonts from
    Blah.eot to Blah.eot.html. Not cool, bro. The image names came out goofy as
    well, so it didn't really help at all.

    * This didn't do any rate-limiting. AWS hosts all of tumblr's images, so
    I'm sure they block an IP that goes bananas with downloading. We can tell it
    to pause between downloads using --wait=<time>, randomize the pauses
    using --random-wait, and limit overall download speed using --limit-rate.
    This is all 'good citizen' stuff, and prevents my IP being throttled or blocked
    by AWS or tumblr :-)

    * By default, wget is ridiculously verbose. The -nv option makes it less so,
    but not totally silent. (You can use -q to make it perfectly quiet.)

    Adjusting the options (but keeping that shit alphabetical, so I don't go
    crazy trying to find options as this gets longer) we have:

    wget -H -k -K --limit-rate=500k -nv -p --random-wait --wait=0.5 http://<site>/<document>

    That's better. Rate limiting and waiting increased download time by a factor
    of three, from 6 seconds to 18 seconds. I can deal with that. So far, so good.

    It might be a little more fun to let 'em know I'm not just some Chinese
    botnet instance of wget, but a friendly hacker taking care of business.
    So maybe a custom user-agent with a nod to Firefox and a reference to some
    made-up thing called Sunflower (with the golden ratio its version number ;-).

    --user-agent="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.8.2; en-US) Wget/1.13.4 Sunflower/1.61803398"

    and a custom header:

    --header="Love-You-Guys: but the instagram thing has me worried. just grabbing my stuff."

    and, of course, a From header:

    --header="From: Jared 'the dragon' Hirsch <ohai@6a68.net>"

    At this point, the command line is getting ridiculously long. Some of wget's
    options can be stuffed in a config file--happily, all the ones I care about:

    # moving stuff into wgetrc
    header = From: 'Jared "the dragon" Hirsch' <ohai@6a68.net>
    header = Love-You-Guys: but the instagram thing has me worried. just grabbing my stuff.
    user_agent = Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.8.2; en-US) Wget/1.13.4 Sunflower/1.61803398
    # same as -H
    span_hosts = on
    # same as -k
    convert_links = on
    # same as -K
    backup_converted = on
    limit_rate = 500k
    # same as -nv
    verbose = off
    # same as -p
    page_requisites = on
    random_wait = on
    wait = 0.5

    And calling wget is as simple as

    wget --config=my-wgetrc some-site.tumblr.com

    ## A few pages

    What I want next is to try downloading a few linked pages, without leaving the
    site and going to some other site (like the links to Bill Israel's site, which
    I left in the page, as I modified his design).

    Let's start with "--recursive" and "--level 2", to go two clicks deep.

    Also, I'm sick of downloading tracking JS code, so let's exclude quantcast with
    a little "--exclude-domains=quantserve.com". Bill Israel is a great guy (I
    imagine anyway), but I don't want his stuff: exclude cubicle17.com as well.

    I'm going to try these things at the command line, at first, and promote them
    to the config file when they seem to be really working.

    wget --config=my-wgetrc --recursive --level=2 \
    --exclude-domains=quantserve.com,cubicle17.com some-site.tumblr.com

    Well, yes and no. Recursion is hard. It followed a bunch of links off of my
    website, which is linked from my tumblr. Dammit.

    I do like excluding quantserve, though. I'll add that to the config file.

    What I actually want is not a recursive download. What I want is to download
    some-site.tumblr.com/page/1 up to /page/104. So let's just do that.

    ## One page plus some bash

    We just need a little for-loop at the bash prompt to get this done.

    A bit of googling turns up the one-liner to iterate over a for-loop n times:

    for i in $(seq 1 n); do $i; done

    In my case, then, where n goes up to 104, it should be something like:

    for i in $(seq 1 104); do wget some-site.tumblr.com/page/$i; done

    And that should do it. This isn't going to download flash streaming music, but
    I can live with that.

    ## Summary

    If you want to download some-site.tumblr.com/page/1 to some-site.tumblr.com/page/N,
    you can set up a wget config file:

    # stuff in my-wgetrc
    # same as -H
    span_hosts = on
    # same as -k
    convert_links = on
    # same as -K
    backup_converted = on
    limit_rate = 500k
    # same as -nv
    verbose = off
    # same as -p
    page_requisites = on
    random_wait = on
    wait = 0.5

    Then at a linux/mac command line, do:

    for i in $(seq 1 N); do wget --config=my-wgetrc some-site.tumblr.com/page/$i; done

    And that'll fetch your precious.