jaredhirsch · January 22, 2021 20:55 · Dec 21, 2012 · Dec 21, 2012
diff --git a/howto-detumblrize.mkd b/howto-detumblrize.mkd
@@ -106,14 +106,12 @@ to the config file when they seem to be really working.
 Well, yes and no. Recursion is hard. It followed a bunch of links off of my
 website, which is linked from my tumblr. Dammit.
 
-I do like excluding quantserve, though. I'll add that to the config file.
-
 What I actually want is not a recursive download. What I want is to download
 some-site.tumblr.com/page/1 up to /page/104. So let's just do that.
 
 ## One page plus some bash
 
-We just need a little for-loop at the bash prompt to get this done.
+We need a little for-loop at the bash prompt to get this done.
 
 A bit of googling turns up the one-liner to iterate over a for-loop n times:
 

diff --git a/howto-detumblrize.mkd b/howto-detumblrize.mkd
@@ -0,0 +1,153 @@
+TL;DR skip to the summary for the bash one-liner if so inclined :-)
+
+## Why & wherefore
+
+I have an old tumblr that I want to back up. I don't use it regularly, but
+I want to have a backup of the whole site, design as well as content, just
+in case; it's not just the pictures I love, but the whole thing. I'm not
+particularly worried about downloading the streaming audio--can't help you
+there, gentle reader.
+
+This is the story of a man, a man page, and a page. And lots of other pages.
+
+
+## Downloading just one page (and all the images, fonts, JS, CSS)
+
+Before trying to download the whole site, I started with just one page. I want
+all the assets, not just the images or underlying HTML, and I want the links
+converted, so I can look at the files offline.
+
+I found this swell incantation in the wget man page:
+
+    wget -E -H -k -K -p http://<site>/<document> 
+
+It worked, but not perfectly:
+
+* Unfortunately, the -E (--adjust-extension) option renamed my webfonts from
+Blah.eot to Blah.eot.html. Not cool, bro. The image names came out goofy as
+well, so it didn't really help at all.
+
+* This didn't do any rate-limiting. AWS hosts all of tumblr's images, so
+I'm sure they block an IP that goes bananas with downloading. We can tell it
+to pause between downloads using --wait=<time>, randomize the pauses
+using --random-wait, and limit overall download speed using --limit-rate.
+This is all 'good citizen' stuff, and prevents my IP being throttled or blocked
+by AWS or tumblr :-)
+
+* By default, wget is ridiculously verbose. The -nv option makes it less so,
+but not totally silent. (You can use -q to make it perfectly quiet.)
+
+Adjusting the options (but keeping that shit alphabetical, so I don't go
+crazy trying to find options as this gets longer) we have:
+
+    wget -H -k -K --limit-rate=500k -nv -p --random-wait --wait=0.5 http://<site>/<document>
+
+That's better. Rate limiting and waiting increased download time by a factor
+of three, from 6 seconds to 18 seconds. I can deal with that. So far, so good.
+
+It might be a little more fun to let 'em know I'm not just some Chinese
+botnet instance of wget, but a friendly hacker taking care of business.
+So maybe a custom user-agent with a nod to Firefox and a reference to some
+made-up thing called Sunflower (with the golden ratio its version number ;-).
+
+    --user-agent="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.8.2; en-US) Wget/1.13.4 Sunflower/1.61803398"
+
+and a custom header: 
+
+    --header="Love-You-Guys: but the instagram thing has me worried. just grabbing my stuff."
+
+and, of course, a From header:
+
+    --header="From: Jared 'the dragon' Hirsch <ohai@6a68.net>"
+
+At this point, the command line is getting ridiculously long. Some of wget's
+options can be stuffed in a config file--happily, all the ones I care about:
+
+    # moving stuff into wgetrc
+    header = From: 'Jared "the dragon" Hirsch' <ohai@6a68.net>
+    header = Love-You-Guys: but the instagram thing has me worried. just grabbing my stuff.
+    user_agent = Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.8.2; en-US) Wget/1.13.4 Sunflower/1.61803398
+    # same as -H  
+    span_hosts = on         
+    # same as -k
+    convert_links = on      
+    # same as -K
+    backup_converted = on   
+    limit_rate = 500k       
+    # same as -nv
+    verbose = off           
+    # same as -p
+    page_requisites = on    
+    random_wait = on
+    wait = 0.5
+
+And calling wget is as simple as
+
+    wget --config=my-wgetrc some-site.tumblr.com
+
+## A few pages
+
+What I want next is to try downloading a few linked pages, without leaving the
+site and going to some other site (like the links to Bill Israel's site, which
+I left in the page, as I modified his design).
+
+Let's start with "--recursive" and "--level 2", to go two clicks deep.
+
+Also, I'm sick of downloading tracking JS code, so let's exclude quantcast with
+a little "--exclude-domains=quantserve.com". Bill Israel is a great guy (I 
+imagine anyway), but I don't want his stuff: exclude cubicle17.com as well.
+
+I'm going to try these things at the command line, at first, and promote them
+to the config file when they seem to be really working.
+
+    wget --config=my-wgetrc --recursive --level=2 \
+        --exclude-domains=quantserve.com,cubicle17.com some-site.tumblr.com
+
+Well, yes and no. Recursion is hard. It followed a bunch of links off of my
+website, which is linked from my tumblr. Dammit.
+
+I do like excluding quantserve, though. I'll add that to the config file.
+
+What I actually want is not a recursive download. What I want is to download
+some-site.tumblr.com/page/1 up to /page/104. So let's just do that.
+
+## One page plus some bash
+
+We just need a little for-loop at the bash prompt to get this done.
+
+A bit of googling turns up the one-liner to iterate over a for-loop n times:
+
+    for i in $(seq 1 n); do $i; done
+
+In my case, then, where n goes up to 104, it should be something like:
+
+    for i in $(seq 1 104); do wget some-site.tumblr.com/page/$i; done
+
+And that should do it. This isn't going to download flash streaming music, but
+I can live with that.
+
+## Summary
+
+If you want to download some-site.tumblr.com/page/1 to some-site.tumblr.com/page/N,
+you can set up a wget config file:
+
+    # stuff in my-wgetrc
+    # same as -H  
+    span_hosts = on         
+    # same as -k
+    convert_links = on      
+    # same as -K
+    backup_converted = on   
+    limit_rate = 500k       
+    # same as -nv
+    verbose = off           
+    # same as -p
+    page_requisites = on    
+    random_wait = on
+    wait = 0.5
+
+Then at a linux/mac command line, do:
+
+    for i in $(seq 1 N); do wget --config=my-wgetrc some-site.tumblr.com/page/$i; done
+
+And that'll fetch your precious.
No results found