Skip to content

Instantly share code, notes, and snippets.

@icco
Last active August 29, 2015 14:22
Show Gist options
  • Select an option

  • Save icco/d33ac8f4f06fca7a9552 to your computer and use it in GitHub Desktop.

Select an option

Save icco/d33ac8f4f06fca7a9552 to your computer and use it in GitHub Desktop.

Revisions

  1. icco renamed this gist Jun 11, 2015. 1 changed file with 12 additions and 2 deletions.
    14 changes: 12 additions & 2 deletions scrap.rb → scrape.rb
    Original file line number Diff line number Diff line change
    @@ -25,10 +25,20 @@
    request = Typhoeus::Request.new(url, followlocation: true)
    request.on_complete do |response|
    # https://unsplash.com/photos/TXG9VLN1J9U/download
    file_name = response.effective_url.split("/").last
    url_name = response.effective_url.split("/").last

    ext = File.extname(url_name)
    name = File.basename(url_name, ext)

    # Get the correct ext for this file
    content_type = response.headers["Content-Type"]
    new_ext = MIME::Types[content_type].first.extensions.first

    file_name = "#{name.downcase.gsub(/[^a-z0-9]/, '')}.#{new_ext.downcase}"
    p file_name
    File.write(file_name, response.body)
    end
    hydra.queue request
    end

    hydra.run
    hydra.run
  2. icco created this gist Jun 11, 2015.
    34 changes: 34 additions & 0 deletions scrap.rb
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,34 @@
    #! /usr/bin/env ruby

    require 'nokogiri'
    require 'open-uri'
    require 'net/http'
    require 'typhoeus'

    urls = []
    (1...20).each do |i|
    doc = Nokogiri::HTML(open("https://unsplash.com/grid?page=#{i}"))
    doc.css('a').each do |a|
    h = a.attributes["href"]
    if h and h.value.include? "/download"
    urls.push "https://unsplash.com#{h.value}"
    end
    end
    end

    thread_count = 10
    urls = urls.sort.uniq

    hydra = Typhoeus::Hydra.new(max_concurrency: thread_count)

    urls.each do |url|
    request = Typhoeus::Request.new(url, followlocation: true)
    request.on_complete do |response|
    # https://unsplash.com/photos/TXG9VLN1J9U/download
    file_name = response.effective_url.split("/").last
    File.write(file_name, response.body)
    end
    hydra.queue request
    end

    hydra.run