Skip to content

Instantly share code, notes, and snippets.

@dkubb
Created June 27, 2014 23:19
Show Gist options
  • Select an option

  • Save dkubb/9c5848eb34a1f2d88de4 to your computer and use it in GitHub Desktop.

Select an option

Save dkubb/9c5848eb34a1f2d88de4 to your computer and use it in GitHub Desktop.

Revisions

  1. dkubb created this gist Jun 27, 2014.
    35 changes: 35 additions & 0 deletions sitemap.rb
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,35 @@
    #!/usr/bin/ruby

    require 'rubygems'
    require 'mechanize'
    require 'addressable/uri'

    ROOT_URL = Addressable::URI.parse(ARGV.fetch(0)).freeze

    # Disable SSL certificate verification
    I_KNOW_THAT_OPENSSL_VERIFY_PEER_EQUALS_VERIFY_NONE_IS_WRONG = nil
    OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE

    agent = Mechanize.new
    uris = File.readlines(ARGV.fetch(1)).map(&:chomp).uniq
    seen = {}

    while uri = uris.pop
    seen[uri] = true
    begin
    agent.get(uri) do |page|
    next unless page.respond_to?(:content_type) &&
    page.content_type =~ %r{\Atext/html\b}

    puts page.uri

    page.links_with(href: %r{\A/}).each do |link|
    link_uri = ROOT_URL.join(link.href).merge!(fragment: nil).normalize!
    next if link_uri.host != ROOT_URL.host || link_uri.fragment || seen.key?(link_uri.to_s)
    uris << link_uri.to_s unless uris.include?(link_uri.to_s)
    end
    end
    rescue Mechanize::UnauthorizedError, Mechanize::ResponseCodeError
    # do nothing
    end
    end