Created
June 27, 2014 23:19
-
-
Save dkubb/9c5848eb34a1f2d88de4 to your computer and use it in GitHub Desktop.
Revisions
-
dkubb created this gist
Jun 27, 2014 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,35 @@ #!/usr/bin/ruby require 'rubygems' require 'mechanize' require 'addressable/uri' ROOT_URL = Addressable::URI.parse(ARGV.fetch(0)).freeze # Disable SSL certificate verification I_KNOW_THAT_OPENSSL_VERIFY_PEER_EQUALS_VERIFY_NONE_IS_WRONG = nil OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE agent = Mechanize.new uris = File.readlines(ARGV.fetch(1)).map(&:chomp).uniq seen = {} while uri = uris.pop seen[uri] = true begin agent.get(uri) do |page| next unless page.respond_to?(:content_type) && page.content_type =~ %r{\Atext/html\b} puts page.uri page.links_with(href: %r{\A/}).each do |link| link_uri = ROOT_URL.join(link.href).merge!(fragment: nil).normalize! next if link_uri.host != ROOT_URL.host || link_uri.fragment || seen.key?(link_uri.to_s) uris << link_uri.to_s unless uris.include?(link_uri.to_s) end end rescue Mechanize::UnauthorizedError, Mechanize::ResponseCodeError # do nothing end end