Skip to content

Instantly share code, notes, and snippets.

@tienhv
Created April 6, 2016 15:35
Show Gist options
  • Select an option

  • Save tienhv/ba2fb26f2822abdfab0ac635d86dd876 to your computer and use it in GitHub Desktop.

Select an option

Save tienhv/ba2fb26f2822abdfab0ac635d86dd876 to your computer and use it in GitHub Desktop.
concat files and remove duplicate at the same time
First off, you're not using the full power of cat. The loop can be replaced by just
http://stackoverflow.com/questions/16873669/combine-multiple-text-files-and-remove-duplicates
cat data/* > dnsFull
assuming that file is initially empty.
Then there's all those temporary files that force programs to wait for hard disks (commonly the slowest parts in modern computer systems). Use a pipeline:
cat data/* | sort | uniq > dnsOut
This is still wasteful since sort alone can do what you're using cat and uniq for; the whole script can be replaced by
sort -u data/* > dnsOut
If this is still not fast enough, then realize that sorting takes O(n lg n) time while deduplication can be done in linear time with Awk:
awk '{if (!a[$0]++) print}' data/* > dnsOut
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment