Created
April 6, 2016 15:35
-
-
Save tienhv/ba2fb26f2822abdfab0ac635d86dd876 to your computer and use it in GitHub Desktop.
concat files and remove duplicate at the same time
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| First off, you're not using the full power of cat. The loop can be replaced by just | |
| http://stackoverflow.com/questions/16873669/combine-multiple-text-files-and-remove-duplicates | |
| cat data/* > dnsFull | |
| assuming that file is initially empty. | |
| Then there's all those temporary files that force programs to wait for hard disks (commonly the slowest parts in modern computer systems). Use a pipeline: | |
| cat data/* | sort | uniq > dnsOut | |
| This is still wasteful since sort alone can do what you're using cat and uniq for; the whole script can be replaced by | |
| sort -u data/* > dnsOut | |
| If this is still not fast enough, then realize that sorting takes O(n lg n) time while deduplication can be done in linear time with Awk: | |
| awk '{if (!a[$0]++) print}' data/* > dnsOut |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment