Couple of my hobbies are travelling and photography. I love to take pictures and experiment with photography. Usually after my trips, I just copy the photos to either my iPad or couple of my external hard disks. After 10 years, I have over 200K photos distributed across several disks and machines. I had to find a way to organize these photos and create a workflow for future maintenance. In this post I want to address one of the issues I had to solve: ** finding duplicate images **.
First, I needed to find out what exactly is a duplicate image. Analysing my photos, I found couple of interesting things:
Identical photos (1) are easy to find. These files are exactly the same so their bit patterns are the same. We could simply compute the checksum (or md5) of the photos and compare to find identical images.
Finding similar photos (2) is a little more challenging. The photos are visually the same, but they have enough differences that their checksum do not compare. Some of the differences that I found:
I had to find a way to detect all these changes to a photo and mark all these as duplicates of the original.
Researching "image similarity detection algorithms" on the web yields several techniques and tools. The primary theme of all these techniques was to calculate a fingerprint or hash of the image based on the perseptual content, not the raw bits in the image. A perceptual fingerprint/hash is derived from the visual features of the image. Unlike cryptographic hash functions like MD5 which rely on the fact that small changes in input leads to drastically different hash values, perceptual hashes produce hash values close to one another if the visual features of images are similar.
There are several algorithms that can calculate an image fingerprint, some were based on heuristics while others had a solid mathematical backing. Here is a rundown of some of the techniques and tools I came across while researching this topic.
The simplest of these fingerprinting techniques is using color histograms. Essentially a color histogram will capture the color distribution of the image. By comparing the normalized color histogram of images we can see if the color distributions match.
This techniques is apparently used in image retrieval/matching systems and are a standard way of matching images that is very reliable, relatively fast and very easy to implement. This type of matching is pretty resiliant to scaling (once the histogram is normalised), and rotation/shifting/movement etc.
Here is an implementation of an histogram indexing as described in Indexing via colour histograms - Swain, Ballard - 1990. See the table below to compare color histogram based similarity with other techniques.
GQview is an image viewer on Linux. It was one of the earliest programs that supported image similarity detection. Reading the code, this is what it does to detect similar images:
To compare two images, compute the signatures of both the images and calculate the average of the corresponding array differences. i.e. similarity = 1 - (abs(red[img1] - red[img2]) + abs(blue[img1] - blue[img2]) + abs(green[img1] - green[img2])) / 255 * 1024 * 3
1.0 is considered an exact match, while 0.0 for exact opposite images. Generally only a match of > 0.85 are significant at all, and >.95 is useful to find images that have been re-saved to other formats, dimensions, or compression.
A pearl script written by rob kudla in 2001. It basically creates a fingerprint of the photos using Image Magick after applying a series of reductions. From the code, this is what findimagedupes does:
When comparing images, findimagedups uses each images fingerprint and XORs them. The count of 1 bits in the result gives an approximate similarity score.
This method is described by Dr. Neal Krawetz on the HackerFactor blog. The basic idea was to filter out high frequencies in an image and just keep the low frequencies. With pictures, high frequencies give you detail, while low frequencies shows the structure. A large, detailed picture has lots of high frequencies. A very small picture lacks details, so it is all low frequencies.
Here is the algorithm Dr. Neal described:
According to Dr. Neal, the resulting hash won't change if the image is scaled or the aspect ratio changes. Increasing or decreasing the brightness or contrast, or even altering the colors won't dramatically change the hash value.
Here is an implementation of average hash using CImg library in C++. The similarity is computed as 1 - distance/64, so 1.0 means the images are similar. See the results of Average hash in the table below.
According to Dr. Neal, a better approach is using pHash. Here is how pHash is computed:
pHash is the most promising of all (see table below). There is an opensource version of pHash library that can be easily used in Ruby.
ImgSeek is an opensource tool that computes image features based on an algorithm described in the paper Fast Multiresolution Image Querying. These features consist of 41 numbers for each colour channel (the code currently works in the YIQ colourspace).
40 of these numbers will correspond to the 40 most significant wavelets found in a Haar wavelet decomposition of the image. The final number is based on the average luminosity of the image, and is basically a compensation factor. The image similarity is given by the sum of the weights for the most significant wavelet features, minus a component based on the average luminosity.
The table below shows the similarity scores for the following 4 images compared to the original.
|  |  |  |  |  | |:---------------------------------------:|:----------------------------------------:|:--------------------------------------:|:---------------------------------------------------:|:--------------------------------------:| | Original Image | Duplicate Image | Resized Image | Exposure Compensated Image | Cropped Image || Original vs Duplicate | Original vs Resized | Original vs Exposure Compensated | Original vs Cropped | |
|---|---|---|---|---|
| Color histogram intersection | 1.0 | 0.00099 | 0.8104 | 0.00099 |
| Average Hash | 1.0 | 0.625 | 0.6875 | 0.6875 |
| pHash | 1.0 | 0.875 | 0.890 | 0.718 |
| wavelet decomposition | 1.0 | 0.99 | 0.886 | 0.298 |
Conclusion:
Based on this, my choice was ImgSeek or pHash. I would have preferred to use imgseek, but I decided to use pHash as there was an open source implementation of pHash that can be easily used in a ruby script.
We know we can easily compute the hamming distance of two pHashes and see how similar they are. But when we have 200K images, we need to compare an image with a very large set of pHashes to compute similarity. Linear comparison is out of question. A hash can only give us exact matches or identical images but not similar images. We need a data structure that can find near matches to hashes within a threshold.
A stack overflow entry pointed me in the right direction.
Question: What do we know about the Hamming distance d(x,y)?
Answer:
This means that the Hamming distance is a metric for a metric space. There are good algorithms and data structures for indexing metric spaces: Metric tree, BK-tree, M-tree, VP-tree, Cover tree, etc.
In fact the open source pHash library has an implementation of Multi Vantage Point (MVP) trees. I needed more control over the tree nodes, so I decided to use a Burkhard Keller tree (BK-tree) in Ruby.
For a really good introduction to BK Trees see Damn Cool Algorithms.
Finally, I wrote a general purpose photo organizer tool that reads the exif data in an image and copies it to a folder structure based on the data-time when the photo was taken. It uses pHash for generating image signatures and skips identical images and marks similar images as "DUPLICATE_OF" the original image.
Check it out: https://github.com/HackerLabs/PhotoOrganizer.