Recently I was given a ticket to implement a “near-duplicate” image detector. Look at these three images:
The original image files have different bytesizes and different sizes but they show essentially the same thing. This is what we call a “near-duplicate” and the problem was that when displaying an automatically generated image gallery for a given subject, we were sometimes showing duplicate images due to slight differences in the images.
Obviously we can’t use something like an MD5 or SHA1 fingerprint – we have to create a fingerprint based on the content of the image, not the exact bytes. This is what the pHash library does. A “perceptual hash” is a 64-bit value based on the discrete cosine transform of the image’s frequency spectrum data. Similar images will have hashes that are close in terms of Hamming distance. That is, a binary hash value of 1000 is closer to 0000 than 0011 because it only has one bit different whereas the latter value has two bits different. The duplicate threshold defines how many bits must be different between two hashes for the two associated images to be considered different images. Our testing showed that 15 bits is a good value to start with, it detected all duplicates with a minimum of false positives.
Phashion is my new Ruby wrapper for the pHash library and wraps just enough of the pHash API to implement the described functionality. Here’s the test in the test suite which verifies that Phashion considers the images to be duplicates:
def assert_duplicate(a, b) assert a.duplicate?(b), "#{a.filename} not dupe of #{b.filename}" end def test_duplicate_detection files = %w(86x86-0a1e.jpeg 86x86-83d6.jpeg 86x86-a855.jpeg) images = files.map {|f| Phashion::Image.new("#{File.dirname(__FILE__) + '/../test/'}#{f}")} assert_duplicate images[0], images[1] assert_duplicate images[1], images[2] assert_duplicate images[0], images[2] end
pHash does have much more functionality, including video and audio support and persistent MVP tree support for similarity searches based on previously processed files, but I have not wrapped any of those APIs. Try it out and let me know what you think!



7 responses so far ↓
1 Art // May 22, 2010 at 3:59 am
This looks awesome Mike! Exactly what I’m going to need next week on some new features. I’ll try it out and let you know. I’m sure it’s gonna rock.
2 Glenn West // Jun 5, 2010 at 5:28 am
This is a great addition. Would you consider adding a wrapper so the video side will work. I’ve written a ruby based HD CCTV application, and it would be nice to look for “hashes” that a “different”.
3 Jonathan Spooner // Jun 28, 2010 at 6:26 pm
Thanks for the gem it looks great. Do you have any hints for compiling the dependencies for OSX?
Should the initial gem install compile pHash? Or do I run extconf.rb manually?
Thanks!
4 Mike Perham // Jun 28, 2010 at 6:36 pm
Have you tried
gem install phashion? The latest version vastly simplified the install.5 Jonathan Spooner // Jun 28, 2010 at 6:54 pm
I did but I “NameError: uninitialized constant Phashion” on line 2.
require ‘phashion’
img1 = Phashion::Image.new(“a.jpg”)
img2 = Phashion::Image.new(“b.jpg”)
img1.duplicate?(img2)
I’m on 10.6.4
6 Mike Perham // Jun 28, 2010 at 7:00 pm
That’s bizarre, is it possible you installed with one RVM setup and are currently using another setup? require ‘phashion’ creates the Phashion module so if the require works, that constant should exist.
7 Nilesh // Jun 29, 2010 at 8:53 am
Jonathan, can you try adding
require ‘rubygems’
on the top of your ruby file?
Leave a Comment