Detecting Duplicate Images with Phashion

Recently I was given a ticket to implement a “near-duplicate” image detector. Look at these three images:

The original image files have different bytesizes and different sizes but they show essentially the same thing. This is what we call a “near-duplicate” and the problem was that when displaying an automatically generated image gallery for a given subject, we were sometimes showing duplicate images due to slight differences in the images.

Obviously we can’t use something like an MD5 or SHA1 fingerprint – we have to create a fingerprint based on the content of the image, not the exact bytes. This is what the pHash library does. A “perceptual hash” is a 64-bit value based on the discrete cosine transform of the image’s frequency spectrum data. Similar images will have hashes that are close in terms of Hamming distance. That is, a binary hash value of 1000 is closer to 0000 than 0011 because it only has one bit different whereas the latter value has two bits different. The duplicate threshold defines how many bits must be different between two hashes for the two associated images to be considered different images. Our testing showed that 15 bits is a good value to start with, it detected all duplicates with a minimum of false positives.

Phashion is my new Ruby wrapper for the pHash library and wraps just enough of the pHash API to implement the described functionality. Here’s the test in the test suite which verifies that Phashion considers the images to be duplicates:

  def assert_duplicate(a, b)
    assert a.duplicate?(b), "#{a.filename} not dupe of #{b.filename}"
  def test_duplicate_detection
    files = %w(86x86-0a1e.jpeg 86x86-83d6.jpeg 86x86-a855.jpeg)
    images = {|f|"#{File.dirname(__FILE__) + '/../test/'}#{f}")}
    assert_duplicate images[0], images[1]
    assert_duplicate images[1], images[2]
    assert_duplicate images[0], images[2]

pHash does have much more functionality, including video and audio support and persistent MVP tree support for similarity searches based on previously processed files, but I have not wrapped any of those APIs. Try it out and let me know what you think!

9 thoughts on “Detecting Duplicate Images with Phashion”

  1. This looks awesome Mike! Exactly what I’m going to need next week on some new features. I’ll try it out and let you know. I’m sure it’s gonna rock.

  2. This is a great addition. Would you consider adding a wrapper so the video side will work. I’ve written a ruby based HD CCTV application, and it would be nice to look for “hashes” that a “different”.

  3. Thanks for the gem it looks great. Do you have any hints for compiling the dependencies for OSX?

    Should the initial gem install compile pHash? Or do I run extconf.rb manually?


  4. I did but I “NameError: uninitialized constant Phashion” on line 2.

    require ‘phashion’
    img1 =“a.jpg”)
    img2 =“b.jpg”)

    I’m on 10.6.4

  5. That’s bizarre, is it possible you installed with one RVM setup and are currently using another setup? require ‘phashion’ creates the Phashion module so if the require works, that constant should exist.

  6. equire ‘phashion’
    LoadError: /usr/local/ruby/lib/ruby/gems/1.8/gems/phashion-1.0.2/lib/ undefined symbol: _ZTVN10__cxxabiv120__si_class_type_infoE – /usr/local/ruby/lib/ruby/gems/1.8/gems/phashion-1.0.2/lib/

  7. This looks awesome but when is someone going to write a Windows GUI program (that scans a drive) for dummies like me?

Comments are closed.