Fighting against duplicated images
With this post I start a new category “Behind the magic“, where I will be giving some details about the VisualizeUs internals.
One of the first things I noticed when I released VisualizeUs (almost 9 months ago now… woa!), was the duplicity of posted images. Some person found a fancy image in a site, and posted it to her account; when some other found the same image in another web, and posted it too. Both files are the same image, but have different addresses so there’s no easy way to identify them as the same image. That happens a lot when people post from sites like ffffound, flickr, and so on. And in fact, is one of the things I most hate when browsing ffffound as spectator.
Here’s one clear example, just check out the number of reference urls. Without a system to control the duplicates, that would be mean six times the same image repeated.
So… how to deal with it? The approach of a lot of sites with this problem is… non-existant :D No, really, it’s a hard battle to fight, and probably it’s one you will never win (unless you have a lot of money like the Digg guys and can borrow some fancy image recognition technology to deal with it :P). So the most usual approach is “why bother”, which, I should say seems now pretty logic for me.
I still don’t know why I started to dealing with this duplicate issue, but for me it was clear that displaying repeated results wasn’t very good to the spectator (fool of me!). So, I started to mark duplicated pictures, simply based on my own visual memory recall. As you can guess, that was a tough, time-wasting, painful task, and even more, not very productive (although I exercise my visual memory as some sort of Brain Training game!).
After some deep research on different approaches to make it painless and more efficient, I finally came up with an algorithm based on color analysis of the images. It’s not a silver bullet, but it does pretty well the job. Basically, it gets a 3×3 color matrix from each image, and given a threshold, compares it with the rest of images. Unfortunately, it wasn’t that easy and I had to tweak things a lot to make it useful. Things like borders, different crops, texts embedded, different quality files and so on, do not help. Of course it’s not an automated process and it requires human intervention but it helps a lot when finding possible duplicate candidates.
With all that said, don’t think that you won’t find a single image duplicated in the site. I wish, but unfortunately that’s near to impossible. But at least, I hope that minimizing the number of duplicates will help to improve the VisualizeUs experience for all, when browsing and watching tons of pictures. To give you an idea, there are about 70K images posted till now, and among them 50K are “unique” where 20K have been found duplicated or posted from the VisualizeUs “i like it” link.

