Over the course of 2009 and 2010 I wrote an automated web crawler as many programmers decide to do (as a side project) just to say we can do it. My web crawler was targeting images, specifically photographs that still contain their camera Exif information. The Exif header in a jpeg image contains all the interesting camera data: make, model, exposure, aperture, focal length, lens info, sometimes GPS info (depending on the camera), et cetera.
I targeted the major photo sharing sites, flickr, et al, and built a searchable database of these images. There was a site in the wild for a while but was taken down when I moved providers from slicehost (now a part of rackspace) to Linode. Though it was never completed in terms of a usable front-end design.
There were a number of challenges typical for a web crawler. I did everything I was supposed to do, even kept it kosher with robots.txt and didn't flood any sites with requests.
The challenge I had, which I will address here, is bandwidth and images. I wasn't storing full images, but I had to download them and at least keep thumbnail versions. Downloading meant massive amounts of bandwidth, over my home 10Mbit (at that time) connection.
What saved me was a brief inspection of the jpeg header before downloading the rest of the image.
Read minimal amounts of the jpeg header, if it does not contain Exif information, close the connection without downloading the entire file. If it does contain Exif header, download and save the image. Here is the meat & potatoes of the code:
The above code is sanitized from the original, removing the database calls that reference the image url saved by the crawler. The original code I used in the web crawler project would call exiv2 to parse the Exif data, store it in the database, then resize the image to a thumbnail (removing the Exif data from the thumbnail).
Run it for yourself
I have considered open sourcing the entire web crawler project but time is a limiting factor at the moment. I may revisit this idea if time permits and enough people bug me to do it.