fetchifjpegexif.py: Or a script to download a jpeg image only if it contains an Exif header

Motivation

Over the course of 2009 and 2010 I wrote an automated web crawler as many programmers decide to do (as a side project) just to say we can do it. My web crawler was targeting images, specifically photographs that still contain their camera Exif information. The Exif header in a jpeg image contains all the interesting camera data: make, model, exposure, aperture, focal length, lens info, sometimes GPS info (depending on the camera), et cetera.

I targeted the major photo sharing sites, flickr, et al, and built a searchable database of these images. There was a site in the wild for a while but was taken down when I moved providers from slicehost (now a part of rackspace) to Linode. Though it was never completed in terms of a usable front-end design.

Challenges

There were a number of challenges typical for a web crawler. I did everything I was supposed to do, even kept it kosher with robots.txt and didn't flood any sites with requests.

The challenge I had, which I will address here, is bandwidth and images. I wasn't storing full images, but I had to download them and at least keep thumbnail versions. Downloading meant massive amounts of bandwidth, over my home 10Mbit (at that time) connection.

What saved me was a brief inspection of the jpeg header before downloading the rest of the image.

The Result

Read minimal amounts of the jpeg header, if it does not contain Exif information, close the connection without downloading the entire file. If it does contain Exif header, download and save the image. Here is the meat & potatoes of the code:

data = fh.read(12)
raw_data = data
if data[0:2] != '\xFF\xD8':
  print 'Invalid JPEG'
  fh.close()
  return True
	 
while data[2] == '\xFF' and data[6:10] in ('JFIF', 'JFXX', 'OLYM', 'Phot'):
  length = ord(data[4])*256+ord(data[5])
  tmp = fh.read(length-8)
  raw_data = raw_data + tmp
  tmp = fh.read(10)
  raw_data = raw_data + tmp
  data = '\xFF\x00'+tmp
 
if data[2] == '\xFF' and data[6:10] == 'Exif':
  #Exif header found
  raw_data = raw_data + fh.read()
  filename = url.rsplit('/',1)[1]
  print 'Exif found: Saving ' + filename
  ft = open(filename,'w')
  ft.write(raw_data)
  ft.close() 
else:
  print 'No Exif header info.'
  fh.close()
  return True

The above code is sanitized from the original, removing the database calls that reference the image url saved by the crawler. The original code I used in the web crawler project would call exiv2 to parse the Exif data, store it in the database, then resize the image to a thumbnail (removing the Exif data from the thumbnail).

Run it for yourself

I have added the above example to fetchifjpegexif.py in my GitHub repository steves-python-utils so you can play with it yourself.

I have considered open sourcing the entire web crawler project but time is a limiting factor at the moment. I may revisit this idea if time permits and enough people bug me to do it.


Leave a Comment

Permitted HTML tags in your comment: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>