Guardian PDF newspaper downloader

Script for automated downloading of PDFs from Guardian's subscription service (guardian.newspaperdirect.com, used to be digital.guardian.co.uk)


Requirements

Currently works only on OS X, but Linux/Unix only need a bug fix or two. Windows currently unsupported. Needs dos2unix, Netcat and Python (used for MD5 encryption).

It's just a big fat Bash script (only because I'm more comfortable with Bash than Perl or Python), which I will tidy up and rewrite using something else in the future (see below).

History

Guardian's website is probably the most popular newspaper website in the UK. As well as online news, they have also been offering digital subscription to full newspaper edition for a long time. This has recently undergone a significant facelift. Even though very fancy, this redesigned system doesn't change anything about the form of this service, which is a online JavaScript viewer/reader running in a web browser. There are several drawbacks of this approach (see below). However, there has always been the option of downloading individual pages in PDF, which solves all of these issues.

Problems with web-browser approach

  1. web browser's font rendering is awful compared to printed newspaper
  2. no support for e-readers and other non-web-enabled electronic media
  3. cannot archive newspapers or go back to a newspaper that's a few months old
  4. cannot make decent quality screenshots or print outs for reader's own use
  5. active Internet connection required in order to read
  6. ...

There are some subjective issues too, like web browser's general clumsiness, awkwardness, laziness and preference of desktop applications like a decent PDF viewer.

Development

When I signed up in early 2009, I could use bash sripted Lynx to log in and download PDFs for a whole issue of Guardian. This worked ok until some day in summer 2009 when Guardian changed the website and introduced s*** loads of JavaScript that made Lynx solutions impossible.

I was generally pissed off with that, so after discovering that the PDF option still exists in the redesigned website, I sat down and painfully went through sources and traffic dumps of logging in and downloading a PDF and wrote a very-very-very-very crude Netcat based downloader, which still works to my 100% satisfaction as of writing this on 29 Jun 2009.

Recently I have also written an automated script that periodically checks the date and downloads a newspaper every day using the downloader script. So basically with this running in the background, if I switch computer for a while every day, I will have a newspaper to read anytime I want.

TODOs


Hosted by SourceForge.

Copyright (C) 2009, Ladislav Snizek