A Python script to import a complete blog to plain text

In today’s post I’ll show you a script I wrote yesterday to import an entire blog based on a (XML) sitemap. It also converts the posts from html to plain text. Allthough it was merely Python practice, it could actually be useful to export/backup blogs to a single text file for easy consumption in a text editor or offline.

How it works / some comments:

The class gets instantiated with a couple of arguments:

# create instant
blog = ImportBlogPosts("http://bobbelderbos.com", '<div class="entry-content">', '<div><br /><h4><strong>You might also like:')
# get all URLs with /20 in them (this means real post URLs in my blog's case, so ignoring the about-, the archive- and homepage)
blog.import_post_urls("/20")
# ... or just a single URL: 
blog.import_post_urls('http://bobbelderbos.com/2012/09/how-to-grow-craft-programming/')

The first arg is clear: the blog URL. The 2nd and 3rd args are the html start- and endsnippets to grab the actual post (and not all the fluff around it). The sitemap has to be present at URL/sitemap.xml, or it could be a file on the FS. I did see quite some blogs without the sitemap, so an RFE would be to let the class generate the sitemap if not present …

Importing web content: this is easy in Python: urllib.urlopen, but in this case I had some issues with body content not being imported so I used “wget -q ..” with subprocess.call

I find native module “xml.dom.minidom” pretty convenient for parsing XML

I use a nice module for html-to-text conversion: html2text – it delivers markdown format. One tricky thing with encodings (a subject I need to study on its own one day): to get the script working to stdout as well as redirecting its output to a file (with ‘>’ on Unix) I had to use this magic:

..
  postContent = postContent.decode('utf-8')
  print html2text.html2text(postContent).encode('ascii', 'ignore') 
..

Yes that is right: decode to utf-8, and encode to ascii. Anybody having a better way or explanation, please comment.

The code

Here a copy of the code, see also here on Github.

#!/usr/bin/env python                                                                                                                                     
# -*- coding: utf-8 -*-
# Author: Bob Belderbos / written: Dec 2012
# Purpose: import all blog posts to one file, converting them in (markdown) text
# Thanks to html2text for doing the actual conversion ( http://www.aaronsw.com/2002/html2text/ ) 
# 
import os, sys, pprint, xml.dom.minidom, urllib, html2text, subprocess

class ImportBlogPosts(object):
  """ Import all blog posts and create one big text file (pdf would increase size too much,
      and I like searching text files with Vim).  It uses the blog's sitemap to get all URLs. """

  def __init__(self, url, poststart, postend, sitemap="sitemap.xml"):
    """ Specify blog url, where post html starts/ stops, what urls in sitemap are valid, and sitemap """
    self.sitemap = sitemap
    self.sitemapUrl = "%s/%s" % (url, self.sitemap)
    self.postStartMark = poststart # where does post content html start?
    self.postEndMark = postend # where does post content html stop?
    if not os.path.isfile(self.sitemap):
      cmd = "wget -q %s" % self.sitemapUrl
      if subprocess.call(cmd.split()) != 0:
        sys.exit("No 0 returned from %s, exiting ..." % cmd)
    self.blogUrls = self.parse_sitemap(self.sitemap)

      
  def parse_sitemap(self, sitemap):
    """ Parse blog's specified xml sitemap """
    posts = {}
    dom = xml.dom.minidom.parse(sitemap)
    for element in dom.getElementsByTagName('url'):
      url = self.getText(element.getElementsByTagName("loc")[0].childNodes)
      mod = self.getText(element.getElementsByTagName("lastmod")[0].childNodes)
      posts[url] = mod # there can be identical mods, but urls are unique
    urls = []
    # return urls ordered desc on last mod. date
    for key, value in sorted(posts.iteritems(), reverse=True, key=lambda (k,v): (v,k)):
      urls.append(key)
    return urls


  def getText(self, nodelist):
    """ Helper method for parsing XML childnodes (see parse_sitemap) """
    rc = ""
    for node in nodelist:
      if node.nodeType == node.TEXT_NODE:
        rc = rc + node.data
    return rc

  
  def import_post_urls(self, urlCriteria="http"):
    """ Loop over blog URL getting each one's content, default 'http' practically results in importing all links """
    for i, url in enumerate(self.blogUrls):
      if urlCriteria in url: 
        html = self.get_url(url)
        if html != None:
          self.print_banner(i, url)
          self.print_content(url)
  

  def get_url(self, url):
    """ Import html from specified url """
    try:
      f = urllib.urlopen(url)
      html = f.read()
      f.close()
      return html
    except: 
      print "Problem getting url %s" % url
      return None


  def print_banner(self, i, url):
    """ print a banner for a specified URL (to seperate from content) """
    divider = "+"*120
    print "\n\n"
    print divider
    print "%i) %s" % (i, url)
    print divider
    print "\n"


  def print_content(self, url): 
    """ Get blog post's content, get relevant html, then convert to plain text """
    try:
      # I know, I probably should have used urllib.urlopen but somehow it 
      # it doesn't import the body html, so using good 'ol wget as workaround
      cmd = "wget -q -O - %s" % url
      html = subprocess.check_output(cmd.split())
    except subprocess.CalledProcessError as e:
      print "Something went wrong importing %s, error: %s" % (url, e)
      return False
    postContent = self.filter_post_content(html)
    if postContent == None:
      print "postContent == None, something went wrong in filter_post_content?"
    else:
      try:
        # to print in terminal decode to utf-8 needed, to print and redirect
        # script's output to file with >, that only works with ascii encode
        postContent = postContent.decode('utf-8')
        print html2text.html2text(postContent).encode('ascii', 'ignore') 
      except:
        print "Cannot convert this post's html to plain text"


  def filter_post_content(self, textdata):
    """ Takes the post page html and return the post html body """
    try:
      post = textdata.split(self.postStartMark)
      post = "".join(post[1:]).split(self.postEndMark)
      return post[0]
    except:
      print "Cannot split post content based on specified start- and endmarks"
      return None

# end class


### run this program from cli
import optparse
parser = optparse.OptionParser()
parser.add_option('-u', '--url', help='specify a blog url', dest='url')
parser.add_option('-b', '--beginhtml', help='first html (div) tag of a blog post', dest='beginhtml')
parser.add_option('-e', '--endhtml', help='first html after the post content', dest='endhtml')
parser.add_option('-s', '--sitemap', help='sitemap name, default = sitemap.xml', dest='sitemap', default="sitemap.xml")
parser.add_option('-p', '--posts', help='url string to filter on, e.g. "/2012" for all 2012 posts', dest='posts', default="http")
(opts, args) = parser.parse_args()

# Making sure all mandatory options appeared.
mandatories = ['url', 'beginhtml', 'endhtml']
for m in mandatories:
  if not opts.__dict__[m]:
    print "Mandatory option is missing\n"
    parser.print_help()
    exit(-1)

# Execute program with given cli options: 
blog = ImportBlogPosts(opts.url, opts.beginhtml, opts.endhtml, opts.sitemap)
blog.import_post_urls(opts.posts)



### example class instant. syntax, and using it for other blogs
# + instant class
# blog = ImportBlogPosts("http://bobbelderbos.com", '<div class="entry-content">', '<div><br /><h4><strong>You might also like:')
# + all posts my blog:
# blog.import_post_urls("/20")
# + only one post my blog:
# blog.import_post_urls('http://bobbelderbos.com/2012/09/how-to-grow-craft-programming/')
# + another single post on my blog:
# blog.import_post_urls('http://bobbelderbos.com/2012/10/php-mysql-novice-to-ninja/')
# 
# + other blogs:
# blog = ImportBlogPosts("http://zenhabits.net", '<div class="entry">', '<div class="home_bottom">', "zenhabits.xml") 
# blog = ImportBlogPosts("http://blog.extracheese.org/", '<div class="post content">', '<div class="clearfix"></div>', "/Users/bbelderbos/Downloads/gary.xml") 
# + import all urls
# blog.import_post_urls()
# blog = ImportBlogPosts("http://programmingzen.com", '<div class="post-wrapper">', 'related posts', "/Users/bbelderbos/Downloads/programmingzen.xml") 
# + supposedly all posts
# blog.import_post_urls("/20")

Running the script / example outputs

I ran the following command to get all my 2012 blog posts in a text file:

$ python import_blog_posts.py  -u http://bobbelderbos.com \
  -b '<div class="entry-content">' -e '<div><br /><h4><strong>You might also like:' \ 
  -p "http://bobbelderbos.com/2012" > import_blog_posts_bb_2012.txt

You can see the result here. Run it without the -p option (no post filter) and you can get over 140 posts == 14.000 lines … useful for quickly vi-ing anything and copying code snippets ;)

Another example: run it with -p “git” to get my 3 posts that had “git” in the url:

$ python import_blog_posts.py  -u http://bobbelderbos.com \ 
  -b '<div class="entry-content">' -e '<div><br /><h4><strong>You might also like:' \
  -p "git" > git.txt

A Python script to import a complete blog to plain text

How it works / some comments:

The code

Running the script / example outputs

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112