Friday, October 2, 2015

Extracting structured data (in a table) from HTML5 using BeautifulSoup / Python

I recently ripped a CD that was unknown to my CDDB server. I found a web page that contained a track list, but found it very cumbersome to copy and paste the information due to the formatting of the web page.

Consequently, I opened up the page using the Firefox DOM inspector, and noticed that each title was associated with HTML class 'title'. Surely, the data element of interest could be extracted using some higher-level language!

I elected to do some research and discovered that I could solve this problem, easily, using Python 2.7 and BeautifulSoup.

After some research (having never used BeautifulSoup before), this is the unbelievably simple script that I came up with:

from requests import get
from bs4 import BeautifulSoup
url = 'https://rainforroots.bandcamp.com/album/the-kingdom-of-heaven-is-like-this'
htmlString = get(url).text
html = BeautifulSoup(htmlString, 'html5lib')
tags = html.find_all('div', {'class':'title'})
text = [t.get_text() for t in tags]
print str(len(text)) + ' items matched:\n'
# join(j.split()) is a quick hack to remove excess whitespace
for i,j in enumerate(text): print ' '.join(j.split())

WOW! Clearly, this is a useful library.

No comments: