Text Processing in Python (1)

For a quick guide for installing Beautiful Soup, see here. In this post, I will briefly talk about the codes to process text when we want to use text as data. The Stanford NLP Group has some fantastic resources and packages available here for processing texts.

Preliminary Step:

Launch IDLE and type:

from bs4 import BeautifulSoup

from urllib import urlopen

import os, re

TASK 1: Grabbing Basic Data from Wikipedia

soup = BeautifulSoup(urlopen(‘[type in your html]‘))

bday = soup.find(‘span’, {‘class’: ‘bday’}).text

bplace = soup.find(‘span’, {‘class’: ‘birthplace’}).text




TASK 2: Processing Texts from HTML

soup=BeautifulSoup(urlopen(‘[type in your html]‘).read())

# I use Extension of Military and Economic Aid as the example

data = soup.p.contents[0]

data1 = data.lower()

data2 = re.sub(‘\W’, ‘ ‘, data1)

Meaningful Practice:

Patrick Perry at NYU Stern has processed the raw text of Federalist papers and JSON data file is available here.

