Text Processing in Python (1)

For a quick guide for installing Beautiful Soup, see here. In this post, I will briefly talk about the codes to process text when we want to use text as data. The Stanford NLP Group has some fantastic resources and packages available here for processing texts.

Preliminary Step:

Launch IDLE and type:

from bs4 import BeautifulSoup

from urllib import urlopen

import os, re

TASK 1: Grabbing Basic Data from Wikipedia

soup = BeautifulSoup(urlopen(‘[type in your html]‘))

bday = soup.find(‘span’, {‘class’: ‘bday’}).text

bplace = soup.find(‘span’, {‘class’: ‘birthplace’}).text




TASK 2: Processing Texts from HTML

soup=BeautifulSoup(urlopen(‘[type in your html]‘).read())

# I use Extension of Military and Economic Aid as the example

data = soup.p.contents[0]

data1 = data.lower()

data2 = re.sub(‘\W’, ‘ ‘, data1)

Meaningful Practice:

Patrick Perry at NYU Stern has processed the raw text of Federalist papers and JSON data file is available here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s