Wikipedia has extensive text on articles it discusses – in some cases so much that a lot of language processing APIs won’t accept it. Alchemy API (now seemingly marketed as “IBM Watson”) has an endpoint to parse text from a website, but it only accepts 600KB pages (50K of output text). Consequently, it quickly becomes easier to just get the text yourself.
To do this, I recommend Apache Tika, which seems to include one of the better / best libraries for extracting text, and has every imaginable interface – Java, command line, REST, and a GUI(!).
You only need a Java jar for this-
curl http://apache.spinellicreations.com/tika/tika-app-1.13.jar > tika-app.jar
Tika has a complex set of options for detecting content types, but it seems to respond to file extensions, and when I was testing this I found that it was more reliable when I specified these:
curl https://en.wikipedia.org/wiki/Barack_Obama > Barack_Obama.html
Invoking Tika is simple:
java -jar tika-app.jar -t data/$1.html > out/$1.txt
Problem is, Wikipedia has a ton of extra content wrapping the text. You could handle this in a few ways – pre-process the file to select out what you want, customize Tika to have it parse out the bits you want (probably a good option if you want to get just captions or headings), or hack at the output.
For my case I chose the last option. The following script will remove the table of contents, most captions, and the bogus header / footer information that shows up at the end of the file. Tune to your liking (I removed the references as well).
import fileinput import re start = re.compile("Jump to:.*navigation,.*search") end = re.compile("^Notes and references$") started = False ended = False blank = False ignore = re.compile( "^(Main article: .*|" + "Main articles: .*|" + "See also: .*|" + "\s*[0-9]+.[0-9]+ .*" + "|\s*[0-9]+.[0-9]+.[0-9]+ .*)\s*$") footnote = re.compile("\[[0-9]+\]") for line in fileinput.input(): if (re.match(end, line.strip())): ended = True if (started and not ended): if (not blank or line.strip() != ""): if (not re.match(ignore, line)): if ("." in line or len(line) > 150 or len(line.strip()) == 0): print(re.sub(footnote, "", line.strip())) if (line.strip() == ""): blank = True else: blank = False if (not started): if (re.search(start, line) != None): started = True pass
If you're looking for a Python book, Natural Language Processing with Python is a great way to learn the language while building some really interesting projects.Citations: