How can I tokenize a sentence with Python?

Datetime:2017-04-18 05:49:14         Topic: Python  Natural Language Processing          Share        Original >>
Here to See The Original Article!!!

How can I tokenize a sentence with Python? (source:OReilly).

How do you tokenize a sentence?

Tokenization is breaking the sentence into words and punctuation, and it is the first step to processing text. We will do tokenization in both NLTK and spaCy .

First, we will do tokenization in the Natural Language Toolkit (NLTK).

The result of tokenization is a list of tokens.

from nltk.tokenize import word_tokenize
text1 = "It's true that the chicken was the best bamboozler in the known multiverse."
tokens = word_tokenize(text1)
print(tokens)

Next, we will do tokenization in spaCy (spaCy is a newish Python NLP library with great features).

from spacy.en import English
parser = English()
print(parser)

spaCy keeps space tokens so you have to filter them out.

text1 = "I like statements that are both true and absurd."
tokens = parser(text1)
tokens = [token.orth_ for token in tokens if not token.orth_.isspace()]
print(tokens)

Note for Python 2: spaCy requires Unicode

In Python 2 you need to put u in front of strings, like this u"bob" .

Or you need to convert my_string like this

my_string_u = my_string.decode('utf-8',errors='ignore')

As an exercise, you can try to generate edge cases like the one below.

textu = "I'm Mr. O'Malley, and I love things, i.e., tacos etc."
tokens = parser(textu)
tokens = [token.orth_ for token in tokens if not token.orth_.isspace()]
print(tokens)

Article image: How can I tokenize a sentence with Python? (source:OReilly).

Tags:Questions








New