How to do it...
- Introduce sentence tokenization:
from nltk.tokenize import sent_tokenize
- Form a new text tokenizer:
tokenize_list_sent = sent_tokenize(text)
print "nSentence tokenizer:" print tokenize_list_sent
- Form a new word tokenizer:
from nltk.tokenize import word_tokenize print "nWord tokenizer:" print word_tokenize(text)
- Introduce a new WordPunct tokenizer:
from nltk.tokenize import WordPunctTokenizer word_punct_tokenizer = WordPunctTokenizer() print "nWord punct tokenizer:" print word_punct_tokenizer.tokenize(text)
The result obtained by the tokenizer is shown here. It divides a sentence into word groups:
