Text preprocessing using NLTK in Python

Text pre-processing using NLTK in Python

We have learned several string operations in our previous blogs. Proceeding further we are going to work on some very interesting and useful concepts of text preprocessing using NLTK in Python. Let us first understand the text processing thought process by observing the following text to work on.

sample_text = '''A nuclear power plant is a thermal power station in which
the heat source is a nuclear reactor. As is typical of thermal power 
stations, heat is usssded to generate steam that drives a steam turbine 
connected to aaaaa generator that produces electricity. As of 2018, the 
International Atomic Enertgy Agency reported ther were 450 nuclear power 
reactors in operation in 30 countries.'''

If you would want to do text preprocessing using NLTK in Python, what all steps pop up in our head? How do you think a text is processed? Whether a whole document is processed at once? Or it is broken down into individual words? Do you think the words like “of”, “the”, “to” add any value in our text analysis? Do these words provide us with any information? What can be done about these words? Can you spot some incorrectly spelled words? Would you like to correct them to improve your text analysis? Give these questions some thought. Now let us work on these questions one by one.

A machine will be able to process the text by breaking it down into smaller structures. Hence, in Text Analytics, we do have the Term Document Matrix (TDM) and TF-IDF techniques to process texts at the individual word level. We will deal with TDM, TF-IDF, and many more advanced NLP concepts in our future articles. For now, we are going to start our text preprocessing using NLTK in Python with Tokenization in this article.

Tokenization –

Tokenization is the process of splitting textual data into smaller and more meaningful components called tokens. The most useful tokenization techniques include sentence and word tokenization. In this, we break down a text document (or corpus) into sentences and each sentence into words.

Sentence Tokenization –

A text corpus can be a collection of paragraphs, where each paragraph can be further split into sentences. We call this sentence segmentation. We can split a sentence by specific delimiters like a period (.) or a newline character (\n) and sometimes even a semicolon (;). Next, we will explain the various techniques the NLTK library provides for sentence tokenization.

  1. Using the default sentence tokenizer sent_tokenize. The nltk.sent_tokenize(…) function uses an instance of the PunktSentenceTokenizer class internally. Run the following commands and notice how in the output the sentences are split.
import nltk
nltk.download('punkt')
nltk.sent_tokenize(sample_text)
     2. We can use a default PunktSentenceTokenizer also.
punkt_st = nltk.tokenize.PunktSentenceTokenizer()
punkt_st.tokenize(sample_text)
     3. There is another tokenizer class RegexpTokenizer that allows specific regular
expression-based patterns to segment sentences.
SENTENCE_TOKENS_PATTERN = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)
                            (?<=\.|\?|\!)\s'
regex_st = nltk.tokenize.RegexpTokenizer(
                   pattern=SENTENCE_TOKENS_PATTERN,
                   gaps=True)
regex_st.tokenize(sample_text)
All the above-mentioned sentence tokenizers return the same output. We can surely use these in different scenarios when needed.

sentence tokenization

Word Tokenization –

Another form of tokenization is the word tokenization. The NLTK library provides us with many different ways to perform word tokenization on a given text. It is important as word tokenization further helps in text cleaning. We can apply stopwords, stemming, lemmatization, etc. on the text and perform text preprocessing. Let us understand the various word tokenization options the NLTK library provides.
1. The default word_tokenize function tokenizer is an instance of TreebankWordTokenizer class. Run the following simple command and observe the output.
nltk.word_tokenize(sample_text)
2. TreebankWordTokenizer uses various regular expressions to tokenize the text. It is similar to the above tokenizer as they both use the same mechanism to tokenize. Some of the main features of this tokenizer are:
  1. Splits and separates out periods that appear at the end of a sentence
  2. It splits and separates commas and single quotes when followed by whitespace
  3. Most punctuation characters are split and separated into independent tokens
  4. Breaks words with standard contractions, such as don’t to do and n’t

We are going to use a different text for the next few word tokenizers.

text = "I stepped out of the house and saw Mother ’s grand-uncle, the 
white-haired Saifuddin, sitting by his grocery counter and scanning the 
market."
nltk.TreebankWordTokenizer().tokenize(text)
3. RegExp Tokenizer can also be used to tokenize text based on the desired pattern.
regex_wt = nltk.RegexpTokenizer(pattern=r'\w+',gaps=False)
regex_wt.tokenize(text)
4. The WhitespaceTokenizer tokenizes sentences into words based on whitespaces, like
tabs, newlines, and spaces. The following snippet shows demonstrations of these tokenizers.
whitespace_wt = nltk.WhitespaceTokenizer()
words = whitespace_wt.tokenize(sample_text)
word_tokenization
There are many other libraries like spacy, gensim, etc. that provides us to perform tokenization on text data. After the words are considered separately, we do need to remove useless words from our corpus.

Stopwords –

Words like a, an, the, of, it, you, you’re, and many more such words are classified as “stopwords”. This is because these words do not add any information in the text,  unnecessarily adds to computation and text processing. So we will proceed to remove these words and hence the name stopwords.

In Python, we have got a couple of libraries that provide us with a list of stop words.

Stopwords text preprocessing using NLTK –

A list of stopwords NLTK library provides can be found using the below line of codes.

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
print(stopwords.words('english'))

So how does our text look like after we remove all these words from our text? Do follow the below code and the output to notice the words left.

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english')) 
word_tokens = word_tokenize(text) 

filtered_sentence = [w for w in word_tokens if not w in stop_words] 

print(word_tokens) 
print(filtered_sentence)
Are you wondering how to remove words in addition to the default list of stopwords? If so, then you are thinking in the right direction. We can customize our own stopwords list. This is how we do it.

Customized Stopwords –

If you might have observed, we converted the list of NLTK stopwords to a Set. That means we can apply all the various available set operations on our stopword variable stop_words. Now am sure you must have got an idea of what we are going to do to create our customized list of stopwords. Follow the codes below where we want to remove a couple of words from the text.

new_stopwords = {"Saifuddin",'grocery',"market}
stop_words.update(new_stopwords)
print("length of new stop_words",len(stop_words))

Now, if you filter your text with the new set of stopwords, you will get a new output list of words. Do give it a try.

In the next article, we are going to talk about other text pre-processing using NLTK in Python concepts like Spelling correction of a word, expanding contractions, and removing accented characters. These techniques will help us refine our text before we put them to machine learning models. Stay tuned. Happy Learning!!!

Leave a Reply

Your email address will not be published. Required fields are marked *