Step by Step Guide of Python Regular Expressions – Part 1

Cover image

The very first article in the NLP series introduced Natural Language Processing. We then explained String Operations that can be performed on text using Python. Further, in the quest to strengthen our text pre-processing, we must have a good understanding of Regular Expressions. This article is the first part of a step by step guide of Python Regular expressions.

What is Regular Expression?

Regular Expressions, also called Reg Exs, are a special sequence of characters often denoted using the raw string notation. Reg Ex helps create string patterns and use them for searching and substituting specific pattern matches in textual data. Many other programming languages like Java, Perl, Unix also support regular expression functionality.

Why do we need Regular expression?

It is pertinent if you are wondering what is the need for regular expressions. In our previous article, we explained several basic string operations like split, join, strip, etc. Should not these suffice? Let us look at the below scenarios before answering this question.

Consider a sample text – “N123tur123l l123ngu123ge Processing”. We want to replace the string ‘123’ with ‘a’. A simple string operation code to achieve the task looks like this.

string_sample = "N123tur123l l123ngu123ge Processing"
string_sample.replace("123","a")
Replaced. Done. Great. Now consider another scenario.
The sample text is “N123tur345l l789ngu567ge Processing” where we want to replace numbers with ‘a’ just like the last example. Will the task be possible with a single application of replace method on the string variable? As a solution, one will want to fulfill this task by successively applying the replace method on the string variable. Surely not a good coding practice. What if there is a large volume of text requiring the numeric to alphabetic replacement of some sort? This is where we will need Regular Expression.

Python Regular Expression Package:

The re package in Python provides us with a lot of options to perform text pre-processing. The regularly used methods are compiled, search, match, split, sub, subn, escape, findall, finditer. There is an optional flag that can be provided as an argument in these methods. For example, re.IGNORECASE, re.ASCII, re.Unicode, re.DOTALL, re.MULTILINE, etc. To specify regular expressions, metacharacters are used. We can also utilize special sequences and sets that make commonly used patterns easy to write.

Metacharacters – These are characters with a special meaning. For example  . ^ $ * + ? {} () \ |

Metacharacters - step by step guide of Python Regular Ex[ressions

Special Sequence – A special sequence is a \ followed by one of the characters and has a special meaning. For example \w, \W, \d, \D

Special Sequence of Regular Expressions

Sets – It is a set of characters inside a pair of square brackets []with a special meaning. For example [a-zA-Z], [0-9]

Sets of Regular Expression

We are going to provide some practical scenarios, apply various regex commands, and explain various results.

Example Scenarios to apply Regular Expressions:

I am considering two scenarios related to applying Natural Language Processing. The first is the Sentiment Analysis of COVID19 Tweets. Another scenario is Topic Modelling on movie reviews posted on IMDB. One needs to perform some text cleaning before applying NLP techniques to identify sentiment or for Topic Modelling. Let us now define two string variables that will be used throughout for applying the regular expression commands.

movie_review = """Troy is loosely based on Homer's 'Iliad', Wolfgang Petersen directs this epic war film, with an ensemble cast
including Brad Pitt, Eric Bana, Orlando Bloom, Diane Kruger, Troy Sean Bean, Brian Cox, Rose Byrne, Garrett Hedlund, 
Peter O' Toole, Brendan Gleeson, & Tyler Mane. Troy is about love, power, deceit, valour, glory."""

covid_tweet = """As of today, the cumulative number of confirmed #COVID19 cases is 97 302, 
total number of deaths is 1930 and the total number of recoveries is 51 608."""

The two strings defined above are multi-line strings. Do check the output of movie_review and COVID tweet. First of all, we are going to understand the various methods and the optional flag that can be used in the arguments wherever possible.

re.compile():

This method compiles a specified regular expression pattern into a regular expression object, which can be used for matching and searching. We will create two regular expression objects and use them by applying the other methods.

import re
regex_movie = re.compile('Troy')
regex_covid = re.compile('deaths')
regex_covid

What output do you notice for regex_covid? Make a note of this!!

re.match():

This method is used to match patterns at the beginning of strings. If you pull tweets from Twitter into an excel sheet, you will notice that the twitter handle of the person replied to comes at the beginning of the tweet. This method match can be very useful to match a pattern in that case. In continuation of the above-provided code, we execute the following commands on regex_movie to find a pattern match.

regex_movie.match(movie_review)
re.match(regex_movie, movie_review)
#Both the above commands do the same task. 
Output - <_sre.SRE_Match object; span=(0, 7), match='Loosely'>

#Observe the difference in output for the below given commands. 
(re.match('troy', movie_review, flags= re.IGNORECASE))
(re.match(regex_movie, movie_review, flags= re.IGNORECASE))
The flag IGNORECASE indicates that the match needs to be performed case-insensitively. We can also use re.I also. Further, if we apply match on regex_covid, we get None as an output because the pattern is not available at the beginning of the string. You can try this out yourself.

re.search():

This method will search the regular expression pattern and return the first occurrence.

re.search('DeaThS', covid_tweet, flags= re.IGNORECASE)
Output is <_sre.SRE_Match object; span=(107, 113), match='deaths'>

re.findall():

This method returns all non-overlapping matches of the specified regex pattern in the string.

re.findall(r"^\w", movie_review, flags = re.MULTILINE)
re.findall(r"\d{2}", covid_tweet, flags = re.MULTILINE)

\w, \d, {}, ^ are metacharacters. The flag MULTILINE makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string). We can also use re.M. You might revisit this paragraph once we explain these special characters in our metacharacters section in our second part.

re.finditer():

This method returns all matched instances in the form of an iterator, for a specific pattern in a string when scanned from left to right.

for match in re.finditer('troy', movie_review, flags = re.MULTILINE|re.IGNORECASE):
    s = match.start()
    e = match.end()
    g = match.group()
    print('Found match "{}" ranging from index {} - {}'.format(g,s,e))
Found match "Troy" ranging from index 0 - 4
Found match "Troy" ranging from index 171 - 175
Found match "Troy" ranging from index 275 - 279

The match variable contains a match object. The group()method returns the part of the string where there is a match.
The start()function returns the index of the start of the matched substring.
Similarly, end()returns the end index of the matched substring.

re.sub():

This method substitutes a specified regex pattern in a string with a replacement string. It only substitutes the left-most occurrence of the pattern in the string.

re.sub("deaths", 'deaths 🐍', covid_tweet, flags=re.UNICODE)
re.sub("[0-9]", 'x', covid_tweet, flags=re.IGNORECASE)

The flag UNICODE specifies the Unicode encoding for character classification. We can also use re.U.

re.subn():

This method is similar to re.sub() except it returns a tuple of 2 items containing the new string and the number of substitutions made. re.subn() is very useful for tasks where we require to know how many replacements are done.

re.subn("[0-9]", 'x', covid_tweet, flags=re.IGNORECASE)

re.split():

The re.split method splits the string where there is a match and returns a list of strings where the splits have occurred. If the pattern from where the split is to happen is not found, re.split() returns a list containing the original string. We can pass the maxsplit argument to the re.split() method. It’s the maximum number of splits that will occur, 16 in the example command below.

re.split(',',movie_review,maxsplit=16,flags=re.MULTILINE)

Conclusion:

We have thoroughly explained several regular expression methods and a few useful flag arguments in regular expressions for string manipulations. In the next article, we are going to cover the remaining portions of metacharacters, special sequences, and sets to cover the regular expressions entirely. We will also talk about string formatting to mark the end of basic string operations required for text analytics. Do let us know how helpful these tutorials are turning out for you in the comments section. Stay Tuned. Happy Learning!!!

Leave a Reply

Your email address will not be published. Required fields are marked *