Step By Step Guide of Python Regular Expressions – Part 2

Spread the love


Step by Step guide of python regular expressions


This article is in continuation of our first part of a Step by Step guide of Python Regular Expressions. The meta-characters, special sequences,  and sets are the essential parts of a regular expression. We are going to talk about working examples of these on the two-sample text data considered earlier. The movie review text is considered if someone is working on Topic Modelling. The Covit tweet sample is for NLP enthusiasts working on Sentiment Analysis.

movie_review = """Troy is loosely based on Homer's 'Iliad', Wolfgang Petersen directs this epic war film, with an 
ensemble cast including Brad Pitt, Eric Bana, Orlando Bloom, Diane Kruger, Troy Sean Bean, Brian Cox, Rose Byrne, 
Garrett Hedlund, Peter O' Toole, Brendan Gleeson, & Tyler Mane.Troy is about love, power, deceit, valour, glory."""

covid_tweet = """As of today, the cumulative number of confirmed #COVID19 cases is 97 302, 
total number of #COVID19INDIA deaths is 1930 and the total number of recoveries is 51 608."""

Special Sequence –

A special sequence is a \ followed by one of the characters and has a special meaning. For example \w, \W, \d, \D

Special Sequence

Try to observe the information given in the snapshot provided above. Are you able to imagine some use cases for applying these special sequences in creating regular expressions?

Use case 1:

Assume every movie review pulled is carrying the movie name at the beginning of the review. The obvious choice would be to get rid of the movie’s name. We will apply a special sequence \A to write the code. On the contrary, \Z can be used if we are looking for a pattern at the end of our string.

re.sub("\ATroy","",movie_review).strip()
re.findall("Troy\Z","",movie_review).strip()

Use case 2:

In the covid_tweet, there are hashtags with #COVID19. If there is a requirement of identifying the presence of this pattern in the tweet, we can apply \B

re.findall(r"COVID\B",covid_tweet)
re.sub(r"COVID\B","COVID",covid_tweet)
If you are wondering what is the letter ‘r’ in the second code, it is used to retain the raw string. Please visit this blog to know more. Now let us proceed with other use cases where we can apply other special sequences.

Use case 3:

Usually, any number present in a tweet is of no importance while performing Sentiment Analysis of tweets. To remove those, we can apply the below code using \d. To find all the non-numerical and non-alphabetical characters at any stage in text cleaning, we can use \W.

re.sub(r"\d","",covid_tweet).strip()
re.findall(r"\W",covid_tweet)

Metacharacters and Sets –

These are characters with a special meaning and makes the application of regular expressions even more powerful. For example ^ $ * + ? {} () \ | .

Metacharacters - step by step guide of Python Regular Ex[ressions

Metacharacters impart even more flexibility and greater control over pattern identification. We will need to implement sets as well while generating regular expressions using metacharacters. Let us look at some interesting scenarios.

Use case 1:

If you notice a collection of tweets related to COVID, you will find many hashtags. Like #COVID19, #COVID2020, #COVID19INDIA, #COVID19PANDEMIC, and many more. To identify all the various types of such hashtags a tweet might comprise of, use [],{},+ metacharacters including sets 0-9, A-Z. A snapshot of sets is provided below.

Sets of Regular Expression

re.sub(r"(#COVID[0-9]{2}[A-Z]+)|(#COVID[0-9]{2})","",covid_tweet)
The + sign is used for greedy search. It searches for any number of occurrences of the preceding regular expression. In the above code, the + sign looks for capital letters. The {2} is to specify the occurrence of the preceding regular expression i.e. digits exactly two times. Now that you have understood the above use case, can you come up with a better regular expression code for it?
re.findall(r"(#[A-Z0-9a-z]+)",covid_tweet)
Interesting and powerful, is not it? But how? You must be wondering about the [A-Z0-9a-z] part of the above code, let me explain you. This means any digit or an upper or lower case alphabet succeeding a # sign. You already know the use of a + sign. Overall, the regular expression indicates that any number of occurrences of a digit or case insensitive alphabet after a # needs to be searched.

Use case 2:

In the movie_review string, there are various names like “Wolfgang Petersen”. Is there a way these names can be extracted? Give it a try before looking at the code below. Do let us know if you have a better solution.

re.findall("[A-Z][a-z]+ [A-Z][a-z]+",movie_review)

Use case 3:

There are several special characters in movie_review like ‘ and ,. In order to remove them from the text, we need to run the below regular expression using the [^arn] example given in the sets snapshot provided above.
re.sub("([^0-9A-Za-z])", " ", movie_review)

Format Method-

Another interesting and useful string operation used in Python is the Format method applied to strings. It is used to format the specified value(s), insert them inside the string’s placeholder, and return the formatted string. The placeholder is defined using curly brackets: {}. The following are some types of formatting we can perform.

Single Formatting:

The {} will be replaced with the string “Enthusiast” in the below command.

# default arguments
print("Hello {}, apply Decision Tree algorithm.".format("Enthusiast"))

Multiple Formatting:

We also have the flexibility to provide several substitutions in a string. This is done by using multiple curly braces and mapping them with the required values. The following are some example commands.

# default arguments
print("Hello {}, apply {} algorithm.".format("Enthusiast", "Decision Tree"))

By default, the command will replace braces with values in the sequence.

# positional arguments
print("Hello {0}, apply {1} algorithm.".format("Enthusiast", "Decision Tree"))

The positional arguments give us better control over where we want to replace the values. We are using positions 0 and 1 in this example.

# keyword arguments
print("Hello {name}, apply {algo} algorithm.".format(name="Enthusiast", algo="Decision Tree"))

keyword arguments go by variable names as shown above.

# Mixed arguments
print("Hello {0}, apply {algo} algorithm.".format("Enthusiast", algo="Decision Tree"))

Needs no explanation. We can use positional as well as keyword arguments in a single format command.

Many other string operations are explained in our article Basic String Operations. We hope by now you are comfortable and confident with Regular expressions and String Operations. It is highly encouraged that you practice these as much as possible. We are going to cover the NLP concept named Text tokenization comprising of the sentence and word tokenization in our next article. Stay tuned. Happy Learning!!!


Spread the love

Comments are closed.

Paste your AdWords Remarketing code here