Ivy Jul 18, 2020 No Comments
This article is in continuation of our first part of a Step by Step guide of Python Regular Expressions. The meta-characters, special sequences, and sets are the essential parts of a regular expression. We are going to talk about working examples of these on the two-sample text data considered earlier. The movie review text is considered if someone is working on Topic Modelling. The Covit tweet sample is for NLP enthusiasts working on Sentiment Analysis.
movie_review = """Troy is loosely based on Homer's 'Iliad', Wolfgang Petersen directs this epic war film, with an ensemble cast including Brad Pitt, Eric Bana, Orlando Bloom, Diane Kruger, Troy Sean Bean, Brian Cox, Rose Byrne, Garrett Hedlund, Peter O' Toole, Brendan Gleeson, & Tyler Mane.Troy is about love, power, deceit, valour, glory.""" covid_tweet = """As of today, the cumulative number of confirmed #COVID19 cases is 97 302, total number of #COVID19INDIA deaths is 1930 and the total number of recoveries is 51 608."""
A special sequence is a \
followed by one of the characters and has a special meaning. For example \w, \W, \d, \D
Try to observe the information given in the snapshot provided above. Are you able to imagine some use cases for applying these special sequences in creating regular expressions?
Assume every movie review pulled is carrying the movie name at the beginning of the review. The obvious choice would be to get rid of the movie’s name. We will apply a special sequence \A to write the code. On the contrary, \Z can be used if we are looking for a pattern at the end of our string.
re.sub("\ATroy","",movie_review).strip() re.findall("Troy\Z","",movie_review).strip()
In the covid_tweet, there are hashtags with #COVID19. If there is a requirement of identifying the presence of this pattern in the tweet, we can apply \B
re.findall(r"COVID\B",covid_tweet) re.sub(r"COVID\B","COVID",covid_tweet)
Usually, any number present in a tweet is of no importance while performing Sentiment Analysis of tweets. To remove those, we can apply the below code using \d. To find all the non-numerical and non-alphabetical characters at any stage in text cleaning, we can use \W.
re.sub(r"\d","",covid_tweet).strip() re.findall(r"\W",covid_tweet)
These are characters with a special meaning and makes the application of regular expressions even more powerful. For example ^ $ * + ? {} () \ | .
Metacharacters impart even more flexibility and greater control over pattern identification. We will need to implement sets as well while generating regular expressions using metacharacters. Let us look at some interesting scenarios.
If you notice a collection of tweets related to COVID, you will find many hashtags. Like #COVID19, #COVID2020, #COVID19INDIA, #COVID19PANDEMIC, and many more. To identify all the various types of such hashtags a tweet might comprise of, use [],{},+ metacharacters including sets 0-9, A-Z. A snapshot of sets is provided below.
re.sub(r"(#COVID[0-9]{2}[A-Z]+)|(#COVID[0-9]{2})","",covid_tweet)
re.findall(r"(#[A-Z0-9a-z]+)",covid_tweet)
In the movie_review string, there are various names like “Wolfgang Petersen”. Is there a way these names can be extracted? Give it a try before looking at the code below. Do let us know if you have a better solution.
re.findall("[A-Z][a-z]+ [A-Z][a-z]+",movie_review)
re.sub("([^0-9A-Za-z])", " ", movie_review)
Another interesting and useful string operation used in Python is the Format method applied to strings. It is used to format the specified value(s), insert them inside the string’s placeholder, and return the formatted string. The placeholder is defined using curly brackets: {}. The following are some types of formatting we can perform.
The {} will be replaced with the string “Enthusiast” in the below command.
# default arguments print("Hello {}, apply Decision Tree algorithm.".format("Enthusiast"))
We also have the flexibility to provide several substitutions in a string. This is done by using multiple curly braces and mapping them with the required values. The following are some example commands.
# default arguments print("Hello {}, apply {} algorithm.".format("Enthusiast", "Decision Tree"))
By default, the command will replace braces with values in the sequence.
# positional arguments print("Hello {0}, apply {1} algorithm.".format("Enthusiast", "Decision Tree"))
The positional arguments give us better control over where we want to replace the values. We are using positions 0 and 1 in this example.
# keyword arguments print("Hello {name}, apply {algo} algorithm.".format(name="Enthusiast", algo="Decision Tree"))
keyword arguments go by variable names as shown above.
# Mixed arguments print("Hello {0}, apply {algo} algorithm.".format("Enthusiast", algo="Decision Tree"))
Needs no explanation. We can use positional as well as keyword arguments in a single format command.
Many other string operations are explained in our article Basic String Operations. We hope by now you are comfortable and confident with Regular expressions and String Operations. It is highly encouraged that you practice these as much as possible. We are going to cover the NLP concept named Text tokenization comprising of the sentence and word tokenization in our next article. Stay tuned. Happy Learning!!!