Rounak Jain Jun 17, 2020 No Comments
Hello NLP learner!! I hope you had visited our last article – Introduction to Natural Language Processing using Python. With that, you would have got an idea that Natural Language Processing essentially deals with Text Analytics. As a next step, we naturally need to learn how to deal with text or in other words strings before we plunge into advanced techniques of NLP. This article will talk about useful string operations using Python for NLP beginners. Get ready for some facts about strings and interesting string operations.!!
In the below code, we assign some value to the String1 variable. Let us see what happens when we try to assign a new value to a particular character index in the variable.
String1 = 'Natural Language Processing' String1 = 'X'
The code throws the following error.
TypeError: ‘str’ object does not support item assignment
This means we can not manipulate the string variable. What happens when we renew the variable value?. Even then, Python creates a completely new string internally. This can be observed with the generation of different id values while assigning a new value for the same string variable. Follow the below codes and the different id values.
String1 = 'Natural Language Processing' print(id(String1)) output is 140713608669024 String1 = 'NLP' print(id(String1)) output is 140713609660712
The following are the common types of string literals used in Python.
We can define strings enclosed in single quotes (‘) as well as double-quotes (“). For example “Python” or ‘Python’.
We can define strings enclosed in three single quotes (”’) or double-quotes (“””). For example “””Python 3.8 is the latest version””” or ”’I am a very long string”’. You might notice this way of defining strings at times.
Escape sequences start with a backslash (\) followed by an ASCII character. Commonly used escape sequences are ‘\n’ for a newline character, ‘\t’ to indicate a tab.
We can create bytes of data type objects using byte strings. For example, bytes(‘Python’) or b’Python’. You might notice byte strings when converting a data frame to a csv.
These strings are denoted by u’….’ notation. Unicode strings are non-ASCII character sequences. As already mentioned, all the strings are by default Unicode strings in Python 3.x versions. A string ‘Hèllo’ is u’H\xe8llo’ in Unicode.
We can create raw strings to keep the strings in its native form. Raw strings do not perform any action on the escape sequence strings. An example is r’Python’.
#Understanding how escape sequence strings look like in a normal string escape_strings = "C:\new_directory\temporary_folder\File.txt" print(escape_strings) The output is C: ew_directory emporary_folder\File.txt #Understanding how the raw string makes a difference in the interpretation of string escape_strings = r"C:\new_directory\temporary_folder\File.txt" print(escape_strings) The output is C:\new_directory\temporary_folder\File.txt
Let us begin string operations by working on the reversal of strings as our first interesting operation. It is very easy to reverse a string in Python. Have a look at the code below.
string_variable = "I love Pizza 🍕!" string_reverse = string_variable[::-1] print(string_reverse) Output is '!🍕 azziP evol I'
Further, the various categories of operations most frequently performed on strings are –
The common operations we would like to explain here are string concatenation, finding substrings, lengths, and characters. Many times we do require concatenating of strings or string variables during URL manipulations during web scraping. These python string concatenation operations can come in handy at that time. Let us check some codes for it.
#concatenating various strings 'https://www.skillenable.com/' + 'SkillEnableLoanPortal/' + 'isa' output is 'https://www.skillenable.com/SkillEnableLoanPortal/isa' #concatenating various strings without a '+' operator
'https://www.skillenable.com/' 'SkillEnableLoanPortal/' 'isa'
output is 'https://www.skillenable.com/SkillEnableLoanPortal/isa'
We can also perform the concatenation of string variables with string literals. During web scraping, we generate additional pages by adding a string literal to the base page link. An example scenario and its code are provided below.
Assuming we are surfing the Ivy Professional School official blog site. The web link is – https://ivyproschool.com/blog/. This is our base page. If we scroll down the page, we find pagination where we can move to the next page by clicking the ‘2’ button. The click takes us to the next page whose URL becomes https://ivyproschool.com/blog/page/2/. We can generate different next page URLs by performing concatenation as shown in the code below.
base_url = 'https://ivyproschool.com/blog/' #base_url is string variable, 'page/2/' is string literal. We concatenate these by a + operator in between new_url = base_url + 'page/2/' print(new_url) output is https://ivyproschool.com/blog/page/2/
Time for some exercise. Tell us in the comment section the outputs when you run the below codes.
1) new_url = base_url 'page/2/' 2) String1 = 'Ivy Professional School' String2 = 'Data Science' (String1 + String2)*3 3) len(String1)
A very simple way to identify if a substring is present in a string variable is done using the below code. The output is in the form of a Boolean True or False. True in this example.
'Ivy' in String1
Other useful string operations using Python for NLP beginners are indexing and slicing of strings. We know that strings are a sequence of characters and iterables just like lists. We can access a single character using a specific position or index in the string is called indexing. Accessing a part of a string i.e., a substring using a start and end index is called slicing.
Please Note: Python provides 2 different ways in which we can index the characters in a string. One starting from 0 and increasing each character index by 1 until the end of the string. The other starting from -1 at the end of the string and decreasing each character index by 1 until the beginning of the string.
Let us take a look at the indexing code snippet for a string variable along with the outputs. We have used the enumerate method.
String1 = 'Natural Language Processing' for index, character in enumerate(String1): print(character,'has index', index)
Experiment with other ways of indexing and understand the outputs from the codes given below.
1) String1[-5] 2) String1
What if someone just wants to fetch the word ‘Processing’ from the String1 variable. Or just a part out of it. How can this be done? This is where slicing comes in to picture. Some slicing code snippets are provided below. We encourage you to work these codes and observe the outputs.
String1[:] String1[::1] String1[0::1] String1[::] String1[8:16] String1[2::2] String1[-2::] String1[-2::-1] String1[-2:] + String1[5:] String1[-2] + String1
A huge list of methods that we can apply on strings is available in this link. We have also explained a few in our previous article – Top 15 interesting tricks every Python Beginner must know. For example, split, join, capitalize, reverse, swapcase, replace, and title. Additionally, provided below are some more relevant methods commonly applied in text analytics.
We use the strip method to remove leading and trailing spaces from the string variable. This is very helpful after we perform text cleaning on tweets collected from twitter as a practical scenario.
String1 = " Natural Language Processing " String1.strip()
We often need to identify if a string variable or literal is alphabets, alphanumeric, or decimal. We apply isalpha, isalnum or isdecimal respectively to find that out. The output is a Boolean True or False value.
'11111'.isdecimal() '123ab'.isalphanum() 'NLP'.isalpha