Rounak Jain Feb 28, 2020 No Comments
Web scraping is a commonly used technology existing for a couple of decades now. We use programming languages like Python with libraries namely Beautiful Soup, Selenium, Scrapy, etc. or R with Rvest package for web scraping. Other web scraping tools are also available. A vivid explanation of how we do web scraping using Python is provided in the article named Web Scraping using Beautiful Soup. Further to the above, in this article, we are going to understand XPath and how it can be used to navigate through the HTML document for Web Scraping with Selenium to reach the target nodes using Python.
Let us consider the webpage of a very popular Indian daily English newspaper Telegraph https://www.telegraphindia.com/
We are using the robust Selenium web driver to navigate through the Telegraph website using XPath.
The full form of XPath is XML Path language. It is a very popular Web Technology and W3C standard. We can access precise information in an XML/HTML document using these concise and powerful XPath statements.
First of all, let us see how can we find XPath of any element. We are using Mozilla Firefox as our browser. Our target element is the “Opinion” section on the web page. The steps to select the XPath for this element is as follows –
The path shown above is called the absolute path that is the detailed path across the nodes to reach the target node. The absolute path starts with a ‘/’ which means starting the search from the root element. Another way to represent is by using a relative path in which we construct a shorter path to the target node. The absolute path for the ‘opinion’ is //section/div/div/div/div/a. Here we have replaced the ‘/html/body’ tags with ‘//’ to shorten the path.
There are other ways to represent an XPath using the Selenium selector. Let us take a look at them one by one.
If you take a look at the screenshot above, we aim at identifying all the elements ‘a’ with attribute class = ‘frontLine’. The XPath command for this element will be as follows.
The command means we are looking for attribute class with value ‘fourLines’ in the element ‘a’
Syntax: //tag[contains(@attribute, ‘value’)]
Another XPath Selenium locator which is very useful in identifying attributes is ‘contains’. Let us see how to use this command with an example.
In the above screenshot, we want to locate all the tag ‘ul’ in which its attribute class contains ‘headImage’ in its value. The XPath command will be as follows. (We are going to use this scenario at other places in the article as well)
Syntax: //tag[starts-with(@attribute, ‘value’)]
This XPath Selenium locator is very useful when the attribute value is too long or changes dynamically. Using the same scenario as in the above-mentioned point, we can create an XPath comprising of ‘starts-with’ as follows.
Syntax: //tag[XPath Statement-1]//tag[XPath Statement-2]
Multiple XPath can be chained with // (double slash) to gain more accuracy and penetration through nodes and attributes. Going back to the same scenario, we want to identify all the links (href) associated with the ul element. The XPath command will be as follows.
//ul[@class ='listing-withImage headImage']//a[@href]
Syntax: //tag[XPath Statement-1 or XPath Statement-2]
Another flexibility provided by the XPath Selenium locator is that we can give two XPath statements with an ‘or’ operator. The output follows the above tabular result. An example command to identify attribute class with text value “fourLine” and “threeLine” is as follows.
//a[@class='fourLine' or @class='threeLine']
Syntax: //tag[XPath Statement-1 and XPath Statement-2]
The various output based on the boolean value of the 2 XPaths is provided in the table above. An example command prepared for this website is as follows.
//a[@class='threeLine' and contains(@href,'/india/')]
With this, we are trying to identify all the ‘a’ elements that contain an attribute named class with value ‘threeLine’ and another attribute href which contains a text ‘/india/’.
We can locate nodes based on numerical functions like ‘position‘ provided as an example in the above syntax. For the Telegraph website, we can use any tag and find the position desired (provided that position of the tag exists). For example, for the ul tag that we saw in a previous example, we can point to all the ul tag wherever it has the first position in succession of its relative parent div tag in the tree structure of the HTML doc.
We can use another function “last” and identify all the ‘ul’ tags positioned last in the relative path.
‘Count‘ is a very useful predicate that can be used to count the number of attributes a tag contains. For example, if we want to identify the element ‘a’ in all the div tags which have a total count of attributes more than 5, we can use the below command.
Another interesting way to create XPath is by using Node Relations that follows the family tree analogy. It is represented as node1/relation::node2. In other words, any node essentially maintains several relations with other nodes, for example, ancestor, child, descendant, parent, sibling, etc. We are going to mention commands for some of these relations.
This command means we are trying to point out all the ‘ul’ nodes which are a child to the ‘div’ tags having a count of attributes more than 5.
The above command points to all the div tag which is the ancestor to the ‘div’ tags having a count of attributes more than 5.
Here we are going to locate all the ‘list’ tag which is preceding to the div tag.
Selenium is a Web Browser Automation Tool. With the use of Selenium, we can browse a website just as a human would. We can click buttons, automate logins, give search text inputs and perform automation of several testing tasks as well.
We have discussed so many different ways to prepare the XPath command pointing to specific nodes and attributes. However, it is incomplete without understanding what command to use to extract all the elements the XPath points to. Selenium package provides us with commands to extract a single element or multiple elements in the form of a list from the XPath location. To name a few, commands are
The commands of our interest are find_element_by_xpath and find_elements_by_xpath. Considering we are working on the Chrome browser with Chromedriver.exe installed, our complete commands for most of the scenarios discussed above will look like below.
driver.find_elements_by_xpath("//section/div/div/div/div/a") driver.find_elements_by_xpath("//a[@class='fourLine']") driver.find_elements_by_xpath("//ul[contains(@class,'headImage')]") driver.find_elements_by_xpath("//ul[@class ='listing-withImage headImage']//a[@href]") driver.find_elements_by_xpath("//a[@class='threeLine' and contains(@href,'/india/')]") driver.find_elements_by_xpath("//div[count(./@*)>5]//a") driver.find_elements_by_xpath("//div[count(./@*)>5]//ancestor::div") driver.find_elements_by_xpath("//div//preceding::li")
Python definitely provides this very powerful Selenium Webdriver with which we can even automate web scraping. We hope the explanation about how to use XPath for Web Scraping with Selenium is easy to understand and proves very useful. Do let us know in the comment section if there is anything specific you are looking for related to Python, Selenium or Web Scraping. We can come up with another blog on the requirement. On a separate note, in case you are wondering about which would be a better programming language between R and Python, we suggest you to visit another blog R vs Python to get a better understanding between the two languages.