How to use XPath for Web Scraping with Selenium

Spread the love


XPath Selenium cover image

Web scraping is a commonly used technology existing for a couple of decades now. We use programming languages like Python with libraries namely Beautiful Soup, Selenium, Scrapy, etc. or R with Rvest package for web scraping. Other web scraping tools are also available. A vivid explanation of how we do web scraping using Python is provided in the article named Web Scraping using Beautiful Soup.  Further to the above, in this article, we are going to understand XPath and how it can be used to navigate through the HTML document for Web Scraping with Selenium to reach the target nodes using Python.

Our Target Website –

Let us consider the webpage of a very popular Indian daily English newspaper Telegraph https://www.telegraphindia.com/

What Library are we using –

We are using the robust Selenium web driver to navigate through the Telegraph website using XPath.

What is XPath –

The full form of XPath is XML Path language. It is a very popular Web Technology and W3C standard. We can access precise information in an XML/HTML document using these concise and powerful XPath statements.

Understanding the structure of XPath –

First of all, let us see how can we find XPath of any element. We are using Mozilla Firefox as our browser. Our target element is the “Opinion” section on the web page. The steps to select the XPath for this element is as follows –

  1. Right-click on the ‘opinion’ section on the web page
  2. select inspect element
  3. The HTML code for this element will be highlighted in blue
  4. Right-click on the blue portion again
  5. Click on copy and select the XPath option
  6. Paste this on a text file. The path shows  /html/body/section/div[1]/div[2]/div[2]/div/a

XPath Opinion element

The path shown above is called the absolute path that is the detailed path across the nodes to reach the target node. The absolute path starts with a ‘/’ which means starting the search from the root element. Another way to represent is by using a relative path in which we construct a shorter path to the target node. The absolute path for the ‘opinion’ is //section/div[1]/div[2]/div[2]/div/a. Here we have replaced the ‘/html/body’ tags with ‘//’ to shorten the path.

There are other ways to represent an XPath using the Selenium selector. Let us take a look at them one by one.

Tag – Attribute – Value Trio

Syntax: //tag[@attribute=’value‘]

Fourlines

If you take a look at the screenshot above, we aim at identifying all the elements ‘a’ with attribute class = ‘frontLine’. The XPath command for this element will be as follows.

[php] //a[@class=’fourLine’] [/php]

The command means we are looking for attribute class with value ‘fourLines’ in the element ‘a’

Contains

Syntax: //tag[contains(@attribute, ‘value’)]

Another XPath Selenium locator which is very useful in identifying attributes is ‘contains’. Let us see how to use this command with an example.

XPath Selenium locator contains example

In the above screenshot, we want to locate all the tag ‘ul’ in which its attribute class contains ‘headImage’ in its value. The XPath command will be as follows. (We are going to use this scenario at other places in the article as well)

[php]  //ul[contains(@class,’headImage’)]  [/php]

Starts With

Syntax: //tag[starts-with(@attribute, ‘value’)]

This XPath Selenium locator is very useful when the attribute value is too long or changes dynamically. Using the same scenario as in the above-mentioned point, we can create an XPath comprising of ‘starts-with’ as follows.

[php]  //ul[starts-with(@class,’listing’)]  [/php]

Chained Declarations

Syntax: //tag[XPath Statement-1]//tag[XPath Statement-2]

Multiple XPath can be chained with // (double slash) to gain more accuracy and penetration through nodes and attributes. Going back to the same scenario, we want to identify all the links (href) associated with the ul element. The XPath command will be as follows.

[php]  //ul[@class =’listing-withImage headImage’]//a[@href]  [/php]

Operator Or

Syntax: //tag[XPath Statement-1 or XPath Statement-2]

A B Result
False False No Element
True False Returns A
False True Returns B
True True Returns Both

 

Another flexibility provided by the XPath Selenium locator is that we can give two XPath statements with an ‘or’ operator. The output follows the above tabular result. An example command to identify attribute class with text value “fourLine” and “threeLine” is as follows.

[php]  //a[@class=’fourLine’ or @class=’threeLine’]  [/php]

Operator And

Syntax: //tag[XPath Statement-1 and XPath Statement-2]

A B Result
False False No Element
True False No Element
False True No Element
True True Returns Both

 

The various output based on the boolean value of the 2 XPaths is provided in the table above. An example command prepared for this website is as follows.

[php]  //a[@class=’threeLine’ and contains(@href,’/india/’)]  [/php]

With this, we are trying to identify all the ‘a’  elements that contain an attribute named class with value ‘threeLine’ and another attribute href which contains a text ‘/india/’.

XPath And operator

Numerical Predicate

Syntax: //tag[position()=1]

We can locate nodes based on numerical functions like ‘position‘ provided as an example in the above syntax. For the Telegraph website, we can use any tag and find the position desired (provided that position of the tag exists). For example, for the ul tag that we saw in a previous example, we can point to all the ul tag wherever it has the first position in succession of its relative parent div tag in the tree structure of the HTML doc.

[php]  //div/ul[position()=1]  [/php]

We can use another function “last” and identify all the ‘ul’ tags positioned last in the relative path.

[php]  //ul[last()]  [/php]

Count‘ is a very useful predicate that can be used to count the number of attributes a tag contains. For example, if we want to identify the element ‘a’ in all the div tags which have a total count of attributes more than 5, we can use the below command.

[php]  //div[count(./@*)>5]//a  [/php]

Node Relations –

Another interesting way to create XPath is by using Node Relations that follows the family tree analogy. It is represented as node1/relation::node2. In other words, any node essentially maintains several relations with other nodes, for example, ancestor, child, descendant, parent, sibling, etc. We are going to mention commands for some of these relations.

[php]  //div[count(./@*)>5]//child::ul  [/php]

This command means we are trying to point out all the ‘ul’ nodes which are a child to the ‘div’ tags having a count of attributes more than 5.

[php]  //div[count(./@*)>5]//ancestor::div  [/php]

The above command points to all the div tag which is the ancestor to the ‘div’ tags having a count of attributes more than 5.

[php]  //div//preceding::li  [/php]

Here we are going to locate all the ‘list’ tag which is preceding to the div tag.

What is Selenium –

Selenium is a Web Browser Automation Tool. With the use of Selenium, we can browse a website just as a human would. We can click buttons, automate logins, give search text inputs and perform automation of several testing tasks as well.

Selenium XPath Commands –

We have discussed so many different ways to prepare the XPath command pointing to specific nodes and attributes. However, it is incomplete without understanding what command to use to extract all the elements the XPath points to. Selenium package provides us with commands to extract a single element or multiple elements in the form of a list from the XPath location. To name a few, commands are

  • find_element_by_id
  • find_element_by_name
  • find_element_by_xpath
  • find_elements_by_name
  • find_elements_by_xpath

The commands of our interest are find_element_by_xpath and find_elements_by_xpath. Considering we are working on the Chrome browser with Chromedriver.exe installed, our complete commands for most of the scenarios discussed above will look like below.

[php]

driver.find_elements_by_xpath("//section/div[1]/div[2]/div[2]/div/a")

driver.find_elements_by_xpath("//a[@class=’fourLine’]")

driver.find_elements_by_xpath("//ul[contains(@class,’headImage’)]")

driver.find_elements_by_xpath("//ul[@class =’listing-withImage headImage’]//a[@href]")

driver.find_elements_by_xpath("//a[@class=’threeLine’ and contains(@href,’/india/’)]")

driver.find_elements_by_xpath("//div[count(./@*)>5]//a")

driver.find_elements_by_xpath("//div[count(./@*)>5]//ancestor::div")

driver.find_elements_by_xpath("//div//preceding::li")

[/php]

Conclusion –

Python definitely provides this very powerful Selenium Webdriver with which we can even automate web scraping. We hope the explanation about how to use XPath for Web Scraping with Selenium is easy to understand and proves very useful. Do let us know in the comment section if there is anything specific you are looking for related to Python, Selenium or Web Scraping. We can come up with another blog on the requirement. On a separate note, in case you are wondering about which would be a better programming language between R and Python, we suggest you to visit another blog R vs Python to get a better understanding between the two languages.


Spread the love

Leave a Reply

Your email address will not be published. Required fields are marked *

Paste your AdWords Remarketing code here