Decathlon Product Capturing using Python (Beautifulsoup)

Decathlon Product Capturing using Python (Beautifulsoup)

Hi everyone, in this article i want to show you how to capture to products or any details from any website. There are many ways to do this but as you know Python is very popular and realy simple programing language, so i’ll use Python. Before I start, I must say that the operations I will do are the advance parts of Python. Also you may need to know about simple HTML.

In this example, I will take the products on Decathlon and make them a table. You can use any Python IDE. I will use Jupyter.

Let’s add 4 different libraries that we will use in this example. These libraries allow us to retrieve data from eBay, send http requests to the website.

from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup as bts
import pandas as pd

Let’s go to the decathlon.com website and see how to get the data. https://www.decathlon.com/collections/camp-hike-day-packs

 

Let’s right click on any product and click on the area indicated by the arrow. Then there will be an area on the right and left as in the picture below.

The place indicated by the red arrow is the HTML address of each one indicated by the blue arrow.  These HTML tags are common to every product on the page. Thus, we do not have to constantly change tags while transferring data to the table.

In the simplest way, we want to get the names of all products on the first page, so let’s find the html tag on the title.

We need to get all the html codes of the website by request. Let’s write a little function for this and get the page codes with Beautifulsoup.

def getAndParseURL(url): 
result = requests.get(url)
soup = bts(result.text, 'html.parser')
return soup

Get all the headings with Beauitifulsoup’s findAll function. The part we want to access is headings, this type is text. The findAll command must be written in a fund loop to access all of the text data.

a = getAndParseURL("https://www.decathlon.com/collections/camp-hike-day-packs")
for i in a.findAll("h4",{"class":"de-ProductTile-title de-u-textMedium de-u-textShrink1 de-u-lg-textGrow1 de-u-lineHeight2"}):
print(i.text)

Now let’s write a code where we get the details of all the data and transfer them to a dataframe with pandas.

TITLE = []
RATING = []
REWIEW = []
PRICE = []
DETAILS = []

for i in a.findAll("section",{"class":"de-ProductTile-info"}):
TITLE.append(i.find("h4",{"class":"de-ProductTile-title de-u-textMedium de-u-textShrink1 de-u-lg-textGrow1 de-u-lineHeight2"}).text.strip())
RATING.append(i.find("span",{"class":"de-u-hiddenVisually"}).text.split(":")[-1].split()[0])
REWIEW.append(i.find("span",{"class":"de-u-textMedium de-u-textSelectNone de-u-textDarkGray de-u-textShrink2 de-u-md-textShrink1"}).text.split()[0])
PRICE.append(i.find("span",{"class":"js-de-ProductTile-currentPrice"}).text)
DETAILS.append(i.find("footer",{"class":"de-ProductTile-footer"}).text.strip())

 

In this section, we have assigned the details of all products to the list, Next is to combine these lists into data frames.

zipped = list(zip(TITLE,RATING,REWIEW,PRICE,DETAILS))
df = pd.DataFrame(zipped,columns=["TITLE","RATING","REWIEW","PRICE","DETAILS"])
df["DETAILS"] = df["DETAILS"].replace('\n','', regex=True)

We have removed unnecessary spaces with regex in the last line.

The final dataframe will be as in the picture. If we want, we can export it to excel with the to_csv command.

See you in another article..

About Deniz Parlak

Hi, i’m Security Data Scientist & Data Engineer at My Security Analytics. I have experienced Advance Python, Machine Learning and Big Data tools. Also i worked Oracle Database Administration, Migration and upgrade projects. For your questions [email protected]

Leave a Reply

Your email address will not be published. Required fields are marked *