Web scraping is employed to gather large information from websites. As the objectives of web scraping its applications are like email gathering, price comparisons, job listings, research and development, collecting datasets, etc.
Web scraping is an automatic method to extract large amounts of knowledge from websites. The data on the websites is unstructured. Web scraping is useful to collect such unstructured data and give a structured form to it.
To know whether an internet site allows web scraping or not, you'll check out the website’s “robots.txt” file.
Uses of web-scraping:-
Let us see how to extract data from the Flipkart website using Python. We are gonna use Selenium, BeautifulSoup, Pandas
!apt update
!apt install chromium-chromedriver
!pip install selenium
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import pandas as pd
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
# open it, go to a website, and get results
driver = webdriver.Chrome(options=options)
driver.get("https://www.flipkart.com/search?q=best%20laptops%20under%2080000&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off") #using this line of code we open the mentioned URL.
content = driver.page_source
soup = bs(content)
products=[] #list to store products' name
discounted_prices=[] #list to store the new discounted price
discounts=[] #list to store the discount available
for a in soup.findAll('div', attrs={'class':'_2kHMtA'}):
name=a.find('div', attrs={'class':'_4rR01T'})
#In Above code the div tag of class:_2kHMtA we are extracting the div tag of class:_4rR01T
discounted_price=a.find('div', attrs={'class':'_30jeq3 _1_WHN1'})
discount=a.find('div', attrs={'class':'_3Ay6Sb'})
products.append(name.text)
discounted_prices.append(discounted_price.text)
discounts.append(discount.text)
df = pd.DataFrame({'Product Name':products,'Discounted_price':discounted_prices,'Discounts':discounts})
df.to_csv('products.csv', index=False, encoding='utf-8')
df.head()
Here the code was run in Google colab that’s why we had to configure the webdriver first otherwise its a simple procedure to use the webdriver on the local editor.
Using the above code we have extracted data from the website. The data we are extracting is nested in tags. So, we will find the div tags with those respective class names, extract the data and store the data in a variable.
We can store the extracted data and store them in a csv file using the following code:
df = pd.DataFrame({'Product Name':products,'Price':prices,'Rating':ratings})
df.to_csv('products.csv', index=False, encoding='utf-8')
In the saved CSV file, we can see the product’s name, discounted_price, and the discount on product.
Problem | Score | Companies | Time | Status |
---|---|---|---|---|
find the one | 30 |
|
2:29 | |
choose the output | 30 |
|
4:00 | |
python broadcasting | 30 |
|
5:01 | |
How not to retrieve? | 30 |
|
4:54 | |
Fill Infinite | 30 |
|
2:36 | |
Duplicates detection | 50 |
|
25:00 | |
Row-wise unique | 50 |
|
29:01 |
Problem | Score | Companies | Time | Status |
---|---|---|---|---|
For 'series' | 30 |
|
4:54 | |
drop axis | 30 |
|
1:47 | |
Rename axis | 30 |
|
2:17 | |
iloc vs loc part I | 30 |
|
1:42 | |
As a Series | 50 |
|
20:07 | |
Max registrations they asked? | 50 |
|
43:05 |
Problem | Score | Companies | Time | Status |
---|---|---|---|---|
Which library it is? | 30 |
|
0:51 | |
Image dimensions | 30 |
|
1:34 | |
Dimension with components | 30 |
|
1:18 | |
Color interpretation | 30 |
|
1:56 | |
Image cropping | 30 |
|
2:02 |
Problem | Score | Companies | Time | Status |
---|---|---|---|---|
2d graphics | 30 |
|
0:39 | |
Suitable plot type | 30 |
|
1:20 | |
Subplot Coordinates | 30 |
|
3:58 | |
Vertically Stacked Bar Graph | 30 |
|
3:32 | |
Load RGB | 30 |
|
2:25 |
Problem | Score | Companies | Time | Status |
---|---|---|---|---|
What does the code do? | 30 |
|
2:35 | |
Retrieval protocol | 30 |
|
1:44 | |
2-way communication | 30 |
|
0:54 | |
Search engine process | 30 |
|
1:31 | |
What does the code print? | 30 |
|
1:21 |
Problem | Score | Companies | Time | Status |
---|---|---|---|---|
PCA's secondary objective | 30 |
|
1:33 | |
Five number theory | 30 |
|
1:32 |