I'm considering buying a net notebook for my data science projects, and decided to turn this idea into yet another project!
My web scraper searches, compares and analyses different laptops and list them as a table for easy comparison. The web scraper will be crawling through Amazon, searching for details on the laptops: price, user reviews and hardware configuration.
The official amazon website: https://amazon.com
# Imports
import bs4
import requests
import pandas as pd
from bs4 import BeautifulSoup
import lxml
The web scraper was build by inspect the source code on: https://www.amazon.com/s?k=laptops
The relevant information was extracted from the first five result pages.
# Function which defines o header for the connection with amazon.com and makes a request to extract the data
def make_request(num_page):
# Header to avoid the problem of non-secure connection with amazon.com
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
"Accept-Encoding":"gzip, deflate",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
# Request the desired number of pages
request = requests.get('https://www.amazon.com/s?k=laptops&page='+str(num_page), headers = headers)
# Then extract the content
content = request.content
return content
# Function to extract the laptop's name
def extract_name(div):
# Find the span tag, class 'a-size-medium a-color-base a-text-normal', where the laptop name will be
span_name = div.find('span', attrs = {'class':'a-size-medium a-color-base a-text-normal'})
# If the value found isn't empty, return the value (name)
if span_name is not None:
return span_name.text
else:
return 'no-info'
# Function to extract the laptop's price
def extract_price(div):
# Find the span tag, class 'a-offscreen', where the laptop price will be
span_price = div.find('span', attrs = {'class':'a-offscreen'})
# If the value found isn't empty, return the value (price)
if span_price is not None:
return span_price.text
else:
return 'no-info'
# Function to extract the laptop's reviews
def extract_reviews(div):
# Find the span tag, class 'a-icon-alt', where the laptop reviews will be
span_reviews = div.find('span', attrs = {'class':'a-icon-alt'})
# If the value found isn't empty, return the value (price)
if span_reviews is not None:
return span_reviews.text
else:
return 'no-info'
# List to store the data
laptop_data = []
# Loop through the desired number of pages
for num_page in range(1,6):
# Make request and get content
content = make_request(num_page)
# Format the content with BeautifulSoup
soup = BeautifulSoup(content, 'lxml')
# Loop through the content
for info in soup.findAll('div', attrs = {'class': 'sg-col-4-of-12 sg-col-4-of-16 sg-col sg-col-4-of-20'}):
# Extract name
name = extract_name(info)
# Extract price
price = extract_price(info)
# Extract reviews
reviews = extract_reviews(info)
# Add the data to the list
laptop_data.append([name, price, reviews])
# Convert the list to a dataframe
df_laptop_data = pd.DataFrame(laptop_data, columns = ['Name', 'Price', 'Reviews'])
df_laptop_data.shape
df_laptop_data.head()
Seems like we got lots of null rows... let's do some cleaning!
# Remove rows without laptop name
df = df_laptop_data[df_laptop_data.Name != 'no-info']
df.shape
# Remove rows without price name
df = df[df.Price != 'no-info']
df.shape
df
# Check data on the column
df['Price'].value_counts(dropna = False)
# Function to clean the data on the price column
def clean_price(price):
# Remove the $
price = price.replace('$', '')
# Remove the comma ',' on numbers over 1000 (1,000)
price = price.replace(',', '')
# Convert to numeric data type
price = pd.to_numeric(price)
return price
# Create a new column to store the clean price
df['Price_Clean'] = df['Price'].apply(clean_price)
df.shape
df.head()
# Check data on the column
df['Price_Clean'].value_counts(dropna = False)
# Check data on the column
df['Reviews'].value_counts(dropna = False)
# Function to clean the reviews
def clean_reviews(review):
# Split the string by spaces
review = review.split()
# Extract the first item (the stars received)
review = review[0]
# Se 'no-infos' as 0
if review == 'no-info':
review = 0
# Convert to numeric data type
review = pd.to_numeric(review)
return review
# Create a new column to store the clean review
df['Review_Clean'] = df['Reviews'].apply(clean_reviews)
df.shape
# Check data on the column
df['Review_Clean'].value_counts(dropna = False)
df.head()
The name column has various data about the laptops, including hardware specifications, brand, screen size, and such. To effectively clean it those data need to be extracted, which might require some experimenting.
df.Name
Pandas is truncating the Name strings deu to their size, but this make it hard to come with a data extaction method, so let me fix pandas.
pd.set_option('display.max_colwidth', None)
df.Name
Getting what I want will require some regular expressions...
re_find_processor = r'\b([iI][\d])\b'
# Create a column for the processor data
df['Processor'] = df['Name'].str.extract(re_find_processor)
# Standardize processor name by setting all characters to lowercase
df['Processor'] = df['Processor'].str.replace('I', 'i')
# Check result
df['Processor'].value_counts(dropna = False)
re_find_ram = r'\b([\d]+)[GB]+[ ][\+ LlPpDdRrAaMmEeOoYy\d]+\b'
# Create a new column for the RAM data
df['RAM_Memory'] = df['Name'].str.extract(re_find_ram)
# And convert to numeric data type
df['RAM_Memory'] = df['RAM_Memory'].apply(lambda x: pd.to_numeric(x))
# Check result
df['RAM_Memory'].value_counts(dropna = False)
re_find_screen = r'\b([1][\d]+[\.IiNnCcHh\d]*)[ \d]*\b'
# Create a new column for the screen size data
df['Screen_Size'] = df['Name'].str.extract(re_find_screen)
# Check result
df['Screen_Size'].value_counts(dropna = False)
There are some weird values that don't seem to be screen sizes...
# Cleaning function
def clean_screen(size):
size = size.replace('Inch', '')
size = size.replace('inch', '')
size = size.replace('in', '')
try:
size = pd.to_numeric(size)
except:
size = 'no-info'
return size
# Apply function
df['Screen_Size'] = df['Screen_Size'].apply(clean_screen)
# Check result
df['Screen_Size'].value_counts(dropna = False)
df_final = df[['Name', 'Price_Clean', 'Review_Clean', 'Processor', 'RAM_Memory', 'Screen_Size']]
df_final.columns = ['Name', 'Price', 'Review', 'Processor', 'RAM_Memory', 'Screen_Size']
df_final
Creating a filter to search the dataframe and return only laptops with certain specifications.
# Dictionary with filters
filters = {'processor': {'min':'i5', 'max':'i7'},
'ram': 16,
'max_price': 1500,
'screen': {'min':14, 'max':16},
'min_stars': 4}
# Filter Processor
df_filtered = df_final[(df_final['Processor'] == filters['processor']['min']) | (df_final['Processor'] == filters['processor']['max'])].copy()
# Filter RAM Memory
df_filtered = df_filtered[(df_filtered['RAM_Memory'] >= filters['ram'])]
# Filter Price
df_filtered = df_filtered[(df_filtered['Price'] <= filters['max_price'])]
# Filter Screen size
df_filtered = df_filtered[(df_filtered['Screen_Size'] >= filters['screen']['min']) & (df_filtered['Screen_Size'] <= filters['screen']['max'])]
# Filter Reviews
df_filtered = df_filtered[(df_filtered['Review'] >= filters['min_stars'])]
print(f'Number of findings: {df_filtered.shape[0]}')
# Filtered results
df_filtered
There it is! Now I know where to start my laptop review from!
Since products on Amazon are frequently updated, I can re-run this later and see if new interesting options appeared!