Web Scraping Amazon Laptops

Problem Definition

I'm considering buying a net notebook for my data science projects, and decided to turn this idea into yet another project!

My web scraper searches, compares and analyses different laptops and list them as a table for easy comparison. The web scraper will be crawling through Amazon, searching for details on the laptops: price, user reviews and hardware configuration.

Data Source

The official amazon website: https://amazon.com

Loading Packages

In [1]:
# Imports
import bs4
import requests
import pandas as pd
from bs4 import BeautifulSoup
import lxml

Web Scraping

The web scraper was build by inspect the source code on: https://www.amazon.com/s?k=laptops

The relevant information was extracted from the first five result pages.

In [2]:
# Function which defines o header for the connection with amazon.com and makes a request to extract the data
def make_request(num_page):
    
    # Header to avoid the problem of non-secure connection with amazon.com
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", 
               "Accept-Encoding":"gzip, deflate", 
               "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
               "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"} 

    # Request the desired number of pages
    request = requests.get('https://www.amazon.com/s?k=laptops&page='+str(num_page), headers = headers)
    
    # Then extract the content
    content = request.content
    
    return content
In [3]:
# Function to extract the laptop's name
def extract_name(div):
    
    # Find the span tag, class 'a-size-medium a-color-base a-text-normal', where the laptop name will be
    span_name = div.find('span', attrs = {'class':'a-size-medium a-color-base a-text-normal'})
    
    # If the value found isn't empty, return the value (name)
    if span_name is not None:
        return span_name.text
    else:
        return 'no-info'
In [4]:
# Function to extract the laptop's price
def extract_price(div):
    
    # Find the span tag, class 'a-offscreen', where the laptop price will be
    span_price = div.find('span', attrs = {'class':'a-offscreen'})

    # If the value found isn't empty, return the value (price)
    if span_price is not None:
        return span_price.text
    else:
        return 'no-info'
In [5]:
# Function to extract the laptop's reviews
def extract_reviews(div):
    
    # Find the span tag, class 'a-icon-alt', where the laptop reviews will be
    span_reviews = div.find('span', attrs = {'class':'a-icon-alt'})

    # If the value found isn't empty, return the value (price)
    if span_reviews is not None:
        return span_reviews.text
    else:
        return 'no-info'
In [6]:
# List to store the data
laptop_data = []

# Loop through the desired number of pages
for num_page in range(1,6):
    
    # Make request and get content
    content = make_request(num_page)
    
    # Format the content with BeautifulSoup
    soup = BeautifulSoup(content, 'lxml')
    
    # Loop through the content
    for info in soup.findAll('div', attrs = {'class': 'sg-col-4-of-12 sg-col-4-of-16 sg-col sg-col-4-of-20'}):
        
        # Extract name
        name = extract_name(info)
        
        # Extract price
        price = extract_price(info)
        
        # Extract reviews
        reviews = extract_reviews(info)
        
        # Add the data to the list
        laptop_data.append([name, price, reviews])
In [7]:
# Convert the list to a dataframe
df_laptop_data = pd.DataFrame(laptop_data, columns = ['Name', 'Price', 'Reviews'])
df_laptop_data.shape
Out[7]:
(693, 3)
In [15]:
df_laptop_data.head()
Out[15]:
Name Price Reviews
0 no-info no-info no-info
1 CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop... $339.00 4 Stars & Up
2 no-info no-info no-info
3 CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop... $339.00 3.9 out of 5 stars
4 no-info no-info no-info

Seems like we got lots of null rows... let's do some cleaning!

Data Cleaning

Delete Null Rows

In [31]:
# Remove rows without laptop name
df = df_laptop_data[df_laptop_data.Name != 'no-info']
df.shape
Out[31]:
(230, 3)
In [32]:
# Remove rows without price name
df = df[df.Price != 'no-info']
df.shape
Out[32]:
(90, 3)
In [33]:
df
Out[33]:
Name Price Reviews
1 CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop... $339.00 4 Stars & Up
3 CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop... $339.00 3.9 out of 5 stars
8 CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop... $339.00 3.9 out of 5 stars
14 CHUWI Herobook Pro 14.1 inch Windows 10 Intel ... $339.00 4.0 out of 5 stars
44 iProda 14" Stream Laptop, Intel i3 Notebook (u... $378.98 4.5 out of 5 stars
... ... ... ...
662 Newest HP 15.6inch Lightweight Laptop, Intel Q... $562.00 4.5 out of 5 stars
668 Goldengulf Windows 10 Computer Laptop Mini 10.... $218.86 3.9 out of 5 stars
674 2020 HP 14-inch HD Touchscreen Premium Laptop ... $523.00 4.6 out of 5 stars
680 Fusion5 14.1inch A90B+ Pro 64GB Windows 10 Lap... $264.95 4.2 out of 5 stars
686 CHUWI Herobook Pro 14.1 inch Windows 10 Intel ... $339.00 4.0 out of 5 stars

90 rows × 3 columns

Clean the Price Column

In [51]:
# Check data on the column
df['Price'].value_counts(dropna = False)
Out[51]:
$339.00      12
$259.90       9
$378.98       9
$439.00       5
$264.95       5
$269.90       2
$599.00       2
$369.99       2
$912.46       1
$585.99       1
$589.99       1
$944.99       1
$527.00       1
$799.99       1
$218.86       1
$309.00       1
$256.99       1
$292.00       1
$809.49       1
$1,299.00     1
$549.99       1
$399.00       1
$2,399.00     1
$849.00       1
$599.99       1
$787.04       1
$725.00       1
$288.00       1
$699.00       1
$845.54       1
$324.00       1
$847.00       1
$340.00       1
$854.01       1
$523.00       1
$419.99       1
$667.02       1
$529.00       1
$698.90       1
$299.99       1
$2,599.00     1
$425.00       1
$1,138.21     1
$334.99       1
$269.00       1
$247.70       1
$279.99       1
$299.00       1
$449.99       1
$660.00       1
$439.99       1
$562.00       1
Name: Price, dtype: int64
In [36]:
# Function to clean the data on the price column
def clean_price(price):
    
    # Remove the $
    price = price.replace('$', '')

    # Remove the comma ',' on numbers over 1000 (1,000)
    price = price.replace(',', '')
    
    # Convert to numeric data type
    price = pd.to_numeric(price)
    
    return price
In [42]:
# Create a new column to store the clean price
df['Price_Clean'] = df['Price'].apply(clean_price)
df.shape
Out[42]:
(90, 4)
In [43]:
df.head()
Out[43]:
Name Price Reviews Price_Clean
1 CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop... $339.00 4 Stars & Up 339.00
3 CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop... $339.00 3.9 out of 5 stars 339.00
8 CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop... $339.00 3.9 out of 5 stars 339.00
14 CHUWI Herobook Pro 14.1 inch Windows 10 Intel ... $339.00 4.0 out of 5 stars 339.00
44 iProda 14" Stream Laptop, Intel i3 Notebook (u... $378.98 4.5 out of 5 stars 378.98
In [53]:
# Check data on the column
df['Price_Clean'].value_counts(dropna = False)
Out[53]:
339.00     12
259.90      9
378.98      9
439.00      5
264.95      5
369.99      2
599.00      2
269.90      2
309.00      1
399.00      1
660.00      1
439.99      1
425.00      1
725.00      1
787.04      1
849.00      1
549.99      1
292.00      1
340.00      1
269.00      1
562.00      1
699.00      1
299.00      1
527.00      1
847.00      1
324.00      1
288.00      1
529.00      1
912.46      1
523.00      1
667.02      1
419.99      1
698.90      1
299.99      1
2599.00     1
2399.00     1
585.99      1
247.70      1
449.99      1
256.99      1
334.99      1
799.99      1
599.99      1
1299.00     1
279.99      1
1138.21     1
218.86      1
944.99      1
589.99      1
809.49      1
854.01      1
845.54      1
Name: Price_Clean, dtype: int64

Clean the Reviews Column

In [45]:
# Check data on the column
df['Reviews'].value_counts(dropna = False)
Out[45]:
4.5 out of 5 stars    15
4.0 out of 5 stars    14
4.2 out of 5 stars     8
4.1 out of 5 stars     8
3.9 out of 5 stars     8
4.6 out of 5 stars     7
no-info                6
4.3 out of 5 stars     6
4 Stars & Up           5
4.4 out of 5 stars     4
4.7 out of 5 stars     3
5.0 out of 5 stars     3
3.7 out of 5 stars     1
4.9 out of 5 stars     1
3.8 out of 5 stars     1
Name: Reviews, dtype: int64
In [46]:
# Function to clean the reviews
def clean_reviews(review):
    
    # Split the string by spaces
    review = review.split()
    
    # Extract the first item (the stars received)
    review = review[0]
    
    # Se 'no-infos' as 0
    if review == 'no-info':
        review = 0
    
    # Convert to numeric data type
    review = pd.to_numeric(review)
    
    return review
In [48]:
# Create a new column to store the clean review
df['Review_Clean'] = df['Reviews'].apply(clean_reviews)
df.shape
Out[48]:
(90, 5)
In [50]:
# Check data on the column
df['Review_Clean'].value_counts(dropna = False)
Out[50]:
4.0    19
4.5    15
4.1     8
4.2     8
3.9     8
4.6     7
4.3     6
0.0     6
4.4     4
4.7     3
5.0     3
3.7     1
3.8     1
4.9     1
Name: Review_Clean, dtype: int64
In [54]:
df.head()
Out[54]:
Name Price Reviews Price_Clean Review_Clean
1 CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop... $339.00 4 Stars & Up 339.00 4.0
3 CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop... $339.00 3.9 out of 5 stars 339.00 3.9
8 CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop... $339.00 3.9 out of 5 stars 339.00 3.9
14 CHUWI Herobook Pro 14.1 inch Windows 10 Intel ... $339.00 4.0 out of 5 stars 339.00 4.0
44 iProda 14" Stream Laptop, Intel i3 Notebook (u... $378.98 4.5 out of 5 stars 378.98 4.5

Clean the Name Column

The name column has various data about the laptops, including hardware specifications, brand, screen size, and such. To effectively clean it those data need to be extracted, which might require some experimenting.

In [55]:
df.Name
Out[55]:
1      CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop...
3      CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop...
8      CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop...
14     CHUWI Herobook Pro 14.1 inch Windows 10 Intel ...
44     iProda 14" Stream Laptop, Intel i3 Notebook (u...
                             ...                        
662    Newest HP 15.6inch Lightweight Laptop, Intel Q...
668    Goldengulf Windows 10 Computer Laptop Mini 10....
674    2020 HP 14-inch HD Touchscreen Premium Laptop ...
680    Fusion5 14.1inch A90B+ Pro 64GB Windows 10 Lap...
686    CHUWI Herobook Pro 14.1 inch Windows 10 Intel ...
Name: Name, Length: 90, dtype: object

Pandas is truncating the Name strings deu to their size, but this make it hard to come with a data extaction method, so let me fix pandas.

In [56]:
pd.set_option('display.max_colwidth', None)
In [57]:
df.Name
Out[57]:
1                                                     CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop Computer, 8G RAM / 256GB SSD with Intel Gmini Lake N4000 Notebook, Thin and Light
3                                                     CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop Computer, 8G RAM / 256GB SSD with Intel Gmini Lake N4000 Notebook, Thin and Light
8                                                     CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop Computer, 8G RAM / 256GB SSD with Intel Gmini Lake N4000 Notebook, Thin and Light
14                       CHUWI Herobook Pro 14.1 inch Windows 10 Intel N4000 Dual Core 8GB RAM 256GB ROM Notebook,Thin and Lightweight Laptop,BT4.0 (Herobook Pro (Herobook Pro(2020))
44     iProda 14" Stream Laptop, Intel i3 Notebook (up to 2.4GHz), 8GB Memory, 256GB SSD, Full HD IPS 19201080 Display, Windows 10 Pro, Perfect PC for Student and Home use (Intel i3)
                                                                                            ...                                                                                       
662         Newest HP 15.6inch Lightweight Laptop, Intel Quad-Core i5-1035G1 Processor Up to 3.60 GHz, 8GB DDR4 RAM, 256GB SSD + 16GB Optane, HDMI, Bluetooth, Win 10-Silver (Renewed)
668                                 Goldengulf Windows 10 Computer Laptop Mini 10.1 Inch 32GB Ultra Thin and Light Netbook Intel Quad Core CPU PC HDMI WiFi USB Netflix YouTube (Blue)
674                                           2020 HP 14-inch HD Touchscreen Premium Laptop PC, AMD Ryzen 3 3200U Processor, 8GB DDR4 Memory, 256GB SSD, Bluetooth, Windows 10, Silver
680                                               Fusion5 14.1inch A90B+ Pro 64GB Windows 10 Laptop - 4GB RAM, 64GB Storage, Full HD IPS, Bluetooth, 2MP Webcam, Dual Band WiFi Laptop
686                      CHUWI Herobook Pro 14.1 inch Windows 10 Intel N4000 Dual Core 8GB RAM 256GB ROM Notebook,Thin and Lightweight Laptop,BT4.0 (Herobook Pro (Herobook Pro(2020))
Name: Name, Length: 90, dtype: object

Getting what I want will require some regular expressions...

Source: https://docs.python.org/3/library/re.html

Find Processor

In [63]:
re_find_processor = r'\b([iI][\d])\b'
In [101]:
# Create a column for the processor data
df['Processor'] = df['Name'].str.extract(re_find_processor)

# Standardize processor name by setting all characters to lowercase
df['Processor'] = df['Processor'].str.replace('I', 'i')

# Check result
df['Processor'].value_counts(dropna = False)
Out[101]:
NaN    55
i5     13
i3     12
i7     10
Name: Processor, dtype: int64

Find RAM Memory

In [84]:
re_find_ram = r'\b([\d]+)[GB]+[ ][\+ LlPpDdRrAaMmEeOoYy\d]+\b'
In [100]:
# Create a new column for the RAM data
df['RAM_Memory'] = df['Name'].str.extract(re_find_ram)

# And convert to numeric data type
df['RAM_Memory'] = df['RAM_Memory'].apply(lambda x: pd.to_numeric(x))

# Check result
df['RAM_Memory'].value_counts(dropna = False)
Out[100]:
8.0     39
4.0     25
12.0    10
NaN     10
16.0     6
Name: RAM_Memory, dtype: int64

Find Screen Size

In [92]:
re_find_screen = r'\b([1][\d]+[\.IiNnCcHh\d]*)[ \d]*\b' 
In [113]:
# Create a new column for the screen size data
df['Screen_Size'] = df['Name'].str.extract(re_find_screen)

# Check result
df['Screen_Size'].value_counts(dropna = False)
Out[113]:
14          23
14.1        14
15.6        14
13.3        10
15           7
17.3         5
14.1inch     5
11.6         4
10750H       1
14inch       1
15.6inch     1
11           1
14.0         1
12.4         1
10           1
10300H       1
Name: Screen_Size, dtype: int64

There are some weird values that don't seem to be screen sizes...

In [114]:
# Cleaning function
def clean_screen(size):
    size = size.replace('Inch', '')
    size = size.replace('inch', '')
    size = size.replace('in', '')
    try:
        size = pd.to_numeric(size)
    except:
        size = 'no-info'
    return size
In [115]:
# Apply function
df['Screen_Size'] = df['Screen_Size'].apply(clean_screen)

# Check result
df['Screen_Size'].value_counts(dropna = False)
Out[115]:
14         25
14.1       19
15.6       15
13.3       10
15          7
17.3        5
11.6        4
no-info     2
12.4        1
11          1
10          1
Name: Screen_Size, dtype: int64

Final Dataset

In [132]:
df_final = df[['Name', 'Price_Clean', 'Review_Clean', 'Processor', 'RAM_Memory', 'Screen_Size']]
df_final.columns = ['Name', 'Price', 'Review', 'Processor', 'RAM_Memory', 'Screen_Size']
df_final
Out[132]:
Name Price Review Processor RAM_Memory Screen_Size
1 CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop Computer, 8G RAM / 256GB SSD with Intel Gmini Lake N4000 Notebook, Thin and Light 339.00 4.0 NaN 8.0 14.1
3 CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop Computer, 8G RAM / 256GB SSD with Intel Gmini Lake N4000 Notebook, Thin and Light 339.00 3.9 NaN 8.0 14.1
8 CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop Computer, 8G RAM / 256GB SSD with Intel Gmini Lake N4000 Notebook, Thin and Light 339.00 3.9 NaN 8.0 14.1
14 CHUWI Herobook Pro 14.1 inch Windows 10 Intel N4000 Dual Core 8GB RAM 256GB ROM Notebook,Thin and Lightweight Laptop,BT4.0 (Herobook Pro (Herobook Pro(2020)) 339.00 4.0 NaN 8.0 14.1
44 iProda 14" Stream Laptop, Intel i3 Notebook (up to 2.4GHz), 8GB Memory, 256GB SSD, Full HD IPS 19201080 Display, Windows 10 Pro, Perfect PC for Student and Home use (Intel i3) 378.98 4.5 i3 8.0 14
... ... ... ... ... ... ...
662 Newest HP 15.6inch Lightweight Laptop, Intel Quad-Core i5-1035G1 Processor Up to 3.60 GHz, 8GB DDR4 RAM, 256GB SSD + 16GB Optane, HDMI, Bluetooth, Win 10-Silver (Renewed) 562.00 4.5 i5 8.0 15.6
668 Goldengulf Windows 10 Computer Laptop Mini 10.1 Inch 32GB Ultra Thin and Light Netbook Intel Quad Core CPU PC HDMI WiFi USB Netflix YouTube (Blue) 218.86 3.9 NaN NaN 10
674 2020 HP 14-inch HD Touchscreen Premium Laptop PC, AMD Ryzen 3 3200U Processor, 8GB DDR4 Memory, 256GB SSD, Bluetooth, Windows 10, Silver 523.00 4.6 NaN 8.0 14
680 Fusion5 14.1inch A90B+ Pro 64GB Windows 10 Laptop - 4GB RAM, 64GB Storage, Full HD IPS, Bluetooth, 2MP Webcam, Dual Band WiFi Laptop 264.95 4.2 NaN 4.0 14.1
686 CHUWI Herobook Pro 14.1 inch Windows 10 Intel N4000 Dual Core 8GB RAM 256GB ROM Notebook,Thin and Lightweight Laptop,BT4.0 (Herobook Pro (Herobook Pro(2020)) 339.00 4.0 NaN 8.0 14.1

90 rows × 6 columns

Search the DataFrame with Filters

Creating a filter to search the dataframe and return only laptops with certain specifications.

In [133]:
# Dictionary with filters
filters = {'processor': {'min':'i5', 'max':'i7'},
            'ram': 16,
            'max_price': 1500,
            'screen': {'min':14, 'max':16},
            'min_stars': 4}
In [134]:
# Filter Processor
df_filtered = df_final[(df_final['Processor'] == filters['processor']['min']) | (df_final['Processor'] == filters['processor']['max'])].copy()

# Filter RAM Memory
df_filtered = df_filtered[(df_filtered['RAM_Memory'] >= filters['ram'])]

# Filter Price
df_filtered = df_filtered[(df_filtered['Price'] <= filters['max_price'])]

# Filter Screen size
df_filtered = df_filtered[(df_filtered['Screen_Size'] >= filters['screen']['min']) & (df_filtered['Screen_Size'] <= filters['screen']['max'])]

# Filter Reviews
df_filtered = df_filtered[(df_filtered['Review'] >= filters['min_stars'])]

print(f'Number of findings: {df_filtered.shape[0]}')
Number of findings: 4
In [135]:
# Filtered results
df_filtered
Out[135]:
Name Price Review Processor RAM_Memory Screen_Size
92 2020 Asus TUF 15.6" FHD Premium Gaming Laptop, 10th Gen Intel Quad-Core i5-10300H, 16GB RAM, 1TB SSD, NVIDIA GeForce GTX 1650Ti 4GB GDDR6, RGB Backlit Keyboard, Windows 10 Home 854.01 4.1 i5 16.0 15.6
392 HP EliteBook 840 G3 Laptop 14" FHD Display, Intel Core i5-6300U 2.4Ghz, 256GB SSD, 16GB DDR4 RAM, Webcam, WiFi, Windows 10 Pro (Renewed) 449.99 4.2 i5 16.0 14
470 Samsung Notebook 9 Pro NP940X5N-X01US 15" FHD 2-in-1 Touch Screen Laptop, 8th Gen Intel Quad-Core i7-8550U Up To 4GHz, 16GB DDR4, 256GB SSD, Backlit Keyboard, Windows 10, Built-in S Pen, Titan Silver 912.46 4.2 i7 16.0 15
536 2019 Newest HP Pavilion 15 15.6" HD Touchscreen Business Laptop Intel Quad-Core i5-8250U, 16GB DDR4, 512GB SSD, Type-C, HDMI, WiFi AC, UHD, Windows 10 725.00 4.0 i5 16.0 15

There it is! Now I know where to start my laptop review from!

Since products on Amazon are frequently updated, I can re-run this later and see if new interesting options appeared!

End

Matheus Schmitz

LinkedIn

Github Portfolio