Web Scraping Amazon Laptops¶

Problem Definition¶

I'm considering buying a net notebook for my data science projects, and decided to turn this idea into yet another project!

My web scraper searches, compares and analyses different laptops and list them as a table for easy comparison. The web scraper will be crawling through Amazon, searching for details on the laptops: price, user reviews and hardware configuration.

Data Source¶

The official amazon website: https://amazon.com

Loading Packages¶

# Imports
import bs4
import requests
import pandas as pd
from bs4 import BeautifulSoup
import lxml

Web Scraping¶

The web scraper was build by inspect the source code on: https://www.amazon.com/s?k=laptops

The relevant information was extracted from the first five result pages.

# Function which defines o header for the connection with amazon.com and makes a request to extract the data
def make_request(num_page):
    
    # Header to avoid the problem of non-secure connection with amazon.com
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", 
               "Accept-Encoding":"gzip, deflate", 
               "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
               "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"} 

    # Request the desired number of pages
    request = requests.get('https://www.amazon.com/s?k=laptops&page='+str(num_page), headers = headers)
    
    # Then extract the content
    content = request.content
    
    return content

# Function to extract the laptop's name
def extract_name(div):
    
    # Find the span tag, class 'a-size-medium a-color-base a-text-normal', where the laptop name will be
    span_name = div.find('span', attrs = {'class':'a-size-medium a-color-base a-text-normal'})
    
    # If the value found isn't empty, return the value (name)
    if span_name is not None:
        return span_name.text
    else:
        return 'no-info'

# Function to extract the laptop's price
def extract_price(div):
    
    # Find the span tag, class 'a-offscreen', where the laptop price will be
    span_price = div.find('span', attrs = {'class':'a-offscreen'})

    # If the value found isn't empty, return the value (price)
    if span_price is not None:
        return span_price.text
    else:
        return 'no-info'

# Function to extract the laptop's reviews
def extract_reviews(div):
    
    # Find the span tag, class 'a-icon-alt', where the laptop reviews will be
    span_reviews = div.find('span', attrs = {'class':'a-icon-alt'})

    # If the value found isn't empty, return the value (price)
    if span_reviews is not None:
        return span_reviews.text
    else:
        return 'no-info'

# List to store the data
laptop_data = []

# Loop through the desired number of pages
for num_page in range(1,6):
    
    # Make request and get content
    content = make_request(num_page)
    
    # Format the content with BeautifulSoup
    soup = BeautifulSoup(content, 'lxml')
    
    # Loop through the content
    for info in soup.findAll('div', attrs = {'class': 'sg-col-4-of-12 sg-col-4-of-16 sg-col sg-col-4-of-20'}):
        
        # Extract name
        name = extract_name(info)
        
        # Extract price
        price = extract_price(info)
        
        # Extract reviews
        reviews = extract_reviews(info)
        
        # Add the data to the list
        laptop_data.append([name, price, reviews])

# Convert the list to a dataframe
df_laptop_data = pd.DataFrame(laptop_data, columns = ['Name', 'Price', 'Reviews'])
df_laptop_data.shape

(693, 3)

df_laptop_data.head()

Seems like we got lots of null rows... let's do some cleaning!

Data Cleaning¶

Delete Null Rows¶

# Remove rows without laptop name
df = df_laptop_data[df_laptop_data.Name != 'no-info']
df.shape

(230, 3)

# Remove rows without price name
df = df[df.Price != 'no-info']
df.shape

(90, 3)

df

Clean the Price Column¶

# Check data on the column
df['Price'].value_counts(dropna = False)

$339.00      12
$259.90       9
$378.98       9
$439.00       5
$264.95       5
$269.90       2
$599.00       2
$369.99       2
$912.46       1
$585.99       1
$589.99       1
$944.99       1
$527.00       1
$799.99       1
$218.86       1
$309.00       1
$256.99       1
$292.00       1
$809.49       1
$1,299.00     1
$549.99       1
$399.00       1
$2,399.00     1
$849.00       1
$599.99       1
$787.04       1
$725.00       1
$288.00       1
$699.00       1
$845.54       1
$324.00       1
$847.00       1
$340.00       1
$854.01       1
$523.00       1
$419.99       1
$667.02       1
$529.00       1
$698.90       1
$299.99       1
$2,599.00     1
$425.00       1
$1,138.21     1
$334.99       1
$269.00       1
$247.70       1
$279.99       1
$299.00       1
$449.99       1
$660.00       1
$439.99       1
$562.00       1
Name: Price, dtype: int64

# Function to clean the data on the price column
def clean_price(price):
    
    # Remove the $
    price = price.replace('$', '')

    # Remove the comma ',' on numbers over 1000 (1,000)
    price = price.replace(',', '')
    
    # Convert to numeric data type
    price = pd.to_numeric(price)
    
    return price

# Create a new column to store the clean price
df['Price_Clean'] = df['Price'].apply(clean_price)
df.shape

(90, 4)

df.head()

# Check data on the column
df['Price_Clean'].value_counts(dropna = False)

339.00     12
259.90      9
378.98      9
439.00      5
264.95      5
369.99      2
599.00      2
269.90      2
309.00      1
399.00      1
660.00      1
439.99      1
425.00      1
725.00      1
787.04      1
849.00      1
549.99      1
292.00      1
340.00      1
269.00      1
562.00      1
699.00      1
299.00      1
527.00      1
847.00      1
324.00      1
288.00      1
529.00      1
912.46      1
523.00      1
667.02      1
419.99      1
698.90      1
299.99      1
2599.00     1
2399.00     1
585.99      1
247.70      1
449.99      1
256.99      1
334.99      1
799.99      1
599.99      1
1299.00     1
279.99      1
1138.21     1
218.86      1
944.99      1
589.99      1
809.49      1
854.01      1
845.54      1
Name: Price_Clean, dtype: int64

Clean the Reviews Column¶

# Check data on the column
df['Reviews'].value_counts(dropna = False)

4.5 out of 5 stars    15
4.0 out of 5 stars    14
4.2 out of 5 stars     8
4.1 out of 5 stars     8
3.9 out of 5 stars     8
4.6 out of 5 stars     7
no-info                6
4.3 out of 5 stars     6
4 Stars & Up           5
4.4 out of 5 stars     4
4.7 out of 5 stars     3
5.0 out of 5 stars     3
3.7 out of 5 stars     1
4.9 out of 5 stars     1
3.8 out of 5 stars     1
Name: Reviews, dtype: int64

# Function to clean the reviews
def clean_reviews(review):
    
    # Split the string by spaces
    review = review.split()
    
    # Extract the first item (the stars received)
    review = review[0]
    
    # Se 'no-infos' as 0
    if review == 'no-info':
        review = 0
    
    # Convert to numeric data type
    review = pd.to_numeric(review)
    
    return review

# Create a new column to store the clean review
df['Review_Clean'] = df['Reviews'].apply(clean_reviews)
df.shape

(90, 5)

# Check data on the column
df['Review_Clean'].value_counts(dropna = False)

4.0    19
4.5    15
4.1     8
4.2     8
3.9     8
4.6     7
4.3     6
0.0     6
4.4     4
4.7     3
5.0     3
3.7     1
3.8     1
4.9     1
Name: Review_Clean, dtype: int64

df.head()

Clean the Name Column¶

The name column has various data about the laptops, including hardware specifications, brand, screen size, and such. To effectively clean it those data need to be extracted, which might require some experimenting.

df.Name

1      CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop...
3      CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop...
8      CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop...
14     CHUWI Herobook Pro 14.1 inch Windows 10 Intel ...
44     iProda 14" Stream Laptop, Intel i3 Notebook (u...
                             ...                        
662    Newest HP 15.6inch Lightweight Laptop, Intel Q...
668    Goldengulf Windows 10 Computer Laptop Mini 10....
674    2020 HP 14-inch HD Touchscreen Premium Laptop ...
680    Fusion5 14.1inch A90B+ Pro 64GB Windows 10 Lap...
686    CHUWI Herobook Pro 14.1 inch Windows 10 Intel ...
Name: Name, Length: 90, dtype: object

Pandas is truncating the Name strings deu to their size, but this make it hard to come with a data extaction method, so let me fix pandas.

pd.set_option('display.max_colwidth', None)

df.Name

1                                                     CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop Computer, 8G RAM / 256GB SSD with Intel Gmini Lake N4000 Notebook, Thin and Light
3                                                     CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop Computer, 8G RAM / 256GB SSD with Intel Gmini Lake N4000 Notebook, Thin and Light
8                                                     CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop Computer, 8G RAM / 256GB SSD with Intel Gmini Lake N4000 Notebook, Thin and Light
14                       CHUWI Herobook Pro 14.1 inch Windows 10 Intel N4000 Dual Core 8GB RAM 256GB ROM Notebook,Thin and Lightweight Laptop,BT4.0 (Herobook Pro (Herobook Pro(2020))
44     iProda 14" Stream Laptop, Intel i3 Notebook (up to 2.4GHz), 8GB Memory, 256GB SSD, Full HD IPS 19201080 Display, Windows 10 Pro, Perfect PC for Student and Home use (Intel i3)
                                                                                            ...                                                                                       
662         Newest HP 15.6inch Lightweight Laptop, Intel Quad-Core i5-1035G1 Processor Up to 3.60 GHz, 8GB DDR4 RAM, 256GB SSD + 16GB Optane, HDMI, Bluetooth, Win 10-Silver (Renewed)
668                                 Goldengulf Windows 10 Computer Laptop Mini 10.1 Inch 32GB Ultra Thin and Light Netbook Intel Quad Core CPU PC HDMI WiFi USB Netflix YouTube (Blue)
674                                           2020 HP 14-inch HD Touchscreen Premium Laptop PC, AMD Ryzen 3 3200U Processor, 8GB DDR4 Memory, 256GB SSD, Bluetooth, Windows 10, Silver
680                                               Fusion5 14.1inch A90B+ Pro 64GB Windows 10 Laptop - 4GB RAM, 64GB Storage, Full HD IPS, Bluetooth, 2MP Webcam, Dual Band WiFi Laptop
686                      CHUWI Herobook Pro 14.1 inch Windows 10 Intel N4000 Dual Core 8GB RAM 256GB ROM Notebook,Thin and Lightweight Laptop,BT4.0 (Herobook Pro (Herobook Pro(2020))
Name: Name, Length: 90, dtype: object

Getting what I want will require some regular expressions...

Source: https://docs.python.org/3/library/re.html

Find Processor¶

re_find_processor = r'\b([iI][\d])\b'

# Create a column for the processor data
df['Processor'] = df['Name'].str.extract(re_find_processor)

# Standardize processor name by setting all characters to lowercase
df['Processor'] = df['Processor'].str.replace('I', 'i')

# Check result
df['Processor'].value_counts(dropna = False)

NaN    55
i5     13
i3     12
i7     10
Name: Processor, dtype: int64

Find RAM Memory¶

re_find_ram = r'\b([\d]+)[GB]+[ ][\+ LlPpDdRrAaMmEeOoYy\d]+\b'

# Create a new column for the RAM data
df['RAM_Memory'] = df['Name'].str.extract(re_find_ram)

# And convert to numeric data type
df['RAM_Memory'] = df['RAM_Memory'].apply(lambda x: pd.to_numeric(x))

# Check result
df['RAM_Memory'].value_counts(dropna = False)

8.0     39
4.0     25
12.0    10
NaN     10
16.0     6
Name: RAM_Memory, dtype: int64

Find Screen Size¶

re_find_screen = r'\b([1][\d]+[\.IiNnCcHh\d]*)[ \d]*\b'

# Create a new column for the screen size data
df['Screen_Size'] = df['Name'].str.extract(re_find_screen)

# Check result
df['Screen_Size'].value_counts(dropna = False)

14          23
14.1        14
15.6        14
13.3        10
15           7
17.3         5
14.1inch     5
11.6         4
10750H       1
14inch       1
15.6inch     1
11           1
14.0         1
12.4         1
10           1
10300H       1
Name: Screen_Size, dtype: int64

There are some weird values that don't seem to be screen sizes...

# Cleaning function
def clean_screen(size):
    size = size.replace('Inch', '')
    size = size.replace('inch', '')
    size = size.replace('in', '')
    try:
        size = pd.to_numeric(size)
    except:
        size = 'no-info'
    return size

# Apply function
df['Screen_Size'] = df['Screen_Size'].apply(clean_screen)

# Check result
df['Screen_Size'].value_counts(dropna = False)

14         25
14.1       19
15.6       15
13.3       10
15          7
17.3        5
11.6        4
no-info     2
12.4        1
11          1
10          1
Name: Screen_Size, dtype: int64

Final Dataset¶

df_final = df[['Name', 'Price_Clean', 'Review_Clean', 'Processor', 'RAM_Memory', 'Screen_Size']]
df_final.columns = ['Name', 'Price', 'Review', 'Processor', 'RAM_Memory', 'Screen_Size']
df_final

Search the DataFrame with Filters¶

Creating a filter to search the dataframe and return only laptops with certain specifications.

# Dictionary with filters
filters = {'processor': {'min':'i5', 'max':'i7'},
            'ram': 16,
            'max_price': 1500,
            'screen': {'min':14, 'max':16},
            'min_stars': 4}

# Filter Processor
df_filtered = df_final[(df_final['Processor'] == filters['processor']['min']) | (df_final['Processor'] == filters['processor']['max'])].copy()

# Filter RAM Memory
df_filtered = df_filtered[(df_filtered['RAM_Memory'] >= filters['ram'])]

# Filter Price
df_filtered = df_filtered[(df_filtered['Price'] <= filters['max_price'])]

# Filter Screen size
df_filtered = df_filtered[(df_filtered['Screen_Size'] >= filters['screen']['min']) & (df_filtered['Screen_Size'] <= filters['screen']['max'])]

# Filter Reviews
df_filtered = df_filtered[(df_filtered['Review'] >= filters['min_stars'])]

print(f'Number of findings: {df_filtered.shape[0]}')

Number of findings: 4

# Filtered results
df_filtered

There it is! Now I know where to start my laptop review from!

Since products on Amazon are frequently updated, I can re-run this later and see if new interesting options appeared!

End¶

Matheus Schmitz

Github Portfolio

	Name	Price	Reviews
0	no-info	no-info	no-info
1	CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop...	$339.00	4 Stars & Up
2	no-info	no-info	no-info
3	CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop...	$339.00	3.9 out of 5 stars
4	no-info	no-info	no-info

	Name	Price	Review	Processor	RAM_Memory	Screen_Size
1	CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop Computer, 8G RAM / 256GB SSD with Intel Gmini Lake N4000 Notebook, Thin and Light	339.00	4.0	NaN	8.0	14.1
3	CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop Computer, 8G RAM / 256GB SSD with Intel Gmini Lake N4000 Notebook, Thin and Light	339.00	3.9	NaN	8.0	14.1
8	CHUWI HeroBook Pro 14.1 inch Windows 10 Laptop Computer, 8G RAM / 256GB SSD with Intel Gmini Lake N4000 Notebook, Thin and Light	339.00	3.9	NaN	8.0	14.1
14	CHUWI Herobook Pro 14.1 inch Windows 10 Intel N4000 Dual Core 8GB RAM 256GB ROM Notebook,Thin and Lightweight Laptop,BT4.0 (Herobook Pro (Herobook Pro(2020))	339.00	4.0	NaN	8.0	14.1
44	iProda 14" Stream Laptop, Intel i3 Notebook (up to 2.4GHz), 8GB Memory, 256GB SSD, Full HD IPS 19201080 Display, Windows 10 Pro, Perfect PC for Student and Home use (Intel i3)	378.98	4.5	i3	8.0	14
...	...	...	...	...	...	...
662	Newest HP 15.6inch Lightweight Laptop, Intel Quad-Core i5-1035G1 Processor Up to 3.60 GHz, 8GB DDR4 RAM, 256GB SSD + 16GB Optane, HDMI, Bluetooth, Win 10-Silver (Renewed)	562.00	4.5	i5	8.0	15.6
668	Goldengulf Windows 10 Computer Laptop Mini 10.1 Inch 32GB Ultra Thin and Light Netbook Intel Quad Core CPU PC HDMI WiFi USB Netflix YouTube (Blue)	218.86	3.9	NaN	NaN	10
674	2020 HP 14-inch HD Touchscreen Premium Laptop PC, AMD Ryzen 3 3200U Processor, 8GB DDR4 Memory, 256GB SSD, Bluetooth, Windows 10, Silver	523.00	4.6	NaN	8.0	14
680	Fusion5 14.1inch A90B+ Pro 64GB Windows 10 Laptop - 4GB RAM, 64GB Storage, Full HD IPS, Bluetooth, 2MP Webcam, Dual Band WiFi Laptop	264.95	4.2	NaN	4.0	14.1
686	CHUWI Herobook Pro 14.1 inch Windows 10 Intel N4000 Dual Core 8GB RAM 256GB ROM Notebook,Thin and Lightweight Laptop,BT4.0 (Herobook Pro (Herobook Pro(2020))	339.00	4.0	NaN	8.0	14.1

	Name	Price	Review	Processor	RAM_Memory	Screen_Size
92	2020 Asus TUF 15.6" FHD Premium Gaming Laptop, 10th Gen Intel Quad-Core i5-10300H, 16GB RAM, 1TB SSD, NVIDIA GeForce GTX 1650Ti 4GB GDDR6, RGB Backlit Keyboard, Windows 10 Home	854.01	4.1	i5	16.0	15.6
392	HP EliteBook 840 G3 Laptop 14" FHD Display, Intel Core i5-6300U 2.4Ghz, 256GB SSD, 16GB DDR4 RAM, Webcam, WiFi, Windows 10 Pro (Renewed)	449.99	4.2	i5	16.0	14
470	Samsung Notebook 9 Pro NP940X5N-X01US 15" FHD 2-in-1 Touch Screen Laptop, 8th Gen Intel Quad-Core i7-8550U Up To 4GHz, 16GB DDR4, 256GB SSD, Backlit Keyboard, Windows 10, Built-in S Pen, Titan Silver	912.46	4.2	i7	16.0	15
536	2019 Newest HP Pavilion 15 15.6" HD Touchscreen Business Laptop Intel Quad-Core i5-8250U, 16GB DDR4, 512GB SSD, Type-C, HDMI, WiFi AC, UHD, Windows 10	725.00	4.0	i5	16.0	15