A practical introduction to Web Scraping with Python

We'll learn to scrape with Python by pulling RTX inventory, price and more from Newegg. First we'll cover how to crawl the content, scrape the data we need and then save the output to a CSV file.

Introduction

In this post, we'll cover how to scrape Newegg using python, lxml and requests. Python is a great language that anyone can pick up quickly and I believe it's also one of the more readable languages, where you can quickly scan the code to determine what it is doing.

Just look at this loop with auto incrementing index:

for index, element in enumerate(href_elements):

We'll scrape Newegg with the use case of monitoring prices and inventory, especially the RTX 3080 and RTX 3090.

Setting up

We're going to work in a virtual python environment which helps us address dependencies and versions separately for each application / project. Let's create a virtual environment in our home directory and install the dependencies we need.

Make sure you are running at least python 3.6, 3.5 is end of support.

mkdir ~/intro-web-scraping
cd ~/intro-web-scraping
virtualenv env
. env/bin/activate # activate the environment which populates the shell's PATH
pip install lxml requests

Let's create the following folders and files.

|-- env # auto generated by virtualenv
|-- core
| |-- crawler.py
| |-- scraper.py
| |-- utils.py
|-- newegg
| |-- __main__.py

We created a __main__.py file, this lets us run the Newegg scraper with the following command (nothing should happen right now):

python -m newegg

Crawling the content

We need to write code that can crawl the content, by crawl I mean fetch or download the HTML from the target website. Our first target is Newegg, this website doesn't seem to require javascript for the data we need. We'll get into rendering javascript in a future post that covers headless scraping using requests-html on Google Places.

Open core/crawler.py which we created earlier. Now, we'll begin by requesting the HTML content from Newegg's domain.

import requests
newegg = "https://newegg.com"
response = requests.get(newegg)
print(response.status_code)

In newegg/__main__.py we can import crawler and the code above will execute.

from core import crawler

Remember you can execute and test your code with the previous python command in your terminal (must be run in the root folder ~/intro-web-scraping).

python -m newegg

It looks like the request succeeded, the status code should of been printed to your terminal with a success of 200. Let's clean up the code to make it reusable and define a function for returning the response text.

In core/crawler.py we'll define a crawl_html function (we want to reuse it and this lets us redefine where the HTML comes from in the future).

import requests
def crawl_html(url):
response = requests.get(url)
return response.content # returns the content in bytes (required later for lxml)

In newegg/__main__.py we'll use the function, you can run it and see the HTML being printed. We use an uppercased variable NEWEGG_URL to define a constant - something that shouldn't change.

from core import crawler
NEWEGG_URL = "https://newegg.com"
html = crawler.crawl_html(NEWEGG_URL)
print(html)

Scraping the data we need

Now that we have access to the HTML content from Newegg, we want a way to pull out stock information and price for the RTX 3080 and RTX 3090. Let's find the page from Newegg that has that information first.

Navigate to https://www.newegg.com/p/pl?N=100007709%20601357282 in your browser and you'll see we have filters applied for RTX 30 series.

Newegg Category Page

We'll take that path and append it to our NEWEGG_URL. We do this using f-strings in python, which is a way to interpolate variables in strings.

from core import crawler
NEWEGG_URL = "https://newegg.com"
NEWEGG_RTX_PATH = "/p/pl?N=100007709%20601357282"
crawl_url = f"{NEWEGG_URL}{NEWEGG_RTX_PATH}"
html = crawler.crawl_html(crawl_url)
print(html)

From this URL we can start scraping the data we need. Let's start by creating a few useful functions in the file core/scraping.py. These functions wrap lxml and handle some of the type conversions to make it easier for us to work with.

from lxml import html
def get_tree(html_content):
return html.fromstring(html_content)
def get_text(tree, xpath_selector):
elements = tree.xpath(xpath_selector)
return list(map(lambda element: element.text_content(), elements))
def get_attributes(tree, xpath_selector, attribute):
elements = tree.xpath(xpath_selector)
return list(map(lambda element: element.get(attribute), elements))

Finding the data

We'll first try to get the prices with XPath. I highly recommend you use XPath instead of CSS selectors which is much more declarative and more expressive, you can use this simple cheat sheet for quickly finding out how to specify selectors. A more in-depth guide can be found from librarycarpentry.

Open your chrome browser and visit the crawl url we defined earlier: https://www.newegg.com/p/pl?N=100007709%20601357282.

Press F12 on your keyboard or open the developer console by right-clicking one of the prices on the page and selecting inspect.

Newegg Inspect

Using XPath

We'll use the inspector and practice our XPath to figure out how to get all prices on the page (there are 29 items listed). This selector: //li[contains(@class, 'price-current')] grabs all relevant prices.

Newegg Inspector

With the selector in hand, let's modify our newegg/__main__.py entry file by adding a new function to grab the prices.

from core import crawler, scraper
NEWEGG_URL = "https://newegg.com"
NEWEGG_RTX_PATH = "/p/pl?N=100007709%20601357282"
def get_rtx_prices(tree):
price_selector = "//li[contains(@class, 'price-current')]"
return scraper.get_text(tree, price_selector)
crawl_url = f"{NEWEGG_URL}{NEWEGG_RTX_PATH}"
html = crawler.crawl_html(crawl_url)
tree = scraper.get_tree(html)
prices = get_rtx_prices(tree)
print(prices)

We should see output like the following.

['$1,499.99\xa0–', '$749.99\xa0–', '$809.99\xa0–', '$1,619.99\xa0–', '$1,549.99\xa0–', '$1,549.99\xa0–', '$729.99\xa0–', '$759.99\xa0–', '$1,589.99\xa0–', '$699.99\xa0–', '$749.99\xa0–', '$749.99\xa0–', '$1,799.99\xa0–', '$1,499.99\xa0–', '$1,799.99\xa0–', '$1,599.99\xa0–', 'COMING SOON', '$739.99\xa0–', '$699.99\xa0–', '$1,579.99\xa0–', '$1,499.99\xa0–', '$699.99\xa0–', '$699.99\xa0–', '$729.99\xa0–', '$1,499.99\xa0–', '$1,729.99\xa0–', '$789.99\xa0–', 'COMING SOON', '$1,499.99\xa0–']

Let's clean this extra HTML entity appearing at the end of our prices with a utility function. We'll make use of re for regex and unescape from html module to cleanup our data. We need to check if the input contains numbers in order to account for the COMING SOON labels. We'll keep this logic encapsulated in our get_rtx_prices by mapping over each item and then converting it back to a list (map returns an object iterator).

from core import crawler, scraper
from html import unescape
import re
NEWEGG_URL = "https://newegg.com"
NEWEGG_RTX_PATH = "/p/pl?N=100007709%20601357282"
def clean_price(price):
price_contains_numbers = bool(re.search(r'[\d+,]+(\d+)', price))
if price_contains_numbers:
# split the price to remove the empty space and pick the first item
price = unescape(price).split()[0]
return price
def get_rtx_prices(tree):
price_selector = "//li[contains(@class, 'price-current')]"
price_text = scraper.get_text(tree, price_selector)
return list(map(lambda price: clean_price(price), price_text))
crawl_url = f"{NEWEGG_URL}{NEWEGG_RTX_PATH}"
html = crawler.crawl_html(crawl_url)
tree = scraper.get_tree(html)
prices = get_rtx_prices(tree)
print(prices)
['$1,499.99', '$749.99', '$809.99', '$1,619.99', '$1,549.99', '$1,549.99', '$729.99', '$759.99', '$1,589.99', '$699.99', '$749.99', '$749.99', '$1,799.99', '$1,499.99', '$1,799.99', '$1,599.99', 'COMING SOON', '$739.99', '$699.99', '$1,579.99', '$1,499.99', '$699.99', '$699.99', '$729.99', '$1,499.99', '$1,729.99', '$789.99', 'COMING SOON', '$1,499.99']

Let's grab the item names.

def get_rtx_names(tree):
name_selector = "//div[@class='item-info']/a"
return scraper.get_text(tree, name_selector)

We also want the link to the item.

def get_rtx_links(tree):
link_selector = "//div[@class='item-info']/a"
return scraper.get_attributes(tree, link_selector, "href")

More complex XPath

Next we want the stock information (out of stock or in stock). To do this we need to add another function called get_children_text to core/scraper.py. This will allow us to specify a parent selector and a child selector, which will return the first child that matches. If our parent selector has many matches it will try to find a matching child and if it does not find one it will return None. In our case we have many parent matches but some of them may not contain the OUT OF STOCK element.

In core/scraper.py add the new function.

def get_children_text(tree, xpath_parent_selector, xpath_child_selector):
parent_elements = tree.xpath(xpath_parent_selector)
children_texts = []
for element in iter(parent_elements):
# for each parent, try to find 1 child with that selector
child = element.xpath(xpath_child_selector)
if child:
children_texts.append(child[0].text_content())
else:
# we add None to indicate the item at this index had no match
children_texts.append(None)
return children_texts

Back in newegg/__main__.py we can add the stock selector.

def get_rtx_stock_information(tree):
item_selector = "//div[@class='item-container']"
child_selector = "div[@class='item-info']/p[contains(., 'OUT OF STOCK')]"
stock_details = scraper.get_children_text(tree, item_selector, child_selector)
# set None to in stock, handles case when item has no "out of stock" label
return list(map(lambda element: element or "IN STOCK", stock_details))

We also want the product id, having this can help us track changes to the product in the future. Here's how we can find the item id from the page.

Item id selector

If you notice on the highlighted lines below, you can see we added another function to our scraper. Because we are using the text() function of XPath, we are asking for the text node which ignores the other strong label node in the tree seen in the screenshot above.

def get_rtx_ids(tree):
item_id_selector = "//ul[@class='item-features']/li[contains(., 'Item #')]/text()"
return scraper.get_nodes(tree, item_id_selector)

Let's add get_nodes to our core/scraper.py module.

def get_nodes(tree, xpath_selector):
return tree.xpath(xpath_selector)

Our final output structure

Let's put it all together now to generate the final structure for our output which will contain basic stock information, price, product name, product id and product link.

def get_rtx_items(tree):
prices = get_rtx_prices(tree)
names = get_rtx_names(tree)
links = get_rtx_links(tree)
ids = get_rtx_ids(tree)
stock_details = get_rtx_stock_information(tree)
items = []
for index, price in enumerate(prices):
name = names[index]
link = links[index]
stock = stock_details[index]
id = ids[index]
items.append({
'name': name,
'link': link,
'stock': stock,
'price': price,
'id': id
})
return items

This is what our newegg/__main__.py should look like now.

from core import crawler, scraper
from html import unescape
import re
NEWEGG_URL = "https://newegg.com"
NEWEGG_RTX_PATH = "/p/pl?N=100007709%20601357282"
def clean_price(price):
price_contains_numbers = bool(re.search(r'[\d+,]+(\d+)', price))
if price_contains_numbers:
# split the price to remove the empty space and pick the first item
price = unescape(price).split()[0]
return price
def get_rtx_prices(tree):
price_selector = "//li[contains(@class, 'price-current')]"
price_text = scraper.get_text(tree, price_selector)
return list(map(lambda price: clean_price(price), price_text))
def get_rtx_names(tree):
name_selector = "//div[@class='item-info']/a"
return scraper.get_text(tree, name_selector)
def get_rtx_links(tree):
link_selector = "//div[@class='item-info']/a"
return scraper.get_attributes(tree, link_selector, "href")
def get_rtx_stock_information(tree):
item_selector = "//div[@class='item-container']"
child_selector = "div[@class='item-info']/p[contains(., 'OUT OF STOCK')]"
stock_details = scraper.get_children_text(
tree, item_selector, child_selector)
# set None to in stock, handles case when item has no "out of stock" label
return list(map(lambda element: element or "IN STOCK", stock_details))
def get_rtx_ids(tree):
item_id_selector = "//ul[@class='item-features']/li[contains(., 'Item #')]/text()"
return scraper.get_nodes(tree, item_id_selector)
def get_rtx_items(tree):
prices = get_rtx_prices(tree)
names = get_rtx_names(tree)
links = get_rtx_links(tree)
ids = get_rtx_ids(tree)
stock_details = get_rtx_stock_information(tree)
items = []
for index, price in enumerate(prices):
name = names[index]
link = links[index]
stock = stock_details[index]
id = ids[index]
items.append({
'name': name,
'link': link,
'stock': stock,
'price': price,
'id': id
})
return items
crawl_url = f"{NEWEGG_URL}{NEWEGG_RTX_PATH}"
html = crawler.crawl_html(crawl_url)
tree = scraper.get_tree(html)
rtx_items = get_rtx_items(tree)
print(rtx_items)

Ommitted some of the results for readability, but the output should total 29 products as of this post.

[{'name': 'MSI GeForce RTX 3080 DirectX 12 RTX 3080 GAMING X TRIO 10G 10GB 320-Bit GDDR6X PCI Express 4.0 HDCP Ready Video Card', 'link': 'https://www.newegg.com/msi-geforce-rtx-3080-rtx-3080-gaming-x-trio-10g/p/N82E16814137597', 'stock': 'OUT OF STOCK', 'price': '$759.99', 'id': 'N82E16814137597'}, {'name': 'ASUS TUF Gaming NVIDIA GeForce RTX 3080 TUF-RTX3080-10G-GAMING Video Card', 'link': 'https://www.newegg.com/asus-geforce-rtx-3080-tuf-rtx3080-10g-gaming/p/N82E16814126453', 'stock': 'OUT OF STOCK', 'price': '$699.99', 'id': 'N82E16814126453'}]

Saving our data

With our data in hand, we can quickly save it for analysis later - it's not hard to imagine what else is possible when you have the data you want. We could monitor the price changes of these items, their stock status or when new items are added.

Let's add two csv utility functions to our core/utils.py file. We will write one to tansform our scraped output to proper csv lines and another to write the csv output.

def dict_to_csv_lines(data_rows):
lines = []
for row in iter(data_rows):
columns = row.keys()
column_values = []
for key in iter(columns):
column_value = row[key]
# account for commas in csv column value
if "," in column_value:
column_value = f"\"{column_value}\""
column_values.append(column_value)
lines.append(','.join(column_values))
return lines
def write_to_csv(file_name, data_rows):
column_headers = data_rows[0].keys()
lines = dict_to_csv_lines(data_rows)
with open(f"{file_name}.csv", 'w') as file:
# first write headers
headers = ','.join(column_headers)
file.write(f"{headers}\n")
for csv_line in iter(lines):
file.write(f"{csv_line}\n")

We can use it in our newegg/__main__.py file and just save the output we receive from get_rtx_items. First import the utils at the top of the file.

from core import crawler, scraper, utils

Now let's use our utility function at the bottom of our Newegg scraper to save the output and complete the full web scraping cycle - crawling, scraping and saving the output.

crawl_url = f"{NEWEGG_URL}{NEWEGG_RTX_PATH}"
html = crawler.crawl_html(crawl_url)
tree = scraper.get_tree(html)
rtx_items = get_rtx_items(tree)
utils.write_to_csv("rtx_output", rtx_items)

Checking the output

We can open the csv file to view the output which is saved in the folder we created at the beginning ~/intro-web-scraping.

Newegg output

Wrapping up

From this guide we should have learned most of what I believe is the web scraping basics:

  1. Crawling content (using requests)
  2. Scraping relevant data (lxml and XPath)
  3. Saving the output (writing to a csv file)

What we didn't cover:

  1. Headers
  2. Proxies (residential, data center, tor)
  3. Headless browsers
  4. Bot detection (fingerprinting)
  5. Throttling
  6. Captcha (recaptcha, image based input)

In a future post, we will scrape a website which requires javascript rendering and we'll make use of the requests-html python library to render the page and execute javascript.

Hopefully you'll find this post enlightening as web scraping has some really creative use cases that are not so obvious. Till next time!

Trove Earth
The quickest way to Scrape the Web

Spend less time building infrastructure and more time pulling data.

Sign up free