Web Scraping
Doing any sort of data analysis requires, well, ... data. You can get this data a couple different ways.
Sometimes you'll have structured, ready-to-consume data directly available in a spreadsheet or a database. Other times you'll have to go out and get it yourself.
In this article, we'll learn how to build a Python program that gets it ("scrapes it") for you using the library BeautifulSoup.
Learn to Code with Fantasy Football
This is adopted from my book Learn to Code with Fantasy Football. If you like it check it out.
In it, I also cover basic Python, Pandas, SQL, data visualization, machine learning and more. I also cover alternatives to scraping like connecting to a public API and some places you can get finding ready-made, already-scraped datasets.
Motivation
There are two situations where webscrapers are especially useful.
First is if you're dealing with some data that's changing over time and you want to take snapshots of it at regular intervals. For example, you could build a program that gets the current weather at every NFL stadium every week.
Second, webscraping is useful when you want to grab a lot of similarly structured data. Scrapers scale really well.
Want annual fantasy point leaders by position for the past 10 years? Write a Python that function that's flexible enough to get data given any arbitrary position and year. Then run it 60 times (10 years * six positions — QB, RB, WR, TE, K, DST) and you'll have the data. Trying to manually copy and paste data that many times from a website would be tedious and error prone.
These two use cases — collecting snapshots of data over time and grabbing a bunch of similarly structured data — aren't the only time you'll ever want to get data from the internet. There might be situations you find a single table online that you'll just need the one time. In that case you might be better off just copying and pasting it into a csv instead of taking the time to write code to get it automatically.
HTML
Building a webscraper involves understanding some basic HTML + CSS, which are two of the main building blocks used to build websites. We'll learn the minimum required for scraping here. So while you won't necessarily be able to build your own website after this, it will make getting data from websites a lot easier.
HTML is a markup language, which means it includes both the content you see on the screen (the text) along with built in instructions (the markup) for how the browser should show it.
These instructions come in tags, which are wrapped in arrow brackets (<>) and look like this:
<p>
<b>
<div>
Most tags come in pairs, with the start one like <p>
and the end like </p>
.
We say any text in between is wrapped in the tags. For example, the p
tag
stands for paragraph and so:
<p>This text is a paragraph.</p>
Tags themselves aren't visible to regular users, though the text they wrap is. You can see them by right clicking on website and selecting 'view source'.
Tags can be nested. The i
tag stands for italics, and so:
<p><i>This text is an italic paragraph.</i></p>
Tags can also have one or more attributes, which are just optional data and
are also invisible to the user. Two common attributes are id
and class
:
<p id="intro">This is my intro paragraph.</p>
<p id="2" class="main-body">This is is my second paragraph.</p>
The ids and classes are there so web designers can specify — separately in
what's called a CSS file — more rules about how things should look. For
example, maybe I want all my intro text (everything wrapped in the id=intro
tags) purple. Or maybe I want my main-body
text to be in a different font,
etc.
As webscrapers, we don't really care how things look, and we don't care about CSS itself. But these tags, ids, and classes are a good way to tell our program what parts of the page we want to get, so we need to be aware of what it is.
Let's cover a few common HTML tags for reference:
<p>
- paragraph
<div>
- this doesn't really do anything directly, but they're a way for web designer's to divide up their HTML however they want so they can assign classes and ids to particular sections
<table>
- tag that specifies the start of a table
<th>
- header in a table
<tr>
- denotes table row
<td>
- table data
<a>
- link, always includes the attribute href, which specifies where the browser should go when you click on it
HTML Tables
As analysts, we're often interested in tables of data, and it's worth exploring how HTML tables work in more depth.
The table
tag is a good example of how HTML tags are nested. Everything in
the table (all the data, the column names, everything) is between the <table>
</table>
tags.
We already know conceptually that a table is just a collection of rows. Each
row in an HTML table is between a pair of <tr>
and </tr>
tags.
Inside a row, we can denote individual columns with <td>
and </td>
or — if
we're inside the header (column name) row — <th>
and </th>
. Usually you'll
see header tags in the first row.
To summarize:
- everything in an html table is between
<table>
and</table>
tags - within each table, individual rows are between
<tr>
tags - within each row, specific columns are between
<td>
tags (<th
> for the header row)
So if we had a table with two rows of fantasy point data, it might look something like this:
<table>
<tr>
<th>Name</th>
<th>Pos</th>
<th>Week</th>
<th>Pts</th>
</tr>
<tr>
<td>Todd Gurley</td>
<td>RB</td>
<td>1</td>
<td>22.6</td>
</tr>
<tr>
<td>Christian McCaffrey</td>
<td>RB</td>
<td>1</td>
<td>14.8</td>
</tr>
</table>
BeautifulSoup
The library BeautifulSoup (abbreviated BS4) is the Python standard for working with HTML data like this. It lets you manipulate, search and work with HTML and put it into normal python data structures (like lists and dicts), which we can then put into Pandas.
The key Python type in BeautifulSoup is called tag.
Like lists and dicts (see the Python chapter of LTCWFF for more), BeautifulSoup tags are containers (they hold things). One thing tags can hold is other tags, though they don't have to.
If we're working with an HTML table, we could have a BS4 tag that represents
the whole table. We could also have a tag that represents just the one
<td>Todd Gurley</td>
element. They're both tags. In fact, the entire web page
(all the html) is just one zoomed out html tag.
Let's mentally divide tags into types1.
Simple Tags
The first type is a tag with just text inside it, not another tag. We'll call this a "simple" tag. Examples of HTML elements that would be simple tags:
<td>Todd Gurley</td>
<td>1</td>
<th>Name</th>
These are three separate simple tags, and doing .name
on them would give td,
td, th respectively. But they don't hold any more tags inside of it, just the
text (Todd Gurley
, 1
, Name
).
On simple tags, the key method is string
. That gives you access to the data
inside of it. So running string
on these three tags would give 'Todd
Gurley'
, '1'
, and 'Name'
respectively. These are BeautifulSoup strings,
which carry around a bunch of extra data, so it's always good to do convert
them to regular Python strings with str(mytag.string)
.
Nested Tags
As opposed to simple tags, nested tags contain other tags. An example would be
the tr
or table
tags above.
The most important method for nested tags is find_all(tag_name)
, where
tag_name
is the name of an HTML tag like 'tr', 'p' or 'td'. The method
searches your nested tag for tag_name
tags and returns them all in a list.
So, in the example above mytable.find_all('tr')
would give back a list like
[tr, tr, tr]
, where each tr
is a BeautifulSoup tr
tag.
Since each of these tr
has a bunch (3) of td tags, they themselves are nested
tags. That means we can call find_all()
on them to get back a list of the td
tags in each row. These td tags are simple tags, so we can call .string
on
each of them to (finally) get the data out.
The constant, nested inside of nested tags aspect of everything can make it hard to wrap your mind around, but it's not too bad with code.
Fantasy Football Calculator ADP - Web Scraping Example
In this section, we'll build a scraper to get all the ADP data off fantasyfootballcalculator.com and put it in a DataFrame. The full code for this example is available at github.com/nathanbraun/ffc-adp-scraping-example.
Note, for help with basic setup and installing and running Python code, see my article on Installing Python and Getting Set Up with Spyder.
We can import what we need from it like this (see the file bs4_demo.py):
In [1]: from bs4 import BeautifulSoup as Soup
In [2]: import requests
In [3]: from pandas import DataFrame
Besides BeautifulSoup
, we're also importing the requests
library to visit
the page and store the raw HTML we get back. We're also going to want to put
our final result in a DataFrame (again, see the chapter on Pandas and
DataFrame's in Learn to Code with Fantasy Football), so we've imported that too.
The first step using using requests visit the page we want to scrape and store the raw HTML we get back.
In [3]: ffc_response = requests.get('https://fantasyfootballcalculator.com/adp/ppr/12-team/all/2017')
Let's look at (part of) the returned HTML (note I'm omitting the
print(ffc_response.text)
and just showing you part of what we get back).
<tr class='PK'>
<td align='right'>178</td>
<td align='right'>14.07</td>
<td class="adp-player-name"><a style="color: rgb(30, 57, 72); font-weight: 300;" href="/players/mason-crosby">Mason Crosby</a></td>
<td>PK</td>
<td>GB</td>
<td class="d-none d-sm-table-cell" align='right'>162.6</td>
<td class="d-none d-sm-table-cell" align='right'>5.9</td>
<td class="d-none d-sm-table-cell" align='right'>13.06</td>
<td class="d-none d-sm-table-cell" align='right'>15.10</td>
<td class="d-none d-sm-table-cell" align='right'>44</td>
<td align='center'> </td>
</tr>
The text
attribute on ffc_response
turns it into a string. This is just a
small snippet of the HTML you see back — there are almost 250k lines of HTML
here — but you get the picture.
Now let's parse it, i.e. turn it into BeautifulSoup data.
In [5]: adp_soup = Soup(ffc_response.text)
Remember, this top level adp_soup
object is a giant nested tag, which means
we can run find_all()
on it.
We can never be 100% sure when dealing with sites that we didn't create, but looking at fantasyfootballcalculator.com, it's probably safe to assume the data we want is on an HTML table.
In [6]: tables = adp_soup.find_all('table')
Remember find_all()
always returns a list, even if there's just one table.
Let's see how many tables we got back.
In [7]: len(tables)
Out[7]: 1
It's just the one here, which is easy, but it's not at all uncommon for sites to have multiple tables, in which case we'd have to pick out the one we wanted from the list. Technically we still have to pick this one out, but since there's just one it's just the first (i.e. the 0th) item in the list.
In [8]: adp_table = tables[0]
Looking at it in the REPL, we can see it has the same <tr>
, <th>
and <td>
structure we talked about above, though — being a real website — the tags
have other attributes (class, align, style etc).
adp_table
is still a nested tag, so let's run another find_all()
.
In [9]: rows = adp_table.find_all('tr')
This gives us a list of all the tr tags inside our table. Let's look at the first one.
In [10]: rows[0]
Out[10]:
<tr>
<th>#</th>
<th>Pick</th>
<th>Name</th>
<th>Pos</th>
<th>Team</th>
<th class="d-none d-sm-table-cell">Overall</th>
<th class="d-none d-sm-table-cell">Std.<br/>Dev</th>
<th class="d-none d-sm-table-cell">High</th>
<th class="d-none d-sm-table-cell">Low</th>
<th class="d-none d-sm-table-cell">Times<br/>Drafted</th>
<th></th>
</tr>
It's the header row, good. That'll be useful later for knowing what columns are what.
Now how about some data.
In [11]: first_data_row = rows[1]
In [12]: first_data_row
Out[12]:
<tr class="RB">
<td align="right">1</td>
<td align="right">1.01</td>
<td class="adp-player-name"><a href="/players/david-johnson" style="color: rgb(30, 57, 72); font-weight: 300;">David Johnson</a></td>
<td>RB</td>
<td>ARI</td>
<td align="right" class="d-none d-sm-table-cell">1.3</td>
<td align="right" class="d-none d-sm-table-cell">0.6</td>
<td align="right" class="d-none d-sm-table-cell">1.01</td>
<td align="right" class="d-none d-sm-table-cell">1.04</td>
<td align="right" class="d-none d-sm-table-cell">310</td>
<td align="center"></td>
</tr>
It's the first, lowest-ADP pick — David Johnson for the of summer 2017 — nice.
Note this is still a nested tag, so we will be using find_all()
, but there
are a bunch of td columns in it so end is in site.
In [13]: first_data_row.find_all('td')
Out[13]:
[<td align="right">1</td>,
<td align="right">1.01</td>,
<td class="adp-player-name"><a href="/players/david-johnson" style="color: rgb(30, 57, 72); font-weight: 300;">David Johnson</a></td>,
<td>RB</td>,
<td>ARI</td>,
<td align="right" class="d-none d-sm-table-cell">1.3</td>,
<td align="right" class="d-none d-sm-table-cell">0.6</td>,
<td align="right" class="d-none d-sm-table-cell">1.01</td>,
<td align="right" class="d-none d-sm-table-cell">1.04</td>,
<td align="right" class="d-none d-sm-table-cell">310</td>,
<td align="center"></td>]
This returns a list of simple td
tags, finally. Now we can call
str(.string)
on them to get the data out.
In [14]: [str(x.string) for x in first_data_row.find_all('td')]
Out[14]:
['1',
'1.01',
'David Johnson',
'RB',
'ARI',
'1.3',
'0.6',
'1.01',
'1.04',
'310',
'\n']
Note the list comprehensions. Scraping is an area comprehensions really become valuable.
Now that that's done, we have to do it to every row. Whenever you have to do
something a bunch of times, you should automatically think about putting it in
a function so you don't repeat yourself. Let's make a function that'll parse
some row (a BeautifulSoup tr
tag).
def parse_row(row):
"""
Take in a tr tag and get the data out of it in the form of a list of
strings.
"""
return [str(x.string) for x in row.find_all('td')]
Now we have to apply parse_row
to every row and put the results in a
DataFrame. To make a new DataFrame we do df = DataFrame(data)
, where data can
be various data structures.
Looking at the pandas documentation, one type of data we can use is a "structured or record array". In other words, a list of lists (rows) — as long as they're all the same shape — should work.
In [15]: list_of_parsed_rows = [parse_row(row) for row in rows[1:]]
Remember the first row (row[0]
is the header, so rows[1:]
will give us our
data, 1 to the end). Then we have to put it in a DataFrame:
In [16]: df = DataFrame(list_of_parsed_rows)
In [17]: df.head()
Out[17]:
0 1 2 3 4 5 ... 8 9 10
0 1 1.01 David Johnson RB ARI 1.3 ... 1.04 310 \n
1 2 1.02 LeVeon Bell RB PIT 2.3 ... 1.06 303 \n
2 3 1.04 Antonio Brown WR PIT 3.7 ... 1.07 338 \n
3 4 1.06 Julio Jones WR ATL 5.7 ... 2.03 131 \n
4 5 1.06 Ezekiel Elliott RB DAL 6.2 ... 2.05 180 \n
Great, it just doesn't have column names. Let's fix that. We could parse them, but it's easiest to just name them what we want.
In [18]: df.columns = ['ovr', 'pick', 'name', 'pos', 'team', 'adp',
'std_dev', 'high', 'low', 'drafted', 'graph']
We're almost there, our one remaining issue is that all of the data is in
string form at the moment, just like numbers stored as text in Excel. This is
because we explicitly told Python to do this in our parse_row
function, but
via converting types, it's an easy fix.
In [19]: float_cols =['adp', 'std_dev']
In [20]: int_cols =['ovr', 'drafted']
In [21]: df[float_cols] = df[float_cols].astype(float)
In [22]: df[int_cols] = df[int_cols].astype(int)
It's debatable whether pick, high, low should be numbers or string. Technically they're numbers, but I kept them strings because the decimal in the picks (e.g. 1.12) doesn't mean what it normally means and wouldn't behave the way we'd expect if we tried to do math on it.
Also let's get rid of the graph column, which we can see on the website is a UI checkbox on fantasyfootballcalculator and not real data.
In [23]: df.drop('graph', axis=1, inplace=True)
And we're done:
In [24]: df.head()
Out[24]:
ovr pick name pos team adp std_dev high low drafted
0 1 1.01 David Johnson RB ARI 1.3 0.6 1.01 1.04 310
1 2 1.02 LeVeon Bell RB PIT 2.3 0.8 1.01 1.06 303
2 3 1.04 Antonio Brown WR PIT 3.7 1.0 1.01 1.07 338
3 4 1.06 Julio Jones WR ATL 5.7 3.2 1.01 2.03 131
4 5 1.06 Ezekiel Elliott RB DAL 6.2 2.8 1.01 2.05 180
There we go, we've built our first webscraper!
Exercises
Note: these are some of the end of chapter exercises from this part of my book, I've included them here so you can test yourself on what you've read.
The book includes fully coded up Python solutions if you want to check your answers.
Problem 1
Put everything we did in the Fantasy Football ADP Calculator example inside a
function scrape_ffc
that takes in scoring
(a string that's one of: 'ppr',
'half', 'std'), nteams
(number of teams) and year
and returns a DataFrame
with the scraped data.
Your function should work on all types of scoring systems leagues, 8, 10, 12 or 14 team leagues, between 2010-2019.
Caveats:
- Fantasy Football Calculator only has half-ppr data going back to 2018, so you only need to be able to get std and ppr before that.
- All of Fantasy Football Calculators archived data is for 12 teams. URLs with other numbers of teams will work (return data), but if you look closely the data is always the same.
Problem 2
On the Fantasy Football Calculator page we're scraping, you'll notice that each
player's name is a clickable link. Modify scrape_ffc
to store that as well,
adding a column named 'link' for it in your DataFrame.
Hint: bs4 tags include the get
method, which takes the name of an html
attribute and returns the value.
Problem 3
Write a function ffc_player_info
takes one of the urls we stored in (2),
scrapes it, and returns the player's team, height, weight, birthday, and draft
info (team and pick) as a dict.
Learn to Code with Fantasy Football
Again, if you like this article and want more like it, check out my book, Learn to Code with Fantasy Football. In it, you'll learn what to do with this data now that you've scraped it.
In it, I also cover basic Python, Pandas, SQL, data visualization, machine learning and more. I also cover alternatives to scraping like connecting to a public API and some places you can get finding ready-made, already-scraped datasets.
-
BeautifulSoup doesn't really distinguish between these, but I think it makes things easier to understand. ↩