Web Scraping

Doing any sort of data analysis requires, well, ... data. You can get this data a couple different ways.

Sometimes you'll have structured, ready-to-consume data directly available in a spreadsheet or a database. Other times you'll have to go out and get it yourself.

In this article, we'll learn how to build a Python program that gets it ("scrapes it") for you using the library BeautifulSoup.

Learn to Code with Fantasy Football

This is adopted from my book Learn to Code with Fantasy Football. If you like it check it out.

In it, I also cover basic Python, Pandas, SQL, data visualization, machine learning and more. I also cover alternatives to scraping like connecting to a public API and some places you can get finding ready-made, already-scraped datasets.

Motivation

There are two situations where webscrapers are especially useful.

First is if you're dealing with some data that's changing over time and you want to take snapshots of it at regular intervals. For example, you could build a program that gets the current weather at every NFL stadium every week.

Second, webscraping is useful when you want to grab a lot of similarly structured data. Scrapers scale really well.

Want annual fantasy point leaders by position for the past 10 years? Write a Python that function that's flexible enough to get data given any arbitrary position and year. Then run it 60 times (10 years * six positions — QB, RB, WR, TE, K, DST) and you'll have the data. Trying to manually copy and paste data that many times from a website would be tedious and error prone.

These two use cases — collecting snapshots of data over time and grabbing a bunch of similarly structured data — aren't the only time you'll ever want to get data from the internet. There might be situations you find a single table online that you'll just need the one time. In that case you might be better off just copying and pasting it into a csv instead of taking the time to write code to get it automatically.

HTML

Building a webscraper involves understanding some basic HTML + CSS, which are two of the main building blocks used to build websites. We'll learn the minimum required for scraping here. So while you won't necessarily be able to build your own website after this, it will make getting data from websites a lot easier.

HTML is a markup language, which means it includes both the content you see on the screen (the text) along with built in instructions (the markup) for how the browser should show it.

These instructions come in tags, which are wrapped in arrow brackets (<>) and look like this:

<p>
<b>
<div>

Most tags come in pairs, with the start one like <p> and the end like </p>. We say any text in between is wrapped in the tags. For example, the p tag stands for paragraph and so:

<p>This text is a paragraph.</p>

Tags themselves aren't visible to regular users, though the text they wrap is. You can see them by right clicking on website and selecting 'view source'.

Tags can be nested. The i tag stands for italics, and so:

<p><i>This text is an italic paragraph.</i></p>

Tags can also have one or more attributes, which are just optional data and are also invisible to the user. Two common attributes are id and class:

<p id="intro">This is my intro paragraph.</p>
<p id="2" class="main-body">This is is my second paragraph.</p>

The ids and classes are there so web designers can specify — separately in what's called a CSS file — more rules about how things should look. For example, maybe I want all my intro text (everything wrapped in the id=intro tags) purple. Or maybe I want my main-body text to be in a different font, etc.

As webscrapers, we don't really care how things look, and we don't care about CSS itself. But these tags, ids, and classes are a good way to tell our program what parts of the page we want to get, so we need to be aware of what it is.

Let's cover a few common HTML tags for reference:

<p>: paragraph
<div>: this doesn't really do anything directly, but they're a way for web designer's to divide up their HTML however they want so they can assign classes and ids to particular sections
<table>: tag that specifies the start of a table
<th>: header in a table
<tr>: denotes table row
<td>: table data
<a>: link, always includes the attribute href, which specifies where the browser should go when you click on it

HTML Tables

As analysts, we're often interested in tables of data, and it's worth exploring how HTML tables work in more depth.

The table tag is a good example of how HTML tags are nested. Everything in the table (all the data, the column names, everything) is between the <table> </table> tags.

We already know conceptually that a table is just a collection of rows. Each row in an HTML table is between a pair of <tr> and </tr> tags.

Inside a row, we can denote individual columns with <td> and </td> or — if we're inside the header (column name) row — <th> and </th>. Usually you'll see header tags in the first row.

To summarize:

everything in an html table is between <table> and </table> tags
within each table, individual rows are between <tr> tags
within each row, specific columns are between <td> tags (<th> for the header row)

So if we had a table with two rows of fantasy point data, it might look something like this:

<table>
  <tr>
   <th>Name</th>
   <th>Pos</th>
   <th>Week</th>
   <th>Pts</th>
  </tr>
  <tr>
   <td>Todd Gurley</td>
   <td>RB</td>
   <td>1</td>
   <td>22.6</td>
  </tr>
  <tr>
   <td>Christian McCaffrey</td>
   <td>RB</td>
   <td>1</td>
   <td>14.8</td>
  </tr>
</table>

BeautifulSoup

The library BeautifulSoup (abbreviated BS4) is the Python standard for working with HTML data like this. It lets you manipulate, search and work with HTML and put it into normal python data structures (like lists and dicts), which we can then put into Pandas.

The key Python type in BeautifulSoup is called tag.

Like lists and dicts (see the Python chapter of LTCWFF for more), BeautifulSoup tags are containers (they hold things). One thing tags can hold is other tags, though they don't have to.

If we're working with an HTML table, we could have a BS4 tag that represents the whole table. We could also have a tag that represents just the one <td>Todd Gurley</td> element. They're both tags. In fact, the entire web page (all the html) is just one zoomed out html tag.

Let's mentally divide tags into types¹.

Simple Tags

The first type is a tag with just text inside it, not another tag. We'll call this a "simple" tag. Examples of HTML elements that would be simple tags:

<td>Todd Gurley</td>
<td>1</td>
<th>Name</th>

These are three separate simple tags, and doing .name on them would give td, td, th respectively. But they don't hold any more tags inside of it, just the text (Todd Gurley, 1, Name).

On simple tags, the key method is string. That gives you access to the data inside of it. So running string on these three tags would give 'Todd Gurley', '1', and 'Name' respectively. These are BeautifulSoup strings, which carry around a bunch of extra data, so it's always good to do convert them to regular Python strings with str(mytag.string).

Nested Tags

As opposed to simple tags, nested tags contain other tags. An example would be the tr or table tags above.

The most important method for nested tags is find_all(tag_name), where tag_name is the name of an HTML tag like 'tr', 'p' or 'td'. The method searches your nested tag for tag_name tags and returns them all in a list.

So, in the example above mytable.find_all('tr') would give back a list like [tr, tr, tr], where each tr is a BeautifulSoup tr tag.

Since each of these tr has a bunch (3) of td tags, they themselves are nested tags. That means we can call find_all() on them to get back a list of the td tags in each row. These td tags are simple tags, so we can call .string on each of them to (finally) get the data out.

The constant, nested inside of nested tags aspect of everything can make it hard to wrap your mind around, but it's not too bad with code.

Fantasy Football Calculator ADP - Web Scraping Example

In this section, we'll build a scraper to get all the ADP data off fantasyfootballcalculator.com and put it in a DataFrame. The full code for this example is available at github.com/nathanbraun/ffc-adp-scraping-example.

Note, for help with basic setup and installing and running Python code, see my article on Installing Python and Getting Set Up with Spyder.

We can import what we need from it like this (see the file bs4_demo.py):

In [1]: from bs4 import BeautifulSoup as Soup
In [2]: import requests
In [3]: from pandas import DataFrame

Besides BeautifulSoup, we're also importing the requests library to visit the page and store the raw HTML we get back. We're also going to want to put our final result in a DataFrame (again, see the chapter on Pandas and DataFrame's in Learn to Code with Fantasy Football), so we've imported that too.

The first step using using requests visit the page we want to scrape and store the raw HTML we get back.

In [3]: ffc_response = requests.get('https://fantasyfootballcalculator.com/adp/ppr/12-team/all/2017')

Let's look at (part of) the returned HTML (note I'm omitting the print(ffc_response.text) and just showing you part of what we get back).

<tr class='PK'>
    <td align='right'>178</td>
    <td align='right'>14.07</td>
    <td class="adp-player-name"><a style="color: rgb(30, 57, 72); font-weight: 300;" href="/players/mason-crosby">Mason Crosby</a></td>
    <td>PK</td>
    <td>GB</td>

    <td class="d-none d-sm-table-cell" align='right'>162.6</td>
    <td class="d-none d-sm-table-cell" align='right'>5.9</td>
    <td class="d-none d-sm-table-cell" align='right'>13.06</td>
    <td class="d-none d-sm-table-cell" align='right'>15.10</td>
    <td class="d-none d-sm-table-cell" align='right'>44</td>
    <td align='center'> </td>
</tr>

The text attribute on ffc_response turns it into a string. This is just a small snippet of the HTML you see back — there are almost 250k lines of HTML here — but you get the picture.

Now let's parse it, i.e. turn it into BeautifulSoup data.

In [5]: adp_soup = Soup(ffc_response.text)

Remember, this top level adp_soup object is a giant nested tag, which means we can run find_all() on it.

We can never be 100% sure when dealing with sites that we didn't create, but looking at fantasyfootballcalculator.com, it's probably safe to assume the data we want is on an HTML table.

In [6]: tables = adp_soup.find_all('table')

Remember find_all() always returns a list, even if there's just one table. Let's see how many tables we got back.

In [7]: len(tables)
Out[7]: 1

It's just the one here, which is easy, but it's not at all uncommon for sites to have multiple tables, in which case we'd have to pick out the one we wanted from the list. Technically we still have to pick this one out, but since there's just one it's just the first (i.e. the 0th) item in the list.

In [8]: adp_table = tables[0]

Looking at it in the REPL, we can see it has the same <tr>, <th> and <td> structure we talked about above, though — being a real website — the tags have other attributes (class, align, style etc).

adp_table is still a nested tag, so let's run another find_all().

In [9]: rows = adp_table.find_all('tr')

This gives us a list of all the tr tags inside our table. Let's look at the first one.

In [10]: rows[0]
Out[10]:
<tr>
    <th>#</th>
    <th>Pick</th>
    <th>Name</th>
    <th>Pos</th>
    <th>Team</th>
    <th class="d-none d-sm-table-cell">Overall</th>
    <th class="d-none d-sm-table-cell">Std.<br/>Dev</th>
    <th class="d-none d-sm-table-cell">High</th>
    <th class="d-none d-sm-table-cell">Low</th>
    <th class="d-none d-sm-table-cell">Times<br/>Drafted</th>
    <th></th>
</tr>

It's the header row, good. That'll be useful later for knowing what columns are what.

Now how about some data.

In [11]: first_data_row = rows[1]

In [12]: first_data_row
Out[12]:
<tr class="RB">
    <td align="right">1</td>
    <td align="right">1.01</td>
    <td class="adp-player-name"><a href="/players/david-johnson" style="color: rgb(30, 57, 72); font-weight: 300;">David Johnson</a></td>
    <td>RB</td>
    <td>ARI</td>
    <td align="right" class="d-none d-sm-table-cell">1.3</td>
    <td align="right" class="d-none d-sm-table-cell">0.6</td>
    <td align="right" class="d-none d-sm-table-cell">1.01</td>
    <td align="right" class="d-none d-sm-table-cell">1.04</td>
    <td align="right" class="d-none d-sm-table-cell">310</td>
    <td align="center"></td>
</tr>

It's the first, lowest-ADP pick — David Johnson for the of summer 2017 — nice. Note this is still a nested tag, so we will be using find_all(), but there are a bunch of td columns in it so end is in site.

In [13]: first_data_row.find_all('td')
Out[13]:
[<td align="right">1</td>,
 <td align="right">1.01</td>,
 <td class="adp-player-name"><a href="/players/david-johnson" style="color: rgb(30, 57, 72); font-weight: 300;">David Johnson</a></td>,
 <td>RB</td>,
 <td>ARI</td>,
 <td align="right" class="d-none d-sm-table-cell">1.3</td>,
 <td align="right" class="d-none d-sm-table-cell">0.6</td>,
 <td align="right" class="d-none d-sm-table-cell">1.01</td>,
 <td align="right" class="d-none d-sm-table-cell">1.04</td>,
 <td align="right" class="d-none d-sm-table-cell">310</td>,
 <td align="center"></td>]

This returns a list of simple td tags, finally. Now we can call str(.string) on them to get the data out.

In [14]: [str(x.string) for x in first_data_row.find_all('td')]
Out[14]:
['1',
 '1.01',
 'David Johnson',
 'RB',
 'ARI',
 '1.3',
 '0.6',
 '1.01',
 '1.04',
 '310',
 '\n']

Note the list comprehensions. Scraping is an area comprehensions really become valuable.

Now that that's done, we have to do it to every row. Whenever you have to do something a bunch of times, you should automatically think about putting it in a function so you don't repeat yourself. Let's make a function that'll parse some row (a BeautifulSoup tr tag).

def parse_row(row):
    """
    Take in a tr tag and get the data out of it in the form of a list of
    strings.
    """
    return [str(x.string) for x in row.find_all('td')]

Now we have to apply parse_row to every row and put the results in a DataFrame. To make a new DataFrame we do df = DataFrame(data), where data can be various data structures.

Looking at the pandas documentation, one type of data we can use is a "structured or record array". In other words, a list of lists (rows) — as long as they're all the same shape — should work.

In [15]: list_of_parsed_rows = [parse_row(row) for row in rows[1:]]

Remember the first row (row[0] is the header, so rows[1:] will give us our data, 1 to the end). Then we have to put it in a DataFrame:

In [16]: df = DataFrame(list_of_parsed_rows)

In [17]: df.head()
Out[17]:
   0     1                2   3    4    5  ...     8    9  10
0  1  1.01    David Johnson  RB  ARI  1.3  ...  1.04  310  \n
1  2  1.02      LeVeon Bell  RB  PIT  2.3  ...  1.06  303  \n
2  3  1.04    Antonio Brown  WR  PIT  3.7  ...  1.07  338  \n
3  4  1.06      Julio Jones  WR  ATL  5.7  ...  2.03  131  \n
4  5  1.06  Ezekiel Elliott  RB  DAL  6.2  ...  2.05  180  \n

Great, it just doesn't have column names. Let's fix that. We could parse them, but it's easiest to just name them what we want.

In [18]: df.columns = ['ovr', 'pick', 'name', 'pos', 'team', 'adp',
            'std_dev', 'high', 'low', 'drafted', 'graph']

We're almost there, our one remaining issue is that all of the data is in string form at the moment, just like numbers stored as text in Excel. This is because we explicitly told Python to do this in our parse_row function, but via converting types, it's an easy fix.

In [19]: float_cols =['adp', 'std_dev']

In [20]: int_cols =['ovr', 'drafted']

In [21]: df[float_cols] = df[float_cols].astype(float)

In [22]: df[int_cols] = df[int_cols].astype(int)

It's debatable whether pick, high, low should be numbers or string. Technically they're numbers, but I kept them strings because the decimal in the picks (e.g. 1.12) doesn't mean what it normally means and wouldn't behave the way we'd expect if we tried to do math on it.

Also let's get rid of the graph column, which we can see on the website is a UI checkbox on fantasyfootballcalculator and not real data.

In [23]: df.drop('graph', axis=1, inplace=True)

And we're done:

In [24]: df.head()
Out[24]:
   ovr  pick             name pos team  adp  std_dev  high   low  drafted
0    1  1.01    David Johnson  RB  ARI  1.3      0.6  1.01  1.04      310
1    2  1.02      LeVeon Bell  RB  PIT  2.3      0.8  1.01  1.06      303
2    3  1.04    Antonio Brown  WR  PIT  3.7      1.0  1.01  1.07      338
3    4  1.06      Julio Jones  WR  ATL  5.7      3.2  1.01  2.03      131
4    5  1.06  Ezekiel Elliott  RB  DAL  6.2      2.8  1.01  2.05      180

There we go, we've built our first webscraper!

Exercises

Note: these are some of the end of chapter exercises from this part of my book, I've included them here so you can test yourself on what you've read.

The book includes fully coded up Python solutions if you want to check your answers.

Problem 1

Put everything we did in the Fantasy Football ADP Calculator example inside a function scrape_ffc that takes in scoring (a string that's one of: 'ppr', 'half', 'std'), nteams (number of teams) and year and returns a DataFrame with the scraped data.

Your function should work on all types of scoring systems leagues, 8, 10, 12 or 14 team leagues, between 2010-2019.

Caveats:

Fantasy Football Calculator only has half-ppr data going back to 2018, so you only need to be able to get std and ppr before that.
All of Fantasy Football Calculators archived data is for 12 teams. URLs with other numbers of teams will work (return data), but if you look closely the data is always the same.

Problem 2

On the Fantasy Football Calculator page we're scraping, you'll notice that each player's name is a clickable link. Modify scrape_ffc to store that as well, adding a column named 'link' for it in your DataFrame.

Hint: bs4 tags include the get method, which takes the name of an html attribute and returns the value.

Problem 3

Write a function ffc_player_info takes one of the urls we stored in (2), scrapes it, and returns the player's team, height, weight, birthday, and draft info (team and pick) as a dict.

Learn to Code with Fantasy Football

Again, if you like this article and want more like it, check out my book, Learn to Code with Fantasy Football. In it, you'll learn what to do with this data now that you've scraped it.

BeautifulSoup doesn't really distinguish between these, but I think it makes things easier to understand. ↩