What You Should Use to Scrape and Clean Data

I am currently teaching a short module for a class at MIT called CMS.622: Applying Media Technologies in Arts and Humanities.  My module focuses on Data Scraping and Civic Visualization.  Here are a few of the tools I introduce related to scraping and cleaning.

Tools for Scraping Data

As much as possible, avoid writing code!  Many of these tools can help you avoid writing software to do the scraping.  There are constantly new tools being built, but I recommend these:

  • Copy/Paste: Never forget the awesome power of copy/paste! There are many times when an hour of copying and pasting will be faster than learning any sort of new tool!
  • Chrome Scraper Extension: This bare-bones plugin for the Chrome web browser gives you a right-click option to “scrape similar” and export in a number spreadsheet formats.
  • Import.io: This is a radical re-thinking of how you scrape.  Point and click to train their scraper.  It’s buggy, but on many simple webpages it works well!
  • Jquery in the browser: Install the bookmarklet, and you can add the JQuery javascript library to any webpage you are viewing.  From there you can use a basic understanding of javascript and the Javascript console (in most browsers) to pull parts of a webpage into an array.
  • Software Development: If you are a coder, and the website you need to scrape has javascript and logins and such, then you might need to go this route (ugh).  If so, here are some example Jupyter notebooks that show how to use Requests and Beautiful Soup to scrape and parse a webpage.  If your source material is more complicated, try using Mechanize (or Watir if you want to do this in Ruby).

Tools for Cleaning Data

If you start with garbage, you end with garbage.  This is why clean data is such a big deal. I’ve written before about what clean data means to me, but here are some tools I introduce to help you clean your data:

  • Typing: Seriously.  If you don’t have much data to clean, just do it by hand.
  • Find/Replace: Again, I’m serious.  Don’t underestimate the power of 30 minutes of find/replace… it’s a lot easier than programing or using some tool.
  • Regular Expressions: Install a text editor like Sublime Text and you get the power of regular expressions (which I call “Super Find and Replace”).  It lets you define a pattern and find/replace it in any large document.  Sure the pattern definition is cryptic, but learning it is totally worth it (here’s an online playground).
  • Data Science Toolkit: This Swiss-army knife is a virtual machine you can install and use via APIs to do tons of data science things.  Go from address to lat/lng, quantify the sentiment of some text, pull the content from a webpage, extract people mentioned in text, and more.
  • CLIFF-CLAVIN: Our geo-parsing tool can identify places, people, and organizations mentioned in plain text.  You give it text and it spits out JSON, taking special effort to resolve the places to lat/lngs that makes sense.
  • Tabula: Need to extra a table from a PDF? Use Tabula to do it.  Try pdftables if you want to do the same in Python.  A web-based option is PDFTables (made by the ScraperWiki people).
  • OpenRefine: It has a little bit of a learning curve, but OpenRefine can handle large sets of data and do great things like cluster and eliminate typos.
  • Programming: If you must programming can help you clean data.  CSVKit is a handy set of libraries and command line tools for managing and changing CSV files.  Messytables can help you parse CSV files that aren’t quite formatted correctly.

I hope those are helpful in your data scraping and data cleaning adventures!

“Tidying” Your Data

Recently I’ve been giving more workshops about cleaning data.  This step in the data cycle often takes 80% of the time, but is seldom focused on in a systematic way.  I want to address one topic that keeps coming up – what is clean data?

When I ask, I usually get answers all over the map.  I tend to approach it from four topics:

  • consistency: are observations always entered the same way?
  • completeness: do you have full coverage of the topic?
  • usability: is your data human readable, or machine readable, in the ways you need it to be?
  • atomicity: do the rows hold the correct basic units for your analysis?

The last topic, atomicity, is one I need a better name for.  In any case, I want to tease it apart a bit more because it is critical.  Wickham’s Tidy Data paper has a great way of talking about this:

each variable is a column, each observation is a row, and each type of observational unit is a table

Yes, someone wrote a whole 24 page paper on how to make sure your columns are right.  And yes, I read it and enjoyed it.  You should go read it too (at least the first few pages). The key point is that far too many tabular datasets have column headers that are, in fact, part of the data.  For instance if you are keeping track of how many times something happens each year, each year shouldn’t be a column header; “year” should be a column and you should have one row for each year.  For you excel junkies, this means your raw data shouldn’t be in cross-tab format.

This process of cleaning your data to make it tidy can be annoying, buy luckily there are tools that can help.  Tableau has a handy plugin for Excel that “reshapes” your data to prep it for analysis.  If you are an R wizard, here is a presentation on how do tidying operations in R.  If you use Google Sheets, there is a Stack Overflow post that has some details on a plugin someone wrote to normalize data in Google Sheets.

I hope that helps you in your next data-cleaning task.  Hooray for tidy data!