What You Should Use to Scrape and Clean Data

I am currently teaching a short module for a class at MIT called CMS.622: Applying Media Technologies in Arts and Humanities.  My module focuses on Data Scraping and Civic Visualization.  Here are a few of the tools I introduce related to scraping and cleaning.

Tools for Scraping Data

As much as possible, avoid writing code!  Many of these tools can help you avoid writing software to do the scraping.  There are constantly new tools being built, but I recommend these:

  • Copy/Paste: Never forget the awesome power of copy/paste! There are many times when an hour of copying and pasting will be faster than learning any sort of new tool!
  • Chrome Scraper Extension: This bare-bones plugin for the Chrome web browser gives you a right-click option to “scrape similar” and export in a number spreadsheet formats.
  • Import.io: This is a radical re-thinking of how you scrape.  Point and click to train their scraper.  It’s buggy, but on many simple webpages it works well!
  • Jquery in the browser: Install the bookmarklet, and you can add the JQuery javascript library to any webpage you are viewing.  From there you can use a basic understanding of javascript and the Javascript console (in most browsers) to pull parts of a webpage into an array.
  • Software Development: If you are a coder, and the website you need to scrape has javascript and logins and such, then you might need to go this route (ugh).  If so, here are some example Jupyter notebooks that show how to use Requests and Beautiful Soup to scrape and parse a webpage.  If your source material is more complicated, try using Mechanize (or Watir if you want to do this in Ruby).

Tools for Cleaning Data

If you start with garbage, you end with garbage.  This is why clean data is such a big deal. I’ve written before about what clean data means to me, but here are some tools I introduce to help you clean your data:

  • Typing: Seriously.  If you don’t have much data to clean, just do it by hand.
  • Find/Replace: Again, I’m serious.  Don’t underestimate the power of 30 minutes of find/replace… it’s a lot easier than programing or using some tool.
  • Regular Expressions: Install a text editor like Sublime Text and you get the power of regular expressions (which I call “Super Find and Replace”).  It lets you define a pattern and find/replace it in any large document.  Sure the pattern definition is cryptic, but learning it is totally worth it (here’s an online playground).
  • Data Science Toolkit: This Swiss-army knife is a virtual machine you can install and use via APIs to do tons of data science things.  Go from address to lat/lng, quantify the sentiment of some text, pull the content from a webpage, extract people mentioned in text, and more.
  • CLIFF-CLAVIN: Our geo-parsing tool can identify places, people, and organizations mentioned in plain text.  You give it text and it spits out JSON, taking special effort to resolve the places to lat/lngs that makes sense.
  • Tabula: Need to extra a table from a PDF? Use Tabula to do it.  Try pdftables if you want to do the same in Python.  A web-based option is PDFTables (made by the ScraperWiki people).
  • OpenRefine: It has a little bit of a learning curve, but OpenRefine can handle large sets of data and do great things like cluster and eliminate typos.
  • Programming: If you must programming can help you clean data.  CSVKit is a handy set of libraries and command line tools for managing and changing CSV files.  Messytables can help you parse CSV files that aren’t quite formatted correctly.

I hope those are helpful in your data scraping and data cleaning adventures!