Over the last few weeks I co-taught a short-course on data scraping and data presentation for. It was a pleasure to get a chance to teach with Ethan Zuckerman (my boss) and interact with the creative group of students! You can peruse the syllabus outline if you like.
In my Data Therapy work I don’t usually introduce tools, because there are loads of YouTube tutorials and written tutorials. However, while co-teaching a short-course for incoming students in the Comparative Media Studies program here at MIT, I led two short “lab” sessions on tools for data scraping, interrogation, and visualization.
There are a myriad of tools that support these efforts, so I was forced to pick just a handle to introduce to these students. I wanted to share the short lists of tools I choose to share.
Data Scraping:
As much as possible, avoid writing code! Many of these tools can help you avoid writing software to do the scraping. There are constantly new tools being built, but I recommend these:
- Copy/Paste: Never forget the awesome power of copy/paste! There are many times when an hour of copying and pasting will be faster than learning any sort of new tool!
- Import.io: Still nascent, but this is a radical re-thinking of how you scrape. Point and click to train their scraper. It’s very early, and buggy, but on many simple webpages it works well!
- Regular Expressions: Install a text editor like Sublime Text and you get the power of regular expressions (which I call “Super Find and Replace”). It lets you define a pattern and find it in any large document. Sure the pattern definition is cryptic, but learning it is totally worth it (here’s an online playground).
- Jquery in the browser: Install the bookmarklet, and you can add the JQuery javascript library to any webpage you are viewing. From there you can use a basic understanding of javascript and the Javascript console (in most browsers) to pull parts of a webpage into an array.
- ScraperWiki: There are a few things this makes really easy – getting recent tweets, getting twitter followers, and a few others. Otherwise this is a good engine for software coding.
- Software Development: If you are a coder, and the website you need to scrape has javascript and logins and such, then you might need to go this route (ugh). If so, here’s a functioning example of a scraper built in Python (with Beautiful Soup and Mechanize). I would use Watir if you want to do this in Ruby.
Data Interrogation and Visualization:
There are even more tools that help you here. I picked a handful of single-purpose tools, and some generic ones to share.
- Tabula: There are few PDF-cleaning tools, but this one has worked particularly well for me. If your data is in a PDF, and selectable, then I recommend this! (disclosure: the Knight Foundation funds much of my paycheck, and contributed to Tabula’s development as well)
- OpenRefine: This data cleaning tool lets you do things like cluster rows in your data that are spelled similarly, look for correlations at a high level, and more! The School of Data has written well about this – read their OpenRefine handbook.
- Wordle: As maligned as word clouds have been, I still believe in their role as a proxy for deep text analysis. They give a nice visual representation of how frequently words appear in quotes, writing, etc.
- Quartz ChartBuilder: If you need to make clean and simple charts, this is the tool for you. Much nicer than the output of Excel.
- TimelineJS: Need an online timeline? This is an awesome tool. Disclosure: another Knight-funded project.
- Google Fusion Tables: This tool has empowered loads of folks to create maps online. I’m not a big user, but lots of folks recommend it to me.
- TileMill: Google maps isn’t the only way to make a map. TileMill lets you create beautiful interactive maps that fit your needs. Disclosure: another Knight-funded project.
- Tableau Public: Tableau is a much nicer way to explore your data than Excel pivot tables. You can drag and drop columns onto a grid and it suggests visualizations that might be revealing in your attempts to find stories.
I hope those are helpful in your data scraping and story-finding adventures!
Curious for More Tools?
Keep your eye on the School of Data and Tactical Technology Collective.
For data interrogation, you may want to add json-csv.com. It can turn a json feed or json text into a downloadable CSV spreadsheet of data.