Web scraping with Ruby

Writing a web scraper seems to be almost a right of passage as a programmer, and recently I found it was my turn to write one. Judging from the libraries I looked at, the term “web scraping” seems to refer to a spectrum of functionality; web crawling on one end to processing the results on the other, and often doing some combination of the two.

I looked at a few libraries and all of them seemed to either do to much or not enough relative to what I had in mind, so as millions of programmers before me have done, I wrote my own.

The idea was very much oriented around extracting a set of structured information from a given page, and to that end I wrote a little DSL to get it done.
The library is called (very creatively) Skrape.

Assuming you have a page like this at the address example.com:

<html><body><h1>I am a title</h1></body></html>

You can scrape the title off the page with this:

results = Skrape::Page.new("http://example.com").extract do
  extract_title with: 'h1'
end

The results would be:

{title: "I am a title"}

The calls to “extract_*” are caught with method_missing and whatever follows the “extract_” is used as the key in the hash of results that is returned.
To deal with the inevitable edge cases that come up so often in scraping you can all so pass a block which will be handed whatever the CSS selector found so you can do some further processing.

I’ve needed to use it for picking out the href attribute of a link:

results = Skrape::Page.new(url).extract do
  extract_link_href with: 'a', and_run: proc {|link| link.attr('href').value }
end

And also removing problematic <br> tags from elements:

results = Skrape::Page.new(url).extract do
  extract_locations with: '.address', and_run: proc {|address| address.map{|a| a.inner_html.gsub('<br>', ', ')} }
end

While there are still some improvements I would like to make, so far I am pretty happy with the library. It feels readable and does not do to much. If you have some scraping to do, check it out. Pull requests welcome.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s