Getting started with Nokogiri on Ubuntu

I am working on a project at the moment that requires that I pick specified elements out of an HTML page. This was the first time that had come up for me and initially I thought I might be able to do this with REXML, but after I tried it in IRB and quickly realised that for parsing potentially dirty HTML this was not the tool for the job. It turns out that Nokogiri is designed for exactly this sort of thing. Getting started with it turned out to be easy as well.

First thing is installing the gem:

mike@sleepycat:~/Desktop$ sudo gem install nokogiri

And then the dependencies:

mike@sleepycat:~$ sudo aptitude install libxml2-dev libxslt-dev

And then the fun of forgetting that I need to require rubygems BEFORE trying to run this in IRB (that part is optional for everyone except me):

mike@sleepycat:~/Desktop$ irb
irb(main):001:0> require ‘nokogiri’
LoadError: no such file to load — nokogiri
from (irb):1:in `require’
from (irb):1
irb(main):002:0> require ‘rubygems’
=> true
irb(main):003:0> require ‘nokogiri’
=> true
irb(main):004:0> require ‘open-uri’
=> true
irb(main):005:0> test = Nokogiri::HTML(open(‘test.html’))

And out comes parsed xml goodness. A little further twiddling I had gotten my XPaths to work had it doing what I needed. I’m pretty impressed how much I could do with Nokogiri in just a few minutes fooling around for the first time. I have a feeling I am going to end up using this a lot.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s