I am working on a project at the moment that requires that I pick specified elements out of an HTML page. This was the first time that had come up for me and initially I thought I might be able to do this with REXML, but after I tried it in IRB and quickly realised that for parsing potentially dirty HTML this was not the tool for the job. It turns out that Nokogiri is designed for exactly this sort of thing. Getting started with it turned out to be easy as well.
First thing is installing the gem:
mike@sleepycat:~/Desktop$ sudo gem install nokogiri
And then the dependencies:
mike@sleepycat:~$ sudo aptitude install libxml2-dev libxslt-dev
And then the fun of forgetting that I need to require rubygems BEFORE trying to run this in IRB (that part is optional for everyone except me):
irb(main):001:0> require ‘nokogiri’
LoadError: no such file to load — nokogiri
from (irb):1:in `require’
irb(main):002:0> require ‘rubygems’
irb(main):003:0> require ‘nokogiri’
irb(main):004:0> require ‘open-uri’
irb(main):005:0> test = Nokogiri::HTML(open(‘test.html’))
And out comes parsed xml goodness. A little further twiddling I had gotten my XPaths to work had it doing what I needed. I’m pretty impressed how much I could do with Nokogiri in just a few minutes fooling around for the first time. I have a feeling I am going to end up using this a lot.