about img
blogger img

buchos posts img

Corys posts image

scotts posts image

UnderPaidLoveMonkis posts img

Pick Your RoR HTML Parsing Poison

Scott Rippee @ 4:07 pm July 5th, 2007

Ruby HTML parsing has been keeping me quite entertained frustrated lately, so I thought I'd share some thoughts. There are a couple of instance in your rails app when you'll want to parse HTML

  1. Automated functional/controller testing
  2. Screen scraping

Functional Testing

The standard method of verifying aspects of resulting HTML in your functional test is HTML::Selector. It's simple, powerful, and baked in. Agile Rails 2nd does a great job of explaining how it's used in functional tests.

  1. def test_add_no_name
  2.   post :add, :color => { :name => '', :hex => '#123456' }
  3.   assert_template 'add'
  4.   assert_select "div[id=errorExplanation]" do
  5.     assert_select "ul" do
  6.       assert_select "li", 'Name is not present'
  7.     end
  8.   end
  9. end



Several options are available, but oh so popular is why's Hpricot. It's fast and enjoyable (although I experienced no joy while learning how to use it =) It also happens to be used in some of the other scraping/navigating libraries (WWW::Mechanize [rdoc] and scRUBYt!).


Some Thoughts...

So if your just concerned with testing use HTML::Selector and the built in asserts. If you have to do very basic screen scraping I would also suggest going with HTML::Selector (as long as speed is not an issue and the scraping is basic) with open-uri or curb for fetching the pages.

For more serious screen scraping bust out Hpricot and if you need to navigate pages via automation use WWW::Mechanize (Mechanize also uses Hpricot so all of that Hpricot knowledge you've absorbed is directly applicable. Mechanize is Hpricot with the ability to click). Don't worry about scRUBYt!. It's more of a pain to figure out than it's worth (but maybe I'm wrong about it. Any good examples/write-ups?).

Hpricot with CSS selector

  1. divs = (doc/"div[@style*='font-weight:'][text()*='$'").inner_html
  2. divs.each do |div|
  3.   if div =~ /\$[0-9]?[0-9]\.[0-9][0-9]/
  4.     self.price = div.to_s.sub('$', '')
  5.   end
  6. end

Hpricot search with XPath

  1. require 'hpricot'
  2. require 'open-uri'
  3. doc = Hpricot(URI.parse("http://google.com/").read)
  5. doc.search("/html/body//p")
  6. doc.search("//p")
  7. doc.search("//p/a")
  8. doc.search("//a[@src]")
  9. doc.search("//a[@src='google.com']")

Using Mechanize to do a search on google

  1. require 'rubygems'
  2. require 'mechanize'
  4. agent = WWW::Mechanize.new
  5. agent.user_agent_alias = 'Mac Safari'
  6. page = agent.get("http://www.google.com/")
  7. search_form = page.forms.with.name("f").first
  8. search_form.q = "Hello"
  9. search_results = agent.submit(search_form)
  10. puts search_results.body

Note that Hpricot lets you use a CSS method of selecting and an XPATH method. Use XPATH if you already have experience otherwise the CSS method is more intuitive.

If you go with XPATH grab the XPather firefox plugin and use it with the DOM Inspector. Also, it works with the firebug firefox plugin. I'm still in awe that it worked when I tried. :) To do this, use firebug to "inspect", choose an element, right click on the page and select "Show in XPather". XPather will open with the selected element locked and loaded.

Finally, if your a Hpricot wiz forget about HTML::Selector and put Hpricot to work for view validation in your functional tests. See this great write up, Testing your Rails views with Hpricot, which demonstrates this elegant solution.

  1. assert_equal "My Funky Website", tag('title')
  2. assert_equal 20, tags('div.boxout').size
  3. assert_equal 'visible', element('div#site_container').attributes['class']


Leave a Reply

Subscribe without commenting