Scraping with Typhoeus and Nokogiri

2009-06-12

I’ve been working on some cool new functionality at OneSpot. We want to provide a widget that can give the reader more context about a given article. Zemanta takes the article text and hands us back a set of semantic entities, including links to their Wikipedia page, but we wanted to get a nice blurb about each entity and figured that the opening paragraph from the Wikipedia page would be reasonable.

To do this, we use Typhoeus to fetch the Wikipedia pages in parallel and Nokogiri to pull the relevant content using a custom XPath expression for Wikipedia’s page layout.

Some notes:

require 'typhoeus'
require 'nokogiri'

class Wikipedia
  include Typhoeus
  #self.cache = Rails.cache.instance_variable_get(:@data)

  remote_defaults :cache_responses => 7*24*60*60,
      :user_agent => 'typhoeus crawler',
      :timeout => 5

  define_remote_method :extract,
      :on_success => lambda {|response| Wikipedia.extract_first_paragraph(response.body) }

  def self.extract_first_paragraph(content)
    nh = Nokogiri::HTML(content)
    str = nh.xpath("//div[@id='bodyContent']/p[1]").inner_html
    str.gsub /href="/wiki/, 'href="http://en.wikipedia.org/wiki'
  end
end

And here’s how you use it.

entities = %w(

http://en.wikipedia.org/wiki/Garth_Marenghi's_Darkplace


http://en.wikipedia.org/wiki/Bus_error


http://en.wikipedia.org/wiki/Washington

)
    content = entities.map do |url|
      Wikipedia.extract(:base_uri => url)
    end
    p content