Web crawling in Ruby with Capybara


In a Rails project, we use Capybara for feature (end to end) testing. However, Capybara is also good for crawling a page without Rails. Any data within a specific pattern can be obtained with ease.

We can create a simple web crawler within a single file with Capybara DSL.

How To

Create a folder with a Gemfile, because we require multiple gems.

$ mkdir crawler

Create a Gemfile:

source "https://rubygems.org"

gem 'capybara'
gem 'selenium-webdriver'

Run bundle after that.

Setup your crawler:

require 'capybara'

Capybara.run_server = false
Capybara.current_driver = :selenium
Capybara.app_host = "https://google.com.tw"

You can pick other drivers from the list in Capybara repo. Basically, install relative gems if required.

After setup, create a class and include the DSL from Capybara:

module MyCapybara
  class Crawler
    include Capybara::DSL
  end
end

crawler = MyCapybara::Crawler.new

If a instance is initiated, create a method, fill in with your patterns, and deal with the data. The following is a complete example:

require 'capybara'

Capybara.run_server = false
Capybara.current_driver = :selenium
Capybara.app_host = "https://google.com"

module MyCapybara
  class Crawler
    include Capybara::DSL
    def query(params)
      visit("/")
      fill_in "#search", with: params
      click_button "search"
      return find("#result").text
    end
  end
end

crawler = MyCapybara::Crawler.new
item = crawler.query(item)
File.open("query.txt","a") {|file|
  file.write("#{item}\n")
}

More complex operation can be done through other methods offered by Capybara.