
Zil NorvilisThe Scraper’s Dilemma Testing a normal Rails app is straightforward: you control the...
Testing a normal Rails app is straightforward: you control the database, you control the code. But with scrapers, the "database" is a third-party website that can change its HTML structure at 3:00 AM on a Sunday.
If you don't test your scrapers, you end up with Silent Failures: your script runs perfectly, but it starts saving nil into your database because .product-price was renamed to .price-current.
To survive, you need a Three-Tier Testing Strategy.
You should never hit the live network during your standard test runs. It’s slow, it’s flaky, and it's bad etiquette. Instead, use the VCR gem to "record" a successful interaction once and replay it forever.
First, configure VCR in your test helper:
# test/test_helper.rb
require "vcr"
require "webmock/minitest"
VCR.configure do |config|
config.cassette_library_dir = "test/vcr_cassettes"
config.hook_into :webmock
end
This ensures that your Logic is correct for a specific version of the HTML without needing an internet connection.
# test/services/price_scraper_test.rb
require "test_helper"
class PriceScraperTest < ActiveSupport::TestCase
test "correctly extracts the price from the cached HTML" do
VCR.use_cassette("amazon_product") do
result = PriceScraper.call("https://amazon.com/item")
assert_equal 29.99, result.price
assert_equal "USD", result.currency
end
end
end
Don't just trust that the parser found what it was looking for. Treat the scraped data like untrusted user input. Use ActiveModel::Validations to ensure the "Contract" between the website and your app is still valid.
# app/models/scraped_product.rb
class ScrapedProduct
include ActiveModel::Validations
attr_accessor :name, :price
validates :name, presence: true, length: { minimum: 5 }
validates :price, presence: true, numericality: { greater_than: 0 }
def initialize(attributes = {})
@name = attributes[:name]
@price = attributes[:price]
end
end
In your scraper:
product = ScrapedProduct.new(name: doc.at_css('.title')&.text, price: parsed_price)
unless product.valid?
raise "Scraping Contract Broken: #{product.errors.full_messages.join(', ')}"
end
VCR tests (Tier 1) only prove your code works with yesterday's HTML. They don't tell you if the website changed today.
For this, you need a Production Smoke Test. This is a small script that runs every few hours in your production environment (via Sidekiq or a Cron job). It hits the live site and checks if the key selectors still exist.
# lib/tasks/scraper_monitor.rake
namespace :scraper do
task monitor: :environment do
sample_url = "https://example.com/product/1"
html = HTTP.get(sample_url).to_s
doc = Nokogiri::HTML(html)
required_selectors = ['.price', '.title', '.description']
missing = required_selectors.reject { |s| doc.at_css(s).present? }
if missing.any?
# Send an alert to Slack, Discord, or Email
SlackNotifier.alert("Scraper broken on #{sample_url}! Missing: #{missing.join(', ')}")
end
end
end
The biggest mistake developers make is putting the network request and the Nokogiri logic in the same method. This makes testing a nightmare.
Do this instead:
# Simple to test with any HTML string!
class AmazonParser
def initialize(html)
@doc = Nokogiri::HTML(html)
end
def price
# Safe navigation and cleanup
@doc.at_css('.price')&.text&.gsub(/[^\d.]/, '')&.to_f
end
end
# Your test now needs NO network and NO VCR:
class AmazonParserTest < ActiveSupport::TestCase
test "extracts price from raw string" do
html = "<div class='price'>$29.99</div>"
parser = AmazonParser.new(html)
assert_equal 29.99, parser.price
end
end
nil or malformed data before it hits your DB.Web scraping is a game of "When, not If." By building a test suite that acknowledges the instability of the web, you stop being a firefighter and start being an architect.
How do you handle site changes? Do you have an "early warning system" or do you wait for the bug reports? Let's discuss in the comments! 👇