Skip to content
This repository was archived by the owner on Nov 14, 2024. It is now read-only.
/ Scrapy Public archive

flexible threaded web crawler based on hpricot and anemone

Notifications You must be signed in to change notification settings

dealerignition/Scrapy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

#Flexible Web Crawler designed for Product Scraping#

##Dependencies##

  • open-uri
  • openssl
  • hpricot
  • anemone

##Usage##

  1. Create a list of hash describing the objects to search for.
[{
  :location=>"class/id name | html location (i.e. ul[@class='vehicleDescription']/li[9])",
  :type=>"custom | class | id"
  :name=>"Name"
}]
  1. Instantiate a copy of the scraper.
scraper = Scrapy::Crawler.new(<website url>, options)
  1. Start the crawl session.
scraper.crawl
Note: this call will return immediately, but crawling will take some time
  1. You can poll the crawler to see if it has finished
scraper.crawling_complete? => boolean
  1. Once crawling has finished, use retrieve_products to receive a list of products matched.
scrapper.receive_products
Note: each item will be under scrapper[0][:Name], page_url is scrapper[0][:page_url]

About

flexible threaded web crawler based on hpricot and anemone

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages