Regrex_crawler -- a crawler which uses regular expression to catch data from website

RegexpCrawler is a crawler which uses regular expression to catch data
from website. It is easy to use and less code if you are familiar with
regular expression.
The project site is: http://github.com/flyerhzm/regexp_crawler/tree

I give an example: a script to synchronize your github projects except
fork projects, , please check example/github_projects.rb

require ‘rubygems’
require ‘regexp_crawler’

crawler = RegexpCrawler::Crawler.new(
:start_page => “flyerhzm (Richard Huang) · GitHub”,
:continue_regexp => %r{

}m,
:named_captures => [‘title’, ‘description’, ‘body’],
:save_method => Proc.new do |result, page|
puts ‘=============================’
puts page·
puts result[:title]
puts result[:description]
puts result[:body][0…100] + “…”
end,·
:need_parse => Proc.new do |page, response_body|
page =~ %r{http://github.com/flyerhzm/\w+} && !response_body.index
(/Fork of.
?/)
end)·
crawler.start

The results are as follows:

=============================

bullet
A rails plugin/gem to kill N+1 queries and unused eager loading

Bullet

The Bullet plugin/gem is designed to help you increase your... ============================= http://github.com/flyerhzm/regexp_crawler/tree/master regexp_crawler A crawler which use regular expression to catch data.

RegexpCrawler

RegexpCrawler is a crawler which use regex expressi... ============================= http://github.com/flyerhzm/sitemap/tree/master sitemap This plugin will generate a sitemap.xml from sitemap.rb whose format is very similar to routes.rb

Sitemap

This plugin will generate a sitemap.xml or sitemap.xml.gz ... ============================= http://github.com/flyerhzm/visual_partial/tree/master visual_partial This plugin provides a way that you can see all the partial pages rendered. So it can prevent you from using partial page too much, which hurts the performance.

VisualPartial

This plugin provides a way that you can see all the ... ============================= http://github.com/flyerhzm/chinese_regions/tree/master chinese_regions provides all chinese regions, cities and districts

ChineseRegions

Provides all chinese regions, cities and districts<... ============================= http://github.com/flyerhzm/chinese_permalink/tree/master chinese_permalink This plugin adds a capability for ar model to create a seo permalink with your chinese text. It will translate your chinese text to english url based on google translate.

ChinesePermalink

This plugin adds a capability for ar model to cre... ============================= http://github.com/flyerhzm/codelinestatistics/tree/master codelinestatistics The code line statistics takes files and directories from GUI, counts the total files, total sizes of files, total lines, lines of codes, lines of comments and lines of blanks in the files, displays the results and can also export results to html file.

codelinestatistics README file:

Wha…