Ruby global regex question


#1

For the life of me, i can’t figure out a ruby equivalent to perl’s /g

basically, i want to do the following

while htmlSource=~m/

(.*?)<\table>/g do
tableSource=$1
tableSource=~m/Index (\d+)/
indexNumber=$1

while tableSource=~m/

(.*?)</tr>/g do
tableRowSource=$1
doSomethingWith(tableRowSource, indexNumber)
end#while tableSource

end#while htmlSource

I will actually need to pull multiple vars, not just a single one,
from the regex
I will need to do the outer loop an unknown amount of times per
document (0-20) and i will need to loop the inner an unknown amount of
times (0-29)

Thread safe would be a plus.

any suggestions?


#2

On Tue, Nov 18, 2008 at 4:06 PM, knohr removed_email_address@domain.invalid
wrote:

while tableSource=~m/

(.*?)</tr>/g do
document (0-20) and i will need to loop the inner an unknown amount of
times (0-29)

Thread safe would be a plus.

any suggestions?

I think this does what you want, although I don’t think gsub was really
made
for this purpose.

def doSomethingWith(s)
print s, “\n”
end

htmlSource = ‘

1,11,2

htmlSource << ‘2,11,2

htmlSource.gsub(/

(.?)</table>/) do |t|
tableRowSource = $1
tableRowSource.gsub(/
(.?)</tr>/) do |r|
doSomethingWith $1
end
end

#3

On 2008.11.19., at 1:06, knohr wrote:

while tableSource=~m/

(.*?)</tr>/g do
document (0-20) and i will need to loop the inner an unknown amount of
times (0-29)

Thread safe would be a plus.

any suggestions?

While I can’t answer your original question, I could possibly help you
with the scraping if you are willing to reveal the page you are trying
to scrape and the data bits on it which should be scraped.

Cheers,
Peter


http://www.rubyrailways.com
http://scrubyt.org


#4

On Nov 18, 7:08 pm, knohr removed_email_address@domain.invalid wrote:

  tableRowSource=$1

Thread safe would be a plus.

Would fast be a plus? No nested loop?

require ‘nokogiri’
doc = Nokogiri::HTML(htmlSource)
doc.search(’//tr’).each do |row|
index = row.xpath(‘ancestor::table/*[contains(“Index”,.)]’)
doSomethingWith(row.text,index[/(\d)/])
end

The location of the element containing the index may have to be
modified.

– Mark.


#5

On 19.11.2008, at 00:37 , Alan Johnson wrote:

tableSource=~m/Index (\d+)/
I will actually need to pull multiple vars, not just a single one,
any suggestions?
htmlSource = ‘

1,11,2

Alan
That is pretty much how, except globals are hardly thread safe I
think. Use scan instead of gsub:
Here’s something I wrote to extract information from data structured
like this:
  • tablename

    • field1
    • field2:string
  • table2name
    +field1 : string
    +field2

Table = Struct.new(:name, :fields)
Field = Struct.new(:name, :type)

def extract_db_spec(file)
tables = []
doc = open(file, File::RDONLY) {|f|f.read}
table_name = /- (\w*)\s*?\n/
field_name = /(\s++ (\w+)\s*(:\s*(\w*))?\n)/
doc.scan /#{table_name}(#{field_name}+)/ do |tablename, fields|
t = Table.new tablename, []
fields.scan field_name do |junk, fieldname, junk2, type|
if type.nil? || type == “”
if /\w+_id/ === fieldname
type = “int”
else
type = “string”
end
end

    t.fields <<  Field.new(fieldname, type)

  end
  tables << t
end
tables

end

einarmagnus


#6

On 19.11.2008 07:08, Einar Magnús Boson wrote:

That is pretty much how, except globals are hardly thread safe I
think.

$1 and the like are

robert@fussel ~
$ ruby -e ‘2.times{|i|Thread.new(i){|ii|4.times{/(\d+)/=~ii.to_s;puts
$1;sleep 1}}};sleep 5’
0
1
1
0
1
0
1
0

robert@fussel ~
$

Use scan instead of gsub:

Right, as far as I can see no replacements should be done. Just read
only access.

html_source.scan %r{

(.*?)
}i do
table_souce = $1
index_number = table_source[%r{Index\s+(\d+)}, 1].to_i

table_source.scan %r{

(.*?)}i do
do_something_with $1, index_number
end
end

But a proper HTML parser is probably much better. :slight_smile:

Kind regards

robert


#7

I use this as an equivalent to global match:

class Regexp
def global_match(str, &proc)
retval = nil
loop do
res = str.sub(self) do |m|
proc.call($~) # pass MatchData obj
‘’
end
break retval if res == str
str = res
retval ||= true
end
end
end

re = /…/
re.global_match(…) do |m|

end