Forum: Ruby ruby global regex question.

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
D19e410b9a96dbc4f8a4fda910addc1a?d=identicon&s=25 knohr (Guest)
on 2008-11-19 01:10
(Received via mailing list)
For the life of me, i can't figure out a ruby equivalent to perl's /g

basically, i want to do the following


while htmlSource=~m/<table>(.*?)<\table>/g do
   tableSource=$1
   tableSource=~m/Index (\d+)/
   indexNumber=$1

   while tableSource=~m/<tr>(.*?)<\/tr>/g do
      tableRowSource=$1
      doSomethingWith(tableRowSource, indexNumber)
   end#while tableSource

end#while htmlSource


I will actually need to pull multiple vars, not just a single one,
from the regex
I will need to do the outer loop an unknown amount of times per
document (0-20) and i will need to loop the inner an unknown amount of
times (0-29)


Thread safe would be a plus.


any suggestions?
Edfcfb783260609029d57cd4e1c39d65?d=identicon&s=25 Alan Johnson (Guest)
on 2008-11-19 01:41
(Received via mailing list)
On Tue, Nov 18, 2008 at 4:06 PM, knohr <just_a_techie200x@yahoo.com>
wrote:

>   while tableSource=~m/<tr>(.*?)<\/tr>/g do
> document (0-20) and i will need to loop the inner an unknown amount of
> times (0-29)
>
>
> Thread safe would be a plus.
>
>
> any suggestions?
>
>
I think this does what you want, although I don't think gsub was really
made
for this purpose.

def doSomethingWith(s)
    print s, "\n"
end

htmlSource =  '<table><tr>1,1</tr><tr>1,2</tr></table>'
htmlSource << '<table><tr>2,1</tr><tr>1,2</tr></table>'

htmlSource.gsub(/<table>(.*?)<\/table>/) do |t|
    tableRowSource = $1
    tableRowSource.gsub(/<tr>(.*?)<\/tr>/) do |r|
        doSomethingWith $1
    end
end
F50f5d582d76f98686da34917531fe56?d=identicon&s=25 Peter Szinek (Guest)
on 2008-11-19 03:04
(Received via mailing list)
On 2008.11.19., at 1:06, knohr wrote:

>   while tableSource=~m/<tr>(.*?)<\/tr>/g do
> document (0-20) and i will need to loop the inner an unknown amount of
> times (0-29)
>
>
> Thread safe would be a plus.
>
>
> any suggestions?

While I can't answer your original question, I could possibly help you
with the scraping if you are willing to reveal the page you are trying
to scrape and the data bits on it which should be scraped.

Cheers,
Peter
___
http://www.rubyrailways.com
http://scrubyt.org
134ea397777886d6f0aa992672a50eaa?d=identicon&s=25 Mark Thomas (Guest)
on 2008-11-19 04:22
(Received via mailing list)
On Nov 18, 7:08 pm, knohr <just_a_techie2...@yahoo.com> wrote:
>       tableRowSource=$1
>
> Thread safe would be a plus.

Would fast be a plus? No nested loop?

require 'nokogiri'
doc = Nokogiri::HTML(htmlSource)
doc.search('//tr').each do |row|
  index = row.xpath('ancestor::table/*[contains("Index",.)]')
  doSomethingWith(row.text,index[/(\d)/])
end

The location of the element containing the index may have to be
modified.

-- Mark.
87349a7a95b3f2e83c20194ef122885c?d=identicon&s=25 Einar Magnús Boson (Guest)
on 2008-11-19 07:12
(Received via mailing list)
On 19.11.2008, at 00:37 , Alan Johnson wrote:

>>  tableSource=~m/Index (\d+)/
>> I will actually need to pull multiple vars, not just a single one,
>> any suggestions?
> htmlSource =  '<table><tr>1,1</tr><tr>1,2</tr></table>'
> Alan
That is pretty much how, except globals are hardly thread safe I
think. Use scan instead of gsub:
Here's something I wrote to extract information from data structured
like this:

- tablename
     + field1
     + field2:string

- table2name
     +field1 : string
     +field2

Table = Struct.new(:name, :fields)
Field = Struct.new(:name, :type)

  def extract_db_spec(file)
    tables = []
    doc = open(file, File::RDONLY) {|f|f.read}
    table_name = /\- (\w*)\s*?\n/
    field_name = /(\s+\+ (\w+)\s*(\:\s*(\w*))?\n)/
    doc.scan /#{table_name}(#{field_name}+)/ do |tablename, fields|
      t = Table.new tablename, []
      fields.scan field_name do |junk, fieldname, junk2, type|
        if type.nil? || type == ""
          if /\w+_id/ === fieldname
            type = "int"
          else
            type = "string"
          end
        end

        t.fields <<  Field.new(fieldname, type)

      end
      tables << t
    end
    tables
  end


einarmagnus
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2008-11-19 08:50
(Received via mailing list)
On 19.11.2008 07:08, Einar Magnús Boson wrote:

>
> That is pretty much how, except globals are hardly thread safe I
> think.

$1 and the like are

robert@fussel ~
$ ruby -e '2.times{|i|Thread.new(i){|ii|4.times{/(\d+)/=~ii.to_s;puts
$1;sleep 1}}};sleep 5'
0
1
1
0
1
0
1
0

robert@fussel ~
$

> Use scan instead of gsub:

Right, as far as I can see no replacements should be done.  Just read
only access.

html_source.scan %r{<table>(.*?)</table>}i do
   table_souce = $1
   index_number = table_source[%r{Index\s+(\d+)}, 1].to_i

   table_source.scan %r{<tr>(.*?)</tr>}i do
     do_something_with $1, index_number
   end
end

But a proper HTML parser is probably much better. :-)

Kind regards

  robert
Ccd5c5837a6bc877cef373975b6c2767?d=identicon&s=25 Gustavo Carvalho (Guest)
on 2008-11-19 13:57
(Received via mailing list)
I use this as an equivalent to global match:

class Regexp
  def global_match(str, &proc)
    retval = nil
    loop do
      res = str.sub(self) do |m|
        proc.call($~) # pass MatchData obj
        ''
      end
      break retval if res == str
      str = res
      retval ||= true
    end
  end
end

re = /.../
re.global_match(...) do |m|
    ...
end
This topic is locked and can not be replied to.