Forum: Ruby Record-separator is a regular expression

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
2ee1a7960cc761a6e92efb5000c0f2c9?d=identicon&s=25 w_a_x_man (Guest)
on 2005-11-29 14:24
(Received via mailing list)
=begin

Unlike Gawk and Mawk, Ruby won't accept a regular expression as a
record-separator. Let's fix that. The substring matched by the
record-separator is automatically removed from the record, but it
can be obtained by RecSep#terminator.

Typical usage:

File.open("stuff.txt"){|handle|
  reader = RecSep.new( handle, /^\d+\.\n/ )
  reader.each {|x| p x }
}

=end


class RecSep

  def initialize( file_handle, record_separator )
    @handle = file_handle
    @buffer = ""
    @rec_sep = record_separator
    @terminator = nil
  end

  def get_rec
    ## The record-separator may be something like /\n\s*\n/,
    ## so we read enough to let it match as much as possible.
    loop  do
      @rec_sep.match( @buffer )
      break  if  $~  &&  $~.post_match.size > 0
      s = @handle.gets( "\n" )
      break  if not s
      @buffer << s
    end

    if $~
      @buffer = $~.post_match
      @terminator = $~.to_s
      $~.pre_match
    else
      @terminator = nil
      return nil  if "" == @buffer
      s, @buffer = @buffer, ""
      s
    end
  end

  def each
    while s = self.get_rec
      yield s
    end
  end

  def terminator
    @terminator
  end

end
Ec9233451f7c6ba37a83388b87a1f565?d=identicon&s=25 gavin (Guest)
on 2005-11-29 15:32
(Received via mailing list)
On Nov 29, 2005, at 6:22 AM, William James wrote:
> Unlike Gawk and Mawk, Ruby won't accept a regular expression as a
> record-separator.

Er, huh?

data = <<ENDDATA
name-----age-----size
Gavin    32      33
ENDDATA

p data.split( /-+| +|\n/ )
#=> ["name", "age", "size", "Gavin", "32", "33"]
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2005-11-29 15:32
(Received via mailing list)
On Nov 29, 2005, at 7:51 AM, Gavin Kistner wrote:

>
> p data.split( /-+| +|\n/ )
> #=> ["name", "age", "size", "Gavin", "32", "33"]

William is talking about the separator used by IO objects, $/.

James Edward Gray II
Ec9233451f7c6ba37a83388b87a1f565?d=identicon&s=25 gavin (Guest)
on 2005-11-29 16:21
(Received via mailing list)
On Nov 29, 2005, at 7:08 AM, James Edward Gray II wrote:
> On Nov 29, 2005, at 7:51 AM, Gavin Kistner wrote:
>> On Nov 29, 2005, at 6:22 AM, William James wrote:
>>> Unlike Gawk and Mawk, Ruby won't accept a regular expression as a
>>> record-separator.
>> p data.split( /-+| +|\n/ )
>> #=> ["name", "age", "size", "Gavin", "32", "33"]
>
> William is talking about the separator used by IO objects, $/.

Ah...thanks :)
5befe95e6648daec3dd5728cd36602d0?d=identicon&s=25 bob.news (Guest)
on 2005-11-29 19:23
(Received via mailing list)
William James <w_a_x_man@yahoo.com> wrote:
>  reader = RecSep.new( handle, /^\d+\.\n/ )
>  reader.each {|x| p x }
> }

I'd prefer something integrated with IO, e.g.

File.open("foo") {|io| io.each_chunk(/:/) {|ch| p ch}}

module RegularIOChunks
  def each_chunk(rx, read_buffer = 1024)
    buff = ""
    loop do
      until ( match = ( rx.match( buff ) ) )
        part = read(read_buffer)

        if part.nil?
          yield buff
          return self
        end

        buff << part
      end

      yield match.pre_match
      buff = match.post_match
    end
  end
end

class IO
  include RegularIOChunks
end

Kind regards

    robert
F3b7b8756d0c7f71cc7460cc33aefaee?d=identicon&s=25 Daniel.Berger (Guest)
on 2005-11-29 20:40
(Received via mailing list)
Robert Klemme wrote:
>>
> module RegularIOChunks
>
>  include RegularIOChunks
> end
>
> Kind regards
>
>    robert

This would *not* be easy to implement.  Consider backtracking (do we put
it
back in the stream?) and greediness (how much do we read?).  Unless you
want to
forbid greedy regular expressions and ignore backtracking (not to
mention
certain switches), this gets real ugly, real quick.

This has come up wrt Perl as well on p5p.  Take a look here for one
thread in
midstream:

http://www.nntp.perl.org/group/perl.perl5.porters/64830

Rumor has it that setting $/ to a regex will be legal in Perl 6, but I
think
there will be several restrictions.

Regards,

Dan
5befe95e6648daec3dd5728cd36602d0?d=identicon&s=25 bob.news (Guest)
on 2005-11-30 09:04
(Received via mailing list)
Daniel Berger wrote:
>>> Typical usage:
>>
>>        end
>> class IO
> backtracking (not to mention certain switches), this gets real ugly,
> real quick.

Right!  My main point was that I'd prefer a solution that is integrated
with IO, i.e. no extra instance needs to be created (at least not
explicitely).  Just a question of usability.

One implementation option would be to continue reading not until the
first
match but until matches don't differ any more.  That would deal at least
with cases like /a{3,10}/ where the sequence is cut in the middle of a
sequence of 10 "a"'s.  And you would get a match for the first half
while
you wanted to match the whole sequence.

> This has come up wrt Perl as well on p5p.  Take a look here for one
> thread in midstream:
>
> http://www.nntp.perl.org/group/perl.perl5.porters/64830
>
> Rumor has it that setting $/ to a regex will be legal in Perl 6, but
> I think there will be several restrictions.

As you mention, the general problem with applying regexps is a
conceptual
one: because of greedy quantifiers in the worst case the whole file is
read into memory (just consider using /.+/ as delimiter) which doesn't
fit
well with the streaming approach. :-)

Kind regards

    robert
2ee1a7960cc761a6e92efb5000c0f2c9?d=identicon&s=25 w_a_x_man (Guest)
on 2005-11-30 13:55
(Received via mailing list)
This version reads farther ahead in an attempt to cope
with greedy regular expressions.

=begin

Unlike Gawk and Mawk, Ruby won't accept a regular expression as a
record-separator. Let's fix that. The substring matched by the
record-separator is automatically removed from the record, but it
can be obtained by RecSep#terminator.

Typical usage:

File.open("stuff.txt"){|handle|
  reader = RecSep.new( handle, /^\d+\.\n/ )
  reader.each {|x| p x }
}

=end


class RecSep

  def initialize( file_handle, record_separator, chunk_size=10_000 )
    @handle = file_handle
    @rec_sep = record_separator
    @chunk_size = chunk_size
    @buffer = ""
    @terminator = nil
  end

  attr_reader :terminator, :buffer

  def get_rec
    ## The record-separator may be something like /\n\s*\n/,
    ## so we read until there's something left over in the buffer
    ## after the match.
    loop  do
      @rec_sep.match( @buffer )
      break  if  $~  &&  $~.post_match.size > 0
      s = @handle.read( @chunk_size )
      break  if not s
      @buffer << s
    end

    if $~
      @buffer = $~.post_match
      @terminator = $~.to_s
      $~.pre_match
    else
      @terminator = nil
      return nil  if "" == @buffer
      s, @buffer = @buffer, ""
      s
    end
  end

  def each
    while s = self.get_rec
      yield s
    end
  end

end
2ee1a7960cc761a6e92efb5000c0f2c9?d=identicon&s=25 w_a_x_man (Guest)
on 2005-12-05 14:24
(Received via mailing list)
Third version.  And here's an example of using it to remove
all html tags from a file:

File.open("data1.htm"){|handle|
  reader = RecSep.new( handle, /<.*?>/m )
  reader.each {|x| print x }
}

-----------------------------------------------------------

=begin

Unlike Gawk and Mawk, Ruby won't accept a regular expression as a
record-separator. Let's fix that. The substring matched by the
record-separator is automatically removed from the record, but it
can be obtained by RecSep#terminator.

Typical usage:

File.open("stuff.txt"){|handle|
  reader = RecSep.new( handle, /^\d+\.\n/ )
  reader.each {|x| p x }
}

Sometimes it may be necessary to keep the regular expression from
matching less than it should by increasing the look-ahead distance
(measured in characters):

File.open("stuff.txt"){|handle|
  reader = RecSep.new( handle, /(^.*\n)\1+/m, 4096 )
  reader.each {|x| p x }
}

=end

class RecSep

  def initialize( file_handle, record_separator,
      minimal_look_ahead = 1024 )
    @handle = file_handle
    @rec_sep = record_separator
    @min_look_ahead = minimal_look_ahead
    @buffer = ""
    @terminator = nil
    @count = 0
  end

  attr_reader :terminator, :count, :buffer

  def get_rec
    ##  Make sure the buffer has a reasonable amount of material.
    if @buffer.size < (3 * @min_look_ahead / 2)  &&  !@handle.eof?
      @buffer << @handle.read( 2 * @min_look_ahead - @buffer.size)
    end
    ##  To cope with all kinds of greedy regular expressions,
    ##  we read until there are at least @min_look_ahead bytes
    ##  left over in the buffer after the match.
    loop  do
      @rec_sep.match( @buffer )
      break  if  $~  &&  $~.post_match.size >= @min_look_ahead
      s = @handle.read( @min_look_ahead )
      break  if not s
      @buffer << s
    end

    if $~
      @buffer = $~.post_match
      @terminator = $~.to_s
      @count += 1
      $~.pre_match
    else
      @terminator = nil
      return nil  if "" == @buffer
      @count += 1
      s, @buffer = @buffer, ""
      s
    end
  end

  def each
    while s = self.get_rec
      yield s
    end
  end

end
This topic is locked and can not be replied to.