Hi
I have a regex as follows
REG1=Regexp.new=(‘dodgydata’)
Basically, I need to read a file in line by line, and on each line
that’s a match for my regex, remove everything from the point of the
regex to the end of that line. Is there a more efficient way to do it
than what I have at the moment? I have some HUGE files to process, and
many more regex matches to capture… :-/
ARGF.each_line do [L]
if L.index(REG1)
a=L.index(REG1)
b=L.length
L.slice!(a…b)
L << “\n” # Add a newline back to the end of the line
end
print L
end
(As you can see I’m no ruby expert!). I’d like to be able to make this
more efficient, just doesn’t seem a very Ruby way to do things… All
help gratefully accepted!
Regards
Eddie
i’d use a regex and the string split method:
ruby-1.9.2-p180 :004 > line = “abcXXdef”
=> “abcXXdef”
ruby-1.9.2-p180 :005 > line.split(/XX/).first
=> “abc”
You probably want something like:
this won’t remove trailing newlines
regex = /thing_to_match.*^/
file.each do |line|
processed_line = line.gsub regex, ‘’
print processed_line
end
~ jf
John F.
Principal Consultant, BitsBuilder
LI: http://www.linkedin.com/in/johnxf
SO: User John Feminella - Stack Overflow
On Fri, Jul 15, 2011 at 09:54:48AM +0900, Chad P. wrote:
I’d probably do it something like this:
ARGF.each_line do |line|
line.sub!(/#{reg1}.*/, '')
puts line
end
Actually, I did too literal a translation of your code. If I was
writing
this from scratch, I probably would have done this instead:
ARGF.each_line {|line| puts line.sub(/#{reg1}.*/, '') }
. . . or this:
ARGF.each_line do |line|
puts line.sub(/#{reg1}.*/, '')
end
Which I would choose would depend on whether I felt it looked better
amidst my other code in the file on one line or three, and on whether I
expected I might need to add more lines later.
Awesome stuff. Thanks to all that replied. It really is appreciated.
Regards
On Fri, Jul 15, 2011 at 09:37:14AM +0900, Eddie Catflap wrote:
many more regex matches to capture… :-/
I’d probably do it something like this:
ARGF.each_line do |line|
line.sub!(/#{reg1}.*/, '')
puts line
end
. . . unless I could be sure the end of the regex in reg1 included .*
after the match you want to find, in which case the second line of that
would look more like this:
line.sub!(reg1, '')
This, of course, assumes that reg1 actually contains a regex object. If
it contains a string, go back to my first example, because strings don’t
deal well with regex special characters.
(As you can see I’m no ruby expert!).
We all start somewhere. I wouldn’t exactly call myself an “expert”
either.
I’d like to be able to make this more efficient, just doesn’t seem a
very Ruby way to do things… All help gratefully accepted!
Note: I have not tested for execution time of my solution. I just wrote
it off the top of my head to conform to my sense of good, clean Ruby
code.
On Thu, Jul 14, 2011 at 8:12 PM, Chad P. [email protected] wrote:
this from scratch, I probably would have done this instead:
amidst my other code in the file on one line or three, and on whether I
expected I might need to add more lines later.
–
Chad P. [ original content licensed OWL: http://owl.apotheon.org ]
Depending on how the string is to be interpreted, you may need to escape
it
before creating it:
str = “a.b”
/#{str}/ # => /a.b/
Regexp.new str # => /a.b/
Regexp.new Regexp.escape str # => /a.b/
Depending on how big the files are, efficiency may be very important. In
that case, you might benchmark a few different approaches and see which
is
best. Here is another possibility:
ARGF.each_line do |line|
index = line.index regex
line.slice!(index…-1) if index
puts line
end
Another possible option is to use String#[]=
irb(main):001:0> s=“foobarbaz”
=> “foobarbaz”
irb(main):002:0> line = “foobarbaz\n”
=> “foobarbaz\n”
irb(main):003:0> line[/bar.*$/] = ‘’
=> “”
irb(main):004:0> line
=> “foo\n”
Kind regards
robert
On Fri, Jul 15, 2011 at 10:52:54AM +0900, Josh C. wrote:
Depending on how the string is to be interpreted, you may need to escape it
before creating it:
str = “a.b”
/#{str}/ # => /a.b/
Regexp.new str # => /a.b/
Regexp.new Regexp.escape str # => /a.b/
That’s a very good point. Things get tricky when you start stuffing
strings into regexen and so on. If you (in general – not Josh C.,
per se) are not yet familiar with the various unit testing libraries
available for Ruby, now might be a good time to familiarize yourself at
least with the basics of Test::Unit, the Ruby standard library’s
equivalent to NUnit.
Depending on how big the files are, efficiency may be very important. In
that case, you might benchmark a few different approaches and see which is
best. Here is another possibility:
ARGF.each_line do |line|
index = line.index regex
line.slice!(index…-1) if index
puts line
end
This looks to me like it is likely to be more resource-intensive than my
example, thanks to the sheer number of operations involved, and the fact
that I do not think .* would impose too much additional overhead on the
regex match – though maybe the interpolation in my version would be
somewhat expensive. Then again, I don’t know much about the
implementation of methods that come with the language, so don’t just
take
my word for it. As Josh suggests, you should at least loosely benchmark
performance of various implementations if you find your first attempt is
not performing well enough for your needs.