Is there a more efficient way to remove data from a string?

dubstep · July 15, 2011, 2:37am

Hi

I have a regex as follows

REG1=Regexp.new=(‘dodgydata’)

Basically, I need to read a file in line by line, and on each line
that’s a match for my regex, remove everything from the point of the
regex to the end of that line. Is there a more efficient way to do it
than what I have at the moment? I have some HUGE files to process, and
many more regex matches to capture… :-/

ARGF.each_line do [L]

if L.index(REG1)
a=L.index(REG1)
b=L.length
L.slice!(a…b)
L << “\n” # Add a newline back to the end of the line
end

print L

end

(As you can see I’m no ruby expert!). I’d like to be able to make this
more efficient, just doesn’t seem a very Ruby way to do things… All
help gratefully accepted!

Regards

Eddie

Eddie_Catflap · July 15, 2011, 2:54am

i’d use a regex and the string split method:

ruby-1.9.2-p180 :004 > line = “abcXXdef”
=> “abcXXdef”
ruby-1.9.2-p180 :005 > line.split(/XX/).first
=> “abc”

Eddie_Catflap · July 15, 2011, 2:55am

You probably want something like:

this won’t remove trailing newlines

regex = /thing_to_match.*^/

file.each do |line|
processed_line = line.gsub regex, ‘’
print processed_line
end

~ jf

John F.
Principal Consultant, BitsBuilder
LI: http://www.linkedin.com/in/johnxf
SO: User John Feminella - Stack Overflow

Eddie_Catflap · July 15, 2011, 3:15am

On Fri, Jul 15, 2011 at 09:54:48AM +0900, Chad P. wrote:

I’d probably do it something like this:
ARGF.each_line do |line|
  line.sub!(/#{reg1}.*/, '')
  puts line
end

Actually, I did too literal a translation of your code. If I was
writing
this from scratch, I probably would have done this instead:

ARGF.each_line {|line| puts line.sub(/#{reg1}.*/, '') }

. . . or this:

ARGF.each_line do |line|
  puts line.sub(/#{reg1}.*/, '')
end

Which I would choose would depend on whether I felt it looked better
amidst my other code in the file on one line or three, and on whether I
expected I might need to add more lines later.

Eddie_Catflap · July 15, 2011, 3:21am

Awesome stuff. Thanks to all that replied. It really is appreciated.

Regards

Eddie_Catflap · July 15, 2011, 2:58am

On Fri, Jul 15, 2011 at 09:37:14AM +0900, Eddie Catflap wrote:

many more regex matches to capture… :-/
I’d probably do it something like this:

ARGF.each_line do |line|
  line.sub!(/#{reg1}.*/, '')
  puts line
end

. . . unless I could be sure the end of the regex in reg1 included .*
after the match you want to find, in which case the second line of that
would look more like this:

  line.sub!(reg1, '')

This, of course, assumes that reg1 actually contains a regex object. If
it contains a string, go back to my first example, because strings don’t
deal well with regex special characters.

(As you can see I’m no ruby expert!).

We all start somewhere. I wouldn’t exactly call myself an “expert”
either.

I’d like to be able to make this more efficient, just doesn’t seem a
very Ruby way to do things… All help gratefully accepted!

Note: I have not tested for execution time of my solution. I just wrote
it off the top of my head to conform to my sense of good, clean Ruby
code.

Eddie_Catflap · July 15, 2011, 3:53am

On Thu, Jul 14, 2011 at 8:12 PM, Chad P. [email protected] wrote:

this from scratch, I probably would have done this instead:
amidst my other code in the file on one line or three, and on whether I
expected I might need to add more lines later.

–
Chad P. [ original content licensed OWL: http://owl.apotheon.org ]

Depending on how the string is to be interpreted, you may need to escape
it
before creating it:

str = “a.b”
/#{str}/ # => /a.b/
Regexp.new str # => /a.b/
Regexp.new Regexp.escape str # => /a.b/

Depending on how big the files are, efficiency may be very important. In
that case, you might benchmark a few different approaches and see which
is
best. Here is another possibility:
ARGF.each_line do |line|
index = line.index regex
line.slice!(index…-1) if index
puts line
end

Eddie_Catflap · July 15, 2011, 9:11am

Another possible option is to use String#[]=

irb(main):001:0> s=“foobarbaz”
=> “foobarbaz”
irb(main):002:0> line = “foobarbaz\n”
=> “foobarbaz\n”
irb(main):003:0> line[/bar.*$/] = ‘’
=> “”
irb(main):004:0> line
=> “foo\n”

Kind regards

robert

Eddie_Catflap · July 15, 2011, 5:17am

On Fri, Jul 15, 2011 at 10:52:54AM +0900, Josh C. wrote:

Depending on how the string is to be interpreted, you may need to escape it
before creating it:

str = “a.b”
/#{str}/ # => /a.b/
Regexp.new str # => /a.b/
Regexp.new Regexp.escape str # => /a.b/

That’s a very good point. Things get tricky when you start stuffing
strings into regexen and so on. If you (in general – not Josh C.,
per se) are not yet familiar with the various unit testing libraries
available for Ruby, now might be a good time to familiarize yourself at
least with the basics of Test::Unit, the Ruby standard library’s
equivalent to NUnit.

Depending on how big the files are, efficiency may be very important. In
that case, you might benchmark a few different approaches and see which is
best. Here is another possibility:
ARGF.each_line do |line|
index = line.index regex
line.slice!(index…-1) if index
puts line
end

This looks to me like it is likely to be more resource-intensive than my
example, thanks to the sheer number of operations involved, and the fact
that I do not think .* would impose too much additional overhead on the
regex match – though maybe the interpolation in my version would be
somewhat expensive. Then again, I don’t know much about the
implementation of methods that come with the language, so don’t just
take
my word for it. As Josh suggests, you should at least loosely benchmark
performance of various implementations if you find your first attempt is
not performing well enough for your needs.