Best practices resource/guidance for strings

Hello,

I am working with scraping quite a bit of data and I would like to make
sure that I’m following some best practices for string manipulation. I
would like to be sure to take into account any speed and garbage
collection issues.

Does anyone know of any posts, websites, books or other resources that
provide “do this, not that” types of guidance?

For example, my understanding is that globbing everything into one line
when manipulating a string is not the best use of resources.

not good
“string+var”.gsub(’+’,’’).strip.capitalize

better
s = "string+var
s.gsub(’+’,’’)
s.strip!
s.capitalize
s => ‘String Var’

Are there resources that explain why one is better than the other that
also provides more best practices like this?

Thanks.

Personally doing things on one line is not a sin of itself. Only when it
is
overdone! As to what counts as overdone depends on your reading ability.

Splitting things onto individual lines allows you to insert logging at
various points without fear of breaking the code which the one line
approach
does not.

However the multiline approach can make an insignificant part of the
code
take up lots of screen real estate which can make the larger code harder
to
read.

For example x.downcase.gsub(/\s+/, ’ ').strip.capitalize is a fairly
easy to
read clean up on a string but if it goes multiline

x.downcase!
x.gsub!(/\s+/, ’ ')
x.strip!
x.capitalize!

not only does it take up more of the screen but it has also altered x,
something that the single line version did not.

Of course if things get really silly you could just create a function
and
stuff all the code in there.

Cs Webgrl wrote:

better
s = "string+var
s.gsub(’+’,’’)
s.strip!
s.capitalize
s => ‘String Var’

(You need gsub! and capitalize! of course)

Are there resources that explain why one is better than the other that
also provides more best practices like this?

Methods like capitalize! work on the existing string buffer in memory.
The non-bang methods create a whole new string, which involves work
copying it, and then later garbage-collecting the original.

Most of the non-bang methods are implemented as a dup followed by
calling the bang method on the copy. They’re written in C, but are
effectively like this:

class String
def capitalize
dup.capitalize!
end

def capitalize!
# scan the string and modify it in place
end
end

Of course, in most apps the original chained code you wrote will be just
fine, and it’s easy to write and understand. If you will be processing
files which are hundreds of megabytes long then it may be worthwhile
rewriting to the second form.

Other thoughts:

  • for large files, process them in chunks or lines rather than reading
    them all in at once

  • use block form when opening a file, to ensure it’s closed as soon as
    you’ve finished with it

File.open("/path/to/file",“rb”) do |f|
f.each_line do |line|

end
end

Thanks so much for the help and guidance. Most of my data is parsed
from mechanize and broken into smaller chunks that will manipulated to
get the final format. From my understanding, I should be ok. I
definitely agree that the conciseness of fewer lines of code is easier
to read. Just wanted to make sure that I’m not compromising speed or
garbage collection for readability on these types of methods.

Brian C. wrote:

Cs Webgrl wrote:

better
s = "string+var
s.gsub(’+’,’’)
s.strip!
s.capitalize
s => ‘String Var’

(You need gsub! and capitalize! of course)

Are there resources that explain why one is better than the other that
also provides more best practices like this?

Methods like capitalize! work on the existing string buffer in memory.
The non-bang methods create a whole new string, which involves work
copying it, and then later garbage-collecting the original.

Most of the non-bang methods are implemented as a dup followed by
calling the bang method on the copy. They’re written in C, but are
effectively like this:

class String
def capitalize
dup.capitalize!
end

def capitalize!
# scan the string and modify it in place
end
end

Of course, in most apps the original chained code you wrote will be just
fine, and it’s easy to write and understand. If you will be processing
files which are hundreds of megabytes long then it may be worthwhile
rewriting to the second form.

Other thoughts:

  • for large files, process them in chunks or lines rather than reading
    them all in at once

  • use block form when opening a file, to ensure it’s closed as soon as
    you’ve finished with it

File.open("/path/to/file",“rb”) do |f|
f.each_line do |line|

end
end

On Wed, Jun 30, 2010 at 8:32 AM, Cs Webgrl [email protected] wrote:

For example, my understanding is that globbing everything into one line
s.capitalize
s => ‘String Var’

Are there resources that explain why one is better than the other that
also provides more best practices like this?

Thanks.

Posted via http://www.ruby-forum.com/.

I don’t know about a specific site, but if you do not need to keep the
value
of string, then string << var is better than string + var, since it
mutates
string, rather than creating a new object. I once read benchmarks about
this, but I can’t remember where I read them, and I can’t seem to
recreate
them, so maybe I am wrong.

plus returns a new String

string , var = ‘abc’ , ‘def’
string + var # => “abcdef”
string # => “abc”

<< mutates the receiver

string << var # => “abcdef”
string # => “abcdef”

You can use s.delete(‘+’) instead of s.gsub(‘+’,‘’) and it will be
faster,
prettier, and more expressive.

I expect the reason you heard that it is better to do it on multiple
lines
is that it then lets you use the bang methods, which, for whatever
reason
will return nil if they don’t mutate the object. In general, it is
faster to
say s.capitalize! than s.capitalize because in bang version, we mutate s
itself, in the second, we create a new object that is modified. But we
are
not interested in keeping the original value of s, so creating all these
objects adds up.

capitalize returns the capital version regardless of the original

string

so you can use it in the middle of a method chain

‘Abc’.capitalize # => “Abc”
‘abc’.capitalize # => “Abc”

don’t use capitalize! in the middle of a method chain because it can

return nil
‘Abc’.capitalize! # => nil
‘abc’.capitalize! # => “Abc”

capitalize creates a new string, so is less efficient if you don’t

care
about the original

also does not modify the receiver, so you have to capture its result

s = ‘abc’
s.capitalize # => “Abc”
s # => “abc”

capitalize! mutates the original string, so is more efficient if you

don’t
care about the original

does modify the receiver, so don’t have to capture its result

in fact, don’t capture its result, because as shown above, result

could
be nil
s = ‘abc’
s.capitalize! # => “Abc”
s # => “Abc”

On Wed, Jun 30, 2010 at 4:10 PM, Josh C. [email protected]
wrote:

You can use s.delete(‘+’) instead of s.gsub(‘+’,‘’) and it will be faster,
prettier, and more expressive.

This is wrong, delete removes the intersection of characters, you do
need
to use gsub. I guess the speed comparison is not relevant, but it is
still
uglier and less expressive – but more correct.

Awesome guidelines. Thank you so much for taking the time to write this
up and help me understand how everything works.

Much appreciated Josh!