String spliting and inclusion

sclarke · July 21, 2009, 5:52pm

Hi all,

I am having trouble working out some logic for my problem. I basically
have a long string (320 characters) and I want to split into smaller
strings no longer than 50 characters in length. At present I have the
following regex:

data = “big long string”

puts data.scan(/{50}/)

This nicely breaks up the string however there are a few problems with
it, including:

It only outputs 50 character chunks, therefore when it gets to the end
and only 20 characters remain it misses them off the output (it outputs
6 50 characters strings and ignores the remaining 20)

This regex also splits up words, which is something I don’t want. I want
a script to count to 50 and when it gets there, go backwards to find
some white space and split it at that point, therefore not breaking up a
word. As a result a number of sub strings of various sizes will be
created all less than 50 chars.

I hope this makes sense, to summarise I want to break up a string into a
max of 50 characters without breaking up words.

Thanks in advance

Stuart

sclarke · July 21, 2009, 6:12pm

s = "a bad day in the office today, " * 3
puts “Attention, some backtracking here:”
puts s.scan( /.{,20}\b/ )
puts “I cannot come up with a non backtracking solution right now :(”

HTH
Robert

On 7/21/09, Stuart C. [email protected] wrote:

word. As a result a number of sub strings of various sizes will be

–
Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]

sclarke · July 21, 2009, 10:16pm

Stuart C. wrote:

Hi all,

I am having trouble working out some logic for my problem. I basically
have a long string (320 characters) and I want to split into smaller
strings no longer than 50 characters in length. At present I have the
following regex:

data = “big long string”

puts data.scan(/{50}/)

error: invalid regular expression; there’s no previous pattern, to which
‘{’ would define cardinality

data =<<ENDOFSTRING
Hello world. Hello moon.
Goodbye world. Goodbye moon.

Hello world. Hello moon.
Goodbye world. Goodbye moon.
The end.
ENDOFSTRING

chunks = []
curr_chunk = []
curr_length = 0

data.scan(/.+?\b/m) do |word|
wlen = word.length

if curr_length + wlen <= 50
curr_chunk << word
curr_length += wlen
else
chunks << curr_chunk.join()
curr_chunk = [word]
curr_length = wlen
end
end

if curr_chunk.length > 0
chunks << curr_chunk.join()
end

p chunks

chunks.each do |chunk|
puts chunk.length
end

sclarke · July 21, 2009, 10:34pm

–output:–
["Hello world. Hello moon.\nGoodbye world. Goodbye ", "moon.\n\nHello
world. Hello moon.\nGoodbye world. ", “Goodbye moon.\nThe end”]
49
48
21

Hmmm…I’m having a problem getting the ending period while using the
word boundary in the regex. I guess that’s because there is no start of
a word after the ending period for the regex to match. \s works:

data.scan(/.+?\s/m) do |word|

sclarke · July 22, 2009, 2:44am

I hope this makes sense, to summarise I want to break up a string into a
max of 50 characters without breaking up words.

Thanks in advance

Stuart

str = “I did not test this completely so you may need to make some
adjustments to this, but give it a try. This cuts on twenty instead of
fifty characters.”

(str.length/20).times do
arr = str.split(//)
ess = arr.zip((0…arr.length).to_a)
tee = ess.reverse.detect{|y| y[0] == " " and y[1] <= 20}
p str.slice!(0…tee[1]).strip
end
p str

Harry

sclarke · July 22, 2009, 3:03am

Hi –

On Wed, 22 Jul 2009, Stuart C. wrote:

word. As a result a number of sub strings of various sizes will be
created all less than 50 chars.

I hope this makes sense, to summarise I want to break up a string into a
max of 50 characters without breaking up words.

Try this. I don’t guarantee robustness.

str.scan(/\b.{0,50}(?:$|\b)/m)

David

sclarke · July 22, 2009, 6:59am

David A. Black wrote:

Try this. I don’t guarantee robustness.

str.scan(/\b.{0,50}(?:$|\b)/m)

Oh my.

sclarke · July 22, 2009, 10:14am

On 7/22/09, David A. Black [email protected] wrote:

strings no longer than 50 characters in length. At present I have the
and only 20 characters remain it misses them off the output (it outputs

Try this. I don’t guarantee robustness.

str.scan(/\b.{0,50}(?:$|\b)/m)
Hmm my \b at the end of my solution might have been a problem in some
edge cases, however I would suggest the usage of \z instead of $ and
the m switch. I fail to see why you put a \b at the beginning David,
would you mind to explain?

In Ruby 1.9 (or Oniguruma that is) the negative lookahead assertion
might lead to the most elegant solution:

/.{,50}(?!\B)/

BTW it seems that {n,m} does not have a “non greedy” and “possessive”
variant, or did I miss it?

Cheers
Robert

sclarke · July 22, 2009, 11:06am

On 7/22/09, Robert D. [email protected] wrote:

On 7/22/09, Robert D. [email protected] wrote:

s = “Some words are made of letters! Some are not!”
p s.scan( /.{,10}\p{Graph}(?:\P{Graph}|\z)/ ).map( &:strip )

s.split( /(.{,10}\S)\s/ ).reject( &:empty? )

sclarke · July 22, 2009, 10:56am

On 7/22/09, Robert D. [email protected] wrote:

I am having trouble working out some logic for my problem. I basically

I hope this makes sense, to summarise I want to break up a string into a
would you mind to explain?

In Ruby 1.9 (or Oniguruma that is) the negative lookahead assertion
might lead to the most elegant solution:

/.{,50}(?!\B)/
Nahh that leaves us with spaces at the beginning of the line, of
course we could do
scan(…).map( &:lstrip ) but that hurts my regex pride

This seems to work (but does not really):

s = “Some words are made of letters! Some are not!”
puts s.scan( /.{,10}\p{Graph}(?:\P{Graph}|\z)/ )

Replace the puts with p and you will see trailing whitespace now :(.

This is a little bastard of a problem indeed. Simplest I could come up
with so far:

s = “Some words are made of letters! Some are not!”
p s.scan( /.{,10}\p{Graph}(?:\P{Graph}|\z)/ ).map( &:strip )

HTH
Robert

BTW it seems that {n,m} does not have a “non greedy” and “possessive”
variant, or did I miss it?
Yes I did, they are there {n,m}? and {n,m}+, sorry.

Cheers
Robert

–
Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]

sclarke · July 22, 2009, 11:40am

On 7/22/09, Robert D. [email protected] wrote:

On 7/22/09, Robert D. [email protected] wrote:

On 7/22/09, Robert D. [email protected] wrote:

s = “Some words are made of letters! Some are not!”
p s.scan( /.{,10}\p{Graph}(?:\P{Graph}|\z)/ ).map( &:strip )

s.split( /(.{,10}\S)\s/ ).reject( &:empty? )

good enough? Certainly not
s.split( /(.{,10}\S)\s+/ ).reject( &:empty? )

Robert

–
Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]

sclarke · July 22, 2009, 12:49pm

Hi –

On Wed, 22 Jul 2009, Robert D. wrote:

have a long string (320 characters) and I want to split into smaller
It only outputs 50 character chunks, therefore when it gets to the end
max of 50 characters without breaking up words.

Try this. I don’t guarantee robustness.

str.scan(/\b.{0,50}(?:$|\b)/m)
Hmm my \b at the end of my solution might have been a problem in some
edge cases, however I would suggest the usage of \z instead of $ and
the m switch. I fail to see why you put a \b at the beginning David,
would you mind to explain?

The idea was to start every scan at a \b. It’s definitely not an
all-purpose solution to the problem anyway. For one thing, it doesn’t
handle words of more than 50 characters – which probably doesn’t
matter, unless you’re using it with a number less than 50:

str
=> “this is a string and i intend to split it up into little strings”

str.scan(/\b.{0,5}(?:$|\b)/m)
=> ["this ", "is a ", “”, " and ", "i ", “”, " to ", “split”, " it ",
"up ", "into ", “”, " ", “”, “”]

Without the first \b you get:

["this ", "is a ", “”, “tring”, " and ", "i ", “”, “ntend”, " to ",
“split”, " it ", "up ", "into ", “”, “ittle”, " ", “”, “rings”, “”]

So… further tweaking required

David

sclarke · July 22, 2009, 12:51pm

Robert D. wrote:

On 7/22/09, David A. Black [email protected] wrote:

strings no longer than 50 characters in length. At present I have the
and only 20 characters remain it misses them off the output (it outputs

Try this. I don’t guarantee robustness.

str.scan(/\b.{0,50}(?:$|\b)/m)
I fail to see why you put a \b at the beginning David,
would you mind to explain?

Yes. Please explain that. Also please explain why you don’t have
{1,50}?

Or, will you claim the 5th under the robustness disclaimer?

sclarke · July 22, 2009, 12:59pm

On Wed, 22 Jul 2009, 7stud – wrote:

would you mind to explain?

Yes. Please explain that. Also please explain why you don’t have
{1,50}?

Or, will you claim the 5th under the robustness disclaimer?

I won’t claim anything. Feel free to experiment with the code, which
I’ve already said repeatedly isn’t a full solution, and see what you
come up with.

David

sclarke · July 22, 2009, 1:59pm

On 7/22/09, David A. Black [email protected] wrote:

I won’t claim anything. Feel free to experiment with the code, which
I’ve already said repeatedly isn’t a full solution, and see what you
come up with.
Indeed this is very tricky, I had some doubts about your leading \b
example, but I experimented with lots of solutions and they were
covering it up. Thx for explaining. Unless OP says what he really
wants I shall stop for not making too much noise. e.g. there is the
issue of more than and one space and of course punctuation.
R.

sclarke · July 22, 2009, 12:50pm

Hi –

On Wed, 22 Jul 2009, 7stud – wrote:

David A. Black wrote:

Try this. I don’t guarantee robustness.

str.scan(/\b.{0,50}(?:$|\b)/m)

Oh my.

It’s got some problems; see the message I just posted about word
length.

David

sclarke · July 23, 2009, 11:19am

In Ruby 1.9 (or Oniguruma that is) the negative lookahead assertion
might lead to the most elegant solution:

/.{,50}(?!\B)/
Nahh that leaves us with spaces at the beginning of the line

p str.scan(/\s*(.{1,50})(?!\S)/)

Or, if there are consecutive spaces between words,
squeeze them out first.

p str.squeeze(" ").scan(/\s*(.{1,50})(?!\S)/)

Harry