String spliting and inclusion

Hi all,

I am having trouble working out some logic for my problem. I basically
have a long string (320 characters) and I want to split into smaller
strings no longer than 50 characters in length. At present I have the
following regex:

data = “big long string”

puts data.scan(/{50}/)

This nicely breaks up the string however there are a few problems with
it, including:

It only outputs 50 character chunks, therefore when it gets to the end
and only 20 characters remain it misses them off the output (it outputs
6 50 characters strings and ignores the remaining 20)

This regex also splits up words, which is something I don’t want. I want
a script to count to 50 and when it gets there, go backwards to find
some white space and split it at that point, therefore not breaking up a
word. As a result a number of sub strings of various sizes will be
created all less than 50 chars.

I hope this makes sense, to summarise I want to break up a string into a
max of 50 characters without breaking up words.

Thanks in advance

Stuart

s = "a bad day in the office today, " * 3
puts “Attention, some backtracking here:”
puts s.scan( /.{,20}\b/ )
puts “I cannot come up with a non backtracking solution right now :(”

HTH
Robert

On 7/21/09, Stuart C. [email protected] wrote:

word. As a result a number of sub strings of various sizes will be


Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]

Stuart C. wrote:

Hi all,

I am having trouble working out some logic for my problem. I basically
have a long string (320 characters) and I want to split into smaller
strings no longer than 50 characters in length. At present I have the
following regex:

data = “big long string”

puts data.scan(/{50}/)

error: invalid regular expression; there’s no previous pattern, to which
‘{’ would define cardinality

data =<<ENDOFSTRING
Hello world. Hello moon.
Goodbye world. Goodbye moon.

Hello world. Hello moon.
Goodbye world. Goodbye moon.
The end.
ENDOFSTRING

chunks = []
curr_chunk = []
curr_length = 0

data.scan(/.+?\b/m) do |word|
wlen = word.length

if curr_length + wlen <= 50
curr_chunk << word
curr_length += wlen
else
chunks << curr_chunk.join()
curr_chunk = [word]
curr_length = wlen
end
end

if curr_chunk.length > 0
chunks << curr_chunk.join()
end

p chunks

chunks.each do |chunk|
puts chunk.length
end

–output:–
["Hello world. Hello moon.\nGoodbye world. Goodbye ", "moon.\n\nHello
world. Hello moon.\nGoodbye world. ", “Goodbye moon.\nThe end”]
49
48
21

Hmmm…I’m having a problem getting the ending period while using the
word boundary in the regex. I guess that’s because there is no start of
a word after the ending period for the regex to match. \s works:

data.scan(/.+?\s/m) do |word|

I hope this makes sense, to summarise I want to break up a string into a
max of 50 characters without breaking up words.

Thanks in advance

Stuart

str = “I did not test this completely so you may need to make some
adjustments to this, but give it a try. This cuts on twenty instead of
fifty characters.”

(str.length/20).times do
arr = str.split(//)
ess = arr.zip((0…arr.length).to_a)
tee = ess.reverse.detect{|y| y[0] == " " and y[1] <= 20}
p str.slice!(0…tee[1]).strip
end
p str

Harry

Hi –

On Wed, 22 Jul 2009, Stuart C. wrote:

word. As a result a number of sub strings of various sizes will be
created all less than 50 chars.

I hope this makes sense, to summarise I want to break up a string into a
max of 50 characters without breaking up words.

Try this. I don’t guarantee robustness.

str.scan(/\b.{0,50}(?:$|\b)/m)

David

David A. Black wrote:

Try this. I don’t guarantee robustness.

str.scan(/\b.{0,50}(?:$|\b)/m)

Oh my.

On 7/22/09, David A. Black [email protected] wrote:

strings no longer than 50 characters in length. At present I have the
and only 20 characters remain it misses them off the output (it outputs

Try this. I don’t guarantee robustness.

str.scan(/\b.{0,50}(?:$|\b)/m)
Hmm my \b at the end of my solution might have been a problem in some
edge cases, however I would suggest the usage of \z instead of $ and
the m switch. I fail to see why you put a \b at the beginning David,
would you mind to explain?

In Ruby 1.9 (or Oniguruma that is) the negative lookahead assertion
might lead to the most elegant solution:

/.{,50}(?!\B)/

BTW it seems that {n,m} does not have a “non greedy” and “possessive”
variant, or did I miss it?

Cheers
Robert

On 7/22/09, Robert D. [email protected] wrote:

On 7/22/09, Robert D. [email protected] wrote:

s = “Some words are made of letters! Some are not!”
p s.scan( /.{,10}\p{Graph}(?:\P{Graph}|\z)/ ).map( &:strip )

s.split( /(.{,10}\S)\s/ ).reject( &:empty? )

On 7/22/09, Robert D. [email protected] wrote:

I am having trouble working out some logic for my problem. I basically

I hope this makes sense, to summarise I want to break up a string into a
would you mind to explain?

In Ruby 1.9 (or Oniguruma that is) the negative lookahead assertion
might lead to the most elegant solution:

/.{,50}(?!\B)/
Nahh that leaves us with spaces at the beginning of the line, of
course we could do
scan(…).map( &:lstrip ) but that hurts my regex pride :wink:

This seems to work (but does not really):

s = “Some words are made of letters! Some are not!”
puts s.scan( /.{,10}\p{Graph}(?:\P{Graph}|\z)/ )

Replace the puts with p and you will see trailing whitespace now :(.

This is a little bastard of a problem indeed. Simplest I could come up
with so far:

s = “Some words are made of letters! Some are not!”
p s.scan( /.{,10}\p{Graph}(?:\P{Graph}|\z)/ ).map( &:strip )

HTH
Robert

BTW it seems that {n,m} does not have a “non greedy” and “possessive”
variant, or did I miss it?
Yes I did, they are there {n,m}? and {n,m}+, sorry.

Cheers
Robert


Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]

On 7/22/09, Robert D. [email protected] wrote:

On 7/22/09, Robert D. [email protected] wrote:

On 7/22/09, Robert D. [email protected] wrote:

s = “Some words are made of letters! Some are not!”
p s.scan( /.{,10}\p{Graph}(?:\P{Graph}|\z)/ ).map( &:strip )

s.split( /(.{,10}\S)\s/ ).reject( &:empty? )

good enough? Certainly not :frowning:
s.split( /(.{,10}\S)\s+/ ).reject( &:empty? )

Robert


Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]

Hi –

On Wed, 22 Jul 2009, Robert D. wrote:

have a long string (320 characters) and I want to split into smaller
It only outputs 50 character chunks, therefore when it gets to the end
max of 50 characters without breaking up words.

Try this. I don’t guarantee robustness.

str.scan(/\b.{0,50}(?:$|\b)/m)
Hmm my \b at the end of my solution might have been a problem in some
edge cases, however I would suggest the usage of \z instead of $ and
the m switch. I fail to see why you put a \b at the beginning David,
would you mind to explain?

The idea was to start every scan at a \b. It’s definitely not an
all-purpose solution to the problem anyway. For one thing, it doesn’t
handle words of more than 50 characters – which probably doesn’t
matter, unless you’re using it with a number less than 50:

str
=> “this is a string and i intend to split it up into little strings”

str.scan(/\b.{0,5}(?:$|\b)/m)
=> ["this ", "is a ", “”, " and ", "i ", “”, " to ", “split”, " it ",
"up ", "into ", “”, " ", “”, “”]

Without the first \b you get:

["this ", "is a ", “”, “tring”, " and ", "i ", “”, “ntend”, " to ",
“split”, " it ", "up ", "into ", “”, “ittle”, " ", “”, “rings”, “”]

So… further tweaking required :slight_smile:

David

Robert D. wrote:

On 7/22/09, David A. Black [email protected] wrote:

strings no longer than 50 characters in length. At present I have the
and only 20 characters remain it misses them off the output (it outputs

Try this. I don’t guarantee robustness.

str.scan(/\b.{0,50}(?:$|\b)/m)
I fail to see why you put a \b at the beginning David,
would you mind to explain?

Yes. Please explain that. Also please explain why you don’t have
{1,50}?

Or, will you claim the 5th under the robustness disclaimer?

On Wed, 22 Jul 2009, 7stud – wrote:

would you mind to explain?

Yes. Please explain that. Also please explain why you don’t have
{1,50}?

Or, will you claim the 5th under the robustness disclaimer?

I won’t claim anything. Feel free to experiment with the code, which
I’ve already said repeatedly isn’t a full solution, and see what you
come up with.

David

On 7/22/09, David A. Black [email protected] wrote:

I won’t claim anything. Feel free to experiment with the code, which
I’ve already said repeatedly isn’t a full solution, and see what you
come up with.
Indeed this is very tricky, I had some doubts about your leading \b
example, but I experimented with lots of solutions and they were
covering it up. Thx for explaining. Unless OP says what he really
wants I shall stop for not making too much noise. e.g. there is the
issue of more than and one space and of course punctuation.
R.

Hi –

On Wed, 22 Jul 2009, 7stud – wrote:

David A. Black wrote:

Try this. I don’t guarantee robustness.

str.scan(/\b.{0,50}(?:$|\b)/m)

Oh my.

It’s got some problems; see the message I just posted about word
length.

David

In Ruby 1.9 (or Oniguruma that is) the negative lookahead assertion
might lead to the most elegant solution:

/.{,50}(?!\B)/
Nahh that leaves us with spaces at the beginning of the line

p str.scan(/\s*(.{1,50})(?!\S)/)

Or, if there are consecutive spaces between words,
squeeze them out first.

p str.squeeze(" ").scan(/\s*(.{1,50})(?!\S)/)

Harry