Hi all,
I am having trouble working out some logic for my problem. I basically
have a long string (320 characters) and I want to split into smaller
strings no longer than 50 characters in length. At present I have the
following regex:
data = “big long string”
puts data.scan(/{50}/)
This nicely breaks up the string however there are a few problems with
it, including:
It only outputs 50 character chunks, therefore when it gets to the end
and only 20 characters remain it misses them off the output (it outputs
6 50 characters strings and ignores the remaining 20)
This regex also splits up words, which is something I don’t want. I want
a script to count to 50 and when it gets there, go backwards to find
some white space and split it at that point, therefore not breaking up a
word. As a result a number of sub strings of various sizes will be
created all less than 50 chars.
I hope this makes sense, to summarise I want to break up a string into a
max of 50 characters without breaking up words.
Thanks in advance
Stuart
s = "a bad day in the office today, " * 3
puts “Attention, some backtracking here:”
puts s.scan( /.{,20}\b/ )
puts “I cannot come up with a non backtracking solution right now :(”
HTH
Robert
On 7/21/09, Stuart C. [email protected] wrote:
word. As a result a number of sub strings of various sizes will be
–
Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.
All adults have been children first, but not many remember.
[Antoine de Saint-Exupéry]
Stuart C. wrote:
Hi all,
I am having trouble working out some logic for my problem. I basically
have a long string (320 characters) and I want to split into smaller
strings no longer than 50 characters in length. At present I have the
following regex:
data = “big long string”
puts data.scan(/{50}/)
error: invalid regular expression; there’s no previous pattern, to which
‘{’ would define cardinality
data =<<ENDOFSTRING
Hello world. Hello moon.
Goodbye world. Goodbye moon.
Hello world. Hello moon.
Goodbye world. Goodbye moon.
The end.
ENDOFSTRING
chunks = []
curr_chunk = []
curr_length = 0
data.scan(/.+?\b/m) do |word|
wlen = word.length
if curr_length + wlen <= 50
curr_chunk << word
curr_length += wlen
else
chunks << curr_chunk.join()
curr_chunk = [word]
curr_length = wlen
end
end
if curr_chunk.length > 0
chunks << curr_chunk.join()
end
p chunks
chunks.each do |chunk|
puts chunk.length
end
–output:–
["Hello world. Hello moon.\nGoodbye world. Goodbye ", "moon.\n\nHello
world. Hello moon.\nGoodbye world. ", “Goodbye moon.\nThe end”]
49
48
21
Hmmm…I’m having a problem getting the ending period while using the
word boundary in the regex. I guess that’s because there is no start of
a word after the ending period for the regex to match. \s works:
data.scan(/.+?\s/m) do |word|
I hope this makes sense, to summarise I want to break up a string into a
max of 50 characters without breaking up words.
Thanks in advance
Stuart
str = “I did not test this completely so you may need to make some
adjustments to this, but give it a try. This cuts on twenty instead of
fifty characters.”
(str.length/20).times do
arr = str.split(//)
ess = arr.zip((0…arr.length).to_a)
tee = ess.reverse.detect{|y| y[0] == " " and y[1] <= 20}
p str.slice!(0…tee[1]).strip
end
p str
Harry
Hi –
On Wed, 22 Jul 2009, Stuart C. wrote:
word. As a result a number of sub strings of various sizes will be
created all less than 50 chars.
I hope this makes sense, to summarise I want to break up a string into a
max of 50 characters without breaking up words.
Try this. I don’t guarantee robustness.
str.scan(/\b.{0,50}(?:$|\b)/m)
David
On 7/22/09, David A. Black [email protected] wrote:
strings no longer than 50 characters in length. At present I have the
and only 20 characters remain it misses them off the output (it outputs
Try this. I don’t guarantee robustness.
str.scan(/\b.{0,50}(?:$|\b)/m)
Hmm my \b at the end of my solution might have been a problem in some
edge cases, however I would suggest the usage of \z instead of $ and
the m switch. I fail to see why you put a \b at the beginning David,
would you mind to explain?
In Ruby 1.9 (or Oniguruma that is) the negative lookahead assertion
might lead to the most elegant solution:
/.{,50}(?!\B)/
BTW it seems that {n,m} does not have a “non greedy” and “possessive”
variant, or did I miss it?
Cheers
Robert
On 7/22/09, Robert D. [email protected] wrote:
On 7/22/09, Robert D. [email protected] wrote:
s = “Some words are made of letters! Some are not!”
p s.scan( /.{,10}\p{Graph}(?:\P{Graph}|\z)/ ).map( &:strip )
s.split( /(.{,10}\S)\s/ ).reject( &:empty? )
On 7/22/09, Robert D. [email protected] wrote:
I am having trouble working out some logic for my problem. I basically
I hope this makes sense, to summarise I want to break up a string into a
would you mind to explain?
In Ruby 1.9 (or Oniguruma that is) the negative lookahead assertion
might lead to the most elegant solution:
/.{,50}(?!\B)/
Nahh that leaves us with spaces at the beginning of the line, of
course we could do
scan(…).map( &:lstrip ) but that hurts my regex pride 
This seems to work (but does not really):
s = “Some words are made of letters! Some are not!”
puts s.scan( /.{,10}\p{Graph}(?:\P{Graph}|\z)/ )
Replace the puts with p and you will see trailing whitespace now :(.
This is a little bastard of a problem indeed. Simplest I could come up
with so far:
s = “Some words are made of letters! Some are not!”
p s.scan( /.{,10}\p{Graph}(?:\P{Graph}|\z)/ ).map( &:strip )
HTH
Robert
BTW it seems that {n,m} does not have a “non greedy” and “possessive”
variant, or did I miss it?
Yes I did, they are there {n,m}? and {n,m}+, sorry.
Cheers
Robert
–
Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.
All adults have been children first, but not many remember.
[Antoine de Saint-Exupéry]
On 7/22/09, Robert D. [email protected] wrote:
On 7/22/09, Robert D. [email protected] wrote:
On 7/22/09, Robert D. [email protected] wrote:
s = “Some words are made of letters! Some are not!”
p s.scan( /.{,10}\p{Graph}(?:\P{Graph}|\z)/ ).map( &:strip )
s.split( /(.{,10}\S)\s/ ).reject( &:empty? )
good enough? Certainly not 
s.split( /(.{,10}\S)\s+/ ).reject( &:empty? )
Robert
–
Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.
All adults have been children first, but not many remember.
[Antoine de Saint-Exupéry]
Hi –
On Wed, 22 Jul 2009, Robert D. wrote:
have a long string (320 characters) and I want to split into smaller
It only outputs 50 character chunks, therefore when it gets to the end
max of 50 characters without breaking up words.
Try this. I don’t guarantee robustness.
str.scan(/\b.{0,50}(?:$|\b)/m)
Hmm my \b at the end of my solution might have been a problem in some
edge cases, however I would suggest the usage of \z instead of $ and
the m switch. I fail to see why you put a \b at the beginning David,
would you mind to explain?
The idea was to start every scan at a \b. It’s definitely not an
all-purpose solution to the problem anyway. For one thing, it doesn’t
handle words of more than 50 characters – which probably doesn’t
matter, unless you’re using it with a number less than 50:
str
=> “this is a string and i intend to split it up into little strings”
str.scan(/\b.{0,5}(?:$|\b)/m)
=> ["this ", "is a ", “”, " and ", "i ", “”, " to ", “split”, " it ",
"up ", "into ", “”, " ", “”, “”]
Without the first \b you get:
["this ", "is a ", “”, “tring”, " and ", "i ", “”, “ntend”, " to ",
“split”, " it ", "up ", "into ", “”, “ittle”, " ", “”, “rings”, “”]
So… further tweaking required 
David
Robert D. wrote:
On 7/22/09, David A. Black [email protected] wrote:
strings no longer than 50 characters in length. At present I have the
and only 20 characters remain it misses them off the output (it outputs
Try this. I don’t guarantee robustness.
str.scan(/\b.{0,50}(?:$|\b)/m)
I fail to see why you put a \b at the beginning David,
would you mind to explain?
Yes. Please explain that. Also please explain why you don’t have
{1,50}?
Or, will you claim the 5th under the robustness disclaimer?
On Wed, 22 Jul 2009, 7stud – wrote:
would you mind to explain?
Yes. Please explain that. Also please explain why you don’t have
{1,50}?
Or, will you claim the 5th under the robustness disclaimer?
I won’t claim anything. Feel free to experiment with the code, which
I’ve already said repeatedly isn’t a full solution, and see what you
come up with.
David
On 7/22/09, David A. Black [email protected] wrote:
I won’t claim anything. Feel free to experiment with the code, which
I’ve already said repeatedly isn’t a full solution, and see what you
come up with.
Indeed this is very tricky, I had some doubts about your leading \b
example, but I experimented with lots of solutions and they were
covering it up. Thx for explaining. Unless OP says what he really
wants I shall stop for not making too much noise. e.g. there is the
issue of more than and one space and of course punctuation.
R.
Hi –
On Wed, 22 Jul 2009, 7stud – wrote:
David A. Black wrote:
Try this. I don’t guarantee robustness.
str.scan(/\b.{0,50}(?:$|\b)/m)
Oh my.
It’s got some problems; see the message I just posted about word
length.
David
In Ruby 1.9 (or Oniguruma that is) the negative lookahead assertion
might lead to the most elegant solution:
/.{,50}(?!\B)/
Nahh that leaves us with spaces at the beginning of the line
p str.scan(/\s*(.{1,50})(?!\S)/)
Or, if there are consecutive spaces between words,
squeeze them out first.
p str.squeeze(" ").scan(/\s*(.{1,50})(?!\S)/)
Harry