Regex to divide document into sections?

luislavena · November 16, 2011, 2:23am

Any expert regexp wrangler around here. I can’t quite find a way to
split a
document up in sections:

This is an example.
    a = 1
Of what I mean.
    b = 2

And it can go on
like this.

    c = 3
For ever and ever.

    d = 4

So how does one parse it?

I want to split it into groups, such that if in yaml it would be:

---
- |
  This is an example.
      a = 1
- |
  Of what I mean.
      b = 2

- |
  And it can go on
  like this.

    c = 3
- |
  For ever and ever.

    d = 4

- |
  So how does one parse it?

I can write a line-by-line algorithm, but I figure there has to be an
regexp that can do it more better.

tcsnide · November 16, 2011, 3:16am

Hi,

This is as best as I can go (with a limitation from Ruby’s Regexp
LookBehind). It does not allow me to do variable numbers of digits;
therefore the following will work as long your digits are all single;
else, if you have w=23, then kaput…

#/usr/bin/ruby

string =<<END
This is an example.
a = 1
Of what I mean.
b = 2

And it can go on
like this.

    c = 3
For ever and ever.

    d = 4

So how does one parse it?

END

temp_array = string.split(/(?<=[a-z] = \d)[\r\n]/)

temp_array.each { |item| put item}

tcsnide · November 16, 2011, 6:54am

The actual content shouldn’t really matter. Sorry, I should have been
more
clear about that. The key pattern that the regexp would have to depend
on
is the indentation.

I would think there is a way, but I’ve tried a few times to get this and
never quite get it. Maybe this is something that regexp’s just can’t do?

tcsnide · November 16, 2011, 7:22am

On Tue, Nov 15, 2011 at 11:53 PM, Intransition [email protected]
wrote:

The actual content shouldn’t really matter. Sorry, I should have been more
clear about that. The key pattern that the regexp would have to depend on
is the indentation.

I would think there is a way, but I’ve tried a few times to get this and
never quite get it. Maybe this is something that regexp’s just can’t do?

This is very close.

p DATA.read.split(/^\b/)
END
This is an example.
a = 1
Of what I mean.
b = 2

And it can go on
like this.

c = 3

For ever and ever.

d = 4

So how does one parse it?

tcsnide · November 16, 2011, 7:55am

On 11/15/2011 10:41 PM, Joel VanderWerf wrote:

puts x

Oops, puts x[0] is better.

tcsnide · November 16, 2011, 8:19pm

Hmm… doesn’t seem to work for me.

a #=> [[“This is an example.\n a = 1\nOf what I mean.\n b =
2\nAnd
it can go on\nlike this.\n\n c = 3\nFor ever and ever.\n\n d =
4\n”,
"\n\n "]]

You know what though. I did some benchmarking and discovered that
manually
parsing the text line by line is much faster than using a regular
expression (I used a close approx re). I was kind of surprised by this,
since the regular expression engine is written in C, where as my line by
line parser is in Ruby.

Despite that, I still find it curious that there isn’t a more obvious
regular expression for parsing a document in this way. It makes me
wonder
if a C.S. PhD could go back to the drawing board, and come up with a
better
alternative to REs.

tcsnide · November 16, 2011, 7:42am

On 11/15/2011 05:23 PM, Intransition wrote:

|
c = 3

|
For ever and ever.

d = 4

|
So how does one parse it?

I can write a line-by-line algorithm, but I figure there has to be an
regexp that can do it more better.

This sorta works:

s = %{
This is an example.
a = 1
Of what I mean.
b = 2

 And it can go on
 like this.

     c = 3
 For ever and ever.

     d = 4

}

a = s.scan(/(.*?(\s+)\s+[^\n]+?\n(?=\2\S|\z))/m)

a.each do |x|
puts x
puts “—”
end

END

Output:

 This is an example.
     a = 1

 Of what I mean.
     b = 2

 And it can go on
 like this.

     c = 3

 For ever and ever.

     d = 4

tcsnide · November 17, 2011, 11:33am

– Matma R.

2011/11/16 Intransition [email protected]:

You know what though. I did some benchmarking and discovered that manually
parsing the text line by line is much faster than using a regular expression
(I used a close approx re). I was kind of surprised by this, since the
regular expression engine is written in C, where as my line by line parser
is in Ruby.

We’re talking about this regex, right?
s.scan(/(.*?(\s+)\s+[^\n]+?\n(?=\2\S|\z))/m)

Wel, I wouldn’t be surprised at all that it’s slower. Regex engines
are crazy complicated beasts, and regex-matching itself can be, in
worst case, exponential in complexity (due to backtracking). This one
regex is kind of complicated too; it has multiple nested matching
groups, it has backreferences to them, it has lazy quantifiers, it has
lookahead… this can make it expensive to match, even more so on a
long text.

A naive line-parsing algorithm just has to (as far as my understanding
of what you’re trying to achieve goes) just split the text on
newlines, look for ones starting with whitespace and group the array
items we got when splitting - the entire ordeal has just a linear
complexity, a peace of cake.

Despite that, I still find it curious that there isn’t a more obvious
regular expression for parsing a document in this way. It makes me wonder if
a C.S. PhD could go back to the drawing board, and come up with a better
alternative to REs.

Regexps are hardly ever good for any kind of “parsing”; they were
created for, and are better suited for, pattern matching and
replacing. Here you might be better of with some kind of automated
grammar parser (possibly Perl’s grammars - a Ruby port, anyone? [1]).
Perl guys are also trying to completely reinvent regular expression
syntax for Perl 6, and most of the ideas are really good stuff. [2]

[1] Raku rules - Wikipedia
[2] Apocalypse 5: Pattern Matching (a long read,
but worth it)

tcsnide · November 16, 2011, 8:20pm

And how nice it would be if it worked! It does come very close if you
knock
out all the empty “”, but still not quite.