Best Practice for Multiline Regexps

viatropos · September 1, 2009, 5:35am

Hey,

What is the recommended way to parse a tree of files and replace
multiline pattern-matches, if you have say 20 regular expressions you’re
looking for. I understand how to traverse directories, read/write
files, and use complex regular expressions, but the question is, what’s
the optimal/recommended way to parse files (find/replace “def
method_name …(some lines)… end”, with some string, for instance)?
Is it:

Read file to String, match string against first pattern, read next
file into String, match with same pattern… once I’ve gone through all
the files with the first pattern, start over with the next pattern.
Read file to String, match whole string against all 20 patterns, go
to next file, match against 20 patterns…
Read file to String, match each line by 20 patterns…
Something with a Tokenizer which I don’t yet understand (if so, could
you shed some light on it for me )

I basically want to write a few patterns to replace multiline text
patterns in lots of files, and need a consistent/fast way to do it in
ruby, without learning C or anything.

Thanks so much for the help,
Lance

viatropos · September 1, 2009, 5:42am

I have read through the TextMate docs and have looked extensively at
their Language Grammars source files (ruby, javascript, actionscript,
etc.), which suggests using “begin” and “end” patterns, so that makes
sense, I’m just not sure how I run through the string if I have so many
befores and ends. The simplest solution would be to just look for one
pattern in a file at a time, but that seems like it’d be real slow.

This’ll be nice to know for code parsing, and for code generation.

Thanks again.

viatropos · September 1, 2009, 7:56am

On 01.09.2009 05:35, Lance P. wrote:

the files with the first pattern, start over with the next pattern.
That’s the worst you can do.

Read file to String, match whole string against all 20 patterns, go
to next file, match against 20 patterns…

Most efficient of the simple approaches.

Read file to String, match each line by 20 patterns…

How do you want to do that with multiline patterns? Is there an easy
way to convert them to several line based patterns? If not, you can
forget this option.

Something with a Tokenizer which I don’t yet understand (if so, could
you shed some light on it for me )

Which Tokenizer are you referring to? Can you give more detail about
your patterns or the kind of replacement you want to do? If not (e.g.
because they must be generic) the simplest and most efficient seems to
be option 2 unless we are talking about GB file sizes.

Kind regards

robert

viatropos · September 1, 2009, 8:27am

Thanks a lot Robert for your help.

Which Tokenizer are you referring to? Can you give more detail about
your patterns or the kind of replacement you want to do?

I would like to be able to do code generation in a kind of preprocessor
way for Actionscript, but I don’t want to use a preprocessor because it
seems like too much work (especially in actionscript or java), and I’d
have to do pattern matching anyway :).

For example, I’d like to convert this:

[Bindable]
public function get myProperty():String {
return _myProperty;
}
public function set myProperty(value:String):void {
_myProperty = value;
}

… or this:

[Bindable] public var myProperty:String;

… or anything in between (different formatting…), to this:

[Bindable(event=“myPropertyChange”)]
public function get myProperty():String {
return _myProperty;
}
public function set myProperty(value:String):void {
_myProperty = value;
dispatchEvent(new Event(“myPropertyChange”));
}

… That accessor little snippet can be from 1 to 20+ lines, and I’d
like to be able to just add that line in there without having to run
through the file a bunch of times.

There’s a few other examples similar to that that I’d like to do to, but
that’s the gist of it.

Lance

viatropos · September 1, 2009, 9:15am

2009/9/1 spiralofhope [email protected]:

Read file to String, match whole string against all 20 patterns,
through each pattern you’re searching for for every file. It would be
horribly taxing on disk access.

Exactly!

But if we imagine that all the files are on a ramdisk, then could the
first method possibly be better? Something in the back of my mind
says method 2 is still better.

Yes, and here’s why: you save the effort of transferring file data
from the disk into Ruby’s address space (memory mapped IO is OS
dependent and might not be available - also, I believe it’s not a core
lib functionality). Additionally: it is unlikely that all files
reside on a ramdisk regularly so you have the additional effort of
moving / copying files there.

Kind regards

robert

viatropos · September 1, 2009, 8:21am

On Tue, 1 Sep 2009 14:55:08 +0900
Robert K. [email protected] wrote:

Read file to String, match string against first pattern, read
next file into String, match with same pattern… once I’ve gone
through all the files with the first pattern, start over with the
next pattern.

That’s the worst you can do.

Read file to String, match whole string against all 20 patterns,
go to next file, match against 20 patterns…

Most efficient of the simple approaches.

Robert is right from a hard drive perspective.

To understand why method 2 works well - just remember that when a file
is read from your disk, it is cached. Your first solution would force
the system to cache file 1, process it, then cache file 2 to process
it… through to caching file n and then back to file 1 again, cycling
through each pattern you’re searching for for every file. It would be
horribly taxing on disk access.

But if we imagine that all the files are on a ramdisk, then could the
first method possibly be better? Something in the back of my mind
says method 2 is still better.

viatropos · September 1, 2009, 9:31am

incomplete

contents.gsub! %r{
(
[Bindable)(]
\s*
public
\s+
function
\s+
get
\s+)
(\w+) # property name $3
\s*
([^)]*)
…
}x, ‘\1(event="\3Change")\2\3…’

Kind regards

robert

Nice! I’ll try that out. That looks real clean. Thanks a lot for your
guys’ help.

Lance.

viatropos · September 1, 2009, 9:21am

2009/9/1 Lance P. [email protected]:

For example, I’d like to convert this:
dispatchEvent(new Event("myPropertyChange"));
}

… That accessor little snippet can be from 1 to 20+ lines, and I’d
like to be able to just add that line in there without having to run
through the file a bunch of times.

There’s a few other examples similar to that that I’d like to do to, but
that’s the gist of it.

Ah, I see. I assume your files are not big (looks like program code
which typically does not come in GB’s.) You might be able to do with
single line regular expressions but I’d first try String#gsub! because
that’s typically the most efficient thing to do. Maybe you can even
do this without the block form, e.g.

incomplete

contents.gsub! %r{
(
[Bindable)(]
\s*
public
\s+
function
\s+
get
\s+)
(\w+) # property name $3
\s*
([^)]*)
…
}x, ‘\1(event=“\3Change”)\2\3…’

Kind regards

robert

viatropos · September 2, 2009, 1:21pm

Lance P.:

Read file to String, match string against first pattern, read next
file into String, match with same pattern… once I’ve gone through
all the files with the first pattern, start over with the next
pattern.

Read file to String, match whole string against all
20 patterns, go to next file, match against 20 patterns…

The only advantage of 1) over 2) is that if you have a situation where
a combination of pattern 18 with file X breaks (because the pattern
didnâ€™t foresee some peculiar corner case which manifests only in file
X), then â€“ assuming you take snapshots of the operation after every
pattern replace is over â€“ the first approach will let you keep the work
done by the first 17 pattern replacements and re-run the process from
pattern 18 on, while the second approach will mean you have to roll back
everything and redo all of the work that succeeded one time already.

It really depends on the number of patterns, the number of
files and whether you need IO/time efficiency or not.

â€” Shot

viatropos · September 2, 2009, 1:56pm

2009/9/2 Shot (Piotr S.) [email protected]:

The only advantage of 1) over 2) is that if you have a situation where
a combination of pattern 18 with file X breaks (because the pattern
didn’t foresee some peculiar corner case which manifests only in file
X), then – assuming you take snapshots of the operation after every
pattern replace is over – the first approach will let you keep the work
done by the first 17 pattern replacements and re-run the process from
pattern 18 on, while the second approach will mean you have to roll back
everything and redo all of the work that succeeded one time already.

Excellent point! With a complex operation like this I would keep
backups of all original files anyway. Maybe even copy the whole
directory tree before changing anything. That way you can check with
diff what changed, roll back etc.

Kind regards

robert