Can't control regular expressions

unknown · July 29, 2008, 12:35pm

Hello guys,

I need to extract from an html file all the scripts. So I have written
the
following regular expression for a first test:

%r|<script(.+)script>|m

The problem I am having is that the expression takes the first . So it matches the beginning of the first script
in
the document and the end of the last script in the document with
everything in the middle. I want to extract just the scripts one by one.
How do I do it?

Thanks for your help,

Guillermo

unknown · July 29, 2008, 12:50pm

On Jul 29, 12:34 pm, [email protected] wrote:

I need to extract from an html file all the scripts. So I have written the
following regular expression for a first test:

%r|<script(.+)script>|m

The problem I am having is that the expression takes the first . So it matches the beginning of the first script in
the document and the end of the last script in the document with
everything in the middle. I want to extract just the scripts one by one.
How do I do it?

You can use the ‘?’ regexp operator to make a lazy match rather than a
greedy.

%r|<script(.+?)script>|m

However, I suggest trying Hpricot for more robust HTML parsing.

Lars

unknown · July 29, 2008, 12:59pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 29, 2008, at 12:34 PM, [email protected] wrote:

the very last script>. So it matches the beginning of the first
script in
the document and the end of the last script in the document with
everything in the middle. I want to extract just the scripts one by
one.
How do I do it?

Thanks for your help,

Guillermo

Hi, regexps are not the right tool for this. You can find some
explanation on why that is, you can
will find some in this topic:

http://groups.google.com/group/ruby-talk-google/browse_thread/thread/2d86d106b5c8797a

Regards,
Florian G.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)

iEYEARECAAYFAkiO98kACgkQJA/zY0IIRZb6zQCdFNi3h+bgYIVIebozgKachGEG
dxIAoId9e7cZVRQr4FYfVKsMKi3ye5Ug
=oXM6
-----END PGP SIGNATURE-----