Img (regular expressions

newb · August 21, 2008, 8:32am

Hi All…
I Need to Extract Img tag Using Regular Expressions From The Html Page
<\simg [^>]src\s=\s(["’])(.*?)\1
Is This Code Would be ok
if So How it can Be Implemented?
Any Ideas??

newb · August 21, 2008, 9:49am

2008/8/21 Newb N. [email protected]:

I Need to Extract Img tag Using Regular Expressions From The Html Page
<\simg [^>]src\s=\s(["'])(.*?)\1
Is This Code Would be ok

I would choose a different regexp.

if So How it can Be Implemented?

What exactly?

Any Ideas??

http://code.whytheluckystiff.net/hpricot/

Cheers

robert

newb · August 21, 2008, 10:19am

Newb N. wrote:

I Need to Extract Img tag Using Regular Expressions From The Html Page
<\simg [^>]src\s=\s(["’])(.*?)\1
Is This Code Would be ok
if So How it can Be Implemented?
Any Ideas??

Regexp is not a parser; it strongly resists matching well-formed syntax,
such as
HTML.

You need to write unit tests so you can “see” what you are doing. They
will feed
samples of input to your parser, and assert the output contains no <img
tags.

I would load these strings into libxml-ruby or Hpricot documents, then
use XPath
to seek ‘//img’, then delete their nodes from the document, then write
the
documents back. But note HTML supports several other ways to inject
images,
including CSS styles, tags, etc.

You need to consult with your client how clean you need your HTML. If
they say
to only allow , , , or tags, for example, you could
use XPath
to seek ‘//*’, meaning all nodes, then replace their tag names with
,
delete all their attributes, and write the document back.

Next, there might be gems out there to do this (or plugins), so you
could google
for [rails scrub html], to just find one, and either raid its source, or
install
and use it.

newb · August 21, 2008, 10:20am

You need to consult with your client how clean you need your HTML. If
they say to only allow , , , or tags, for example,
you could use XPath to seek ‘//*’, meaning all nodes, then replace their
tag names with , delete all their attributes, and write the
document back.

Another way to scrub input is don’t allow raw HTML. Only allow a wiki
markup,
such as RedCloth. Some wikis allow ‘‘italic’’ and ‘’‘bold’’’ content,
and very
little else. Then you don’t need to scrub it; you simply let the wiki
engine
convert it to harmless read-only HTML.