Forum: Ruby Regex html

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
K. R. (Guest)
on 2007-05-15 19:50
Hi everybody

I want to filter the content of a body-Tag in html. How can I do this
with regular expression?


@h = Net::HTTP.new(url, 80)
@response = @h.get(file, nil)

if response.message == "OK"
  @body_content = response.scan(/..................../).to_s
end


thx for your solution!
ribit
Robert D. (Guest)
on 2007-05-15 20:12
(Received via mailing list)
On 5/15/07, M. R. <removed_email_address@domain.invalid> wrote:
>   @body_content = response.scan(/..................../).to_s
First a general remark! you should use multiline mode as tags can be
more lines.
Now I do not really see what you want from your page.
If you want the body tag only, the following should do

@response.gsub(/\s+/," ").scan(/<body[^>]*>/)
this might be slow for long pages

Regexen are probably not your best choice if you want to analyze the
tag. Still something like this might work fine
@response.gsub(/\s+/," ").scan(/<body(\s+(\w+)="([^"]*)")>/)
save for escaped "

HTH
Robert
Gavin K. (Guest)
on 2007-05-15 21:16
(Received via mailing list)
On May 15, 9:50 am, "M. R." <removed_email_address@domain.invalid> wrote:
> I want to filter the content of a body-Tag in html. How can I do this
> with regular expression?
>
> @h = Net::HTTP.new(url, 80)
> @response = @h.get(file, nil)
>
> if response.message == "OK"
>   @body_content = response.scan(/..................../).to_s
> end

Assuming your HTML is valid, then simply:
@body_content = response[ /<body[^>]*>(.+?)</body>/m, 1 ]
Jörg W Mittag (Guest)
on 2007-05-16 08:20
(Received via mailing list)
Phrogz wrote:
> Assuming your HTML is valid, then simply:
> @body_content = response[ /<body[^>]*>(.+?)</body>/m, 1 ]

Whenever someone asks me how to parse HTML with regular expressions, I
usually tell them: don't.  HTML is an extremely complex language; if
you want to parse HTML, use an HTML parser.  For example, the
following snippet is a perfectly well-formed and valid HTML document,
but none of the regexps posted in this thread so far are able to
correctly parse it:

  <HTML/
    <HEAD/
      <TITLE/>/
      <P/>

Oh, and, no, there is nothing missing there (well, except for the
DOCTYPE declaration, I left that out for brevity -- this snippet is
valid HTML 2.0, HTML 3.2 and HTML 4.01), that is actually a complete,
well-formed and valid HTML document.

The content of the above document's body element, flattened to a
string, should be something like this: '<P>></P>'.

Using an actual HTML parser like Hpricot might be a much better
choice.  Actually, I just checked and Hpricot doesn't seem to work
either and neither does RubyfulSoup.  Strange.  What other Ruby HTML
parsers are there that I could try?

jwm
Dan Z. (Guest)
on 2007-05-16 08:30
(Received via mailing list)
> Oh, and, no, there is nothing missing there (well, except for the
> DOCTYPE declaration, I left that out for brevity -- this snippet is
> valid HTML 2.0, HTML 3.2 and HTML 4.01), that is actually a complete,
> well-formed and valid HTML document.
>

True, but most web sites are more likely to be malformed than they are
to be unparsably complex. If a regex will work predictably for one type
of page on one web site, perhaps a parser might be overkill.

Dan
Daniel -. (Guest)
on 2007-05-16 08:41
(Received via mailing list)
Is there a reason why Hpricot would not be suitable?

 require 'hpricot'
 require 'open-uri'

body_content = (Hpricot( open(  url ) )/"body").to_html

should find the body content of a page I think.

cheers
Daniel
Stephane W. (Guest)
on 2007-05-16 10:45
(Received via mailing list)
M. R. a écrit :
>   @body_content = response.scan(/..................../).to_s
> end
>
>
> thx for your solution!
> ribit
>

Question, Why don't you use Hpricot ?
Gavin K. (Guest)
on 2007-05-16 21:45
(Received via mailing list)
On May 15, 10:18 pm, Jörg W Mittag <removed_email_address@domain.invalid> wrote:
> For example, the
> following snippet is a perfectly well-formed and valid HTML document,
> but none of the regexps posted in this thread so far are able to
> correctly parse it:
>
>   <HTML/
>     <HEAD/
>       <TITLE/>/
>       <P/>

Wow. I was all fired up to call you out on this, and ask you what
insane cocaine you were smoking when you main this claim.

I was a web developer for many many years and standards were very,
very important to me. I thought I knew the specs.

And then I ran that by validator.w3.org along with an HTML 4.01 strict
DTD, and - to my utter shock and surprise and horror - it turns out
you were correct.

Thanks for sharing.
Jörg W Mittag (Guest)
on 2007-05-17 01:26
(Received via mailing list)
Phrogz wrote:
> Wow. I was all fired up to call you out on this, and ask you what
> insane cocaine you were smoking when you main this claim.

Well, keep in mind that this is a very contrived, extreme, exaggerated
example that you will never find in the wild, simply because not only
the regexps in this thread but also the browsers cannot parse it --
although I heard rumors that Emacs/w3 actually supports some of the
features used by that snippet.  I just wanted to demonstrate that
there are a lot of weird things in HTML that are much better left to
the people that write HTML parsers rather than writing the same
incomplete HTML regexps over and over and over and over again.

> I was a web developer for many many years and standards were very,
> very important to me. I thought I knew the specs.

The above example mainly draws upon one simple fact: the HTML
designers decided to make HTML an application of SGML without actually
having a *beep*ing clue about SGML, thus creating some "interesting"
interactions with SGML's parsing rules.  And who can blame them?  The
reason they created HTML in the first place, was that SGML is so
mind-bogglingly complex that *nobody* has a *beep*ing clue!

So, you can read all the W3C specs you want, but what makes HTML so
weird isn't actually in there; it's buried somewhere in the thousands
of pages of ISO SGML specs.

> And then I ran that by validator.w3.org along with an HTML 4.01 strict
> DTD, and - to my utter shock and surprise and horror - it turns out
> you were correct.

Well, let's see what actually happens.  We start out with this:

  <html>
    <head>
      <title>&gt;</title>
    </head>
    <body>
      <p>&gt;</p>
    </body>
  </html>

First, SGML is case-insensitive and HTML inherits that property.  This
already fools about 99% of all HTML regexps that you can find on the
web:

  <HTML>
    <HEAD>
      <TITLE>&gt;</TITLE>
    </HEAD>
    <BODY>
      <P>&gt;</P>
    </BODY>
  </HTML>

We don't need to escape closing/right angle brackets (&gt;), only
opening/left ones (&lt;):

  <HTML>
    <HEAD>
      <TITLE>></TITLE>
    </HEAD>
    <BODY>
      <P>></P>
    </BODY>
  </HTML>

Next, we use a feature that HTML inherited from SGML (without anybody
noticing), called Null End Tags (NET), which allows you, basically, to
DRY out (in Rails speak) the end tags.  If you close the start tag
with a slash instead of an angle bracket, you can replace the end tag
with another slash, so

  <tag>some content</tag>

becomes

  <tag/some content/

That looks like this:

  <HTML/
    <HEAD/
      <TITLE/>/
    /
    <BODY/
      <P/>/
    /
  /

Quite weird, huh?  But we are not done yet!  End tags are optional if
they can be inferred from the context (and if the DTD specifically
allows this).  So, for example, since BODY cannot occur inside of
HEAD, the opening BODY tag implies a closing HEAD tag:

  <HTML/
    <HEAD/
      <TITLE/>/
    <BODY/
      <P/>

And one last step: actually, not only are end tags optional, you can
even lose the tags entirely if they can be inferred.  P can only occur
inside a BODY, so the BODY can be inferred from P and we can get rid
of it:

  <HTML/
    <HEAD/
      <TITLE/>/
      <P/>

> Thanks for sharing.

My pleasure.  BTW: this is not so useless as it might first seem.
It's actually quite important to know that the W3C Validator uses an
SGML parser to validate your documents, because that means it's
worthless for

 a) XHTML, because XHTML is an application of XML, not SGML and

 b) HTML, too, because browsers don't parse HTML as SGML, they parse
      it as Tag Soup.  (To be more precise: if the validator tells you
      your HTML is invalid, then you know it's broken; however, if it
      tells you it's valid, that doesn't necessarily mean it'll
      actually work in a browser.)

XHTML is much better validated with an XML Schema Validator such as
Christoph Schneegans' Schema Validator at <http://Schneegans.de/sv/>
or the Validome validator at <http://Validome.org/>.

It's crucial to remember that the W3C Validator and the browser parse
HTML quite differently and that neither of those has necessarily
anything to do with how *you* might actually parse it (-;  I once
found a cute little snippet on a website that I unfortunately can no
longer locate, that demonstrated this quite nicely.  That snippet had
a little typo in it that fooled the human reader, the W3C Validator
and the browser into reading that exact same snippet in three
radically different ways, although what was *really* meant was
actually a *fourth* thing.

Just one quick example: HTML allows you to leave out the quotation
marks around attribute contents.  So,

  <A HREF=search.html>Search</A>

is perfectly fine, however

  <A HREF=http://google.com/>Search</A>

isn't, because as we now know, the double slash actually gets
interpreted as a Null End Tag, so the above snippet would actually be
parsed as something like the following:

  <A HREF="http:"></A>google.com/&gt;Search</A>

And the validator will complain about an extra closing </A> tag, while
the browser will quietly fix that up to mean

  <A HREF="http://google.com/">Search</A>

which is obviously what was intended.  However, if you don't know
about Null End Tags you can stare at the Validator's Error Message:

  Line X, Column Y: end tag for element "A" which is not open

for hours and still not realize that your problem has nothing to do
with an extra end tag, Line X or Column Y but that you are actually
missing some quotation marks somewhere else in your document.

BTW: the W3C gave up on SGML long ago and developed XML as a much
simpler subset of SGML and XHTML as an application of XML.  Now, the
WHAT-WG followed by basically giving up any pretenses that HTML5 was
actually an application of SGML; rather it is a language in its own
right, totally seperate from both XML and SGML.  And now we know why!

One last goodie: you can actually specify an alternate root element in
the DOCTYPE declaration:

  <!DOCTYPE p PUBLIC "-//W3C//DTD HTML 4.01//EN"
                     "http://www.w3.org/TR/html4/strict.dtd">
  <P/

Although I have no friggin' clue how a browser were actually supposed
to display this.

Anyway, that concludes today's off-topic SGML rant, let's now get back
to our regularly scheduled Smalltalk and Lisp threads, please (-;

jwm
John J. (Guest)
on 2007-05-17 01:41
(Received via mailing list)
On May 17, 2007, at 2:45 AM, Phrogz wrote:

>
> Thanks for sharing.
This is indeed technically correct for HTML, but what good is it? We
could all sit down and write stupid code in any language that is
technically valid, but useless.
Daniel -. (Guest)
on 2007-05-17 03:44
(Received via mailing list)
On 5/17/07, John J. <removed_email_address@domain.invalid> wrote:
> >>   <HTML/
> > And then I ran that by validator.w3.org along with an HTML 4.01 strict
> > DTD, and - to my utter shock and surprise and horror - it turns out
> > you were correct.
> >
> > Thanks for sharing.
> This is indeed technically correct for HTML, but what good is it? We
> could all sit down and write stupid code in any language that is
> technically valid, but useless.



Well I found it interesting.
This topic is locked and can not be replied to.