Hi everybody I want to filter the content of a body-Tag in html. How can I do this with regular expression? @h = Net::HTTP.new(url, 80) @response = @h.get(file, nil) if response.message == "OK" @body_content = response.scan(/..................../).to_s end thx for your solution! ribit
on 2007-05-15 19:50
on 2007-05-15 20:12
On 5/15/07, M. R. <email@example.com> wrote: > @body_content = response.scan(/..................../).to_s First a general remark! you should use multiline mode as tags can be more lines. Now I do not really see what you want from your page. If you want the body tag only, the following should do @response.gsub(/\s+/," ").scan(/<body[^>]*>/) this might be slow for long pages Regexen are probably not your best choice if you want to analyze the tag. Still something like this might work fine @response.gsub(/\s+/," ").scan(/<body(\s+(\w+)="([^"]*)")>/) save for escaped " HTH Robert
on 2007-05-15 21:16
On May 15, 9:50 am, "M. R." <firstname.lastname@example.org> wrote: > I want to filter the content of a body-Tag in html. How can I do this > with regular expression? > > @h = Net::HTTP.new(url, 80) > @response = @h.get(file, nil) > > if response.message == "OK" > @body_content = response.scan(/..................../).to_s > end Assuming your HTML is valid, then simply: @body_content = response[ /<body[^>]*>(.+?)</body>/m, 1 ]
on 2007-05-16 08:20
Phrogz wrote: > Assuming your HTML is valid, then simply: > @body_content = response[ /<body[^>]*>(.+?)</body>/m, 1 ] Whenever someone asks me how to parse HTML with regular expressions, I usually tell them: don't. HTML is an extremely complex language; if you want to parse HTML, use an HTML parser. For example, the following snippet is a perfectly well-formed and valid HTML document, but none of the regexps posted in this thread so far are able to correctly parse it: <HTML/ <HEAD/ <TITLE/>/ <P/> Oh, and, no, there is nothing missing there (well, except for the DOCTYPE declaration, I left that out for brevity -- this snippet is valid HTML 2.0, HTML 3.2 and HTML 4.01), that is actually a complete, well-formed and valid HTML document. The content of the above document's body element, flattened to a string, should be something like this: '<P>></P>'. Using an actual HTML parser like Hpricot might be a much better choice. Actually, I just checked and Hpricot doesn't seem to work either and neither does RubyfulSoup. Strange. What other Ruby HTML parsers are there that I could try? jwm
on 2007-05-16 08:30
> Oh, and, no, there is nothing missing there (well, except for the > DOCTYPE declaration, I left that out for brevity -- this snippet is > valid HTML 2.0, HTML 3.2 and HTML 4.01), that is actually a complete, > well-formed and valid HTML document. > True, but most web sites are more likely to be malformed than they are to be unparsably complex. If a regex will work predictably for one type of page on one web site, perhaps a parser might be overkill. Dan
on 2007-05-16 08:41
Is there a reason why Hpricot would not be suitable? require 'hpricot' require 'open-uri' body_content = (Hpricot( open( url ) )/"body").to_html should find the body content of a page I think. cheers Daniel
on 2007-05-16 10:45
M. R. a écrit : > @body_content = response.scan(/..................../).to_s > end > > > thx for your solution! > ribit > Question, Why don't you use Hpricot ?
on 2007-05-16 21:45
On May 15, 10:18 pm, Jörg W Mittag <email@example.com> wrote: > For example, the > following snippet is a perfectly well-formed and valid HTML document, > but none of the regexps posted in this thread so far are able to > correctly parse it: > > <HTML/ > <HEAD/ > <TITLE/>/ > <P/> Wow. I was all fired up to call you out on this, and ask you what insane cocaine you were smoking when you main this claim. I was a web developer for many many years and standards were very, very important to me. I thought I knew the specs. And then I ran that by validator.w3.org along with an HTML 4.01 strict DTD, and - to my utter shock and surprise and horror - it turns out you were correct. Thanks for sharing.
on 2007-05-17 01:26
Phrogz wrote: > Wow. I was all fired up to call you out on this, and ask you what > insane cocaine you were smoking when you main this claim. Well, keep in mind that this is a very contrived, extreme, exaggerated example that you will never find in the wild, simply because not only the regexps in this thread but also the browsers cannot parse it -- although I heard rumors that Emacs/w3 actually supports some of the features used by that snippet. I just wanted to demonstrate that there are a lot of weird things in HTML that are much better left to the people that write HTML parsers rather than writing the same incomplete HTML regexps over and over and over and over again. > I was a web developer for many many years and standards were very, > very important to me. I thought I knew the specs. The above example mainly draws upon one simple fact: the HTML designers decided to make HTML an application of SGML without actually having a *beep*ing clue about SGML, thus creating some "interesting" interactions with SGML's parsing rules. And who can blame them? The reason they created HTML in the first place, was that SGML is so mind-bogglingly complex that *nobody* has a *beep*ing clue! So, you can read all the W3C specs you want, but what makes HTML so weird isn't actually in there; it's buried somewhere in the thousands of pages of ISO SGML specs. > And then I ran that by validator.w3.org along with an HTML 4.01 strict > DTD, and - to my utter shock and surprise and horror - it turns out > you were correct. Well, let's see what actually happens. We start out with this: <html> <head> <title>></title> </head> <body> <p>></p> </body> </html> First, SGML is case-insensitive and HTML inherits that property. This already fools about 99% of all HTML regexps that you can find on the web: <HTML> <HEAD> <TITLE>></TITLE> </HEAD> <BODY> <P>></P> </BODY> </HTML> We don't need to escape closing/right angle brackets (>), only opening/left ones (<): <HTML> <HEAD> <TITLE>></TITLE> </HEAD> <BODY> <P>></P> </BODY> </HTML> Next, we use a feature that HTML inherited from SGML (without anybody noticing), called Null End Tags (NET), which allows you, basically, to DRY out (in Rails speak) the end tags. If you close the start tag with a slash instead of an angle bracket, you can replace the end tag with another slash, so <tag>some content</tag> becomes <tag/some content/ That looks like this: <HTML/ <HEAD/ <TITLE/>/ / <BODY/ <P/>/ / / Quite weird, huh? But we are not done yet! End tags are optional if they can be inferred from the context (and if the DTD specifically allows this). So, for example, since BODY cannot occur inside of HEAD, the opening BODY tag implies a closing HEAD tag: <HTML/ <HEAD/ <TITLE/>/ <BODY/ <P/> And one last step: actually, not only are end tags optional, you can even lose the tags entirely if they can be inferred. P can only occur inside a BODY, so the BODY can be inferred from P and we can get rid of it: <HTML/ <HEAD/ <TITLE/>/ <P/> > Thanks for sharing. My pleasure. BTW: this is not so useless as it might first seem. It's actually quite important to know that the W3C Validator uses an SGML parser to validate your documents, because that means it's worthless for a) XHTML, because XHTML is an application of XML, not SGML and b) HTML, too, because browsers don't parse HTML as SGML, they parse it as Tag Soup. (To be more precise: if the validator tells you your HTML is invalid, then you know it's broken; however, if it tells you it's valid, that doesn't necessarily mean it'll actually work in a browser.) XHTML is much better validated with an XML Schema Validator such as Christoph Schneegans' Schema Validator at <http://Schneegans.de/sv/> or the Validome validator at <http://Validome.org/>. It's crucial to remember that the W3C Validator and the browser parse HTML quite differently and that neither of those has necessarily anything to do with how *you* might actually parse it (-; I once found a cute little snippet on a website that I unfortunately can no longer locate, that demonstrated this quite nicely. That snippet had a little typo in it that fooled the human reader, the W3C Validator and the browser into reading that exact same snippet in three radically different ways, although what was *really* meant was actually a *fourth* thing. Just one quick example: HTML allows you to leave out the quotation marks around attribute contents. So, <A HREF=search.html>Search</A> is perfectly fine, however <A HREF=http://google.com/>Search</A> isn't, because as we now know, the double slash actually gets interpreted as a Null End Tag, so the above snippet would actually be parsed as something like the following: <A HREF="http:"></A>google.com/>Search</A> And the validator will complain about an extra closing </A> tag, while the browser will quietly fix that up to mean <A HREF="http://google.com/">Search</A> which is obviously what was intended. However, if you don't know about Null End Tags you can stare at the Validator's Error Message: Line X, Column Y: end tag for element "A" which is not open for hours and still not realize that your problem has nothing to do with an extra end tag, Line X or Column Y but that you are actually missing some quotation marks somewhere else in your document. BTW: the W3C gave up on SGML long ago and developed XML as a much simpler subset of SGML and XHTML as an application of XML. Now, the WHAT-WG followed by basically giving up any pretenses that HTML5 was actually an application of SGML; rather it is a language in its own right, totally seperate from both XML and SGML. And now we know why! One last goodie: you can actually specify an alternate root element in the DOCTYPE declaration: <!DOCTYPE p PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <P/ Although I have no friggin' clue how a browser were actually supposed to display this. Anyway, that concludes today's off-topic SGML rant, let's now get back to our regularly scheduled Smalltalk and Lisp threads, please (-; jwm
on 2007-05-17 01:41
On May 17, 2007, at 2:45 AM, Phrogz wrote: > > Thanks for sharing. This is indeed technically correct for HTML, but what good is it? We could all sit down and write stupid code in any language that is technically valid, but useless.
on 2007-05-17 03:44
On 5/17/07, John J. <firstname.lastname@example.org> wrote: > >> <HTML/ > > And then I ran that by validator.w3.org along with an HTML 4.01 strict > > DTD, and - to my utter shock and surprise and horror - it turns out > > you were correct. > > > > Thanks for sharing. > This is indeed technically correct for HTML, but what good is it? We > could all sit down and write stupid code in any language that is > technically valid, but useless. Well I found it interesting.