Wow. I was all fired up to call you out on this, and ask you what
insane cocaine you were smoking when you main this claim.
Well, keep in mind that this is a very contrived, extreme, exaggerated
example that you will never find in the wild, simply because not only
the regexps in this thread but also the browsers cannot parse it –
although I heard rumors that Emacs/w3 actually supports some of the
features used by that snippet. I just wanted to demonstrate that
there are a lot of weird things in HTML that are much better left to
the people that write HTML parsers rather than writing the same
incomplete HTML regexps over and over and over and over again.
I was a web developer for many many years and standards were very,
very important to me. I thought I knew the specs.
The above example mainly draws upon one simple fact: the HTML
designers decided to make HTML an application of SGML without actually
having a beeping clue about SGML, thus creating some “interesting”
interactions with SGML’s parsing rules. And who can blame them? The
reason they created HTML in the first place, was that SGML is so
mind-bogglingly complex that nobody has a beeping clue!
So, you can read all the W3C specs you want, but what makes HTML so
weird isn’t actually in there; it’s buried somewhere in the thousands
of pages of ISO SGML specs.
And then I ran that by validator.w3.org along with an HTML 4.01 strict
DTD, and - to my utter shock and surprise and horror - it turns out
you were correct.
Well, let’s see what actually happens. We start out with this:
First, SGML is case-insensitive and HTML inherits that property. This
already fools about 99% of all HTML regexps that you can find on the
We don’t need to escape closing/right angle brackets (>), only
opening/left ones (<):
Next, we use a feature that HTML inherited from SGML (without anybody
noticing), called Null End Tags (NET), which allows you, basically, to
DRY out (in Rails speak) the end tags. If you close the start tag
with a slash instead of an angle bracket, you can replace the end tag
with another slash, so
That looks like this:
Quite weird, huh? But we are not done yet! End tags are optional if
they can be inferred from the context (and if the DTD specifically
allows this). So, for example, since BODY cannot occur inside of
HEAD, the opening BODY tag implies a closing HEAD tag:
And one last step: actually, not only are end tags optional, you can
even lose the tags entirely if they can be inferred. P can only occur
inside a BODY, so the BODY can be inferred from P and we can get rid
Thanks for sharing.
My pleasure. BTW: this is not so useless as it might first seem.
It’s actually quite important to know that the W3C Validator uses an
SGML parser to validate your documents, because that means it’s
a) XHTML, because XHTML is an application of XML, not SGML and
b) HTML, too, because browsers don’t parse HTML as SGML, they parse
it as Tag Soup. (To be more precise: if the validator tells you
your HTML is invalid, then you know it’s broken; however, if it
tells you it’s valid, that doesn’t necessarily mean it’ll
actually work in a browser.)
XHTML is much better validated with an XML Schema Validator such as
Christoph Schneegans’ Schema Validator at http://Schneegans.de/sv/
or the Validome validator at http://Validome.org/.
It’s crucial to remember that the W3C Validator and the browser parse
HTML quite differently and that neither of those has necessarily
anything to do with how you might actually parse it (-; I once
found a cute little snippet on a website that I unfortunately can no
longer locate, that demonstrated this quite nicely. That snippet had
a little typo in it that fooled the human reader, the W3C Validator
and the browser into reading that exact same snippet in three
radically different ways, although what was really meant was
actually a fourth thing.
Just one quick example: HTML allows you to leave out the quotation
marks around attribute contents. So,
is perfectly fine, however
isn’t, because as we now know, the double slash actually gets
interpreted as a Null End Tag, so the above snippet would actually be
parsed as something like the following:
And the validator will complain about an extra closing tag, while
the browser will quietly fix that up to mean
which is obviously what was intended. However, if you don’t know
about Null End Tags you can stare at the Validator’s Error Message:
Line X, Column Y: end tag for element “A” which is not open
for hours and still not realize that your problem has nothing to do
with an extra end tag, Line X or Column Y but that you are actually
missing some quotation marks somewhere else in your document.
BTW: the W3C gave up on SGML long ago and developed XML as a much
simpler subset of SGML and XHTML as an application of XML. Now, the
WHAT-WG followed by basically giving up any pretenses that HTML5 was
actually an application of SGML; rather it is a language in its own
right, totally seperate from both XML and SGML. And now we know why!
One last goodie: you can actually specify an alternate root element in
the DOCTYPE declaration:
Although I have no friggin’ clue how a browser were actually supposed
to display this.
Anyway, that concludes today’s off-topic SGML rant, let’s now get back
to our regularly scheduled Smalltalk and Lisp threads, please (-;