Valid XML PIs for ERB

unknown · January 26, 2006, 12:16pm

Hi. Why doesn’t ERB work with valid XML processing instructions, as
suggested by lots of people[1]? Is there any good reason we’re stuck
with invalid, nonstandard, Microsoft-inspired “<%…%>” tags? It
shouldn’t be difficult to allow both syntaxes (someone even made a
patch[2]), and .rhtml files could finally be valid XHTML.

References:
[1]
http://groups.google.com/group/comp.lang.ruby/browse_thread/thread/936cf51a2b8b01c0/aed5dbbdac99cdfb?q=erb+xml&rnum=9#aed5dbbdac99cdfb
[2]
http://groups.google.com/group/comp.lang.ruby/browse_thread/thread/602c5f90e980dffb/9f9c6229a3313e73?q=erb+xml&rnum=2

unknown · January 27, 2006, 9:57am

On Thu, 2006-01-26 at 12:13, [email protected] wrote:

Hi. Why doesn’t ERB work with valid XML processing instructions, as
suggested by lots of people[1]? Is there any good reason we’re stuck
with invalid, nonstandard, Microsoft-inspired “<%…%>” tags? It
shouldn’t be difficult to allow both syntaxes (someone even made a
patch[2]), and .rhtml files could finally be valid XHTML.

Historical reasons, mostly. HTML started out as inspired by SGML, rather
than compliant with SGML. The people who built the first web servers
didn’t know much about SGML, and so they reinvented processing
instructions in an annoyingly incompatible manner. Ever since, web
frameworks have been built on a solid foundation of
don’t-bother-me-with-the-basics-of-SGML/XML-processing. Quite
successfully too, which really grates my cheese.

In addition to the non-compliant syntax there are problems with how the
<%…%> tags are used. It is very common for them to contain either
procedural instructions, such as in PHP and Rails, or to be tightly
coupled to procedures, such as ASP.NET and Java tag libraries. None of
this is good from an SGML/XML point of view.

The SGML/XML paradigm says “separate information from processing so that
the same information can be processed by many different applications”.
This is quite different from the OO idea of “bunch data and operations
together in an object”. (Apologies for the vast oversimplification.)

A processing instruction, when used as originally intended, indicates
that something should be done, and what that something is. It should not
indicate which piece of software that will perform the action.

A well written processing system (again, from an SGML/XML point of view)
uses templates that are independent of the templating system, so that
they can be processed by other software if need be, just as you suggest.

I am no Rails expert, but as I understand it, Rails is one of the few,
MVC based web application framework that can easily be taught to
understand processing instructions. It is possible to add new templating
systems fairly easily. If someone added an XML compliant templating
system, Rails would have a strong competitive advantage with SGML/XML
aware organizations, for example in the automotive, telecommunications,
and electronics industries.

An alternative is to skip the MVC framework bit altogether, and build
Transformation View based applications instead. Not much framework
support that I know of, but a paradigm that fits the problem better than
MVC in many cases. With a Transformation View based application, you
would most likely deal with XML parsers and an XML transformation system
directly, and the problem on non-compliant tagging never arises.

/Henrik

–
http://kallokain.blogspot.com/ - Blogging from the trenches of software
development
http://www.henrikmartensson.org/ - Reflections on software development
http://testunitxml.rubyforge.org/ - The Test::Unit::XML Home Page
http://declan.rubyforge.org/ - The Declan Home Page

unknown · January 27, 2006, 4:17pm

Henrik M. wrote:

Historical reasons, mostly. HTML started out as inspired by SGML, rather
than compliant with SGML. The people who built the first web servers
didn’t know much about SGML, and so they reinvented processing
instructions in an annoyingly incompatible manner. Ever since, web
frameworks have been built on a solid foundation of
don’t-bother-me-with-the-basics-of-SGML/XML-processing. Quite
successfully too, which really grates my cheese.

Do you have any references for this? I’m pretty sure Tim Berners-Lee,
Marc Andreessen, etc. knew about SGML, and I do not believe that HTML
ever had PIs.

Also, the xml-dev list is a good place to read varying, but informed,
opinions on the use of PIs.

For example:

http://lists.xml.org/archives/xml-dev/200505/msg00159.html

Thanks,

unknown · January 27, 2006, 10:23pm

On Fri, 2006-01-27 at 16:15, James B. wrote:

Historical reasons, mostly. HTML started out as inspired by SGML, rather
than compliant with SGML. The people who built the first web servers
didn’t know much about SGML, and so they reinvented processing
instructions in an annoyingly incompatible manner. Ever since, web
frameworks have been built on a solid foundation of
don’t-bother-me-with-the-basics-of-SGML/XML-processing. Quite
successfully too, which really grates my cheese.

Do you have any references for this?

Tim Berners-Lee certainly knew about SGML. The original specification
explicitly mentions it. See
Tags used in HTML.

However, HTML was not fully SGML compliant. For example, there is a
sentence in the original spec that says:

“Currently HTML documents are transmitted without the normal SGML
framing tags, but if these are included parsers will ignore them.”

There was also an original test dataset, including this file:
Hypertext HTML formatting example.
If you look at the source, you can see that it is not fully SGML
compliant. For starters, there is no Doctype. Also, there are tags that
contain formatted text that is not wrapped in a CDATA section. Neither
is allowed in SGML.

An interesting thing to note is that the

tag was used to indicate
the end of a paragraph in the test document, though the original spec
said

was a paragraph start tag. I remember that all my early HTML
books said

was an end tag. Unfortunately, I threw those books away
years ago.

Also, I believe the first HTML DTD was for version 2.0, written in 1995.
Here is the link: http://www.w3.org/MarkUp/html-spec/html.dtd. Since
there was no DTD for version 1.0, it could not have been SGML compliant.
In all fairness, I could be wrong about there being no HTML 1.0 DTD.
There are notes from 1992 that talk about the future of HTML and “a new
DTD”, which indicates the existence of an old one. It’s just that I
haven’t found it. Even so, the lack of a requirement for a Doctype would
be enough to render HTML non-compliant. More importantly, it would not
be parseable by SGML parsers.

At the time, loosing the Doctype and CDATA sections, and not supporting
hierarchical chapter and section structures, was probably the right
decision. HTML had to be very simple, or people would not have used it.
If the design had been “better”, we might not have had a web today.

I’m pretty sure Tim Berners-Lee,
Marc Andreessen, etc. knew about SGML, and I do not believe that HTML
ever had PIs.

I have never seen a HTML spec that mentions processing instructions. Nor
is there any need to. Processing instructions can be defined by anyone
who designs a processing application, they are not tied to a specific
DTD or SGML application. (Well, except that some specifications
explicitly defines some PIs, but there is nothing that prevents users of
the DTD to specify more of them.)

I can’t prove that the people who wrote the first web servers did not
know about PIs, but I think it is likely. If they had known, what
possible reason could they have had for deliberately doing something
that was not SGML compliant? (Browser wars and vendor lock in didn’t
become major issues until later.)

Also, the xml-dev list is a good place to read varying, but informed,
opinions on the use of PIs.

For example:

xml-dev - Well-established uses of processing instructions?

I follow the list, though not as carefully now as I did a couple of
years ago. In addition to the applications mentioned in the thread you
refer to, XML editors, like XMetaL and Arbortext Editor make use of
processing instructions. So does many proprietary SGML/XML processing
systems.

/Henrik

–
http://kallokain.blogspot.com/ - Blogging from the trenches of software
development
http://www.henrikmartensson.org/ - Reflections on software development
http://testunitxml.rubyforge.org/ - The Test::Unit::XML Home Page
http://declan.rubyforge.org/ - The Declan Home Page

unknown · January 27, 2006, 11:36pm

Henrik M. wrote:
…

I’m pretty sure Tim Berners-Lee,
Marc Andreessen, etc. knew about SGML, and I do not believe that HTML
ever had PIs.

…

I can’t prove that the people who wrote the first web servers did not
know about PIs, but I think it is likely. If they had known, what
possible reason could they have had for deliberately doing something
that was not SGML compliant? (Browser wars and vendor lock in didn’t
become major issues until later.)

Tim Berners-Lee arguably wrote the first Web server (if Wikipedia is to
be believed). I’m doubtful that it had anything to do with processing
instructions; Web servers generally don’t worry if the pages they serve
up are SGML compliant or otherwise; that tends to be left to the client.

The issues with funky page-generation syntax, at least for the Ruby Web
tools I’ve used, arise prior to interaction with any Web server.

There seem to be (again, generally speaking) two camps: Those who want
to treat the page templating or page generation source as a consistent
set of a single markup language (e.g. SGML or XML), and those who want a
special syntax such that an editor or other tool might readily
distinguish between document markup and programming language markup.

I’m guessing the latter describes the PHP/ASP/Erb path. Meanwhile,
Nitro, Amrita, Sean Russell’s xml-tmpl, some others I can’t recall,
allow for the use of either XML elements or processing instructions or
both.

I tend to prefer the PI-only approach, as it is (for me) less intrusive
for various XML tools such as tidy, and seems to make it easier to check
for certain page validation errors prior to full rendering (though at
some point that needs to be checked as well). Luckily, Erb lends itself
to some basic hacking to allow the use of PI syntax.

But I can see why people might want a syntax that was orthogonal to any
particular output format; Erb need not only be used to create Web pages.
It can generate Ruby or postscript or whatever, and a PI syntax would
mean little then.

–
James B.

“Blanket statements are over-rated”