ClothRed (HTML to Textile)

Phillip_G · April 10, 2007, 7:39pm

I’m pleased to announce, that I’ve begun working on a small library to
convert HTML into Textile.

Please forgive me, that this announcement isn’t yet following the
community’s standards, but I’m slowly getting there.

For the curious, the website and project on RuybForge have gone online
and have some content[0].

For the impatient:
ClothRed will be exactly the reverse of RedCloth: It will grab any HTML
string, and convert it into Textile.

As a bonus, ClothRed will strip all HTML that is not being converted
into Textile’s markup from the text, making it, hopefully, usable for
sanitizing HTML.

I hope to have an Alpha release out by the end of next month.

Links:
[0] http://clothred.rubyforge.org/

–
Phillip “CynicalRyan” Gawlowski
http://cynicalryan.110mb.com/

Rule of Open-Source Programming #5:

A project is never finished.

Phillip_G · April 10, 2007, 8:49pm

On 4/10/07, Phillip G. [email protected] wrote:

I’m pleased to announce, that I’ve begun working on a small library to
convert HTML into Textile.
…
ClothRed will be exactly the reverse of RedCloth: It will grab any HTML
string, and convert it into Textile.

As a bonus, ClothRed will strip all HTML that is not being converted
into Textile’s markup from the text, making it, hopefully, usable for
sanitizing HTML.

I hope to have an Alpha release out by the end of next month.

Awesome, Phillip. I really look forward to using this!

Jacob F.

Phillip_G · April 11, 2007, 9:03pm

From: Phillip G. [mailto:[email protected]]
Sent: Tuesday, April 10, 2007 8:39 PM

For the impatient:
ClothRed will be exactly the reverse of RedCloth: It will grab any HTML
string, and convert it into Textile.

As a bonus, ClothRed will strip all HTML that is not being converted
into Textile’s markup from the text, making it, hopefully, usable for
sanitizing HTML.

I hope to have an Alpha release out by the end of next month.

Awesome!

x x x

A bit OT, but I’m dreaming/planning (for a long time) about library,
which
can handle “greatest common divisor” of all simple text format and
perform
uniformly conversions like

Textile <=> <=> HTML
Markdown <=> <=> PDF
Mediawiki <=> gcd <=> PS
RDOC <=> <=> OpenOffice

There are several projects performing only [some markup]=>html
conversions.
There is also Maruku[1], which seems to handle virually any
Markdown=>[rich
format] conversion (and seems to embody some common intermediate
format).
There is now your project.

Isn’t now a time to do something more generic?

V.

1: maruku.rubyforge.org

Phillip_G · April 11, 2007, 10:53pm

From: Phillip G. [mailto:[email protected]]
Sent: Wednesday, April 11, 2007 10:26 PM

Mediawiki <=> gcd <=> PS

Well, once there are libraries to do any one of these tasks, you can
build a tool chain, similar to DBI, for example.

I know that there’s a PDF generator written in Ruby, but i don’t know
about the other file formats. Creating markup parsers isn’t that much of
a challenge, so that could be done quite easily.

I’d be happy to, once ClothRed is feature-complete in the HTML →
Textile area, to write an API to integrate ClothRed into other tools.

My point was, to have some intermediate format, and have couple of
parsers
TO this format and generators FROM it.

Now, authors of all libraries are solving 2 problems - parse & generate.
It
should be nice to have one common HTML parser, which could be used
either
for HTML->Textile, or for HTML->Markdown (only generators will differ).

From some poin of view, we can use Textile as intermediate, your library
would be “parser”, RedCloth would be “generator”. But this leaves
Markdown,
Rdoc “off the game”, while we have no Markdown->Textile and similar
convertors.

V.

Phillip_G · April 11, 2007, 9:26pm

Victor “Zverok” Shepelev wrote:

There is also Maruku[1], which seems to handle virually any Markdown=>[rich
format] conversion (and seems to embody some common intermediate format).
There is now your project.

Isn’t now a time to do something more generic?

Well, once there are libraries to do any one of these tasks, you can
build a tool chain, similar to DBI, for example.

I know that there’s a PDF generator written in Ruby, but i don’t know
about the other file formats. Creating markup parsers isn’t that much of
a challenge, so that could be done quite easily.

I’d be happy to, once ClothRed is feature-complete in the HTML →
Textile area, to write an API to integrate ClothRed into other tools.

–
Phillip “CynicalRyan” Gawlowski
http://cynicalryan.110mb.com/

Rule of Open-Source Programming #33:

Don’t waste time on writing test cases and test scripts - your users are
your best testers.

Phillip_G · April 11, 2007, 11:26pm

From: Gary W. [mailto:[email protected]]
Sent: Thursday, April 12, 2007 12:10 AM

On Apr 11, 2007, at 4:52 PM, Victor Zverok S. wrote:

My point was, to have some intermediate format, and have couple of
parsers
TO this format and generators FROM it.

Seems like XHTML would be the obvious choice for the intermediate
format, no?
Unless you want to reinvent that particular wheel.

We, russians, say "invent the bike". In russian programming forums my usual origin is "Bikes forever!" :)

The question, I think, is like XML vs. YAML/JSON. Just do simpler.

I mean, for conversions like Markdown <=> Textile, XHTML as intermediate
is
slightly too funny.

The overall thought was “conversion of basic logical formatting”, thus,
intermediate format should only handle basic features (at the level of
Textile-like formats, not bloated XHTML-like).

V.

Phillip_G · April 11, 2007, 11:11pm

On Apr 11, 2007, at 4:52 PM, Victor Zverok S. wrote:

My point was, to have some intermediate format, and have couple of
parsers
TO this format and generators FROM it.

Seems like XHTML would be the obvious choice for the intermediate
format, no?
Unless you want to reinvent that particular wheel.

Gary W.

Phillip_G · April 11, 2007, 11:40pm

From: Gary W. [mailto:[email protected]]
Sent: Thursday, April 12, 2007 12:33 AM

Textile exists to generate XHTML.

If you’ve got reverse translations also, then XHTML is already working
as the intermediate. Why do you need yet another format?

May be you’re right. Only can I say that for other “rich formats” (like
PDF
or OpenOffice) generation conversions textile->pdf can be simpler than
XHTML->pdf (thus, it breaks rule for “the single intermediate format”
through chains like Markdown->XHTML->Textile->PDF). What do you think?

V.

Phillip_G · April 11, 2007, 11:33pm

On Apr 11, 2007, at 5:26 PM, Victor “Zverok” Shepelev wrote:

I mean, for conversions like Markdown <=> Textile, XHTML as
intermediate is
slightly too funny.

Yes, but that is because XHTML is too funny.

Markdown exists to generate XHTML.
Textile exists to generate XHTML.

If you’ve got reverse translations also, then XHTML is already working
as the intermediate. Why do you need yet another format?

Gary W.

Phillip_G · April 12, 2007, 2:46am

Valid, but not the same. Human languages leave lots of implicit
information that isn’t easily machine parsed. That comparison is out.
But rich formats don’t translate to formats that lack certain
capabilities.
Particularly PDF to XHTML or Markdown or anything almost.
PDF is a pretty broad format. Layouts don’t translate so easily.
Adobe would love to have such a capability reliably. InDesign could
then produce layouts for print and the web. Not likely.

Phillip_G · April 12, 2007, 12:00am

On Apr 11, 2007, at 5:40 PM, Victor “Zverok” Shepelev wrote:

May be you’re right. Only can I say that for other “rich
formats” (like PDF
or OpenOffice) generation conversions textile->pdf can be simpler than
XHTML->pdf (thus, it breaks rule for “the single intermediate format”
through chains like Markdown->XHTML->Textile->PDF). What do you think?

I think that a translator that is designed specifically for X to Y will
always do better than a translator that goes through an intermediate
language. A Russian to Spanish translation is going to be better than
a Russian to English to Spanish translation. The benefit of an
intermediate language is that you don’t need n^2 translators only n.
That doesn’t mean that in some special/common cases the direct
translation
might be available and might make a better choice.

Gary W.

Phillip_G · April 12, 2007, 4:09am

Phillip G. wrote:

ClothRed will be exactly the reverse of RedCloth: It will grab any HTML
string, and convert it into Textile.

As a bonus, ClothRed will strip all HTML that is not being converted
into Textile’s markup from the text, making it, hopefully, usable for
sanitizing HTML.

Looks interesting, but I hope there would be a mode to preserve unknown
HTML in addition to the “lossy” mode. Sanitizing HTML is good but if you
convert the resulting Textile to HTML and it doesn’t look like the
original, that’s not too good IMHO.

Daniel

Phillip_G · April 12, 2007, 8:49am

Daniel DeLorme wrote:

Looks interesting, but I hope there would be a mode to preserve unknown
HTML in addition to the “lossy” mode. Sanitizing HTML is good but if you
convert the resulting Textile to HTML and it doesn’t look like the
original, that’s not too good IMHO.

To do that, there’ll probably be two different modes of HTML stripping:

One “strict”: Every thing that cannot be parsed by ClothRed will be
thrown out.
One “loose”: All HTML that ClothRed cannot preserve will be kept, and
warnings will be emitted (either to stdout, or stderr, or both).

The latter will not be usable for sanitizing HTML, as “unknown” HTML
should be treated as malicious (specifically, as there is no “unknown”
HTML in the W3C specs).

–
Phillip “CynicalRyan” Gawlowski
http://cynicalryan.110mb.com/

Rule of Open-Source Programming #33:

Don’t waste time on writing test cases and test scripts - your users are
your best testers.

Phillip_G · April 12, 2007, 3:31pm

On Thu, Apr 12, 2007 at 03:56:50PM +0900, Phillip G. wrote:

Victor “Zverok” Shepelev wrote:

Now, authors of all libraries are solving 2 problems - parse & generate. It
should be nice to have one common HTML parser, which could be used either
for HTML->Textile, or for HTML->Markdown (only generators will differ).

Well, ClothRed has to parse HTML to output Textile. It does not more
than that. If you plug it into a converter suit, you can use HTML as an
intermediary format (RedCloth can parse Textile and Markdown into HTML,
so you’d have already a little part of such a converter).

Are you using Hpricot for your parsing? If so, it should be pretty easy
to
do the conversion. If not, why not? (Disclaimer: I’ve been following the
thread but haven’t looked at or even installed/run the code.)

From some poin of view, we can use Textile as intermediate, your library
would be “parser”, RedCloth would be “generator”. But this leaves Markdown,
Rdoc “off the game”, while we have no Markdown->Textile and similar
convertors.

Granted, the scope of my library is limited, but purposefully so, to
keep it a) manageable, and b) keep it in line with my skills. Once
ClothRed is feature-complete, I can add to its functionality, but not
sooner, if I can avoid it.

I understand where you are with this. At the same time, I have an actual
need to do something very much like this in my own work. I suspect there
are others out there in a similar situation. We’re hoping that this will
become useful to us sooner rather than later and that we can avoid
rolling
our own.

Phillip “CynicalRyan” Gawlowski
–Greg

Phillip_G · April 12, 2007, 8:57am

Victor “Zverok” Shepelev wrote:

Now, authors of all libraries are solving 2 problems - parse & generate. It
should be nice to have one common HTML parser, which could be used either
for HTML->Textile, or for HTML->Markdown (only generators will differ).

Well, ClothRed has to parse HTML to output Textile. It does not more
than that. If you plug it into a converter suit, you can use HTML as an
intermediary format (RedCloth can parse Textile and Markdown into HTML,
so you’d have already a little part of such a converter).

From some poin of view, we can use Textile as intermediate, your library
would be “parser”, RedCloth would be “generator”. But this leaves Markdown,
Rdoc “off the game”, while we have no Markdown->Textile and similar
convertors.

Granted, the scope of my library is limited, but purposefully so, to
keep it a) manageable, and b) keep it in line with my skills. Once
ClothRed is feature-complete, I can add to its functionality, but not
sooner, if I can avoid it.

–
Phillip “CynicalRyan” Gawlowski
http://cynicalryan.110mb.com/

Rule of Open-Source Programming #8:

Open-Source is not a panacea.

Phillip_G · April 12, 2007, 3:48pm

Gregory S. wrote:

Are you using Hpricot for your parsing? If so, it should be pretty easy to
do the conversion. If not, why not? (Disclaimer: I’ve been following the
thread but haven’t looked at or even installed/run the code.)

No, I don’t. I want to avoid dependencies as much as I can, so that
ClothRed can stand on its own as much as possible.

I understand where you are with this. At the same time, I have an actual
need to do something very much like this in my own work. I suspect there
are others out there in a similar situation. We’re hoping that this will
become useful to us sooner rather than later and that we can avoid rolling
our own.

Regarding the time frame, I’m trying to make it feature-complete as soon
as I can. There isn’t much left to do for the core engine, and after
that I can pretty it up a bit (with rule sets and the like).

If all goes well, ClothRed will hit the big 1.0.0 at the weekend, as a
full HTML to Textile parser. After that, I’m very open to ideas
regarding its future.

–
Phillip “CynicalRyan” Gawlowski
http://cynicalryan.110mb.com/
http://clothred.rubyforge.org

Rules of Open-Source Programming:

Backward compatibility is your worst enemy.
Backward compatibility is your users’ best friend.