Forum: Ruby ClothRed (HTML to Textile)

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Phillip G. (Guest)
on 2007-04-10 21:39
(Received via mailing list)
I'm pleased to announce, that I've begun working on a small library to
convert HTML into Textile.

Please forgive me, that this announcement isn't yet following the
community's standards, but I'm slowly getting there.

For the curious, the website and project on RuybForge have gone online
*and* have some content[0].

For the impatient:
ClothRed will be exactly the reverse of RedCloth: It will grab any HTML
string, and convert it into Textile.

As a bonus, ClothRed will strip all HTML that is not being converted
into Textile's markup from the text, making it, hopefully, usable for
sanitizing HTML.

I hope to have an Alpha release out by the end of next month.

Links:
[0] http://clothred.rubyforge.org/

--
Phillip "CynicalRyan" Gawlowski
http://cynicalryan.110mb.com/

Rule of Open-Source Programming #5:

A project is never finished.
Jacob F. (Guest)
on 2007-04-10 22:49
(Received via mailing list)
On 4/10/07, Phillip G. <removed_email_address@domain.invalid> wrote:
> I'm pleased to announce, that I've begun working on a small library to
> convert HTML into Textile.
...
> ClothRed will be exactly the reverse of RedCloth: It will grab any HTML
> string, and convert it into Textile.
>
> As a bonus, ClothRed will strip all HTML that is not being converted
> into Textile's markup from the text, making it, hopefully, usable for
> sanitizing HTML.
>
> I hope to have an Alpha release out by the end of next month.

Awesome, Phillip. I really look forward to using this!

Jacob F.
Victor "Zverok" Shepelev (Guest)
on 2007-04-11 23:03
(Received via mailing list)
From: Phillip G. [mailto:removed_email_address@domain.invalid]
Sent: Tuesday, April 10, 2007 8:39 PM
>For the impatient:
>ClothRed will be exactly the reverse of RedCloth: It will grab any HTML
>string, and convert it into Textile.
>
>As a bonus, ClothRed will strip all HTML that is not being converted
>into Textile's markup from the text, making it, hopefully, usable for
>sanitizing HTML.
>
>I hope to have an Alpha release out by the end of next month.
>

Awesome!

x x x

A bit OT, but I'm dreaming/planning (for a long time) about library,
which
can handle "greatest common divisor" of all simple text format and
perform
uniformly conversions like

Textile   <=>     <=> HTML
Markdown  <=>     <=> PDF
Mediawiki <=> gcd <=> PS
RDOC      <=>     <=> OpenOffice

There are several projects performing only [some markup]=>html
conversions.
There is also Maruku[1], which seems to handle virually any
Markdown=>[rich
format] conversion (and seems to embody some common intermediate
format).
There is now your project.

Isn't now a time to do something more generic?

V.

1: maruku.rubyforge.org
Phillip G. (Guest)
on 2007-04-11 23:26
(Received via mailing list)
Victor "Zverok" Shepelev wrote:

> There is also Maruku[1], which seems to handle virually any Markdown=>[rich
> format] conversion (and seems to embody some common intermediate format).
> There is now your project.
>
> Isn't now a time to do something more generic?

Well, once there are libraries to do any one of these tasks, you can
build a tool chain, similar to DBI, for example.

I know that there's a PDF generator written in Ruby, but i don't know
about the other file formats. Creating markup parsers isn't that much of
a challenge, so that could be done quite easily.

I'd be happy to, once ClothRed is feature-complete in the HTML ->
Textile area, to write an API to integrate ClothRed into other tools.

--
Phillip "CynicalRyan" Gawlowski
http://cynicalryan.110mb.com/

Rule of Open-Source Programming #33:

Don't waste time on writing test cases and test scripts - your users are
your best testers.
Victor "Zverok" Shepelev (Guest)
on 2007-04-12 00:53
(Received via mailing list)
From: Phillip G. [mailto:removed_email_address@domain.invalid]
Sent: Wednesday, April 11, 2007 10:26 PM
>> Mediawiki <=> gcd <=> PS
>
>Well, once there are libraries to do any one of these tasks, you can
>build a tool chain, similar to DBI, for example.
>
>I know that there's a PDF generator written in Ruby, but i don't know
>about the other file formats. Creating markup parsers isn't that much of
>a challenge, so that could be done quite easily.
>
>I'd be happy to, once ClothRed is feature-complete in the HTML ->
>Textile area, to write an API to integrate ClothRed into other tools.
>

My point was, to have some intermediate format, and have couple of
parsers
TO this format and generators FROM it.

Now, authors of all libraries are solving 2 problems - parse & generate.
It
should be nice to have one common HTML parser, which could be used
either
for HTML->Textile, or for HTML->Markdown (only generators will differ).

From some poin of view, we can use Textile as intermediate, your library
would be "parser", RedCloth would be "generator". But this leaves
Markdown,
Rdoc "off the game", while we have no Markdown->Textile and similar
convertors.

V.
Gary W. (Guest)
on 2007-04-12 01:11
(Received via mailing list)
On Apr 11, 2007, at 4:52 PM, Victor Zverok S. wrote:
> My point was, to have some intermediate format, and have couple of
> parsers
> TO this format and generators FROM it.

Seems like XHTML would be the obvious choice for the intermediate
format, no?
Unless you want to reinvent that particular wheel.

Gary W.
Victor "Zverok" Shepelev (Guest)
on 2007-04-12 01:26
(Received via mailing list)
From: Gary W. [mailto:removed_email_address@domain.invalid]
Sent: Thursday, April 12, 2007 12:10 AM
>
>On Apr 11, 2007, at 4:52 PM, Victor Zverok S. wrote:
>> My point was, to have some intermediate format, and have couple of
>> parsers
>> TO this format and generators FROM it.
>
>Seems like XHTML would be the obvious choice for the intermediate
>format, no?
>Unless you want to reinvent that particular wheel.
>

<OT>
We, russians, say "invent the bike". In russian programming forums my
usual
origin is "Bikes forever!" :)
</OT>

The question, I think, is like XML vs. YAML/JSON. Just do simpler.

I mean, for conversions like Markdown <=> Textile, XHTML as intermediate
is
slightly too funny.

The overall thought was "conversion of basic logical formatting", thus,
intermediate format should only handle basic features (at the level of
Textile-like formats, not bloated XHTML-like).

V.
Gary W. (Guest)
on 2007-04-12 01:33
(Received via mailing list)
On Apr 11, 2007, at 5:26 PM, Victor "Zverok" Shepelev wrote:
> I mean, for conversions like Markdown <=> Textile, XHTML as
> intermediate is
> slightly too funny.

Yes, but that is because XHTML is too funny.

Markdown exists to generate XHTML.
Textile exists to generate XHTML.

If you've got reverse translations also, then XHTML is *already* working
as the intermediate.  Why do you need yet another format?

Gary W.
Victor "Zverok" Shepelev (Guest)
on 2007-04-12 01:40
(Received via mailing list)
From: Gary W. [mailto:removed_email_address@domain.invalid]
Sent: Thursday, April 12, 2007 12:33 AM
>Textile exists to generate XHTML.
>
>If you've got reverse translations also, then XHTML is *already* working
>as the intermediate.  Why do you need yet another format?
>

May be you're right. Only can I say that for other "rich formats" (like
PDF
or OpenOffice) generation conversions textile->pdf can be simpler than
XHTML->pdf (thus, it breaks rule for "the single intermediate format"
through chains like Markdown->XHTML->Textile->PDF). What do you think?

V.
Gary W. (Guest)
on 2007-04-12 02:00
(Received via mailing list)
On Apr 11, 2007, at 5:40 PM, Victor "Zverok" Shepelev wrote:
> May be you're right. Only can I say that for other "rich
> formats" (like PDF
> or OpenOffice) generation conversions textile->pdf can be simpler than
> XHTML->pdf (thus, it breaks rule for "the single intermediate format"
> through chains like Markdown->XHTML->Textile->PDF). What do you think?

I think that a translator that is designed specifically for X to Y will
always do better than a translator that goes through an intermediate
language.  A Russian to Spanish translation is going to be better than
a Russian to English to Spanish translation.  The benefit of an
intermediate language is that you don't need n^2 translators only n.
That doesn't mean that in some special/common cases the direct
translation
might be available and might make a better choice.

Gary W.
John J. (Guest)
on 2007-04-12 04:46
(Received via mailing list)
Valid, but not the same. Human languages leave lots of implicit
information that isn't easily machine parsed. That comparison is out.
But rich formats don't translate to formats that lack certain
capabilities.
Particularly PDF to XHTML or Markdown or anything almost.
PDF is a pretty broad format. Layouts don't translate so easily.
Adobe would love to have such a capability reliably. InDesign could
then produce layouts for print and the web. Not likely.
Daniel DeLorme (Guest)
on 2007-04-12 06:09
(Received via mailing list)
Phillip G. wrote:
> ClothRed will be exactly the reverse of RedCloth: It will grab any HTML
> string, and convert it into Textile.
>
> As a bonus, ClothRed will strip all HTML that is not being converted
> into Textile's markup from the text, making it, hopefully, usable for
> sanitizing HTML.

Looks interesting, but I hope there would be a mode to preserve unknown
HTML in addition to the "lossy" mode. Sanitizing HTML is good but if you
convert the resulting Textile to HTML and it doesn't look like the
original, that's not too good IMHO.

Daniel
Phillip G. (Guest)
on 2007-04-12 10:49
(Received via mailing list)
Daniel DeLorme wrote:

> Looks interesting, but I hope there would be a mode to preserve unknown
> HTML in addition to the "lossy" mode. Sanitizing HTML is good but if you
> convert the resulting Textile to HTML and it doesn't look like the
> original, that's not too good IMHO.

To do that, there'll probably be two different modes of HTML stripping:
* One "strict": Every thing that cannot be parsed by ClothRed will be
thrown out.
* One "loose": All HTML that ClothRed cannot preserve will be kept, and
warnings will be emitted (either to stdout, or stderr, or both).

The latter will not be usable for sanitizing HTML, as "unknown" HTML
*should* be treated as malicious (specifically, as there is no "unknown"
HTML in the W3C specs).

--
Phillip "CynicalRyan" Gawlowski
http://cynicalryan.110mb.com/

Rule of Open-Source Programming #33:

Don't waste time on writing test cases and test scripts - your users are
your best testers.
Phillip G. (Guest)
on 2007-04-12 10:57
(Received via mailing list)
Victor "Zverok" Shepelev wrote:

> Now, authors of all libraries are solving 2 problems - parse & generate. It
> should be nice to have one common HTML parser, which could be used either
> for HTML->Textile, or for HTML->Markdown (only generators will differ).

Well, ClothRed has to parse HTML to output Textile. It does not more
than that. If you plug it into a converter suit, you can use HTML as an
intermediary format (RedCloth can parse Textile and Markdown into HTML,
so you'd have already a little part of such a converter).

>>From some poin of view, we can use Textile as intermediate, your library
> would be "parser", RedCloth would be "generator". But this leaves Markdown,
> Rdoc "off the game", while we have no Markdown->Textile and similar
> convertors.

Granted, the scope of my library is limited, but purposefully so, to
keep it a) manageable, and b) keep it in line with my skills. Once
ClothRed is feature-complete, I can add to its functionality, but not
sooner, if I can avoid it.

--
Phillip "CynicalRyan" Gawlowski
http://cynicalryan.110mb.com/

Rule of Open-Source Programming #8:

Open-Source is not a panacea.
Gregory S. (Guest)
on 2007-04-12 17:31
(Received via mailing list)
On Thu, Apr 12, 2007 at 03:56:50PM +0900, Phillip G. wrote:
> Victor "Zverok" Shepelev wrote:
> >Now, authors of all libraries are solving 2 problems - parse & generate. It
> >should be nice to have one common HTML parser, which could be used either
> >for HTML->Textile, or for HTML->Markdown (only generators will differ).
>
> Well, ClothRed has to parse HTML to output Textile. It does not more
> than that. If you plug it into a converter suit, you can use HTML as an
> intermediary format (RedCloth can parse Textile and Markdown into HTML,
> so you'd have already a little part of such a converter).

Are you using Hpricot for your parsing? If so, it should be pretty easy
to
do the conversion. If not, why not? (Disclaimer: I've been following the
thread but haven't looked at or even installed/run the code.)

> >From some poin of view, we can use Textile as intermediate, your library
> >would be "parser", RedCloth would be "generator". But this leaves Markdown,
> >Rdoc "off the game", while we have no Markdown->Textile and similar
> >convertors.
>
> Granted, the scope of my library is limited, but purposefully so, to
> keep it a) manageable, and b) keep it in line with my skills. Once
> ClothRed is feature-complete, I can add to its functionality, but not
> sooner, if I can avoid it.

I understand where you are with this. At the same time, I have an actual
need to do something very much like this in my own work. I suspect there
are others out there in a similar situation. We're hoping that this will
become useful to us sooner rather than later and that we can avoid
rolling
our own.

> Phillip "CynicalRyan" Gawlowski
--Greg
Phillip G. (Guest)
on 2007-04-12 17:48
(Received via mailing list)
Gregory S. wrote:

> Are you using Hpricot for your parsing? If so, it should be pretty easy to
> do the conversion. If not, why not? (Disclaimer: I've been following the
> thread but haven't looked at or even installed/run the code.)

No, I don't. I want to avoid dependencies as much as I can, so that
ClothRed can stand on its own as much as possible.

> I understand where you are with this. At the same time, I have an actual
> need to do something very much like this in my own work. I suspect there
> are others out there in a similar situation. We're hoping that this will
> become useful to us sooner rather than later and that we can avoid rolling
> our own.

Regarding the time frame, I'm trying to make it feature-complete as soon
as I can. There isn't much left to do for the core engine, and after
that I can pretty it up a bit (with rule sets and the like).

If all goes well, ClothRed will hit the big 1.0.0 at the weekend, as a
full HTML to Textile parser. After that, I'm very open to ideas
regarding its future.

--
Phillip "CynicalRyan" Gawlowski
http://cynicalryan.110mb.com/
http://clothred.rubyforge.org

Rules of Open-Source Programming:

22. Backward compatibility is your worst enemy.

23. Backward compatibility is your users' best friend.
This topic is locked and can not be replied to.