Truncating html text


#1

I’ve got a fairly basic problem here that I’m hoping there is an easy
solution for.

I have a chunk of html code that I want to truncate to a given length…
say 20 characters or so.

If I use the ‘truncate’ helper function I end up with unbalanced tags.

For example.

A really long string of words

becomes

A really long…

When run through the ‘truncate’ function, leaving off the closing tag,
causing untold trouble and chaos. On top of that, the trunctate
function counts characters in the tag, so you end up getting somewhat
less than what you asked for.

So… is there a way to truncate html text properly?

By this I mean a function or set of functions that returns a chunk of
html with the tags properly closed and where the length of the text
outside the tags is the specified amount.


#2

Kevin-

How about this:

truncate(html_text.gsub(/(<[^>]+>)/, ‘’), 20)

That will just do a naive regex to remove the html tags from

html_text and pass that in to truncate with a length of 20

Cheers-

-Ezra

On Dec 22, 2005, at 8:00 PM, Kevin O. wrote:

removed_email_address@domain.invalid
http://lists.rubyonrails.org/mailman/listinfo/rails

-Ezra Z.
WebMaster
Yakima Herald-Republic Newspaper
removed_email_address@domain.invalid
509-577-7732


#3

Kevin O. wrote:

I’ve got a fairly basic problem here that I’m hoping there is an easy
solution for.

I have a chunk of html code that I want to truncate to a given length…
say 20 characters or so.

If I use the ‘truncate’ helper function I end up with unbalanced tags.

For example.

A really long string of words

becomes

A really long…

When run through the ‘truncate’ function, leaving off the closing tag,
causing untold trouble and chaos. On top of that, the trunctate
function counts characters in the tag, so you end up getting somewhat
less than what you asked for.

So… is there a way to truncate html text properly?

Try this:
http://www.bigbold.com/snippets/posts/show/295


#4

Try this:
http://www.bigbold.com/snippets/posts/show/295

@Ezra… I would like to retain the HTML formatting if possible.
Stripping them out would work, but then the formatting gets lost. Not
ideal, but functional.

Closing the broken tags might work. I need to see how this works if a
tag gets chopped in half.

Something like “My tag link</a…” might make
that algorithm upset. I’m still stuck with the fact that the truncated
length will be totally wrong.

Something to work with anyway, Thanks guys.

_Kevin


#5

Kevin O. wrote:

Try this:
http://www.bigbold.com/snippets/posts/show/295

@Ezra… I would like to retain the HTML formatting if possible.
Stripping them out would work, but then the formatting gets lost. Not
ideal, but functional.

Closing the broken tags might work. I need to see how this works if a
tag gets chopped in half.

Something like “My tag link</a…” might make
that algorithm upset.

A regex for removing open tags from the end should be quite trivial.

I’m still stuck with the fact that the truncated
length will be totally wrong.

You’ll probably have to write your own truncate function with
String#scan and make it count only non-tag characters.


#6

Justin F. wrote:

Perhaps if you explained why you want to truncate an HTML string, that
would help…

regards

Justin

@Justin:
The goal is to have an ‘article’ model. I would like to have the ‘list’
view generate a brief excerpt of the article body as a teaser. For now
the text itself is being generated from text using textile. I have
considered simply truncating the textile source and then generating html
from that, but you run into similar problems with unbalanced decorations
(sort of like my Christmas tree).

@Andreas, yes, removing the malformed tag at the end is easy. The rest
of it is a bit tricky, but I am making progress. It is a good learning
excercise for regex judo.

_Kevin


#7

Kevin O. wrote:

The goal is to have an ‘article’ model. I would like to have the ‘list’
view generate a brief excerpt of the article body as a teaser. For now
the text itself is being generated from text using textile. I have
considered simply truncating the textile source and then generating html
from that, but you run into similar problems with unbalanced decorations
(sort of like my Christmas tree).

Thanks, that’s useful. Have you looked at the feasibility of altering
the textile-to-html conversion, so that it works with a bound on the
number of content characters? On reaching the bound, it would just need
to emit closing tags for all currently unclosed HTML tags.

regards

Justin


#8

Justin F. wrote:

Thanks, that’s useful. Have you looked at the feasibility of altering
the textile-to-html conversion, so that it works with a bound on the
number of content characters? On reaching the bound, it would just need
to emit closing tags for all currently unclosed HTML tags.

regards

Justin

Thanks, that’s a good suggestion. This may solve my immediate problem
so long as I continue to use textile. However, I’m still interested in
finding a more general solution to the problem.

_Kevin


#9

Kevin O. wrote:

Something like “My tag link</a…” might make
that algorithm upset. I’m still stuck with the fact that the truncated
length will be totally wrong.

Something to work with anyway, Thanks guys.

Perhaps if you explained why you want to truncate an HTML string, that
would help…

regards

Justin