Parsing html tags to ruby characters

karthikeyan · May 20, 2008, 7:03am

Hi,

I have a string containing some ruby code and html tags in-between.
For example,

str = “require 'my_class.rb’
require
‘your_class.rb’
:key=>‘hello’”

I want these html tags(’
’, ’ ', ‘>’, ‘<’, ‘

’, ‘’
etc…)
to be replaced by the equivalent ruby characters("\n", " ", “>”, “<”
etc…).

These html tags can change dynamically according to the inputs.

Is there any way to parse these html tags to equivalent ruby characters?

Thanks in advance…

karthikeyan · May 20, 2008, 7:23am

string = “
><

”
string.gsub!(“tag”,“replacement”)

I think you get the idea.

On Tue, May 20, 2008 at 2:33 PM, Karthi kn
[email protected]
wrote:

etc…)
Posted via http://www.ruby-forum.com/.

–
Appreciated my help?
Reccommend me on Working With Rails
http://workingwithrails.com/person/11030-ryan-bigg

karthikeyan · May 20, 2008, 7:58am

Thanks Ryan. But I can’t guess what are all the tags i will be getting.
Because those are dynamic. Any possible tag can come. So if I have to
use the ‘gsub’ method, I will have to write for each and every html tag.
Then that will be big.

So I am looking for any other easier way to implement this(something
like html parser kind of).

karthikeyan · May 20, 2008, 8:05am

You never specified what you wanted the

and tags replaced
with
either.

On Tue, May 20, 2008 at 3:28 PM, Karthi kn
[email protected]
wrote:

–
Appreciated my help?
Reccommend me on Working With Rails
http://workingwithrails.com/person/11030-ryan-bigg

karthikeyan · May 20, 2008, 10:16am

On 20 May 2008, at 07:12, Karthi kn wrote:

Sorry. That’s my mistake. The final thing i want from the string is a
runnable ruby code. So

and tags can be removed from the
string without any replacement.

Now I think, the only way to implement this is to use the ‘gsub’
method
for each and every possible tag.

Well assuming the only tag with special meaning is
Then you can
just convert entities to their respective characters (there are tables
of these),
to “\n” and then just replace every other tag with ‘’.
No need for one regexp per tag for that!

Fred

karthikeyan · May 20, 2008, 10:42am

But “>” and “<” need to be replaced with “>” and “<” respectively.
Because I will having some ruby hash code in the string.

Also I need to find out all the html tags in that string. Is there any
way to find that?

karthikeyan · May 20, 2008, 8:12am

Sorry. That’s my mistake. The final thing i want from the string is a
runnable ruby code. So

and tags can be removed from the
string without any replacement.

Now I think, the only way to implement this is to use the ‘gsub’ method
for each and every possible tag.

karthikeyan · May 20, 2008, 11:59am

On 20 May 2008, at 09:42, Karthi kn wrote:

But “>” and “<” need to be replaced with “>” and “<”
respectively.
Because I will having some ruby hash code in the string.

I’m not seeing the problem Replace entities and then look for
everything between < and >. Change it to a newline if it’s a br, or
just replace it with blank and add it to your list of html tags.
Fred

karthikeyan · May 21, 2008, 1:30pm

Thanks for your replies. I have done as I wanted. The following the code
for that.

markup = markup.gsub('<br>', "\n")
markup = markup.gsub(/[\<]([\/])*([A-Za-z0-9])*[\>]/, '')
markup = markup.gsub('&gt;', ">")
markup = markup.gsub('&lt;', "<")
markup = markup.gsub('&nbsp;', " ")
markup = markup.gsub('&amp;', "&")

It’s working fine now. But I am not sure whether I have covered all the
tags and characters or not.

karthikeyan · May 21, 2008, 1:53pm

On 21 May 2008, at 12:30, Derbee Don wrote:

markup = markup.gsub(‘&’, “&”)

It’s working fine now. But I am not sure whether I have covered all
the
tags and characters or not.

depends what you are trying todo. there are far more html entities
that that. (a partial list is here
http://www.w3schools.com/tags/ref_entities.asp)
and of course there are the unicode style ones
(http://theorem.ca/~mvcorks/code/charsets/auto.html
)

Fred

karthikeyan · May 21, 2008, 4:57pm

I have a very, very strong suspicion that the need is only to
translate character enconding (e.g., &amp=>‘&’).

It might be worth considering iterating over an array of hashes rather
than repeating the same code with different parameters:

[{:regex=>/<br>/, :decoded=>“\n”},
{:regex=>/<([A-Za-z0-9])[>]/, :decoded=>‘’},
{:regex=>/>/, :decoded=>‘>’}
…
].each do |decoding_hash|
markup.gsub!(decoding_hash[:regex], decoding_hash[:decoded])
end

The advantage is in keeping the code DRY and making the intentions of
the block a bit clearer.

On May 21, 7:53 am, Frederick C. [email protected]