Replace string between xml tags that contains special characters

luislavena · July 15, 2011, 12:41pm

I am trying to get rid of the string in a xml file which contains some
special characters. I want it to be transformed from the following:

<message value=“teststr: %F0wt^϶b%99%90%94%D4N%8D%FA%8A%EE%81_
ޢg%9B@I%E3%F6%FCp%AFX%BD%80%91%B5pEK%C9!j%D3%F3SY%C3%F6B~%C8%FC
^%87%C4%F2]! %B9%DF=%E7Y%B9element:
%F0wt^϶b%99%90%94%D4N%8D%FA%8A%EE%81_
ޢg%9B@I%E3%F6%FCp%AFX%BD%80%91%B5pEK%C9!j%D3%F3SY%C3%F6B~%C8%FC
^%87%C4%F2]! %B9%DF=%E7Y%B9”

to

<message value=“Validating element:”

I tried using gsub() with regex but so far haven’t been successful.

Can someone help me with this?

Thanks in Advance,
Rousan

rousan · July 15, 2011, 5:23pm

Rousan M. wrote in post #1010924:

I tried using gsub() with regex but so far haven’t been successful.

Can you show what you tried, i.e. the code?

Can someone help me with this?

Certainly.

Cheers

robert

rousan · July 15, 2011, 5:33pm

On Fri, Jul 15, 2011 at 07:41:17PM +0900, Rousan M. wrote:

to

<message value=“Validating element:”

I tried using gsub() with regex but so far haven’t been successful.

It seems to me you should make use of the non-greedy modifier, which is
?, for .* to indicate you want to match any characters up to a
particular
matching string in this case. Why don’t you share what you have for an
attempt at a useful regex, then we can offer modifications to yours
rather than just providing a complete solution from scratch?

This page offers some information about special characters in regexen:

http://www.zenspider.com/Languages/Ruby/QuickRef.html#12

The non-greedy modifier can be found easily by doing a text search on
that page for “non-greedy”.

rousan · July 16, 2011, 2:18am

hi Rousan,

an example of what you’ve tried would make things much better and
easier (no one wants to do your work for you!) i had a similar issue
dealing with parsing html pages, and i wound up writing a method for the
String class which replaces the text between one marker within a string
and another. it takes 3 or 4 arguments - 1st is a sub-string that is
the starting point marker, 2nd a sub-string that is the end point
marker, 3rd the new text to go between the markers, and an optional 4th
which makes the method global - and it happens to work with your
example
some things to maybe think about:
convert the first two arguments to Regexp’s. the =~ operator will
give you the index of your Regexp within the main string… this can be
very useful. i make a range between the index of the first marker and
the second (actually the index of the end of the first and the beginning
of second, but you get the idea,) and iterate through each index of the
string between them to create a new string to be replaced, and then use
#sub! (or #gsub! if global is true) to replace it with the 3rd argument
of the method.
i’m sure that other folks have come up with better ways to do this as
well… show us what you’re working with!

j

rousan · July 17, 2011, 9:34pm

Thank you all for the response.
I will try to elaborate on this. All I want is to parse an xml file
which contains special characters. But my parser fails because it cannot
open the file correctly (because of the special characters). So I
decided to write the file contents to a new file by removing the special
characters and then parse it. My sample input file(input.xml) with
special characters is as follows:

Validating element: %F0wt^϶b%99%90%94%D4N%8D%FA%8A%EE%81_
ޢg%9B@I%E3%F6%FCp%AFX%BD%80%91%B5pEK%C9!j%D3%F3SY%C3%F6B~%C8%FC
^%87%C4%F2]! %B9%DF=%E7Y%B9

My desired output file(output.xml) should be something like this:

TEXT REMOVED

I have the following code in place in an attempt to do this:

fin = File.new("input.xml", "rb")
fout = File.new("output.xml", "w")
while (line = file.gets)

temp = line.gsub(/<message>Validating element:(.*?)</message>/,
‘TEXT REMOVED’)
fout.puts “#{temp}”
end
fin.close
fout.close

But the above code replaces the “…” content all
together and my output.xml file is:

My problem is solved as I am not using the message tag in my parser. But
ideally I want to remove only the content between the message tag
without
removing the tag all together. If anyone knows how to do it(preferably
in a single line) please share it with me.

Thanks in Advance,
Rousan

rousan · July 17, 2011, 9:54pm

On Sun, Jul 17, 2011 at 12:34 PM, Rousan M. [email protected]
wrote:

ޢg%9B@I%E3%F6%FCp%AFX%BD%80%91%B5pEK%C9!j%D3%F3S Y%C3%F6B~%C8%FC
^%87%C4%F2]! %B9%DF=%E7Y%B9

This sample is not like your original example, which wasn’t even
valid XML. However, if you’re working with XML you shouldn’t be
wasting time with any regex-based approach. Use nokogiri, which
can parse the above example just fine, and with which you can
easily accomplish your goal.

Replace string between xml tags that contains special characters

I have the following code in place in an attempt to do this:

temp = line.gsub(/<message>Validating element:(.*?)</message>/, ‘TEXT REMOVED’) fout.puts “#{temp}” end fin.close fout.close

temp = line.gsub(/<message>Validating element:(.*?)</message>/,
‘TEXT REMOVED’)
fout.puts “#{temp}”
end
fin.close
fout.close