Forum: Ruby sparse xml string

Posted by Prog Rammer (proggrammer)
on 2012-10-01 23:32
How to sparse a string like:

suppose input is str="<a>1</a><a>22</a><a>3</a>"
I want str_a=["<a>1</a>","<a>22</a>","<a>3</a>"]
How can I get str_a from str?
Posted by Sam Duncan (Guest)
on 2012-10-01 23:52
(Received via mailing list)
On 10/02/2012 10:32 AM, ajay paswan wrote:
> How to sparse a string like:
>
> suppose input is str="<a>1</a><a>22</a><a>3</a>"
> I want str_a=["<a>1</a>","<a>22</a>","<a>3</a>"]
> How can I get str_a from str?
>
1.9.3p125 :002 > str.scan /<a>\d+<\/a>/
  => ["<a>1</a>", "<a>22</a>", "<a>3</a>"]

Sam
Posted by Prog Rammer (proggrammer)
on 2012-10-01 23:58
Sam Duncan wrote in post #1078251:
> On 10/02/2012 10:32 AM, ajay paswan wrote:
>> How to sparse a string like:
>>
>> suppose input is str="<a>1</a><a>22</a><a>3</a>"
>> I want str_a=["<a>1</a>","<a>22</a>","<a>3</a>"]
>> How can I get str_a from str?
>>
> 1.9.3p125 :002 > str.scan /<a>\d+<\/a>/
>   => ["<a>1</a>", "<a>22</a>", "<a>3</a>"]
>
> Sam

What if: str="<a>kl1</a><a>22ik</a><a>3o</a>" ?
Posted by Sam Duncan (Guest)
on 2012-10-02 00:04
(Received via mailing list)
On 10/02/2012 10:58 AM, ajay paswan wrote:
>>
>> Sam
> What if: str="<a>kl1</a><a>22ik</a><a>3o</a>" ?
>
1.9.3p125 :002 > str.scan /<a>[[:alnum:]]+<\/a>/
  => ["<a>kl1</a>", "<a>22ik</a>", "<a>3o</a>"]

Ping pong

I think perhaps you should read up on regular expressions in Ruby =]

Sam
Posted by Prog Rammer (proggrammer)
on 2012-10-02 00:11
Sam Duncan wrote in post #1078254:
> On 10/02/2012 10:58 AM, ajay paswan wrote:
>>>
>>> Sam
>> What if: str="<a>kl1</a><a>22ik</a><a>3o</a>" ?
>>
> 1.9.3p125 :002 > str.scan /<a>[[:alnum:]]+<\/a>/
>   => ["<a>kl1</a>", "<a>22ik</a>", "<a>3o</a>"]
>
> Ping pong
>
> I think perhaps you should read up on regular expressions in Ruby =]
>
> Sam

I have gone through it previously but could not figure it out, is '.' 
denotes any character?
Posted by Sam Duncan (Guest)
on 2012-10-02 00:27
(Received via mailing list)
On 10/02/2012 11:11 AM, ajay paswan wrote:
>> I think perhaps you should read up on regular expressions in Ruby =]
>>
>> Sam
> I have gone through it previously but could not figure it out, is '.'
> denotes any character?
>
There are at least three articles on the Internet about regular
expressions. Have a hunt for one that makes sense to you specifically =]

Sam
Posted by Robert Klemme (robert_k78)
on 2012-10-02 23:04
(Received via mailing list)
On Mon, Oct 1, 2012 at 11:32 PM, ajay paswan <lists@ruby-forum.com> 
wrote:
> How to sparse a string like:
>
> suppose input is str="<a>1</a><a>22</a><a>3</a>"
> I want str_a=["<a>1</a>","<a>22</a>","<a>3</a>"]
> How can I get str_a from str?

I'd use a proper XML or HTML processing tool for that.  Then you can
search with XPath '//a'.

$ irb19 -r nokogiri
irb(main):001:0> str = "<a>1</a><a>22</a><a>3</a>"
=> "<a>1</a><a>22</a><a>3</a>"
irb(main):004:0> dom = Nokogiri.HTML str
=> #<Nokogiri::HTML::Document:0x439c00c name="document"
children=[#<Nokogiri::XML::DTD:0x439bdd2 name="html">,
#<Nokogiri::XML::Element:0x439b940 name="html"
children=[#<Nokogiri::XML::Element:0x439b80a name="body"
children=[#<Nokogiri::XML::Element:0x439b6b6 name="a"
children=[#<Nokogiri::XML::Text:0x439b53a "1">]>,
#<Nokogiri::XML::Element:0x439b3be name="a"
children=[#<Nokogiri::XML::Text:0x439b27e "22">]>,
#<Nokogiri::XML::Element:0x439b152 name="a"
children=[#<Nokogiri::XML::Text:0x439b030 "3">]>]>]>]>
irb(main):005:0> str_a = dom.xpath '//a'
=> [#<Nokogiri::XML::Element:0x439b6b6 name="a"
children=[#<Nokogiri::XML::Text:0x439b53a "1">]>,
#<Nokogiri::XML::Element:0x439b3be name="a"
children=[#<Nokogiri::XML::Text:0x439b27e "22">]>,
#<Nokogiri::XML::Element:0x439b152 name="a"
children=[#<Nokogiri::XML::Text:0x439b030 "3">]>]
irb(main):006:0> str_a.size
=> 3

Kind regards

robert5
Posted by Ryan Davis (Guest)
on 2012-10-03 01:01
(Received via mailing list)
On Oct 1, 2012, at 15:11 , ajay paswan <lists@ruby-forum.com> wrote:

>>
>> I think perhaps you should read up on regular expressions in Ruby =]
>>
>> Sam
>
> I have gone through it previously but could not figure it out, is '.'
> denotes any character?

http://www.zenspider.com/Languages/Ruby/QuickRef.h...
Posted by Brian Candler (candlerb)
on 2012-10-04 08:18
ajay paswan wrote in post #1078256:
> Sam Duncan wrote in post #1078254:
>> On 10/02/2012 10:58 AM, ajay paswan wrote:
>>>>
>>>> Sam
>>> What if: str="<a>kl1</a><a>22ik</a><a>3o</a>" ?
>>>
>> 1.9.3p125 :002 > str.scan /<a>[[:alnum:]]+<\/a>/
>>   => ["<a>kl1</a>", "<a>22ik</a>", "<a>3o</a>"]
>>
>> Ping pong
>>
>> I think perhaps you should read up on regular expressions in Ruby =]
>>
>> Sam
>
> I have gone through it previously but could not figure it out, is '.'
> denotes any character?

Yes, and .* means zero or more times of any character, so you might 
think of <a>.*</a> to match an open tag, followed by any text, followed 
by a closing tag.

However this won't work the way you expect, because .* will match the 
largest amount of text it can while still matching the rest of the 
pattern.

>> str="<a>kl1</a><a>22ik</a><a>3o</a>"
=> "<a>kl1</a><a>22ik</a><a>3o</a>"
>> str.scan /<a>.*<\/a>/
=> ["<a>kl1</a><a>22ik</a><a>3o</a>"]

That is: the opening tag is <a>, the content is kl1</a><a>22ik</a><a>3o, 
and the closing tag is </a>. You probably hadn't thought of it like that 
:-)

You can fix this using .*?, which will consume the smallest amount of 
text it can while still matching the rest of the pattern.

>> str.scan /<a>.*?<\/a>/
=> ["<a>kl1</a>", "<a>22ik</a>", "<a>3o</a>"]

But as has been pointed out, regular expressions are not the right way 
to parse XML. Use a library specifically designed for XML parsing.
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.