Forum: Ruby Using XPath: find last text node of each paragraph under the root node.

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Diego (Guest)
on 2008-11-03 06:50
(Received via mailing list)
I want to trim trailing whitespace at the end of all XHTML paragraphs.
I am using the REXML library.

Say I have the following in a valid XHTML file:

<p>hello <span>world</span> a </p>
<p>Hi there </p>
<p>The End </p>

I want to end up with this:

<p>hello <span>world</span> a</p>
<p>Hi there</p>
<p>The End</p>

So I was thinking what I could use XPath to get just the text nodes
that I want, then just trim the text, which would allow me to end up
with what I want (previous).

I started with the following XPath: //root/p/child::text()

Of course, the problem here is that it returns all text nodes that are
children of all p-tags. Which is this:

'hello '
' a '
'Hi there '
'The End '

Trying the following XPath gives me the last text node of the last
paragraph. Not the last text node of each paragraph that is a child of
the root node.

//root/p/child::text()[last()]

This only returns: 'The End '

What I would like to get from the XPath is therefore:

' a '
'Hi there '
'The End '

I have tried //root//p/child::text()[last()] on other XPath parsers,
and it works. Could just be a bug (or different interpretation of the
rules) by REXML?

Cheers, Diego
Robert K. (Guest)
on 2008-11-03 12:41
(Received via mailing list)
2008/11/3 Diego <removed_email_address@domain.invalid>:
>
> Of course, the problem here is that it returns all text nodes that are
>
> I have tried //root//p/child::text()[last()] on other XPath parsers,
> and it works. Could just be a bug (or different interpretation of the
> rules) by REXML?

Could well be both.  When I try '//p/text()[last()]' I get only the
last node of the whole document. The issue seems to be the binding of
last() i.e. which collection it references or when it is applied. I
lean towards the bug variant.

One workaround would be to use a two step approach, i.e. first select
all <p> and then the last text:

irb(main):062:0> doc.elements.each('//p'){|x|
REXML::XPath.each(x,'text()[last()]'){|t|p t}}
" a "
"Hi there "
"The End "
=> [<p> ... </>, <p> ... </>, <p> ... </>]

Kind regards

robert
Mark T. (Guest)
on 2008-11-03 18:02
(Received via mailing list)
> I have tried //root//p/child::text()[last()] on other XPath parsers,
> and it works. Could just be a bug (or different interpretation of the
> rules) by REXML?

Sounds like a bug to me too. Is there a reason you don't want to use a
parser like libxml-ruby, which is fully XPath 1.0 compliant (and will
give you a speed boost as well)?

-- Mark.
Robert K. (Guest)
on 2008-11-03 19:15
(Received via mailing list)
On 03.11.2008 16:59, Mark T. wrote:
> Is there a reason you don't want to use a
> parser like libxml-ruby, which is fully XPath 1.0 compliant (and will
> give you a speed boost as well)?

I can't speak for Diego but I use REXML because it's there and I do not
have to satisfy serious performance requirements when doing XML
processing.

Cheers

  robert
Diego (Guest)
on 2008-11-05 01:05
(Received via mailing list)
@Robert: As you suggest, a work around is in order. I had a look at
other XPath implementations and they were returning what I originally
expected. Just not REXML. At least now I am sure it's not just me. :)

@Mark: I would consider something other than REXML if (for example) I
maybe had performance issues using a work around. Or if there was not
easy work around. As Robert commented, I use it because it's there and
had not run in to any real problems prior to this. So I was happy with
it. Certainly if it was a show-stopper then I would switch to
something else. But thanks for the recommendation. Something to keep
in mind.

Cheers.
This topic is locked and can not be replied to.