Using XPath: find last text node of each paragraph under the root node

Diego · November 3, 2008, 5:50am

I want to trim trailing whitespace at the end of all XHTML paragraphs.
I am using the REXML library.

Say I have the following in a valid XHTML file:

hello world a

Hi there

The End

I want to end up with this:

hello world a

Hi there

The End

So I was thinking what I could use XPath to get just the text nodes
that I want, then just trim the text, which would allow me to end up
with what I want (previous).

I started with the following XPath: //root/p/child::text()

Of course, the problem here is that it returns all text nodes that are
children of all p-tags. Which is this:

'hello ’
’ a ’
'Hi there ’
'The End ’

Trying the following XPath gives me the last text node of the last
paragraph. Not the last text node of each paragraph that is a child of
the root node.

//root/p/child::text()[last()]

This only returns: 'The End ’

What I would like to get from the XPath is therefore:

’ a ’
'Hi there ’
'The End ’

I have tried //root//p/child::text()[last()] on other XPath parsers,
and it works. Could just be a bug (or different interpretation of the
rules) by REXML?

Cheers, Diego

Diego · November 3, 2008, 11:41am

2008/11/3 Diego [email protected]:

Of course, the problem here is that it returns all text nodes that are

I have tried //root//p/child::text()[last()] on other XPath parsers,
and it works. Could just be a bug (or different interpretation of the
rules) by REXML?

Could well be both. When I try ‘//p/text()[last()]’ I get only the
last node of the whole document. The issue seems to be the binding of
last() i.e. which collection it references or when it is applied. I
lean towards the bug variant.

One workaround would be to use a two step approach, i.e. first select
all

and then the last text:

irb(main):062:0> doc.elements.each(‘//p’){|x|
REXML::XPath.each(x,‘text()[last()]’){|t|p t}}
" a "
"Hi there "
"The End "
=> [

… </>,

… </>]

Kind regards

robert

Diego · November 3, 2008, 5:02pm

I have tried //root//p/child::text()[last()] on other XPath parsers,
and it works. Could just be a bug (or different interpretation of the
rules) by REXML?

Sounds like a bug to me too. Is there a reason you don’t want to use a
parser like libxml-ruby, which is fully XPath 1.0 compliant (and will
give you a speed boost as well)?

– Mark.

Diego · November 3, 2008, 6:15pm

On 03.11.2008 16:59, Mark T. wrote:

Is there a reason you don’t want to use a
parser like libxml-ruby, which is fully XPath 1.0 compliant (and will
give you a speed boost as well)?

I can’t speak for Diego but I use REXML because it’s there and I do not
have to satisfy serious performance requirements when doing XML
processing.

Cheers

robert

Diego · November 5, 2008, 12:05am

@Robert: As you suggest, a work around is in order. I had a look at
other XPath implementations and they were returning what I originally
expected. Just not REXML. At least now I am sure it’s not just me.

@Mark: I would consider something other than REXML if (for example) I
maybe had performance issues using a work around. Or if there was not
easy work around. As Robert commented, I use it because it’s there and
had not run in to any real problems prior to this. So I was happy with
it. Certainly if it was a show-stopper then I would switch to
something else. But thanks for the recommendation. Something to keep
in mind.

Cheers.