Processing mixed content with REXML


#1

Element.each_element gives the element children
And Element.texts gives the text nodes.

But how do you process the complete list of
children, in order?


#2

2006/6/14, Eric A. removed_email_address@domain.invalid:

Element.each_element gives the element children
And Element.texts gives the text nodes.

But how do you process the complete list of
children, in order?

If I’m not mistaken each_element works recursively - so you get all
children.

robert


#3

On 6/14/06, Eric A. removed_email_address@domain.invalid wrote:

Element.each_element gives the element children
And Element.texts gives the text nodes.

But how do you process the complete list of
children, in order?

Element#to_a will give you an array with everything. Element#each will
iterate everything.

Pedro.


#4

Robert K. wrote:

If I’m not mistaken each_element works recursively

  • so you get all children.

You’re thinking of each_recursive, which does
that. But both methods return elements /only/,
not text nodes.

To get the text, you use Element.text–which
only returns the /first/ text element, or
Element.texts, which returns all of them,
but without any way I can see to seqeunce them
with elements that are mixed in (XML mixed
content model)

Example

<p>A <b>bold</b> word<p>

Structure

elem: <p>
   text: "A "
   elem: <b>
      text: "bold"
   text: " word"

From

:

  • text() gives you "A "
  • each_element gives you (b.text => “bold”
  • texts() gives you ["A “, " word”]

But there is no way I can to see to determine
whethe the element comes before or after
" word" in the node sequence.

Surely Element has a getChildren or getNodes method?
I don’t see an inclusion or inheritance in the API
docs that show me one, but perhaps I missed it.

On the other hand, it could be that this capability
is simply missing from REXML, and I should be using
a different package.


#5

Pedro Côrte-Real wrote:

Element#to_a will give you an array with everything. Element#each will
iterate everything.

Interesting. Your sure they operate on all nodes,
and not just elements? And I take it there is some way
to use node_type() to tell which is which?

The API docs for Element don’t even mention to_a or each.
The node_type() method is listed, but has no commentary.

Assuming that your suggestion solves the problem, what’s
the best way to feed back the documentation for those
methods?


#6

2006/6/14, Eric A. removed_email_address@domain.invalid:

You did indeed speak truly. Thank you very much.

Notes for the Element API docs:

node_type --returns a symbol
   :comment, :element, :text (I've seen these)
   :cdata, ???      (I expect these)

I don’t think you will see them. CDATA is really just a way to
encapsulate text.

each --iterates over all child nodes

That’s written in the docs.

to_a --returns an array of child nodes

That’s also written in the docs. Remember that by including modules
multiple inheritance is at work. Granted that RI isn’t as good as
pointing out all the methods as JavaDoc but then again these can
change at runtime anyway. It usually helps to look at things in IRB
or #inspect them.

Where is a good place to add these?
(In other words, is there a way to do it without
checking out the project?)

IMHO not needed.

Have you been on the REXML homepage? There’s pretty good docs and
tutorials there
http://www.germane-software.com/software/rexml/

Cheers

robert


#7

You did indeed speak truly. Thank you very much.

Notes for the Element API docs:

node_type --returns a symbol
   :comment, :element, :text (I've seen these)
   :cdata, ???      (I expect these)

each --iterates over all child nodes

to_a --returns an array of child nodes

Where is a good place to add these?
(In other words, is there a way to do it without
checking out the project?)

thanks again
eric


#8

Thanks for the comments, Robert. I don’t mind that
RI isn’t as good at javadoc about displaying
inherited methods. That’s a javadoc feature that
happened to be implemented at my request.

But it bugs me that that it doesn’t even name
superclasses. (It /seems/ to name modules, but
I have no way of knowing if it’s complete.)

I started from the REXML home page, which took
me here for the APIs:
http://www.germane-software.com/software/XML/rexml/doc/

When you click on Element, you’ll see no mention
of each or to_a, or any mention of a class that might
have defined them. There is no comment on node_type,
and no pointers to code, in lieu of commentary.

I like Ruby. A lot. But it's murderous trying to figure out how to get anything done. I've never been much of a code reader. (Character flaw, I admit.) But I guess I'll have to become one.

But if that’s the case, what’s the point of publishing
API documents? Why should I read them, if needed APIs are
quietly ignored? How would I even /know/ that an API was
absent?

On the other hand, you do seem to have given me two
great tips:

  • Use IRB to find out what an object is capable of
  • Use a Class method, #inspect

IRB has never been proposed from that perspective before.
That’s a new idea. I got this far:

require ‘rexml/Document’
include REXML

I tried Element#inspect, but all that gave me was
=> REXML::Element

That’s not very helpful. What’s the final part of this
very helpful trick?

thanks again
eric


#9

Arggh. It turns out that a list of behaviors
really doesn’t solve anything.

I see to_a in the list, but so what? That’s a
pretty standard method. I have no way of knowing
what it does. Of course, I can try out all 50-some
methods and see if I can figure out what they do.
But that’s a long way to go to answer a pretty
simple question.

I guess there is just no alternative. At this
point, API documents are beginning to look pretty
useless, when it comes to learning how to use a
class.

That’s a shame, because the formatting makes them
a lot easier to read. But if they’re going to
silently ignore the very existence of important APIs,
how can I begin to trust them?

Frankly, I see this as a pretty big deal when it
comes to language acceptance. Alternative opinions
would be very welcome.


#10

Found it!

% irb
require ‘rexml/Document’
include REXML
puts Element.methods

or x = SomeClass.new
puts x.methods


#11

First things first:

When you click on Element, you’ll see no mention
of each or to_a, or any mention of a class that might
have defined them. There is no comment on node_type,
and no pointers to code, in lieu of commentary.

This is related to this:

But it bugs me that that it doesn’t even name
superclasses. (It /seems/ to name modules, but
I have no way of knowing if it’s complete.)

Which isn’t actually the case. If you look at the blue title bar,
you see this:

Class: REXML::Element
In: temp/element.rb
Parent: Parent

‘Parent’ is the superclass. If you click on that, you can see that
the Parent class includes Enumerable (which would account for a
default each method that you seem to be looking for).

You can find the entire source in ‘temp/element.rb’, which is a
pointer to code, and you can get the source for a method by clicking
on its signature (you get a pop-up).

Reading API docs is a pain, I know, but it does help to know how they
work. Hope this helps a bit.

matthew smillie.


#12

On Jun 14, 2006, at 7:35 PM, Matthew S. wrote:

superclasses. (It /seems/ to name modules, but
‘Parent’ is the superclass. If you click on that, you can see that
matthew smillie.

In irb you can also narrow down the list of methods on an object.

Try this:

puts Element.instance_methods(false)

-Ezra


#13

I wish what you were saying were true!

I’m here:
http://www.germane-software.com/software/XML/rexml/doc/

The title bar reads:
rexml.rb
Path: temp/rexml.rb
Last Update: Thu Apr 13 20:03:06 PDT 2006

There is no ‘parent’ link.
Are you looking at docs at some other location?
If so, that is clearly the location I need!
(You give me hope.)

eric


#14

/Excellent/ tips. Thanks much. I’ll digest and apply.


#15

On Jun 15, 2006, at 23:17, Eric A. wrote:

There is no ‘parent’ link.
Are you looking at docs at some other location?
If so, that is clearly the location I need!
(You give me hope.)

We’re looking at the documentation for the REXML::Element class
(http://www.germane-software.com/software/XML/rexml/doc/classes/REXML/
Element.html if you want to bypass the frames). The documentation
that comes up initially is for the rexml.rb file (which doesn’t have
a parent, since it’s a file).

matthew smillie.


#16

2006/6/15, Eric A. removed_email_address@domain.invalid:

Thanks for the comments, Robert. I don’t mind that
RI isn’t as good at javadoc about displaying
inherited methods. That’s a javadoc feature that
happened to be implemented at my request.

I didn’t even notice that your email address end in @sun.com when I
wrote the last message. Funny.

But it bugs me that that it doesn’t even name
superclasses. (It /seems/ to name modules, but
I have no way of knowing if it’s complete.)

While I’ll readily agree that RI and documentation is not too good in
Ruby I have to disagree here. If you closely look at Element’s
documentation you’ll see that the superclass Parent is mentioned in
the header as well as included modules (Namespace in this case). If
you go to Parent which is hyperlinked you’ll see that it includes
Enumerable - a std lib module that provides a lot of methods based on
iteration (#each) and #to_a is one of them.

I started from the REXML home page, which took
me here for the APIs:
http://www.germane-software.com/software/XML/rexml/doc/

When you click on Element, you’ll see no mention
of each or to_a, or any mention of a class that might
have defined them. There is no comment on node_type,
and no pointers to code, in lieu of commentary.

(see above)

I like Ruby. A lot. But it's murderous trying to figure out how to get anything done. I've never been much of a code reader. (Character flaw, I admit.) But I guess I'll have to become one.

Not necessarily. If you really love Ruby that much it’s sure no
problem for you to adapt your approach in obtaining information to
Ruby’s dynamic nature. :slight_smile: These are the things I usually do when
confronted with an unknown interface:

  • Go to IRB and evaluate obj.methods.sort or obj.methods.grep
    /expected_name/. You can also use inspect on obj.class.ancestors to
    print out methods by class. Try this in IRB

[].class.ancestors.inject {|cl,su| p cl, cl.instance_methods -
su.instance_methods;su}

  • In IRB: Evaluate obj.class.ancestors to see the chain of
    superclasses and supermodules

  • In IRB: do obj.method :a_name to see where it is defined. See
    http://groups.google.com/group/comp.lang.ruby/msg/a6400366930941bc

  • Start out with a script and use any of the above and / or #inspect
    at the point where you want to derive information; in your case that
    would mean to fill the iteration block with something like

doc.each do |element|
p element
p element.methods.sort

maybe exit to keep the output readable

end

That’s fairly easy to do with Ruby because you don’t need the compile
cycle.

But if that’s the case, what’s the point of publishing
API documents? Why should I read them, if needed APIs are
quietly ignored? How would I even /know/ that an API was
absent?

You are so right. But remember that even the best API documentation
cannot deal with methods added at runtime. Sometimes even complete
modules are included at runtime and you never know until you see the
object. So I’d say the incomplete state of the Ruby API documentation
partly reflects this basic problem of API documentation for a dynamic
language like Ruby.

include REXML

I tried Element#inspect, but all that gave me was
=> REXML::Element

That’s not very helpful. What’s the final part of this
very helpful trick?

See above.

thanks again

You’re welcome.

Kind regards

robert


#17

Eric A. wrote:

But here’s a page that sets the bar for how API
docs should read:
http://builder.rubyforge.org/

There are links to source code for each API. And
clicking on the link displays the code inline.
/Very/ nice. Anyone know how that was created?

RDoc. Nice template. But similar layout for REXML is here:

http://www.ruby-doc.org/stdlib/libdoc/rexml/rdoc/

has links to inline source.


James B.

“A language that doesn’t affect the way you think about programming is
not worth knowing.”

  • A. Perlis

#18

Kindly ignore previous. The problem with frames.
I was at the home page, not at the Element page.

Observations:

  1. Wow. Parent is a link. Who would have guessed?

  2. Visting that page, it looks familiar. As I
    recall, I found it before by selecting “Parent”
    in the class list.

  3. each, and to_a are indeed listed there–totally
    without comments. I still have to visit the
    source code to find out what they do. (Or
    depend on the kindness of the folks who inhabit
    this forum–which generates a lot of traffic.)

  4. Clicking on the file link shows me a page that
    lists required files, and that’s all.

But here’s a page that sets the bar for how API
docs should read:
http://builder.rubyforge.org/

There are links to source code for each API. And
clicking on the link displays the code inline.
/Very/ nice. Anyone know how that was created?

Note:
Seeing the method without any context turns out
not to help me very much. But it’s way better
than nothing. For example, class! shows
_start_container, _css_block, and _unify_block,
none of which are in the method list!

Implication:
If had I seen that to_a returned the value of
“children” for example, I would still need to
know what that variable contains. Alternatively,
I need to search for occurrences of “children”
in the source code, in order to build up that
understanding. But that takes me back to the
need to examine source code, once again.

I don’t /like/ coming to the conclusion that API
documents have too many gaps to be useful, but I
find myself being forced in that direction–and
thinking about how to solve the problem.


#19

On 6/15/06, Robert K. removed_email_address@domain.invalid wrote:

You are so right. But remember that even the best API documentation
cannot deal with methods added at runtime. Sometimes even complete
modules are included at runtime and you never know until you see the
object.

rdoc should probably have some way to mark methods/modules that can’t
be seen by analysing code but that the programmer knows that will
exist at runtime. In xmlcodec I add some methods to the class at
runtime that I know will always exist and it would be great if I could
still make then appear in the rdoc.

Pedro.


#20

On 6/15/06, Eric A. removed_email_address@domain.invalid wrote:

Kindly ignore previous. The problem with frames.
I was at the home page, not at the Element page.

Observations:

  1. Wow. Parent is a link. Who would have guessed?

True, this is something that might be made more obvious with a
different stylesheet. It’s mostly a matter of taste and acclimation.

  1. Visting that page, it looks familiar. As I
    recall, I found it before by selecting “Parent”
    in the class list.

Yes, that link will take you to the documentation for the parent
class, which in this case is REXML::Parent. It’s not always (or even
usually) going to be a class named “Parent”, though. That was just
coincidence in this case.

  1. each, and to_a are indeed listed there–totally
    without comments. I still have to visit the
    source code to find out what they do. (Or
    depend on the kindness of the folks who inhabit
    this forum–which generates a lot of traffic.)

This a not the fault of API documentation as a concept, but of this
specific documentation. Those methods should be better documented. But
unwritten documentation is the bane of users no matter what
documentation style you choose.

  1. Clicking on the file link shows me a page that
    lists required files, and that’s all.

Not sure about what is required to make this page show/link to the
full source, but I know I have seen it available (not for this
package, but others) before. Must be some rdoc setting. You’ll notice
that the example you mention below of Builder doesn’t have the source
here either, however.

But here’s a page that sets the bar for how API
docs should read:
http://builder.rubyforge.org/

There are links to source code for each API. And
clicking on the link displays the code inline.
/Very/ nice. Anyone know how that was created?

It’s just a setting on RDoc. In fact, if you look at the authoritative
documentation for REXML at
http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/index.html, you’ll see it
has this feature as well. The documentation you were looking at at
germane-software.com was built with an older rdoc and had a different
linking mechanism – you clicked on the method name and it opens the
source in a popup. I too prefer the newer method, and if you look for
documentation on standard libraries at ruby-doc.org rather than
elsewhere, you’ll have it available.

Seeing the method without any context turns out
not to help me very much. But it’s way better
than nothing. For example, class! shows
_start_container, _css_block, and _unify_block,
none of which are in the method list!

They’re not in the list because they’re private/protected methods,
which aren’t included in RDoc output by default, because they’re not
part of the public API. You shouldn’t need to know about them.

If had I seen that to_a returned the value of
“children” for example, I would still need to
know what that variable contains. Alternatively,
I need to search for occurrences of “children”
in the source code, in order to build up that
understanding. But that takes me back to the
need to examine source code, once again.

What this really comes down to is that you shouldn’t need to
understand what the @children variable contains. You shouldn’t need to
know where else it is used in the code. You shouldn’t need to inspect
the source code at all; that link to the source of the method is only
a convenience for 1) if your curious, or 2) if the documenter failed
at his job.

In this case, the to_a method was poorly documented, you’ll find that
a lot with any language and any documentation method.

I don’t /like/ coming to the conclusion that API
documents have too many gaps to be useful, but I
find myself being forced in that direction–and
thinking about how to solve the problem.

It doesn’t sound like your problem is actually with API docs in
general, but with the specific API docs for REXML itself. Like you
said, you like the Builder documentation – it’s also API
documentation built by RDoc.

The best thing any of us, you and me included, can do about this
situation is not to try to engineer a better solution than RDoc, but
to go in and write documentation. I’ve started with the webrick
library myself – hopefully another couple months down the road I can
get it finished and added to ruby-doc.org.

Jacob F.