Libxml: is it possible not to use doctype declaration?

ruud_grosmann · July 29, 2008, 3:45pm

hi all,

I have tried to find elements in XML documents with xpath expression
support in libxml:

require ‘xml/libxml’
doc = XML::Document.file( file)
node = doc.find_first( ‘doc/p[@att]/@att’)

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

Is there a way to tell the Document-object that it should ignore the
doctype declaration if present? Or should I first remove the
declaration from the document before calling new?

regard, Ruud

ruud_grosmann · July 29, 2008, 4:15pm

ruud grosmann wrote:

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

If the doctype is an HTML, open the document like this:

 xp = XML::HTMLParser.new()
 xp.string = xhtml
 XML::Parser.default_pedantic_parser = false
 doc = xp.parse

My assertxpath gem shows how, in the method assert_libxml.

ruud_grosmann · July 29, 2008, 4:45pm

hi Phlip,

thanks for the suggestion. The document is not an HTML document. It is
an XML document. It is something like this:

<?xml version="1.0" encoding="utf-8"?>

this is a test

I don’t want XML::Document to resolve the URL and waiting for a
timeout. I couldn’t find anything in the documentation on this.

regards, Ruud

ruud_grosmann · July 29, 2008, 5:50pm

ruud grosmann wrote:

I don’t want XML::Document to resolve the URL and waiting for a
timeout. I couldn’t find anything in the documentation on this.

Use string surgery to yank out the DOCTYPE.

ruud_grosmann · July 29, 2008, 7:49pm

On 29 jul 2008, at 15.44, ruud grosmann wrote:

declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

Is there a way to tell the Document-object that it should ignore the
doctype declaration if present? Or should I first remove the
declaration from the document before calling new?

regard, Ruud

Check wether your xml processor supports xml catalog files. They
provide a mapping from web-based
paths to local file names.

ruud_grosmann · July 29, 2008, 9:01pm

ruud grosmann wrote:

Is using libxml the right thing to do to, or are there smarter alternatives?

Libxml-ruby is the most complete & accurate parser of the big three
(REXML,
Libxml-ruby, and Hpricot), and its documentation can be very
challenging. How
much of the original C Libxml documentation have you been able to read?

ruud_grosmann · July 29, 2008, 7:32pm

hi Phlip,

thank you for the hint. I did it already, but I was wondering if there
is some hidden option that did it for me.

Is my assumption correct that the class not documentated very good?
After googling for some time I only found something that appeared to
be outdated. That why I eventually posted my question here.

Is using libxml the right thing to do to, or are there smarter
alternatives?

thanks, Ruud

ruud_grosmann · July 30, 2008, 12:24am

To start, the rdoc documentation can be found at
http://libxml.rubyforge.org/rdoc/index.html. Now I don’t know this for
sure, but

doesn’t look like a real doctype definition, so if you can pull it out
of your xml (by hand, not programmatically) before trying to parse it,
I’d say that would be a good idea. That being said, there are two
attributes of the XML::Parser class that look like they may be of
interest: default_load_external_dtd and default_validity_checking. Try
setting both of those to false, unless you have a real dtd to validate
against and the example above was fake.
Of course, since this is using XML::Parser instead of XML::Document I
think you would need to do e.g.:
parser = XML::Parser.file()
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

… and then go from there.
Phill D.

ruud_grosmann · July 29, 2008, 11:39pm

I tried to reply to this via the ruby-talk mailing list and it didn’t
work. Not sure why not, maybe someone can fill me in on that. Anyway,
here’s my take:

To start, the rdoc documentation can be found at
http://libxml.rubyforge.org/rdoc/index.html. Now I don’t know this for
sure, but

doesn’t look like a real doctype definition, so if you can pull it out
of your xml (by hand, not programmatically) before trying to parse it,
I’d say that would be a good idea. That being said, there are two
attributes of the XML::Parser class that look like they may be of
interest: default_load_external_dtd and default_validity_checking. Try
setting both of those to false, unless you have a real dtd to validate
against and the example above was fake. Of course, since this is using
XML::Parser instead of XML::Document I think you would need to do e.g.:
parser = XML::Parser.file()
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

… and then go from there.
Phill D.

Phlip wrote:

ruud grosmann wrote:

Is using libxml the right thing to do to, or are there smarter alternatives?

Libxml-ruby is the most complete & accurate parser of the big three
(REXML,
Libxml-ruby, and Hpricot), and its documentation can be very
challenging. How
much of the original C Libxml documentation have you been able to read?

ruud_grosmann · July 30, 2008, 2:54am

Whoops, those were supposed to be class variables. What you really want
to do (I think) is more like:

LibXML::XML::Parser.default_load_external_dtd = false
LibXML::XML::Parser.default_validity_checking = false

And then:
parser = LibXML::XML::Parser.file()
doc = parser.parse

That seems to work with your example.

ruud_grosmann · July 30, 2008, 10:39am

hi Phill,

I’ve tried it right away. I ended up with the following:

XML::Parser.default_load_external_dtd = false
XML::Parser.default_validity_checking = false
XML::Parser.default_substitute_entities = false

    parser = XML::Parser.file( file)
    #parser.default_substitute_entities = false
    #parser.default_load_external_dtd = false
    #parser.default_validity_checking = false
    doc = parser.parse
    node = doc.find( xpath).first

But the script still tries to resolve the entity. The doctype
definition is a slightly changed real one. The message I get with the
above code is:

Operation in progress./tmp/ut21.uit:3: I/O warning : failed to load
external entity “http://ruud.grosmann.nl/op/dtd/publicatie.dtd”
e publicaties 1.0//NL" “http://ruud.grosmann.nl/op/dtd/publicatie.dtd”

You were right that the methods are not instance methods, although I
am not sure how to conclude that from the documentation.

Did I something wrong in the script?

regards, Ruud

ruud_grosmann · July 30, 2008, 1:16pm

ruud grosmann wrote:

XML::Parser.default_load_external_dtd = false
XML::Parser.default_validity_checking = false
XML::Parser.default_substitute_entities = false

Did I something wrong in the script?

When I was researching the difference between the normal XML parser and
the HTML
parser, I also observed those variables not working. That’s why I didn’t
bring
them up.

ruud_grosmann · July 31, 2008, 8:52am

Give fastxml a try. It’s also a ruby interface to libxml.

http://fastxml.rubyforge.org/
–mg

ruud_grosmann · July 30, 2008, 5:43pm

ruud grosmann wrote:

    #parser.default_load_external_dtd = false
e publicaties 1.0//NL" “http://ruud.grosmann.nl/op/dtd/publicatie.dtd”

Whoops, those were supposed to be class variables. What you really want

of your xml (by hand, not programmatically) before trying to parse it,

Hey Ruud,
Nope, I can’t see that you’re doing anything wrong. I guess all I
can say is if can send the actual XML so I can give it a try with it
(because when I use your original example it seems to work fine as long
as I set those class variables). Also, the error message you sent was
broken up, if you could please try to send that again it would probably
help. Here’s what I’m using:

<?xml version="1.0" encoding="utf-8"?>

this is a test

And here’s the error I get when I don’t set those class variables:

test.xml:2:
I/O
warning :
failed to load HTTP resource
TYPE test PUBLIC “-//FARAWAY//DTD-verweg//NL”
“Site.nl: Alles-in-één oplossing voor jouw site”

Thanks,
Phill

ruud_grosmann · July 31, 2008, 9:52am

hi Mark,

thanks for this hint. I had decided libxslt was not for me because of
a probblem with garbage collection after starting to use it (see other
post).
So a good alternative is welcome. I’ll check it out later this week.

regards, Ruud

ruud_grosmann · August 1, 2008, 11:51am

2008/7/30 Phill D. [email protected]:

   parser = XML::Parser.file( file)
Operation in progress./tmp/ut21.uit:3: I/O warning : failed to load
On 30/07/2008, Phill D. [email protected] wrote:

doc = parser.parse

<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL" e.g.: parser = XML::Parser.file()

ruud grosmann wrote:
been able to read?
this is a test

And here’s the error I get when I don’t set those class variables:

test.xml:2:
I/O
warning :
failed to load HTTP resource
TYPE test PUBLIC “-//FARAWAY//DTD-verweg//NL”
“Site.nl: Alles-in-één oplossing voor jouw site”

Hm, Java XML parsers I know have a special callback that you can set
that will deal with resolving external entities. I could not find
anything similar in libxml documentation but maybe I just looked in
the wrong places. With that you could load the file just once (or
even fetch it from some internal memory or file system). Also, I find
it a bit strange that those flags are global - this can introduce
weird bugs when using an application which parses XML concurrently and
needs different flags for each process…

Kind regards

robert

ruud_grosmann · August 2, 2008, 7:40pm

thanks everybody,

I think I rather do a system call for saxon. It’s just to many little
bugs and uncertainties to me. Thanks anyway for your efforts and
helping me.

Regards, Ruud