Libxml: is it possible not to use doctype declaration?

hi all,

I have tried to find elements in XML documents with xpath expression
support in libxml:

require ‘xml/libxml’
doc = XML::Document.file( file)
node = doc.find_first( ‘doc/p[@att]/@att’)

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

Is there a way to tell the Document-object that it should ignore the
doctype declaration if present? Or should I first remove the
declaration from the document before calling new?

regard, Ruud

ruud grosmann wrote:

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

If the doctype is an HTML, open the document like this:

 xp = XML::HTMLParser.new()
 xp.string = xhtml
 XML::Parser.default_pedantic_parser = false
 doc = xp.parse

My assertxpath gem shows how, in the method assert_libxml.

hi Phlip,

thanks for the suggestion. The document is not an HTML document. It is
an XML document. It is something like this:

<?xml version="1.0" encoding="utf-8"?>

this is a test

I don’t want XML::Document to resolve the URL and waiting for a
timeout. I couldn’t find anything in the documentation on this.

regards, Ruud

ruud grosmann wrote:

I don’t want XML::Document to resolve the URL and waiting for a
timeout. I couldn’t find anything in the documentation on this.

Use string surgery to yank out the DOCTYPE.

On 29 jul 2008, at 15.44, ruud grosmann wrote:

declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

Is there a way to tell the Document-object that it should ignore the
doctype declaration if present? Or should I first remove the
declaration from the document before calling new?

regard, Ruud

Check wether your xml processor supports xml catalog files. They
provide a mapping from web-based
paths to local file names.

ruud grosmann wrote:

Is using libxml the right thing to do to, or are there smarter alternatives?

Libxml-ruby is the most complete & accurate parser of the big three
(REXML,
Libxml-ruby, and Hpricot), and its documentation can be very
challenging. How
much of the original C Libxml documentation have you been able to read?

hi Phlip,

thank you for the hint. I did it already, but I was wondering if there
is some hidden option that did it for me.

Is my assumption correct that the class not documentated very good?
After googling for some time I only found something that appeared to
be outdated. That why I eventually posted my question here.

Is using libxml the right thing to do to, or are there smarter
alternatives?

thanks, Ruud

To start, the rdoc documentation can be found at
http://libxml.rubyforge.org/rdoc/index.html. Now I don’t know this for
sure, but

doesn’t look like a real doctype definition, so if you can pull it out
of your xml (by hand, not programmatically) before trying to parse it,
I’d say that would be a good idea. That being said, there are two
attributes of the XML::Parser class that look like they may be of
interest: default_load_external_dtd and default_validity_checking. Try
setting both of those to false, unless you have a real dtd to validate
against and the example above was fake.
Of course, since this is using XML::Parser instead of XML::Document I
think you would need to do e.g.:
parser = XML::Parser.file()
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

… and then go from there.
Phill D.

I tried to reply to this via the ruby-talk mailing list and it didn’t
work. Not sure why not, maybe someone can fill me in on that. Anyway,
here’s my take:

To start, the rdoc documentation can be found at
http://libxml.rubyforge.org/rdoc/index.html. Now I don’t know this for
sure, but

doesn’t look like a real doctype definition, so if you can pull it out
of your xml (by hand, not programmatically) before trying to parse it,
I’d say that would be a good idea. That being said, there are two
attributes of the XML::Parser class that look like they may be of
interest: default_load_external_dtd and default_validity_checking. Try
setting both of those to false, unless you have a real dtd to validate
against and the example above was fake. Of course, since this is using
XML::Parser instead of XML::Document I think you would need to do e.g.:
parser = XML::Parser.file()
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

… and then go from there.
Phill D.

Phlip wrote:

ruud grosmann wrote:

Is using libxml the right thing to do to, or are there smarter alternatives?

Libxml-ruby is the most complete & accurate parser of the big three
(REXML,
Libxml-ruby, and Hpricot), and its documentation can be very
challenging. How
much of the original C Libxml documentation have you been able to read?

Whoops, those were supposed to be class variables. What you really want
to do (I think) is more like:

LibXML::XML::Parser.default_load_external_dtd = false
LibXML::XML::Parser.default_validity_checking = false

And then:
parser = LibXML::XML::Parser.file()
doc = parser.parse

That seems to work with your example.

hi Phill,

I’ve tried it right away. I ended up with the following:

XML::Parser.default_load_external_dtd = false
XML::Parser.default_validity_checking = false
XML::Parser.default_substitute_entities = false

    parser = XML::Parser.file( file)
    #parser.default_substitute_entities = false
    #parser.default_load_external_dtd = false
    #parser.default_validity_checking = false
    doc = parser.parse
    node = doc.find( xpath).first

But the script still tries to resolve the entity. The doctype
definition is a slightly changed real one. The message I get with the
above code is:

Operation in progress./tmp/ut21.uit:3: I/O warning : failed to load
external entity “http://ruud.grosmann.nl/op/dtd/publicatie.dtd
e publicaties 1.0//NL" “http://ruud.grosmann.nl/op/dtd/publicatie.dtd

You were right that the methods are not instance methods, although I
am not sure how to conclude that from the documentation.

Did I something wrong in the script?

regards, Ruud

ruud grosmann wrote:

XML::Parser.default_load_external_dtd = false
XML::Parser.default_validity_checking = false
XML::Parser.default_substitute_entities = false

Did I something wrong in the script?

When I was researching the difference between the normal XML parser and
the HTML
parser, I also observed those variables not working. That’s why I didn’t
bring
them up.

Give fastxml a try. It’s also a ruby interface to libxml.

http://fastxml.rubyforge.org/
–mg

ruud grosmann wrote:

    #parser.default_load_external_dtd = false

e publicaties 1.0//NL" “http://ruud.grosmann.nl/op/dtd/publicatie.dtd

Whoops, those were supposed to be class variables. What you really want

of your xml (by hand, not programmatically) before trying to parse it,

Hey Ruud,
Nope, I can’t see that you’re doing anything wrong. I guess all I
can say is if can send the actual XML so I can give it a try with it
(because when I use your original example it seems to work fine as long
as I set those class variables). Also, the error message you sent was
broken up, if you could please try to send that again it would probably
help. Here’s what I’m using:

<?xml version="1.0" encoding="utf-8"?>

this is a test

And here’s the error I get when I don’t set those class variables:

test.xml:2:
I/O
warning :
failed to load HTTP resource
TYPE test PUBLIC “-//FARAWAY//DTD-verweg//NL”
Site.nl: Alles-in-één oplossing voor jouw site

Thanks,
Phill

hi Mark,

thanks for this hint. I had decided libxslt was not for me because of
a probblem with garbage collection after starting to use it (see other
post).
So a good alternative is welcome. I’ll check it out later this week.

regards, Ruud

2008/7/30 Phill D. [email protected]:

   parser = XML::Parser.file( file)

Operation in progress./tmp/ut21.uit:3: I/O warning : failed to load
On 30/07/2008, Phill D. [email protected] wrote:

doc = parser.parse

<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL" e.g.: parser = XML::Parser.file()

ruud grosmann wrote:
been able to read?

this is a test

And here’s the error I get when I don’t set those class variables:

test.xml:2:
I/O
warning :
failed to load HTTP resource
TYPE test PUBLIC “-//FARAWAY//DTD-verweg//NL”
Site.nl: Alles-in-één oplossing voor jouw site

Hm, Java XML parsers I know have a special callback that you can set
that will deal with resolving external entities. I could not find
anything similar in libxml documentation but maybe I just looked in
the wrong places. With that you could load the file just once (or
even fetch it from some internal memory or file system). Also, I find
it a bit strange that those flags are global - this can introduce
weird bugs when using an application which parses XML concurrently and
needs different flags for each process…

Kind regards

robert

thanks everybody,

I think I rather do a system call for saxon. It’s just to many little
bugs and uncertainties to me. Thanks anyway for your efforts and
helping me.

Regards, Ruud