Parse XML that isn't well formed

milo · September 19, 2007, 12:05pm

I have some XML looking like the following, other than being very much
larger (some files are up to 2GB):

<?xml version="1.0" encoding="UTF-8"?>

    <server_url>http://myserver.edu/data/</server_url>
    <server_name>myserver.edu</server_name>
    <uploads>
            <result>
                    <dir>/storage/data/results/</dir>
                    <result_name>hadcm3l_00012_00000118_0</result_name>
                    <file_info>
                    <name>hadcm3l_00012_00000118_0_6.zip</name>
                    <nbytes>5154055</nbytes>
                    <md5_checksum>485600296bb601ab4a3d1d49a9fb1c86</md5_checksum>
                    </file_info>
                    <file_info>
                    <name>hadcm3l_00012_00000118_0_7.zip</name>
                    <nbytes>5153055</nbytes>
                    <md5_checksum>36a600296cb60229a3d1d49a9fb1a10</md5_checksum>
                    </file_info>
            </result>
    </uploads>

I’ve tried a few xml parsers such as xml-simple, libxml and quixml, but
all reject this data as badly formed. One answer would, of course, be
for the data to be re-generated using properly formed xml. Meanwhile, is
there anything that could be done with the existing files? Is it a case
of having to write regexps to parse this sort of thing?

milo · September 19, 2007, 8:54pm

On 9/19/07, Milo T. [email protected] wrote:

                    <file_info>
    </uploads>

Note that there should be no - the line at the top is a
declaration, not an opening tag. Where did come from? What
happens if you remove that from the data?

-A

milo · September 20, 2007, 3:27pm

Alex LeDonne wrote:

Note that there should be no - the line at the top is a
declaration, not an opening tag. Where did come from? What
happens if you remove that from the data?

Good point about the XML. Unfortunately, these are the files I have
received and have to deal with them for now.

Removing the final tag gives:

.file.xml:3: parser error : Extra content at the end of the document
<server_name>myserver.edu</server_name>
^
rake aborted!

milo · September 21, 2007, 11:41am

Jano S. wrote:>

You should have done two things: 1. add root node (with
closing just before ) AND 2. remove the trailing

Great, thanks.
That should sort out the “legacy” files, and future ones can be
corrected.

I have also been parsing each line with IO.foreach and
/<(.+)[^>]*>(.+?)<(/.+)>/, which though not as nice as a proper XML
parser does avoid loading huge files into memory in one go.

milo · September 21, 2007, 11:31am

On 9/20/07, Milo T. [email protected] wrote:

.file.xml:3: parser error : Extra content at the end of the document
<server_name>myserver.edu</server_name>
^
rake aborted!

You should have done two things: 1. add root node (with
closing just before ) AND 2. remove the trailing

Then it’ll be fine.

in your case it’s easy:

data.gsub(‘?>’, ‘?>’).gsub(‘’, ‘’)