[FYI| Pure Java Nokogiri 1.5.0.beta.4

luislavena · January 28, 2011, 2:56am

For those who are struggling to get pure Java Nokogiri work:

Pure Java Nokogiri 1.5.0.beta.4 is out. This version fixed the problem
of not working on rack based environment. Also, other reported bugs
were fixed. Please give it a try. I’m looking forward to having
feedback from you.

-Yoko

Yoko_H · January 28, 2011, 10:10am

Glad to see pure java nokogiri advancing.

Have you looked at performance anything? The main reason I cannot use
it currently is performance.

For my small test (that I’m happy to send to you) I get the following
in jruby with nokogiri

: [erik@zyrgelkwyt3]$ ; jruby footest-perf.rb
Took 5797 ms

With none-java nokogiri I get

: [erik@zyrgelkwyt3]$ ; jruby footest-perf.rb
Took 2648 ms

Somewhat unrelated, in this example jruby is way slower then mri ruby

: [erik@zyrgelkwyt3]$ ; /usr/local/bin/ruby footest-perf.rb
Took 178 ms

Regards,
Erik

Yoko_H · January 28, 2011, 3:08pm

On Fri, Jan 28, 2011 at 3:10 AM, Erik B. [email protected] wrote:

With none-java nokogiri I get

: [erik@zyrgelkwyt3]$ ; jruby footest-perf.rb
Took 2648 ms

Somewhat unrelated, in this example jruby is way slower then mri ruby

: [erik@zyrgelkwyt3]$ ; /usr/local/bin/ruby footest-perf.rb
Took 178 ms

Is this test available? One user had commented that the Java adapter
was quite a bit faster for what they were doing.

-Tom

–
blog: http://blog.enebo.com twitter: tom_enebo
mail: [email protected]

Yoko_H · January 28, 2011, 6:46pm

On 28 janv. 2011, at 15:07, Thomas E Enebo wrote:

Is this test available? One user had commented that the Java adapter
was quite a bit faster for what they were doing.

That might be me

I did tweet about some massive speed improvements when switching from
FFI based nokogiri to the Java based one in my specific case which
involves stream parsing. In this case it basically means removing the
shitload of boundary crossings caused by the parsing callbacks which the
FFI layer overhead makes painfully slow. The speedup is substantial in
this case. And when I say substantial I mean between 50x and 100x faster
depending on the dataset size.

I have a few issue with API incompatibilities but other than that
Nokogiri Java is a godsend for me.

–
Luc H. - [email protected]

Yoko_H · January 31, 2011, 9:44pm

On Fri, Jan 28, 2011 at 12:45 PM, Luc H. [email protected]
wrote:

On 28 janv. 2011, at 15:07, Thomas E Enebo wrote:

Is this test available? One user had commented that the Java adapter
was quite a bit faster for what they were doing.

That might be me

I did tweet about some massive speed improvements when switching from FFI
based nokogiri to the Java based one in my specific case which involves stream
parsing. In this case it basically means removing the shitload of boundary
crossings caused by the parsing callbacks which the FFI layer overhead makes
painfully slow. The speedup is substantial in this case. And when I say
substantial I mean between 50x and 100x faster depending on the dataset size.

Thanks for reporting your case. That’s awesome!

I have a few issue with API incompatibilities but other than that Nokogiri Java
is a godsend for me.

There are tools to test “semantic equality of the document structure.”

(see the discussion
http://groups.google.com/group/nokogiri-talk/browse_thread/thread/a16e0554158965ea?hl=en)

These might help you?

-Yoko

Yoko_H · February 1, 2011, 7:51am

Yes. I can sent you the test later. Just need to look through it first.

Sent from my iPad

Yoko_H · January 31, 2011, 9:33pm

Hi,

On Fri, Jan 28, 2011 at 4:10 AM, Erik B. [email protected] wrote:

With none-java nokogiri I get

: [erik@zyrgelkwyt3]$ ; jruby footest-perf.rb
Took 2648 ms

As you pointed out pure Java version is slower than CRuby version.
I’ve attempted various possible ways to improve performance, but
there’s a limit. However, performance of pure Java version is expected
to improve applying JVM or JIT options. So, would you give us the code
used to measure the performance as Tom said?

-Yoko

Yoko_H · February 10, 2011, 12:27pm

On Fri, Jan 28, 2011 at 2:56 AM, Yoko H. [email protected] wrote:

For those who are struggling to get pure Java Nokogiri work:

Pure Java Nokogiri 1.5.0.beta.4 is out. This version fixed the problem
of not working on rack based environment. Also, other reported bugs
were fixed. Please give it a try. I’m looking forward to having
feedback from you.

Hi Yoko,

it seems 1.5.0.beta.4 does not validate xml versus xsd anymore…

I’ve a case where a xsd type is actually defined in another xsd file
(header.xsd included), and it does not work anymore.

I can provide test cases separately… Meanwhile I reverted back to
1.5.0.beta.3

–
Christian

Yoko_H · February 10, 2011, 8:42pm

Hello,
Thank you for using pure Java Nokogiri!

On Thu, Feb 10, 2011 at 6:27 AM, Christian MICHON
[email protected] wrote:

it seems 1.5.0.beta.4 does not validate xml versus xsd anymore…

I’ve a case where a xsd type is actually defined in another xsd file
(header.xsd included), and it does not work anymore.

I can provide test cases separately… Meanwhile I reverted back to 1.5.0.beta.3

I changed the way of getting schema specified by schemaLocation upon
the bug report. That bug was fixed. But, the fix seems to break
another case.

I filed this issue,
https://github.com/tenderlove/nokogiri/issues/issue/417

-Yoko

Yoko_H · February 14, 2011, 3:08am

Pure Java Nokogiri 1.5.0.beta.4 is out. This version fixed the problem
of not working on rack based environment.

I’m not familiar with this problem. Is there some information
available about this?

I’m switching a parsing function in a JRuby on Rails app from REXML to
Nokogiri. It parses an XML file and runs a few XPaths against it. I
ran a benchmark comparing the two, and for good measure I ran it again
using 1.5.0.beta.4.

Times below are for 1000 iterations, after warm-up:

#-----------------------------------------------------

JRuby 1.6.0RC2 + nokogiri-1.4.4.2-java

            user     system      total        real

Nokogiri 6.503000 0.000000 6.503000 ( 6.503000)
REXML 36.334000 0.000000 36.334000 ( 36.334000)

#-----------------------------------------------------

JRuby 1.6.0RC2 + nokogiri-1.5.0.beta.4-java

            user     system      total        real

Nokogiri 24.952000 0.000000 24.952000 ( 24.952000)
REXML 35.646000 0.000000 35.646000 ( 35.646000)

Yoko_H · March 31, 2011, 1:32am

Erik,

Thanks for sharing the sample code and data.
After beta.4 was released, I’ve tried various (possible) performance
improvements. Now, current master branch became a little bit better
though it’s still slow compared to FFI/C versions, especially for
parsing a large xml document. For example, JRuby 1.6.0 on my MacBook,
the result was below:

Nokogiri beta.4

$ jruby footest-perf.rb
Took 7924 ms

master branch

$ jruby footest-perf.rb
Took 7300 ms

I tried various JRuby and JVM options, too. Among them, a combination
of jruby.compile.mode=FORCE and -server worked for me.

$ jruby -J-server -J-Djruby.compile.mode=FORCE footest-perf.rb
Took 6780 ms

Including default settings of JRuby, the options given to the process
were:

-Xserver -Djruby.memory.max=500m -Djruby.stack.max=2048k -Xmx500m
-Xss2048k
-Djffi.boot.library.path=/Users/yoko/Tools/jruby-1.6.0/lib/native/Darwin
-Djruby.compile.mode=FORCE -Dfile.encoding=UTF-8
-Xbootclasspath/a:/Users/yoko/Tools/jruby-1.6.0/lib/jruby.jar
-Djruby.home=/Users/yoko/Tools/jruby-1.6.0
-Djruby.lib=/Users/yoko/Tools/jruby-1.6.0/lib -Djruby.script=jruby
-Djruby.shell=/bin/sh

As far as I watched the process over jconsole, memory usage was not
the factor of slowness. The process used less than 150m of heap. From
the result of -Xprof (JVM) and --profile (JRuby),
gist:895486 · GitHub, the biggest reason of the slowness
seems to come from creating DOM tree by Xerces. However, I can’t
replace XML parser at this moment. That will be a huge change and need
a lot of work. So, you’d better use C version of Nokogiri when you
parse large xml files.

Sorry, I can’t help so much
-Yoko

Yoko_H · February 16, 2011, 8:33am

On Fri, Jan 28, 2011 at 3:07 PM, Thomas E Enebo [email protected]
wrote:

Took 5797 ms

Is this test available? One user had commented that the Java adapter
was quite a bit faster for what they were doing.

Attached are the tests I was using. It’s running roxml for mapping,
which in turn is using nokogiri.

Current performance.

: [erik@zyrgelkwyt3]$ ; rvm use 1.8.7
: [erik@zyrgelkwyt3]$ ; ruby -I. footest-perf.rb
Took 258 ms

: [erik@zyrgelkwyt3]$ ; rvm use 1.9.2
: [erik@zyrgelkwyt3]$ ; ruby -I. footest-perf.rb
Took 185 ms

: [erik@zyrgelkwyt3]$ ; rvm jruby-1.6.0.RC2@global
: [erik@zyrgelkwyt3]$ ; ruby -I. footest-perf.rb
Took 6042 ms

: [erik@zyrgelkwyt3]$ ; rvm jruby-1.6.0.RC2@nokogiri-ffi
: [erik@zyrgelkwyt3]$ ; ruby -I. footest-perf.rb
Took 2616 ms

/Erik

Yoko_H · March 31, 2011, 1:59am

As far as I watched the process over jconsole, memory usage was not
the factor of slowness. The process used less than 150m of heap. From
the result of -Xprof (JVM) and --profile (JRuby),
gist:895486 · GitHub, the biggest reason of the slowness
seems to come from creating DOM tree by Xerces. However, I can’t
replace XML parser at this moment. That will be a huge change and need
a lot of work. So, you’d better use C version of Nokogiri when you
parse large xml files.
Yoko, have you evaluated what xml parser would be a good fit for
replacing xerces? I know it would involve hard work, but if xerces is
the culprit I don’t see better options than start organizing the effort

Douglas Campos (qmx)
[email protected]

Yoko_H · March 31, 2011, 2:37am

On Wed, Mar 30, 2011 at 7:58 PM, Douglas Campos (qmx) [email protected]
wrote:

As far as I watched the process over jconsole, memory usage was not
the factor of slowness. The process used less than 150m of heap. From
the result of -Xprof (JVM) and --profile (JRuby),
gist:895486 · GitHub, the biggest reason of the slowness
seems to come from creating DOM tree by Xerces. However, I can’t
replace XML parser at this moment. That will be a huge change and need
a lot of work. So, you’d better use C version of Nokogiri when you
parse large xml files.
Yoko, have you evaluated what xml parser would be a good fit for replacing
xerces? I know it would involve hard work, but if xerces is the culprit I don’t
see better options than start organizing the effort

I have just looked at what parsers could cover what features of
Nokogiri. Not just that. Software license is the matter, too. So far,
no replacement of Xerces.

-Yoko