[ANN] Rubydoop 1.0, write Hadoop jobs in JRuby

Hi,

I’ve just released v1.0.0 of Rubydoop. It’s a gem that makes it possible
to write Hadoop jobs in Ruby without using the streaming APIs. It
configures the Hadoop runtime to run your Ruby code in an embedded JRuby
runtime, and it provides a configuration DSL that’s way nicer to use
than Hadoop’s ToolRunner.

Check out the code, some instructions and examples at

If you’re into Hadoop and JRuby, I really hope this will help you, and
if you have any suggestions or feature requests, please let me know.

yours,
Theo, @iconara

hi –

i’ve been looking at amazon elastic mapreduce + jruby myself. how does
your approach differ from using something like warbler to compile ruby
sources + gem dependencies into a standalone jar then using that jar
when invoking hadoop?

thanks.

thanks. i’ve been playing around with rubydoop today, and have a
question: it looks to me like the packaging won’t honor gemspecs, it
only scans each gem’s lib/ directory and packages up whatever it finds
in there. is that correct, or am i misconfiguring something? (one of my
local gems happens to have something in app/ as well as lib/)

Rubydoop tries to solve the problem of how to make Hadoop talk to your
Ruby
code, without forcing you to write lots of annotations to make your Ruby
classes look like Java classes. You could say that it’s more or less the
same thing as using Warbler and all that, but doing that is a lot of
work.

I’m currently running a Rubydoop-based job on EMR on 640 CPUs, it works
like a charm. If you’re going to run it on lots of data, you should use
the
v1.1.0.pre2, it contains a few things I’ve had to fix when scaling up
from
small jobs to massive jobs.

T#

i think i see what’s going on: it’s respecting the require_paths
directive, but not the files directive.

also it looks like it flattens out the folder hierarchy such that if i
have:

my_gem
|
|- lib/
|-one.rb
|-vendor/
|-two.rb

the jar will contain

/one.rb
/two.rb

a problem with this is if one.rb has a relative path, e.g.

here=File.expand_path(File.dirname(FILE))
require “#{here}/…/vendor/two.rb”

this is now broken. also if you have two gems containing the same
filename, it seems like you’d have a problem.

i should mention that this is for a locally installed gem (i.e.
gem ‘blah’, :path => ‘…/…/’
not sure if it makes a difference)

“I’d love to find a safer way of packaging the dependencies, but short
of
puting the whole gem directories into the jar and doing some magic to
make sure that they all are on the load path I think that this works
reasonably well.”

that doesn’t sound so bad, i was in the middle of playing around with
something like:

def load_gem_files
  Bundler.definition.specs_for(@options[:gem_groups]).flat_map do

|spec|
if spec.full_name !~ /^(?:bundler|rubydoop|jruby-openssl)-\d+/
spec.files.select {|f| f =~ /.rb$/}.map do |f|
[spec.full_gem_path, f]
end
else
[]
end
end
end

the in the ant jar definition, adding:

      bundled_gem_files.each { |path| fileset :dir => path[0],

:includes => “#{path[1]}” }

it’s a little ugly since we could unify files with common parent
folders, but it works in that the correct files with the correct folder
structure are added to the jar.

the thing i haven’t figured out yet is how to ensure that all those
files are in the load path. i’m probably missing something though,
because that seems like it should be straightforward – can’t you just
add all those directories to the manifest file’s classpath (since my
understanding is that the jruby load_path and classpath are unified)?

if you get something working it would be something I’d love to merge in.

T#

yes, it only looks at the require paths. there are a lot of gems that
are
set up very differently, I tried a few things and it looked like using
require paths worked the best. there are a lot of gems that don’t have
anything, or very little, in the files list. many gems are packaged
really
badly, and one of the worst offenders is jruby-openssl, which has caused
me
no end of trouble (it went so far that I had to put in special handling
for
it, basically rewriting its structure before adding it to the jar).

the require path thing works because these are the paths put on the load
path, and even if you have two files with the same name on the load
path,
require will only find the first of them anyway – so it doesn’t matter
that only one gets packaged into the jar. that is, unless your gem tries
to
be clever, like jruby-openssl, and loads things that are neither in any
require path, or in the files list.

it gets really hairy when the code is running from inside a jar and you
get
paths like blah/blah/job.jar!/some/gem/…/…/…/doblidoo/file.rb, since
one
of those … takes you outside of the jar file.

I’d love to find a safer way of packaging the dependencies, but short of
puting the whole gem directories into the jar and doing some magic to
make
sure that they all are on the load path I think that this works
reasonably
well. I did some work on the packaging in redstorm and it wasn’t very
pretty. I wanted to get away from the complicated things that happen
when
you just stuff all the gems into the jar and try to set up the correct
load
path. putting all the require paths in the root of the jar makes all
well-behaved gems, and most slightly weird gems work out of the box.

maybe bundler can be coaxed into helping, but it behaves strangely when
it’s not in control of things.

T#

which version of Hadoop are you running? I’ve targeted 1.0.3.

from the error it looks like it’s trying to run another job after the
first
has finished and that it’s that job that is complaining.

T#

i ended up adding this line to my MR:

$LOAD_PATH << File.expand_path(File.dirname(FILE)) + ‘/lib’

as a way of forcing the load path to be what i want. i couldn’t get
jruby to respect the manifest classpath.

my mapreduce actually completes now (i get the reducer output file), but
at the end it complains about the output path existing (which it
definitely does not before i run the command). have you run into this?

Ilyas-MacBook-Pro:word_count ilya$ hadoop jar build/word_count.jar
word_count Gemfile /tmp/hadoop3
12/12/19 18:03:55 INFO mapred.Task: Task:attempt_local_0001_r_000000_0
is done. And is in the process of commiting
12/12/19 18:03:55 INFO mapred.LocalJobRunner:
12/12/19 18:03:55 INFO mapred.Task: Task attempt_local_0001_r_000000_0
is allowed to commit now
12/12/19 18:03:55 INFO output.FileOutputCommitter: Saved output of task
‘attempt_local_0001_r_000000_0’ to /tmp/hadoop3
12/12/19 18:03:55 INFO mapred.LocalJobRunner: reduce > reduce
12/12/19 18:03:55 INFO mapred.Task: Task ‘attempt_local_0001_r_000000_0’
done.
12/12/19 18:03:56 INFO mapred.JobClient: map 100% reduce 100%
12/12/19 18:03:56 INFO mapred.JobClient: Job complete: job_local_0001
12/12/19 18:03:56 INFO mapred.JobClient: Counters: 18
12/12/19 18:03:56 INFO mapred.JobClient: Rubydoop
12/12/19 18:03:56 INFO mapred.JobClient: JRuby runtimes created=2
12/12/19 18:03:56 INFO mapred.JobClient: File Output Format Counters
12/12/19 18:03:56 INFO mapred.JobClient: Bytes Written=217
12/12/19 18:03:56 INFO mapred.JobClient: FileSystemCounters
12/12/19 18:03:56 INFO mapred.JobClient: FILE_BYTES_READ=47005944
12/12/19 18:03:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=47444829
12/12/19 18:03:56 INFO mapred.JobClient: File Input Format Counters
12/12/19 18:03:56 INFO mapred.JobClient: Bytes Read=258
12/12/19 18:03:56 INFO mapred.JobClient: Map-Reduce Framework
12/12/19 18:03:56 INFO mapred.JobClient: Map output materialized
bytes=342
12/12/19 18:03:56 INFO mapred.JobClient: Map input records=10
12/12/19 18:03:56 INFO mapred.JobClient: Reduce shuffle bytes=0
12/12/19 18:03:56 INFO mapred.JobClient: Spilled Records=50
12/12/19 18:03:56 INFO mapred.JobClient: Map output bytes=286
12/12/19 18:03:56 INFO mapred.JobClient: Total committed heap usage
(bytes)=507133952
12/12/19 18:03:56 INFO mapred.JobClient: SPLIT_RAW_BYTES=125
12/12/19 18:03:56 INFO mapred.JobClient: Combine input records=0
12/12/19 18:03:56 INFO mapred.JobClient: Reduce input records=25
12/12/19 18:03:56 INFO mapred.JobClient: Reduce input groups=20
12/12/19 18:03:56 INFO mapred.JobClient: Combine output records=0
12/12/19 18:03:56 INFO mapred.JobClient: Reduce output records=20
12/12/19 18:03:56 INFO mapred.JobClient: Map output records=25
12/12/19 18:03:56 WARN mapred.JobClient: No job jar file set. User
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
12/12/19 18:03:56 INFO mapred.JobClient: Cleaning up the staging area
file:/tmp/hadoop-ilya/mapred/staging/ilya1409933347/.staging/job_local_0002
12/12/19 18:03:56 ERROR security.UserGroupInformation:
PriviledgedActionException as:ilya
cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output
directory /tmp/hadoop3 already exists
Exception in thread “main”
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
/tmp/hadoop3 already exists
at
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:949)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:912)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:912)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)
at rubydoop.RubydoopJobRunner.run(RubydoopJobRunner.java:31)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at rubydoop.RubydoopJobRunner.main(RubydoopJobRunner.java:82)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

i was running 1.1.1. via a combination of downgrading to 1.0.3 plus
some other changes i’ve been making, i successfully ran a mapreduce
locally using my entire rails app packaged up in a single jar (a few
hundred gems, roughly 4000 files).

trying the same jar with elastic mapreduce, however, gives me an error
right now:

Exception in thread “main” rubydoop.RubydoopRunnerException: Could not
load job setup script “word_count”
at rubydoop.RubydoopJobRunner.configureJobs(RubydoopJobRunner.java:63)
at rubydoop.RubydoopJobRunner.run(RubydoopJobRunner.java:30)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at rubydoop.RubydoopJobRunner.main(RubydoopJobRunner.java:82)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
Caused by: org.jruby.exceptions.RaiseException: (LoadError) no such file
to load – rubygems/path_support

the $LOAD_PATH as reported by my mapper is:

[“file:/home/hadoop/lib/jruby-complete-no-joda-1.6.5.jar!/META-INF/jruby.home/lib/ruby/site_ruby/1.9”,
“file:/home/hadoop/lib/jruby-complete-no-joda-1.6.5.jar!/META-INF/jruby.home/lib/ruby/site_ruby/shared”,
“file:/home/hadoop/lib/jruby-complete-no-joda-1.6.5.jar!/META-INF/jruby.home/lib/ruby/1.9”,
“/mnt/var/lib/hadoop/tmp/hadoop-unjar1366878146839692659”,
“/mnt/var/lib/hadoop/tmp/hadoop-unjar1366878146839692659/lib”]

some quick googling led me to this:
http://techpolesen.blogspot.com/2007/07/jruby-runtime-initialization-and.html,
but replacing

    Ruby runtime = Ruby.newInstance(config);

with
Ruby runtime = JavaEmbedUtils.initialize(new ArrayList(),
config)

in InstanceContainer didn’t seem to make any difference. i plan on
tackling this next, any pointers would be appreciated.

Theo Hultberg wrote in post #1089694:

which version of Hadoop are you running? I’ve targeted 1.0.3.

from the error it looks like it’s trying to run another job after the
first
has finished and that it’s that job that is complaining.

T#

good catch, i didn’t notice hadoop’s jruby 1.6 jar had snuck in there.
you mentioned that you’re running rubydoop on EMR – is there a setting
that allows you to use 1.7.x there? i assume you have to change
hadoop’s classpath before starting up, via something like this
Forums | AWS re:Post?

Theo Hultberg wrote in post #1089725:

JRuby 1.6.x has some bugs in the embedding that I’ve run into. That
particular error will resolve if you upgrade to JRuby 1.7.x, but if you
can’t you might have to fiddle with the load path a bit (I forget
exactly
where rubygems/path_support is located).

Rubydoop removes the 1.8 stdlib from the load path, since JRuby 1.6.x
adds
both the 1.8 and 1.9 stdlibs for some reason, even when the runtime is
explicitly running in 1.9 mode.

T#

JRuby 1.6.x has some bugs in the embedding that I’ve run into. That
particular error will resolve if you upgrade to JRuby 1.7.x, but if you
can’t you might have to fiddle with the load path a bit (I forget
exactly
where rubygems/path_support is located).

Rubydoop removes the 1.8 stdlib from the load path, since JRuby 1.6.x
adds
both the 1.8 and 1.9 stdlibs for some reason, even when the runtime is
explicitly running in 1.9 mode.

T#

I have no idea why there’s a jruby-complete.jar on the default EMR
installation, but so far I haven’t noticed any bad effects from removing
it.

T#

oooooh, now I remember. when you run on EMR you need to run this as a
bootstrap action:

#!/bin/bash
rm /home/hadoop/lib/jruby-complete-no-joda-1.6.5.jar

I’ve just copy and pasted that so many times now I’ve forgotten about
it. I
should write a guide and stick all these things there for posterity.

if you launch your flow using the aws-sdk gem you can add this:

remove_old_jruby_action = {
    :name => 'remove_old_jruby',
    :script_bootstrap_action => {
      :path => "s3://your-bucket/remove_old_jruby.sh"
    }
}

and then add that to the :bootstrap_actions option to
JobFlowCollection#create. don’t forget to upload the script to s3 before
you run the flow.

T#

thanks, that helped me make progress. the thing i’m struggling with now
is that it seems like EMR does not unjar my jar before running it, which
is what my local hadoop does (i.e. everything gets loaded from a
/tmp/… directory that my jar gets uncompressed to locally). in EMR, i
get stacktraces like:

org.jruby.exceptions.RaiseException: (ENOENT) No such file or directory

jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!
at org.jruby.RubyFile.realpath(org/jruby/RubyFile.java:760)
at
RUBY.realpath(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/lib/jruby-complete-1.7.1.jar!/META-INF/jruby.home/lib/ruby/1.9/pathname.rb:446)
at
RUBY.find_root_with_flag(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!/lib/rails/engine.rb:637)
at
RUBY.config(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!/lib/rails/engine.rb:511)
at
org.jruby.RubyBasicObject.send(org/jruby/RubyBasicObject.java:1659)
at RUBY.config(jar:0)
at
RUBY.Engine(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!/lib/cs-base/engine.rb:11)

from which it looks like EMR hadoop is loading directly from that jar
file, which makes any attempts to call Pathname.new().realpath (for
example) fail. some gems try to do that (to load their configs, etc).
i’m trying to figure out how to get EMR to stage my jar similarly to
what my local hadoop installation does.

okay, after wrestling with rails, jruby and hadoop for a few days, i’m
finally able to use my rails app in the mapper class. the key sticking
point turned out to be the fact that rails searches for its
configuration by walking up the directory tree by using File.dirname.
but if running from a jar, jruby’s RubyFile implementation refuses to
treat the “root” directory as a valid path entry. the easy solution was
to modify radoop’s package.rb to jar up my rails app code under a
/classes directory instead of at the top level. i then needed to add
/classes and /classes/lib to the LOAD_PATH. (it has be /classes due to
hadoop’s classpath when the job jar is run by the workers.)

i’m attaching my patched up version package.rb, which is the only change
to rubydoop.rb i needed to make. if it looks interesting, let me know
and i can send you a pull request on github.

thanks for all your help.

Ilya K. wrote in post #1089906:

thanks, that helped me make progress. the thing i’m struggling with now
is that it seems like EMR does not unjar my jar before running it, which
is what my local hadoop does (i.e. everything gets loaded from a
/tmp/… directory that my jar gets uncompressed to locally). in EMR, i
get stacktraces like:

org.jruby.exceptions.RaiseException: (ENOENT) No such file or directory

jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!

at org.jruby.RubyFile.realpath(org/jruby/RubyFile.java:760)
at

RUBY.realpath(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/lib/jruby-complete-1.7.1.jar!/META-INF/jruby.home/lib/ruby/1.9/pathname.rb:446)

at

RUBY.find_root_with_flag(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!/lib/rails/engine.rb:637)

at

RUBY.config(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!/lib/rails/engine.rb:511)

at
org.jruby.RubyBasicObject.send(org/jruby/RubyBasicObject.java:1659)
at RUBY.config(jar:0)
at

RUBY.Engine(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!/lib/cs-base/engine.rb:11)

from which it looks like EMR hadoop is loading directly from that jar
file, which makes any attempts to call Pathname.new().realpath (for
example) fail. some gems try to do that (to load their configs, etc).
i’m trying to figure out how to get EMR to stage my jar similarly to
what my local hadoop installation does.

Thanks for the patch, not near my computer at the moment so I haven’t
looked at it, but I’d rather that you submitted a pull request anyway,
it’s much easier to work with and discuss.

T#

sure

Theo Hultberg wrote in post #1090144:

Thanks for the patch, not near my computer at the moment so I haven’t
looked at it, but I’d rather that you submitted a pull request anyway,
it’s much easier to work with and discuss.

T#