Forum: JRuby [ANN] Rubydoop 1.0, write Hadoop jobs in JRuby

Posted by Theo (Guest)
on 2012-10-02 07:46
(Received via mailing list)
Hi,

I've just released v1.0.0 of Rubydoop. It's a gem that makes it possible 
to write Hadoop jobs in Ruby without using the streaming APIs. It 
configures the Hadoop runtime to run your Ruby code in an embedded JRuby 
runtime, and it provides a configuration DSL that's way nicer to use 
than Hadoop's ToolRunner.

Check out the code, some instructions and examples at 
https://github.com/iconara/rubydoop

If you're into Hadoop and JRuby, I really hope this will help you, and 
if you have any suggestions or feature requests, please let me know.

yours,
Theo, @iconara
Posted by Ilya K. (ilya_k)
on 2012-12-17 21:33
hi --

i've been looking at amazon elastic mapreduce + jruby myself.  how does
your approach differ from using something like warbler to compile ruby
sources + gem dependencies into a standalone jar then using that jar
when invoking hadoop?

thanks.
Posted by Theo Hultberg (Guest)
on 2012-12-18 19:33
(Received via mailing list)
Rubydoop tries to solve the problem of how to make Hadoop talk to your 
Ruby
code, without forcing you to write lots of annotations to make your Ruby
classes look like Java classes. You could say that it's more or less the
same thing as using Warbler and all that, but doing that is a lot of 
work.

I'm currently running a Rubydoop-based job on EMR on 640 CPUs, it works
like a charm. If you're going to run it on lots of data, you should use 
the
v1.1.0.pre2, it contains a few things I've had to fix when scaling up 
from
small jobs to massive jobs.

T#
Posted by Ilya K. (ilya_k)
on 2012-12-19 02:41
thanks.  i've been playing around with rubydoop today, and have a 
question: it looks to me like the packaging won't honor gemspecs, it 
only scans each gem's lib/ directory and packages up whatever it finds 
in there.  is that correct, or am i misconfiguring something? (one of my 
local gems happens to have something in app/ as well as lib/)
Posted by Ilya K. (ilya_k)
on 2012-12-19 03:50
i think i see what's going on: it's respecting the require_paths 
directive, but not the files directive.

also it looks like it flattens out the folder hierarchy such that if i 
have:

my_gem
|
|- lib/
   |-one.rb
|-vendor/
   |-two.rb

the jar will contain

/one.rb
/two.rb

a problem with this is if one.rb has a relative path, e.g.

here=File.expand_path(File.dirname(__FILE__))
require "#{here}/../vendor/two.rb"

this is now broken.  also if you have two gems containing the same 
filename, it seems like you'd have a problem.

i should mention that this is for a locally installed gem (i.e.
gem 'blah', :path => '../../'
not sure if it makes a difference)
Posted by Theo Hultberg (Guest)
on 2012-12-19 07:17
(Received via mailing list)
yes, it only looks at the require paths. there are a lot of gems that 
are
set up very differently, I tried a few things and it looked like using
require paths worked the best. there are a lot of gems that don't have
anything, or very little, in the files list. many gems are packaged 
really
badly, and one of the worst offenders is jruby-openssl, which has caused 
me
no end of trouble (it went so far that I had to put in special handling 
for
it, basically rewriting its structure before adding it to the jar).

the require path thing works because these are the paths put on the load
path, and even if you have two files with the same name on the load 
path,
require will only find the first of them anyway -- so it doesn't matter
that only one gets packaged into the jar. that is, unless your gem tries 
to
be clever, like jruby-openssl, and loads things that are neither in any
require path, or in the files list.

it gets really hairy when the code is running from inside a jar and you 
get
paths like blah/blah/job.jar!/some/gem/../../../doblidoo/file.rb, since 
one
of those .. takes you outside of the jar file.

I'd love to find a safer way of packaging the dependencies, but short of
puting the whole gem directories into the jar and doing some magic to 
make
sure that they all are on the load path I think that this works 
reasonably
well. I did some work on the packaging in redstorm and it wasn't very
pretty. I wanted to get away from the complicated things that happen 
when
you just stuff all the gems into the jar and try to set up the correct 
load
path. putting all the require paths in the root of the jar makes all
well-behaved gems, and most slightly weird gems work out of the box.

maybe bundler can be coaxed into helping, but it behaves strangely when
it's not in control of things.

T#
Posted by Ilya K. (ilya_k)
on 2012-12-19 09:09
"I'd love to find a safer way of packaging the dependencies, but short
of
puting the whole gem directories into the jar and doing some magic to
make sure that they all are on the load path I think that this works
reasonably well."

that doesn't sound so bad, i was in the middle of playing around with
something like:

    def load_gem_files
      Bundler.definition.specs_for(@options[:gem_groups]).flat_map do
|spec|
        if spec.full_name !~ /^(?:bundler|rubydoop|jruby-openssl)-\d+/
          spec.files.select {|f| f =~ /\.rb$/}.map do |f|
            [spec.full_gem_path, f]
          end
        else
          []
        end
      end
    end

the in the ant jar definition, adding:

          bundled_gem_files.each { |path| fileset :dir => path[0],
:includes => "#{path[1]}" }

it's a little ugly since we could unify files with common parent
folders, but it works in that the correct files with the correct folder
structure are added to the jar.

the thing i haven't figured out yet is how to ensure that all those
files are in the load path.  i'm probably missing something though,
because that seems like it should be straightforward -- can't you just
add all those directories to the manifest file's classpath (since my
understanding is that the jruby load_path and classpath are unified)?
Posted by Theo Hultberg (Guest)
on 2012-12-19 09:27
(Received via mailing list)
if you get something working it would be something I'd love to merge in.

T#
Posted by Ilya K. (ilya_k)
on 2012-12-20 04:01
i ended up adding this line to my MR:

$LOAD_PATH << File.expand_path(File.dirname(__FILE__)) + '/lib'

as a way of forcing the load path to be what i want.  i couldn't get 
jruby to respect the manifest classpath.

my mapreduce actually completes now (i get the reducer output file), but 
at the end it complains about the output path existing (which it 
definitely does not before i run the command).  have you run into this?


Ilyas-MacBook-Pro:word_count ilya$ hadoop jar build/word_count.jar 
word_count Gemfile /tmp/hadoop3
12/12/19 18:03:55 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 
is done. And is in the process of commiting
12/12/19 18:03:55 INFO mapred.LocalJobRunner:
12/12/19 18:03:55 INFO mapred.Task: Task attempt_local_0001_r_000000_0 
is allowed to commit now
12/12/19 18:03:55 INFO output.FileOutputCommitter: Saved output of task 
'attempt_local_0001_r_000000_0' to /tmp/hadoop3
12/12/19 18:03:55 INFO mapred.LocalJobRunner: reduce > reduce
12/12/19 18:03:55 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' 
done.
12/12/19 18:03:56 INFO mapred.JobClient:  map 100% reduce 100%
12/12/19 18:03:56 INFO mapred.JobClient: Job complete: job_local_0001
12/12/19 18:03:56 INFO mapred.JobClient: Counters: 18
12/12/19 18:03:56 INFO mapred.JobClient:   Rubydoop
12/12/19 18:03:56 INFO mapred.JobClient:     JRuby runtimes created=2
12/12/19 18:03:56 INFO mapred.JobClient:   File Output Format Counters
12/12/19 18:03:56 INFO mapred.JobClient:     Bytes Written=217
12/12/19 18:03:56 INFO mapred.JobClient:   FileSystemCounters
12/12/19 18:03:56 INFO mapred.JobClient:     FILE_BYTES_READ=47005944
12/12/19 18:03:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=47444829
12/12/19 18:03:56 INFO mapred.JobClient:   File Input Format Counters
12/12/19 18:03:56 INFO mapred.JobClient:     Bytes Read=258
12/12/19 18:03:56 INFO mapred.JobClient:   Map-Reduce Framework
12/12/19 18:03:56 INFO mapred.JobClient:     Map output materialized 
bytes=342
12/12/19 18:03:56 INFO mapred.JobClient:     Map input records=10
12/12/19 18:03:56 INFO mapred.JobClient:     Reduce shuffle bytes=0
12/12/19 18:03:56 INFO mapred.JobClient:     Spilled Records=50
12/12/19 18:03:56 INFO mapred.JobClient:     Map output bytes=286
12/12/19 18:03:56 INFO mapred.JobClient:     Total committed heap usage 
(bytes)=507133952
12/12/19 18:03:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=125
12/12/19 18:03:56 INFO mapred.JobClient:     Combine input records=0
12/12/19 18:03:56 INFO mapred.JobClient:     Reduce input records=25
12/12/19 18:03:56 INFO mapred.JobClient:     Reduce input groups=20
12/12/19 18:03:56 INFO mapred.JobClient:     Combine output records=0
12/12/19 18:03:56 INFO mapred.JobClient:     Reduce output records=20
12/12/19 18:03:56 INFO mapred.JobClient:     Map output records=25
12/12/19 18:03:56 WARN mapred.JobClient: No job jar file set.  User 
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
12/12/19 18:03:56 INFO mapred.JobClient: Cleaning up the staging area 
file:/tmp/hadoop-ilya/mapred/staging/ilya1409933347/.staging/job_local_0002
12/12/19 18:03:56 ERROR security.UserGroupInformation: 
PriviledgedActionException as:ilya 
cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output 
directory /tmp/hadoop3 already exists
Exception in thread "main" 
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
/tmp/hadoop3 already exists
  at 
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137)
  at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:949)
  at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:912)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:396)
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
  at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:912)
  at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)
  at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)
  at rubydoop.RubydoopJobRunner.run(RubydoopJobRunner.java:31)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
  at rubydoop.RubydoopJobRunner.main(RubydoopJobRunner.java:82)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Posted by Theo Hultberg (Guest)
on 2012-12-20 08:13
(Received via mailing list)
which version of Hadoop are you running? I've targeted 1.0.3.

from the error it looks like it's trying to run another job after the 
first
has finished and that it's that job that is complaining.

T#
Posted by Ilya K. (ilya_k)
on 2012-12-20 10:39
i was running 1.1.1.  via a combination of downgrading to 1.0.3 plus 
some other changes i've been making, i successfully ran a mapreduce 
locally using my entire rails app packaged up in a single jar (a few 
hundred gems, roughly 4000 files).

trying the same jar with elastic mapreduce, however, gives me an error 
right now:

Exception in thread "main" rubydoop.RubydoopRunnerException: Could not 
load job setup script "word_count"
  at rubydoop.RubydoopJobRunner.configureJobs(RubydoopJobRunner.java:63)
  at rubydoop.RubydoopJobRunner.run(RubydoopJobRunner.java:30)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
  at rubydoop.RubydoopJobRunner.main(RubydoopJobRunner.java:82)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
Caused by: org.jruby.exceptions.RaiseException: (LoadError) no such file 
to load -- rubygems/path_support

the $LOAD_PATH as reported by my mapper is:

["file:/home/hadoop/lib/jruby-complete-no-joda-1.6.5.jar!/META-INF/jruby.home/lib/ruby/site_ruby/1.9", 
"file:/home/hadoop/lib/jruby-complete-no-joda-1.6.5.jar!/META-INF/jruby.home/lib/ruby/site_ruby/shared", 
"file:/home/hadoop/lib/jruby-complete-no-joda-1.6.5.jar!/META-INF/jruby.home/lib/ruby/1.9", 
"/mnt/var/lib/hadoop/tmp/hadoop-unjar1366878146839692659", 
"/mnt/var/lib/hadoop/tmp/hadoop-unjar1366878146839692659/lib"]

some quick googling led me to this: 
http://techpolesen.blogspot.com/2007/07/jruby-runt..., 
but replacing

        Ruby runtime = Ruby.newInstance(config);

with
        Ruby runtime = JavaEmbedUtils.initialize(new ArrayList(), 
config)

in InstanceContainer didn't seem to make any difference.  i plan on 
tackling this next, any pointers would be appreciated.


Theo Hultberg wrote in post #1089694:
> which version of Hadoop are you running? I've targeted 1.0.3.
>
> from the error it looks like it's trying to run another job after the
> first
> has finished and that it's that job that is complaining.
>
> T#
Posted by Theo Hultberg (Guest)
on 2012-12-20 13:04
(Received via mailing list)
JRuby 1.6.x has some bugs in the embedding that I've run into. That
particular error will resolve if you upgrade to JRuby 1.7.x, but if you
can't you might have to fiddle with the load path a bit (I forget 
exactly
where rubygems/path_support is located).

Rubydoop removes the 1.8 stdlib from the load path, since JRuby 1.6.x 
adds
both the 1.8 and 1.9 stdlibs for some reason, even when the runtime is
explicitly running in 1.9 mode.

T#
Posted by Ilya K. (ilya_k)
on 2012-12-20 15:43
good catch, i didn't notice hadoop's jruby 1.6 jar had snuck in there. 
you mentioned that you're running rubydoop on EMR -- is there a setting 
that allows you to use 1.7.x there?  i assume you have to change 
hadoop's classpath before starting up, via something like this 
https://forums.aws.amazon.com/thread.jspa?messageID=347690?

Theo Hultberg wrote in post #1089725:
> JRuby 1.6.x has some bugs in the embedding that I've run into. That
> particular error will resolve if you upgrade to JRuby 1.7.x, but if you
> can't you might have to fiddle with the load path a bit (I forget
> exactly
> where rubygems/path_support is located).
>
> Rubydoop removes the 1.8 stdlib from the load path, since JRuby 1.6.x
> adds
> both the 1.8 and 1.9 stdlibs for some reason, even when the runtime is
> explicitly running in 1.9 mode.
>
> T#
Posted by Theo Hultberg (Guest)
on 2012-12-20 16:12
(Received via mailing list)
oooooh, now I remember. when you run on EMR you need to run this as a
bootstrap action:

    #!/bin/bash
    rm /home/hadoop/lib/jruby-complete-no-joda-1.6.5.jar

I've just copy and pasted that so many times now I've forgotten about 
it. I
should write a guide and stick all these things there for posterity.

if you launch your flow using the aws-sdk gem you can add this:

    remove_old_jruby_action = {
        :name => 'remove_old_jruby',
        :script_bootstrap_action => {
          :path => "s3://your-bucket/remove_old_jruby.sh"
        }
    }

and then add that to the :bootstrap_actions option to
JobFlowCollection#create. don't forget to upload the script to s3 before
you run the flow.

T#
Posted by Theo Hultberg (Guest)
on 2012-12-20 16:15
(Received via mailing list)
I have no idea why there's a jruby-complete.jar on the default EMR
installation, but so far I haven't noticed any bad effects from removing 
it.

T#
Posted by Ilya K. (ilya_k)
on 2012-12-21 23:24
thanks, that helped me make progress.  the thing i'm struggling with now 
is that it seems like EMR does not unjar my jar before running it, which 
is what my local hadoop does (i.e. everything gets loaded from a 
/tmp/... directory that my jar gets uncompressed to locally).  in EMR, i 
get stacktraces like:

org.jruby.exceptions.RaiseException: (ENOENT) No such file or directory 
- 
jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!
  at org.jruby.RubyFile.realpath(org/jruby/RubyFile.java:760)
  at 
RUBY.realpath(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/lib/jruby-complete-1.7.1.jar!/META-INF/jruby.home/lib/ruby/1.9/pathname.rb:446)
  at 
RUBY.find_root_with_flag(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!/lib/rails/engine.rb:637)
  at 
RUBY.config(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!/lib/rails/engine.rb:511)
  at 
org.jruby.RubyBasicObject.__send__(org/jruby/RubyBasicObject.java:1659)
  at RUBY.config(jar:0)
  at 
RUBY.Engine(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!/lib/cs-base/engine.rb:11)


from which it looks like EMR hadoop is loading directly from that jar 
file, which makes any attempts to call Pathname.new().realpath (for 
example) fail.  some gems try to do that (to load their configs, etc). 
i'm trying to figure out how to get EMR to stage my jar similarly to 
what my local hadoop installation does.
Posted by Ilya K. (ilya_k)
on 2012-12-24 08:13
Attachment: package.rb (6,73 KB)
okay, after wrestling with rails, jruby and hadoop for a few days, i'm 
finally able to use my rails app in the mapper class.  the key sticking 
point turned out to be the fact that rails searches for its 
configuration by walking up the directory tree by using File.dirname. 
but if running from a jar, jruby's RubyFile implementation refuses to 
treat the "root" directory as a valid path entry.  the easy solution was 
to modify radoop's package.rb to jar up my rails app code under a 
/classes directory instead of at the top level.  i then needed to add 
/classes and /classes/lib to the LOAD_PATH.  (it has be /classes due to 
hadoop's classpath when the job jar is run by the workers.)

i'm attaching my patched up version package.rb, which is the only change 
to rubydoop.rb i needed to make.  if it looks interesting, let me know 
and i can send you a pull request on github.

thanks for all your help.

Ilya K. wrote in post #1089906:
> thanks, that helped me make progress.  the thing i'm struggling with now
> is that it seems like EMR does not unjar my jar before running it, which
> is what my local hadoop does (i.e. everything gets loaded from a
> /tmp/... directory that my jar gets uncompressed to locally).  in EMR, i
> get stacktraces like:
>
> org.jruby.exceptions.RaiseException: (ENOENT) No such file or directory
> -
> 
jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!
>   at org.jruby.RubyFile.realpath(org/jruby/RubyFile.java:760)
>   at
> 
RUBY.realpath(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/lib/jruby-complete-1.7.1.jar!/META-INF/jruby.home/lib/ruby/1.9/pathname.rb:446)
>   at
> 
RUBY.find_root_with_flag(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!/lib/rails/engine.rb:637)
>   at
> 
RUBY.config(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!/lib/rails/engine.rb:511)
>   at
> org.jruby.RubyBasicObject.__send__(org/jruby/RubyBasicObject.java:1659)
>   at RUBY.config(jar:0)
>   at
> 
RUBY.Engine(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!/lib/cs-base/engine.rb:11)
>
>
> from which it looks like EMR hadoop is loading directly from that jar
> file, which makes any attempts to call Pathname.new().realpath (for
> example) fail.  some gems try to do that (to load their configs, etc).
> i'm trying to figure out how to get EMR to stage my jar similarly to
> what my local hadoop installation does.
Posted by Theo Hultberg (Guest)
on 2012-12-25 11:41
(Received via mailing list)
Thanks for the patch, not near my computer at the moment so I haven’t 
looked at it, but I'd rather that you submitted a pull request anyway, 
it's much easier to work with and discuss.

T#
Posted by Ilya K. (ilya_k)
on 2012-12-27 20:54
sure

Theo Hultberg wrote in post #1090144:
> Thanks for the patch, not near my computer at the moment so I haven’t
> looked at it, but I'd rather that you submitted a pull request anyway,
> it's much easier to work with and discuss.
>
> T#
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.