Hi, I've just released v1.0.0 of Rubydoop. It's a gem that makes it possible to write Hadoop jobs in Ruby without using the streaming APIs. It configures the Hadoop runtime to run your Ruby code in an embedded JRuby runtime, and it provides a configuration DSL that's way nicer to use than Hadoop's ToolRunner. Check out the code, some instructions and examples at https://github.com/iconara/rubydoop If you're into Hadoop and JRuby, I really hope this will help you, and if you have any suggestions or feature requests, please let me know. yours, Theo, @iconara
on 2012-10-02 07:46
on 2012-12-17 21:33
hi -- i've been looking at amazon elastic mapreduce + jruby myself. how does your approach differ from using something like warbler to compile ruby sources + gem dependencies into a standalone jar then using that jar when invoking hadoop? thanks.
on 2012-12-18 19:33
Rubydoop tries to solve the problem of how to make Hadoop talk to your Ruby code, without forcing you to write lots of annotations to make your Ruby classes look like Java classes. You could say that it's more or less the same thing as using Warbler and all that, but doing that is a lot of work. I'm currently running a Rubydoop-based job on EMR on 640 CPUs, it works like a charm. If you're going to run it on lots of data, you should use the v1.1.0.pre2, it contains a few things I've had to fix when scaling up from small jobs to massive jobs. T#
on 2012-12-19 02:41
thanks. i've been playing around with rubydoop today, and have a question: it looks to me like the packaging won't honor gemspecs, it only scans each gem's lib/ directory and packages up whatever it finds in there. is that correct, or am i misconfiguring something? (one of my local gems happens to have something in app/ as well as lib/)
on 2012-12-19 03:50
i think i see what's going on: it's respecting the require_paths
directive, but not the files directive.
also it looks like it flattens out the folder hierarchy such that if i
have:
my_gem
|
|- lib/
|-one.rb
|-vendor/
|-two.rb
the jar will contain
/one.rb
/two.rb
a problem with this is if one.rb has a relative path, e.g.
here=File.expand_path(File.dirname(__FILE__))
require "#{here}/../vendor/two.rb"
this is now broken. also if you have two gems containing the same
filename, it seems like you'd have a problem.
i should mention that this is for a locally installed gem (i.e.
gem 'blah', :path => '../../'
not sure if it makes a difference)
on 2012-12-19 07:17
yes, it only looks at the require paths. there are a lot of gems that are set up very differently, I tried a few things and it looked like using require paths worked the best. there are a lot of gems that don't have anything, or very little, in the files list. many gems are packaged really badly, and one of the worst offenders is jruby-openssl, which has caused me no end of trouble (it went so far that I had to put in special handling for it, basically rewriting its structure before adding it to the jar). the require path thing works because these are the paths put on the load path, and even if you have two files with the same name on the load path, require will only find the first of them anyway -- so it doesn't matter that only one gets packaged into the jar. that is, unless your gem tries to be clever, like jruby-openssl, and loads things that are neither in any require path, or in the files list. it gets really hairy when the code is running from inside a jar and you get paths like blah/blah/job.jar!/some/gem/../../../doblidoo/file.rb, since one of those .. takes you outside of the jar file. I'd love to find a safer way of packaging the dependencies, but short of puting the whole gem directories into the jar and doing some magic to make sure that they all are on the load path I think that this works reasonably well. I did some work on the packaging in redstorm and it wasn't very pretty. I wanted to get away from the complicated things that happen when you just stuff all the gems into the jar and try to set up the correct load path. putting all the require paths in the root of the jar makes all well-behaved gems, and most slightly weird gems work out of the box. maybe bundler can be coaxed into helping, but it behaves strangely when it's not in control of things. T#
on 2012-12-19 09:09
"I'd love to find a safer way of packaging the dependencies, but short
of
puting the whole gem directories into the jar and doing some magic to
make sure that they all are on the load path I think that this works
reasonably well."
that doesn't sound so bad, i was in the middle of playing around with
something like:
def load_gem_files
Bundler.definition.specs_for(@options[:gem_groups]).flat_map do
|spec|
if spec.full_name !~ /^(?:bundler|rubydoop|jruby-openssl)-\d+/
spec.files.select {|f| f =~ /\.rb$/}.map do |f|
[spec.full_gem_path, f]
end
else
[]
end
end
end
the in the ant jar definition, adding:
bundled_gem_files.each { |path| fileset :dir => path[0],
:includes => "#{path[1]}" }
it's a little ugly since we could unify files with common parent
folders, but it works in that the correct files with the correct folder
structure are added to the jar.
the thing i haven't figured out yet is how to ensure that all those
files are in the load path. i'm probably missing something though,
because that seems like it should be straightforward -- can't you just
add all those directories to the manifest file's classpath (since my
understanding is that the jruby load_path and classpath are unified)?
on 2012-12-20 04:01
i ended up adding this line to my MR: $LOAD_PATH << File.expand_path(File.dirname(__FILE__)) + '/lib' as a way of forcing the load path to be what i want. i couldn't get jruby to respect the manifest classpath. my mapreduce actually completes now (i get the reducer output file), but at the end it complains about the output path existing (which it definitely does not before i run the command). have you run into this? Ilyas-MacBook-Pro:word_count ilya$ hadoop jar build/word_count.jar word_count Gemfile /tmp/hadoop3 12/12/19 18:03:55 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting 12/12/19 18:03:55 INFO mapred.LocalJobRunner: 12/12/19 18:03:55 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now 12/12/19 18:03:55 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to /tmp/hadoop3 12/12/19 18:03:55 INFO mapred.LocalJobRunner: reduce > reduce 12/12/19 18:03:55 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done. 12/12/19 18:03:56 INFO mapred.JobClient: map 100% reduce 100% 12/12/19 18:03:56 INFO mapred.JobClient: Job complete: job_local_0001 12/12/19 18:03:56 INFO mapred.JobClient: Counters: 18 12/12/19 18:03:56 INFO mapred.JobClient: Rubydoop 12/12/19 18:03:56 INFO mapred.JobClient: JRuby runtimes created=2 12/12/19 18:03:56 INFO mapred.JobClient: File Output Format Counters 12/12/19 18:03:56 INFO mapred.JobClient: Bytes Written=217 12/12/19 18:03:56 INFO mapred.JobClient: FileSystemCounters 12/12/19 18:03:56 INFO mapred.JobClient: FILE_BYTES_READ=47005944 12/12/19 18:03:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=47444829 12/12/19 18:03:56 INFO mapred.JobClient: File Input Format Counters 12/12/19 18:03:56 INFO mapred.JobClient: Bytes Read=258 12/12/19 18:03:56 INFO mapred.JobClient: Map-Reduce Framework 12/12/19 18:03:56 INFO mapred.JobClient: Map output materialized bytes=342 12/12/19 18:03:56 INFO mapred.JobClient: Map input records=10 12/12/19 18:03:56 INFO mapred.JobClient: Reduce shuffle bytes=0 12/12/19 18:03:56 INFO mapred.JobClient: Spilled Records=50 12/12/19 18:03:56 INFO mapred.JobClient: Map output bytes=286 12/12/19 18:03:56 INFO mapred.JobClient: Total committed heap usage (bytes)=507133952 12/12/19 18:03:56 INFO mapred.JobClient: SPLIT_RAW_BYTES=125 12/12/19 18:03:56 INFO mapred.JobClient: Combine input records=0 12/12/19 18:03:56 INFO mapred.JobClient: Reduce input records=25 12/12/19 18:03:56 INFO mapred.JobClient: Reduce input groups=20 12/12/19 18:03:56 INFO mapred.JobClient: Combine output records=0 12/12/19 18:03:56 INFO mapred.JobClient: Reduce output records=20 12/12/19 18:03:56 INFO mapred.JobClient: Map output records=25 12/12/19 18:03:56 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 12/12/19 18:03:56 INFO mapred.JobClient: Cleaning up the staging area file:/tmp/hadoop-ilya/mapred/staging/ilya1409933347/.staging/job_local_0002 12/12/19 18:03:56 ERROR security.UserGroupInformation: PriviledgedActionException as:ilya cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /tmp/hadoop3 already exists Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /tmp/hadoop3 already exists at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:949) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:912) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:912) at org.apache.hadoop.mapreduce.Job.submit(Job.java:500) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530) at rubydoop.RubydoopJobRunner.run(RubydoopJobRunner.java:31) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at rubydoop.RubydoopJobRunner.main(RubydoopJobRunner.java:82) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
on 2012-12-20 08:13
which version of Hadoop are you running? I've targeted 1.0.3. from the error it looks like it's trying to run another job after the first has finished and that it's that job that is complaining. T#
on 2012-12-20 10:39
i was running 1.1.1. via a combination of downgrading to 1.0.3 plus some other changes i've been making, i successfully ran a mapreduce locally using my entire rails app packaged up in a single jar (a few hundred gems, roughly 4000 files). trying the same jar with elastic mapreduce, however, gives me an error right now: Exception in thread "main" rubydoop.RubydoopRunnerException: Could not load job setup script "word_count" at rubydoop.RubydoopJobRunner.configureJobs(RubydoopJobRunner.java:63) at rubydoop.RubydoopJobRunner.run(RubydoopJobRunner.java:30) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at rubydoop.RubydoopJobRunner.main(RubydoopJobRunner.java:82) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:187) Caused by: org.jruby.exceptions.RaiseException: (LoadError) no such file to load -- rubygems/path_support the $LOAD_PATH as reported by my mapper is: ["file:/home/hadoop/lib/jruby-complete-no-joda-1.6.5.jar!/META-INF/jruby.home/lib/ruby/site_ruby/1.9", "file:/home/hadoop/lib/jruby-complete-no-joda-1.6.5.jar!/META-INF/jruby.home/lib/ruby/site_ruby/shared", "file:/home/hadoop/lib/jruby-complete-no-joda-1.6.5.jar!/META-INF/jruby.home/lib/ruby/1.9", "/mnt/var/lib/hadoop/tmp/hadoop-unjar1366878146839692659", "/mnt/var/lib/hadoop/tmp/hadoop-unjar1366878146839692659/lib"] some quick googling led me to this: http://techpolesen.blogspot.com/2007/07/jruby-runt..., but replacing Ruby runtime = Ruby.newInstance(config); with Ruby runtime = JavaEmbedUtils.initialize(new ArrayList(), config) in InstanceContainer didn't seem to make any difference. i plan on tackling this next, any pointers would be appreciated. Theo Hultberg wrote in post #1089694: > which version of Hadoop are you running? I've targeted 1.0.3. > > from the error it looks like it's trying to run another job after the > first > has finished and that it's that job that is complaining. > > T#
on 2012-12-20 13:04
JRuby 1.6.x has some bugs in the embedding that I've run into. That particular error will resolve if you upgrade to JRuby 1.7.x, but if you can't you might have to fiddle with the load path a bit (I forget exactly where rubygems/path_support is located). Rubydoop removes the 1.8 stdlib from the load path, since JRuby 1.6.x adds both the 1.8 and 1.9 stdlibs for some reason, even when the runtime is explicitly running in 1.9 mode. T#
on 2012-12-20 15:43
good catch, i didn't notice hadoop's jruby 1.6 jar had snuck in there. you mentioned that you're running rubydoop on EMR -- is there a setting that allows you to use 1.7.x there? i assume you have to change hadoop's classpath before starting up, via something like this https://forums.aws.amazon.com/thread.jspa?messageID=347690? Theo Hultberg wrote in post #1089725: > JRuby 1.6.x has some bugs in the embedding that I've run into. That > particular error will resolve if you upgrade to JRuby 1.7.x, but if you > can't you might have to fiddle with the load path a bit (I forget > exactly > where rubygems/path_support is located). > > Rubydoop removes the 1.8 stdlib from the load path, since JRuby 1.6.x > adds > both the 1.8 and 1.9 stdlibs for some reason, even when the runtime is > explicitly running in 1.9 mode. > > T#
on 2012-12-20 16:12
oooooh, now I remember. when you run on EMR you need to run this as a
bootstrap action:
#!/bin/bash
rm /home/hadoop/lib/jruby-complete-no-joda-1.6.5.jar
I've just copy and pasted that so many times now I've forgotten about
it. I
should write a guide and stick all these things there for posterity.
if you launch your flow using the aws-sdk gem you can add this:
remove_old_jruby_action = {
:name => 'remove_old_jruby',
:script_bootstrap_action => {
:path => "s3://your-bucket/remove_old_jruby.sh"
}
}
and then add that to the :bootstrap_actions option to
JobFlowCollection#create. don't forget to upload the script to s3 before
you run the flow.
T#
on 2012-12-20 16:15
I have no idea why there's a jruby-complete.jar on the default EMR installation, but so far I haven't noticed any bad effects from removing it. T#
on 2012-12-21 23:24
thanks, that helped me make progress. the thing i'm struggling with now is that it seems like EMR does not unjar my jar before running it, which is what my local hadoop does (i.e. everything gets loaded from a /tmp/... directory that my jar gets uncompressed to locally). in EMR, i get stacktraces like: org.jruby.exceptions.RaiseException: (ENOENT) No such file or directory - jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar! at org.jruby.RubyFile.realpath(org/jruby/RubyFile.java:760) at RUBY.realpath(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/lib/jruby-complete-1.7.1.jar!/META-INF/jruby.home/lib/ruby/1.9/pathname.rb:446) at RUBY.find_root_with_flag(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!/lib/rails/engine.rb:637) at RUBY.config(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!/lib/rails/engine.rb:511) at org.jruby.RubyBasicObject.__send__(org/jruby/RubyBasicObject.java:1659) at RUBY.config(jar:0) at RUBY.Engine(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!/lib/cs-base/engine.rb:11) from which it looks like EMR hadoop is loading directly from that jar file, which makes any attempts to call Pathname.new().realpath (for example) fail. some gems try to do that (to load their configs, etc). i'm trying to figure out how to get EMR to stage my jar similarly to what my local hadoop installation does.
on 2012-12-24 08:13
okay, after wrestling with rails, jruby and hadoop for a few days, i'm finally able to use my rails app in the mapper class. the key sticking point turned out to be the fact that rails searches for its configuration by walking up the directory tree by using File.dirname. but if running from a jar, jruby's RubyFile implementation refuses to treat the "root" directory as a valid path entry. the easy solution was to modify radoop's package.rb to jar up my rails app code under a /classes directory instead of at the top level. i then needed to add /classes and /classes/lib to the LOAD_PATH. (it has be /classes due to hadoop's classpath when the job jar is run by the workers.) i'm attaching my patched up version package.rb, which is the only change to rubydoop.rb i needed to make. if it looks interesting, let me know and i can send you a pull request on github. thanks for all your help. Ilya K. wrote in post #1089906: > thanks, that helped me make progress. the thing i'm struggling with now > is that it seems like EMR does not unjar my jar before running it, which > is what my local hadoop does (i.e. everything gets loaded from a > /tmp/... directory that my jar gets uncompressed to locally). in EMR, i > get stacktraces like: > > org.jruby.exceptions.RaiseException: (ENOENT) No such file or directory > - > jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar! > at org.jruby.RubyFile.realpath(org/jruby/RubyFile.java:760) > at > RUBY.realpath(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/lib/jruby-complete-1.7.1.jar!/META-INF/jruby.home/lib/ruby/1.9/pathname.rb:446) > at > RUBY.find_root_with_flag(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!/lib/rails/engine.rb:637) > at > RUBY.config(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!/lib/rails/engine.rb:511) > at > org.jruby.RubyBasicObject.__send__(org/jruby/RubyBasicObject.java:1659) > at RUBY.config(jar:0) > at > RUBY.Engine(jar:file:/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201212212148_0001/jars/job.jar!/lib/cs-base/engine.rb:11) > > > from which it looks like EMR hadoop is loading directly from that jar > file, which makes any attempts to call Pathname.new().realpath (for > example) fail. some gems try to do that (to load their configs, etc). > i'm trying to figure out how to get EMR to stage my jar similarly to > what my local hadoop installation does.
on 2012-12-25 11:41
Thanks for the patch, not near my computer at the moment so I haven’t looked at it, but I'd rather that you submitted a pull request anyway, it's much easier to work with and discuss. T#
on 2012-12-27 20:54
sure Theo Hultberg wrote in post #1090144: > Thanks for the patch, not near my computer at the moment so I haven’t > looked at it, but I'd rather that you submitted a pull request anyway, > it's much easier to work with and discuss. > > T#
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.