(note: i just pushed 0.0.2 up and the gem mirrors typically take ~ 30
minutes to sync - be sure you get 0.0.2!)
NAME
bj
SYNOPSIS
bj (migration_code|generate_migration|migrate|setup|run|submit|
list|set|config|pid) [options]+
DESCRIPTION
________________________________
Overview
--------------------------------
Backgroundjob (Bj) is a simple to use background priority queue
for rails.
Although not yet tested on windows, the design of bj is such
that operation
should be possible on any operating system, including M$.
Jobs can be submitted to the queue directly using the api or
from the
commandline using the 'bj' script. For example
code:
Bj.submit 'cat /etc/password'
cli:
bj submit cat /etc/password
When used from inside a rails application bj arranges that
another process
will always be running in the background to process the jobs
that you submit.
By using a separate process to run jobs bj does not impact the
resource
utilization of your rails application at all and enables several
very cool
features:
1) Bj allows you to sumbit jobs to any of your configured
databases and,
in each case, spawns a separate background process to run jobs
from that
queue
Bj.in :production do
Bj.submit 'production_job.exe'
end
Bj.in :development do
Bj.submit 'development_job.exe'
end
2) Although bj ensures that a process is always running to
process
your jobs, you can start a proces manually. This means that
any machine
capable of seeing your RAILS_ROOT can run jobs for your
application, allowing
one to setup a cluster of machines doing the work of a single
front end rails
applicaiton.
________________________________
Install
--------------------------------
Bj can be installed two ways: as a gem or as a plugin.
gem:
1) $sudo gem install bj
2) add "require 'bj'" to config/environment.rb
3) bj setup
plugin:
1) ./script/plugin install http://
codeforpeople.rubyforge.org/svn/rails/plugins/bj
2) ./script/bj setup
________________________________
Api
--------------------------------
submit jobs for background processing. 'jobs' can be a string
or array of
strings. options are applied to each job in the 'jobs', and the
list of
submitted jobs is always returned. options (string or symbol)
can be
:rails_env => production|development|key_in_database_yml
when given this keyword causes bj to submit jobs
to the
specified database. default is RAILS_ENV.
:priority => any number, including negative ones. default is
zero.
:tag => a tag added to the job. simply makes searching easier.
:env => a hash specifying any additional environment vars the
background
process should have.
:stdin => any stdin the background process should have.
eg:
jobs = Bj.submit 'echo foobar', :tag => 'simple job'
jobs = Bj.submit '/bin/cat', :stdin => 'in the hat', :priority
=> 42
jobs = Bj.submit './script/runner ./scripts/a.rb', :rails_env
=> 'production'
jobs = Bj.submit './script/runner /dev/stdin',
:stdin => 'p RAILS_ENV',
:tag => 'dynamic ruby code'
jobs Bj.submit array_of_commands, :priority => 451
when jobs are run, they are run in RAILS_ROOT. various attributes
are
available *only* once the job has finished. you can check whether
or not a
job is finished by using the #finished method, which simple does a
reload and
checks to see if the exit_status is non-nil.
eg:
jobs = Bj.submit list_of_jobs, :tag => 'important'
...
jobs.each do |job|
if job.finished?
p job.exit_status
p job.stdout
p job.stderr
end
end
See lib/bj/api.rb for more details.
________________________________
Sponsors
--------------------------------
http://www.engineyard.com/
http://quintess.com/
http://eparklabs.com/
PARAMETERS
--rails_root=rails_root, -R (0 ~> rails_root=/Users/ahoward/
rails_root)
the rails_root will be guessed unless you set this
--rails_env=rails_env, -E (0 ~> rails_env=development)
set the rails_env
--log=log, -l (0 ~> log=STDERR)
set the logfile
--help, -h
AUTHOR
ara.t.howard@gmail.com
URIS
http://codeforpeople.com/lib/ruby/
http://rubyforge.org/projects/codeforpeople/
http://codeforpeople.rubyforge.org/svn/rails/plugins/
a @ http://codeforpeople.com/
on 13.12.2007 02:15
on 13.12.2007 16:44
But why this instead of BackgrounDRb? On 12/12/07, ara.t.howard <ara.t.howard@gmail.com> wrote: > SYNOPSIS > Although not yet tested on windows, the design of bj is such > cli: > features: > > one to setup a cluster of machines doing the work of a single > 1) $sudo gem install bj > -------------------------------- > to the > > => 'production' > or not a > if job.finished? > -------------------------------- > --log=log, -l (0 ~> log=STDERR) > > > a @ http://codeforpeople.com/ > -- > share your knowledge. it's a way to achieve immortality. > h.h. the 14th dalai lama > > > > -- Giles Bowkett Podcast: http://hollywoodgrit.blogspot.com Blog: http://gilesbowkett.blogspot.com Portfolio: http://www.gilesgoatboy.org Tumblelog: http://giles.tumblr.com
on 13.12.2007 17:54
On Dec 13, 2007, at 8:34 AM, Giles Bowkett wrote: > But why this instead of BackgrounDRb? well, backgrounddrb was originally written by ezra on top of my slave lib, and ezra is one of the sponsors of bj so hopefully he'll chime in with his reasons, but here are mine 1) much better name. gem install bj? require 'bj'? seriously giles... 2) backgrounddrb, afaik, is has proven to be a bit tricky for *non- experts* to manage and use in a production environment. 3) backgrounddrb aims to provide a 'rubyish' environment for code to execute in. in otherwords you call methods on on objects, serialize ruby objects over the wire, etc. this makes entire classes of problems easier to reason about, but it also comes with a price and that price is complexity. for example, most (all?) people have to think about methods like this when using drb remote_object.each do |thang| thang.intense_computation end now, on which cpu does 'intense' run? in which process? the answer is that it entirely depends on how the objects where setup and how DRbUundumped may or may not have been used. as the maintainer of slave.rb i can tell you that the list of people who understand this is eric hodel and, um, eric hodel. the point is that drb is not an rpc mechanism but a toolset for building servants. using drb every process is potentially either a client or a server and generally both. it's the block passing mechanism that gets people into trouble - blocks cannot go across the wire to drb does some magic to make them work. the other issue with having a 'rubyish' environment to execute code in, in the case of using backgrounddrb with a rails app, is that rails' ruby code tends to do all sorts of nasty things like leak memory like a row boat full of hair trigger shotguns. 4) bj, on the otherhand, simply provides a way to fire and forget system calls. these system calls just may happen to use ./script/ runner to run some code from within your rails environment, but that's up to you. it may even contact a long running daemon like backgrounddrb to avoid loading your rails app over and over, but again that's up to you. bj does *not* load your rails app or make that code available in any way. all it does is connect to the db and run jobs from a queue - which is another big difference: bj is a priority queue, you can submit 100,000 jobs and forget about it, they will run serially in the background until they are complete. another result of the design is that you can easily fire up runners on other hosts using bj - thereby creating a *cluster* of machines that run jobs on behalf of your front end(s) rails application. and, of course, it's easy for development to submit jobs into a production queue and vise versa. the last major difference is that bj is queuing job in the database whereas backgrounddrb is dealing with memory/context/closures - if you have backgrounded 100k credit card sales and your application crashes you can probably guess where having the jobs live would be best ;-) with bj the act of submitting a job is a db transaction that's submitted a job which can run on it's own two feet so you *know* once submission is complete that, no matter what happens next, that job is recoverable - at least to the extent your database/fs are. backgrounddrb, bj, and spawn (http://wiki.rubyonrails.org/rails/pages/ Tom+Anderson) all serve totally different purposes. i think bj provides the lowest barrier of entry into doing background rails processing and, in cases where the user requires a rails_env and needs to wrap the methodology in a ./script/runner capable script makes up for making the user to a little work with promising that the application will not start leaking memory of having network issues in production once the script is working from the commandline. i have not looked at the backgrounddrb code for some time - since the dependancy on slave.rb was removed - so i'm positive i've made a few errors in the above explanation - but i'm sure ezra can correct any serious mistakes i've made. kind regards. a @ http://codeforpeople.com/
on 13.12.2007 18:34
Hi, On Fri, 2007-12-14 at 01:53 +0900, ara.t.howard wrote: > 2) backgrounddrb, afaik, is has proven to be a bit tricky for *non- > thang.intense_computation > - blocks cannot go across the wire to drb does some magic to make > again that's up to you. bj does *not* load your rails app or make > memory/context/closures - if you have backgrounded 100k credit card > provides the lowest barrier of entry into doing background rails > I am maintaining backgruondrb in these younger days. And things have changed. Its written on top of event driven networking lib ( packet ) that i wrote. There are no threads anywhere. Everything is even driven, it still has real processes, but those have reactor loop of their own. I wrote a custom protocol for internal communication between workers and it works reasonably well. I agree that, bj , spawn they all have different purpose. If you find time, please look into code base and suggest any problems that you find. -- Let them talk of their oriental summer climes of everlasting conservatories; give me the privilege of making my own summer with my own coals. http://gnufied.org
on 13.12.2007 18:37
ara.t.howard wrote: <snip> > fire up runners on other hosts using bj - thereby creating a *cluster* > database/fs are. My word. I think you've just saved me a ton of work. Yet again. A quick question, though. How difficult is it to set up parallel job queues, so that a cluster node can pick up jobs from one queue, process them, and submit them to the next in a chain? Take a search engine's spider as an example - from 20,000 feet you've got a job that fetches a page, a job to parse the contents, followed by a third to index the parsed structure. Chances are that you want different types of cluster node to work on each type of job, and there's different data that you might want to attach at each stage. Is that easy to set up?
on 13.12.2007 18:44
On Dec 13, 2007 11:33 AM, hemant kumar <gethemant@gmail.com> wrote: > > I agree that, bj , spawn they all have different purpose. If you find > time, please look into code base and suggest any problems that you find. A comparison between them would be enlightening. I knew about backgrounDRb only. I'm prototyping a payroll system and I need to trigger long time processes, between reports and processes that modify the state of the database.
on 13.12.2007 21:10
On Dec 13, 2007, at 10:33 AM, hemant kumar wrote: > > I agree that, bj , spawn they all have different purpose. If you find > time, please look into code base and suggest any problems that you > find. hey hemant - didn't know you'd taken that over. so it seems everyone is in agreement here and to be entirely clear, i'm not suggesting there is anything wrong with either spawn or backgrounddrb. i thought i'd take a quick stab at use cases for each and hopefully you can clarify so people can understand 1) spawn this just forks your rails app and, as such, is the simplest way to get a background process that has context, etc. it's not going to easily allow say, 1000 incoming requests because you are duping your entire rails process on the fork so you have to be careful with this and about collecting the children so as not to create zombies. 2) backgrounddrb this is going to be good where you have a medium number of background processes, esp if you want to interact with with them. for example, a task, spawn with an ajax request, like a video conversion, cold then be polled using periodically_call_remote to display progress to the user. these processes are going to be living in memory and, unless you code it, there is no concept of queueing. 3) bj this is for fire and forget standalone processes that may or may not load the rails_env. examples would include adding users based on an uploaded csv file and emailing each one or updating 100 rss feeds in the background. the queueing effect is going to save your butt if many requests come in at once and tasks are going to durable across application restart and even reboot (since they live in the db). bj is good where you want to be able track which jobs succeeded or failed, possibly by an external sweeper process, taking appropriate actions as needed. bj tasks are going to be slower to start if you require the rails_env, since you'll need to load the rails app for each process but the memory if going to be freed on each transient task to leaks are of no concern. maybe others can add to this stab a summary... cheers. a @ http://codeforpeople.com/
on 13.12.2007 21:54
On Dec 13, 2007, at 10:36 AM, Alex Young wrote: > stage. Is that easy to set up? not exactly but this would be quite close: 1) i'd forget about having specialized nodes unless you have a very good reason - the death of one node will halt the entire processing chain otherwise. it's nice if nodes are dumb from the perspective of robustness. that said i'll add a feature where you can say Bj.submit 'job.exe', :runner => 'some.hostname' to specify which host to run on. this'll be two lines of code so i don't mind adding it. 2) bj supports priorities so here is what i would do. say you've got a three stage job: a, b, c and 1000 initial 'a' tasks. furthermore let's say you make a ./scripts/ directory in your rails_root (bj runs all jobs from the rails_root). so you'll have something like ./scripts/task_a ./scripts/task_b ./scripts/task_c then you'd do something like this in your rails app jobs = inputs.map{|input| "./scripts/task_a #{ input }} Bj.submit jobs, :priority => 10 now task_a is going to do this #! /usr/bin/env ruby input = ARGV.shift output = process_for_task_a input system "./script/bj submit ./scripts/task_b #{ output } -- priority=20" (of course, if your processing needs to be run through ./script/ runner you'll just be able to use the api directly instead of the cli... i'll be adding a feature shortly to allow for running ruby code through script runner directly) task_b, for it's part, runs and submits task_c at priority=30. so think about that for a minute and imagine you have three processes nodes - each will consume a task_a, run it, and then submit a priority=20 job. therefore each node will probably then get one of those higher priority jobs, run that, and then find the priority=30 task_c job in the queue. when those are done there will nothing left except priority=10 task_a jobs and another batch will start. so this will give you parallel processing of a host of tasks. make sense? a @ http://codeforpeople.com/
on 13.12.2007 23:13
ara.t.howard wrote: >> index the parsed structure. Chances are that you want different types > otherwise. I should have been a little clearer - I'm not thinking of one node per task, it's one *class* of nodes per task. I might want 21 processes across 3 machines (for example) all working on the first stage in the chain. > it's nice if nodes are dumb from the perspective of > robustness. that said i'll add a feature where you can say > > Bj.submit 'job.exe', :runner => 'some.hostname' > > to specify which host to run on. this'll be two lines of code so i > don't mind adding it. I can see how that'd be a handy thing to have anyway :-) > 2) bj supports priorities so here is what i would do. say you've got a > three stage job: a, b, c and 1000 initial 'a' tasks. furthermore let's > say you make a ./scripts/ directory in your rails_root (bj runs all jobs > from the rails_root). so you'll have something like > > ./scripts/task_a > ./scripts/task_b > ./scripts/task_c > > then you'd do something like this in your rails app (I'm not using Rails for the app I'm thinking of using this in, but that's not important) > (of course, if your processing needs to be run through ./script/runner > task_c job in the queue. when those are done there will nothing left > except priority=10 task_a jobs and another batch will start. > > > so this will give you parallel processing of a host of tasks. > > make sense? It does, but it's not *quite* what I'm after. I've got a few other requirements that this strains against - the most pertinent being that I'd like to be able to use priority independently within each task queue. I've got a bit of spare time coming up in the next couple of weeks (Holiday! What a concept! :-), so I'll try hacking something together based on your code.
on 13.12.2007 23:44
On Dec 13, 2007, at 2:32 PM, Alex Young wrote: > (I'm not using Rails for the app I'm thinking of using this in, but > that's not important) maybe just use rq then? a @ http://codeforpeople.com/
on 14.12.2007 00:22
ara.t.howard wrote: > > On Dec 13, 2007, at 2:32 PM, Alex Young wrote: > >> (I'm not using Rails for the app I'm thinking of using this in, but >> that's not important) > > maybe just use rq then? Maybe. It's another option to check out :-)
on 07.07.2008 21:06
I'm trying out bj in a rails environment running 6 mongrel instances. I'm seeing 6 bj processes running. One for each mongrel. The pid of the mongrel is included in the running bj command --ppid=99999. I've tried to start a bj process by hand. Following the docs I did: ruby script/bj run --forever --rails_env=qa --rails_root=/mnt/app/current I restarted my nongrel servers. They just ignore the running bj process and start their own. I know this isn't by design as you state that there will be only one bj process for each machine. Is there some configuration I'm missing? Thanks.
on 07.07.2008 22:56
On Jul 7, 2008, at 1:02 PM, Colin Shield wrote: > I know this isn't by design as you state that there will be only one > bj > process for each machine. it's a documentation flaw - sorry. if you want to run only one instance run it by hand via cron as the docs show - this is vastly easier to monitor and allows the background process to run on ta different host to boot. cheers. a @ http://codeforpeople.com/
on 09.07.2008 20:05
ara.t.howard wrote: > On Jul 7, 2008, at 1:02 PM, Colin Shield wrote: if you want to run only one > instance run it by hand via cron as the docs show Thanks for the reply. I've got it all working nicely in production now. We use monit in production. The config is under source control so we can deploy changes with cap. I wrote some monit config to run a bj process. This may help somebody. ##### BJ #### # BJ is a ruby script that manages background processes. check process bj with pidfile /mnt/app/shared/pids/bj.pid start program = "/bin/bash /mnt/app/current/script/bj.sh start production &" stop program = "/bin/bash /mnt/app/current/script/bj.sh stop" I wrote a bash wrapper (bj.sh) as monit uses pidfiles. #!/bin/bash case $1 in start) echo $$ > /mnt/app/shared/pids/bj.pid; exec 2>&1 /usr/bin/ruby1.8 /mnt/app/current/script/bj run --forever --redirect=/mnt/app/shared/log/bj.log --rails_env=$2 --rails_root=/mnt/app/current 1>/mnt/app/shared/log/bj.log ;; stop) kill `cat /mnt/app/shared/pids/bj.pid`; rm /mnt/app/shared/pids/bj.pid ;; *) echo "usage: bj.sh {start <stage>|stop}" ;; esac
on 29.07.2008 19:15
ara.t.howard wrote: > On Jul 7, 2008, at 1:02 PM, Colin Shield wrote: > >> I know this isn't by design as you state that there will be only one >> bj process for each machine. > > it's a documentation flaw - sorry. if you want to run only one > instance run it by hand via cron as the docs show - this is vastly > easier to monitor and allows the background process to run on ta > different host to boot. Ara, I'd like to suggest mentioning this in the README. Everything I've read aside from this post (which I was lucky to find) makes it seem as though only once instance of Bj should ever be running at a time. This caused me quite a lot of frustration trying to track down the problem - only to find that there wasn't one! From /bin/bj: >> Bj ensures that only one background process is running for your application - >> firing up three mongrels or fcgi processes will result in only one background >> runner being started. Note that the number of background runners does not >> determine throughput - that is determined primarily by the nature of the jobs >> themselves and how much work they perform per process. Thanks, - Trevor
on 29.07.2008 19:40
On Jul 9, 1:01 pm, Colin Shield <colin_shi...@hotmail.com> wrote:
> kill `cat /mnt/app/shared/pids/bj.pid`; rm /mnt/app/shared/pids/bj.pid
I'm trying to get bj set up with Monit as well, but I can't seem to
find the pid file in the usual places (including app/shared/pids/
bj.pid). Is there anyway I can specify a location for the pid file?
Ideally I'd want to stick it outside the virtual file system which is
shared across slices.
Thanks,
Scott