Bj-0.0.2


#1

(note: i just pushed 0.0.2 up and the gem mirrors typically take ~ 30
minutes to sync - be sure you get 0.0.2!)

NAME
bj

SYNOPSIS
bj (migration_code|generate_migration|migrate|setup|run|submit|
list|set|config|pid) [options]+

DESCRIPTION


Overview

 Backgroundjob (Bj) is a simple to use background priority queue

for rails.
Although not yet tested on windows, the design of bj is such
that operation
should be possible on any operating system, including M$.

 Jobs can be submitted to the queue directly using the api or

from the
commandline using the ‘bj’ script. For example

 code:
     Bj.submit 'cat /etc/password'

   cli:
     bj submit cat /etc/password

 When used from inside a rails application bj arranges that

another process
will always be running in the background to process the jobs
that you submit.
By using a separate process to run jobs bj does not impact the
resource
utilization of your rails application at all and enables several
very cool
features:

   1) Bj allows you to sumbit jobs to any of your configured

databases and,
in each case, spawns a separate background process to run jobs
from that
queue

     Bj.in :production do
       Bj.submit 'production_job.exe'
     end

     Bj.in :development do
       Bj.submit 'development_job.exe'
     end

   2) Although bj ensures that a process is always running to

process
your jobs, you can start a proces manually. This means that
any machine
capable of seeing your RAILS_ROOT can run jobs for your
application, allowing
one to setup a cluster of machines doing the work of a single
front end rails
applicaiton.


Install

 Bj can be installed two ways: as a gem or as a plugin.

   gem:
     1) $sudo gem install bj
     2) add "require 'bj'" to config/environment.rb
     3) bj setup

   plugin:
     1) ./script/plugin install http://

codeforpeople.rubyforge.org/svn/rails/plugins/bj
2) ./script/bj setup


Api

 submit jobs for background processing.  'jobs' can be a string

or array of
strings. options are applied to each job in the ‘jobs’, and the
list of
submitted jobs is always returned. options (string or symbol)
can be

   :rails_env => production|development|key_in_database_yml
                 when given this keyword causes bj to submit jobs

to the
specified database. default is RAILS_ENV.

   :priority => any number, including negative ones.  default is

zero.

   :tag => a tag added to the job.  simply makes searching easier.

   :env => a hash specifying any additional environment vars the

background
process should have.

   :stdin => any stdin the background process should have.

 eg:

   jobs = Bj.submit 'echo foobar', :tag => 'simple job'

   jobs = Bj.submit '/bin/cat', :stdin => 'in the hat', :priority

=> 42

   jobs = Bj.submit './script/runner ./scripts/a.rb', :rails_env

=> ‘production’

   jobs = Bj.submit './script/runner /dev/stdin',
                    :stdin => 'p RAILS_ENV',
                    :tag => 'dynamic ruby code'

   jobs Bj.submit array_of_commands, :priority => 451

when jobs are run, they are run in RAILS_ROOT. various attributes
are
available only once the job has finished. you can check whether
or not a
job is finished by using the #finished method, which simple does a
reload and
checks to see if the exit_status is non-nil.

 eg:

   jobs = Bj.submit list_of_jobs, :tag => 'important'
   ...

   jobs.each do |job|
     if job.finished?
       p job.exit_status
       p job.stdout
       p job.stderr
     end
   end

See lib/bj/api.rb for more details.


Sponsors

 http://www.engineyard.com/
 http://quintess.com/
 http://eparklabs.com/

PARAMETERS
–rails_root=rails_root, -R (0 ~> rails_root=/Users/ahoward/
rails_root)
the rails_root will be guessed unless you set this
–rails_env=rails_env, -E (0 ~> rails_env=development)
set the rails_env
–log=log, -l (0 ~> log=STDERR)
set the logfile
–help, -h

AUTHOR
removed_email_address@domain.invalid

URIS
http://codeforpeople.com/lib/ruby/
http://rubyforge.org/projects/codeforpeople/
http://codeforpeople.rubyforge.org/svn/rails/plugins/

a @ http://codeforpeople.com/


#2

But why this instead of BackgrounDRb?

On 12/12/07, ara.t.howard removed_email_address@domain.invalid wrote:

SYNOPSIS
Although not yet tested on windows, the design of bj is such
cli:
features:

   one to setup a cluster of machines doing the work of a single
     1) $sudo gem install bj

to the

=> ‘production’
or not a
if job.finished?

–log=log, -l (0 ~> log=STDERR)

a @ http://codeforpeople.com/

share your knowledge. it’s a way to achieve immortality.
h.h. the 14th dalai lama


Giles B.

Podcast: http://hollywoodgrit.blogspot.com
Blog: http://gilesbowkett.blogspot.com
Portfolio: http://www.gilesgoatboy.org
Tumblelog: http://giles.tumblr.com


#3

On Dec 13, 2007, at 8:34 AM, Giles B. wrote:

But why this instead of BackgrounDRb?

well, backgrounddrb was originally written by ezra on top of my slave
lib, and ezra is one of the sponsors of bj so hopefully he’ll chime
in with his reasons, but here are mine

  1. much better name. gem install bj? require ‘bj’? seriously giles…

  2. backgrounddrb, afaik, is has proven to be a bit tricky for non-
    experts
    to manage and use in a production environment.

  3. backgrounddrb aims to provide a ‘rubyish’ environment for code to
    execute in. in otherwords you call methods on on objects, serialize
    ruby objects over the wire, etc. this makes entire classes of
    problems easier to reason about, but it also comes with a price and
    that price is complexity. for example, most (all?) people have to
    think about methods like this when using drb

    remote_object.each do |thang|
    thang.intense_computation
    end

now, on which cpu does ‘intense’ run? in which process? the answer
is that it entirely depends on how the objects where setup and how
DRbUundumped may or may not have been used. as the maintainer of
slave.rb i can tell you that the list of people who understand this
is eric hodel and, um, eric hodel. the point is that drb is not an
rpc mechanism but a toolset for building servants. using drb every
process is potentially either a client or a server and generally
both. it’s the block passing mechanism that gets people into trouble

  • blocks cannot go across the wire to drb does some magic to make
    them work. the other issue with having a ‘rubyish’ environment to
    execute code in, in the case of using backgrounddrb with a rails app,
    is that rails’ ruby code tends to do all sorts of nasty things like
    leak memory like a row boat full of hair trigger shotguns.
  1. bj, on the otherhand, simply provides a way to fire and forget
    system calls. these system calls just may happen to use ./script/
    runner to run some code from within your rails environment, but
    that’s up to you. it may even contact a long running daemon like
    backgrounddrb to avoid loading your rails app over and over, but
    again that’s up to you. bj does not load your rails app or make
    that code available in any way. all it does is connect to the db and
    run jobs from a queue - which is another big difference: bj is a
    priority queue, you can submit 100,000 jobs and forget about it, they
    will run serially in the background until they are complete. another
    result of the design is that you can easily fire up runners on other
    hosts using bj - thereby creating a cluster of machines that run
    jobs on behalf of your front end(s) rails application. and, of
    course, it’s easy for development to submit jobs into a production
    queue and vise versa. the last major difference is that bj is
    queuing job in the database whereas backgrounddrb is dealing with
    memory/context/closures - if you have backgrounded 100k credit card
    sales and your application crashes you can probably guess where
    having the jobs live would be best :wink: with bj the act of submitting
    a job is a db transaction that’s submitted a job which can run on
    it’s own two feet so you know once submission is complete that, no
    matter what happens next, that job is recoverable - at least to the
    extent your database/fs are.

backgrounddrb, bj, and spawn (http://wiki.rubyonrails.org/rails/pages/
Tom+Anderson) all serve totally different purposes. i think bj
provides the lowest barrier of entry into doing background rails
processing and, in cases where the user requires a rails_env and
needs to wrap the methodology in a ./script/runner capable script
makes up for making the user to a little work with promising that the
application will not start leaking memory of having network issues in
production once the script is working from the commandline.

i have not looked at the backgrounddrb code for some time - since the
dependancy on slave.rb was removed - so i’m positive i’ve made a few
errors in the above explanation - but i’m sure ezra can correct any
serious mistakes i’ve made.

kind regards.

a @ http://codeforpeople.com/


#4

ara.t.howard wrote:

fire up runners on other hosts using bj - thereby creating a cluster
database/fs are.
My word. I think you’ve just saved me a ton of work. Yet again.

A quick question, though. How difficult is it to set up parallel job
queues, so that a cluster node can pick up jobs from one queue, process
them, and submit them to the next in a chain? Take a search engine’s
spider as an example - from 20,000 feet you’ve got a job that fetches a
page, a job to parse the contents, followed by a third to index the
parsed structure. Chances are that you want different types of cluster
node to work on each type of job, and there’s different data that you
might want to attach at each stage. Is that easy to set up?


#5

Hi,

On Fri, 2007-12-14 at 01:53 +0900, ara.t.howard wrote:

  1. backgrounddrb, afaik, is has proven to be a bit tricky for *non-
    thang.intense_computation
  • blocks cannot go across the wire to drb does some magic to make
    again that’s up to you. bj does not load your rails app or make
    memory/context/closures - if you have backgrounded 100k credit card
    provides the lowest barrier of entry into doing background rails

I am maintaining backgruondrb in these younger days. And things have
changed. Its written on top of event driven networking lib ( packet )
that i wrote.

There are no threads anywhere. Everything is even driven, it still has
real processes, but those have reactor loop of their own. I wrote a
custom protocol for internal communication between workers and it works
reasonably well.

I agree that, bj , spawn they all have different purpose. If you find
time, please look into code base and suggest any problems that you find.


Let them talk of their oriental summer climes of everlasting
conservatories; give me the privilege of making my own summer with my
own coals.

http://gnufied.org


#6

On Dec 13, 2007 11:33 AM, hemant kumar removed_email_address@domain.invalid wrote:

I agree that, bj , spawn they all have different purpose. If you find
time, please look into code base and suggest any problems that you find.

A comparison between them would be enlightening. I knew about
backgrounDRb only. I’m prototyping a payroll system and I need to
trigger long time processes, between reports and processes that modify
the state of the database.


#7

On Dec 13, 2007, at 10:33 AM, hemant kumar wrote:

I agree that, bj , spawn they all have different purpose. If you find
time, please look into code base and suggest any problems that you
find.

hey hemant - didn’t know you’d taken that over. so it seems everyone
is in agreement here and to be entirely clear, i’m not suggesting
there is anything wrong with either spawn or backgrounddrb. i
thought i’d take a quick stab at use cases for each and hopefully you
can clarify so people can understand

  1. spawn

this just forks your rails app and, as such, is the simplest way to
get a background process that has context, etc. it’s not going to
easily allow say, 1000 incoming requests because you are duping your
entire rails process on the fork so you have to be careful with this
and about collecting the children so as not to create zombies.

  1. backgrounddrb

this is going to be good where you have a medium number of background
processes, esp if you want to interact with with them. for example,
a task, spawn with an ajax request, like a video conversion, cold
then be polled using periodically_call_remote to display progress to
the user. these processes are going to be living in memory and,
unless you code it, there is no concept of queueing.

  1. bj

this is for fire and forget standalone processes that may or may not
load the rails_env. examples would include adding users based on an
uploaded csv file and emailing each one or updating 100 rss feeds in
the background. the queueing effect is going to save your butt if
many requests come in at once and tasks are going to durable across
application restart and even reboot (since they live in the db). bj
is good where you want to be able track which jobs succeeded or
failed, possibly by an external sweeper process, taking appropriate
actions as needed. bj tasks are going to be slower to start if you
require the rails_env, since you’ll need to load the rails app for
each process but the memory if going to be freed on each transient
task to leaks are of no concern.

maybe others can add to this stab a summary…

cheers.

a @ http://codeforpeople.com/


#8

ara.t.howard wrote:

index the parsed structure. Chances are that you want different types
otherwise.
I should have been a little clearer - I’m not thinking of one node per
task, it’s one class of nodes per task. I might want 21 processes
across 3 machines (for example) all working on the first stage in the
chain.

it’s nice if nodes are dumb from the perspective of
robustness. that said i’ll add a feature where you can say

Bj.submit ‘job.exe’, :runner => ‘some.hostname’

to specify which host to run on. this’ll be two lines of code so i
don’t mind adding it.
I can see how that’d be a handy thing to have anyway :slight_smile:

  1. bj supports priorities so here is what i would do. say you’ve got a
    three stage job: a, b, c and 1000 initial ‘a’ tasks. furthermore let’s
    say you make a ./scripts/ directory in your rails_root (bj runs all jobs
    from the rails_root). so you’ll have something like

./scripts/task_a
./scripts/task_b
./scripts/task_c

then you’d do something like this in your rails app
(I’m not using Rails for the app I’m thinking of using this in, but
that’s not important)
(of course, if your processing needs to be run through ./script/runner
task_c job in the queue. when those are done there will nothing left
except priority=10 task_a jobs and another batch will start.

so this will give you parallel processing of a host of tasks.

make sense?
It does, but it’s not quite what I’m after. I’ve got a few other
requirements that this strains against - the most pertinent being that
I’d like to be able to use priority independently within each task
queue. I’ve got a bit of spare time coming up in the next couple of
weeks (Holiday! What a concept! :-), so I’ll try hacking something
together based on your code.


#9

On Dec 13, 2007, at 10:36 AM, Alex Y. wrote:

stage. Is that easy to set up?
not exactly but this would be quite close:

  1. i’d forget about having specialized nodes unless you have a very
    good reason - the death of one node will halt the entire processing
    chain otherwise. it’s nice if nodes are dumb from the perspective of
    robustness. that said i’ll add a feature where you can say

    Bj.submit ‘job.exe’, :runner => ‘some.hostname’

to specify which host to run on. this’ll be two lines of code so i
don’t mind adding it.

  1. bj supports priorities so here is what i would do. say you’ve got
    a three stage job: a, b, c and 1000 initial ‘a’ tasks. furthermore
    let’s say you make a ./scripts/ directory in your rails_root (bj runs
    all jobs from the rails_root). so you’ll have something like

    ./scripts/task_a
    ./scripts/task_b
    ./scripts/task_c

then you’d do something like this in your rails app

jobs = inputs.map{|input| "./scripts/task_a #{ input }}
Bj.submit jobs, :priority => 10

now task_a is going to do this

#! /usr/bin/env ruby
input = ARGV.shift
output = process_for_task_a input
system “./script/bj submit ./scripts/task_b #{ output } –
priority=20”

(of course, if your processing needs to be run through ./script/
runner you’ll just be able to use the api directly instead of the
cli… i’ll be adding a feature shortly to allow for running ruby
code through script runner directly)

task_b, for it’s part, runs and submits task_c at priority=30.

so think about that for a minute and imagine you have three processes
nodes - each will consume a task_a, run it, and then submit a
priority=20 job. therefore each node will probably then get one of
those higher priority jobs, run that, and then find the priority=30
task_c job in the queue. when those are done there will nothing left
except priority=10 task_a jobs and another batch will start.

so this will give you parallel processing of a host of tasks.

make sense?

a @ http://codeforpeople.com/


#10

ara.t.howard wrote:

On Dec 13, 2007, at 2:32 PM, Alex Y. wrote:

(I’m not using Rails for the app I’m thinking of using this in, but
that’s not important)

maybe just use rq then?
Maybe. It’s another option to check out :slight_smile:


#11

On Dec 13, 2007, at 2:32 PM, Alex Y. wrote:

(I’m not using Rails for the app I’m thinking of using this in, but
that’s not important)

maybe just use rq then?

a @ http://codeforpeople.com/


#12

On Jul 7, 2008, at 1:02 PM, Colin S. wrote:

I know this isn’t by design as you state that there will be only one
bj
process for each machine.

it’s a documentation flaw - sorry. if you want to run only one
instance run it by hand via cron as the docs show - this is vastly
easier to monitor and allows the background process to run on ta
different host to boot.

cheers.

a @ http://codeforpeople.com/


#13

ara.t.howard wrote:

On Jul 7, 2008, at 1:02 PM, Colin S. wrote:
if you want to run only one
instance run it by hand via cron as the docs show

Thanks for the reply. I’ve got it all working nicely in production now.
We use monit in production.
The config is under source control so we can deploy changes with cap.
I wrote some monit config to run a bj process. This may help somebody.

BJ

BJ is a ruby script that manages background processes.

check process bj with pidfile /mnt/app/shared/pids/bj.pid
start program = “/bin/bash /mnt/app/current/script/bj.sh start
production &”
stop program = “/bin/bash /mnt/app/current/script/bj.sh stop”

I wrote a bash wrapper (bj.sh) as monit uses pidfiles.

#!/bin/bash

case $1 in
start)
echo $$ > /mnt/app/shared/pids/bj.pid;
exec 2>&1 /usr/bin/ruby1.8 /mnt/app/current/script/bj run --forever
–redirect=/mnt/app/shared/log/bj.log --rails_env=$2
–rails_root=/mnt/app/current 1>/mnt/app/shared/log/bj.log
;;
stop)
kill cat /mnt/app/shared/pids/bj.pid; rm /mnt/app/shared/pids/bj.pid
;;
*)
echo “usage: bj.sh {start |stop}” ;;
esac


#14

I’m trying out bj in a rails environment running 6 mongrel instances.
I’m seeing 6 bj processes running. One for each mongrel.
The pid of the mongrel is included in the running bj command
–ppid=99999.

I’ve tried to start a bj process by hand. Following the docs I did:
ruby script/bj run --forever --rails_env=qa
–rails_root=/mnt/app/current

I restarted my nongrel servers. They just ignore the running bj process
and start their own.

I know this isn’t by design as you state that there will be only one bj
process for each machine.

Is there some configuration I’m missing? Thanks.


#15

On Jul 9, 1:01 pm, Colin S. removed_email_address@domain.invalid wrote:

kill cat /mnt/app/shared/pids/bj.pid; rm /mnt/app/shared/pids/bj.pid

I’m trying to get bj set up with Monit as well, but I can’t seem to
find the pid file in the usual places (including app/shared/pids/
bj.pid). Is there anyway I can specify a location for the pid file?
Ideally I’d want to stick it outside the virtual file system which is
shared across slices.

Thanks,
Scott


#16

ara.t.howard wrote:

On Jul 7, 2008, at 1:02 PM, Colin S. wrote:

I know this isn’t by design as you state that there will be only one
bj process for each machine.

it’s a documentation flaw - sorry. if you want to run only one
instance run it by hand via cron as the docs show - this is vastly
easier to monitor and allows the background process to run on ta
different host to boot.

Ara, I’d like to suggest mentioning this in the README. Everything I’ve
read aside from this post (which I was lucky to find) makes it seem as
though only once instance of Bj should ever be running at a time. This
caused me quite a lot of frustration trying to track down the problem -
only to find that there wasn’t one!

From /bin/bj:

Bj ensures that only one background process is running for your application -
firing up three mongrels or fcgi processes will result in only one background
runner being started. Note that the number of background runners does not
determine throughput - that is determined primarily by the nature of the jobs
themselves and how much work they perform per process.

Thanks,

  • Trevor