Forking job scheduler

Hi all,
If anyone is willing, I’d be grateful for some advice on the forking
job scheduler I’ve written. It works fine in simple tests, but does
not feel elegant. On IRC kbrooks recommended an asynchronous main
loop, but i don’t understand how to implement that in this situation.
The first version I wrote used threads, but several sources
recommended fork instead. I have also considered just using the shell
command ‘ps’ to see how many jobs are running, launching more as
needed.

The basic requirements:

  • Each job is a long-running external process (taking a day or more)
    and all jobs require a different amount of time to run (so
    asynchronous launching will be needed).
  • I want to keep N jobs running at all times (N = 4 in the example
    below)

Thanks,
Krishna

##############################################

@jobs = (1…10).to_a # an imaginary list of jobs

any time a job finishes, launch another

Signal.trap(“CLD”) { start_job unless @jobs.empty? }

def start_job
my_job = @jobs.pop
puts “starting job #{my_job}”
exec(“sleep 2”) if fork == nil # launch a job. in reality it would
run for a day or more
end

for num in 1…4 # i want to keep 4 jobs running at all times
start_job
end

this doesn’t wait for the last jobs to finish

while @jobs.size > 0
Process.wait
end

this waits for the last jobs, but if i only had this line, it

wouldn’t wait for all the jobs to start!
Process.wait

On 9/29/06, Krishna D. [email protected] wrote:

The basic requirements:

  • Each job is a long-running external process (taking a day or more)
    and all jobs require a different amount of time to run (so
    asynchronous launching will be needed).
  • I want to keep N jobs running at all times (N = 4 in the example below)

You say nothing about the coordination requirements of the external
processes with the “watchdog” process. Is your requirement really just
to
ensure that four jobs are running at all times? If so, I would avoid
using a
long-running watchdog process, because you’re making an assumption that
it
will never crash, catch a signal, etc. Why not run a cron job every five
minutes or so that checks the running processes (via pgrep or ps as you
suggested), starts more if necessary, writes status to syslog, and then
quits? Much, much easier.

Hi Francis,

I actually had considered the cron approach, but wasn’t sure if it was
the best way to do things. What you say makes a lot of sense (I was
already nervous about the watchdog running into trouble, and there are
no coordination requirements), so I will go with your suggestion.

Thanks!
Krishna

On Fri, 29 Sep 2006, Krishna D. wrote:

The basic requirements:

  • Each job is a long-running external process (taking a day or more)
    and all jobs require a different amount of time to run (so
    asynchronous launching will be needed).
  • I want to keep N jobs running at all times (N = 4 in the example below)

no need to reinvent the wheel! :wink:

Linux Clustering with Ruby Queue: Small Is Beautiful | Linux Journal
http://codeforpeople.com/lib/ruby/rq/
http://codeforpeople.com/lib/ruby/rq/rq-2.3.4/README

download ruby queue (rq) and run it locally. it does this and much,
much more

setup a work q

harp:~ > rq ./q create

q: /home/ahoward/q
db: /home/ahoward/q/db
schema: /home/ahoward/q/db.schema
lock: /home/ahoward/q/lock
bin: /home/ahoward/q/bin
stdin: /home/ahoward/q/stdin
stdout: /home/ahoward/q/stdout
stderr: /home/ahoward/q/stderr

start a daemon processs that will run 4 jobs at a time

harp:~ > rq ./q start --max_feed=4

submit a job

harp:~ > rq ./q submit echo foobar

jid: 1
priority: 0
state: pending
submitted: 2006-09-29 08:49:46.814603
started:
finished:
elapsed:
submitter: jib.ngdc.noaa.gov
runner:
stdin: stdin/1
stdout:
stderr:
pid:
exit_status:
tag:
restartable:
command: echo foobar

wait a bit

check the status

jib:~ > rq ./q list 2

jid: 2
priority: 0
state: finished
submitted: 2006-09-29 08:49:50.839391
started: 2006-09-29 08:50:09.282754
finished: 2006-09-29 08:50:09.798060
elapsed: 0.515306
submitter: jib.ngdc.noaa.gov
runner: jib.ngdc.noaa.gov
stdin: stdin/2
stdout: stdout/2
stderr: stderr/2
pid: 721
exit_status: 0
tag:
restartable:
command: echo barfoo

view the stdout

jib:~ > rq ./q stdout 2
barfoo

there is a command-line interface plus programming api - so you can
almost
certainly accomplish whatever it is you need to do with zero or very
little
coding on your part.

kind regards.

-a

On 9/29/06, [email protected] [email protected] wrote:

it
simply starts if it’s not running, otherwise it does nothing.
Sounds cool, Ara. How does it keep two copies of itself from running?
Does
it flock a file in /var/run or something like that?

On Fri, 29 Sep 2006, Francis C. wrote:

You say nothing about the coordination requirements of the external
processes with the “watchdog” process. Is your requirement really just to
ensure that four jobs are running at all times? If so, I would avoid using a
long-running watchdog process, because you’re making an assumption that it
will never crash, catch a signal, etc. Why not run a cron job every five
minutes or so that checks the running processes (via pgrep or ps as you
suggested), starts more if necessary, writes status to syslog, and then
quits? Much, much easier.

this is exactly how rq works - except it does both: the feeder process
is a
daemon, but one which refuses to start two copies of itself. therefore
a
crontab entry can be used to make it ‘immortal’. basically, the crontab
simply starts if it’s not running, otherwise it does nothing.

cheers.

-a

On Sat, 30 Sep 2006, Francis C. wrote:

Sounds cool, Ara. How does it keep two copies of itself from running? Does
it flock a file in /var/run or something like that?

yeah - basically. it’s under the users home dir though, named after the
queue. the effect is ‘one feeder per host per user’ by default. it
really
works nicely because you can have a daemon process totally independent
of
system space and without root privs. dirwatch works the same way.
here’s my
crontab on our nrt system:

mussel:~ > crontab -l
leader = /dmsp/reference/bin/leader
worker = /dmsp/reference/bin/worker
env = /dmsp/reference/bin/bashenv
shush = /dmsp/reference/bin/shush
dirwatch = /dmsp/reference/bin/dirwatch
nrt = /dmsp/reference/bin/nrt
nrtq = /dmsp/reference/bin/nrtq
nrtw = /dmsp/reference/bin/nrtw
nrts = /dmsp/reference/bin/nrts
beveldevil = /dmsp/reference/bin/beveldevil
sfctmp1p0 = /dmsp/reference/bin/sfctmp1p0
afwa_watch =
/dmsp/nrt/dirwatches/data/incoming/afwa/dirwatch
subscriptions_watch = /dmsp/nrt/dirwatches/subscriptions/dirwatch
dmsp_watch =
/dmsp/nrt/dirwatches/data/incoming/dmsp/dirwatch
night_files_watch =
/dmsp/nrt/dirwatches/data/incoming/night_files/dirwatch
mosaic_watch =
/dmsp/nrt/dirwatches/data/incoming/mosaic/dirwatch
www = /dmsp/nrt/www/root/
qdb = /dmsp/nrt/queues/q/db
show_failed = /dmsp/reference/bin/show_failed

mussel is the current leader

*/15 * * * * $leader $env $shush $afwa_watch start
*/15 * * * * $leader $env $shush $subscriptions_watch start
*/15 * * * * $leader $env $shush $dmsp_watch start
*/15 * * * * $leader $env $shush $night_files_watch start
*/15 * * * * $leader $env $shush $mosaic_watch start
*/15 * * * * $leader $env $shush $beveldevil
59 23 * * * $leader $env $shush $nrtq rotate

clam, oyster, bismarck, scallop, shrimp are current workers

*/15 * * * * $worker $env $shush $nrtq start

this same crontab is installed across our nrt cluster. basically one
node
runs a bunch of dirwatchs which trigger submits to the master queue.
the
workers, for their part, are completely stupid, all the have is a user
account
and the ‘$worker’ crontab entry that keeps a feeding process running at
all
times, even after reboot. it’s a simple was to setup durable userland
daemons.

($leader and $worker are xargs style programs - $leader obviously only
executes it’s command line if run on the leader, vise verse for worker)

regards.

-a