Rake dependencies unknown prior to running tasks

Joe_WSSSSlfel · September 24, 2008, 5:02pm

Say I don’t know what all the dependencies are until I’ve already
begun executing tasks? To what extent can I add new tasks and
dependencies on the fly? At first I thought that adding tasks
during task execution didn’t seem to be a safe thing to do. Then I
made a few toy examples that seem to confirm this. So I started
thinking that I need to have Rake call Rake, which seemed a bit
clumsy. But then I read this discussion mentioned by Jim W.
that seemed to imply that I ought to be able to make a single rake
file do the job (Re: Recursive Rake - rubikitch - org.ruby-lang.ruby-talk - MarkMail
+page:1+mid:zlc5qjj5r6abfcse+state:results). But I couldn’t find
any details on how to accomplish this. Are there better ways of
handing this problem that don’t involve multiple rake files and rake
calling rake?

Joe_WSSSSlfel · September 24, 2008, 5:46pm

Joe WÃ¶lfel wrote:

Say I don’t know what all the dependencies are until I’ve already
begun executing tasks? To what extent can I add new tasks and
dependencies on the fly?

This is what ‘import’ is for,

import ‘moretasks.rb’

moretasks.rb is run after the Rakefile has been loaded but before any
tasks are invoked.

Actually I see no reason why it has to be a file. It looks like
‘import’ should take an optional block.

Joe_WSSSSlfel · September 24, 2008, 6:02pm

Mike G. wrote:

Joe WÃ¶lfel wrote:

Say I don’t know what all the dependencies are until I’ve already
begun executing tasks? To what extent can I add new tasks and
dependencies on the fly?

This is what ‘import’ is for

Sorry I just realized you meant that you actually create new tasks after
the task invocations have begun.

In this case, are you certain those things creating tasks should be
tasks? It seems like you should have normal ruby classes/methods which
determine which tasks to create, then create them. That is what I do.

I think this strategy covers all cases, even though you may need to
restructure your code. But in the end it’s a cleaner approach, IMO.

Joe_WSSSSlfel · September 24, 2008, 6:41pm

Cleaner, maybe. But inefficient in my case. That would mean a lot
of unnecessary rebuilding. Unfortunately, efficiency matters in
this case. It can take days or weeks even with parallel builds. And
it needs to be done often.

It seems like the wrong way to do it, but the only efficient solution
I’ve come up with so far is to have Rake call itself with a different
task. So basically I have dependency graph 1, which is known at the
outset and dependency graph 2 which is only known after running tasks
in dependency graph 1, and dependency graph 2 is itself dependent on
dependancy graph 1.

It seems like a common problem. I’ve run into a number of build
systems that needed to be restarted several times to get around
similar issues. But if there’s a better solution already out there
I’d like to use it.

Joe_WSSSSlfel · September 24, 2008, 7:27pm

Joe WÃ¶lfel wrote:

Cleaner, maybe. But inefficient in my case. That would mean a lot
of unnecessary rebuilding. Unfortunately, efficiency matters in
this case. It can take days or weeks even with parallel builds. And
it needs to be done often.

It seems like the wrong way to do it, but the only efficient solution
I’ve come up with so far is to have Rake call itself with a different
task. So basically I have dependency graph 1, which is known at the
outset and dependency graph 2 which is only known after running tasks
in dependency graph 1, and dependency graph 2 is itself dependent on
dependancy graph 1.

It seems like a common problem. I’ve run into a number of build
systems that needed to be restarted several times to get around
similar issues. But if there’s a better solution already out there
I’d like to use it.

I don’t see why it would be inefficient or require unnecessary
rebuilding.

If you follow the strategy I mentioned, making your changes to the graph
before the first invoke, and avoiding tasks creating tasks (which is
forbidden anyway with the new parallel -j support in Drake), then you’ve
removed the dependency between graph 1 and graph 2 you describe.

By removing that dependency, it becomes more efficient because more
tasks can be parallelized, whereas before graph 1 and graph 2 had to be
executed sequentially (this may not be significant in your case, but is
very much so in other cases).

Any build system in which the only entry point is a task – that is, you
must make a graph in order to make a graph – would have to be run-run
to compensate its lack of dynamic support. Makefiles, for example.
That is why Rake is different – you have the whole ruby language to
define your tasks, and then you say “go”. This two-step approach is the
solution you seek.

Joe_WSSSSlfel · September 24, 2008, 8:37pm

Joe WÃ¶lfel wrote:

I don’t see why it would be inefficient or require unnecessary
rebuilding.

The reason is because I have to build things before I know (or can
even determine programmatically) what other things need to be built.

If you can’t determine programmaticaly what is built, then how does a
program build it?

Even C/C++ dependencies, where you have no clue what g++ -MM is going to
spit out, can be handled with ‘import’ and the makefile loader.

If you are executing some other program which generates stuff, perhaps
you can add a flag where the program outputs what it would generate.
Capture that and ‘import’ it.

And if you can’t add that flag, or if you otherwise don’t know what is
being generated, then your hands are tied anyway. You can’t know what’s
going to happen, so you can’t do anything about it. The two graphs are
worlds apart, and never the twain shall meet. In this case I wonder
what solution you could have expected.

Joe_WSSSSlfel · September 24, 2008, 7:52pm

I don’t see why it would be inefficient or require unnecessary
rebuilding.

The reason is because I have to build things before I know (or can
even determine programmatically) what other things need to be built.

Joe_WSSSSlfel · September 24, 2008, 9:57pm

I didn’t say what was being built couldn’t be determined
programmatically. I said it couldn’t be determined until certain
portions were already built. To build those things initial things I
need a build tool, such as Rake. If the suggestion is that I
shouldn’t actually execute any Rake tasks until after I’ve determined
all possible tasks then the catch 22 your talking about actually
occurs. The only practical solution I’ve come up with so far is to
have Rake build the initial targets and then call itself again to
determine the rest of the dependency graph and build the remaining
targets. If there were a way to augment the initial dependency
graph dynamically then this wouldn’t be necessary. I just don’t
happen to know of one.

Joe_WSSSSlfel · September 25, 2008, 2:29am

If only the Internet came with an Undo button…

Since in the previous example Rake complains unless main_a and main_b
are defined, it sort of defeats the whole purpose. This works:

task :setup_a do
puts “setup_a”
end

task :setup_b do
puts “setup_b”
end

task :setup => [:setup_a, :setup_b] do
puts “setup phase complete. defining new tasks…”

task :main_a do
  puts "main_a"
end

task :main_b do
  puts "main_b"
end

puts "restarting..."
throw :restart

end

task :main => [:main_a, :main_b] do
puts “main phase complete.”
end

task :default => [:setup, :main] do
puts “all done.”
end

However this defeats Drake, which I suppose is another matter.

Joe_WSSSSlfel · September 25, 2008, 2:19am

Joe WÃ¶lfel wrote:

I didn’t say what was being built couldn’t be determined
programmatically. I said it couldn’t be determined until certain
portions were already built. To build those things initial things I
need a build tool, such as Rake. If the suggestion is that I
shouldn’t actually execute any Rake tasks until after I’ve determined
all possible tasks then the catch 22 your talking about actually
occurs. The only practical solution I’ve come up with so far is to
have Rake build the initial targets and then call itself again to
determine the rest of the dependency graph and build the remaining
targets. If there were a way to augment the initial dependency
graph dynamically then this wouldn’t be necessary. I just don’t
happen to know of one.

If you really cannot know what is going to be built, for example if a
program generates files whose names are taken from /dev/random and then
other tasks depend on those files, then you are in a pickle. Normally
this kind of thing is handled by ‘import’, but this assumes tasks can be
determined (for example examining the makedepend output).

What do you think of this:

task :setup_a do
puts “setup_a”
end

task :setup_b do
puts “setup_b”
end

task :setup => [:setup_a, :setup_b] do
puts “setup phase complete. defining new tasks…”

task :main_a do
  puts "main_a"
end

task :main_b do
  puts "main_b"
end

puts "restarting..."
throw :restart

end

task :main => [:main_a, :main_b] do
puts “main phase complete.”
end
task :main_a => :setup
task :main_b => :setup

task :default => :main do
puts “all done.”
end

% rake -f test/Rakefile.restart-flag
(in /Users/jlawrence/work/rake)
setup_a
setup_b
setup phase complete. defining new tasks…
restarting…
main_a
main_b
main phase complete.
all done.

I may be inflicting hardship on myself since this would complicate drake
(http://drake.rubyforge.org), but anyway… This patch is for regular
rake; the git branch is the same thing.

% git clone git://github.com/quix/rake.git
% cd rake
% git checkout -b restart-flag origin/restart-flag

diff --git a/lib/rake.rb b/lib/rake.rb
index 7c84f57…3010261 100755
— a/lib/rake.rb
+++ b/lib/rake.rb
@@ -560,8 +560,15 @@ module Rake

 # Invoke the task if it is needed.  Prerequites are invoked first.
 def invoke(*args)

 task_args = TaskArguments.new(arg_names, args)

 invoke_with_call_chain(task_args, InvocationChain::EMPTY)

```
 catch(:done) {
```
```
   loop {
```
```
     catch(:restart) {
```

       task_args = TaskArguments.new(arg_names, args)

       invoke_with_call_chain(task_args, InvocationChain::EMPTY)

```
       throw :done
```
```
     }
```
```
   }
```
```
 }
```
end

Same as invoke, but explicitly pass a call chain to detect

@@ -573,8 +580,8 @@ module Rake
puts “** Invoke #{name} #{format_trace_flags}”
end
return if @already_invoked

   @already_invoked = true
   invoke_prerequisites(task_args, new_chain)

   @already_invoked = true
   execute(task_args) if needed?
 end

end

Joe_WSSSSlfel · September 25, 2008, 7:05pm

Thanks for the patch. Here’s a clunkier variation on your
suggestion that seems to work with Drake. Stage 1 serializes an
unpredictable set of tasks. Stage 2 creates instances of them and
runs them if necessary. There might be a better way that involves
making the dependency tree modifiable dynamically. I think allowing
all possible dependency changes would get complicated. Maybe that
would require reevaluating the entire tree constantly and there’s no
way to un-execute a task anyway. But most of the real world problems
I can think of seem to involve adding tasks that wouldn’t have been
exercised yet anyway. Could this be solved with an improved
dependency tree walking algorithm?

require ‘rake/clean’

Stage 1 puts a random set of numbers in a file

STAGE_ONE_RESULTS = “s1.txt”
file STAGE_ONE_RESULTS do
open(STAGE_ONE_RESULTS, ‘wb’) do |file|
(1…5).map{|i|rand 10}.uniq.each do |i|
puts “stage1 creating dependency #{i}”
file.puts i
end
end
end
task :stage1 => STAGE_ONE_RESULTS

Stage 2 creates task based on those random numbers

task :stage2 => :stage1
if File.exists? STAGE_ONE_RESULTS
IO.readlines(STAGE_ONE_RESULTS).each do |task_info|
task task_info do
puts “stage2 executing #{task_info}”
end
task :stage2 => task_info
end
end

task :all => :stage1 do
puts drake -j4 stage2
end

CLEAN.include STAGE_ONE_RESULTS
task :default => :all

Joe_WSSSSlfel · September 26, 2008, 1:23am

The following is a better implementation which could be made to work
with Drake. A patch for regular Rake follows.

task :setup_a do
puts “setup_a”
end

task :setup_b do
puts “setup_b”
end

task :setup => [:setup_a, :setup_b] do
puts “setup phase complete. defining new tasks…”

task :main_a do
  puts "main_a"
end

task :main_b do
  puts "main_b"
end

task :main => [:main_a, :main_b] do
  puts "main phase complete."
end

puts "restarting..."
throw :restart

end

task :main

task :default => [:setup, :main] do
puts “all done.”
end

diff --git a/lib/rake.rb b/lib/rake.rb
index 36c2734…1e360a6 100755
— a/lib/rake.rb
+++ b/lib/rake.rb
@@ -573,8 +573,8 @@ module Rake
puts “** Invoke #{name} #{format_trace_flags}”
end
return if @already_invoked

   @already_invoked = true
   invoke_prerequisites(task_args, new_chain)

```
   @already_invoked = true
   execute(task_args) if needed?
 end
```
end
@@ -1994,7 +1994,14 @@ module Rake
elsif options.show_prereqs
display_prerequisites
else

     top_level_tasks.each { |task_name| invoke_task(task_name) }

```
     catch(:done) {
```
```
       loop {
```
```
         catch(:restart) {
```

           top_level_tasks.each { |task_name|

invoke_task(task_name) }

```
           throw :done
```
```
         }
```
```
       }
```
```
     }
   end
 end
```
end

Joe WÃ¶lfel wrote:

Thanks for the patch. Here’s a clunkier variation on your
suggestion that seems to work with Drake. Stage 1 serializes an
unpredictable set of tasks. Stage 2 creates instances of them and
runs them if necessary. There might be a better way that involves
making the dependency tree modifiable dynamically. I think allowing
all possible dependency changes would get complicated. Maybe that
would require reevaluating the entire tree constantly and there’s no
way to un-execute a task anyway. But most of the real world problems
I can think of seem to involve adding tasks that wouldn’t have been
exercised yet anyway. Could this be solved with an improved
dependency tree walking algorithm?

I think the best strategy is to cache the dynamic changes and update
only when the :restart flag is thrown. Fortunately, Drake is already
structured to work this way.

Drake does a dry run to collect all tasks to be executed, then passes
the dependency graph to my CompTree package which executes it in
parallel (CompTree is a kind of modest Erlang-in-Ruby).

It would be safe to add tasks during execution, as CompTree is running a
shallow copy of the dependency graph and will be unaware of any new
tasks or dependencies. I don’t foresee any serious issues with simply
restarting the computation with a new shallow copy of the appended
graph.

Though Drake copies the dependency tree for unrelated reasons, it turns
out to be coincidentally useful here because it acts as a cache while
the user can append the original.

Though this is mostly brainstorming, I do see a need for a restart
feature, whether or not these ideas pan out. The :setup phase may
execute non-trivial tasks which will be obliviously re-executed by :main
in the separate process. We could assume the two stages comprise
disjoint sets, however it would be difficult to enforce. It’s an
artificial restriction which will eventually fail.

In the example above, the :main gets executed in Rake and Drake for
entirely different reasons. In single-threaded Rake, the restart
happens before it even gets to :main, so :main did not get marked as
@already_invoked. In multi-threaded Drake, :main does get marked,
however after the restart its newly-created child nodes will still be
executed because CompTree will not even consider a node until all its
children have been computed.

One difference: in single-threaded Rake you must be careful to add tasks
“in the future”, some place ahead in the sequential order of execution.
In my example :setup modifies :main, which is fine since the order given
is [:setup, :main]. In multi-threaded Drake you don’t have to worry
about it, for reasons mentioned in the previous paragraph.

James M. Lawrence

Joe_WSSSSlfel · September 29, 2008, 6:22pm

I’ve had a chance to play with your solution a bit. It is a much
better solution than my other earlier crude solution that relaunched
Rake. With your solution it seems all you need to do is call
restart at the end of a task that creates new tasks. It’s very
simple. Everything just seems to work and previously executed tasks
aren’t executed twice. In my solution they are executed twice, which
is bad (or at least expensive). Also, in my solution I have to hard
code Rake parameters and it gets especially hairy when I have more
than one task that creates other tasks.

I noticed that your patch doesn’t seem to work with multitask for
some reason. I’m not sure why. Also, as simple as it is to throw
restart I’m wondering if it’s possible that this could be done
automatically - maybe with a warning for users who do it
inadvertently. If a task is defined at any point while a task is
executing then restart could be thrown automatically when the
executing task completes. Then Rake could automatically support
dynamic task creation. Would that make sense?

Rake dependencies unknown prior to running tasks

Same as invoke, but explicitly pass a call chain to detect

Stage 1 puts a random set of numbers in a file

Stage 2 creates task based on those random numbers