On Tue, Jan 21, 2014 at 9:08 PM, Martin H. [email protected]
wrote:
Robert, I appreciate your effort. I still have a hard time wrapping my
head around fork and closing dangling file handles - just as you pointed
out.
Did you look at the code I provided (the gist)? Basically you need to
close all file handles that you do not need in a process which
includes all previously opened pipes and the write end of the current
pipe. I also did the optimization that the head of the pipeline is
executed in the current process.
Also, lambdas, smart as they may be, are less readable to me, and I
would think also my target audience (biologists). I have decided to
stick to my proposed syntax for setting up and executing pipes, which is
most flexible and easy to grasp (and will work excellently in irb):
I’ll try to convince you to reconsider. First, with the approach you
have taken, you need to define a method inside the Pipe class for
every functionality that you want to have. By that you make the
Pipe class multi purpose (executing pipelines in multiple processes
AND a lot individual functionality). As a consequence the provider of
the Pipe class (you) ought to provide all the algorithms in form of
methods. Of course users of the class can open it again and add
methods (e.g. their own version of “cat” or whatnot). But that opens
the door for name conflicts, issues with instance variables etc.
In software engineering it is usually considered good practice to only
condense one purpose in a class. For class Pipe (which should
probably rather be called “Pipeline”) it is executing multiple
operations in various processes (or even threads as my class offers).
By using lambda as abstraction for an anonymous method you have
maximum flexibility to provide any functionality. The only contract
is that the lambda will be invoked with two IOs for reading and
writing.
Even if you dislike lambdas, it is not too hard to come up with a
syntax that resembles yours! And that is probably the best news for
you. Here are some examples (for “cat” only) and how they look like
when creating the pipeline:
CAT = lambda do |*args|
lambda do |io_in, io_out|
args.each do |file|
File.foreach(file) {|line| io_out.puts line}
end
end
end
pipe.add(CAT[“test_in”]).add(GREP[/foo.*bar/])
pipe << CAT[“test_in”] << GREP[/foo.*bar/]
pipe | CAT[“test_in”] | GREP[/foo.*bar/]
(Note, I added operator “|”.)
Other forms to create those lambdas - basically always following the
same approach to use a closure to capture arguments for later
execution:
def cat(*args)
lambda do |io_in, io_out|
args.each do |file|
File.foreach(file) {|line| io_out.puts line}
end
end
end
pipe.add(cat(“test_in”)).add(grep(/foo.*bar/))
pipe << cat(“test_in”) << grep(/foo.*bar/)
pipe | cat(“test_in”) | grep(/foo.*bar/)
def Cat(*args)
lambda do |io_in, io_out|
args.each do |file|
File.foreach(file) {|line| io_out.puts line}
end
end
end
pipe.add(Cat(“test_in”)).add(Grep(/foo.*bar/))
pipe << Cat(“test_in”) << Grep(/foo.*bar/)
pipe | Cat(“test_in”) | Grep(/foo.*bar/)
require ‘pipe’
p1 = Pipe.new.add(:cat, input: “test_in”).add(:grep, pattern: “foo”)
p2 = Pipe.new.add(:save, output: “test_out” )
(p1 + p2).add(:dump).run
Easily done.
Also, I am aiming for some +100 commands here, where some will be quite
advanced - and I am afraid of lambdas for this.
Why? Your lambda can even be a simple adapter if you want to make use
of multiple other classes.
In fact, I am
experimenting to come up with the next generation of Biopieces
(www.biopieces.org).
Here is a version with named pipes that works (though still with
dangling file handles):
named_pipes2.rb · GitHub
I have updated my version but there is the opposite problem: one of
the IOs is closed too early.
You can see current state of affairs
in the gist. I have to go to bed now.
Named pipes don’t have the parent/child binding of IO.pipe, so they work
with the parallel gem. However, the stream terminator “\0” I use is
quite hackery. I could possible keep track of the number of records
passed around instead. Or get those dangling file handles sorted?
Better that.
Finally, I wonder about performance of IO.pipe versus named pipes - I
want to do a benchmark. Actually, I am concerned about overall
performance; this is of cause not C or even well written Ruby code
optimized to a specific task, but rather a general purpose way of
setting up pipes and executing them as simply as possible. I figure that
30 minutes writing a script that runs for 1 minute is more often less
appealing than a 1 minute script that runs for 30 minutes.

There is another disadvantages of names pipes: you need to make sure
that names do not collide. Also, there might be security
implications. The approach with nameless pipes is certainly robuster.
Kind regards
robert