Marshal Pipe

Carlos_JSHernandez · January 6, 2008, 12:39am

I’ve just re-discovered pipes.
Using Linux bash… stuff like grep zip.89433 addresses.csv | sort | head
Bash pipes work very well for many problems, such as mass downloads and
data filtering.
But they’re simplest to implement on line by line text data.
This is not a true limitation of pipe architectures.

You can implement data pipes with Marshal.
Within your class, you can define a puts method for the source’s
$stdout:

def self.puts(data)
data = Marshal.dump( data )
# tell the sink how many bytes to read
$stdout.print [data.length].pack(‘l’)
# then print out data
$stdout.print data
end

and then the sink reads from $stdin:

while data = $stdin.read(4) do
  data = data.unpack('l').shift # bytes to read
  data = $stdin.read( data ) # marshal'ed dump from stdin
  data = Marshal.load( data ) # restored data structure
  # what you do here.........
end

I don’t think this is implemented in a standard way anywhere in Ruby (or
any other language), but
looks to me like a really, really good idea.

-Carlos

Carlos_JSHernandez · January 7, 2008, 10:59pm

On Jan 5, 2008, at 15:37 PM, Carlos J. Hernandez wrote:

Within your class, you can define a puts method for the source’s
(or
any other language), but
looks to me like a really, really good idea.

You’ve written the core of DRb, which is these data pipes expanded to
a multi-process, multi-machine distributed programming tool.

Carlos_JSHernandez · January 8, 2008, 6:09am

Eric, thanks for your comment.
I’ll look again, but I don’t think I saw in DRb the simplicity achieved
by bash as in:

cat source.txt | filter | sort > result.txt

I’m saying cat, filter, and sort could be ruby programs piping Marshal
data structures.
-Carlos

Carlos_JSHernandez · January 8, 2008, 2:57pm

Robert:
Thanks for your performance improvement suggestion.
I did not think of giving Marshal $stdout.
But the problem remains that I don’t know ahead of time how many bytes
the Marshal data will have and
I can no longer use “\n”, the input line separator, as a record
separator.

As for general usefulness.
If you already have a general purpose cat, filter, transform, and sort
programs…
And just want to see the results of manipulating the contents of some
source file…
Then just say
cat source.txt | transform | filter | sort > result.txt
I do these kind of stuff all the time, I just have not program that way
before.
I just started because the model is useful in my data downloads where
I download history CSVs from Finance.Yahoo.com and along the way to
append to my data files,
I transform the data.
There is an impedance problem though,
in having to flatten and convert a data structure that contain floats,
integers, and dates,
back to a CSV line every time you go through the pipe, and then restore
it back in the receiver.
Marshal solves this, except that “\n” can no longer be used as record
separators.
Marshal is more efficient, that’s why someone wrote it.

Lastly, computer will be multi-processing from here on…
Faster chips are finding their physical limits.

BTW, I have an implementation of Marshal Pipes, just as I described in
my opening email.
It works great.

-Carlos

Carlos_JSHernandez · January 8, 2008, 9:32am

2008/1/8, Carlos J. Hernandez [email protected]:

Eric, thanks for your comment.
I’ll look again, but I don’t think I saw in DRb the simplicity achieved
by bash as in:

cat source.txt | filter | sort > result.txt

That line makes you eligible for a “useless cat award”.

I’m saying cat, filter, and sort could be ruby programs piping Marshal
data structures.

Your solution is still too complicated: you do not need the byte
transfer - in fact, it may be disadvantageous because you need the
full marshaled representation in memory before you can send it. This
is not very nice for streaming processing. Instead, simply directly
marshal data into the pipe:

$ ruby -e ‘10.times {|i| Marshal.dump(i, $stdout) }’ | ruby -e ‘until
$stdin.eof?; p Marshal.load($stdin) end’
0
1
2
3
4
5
6
7
8
9

The question is: how often do you actually need the processing power
of two processes? On a single core machine the code is probably as
efficient with a single Ruby process (probably using multiple threads)

and you do not need the piping complexity and marshaling overhead.
For tasks that involve IO Ruby threads work pretty well. So, I’d be
interested to hear what is the use case for your solution?

Kind regards

robert

Carlos_JSHernandez · January 8, 2008, 3:22pm

2008/1/8, Carlos J. Hernandez [email protected]:

Robert:
Thanks for your performance improvement suggestion.
I did not think of giving Marshal $stdout.
But the problem remains that I don’t know ahead of time how many bytes

No, this is not a problem because Marshal.load will take care of this
(as you can see from the command line example I posted).

the Marshal data will have and
I can no longer use “\n”, the input line separator, as a record
separator.

Not needed as said before.

As for general usefulness.
If you already have a general purpose cat, filter, transform, and sort
programs…
And just want to see the results of manipulating the contents of some
source file…
Then just say
cat source.txt | transform | filter | sort > result.txt

… and get another “useless cat award”.

it back in the receiver.
Marshal solves this, except that “\n” can no longer be used as record
separators.

Marshal basically just hides the conversion and makes it faster. The
conversion is still there: you have a data structure (say an array),
transform it into a sequence of bytes (either CSV or Marshal format),
send it through a pipe, transform byte sequence back (either from CSV
or Marshal format) and get out the array again. That’s why I say it’s
more efficient to not use two processes but do it in one Ruby process
most of the time (i.e. on single core machine or with IO bound stuff).

Marshal is more efficient, that’s why someone wrote it.

Not only that. Marshal servers a slightly different purpose, namely
converting object graphs which can contain loops into a byte stream
and resurrecting this graph from the byte stream.

Lastly, computer will be multi-processing from here on…
Faster chips are finding their physical limits.

But OTOH Ruby will rather sooner than later use native threads and a
multithreaded application is easier and in this particular case also
more efficient (unless you use tons of memory per processing step)
because you do not need the conversion for IPC. Do you actually
/need/ that processing power?

BTW, I have an implementation of Marshal Pipes, just as I described in
my opening email.
It works great.

That’s nice for you. But you proposed a general solution in your
original posting. At least that’s what I picked up from your last
statements. With this (public!) discussion we are trying to find out
whether it is actually a good idea for the general audience. So far
I haven’t been convinced that it is indeed.

Kind regards

robert

Carlos_JSHernandez · January 8, 2008, 5:59pm

On Jan 7, 2008, at 10:09 PM, Carlos J. Hernandez wrote:

I’ll look again, but I don’t think I saw in DRb the simplicity
achieved
by bash as in:

cat source.txt | filter | sort > result.txt

I’m saying cat, filter, and sort could be ruby programs piping Marshal
data structures.

check out ruby queue (rq) - it uses that paradigm but, instead of
marshal’d data, it uses yaml which accomplishes the same goal without
giving up human readability. for instance one might do (simplified)

rq q query tag==foobar

jid: 1
tag: foobar
command: processing_stage_a input

so query is dumping a job object, as yaml. then you do

!! | rq q update priority=42 -

which is to say use the output of the last command, a ruby object, and
input that into the next command, which takes a job, or jobs, on stdin
when ‘-’ is given, and update that job in the queue

you can also do things like

rq q query priority=42 tag=foobar | rq q resubmit -

etc.

the pattern is a good one - but i wouldn’t touch marshal data over
yaml for the commandline with a ten foot pole: one slip and you’ll
blast out chars that will hose the display or disconnect your ssh
session. also, yaml provides natural document separators so you can
embed more than one set in a stream separated by — which allows for
chunking of huge output streams

food for thought.

kind regards.

a @ http://codeforpeople.com/

Carlos_JSHernandez · January 8, 2008, 8:03pm

Ara:

Yaml is find over internet connection where transmission time is high
compared to cpu time, and
where human readability is a plus.
For my case, separate programs/processes on the same machine working
very closely
as if a single program in a pipe architecture… Marshal is better.
In fact, if Marshal is a bit of a Hybrid (don’t know the details), then
what I really want is pure binary, I think.

Anyways, for a bit more details of my implementation,
taking out the specifics of my application and including Roberts’
comments,
I now have:

class MarshalPipe
def self.puts(data)
Marshal.dump( data, $stdout )
end

def _pipe
data = nil
while data = Marshal.load($stdin) do
pipe(data)
break if $stdin.eof?
end
end
end

I don’t know why this did not work:

until $stdin.eof do
data = Marshal.load($stdin)
pipe( data )
end

Carlos_JSHernandez · January 8, 2008, 4:18pm

Robert:

ruby -e ‘10.times {|i| Marshal.dump(i, $stdout) }’ | ruby -e ‘until
$stdin.eof?; p Marshal.load($stdin) end’

THANKS!!!
Did not recognized it at first read, because it’s a bit cryptic.
-Carlos

Carlos_JSHernandez · January 8, 2008, 10:42pm

On Jan 7, 2008, at 4:58 PM, Eric H. wrote:

On Jan 5, 2008, at 15:37 PM, Carlos J. Hernandez wrote:

I don’t think this is implemented in a standard way anywhere in
Ruby (or
any other language), but
looks to me like a really, really good idea.

You’ve written the core of DRb, which is these data pipes expanded
to a multi-process, multi-machine distributed programming tool.

I’m really looking to get into DRb, but it’s dsl and stuff is a
little… daunting… Is there a slightly toned-down wrapper for it
or an alternative?

Carlos_JSHernandez · January 8, 2008, 11:03pm

On 08.01.2008 20:01, Carlos J. Hernandez wrote:

def _pipe
data = nil
while data = Marshal.load($stdin) do
pipe(data)

What does #pipe do? Why don’t you use a block for the processing of the
data? For a general (aka library) solution it would also be much better
to pass the IO as an argument, in case there are more pipes to work
with.

end
Probably because this is not the same as my code (hint: punctuation
matters).

Bte, I am still interested to learn the use case where your solution is
significantly better than an in process solution with Threads and a
Queue…

Regards

robert

Carlos_JSHernandez · January 8, 2008, 11:50pm

On Wed, 9 Jan 2008 07:00:04 +0900, “Robert K.”
[email protected] said:…

What does #pipe do? Why don’t you use a block for the processing of the
data? For a general (aka library) solution it would also be much better
to pass the IO as an argument, in case there are more pipes to work
with.

Yep! Like a yield statement you mean. I agree.

As for multiple pipe sources and your question of general usefulness…
(I read somewhere lack of multiple IO is a known issue in UNIX pipes)
I’m just thinking bash, shell scripting.
I don’t mean to ignite a language war.
I just think Bash, Ruby, and C make a terrific team.
Also, setting up the pipes seems to be best done from outside,
which makes it best fitted for shell scripting.

Anyways, the missing “?” was a typo.
The following which as I read it should work;

until $stdin.eof? do
data = Marshal.load($stdin) # <= Error here
pipe( data )
end

still gives the following error:

buffer already filled with text-mode content

$stdin.eof? is necessary though, as
a different error is triggered if Marshal tries to load on a EOF.

-Carlos

Carlos_JSHernandez · January 9, 2008, 8:17pm

class MarshalPipe
def self.puts(data)
Marshal.dump(data,$stdout)
end
def self.each
data = nil
begin
while data = Marshal.load($stdin) do
yield data
end
rescue EOFError
# rudely ignore
end
end
end

I guess a class/module above is as clean and simple I can get it.
A mp2mp type pipe would be…

require ‘marshal_pipe’
MarshalPipe.each { |data|
transformed_data = transform( data ) # <= do something
MarshalPipe.puts transformed_data
}

A quick csv2mp could be…

require ‘MarshalPipe.rb’
require ‘csv’
$stdin.each { |line|
data = []
CSV.parse_line( line.strip ).each {|item|
case item
when /^-?\d+(.\d+)?$/
data.push( ($1)? item.to_f: item.to_i )
# maybe add date handling or any other data type…
else
# simple string
data.push( item )
end
}
MarshalPipe.puts data
}

and a mp2txt

require ‘MarshalPipe.rb’
MarshalPipe.each { |data|
puts data.join("\t")
}

A hastily written csv2mp needs cat (not knowing how to read files)…

cat source.csv | csv2mp | mp2mp | mp2txt > result.txt

But one could argue to make MarshalPipe a template to make pipes in
general.
That’d be more like I’m actually using, except
without the much nicer MashalPipe.(puts and each),

Carlos_JSHernandez · January 9, 2008, 10:18pm

Carlos Hernandez wrote:
…

A quick csv2mp could be…

require ‘MarshalPipe.rb’
require ‘csv’
$stdin.each { |line|
…
A hastily written csv2mp needs cat (not knowing how to read files)…

cat source.csv | csv2mp | mp2mp | mp2txt > result.txt

Use ARGF instead of $stdin, and you read files for free.

Carlos_JSHernandez · January 9, 2008, 10:54pm

On Thu, 10 Jan 2008 06:16:35 +0900, “Joel VanderWerf”
[email protected] said:
…

Use ARGF instead of $stdin, and you read files for free.
…
vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Cool!!! Thanks! So

csv2mp < source.txt | …

the filehandle is in ARGF.
I guess I don’t need to explain the use of pipes to a Berkley man, home
of Berkley-Unix.

Carlos_JSHernandez · January 10, 2008, 12:03am

Carlos J. Hernandez wrote:

csv2mp < source.txt | …

the filehandle is in ARGF.

Or just this:

csv2mp source.txt | …

For example:

$ cat test.txt
This is
a test
$ ruby -e ‘puts ARGF.read’ test.txt
This is
a test