Forum: Ruby-core Bytecode handling (compilation) extensions to Ruby 1.9

F581c43402ef82f64043b3a1ceb1599a?d=identicon&s=25 Adam Strzelecki (Guest)
on 2008-05-31 23:19
(Received via mailing list)
Hello,

Since 1.9 is YARV VM powered it is technically possible to dump &
restore VM code (instructions) to/from file. Which means that we could
load already compiled Ruby code directly into 1.9 VM without related
sourcecode.

The most important application would be commercial Ruby programs
deployment. Often when we deploy our programs at our client's servers,
or some web hosting providers, we would love to keep source-code and
the ideas behind in secret. Since ruby up to 1.8 was dynamic
interpreter it wasn't possible before, but it is now possible with
YARV & 1.9.

I'm aware that it would be also possible to use JRuby to get Java's
JAR with compiled Java classes, but I just feel that 1.9 may have more
potential and is more lightweight than Java, moreover it sets kind of
standard for other distributions.
I know that 1.9 VM bytecode could be reverse-engineered, however due
optimizations and removal of comments etc, it would not present such
as value as original sourcecode.

So I wish to propose extension for Ruby 1.9 for generating (emission)
of VM bytecode to the file and reading/loading VM bytecode back to VM.
My main aim is to NOT change the current Ruby behavior, NEITHER add
new file type for "compiled" Ruby, but instead of that do few light
additions:

1. "ruby" command line additions:
  -o outputfile      emit VM bytecode into outputfile instead of
executing (-o - for stdout)
  --omit-bootstrap   omit VM loading bootstrap in outputfile (output
file won't be valid Ruby program)
  -f[level]          follow statically required files and emit their
bytecode too; 0=don't follow, 1=only current and subdirs, 2=all except
system, 3=all

2. System API additions

VM#exec(input)   Loads VM bytecode from input (IO) and executes it
VM#emit(output)  Emits all current VM instructions as bytecode into
output (IO)

So:

$ ruby -o myprogram_dist.rb myprogram.rb

Will emit VM bytecode with bootstrap "VM.exec DATA" as below:

--------------- myprogram_dist.rb -----------------
#!/usr/bin/env ruby
VM.exec DATA
__END__
RubyVM-1.9.0!@#!@!@%&()*@$*!)@$!@$....
---------------------------------------------------

So produced myprogram_dist.rb will be valid Ruby program, that raises
an class-not-found or method-not-found exception on all distributions
or versions that don't implement VM#exec, also I believe it should
produce another exception in version that has VM#exec but passed
bytecode structure format isn't valid (or incompatible), however in
our VM#exec matching version it will successfully execute the code
into the VM. (Maybe checking if magic is RubyVM-1.9.0)

Also we could have an option to omit bootstrap i.e. --omit-bootstrap,
so we get instead:

--------------- myprogram_dist.rb -----------------
RubyVM-1.9.0!@#!@!@%&()*@$*!)@$!@$....
---------------------------------------------------

Of course then myprogram_dist.rb won't be valid Ruby program anymore,
but it may be useful to store some compiled code chunks in database,
produce some more advanced bundles or bootstraps, or maybe on demand
downloadable & executable code with VM#exec.

Note that VM#exec should load and execute the loaded code, and once it
is done return back and execute code that follows VM#exec, so it is
possible to stack them:

VM.exec my_library
VM.exec my_magic_routine
# rest of the code

Finally it would be nice to add option to follow Ruby 'requires' that
are static (aka. flatten the code)

Lets take such file structure for out sample project we want to
"compile" prior deploying:

myprogram/
   main.rb
   db.rb
   scripts.rb
   lib/
     extra.rb

We want to make single file bytecode bundle myprogram.rb instead of
several "compiled" files (per each source file).

Our entry point is here:
-------------- myprogram/main.rb ------------------
require 'db' # this is static require, we can follow it
require 'scripts' # this is static require, we can also follow it
require 'lib/extra' # this is static require, we can also follow it

require 'sequel' # this is static require, but not in the current folder

# (...)
---------------------------------------------------

We do:

$ ruby -f -o myprogram.rb myprogram/main.rb

And get myprogram.rb which has VM code for all main.rb and it's
dependencies from myprogram/ folder.

Our -f new option will work as follows:
  -f0 would mean 'don't follow',
  -f1 'follow only current dir and subdirs',
  -f2 'follow all except system',
  -f3 'follow all'.

When -f not specified default behavior 'don't follow', when -f without
level specified default 'follow current dir and subdirs'.

$ ruby -f2 -o myprogram.rb myprogram/main.rb

Will also include 'sequel' VM bytecode.

-f will not work obviously for any:
require SomeModule.getdependencies
require Config.getpath . 'tests'

As they can be only evaluated only at runtime. But that's not a
problem. But it would be nice that ruby tells what it is compiling.

$ ruby -f -o myprogram.rb myprogram/main.rb
Compiling:
   main.rb
   db.rb
   scripts.rb
   lib/extra.rb
myprogram.rb done in 0.005 second(s)

That's all. I'd appreciate any comments to my proposition.

I'm new to this mailing-list, so I hope you don't find me too
impudent :) I felt in love with Ruby few months ago, while programming
regularly in C/C++ & PHP.
I've used PHP bytecode compilers for some of my past projects, and I
miss this possibility in Ruby.

Best regards,
F581c43402ef82f64043b3a1ceb1599a?d=identicon&s=25 Adam Strzelecki (Guest)
on 2008-06-05 12:08
(Received via mailing list)
Hello again,

Since there's no answer to my post :(, I just want to make sure that
emit/load of bytecode is possible with Ruby 1.9.

I know we can disasm any ruby program code with:

$ ruby-1.9 --dump=insns hello.rb

I'd like to try implementing VM seq serialization @ iseq.c

VALUE ruby_iseq_serialize(VALUE self);   /* VM seq -> bytecode */
VALUE ruby_iseq_unserialize(VALUE self); /* bytecode -> VM seq */

Then use those new functions to implement compilation, loading of VM
bytecode.

So if someone can tell me if it is waste of time because of something
I may now know about YARV, please do.

Regards,
956f185be9eac1760a2a54e287c4c844?d=identicon&s=25 ts (Guest)
on 2008-06-05 12:24
(Received via mailing list)
Adam Strzelecki wrote:
> Since there's no answer to my post :(, I just want to make sure that
> emit/load of bytecode is possible with Ruby 1.9.

 Have you looked at VM::InstructionSequence#to_a,
 VM::InstructionSequence::load ?

    /* disable this feature because there is no verifier. */
    /* rb_define_singleton_method(rb_cISeq, "load", iseq_s_load, -1); */
    (void)iseq_s_load;


Guy Decoux
308cbef6e86dfc49cce3b2d4cf42aedc?d=identicon&s=25 SASADA Koichi (Guest)
on 2008-06-05 12:30
(Received via mailing list)
Hi,

Adam Strzelecki wrote:
> Since there's no answer to my post :(, I just want to make sure that
> emit/load of bytecode is possible with Ruby 1.9.

I missed your post.

> bytecode.
>
> So if someone can tell me if it is waste of time because of something I
> may now know about YARV, please do.

In fact, there is:
   VALUE iseq_data_to_ary(rb_iseq_t *iseq)
   VALUE iseq_load(VALUE self, VALUE data, VALUE parent, VALUE opt)

to_ary() convert ISeq object to Array and well known objects such as
Numbers, Regexps, Strings, and so on.  iseq_load() loads such data
structure.  Point of this feature is you can write down YARV
assember with Ruby Script or YAML or something.  YASM, the YARV
bytecode assembler use this feature.

ISeq#to_a is supported, but ISeq.load is not supported because of
absence of bytecode verifyer.  You can cause SEGV easily.

This feature is convinient to use, but not optimal one.

Now, we are developing following features (on a time of practical
work in our university, 1 student is working on this topic):

  (1) ISeq -> external file that packed
  (2) ISeq -> external file -> .c file

(1) is like Java class file approach.  (2) make us happy with
implanting Ruby codes to Ruby core or extension libraries written in C.
F581c43402ef82f64043b3a1ceb1599a?d=identicon&s=25 Adam Strzelecki (Guest)
on 2008-06-05 13:54
(Received via mailing list)
> to_ary() convert ISeq object to Array and well known objects such as
> Numbers, Regexps, Strings, and so on.  iseq_load() loads such data
> structure.  Point of this feature is you can write down YARV
> assember with Ruby Script or YAML or something.  YASM, the YARV
> bytecode assembler use this feature.

> Have you looked at VM::InstructionSequence#to_a,
> VM::InstructionSequence::load ?

Ahh, there it is :) Thank you both for pointing me that out. I'll
start playing around there.

> ISeq#to_a is supported, but ISeq.load is not supported because of
> absence of bytecode verifyer.  You can cause SEGV easily.

I see - commented-out in the source. CRC + magic cookie + version
checking would be enough?

> Now, we are developing following features (on a time of practical
> work in our university, 1 student is working on this topic):
>
> (1) ISeq -> external file that packed
> (2) ISeq -> external file -> .c file
>
> (1) is like Java class file approach.  (2) make us happy with
> implanting Ruby codes to Ruby core or extension libraries written in
> C.

Great!

Have you thought about using LLVM for:
ISeq -> LLVM -> native, this would be a way for HotSpot YARV or native
Ruby compiler producing .o file then linked with Ruby runtime into
single ELF or MSCOFF.

Evan Phoenix from Rubinius is testing LLVM for integrating with their VM
http://blog.fallingsnow.net/2008/05/23/simple-vm-j...

I think it may be interesting for mainstream Ruby too.

BTW. What about following all non dynamically referenced "require" and
instead generating Kernel.require call. Triggering rb_load at core
parsing stage could be an option ? (See my 1st mail)
I don't know if it does make sense, but if we want to compile whole
project that starts from "start.rb" + requires other files from
current folder & subfolders into one single binary form, we'd need to
make such a trick.
Or maybe asking it upside down, can Kernel.require check whether
target file was already loaded by ruby and just call relative ISeq
chunk?

Best regards,
F581c43402ef82f64043b3a1ceb1599a?d=identicon&s=25 Adam Strzelecki (Guest)
on 2008-06-05 16:51
(Received via mailing list)
Just wanted to say thanks again, it works like a charm. I've reenabled
rb_define_singleton_method(rb_cISeq, "load", iseq_s_load, -1) and it
rocks:

--- test_vm_save_restore.rb ----------------------------
#!/usr/bin/env ruby-1.9
require 'zlib'

code = VM::InstructionSequence.compile_file 'sample.rb'

File.open "simple.rbvm", "w" do |file|
   file << Zlib::Deflate.deflate(Marshal.dump(code.to_a))
end

code = nil # Make sure we don't work with old code anymore

File.open "simple.rbvm", "r" do |file|
   code = VM::InstructionSequence.load
Marshal.restore(Zlib::Inflate.inflate(file.read))
end

code.eval
--------------------------------------------------------

So iseq_load checking magic, version1, version2, format_type and
rasing exception if they don't match would be enough validation to
reenable VM::InstructionSequence.load in official repository?

Cheers,
308cbef6e86dfc49cce3b2d4cf42aedc?d=identicon&s=25 SASADA Koichi (Guest)
on 2008-06-05 17:42
(Received via mailing list)
Hi,

Adam Strzelecki wrote:
> So iseq_load checking magic, version1, version2, format_type and rasing
> exception if they don't match would be enough validation to reenable
> VM::InstructionSequence.load in official repository?

If bytecode is only "leave", it will cause SEGV ("leave" instruction
needs 1 stack operand).  Bytecode verifier must guarantee the
consistency.
308cbef6e86dfc49cce3b2d4cf42aedc?d=identicon&s=25 SASADA Koichi (Guest)
on 2008-06-05 18:02
(Received via mailing list)
Hi,

Adam Strzelecki wrote:
> I see - commented-out in the source. CRC + magic cookie + version
> checking would be enough?

Evil person can make strange bytecodes.

> Have you thought about using LLVM for:
> ISeq -> LLVM -> native, this would be a way for HotSpot YARV or native
> Ruby compiler producing .o file then linked with Ruby runtime into
> single ELF or MSCOFF.

Yes, of course.  We are considering.

> BTW. What about following all non dynamically referenced "require" and
> instead generating Kernel.require call. Triggering rb_load at core
> parsing stage could be an option ? (See my 1st mail)

I can't understand it. what's option?

> Or maybe asking it upside down, can Kernel.require check whether target
> file was already loaded by ruby and just call relative ISeq chunk?

You can do it if you make it.
This topic is locked and can not be replied to.