Ruby Drops

Jake_McArthur · April 27, 2006, 9:49pm

This is the idea I am thinking of proposing for my Google Summer of
Code project. This is from the Google Summer of Code thread:

factoring code out of their projects which they can all share.
accordingly if they desire to do so.
takes advantage of this be unit tested.) This would also provide
development, simplify and stabilize Ruby programs, and bring a
collaborative atmosphere even to individual projects.

I’m making a thread for it because I’m looking for input (ideas,
suggestions, etc.).

Jake McArthur

Jake_McArthur · April 28, 2006, 6:46am

Come on, it can’t be that bad of an idea! Is this really going to go
over that badly if I submit this as a project proposal?

Jake McArthur

Jake_McArthur · April 28, 2006, 7:25am

On Apr 27, 2006, at 12:46 PM, Jake McArthur wrote:

My idea is to create an open source code repository, web site, and
set of tools designed to help people to automate the process of
factoring code out of their projects which they can all share.
First, it helps them to find instances of code that need to be
DRYed or DROPed by comparing lines of code across the entire code
base in the repository and pointing out lines that are similar to
things that have already been done before.

Your idea sounds good to me. It’d have the additional benefit of
helping people notice that they are writing bad code. Many people who
understand the DRY principle in abstract haven’t made the connections
to realise all the situations it applies to. And repeating other
people is easy, and it’s hard to know that you are.

I’m new here, but I’ll try to give feedback.

How will it find similar code? One simple issue is that people will
name their variables and methods differently, so you’ll want to
somehow see the structure of a section of code and ignore a lot of
details. But you can’t ignore the details too much. Maybe (trivial
example) someone wrote a “max” function and someone else in-lined it,
and otherwise their code blocks are the same.

If the programmer finds things within his program which he
repeated, then it should be a simple a matter of factoring out to
another function or class within his code to DRY it.

I don’t think all code is simple to refactor like that. But maybe
enough is for this to be useful. Maybe most is? I don’t know.

external projects up to date in their own program. As it does
this, it runs all the programmer’s tests to make sure that it
doesn’t break something and pulls back to a previous revision if
necessary. (As such, it would practically be a requirement that
all code that takes advantage of this be unit tested.)

I don’t have much experience with unit tests. How well can they
usually withstand arbitrary changes to code with subtle bugs?

This would also provide the benefit that factored out projects can
be edited by anyone, like a wiki,

It’s a bit off-topic, but I’m not sure how good an idea wikis are.
Wikipedia gets a lot of vandalism. But worse: what happens when
people have a legitimate disagreement about how some code should be
written? “anyone can post anything” doesn’t provide a way to resolve
disagreement.

There could also be a risk of a malicious code that people auto-update.

collaborative atmosphere even to individual projects.
I wonder how well the code-similarity algorithm would work for non-
Ruby code. Just curious how Ruby-specific the tests would be vs how
general.

I’m making a thread for it because I’m looking for input (ideas,
suggestions, etc.).

Hope that helped

– Elliot T.

Jake_McArthur · April 28, 2006, 7:34am

On 4/28/06, Jake McArthur [email protected] wrote:

Come on, it can’t be that bad of an idea! Is this really going to go
over that badly if I submit this as a project proposal?

I think that this idea would work better as a social network which
tied together Ruby projects via tagging and rss feeds and all that
other yummy stuff.

Basically, rather than trying to accomplish the very hard task of
trying to compare ruby code, where there are often many ways to do
things (and let’s not even consider meta-programming!) we tie
projects together through a bunch of community maintained meta-data
which helps support DROP.

This might include open commenting systems on the source code, so that
people can review the code live, things which automatically sense
similar projects or similar dependencies and things like that, and
just the general cool stuff that well set up social network could
provide.

I think it would be technically impossible to implement a good
automated solution to find duplicated code. This is a more human
oriented option which could be helpful and a lot of fun.

Plus… you still have a project here… it could be implemented
nicely in Rails or Nitro or something

Jake_McArthur · April 28, 2006, 7:43am

things (and let’s not even consider meta-programming!) we tie
automated solution to find duplicated code. This is a more human
oriented option which could be helpful and a lot of fun.

Plus… you still have a project here… it could be implemented
nicely in Rails or Nitro or something

It is a very good point (I just tried to write something like this).
Life shows, that “dumb social” systems works faster and better, than
“smart
intellectual analysys”.
At least, it sounds like something we can at minimum try to do (while
first idea sounded like “It would be nice if somebody already done this,
but
personally I would never even try”).

Victor.

Jake_McArthur · April 28, 2006, 4:33pm

How will it find similar code? One simple issue is that people will
name their variables and methods differently, so you’ll want to
somehow see the structure of a section of code and ignore a lot of
details. But you can’t ignore the details too much. Maybe (trivial
example) someone wrote a “max” function and someone else in-lined
it, and otherwise their code blocks are the same.

I’ve already been working on this. Right now, I’m making a simple
algorithm that works on arbitrary text and returns a number
reflecting how similar two strings are. Even this alone has been
giving fairly good results on code, even code that was written rather
differently, but my plan is to use this algorithm to compare symbols
and literals. A similar algorithm, working on a slightly larger
scale, would compare entire lines of code for similar syntax,
augmented by data from the first algorithm.

I’m still thinking about this. Suggestions, anybody?

I don’t think all code is simple to refactor like that. But maybe
enough is for this to be useful. Maybe most is? I don’t know.

By far it is not, but all I meant is that there is no need to mess
with the system to do it.

I don’t have much experience with unit tests. How well can they
usually withstand arbitrary changes to code with subtle bugs?

Well-tested code will not break unless a test was missed, and if a
bug is found, writing a test to cover it will practically squish that
particular bug permanently.

It’s a bit off-topic, but I’m not sure how good an idea wikis are.
Wikipedia gets a lot of vandalism. But worse: what happens when
people have a legitimate disagreement about how some code should be
written? “anyone can post anything” doesn’t provide a way to
resolve disagreement.

There could also be a risk of a malicious code that people auto-
update.

Disagreements could be resolved by simply forking off another
project. Everybody is happy. And anyway, if everybody agrees on
tests, and those tests pass, everybody should be happy anyway.

Well-tested projects will not be affected by malicious code because
the system would see that tests fail and revert back to the last
working version.

I wonder how well the code-similarity algorithm would work for non-
Ruby code. Just curious how Ruby-specific the tests would be vs how
general.

The algorithm I’m currently on is language agnostic, but it doesn’t
benefit from syntax parsing and such like plans reflect.

Jake McArthur

Jake_McArthur · April 28, 2006, 4:39pm

On Fri, 2006-04-28 at 23:30 +0900, Jake McArthur wrote:

giving fairly good results on code, even code that was written rather
differently, but my plan is to use this algorithm to compare symbols
and literals. A similar algorithm, working on a slightly larger
scale, would compare entire lines of code for similar syntax,
augmented by data from the first algorithm.

CPD uses the Burrows-Wheeler transform to find exact matches. It has
some options to ignore identifiers and literals, although that results
in false positives sometimes…

Yours,

Tom

Jake_McArthur · April 28, 2006, 4:36pm

On Fri, 2006-04-28 at 14:31 +0900, Gregory B. wrote:

I think it would be technically impossible to implement a good
automated solution to find duplicated code. This is a more human
oriented option which could be helpful and a lot of fun.

There’s always the copy/paste detector CPD for that:

http://pmd.sf.net/cpd.html

It’s got a (very basic) Ruby parser.

Yours,

tom

Jake_McArthur · April 28, 2006, 4:45pm

That exactly what I’m going for… something that nobody else really
wants to try. If nobody else will make something like this, then I
want to make it, or at least try; we would never reap the benefits
otherwise.

What better chance is there for this than Summer of Code? It is the
kind of project everybody secretly really wants to have, but would
never realistically be able to find the time for it unless they could
do it as a kind of job, but who would pay for something this risky?
This is really the only way this could happen is if I propose it for
Summer of Code.

Jake McArthur

Jake_McArthur · April 28, 2006, 4:48pm

Excellent find! Gives me some good algorithms to look up at the
least, but maybe even some code to use? (I can’t see right now if it
is open source. No time right now. Gotta study for exams.)

Jake McArthur

Jake_McArthur · April 28, 2006, 4:42pm

[… social network …]

I do like the idea, but it’s a different spirit from what I’m going
for. There are tons of developer communities, and tons of source code
repositories. While I don’t think I’ve seen a social network for
developers before, I don’t really see it kicking off so well unless
it happens to be language agnostic, which is fine except that it
would only become that much more difficult to find anything that
applies to your particular project.

I’m trying to do this as a learning experience, and something that
hasn’t been done before. While I like some social networks, there are
only so many people can be so active in. A web site like this would
have to compete with other social networks (with overlapping
functionality) in order to have anything worth using, but I know I
can’t do that. I have to try something new because of that.

This might include open commenting systems on the source code, so that
people can review the code live, things which automatically sense
similar projects or similar dependencies and things like that, and
just the general cool stuff that well set up social network could
provide.

I thought of this, but I don’t like the idea of having to go through
so much extra trouble just to get rigged up for the system. And as
for automatically sensing similar projects, that’s just a larger
scale version of what I’m trying to do already.

Jake McArthur

Jake_McArthur · April 28, 2006, 5:19pm

Jake,

I am working on something uncannily similar to what you describe. I
imagined it to also be wiki-integrated and to present a cleaned-up
version of test code that is human readable.

have a look at
http://liquiddevelopment.blogspot.com/2005/12/social-evolution-of-software.html

and tell me what you think

Cheers

–
Chiaroscuro

Liquid Development: http://liquiddevelopment.blogspot.com/

Jake_McArthur · April 28, 2006, 5:28pm

The cleaned-up code (which I call ‘prozed’) looks like the coloured
bits of code at:
http://liquiddevelopment.blogspot.com/2006/01/software-roadmap.html

proze and intent (my rSpec-like framework) produce stuff like:

###############################################
people = [
“john”,
“mike”,
“Sam”
]

story “Turning people names to uppercase”
people to upcase should be [“JOHN”,“MIKE”,“SAM”]
###############################################

and

###############################################
story “Using Cashflows”
Let’s now deal with cashflows

    with  Conventional discount = continuously compounded yield

    we  expect about 271.67748, as the
        present value of
          Cashflow  [
             100 at 1 year ,
             100 at 2 years ,
             100 at 3 years
          ],
          with interest rate at   5 percent

###############################################

this code is easily readable and the intentional code could be
automatically imported within the editor (FreeRIDE has a nice system
to write plugins to connect to the repository) while the appropriate
gems get downloaded in background.

What do you say? Shall we join forces?

Jake_McArthur · April 28, 2006, 4:51pm

On Fri, 28 Apr 2006, Jake McArthur wrote:

I’ve already been working on this. Right now, I’m making a simple algorithm
that works on arbitrary text and returns a number reflecting how similar two
strings are. Even this alone has been giving fairly good results on code,
even code that was written rather differently, but my plan is to use this
algorithm to compare symbols and literals. A similar algorithm, working on a
slightly larger scale, would compare entire lines of code for similar
syntax, augmented by data from the first algorithm.

I’m still thinking about this. Suggestions, anybody?

it seems down at the moment, but this is close/perfect for your needs

http://complearn.org/

google cache (until site up)

http://72.14.207.104/search?q=cache:bmlzYI4W39sJ:www.complearn.org/+complearn&hl=en&gl=us&ct=clnk&cd=1

more links

http://www.newscientist.com/article.ns?id=dn3602
http://homepages.cwi.nl/~cilibrar/musicart/trnmag.com/Stories/2003/042303/Software_sorts_tunes_042303.html

i’ve played with it and, since there are command line tools and a ruby
api, i
would think you could categorize text quite easily.

we are actually playing with this to identify spatial/temporal trends in
nighttime lights satellite imagery.

cheers.

-a

Jake_McArthur · April 28, 2006, 5:28pm

Jake McArthur wrote:

giving fairly good results on code, even code that was written rather
differently, but my plan is to use this algorithm to compare symbols
and literals. A similar algorithm, working on a slightly larger
scale, would compare entire lines of code for similar syntax,
augmented by data from the first algorithm.

I’m still thinking about this. Suggestions, anybody?

(my 1st ruby-talk post not from Google groups )
cyclomatic complexity may have some value as another input
http://saikuro.rubyforge.org/

also look at gonzui and doing some kind of vector space-LSI modelling
based on ruby keywords, core and std lib methodnames, etc.

Jake_McArthur · April 28, 2006, 5:41pm

Just had a thought… Something similar has been done before, but
targeted at a different application. Take a look at services like
MOSS http://www.cs.berkeley.edu/~aiken/moss.html, which was
designed for plagiarism detection – e.g. finding similar code across
projects. There is a nice paper linked that talks about how they go
about doing it.

Matt

On 28 Apr , 2006, at 10:42 AM, Jake McArthur wrote:

propose it for Summer of Code.

–
Matt Long [email protected] /
[email protected]
University of South Florida, CRASAR
GnuPG public key: http://www.csee.usf.edu/~mtlong/public_key.html

“In mathematics you don’t understand things, you just get used to them.”

John von Neumann

Jake_McArthur · April 28, 2006, 6:05pm

Just had a thought… Something similar has been done before, but
targeted at a different application. Take a look at services like
MOSS http://www.cs.berkeley.edu/~aiken/moss.html, which was
designed for plagiarism detection – e.g. finding similar
code across
projects. There is a nice paper linked that talks about how they go
about doing it.

That is an interesting paper, thanks for the link! He lists a couple of
requirements:

whitespace independence - CPD has this for several languages (C, C++,
Java) since for those languages uses JavaCC-generated parsers that
discard whitespace. For other languages (Ruby, PHP) it also discards
whitespace but does it a bit more clunkily, at a higher level in the
framework.
noise suppression - yup, this is important since you don’t want to
catch little matches like “x.each do |y|”. CPD allows you to set the
minimum match size; I usually start at about 100 and work my way lower
from there.
position independence - since you don’t want moving things around to
affect the analysis. CPD mostly has this, I think

Of course, the hard problem is fixing the duplicates once you find them;
that can be a delicate job.

[SHAMELESS PLUG] If you’re interested in reading more about CPD, I’ve
got a chapter on it in my book:

Yours,

Tom

Jake_McArthur · April 28, 2006, 5:53pm

Excellent find! Gives me some good algorithms to look up at the
least, but maybe even some code to use? (I can’t see right now if it
is open source.

Yup, it’s BSD-licensed; code is here:

http://pmd.sourceforge.net/xref/net/sourceforge/pmd/cpd/package-summary.
html

No time right now. Gotta study for exams.)

Best of luck!

Yours,

Tom

Jake_McArthur · April 28, 2006, 6:15pm

On Apr 28, 2006, at 7:30 AM, Jake McArthur wrote:

Well-tested projects will not be affected by malicious code because
the system would see that tests fail and revert back to the last
working version.

What if the code functioned exactly the same plus some nasty side
effects like a root kit? Could that get through tests?

– Elliot T.

Jake_McArthur · April 28, 2006, 6:12pm

Quote: “To date, the main application of Moss has been in detecting
plagiarism in programming classes.”

Really… I’ve been known to do other things.

Ruby Drops

– Chiaroscuro

–
Chiaroscuro