Forum: Ruby Ruby Drops

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Jake McArthur (Guest)
on 2006-04-27 23:49
(Received via mailing list)
This is the idea I am thinking of proposing for my Google Summer of
Code project. This is from the Google Summer of Code thread:

> factoring code out of their projects which they can all share.
> accordingly if they desire to do so.
> takes advantage of this be unit tested.) This would also provide
> development, simplify and stabilize Ruby programs, and bring a
> collaborative atmosphere even to individual projects.

I'm making a thread for it because I'm looking for input (ideas,
suggestions, etc.).

- Jake McArthur
Jake McArthur (Guest)
on 2006-04-28 08:46
(Received via mailing list)
Come on, it can't be that bad of an idea! Is this really going to go
over that badly if I submit this as a project proposal?

- Jake McArthur
Elliot T. (Guest)
on 2006-04-28 09:25
(Received via mailing list)
On Apr 27, 2006, at 12:46 PM, Jake McArthur wrote:

>> My idea is to create an open source code repository, web site, and
>> set of tools designed to help people to automate the process of
>> factoring code out of their projects which they can all share.
>> First, it helps them to find instances of code that need to be
>> DRYed or DROPed by comparing lines of code across the entire code
>> base in the repository and pointing out lines that are similar to
>> things that have already been done before.

Your idea sounds good to me. It'd have the additional benefit of
helping people notice that they are writing bad code. Many people who
understand the DRY principle in abstract haven't made the connections
to realise all the situations it applies to. And repeating other
people is easy, and it's hard to know that you are.

I'm new here, but I'll try to give feedback.


How will it find similar code? One simple issue is that people will
name their variables and methods differently, so you'll want to
somehow see the structure of a section of code and ignore a lot of
details. But you can't ignore the details too much. Maybe (trivial
example) someone wrote a "max" function and someone else in-lined it,
and otherwise their code blocks are the same.

>> If the programmer finds things within his program which he
>> repeated, then it should be a simple a matter of factoring out to
>> another function or class within his code to DRY it.

I don't think all code is simple to refactor like that. But maybe
enough is for this to be useful. Maybe most is? I don't know.

>> external projects up to date in their own program. As it does
>> this, it runs all the programmer's tests to make sure that it
>> doesn't break something and pulls back to a previous revision if
>> necessary. (As such, it would practically be a requirement that
>> all code that takes advantage of this be unit tested.)

I don't have much experience with unit tests. How well can they
usually withstand arbitrary changes to code with subtle bugs?

>> This would also provide the benefit that factored out projects can
>> be edited by anyone, like a wiki,

It's a bit off-topic, but I'm not sure how good an idea wikis are.
Wikipedia gets a lot of vandalism. But worse: what happens when
people have a legitimate disagreement about how some code should be
written? "anyone can post anything" doesn't provide a way to resolve
disagreement.

There could also be a risk of a malicious code that people auto-update.

>> collaborative atmosphere even to individual projects.
I wonder how well the code-similarity algorithm would work for non-
Ruby code. Just curious how Ruby-specific the tests would be vs how
general.

>
> I'm making a thread for it because I'm looking for input (ideas,
> suggestions, etc.).

Hope that helped :)

-- Elliot T.
http://www.curi.us/blog/
Gregory B. (Guest)
on 2006-04-28 09:34
(Received via mailing list)
On 4/28/06, Jake McArthur <removed_email_address@domain.invalid> wrote:
> Come on, it can't be that bad of an idea! Is this really going to go
> over that badly if I submit this as a project proposal?

I think that this idea would work better as a social network which
tied together Ruby projects via tagging and rss feeds and all that
other yummy stuff.

Basically, rather than trying to accomplish the very hard task of
trying to compare ruby code, where there are often many ways to do
things (and let's not even consider meta-programming!)  we tie
projects together through a bunch of community maintained meta-data
which helps support DROP.

This might include open commenting systems on the source code, so that
people can review the code live, things which automatically sense
similar projects or similar dependencies and things like that, and
just the general cool stuff that well set up social network could
provide.

I think it would be technically impossible to implement a good
automated solution to find duplicated code.  This is a more human
oriented option which could be helpful and a lot of fun.

Plus... you still have a project here... it could be implemented
nicely in Rails or Nitro or something :)
Victor S. (Guest)
on 2006-04-28 09:43
(Received via mailing list)
> things (and let's not even consider meta-programming!)  we tie
> automated solution to find duplicated code.  This is a more human
> oriented option which could be helpful and a lot of fun.
>
> Plus... you still have a project here... it could be implemented
> nicely in Rails or Nitro or something :)

It is a very good point (I just tried to write something like this).
Life shows, that "dumb social" systems works faster and better, than
"smart
intellectual analysys".
At least, it sounds like something we can at minimum *try* to do (while
first idea sounded like "It would be nice if somebody already done this,
but
personally I would never even try").

Victor.
Jake McArthur (Guest)
on 2006-04-28 18:33
(Received via mailing list)
> How will it find similar code? One simple issue is that people will
> name their variables and methods differently, so you'll want to
> somehow see the structure of a section of code and ignore a lot of
> details. But you can't ignore the details too much. Maybe (trivial
> example) someone wrote a "max" function and someone else in-lined
> it, and otherwise their code blocks are the same.

I've already been working on this. Right now, I'm making a simple
algorithm that works on arbitrary text and returns a number
reflecting how similar two strings are. Even this alone has been
giving fairly good results on code, even code that was written rather
differently, but my plan is to use this algorithm to compare symbols
and literals. A similar algorithm, working on a slightly larger
scale, would compare entire lines of code for similar syntax,
augmented by data from the first algorithm.

I'm still thinking about this. Suggestions, anybody?

> I don't think all code is simple to refactor like that. But maybe
> enough is for this to be useful. Maybe most is? I don't know.

By far it is not, but all I meant is that there is no need to mess
with the system to do it.

> I don't have much experience with unit tests. How well can they
> usually withstand arbitrary changes to code with subtle bugs?

Well-tested code will not break unless a test was missed, and if a
bug is found, writing a test to cover it will practically squish that
particular bug permanently.

> It's a bit off-topic, but I'm not sure how good an idea wikis are.
> Wikipedia gets a lot of vandalism. But worse: what happens when
> people have a legitimate disagreement about how some code should be
> written? "anyone can post anything" doesn't provide a way to
> resolve disagreement.
>
> There could also be a risk of a malicious code that people auto-
> update.

Disagreements could be resolved by simply forking off another
project. Everybody is happy. And anyway, if everybody agrees on
tests, and those tests pass, everybody should be happy anyway.

Well-tested projects will not be affected by malicious code because
the system would see that tests fail and revert back to the last
working version.

> I wonder how well the code-similarity algorithm would work for non-
> Ruby code. Just curious how Ruby-specific the tests would be vs how
> general.

The algorithm I'm currently on is language agnostic, but it doesn't
benefit from syntax parsing and such like plans reflect.

- Jake McArthur
Tom C. (Guest)
on 2006-04-28 18:36
(Received via mailing list)
On Fri, 2006-04-28 at 14:31 +0900, Gregory B. wrote:
> I think it would be technically impossible to implement a good
> automated solution to find duplicated code.  This is a more human
> oriented option which could be helpful and a lot of fun.

There's always the copy/paste detector CPD for that:

http://pmd.sf.net/cpd.html

It's got a (very basic) Ruby parser.

Yours,

tom
Tom C. (Guest)
on 2006-04-28 18:39
(Received via mailing list)
On Fri, 2006-04-28 at 23:30 +0900, Jake McArthur wrote:
> giving fairly good results on code, even code that was written rather
> differently, but my plan is to use this algorithm to compare symbols
> and literals. A similar algorithm, working on a slightly larger
> scale, would compare entire lines of code for similar syntax,
> augmented by data from the first algorithm.

CPD uses the Burrows-Wheeler transform to find exact matches.  It has
some options to ignore identifiers and literals, although that results
in false positives sometimes...

Yours,

Tom
Jake McArthur (Guest)
on 2006-04-28 18:42
(Received via mailing list)
> [... social network ...]

I do like the idea, but it's a different spirit from what I'm going
for. There are tons of developer communities, and tons of source code
repositories. While I don't think I've seen a social network for
developers before, I don't really see it kicking off so well unless
it happens to be language agnostic, which is fine except that it
would only become that much more difficult to find anything that
applies to your particular project.

I'm trying to do this as a learning experience, and something that
hasn't been done before. While I like some social networks, there are
only so many people can be so active in. A web site like this would
have to compete with other social networks (with overlapping
functionality) in order to have anything worth using, but I know I
can't do that. I have to try something new because of that.

> This might include open commenting systems on the source code, so that
> people can review the code live, things which automatically sense
> similar projects or similar dependencies and things like that, and
> just the general cool stuff that well set up social network could
> provide.

I thought of this, but I don't like the idea of having to go through
so much extra trouble just to get rigged up for the system. And as
for automatically sensing similar projects, that's just a larger
scale version of what I'm trying to do already.

- Jake McArthur
Jake McArthur (Guest)
on 2006-04-28 18:45
(Received via mailing list)
That exactly what I'm going for... something that nobody else really
wants to try. If nobody else will make something like this, then I
want to make it, or at least try; we would never reap the benefits
otherwise.

What better chance is there for this than Summer of Code? It is the
kind of project everybody secretly really wants to have, but would
never realistically be able to find the time for it unless they could
do it as a kind of job, but who would pay for something this risky?
This is really the only way this could happen is if I propose it for
Summer of Code.

- Jake McArthur
Jake McArthur (Guest)
on 2006-04-28 18:48
(Received via mailing list)
Excellent find! Gives me some good algorithms to look up at the
least, but maybe even some code to use? (I can't see right now if it
is open source. No time right now. Gotta study for exams.)

- Jake McArthur
unknown (Guest)
on 2006-04-28 18:51
(Received via mailing list)
On Fri, 28 Apr 2006, Jake McArthur wrote:

> I've already been working on this. Right now, I'm making a simple algorithm
> that works on arbitrary text and returns a number reflecting how similar two
> strings are. Even this alone has been giving fairly good results on code,
> even code that was written rather differently, but my plan is to use this
> algorithm to compare symbols and literals. A similar algorithm, working on a
> slightly larger scale, would compare entire lines of code for similar
> syntax, augmented by data from the first algorithm.
>
> I'm still thinking about this. Suggestions, anybody?

it seems down at the moment, but this is close/perfect for your needs

   http://complearn.org/

google cache (until site up)

   http://72.14.207.104/search?q=cache:bmlzYI4W39sJ:w...

more links

   http://www.newscientist.com/article.ns?id=dn3602
   http://homepages.cwi.nl/~cilibrar/musicart/trnmag....


i've played with it and, since there are command line tools and a ruby
api, i
would think you could categorize text quite easily.

we are actually playing with this to identify spatial/temporal trends in
nighttime lights satellite imagery.

cheers.

-a
Chiaro S. (Guest)
on 2006-04-28 19:19
(Received via mailing list)
Jake,

I am working on something uncannily similar to what you describe. I
imagined it to also be wiki-integrated and to present a cleaned-up
version of test code that is human readable.

have a look at
http://liquiddevelopment.blogspot.com/2005/12/soci...

and tell me what you think

Cheers

--
Chiaroscuro
---
Liquid Development: http://liquiddevelopment.blogspot.com/
Chiaro S. (Guest)
on 2006-04-28 19:28
(Received via mailing list)
The cleaned-up code (which I call 'prozed') looks like the coloured
bits of code at:
http://liquiddevelopment.blogspot.com/2006/01/soft...

proze and intent (my rSpec-like framework) produce stuff like:

###############################################
people = [
    "john",
    "mike",
    "Sam"
]

story "Turning people names to uppercase"
    people to upcase should be ["JOHN","MIKE","SAM"]
###############################################

and

###############################################
story "Using Cashflows"
        Let's now deal with cashflows

        with  Conventional discount = continuously compounded yield

        we  expect about 271.67748, as the
            present value of
              Cashflow  [
                 100 at 1 year ,
                 100 at 2 years ,
                 100 at 3 years
              ],
              with interest rate at   5 percent
###############################################


this code is easily readable and the intentional code could be
automatically imported within the editor (FreeRIDE has a nice system
to write plugins to connect to the repository) while the appropriate
gems get downloaded in background.


What do you say? Shall we join forces?
(Guest)
on 2006-04-28 19:28
(Received via mailing list)
Jake McArthur wrote:
> giving fairly good results on code, even code that was written rather
> differently, but my plan is to use this algorithm to compare symbols
> and literals. A similar algorithm, working on a slightly larger
> scale, would compare entire lines of code for similar syntax,
> augmented by data from the first algorithm.
>
> I'm still thinking about this. Suggestions, anybody?
>

(my 1st ruby-talk post not from Google groups )
cyclomatic complexity may have some value as another input
http://saikuro.rubyforge.org/

also look at gonzui and doing some kind of vector space-LSI modelling
based on ruby keywords, core and std lib methodnames, etc.
Matt Long (Guest)
on 2006-04-28 19:41
(Received via mailing list)
Just had a thought...  Something similar _has_ been done before, but
targeted at a different application.  Take a  look at services like
MOSS <http://www.cs.berkeley.edu/~aiken/moss.html>, which was
designed for plagiarism detection -- e.g. finding similar code across
projects.  There is a nice paper linked that talks about how they go
about doing it.

Matt


On 28 Apr , 2006, at 10:42 AM, Jake McArthur wrote:

> propose it for Summer of Code.
>
--
Matt Long 
removed_email_address@domain.invalid /
removed_email_address@domain.invalid
University of South Florida, CRASAR
GnuPG public key: http://www.csee.usf.edu/~mtlong/public_key.html

"In mathematics you don't understand things, you just get used to them."
- John von Neumann
Tom C. (Guest)
on 2006-04-28 19:53
(Received via mailing list)
> Excellent find! Gives me some good algorithms to look up at the
> least, but maybe even some code to use? (I can't see right now if it
> is open source.

Yup, it's BSD-licensed; code is here:

http://pmd.sourceforge.net/xref/net/sourceforge/pm....
html

> No time right now. Gotta study for exams.)

Best of luck!

Yours,

Tom
Tom C. (Guest)
on 2006-04-28 20:05
(Received via mailing list)
> Just had a thought...  Something similar _has_ been done before, but
> targeted at a different application.  Take a  look at services like
> MOSS <http://www.cs.berkeley.edu/~aiken/moss.html>, which was
> designed for plagiarism detection -- e.g. finding similar
> code across
> projects.  There is a nice paper linked that talks about how they go
> about doing it.

That is an interesting paper, thanks for the link!  He lists a couple of
requirements:

1) whitespace independence - CPD has this for several languages (C, C++,
Java) since for those languages uses JavaCC-generated parsers that
discard whitespace.  For other languages (Ruby, PHP) it also discards
whitespace but does it a bit more clunkily, at a higher level in the
framework.

2) noise suppression - yup, this is important since you don't want to
catch little matches like "x.each do |y|".  CPD allows you to set the
minimum match size; I usually start at about 100 and work my way lower
from there.

3) position independence - since you don't want moving things around to
affect the analysis.  CPD mostly has this, I think :-)

Of course, the hard problem is fixing the duplicates once you find them;
that can be a delicate job.

[SHAMELESS PLUG] If you're interested in reading more about CPD, I've
got a chapter on it in my book:

http://pmdapplied.com/

Yours,

Tom
Matthew M. (Guest)
on 2006-04-28 20:12
(Received via mailing list)
Quote: "To date, the main application of Moss has been in detecting
plagiarism in programming classes."

Really...  I've been known to do other things.
Elliot T. (Guest)
on 2006-04-28 20:15
(Received via mailing list)
On Apr 28, 2006, at 7:30 AM, Jake McArthur wrote:

> Well-tested projects will not be affected by malicious code because
> the system would see that tests fail and revert back to the last
> working version.

What if the code functioned exactly the same plus some nasty side
effects like a root kit? Could that get through tests?

-- Elliot T.
http://www.curi.us/blog/
Matt Long (Guest)
on 2006-04-28 20:31
(Received via mailing list)
Must keep you busy, checking all those submissions... :-)

Matt

On 28 Apr , 2006, at 12:09 PM, Matthew M. wrote:

> Quote: "To date, the main application of Moss has been in detecting
> plagiarism in programming classes."
>
> Really...  I've been known to do other things.
>

--
Matt Long 
removed_email_address@domain.invalid /
removed_email_address@domain.invalid
University of South Florida, CRASAR
GnuPG public key: http://www.csee.usf.edu/~mtlong/public_key.html

The wars of the future will not be fought on the battlefield or at
sea. They will be fought in space, or possibly on top of a very tall
mountain. In either case, most of the actual fighting will be done by
small robots. And as you go forth today remember always your duty is
clear: To build and maintain those robots. Thank you.

-The Simpsons
Jake McArthur (Guest)
on 2006-04-28 22:06
(Received via mailing list)
You have a point. This would not be caught. I see only a few
potential solutions, but they are kind of a hassle:

a) No world editing. The creator of a branch can give access to other
people at will, but only explicitly.
b) No automatic running of tests. Updating to recent revisions of
dependencies prompts the developer to review the changes to the code
before running tests, and then gives an option to run the tests and/
or revert to an older version that works.
c) Both of the above.
d) A sort of karma system in which a dependency update will update
and test code automatically that is edited by "trusted" developers,
but uses method B for any code that has been edited ("tainted") by
untrusted developers. A developer gains karma in the network whenever
somebody decides to keep their revisions in their own projects. A
developer loses karma whenever somebody updates a dependency,
examines their code, and decides not to use it. You can define the
karma threshold at which an update is automatically tested and
integrated. A dependency itself may also be marked as editable only
to those with a high enough karma. This is my favorite idea here, but
also the most difficult to implement.
e) Some programmatic way of checking for or eliminating the
possibility of malicious code. (???)

- Jake McArthur
Giles B. (Guest)
on 2006-04-28 22:18
(Received via mailing list)
I'm working on something similar also, although it's hardly even
started, and it isn't optimized for code sharing.

Also, somebody mentioned they were working on techniques to identify
similar strings. I think what you're looking for may be Levenstein
distance.

--
Giles B.
http://www.gilesgoatboy.org
Jake McArthur (Guest)
on 2006-04-28 22:40
(Received via mailing list)
You're right. It is very similar. We do, however, have slightly
different ideas. Yours seems to be a repository based around nuggets
of code which is searched by comparing the nuggets' "intent" with
your own. I really like it too, but it's not the same.

Mine is to compare code directly, even code that normally wouldn't be
classified as a stand-alone "nugget," like inline code inside large
projects, code that is a bit interspersed with other code, etc. In
this way, similarities within individual projects can be located and
factored out. This approach seems to focus less on explicitly sharing
_everything_ (and trying to make code work for _everybody_) and more
on getting your own project done, with the improvement of the
collective code base for everybody coming almost as a side-effect.

- Jake McArthur
Victor S. (Guest)
on 2006-04-29 02:24
(Received via mailing list)
[skip]

> > accordingly if they desire to do so.
I am afraid the problem here is: what I would do even if I find that
some of
my code repeats code in repository? I already *had written* the code,
so,
where is my benefits? To understand "Oops, I'm and idiot" ? I already
knew
:))

Much more useful repository must help me to find the code I only
*intend* to
write - and here code comparison isn't necessary, because I have nothing
to
compare yet.

What do you think about this?

> - Jake McArthur
>

Victor.
Jake McArthur (Guest)
on 2006-04-29 05:47
(Received via mailing list)
There are many benefits:

a) You help others to not repeat themselves (obvious).
b) You open parts of your code up so that others have reason to find
and fix your bugs.
c) It creates a much more useful repository of code than ordinarily
because this is code that people actually are using and maintaining,
not just things people figured might be useful later.

The point isn't to see that you repeated somebody when you shouldn't
have. The point is to see the repeat and save everybody the trouble
and factor it into one central location. It's not too difficult; like
you said, you already wrote the code. It's just a matter of branching
it off and making it shiny. Code that has been factored out would, of
course, be tagged and searchable so that people (or you) _can_ look
for the code they "intend" to write.

- Jake McArthur
Alex Y. (Guest)
on 2006-05-03 21:08
(Received via mailing list)
Jake McArthur wrote:
> There are many benefits:
>
> a) You help others to not repeat themselves (obvious).
> b) You open parts of your code up so that others have reason to find and
> fix your bugs.
> c) It creates a much more useful repository of code than ordinarily
> because this is code that people actually are using and maintaining, not
> just things people figured might be useful later.
>
d) You find not only the bit you've already written, but also the bit
that goes with it that you were about to write.
This topic is locked and can not be replied to.