How will it find similar code? One simple issue is that people will
name their variables and methods differently, so you’ll want to
somehow see the structure of a section of code and ignore a lot of
details. But you can’t ignore the details too much. Maybe (trivial
example) someone wrote a “max” function and someone else in-lined
it, and otherwise their code blocks are the same.
I’ve already been working on this. Right now, I’m making a simple
algorithm that works on arbitrary text and returns a number
reflecting how similar two strings are. Even this alone has been
giving fairly good results on code, even code that was written rather
differently, but my plan is to use this algorithm to compare symbols
and literals. A similar algorithm, working on a slightly larger
scale, would compare entire lines of code for similar syntax,
augmented by data from the first algorithm.
I’m still thinking about this. Suggestions, anybody?
I don’t think all code is simple to refactor like that. But maybe
enough is for this to be useful. Maybe most is? I don’t know.
By far it is not, but all I meant is that there is no need to mess
with the system to do it.
I don’t have much experience with unit tests. How well can they
usually withstand arbitrary changes to code with subtle bugs?
Well-tested code will not break unless a test was missed, and if a
bug is found, writing a test to cover it will practically squish that
particular bug permanently.
It’s a bit off-topic, but I’m not sure how good an idea wikis are.
Wikipedia gets a lot of vandalism. But worse: what happens when
people have a legitimate disagreement about how some code should be
written? “anyone can post anything” doesn’t provide a way to
resolve disagreement.
There could also be a risk of a malicious code that people auto-
update.
Disagreements could be resolved by simply forking off another
project. Everybody is happy. And anyway, if everybody agrees on
tests, and those tests pass, everybody should be happy anyway.
Well-tested projects will not be affected by malicious code because
the system would see that tests fail and revert back to the last
working version.
I wonder how well the code-similarity algorithm would work for non-
Ruby code. Just curious how Ruby-specific the tests would be vs how
general.
The algorithm I’m currently on is language agnostic, but it doesn’t
benefit from syntax parsing and such like plans reflect.