More Summer of Code goodness (please forward)


This may be long so I’ll write first things first :

  • Can I apply ?
  • Can RubyCentral be my mentor ?
  • If so, should I rather take a real Ruby project instead of my own ?
  • Do my project sound good ?

I’ve posted that on the ruby-talk-google, but I duplicate here. Please
tell me it this is forbidden, but since I don’t see my message
appearing I wouldn’t want it to be lost again ^^;
OK, so I had free time on my hands and posting it again :wink:


I don’t know whether I’ll have enough time to do GSoC because I have
work to do at school until mid-july. But then it can still be possible
if school work does not take so much time. Last year was hardcore and
I am used to working 70+ hours a week. So if my school only takes me
30 to 40 hours per week in June-July I can still do a great project

Are GSoC students supposed to be working harder than basic guys in
companies (50h/week) or are they considered as students during summer
that will go to parties every night or even have weekends (25h/week) ?
If more than 40h a week are a minimum then I guess I’ll just have to
forget about it.

There is still the possibility that I ask my school to include GSoC in
my scholarship, giving me more time to work on it and replace a
project my friends are going to do with a grade about my GSoC, if my
GSoC project gives me skills in a same field as the project, but they
are a bit long to decide and applications are due soon.


My project is not a project for the Ruby community, but merely
using Ruby, which is my favorite language and is quite good for
manipulating text. Is that still eligible for RubyCentral to be my
mentor ?

I like the spirit and philosophy of Ruby and its community has always
looked great, so that would really motivates me if I could work with
Ruby-guys :slight_smile:


There are projects I’d be glad to work on improving for the community,
such as ZenSpider’s tools (RubyInline, Ruby2C) or some ambitious VM
projects (YARV, rubinius).

The question is : I think I’m quite good, but when I look at that it
seems to hard. But I like challenges and I’ve done complicated stuff
before, so are these the kind :

  • super hard, but when you really look at it you find that you can do it
  • super hard, only semi-gods can even understand it (which I’m not) !

Sorry for forgetting to be humble but I think I’ve done quite hard
stuff before. I have good understanding of some subjects : my school
made us recode many parts of the C standard lib with only a very small
set of available function (ex: recoding malloc with only brk/sbrk.
Other functions are restricted to assert, perror, exit, write and
getenv). I value this because when shit happens, I can understand what
I’ve done wrong :slight_smile:

I also had opportunities to discover many things : I’m no expert but I
tried fun stuff such as distributed programming, image processing,
functional languages (OCaML) or just stuff that are interesting and
challenging to do (ObjectiveC, tiny tiny bits of Lisp, …) which made
me curious and allow me to quickly match what I’m learning now to
parts I’ve already heard of.

Now that you semi-gods know me, is your project still too hard for me
or do you think I just have to read many doc then I can join in ?

That said, I don’t think the other ideas are bad :slight_smile: I just don’t know
them all, and I try to ask for the projects that motivates me the
most, that’s only natural.


OK, now my project idea : when given a report to write, I want to help
the teacher spot the cheaters (massive copy-paste from Wikipedia or
other docs).

This is really something that is resource consuming, so what I want to
do is not having to diff every file against every other file, so I’d
like to implement heuristics that lead to a “signature” of the
document that would be easy to compare with many other “signatures” so
that I can show the teacher parts of documents that are highly

I think the “extract-the-docs-signature” part can be slow and
complicated, but I’d really like the signatures comparison to be super
fast. I also prefer to let cheaters go unsuspected than to overwhelm
the teacher with many cheat warnings (or my tool would defeit its
purpose, which is easing the teachers’ life), but that can only be a
parameter in the heuristics.

It has some sub-parts around it such as asking the teacher the other
students’ documents, asking the keywords and getting a few first docs
from Google : cheaters are lazy :wink:
I also don’t want my program to be too “google-heavy”.

What I am thinking as a first heuristic would be taking the words’
size. If in two documents there is the same sequence of 20 words with
the exact same length, this really seems suspicious. Of course, since
I have said that I want to reduce greatly the number of suspicious
parts, I can spend time on ‘critical’ parts and make some other
algorithm run on them, so I can see if that signatures resemblance
was only chance (ie. thinking this mail has been copied-pasted from

In my school there is such a tool for comparing students’ source code
(we are not allowed to look at it of course, and maybe that’s just
bluff ^^). That’s easy to do with code since there is a strict
grammar, preprocessor tools and so on : a basic attempt of concealing
cheat such as changing the variables’ names does not work. Of course
two sourcecodes with the same AST would be VERY suspicious.

I’m willing to try this approach during my GSoC if this is necessary,
but I know natural language processing is hard (impossible ?) even for
researchers that are far more intelligent than I am :wink: But hey, maybe
I’ll even be able to catch people that are merely paraphrasing
Wikipedia !

As you see my thinking is not complete and I have many points to
study. If some people are ever interested in that, even not for GSoC,
please feel free to contact me ! I may not have lots of free time but
hey, let’s try !

= THANKS ! =

Thanks to anybody having read until here :slight_smile:
Now you understand why I did not want to write it down again, but
don’t worry I made some copies elsewhere :wink:

I’m looking forward to your answers and I’m beginning to enjoy
Ruby-talk, but alas that’s quite time-consuming and my school is
forcing me to do an awful J2EE project due very soon, I miss Rails so
much :’(

Thanks again everyone !


Sylvain Abélard wrote:

companies (50h/week) or are they considered as students during summer
that will go to parties every night or even have weekends (25h/week) ?
If more than 40h a week are a minimum then I guess I’ll just have to
forget about it.

My 2 cents here would say that I doubt if anyone cares if you work 7
hours a week or 70 hours a week. What impresses me (and I’ll wager
Google and others) is the results. If you can do a great project then we
would all benefit and would appreciate your efforts. If the project
sucks and fails miserably then it doesn’t matter how much work you’ve
done - nobody benefits, no fame, and no Google job offer for you. There
is always the personal benefit of what you learned but the GSoC projects
I read about in Dr. Dobbs highlighted the contributions to the
community. I don’t remember any mention about the hours clocked.