Cucover: coverage-aware 'lazy' cucumber runs

mwalsh · April 10, 2009, 3:15am

Hi Folks,

This is an idea I’ve been playing with for some time, and I think I’ve
finally it into a shape where it might be useful to other people.

Cucover is a thin wrapper for Cucumber which makes it lazy.

What does it mean for Cucumber to be lazy? It will only run a feature
if it needs to.

How does it decide whether it needs to run a feature? Every time you
run a feature using Cucover, it watches the code in your application
that is executed, and remembers. The next time you run Cucover, it
skips a feature if the source files (or the feature itself) have not
been changed since it was last run.

If this sounds useful to you, please take a look, try it out, and let
me know what you think.

cheers,

Matt W.
http://blog.mattwynne.net

mwalsh · April 10, 2009, 7:37pm

On 10 Apr 2009, at 02:12, Matt W. wrote:

If this sounds useful to you, please take a look, try it out, and
let me know what you think.

Hi Matt

Cucover sounds very interesting! However I fell at the first hurdle:

“Anything that runs out of process will not be covered, and therefore
cannot trigger a re-run, so if you use Cucumber to drive Selenium, for
example, you’re out of luck.”

As I use Cucumber with Celerity in a separate JRuby process, I can’t
use Cucover. Is it feasible to make it work cross-process?

Ashley

–
http://www.patchspace.co.uk/
http://www.linkedin.com/in/ashleymoran

http://twitter.com/ashleymoran

mwalsh · April 10, 2009, 8:42pm

On 10 Apr 2009, at 18:34, Ashley M. wrote:

“Anything that runs out of process will not be covered, and
therefore cannot trigger a re-run, so if you use Cucumber to drive
Selenium, for example, you’re out of luck.”

As I use Cucumber with Celerity in a separate JRuby process, I can’t
use Cucover. Is it feasible to make it work cross-process?

It’s feasible I think, and something I’d definitely like to add for my
own purposes eventually. I think RCov works with JRuby too, though
I’ve not tried it myself.

To be honest though, the next feature in my queue is probably to make
more granular re-runs that work per scenario rather than per feature,
as it currently does.

The code is really pretty simple so if you want to pull it down and
take a look maybe we can have a chat directly about it would work. I
think the problem would be around how the external process gets
started (and it’s coverage observed) but your situation should be much
easier than a selenium setup where the process could be on a remote box.

Just to be clear, are you calling Ruby to call Cucumber to call JRuby
to call Celerity, as you seem to be suggesting?

Matt W.

http://blog.mattwynne.net

mwalsh · April 11, 2009, 12:42am

On 10 Apr 2009, at 19:39, Matt W. wrote:

It’s feasible I think, and something I’d definitely like to add for
my own purposes eventually. I think RCov works with JRuby too,
though I’ve not tried it myself.

Hmmm - the JRuby process is just running Cucumber, where my Merb code
is all in MRI. It’s the MRI process that needs to run RCov, right?
(I’ve never used it beyond inspecting its coverage reports.)

To be honest though, the next feature in my queue is probably to
make more granular re-runs that work per scenario rather than per
feature, as it currently does.

Sounds like a good idea!

The code is really pretty simple so if you want to pull it down and
take a look maybe we can have a chat directly about it would work. I
think the problem would be around how the external process gets
started (and it’s coverage observed) but your situation should be
much easier than a selenium setup where the process could be on a
remote box.

Just to be clear, are you calling Ruby to call Cucumber to call
JRuby to call Celerity, as you seem to be suggesting?

Currently what I’ve got is a rake file running in MRI, that calls
jrake, that runs the Cucumber task in the JRuby process. The
indirection there just so I can type rake features, as Merb won’t
(currently) start up in JRuby, which prevents me from typing jrake features.

Part of my env.rb involves some code I wrote that wraps
daemon_controller to start my webapps. I have a separate “features”
environment that is started on demand, so it’s only available when
Cucumber is running.

Cheers
Ashley

–
http://www.patchspace.co.uk/
http://www.linkedin.com/in/ashleymoran

http://twitter.com/ashleymoran

mwalsh · April 11, 2009, 8:04pm

On 10 Apr 2009, at 02:12, Matt W. wrote:

What does it mean for Cucumber to be lazy? It will only run a
feature if it needs to.

While I have yet to do more than skim the full articles, I wondered if
you’d seen “Integration Tests Are A Scam” on InfoQ[1]? It was the
following that caught my attention:

the hypothetical programmers with the integration-based test suite
choose to worry only about “the most important [x]% of the tests”

Towards the end of the second article he seems to say it’s more
excessive integration testing that is the problem. (Even if he does
end with “Stop writing them.”) Strikes as me as rather along the
lines of, albeit in the opposite direction to, “we don’t mock because
it makes the tests fragile”.

I was just idly thinking, could a code-coverage based system could be
combined with some sort of failure (fragility) history to balance the
time cost of heavy feature runs with the benefits of having something
run end-to-end? We’ve had reverse-modification-time spec ordering for
ages which is a useful start.

On a more ranty note - I have very little time for these “XXX BDD/
development technique is always bad, don’t do it” articles. (But hey,
maybe I was guilty of this myself and have forgotten since…) Pretty
much every technique I’ve seen has some benefit - if you use it
selectively. I wish people would stop writing these inflammatory
articles, and (a) figure out how to use these techniques like razors
not shotguns and (b) go and improve the tools that apply them.
Otherwise they’re just making everyone’s life harder. Gah!!!

Ashley

[1] J.B. Rainsberger: "Integration Tests Are A Scam"

–
http://www.patchspace.co.uk/
http://www.linkedin.com/in/ashleymoran

http://twitter.com/ashleymoran

mwalsh · April 13, 2009, 12:54am

Stephen E. wrote:

I’ve had it in my head for a while now that someday (yes, that
A proper design would let you plug in your own file-change discovery
strategy, plug in multiple runners (RSpec, Cucumber, yadda yadda) with
true modularity, specify lists of observers on directories or files,
and allow different output views. An ideal design would also let
you set priority rules like you’re describing here, so you get instant
feedback only on the stuff you’re working with, and do end-to-end runs
in the background.

That would be very cool, you have lots of good ideas there. Being able
to plug in your own file-change strategy would be killer. Another cool
idea I ran across the other idea is being able to specify in your
examples that which ones are “focussed”. Meaning, autotest will only
run the focussed ones and not bother running the entire suite. Once you
have solved the problem at hand you remove the focussed tag and the
whole suite is then ran. This idea, which is already implemented,
comes from Micronaut[1]. The idea is very similar to Cucumber’s and
RSpec’s[2] tagging feature (yet to come for rspec). The cool thing
about micronaunt is that they have tied it into autotest. Ideally, we
could be able to tell autotest, or whatever program, to only run tests
that are tagged a certain way-- and then you could override that with
the “focused” tag. So, we can add that to our list of cool things to
have.

-Ben

2.
https://rspec.lighthouseapp.com/projects/5645/tickets/682-conditional-exclusion-of-example-groups

mwalsh · April 12, 2009, 6:49am

On Sat, Apr 11, 2009 at 2:02 PM, Ashley M.
[email protected] wrote:

I was just idly thinking, could a code-coverage based system could be
combined with some sort of failure (fragility) history to balance the time
cost of heavy feature runs with the benefits of having something run
end-to-end? We’ve had reverse-modification-time spec ordering for ages
which is a useful start.

I’ve had it in my head for a while now that someday (yes, that
mythical ‘someday’) I want to write a better autotest. Maybe this is
heresy, but I am a huge fan of the idea behind autotest and totally
annoyed by its implementation. It’s extensible only in strange ways
(hence wrappers like autospec), and its fundamental strategy is too
static. I once lost most of a day trying to fix merb_cucumber so the
features would run when they should, and was ready to hurl cats when I
realized autotest’s idea of context chaining was to make you list them
all in the classname in alphabetical order. Look at the files in the
Cucumber gem’s ‘lib/autotest’ directory and you’ll see what I mean.

A proper design would let you plug in your own file-change discovery
strategy, plug in multiple runners (RSpec, Cucumber, yadda yadda) with
true modularity, specify lists of observers on directories or files,
and allow different output views. An ideal design would also let
you set priority rules like you’re describing here, so you get instant
feedback only on the stuff you’re working with, and do end-to-end runs
in the background.

Right now this is just a pipe dream, but I don’t think it would be
hard. It’s just finding the time to do it vs. actual public-facing
applications that’s the challenge. If anybody wants to have a
conversation about this, maybe get some collaboration going, feel free
to drop me a line.

On a more ranty note - I have very little time for these “XXX
BDD/development technique is always bad, don’t do it” articles. (But hey,
maybe I was guilty of this myself and have forgotten since…)

“Declaring absolutes is always bad, don’t do it?” >8->

Oh – one other thought I had from reflecting upon your e-mail. This
is totally unrelated to the above, but since we’re being Big Thinkers
I might as well write it down before I forget. You mentioned
fragility/failure history, in relation to coverage, and I started
thinking… I wonder if everyone’s going about test coverage from the
wrong direction, simply trying to anticipate failure? What if we
extended something like exception_notifier or Hoptoad as well, and
brought real exceptions from the application’s staging and production
environments into our test tools? We know from the stack traces where
failures occur, so it’d be pretty straightforward to write an
RCov-like utility that nagged you: “You dingbat, your specs totally
fail to cover line 119 of hamburgers.rb. It threw a MeatNotFound
exception last Tuesday. Gonna test for that? Ever?”

What do you think? Decent idea? Or does something like this already
exist and I don’t know about it?

–
Have Fun,
Steve E. ([email protected])
ESCAPE POD - The Science Fiction Podcast Magazine
http://www.escapepod.org

mwalsh · April 12, 2009, 6:30pm

On 10 Apr 2009, at 23:40, Ashley M. wrote:

On 10 Apr 2009, at 19:39, Matt W. wrote:

It’s feasible I think, and something I’d definitely like to add for
my own purposes eventually. I think RCov works with JRuby too,
though I’ve not tried it myself.

Hmmm - the JRuby process is just running Cucumber, where my Merb
code is all in MRI. It’s the MRI process that needs to run RCov,
right? (I’ve never used it beyond inspecting its coverage reports.)

Ah right, OK, that might make things a little simpler then

After a few iterations, I settled on Cucover using Rcov as a
library[1], slipping in a call to Rcov::CallSiteAnalyzer#run_hooked
into the Cucumber AST. This makes it easy for me to pull out just the
data I need from the Rcov result objects, rather than trying to parse
the Rcov binary’s output.

Currently what I’ve got is a rake file running in MRI, that calls
jrake, that runs the Cucumber task in the JRuby process. The
indirection there just so I can type rake features, as Merb won’t
(currently) start up in JRuby, which prevents me from typing jrake features.

Part of my env.rb involves some code I wrote that wraps
daemon_controller to start my webapps. I have a separate “features”
environment that is started on demand, so it’s only available when
Cucumber is running.

So it sounds like what would need to happen is for those
daemon_controller spawned webapps to be run with coverage, and that
coverage passed back to Cucover, right? This sounds like an
interesting challenge

[1]manalang.com - This website is for sale! - manalang Resources and Information.

Matt W.

http://blog.mattwynne.net

mwalsh · April 13, 2009, 12:55am

Stephen,
Regarding the exception nagger, would a simple script that grepped the
log
file for exceptions and produced a list of failing lines in your code be
a
start?

Steve

mwalsh · April 13, 2009, 5:27am

On Sun, Apr 12, 2009 at 6:47 PM, Steve M. [email protected]
wrote:

Regarding the exception nagger, would a simple script that grepped the log
file for exceptions and produced a list of failing lines in your code be a
start?

Hi Steve,

I think so. If it said which class-or-module and method they were
defined in (almost always determinable by backwards regexing) even
better. Then a dimwit like me could just glance at it and say “Oh,
duh! I forgot to spec Planet.destroy!”

Also I must confess: after I sent that last e-mail, it occurred to me
why a past-exception-based coverage tool wouldn’t work very well in
the long term. It’d be fine for immediate use, but if you didn’t do
the specs right away, the code would evolve and the line numbers of
those old exceptions would slowly go out of sync with current reality.
Since the only reasonable answer to that is “When something breaks,
write a test for it immediately to catch it next time” I’d say the
simpler script you’re talking about is probably close to ideal.

–
Have Fun,
Steve E. ([email protected])
ESCAPE POD - The Science Fiction Podcast Magazine
http://www.escapepod.org

mwalsh · April 13, 2009, 11:41am

could be able to tell autotest, or whatever program, to only run tests
that are tagged a certain way-- and then you could override that with
the “focused” tag. So, we can add that to our list of cool things to
have.

Are you saying you want multiple tags and let autotest do logic on them?

run everything “focus and current_feature”
run everything “current_feature or related_to_it”

Or are you saying you should explicitely specified which test went wrong
(would it be nice if autotest just ran all your recently failed features
first and the rest later; compared to the last failure only, as it is
supposed to do now, afaik; perhaps after running previously failed
features, it can run newest features first, finally followed by the
other
features).

Or are you saying it’s going to be separate from tags completely?

I’m feeling my first idea will run out of hand (adding complexity,
while not solving a specific need), the third sounds bad from a
technical Point of view. The second one comes from the observation
that “focussing” is a process, not a property. You state
“you remove the focussed tag”
yourself. “important” could be a property, if you don’t remove it.

Bye,
Kero.

How can I change the world if I can’t even change myself?
– Faithless, Salva Mea

mwalsh · April 13, 2009, 1:55pm

On Sun, Apr 12, 2009 at 6:47 AM, Stephen E. [email protected] wrote:

I’ve had it in my head for a while now that someday (yes, that
A proper design would let you plug in your own file-change discovery
strategy, plug in multiple runners (RSpec, Cucumber, yadda yadda) with
true modularity, specify lists of observers on directories or files,
and allow different output views. An ideal design would also let
you set priority rules like you’re describing here, so you get instant
feedback only on the stuff you’re working with, and do end-to-end runs
in the background.

A couple of years ago I was on a project that had fallen into the trap
of
too many integration tests (exactly the horror scenario that J.B.
Rainsberger describes: Beware the Integrated Tests Scam (was Integrated Tests Are a Scam) - The Code Whisperer). The whole
suite
had hundreds of slow watir tests and took several hours to run. If there
was
a failure, it was usually just in a coupel of them.

We ended up improving this a lot with a homegrown distributed RSpec
runner
(based on Drb) and employing techniques from pairwise testing (
http://www.pairwise.org/).

At the time I also explored a third technique that we never put in use:
Heuristics.

If we could establish relationships between arbitrary files and failing
tests over time, then we would be able to calulate the probablilty that
a
certain commit would break certain tests. We could then choose to run
the
tests that had a high probablilty of breakage, and exclude the others. A
neural network could potentially be used to implement this. The input
neurons would be files in the codebase (on if it’s changed, off if not)
and
the output neurons would be the tests to run.

So if someone develops a better AutoTest with a plugin architecture, and
that doesn’t have to run as a long/lived process, then I’d be very
interested in writing the neural network part - possibly backed by FANN
(
FANN - FANN)

It’s so crazy it has to be tried!

Aslak

mwalsh · April 13, 2009, 9:57pm

On 11 Apr 2009, at 19:02, Ashley M. wrote:

I was just idly thinking, could a code-coverage based system could
be combined with some sort of failure (fragility) history to balance
the time cost of heavy feature runs with the benefits of having
something run end-to-end? We’ve had reverse-modification-time spec
ordering for ages which is a useful start.

I believe this is roughly what Kent Beck’s new venture, JUnit Max,
does. I think it’s pretty much essential to start thinking about doing
this - dumbly running all the tests just doesn’t make sense and won’t
scale on a bigger project. Cucover is my first attempt to dip my toe
into this water.

I blogged about this the other day:
http://blog.mattwynne.net/2009/04/06/the-future-of-automated-acceptance-testing/

Matt W.

http://blog.mattwynne.net

mwalsh · April 13, 2009, 10:08pm

On 12 Apr 2009, at 23:51, Ben M. wrote:

ages
when I
you set priority rules like you’re describing here, so you get
instant
feedback only on the stuff you’re working with, and do end-to-end
runs
in the background.

+1 to this Stephen, I am with you 100%.

A direct email about the pipe-dream is on it’s way.

similar to Cucumber’s and RSpec’s[2] tagging feature (yet to come
for rspec). The cool thing about micronaunt is that they have tied
it into autotest. Ideally, we could be able to tell autotest, or
whatever program, to only run tests that are tagged a certain way–
and then you could override that with the “focused” tag. So, we can
add that to our list of cool things to have.

I actually don’t think it should be necessary to tell the tool where
to focus. As long as it understands the relationship between your
tests and your source code, and you’re making changes test-first, the
tool should be able to know which parts of your code are unstable and
likely to need re-testing after a change.

Matt W.

http://blog.mattwynne.net

mwalsh · April 13, 2009, 5:41pm

Kero van Gelder wrote:

could be able to tell autotest, or whatever program, to only run tests
(would it be nice if autotest just ran all your recently failed features
first and the rest later; compared to the last failure only, as it is
supposed to do now, afaik; perhaps after running previously failed
features, it can run newest features first, finally followed by the other
features).

The way micronaut currently works is that when you tag an example or
example group as “focussed” autotest will only run those. So you are
overriding it’s default behaviour and not running the entire suite.

Or are you saying it’s going to be separate from tags completely?

I’m feeling my first idea will run out of hand (adding complexity,
while not solving a specific need), the third sounds bad from a
technical Point of view. The second one comes from the observation
that “focussing” is a process, not a property. You state
“you remove the focussed tag”
yourself. “important” could be a property, if you don’t remove it.

Yeah, it is a process and that is what I like about it. When working on
a large suite it is impractical to run the entire suite in conjunction
with autotest. So, you end up running them by hand. I really like
using autotest though and so by providing this feature one can still use
autotest in there workflow/process when working on new functionality in
a large project.

I haven’t used it too much, but it seems like a really useful thing to
have. Having autotest act on multiple tags may get too complicated, but
I think the “focussed” tag is pretty straightforward.

-Ben

mwalsh · April 13, 2009, 10:11pm

On 13 Apr 2009, at 12:46, aslak hellesoy wrote:

would be the tests to run.

So if someone develops a better AutoTest with a plugin architecture,
and that doesn’t have to run as a long/lived process, then I’d be
very interested in writing the neural network part - possibly backed
by FANN (http://leenissen.dk/fann/)

It’s so crazy it has to be tried!

Aslak

Aslak as usual you have stepped it up another seventy levels!

I’m really keen to write a generic test runner that works on both
RSpec and Cucumber examples alike (and could extend for other testing
libraries). Allowing for pluggable strategies for test selection as
Stephen has described is a terrific idea.

Aslak do you have any ideas in the pipeline for building a fancier
test runner for Cucumber? I know there was some brain-storming on the
list a while back about a ‘thick client’.

Matt W.

http://blog.mattwynne.net

mwalsh · April 13, 2009, 11:57pm

On Mon, Apr 13, 2009 at 10:09 PM, Matt W. [email protected] wrote:

(based on Drb) and employing techniques from pairwise testing (
neurons would be files in the codebase (on if it’s changed, off if not) and

while back about a ‘thick client’.

Maybe

Aslak

mwalsh · April 14, 2009, 10:15pm

One simple thing I asked about the other day was running multiple
instances
of autotest to do different things. Currently I’d like to run one for my
specs and one for my features, but you could easily extend this idea.
Creating several profiles that run at the same time, with the long
running
ones having a low priority would give a range of feedback that
eventually
would be fairly complete (on a big project it might fully catch up
overnight, or at the weekend) whilst providing sufficient feedback to be
able to iterate quickly with reasonable confidence

2009/4/14 Stephen E. [email protected]

mwalsh · April 14, 2009, 5:58am

On Mon, Apr 13, 2009 at 7:46 AM, aslak hellesoy
[email protected] wrote:

So if someone develops a better AutoTest with a plugin architecture, and
that doesn’t have to run as a long/lived process, then I’d be very
interested in writing the neural network part - possibly backed by FANN
(FANN - FANN)

In the immortal words of Socrates, “That rocks.”

The nice thing about separating concerns like this – the reason why
design patterns appeal so much to me – is that pieces can rather
easily be built up incrementally. As Matt said, thinking ‘neural
networks’ with this is way beyond the level of anything I’d had in my
head. But it’s a damn cool idea.

I can think of a few ways to handle this particular chunk, with levels
of complexity that would scale up:

1.) DUMB REACTION: Just re-run the tests that failed in this run
cycle. Then, periodically, run tests that succeeded in the background
as a regression check. (This isn’t much beyond what Autotest does
now.)

2.) OBSERVERS: Allow handlers to register themselves against certain
files, so that when a file is changed, that handlers gets run.
Multiple handlers can observe any given file, and a handler can
declare multiple rules, including directories or pattern matches.
(Again, Autotest has something sort of like this, but not nearly as
flexible.)

3.) PERSISTENCE: Track the history of tests and the times they were
created, edited, last run, and last failed. Also track file
modification times. When a file changes, run the tests first that are
either new or have failed since the last time the file was changed.
Then run the tests that went from failing to passing in that time.
(This could certainly be improved – I haven’t sat down to figure out
the actual rule set – but you get the gist. Know when things changed
and set priorities accordingly.)

4.) INTELLIGENCE: Aslak’s neural network. Let the system figure out
which tests matter to which files, and run what it thinks it ought to
run. Maybe use code coverage analysis. It can ‘learn’ and improve
when the full suite is run and it discovers new failures.

In all four of these cases, I still think it’s imperative to run the
full suite. None of these methods are foolproof, and code is tricky
and makes weird things happen in weird crevices. That’s why testing
must be done. But running the suite doesn’t have to be a ‘blocking’
activity, like it is with Autotest now. It can happen in bits and
pieces, when nothing else is going on, and it can be configured to
only grab your attention when something failed unexpectedly.

(That’s one of the prime reasons for the ‘multiple output views’
feature, by the way. When I said output views I wasn’t just thinking
of the console or a window. I’m also thinking Dashboard widgets, or
gauges in the toolbar, or RSS feeds, or dynamic wallpaper, or whatever
else anyone can think of. Stuff that stays out of your way until it
either needs you or you choose to look at it.)

Still making sense? This is starting to sound pretty big and pretty
complex – but I don’t think it strictly needs to be. #1 and #2 above
are pretty easy. The others don’t have to be built before releasing
something. And, of course, you wouldn’t have to pick just one
selection module. You could run or disable all of these depending on
your project’s needs, or come up with your own system involving Tarot
cards and Linux running on a dead badger.(*)

I just want to build a core that detects changes in stuff, tells other
stuff about it, and passes on what that stuff says about it to a third
set of stuff. The rest is implementation-specific details. >8->

(* http://www.strangehorizons.com/2004/20040405/badger.shtml )

–
Have Fun,
Steve E. ([email protected])
ESCAPE POD - The Science Fiction Podcast Magazine
http://www.escapepod.org

mwalsh · April 15, 2009, 10:36am

Yeah this is all good stuff.

We’re basically talking about ‘personal CI’ here I think[1]. Nothing
wrong with that, but we must remember some of this can be deferred
until check-in.

[1]http://silkandspinach.net/2009/01/18/a-different-use-for-cruisecontrol/

On 14 Apr 2009, at 21:13, Andrew P. wrote:

2009/4/14 Stephen E. [email protected]
In the immortal words of Socrates, “That rocks.”
1.) DUMB REACTION: Just re-run the tests that failed in this run

which tests matter to which files, and run what it thinks it ought to

something. And, of course, you wouldn’t have to pick just one

rspec-users mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/rspec-users

Matt W.

http://blog.mattwynne.net