I’ve been looking at software metrics recently as a way to track how we’re doing with our code quality. Of course, not all metrics are equal, and they are often about as helpful and accurate as a 7-day weather report… Neal Ford from Thoughtworks recently gave a presentation on his take of software metrics, which he said was something like “reading tea leaves”. A single metric on its own, out of context, may not tell you anything at all (what does a cyclomatic complexity rating of 25 mean, anyways? Is that good or bad? Hell, this rating doesn’t even have UNITS!). It’s only when grouping them together with other metrics and comparing them relatively against other pieces of code that you start to get some sort of idea.
I consider the CRAP4J crappiness rating to be something of an exception. The people who came up with it had a great idea: what if we take two metrics that offset one another, compare them and boil it down into a new value? In other words, what if we do some of the tea reading for you? The two metrics they are using are McCabe’s cyclomatic complexity rating, which gives you a measure more or less of how many different paths there are through a given method (the more paths, the more complex), and a code coverage measure of their own something like Cobertura which can measure the number of paths that are covered by unit tests. This makes perfect sense: if you have a lot of paths, that’s bad, but if you’re testing all the paths, that should be perfectly OK. If you were trying to read these leaves on your own, you would have to paw through all the parts of code that have the highest complexity rating, then try to look up a corresponding report on test coverage…
So my hat’s off to the guys that did CRAP4J.
I’ve also been considering coming up with a metric of my own that I call “Legaciness”. I get my inspiration from Michael Feather’s fantastic book “Working Effectively with Legacy Code”. In his book, he defines legacy code as code which isn’t covered by unit tests. That in itself is a great definition. But the focus of my metric wouldn’t be on the test coverage itself (which we can already measure). The majority of the book focuses on all the tricky issues that can make “legacy” code hard to test in the first place. He lists all sorts of bad practices, like static method calls, doing Bad Things in the constructor, and so on. Then the second half of the book enumerates a series of patterns that can be used in Java, C++ and other languages to safely and slowly refactor your code JUST ENOUGH to be able to slip them into some unit tests, after which you should be able to redesign your code as you wish.
The key here that I’d like to measure is exactly HOW HARD would it be to get some unit tests around your code. In other words, what would be the amount of effort necessary to refactor a class for unit testing; how many time would the Legacy Patterns have to be invoked in order to get 100% test coverage for that class? This measure is what I would call “Legaciness”.
I’ve gone through a lot of the thinking process as to how this sort of thing would be measured, and a number complications have crept up:
- Not all static calls are an obstacle to unit testing. For example, a utility method that converts a String to all caps wouldn’t get in the way. So, the metric ideally should take that sort of thing into account.
- Explicit calls to “new” are generally a bad thing, since there’s no way to substitute the object being instantiated with a mock object. But, again, if the class being instantiated is a simple Data Transfer Object (DTO) with a bunch of getters and setters, what’s the problem?
- Even if a call to “new” is a problem, one of the patterns to remove the obstacle is to simply extract the line of code into a protected method which can then be overridden in a test class. So the metric would have to somehow take into account when one of the legacy code patterns has been implemented to surmount these obstacles.
- The Legaciness score would have to be crafty enough to take into account certain levels of indirection. For example, it might be slightly bad that a method instantiates a class via “new”. But it should be considered MUCH WORSE if the class it instantiates also has a really bad Legaciness rating via, say, direct calls to I/O methods and threading and so on and so forth. So, there seems to be some need for an “inheritance” of Legaciness through these dependencies.
- When you get down to it, a lot of these bad dependencies are due to I/O calls, threading, container dependencies and so on. These dependencies are very likely outside your own code base (for example, in the base Java APIs). The only realistic measure I could think of is to come up with a base Legaciness score for all known API methods. This seems like more work than it’s worth (actually, Neal Ford mentioned to me that someone did this very thing with Ruby: assigned a score to each base API method for some nefarious metric of their own). Also, I don’t know if the same would be needed for other 3rd-party libraries, or if it would be enough to cover just the java.* and javax.* libraries, and then analyze their Legaciness based on their use of the Java APIs.
It seems to me that Legaciness would be another very useful metric to have, much like CRAP4J. If the latter tells you where your code is complex and untested, the former could tell you just how hard it would be to to give it the test coverage it needs. I think this metric would also be useful to help educate developers as to some of the same best practices that TDD is supposed to encourage.
Unfortunately, I don’t have the time to work out the details of such a metric. If anyone out there’s interested, or has any ideas, let me know!