Manage Duplicated Code with Sonar

If you use Sonar already, I am sure that you know already the worse of all 7 developer's deadly sins:

And if you don't, I would assume you know about duplicated / cloned / similar code when you talk about quality of code and that you have heard of tools such PMD CPD or Simian.

But why does copy paste matters from a code quality point of view? How can you benefit from Sonar to improve this? Let’s try to figure this out.

What is duplicated code?

Let's try to answer what sounds like a pretty simple question: what does "duplicated code" mean? Let's consider following four code fragments. They all will print 34, but are they duplicated?

int[] a = new int[10];
a[9] = 0;
a[8] = 1;
for (int i = 7; i >= 0; i--) {
a[i] = a[i+2] + a[i+1];
}
System.out.println(a[0]);

int f(int i) {
if (i == 0 || i == 1) return i;
return f(i - 2) + f(i - 1);
}
System.out.println(f(9));

int[] a = {34, 21, 13, 8, 5, 3, 2, 1, 1, 0};
System.out.println(a[0]);

int[] b = {0, 1, 1, 2, 3, 5, 8, 13, 21, 34};
System.out.println(b[9]);

Well, this is not entirely a fair question, because there is no agreement in the research community on the exact notion of duplication. Ira Baxter’s definition (2002) of clones expresses this vagueness:

Clones are segments of code that are similar according to some definition of similarity.

We can rephrase this more formally:

Code fragment (code region / portion / segment) is any sequence of code. It can be of any granularity, e.g., function definition, begin-end block, sequence of statements or sequence of tokens.
A code fragment CF2 is a duplication of another code fragment CF1 if they are similar by some given definition of similarity, that is, f(CF1) = f(CF2) where f is the similarity function.
Two fragments that are similar to each other form a duplication pair (CF1, CF2) and when many fragments are similar, they form a duplication group.
We prefer term "duplicate" ("duplication") over "clone", because at least it doesn't clash with a name of clone method in Java.

According to this definition, we still can have different notions of duplications depending on definition of similarity function. Moreover - human judgment of duplications is an issue and varies among experts. In one of experiments, for more than 60% of automatically detected duplications, three experts disagreed whether the fragments are really duplication or not. Even in SonarSource from time to time we have disputes about some code fragments. Let's make our life a bit simpler by using the following less formal definitions of a similarity function (in literature those definitions typically called types of duplications):

Identical code fragments except for variations in whitespace (may be also variations in layout) and comments.
Structurally / syntactically identical fragments except for variations in identifiers, literals, types, layout and comments. The reserved words and the sentence structures are essentially the same.
As previous, but with further modifications - statements can be changed, added and / or deleted in addition to variations in identifiers, literals, types, layout and comments.
Code fragments that perform the same computation but implemented through different syntactic variants.

You can try to use those definitions for given fragments by yourself to see difference ;)

How is such code typically created?

Here are some examples of why duplication occurs:

Reusing existing code by copying and pasting (with or without minor modifications) is the simplest form of reuse mechanism in the development process, which results in duplicated code.
Code may be borrowed from another system, which may not be modified. In such situations, the only way of reusing the existing code is to copy and paste with required changes.
Generating code with a tool using generative programming may produce huge duplications because these tools often use the same template to generate the same or similar logic.
Sometimes programming languages do not have sufficient abstraction mechanisms, e.g., inheritance, generic types or parameter passing, thus developers forced to repeatedly implement these as idioms, which leads to small and frequent duplications.
Duplications may be introduced by accidents: side effect of developers memories; coincidentally implementing the same logic by different developers.
I really want to believe that this is not about readers of this blog, however sometimes productivity of a developer is measured by the number of lines produced per day. In such circumstances, developers focused on reuse of the same code again and again by copying and pasting, instead of following a proper development strategy.

Why you should pay attention on such code fragments?

Propagation of bugs: if a code fragment contains a bug and this fragment is copied, then the bug will exist in all pasted fragments. More generally, duplicating code will also duplicate the associated technical debt.
Increased maintenance cost: any maintenance required on a copied code fragments will certainly need to be applied on the pasted ones, i.e. duplication multiplies the work to be done.
Increased time to understand and thus to improve/modify existing system if it contains a lot of duplications, because differences must be studied by developers before modifications.
As an indicator of a bad design, lack of good inheritance structure or abstraction.
As an indicator about copyright infringement.
Last, but not least - you can't manage what you don't measure.

In summary, this practice is evil !

What can Sonar offer to you?

Prior to version 2.11, Sonar was relying on PMD-CPD to detect duplicated code. PMD-CPD is a good tool with a great history which uses Karp-Rabin algorithm for list of tokens and is able to detect duplications of type 2 and partially type 3. But it also has some drawbacks:

It requires a lot of resources, especially on the memory side and thus is not hardly scalable on a large code base with millions of lines of code.
As a consequence of the previous point, the copy-paste detection is limited to boundaries of a single module / project.
Impossible to tune underlying algorithm to prevent false-positives and to increase precision.
No easy way to cover new languages without having a full lexer.
We observed that results may slightly vary depending on operating system where analysis was done.

Because of those drawbacks, we decided to implement our own library sonar-cpd to detect duplicated code. The first brick was created during Google Summer of Code 2011. And we should say a big thank you to the participants for their ideas, suggestions and efforts to help us. This first baby step already gave us a good feedback with Sonar 2.11:

We noted comparable performances:

We noted also lower memory peak :

With sonar-cpd, results are more accurate, controlled and predictable. The detection is based on "statements" and therefore we are able to detect duplications of type 2 and partially of type 3 (maybe one day we will go further) and we can reduce amount of false-positives (like repeated blocks of import statements for example).

And last but not least : because of the significant improvement in terms of performances, we are now providing the ability to detect cross-project duplications. This feature is for now only available for Java but this limitation will disappear in Sonar 2.14.

In summary many more opportunities for abstracting and mutualizing code, how cool is this!