Source code clone

from Wikipedia, the free encyclopedia

In the source code cloning (also code duplicates , software cloning or simply cloning ) is similar sections in the source code of a computer program . Such clones are a form of redundant code and are mainly created by copying and pasting . Because they have a negative impact on maintenance , clones are a strong indicator of poor quality .

similarity

Clones are generally similar sections of code. The definition of similarity is thus a central point that is still being discussed intensively in research. However, there is agreement that there is no universally applicable definition, but that it is always based on the specific application. When defining similarity, the following aspects, among others, must be taken into account:

Minimum length
Since every program is made up of the atomic elements (e.g. keywords, operators, identifiers, ...) of the respective programming language, there is always similarity between different code fragments at this level. However, there are seldom meaningful abstraction options at this level and even viewing these atomic elements as clones is rarely helpful. The definition of similarity therefore includes the definition of a minimum length, which specifies the minimum length of code sections must be in order for them to be regarded as clones. The unit in which this minimum length is specified depends on the recognition method. Examples of minimum lengths are values ​​such as 10 lines, 100 tokens, 7 instructions, etc.
normalization
In addition to the minimum length, it has proven useful to normalize the code with regard to the definition of similarity. In many cases , for example, white space , indentation style, and comments in the code related to clones are ignored. Another common normalization is the abstraction of identifiers , i. H. two loops are considered to be clones, even if i is used as a loop variable in one and j in the other . Literals can be normalized in a similar way . More complex normalization could, for example, consist in making the order of the operands unambiguous in commutative operations or in abstracting the type of loop in the case of loop constructs. Which normalization makes sense generally depends heavily on the application. However, since normalization has a great influence on the results, it should be carried out conscientiously. This also requires a certain amount of experience in clone recognition.
Unequal sections ("gaps")
Even if the definition of similarity can be taken very broadly with normalization, it must still be considered whether clones may also contain smaller, unequal sections ("gaps"). The advantage is that several small clones merge to form larger clones, which makes the redundancy visible at a higher level. The gaps also indicate differences that may be unintentional. However, not all tools support the detection of clones with gaps. In addition, it must be specified how many gaps a clone may contain.

Differentiation from redundant literals

Clones are defined by the fact that code sections of a certain size are similar to one another. For this reason, single duplicated tokens are usually not referred to as clones. This mainly affects redundant literals in the source text, which should be represented better as symbolic constants (for example all occurrences of the literal 3.14159 by a constant with the identifier PI ). In order to avoid the repetition of expressions in the source texts, categorical symbolic constants should be used for reasons of maintainability and more efficient troubleshooting.

Terms

There are a number of terms that have become established in the field of software clones over time.

clone
A section of code that is similar to at least one other section of code. Alternatively, the term fragment is also used. The similarity is represented by either a pair of clones or a class of clones .
Type-1 clones
Identical source text sections except for layout and comments.
Type 1 clone pair
int a = 0;
int b = a * a;
String str = "Peter";
int a = 0; // Comment
int b = a * a;
String str = "Peter";
Type 2 clones
Type-1 clones with additional differences in the identifiers or literals.
Type 2 clone pair
int a = 0;
int b = a * a;
String str = "Peter";
int a = 0; // Comment
int x = a * a;
String str = "Peter";
Type-3 clones
Type-2 clones that contain unequal sections (gaps). The term near-miss clone is also used in the English-language literature in particular .
Type-3 clone pair
int a = 0;
int b = a * a;

String str = "Peter";
int a = 0; // Comment
int x = a * a;
print(b);
String str = "Peter";
Type-4 clones
Clones that are semantically similar but not necessarily syntactically similar. The syntactic similarity is typically less than 16% and other differences lie in the algorithms used, data structures, libraries used, input / output handling and the object-oriented design. Because of the impossibility of determining semantic similarity, this type of clone has little practical use.
Clone pair
A clone pair represents the relation between exactly two similar source code sections (clones).
Clone class
A clone class represents the relation between two or more similar source code sections (clones).
Clone coverage
The proportion of the system that is part of a clone. Often used to quantify the amount of redundancy in a system.

reasons

Clones are created almost exclusively through "copy & paste programming". There are a number of reasons why clones occur, which can be divided into the following categories:

  • Development strategy : Existing functionality is used as a template for new functionality to be created. The existing code is copied and adapted as required. This also includes situations in which code is duplicated in order to test changes without endangering the function of the system (branching).
  • Maintenance advantages : The reuse of existing code is intended to facilitate future maintenance. For example, proven code is reused to minimize the risk of new mistakes. In addition, copying code can avoid undesirable dependencies between components and enable separate maintenance.
  • Bypassing restrictions : Programming languages ​​offer different types of abstraction mechanisms. If no suitable abstraction option is available in a particular situation, this problem is often solved by cloning. However, clones can also arise due to time pressure and a lack of knowledge on the part of the developer.
  • Independent implementation : Due to a lack of knowledge of the existing functionality, this is implemented multiple times. In addition, clones can result from the use of libraries that require a specific protocol.

Negative impacts

Clones are generally considered anti-patterns because they contradict the " Don't repeat yourself " (DRY) principle and are considered the most common trait of bad code. For the reasons mentioned, they top the list of so-called code smells . And while there are a variety of reasons for creating them, clones can have serious negative effects on software and its maintenance.

  • Increased maintenance effort: Clones increase the maintenance effort significantly. Identical code may have to be read and understood several times. Additional effort is required to understand whether small differences are intentional or unintentional. In the event of a change, this must be carried out and tested several times. For each copy, it must at least be checked whether the change must also be made here - even if the copy is not changed at the end.
  • Inconsistent changes : When making changes to clones, there is always the risk of overlooking individual copies. The more developers are involved in a project and unknowingly copy from each other, the worse this problem is. Even if existing code is used as a template and has to be adapted, there is always the risk that this adaptation will not be carried out correctly. Unintentionally inconsistent changes can be particularly problematic if the change is a bug fix. The error is corrected at one point and it is assumed that the problem is solved, although the error remains in at least one other place in the system. These unintentional inconsistent changes have been shown to exist in large numbers in productive systems.
  • Increased memory requirements : Another consequence of cloning is that the code size is larger than it should be. This affects both the storage space of the source code files and the size of the compiled system. This can lead to problems, especially in the area of embedded systems , and make more expensive hardware necessary. In addition, an increased amount of code also means that the compiler needs more time .
  • Copying errors : There is a risk that the source text containing errors has been copied. As a result, the error must later be found in various places in the source text and corrected. There is a risk of overlooking individual occurrences.

Clone detection

There are a number of clone detection methods that can be roughly categorized based on the program representation with which they work. The procedures differ in terms of runtime , complexity , quality of the results and available normalizations.

  • Text : These methods have no language-specific knowledge and interpret the source text as pure text. As a result, these methods can be implemented comparatively quickly and easily. The disadvantage is that they depend on the layout of the source text and offer very few possibilities for normalization.
  • Tokens : These procedures are based on a lexical analysis of the source text and work on a sequence of tokens. These methods are still comparatively quick and easy to implement. With a certain knowledge of the analyzed language, normalizations such as B. ignoring comments or abstracting identifiers. In addition, these methods are not susceptible to differences in the layout of the source text.
  • Tree : These procedures work on the abstract syntax tree (AST) of the program and search for similar subtrees in it. As a result, they offer further normalization options such as B. the abstraction from the kind of a loop or the order of the operands in commutative operations. Tree-based methods are, however, more complex to implement and require significantly more runtime. They also require that the source text is syntactically correct and can be parsed . This is a clear disadvantage compared to text-based and token-based methods.
  • Graph : These procedures mostly work on the Program Dependency Graph (PDG) and search for similar sub-graphs. This makes them the most complex to implement and at the same time the slowest process. Their advantage is that they can recognize clones with little syntactic similarity that cannot be found by any of the other methods.
  • Metrics : These methods calculate for certain entities in the source text (eg. Methods ) metrics and identify clones by comparing these metrics. They are quite easy to implement and also comparatively quick. The disadvantage is that clones can only be found at the predetermined granularity level. So z. B. no clones are found within methods if the metrics are calculated on the basis of whole methods. In practice, these methods are rarely used, since many entities happen to have similar metric values ​​without any syntactic or semantic similarity.

Not every procedure can be clearly assigned to one of the categories, as some procedures combine elements from different categories. So exist z. B. hybrid methods that first create an abstract syntax tree, then serialize it and use a token-based method for recognizing clones in the sequence of the serialized nodes.

Furthermore, clone detection methods can be differentiated according to whether they work incrementally or not. An incremental procedure recognizes clones in several successive versions of the source code. Compared to the repeated application of a non-incremental method for each individual version, incremental methods make use of the analysis results of the previous version. As a result, incremental methods offer a clear speed advantage for clone detection across multiple versions of the source code.

Tools

There are various tools for the static analysis of program code that can also find source text clones. This includes numerous free tools such as the PMD plug-in CPD (Copy / Paste Detector) , Clone Digger (for Python and Java ), Cppcheck (for C ++ and C ) and ConQAT (for Ada , ABAP , C # , C , C ++, COBOL , Java , Visual Basic , PL / I ) as well as proprietary tools such as CCFinder (Code Clone Finder) or Simian (Similarity Analyzer) .

Clone management

Clone management summarizes all activities in dealing with clones. This includes both the tool-supported detection of clones and the selection and implementation of appropriate countermeasures. Possible countermeasures include:

Remove

The clones are removed by creating a suitable abstraction with which the redundant source code sections can be standardized. This can be done, for example, by outsourcing recurring algorithms to procedures or methods . Extract method refactoring is often used for this .

In the object-oriented language, there is also the possibility of removing clones by using inheritance and combining the copies in a common base class (pull-up method refactoring ).

In languages ​​that contain a preprocessor , clones can be removed using appropriate macros. In the case of clones that comprise several files or entire subsystems (“structural clones”), it is advisable to remove the clones by relocating them to modules or libraries .

Observe

Clones are not always easy to remove. An alternative approach is to observe the clones with an appropriate tool. When changing a clone, the developer is notified of the existence of additional copies. This avoids the risk of inadvertently inconsistent changes.

Clones outside the source code

Clone detection focuses heavily on similar sections in the source code of a program. In addition, clones also exist in other artifacts in software development such as B. Models or requirement specifications . The reasons for the clones and their negative effects are largely transferrable. The definition of similarity, however, has to be adapted. While there are already established procedures for the detection of clones in the source code, intensive research is still being carried out on the detection of clones in other artifacts.

example

The following example in Java shows how clones can be removed by an Extract-Method refactoring . The source code calculates the sum of the values ​​in two arrays.

public int sum(int[] values1, int[] values2) {
    int sum1 = 0;
    int sum2 = 0;

    for (int i = 0; i < values1.length; i++)
    {
        sum1 += values1[i];
    }

    for (int i = 0; i < values2.length; i++)
    {
        sum2 += values2[i];
    }

    return sum1 + sum2;
}

The loop that does the actual calculation can be extracted into its own function.

public int sum(int[] values)
{
   int sum = 0;
   for (int i = 0; i < values.length; i++)
   {
       sum += values[i];
   }
   return sum;
}

The new function can be called once for each array. This removed the clones and reduced redundancy.

public int sum(int[] values1, int[] values2) {
    return sum(values1) + sum(values2);
}

The calculation of the sum of the values ​​of an array represents a clone outside of the source code, since from Java 8 onwards there are streams for such requirements. The public int sum(int[] values)method can therefore be deleted and the second method changed as follows:

public int sum(int[] values1, int[] values2) {
    return IntStream.of(values1).sum() + IntStream.of(values2).sum();
}

literature

  • Martin Fowler : Refactoring. How to improve the design of existing software. Addison-Wesley, Munich 2000 (original title: Refactoring. Improving The Design Of Existing Code , translated by Bernd Kahlbrandt), ISBN 3-8273-1630-8 .
  • Nils Göde: Clone Evolution. Dissertation, University of Bremen, Logos Verlag Berlin GmbH, Berlin 2011, ISBN 978-3-8325-2920-8 .
  • Elmar Juergens, Florian Deissenboeck, Benjamin Hummel: A Workbench for Clone Detection Research . (PDF, 359 kB) In: Proceedings of the 31 st International Conference on Software Engineering , IEEE Computer Society, in 2009.
  • Elmar Juergens, Florian Deissenboeck, Benjamin Hummel, Stefan Wagner: Do code clones matter? (PDF, 259 kB) In: Proceedings of the 31 st International Conference on Software Engineering , pp 485-495. IEEE Computer Society, 2009.
  • Rainer Koschke: Survey of Research on Software Clones . (PDF; 222 kB) Dagstuhl Seminar: Duplication, Redundancy, and Similarity in Software, 2006.
  • Dhavleesh Rattan, Rajesh Bhatia, Maninder Singh: Software clone detection: A systematic review. Information and Software Technology, Volume 55, 2013, Issue 7, pp. 1165-1199.
  • Chanchal Kumar Roy, James R. Cordy: A survey on software clone detection research . (PDF; 577 kB) Technical report, Queens University at Kingston, Ontario (Canada) 2007.

Web links

Individual evidence

  1. Markus Bautsch: Code repetition - symbolic constants . In: Structured Programming . Wikibook
  2. Stefan Wagner, Asim Abdulkhaleq, Ivan Bogicevic, Jan-Peter Ostberg, Jasmin Ramandani. How are functionally similar code clones different? . PeerJ Preprints, 2015.
  3. Chanchal Kumar Roy, James R. Cordy: A survey on software clone detection research . (PDF; 577 kB) Technical report, Queens University at Kingston, Ontario (Canada) 2007, pp. 3–7.
  4. ^ Martin Fowler : Refactoring. How to improve the design of existing software. Addison-Wesley, Munich 2000 (original title: Refactoring. Improving The Design Of Existing Code , translated by Bernd Kahlbrandt), ISBN 3-8273-1630-8 , pp. 67–82
  5. Elmar Juergens, Florian Deissenboeck, Benjamin Hummel, Stefan Wagner: Do code clones matter? (PDF, 259 kB) In: Proceedings of the 31 st International Conference on Software Engineering . IEEE Computer Society, 2009, pp. 485-495.
  6. Nils Göde: Clone Evolution. Dissertation, University of Bremen, 2011, pp. 15–21.
  7. Nils Göde, Rainer Koschke: Incremental clone detection. In: Proceedings of the 13 th European Conference on Software Maintenance and Reengineering . IEEE Computer Society, 2009, pp. 219-228.
  8. ^ Benjamin Hummel, Elmar Juergens, Lars Heinemann, Michael Conradt: Index-based code clone detection. Incremental, distributed, scalable . (PDF, 440 kB) In: Proceedings of the 26 th International Conference on Software Maintenance . IEEE Computer Society, 2010.
  9. Rainer Koschke: Frontiers of software clone management . In: Proceedings of Frontiers of Software Maintenance . IEEE Computer Society, 2008, pp. 119-128.
  10. Markus Bautsch: Code repetition - method calls . In: Structured Programming . Wikibook
  11. Markus Bautsch: Avoiding code repetition through inheritance . In: Structured Programming . Wikibook
  12. Florian Deissenboeck, Benjamin Hummel, Elmar Jürgens, Bernhard Schätz, Stefan Wagner, Jean-François Girard, Stefan Teuchert: Clone detection in automotive model-based development . (PDF, 409 kB) In: Proceedings of the 30 th International Conference on Software Engineering . ACM, 2008, pp. 603-612.
  13. Florian Deissenboeck, Benjamin Hummel, Elmar Juergens, Michael Pfaehler: Model clone detection in practice . (PDF, 291 kB) In: Proceedings of the 4 th International Workshop on Software Clones . ACM, 2010, pp. 57-64.
  14. Harald Störrle: Towards clone detection in UML domain models. In: Proceedings of the 4 th European Conference on Software Architecture . ACM, 2010, Companion Volume, pp. 285-293.
  15. Christoph Domann, Elmar Juergens, Jonathan Streit: The curse of copy & paste cloning in requirements specifications . (PDF; 107 kB) In: Proceedings of the 3 rd International Symposium on Empirical Software Engineering and Measurement . IEEE Computer Society, 2009, pp. 443-446.
  16. Elmar Juergens, Florian Deissenboeck, Martin Feilkas, Benjamin Hummel, Bernhard Schaetz, Stefan Wagner, Christoph Domann, Jonathan Streit: Can clone detection support quality assessments of requirements specifications? (PDF, 400 kB) In: Proceedings of the 32 nd International Conference on Software Engineering . ACM, 2010, pp. 79-88.