Overview
Duplicate code can be hard to find, especially in a large project. But PMD’s Copy/Paste Detector (CPD) can find it for you!
CPD works with Java, JSP, C/C++, C#, Go, Kotlin, Ruby, Swift and many more languages.
It can be used via command-line, or via an Ant task.
It can also be run with Maven by using the cpd-check
goal on the Maven PMD Plugin.
Your own language is missing? See how to add it here.
Why should you care about duplicates?
It’s certainly important to know where to get CPD, and how to call it, but it’s worth stepping back for a moment and asking yourself why you should care about this, being the occurrence of duplicate code blocks.
Assuming duplicated blocks of code are supposed to do the same thing, any refactoring, even simple, must be duplicated too – which is unrewarding grunt work, and puts pressure on the developer to find every place in which to perform the refactoring. Automated tools like CPD can help with that to some extent.
However, failure to keep the code in sync may mean automated tools will no longer recognise these blocks as duplicates. This means the task of finding duplicates to keep them in sync when doing subsequent refactorings can no longer be entrusted to an automated tool – adding more burden on the maintainer. Segments of code initially supposed to do the same thing may grow apart undetected upon further refactoring.
Now, if the code may never change in the future, then this is not a problem.
Otherwise, the most viable solution is to not duplicate. If the duplicates are already there, then they should be refactored out. We thus advise developers to use CPD to help remove duplicates, not to help keep duplicates in sync.
Refactoring duplicates
Once you have located some duplicates, several refactoring strategies may apply depending of the scope and extent of the duplication. Here’s a quick summary:
- If the duplication is local to a method or single class:
- Extract a local variable if the duplicated logic is not prohibitively long
- Extract the duplicated logic into a private method
- If the duplication occurs in siblings within a class hierarchy:
- Extract a method and pull it up in the class hierarchy, along with common fields
- Use the Template Method design pattern
- If the duplication occurs consistently in unrelated hierarchies:
- Introduce a common ancestor to those class hierarchies
Novice as much as advanced readers may want to read on on Refactoring Guru for more in-depth strategies, use cases and explanations.
CLI Usage
CLI options reference
Option | Description | Default | Applies to |
---|---|---|---|
--minimum-tokens |
Required The minimum token length which should be reported as a duplicate. |
|
|
--files --dir -d |
Required List of files and directories to process.
Note: |
|
|
--file-list |
Path to file containing a comma delimited list of files to analyze. If this is given, then you don't need to provide --dir . |
|
|
--language |
Sources code language. | java |
|
--debug --verbose -v -D |
Debug mode. Prints more log output. |
|
|
--encoding -e |
Character encoding to use when processing files. If not specified, CPD uses the system default encoding. |
|
|
--skip-duplicate-files |
Ignore multiple copies of files of the same name and length in comparison. | false |
|
--exclude |
Files to be excluded from CPD check |
|
|
--non-recursive |
Don't scan subdirectories | false |
|
--skip-lexical-errors |
Skip files which can't be tokenized due to invalid characters instead of aborting CPD | false |
|
--format |
Report format. | text |
|
--fail-on-violation <bool> |
By default CPD exits with status 4 if code duplications are found.
Disable this option with --fail-on-violation false to exit with 0 instead and just write the report. |
true |
|
--ignore-literals |
Ignore number values and string contents when comparing text | false |
Java |
--ignore-identifiers |
Ignore constant and variable names when comparing text | false |
Java |
--ignore-annotations |
Ignore language annotations (Java) or attributes (C#) when comparing text | false |
C#, Java |
--ignore-literal-sequences |
Ignore sequences of literals (common e.g. in list initializers) | false |
C#, C++, Lua |
--ignore-usings |
Ignore using directives in C# when comparing text |
false |
C# |
--no-skip-blocks |
Do not skip code blocks matched by --skip-blocks-pattern |
false |
C++ |
--skip-blocks-pattern |
Pattern to find the blocks to skip. It is a string property and contains of two parts,
separated by | . The first part is the start pattern, the second part is the ending pattern. |
#if 0|#endif |
C++ |
--uri |
URI to process |
|
PLSQL |
--help -h |
Print help text | false |
Examples
Note: The following example use the Linux start script. For Windows, just replace “./run.sh cpd” by “cpd.bat”.
Minimum required options: Just give it the minimum duplicate size and the source directory:
$ ./run.sh cpd --minimum-tokens 100 --dir /usr/local/java/src/java
You can also specify the language:
$ ./run.sh cpd --minimum-tokens 100 --dir /path/to/c/source --language cpp
You may wish to check sources that are stored in different directories:
$ ./run.sh cpd --minimum-tokens 100 --dir /path/to/other/source --dir /path/to/other/source --dir /path/to/other/source --language fortran
There should be no limit to the number of --dir
, you may add… But if you stumble one, please tell us !
And if you’re checking a C source tree with duplicate files in different architecture directories you can skip those using –skip-duplicate-files:
$ ./run.sh cpd --minimum-tokens 100 --dir /path/to/c/source --language cpp --skip-duplicate-files
You can also specify the encoding to use when parsing files:
$ ./run.sh cpd --minimum-tokens 100 --dir /usr/local/java/src/java --encoding utf-16le
You can also specify a report format - here we’re using the XML report:
$ ./run.sh cpd --minimum-tokens 100 --dir /usr/local/java/src/java --format xml
The default format is a text report, and there’s also a csv
report.
Note that CPD is pretty memory-hungry; you may need to give Java more memory to run it, like this:
$ export PMD_JAVA_OPTS=-Xmx512m
$ ./run.sh cpd --minimum-tokens 100 --dir /usr/local/java/src/java
In order to change the heap size under Windows, you’ll need to edit the batch file cpd.bat
or
set the environment variable PMD_JAVA_OPTS
prior to starting CPD:
C:\ > cd C:\pmd-bin-6.55.0\bin
C:\...\bin > set PMD_JAVA_OPTS=-Xmx512m
C:\...\bin > .\cpd.bat --minimum-tokens 100 --dir c:\temp\src
If you specify a source directory but don’t want to scan the sub-directories, you can use the non-recursive option:
$ ./run.sh cpd --minimum-tokens 100 --non-recursive --dir /usr/local/java/src/java
Exit status
Please note that if CPD detects duplicated source code, it will exit with status 4 (since 5.0). This behavior has been introduced to ease CPD integration into scripts or hooks, such as SVN hooks.
0 | Everything is fine, no code duplications found |
1 | Couldn't understand command line parameters or CPD exited with an exception |
4 | At least one code duplication has been detected unless `--fail-on-violation false` is used. |
Supported Languages
- C#
- C/C++
- Dart
- EcmaScript (JavaScript)
- Fortran
- Gherkin (Cucumber)
- Go
- Groovy
- Html
- Java
- Jsp
- Kotlin
- Lua
- Matlab
- Modelica
- Objective-C
- Perl
- PHP
- PL/SQL
- Python
- Ruby
- Salesforce.com Apex
- Scala
- Swift
- Visualforce
- XML
Available report formats
- text : Default format
- xml
- csv
- csv_with_linecount_per_file
- vs
For details, see CPD Report Formats.
Ant task
Andy Glover wrote an Ant task for CPD; here’s how to use it:
<target name="cpd">
<taskdef name="cpd" classname="net.sourceforge.pmd.cpd.CPDTask" />
<cpd minimumTokenCount="100" outputFile="/home/tom/cpd.txt">
<fileset dir="/home/tom/tmp/ant">
<include name="**/*.java"/>
</fileset>
</cpd>
</target>
Attribute reference
Attribute | Description | Default | Applies to |
---|---|---|---|
minimumtokencount |
Required A positive integer indicating the minimum duplicate size. |
|
|
encoding |
The character set encoding (e.g., UTF-8) to use when reading the source code files, but also when
producing the report. A piece of warning, even if you set properly the encoding value,
let's say to UTF-8, but you are running CPD encoded with CP1252, you may end up with not UTF-8 file.
Indeed, CPD copy piece of source code in its report directly, therefore, the source files
keep their encoding. If not specified, CPD uses the system default encoding. |
|
|
format |
The format of the report (e.g. csv , text , xml ). |
text |
|
ignoreLiterals |
if true , CPD ignores literal value differences when evaluating a duplicate
block. This means that foo=42; and foo=43; will be seen as equivalent. You may want
to run PMD with this option off to start with and then switch it on to see what it turns up. |
false |
Java |
ignoreIdentifiers |
Similar to ignoreLiterals but for identifiers; i.e., variable names, methods names, and so forth. |
false |
Java |
ignoreAnnotations |
Ignore annotations. More and more modern frameworks use annotations on classes and methods, which can be very redundant and trigger CPD matches. With J2EE (CDI, Transaction Handling, etc) and Spring (everything) annotations become very redundant. Often classes or methods have the same 5-6 lines of annotations. This causes false positives. | false |
Java |
ignoreUsings |
Ignore using directives in C#. | false |
C# |
skipDuplicateFiles |
Ignore multiple copies of files of the same name and length in comparison. | false |
|
skipLexicalErrors |
Skip files which can't be tokenized due to invalid characters instead of aborting CPD. | false |
|
skipBlocks |
Enables or disabled skipping of blocks like a pre-processor. See also option skipBlocksPattern. | true |
C++ |
skipBlocksPattern |
Configures the pattern, to find the blocks to skip. It is a string property and contains of two parts,
separated by | . The first part is the start pattern, the second part is the ending pattern. |
#if 0|#endif |
C++ |
language |
Flag to select the appropriate language (e.g. c , cpp , cs , java , jsp , php , ruby , fortran
ecmascript , and plsql ). |
java |
|
outputfile |
The destination file for the report. If not specified the console will be used instead. |
|
Also, you can get verbose output from this task by running ant with the -v
flag; i.e.:
ant -v -f mybuildfile.xml cpd
Also, you can get an HTML report from CPD by using the XSLT script in pmd/etc/xslt/cpdhtml.xslt. Just run the CPD task as usual and right after it invoke the Ant XSLT script like this:
<xslt in="cpd.xml" style="etc/xslt/cpdhtml.xslt" out="cpd.html" />
GUI
CPD also comes with a simple GUI. You can start it via some scripts in the bin
folder:
For Windows:
cpdgui.bat
For Linux:
./run.sh cpd-gui
Here’s a screenshot of CPD after running on the JDK 8 java.lang package:
Suppression
Arbitrary blocks of code can be ignored through comments on Java, C/C++, Dart, Go, Javascript,
Kotlin, Lua, Matlab, Objective-C, PL/SQL, Python, Scala, Swift and C# by including the keywords CPD-OFF
and CPD-ON
.
public Object someParameterizedFactoryMethod(int x) throws Exception {
// some unignored code
// tell cpd to start ignoring code - CPD-OFF
// mission critical code, manually loop unroll
goDoSomethingAwesome(x + x / 2);
goDoSomethingAwesome(x + x / 2);
goDoSomethingAwesome(x + x / 2);
goDoSomethingAwesome(x + x / 2);
goDoSomethingAwesome(x + x / 2);
goDoSomethingAwesome(x + x / 2);
// resume CPD analysis - CPD-ON
// further code will *not* be ignored
}
Additionally, Java allows to toggle suppression by adding the annotations
@SuppressWarnings("CPD-START")
and @SuppressWarnings("CPD-END")
all code within will be ignored by CPD.
This approach however, is limited to the locations were @SuppressWarnings
is accepted.
It’s legacy and the new comment’s based approach should be favored.
//enable suppression
@SuppressWarnings("CPD-START")
public Object someParameterizedFactoryMethod(int x) throws Exception {
// any code here will be ignored for the duplication detection
}
//disable suppression
@SuppressWarnings("CPD-END)
public void nextMethod() {
}
Other languages currently have no support to suppress CPD reports. In the future, the comment based approach will be extended to those of them that can support it.
Credits
CPD has been through three major incarnations:
-
First we wrote it using a variant of Michael Wise’s Greedy String Tiling algorithm (our variant is described here).
-
Then it was completely rewritten by Brian Ewins using the Burrows-Wheeler transform.
-
Finally, it was rewritten by Steve Hawkins to use the Karp-Rabin string matching algorithm.