How to add a new CPD language

First of all, thanks for the contribution!

Happily for you, to add CPD support for a new language is now easier than ever!

Pro Tip: If you wish to add a new language, there are more than 50 languages you could easily add with just an Antlr grammar.

All you need to do is follow this few steps:

Create a new module for your language, you can take GO as an example

Create a Tokenizer

For Antlr grammars you can take the grammar from here and extend AntlrTokenizer taking Go as an example

   public class GoTokenizer extends AntlrTokenizer {    
        
       @Override protected AntlrTokenManager getLexerForSource(SourceCode sourceCode) {   
           CharStream charStream = AntlrTokenizer.getCharStreamFromSourceCode(sourceCode);   
           return new AntlrTokenManager(new GolangLexer(charStream), sourceCode.getFileName());   
       }
   }

For JavaCC grammars you should subclass JavaCCTokenizer which has many examples you could follow, you should also take the Python implementation as reference
For any other scenario you can use AnyTokenizer

Create your Language class

 public class GoLanguage extends AbstractLanguage {    
        
     public GoLanguage() {   
         super("Go", "go", new GoTokenizer(), ".go");   
     }  
 } 

Pro Tip: Yes, keep looking at Go!

You are almost there!

Update the list of supported languages
- Write the fully-qualified name of your Language class to the file src/main/resources/META-INF/services/net.sourceforge.pmd.cpd.Language
- Update the test that asserts the list of supported languages by updating the SUPPORTED_LANGUAGES constant in BinaryDistributionIT
Please don’t forget to add some test, you can again.. look at Go implementation ;)

If you read this far, I’m keen to think you would also love to support some extra CPD configuration (ignore imports or crazy things like that)
If that’s your case , you came to the right place!
You can add your custom properties using a Token filter
- For Antlr grammars all you need to do is implement your own AntlrTokenFilter
  
  And by now, I know where you are going to look…
  
  WRONG
  
  Why do you want GO to solve all your problems?
  
  You should take a look to Kotlin token filter implementation
- For non-Antlr grammars you can use BaseTokenFilter directly or take a peek to Java’s token filter

Testing your implementation

Add a Maven dependency on pmd-lang-test (scope test) in your pom.xml. This contains utilities to test your Tokenizer.

For simple tests, create a test class extending from CpdTextComparisonTest. That class is written in Kotlin, but you can extend it in Java as well.

To add tests, you need to write regular JUnit @Test-annotated methods, and call the method doTest with the name of the test file.

For example, for the Dart language:

public class DartTokenizerTest extends CpdTextComparisonTest {

    /**********************************
      Implementation of the superclass
    ***********************************/


    public DartTokenizerTest() {
        super(".dart"); // the file extension for the dart language
    }

    @Override
    protected String getResourcePrefix() {
        // If your class is in                  src/test/java     /some/package
        // you need to place the test files in  src/test/resources/some/package/cpdData
        return "cpdData";
    }

    @Override
    public Tokenizer newTokenizer() {
        // Override this abstract method to return the correct tokenizer
        return new DartTokenizer();
    }

    /**************
      Test methods
    ***************/


    @Test  // don't forget the JUnit annotation
    public void testLiterals() {
        // This will look for a file named literals.dart
        // in the directory identified by getResourcePrefix,
        // tokenize it, then compare the result against a baseline
        // literals.txt file in the same directory

        // If the baseline file does not exist, it is created automatically
        doTest("literals");
    }

}

Tags: