Thoughts on Design and Computation

Customize Identifier Parsing in a Groovy DSL

I had just created what I thought to be the perfect domain specific language. It had fluent method chaining and closure configurations, all the cool tricks to create a natural DSL. There was only one sticking point that kept bothering me - the user defined names had to be valid groovy identifiers. At first, I thought this was no problem. Who would name things with weird characters like < or -? It turns out my customers in the genetics business do exactly that! They have assays with wonderful names such as RS123A→C. Thus began my quest to support those identifiers.

My Ideal DSL

The idea of a DSL is to let the user describe their domain in a natural way that doesn’t feel like they’re writing code.

assay RS123A->C

compositeAssay RS456-789 with {
    assay RS123A->C
}

Ideally, I want to define an identifier named RS123A→C so I can take advantage of static typing. It would be great if I could modify the token parsing phase so characters that are normally treated as operators can be considered part of the identifier. One of Groovy’s selling points is that there are hooks available for modifying the compilation process. Most of the documentation out on the Internet deals with creating abstract syntax tree transformations which is too late to make these identifier modifications. I need to modify the concrete syntax tree which Groovy exposes modification to by providing a custom ParserPlugin. Since Groovy migrated to ANTLR it seems much more difficult to make modifications during tokenization before the AST is generated and the most basic grammar rules, such as taking - to subtract two identifiers, kick in causing the above to light up with errors. With that in mind, I opted to modify the source before tokenization.

Rewriting your Code

The goal of modifying the source is to recognize certain strings as identifiers. To do this without modifying Groovy we need to rewrite these "certain strings" into identifiers. I created a parser plugin that you can register regular expressions with based on a method name and one or many end tokens.

Essentially, the parser will rewrite the above source.

assay __var1

compositeAssay __var2 with {
    assay __var1
}

Now that we have valid Groovy identifiers, the next piece is to provide the original strings to the calling methods. To accomplish the translation back to the original string I used Groovy’s propertyMissing method to catch the unknown varaible and return the original string for translated extended identifiers.

ExtendedIdentifierScript.groovy
def propertyMissing(String propertyName) {
    if (isExtendedIdentifier(propertyName))
        return decodeExtendedIdentifier(propertyName)
    throw new MissingPropertyException(propertyName, Void)
}

Setting up the Parser

Before you can parse any DSL scripts, you need to configure the parser with what methods work with extended identifiers. Suppose you had the following DSL script.

Script.dsl
assay RS123A->C

compositeAssay RS456-789 with {
    assay RS123A->C
}

compositeAssay *&<>WeirdName-_>

manuallySpecifiedPatternMethod RS123A->C

We want the usages of the assay, compositeAssay and manuallySpecifiedPatternMethod to treat their arguments as extended identifiers. For this scenario the Groovy shell could be configured as follows.

DSL Parsing Configuration
def conf = new CompilerConfiguration()
conf.scriptBaseClass = Dsl.name
conf.pluginFactory = {
        def parser = new ExtendedIdentifierParser()
        parser.scanPackage("my.dsl") // (1)
        parser.addPattern("manuallySpecifiedPatternMethod") // (2)
        parser
}
def binding = new ExtendedIdentifierBinding() // (3)
def shell = new GroovyShell(this.class.classLoader, binding, conf)
  1. Instructs the parser to register method translation patterns based upon annotations. How these work are discussed below.

  2. Instruct the parser to treat the rest of the line after "manuallySpecifiedPatternMethod" as an extended identifier.

  3. Use our custom Binding implementation to register variables named with extended identifiers.

Annotating the DSL

The parser can be configured to scan a package for annotations to register the patterns for identifier translation. There is one provided annotation @ExtendedIdentifierPattern which will register the name of the annotated method as a maker for accepting extended identifiers.

abstract class Dsl extends ExtendedIdentifierScript {
    Set<Assay> assays = new HashSet<>()
    Set<CompositeAssay> compositeAssays = new HashSet<>();

    @ExtendedIdentifierPattern // (1)
    def assay(String name) {
        def assay = new Assay(name: name)
        assays.add(assay)
        getBinding().setVariable(name, assay)
        assay
    }

    @ExtendedIdentifierPattern(endTokens = [EOL, "with"]) // (2)
    def compositeAssay(String name) {
        def compositeAssay = new CompositeAssay(name: name)
        compositeAssays.add(compositeAssay)
        getBinding().setVariable(name, compositeAssay)
        compositeAssay
    }

    def manuallySpecifiedPatternMethod(Assay assay) { // (3)
        assays.add(assay);
    }
}

class CompositeAssay {
    String name
    private Assay assay

    def assay(Assay assay) { // (4)
        this.assay = assay
    }
}
  1. By default the @ExtendedStringPattern will glob the rest of the line as the string argument.

  2. Or you can specify what tokens demarcate the end of the identifier, non inclusive.

  3. This method pattern was registered with the parser directly.

  4. Normally this method would have to be annotated but Dsl already has an annotated method with the same name.

Thanks for reading! You can check out the full source and provided tests.