A Better R Programming Experience Thanks to Tree-sitter

I read this article a week or so ago and immediately implemented a VS Code extension that I've always wanted: a static analysis tool for targets pipelines. targets is an R package which provides Make-like pipelines for data science and analysis work. You write your pipeline as a DAG and targets orchestrates the analysis and only re-runs downstream nodes if upstream ones are invalidated and the output changes. Fantastic tool, but at a certain level of complexity the DAG becomes a bit hard to navigate and reason about ("wait, what targets are downstream of this one again?"). This isn't really a targets problem, as this will happen with any analysis of decent complexity, but the structure targets adds to the analysis actually allows for a decent amount of static analysis of the environment/code. Enter tree-sitter.

I wrote a VS Code extension that analyzes the pipeline and provides useful hover information (like size, time last invalidated, computation time for that target, and children/parent info) as well as links to quickly jump to different targets and their children/parents. I've dogfooded the hell out of it and it's already vastly improved my targets workflow within a week. Things like providing better error hints in the IDE for targets-specific malformed inputs and showing which targets are emitting errors really take lots of the friction out of an analysis.

All that to say: nice work on extending tree-sitter to R!

tarborist: targets + tree-sitter https://open-vsx.org/extension/tylermorganwall/tarborist

GH: https://github.com/tylermorganwall/tarborist

The article makes out like auto completion and help on hover are new things, but RStudio IDE has had them for years and years.

R/RStudio was my first language/IDE. I was horribly shocked when moving into other languages to discover they didn't have things you got out of the box with R/RStudio. "You mean I have to look up documentation for a function/method!?! - that's supposed to be automatic!".

R has a bunch of features which other languages lack to the degree that it's a rude shock to learn that other ecosystems lack them. One is the REPL with extremely convenient RStudio keyboard shortcuts to run lines of code (to achieve similar with ruby, I have an elaborate neovim/slime setup that took hours to configure and still isn't as good as RStudio gives out of the box).

A sign of a brilliant tool is when an idiot can get more done with it than an expert can with alternatives.

Tree-sitter is one of the finer engineering products out there, it enables so much. Thanks to its creator and everyone who has contributed to this project and its many grammars!

Do the tools built on this understand dplyr pipelines and columns in the data frames appearing as bare variables in the code? If so, I’m really impressed. R does some unusual stuff.

I moved to tree-sitter inside Emacs a while ago and I'd say tree-sitter is much easier than it looks like.

I had a first little use case... For whatever reason the options to align let bindings in Clojure code, no matter if I tried the "semantic" or Tonsky's semi-standard way of formatting Clojure code (several tools adopted Tonsky's suggestion) and no matter which option/knob I turned on, I couldn't align like I wanted.

I really, really, really hate the pure horrible chaos of this:

    (let [abc (+ a 2)
          d (inc b)
          vwxyz (+ abc d)]
      ...

But I love the perfection of this [1]:

    (let [abc     (+ a 2)
          d       (inc b)
          vwxyz   (+ abc d)]
      ...

And the cljfmt is pretty agnostic about it: I can both use cljfmt from Emacs and have a hook forcing cljfmt and it'll align everything but it won't mess with those nice vertical alignments.

Now, I know, I know: it is supposed to work directly from cljfmt but many options are, still in the latest version, labelled as experimental and I simply couldn't make it work on my setup, no matter which knob I turned on.

So what did I do? Claude Code CLI, tree-sitter, and three elisp functions.

And I added my own vertical indenting to Clojure let bindings. And it's compatible with cljfmt (as in: if I run cljfmt it doesn't remove my vertical alignments).

I'd say the tree-sitter syntax tree is incredibly verbose (and has to be) but it's not that hard to use tree-sitter.

P.S: and I'm not alone in liking this kind of alignment and, no, we're not receptive to the "but then you modify one line and several lines are detected as modified". And we're less receptive by the day now that we begin to had tools like diff'ing tools that are indentation-agnostic and only do AST diffs.

All that to say: nice work on extending tree-sitter to R!

tarborist: targets + tree-sitter https://open-vsx.org/extension/tylermorganwall/tarborist

GH: https://github.com/tylermorganwall/tarborist

Tree-sitter is one of the finer engineering products out there, it enables so much. Thanks to its creator and everyone who has contributed to this project and its many grammars!

Do the tools built on this understand dplyr pipelines and columns in the data frames appearing as bare variables in the code? If so, I’m really impressed. R does some unusual stuff.

I moved to tree-sitter inside Emacs a while ago and I'd say tree-sitter is much easier than it looks like.

I really, really, really hate the pure horrible chaos of this:

    (let [abc (+ a 2)
          d (inc b)
          vwxyz (+ abc d)]
      ...

But I love the perfection of this [1]:

    (let [abc     (+ a 2)
          d       (inc b)
          vwxyz   (+ abc d)]
      ...

And the cljfmt is pretty agnostic about it: I can both use cljfmt from Emacs and have a hook forcing cljfmt and it'll align everything but it won't mess with those nice vertical alignments.

So what did I do? Claude Code CLI, tree-sitter, and three elisp functions.

And I added my own vertical indenting to Clojure let bindings. And it's compatible with cljfmt (as in: if I run cljfmt it doesn't remove my vertical alignments).

I'd say the tree-sitter syntax tree is incredibly verbose (and has to be) but it's not that hard to use tree-sitter.

Can you move the closing ) to also be vertically aligned?

And the first +/inc in parenthesis?

The article makes out like auto completion and help on hover are new things, but RStudio IDE has had them for years and years.

A sign of a brilliant tool is when an idiot can get more done with it than an expert can with alternatives.

What if you want to share something outside of your precious IDE?

- Merge request on GitHub - Presentation with reveal.js (kind of like PowerPoint)

You'd be stuck with either bland, uncoloured, text-only characters, OR with a fuzzy PNG screenshot where you can't zoom or copy. Or maybe you "parse R" with Regex.

tree-sitter integrates into any web-based technology, allowing you to _share_ code.

Maybe that explains why I was confused about this article. I kept wondering what exactly on offer, and that it couldn't be as simple as help on hover and auto-complete, because those seemed pretty basic and prevalent. It took me a few years to move to RStudio, but at this point, I literally don't know anyone who doesn't use it. To the point that I once had to explain to a labmate that R and RStudio were, in fact, not the same thing.

So either this is not that exciting, or else the additional things that are on offer are not very clearly explained to the point that I missed them.

Can you move the closing ) to also be vertically aligned?

And the first +/inc in parenthesis?

What if you want to share something outside of your precious IDE?

- Merge request on GitHub - Presentation with reveal.js (kind of like PowerPoint)

You'd be stuck with either bland, uncoloured, text-only characters, OR with a fuzzy PNG screenshot where you can't zoom or copy. Or maybe you "parse R" with Regex.

tree-sitter integrates into any web-based technology, allowing you to _share_ code.

Yes, your comment really should be the focus of article, i.e. genuinely new capabilities and improvements, not existing capabilities done a slightly different way. In any case it’s a minor nitpick and it’s awesome progress for the language and tooling

So either this is not that exciting, or else the additional things that are on offer are not very clearly explained to the point that I missed them.

I suspect the main benefits are portability (since tree-sitter uses wasm and javascript it can run in any webpage - compared to the previous way of parsing R code which needed an R runtime, so not just any old website could do it; e.g. a shiny app probably could because it has an R runtime available but a standard HTML page couldn't). And the other is tree-sitter is a widely used tool so now anything that uses tree-sitter can now work with R, since the R grammar is available.

Looks like R's tree-sitter grammar has been in use for GitHub search for a while (since 2024), so it's a nice improvement due to R/tree-sitter, although we've probably been benefitting from it for a while already, perhaps without knowing exactly how it worked!

https://github.com/orgs/community/discussions/120397#discuss...

I believe this should let you do syntax highlighting for R in vim for example.

A little bit less than two years ago, building on work by Jim Hester and Kevin Ushey, Davis Vaughan completed a very impactful JavaScript file for the R community: an R grammar for the Tree-sitter parsing generator. He even got a round of applause for it during a talk at the useR! 2024 conference! So, did he get cheered for… grammatical rules in a JavaScript file? 😅

No, the audience was excited about the improved developer experience for R that this file unlocked. R tooling around Tree-sitter is how you get

reformatting through Air and linting through Jarl;
auto-completion or help on hover in the Positron IDE;
better search for R on GitHub;
and more!

In this post, we’ll explain what Tree-sitter is, and how tools built on Tree-sitter can benefit your R development workflow.

🔗 Code parsing: what is Tree-sitter?

Tree-sitter is a code parsing generator written in C, with bindings existing in several languages including Rust (and R!).

Let’s rewind a little bit. What does it mean to parse code?

Basically, given a string of code like

a <- mean(x, na.rm = TRUE)

How do you know that mean is a function name, na.rm an argument name, TRUE a logical? You have to parse that code into what’s called a parse tree. You do that in your head when reading R code. 😸

R itself can obviously parse R code, thanks to its grammar. See for instance the commit that introduced R’s native pipe, which necessitated extending R’s syntax thus modifying its grammar.

You can use parse() and getParseData() to parse R code.

parse(
  text = "a <- mean(x, na.rm = TRUE)",
  keep.source = TRUE
) |>
  getParseData()
#>    line1 col1 line2 col2 id parent                token terminal  text
#> 23     1    1     1   26 23      0                 expr    FALSE      
#> 1      1    1     1    1  1      3               SYMBOL     TRUE     a
#> 3      1    1     1    1  3     23                 expr    FALSE      
#> 2      1    3     1    4  2     23          LEFT_ASSIGN     TRUE    <-
#> 21     1    6     1   26 21     23                 expr    FALSE      
#> 4      1    6     1    9  4      6 SYMBOL_FUNCTION_CALL     TRUE  mean
#> 6      1    6     1    9  6     21                 expr    FALSE      
#> 5      1   10     1   10  5     21                  '('     TRUE     (
#> 7      1   11     1   11  7      9               SYMBOL     TRUE     x
#> 9      1   11     1   11  9     21                 expr    FALSE      
#> 8      1   12     1   12  8     21                  ','     TRUE     ,
#> 13     1   14     1   18 13     21           SYMBOL_SUB     TRUE na.rm
#> 14     1   20     1   20 14     21               EQ_SUB     TRUE     =
#> 15     1   22     1   25 15     16            NUM_CONST     TRUE  TRUE
#> 16     1   22     1   25 16     21                 expr    FALSE      
#> 17     1   26     1   26 17     21                  ')'     TRUE     )

Or you could transform that same data into XML using Gábor Csárdi’s {xmlparsedata}:

parse(
  text = "a <- mean(x, na.rm = TRUE)",
  keep.source = TRUE
) |>
  xmlparsedata::xml_parse_data(pretty = TRUE) |>
  xml2::read_xml() |>
  as.character() |>
  cat()
#> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
#> <exprlist>
#>   <expr line1="1" col1="1" line2="1" col2="26" start="28" end="53">
#>     <expr line1="1" col1="1" line2="1" col2="1" start="28" end="28">
#>       <SYMBOL line1="1" col1="1" line2="1" col2="1" start="28" end="28">a</SYMBOL>
#>     </expr>
#>     <LEFT_ASSIGN line1="1" col1="3" line2="1" col2="4" start="30" end="31">&lt;-</LEFT_ASSIGN>
#>     <expr line1="1" col1="6" line2="1" col2="26" start="33" end="53">
#>       <expr line1="1" col1="6" line2="1" col2="9" start="33" end="36">
#>         <SYMBOL_FUNCTION_CALL line1="1" col1="6" line2="1" col2="9" start="33" end="36">mean</SYMBOL_FUNCTION_CALL>
#>       </expr>
#>       <OP-LEFT-PAREN line1="1" col1="10" line2="1" col2="10" start="37" end="37">(</OP-LEFT-PAREN>
#>       <expr line1="1" col1="11" line2="1" col2="11" start="38" end="38">
#>         <SYMBOL line1="1" col1="11" line2="1" col2="11" start="38" end="38">x</SYMBOL>
#>       </expr>
#>       <OP-COMMA line1="1" col1="12" line2="1" col2="12" start="39" end="39">,</OP-COMMA>
#>       <SYMBOL_SUB line1="1" col1="14" line2="1" col2="18" start="41" end="45">na.rm</SYMBOL_SUB>
#>       <EQ_SUB line1="1" col1="20" line2="1" col2="20" start="47" end="47">=</EQ_SUB>
#>       <expr line1="1" col1="22" line2="1" col2="25" start="49" end="52">
#>         <NUM_CONST line1="1" col1="22" line2="1" col2="25" start="49" end="52">TRUE</NUM_CONST>
#>       </expr>
#>       <OP-RIGHT-PAREN line1="1" col1="26" line2="1" col2="26" start="53" end="53">)</OP-RIGHT-PAREN>
#>     </expr>
#>   </expr>
#> </exprlist>

In both cases, you recognize words such as LEFT_ASSIGN or SYMBOL_FUNCTION_CALL. Parsing is an essential step before the code is actually executed, but parsed code can also be used for other purposes, such as analyzing code without brittle regular expressions (does it call a particular function?), navigating code (going from a function call to the definition of that function), or modifying code (replacing all occurrences of a function with another one).

Now, Tree-sitter performs this same code parsing but faster especially thanks to its support of incremental parsing – which is key to updating the syntax tree as you are typing in your editor for instance! Tree-sitter is agnostic in that it can parse any code as long as there is a grammar for it (think, Rosetta Stone plugins). It’s been used for many languages which means many tools have been built around it.

To have Tree-sitter “learn” a new language you need to give it a file containing the definition of the syntax of that language, what’s called a grammar. This is where the aforementioned JavaScript file by Davis Vaughan and collaborators comes into play! The treesitter-r repo, which provides a translation of the R grammar in the format expected by Tree-sitter, is the base of all tools presented in this post which use R code as their input.

Here’s how to use the {treesitter} R package for the same code as earlier. The {treesitter} R package allows us to use Tree-sitter from R. To parse R code with it, we need the language() function from {treesitter.r}1.

library(treesitter)
#> 
#> Attaching package: 'treesitter'
#> The following object is masked from 'package:base':
#> 
#>     range
language <- treesitter.r::language()
parser <- parser(language)
text <- "a <- mean(x, na.rm = TRUE)"
parser_parse(parser, text)
#> <tree_sitter_tree>
#> 
#> ── Text ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> a <- mean(x, na.rm = TRUE)
#> 
#> ── S-Expression ────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> (program [(0, 0), (0, 26)]
#>   (binary_operator [(0, 0), (0, 26)]
#>     lhs: (identifier [(0, 0), (0, 1)])
#>     operator: "<-" [(0, 2), (0, 4)]
#>     rhs: (call [(0, 5), (0, 26)]
#>       function: (identifier [(0, 5), (0, 9)])
#>       arguments: (arguments [(0, 9), (0, 26)]
#>         open: "(" [(0, 9), (0, 10)]
#>         argument: (argument [(0, 10), (0, 11)]
#>           value: (identifier [(0, 10), (0, 11)])
#>         )
#>         (comma [(0, 11), (0, 12)])
#>         argument: (argument [(0, 13), (0, 25)]
#>           name: (identifier [(0, 13), (0, 18)])
#>           "=" [(0, 19), (0, 20)]
#>           value: (true [(0, 21), (0, 25)])
#>         )
#>         close: ")" [(0, 25), (0, 26)]
#>       )
#>     )
#>   )
#> )

Tree-sitter is the workhorse of many tools, that are mentioned in the diagram below. All of them are dependent on tree-sitter and the R grammar provided to it. Some of them are command-line interfaces (CLIs), while others are R packages.

🔗 Browsing code interactively: Positron IDE, GitHub

The real reason why the audience applauded Davis Vaughan is that he explained how the R grammar for Tree-sitter had been deployed to GitHub so that we get almost as good experience browsing R code on GitHub as browsing, say, JS code. If we search for a function name in a repository for instance, its definition will be indicated in the search results. See Davis’ slides (also available in PDF), or refer to the video below showing how typing vetiver_model in the search bar from the R vetiver repo makes the function definition the first result, on which one can click to land into the definition.

Also very useful is the use of Tree-sitter by Ark, the R kernel used in the Positron IDE. Ark is how you get autocompletion and help on hover in Positron. The video below shows how you can extend the selection to further steps of a pipeline in Positron.

This use case of Tree-sitter is also featured in Davis’ slides. See also Lionel Henry’s and Davis Vaughan’s talk about Ark at posit conf 2024, especially the part about code assistance.

Other development environments such as Emacs also have support for Tree-sitter.

🔗 Searching/browsing code

You can parse and search R code using the {treesitter} R package and treesitter query syntax. The {treesitter} R package is a dependency of the {gander} package by Simon Couch, that is meant to be used for a better experience with LLMs when writing R code. Another use case of the {treesitter} R package is the {igraph.r2cdocs} extension to {roxygen2} for the {igraph} package, that parses all of igraph R code to then be able to identify, for each exported function, whether it (in)directly calls a function whose name ends with _impl, indicating a wrapper to a C igraph function whose docs can be then be linked from the manual of the R function.

The {pkgdepends} package calls Tree-sitter (C) to detect dependencies in files. Below we run it on the source of the saperlipopette R package.

pkgdepends::scan_deps(
  "../../../../../CHAMPIONS/saperlipopette",
  "../../../../../CHAMPIONS"
)
#> 
#> Dependencies:
#> + brio           @ R/blame.R, R/check-editor.R, R/clean-dir.R, R/committed-to-main.R, R/committed-to-wrong-branch.R, R/conflict…
#> + cli            @ inst/exo_bisect-Rprofile.en.R, inst/exo_bisect-Rprofile.es.R, inst/exo_bisect-Rprofile.fr.R, inst/exo_blame-…
#> + devtools       @ saperlipopette.Rproj
#> + fs             @ R/blame.R, R/check-editor.R, R/clean-dir.R, R/committed-to-main.R, R/committed-to-wrong-branch.R, R/conflict…
#> + gert           @ inst/exo_check_editor-Rprofile.en.R, inst/exo_check_editor-Rprofile.es.R, inst/exo_check_editor-Rprofile.fr.…
#> + knitr          @ README.Rmd
#> + parsedate      @ R/utils-git.R
#> + purrr          @ R/create-all.R, R/debug.R, R/log-deleted-file.R, R/log-deleted-line.R, R/revparse.R, R/roxygen2.R, R/worktre…
#> + rlang          @ R/create-all.R, R/roxygen2.R, R/utils-fs.R, R/utils-usethis.R, R/zzz.R
#> + rmarkdown      @ README.Rmd, vignettes/saperlipopette.qmd
#> + roxygen2       @ R/roxygen2.R, saperlipopette.Rproj
#> + saperlipopette @ README.Rmd, vignettes/saperlipopette.qmd
#> + tibble         @ R/roxygen2.R
#> + usethis        @ R/blame.R, R/check-editor.R, R/clean-dir.R, R/committed-to-main.R, R/committed-to-wrong-branch.R, R/conflict…
#> + vctrs          @ R/roxygen2.R
#> + withr          @ R/blame.R, R/check-editor.R, R/clean-dir.R, R/committed-to-main.R, R/committed-to-wrong-branch.R, R/conflict…
#> 
#> Test dependencies:
#> + fs             @ tests/testthat/test-blame.R, tests/testthat/test-check-editor.R, tests/testthat/test-clean-dir.R, tests/test…
#> + gert           @ tests/testthat/test-blame.R, tests/testthat/test-clean-dir.R, tests/testthat/test-committed-to-main.R, tests…
#> + rlang          @ tests/testthat/test-blame.R, tests/testthat/test-check-editor.R, tests/testthat/test-clean-dir.R, tests/test…
#> + saperlipopette @ tests/testthat.R
#> + testthat       @ tests/testthat.R
#> + withr          @ tests/testthat/test-blame.R, tests/testthat/test-check-editor.R, tests/testthat/test-clean-dir.R, tests/test…

ast-grep is a useful tool built on Tree-sitter for searching and re-writing code, with a clearer query syntax than Tree-sitter’s. Its name is reminiscent of grep, but with ast-grep we do not need to write brittle regular expressions 😸. {astgrepr} by Etienne Bacher is an R wrapper to the Rust bindings of ast-grep, and is used in Etienne’s {flir} package for refactoring code.

The ast-grep command-line interface (CLI) itself is featured in a useful blog post by Emil Hvitfeldt where he explains how to document the usage of ast-grep for Claude.

🔗 Formatting and linting: Air, Jarl

Speaking of CLIs…

Cute kitten attacked by robots. The text says 'Everytime you use Claude for something a CLI can do, a kitten dies'.

Air, by Davis Vaughan and Lionel Henry, is a CLI built on Tree-sitter, in Rust. It reformats code blazingly fast.

Jarl, by Etienne Bacher, is a CLI built on Air, therefore also on Tree-sitter, in Rust. It lints and fixes code, also blazingly fast. It can even detect unreachable code, unused functions and duplicated function definitions.

In both of these examples, the creation of CLIs wrapping Rust bindings was more efficient than the creation of R packages wrapping the {treesitter} R package, for several reasons:

Rust CLIs can edit code very fast2;
CLIs are integrated in extensions for popular IDEs (for instance Positron);
a CLI is easier to install on CI than an R package that needs, well, an R installation.

🔗 More tools

A brief mention of some other interesting tools we’ve explored a bit less.

🔗 Configuring: {ts} for parsing JSON and TOML (not R!)

The {ts} package by Gábor Csárdi is the backbone of two R packages used for editing and manipulating:

TOML {tstoml};
JSON {tsjson}.

Compared to existing parsers in R for those formats, these two packages preserve comments.

🔗 Testing code: {muttest}

Mutation testing is a kind of testing where you, say, randomly swap + with - in your code (you mutate it) and you run your tests to see whether they catch the mutant. The {muttest} package by Jakub Sobolewski is an R package for mutation testing, that depends on the {treesitter} R package.

🔗 Diffing code: difftastic

The difftastic CLI by Wilfred Hughes is “a structural diff tool that understands syntax”. ✨ This means that difftastic doesn’t only compare line or “words” but actual syntax by looking at lines around the lines that changed (by default, 3). Even better, it understands R out of the box. See this blog post with examples of R code diffing.

🔗 Conclusion: more to come?

In this post, we’ve presented an overview of Tree-sitter based tooling for R or in R.

Note that this ecosystem of tools is very actively developed, so some tools might come and go. However, the idea that plugging the R grammar into a general parsing generator brings cool features to us R developers, will remain true. Maybe you will contribute to this ecosystem, either through an existing tool or by creating a new one?

We could also parse C code with it using {treesitter.c}. ↩︎
Rust is a lower level language than R so has less overhead; furthermore this kind of Rust code can be easily parallelized. ↩︎

Hacker Times