content: treesitter-on-the-web

This commit is contained in:
Maciej Jur 2024-02-14 22:28:20 +01:00
parent 21dd3e6045
commit dd0ee174d0
Signed by: kamov
GPG key ID: 191CBFF5F72ECAFD
4 changed files with 457 additions and 2 deletions

View file

@ -0,0 +1,400 @@
---
title: Bringing treesitter to the Internet
date: 2024-02-14T18:32:41.645Z
---
Recently, there has been a complete rewrite of [Shiki](https://github.com/shikijs/shiki),
a rather nice syntax highlighter that you can employ to accentuate a plethora of code on your online blog.
It's built upon the very same system utilized within VS Code - [TextMate grammars](https://github.com/microsoft/vscode-textmate).
Additionally, a side effect of this setup is its reliance on the [Oniguruma](https://en.wikipedia.org/wiki/Oniguruma) regex library,
which is not so nice for a number of reasons. It works by using glorified regexes to highlight the syntax.
A crazy thought struck me at some point. What if I used [Treesitter](https://tree-sitter.github.io/tree-sitter/) to do all the
heavy-lifting of finding out how to color the syntax of the code snippets?
Treesitter is a novel approach to syntax highlighting which is utilized in editors
like Neovim, Helix or Zed. Being a Neovim user myself, I was quite enthusiastic
about the thought of being able to use the same tool used by Neovim.
This website is generated statically by Astro, which in turn runs on Node.js,
at least that's the setup for the time being. If I wanted to use Treesitter
I would have to somehow plug Treesitter into the Node.js process...
My language of choice for doing this is Rust, because it's a systems language,
it has some of the best bindings to Treesitter, and it has pretty good interop
with the JavaScript ecosystem.
## WASM is too hard
My first intuition was to try to compile Treesitter into a WASM module,
however this proved to be much harder than I anticipated at first.
The main problem is compiling `tree-sitter` crate to `wasm32-unknown-unknown`.
This is simply impossible to do without resorting to hacks, because there's
a C header in the source code which cannot be compiled while passing that target.
Another problem is that there is currently an [ABI mismatch between C and Rust](https://github.com/rust-lang/rust/issues/71871)
when it comes to the `wasm32-unknown-unknown` target. The `wasm32-wasi` target
is not affected by this issue.
## The native approach
I've found that there are pretty good [Rust bindings to the Native API for Node](https://napi.rs/),
so I decided to try loading Treesitter compiled as a dynamic library.
A lot of the work related to the setup, as well as the building can be automated
using a [CLI tool](https://napi.rs/docs/introduction/getting-started),
which I definitely recommend checking out.
For this type of project we have to set the crate type as a dynamic library respecting the C ABI.
```toml
[lib]
crate-type = ["cdylib"]
```
Next we need to import all the external libraries required to compile for Node.
```toml
[dependencies]
# Default enable napi4 feature, see https://nodejs.org/api/n-api.html#node-api-version-matrix
napi = { version = "2.12.2", default-features = false, features = ["napi4"] }
napi-derive = "2.12.2"
```
In this section we define the dependencies required for the project. We need to
use the `napi` crate. The feature flag indicates compatibility with Node.js N-API version 4.
The comment provides a link to the Node.js documentation explaining the N-API version matrix.
```toml
[build-dependencies]
napi-build = "2.0.1"
```
In this section, we define build dependencies. Build dependencies are dependencies that are only needed during the build process, such as compiler plugins or code generation tools.
```toml
[profile.release]
lto = true
strip = "symbols"
```
We can add Treesitter dependencies like this.
```toml
[dependencies]
# Treesitter
tree-sitter = "0.20.10"
tree-sitter-highlight = "0.20.1"
# Languages
tree-sitter-astro = { git = "https://github.com/virchau13/tree-sitter-astro.git", rev = "e924787e12e8a03194f36a113290ac11d6dc10f3" }
tree-sitter-css = "0.20.0"
tree-sitter-html = "0.20.0"
tree-sitter-javascript = "0.20.3"
```
In Rust we need to create a function which will be callable from the JavaScript
side, it needs to be marked by the `#[napi]` procedural macro. It makes the function
visible to Node.js. By the way, there are only some types you can use in such a function.
Check the documentation for the available types.
```rust
#[macro_use]
extern crate napi_derive;
```
In the entry function I load the configuration for a language, if there's no
such language then we need to return early. Next, we create a highlighter and pass
the config, a source code we want to highlight, as well as a callback used for
retrieving additional language configs. This is needed for handling injections.
Last we convert events into a type which can be converted into a JavaScript object.
```rust
#[napi]
pub fn hl(lang: String, src: String) -> Vec<HashMap<String, String>> {
let config = match configs::get_config(&lang) {
Some(c) => c,
None => return vec![
HashMap::from([
("kind".into(), "text".into()),
("text".into(), src.into())
])
]
};
let mut hl = Highlighter::new();
let highlights = hl.highlight(
&config,
src.as_bytes(),
None,
|name| configs::get_config(name)
).unwrap();
let mut out = vec![];
for event in highlights {
let event = event.unwrap();
let obj = map_event(event, &src);
out.push(obj);
}
out
}
```
The events we get from Treesitter need to be converted into something serializable, e.g. `HashMap`.
```rust
fn map_event(event: HighlightEvent, src: &str) -> HashMap<String, String> {
match event {
HighlightEvent::Source {start, end} => HashMap::from([
("kind".into(), "text".into()),
("text".into(), src[start..end].into())
]),
HighlightEvent::HighlightStart(s) => HashMap::from([
("kind".into(), "open".into()),
("name".into(), captures::NAMES[s.0].into())
]),
HighlightEvent::HighlightEnd => HashMap::from([
("kind".into(), "close".into())
]),
}
}
```
On the JavaScript side we need to load the Rust library like this, assuming we
are writing ES modules that is. This `require` is *required* (heh) to be able to import the
library, but we have to create it ourselves.
```javascript
import { createRequire } from 'node:module';
const require = createRequire(import.meta.url);
export const { hl } = require('./treesitter.linux-x64-gnu.node');
```
Once we have this library we can load it inside Node (almost) like any other module.
```
Welcome to Node.js v21.6.1.
Type ".help" for more information.
> const treesitter = await import('./dist/index.js')
undefined
> treesitter.hl.toString()
'function hl() { [native code] }'
> treesitter.hl('ts', 'function a() {}')
[
{ name: 'keyword', kind: 'open' },
{ text: 'function', kind: 'text' },
{ kind: 'close' },
{ kind: 'text', text: ' ' },
{ kind: 'open', name: 'function' },
// ...
]
```
As you can see by the `[native code]` marker when calling `toString()` on the `hl`
function, this function is written using a compiled language.
We can use this function for example inside a remark plugin, to transform the
tree of elements we get from parsing a markdown file.
Below is an example which I used to highlight syntax in all code blocks.
```typescript
export default function rehypeTreesitter() {
return function (tree: any) {
visit(tree, null, (node, _, above) => {
if (node.tagName !== 'code' || above.tagName !== 'pre') return;
const code = node.children?.[0].value || '';
const lang = node.properties.className?.[0].replace('language-', '') || '';
const parent = { ...above };
above.tagName = 'figure';
above.children = [parent];
above.properties = {
className: 'listing kanagawa',
...!!lang && { "data-lang": lang },
};
const root = { children: [] };
const ptrs: any[] = [root];
for (const event of treesitter.hl(lang, code)) {
switch (event.kind) {
case 'text': {
const inserted = text(event.text);
ptrs.at(-1).children.push(inserted);
} break;
case 'open': {
const inserted = span(event.name);
ptrs.at(-1).children.push(inserted);
ptrs.push(inserted);
} break;
case 'close': {
ptrs.pop();
} break;
}
}
node.children = root.children;
});
};
}
```
## Extensions
Using Treesitter means that we can easily add new language parsers, and write custom
queries for highlights and injections.
For example if we want to have syntax highlighting for Astro, we have to install
parsers, which need to be included in the `Cargo.toml` file. Then we can set up them
like this.
```rust
pub static CONFIGS: Lazy<HashMap<&'static str, HighlightConfiguration>> = Lazy::new(|| {
HashMap::from([
(
"astro",
config_for(
tree_sitter_astro::language(),
query!("astro/highlights"),
query!("astro/injections"),
""
)
),
(
"html",
config_for(
tree_sitter_html::language(),
tree_sitter_html::HIGHLIGHTS_QUERY,
tree_sitter_html::INJECTIONS_QUERY,
"",
)
),
// -- snip --
])
})
```
In the previous snippet I've used some custom-made things, as well as used `once_cell`
just to statically load all the configurations right at the beginning.
I've used here a function `config_for` which is a simple wrapper around `HighlightConfiguration::new(...)`
and a `query!` macro.
The macro just loads a string from a file and embeds it as a static string.
```rust
macro_rules! query {
($path:literal) => {
include_str!(concat!(
env!("CARGO_MANIFEST_DIR"),
"/queries/",
$path,
".scm"
))
};
}
```
This is useful for when you would like to write a custom query, because the one
included with the parser is not good enough, or if there is none at all.
Below is an example `highlights.scm` query for Astro, which adds syntax highlighting
captures for all `.astro` files. I've taken them from `nvim-treesitter` repo.
Be careful, however, not every directive available in Neovim is available
for general use. For example, `#lua-match?` is Neovim only and won't work here.
```query
(tag_name) @tag
(erroneous_end_tag_name) @keyword
(doctype) @constant
(attribute_name) @property
(attribute_value) @string
(comment) @comment
[
(attribute_value)
(quoted_attribute_value)
] @string
"=" @operator
[
"{"
"}"
] @punctuation.bracket
[
"<"
">"
"</"
"/>"
] @tag.delimiter
```
As for the injections, we can add a `injections.scm` file. This will allow us
to highlight additional languages embedded inside Astro, like TypeScript, or HTML.
```query
(frontmatter
(raw_text) @injection.content
(#set! "injection.language" "typescript"))
(interpolation
(raw_text) @injection.content
(#set! "injection.language" "tsx"))
(script_element
(raw_text) @injection.content
(#set! "injection.language" "typescript"))
(style_element
(raw_text) @injection.content
(#set! "injection.language" "css"))
```
Last but not least, you have to configure the styles for the classes generated
by Treesitter. Some would argue this is in fact the hardest part, which is why
I've borrowed the color scheme from the [Kanagawa theme](https://github.com/rebelot/kanagawa.nvim) for Neovim :)
```scss
// Identifiers
.variable-builtin { color: var(--kngw-waveRed); }
.variable-parameter { color: var(--kngw-springViolet2); }
.constant { color: var(--kngw-surimiOrange); }
.constant-builtin { color: var(--kngw-surimiOrange); }
.label { color: var(--kngw-oniViolet); }
// Literals
.string { color: var(--kngw-springGreen); }
.string-special { color: var(--kngw-boatYellow2); }
.number { color: var(--kngw-sakuraPink); }
.number-float { color: var(--kngw-sakuraPink); };
```
## The result
If everything worked correctly you should be able to see a nicely highlighted
snippet below :)
```astro
---
const { isRed } = Astro.props;
---
<!-- If `isRed` is truthy, class will be "box red". -->
<!-- If `isRed` is falsy, class will be "box". -->
<div class:list={['box', { red: isRed }]}><slot /></div>
<style>
.box { border: 1px solid blue; }
.red { border-color: red; }
</style>
```
All the code, which I do use in production and might change in the future, is
[available on Github](https://github.com/kamoshi/kamoshi.org/tree/main/tools/treesitter).

View file

@ -52,6 +52,7 @@
.variable-parameter { color: var(--kngw-springViolet2); }
.constant { color: var(--kngw-surimiOrange); }
.constant-builtin { color: var(--kngw-surimiOrange); }
.label { color: var(--kngw-oniViolet); }

View file

@ -23,8 +23,11 @@ tree-sitter-haskell = { git = "https://github.com/tree-sitter/tree-sitter-haskel
tree-sitter-html = "0.20.0"
tree-sitter-javascript = "0.20.3"
tree-sitter-md = "0.1.7"
tree-sitter-query = "0.1.0"
tree-sitter-regex = "0.20.0"
tree-sitter-rust = "0.20.4"
tree-sitter-scheme = { git = "https://github.com/6cdh/tree-sitter-scheme", rev = "af0fd1fa452cb2562dc7b5c8a8c55551c39273b9" }
tree-sitter-toml = "0.20.0"
tree-sitter-typescript = "0.20.5"
[build-dependencies]

View file

@ -23,6 +23,8 @@ pub static EXTENSIONS: Lazy<HashMap<&'static str, &'static str>> = Lazy::new(||
("js", "javascript"),
("md", "markdown"),
("mdx", "markdown"),
("query", "scheme"),
("scm", "scheme"),
("scss", "css"),
("ts", "typescript")
])
@ -86,6 +88,18 @@ pub static CONFIGS: Lazy<HashMap<&'static str, HighlightConfiguration>> = Lazy::
tree_sitter_javascript::LOCALS_QUERY,
)
),
(
"jsx",
config_for(
tree_sitter_javascript::language(),
&format!("{} {}",
tree_sitter_javascript::HIGHLIGHT_QUERY,
tree_sitter_javascript::JSX_HIGHLIGHT_QUERY
),
tree_sitter_javascript::INJECTION_QUERY,
tree_sitter_javascript::LOCALS_QUERY,
)
),
(
"markdown",
config_for(
@ -113,16 +127,53 @@ pub static CONFIGS: Lazy<HashMap<&'static str, HighlightConfiguration>> = Lazy::
"",
)
),
(
"scheme",
config_for(
tree_sitter_scheme::language(),
tree_sitter_scheme::HIGHLIGHTS_QUERY,
"",
""
)
),
(
"toml",
config_for(
tree_sitter_toml::language(),
tree_sitter_toml::HIGHLIGHT_QUERY,
"",
""
)
),
(
"tsx",
config_for(
tree_sitter_typescript::language_tsx(),
&format!("{} {} {}",
tree_sitter_javascript::HIGHLIGHT_QUERY,
tree_sitter_javascript::JSX_HIGHLIGHT_QUERY,
tree_sitter_typescript::HIGHLIGHT_QUERY,
),
tree_sitter_javascript::INJECTION_QUERY,
&format!("{} {}",
tree_sitter_javascript::LOCALS_QUERY,
tree_sitter_typescript::LOCALS_QUERY
)
)
),
(
"typescript",
config_for(
tree_sitter_typescript::language_typescript(),
&format!("{}\n{}",
&format!("{} {}",
tree_sitter_javascript::HIGHLIGHT_QUERY,
tree_sitter_typescript::HIGHLIGHT_QUERY
),
tree_sitter_javascript::INJECTION_QUERY,
tree_sitter_typescript::LOCALS_QUERY,
&format!("{} {}",
tree_sitter_javascript::LOCALS_QUERY,
tree_sitter_typescript::LOCALS_QUERY
),
)
),
])