01 - Rewriting A Python Parser In Rust

# Rewriting a Python Parser in Rust This is the start of a series of posts about my experience [swapping out the parser infrastructure from under LibCST](https://github.com/Instagram/LibCST/releases/tag/v0.4.0). This post sets some context and talks about the motivation behind such an insane project. > [!info]- Disclaimer > I like Rust as a language, and I _also_ like Python. If you've come here expecting to see one or the other get bashed, I have to disappoint you. Better put those pitchforks away, have some ✨ 🍰 ✨ instead. Okay, the aforementioned context: [LibCST](https://github.com/instagram/libcst) is a library for programmatically changing Python code while having maximal control over syntax. It's useful for example for [applying type annotations](https://github.com/Instagram/LibCST/blob/f3811a0e3f45133b18c6e0281784304270113416/libcst/codemod/commands/convert_type_comments.py#L396), [writing lint rules](https://fixit.readthedocs.io/en/latest/), [renaming things](https://github.com/Instagram/LibCST/blob/f3811a0e3f45133b18c6e0281784304270113416/libcst/codemod/commands/rename.py#L35), [upgrading between APIs](https://github.com/Instagram/LibCST/blob/f3811a0e3f45133b18c6e0281784304270113416/libcst/codemod/commands/convert_format_to_fstring.py#L221). It's [widely used](https://github.com/Instagram/LibCST/network/dependents?package_id=UGFja2FnZS00MDE2NTkxNjg%3D) in [somewhat](https://github.com/python-packaging/dowsing/blob/d3724a35e5e9e7dcc678d4adf22c2d662016c3fc/dowsing/setuptools/setup_py_parsing.py) ... [surprising](https://github.com/googleapis/python-pubsub/blob/20d661c8562cc1f777ac7b3f1ba03dcad7a831c0/UPGRADING.md#method-calls) ... [places](https://github.com/HypothesisWorks/hypothesis/blob/fd1e9272e1294a2d6402abd71c39dc44c29979d5/hypothesis-python/src/hypothesis/extra/codemods.py). You can read more in the [docs](https://libcst.readthedocs.io/en/latest/) or listen to [a podcast](https://www.pythonpodcast.com/libcst-automated-code-modification-episode-361/). At its core, LibCST needs to parse Python code for further processing. Because it cares about syntactic trivia (comments, whitespace, etc), it can't use Python's built-in [ast](https://docs.python.org/3/library/ast.html) library. Unfortunately, language changes in Python 3.10 meant that this parser couldn't be adapted to process them, so I (as the sole maintainer) decided to rewrite it. This was an ambitious (crazy? 🙃) project that had multiple aspects, so let me explain some of the rewrite's: # Goals ## Future-proof, low maintenance parser LibCST's previous parser is written in Python, based on a fork of [@davidhalter](https://github.com/davidhalter)'s excellent [Parso](https://github.com/davidhalter/parso) project, which itself is a fork of CPython's [pgen2](https://github.com/python/cpython/tree/3680ebed7f3e529d01996dd0318601f9f0d02b4b/Lib/lib2to3/pgen2). Since Python 3.10 however, CPython has moved to [pegen](https://peps.python.org/pep-0617/). This causes two big problems for projects like LibCST that aim to parse Python source code: 1. `pegen` allows language features that were impossible[^peg] to correctly and efficiently implement using `pgen2`. LibCST simply couldn't correctly parse e.g. match statements. 2. Even _if_ some brilliant hack would allow LibCST to parse new language features, the engineering cost of adding these hacks would quickly grow to an unreasonable degree. To compound this, Python the language has seen an increase in language changes in recent years; with a [yearly release cycle](https://peps.python.org/pep-0602/), the maintenance burden on third-party Python parsers is becoming non-trivial. The diagram below shows significant[^diagram] language changes over time: ![[python language changes|Python Language Changes over time]] So the first (and primary) goal of LibCST's parser rewrite was to **find a parser that can sustainably evolve along with the Python language**. ## Drop-in replacement Changing the foundations of such a complex library is no small task. To minimize risk during [[rollout]] and keep the scope reasonable during [[implementation]], I decided to make **100% feature parity with the old implementation** a goal. In practice this means: - exposing (Python) APIs for parsing entire files, single statements, and expressions - making sure the new parser can process any input the old parser could (even if it's the incorrect behavior) - supporting different version targets for parsing There's a significant advantage of this approach: 99% of the existing test suite of LibCST (that was written with the old parser in mind) could be reused. ## Long-term goals While the above two were the primary goals of the rewrite, it is prudent to take the opportunity and use this project to move LibCST towards a longer-term goal: being able to power interactive refactors (say, in IDE scenarios). What this means for the project: - **parsing needs to be fast**; a UI action stops being delightful if it takes longer than 300ms. Parsing a typical Python file needs to be a fraction of that - eventually LibCST will need to **support parsing partially written files that have syntax errors** Both of these run counter to the primary goal - it would be way easier to achieve a fast, partial parser by hand-rolling the parsing logic. That will be much harder to maintain in the long run, though. But hey, who doesn't like a challenge? 🙃 # Decision Based on all of this, and after a bit of prototyping, I decided to re-implement the parser logic in Rust, using [rust-peg](https://github.com/kevinmehall/rust-peg) as the parser generator, [PyO3](https://pyo3.rs/) as the Rust-Python bridge, and [@bgw](https://github.com/bgw)'s tokenizer implementation. Stay tuned for notes on the [[implementation]] and [[rollout]]. [^peg]: For language enthusiasts: `pgen2` was good at parsing `LL(1)` grammars while `pegen` is for parsing `PEG` - which are like unambiguous context free grammars. It _is_ possible to implement some of these features as [recent improvements](https://github.com/psf/black/commits/main?author=isidentical) to Black's `blib2to3` show (excellent work by [@isidentical](https://github.com/isidentical) by the way), but at the cost of correctness and complexity of the implementation. [^diagram]: In the interest of keeping the diagram readable, minor language changes are sometimes grouped together, or missing altogether.