cargo : regex-automata @ 0.4.14
PE Patrick Elsen signed 2026-05-28 published 2026-05-28

README.md

118 lines · markdown

regex-automata==============This crate exposes a variety of regex engines used by the `regex` crate.It provides a vast, sprawling and "expert" level API to each regex engine.The regex engines provided by this crate focus heavily on finite automataimplementations and specifically guarantee worst case `O(m * n)` timecomplexity for all searches. (Where `m ~ len(regex)` and `n ~ len(haystack)`.)[![Build status](https://github.com/rust-lang/regex/workflows/ci/badge.svg)](https://github.com/rust-lang/regex/actions)[![Crates.io](https://img.shields.io/crates/v/regex-automata.svg)](https://crates.io/crates/regex-automata)### Documentationhttps://docs.rs/regex-automata### ExampleThis example shows how to search for matches of multiple regexes, where eachregex uses the same capture group names to parse different key-value formats.```rustuse regex_automata::{meta::Regex, PatternID};let re = Regex::new_many(&[    r#"(?m)^(?<key>[[:word:]]+)=(?<val>[[:word:]]+)$"#,    r#"(?m)^(?<key>[[:word:]]+)="(?<val>[^"]+)"$"#,    r#"(?m)^(?<key>[[:word:]]+)='(?<val>[^']+)'$"#,    r#"(?m)^(?<key>[[:word:]]+):\s*(?<val>[[:word:]]+)$"#,]).unwrap();let hay = r#"best_album="Blow Your Face Out"best_quote='"then as it was, then again it will be"'best_year=1973best_simpsons_episode: HOMR"#;let mut kvs = vec![];for caps in re.captures_iter(hay) {    // N.B. One could use capture indices '1' and '2' here    // as well. Capture indices are local to each pattern.    // (Just like names are.)    let key = &hay[caps.get_group_by_name("key").unwrap()];    let val = &hay[caps.get_group_by_name("val").unwrap()];    kvs.push((key, val));}assert_eq!(kvs, vec![    ("best_album", "Blow Your Face Out"),    ("best_quote", "\"then as it was, then again it will be\""),    ("best_year", "1973"),    ("best_simpsons_episode", "HOMR"),]);```### Safety**I welcome audits of `unsafe` code.**This crate tries to be extremely conservative in its use of `unsafe`, but doesuse it in a few spots. In general, I am very open to removing uses of `unsafe`if it doesn't result in measurable performance regressions and doesn't resultin significantly more complex code.Below is an outline of how `unsafe` is used in this crate.* `util::pool::Pool` makes use of `unsafe` to implement a fast path foraccessing an element of the pool. The fast path applies to the first threadthat uses the pool. In effect, the fast path is fast because it avoids a mutexlock. `unsafe` is also used in the no-std version of `Pool` to implement a spinlock for synchronization.* `util::lazy::Lazy` uses `unsafe` to implement a variant of`once_cell::sync::Lazy` that works in no-std environments. A no-std no-allocimplementation is also provided that requires use of `unsafe`.* The `dfa` module makes extensive use of `unsafe` to support zero-copydeserialization of DFAs. The high level problem is that you need to get from`&[u8]` to the internal representation of a DFA without doing any copies.This is required for support in no-std no-alloc environments. It also makesdeserialization extremely cheap.* The `dfa` and `hybrid` modules use `unsafe` to explicitly elide bounds checksin the core search loops. This makes the codegen tighter and typically leads toconsistent 5-10% performance improvements on some workloads.In general, the above reflect the only uses of `unsafe` throughout the entire`regex` crate. At present, there are no plans to meaningfully expand the useof `unsafe`. With that said, one thing folks have been asking for is cheapdeserialization of a `regex::Regex`. My sense is that this feature will requirea lot more `unsafe` in places to support zero-copy deserialization. It isunclear at this point whether this will be pursued.### MotivationI started out building this crate because I wanted to re-work the `regex`crate internals to make it more amenable to optimizations. It turns out thatthere are a lot of different ways to build regex engines and even more ways tocompose them. Moreover, heuristic literal optimizations are often tricky toget correct, but the fruit they bear is attractive. All of these things weredifficult to expand upon without risking the introduction of more bugs. So Idecided to tear things down and start fresh.In the course of doing so, I ended up designing strong boundaries between eachcomponent so that each component could be reasoned and tested independently.This also made it somewhat natural to expose the components as a library untoitself. Namely, folks have been asking for more capabilities in the regexcrate for a long time, but these capabilities usually come with additional APIcomplexity that I didn't want to introduce in the `regex` crate proper. Butexposing them in an "expert" level crate like `regex-automata` seemed quitefine.In the end, I do still somewhat consider this crate an experiment. It isunclear whether the strong boundaries between components will be an impedimentto ongoing development or not. De-coupling tends to lead to slower developmentin my experience, and when you mix in the added cost of not introducingbreaking changes all the time, things can get quite complicated. But, Idon't think anyone has ever release the internals of a regex engine as alibrary before. So it will be interesting to see how it plays out!