cargo : regex-syntax @ 0.8.10
PE Patrick Elsen signed 2026-05-28 published 2026-05-28

Claims

algorithm-impl-boundsalgorithm-impl-safealgorithm-impl-testeddatastructure-impl-boundsdatastructure-impl-safedatastructure-impl-testedhas-binarieshas-build-exechas-fuzz-testshas-install-exechas-integration-testshas-property-testshas-unit-testsimpl-algorithmimpl-concurrencyimpl-cryptoimpl-datastructureimpl-interpreterimpl-jitimpl-parserimpl-protocolis-benignparser-impl-safeparser-impl-testeduses-concurrencyuses-cryptouses-environmentuses-execuses-filesystemuses-interpreteruses-jituses-networkuses-unsafe

Summary

regex-syntax 0.8.10 parses regular expressions into an AST and translates them to an HIR, with #![forbid(unsafe_code)], no I/O, and one optional off-by-default dependency. Parsing, traversal, the nest-limit check, and the destructors are all iterative, so deeply nested patterns bound stack usage to a constant; repetition counts are validated as u32. No findings.

Report

Subject

regex-syntax 0.8.10 is the parser front end of the Rust regex ecosystem, maintained under rust-lang/regex. It converts a regular expression pattern string into an abstract syntax tree (Ast) via ast::parse::Parser, then translates that AST into a high-level intermediate representation (Hir) via hir::translate::Translator. The top-level Parser and the free parse function combine both steps. The crate also exposes character-class interval sets, Unicode property and case-folding lookups, UTF-8 range encoding, and limited literal extraction from an Hir. It is #![no_std] (using alloc) and ships bundled Unicode data tables behind granular feature flags. It does no I/O; it is a build-time and parse-time library.

Methodology

Tools: openvet 0.6.0, ripgrep, diff, git, and the Read tool. I read lib.rs, parser.rs, and the recursion-handling core in full: the iterative parse loop and NestLimiter in ast/parse.rs, the HeapVisitor driver in ast/visitor.rs, the AST-to-HIR translator's stack handling in hir/translate.rs, the custom iterative Drop impls for Ast, ClassSet, and Hir, the interval-set algebra and case folding in hir/interval.rs, and the SimpleCaseFolder table lookups in unicode.rs. I surveyed the remaining modules and the large generated unicode_tables/ files. Capability surveys for network, filesystem, process, environment, crypto, RNG, and concurrency all returned nothing. diff -rq contents vcs differs only in the cargo-normalised Cargo.toml; every source file is byte-identical to the VCS checkout (the vcs symlink resolves to the regex-syntax subdirectory of the rust-lang/regex workspace). I did not fetch the Unicode Character Database or the regex syntax specification, so I did not independently re-derive the generated tables or parse semantics against a reference.

Results

The crate is byte-equivalent to its VCS source aside from manifest normalisation. It carries #![forbid(unsafe_code)]; the only unsafe token in src/ is that attribute, so there is no unsafe code (uses-unsafe), no FFI, and no memory-management hazards to review. There is no networking, filesystem, process execution, environment access, cryptography, JIT, or interpreter (uses-network, uses-filesystem, uses-exec, uses-environment, uses-crypto, uses-jit, uses-interpreter, uses-concurrency), and no build or install scripts, binaries, crypto, protocol, or concurrency implementation (has-build-exec, has-install-exec, has-binaries, impl-crypto, impl-protocol, impl-concurrency, impl-interpreter, impl-jit). The code is benign: no obfuscation, encoded blobs, telemetry, or hidden endpoints (is-benign). The sole dependency, arbitrary, is optional and off by default.

The crate implements a parser (impl-parser), the interval-set data structure backing character classes (impl-datastructure), and the parsing, translation, case-folding, and UTF-8-encoding algorithms (impl-algorithm). Its central safety property is recursion avoidance on adversarial input. Parsing keeps nested groups and classes on explicit heap stacks rather than the call stack; AST traversal and AST-to-HIR translation run through HeapVisitor, which uses constant stack space and heap proportional to input size; and Ast, ClassSet, and Hir each carry a custom iterative Drop so that destroying a deeply nested value cannot overflow the stack. The post-parse NestLimiter (default limit 250) bounds tree depth with a checked_add guard, and literal extraction is gated by configurable limit_repeat, limit_class, and limit_total bounds. Repetition counts parse through u32::from_str_radix(..).ok(), so an out-of-range bound such as a{0,99999999999} returns DecimalInvalid rather than wrapping or panicking; counts are carried into the HIR as u32 min/max pairs without expansion at parse time, so large bounds are not a parse-time blowup. Together these justify parser-impl-safe, datastructure-impl-safe, algorithm-impl-safe, and algorithm-impl-bounds. The crate is exercised by 200-plus inline unit tests (has-unit-tests, parser-impl-tested, algorithm-impl-tested, datastructure-impl-tested) and by workspace fuzz targets with OSS-Fuzz integration and a checked-in regression corpus, including an arbitrary-driven AST round-trip target (has-fuzz-tests). It has no separate integration-test crate and no property-test harness (has-integration-tests, has-property-tests). No findings were recorded.

I did not assert parser-impl-correct, algorithm-impl-correct, or datastructure-impl-correct. The Unicode tables are generated from the UCD and the parse semantics track the regex crate's documented syntax, but verifying either against the reference data would require fetching the UCD and the spec, which was out of scope here.

Conclusion

regex-syntax 0.8.10 parses regular expressions into an AST and translates that AST into an HIR with no unsafe code, no I/O, and one optional, off-by-default dependency. Its attack surface is parser correctness and denial of service on adversarial patterns rather than memory safety. The parse loop, AST and HIR traversal, the nest-limit check, and the destructors are all iterative, bounding stack usage to a constant regardless of nesting depth; repetition counts are validated as u32 and rejected when out of range. The audit recorded no findings. Correctness of the generated Unicode tables against the UCD was not independently verified.

Findings

No findings.

Annotations(5)

src/ast/mod.rs

src/ast/mod.rs, line 1634-1684

impl Drop for Ast {
    fn drop(&mut self) {
        use core::mem;

        match *self {
            Ast::Empty(_)
            | Ast::Flags(_)
            | Ast::Literal(_)
            | Ast::Dot(_)
            | Ast::Assertion(_)
            | Ast::ClassUnicode(_)
            | Ast::ClassPerl(_)
            // Bracketed classes are recursive, they get their own Drop impl.
            | Ast::ClassBracketed(_) => return,
            Ast::Repetition(ref x) if !x.ast.has_subexprs() => return,
            Ast::Group(ref x) if !x.ast.has_subexprs() => return,
            Ast::Alternation(ref x) if x.asts.is_empty() => return,
            Ast::Concat(ref x) if x.asts.is_empty() => return,
            _ => {}
        }

        let empty_span = || Span::splat(Position::new(0, 0, 0));
        let empty_ast = || Ast::empty(empty_span());
        let mut stack = vec![mem::replace(self, empty_ast())];
        while let Some(mut ast) = stack.pop() {
            match ast {
                Ast::Empty(_)
                | Ast::Flags(_)
                | Ast::Literal(_)
                | Ast::Dot(_)
                | Ast::Assertion(_)
                | Ast::ClassUnicode(_)
                | Ast::ClassPerl(_)
                // Bracketed classes are recursive, so they get their own Drop
                // impl.
                | Ast::ClassBracketed(_) => {}
                Ast::Repetition(ref mut x) => {
                    stack.push(mem::replace(&mut x.ast, empty_ast()));
                }
                Ast::Group(ref mut x) => {
                    stack.push(mem::replace(&mut x.ast, empty_ast()));
                }
                Ast::Alternation(ref mut x) => {
                    stack.extend(x.asts.drain(..));
                }
                Ast::Concat(ref mut x) => {
                    stack.extend(x.asts.drain(..));
                }
            }
        }
    }

Ast has a custom Drop that deallocates iteratively using a heap work stack, so destroying a deeply nested AST does not recurse and cannot overflow the call stack. ClassSet (lines 1689 onward) and Hir (in hir/mod.rs) carry the same iterative-drop pattern. Without these, dropping a pathological pattern such as a few hundred thousand nested groups would recurse through the default destructor and crash. The interval-set data structure backing character classes (hir/interval.rs) keeps its ranges in a sorted, non-overlapping canonical form; union, intersection, and difference are linear merges over that form and canonicalize re-sorts in O(n log n), with no adversarial-input degradation beyond the sort. Justifies algorithm-impl-bounds, datastructure-impl-safe, and datastructure-impl-bounds.

src/ast/parse.rs

src/ast/parse.rs, line 2266-2306

struct NestLimiter<'p, 's, P> {
    /// The parser that is checking the nest limit.
    p: &'p ParserI<'s, P>,
    /// The current depth while walking an Ast.
    depth: u32,
}

impl<'p, 's, P: Borrow<Parser>> NestLimiter<'p, 's, P> {
    fn new(p: &'p ParserI<'s, P>) -> NestLimiter<'p, 's, P> {
        NestLimiter { p, depth: 0 }
    }

    #[inline(never)]
    fn check(self, ast: &Ast) -> Result<()> {
        ast::visit(ast, self)
    }

    fn increment_depth(&mut self, span: &Span) -> Result<()> {
        let new = self.depth.checked_add(1).ok_or_else(|| {
            self.p.error(
                span.clone(),
                ast::ErrorKind::NestLimitExceeded(u32::MAX),
            )
        })?;
        let limit = self.p.parser().nest_limit;
        if new > limit {
            return Err(self.p.error(
                span.clone(),
                ast::ErrorKind::NestLimitExceeded(limit),
            ));
        }
        self.depth = new;
        Ok(())
    }

    fn decrement_depth(&mut self) {
        // Assuming the correctness of the visitor, this should never drop
        // below 0.
        self.depth = self.depth.checked_sub(1).unwrap();
    }
}

The NestLimiter enforces the parser's nest_limit (default 250) after the full Ast is built. It runs as an ast::Visitor over the heap-based visitor driver, so the depth check itself uses constant stack space. increment_depth uses checked_add, returning NestLimitExceeded(u32::MAX) rather than overflowing, and rejects any node whose depth exceeds the configured limit. Base-case node kinds (literals, dot, assertions, Perl/Unicode classes) do not increment depth; only the recursive kinds do. This is the mechanism that lets downstream consumers walk an Ast with explicit recursion without risking stack overflow on adversarial patterns. Justifies parser-impl-safe and algorithm-impl-bounds.

src/ast/parse.rs, line 982-1032

    fn parse_with_comments(&self) -> Result<ast::WithComments> {
        assert_eq!(self.offset(), 0, "parser can only be used once");
        self.parser().reset();
        let mut concat = ast::Concat { span: self.span(), asts: vec![] };
        loop {
            self.bump_space();
            if self.is_eof() {
                break;
            }
            match self.char() {
                '(' => concat = self.push_group(concat)?,
                ')' => concat = self.pop_group(concat)?,
                '|' => concat = self.push_alternate(concat)?,
                '[' => {
                    let class = self.parse_set_class()?;
                    concat.asts.push(Ast::class_bracketed(class));
                }
                '?' => {
                    concat = self.parse_uncounted_repetition(
                        concat,
                        ast::RepetitionKind::ZeroOrOne,
                    )?;
                }
                '*' => {
                    concat = self.parse_uncounted_repetition(
                        concat,
                        ast::RepetitionKind::ZeroOrMore,
                    )?;
                }
                '+' => {
                    concat = self.parse_uncounted_repetition(
                        concat,
                        ast::RepetitionKind::OneOrMore,
                    )?;
                }
                '{' => {
                    concat = self.parse_counted_repetition(concat)?;
                }
                _ => concat.asts.push(self.parse_primitive()?.into_ast()),
            }
        }
        let ast = self.pop_group_end(concat)?;
        NestLimiter::new(self).check(&ast)?;
        Ok(ast::WithComments {
            ast,
            comments: mem::replace(
                &mut *self.parser().comments.borrow_mut(),
                vec![],
            ),
        })
    }

The main parse loop is iterative. Nested groups and alternations are managed through the explicit stack_group and stack_class heap stacks (RefCell<Vec<..>>) rather than recursion, so parsing a deeply nested pattern consumes heap, not call stack. The loop dispatches on a single character at a time and pushes onto the in-progress concatenation. parse_decimal (used for {m,n} repetition counts) collects digits and converts via u32::from_str_radix(..).ok(); an out-of-range value such as a{0,99999999999} yields DecimalInvalid rather than panicking or wrapping. Repetition bounds are kept as u32 and carried verbatim into the HIR Repetition { min, max } without expansion at parse time. Justifies parser-impl-safe.

src/ast/visitor.rs

src/ast/visitor.rs, line 104-120

/// Executes an implementation of `Visitor` in constant stack space.
///
/// This function will visit every node in the given `Ast` while calling the
/// appropriate methods provided by the [`Visitor`] trait.
///
/// The primary use case for this method is when one wants to perform case
/// analysis over an `Ast` without using a stack size proportional to the depth
/// of the `Ast`. Namely, this method will instead use constant stack size, but
/// will use heap space proportional to the size of the `Ast`. This may be
/// desirable in cases where the size of `Ast` is proportional to end user
/// input.
///
/// If the visitor returns an error at any point, then visiting is stopped and
/// the error is returned.
pub fn visit<V: Visitor>(ast: &Ast, visitor: V) -> Result<V::Output, V::Err> {
    HeapVisitor::new().visit(ast, visitor)
}

visit walks an Ast using HeapVisitor, which keeps its own work stack on the heap rather than recursing on the native call stack. The module doc states the intent: constant stack usage, heap usage proportional to the size of the Ast. Both the AST-to-HIR Translator (hir/translate.rs) and the NestLimiter (ast/parse.rs) run through this driver, so neither translation nor the nest-limit check can stack-overflow on a deeply nested pattern. Justifies parser-impl-safe and algorithm-impl-bounds.

src/lib.rs

src/lib.rs, line 167-197

#![no_std]
#![forbid(unsafe_code)]
#![deny(missing_docs, rustdoc::broken_intra_doc_links)]
#![warn(missing_debug_implementations)]
// This adds Cargo feature annotations to items in the rustdoc output. Which is
// sadly hugely beneficial for this crate due to the number of features.
#![cfg_attr(docsrs_regex, feature(doc_cfg))]

#[cfg(any(test, feature = "std"))]
extern crate std;

extern crate alloc;

pub use crate::{
    error::Error,
    parser::{parse, Parser, ParserBuilder},
    unicode::UnicodeWordError,
};

use alloc::string::String;

pub mod ast;
mod debug;
mod either;
mod error;
pub mod hir;
mod parser;
mod rank;
mod unicode;
mod unicode_tables;
pub mod utf8;

The crate is #![no_std] and #![forbid(unsafe_code)]. The only unsafe token in the entire src/ tree is this forbid attribute, so the compiler rejects any unsafe block. There is no I/O: surveys for std::net, std::fs, std::process, std::env, and RNG return nothing. The public surface is a Parser/ParserBuilder, the free parse function, and the ast, hir, and utf8 modules. Justifies uses-unsafe, uses-network, uses-filesystem, uses-environment, uses-exec, uses-crypto, uses-concurrency, uses-jit, uses-interpreter, has-build-exec, has-install-exec, has-binaries, and is-benign.

src/unicode.rs

src/unicode.rs, line 80-201

#[derive(Debug)]
pub struct SimpleCaseFolder {
    /// The simple case fold table. It's a sorted association list, where the
    /// keys are Unicode scalar values and the values are the corresponding
    /// equivalence class (not including the key) of the "simple" case folded
    /// Unicode scalar values.
    table: &'static [(char, &'static [char])],
    /// The last codepoint that was used for a lookup.
    last: Option<char>,
    /// The index to the entry in `table` corresponding to the smallest key `k`
    /// such that `k > k0`, where `k0` is the most recent key lookup. Note that
    /// in particular, `k0` may not be in the table!
    next: usize,
}

impl SimpleCaseFolder {
    /// Create a new simple case folder, returning an error if the underlying
    /// case folding table is unavailable.
    pub fn new() -> Result<SimpleCaseFolder, CaseFoldError> {
        #[cfg(not(feature = "unicode-case"))]
        {
            Err(CaseFoldError(()))
        }
        #[cfg(feature = "unicode-case")]
        {
            Ok(SimpleCaseFolder {
                table: crate::unicode_tables::case_folding_simple::CASE_FOLDING_SIMPLE,
                last: None,
                next: 0,
            })
        }
    }

    /// Return the equivalence class of case folded codepoints for the given
    /// codepoint. The equivalence class returned never includes the codepoint
    /// given. If the given codepoint has no case folded codepoints (i.e.,
    /// no entry in the underlying case folding table), then this returns an
    /// empty slice.
    ///
    /// # Panics
    ///
    /// This panics when called with a `c` that is less than or equal to the
    /// previous call. In other words, callers need to use this method with
    /// strictly increasing values of `c`.
    pub fn mapping(&mut self, c: char) -> &'static [char] {
        if let Some(last) = self.last {
            assert!(
                last < c,
                "got codepoint U+{:X} which occurs before \
                 last codepoint U+{:X}",
                u32::from(c),
                u32::from(last),
            );
        }
        self.last = Some(c);
        if self.next >= self.table.len() {
            return &[];
        }
        let (k, v) = self.table[self.next];
        if k == c {
            self.next += 1;
            return v;
        }
        match self.get(c) {
            Err(i) => {
                self.next = i;
                &[]
            }
            Ok(i) => {
                // Since we require lookups to proceed
                // in order, anything we find should be
                // after whatever we thought might be
                // next. Otherwise, the caller is either
                // going out of order or we would have
                // found our next key at 'self.next'.
                assert!(i > self.next);
                self.next = i + 1;
                self.table[i].1
            }
        }
    }

    /// Returns true if and only if the given range overlaps with any region
    /// of the underlying case folding table. That is, when true, there exists
    /// at least one codepoint in the inclusive range `[start, end]` that has
    /// a non-trivial equivalence class of case folded codepoints. Conversely,
    /// when this returns false, all codepoints in the range `[start, end]`
    /// correspond to the trivial equivalence class of case folded codepoints,
    /// i.e., itself.
    ///
    /// This is useful to call before iterating over the codepoints in the
    /// range and looking up the mapping for each. If you know none of the
    /// mappings will return anything, then you might be able to skip doing it
    /// altogether.
    ///
    /// # Panics
    ///
    /// This panics when `end < start`.
    pub fn overlaps(&self, start: char, end: char) -> bool {
        use core::cmp::Ordering;

        assert!(start <= end);
        self.table
            .binary_search_by(|&(c, _)| {
                if start <= c && c <= end {
                    Ordering::Equal
                } else if c > end {
                    Ordering::Greater
                } else {
                    Ordering::Less
                }
            })
            .is_ok()
    }

    /// Returns the index at which `c` occurs in the simple case fold table. If
    /// `c` does not occur, then this returns an `i` such that `table[i-1].0 <
    /// c` and `table[i].0 > c`.
    fn get(&self, c: char) -> Result<usize, usize> {
        self.table.binary_search_by_key(&c, |&(c1, _)| c1)
    }
}

SimpleCaseFolder reads the static CASE_FOLDING_SIMPLE table, a sorted association list of (char, &[char]). Lookups use binary search (get, overlaps) with an amortized monotonic-cursor fast path in mapping. The mapping contract requires strictly increasing input codepoints and asserts on violation; callers in hir/interval.rs iterate ranges in sorted order. Folding a class (IntervalSet::case_fold_simple) appends folded codepoints and then canonicalizes back to sorted, non-overlapping ranges. The table itself is generated from the Unicode Character Database; its entries were not independently re-derived against the UCD in this audit. Justifies algorithm-impl-safe.