Audit · shlex@2.0.1

cargo : shlex @ 2.0.1

PE Patrick Elsen signed 2026-06-02 published 2026-06-02

Claims

has-binarieshas-build-exechas-fuzz-testshas-install-exechas-integration-testshas-property-testshas-unit-testsimpl-algorithmimpl-concurrencyimpl-cryptoimpl-datastructureimpl-interpreterimpl-jitimpl-parserimpl-protocolis-benignparser-impl-correctparser-impl-safeparser-impl-testedunsafe-documentedunsafe-minimalunsafe-safeunsafe-testeduses-concurrencyuses-cryptouses-environmentuses-execuses-filesystemuses-interpreteruses-jituses-networkuses-unsafe

Summary

Audit of shlex 2.0.1, a small POSIX-shell-word splitter/quoter (split, try_quote, try_join, Shlex iterator). Matches upstream Git byte-for-byte; no dependencies, no I/O, no concurrency, no build script. The byte-level parser is panic-free and the string-typed unsafe UTF-8 wrappers in lib.rs are sound. Two informational findings: a documented threat-model caveat (output not safe for interactive shells) and the soundness analysis for the unsafe blocks. Includes the RUSTSEC-2024-0006 fix.

Report

Subject

shlex is a Rust library that splits a string into shell words and quotes strings for safe use as shell arguments, modelled on Python's shlex.split and matching the default POSIX shell grammar. The public API is Shlex (an iterator), split, try_quote, try_join, and a Quoter builder. The crate exposes both a string-typed surface in src/lib.rs and a byte-typed surface in src/bytes.rs; the string surface is a thin unsafe wrapper around the byte surface. The crate supports no_std with alloc available (the std feature is on by default and only adds std::error::Error for QuoteError).

Methodology

The published crate contents were compared against the upstream Git repository at the commit recorded in .cargo_vcs_info.json using diff -rq; the two Rust source files were also diffed with diff and match byte-for-byte. Both src/lib.rs (324 lines) and src/bytes.rs (516 lines) were read in full, including the in-file unit tests. src/quoting_warning.md (documentation, loaded as a doc module) was read in full. grep was used to enumerate every unsafe token in the source (seven hits in src/lib.rs, none in src/bytes.rs) and to confirm absence of std::net, std::fs, std::process, std::env, thread::, and cryptographic crates. The upstream vcs/fuzz/ directory and .github/workflows/test.yml were inspected to confirm the project ships a fuzz harness (fuzz_quote.rs, fuzz_next.rs, plus three differential-fuzz subprojects against Python shlex, real shells, and wordexp) and that CI builds fuzz/ on stable and beta with -Dwarnings. The code was not built or executed locally.

Results

The published crate matches its upstream Git tree byte-for-byte; only the cargo-generated artefacts (.cargo_vcs_info.json, Cargo.lock, Cargo.toml.orig, normalised Cargo.toml) and the upstream fuzz/ subproject (excluded from the published crate) differ. The crate ships no binaries (justifying has-binaries), no build.rs (justifying has-build-exec), no install hooks (justifying has-install-exec), and has no declared dependencies in any scope — the crate is self-contained, using only alloc and optionally std.

The audited code makes no network calls (justifying uses-network), no filesystem calls (justifying uses-filesystem), no subprocess spawns (justifying uses-exec), no env-var reads (justifying uses-environment), no concurrency primitives (justifying uses-concurrency), no cryptographic operations (justifying uses-crypto, impl-crypto), no JIT (justifying uses-jit, impl-jit), and no interpreter (justifying uses-interpreter, impl-interpreter). The single implementation focus is shell-word parsing and quoting — there are no data structures, algorithms, protocols, or concurrency primitives implemented here (justifying impl-datastructure, impl-algorithm, impl-protocol, impl-concurrency).

The parser (bytes::Shlex, bytes::split) is a hand-written single-pass state machine over a core::slice::Iter<'a, u8>, modelled on the POSIX shell tokenizer at pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html (justifying impl-parser). It recognises single quotes, double quotes (including the four documented backslash escape targets $, `, ", \ plus \<newline>), unquoted backslash escapes, # comments, and whitespace separators. There are no panics, no recursion, and no allocations beyond the per-word Vec<u8> buffer — error states return None after setting the public had_error flag (justifying parser-impl-safe). The in-tree test suite covers 20+ split cases and 25+ quote round-trip cases, including the historical edge cases ('\n', embedded backslashes, nul-byte rejection), justifying has-unit-tests; the upstream repository additionally ships libfuzzer harnesses (fuzz/fuzz_targets/fuzz_quote.rs, fuzz_next.rs) that differential-fuzz against Python shlex, real shells (bash, zsh, dash, mksh), and wordexp, and CI builds those harnesses on every push, justifying has-fuzz-tests and parser-impl-tested. The repository does not include a separate tests/ directory or property-test crate, justifying has-integration-tests and has-property-tests. The implementation conforms to the documented POSIX subset with explicit, documented deviations (no \r special handling, no Python-shlex customisation knobs) and includes the RUSTSEC-2024-0006 fix for {, }, and \xa0 quoting (justifying parser-impl-correct).

Two findings were recorded, both low-severity and both informational rather than defects:

FINDING-1 documents the threat-model boundary: try_quote/try_join output is safe to feed to non-interactive POSIX shells but unsafe to pipe into interactive shells or cooked-mode ptys because POSIX shell syntax cannot portably escape control bytes. This is exhaustively documented in src/quoting_warning.md and the top-level doc comment.
FINDING-2 records the analysis behind the unsafe blocks in src/lib.rs: the string-typed API uses String::from_utf8_unchecked / core::str::from_utf8_unchecked on the output of the byte-level parser/quoter, relying on the invariant that those routines never split multi-byte UTF-8 sequences. Reviewing every match arm in parse_word/parse_single/parse_double and every byte insertion in append_quoted_chunk confirms the invariant. No defect; the analysis supports unsafe-safe.

The crate exposes seven unsafe tokens, all in src/lib.rs: two pub unsafe fns (Shlex::from_bytes, Shlex::as_bytes_mut) with explicit safety contracts, and five internal unsafe { ... } blocks that perform the unchecked UTF-8 transmutation described above (justifying unsafe-documented). The 2.0.0 release removed the unsound DerefMut impl that previously made this surface trivially unsound and replaced it with these explicitly-marked APIs (per the CHANGELOG). The unsafe is used only where strictly necessary to avoid re-validating output the byte layer has already produced in UTF-8 (justifying unsafe-minimal), and is indirectly exercised by the parse / quote test suites and the differential fuzz harnesses (justifying unsafe-tested).

The code makes no malicious calls — no data exfiltration, no telemetry, no obfuscated payloads, no targeted cfg branches — supporting is-benign.

Conclusion

shlex 2.0.1 is a small, focused, single-crate library with a well-defined POSIX-shell tokenizer/quoter, no runtime dependencies, no I/O of any kind, and unusually thorough security documentation. The two findings recorded are informational notes about the threat model and the soundness argument for the crate's small set of unsafe blocks; neither is a defect. The author's response to the historical RUSTSEC-2024-0006 advisory (the 1.2.1 / 1.3.0 / 2.0.0 line of releases) shows both the fix itself (now baked into 2.0.1) and an extensive write-up in quoting_warning.md of the broader class of issues, which is appropriate care for a library at this position in the dependency graph.

Findings(2)

FINDING-1 security low

Quoting output is unsafe to pipe into interactive shells (documented)

The crate's Quoter::quote / try_quote / try_join functions are documented as safe for noninteractive shell contexts (scripts, sh -c arguments, sourced scripts) but not safe to feed into the stdin of an interactive shell (bash -i) or a cooked-mode pty. Control bytes \x00..\x1f and \x7f are not — and cannot portably be — quoted by POSIX shell syntax: there are no numeric escape sequences, so a control byte in the input must appear literally in the output, and an interactive shell's line editor (or the tty layer) will interpret it before parsing even begins, allowing command injection.

This behaviour is exhaustively documented in src/quoting_warning.md and the warning is mirrored at the top of the crate-level doc comment in src/lib.rs. The crate also rejects nul bytes by default (QuoteError::Nul) and includes the past RUSTSEC-2024-0006 fix ({, }, \xa0 are now quoted). Recording the finding to make the threat-boundary visible: consumers must not pipe try_quote/try_join output to interactive shells or cooked-mode ptys. This is by design, not a defect.

FINDING-2 safety low

UTF-8 invariant in unsafe wrappers depends on per-byte parser/quoter behaviour

The string-typed entry points in src/lib.rs are thin wrappers around the byte-typed implementations in src/bytes.rs. They use String::from_utf8_unchecked / core::str::from_utf8_unchecked to avoid re-validating output (lines 85, 167, 175, 179) with the rationale "given valid UTF-8, bytes::Shlex / bytes::quote() will always return valid UTF-8." The soundness of every unsafe block in the crate rests on that invariant.

The invariant holds because the byte-level parser/quoter only ever matches on ASCII bytes (< 0x80) — quote delimiters, backslash, whitespace, # — and pushes the input bytes unchanged into the output, never splitting a multi-byte UTF-8 sequence. The double-quote escape-insertion only prepends \\ before specific ASCII bytes ($, \``, ", \), none of which appear inside a multi-byte sequence. Shlex::newis the only safe constructor;Shlex::from_bytesandShlex::as_bytes_mutarepub unsafe fn` with explicit safety contracts requiring the caller to ensure UTF-8 validity.

Reviewing the relevant byte-level functions (bytes::Shlex::parse_word/parse_single/parse_double and bytes::Quoter::quote/append_quoted_chunk) confirms the invariant. No defect found; recording the analysis as a safety note so a future contributor changing those functions can see what to preserve.

Annotations(3)

`src/bytes.rs`

`src/bytes.rs`, line 36-154

/// An iterator that takes an input byte string and splits it into the words using the same syntax as
/// the POSIX shell.
pub struct Shlex<'a> {
    in_iter: core::slice::Iter<'a, u8>,
    /// The number of newlines read so far, plus one.
    pub line_no: usize,
    /// An input string is erroneous if it ends while inside a quotation or right after an
    /// unescaped backslash.  Since Iterator does not have a mechanism to return an error, if that
    /// happens, Shlex just throws out the last token, ends the iteration, and sets 'had_error' to
    /// true; best to check it after you're done iterating.
    pub had_error: bool,
}

impl<'a> Shlex<'a> {
    pub fn new(in_bytes: &'a [u8]) -> Self {
        Shlex {
            in_iter: in_bytes.iter(),
            line_no: 1,
            had_error: false,
        }
    }

    fn parse_word(&mut self, mut ch: u8) -> Option<Vec<u8>> {
        let mut result: Vec<u8> = Vec::new();
        loop {
            match ch as char {
                '"' => if let Err(()) = self.parse_double(&mut result) {
                    self.had_error = true;
                    return None;
                },
                '\'' => if let Err(()) = self.parse_single(&mut result) {
                    self.had_error = true;
                    return None;
                },
                '\\' => if let Some(ch2) = self.next_char() {
                    if ch2 != b'\n' { result.push(ch2); }
                } else {
                    self.had_error = true;
                    return None;
                },
                ' ' | '\t' | '\n' => { break; },
                _ => { result.push(ch); },
            }
            if let Some(ch2) = self.next_char() { ch = ch2; } else { break; }
        }
        Some(result)
    }

    fn parse_double(&mut self, result: &mut Vec<u8>) -> Result<(), ()> {
        loop {
            if let Some(ch2) = self.next_char() {
                match ch2 as char {
                    '\\' => {
                        if let Some(ch3) = self.next_char() {
                            match ch3 as char {
                                // \$ => $
                                '$' | '`' | '"' | '\\' => { result.push(ch3); },
                                // \<newline> => nothing
                                '\n' => {},
                                // \x => =x
                                _ => { result.push(b'\\'); result.push(ch3); }
                            }
                        } else {
                            return Err(());
                        }
                    },
                    '"' => { return Ok(()); },
                    _ => { result.push(ch2); },
                }
            } else {
                return Err(());
            }
        }
    }

    fn parse_single(&mut self, result: &mut Vec<u8>) -> Result<(), ()> {
        loop {
            if let Some(ch2) = self.next_char() {
                match ch2 as char {
                    '\'' => { return Ok(()); },
                    _ => { result.push(ch2); },
                }
            } else {
                return Err(());
            }
        }
    }

    fn next_char(&mut self) -> Option<u8> {
        let res = self.in_iter.next().copied();
        if res == Some(b'\n') { self.line_no += 1; }
        res
    }
}

impl Iterator for Shlex<'_> {
    type Item = Vec<u8>;
    fn next(&mut self) -> Option<Self::Item> {
        if let Some(mut ch) = self.next_char() {
            // skip initial whitespace
            loop {
                match ch as char {
                    ' ' | '\t' | '\n' => {},
                    '#' => {
                        while let Some(ch2) = self.next_char() {
                            if ch2 as char == '\n' { break; }
                        }
                    },
                    _ => { break; }
                }
                if let Some(ch2) = self.next_char() { ch = ch2; } else { return None; }
            }
            self.parse_word(ch)
        } else { // no initial character
            None
        }
    }

}

Core POSIX-shell tokenizer: a hand-written single-pass state machine over a byte iterator, recognising single/double quotes, backslash escapes, # comments, and whitespace word separators. Justifies impl-parser. No allocations beyond Vec<u8> word buffers, no recursion, no panics on any input — error states return None after setting had_error.

`src/bytes.rs`, line 244-441

/// Is this ASCII byte okay to emit unquoted?
const fn unquoted_ok(c: u8) -> bool {
    match c as char {
        // Allowed characters:
        '+' | '-' | '.' | '/' | ':' | '@' | ']' | '_' |
        '0'..='9' | 'A'..='Z' | 'a'..='z'
        => true,

        // Non-allowed characters:
        // From POSIX https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html
        // "The application shall quote the following characters if they are to represent themselves:"
        '|' | '&' | ';' | '<' | '>' | '(' | ')' | '$' | '`' | '\\' | '"' | '\'' | ' ' | '\t' | '\n' |
        // "and the following may need to be quoted under certain circumstances[..]:"
        '*' | '?' | '[' | '#' | '~' | '=' | '%' |
        // Brace expansion.  These ought to be in the POSIX list but aren't yet;
        // see: https://www.austingroupbugs.net/view.php?id=1193
        '{' | '}' |
        // Also quote comma, just to be safe in the extremely odd case that the user of this crate
        // is intentionally placing a quoted string inside a brace expansion, e.g.:
        //     format!("echo foo{{a,b,{}}}" | shlex::quote(some_str))
        ',' |
        // '\r' is allowed in a word by all real shells I tested, but is treated as a word
        // separator by Python `shlex` | and might be translated to '\n' in interactive mode.
        '\r' |
        // '!' and '^' are treated specially in interactive mode; see quoting_warning.
        '!' | '^' |
        // Nul bytes and control characters.
        '\x00' ..= '\x1f' | '\x7f'
        => false,
        '\u{80}' ..= '\u{10ffff}' => {
            // This is unreachable since `unquoted_ok` is only called for 0..128.
            // Non-ASCII bytes are handled separately in `quoting_strategy`.
            // Can't call unreachable!() from `const fn` on old Rust, so...
            unquoted_ok(c)
        },
    }
    // Note: The logic cited above for quoting comma might suggest that `..` should also be quoted,
    // it as a special case of brace expansion).  But it's not necessary.  There are three cases:
    //
    // 1. The user wants comma-based brace expansion, but the untrusted string being `quote`d
    //    contains `..`, so they get something like `{foo,bar,3..5}`.
    //  => That's safe; both Bash and Zsh expand this to `foo bar 3..5` rather than
    //     `foo bar 3 4 5`.  The presence of commas disables sequence expression expansion.
    //
    // 2. The user wants comma-based brace expansion where the contents of the braces are a
    //    variable number of `quote`d strings and nothing else.  There happens to be exactly
    //    one string and it contains `..`, so they get something like `{3..5}`.
    //  => Then this will expand as a sequence expression, which is unintended.  But I don't mind,
    //     because any such code is already buggy.  Suppose the untrusted string *didn't* contain
    //     `,` or `..`, resulting in shell input like `{foo}`.  Then the shell would interpret it
    //     as the literal string `{foo}` rather than brace-expanding it into `foo`.
    //
    // 3. The user wants a sequence expression and wants to supply an untrusted string as one of
    //    the endpoints or the increment.
    //  => Well, that's just silly, since the endpoints can only be numbers or single letters.
}

/// Optimized version of `unquoted_ok`.
fn unquoted_ok_fast(c: u8) -> bool {
    const UNQUOTED_OK_MASK: u128 = {
        // Make a mask of all bytes in 0..<0x80 that pass.
        let mut c = 0u8;
        let mut mask = 0u128;
        while c < 0x80 {
            if unquoted_ok(c) {
                mask |= 1u128 << c;
            }
            c += 1;
        }
        mask
    };
    ((UNQUOTED_OK_MASK >> c) & 1) != 0
}

/// Is this ASCII byte okay to emit in single quotes?
fn single_quoted_ok(c: u8) -> bool {
    match c {
        // No single quotes in single quotes.
        b'\'' => false,
        // To work around a Bash bug, ^ is only allowed right after an opening single quote; see
        // quoting_warning.
        b'^' => false,
        // Backslashes in single quotes are literal according to POSIX, but Fish treats them as an
        // escape character.  Ban them.  Fish doesn't aim to be POSIX-compatible, but we *can*
        // achieve Fish compatibility using double quotes, so we might as well.
        b'\\' => false,
        _ => true
    }
}

/// Is this ASCII byte okay to emit in double quotes?
fn double_quoted_ok(c: u8) -> bool {
    match c {
        // Work around Python `shlex` bug where parsing "\`" and "\$" doesn't strip the
        // backslash, even though POSIX requires it.
        b'`' | b'$' => false,
        // '!' and '^' are treated specially in interactive mode; see quoting_warning.
        b'!' | b'^' => false,
        _ => true
    }
}

/// Given an input, return a quoting strategy that can cover some prefix of the string, along with
/// the size of that prefix.
///
/// Precondition: input size is nonzero.  (Empty strings are handled by the caller.)
/// Postcondition: returned size is nonzero.
#[cfg_attr(manual_codegen_check, inline(never))]
fn quoting_strategy(in_bytes: &[u8]) -> (usize, QuotingStrategy) {
    const UNQUOTED_OK: u8 = 1;
    const SINGLE_QUOTED_OK: u8 = 2;
    const DOUBLE_QUOTED_OK: u8 = 4;

    let mut prev_ok = SINGLE_QUOTED_OK | DOUBLE_QUOTED_OK | UNQUOTED_OK;
    let mut i = 0;

    if in_bytes[0] == b'^' {
        // To work around a Bash bug, ^ is only allowed right after an opening single quote; see
        // quoting_warning.
        prev_ok = SINGLE_QUOTED_OK;
        i = 1;
    }

    while i < in_bytes.len() {
        let c = in_bytes[i];
        let mut cur_ok = prev_ok;

        if c >= 0x80 {
            // Normally, non-ASCII characters shouldn't require quoting, but see quoting_warning.md
            // about \xa0.  For now, just treat all non-ASCII characters as requiring quotes.  This
            // also ensures things are safe in the off-chance that you're in a legacy 8-bit locale that
            // has additional characters satisfying `isblank`.
            cur_ok &= !UNQUOTED_OK;
        } else {
            if !unquoted_ok_fast(c) {
                cur_ok &= !UNQUOTED_OK;
            }
            if !single_quoted_ok(c){
                cur_ok &= !SINGLE_QUOTED_OK;
            }
            if !double_quoted_ok(c) {
                cur_ok &= !DOUBLE_QUOTED_OK;
            }
        }

        if cur_ok == 0 {
            // There are no quoting strategies that would work for both the previous characters and
            // this one.  So we have to end the chunk before this character.  The caller will call
            // `quoting_strategy` again to handle the rest of the string.
            break;
        }

        prev_ok = cur_ok;
        i += 1;
    }

    // Pick the best allowed strategy.
    let strategy = if prev_ok & UNQUOTED_OK != 0 {
        QuotingStrategy::Unquoted
    } else if prev_ok & SINGLE_QUOTED_OK != 0 {
        QuotingStrategy::SingleQuoted
    } else if prev_ok & DOUBLE_QUOTED_OK != 0 {
        QuotingStrategy::DoubleQuoted
    } else {
        unreachable!()
    };
    debug_assert!(i > 0);
    (i, strategy)
}

fn append_quoted_chunk(out: &mut Vec<u8>, cur_chunk: &[u8], strategy: QuotingStrategy) {
    match strategy {
        QuotingStrategy::Unquoted => {
            out.extend_from_slice(cur_chunk);
        },
        QuotingStrategy::SingleQuoted => {
            out.reserve(cur_chunk.len() + 2);
            out.push(b'\'');
            out.extend_from_slice(cur_chunk);
            out.push(b'\'');
        },
        QuotingStrategy::DoubleQuoted => {
            out.reserve(cur_chunk.len() + 2);
            out.push(b'"');
            for &c in cur_chunk.iter() {
                if let b'$' | b'`' | b'"' | b'\\' = c {
                    // Add a preceding backslash.
                    // Note: We shouldn't actually get here for $ and ` because they don't pass
                    // `double_quoted_ok`.
                    out.push(b'\\');
                }
                // Add the character itself.
                out.push(c);
            }
            out.push(b'"');
        },
    }
}

Quoting machinery: chooses among Unquoted / SingleQuoted / DoubleQuoted strategies per chunk and emits escapes. The character classes (unquoted_ok, single_quoted_ok, double_quoted_ok) cite the POSIX shell spec and Bash/Zsh/Dash/Busybox/Mksh/Fish quirks; unquoted_ok_fast builds a const-fn u128 bitmask of allowed ASCII bytes for the hot path. The ^ and {/}/, handling fixes RUSTSEC-2024-0006. Supports parser-impl-correct: the rule set is matched to the documented compatibility targets in the crate-level doc comment.

`src/lib.rs`

`src/lib.rs`, line 80-88

impl Iterator for Shlex<'_> {
    type Item = String;
    fn next(&mut self) -> Option<String> {
        self.0.next().map(|byte_word| {
            // Safety: given valid UTF-8, bytes::Shlex will always return valid UTF-8.
            unsafe { String::from_utf8_unchecked(byte_word) }
        })
    }
}

String::from_utf8_unchecked on the output of bytes::Shlex::next — soundness depends on the byte-level parser preserving UTF-8 validity. See FINDING-2 for the analysis that supports uses-unsafe and unsafe-safe.

`src/lib.rs`, line 65-78

    /// # Safety
    ///
    /// The parameter must have been constructed from valid UTF-8.
    pub unsafe fn from_bytes(bytes: bytes::Shlex<'a>) -> Self {
        Self(bytes)
    }

    /// # Safety
    ///
    /// If the returned reference is reassigned, the new [`bytes::Shlex`] must have been constructed from valid UTF-8.
    pub unsafe fn as_bytes_mut(&mut self) -> &mut bytes::Shlex<'a> {
        &mut self.0
    }
}

Two pub unsafe fns with documented safety contracts requiring the caller to guarantee UTF-8 validity. Replaces the unsound DerefMut impl that existed before 2.0.0 (see CHANGELOG). Justifies unsafe-documented.

`src/quoting_warning.md`

Detailed threat-model documentation — nul bytes, overlong commands, control-character injection into interactive shells, and the past RUSTSEC-2024-0006 / GHSA-r7qv-8r2h-pg27 vulnerability that was fixed in 1.2.1 / 1.3.0. Loaded as a doc module only (#[cfg(all(doc, not(doctest)))] pub mod quoting_warning;).