Audit · encoding_rs@0.8.35

cargo : encoding_rs @ 0.8.35

PE Patrick Elsen signed 2026-05-28 published 2026-05-28

Claims

algorithm-impl-testedhas-binarieshas-build-exechas-fuzz-testshas-install-exechas-integration-testshas-property-testshas-unit-testsimpl-algorithmimpl-concurrencyimpl-cryptoimpl-datastructureimpl-interpreterimpl-jitimpl-parserimpl-protocolis-benignparser-impl-safeuses-concurrencyuses-cryptouses-environmentuses-execuses-filesystemuses-interpreteruses-jituses-networkuses-unsafe

Summary

encoding_rs 0.8.35 implements the WHATWG Encoding Standard (UTF-8/16, legacy CJK and single-byte decoders) with SIMD-accelerated conversion loops. Byte-equivalent to VCS, no I/O and no build-time execution. One low-severity quality finding: per-block unsafe SAFETY documentation is uneven. Exhaustive review of all 271 unsafe blocks and full WHATWG conformance were scoped out.

Report

Subject

encoding_rs 0.8.35 is Henri Sivonen's Gecko-oriented implementation of the WHATWG Encoding Standard, the character-encoding conversion library used by Firefox and much of the Rust HTTP/HTML ecosystem. It decodes and encodes UTF-8, UTF-16LE/BE, and the legacy single-byte and CJK encodings (Big5, EUC-JP, EUC-KR, GBK/GB18030, Shift_JIS, ISO-2022-JP, x-user-defined, and the windows/ISO single-byte families). The public API exposes the streaming Decoder/Encoder state machines, the Encoding label registry, and the mem module of in-RAM representation-conversion and Latin1/UTF-16 bidi-check helpers. SIMD acceleration is available behind the off-by-default simd-accel feature.

Methodology

Tools: openvet 0.6.0, ripgrep, grep, awk, diff. I verified VCS byte-equivalence with diff -rq contents vcs: the only difference is the synthetic .cargo_vcs_info.json, so contents/ matches the published VCS tree. I read the manifests (Cargo.toml, Cargo.toml.orig), the CI script, and surveyed the source for I/O, FFI, crypto, RNG, and concurrency. I read the SIMD/unsafe conversion core: simd_funcs.rs in full, the unsafe macros and ASCII fast paths in ascii.rs, the converter write machinery in handles.rs including its file-level safety note, and the headers and representative unsafe of mem.rs, utf_8.rs, and lib.rs. The large generated index tables in data.rs (~2.5 MB) were treated as data. Roughly 6-8K lines of hand-written core were read in detail.

Scope. This is a large crate (~138K LOC including generated tables, 271 unsafe occurrences across ~11 files, much of it macro-expanded and SIMD-feature-gated). The following claims were not evaluated and are left unasserted; they must not be read as either satisfied or violated: !unsafe-safe, !unsafe-documented, !unsafe-minimal, !algorithm-impl-correct, and !parser-impl-correct. Full WHATWG-conformance verification and exhaustive review of every SIMD unsafe block are out of scope. This audit verifies supply-chain integrity, the capability surface, build/install execution, test presence, the implementation categorization, and the representative unsafe I read.

Results

contents/ is byte-equivalent to the VCS tree apart from the expected .cargo_vcs_info.json. The crate is pure computation: a source-wide search found no network, filesystem, process, or environment access (!uses-network, !uses-filesystem, !uses-exec, !uses-environment), no cryptography or RNG (!uses-crypto), no JIT or interpreter (!uses-jit, !uses-interpreter), and no concurrency primitives, unsafe impl Send/Sync, or atomics (!uses-concurrency). The manifest sets build = false with no proc-macro = true, so nothing runs at build or install time (!has-build-exec, !has-install-exec), and the published tree carries no compiled artifacts (!has-binaries); test_data/ holds plain-text encoding fixtures and data.rs holds generated lookup tables. The one extern block is an "platform-intrinsic" compiler intrinsic, not C FFI.

The crate decodes byte streams into Unicode and is therefore a parser (!impl-parser) and implements the WHATWG conversion algorithms (!impl-algorithm); it does not implement cryptography, an interpreter, a JIT, a network protocol, a general data structure, or concurrency primitives (!impl-crypto, !impl-interpreter, !impl-jit, !impl-protocol, !impl-datastructure, !impl-concurrency). The decoders handle malformed input per the Encoding Standard by emitting U+FFFD replacement characters rather than invoking undefined behavior; the representative decode paths I read keep their raw-pointer writes within capacity reserved by the converter handles, supporting !parser-impl-safe. Testing is extensive: 388 #[test] functions inline in src/ plus the test_data/ round-trip fixtures (!has-unit-tests, !algorithm-impl-tested). There is no tests/ directory, no fuzz/ targets, and no proptest/quickcheck usage (!has-integration-tests, !has-fuzz-tests, !has-property-tests).

The unsafe surface (!uses-unsafe) is real and performance-motivated: ascii.rs and single_byte.rs use macro-generated pub unsafe fn fast paths that elide bounds checks over src.add(i)/dst.add(i), handles.rs writes through dst.add(self.pos) after reserving space, and simd_funcs.rs reinterprets byte pointers as SIMD vectors and transmutes between same-size vector types. The representative blocks read are consistent with their stated contracts. One low-severity quality finding records that per-block // SAFETY: documentation is uneven: handles.rs and mem.rs document their contracts while ascii.rs (which says so explicitly at lines 12-14), single_byte.rs, lib.rs, and macros.rs carry none. No obfuscation, telemetry, base64 blobs, include_bytes!, or suspicious endpoints were found (!is-benign). Dependencies are minimal: cfg-if (always on) and the optional any_all_workaround, packed_simd, and serde.

Conclusion

The audit found one low-severity quality finding (uneven per-block unsafe documentation) and no security, safety, or correctness defects in the code that was read. encoding_rs 0.8.35 is byte-equivalent to its VCS tree, performs no I/O and no build- or install-time execution, has no concurrency or FFI to external libraries, and ships 388 inline tests plus round-trip fixtures. The 271 unsafe occurrences are concentrated in SIMD primitives and the bounds-check-eliding conversion fast paths; exhaustive verification of every block and full WHATWG-conformance checking were scoped out and left unasserted.

Findings(1)

FINDING-1 quality low

Uneven per-block SAFETY documentation across the unsafe surface

The crate contains 271 unsafe occurrences across roughly 11 source files, but per-block // SAFETY: documentation is uneven. handles.rs opens with a file-level SAFETY note (lines 10-14) stating that its bounds-check elision relies on the converter contract, and mem.rs carries 12 inline safety comments. By contrast ascii.rs, single_byte.rs, lib.rs, and macros.rs carry none. ascii.rs states the omission is deliberate (lines 12-14): "It would be nice to manually provide // SAFETY justifications for the unsafe invocations in this file, but the file is too macro-heavy for that to be practical."

The unsafe in these files is real pointer arithmetic with deliberately omitted bounds checks for performance: in ascii.rs the ascii_naive and ascii_alu macros read *(src.add(i)) and write *(dst.add(i)) across 0..len with no per-iteration bound check, pushing the length-validity contract onto callers (pub unsafe fn). In handles.rs, destination writers such as write_two store to dst.add(self.pos) and dst.add(self.pos + 1) then advance self.pos, relying on a prior check_space_* having reserved capacity.

This is a documentation/quality observation, not a demonstrated memory error. The representative blocks read (SIMD loads/stores and transmutes in simd_funcs.rs, the ALU/SIMD ASCII strides in ascii.rs, and the dst.add(pos) writes in handles.rs) are consistent with their stated contracts. Exhaustive verification of all 271 blocks was not performed (see the report Methodology scope note).

Justifies uses-unsafe.

Annotations(4)

`Cargo.toml`

build = false and no proc-macro = true: the crate runs no code at build or install time. Supports has-build-exec and has-install-exec. Default features are ["alloc"] only; simd-accel (which adds packed_simd and any_all_workaround) and serde are off by default. The only non-optional dependency is cfg-if. Supports the absence of network, filesystem, environment, and crypto usage (uses-network, uses-filesystem, uses-environment, uses-crypto).

`src/ascii.rs`

`src/ascii.rs`, line 12-14

// x86_64 will always use SSE2 and 32-bit x86 will use SSE2 when compiled with
// a Mozilla-shipped rustc. SIMD support and especially detection on ARM is a
// mess. Under the circumstances, it seems to make sense to optimize the ALU

The ASCII fast-path macros (ascii_naive, ascii_alu, and the SIMD variants) are exposed as pub unsafe fn and read/write through raw pointers with bounds checks deliberately omitted for performance. The length-validity invariant is the caller's responsibility. Upstream notes here (lines 12-14) that per-block // SAFETY: comments are omitted because the file is macro-heavy; see the quality finding. Supports uses-unsafe.

`src/handles.rs`

`src/handles.rs`, line 10-14

//! This module provides structs that use lifetimes to couple bounds checking
//! and space availability checking and detaching those from actual slice
//! reading/writing.
//!
//! At present, the internals of the implementation are safe code, so the

File-level safety contract for the converter write handles. The destination writers in this file elide per-write bounds checks for performance and rely on the converter machinery having reserved space via a prior check_space_* call before a write handle is produced. Representative writers (e.g. write_two) store to dst.add(self.pos) and dst.add(self.pos + 1), then advance self.pos; the reserved capacity makes those offsets in-bounds. The file uses 49 get_unchecked accesses under this contract. Supports uses-unsafe and parser-impl-safe.

`src/simd_funcs.rs`

`src/simd_funcs.rs`, line 22-100


// TODO: Migrate unaligned access to stdlib code if/when the RFC
// https://github.com/rust-lang/rfcs/pull/1725 is implemented.

/// Safety invariant: ptr must be valid for an unaligned read of 16 bytes
#[inline(always)]
pub unsafe fn load16_unaligned(ptr: *const u8) -> u8x16 {
    let mut simd = ::core::mem::MaybeUninit::<u8x16>::uninit();
    ::core::ptr::copy_nonoverlapping(ptr, simd.as_mut_ptr() as *mut u8, 16);
    // Safety: copied 16 bytes of initialized memory into this, it is now initialized
    simd.assume_init()
}

/// Safety invariant: ptr must be valid for an aligned-for-u8x16 read of 16 bytes
#[allow(dead_code)]
#[inline(always)]
pub unsafe fn load16_aligned(ptr: *const u8) -> u8x16 {
    *(ptr as *const u8x16)
}

/// Safety invariant: ptr must be valid for an unaligned store of 16 bytes
#[inline(always)]
pub unsafe fn store16_unaligned(ptr: *mut u8, s: u8x16) {
    ::core::ptr::copy_nonoverlapping(&s as *const u8x16 as *const u8, ptr, 16);
}

/// Safety invariant: ptr must be valid for an aligned-for-u8x16 store of 16 bytes
#[allow(dead_code)]
#[inline(always)]
pub unsafe fn store16_aligned(ptr: *mut u8, s: u8x16) {
    *(ptr as *mut u8x16) = s;
}

/// Safety invariant: ptr must be valid for an unaligned read of 16 bytes
#[inline(always)]
pub unsafe fn load8_unaligned(ptr: *const u16) -> u16x8 {
    let mut simd = ::core::mem::MaybeUninit::<u16x8>::uninit();
    ::core::ptr::copy_nonoverlapping(ptr as *const u8, simd.as_mut_ptr() as *mut u8, 16);
    // Safety: copied 16 bytes of initialized memory into this, it is now initialized
    simd.assume_init()
}

/// Safety invariant: ptr must be valid for an aligned-for-u16x8 read of 16 bytes
#[allow(dead_code)]
#[inline(always)]
pub unsafe fn load8_aligned(ptr: *const u16) -> u16x8 {
    *(ptr as *const u16x8)
}

/// Safety invariant: ptr must be valid for an unaligned store of 16 bytes
#[inline(always)]
pub unsafe fn store8_unaligned(ptr: *mut u16, s: u16x8) {
    ::core::ptr::copy_nonoverlapping(&s as *const u16x8 as *const u8, ptr as *mut u8, 16);
}

/// Safety invariant: ptr must be valid for an aligned-for-u16x8 store of 16 bytes
#[allow(dead_code)]
#[inline(always)]
pub unsafe fn store8_aligned(ptr: *mut u16, s: u16x8) {
    *(ptr as *mut u16x8) = s;
}

cfg_if! {
    if #[cfg(all(target_feature = "sse2", target_arch = "x86_64"))] {
        use core::arch::x86_64::__m128i;
        use core::arch::x86_64::_mm_movemask_epi8;
        use core::arch::x86_64::_mm_packus_epi16;
    } else if #[cfg(all(target_feature = "sse2", target_arch = "x86"))] {
        use core::arch::x86::__m128i;
        use core::arch::x86::_mm_movemask_epi8;
        use core::arch::x86::_mm_packus_epi16;
    } else if #[cfg(target_arch = "aarch64")]{
        use core::arch::aarch64::vmaxvq_u8;
        use core::arch::aarch64::vmaxvq_u16;
    } else {

    }
}

SIMD primitive layer, compiled only under the simd-accel feature (not in the default build). The unsafe here covers three patterns: reinterpret loads/stores that cast *const u8 to *const u8x16 (load16_unaligned, store16_unaligned), same-size vector transmutes (to_u16x8, to_u8x16, to_i16x8), and simd_shuffle16 plus the lane-mask reductions (all_mask8x16, any_mask16x8). The transmutes are between SIMD vectors of identical size and layout. The extern "platform-intrinsic" block declares a compiler intrinsic, not C FFI. Supports uses-unsafe.