cargo / encoding_rs / audit
cargo : encoding_rs @ 0.8.35
PE Patrick Elsen signed 2026-05-28 published 2026-05-28

Ideas.md

107 lines · markdown

This document contains notes about various ideas that for one reason or anotherare not being actively pursued.## Next byte is non-ASCII after ASCII optimizationThe current plan for a SIMD-accelerated inner loop for handling ASCII bytesmakes no use of the bit of information that if the buffers didn't end but theASCII loop exited, the next byte will not be an ASCII byte.## Handling ASCII with table lookups when decoding single-byte to UTF-16Both uconv and ICU outperform encoding_rs when decoding single-byte to UTF-16.unconv doesn't even do anything fancy to manually unroll the loop (see below).Both handle even the ASCII range using table lookup. That is, there's no branchfor checking if we're in the lower or upper half of the encoding.However, adding SIMD acceleration for the ASCII half will likely be a biggerwin than eliminating the branch to decide ASCII vs. non-ASCII.## Manual loop unrolling for single-byte encodingsICU currently outperforms encoding_rs (by over x2!) when decoding a single-byteencoding to UTF-16. This appears to be thanks to manually unrolling theconversion loop by 16. See [ucnv_MBCSSingleToBMPWithOffsets][1].[1]: https://ssl.icu-project.org/repos/icu/icu/tags/release-55-1/source/common/ucnvmbcs.cppNotably, none of the single-byte encodings have bytes that'd decode to theupper half of BMP. Therefore, if the unmappable marker has the highest bit setinstead of being zero, the check for unmappables within a 16-character stridecan be done either by ORing the BMP characters in the stride together andchecking the high bit or by loading the upper halves of the BMP charatersin a `u8x8` register and checking the high bits using the `_mm_movemask_epi8`/ `pmovmskb` SSE2 instruction.## After non-ASCII, handle ASCII punctuation without SIMDSince the failure mode of SIMD ASCII acceleration involves wasted aligmentchecks and a wasted SIMD read when the next code unit is non-ASCII and non-Latinscripts have runs of non-ASCII even if ASCII spaces and punctuation is used,consider handling the next two or three bytes following non-ASCII as non-SIMDbefore looping back to the SIMD mode. Maybe move back to SIMD ASCII faster ifthere's ASCII that's not space or punctuation. Maybe with the "space orpunctuation" check in place, this code can be allowed to be in place even forUTF-8 and Latin single-byte (i.e. not having different code for Latin andnon-Latin single-byte).## Prefer maintaining aligmentInstead of returning to acceleration directly after non-ASCII, considercontinuing to the alignment boundary without acceleration.## Read from SIMD lanes instead of RAM (cache) when ASCII check failsWhen the SIMD ASCII check fails, the data has already been read from memory.Test whether it's faster to read the data by lane from the SIMD register thanto read it again from RAM (cache).## Use Level 2 Hanzi and Level 2 Kanji orderingThese two are ordered by radical and then by stroke count, so in principle,they should be mostly Unicode-ordered, although at least Level 2 Hanzi isn'tfully Unicode-ordered. Is "mostly" good enough for encode accelelation?## Create a `divmod_94()` functionExperiment with a function that computes `(i / 94, i % 94)` more efficientlythan generic code.## Align writes on Aarch64On [Cortex-A57](https://stackoverflow.com/questions/45714535/performance-of-unaligned-simd-load-store-on-aarch64/45938112#45938112), it might be a good idea to move the destination into 16-byte alignment.## Unalign UTF-8 validation on Aarch64Currently, Aarch64 runs the generic ALU UTF-8 validation code that alignsreads. That's probably unnecessary on Aarch64. (SIMD was slower than ALU!)## Table-driven UTF-8 validationWhen there are at least four bytes left, read all four. With each byteindex into tables corresponding to magic values indexable by byte ineach position.In the value read from the table indexed by lead byte, encode thefollowing in 16 bits: advance 2 bits (2, 3 or 4 bytes), 9 positionalbits one of which is set to indicate the type of lead byte (8 validtypes, in the 8 lowest bits, and invalid, ASCII would be tenth type),and the mask for extracting the payload bits from the lead byte(for conversion to UTF-16 or UTF-32).In the tables indexable by the trail bytes, in each positionscorresponding byte the lead byte type, store 1 if the trail isinvalid given the lead and 0 if valid given the lead.Use the low 8 bits of the of the 16 bits read from the firsttable to mask (bitwise AND) one positional bit from each of the three other values. Bitwise OR the results together with thebit that is 1 if the lead is invalid. If the result is zero,the sequence is valid. Otherwise it's invalid.Use the advance to advance. In the conversion to UTF-16 orUTF-32 case, use the mast for extracting the meaningfulbits from the lead byte to mask them from the lead. Shiftleft by 6 as many times as the advance indicates, etc.