Detect control characters, quotes and backslashes efficiently using 'SWAR'

Detect control characters, quotes and backslashes efficiently using 'SWAR'

When trying to write fast functions operating over many bytes, we sometimes use 'SWAR'. SWAR stands for SIMD Within A Register, and it is a technique to perform parallel computations on multiple data elements packed into a single word. It treats a register (e.g., 64-bit) as a vector of smaller units (e.g., 8-bit bytes) and processes them simultaneously without explicit SIMD instructions.

For example, if you have 8 bytes in a 64-bit word x, computing (x - 0x2020202020202020) subtracts 32 (0x20 in hex) from each byte. It works well if each byte value is greater than or equal to 32. Otherwise, the operation 'overflows': if the least significant byte is smaller than 32, then its most significant bit is set, and you cannot rely on the other more significant byte results.

It works well if all bytes are 'ASCII' (they have their most significant bit unset). When that is the case, you can check if any byte value is less than 32 by computing (x - 0x2020202020202020) and checking if any most significant bit is set, by masking the result: (x - 0x2020202020202020) & 0x8080808080808080.

Similarly,  you can check whether the byte value 34 (0x22) appears simply by first doing an XOR which sets any such byte value to zero, (x ^ 0x2222222222222222) and then subtracting each byte value by 1: (x ^ 0x2222222222222222) - 0x0101010101010101. It then suffices to masking with 0x80...80 again to see if any  byte value was set to 34, as long as all input byte values were ASCII.

Thankfully, it is easy to select only the byte values that are ASCII: (0x8080808080808080 & ~x) will do.

In JSON and other languages, you need to escape within strings all character values smaller than 32 (control characters in ASCII), equal to 34 ('"' in ASCII) or 92 ("\" in ASCII). So you want to quickly check if a byte is smaller than 32, equal to 34 or 92. With SWAR, you can do it with about ten operations:

bool has_json_escapable_byte(uint64_t x) {
    uint64_t is_ascii = 0x8080808080808080ULL & ~x;
    uint64_t lt32 =
        (x - 0x2020202020202020ULL);
    uint64_t sub34 = x ^ 0x2222222222222222ULL;
    uint64_t eq34 = (sub34 - 0x0101010101010101ULL);
    uint64_t sub92 = x ^ 0x5C5C5C5C5C5C5C5CULL;
    uint64_t eq92 = (sub92 - 0x0101010101010101ULL);
    return ((lt32 | eq34 | eq92) & is_ascii) != 0;
}
        

That's slightly more than one operation per input byte.

The following routine is even better (credit to Falk Hüffner):

bool has_json_escapable_byte(uint64_t x) {
    uint64_t is_ascii = 0x8080808080808080ULL & ~x;
    uint64_t xor2 = x ^ 0x0202020202020202ULL;
    uint64_t lt32_or_eq34 = xor2 – 0x2121212121212121ULL;
    uint64_t sub92 = x ^ 0x5C5C5C5C5C5C5C5CULL;
    uint64_t eq92 = (sub92 - 0x0101010101010101ULL);
    return ((lt32_or_eq34 | eq92) & is_ascii) != 0;
}        

You might be able to do even better.

To view or add a comment, sign in

More articles by Daniel Lemire

Insights from the community

Others also viewed

Explore topics