I've put a few performance notes in comments. Specifically, I'm curious if an inline function that expands to 128+ bytes like this should possibly be wrapped in an __attribute__((flatten)) __attribute__((noinline)) function to force full expansion in one place and then prevent it from getting inlined elsewhere (to keep the generated code size down).