* [PATCH v3 0/3] lib/crypto: x86/sha: Add PHE Extensions support
@ 2026-01-16 7:15 AlanSong-oc
2026-01-16 7:15 ` [PATCH v3 1/3] crypto: padlock-sha - Disable for Zhaoxin processor AlanSong-oc
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: AlanSong-oc @ 2026-01-16 7:15 UTC (permalink / raw)
To: herbert, davem, ebiggers, Jason, ardb, linux-crypto, linux-kernel,
x86
Cc: CobeChen, TonyWWang-oc, YunShen, GeorgeXue, LeoLiu-oc, HansHu,
AlanSong-oc
This series adds support for PHE Extensions optimized SHA1 and SHA256
transform functions for Zhaoxin processors in lib/crypto, and disables
the padlock-sha driver on Zhaoxin platforms due to self-test failures.
The table below shows the benchmark results before and after this patch
series by using CRYPTO_LIB_BENCHMARK on Zhaoxin KX-7000 platform,
highlighting the achieved speedups.
+---------+-------------------------+--------------------------+
| | SHA1 | SHA256 |
+---------+--------+----------------+--------+-----------------+
| Len | Before | After | Before | After |
+---------+--------+----------------+--------+-----------------+
| 1* | 3** | 8 (2.67x) | 2 | 7 (3.50x) |
| 16 | 52 | 125 (2.40x) | 35 | 119 (3.40x) |
| 64 | 114 | 318 (2.79x) | 74 | 280 (3.78x) |
| 127 | 154 | 440 (2.86x) | 99 | 387 (3.91x) |
| 128 | 160 | 492 (3.08x) | 103 | 427 (4.15x) |
| 200 | 189 | 605 (3.20x) | 123 | 537 (4.37x) |
| 256 | 199 | 676 (3.40x) | 128 | 582 (4.55x) |
| 511 | 223 | 794 (3.56x) | 144 | 679 (4.72x) |
| 512 | 225 | 833 (3.70x) | 146 | 714 (4.89x) |
| 1024 | 243 | 941 (3.87x) | 157 | 796 (5.07x) |
| 3173 | 259 | 1044 (4.03x) | 167 | 883 (5.28x) |
| 4096 | 257 | 1044 (4.06x) | 166 | 876 (5.28x) |
| 16384 | 261 | 1073 (4.11x) | 169 | 899 (5.32x) |
+---------+--------+----------------+--------+-----------------+
*: The length of each data block to be processed by one complete SHA
sequence.
**: The throughput of processing data blocks, unit is Mb/s.
After applying this patch series, the KUnit test suites for SHA1 and
SHA256 pass successfully on Zhaoxin platforms. The following shows the
detailed test logs:
[ 5.993700] # Subtest: sha1
[ 5.996813] # module: sha1_kunit
[ 5.996814] 1..11
[ 6.003399] ok 1 test_hash_test_vectors
[ 6.012489] ok 2 test_hash_all_lens_up_to_4096
[ 6.028511] ok 3 test_hash_incremental_updates
[ 6.035766] ok 4 test_hash_buffer_overruns
[ 6.043445] ok 5 test_hash_overlaps
[ 6.050315] ok 6 test_hash_alignment_consistency
[ 6.054994] ok 7 test_hash_ctx_zeroization
[ 6.127778] ok 8 test_hash_interrupt_context_1
[ 6.774847] ok 9 test_hash_interrupt_context_2
[ 6.810745] ok 10 test_hmac
[ 6.835169] # benchmark_hash: len=1: 8 MB/s
[ 6.847167] # benchmark_hash: len=16: 125 MB/s
[ 6.862114] # benchmark_hash: len=64: 318 MB/s
[ 6.878173] # benchmark_hash: len=127: 440 MB/s
[ 6.893081] # benchmark_hash: len=128: 492 MB/s
[ 6.907976] # benchmark_hash: len=200: 605 MB/s
[ 6.922658] # benchmark_hash: len=256: 676 MB/s
[ 6.937558] # benchmark_hash: len=511: 794 MB/s
[ 6.951994] # benchmark_hash: len=512: 833 MB/s
[ 6.966262] # benchmark_hash: len=1024: 941 MB/s
[ 6.980295] # benchmark_hash: len=3173: 1044 MB/s
[ 6.994494] # benchmark_hash: len=4096: 1044 MB/s
[ 7.008728] # benchmark_hash: len=16384: 1073 MB/s
[ 7.014515] ok 11 benchmark_hash
[ 7.019628] # sha1: pass:11 fail:0 skip:0 total:11
[ 7.023170] # Totals: pass:11 fail:0 skip:0 total:11
[ 7.027916] ok 5 sha1
[ 7.767257] # Subtest: sha256
[ 7.770542] # module: sha256_kunit
[ 7.770544] 1..15
[ 7.777383] ok 1 test_hash_test_vectors
[ 7.788563] ok 2 test_hash_all_lens_up_to_4096
[ 7.806090] ok 3 test_hash_incremental_updates
[ 7.813553] ok 4 test_hash_buffer_overruns
[ 7.822384] ok 5 test_hash_overlaps
[ 7.829388] ok 6 test_hash_alignment_consistency
[ 7.833843] ok 7 test_hash_ctx_zeroization
[ 7.915191] ok 8 test_hash_interrupt_context_1
[ 8.362312] ok 9 test_hash_interrupt_context_2
[ 8.401607] ok 10 test_hmac
[ 8.415458] ok 11 test_sha256_finup_2x
[ 8.419397] ok 12 test_sha256_finup_2x_defaultctx
[ 8.424107] ok 13 test_sha256_finup_2x_hugelen
[ 8.451289] # benchmark_hash: len=1: 7 MB/s
[ 8.465372] # benchmark_hash: len=16: 119 MB/s
[ 8.481760] # benchmark_hash: len=64: 280 MB/s
[ 8.499344] # benchmark_hash: len=127: 387 MB/s
[ 8.515800] # benchmark_hash: len=128: 427 MB/s
[ 8.531970] # benchmark_hash: len=200: 537 MB/s
[ 8.548241] # benchmark_hash: len=256: 582 MB/s
[ 8.564838] # benchmark_hash: len=511: 679 MB/s
[ 8.580872] # benchmark_hash: len=512: 714 MB/s
[ 8.596858] # benchmark_hash: len=1024: 796 MB/s
[ 8.612567] # benchmark_hash: len=3173: 883 MB/s
[ 8.628546] # benchmark_hash: len=4096: 876 MB/s
[ 8.644482] # benchmark_hash: len=16384: 899 MB/s
[ 8.649773] ok 14 benchmark_hash
[ 8.655505] ok 15 benchmark_sha256_finup_2x # SKIP not relevant
[ 8.659065] # sha256: pass:14 fail:0 skip:1 total:15
[ 8.665276] # Totals: pass:14 fail:0 skip:1 total:15
[ 8.670195] ok 7 sha256
Changes in v3:
- Implement PHE Extensions optimized SHA1 and SHA256 transform functions
using inline assembly instead of separate assembly files
- Eliminate unnecessary casts
- Add CONFIG_CPU_SUP_ZHAOXIN check to compile out the code when disabled
- Use 'boot_cpu_data.x86' to identify the CPU family instead of
'cpu_data(0).x86'
- Only check X86_FEATURE_PHE_EN for CPU support, consistent with other
CPU feature checks.
- Disable the padlock-sha driver on Zhaoxin processors with CPU family
0x07 and newer.
Changes in v2:
- Add Zhaoxin support to lib/crypto instead of extending the existing
padlock-sha driver
AlanSong-oc (3):
crypto: padlock-sha - Disable for Zhaoxin processor
lib/crypto: x86/sha1: PHE Extensions optimized SHA1 transform function
lib/crypto: x86/sha256: PHE Extensions optimized SHA256 transform
function
drivers/crypto/padlock-sha.c | 7 +++++++
lib/crypto/x86/sha1.h | 25 +++++++++++++++++++++++++
lib/crypto/x86/sha256.h | 25 +++++++++++++++++++++++++
3 files changed, 57 insertions(+)
--
2.34.1
^ permalink raw reply [flat|nested] 13+ messages in thread* [PATCH v3 1/3] crypto: padlock-sha - Disable for Zhaoxin processor 2026-01-16 7:15 [PATCH v3 0/3] lib/crypto: x86/sha: Add PHE Extensions support AlanSong-oc @ 2026-01-16 7:15 ` AlanSong-oc 2026-01-18 0:09 ` Eric Biggers 2026-01-16 7:15 ` [PATCH v3 2/3] lib/crypto: x86/sha1: PHE Extensions optimized SHA1 transform function AlanSong-oc 2026-01-16 7:15 ` [PATCH v3 3/3] lib/crypto: x86/sha256: PHE Extensions optimized SHA256 " AlanSong-oc 2 siblings, 1 reply; 13+ messages in thread From: AlanSong-oc @ 2026-01-16 7:15 UTC (permalink / raw) To: herbert, davem, ebiggers, Jason, ardb, linux-crypto, linux-kernel, x86 Cc: CobeChen, TonyWWang-oc, YunShen, GeorgeXue, LeoLiu-oc, HansHu, AlanSong-oc For Zhaoxin processors, the XSHA1 instruction requires the total memory allocated at %rdi register must be 32 bytes, while the XSHA1 and XSHA256 instruction doesn't perform any operation when %ecx is zero. Due to these requirements, the current padlock-sha driver does not work correctly with Zhaoxin processors. It cannot pass the self-tests and therefore does not activate the driver on Zhaoxin processors. This issue has been reported in Debian [1]. The self-tests fail with the following messages [2]: alg: shash: sha1-padlock-nano test failed (wrong result) on test vector 0, cfg="init+update+final aligned buffer" alg: self-tests for sha1 using sha1-padlock-nano failed (rc=-22) ------------[ cut here ]------------ alg: shash: sha256-padlock-nano test failed (wrong result) on test vector 0, cfg="init+update+final aligned buffer" alg: self-tests for sha256 using sha256-padlock-nano failed (rc=-22) ------------[ cut here ]------------ Disable the padlock-sha driver on Zhaoxin processors with CPU family 0x07 and newer. Add PHE Extensions support for SHA-1 and SHA-256 to lib/crypto, following the suggestion in [3]. [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1103397 [2] https://linux-hardware.org/?probe=271fabb7a4&log=dmesg [3] https://lore.kernel.org/linux-crypto/aUI4CGp6kK7mxgEr@gondor.apana.org.au/ Signed-off-by: AlanSong-oc <AlanSong-oc@zhaoxin.com> --- drivers/crypto/padlock-sha.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/crypto/padlock-sha.c b/drivers/crypto/padlock-sha.c index 329f60ad4..9214bbfc8 100644 --- a/drivers/crypto/padlock-sha.c +++ b/drivers/crypto/padlock-sha.c @@ -332,6 +332,13 @@ static int __init padlock_init(void) if (!x86_match_cpu(padlock_sha_ids) || !boot_cpu_has(X86_FEATURE_PHE_EN)) return -ENODEV; + /* + * Skip family 0x07 and newer used by Zhaoxin processors, + * as the driver's self-tests fail on these CPUs. + */ + if (c->x86 >= 0x07) + return -ENODEV; + /* Register the newly added algorithm module if on * * VIA Nano processor, or else just do as before */ if (c->x86_model < 0x0f) { -- 2.34.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v3 1/3] crypto: padlock-sha - Disable for Zhaoxin processor 2026-01-16 7:15 ` [PATCH v3 1/3] crypto: padlock-sha - Disable for Zhaoxin processor AlanSong-oc @ 2026-01-18 0:09 ` Eric Biggers 2026-03-05 1:36 ` AlanSong-oc 0 siblings, 1 reply; 13+ messages in thread From: Eric Biggers @ 2026-01-18 0:09 UTC (permalink / raw) To: AlanSong-oc Cc: herbert, davem, Jason, ardb, linux-crypto, linux-kernel, x86, CobeChen, TonyWWang-oc, YunShen, GeorgeXue, LeoLiu-oc, HansHu On Fri, Jan 16, 2026 at 03:15:11PM +0800, AlanSong-oc wrote: > > Signed-off-by: AlanSong-oc <AlanSong-oc@zhaoxin.com> Since this is a bug fix, please add Fixes and Cc stable tags. - Eric ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 1/3] crypto: padlock-sha - Disable for Zhaoxin processor 2026-01-18 0:09 ` Eric Biggers @ 2026-03-05 1:36 ` AlanSong-oc 0 siblings, 0 replies; 13+ messages in thread From: AlanSong-oc @ 2026-03-05 1:36 UTC (permalink / raw) To: Eric Biggers Cc: herbert, davem, Jason, ardb, linux-crypto, linux-kernel, x86, CobeChen, TonyWWang-oc, YunShen, GeorgeXue, LeoLiu-oc, HansHu On 1/18/2026 8:09 AM, Eric Biggers wrote: > > On Fri, Jan 16, 2026 at 03:15:11PM +0800, AlanSong-oc wrote: >> >> Signed-off-by: AlanSong-oc <AlanSong-oc@zhaoxin.com> > > Since this is a bug fix, please add Fixes and Cc stable tags. > I will add Fixes and CC stable tags in the next version of the patch. Please accept my apologies for the delayed response due to administrative procedures and the recent holidays. Thank you for your review and valuable suggestions. Best Regards AlanSong-oc ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v3 2/3] lib/crypto: x86/sha1: PHE Extensions optimized SHA1 transform function 2026-01-16 7:15 [PATCH v3 0/3] lib/crypto: x86/sha: Add PHE Extensions support AlanSong-oc 2026-01-16 7:15 ` [PATCH v3 1/3] crypto: padlock-sha - Disable for Zhaoxin processor AlanSong-oc @ 2026-01-16 7:15 ` AlanSong-oc 2026-01-18 0:31 ` Eric Biggers 2026-01-16 7:15 ` [PATCH v3 3/3] lib/crypto: x86/sha256: PHE Extensions optimized SHA256 " AlanSong-oc 2 siblings, 1 reply; 13+ messages in thread From: AlanSong-oc @ 2026-01-16 7:15 UTC (permalink / raw) To: herbert, davem, ebiggers, Jason, ardb, linux-crypto, linux-kernel, x86 Cc: CobeChen, TonyWWang-oc, YunShen, GeorgeXue, LeoLiu-oc, HansHu, AlanSong-oc Zhaoxin CPUs have implemented the SHA(Secure Hash Algorithm) as its CPU instructions by PHE(Padlock Hash Engine) Extensions, including XSHA1, XSHA256, XSHA384 and XSHA512 instructions. With the help of implementation of SHA in hardware instead of software, can develop applications with higher performance, more security and more flexibility. This patch includes the XSHA1 instruction optimized implementation of SHA-1 transform function. Signed-off-by: AlanSong-oc <AlanSong-oc@zhaoxin.com> --- lib/crypto/x86/sha1.h | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/lib/crypto/x86/sha1.h b/lib/crypto/x86/sha1.h index c48a0131f..d4946270a 100644 --- a/lib/crypto/x86/sha1.h +++ b/lib/crypto/x86/sha1.h @@ -48,6 +48,26 @@ static void sha1_blocks_avx2(struct sha1_block_state *state, } } +#if IS_ENABLED(CONFIG_CPU_SUP_ZHAOXIN) +#define PHE_ALIGNMENT 16 +static void sha1_blocks_phe(struct sha1_block_state *state, + const u8 *data, size_t nblocks) +{ + /* + * XSHA1 requires %edi to point to a 32-byte, 16-byte-aligned + * buffer on Zhaoxin processors. + */ + u8 buf[32 + PHE_ALIGNMENT - 1]; + u8 *dst = PTR_ALIGN(&buf[0], PHE_ALIGNMENT); + + memcpy(dst, state, SHA1_DIGEST_SIZE); + asm volatile(".byte 0xf3,0x0f,0xa6,0xc8" + : "+S"(data), "+D"(dst) + : "a"((long)-1), "c"(nblocks)); + memcpy(state, dst, SHA1_DIGEST_SIZE); +} +#endif /* CONFIG_CPU_SUP_ZHAOXIN */ + static void sha1_blocks(struct sha1_block_state *state, const u8 *data, size_t nblocks) { @@ -59,6 +79,11 @@ static void sha1_mod_init_arch(void) { if (boot_cpu_has(X86_FEATURE_SHA_NI)) { static_call_update(sha1_blocks_x86, sha1_blocks_ni); +#if IS_ENABLED(CONFIG_CPU_SUP_ZHAOXIN) + } else if (boot_cpu_has(X86_FEATURE_PHE_EN)) { + if (boot_cpu_data.x86 >= 0x07) + static_call_update(sha1_blocks_x86, sha1_blocks_phe); +#endif } else if (cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL) && boot_cpu_has(X86_FEATURE_AVX)) { -- 2.34.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v3 2/3] lib/crypto: x86/sha1: PHE Extensions optimized SHA1 transform function 2026-01-16 7:15 ` [PATCH v3 2/3] lib/crypto: x86/sha1: PHE Extensions optimized SHA1 transform function AlanSong-oc @ 2026-01-18 0:31 ` Eric Biggers 2026-03-05 1:37 ` AlanSong-oc 0 siblings, 1 reply; 13+ messages in thread From: Eric Biggers @ 2026-01-18 0:31 UTC (permalink / raw) To: AlanSong-oc Cc: herbert, davem, Jason, ardb, linux-crypto, linux-kernel, x86, CobeChen, TonyWWang-oc, YunShen, GeorgeXue, LeoLiu-oc, HansHu On Fri, Jan 16, 2026 at 03:15:12PM +0800, AlanSong-oc wrote: > Zhaoxin CPUs have implemented the SHA(Secure Hash Algorithm) as its CPU > instructions by PHE(Padlock Hash Engine) Extensions, including XSHA1, > XSHA256, XSHA384 and XSHA512 instructions. > > With the help of implementation of SHA in hardware instead of software, > can develop applications with higher performance, more security and more > flexibility. > > This patch includes the XSHA1 instruction optimized implementation of > SHA-1 transform function. > > Signed-off-by: AlanSong-oc <AlanSong-oc@zhaoxin.com> Please include the information I've asked for (benchmark results, test results, and link to the specification) directly in the commit message. > +#if IS_ENABLED(CONFIG_CPU_SUP_ZHAOXIN) > +#define PHE_ALIGNMENT 16 > +static void sha1_blocks_phe(struct sha1_block_state *state, > + const u8 *data, size_t nblocks) The IS_ENABLED(CONFIG_CPU_SUP_ZHAOXIN) should go in the CPU feature check, so that the code will be parsed regardless of the setting. That reduces the chance that future changes will cause compilation errors. > + /* > + * XSHA1 requires %edi to point to a 32-byte, 16-byte-aligned > + * buffer on Zhaoxin processors. > + */ This seems implausible. In 64-bit mode a pointer can't fit in %edi. I thought you mentioned that this instruction is 64-bit compatible? You may have meant %rdi. Interestingly, the spec you provided specifically says the registers operated on are %eax, %ecx, %esi, and %edi. So assuming the code works, perhaps both the spec and your code comment are incorrect? These errors don't really confidence in this instruction. > + memcpy(dst, state, SHA1_DIGEST_SIZE); > + asm volatile(".byte 0xf3,0x0f,0xa6,0xc8" > + : "+S"(data), "+D"(dst) > + : "a"((long)-1), "c"(nblocks)); > + memcpy(state, dst, SHA1_DIGEST_SIZE); Is the reason for using '.byte' that the GNU and clang assemblers don't implement the mnemonic this Zhaoxin-specific instruction? The spec implies that the intended mnemonic is "rep sha1". If that's correct, could you add a comment like /* rep sha1 */ so that it's clear what the intended instruction is? Also, the spec describes all four registers as both input and output registers. Yet your inline asm marks %rax and %rcx as inputs only. > @@ -59,6 +79,11 @@ static void sha1_mod_init_arch(void) > { > if (boot_cpu_has(X86_FEATURE_SHA_NI)) { > static_call_update(sha1_blocks_x86, sha1_blocks_ni); > +#if IS_ENABLED(CONFIG_CPU_SUP_ZHAOXIN) > + } else if (boot_cpu_has(X86_FEATURE_PHE_EN)) { > + if (boot_cpu_data.x86 >= 0x07) > + static_call_update(sha1_blocks_x86, sha1_blocks_phe); > +#endif I think it should be: } else if (IS_ENABLED(CONFIG_CPU_SUP_ZHAOXIN) && boot_cpu_has(X86_FEATURE_PHE_EN) && boot_cpu_data.x86 >= 0x07) { static_call_update(sha1_blocks_x86, sha1_blocks_phe); ... so (a) the code will be parsed even when !CONFIG_CPU_SUP_ZHAOXIN, and (b) functions won't be unnecessarily disabled when boot_cpu_has(X86_FEATURE_PHE_EN) && boot_cpu_data.x86 < 0x07). As before, all these comments apply to the SHA-256 patch too. - Eric ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 2/3] lib/crypto: x86/sha1: PHE Extensions optimized SHA1 transform function 2026-01-18 0:31 ` Eric Biggers @ 2026-03-05 1:37 ` AlanSong-oc 2026-03-05 19:18 ` Eric Biggers 0 siblings, 1 reply; 13+ messages in thread From: AlanSong-oc @ 2026-03-05 1:37 UTC (permalink / raw) To: Eric Biggers Cc: herbert, davem, Jason, ardb, linux-crypto, linux-kernel, x86, CobeChen, TonyWWang-oc, YunShen, GeorgeXue, LeoLiu-oc, HansHu On 1/18/2026 8:31 AM, Eric Biggers wrote: > > On Fri, Jan 16, 2026 at 03:15:12PM +0800, AlanSong-oc wrote: >> Zhaoxin CPUs have implemented the SHA(Secure Hash Algorithm) as its CPU >> instructions by PHE(Padlock Hash Engine) Extensions, including XSHA1, >> XSHA256, XSHA384 and XSHA512 instructions. >> >> With the help of implementation of SHA in hardware instead of software, >> can develop applications with higher performance, more security and more >> flexibility. >> >> This patch includes the XSHA1 instruction optimized implementation of >> SHA-1 transform function. >> >> Signed-off-by: AlanSong-oc <AlanSong-oc@zhaoxin.com> > > Please include the information I've asked for (benchmark results, test > results, and link to the specification) directly in the commit message. Sorry for missing the link to the specification in the commit message. I will include the benchmark results, test results, and the link to the specification directly in the commit message in the next version of the patch, rather than in the cover letter. >> +#if IS_ENABLED(CONFIG_CPU_SUP_ZHAOXIN) >> +#define PHE_ALIGNMENT 16 >> +static void sha1_blocks_phe(struct sha1_block_state *state, >> + const u8 *data, size_t nblocks) > > The IS_ENABLED(CONFIG_CPU_SUP_ZHAOXIN) should go in the CPU feature > check, so that the code will be parsed regardless of the setting. That > reduces the chance that future changes will cause compilation errors. I will address this problem using the approach you described below in the next version of the patch. > >> + /* >> + * XSHA1 requires %edi to point to a 32-byte, 16-byte-aligned >> + * buffer on Zhaoxin processors. >> + */ > > This seems implausible. In 64-bit mode a pointer can't fit in %edi. I > thought you mentioned that this instruction is 64-bit compatible? You > may have meant %rdi. > > Interestingly, the spec you provided specifically says the registers > operated on are %eax, %ecx, %esi, and %edi. > > So assuming the code works, perhaps both the spec and your code comment > are incorrect? > > These errors don't really confidence in this instruction. Sorry for the misleading comment. I will correct it in the next version of the patch. The specification provided earlier uses the 32-bit register as an example, which doesn't mean the instruction only supports 32-bit mode. The updated specification explicitly clarifies this point and is available at the following link. (https://gitee.com/openzhaoxin/zhaoxin_specifications/blob/20260227/ZX_Padlock_Reference.pdf) > >> + memcpy(dst, state, SHA1_DIGEST_SIZE); >> + asm volatile(".byte 0xf3,0x0f,0xa6,0xc8" >> + : "+S"(data), "+D"(dst) >> + : "a"((long)-1), "c"(nblocks)); >> + memcpy(state, dst, SHA1_DIGEST_SIZE); > > Is the reason for using '.byte' that the GNU and clang assemblers don't > implement the mnemonic this Zhaoxin-specific instruction? The spec > implies that the intended mnemonic is "rep sha1". > > If that's correct, could you add a comment like /* rep sha1 */ so that > it's clear what the intended instruction is? The '.byte' directive is used to ensure the correct binary code is generated, regardless of compiler support, particularly for compilers that lack the corresponding mnemonic. I will add an appropriate comment in the next version of the patch to clarify the intended instruction. > Also, the spec describes all four registers as both input and output > registers. Yet your inline asm marks %rax and %rcx as inputs only. Thank you for pointing this question out. On the one hand, when the '+' constraint modifier is applied to an operand, it is treated as both an input and an output operand. Therefore, %rsi and %rdi are considered input operands as well. On the other hand, after the instruction executes, the values in %rax, %rsi, and %rcx are modified. These registers should therefore use the '+' constraint modifier to inform the compiler that their values are updated by the assembly code. We cannot rely on clobbers to indicate that the values of input operands are modified following the suggestion by gcc manual. However, since %rax is initialized with a constant value, it does not need the '+' constraint modifier. It should can simply be specified as an input operand. In addition, although %rdi itself is not modified by the instruction but the memory it references may be updated, a "memory" clobber should be added to notify the compiler about possible memory side effects. The corrected inline assembly should be written as follows: asm volatile(".byte 0xf3,0x0f,0xa6,0xc8" /* REP XSHA1 */ : "+S"(data), "+c"(nblocks) : "a"((long)-1), "D"(dst) : "memory"); I will adopt it in the next version of the patch. >> @@ -59,6 +79,11 @@ static void sha1_mod_init_arch(void) >> { >> if (boot_cpu_has(X86_FEATURE_SHA_NI)) { >> static_call_update(sha1_blocks_x86, sha1_blocks_ni); >> +#if IS_ENABLED(CONFIG_CPU_SUP_ZHAOXIN) >> + } else if (boot_cpu_has(X86_FEATURE_PHE_EN)) { >> + if (boot_cpu_data.x86 >= 0x07) >> + static_call_update(sha1_blocks_x86, sha1_blocks_phe); >> +#endif > > I think it should be: > > } else if (IS_ENABLED(CONFIG_CPU_SUP_ZHAOXIN) && > boot_cpu_has(X86_FEATURE_PHE_EN) && > boot_cpu_data.x86 >= 0x07) { > static_call_update(sha1_blocks_x86, sha1_blocks_phe); > > ... so (a) the code will be parsed even when !CONFIG_CPU_SUP_ZHAOXIN, > and (b) functions won't be unnecessarily disabled when > boot_cpu_has(X86_FEATURE_PHE_EN) && boot_cpu_data.x86 < 0x07). Thanks for the suggestion, that makes more sense. Using #if and #endif for conditional compilation is a poor choice, as it prevents the code from being properly parsed. It is more efficient to include CPU family checks directly in the condition. > > As before, all these comments apply to the SHA-256 patch too. Surely, I will also apply all of the suggestions mentioned above to the SHA-256 patch. Please accept my apologies for the delayed response due to administrative procedures and the recent holidays. Thank you for your review and valuable suggestions. Best Regards AlanSong-oc ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 2/3] lib/crypto: x86/sha1: PHE Extensions optimized SHA1 transform function 2026-03-05 1:37 ` AlanSong-oc @ 2026-03-05 19:18 ` Eric Biggers 2026-03-11 11:37 ` AlanSong-oc 0 siblings, 1 reply; 13+ messages in thread From: Eric Biggers @ 2026-03-05 19:18 UTC (permalink / raw) To: AlanSong-oc Cc: herbert, davem, Jason, ardb, linux-crypto, linux-kernel, x86, CobeChen, TonyWWang-oc, YunShen, GeorgeXue, LeoLiu-oc, HansHu On Thu, Mar 05, 2026 at 09:37:01AM +0800, AlanSong-oc wrote: > > Also, the spec describes all four registers as both input and output > > registers. Yet your inline asm marks %rax and %rcx as inputs only. > > Thank you for pointing this question out. > > On the one hand, when the '+' constraint modifier is applied to an > operand, it is treated as both an input and an output operand. > Therefore, %rsi and %rdi are considered input operands as well. > > On the other hand, after the instruction executes, the values in %rax, > %rsi, and %rcx are modified. These registers should therefore use the > '+' constraint modifier to inform the compiler that their values are > updated by the assembly code. We cannot rely on clobbers to indicate > that the values of input operands are modified following the suggestion > by gcc manual. However, since %rax is initialized with a constant value, > it does not need the '+' constraint modifier. It should can simply be > specified as an input operand. > > In addition, although %rdi itself is not modified by the instruction but > the memory it references may be updated, a "memory" clobber should be > added to notify the compiler about possible memory side effects. > > The corrected inline assembly should be written as follows: > > asm volatile(".byte 0xf3,0x0f,0xa6,0xc8" /* REP XSHA1 */ > : "+S"(data), "+c"(nblocks) > : "a"((long)-1), "D"(dst) > : "memory"); If the instruction both reads and writes %rax, then the constraint needs to be "+a", even if the C code doesn't use the updated value. Otherwise the compiler can assume that the value stored in %rax is unchanged and optimize the code accordingly, for example by not reinitializing %rax if the constant -1 is needed again later on. Yes, this means you'll need to move the constant -1 to a local variable. > > As before, all these comments apply to the SHA-256 patch too. > > Surely, I will also apply all of the suggestions mentioned above to the > SHA-256 patch. I also have to ask: are you sure you need SHA-1 to be optimized at all? SHA-1 has been deprecated for a long time. Most users have moved to SHA-256 and other stronger algorithms, and those that haven't need to move very soon. There's little value in adding new optimized code for SHA-1. How about simplifying your patch to just SHA-256? Then we can focus on the one that's actually important and not on the deprecated SHA-1. - Eric ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 2/3] lib/crypto: x86/sha1: PHE Extensions optimized SHA1 transform function 2026-03-05 19:18 ` Eric Biggers @ 2026-03-11 11:37 ` AlanSong-oc 2026-03-12 4:03 ` Eric Biggers 0 siblings, 1 reply; 13+ messages in thread From: AlanSong-oc @ 2026-03-11 11:37 UTC (permalink / raw) To: Eric Biggers Cc: herbert, davem, Jason, ardb, linux-crypto, linux-kernel, x86, CobeChen, TonyWWang-oc, YunShen, GeorgeXue, LeoLiu, HansHu On 3/6/26 03:18, Eric Biggers wrote: > On Thu, Mar 05, 2026 at 09:37:01AM +0800, AlanSong-oc wrote: >>> Also, the spec describes all four registers as both input and output >>> registers. Yet your inline asm marks %rax and %rcx as inputs only. >> >> Thank you for pointing this question out. >> >> On the one hand, when the '+' constraint modifier is applied to an >> operand, it is treated as both an input and an output operand. >> Therefore, %rsi and %rdi are considered input operands as well. >> >> On the other hand, after the instruction executes, the values in %rax, >> %rsi, and %rcx are modified. These registers should therefore use the >> '+' constraint modifier to inform the compiler that their values are >> updated by the assembly code. We cannot rely on clobbers to indicate >> that the values of input operands are modified following the suggestion >> by gcc manual. However, since %rax is initialized with a constant value, >> it does not need the '+' constraint modifier. It should can simply be >> specified as an input operand. >> >> In addition, although %rdi itself is not modified by the instruction but >> the memory it references may be updated, a "memory" clobber should be >> added to notify the compiler about possible memory side effects. >> >> The corrected inline assembly should be written as follows: >> >> asm volatile(".byte 0xf3,0x0f,0xa6,0xc8" /* REP XSHA1 */ >> : "+S"(data), "+c"(nblocks) >> : "a"((long)-1), "D"(dst) >> : "memory"); > > If the instruction both reads and writes %rax, then the constraint needs > to be "+a", even if the C code doesn't use the updated value. Otherwise > the compiler can assume that the value stored in %rax is unchanged and > optimize the code accordingly, for example by not reinitializing %rax if > the constant -1 is needed again later on. > > Yes, this means you'll need to move the constant -1 to a local variable. Indeed, to ensure that the compiler generates correct binary code under all optimization levels, the inline assembly should accurately describe the behavior of the instruction. I will use a local variable initialized to -1 for the instruction reads and writes %rax register in the next version of the patch. >>> As before, all these comments apply to the SHA-256 patch too. >> >> Surely, I will also apply all of the suggestions mentioned above to the >> SHA-256 patch. > > I also have to ask: are you sure you need SHA-1 to be optimized at all? > SHA-1 has been deprecated for a long time. Most users have moved to > SHA-256 and other stronger algorithms, and those that haven't need to > move very soon. There's little value in adding new optimized code for > SHA-1. > > How about simplifying your patch to just SHA-256? Then we can focus on > the one that's actually important and not on the deprecated SHA-1. It is true that SHA-1 is rarely used by most users today. However, it may still be needed in certain scenarios. For those cases, we would like to add support for the XHSA1 instruction to accelerate SHA-1. Does the crypto community have any plans to remove SHA-1 support in recent kernel versions? Best Regards AlanSong-oc ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 2/3] lib/crypto: x86/sha1: PHE Extensions optimized SHA1 transform function 2026-03-11 11:37 ` AlanSong-oc @ 2026-03-12 4:03 ` Eric Biggers 2026-03-13 2:58 ` AlanSong-oc 0 siblings, 1 reply; 13+ messages in thread From: Eric Biggers @ 2026-03-12 4:03 UTC (permalink / raw) To: AlanSong-oc Cc: herbert, davem, Jason, ardb, linux-crypto, linux-kernel, x86, CobeChen, TonyWWang-oc, YunShen, GeorgeXue, LeoLiu, HansHu On Wed, Mar 11, 2026 at 07:37:39PM +0800, AlanSong-oc wrote: > > I also have to ask: are you sure you need SHA-1 to be optimized at all? > > SHA-1 has been deprecated for a long time. Most users have moved to > > SHA-256 and other stronger algorithms, and those that haven't need to > > move very soon. There's little value in adding new optimized code for > > SHA-1. > > > > How about simplifying your patch to just SHA-256? Then we can focus on > > the one that's actually important and not on the deprecated SHA-1. > > It is true that SHA-1 is rarely used by most users today. However, it > may still be needed in certain scenarios. For those cases, we would like > to add support for the XHSA1 instruction to accelerate SHA-1. > > Does the crypto community have any plans to remove SHA-1 support in > recent kernel versions? It's already possible to build a kernel without SHA-1 support. SHA-1 has been cryptographically broken and is considered obsolete. Performance-critical hashing in the kernel already tends to use SHA-256. These patches already feel marginal, as they are being pushed without QEMU support, so the community will be unable to test them. The only reason I would consider accepting them without QEMU support is because there was already code in drivers/crypto/ that used these instructions. It also helps that they are just single instructions. Though, even with that I still found a bug in the proposed code as well as errors in the CPU documentation, as mentioned. And the drivers/crypto/ implementation that uses these instructions is broken too, as you're aware of. Overall, it's clear that platform-specific routines like this are very risky to maintain without adequate testing. Yet, correctness is the first priority in cryptographic code. So I would suggest that to reduce the risk, we focus on just one algorithm, SHA-256. Note that this makes your job easier, as well. - Eric ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 2/3] lib/crypto: x86/sha1: PHE Extensions optimized SHA1 transform function 2026-03-12 4:03 ` Eric Biggers @ 2026-03-13 2:58 ` AlanSong-oc 2026-03-13 3:28 ` Eric Biggers 0 siblings, 1 reply; 13+ messages in thread From: AlanSong-oc @ 2026-03-13 2:58 UTC (permalink / raw) To: Eric Biggers Cc: herbert, davem, Jason, ardb, linux-crypto, linux-kernel, x86, CobeChen, TonyWWang-oc, YunShen, GeorgeXue, LeoLiu, HansHu On 3/12/26 12:03, Eric Biggers wrote: > On Wed, Mar 11, 2026 at 07:37:39PM +0800, AlanSong-oc wrote: >>> I also have to ask: are you sure you need SHA-1 to be optimized at all? >>> SHA-1 has been deprecated for a long time. Most users have moved to >>> SHA-256 and other stronger algorithms, and those that haven't need to >>> move very soon. There's little value in adding new optimized code for >>> SHA-1. >>> >>> How about simplifying your patch to just SHA-256? Then we can focus on >>> the one that's actually important and not on the deprecated SHA-1. >> >> It is true that SHA-1 is rarely used by most users today. However, it >> may still be needed in certain scenarios. For those cases, we would like >> to add support for the XHSA1 instruction to accelerate SHA-1. >> >> Does the crypto community have any plans to remove SHA-1 support in >> recent kernel versions? > > It's already possible to build a kernel without SHA-1 support. SHA-1 > has been cryptographically broken and is considered obsolete. > Performance-critical hashing in the kernel already tends to use SHA-256. > > These patches already feel marginal, as they are being pushed without > QEMU support, so the community will be unable to test them. The only > reason I would consider accepting them without QEMU support is because > there was already code in drivers/crypto/ that used these instructions. Sorry for the inconvenience caused by the inability to test provided patches, as QEMU currently does not support emulation of the XSHA1 and XSHA256 instructions. Besides, since the previous patch adding XSHA384 and XSHA512 instruction support was not accepted, I would like to ask whether adding emulation support for XSHA384 and XSHA512 instructions in QEMU would help the crypto community evaluate and accept the corresponding kernel patches. > It also helps that they are just single instructions. Though, even with > that I still found a bug in the proposed code as well as errors in the > CPU documentation, as mentioned. And the drivers/crypto/ implementation > that uses these instructions is broken too, as you're aware of. > > Overall, it's clear that platform-specific routines like this are very > risky to maintain without adequate testing. Yet, correctness is the > first priority in cryptographic code. > > So I would suggest that to reduce the risk, we focus on just one > algorithm, SHA-256. Note that this makes your job easier, as well. Thanks for your suggestions. I will only add XSHA256 instruction support for the SHA-256 algorithm in the next version of the patch. Best Regards AlanSong-oc ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 2/3] lib/crypto: x86/sha1: PHE Extensions optimized SHA1 transform function 2026-03-13 2:58 ` AlanSong-oc @ 2026-03-13 3:28 ` Eric Biggers 0 siblings, 0 replies; 13+ messages in thread From: Eric Biggers @ 2026-03-13 3:28 UTC (permalink / raw) To: AlanSong-oc Cc: herbert, davem, Jason, ardb, linux-crypto, linux-kernel, x86, CobeChen, TonyWWang-oc, YunShen, GeorgeXue, LeoLiu, HansHu On Fri, Mar 13, 2026 at 10:58:07AM +0800, AlanSong-oc wrote: > On 3/12/26 12:03, Eric Biggers wrote: > > On Wed, Mar 11, 2026 at 07:37:39PM +0800, AlanSong-oc wrote: > >>> I also have to ask: are you sure you need SHA-1 to be optimized at all? > >>> SHA-1 has been deprecated for a long time. Most users have moved to > >>> SHA-256 and other stronger algorithms, and those that haven't need to > >>> move very soon. There's little value in adding new optimized code for > >>> SHA-1. > >>> > >>> How about simplifying your patch to just SHA-256? Then we can focus on > >>> the one that's actually important and not on the deprecated SHA-1. > >> > >> It is true that SHA-1 is rarely used by most users today. However, it > >> may still be needed in certain scenarios. For those cases, we would like > >> to add support for the XHSA1 instruction to accelerate SHA-1. > >> > >> Does the crypto community have any plans to remove SHA-1 support in > >> recent kernel versions? > > > > It's already possible to build a kernel without SHA-1 support. SHA-1 > > has been cryptographically broken and is considered obsolete. > > Performance-critical hashing in the kernel already tends to use SHA-256. > > > > These patches already feel marginal, as they are being pushed without > > QEMU support, so the community will be unable to test them. The only > > reason I would consider accepting them without QEMU support is because > > there was already code in drivers/crypto/ that used these instructions. > > Sorry for the inconvenience caused by the inability to test provided > patches, as QEMU currently does not support emulation of the XSHA1 and > XSHA256 instructions. > > Besides, since the previous patch adding XSHA384 and XSHA512 instruction > support was not accepted, I would like to ask whether adding emulation > support for XSHA384 and XSHA512 instructions in QEMU would help the > crypto community evaluate and accept the corresponding kernel patches. > > > It also helps that they are just single instructions. Though, even with > > that I still found a bug in the proposed code as well as errors in the > > CPU documentation, as mentioned. And the drivers/crypto/ implementation > > that uses these instructions is broken too, as you're aware of. > > > > Overall, it's clear that platform-specific routines like this are very > > risky to maintain without adequate testing. Yet, correctness is the > > first priority in cryptographic code. > > > > So I would suggest that to reduce the risk, we focus on just one > > algorithm, SHA-256. Note that this makes your job easier, as well. > > Thanks for your suggestions. I will only add XSHA256 instruction support > for the SHA-256 algorithm in the next version of the patch. Thanks. Yes, adding QEMU support for these instructions would be very helpful. - Eric ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v3 3/3] lib/crypto: x86/sha256: PHE Extensions optimized SHA256 transform function 2026-01-16 7:15 [PATCH v3 0/3] lib/crypto: x86/sha: Add PHE Extensions support AlanSong-oc 2026-01-16 7:15 ` [PATCH v3 1/3] crypto: padlock-sha - Disable for Zhaoxin processor AlanSong-oc 2026-01-16 7:15 ` [PATCH v3 2/3] lib/crypto: x86/sha1: PHE Extensions optimized SHA1 transform function AlanSong-oc @ 2026-01-16 7:15 ` AlanSong-oc 2 siblings, 0 replies; 13+ messages in thread From: AlanSong-oc @ 2026-01-16 7:15 UTC (permalink / raw) To: herbert, davem, ebiggers, Jason, ardb, linux-crypto, linux-kernel, x86 Cc: CobeChen, TonyWWang-oc, YunShen, GeorgeXue, LeoLiu-oc, HansHu, AlanSong-oc Zhaoxin CPUs have implemented the SHA(Secure Hash Algorithm) as its CPU instructions by PHE(Padlock Hash Engine) Extensions, including XSHA1, XSHA256, XSHA384 and XSHA512 instructions. With the help of implementation of SHA in hardware instead of software, can develop applications with higher performance, more security and more flexibility. This patch includes the XSHA256 instruction optimized implementation of SHA-256 transform function. Signed-off-by: AlanSong-oc <AlanSong-oc@zhaoxin.com> --- lib/crypto/x86/sha256.h | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/lib/crypto/x86/sha256.h b/lib/crypto/x86/sha256.h index 38e33b22a..70b68bad0 100644 --- a/lib/crypto/x86/sha256.h +++ b/lib/crypto/x86/sha256.h @@ -31,6 +31,26 @@ DEFINE_X86_SHA256_FN(sha256_blocks_avx, sha256_transform_avx); DEFINE_X86_SHA256_FN(sha256_blocks_avx2, sha256_transform_rorx); DEFINE_X86_SHA256_FN(sha256_blocks_ni, sha256_ni_transform); +#if IS_ENABLED(CONFIG_CPU_SUP_ZHAOXIN) +#define PHE_ALIGNMENT 16 +static void sha256_blocks_phe(struct sha256_block_state *state, + const u8 *data, size_t nblocks) +{ + /* + * XSHA256 requires %edi to point to a 32-byte, 16-byte-aligned + * buffer on Zhaoxin processors. + */ + u8 buf[32 + PHE_ALIGNMENT - 1]; + u8 *dst = PTR_ALIGN(&buf[0], PHE_ALIGNMENT); + + memcpy(dst, state, SHA256_DIGEST_SIZE); + asm volatile(".byte 0xf3,0x0f,0xa6,0xd0" + : "+S"(data), "+D"(dst) + : "a"((long)-1), "c"(nblocks)); + memcpy(state, dst, SHA256_DIGEST_SIZE); +} +#endif /* CONFIG_CPU_SUP_ZHAOXIN */ + static void sha256_blocks(struct sha256_block_state *state, const u8 *data, size_t nblocks) { @@ -79,6 +99,11 @@ static void sha256_mod_init_arch(void) if (boot_cpu_has(X86_FEATURE_SHA_NI)) { static_call_update(sha256_blocks_x86, sha256_blocks_ni); static_branch_enable(&have_sha_ni); +#if IS_ENABLED(CONFIG_CPU_SUP_ZHAOXIN) + } else if (boot_cpu_has(X86_FEATURE_PHE_EN)) { + if (boot_cpu_data.x86 >= 0x07) + static_call_update(sha256_blocks_x86, sha256_blocks_phe); +#endif } else if (cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL) && boot_cpu_has(X86_FEATURE_AVX)) { -- 2.34.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
end of thread, other threads:[~2026-03-13 3:28 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-01-16 7:15 [PATCH v3 0/3] lib/crypto: x86/sha: Add PHE Extensions support AlanSong-oc 2026-01-16 7:15 ` [PATCH v3 1/3] crypto: padlock-sha - Disable for Zhaoxin processor AlanSong-oc 2026-01-18 0:09 ` Eric Biggers 2026-03-05 1:36 ` AlanSong-oc 2026-01-16 7:15 ` [PATCH v3 2/3] lib/crypto: x86/sha1: PHE Extensions optimized SHA1 transform function AlanSong-oc 2026-01-18 0:31 ` Eric Biggers 2026-03-05 1:37 ` AlanSong-oc 2026-03-05 19:18 ` Eric Biggers 2026-03-11 11:37 ` AlanSong-oc 2026-03-12 4:03 ` Eric Biggers 2026-03-13 2:58 ` AlanSong-oc 2026-03-13 3:28 ` Eric Biggers 2026-01-16 7:15 ` [PATCH v3 3/3] lib/crypto: x86/sha256: PHE Extensions optimized SHA256 " AlanSong-oc
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox