From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 1773BC282CD for ; Sun, 2 Mar 2025 22:04:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References: Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=FCS+BATOT9XLJ9hJA3n2KhWheaQVKa4MytzcFY+kD8c=; b=k1adeyfj4yB56L xcnE9e1QyVotwt1ZraxZUgglAMRuwEDiWagqrdDgM0HSdTcg0BotzOd2oDrj93WRwbefyX561bM7C 7I9TKRPrydOIMAm3jK7dtozq5qiQT0tD7i7Q+jU8xfqNr9gOYoAasKrRpGaKw7rLYFSk6WOIH161H a3OZ+3JkzBcW5mp25irc1imyiiVh80eIkruCHLzYRLwRbleLufQCEPHDnVoTxS1bft+wZKdAKwSZ7 NTi+izBDF9u9zQL9/IxlOZqXHWuGj/Sv/noMMiCB+JzAgb8vR0ipYsEDAGeiECr1dchMOzCIyLUC2 D90ClWoms31Xd5/NKvPA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98 #2 (Red Hat Linux)) id 1torQL-0000000Gjfv-2LW8; Sun, 02 Mar 2025 22:04:37 +0000 Received: from tor.source.kernel.org ([172.105.4.254]) by bombadil.infradead.org with esmtps (Exim 4.98 #2 (Red Hat Linux)) id 1torQK-0000000Gjen-2Ksn for linux-riscv@lists.infradead.org; Sun, 02 Mar 2025 22:04:36 +0000 Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by tor.source.kernel.org (Postfix) with ESMTP id 18DC0611C4; Sun, 2 Mar 2025 22:04:25 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id E5901C4CED6; Sun, 2 Mar 2025 22:04:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1740953075; bh=sNHgqXDivFwdiZ8TXbNDsXGkLUf6pi1RZL5XR7U6cCY=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=tN3Mxiiw13zsYQWWDIEtsEf/0L6ocsJEAmNj1Ny1VIFIzCRYoy4eEuXfZ4sukVgnS cHg0RxwS06qLApzE6y6ebLvaQsa2tZeX84fsq3o5RY+x4QS2q8+R7QRvI6hh4cNnGs L3JJLL2loPpv/3k8ZE6X10TWEB5joAyshocfFPU89lWa/Vv2Qk3FNbvpkMLZ2hoqMg vvF2NyhWKPpD2PS16k1b9U4qdicuBDveuthuDYSCwnSMTmhMYeUBS16OlE/YxI1OPr rCU+0x1sjLrc0Pq+KpH1odkZ1or1f7CQ2W+QOBwoj7oRJzdmwevG1d3bnc+M1V4G6z HIpn0NU+UD5xA== Date: Sun, 2 Mar 2025 14:04:26 -0800 From: Eric Biggers To: =?iso-8859-1?Q?Bj=F6rn_T=F6pel?= , Palmer Dabbelt Cc: linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org, linux-riscv@lists.infradead.org, Zhihang Shao , Ard Biesheuvel , Xiao Wang , Charlie Jenkins , Alexandre Ghiti Subject: Re: [PATCH 0/4] RISC-V CRC optimizations Message-ID: <20250302220426.GC2079@quark.localdomain> References: <20250216225530.306980-1-ebiggers@kernel.org> <20250224180614.GA11336@google.com> <87ikorl0r5.fsf@all.your.base.are.belong.to.us> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <87ikorl0r5.fsf@all.your.base.are.belong.to.us> X-BeenThere: linux-riscv@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Sender: "linux-riscv" Errors-To: linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org On Sun, Mar 02, 2025 at 07:56:46PM +0100, Bj=F6rn T=F6pel wrote: > Eric! > = > Eric Biggers writes: > = > > On Sun, Feb 16, 2025 at 02:55:26PM -0800, Eric Biggers wrote: > >> This patchset is a replacement for > >> "[PATCH v4] riscv: Optimize crct10dif with Zbc extension" > >> (https://lore.kernel.org/r/20250211071101.181652-1-zhihang.shao.iscas@= gmail.com/). > >> It adopts the approach that I'm taking for x86 where code is shared > >> among CRC variants. It replaces the existing Zbc optimized CRC32 > >> functions, then adds Zbc optimized CRC-T10DIF and CRC64 functions. > >> = > >> This new code should be significantly faster than the current Zbc > >> optimized CRC32 code and the previously proposed CRC-T10DIF code. It > >> uses "folding" instead of just Barrett reduction, and it also implemen= ts > >> Barrett reduction more efficiently. > >> = > >> This applies to crc-next at > >> https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log= /?h=3Dcrc-next. > >> It depends on other patches that are queued there for 6.15, so I plan = to > >> take it through there if there are no objections. > >> = > >> Tested with crc_kunit in QEMU (set CONFIG_CRC_KUNIT_TEST=3Dy and > >> CONFIG_CRC_BENCHMARK=3Dy), both 32-bit and 64-bit. I don't have real = Zbc > >> capable hardware to benchmark this on, but the new code should work ve= ry > >> well; similar optimizations work very well on other architectures. > > > > Any feedback on this series from the RISC-V side? > = > I have not reviewed your series, but I did a testrun the Milk-V Jupiter > which sports a Spacemit K1 that has Zbc. > = > I based the run on commit 1973160c90d7 ("Merge tag > 'gpio-fixes-for-v6.14-rc5' of > git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux"), plus your > crc-next branch (commit a0bd462f3a13 ("x86/crc: add ANNOTATE_NOENDBR to > suppress objtool warnings")) merged: > = > | --- base1.txt 2025-03-02 18:31:16.169438876 +0000 > | +++ eric.txt 2025-03-02 18:35:58.683017223 +0000 > | @@ -11,7 +11,7 @@ > | # crc16_benchmark: len=3D127: 153 MB/s > | # crc16_benchmark: len=3D128: 153 MB/s > | # crc16_benchmark: len=3D200: 153 MB/s > | - # crc16_benchmark: len=3D256: 153 MB/s > | + # crc16_benchmark: len=3D256: 154 MB/s > | # crc16_benchmark: len=3D511: 154 MB/s > | # crc16_benchmark: len=3D512: 154 MB/s > | # crc16_benchmark: len=3D1024: 155 MB/s > | @@ -20,94 +20,94 @@ > | # crc16_benchmark: len=3D16384: 155 MB/s > | ok 2 crc16_benchmark > | ok 3 crc_t10dif_test > | - # crc_t10dif_benchmark: len=3D1: 48 MB/s > | - # crc_t10dif_benchmark: len=3D16: 125 MB/s > | - # crc_t10dif_benchmark: len=3D64: 136 MB/s > | - # crc_t10dif_benchmark: len=3D127: 138 MB/s > | - # crc_t10dif_benchmark: len=3D128: 138 MB/s > | - # crc_t10dif_benchmark: len=3D200: 138 MB/s > | - # crc_t10dif_benchmark: len=3D256: 138 MB/s > | - # crc_t10dif_benchmark: len=3D511: 139 MB/s > | - # crc_t10dif_benchmark: len=3D512: 139 MB/s > | - # crc_t10dif_benchmark: len=3D1024: 139 MB/s > | - # crc_t10dif_benchmark: len=3D3173: 140 MB/s > | - # crc_t10dif_benchmark: len=3D4096: 140 MB/s > | - # crc_t10dif_benchmark: len=3D16384: 140 MB/s > | + # crc_t10dif_benchmark: len=3D1: 28 MB/s > | + # crc_t10dif_benchmark: len=3D16: 236 MB/s > | + # crc_t10dif_benchmark: len=3D64: 450 MB/s > | + # crc_t10dif_benchmark: len=3D127: 480 MB/s > | + # crc_t10dif_benchmark: len=3D128: 540 MB/s > | + # crc_t10dif_benchmark: len=3D200: 559 MB/s > | + # crc_t10dif_benchmark: len=3D256: 600 MB/s > | + # crc_t10dif_benchmark: len=3D511: 613 MB/s > | + # crc_t10dif_benchmark: len=3D512: 635 MB/s > | + # crc_t10dif_benchmark: len=3D1024: 654 MB/s > | + # crc_t10dif_benchmark: len=3D3173: 665 MB/s > | + # crc_t10dif_benchmark: len=3D4096: 669 MB/s > | + # crc_t10dif_benchmark: len=3D16384: 673 MB/s > | ok 4 crc_t10dif_benchmark > | ok 5 crc32_le_test > | # crc32_le_benchmark: len=3D1: 31 MB/s > | - # crc32_le_benchmark: len=3D16: 456 MB/s > | - # crc32_le_benchmark: len=3D64: 682 MB/s > | - # crc32_le_benchmark: len=3D127: 620 MB/s > | - # crc32_le_benchmark: len=3D128: 744 MB/s > | - # crc32_le_benchmark: len=3D200: 768 MB/s > | - # crc32_le_benchmark: len=3D256: 777 MB/s > | - # crc32_le_benchmark: len=3D511: 758 MB/s > | - # crc32_le_benchmark: len=3D512: 798 MB/s > | - # crc32_le_benchmark: len=3D1024: 807 MB/s > | - # crc32_le_benchmark: len=3D3173: 807 MB/s > | - # crc32_le_benchmark: len=3D4096: 814 MB/s > | - # crc32_le_benchmark: len=3D16384: 816 MB/s > | + # crc32_le_benchmark: len=3D16: 439 MB/s > | + # crc32_le_benchmark: len=3D64: 1209 MB/s > | + # crc32_le_benchmark: len=3D127: 1067 MB/s > | + # crc32_le_benchmark: len=3D128: 1616 MB/s > | + # crc32_le_benchmark: len=3D200: 1739 MB/s > | + # crc32_le_benchmark: len=3D256: 1951 MB/s > | + # crc32_le_benchmark: len=3D511: 1855 MB/s > | + # crc32_le_benchmark: len=3D512: 2174 MB/s > | + # crc32_le_benchmark: len=3D1024: 2301 MB/s > | + # crc32_le_benchmark: len=3D3173: 2347 MB/s > | + # crc32_le_benchmark: len=3D4096: 2407 MB/s > | + # crc32_le_benchmark: len=3D16384: 2440 MB/s > | ok 6 crc32_le_benchmark > | ok 7 crc32_be_test > | - # crc32_be_benchmark: len=3D1: 27 MB/s > | - # crc32_be_benchmark: len=3D16: 258 MB/s > | - # crc32_be_benchmark: len=3D64: 388 MB/s > | - # crc32_be_benchmark: len=3D127: 402 MB/s > | - # crc32_be_benchmark: len=3D128: 424 MB/s > | - # crc32_be_benchmark: len=3D200: 438 MB/s > | - # crc32_be_benchmark: len=3D256: 444 MB/s > | - # crc32_be_benchmark: len=3D511: 449 MB/s > | - # crc32_be_benchmark: len=3D512: 455 MB/s > | - # crc32_be_benchmark: len=3D1024: 461 MB/s > | - # crc32_be_benchmark: len=3D3173: 463 MB/s > | - # crc32_be_benchmark: len=3D4096: 465 MB/s > | - # crc32_be_benchmark: len=3D16384: 466 MB/s > | + # crc32_be_benchmark: len=3D1: 25 MB/s > | + # crc32_be_benchmark: len=3D16: 251 MB/s > | + # crc32_be_benchmark: len=3D64: 458 MB/s > | + # crc32_be_benchmark: len=3D127: 496 MB/s > | + # crc32_be_benchmark: len=3D128: 547 MB/s > | + # crc32_be_benchmark: len=3D200: 569 MB/s > | + # crc32_be_benchmark: len=3D256: 605 MB/s > | + # crc32_be_benchmark: len=3D511: 621 MB/s > | + # crc32_be_benchmark: len=3D512: 637 MB/s > | + # crc32_be_benchmark: len=3D1024: 657 MB/s > | + # crc32_be_benchmark: len=3D3173: 668 MB/s > | + # crc32_be_benchmark: len=3D4096: 671 MB/s > | + # crc32_be_benchmark: len=3D16384: 674 MB/s > | ok 8 crc32_be_benchmark > | ok 9 crc32c_test > | # crc32c_benchmark: len=3D1: 31 MB/s > | - # crc32c_benchmark: len=3D16: 457 MB/s > | - # crc32c_benchmark: len=3D64: 682 MB/s > | - # crc32c_benchmark: len=3D127: 620 MB/s > | - # crc32c_benchmark: len=3D128: 744 MB/s > | - # crc32c_benchmark: len=3D200: 769 MB/s > | - # crc32c_benchmark: len=3D256: 779 MB/s > | - # crc32c_benchmark: len=3D511: 758 MB/s > | - # crc32c_benchmark: len=3D512: 797 MB/s > | - # crc32c_benchmark: len=3D1024: 807 MB/s > | - # crc32c_benchmark: len=3D3173: 806 MB/s > | - # crc32c_benchmark: len=3D4096: 813 MB/s > | - # crc32c_benchmark: len=3D16384: 816 MB/s > | + # crc32c_benchmark: len=3D16: 446 MB/s > | + # crc32c_benchmark: len=3D64: 1188 MB/s > | + # crc32c_benchmark: len=3D127: 1066 MB/s > | + # crc32c_benchmark: len=3D128: 1600 MB/s > | + # crc32c_benchmark: len=3D200: 1727 MB/s > | + # crc32c_benchmark: len=3D256: 1941 MB/s > | + # crc32c_benchmark: len=3D511: 1854 MB/s > | + # crc32c_benchmark: len=3D512: 2164 MB/s > | + # crc32c_benchmark: len=3D1024: 2300 MB/s > | + # crc32c_benchmark: len=3D3173: 2345 MB/s > | + # crc32c_benchmark: len=3D4096: 2402 MB/s > | + # crc32c_benchmark: len=3D16384: 2437 MB/s > | ok 10 crc32c_benchmark > | ok 11 crc64_be_test > | - # crc64_be_benchmark: len=3D1: 64 MB/s > | - # crc64_be_benchmark: len=3D16: 144 MB/s > | - # crc64_be_benchmark: len=3D64: 154 MB/s > | - # crc64_be_benchmark: len=3D127: 156 MB/s > | - # crc64_be_benchmark: len=3D128: 156 MB/s > | - # crc64_be_benchmark: len=3D200: 156 MB/s > | - # crc64_be_benchmark: len=3D256: 156 MB/s > | - # crc64_be_benchmark: len=3D511: 157 MB/s > | - # crc64_be_benchmark: len=3D512: 157 MB/s > | - # crc64_be_benchmark: len=3D1024: 157 MB/s > | - # crc64_be_benchmark: len=3D3173: 158 MB/s > | - # crc64_be_benchmark: len=3D4096: 158 MB/s > | - # crc64_be_benchmark: len=3D16384: 158 MB/s > | + # crc64_be_benchmark: len=3D1: 29 MB/s > | + # crc64_be_benchmark: len=3D16: 264 MB/s > | + # crc64_be_benchmark: len=3D64: 476 MB/s > | + # crc64_be_benchmark: len=3D127: 499 MB/s > | + # crc64_be_benchmark: len=3D128: 558 MB/s > | + # crc64_be_benchmark: len=3D200: 576 MB/s > | + # crc64_be_benchmark: len=3D256: 611 MB/s > | + # crc64_be_benchmark: len=3D511: 621 MB/s > | + # crc64_be_benchmark: len=3D512: 638 MB/s > | + # crc64_be_benchmark: len=3D1024: 659 MB/s > | + # crc64_be_benchmark: len=3D3173: 667 MB/s > | + # crc64_be_benchmark: len=3D4096: 671 MB/s > | + # crc64_be_benchmark: len=3D16384: 674 MB/s > | ok 12 crc64_be_benchmark > | ok 13 crc64_nvme_test > | - # crc64_nvme_benchmark: len=3D1: 64 MB/s > | - # crc64_nvme_benchmark: len=3D16: 144 MB/s > | - # crc64_nvme_benchmark: len=3D64: 154 MB/s > | - # crc64_nvme_benchmark: len=3D127: 156 MB/s > | - # crc64_nvme_benchmark: len=3D128: 156 MB/s > | - # crc64_nvme_benchmark: len=3D200: 156 MB/s > | - # crc64_nvme_benchmark: len=3D256: 156 MB/s > | - # crc64_nvme_benchmark: len=3D511: 157 MB/s > | - # crc64_nvme_benchmark: len=3D512: 157 MB/s > | - # crc64_nvme_benchmark: len=3D1024: 157 MB/s > | - # crc64_nvme_benchmark: len=3D3173: 158 MB/s > | - # crc64_nvme_benchmark: len=3D4096: 158 MB/s > | - # crc64_nvme_benchmark: len=3D16384: 158 MB/s > | + # crc64_nvme_benchmark: len=3D1: 36 MB/s > | + # crc64_nvme_benchmark: len=3D16: 479 MB/s > | + # crc64_nvme_benchmark: len=3D64: 1340 MB/s > | + # crc64_nvme_benchmark: len=3D127: 1179 MB/s > | + # crc64_nvme_benchmark: len=3D128: 1766 MB/s > | + # crc64_nvme_benchmark: len=3D200: 1965 MB/s > | + # crc64_nvme_benchmark: len=3D256: 2201 MB/s > | + # crc64_nvme_benchmark: len=3D511: 2087 MB/s > | + # crc64_nvme_benchmark: len=3D512: 2464 MB/s > | + # crc64_nvme_benchmark: len=3D1024: 2331 MB/s > | + # crc64_nvme_benchmark: len=3D3173: 2673 MB/s > | + # crc64_nvme_benchmark: len=3D4096: 2745 MB/s > | + # crc64_nvme_benchmark: len=3D16384: 2782 MB/s > | ok 14 crc64_nvme_benchmark > | # crc: pass:14 fail:0 skip:0 total:14 > | # Totals: pass:14 fail:0 skip:0 total:14 > = > That's a significant speed up for this popular SoC, and it would be > great to get this series in for the next merge window! Thank you! > = > Tested-by: Bj=F6rn T=F6pel Thanks for testing this patchset! So to summarize, on long messages the re= sults were roughly: lsb-first CRCs (crc32_le, crc32c, crc64_nvme): Generic table-based code: 158 MB/s Old Zbc-optimized code (crc32* only): 816 MB/s New Zbc-optimized code: 2440 MB/s mst-first CRCs (crc_t10dif, crc32_be, crc64_be): Generic table-based code: 158 MB/s Old Zbc-optimized code (crc32* only): 466 MB/s New Zbc-optimized code: 674 MB/s So, quite positive results. Though, the fact the msb-first CRCs are (still= ) so much slower than lsb-first ones indicates that be64_to_cpu() is super slow = on RISC-V. That seems to be caused by the rev8 instruction from Zbb not being used. I wonder if there are any plans to make the endianness swap macros u= se rev8, or if I'm going to have to roll my own endianness swap in the CRC cod= e. (I assume it would be fine for the CRC code to depend on both Zbb and Zbc.) Anyway, I've applied this series to the crc tree (https://web.git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log= /?h=3Dcrc-next). Palmer, I'd appreciate your ack though! - Eric _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv