From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 1D152D20691 for ; Wed, 16 Oct 2024 03:05:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:In-Reply-To:Content-Type: MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=IOJiszah4VKtkuOoJjHVLSHJ3NxO1zJEU38JfGqGSXY=; b=0YAAa3QDbOHralZQOvAJ9ZDprx sJ3qKr5kwoEYE3oVmkxBkv0MyXYtTGJKpoo11iiFLystb9j7jpEGFsPBTAfQ10J3Xwg/qePbY7WWn Fy8S6oBqwtRwfIszMP5KwY6zQcVuCtEfSJJgEMGY1COsEnGHn5yMSdVooL2/obTynjcVZ+y+59Pmz 7lzTG9rNXhFmlq93Z1MTIZdbS7TIPnYIjd/OXYTUNZnYobGDVKik3EYNfF8WOCf/oLtK7nZUrl4Do z/GLNXszARrgjnVAMnpGR+sswmVQ8XSwzG8flwJ+J8podHTZl1niNcTStEJ9wOQgnAvnQJx/iY06H wkqozV0g==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98 #2 (Red Hat Linux)) id 1t0uLh-0000000AIY3-3Vya; Wed, 16 Oct 2024 03:05:21 +0000 Received: from nyc.source.kernel.org ([147.75.193.91]) by bombadil.infradead.org with esmtps (Exim 4.98 #2 (Red Hat Linux)) id 1t0uKG-0000000AIJN-1gm4 for linux-arm-kernel@lists.infradead.org; Wed, 16 Oct 2024 03:03:53 +0000 Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by nyc.source.kernel.org (Postfix) with ESMTP id 5AAD3A43331; Wed, 16 Oct 2024 03:03:42 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id B278FC4CEC6; Wed, 16 Oct 2024 03:03:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1729047831; bh=OvU5BIe7svfN/KO28xtzTgtEROgI/qBKkppVbJJdG08=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=AsWCavh8uNXg9fSfKlVr7UDasLKQBfMRTZ7vDju4zT9eEaUen0JsOhG2N0852aVgm 7n/BH/2sOIbMgTKAbiXRQIU1olOzAUmV/eFcIWInD0DpP1nnibLw94Gr+lRsiKd8ay /ruug9hAuCackPKcpRpcHcD0jLmizFv/jNNc5czHLvWUP2AmXJYy2WG/G2X19MMH3p GLvsxfrADlyrlN3OHX3uAHY4Bbry2wj0plTsCW5yrz8/CSmv1Jz543GrlBgKzm77Oj J6aVTig9Bu/XGoQoWQSharHF4P611auAErSn2FwSXpMzLIOLZ1ADMFzVUxDw5ohqJ1 4vB+R2cfnViCg== Date: Tue, 15 Oct 2024 20:03:49 -0700 From: Eric Biggers To: Ard Biesheuvel Cc: linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au, will@kernel.org, catalin.marinas@arm.com, Ard Biesheuvel , Kees Cook Subject: Re: [PATCH 2/2] arm64/crc32: Implement 4-way interleave using PMULL Message-ID: <20241016030349.GD1138@sol.localdomain> References: <20241015104138.2875879-4-ardb+git@google.com> <20241015104138.2875879-6-ardb+git@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20241015104138.2875879-6-ardb+git@google.com> X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20241015_200352_527539_470E65EE X-CRM114-Status: GOOD ( 20.35 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Tue, Oct 15, 2024 at 12:41:40PM +0200, Ard Biesheuvel wrote: > From: Ard Biesheuvel > > Now that kernel mode NEON no longer disables preemption, using FP/SIMD > in library code which is not obviously part of the crypto subsystem is > no longer problematic, as it will no longer incur unexpected latencies. > > So accelerate the CRC-32 library code on arm64 to use a 4-way > interleave, using PMULL instructions to implement the folding. > > On Apple M2, this results in a speedup of 2 - 2.8x when using input > sizes of 1k - 8k. For smaller sizes, the overhead of preserving and > restoring the FP/SIMD register file may not be worth it, so 1k is used > as a threshold for choosing this code path. > > The coefficient tables were generated using code provided by Eric. [0] > > [0] https://github.com/ebiggers/libdeflate/blob/master/scripts/gen_crc32_multipliers.c > > Cc: Eric Biggers > Signed-off-by: Ard Biesheuvel > --- > arch/arm64/lib/Makefile | 2 +- > arch/arm64/lib/crc32-glue.c | 36 +++ > arch/arm64/lib/crc32-pmull.S | 240 ++++++++++++++++++++ > 3 files changed, 277 insertions(+), 1 deletion(-) Thanks for doing this! The new code looks good to me. 4-way does seem like the right choice for arm64. I'd recommend calling the file crc32-4way.S and the functions crc32*_arm64_4way(), rather than crc32-pmull.S and crc32*_pmull(). This would avoid confusion with a CRC implementation that is actually based entirely on pmull (which is possible). The proposed implementation uses the crc32 instructions to do most of the work and only uses pmull for combining the CRCs. Yes, crc32c-pcl-intel-asm_64.S made this same mistake, but it is a mistake, IMO. - Eric