From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F3D9013C810; Mon, 14 Oct 2024 19:01:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728932505; cv=none; b=hnpfOFhNpzFsqGjVUhpyTLnMWZtDj9trKMzOM/Nw4Jv8EJYFUYrrM4bv9BpxtTq/uXd797rDuFeTu3rhZh6EQ1q1XZOvpStXrc79NzKMY9YRjb0e5s1RIRL96DKigpu4Tycf6hY4UVqzfVr+1pc2SqlGJmll92EUM1lsCl1+VBc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728932505; c=relaxed/simple; bh=4B/4MKRiZMEAx3KlSKct+QBFZ1DTuDQEOf7PADj4UYE=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=njYqvPwxwQLmBEBIRhLRX21BeUF13bWYbNpuMU4Pp8seKt5BjTsL/jQXV5Ty1AipwNhgQgMHPUOH6hf6TMVpWf92faoMvDvgqHJW0PEp9a8C3ZckMhMHEo9+AK3HesuL8d2ZKE0vEZBdjpJX0nxIXFNMCqJ5D4hGNb5RRToko5o= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=llYDHDQI; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="llYDHDQI" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 5A59AC4CEC3; Mon, 14 Oct 2024 19:01:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1728932504; bh=4B/4MKRiZMEAx3KlSKct+QBFZ1DTuDQEOf7PADj4UYE=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=llYDHDQIkfBeL0ZFDnSSgtqfLxoWdgvdPQQopLaTh+W9roGjh6pHAihxCtKvjbeiz wrXDNxYaPBV1FqD3nwk/3XYS6tVExcudDEkF40u11/es90SNLxLgEqHKl90Bz1sm8/ UE+OSwTwgfZeTZ5J5FhTealJUO8mYpq5b9wbmYQA9b1Lwm5kbcD2A7Hz0MRFUXysVb ql+tPM3W3hmDIUQGaM4lwFS/ufGdl+dQJJAMuhF3Jax0Bp7+aq6uc1oeUo74yE6RY1 KUoQCo+a8R1KttTQuGt8RzU9Ij4Kv1zZJNls5IYCd4SPowfxKwEAOsnoFw2o4vmQBx EZr/wgp+Hbv9A== Date: Mon, 14 Oct 2024 12:01:42 -0700 From: Eric Biggers To: David Laight Cc: "linux-crypto@vger.kernel.org" , "x86@kernel.org" , "linux-kernel@vger.kernel.org" , Ard Biesheuvel , Josh Poimboeuf , Peter Zijlstra Subject: Re: [PATCH 3/3] crypto: x86/crc32c - eliminate jump table and excessive unrolling Message-ID: <20241014190142.GA1137@sol.localdomain> References: <20241014042447.50197-1-ebiggers@kernel.org> <20241014042447.50197-4-ebiggers@kernel.org> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Mon, Oct 14, 2024 at 04:30:05PM +0000, David Laight wrote: > From: Eric Biggers > > Sent: 14 October 2024 05:25 > > > > crc32c-pcl-intel-asm_64.S has a loop with 1 to 127 iterations fully > > unrolled and uses a jump table to jump into the correct location. This > > optimization is misguided, as it bloats the binary code size and > > introduces an indirect call. x86_64 CPUs can predict loops well, so it > > is fine to just use a loop instead. Loop bookkeeping instructions can > > compete with the crc instructions for the ALUs, but this is easily > > mitigated by unrolling the loop by a smaller amount, such as 4 times. > > Do you need to unroll it at all? It looks like on most CPUs, no. On Haswell, Emerald Rapids, Zen 2 it does not make a significant difference. However, it helps on Zen 5. > > + # Unroll the loop by a factor of 4 to reduce the overhead of the loop > > + # bookkeeping instructions, which can compete with crc32q for the ALUs. > > +.Lcrc_3lanes_4x_loop: > > + crc32q (bufp), crc_init_q > > + crc32q (bufp,chunk_bytes_q), crc1 > > + crc32q (bufp,chunk_bytes_q,2), crc2 > > + crc32q 8(bufp), crc_init_q > > + crc32q 8(bufp,chunk_bytes_q), crc1 > > + crc32q 8(bufp,chunk_bytes_q,2), crc2 > > + crc32q 16(bufp), crc_init_q > > + crc32q 16(bufp,chunk_bytes_q), crc1 > > + crc32q 16(bufp,chunk_bytes_q,2), crc2 > > + crc32q 24(bufp), crc_init_q > > + crc32q 24(bufp,chunk_bytes_q), crc1 > > + crc32q 24(bufp,chunk_bytes_q,2), crc2 > > + add $32, bufp > > + sub $4, %eax > > + jge .Lcrc_3lanes_4x_loop > > If you are really lucky you'll get two memory reads/clock. > So you won't ever to do than two crc32/clock. > Looking at Agner's instruction latency tables I don't think > any cpu can do more that one per clock, or pipeline them. > I think that means you don't even need two (never mind 3) > buffers. On most Intel and AMD CPUs (I tested Haswell for old Intel, Emerald Rapids for new Intel, and Zen 2 for slightly-old AMD), crc32q has 3 cycle latency and 1 per cycle throughput. So you do need at least 3 streams. AMD Zen 5 has much higher crc32q throughput and seems to want up to 7 streams. This is not implemented yet. > Most modern x86 can do 4 or 5 (or even more) ALU operations > per clock - depending on the combination of instructions. > > Replace the loop termination with a comparison of 'bufp' > against a pre-calculated limit and you get two instructions > (that might get merged into one u-op) for the loop overhead. > They'll run in parallel with the crc32q instructions. That's actually still three instructions: add, cmp, and jne. I tried it on both Intel and AMD, and it did not help. > I've never managed to get a 1-clock loop, but two is easy. > You might find that just: > 10: > crc32q (bufp), crc > crc32q 8(bufp), crc > add $16, bufp > cmp bufp, buf_lim > jne 10b > will run at 8 bytes/clock on modern intel cpu. No, the latency of crc32q is still three cycles. You need three streams. > You can write that in C with a simple asm function for the crc32 > instruction itself. Well, the single-stream CRC32C implementation already does that; see arch/x86/crypto/crc32c-intel_glue.c. Things are not as simple for this multi-stream implementation, which uses pclmulqdq to combine the CRCs. - Eric