From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4595A1ADC94; Thu, 23 Jan 2025 21:13:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737666800; cv=none; b=mYcRlag71JERfjkFgmFbhoLL7wlt/Fkm71INrZmCo2K8oIrkgfYUMX5dyYpWBXCRwLDQ+LEPjOJUsk8GRjOPhXfDXdS+hvuWKbvlqpq0GcTB/X2MNcH4uDJDwxpwx7VDr03cpynFvNhxbRwY2urob5vk0GR6mB7QT38lFkr3Vjk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737666800; c=relaxed/simple; bh=i3lPbt0fcoLvkz+WFDp+xCQj6SBZ4HSG9Ilrach0yOM=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=f8fa4QTv2w35067ygjHhbiE+3ktXBSb5rTQcUOXa5MiTBAxlDLJKQxAXiY/ztmjuI23SsQmZp2CBfiqiZtXZpqLc4dqdTpM5OhWPzCIIg0MruyVZwxlHe7kdIBo413A3k6s/DxRs4mNfw6F9Gc4BetH68NSQ7d01RIiWrNy8Ni8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=h2zelItM; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="h2zelItM" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4C67BC4CED3; Thu, 23 Jan 2025 21:13:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1737666799; bh=i3lPbt0fcoLvkz+WFDp+xCQj6SBZ4HSG9Ilrach0yOM=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=h2zelItM9TDXVoI+oegO9sx1N5aSvXK6RAJzIpU744ykx0Sy5pnmg4v0tGbKc/TEm leH5FWHZV9HbD5oc1wm8FLvzG9w0tzjegBDkT3zATgP87NeCViO+B7Hi8yMJEpMUBO 1VM4kygzkRo26Pw3HPRWBbmbmX0Hou8HFnotpegzBvQ0ZEoW1A7ilWLUC2VghIwhqC TkDyT6M6fj3ijhzTBXTb+dqrvH3JnLIC9F1D1oSVDbnClKQPtmR+as0pNP1UKrp7br GdBfunH7I7d2HPlfIOh+U92OuI8ORRFE5p/ZKlokgM8T00D3gKOKYMauhanayoLa70 y31w4NF0aJWpA== Date: Thu, 23 Jan 2025 13:13:17 -0800 From: Eric Biggers To: Linus Torvalds Cc: Theodore Ts'o , linux-crypto@vger.kernel.org, linux-kernel@vger.kernel.org, Ard Biesheuvel , Chao Yu , "Darrick J. Wong" , Geert Uytterhoeven , Kent Overstreet , "Martin K. Petersen" , Michael Ellerman , Vinicius Peixoto , WangYuli Subject: Re: [GIT PULL] CRC updates for 6.14 Message-ID: <20250123211317.GA88607@sol.localdomain> References: <20250119225118.GA15398@sol.localdomain> <20250123051633.GA183612@sol.localdomain> <20250123074618.GB183612@sol.localdomain> <20250123140744.GB3875121@mit.edu> <20250123181818.GA2117666@google.com> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Thu, Jan 23, 2025 at 12:52:30PM -0800, Linus Torvalds wrote: > On Thu, 23 Jan 2025 at 10:18, Eric Biggers wrote: > > > > FWIW, benchmarking the CRC library functions is easy now; just enable > > CONFIG_CRC_KUNIT_TEST=y and CONFIG_CRC_BENCHMARK=y. > > > > But, it's just a traditional benchmark that calls the functions in a loop, and > > doesn't account for dcache thrashing. > > Yeah. I suspect the x86 vector version in particular is just not even > worth it. If you have the crc instruction, the basic arch-optimized > case is presumably already pretty good (and *that* code is tiny). x86 unfortunately only has an instruction for crc32c, i.e. the variant of CRC32 that uses the Castagnoli polynomial. So it works great for crc32c(). But any other variant of CRC such as the regular crc32() or crc_t10dif_update() need carryless multiplication (PCLMULQDQ) which uses the vector registers. It is super fast on sufficiently long messages, but it does use the vector registers. FWIW, arm64 has an instruction for both crc32c and crc32. And RISC-V has carryless multiplication using scalar registers. So things are a bit easier there. > Honestly, I took a quick look at the "by-4" and "by-8" cases, and > considering that you still have to do per-byte lookups of the words > _anyway_, I would expect that the regular by-1 is presumably not that > much worse. The difference I'm seeing on x86_64 (Ryzen 9 9950X) is 690 MB/s for slice-by-1 vs. 3091 MB/s for slice-by-8. It's significant since the latter gives much more instruction-level parallelism. But of course, CPUs on which it matters tend to have *much* faster arch-optimized implementations anyway. Currently the x86_64-optimized crc32c_le() is up to 43607 MB/s on the same CPU, and crc32_le() is up to 22664 MB/s. (By adding VPCLMULQDQ support we could actually achieve over 80000 MB/s.) The caveat is that in the [V]PCLMULQDQ case the data length has to be long enough for it to be worthwhile, but then again having a 8 KiB randomly-accessed data table just to micro-optimize short messages seems not too worthwhile. > IOW, maybe we could try to just do the simple by-1 for the generic > case, and cut the x86 version down to the simple "use crc32b" case. > And see if anybody even notices... > > Linus - Eric