From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail.zytor.com (terminus.zytor.com [198.137.202.136]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E2F7C1E835C; Mon, 3 Mar 2025 23:58:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.137.202.136 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741046318; cv=none; b=lcAcTjVZumCdSka/ZCXN9yRKMr++eRNCGL/jlRVRHoDre6mF6z0wqeoZQ3zxRb5aFGXWBJ9N06/yKRtTyIYzDxG6OjDgfKx/AeerDPxwqKQokEOTj4jP+Sb7upe28uncL01oRDcmJyvptKr8ZTf/kJqTmLhPTWoKvQla1yDnF3I= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741046318; c=relaxed/simple; bh=G0WznZWM4Ze3YYkf50OJ1yBN8kORHgALSZlxETzoA6o=; h=Date:From:To:CC:Subject:In-Reply-To:References:Message-ID: MIME-Version:Content-Type; b=dpKUMrAgMAYPw9FXM6pJjGi6CupRvXJiTs3eYh/l4WEpSU69D2bVi4dkBHZElgaHMC8rJY20ifCiNilJF8LuLghhvP3KUZwXglz9CCZyozo0uOM4dzLc83wGOAdWL7mpNW1pXUvzAGBxuWib0nlBtAiFe080lraQQ813L3qRPVQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=zytor.com; spf=pass smtp.mailfrom=zytor.com; dkim=pass (2048-bit key) header.d=zytor.com header.i=@zytor.com header.b=jV6E0PXc; arc=none smtp.client-ip=198.137.202.136 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=zytor.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=zytor.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=zytor.com header.i=@zytor.com header.b="jV6E0PXc" Received: from [127.0.0.1] ([76.133.66.138]) (authenticated bits=0) by mail.zytor.com (8.18.1/8.17.1) with ESMTPSA id 523NvvJH1810687 (version=TLSv1.3 cipher=TLS_AES_128_GCM_SHA256 bits=128 verify=NO); Mon, 3 Mar 2025 15:57:58 -0800 DKIM-Filter: OpenDKIM Filter v2.11.0 mail.zytor.com 523NvvJH1810687 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=zytor.com; s=2025021701; t=1741046279; bh=NjBSpK46k3R8QLwa4C4XpcqEVjguWPkN752XrSvL32I=; h=Date:From:To:CC:Subject:In-Reply-To:References:From; b=jV6E0PXcpqNxOrGITT1TB0LCZeZg+I6n3gquNfcPZK9FRwwnws2Wtb9I3eU69aud7 pysrlmA9y1+af/oyAFudnyBKYXZ62yzYezpMQpAAXFdTHk6u3RPI3M1bG8Pw1XLAwJ Rtj9ltRMDRBP6czYWYfVDfWpe5lQF2Ckr+69V8JnvGQB3nazQSwvl0wucuM6P9bDG+ USh83St6BsXGZ1yfi8evTYQSJ90OaAytQb3i3kRcUIw4DlTWT44/vgYGbrhPB+8pGZ i3BTD46qlF0/fq01s/Xel6tB9FvLRahPdkQqitVMPFJPoITDQvRQViDtavM6nBVwR8 I0QAYysRtSkrw== Date: Mon, 03 Mar 2025 15:57:57 -0800 From: "H. Peter Anvin" To: David Laight , Bill Wendling CC: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)" , Eric Biggers , Ard Biesheuvel , Nathan Chancellor , Nick Desaulniers , Justin Stitt , LKML , linux-crypto@vger.kernel.org, clang-built-linux Subject: Re: [PATCH v2] x86/crc32: use builtins to improve code generation User-Agent: K-9 Mail for Android In-Reply-To: <20250303224216.30431b1d@pumpkin> References: <20250303201509.32f6f062@pumpkin> <20250303224216.30431b1d@pumpkin> Message-ID: <7BC89461-A060-462A-9B42-7C0138AA0307@zytor.com> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On March 3, 2025 2:42:16 PM PST, David Laight wrote: >On Mon, 3 Mar 2025 12:27:21 -0800 >Bill Wendling wrote: > >> On Mon, Mar 3, 2025 at 12:15=E2=80=AFPM David Laight >> wrote: >> > On Thu, 27 Feb 2025 15:47:03 -0800 >> > Bill Wendling wrote: >> > =20 >> > > For both gcc and clang, crc32 builtins generate better code than th= e >> > > inline asm=2E GCC improves, removing unneeded "mov" instructions=2E= Clang >> > > does the same and unrolls the loops=2E GCC has no changes on i386, = but >> > > Clang's code generation is vastly improved, due to Clang's "rm" >> > > constraint issue=2E >> > > >> > > The number of cycles improved by ~0=2E1% for GCC and ~1% for Clang,= which >> > > is expected because of the "rm" issue=2E However, Clang's performan= ce is >> > > better than GCC's by ~1=2E5%, most likely due to loop unrolling=2E = =20 >> > >> > How much does it unroll? >> > How much you need depends on the latency of the crc32 instruction=2E >> > The copy of Agner's tables I have gives it a latency of 3 on >> > pretty much everything=2E >> > If you can only do one chained crc instruction every three clocks >> > it is hard to see how unrolling the loop will help=2E >> > Intel cpu (since sandy bridge) will run a two clock loop=2E >> > With three clocks to play with it should be easy (even for a compiler= ) >> > to generate a loop with no extra clock stalls=2E >> > >> > Clearly if Clang decides to copy arguments to the stack an extra time >> > that will kill things=2E But in this case you want the "m" constraint >> > to directly read from the buffer (with a (reg,reg,8) addressing mode)= =2E >> > =20 >> Below is what Clang generates with the builtins=2E From what Eric said, >> this code is only run for sizes <=3D 512 bytes? So maybe it's not super >> important to micro-optimize this=2E I apologize, but my ability to >> measure clock loops for x86 code isn't great=2E (I'm sure I lack the >> requisite benchmarks, etc=2E) > >Jeepers - that is trashing the I-cache=2E >Not to mention all the conditional branches at the bottom=2E >Consider the basic loop: >1: crc32q (%rcx), %rbx > addq $8, %rcx > cmp %rcx, %rdx > jne 1b >The crc32 has latency 3 so it must take at least 3 clocks=2E >Even naively the addq can be issued in the same clock as the crc32 >and the cmp and jne in the following ones=2E >Since the jne is predicted taken, the addq can be assumed to execute >in the same clock as the jne=2E >(The cmp+jne might also get merged into a single u-op) >(I've done this with adc (for IP checksum), with two adc the loop takes >two clocks even with the extra memory reads=2E) > >So that loop is likely to run limited by the three clock latency of crc32= =2E >Even the memory reads will happen with all the crc32 just waiting for the >previous crc32 to finish=2E >You can take an instruction out of the loop: >1: crc32q (%rcx,%rdx), %rbx > addq $8, %rdx > jne 1b >but that may not be necessary, and (IIRC) gcc doesn't like letting you >generate it=2E > >For buffers that aren't multiples of 8 bytes 'remember' that the crc of >a byte depends on how far it is from the end of the buffer, and that init= ial >zero bytes have no effect=2E >So (provided the buffer is 8+ bytes long) read the first 8 bytes, shift >right by the number of bytes needed to make the rest of the buffer a mult= iple >or 8 bytes (the same as reading from across the start of the buffer and m= asking >the low bytes) then treat exactly the same as a buffer that is a multiple >of 8 bytes long=2E >Don't worry about misaligned reads, you lose less than one clock per cach= e >line (that is with adc doing a read every clock)=2E > >Actually measuring the performance is hard=2E >You can use rdtsc because the clock speed will change when the cpu gets b= usy=2E >There is a 'performance counter' that is actual clocks=2E >While you can use the library functions to set it up, you need to just re= ad the >register - the library overhead it too big=2E >You also need the odd lfence=2E >Having done that, and provided the buffer is in the L1 d-cache you can me= asure >the loop time in clocks and compare against the expected value=2E >Once you've got 3 clocks per crc32 instruction it won't get any better, >which is why the 'fast' code for big buffers does crc of 3+ buffers secti= ons >in parallel=2E > > David > >>=20 >> -bw >>=20 >> =2ELBB1_9: # =3D>This Inner Loop Header:= Depth=3D1 >> movl %ebx, %ebx >> crc32q (%rcx), %rbx >> addq $8, %rcx >> incq %rdi >> cmpq %rdi, %rsi >> jne =2ELBB1_9 >> # %bb=2E10: >> subq %rdi, %rax >> jmp =2ELBB1_11 >> =2ELBB1_7: >> movq %r14, %rcx >> =2ELBB1_11: >> movq %r15, %rsi >> andq $-8, %rsi >> cmpq $7, %rdx >> jb =2ELBB1_14 >> # %bb=2E12: >> xorl %edx, %edx >> =2ELBB1_13: # =3D>This Inner Loop Header:= Depth=3D1 >> movl %ebx, %ebx >> crc32q (%rcx,%rdx,8), %rbx >> crc32q 8(%rcx,%rdx,8), %rbx >> crc32q 16(%rcx,%rdx,8), %rbx >> crc32q 24(%rcx,%rdx,8), %rbx >> crc32q 32(%rcx,%rdx,8), %rbx >> crc32q 40(%rcx,%rdx,8), %rbx >> crc32q 48(%rcx,%rdx,8), %rbx >> crc32q 56(%rcx,%rdx,8), %rbx >> addq $8, %rdx >> cmpq %rdx, %rax >> jne =2ELBB1_13 >> =2ELBB1_14: >> addq %rsi, %r14 >> =2ELBB1_15: >> andq $7, %r15 >> je =2ELBB1_23 >> # %bb=2E16: >> crc32b (%r14), %ebx >> cmpl $1, %r15d >> je =2ELBB1_23 >> # %bb=2E17: >> crc32b 1(%r14), %ebx >> cmpl $2, %r15d >> je =2ELBB1_23 >> # %bb=2E18: >> crc32b 2(%r14), %ebx >> cmpl $3, %r15d >> je =2ELBB1_23 >> # %bb=2E19: >> crc32b 3(%r14), %ebx >> cmpl $4, %r15d >> je =2ELBB1_23 >> # %bb=2E20: >> crc32b 4(%r14), %ebx >> cmpl $5, %r15d >> je =2ELBB1_23 >> # %bb=2E21: >> crc32b 5(%r14), %ebx >> cmpl $6, %r15d >> je =2ELBB1_23 >> # %bb=2E22: >> crc32b 6(%r14), %ebx >> =2ELBB1_23: >> movl %ebx, %eax >> =2ELBB1_24: > > The tail is *weird*=2E Wouldn't it be better to do a 4-2-1 stepdown?