From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eu-smtp-delivery-151.mimecast.com (eu-smtp-delivery-151.mimecast.com [185.58.86.151]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 259591ABEC9 for ; Mon, 14 Oct 2024 16:30:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.58.86.151 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728923411; cv=none; b=dX7OkXOtav35PUgGcfi/EppBMvZ5hjDfrOVnZZVPjK4enflewnVu6HtTq/ikR5mt/5VLiXGB31WD7Y84IsTNkl0Jo/jE+vR+B6v1NRh+ayRfEKO6t52ft7zGTVsWzVhBhFxh60Ye/SyQLXqVUfZzrgXsfx6QVkJIRQGQ7PBOiCk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728923411; c=relaxed/simple; bh=MqDoa+aWbGkHWUz7ay7S+ulaBshn9RyOgsYiphaQi5U=; h=From:To:CC:Subject:Date:Message-ID:References:In-Reply-To: MIME-Version:Content-Type; b=fLXYsZcRci51PgGEJ9kEbvxkoNvb1Z/MurBL8VBbW+DfiwzFmmOqoPpd10CwCh1Tp/s6IO7oQ+zMxL6gXrfVBSAZlVSczdAMjUuqXioHT4VX+Tcu25i6XAsk0uEAblj/yer8t2YPbbap4qLYgvnrFVZ5jwJp4sLAF8nXxhsX7/Q= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=ACULAB.COM; spf=pass smtp.mailfrom=aculab.com; arc=none smtp.client-ip=185.58.86.151 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=ACULAB.COM Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=aculab.com Received: from AcuMS.aculab.com (156.67.243.121 [156.67.243.121]) by relay.mimecast.com with ESMTP with both STARTTLS and AUTH (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id uk-mta-215-SbgoJqXHMQafOzomsgtOvQ-1; Mon, 14 Oct 2024 17:30:06 +0100 X-MC-Unique: SbgoJqXHMQafOzomsgtOvQ-1 Received: from AcuMS.Aculab.com (10.202.163.6) by AcuMS.aculab.com (10.202.163.6) with Microsoft SMTP Server (TLS) id 15.0.1497.48; Mon, 14 Oct 2024 17:30:05 +0100 Received: from AcuMS.Aculab.com ([::1]) by AcuMS.aculab.com ([::1]) with mapi id 15.00.1497.048; Mon, 14 Oct 2024 17:30:05 +0100 From: David Laight To: 'Eric Biggers' , "linux-crypto@vger.kernel.org" CC: "x86@kernel.org" , "linux-kernel@vger.kernel.org" , Ard Biesheuvel , "Josh Poimboeuf" , Peter Zijlstra Subject: RE: [PATCH 3/3] crypto: x86/crc32c - eliminate jump table and excessive unrolling Thread-Topic: [PATCH 3/3] crypto: x86/crc32c - eliminate jump table and excessive unrolling Thread-Index: AQHbHfE2Of1hyBaCckeJTHnfIiWE7LKGaWew Date: Mon, 14 Oct 2024 16:30:05 +0000 Message-ID: References: <20241014042447.50197-1-ebiggers@kernel.org> <20241014042447.50197-4-ebiggers@kernel.org> In-Reply-To: <20241014042447.50197-4-ebiggers@kernel.org> Accept-Language: en-GB, en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-exchange-transport-fromentityheader: Hosted Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: aculab.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable From: Eric Biggers > Sent: 14 October 2024 05:25 >=20 > crc32c-pcl-intel-asm_64.S has a loop with 1 to 127 iterations fully > unrolled and uses a jump table to jump into the correct location. This > optimization is misguided, as it bloats the binary code size and > introduces an indirect call. x86_64 CPUs can predict loops well, so it > is fine to just use a loop instead. Loop bookkeeping instructions can > compete with the crc instructions for the ALUs, but this is easily > mitigated by unrolling the loop by a smaller amount, such as 4 times. Do you need to unroll it at all? ... > +=09# Unroll the loop by a factor of 4 to reduce the overhead of the loop > +=09# bookkeeping instructions, which can compete with crc32q for the ALU= s. > +.Lcrc_3lanes_4x_loop: > +=09crc32q=09(bufp), crc_init_q > +=09crc32q=09(bufp,chunk_bytes_q), crc1 > +=09crc32q=09(bufp,chunk_bytes_q,2), crc2 > +=09crc32q=098(bufp), crc_init_q > +=09crc32q=098(bufp,chunk_bytes_q), crc1 > +=09crc32q=098(bufp,chunk_bytes_q,2), crc2 > +=09crc32q=0916(bufp), crc_init_q > +=09crc32q=0916(bufp,chunk_bytes_q), crc1 > +=09crc32q=0916(bufp,chunk_bytes_q,2), crc2 > +=09crc32q=0924(bufp), crc_init_q > +=09crc32q=0924(bufp,chunk_bytes_q), crc1 > +=09crc32q=0924(bufp,chunk_bytes_q,2), crc2 > +=09add=09$32, bufp > +=09sub=09$4, %eax > +=09jge=09.Lcrc_3lanes_4x_loop If you are really lucky you'll get two memory reads/clock. So you won't ever to do than two crc32/clock. Looking at Agner's instruction latency tables I don't think any cpu can do more that one per clock, or pipeline them. I think that means you don't even need two (never mind 3) buffers. Most modern x86 can do 4 or 5 (or even more) ALU operations per clock - depending on the combination of instructions. Replace the loop termination with a comparison of 'bufp' against a pre-calculated limit and you get two instructions (that might get merged into one u-op) for the loop overhead. They'll run in parallel with the crc32q instructions. I've never managed to get a 1-clock loop, but two is easy. You might find that just: 10: =09crc32q=09(bufp), crc =09crc32q=098(bufp), crc =09add=09=09$16, bufp =09cmp=09=09bufp, buf_lim =09jne=09=0910b will run at 8 bytes/clock on modern intel cpu. You can write that in C with a simple asm function for the crc32 instruction itself. You might need the more complex to setup loop: =09offset =3D -length; =09bufend =3D buf + length; 10: =09crc32q=09(offset, bufend), crc =09crc32q=098(offset, bufend), crc =09add=09=09&16, offset =09jne=09=0910b which uses negative offsets from the end of the buffer. =09David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1= PT, UK Registration No: 1397386 (Wales)