From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wm1-f45.google.com (mail-wm1-f45.google.com [209.85.128.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E87A8305684 for ; Fri, 12 Jun 2026 09:04:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.45 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781255079; cv=none; b=nXHa+vJh+7IjOYIkST7UrGuMb9YWpT1fZfLISeLx3uy5ejJoQLr+qip117jFQdno6nr5GS0hZhrBPTNRsdfQ8iT02+i6FsyBTzw3u62ABdTIiE5FxIaSPNTEfdX8PN3YVERrqScqqWFtx3kbUf8EAbAGSegz6z24RHrl+gbSuno= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781255079; c=relaxed/simple; bh=gnSrC76reHdx9wPAQQwWVrC27UWP4gCcm3C5Jcgp/p4=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=NDQGrHid8HFhxInLzAEXlwv5tV+h1D/y9OwKElcsaieGshKCLRKxUcQxPCKVAHYmWR8XFo4ssCax1B8UHjflqsX9zQMhQ34NY6opRXBnZpnFqK2hvpWfcN8I99YBeMv6sEOODj0R3pk2VoMupRhlv9vjdmAvathJ8Zby6kpfrXs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=cAQvS0qY; arc=none smtp.client-ip=209.85.128.45 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="cAQvS0qY" Received: by mail-wm1-f45.google.com with SMTP id 5b1f17b1804b1-490afc47455so2847845e9.2 for ; Fri, 12 Jun 2026 02:04:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1781255075; x=1781859875; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:from:to:cc:subject:date :message-id:reply-to; bh=UiSf3+ko2j/Sn9SLSqeV80DYxQjjJ0/3Hsz4WRxgm14=; b=cAQvS0qY0jCaca9z4pJfajlLrd3/+YTn0fDwgcgS77khTncKXqYBxoBJF76OMCDS2y 00aiHqpyEz41I/hOe4CfqyddpEkRf9Vhq4Rx8llIoll/913rmhf/dW7lYS2Rub50dguc 3N3Rpcfnc86yhlpGZ1tr5RHRRwPye2M+cY5dK+h9fPckfam7Q2fRidImUnOLuQt7X2cB 2zj/qrFces/dMLEk2JbLX6TSQKHS8cu65T15lV0DeDDJYYkhNFVvUSmYuM2UN1Egqzlk Rs8uMXonLrexjG1wi/rkscpsXOhy1hsPCmP+1GQl1mnz/B+i3q5Q7E6E6khIREydEaBo ZTJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1781255075; x=1781859875; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=UiSf3+ko2j/Sn9SLSqeV80DYxQjjJ0/3Hsz4WRxgm14=; b=Aq9FjNJNvo8w9ksS812wTZj7fImFsXDd2yxf8ZbDPmC8M2WmCdRcRtIGtKL4lK4ddY hpkoWEyWjnNo4rA9vYYA7u+tbhtXpeUyrfh8p5qu7BLjMa/BWUmVR1rvJb10tXtKrPM7 GsCkwQGWYRX4u3cZ64MUsNMntYrjUwz+vnLC30hnFPGoTzzGMuck2/YuctTrfBKKk93G r3uLB8FBpQCV7SWlyyNSsQ0qVBbVWBLFcwvWStquNBAcPo9V5O3lsRMbjInD/EFaYnOy 2/r5LKFsuWanm3huIrGkZt2RigqORpdycLQSeRZ+uLwGSOFBJHIKHy5tBNFQOQ5JI6NF Wu1Q== X-Forwarded-Encrypted: i=1; AFNElJ+zOmzFdy2EBRHviXNSVrhU1/Vw5PCrTA+idmL3FRyoOZ2UibJ70vScKUJ2x3i7P4I3dQKdIqVsfy7mI0Y=@vger.kernel.org X-Gm-Message-State: AOJu0YwCo+oMHXHZb/WtlKKoOm+7XI6wk0Sad7HQCtk0PDwhHHYfAF+U soaIet2wuHM2KVQLN44oCW1TPX5scZx6CMmo1bX1F3/+/tgxmAJY8Dpg X-Gm-Gg: Acq92OHFJN4/LKhQ0r5fUjkVR5tlAOoYn7Fk1kxdx1qFdgaoZqCEsdHYn3Yi92pkBK+ ot6rfM6YM0dGjft7YLREeUrjcYQl+hlWo03pE9RQq/tEjtcyW7pXmv893OVbZWpEAXv5f08A2iv j1Sf7aFPpIQip1mYo6GohbvPsTNocb+IuEDEDL8oGNuY3h7R2a849GleRGXtwzDYMFcZ5UYNBuQ Ta4fDlkstBIjc4TmiJvvzMrwDpA9TJ4reUstiU/4uliW/Y2o0JwxYBcG05EcD+EcGEx+wnDCsh2 Fe3aDM4X2vOYsx5GC+M1iNqpQO1NsABMjMyr+Ag6rBQwM/2RUehYztFd2X2uBNbZ9+PesgIqKsi SXvGIl5dwHEAj37bXRDPm0Dj1sPzPb22NWn4f/RQ0j69ZErNVpKpREP3KakkLNlrliITSm12M2F 0qMFqNed9yEdMOtbv3pexAZaNb+YI5h0azm4bdh9pWtEQvruYU+oZI8QZZwHfi X-Received: by 2002:a05:600c:820c:b0:48e:5d91:cfe3 with SMTP id 5b1f17b1804b1-490ec4cd08fmr24926575e9.1.1781255075157; Fri, 12 Jun 2026 02:04:35 -0700 (PDT) Received: from pumpkin (82-69-66-36.dsl.in-addr.zen.co.uk. [82.69.66.36]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-490e2c90668sm120657375e9.4.2026.06.12.02.04.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 12 Jun 2026 02:04:34 -0700 (PDT) Date: Fri, 12 Jun 2026 10:04:32 +0100 From: David Laight To: Christoph Hellwig Cc: Eric Biggers , Andrew Morton , linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org, x86@kernel.org, Andrea Mazzoleni Subject: Re: [PATCH] lib/raid/xor: x86: Add AVX-512 optimized xor_gen() Message-ID: <20260612100432.1f1c8c7a@pumpkin> In-Reply-To: <20260612052247.GA8848@lst.de> References: <20260612044034.117442-1-ebiggers@kernel.org> <20260612052247.GA8848@lst.de> X-Mailer: Claws Mail 4.1.1 (GTK 3.24.38; arm-unknown-linux-gnueabihf) Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Fri, 12 Jun 2026 07:22:47 +0200 Christoph Hellwig wrote: > On Thu, Jun 11, 2026 at 09:40:34PM -0700, Eric Biggers wrote: > > Add an implementation of xor_gen() using AVX-512. =20 >=20 > > Benchmark on AMD Ryzen 9 9950X (Zen 5): =20 >=20 > Can you share the benchmark? >=20 > In my local tree I have ports of the AVX2 and AVX512 implementations > from snapraid (https://github.com/amadvance/snapraid), which in userspace > give really good performance. On my Laptop with a AMD Ryzen AI 7 PRO 350 > (which is a Zen5 with the slower double pumped AVX512 unit), both of > them get over 1GB/s throughput on the snapraid benchmarks. I've been > holding them back as I don't have a good kernel benchmarking harness, > and it's missing the quirks for old AVX512 or the newer AMD special > cases. =46rom my experiments on Intel cpu (and I don't remember the zen-5 being that different - but I've done less testing on it) you don't need to unroll loops very much at all. A reasonable model seems to be that the uops generated by the instruction decoder get executed when all the prerequisite registers and the required execution unit are available. So for a memory copy (and the xor is basically a copy) the control loop can run way ahead of the read/write instructions. This means you can get the control loop 'for free' and unrolling further makes no/little difference. Each xor is two memory reads and one memory write. The cpu I was using could only do one write/clock - so you can only do one xor each clock. I think some of the newer ones can to two writes/clock but I'm not sure how many reads/clock they can do - might still be 2, don't think it s 4. So you should be able to get one xor per clock, but I doubt you'll get two (and possibly not even 1.3 - which would require 4 memory accesses per cloc= k). The best loop construct is the one that uses negative offsets from the end of the buffers, basically: buf +=3D len; offset =3D -len; do f(buf[offset]); while (offset +=3D size); that reduces the loop control to just an 'add' and 'jnz' (which can get merged into a single u-op). The cpu have enough execution units to execute two memory reads, a memory write, an xor the add and jnz every clock. So even the 'rolled up' loop might run at one xor per clock. While I think I got a 'one clock loop' on my zen-5 (testing word-at-a-time strlen) I only managed a two clock loop on the newest Intel cpu I've got (which isn't that new). So put two xor in the loop and it shouldn't be limited by the loop control, but will be limited by the memory accesses instead. Further unrolling shouldn't help and may make things worse. The Intel cpu have logic to directly forward the result of an ALU instruction into the next few instructions, but after that you can get a stall because of the 'round trip' via the register file. So part way down an unrolled nn(%reg) sequence you can get a stall. An extra 'add $0,%reg' in the middle of the unrolled loop will 'refresh' the register and speed things up. (I hit that with a loop that needed a rather more complicated control structure.) You definitely need to use the pmc clock counter and data dependencies against the rdpmc instruction to get sensible performance figures. The can reasonably reliably measure down to less than 20 clocks. David =20