From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 12C0234EF15; Fri, 12 Jun 2026 06:00:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781244061; cv=none; b=lqTPCi185jb2i3IpaGJgCTQ5RfzlTjFxZmrA94rZkLOw6Jnp2MkAuRo0POiCn+LCMhictDGmVIclhdKouCLmXN20FP2v4P6drv+95yW2VRBpYMXjzn6+WMssmVjHAM/si5/v7KvwjtZM5hZ8yrNlTAVupbjozGUceRETIyduZhU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781244061; c=relaxed/simple; bh=0yegm3eGBOOZV2A1GhTmyIf4+BMiPvaqf/Jc2Gdj6m4=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=HMtknywLgQfl4GcpSyHTDw5bs9jOrD5IaZFJBmux5+33fM8DA7or0D2k6UcjhfhbS/mrRG4E9nvrjZfEbI4/vyrMW7UemdkrUl4PiqH57axI89XTdcT2+GyDcRd4VVoK8F+UfJEEz21Cu2f+m/0K0SRn7VTDZD7XtwUOWbmThr4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=VyqZs07Z; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="VyqZs07Z" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 73DB21F000E9; Fri, 12 Jun 2026 06:00:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1781244059; bh=dEOLjGXrCgZXGel/l6AP9eBHKjfvTQ8DdxUtTIy4yCg=; h=Date:From:To:Cc:Subject:References:In-Reply-To; b=VyqZs07ZyQfs060hb7fOxSPPWHRUkqOo1M/LqIQBnwuWqKpU1I32WNPchqFMs2Xkg 1Y0FliB+HOgGrGGzCEhxFc7RhxZVj9yGzGiCeYBdEGS9cPPAz8mdLaLlPprxqqp8DH n/tXS5ElxlPWb3+MjYVssZ7DtVEJSR7GKHYGBPrvgeO6OSTR+sr0gZ1BOoy0HdEehN UYuXF4uDvd1C56mW5GBQJhYekaH9ehtBqgaiKFDLxuix4Kgff6edm/Mkwq978yJcSz uqBqdG5MKL96gYbRZ+BTLV6mPsecnFtiudU6sW05ZiGRwLwXSmSq5GExGHQtkXnsj2 MlMMCQcxR9mHA== Date: Thu, 11 Jun 2026 22:59:33 -0700 From: Eric Biggers To: Christoph Hellwig Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org, x86@kernel.org, Andrea Mazzoleni Subject: Re: [PATCH] lib/raid/xor: x86: Add AVX-512 optimized xor_gen() Message-ID: <20260612055933.GA6675@sol> References: <20260612044034.117442-1-ebiggers@kernel.org> <20260612052247.GA8848@lst.de> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260612052247.GA8848@lst.de> On Fri, Jun 12, 2026 at 07:22:47AM +0200, Christoph Hellwig wrote: > On Thu, Jun 11, 2026 at 09:40:34PM -0700, Eric Biggers wrote: > > Add an implementation of xor_gen() using AVX-512. > > > Benchmark on AMD Ryzen 9 9950X (Zen 5): > > Can you share the benchmark? For now I had just hacked up do_xor_speed() as follows and changed xor_force() to xor_register(). There should be a benchmark added to the KUnit test similar to the one in the crypto and CRC tests, though. diff --git a/lib/raid/xor/xor-core.c b/lib/raid/xor/xor-core.c index bd4e6e434418..8c5814af03d5 100644 --- a/lib/raid/xor/xor-core.c +++ b/lib/raid/xor/xor-core.c @@ -76,15 +76,24 @@ void __init xor_force(struct xor_block_template *tmpl) #define REPS 800U static void __init -do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2) +do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2, + void *b3, void *b4, void *b5) { + for (int src_cnt = 1; src_cnt <= 4; src_cnt++) { int speed; unsigned long reps; ktime_t min, start, t0; - void *srcs[1] = { b2 }; + void *srcs[4] = { b2, b3, b4, b5 }; preempt_disable(); + /* warm-up */ + for (int i = 0; i < 8000; i++) { + mb(); /* prevent loop optimization */ + tmpl->xor_gen(b1, srcs, src_cnt, BENCH_SIZE); + mb(); + } + reps = 0; t0 = ktime_get(); /* delay start until time has advanced */ @@ -92,7 +101,7 @@ do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2) cpu_relax(); do { mb(); /* prevent loop optimization */ - tmpl->xor_gen(b1, srcs, 1, BENCH_SIZE); + tmpl->xor_gen(b1, srcs, src_cnt, BENCH_SIZE); mb(); } while (reps++ < REPS || (t0 = ktime_get()) == start); min = ktime_sub(t0, start); @@ -105,26 +114,30 @@ do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2) pr_info(" %-16s: %5d MB/sec\n", tmpl->name, speed); } +} static int __init calibrate_xor_blocks(void) { - void *b1, *b2; + void *b1, *b2, *b3, *b4, *b5; struct xor_block_template *f, *fastest; if (forced_template) return 0; - b1 = (void *) __get_free_pages(GFP_KERNEL, 2); + b1 = (void *) __get_free_pages(GFP_KERNEL, 4); if (!b1) { pr_warn("xor: Yikes! No memory available.\n"); return -ENOMEM; } b2 = b1 + 2*PAGE_SIZE + BENCH_SIZE; + b3 = b2 + 2*PAGE_SIZE + BENCH_SIZE; + b4 = b3 + 2*PAGE_SIZE + BENCH_SIZE; + b5 = b4 + 2*PAGE_SIZE + BENCH_SIZE; pr_info("xor: measuring software checksum speed\n"); fastest = template_list; for (f = template_list; f; f = f->next) { - do_xor_speed(f, b1, b2); + do_xor_speed(f, b1, b2, b3, b4, b5); if (f->speed > fastest->speed) fastest = f; } > In my local tree I have ports of the AVX2 and AVX512 implementations > from snapraid (https://github.com/amadvance/snapraid), which in userspace > give really good performance. On my Laptop with a AMD Ryzen AI 7 PRO 350 > (which is a Zen5 with the slower double pumped AVX512 unit), both of > them get over 1GB/s throughput on the snapraid benchmarks. I've been > holding them back as I don't have a good kernel benchmarking harness, > and it's missing the quirks for old AVX512 or the newer AMD special > cases. > > Attached for reference. > > Note that either way I'd prefer if we could get away from the stange > old code organization with the DO{1-4} helpers which don't really > help. Well, doing the same on your avx512bw version and adding a column to my table for it (by the way, I think it really just needs avx512f), I get: src_cnt avx avx512 avx512bw ======= ========== ========== ========== 1 68423 MB/s 81940 MB/s 12067 MB/s 2 56035 MB/s 74112 MB/s 10958 MB/s 3 49396 MB/s 67011 MB/s 8608 MB/s 4 43056 MB/s 60823 MB/s 8069 MB/s So, your version isn't great, I'm afraid. Making the inner loop be over src_cnt does simplify the code a lot, but it destroys performance since it turns into 9 instructions for each 64 bytes in each 3 buffers: 5b: 89 c1 mov %eax,%ecx 5d: 8d 70 01 lea 0x1(%rax),%esi 60: 48 8b 0c cb mov (%rbx,%rcx,8),%rcx 64: 48 8b 34 f3 mov (%rbx,%rsi,8),%rsi 68: 62 f1 fd 48 6f 0c 11 vmovdqa64 (%rcx,%rdx,1),%zmm1 6f: 62 f3 f5 48 25 04 16 vpternlogq $0x96,(%rsi,%rdx,1),%zmm1,%zmm0 76: 96 77: 83 c0 02 add $0x2,%eax 7a: 39 f8 cmp %edi,%eax 7c: 72 dd jb 5b You could try unrolling by 512 bytes, which should help. - Eric