From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 13FC01C860C; Tue, 25 Feb 2025 22:59:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1740524375; cv=none; b=Jz1To4dvbgPqL2CFII4IEkBYCH0uxtCSZR0ARiBbHLQlgRNLqSyITiCZVfcIoRx5gp/n+ISuMiytiojaP7oo+N7T5QZ5Hs03mif2dm9X0D/GYyz30uzZpKy0pKNGSb+thbkBzVdDXiR6rs52rmIBOGx9L6tC0ZCRkP/Dm4hngy4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1740524375; c=relaxed/simple; bh=UutXvj5xcObwy2bXzQ6fauT5FEVlvk2zdLl/xwmVg00=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=AGdjQ3SZ2V2fdXLwZzzw7t9p63nyNKxZhApg07i3kenCjw5AIpvUa3uVxzQUqF1ZNlC7JtZuOeh3ec+IdLydaV1ki9kN3EopOOoS8XpEYSdSg7XeLIf4PoOrHHY9RQHhbZzNuy+2h/Rmuxsr4+QHPXScdXuGbSjR8en+VY1T0Eg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=rEh04ANg; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="rEh04ANg" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 555A3C4CEDD; Tue, 25 Feb 2025 22:59:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1740524374; bh=UutXvj5xcObwy2bXzQ6fauT5FEVlvk2zdLl/xwmVg00=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=rEh04ANgrCejG70O9NFdPZdOC5Gt8H+W3ZDZBQKufLa/XY0HD4RUhBJxqoaE3clLC GFEf1wzF5vKHI669T5uC6Ir1XjLQvcLFwC9jA+narvdPkYlFy03JzVKA5rGPfXk9bP ls8WU47uZKYtklajeWa+kfWUaw0V17LHhzU9Psr1qDCAeE7a93WrLl39u7f0BqjTi5 vJzlTA9VehJ9ZpOYpCQGvVe1jtyijnELp4Kl/O9pJAmMO+I24e1DR/jCybEhHorpsU tOsOv5g+eq0gXI8cmQ7Y+tFbi5/NWZKLas2XswoovAxLnsjBtTPO8pfkOlra+JBLS1 YegVcvR1QM0pA== Date: Tue, 25 Feb 2025 22:59:32 +0000 From: Eric Biggers To: David Laight Cc: Xiao Liang , x86@kernel.org, linux-crypto@vger.kernel.org, linux-kernel@vger.kernel.org, Ard Biesheuvel , Ben Greear , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , Andy Lutomirski Subject: Re: [RFC PATCH 1/2] x86/fpu: make kernel-mode FPU reliably usable in softirqs Message-ID: <20250225225932.GA2975818@google.com> References: <20250220051325.340691-1-ebiggers@kernel.org> <20250220051325.340691-2-ebiggers@kernel.org> <20250221193124.GA3790599@google.com> <20250225222133.395f7194@pumpkin> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250225222133.395f7194@pumpkin> On Tue, Feb 25, 2025 at 10:21:33PM +0000, David Laight wrote: > > As for supporting nested kernel-mode FPU if we wanted to go that way: yes, your > > patch from last year > > https://lore.kernel.org/lkml/20240403140138.393825-1-shaw.leon@gmail.com/ > > ostensibly did that. However, I found some bugs in it; e.g., it didn't take > > into account that struct fpu is variable-length. So it didn't turn out as > > simple as that patch made it seem. Just extending fpregs_{lock,unlock}() to > > kernel-mode FPU is a simpler solution with fewer edge cases, and it avoids > > increasing the memory usage of the kernel. So I thought I'd propose that first. > > Since many kernel users don't want the traditional fpu, they just need to use > an instruction that requires an AVX register or two, is it possible for code > to specify a small save area for just two or four registers and then use just > those registers? (so treating then all as caller-saved). > I know that won't work with anything that affects the fpu status register, > but if you want a single wide register for a PCIe read (to generate a big TLP) > it is more than enough. > > I'm sure there are horrid pitfalls, especially if IPI are still used to for > deferred save of fpu state. I'm afraid that's not an accurate summary of what uses the vector registers in kernel mode. The main use case is crypto, and most of the crypto code uses a lot of vector registers. Some of the older crypto code uses at most 8 vector registers (xmm0-xmm7) for 32-bit compatibility, but newer code uses 16 or even up to 32 YMM or ZMM registers. The new AES-GCM code for example uses all 32 vector registers, and the new AES-XTS code uses 30. In general, taking full advantage of the vector register set improves performance, and the trend has very much been towards using more registers -- not fewer. (And the registers have been getting larger too!) AES by itself tends to need about 8 registers to take advantage of the CPU's full AES throughput, but there are other computations like GHASH or tweak computation that need to be interleaved with AES, using more registers. And various constants and round keys can be cached in registers to improve performance. If we had to save/restore a large number of vector registers in every crypto function call (not amortized to one save/restore per return to userspace), that would be a big performance problem. Most of the crypto code certainly could be written to use fewer registers. But it would reduce performance, especially if we tried to squeeze it down to use a really small number of registers like 2-4. Plus any such efforts would complicate efforts to port crypto code between the kernel and userspace, as userspace does not have such constraints on the number of registers. - Eric