From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BEA591C5499 for ; Sat, 20 Jun 2026 23:34:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781998491; cv=none; b=hCArltn2CXg8MEwvWjn8W9rpMxxrkPou+y5iaUFgyaWOZZGTJQtP/glOVOV0LILJVFiiyBUlcBFWWYRy62nQaizrmX0eSUBic63CeivDzlAQxw6PPBI+P6rPlJb4d9wHlLh66gQZp8eZdIZrO8Yt7Qd0NJHUL4iabdTyfRiZSZ4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781998491; c=relaxed/simple; bh=oSRnmcI2y72zonrsX9hKAPj9nHnbkUb7B7VxQ/0YB5s=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=qwI2Wgauu9+mI8k28VgAyoutfEJgN8p9BGxvTYBvZwRCCNuOg6ArDVJ8lTSAPqsPsqEVwoHDS75Ue+DDoDCToNuGbZvDlkqC+p0RwhG36LyZz+aUjgHlFyyzUZ/JexpnvlnNvTajGH8/V/hwFZt/U5tNBveaZxY9vGZ0LZTgnIs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=UNXo0yQm; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="UNXo0yQm" Received: by smtp.kernel.org (Postfix) with ESMTPSA id B17C21F000E9; Sat, 20 Jun 2026 23:34:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1781998490; bh=OPqySwjlM5CRQv3Iryy2Knga3elRhDn2ZrarUZX3Msk=; h=From:To:Cc:Subject:In-Reply-To:References:Date; b=UNXo0yQmeM7s22VKPCE57kxJzyEObUQrZ+qf1xWfXdFcwsuRIzFwSH7QamifsJN48 Lf04uOmgOY5qWAQnhyafplkgFfm4Dv04y2rxvDhSFppBlGjtg3MZPmgvyQOXCmnDqp IY5VZnLZdNoHF47NW1fdP+vkIOAvENZbHnh+WyNmGw4BpEtwunzltzl7tT/jqikOjb yFW+tszuoWLHpDj5NPRLnMlu/PHXx+nxDbSK5QwnLFIFaCTkhB/203hHXtwn8zBHn/ VtwEWS183ACG/4CRavqS9RNRLAPzv+tTAsmie+6VVwqSsn+1aduvGv4GKDt8vIevsG IKWPxYQ9CtnAg== From: Thomas Gleixner To: Zach O'Keefe Cc: Dave Hansen , "H. Peter Anvin" , David Stevens , Pasha Tatashin , Linus Walleij , Will Deacon , Quentin Perret , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, Andy Lutomirski , Xin Li , Peter Zijlstra , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Uladzislau Rezki , Kees Cook , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v2 00/13] Dynamic Kernel Stacks In-Reply-To: References: <20260424191456.2679717-1-stevensd@google.com> <6369e5ce-74e3-4c68-8053-d7d7d21b6955@zytor.com> <87pl1md7h0.ffs@fw13> <87qzm2b39k.ffs@fw13> Date: Sun, 21 Jun 2026 01:34:47 +0200 Message-ID: <87mrwon5uw.ffs@fw13> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On Sat, Jun 20 2026 at 12:33, Zach O'Keefe wrote: > On Fri, Jun 19, 2026 at 2:59=E2=80=AFPM Thomas Gleixner = wrote: >> The #PF path is considered perfomance critical. But how much the >> downgrade matters needs actual numbers to analyze under various workload >> scenarios. > > Ya, that's my concern as well, as I don't have a good intuition for > how perf critical kernel #PF is for real workloads. If this is your > primary concern, I'll take that as a _good_ thing ; i.e. there's > nothing architecturally stopping us from doing this downgrade safely. > We'll still need the analysis, but that can be a later stage -- we're > more than happy to get this data for all. No. That's not a later stage optional requirement. You have a PoC which works for you otherwise you wouldn't have posted it. So you can trivially microbenchmark the costs of the up/downgrade. And that's critical information for us but also for you. If the costs are significant then you really have to think about the tradeoffs. Care to read Documentation/process/* carefully? It applies to you as it applies to anyone else. >> I've not seen numbers to that effect anywhere. The only numbers provided >> are marketing material about the memory savings on a freshly booted idle >> machine. There are _zero_ numbers about the actual real world savings, >> but claims about the PETABYTE savings possible. >> >> Seriously? > > This is actually the most understood aspect. With O(100B) active tasks > fleetwide at any point, it only takes an average savings of O(10KiB) > per task to get to 1PiB. At least for our fleet, we know the % of > tasks that use only 4KiB, 8KiB, or require the full 16KiB, and the > math confirms that we expect O(PiB) aggregate savings. The % of stacks > requiring the full 16KiB is minuscule, but it still occurs at a rate > higher than what we can tolerate for SO panics. Given the vast > majority of stacks never exceed the first 4KiB, this enables the > significant opportunity. I know that the potential savings are well understood and my understanding of math is sufficient to calculate how much tasks and average saving it takes to save 1PiB on a fleet. That's a no-brainer, but this is an aggregate saving, which sounds WOW but does not tell much about anything else. 1) What's the actual percentage of savings in relation to the overall memory? 2) Does the saving allow you to get more stuff done on a machine, pack more threads on it? 3) Can you actually downsize the memory on the machines? 4) What is the performance tradeoff for that? IOW, you fail to tell what the actual benefit of such an intrusive change is. Just boasting an aggregate Petabyte number does not tell anything at all. Let me give you a trivial example with a scenario which I have access to: 256 CPUs 256 GiB Memory 64k Threads Let's assume the full saving of 12k per thread. That sums up to 64k * 12k =3D 768MB of memory which is 0.29% of the total 256 GiB of memory. Not so impressive as the petabyte aggregate number, right? The workload consumes about 80% of the overall memory and is already constraint on close to 100% CPU utilization. Now let's assume that the runtime overhead of this amounts to 1% then this is a net loss. Let me turn that around and use a made up example assuming the 1Mio threads per compute unit taken from some reply in this thread. Now the full saving of 12k per thread amounts to: 1M * 12k =3D 12G which is 4.7% of the overall available memory. Agreed that's a substantial number. That 12G saving does not do anything in terms of hardware downsizing. The only way that has a benefit is when the system is constraint by overall memory consumption, but has quite some compute capacity left. IOW, if 1M threads hit the memory limit that means that the savings in kernel stack consumed memory allows you to add about 4% (~40k) more threads. If that ups the CPU utilization accordingly then yes, I can see the benefit. But TBH, if that's the case then you are trying to fix a user space implementation problem in the kernel. That said you really have to describe the scenarios where there is a benefit and I do not buy this "fleet level" argument at all because there is no single fleet which has a uniform workload distribution. Aside of that. If your argument holds that there are only a few scenarios which require a deep stack, then we are better off to identify them and fix them up rather than trying to hack around the occacional insanity of deep stack usage by adding complexity for complexity sake. As you say that you have numbers of your fleet which confirm that the vast majority of the stack depth is below 4k, you can surely figure out the information which call chains are actually exceeding the limit. I prefer to fix such shitty code and downgrade the stacksize in general instead of papering over the underlying issues which probably have been ignored for years if not decades. Have you ever thought about that instead of adding complexity with a dubious value? Thanks, tglx