From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id BEA591C5499
	for <linux-kernel@vger.kernel.org>; Sat, 20 Jun 2026 23:34:50 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1781998491; cv=none; b=hCArltn2CXg8MEwvWjn8W9rpMxxrkPou+y5iaUFgyaWOZZGTJQtP/glOVOV0LILJVFiiyBUlcBFWWYRy62nQaizrmX0eSUBic63CeivDzlAQxw6PPBI+P6rPlJb4d9wHlLh66gQZp8eZdIZrO8Yt7Qd0NJHUL4iabdTyfRiZSZ4=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1781998491; c=relaxed/simple;
	bh=oSRnmcI2y72zonrsX9hKAPj9nHnbkUb7B7VxQ/0YB5s=;
	h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID:
	 MIME-Version:Content-Type; b=qwI2Wgauu9+mI8k28VgAyoutfEJgN8p9BGxvTYBvZwRCCNuOg6ArDVJ8lTSAPqsPsqEVwoHDS75Ue+DDoDCToNuGbZvDlkqC+p0RwhG36LyZz+aUjgHlFyyzUZ/JexpnvlnNvTajGH8/V/hwFZt/U5tNBveaZxY9vGZ0LZTgnIs=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=UNXo0yQm; arc=none smtp.client-ip=100.103.45.18
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="UNXo0yQm"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id B17C21F000E9;
	Sat, 20 Jun 2026 23:34:49 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org;
	s=k20260515; t=1781998490;
	bh=OPqySwjlM5CRQv3Iryy2Knga3elRhDn2ZrarUZX3Msk=;
	h=From:To:Cc:Subject:In-Reply-To:References:Date;
	b=UNXo0yQmeM7s22VKPCE57kxJzyEObUQrZ+qf1xWfXdFcwsuRIzFwSH7QamifsJN48
	 Lf04uOmgOY5qWAQnhyafplkgFfm4Dv04y2rxvDhSFppBlGjtg3MZPmgvyQOXCmnDqp
	 IY5VZnLZdNoHF47NW1fdP+vkIOAvENZbHnh+WyNmGw4BpEtwunzltzl7tT/jqikOjb
	 yFW+tszuoWLHpDj5NPRLnMlu/PHXx+nxDbSK5QwnLFIFaCTkhB/203hHXtwn8zBHn/
	 VtwEWS183ACG/4CRavqS9RNRLAPzv+tTAsmie+6VVwqSsn+1aduvGv4GKDt8vIevsG
	 IKWPxYQ9CtnAg==
From: Thomas Gleixner <tglx@kernel.org>
To: Zach O'Keefe <zokeefe@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>, "H. Peter Anvin" <hpa@zytor.com>,
 David Stevens <stevensd@google.com>, Pasha Tatashin
 <pasha.tatashin@soleen.com>, Linus Walleij <linus.walleij@linaro.org>,
 Will Deacon <willdeacon@google.com>, Quentin Perret <qperret@google.com>,
 Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave
 Hansen <dave.hansen@linux.intel.com>, x86@kernel.org, Andy Lutomirski
 <luto@kernel.org>, Xin Li <xin@zytor.com>, Peter Zijlstra
 <peterz@infradead.org>, Andrew Morton <akpm@linux-foundation.org>, David
 Hildenbrand <david@kernel.org>, Lorenzo Stoakes <ljs@kernel.org>, "Liam R.
 Howlett" <Liam.Howlett@oracle.com>, Vlastimil Babka <vbabka@kernel.org>,
 Mike Rapoport <rppt@kernel.org>, Suren Baghdasaryan <surenb@google.com>,
 Michal Hocko <mhocko@suse.com>, Uladzislau Rezki <urezki@gmail.com>, Kees
 Cook <kees@kernel.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 00/13] Dynamic Kernel Stacks
In-Reply-To: <CAAa6QmTO=hhdJQa-ofSZ6wW0geLaEfWZumF6KmksxZqM3i33OA@mail.gmail.com>
References: <20260424191456.2679717-1-stevensd@google.com>
 <da9321ad-4198-494e-b9fa-30d69bd29be3@intel.com>
 <6369e5ce-74e3-4c68-8053-d7d7d21b6955@zytor.com>
 <dbeeea58-16cb-4383-b8e8-91a8ca84e88a@intel.com>
 <CAAa6QmRw6QLnVJ8+uvMV8ASreLXzSab5Jii3Ju11qCZYio6Few@mail.gmail.com>
 <c070c4d6-a570-4eea-aca0-72eed319a198@intel.com> <87pl1md7h0.ffs@fw13>
 <CAAa6QmSHBDeY0G=_N1P4dAAH917J7jerfZrWDfDd8w=8jH8nVw@mail.gmail.com>
 <87qzm2b39k.ffs@fw13>
 <CAAa6QmTO=hhdJQa-ofSZ6wW0geLaEfWZumF6KmksxZqM3i33OA@mail.gmail.com>
Date: Sun, 21 Jun 2026 01:34:47 +0200
Message-ID: <87mrwon5uw.ffs@fw13>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

On Sat, Jun 20 2026 at 12:33, Zach O'Keefe wrote:
> On Fri, Jun 19, 2026 at 2:59=E2=80=AFPM Thomas Gleixner <tglx@kernel.org>=
 wrote:
>> The #PF path is considered perfomance critical. But how much the
>> downgrade matters needs actual numbers to analyze under various workload
>> scenarios.
>
> Ya, that's my concern as well, as I don't have a good intuition for
> how perf critical kernel #PF is for real workloads. If this is your
> primary concern, I'll take that as a _good_ thing ; i.e. there's
> nothing architecturally stopping us from doing this downgrade safely.
> We'll still need the analysis, but that can be a later stage -- we're
> more than happy to get this data for all.

No. That's not a later stage optional requirement.

You have a PoC which works for you otherwise you wouldn't have posted
it. So you can trivially microbenchmark the costs of the
up/downgrade. And that's critical information for us but also for
you. If the costs are significant then you really have to think about
the tradeoffs.

Care to read Documentation/process/* carefully? It applies to you as it
applies to anyone else.

>> I've not seen numbers to that effect anywhere. The only numbers provided
>> are marketing material about the memory savings on a freshly booted idle
>> machine. There are _zero_ numbers about the actual real world savings,
>> but claims about the PETABYTE savings possible.
>>
>> Seriously?
>
> This is actually the most understood aspect. With O(100B) active tasks
> fleetwide at any point, it only takes an average savings of O(10KiB)
> per task to get to 1PiB. At least for our fleet, we know the % of
> tasks that use only 4KiB, 8KiB, or require the full 16KiB, and the
> math confirms that we expect O(PiB) aggregate savings. The % of stacks
> requiring the full 16KiB is minuscule, but it still occurs at a rate
> higher than what we can tolerate for SO panics. Given the vast
> majority of stacks never exceed the first 4KiB, this enables the
> significant opportunity.

I know that the potential savings are well understood and my
understanding of math is sufficient to calculate how much tasks and
average saving it takes to save 1PiB on a fleet.

That's a no-brainer, but this is an aggregate saving, which sounds WOW
but does not tell much about anything else.

 1) What's the actual percentage of savings in relation to the overall
    memory?

 2) Does the saving allow you to get more stuff done on a machine, pack
    more threads on it?

 3) Can you actually downsize the memory on the machines?

 4) What is the performance tradeoff for that?

IOW, you fail to tell what the actual benefit of such an intrusive
change is. Just boasting an aggregate Petabyte number does not tell
anything at all.

Let me give you a trivial example with a scenario which I have access
to:

    256  CPUs
    256  GiB Memory
    64k  Threads

Let's assume the full saving of 12k per thread. That sums up to

      64k * 12k =3D 768MB of memory

which is 0.29% of the total 256 GiB of memory. Not so impressive as the
petabyte aggregate number, right?

The workload consumes about 80% of the overall memory and is already
constraint on close to 100% CPU utilization.

Now let's assume that the runtime overhead of this amounts to 1% then
this is a net loss.

Let me turn that around and use a made up example assuming the 1Mio
threads per compute unit taken from some reply in this thread.

Now the full saving of 12k per thread amounts to:

    1M * 12k =3D 12G

which is 4.7% of the overall available memory. Agreed that's a
substantial number.

That 12G saving does not do anything in terms of hardware downsizing.

The only way that has a benefit is when the system is constraint by
overall memory consumption, but has quite some compute capacity left.

IOW, if 1M threads hit the memory limit that means that the savings in
kernel stack consumed memory allows you to add about 4% (~40k) more
threads. If that ups the CPU utilization accordingly then yes, I can see
the benefit. But TBH, if that's the case then you are trying to fix a
user space implementation problem in the kernel.

That said you really have to describe the scenarios where there is a
benefit and I do not buy this "fleet level" argument at all because
there is no single fleet which has a uniform workload distribution.

Aside of that. If your argument holds that there are only a few
scenarios which require a deep stack, then we are better off to identify
them and fix them up rather than trying to hack around the occacional
insanity of deep stack usage by adding complexity for complexity sake.

As you say that you have numbers of your fleet which confirm that the
vast majority of the stack depth is below 4k, you can surely figure out
the information which call chains are actually exceeding the limit.

I prefer to fix such shitty code and downgrade the stacksize in general
instead of papering over the underlying issues which probably have been
ignored for years if not decades.

Have you ever thought about that instead of adding complexity with a
dubious value?

Thanks,

        tglx