From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id A27742E8B9F
	for <linux-kernel@vger.kernel.org>; Tue, 16 Dec 2025 02:44:15 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1765853055; cv=none; b=jLGkZ/DFfXvJfEEcE9zua6TcQjijrmSCaEKwkmYxlGXZqTSC4WRuPd2vcGuUbo/Ax7qXjsXU8S4pFFBS7HRUjzCAbUa9XN+HuZ18AYWEuXjCoc3bhOruebv9kHK47wlgUusjGt4NnxCoU/t3dbWH2dTRbVL/QhrRk2cnKOnLak4=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1765853055; c=relaxed/simple;
	bh=ASJNWfPG6p1NVFxDWaiLv8wu9ngB2Uk8IA5ptI/+6io=;
	h=Date:From:To:Cc:Subject:Message-Id:In-Reply-To:References:
	 Mime-Version:Content-Type; b=YUhj4vuSRCeG0AGoTE6i/uuSooFHxPLQ9Oij9uTOTDx50evXZqm3UD+j8n13m8kFmr/cPq309gVUB6/30Mi4ProHzH/1oyq/N0wxCwOVInVwbJUn5Xgh0WQHVQpQPB5oWpFpX6/fRFGTZoTBu7AY57W4vLyAjUW0Okdk1z4MOQs=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux-foundation.org header.i=@linux-foundation.org header.b=uVTOulPI; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux-foundation.org header.i=@linux-foundation.org header.b="uVTOulPI"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 80A89C4CEF5;
	Tue, 16 Dec 2025 02:44:14 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org;
	s=korg; t=1765853055;
	bh=ASJNWfPG6p1NVFxDWaiLv8wu9ngB2Uk8IA5ptI/+6io=;
	h=Date:From:To:Cc:Subject:In-Reply-To:References:From;
	b=uVTOulPIBAvE+m56YDfl18Mzw+PX5n7rt+OS8jl5WRyW32wjhR6NC3WnFqHj2Nbhp
	 0AnjffA0z4ksYxiI3xn91E8VipJ8KFb3pUGeutbAfwBDA5FGCFlJeKNGq3201/RYoh
	 tzavQfgMlGXW7r18pkqtiRwZ0+wvLJx8t4X5F/Kg=
Date: Mon, 15 Dec 2025 18:44:13 -0800
From: Andrew Morton <akpm@linux-foundation.org>
To: Ankur Arora <ankur.a.arora@oracle.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
 david@kernel.org, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com,
 mingo@redhat.com, mjguzik@gmail.com, luto@kernel.org, peterz@infradead.org,
 tglx@linutronix.de, willy@infradead.org, raghavendra.kt@amd.com,
 chleroy@kernel.org, ioworker0@gmail.com, boris.ostrovsky@oracle.com,
 konrad.wilk@oracle.com
Subject: Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page
 ranges
Message-Id: <20251215184413.19589400a74c2aadb42a2eca@linux-foundation.org>
In-Reply-To: <20251215204922.475324-8-ankur.a.arora@oracle.com>
References: <20251215204922.475324-1-ankur.a.arora@oracle.com>
	<20251215204922.475324-8-ankur.a.arora@oracle.com>
X-Mailer: Sylpheed 3.8.0beta1 (GTK+ 2.24.33; x86_64-pc-linux-gnu)
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

On Mon, 15 Dec 2025 12:49:21 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote:

> Clear contiguous page ranges in folio_zero_user() instead of clearing
> a single page at a time. Exposing larger ranges enables extent based
> processor optimizations.
> 
> However, because the underlying clearing primitives do not, or might
> not be able to check to call cond_resched() to check if preemption
> is required, limit the worst case preemption latency by doing the
> clearing in no more than PROCESS_PAGES_NON_PREEMPT_BATCH units.
> 
> For architectures that define clear_pages(), we assume that the
> clearing is fast and define PROCESS_PAGES_NON_PREEMPT_BATCH as 8MB
> worth of pages. This should be large enough to allow the processor
> to optimize the operation and yet small enough that we see reasonable
> preemption latency for when this optimization is not possible
> (ex. slow microarchitectures, memory bandwidth saturation.)
> 
> Architectures that don't define clear_pages() will continue to use
> the base value (single page). And, preemptible models don't need
> invocations of cond_resched() so don't care about the batch size.
> 
> The resultant performance depends on the kinds of optimizations
> available to the CPU for the region size being cleared. Two classes
> of optimizations:
> 
>   - clearing iteration costs are amortized over a range larger
>     than a single page.
>   - cacheline allocation elision (seen on AMD Zen models).

8MB is a big chunk of memory.

> Testing a demand fault workload shows an improved baseline from the
> first optimization and a larger improvement when the region being
> cleared is large enough for the second optimization.
> 
> AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):

So we break out of the copy to run cond_resched() 8192 times?  This sounds
like a minor cost.

>   $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5
> 
>                     page-at-a-time     contiguous clearing      change
> 
>                   (GB/s  +- %stdev)     (GB/s  +- %stdev)
> 
>    pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%	preempt=*
> 
>    pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05%       +  5.2%	preempt=none|voluntary
>    pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21% [#]   +144.3%	preempt=full|lazy

And yet those 8192 cond_resched()'s have a huge impact on the
performance!  I find this result very surprising.  Is it explainable?

>  [#] Notice that we perform much better with preempt=full|lazy. As
>   mentioned above, preemptible models not needing explicit invocations
>   of cond_resched() allow clearing of the full extent (1GB) as a
>   single unit.
>   In comparison the maximum extent used for preempt=none|voluntary is
>   PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
> 
>   The larger extent allows the processor to elide cacheline
>   allocation (on Milan the threshold is LLC-size=32MB.)

It is this?

> Also as mentioned earlier, the baseline improvement is not specific to
> AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar
> improvement as the Milan pg-sz=2MB workload above (~30%).
>