From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A27742E8B9F for ; Tue, 16 Dec 2025 02:44:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765853055; cv=none; b=jLGkZ/DFfXvJfEEcE9zua6TcQjijrmSCaEKwkmYxlGXZqTSC4WRuPd2vcGuUbo/Ax7qXjsXU8S4pFFBS7HRUjzCAbUa9XN+HuZ18AYWEuXjCoc3bhOruebv9kHK47wlgUusjGt4NnxCoU/t3dbWH2dTRbVL/QhrRk2cnKOnLak4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765853055; c=relaxed/simple; bh=ASJNWfPG6p1NVFxDWaiLv8wu9ngB2Uk8IA5ptI/+6io=; h=Date:From:To:Cc:Subject:Message-Id:In-Reply-To:References: Mime-Version:Content-Type; b=YUhj4vuSRCeG0AGoTE6i/uuSooFHxPLQ9Oij9uTOTDx50evXZqm3UD+j8n13m8kFmr/cPq309gVUB6/30Mi4ProHzH/1oyq/N0wxCwOVInVwbJUn5Xgh0WQHVQpQPB5oWpFpX6/fRFGTZoTBu7AY57W4vLyAjUW0Okdk1z4MOQs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux-foundation.org header.i=@linux-foundation.org header.b=uVTOulPI; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux-foundation.org header.i=@linux-foundation.org header.b="uVTOulPI" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 80A89C4CEF5; Tue, 16 Dec 2025 02:44:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1765853055; bh=ASJNWfPG6p1NVFxDWaiLv8wu9ngB2Uk8IA5ptI/+6io=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=uVTOulPIBAvE+m56YDfl18Mzw+PX5n7rt+OS8jl5WRyW32wjhR6NC3WnFqHj2Nbhp 0AnjffA0z4ksYxiI3xn91E8VipJ8KFb3pUGeutbAfwBDA5FGCFlJeKNGq3201/RYoh tzavQfgMlGXW7r18pkqtiRwZ0+wvLJx8t4X5F/Kg= Date: Mon, 15 Dec 2025 18:44:13 -0800 From: Andrew Morton To: Ankur Arora Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org, david@kernel.org, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, mingo@redhat.com, mjguzik@gmail.com, luto@kernel.org, peterz@infradead.org, tglx@linutronix.de, willy@infradead.org, raghavendra.kt@amd.com, chleroy@kernel.org, ioworker0@gmail.com, boris.ostrovsky@oracle.com, konrad.wilk@oracle.com Subject: Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges Message-Id: <20251215184413.19589400a74c2aadb42a2eca@linux-foundation.org> In-Reply-To: <20251215204922.475324-8-ankur.a.arora@oracle.com> References: <20251215204922.475324-1-ankur.a.arora@oracle.com> <20251215204922.475324-8-ankur.a.arora@oracle.com> X-Mailer: Sylpheed 3.8.0beta1 (GTK+ 2.24.33; x86_64-pc-linux-gnu) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit On Mon, 15 Dec 2025 12:49:21 -0800 Ankur Arora wrote: > Clear contiguous page ranges in folio_zero_user() instead of clearing > a single page at a time. Exposing larger ranges enables extent based > processor optimizations. > > However, because the underlying clearing primitives do not, or might > not be able to check to call cond_resched() to check if preemption > is required, limit the worst case preemption latency by doing the > clearing in no more than PROCESS_PAGES_NON_PREEMPT_BATCH units. > > For architectures that define clear_pages(), we assume that the > clearing is fast and define PROCESS_PAGES_NON_PREEMPT_BATCH as 8MB > worth of pages. This should be large enough to allow the processor > to optimize the operation and yet small enough that we see reasonable > preemption latency for when this optimization is not possible > (ex. slow microarchitectures, memory bandwidth saturation.) > > Architectures that don't define clear_pages() will continue to use > the base value (single page). And, preemptible models don't need > invocations of cond_resched() so don't care about the batch size. > > The resultant performance depends on the kinds of optimizations > available to the CPU for the region size being cleared. Two classes > of optimizations: > > - clearing iteration costs are amortized over a range larger > than a single page. > - cacheline allocation elision (seen on AMD Zen models). 8MB is a big chunk of memory. > Testing a demand fault workload shows an improved baseline from the > first optimization and a larger improvement when the region being > cleared is large enough for the second optimization. > > AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node): So we break out of the copy to run cond_resched() 8192 times? This sounds like a minor cost. > $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5 > > page-at-a-time contiguous clearing change > > (GB/s +- %stdev) (GB/s +- %stdev) > > pg-sz=2MB 12.92 +- 2.55% 17.03 +- 0.70% + 31.8% preempt=* > > pg-sz=1GB 17.14 +- 2.27% 18.04 +- 1.05% + 5.2% preempt=none|voluntary > pg-sz=1GB 17.26 +- 1.24% 42.17 +- 4.21% [#] +144.3% preempt=full|lazy And yet those 8192 cond_resched()'s have a huge impact on the performance! I find this result very surprising. Is it explainable? > [#] Notice that we perform much better with preempt=full|lazy. As > mentioned above, preemptible models not needing explicit invocations > of cond_resched() allow clearing of the full extent (1GB) as a > single unit. > In comparison the maximum extent used for preempt=none|voluntary is > PROCESS_PAGES_NON_PREEMPT_BATCH (8MB). > > The larger extent allows the processor to elide cacheline > allocation (on Milan the threshold is LLC-size=32MB.) It is this? > Also as mentioned earlier, the baseline improvement is not specific to > AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar > improvement as the Milan pg-sz=2MB workload above (~30%). >