From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E2372382F05 for ; Thu, 12 Mar 2026 09:50:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773309041; cv=none; b=DXhjd2DTofOEkUdTFaKx3vM0czfUpJ3gEyiUuSIpS1TDvPajRBEQ+zPNy8DW3RTeO9bCU0AceTX7Ms3eKMakfAQoskpjmbN4TgJfn0nVtmO3ruTPYdsaUxrcuJYrsf8mMJcOFdNaLUMIuz7RK+HeSX98kLcj+LIK48oalyjqzJk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773309041; c=relaxed/simple; bh=nTB9+tdpIn38fFxp+E1+Kq8q+k49zWwh7dHCmIwZ6qE=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=Qe+rR/i3YQqRLqxVYpxtlroX55HXYKZPd3g6jHQkk+pcgwrB1aglk1cogI1ToIeh4ce1PUxvBW0C1mJJsWXbNnTeN+UuG09GHHG5cnyWTgzjD5pGCCM+L8yY+gEkxPaq7wYxRmg1jtL+t1r+gEYaB1+moSA6o4Lh2hZDBepFe/Y= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=rRekdSnS; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="rRekdSnS" Received: by smtp.kernel.org (Postfix) with ESMTPSA id C4260C4CEF7; Thu, 12 Mar 2026 09:50:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1773309040; bh=nTB9+tdpIn38fFxp+E1+Kq8q+k49zWwh7dHCmIwZ6qE=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=rRekdSnS5BsxfizH/i+yr3fub8k03xEibZh3qgF5SVf5+VLb1cyTNuuQ0mYIJPhKq xbQp1x9Gso8dpJgLLZNHCpipKjhB0m8Q2snS1Q9FeWy9QmExbO+sTrHvNYbOf30LD5 VKRQtHbgYFzxzZNazSy+9oHw1aEbWSb6kfdSUZvxh1RtuSp1id3MebioK1qPx00yFc G2ERGiGqpeBc1RtsKVgbNjKpaG5AQbakKp2nKfoVhDmtq/Ss39hfmopYJAk1KUck3Y /sz16ZoaJAd5cwm9qvKtyE4tftvkXHdpDWpMpQj9rVzAhOCmWrQFis99nainEtzrvb OwDFAjGFOYBEA== Message-ID: Date: Thu, 12 Mar 2026 10:50:36 +0100 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v3] mm/vmpressure: scale window size based on machine memory To: Benjamin Lee McQueen , Andrew Morton , Michal Hocko , Lorenzo Stoakes Cc: "Liam R . Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20260305043038.2176-1-mcq@disroot.org> From: "David Hildenbrand (Arm)" Content-Language: en-US Autocrypt: addr=david@kernel.org; keydata= xsFNBFXLn5EBEAC+zYvAFJxCBY9Tr1xZgcESmxVNI/0ffzE/ZQOiHJl6mGkmA1R7/uUpiCjJ dBrn+lhhOYjjNefFQou6478faXE6o2AhmebqT4KiQoUQFV4R7y1KMEKoSyy8hQaK1umALTdL QZLQMzNE74ap+GDK0wnacPQFpcG1AE9RMq3aeErY5tujekBS32jfC/7AnH7I0v1v1TbbK3Gp XNeiN4QroO+5qaSr0ID2sz5jtBLRb15RMre27E1ImpaIv2Jw8NJgW0k/D1RyKCwaTsgRdwuK Kx/Y91XuSBdz0uOyU/S8kM1+ag0wvsGlpBVxRR/xw/E8M7TEwuCZQArqqTCmkG6HGcXFT0V9 PXFNNgV5jXMQRwU0O/ztJIQqsE5LsUomE//bLwzj9IVsaQpKDqW6TAPjcdBDPLHvriq7kGjt WhVhdl0qEYB8lkBEU7V2Yb+SYhmhpDrti9Fq1EsmhiHSkxJcGREoMK/63r9WLZYI3+4W2rAc UucZa4OT27U5ZISjNg3Ev0rxU5UH2/pT4wJCfxwocmqaRr6UYmrtZmND89X0KigoFD/XSeVv jwBRNjPAubK9/k5NoRrYqztM9W6sJqrH8+UWZ1Idd/DdmogJh0gNC0+N42Za9yBRURfIdKSb B3JfpUqcWwE7vUaYrHG1nw54pLUoPG6sAA7Mehl3nd4pZUALHwARAQABzS5EYXZpZCBIaWxk ZW5icmFuZCAoQ3VycmVudCkgPGRhdmlkQGtlcm5lbC5vcmc+wsGQBBMBCAA6AhsDBQkmWAik AgsJBBUKCQgCFgICHgUCF4AWIQQb2cqtc1xMOkYN/MpN3hD3AP+DWgUCaYJt/AIZAQAKCRBN 3hD3AP+DWriiD/9BLGEKG+N8L2AXhikJg6YmXom9ytRwPqDgpHpVg2xdhopoWdMRXjzOrIKD g4LSnFaKneQD0hZhoArEeamG5tyo32xoRsPwkbpIzL0OKSZ8G6mVbFGpjmyDLQCAxteXCLXz ZI0VbsuJKelYnKcXWOIndOrNRvE5eoOfTt2XfBnAapxMYY2IsV+qaUXlO63GgfIOg8RBaj7x 3NxkI3rV0SHhI4GU9K6jCvGghxeS1QX6L/XI9mfAYaIwGy5B68kF26piAVYv/QZDEVIpo3t7 /fjSpxKT8plJH6rhhR0epy8dWRHk3qT5tk2P85twasdloWtkMZ7FsCJRKWscm1BLpsDn6EQ4 jeMHECiY9kGKKi8dQpv3FRyo2QApZ49NNDbwcR0ZndK0XFo15iH708H5Qja/8TuXCwnPWAcJ DQoNIDFyaxe26Rx3ZwUkRALa3iPcVjE0//TrQ4KnFf+lMBSrS33xDDBfevW9+Dk6IISmDH1R HFq2jpkN+FX/PE8eVhV68B2DsAPZ5rUwyCKUXPTJ/irrCCmAAb5Jpv11S7hUSpqtM/6oVESC 3z/7CzrVtRODzLtNgV4r5EI+wAv/3PgJLlMwgJM90Fb3CB2IgbxhjvmB1WNdvXACVydx55V7 LPPKodSTF29rlnQAf9HLgCphuuSrrPn5VQDaYZl4N/7zc2wcWM7BTQRVy5+RARAA59fefSDR 9nMGCb9LbMX+TFAoIQo/wgP5XPyzLYakO+94GrgfZjfhdaxPXMsl2+o8jhp/hlIzG56taNdt VZtPp3ih1AgbR8rHgXw1xwOpuAd5lE1qNd54ndHuADO9a9A0vPimIes78Hi1/yy+ZEEvRkHk /kDa6F3AtTc1m4rbbOk2fiKzzsE9YXweFjQvl9p+AMw6qd/iC4lUk9g0+FQXNdRs+o4o6Qvy iOQJfGQ4UcBuOy1IrkJrd8qq5jet1fcM2j4QvsW8CLDWZS1L7kZ5gT5EycMKxUWb8LuRjxzZ 3QY1aQH2kkzn6acigU3HLtgFyV1gBNV44ehjgvJpRY2cC8VhanTx0dZ9mj1YKIky5N+C0f21 zvntBqcxV0+3p8MrxRRcgEtDZNav+xAoT3G0W4SahAaUTWXpsZoOecwtxi74CyneQNPTDjNg azHmvpdBVEfj7k3p4dmJp5i0U66Onmf6mMFpArvBRSMOKU9DlAzMi4IvhiNWjKVaIE2Se9BY FdKVAJaZq85P2y20ZBd08ILnKcj7XKZkLU5FkoA0udEBvQ0f9QLNyyy3DZMCQWcwRuj1m73D sq8DEFBdZ5eEkj1dCyx+t/ga6x2rHyc8Sl86oK1tvAkwBNsfKou3v+jP/l14a7DGBvrmlYjO 59o3t6inu6H7pt7OL6u6BQj7DoMAEQEAAcLBfAQYAQgAJgIbDBYhBBvZyq1zXEw6Rg38yk3e EPcA/4NaBQJonNqrBQkmWAihAAoJEE3eEPcA/4NaKtMQALAJ8PzprBEXbXcEXwDKQu+P/vts IfUb1UNMfMV76BicGa5NCZnJNQASDP/+bFg6O3gx5NbhHHPeaWz/VxlOmYHokHodOvtL0WCC 8A5PEP8tOk6029Z+J+xUcMrJClNVFpzVvOpb1lCbhjwAV465Hy+NUSbbUiRxdzNQtLtgZzOV Zw7jxUCs4UUZLQTCuBpFgb15bBxYZ/BL9MbzxPxvfUQIPbnzQMcqtpUs21CMK2PdfCh5c4gS sDci6D5/ZIBw94UQWmGpM/O1ilGXde2ZzzGYl64glmccD8e87OnEgKnH3FbnJnT4iJchtSvx yJNi1+t0+qDti4m88+/9IuPqCKb6Stl+s2dnLtJNrjXBGJtsQG/sRpqsJz5x1/2nPJSRMsx9 5YfqbdrJSOFXDzZ8/r82HgQEtUvlSXNaXCa95ez0UkOG7+bDm2b3s0XahBQeLVCH0mw3RAQg r7xDAYKIrAwfHHmMTnBQDPJwVqxJjVNr7yBic4yfzVWGCGNE4DnOW0vcIeoyhy9vnIa3w1uZ 3iyY2Nsd7JxfKu1PRhCGwXzRw5TlfEsoRI7V9A8isUCoqE2Dzh3FvYHVeX4Us+bRL/oqareJ CIFqgYMyvHj7Q06kTKmauOe4Nf0l0qEkIuIzfoLJ3qr5UyXc2hLtWyT9Ir+lYlX9efqh7mOY qIws/H2t In-Reply-To: <20260305043038.2176-1-mcq@disroot.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 3/5/26 05:30, Benjamin Lee McQueen wrote: > the vmpressure window size has been fixed at 512 pages > (SWAP_CLUSTER_MAX * 16), ever since the file's inception. a TODO in > the file notes that the vmpressure window size should be scaled > similarly to vmstat's scaling, via machine size. > > the problem with fixed window size on large memory systems: > > the window fills after 512 pages (SWAP_CLUSTER_MAX * 16) of scanned > pages. on a 256GB system that is 0.00076% of total memory. the > reclaimer works in chunks of 32 pages (SWAP_CLUSTER_MAX), so the > window fills up after 16 reclaim cycles. here, a single (or more) > bad reclaim cycle which reports false info has a considerable effect > on the scanned/reclaim ratio, producing an incorrect reading. > > a larger window however, is *potentially* prone to some additional > notification latency, as more pages must be scanned before the ratio > is calculated. > > this is what we consider a false positive; a notification that > doesn't correctly represent the current sustained memory pressure. > > as for issues with why false positives are bad: applications or even > maybe the system listening for these notifications are woken up > unnecessarily and may perform actions that aren't supposed to happen, > instead of being in coherence with the actual sustained memory > pressure. > > i did some testing, as well. > > the testing was performed on ONLY a 9GB VM and nothing else. > window sizes corresponding to larger machine memory were set manually > via a debugfs knob, so my testing may have been corrupted as i only > had 9GB on the VM, and this doesn't correctly test on larger systems, > but it is the best i am able to do. > > vmpressure_calc_level() was instrumented with a tracepoint emitting > the raw pressure value. a controlled workload (2200MB allocation into > a 2000MB cgroup with 500MB swap cap) was run at each window size. > 1000 pressure samples were collected per run with 50 sample warmup > discarded, repeated 5 times per window size. > > the key metrics are stddev and cv% (coefficient of variation; stddev > divided by mean pressure, expressed as a percentage). cv% is > load-independent so it is a better measurement than stddev alone. a > high cv% means the pressure signal is noisy relative to its own > average; essentially the readings are unpredictable and unreliable. > a low cv% means the signal is stable and trustworthy. > > do take the data with a grain of salt, as i probably didn't test > efficiently. this patch still needs to be tested on larger memory > systems on real workloads. but if you think there is a better way > for me or others to test this PLEASE REACH OUT! > > Window RAM Equiv avg stddev avg cv% > 512 stock 45.86 91.24% > 1024 4GB 34.62 69.28% > 1792 8GB 4.03 7.97% > 2304 32GB 9.90 18.53% > 2560 64GB 9.95 18.59% > 3072 256GB 11.49 20.99% > > the results show an improvement in quality as window size increases. > stock at 512 pages shows a cv% of 91.24%, meaning the noise in the > pressure signal is nearly as large as the signal itself; the readings > are essentially unpredictable. at the 8GB equivalent window (1792 > pages) cv% drops to 7.97%, an 11x improvement in signal stability. > > the data is consistent across 25 independent runs per window size > (5 sweeps of 5 runs each). stddev and cv% barely move between > sweeps, which gives me confidence the measurement is real and not > an artifact of system state. > > stddev increases slightly beyond the 8GB equivalent window, from > 3.82 at win=1792 up to 11.49 at win=3072. this is expected and > may also be an artifact of testing on a 9GB machine rather than > real large-memory hardware. even at the 256GB equivalent window > cv% is 20.99%; still a 4x improvement over stock's 91.24%. > > since i only have a 9GB VM, i'm setting the window size manually > to simulate larger machines, but the actual reclaim behavior of a > 9GB system doesn't match what a real 256GB machine would do. on a > real large-memory machine the reclaimer has proportionally more > work to do and the window would fill with more representative data. > this is another reason why testing on real large-memory hardware > is needed. > > the formula itself isn't like vmstat's threshold calculation, and > uses total machine memory size (RAM), because reclaim costs grow > with RAM, not CPU size or any other variables about the system. > the formula's floor clause also ensures the existing 512 page > window on smaller systems (512MB), and only affects larger systems. > > if there are any other questions, i can try to answer them. > > IF YOU CAN TEST OR COME UP WITH BETTER METHODS PLEASE REACH OUT! > > Signed-off-by: Benjamin Lee McQueen > --- > mm/vmpressure.c | 18 ++++++++++++------ > 1 file changed, 12 insertions(+), 6 deletions(-) > > diff --git a/mm/vmpressure.c b/mm/vmpressure.c > index 3fbb86996c4d..0154df4d754e 100644 > --- a/mm/vmpressure.c > +++ b/mm/vmpressure.c > @@ -10,6 +10,7 @@ > */ > > #include > +#include > #include > #include > #include > @@ -29,14 +30,19 @@ > * sizes can cause lot of false positives, but too big window size will > * delay the notifications. > * > - * As the vmscan reclaimer logic works with chunks which are multiple of > - * SWAP_CLUSTER_MAX, it makes sense to use it for the window size as well. > - * > - * TODO: Make the window size depend on machine size, as we do for vmstat > - * thresholds. Currently we set it to 512 pages (2MB for 4KB pages). > + * As of now, we use a logarithmic scale to scale the window based on > + * machine RAM size. > */ > -static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16; > +static unsigned long vmpressure_win; > + > +static int __init vmpressure_win_init(void) > +{ > + unsigned long mem = totalram_pages() >> (27 - PAGE_SHIFT); > > + vmpressure_win = SWAP_CLUSTER_MAX * max(16UL, (unsigned long)fls64(mem) * 8UL); > + return 0; > +} > +core_initcall(vmpressure_win_init); > /* > * These thresholds are used when we account memory pressure through > * scanned/reclaimed ratio. The current values were chosen empirically. In How does this interact with memory hotplug adding a lot of memory. Just imagine you have a 4G VM and hotplug 128GB or more. Would we want to get notified and adjust the vmpressure_win dynamically? -- Cheers, David