From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0E7F2C27C53 for ; Thu, 13 Jun 2024 01:27:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 700106B0095; Wed, 12 Jun 2024 21:27:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6B0506B0098; Wed, 12 Jun 2024 21:27:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 550AC6B0099; Wed, 12 Jun 2024 21:27:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 371046B0095 for ; Wed, 12 Jun 2024 21:27:51 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id BA267160501 for ; Thu, 13 Jun 2024 01:27:50 +0000 (UTC) X-FDA: 82224128700.03.1F191DD Received: from invmail4.hynix.com (exvmail4.hynix.com [166.125.252.92]) by imf09.hostedemail.com (Postfix) with ESMTP id 17D1714000D for ; Thu, 13 Jun 2024 01:27:47 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=none; spf=pass (imf09.hostedemail.com: domain of byungchul@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=byungchul@sk.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718242067; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=EvltV1DtKBT+cLSZS2/QPzRZ6h+4uJgOZZ/duYj/RSk=; b=mW463CmS5FElKLj2WyxOXwEw6HYuZfdRtHD0niY+nq4FzF2iNGdZZwB7h9aRNkYxifXmcD A3ASHm7XeEuDoIiu4mt8wW7v2rJGwmrX8ZKKwMxawcKa1I3ceignnE98KAjZcj7wB9kEXL KSwz6uMFUWrqAId4JiA+kUg4/8JCjZw= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=none; spf=pass (imf09.hostedemail.com: domain of byungchul@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=byungchul@sk.com; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718242067; a=rsa-sha256; cv=none; b=lJ69QPpOXdAZgKyA+NPCjOEL96tv9hK7DLJ2cITRat4K/Vhrros3MoRTZp5jcIhdUP7quq 9UROf8oZA8DIx2ewUXWspi1q3mUcEOuMUpJdQkspMmAW67va08RbLVVtvzKdelyoGr92Kn 0N4sLpxAceMlarAN2qbjiHdF0wOTJwM= X-AuditID: a67dfc5b-d6dff70000001748-93-666a4b0f1a48 Date: Thu, 13 Jun 2024 10:27:38 +0900 From: Byungchul Park To: "Huang, Ying" Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, kernel_team@skhynix.com, hannes@cmpxchg.org, iamjoonsoo.kim@lge.com, rientjes@google.com Subject: Re: [PATCH v2] mm: let kswapd work again for node that used to be hopeless but may not now Message-ID: <20240613012738.GA2327@system.software.com> References: <20240604072323.10886-1-byungchul@sk.com> <87bk4hcf7h.fsf@yhuang6-desk2.ccr.corp.intel.com> <20240604084533.GA68919@system.software.com> <8734ptccgi.fsf@yhuang6-desk2.ccr.corp.intel.com> <20240605015021.GB75311@system.software.com> <87tti8b10g.fsf@yhuang6-desk2.ccr.corp.intel.com> <20240605021902.GC75311@system.software.com> <20240607071228.GA76933@system.software.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240607071228.GA76933@system.software.com> User-Agent: Mutt/1.9.4 (2018-02-28) X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFrrHLMWRmVeSWpSXmKPExsXC9ZZnoS6/d1aawdWbRhZz1q9hs1i9yddi ZXczm8XlXXPYLO6t+c9q0bZkI5PFyVmTWRzYPQ6/ec/ssWBTqcfiPS+ZPDZ9msTu0fX2CpPH iRm/WTw+b5ILYI/isklJzcksSy3St0vgyrjwdSZTwVrzin9LnjA3MLZodTFyckgImEjM3nuf EcZuW3OJHcRmEVCVePK9jQXEZhNQl7hx4ycziC0ioCHxaeFyoBouDmaBtYwSR662sHYxcnAI C6RKHN8eAFLDK2AusfTXVLCZQgK9zBI7DuZDxAUlTs58AjaTWUBL4sa/l0wgrcwC0hLL/3GA hDkFLCW6m4+DtYoKKEsc2HacCWSVhMAeNonFOz8xQ9wpKXFwxQ2WCYwCs5CMnYVk7CyEsQsY mVcxCmXmleUmZuaY6GVU5mVW6CXn525iBIb6sto/0TsYP10IPsQowMGoxMP7YGVmmhBrYllx Ze4hRgkOZiUR3jMx6WlCvCmJlVWpRfnxRaU5qcWHGKU5WJTEeY2+lacICaQnlqRmp6YWpBbB ZJk4OKUaGCd+cy2uWzxp+wavlfu+86guPM9/u3ez+m0hZkVhzdJL6ydoShrYLTdZLVb0ekHC 9dUd798pJ5ukLOyYPOXmydMbj03S/c55sGHuKxFWxVXLiu9GnuhxY7i/3d/ZTuu48OTgw/Kt 3quUF8h8fu8Umnj8po3AAROm1a5LZyqzpJ3v8Fg+46HwqjdKLMUZiYZazEXFiQAZAVk4cQIA AA== X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrKLMWRmVeSWpSXmKPExsXC5WfdrMvvnZVm8PCPvMWc9WvYLFZv8rVY 2d3MZnF47klWi8u75rBZ3Fvzn9WibclGJouTsyazOHB4HH7zntljwaZSj8V7XjJ5bPo0id2j 6+0VJo8TM36zeCx+8YHJ4/MmuQCOKC6blNSczLLUIn27BK6MC19nMhWsNa/4t+QJcwNji1YX IyeHhICJRNuaS+wgNouAqsST720sIDabgLrEjRs/mUFsEQENiU8LlwPVcHEwC6xllDhytYW1 i5GDQ1ggVeL49gCQGl4Bc4mlv6YygthCAr3MEjsO5kPEBSVOznwCNpNZQEvixr+XTCCtzALS Esv/cYCEOQUsJbqbj4O1igooSxzYdpxpAiPvLCTds5B0z0LoXsDIvIpRJDOvLDcxM8dUrzg7 ozIvs0IvOT93EyMwcJfV/pm4g/HLZfdDjAIcjEo8vA9WZqYJsSaWFVfmHmKU4GBWEuE9E5Oe JsSbklhZlVqUH19UmpNafIhRmoNFSZzXKzw1QUggPbEkNTs1tSC1CCbLxMEp1cCod2Xe6/1f lEKfbTt20d76SZ33rFp907hza63LItxljKPeWpWEHwvSeBL47uXai2e0I4onPP0n9GFH/QTz 1YbiETf9J9rmrzvbseCr8tKttUs0jVMdXnseUf7g+PkIB4P1bb2njE9sGIpa5vbwqxvHRaqG cOvbCxxLbzyltYn7xbXSzXufPc1SYinOSDTUYi4qTgQAneEXPlgCAAA= X-CFilter-Loop: Reflected X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 17D1714000D X-Stat-Signature: x6sfkeucizhab4mpmfbzsderpa6zthep X-Rspam-User: X-HE-Tag: 1718242067-298178 X-HE-Meta: U2FsdGVkX18aO9Jjg+1bI964zOSPj0/PBRgdCKugtBkC7oFa9LOK973K62MEuIXJEcmmMvYcY+ardT3U341Vx1E+6iXckuu7p7dqNJ03pAPcQ+S2gSXEmCSMThERcXh9+uB3X0mcpYZE2bBP8yJRNaiOxI9sZqPw4VpIwcAt8EdUQJX5V040hQZGbeROy1wFcKsOlkzODj4FX2rT7qdbtlDe4DJnHnJsYCUiowib2larUz1TeokUCNofYGwDJJvjOvyILhD4mCLk0R9yZNdOYDa5Dsgj9JZr3H3PRRgFwnpdPVMUmGKJYrqrbb+B+mmVlz1zFxx36TX2LaIE1zLJHHfBjIrQVubeoFo6WMszGFkIXWYVXUGhgxXq879c/fr11DX/G2FH8UwY27L136B9axXLOrJ0uPUsj4zIh6CQjiE5nGGRkzkS/yei7rK+RzyTwzDwzZOu6DqNtdkCopfRi45zdg28eGmajLh02TCMYQNe5qi1mOmqeGV0UVbIN05TA3HQtksfvg3J/JIF74Nc+NLLwMDRx/AUxBZxkVkkfV/NCJe4pjnCWNP90W1DtQ4Q6fPxBeDUaJYr8j0LckjTGd+kcOcsVieaTnMsrVtpbzPeM4QbMxQdnYPLDacDbvtFpzPR9oR6GiZj+87IQD1esjqDebFzm+qNsx8+Iofp/I3x5LzM+DCyjlQmCLMkrniVCq/S5er0xU+j/Q52s+y9syt2qsCZ3E5BOKJcwwEk7FqescqAkS+4Xwcqoxhf8SPvMPWoFeyngcfjSwwSpZRK9VN4K3Itm4YUoM6/UoOXu1+tQFLvVToD/YHrb5PlNkJRB2zgnPbzyMIt8v1JLLYSanYW2uduRJAt53w75Iqhtvb5cWoCE07BToslal1j85mXXKkV9BfsX67GC8i8EbG/UVH8pZVlEzySTQ2xaB24U+FFasy1Bd1+qRJKJ/E4VCruy257pdpnH5m4rY/Gr4B k6y57pb3 7+pa88L9zJX2qjZZgE/sH7Q2ASJV22Tx8XbPsLwmLa68mb33bB++Z5y2vWA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jun 07, 2024 at 04:12:28PM +0900, Byungchul Park wrote: > On Wed, Jun 05, 2024 at 11:19:02AM +0900, Byungchul Park wrote: > > On Wed, Jun 05, 2024 at 10:02:07AM +0800, Huang, Ying wrote: > > > Byungchul Park writes: > > > > > > > On Tue, Jun 04, 2024 at 04:57:17PM +0800, Huang, Ying wrote: > > > >> Byungchul Park writes: > > > >> > > > >> > On Tue, Jun 04, 2024 at 03:57:54PM +0800, Huang, Ying wrote: > > > >> >> Byungchul Park writes: > > > >> >> > > > >> >> > Changes from v1: > > > >> >> > 1. Don't allow to resume kswapd if the system is under memory > > > >> >> > pressure that might affect direct reclaim by any chance, like > > > >> >> > if NR_FREE_PAGES is less than (low wmark + min wmark)/2. > > > >> >> > > > > >> >> > --->8--- > > > >> >> > From 6c73fc16b75907f5da9e6b33aff86bf7d7c9dd64 Mon Sep 17 00:00:00 2001 > > > >> >> > From: Byungchul Park > > > >> >> > Date: Tue, 4 Jun 2024 15:27:56 +0900 > > > >> >> > Subject: [PATCH v2] mm: let kswapd work again for node that used to be hopeless but may not now > > > >> >> > > > > >> >> > A system should run with kswapd running in background when under memory > > > >> >> > pressure, such as when the available memory level is below the low water > > > >> >> > mark and there are reclaimable folios. > > > >> >> > > > > >> >> > However, the current code let the system run with kswapd stopped if > > > >> >> > kswapd has been stopped due to more than MAX_RECLAIM_RETRIES failures > > > >> >> > until direct reclaim will do for that, even if there are reclaimable > > > >> >> > folios that can be reclaimed by kswapd. This case was observed in the > > > >> >> > following scenario: > > > >> >> > > > > >> >> > CONFIG_NUMA_BALANCING enabled > > > >> >> > sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING > > > >> >> > numa node0 (500GB local DRAM, 128 CPUs) > > > >> >> > numa node1 (100GB CXL memory, no CPUs) > > > >> >> > swap off > > > >> >> > > > > >> >> > 1) Run a workload with big anon pages e.g. mmap(200GB). > > > >> >> > 2) Continue adding the same workload to the system. > > > >> >> > 3) The anon pages are placed in node0 by promotion/demotion. > > > >> >> > 4) kswapd0 stops because of the unreclaimable anon pages in node0. > > > >> >> > 5) Kill the memory hoggers to restore the system. > > > >> >> > > > > >> >> > After restoring the system at 5), the system starts to run without > > > >> >> > kswapd. Even worse, tiering mechanism is no longer able to work since > > > >> >> > the mechanism relies on kswapd for demotion. > > > >> >> > > > >> >> We have run into the situation that kswapd is kept in failure state for > > > >> >> long in a multiple tiers system. I think that your solution is too > > > >> > > > > >> > My solution just gives a chance for kswapd to work again even if > > > >> > kswapd_failures >= MAX_RECLAIM_RETRIES, if there are potential > > > >> > reclaimable folios. That's it. > > > >> > > > > >> >> limited, because OOM killing may not happen, while the access pattern of > > > >> > > > > >> > I don't get this. OOM will happen as is, through direct reclaim. > > > >> > > > >> A system that fails to reclaim via kswapd may succeed to reclaim via > > > >> direct reclaim, because more CPUs are used to scanning the page tables. > > > > > > > > Honestly, I don't think so with this description. > > > > > > > > The fact that the system hit MAX_RECLAIM_RETRIES means the system is > > > > currently hopeless unless reclaiming folios in a stronger way by *direct > > > > reclaim*. The solution for this situation should not be about letting > > > > more CPUs particiated in reclaiming, again, *at least in this situation*. > > > > > > > > What you described here is true only in a normal state where the more > > > > CPUs work on reclaiming, the more reclaimable folios can be reclaimed. > > > > kswapd can be a helper *only* when there are kswapd-reclaimable folios. > > > > > > Sometimes, we cannot reclaim just because we doesn't scan fast enough so > > > the Accessed-bit is set again during scanning. With more CPUs, we can > > > scan faster, so make some progress. But, yes, this only cover one > > > situation, there are other situations too. > > > > What I mean is *the issue we try to solve* is not the situation that > > can be solved by letting more CPUs participate in reclaiming. > > Again, in the situation where kswapd has failed more than > MAX_RECLAIM_RETRIES, say, holeless, I don't think it makes sense to wake > up kswapd every 10 seconds. It'd be more sensible to wake up kwapd only > if there are *at least potentially* reclaimable folios. 1) numa balancing tiering on No doubt the patch should work for it since numa balancing tiering doesn't work at all once kswapd stops. We are already applying and using this patch in tests for tiering. It works perfect. 2) numa balancing tiering off kswapd will be resumed even without this patch if free memory hits min wmark. However, do we have to wait for direct reclaim to work for it? Even though we can proactively prevent direct reclaim using kswapd? Byungchul > As Ying said, there's no way to precisely track if reclaimable, but it's > worth trying when the possibility becomes positive and looks more > reasonable. Thoughts? > > Byungchul > > > Byungchul > > > > > -- > > > Best Regards, > > > Huang, Ying > > > > > > > Byungchul > > > > > > > >> In a system with NUMA balancing based page promotion and page demotion > > > >> enabled, page promotion will wake up kswapd, but kswapd may fail in some > > > >> situations. But page promotion will no trigger direct reclaim or OOM. > > > >> > > > >> >> the workloads may change. We have a preliminary and simple solution for > > > >> >> this as follows, > > > >> >> > > > >> >> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=17a24a354e12d4d4675d78481b358f668d5a6866 > > > >> > > > > >> > Whether tiering is involved or not, the same problem can arise if > > > >> > kswapd gets stopped due to kswapd_failures >= MAX_RECLAIM_RETRIES. > > > >> > > > >> Your description is about tiering too. Can you describe a situation > > > >> without tiering? > > > >> > > > >> -- > > > >> Best Regards, > > > >> Huang, Ying > > > >> > > > >> > Byungchul > > > >> > > > > >> >> where we will try to wake up kswapd to check every 10 seconds if kswapd > > > >> >> is in failure state. This is another possible solution. > > > >> >> > > > >> >> > However, the node0 has pages newly allocated after 5), that might or > > > >> >> > might not be reclaimable. Since those are potentially reclaimable, it's > > > >> >> > worth hopefully trying reclaim by allowing kswapd to work again. > > > >> >> > > > > >> >> > > > >> >> [snip] > > > >> >> > > > >> >> -- > > > >> >> Best Regards, > > > >> >> Huang, Ying