From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from invmail4.hynix.com (exvmail4.hynix.com [166.125.252.92]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 0A6FC81F for ; Fri, 24 May 2024 00:47:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=166.125.252.92 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1716511672; cv=none; b=c56otq0N5WjotXhonFyh/fRTwymhwBIiTAPjmRIDXSGXXLcMsQs3MhM7wefaGQvuTj0N66RNF7xEJlKV1K1qB424t93/uipdWuBE/DPI/tWNkb1IE0CKluLrBhlsGcCEty7CsWUKIcBv9R3FsGAr0uW4Dj9k782j/ZNjc/TL+ok= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1716511672; c=relaxed/simple; bh=9TiQ1NawlgxMFkjnvp0VImY+CBDxe13HEElO1pLM4yY=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=lRWNNh1YLuXh0p68KxUuZGNcrDM+d/W6F66/lemZHQUqig7sVUzk9ue1D3/xsS/K4rLdykbh5QA/KJihq29/gTiX5Gn7Na+SLzUPnC/uAQP8Z0DP1DN4e72Rnx4KmA3zWN2wE2FtS3Y4Rro2HPZi9se1d6Oq+ReM3zm4y/03G2E= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=sk.com; spf=pass smtp.mailfrom=sk.com; arc=none smtp.client-ip=166.125.252.92 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=sk.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=sk.com X-AuditID: a67dfc5b-d6dff70000001748-40-664fe3a87ca5 Date: Fri, 24 May 2024 09:47:31 +0900 From: Byungchul Park To: Karim Manaouil Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, kernel_team@skhynix.com Subject: Re: [PATCH] mm: let kswapd work again for node that used to be hopeless but may not now Message-ID: <20240524004731.GA78958@system.software.com> References: <20240523051406.81053-1-byungchul@sk.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.4 (2018-02-28) X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrILMWRmVeSWpSXmKPExsXC9ZZnke6Kx/5pBs/n6lvMWb+GzeLthO2M Fpd3zWGzuLfmP6sDi8fOWXfZPTZ9msTucWLGbxaPz5vkAliiuGxSUnMyy1KL9O0SuDLOnG1j KlhkXjF13SPmBsafWl2MnBwSAiYSU39+YoexJ35dzwZiswioSvxuWsoMYrMJqEvcuPETzBYR 0JGY8LQdqJ6Dg1kgS2J6axRIWFggSaLn3gYmEJtXwEJix4J1jCC2kECCxIZvPcwQcUGJkzOf sIDYzAJaEjf+vWSCGCMtsfwfB0iYU8BU4tqNxWAlogLKEge2HQcq4QK6bAKbxMYDm6HOlJQ4 uOIGywRGgVlIxs5CMnYWwtgFjMyrGIUy88pyEzNzTPQyKvMyK/SS83M3MQIDdlntn+gdjJ8u BB9iFOBgVOLhPaDvnybEmlhWXJl7iFGCg1lJhDd6pW+aEG9KYmVValF+fFFpTmrxIUZpDhYl cV6jb+UpQgLpiSWp2ampBalFMFkmDk6pBsbcxr6jDaEmSR5HL9zmqIy49NWiZP9Fm7li3Ndv 50YJGXsbzaluSkp9u8Rdrl9glrH4tiXP3x23lLPc9XFdlugXI4U1x346rGacJpC11qr2Xt23 hXPfnM25w7d29iOJXRmvagVSV53ql8rbddvms8UVftbAPQz2ETGXN37/X13eunK205TS485K LMUZiYZazEXFiQChxJnnVAIAAA== X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFlrBLMWRmVeSWpSXmKPExsXC5WfdrLvisX+aweT/chZz1q9hszg89ySr xdsJ2xktLu+aw2Zxb81/VgdWj52z7rJ7bPo0id3jxIzfLB6LX3xg8vi8SS6ANYrLJiU1J7Ms tUjfLoEr48zZNqaCReYVU9c9Ym5g/KnVxcjJISFgIjHx63o2EJtFQFXid9NSZhCbTUBd4saN n2C2iICOxISn7exdjBwczAJZEtNbo0DCwgJJEj33NjCB2LwCFhI7FqxjBLGFBBIkNnzrYYaI C0qcnPmEBcRmFtCSuPHvJRPEGGmJ5f84QMKcAqYS124sBisRFVCWOLDtONMERt5ZSLpnIeme hdC9gJF5FaNIZl5ZbmJmjqlecXZGZV5mhV5yfu4mRmD4Lav9M3EH45fL7ocYBTgYlXh4D+j7 pwmxJpYVV+YeYpTgYFYS4Y1e6ZsmxJuSWFmVWpQfX1Sak1p8iFGag0VJnNcrPDVBSCA9sSQ1 OzW1ILUIJsvEwSnVwLiu57xNe9zCmbnd9/5VZT/9omHDu7BptpKNpL3zv3UBj1uktm479v+T RHTRZkfN/M3sy2tuTupuNtTWf7+Ay7XkSOivS3XNzdtD7gU/27xc+3PQ9eBLM/y5AmdbsJpt u/VVjqtp5uPf/7bPadn5L6je+uzVuZlr3tZ8UePxXRApOGH604X6Rh+UWIozEg21mIuKEwER keXAOwIAAA== X-CFilter-Loop: Reflected On Thu, May 23, 2024 at 01:53:37PM +0100, Karim Manaouil wrote: > On Thu, May 23, 2024 at 02:14:06PM +0900, Byungchul Park wrote: > > I suffered from kswapd stopped in the following scenario: > > > > CONFIG_NUMA_BALANCING enabled > > sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING > > numa node0 (500GB local DRAM, 128 CPUs) > > numa node1 (100GB CXL memory, no CPUs) > > swap off > > > > 1) Run any workload using a lot of anon pages e.g. mmap(200GB). > > 2) Keep adding another workload using a lot of anon pages. > > 3) The DRAM becomes filled with only anon pages through promotion. > > 4) Demotion barely works due to severe memory pressure. > > 5) kswapd for node0 stops because of the unreclaimable anon pages. > > It's not very clear to me, but if I understand correctly, if you have I don't have free memory on CXL. > free memory on CXL, kswapd0 should not stop as long as demotion is kswapd0 stops because demotion barely works. > successfully migrating the pages from DRAM to CXL and returns that as > nr_reclaimed in shrink_folio_list()? > > If that's the case, kswapd0 is making progress and shouldn't give up. It's not the case. > If CXL memory is also filled and migration fails, then it doesn't make > sense to me to wake up kswapd0 as it obvisoly won't help with anything, It's true *only* when it won't help with anything. However, kswapd should work again once the system got back to normal e.g. by terminating the anon hoggers. I addressed this issue. > because, you guessed it, you have no memory in the first place!! > > > 6) Manually kill the memory hoggers. This is the point. Byungchul > > 7) kswapd is still stopped even though the system got back to normal. > > > > From now on, the system should run without reclaim service in background > > served by kswapd until direct reclaim will do for that. Even worse, > > tiering mechanism is no longer able to work because kswapd has stopped > > that the mechanism relies on. > > > > However, after 6), the DRAM will be filled with pages that might or > > might not be reclaimable, that depends on how those are going to be used. > > Since those are potentially reclaimable anyway, it's worth hopefully > > trying reclaim by allowing kswapd to work again if needed. > > > > Signed-off-by: Byungchul Park > > --- > > include/linux/mmzone.h | 4 ++++ > > mm/page_alloc.c | 12 ++++++++++++ > > mm/vmscan.c | 21 ++++++++++++++++----- > > 3 files changed, 32 insertions(+), 5 deletions(-) > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index c11b7cde81ef..7c0ba90ea7b4 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -1331,6 +1331,10 @@ typedef struct pglist_data { > > enum zone_type kswapd_highest_zoneidx; > > > > int kswapd_failures; /* Number of 'reclaimed == 0' runs */ > > + int nr_may_reclaimable; /* Number of pages that have been > > + allocated since considered the > > + node is hopeless due to too many > > + kswapd_failures. */ > > > > #ifdef CONFIG_COMPACTION > > int kcompactd_max_order; > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 14d39f34d336..1dd2daede014 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -1538,8 +1538,20 @@ inline void post_alloc_hook(struct page *page, unsigned int order, > > static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags, > > unsigned int alloc_flags) > > { > > + pg_data_t *pgdat = page_pgdat(page); > > + > > post_alloc_hook(page, order, gfp_flags); > > > > + /* > > + * New pages might or might not be reclaimable depending on how > > + * these pages are going to be used. However, since these are > > + * potentially reclaimable, it's worth hopefully trying reclaim > > + * by allowing kswapd to work again even if there have been too > > + * many ->kswapd_failures, if ->nr_may_reclaimable is big enough. > > + */ > > + if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) > > + pgdat->nr_may_reclaimable += 1 << order; > > + > > if (order && (gfp_flags & __GFP_COMP)) > > prep_compound_page(page, order); > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 3ef654addd44..5b39090c4ef1 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -4943,6 +4943,7 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control * > > done: > > /* kswapd should never fail */ > > pgdat->kswapd_failures = 0; > > + pgdat->nr_may_reclaimable = 0; > > } > > > > /****************************************************************************** > > @@ -5991,9 +5992,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) > > * sleep. On reclaim progress, reset the failure counter. A > > * successful direct reclaim run will revive a dormant kswapd. > > */ > > - if (reclaimable) > > + if (reclaimable) { > > pgdat->kswapd_failures = 0; > > - else if (sc->cache_trim_mode) > > + pgdat->nr_may_reclaimable = 0; > > + } else if (sc->cache_trim_mode) > > sc->cache_trim_mode_failed = 1; > > } > > > > @@ -6636,6 +6638,11 @@ static void clear_pgdat_congested(pg_data_t *pgdat) > > clear_bit(PGDAT_WRITEBACK, &pgdat->flags); > > } > > > > +static bool may_recaimable(pg_data_t *pgdat, int order) > > +{ > > + return pgdat->nr_may_reclaimable >= 1 << order; > > +} > > + > > /* > > * Prepare kswapd for sleeping. This verifies that there are no processes > > * waiting in throttle_direct_reclaim() and that watermarks have been met. > > @@ -6662,7 +6669,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, > > wake_up_all(&pgdat->pfmemalloc_wait); > > > > /* Hopeless node, leave it to direct reclaim */ > > - if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) > > + if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES && > > + !may_recaimable(pgdat, order)) > > return true; > > > > if (pgdat_balanced(pgdat, order, highest_zoneidx)) { > > @@ -6940,8 +6948,10 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) > > goto restart; > > } > > > > - if (!sc.nr_reclaimed) > > + if (!sc.nr_reclaimed) { > > pgdat->kswapd_failures++; > > + pgdat->nr_may_reclaimable = 0; > > + } > > > > out: > > clear_reclaim_active(pgdat, highest_zoneidx); > > @@ -7204,7 +7214,8 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order, > > return; > > > > /* Hopeless node, leave it to direct reclaim if possible */ > > - if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES || > > + if ((pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES && > > + !may_recaimable(pgdat, order)) || > > (pgdat_balanced(pgdat, order, highest_zoneidx) && > > !pgdat_watermark_boosted(pgdat, highest_zoneidx))) { > > /* > > -- > > 2.17.1 > > > >