All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, akpm@linux-foundation.org,
	axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com,
	hannes@cmpxchg.org, david@kernel.org, zhengqi.arch@bytedance.com,
	shakeel.butt@linux.dev, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
	surenb@google.com, ziy@nvidia.com, matthew.brost@intel.com,
	rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net,
	ying.huang@linux.alibaba.com, apopple@nvidia.com,
	bingjiao@google.com, jonathan.cameron@huawei.com,
	pratyush.brahma@oss.qualcomm.com
Subject: Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier
Date: Wed, 4 Feb 2026 10:25:03 +0100	[thread overview]
Message-ID: <aYMQb9JTPXTutgvH@tiehlicka> (raw)
In-Reply-To: <CAC5umygW8PXnS5tix-DfujhPjrRjBaEKe8ojW=y5FmqhqBfurg@mail.gmail.com>

On Wed 04-02-26 11:07:03, Akinobu Mita wrote:
> 2026年2月2日(月) 22:11 Michal Hocko <mhocko@suse.com>:
> >
> > On Thu 29-01-26 09:40:17, Akinobu Mita wrote:
> > > 2026年1月28日(水) 7:00 Joshua Hahn <joshua.hahnjy@gmail.com>:
> > > >
> > > > > > > Therefore, it appears that the behavior of get_swappiness() is important
> > > > > > > in this issue.
> > > > > >
> > > > > > This is quite mysterious.
> > > > > >
> > > > > > Especially because get_swappiness() is an MGLRU exclusive function, I find
> > > > > > it quite strange that the issue you mention above occurs regardless of whether
> > > > > > MGLRU is enabled or disabled. With MGLRU disabled, did you see the same hangs
> > > > > > as before? Were these hangs similarly fixed by modifying the callsite in
> > > > > > get_swappiness?
> > > > >
> > > > > Good point.
> > > > > When MGLRU is disabled, changing only the behavior of can_demote()
> > > > > called by get_swappiness() did not solve the problem.
> > > > >
> > > > > Instead, the problem was avoided by changing only the behavior of
> > > > > can_demote() called by can_reclaim_anon_page(), without changing the
> > > > > behavior of can_demote() called from other places.
> > > > >
> > > > > > On a separate note, I feel a bit uncomfortable for making this the default
> > > > > > setting, regardless of whether there is swap space or not. Just as it is
> > > > > > easy to create a degenerate scenario where all memory is unreclaimable
> > > > > > and the system starts going into (wasteful) reclaim on the lower tiers,
> > > > > > it is equally easy to create a scenario where all memory is very easily
> > > > > > reclaimable (say, clean pagecache) and we OOM without making any attempt to
> > > > > > free up memory on the lower tiers.
> > > > > >
> > > > > > Reality is likely somewhere in between. And from my perspective, as long as
> > > > > > we have some amount of easily reclaimable memory, I don't think immediately
> > > > > > OOMing will be helpful for the system (and even if none of the memory is
> > > > > > easily reclaimable, we should still try doing something before killing).
> > > > > >
> > > > > > > > > The reason for this issue is that memory allocations do not directly
> > > > > > > > > trigger the oom-killer, assuming that if the target node has an underlying
> > > > > > > > > memory tier, it can always be reclaimed by demotion.
> > > > > >
> > > > > > This patch enforces that the opposite of this assumption is true; that even
> > > > > > if a target node has an underlying memory tier, it can never be reclaimed by
> > > > > > demotion.
> > > > > >
> > > > > > Certainly for systems with swap and some compression methods (z{ram, swap}),
> > > > > > this new enforcement could be harmful to the system. What do you think?
> > > > >
> > > > > Thank you for the detailed explanation.
> > > > >
> > > > > I understand the concern regarding the current patch, which only
> > > > > checks the free memory of the demotion target node.
> > > > > I will explore a solution.
> > > >
> > > > Hello Akinobu, I hope you had a great weekend!
> > > >
> > > > I noticed something that I thought was worth flagging. It seems like the
> > > > primary addition of this patch, which is to check for zone_watermark_ok
> > > > across the zones, is already a part of should_reclaim_retry():
> > > >
> > > >     /*
> > > >      * Keep reclaiming pages while there is a chance this will lead
> > > >      * somewhere.  If none of the target zones can satisfy our allocation
> > > >      * request even if all reclaimable pages are considered then we are
> > > >      * screwed and have to go OOM.
> > > >      */
> > > >     for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
> > > >                 ac->highest_zoneidx, ac->nodemask) {
> > > >
> > > >         [...snip...]
> > > >
> > > >         /*
> > > >          * Would the allocation succeed if we reclaimed all
> > > >          * reclaimable pages?
> > > >          */
> > > >         wmark = __zone_watermark_ok(zone, order, min_wmark,
> > > >                 ac->highest_zoneidx, alloc_flags, available);
> > > >
> > > >         if (wmark) {
> > > >             ret = true;
> > > >             break;
> > > >         }
> > > >     }
> > > >
> > > > ... which is called in __alloc_pages_slowpath. I wonder why we don't already
> > > > hit this. It seems to do the same thing your patch is doing?
> > >
> > > I checked the number of calls and the time spent for several functions
> > > called by __alloc_pages_slowpath(), and found that time is spent in
> > > __alloc_pages_direct_reclaim() before reaching the first should_reclaim_retry().
> > >
> > > After a few minutes have passed and the debug code that automatically
> > > resets numa_demotion_enabled to false is executed, it appears that
> > > __alloc_pages_direct_reclaim() immediately exits.
> >
> > First of all is this MGLRU or traditional reclaim? Or both?
> 
> The behavior is almost the same whether MGLRU is enabled or not.
> However, one difference is that __alloc_pages_direct_reclaim() may be
> called multiple times when __alloc_pages_slowpath() is called, and
> should_reclaim_retry() also returns true several times.
> 
> This is probably because the watermark check in should_reclaim_retry()
> considers not only NR_FREE_PAGES but also NR_ZONE_INACTIVE_ANON and
> NR_ZONE_ACTIVE_ANON as potential free memory. (zone_reclaimable_pages())

Yes, seems like the same problem as with get_scan_count.

> The following is the increment of stats in /proc/vmstat from the start
> of the reproduction test until the problem occurred and
> numa_demotion_enabled was automatically reset by the debug code and
> OOM occurred a few minutes later:
> 
> workingset_nodes 578
> workingset_refault_anon 5054381
> workingset_refault_file 41502
> workingset_activate_anon 3003283
> workingset_activate_file 33232
> workingset_restore_anon 2556549
> workingset_restore_file 27139
> workingset_nodereclaim 3472
> pgdemote_kswapd 121684
> pgdemote_direct 23977
> pgdemote_khugepaged 0
> pgdemote_proactive 0
> pgsteal_kswapd 3480404
> pgsteal_direct 2602011
> pgsteal_khugepaged 74
> pgsteal_proactive 0
> pgscan_kswapd 93334262
> pgscan_direct 227649302
> pgscan_khugepaged 1232161
> pgscan_proactive 0
> pgscan_direct_throttle 18
> pgscan_anon 320480379
> pgscan_file 1735346
> pgsteal_anon 5828270
> pgsteal_file 254219

You can clearly see that the order of magnitute of pages scanned is
completely disproportional to pages actually reclaimed. So there is a
lot of scanning without any progress at all.

> > Then another thing I've noticed only now. There seems to be a layering
> > discrepancy (for traditional LRU reclaim) when get_scan_count which
> > controls the to-be-reclaimed lrus always relies on can_reclaim_anon_pages
> > while down the reclaim path shrink_folio_list tries to be more clever
> > and avoid demotion if it turns out to be inefficient.
> >
> > I wouldn't be surprised if get_scan_count predominantly (or even
> > exclusively) scanned anon LRUs only while increasing the reclaim
> > priority  (so essentially just checked all anon pages on the LRU list)
> > before concluding that it makes no sense. This can take quite some time
> > and in the worst case you could be recycling couple of page cache pages
> > remaining on the list to make small but sufficient progress to loop
> > around.
> >
> > So I think the first step is to make the demotion behavior consistent.
> > If demotion fails then it would probably makes sense to set sc->no_demotion
> > so that get_scan_count can learn from the reclaim feedback that
> > anonymous pages are not a good reclaim target in this situation. But the
> > whole reclaim path needs a careful review I am afraid.
> 
> If migrate_pages() in demote_folio_list() detects that it cannot
> migrate any folios and all calls to alloc_demote_folio() also fail
> (this is made possible by adding a few fields to migration_target_control),
> it sets sc->no_demotion to true, which also resolves the issue.
> 
>         migrate_pages(demote_folios, alloc_demote_folio, NULL,
>                       (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
>                       &nr_succeeded);
>         if (!nr_succeeded && mtc.nr_alloc_tried > 0 &&
>                         (mtc.nr_alloc_tried == mtc.nr_alloc_failed)) {
>                 sc->no_demotion = 1;
>         }

This seems to low level place to make such a decision. Keep in mind that
shrink_list operates on SWAP_CLUSTER_MAX batches so the backoff could be
pre-mature. shrink_lruvec seems like a better place to make such a
decision but this really requires a deeper evaluation.

Anyway, it is good that we have a better understanding what is going on.
Thanks for confirming the theory.
-- 
Michal Hocko
SUSE Labs

      reply	other threads:[~2026-02-04  9:25 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-13  8:14 [PATCH v4 0/3] mm: fix oom-killer not being invoked when demotion is enabled Akinobu Mita
2026-01-13  8:14 ` [PATCH v4 1/3] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes Akinobu Mita
2026-01-13  9:30   ` Pratyush Brahma
2026-01-13  8:14 ` [PATCH v4 2/3] mm: numa_emu: add document for NUMA emulation Akinobu Mita
2026-01-13  9:32   ` Pratyush Brahma
2026-01-13  8:14 ` [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier Akinobu Mita
2026-01-13 13:40   ` Michal Hocko
2026-01-14 12:51     ` Akinobu Mita
2026-01-14 13:40       ` Michal Hocko
2026-01-14 17:49       ` Gregory Price
2026-01-15  0:40         ` Akinobu Mita
2026-01-22  0:32           ` Akinobu Mita
2026-01-22 16:38             ` Gregory Price
2026-01-26  1:57               ` Akinobu Mita
2026-01-27 21:21                 ` Gregory Price
2026-01-29  0:51                   ` Akinobu Mita
2026-01-29  2:48                     ` Gregory Price
2026-01-22 18:34       ` Joshua Hahn
2026-01-26  2:01         ` Akinobu Mita
2026-01-27 22:00           ` Joshua Hahn
2026-01-29  0:40             ` Akinobu Mita
2026-02-02 13:11               ` Michal Hocko
2026-02-02 13:15                 ` Michal Hocko
2026-02-04  2:07                 ` Akinobu Mita
2026-02-04  9:25                   ` Michal Hocko [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aYMQb9JTPXTutgvH@tiehlicka \
    --to=mhocko@suse.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akinobu.mita@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=bingjiao@google.com \
    --cc=byungchul@sk.com \
    --cc=david@kernel.org \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=jonathan.cameron@huawei.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=matthew.brost@intel.com \
    --cc=pratyush.brahma@oss.qualcomm.com \
    --cc=rakie.kim@sk.com \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=weixugc@google.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.