[patch 0/2] mm: too_many_isolated can stall due to out of sync VM counters

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Marcelo Tosatti <mtosatti@redhat.com>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	Peter Xu <peterx@redhat.com>
Subject: [patch 0/2] mm: too_many_isolated can stall due to out of sync VM counters
Date: Mon, 13 Nov 2023 20:34:20 -0300	[thread overview]
Message-ID: <20231113233420.446465795@redhat.com> (raw)

A customer reported seeing processes hung at too_many_isolated,
while analysis indicated that the problem occurred due to out
of sync per-CPU stats (see below).

Fix is to use node_page_state_snapshot to avoid the out of stale values.

2136 static unsigned long
    2137 shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
    2138                      struct scan_control *sc, enum lru_list lru)
    2139 {
    :
    2145         bool file = is_file_lru(lru);
    :
    2147         struct pglist_data *pgdat = lruvec_pgdat(lruvec);
    :
    2150         while (unlikely(too_many_isolated(pgdat, file, sc))) {
    2151                 if (stalled)
    2152                         return 0;
    2153
    2154                 /* wait a bit for the reclaimer. */
    2155                 msleep(100);   <--- some processes were sleeping here, with pending SIGKILL.
    2156                 stalled = true;
    2157
    2158                 /* We are about to die and free our memory. Return now. */
    2159                 if (fatal_signal_pending(current))
    2160                         return SWAP_CLUSTER_MAX;
    2161         }

msleep() must be called only when there are too many isolated pages:

    2019 static int too_many_isolated(struct pglist_data *pgdat, int file,
    2020                 struct scan_control *sc)
    2021 {
    :
    2030         if (file) {
    2031                 inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
    2032                 isolated = node_page_state(pgdat, NR_ISOLATED_FILE);
    2033         } else {
    :
    2046         return isolated > inactive;

The return value was true since:

    crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_INACTIVE_FILE]
    $8 = {
      counter = 1
    }
    crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_ISOLATED_FILE]
    $9 = {
      counter = 2

while per_cpu stats had:

    crash> p ((struct pglist_data *) 0xffff00817fffe580)->per_cpu_nodestats
    $85 = (struct per_cpu_nodestat *) 0xffff8000118832e0
    crash> p/x 0xffff8000118832e0 + __per_cpu_offset[42]
    $86 = 0xffff00917fcc32e0
    crash> p ((struct per_cpu_nodestat *) 0xffff00917fcc32e0)->vm_node_stat_diff[NR_ISOLATED_FILE]
    $87 = -1 '\377'

    crash> p/x 0xffff8000118832e0 + __per_cpu_offset[44]
    $89 = 0xffff00917fe032e0
    crash> p ((struct per_cpu_nodestat *) 0xffff00917fe032e0)->vm_node_stat_diff[NR_ISOLATED_FILE]
    $91 = -1 '\377'

It seems that processes were trapped in direct reclaim/compaction loop
because these nodes had few free pages lower than watermark min.

  crash> kmem -z | grep -A 3 Normal
  :
  NODE: 4  ZONE: 1  ADDR: ffff00817fffec40  NAME: "Normal"
    SIZE: 8454144  PRESENT: 98304  MIN/LOW/HIGH: 68/166/264
    VM_STAT:
          NR_FREE_PAGES: 68
  --
  NODE: 5  ZONE: 1  ADDR: ffff00897fffec40  NAME: "Normal"
    SIZE: 118784  MIN/LOW/HIGH: 82/200/318
    VM_STAT:
          NR_FREE_PAGES: 45
  --
  NODE: 6  ZONE: 1  ADDR: ffff00917fffec40  NAME: "Normal"
    SIZE: 118784  MIN/LOW/HIGH: 82/200/318
    VM_STAT:
          NR_FREE_PAGES: 53
  --
  NODE: 7  ZONE: 1  ADDR: ffff00997fbbec40  NAME: "Normal"
    SIZE: 118784  MIN/LOW/HIGH: 82/200/318
    VM_STAT:
          NR_FREE_PAGES: 52


---

 include/linux/vmstat.h |    4 ++++
 mm/compaction.c        |    6 +++---
 mm/vmscan.c            |    8 ++++----
 mm/vmstat.c            |   28 ++++++++++++++++++++++++++++
 4 files changed, 39 insertions(+), 7 deletions(-)

next             reply	other threads:[~2023-11-13 23:44 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-13 23:34 Marcelo Tosatti [this message]
2023-11-13 23:34 ` [patch 1/2] mm: vmstat: introduce node_page_state_pages_snapshot Marcelo Tosatti
2023-11-13 23:34 ` [patch 2/2] mm: vmstat: use node_page_state_snapshot in too_many_isolated Marcelo Tosatti
2023-11-14  8:20 ` [patch 0/2] mm: too_many_isolated can stall due to out of sync VM counters Michal Hocko
2023-11-14 12:26   ` Marcelo Tosatti
2023-11-14 12:46     ` Michal Hocko
2023-11-21 13:35       ` Marcelo Tosatti
2023-11-22 11:23       ` Marcelo Tosatti
2023-11-22 11:26         ` Marcelo Tosatti
2023-11-22 13:56           ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231113233420.446465795@redhat.com \
    --to=mtosatti@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=peterx@redhat.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.