From: Byungchul Park <byungchul@sk.com>
To: Phil Auld <pauld@redhat.com>
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
bristot@redhat.com, vschneid@redhat.com,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
kernel_team@skhynix.com, akpm@linux-foundation.org
Subject: Re: [PATCH] sched/numa, mm: do not promote folios to nodes not set N_MEMORY
Date: Fri, 16 Feb 2024 16:45:51 +0900 [thread overview]
Message-ID: <20240216074551.GD32626@system.software.com> (raw)
In-Reply-To: <20240214200318.GA74174@lorien.usersys.redhat.com>
On Wed, Feb 14, 2024 at 03:03:18PM -0500, Phil Auld wrote:
> On Wed, Feb 14, 2024 at 07:31:37AM -0500 Phil Auld wrote:
> > Hi,
> >
> > On Wed, Feb 14, 2024 at 12:53:55PM +0900 Byungchul Park wrote:
> > > While running qemu with a configuration where some CPUs don't have their
> > > local memory and with a kernel numa balancing on, the following oops has
> > > been observed. It's because of null pointers of ->zone_pgdat of zones of
> > > those nodes that are not initialized at booting time. So should avoid
> > > nodes not set N_MEMORY from getting promoted.
> > >
> > > > BUG: unable to handle page fault for address: 00000000000033f3
> > > > #PF: supervisor read access in kernel mode
> > > > #PF: error_code(0x0000) - not-present page
> > > > PGD 0 P4D 0
> > > > Oops: 0000 [#1] PREEMPT SMP NOPTI
> > > > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255
> > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > > > rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > > > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> > > > Code: (omitted)
> > > > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> > > > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> > > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> > > > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> > > > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> > > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> > > > FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > PKRU: 55555554
> > > > Call Trace:
> > > > <TASK>
> > > > ? __die
> > > > ? page_fault_oops
> > > > ? __pte_offset_map_lock
> > > > ? exc_page_fault
> > > > ? asm_exc_page_fault
> > > > ? wakeup_kswapd
> > > > migrate_misplaced_page
> > > > __handle_mm_fault
> > > > handle_mm_fault
> > > > do_user_addr_fault
> > > > exc_page_fault
> > > > asm_exc_page_fault
> > > > RIP: 0033:0x55b897ba0808
> > > > Code: (omitted)
> > > > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287
> > > > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0
> > > > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0
> > > > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075
> > > > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
> > > > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000
> > > > </TASK>
> > > > Modules linked in:
> > > > CR2: 00000000000033f3
> > > > ---[ end trace 0000000000000000 ]---
> > > > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> > > > Code: (omitted)
> > > > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> > > > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> > > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> > > > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> > > > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> > > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> > > > FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > PKRU: 55555554
> > > > note: masim[895] exited with irqs disabled
> >
> > I think you could trim the down a little bit.
> >
> >
> > >
> > > Signed-off-by: Byungchul Park <byungchul@sk.com>
> > > Reported-by: hyeongtak.ji@sk.com
> > > ---
> > > kernel/sched/fair.c | 17 +++++++++++++++++
> > > 1 file changed, 17 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index d7a3c63a2171..6d215cc85f14 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -1828,6 +1828,23 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
> > > int dst_nid = cpu_to_node(dst_cpu);
> > > int last_cpupid, this_cpupid;
> > >
> > > + /*
> > > + * A node of dst_nid might not have its local memory. Promoting
> > > + * a folio to the node is meaningless. What's even worse, oops
> > > + * can be observed by the null pointer of ->zone_pgdat in
> > > + * various points of the code during migration.
> > > + *
> >
> > > + * For instance, oops has been observed at CPU2 while qemu'ing:
> > > + *
> > > + * {qemu} \
> > > + * -numa node,nodeid=0,mem=1G,cpus=0-1 \
> > > + * -numa node,nodeid=1,cpus=2-3 \
> > > + * -numa node,nodeid=2,mem=8G \
> > > + * ...
> >
> > This part above should probably be in the commit message not in the code.
> > The first paragraph of comment is plenty.
> >
> > Otherwise, I think the check probably makes sense.
> >
>
> Actually, after looking at the memory.c code I wonder if this check should
> not be made farther up in the numa migrate machinery.
First of all, we cannot avoid hinting fault. It's because no one knows
which node a task eventually runs on until a hinting fault occurs. So
should let it go get hinting fault *and then* we can make the decision
if we can migrate the folio or not. Assuming that, IMHO,
should_numa_migrate_memory() is a good place to make it.
Thoughts? Am I missing something?
Byungchul
> Cheers,
> Phil
>
> >
> > Cheers,
> > Phil
> >
> > > + */
> > > + if (!node_state(dst_nid, N_MEMORY))
> > > + return false;
> > > +
> > > /*
> > > * The pages in slow memory node should be migrated according
> > > * to hot/cold instead of private/shared.
> > > --
> > > 2.17.1
> > >
> > >
> >
> > --
> >
> >
>
> --
next prev parent reply other threads:[~2024-02-16 7:46 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-14 3:53 [PATCH] sched/numa, mm: do not promote folios to nodes not set N_MEMORY Byungchul Park
2024-02-14 12:31 ` Phil Auld
2024-02-14 20:03 ` Phil Auld
2024-02-16 7:45 ` Byungchul Park [this message]
2024-02-16 5:26 ` Byungchul Park
2024-02-14 21:13 ` Oscar Salvador
2024-02-16 7:07 ` Byungchul Park
2024-02-16 7:52 ` Oscar Salvador
2024-02-16 9:11 ` Byungchul Park
2024-02-16 9:23 ` Byungchul Park
2024-02-16 11:26 ` Byungchul Park
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240216074551.GD32626@system.software.com \
--to=byungchul@sk.com \
--cc=akpm@linux-foundation.org \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=juri.lelli@redhat.com \
--cc=kernel_team@skhynix.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=pauld@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.