From: Nico Pache <npache@redhat.com>
To: Michal Hocko <mhocko@suse.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
akpm@linux-foundation.org, shakeelb@google.com,
ktkhai@virtuozzo.com, shy828301@gmail.com, guro@fb.com,
vbabka@suse.cz, vdavydov.dev@gmail.com, raquini@redhat.com
Subject: Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
Date: Tue, 7 Dec 2021 16:34:21 -0500 [thread overview]
Message-ID: <ade0ed9d-4d6e-d780-4a72-0dcdb2fb2309@redhat.com> (raw)
In-Reply-To: <Ya3WcYKcej8XEI0W@dhcp22.suse.cz>
On 12/6/21 04:22, Michal Hocko wrote:
> On Sun 05-12-21 22:33:38, Nico Pache wrote:
>> We have run into a panic caused by a shrinker allocation being attempted
>> on an offlined node.
>>
>> Our crash analysis has determined that the issue originates from trying
>> to allocate pages on an offlined node in expand_one_shrinker_info. This
>> function makes the incorrect assumption that we can allocate on any node.
>> To correct this we make sure we only itterate over online nodes.
>>
>> This assumption can lead to an incorrect address being assigned to ac->zonelist
>> in the following callchain:
>> __alloc_pages
>> -> prepare_alloc_pages
>> -> node_zonelist
>>
>> static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>> {
>> return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
>> }
>> if the node is not online the return of node_zonelist will evaluate to a
>> invalid pointer of 0x00000 + offset_of(node_zonelists) + (1|0)
>>
>> This address is then dereferenced further down the callchain in:
>> prepare_alloc_pages
>> -> first_zones_zonelist
>> -> next_zones_zonelist
>> -> zonelist_zone_idx
>>
>> static inline int zonelist_zone_idx(struct zoneref *zoneref)
>> {
>> return zoneref->zone_idx;
>> }
>>
>> Leading the system to panic.
>
> Thanks for the analysis! Please also add an oops report so that this is
> easier to search for. It would be also interesting to see specifics
> about the issue. Why was the specific node !online in the first place?
> What architecture was this on?
Here is the Oops report. I will also add it to my commit message on the second
posting! This was x86 btw, but it has also been hit on PPC64.
[ 362.179917] RIP: 0010:prepare_alloc_pages.constprop.0+0xc6/0x150
[ 362.186639] Code: 03 80 c9 80 83 e2 03 83 fa 01 0f 44 c1 41 89 04 24 c1 eb 0c
48 8b 55 08 83 e3 01 8b 75 1c 48 8b 7d 00 88 5d 20 48 85 d2 75 6b <3b> 77 08 72
66 48 89 7d 10 b8 01 00 00 00 48 83 c4 08 5b 5d 41 5c
[ 362.207604] RSP: 0018:ffffb4ba31427bc0 EFLAGS: 00010246
[ 362.213443] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000081
[ 362.221412] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
[ 362.229380] RBP: ffffb4ba31427bf8 R08: 0000000000000000 R09: ffffb4ba31427bf4
[ 362.237347] R10: 0000000000000000 R11: 0000000000000000 R12: ffffb4ba31427bf0
[ 362.245316] R13: 0000000000000002 R14: ffff9f2fb3788000 R15: 0000000000000078
[ 362.253285] FS: 00007fbc57bfd740(0000) GS:ffff9f4c7d780000(0000)
knlGS:0000000000000000
[ 362.262322] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 362.268739] CR2: 0000000000002088 CR3: 000002004cb58002 CR4: 00000000007706e0
[ 362.276707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 362.284675] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 362.292644] PKRU: 55555554
[ 362.295669] Call Trace:
[ 362.298402] __alloc_pages+0x9d/0x210
[ 362.302501] kmalloc_large_node+0x40/0x90
[ 362.306988] __kmalloc_node+0x3ac/0x480
[ 362.311279] kvmalloc_node+0x46/0x80
[ 362.315276] expand_one_shrinker_info+0x84/0x190
[ 362.320437] prealloc_shrinker+0x166/0x1c0
[ 362.325015] alloc_super+0x2ba/0x330
[ 362.329011] ? __fput_sync+0x30/0x30
[ 362.333003] ? set_anon_super+0x40/0x40
[ 362.337288] sget_fc+0x6c/0x2f0
[ 362.340798] ? mqueue_create+0x20/0x20
[ 362.344992] get_tree_keyed+0x2f/0xc0
[ 362.349086] vfs_get_tree+0x25/0xb0
[ 362.352982] fc_mount+0xe/0x30
[ 362.356397] mq_init_ns+0x105/0x1a0
[ 362.360291] copy_ipcs+0x129/0x220
[ 362.364093] create_new_namespaces+0xa1/0x2e0
[ 362.368960] unshare_nsproxy_namespaces+0x55/0xa0
[ 362.374216] ksys_unshare+0x198/0x380
[ 362.378310] __x64_sys_unshare+0xe/0x20
[ 362.382595] do_syscall_64+0x3b/0x90
[ 362.386597] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 362.392247] RIP: 0033:0x7fbc57d14d7b
[ 362.396242] Code: 73 01 c3 48 8b 0d ad 70 0e 00 f7 d8 64 89 01 48 83 c8 ff c3
66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 01 00 00 0f 05 <48> 3d 01 f0
ff ff 73 01 c3 48 8b 0d 7d 70 0e 00 f7 d8 64 89 01 48
[ 362.417204] RSP: 002b:00007fbc4dc73f88 EFLAGS: 00000206 ORIG_RAX:
0000000000000110
[ 362.425664] RAX: ffffffffffffffda RBX: 0000560144b09578 RCX: 00007fbc57d14d7b
[ 362.433634] RDX: 0000000000000010 RSI: 00007fbc4dc73f90 RDI: 0000000008000000
[ 362.441602] RBP: 0000560144b095a0 R08: 0000000000000000 R09: 0000000000000000
[ 362.449573] R10: 0000000000000000 R11: 0000000000000206 R12: 00007fbc4dc73f90
[ 362.457542] R13: 000000000000006a R14: 0000560144e7e5a0 R15: 00007fff5dec8e10
>
>> We also correct this behavior in alloc_shrinker_info, free_shrinker_info,
>> and reparent_shrinker_deferred.
>>
>> Fixes: 2bfd36374edd ("mm: vmscan: consolidate shrinker_maps handling code")
>> Fixes: 0a4465d34028 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg")
>
> Normally I would split the fix as it is fixing two issues one introduced
> in 4.19 the other in 5.13.
These are both commits that introduced the function, one introduces it while the
other moves it to a separate file. But as Yang Shi pointed out the better commit
to blame is 86daf94efb11 ("mm/memcontrol.c: allocate shrinker_map on appropriate
NUMA node") which actually made the change that would have caused the allocator
to go for an !online node.
>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>> mm/vmscan.c | 8 ++++----
>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index fb9584641ac7..731564b61e3f 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -221,7 +221,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>> int nid;
>> int size = map_size + defer_size;
>>
>> - for_each_node(nid) {
>> + for_each_online_node(nid) {
>> pn = memcg->nodeinfo[nid];
>> old = shrinker_info_protected(memcg, nid);
>> /* Not yet online memcg */
>> @@ -256,7 +256,7 @@ void free_shrinker_info(struct mem_cgroup *memcg)
>> struct shrinker_info *info;
>> int nid;
>>
>> - for_each_node(nid) {
>> + for_each_online_node(nid) {
>> pn = memcg->nodeinfo[nid];
>> info = rcu_dereference_protected(pn->shrinker_info, true);
>> kvfree(info);
>> @@ -274,7 +274,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>> map_size = shrinker_map_size(shrinker_nr_max);
>> defer_size = shrinker_defer_size(shrinker_nr_max);
>> size = map_size + defer_size;
>> - for_each_node(nid) {
>> + for_each_online_node(nid) {
>> info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
>> if (!info) {
>> free_shrinker_info(memcg);
>> @@ -417,7 +417,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>>
>> /* Prevent from concurrent shrinker_info expand */
>> down_read(&shrinker_rwsem);
>> - for_each_node(nid) {
>> + for_each_online_node(nid) {
>> child_info = shrinker_info_protected(memcg, nid);
>> parent_info = shrinker_info_protected(parent, nid);
>> for (i = 0; i < shrinker_nr_max; i++) {
>> --
>> 2.33.1
>
> This doesn't seen complete. Slab shrinkers are used in the reclaim
> context. Previously offline nodes could be onlined later and this would
> lead to NULL ptr because there is no hook to allocate new shrinker
> infos. This would be also really impractical because this would have to
> update all existing memcgs...
>
> To be completely honest I am not really sure this is a practical problem
> because some architectures allocate (aka make online) all possible nodes
> reported by the platform. There are major inconsistencies there. Maybe
> that should be unified, so that problems like this one do not really
> have to add a complexity to the code.
Im currently working a solution that will use register_one_node to perform the
memcg node allocation for all memcgs. I will post that once I've verified all
potential corner cases.
Cheers,
-- Nico
>
next prev parent reply other threads:[~2021-12-07 21:34 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-12-06 3:33 [RFC PATCH 0/2] mm: Dont allocate pages on a offline node Nico Pache
2021-12-06 3:33 ` [RFC PATCH 1/2] include/linux/gfp.h: Do not allocate pages on a offlined node Nico Pache
2021-12-06 3:37 ` Matthew Wilcox
2021-12-06 9:22 ` Michal Hocko
2021-12-07 21:24 ` Nico Pache
2021-12-06 3:33 ` [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes Nico Pache
2021-12-06 9:22 ` Michal Hocko
2021-12-06 9:24 ` Michal Hocko
[not found] ` <d9d14beb-ee20-7ebb-e007-fbf58fb28535@redhat.com>
2021-12-06 10:54 ` Michal Hocko
[not found] ` <840cb3d0-61fe-b6cb-9918-69146ba06cf7@redhat.com>
2021-12-06 11:22 ` Michal Hocko
[not found] ` <51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com>
2021-12-06 13:06 ` Michal Hocko
[not found] ` <05157de4-e5df-11fc-fc46-8a9f79d0ddb4@redhat.com>
2021-12-06 14:06 ` Michal Hocko
[not found] ` <d4f281e6-1999-a3de-b879-c6ca6a25ae67@redhat.com>
2021-12-06 14:21 ` Michal Hocko
2021-12-06 14:30 ` Vlastimil Babka
2021-12-06 14:53 ` Michal Hocko
2021-12-06 18:26 ` Yang Shi
2021-12-07 10:15 ` Michal Hocko
2021-12-06 14:15 ` Michal Hocko
[not found] ` <24b4455c-aff9-ca9f-e29f-350833e7a0d1@virtuozzo.com>
2021-12-06 13:24 ` Michal Hocko
2021-12-08 19:00 ` Nico Pache
2021-12-06 18:42 ` Yang Shi
[not found] ` <a48c16d6-07df-ff44-67e6-f0942672ec28@redhat.com>
2021-12-06 21:28 ` Yang Shi
2021-12-07 10:15 ` David Hildenbrand
2021-12-07 10:55 ` Michal Hocko
2021-12-07 21:45 ` Nico Pache
2021-12-07 21:40 ` Nico Pache
2021-12-07 21:34 ` Nico Pache [this message]
2021-12-06 18:45 ` Yang Shi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ade0ed9d-4d6e-d780-4a72-0dcdb2fb2309@redhat.com \
--to=npache@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=guro@fb.com \
--cc=ktkhai@virtuozzo.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=raquini@redhat.com \
--cc=shakeelb@google.com \
--cc=shy828301@gmail.com \
--cc=vbabka@suse.cz \
--cc=vdavydov.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).