* [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask()
@ 2026-05-28 19:03 Yury Norov
2026-05-28 19:37 ` Waiman Long
` (7 more replies)
0 siblings, 8 replies; 29+ messages in thread
From: Yury Norov @ 2026-05-28 19:03 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, linux-mm, linux-kernel
Cc: Yury Norov, Farhad Alemi, Waiman Long, Rasmus Villemoes, cgroups
Reassigning nodes relative an empty user-provided nodemask is useless,
and triggers divide-by-zero in the function.
Reported-by: Farhad Alemi <farhad.alemi@berkeley.edu>
Link: https://lore.kernel.org/all/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/
Signed-off-by: Yury Norov <ynorov@nvidia.com>
---
mm/mempolicy.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4e4421b22b59..cd961fa1eb33 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -370,8 +370,13 @@ static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig,
const nodemask_t *rel)
{
+ unsigned int w = nodes_weight(*rel);
nodemask_t tmp;
- nodes_fold(tmp, *orig, nodes_weight(*rel));
+
+ if (w == 0)
+ return -EINVAL;
+
+ nodes_fold(tmp, *orig, w);
nodes_onto(*ret, tmp, *rel);
}
--
2.51.0
^ permalink raw reply related [flat|nested] 29+ messages in thread* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() 2026-05-28 19:03 [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() Yury Norov @ 2026-05-28 19:37 ` Waiman Long 2026-05-28 19:40 ` Yury Norov 2026-05-28 19:37 ` Matthew Wilcox ` (6 subsequent siblings) 7 siblings, 1 reply; 29+ messages in thread From: Waiman Long @ 2026-05-28 19:37 UTC (permalink / raw) To: Yury Norov, Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang, Alistair Popple, linux-mm, linux-kernel Cc: Farhad Alemi, Rasmus Villemoes, cgroups On 5/28/26 3:03 PM, Yury Norov wrote: > Reassigning nodes relative an empty user-provided nodemask is useless, > and triggers divide-by-zero in the function. > > Reported-by: Farhad Alemi <farhad.alemi@berkeley.edu> > Link: https://lore.kernel.org/all/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/ > Signed-off-by: Yury Norov <ynorov@nvidia.com> > --- > mm/mempolicy.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 4e4421b22b59..cd961fa1eb33 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -370,8 +370,13 @@ static inline int mpol_store_user_nodemask(const struct mempolicy *pol) > static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig, > const nodemask_t *rel) > { > + unsigned int w = nodes_weight(*rel); > nodemask_t tmp; > - nodes_fold(tmp, *orig, nodes_weight(*rel)); > + > + if (w == 0) > + return -EINVAL; > + > + nodes_fold(tmp, *orig, w); > nodes_onto(*ret, tmp, *rel); > } > mpol_relative_nodemask() is a void function, so this code should fail compilation. Right? Cheers, Longman ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() 2026-05-28 19:37 ` Waiman Long @ 2026-05-28 19:40 ` Yury Norov 0 siblings, 0 replies; 29+ messages in thread From: Yury Norov @ 2026-05-28 19:40 UTC (permalink / raw) To: Waiman Long Cc: Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang, Alistair Popple, linux-mm, linux-kernel, Farhad Alemi, Rasmus Villemoes, cgroups On Thu, May 28, 2026 at 03:37:04PM -0400, Waiman Long wrote: > On 5/28/26 3:03 PM, Yury Norov wrote: > > Reassigning nodes relative an empty user-provided nodemask is useless, > > and triggers divide-by-zero in the function. > > > > Reported-by: Farhad Alemi <farhad.alemi@berkeley.edu> > > Link: https://lore.kernel.org/all/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/ > > Signed-off-by: Yury Norov <ynorov@nvidia.com> > > --- > > mm/mempolicy.c | 7 ++++++- > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > index 4e4421b22b59..cd961fa1eb33 100644 > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -370,8 +370,13 @@ static inline int mpol_store_user_nodemask(const struct mempolicy *pol) > > static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig, > > const nodemask_t *rel) > > { > > + unsigned int w = nodes_weight(*rel); > > nodemask_t tmp; > > - nodes_fold(tmp, *orig, nodes_weight(*rel)); > > + > > + if (w == 0) > > + return -EINVAL; > > + > > + nodes_fold(tmp, *orig, w); > > nodes_onto(*ret, tmp, *rel); > > } > > mpol_relative_nodemask() is a void function, so this code should fail > compilation. Right? Apologize, submitted the wrong file. Will resend shortly. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() 2026-05-28 19:03 [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() Yury Norov 2026-05-28 19:37 ` Waiman Long @ 2026-05-28 19:37 ` Matthew Wilcox 2026-05-28 19:41 ` Andrew Morton ` (5 subsequent siblings) 7 siblings, 0 replies; 29+ messages in thread From: Matthew Wilcox @ 2026-05-28 19:37 UTC (permalink / raw) To: Yury Norov Cc: Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang, Alistair Popple, linux-mm, linux-kernel, Farhad Alemi, Waiman Long, Rasmus Villemoes, cgroups On Thu, May 28, 2026 at 03:03:37PM -0400, Yury Norov wrote: > static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig, ^^^^ > const nodemask_t *rel) > { > + unsigned int w = nodes_weight(*rel); > nodemask_t tmp; > - nodes_fold(tmp, *orig, nodes_weight(*rel)); > + > + if (w == 0) > + return -EINVAL; ... this doesn't even compile. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() 2026-05-28 19:03 [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() Yury Norov 2026-05-28 19:37 ` Waiman Long 2026-05-28 19:37 ` Matthew Wilcox @ 2026-05-28 19:41 ` Andrew Morton 2026-05-29 15:26 ` Joshua Hahn 2026-05-29 8:47 ` [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() kernel test robot ` (4 subsequent siblings) 7 siblings, 1 reply; 29+ messages in thread From: Andrew Morton @ 2026-05-28 19:41 UTC (permalink / raw) To: Yury Norov Cc: David Hildenbrand, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang, Alistair Popple, linux-mm, linux-kernel, Farhad Alemi, Waiman Long, Rasmus Villemoes, cgroups On Thu, 28 May 2026 15:03:37 -0400 Yury Norov <ynorov@nvidia.com> wrote: > Reassigning nodes relative an empty user-provided nodemask is useless, > and triggers divide-by-zero in the function. > > Reported-by: Farhad Alemi <farhad.alemi@berkeley.edu> > Link: https://lore.kernel.org/all/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/ Thanks both. It looks like this is very old code, so we'll be wanting a cc:stable in this. > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -370,8 +370,13 @@ static inline int mpol_store_user_nodemask(const struct mempolicy *pol) > static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig, > const nodemask_t *rel) > { > + unsigned int w = nodes_weight(*rel); > nodemask_t tmp; > - nodes_fold(tmp, *orig, nodes_weight(*rel)); > + > + if (w == 0) > + return -EINVAL; > + > + nodes_fold(tmp, *orig, w); > nodes_onto(*ret, tmp, *rel); > } I suspect we should address this at the mpol level - it should never have got that far. Hopefully the mempolicy maintainers can have a think. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() 2026-05-28 19:41 ` Andrew Morton @ 2026-05-29 15:26 ` Joshua Hahn 2026-05-29 17:47 ` Yury Norov 0 siblings, 1 reply; 29+ messages in thread From: Joshua Hahn @ 2026-05-29 15:26 UTC (permalink / raw) To: Andrew Morton Cc: Yury Norov, David Hildenbrand, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang, Alistair Popple, linux-mm, linux-kernel, Farhad Alemi, Waiman Long, Rasmus Villemoes, cgroups On Thu, 28 May 2026 12:41:33 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Thu, 28 May 2026 15:03:37 -0400 Yury Norov <ynorov@nvidia.com> wrote: > > > Reassigning nodes relative an empty user-provided nodemask is useless, > > and triggers divide-by-zero in the function. > > > > Reported-by: Farhad Alemi <farhad.alemi@berkeley.edu> > > Link: https://lore.kernel.org/all/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/ > > Thanks both. > > It looks like this is very old code, so we'll be wanting a cc:stable in > this. > > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -370,8 +370,13 @@ static inline int mpol_store_user_nodemask(const struct mempolicy *pol) > > static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig, > > const nodemask_t *rel) > > { > > + unsigned int w = nodes_weight(*rel); > > nodemask_t tmp; > > - nodes_fold(tmp, *orig, nodes_weight(*rel)); > > + > > + if (w == 0) > > + return -EINVAL; > > + > > + nodes_fold(tmp, *orig, w); > > nodes_onto(*ret, tmp, *rel); > > } > > I suspect we should address this at the mpol level - it should never > have got that far. Hopefully the mempolicy maintainers can have a > think. Hello Andrew, hello Yury, I agree with Andrew here. mpol_relative_nodemask is called from two places, the first being mpol_rebind_nodemask which is the calling function seen in the bug report as well. The other place is mpol_set_nodemask, which has a helpful comment that notes: "mpol_set_nodemask is called after mpol_new() [...snip...] mpol_new() has already validated the nodes parameter with respect to the policy mode and flags". So it seems like we are missing the big if-else if-else if block from mpol_new in other places that should in fact have it, like mpol_rebind_nodemask. The approach proposed here of just checking whether the node weight is 0 won't work for a few cases, namely for MPOL_DEFAULT and MPOL_PREFERRED where empty nodemasks are actually allowed. So what should really be done here is to do the full policy-nodemask checking section in mpol_new and call that from mpol_set_nodemask as well. Thank you for taking a shot at fixing the bug report, please let me know what you think! Have a great day : -) Joshua ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() 2026-05-29 15:26 ` Joshua Hahn @ 2026-05-29 17:47 ` Yury Norov 2026-05-29 18:40 ` Joshua Hahn 2026-06-01 14:32 ` David Hildenbrand (Arm) 0 siblings, 2 replies; 29+ messages in thread From: Yury Norov @ 2026-05-29 17:47 UTC (permalink / raw) To: Joshua Hahn Cc: Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang, Alistair Popple, linux-mm, linux-kernel, Farhad Alemi, Waiman Long, Rasmus Villemoes, cgroups On Fri, May 29, 2026 at 08:26:15AM -0700, Joshua Hahn wrote: > On Thu, 28 May 2026 12:41:33 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > > > On Thu, 28 May 2026 15:03:37 -0400 Yury Norov <ynorov@nvidia.com> wrote: > > > > > Reassigning nodes relative an empty user-provided nodemask is useless, > > > and triggers divide-by-zero in the function. > > > > > > Reported-by: Farhad Alemi <farhad.alemi@berkeley.edu> > > > Link: https://lore.kernel.org/all/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/ > > > > Thanks both. > > > > It looks like this is very old code, so we'll be wanting a cc:stable in > > this. > > > > > --- a/mm/mempolicy.c > > > +++ b/mm/mempolicy.c > > > @@ -370,8 +370,13 @@ static inline int mpol_store_user_nodemask(const struct mempolicy *pol) > > > static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig, > > > const nodemask_t *rel) > > > { > > > + unsigned int w = nodes_weight(*rel); > > > nodemask_t tmp; > > > - nodes_fold(tmp, *orig, nodes_weight(*rel)); > > > + > > > + if (w == 0) > > > + return -EINVAL; > > > + > > > + nodes_fold(tmp, *orig, w); > > > nodes_onto(*ret, tmp, *rel); > > > } > > > > I suspect we should address this at the mpol level - it should never > > have got that far. Hopefully the mempolicy maintainers can have a > > think. > > Hello Andrew, hello Yury, > > I agree with Andrew here. > mpol_relative_nodemask is called from two places, the first being > mpol_rebind_nodemask which is the calling function seen in the bug report as > well. > > The other place is mpol_set_nodemask, which has a helpful comment that notes: > "mpol_set_nodemask is called after mpol_new() [...snip...] mpol_new() has > already validated the nodes parameter with respect to the policy mode and > flags". > > So it seems like we are missing the big if-else if-else if block from mpol_new > in other places that should in fact have it, like mpol_rebind_nodemask. > > The approach proposed here of just checking whether the node weight is 0 > won't work for a few cases, namely for MPOL_DEFAULT and MPOL_PREFERRED where > empty nodemasks are actually allowed. So what should really be done here is to > do the full policy-nodemask checking section in mpol_new and call that from > mpol_set_nodemask as well. > > Thank you for taking a shot at fixing the bug report, please let me know what > you think! Have a great day : -) Hi Joshua. Indeed, quick and dirty shot. The problem is that nodes_fold() can't work with the sz == 0. In other words, folding to a 0-bit bitmap is an error. We don't check that on bitmaps level because it's an internal helper, and it's a caller's responsibility to validate the parameters. nodes_onto(), or more specifically bitmap_onto(), is a different story. In case of empty relmap, the function actually clears all the bits in dst and returns. I see 2 options to move this forward. 1. Simply disallow empty relmap in mpol_relative_nodemask(). There's no valid cases for it, AFAIK, so the nodes_fold() limitation looks reasonable. We can consider it as a new policy. We've got 2 users for mpol_relative_nodemask(). In mpol_set_nodemask() we can simply propagate the error; and in mpol_rebind_nodemask() we can throw a warning and do nothing. 2. Follow the spirit of the nodes_onto(), and in case of empty relmask, clean the ret mask and bail out I'm in a favor for the 1st option, because empty relmask looks buggy anyways. > The approach proposed here of just checking whether the node weight is 0 > won't work for a few cases, namely for MPOL_DEFAULT and MPOL_PREFERRED where > empty nodemasks are actually allowed. Not sure I understand this. The mpol_relative_nodemask() is called only if MPOL_F_RELATIVE_NODES is set. In mpol_rebind_nodemask(), if both MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES are set, the former wins. How would the RELATIVE mode mess with the others? The mpol_new() code seemingly tries to disable empty nodes in case of MPOL_DEFAILT and MPOL_PREFERRED + MPOL_F_RELATIVE_NODES, but obviously it doesn't work very well in the rebind case. Anyways, I'm not really deep in mempolicy domain, so please educate me if I miss something. Thanks, Yury ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() 2026-05-29 17:47 ` Yury Norov @ 2026-05-29 18:40 ` Joshua Hahn 2026-06-01 14:32 ` David Hildenbrand (Arm) 1 sibling, 0 replies; 29+ messages in thread From: Joshua Hahn @ 2026-05-29 18:40 UTC (permalink / raw) To: Yury Norov Cc: Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang, Alistair Popple, linux-mm, linux-kernel, Farhad Alemi, Waiman Long, Rasmus Villemoes, cgroups On Fri, 29 May 2026 13:47:12 -0400 Yury Norov <ynorov@nvidia.com> wrote: > On Fri, May 29, 2026 at 08:26:15AM -0700, Joshua Hahn wrote: > > On Thu, 28 May 2026 12:41:33 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > On Thu, 28 May 2026 15:03:37 -0400 Yury Norov <ynorov@nvidia.com> wrote: > > > > > > > Reassigning nodes relative an empty user-provided nodemask is useless, > > > > and triggers divide-by-zero in the function. > > > > > > > > Reported-by: Farhad Alemi <farhad.alemi@berkeley.edu> > > > > Link: https://lore.kernel.org/all/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/ > > > > > > Thanks both. > > > > > > It looks like this is very old code, so we'll be wanting a cc:stable in > > > this. > > > > > > > --- a/mm/mempolicy.c > > > > +++ b/mm/mempolicy.c > > > > @@ -370,8 +370,13 @@ static inline int mpol_store_user_nodemask(const struct mempolicy *pol) > > > > static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig, > > > > const nodemask_t *rel) > > > > { > > > > + unsigned int w = nodes_weight(*rel); > > > > nodemask_t tmp; > > > > - nodes_fold(tmp, *orig, nodes_weight(*rel)); > > > > + > > > > + if (w == 0) > > > > + return -EINVAL; > > > > + > > > > + nodes_fold(tmp, *orig, w); > > > > nodes_onto(*ret, tmp, *rel); > > > > } > > > > > > I suspect we should address this at the mpol level - it should never > > > have got that far. Hopefully the mempolicy maintainers can have a > > > think. > > > > Hello Andrew, hello Yury, > > > > I agree with Andrew here. > > mpol_relative_nodemask is called from two places, the first being > > mpol_rebind_nodemask which is the calling function seen in the bug report as > > well. > > > > The other place is mpol_set_nodemask, which has a helpful comment that notes: > > "mpol_set_nodemask is called after mpol_new() [...snip...] mpol_new() has > > already validated the nodes parameter with respect to the policy mode and > > flags". > > > > So it seems like we are missing the big if-else if-else if block from mpol_new > > in other places that should in fact have it, like mpol_rebind_nodemask. > > > > The approach proposed here of just checking whether the node weight is 0 > > won't work for a few cases, namely for MPOL_DEFAULT and MPOL_PREFERRED where > > empty nodemasks are actually allowed. So what should really be done here is to > > do the full policy-nodemask checking section in mpol_new and call that from > > mpol_set_nodemask as well. > > > > Thank you for taking a shot at fixing the bug report, please let me know what > > you think! Have a great day : -) > > Hi Joshua. > > Indeed, quick and dirty shot. > > The problem is that nodes_fold() can't work with the sz == 0. In > other words, folding to a 0-bit bitmap is an error. We don't check > that on bitmaps level because it's an internal helper, and it's a > caller's responsibility to validate the parameters. > > nodes_onto(), or more specifically bitmap_onto(), is a different > story. In case of empty relmap, the function actually clears all the > bits in dst and returns. I see, thank you for helping me understand. Yeah, we probably don't want an empty nodemask here regardless of policy, as long as MPOL_F_RELATIVE_NODES is set. > I see 2 options to move this forward. > > 1. Simply disallow empty relmap in mpol_relative_nodemask(). There's > no valid cases for it, AFAIK, so the nodes_fold() limitation looks > reasonable. We can consider it as a new policy. > > We've got 2 users for mpol_relative_nodemask(). In mpol_set_nodemask() > we can simply propagate the error; and in mpol_rebind_nodemask() we > can throw a warning and do nothing. I think we should never be able to reach mpol_set_nodemask with an empty nodemask if MPOL_F_RELATIVE_NODES is set. Not sure if we need to be extra defensive here. For mpol_rebind_nodemask I think we should actually do some more checks, I think we should do it in mpol_rebind_policy since it gives us an opportunity to catch other sources of failure too, like calling mpol_rebind_preferred with an empty nodemask as well (which shouldn't be allowed for MPOL_F_{ RELATIVE, STATIC}_NODES) as far as I can tell from the checks in mpol_new. Setting empty nodemask for mpol_rebind_preferred won't throw a div0 error like for mpol_rebind_nodemask but we can at least throw a warning like you suggested. Does that make sense? This is your fix and if you would prefer to address only the div0 case, that makes sense too, since the empty nodemask for preferred is more of a semantic incorrectness and will not cause panics. Entirely up to you! : -) > 2. Follow the spirit of the nodes_onto(), and in case of empty > relmask, clean the ret mask and bail out > > I'm in a favor for the 1st option, because empty relmask looks buggy > anyways. > > > The approach proposed here of just checking whether the node weight is 0 > > won't work for a few cases, namely for MPOL_DEFAULT and MPOL_PREFERRED where > > empty nodemasks are actually allowed. > > Not sure I understand this. The mpol_relative_nodemask() is called > only if MPOL_F_RELATIVE_NODES is set. In mpol_rebind_nodemask(), if > both MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES are set, the former > wins. How would the RELATIVE mode mess with the others? Yes, you're right, the case that MPOL_DEFAULT and MPOL_PREFERRED allows empty nodemasks is precisely when !STATIC && !RELATIVE :p this is my bad for missing that case completely. > Anyways, I'm not really deep in mempolicy domain, so please educate me if > I miss something. Thank you, I have also learned a lot looking into this to think about what the best solution is! Joshua ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() 2026-05-29 17:47 ` Yury Norov 2026-05-29 18:40 ` Joshua Hahn @ 2026-06-01 14:32 ` David Hildenbrand (Arm) 2026-06-02 8:44 ` Gregory Price 1 sibling, 1 reply; 29+ messages in thread From: David Hildenbrand (Arm) @ 2026-06-01 14:32 UTC (permalink / raw) To: Yury Norov, Joshua Hahn Cc: Andrew Morton, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang, Alistair Popple, linux-mm, linux-kernel, Farhad Alemi, Waiman Long, Rasmus Villemoes, cgroups >> >> Thank you for taking a shot at fixing the bug report, please let me know what >> you think! Have a great day : -) > > Hi Joshua. > > Indeed, quick and dirty shot. > > The problem is that nodes_fold() can't work with the sz == 0. In > other words, folding to a 0-bit bitmap is an error. We don't check > that on bitmaps level because it's an internal helper, and it's a > caller's responsibility to validate the parameters. > > nodes_onto(), or more specifically bitmap_onto(), is a different > story. In case of empty relmap, the function actually clears all the > bits in dst and returns. It's very weird that mpol_new_nodemask() (->create() callback) disallows empty nodemasks, but mpol_rebind_nodemask() (->rebind() callback) would allow empty nodemasks. I guess mpol_set_nodemask() could trigger it after doing the nodes_and(nsc->mask1, cpuset_current_mems_allowed, node_states[N_MEMORY]); And ending with an empty &nsc->mask1. The later "mpol_ops[pol->mode].create(pol, &nsc->mask2);" would reject it, but the division by zero could still happen. > > I see 2 options to move this forward. > > 1. Simply disallow empty relmap in mpol_relative_nodemask(). There's > no valid cases for it, AFAIK, so the nodes_fold() limitation looks > reasonable. We can consider it as a new policy. > > We've got 2 users for mpol_relative_nodemask(). In mpol_set_nodemask() > we can simply propagate the error; and in mpol_rebind_nodemask() we > can throw a warning and do nothing. Throwing a warning is really bad. We'd still end up with an empty nodemask? -- Cheers, David ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() 2026-06-01 14:32 ` David Hildenbrand (Arm) @ 2026-06-02 8:44 ` Gregory Price 2026-06-02 9:19 ` David Hildenbrand (Arm) 0 siblings, 1 reply; 29+ messages in thread From: Gregory Price @ 2026-06-02 8:44 UTC (permalink / raw) To: David Hildenbrand (Arm) Cc: Yury Norov, Joshua Hahn, Andrew Morton, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple, linux-mm, linux-kernel, Farhad Alemi, Waiman Long, Rasmus Villemoes, cgroups On Mon, Jun 01, 2026 at 04:32:25PM +0200, David Hildenbrand (Arm) wrote: > >> > >> Thank you for taking a shot at fixing the bug report, please let me know what > >> you think! Have a great day : -) > > > > Hi Joshua. > > > > Indeed, quick and dirty shot. > > > > The problem is that nodes_fold() can't work with the sz == 0. In > > other words, folding to a 0-bit bitmap is an error. We don't check > > that on bitmaps level because it's an internal helper, and it's a > > caller's responsibility to validate the parameters. > > > > nodes_onto(), or more specifically bitmap_onto(), is a different > > story. In case of empty relmap, the function actually clears all the > > bits in dst and returns. > > It's very weird that mpol_new_nodemask() (->create() callback) disallows empty > nodemasks, but mpol_rebind_nodemask() (->rebind() callback) would allow empty > nodemasks. > Was this actually observed? mpol_rebind_nodemask() happens when cgroup.cpuset changes, and cgroup.cpuset cannot be empty. cpuset only changes with sysfs twiddles or offlining. In either case, cpuset *guarantees* that cpuset.mems will never be empty. So... is this an observed bug or just a statically discovered "bug" that can't actually be reached? ~Gregory ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() 2026-06-02 8:44 ` Gregory Price @ 2026-06-02 9:19 ` David Hildenbrand (Arm) 2026-06-02 9:54 ` Gregory Price 0 siblings, 1 reply; 29+ messages in thread From: David Hildenbrand (Arm) @ 2026-06-02 9:19 UTC (permalink / raw) To: Gregory Price Cc: Yury Norov, Joshua Hahn, Andrew Morton, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple, linux-mm, linux-kernel, Farhad Alemi, Waiman Long, Rasmus Villemoes, cgroups On 6/2/26 10:44, Gregory Price wrote: > On Mon, Jun 01, 2026 at 04:32:25PM +0200, David Hildenbrand (Arm) wrote: >>> >>> Hi Joshua. >>> >>> Indeed, quick and dirty shot. >>> >>> The problem is that nodes_fold() can't work with the sz == 0. In >>> other words, folding to a 0-bit bitmap is an error. We don't check >>> that on bitmaps level because it's an internal helper, and it's a >>> caller's responsibility to validate the parameters. >>> >>> nodes_onto(), or more specifically bitmap_onto(), is a different >>> story. In case of empty relmap, the function actually clears all the >>> bits in dst and returns. >> >> It's very weird that mpol_new_nodemask() (->create() callback) disallows empty >> nodemasks, but mpol_rebind_nodemask() (->rebind() callback) would allow empty >> nodemasks. >> > > Was this actually observed? > > mpol_rebind_nodemask() happens when cgroup.cpuset changes, and > cgroup.cpuset cannot be empty. > > cpuset only changes with sysfs twiddles or offlining. In either case, > cpuset *guarantees* that cpuset.mems will never be empty. > > So... is this an observed bug or just a statically discovered > "bug" that can't actually be reached? According to the report [1] syzkaller can trigger it. There is no reproducer, though. [1] https://lore.kernel.org/all/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/ -- Cheers, David ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() 2026-06-02 9:19 ` David Hildenbrand (Arm) @ 2026-06-02 9:54 ` Gregory Price 2026-06-02 15:01 ` Farhad Alemi 0 siblings, 1 reply; 29+ messages in thread From: Gregory Price @ 2026-06-02 9:54 UTC (permalink / raw) To: David Hildenbrand (Arm) Cc: Yury Norov, Joshua Hahn, Andrew Morton, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple, linux-mm, linux-kernel, Farhad Alemi, Waiman Long, Rasmus Villemoes, cgroups On Tue, Jun 02, 2026 at 11:19:49AM +0200, David Hildenbrand (Arm) wrote: > > According to the report [1] syzkaller can trigger it. There is no reproducer, > though. > > [1] > https://lore.kernel.org/all/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/ > The actual implication of this report is that there is a bug in cpuset, not mempolicy. mpol_rebind_mm+0x3ab/0x680 mm/mempolicy.c:569 ^^^ should never receive a 0-node nodemask ^^^ ...snip... cpuset_update_tasks_nodemask+0x22e/0x340 kernel/cgroup/cpuset.c:2777 ^^^ calls guarantee_online_mems ^^^ ...snip... hotplug_update_tasks kernel/cgroup/cpuset.c:3882 [inline] cpuset_hotplug_update_tasks kernel/cgroup/cpuset.c:3985 [inline] Relevant code: void cpuset_update_tasks_nodemask(struct cpuset *cs) { ... snip ... guarantee_online_mems(cs, &newmems); <<< critical call ... snip ... while ((task = css_task_iter_next(&it))) { ... snip ... mpol_rebind_mm(mm, &cs->mems_allowed); Seems like maybe mpol_rebind_mm should be called with newmems, not cs->mems_allowed, though cs->mems_allowed should never be allowed to be empty, because that makes no sense. Just eyeballing it, I can't say whether calling with newmems is the right thing, or if mems_allowed should not be allowed to be empty, would have to dig in a little further. ~Gregory ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() 2026-06-02 9:54 ` Gregory Price @ 2026-06-02 15:01 ` Farhad Alemi 2026-06-05 15:18 ` David Hildenbrand (Arm) 0 siblings, 1 reply; 29+ messages in thread From: Farhad Alemi @ 2026-06-02 15:01 UTC (permalink / raw) To: Gregory Price Cc: falemi, David Hildenbrand (Arm), Yury Norov, Joshua Hahn, Andrew Morton, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple, linux-mm, linux-kernel, Waiman Long, Rasmus Villemoes, cgroups [-- Attachment #1: Type: text/plain, Size: 3495 bytes --] Confirmed, with a standalone reproducer (attached); it panics linus/master at e8c2f9fdadee. cs->mems_allowed can legitimately be empty on v2 -- a freshly created cpuset child that never had cpuset.mems written keeps mems_allowed empty (never initialized) while effective_mems is inherited non-empty in cpuset_css_online(), and v2 allows attaching tasks to it (the empty-mems guard in cpuset_can_attach_check() is gated on !is_in_v2_mode()). So the non-empty guarantee holds for effective_mems, not for the configured cs->mems_allowed; forbidding empty cpuset.mems would break v2's inherit-from-parent semantics. The reproducer enables +cpuset, mkdirs a child without writing cpuset.mems, moves a task in, mbind()s a VMA with MPOL_BIND | MPOL_F_RELATIVE_NODES, and offlines a CPU; the hotplug walk then calls mpol_rebind_mm(mm, &cs->mems_allowed) with the empty mask and folds modulo nodes_weight(*rel) == 0 (console logs attached). The newmems instinct looks right: it's the effective, online mask the task is actually allowed to use, guarantee_online_mems() keeps it non-empty, and it matches cpuset_attach(), which already rebinds against cs->effective_mems. The fix this implies: - mpol_rebind_mm(mm, &cs->mems_allowed); + mpol_rebind_mm(mm, &newmems); I built the current base (e8c2f9fdadee) with and without this one-liner: the unpatched kernel panics on the first cpu1 offline, while the patched kernel runs the reproducer's 8 offline/online cycles cleanly, with no divide error. This regressed in ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}", v3.17), which moved cpuset_attach() to the effective mask but left this rebind on cs->mems_allowed. Happy to send this as a proper patch (Fixes: ae1c802382f7, Cc: stable, reproducer) if you agree the cpuset side is right, or to test a mempolicy-side fix if not. Thanks, Farhad Alemi PhD Student SEFCOM Lab @ ASU On Tue, Jun 2, 2026 at 2:54 AM Gregory Price <gourry@gourry.net> wrote: > > On Tue, Jun 02, 2026 at 11:19:49AM +0200, David Hildenbrand (Arm) wrote: > > > > According to the report [1] syzkaller can trigger it. There is no reproducer, > > though. > > > > [1] > > https://lore.kernel.org/all/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/ > > > > The actual implication of this report is that there is a bug in cpuset, > not mempolicy. > > mpol_rebind_mm+0x3ab/0x680 mm/mempolicy.c:569 > ^^^ should never receive a 0-node nodemask ^^^ > ...snip... > cpuset_update_tasks_nodemask+0x22e/0x340 kernel/cgroup/cpuset.c:2777 > ^^^ calls guarantee_online_mems ^^^ > ...snip... > hotplug_update_tasks kernel/cgroup/cpuset.c:3882 [inline] > cpuset_hotplug_update_tasks kernel/cgroup/cpuset.c:3985 [inline] > > Relevant code: > > void cpuset_update_tasks_nodemask(struct cpuset *cs) > { > ... snip ... > guarantee_online_mems(cs, &newmems); <<< critical call > ... snip ... > while ((task = css_task_iter_next(&it))) { > ... snip ... > mpol_rebind_mm(mm, &cs->mems_allowed); > > Seems like maybe mpol_rebind_mm should be called with newmems, not > cs->mems_allowed, though cs->mems_allowed should never be allowed to be > empty, because that makes no sense. > > Just eyeballing it, I can't say whether calling with newmems is the > right thing, or if mems_allowed should not be allowed to be empty, would > have to dig in a little further. > > ~Gregory [-- Attachment #2: reproducer.c --] [-- Type: application/octet-stream, Size: 5840 bytes --] // Reproducer for: divide error in bitmap_fold // (cpuset hotplug rebind of a relative-nodes mempolicy with an empty // cpuset.mems_allowed) // // Lore report: // https://lore.kernel.org/all/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/ // // Crash signature (PID is a cpuhp/N kthread): // Oops: divide error: 0000 [#1] ... RIP: bitmap_fold+0x5e/0xb0 lib/bitmap.c:728 // __nodes_fold include/linux/nodemask.h // mpol_relative_nodemask mm/mempolicy.c:374 // mpol_rebind_nodemask mm/mempolicy.c:511 // mpol_rebind_policy mm/mempolicy.c:545 // mpol_rebind_mm mm/mempolicy.c:572 // cpuset_update_tasks_nodemask kernel/cgroup/cpuset.c:2652 // hotplug_update_tasks / cpuset_hotplug_update_tasks / cpuset_handle_hotplug // cpuset_cpu_active|inactive -> sched_cpu_activate|deactivate (CPU hotplug) // // Mechanism: // 1. A freshly created cgroup-v2 cpuset child has cpuset.mems_allowed == {} // (never written) while its effective_mems is inherited from the parent // and is non-empty. On the legacy (v1) hierarchy, changing a populated // cpuset's non-empty cpuset.mems to empty is rejected (-ENOSPC) by the // empty-mems check in cpuset1_validate_change(); on the default (v2) // hierarchy there is no such check, and a fresh child simply never // writes mems_allowed at all. // 2. A task in that child owns a VMA mempolicy created with // MPOL_F_RELATIVE_NODES and a non-empty user nodemask. // 3. A CPU hot{,un}plug event makes cpuset_handle_hotplug() walk every // descendant (the walk is gated on the active CPU/mem set actually // changing, which a cpu on/offline satisfies via cpus_updated). For the // child, new effective mems == old, but the v2 hotplug path still calls // cpuset_update_tasks_nodemask(), which rebinds VMA policies with // &cs->mems_allowed -- the *configured* (empty) mask, NOT the effective // one. // 4. mpol_rebind_nodemask() sees MPOL_F_RELATIVE_NODES and calls // mpol_relative_nodemask(tmp, user_nodemask, cs->mems_allowed={}), i.e. // nodes_fold(tmp, user_nodemask, nodes_weight({})==0) -> bitmap_fold() // with sz==0 -> `oldbit % 0` -> #DE. // // Run as root inside the test VM (kernel CONFIG_HOTPLUG_CPU, CONFIG_CPUSETS, // CONFIG_NUMA). The VM in the report has -smp 2, so cpu1 is offlined. // // gcc -O2 -static -o reproducer reproducer.c && ./reproducer #define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <stdio.h> #include <string.h> #include <sys/mman.h> #include <sys/mount.h> #include <sys/stat.h> #include <sys/syscall.h> #include <unistd.h> #define MPOL_BIND 2 #define MPOL_F_RELATIVE_NODES (1 << 14) static int write_file(const char *path, const char *val) { int fd = open(path, O_WRONLY); if (fd < 0) { fprintf(stderr, "[-] open(%s): %s\n", path, strerror(errno)); return -1; } int n = write(fd, val, strlen(val)); if (n < 0) fprintf(stderr, "[-] write(%s, \"%s\"): %s\n", path, val, strerror(errno)); close(fd); return n < 0 ? -1 : 0; } // Find a writable cgroup2 root, mounting a fresh view if needed. static const char *cgroup2_root(void) { static char root[256]; // A fresh mount of cgroup2 is just another view of the single unified // hierarchy, so its root subtree_control is the system one. mkdir("/tmp/cg2", 0755); if (mount("none", "/tmp/cg2", "cgroup2", 0, NULL) == 0 || errno == EBUSY) { strcpy(root, "/tmp/cg2"); if (access("/tmp/cg2/cgroup.subtree_control", F_OK) == 0) return root; } // Fall back to the conventional location. if (access("/sys/fs/cgroup/cgroup.subtree_control", F_OK) == 0) { strcpy(root, "/sys/fs/cgroup"); return root; } return NULL; } int main(void) { char path[320]; const char *root = cgroup2_root(); if (!root) { fprintf(stderr, "[-] no cgroup2 hierarchy available\n"); return 1; } printf("[+] cgroup2 root: %s\n", root); // 1. Enable the cpuset controller for children of the root. snprintf(path, sizeof(path), "%s/cgroup.subtree_control", root); write_file(path, "+cpuset"); // 2. Create a child cpuset. Crucially, we NEVER write its cpuset.mems, // so mems_allowed stays empty while effective_mems inherits {0}. snprintf(path, sizeof(path), "%s/koops", root); mkdir(path, 0755); // 3. Move ourselves into the child (allowed in v2 even with empty mems). snprintf(path, sizeof(path), "%s/koops/cgroup.procs", root); char pid[32]; snprintf(pid, sizeof(pid), "%d", getpid()); if (write_file(path, pid) < 0) fprintf(stderr, "[-] could not join child cgroup\n"); // 4. Install a VMA mempolicy with MPOL_F_RELATIVE_NODES and a non-empty // user nodemask (node 0). This is the policy whose later rebind with // an empty mask folds modulo zero. void *area = mmap(NULL, 0x4000, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (area == MAP_FAILED) { fprintf(stderr, "[-] mmap: %s\n", strerror(errno)); return 1; } unsigned long nodemask[16] = { 0 }; nodemask[0] = 1UL; // node 0 long r = syscall(__NR_mbind, area, 0x4000, MPOL_BIND | MPOL_F_RELATIVE_NODES, nodemask, sizeof(nodemask) * 8, 0); if (r != 0) fprintf(stderr, "[-] mbind: %s\n", strerror(errno)); else printf("[+] installed MPOL_F_RELATIVE_NODES VMA policy\n"); // 5. Trigger CPU hotplug. cpuset_handle_hotplug() then walks descendants // and rebinds our VMA policy with the empty mems_allowed -> #DE in the // cpuhp/N kthread. Loop a few times to cover online/offline timing. printf("[+] toggling cpu1 online state to trigger hotplug rebind...\n"); for (int i = 0; i < 8; i++) { write_file("/sys/devices/system/cpu/cpu1/online", "0"); write_file("/sys/devices/system/cpu/cpu1/online", "1"); } printf("[+] done (if the kernel did not crash, check NUMA/cpuset config)\n"); return 0; } ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() 2026-06-02 15:01 ` Farhad Alemi @ 2026-06-05 15:18 ` David Hildenbrand (Arm) 2026-06-09 23:57 ` [PATCH] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed Farhad Alemi 0 siblings, 1 reply; 29+ messages in thread From: David Hildenbrand (Arm) @ 2026-06-05 15:18 UTC (permalink / raw) To: Farhad Alemi, Gregory Price Cc: falemi, Yury Norov, Joshua Hahn, Andrew Morton, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple, linux-mm, linux-kernel, Waiman Long, Rasmus Villemoes, cgroups On 6/2/26 17:01, Farhad Alemi wrote: > Confirmed, with a standalone reproducer (attached); it panics linus/master > at e8c2f9fdadee. cs->mems_allowed can legitimately be empty > on v2 -- a freshly created cpuset child that never had cpuset.mems > written keeps mems_allowed empty (never initialized) while effective_mems > is inherited non-empty in cpuset_css_online(), and v2 allows attaching > tasks to it (the empty-mems guard in cpuset_can_attach_check() is gated > on !is_in_v2_mode()). So the non-empty guarantee holds for effective_mems, > not for the configured cs->mems_allowed; forbidding empty cpuset.mems > would break v2's inherit-from-parent semantics. > > The reproducer enables +cpuset, mkdirs a child without writing > cpuset.mems, moves a task in, mbind()s a VMA with > MPOL_BIND | MPOL_F_RELATIVE_NODES, and offlines a CPU; the hotplug walk > then calls mpol_rebind_mm(mm, &cs->mems_allowed) with the empty mask and > folds modulo nodes_weight(*rel) == 0 (console logs attached). > > The newmems instinct looks right: it's the effective, online mask the > task is actually allowed to use, guarantee_online_mems() keeps it > non-empty, and it matches cpuset_attach(), which already rebinds against > cs->effective_mems. The fix this implies: > > - mpol_rebind_mm(mm, &cs->mems_allowed); > + mpol_rebind_mm(mm, &newmems); > > I built the current base (e8c2f9fdadee) with and without this one-liner: > the unpatched kernel panics on the first cpu1 offline, while the patched > kernel runs the reproducer's 8 offline/online cycles cleanly, with no > divide error. > > This regressed in ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}", > v3.17), which moved cpuset_attach() to the effective mask but left this > rebind on cs->mems_allowed. > > Happy to send this as a proper patch (Fixes: ae1c802382f7, Cc: stable, > reproducer) if you agree the cpuset side is right, or to test a > mempolicy-side fix if not. Yes, please send a patch, including a high-level explanation of what you analyzed above! -- Cheers, David ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed 2026-06-05 15:18 ` David Hildenbrand (Arm) @ 2026-06-09 23:57 ` Farhad Alemi 2026-06-10 0:53 ` Andrew Morton ` (3 more replies) 0 siblings, 4 replies; 29+ messages in thread From: Farhad Alemi @ 2026-06-09 23:57 UTC (permalink / raw) To: Andrew Morton, David Hildenbrand, Gregory Price Cc: Farhad Alemi, Yury Norov, Joshua Hahn, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple, Waiman Long, Rasmus Villemoes, linux-mm, linux-kernel, cgroups, stable cpuset_update_tasks_nodemask() rebinds a task's own mempolicy to the cpuset's effective, online mems (newmems, from guarantee_online_mems()), but rebinds that task's VMA mempolicies to the *configured* mask instead: cpuset_change_task_nodemask(task, &newmems); ... mpol_rebind_mm(mm, &cs->mems_allowed); On the default (v2) hierarchy a cpuset that has never had cpuset.mems written keeps mems_allowed empty while effective_mems is inherited non-empty from the parent, and tasks may be attached to it (the empty-mems attach check is v1-only). A subsequent rebind -- e.g. from a CPU hotplug event walking the cpuset -- then calls mpol_rebind_mm() with an empty mask. For a VMA policy created with MPOL_F_RELATIVE_NODES this reaches mpol_relative_nodemask() -> nodes_fold(..., nodes_weight(cs->mems_allowed) == 0) -> bitmap_fold(), whose set_bit(oldbit % sz, dst) divides by zero: Oops: divide error: 0000 [#1] SMP KASAN NOPTI RIP: 0010:bitmap_fold+0x5e/0xb0 mpol_rebind_nodemask mpol_rebind_mm cpuset_update_tasks_nodemask cpuset_handle_hotplug sched_cpu_deactivate cpuhp_thread_fun cs->mems_allowed is the only nodemask in this function that is not the effective set: the task-policy rebind, the page-migration target and cs->old_mems_allowed all use newmems. The sibling cpuset_attach() path already rebinds VMA policies against the effective mems (cpuset_attach_nodemask_to = cs->effective_mems) and explicitly notes that mems_allowed can be empty under hotplug. Rebind the VMA policies to newmems too: it is guaranteed non-empty by guarantee_online_mems(), which fixes the divide-by-zero, and it makes the VMA policies consistent with the task policy and with the nodes the task is actually allowed to use. Fixes: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}") Suggested-by: Gregory Price <gourry@gourry.net> Signed-off-by: Farhad Alemi <farhad.alemi@berkeley.edu> Cc: stable@vger.kernel.org --- kernel/cgroup/cpuset.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2649,7 +2649,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs) migrate = is_memory_migrate(cs); - mpol_rebind_mm(mm, &cs->mems_allowed); + mpol_rebind_mm(mm, &newmems); if (migrate) cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems); else -- 2.43.0 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed 2026-06-09 23:57 ` [PATCH] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed Farhad Alemi @ 2026-06-10 0:53 ` Andrew Morton 2026-06-10 11:34 ` Gregory Price ` (2 subsequent siblings) 3 siblings, 0 replies; 29+ messages in thread From: Andrew Morton @ 2026-06-10 0:53 UTC (permalink / raw) To: Farhad Alemi Cc: David Hildenbrand, Gregory Price, Farhad Alemi, Yury Norov, Joshua Hahn, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple, Waiman Long, Rasmus Villemoes, linux-mm, linux-kernel, cgroups, stable On Tue, 9 Jun 2026 19:57:41 -0400 Farhad Alemi <farhad.alemi@berkeley.edu> wrote: > cpuset_update_tasks_nodemask() rebinds a task's own mempolicy to the > cpuset's effective, online mems (newmems, from guarantee_online_mems()), > but rebinds that task's VMA mempolicies to the *configured* mask instead: Hard to understand. Was "rebinds" supposed to be "is supposed to rebind"? > cpuset_change_task_nodemask(task, &newmems); > ... > mpol_rebind_mm(mm, &cs->mems_allowed); > > On the default (v2) hierarchy a cpuset that has never had cpuset.mems > written keeps mems_allowed empty while effective_mems is inherited > non-empty from the parent, and tasks may be attached to it (the > empty-mems attach check is v1-only). A subsequent rebind -- e.g. from a > CPU hotplug event walking the cpuset -- then calls mpol_rebind_mm() with > an empty mask. For a VMA policy created with MPOL_F_RELATIVE_NODES this > reaches mpol_relative_nodemask() -> > nodes_fold(..., nodes_weight(cs->mems_allowed) == 0) -> bitmap_fold(), > whose set_bit(oldbit % sz, dst) divides by zero: > > Oops: divide error: 0000 [#1] SMP KASAN NOPTI > RIP: 0010:bitmap_fold+0x5e/0xb0 > mpol_rebind_nodemask > mpol_rebind_mm > cpuset_update_tasks_nodemask > cpuset_handle_hotplug > sched_cpu_deactivate > cpuhp_thread_fun > > cs->mems_allowed is the only nodemask in this function that is not the > effective set: the task-policy rebind, the page-migration target and > cs->old_mems_allowed all use newmems. The sibling cpuset_attach() path > already rebinds VMA policies against the effective mems > (cpuset_attach_nodemask_to = cs->effective_mems) and explicitly notes > that mems_allowed can be empty under hotplug. Rebind the VMA policies to > newmems too: it is guaranteed non-empty by guarantee_online_mems(), which > fixes the divide-by-zero, and it makes the VMA policies consistent with > the task policy and with the nodes the task is actually allowed to use. How is this bug triggered? ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed 2026-06-09 23:57 ` [PATCH] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed Farhad Alemi 2026-06-10 0:53 ` Andrew Morton @ 2026-06-10 11:34 ` Gregory Price 2026-06-11 2:50 ` Waiman Long 2026-06-14 13:25 ` [PATCH v2] " Farhad Alemi 3 siblings, 0 replies; 29+ messages in thread From: Gregory Price @ 2026-06-10 11:34 UTC (permalink / raw) To: Farhad Alemi Cc: Andrew Morton, David Hildenbrand, Farhad Alemi, Yury Norov, Joshua Hahn, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple, Waiman Long, Rasmus Villemoes, linux-mm, linux-kernel, cgroups, stable On Tue, Jun 09, 2026 at 07:57:41PM -0400, Farhad Alemi wrote: > cpuset_update_tasks_nodemask() rebinds a task's own mempolicy to the > cpuset's effective, online mems (newmems, from guarantee_online_mems()), > but rebinds that task's VMA mempolicies to the *configured* mask instead: > > cpuset_change_task_nodemask(task, &newmems); > ... > mpol_rebind_mm(mm, &cs->mems_allowed); > > On the default (v2) hierarchy a cpuset that has never had cpuset.mems > written keeps mems_allowed empty while effective_mems is inherited > non-empty from the parent, and tasks may be attached to it (the > empty-mems attach check is v1-only). A subsequent rebind -- e.g. from a > CPU hotplug event walking the cpuset -- then calls mpol_rebind_mm() with > an empty mask. For a VMA policy created with MPOL_F_RELATIVE_NODES this > reaches mpol_relative_nodemask() -> > nodes_fold(..., nodes_weight(cs->mems_allowed) == 0) -> bitmap_fold(), > whose set_bit(oldbit % sz, dst) divides by zero: > > Oops: divide error: 0000 [#1] SMP KASAN NOPTI > RIP: 0010:bitmap_fold+0x5e/0xb0 > mpol_rebind_nodemask > mpol_rebind_mm > cpuset_update_tasks_nodemask > cpuset_handle_hotplug > sched_cpu_deactivate > cpuhp_thread_fun > > cs->mems_allowed is the only nodemask in this function that is not the > effective set: the task-policy rebind, the page-migration target and > cs->old_mems_allowed all use newmems. The sibling cpuset_attach() path > already rebinds VMA policies against the effective mems > (cpuset_attach_nodemask_to = cs->effective_mems) and explicitly notes > that mems_allowed can be empty under hotplug. Rebind the VMA policies to > newmems too: it is guaranteed non-empty by guarantee_online_mems(), which > fixes the divide-by-zero, and it makes the VMA policies consistent with > the task policy and with the nodes the task is actually allowed to use. > I think you can make this a bit more concise: Creating a child cpuset where cpuset.mems is never set leads to a div/0 when a VMA mempolicy with MPOL_F_RELATIVE_NODES rebinds in response to a CPU hotplug event. Reproduction steps: 1) Create a cgroup w/ cpuset controls (do not set cpuset.mems) 2) Move the task into the child cpuset 3) Create a VMA mempolicy for that task with MPOL_F_RELATIVE_NODES 4) unplug and hotplug a cpu echo 0 > /sys/devices/system/cpu/cpu1/oneline echo 1 > /sys/devices/system/cpu/cpu1/oneline 5) mempolicy rebind does a div/0 in mpol_relative_nodemask on the call to __nodes_fold() The cpuset code passes (cs->mems_allowed) which is not guaranteed to have nodes to the rebind routine. Use mems_effective - the value returned by guarantee_online_mems() - instead, which is guaranteed to have a non-empty nodemask.. Maybe add a link to your reproducer and the original [BUG] Link: https://lore.kernel.org/linux-mm/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/ Link: https://lore.kernel.org/all/CA+0ovCiEz6SP_sn3kN4Tb+_oC=eHMXy_Ffj=usV3wREdQrUtww@mail.gmail.com/ Does this need a Closes tag? > Fixes: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}") > Suggested-by: Gregory Price <gourry@gourry.net> > Signed-off-by: Farhad Alemi <farhad.alemi@berkeley.edu> > Cc: stable@vger.kernel.org > --- > kernel/cgroup/cpuset.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -2649,7 +2649,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs) > > migrate = is_memory_migrate(cs); > > - mpol_rebind_mm(mm, &cs->mems_allowed); > + mpol_rebind_mm(mm, &newmems); > if (migrate) > cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems); > else > -- > 2.43.0 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed 2026-06-09 23:57 ` [PATCH] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed Farhad Alemi 2026-06-10 0:53 ` Andrew Morton 2026-06-10 11:34 ` Gregory Price @ 2026-06-11 2:50 ` Waiman Long 2026-06-14 13:25 ` [PATCH v2] " Farhad Alemi 3 siblings, 0 replies; 29+ messages in thread From: Waiman Long @ 2026-06-11 2:50 UTC (permalink / raw) To: Farhad Alemi, Andrew Morton, David Hildenbrand, Gregory Price Cc: Farhad Alemi, Yury Norov, Joshua Hahn, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple, Rasmus Villemoes, linux-mm, linux-kernel, cgroups, stable On 6/9/26 7:57 PM, Farhad Alemi wrote: > cpuset_update_tasks_nodemask() rebinds a task's own mempolicy to the > cpuset's effective, online mems (newmems, from guarantee_online_mems()), > but rebinds that task's VMA mempolicies to the *configured* mask instead: > > cpuset_change_task_nodemask(task, &newmems); > ... > mpol_rebind_mm(mm, &cs->mems_allowed); > > On the default (v2) hierarchy a cpuset that has never had cpuset.mems > written keeps mems_allowed empty while effective_mems is inherited > non-empty from the parent, and tasks may be attached to it (the > empty-mems attach check is v1-only). A subsequent rebind -- e.g. from a > CPU hotplug event walking the cpuset -- then calls mpol_rebind_mm() with > an empty mask. For a VMA policy created with MPOL_F_RELATIVE_NODES this > reaches mpol_relative_nodemask() -> > nodes_fold(..., nodes_weight(cs->mems_allowed) == 0) -> bitmap_fold(), > whose set_bit(oldbit % sz, dst) divides by zero: > > Oops: divide error: 0000 [#1] SMP KASAN NOPTI > RIP: 0010:bitmap_fold+0x5e/0xb0 > mpol_rebind_nodemask > mpol_rebind_mm > cpuset_update_tasks_nodemask > cpuset_handle_hotplug > sched_cpu_deactivate > cpuhp_thread_fun > > cs->mems_allowed is the only nodemask in this function that is not the > effective set: the task-policy rebind, the page-migration target and > cs->old_mems_allowed all use newmems. The sibling cpuset_attach() path > already rebinds VMA policies against the effective mems > (cpuset_attach_nodemask_to = cs->effective_mems) and explicitly notes > that mems_allowed can be empty under hotplug. Rebind the VMA policies to > newmems too: it is guaranteed non-empty by guarantee_online_mems(), which > fixes the divide-by-zero, and it makes the VMA policies consistent with > the task policy and with the nodes the task is actually allowed to use. > > Fixes: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}") > Suggested-by: Gregory Price <gourry@gourry.net> > Signed-off-by: Farhad Alemi <farhad.alemi@berkeley.edu> > Cc: stable@vger.kernel.org > --- > kernel/cgroup/cpuset.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -2649,7 +2649,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs) > > migrate = is_memory_migrate(cs); > > - mpol_rebind_mm(mm, &cs->mems_allowed); > + mpol_rebind_mm(mm, &newmems); > if (migrate) > cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems); > else Could you change it to &cs->effecitve_mems instead? For v2, effective_mems will never be empty. In fact, this is part of the following patch https://lore.kernel.org/lkml/20260604150229.414135-2-longman@redhat.com/ Given that this bug can crash the kernel, it should be separated out as a separate patch. Cheers, Longman ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v2] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed 2026-06-09 23:57 ` [PATCH] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed Farhad Alemi ` (2 preceding siblings ...) 2026-06-11 2:50 ` Waiman Long @ 2026-06-14 13:25 ` Farhad Alemi 2026-06-15 8:08 ` David Hildenbrand (Arm) 3 siblings, 1 reply; 29+ messages in thread From: Farhad Alemi @ 2026-06-14 13:25 UTC (permalink / raw) To: Andrew Morton, Waiman Long Cc: Farhad Alemi, David Hildenbrand, Gregory Price, Yury Norov, Joshua Hahn, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple, Rasmus Villemoes, linux-mm, linux-kernel, cgroups, stable Creating a child cpuset where cpuset.mems is never set leads to a div/0 when a VMA mempolicy with MPOL_F_RELATIVE_NODES rebinds in response to a CPU hotplug event. Reproduction steps: 1) Create a cgroup w/ cpuset controls (do not set cpuset.mems) 2) Move the task into the child cpuset 3) Create a VMA mempolicy for that task with MPOL_F_RELATIVE_NODES 4) unplug and hotplug a cpu echo 0 > /sys/devices/system/cpu/cpu1/online echo 1 > /sys/devices/system/cpu/cpu1/online 5) mempolicy rebind does a div/0 in mpol_relative_nodemask on the call to __nodes_fold() The cpuset code passes (cs->mems_allowed) which is not guaranteed to have nodes to the rebind routine. Use cs->effective_mems instead, which is guaranteed to have a non-empty nodemask. Link: https://lore.kernel.org/linux-mm/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/ Link: https://lore.kernel.org/all/CA+0ovCiEz6SP_sn3kN4Tb+_oC=eHMXy_Ffj=usV3wREdQrUtww@mail.gmail.com/ Fixes: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}") Suggested-by: Gregory Price <gourry@gourry.net> Suggested-by: Waiman Long <longman@redhat.com> Signed-off-by: Farhad Alemi <farhad.alemi@berkeley.edu> Cc: stable@vger.kernel.org --- v2: rebind to cs->effective_mems instead of newmems (Waiman Long); condense the changelog. kernel/cgroup/cpuset.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2649,7 +2649,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs) migrate = is_memory_migrate(cs); - mpol_rebind_mm(mm, &cs->mems_allowed); + mpol_rebind_mm(mm, &cs->effective_mems); if (migrate) cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems); else -- 2.43.0 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed 2026-06-14 13:25 ` [PATCH v2] " Farhad Alemi @ 2026-06-15 8:08 ` David Hildenbrand (Arm) 2026-06-15 9:38 ` Gregory Price 0 siblings, 1 reply; 29+ messages in thread From: David Hildenbrand (Arm) @ 2026-06-15 8:08 UTC (permalink / raw) To: Farhad Alemi, Andrew Morton, Waiman Long Cc: Farhad Alemi, Gregory Price, Yury Norov, Joshua Hahn, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple, Rasmus Villemoes, linux-mm, linux-kernel, cgroups, stable On 6/14/26 15:25, Farhad Alemi wrote: Hi, thanks for your patch! For the future, please don't submit new revisions as reply to previous submissions. > Creating a child cpuset where cpuset.mems is never set leads to a div/0 > when a VMA mempolicy with MPOL_F_RELATIVE_NODES rebinds in response to a > CPU hotplug event. > > Reproduction steps: > 1) Create a cgroup w/ cpuset controls (do not set cpuset.mems) > 2) Move the task into the child cpuset > 3) Create a VMA mempolicy for that task with MPOL_F_RELATIVE_NODES > 4) unplug and hotplug a cpu > echo 0 > /sys/devices/system/cpu/cpu1/online > echo 1 > /sys/devices/system/cpu/cpu1/online > 5) mempolicy rebind does a div/0 in mpol_relative_nodemask on the > call to __nodes_fold() > > The cpuset code passes (cs->mems_allowed) which is not guaranteed to have > nodes to the rebind routine. Use cs->effective_mems instead, which is > guaranteed to have a non-empty nodemask. Probably worth mentioning here that this makes the linked reproducer happy. > > Link: https://lore.kernel.org/linux-mm/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/ This should be a Closes: https://lore.kernel.org/linux-mm/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/ > Link: https://lore.kernel.org/all/CA+0ovCiEz6SP_sn3kN4Tb+_oC=eHMXy_Ffj=usV3wREdQrUtww@mail.gmail.com/ > Fixes: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}") > Suggested-by: Gregory Price <gourry@gourry.net> > Suggested-by: Waiman Long <longman@redhat.com> > Signed-off-by: Farhad Alemi <farhad.alemi@berkeley.edu> > Cc: stable@vger.kernel.org > --- > v2: rebind to cs->effective_mems instead of newmems (Waiman Long); > condense the changelog. > > kernel/cgroup/cpuset.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -2649,7 +2649,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs) > > migrate = is_memory_migrate(cs); > > - mpol_rebind_mm(mm, &cs->mems_allowed); > + mpol_rebind_mm(mm, &cs->effective_mems); God this is confusing. So, we obtain newmems from guarantee_online_mems(), which guarantees that newmems is non-empty. In cpuset_change_task_nodemask(), we set tsk->mems_allowed to newmems, and call mpol_rebind_task(tsk, newmems). So at least tsk->mems_allowed should be non-empty. Then we call mpol_rebind_mm(mm, &cs->mems_allowed); Naturally I wonder: Why are we not using "task->mems_allowed" (maybe cs vs. tsk was the original bug?), which is effectively just newmems? guarantee_online_mems() computes newmems as "cs->effective_mems & node_states[N_MEMORY]", but walks up to the parent if it would be empty. -- Cheers, David ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed 2026-06-15 8:08 ` David Hildenbrand (Arm) @ 2026-06-15 9:38 ` Gregory Price 2026-06-15 11:08 ` David Hildenbrand (Arm) 0 siblings, 1 reply; 29+ messages in thread From: Gregory Price @ 2026-06-15 9:38 UTC (permalink / raw) To: David Hildenbrand (Arm) Cc: Farhad Alemi, Andrew Morton, Waiman Long, Farhad Alemi, Yury Norov, Joshua Hahn, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple, Rasmus Villemoes, linux-mm, linux-kernel, cgroups, stable On Mon, Jun 15, 2026 at 10:08:51AM +0200, David Hildenbrand (Arm) wrote: > On 6/14/26 15:25, Farhad Alemi wrote: > > > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > > --- a/kernel/cgroup/cpuset.c > > +++ b/kernel/cgroup/cpuset.c > > @@ -2649,7 +2649,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs) > > > > migrate = is_memory_migrate(cs); > > > > - mpol_rebind_mm(mm, &cs->mems_allowed); > > + mpol_rebind_mm(mm, &cs->effective_mems); > > God this is confusing. > All interactions between mempolicy and cpuset are horrible and confusing. Much like Lorenzo's anon_vma work, I have to keep notes on how this whole thing doesn't just spew SIGBUS constantly. The short answer is: mempolicy is advisory and cpuset is strictly followed - in a dispute cpuset wins... except for file backed memory, then everyon loses and nothing is consistent. > Naturally I wonder: Why are we not using "task->mems_allowed" (maybe cs vs. tsk > was the original bug?), which is effectively just newmems? > Short answer: task->mems_allowed is protected by the task lock and we don't hold the task lock for a foreign task (not-current) over mm operations. Long answer: Reasons and "Stop looking at the spaghetti, it's going to break" ~Gregory ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed 2026-06-15 9:38 ` Gregory Price @ 2026-06-15 11:08 ` David Hildenbrand (Arm) 2026-06-15 11:19 ` Gregory Price 0 siblings, 1 reply; 29+ messages in thread From: David Hildenbrand (Arm) @ 2026-06-15 11:08 UTC (permalink / raw) To: Gregory Price Cc: Farhad Alemi, Andrew Morton, Waiman Long, Farhad Alemi, Yury Norov, Joshua Hahn, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple, Rasmus Villemoes, linux-mm, linux-kernel, cgroups, stable On 6/15/26 11:38, Gregory Price wrote: > On Mon, Jun 15, 2026 at 10:08:51AM +0200, David Hildenbrand (Arm) wrote: >> On 6/14/26 15:25, Farhad Alemi wrote: >>> >>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c >>> --- a/kernel/cgroup/cpuset.c >>> +++ b/kernel/cgroup/cpuset.c >>> @@ -2649,7 +2649,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs) >>> >>> migrate = is_memory_migrate(cs); >>> >>> - mpol_rebind_mm(mm, &cs->mems_allowed); >>> + mpol_rebind_mm(mm, &cs->effective_mems); >> >> God this is confusing. >> > > All interactions between mempolicy and cpuset are horrible and > confusing. Much like Lorenzo's anon_vma work, I have to keep > notes on how this whole thing doesn't just spew SIGBUS constantly. > > The short answer is: mempolicy is advisory and cpuset is strictly > followed - in a dispute cpuset wins... except for file backed memory, > then everyon loses and nothing is consistent. > >> Naturally I wonder: Why are we not using "task->mems_allowed" (maybe cs vs. tsk >> was the original bug?), which is effectively just newmems? >> > > Short answer: task->mems_allowed is protected by the task lock and we > don't hold the task lock for a foreign task (not-current) over mm > operations. Well, we can just use newmems, which cannot change? Again, that is based on cs->effective_mems but is guaranteed to return something non-empty. AI was not able to convince me (neither was I able to convince AI) that there is not some obscure cgroup v1 scenario where the current fix would also be wrong. With newmems it's clear that it is guaranteed to not be empty. -- Cheers, David ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed 2026-06-15 11:08 ` David Hildenbrand (Arm) @ 2026-06-15 11:19 ` Gregory Price 2026-06-15 11:39 ` David Hildenbrand (Arm) 0 siblings, 1 reply; 29+ messages in thread From: Gregory Price @ 2026-06-15 11:19 UTC (permalink / raw) To: David Hildenbrand (Arm) Cc: Farhad Alemi, Andrew Morton, Waiman Long, Farhad Alemi, Yury Norov, Joshua Hahn, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple, Rasmus Villemoes, linux-mm, linux-kernel, cgroups, stable On Mon, Jun 15, 2026 at 01:08:16PM +0200, David Hildenbrand (Arm) wrote: > With newmems it's clear that it is guaranteed to not be empty. I hadn't noticed he switched the patch from newmems -> effective_mems. This needs to be changed back to newmems, otherwise we're depending on a derivative value set somewhere else in the code being correct instead of using what we *know* is correct *at the moment we need to use it*. So yes, go back to using newmems. ~Gregory ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed 2026-06-15 11:19 ` Gregory Price @ 2026-06-15 11:39 ` David Hildenbrand (Arm) 0 siblings, 0 replies; 29+ messages in thread From: David Hildenbrand (Arm) @ 2026-06-15 11:39 UTC (permalink / raw) To: Gregory Price Cc: Farhad Alemi, Andrew Morton, Waiman Long, Farhad Alemi, Yury Norov, Joshua Hahn, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple, Rasmus Villemoes, linux-mm, linux-kernel, cgroups, stable On 6/15/26 13:19, Gregory Price wrote: > On Mon, Jun 15, 2026 at 01:08:16PM +0200, David Hildenbrand (Arm) wrote: >> With newmems it's clear that it is guaranteed to not be empty. > > I hadn't noticed he switched the patch from newmems -> effective_mems. > > This needs to be changed back to newmems, otherwise we're depending on > a derivative value set somewhere else in the code being correct instead > of using what we *know* is correct *at the moment we need to use it*. > > So yes, go back to using newmems. Right, that's what v1 did looking at this now. Waiman requested the change, but I don't think we want that. So for v1: Acked-by: David Hildenbrand (Arm) <david@kernel.org> -- Cheers, David ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() 2026-05-28 19:03 [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() Yury Norov ` (2 preceding siblings ...) 2026-05-28 19:41 ` Andrew Morton @ 2026-05-29 8:47 ` kernel test robot 2026-05-29 8:58 ` kernel test robot ` (3 subsequent siblings) 7 siblings, 0 replies; 29+ messages in thread From: kernel test robot @ 2026-05-29 8:47 UTC (permalink / raw) To: Yury Norov, Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang, Alistair Popple, linux-kernel Cc: oe-kbuild-all, Linux Memory Management List, Yury Norov, Farhad Alemi, Waiman Long, Rasmus Villemoes, cgroups Hi Yury, kernel test robot noticed the following build errors: [auto build test ERROR on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Yury-Norov/mm-don-t-allow-empty-relative-nodemask-in-mpol_relative_nodemask/20260529-030835 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20260528190337.878027-1-ynorov%40nvidia.com patch subject: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() config: x86_64-buildonly-randconfig-003-20260529 (https://download.01.org/0day-ci/archive/20260529/202605291631.6MATSv6v-lkp@intel.com/config) compiler: gcc-14 (Debian 14.2.0-19) 14.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260529/202605291631.6MATSv6v-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202605291631.6MATSv6v-lkp@intel.com/ All errors (new ones prefixed by >>): mm/mempolicy.c: In function 'mpol_relative_nodemask': >> mm/mempolicy.c:377:24: error: 'return' with a value, in function returning void [-Wreturn-mismatch] 377 | return -EINVAL; | ^ mm/mempolicy.c:370:13: note: declared here 370 | static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig, | ^~~~~~~~~~~~~~~~~~~~~~ Kconfig warnings: (for reference only) WARNING: unmet direct dependencies detected for MFD_STMFX Depends on [n]: HAS_IOMEM [=y] && I2C [=y] && OF [=n] Selected by [y]: - PINCTRL_STMFX [=y] && PINCTRL [=y] && I2C [=y] && HAS_IOMEM [=y] vim +/return +377 mm/mempolicy.c 369 370 static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig, 371 const nodemask_t *rel) 372 { 373 unsigned int w = nodes_weight(*rel); 374 nodemask_t tmp; 375 376 if (w == 0) > 377 return -EINVAL; 378 379 nodes_fold(tmp, *orig, w); 380 nodes_onto(*ret, tmp, *rel); 381 } 382 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() 2026-05-28 19:03 [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() Yury Norov ` (3 preceding siblings ...) 2026-05-29 8:47 ` [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() kernel test robot @ 2026-05-29 8:58 ` kernel test robot 2026-05-29 12:45 ` kernel test robot ` (2 subsequent siblings) 7 siblings, 0 replies; 29+ messages in thread From: kernel test robot @ 2026-05-29 8:58 UTC (permalink / raw) To: Yury Norov, Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang, Alistair Popple, linux-kernel Cc: llvm, oe-kbuild-all, Linux Memory Management List, Yury Norov, Farhad Alemi, Waiman Long, Rasmus Villemoes, cgroups Hi Yury, kernel test robot noticed the following build warnings: [auto build test WARNING on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Yury-Norov/mm-don-t-allow-empty-relative-nodemask-in-mpol_relative_nodemask/20260529-030835 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20260528190337.878027-1-ynorov%40nvidia.com patch subject: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() config: x86_64-kexec (https://download.01.org/0day-ci/archive/20260529/202605291609.AR5UEvmT-lkp@intel.com/config) compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260529/202605291609.AR5UEvmT-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202605291609.AR5UEvmT-lkp@intel.com/ All warnings (new ones prefixed by >>): >> mm/mempolicy.c:377:3: warning: void function 'mpol_relative_nodemask' should not return a value [-Wreturn-mismatch] 377 | return -EINVAL; | ^ ~~~~~~~ 1 warning generated. vim +/mpol_relative_nodemask +377 mm/mempolicy.c 369 370 static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig, 371 const nodemask_t *rel) 372 { 373 unsigned int w = nodes_weight(*rel); 374 nodemask_t tmp; 375 376 if (w == 0) > 377 return -EINVAL; 378 379 nodes_fold(tmp, *orig, w); 380 nodes_onto(*ret, tmp, *rel); 381 } 382 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() 2026-05-28 19:03 [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() Yury Norov ` (4 preceding siblings ...) 2026-05-29 8:58 ` kernel test robot @ 2026-05-29 12:45 ` kernel test robot 2026-05-29 12:47 ` kernel test robot 2026-06-01 14:06 ` David Hildenbrand (Arm) 7 siblings, 0 replies; 29+ messages in thread From: kernel test robot @ 2026-05-29 12:45 UTC (permalink / raw) To: Yury Norov, Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang, Alistair Popple, linux-kernel Cc: llvm, oe-kbuild-all, Linux Memory Management List, Yury Norov, Farhad Alemi, Waiman Long, Rasmus Villemoes, cgroups Hi Yury, kernel test robot noticed the following build warnings: [auto build test WARNING on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Yury-Norov/mm-don-t-allow-empty-relative-nodemask-in-mpol_relative_nodemask/20260529-030835 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20260528190337.878027-1-ynorov%40nvidia.com patch subject: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() config: x86_64-kexec (https://download.01.org/0day-ci/archive/20260529/202605291432.MbAf9EG6-lkp@intel.com/config) compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260529/202605291432.MbAf9EG6-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202605291432.MbAf9EG6-lkp@intel.com/ All warnings (new ones prefixed by >>): >> mm/mempolicy.c:377:3: warning: void function 'mpol_relative_nodemask' should not return a value [-Wreturn-mismatch] 377 | return -EINVAL; | ^ ~~~~~~~ 1 warning generated. vim +/mpol_relative_nodemask +377 mm/mempolicy.c 369 370 static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig, 371 const nodemask_t *rel) 372 { 373 unsigned int w = nodes_weight(*rel); 374 nodemask_t tmp; 375 376 if (w == 0) > 377 return -EINVAL; 378 379 nodes_fold(tmp, *orig, w); 380 nodes_onto(*ret, tmp, *rel); 381 } 382 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() 2026-05-28 19:03 [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() Yury Norov ` (5 preceding siblings ...) 2026-05-29 12:45 ` kernel test robot @ 2026-05-29 12:47 ` kernel test robot 2026-06-01 14:06 ` David Hildenbrand (Arm) 7 siblings, 0 replies; 29+ messages in thread From: kernel test robot @ 2026-05-29 12:47 UTC (permalink / raw) To: Yury Norov, Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang, Alistair Popple, linux-kernel Cc: oe-kbuild-all, Linux Memory Management List, Yury Norov, Farhad Alemi, Waiman Long, Rasmus Villemoes, cgroups Hi Yury, kernel test robot noticed the following build warnings: [auto build test WARNING on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Yury-Norov/mm-don-t-allow-empty-relative-nodemask-in-mpol_relative_nodemask/20260529-030835 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20260528190337.878027-1-ynorov%40nvidia.com patch subject: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() config: sparc64-randconfig-002-20260529 (https://download.01.org/0day-ci/archive/20260529/202605292049.eaIv99hr-lkp@intel.com/config) compiler: sparc64-linux-gcc (GCC) 8.5.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260529/202605292049.eaIv99hr-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202605292049.eaIv99hr-lkp@intel.com/ All warnings (new ones prefixed by >>): mm/mempolicy.c: In function 'mpol_relative_nodemask': >> mm/mempolicy.c:377:10: warning: 'return' with a value, in function returning void return -EINVAL; ^ mm/mempolicy.c:370:13: note: declared here static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig, ^~~~~~~~~~~~~~~~~~~~~~ vim +/return +377 mm/mempolicy.c 369 370 static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig, 371 const nodemask_t *rel) 372 { 373 unsigned int w = nodes_weight(*rel); 374 nodemask_t tmp; 375 376 if (w == 0) > 377 return -EINVAL; 378 379 nodes_fold(tmp, *orig, w); 380 nodes_onto(*ret, tmp, *rel); 381 } 382 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() 2026-05-28 19:03 [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() Yury Norov ` (6 preceding siblings ...) 2026-05-29 12:47 ` kernel test robot @ 2026-06-01 14:06 ` David Hildenbrand (Arm) 7 siblings, 0 replies; 29+ messages in thread From: David Hildenbrand (Arm) @ 2026-06-01 14:06 UTC (permalink / raw) To: Yury Norov, Andrew Morton, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang, Alistair Popple, linux-mm, linux-kernel Cc: Farhad Alemi, Waiman Long, Rasmus Villemoes, cgroups On 5/28/26 21:03, Yury Norov wrote: > Reassigning nodes relative an empty user-provided nodemask is useless, > and triggers divide-by-zero in the function. > > Reported-by: Farhad Alemi <farhad.alemi@berkeley.edu> > Link: https://lore.kernel.org/all/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/ Likely this should be a Closes: And be accompanied by a Fixes: and Cc stable. > Signed-off-by: Yury Norov <ynorov@nvidia.com> > --- > mm/mempolicy.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 4e4421b22b59..cd961fa1eb33 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -370,8 +370,13 @@ static inline int mpol_store_user_nodemask(const struct mempolicy *pol) > static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig, > const nodemask_t *rel) > { Continuing the discussion of the context in the other thread :) -- Cheers, David ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2026-06-15 11:39 UTC | newest] Thread overview: 29+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-28 19:03 [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() Yury Norov 2026-05-28 19:37 ` Waiman Long 2026-05-28 19:40 ` Yury Norov 2026-05-28 19:37 ` Matthew Wilcox 2026-05-28 19:41 ` Andrew Morton 2026-05-29 15:26 ` Joshua Hahn 2026-05-29 17:47 ` Yury Norov 2026-05-29 18:40 ` Joshua Hahn 2026-06-01 14:32 ` David Hildenbrand (Arm) 2026-06-02 8:44 ` Gregory Price 2026-06-02 9:19 ` David Hildenbrand (Arm) 2026-06-02 9:54 ` Gregory Price 2026-06-02 15:01 ` Farhad Alemi 2026-06-05 15:18 ` David Hildenbrand (Arm) 2026-06-09 23:57 ` [PATCH] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed Farhad Alemi 2026-06-10 0:53 ` Andrew Morton 2026-06-10 11:34 ` Gregory Price 2026-06-11 2:50 ` Waiman Long 2026-06-14 13:25 ` [PATCH v2] " Farhad Alemi 2026-06-15 8:08 ` David Hildenbrand (Arm) 2026-06-15 9:38 ` Gregory Price 2026-06-15 11:08 ` David Hildenbrand (Arm) 2026-06-15 11:19 ` Gregory Price 2026-06-15 11:39 ` David Hildenbrand (Arm) 2026-05-29 8:47 ` [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask() kernel test robot 2026-05-29 8:58 ` kernel test robot 2026-05-29 12:45 ` kernel test robot 2026-05-29 12:47 ` kernel test robot 2026-06-01 14:06 ` David Hildenbrand (Arm)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox