[PATCH] cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH] cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()
@ 2026-05-07 10:54 Chen Wandun
  2026-05-07 12:33 ` Michal Koutný
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Chen Wandun @ 2026-05-07 10:54 UTC (permalink / raw)
  To: longman, chenridong, tj, hannes, mkoutny; +Cc: cgroups, linux-kernel

Since prepare_alloc_pages() unconditionally adds __GFP_HARDWALL for the
fast path when cpusets are enabled, the __GFP_HARDWALL check in
cpuset_current_node_allowed() causes the PF_EXITING escape path to be
skipped on the first allocation attempt.  This makes it unreachable in
the common case, so dying tasks can get stuck in direct reclaim or even
trigger OOM while trying to exit, despite being allowed to allocate from
any node.

Move the PF_EXITING check before __GFP_HARDWALL so that dying tasks
can allocate memory from any node to exit quickly, even when cpusets
are enabled.

Also update the function comment to reflect the actual behavior of
prepare_alloc_pages() and the corrected check ordering.

Signed-off-by: Chen Wandun <chenwandun@lixiang.com>
---
 kernel/cgroup/cpuset.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e3a081a07c6d..a48901a0416a 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4176,11 +4176,11 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  * current's mems_allowed, yes.  If it's not a __GFP_HARDWALL request and this
  * node is set in the nearest hardwalled cpuset ancestor to current's cpuset,
  * yes.  If current has access to memory reserves as an oom victim, yes.
- * Otherwise, no.
+ * If the current task is PF_EXITING, yes. Otherwise, no.
  *
  * GFP_USER allocations are marked with the __GFP_HARDWALL bit,
  * and do not allow allocations outside the current tasks cpuset
- * unless the task has been OOM killed.
+ * unless the task has been OOM killed or is exiting.
  * GFP_KERNEL allocations are not so marked, so can escape to the
  * nearest enclosing hardwalled ancestor cpuset.
  *
@@ -4194,7 +4194,9 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  * The first call here from mm/page_alloc:get_page_from_freelist()
  * has __GFP_HARDWALL set in gfp_mask, enforcing hardwall cpusets,
  * so no allocation on a node outside the cpuset is allowed (unless
- * in interrupt, of course).
+ * in interrupt, of course).  The PF_EXITING check must therefore
+ * come before the __GFP_HARDWALL check, otherwise a dying task
+ * would be blocked on the fast path.
  *
  * The second pass through get_page_from_freelist() doesn't even call
  * here for GFP_ATOMIC calls.  For those calls, the __alloc_pages()
@@ -4204,6 +4206,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  *	in_interrupt - any node ok (current task context irrelevant)
  *	GFP_ATOMIC   - any node ok
  *	tsk_is_oom_victim   - any node ok
+ *	PF_EXITING   - any node ok (let dying task exit quickly)
  *	GFP_KERNEL   - any node in enclosing hardwalled cpuset ok
  *	GFP_USER     - only nodes in current tasks mems allowed ok.
  */
@@ -4223,11 +4226,10 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
 	 */
 	if (unlikely(tsk_is_oom_victim(current)))
 		return true;
-	if (gfp_mask & __GFP_HARDWALL)	/* If hardwall request, stop here */
-		return false;
-
 	if (current->flags & PF_EXITING) /* Let dying task have memory */
 		return true;
+	if (gfp_mask & __GFP_HARDWALL)	/* If hardwall request, stop here */
+		return false;
 
 	/* Not hardwall and node outside mems_allowed: scan up cpusets */
 	spin_lock_irqsave(&callback_lock, flags);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()
  2026-05-07 10:54 [PATCH] cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed() Chen Wandun
@ 2026-05-07 12:33 ` Michal Koutný
  2026-05-07 13:53   ` Waiman Long
  2026-05-08  6:27   ` Wandun
  2026-05-07 22:00 ` Tejun Heo
  2026-05-08  1:39 ` Chen Ridong
  2 siblings, 2 replies; 7+ messages in thread
From: Michal Koutný @ 2026-05-07 12:33 UTC (permalink / raw)
  To: Chen Wandun; +Cc: longman, chenridong, tj, hannes, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1582 bytes --]

On Thu, May 07, 2026 at 06:54:34PM +0800, Chen Wandun <chenwandun1@gmail.com> wrote:
> This makes it unreachable in the common case, so dying tasks can get
> stuck in direct reclaim or even trigger OOM while trying to exit,
> despite being allowed to allocate from any node.

(OTOH, the caused OOM could select this task and bypass the hardwall. So
this should only expedite but no unblock the exit path.)

> Move the PF_EXITING check before __GFP_HARDWALL so that dying tasks
> can allocate memory from any node to exit quickly, even when cpusets
> are enabled.

This makes sense to me on its own (given other hardwall exemptions,
namely the commit c596d9f320aaf ("cpusets: allow TIF_MEMDIE threads to
allocate anywhere")).

Acked-by: Michal Koutný <mkoutny@suse.com>


At first, I wondered whether this could happen on cpuset v2 -- it can --
because only per-cpuset hardwalling is absent but the generic logic for
GFP_USER allocations is still meant to be in place. Nevertheless, it
occured to me we can spare callback_lock in this function (a separate
chaneg for cpuset_current_node_allowed()):

--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4213,6 +4213,9 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
        if (current->flags & PF_EXITING) /* Let dying task have memory */
                return true;

+       if (is_in_v2_mode())
+               return true;
+
        /* Not hardwall and node outside mems_allowed: scan up cpusets */
        spin_lock_irqsave(&callback_lock, flags);

Regards,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()
  2026-05-07 12:33 ` Michal Koutný
@ 2026-05-07 13:53   ` Waiman Long
  2026-05-08  6:27   ` Wandun
  1 sibling, 0 replies; 7+ messages in thread
From: Waiman Long @ 2026-05-07 13:53 UTC (permalink / raw)
  To: Michal Koutný, Chen Wandun
  Cc: chenridong, tj, hannes, cgroups, linux-kernel

On 5/7/26 8:33 AM, Michal Koutný wrote:
> On Thu, May 07, 2026 at 06:54:34PM +0800, Chen Wandun <chenwandun1@gmail.com> wrote:
>> This makes it unreachable in the common case, so dying tasks can get
>> stuck in direct reclaim or even trigger OOM while trying to exit,
>> despite being allowed to allocate from any node.
> (OTOH, the caused OOM could select this task and bypass the hardwall. So
> this should only expedite but no unblock the exit path.)
>
>> Move the PF_EXITING check before __GFP_HARDWALL so that dying tasks
>> can allocate memory from any node to exit quickly, even when cpusets
>> are enabled.
> This makes sense to me on its own (given other hardwall exemptions,
> namely the commit c596d9f320aaf ("cpusets: allow TIF_MEMDIE threads to
> allocate anywhere")).
>
> Acked-by: Michal Koutný <mkoutny@suse.com>

This looks good to me too.

Acked-by: Waiman Long <longman@redhat.com>

>
>
> At first, I wondered whether this could happen on cpuset v2 -- it can --
> because only per-cpuset hardwalling is absent but the generic logic for
> GFP_USER allocations is still meant to be in place. Nevertheless, it
> occured to me we can spare callback_lock in this function (a separate
> chaneg for cpuset_current_node_allowed()):
>
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -4213,6 +4213,9 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
>          if (current->flags & PF_EXITING) /* Let dying task have memory */
>                  return true;
>
> +       if (is_in_v2_mode())
> +               return true;
> +
>          /* Not hardwall and node outside mems_allowed: scan up cpusets */
>          spin_lock_irqsave(&callback_lock, flags);

Yes, it is a performance optimization that is worth to have as cgroup v2 
doesn't have the concept of memory hardwall yet.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()
  2026-05-07 10:54 [PATCH] cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed() Chen Wandun
  2026-05-07 12:33 ` Michal Koutný
@ 2026-05-07 22:00 ` Tejun Heo
  2026-05-08  1:39 ` Chen Ridong
  2 siblings, 0 replies; 7+ messages in thread
From: Tejun Heo @ 2026-05-07 22:00 UTC (permalink / raw)
  To: Chen Wandun, Chen Wandun
  Cc: longman, chenridong, hannes, mkoutny, cgroups, linux-kernel

Hello,

> Chen Wandun (1):
>   cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()

Applied to cgroup/for-7.1-fixes.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()
  2026-05-07 10:54 [PATCH] cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed() Chen Wandun
  2026-05-07 12:33 ` Michal Koutný
  2026-05-07 22:00 ` Tejun Heo
@ 2026-05-08  1:39 ` Chen Ridong
  2026-05-08  6:15   ` Wandun
  2 siblings, 1 reply; 7+ messages in thread
From: Chen Ridong @ 2026-05-08  1:39 UTC (permalink / raw)
  To: Chen Wandun, longman, tj, hannes, mkoutny; +Cc: cgroups, linux-kernel



On 2026/5/7 18:54, Chen Wandun wrote:
> Since prepare_alloc_pages() unconditionally adds __GFP_HARDWALL for the
> fast path when cpusets are enabled, the __GFP_HARDWALL check in
> cpuset_current_node_allowed() causes the PF_EXITING escape path to be
> skipped on the first allocation attempt.  This makes it unreachable in
> the common case, so dying tasks can get stuck in direct reclaim or even
> trigger OOM while trying to exit, despite being allowed to allocate from
> any node.
> 
> Move the PF_EXITING check before __GFP_HARDWALL so that dying tasks
> can allocate memory from any node to exit quickly, even when cpusets
> are enabled.
> 
> Also update the function comment to reflect the actual behavior of
> prepare_alloc_pages() and the corrected check ordering.
> 
> Signed-off-by: Chen Wandun <chenwandun@lixiang.com>
> ---
>  kernel/cgroup/cpuset.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index e3a081a07c6d..a48901a0416a 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -4176,11 +4176,11 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
>   * current's mems_allowed, yes.  If it's not a __GFP_HARDWALL request and this
>   * node is set in the nearest hardwalled cpuset ancestor to current's cpuset,
>   * yes.  If current has access to memory reserves as an oom victim, yes.
> - * Otherwise, no.
> + * If the current task is PF_EXITING, yes. Otherwise, no.
>   *
>   * GFP_USER allocations are marked with the __GFP_HARDWALL bit,
>   * and do not allow allocations outside the current tasks cpuset
> - * unless the task has been OOM killed.
> + * unless the task has been OOM killed or is exiting.
>   * GFP_KERNEL allocations are not so marked, so can escape to the
>   * nearest enclosing hardwalled ancestor cpuset.
>   *
> @@ -4194,7 +4194,9 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
>   * The first call here from mm/page_alloc:get_page_from_freelist()
>   * has __GFP_HARDWALL set in gfp_mask, enforcing hardwall cpusets,
>   * so no allocation on a node outside the cpuset is allowed (unless
> - * in interrupt, of course).
> + * in interrupt, of course).  The PF_EXITING check must therefore
> + * come before the __GFP_HARDWALL check, otherwise a dying task
> + * would be blocked on the fast path.
>   *
>   * The second pass through get_page_from_freelist() doesn't even call
>   * here for GFP_ATOMIC calls.  For those calls, the __alloc_pages()
> @@ -4204,6 +4206,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
>   *	in_interrupt - any node ok (current task context irrelevant)
>   *	GFP_ATOMIC   - any node ok
>   *	tsk_is_oom_victim   - any node ok
> + *	PF_EXITING   - any node ok (let dying task exit quickly)
>   *	GFP_KERNEL   - any node in enclosing hardwalled cpuset ok
>   *	GFP_USER     - only nodes in current tasks mems allowed ok.
>   */
> @@ -4223,11 +4226,10 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
>  	 */
>  	if (unlikely(tsk_is_oom_victim(current)))
>  		return true;
> -	if (gfp_mask & __GFP_HARDWALL)	/* If hardwall request, stop here */
> -		return false;
> -
>  	if (current->flags & PF_EXITING) /* Let dying task have memory */
>  		return true;
> +	if (gfp_mask & __GFP_HARDWALL)	/* If hardwall request, stop here */
> +		return false;
>  
>  	/* Not hardwall and node outside mems_allowed: scan up cpusets */
>  	spin_lock_irqsave(&callback_lock, flags);

Make sense.

BTW, how did you find this issue?

Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()
  2026-05-08  1:39 ` Chen Ridong
@ 2026-05-08  6:15   ` Wandun
  0 siblings, 0 replies; 7+ messages in thread
From: Wandun @ 2026-05-08  6:15 UTC (permalink / raw)
  To: Chen Ridong; +Cc: cgroups, linux-kernel, longman, tj, hannes, mkoutny



On 5/8/26 09:39, Chen Ridong wrote:
>
> On 2026/5/7 18:54, Chen Wandun wrote:
>> Since prepare_alloc_pages() unconditionally adds __GFP_HARDWALL for the
>> fast path when cpusets are enabled, the __GFP_HARDWALL check in
>> cpuset_current_node_allowed() causes the PF_EXITING escape path to be
>> skipped on the first allocation attempt.  This makes it unreachable in
>> the common case, so dying tasks can get stuck in direct reclaim or even
>> trigger OOM while trying to exit, despite being allowed to allocate from
>> any node.
>>
>> Move the PF_EXITING check before __GFP_HARDWALL so that dying tasks
>> can allocate memory from any node to exit quickly, even when cpusets
>> are enabled.
>>
>> Also update the function comment to reflect the actual behavior of
>> prepare_alloc_pages() and the corrected check ordering.
>>
>> Signed-off-by: Chen Wandun <chenwandun@lixiang.com>
>> ---
>>   kernel/cgroup/cpuset.c | 14 ++++++++------
>>   1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index e3a081a07c6d..a48901a0416a 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -4176,11 +4176,11 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
>>    * current's mems_allowed, yes.  If it's not a __GFP_HARDWALL request and this
>>    * node is set in the nearest hardwalled cpuset ancestor to current's cpuset,
>>    * yes.  If current has access to memory reserves as an oom victim, yes.
>> - * Otherwise, no.
>> + * If the current task is PF_EXITING, yes. Otherwise, no.
>>    *
>>    * GFP_USER allocations are marked with the __GFP_HARDWALL bit,
>>    * and do not allow allocations outside the current tasks cpuset
>> - * unless the task has been OOM killed.
>> + * unless the task has been OOM killed or is exiting.
>>    * GFP_KERNEL allocations are not so marked, so can escape to the
>>    * nearest enclosing hardwalled ancestor cpuset.
>>    *
>> @@ -4194,7 +4194,9 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
>>    * The first call here from mm/page_alloc:get_page_from_freelist()
>>    * has __GFP_HARDWALL set in gfp_mask, enforcing hardwall cpusets,
>>    * so no allocation on a node outside the cpuset is allowed (unless
>> - * in interrupt, of course).
>> + * in interrupt, of course).  The PF_EXITING check must therefore
>> + * come before the __GFP_HARDWALL check, otherwise a dying task
>> + * would be blocked on the fast path.
>>    *
>>    * The second pass through get_page_from_freelist() doesn't even call
>>    * here for GFP_ATOMIC calls.  For those calls, the __alloc_pages()
>> @@ -4204,6 +4206,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
>>    *	in_interrupt - any node ok (current task context irrelevant)
>>    *	GFP_ATOMIC   - any node ok
>>    *	tsk_is_oom_victim   - any node ok
>> + *	PF_EXITING   - any node ok (let dying task exit quickly)
>>    *	GFP_KERNEL   - any node in enclosing hardwalled cpuset ok
>>    *	GFP_USER     - only nodes in current tasks mems allowed ok.
>>    */
>> @@ -4223,11 +4226,10 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
>>   	 */
>>   	if (unlikely(tsk_is_oom_victim(current)))
>>   		return true;
>> -	if (gfp_mask & __GFP_HARDWALL)	/* If hardwall request, stop here */
>> -		return false;
>> -
>>   	if (current->flags & PF_EXITING) /* Let dying task have memory */
>>   		return true;
>> +	if (gfp_mask & __GFP_HARDWALL)	/* If hardwall request, stop here */
>> +		return false;
>>   
>>   	/* Not hardwall and node outside mems_allowed: scan up cpusets */
>>   	spin_lock_irqsave(&callback_lock, flags);
> Make sense.
>
> BTW, how did you find this issue?
I found this while reviewing the cpuset node-allowed logic during an
investigation into a memory allocation issue (not the root cause of
that investigation).
>
> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()
  2026-05-07 12:33 ` Michal Koutný
  2026-05-07 13:53   ` Waiman Long
@ 2026-05-08  6:27   ` Wandun
  1 sibling, 0 replies; 7+ messages in thread
From: Wandun @ 2026-05-08  6:27 UTC (permalink / raw)
  To: Michal Koutný; +Cc: longman, chenridong, tj, hannes, cgroups, linux-kernel



On 5/7/26 20:33, Michal Koutný wrote:
> On Thu, May 07, 2026 at 06:54:34PM +0800, Chen Wandun <chenwandun1@gmail.com> wrote:
>> This makes it unreachable in the common case, so dying tasks can get
>> stuck in direct reclaim or even trigger OOM while trying to exit,
>> despite being allowed to allocate from any node.
> (OTOH, the caused OOM could select this task and bypass the hardwall. So
> this should only expedite but no unblock the exit path.)
>
>> Move the PF_EXITING check before __GFP_HARDWALL so that dying tasks
>> can allocate memory from any node to exit quickly, even when cpusets
>> are enabled.
> This makes sense to me on its own (given other hardwall exemptions,
> namely the commit c596d9f320aaf ("cpusets: allow TIF_MEMDIE threads to
> allocate anywhere")).
>
> Acked-by: Michal Koutný <mkoutny@suse.com>
>
>
> At first, I wondered whether this could happen on cpuset v2 -- it can --
> because only per-cpuset hardwalling is absent but the generic logic for
> GFP_USER allocations is still meant to be in place. Nevertheless, it
> occured to me we can spare callback_lock in this function (a separate
> chaneg for cpuset_current_node_allowed()):
>
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -4213,6 +4213,9 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
>          if (current->flags & PF_EXITING) /* Let dying task have memory */
>                  return true;
>
> +       if (is_in_v2_mode())
> +               return true;
> +
Thanks for the suggestion! I'll send a separate patch for this 
optimization. Best regards, Wandun
>          /* Not hardwall and node outside mems_allowed: scan up cpusets */
>          spin_lock_irqsave(&callback_lock, flags);
>
> Regards,
> Michal


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-05-08  6:28 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-07 10:54 [PATCH] cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed() Chen Wandun
2026-05-07 12:33 ` Michal Koutný
2026-05-07 13:53   ` Waiman Long
2026-05-08  6:27   ` Wandun
2026-05-07 22:00 ` Tejun Heo
2026-05-08  1:39 ` Chen Ridong
2026-05-08  6:15   ` Wandun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox