[PATCH] sched/numa: Fix NULL pointer access to mm

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap
@ 2025-07-02 16:32 Chen Yu
  2025-07-02 21:08 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Chen Yu @ 2025-07-02 16:32 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Andrew Morton, Tim Chen, linux-kernel,
	Chen Yu, Jirka Hladky, Srikanth Aithal, Suneeth D, Libo Chen

It was reported that after Commit ad6b26b6a0a7
("sched/numa: add statistics of numa balance task"),
a NULL pointer exception[1] occurs when accessing
p->mm. The following race condition was found to
trigger this bug: After a swap task candidate is
chosen during NUMA balancing, its mm_struct is
released due to task exit. Later, when the task
swapping is performed, p->mm is NULL, which causes
the problem:

CPU0                                   CPU1
:
...
task_numa_migrate
   task_numa_find_cpu
    task_numa_compare
      # a normal task p is chosen
      env->best_task = p

                                        # p exit:
                                        exit_signals(p);
                                           p->flags |= PF_EXITING
                                        exit_mm
                                           p->mm = NULL;

    migrate_swap_stop
      __migrate_swap_task((arg->src_task, arg->dst_cpu)
       count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL

Fix this issue by checking if the task has the PF_EXITING
flag set in migrate_swap_stop(). If it does, skip updating
the memcg events. Additionally, log a warning if p->mm is
NULL to facilitate future debugging.

Fixes: ad6b26b6a0a7 ("sched/numa: add statistics of numa balance task")
Reported-by: Jirka Hladky <jhladky@redhat.com>
Closes: https://lore.kernel.org/all/CAE4VaGBLJxpd=NeRJXpSCuw=REhC5LWJpC29kDy-Zh2ZDyzQZA@mail.gmail.com/
Reported-by: Srikanth Aithal <Srikanth.Aithal@amd.com>
Reported-by: Suneeth D <Suneeth.D@amd.com>
Suggested-by: Libo Chen <libo.chen@oracle.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/core.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8988d38d46a3..4e06bb955dad 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3364,7 +3364,14 @@ static void __migrate_swap_task(struct task_struct *p, int cpu)
 {
 	__schedstat_inc(p->stats.numa_task_swapped);
 	count_vm_numa_event(NUMA_TASK_SWAP);
-	count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
+	/* exiting task has NULL mm */
+	if (!(p->flags & PF_EXITING)) {
+		WARN_ONCE(!p->mm, "swap task %d %s %x has no mm\n",
+			  p->pid, p->comm, p->flags);
+
+		if (p->mm)
+			count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
+	}
 
 	if (task_on_rq_queued(p)) {
 		struct rq *src_rq, *dst_rq;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap
  2025-07-02 16:32 [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap Chen Yu
@ 2025-07-02 21:08 ` Andrew Morton
  2025-07-03  9:24   ` Chen, Yu C
  2025-07-03  7:18 ` Michal Hocko
  2025-07-03  7:26 ` Peter Zijlstra
  2 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2025-07-02 21:08 UTC (permalink / raw)
  To: Chen Yu
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim Chen, linux-kernel, Jirka Hladky,
	Srikanth Aithal, Suneeth D, Libo Chen

On Thu,  3 Jul 2025 00:32:47 +0800 Chen Yu <yu.c.chen@intel.com> wrote:

> It was reported that after Commit ad6b26b6a0a7
> ("sched/numa: add statistics of numa balance task"),
> a NULL pointer exception[1] occurs when accessing
> p->mm. The following race condition was found to
> trigger this bug: After a swap task candidate is
> chosen during NUMA balancing, its mm_struct is
> released due to task exit. Later, when the task
> swapping is performed, p->mm is NULL, which causes
> the problem:
> 
> CPU0                                   CPU1
> :
> ...
> task_numa_migrate
>    task_numa_find_cpu
>     task_numa_compare
>       # a normal task p is chosen
>       env->best_task = p
> 
>                                         # p exit:
>                                         exit_signals(p);
>                                            p->flags |= PF_EXITING
>                                         exit_mm
>                                            p->mm = NULL;
> 
>     migrate_swap_stop
>       __migrate_swap_task((arg->src_task, arg->dst_cpu)
>        count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL
> 
> Fix this issue by checking if the task has the PF_EXITING
> flag set in migrate_swap_stop(). If it does, skip updating
> the memcg events. Additionally, log a warning if p->mm is
> NULL to facilitate future debugging.
>
> ...
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3364,7 +3364,14 @@ static void __migrate_swap_task(struct task_struct *p, int cpu)
>  {
>  	__schedstat_inc(p->stats.numa_task_swapped);
>  	count_vm_numa_event(NUMA_TASK_SWAP);
> -	count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
> +	/* exiting task has NULL mm */
> +	if (!(p->flags & PF_EXITING)) {
> +		WARN_ONCE(!p->mm, "swap task %d %s %x has no mm\n",
> +			  p->pid, p->comm, p->flags);
> +
> +		if (p->mm)
> +			count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
> +	}

I don't think we should warn on a condition which is known to occur and
which we successfully handle.  What action can anyone take upon that
warning?

Which means the change might as well become

+	/* comment goes here */
+	if (p->mm)
+		count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);

But is that a real fix?  Can the other thread call exit(), set
PF_EXITING and null its p->mm right between the above two lines?  After
the p->mm test and before the count_memcg_event_mm() call?

IOW, there needs to be some locking in place to stabilize p->mm
throughout the p->mm test and the count_memcg_event_mm() call?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap
  2025-07-02 16:32 [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap Chen Yu
  2025-07-02 21:08 ` Andrew Morton
@ 2025-07-03  7:18 ` Michal Hocko
  2025-07-03  9:37   ` Chen, Yu C
  2025-07-03  7:26 ` Peter Zijlstra
  2 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2025-07-03  7:18 UTC (permalink / raw)
  To: Chen Yu
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Andrew Morton, Tim Chen, linux-kernel,
	Jirka Hladky, Srikanth Aithal, Suneeth D, Libo Chen

On Thu 03-07-25 00:32:47, Chen Yu wrote:
> It was reported that after Commit ad6b26b6a0a7
> ("sched/numa: add statistics of numa balance task"),
> a NULL pointer exception[1] occurs when accessing
> p->mm. The following race condition was found to
> trigger this bug: After a swap task candidate is
> chosen during NUMA balancing, its mm_struct is
> released due to task exit. Later, when the task
> swapping is performed, p->mm is NULL, which causes
> the problem:
> 
> CPU0                                   CPU1
> :
> ...
> task_numa_migrate
>    task_numa_find_cpu
>     task_numa_compare
>       # a normal task p is chosen
>       env->best_task = p
> 
>                                         # p exit:
>                                         exit_signals(p);
>                                            p->flags |= PF_EXITING
>                                         exit_mm
>                                            p->mm = NULL;
> 
>     migrate_swap_stop
>       __migrate_swap_task((arg->src_task, arg->dst_cpu)
>        count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL
> 
> Fix this issue by checking if the task has the PF_EXITING
> flag set in migrate_swap_stop(). If it does, skip updating
> the memcg events. Additionally, log a warning if p->mm is
> NULL to facilitate future debugging.
> 
> Fixes: ad6b26b6a0a7 ("sched/numa: add statistics of numa balance task")
> Reported-by: Jirka Hladky <jhladky@redhat.com>
> Closes: https://lore.kernel.org/all/CAE4VaGBLJxpd=NeRJXpSCuw=REhC5LWJpC29kDy-Zh2ZDyzQZA@mail.gmail.com/
> Reported-by: Srikanth Aithal <Srikanth.Aithal@amd.com>
> Reported-by: Suneeth D <Suneeth.D@amd.com>
> Suggested-by: Libo Chen <libo.chen@oracle.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>  kernel/sched/core.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 8988d38d46a3..4e06bb955dad 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3364,7 +3364,14 @@ static void __migrate_swap_task(struct task_struct *p, int cpu)
>  {
>  	__schedstat_inc(p->stats.numa_task_swapped);
>  	count_vm_numa_event(NUMA_TASK_SWAP);
> -	count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
> +	/* exiting task has NULL mm */
> +	if (!(p->flags & PF_EXITING)) {
> +		WARN_ONCE(!p->mm, "swap task %d %s %x has no mm\n",
> +			  p->pid, p->comm, p->flags);

As Andrew already said this is not really acceptable because this is
very likely too easy to trigger and a) you do not want logs flooded with
warnings and also there are setups with panic_on_warn configured and for
those this would be a fatal situation without any good reason.

> +
> +		if (p->mm)
> +			count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
> +	}

Why are you testing for p->mm here? Isn't PF_EXITING test sufficient?
A robust way to guarantee non-NULL mm against races when a task is
exiting is find_lock_task_mm. Probably too heavy weight for this path.
>  
>  	if (task_on_rq_queued(p)) {
>  		struct rq *src_rq, *dst_rq;
> -- 
> 2.25.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap
  2025-07-02 16:32 [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap Chen Yu
  2025-07-02 21:08 ` Andrew Morton
  2025-07-03  7:18 ` Michal Hocko
@ 2025-07-03  7:26 ` Peter Zijlstra
  2025-07-03  9:28   ` Michal Hocko
  2 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2025-07-03  7:26 UTC (permalink / raw)
  To: Chen Yu
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Andrew Morton, Tim Chen, linux-kernel, Jirka Hladky,
	Srikanth Aithal, Suneeth D, Libo Chen

On Thu, Jul 03, 2025 at 12:32:47AM +0800, Chen Yu wrote:

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 8988d38d46a3..4e06bb955dad 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3364,7 +3364,14 @@ static void __migrate_swap_task(struct task_struct *p, int cpu)
>  {
>  	__schedstat_inc(p->stats.numa_task_swapped);
>  	count_vm_numa_event(NUMA_TASK_SWAP);
> -	count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
> +	/* exiting task has NULL mm */
> +	if (!(p->flags & PF_EXITING)) {
> +		WARN_ONCE(!p->mm, "swap task %d %s %x has no mm\n",
> +			  p->pid, p->comm, p->flags);
> +
> +		if (p->mm)
> +			count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
> +	}

Aside from the things already mentioned by Andrew and Michal; why not
simply do something like:

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 87b6688f124a..8396ebfab0d5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -969,7 +969,7 @@ static inline void count_memcg_events_mm(struct mm_struct *mm,
 {
 	struct mem_cgroup *memcg;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() || !mm)
 		return;
 
 	rcu_read_lock();

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap
  2025-07-02 21:08 ` Andrew Morton
@ 2025-07-03  9:24   ` Chen, Yu C
  0 siblings, 0 replies; 21+ messages in thread
From: Chen, Yu C @ 2025-07-03  9:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim Chen, linux-kernel, Jirka Hladky,
	Srikanth Aithal, Suneeth D, Libo Chen

Hi Andrew,

On 7/3/2025 5:08 AM, Andrew Morton wrote:
> On Thu,  3 Jul 2025 00:32:47 +0800 Chen Yu <yu.c.chen@intel.com> wrote:
> 
>> It was reported that after Commit ad6b26b6a0a7
>> ("sched/numa: add statistics of numa balance task"),
>> a NULL pointer exception[1] occurs when accessing
>> p->mm. The following race condition was found to
>> trigger this bug: After a swap task candidate is
>> chosen during NUMA balancing, its mm_struct is
>> released due to task exit. Later, when the task
>> swapping is performed, p->mm is NULL, which causes
>> the problem:
>>
>> CPU0                                   CPU1
>> :
>> ...
>> task_numa_migrate
>>     task_numa_find_cpu
>>      task_numa_compare
>>        # a normal task p is chosen
>>        env->best_task = p
>>
>>                                          # p exit:
>>                                          exit_signals(p);
>>                                             p->flags |= PF_EXITING
>>                                          exit_mm
>>                                             p->mm = NULL;
>>
>>      migrate_swap_stop
>>        __migrate_swap_task((arg->src_task, arg->dst_cpu)
>>         count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL
>>
>> Fix this issue by checking if the task has the PF_EXITING
>> flag set in migrate_swap_stop(). If it does, skip updating
>> the memcg events. Additionally, log a warning if p->mm is
>> NULL to facilitate future debugging.
>>
>> ...
>>
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3364,7 +3364,14 @@ static void __migrate_swap_task(struct task_struct *p, int cpu)
>>   {
>>   	__schedstat_inc(p->stats.numa_task_swapped);
>>   	count_vm_numa_event(NUMA_TASK_SWAP);
>> -	count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
>> +	/* exiting task has NULL mm */
>> +	if (!(p->flags & PF_EXITING)) {
>> +		WARN_ONCE(!p->mm, "swap task %d %s %x has no mm\n",
>> +			  p->pid, p->comm, p->flags);
>> +
>> +		if (p->mm)
>> +			count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
>> +	}
> 
> I don't think we should warn on a condition which is known to occur and
> which we successfully handle.  What action can anyone take upon that
> warning?
> 
> Which means the change might as well become
> 
> +	/* comment goes here */
> +	if (p->mm)
> +		count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
> 
> But is that a real fix?  Can the other thread call exit(), set
> PF_EXITING and null its p->mm right between the above two lines?  After
> the p->mm test and before the count_memcg_event_mm() call?
> 

Thank you for the scenario description. Right, there is still a race
condition although the p->pi_lock is hold when reaching here, the
p->alloc_lock should be hold to access p->mm.

> IOW, there needs to be some locking in place to stabilize p->mm
> throughout the p->mm test and the count_memcg_event_mm() call?

Yes, Michal and Peter have given some feedback on how to do this.

thanks,
Chenyu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap
  2025-07-03  7:26 ` Peter Zijlstra
@ 2025-07-03  9:28   ` Michal Hocko
  2025-07-03 11:50     ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2025-07-03  9:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chen Yu, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Andrew Morton, Tim Chen, linux-kernel,
	Jirka Hladky, Srikanth Aithal, Suneeth D, Libo Chen

On Thu 03-07-25 09:26:08, Peter Zijlstra wrote:
> On Thu, Jul 03, 2025 at 12:32:47AM +0800, Chen Yu wrote:
> 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 8988d38d46a3..4e06bb955dad 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3364,7 +3364,14 @@ static void __migrate_swap_task(struct task_struct *p, int cpu)
> >  {
> >  	__schedstat_inc(p->stats.numa_task_swapped);
> >  	count_vm_numa_event(NUMA_TASK_SWAP);
> > -	count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
> > +	/* exiting task has NULL mm */
> > +	if (!(p->flags & PF_EXITING)) {
> > +		WARN_ONCE(!p->mm, "swap task %d %s %x has no mm\n",
> > +			  p->pid, p->comm, p->flags);
> > +
> > +		if (p->mm)
> > +			count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
> > +	}
> 
> Aside from the things already mentioned by Andrew and Michal; why not
> simply do something like:
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 87b6688f124a..8396ebfab0d5 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -969,7 +969,7 @@ static inline void count_memcg_events_mm(struct mm_struct *mm,
>  {
>  	struct mem_cgroup *memcg;
>  
> -	if (mem_cgroup_disabled())
> +	if (mem_cgroup_disabled() || !mm)
>  		return;

This would imply mm check for all other users that know their mm is
valid as they are operating on vma->mm or current task.

But thinking about this some more, this would be racy same as the
PF_EXITING check. This is not my area but is this performance sensitive
path that couldn't live with the proper find_lock_task_mm?

I do not see other race free way to deal with a remote task exit race.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap
  2025-07-03  7:18 ` Michal Hocko
@ 2025-07-03  9:37   ` Chen, Yu C
  2025-07-03 11:51     ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Chen, Yu C @ 2025-07-03  9:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Andrew Morton, Tim Chen, linux-kernel,
	Jirka Hladky, Srikanth Aithal, Suneeth D, Libo Chen

Hi Michal,

On 7/3/2025 3:18 PM, Michal Hocko wrote:
> On Thu 03-07-25 00:32:47, Chen Yu wrote:
>> It was reported that after Commit ad6b26b6a0a7
>> ("sched/numa: add statistics of numa balance task"),
>> a NULL pointer exception[1] occurs when accessing
>> p->mm. The following race condition was found to
>> trigger this bug: After a swap task candidate is
>> chosen during NUMA balancing, its mm_struct is
>> released due to task exit. Later, when the task
>> swapping is performed, p->mm is NULL, which causes
>> the problem:
>>
>> CPU0                                   CPU1
>> :
>> ...
>> task_numa_migrate
>>     task_numa_find_cpu
>>      task_numa_compare
>>        # a normal task p is chosen
>>        env->best_task = p
>>
>>                                          # p exit:
>>                                          exit_signals(p);
>>                                             p->flags |= PF_EXITING
>>                                          exit_mm
>>                                             p->mm = NULL;
>>
>>      migrate_swap_stop
>>        __migrate_swap_task((arg->src_task, arg->dst_cpu)
>>         count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL
>>
>> Fix this issue by checking if the task has the PF_EXITING
>> flag set in migrate_swap_stop(). If it does, skip updating
>> the memcg events. Additionally, log a warning if p->mm is
>> NULL to facilitate future debugging.
>>
>> Fixes: ad6b26b6a0a7 ("sched/numa: add statistics of numa balance task")
>> Reported-by: Jirka Hladky <jhladky@redhat.com>
>> Closes: https://lore.kernel.org/all/CAE4VaGBLJxpd=NeRJXpSCuw=REhC5LWJpC29kDy-Zh2ZDyzQZA@mail.gmail.com/
>> Reported-by: Srikanth Aithal <Srikanth.Aithal@amd.com>
>> Reported-by: Suneeth D <Suneeth.D@amd.com>
>> Suggested-by: Libo Chen <libo.chen@oracle.com>
>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>> ---
>>   kernel/sched/core.c | 9 ++++++++-
>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 8988d38d46a3..4e06bb955dad 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3364,7 +3364,14 @@ static void __migrate_swap_task(struct task_struct *p, int cpu)
>>   {
>>   	__schedstat_inc(p->stats.numa_task_swapped);
>>   	count_vm_numa_event(NUMA_TASK_SWAP);
>> -	count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
>> +	/* exiting task has NULL mm */
>> +	if (!(p->flags & PF_EXITING)) {
>> +		WARN_ONCE(!p->mm, "swap task %d %s %x has no mm\n",
>> +			  p->pid, p->comm, p->flags);
> 
> As Andrew already said this is not really acceptable because this is
> very likely too easy to trigger and a) you do not want logs flooded with
> warnings and also there are setups with panic_on_warn configured and for
> those this would be a fatal situation without any good reason.
> 

OK, got it, thanks for pointing it out.

>> +
>> +		if (p->mm)
>> +			count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
>> +	}
> 
> Why are you testing for p->mm here? Isn't PF_EXITING test sufficient?
> A robust way to guarantee non-NULL mm against races when a task is
> exiting is find_lock_task_mm. Probably too heavy weight for this path.

I suppose we might only need to grab task_lock(p), check if its mm
pointer is NULL. If yes, we skip the update of memcg event without
scanning for a non-NULL mm within the process(as find_lock_task_mm()
does)? If the mm is non-NULL, we update the memcg event with task_lock(p)
hold and releases it later.

thanks,
Chenyu

>>   
>>   	if (task_on_rq_queued(p)) {
>>   		struct rq *src_rq, *dst_rq;
>> -- 
>> 2.25.1
>>
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap
  2025-07-03  9:28   ` Michal Hocko
@ 2025-07-03 11:50     ` Peter Zijlstra
  2025-07-03 12:01       ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2025-07-03 11:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Chen Yu, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Andrew Morton, Tim Chen, linux-kernel,
	Jirka Hladky, Srikanth Aithal, Suneeth D, Libo Chen

On Thu, Jul 03, 2025 at 11:28:46AM +0200, Michal Hocko wrote:

> But thinking about this some more, this would be racy same as the
> PF_EXITING check. This is not my area but is this performance sensitive
> path that couldn't live with the proper find_lock_task_mm?

find_lock_task_mm() seems eminently unsuitable for accounting --
iterating the task group is insane.

Looking at this, the mm_struct lifetimes suck.. task_struct reference
doesn't help, rcu doesn't help :-(

Also, whatever the solution it needs to be inside this count_memcg_*()
nonsense, because nobody wants this overhead, esp. not for something
daft like accounting.

My primary desire at this point is to just revert the patch that caused
this. Accounting just isn't worth it. Esp. not since there is already a
tracepoint in this path -- people that want to count crap can very well
get their numbers from that.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap
  2025-07-03  9:37   ` Chen, Yu C
@ 2025-07-03 11:51     ` Michal Hocko
  2025-07-03 11:55       ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2025-07-03 11:51 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Andrew Morton, Tim Chen, linux-kernel,
	Jirka Hladky, Srikanth Aithal, Suneeth D, Libo Chen

On Thu 03-07-25 17:37:23, Chen, Yu C wrote:
> Hi Michal,
> 
> On 7/3/2025 3:18 PM, Michal Hocko wrote:
> > On Thu 03-07-25 00:32:47, Chen Yu wrote:
[...]
> > > +
> > > +		if (p->mm)
> > > +			count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
> > > +	}
> > 
> > Why are you testing for p->mm here? Isn't PF_EXITING test sufficient?
> > A robust way to guarantee non-NULL mm against races when a task is
> > exiting is find_lock_task_mm. Probably too heavy weight for this path.
> 
> I suppose we might only need to grab task_lock(p), check if its mm
> pointer is NULL. If yes, we skip the update of memcg event without
> scanning for a non-NULL mm within the process(as find_lock_task_mm()
> does)? If the mm is non-NULL, we update the memcg event with task_lock(p)
> hold and releases it later.

Why not use find_lock_task_mm if task_lock is acceptable for this code
path?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap
  2025-07-03 11:51     ` Michal Hocko
@ 2025-07-03 11:55       ` Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2025-07-03 11:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Chen, Yu C, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Andrew Morton, Tim Chen, linux-kernel,
	Jirka Hladky, Srikanth Aithal, Suneeth D, Libo Chen

On Thu, Jul 03, 2025 at 01:51:23PM +0200, Michal Hocko wrote:
> Why not use find_lock_task_mm if task_lock is acceptable for this code
> path?

Having to take locks just to count a number nobody is ever going to look
at is silly. Iterating the whole thread group (might be hundreds or
thousands of tasks) just to maybe find one when racing with a fatal
signal is even more silly.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap
  2025-07-03 11:50     ` Peter Zijlstra
@ 2025-07-03 12:01       ` Michal Hocko
  2025-07-03 12:04         ` Chen, Yu C
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2025-07-03 12:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chen Yu, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Andrew Morton, Tim Chen, linux-kernel,
	Jirka Hladky, Srikanth Aithal, Suneeth D, Libo Chen

On Thu 03-07-25 13:50:06, Peter Zijlstra wrote:
> On Thu, Jul 03, 2025 at 11:28:46AM +0200, Michal Hocko wrote:
> 
> > But thinking about this some more, this would be racy same as the
> > PF_EXITING check. This is not my area but is this performance sensitive
> > path that couldn't live with the proper find_lock_task_mm?
> 
> find_lock_task_mm() seems eminently unsuitable for accounting --
> iterating the task group is insane.
> 
> Looking at this, the mm_struct lifetimes suck.. task_struct reference
> doesn't help, rcu doesn't help :-(
> 
> Also, whatever the solution it needs to be inside this count_memcg_*()
> nonsense, because nobody wants this overhead, esp. not for something
> daft like accounting.
> 
> My primary desire at this point is to just revert the patch that caused
> this. Accounting just isn't worth it. Esp. not since there is already a
> tracepoint in this path -- people that want to count crap can very well
> get their numbers from that.

I would tend to agree with this. Doing the accounting race free on a
remote task is nasty and if this is a rare event that could be avoided
then it should be just dropped than racy and oops prone.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap
  2025-07-03 12:01       ` Michal Hocko
@ 2025-07-03 12:04         ` Chen, Yu C
  2025-07-03 12:20           ` Libo Chen
  0 siblings, 1 reply; 21+ messages in thread
From: Chen, Yu C @ 2025-07-03 12:04 UTC (permalink / raw)
  To: Michal Hocko, Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Andrew Morton, Tim Chen, linux-kernel, Jirka Hladky,
	Srikanth Aithal, Suneeth D, Libo Chen

On 7/3/2025 8:01 PM, Michal Hocko wrote:
> On Thu 03-07-25 13:50:06, Peter Zijlstra wrote:
>> On Thu, Jul 03, 2025 at 11:28:46AM +0200, Michal Hocko wrote:
>>
>>> But thinking about this some more, this would be racy same as the
>>> PF_EXITING check. This is not my area but is this performance sensitive
>>> path that couldn't live with the proper find_lock_task_mm?
>>
>> find_lock_task_mm() seems eminently unsuitable for accounting --
>> iterating the task group is insane.
>>
>> Looking at this, the mm_struct lifetimes suck.. task_struct reference
>> doesn't help, rcu doesn't help :-(
>>
>> Also, whatever the solution it needs to be inside this count_memcg_*()
>> nonsense, because nobody wants this overhead, esp. not for something
>> daft like accounting.
>>
>> My primary desire at this point is to just revert the patch that caused
>> this. Accounting just isn't worth it. Esp. not since there is already a
>> tracepoint in this path -- people that want to count crap can very well
>> get their numbers from that.
> 
> I would tend to agree with this. Doing the accounting race free on a
> remote task is nasty and if this is a rare event that could be avoided
> then it should be just dropped than racy and oops prone.
> 

OK, Michal and Peter,
how about keeping the per task schedstat and drop the memcg statistics?
The user can still get the per task information without having to filter
the ftrace log.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap
  2025-07-03 12:04         ` Chen, Yu C
@ 2025-07-03 12:20           ` Libo Chen
  2025-07-03 12:36             ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Libo Chen @ 2025-07-03 12:20 UTC (permalink / raw)
  To: Chen, Yu C, Michal Hocko, Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Andrew Morton, Tim Chen, linux-kernel, Jirka Hladky,
	Srikanth Aithal, Suneeth D



On 7/3/25 05:04, Chen, Yu C wrote:
> On 7/3/2025 8:01 PM, Michal Hocko wrote:
>> On Thu 03-07-25 13:50:06, Peter Zijlstra wrote:
>>> On Thu, Jul 03, 2025 at 11:28:46AM +0200, Michal Hocko wrote:
>>>
>>>> But thinking about this some more, this would be racy same as the
>>>> PF_EXITING check. This is not my area but is this performance sensitive
>>>> path that couldn't live with the proper find_lock_task_mm?
>>>
>>> find_lock_task_mm() seems eminently unsuitable for accounting --
>>> iterating the task group is insane.
>>>
>>> Looking at this, the mm_struct lifetimes suck.. task_struct reference
>>> doesn't help, rcu doesn't help :-(
>>>
>>> Also, whatever the solution it needs to be inside this count_memcg_*()
>>> nonsense, because nobody wants this overhead, esp. not for something
>>> daft like accounting.
>>>
>>> My primary desire at this point is to just revert the patch that caused
>>> this. Accounting just isn't worth it. Esp. not since there is already a
>>> tracepoint in this path -- people that want to count crap can very well
>>> get their numbers from that.
>>
>> I would tend to agree with this. Doing the accounting race free on a
>> remote task is nasty and if this is a rare event that could be avoided
>> then it should be just dropped than racy and oops prone.
>>
> 
> OK, Michal and Peter,
> how about keeping the per task schedstat and drop the memcg statistics?
> The user can still get the per task information without having to filter
> the ftrace log.
> 

I agree. The other parts, schedstat and vmstat, are still quite helpful.
Also tracepoints are more expensive than counters once enabled, I think
that's too much for just counting numbers.

Libo

> thanks,
> Chenyu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap
  2025-07-03 12:20           ` Libo Chen
@ 2025-07-03 12:36             ` Peter Zijlstra
  2025-07-03 13:38               ` Chen, Yu C
  2025-07-03 13:57               ` Libo Chen
  0 siblings, 2 replies; 21+ messages in thread
From: Peter Zijlstra @ 2025-07-03 12:36 UTC (permalink / raw)
  To: Libo Chen
  Cc: Chen, Yu C, Michal Hocko, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Andrew Morton, Tim Chen,
	linux-kernel, Jirka Hladky, Srikanth Aithal, Suneeth D

On Thu, Jul 03, 2025 at 05:20:47AM -0700, Libo Chen wrote:

> I agree. The other parts, schedstat and vmstat, are still quite helpful.
> Also tracepoints are more expensive than counters once enabled, I think
> that's too much for just counting numbers.

I'm not generally a fan of eBPF, but supposedly it is really good for
stuff like this. 

Attaching to a tracepoint and distributing into cgroup buckets seems
like it should be a trivial script.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap
  2025-07-03 12:36             ` Peter Zijlstra
@ 2025-07-03 13:38               ` Chen, Yu C
  2025-07-03 14:01                 ` Peter Zijlstra
  2025-07-03 13:57               ` Libo Chen
  1 sibling, 1 reply; 21+ messages in thread
From: Chen, Yu C @ 2025-07-03 13:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Hocko, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Andrew Morton, Tim Chen, linux-kernel,
	Jirka Hladky, Srikanth Aithal, Suneeth D, Libo Chen

Hi Peter,

On 7/3/2025 8:36 PM, Peter Zijlstra wrote:
> On Thu, Jul 03, 2025 at 05:20:47AM -0700, Libo Chen wrote:
> 
>> I agree. The other parts, schedstat and vmstat, are still quite helpful.
>> Also tracepoints are more expensive than counters once enabled, I think
>> that's too much for just counting numbers.
> 
> I'm not generally a fan of eBPF, but supposedly it is really good for
> stuff like this.
> 
> Attaching to a tracepoint and distributing into cgroup buckets seems
> like it should be a trivial script.

Yes, it is feasible to use eBPF. On the other hand, if some
existing monitoring programs rely on /proc/{pid}/sched to observe
the NUMA balancing metrics of processes, it might be helpful to
include the NUMA migration/swap information in /proc/{pid}/sched.
This approach can minimize the modifications needed for these
monitoring programs, eliminating the need to add a new BPF script
to obtain NUMA balancing statistics from different sources IMHO.


thanks,
Chenyu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap
  2025-07-03 12:36             ` Peter Zijlstra
  2025-07-03 13:38               ` Chen, Yu C
@ 2025-07-03 13:57               ` Libo Chen
  2025-07-03 14:18                 ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Libo Chen @ 2025-07-03 13:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chen, Yu C, Michal Hocko, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Andrew Morton, Tim Chen,
	linux-kernel, Jirka Hladky, Srikanth Aithal, Suneeth D



On 7/3/25 05:36, Peter Zijlstra wrote:
> On Thu, Jul 03, 2025 at 05:20:47AM -0700, Libo Chen wrote:
> 
>> I agree. The other parts, schedstat and vmstat, are still quite helpful.
>> Also tracepoints are more expensive than counters once enabled, I think
>> that's too much for just counting numbers.
> 
> I'm not generally a fan of eBPF, but supposedly it is really good for
> stuff like this. 
> 

Yeah but not nearly as good as, for example, __schedstat_inc(var) which
probably only takes a few CPU cycles if var is in the right place. eBPF
is gonna take a whole bunch of sequences to even get to updating an eBPF
map which itself is much more expensive than __schedstat_inc(var).

For one, __migrate_swap_task() happens when dst node is fully busy (most
likely src node is full as well), so the overhead of ebpf could be quite
noticeable.


Libo
> Attaching to a tracepoint and distributing into cgroup buckets seems
> like it should be a trivial script.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap
  2025-07-03 13:38               ` Chen, Yu C
@ 2025-07-03 14:01                 ` Peter Zijlstra
  2025-07-04  5:57                   ` Chen, Yu C
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2025-07-03 14:01 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Michal Hocko, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Andrew Morton, Tim Chen, linux-kernel,
	Jirka Hladky, Srikanth Aithal, Suneeth D, Libo Chen

On Thu, Jul 03, 2025 at 09:38:08PM +0800, Chen, Yu C wrote:
> Hi Peter,
> 
> On 7/3/2025 8:36 PM, Peter Zijlstra wrote:
> > On Thu, Jul 03, 2025 at 05:20:47AM -0700, Libo Chen wrote:
> > 
> > > I agree. The other parts, schedstat and vmstat, are still quite helpful.
> > > Also tracepoints are more expensive than counters once enabled, I think
> > > that's too much for just counting numbers.
> > 
> > I'm not generally a fan of eBPF, but supposedly it is really good for
> > stuff like this.
> > 
> > Attaching to a tracepoint and distributing into cgroup buckets seems
> > like it should be a trivial script.
> 
> Yes, it is feasible to use eBPF. On the other hand, if some
> existing monitoring programs rely on /proc/{pid}/sched to observe
> the NUMA balancing metrics of processes, it might be helpful to
> include the NUMA migration/swap information in /proc/{pid}/sched.
> This approach can minimize the modifications needed for these
> monitoring programs, eliminating the need to add a new BPF script
> to obtain NUMA balancing statistics from different sources IMHO.

Maybe...

The thing is, most of the time the effort spend on collecting all these
numbers is wasted energy since nobody ever looks at them.

Sometimes we're stuck with ABI, like the proc files you mentioned. We
can't readily remove them, stuff would break. But does that mean we
should endlessly add to them just because convenient?

Ideally I would strip out all the statistics and accounting crap and
make sure we have tracepoints (not trace-events) covering all the needed
spots, and then maybe just maybe have a few kernel modules that hook
into those tracepoints to provide the legacy interfaces.

That way, only the people that care get to pay the overhead of actually
collecting the numbers.

One can dream I suppose... :-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap
  2025-07-03 13:57               ` Libo Chen
@ 2025-07-03 14:18                 ` Peter Zijlstra
  2025-07-03 23:35                   ` Libo Chen
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2025-07-03 14:18 UTC (permalink / raw)
  To: Libo Chen
  Cc: Chen, Yu C, Michal Hocko, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Andrew Morton, Tim Chen,
	linux-kernel, Jirka Hladky, Srikanth Aithal, Suneeth D

On Thu, Jul 03, 2025 at 06:57:04AM -0700, Libo Chen wrote:
> 
> 
> On 7/3/25 05:36, Peter Zijlstra wrote:
> > On Thu, Jul 03, 2025 at 05:20:47AM -0700, Libo Chen wrote:
> > 
> >> I agree. The other parts, schedstat and vmstat, are still quite helpful.
> >> Also tracepoints are more expensive than counters once enabled, I think
> >> that's too much for just counting numbers.
> > 
> > I'm not generally a fan of eBPF, but supposedly it is really good for
> > stuff like this. 
> > 
> 
> Yeah but not nearly as good as, for example, __schedstat_inc(var) which
> probably only takes a few CPU cycles if var is in the right place. eBPF
> is gonna take a whole bunch of sequences to even get to updating an eBPF
> map which itself is much more expensive than __schedstat_inc(var).
> 
> For one, __migrate_swap_task() happens when dst node is fully busy (most
> likely src node is full as well), so the overhead of ebpf could be quite
> noticeable.

But that overhead is only paid if you actually care about the numbers;
most people don't.

We already stick static branches in many of the accounting paths --
because we know they hurt.

But look at this:

        __schedstat_inc(p->stats.numa_task_swapped);
        count_vm_numa_event(NUMA_TASK_SWAP);
        count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);

that is _3_ different counters, 3 cachelines touched. For what?

Would not a single:

	trace_numa_task_swap_tp(p);

be much saner? It translates into a single no-op; no lines touched. Only
when someone wants the numbers do we attach to the tracepoint and start
collecting things.

Is the collecting more expensive; maybe. But the rest of us will be
better of, no?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap
  2025-07-03 14:18                 ` Peter Zijlstra
@ 2025-07-03 23:35                   ` Libo Chen
  2025-07-04  8:26                     ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Libo Chen @ 2025-07-03 23:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chen, Yu C, Michal Hocko, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Andrew Morton, Tim Chen,
	linux-kernel, Jirka Hladky, Srikanth Aithal, Suneeth D



On 7/3/25 07:18, Peter Zijlstra wrote:
> On Thu, Jul 03, 2025 at 06:57:04AM -0700, Libo Chen wrote:
>>
>>
>> On 7/3/25 05:36, Peter Zijlstra wrote:
>>> On Thu, Jul 03, 2025 at 05:20:47AM -0700, Libo Chen wrote:
>>>
>>>> I agree. The other parts, schedstat and vmstat, are still quite helpful.
>>>> Also tracepoints are more expensive than counters once enabled, I think
>>>> that's too much for just counting numbers.
>>>
>>> I'm not generally a fan of eBPF, but supposedly it is really good for
>>> stuff like this. 
>>>
>>
>> Yeah but not nearly as good as, for example, __schedstat_inc(var) which
>> probably only takes a few CPU cycles if var is in the right place. eBPF
>> is gonna take a whole bunch of sequences to even get to updating an eBPF
>> map which itself is much more expensive than __schedstat_inc(var).
>>
>> For one, __migrate_swap_task() happens when dst node is fully busy (most
>> likely src node is full as well), so the overhead of ebpf could be quite
>> noticeable.
> 
> But that overhead is only paid if you actually care about the numbers;
> most people don't.
> 
> We already stick static branches in many of the accounting paths --
> because we know they hurt.
> 
> But look at this:
> 
>         __schedstat_inc(p->stats.numa_task_swapped);
>         count_vm_numa_event(NUMA_TASK_SWAP);
>         count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
> 
> that is _3_ different counters, 3 cachelines touched. For what?
> 
> Would not a single:
> 
> 	trace_numa_task_swap_tp(p);
> 
> be much saner? It translates into a single no-op; no lines touched. Only
> when someone wants the numbers do we attach to the tracepoint and start
> collecting things.
> 
> Is the collecting more expensive; maybe. But the rest of us will be
> better of, no?

Probably not as bad as you may think. Systems with one NUMA node or NUMA
balancing disabled (which will be most of the machines) won't be affected
by this at all , task_numa_migrate() is also ratelimited so it doesn't get
touched nearly as often as most of other scheduler events.

If this is on a really hot and critical path that most of us have to take,
such as wakeup, I won't argue with you at all. I don't want to be too
persistent here, it's fine to use eBPF with the existing tracepoints. I
just think this is convenient and doesn't really hurt those who don't care
about these numbers. 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap
  2025-07-03 14:01                 ` Peter Zijlstra
@ 2025-07-04  5:57                   ` Chen, Yu C
  0 siblings, 0 replies; 21+ messages in thread
From: Chen, Yu C @ 2025-07-04  5:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Hocko, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Andrew Morton, Tim Chen, linux-kernel,
	Jirka Hladky, Srikanth Aithal, Suneeth D, Libo Chen

On 7/3/2025 10:01 PM, Peter Zijlstra wrote:
> On Thu, Jul 03, 2025 at 09:38:08PM +0800, Chen, Yu C wrote:
>> Hi Peter,
>>
>> On 7/3/2025 8:36 PM, Peter Zijlstra wrote:
>>> On Thu, Jul 03, 2025 at 05:20:47AM -0700, Libo Chen wrote:
>>>
>>>> I agree. The other parts, schedstat and vmstat, are still quite helpful.
>>>> Also tracepoints are more expensive than counters once enabled, I think
>>>> that's too much for just counting numbers.
>>>
>>> I'm not generally a fan of eBPF, but supposedly it is really good for
>>> stuff like this.
>>>
>>> Attaching to a tracepoint and distributing into cgroup buckets seems
>>> like it should be a trivial script.
>>
>> Yes, it is feasible to use eBPF. On the other hand, if some
>> existing monitoring programs rely on /proc/{pid}/sched to observe
>> the NUMA balancing metrics of processes, it might be helpful to
>> include the NUMA migration/swap information in /proc/{pid}/sched.
>> This approach can minimize the modifications needed for these
>> monitoring programs, eliminating the need to add a new BPF script
>> to obtain NUMA balancing statistics from different sources IMHO.
> 
> Maybe...
> 
> The thing is, most of the time the effort spend on collecting all these
> numbers is wasted energy since nobody ever looks at them.
>

As for per-task NUMA balancing activity itself, we found it useful for
debugging when trying to ensure that cache-aware load balancing coexists
properly with NUMA balancing.
> Sometimes we're stuck with ABI, like the proc files you mentioned. We
> can't readily remove them, stuff would break. But does that mean we
> should endlessly add to them just because convenient?
> 
 > Ideally I would strip out all the statistics and accounting crap and> 
make sure we have tracepoints (not trace-events) covering all the needed
> spots, and then maybe just maybe have a few kernel modules that hook
> into those tracepoints to provide the legacy interfaces.
> 
> That way, only the people that care get to pay the overhead of actually
> collecting the numbers.
> 
> One can dream I suppose... :-)

I see.

If I understand correctly, it's generally not recommended to add
new items under /proc. Users are recommended to use tracepoints/events
when trying to collect the statistics, and something like schedstat_inc()
should be avoided. Is the per-task data an exception? We recently exposed
task's slice via /proc/pid/sched : D
thanks,
Chenyu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap
  2025-07-03 23:35                   ` Libo Chen
@ 2025-07-04  8:26                     ` Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2025-07-04  8:26 UTC (permalink / raw)
  To: Libo Chen
  Cc: Chen, Yu C, Michal Hocko, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Andrew Morton, Tim Chen,
	linux-kernel, Jirka Hladky, Srikanth Aithal, Suneeth D

On Thu, Jul 03, 2025 at 04:35:42PM -0700, Libo Chen wrote:

> Probably not as bad as you may think. Systems with one NUMA node or NUMA
> balancing disabled (which will be most of the machines) won't be affected
> by this at all , task_numa_migrate() is also ratelimited so it doesn't get
> touched nearly as often as most of other scheduler events.
> 
> If this is on a really hot and critical path that most of us have to take,
> such as wakeup, I won't argue with you at all. I don't want to be too
> persistent here, it's fine to use eBPF with the existing tracepoints. I
> just think this is convenient and doesn't really hurt those who don't care
> about these numbers. 

Its not about this one path per-se, more a general rant on the 'merit'
of endlessly adding statistics and accounting to code.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-07-04  8:26 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-02 16:32 [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap Chen Yu
2025-07-02 21:08 ` Andrew Morton
2025-07-03  9:24   ` Chen, Yu C
2025-07-03  7:18 ` Michal Hocko
2025-07-03  9:37   ` Chen, Yu C
2025-07-03 11:51     ` Michal Hocko
2025-07-03 11:55       ` Peter Zijlstra
2025-07-03  7:26 ` Peter Zijlstra
2025-07-03  9:28   ` Michal Hocko
2025-07-03 11:50     ` Peter Zijlstra
2025-07-03 12:01       ` Michal Hocko
2025-07-03 12:04         ` Chen, Yu C
2025-07-03 12:20           ` Libo Chen
2025-07-03 12:36             ` Peter Zijlstra
2025-07-03 13:38               ` Chen, Yu C
2025-07-03 14:01                 ` Peter Zijlstra
2025-07-04  5:57                   ` Chen, Yu C
2025-07-03 13:57               ` Libo Chen
2025-07-03 14:18                 ` Peter Zijlstra
2025-07-03 23:35                   ` Libo Chen
2025-07-04  8:26                     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).