[PATCH v5 0/2] sched/numa: add statistics of numa balance task migration

linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/2] sched/numa: add statistics of numa balance task migration
@ 2025-05-23 12:48 Chen Yu
  2025-05-23 12:51 ` [PATCH v5 1/2] sched/numa: fix task swap by skipping kernel threads Chen Yu
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Chen Yu @ 2025-05-23 12:48 UTC (permalink / raw)
  To: peterz, akpm
  Cc: mkoutny, mingo, tj, hannes, corbet, mgorman, mhocko, muchun.song,
	roman.gushchin, shakeel.butt, tim.c.chen, aubrey.li, libo.chen,
	kprateek.nayak, vineethr, venkat88, ayushjai, cgroups, linux-doc,
	linux-mm, linux-kernel, yu.chen.surf, Chen Yu

Introducing the task migration and swap statistics in the following places:
/sys/fs/cgroup/{GROUP}/memory.stat
/proc/{PID}/sched
/proc/vmstat

These statistics facilitate a rapid evaluation of the performance and resource
utilization of the target workload.

Patch 1 is a fix from Libo to avoid task swapping for kernel threads
and user thread that does not have mm, because Numa balance only cares
about the user pages via VMA.

Patch 2 is the major change to expose the statistics of task migration and
swapping in corresponding files.

The reason to fold patch 1 and patch 2 into 1 patch set is that patch 1 is
necessary for patch 2 to avoid accessing a NULL mm_struct from a kernel
thread, which causes NULL pointer exception.

Changes since v4:
Skip the kernel thread in patch 1, by checking if the target thread
has PF_KTHREAD(Peter). Besides, remove the check for PF_IDLE, because
idle thread has PF_KTHREAD set already(Prateek).

Previous version:
v4:
https://lore.kernel.org/all/cover.1746611892.git.yu.c.chen@intel.com/
v3:
https://lore.kernel.org/lkml/20250430103623.3349842-1-yu.c.chen@intel.com/
v2:
https://lore.kernel.org/lkml/20250408101444.192519-1-yu.c.chen@intel.com/
v1:
https://lore.kernel.org/lkml/20250402010611.3204674-1-yu.c.chen@intel.com/

Chen Yu (1):
  sched/numa: add statistics of numa balance task

Libo Chen (1):
  sched/numa: fix task swap by skipping kernel threads

 Documentation/admin-guide/cgroup-v2.rst | 6 ++++++
 include/linux/sched.h                   | 4 ++++
 include/linux/vm_event_item.h           | 2 ++
 kernel/sched/core.c                     | 9 +++++++--
 kernel/sched/debug.c                    | 4 ++++
 kernel/sched/fair.c                     | 3 ++-
 mm/memcontrol.c                         | 2 ++
 mm/vmstat.c                             | 2 ++
 8 files changed, 29 insertions(+), 3 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v5 1/2] sched/numa: fix task swap by skipping kernel threads
  2025-05-23 12:48 [PATCH v5 0/2] sched/numa: add statistics of numa balance task migration Chen Yu
@ 2025-05-23 12:51 ` Chen Yu
  2025-05-23 23:22   ` Shakeel Butt
  2025-05-23 12:51 ` [PATCH v5 2/2] sched/numa: add statistics of numa balance task Chen Yu
  2025-05-23 22:06 ` [PATCH v5 0/2] sched/numa: add statistics of numa balance task migration Andrew Morton
  2 siblings, 1 reply; 21+ messages in thread
From: Chen Yu @ 2025-05-23 12:51 UTC (permalink / raw)
  To: peterz, akpm
  Cc: mkoutny, mingo, tj, hannes, corbet, mgorman, mhocko, muchun.song,
	roman.gushchin, shakeel.butt, tim.c.chen, aubrey.li, libo.chen,
	kprateek.nayak, vineethr, venkat88, ayushjai, cgroups, linux-doc,
	linux-mm, linux-kernel, yu.chen.surf, Ayush Jain, Chen Yu

From: Libo Chen <libo.chen@oracle.com>

Task swapping is triggered when there are no idle CPUs in
task A's preferred node. In this case, the NUMA load balancer
chooses a task B on A's preferred node and swaps B with A. This
helps improve NUMA locality without introducing load imbalance
between nodes. In the current implementation, B's NUMA node
preference is not mandatory. That is to say, a kernel thread
might be incorrectly chosen as B. However, kernel thread and
user space thread that does not have mm are not supposed to be
covered by NUMA balancing because NUMA balancing only considers
user pages via VMAs.

According to Peter's suggestion for fixing this issue, we use
PF_KTHREAD to skip the kernel thread. curr->mm is also checked
because it is possible that user_mode_thread() might create a
user thread without an mm. As per Prateek's analysis, after
adding the PF_KTHREAD check, there is no need to further check
the PF_IDLE flag:
"
- play_idle_precise() already ensures PF_KTHREAD is set before adding
  PF_IDLE

- cpu_startup_entry() is only called from the startup thread which
  should be marked with PF_KTHREAD (based on my understanding looking at
  commit cff9b2332ab7 ("kernel/sched: Modify initial boot task idle
  setup"))
"

In summary, the check in task_numa_compare() now aligns with
task_tick_numa().

Suggested-by: Michal Koutny <mkoutny@suse.com>
Tested-by: Ayush Jain <Ayush.jain3@amd.com>
Signed-off-by: Libo Chen <libo.chen@oracle.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
v4->v5:
Add PF_KTHREAD check, and remove PF_IDLE check.
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fb9bf995a47..03d9a49a68b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2273,7 +2273,8 @@ static bool task_numa_compare(struct task_numa_env *env,
 
 	rcu_read_lock();
 	cur = rcu_dereference(dst_rq->curr);
-	if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
+	if (cur && ((cur->flags & (PF_EXITING | PF_KTHREAD)) ||
+		    !cur->mm))
 		cur = NULL;
 
 	/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 2/2] sched/numa: add statistics of numa balance task
  2025-05-23 12:48 [PATCH v5 0/2] sched/numa: add statistics of numa balance task migration Chen Yu
  2025-05-23 12:51 ` [PATCH v5 1/2] sched/numa: fix task swap by skipping kernel threads Chen Yu
@ 2025-05-23 12:51 ` Chen Yu
  2025-05-23 23:42   ` Shakeel Butt
  2025-05-23 22:06 ` [PATCH v5 0/2] sched/numa: add statistics of numa balance task migration Andrew Morton
  2 siblings, 1 reply; 21+ messages in thread
From: Chen Yu @ 2025-05-23 12:51 UTC (permalink / raw)
  To: peterz, akpm
  Cc: mkoutny, mingo, tj, hannes, corbet, mgorman, mhocko, muchun.song,
	roman.gushchin, shakeel.butt, tim.c.chen, aubrey.li, libo.chen,
	kprateek.nayak, vineethr, venkat88, ayushjai, cgroups, linux-doc,
	linux-mm, linux-kernel, yu.chen.surf, Chen Yu

On systems with NUMA balancing enabled, it has been found
that tracking task activities resulting from NUMA balancing
is beneficial. NUMA balancing employs two mechanisms for task
migration: one is to migrate a task to an idle CPU within its
preferred node, and the other is to swap tasks located on
different nodes when they are on each other's preferred nodes.

The kernel already provides NUMA page migration statistics in
/sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched. However,
it lacks statistics regarding task migration and swapping.
Therefore, relevant counts for task migration and swapping should
be added.

The following two new fields:

numa_task_migrated
numa_task_swapped

will be shown in /sys/fs/cgroup/{GROUP}/memory.stat, /proc/{PID}/sched
and /proc/vmstat

Introducing both per-task and per-memory cgroup (memcg) NUMA
balancing statistics facilitates a rapid evaluation of the
performance and resource utilization of the target workload.
For instance, users can first identify the container with high
NUMA balancing activity and then further pinpoint a specific
task within that group, and subsequently adjust the memory policy
for that task. In short, although it is possible to iterate through
/proc/$pid/sched to locate the problematic task, the introduction
of aggregated NUMA balancing activity for tasks within each memcg
can assist users in identifying the task more efficiently through
a divide-and-conquer approach.

As Libo Chen pointed out, the memcg event relies on the text
names in vmstat_text, and /proc/vmstat generates corresponding items
based on vmstat_text. Thus, the relevant task migration and swapping
events introduced in vmstat_text also need to be populated by
count_vm_numa_event(), otherwise these values are zero in
/proc/vmstat.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
v4->v5:
no change.
v3->v4:
Populate the /prov/vmstat otherwise the items are all zero.
(Libo)
v2->v3:
Remove unnecessary p->mm check because kernel threads are
not supported by Numa Balancing. (Libo Chen)
v1->v2:
Update the Documentation/admin-guide/cgroup-v2.rst. (Michal)
---
 Documentation/admin-guide/cgroup-v2.rst | 6 ++++++
 include/linux/sched.h                   | 4 ++++
 include/linux/vm_event_item.h           | 2 ++
 kernel/sched/core.c                     | 9 +++++++--
 kernel/sched/debug.c                    | 4 ++++
 mm/memcontrol.c                         | 2 ++
 mm/vmstat.c                             | 2 ++
 7 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 1a16ce68a4d7..d346f3235945 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1670,6 +1670,12 @@ The following nested keys are defined.
 	  numa_hint_faults (npn)
 		Number of NUMA hinting faults.
 
+	  numa_task_migrated (npn)
+		Number of task migration by NUMA balancing.
+
+	  numa_task_swapped (npn)
+		Number of task swap by NUMA balancing.
+
 	  pgdemote_kswapd
 		Number of pages demoted by kswapd.
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f96ac1982893..1c50e30b5c01 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -549,6 +549,10 @@ struct sched_statistics {
 	u64				nr_failed_migrations_running;
 	u64				nr_failed_migrations_hot;
 	u64				nr_forced_migrations;
+#ifdef CONFIG_NUMA_BALANCING
+	u64				numa_task_migrated;
+	u64				numa_task_swapped;
+#endif
 
 	u64				nr_wakeups;
 	u64				nr_wakeups_sync;
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 9e15a088ba38..91a3ce9a2687 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -66,6 +66,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		NUMA_HINT_FAULTS,
 		NUMA_HINT_FAULTS_LOCAL,
 		NUMA_PAGE_MIGRATE,
+		NUMA_TASK_MIGRATE,
+		NUMA_TASK_SWAP,
 #endif
 #ifdef CONFIG_MIGRATION
 		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c81cf642dba0..62b033199e9c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3352,6 +3352,10 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 #ifdef CONFIG_NUMA_BALANCING
 static void __migrate_swap_task(struct task_struct *p, int cpu)
 {
+	__schedstat_inc(p->stats.numa_task_swapped);
+	count_vm_numa_event(NUMA_TASK_SWAP);
+	count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
+
 	if (task_on_rq_queued(p)) {
 		struct rq *src_rq, *dst_rq;
 		struct rq_flags srf, drf;
@@ -7953,8 +7957,9 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
 	if (!cpumask_test_cpu(target_cpu, p->cpus_ptr))
 		return -EINVAL;
 
-	/* TODO: This is not properly updating schedstats */
-
+	__schedstat_inc(p->stats.numa_task_migrated);
+	count_vm_numa_event(NUMA_TASK_MIGRATE);
+	count_memcg_event_mm(p->mm, NUMA_TASK_MIGRATE);
 	trace_sched_move_numa(p, curr_cpu, target_cpu);
 	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
 }
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 56ae54e0ce6a..f971c2af7912 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1206,6 +1206,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_failed_migrations_running);
 		P_SCHEDSTAT(nr_failed_migrations_hot);
 		P_SCHEDSTAT(nr_forced_migrations);
+#ifdef CONFIG_NUMA_BALANCING
+		P_SCHEDSTAT(numa_task_migrated);
+		P_SCHEDSTAT(numa_task_swapped);
+#endif
 		P_SCHEDSTAT(nr_wakeups);
 		P_SCHEDSTAT(nr_wakeups_sync);
 		P_SCHEDSTAT(nr_wakeups_migrate);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c96c1f2b9cf5..cdaab8a957f3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -463,6 +463,8 @@ static const unsigned int memcg_vm_event_stat[] = {
 	NUMA_PAGE_MIGRATE,
 	NUMA_PTE_UPDATES,
 	NUMA_HINT_FAULTS,
+	NUMA_TASK_MIGRATE,
+	NUMA_TASK_SWAP,
 #endif
 };
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4c268ce39ff2..ed08bb384ae4 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1347,6 +1347,8 @@ const char * const vmstat_text[] = {
 	"numa_hint_faults",
 	"numa_hint_faults_local",
 	"numa_pages_migrated",
+	"numa_task_migrated",
+	"numa_task_swapped",
 #endif
 #ifdef CONFIG_MIGRATION
 	"pgmigrate_success",
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/2] sched/numa: add statistics of numa balance task migration
  2025-05-23 12:48 [PATCH v5 0/2] sched/numa: add statistics of numa balance task migration Chen Yu
  2025-05-23 12:51 ` [PATCH v5 1/2] sched/numa: fix task swap by skipping kernel threads Chen Yu
  2025-05-23 12:51 ` [PATCH v5 2/2] sched/numa: add statistics of numa balance task Chen Yu
@ 2025-05-23 22:06 ` Andrew Morton
  2025-05-23 23:52   ` Shakeel Butt
  2 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2025-05-23 22:06 UTC (permalink / raw)
  To: Chen Yu
  Cc: peterz, mkoutny, mingo, tj, hannes, corbet, mgorman, mhocko,
	muchun.song, roman.gushchin, shakeel.butt, tim.c.chen, aubrey.li,
	libo.chen, kprateek.nayak, vineethr, venkat88, ayushjai, cgroups,
	linux-doc, linux-mm, linux-kernel, yu.chen.surf

On Fri, 23 May 2025 20:48:02 +0800 Chen Yu <yu.c.chen@intel.com> wrote:

> Introducing the task migration and swap statistics in the following places:
> /sys/fs/cgroup/{GROUP}/memory.stat
> /proc/{PID}/sched
> /proc/vmstat
> 
> These statistics facilitate a rapid evaluation of the performance and resource
> utilization of the target workload.

Thanks.  I added this.

We're late in -rc7 but an earlier verison of this did have a run in
linux-next.  Could reviewers please take a look relatively soon, let us
know whether they believe this looks suitable for 6.16-rc1?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 1/2] sched/numa: fix task swap by skipping kernel threads
  2025-05-23 12:51 ` [PATCH v5 1/2] sched/numa: fix task swap by skipping kernel threads Chen Yu
@ 2025-05-23 23:22   ` Shakeel Butt
  0 siblings, 0 replies; 21+ messages in thread
From: Shakeel Butt @ 2025-05-23 23:22 UTC (permalink / raw)
  To: Chen Yu
  Cc: peterz, akpm, mkoutny, mingo, tj, hannes, corbet, mgorman, mhocko,
	muchun.song, roman.gushchin, tim.c.chen, aubrey.li, libo.chen,
	kprateek.nayak, vineethr, venkat88, ayushjai, cgroups, linux-doc,
	linux-mm, linux-kernel, yu.chen.surf, Ayush Jain

On Fri, May 23, 2025 at 08:51:01PM +0800, Chen Yu wrote:
> From: Libo Chen <libo.chen@oracle.com>
> 
> Task swapping is triggered when there are no idle CPUs in
> task A's preferred node. In this case, the NUMA load balancer
> chooses a task B on A's preferred node and swaps B with A. This
> helps improve NUMA locality without introducing load imbalance
> between nodes. In the current implementation, B's NUMA node
> preference is not mandatory. That is to say, a kernel thread
> might be incorrectly chosen as B. However, kernel thread and
> user space thread that does not have mm are not supposed to be
> covered by NUMA balancing because NUMA balancing only considers
> user pages via VMAs.
> 
> According to Peter's suggestion for fixing this issue, we use
> PF_KTHREAD to skip the kernel thread. curr->mm is also checked
> because it is possible that user_mode_thread() might create a
> user thread without an mm. As per Prateek's analysis, after
> adding the PF_KTHREAD check, there is no need to further check
> the PF_IDLE flag:
> "
> - play_idle_precise() already ensures PF_KTHREAD is set before adding
>   PF_IDLE
> 
> - cpu_startup_entry() is only called from the startup thread which
>   should be marked with PF_KTHREAD (based on my understanding looking at
>   commit cff9b2332ab7 ("kernel/sched: Modify initial boot task idle
>   setup"))
> "
> 
> In summary, the check in task_numa_compare() now aligns with
> task_tick_numa().
> 
> Suggested-by: Michal Koutny <mkoutny@suse.com>
> Tested-by: Ayush Jain <Ayush.jain3@amd.com>
> Signed-off-by: Libo Chen <libo.chen@oracle.com>
> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/2] sched/numa: add statistics of numa balance task
  2025-05-23 12:51 ` [PATCH v5 2/2] sched/numa: add statistics of numa balance task Chen Yu
@ 2025-05-23 23:42   ` Shakeel Butt
  2025-05-24  9:07     ` Chen, Yu C
  2025-05-26 13:35     ` Michal Koutný
  0 siblings, 2 replies; 21+ messages in thread
From: Shakeel Butt @ 2025-05-23 23:42 UTC (permalink / raw)
  To: Chen Yu
  Cc: peterz, akpm, mkoutny, mingo, tj, hannes, corbet, mgorman, mhocko,
	muchun.song, roman.gushchin, tim.c.chen, aubrey.li, libo.chen,
	kprateek.nayak, vineethr, venkat88, ayushjai, cgroups, linux-doc,
	linux-mm, linux-kernel, yu.chen.surf

On Fri, May 23, 2025 at 08:51:15PM +0800, Chen Yu wrote:
> On systems with NUMA balancing enabled, it has been found
> that tracking task activities resulting from NUMA balancing
> is beneficial. NUMA balancing employs two mechanisms for task
> migration: one is to migrate a task to an idle CPU within its
> preferred node, and the other is to swap tasks located on
> different nodes when they are on each other's preferred nodes.
> 
> The kernel already provides NUMA page migration statistics in
> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched. However,
> it lacks statistics regarding task migration and swapping.
> Therefore, relevant counts for task migration and swapping should
> be added.
> 
> The following two new fields:
> 
> numa_task_migrated
> numa_task_swapped
> 
> will be shown in /sys/fs/cgroup/{GROUP}/memory.stat, /proc/{PID}/sched
> and /proc/vmstat

Hmm these are scheduler events, how are these relevant to memory cgroup
or vmstat? Any reason to not expose these in cpu.stat?

> 
> Introducing both per-task and per-memory cgroup (memcg) NUMA
> balancing statistics facilitates a rapid evaluation of the
> performance and resource utilization of the target workload.
> For instance, users can first identify the container with high
> NUMA balancing activity and then further pinpoint a specific
> task within that group, and subsequently adjust the memory policy
> for that task. In short, although it is possible to iterate through
> /proc/$pid/sched to locate the problematic task, the introduction
> of aggregated NUMA balancing activity for tasks within each memcg
> can assist users in identifying the task more efficiently through
> a divide-and-conquer approach.
> 
> As Libo Chen pointed out, the memcg event relies on the text
> names in vmstat_text, and /proc/vmstat generates corresponding items
> based on vmstat_text. Thus, the relevant task migration and swapping
> events introduced in vmstat_text also need to be populated by
> count_vm_numa_event(), otherwise these values are zero in
> /proc/vmstat.
> 
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
> v4->v5:
> no change.
> v3->v4:
> Populate the /prov/vmstat otherwise the items are all zero.
> (Libo)
> v2->v3:
> Remove unnecessary p->mm check because kernel threads are
> not supported by Numa Balancing. (Libo Chen)
> v1->v2:
> Update the Documentation/admin-guide/cgroup-v2.rst. (Michal)
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 6 ++++++
>  include/linux/sched.h                   | 4 ++++
>  include/linux/vm_event_item.h           | 2 ++
>  kernel/sched/core.c                     | 9 +++++++--
>  kernel/sched/debug.c                    | 4 ++++
>  mm/memcontrol.c                         | 2 ++
>  mm/vmstat.c                             | 2 ++
>  7 files changed, 27 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 1a16ce68a4d7..d346f3235945 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1670,6 +1670,12 @@ The following nested keys are defined.
>  	  numa_hint_faults (npn)
>  		Number of NUMA hinting faults.
>  
> +	  numa_task_migrated (npn)
> +		Number of task migration by NUMA balancing.
> +
> +	  numa_task_swapped (npn)
> +		Number of task swap by NUMA balancing.
> +
>  	  pgdemote_kswapd
>  		Number of pages demoted by kswapd.
>  
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f96ac1982893..1c50e30b5c01 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -549,6 +549,10 @@ struct sched_statistics {
>  	u64				nr_failed_migrations_running;
>  	u64				nr_failed_migrations_hot;
>  	u64				nr_forced_migrations;
> +#ifdef CONFIG_NUMA_BALANCING
> +	u64				numa_task_migrated;
> +	u64				numa_task_swapped;
> +#endif
>  
>  	u64				nr_wakeups;
>  	u64				nr_wakeups_sync;
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 9e15a088ba38..91a3ce9a2687 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -66,6 +66,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		NUMA_HINT_FAULTS,
>  		NUMA_HINT_FAULTS_LOCAL,
>  		NUMA_PAGE_MIGRATE,
> +		NUMA_TASK_MIGRATE,
> +		NUMA_TASK_SWAP,
>  #endif
>  #ifdef CONFIG_MIGRATION
>  		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c81cf642dba0..62b033199e9c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3352,6 +3352,10 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>  #ifdef CONFIG_NUMA_BALANCING
>  static void __migrate_swap_task(struct task_struct *p, int cpu)
>  {
> +	__schedstat_inc(p->stats.numa_task_swapped);
> +	count_vm_numa_event(NUMA_TASK_SWAP);
> +	count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
> +
>  	if (task_on_rq_queued(p)) {
>  		struct rq *src_rq, *dst_rq;
>  		struct rq_flags srf, drf;
> @@ -7953,8 +7957,9 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
>  	if (!cpumask_test_cpu(target_cpu, p->cpus_ptr))
>  		return -EINVAL;
>  
> -	/* TODO: This is not properly updating schedstats */
> -
> +	__schedstat_inc(p->stats.numa_task_migrated);
> +	count_vm_numa_event(NUMA_TASK_MIGRATE);
> +	count_memcg_event_mm(p->mm, NUMA_TASK_MIGRATE);
>  	trace_sched_move_numa(p, curr_cpu, target_cpu);
>  	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
>  }
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 56ae54e0ce6a..f971c2af7912 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -1206,6 +1206,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
>  		P_SCHEDSTAT(nr_failed_migrations_running);
>  		P_SCHEDSTAT(nr_failed_migrations_hot);
>  		P_SCHEDSTAT(nr_forced_migrations);
> +#ifdef CONFIG_NUMA_BALANCING
> +		P_SCHEDSTAT(numa_task_migrated);
> +		P_SCHEDSTAT(numa_task_swapped);
> +#endif
>  		P_SCHEDSTAT(nr_wakeups);
>  		P_SCHEDSTAT(nr_wakeups_sync);
>  		P_SCHEDSTAT(nr_wakeups_migrate);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c96c1f2b9cf5..cdaab8a957f3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -463,6 +463,8 @@ static const unsigned int memcg_vm_event_stat[] = {
>  	NUMA_PAGE_MIGRATE,
>  	NUMA_PTE_UPDATES,
>  	NUMA_HINT_FAULTS,
> +	NUMA_TASK_MIGRATE,
> +	NUMA_TASK_SWAP,
>  #endif
>  };
>  
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4c268ce39ff2..ed08bb384ae4 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1347,6 +1347,8 @@ const char * const vmstat_text[] = {
>  	"numa_hint_faults",
>  	"numa_hint_faults_local",
>  	"numa_pages_migrated",
> +	"numa_task_migrated",
> +	"numa_task_swapped",
>  #endif
>  #ifdef CONFIG_MIGRATION
>  	"pgmigrate_success",
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/2] sched/numa: add statistics of numa balance task migration
  2025-05-23 22:06 ` [PATCH v5 0/2] sched/numa: add statistics of numa balance task migration Andrew Morton
@ 2025-05-23 23:52   ` Shakeel Butt
  2025-05-28  0:21     ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Shakeel Butt @ 2025-05-23 23:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chen Yu, peterz, mkoutny, mingo, tj, hannes, corbet, mgorman,
	mhocko, muchun.song, roman.gushchin, tim.c.chen, aubrey.li,
	libo.chen, kprateek.nayak, vineethr, venkat88, ayushjai, cgroups,
	linux-doc, linux-mm, linux-kernel, yu.chen.surf

On Fri, May 23, 2025 at 03:06:35PM -0700, Andrew Morton wrote:
> On Fri, 23 May 2025 20:48:02 +0800 Chen Yu <yu.c.chen@intel.com> wrote:
> 
> > Introducing the task migration and swap statistics in the following places:
> > /sys/fs/cgroup/{GROUP}/memory.stat
> > /proc/{PID}/sched
> > /proc/vmstat
> > 
> > These statistics facilitate a rapid evaluation of the performance and resource
> > utilization of the target workload.
> 
> Thanks.  I added this.
> 
> We're late in -rc7 but an earlier verison of this did have a run in
> linux-next.  Could reviewers please take a look relatively soon, let us
> know whether they believe this looks suitable for 6.16-rc1?
> 

The stats seems valuable but I am not convinced that memcg is the right
home for these stats. So, please hold until that is resolved.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/2] sched/numa: add statistics of numa balance task
  2025-05-23 23:42   ` Shakeel Butt
@ 2025-05-24  9:07     ` Chen, Yu C
  2025-05-24 17:32       ` Shakeel Butt
  2025-05-26 13:35     ` Michal Koutný
  1 sibling, 1 reply; 21+ messages in thread
From: Chen, Yu C @ 2025-05-24  9:07 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: peterz, akpm, mkoutny, mingo, tj, hannes, corbet, mgorman, mhocko,
	muchun.song, roman.gushchin, tim.c.chen, aubrey.li, libo.chen,
	kprateek.nayak, vineethr, venkat88, ayushjai, cgroups, linux-doc,
	linux-mm, linux-kernel, yu.chen.surf

Hi Shakeel,

On 5/24/2025 7:42 AM, Shakeel Butt wrote:
> On Fri, May 23, 2025 at 08:51:15PM +0800, Chen Yu wrote:
>> On systems with NUMA balancing enabled, it has been found
>> that tracking task activities resulting from NUMA balancing
>> is beneficial. NUMA balancing employs two mechanisms for task
>> migration: one is to migrate a task to an idle CPU within its
>> preferred node, and the other is to swap tasks located on
>> different nodes when they are on each other's preferred nodes.
>>
>> The kernel already provides NUMA page migration statistics in
>> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched. However,
>> it lacks statistics regarding task migration and swapping.
>> Therefore, relevant counts for task migration and swapping should
>> be added.
>>
>> The following two new fields:
>>
>> numa_task_migrated
>> numa_task_swapped
>>
>> will be shown in /sys/fs/cgroup/{GROUP}/memory.stat, /proc/{PID}/sched
>> and /proc/vmstat
> 
> Hmm these are scheduler events, how are these relevant to memory cgroup
> or vmstat? 
> Any reason to not expose these in cpu.stat?
> 

I understand that in theory they are scheduling activities.
The reason for including these statistics here was mainly that
I assumed there is a close relationship between page migration
and task migration in Numa Balance. Specifically, task migration
is triggered when page migration fails.
Placing these statistics closer to the existing Numa Balance page
statistics in /sys/fs/cgroup/{GROUP}/memory.stat and /proc/vmstat
may help users query relevant data from a single file, avoiding
the need to search through scattered files.
Notably, these events are associated with a task’s working set
(footprint) rather than pure CPU cycles IMO. I took a look at
the cpu_cfs_stat_show() for cpu.stat, it seems that a lot of
code is needed if we want to expose them in cpu.stat, while
reusing existing interface of count_memcg_event_mm() is simpler.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/2] sched/numa: add statistics of numa balance task
  2025-05-24  9:07     ` Chen, Yu C
@ 2025-05-24 17:32       ` Shakeel Butt
  2025-05-25 12:35         ` Chen, Yu C
  0 siblings, 1 reply; 21+ messages in thread
From: Shakeel Butt @ 2025-05-24 17:32 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: peterz, akpm, mkoutny, mingo, tj, hannes, corbet, mgorman, mhocko,
	muchun.song, roman.gushchin, tim.c.chen, aubrey.li, libo.chen,
	kprateek.nayak, vineethr, venkat88, ayushjai, cgroups, linux-doc,
	linux-mm, linux-kernel, yu.chen.surf

On Sat, May 24, 2025 at 2:07 AM Chen, Yu C <yu.c.chen@intel.com> wrote:
>
> Hi Shakeel,
>
> On 5/24/2025 7:42 AM, Shakeel Butt wrote:
> > On Fri, May 23, 2025 at 08:51:15PM +0800, Chen Yu wrote:
> >> On systems with NUMA balancing enabled, it has been found
> >> that tracking task activities resulting from NUMA balancing
> >> is beneficial. NUMA balancing employs two mechanisms for task
> >> migration: one is to migrate a task to an idle CPU within its
> >> preferred node, and the other is to swap tasks located on
> >> different nodes when they are on each other's preferred nodes.
> >>
> >> The kernel already provides NUMA page migration statistics in
> >> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched. However,
> >> it lacks statistics regarding task migration and swapping.
> >> Therefore, relevant counts for task migration and swapping should
> >> be added.
> >>
> >> The following two new fields:
> >>
> >> numa_task_migrated
> >> numa_task_swapped
> >>
> >> will be shown in /sys/fs/cgroup/{GROUP}/memory.stat, /proc/{PID}/sched
> >> and /proc/vmstat
> >
> > Hmm these are scheduler events, how are these relevant to memory cgroup
> > or vmstat?
> > Any reason to not expose these in cpu.stat?
> >
>
> I understand that in theory they are scheduling activities.
> The reason for including these statistics here was mainly that
> I assumed there is a close relationship between page migration
> and task migration in Numa Balance. Specifically, task migration
> is triggered when page migration fails.
> Placing these statistics closer to the existing Numa Balance page
> statistics in /sys/fs/cgroup/{GROUP}/memory.stat and /proc/vmstat
> may help users query relevant data from a single file, avoiding
> the need to search through scattered files.
> Notably, these events are associated with a task’s working set
> (footprint) rather than pure CPU cycles IMO. I took a look at
> the cpu_cfs_stat_show() for cpu.stat, it seems that a lot of
> code is needed if we want to expose them in cpu.stat, while
> reusing existing interface of count_memcg_event_mm() is simpler.

Let me address two of your points first:

(1) cpu.stat currently contains cpu cycles stats. I don't see an issue
adding these new events in it as you can see memory.stat exposes stats
and events as well.

(2) You can still use count_memcg_event_mm() and related infra while
exposing the stats/events in cpu.stat.

Now your point on having related stats within a single interface is
more convincing. Let me ask you couple of simple questions:

I am not well versed with numa migration, can you expand a bit more on
these two events (numa_task_migrated & numa_task_swapped)? How are
these related to numa memory migration? You mentioned these events
happen on page migration failure, can you please give an end-to-end
flow/story of all these events happening on a timeline.

Beside that, do you think there might be some other scheduling events
(maybe unrelated to numa balancing) which might be suitable for
memory.stat? Basically I am trying to find if having sched events in
memory.stat be an exception for numa balancing or more general.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/2] sched/numa: add statistics of numa balance task
  2025-05-24 17:32       ` Shakeel Butt
@ 2025-05-25 12:35         ` Chen, Yu C
  2025-05-27 17:48           ` Shakeel Butt
  0 siblings, 1 reply; 21+ messages in thread
From: Chen, Yu C @ 2025-05-25 12:35 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: peterz, akpm, mkoutny, mingo, tj, hannes, corbet, mgorman, mhocko,
	muchun.song, roman.gushchin, tim.c.chen, aubrey.li, libo.chen,
	kprateek.nayak, vineethr, venkat88, ayushjai, cgroups, linux-doc,
	linux-mm, linux-kernel, yu.chen.surf

On 5/25/2025 1:32 AM, Shakeel Butt wrote:
> On Sat, May 24, 2025 at 2:07 AM Chen, Yu C <yu.c.chen@intel.com> wrote:
>>
>> Hi Shakeel,
>>
>> On 5/24/2025 7:42 AM, Shakeel Butt wrote:
>>> On Fri, May 23, 2025 at 08:51:15PM +0800, Chen Yu wrote:
>>>> On systems with NUMA balancing enabled, it has been found
>>>> that tracking task activities resulting from NUMA balancing
>>>> is beneficial. NUMA balancing employs two mechanisms for task
>>>> migration: one is to migrate a task to an idle CPU within its
>>>> preferred node, and the other is to swap tasks located on
>>>> different nodes when they are on each other's preferred nodes.
>>>>
>>>> The kernel already provides NUMA page migration statistics in
>>>> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched. However,
>>>> it lacks statistics regarding task migration and swapping.
>>>> Therefore, relevant counts for task migration and swapping should
>>>> be added.
>>>>
>>>> The following two new fields:
>>>>
>>>> numa_task_migrated
>>>> numa_task_swapped
>>>>
>>>> will be shown in /sys/fs/cgroup/{GROUP}/memory.stat, /proc/{PID}/sched
>>>> and /proc/vmstat
>>>
>>> Hmm these are scheduler events, how are these relevant to memory cgroup
>>> or vmstat?
>>> Any reason to not expose these in cpu.stat?
>>>
>>
>> I understand that in theory they are scheduling activities.
>> The reason for including these statistics here was mainly that
>> I assumed there is a close relationship between page migration
>> and task migration in Numa Balance. Specifically, task migration
>> is triggered when page migration fails.
>> Placing these statistics closer to the existing Numa Balance page
>> statistics in /sys/fs/cgroup/{GROUP}/memory.stat and /proc/vmstat
>> may help users query relevant data from a single file, avoiding
>> the need to search through scattered files.
>> Notably, these events are associated with a task’s working set
>> (footprint) rather than pure CPU cycles IMO. I took a look at
>> the cpu_cfs_stat_show() for cpu.stat, it seems that a lot of
>> code is needed if we want to expose them in cpu.stat, while
>> reusing existing interface of count_memcg_event_mm() is simpler.
> 
> Let me address two of your points first:
> 
> (1) cpu.stat currently contains cpu cycles stats. I don't see an issue
> adding these new events in it as you can see memory.stat exposes stats
> and events as well.
> 
> (2) You can still use count_memcg_event_mm() and related infra while
> exposing the stats/events in cpu.stat.
> 

Got it.

> Now your point on having related stats within a single interface is
> more convincing. Let me ask you couple of simple questions:
> 
> I am not well versed with numa migration, can you expand a bit more on
> these two events (numa_task_migrated & numa_task_swapped)? How are
> these related to numa memory migration? You mentioned these events> happen on page migration failure,

I double-checked the code, and it seems that task numa migration
occurs regardless of whether page migration fails or succeeds.

 > can you please give an end-to-end> flow/story of all these events 
happening on a timeline.
> 

Yes, sure, let me have a try.

The goal of NUMA balancing is to co-locate a task and its
memory pages on the same NUMA node. There are two strategies:
migrate the pages to the task's node, or migrate the task to
the node where its pages reside.

Suppose a task p1 is running on Node 0, but its pages are
located on Node 1. NUMA page fault statistics for p1 reveal
its "page footprint" across nodes. If NUMA balancing detects
that most of p1's pages are on Node 1:

1.Page Migration Attempt:
The Numa balance first tries to migrate p1's pages to Node 0.
The numa_page_migrate counter increments.

2.Task Migration Strategies:
After the page migration finishes, Numa balance checks every
1 second to see if p1 can be migrated to Node 1.

Case 2.1: Idle CPU Available
If Node 1 has an idle CPU, p1 is directly scheduled there. This event is 
logged as numa_task_migrated.
Case 2.2: No Idle CPU (Task Swap)
If all CPUs on Node1 are busy, direct migration could cause CPU 
contention or load imbalance. Instead:
The Numa balance selects a candidate task p2 on Node 1 that prefers
Node 0 (e.g., due to its own page footprint).
p1 and p2 are swapped. This cross-node swap is recorded as 
numa_task_swapped.

> Beside that, do you think there might be some other scheduling events
> (maybe unrelated to numa balancing) which might be suitable for
> memory.stat? Basically I am trying to find if having sched events in
> memory.stat be an exception for numa balancing or more general.

If the criterion is a combination of task scheduling strategy and
page-based operations, I cannot find any other existing scheduling
events. For now, NUMA balancing seems to be the only case.


thanks,
Chenyu

> 
> thanks,
> Shakeel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/2] sched/numa: add statistics of numa balance task
  2025-05-23 23:42   ` Shakeel Butt
  2025-05-24  9:07     ` Chen, Yu C
@ 2025-05-26 13:35     ` Michal Koutný
  2025-05-27  9:20       ` Chen, Yu C
  1 sibling, 1 reply; 21+ messages in thread
From: Michal Koutný @ 2025-05-26 13:35 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Chen Yu, peterz, akpm, mingo, tj, hannes, corbet, mgorman, mhocko,
	muchun.song, roman.gushchin, tim.c.chen, aubrey.li, libo.chen,
	kprateek.nayak, vineethr, venkat88, ayushjai, cgroups, linux-doc,
	linux-mm, linux-kernel, yu.chen.surf

[-- Attachment #1: Type: text/plain, Size: 561 bytes --]

On Fri, May 23, 2025 at 04:42:50PM -0700, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> Hmm these are scheduler events, how are these relevant to memory cgroup
> or vmstat? Any reason to not expose these in cpu.stat?

Good point. If I take it further -- this functionality needs neither
memory controller (CONFIG_MEMCG) nor CPU controller
(CONFIG_CGROUP_SCHED), so it might be technically calculated and exposed
in _any_ cgroup (which would be same technical solution how cpu time is
counted in cpu.stat regardless of CPU controller, cpu_stat_show()).

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/2] sched/numa: add statistics of numa balance task
  2025-05-26 13:35     ` Michal Koutný
@ 2025-05-27  9:20       ` Chen, Yu C
  2025-05-27 18:15         ` Shakeel Butt
  0 siblings, 1 reply; 21+ messages in thread
From: Chen, Yu C @ 2025-05-27  9:20 UTC (permalink / raw)
  To: Michal Koutný, Shakeel Butt
  Cc: peterz, akpm, mingo, tj, hannes, corbet, mgorman, mhocko,
	muchun.song, roman.gushchin, tim.c.chen, aubrey.li, libo.chen,
	kprateek.nayak, vineethr, venkat88, ayushjai, cgroups, linux-doc,
	linux-mm, linux-kernel, yu.chen.surf

On 5/26/2025 9:35 PM, Michal Koutný wrote:
> On Fri, May 23, 2025 at 04:42:50PM -0700, Shakeel Butt <shakeel.butt@linux.dev> wrote:
>> Hmm these are scheduler events, how are these relevant to memory cgroup
>> or vmstat? Any reason to not expose these in cpu.stat?
> 
> Good point. If I take it further -- this functionality needs neither
> memory controller (CONFIG_MEMCG) nor CPU controller
> (CONFIG_CGROUP_SCHED), so it might be technically calculated and exposed
> in _any_ cgroup (which would be same technical solution how cpu time is
> counted in cpu.stat regardless of CPU controller, cpu_stat_show()).
> 

Yes, we can add it to cpu.stat. However, this might make it more difficult
for users to locate related events. Some statistics about NUMA page
migrations/faults are recorded in memory.stat, while others about NUMA task
migrations (triggered by NUMA faults periodicly) are stored in cpu.stat.

Do you recommend extending the struct cgroup_base_stat to include counters
for task_migrate/task_swap? Additionally, should we enhance
cgroup_base_stat_cputime_show() to parse task_migrate/task_swap in a manner
similar to cputime?

Alternatively, as Shakeel previously mentioned, could we reuse
"count_memcg_event_mm()" and related infrastructure while exposing these
statistics/events in cpu.stat? I assume Shakeel was referring to the 
following
approach:

1. Skip task migration/swap in memory.stat:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cdaab8a957f3..b8eea3eca46f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1529,6 +1529,11 @@ static void memcg_stat_format(struct mem_cgroup 
*memcg, struct seq_buf *s)
                 if (memcg_vm_event_stat[i] == PGPGIN ||
                     memcg_vm_event_stat[i] == PGPGOUT)
                         continue;
+#endif
+#ifdef CONFIG_NUMA_BALANCING
+               if (memcg_vm_event_stat[i] == NUMA_TASK_MIGRATE ||
+                   memcg_vm_event_stat[i] == NUMA_TASK_SWAP)
+                       continue;
  #endif

2.Skip task migration/swap in /proc/vmstat
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ed08bb384ae4..ea8a8ae1cdac 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1912,6 +1912,10 @@ static void *vmstat_next(struct seq_file *m, void 
*arg, loff_t *pos)
         (*pos)++;
         if (*pos >= NR_VMSTAT_ITEMS)
                 return NULL;
+#ifdef CONFIG_NUMA_BALANCING
+       if (*pos == NUMA_TASK_MIGRATE || *pos == NUMA_TASK_SWAP)
+               return NULL;
+#endif

3. Display task migration/swap events in cpu.stat:
  seq_buf_printf(&s, "%s %lu\n",
+ 
vm_event_name(memcg_vm_event_stat[NUMA_TASK_MIGRATE]),
+                      memcg_events(memcg, 
memcg_vm_event_stat[NUMA_TASK_MIGRATE]));


It looks like more code is needed. Michal, Shakeel, could you please advise
which strategy is preferred, or should we keep the current version?


Thanks,
Chenyu



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/2] sched/numa: add statistics of numa balance task
  2025-05-25 12:35         ` Chen, Yu C
@ 2025-05-27 17:48           ` Shakeel Butt
  2025-05-29  5:04             ` Chen, Yu C
  0 siblings, 1 reply; 21+ messages in thread
From: Shakeel Butt @ 2025-05-27 17:48 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: peterz, akpm, mkoutny, mingo, tj, hannes, corbet, mgorman, mhocko,
	muchun.song, roman.gushchin, tim.c.chen, aubrey.li, libo.chen,
	kprateek.nayak, vineethr, venkat88, ayushjai, cgroups, linux-doc,
	linux-mm, linux-kernel, yu.chen.surf

On Sun, May 25, 2025 at 08:35:24PM +0800, Chen, Yu C wrote:
> On 5/25/2025 1:32 AM, Shakeel Butt wrote:
[...]
> > can you please give an end-to-end> flow/story of all these events
> happening on a timeline.
> > 
> 
> Yes, sure, let me have a try.
> 
> The goal of NUMA balancing is to co-locate a task and its
> memory pages on the same NUMA node. There are two strategies:
> migrate the pages to the task's node, or migrate the task to
> the node where its pages reside.
> 
> Suppose a task p1 is running on Node 0, but its pages are
> located on Node 1. NUMA page fault statistics for p1 reveal
> its "page footprint" across nodes. If NUMA balancing detects
> that most of p1's pages are on Node 1:
> 
> 1.Page Migration Attempt:
> The Numa balance first tries to migrate p1's pages to Node 0.
> The numa_page_migrate counter increments.
> 
> 2.Task Migration Strategies:
> After the page migration finishes, Numa balance checks every
> 1 second to see if p1 can be migrated to Node 1.
> 
> Case 2.1: Idle CPU Available
> If Node 1 has an idle CPU, p1 is directly scheduled there. This event is
> logged as numa_task_migrated.
> Case 2.2: No Idle CPU (Task Swap)
> If all CPUs on Node1 are busy, direct migration could cause CPU contention
> or load imbalance. Instead:
> The Numa balance selects a candidate task p2 on Node 1 that prefers
> Node 0 (e.g., due to its own page footprint).
> p1 and p2 are swapped. This cross-node swap is recorded as
> numa_task_swapped.
> 

Thanks for the explanation, this is really helpful and I would like this
to be included in the commit message.

> > Beside that, do you think there might be some other scheduling events
> > (maybe unrelated to numa balancing) which might be suitable for
> > memory.stat? Basically I am trying to find if having sched events in
> > memory.stat be an exception for numa balancing or more general.
> 
> If the criterion is a combination of task scheduling strategy and
> page-based operations, I cannot find any other existing scheduling
> events. For now, NUMA balancing seems to be the only case.

Mainly I was looking if in future we need to add more sched events to
memory.stat file.

Let me reply on the other email chain on what should we do next.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/2] sched/numa: add statistics of numa balance task
  2025-05-27  9:20       ` Chen, Yu C
@ 2025-05-27 18:15         ` Shakeel Butt
  2025-06-02 16:53           ` Michal Koutný
  0 siblings, 1 reply; 21+ messages in thread
From: Shakeel Butt @ 2025-05-27 18:15 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Michal Koutný, peterz, akpm, mingo, tj, hannes, corbet,
	mgorman, mhocko, muchun.song, roman.gushchin, tim.c.chen,
	aubrey.li, libo.chen, kprateek.nayak, vineethr, venkat88,
	ayushjai, cgroups, linux-doc, linux-mm, linux-kernel,
	yu.chen.surf

On Tue, May 27, 2025 at 05:20:54PM +0800, Chen, Yu C wrote:
> On 5/26/2025 9:35 PM, Michal Koutný wrote:
> > On Fri, May 23, 2025 at 04:42:50PM -0700, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > Hmm these are scheduler events, how are these relevant to memory cgroup
> > > or vmstat? Any reason to not expose these in cpu.stat?
> > 
> > Good point. If I take it further -- this functionality needs neither
> > memory controller (CONFIG_MEMCG) nor CPU controller
> > (CONFIG_CGROUP_SCHED), so it might be technically calculated and exposed
> > in _any_ cgroup (which would be same technical solution how cpu time is
> > counted in cpu.stat regardless of CPU controller, cpu_stat_show()).
> > 
> 
> Yes, we can add it to cpu.stat. However, this might make it more difficult
> for users to locate related events. Some statistics about NUMA page
> migrations/faults are recorded in memory.stat, while others about NUMA task
> migrations (triggered by NUMA faults periodicly) are stored in cpu.stat.
> 
> Do you recommend extending the struct cgroup_base_stat to include counters
> for task_migrate/task_swap? Additionally, should we enhance
> cgroup_base_stat_cputime_show() to parse task_migrate/task_swap in a manner
> similar to cputime?
> 
> Alternatively, as Shakeel previously mentioned, could we reuse
> "count_memcg_event_mm()" and related infrastructure while exposing these
> statistics/events in cpu.stat? I assume Shakeel was referring to the
> following
> approach:
> 
> 1. Skip task migration/swap in memory.stat:
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cdaab8a957f3..b8eea3eca46f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1529,6 +1529,11 @@ static void memcg_stat_format(struct mem_cgroup
> *memcg, struct seq_buf *s)
>                 if (memcg_vm_event_stat[i] == PGPGIN ||
>                     memcg_vm_event_stat[i] == PGPGOUT)
>                         continue;
> +#endif
> +#ifdef CONFIG_NUMA_BALANCING
> +               if (memcg_vm_event_stat[i] == NUMA_TASK_MIGRATE ||
> +                   memcg_vm_event_stat[i] == NUMA_TASK_SWAP)
> +                       continue;
>  #endif
> 
> 2.Skip task migration/swap in /proc/vmstat
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index ed08bb384ae4..ea8a8ae1cdac 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1912,6 +1912,10 @@ static void *vmstat_next(struct seq_file *m, void
> *arg, loff_t *pos)
>         (*pos)++;
>         if (*pos >= NR_VMSTAT_ITEMS)
>                 return NULL;
> +#ifdef CONFIG_NUMA_BALANCING
> +       if (*pos == NUMA_TASK_MIGRATE || *pos == NUMA_TASK_SWAP)
> +               return NULL;
> +#endif
> 
> 3. Display task migration/swap events in cpu.stat:
>  seq_buf_printf(&s, "%s %lu\n",
> + vm_event_name(memcg_vm_event_stat[NUMA_TASK_MIGRATE]),
> +                      memcg_events(memcg,
> memcg_vm_event_stat[NUMA_TASK_MIGRATE]));
> 

You would need to use memcg_events() and you will need to flush the
memcg rstat trees as well

> 
> It looks like more code is needed. Michal, Shakeel, could you please advise
> which strategy is preferred, or should we keep the current version?

I am now more inclined to keep these new stats in memory.stat as the
current version is doing because:

1. Relevant stats are exposed through the same interface and we already
   have numa balancing stats in memory.stat.

2. There is no single good home for these new stats and exposing them in
   cpu.stat would require more code and even if we reuse memcg infra, we
   would still need to flush the memcg stats, so why not just expose in
   the memory.stat.

3. Though a bit far fetched, I think we may add more stats which sit at
   the boundary of sched and mm in future. Numa balancing is one
   concrete example of such stats. I am envisioning for reliable memory
   reclaim or overcommit, there might be some useful events as well.
   Anyways it is still unbaked atm.


Michal, let me know your thought on this.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/2] sched/numa: add statistics of numa balance task migration
  2025-05-23 23:52   ` Shakeel Butt
@ 2025-05-28  0:21     ` Andrew Morton
  0 siblings, 0 replies; 21+ messages in thread
From: Andrew Morton @ 2025-05-28  0:21 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Chen Yu, peterz, mkoutny, mingo, tj, hannes, corbet, mgorman,
	mhocko, muchun.song, roman.gushchin, tim.c.chen, aubrey.li,
	libo.chen, kprateek.nayak, vineethr, venkat88, ayushjai, cgroups,
	linux-doc, linux-mm, linux-kernel, yu.chen.surf

On Fri, 23 May 2025 16:52:46 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> On Fri, May 23, 2025 at 03:06:35PM -0700, Andrew Morton wrote:
> > On Fri, 23 May 2025 20:48:02 +0800 Chen Yu <yu.c.chen@intel.com> wrote:
> > 
> > > Introducing the task migration and swap statistics in the following places:
> > > /sys/fs/cgroup/{GROUP}/memory.stat
> > > /proc/{PID}/sched
> > > /proc/vmstat
> > > 
> > > These statistics facilitate a rapid evaluation of the performance and resource
> > > utilization of the target workload.
> > 
> > Thanks.  I added this.
> > 
> > We're late in -rc7 but an earlier verison of this did have a run in
> > linux-next.  Could reviewers please take a look relatively soon, let us
> > know whether they believe this looks suitable for 6.16-rc1?
> > 
> 
> The stats seems valuable but I am not convinced that memcg is the right
> home for these stats. So, please hold until that is resolved.

No probs, I'll keep these in mm-new until something changes.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/2] sched/numa: add statistics of numa balance task
  2025-05-27 17:48           ` Shakeel Butt
@ 2025-05-29  5:04             ` Chen, Yu C
  0 siblings, 0 replies; 21+ messages in thread
From: Chen, Yu C @ 2025-05-29  5:04 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: peterz, akpm, mkoutny, mingo, tj, hannes, corbet, mgorman, mhocko,
	muchun.song, roman.gushchin, tim.c.chen, aubrey.li, libo.chen,
	kprateek.nayak, vineethr, venkat88, ayushjai, cgroups, linux-doc,
	linux-mm, linux-kernel, yu.chen.surf

On 5/28/2025 1:48 AM, Shakeel Butt wrote:
> On Sun, May 25, 2025 at 08:35:24PM +0800, Chen, Yu C wrote:
>> On 5/25/2025 1:32 AM, Shakeel Butt wrote:
> [...]
>>> can you please give an end-to-end> flow/story of all these events
>> happening on a timeline.
>>>
>>
>> Yes, sure, let me have a try.
>>
>> The goal of NUMA balancing is to co-locate a task and its
>> memory pages on the same NUMA node. There are two strategies:
>> migrate the pages to the task's node, or migrate the task to
>> the node where its pages reside.
>>
>> Suppose a task p1 is running on Node 0, but its pages are
>> located on Node 1. NUMA page fault statistics for p1 reveal
>> its "page footprint" across nodes. If NUMA balancing detects
>> that most of p1's pages are on Node 1:
>>
>> 1.Page Migration Attempt:
>> The Numa balance first tries to migrate p1's pages to Node 0.
>> The numa_page_migrate counter increments.
>>
>> 2.Task Migration Strategies:
>> After the page migration finishes, Numa balance checks every
>> 1 second to see if p1 can be migrated to Node 1.
>>
>> Case 2.1: Idle CPU Available
>> If Node 1 has an idle CPU, p1 is directly scheduled there. This event is
>> logged as numa_task_migrated.
>> Case 2.2: No Idle CPU (Task Swap)
>> If all CPUs on Node1 are busy, direct migration could cause CPU contention
>> or load imbalance. Instead:
>> The Numa balance selects a candidate task p2 on Node 1 that prefers
>> Node 0 (e.g., due to its own page footprint).
>> p1 and p2 are swapped. This cross-node swap is recorded as
>> numa_task_swapped.
>>
> 
> Thanks for the explanation, this is really helpful and I would like this
> to be included in the commit message.
> 

OK, just sent out a v6 with the commit message enhanced.

Thanks,
Chenyu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/2] sched/numa: add statistics of numa balance task
  2025-05-27 18:15         ` Shakeel Butt
@ 2025-06-02 16:53           ` Michal Koutný
  2025-06-03 14:46             ` Chen, Yu C
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Koutný @ 2025-06-02 16:53 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Chen, Yu C, peterz, akpm, mingo, tj, hannes, corbet, mgorman,
	mhocko, muchun.song, roman.gushchin, tim.c.chen, aubrey.li,
	libo.chen, kprateek.nayak, vineethr, venkat88, ayushjai, cgroups,
	linux-doc, linux-mm, linux-kernel, yu.chen.surf

[-- Attachment #1: Type: text/plain, Size: 2160 bytes --]

On Tue, May 27, 2025 at 11:15:33AM -0700, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> I am now more inclined to keep these new stats in memory.stat as the
> current version is doing because:
> 
> 1. Relevant stats are exposed through the same interface and we already
>    have numa balancing stats in memory.stat.
> 
> 2. There is no single good home for these new stats and exposing them in
>    cpu.stat would require more code and even if we reuse memcg infra, we
>    would still need to flush the memcg stats, so why not just expose in
>    the memory.stat.
> 
> 3. Though a bit far fetched, I think we may add more stats which sit at
>    the boundary of sched and mm in future. Numa balancing is one
>    concrete example of such stats. I am envisioning for reliable memory
>    reclaim or overcommit, there might be some useful events as well.
>    Anyways it is still unbaked atm.
> 
> 
> Michal, let me know your thought on this.

I reckon users may be little bit more likely to look that info in
memory.stat.

Which would be OK unless threaded subtrees are considered (e.g. cpuset
(NUMA affinity) has thread granularity) and these migration stats are
potentially per-thread relevant.

I was also pondering why cannot be misplaced container found by existing
NUMA stats. Chen has explained task vs page migration in NUMA balancing.
I guess mere page migration number (especially when stagnating) may not
point to the the misplaced container. OK.

Second thing is what is the "misplaced" container. Is it because of
wrong set_mempolicy(2) or cpuset configuration? If it's the former (i.e.
it requires enabled cpuset controller), it'd justify exposing this info
in cpuset.stat, if it's the latter, the cgroup aggregation is not that
relevant (hence /proc/<PID>/sched) is sufficient. Or is there another
meaning of a misplaced container? Chen, could you please clarify?

Because memory controller doesn't control NUMA, it needn't be enabled
to have this statistics and it cannot be enabled in threaded groups, I'm
having some doubts whether memory.stat is a good home for this field.

Regards,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/2] sched/numa: add statistics of numa balance task
  2025-06-02 16:53           ` Michal Koutný
@ 2025-06-03 14:46             ` Chen, Yu C
  2025-06-17  9:30               ` Michal Koutný
  0 siblings, 1 reply; 21+ messages in thread
From: Chen, Yu C @ 2025-06-03 14:46 UTC (permalink / raw)
  To: Michal Koutný, Shakeel Butt
  Cc: peterz, akpm, mingo, tj, hannes, corbet, mgorman, mhocko,
	muchun.song, roman.gushchin, tim.c.chen, aubrey.li, libo.chen,
	kprateek.nayak, vineethr, venkat88, ayushjai, cgroups, linux-doc,
	linux-mm, linux-kernel, yu.chen.surf

Hi Michal,

On 6/3/2025 12:53 AM, Michal Koutný wrote:
> On Tue, May 27, 2025 at 11:15:33AM -0700, Shakeel Butt <shakeel.butt@linux.dev> wrote:
>> I am now more inclined to keep these new stats in memory.stat as the
>> current version is doing because:
>>
>> 1. Relevant stats are exposed through the same interface and we already
>>     have numa balancing stats in memory.stat.
>>
>> 2. There is no single good home for these new stats and exposing them in
>>     cpu.stat would require more code and even if we reuse memcg infra, we
>>     would still need to flush the memcg stats, so why not just expose in
>>     the memory.stat.
>>
>> 3. Though a bit far fetched, I think we may add more stats which sit at
>>     the boundary of sched and mm in future. Numa balancing is one
>>     concrete example of such stats. I am envisioning for reliable memory
>>     reclaim or overcommit, there might be some useful events as well.
>>     Anyways it is still unbaked atm.
>>
>>
>> Michal, let me know your thought on this.
> 
> I reckon users may be little bit more likely to look that info in
> memory.stat.
> 
> Which would be OK unless threaded subtrees are considered (e.g. cpuset
> (NUMA affinity) has thread granularity) and these migration stats are
> potentially per-thread relevant.
> 
> 
> I was also pondering why cannot be misplaced container found by existing
> NUMA stats. Chen has explained task vs page migration in NUMA balancing.
> I guess mere page migration number (especially when stagnating) may not
> point to the the misplaced container. OK.
> 
> Second thing is what is the "misplaced" container. Is it because of
> wrong set_mempolicy(2) or cpuset configuration?
> If it's the former (i.e.
> it requires enabled cpuset controller), it'd justify exposing this info
> in cpuset.stat, if it's the latter, the cgroup aggregation is not that
> relevant (hence /proc/<PID>/sched) is sufficient. Or is there another
> meaning of a misplaced container? Chen, could you please clarify?

My understanding is that the "misplaced" container is not strictly tied
to set_mempolicy or cpuset configuration, but is mainly caused by the
scheduler's generic load balancer. The generic load balancer spreads
tasks across different nodes to fully utilize idle CPUs, while NUMA
balancing tries to pull misplaced tasks/pages back to honor NUMA locality.

Regarding the threaded subtrees mode, I was previously unfamiliar with
it and have been trying to understand it better. If I understand correctly,
if threads within a single process are placed in different cgroups via 
cpuset,
we might need to scan /proc/<PID>/sched to collect NUMA task migration/swap
statistics. If threaded subtrees are disabled for that process, we can query
memory.stat.

I agree with your prior point that NUMA balancing task activity is not 
directly
associated with either the Memory controller or the CPU controller. Although
showing this data in cpu.stat might seem more appropriate, we expose it in
memory.stat due to the following trade-offs(or as an exception for
NUMA balancing):

1.It aligns with existing NUMA-related metrics already present in 
memory.stat.
2.It simplifies code implementation.

thanks,
Chenyu

> 
> Because memory controller doesn't control NUMA, it needn't be enabled
> to have this statistics and it cannot be enabled in threaded groups, I'm
> having some doubts whether memory.stat is a good home for this field.
> 
> Regards,
> Michal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/2] sched/numa: add statistics of numa balance task
  2025-06-03 14:46             ` Chen, Yu C
@ 2025-06-17  9:30               ` Michal Koutný
  2025-06-19 13:03                 ` Chen, Yu C
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Koutný @ 2025-06-17  9:30 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Shakeel Butt, peterz, akpm, mingo, tj, hannes, corbet, mgorman,
	mhocko, muchun.song, roman.gushchin, tim.c.chen, aubrey.li,
	libo.chen, kprateek.nayak, vineethr, venkat88, ayushjai, cgroups,
	linux-doc, linux-mm, linux-kernel, yu.chen.surf

[-- Attachment #1: Type: text/plain, Size: 1647 bytes --]

On Tue, Jun 03, 2025 at 10:46:06PM +0800, "Chen, Yu C" <yu.c.chen@intel.com> wrote:
> My understanding is that the "misplaced" container is not strictly tied
> to set_mempolicy or cpuset configuration, but is mainly caused by the
> scheduler's generic load balancer.

You are convincing me with this that, cpu.stat fits the concept better.
Doesn't that sound like that to you?

> Regarding the threaded subtrees mode, I was previously unfamiliar with
> it and have been trying to understand it better.

No problem.

> If I understand correctly, if threads within a single process are
> placed in different cgroups via cpuset, we might need to scan
> /proc/<PID>/sched to collect NUMA task migration/swap statistics.

The premise of your series was that you didn't want to do that :-)

> I agree with your prior point that NUMA balancing task activity is not
> directly
> associated with either the Memory controller or the CPU controller. Although
> showing this data in cpu.stat might seem more appropriate, we expose it in
> memory.stat due to the following trade-offs(or as an exception for
> NUMA balancing):
> 
> 1.It aligns with existing NUMA-related metrics already present in
> memory.stat.

That one I'd buy into. OTOH, I'd hope this could be overcome with
documentation.

> 2.It simplifies code implementation.

I'd say that only applies when accepting memory.stat as the better
place. I think the appropriately matching API should be picked first and
implementation is only secondary to that.
From your reasoning above, I think that the concept is closer to be in
cpu.stat ¯\_(ツ)_/¯

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/2] sched/numa: add statistics of numa balance task
  2025-06-17  9:30               ` Michal Koutný
@ 2025-06-19 13:03                 ` Chen, Yu C
  2025-06-19 14:06                   ` Michal Koutný
  0 siblings, 1 reply; 21+ messages in thread
From: Chen, Yu C @ 2025-06-19 13:03 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Shakeel Butt, peterz, akpm, mingo, tj, hannes, corbet, mgorman,
	mhocko, muchun.song, roman.gushchin, tim.c.chen, aubrey.li,
	libo.chen, kprateek.nayak, vineethr, venkat88, ayushjai, cgroups,
	linux-doc, linux-mm, linux-kernel, yu.chen.surf

On 6/17/2025 5:30 PM, Michal Koutný wrote:
> On Tue, Jun 03, 2025 at 10:46:06PM +0800, "Chen, Yu C" <yu.c.chen@intel.com> wrote:
>> My understanding is that the "misplaced" container is not strictly tied
>> to set_mempolicy or cpuset configuration, but is mainly caused by the
>> scheduler's generic load balancer.
> 
> You are convincing me with this that, cpu.stat fits the concept better.
> Doesn't that sound like that to you?
> 
>> Regarding the threaded subtrees mode, I was previously unfamiliar with
>> it and have been trying to understand it better.
> 
> No problem.
> 
>> If I understand correctly, if threads within a single process are
>> placed in different cgroups via cpuset, we might need to scan
>> /proc/<PID>/sched to collect NUMA task migration/swap statistics.
> 
> The premise of your series was that you didn't want to do that :-)
> 
>> I agree with your prior point that NUMA balancing task activity is not
>> directly
>> associated with either the Memory controller or the CPU controller. Although
>> showing this data in cpu.stat might seem more appropriate, we expose it in
>> memory.stat due to the following trade-offs(or as an exception for
>> NUMA balancing):
>>
>> 1.It aligns with existing NUMA-related metrics already present in
>> memory.stat.
> 
> That one I'd buy into. OTOH, I'd hope this could be overcome with
> documentation.
> 
>> 2.It simplifies code implementation.
> 
> I'd say that only applies when accepting memory.stat as the better
> place. I think the appropriately matching API should be picked first and
> implementation is only secondary to that.

Thanks for this guidance.

>  From your reasoning above, I think that the concept is closer to be in
> cpu.stat ¯\_(ツ)_/¯
> 
OK. Since this change has already been addressed in upstream kernel,
I can update the numa_task_migrated/numa_task_swapped fields in
Documentation/admin-guide/cgroup-v2.rst to mention that, these
activities are not memory related but put here because they are
closer to numa balance's page statistics.
Or do you want me to submit a patch to move the items from
memory.stat to cpu.stat?

thanks,
Chenyu



> Michal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/2] sched/numa: add statistics of numa balance task
  2025-06-19 13:03                 ` Chen, Yu C
@ 2025-06-19 14:06                   ` Michal Koutný
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Koutný @ 2025-06-19 14:06 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Shakeel Butt, peterz, akpm, mingo, tj, hannes, corbet, mgorman,
	mhocko, muchun.song, roman.gushchin, tim.c.chen, aubrey.li,
	libo.chen, kprateek.nayak, vineethr, venkat88, ayushjai, cgroups,
	linux-doc, linux-mm, linux-kernel, yu.chen.surf

[-- Attachment #1: Type: text/plain, Size: 638 bytes --]

On Thu, Jun 19, 2025 at 09:03:55PM +0800, "Chen, Yu C" <yu.c.chen@intel.com> wrote:
> OK. Since this change has already been addressed in upstream kernel,

Oh, I missed that. (Otherwise I wouldn't have bothered responding
anymore in this case.)

> I can update the numa_task_migrated/numa_task_swapped fields in
> Documentation/admin-guide/cgroup-v2.rst to mention that, these
> activities are not memory related but put here because they are
> closer to numa balance's page statistics.
> Or do you want me to submit a patch to move the items from
> memory.stat to cpu.stat?

I leave it up to you. (It's become sunk cost for me.)

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-06-19 14:06 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-23 12:48 [PATCH v5 0/2] sched/numa: add statistics of numa balance task migration Chen Yu
2025-05-23 12:51 ` [PATCH v5 1/2] sched/numa: fix task swap by skipping kernel threads Chen Yu
2025-05-23 23:22   ` Shakeel Butt
2025-05-23 12:51 ` [PATCH v5 2/2] sched/numa: add statistics of numa balance task Chen Yu
2025-05-23 23:42   ` Shakeel Butt
2025-05-24  9:07     ` Chen, Yu C
2025-05-24 17:32       ` Shakeel Butt
2025-05-25 12:35         ` Chen, Yu C
2025-05-27 17:48           ` Shakeel Butt
2025-05-29  5:04             ` Chen, Yu C
2025-05-26 13:35     ` Michal Koutný
2025-05-27  9:20       ` Chen, Yu C
2025-05-27 18:15         ` Shakeel Butt
2025-06-02 16:53           ` Michal Koutný
2025-06-03 14:46             ` Chen, Yu C
2025-06-17  9:30               ` Michal Koutný
2025-06-19 13:03                 ` Chen, Yu C
2025-06-19 14:06                   ` Michal Koutný
2025-05-23 22:06 ` [PATCH v5 0/2] sched/numa: add statistics of numa balance task migration Andrew Morton
2025-05-23 23:52   ` Shakeel Butt
2025-05-28  0:21     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).