cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/2] sched/numa: add statistics of numa balance task migration
@ 2025-05-29  4:53 Chen Yu
  2025-05-29  4:54 ` [PATCH v6 1/2] sched/numa: fix task swap by skipping kernel threads Chen Yu
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Chen Yu @ 2025-05-29  4:53 UTC (permalink / raw)
  To: peterz, akpm, mkoutny, shakeel.butt
  Cc: mingo, tj, hannes, corbet, mgorman, mhocko, muchun.song,
	roman.gushchin, tim.c.chen, aubrey.li, libo.chen, kprateek.nayak,
	vineethr, venkat88, ayushjai, cgroups, linux-doc, linux-mm,
	linux-kernel, yu.chen.surf, Chen Yu

Introducing the task migration and swap statistics in the following places:
/sys/fs/cgroup/{GROUP}/memory.stat
/proc/{PID}/sched
/proc/vmstat

These statistics facilitate a rapid evaluation of the performance and resource
utilization of the target workload.

Patch 1 is a fix from Libo to avoid task swapping for kernel threads
and user thread that does not have mm, because Numa balance only cares
about the user pages via VMA.

Patch 2 is the major change to expose the statistics of task migration and
swapping in corresponding files.

The reason to fold patch 1 and patch 2 into 1 patch set is that patch 1 is
necessary for patch 2 to avoid accessing a NULL mm_struct from a kernel
thread, which causes NULL pointer exception.

Changes since v5:
Enhance the git commit message to describe the reason why these events
are introduced in memory.stat/vmstat rather than cpu.stat.

Previous version:
v5:
https://lore.kernel.org/all/cover.1748002400.git.yu.c.chen@intel.com/
v4:
https://lore.kernel.org/all/cover.1746611892.git.yu.c.chen@intel.com/
v3:
https://lore.kernel.org/lkml/20250430103623.3349842-1-yu.c.chen@intel.com/
v2:
https://lore.kernel.org/lkml/20250408101444.192519-1-yu.c.chen@intel.com/
v1:
https://lore.kernel.org/lkml/20250402010611.3204674-1-yu.c.chen@intel.com/

Chen Yu (1):
  sched/numa: add statistics of numa balance task

Libo Chen (1):
  sched/numa: fix task swap by skipping kernel threads

 Documentation/admin-guide/cgroup-v2.rst | 6 ++++++
 include/linux/sched.h                   | 4 ++++
 include/linux/vm_event_item.h           | 2 ++
 kernel/sched/core.c                     | 9 +++++++--
 kernel/sched/debug.c                    | 4 ++++
 kernel/sched/fair.c                     | 3 ++-
 mm/memcontrol.c                         | 2 ++
 mm/vmstat.c                             | 2 ++
 8 files changed, 29 insertions(+), 3 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v6 1/2] sched/numa: fix task swap by skipping kernel threads
  2025-05-29  4:53 [PATCH v6 0/2] sched/numa: add statistics of numa balance task migration Chen Yu
@ 2025-05-29  4:54 ` Chen Yu
  2025-05-29  4:55 ` [PATCH v6 2/2] sched/numa: add statistics of numa balance task Chen Yu
  2025-05-29  5:26 ` [PATCH v6 0/2] sched/numa: add statistics of numa balance task migration Andrew Morton
  2 siblings, 0 replies; 5+ messages in thread
From: Chen Yu @ 2025-05-29  4:54 UTC (permalink / raw)
  To: peterz, akpm, mkoutny, shakeel.butt
  Cc: mingo, tj, hannes, corbet, mgorman, mhocko, muchun.song,
	roman.gushchin, tim.c.chen, aubrey.li, libo.chen, kprateek.nayak,
	vineethr, venkat88, ayushjai, cgroups, linux-doc, linux-mm,
	linux-kernel, yu.chen.surf, Ayush Jain, Chen Yu

From: Libo Chen <libo.chen@oracle.com>

Task swapping is triggered when there are no idle CPUs in
task A's preferred node. In this case, the NUMA load balancer
chooses a task B on A's preferred node and swaps B with A. This
helps improve NUMA locality without introducing load imbalance
between nodes. In the current implementation, B's NUMA node
preference is not mandatory. That is to say, a kernel thread
might be incorrectly chosen as B. However, kernel thread and
user space thread that does not have mm are not supposed to be
covered by NUMA balancing because NUMA balancing only considers
user pages via VMAs.

According to Peter's suggestion for fixing this issue, we use
PF_KTHREAD to skip the kernel thread. curr->mm is also checked
because it is possible that user_mode_thread() might create a
user thread without an mm. As per Prateek's analysis, after
adding the PF_KTHREAD check, there is no need to further check
the PF_IDLE flag:
"
- play_idle_precise() already ensures PF_KTHREAD is set before adding
  PF_IDLE

- cpu_startup_entry() is only called from the startup thread which
  should be marked with PF_KTHREAD (based on my understanding looking at
  commit cff9b2332ab7 ("kernel/sched: Modify initial boot task idle
  setup"))
"

In summary, the check in task_numa_compare() now aligns with
task_tick_numa().

Suggested-by: Michal Koutny <mkoutny@suse.com>
Tested-by: Ayush Jain <Ayush.jain3@amd.com>
Signed-off-by: Libo Chen <libo.chen@oracle.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
v5->v6:
Add Reviewed-by from Shakeel.
v4->v5:
Add PF_KTHREAD check, and remove PF_IDLE check(Prateek).
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 125912c0e9dd..68aa5941c8ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2273,7 +2273,8 @@ static bool task_numa_compare(struct task_numa_env *env,
 
 	rcu_read_lock();
 	cur = rcu_dereference(dst_rq->curr);
-	if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
+	if (cur && ((cur->flags & (PF_EXITING | PF_KTHREAD)) ||
+		    !cur->mm))
 		cur = NULL;
 
 	/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v6 2/2] sched/numa: add statistics of numa balance task
  2025-05-29  4:53 [PATCH v6 0/2] sched/numa: add statistics of numa balance task migration Chen Yu
  2025-05-29  4:54 ` [PATCH v6 1/2] sched/numa: fix task swap by skipping kernel threads Chen Yu
@ 2025-05-29  4:55 ` Chen Yu
  2025-05-29  5:26 ` [PATCH v6 0/2] sched/numa: add statistics of numa balance task migration Andrew Morton
  2 siblings, 0 replies; 5+ messages in thread
From: Chen Yu @ 2025-05-29  4:55 UTC (permalink / raw)
  To: peterz, akpm, mkoutny, shakeel.butt
  Cc: mingo, tj, hannes, corbet, mgorman, mhocko, muchun.song,
	roman.gushchin, tim.c.chen, aubrey.li, libo.chen, kprateek.nayak,
	vineethr, venkat88, ayushjai, cgroups, linux-doc, linux-mm,
	linux-kernel, yu.chen.surf, Chen Yu

On systems with NUMA balancing enabled, it has been found
that tracking task activities resulting from NUMA balancing
is beneficial. NUMA balancing employs two mechanisms for task
migration: one is to migrate a task to an idle CPU within its
preferred node, and the other is to swap tasks located on
different nodes when they are on each other's preferred nodes.

The kernel already provides NUMA page migration statistics in
/sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched. However,
it lacks statistics regarding task migration and swapping.
Therefore, relevant counts for task migration and swapping should
be added.

The following two new fields:

numa_task_migrated
numa_task_swapped

will be shown in /sys/fs/cgroup/{GROUP}/memory.stat, /proc/{PID}/sched
and /proc/vmstat

Introducing both per-task and per-memory cgroup (memcg) NUMA
balancing statistics facilitates a rapid evaluation of the
performance and resource utilization of the target workload.
For instance, users can first identify the container with high
NUMA balancing activity and then further pinpoint a specific
task within that group, and subsequently adjust the memory policy
for that task. In short, although it is possible to iterate through
/proc/$pid/sched to locate the problematic task, the introduction
of aggregated NUMA balancing activity for tasks within each memcg
can assist users in identifying the task more efficiently through
a divide-and-conquer approach.

As Libo Chen pointed out, the memcg event relies on the text
names in vmstat_text, and /proc/vmstat generates corresponding items
based on vmstat_text. Thus, the relevant task migration and swapping
events introduced in vmstat_text also need to be populated by
count_vm_numa_event(), otherwise these values are zero in
/proc/vmstat.

In theory, task migration and swap events are part of the
scheduler's activities. The reason for exposing them through the
memory.stat/vmstat interface is that we already have NUMA balancing
statistics in memory.stat/vmstat, and these events are closely
related to each other. Following Shakeel's suggestion, we describe
the end-to-end flow/story of all these events occurring on a
timeline for future reference:

The goal of NUMA balancing is to co-locate a task and its
memory pages on the same NUMA node. There are two strategies:
migrate the pages to the task's node, or migrate the task to
the node where its pages reside.

Suppose a task p1 is running on Node 0, but its pages are
located on Node 1. NUMA page fault statistics for p1 reveal
its "page footprint" across nodes. If NUMA balancing detects
that most of p1's pages are on Node 1:

1.Page Migration Attempt:
The Numa balance first tries to migrate p1's pages to Node 0.
The numa_page_migrate counter increments.

2.Task Migration Strategies:
After the page migration finishes, Numa balance checks every
1 second to see if p1 can be migrated to Node 1.

Case 2.1: Idle CPU Available
If Node 1 has an idle CPU, p1 is directly scheduled there. This
event is logged as numa_task_migrated.
Case 2.2: No Idle CPU (Task Swap)
If all CPUs on Node1 are busy, direct migration could cause CPU
contention or load imbalance. Instead:
The Numa balance selects a candidate task p2 on Node 1 that prefers
Node 0 (e.g., due to its own page footprint).
p1 and p2 are swapped. This cross-node swap is recorded as
numa_task_swapped.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
v5->v6:
Enhance the commit message by adding Numa balance events description.
(Shakeel)
v4->v5:
no change.
v3->v4:
Populate the /prov/vmstat otherwise the items are all zero.
(Libo)
v2->v3:
Remove unnecessary p->mm check because kernel threads are
not supported by Numa Balancing. (Libo)
v1->v2:
Update the Documentation/admin-guide/cgroup-v2.rst. (Michal)
---
 Documentation/admin-guide/cgroup-v2.rst | 6 ++++++
 include/linux/sched.h                   | 4 ++++
 include/linux/vm_event_item.h           | 2 ++
 kernel/sched/core.c                     | 9 +++++++--
 kernel/sched/debug.c                    | 4 ++++
 mm/memcontrol.c                         | 2 ++
 mm/vmstat.c                             | 2 ++
 7 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 1edc26622594..2e5d08a273ad 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1705,6 +1705,12 @@ The following nested keys are defined.
 	  numa_hint_faults (npn)
 		Number of NUMA hinting faults.
 
+	  numa_task_migrated (npn)
+		Number of task migration by NUMA balancing.
+
+	  numa_task_swapped (npn)
+		Number of task swap by NUMA balancing.
+
 	  pgdemote_kswapd
 		Number of pages demoted by kswapd.
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 45e5953b8f32..bbc1820a1552 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -548,6 +548,10 @@ struct sched_statistics {
 	u64				nr_failed_migrations_running;
 	u64				nr_failed_migrations_hot;
 	u64				nr_forced_migrations;
+#ifdef CONFIG_NUMA_BALANCING
+	u64				numa_task_migrated;
+	u64				numa_task_swapped;
+#endif
 
 	u64				nr_wakeups;
 	u64				nr_wakeups_sync;
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 9e15a088ba38..91a3ce9a2687 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -66,6 +66,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		NUMA_HINT_FAULTS,
 		NUMA_HINT_FAULTS_LOCAL,
 		NUMA_PAGE_MIGRATE,
+		NUMA_TASK_MIGRATE,
+		NUMA_TASK_SWAP,
 #endif
 #ifdef CONFIG_MIGRATION
 		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 62b3416f5e43..dce50fa57471 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3362,6 +3362,10 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 #ifdef CONFIG_NUMA_BALANCING
 static void __migrate_swap_task(struct task_struct *p, int cpu)
 {
+	__schedstat_inc(p->stats.numa_task_swapped);
+	count_vm_numa_event(NUMA_TASK_SWAP);
+	count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
+
 	if (task_on_rq_queued(p)) {
 		struct rq *src_rq, *dst_rq;
 		struct rq_flags srf, drf;
@@ -7930,8 +7934,9 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
 	if (!cpumask_test_cpu(target_cpu, p->cpus_ptr))
 		return -EINVAL;
 
-	/* TODO: This is not properly updating schedstats */
-
+	__schedstat_inc(p->stats.numa_task_migrated);
+	count_vm_numa_event(NUMA_TASK_MIGRATE);
+	count_memcg_event_mm(p->mm, NUMA_TASK_MIGRATE);
 	trace_sched_move_numa(p, curr_cpu, target_cpu);
 	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
 }
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 557246880a7e..9d71baf08075 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1210,6 +1210,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_failed_migrations_running);
 		P_SCHEDSTAT(nr_failed_migrations_hot);
 		P_SCHEDSTAT(nr_forced_migrations);
+#ifdef CONFIG_NUMA_BALANCING
+		P_SCHEDSTAT(numa_task_migrated);
+		P_SCHEDSTAT(numa_task_swapped);
+#endif
 		P_SCHEDSTAT(nr_wakeups);
 		P_SCHEDSTAT(nr_wakeups_sync);
 		P_SCHEDSTAT(nr_wakeups_migrate);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ec39e62b172e..ab89a3a0f1bf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -463,6 +463,8 @@ static const unsigned int memcg_vm_event_stat[] = {
 	NUMA_PAGE_MIGRATE,
 	NUMA_PTE_UPDATES,
 	NUMA_HINT_FAULTS,
+	NUMA_TASK_MIGRATE,
+	NUMA_TASK_SWAP,
 #endif
 };
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4c268ce39ff2..ed08bb384ae4 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1347,6 +1347,8 @@ const char * const vmstat_text[] = {
 	"numa_hint_faults",
 	"numa_hint_faults_local",
 	"numa_pages_migrated",
+	"numa_task_migrated",
+	"numa_task_swapped",
 #endif
 #ifdef CONFIG_MIGRATION
 	"pgmigrate_success",
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v6 0/2] sched/numa: add statistics of numa balance task migration
  2025-05-29  4:53 [PATCH v6 0/2] sched/numa: add statistics of numa balance task migration Chen Yu
  2025-05-29  4:54 ` [PATCH v6 1/2] sched/numa: fix task swap by skipping kernel threads Chen Yu
  2025-05-29  4:55 ` [PATCH v6 2/2] sched/numa: add statistics of numa balance task Chen Yu
@ 2025-05-29  5:26 ` Andrew Morton
  2025-05-29  5:32   ` Chen, Yu C
  2 siblings, 1 reply; 5+ messages in thread
From: Andrew Morton @ 2025-05-29  5:26 UTC (permalink / raw)
  To: Chen Yu
  Cc: peterz, mkoutny, shakeel.butt, mingo, tj, hannes, corbet, mgorman,
	mhocko, muchun.song, roman.gushchin, tim.c.chen, aubrey.li,
	libo.chen, kprateek.nayak, vineethr, venkat88, ayushjai, cgroups,
	linux-doc, linux-mm, linux-kernel, yu.chen.surf

On Thu, 29 May 2025 12:53:36 +0800 Chen Yu <yu.c.chen@intel.com> wrote:

> Introducing the task migration and swap statistics in the following places:
> /sys/fs/cgroup/{GROUP}/memory.stat
> /proc/{PID}/sched
> /proc/vmstat
> 
> These statistics facilitate a rapid evaluation of the performance and resource
> utilization of the target workload.

OK, thanks, I confirmed that there were no code changes since v5 and I
updated the changelogs in place.

I'll aim to include this series in the second mm.git->Linus pull
request next week, unless someone prevents me.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v6 0/2] sched/numa: add statistics of numa balance task migration
  2025-05-29  5:26 ` [PATCH v6 0/2] sched/numa: add statistics of numa balance task migration Andrew Morton
@ 2025-05-29  5:32   ` Chen, Yu C
  0 siblings, 0 replies; 5+ messages in thread
From: Chen, Yu C @ 2025-05-29  5:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: peterz, mkoutny, shakeel.butt, mingo, tj, hannes, corbet, mgorman,
	mhocko, muchun.song, roman.gushchin, tim.c.chen, aubrey.li,
	libo.chen, kprateek.nayak, vineethr, venkat88, ayushjai, cgroups,
	linux-doc, linux-mm, linux-kernel, yu.chen.surf

On 5/29/2025 1:26 PM, Andrew Morton wrote:
> On Thu, 29 May 2025 12:53:36 +0800 Chen Yu <yu.c.chen@intel.com> wrote:
> 
>> Introducing the task migration and swap statistics in the following places:
>> /sys/fs/cgroup/{GROUP}/memory.stat
>> /proc/{PID}/sched
>> /proc/vmstat
>>
>> These statistics facilitate a rapid evaluation of the performance and resource
>> utilization of the target workload.
> 
> OK, thanks, I confirmed that there were no code changes since v5 and I
> updated the changelogs in place.
> 
> I'll aim to include this series in the second mm.git->Linus pull
> request next week, unless someone prevents me.

OK, thanks Andrew.

best,
Chenyu

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-05-29  5:33 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-29  4:53 [PATCH v6 0/2] sched/numa: add statistics of numa balance task migration Chen Yu
2025-05-29  4:54 ` [PATCH v6 1/2] sched/numa: fix task swap by skipping kernel threads Chen Yu
2025-05-29  4:55 ` [PATCH v6 2/2] sched/numa: add statistics of numa balance task Chen Yu
2025-05-29  5:26 ` [PATCH v6 0/2] sched/numa: add statistics of numa balance task migration Andrew Morton
2025-05-29  5:32   ` Chen, Yu C

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).