From: "Chen, Yu C" <yu.c.chen@intel.com>
To: Libo Chen <libo.chen@oracle.com>
Cc: Ingo Molnar <mingo@redhat.com>, Tejun Heo <tj@kernel.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Jonathan Corbet <corbet@lwn.net>,
Mel Gorman <mgormanmgorman@suse.de>,
Michal Hocko <mhocko@kernel.org>,
Michal Koutny <mkoutny@suse.com>,
Muchun Song <muchun.song@linux.dev>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeel.butt@linux.dev>,
"Chen, Tim C" <tim.c.chen@intel.com>,
Aubrey Li <aubrey.li@intel.com>, <cgroups@vger.kernel.org>,
<linux-doc@vger.kernel.org>, <linux-mm@kvack.org>,
<linux-kernel@vger.kernel.org>,
K Prateek Nayak <kprateek.nayak@amd.com>,
Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
Peter Zijlstra <peterz@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH v3] sched/numa: add statistics of numa balance task migration
Date: Fri, 2 May 2025 17:30:14 +0800 [thread overview]
Message-ID: <f024ebd4-2b18-4af5-8fc3-7b057501444b@intel.com> (raw)
In-Reply-To: <9a6402ca-ce63-430b-b60b-1a36971e37e4@oracle.com>
Hi Libo,
On 5/1/2025 3:00 PM, Libo Chen wrote:
> Hi Chen Yu
>
> On 4/30/25 03:36, Chen Yu wrote:
>> On systems with NUMA balancing enabled, it is found that tracking
>> the task activities due to NUMA balancing is helpful. NUMA balancing
>> has two mechanisms for task migration: one is to migrate the task to
>> an idle CPU in its preferred node, the other is to swap tasks on
>> different nodes if they are on each other's preferred node.
>>
>> The kernel already has NUMA page migration statistics in
>> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched,
>> but does not have statistics for task migration/swap.
>> Add the task migration and swap count accordingly.
>>
>> The following two new fields:
>>
>> numa_task_migrated
>> numa_task_swapped
>>
>> will be displayed in both
>> /sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched
>>
>
> Both stats show up in expected places, but I notice they are also in
> /proc/vmstat and are always 0.
>
> I think you may have to add count_vm_numa_event() in migrate_task_to()
> and __migrate_swap_task() unless there is a way to not show both stats
> in /proc/vmstat.
>
>
Thanks for catching this. For some reasons, I added these
"placeholders" to the vmstat but didn't populate them. Let
me remove these items from vmstat because we mainly care
about task activity rather than page activity(not a vm event
I suppose) Let me have a try on this.
>> Introducing both pertask and permemcg NUMA balancing statistics helps
>> to quickly evaluate the performance and resource usage of the target
>> workload. For example, the user can first identify the container which
>> has high NUMA balance activity and then narrow down to a specific task
>> within that group, and tune the memory policy of that task.
>> In summary, it is plausible to iterate the /proc/$pid/sched to find the
>> offending task, but the introduction of per memcg tasks' Numa balancing
>> aggregated activity can further help users identify the task in a
>> divide-and-conquer way.
>>
>> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
>> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>> ---
>> v2->v3:
>> Remove unnecessary p->mm check because kernel threads are
>> not supported by Numa Balancing. (Libo Chen)
>> v1->v2:
>> Update the Documentation/admin-guide/cgroup-v2.rst. (Michal)
>> ---
>> Documentation/admin-guide/cgroup-v2.rst | 6 ++++++
>> include/linux/sched.h | 4 ++++
>> include/linux/vm_event_item.h | 2 ++
>> kernel/sched/core.c | 7 +++++--
>> kernel/sched/debug.c | 4 ++++
>> mm/memcontrol.c | 2 ++
>> mm/vmstat.c | 2 ++
>> 7 files changed, 25 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>> index 1a16ce68a4d7..d346f3235945 100644
>> --- a/Documentation/admin-guide/cgroup-v2.rst
>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>> @@ -1670,6 +1670,12 @@ The following nested keys are defined.
>> numa_hint_faults (npn)
>> Number of NUMA hinting faults.
>>
>> + numa_task_migrated (npn)
>> + Number of task migration by NUMA balancing.
>> +
>> + numa_task_swapped (npn)
>> + Number of task swap by NUMA balancing.
>> +
>> pgdemote_kswapd
>> Number of pages demoted by kswapd.
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index f96ac1982893..1c50e30b5c01 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -549,6 +549,10 @@ struct sched_statistics {
>> u64 nr_failed_migrations_running;
>> u64 nr_failed_migrations_hot;
>> u64 nr_forced_migrations;
>> +#ifdef CONFIG_NUMA_BALANCING
>> + u64 numa_task_migrated;
>> + u64 numa_task_swapped;
>> +#endif
>>
>
> This one is more of personal preference. I understand they show up only if
> you turn on schedstats, but will it be better to put them in sched_show_numa()
> so they will be printed out next to other numa stats such as numa_pages_migrated?
>
> @@ -1153,6 +1153,10 @@ static void sched_show_numa(struct task_struct *p, struct seq_file *m)
> if (p->mm)
> P(mm->numa_scan_seq);
>
> + if (schedstat_enabled()) {
> + P_SCHEDSTAT(numa_task_migrated);
> + P_SCHEDSTAT(numa_task_swapped);
> + }
> P(numa_pages_migrated);
> P(numa_preferred_nid);
> P(total_numa_faults);
>
>
Previously, the numa_task_migrated and numa_task_swapped were
put under the scope of schedstat_enabled() in
proc_sched_show_task().
We mainly care about task migration activity, so it is put near
the nr_forced_migrations. When it reaches sched_show_numa(),
P_SCHEDSTAT has been undefined. It is just simpler to do this
directly in proc_sched_show_task() IMO.
Thanks,
Chenyu
> Thanks,
> Libo
>> u64 nr_wakeups;
>> u64 nr_wakeups_sync;
>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>> index 9e15a088ba38..91a3ce9a2687 100644
>> --- a/include/linux/vm_event_item.h
>> +++ b/include/linux/vm_event_item.h
>> @@ -66,6 +66,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>> NUMA_HINT_FAULTS,
>> NUMA_HINT_FAULTS_LOCAL,
>> NUMA_PAGE_MIGRATE,
>> + NUMA_TASK_MIGRATE,
>> + NUMA_TASK_SWAP,
>> #endif
>> #ifdef CONFIG_MIGRATION
>> PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c81cf642dba0..25a92f2abda4 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3352,6 +3352,9 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>> #ifdef CONFIG_NUMA_BALANCING
>> static void __migrate_swap_task(struct task_struct *p, int cpu)
>> {
>> + __schedstat_inc(p->stats.numa_task_swapped);
>> + count_memcg_events_mm(p->mm, NUMA_TASK_SWAP, 1);
>> +
>> if (task_on_rq_queued(p)) {
>> struct rq *src_rq, *dst_rq;
>> struct rq_flags srf, drf;
>> @@ -7953,8 +7956,8 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
>> if (!cpumask_test_cpu(target_cpu, p->cpus_ptr))
>> return -EINVAL;
>>
>> - /* TODO: This is not properly updating schedstats */
>> -
>> + __schedstat_inc(p->stats.numa_task_migrated);
>> + count_memcg_events_mm(p->mm, NUMA_TASK_MIGRATE, 1);
>> trace_sched_move_numa(p, curr_cpu, target_cpu);
>> return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
>> }
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index 56ae54e0ce6a..f971c2af7912 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -1206,6 +1206,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
>> P_SCHEDSTAT(nr_failed_migrations_running);
>> P_SCHEDSTAT(nr_failed_migrations_hot);
>> P_SCHEDSTAT(nr_forced_migrations);
>> +#ifdef CONFIG_NUMA_BALANCING
>> + P_SCHEDSTAT(numa_task_migrated);
>> + P_SCHEDSTAT(numa_task_swapped);
>> +#endif
>> P_SCHEDSTAT(nr_wakeups);
>> P_SCHEDSTAT(nr_wakeups_sync);
>> P_SCHEDSTAT(nr_wakeups_migrate);
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index c96c1f2b9cf5..cdaab8a957f3 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -463,6 +463,8 @@ static const unsigned int memcg_vm_event_stat[] = {
>> NUMA_PAGE_MIGRATE,
>> NUMA_PTE_UPDATES,
>> NUMA_HINT_FAULTS,
>> + NUMA_TASK_MIGRATE,
>> + NUMA_TASK_SWAP,
>> #endif
>> };
>>
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 4c268ce39ff2..ed08bb384ae4 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1347,6 +1347,8 @@ const char * const vmstat_text[] = {
>> "numa_hint_faults",
>> "numa_hint_faults_local",
>> "numa_pages_migrated",
>> + "numa_task_migrated",
>> + "numa_task_swapped",
>> #endif
>> #ifdef CONFIG_MIGRATION
>> "pgmigrate_success",
>
next prev parent reply other threads:[~2025-05-02 9:31 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-30 10:36 [PATCH v3] sched/numa: add statistics of numa balance task migration Chen Yu
2025-05-01 7:00 ` Libo Chen
2025-05-02 9:30 ` Chen, Yu C [this message]
2025-05-05 6:43 ` Jain, Ayush
2025-05-05 15:03 ` Chen, Yu C
2025-05-05 17:25 ` Venkat Rao Bagalkote
2025-05-07 11:36 ` Chen, Yu C
2025-05-05 17:46 ` Michal Koutný
2025-05-05 18:27 ` Chen, Yu C
2025-05-05 18:49 ` Libo Chen
2025-05-05 21:32 ` Libo Chen
2025-05-05 21:57 ` Libo Chen
2025-05-06 5:06 ` Jain, Ayush
2025-05-06 5:36 ` Chen, Yu C
2025-05-06 7:03 ` Libo Chen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f024ebd4-2b18-4af5-8fc3-7b057501444b@intel.com \
--to=yu.c.chen@intel.com \
--cc=akpm@linux-foundation.org \
--cc=aubrey.li@intel.com \
--cc=cgroups@vger.kernel.org \
--cc=corbet@lwn.net \
--cc=hannes@cmpxchg.org \
--cc=kprateek.nayak@amd.com \
--cc=libo.chen@oracle.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgormanmgorman@suse.de \
--cc=mhocko@kernel.org \
--cc=mingo@redhat.com \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=peterz@infradead.org \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=tim.c.chen@intel.com \
--cc=tj@kernel.org \
--cc=vineethr@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox