[PATCH v2] sched/numa: Introduce per cgroup numa balance control

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] sched/numa: Introduce per cgroup numa balance control
@ 2025-06-25 10:23 Chen Yu
  2025-06-25 12:19 ` Michal Koutný
  2025-07-01  8:25 ` Libo Chen
  0 siblings, 2 replies; 7+ messages in thread
From: Chen Yu @ 2025-06-25 10:23 UTC (permalink / raw)
  To: Jonathan Corbet, Peter Zijlstra, Ingo Molnar, Shakeel Butt,
	Michal Koutný
  Cc: Juri Lelli, Ben Segall, Libo Chen, Mel Gorman, Valentin Schneider,
	Andrew Morton, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Phil Auld, Tejun Heo, Daniel Jordan, Jann Horn, Pedro Falcato,
	Aubrey Li, Tim Chen, Huang, Ying, linux-doc, linux-kernel,
	linux-mm, Chen Yu

[Problem Statement]
Currently, NUMA balancing is configured system-wide.
However, in some production environments, different
cgroups may have varying requirements for NUMA balancing.
Some cgroups are CPU-intensive, while others are
memory-intensive. Some do not benefit from NUMA balancing
due to the overhead associated with VMA scanning, while
others prefer NUMA balancing as it helps improve memory
locality. In this case, system-wide NUMA balancing is
usually disabled to avoid causing regressions.

[Proposal]
Introduce a per-cgroup interface to enable NUMA balancing
for specific cgroups. This interface is associated with
the CPU subsystem, which does not support threaded subtrees,
and close to CPU bandwidth control. The system administrator
needs to set the NUMA balancing mode to
NUMA_BALANCING_CGROUP=4 to enable this feature. When the
system is in NUMA_BALANCING_CGROUP mode, NUMA balancing
for all cgroups is disabled by default. After the
administrator enables this feature for a specific cgroup,
NUMA balancing for that cgroup is enabled.

A simple example to show how to use per-cgroup NUMA balancing:

Step1
//switch on per cgroup Numa balancing.
//All cgroup's NUMA balance is disabled by default.
echo 4 > /proc/sys/kernel/numa_balancing

Step2
//created a cgroup named mytest, enable its NUMA balancing
echo 1 > /sys/fs/cgroup/mytest/cpu.numa_load_balance

[Benchmark]
Tested on Xeon Sapphire Rapids, with 4 Numa nodes. Created
a cgroup mytest and launched autonumabench NUMA01_THREADLOCAL.
Each test runs 6 cycles.

baseline:
29360.56user 16280.68system 3:33.29elapsed
29788.41user 16060.31system 3:34.38elapsed
28307.51user 17043.45system 3:33.03elapsed
29552.49user 16307.65system 3:34.20elapsed
29847.41user 15966.15system 3:34.65elapsed
29111.10user 16532.78system 3:33.19elapsed

per cgroup NUMA balance:
7589.78user 16494.90system 1:53.18elapsed
7795.54user 16537.65system 1:54.11elapsed
8295.66user 16391.21system 1:55.98elapsed
7836.34user 17312.31system 1:55.71elapsed
7773.26user 16856.19system 1:54.08elapsed
7534.43user 17604.58system 1:55.01elapsed

The user time has been reduced to 33% of the
original, and the elapsed time has dropped to
45% of the original (lower values are better).

cat /sys/fs/cgroup/mytest/memory.stat | grep numa
numa_pages_migrated 10238503
numa_pte_updates 24378124
numa_hint_faults 16921590
numa_task_migrated 253
numa_task_swapped 4

to-do:
Per-cgroup NUMA balancing should consider the
hierarchy of the cgroup. Initially, NUMA balancing
is disabled for the root cgroup. A cgroup that has
NUMA balancing enabled should have all its parents
enabled. For example, suppose cgroup A is the parent
of cgroup B; if A.numa_load_balance is 0, even if
B.numa_load_balance is 1, NUMA balancing for B is
disabled. This idea is derived from
commit e39925734909 ("mm/memcontrol: respect
zswap.writeback setting from parent cgroup too").

Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
v1->v2:

Add documentation in Documentation/admin-guide/sysctl/kernel.rst.

Add comments in tg_numa_balance_enabled() to explain that
the newly introduced NUMA balancing mode is naturally
exclusive of existing NUMA balancing modes. (Tim)
---
 Documentation/admin-guide/sysctl/kernel.rst |  6 ++++
 include/linux/sched/sysctl.h                |  1 +
 kernel/sched/core.c                         | 31 +++++++++++++++++++++
 kernel/sched/fair.c                         | 28 +++++++++++++++++++
 kernel/sched/sched.h                        |  3 ++
 mm/mprotect.c                               |  5 ++--
 6 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index dd49a89a62d3..ff88d1153c19 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -709,6 +709,7 @@ The value to set can be the result of ORing the following:
 0 NUMA_BALANCING_DISABLED
 1 NUMA_BALANCING_NORMAL
 2 NUMA_BALANCING_MEMORY_TIERING
+4 NUMA_BALANCING_CGROUP
 = =================================
 
 Or NUMA_BALANCING_NORMAL to optimize page placement among different
@@ -729,6 +730,11 @@ different types of memory (represented as different NUMA nodes) to
 place the hot pages in the fast memory.  This is implemented based on
 unmapping and page fault too.
 
+Or NUMA_BALANCING_CGROUP to enable the per cgroup NUMA balancing.
+This new behavior can be opted-in/out on a per-cgroup basis via a new
+cgroup CPU subsystem file named numa_load_balance. By default, per
+cgroup NUMA balancing for each cgroup is enabled.
+
 numa_balancing_promote_rate_limit_MBps
 ======================================
 
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 5a64582b086b..1e4d5a9ddb26 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -22,6 +22,7 @@ enum sched_tunable_scaling {
 #define NUMA_BALANCING_DISABLED		0x0
 #define NUMA_BALANCING_NORMAL		0x1
 #define NUMA_BALANCING_MEMORY_TIERING	0x2
+#define NUMA_BALANCING_CGROUP		0x4
 
 #ifdef CONFIG_NUMA_BALANCING
 extern int sysctl_numa_balancing_mode;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8988d38d46a3..8e9aa59193df 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10078,6 +10078,30 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
 }
 #endif
 
+#ifdef CONFIG_NUMA_BALANCING
+static int numa_balance_write_u64(struct cgroup_subsys_state *css,
+				  struct cftype *cftype, u64 enable)
+{
+	struct task_group *tg;
+	bool was_enabled;
+
+	tg = css_tg(css);
+	was_enabled = READ_ONCE(tg->nlb_enabled);
+	if (was_enabled == enable)
+		return 0;
+
+	WRITE_ONCE(tg->nlb_enabled, enable);
+
+	return 0;
+}
+
+static u64 numa_balance_read_u64(struct cgroup_subsys_state *css,
+				 struct cftype *cft)
+{
+	return READ_ONCE(css_tg(css)->nlb_enabled);
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 static struct cftype cpu_files[] = {
 #ifdef CONFIG_GROUP_SCHED_WEIGHT
 	{
@@ -10126,6 +10150,13 @@ static struct cftype cpu_files[] = {
 		.seq_show = cpu_uclamp_max_show,
 		.write = cpu_uclamp_max_write,
 	},
+#endif
+#ifdef CONFIG_NUMA_BALANCING
+	{
+		.name = "numa_load_balance",
+		.read_u64 = numa_balance_read_u64,
+		.write_u64 = numa_balance_write_u64,
+	},
 #endif
 	{ }	/* terminate */
 };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a14da5396fb..dcdee3bf9960 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3161,6 +3161,29 @@ void task_numa_free(struct task_struct *p, bool final)
 	}
 }
 
+/*
+ * Return true if the NUMA balance is allowed for
+ * the task in a task group.
+ */
+static bool tg_numa_balance_enabled(struct task_struct *p)
+{
+	/*
+	 * The min/max of sysctl_numa_balancing_mode ranges
+	 * from SYSCTL_ONE to SYSCTL_FOUR, so it is safe to
+	 * only check NUMA_BALANCING_CGROUP because it is
+	 * impossible to have both NUMA_BALANCING_CGROUP and
+	 * NUMA_BALANCING_NORMAL/NUMA_BALANCING_MEMORY_TIERING
+	 * set.
+	 */
+	struct task_group *tg = task_group(p);
+
+	if (tg && (sysctl_numa_balancing_mode & NUMA_BALANCING_CGROUP) &&
+	    !READ_ONCE(tg->nlb_enabled))
+		return false;
+
+	return true;
+}
+
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
@@ -3189,6 +3212,9 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 	     !cpupid_valid(last_cpupid)))
 		return;
 
+	if (!tg_numa_balance_enabled(p))
+		return;
+
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) *
@@ -3330,6 +3356,8 @@ static void task_numa_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
+	if (!tg_numa_balance_enabled(p))
+		return;
 	/*
 	 * Memory is pinned to only one NUMA node via cpuset.mems, naturally
 	 * no page can be migrated.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 475bb5998295..4b0dc656688e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -485,6 +485,9 @@ struct task_group {
 	/* Effective clamp values used for a task group */
 	struct uclamp_se	uclamp[UCLAMP_CNT];
 #endif
+#ifdef CONFIG_NUMA_BALANCING
+	u64			nlb_enabled;
+#endif
 
 };
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 88608d0dc2c2..c288ffb92bfc 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -155,10 +155,11 @@ static long change_pte_range(struct mmu_gather *tlb,
 				toptier = node_is_toptier(nid);
 
 				/*
-				 * Skip scanning top tier node if normal numa
+				 * Skip scanning top tier node if normal and cgroup numa
 				 * balancing is disabled
 				 */
-				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
+				if (!(sysctl_numa_balancing_mode &
+				    (NUMA_BALANCING_CGROUP | NUMA_BALANCING_NORMAL)) &&
 				    toptier)
 					continue;
 				if (folio_use_access_time(folio))
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] sched/numa: Introduce per cgroup numa balance control
  2025-06-25 10:23 [PATCH v2] sched/numa: Introduce per cgroup numa balance control Chen Yu
@ 2025-06-25 12:19 ` Michal Koutný
  2025-06-26  9:07   ` Chen, Yu C
  2025-07-01  8:25 ` Libo Chen
  1 sibling, 1 reply; 7+ messages in thread
From: Michal Koutný @ 2025-06-25 12:19 UTC (permalink / raw)
  To: Chen Yu
  Cc: Jonathan Corbet, Peter Zijlstra, Ingo Molnar, Shakeel Butt,
	Juri Lelli, Ben Segall, Libo Chen, Mel Gorman, Valentin Schneider,
	Andrew Morton, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Phil Auld, Tejun Heo, Daniel Jordan, Jann Horn, Pedro Falcato,
	Aubrey Li, Tim Chen, Huang, Ying, linux-doc, linux-kernel,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 1744 bytes --]

On Wed, Jun 25, 2025 at 06:23:37PM +0800, Chen Yu <yu.c.chen@intel.com> wrote:
> [Problem Statement]
> Currently, NUMA balancing is configured system-wide.
> However, in some production environments, different
> cgroups may have varying requirements for NUMA balancing.
> Some cgroups are CPU-intensive, while others are
> memory-intensive. Some do not benefit from NUMA balancing
> due to the overhead associated with VMA scanning, while
> others prefer NUMA balancing as it helps improve memory
> locality. In this case, system-wide NUMA balancing is
> usually disabled to avoid causing regressions.
> 
> [Proposal]
> Introduce a per-cgroup interface to enable NUMA balancing
> for specific cgroups.

The balancing works with task granularity already and this new attribute
is not much of a resource to control.
Have you considered a per-task attribute? (sched_setattr(), prctl() or
similar) That one could be inherited and respective cgroups would be
seeded with a process with intended values. And cpuset could be
traditionally used to restrict the scope of balancing of such tasks.

WDYT?

> This interface is associated with the CPU subsystem, which
> does not support threaded subtrees, and close to CPU bandwidth
> control.
 (??) does support

> The system administrator needs to set the NUMA balancing mode to
> NUMA_BALANCING_CGROUP=4 to enable this feature. When the system is in
> NUMA_BALANCING_CGROUP mode, NUMA balancing for all cgroups is disabled
> by default. After the administrator enables this feature for a
> specific cgroup, NUMA balancing for that cgroup is enabled.

How much dynamic do you such changes to be? In relation to given
cgroup's/process's lifecycle.

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] sched/numa: Introduce per cgroup numa balance control
  2025-06-25 12:19 ` Michal Koutný
@ 2025-06-26  9:07   ` Chen, Yu C
  0 siblings, 0 replies; 7+ messages in thread
From: Chen, Yu C @ 2025-06-26  9:07 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Jonathan Corbet, Peter Zijlstra, Ingo Molnar, Shakeel Butt,
	Juri Lelli, Ben Segall, Libo Chen, Mel Gorman, Valentin Schneider,
	Andrew Morton, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Phil Auld, Tejun Heo, Daniel Jordan, Jann Horn, Pedro Falcato,
	Aubrey Li, Tim Chen, Huang, Ying, linux-doc, linux-kernel,
	linux-mm, Xunlei Pang, huifeng.le

Hi Michal,

Thanks for taking a look.

On 6/25/2025 8:19 PM, Michal Koutný wrote:
> On Wed, Jun 25, 2025 at 06:23:37PM +0800, Chen Yu <yu.c.chen@intel.com> wrote:
>> [Problem Statement]
>> Currently, NUMA balancing is configured system-wide.
>> However, in some production environments, different
>> cgroups may have varying requirements for NUMA balancing.
>> Some cgroups are CPU-intensive, while others are
>> memory-intensive. Some do not benefit from NUMA balancing
>> due to the overhead associated with VMA scanning, while
>> others prefer NUMA balancing as it helps improve memory
>> locality. In this case, system-wide NUMA balancing is
>> usually disabled to avoid causing regressions.
>>
>> [Proposal]
>> Introduce a per-cgroup interface to enable NUMA balancing
>> for specific cgroups.
> 
> The balancing works with task granularity already and this new attribute
> is not much of a resource to control.
> Have you considered a per-task attribute? (sched_setattr(), prctl() or
> similar) That one could be inherited and respective cgroups would be
> seeded with a process with intended values.

OK, the prctl approach should work. However, setting this
attribute via cgroup might be more convenient for the userspace
IMHO. The original requirement stems from cloud environments,
where it's typically unacceptable to require applications to
modify their code to add prctl(). Thus, the orchestration layer
must handle this. For example, the initial process of the container
needs adjustment. After consulting with cloud-native developers,
I learned that containerd-shim-runc-v2 serves as the first
process. Therefore, we may need to modify the
containerd-shim-runc-v2 code to use prctl for the NUMA
balancing attribute, allowing child processes to inherit the
settings. While if it is per cgroup control, the user can just
touch one sysfs item.

> And cpuset could be
> traditionally used to restrict the scope of balancing of such tasks.
> 
> WDYT?
> 

In some scenarios, cgroups serve as micro-service containers.
They are not bound to any CPU sets and instead run freely on all
online CPUs. These cgroups can be sensitive to CPU capacity, as well
as NUMA locality (involving page migration and task migration).

>> This interface is associated with the CPU subsystem, which
>> does not support threaded subtrees, and close to CPU bandwidth
>> control.
>   (??) does support
> 

Ah yes, it supports threaded cgroup type. In this case, we
might need to disable the per-cgroup NUMA balance for threaded
cgroup type.

>> The system administrator needs to set the NUMA balancing mode to
>> NUMA_BALANCING_CGROUP=4 to enable this feature. When the system is in
>> NUMA_BALANCING_CGROUP mode, NUMA balancing for all cgroups is disabled
>> by default. After the administrator enables this feature for a
>> specific cgroup, NUMA balancing for that cgroup is enabled.
> 
> How much dynamic do you such changes to be? In relation to given
> cgroup's/process's lifecycle.
> 

I think it depends on the design. Starting from Kubernetes v1.33,
there is a feature called "in-place Pod resize," which allows users
to modify CPU and memory requests and limits for containers(via
cgroup interfaces) in a running Pod — often without needing to
restart the container. That said, if an admin wants to adjust
NUMA balancing settings at runtime (after the monitor detects
excessive remote NUMA memory access), using prctl might require
iterating through each process in the cgroup and invoking prctl
on them individually.

thanks,
Chenyu

> Thanks,
> Michal

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] sched/numa: Introduce per cgroup numa balance control
  2025-06-25 10:23 [PATCH v2] sched/numa: Introduce per cgroup numa balance control Chen Yu
  2025-06-25 12:19 ` Michal Koutný
@ 2025-07-01  8:25 ` Libo Chen
  2025-07-01 16:36   ` Chen, Yu C
  1 sibling, 1 reply; 7+ messages in thread
From: Libo Chen @ 2025-07-01  8:25 UTC (permalink / raw)
  To: Chen Yu, Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Shakeel Butt, Michal Koutný
  Cc: Juri Lelli, Ben Segall, Mel Gorman, Valentin Schneider,
	Andrew Morton, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Phil Auld, Tejun Heo, Daniel Jordan, Jann Horn, Pedro Falcato,
	Aubrey Li, Tim Chen, Huang, Ying, linux-doc, linux-kernel,
	linux-mm

Hi Chenyu,

Thanks for the patch. See my comments below.

On 6/25/25 03:23, Chen Yu wrote:
> [Problem Statement]
> Currently, NUMA balancing is configured system-wide.
> However, in some production environments, different
> cgroups may have varying requirements for NUMA balancing.
> Some cgroups are CPU-intensive, while others are
> memory-intensive. Some do not benefit from NUMA balancing
> due to the overhead associated with VMA scanning, while
> others prefer NUMA balancing as it helps improve memory
> locality. In this case, system-wide NUMA balancing is
> usually disabled to avoid causing regressions.
> 
> [Proposal]
> Introduce a per-cgroup interface to enable NUMA balancing
> for specific cgroups. This interface is associated with
> the CPU subsystem, which does not support threaded subtrees,
> and close to CPU bandwidth control. The system administrator
> needs to set the NUMA balancing mode to
> NUMA_BALANCING_CGROUP=4 to enable this feature. When the
> system is in NUMA_BALANCING_CGROUP mode, NUMA balancing
> for all cgroups is disabled by default. After the
> administrator enables this feature for a specific cgroup,
> NUMA balancing for that cgroup is enabled.
> 
> A simple example to show how to use per-cgroup NUMA balancing:
> 
> Step1
> //switch on per cgroup Numa balancing.
> //All cgroup's NUMA balance is disabled by default.
> echo 4 > /proc/sys/kernel/numa_balancing
> 
> Step2
> //created a cgroup named mytest, enable its NUMA balancing
> echo 1 > /sys/fs/cgroup/mytest/cpu.numa_load_balance
> 
> [Benchmark]
> Tested on Xeon Sapphire Rapids, with 4 Numa nodes. Created
> a cgroup mytest and launched autonumabench NUMA01_THREADLOCAL.
> Each test runs 6 cycles.
> 
> baseline:
> 29360.56user 16280.68system 3:33.29elapsed
> 29788.41user 16060.31system 3:34.38elapsed
> 28307.51user 17043.45system 3:33.03elapsed
> 29552.49user 16307.65system 3:34.20elapsed
> 29847.41user 15966.15system 3:34.65elapsed
> 29111.10user 16532.78system 3:33.19elapsed
> 
> per cgroup NUMA balance:
> 7589.78user 16494.90system 1:53.18elapsed
> 7795.54user 16537.65system 1:54.11elapsed
> 8295.66user 16391.21system 1:55.98elapsed
> 7836.34user 17312.31system 1:55.71elapsed
> 7773.26user 16856.19system 1:54.08elapsed
> 7534.43user 17604.58system 1:55.01elapsed
> 
> The user time has been reduced to 33% of the
> original, and the elapsed time has dropped to
> 45% of the original (lower values are better).
> 
> cat /sys/fs/cgroup/mytest/memory.stat | grep numa
> numa_pages_migrated 10238503
> numa_pte_updates 24378124
> numa_hint_faults 16921590
> numa_task_migrated 253
> numa_task_swapped 4
> 
> to-do:
> Per-cgroup NUMA balancing should consider the
> hierarchy of the cgroup. Initially, NUMA balancing
> is disabled for the root cgroup. A cgroup that has
> NUMA balancing enabled should have all its parents
> enabled. For example, suppose cgroup A is the parent
> of cgroup B; if A.numa_load_balance is 0, even if
> B.numa_load_balance is 1, NUMA balancing for B is
> disabled. This idea is derived from
> commit e39925734909 ("mm/memcontrol: respect
> zswap.writeback setting from parent cgroup too").
> 
> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
> v1->v2:
> 
> Add documentation in Documentation/admin-guide/sysctl/kernel.rst.
> 
> Add comments in tg_numa_balance_enabled() to explain that
> the newly introduced NUMA balancing mode is naturally
> exclusive of existing NUMA balancing modes. (Tim)
> ---
>  Documentation/admin-guide/sysctl/kernel.rst |  6 ++++
>  include/linux/sched/sysctl.h                |  1 +
>  kernel/sched/core.c                         | 31 +++++++++++++++++++++
>  kernel/sched/fair.c                         | 28 +++++++++++++++++++
>  kernel/sched/sched.h                        |  3 ++
>  mm/mprotect.c                               |  5 ++--
>  6 files changed, 72 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index dd49a89a62d3..ff88d1153c19 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -709,6 +709,7 @@ The value to set can be the result of ORing the following:
>  0 NUMA_BALANCING_DISABLED
>  1 NUMA_BALANCING_NORMAL
>  2 NUMA_BALANCING_MEMORY_TIERING
> +4 NUMA_BALANCING_CGROUP
>  = =================================
>  
>  Or NUMA_BALANCING_NORMAL to optimize page placement among different
> @@ -729,6 +730,11 @@ different types of memory (represented as different NUMA nodes) to
>  place the hot pages in the fast memory.  This is implemented based on
>  unmapping and page fault too.
>  
> +Or NUMA_BALANCING_CGROUP to enable the per cgroup NUMA balancing.
> +This new behavior can be opted-in/out on a per-cgroup basis via a new
> +cgroup CPU subsystem file named numa_load_balance. By default, per
> +cgroup NUMA balancing for each cgroup is enabled.
> +
>  numa_balancing_promote_rate_limit_MBps
>  ======================================
>  
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index 5a64582b086b..1e4d5a9ddb26 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -22,6 +22,7 @@ enum sched_tunable_scaling {
>  #define NUMA_BALANCING_DISABLED		0x0
>  #define NUMA_BALANCING_NORMAL		0x1
>  #define NUMA_BALANCING_MEMORY_TIERING	0x2
> +#define NUMA_BALANCING_CGROUP		0x4
>  
>  #ifdef CONFIG_NUMA_BALANCING
>  extern int sysctl_numa_balancing_mode;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 8988d38d46a3..8e9aa59193df 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -10078,6 +10078,30 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
>  }
>  #endif
>  
> +#ifdef CONFIG_NUMA_BALANCING
> +static int numa_balance_write_u64(struct cgroup_subsys_state *css,
> +				  struct cftype *cftype, u64 enable)
> +{
> +	struct task_group *tg;
> +	bool was_enabled;
> +
> +	tg = css_tg(css);
> +	was_enabled = READ_ONCE(tg->nlb_enabled);
> +	if (was_enabled == enable)
> +		return 0;
> +
> +	WRITE_ONCE(tg->nlb_enabled, enable);
> +
> +	return 0;
> +}
> +
> +static u64 numa_balance_read_u64(struct cgroup_subsys_state *css,
> +				 struct cftype *cft)
> +{
> +	return READ_ONCE(css_tg(css)->nlb_enabled);
> +}
> +#endif /* CONFIG_NUMA_BALANCING */
> +
>  static struct cftype cpu_files[] = {
>  #ifdef CONFIG_GROUP_SCHED_WEIGHT
>  	{
> @@ -10126,6 +10150,13 @@ static struct cftype cpu_files[] = {
>  		.seq_show = cpu_uclamp_max_show,
>  		.write = cpu_uclamp_max_write,
>  	},
> +#endif
> +#ifdef CONFIG_NUMA_BALANCING
> +	{
> +		.name = "numa_load_balance",
> +		.read_u64 = numa_balance_read_u64,
> +		.write_u64 = numa_balance_write_u64,
> +	},
>  #endif
>  	{ }	/* terminate */
>  };
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a14da5396fb..dcdee3bf9960 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3161,6 +3161,29 @@ void task_numa_free(struct task_struct *p, bool final)
>  	}
>  }
>  
> +/*
> + * Return true if the NUMA balance is allowed for
> + * the task in a task group.
> + */
> +static bool tg_numa_balance_enabled(struct task_struct *p)
> +{
> +	/*
> +	 * The min/max of sysctl_numa_balancing_mode ranges
> +	 * from SYSCTL_ONE to SYSCTL_FOUR, so it is safe to
> +	 * only check NUMA_BALANCING_CGROUP because it is
> +	 * impossible to have both NUMA_BALANCING_CGROUP and
> +	 * NUMA_BALANCING_NORMAL/NUMA_BALANCING_MEMORY_TIERING
> +	 * set.
> +	 */
> +	struct task_group *tg = task_group(p);
> +
> +	if (tg && (sysctl_numa_balancing_mode & NUMA_BALANCING_CGROUP) &&
> +	    !READ_ONCE(tg->nlb_enabled))
> +		return false;
> +
> +	return true;
> +}
> +
>  /*
>   * Got a PROT_NONE fault for a page on @node.
>   */
> @@ -3189,6 +3212,9 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
>  	     !cpupid_valid(last_cpupid)))
>  		return;
>  
> +	if (!tg_numa_balance_enabled(p))
> +		return;
> +

I think this one may be redundant when you already have it in task_numa_work().  Without the
scanning, there won't be any hinting page faults on that task, so neither do_numa_page() nor
do_huge_pmd_numa_page() will be called. Though it's a minor issue if tg_numa_balance_enabled(p)
is fast. 

Overall this is good. But more generally, I am thinking something finer-grained, like per-task
numab control with numab tunnables at task-level (if possible), that will be so much more useful
at least for us. There are use cases for controlling numa balancing at task level as applications
tuned for NUMA (that don't want numab mess with their tsk/mem placements) such as databases can
be in the same cgroup with other untuned applications, or not in a cgroup at all. Right now we
have to turn off numab globally but that's not really optimal in a lot of cases. I do understand
your use cases for per-cgroup control, but I wonder if there is a way to nicely combine them.
Per-task control should cover per-cgroup control functionality-wise, but it's an inconvenient
interface as one has to set for all tasks of the same cgroup, I haven't thought too hard about
it yet, just want to bring it out and see if we can work out something together.

Thanks,
Libo

>  	/* Allocate buffer to track faults on a per-node basis */
>  	if (unlikely(!p->numa_faults)) {
>  		int size = sizeof(*p->numa_faults) *
> @@ -3330,6 +3356,8 @@ static void task_numa_work(struct callback_head *work)
>  	if (p->flags & PF_EXITING)
>  		return;
>  
> +	if (!tg_numa_balance_enabled(p))
> +		return;
>  	/*
>  	 * Memory is pinned to only one NUMA node via cpuset.mems, naturally
>  	 * no page can be migrated.
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 475bb5998295..4b0dc656688e 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -485,6 +485,9 @@ struct task_group {
>  	/* Effective clamp values used for a task group */
>  	struct uclamp_se	uclamp[UCLAMP_CNT];
>  #endif
> +#ifdef CONFIG_NUMA_BALANCING
> +	u64			nlb_enabled;
> +#endif
>  
>  };
>  
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 88608d0dc2c2..c288ffb92bfc 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -155,10 +155,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				toptier = node_is_toptier(nid);
>  
>  				/*
> -				 * Skip scanning top tier node if normal numa
> +				 * Skip scanning top tier node if normal and cgroup numa
>  				 * balancing is disabled
>  				 */
> -				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
> +				if (!(sysctl_numa_balancing_mode &
> +				    (NUMA_BALANCING_CGROUP | NUMA_BALANCING_NORMAL)) &&
>  				    toptier)
>  					continue;
>  				if (folio_use_access_time(folio))



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] sched/numa: Introduce per cgroup numa balance control
  2025-07-01  8:25 ` Libo Chen
@ 2025-07-01 16:36   ` Chen, Yu C
  2025-07-01 17:58     ` Libo Chen
  0 siblings, 1 reply; 7+ messages in thread
From: Chen, Yu C @ 2025-07-01 16:36 UTC (permalink / raw)
  To: Libo Chen
  Cc: Juri Lelli, Ben Segall, Mel Gorman, Valentin Schneider,
	Andrew Morton, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Phil Auld, Tejun Heo, Daniel Jordan, Jann Horn, Pedro Falcato,
	Aubrey Li, Tim Chen, Huang, Ying, linux-doc, linux-kernel,
	linux-mm, Jonathan Corbet, Ingo Molnar, Michal Koutný,
	Peter Zijlstra, Shakeel Butt

Hi Libo,

On 7/1/2025 4:25 PM, Libo Chen wrote:
> Hi Chenyu,
> 
> Thanks for the patch. See my comments below.
> 
> On 6/25/25 03:23, Chen Yu wrote:
>> [Problem Statement]
>> Currently, NUMA balancing is configured system-wide.
>> However, in some production environments, different
>> cgroups may have varying requirements for NUMA balancing.
>> Some cgroups are CPU-intensive, while others are
>> memory-intensive. Some do not benefit from NUMA balancing
>> due to the overhead associated with VMA scanning, while
>> others prefer NUMA balancing as it helps improve memory
>> locality. In this case, system-wide NUMA balancing is
>> usually disabled to avoid causing regressions.
>>
>> [Proposal]
>> Introduce a per-cgroup interface to enable NUMA balancing
>> for specific cgroups. This interface is associated with
>> the CPU subsystem, which does not support threaded subtrees,
>> and close to CPU bandwidth control. The system administrator
>> needs to set the NUMA balancing mode to
>> NUMA_BALANCING_CGROUP=4 to enable this feature. When the
>> system is in NUMA_BALANCING_CGROUP mode, NUMA balancing
>> for all cgroups is disabled by default. After the
>> administrator enables this feature for a specific cgroup,
>> NUMA balancing for that cgroup is enabled.
>>
>> A simple example to show how to use per-cgroup NUMA balancing:
>>
>> Step1
>> //switch on per cgroup Numa balancing.
>> //All cgroup's NUMA balance is disabled by default.
>> echo 4 > /proc/sys/kernel/numa_balancing
>>
>> Step2
>> //created a cgroup named mytest, enable its NUMA balancing
>> echo 1 > /sys/fs/cgroup/mytest/cpu.numa_load_balance
>>
>> [Benchmark]
>> Tested on Xeon Sapphire Rapids, with 4 Numa nodes. Created
>> a cgroup mytest and launched autonumabench NUMA01_THREADLOCAL.
>> Each test runs 6 cycles.
>>
>> baseline:
>> 29360.56user 16280.68system 3:33.29elapsed
>> 29788.41user 16060.31system 3:34.38elapsed
>> 28307.51user 17043.45system 3:33.03elapsed
>> 29552.49user 16307.65system 3:34.20elapsed
>> 29847.41user 15966.15system 3:34.65elapsed
>> 29111.10user 16532.78system 3:33.19elapsed
>>
>> per cgroup NUMA balance:
>> 7589.78user 16494.90system 1:53.18elapsed
>> 7795.54user 16537.65system 1:54.11elapsed
>> 8295.66user 16391.21system 1:55.98elapsed
>> 7836.34user 17312.31system 1:55.71elapsed
>> 7773.26user 16856.19system 1:54.08elapsed
>> 7534.43user 17604.58system 1:55.01elapsed
>>
>> The user time has been reduced to 33% of the
>> original, and the elapsed time has dropped to
>> 45% of the original (lower values are better).
>>
>> cat /sys/fs/cgroup/mytest/memory.stat | grep numa
>> numa_pages_migrated 10238503
>> numa_pte_updates 24378124
>> numa_hint_faults 16921590
>> numa_task_migrated 253
>> numa_task_swapped 4
>>
>> to-do:
>> Per-cgroup NUMA balancing should consider the
>> hierarchy of the cgroup. Initially, NUMA balancing
>> is disabled for the root cgroup. A cgroup that has
>> NUMA balancing enabled should have all its parents
>> enabled. For example, suppose cgroup A is the parent
>> of cgroup B; if A.numa_load_balance is 0, even if
>> B.numa_load_balance is 1, NUMA balancing for B is
>> disabled. This idea is derived from
>> commit e39925734909 ("mm/memcontrol: respect
>> zswap.writeback setting from parent cgroup too").
>>
>> Suggested-by: Tim Chen <tim.c.chen@intel.com>
>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>> ---
>> v1->v2:
>>
>> Add documentation in Documentation/admin-guide/sysctl/kernel.rst.
>>
>> Add comments in tg_numa_balance_enabled() to explain that
>> the newly introduced NUMA balancing mode is naturally
>> exclusive of existing NUMA balancing modes. (Tim)
>> ---
>>   Documentation/admin-guide/sysctl/kernel.rst |  6 ++++
>>   include/linux/sched/sysctl.h                |  1 +
>>   kernel/sched/core.c                         | 31 +++++++++++++++++++++
>>   kernel/sched/fair.c                         | 28 +++++++++++++++++++
>>   kernel/sched/sched.h                        |  3 ++
>>   mm/mprotect.c                               |  5 ++--
>>   6 files changed, 72 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
>> index dd49a89a62d3..ff88d1153c19 100644
>> --- a/Documentation/admin-guide/sysctl/kernel.rst
>> +++ b/Documentation/admin-guide/sysctl/kernel.rst
>> @@ -709,6 +709,7 @@ The value to set can be the result of ORing the following:
>>   0 NUMA_BALANCING_DISABLED
>>   1 NUMA_BALANCING_NORMAL
>>   2 NUMA_BALANCING_MEMORY_TIERING
>> +4 NUMA_BALANCING_CGROUP
>>   = =================================
>>   
>>   Or NUMA_BALANCING_NORMAL to optimize page placement among different
>> @@ -729,6 +730,11 @@ different types of memory (represented as different NUMA nodes) to
>>   place the hot pages in the fast memory.  This is implemented based on
>>   unmapping and page fault too.
>>   
>> +Or NUMA_BALANCING_CGROUP to enable the per cgroup NUMA balancing.
>> +This new behavior can be opted-in/out on a per-cgroup basis via a new
>> +cgroup CPU subsystem file named numa_load_balance. By default, per
>> +cgroup NUMA balancing for each cgroup is enabled.
>> +
>>   numa_balancing_promote_rate_limit_MBps
>>   ======================================
>>   
>> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
>> index 5a64582b086b..1e4d5a9ddb26 100644
>> --- a/include/linux/sched/sysctl.h
>> +++ b/include/linux/sched/sysctl.h
>> @@ -22,6 +22,7 @@ enum sched_tunable_scaling {
>>   #define NUMA_BALANCING_DISABLED		0x0
>>   #define NUMA_BALANCING_NORMAL		0x1
>>   #define NUMA_BALANCING_MEMORY_TIERING	0x2
>> +#define NUMA_BALANCING_CGROUP		0x4
>>   
>>   #ifdef CONFIG_NUMA_BALANCING
>>   extern int sysctl_numa_balancing_mode;
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 8988d38d46a3..8e9aa59193df 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -10078,6 +10078,30 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
>>   }
>>   #endif
>>   
>> +#ifdef CONFIG_NUMA_BALANCING
>> +static int numa_balance_write_u64(struct cgroup_subsys_state *css,
>> +				  struct cftype *cftype, u64 enable)
>> +{
>> +	struct task_group *tg;
>> +	bool was_enabled;
>> +
>> +	tg = css_tg(css);
>> +	was_enabled = READ_ONCE(tg->nlb_enabled);
>> +	if (was_enabled == enable)
>> +		return 0;
>> +
>> +	WRITE_ONCE(tg->nlb_enabled, enable);
>> +
>> +	return 0;
>> +}
>> +
>> +static u64 numa_balance_read_u64(struct cgroup_subsys_state *css,
>> +				 struct cftype *cft)
>> +{
>> +	return READ_ONCE(css_tg(css)->nlb_enabled);
>> +}
>> +#endif /* CONFIG_NUMA_BALANCING */
>> +
>>   static struct cftype cpu_files[] = {
>>   #ifdef CONFIG_GROUP_SCHED_WEIGHT
>>   	{
>> @@ -10126,6 +10150,13 @@ static struct cftype cpu_files[] = {
>>   		.seq_show = cpu_uclamp_max_show,
>>   		.write = cpu_uclamp_max_write,
>>   	},
>> +#endif
>> +#ifdef CONFIG_NUMA_BALANCING
>> +	{
>> +		.name = "numa_load_balance",
>> +		.read_u64 = numa_balance_read_u64,
>> +		.write_u64 = numa_balance_write_u64,
>> +	},
>>   #endif
>>   	{ }	/* terminate */
>>   };
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7a14da5396fb..dcdee3bf9960 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3161,6 +3161,29 @@ void task_numa_free(struct task_struct *p, bool final)
>>   	}
>>   }
>>   
>> +/*
>> + * Return true if the NUMA balance is allowed for
>> + * the task in a task group.
>> + */
>> +static bool tg_numa_balance_enabled(struct task_struct *p)
>> +{
>> +	/*
>> +	 * The min/max of sysctl_numa_balancing_mode ranges
>> +	 * from SYSCTL_ONE to SYSCTL_FOUR, so it is safe to
>> +	 * only check NUMA_BALANCING_CGROUP because it is
>> +	 * impossible to have both NUMA_BALANCING_CGROUP and
>> +	 * NUMA_BALANCING_NORMAL/NUMA_BALANCING_MEMORY_TIERING
>> +	 * set.
>> +	 */
>> +	struct task_group *tg = task_group(p);
>> +
>> +	if (tg && (sysctl_numa_balancing_mode & NUMA_BALANCING_CGROUP) &&
>> +	    !READ_ONCE(tg->nlb_enabled))
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>>   /*
>>    * Got a PROT_NONE fault for a page on @node.
>>    */
>> @@ -3189,6 +3212,9 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
>>   	     !cpupid_valid(last_cpupid)))
>>   		return;
>>   
>> +	if (!tg_numa_balance_enabled(p))
>> +		return;
>> +
> 
> I think this one may be redundant when you already have it in task_numa_work().  Without the
> scanning, there won't be any hinting page faults on that task, so neither do_numa_page() nor
> do_huge_pmd_numa_page() will be called. Though it's a minor issue if tg_numa_balance_enabled(p)
> is fast.
> 

Previously I was thinking of the following sequence:
1. the NUMA balancing is enabled and task_numa_work() is invoked,
    pages are scanned and PROT_NONE is set.
2. cgroup NUMA balancing is disabled by the user
3. do_numa_page() is triggered and PROT_NONE is cleared.
    We don't want to do further task migration and
    task_numa_fault() bails out.(page migration is still
    allowed as we mainly want to control the behavior of
    the task)

> Overall this is good. But more generally, I am thinking something finer-grained, like per-task
> numab control with numab tunnables at task-level (if possible), that will be so much more useful
> at least for us. There are use cases for controlling numa balancing at task level as applications
> tuned for NUMA (that don't want numab mess with their tsk/mem placements) such as databases can
> be in the same cgroup with other untuned applications, or not in a cgroup at all. Right now we
> have to turn off numab globally but that's not really optimal in a lot of cases. I do understand
> your use cases for per-cgroup control, but I wonder if there is a way to nicely combine them.
> Per-task control should cover per-cgroup control functionality-wise, but it's an inconvenient
> interface as one has to set for all tasks of the same cgroup, 

OK. Michal has also suggested using the per-task interface
(prctl()/sched_setattr()) for NUMA balancing instead of per-cgroup
control. In theory, I think it is doable. In a cloud environment,
users can set the attribute (enable NUMA balancing) for the first
process of a cgroup, and later child processes will inherit this
attribute. But yes, when the admin decides to change this attribute,
each process of the cgroup has to be iterated.

> I haven't thought too hard about
> it yet, just want to bring it out and see if we can work out something together.
> 

Sure, let me have a try on this per-task version and we can
discuss/co-work on that.

thanks,
Chenyu

> Thanks,
> Libo
> 
>>   	/* Allocate buffer to track faults on a per-node basis */
>>   	if (unlikely(!p->numa_faults)) {
>>   		int size = sizeof(*p->numa_faults) *
>> @@ -3330,6 +3356,8 @@ static void task_numa_work(struct callback_head *work)
>>   	if (p->flags & PF_EXITING)
>>   		return;
>>   
>> +	if (!tg_numa_balance_enabled(p))
>> +		return;
>>   	/*
>>   	 * Memory is pinned to only one NUMA node via cpuset.mems, naturally
>>   	 * no page can be migrated.
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 475bb5998295..4b0dc656688e 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -485,6 +485,9 @@ struct task_group {
>>   	/* Effective clamp values used for a task group */
>>   	struct uclamp_se	uclamp[UCLAMP_CNT];
>>   #endif
>> +#ifdef CONFIG_NUMA_BALANCING
>> +	u64			nlb_enabled;
>> +#endif
>>   
>>   };
>>   
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 88608d0dc2c2..c288ffb92bfc 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -155,10 +155,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   				toptier = node_is_toptier(nid);
>>   
>>   				/*
>> -				 * Skip scanning top tier node if normal numa
>> +				 * Skip scanning top tier node if normal and cgroup numa
>>   				 * balancing is disabled
>>   				 */
>> -				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
>> +				if (!(sysctl_numa_balancing_mode &
>> +				    (NUMA_BALANCING_CGROUP | NUMA_BALANCING_NORMAL)) &&
>>   				    toptier)
>>   					continue;
>>   				if (folio_use_access_time(folio))
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] sched/numa: Introduce per cgroup numa balance control
  2025-07-01 16:36   ` Chen, Yu C
@ 2025-07-01 17:58     ` Libo Chen
  2025-07-02 11:25       ` Chen, Yu C
  0 siblings, 1 reply; 7+ messages in thread
From: Libo Chen @ 2025-07-01 17:58 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Juri Lelli, Ben Segall, Mel Gorman, Valentin Schneider,
	Andrew Morton, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Phil Auld, Tejun Heo, Daniel Jordan, Jann Horn, Pedro Falcato,
	Aubrey Li, Tim Chen, Huang, Ying, linux-doc, linux-kernel,
	linux-mm, Jonathan Corbet, Ingo Molnar, Michal Koutný,
	Peter Zijlstra, Shakeel Butt

Hi Chenyu

On 7/1/25 09:36, Chen, Yu C wrote:
> Hi Libo,
> 
> On 7/1/2025 4:25 PM, Libo Chen wrote:
>> Hi Chenyu,
>>
>> Thanks for the patch. See my comments below.
>>
>> On 6/25/25 03:23, Chen Yu wrote:
>>> [Problem Statement]
>>> Currently, NUMA balancing is configured system-wide.
>>> However, in some production environments, different
>>> cgroups may have varying requirements for NUMA balancing.
>>> Some cgroups are CPU-intensive, while others are
>>> memory-intensive. Some do not benefit from NUMA balancing
>>> due to the overhead associated with VMA scanning, while
>>> others prefer NUMA balancing as it helps improve memory
>>> locality. In this case, system-wide NUMA balancing is
>>> usually disabled to avoid causing regressions.
>>>
>>> [Proposal]
>>> Introduce a per-cgroup interface to enable NUMA balancing
>>> for specific cgroups. This interface is associated with
>>> the CPU subsystem, which does not support threaded subtrees,
>>> and close to CPU bandwidth control. The system administrator
>>> needs to set the NUMA balancing mode to
>>> NUMA_BALANCING_CGROUP=4 to enable this feature. When the
>>> system is in NUMA_BALANCING_CGROUP mode, NUMA balancing
>>> for all cgroups is disabled by default. After the
>>> administrator enables this feature for a specific cgroup,
>>> NUMA balancing for that cgroup is enabled.
>>>
>>> A simple example to show how to use per-cgroup NUMA balancing:
>>>
>>> Step1
>>> //switch on per cgroup Numa balancing.
>>> //All cgroup's NUMA balance is disabled by default.
>>> echo 4 > /proc/sys/kernel/numa_balancing
>>>
>>> Step2
>>> //created a cgroup named mytest, enable its NUMA balancing
>>> echo 1 > /sys/fs/cgroup/mytest/cpu.numa_load_balance
>>>
>>> [Benchmark]
>>> Tested on Xeon Sapphire Rapids, with 4 Numa nodes. Created
>>> a cgroup mytest and launched autonumabench NUMA01_THREADLOCAL.
>>> Each test runs 6 cycles.
>>>
>>> baseline:
>>> 29360.56user 16280.68system 3:33.29elapsed
>>> 29788.41user 16060.31system 3:34.38elapsed
>>> 28307.51user 17043.45system 3:33.03elapsed
>>> 29552.49user 16307.65system 3:34.20elapsed
>>> 29847.41user 15966.15system 3:34.65elapsed
>>> 29111.10user 16532.78system 3:33.19elapsed
>>>
>>> per cgroup NUMA balance:
>>> 7589.78user 16494.90system 1:53.18elapsed
>>> 7795.54user 16537.65system 1:54.11elapsed
>>> 8295.66user 16391.21system 1:55.98elapsed
>>> 7836.34user 17312.31system 1:55.71elapsed
>>> 7773.26user 16856.19system 1:54.08elapsed
>>> 7534.43user 17604.58system 1:55.01elapsed
>>>
>>> The user time has been reduced to 33% of the
>>> original, and the elapsed time has dropped to
>>> 45% of the original (lower values are better).
>>>
>>> cat /sys/fs/cgroup/mytest/memory.stat | grep numa
>>> numa_pages_migrated 10238503
>>> numa_pte_updates 24378124
>>> numa_hint_faults 16921590
>>> numa_task_migrated 253
>>> numa_task_swapped 4
>>>
>>> to-do:
>>> Per-cgroup NUMA balancing should consider the
>>> hierarchy of the cgroup. Initially, NUMA balancing
>>> is disabled for the root cgroup. A cgroup that has
>>> NUMA balancing enabled should have all its parents
>>> enabled. For example, suppose cgroup A is the parent
>>> of cgroup B; if A.numa_load_balance is 0, even if
>>> B.numa_load_balance is 1, NUMA balancing for B is
>>> disabled. This idea is derived from
>>> commit e39925734909 ("mm/memcontrol: respect
>>> zswap.writeback setting from parent cgroup too").
>>>
>>> Suggested-by: Tim Chen <tim.c.chen@intel.com>
>>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>>> ---
>>> v1->v2:
>>>
>>> Add documentation in Documentation/admin-guide/sysctl/kernel.rst.
>>>
>>> Add comments in tg_numa_balance_enabled() to explain that
>>> the newly introduced NUMA balancing mode is naturally
>>> exclusive of existing NUMA balancing modes. (Tim)
>>> ---
>>>   Documentation/admin-guide/sysctl/kernel.rst |  6 ++++
>>>   include/linux/sched/sysctl.h                |  1 +
>>>   kernel/sched/core.c                         | 31 +++++++++++++++++++++
>>>   kernel/sched/fair.c                         | 28 +++++++++++++++++++
>>>   kernel/sched/sched.h                        |  3 ++
>>>   mm/mprotect.c                               |  5 ++--
>>>   6 files changed, 72 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
>>> index dd49a89a62d3..ff88d1153c19 100644
>>> --- a/Documentation/admin-guide/sysctl/kernel.rst
>>> +++ b/Documentation/admin-guide/sysctl/kernel.rst
>>> @@ -709,6 +709,7 @@ The value to set can be the result of ORing the following:
>>>   0 NUMA_BALANCING_DISABLED
>>>   1 NUMA_BALANCING_NORMAL
>>>   2 NUMA_BALANCING_MEMORY_TIERING
>>> +4 NUMA_BALANCING_CGROUP
>>>   = =================================
>>>     Or NUMA_BALANCING_NORMAL to optimize page placement among different
>>> @@ -729,6 +730,11 @@ different types of memory (represented as different NUMA nodes) to
>>>   place the hot pages in the fast memory.  This is implemented based on
>>>   unmapping and page fault too.
>>>   +Or NUMA_BALANCING_CGROUP to enable the per cgroup NUMA balancing.
>>> +This new behavior can be opted-in/out on a per-cgroup basis via a new
>>> +cgroup CPU subsystem file named numa_load_balance. By default, per
>>> +cgroup NUMA balancing for each cgroup is enabled.
>>> +
>>>   numa_balancing_promote_rate_limit_MBps
>>>   ======================================
>>>   diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
>>> index 5a64582b086b..1e4d5a9ddb26 100644
>>> --- a/include/linux/sched/sysctl.h
>>> +++ b/include/linux/sched/sysctl.h
>>> @@ -22,6 +22,7 @@ enum sched_tunable_scaling {
>>>   #define NUMA_BALANCING_DISABLED        0x0
>>>   #define NUMA_BALANCING_NORMAL        0x1
>>>   #define NUMA_BALANCING_MEMORY_TIERING    0x2
>>> +#define NUMA_BALANCING_CGROUP        0x4
>>>     #ifdef CONFIG_NUMA_BALANCING
>>>   extern int sysctl_numa_balancing_mode;
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 8988d38d46a3..8e9aa59193df 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -10078,6 +10078,30 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
>>>   }
>>>   #endif
>>>   +#ifdef CONFIG_NUMA_BALANCING
>>> +static int numa_balance_write_u64(struct cgroup_subsys_state *css,
>>> +                  struct cftype *cftype, u64 enable)
>>> +{
>>> +    struct task_group *tg;
>>> +    bool was_enabled;
>>> +
>>> +    tg = css_tg(css);
>>> +    was_enabled = READ_ONCE(tg->nlb_enabled);
>>> +    if (was_enabled == enable)
>>> +        return 0;
>>> +
>>> +    WRITE_ONCE(tg->nlb_enabled, enable);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static u64 numa_balance_read_u64(struct cgroup_subsys_state *css,
>>> +                 struct cftype *cft)
>>> +{
>>> +    return READ_ONCE(css_tg(css)->nlb_enabled);
>>> +}
>>> +#endif /* CONFIG_NUMA_BALANCING */
>>> +
>>>   static struct cftype cpu_files[] = {
>>>   #ifdef CONFIG_GROUP_SCHED_WEIGHT
>>>       {
>>> @@ -10126,6 +10150,13 @@ static struct cftype cpu_files[] = {
>>>           .seq_show = cpu_uclamp_max_show,
>>>           .write = cpu_uclamp_max_write,
>>>       },
>>> +#endif
>>> +#ifdef CONFIG_NUMA_BALANCING
>>> +    {
>>> +        .name = "numa_load_balance",
>>> +        .read_u64 = numa_balance_read_u64,
>>> +        .write_u64 = numa_balance_write_u64,
>>> +    },
>>>   #endif
>>>       { }    /* terminate */
>>>   };
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 7a14da5396fb..dcdee3bf9960 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -3161,6 +3161,29 @@ void task_numa_free(struct task_struct *p, bool final)
>>>       }
>>>   }
>>>   +/*
>>> + * Return true if the NUMA balance is allowed for
>>> + * the task in a task group.
>>> + */
>>> +static bool tg_numa_balance_enabled(struct task_struct *p)
>>> +{
>>> +    /*
>>> +     * The min/max of sysctl_numa_balancing_mode ranges
>>> +     * from SYSCTL_ONE to SYSCTL_FOUR, so it is safe to
>>> +     * only check NUMA_BALANCING_CGROUP because it is
>>> +     * impossible to have both NUMA_BALANCING_CGROUP and
>>> +     * NUMA_BALANCING_NORMAL/NUMA_BALANCING_MEMORY_TIERING
>>> +     * set.
>>> +     */
>>> +    struct task_group *tg = task_group(p);
>>> +
>>> +    if (tg && (sysctl_numa_balancing_mode & NUMA_BALANCING_CGROUP) &&
>>> +        !READ_ONCE(tg->nlb_enabled))
>>> +        return false;
>>> +
>>> +    return true;
>>> +}
>>> +
>>>   /*
>>>    * Got a PROT_NONE fault for a page on @node.
>>>    */
>>> @@ -3189,6 +3212,9 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
>>>            !cpupid_valid(last_cpupid)))
>>>           return;
>>>   +    if (!tg_numa_balance_enabled(p))
>>> +        return;
>>> +
>>
>> I think this one may be redundant when you already have it in task_numa_work().  Without the
>> scanning, there won't be any hinting page faults on that task, so neither do_numa_page() nor
>> do_huge_pmd_numa_page() will be called. Though it's a minor issue if tg_numa_balance_enabled(p)
>> is fast.
>>
> 
> Previously I was thinking of the following sequence:
> 1. the NUMA balancing is enabled and task_numa_work() is invoked,
>    pages are scanned and PROT_NONE is set.
> 2. cgroup NUMA balancing is disabled by the user
> 3. do_numa_page() is triggered and PROT_NONE is cleared.
>    We don't want to do further task migration and
>    task_numa_fault() bails out.(page migration is still
>    allowed as we mainly want to control the behavior of
>    the task)
> 

Ah right, that makes sense. Does that fall under unlikely()? The timing window seems to be
quite small to me. 

>> Overall this is good. But more generally, I am thinking something finer-grained, like per-task
>> numab control with numab tunnables at task-level (if possible), that will be so much more useful
>> at least for us. There are use cases for controlling numa balancing at task level as applications
>> tuned for NUMA (that don't want numab mess with their tsk/mem placements) such as databases can
>> be in the same cgroup with other untuned applications, or not in a cgroup at all. Right now we
>> have to turn off numab globally but that's not really optimal in a lot of cases. I do understand
>> your use cases for per-cgroup control, but I wonder if there is a way to nicely combine them.
>> Per-task control should cover per-cgroup control functionality-wise, but it's an inconvenient
>> interface as one has to set for all tasks of the same cgroup, 
> 
> OK. Michal has also suggested using the per-task interface
> (prctl()/sched_setattr()) for NUMA balancing instead of per-cgroup
> control. In theory, I think it is doable. In a cloud environment,
> users can set the attribute (enable NUMA balancing) for the first
> process of a cgroup, and later child processes will inherit this
> attribute. But yes, when the admin decides to change this attribute,
> each process of the cgroup has to be iterated.
> 

An on/off button for cgroup could be added to libcgroup or similar for sysadmins, but still
inside the kernel, each task's attribute will be set one by one. What I am more worried about
is cgroups that frequently have tasks join and leave, it will become very annoying to set/unset
at each entry and exit. 

>> I haven't thought too hard about
>> it yet, just want to bring it out and see if we can work out something together.
>>
> 
> Sure, let me have a try on this per-task version and we can
> discuss/co-work on that.
> 

Cool~ Looking forward to it.

> thanks,
> Chenyu
> 
>> Thanks,
>> Libo
>>
>>>       /* Allocate buffer to track faults on a per-node basis */
>>>       if (unlikely(!p->numa_faults)) {
>>>           int size = sizeof(*p->numa_faults) *
>>> @@ -3330,6 +3356,8 @@ static void task_numa_work(struct callback_head *work)
>>>       if (p->flags & PF_EXITING)
>>>           return;
>>>   +    if (!tg_numa_balance_enabled(p))
>>> +        return;
>>>       /*
>>>        * Memory is pinned to only one NUMA node via cpuset.mems, naturally
>>>        * no page can be migrated.
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index 475bb5998295..4b0dc656688e 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -485,6 +485,9 @@ struct task_group {
>>>       /* Effective clamp values used for a task group */
>>>       struct uclamp_se    uclamp[UCLAMP_CNT];
>>>   #endif
>>> +#ifdef CONFIG_NUMA_BALANCING
>>> +    u64            nlb_enabled;
>>> +#endif
>>>     };
>>>   diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 88608d0dc2c2..c288ffb92bfc 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -155,10 +155,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>                   toptier = node_is_toptier(nid);
>>>                     /*
>>> -                 * Skip scanning top tier node if normal numa
>>> +                 * Skip scanning top tier node if normal and cgroup numa
>>>                    * balancing is disabled
>>>                    */
>>> -                if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
>>> +                if (!(sysctl_numa_balancing_mode &
>>> +                    (NUMA_BALANCING_CGROUP | NUMA_BALANCING_NORMAL)) &&
>>>                       toptier)
>>>                       continue;
>>>                   if (folio_use_access_time(folio))
>>



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] sched/numa: Introduce per cgroup numa balance control
  2025-07-01 17:58     ` Libo Chen
@ 2025-07-02 11:25       ` Chen, Yu C
  0 siblings, 0 replies; 7+ messages in thread
From: Chen, Yu C @ 2025-07-02 11:25 UTC (permalink / raw)
  To: Libo Chen
  Cc: Juri Lelli, Ben Segall, Mel Gorman, Valentin Schneider,
	Andrew Morton, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Phil Auld, Tejun Heo, Daniel Jordan, Jann Horn, Pedro Falcato,
	Aubrey Li, Tim Chen, Huang, Ying, linux-doc, linux-kernel,
	linux-mm, Jonathan Corbet, Ingo Molnar, Michal Koutný,
	Peter Zijlstra, Shakeel Butt

On 7/2/2025 1:58 AM, Libo Chen wrote:
> Hi Chenyu
> 
> On 7/1/25 09:36, Chen, Yu C wrote:
>>>> +
>>>>    /*
>>>>     * Got a PROT_NONE fault for a page on @node.
>>>>     */
>>>> @@ -3189,6 +3212,9 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
>>>>             !cpupid_valid(last_cpupid)))
>>>>            return;
>>>>    +    if (!tg_numa_balance_enabled(p))
>>>> +        return;
>>>> +
>>>
>>> I think this one may be redundant when you already have it in task_numa_work().  Without the
>>> scanning, there won't be any hinting page faults on that task, so neither do_numa_page() nor
>>> do_huge_pmd_numa_page() will be called. Though it's a minor issue if tg_numa_balance_enabled(p)
>>> is fast.
>>>
>>
>> Previously I was thinking of the following sequence:
>> 1. the NUMA balancing is enabled and task_numa_work() is invoked,
>>     pages are scanned and PROT_NONE is set.
>> 2. cgroup NUMA balancing is disabled by the user
>> 3. do_numa_page() is triggered and PROT_NONE is cleared.
>>     We don't want to do further task migration and
>>     task_numa_fault() bails out.(page migration is still
>>     allowed as we mainly want to control the behavior of
>>     the task)
>>
> 
> Ah right, that makes sense. Does that fall under unlikely()? The timing window seems to be
> quite small to me.
> 

Yup, and I guess that depends on how often the task access the
"invalid" address - it might take quite some time after the
page been set invalid and then the task access it to trigger
a page fault.

Thanks,
Chenyu



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-07-02 11:26 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-25 10:23 [PATCH v2] sched/numa: Introduce per cgroup numa balance control Chen Yu
2025-06-25 12:19 ` Michal Koutný
2025-06-26  9:07   ` Chen, Yu C
2025-07-01  8:25 ` Libo Chen
2025-07-01 16:36   ` Chen, Yu C
2025-07-01 17:58     ` Libo Chen
2025-07-02 11:25       ` Chen, Yu C

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).