[PATCH 1/2] sched/numa: Add ability to override task's numa_preferred

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] sched/numa: Add ability to override task's numa_preferred_nid.
@ 2025-04-15  1:35 Chris Hyser
  2025-04-15  1:35 ` [PATCH 2/2] sched/numa: prctl to set/override " Chris Hyser
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Chris Hyser @ 2025-04-15  1:35 UTC (permalink / raw)
  To: Chris Hyser, Peter Zijlstra, Mel Gorman, longman, linux-kernel

From: chris hyser <chris.hyser@oracle.com>

This patch allows directly setting and subsequent overriding of a task's
"Preferred Node Affinity" by setting the task's numa_preferred_nid and
relying on the existing NUMA balancing infrastructure.

NUMA balancing introduced the notion of tracking and using a task's
preferred memory node for both migrating/consolidating the physical pages
accessed by a task and to assist the scheduler in making NUMA aware
placement and load balancing decisions.

The existing mechanism for determining this, Auto NUMA Balancing, relies
on periodic removal of virtual mappings for blocks of a task's address
space. The resulting faults can indicate a preference for an accessed
node.

This has two issues that this patch seeks to overcome:

- there is a trade-off between faulting overhead and the ability to detect
  dynamic access patterns. In cases where the task or user understand the
  NUMA sensitivities, this patch can enable the benefits of setting a
  preferred node used either in conjunction with Auto NUMA Balancing's
  default parameters or adjusting the NUMA balance parameters to reduce the
  faulting rate (potentially to 0).

- memory pinned to nodes or to physical addresses such as RDMA cannot be
  migrated and have thus far been excluded from the scanning. Not taking
  those faults however can prevent Auto NUMA Balancing from reliably
  detecting a node preference with the scheduler load balancer then
  possibly operating with incorrect NUMA information.

The following results were from TPCC runs on an Oracle Database. The system
was a 2-node Intel machine with a database running on each node with local
memory allocations. No tasks or memory were pinned.

There are four scenarios of interest:

- Auto NUMA Balancing OFF.
    base value

- Auto NUMA Balancing ON.
    1.2% - ANB ON better than ANB OFF.

- Use the prctl(), ANB ON, parameters set to prevent faulting.
    2.4% - prctl() better then ANB OFF.
    1.2% - prctl() better than ANB ON.

- Use the prctl(), ANB parameters normal.
    3.1% - prctl() and ANB ON better than ANB OFF.
    1.9% - prctl() and ANB ON better than just ANB ON.
    0.7% - prctl() and ANB ON better than prctl() and ANB ON/faulting off

In benchmarks pinning large regions of heavily accessed memory, the
advantages of the prctl() over Auto NUMA Balancing alone is significantly
higher.

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
---
 include/linux/sched.h |  1 +
 init/init_task.c      |  1 +
 kernel/sched/core.c   |  5 ++++-
 kernel/sched/debug.c  |  1 +
 kernel/sched/fair.c   | 15 +++++++++++++--
 5 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f96ac1982893..373046c82b35 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1350,6 +1350,7 @@ struct task_struct {
 	short				pref_node_fork;
 #endif
 #ifdef CONFIG_NUMA_BALANCING
+	int				numa_preferred_nid_force;
 	int				numa_scan_seq;
 	unsigned int			numa_scan_period;
 	unsigned int			numa_scan_period_max;
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd90..1921a87326db 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -184,6 +184,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.vtime.state	= VTIME_SYS,
 #endif
 #ifdef CONFIG_NUMA_BALANCING
+	.numa_preferred_nid_force = NUMA_NO_NODE,
 	.numa_preferred_nid = NUMA_NO_NODE,
 	.numa_group	= NULL,
 	.numa_faults	= NULL,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 79692f85643f..7d1532f35d15 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7980,7 +7980,10 @@ void sched_setnuma(struct task_struct *p, int nid)
 	if (running)
 		put_prev_task(rq, p);
 
-	p->numa_preferred_nid = nid;
+	if (p->numa_preferred_nid_force != NUMA_NO_NODE)
+		p->numa_preferred_nid = p->numa_preferred_nid_force;
+	else
+		p->numa_preferred_nid = nid;
 
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 56ae54e0ce6a..4cba21f5d24d 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1154,6 +1154,7 @@ static void sched_show_numa(struct task_struct *p, struct seq_file *m)
 		P(mm->numa_scan_seq);
 
 	P(numa_pages_migrated);
+	P(numa_preferred_nid_force);
 	P(numa_preferred_nid);
 	P(total_numa_faults);
 	SEQ_printf(m, "current_node=%d, numa_group_id=%d\n",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c19459c8042..79d3d0840fb2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2642,9 +2642,15 @@ static void numa_migrate_preferred(struct task_struct *p)
 	unsigned long interval = HZ;
 
 	/* This task has no NUMA fault statistics yet */
-	if (unlikely(p->numa_preferred_nid == NUMA_NO_NODE || !p->numa_faults))
+	if (unlikely(p->numa_preferred_nid == NUMA_NO_NODE))
 		return;
 
+	/* Execute rest of function if forced PNID */
+	if (p->numa_preferred_nid_force == NUMA_NO_NODE) {
+		if (unlikely(!p->numa_faults))
+			return;
+	}
+
 	/* Periodically retry migrating the task to the preferred node */
 	interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
 	p->numa_migrate_retry = jiffies + interval;
@@ -3578,6 +3584,7 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 
 	/* New address space, reset the preferred nid */
 	if (!(clone_flags & CLONE_VM)) {
+		p->numa_preferred_nid_force = NUMA_NO_NODE;
 		p->numa_preferred_nid = NUMA_NO_NODE;
 		return;
 	}
@@ -9301,7 +9308,11 @@ static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	if (!static_branch_likely(&sched_numa_balancing))
 		return 0;
 
-	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+	/* Execute rest of function if forced PNID */
+	if (p->numa_preferred_nid_force == NUMA_NO_NODE && !p->numa_faults)
+		return 0;
+
+	if (!(env->sd->flags & SD_NUMA))
 		return 0;
 
 	src_nid = cpu_to_node(env->src_cpu);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] sched/numa: prctl to set/override task's numa_preferred_nid
  2025-04-15  1:35 [PATCH 1/2] sched/numa: Add ability to override task's numa_preferred_nid Chris Hyser
@ 2025-04-15  1:35 ` Chris Hyser
  2025-04-16  7:00 ` [PATCH 1/2] sched/numa: Add ability to override " Madadi Vineeth Reddy
  2025-06-10 18:31 ` Dhaval Giani
  2 siblings, 0 replies; 7+ messages in thread
From: Chris Hyser @ 2025-04-15  1:35 UTC (permalink / raw)
  To: Chris Hyser, Peter Zijlstra, Mel Gorman, longman, linux-kernel

From: chris hyser <chris.hyser@oracle.com>

Adds a simple prctl() interface to enable setting or reading a task's
numa_preferred_nid. Once set this value will override any value set
by auto NUMA balancing.

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
---
 .../scheduler/sched-preferred-node.rst        | 67 +++++++++++++++++++
 include/linux/sched.h                         |  9 +++
 include/uapi/linux/prctl.h                    |  8 +++
 kernel/sched/fair.c                           | 64 ++++++++++++++++++
 kernel/sys.c                                  |  5 ++
 tools/include/uapi/linux/prctl.h              |  6 ++
 6 files changed, 159 insertions(+)
 create mode 100644 Documentation/scheduler/sched-preferred-node.rst

diff --git a/Documentation/scheduler/sched-preferred-node.rst b/Documentation/scheduler/sched-preferred-node.rst
new file mode 100644
index 000000000000..753fd0b20993
--- /dev/null
+++ b/Documentation/scheduler/sched-preferred-node.rst
@@ -0,0 +1,67 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Prctl for Explicitly Setting Task's Preferred Node
+####################################################
+
+This feature is an addition to Auto NUMA Balancing. Auto NUMA balancing by
+default scans a task's address space removing address translations such that
+subsequent faults can indicate the predominant node from which memory is being
+accessed. A task's numa_preferred_nid is set to the node ID.
+
+The numa_preferred_nid is used to both consolidate physical pages and assist the
+scheduler in making NUMA friendly load balancing decisions.
+
+While quite useful for some workloads, this has two issues that this prctl() can
+help solve:
+
+- There is a trade-off between faulting overhead and the ability to detect
+dynamic access patterns. In cases where the task or user understand the NUMA
+sensitivities, this patch can enable the benefits of setting a preferred node
+used either in conjunction with Auto NUMA Balancing's default parameters or
+adjusting the NUMA balance parameters to reduce the faulting rate
+(potentially to 0).
+
+- Memory pinned to nodes or to physical addresses such as RDMA cannot be
+migrated and have thus far been excluded from the scanning. Not taking
+those faults however can prevent Auto NUMA Balancing from reliably detecting a
+node preference with the scheduler load balancer then possibly operating with
+incorrect NUMA information.
+
+
+Usage
+*******
+
+    Note: Auto NUMA Balancing must be enabled to get the effects.
+
+    #include <sys/prctl.h>
+
+    int prctl(int option, unsigned long arg2, unsigned long arg3, unsigned long arg4, unsigned long arg5);
+
+option:
+    ``PR_PREFERRED_NID``
+
+arg2:
+    Command for operation, must be one of:
+
+    - ``PR_PREFERRED_NID_GET`` -- get the forced preferred node ID for ``pid``.
+    - ``PR_PREFERRED_NID_SET`` -- set the forced preferred node ID for ``pid``.
+
+    Returns ERANGE for an illegal command.
+
+arg3:
+    ``pid`` of the task for which the operation applies. ``0`` implies current.
+
+    Returns ESRCH if ``pid`` is not found.
+
+arg4:
+    ``node_id`` for PR_PREFERRED_NID_SET. Between ``-1`` and ``num_possible_nodes()``.
+    ``-1`` indicates no preference.
+
+    Returns EINVAL for an illegal command.
+
+arg5:
+    userspace pointer to an integer for returning the Node ID from
+    ``PR_PREFERRED_NID_GET``. Should be 0 for all other commands.
+
+Must have the ptrace access mode: `PTRACE_MODE_READ_REALCREDS` to get/set
+the preferred node ID to a process otherwise returns EPERM.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 373046c82b35..8054fd37acdc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2261,6 +2261,15 @@ static inline void sched_core_fork(struct task_struct *p) { }
 static inline int sched_core_idle_cpu(int cpu) { return idle_cpu(cpu); }
 #endif
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Change a task's numa_preferred_nid */
+int prctl_chg_pref_nid(unsigned long cmd, int nid, pid_t pid,
+		       unsigned long uaddr);
+#else
+static inline int prctl_chg_pref_nid(unsigned long cmd, int nid, pid_t pid,
+				     unsigned long uaddr) { return -ERANGE; }
+#endif
+
 extern void sched_set_stop_task(int cpu, struct task_struct *stop);
 
 #ifdef CONFIG_MEM_ALLOC_PROFILING
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 15c18ef4eb11..e8a47777aeb2 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -364,4 +364,12 @@ struct prctl_mm_map {
 # define PR_TIMER_CREATE_RESTORE_IDS_ON		1
 # define PR_TIMER_CREATE_RESTORE_IDS_GET	2
 
+/*
+ * Set or get a task's numa_preferred_nid
+ */
+#define PR_PREFERRED_NID		78
+# define PR_PREFERRED_NID_GET		0
+# define PR_PREFERRED_NID_SET		1
+# define PR_PREFERRED_NID_CMD_MAX	2
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 79d3d0840fb2..7afff9fa3922 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -49,6 +49,7 @@
 #include <linux/ratelimit.h>
 #include <linux/task_work.h>
 #include <linux/rbtree_augmented.h>
+#include <linux/prctl.h>
 
 #include <asm/switch_to.h>
 
@@ -3670,6 +3671,69 @@ static void update_scan_period(struct task_struct *p, int new_cpu)
 	p->numa_scan_period = task_scan_start(p);
 }
 
+/*
+ * Enable setting task->numa_preferred_nid directly
+ */
+int prctl_chg_pref_nid(unsigned long cmd, pid_t pid, int nid,
+		       unsigned long uaddr)
+{
+	struct task_struct *task;
+	struct rq_flags rf;
+	struct rq *rq;
+	int err = 0;
+
+	if (cmd >= PR_PREFERRED_NID_CMD_MAX)
+		return -ERANGE;
+
+	rcu_read_lock();
+	if (pid == 0) {
+		task = current;
+	} else {
+		task = find_task_by_vpid((pid_t)pid);
+		if (!task) {
+			rcu_read_unlock();
+			return -ESRCH;
+		}
+	}
+	get_task_struct(task);
+	rcu_read_unlock();
+
+	/*
+	 * Check if this process has the right to modify the specified
+	 * process. Use the regular "ptrace_may_access()" checks.
+	 */
+	if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+		err = -EPERM;
+		goto out;
+	}
+
+	switch (cmd) {
+	case PR_PREFERRED_NID_GET:
+		if (uaddr & 0x3) {
+			err = -EINVAL;
+			goto out;
+		}
+		err = put_user(task->numa_preferred_nid_force,
+			       (int __user *)uaddr);
+		break;
+
+	case PR_PREFERRED_NID_SET:
+		if (!(-1 <= nid && nid < num_possible_nodes())) {
+			err = -EINVAL;
+			goto out;
+		}
+
+		rq = task_rq_lock(task, &rf);
+		task->numa_preferred_nid_force = nid;
+		task_rq_unlock(rq, task, &rf);
+		sched_setnuma(task, nid);
+		break;
+	}
+
+out:
+	put_task_struct(task);
+	return err;
+}
 #else
 static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 {
diff --git a/kernel/sys.c b/kernel/sys.c
index c434968e9f5d..20629a3267b1 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2746,6 +2746,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_SCHED_CORE:
 		error = sched_core_share_pid(arg2, arg3, arg4, arg5);
 		break;
+#endif
+#ifdef CONFIG_NUMA_BALANCING
+	case PR_PREFERRED_NID:
+		error = prctl_chg_pref_nid(arg2, arg3, arg4, arg5);
+		break;
 #endif
 	case PR_SET_MDWE:
 		error = prctl_set_mdwe(arg2, arg3, arg4, arg5);
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index 35791791a879..937160e3a77a 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -328,4 +328,10 @@ struct prctl_mm_map {
 # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC	0x10 /* Clear the aspect on exec */
 # define PR_PPC_DEXCR_CTRL_MASK		0x1f
 
+/* Set or get a task's numa_preferred_nid
+ */
+#define PR_PREFERRED_NID		78
+# define PR_PREFERRED_NID_GET		0
+# define PR_PREFERRED_NID_SET		1
+# define PR_PREFERRED_NID_CMD_MAX	2
 #endif /* _LINUX_PRCTL_H */
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] sched/numa: Add ability to override task's numa_preferred_nid.
  2025-04-15  1:35 [PATCH 1/2] sched/numa: Add ability to override task's numa_preferred_nid Chris Hyser
  2025-04-15  1:35 ` [PATCH 2/2] sched/numa: prctl to set/override " Chris Hyser
@ 2025-04-16  7:00 ` Madadi Vineeth Reddy
  2025-04-16 21:13   ` Chris Hyser
  2025-06-10 18:31 ` Dhaval Giani
  2 siblings, 1 reply; 7+ messages in thread
From: Madadi Vineeth Reddy @ 2025-04-16  7:00 UTC (permalink / raw)
  To: Chris Hyser
  Cc: Peter Zijlstra, Mel Gorman, longman, linux-kernel,
	Madadi Vineeth Reddy

Hi Chris,

On 15/04/25 07:05, Chris Hyser wrote:
> From: chris hyser <chris.hyser@oracle.com>
> 

[..snip..]

> The following results were from TPCC runs on an Oracle Database. The system
> was a 2-node Intel machine with a database running on each node with local
> memory allocations. No tasks or memory were pinned.
> 
> There are four scenarios of interest:
> 
> - Auto NUMA Balancing OFF.
>     base value
> 
> - Auto NUMA Balancing ON.
>     1.2% - ANB ON better than ANB OFF.
> 
> - Use the prctl(), ANB ON, parameters set to prevent faulting.
>     2.4% - prctl() better then ANB OFF.
>     1.2% - prctl() better than ANB ON.
> 
> - Use the prctl(), ANB parameters normal.
>     3.1% - prctl() and ANB ON better than ANB OFF.
>     1.9% - prctl() and ANB ON better than just ANB ON.
>     0.7% - prctl() and ANB ON better than prctl() and ANB ON/faulting off
> 

Are you using prctl() to set the preferred node id for all the tasks of your run?
If yes, then how `prctl() and ANB ON better than prctl() and ANB ON/faulting off`
case happens?

IIUC, when setting preferred node in numa_preferred_nid_force, the original
numa_preferred_nid which is derived from page faults will be a nop which should
be an overhead.

Let me know if my understanding is correct. Also, can you tell how to set the
parameters of ANB to prevent faulting.

Thanks,
Madadi Vineeth Reddy

> In benchmarks pinning large regions of heavily accessed memory, the
> advantages of the prctl() over Auto NUMA Balancing alone is significantly
> higher.
> 
> Signed-off-by: Chris Hyser <chris.hyser@oracle.com>

[..snip..]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] sched/numa: Add ability to override task's numa_preferred_nid.
  2025-04-16  7:00 ` [PATCH 1/2] sched/numa: Add ability to override " Madadi Vineeth Reddy
@ 2025-04-16 21:13   ` Chris Hyser
  2025-04-17  3:39     ` Madadi Vineeth Reddy
  0 siblings, 1 reply; 7+ messages in thread
From: Chris Hyser @ 2025-04-16 21:13 UTC (permalink / raw)
  To: Madadi Vineeth Reddy
  Cc: Peter Zijlstra, Mel Gorman, longman@redhat.com,
	linux-kernel@vger.kernel.org, Chris Hyser

> From: Madadi Vineeth Reddy
> Sent: Wednesday, April 16, 2025 3:00 AM
> To: Chris Hyser
> Cc: Peter Zijlstra; Mel Gorman; longman@redhat.com; linux-kernel@vger.kernel.org; Madadi Vineeth Reddy
> Subject: Re: [PATCH 1/2] sched/numa: Add ability to override task's numa_preferred_nid.
>
>
> Hi Chris,
>
> On 15/04/25 07:05, Chris Hyser wrote:
>> From: chris hyser <chris.hyser@oracle.com>
>> 
>
>[..snip..]
>
>> The following results were from TPCC runs on an Oracle Database. The system
>> was a 2-node Intel machine with a database running on each node with local
>> memory allocations. No tasks or memory were pinned.
>> 
>> There are four scenarios of interest:
>> 
>> - Auto NUMA Balancing OFF.
>>     base value
>> 
>> - Auto NUMA Balancing ON.
>>     1.2% - ANB ON better than ANB OFF.
>> 
>> - Use the prctl(), ANB ON, parameters set to prevent faulting.
>>     2.4% - prctl() better then ANB OFF.
>>     1.2% - prctl() better than ANB ON.
>> 
>> - Use the prctl(), ANB parameters normal.
>>     3.1% - prctl() and ANB ON better than ANB OFF.
>>     1.9% - prctl() and ANB ON better than just ANB ON.
>>     0.7% - prctl() and ANB ON better than prctl() and ANB ON/faulting off
>> 
>
> Are you using prctl() to set the preferred node id for all the tasks of your run?
> If yes, then how `prctl() and ANB ON better than prctl() and ANB ON/faulting off`
> case happens?

Not every task in the system (including some DB tasks) has a prctl() set preferred node as the expected preference is not always known. So that is part of it, however the bigger influence even with a prctl() set preferred node, is that faulting drives physical page migration.  You only want to migrate pages that the task is accessing. The fault tells you it was accessed and what node it is currently in allowing a migration decision to be made.

> IIUC, when setting preferred node in numa_preferred_nid_force, the original
> numa_preferred_nid which is derived from page faults will be a nop which should
> be an overhead.

As mentioned above faulting drives physical page migration with the usual trade-off between faulting overhead and the benefits of consolidating pages on the same node. 

One issue I've seen repeatably is that if you monitor a task (numa fields in /proc/<pid>/sched) some tasks keep changing their preferred node. This makes sense since spatial access locality can change over time, but you also see the migrated page count going up independent of which node is currently preferred. So on a two node system, there are pages being migrated back and forth (not necessarily the same ones). One possible effect of forcing the preferred node is that it isn't changing and migrated pages should be going the same way. 

> Let me know if my understanding is correct. Also, can you tell how to set the
> parameters of ANB to prevent faulting.

Basically, I set the sampling periods to a large number of seconds. Sampling frequency then is 1/large is ~0. Monitoring the task again, it should show no NUMA faults and no pages migrated. 

kernel.numa_balancing : 1
scan_period_max_ms: 4294967295
scan_period_min_ms: 4294967295
scan_delay_ms: 4294967295

-chrish

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] sched/numa: Add ability to override task's numa_preferred_nid.
  2025-04-16 21:13   ` Chris Hyser
@ 2025-04-17  3:39     ` Madadi Vineeth Reddy
  0 siblings, 0 replies; 7+ messages in thread
From: Madadi Vineeth Reddy @ 2025-04-17  3:39 UTC (permalink / raw)
  To: Chris Hyser
  Cc: Peter Zijlstra, Mel Gorman, longman@redhat.com,
	linux-kernel@vger.kernel.org, Madadi Vineeth Reddy

On 17/04/25 02:43, Chris Hyser wrote:
>> From: Madadi Vineeth Reddy
>> Sent: Wednesday, April 16, 2025 3:00 AM
>> To: Chris Hyser
>> Cc: Peter Zijlstra; Mel Gorman; longman@redhat.com; linux-kernel@vger.kernel.org; Madadi Vineeth Reddy
>> Subject: Re: [PATCH 1/2] sched/numa: Add ability to override task's numa_preferred_nid.
>>
>>
>> Hi Chris,
>>
>> On 15/04/25 07:05, Chris Hyser wrote:
>>> From: chris hyser <chris.hyser@oracle.com>
>>>
>>
>> [..snip..]
>>
>>> The following results were from TPCC runs on an Oracle Database. The system
>>> was a 2-node Intel machine with a database running on each node with local
>>> memory allocations. No tasks or memory were pinned.
>>>
>>> There are four scenarios of interest:
>>>
>>> - Auto NUMA Balancing OFF.
>>>      base value
>>>
>>> - Auto NUMA Balancing ON.
>>>      1.2% - ANB ON better than ANB OFF.
>>>
>>> - Use the prctl(), ANB ON, parameters set to prevent faulting.
>>>      2.4% - prctl() better then ANB OFF.
>>>      1.2% - prctl() better than ANB ON.
>>>
>>> - Use the prctl(), ANB parameters normal.
>>>      3.1% - prctl() and ANB ON better than ANB OFF.
>>>      1.9% - prctl() and ANB ON better than just ANB ON.
>>>      0.7% - prctl() and ANB ON better than prctl() and ANB ON/faulting off
>>>
>>
>> Are you using prctl() to set the preferred node id for all the tasks of your run?
>> If yes, then how `prctl() and ANB ON better than prctl() and ANB ON/faulting off`
>> case happens?
> 
> Not every task in the system (including some DB tasks) has a prctl() set preferred node as the expected preference is not always known. So that is part of it, however the bigger influence even with a prctl() set preferred node, is that faulting drives physical page migration.  You only want to migrate pages that the task is accessing. The fault tells you it was accessed and what node it is currently in allowing a migration decision to be made.
> 

Yes, understood.

>> IIUC, when setting preferred node in numa_preferred_nid_force, the original
>> numa_preferred_nid which is derived from page faults will be a nop which should
>> be an overhead.
> 
> As mentioned above faulting drives physical page migration with the usual trade-off between faulting overhead and the benefits of consolidating pages on the same node. 
> 
> One issue I've seen repeatably is that if you monitor a task (numa fields in /proc/<pid>/sched) some tasks keep changing their preferred node. This makes sense since spatial access locality can change over time, but you also see the migrated page count going up independent of which node is currently preferred. So on a two node system, there are pages being migrated back and forth (not necessarily the same ones). One possible effect of forcing the preferred node is that it isn't changing and migrated pages should be going the same way. 
> 
>> Let me know if my understanding is correct. Also, can you tell how to set the
>> parameters of ANB to prevent faulting.
> 
> Basically, I set the sampling periods to a large number of seconds. Sampling frequency then is 1/large is ~0. Monitoring the task again, it should show no NUMA faults and no pages migrated. 
> 
> kernel.numa_balancing : 1
> scan_period_max_ms: 4294967295
> scan_period_min_ms: 4294967295
> scan_delay_ms: 4294967295
>

Got it. Thanks for the explanation.

Thanks,
Madadi Vineeth Reddy
 
> -chrish


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] sched/numa: Add ability to override task's numa_preferred_nid.
  2025-04-15  1:35 [PATCH 1/2] sched/numa: Add ability to override task's numa_preferred_nid Chris Hyser
  2025-04-15  1:35 ` [PATCH 2/2] sched/numa: prctl to set/override " Chris Hyser
  2025-04-16  7:00 ` [PATCH 1/2] sched/numa: Add ability to override " Madadi Vineeth Reddy
@ 2025-06-10 18:31 ` Dhaval Giani
  2025-06-11 15:12   ` Chris Hyser
  2 siblings, 1 reply; 7+ messages in thread
From: Dhaval Giani @ 2025-06-10 18:31 UTC (permalink / raw)
  To: Chris Hyser; +Cc: Peter Zijlstra, Mel Gorman, longman, linux-kernel

On Mon, Apr 14, 2025 at 09:35:51PM -0400, Chris Hyser wrote:
> From: chris hyser <chris.hyser@oracle.com>
> 
> This patch allows directly setting and subsequent overriding of a task's
> "Preferred Node Affinity" by setting the task's numa_preferred_nid and
> relying on the existing NUMA balancing infrastructure.
> 
> NUMA balancing introduced the notion of tracking and using a task's
> preferred memory node for both migrating/consolidating the physical pages
> accessed by a task and to assist the scheduler in making NUMA aware
> placement and load balancing decisions.
> 
> The existing mechanism for determining this, Auto NUMA Balancing, relies
> on periodic removal of virtual mappings for blocks of a task's address
> space. The resulting faults can indicate a preference for an accessed
> node.
> 
> This has two issues that this patch seeks to overcome:
> 
> - there is a trade-off between faulting overhead and the ability to detect
>   dynamic access patterns. In cases where the task or user understand the
>   NUMA sensitivities, this patch can enable the benefits of setting a
>   preferred node used either in conjunction with Auto NUMA Balancing's
>   default parameters or adjusting the NUMA balance parameters to reduce the
>   faulting rate (potentially to 0).
> 
> - memory pinned to nodes or to physical addresses such as RDMA cannot be
>   migrated and have thus far been excluded from the scanning. Not taking
>   those faults however can prevent Auto NUMA Balancing from reliably
>   detecting a node preference with the scheduler load balancer then
>   possibly operating with incorrect NUMA information.
> 
> The following results were from TPCC runs on an Oracle Database. The system
> was a 2-node Intel machine with a database running on each node with local
> memory allocations. No tasks or memory were pinned.
> 
> There are four scenarios of interest:
> 
> - Auto NUMA Balancing OFF.
>     base value
> 
> - Auto NUMA Balancing ON.
>     1.2% - ANB ON better than ANB OFF.
> 
> - Use the prctl(), ANB ON, parameters set to prevent faulting.
>     2.4% - prctl() better then ANB OFF.
>     1.2% - prctl() better than ANB ON.
> 
> - Use the prctl(), ANB parameters normal.
>     3.1% - prctl() and ANB ON better than ANB OFF.
>     1.9% - prctl() and ANB ON better than just ANB ON.
>     0.7% - prctl() and ANB ON better than prctl() and ANB ON/faulting off
> 
> In benchmarks pinning large regions of heavily accessed memory, the
> advantages of the prctl() over Auto NUMA Balancing alone is significantly
> higher.
> 
> Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
> ---
>  include/linux/sched.h |  1 +
>  init/init_task.c      |  1 +
>  kernel/sched/core.c   |  5 ++++-
>  kernel/sched/debug.c  |  1 +
>  kernel/sched/fair.c   | 15 +++++++++++++--
>  5 files changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f96ac1982893..373046c82b35 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1350,6 +1350,7 @@ struct task_struct {
>  	short				pref_node_fork;
>  #endif
>  #ifdef CONFIG_NUMA_BALANCING
> +	int				numa_preferred_nid_force;
>  	int				numa_scan_seq;
>  	unsigned int			numa_scan_period;
>  	unsigned int			numa_scan_period_max;
> diff --git a/init/init_task.c b/init/init_task.c
> index e557f622bd90..1921a87326db 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -184,6 +184,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
>  	.vtime.state	= VTIME_SYS,
>  #endif
>  #ifdef CONFIG_NUMA_BALANCING
> +	.numa_preferred_nid_force = NUMA_NO_NODE,
>  	.numa_preferred_nid = NUMA_NO_NODE,
>  	.numa_group	= NULL,
>  	.numa_faults	= NULL,
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 79692f85643f..7d1532f35d15 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7980,7 +7980,10 @@ void sched_setnuma(struct task_struct *p, int nid)
>  	if (running)
>  		put_prev_task(rq, p);
> 
> -	p->numa_preferred_nid = nid;
> +	if (p->numa_preferred_nid_force != NUMA_NO_NODE)
> +		p->numa_preferred_nid = p->numa_preferred_nid_force;
> +	else
> +		p->numa_preferred_nid = nid;
> 
>  	if (queued)
>  		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 56ae54e0ce6a..4cba21f5d24d 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -1154,6 +1154,7 @@ static void sched_show_numa(struct task_struct *p, struct seq_file *m)
>  		P(mm->numa_scan_seq);
> 
>  	P(numa_pages_migrated);
> +	P(numa_preferred_nid_force);
>  	P(numa_preferred_nid);
>  	P(total_numa_faults);
>  	SEQ_printf(m, "current_node=%d, numa_group_id=%d\n",
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0c19459c8042..79d3d0840fb2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2642,9 +2642,15 @@ static void numa_migrate_preferred(struct task_struct *p)
>  	unsigned long interval = HZ;
> 
>  	/* This task has no NUMA fault statistics yet */
> -	if (unlikely(p->numa_preferred_nid == NUMA_NO_NODE || !p->numa_faults))
> +	if (unlikely(p->numa_preferred_nid == NUMA_NO_NODE))
>  		return;
> 
> +	/* Execute rest of function if forced PNID */

This comment had me confused, expecially since you check for
NUMA_NO_NODE and exit right after. Move it to after please.

> +	if (p->numa_preferred_nid_force == NUMA_NO_NODE) {
> +		if (unlikely(!p->numa_faults))
> +			return;
> +	}
> +

I am in two minds with this one here -> Why the unlikely for
p->numa_faults, and can we just make it one single if function?

>  	/* Periodically retry migrating the task to the preferred node */
>  	interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
>  	p->numa_migrate_retry = jiffies + interval;
> @@ -3578,6 +3584,7 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
> 
>  	/* New address space, reset the preferred nid */
>  	if (!(clone_flags & CLONE_VM)) {
> +		p->numa_preferred_nid_force = NUMA_NO_NODE;
>  		p->numa_preferred_nid = NUMA_NO_NODE;
>  		return;
>  	}
> @@ -9301,7 +9308,11 @@ static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
>  	if (!static_branch_likely(&sched_numa_balancing))
>  		return 0;
> 
> -	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
> +	/* Execute rest of function if forced PNID */

Same here.

> +	if (p->numa_preferred_nid_force == NUMA_NO_NODE && !p->numa_faults)
> +		return 0;
> +
> +	if (!(env->sd->flags & SD_NUMA))
>  		return 0;
> 
>  	src_nid = cpu_to_node(env->src_cpu);
> --
> 2.43.5
> 
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] sched/numa: Add ability to override task's numa_preferred_nid.
  2025-06-10 18:31 ` Dhaval Giani
@ 2025-06-11 15:12   ` Chris Hyser
  0 siblings, 0 replies; 7+ messages in thread
From: Chris Hyser @ 2025-06-11 15:12 UTC (permalink / raw)
  To: Dhaval Giani
  Cc: Peter Zijlstra, Mel Gorman, longman@redhat.com,
	linux-kernel@vger.kernel.org

From: Dhaval Giani <dhaval@gianis.ca>
Sent: Tuesday, June 10, 2025 2:31 PM
To: Chris Hyser
Cc: Peter Zijlstra; Mel Gorman; longman@redhat.com; linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2] sched/numa: Add ability to override task's numa_preferred_nid.

>On Mon, Apr 14, 2025 at 09:35:51PM -0400, Chris Hyser wrote:
>> From: chris hyser <chris.hyser@oracle.com>
>>
>> This patch allows directly setting and subsequent overriding of a task's
>> "Preferred Node Affinity" by setting the task's numa_preferred_nid and
>> relying on the existing NUMA balancing infrastructure.
>>
>> NUMA balancing introduced the notion of tracking and using a task's
>> preferred memory node for both migrating/consolidating the physical pages
>> accessed by a task and to assist the scheduler in making NUMA aware
>> placement and load balancing decisions.
>>
>> The existing mechanism for determining this, Auto NUMA Balancing, relies
>> on periodic removal of virtual mappings for blocks of a task's address
>> space. The resulting faults can indicate a preference for an accessed
>> node.
>>
>> This has two issues that this patch seeks to overcome:
>>
>> - there is a trade-off between faulting overhead and the ability to detect
>>   dynamic access patterns. In cases where the task or user understand the
>>   NUMA sensitivities, this patch can enable the benefits of setting a
>>   preferred node used either in conjunction with Auto NUMA Balancing's
>>   default parameters or adjusting the NUMA balance parameters to reduce the
>>   faulting rate (potentially to 0).
>>
>> - memory pinned to nodes or to physical addresses such as RDMA cannot be
>>   migrated and have thus far been excluded from the scanning. Not taking
>>   those faults however can prevent Auto NUMA Balancing from reliably
>>   detecting a node preference with the scheduler load balancer then
>>   possibly operating with incorrect NUMA information.
>>
>> The following results were from TPCC runs on an Oracle Database. The system
>> was a 2-node Intel machine with a database running on each node with local
>> memory allocations. No tasks or memory were pinned.
>>
>> There are four scenarios of interest:
>>
>> - Auto NUMA Balancing OFF.
>>     base value
>>
>> - Auto NUMA Balancing ON.
>>     1.2% - ANB ON better than ANB OFF.
>>
>> - Use the prctl(), ANB ON, parameters set to prevent faulting.
>>     2.4% - prctl() better then ANB OFF.
>>     1.2% - prctl() better than ANB ON.
>>
>> - Use the prctl(), ANB parameters normal.
>>     3.1% - prctl() and ANB ON better than ANB OFF.
>>     1.9% - prctl() and ANB ON better than just ANB ON.
>>     0.7% - prctl() and ANB ON better than prctl() and ANB ON/faulting off
>>
>> In benchmarks pinning large regions of heavily accessed memory, the
>> advantages of the prctl() over Auto NUMA Balancing alone is significantly
>> higher.
>>
>> Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
>> ---
>>  include/linux/sched.h |  1 +
>>  init/init_task.c      |  1 +
>>  kernel/sched/core.c   |  5 ++++-
>>  kernel/sched/debug.c  |  1 +
>>  kernel/sched/fair.c   | 15 +++++++++++++--
>>  5 files changed, 20 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index f96ac1982893..373046c82b35 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1350,6 +1350,7 @@ struct task_struct {
>>       short                           pref_node_fork;
>>  #endif
>>  #ifdef CONFIG_NUMA_BALANCING
>> +     int                             numa_preferred_nid_force;
>>       int                             numa_scan_seq;
>>       unsigned int                    numa_scan_period;
>>       unsigned int                    numa_scan_period_max;
>> diff --git a/init/init_task.c b/init/init_task.c
>> index e557f622bd90..1921a87326db 100644
>> --- a/init/init_task.c
>> +++ b/init/init_task.c
>> @@ -184,6 +184,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
>>       .vtime.state    = VTIME_SYS,
>>  #endif
>>  #ifdef CONFIG_NUMA_BALANCING
>> +     .numa_preferred_nid_force = NUMA_NO_NODE,
>>       .numa_preferred_nid = NUMA_NO_NODE,
>>       .numa_group     = NULL,
>>       .numa_faults    = NULL,
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 79692f85643f..7d1532f35d15 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7980,7 +7980,10 @@ void sched_setnuma(struct task_struct *p, int nid)
>>       if (running)
>>               put_prev_task(rq, p);
>>
>> -     p->numa_preferred_nid = nid;
>> +     if (p->numa_preferred_nid_force != NUMA_NO_NODE)
>> +             p->numa_preferred_nid = p->numa_preferred_nid_force;
>> +     else
>> +             p->numa_preferred_nid = nid;
>>
>>       if (queued)
>>               enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index 56ae54e0ce6a..4cba21f5d24d 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -1154,6 +1154,7 @@ static void sched_show_numa(struct task_struct *p, struct seq_file *m)
>>               P(mm->numa_scan_seq);
>>
>>       P(numa_pages_migrated);
>> +     P(numa_preferred_nid_force);
>>       P(numa_preferred_nid);
>>       P(total_numa_faults);
>>       SEQ_printf(m, "current_node=%d, numa_group_id=%d\n",
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 0c19459c8042..79d3d0840fb2 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2642,9 +2642,15 @@ static void numa_migrate_preferred(struct task_struct *p)
>>       unsigned long interval = HZ;
>>
>>       /* This task has no NUMA fault statistics yet */
>> -     if (unlikely(p->numa_preferred_nid == NUMA_NO_NODE || !p->numa_faults))
>> +     if (unlikely(p->numa_preferred_nid == NUMA_NO_NODE))
>>               return;
>>
>> +     /* Execute rest of function if forced PNID */
>
>This comment had me confused, expecially since you check for
>NUMA_NO_NODE and exit right after. Move it to after please.

Sure

> +     if (p->numa_preferred_nid_force == NUMA_NO_NODE) {
> +             if (unlikely(!p->numa_faults))
> +                     return;
> +     }
> +
>
>I am in two minds with this one here -> Why the unlikely for
>p->numa_faults, and can we just make it one single if function?

A single if statement will work fine.

>>       /* Periodically retry migrating the task to the preferred node */
>>       interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
>>       p->numa_migrate_retry = jiffies + interval;
>> @@ -3578,6 +3584,7 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
>>
>>       /* New address space, reset the preferred nid */
>>       if (!(clone_flags & CLONE_VM)) {
>> +             p->numa_preferred_nid_force = NUMA_NO_NODE;
>>               p->numa_preferred_nid = NUMA_NO_NODE;
>>               return;
>>       }
>> @@ -9301,7 +9308,11 @@ static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
>>       if (!static_branch_likely(&sched_numa_balancing))
>>               return 0;
>>
>> -     if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
>> +     /* Execute rest of function if forced PNID */
>
>Same here.

Sure.

>> +     if (p->numa_preferred_nid_force == NUMA_NO_NODE && !p->numa_faults)
>> +             return 0;
>> +
>> +     if (!(env->sd->flags & SD_NUMA))
>>               return 0;
>>
>>       src_nid = cpu_to_node(env->src_cpu);
>> --
>> 2.43.5

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-06-11 15:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-15  1:35 [PATCH 1/2] sched/numa: Add ability to override task's numa_preferred_nid Chris Hyser
2025-04-15  1:35 ` [PATCH 2/2] sched/numa: prctl to set/override " Chris Hyser
2025-04-16  7:00 ` [PATCH 1/2] sched/numa: Add ability to override " Madadi Vineeth Reddy
2025-04-16 21:13   ` Chris Hyser
2025-04-17  3:39     ` Madadi Vineeth Reddy
2025-06-10 18:31 ` Dhaval Giani
2025-06-11 15:12   ` Chris Hyser

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox