All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qian Cai <quic_qiancai@quicinc.com>
To: <linux-kernel@vger.kernel.org>
Cc: <linux-tip-commits@vger.kernel.org>,
	"Huang, Ying" <ying.huang@intel.com>,
	 "Peter Zijlstra (Intel)" <peterz@infradead.org>,
	<x86@kernel.org>, <linux-arm-kernel@lists.infradead.org>
Subject: Re: [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node
Date: Tue, 1 Mar 2022 15:54:09 -0500	[thread overview]
Message-ID: <Yh6H8SPSqpjv1dl7@qian> (raw)
In-Reply-To: <164512421264.16921.689831789198253265.tip-bot2@tip-bot2>

On Thu, Feb 17, 2022 at 06:56:52PM -0000, tip-bot2 for Huang Ying wrote:
> The following commit has been merged into the sched/core branch of tip:
> 
> Commit-ID:     5c7b1aaf139dab5072311853bacc40fc3457d1f9
> Gitweb:        https://git.kernel.org/tip/5c7b1aaf139dab5072311853bacc40fc3457d1f9
> Author:        Huang Ying <ying.huang@intel.com>
> AuthorDate:    Mon, 14 Feb 2022 20:15:53 +08:00
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Wed, 16 Feb 2022 15:57:53 +01:00
> 
> sched/numa: Avoid migrating task to CPU-less node
> 
> In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
> nodes.  But if the number of the hint page faults on a PMEM node is
> the max for a task, The current NUMA balancing policy may try to place
> the task on the PMEM node instead of DRAM node.  This is unreasonable,
> because there's no CPU in PMEM NUMA nodes.  To fix this, CPU-less
> nodes are ignored when searching the migration target node for a task
> in this patch.
> 
> To test the patch, we run a workload that accesses more memory in PMEM
> node than memory in DRAM node.  Without the patch, the PMEM node will
> be chosen as preferred node in task_numa_placement().  While the DRAM
> node will be chosen instead with the patch.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20220214121553.582248-2-ying.huang@intel.com

Reverting this commit on the top of today's linux-next fixed a boot crash
on arm64 NUMA systems.

 Unable to handle kernel paging request at virtual address ffff7a6601694aec
 KASAN: maybe wild-memory-access in range [0xffffd3300b4a5760-0xffffd3300b4a5767]
 Mem abort info:
   ESR = 0x96000005
   EC = 0x25: DABT (current EL), IL = 32 bits
 mlx5_core 0007:02:00.0: enabling device (0100 -> 0102)
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
   FSC = 0x05: level 1 translation fault
 Data abort info:
   ISV = 0, ISS = 0x00000005
   CM = 0, WnR = 0
 swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000400b3d6c6000
 [ffff7a6601694aec] pgd=0000403fc007f003, p4d=0000403fc007f003, pud=0000000000000000
 Internal error: Oops: 96000005 [#1] PREEMPT SMP
 Modules linked in: nouveau(+) drm_ttm_helper ttm nvme(+) drm_dp_helper drm_kms_helper mlx5_core(+) mpt3sas(+) xhci_pci(+) nvme_core raid_class xhci_pci_renesas drm
 CPU: 85 PID: 1308 Comm: udevadm Not tainted 5.17.0-rc6-next-20220301 #1
 pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : task_numa_placement
 lr : task_numa_placement
 sp : ffff800031047760
 x29: ffff800031047760 x28: ffff3fffab916c00 x27: 0000000000000020
 x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000000

 x23: ffff07ffe5289a80 x22: ffffd3300b4a5760 x21: 000000000000003f
 x20: ffffd32feb4a5768 x19: 0000000000000000 x18: ffff07ffe528ad88
 x17: ffffd32fe5693a1c x16: 0000000000000000 x15: ffff8000310478e0

 x14: ffff07ffe528ad90 x13: 0000000000000002 x12: dfff80000000000d
 x11: 0000000000000001 x10: 000000000000b6be x9 : 0000000000000000
 x8 : 00000000ffffffff x7 : ffffd32feb4a5780 x6 : 0000000000000000
 x5 : 0000000000000000 x4 : 0000000000000000 x3 : 1ffffa6601694aec
 x2 : 0000000000000000 x1 : dfff800000000000 x0 : 000000001ffffff8
 Call trace:
  task_numa_placement
  arch_test_bit at include/asm-generic/bitops/non-atomic.h:118
  (inlined by) node_state at include/linux/nodemask.h:416
  (inlined by) task_numa_placement at kernel/sched/fair.c:2439
  task_numa_fault
  do_numa_page
  handle_pte_fault
  __handle_mm_fault
  handle_mm_fault
  do_page_fault
  do_translation_fault
  do_mem_abort
  el0_da
  el0t_64_sync_handler
  el0t_64_sync
 Code: 8b000296 d2d00001 f2fbffe1 d343fec3 (38e16861)
 ---[ end trace 0000000000000000 ]---
 Kernel panic - not syncing: Oops: Fatal exception
 SMP: stopping secondary CPUs
 Kernel Offset: 0x532fdcf70000 from 0xffff800008000000
 PHYS_OFFSET: 0x80000000
 CPU features: 0x00,00042c0c,19801c82
 Memory Limit: none
 ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---

> ---
>  kernel/sched/fair.c | 25 ++++++++++++++++++++-----
>  1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da3230b..11a72e1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1989,7 +1989,7 @@ static int task_numa_migrate(struct task_struct *p)
>  	 */
>  	ng = deref_curr_numa_group(p);
>  	if (env.best_cpu == -1 || (ng && ng->active_nodes > 1)) {
> -		for_each_online_node(nid) {
> +		for_each_node_state(nid, N_CPU) {
>  			if (nid == env.src_nid || nid == p->numa_preferred_nid)
>  				continue;
>  
> @@ -2087,13 +2087,13 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
>  	unsigned long faults, max_faults = 0;
>  	int nid, active_nodes = 0;
>  
> -	for_each_online_node(nid) {
> +	for_each_node_state(nid, N_CPU) {
>  		faults = group_faults_cpu(numa_group, nid);
>  		if (faults > max_faults)
>  			max_faults = faults;
>  	}
>  
> -	for_each_online_node(nid) {
> +	for_each_node_state(nid, N_CPU) {
>  		faults = group_faults_cpu(numa_group, nid);
>  		if (faults * ACTIVE_NODE_FRACTION > max_faults)
>  			active_nodes++;
> @@ -2247,7 +2247,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
>  
>  		dist = sched_max_numa_distance;
>  
> -		for_each_online_node(node) {
> +		for_each_node_state(node, N_CPU) {
>  			score = group_weight(p, node, dist);
>  			if (score > max_score) {
>  				max_score = score;
> @@ -2266,7 +2266,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
>  	 * inside the highest scoring group of nodes. The nodemask tricks
>  	 * keep the complexity of the search down.
>  	 */
> -	nodes = node_online_map;
> +	nodes = node_states[N_CPU];
>  	for (dist = sched_max_numa_distance; dist > LOCAL_DISTANCE; dist--) {
>  		unsigned long max_faults = 0;
>  		nodemask_t max_group = NODE_MASK_NONE;
> @@ -2405,6 +2405,21 @@ static void task_numa_placement(struct task_struct *p)
>  		}
>  	}
>  
> +	/* Cannot migrate task to CPU-less node */
> +	if (!node_state(max_nid, N_CPU)) {
> +		int near_nid = max_nid;
> +		int distance, near_distance = INT_MAX;
> +
> +		for_each_node_state(nid, N_CPU) {
> +			distance = node_distance(max_nid, nid);
> +			if (distance < near_distance) {
> +				near_nid = nid;
> +				near_distance = distance;
> +			}
> +		}
> +		max_nid = near_nid;
> +	}
> +
>  	if (ng) {
>  		numa_group_count_active_nodes(ng);
>  		spin_unlock_irq(group_lock);

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)
From: Qian Cai <quic_qiancai@quicinc.com>
To: <linux-kernel@vger.kernel.org>
Cc: <linux-tip-commits@vger.kernel.org>,
	"Huang, Ying" <ying.huang@intel.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>, <x86@kernel.org>,
	<linux-arm-kernel@lists.infradead.org>
Subject: Re: [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node
Date: Tue, 1 Mar 2022 15:54:09 -0500	[thread overview]
Message-ID: <Yh6H8SPSqpjv1dl7@qian> (raw)
In-Reply-To: <164512421264.16921.689831789198253265.tip-bot2@tip-bot2>

On Thu, Feb 17, 2022 at 06:56:52PM -0000, tip-bot2 for Huang Ying wrote:
> The following commit has been merged into the sched/core branch of tip:
> 
> Commit-ID:     5c7b1aaf139dab5072311853bacc40fc3457d1f9
> Gitweb:        https://git.kernel.org/tip/5c7b1aaf139dab5072311853bacc40fc3457d1f9
> Author:        Huang Ying <ying.huang@intel.com>
> AuthorDate:    Mon, 14 Feb 2022 20:15:53 +08:00
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Wed, 16 Feb 2022 15:57:53 +01:00
> 
> sched/numa: Avoid migrating task to CPU-less node
> 
> In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
> nodes.  But if the number of the hint page faults on a PMEM node is
> the max for a task, The current NUMA balancing policy may try to place
> the task on the PMEM node instead of DRAM node.  This is unreasonable,
> because there's no CPU in PMEM NUMA nodes.  To fix this, CPU-less
> nodes are ignored when searching the migration target node for a task
> in this patch.
> 
> To test the patch, we run a workload that accesses more memory in PMEM
> node than memory in DRAM node.  Without the patch, the PMEM node will
> be chosen as preferred node in task_numa_placement().  While the DRAM
> node will be chosen instead with the patch.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20220214121553.582248-2-ying.huang@intel.com

Reverting this commit on the top of today's linux-next fixed a boot crash
on arm64 NUMA systems.

 Unable to handle kernel paging request at virtual address ffff7a6601694aec
 KASAN: maybe wild-memory-access in range [0xffffd3300b4a5760-0xffffd3300b4a5767]
 Mem abort info:
   ESR = 0x96000005
   EC = 0x25: DABT (current EL), IL = 32 bits
 mlx5_core 0007:02:00.0: enabling device (0100 -> 0102)
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
   FSC = 0x05: level 1 translation fault
 Data abort info:
   ISV = 0, ISS = 0x00000005
   CM = 0, WnR = 0
 swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000400b3d6c6000
 [ffff7a6601694aec] pgd=0000403fc007f003, p4d=0000403fc007f003, pud=0000000000000000
 Internal error: Oops: 96000005 [#1] PREEMPT SMP
 Modules linked in: nouveau(+) drm_ttm_helper ttm nvme(+) drm_dp_helper drm_kms_helper mlx5_core(+) mpt3sas(+) xhci_pci(+) nvme_core raid_class xhci_pci_renesas drm
 CPU: 85 PID: 1308 Comm: udevadm Not tainted 5.17.0-rc6-next-20220301 #1
 pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : task_numa_placement
 lr : task_numa_placement
 sp : ffff800031047760
 x29: ffff800031047760 x28: ffff3fffab916c00 x27: 0000000000000020
 x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000000

 x23: ffff07ffe5289a80 x22: ffffd3300b4a5760 x21: 000000000000003f
 x20: ffffd32feb4a5768 x19: 0000000000000000 x18: ffff07ffe528ad88
 x17: ffffd32fe5693a1c x16: 0000000000000000 x15: ffff8000310478e0

 x14: ffff07ffe528ad90 x13: 0000000000000002 x12: dfff80000000000d
 x11: 0000000000000001 x10: 000000000000b6be x9 : 0000000000000000
 x8 : 00000000ffffffff x7 : ffffd32feb4a5780 x6 : 0000000000000000
 x5 : 0000000000000000 x4 : 0000000000000000 x3 : 1ffffa6601694aec
 x2 : 0000000000000000 x1 : dfff800000000000 x0 : 000000001ffffff8
 Call trace:
  task_numa_placement
  arch_test_bit at include/asm-generic/bitops/non-atomic.h:118
  (inlined by) node_state at include/linux/nodemask.h:416
  (inlined by) task_numa_placement at kernel/sched/fair.c:2439
  task_numa_fault
  do_numa_page
  handle_pte_fault
  __handle_mm_fault
  handle_mm_fault
  do_page_fault
  do_translation_fault
  do_mem_abort
  el0_da
  el0t_64_sync_handler
  el0t_64_sync
 Code: 8b000296 d2d00001 f2fbffe1 d343fec3 (38e16861)
 ---[ end trace 0000000000000000 ]---
 Kernel panic - not syncing: Oops: Fatal exception
 SMP: stopping secondary CPUs
 Kernel Offset: 0x532fdcf70000 from 0xffff800008000000
 PHYS_OFFSET: 0x80000000
 CPU features: 0x00,00042c0c,19801c82
 Memory Limit: none
 ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---

> ---
>  kernel/sched/fair.c | 25 ++++++++++++++++++++-----
>  1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da3230b..11a72e1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1989,7 +1989,7 @@ static int task_numa_migrate(struct task_struct *p)
>  	 */
>  	ng = deref_curr_numa_group(p);
>  	if (env.best_cpu == -1 || (ng && ng->active_nodes > 1)) {
> -		for_each_online_node(nid) {
> +		for_each_node_state(nid, N_CPU) {
>  			if (nid == env.src_nid || nid == p->numa_preferred_nid)
>  				continue;
>  
> @@ -2087,13 +2087,13 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
>  	unsigned long faults, max_faults = 0;
>  	int nid, active_nodes = 0;
>  
> -	for_each_online_node(nid) {
> +	for_each_node_state(nid, N_CPU) {
>  		faults = group_faults_cpu(numa_group, nid);
>  		if (faults > max_faults)
>  			max_faults = faults;
>  	}
>  
> -	for_each_online_node(nid) {
> +	for_each_node_state(nid, N_CPU) {
>  		faults = group_faults_cpu(numa_group, nid);
>  		if (faults * ACTIVE_NODE_FRACTION > max_faults)
>  			active_nodes++;
> @@ -2247,7 +2247,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
>  
>  		dist = sched_max_numa_distance;
>  
> -		for_each_online_node(node) {
> +		for_each_node_state(node, N_CPU) {
>  			score = group_weight(p, node, dist);
>  			if (score > max_score) {
>  				max_score = score;
> @@ -2266,7 +2266,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
>  	 * inside the highest scoring group of nodes. The nodemask tricks
>  	 * keep the complexity of the search down.
>  	 */
> -	nodes = node_online_map;
> +	nodes = node_states[N_CPU];
>  	for (dist = sched_max_numa_distance; dist > LOCAL_DISTANCE; dist--) {
>  		unsigned long max_faults = 0;
>  		nodemask_t max_group = NODE_MASK_NONE;
> @@ -2405,6 +2405,21 @@ static void task_numa_placement(struct task_struct *p)
>  		}
>  	}
>  
> +	/* Cannot migrate task to CPU-less node */
> +	if (!node_state(max_nid, N_CPU)) {
> +		int near_nid = max_nid;
> +		int distance, near_distance = INT_MAX;
> +
> +		for_each_node_state(nid, N_CPU) {
> +			distance = node_distance(max_nid, nid);
> +			if (distance < near_distance) {
> +				near_nid = nid;
> +				near_distance = distance;
> +			}
> +		}
> +		max_nid = near_nid;
> +	}
> +
>  	if (ng) {
>  		numa_group_count_active_nodes(ng);
>  		spin_unlock_irq(group_lock);

  reply	other threads:[~2022-03-01 20:55 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-14 12:15 [PATCH -V3 1/2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes Huang Ying
2022-02-14 12:15 ` [PATCH -V3 2/2] NUMA balancing: avoid to migrate task to CPU-less node Huang Ying
2022-02-17 18:56   ` [tip: sched/core] sched/numa: Avoid migrating " tip-bot2 for Huang Ying
2022-03-01 20:54     ` Qian Cai [this message]
2022-03-01 20:54       ` Qian Cai
2022-03-02  0:59       ` Huang, Ying
2022-03-02  0:59         ` Huang, Ying
2022-03-02 12:37         ` Qian Cai
2022-03-02 12:37           ` Qian Cai
2022-03-07  5:51         ` Huang, Ying
2022-03-07  5:51           ` Huang, Ying
2022-03-07 13:53           ` Qian Cai
2022-03-07 13:53             ` Qian Cai
2022-03-08  0:40             ` Huang, Ying
2022-03-08  0:40               ` Huang, Ying
2022-03-08  2:05   ` [PATCH -V3 2/2 UPDATE] NUMA balancing: avoid to migrate " Huang, Ying
2022-03-08  2:11     ` Huang, Ying
2022-03-16  0:37       ` Huang, Ying
2022-02-14 15:05 ` [PATCH -V3 1/2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes Peter Zijlstra
2022-02-15  1:29   ` Huang, Ying
2022-02-17 18:56 ` [tip: sched/core] sched/numa: Fix " tip-bot2 for Huang Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yh6H8SPSqpjv1dl7@qian \
    --to=quic_qiancai@quicinc.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=x86@kernel.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.