Re: [PATCH 2/5] workqueue: update wq_numa_possible_cpumask

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Lai Jiangshan <laijs@cn.fujitsu.com>
To: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: <linux-kernel@vger.kernel.org>, Tejun Heo <tj@kernel.org>,
	Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>,
	"Gu, Zheng" <guz.fnst@cn.fujitsu.com>,
	tangchen <tangchen@cn.fujitsu.com>,
	Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com>
Subject: Re: [PATCH 2/5] workqueue: update wq_numa_possible_cpumask
Date: Thu, 18 Dec 2014 10:22:27 +0800	[thread overview]
Message-ID: <54923A63.3010701@cn.fujitsu.com> (raw)
In-Reply-To: <1418379595-6281-3-git-send-email-laijs@cn.fujitsu.com>

On 12/12/2014 06:19 PM, Lai Jiangshan wrote:
> Yasuaki Ishimatsu hit a allocation failure bug when the numa mapping
> between CPU and node is changed. This was the last scene:
>  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>   cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0
>   node 0: slabs: 6172, objs: 259224, free: 245741
>   node 1: slabs: 3261, objs: 136962, free: 127656
> 
> Yasuaki Ishimatsu investigated that it happened in the following situation:
> 
> 1) System Node/CPU before offline/online:
> 	       | CPU
> 	------------------------
> 	node 0 |  0-14, 60-74
> 	node 1 | 15-29, 75-89
> 	node 2 | 30-44, 90-104
> 	node 3 | 45-59, 105-119
> 
> 2) A system-board (contains node2 and node3) is offline:
> 	       | CPU
> 	------------------------
> 	node 0 |  0-14, 60-74
> 	node 1 | 15-29, 75-89
> 
> 3) A new system-board is online, two new node IDs are allocated
>    for the two node of the SB, but the old CPU IDs are allocated for
>    the SB, here the NUMA mapping between node and CPU is changed.
>    (the node of CPU#30 is changed from node#2 to node#4, for example)
> 	       | CPU
> 	------------------------
> 	node 0 |  0-14, 60-74
> 	node 1 | 15-29, 75-89
> 	node 4 | 30-59
> 	node 5 | 90-119
> 
> 4) now, the NUMA mapping is changed, but wq_numa_possible_cpumask
>    which is the convenient NUMA mapping cache in workqueue.c is still outdated.
>    thus pool->node calculated by get_unbound_pool() is incorrect.
> 
> 5) when the create_worker() is called with the incorrect offlined
>     pool->node, it is failed and the pool can't make any progress.
> 
> To fix this bug, we need to fixup the wq_numa_possible_cpumask and the
> pool->node, the fix is so complicated that we split it into two patches,
> this patch fix the wq_numa_possible_cpumask and the next fix the pool->node.
> 
> To fix the wq_numa_possible_cpumask, we only update the cpumasks of
> the orig_node and the new_node of the onlining @cpu.  we con't touch
> other unrelated nodes since the wq subsystem haven't seen the changed.
> 
> After this fix the new pool->node of new pools are correct.
> and existing wq's affinity is fixed up by wq_update_unbound_numa()
> after wq_update_numa_mapping().
> 
> Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> Cc: "Gu, Zheng" <guz.fnst@cn.fujitsu.com>
> Cc: tangchen <tangchen@cn.fujitsu.com>
> Cc: Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> ---
>  kernel/workqueue.c |   42 +++++++++++++++++++++++++++++++++++++++++-
>  1 files changed, 41 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index a6fd2b8..4c88b61 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -266,7 +266,7 @@ struct workqueue_struct {
>  static struct kmem_cache *pwq_cache;
>  
>  static cpumask_var_t *wq_numa_possible_cpumask;
> -					/* possible CPUs of each node */
> +					/* PL: possible CPUs of each node */
>  
>  static bool wq_disable_numa;
>  module_param_named(disable_numa, wq_disable_numa, bool, 0444);
> @@ -3949,6 +3949,44 @@ out_unlock:
>  	put_pwq_unlocked(old_pwq);
>  }
>  
> +static void wq_update_numa_mapping(int cpu)
> +{
> +	int node, orig_node = NUMA_NO_NODE, new_node = cpu_to_node(cpu);
> +
> +	lockdep_assert_held(&wq_pool_mutex);
> +
> +	if (!wq_numa_enabled)
> +		return;
> +
> +	/* the node of onlining CPU is not NUMA_NO_NODE */
> +	if (WARN_ON(new_node == NUMA_NO_NODE))
> +		return;
> +
> +	/* test whether the NUMA node mapping is changed. */
> +	if (cpumask_test_cpu(cpu, wq_numa_possible_cpumask[new_node]))
> +		return;
> +
> +	/* find the origin node */
> +	for_each_node(node) {
> +		if (cpumask_test_cpu(cpu, wq_numa_possible_cpumask[node])) {
> +			orig_node = node;
> +			break;
> +		}
> +	}
> +
> +	/* there may be multi mappings changed, re-initial. */
> +	cpumask_clear(wq_numa_possible_cpumask[new_node]);
> +	if (orig_node != NUMA_NO_NODE)
> +		cpumask_clear(wq_numa_possible_cpumask[orig_node]);
> +	for_each_possible_cpu(cpu) {
> +		node = cpu_to_node(node);

Hi, Yasuaki Ishimatsu

The bug is here.  It should be
		node = cpu_to_node(cpu);

> +		if (node == new_node)
> +			cpumask_set_cpu(cpu, wq_numa_possible_cpumask[new_node]);
> +		else if (orig_node != NUMA_NO_NODE && node == orig_node)
> +			cpumask_set_cpu(cpu, wq_numa_possible_cpumask[orig_node]);
> +	}
> +}
> +
>  static int alloc_and_link_pwqs(struct workqueue_struct *wq)
>  {
>  	bool highpri = wq->flags & WQ_HIGHPRI;
> @@ -4584,6 +4622,8 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>  			mutex_unlock(&pool->attach_mutex);
>  		}
>  
> +		wq_update_numa_mapping(cpu);
> +
>  		/* update NUMA affinity of unbound workqueues */
>  		list_for_each_entry(wq, &workqueues, list)
>  			wq_update_unbound_numa(wq, cpu, true);

next prev parent reply	other threads:[~2014-12-18  2:18 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-12 10:19 [PATCH 0/5] workqueue: fix bug when numa mapping is changed Lai Jiangshan
2014-12-12 10:19 ` [PATCH 1/5] workqueue: fix memory leak in wq_numa_init() Lai Jiangshan
2014-12-12 17:12   ` Tejun Heo
2014-12-15  5:25     ` Lai Jiangshan
2014-12-12 10:19 ` [PATCH 2/5] workqueue: update wq_numa_possible_cpumask Lai Jiangshan
2014-12-12 17:18   ` Tejun Heo
2014-12-15  2:02     ` Lai Jiangshan
2014-12-25 20:16       ` Tejun Heo
2014-12-18  2:22   ` Lai Jiangshan [this message]
2014-12-12 10:19 ` [PATCH 3/5] workqueue: fixup existing pool->node Lai Jiangshan
2014-12-12 17:25   ` Tejun Heo
2014-12-15  1:23     ` Lai Jiangshan
2014-12-25 20:14       ` Tejun Heo
2015-01-13  7:08         ` Lai Jiangshan
2015-01-13 15:24           ` Tejun Heo
2014-12-12 10:19 ` [PATCH 4/5] workqueue: update NUMA affinity for the node lost CPU Lai Jiangshan
2014-12-12 17:27   ` Tejun Heo
2014-12-15  1:28     ` Lai Jiangshan
2014-12-25 20:17       ` Tejun Heo
2014-12-12 10:19 ` [PATCH 5/5] workqueue: retry on NUMA_NO_NODE when create_worker() fails Lai Jiangshan
2014-12-12 16:05   ` KOSAKI Motohiro
2014-12-12 17:29     ` KOSAKI Motohiro
2014-12-12 17:29   ` Tejun Heo
2014-12-12 17:13 ` [PATCH 0/5] workqueue: fix bug when numa mapping is changed Yasuaki Ishimatsu
2014-12-15  1:34   ` Lai Jiangshan
2014-12-18  1:50     ` Yasuaki Ishimatsu
2014-12-13 16:27 ` [PATCH 0/4] workqueue: fix bug when numa mapping is changed v2 Kamezawa Hiroyuki
2014-12-13 16:30   ` [PATCH 1/4] workqueue: add a hook for node hotplug Kamezawa Hiroyuki
2014-12-13 16:33   ` [PATCH 2/4] workqueue: add warning if pool->node is offline Kamezawa Hiroyuki
2014-12-13 16:35   ` [PATCH 3/4] workqueue: remove per-node unbound pool when node goes offline Kamezawa Hiroyuki
2014-12-15  2:06     ` Lai Jiangshan
2014-12-15  2:06       ` Kamezawa Hiroyuki
2014-12-13 16:38   ` [PATCH 4/4] workqueue: handle change in cpu-node relationship Kamezawa Hiroyuki
2014-12-15  2:12     ` Lai Jiangshan
2014-12-15  2:20       ` Kamezawa Hiroyuki
2014-12-15  2:48         ` Lai Jiangshan
2014-12-15  2:55           ` Kamezawa Hiroyuki
2014-12-15  3:30             ` Lai Jiangshan
2014-12-15  3:34             ` Lai Jiangshan
2014-12-15  4:04               ` Kamezawa Hiroyuki
2014-12-15  5:19                 ` Lai Jiangshan
2014-12-15  5:33                   ` Kamezawa Hiroyuki
2014-12-15 11:11   ` [PATCH 0/4] workqueue: fix memory allocation after numa mapping is changed v3 Kamezawa Hiroyuki
2014-12-15 11:14     ` [PATCH 1/4] workqueue:Fix unbound workqueue's node affinity detection Kamezawa Hiroyuki
2014-12-16  5:30       ` Lai Jiangshan
2014-12-16  7:32         ` Kamezawa Hiroyuki
2014-12-16  7:54           ` Lai Jiangshan
2014-12-15 11:16     ` [PATCH 2/4] workqueue: update per-cpu workqueue's node affinity at,online-offline Kamezawa Hiroyuki
2014-12-16  5:32       ` Lai Jiangshan
2014-12-16  7:25         ` Kamezawa Hiroyuki
2014-12-15 11:18     ` [PATCH 3/4] workqueue: Update workqueue's possible cpumask when a new node, coming up Kamezawa Hiroyuki
2014-12-16  7:49       ` Lai Jiangshan
2014-12-16  8:10         ` Kamezawa Hiroyuki
2014-12-16  8:18           ` Kamezawa Hiroyuki
2014-12-15 11:22     ` [PATCH 4/4] workqueue: Handle cpu-node affinity change at CPU_ONLINE Kamezawa Hiroyuki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54923A63.3010701@cn.fujitsu.com \
    --to=laijs@cn.fujitsu.com \
    --cc=guz.fnst@cn.fujitsu.com \
    --cc=isimatu.yasuaki@jp.fujitsu.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tangchen@cn.fujitsu.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.