[REGRESSION] funny sched_domain build failure during resume

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [REGRESSION] funny sched_domain build failure during resume
@ 2014-05-09 16:04 Tejun Heo
  2014-05-09 16:15 ` Peter Zijlstra
  2014-05-14 14:00 ` Peter Zijlstra
  0 siblings, 2 replies; 17+ messages in thread
From: Tejun Heo @ 2014-05-09 16:04 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Johannes Weiner, Rafael J. Wysocki

Hello, guys.

So, after resuming from suspend, I found my build jobs can not migrate
away from the CPU it started on and thus just making use of single
core.  It turns out the scheduler failed to build sched domains due to
order-3 allocation failure.

 systemd-sleep: page allocation failure: order:3, mode:0x104010
 CPU: 0 PID: 11648 Comm: systemd-sleep Not tainted 3.14.2-200.fc20.x86_64 #1
 Hardware name: System manufacturer System Product Name/P8Z68-V LX, BIOS 4105 07/01/2013
  0000000000000000 000000001bc36890 ffff88009c2d5958 ffffffff816eec92
  0000000000104010 ffff88009c2d59e8 ffffffff8117a32a 0000000000000000
  ffff88021efe6b00 0000000000000003 0000000000104010 ffff88009c2d59e8
 Call Trace:
  [<ffffffff816eec92>] dump_stack+0x45/0x56
  [<ffffffff8117a32a>] warn_alloc_failed+0xfa/0x170
  [<ffffffff8117e8f5>] __alloc_pages_nodemask+0x8e5/0xb00
  [<ffffffff811c0ce3>] alloc_pages_current+0xa3/0x170
  [<ffffffff811796a4>] __get_free_pages+0x14/0x50
  [<ffffffff8119823e>] kmalloc_order_trace+0x2e/0xa0
  [<ffffffff810c033f>] build_sched_domains+0x1ff/0xcc0
  [<ffffffff810c123e>] partition_sched_domains+0x35e/0x3d0
  [<ffffffff811168e7>] cpuset_update_active_cpus+0x17/0x40
  [<ffffffff810c130a>] cpuset_cpu_active+0x5a/0x70
  [<ffffffff816f9f4c>] notifier_call_chain+0x4c/0x70
  [<ffffffff810b2a1e>] __raw_notifier_call_chain+0xe/0x10
  [<ffffffff8108a413>] cpu_notify+0x23/0x50
  [<ffffffff8108a678>] _cpu_up+0x188/0x1a0
  [<ffffffff816e1783>] enable_nonboot_cpus+0x93/0xf0
  [<ffffffff810d9d45>] suspend_devices_and_enter+0x325/0x450
  [<ffffffff810d9fe8>] pm_suspend+0x178/0x260
  [<ffffffff810d8e79>] state_store+0x79/0xf0
  [<ffffffff81355bdf>] kobj_attr_store+0xf/0x20
  [<ffffffff81262c4d>] sysfs_kf_write+0x3d/0x50
  [<ffffffff81266b12>] kernfs_fop_write+0xd2/0x140
  [<ffffffff811e964a>] vfs_write+0xba/0x1e0
  [<ffffffff811ea0a5>] SyS_write+0x55/0xd0
  [<ffffffff816ff029>] system_call_fastpath+0x16/0x1b

The allocation is from alloc_rootdomain().

	struct root_domain *rd;

	rd = kmalloc(sizeof(*rd), GFP_KERNEL);

The thing is the system has plenty of reclaimable memory and shouldn't
have any trouble satisfying one GFP_KERNEL order-3 allocation;
however, the problem is that this is during resume and the devices
haven't been woken up yet, so pm_restrict_gfp_mask() punches out
GFP_IOFS from all allocation masks and the page allocator has just
__GFP_WAIT to work with and, with enough bad luck, fails expectedly.

The problem has always been there but seems to have been exposed by
the addition of deadline scheduler support, which added cpudl to
root_domain making it larger by around 20k bytes on my setup, making
an order-3 allocation necessary during CPU online.

It looks like the allocation is for a temp buffer and there are also
percpu allocations going on.  Maybe just allocate the buffers on boot
and keep them around?

Kudos to Johannes for helping deciphering mm debug messages.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] funny sched_domain build failure during resume
  2014-05-09 16:04 [REGRESSION] funny sched_domain build failure during resume Tejun Heo
@ 2014-05-09 16:15 ` Peter Zijlstra
  2014-05-14 14:00 ` Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-05-09 16:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Ingo Molnar, linux-kernel, Johannes Weiner, Rafael J. Wysocki

[-- Attachment #1: Type: text/plain, Size: 358 bytes --]

On Fri, May 09, 2014 at 12:04:55PM -0400, Tejun Heo wrote:
> It looks like the allocation is for a temp buffer and there are also
> percpu allocations going on.  Maybe just allocate the buffers on boot
> and keep them around?

Its not a scratch buffer, but we could certainly try and keep it around
over suspend/resume.

I'll try and prod at that on monday.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] funny sched_domain build failure during resume
  2014-05-09 16:04 [REGRESSION] funny sched_domain build failure during resume Tejun Heo
  2014-05-09 16:15 ` Peter Zijlstra
@ 2014-05-14 14:00 ` Peter Zijlstra
  2014-05-14 14:05   ` Peter Zijlstra
                     ` (2 more replies)
  1 sibling, 3 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-05-14 14:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, linux-kernel, Johannes Weiner, Rafael J. Wysocki,
	Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 6686 bytes --]

On Fri, May 09, 2014 at 12:04:55PM -0400, Tejun Heo wrote:
> Hello, guys.
> 
> So, after resuming from suspend, I found my build jobs can not migrate
> away from the CPU it started on and thus just making use of single
> core.  It turns out the scheduler failed to build sched domains due to
> order-3 allocation failure.
> 
>  systemd-sleep: page allocation failure: order:3, mode:0x104010
>  CPU: 0 PID: 11648 Comm: systemd-sleep Not tainted 3.14.2-200.fc20.x86_64 #1
>  Hardware name: System manufacturer System Product Name/P8Z68-V LX, BIOS 4105 07/01/2013
>   0000000000000000 000000001bc36890 ffff88009c2d5958 ffffffff816eec92
>   0000000000104010 ffff88009c2d59e8 ffffffff8117a32a 0000000000000000
>   ffff88021efe6b00 0000000000000003 0000000000104010 ffff88009c2d59e8
>  Call Trace:
>   [<ffffffff816eec92>] dump_stack+0x45/0x56
>   [<ffffffff8117a32a>] warn_alloc_failed+0xfa/0x170
>   [<ffffffff8117e8f5>] __alloc_pages_nodemask+0x8e5/0xb00
>   [<ffffffff811c0ce3>] alloc_pages_current+0xa3/0x170
>   [<ffffffff811796a4>] __get_free_pages+0x14/0x50
>   [<ffffffff8119823e>] kmalloc_order_trace+0x2e/0xa0
>   [<ffffffff810c033f>] build_sched_domains+0x1ff/0xcc0
>   [<ffffffff810c123e>] partition_sched_domains+0x35e/0x3d0
>   [<ffffffff811168e7>] cpuset_update_active_cpus+0x17/0x40
>   [<ffffffff810c130a>] cpuset_cpu_active+0x5a/0x70
>   [<ffffffff816f9f4c>] notifier_call_chain+0x4c/0x70
>   [<ffffffff810b2a1e>] __raw_notifier_call_chain+0xe/0x10
>   [<ffffffff8108a413>] cpu_notify+0x23/0x50
>   [<ffffffff8108a678>] _cpu_up+0x188/0x1a0
>   [<ffffffff816e1783>] enable_nonboot_cpus+0x93/0xf0
>   [<ffffffff810d9d45>] suspend_devices_and_enter+0x325/0x450
>   [<ffffffff810d9fe8>] pm_suspend+0x178/0x260
>   [<ffffffff810d8e79>] state_store+0x79/0xf0
>   [<ffffffff81355bdf>] kobj_attr_store+0xf/0x20
>   [<ffffffff81262c4d>] sysfs_kf_write+0x3d/0x50
>   [<ffffffff81266b12>] kernfs_fop_write+0xd2/0x140
>   [<ffffffff811e964a>] vfs_write+0xba/0x1e0
>   [<ffffffff811ea0a5>] SyS_write+0x55/0xd0
>   [<ffffffff816ff029>] system_call_fastpath+0x16/0x1b
> 
> The allocation is from alloc_rootdomain().
> 
> 	struct root_domain *rd;
> 
> 	rd = kmalloc(sizeof(*rd), GFP_KERNEL);
> 
> The thing is the system has plenty of reclaimable memory and shouldn't
> have any trouble satisfying one GFP_KERNEL order-3 allocation;
> however, the problem is that this is during resume and the devices
> haven't been woken up yet, so pm_restrict_gfp_mask() punches out
> GFP_IOFS from all allocation masks and the page allocator has just
> __GFP_WAIT to work with and, with enough bad luck, fails expectedly.
> 
> The problem has always been there but seems to have been exposed by
> the addition of deadline scheduler support, which added cpudl to
> root_domain making it larger by around 20k bytes on my setup, making
> an order-3 allocation necessary during CPU online.
> 
> It looks like the allocation is for a temp buffer and there are also
> percpu allocations going on.  Maybe just allocate the buffers on boot
> and keep them around?
> 
> Kudos to Johannes for helping deciphering mm debug messages.

Does something like the below help any? I noticed those things (cpudl
and cpupri) had [NR_CPUS] arrays, which is always 'fun'.

The below is a mostly no thought involved conversion of cpudl which
boots, I'll also do cpupri and then actually stare at the algorithms to
see if I didn't make any obvious fails.

Juri?

---
 kernel/sched/cpudeadline.c | 29 +++++++++++++++++++----------
 kernel/sched/cpudeadline.h |  6 +++---
 2 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index ab001b5d5048..c34ab09a790b 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -13,6 +13,7 @@
 
 #include <linux/gfp.h>
 #include <linux/kernel.h>
+#include <linux/slab.h>
 #include "cpudeadline.h"
 
 static inline int parent(int i)
@@ -37,10 +38,7 @@ static inline int dl_time_before(u64 a, u64 b)
 
 static void cpudl_exchange(struct cpudl *cp, int a, int b)
 {
-	int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu;
-
 	swap(cp->elements[a], cp->elements[b]);
-	swap(cp->cpu_to_idx[cpu_a], cp->cpu_to_idx[cpu_b]);
 }
 
 static void cpudl_heapify(struct cpudl *cp, int idx)
@@ -140,7 +138,7 @@ void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid)
 	WARN_ON(!cpu_present(cpu));
 
 	raw_spin_lock_irqsave(&cp->lock, flags);
-	old_idx = cp->cpu_to_idx[cpu];
+	old_idx = cp->elements[cpu].idx;
 	if (!is_valid) {
 		/* remove item */
 		if (old_idx == IDX_INVALID) {
@@ -155,8 +153,8 @@ void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid)
 		cp->elements[old_idx].dl = cp->elements[cp->size - 1].dl;
 		cp->elements[old_idx].cpu = new_cpu;
 		cp->size--;
-		cp->cpu_to_idx[new_cpu] = old_idx;
-		cp->cpu_to_idx[cpu] = IDX_INVALID;
+		cp->elements[new_cpu].idx = old_idx;
+		cp->elements[cpu].idx = IDX_INVALID;
 		while (old_idx > 0 && dl_time_before(
 				cp->elements[parent(old_idx)].dl,
 				cp->elements[old_idx].dl)) {
@@ -173,7 +171,7 @@ void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid)
 		cp->size++;
 		cp->elements[cp->size - 1].dl = 0;
 		cp->elements[cp->size - 1].cpu = cpu;
-		cp->cpu_to_idx[cpu] = cp->size - 1;
+		cp->elements[cpu].idx = cp->size - 1;
 		cpudl_change_key(cp, cp->size - 1, dl);
 		cpumask_clear_cpu(cpu, cp->free_cpus);
 	} else {
@@ -195,10 +193,21 @@ int cpudl_init(struct cpudl *cp)
 	memset(cp, 0, sizeof(*cp));
 	raw_spin_lock_init(&cp->lock);
 	cp->size = 0;
-	for (i = 0; i < NR_CPUS; i++)
-		cp->cpu_to_idx[i] = IDX_INVALID;
-	if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL))
+
+	cp->elements = kcalloc(num_possible_cpus(),
+			       sizeof(struct cpudl_item),
+			       GFP_KERNEL);
+	if (!cp->elements)
+		return -ENOMEM;
+
+	if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL)) {
+		kfree(cp->elements);
 		return -ENOMEM;
+	}
+
+	for_each_possible_cpu(i)
+		cp->elements[i].idx = IDX_INVALID;
+
 	cpumask_setall(cp->free_cpus);
 
 	return 0;
diff --git a/kernel/sched/cpudeadline.h b/kernel/sched/cpudeadline.h
index a202789a412c..538c9796ad4a 100644
--- a/kernel/sched/cpudeadline.h
+++ b/kernel/sched/cpudeadline.h
@@ -5,17 +5,17 @@
 
 #define IDX_INVALID     -1
 
-struct array_item {
+struct cpudl_item {
 	u64 dl;
 	int cpu;
+	int idx;
 };
 
 struct cpudl {
 	raw_spinlock_t lock;
 	int size;
-	int cpu_to_idx[NR_CPUS];
-	struct array_item elements[NR_CPUS];
 	cpumask_var_t free_cpus;
+	struct cpudl_item *elements;
 };
 
 

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] funny sched_domain build failure during resume
  2014-05-14 14:00 ` Peter Zijlstra
@ 2014-05-14 14:05   ` Peter Zijlstra
  2014-05-14 17:02   ` Tejun Heo
  2014-05-15  8:40   ` Juri Lelli
  2 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-05-14 14:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, linux-kernel, Johannes Weiner, Rafael J. Wysocki,
	Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 1276 bytes --]

On Wed, May 14, 2014 at 04:00:34PM +0200, Peter Zijlstra wrote:
> The below is a mostly no thought involved conversion of cpudl which
                        ^^^^^^^^^^^^^^^^^^^

> boots, I'll also do cpupri and then actually stare at the algorithms to
> see if I didn't make any obvious fails.
> 
> Juri?
> 
> ---
> @@ -195,10 +193,21 @@ int cpudl_init(struct cpudl *cp)
>  	memset(cp, 0, sizeof(*cp));
>  	raw_spin_lock_init(&cp->lock);
>  	cp->size = 0;
> -	for (i = 0; i < NR_CPUS; i++)
> -		cp->cpu_to_idx[i] = IDX_INVALID;
> -	if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL))
> +
> +	cp->elements = kcalloc(num_possible_cpus(),
> +			       sizeof(struct cpudl_item),
> +			       GFP_KERNEL);
> +	if (!cp->elements)
> +		return -ENOMEM;
> +
> +	if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL)) {
> +		kfree(cp->elements);
>  		return -ENOMEM;
> +	}
> +
> +	for_each_possible_cpu(i)
> +		cp->elements[i].idx = IDX_INVALID;
> +
>  	cpumask_setall(cp->free_cpus);
>  
>  	return 0;

Also add:

---
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -220,4 +220,5 @@ int cpudl_init(struct cpudl *cp)
 void cpudl_cleanup(struct cpudl *cp)
 {
 	free_cpumask_var(cp->free_cpus);
+	kfree(cp->elements);
 }

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] funny sched_domain build failure during resume
  2014-05-14 14:00 ` Peter Zijlstra
  2014-05-14 14:05   ` Peter Zijlstra
@ 2014-05-14 17:02   ` Tejun Heo
  2014-05-14 17:10     ` Peter Zijlstra
  2014-05-15  8:40   ` Juri Lelli
  2 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2014-05-14 17:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Johannes Weiner, Rafael J. Wysocki,
	Juri Lelli

On Wed, May 14, 2014 at 04:00:34PM +0200, Peter Zijlstra wrote:
> Does something like the below help any? I noticed those things (cpudl
> and cpupri) had [NR_CPUS] arrays, which is always 'fun'.
> 
> The below is a mostly no thought involved conversion of cpudl which
> boots, I'll also do cpupri and then actually stare at the algorithms to
> see if I didn't make any obvious fails.

Yeah, should avoid large allocation on reasonably sized machines and I
don't think 2k CPU machines suspend regularly.  Prolly good / safe
enough for -stable port?  It'd be still nice to avoid allocations if
possible during online tho given that the operation happens while mm
is mostly crippled.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] funny sched_domain build failure during resume
  2014-05-14 17:02   ` Tejun Heo
@ 2014-05-14 17:10     ` Peter Zijlstra
  2014-05-14 22:36       ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-05-14 17:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, linux-kernel, Johannes Weiner, Rafael J. Wysocki,
	Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 1406 bytes --]

On Wed, May 14, 2014 at 01:02:38PM -0400, Tejun Heo wrote:
> On Wed, May 14, 2014 at 04:00:34PM +0200, Peter Zijlstra wrote:
> > Does something like the below help any? I noticed those things (cpudl
> > and cpupri) had [NR_CPUS] arrays, which is always 'fun'.
> > 
> > The below is a mostly no thought involved conversion of cpudl which
> > boots, I'll also do cpupri and then actually stare at the algorithms to
> > see if I didn't make any obvious fails.
> 
> Yeah, should avoid large allocation on reasonably sized machines and I
> don't think 2k CPU machines suspend regularly.  Prolly good / safe
> enough for -stable port?  

Yeah, its certainly -stable material. Esp. if this cures the immediate
problem.

> It'd be still nice to avoid allocations if
> possible during online tho given that the operation happens while mm
> is mostly crippled.

Yeah, I started looking at that but that turned out to be slightly more
difficult than I had hoped (got lost in the suspend code). Also avoiding
large order allocs is good practise regardless.

So probably the easiest way to not free/alloc the entire sched_domain
thing is just keeping it around in its entirety over suspend/resume, as
I think the promise of suspend/resume is that you return to the
status-quo.

But I'll stick it on the todo list after fixing this use-after-free
thing I've been trying to chase down.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] funny sched_domain build failure during resume
  2014-05-14 17:10     ` Peter Zijlstra
@ 2014-05-14 22:36       ` Tejun Heo
  2014-05-15  8:57         ` Peter Zijlstra
  2014-05-15 14:41         ` Johannes Weiner
  0 siblings, 2 replies; 17+ messages in thread
From: Tejun Heo @ 2014-05-14 22:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Johannes Weiner, Rafael J. Wysocki,
	Juri Lelli, Bruce Allan

On Wed, May 14, 2014 at 07:10:51PM +0200, Peter Zijlstra wrote:
> On Wed, May 14, 2014 at 01:02:38PM -0400, Tejun Heo wrote:
> > On Wed, May 14, 2014 at 04:00:34PM +0200, Peter Zijlstra wrote:
> > > Does something like the below help any? I noticed those things (cpudl
> > > and cpupri) had [NR_CPUS] arrays, which is always 'fun'.
> > > 
> > > The below is a mostly no thought involved conversion of cpudl which
> > > boots, I'll also do cpupri and then actually stare at the algorithms to
> > > see if I didn't make any obvious fails.
> > 
> > Yeah, should avoid large allocation on reasonably sized machines and I
> > don't think 2k CPU machines suspend regularly.  Prolly good / safe
> > enough for -stable port?  
> 
> Yeah, its certainly -stable material. Esp. if this cures the immediate
> problem.

The patches are URL encoded.  Tried to reproduce the problem but
haven't succeeded but I'm quite confident about the analysis, so
verifying that the high order allocation goes away should be enough.

I instrumented the allocator and it looks like we also have other
sources of high order allocation during resume before GFP_IOFS is
cleared.  Some kthread creations (order 2, probably okay) and on my
test setup a series of order 3 allocations from e1000.

Cc'ing Bruce for e1000.  Is it necessary to free and re-allocate
buffers across suspend/resume?  The driver ends up allocating multiple
order-3 regions during resume where mm doesn't have access to backing
devices and thus can't compact or reclaim and it's not too unlikely
for those allocations to fail.

I wonder whether we need some generic solution to address the issue.
Unfortunately, I don't think it'd be possible to order device
initialization to bring up backing devices earlier.  We don't really
have that kind of knowledge easily accessible.  Also, I don't think
it's realistic to require drivers to avoid high order allocations
during resume.

Maybe we should pre-reclaim and set aside some amount of memory to be
used during resume?  It's a mushy solution but could be good enough.
Rafael, Johannes, what do you guys think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] funny sched_domain build failure during resume
  2014-05-14 22:36       ` Tejun Heo
@ 2014-05-15  8:57         ` Peter Zijlstra
  2014-05-15 14:41         ` Johannes Weiner
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-05-15  8:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, linux-kernel, Johannes Weiner, Rafael J. Wysocki,
	Juri Lelli, Bruce Allan

[-- Attachment #1: Type: text/plain, Size: 323 bytes --]

On Wed, May 14, 2014 at 06:36:48PM -0400, Tejun Heo wrote:
> The patches are URL encoded. 

*sigh*.. evo used to do that, and now I think mutt does it because I've
set up the whole GPG thing.

All my scripts already automagically decode it so I hadn't noticed.

I can give you some awk that undoes it if you want?

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] funny sched_domain build failure during resume
  2014-05-14 22:36       ` Tejun Heo
  2014-05-15  8:57         ` Peter Zijlstra
@ 2014-05-15 14:41         ` Johannes Weiner
  2014-05-15 15:12           ` Peter Zijlstra
  2014-05-16 11:47           ` Srivatsa S. Bhat
  1 sibling, 2 replies; 17+ messages in thread
From: Johannes Weiner @ 2014-05-15 14:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Rafael J. Wysocki,
	Juri Lelli, Bruce Allan

On Wed, May 14, 2014 at 06:36:48PM -0400, Tejun Heo wrote:
> On Wed, May 14, 2014 at 07:10:51PM +0200, Peter Zijlstra wrote:
> > On Wed, May 14, 2014 at 01:02:38PM -0400, Tejun Heo wrote:
> > > On Wed, May 14, 2014 at 04:00:34PM +0200, Peter Zijlstra wrote:
> > > > Does something like the below help any? I noticed those things (cpudl
> > > > and cpupri) had [NR_CPUS] arrays, which is always 'fun'.
> > > > 
> > > > The below is a mostly no thought involved conversion of cpudl which
> > > > boots, I'll also do cpupri and then actually stare at the algorithms to
> > > > see if I didn't make any obvious fails.
> > > 
> > > Yeah, should avoid large allocation on reasonably sized machines and I
> > > don't think 2k CPU machines suspend regularly.  Prolly good / safe
> > > enough for -stable port?  
> > 
> > Yeah, its certainly -stable material. Esp. if this cures the immediate
> > problem.
> 
> The patches are URL encoded.  Tried to reproduce the problem but
> haven't succeeded but I'm quite confident about the analysis, so
> verifying that the high order allocation goes away should be enough.
> 
> I instrumented the allocator and it looks like we also have other
> sources of high order allocation during resume before GFP_IOFS is
> cleared.  Some kthread creations (order 2, probably okay) and on my
> test setup a series of order 3 allocations from e1000.
> 
> Cc'ing Bruce for e1000.  Is it necessary to free and re-allocate
> buffers across suspend/resume?  The driver ends up allocating multiple
> order-3 regions during resume where mm doesn't have access to backing
> devices and thus can't compact or reclaim and it's not too unlikely
> for those allocations to fail.
> 
> I wonder whether we need some generic solution to address the issue.
> Unfortunately, I don't think it'd be possible to order device
> initialization to bring up backing devices earlier.  We don't really
> have that kind of knowledge easily accessible.  Also, I don't think
> it's realistic to require drivers to avoid high order allocations
> during resume.
> 
> Maybe we should pre-reclaim and set aside some amount of memory to be
> used during resume?  It's a mushy solution but could be good enough.
> Rafael, Johannes, what do you guys think?

Is it necessary that resume paths allocate at all?  Freeing at suspend
what you have to reallocate at resume is asking for trouble.  It's not
just higher order allocations, either, even order-0 allocations are
less reliable without GFP_IOFS.  So I think this should be avoided as
much as possible.

---
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: [patch] mm: page_alloc: warn about higher-order allocations during
 suspend

Higher-order allocations require compaction to work reliably, and
compaction requires the ability to do IO.  Unfortunately, backing
storage infrastructure is disabled during suspend & resume, and so
higher-order allocations can not be supported during that time.

Drivers should refrain from freeing and allocating data structures
during suspend and resume entirely, and fall back to order-0 pages if
strictly necessary.

Add an extra line of warning to the allocation failure dump when a
higher-order allocation fails while backing storage is suspended.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5dba2933c9c0..3acc12c0e093 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2122,6 +2122,16 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 	pr_warn("%s: page allocation failure: order:%d, mode:0x%x\n",
 		current->comm, order, gfp_mask);
 
+	/*
+	 * Compaction doesn't work while backing storage is suspended
+	 * in the resume path.  Drivers should refrain from managing
+	 * kernel objects during suspend/resume, and leave this task
+	 * to init/exit as much as possible.
+	 */
+	if (order && pm_suspended_storage())
+		pr_warn("Higher-order allocations during resume "
+			"are unsupported\n");
+
 	dump_stack();
 	if (!should_suppress_show_mem())
 		show_mem(filter);
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] funny sched_domain build failure during resume
  2014-05-15 14:41         ` Johannes Weiner
@ 2014-05-15 15:12           ` Peter Zijlstra
  2014-05-16 11:47           ` Srivatsa S. Bhat
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-05-15 15:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Ingo Molnar, linux-kernel, Rafael J. Wysocki,
	Juri Lelli, Bruce Allan

[-- Attachment #1: Type: text/plain, Size: 647 bytes --]

On Thu, May 15, 2014 at 10:41:14AM -0400, Johannes Weiner wrote:
> Is it necessary that resume paths allocate at all?  Freeing at suspend
> what you have to reallocate at resume is asking for trouble.  It's not
> just higher order allocations, either, even order-0 allocations are
> less reliable without GFP_IOFS.  So I think this should be avoided as
> much as possible.

Well, in my case its because suspend does an effective hot-unplug of all
CPUs in the system except CPU0.

And if the cpu topology changes I need to allocate new data structures
and free the old ones.

The reverse is true on resume, it hot-plugs all CPUs again, same story.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] funny sched_domain build failure during resume
  2014-05-15 14:41         ` Johannes Weiner
  2014-05-15 15:12           ` Peter Zijlstra
@ 2014-05-16 11:47           ` Srivatsa S. Bhat
  1 sibling, 0 replies; 17+ messages in thread
From: Srivatsa S. Bhat @ 2014-05-16 11:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Peter Zijlstra, Ingo Molnar, linux-kernel,
	Rafael J. Wysocki, Juri Lelli, Bruce Allan, Linux PM mailing list

On 05/15/2014 08:11 PM, Johannes Weiner wrote:
> On Wed, May 14, 2014 at 06:36:48PM -0400, Tejun Heo wrote:
>> On Wed, May 14, 2014 at 07:10:51PM +0200, Peter Zijlstra wrote:
>>> On Wed, May 14, 2014 at 01:02:38PM -0400, Tejun Heo wrote:
>>>> On Wed, May 14, 2014 at 04:00:34PM +0200, Peter Zijlstra wrote:
>>>>> Does something like the below help any? I noticed those things (cpudl
>>>>> and cpupri) had [NR_CPUS] arrays, which is always 'fun'.
>>>>>
>>>>> The below is a mostly no thought involved conversion of cpudl which
>>>>> boots, I'll also do cpupri and then actually stare at the algorithms to
>>>>> see if I didn't make any obvious fails.
>>>>
>>>> Yeah, should avoid large allocation on reasonably sized machines and I
>>>> don't think 2k CPU machines suspend regularly.  Prolly good / safe
>>>> enough for -stable port?  
>>>
>>> Yeah, its certainly -stable material. Esp. if this cures the immediate
>>> problem.
>>
>> The patches are URL encoded.  Tried to reproduce the problem but
>> haven't succeeded but I'm quite confident about the analysis, so
>> verifying that the high order allocation goes away should be enough.
>>
>> I instrumented the allocator and it looks like we also have other
>> sources of high order allocation during resume before GFP_IOFS is
>> cleared.  Some kthread creations (order 2, probably okay) and on my
>> test setup a series of order 3 allocations from e1000.
>>
>> Cc'ing Bruce for e1000.  Is it necessary to free and re-allocate
>> buffers across suspend/resume?  The driver ends up allocating multiple
>> order-3 regions during resume where mm doesn't have access to backing
>> devices and thus can't compact or reclaim and it's not too unlikely
>> for those allocations to fail.
>>
>> I wonder whether we need some generic solution to address the issue.
>> Unfortunately, I don't think it'd be possible to order device
>> initialization to bring up backing devices earlier.  We don't really
>> have that kind of knowledge easily accessible.  Also, I don't think
>> it's realistic to require drivers to avoid high order allocations
>> during resume.
>>
>> Maybe we should pre-reclaim and set aside some amount of memory to be
>> used during resume?  It's a mushy solution but could be good enough.
>> Rafael, Johannes, what do you guys think?
> 
> Is it necessary that resume paths allocate at all?  Freeing at suspend
> what you have to reallocate at resume is asking for trouble.  It's not
> just higher order allocations, either, even order-0 allocations are
> less reliable without GFP_IOFS.  So I think this should be avoided as
> much as possible.
>

>From the discussion on this thread, what I understand is that certain
drivers and some subsystems like the scheduler free memory during suspend
and allocate them back during resume. But why does that pose a problem
to the MM subsystem? I mean, the memory freed and the memory requested
later is not substantially different either in terms of quantity or the
order of the allocation, right?

If that's the case, then what happened to the freed memory? Did the
page-cache or other caching mechanism launder most of that so soon, that
we are forced to rely on reclaim to allocate memory during resume? Isn't
that somewhat suspicious?

I might be missing something obvious here, but given the fact that tasks
are frozen during suspend and not a whole lot of things (allocations etc)
happen in the suspend path, I would expect that whatever memory was freed
during suspend would naturally remain available during resume as well.

Thoughts?

Regards,
Srivatsa S. Bhat

> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: [patch] mm: page_alloc: warn about higher-order allocations during
>  suspend
> 
> Higher-order allocations require compaction to work reliably, and
> compaction requires the ability to do IO.  Unfortunately, backing
> storage infrastructure is disabled during suspend & resume, and so
> higher-order allocations can not be supported during that time.
> 
> Drivers should refrain from freeing and allocating data structures
> during suspend and resume entirely, and fall back to order-0 pages if
> strictly necessary.
> 
> Add an extra line of warning to the allocation failure dump when a
> higher-order allocation fails while backing storage is suspended.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/page_alloc.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5dba2933c9c0..3acc12c0e093 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2122,6 +2122,16 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
>  	pr_warn("%s: page allocation failure: order:%d, mode:0x%x\n",
>  		current->comm, order, gfp_mask);
> 
> +	/*
> +	 * Compaction doesn't work while backing storage is suspended
> +	 * in the resume path.  Drivers should refrain from managing
> +	 * kernel objects during suspend/resume, and leave this task
> +	 * to init/exit as much as possible.
> +	 */
> +	if (order && pm_suspended_storage())
> +		pr_warn("Higher-order allocations during resume "
> +			"are unsupported\n");
> +
>  	dump_stack();
>  	if (!should_suppress_show_mem())
>  		show_mem(filter);
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] funny sched_domain build failure during resume
  2014-05-14 14:00 ` Peter Zijlstra
  2014-05-14 14:05   ` Peter Zijlstra
  2014-05-14 17:02   ` Tejun Heo
@ 2014-05-15  8:40   ` Juri Lelli
  2014-05-15  8:51     ` Peter Zijlstra
  2 siblings, 1 reply; 17+ messages in thread
From: Juri Lelli @ 2014-05-15  8:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Ingo Molnar, linux-kernel, Johannes Weiner,
	Rafael J. Wysocki

Hi,

On Wed, 14 May 2014 16:00:34 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, May 09, 2014 at 12:04:55PM -0400, Tejun Heo wrote:
> > Hello, guys.
> > 
> > So, after resuming from suspend, I found my build jobs can not migrate
> > away from the CPU it started on and thus just making use of single
> > core.  It turns out the scheduler failed to build sched domains due to
> > order-3 allocation failure.
> > 

[snip]

> 
> Does something like the below help any? I noticed those things (cpudl
> and cpupri) had [NR_CPUS] arrays, which is always 'fun'.
>

Yeah, not nice :/.
 
> The below is a mostly no thought involved conversion of cpudl which
> boots, I'll also do cpupri and then actually stare at the algorithms to
> see if I didn't make any obvious fails.
> 
> Juri?
> 
> ---
>  kernel/sched/cpudeadline.c | 29 +++++++++++++++++++----------
>  kernel/sched/cpudeadline.h |  6 +++---
>  2 files changed, 22 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> index ab001b5d5048..c34ab09a790b 100644
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -13,6 +13,7 @@
>  
>  #include <linux/gfp.h>
>  #include <linux/kernel.h>
> +#include <linux/slab.h>
>  #include "cpudeadline.h"
>  
>  static inline int parent(int i)
> @@ -37,10 +38,7 @@ static inline int dl_time_before(u64 a, u64 b)
>  
>  static void cpudl_exchange(struct cpudl *cp, int a, int b)
>  {
> -	int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu;
> -
>  	swap(cp->elements[a], cp->elements[b]);
> -	swap(cp->cpu_to_idx[cpu_a], cp->cpu_to_idx[cpu_b]);
>  }
>  

I think there is a problem here. Your patch "embeds" cpu_to_idx array
in elements array, but here the swap operates differently on the two.
Let me try to clarify with a simple example.

On a 4CPU system suppose we have this situation:

                      CPU 1
                      DL 50
                    /       \         
                   /         \
                  X           X
                 /
                /
               X

-- orig

elements[dl/cpu] | 50/1  |  0/0  |  0/0  |  0/0  |
                       ^
                        \
                         \
                          \
cpu_to_idx[idx]  |  -1   |   0   |  -1   |   -1  |

-- yours

elements[dl/cpu] | 50/1  |  0/0  |  0/0  |  0/0  |
                       ^
                        \
                         \
                          \
elements[idx]    |  -1   |   0   |  -1   |   -1  |

So since here all is fine.

But, what happens if I call cpudl_set(&cp, 2, 55, 1) ?

No surprises we insert the new element and then we try to bring it to
root (as it has max-dline). New situation is:

                      CPU 1
                      DL 50
                    /       \         
                   /         \
           ^    CPU2          X
           |    DL 55
                 /
                /
               X

-- orig

elements[dl/cpu] | 50/1  |  55/2  |  0/0  |  0/0  |
                       ^       ^
                        \       \
                         \       \
                          \       \
cpu_to_idx[idx]  |  -1   |   0   |  1   |   -1  |

-- yours

elements[dl/cpu] | 50/1  |  55/2  |  0/0  |  0/0  |
                       ^       ^
                        \       \
                         \       \
                          \       \
elements[idx]    |  -1   |   0   |  1   |   -1  |

Now, we do cpudl_exchange(&cp, 1, 0).

In orig we have

static void cpudl_exchange(struct cpudl *cp, int a, int b)
{
        int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu;

        swap(cp->elements[a], cp->elements[b]);
        swap(cp->cpu_to_idx[cpu_a], cp->cpu_to_idx[cpu_b]);
}

note that here a = 1, b = 0, cpu_a = 2 and cpu_b = 1.

While in yours

static void cpudl_exchange(struct cpudl *cp, int a, int b)
{
        swap(cp->elements[a], cp->elements[b]);
}

so we end up with

-- orig

elements[dl/cpu] | 55/2  |  50/1  |  0/0  |  0/0  |
                     ^       ^
                     |       |
                     +-------|------+
                             |      |
cpu_to_idx[idx]  |  -1   |   1   |  0   |   -1  |

-- yours

elements[dl/cpu] | 55/2  |  50/1  |  0/0  |  0/0  |
                     ^        ^
                     |        |
                     |        +------+
                     |               |
elements[idx]    |   0   |   -1   |  1   |   -1  |

and this breaks how the heap works. For example, what if I want to
update CPU1 deadline? In orig I correctly get it at position 1 of
elements array. But with the patch CPU1's idx is IDX_INVALID.

Sorry for this long reply, but I had to convince also myself...

So, I think that having just one dynamic array is nicer and better, but
we still need to swap things separately. Something like (on top of
yours):

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 60ad0af..10dc7ab 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -36,9 +36,31 @@ static inline int dl_time_before(u64 a, u64 b)
        return (s64)(a - b) < 0;
 }
 
+static inline void swap_element(struct cpudl *cp, int a, int b)
+{
+       int cpu_tmp = cp->elements[a].cpu;
+       u64 dl_tmp = cp->elements[a].dl;
+
+       cp->elements[a].cpu = cp->elements[b].cpu;
+       cp->elements[a].dl = cp->elements[b].dl;
+       cp->elements[b].cpu = cpu_tmp;
+       cp->elements[b].dl = dl_tmp;
+}
+
+static inline void swap_idx(struct cpudl *cp, int a, int b)
+{
+       int idx_tmp = cp->elements[a].idx;
+
+       cp->elements[a].idx = cp->elements[b].idx;
+       cp->elements[b].idx = idx_tmp;
+}
+
 static void cpudl_exchange(struct cpudl *cp, int a, int b)
 {
-       swap(cp->elements[a], cp->elements[b]);
+       int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu;
+
+       swap_element(cp, a, b);
+       swap_idx(cp, cpu_a, cpu_b);
 }
 
 static void cpudl_heapify(struct cpudl *cp, int idx)
---

But, I don't know if this is too ugly and we better go with two arrays
(or a better solution, as this is the fastest thing I could come up
with :)).

In the meanwhile I'll test this more...

Thanks a lot,

- Juri

>  static void cpudl_heapify(struct cpudl *cp, int idx)
> @@ -140,7 +138,7 @@ void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid)
>  	WARN_ON(!cpu_present(cpu));
>  
>  	raw_spin_lock_irqsave(&cp->lock, flags);
> -	old_idx = cp->cpu_to_idx[cpu];
> +	old_idx = cp->elements[cpu].idx;
>  	if (!is_valid) {
>  		/* remove item */
>  		if (old_idx == IDX_INVALID) {
> @@ -155,8 +153,8 @@ void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid)
>  		cp->elements[old_idx].dl = cp->elements[cp->size - 1].dl;
>  		cp->elements[old_idx].cpu = new_cpu;
>  		cp->size--;
> -		cp->cpu_to_idx[new_cpu] = old_idx;
> -		cp->cpu_to_idx[cpu] = IDX_INVALID;
> +		cp->elements[new_cpu].idx = old_idx;
> +		cp->elements[cpu].idx = IDX_INVALID;
>  		while (old_idx > 0 && dl_time_before(
>  				cp->elements[parent(old_idx)].dl,
>  				cp->elements[old_idx].dl)) {
> @@ -173,7 +171,7 @@ void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid)
>  		cp->size++;
>  		cp->elements[cp->size - 1].dl = 0;
>  		cp->elements[cp->size - 1].cpu = cpu;
> -		cp->cpu_to_idx[cpu] = cp->size - 1;
> +		cp->elements[cpu].idx = cp->size - 1;
>  		cpudl_change_key(cp, cp->size - 1, dl);
>  		cpumask_clear_cpu(cpu, cp->free_cpus);
>  	} else {
> @@ -195,10 +193,21 @@ int cpudl_init(struct cpudl *cp)
>  	memset(cp, 0, sizeof(*cp));
>  	raw_spin_lock_init(&cp->lock);
>  	cp->size = 0;
> -	for (i = 0; i < NR_CPUS; i++)
> -		cp->cpu_to_idx[i] = IDX_INVALID;
> -	if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL))
> +
> +	cp->elements = kcalloc(num_possible_cpus(),
> +			       sizeof(struct cpudl_item),
> +			       GFP_KERNEL);
> +	if (!cp->elements)
> +		return -ENOMEM;
> +
> +	if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL)) {
> +		kfree(cp->elements);
>  		return -ENOMEM;
> +	}
> +
> +	for_each_possible_cpu(i)
> +		cp->elements[i].idx = IDX_INVALID;
> +
>  	cpumask_setall(cp->free_cpus);
>  
>  	return 0;
> diff --git a/kernel/sched/cpudeadline.h b/kernel/sched/cpudeadline.h
> index a202789a412c..538c9796ad4a 100644
> --- a/kernel/sched/cpudeadline.h
> +++ b/kernel/sched/cpudeadline.h
> @@ -5,17 +5,17 @@
>  
>  #define IDX_INVALID     -1
>  
> -struct array_item {
> +struct cpudl_item {
>  	u64 dl;
>  	int cpu;
> +	int idx;
>  };
>  
>  struct cpudl {
>  	raw_spinlock_t lock;
>  	int size;
> -	int cpu_to_idx[NR_CPUS];
> -	struct array_item elements[NR_CPUS];
>  	cpumask_var_t free_cpus;
> +	struct cpudl_item *elements;
>  };
>  
>  

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] funny sched_domain build failure during resume
  2014-05-15  8:40   ` Juri Lelli
@ 2014-05-15  8:51     ` Peter Zijlstra
  2014-05-15 11:52       ` Juri Lelli
  2014-05-16 10:43       ` Peter Zijlstra
  0 siblings, 2 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-05-15  8:51 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Tejun Heo, Ingo Molnar, linux-kernel, Johannes Weiner,
	Rafael J. Wysocki

[-- Attachment #1: Type: text/plain, Size: 2626 bytes --]

On Thu, May 15, 2014 at 10:40:55AM +0200, Juri Lelli wrote:
> Hi,
> 
> > @@ -37,10 +38,7 @@ static inline int dl_time_before(u64 a, u64 b)
> >  
> >  static void cpudl_exchange(struct cpudl *cp, int a, int b)
> >  {
> > -	int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu;
> > -
> >  	swap(cp->elements[a], cp->elements[b]);
> > -	swap(cp->cpu_to_idx[cpu_a], cp->cpu_to_idx[cpu_b]);
> >  }
> >  
> 
> I think there is a problem here. Your patch "embeds" cpu_to_idx array
> in elements array, but here the swap operates differently on the two.

<snip>

> Sorry for this long reply, but I had to convince also myself...

Glad you explained it before I tried to untangle that code myself.

> So, I think that having just one dynamic array is nicer and better, but
> we still need to swap things separately. Something like (on top of
> yours):
> 
> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> index 60ad0af..10dc7ab 100644
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -36,9 +36,31 @@ static inline int dl_time_before(u64 a, u64 b)
>         return (s64)(a - b) < 0;
>  }
>  
> +static inline void swap_element(struct cpudl *cp, int a, int b)
> +{
> +       int cpu_tmp = cp->elements[a].cpu;
> +       u64 dl_tmp = cp->elements[a].dl;
> +
> +       cp->elements[a].cpu = cp->elements[b].cpu;
> +       cp->elements[a].dl = cp->elements[b].dl;
> +       cp->elements[b].cpu = cpu_tmp;
> +       cp->elements[b].dl = dl_tmp;

You could've just written:

	swap(cp->elements[a].cpu, cp->elements[b].cpu);
	swap(cp->elements[a].dl , cp->elements[b].dl );

The swap macro does the tmp var for you.

> +}
> +
> +static inline void swap_idx(struct cpudl *cp, int a, int b)
> +{
> +       int idx_tmp = cp->elements[a].idx;
> +
> +       cp->elements[a].idx = cp->elements[b].idx;
> +       cp->elements[b].idx = idx_tmp;

	swap(cp->elements[a].idx, cp->elements[b].idx);

> +}
> +
>  static void cpudl_exchange(struct cpudl *cp, int a, int b)
>  {
> -       swap(cp->elements[a], cp->elements[b]);
> +       int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu;
> +
> +       swap_element(cp, a, b);
> +       swap_idx(cp, cpu_a, cpu_b);

Or just skip the lot and put the 3 swap() stmts here.

>  }
>  
>  static void cpudl_heapify(struct cpudl *cp, int idx)
> ---
> 
> But, I don't know if this is too ugly and we better go with two arrays
> (or a better solution, as this is the fastest thing I could come up
> with :)).

Thanks for looking at it, and sorry for breaking it.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] funny sched_domain build failure during resume
  2014-05-15  8:51     ` Peter Zijlstra
@ 2014-05-15 11:52       ` Juri Lelli
  2014-05-16 10:43       ` Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: Juri Lelli @ 2014-05-15 11:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Ingo Molnar, linux-kernel, Johannes Weiner,
	Rafael J. Wysocki

On Thu, 15 May 2014 10:51:56 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, May 15, 2014 at 10:40:55AM +0200, Juri Lelli wrote:
> > Hi,
> > 
> > > @@ -37,10 +38,7 @@ static inline int dl_time_before(u64 a, u64 b)
> > >  
> > >  static void cpudl_exchange(struct cpudl *cp, int a, int b)
> > >  {
> > > -	int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu;
> > > -
> > >  	swap(cp->elements[a], cp->elements[b]);
> > > -	swap(cp->cpu_to_idx[cpu_a], cp->cpu_to_idx[cpu_b]);
> > >  }
> > >  
> > 
> > I think there is a problem here. Your patch "embeds" cpu_to_idx array
> > in elements array, but here the swap operates differently on the two.
> 
> <snip>
> 
> > Sorry for this long reply, but I had to convince also myself...
> 
> Glad you explained it before I tried to untangle that code myself.
> 
> > So, I think that having just one dynamic array is nicer and better, but
> > we still need to swap things separately. Something like (on top of
> > yours):
> > 
> > diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> > index 60ad0af..10dc7ab 100644
> > --- a/kernel/sched/cpudeadline.c
> > +++ b/kernel/sched/cpudeadline.c
> > @@ -36,9 +36,31 @@ static inline int dl_time_before(u64 a, u64 b)
> >         return (s64)(a - b) < 0;
> >  }
> >  
> > +static inline void swap_element(struct cpudl *cp, int a, int b)
> > +{
> > +       int cpu_tmp = cp->elements[a].cpu;
> > +       u64 dl_tmp = cp->elements[a].dl;
> > +
> > +       cp->elements[a].cpu = cp->elements[b].cpu;
> > +       cp->elements[a].dl = cp->elements[b].dl;
> > +       cp->elements[b].cpu = cpu_tmp;
> > +       cp->elements[b].dl = dl_tmp;
> 
> You could've just written:
> 
> 	swap(cp->elements[a].cpu, cp->elements[b].cpu);
> 	swap(cp->elements[a].dl , cp->elements[b].dl );
> 
> The swap macro does the tmp var for you.
> 
> > +}
> > +
> > +static inline void swap_idx(struct cpudl *cp, int a, int b)
> > +{
> > +       int idx_tmp = cp->elements[a].idx;
> > +
> > +       cp->elements[a].idx = cp->elements[b].idx;
> > +       cp->elements[b].idx = idx_tmp;
> 
> 	swap(cp->elements[a].idx, cp->elements[b].idx);
> 
> > +}
> > +
> >  static void cpudl_exchange(struct cpudl *cp, int a, int b)
> >  {
> > -       swap(cp->elements[a], cp->elements[b]);
> > +       int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu;
> > +
> > +       swap_element(cp, a, b);
> > +       swap_idx(cp, cpu_a, cpu_b);
> 
> Or just skip the lot and put the 3 swap() stmts here.
> 

Ah, yes, sure!

Maybe we could add in cpudeadline.c also a comment explaining a bit how
the thing works. Something along the line of cpupri.c:

/*
 *  kernel/sched/cpudeadline.c
 *
 *  Global CPU deadline management                                     
 *                               
 *  Author: Juri Lelli <juri.lelli@gmail.com>
 *
 *  This code tracks the deadline of each CPU (i.e., the deadline of
 *  current on the CPU) so that global migration decisions are easy
 *  to calculate. Each CPU can be in a state as follows:
 *
 *            (INVALID), WITH_DL_TASK(S)
 *
 *  The system maintains this state with a max-heap implemented as
 *  a simple array. To efficiently handle updates on intermediate nodes
 *  the array can be thought as splitted in two parts, one that contains
 *  heap nodes, and the other that keeps track of where nodes reside in
 *  the first part. From kernel/sched/cpudeadline.h we conceptually
 *  have:
 *
 *  	struct cpudl_item {
 *  		u64 dl;
 * 		int cpu;
 *  	-------------------
 *  		int idx;
 *  	}
 *
 *  Moreover, we keep track of CPUs in the INVALID state with a cpumask
 *  (no need to use the array if some CPU is free).
 *
 *  Let's clarify this with a simple example. Suppose at a certain
 *  instant of time we have this situation (4CPUs box):
 *
 *                     CPU 1
 *                     DL 50
 *                   /       \         
 *                  /         \
 *               CPU 2       CPU 0
 *               DL 30       DL 40
 *              /
 *             /
 *           CPU 3
 *           DL 25
 *
 *  In this case the state is mantained as:
 *
 *  elements[dl/cpu] |  50/1  |  30/2  |  40/0  |  25/3  |
 *                       ^          ^     ^          ^
 *                       |          +-----|-+        |
 *                       +---------+      | |        |
 *                        +--------|------+ |        |
 *  elements[idx]    |    2   |    0   |    1   |    3   |
 *
 *  Operations on the heap (e.g., node updates) must thus handle
 *  the two parts of elements array separately, see cpudl_set()
 *  and cpudl_exchange() for details.
 *                 
 *  This program is free software; you can redistribute it and/or
 *  modify it under the terms of the GNU General Public License
 *  as published by the Free Software Foundation; version 2
 *  of the License.
 */

Thanks,

- Juri

> >  }
> >  
> >  static void cpudl_heapify(struct cpudl *cp, int idx)
> > ---
> > 
> > But, I don't know if this is too ugly and we better go with two arrays
> > (or a better solution, as this is the fastest thing I could come up
> > with :)).
> 
> Thanks for looking at it, and sorry for breaking it.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] funny sched_domain build failure during resume
  2014-05-15  8:51     ` Peter Zijlstra
  2014-05-15 11:52       ` Juri Lelli
@ 2014-05-16 10:43       ` Peter Zijlstra
  2014-05-16 11:01         ` Juri Lelli
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-05-16 10:43 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Tejun Heo, Ingo Molnar, linux-kernel, Johannes Weiner,
	Rafael J. Wysocki

[-- Attachment #1: Type: text/plain, Size: 3666 bytes --]


OK I made that..

---

Subject: sched/cpudl: Replace NR_CPUS arrays
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed May 14 16:13:56 CEST 2014

Tejun reported that his resume was failing due to order-3 allocations
from sched_domain building.

Replace the NR_CPUS arrays in there with a dynamically allocated
array.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/cpudeadline.c |   33 ++++++++++++++++++++++++---------
 kernel/sched/cpudeadline.h |    6 +++---
 2 files changed, 27 insertions(+), 12 deletions(-)

--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -13,6 +13,7 @@
 
 #include <linux/gfp.h>
 #include <linux/kernel.h>
+#include <linux/slab.h>
 #include "cpudeadline.h"
 
 static inline int parent(int i)
@@ -39,8 +40,10 @@ static void cpudl_exchange(struct cpudl
 {
 	int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu;
 
-	swap(cp->elements[a], cp->elements[b]);
-	swap(cp->cpu_to_idx[cpu_a], cp->cpu_to_idx[cpu_b]);
+	swap(cp->elements[a].cpu, cp->elements[b].cpu);
+	swap(cp->elements[a].dl , cp->elements[b].dl );
+
+	swap(cp->elements[cpu_a].idx, cp->elements[cpu_b].idx);
 }
 
 static void cpudl_heapify(struct cpudl *cp, int idx)
@@ -140,7 +143,7 @@ void cpudl_set(struct cpudl *cp, int cpu
 	WARN_ON(!cpu_present(cpu));
 
 	raw_spin_lock_irqsave(&cp->lock, flags);
-	old_idx = cp->cpu_to_idx[cpu];
+	old_idx = cp->elements[cpu].idx;
 	if (!is_valid) {
 		/* remove item */
 		if (old_idx == IDX_INVALID) {
@@ -155,8 +158,8 @@ void cpudl_set(struct cpudl *cp, int cpu
 		cp->elements[old_idx].dl = cp->elements[cp->size - 1].dl;
 		cp->elements[old_idx].cpu = new_cpu;
 		cp->size--;
-		cp->cpu_to_idx[new_cpu] = old_idx;
-		cp->cpu_to_idx[cpu] = IDX_INVALID;
+		cp->elements[new_cpu].idx = old_idx;
+		cp->elements[cpu].idx = IDX_INVALID;
 		while (old_idx > 0 && dl_time_before(
 				cp->elements[parent(old_idx)].dl,
 				cp->elements[old_idx].dl)) {
@@ -173,7 +176,7 @@ void cpudl_set(struct cpudl *cp, int cpu
 		cp->size++;
 		cp->elements[cp->size - 1].dl = 0;
 		cp->elements[cp->size - 1].cpu = cpu;
-		cp->cpu_to_idx[cpu] = cp->size - 1;
+		cp->elements[cpu].idx = cp->size - 1;
 		cpudl_change_key(cp, cp->size - 1, dl);
 		cpumask_clear_cpu(cpu, cp->free_cpus);
 	} else {
@@ -195,10 +198,21 @@ int cpudl_init(struct cpudl *cp)
 	memset(cp, 0, sizeof(*cp));
 	raw_spin_lock_init(&cp->lock);
 	cp->size = 0;
-	for (i = 0; i < NR_CPUS; i++)
-		cp->cpu_to_idx[i] = IDX_INVALID;
-	if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL))
+
+	cp->elements = kcalloc(nr_cpu_ids,
+			       sizeof(struct cpudl_item),
+			       GFP_KERNEL);
+	if (!cp->elements)
+		return -ENOMEM;
+
+	if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL)) {
+		kfree(cp->elements);
 		return -ENOMEM;
+	}
+
+	for_each_possible_cpu(i)
+		cp->elements[i].idx = IDX_INVALID;
+
 	cpumask_setall(cp->free_cpus);
 
 	return 0;
@@ -211,4 +225,5 @@ int cpudl_init(struct cpudl *cp)
 void cpudl_cleanup(struct cpudl *cp)
 {
 	free_cpumask_var(cp->free_cpus);
+	kfree(cp->elements);
 }
--- a/kernel/sched/cpudeadline.h
+++ b/kernel/sched/cpudeadline.h
@@ -5,17 +5,17 @@
 
 #define IDX_INVALID     -1
 
-struct array_item {
+struct cpudl_item {
 	u64 dl;
 	int cpu;
+	int idx;
 };
 
 struct cpudl {
 	raw_spinlock_t lock;
 	int size;
-	int cpu_to_idx[NR_CPUS];
-	struct array_item elements[NR_CPUS];
 	cpumask_var_t free_cpus;
+	struct cpudl_item *elements;
 };
 
 


[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] funny sched_domain build failure during resume
  2014-05-16 10:43       ` Peter Zijlstra
@ 2014-05-16 11:01         ` Juri Lelli
  2014-05-16 11:04           ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Juri Lelli @ 2014-05-16 11:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Ingo Molnar, linux-kernel, Johannes Weiner,
	Rafael J. Wysocki

On Fri, 16 May 2014 12:43:36 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> 
> OK I made that..
> 

Are the comments I proposed to add overdoing?

Apart from this,

Acked-by: Juri Lelli <juri.lelli@gmail.com>

Thanks!

- Juri

> ---
> 
> Subject: sched/cpudl: Replace NR_CPUS arrays
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Wed May 14 16:13:56 CEST 2014
> 
> Tejun reported that his resume was failing due to order-3 allocations
> from sched_domain building.
> 
> Replace the NR_CPUS arrays in there with a dynamically allocated
> array.
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Juri Lelli <juri.lelli@gmail.com>
> Reported-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  kernel/sched/cpudeadline.c |   33 ++++++++++++++++++++++++---------
>  kernel/sched/cpudeadline.h |    6 +++---
>  2 files changed, 27 insertions(+), 12 deletions(-)
> 
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -13,6 +13,7 @@
>  
>  #include <linux/gfp.h>
>  #include <linux/kernel.h>
> +#include <linux/slab.h>
>  #include "cpudeadline.h"
>  
>  static inline int parent(int i)
> @@ -39,8 +40,10 @@ static void cpudl_exchange(struct cpudl
>  {
>  	int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu;
>  
> -	swap(cp->elements[a], cp->elements[b]);
> -	swap(cp->cpu_to_idx[cpu_a], cp->cpu_to_idx[cpu_b]);
> +	swap(cp->elements[a].cpu, cp->elements[b].cpu);
> +	swap(cp->elements[a].dl , cp->elements[b].dl );
> +
> +	swap(cp->elements[cpu_a].idx, cp->elements[cpu_b].idx);
>  }
>  
>  static void cpudl_heapify(struct cpudl *cp, int idx)
> @@ -140,7 +143,7 @@ void cpudl_set(struct cpudl *cp, int cpu
>  	WARN_ON(!cpu_present(cpu));
>  
>  	raw_spin_lock_irqsave(&cp->lock, flags);
> -	old_idx = cp->cpu_to_idx[cpu];
> +	old_idx = cp->elements[cpu].idx;
>  	if (!is_valid) {
>  		/* remove item */
>  		if (old_idx == IDX_INVALID) {
> @@ -155,8 +158,8 @@ void cpudl_set(struct cpudl *cp, int cpu
>  		cp->elements[old_idx].dl = cp->elements[cp->size - 1].dl;
>  		cp->elements[old_idx].cpu = new_cpu;
>  		cp->size--;
> -		cp->cpu_to_idx[new_cpu] = old_idx;
> -		cp->cpu_to_idx[cpu] = IDX_INVALID;
> +		cp->elements[new_cpu].idx = old_idx;
> +		cp->elements[cpu].idx = IDX_INVALID;
>  		while (old_idx > 0 && dl_time_before(
>  				cp->elements[parent(old_idx)].dl,
>  				cp->elements[old_idx].dl)) {
> @@ -173,7 +176,7 @@ void cpudl_set(struct cpudl *cp, int cpu
>  		cp->size++;
>  		cp->elements[cp->size - 1].dl = 0;
>  		cp->elements[cp->size - 1].cpu = cpu;
> -		cp->cpu_to_idx[cpu] = cp->size - 1;
> +		cp->elements[cpu].idx = cp->size - 1;
>  		cpudl_change_key(cp, cp->size - 1, dl);
>  		cpumask_clear_cpu(cpu, cp->free_cpus);
>  	} else {
> @@ -195,10 +198,21 @@ int cpudl_init(struct cpudl *cp)
>  	memset(cp, 0, sizeof(*cp));
>  	raw_spin_lock_init(&cp->lock);
>  	cp->size = 0;
> -	for (i = 0; i < NR_CPUS; i++)
> -		cp->cpu_to_idx[i] = IDX_INVALID;
> -	if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL))
> +
> +	cp->elements = kcalloc(nr_cpu_ids,
> +			       sizeof(struct cpudl_item),
> +			       GFP_KERNEL);
> +	if (!cp->elements)
> +		return -ENOMEM;
> +
> +	if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL)) {
> +		kfree(cp->elements);
>  		return -ENOMEM;
> +	}
> +
> +	for_each_possible_cpu(i)
> +		cp->elements[i].idx = IDX_INVALID;
> +
>  	cpumask_setall(cp->free_cpus);
>  
>  	return 0;
> @@ -211,4 +225,5 @@ int cpudl_init(struct cpudl *cp)
>  void cpudl_cleanup(struct cpudl *cp)
>  {
>  	free_cpumask_var(cp->free_cpus);
> +	kfree(cp->elements);
>  }
> --- a/kernel/sched/cpudeadline.h
> +++ b/kernel/sched/cpudeadline.h
> @@ -5,17 +5,17 @@
>  
>  #define IDX_INVALID     -1
>  
> -struct array_item {
> +struct cpudl_item {
>  	u64 dl;
>  	int cpu;
> +	int idx;
>  };
>  
>  struct cpudl {
>  	raw_spinlock_t lock;
>  	int size;
> -	int cpu_to_idx[NR_CPUS];
> -	struct array_item elements[NR_CPUS];
>  	cpumask_var_t free_cpus;
> +	struct cpudl_item *elements;
>  };
>  
>  
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] funny sched_domain build failure during resume
  2014-05-16 11:01         ` Juri Lelli
@ 2014-05-16 11:04           ` Peter Zijlstra
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-05-16 11:04 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Tejun Heo, Ingo Molnar, linux-kernel, Johannes Weiner,
	Rafael J. Wysocki

[-- Attachment #1: Type: text/plain, Size: 274 bytes --]

On Fri, May 16, 2014 at 01:01:53PM +0200, Juri Lelli wrote:
> Are the comments I proposed to add overdoing?

Dunno, might be helpful, if you post them as a proper patch I'll press
'A'.

> Apart from this,
> 
> Acked-by: Juri Lelli <juri.lelli@gmail.com>

Thanks!

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-05-16 11:49 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-09 16:04 [REGRESSION] funny sched_domain build failure during resume Tejun Heo
2014-05-09 16:15 ` Peter Zijlstra
2014-05-14 14:00 ` Peter Zijlstra
2014-05-14 14:05   ` Peter Zijlstra
2014-05-14 17:02   ` Tejun Heo
2014-05-14 17:10     ` Peter Zijlstra
2014-05-14 22:36       ` Tejun Heo
2014-05-15  8:57         ` Peter Zijlstra
2014-05-15 14:41         ` Johannes Weiner
2014-05-15 15:12           ` Peter Zijlstra
2014-05-16 11:47           ` Srivatsa S. Bhat
2014-05-15  8:40   ` Juri Lelli
2014-05-15  8:51     ` Peter Zijlstra
2014-05-15 11:52       ` Juri Lelli
2014-05-16 10:43       ` Peter Zijlstra
2014-05-16 11:01         ` Juri Lelli
2014-05-16 11:04           ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox