* [REGRESSION] funny sched_domain build failure during resume @ 2014-05-09 16:04 Tejun Heo 2014-05-09 16:15 ` Peter Zijlstra 2014-05-14 14:00 ` Peter Zijlstra 0 siblings, 2 replies; 17+ messages in thread From: Tejun Heo @ 2014-05-09 16:04 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: linux-kernel, Johannes Weiner, Rafael J. Wysocki Hello, guys. So, after resuming from suspend, I found my build jobs can not migrate away from the CPU it started on and thus just making use of single core. It turns out the scheduler failed to build sched domains due to order-3 allocation failure. systemd-sleep: page allocation failure: order:3, mode:0x104010 CPU: 0 PID: 11648 Comm: systemd-sleep Not tainted 3.14.2-200.fc20.x86_64 #1 Hardware name: System manufacturer System Product Name/P8Z68-V LX, BIOS 4105 07/01/2013 0000000000000000 000000001bc36890 ffff88009c2d5958 ffffffff816eec92 0000000000104010 ffff88009c2d59e8 ffffffff8117a32a 0000000000000000 ffff88021efe6b00 0000000000000003 0000000000104010 ffff88009c2d59e8 Call Trace: [<ffffffff816eec92>] dump_stack+0x45/0x56 [<ffffffff8117a32a>] warn_alloc_failed+0xfa/0x170 [<ffffffff8117e8f5>] __alloc_pages_nodemask+0x8e5/0xb00 [<ffffffff811c0ce3>] alloc_pages_current+0xa3/0x170 [<ffffffff811796a4>] __get_free_pages+0x14/0x50 [<ffffffff8119823e>] kmalloc_order_trace+0x2e/0xa0 [<ffffffff810c033f>] build_sched_domains+0x1ff/0xcc0 [<ffffffff810c123e>] partition_sched_domains+0x35e/0x3d0 [<ffffffff811168e7>] cpuset_update_active_cpus+0x17/0x40 [<ffffffff810c130a>] cpuset_cpu_active+0x5a/0x70 [<ffffffff816f9f4c>] notifier_call_chain+0x4c/0x70 [<ffffffff810b2a1e>] __raw_notifier_call_chain+0xe/0x10 [<ffffffff8108a413>] cpu_notify+0x23/0x50 [<ffffffff8108a678>] _cpu_up+0x188/0x1a0 [<ffffffff816e1783>] enable_nonboot_cpus+0x93/0xf0 [<ffffffff810d9d45>] suspend_devices_and_enter+0x325/0x450 [<ffffffff810d9fe8>] pm_suspend+0x178/0x260 [<ffffffff810d8e79>] state_store+0x79/0xf0 [<ffffffff81355bdf>] kobj_attr_store+0xf/0x20 [<ffffffff81262c4d>] sysfs_kf_write+0x3d/0x50 [<ffffffff81266b12>] kernfs_fop_write+0xd2/0x140 [<ffffffff811e964a>] vfs_write+0xba/0x1e0 [<ffffffff811ea0a5>] SyS_write+0x55/0xd0 [<ffffffff816ff029>] system_call_fastpath+0x16/0x1b The allocation is from alloc_rootdomain(). struct root_domain *rd; rd = kmalloc(sizeof(*rd), GFP_KERNEL); The thing is the system has plenty of reclaimable memory and shouldn't have any trouble satisfying one GFP_KERNEL order-3 allocation; however, the problem is that this is during resume and the devices haven't been woken up yet, so pm_restrict_gfp_mask() punches out GFP_IOFS from all allocation masks and the page allocator has just __GFP_WAIT to work with and, with enough bad luck, fails expectedly. The problem has always been there but seems to have been exposed by the addition of deadline scheduler support, which added cpudl to root_domain making it larger by around 20k bytes on my setup, making an order-3 allocation necessary during CPU online. It looks like the allocation is for a temp buffer and there are also percpu allocations going on. Maybe just allocate the buffers on boot and keep them around? Kudos to Johannes for helping deciphering mm debug messages. Thanks. -- tejun ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [REGRESSION] funny sched_domain build failure during resume 2014-05-09 16:04 [REGRESSION] funny sched_domain build failure during resume Tejun Heo @ 2014-05-09 16:15 ` Peter Zijlstra 2014-05-14 14:00 ` Peter Zijlstra 1 sibling, 0 replies; 17+ messages in thread From: Peter Zijlstra @ 2014-05-09 16:15 UTC (permalink / raw) To: Tejun Heo; +Cc: Ingo Molnar, linux-kernel, Johannes Weiner, Rafael J. Wysocki [-- Attachment #1: Type: text/plain, Size: 358 bytes --] On Fri, May 09, 2014 at 12:04:55PM -0400, Tejun Heo wrote: > It looks like the allocation is for a temp buffer and there are also > percpu allocations going on. Maybe just allocate the buffers on boot > and keep them around? Its not a scratch buffer, but we could certainly try and keep it around over suspend/resume. I'll try and prod at that on monday. [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [REGRESSION] funny sched_domain build failure during resume 2014-05-09 16:04 [REGRESSION] funny sched_domain build failure during resume Tejun Heo 2014-05-09 16:15 ` Peter Zijlstra @ 2014-05-14 14:00 ` Peter Zijlstra 2014-05-14 14:05 ` Peter Zijlstra ` (2 more replies) 1 sibling, 3 replies; 17+ messages in thread From: Peter Zijlstra @ 2014-05-14 14:00 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, linux-kernel, Johannes Weiner, Rafael J. Wysocki, Juri Lelli [-- Attachment #1: Type: text/plain, Size: 6686 bytes --] On Fri, May 09, 2014 at 12:04:55PM -0400, Tejun Heo wrote: > Hello, guys. > > So, after resuming from suspend, I found my build jobs can not migrate > away from the CPU it started on and thus just making use of single > core. It turns out the scheduler failed to build sched domains due to > order-3 allocation failure. > > systemd-sleep: page allocation failure: order:3, mode:0x104010 > CPU: 0 PID: 11648 Comm: systemd-sleep Not tainted 3.14.2-200.fc20.x86_64 #1 > Hardware name: System manufacturer System Product Name/P8Z68-V LX, BIOS 4105 07/01/2013 > 0000000000000000 000000001bc36890 ffff88009c2d5958 ffffffff816eec92 > 0000000000104010 ffff88009c2d59e8 ffffffff8117a32a 0000000000000000 > ffff88021efe6b00 0000000000000003 0000000000104010 ffff88009c2d59e8 > Call Trace: > [<ffffffff816eec92>] dump_stack+0x45/0x56 > [<ffffffff8117a32a>] warn_alloc_failed+0xfa/0x170 > [<ffffffff8117e8f5>] __alloc_pages_nodemask+0x8e5/0xb00 > [<ffffffff811c0ce3>] alloc_pages_current+0xa3/0x170 > [<ffffffff811796a4>] __get_free_pages+0x14/0x50 > [<ffffffff8119823e>] kmalloc_order_trace+0x2e/0xa0 > [<ffffffff810c033f>] build_sched_domains+0x1ff/0xcc0 > [<ffffffff810c123e>] partition_sched_domains+0x35e/0x3d0 > [<ffffffff811168e7>] cpuset_update_active_cpus+0x17/0x40 > [<ffffffff810c130a>] cpuset_cpu_active+0x5a/0x70 > [<ffffffff816f9f4c>] notifier_call_chain+0x4c/0x70 > [<ffffffff810b2a1e>] __raw_notifier_call_chain+0xe/0x10 > [<ffffffff8108a413>] cpu_notify+0x23/0x50 > [<ffffffff8108a678>] _cpu_up+0x188/0x1a0 > [<ffffffff816e1783>] enable_nonboot_cpus+0x93/0xf0 > [<ffffffff810d9d45>] suspend_devices_and_enter+0x325/0x450 > [<ffffffff810d9fe8>] pm_suspend+0x178/0x260 > [<ffffffff810d8e79>] state_store+0x79/0xf0 > [<ffffffff81355bdf>] kobj_attr_store+0xf/0x20 > [<ffffffff81262c4d>] sysfs_kf_write+0x3d/0x50 > [<ffffffff81266b12>] kernfs_fop_write+0xd2/0x140 > [<ffffffff811e964a>] vfs_write+0xba/0x1e0 > [<ffffffff811ea0a5>] SyS_write+0x55/0xd0 > [<ffffffff816ff029>] system_call_fastpath+0x16/0x1b > > The allocation is from alloc_rootdomain(). > > struct root_domain *rd; > > rd = kmalloc(sizeof(*rd), GFP_KERNEL); > > The thing is the system has plenty of reclaimable memory and shouldn't > have any trouble satisfying one GFP_KERNEL order-3 allocation; > however, the problem is that this is during resume and the devices > haven't been woken up yet, so pm_restrict_gfp_mask() punches out > GFP_IOFS from all allocation masks and the page allocator has just > __GFP_WAIT to work with and, with enough bad luck, fails expectedly. > > The problem has always been there but seems to have been exposed by > the addition of deadline scheduler support, which added cpudl to > root_domain making it larger by around 20k bytes on my setup, making > an order-3 allocation necessary during CPU online. > > It looks like the allocation is for a temp buffer and there are also > percpu allocations going on. Maybe just allocate the buffers on boot > and keep them around? > > Kudos to Johannes for helping deciphering mm debug messages. Does something like the below help any? I noticed those things (cpudl and cpupri) had [NR_CPUS] arrays, which is always 'fun'. The below is a mostly no thought involved conversion of cpudl which boots, I'll also do cpupri and then actually stare at the algorithms to see if I didn't make any obvious fails. Juri? --- kernel/sched/cpudeadline.c | 29 +++++++++++++++++++---------- kernel/sched/cpudeadline.h | 6 +++--- 2 files changed, 22 insertions(+), 13 deletions(-) diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c index ab001b5d5048..c34ab09a790b 100644 --- a/kernel/sched/cpudeadline.c +++ b/kernel/sched/cpudeadline.c @@ -13,6 +13,7 @@ #include <linux/gfp.h> #include <linux/kernel.h> +#include <linux/slab.h> #include "cpudeadline.h" static inline int parent(int i) @@ -37,10 +38,7 @@ static inline int dl_time_before(u64 a, u64 b) static void cpudl_exchange(struct cpudl *cp, int a, int b) { - int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu; - swap(cp->elements[a], cp->elements[b]); - swap(cp->cpu_to_idx[cpu_a], cp->cpu_to_idx[cpu_b]); } static void cpudl_heapify(struct cpudl *cp, int idx) @@ -140,7 +138,7 @@ void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid) WARN_ON(!cpu_present(cpu)); raw_spin_lock_irqsave(&cp->lock, flags); - old_idx = cp->cpu_to_idx[cpu]; + old_idx = cp->elements[cpu].idx; if (!is_valid) { /* remove item */ if (old_idx == IDX_INVALID) { @@ -155,8 +153,8 @@ void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid) cp->elements[old_idx].dl = cp->elements[cp->size - 1].dl; cp->elements[old_idx].cpu = new_cpu; cp->size--; - cp->cpu_to_idx[new_cpu] = old_idx; - cp->cpu_to_idx[cpu] = IDX_INVALID; + cp->elements[new_cpu].idx = old_idx; + cp->elements[cpu].idx = IDX_INVALID; while (old_idx > 0 && dl_time_before( cp->elements[parent(old_idx)].dl, cp->elements[old_idx].dl)) { @@ -173,7 +171,7 @@ void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid) cp->size++; cp->elements[cp->size - 1].dl = 0; cp->elements[cp->size - 1].cpu = cpu; - cp->cpu_to_idx[cpu] = cp->size - 1; + cp->elements[cpu].idx = cp->size - 1; cpudl_change_key(cp, cp->size - 1, dl); cpumask_clear_cpu(cpu, cp->free_cpus); } else { @@ -195,10 +193,21 @@ int cpudl_init(struct cpudl *cp) memset(cp, 0, sizeof(*cp)); raw_spin_lock_init(&cp->lock); cp->size = 0; - for (i = 0; i < NR_CPUS; i++) - cp->cpu_to_idx[i] = IDX_INVALID; - if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL)) + + cp->elements = kcalloc(num_possible_cpus(), + sizeof(struct cpudl_item), + GFP_KERNEL); + if (!cp->elements) + return -ENOMEM; + + if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL)) { + kfree(cp->elements); return -ENOMEM; + } + + for_each_possible_cpu(i) + cp->elements[i].idx = IDX_INVALID; + cpumask_setall(cp->free_cpus); return 0; diff --git a/kernel/sched/cpudeadline.h b/kernel/sched/cpudeadline.h index a202789a412c..538c9796ad4a 100644 --- a/kernel/sched/cpudeadline.h +++ b/kernel/sched/cpudeadline.h @@ -5,17 +5,17 @@ #define IDX_INVALID -1 -struct array_item { +struct cpudl_item { u64 dl; int cpu; + int idx; }; struct cpudl { raw_spinlock_t lock; int size; - int cpu_to_idx[NR_CPUS]; - struct array_item elements[NR_CPUS]; cpumask_var_t free_cpus; + struct cpudl_item *elements; }; [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [REGRESSION] funny sched_domain build failure during resume 2014-05-14 14:00 ` Peter Zijlstra @ 2014-05-14 14:05 ` Peter Zijlstra 2014-05-14 17:02 ` Tejun Heo 2014-05-15 8:40 ` Juri Lelli 2 siblings, 0 replies; 17+ messages in thread From: Peter Zijlstra @ 2014-05-14 14:05 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, linux-kernel, Johannes Weiner, Rafael J. Wysocki, Juri Lelli [-- Attachment #1: Type: text/plain, Size: 1276 bytes --] On Wed, May 14, 2014 at 04:00:34PM +0200, Peter Zijlstra wrote: > The below is a mostly no thought involved conversion of cpudl which ^^^^^^^^^^^^^^^^^^^ > boots, I'll also do cpupri and then actually stare at the algorithms to > see if I didn't make any obvious fails. > > Juri? > > --- > @@ -195,10 +193,21 @@ int cpudl_init(struct cpudl *cp) > memset(cp, 0, sizeof(*cp)); > raw_spin_lock_init(&cp->lock); > cp->size = 0; > - for (i = 0; i < NR_CPUS; i++) > - cp->cpu_to_idx[i] = IDX_INVALID; > - if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL)) > + > + cp->elements = kcalloc(num_possible_cpus(), > + sizeof(struct cpudl_item), > + GFP_KERNEL); > + if (!cp->elements) > + return -ENOMEM; > + > + if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL)) { > + kfree(cp->elements); > return -ENOMEM; > + } > + > + for_each_possible_cpu(i) > + cp->elements[i].idx = IDX_INVALID; > + > cpumask_setall(cp->free_cpus); > > return 0; Also add: --- --- a/kernel/sched/cpudeadline.c +++ b/kernel/sched/cpudeadline.c @@ -220,4 +220,5 @@ int cpudl_init(struct cpudl *cp) void cpudl_cleanup(struct cpudl *cp) { free_cpumask_var(cp->free_cpus); + kfree(cp->elements); } [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [REGRESSION] funny sched_domain build failure during resume 2014-05-14 14:00 ` Peter Zijlstra 2014-05-14 14:05 ` Peter Zijlstra @ 2014-05-14 17:02 ` Tejun Heo 2014-05-14 17:10 ` Peter Zijlstra 2014-05-15 8:40 ` Juri Lelli 2 siblings, 1 reply; 17+ messages in thread From: Tejun Heo @ 2014-05-14 17:02 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, linux-kernel, Johannes Weiner, Rafael J. Wysocki, Juri Lelli On Wed, May 14, 2014 at 04:00:34PM +0200, Peter Zijlstra wrote: > Does something like the below help any? I noticed those things (cpudl > and cpupri) had [NR_CPUS] arrays, which is always 'fun'. > > The below is a mostly no thought involved conversion of cpudl which > boots, I'll also do cpupri and then actually stare at the algorithms to > see if I didn't make any obvious fails. Yeah, should avoid large allocation on reasonably sized machines and I don't think 2k CPU machines suspend regularly. Prolly good / safe enough for -stable port? It'd be still nice to avoid allocations if possible during online tho given that the operation happens while mm is mostly crippled. Thanks. -- tejun ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [REGRESSION] funny sched_domain build failure during resume 2014-05-14 17:02 ` Tejun Heo @ 2014-05-14 17:10 ` Peter Zijlstra 2014-05-14 22:36 ` Tejun Heo 0 siblings, 1 reply; 17+ messages in thread From: Peter Zijlstra @ 2014-05-14 17:10 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, linux-kernel, Johannes Weiner, Rafael J. Wysocki, Juri Lelli [-- Attachment #1: Type: text/plain, Size: 1406 bytes --] On Wed, May 14, 2014 at 01:02:38PM -0400, Tejun Heo wrote: > On Wed, May 14, 2014 at 04:00:34PM +0200, Peter Zijlstra wrote: > > Does something like the below help any? I noticed those things (cpudl > > and cpupri) had [NR_CPUS] arrays, which is always 'fun'. > > > > The below is a mostly no thought involved conversion of cpudl which > > boots, I'll also do cpupri and then actually stare at the algorithms to > > see if I didn't make any obvious fails. > > Yeah, should avoid large allocation on reasonably sized machines and I > don't think 2k CPU machines suspend regularly. Prolly good / safe > enough for -stable port? Yeah, its certainly -stable material. Esp. if this cures the immediate problem. > It'd be still nice to avoid allocations if > possible during online tho given that the operation happens while mm > is mostly crippled. Yeah, I started looking at that but that turned out to be slightly more difficult than I had hoped (got lost in the suspend code). Also avoiding large order allocs is good practise regardless. So probably the easiest way to not free/alloc the entire sched_domain thing is just keeping it around in its entirety over suspend/resume, as I think the promise of suspend/resume is that you return to the status-quo. But I'll stick it on the todo list after fixing this use-after-free thing I've been trying to chase down. [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [REGRESSION] funny sched_domain build failure during resume 2014-05-14 17:10 ` Peter Zijlstra @ 2014-05-14 22:36 ` Tejun Heo 2014-05-15 8:57 ` Peter Zijlstra 2014-05-15 14:41 ` Johannes Weiner 0 siblings, 2 replies; 17+ messages in thread From: Tejun Heo @ 2014-05-14 22:36 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, linux-kernel, Johannes Weiner, Rafael J. Wysocki, Juri Lelli, Bruce Allan On Wed, May 14, 2014 at 07:10:51PM +0200, Peter Zijlstra wrote: > On Wed, May 14, 2014 at 01:02:38PM -0400, Tejun Heo wrote: > > On Wed, May 14, 2014 at 04:00:34PM +0200, Peter Zijlstra wrote: > > > Does something like the below help any? I noticed those things (cpudl > > > and cpupri) had [NR_CPUS] arrays, which is always 'fun'. > > > > > > The below is a mostly no thought involved conversion of cpudl which > > > boots, I'll also do cpupri and then actually stare at the algorithms to > > > see if I didn't make any obvious fails. > > > > Yeah, should avoid large allocation on reasonably sized machines and I > > don't think 2k CPU machines suspend regularly. Prolly good / safe > > enough for -stable port? > > Yeah, its certainly -stable material. Esp. if this cures the immediate > problem. The patches are URL encoded. Tried to reproduce the problem but haven't succeeded but I'm quite confident about the analysis, so verifying that the high order allocation goes away should be enough. I instrumented the allocator and it looks like we also have other sources of high order allocation during resume before GFP_IOFS is cleared. Some kthread creations (order 2, probably okay) and on my test setup a series of order 3 allocations from e1000. Cc'ing Bruce for e1000. Is it necessary to free and re-allocate buffers across suspend/resume? The driver ends up allocating multiple order-3 regions during resume where mm doesn't have access to backing devices and thus can't compact or reclaim and it's not too unlikely for those allocations to fail. I wonder whether we need some generic solution to address the issue. Unfortunately, I don't think it'd be possible to order device initialization to bring up backing devices earlier. We don't really have that kind of knowledge easily accessible. Also, I don't think it's realistic to require drivers to avoid high order allocations during resume. Maybe we should pre-reclaim and set aside some amount of memory to be used during resume? It's a mushy solution but could be good enough. Rafael, Johannes, what do you guys think? Thanks. -- tejun ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [REGRESSION] funny sched_domain build failure during resume 2014-05-14 22:36 ` Tejun Heo @ 2014-05-15 8:57 ` Peter Zijlstra 2014-05-15 14:41 ` Johannes Weiner 1 sibling, 0 replies; 17+ messages in thread From: Peter Zijlstra @ 2014-05-15 8:57 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, linux-kernel, Johannes Weiner, Rafael J. Wysocki, Juri Lelli, Bruce Allan [-- Attachment #1: Type: text/plain, Size: 323 bytes --] On Wed, May 14, 2014 at 06:36:48PM -0400, Tejun Heo wrote: > The patches are URL encoded. *sigh*.. evo used to do that, and now I think mutt does it because I've set up the whole GPG thing. All my scripts already automagically decode it so I hadn't noticed. I can give you some awk that undoes it if you want? [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [REGRESSION] funny sched_domain build failure during resume 2014-05-14 22:36 ` Tejun Heo 2014-05-15 8:57 ` Peter Zijlstra @ 2014-05-15 14:41 ` Johannes Weiner 2014-05-15 15:12 ` Peter Zijlstra 2014-05-16 11:47 ` Srivatsa S. Bhat 1 sibling, 2 replies; 17+ messages in thread From: Johannes Weiner @ 2014-05-15 14:41 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Rafael J. Wysocki, Juri Lelli, Bruce Allan On Wed, May 14, 2014 at 06:36:48PM -0400, Tejun Heo wrote: > On Wed, May 14, 2014 at 07:10:51PM +0200, Peter Zijlstra wrote: > > On Wed, May 14, 2014 at 01:02:38PM -0400, Tejun Heo wrote: > > > On Wed, May 14, 2014 at 04:00:34PM +0200, Peter Zijlstra wrote: > > > > Does something like the below help any? I noticed those things (cpudl > > > > and cpupri) had [NR_CPUS] arrays, which is always 'fun'. > > > > > > > > The below is a mostly no thought involved conversion of cpudl which > > > > boots, I'll also do cpupri and then actually stare at the algorithms to > > > > see if I didn't make any obvious fails. > > > > > > Yeah, should avoid large allocation on reasonably sized machines and I > > > don't think 2k CPU machines suspend regularly. Prolly good / safe > > > enough for -stable port? > > > > Yeah, its certainly -stable material. Esp. if this cures the immediate > > problem. > > The patches are URL encoded. Tried to reproduce the problem but > haven't succeeded but I'm quite confident about the analysis, so > verifying that the high order allocation goes away should be enough. > > I instrumented the allocator and it looks like we also have other > sources of high order allocation during resume before GFP_IOFS is > cleared. Some kthread creations (order 2, probably okay) and on my > test setup a series of order 3 allocations from e1000. > > Cc'ing Bruce for e1000. Is it necessary to free and re-allocate > buffers across suspend/resume? The driver ends up allocating multiple > order-3 regions during resume where mm doesn't have access to backing > devices and thus can't compact or reclaim and it's not too unlikely > for those allocations to fail. > > I wonder whether we need some generic solution to address the issue. > Unfortunately, I don't think it'd be possible to order device > initialization to bring up backing devices earlier. We don't really > have that kind of knowledge easily accessible. Also, I don't think > it's realistic to require drivers to avoid high order allocations > during resume. > > Maybe we should pre-reclaim and set aside some amount of memory to be > used during resume? It's a mushy solution but could be good enough. > Rafael, Johannes, what do you guys think? Is it necessary that resume paths allocate at all? Freeing at suspend what you have to reallocate at resume is asking for trouble. It's not just higher order allocations, either, even order-0 allocations are less reliable without GFP_IOFS. So I think this should be avoided as much as possible. --- From: Johannes Weiner <hannes@cmpxchg.org> Subject: [patch] mm: page_alloc: warn about higher-order allocations during suspend Higher-order allocations require compaction to work reliably, and compaction requires the ability to do IO. Unfortunately, backing storage infrastructure is disabled during suspend & resume, and so higher-order allocations can not be supported during that time. Drivers should refrain from freeing and allocating data structures during suspend and resume entirely, and fall back to order-0 pages if strictly necessary. Add an extra line of warning to the allocation failure dump when a higher-order allocation fails while backing storage is suspended. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- mm/page_alloc.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5dba2933c9c0..3acc12c0e093 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2122,6 +2122,16 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...) pr_warn("%s: page allocation failure: order:%d, mode:0x%x\n", current->comm, order, gfp_mask); + /* + * Compaction doesn't work while backing storage is suspended + * in the resume path. Drivers should refrain from managing + * kernel objects during suspend/resume, and leave this task + * to init/exit as much as possible. + */ + if (order && pm_suspended_storage()) + pr_warn("Higher-order allocations during resume " + "are unsupported\n"); + dump_stack(); if (!should_suppress_show_mem()) show_mem(filter); -- 1.9.2 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [REGRESSION] funny sched_domain build failure during resume 2014-05-15 14:41 ` Johannes Weiner @ 2014-05-15 15:12 ` Peter Zijlstra 2014-05-16 11:47 ` Srivatsa S. Bhat 1 sibling, 0 replies; 17+ messages in thread From: Peter Zijlstra @ 2014-05-15 15:12 UTC (permalink / raw) To: Johannes Weiner Cc: Tejun Heo, Ingo Molnar, linux-kernel, Rafael J. Wysocki, Juri Lelli, Bruce Allan [-- Attachment #1: Type: text/plain, Size: 647 bytes --] On Thu, May 15, 2014 at 10:41:14AM -0400, Johannes Weiner wrote: > Is it necessary that resume paths allocate at all? Freeing at suspend > what you have to reallocate at resume is asking for trouble. It's not > just higher order allocations, either, even order-0 allocations are > less reliable without GFP_IOFS. So I think this should be avoided as > much as possible. Well, in my case its because suspend does an effective hot-unplug of all CPUs in the system except CPU0. And if the cpu topology changes I need to allocate new data structures and free the old ones. The reverse is true on resume, it hot-plugs all CPUs again, same story. [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [REGRESSION] funny sched_domain build failure during resume 2014-05-15 14:41 ` Johannes Weiner 2014-05-15 15:12 ` Peter Zijlstra @ 2014-05-16 11:47 ` Srivatsa S. Bhat 1 sibling, 0 replies; 17+ messages in thread From: Srivatsa S. Bhat @ 2014-05-16 11:47 UTC (permalink / raw) To: Johannes Weiner Cc: Tejun Heo, Peter Zijlstra, Ingo Molnar, linux-kernel, Rafael J. Wysocki, Juri Lelli, Bruce Allan, Linux PM mailing list On 05/15/2014 08:11 PM, Johannes Weiner wrote: > On Wed, May 14, 2014 at 06:36:48PM -0400, Tejun Heo wrote: >> On Wed, May 14, 2014 at 07:10:51PM +0200, Peter Zijlstra wrote: >>> On Wed, May 14, 2014 at 01:02:38PM -0400, Tejun Heo wrote: >>>> On Wed, May 14, 2014 at 04:00:34PM +0200, Peter Zijlstra wrote: >>>>> Does something like the below help any? I noticed those things (cpudl >>>>> and cpupri) had [NR_CPUS] arrays, which is always 'fun'. >>>>> >>>>> The below is a mostly no thought involved conversion of cpudl which >>>>> boots, I'll also do cpupri and then actually stare at the algorithms to >>>>> see if I didn't make any obvious fails. >>>> >>>> Yeah, should avoid large allocation on reasonably sized machines and I >>>> don't think 2k CPU machines suspend regularly. Prolly good / safe >>>> enough for -stable port? >>> >>> Yeah, its certainly -stable material. Esp. if this cures the immediate >>> problem. >> >> The patches are URL encoded. Tried to reproduce the problem but >> haven't succeeded but I'm quite confident about the analysis, so >> verifying that the high order allocation goes away should be enough. >> >> I instrumented the allocator and it looks like we also have other >> sources of high order allocation during resume before GFP_IOFS is >> cleared. Some kthread creations (order 2, probably okay) and on my >> test setup a series of order 3 allocations from e1000. >> >> Cc'ing Bruce for e1000. Is it necessary to free and re-allocate >> buffers across suspend/resume? The driver ends up allocating multiple >> order-3 regions during resume where mm doesn't have access to backing >> devices and thus can't compact or reclaim and it's not too unlikely >> for those allocations to fail. >> >> I wonder whether we need some generic solution to address the issue. >> Unfortunately, I don't think it'd be possible to order device >> initialization to bring up backing devices earlier. We don't really >> have that kind of knowledge easily accessible. Also, I don't think >> it's realistic to require drivers to avoid high order allocations >> during resume. >> >> Maybe we should pre-reclaim and set aside some amount of memory to be >> used during resume? It's a mushy solution but could be good enough. >> Rafael, Johannes, what do you guys think? > > Is it necessary that resume paths allocate at all? Freeing at suspend > what you have to reallocate at resume is asking for trouble. It's not > just higher order allocations, either, even order-0 allocations are > less reliable without GFP_IOFS. So I think this should be avoided as > much as possible. > >From the discussion on this thread, what I understand is that certain drivers and some subsystems like the scheduler free memory during suspend and allocate them back during resume. But why does that pose a problem to the MM subsystem? I mean, the memory freed and the memory requested later is not substantially different either in terms of quantity or the order of the allocation, right? If that's the case, then what happened to the freed memory? Did the page-cache or other caching mechanism launder most of that so soon, that we are forced to rely on reclaim to allocate memory during resume? Isn't that somewhat suspicious? I might be missing something obvious here, but given the fact that tasks are frozen during suspend and not a whole lot of things (allocations etc) happen in the suspend path, I would expect that whatever memory was freed during suspend would naturally remain available during resume as well. Thoughts? Regards, Srivatsa S. Bhat > --- > From: Johannes Weiner <hannes@cmpxchg.org> > Subject: [patch] mm: page_alloc: warn about higher-order allocations during > suspend > > Higher-order allocations require compaction to work reliably, and > compaction requires the ability to do IO. Unfortunately, backing > storage infrastructure is disabled during suspend & resume, and so > higher-order allocations can not be supported during that time. > > Drivers should refrain from freeing and allocating data structures > during suspend and resume entirely, and fall back to order-0 pages if > strictly necessary. > > Add an extra line of warning to the allocation failure dump when a > higher-order allocation fails while backing storage is suspended. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > mm/page_alloc.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 5dba2933c9c0..3acc12c0e093 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2122,6 +2122,16 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...) > pr_warn("%s: page allocation failure: order:%d, mode:0x%x\n", > current->comm, order, gfp_mask); > > + /* > + * Compaction doesn't work while backing storage is suspended > + * in the resume path. Drivers should refrain from managing > + * kernel objects during suspend/resume, and leave this task > + * to init/exit as much as possible. > + */ > + if (order && pm_suspended_storage()) > + pr_warn("Higher-order allocations during resume " > + "are unsupported\n"); > + > dump_stack(); > if (!should_suppress_show_mem()) > show_mem(filter); > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [REGRESSION] funny sched_domain build failure during resume 2014-05-14 14:00 ` Peter Zijlstra 2014-05-14 14:05 ` Peter Zijlstra 2014-05-14 17:02 ` Tejun Heo @ 2014-05-15 8:40 ` Juri Lelli 2014-05-15 8:51 ` Peter Zijlstra 2 siblings, 1 reply; 17+ messages in thread From: Juri Lelli @ 2014-05-15 8:40 UTC (permalink / raw) To: Peter Zijlstra Cc: Tejun Heo, Ingo Molnar, linux-kernel, Johannes Weiner, Rafael J. Wysocki Hi, On Wed, 14 May 2014 16:00:34 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, May 09, 2014 at 12:04:55PM -0400, Tejun Heo wrote: > > Hello, guys. > > > > So, after resuming from suspend, I found my build jobs can not migrate > > away from the CPU it started on and thus just making use of single > > core. It turns out the scheduler failed to build sched domains due to > > order-3 allocation failure. > > [snip] > > Does something like the below help any? I noticed those things (cpudl > and cpupri) had [NR_CPUS] arrays, which is always 'fun'. > Yeah, not nice :/. > The below is a mostly no thought involved conversion of cpudl which > boots, I'll also do cpupri and then actually stare at the algorithms to > see if I didn't make any obvious fails. > > Juri? > > --- > kernel/sched/cpudeadline.c | 29 +++++++++++++++++++---------- > kernel/sched/cpudeadline.h | 6 +++--- > 2 files changed, 22 insertions(+), 13 deletions(-) > > diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c > index ab001b5d5048..c34ab09a790b 100644 > --- a/kernel/sched/cpudeadline.c > +++ b/kernel/sched/cpudeadline.c > @@ -13,6 +13,7 @@ > > #include <linux/gfp.h> > #include <linux/kernel.h> > +#include <linux/slab.h> > #include "cpudeadline.h" > > static inline int parent(int i) > @@ -37,10 +38,7 @@ static inline int dl_time_before(u64 a, u64 b) > > static void cpudl_exchange(struct cpudl *cp, int a, int b) > { > - int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu; > - > swap(cp->elements[a], cp->elements[b]); > - swap(cp->cpu_to_idx[cpu_a], cp->cpu_to_idx[cpu_b]); > } > I think there is a problem here. Your patch "embeds" cpu_to_idx array in elements array, but here the swap operates differently on the two. Let me try to clarify with a simple example. On a 4CPU system suppose we have this situation: CPU 1 DL 50 / \ / \ X X / / X -- orig elements[dl/cpu] | 50/1 | 0/0 | 0/0 | 0/0 | ^ \ \ \ cpu_to_idx[idx] | -1 | 0 | -1 | -1 | -- yours elements[dl/cpu] | 50/1 | 0/0 | 0/0 | 0/0 | ^ \ \ \ elements[idx] | -1 | 0 | -1 | -1 | So since here all is fine. But, what happens if I call cpudl_set(&cp, 2, 55, 1) ? No surprises we insert the new element and then we try to bring it to root (as it has max-dline). New situation is: CPU 1 DL 50 / \ / \ ^ CPU2 X | DL 55 / / X -- orig elements[dl/cpu] | 50/1 | 55/2 | 0/0 | 0/0 | ^ ^ \ \ \ \ \ \ cpu_to_idx[idx] | -1 | 0 | 1 | -1 | -- yours elements[dl/cpu] | 50/1 | 55/2 | 0/0 | 0/0 | ^ ^ \ \ \ \ \ \ elements[idx] | -1 | 0 | 1 | -1 | Now, we do cpudl_exchange(&cp, 1, 0). In orig we have static void cpudl_exchange(struct cpudl *cp, int a, int b) { int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu; swap(cp->elements[a], cp->elements[b]); swap(cp->cpu_to_idx[cpu_a], cp->cpu_to_idx[cpu_b]); } note that here a = 1, b = 0, cpu_a = 2 and cpu_b = 1. While in yours static void cpudl_exchange(struct cpudl *cp, int a, int b) { swap(cp->elements[a], cp->elements[b]); } so we end up with -- orig elements[dl/cpu] | 55/2 | 50/1 | 0/0 | 0/0 | ^ ^ | | +-------|------+ | | cpu_to_idx[idx] | -1 | 1 | 0 | -1 | -- yours elements[dl/cpu] | 55/2 | 50/1 | 0/0 | 0/0 | ^ ^ | | | +------+ | | elements[idx] | 0 | -1 | 1 | -1 | and this breaks how the heap works. For example, what if I want to update CPU1 deadline? In orig I correctly get it at position 1 of elements array. But with the patch CPU1's idx is IDX_INVALID. Sorry for this long reply, but I had to convince also myself... So, I think that having just one dynamic array is nicer and better, but we still need to swap things separately. Something like (on top of yours): diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c index 60ad0af..10dc7ab 100644 --- a/kernel/sched/cpudeadline.c +++ b/kernel/sched/cpudeadline.c @@ -36,9 +36,31 @@ static inline int dl_time_before(u64 a, u64 b) return (s64)(a - b) < 0; } +static inline void swap_element(struct cpudl *cp, int a, int b) +{ + int cpu_tmp = cp->elements[a].cpu; + u64 dl_tmp = cp->elements[a].dl; + + cp->elements[a].cpu = cp->elements[b].cpu; + cp->elements[a].dl = cp->elements[b].dl; + cp->elements[b].cpu = cpu_tmp; + cp->elements[b].dl = dl_tmp; +} + +static inline void swap_idx(struct cpudl *cp, int a, int b) +{ + int idx_tmp = cp->elements[a].idx; + + cp->elements[a].idx = cp->elements[b].idx; + cp->elements[b].idx = idx_tmp; +} + static void cpudl_exchange(struct cpudl *cp, int a, int b) { - swap(cp->elements[a], cp->elements[b]); + int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu; + + swap_element(cp, a, b); + swap_idx(cp, cpu_a, cpu_b); } static void cpudl_heapify(struct cpudl *cp, int idx) --- But, I don't know if this is too ugly and we better go with two arrays (or a better solution, as this is the fastest thing I could come up with :)). In the meanwhile I'll test this more... Thanks a lot, - Juri > static void cpudl_heapify(struct cpudl *cp, int idx) > @@ -140,7 +138,7 @@ void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid) > WARN_ON(!cpu_present(cpu)); > > raw_spin_lock_irqsave(&cp->lock, flags); > - old_idx = cp->cpu_to_idx[cpu]; > + old_idx = cp->elements[cpu].idx; > if (!is_valid) { > /* remove item */ > if (old_idx == IDX_INVALID) { > @@ -155,8 +153,8 @@ void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid) > cp->elements[old_idx].dl = cp->elements[cp->size - 1].dl; > cp->elements[old_idx].cpu = new_cpu; > cp->size--; > - cp->cpu_to_idx[new_cpu] = old_idx; > - cp->cpu_to_idx[cpu] = IDX_INVALID; > + cp->elements[new_cpu].idx = old_idx; > + cp->elements[cpu].idx = IDX_INVALID; > while (old_idx > 0 && dl_time_before( > cp->elements[parent(old_idx)].dl, > cp->elements[old_idx].dl)) { > @@ -173,7 +171,7 @@ void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid) > cp->size++; > cp->elements[cp->size - 1].dl = 0; > cp->elements[cp->size - 1].cpu = cpu; > - cp->cpu_to_idx[cpu] = cp->size - 1; > + cp->elements[cpu].idx = cp->size - 1; > cpudl_change_key(cp, cp->size - 1, dl); > cpumask_clear_cpu(cpu, cp->free_cpus); > } else { > @@ -195,10 +193,21 @@ int cpudl_init(struct cpudl *cp) > memset(cp, 0, sizeof(*cp)); > raw_spin_lock_init(&cp->lock); > cp->size = 0; > - for (i = 0; i < NR_CPUS; i++) > - cp->cpu_to_idx[i] = IDX_INVALID; > - if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL)) > + > + cp->elements = kcalloc(num_possible_cpus(), > + sizeof(struct cpudl_item), > + GFP_KERNEL); > + if (!cp->elements) > + return -ENOMEM; > + > + if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL)) { > + kfree(cp->elements); > return -ENOMEM; > + } > + > + for_each_possible_cpu(i) > + cp->elements[i].idx = IDX_INVALID; > + > cpumask_setall(cp->free_cpus); > > return 0; > diff --git a/kernel/sched/cpudeadline.h b/kernel/sched/cpudeadline.h > index a202789a412c..538c9796ad4a 100644 > --- a/kernel/sched/cpudeadline.h > +++ b/kernel/sched/cpudeadline.h > @@ -5,17 +5,17 @@ > > #define IDX_INVALID -1 > > -struct array_item { > +struct cpudl_item { > u64 dl; > int cpu; > + int idx; > }; > > struct cpudl { > raw_spinlock_t lock; > int size; > - int cpu_to_idx[NR_CPUS]; > - struct array_item elements[NR_CPUS]; > cpumask_var_t free_cpus; > + struct cpudl_item *elements; > }; > > ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [REGRESSION] funny sched_domain build failure during resume 2014-05-15 8:40 ` Juri Lelli @ 2014-05-15 8:51 ` Peter Zijlstra 2014-05-15 11:52 ` Juri Lelli 2014-05-16 10:43 ` Peter Zijlstra 0 siblings, 2 replies; 17+ messages in thread From: Peter Zijlstra @ 2014-05-15 8:51 UTC (permalink / raw) To: Juri Lelli Cc: Tejun Heo, Ingo Molnar, linux-kernel, Johannes Weiner, Rafael J. Wysocki [-- Attachment #1: Type: text/plain, Size: 2626 bytes --] On Thu, May 15, 2014 at 10:40:55AM +0200, Juri Lelli wrote: > Hi, > > > @@ -37,10 +38,7 @@ static inline int dl_time_before(u64 a, u64 b) > > > > static void cpudl_exchange(struct cpudl *cp, int a, int b) > > { > > - int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu; > > - > > swap(cp->elements[a], cp->elements[b]); > > - swap(cp->cpu_to_idx[cpu_a], cp->cpu_to_idx[cpu_b]); > > } > > > > I think there is a problem here. Your patch "embeds" cpu_to_idx array > in elements array, but here the swap operates differently on the two. <snip> > Sorry for this long reply, but I had to convince also myself... Glad you explained it before I tried to untangle that code myself. > So, I think that having just one dynamic array is nicer and better, but > we still need to swap things separately. Something like (on top of > yours): > > diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c > index 60ad0af..10dc7ab 100644 > --- a/kernel/sched/cpudeadline.c > +++ b/kernel/sched/cpudeadline.c > @@ -36,9 +36,31 @@ static inline int dl_time_before(u64 a, u64 b) > return (s64)(a - b) < 0; > } > > +static inline void swap_element(struct cpudl *cp, int a, int b) > +{ > + int cpu_tmp = cp->elements[a].cpu; > + u64 dl_tmp = cp->elements[a].dl; > + > + cp->elements[a].cpu = cp->elements[b].cpu; > + cp->elements[a].dl = cp->elements[b].dl; > + cp->elements[b].cpu = cpu_tmp; > + cp->elements[b].dl = dl_tmp; You could've just written: swap(cp->elements[a].cpu, cp->elements[b].cpu); swap(cp->elements[a].dl , cp->elements[b].dl ); The swap macro does the tmp var for you. > +} > + > +static inline void swap_idx(struct cpudl *cp, int a, int b) > +{ > + int idx_tmp = cp->elements[a].idx; > + > + cp->elements[a].idx = cp->elements[b].idx; > + cp->elements[b].idx = idx_tmp; swap(cp->elements[a].idx, cp->elements[b].idx); > +} > + > static void cpudl_exchange(struct cpudl *cp, int a, int b) > { > - swap(cp->elements[a], cp->elements[b]); > + int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu; > + > + swap_element(cp, a, b); > + swap_idx(cp, cpu_a, cpu_b); Or just skip the lot and put the 3 swap() stmts here. > } > > static void cpudl_heapify(struct cpudl *cp, int idx) > --- > > But, I don't know if this is too ugly and we better go with two arrays > (or a better solution, as this is the fastest thing I could come up > with :)). Thanks for looking at it, and sorry for breaking it. [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [REGRESSION] funny sched_domain build failure during resume 2014-05-15 8:51 ` Peter Zijlstra @ 2014-05-15 11:52 ` Juri Lelli 2014-05-16 10:43 ` Peter Zijlstra 1 sibling, 0 replies; 17+ messages in thread From: Juri Lelli @ 2014-05-15 11:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Tejun Heo, Ingo Molnar, linux-kernel, Johannes Weiner, Rafael J. Wysocki On Thu, 15 May 2014 10:51:56 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, May 15, 2014 at 10:40:55AM +0200, Juri Lelli wrote: > > Hi, > > > > > @@ -37,10 +38,7 @@ static inline int dl_time_before(u64 a, u64 b) > > > > > > static void cpudl_exchange(struct cpudl *cp, int a, int b) > > > { > > > - int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu; > > > - > > > swap(cp->elements[a], cp->elements[b]); > > > - swap(cp->cpu_to_idx[cpu_a], cp->cpu_to_idx[cpu_b]); > > > } > > > > > > > I think there is a problem here. Your patch "embeds" cpu_to_idx array > > in elements array, but here the swap operates differently on the two. > > <snip> > > > Sorry for this long reply, but I had to convince also myself... > > Glad you explained it before I tried to untangle that code myself. > > > So, I think that having just one dynamic array is nicer and better, but > > we still need to swap things separately. Something like (on top of > > yours): > > > > diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c > > index 60ad0af..10dc7ab 100644 > > --- a/kernel/sched/cpudeadline.c > > +++ b/kernel/sched/cpudeadline.c > > @@ -36,9 +36,31 @@ static inline int dl_time_before(u64 a, u64 b) > > return (s64)(a - b) < 0; > > } > > > > +static inline void swap_element(struct cpudl *cp, int a, int b) > > +{ > > + int cpu_tmp = cp->elements[a].cpu; > > + u64 dl_tmp = cp->elements[a].dl; > > + > > + cp->elements[a].cpu = cp->elements[b].cpu; > > + cp->elements[a].dl = cp->elements[b].dl; > > + cp->elements[b].cpu = cpu_tmp; > > + cp->elements[b].dl = dl_tmp; > > You could've just written: > > swap(cp->elements[a].cpu, cp->elements[b].cpu); > swap(cp->elements[a].dl , cp->elements[b].dl ); > > The swap macro does the tmp var for you. > > > +} > > + > > +static inline void swap_idx(struct cpudl *cp, int a, int b) > > +{ > > + int idx_tmp = cp->elements[a].idx; > > + > > + cp->elements[a].idx = cp->elements[b].idx; > > + cp->elements[b].idx = idx_tmp; > > swap(cp->elements[a].idx, cp->elements[b].idx); > > > +} > > + > > static void cpudl_exchange(struct cpudl *cp, int a, int b) > > { > > - swap(cp->elements[a], cp->elements[b]); > > + int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu; > > + > > + swap_element(cp, a, b); > > + swap_idx(cp, cpu_a, cpu_b); > > Or just skip the lot and put the 3 swap() stmts here. > Ah, yes, sure! Maybe we could add in cpudeadline.c also a comment explaining a bit how the thing works. Something along the line of cpupri.c: /* * kernel/sched/cpudeadline.c * * Global CPU deadline management * * Author: Juri Lelli <juri.lelli@gmail.com> * * This code tracks the deadline of each CPU (i.e., the deadline of * current on the CPU) so that global migration decisions are easy * to calculate. Each CPU can be in a state as follows: * * (INVALID), WITH_DL_TASK(S) * * The system maintains this state with a max-heap implemented as * a simple array. To efficiently handle updates on intermediate nodes * the array can be thought as splitted in two parts, one that contains * heap nodes, and the other that keeps track of where nodes reside in * the first part. From kernel/sched/cpudeadline.h we conceptually * have: * * struct cpudl_item { * u64 dl; * int cpu; * ------------------- * int idx; * } * * Moreover, we keep track of CPUs in the INVALID state with a cpumask * (no need to use the array if some CPU is free). * * Let's clarify this with a simple example. Suppose at a certain * instant of time we have this situation (4CPUs box): * * CPU 1 * DL 50 * / \ * / \ * CPU 2 CPU 0 * DL 30 DL 40 * / * / * CPU 3 * DL 25 * * In this case the state is mantained as: * * elements[dl/cpu] | 50/1 | 30/2 | 40/0 | 25/3 | * ^ ^ ^ ^ * | +-----|-+ | * +---------+ | | | * +--------|------+ | | * elements[idx] | 2 | 0 | 1 | 3 | * * Operations on the heap (e.g., node updates) must thus handle * the two parts of elements array separately, see cpudl_set() * and cpudl_exchange() for details. * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; version 2 * of the License. */ Thanks, - Juri > > } > > > > static void cpudl_heapify(struct cpudl *cp, int idx) > > --- > > > > But, I don't know if this is too ugly and we better go with two arrays > > (or a better solution, as this is the fastest thing I could come up > > with :)). > > Thanks for looking at it, and sorry for breaking it. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [REGRESSION] funny sched_domain build failure during resume 2014-05-15 8:51 ` Peter Zijlstra 2014-05-15 11:52 ` Juri Lelli @ 2014-05-16 10:43 ` Peter Zijlstra 2014-05-16 11:01 ` Juri Lelli 1 sibling, 1 reply; 17+ messages in thread From: Peter Zijlstra @ 2014-05-16 10:43 UTC (permalink / raw) To: Juri Lelli Cc: Tejun Heo, Ingo Molnar, linux-kernel, Johannes Weiner, Rafael J. Wysocki [-- Attachment #1: Type: text/plain, Size: 3666 bytes --] OK I made that.. --- Subject: sched/cpudl: Replace NR_CPUS arrays From: Peter Zijlstra <peterz@infradead.org> Date: Wed May 14 16:13:56 CEST 2014 Tejun reported that his resume was failing due to order-3 allocations from sched_domain building. Replace the NR_CPUS arrays in there with a dynamically allocated array. Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Juri Lelli <juri.lelli@gmail.com> Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> --- kernel/sched/cpudeadline.c | 33 ++++++++++++++++++++++++--------- kernel/sched/cpudeadline.h | 6 +++--- 2 files changed, 27 insertions(+), 12 deletions(-) --- a/kernel/sched/cpudeadline.c +++ b/kernel/sched/cpudeadline.c @@ -13,6 +13,7 @@ #include <linux/gfp.h> #include <linux/kernel.h> +#include <linux/slab.h> #include "cpudeadline.h" static inline int parent(int i) @@ -39,8 +40,10 @@ static void cpudl_exchange(struct cpudl { int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu; - swap(cp->elements[a], cp->elements[b]); - swap(cp->cpu_to_idx[cpu_a], cp->cpu_to_idx[cpu_b]); + swap(cp->elements[a].cpu, cp->elements[b].cpu); + swap(cp->elements[a].dl , cp->elements[b].dl ); + + swap(cp->elements[cpu_a].idx, cp->elements[cpu_b].idx); } static void cpudl_heapify(struct cpudl *cp, int idx) @@ -140,7 +143,7 @@ void cpudl_set(struct cpudl *cp, int cpu WARN_ON(!cpu_present(cpu)); raw_spin_lock_irqsave(&cp->lock, flags); - old_idx = cp->cpu_to_idx[cpu]; + old_idx = cp->elements[cpu].idx; if (!is_valid) { /* remove item */ if (old_idx == IDX_INVALID) { @@ -155,8 +158,8 @@ void cpudl_set(struct cpudl *cp, int cpu cp->elements[old_idx].dl = cp->elements[cp->size - 1].dl; cp->elements[old_idx].cpu = new_cpu; cp->size--; - cp->cpu_to_idx[new_cpu] = old_idx; - cp->cpu_to_idx[cpu] = IDX_INVALID; + cp->elements[new_cpu].idx = old_idx; + cp->elements[cpu].idx = IDX_INVALID; while (old_idx > 0 && dl_time_before( cp->elements[parent(old_idx)].dl, cp->elements[old_idx].dl)) { @@ -173,7 +176,7 @@ void cpudl_set(struct cpudl *cp, int cpu cp->size++; cp->elements[cp->size - 1].dl = 0; cp->elements[cp->size - 1].cpu = cpu; - cp->cpu_to_idx[cpu] = cp->size - 1; + cp->elements[cpu].idx = cp->size - 1; cpudl_change_key(cp, cp->size - 1, dl); cpumask_clear_cpu(cpu, cp->free_cpus); } else { @@ -195,10 +198,21 @@ int cpudl_init(struct cpudl *cp) memset(cp, 0, sizeof(*cp)); raw_spin_lock_init(&cp->lock); cp->size = 0; - for (i = 0; i < NR_CPUS; i++) - cp->cpu_to_idx[i] = IDX_INVALID; - if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL)) + + cp->elements = kcalloc(nr_cpu_ids, + sizeof(struct cpudl_item), + GFP_KERNEL); + if (!cp->elements) + return -ENOMEM; + + if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL)) { + kfree(cp->elements); return -ENOMEM; + } + + for_each_possible_cpu(i) + cp->elements[i].idx = IDX_INVALID; + cpumask_setall(cp->free_cpus); return 0; @@ -211,4 +225,5 @@ int cpudl_init(struct cpudl *cp) void cpudl_cleanup(struct cpudl *cp) { free_cpumask_var(cp->free_cpus); + kfree(cp->elements); } --- a/kernel/sched/cpudeadline.h +++ b/kernel/sched/cpudeadline.h @@ -5,17 +5,17 @@ #define IDX_INVALID -1 -struct array_item { +struct cpudl_item { u64 dl; int cpu; + int idx; }; struct cpudl { raw_spinlock_t lock; int size; - int cpu_to_idx[NR_CPUS]; - struct array_item elements[NR_CPUS]; cpumask_var_t free_cpus; + struct cpudl_item *elements; }; [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [REGRESSION] funny sched_domain build failure during resume 2014-05-16 10:43 ` Peter Zijlstra @ 2014-05-16 11:01 ` Juri Lelli 2014-05-16 11:04 ` Peter Zijlstra 0 siblings, 1 reply; 17+ messages in thread From: Juri Lelli @ 2014-05-16 11:01 UTC (permalink / raw) To: Peter Zijlstra Cc: Tejun Heo, Ingo Molnar, linux-kernel, Johannes Weiner, Rafael J. Wysocki On Fri, 16 May 2014 12:43:36 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > > OK I made that.. > Are the comments I proposed to add overdoing? Apart from this, Acked-by: Juri Lelli <juri.lelli@gmail.com> Thanks! - Juri > --- > > Subject: sched/cpudl: Replace NR_CPUS arrays > From: Peter Zijlstra <peterz@infradead.org> > Date: Wed May 14 16:13:56 CEST 2014 > > Tejun reported that his resume was failing due to order-3 allocations > from sched_domain building. > > Replace the NR_CPUS arrays in there with a dynamically allocated > array. > > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Juri Lelli <juri.lelli@gmail.com> > Reported-by: Tejun Heo <tj@kernel.org> > Signed-off-by: Peter Zijlstra <peterz@infradead.org> > --- > kernel/sched/cpudeadline.c | 33 ++++++++++++++++++++++++--------- > kernel/sched/cpudeadline.h | 6 +++--- > 2 files changed, 27 insertions(+), 12 deletions(-) > > --- a/kernel/sched/cpudeadline.c > +++ b/kernel/sched/cpudeadline.c > @@ -13,6 +13,7 @@ > > #include <linux/gfp.h> > #include <linux/kernel.h> > +#include <linux/slab.h> > #include "cpudeadline.h" > > static inline int parent(int i) > @@ -39,8 +40,10 @@ static void cpudl_exchange(struct cpudl > { > int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu; > > - swap(cp->elements[a], cp->elements[b]); > - swap(cp->cpu_to_idx[cpu_a], cp->cpu_to_idx[cpu_b]); > + swap(cp->elements[a].cpu, cp->elements[b].cpu); > + swap(cp->elements[a].dl , cp->elements[b].dl ); > + > + swap(cp->elements[cpu_a].idx, cp->elements[cpu_b].idx); > } > > static void cpudl_heapify(struct cpudl *cp, int idx) > @@ -140,7 +143,7 @@ void cpudl_set(struct cpudl *cp, int cpu > WARN_ON(!cpu_present(cpu)); > > raw_spin_lock_irqsave(&cp->lock, flags); > - old_idx = cp->cpu_to_idx[cpu]; > + old_idx = cp->elements[cpu].idx; > if (!is_valid) { > /* remove item */ > if (old_idx == IDX_INVALID) { > @@ -155,8 +158,8 @@ void cpudl_set(struct cpudl *cp, int cpu > cp->elements[old_idx].dl = cp->elements[cp->size - 1].dl; > cp->elements[old_idx].cpu = new_cpu; > cp->size--; > - cp->cpu_to_idx[new_cpu] = old_idx; > - cp->cpu_to_idx[cpu] = IDX_INVALID; > + cp->elements[new_cpu].idx = old_idx; > + cp->elements[cpu].idx = IDX_INVALID; > while (old_idx > 0 && dl_time_before( > cp->elements[parent(old_idx)].dl, > cp->elements[old_idx].dl)) { > @@ -173,7 +176,7 @@ void cpudl_set(struct cpudl *cp, int cpu > cp->size++; > cp->elements[cp->size - 1].dl = 0; > cp->elements[cp->size - 1].cpu = cpu; > - cp->cpu_to_idx[cpu] = cp->size - 1; > + cp->elements[cpu].idx = cp->size - 1; > cpudl_change_key(cp, cp->size - 1, dl); > cpumask_clear_cpu(cpu, cp->free_cpus); > } else { > @@ -195,10 +198,21 @@ int cpudl_init(struct cpudl *cp) > memset(cp, 0, sizeof(*cp)); > raw_spin_lock_init(&cp->lock); > cp->size = 0; > - for (i = 0; i < NR_CPUS; i++) > - cp->cpu_to_idx[i] = IDX_INVALID; > - if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL)) > + > + cp->elements = kcalloc(nr_cpu_ids, > + sizeof(struct cpudl_item), > + GFP_KERNEL); > + if (!cp->elements) > + return -ENOMEM; > + > + if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL)) { > + kfree(cp->elements); > return -ENOMEM; > + } > + > + for_each_possible_cpu(i) > + cp->elements[i].idx = IDX_INVALID; > + > cpumask_setall(cp->free_cpus); > > return 0; > @@ -211,4 +225,5 @@ int cpudl_init(struct cpudl *cp) > void cpudl_cleanup(struct cpudl *cp) > { > free_cpumask_var(cp->free_cpus); > + kfree(cp->elements); > } > --- a/kernel/sched/cpudeadline.h > +++ b/kernel/sched/cpudeadline.h > @@ -5,17 +5,17 @@ > > #define IDX_INVALID -1 > > -struct array_item { > +struct cpudl_item { > u64 dl; > int cpu; > + int idx; > }; > > struct cpudl { > raw_spinlock_t lock; > int size; > - int cpu_to_idx[NR_CPUS]; > - struct array_item elements[NR_CPUS]; > cpumask_var_t free_cpus; > + struct cpudl_item *elements; > }; > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [REGRESSION] funny sched_domain build failure during resume 2014-05-16 11:01 ` Juri Lelli @ 2014-05-16 11:04 ` Peter Zijlstra 0 siblings, 0 replies; 17+ messages in thread From: Peter Zijlstra @ 2014-05-16 11:04 UTC (permalink / raw) To: Juri Lelli Cc: Tejun Heo, Ingo Molnar, linux-kernel, Johannes Weiner, Rafael J. Wysocki [-- Attachment #1: Type: text/plain, Size: 274 bytes --] On Fri, May 16, 2014 at 01:01:53PM +0200, Juri Lelli wrote: > Are the comments I proposed to add overdoing? Dunno, might be helpful, if you post them as a proper patch I'll press 'A'. > Apart from this, > > Acked-by: Juri Lelli <juri.lelli@gmail.com> Thanks! [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2014-05-16 11:49 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-05-09 16:04 [REGRESSION] funny sched_domain build failure during resume Tejun Heo 2014-05-09 16:15 ` Peter Zijlstra 2014-05-14 14:00 ` Peter Zijlstra 2014-05-14 14:05 ` Peter Zijlstra 2014-05-14 17:02 ` Tejun Heo 2014-05-14 17:10 ` Peter Zijlstra 2014-05-14 22:36 ` Tejun Heo 2014-05-15 8:57 ` Peter Zijlstra 2014-05-15 14:41 ` Johannes Weiner 2014-05-15 15:12 ` Peter Zijlstra 2014-05-16 11:47 ` Srivatsa S. Bhat 2014-05-15 8:40 ` Juri Lelli 2014-05-15 8:51 ` Peter Zijlstra 2014-05-15 11:52 ` Juri Lelli 2014-05-16 10:43 ` Peter Zijlstra 2014-05-16 11:01 ` Juri Lelli 2014-05-16 11:04 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox