* [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task() @ 2015-04-03 12:41 Konstantin Khlebnikov 2015-04-03 12:51 ` Konstantin Khlebnikov 2015-04-06 22:45 ` bsegall 0 siblings, 2 replies; 10+ messages in thread From: Konstantin Khlebnikov @ 2015-04-03 12:41 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar, linux-kernel; +Cc: Ben Segall, Roman Gushchin Pick_next_task_fair() must be sure that here is at least one runnable task before calling put_prev_task(), but put_prev_task() can expire last remains of cfs quota and throttle all currently runnable tasks. As a result pick_next_task_fair() cannot find next task and crashes. This patch leaves 1 in ->runtime_remaining when current assignation expires and tries to refill it right after that. In the worst case task will be scheduled once and throttled at the end of slice. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> --- kernel/sched/fair.c | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7ce18f3c097a..91785d077db4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3447,11 +3447,12 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) { struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); - /* if the deadline is ahead of our clock, nothing to do */ - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0)) + /* nothing to expire */ + if (cfs_rq->runtime_remaining <= 0) return; - if (cfs_rq->runtime_remaining < 0) + /* if the deadline is ahead of our clock, nothing to do */ + if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0)) return; /* @@ -3469,8 +3470,14 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; } else { - /* global deadline is ahead, expiration has passed */ - cfs_rq->runtime_remaining = 0; + /* + * Global deadline is ahead, expiration has passed. + * + * Do not expire runtime completely. Otherwise put_prev_task() + * can throttle all tasks when we already checked nr_running or + * put_prev_entity() can throttle already chosen next entity. + */ + cfs_rq->runtime_remaining = 1; } } @@ -3480,7 +3487,7 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) cfs_rq->runtime_remaining -= delta_exec; expire_cfs_rq_runtime(cfs_rq); - if (likely(cfs_rq->runtime_remaining > 0)) + if (likely(cfs_rq->runtime_remaining > 1)) return; /* ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task() 2015-04-03 12:41 [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task() Konstantin Khlebnikov @ 2015-04-03 12:51 ` Konstantin Khlebnikov 2015-04-07 12:52 ` Peter Zijlstra 2015-04-06 22:45 ` bsegall 1 sibling, 1 reply; 10+ messages in thread From: Konstantin Khlebnikov @ 2015-04-03 12:51 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar, linux-kernel; +Cc: Ben Segall, Roman Gushchin On 03.04.2015 15:41, Konstantin Khlebnikov wrote: > Pick_next_task_fair() must be sure that here is at least one runnable > task before calling put_prev_task(), but put_prev_task() can expire > last remains of cfs quota and throttle all currently runnable tasks. > As a result pick_next_task_fair() cannot find next task and crashes. Kernel crash looks like this: <1>[50288.719491] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 <1>[50288.719538] IP: [<ffffffff81097b8c>] set_next_entity+0x1c/0x80 <4>[50288.719567] PGD 0 <4>[50288.719578] Oops: 0000 [#1] SMP <4>[50288.719594] Modules linked in: vhost_net macvtap macvlan vhost 8021q mrp garp ip6table_filter ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc netconsole configfs x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm mgag200 crc32_pclmul ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw ttm gf128mul drm_kms_helper drm glue_helper aes_x86_64 i2c_algo_bit sysimgblt sysfillrect i2c_core sb_edac edac_core syscopyarea microcode ipmi_si ipmi_msghandler lpc_ich ioatdma dca mlx4_en mlx4_core vxlan udp_tunnel ip6_udp_tunnel tcp_htcp e1000e ptp pps_core ahci libahci raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 multipath<4>[50288.719956] linear <4>[50288.719964] CPU: 27 PID: 11505 Comm: kvm Not tainted 3.18.10-7 #7 <4>[50288.719987] Hardware name: <4>[50288.720015] task: ffff880036acbaa0 ti: ffff8808445f8000 task.ti: ffff8808445f8000 <4>[50288.720041] RIP: 0010:[<ffffffff81097b8c>] [<ffffffff81097b8c>] set_next_entity+0x1c/0x80 <4>[50288.720072] RSP: 0018:ffff8808445fbbb8 EFLAGS: 00010086 <4>[50288.720091] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000bcb8 <4>[50288.720116] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88107fd72af0 <4>[50288.720141] RBP: ffff8808445fbbd8 R08: 0000000000000000 R09: 0000000000000001 <4>[50288.720165] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 <4>[50288.720190] R13: 0000000000000000 R14: ffff880b6f250030 R15: ffff88107fd72af0 <4>[50288.720214] FS: 00007f55467fc700(0000) GS:ffff88107fd60000(0000) knlGS:ffff8802175e0000 <4>[50288.720242] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[50288.720262] CR2: 0000000000000038 CR3: 0000000324ede000 CR4: 00000000000427e0 <4>[50288.720287] Stack: <4>[50288.720296] ffff88107fd72a80 ffff88107fd72a80 0000000000000000 0000000000000000 <4>[50288.720327] ffff8808445fbc68 ffffffff8109ead8 ffff880800000000 ffffffffa1438990 <4>[50288.720357] ffff880b6f250000 0000000000000000 0000000000012a80 ffff880036acbaa0 <4>[50288.720388] Call Trace: <4>[50288.720402] [<ffffffff8109ead8>] pick_next_task_fair+0x88/0x5d0 <4>[50288.720429] [<ffffffffa1438990>] ? vmx_fpu_activate.part.63+0x90/0xb0 [kvm_intel] <4>[50288.720457] [<ffffffff81096b95>] ? sched_clock_cpu+0x85/0xc0 <4>[50288.720479] [<ffffffff816b5b99>] __schedule+0xf9/0x7d0 <4>[50288.720500] [<ffffffff816bb210>] ? reboot_interrupt+0x80/0x80 <4>[50288.720522] [<ffffffff816b630a>] _cond_resched+0x2a/0x40 <4>[50288.720549] [<ffffffffa03dd8c5>] __vcpu_run+0xd35/0xf30 [kvm] <4>[50288.720573] [<ffffffff81075fc7>] ? __set_task_blocked+0x37/0x80 <4>[50288.720595] [<ffffffff8109387e>] ? try_to_wake_up+0x21e/0x360 <4>[50288.720622] [<ffffffffa03ddb65>] kvm_arch_vcpu_ioctl_run+0xa5/0x220 [kvm] <4>[50288.720650] [<ffffffffa03c48b2>] kvm_vcpu_ioctl+0x2c2/0x620 [kvm] <4>[50288.720675] [<ffffffff811c01c6>] do_vfs_ioctl+0x86/0x4f0 <4>[50288.720697] [<ffffffff810d14a2>] ? SyS_futex+0x142/0x1a0 <4>[50288.720717] [<ffffffff811c06c1>] SyS_ioctl+0x91/0xb0 <4>[50288.720737] [<ffffffff816ba489>] system_call_fastpath+0x12/0x17 <4>[50288.720758] Code: c7 47 60 00 00 00 00 45 31 c0 e9 0c ff ff ff 66 66 66 66 90 55 48 89 e5 48 83 ec 20 48 89 5d e8 4c 89 65 f0 48 89 f3 4c 89 6d f8 <44> 8b 4e 38 49 89 fc 45 85 c9 74 17 4c 8d 6e 10 4c 39 6f 30 74 <1>[50288.722636] RIP [<ffffffff81097b8c>] set_next_entity+0x1c/0x80 <4>[50288.723533] RSP <ffff8808445fbbb8> <4>[50288.724406] CR2: 0000000000000038 in pick_next_task_fair() cfs_rq->nr_running was non-zero but after put_prev_task(rq, prev) kernel cannot find any tasks to schedule next. It crashes from time to time on strange libvirt/kvm setup where cfs_quota is set on two levels: at parent cgroup which contains kvm and at per-vcpu child cgroup. This patch isn't verified yet. But I haven't found any other possible reasons for that crash. > > This patch leaves 1 in ->runtime_remaining when current assignation > expires and tries to refill it right after that. In the worst case > task will be scheduled once and throttled at the end of slice. > > Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> > --- > kernel/sched/fair.c | 19 +++++++++++++------ > 1 file changed, 13 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 7ce18f3c097a..91785d077db4 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3447,11 +3447,12 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) > { > struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); > > - /* if the deadline is ahead of our clock, nothing to do */ > - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0)) > + /* nothing to expire */ > + if (cfs_rq->runtime_remaining <= 0) > return; > > - if (cfs_rq->runtime_remaining < 0) > + /* if the deadline is ahead of our clock, nothing to do */ > + if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0)) > return; > > /* > @@ -3469,8 +3470,14 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) > /* extend local deadline, drift is bounded above by 2 ticks */ > cfs_rq->runtime_expires += TICK_NSEC; > } else { > - /* global deadline is ahead, expiration has passed */ > - cfs_rq->runtime_remaining = 0; > + /* > + * Global deadline is ahead, expiration has passed. > + * > + * Do not expire runtime completely. Otherwise put_prev_task() > + * can throttle all tasks when we already checked nr_running or > + * put_prev_entity() can throttle already chosen next entity. > + */ > + cfs_rq->runtime_remaining = 1; > } > } > > @@ -3480,7 +3487,7 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) > cfs_rq->runtime_remaining -= delta_exec; > expire_cfs_rq_runtime(cfs_rq); > > - if (likely(cfs_rq->runtime_remaining > 0)) > + if (likely(cfs_rq->runtime_remaining > 1)) > return; > > /* > -- Konstantin ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task() 2015-04-03 12:51 ` Konstantin Khlebnikov @ 2015-04-07 12:52 ` Peter Zijlstra 2015-04-07 13:47 ` Peter Zijlstra 0 siblings, 1 reply; 10+ messages in thread From: Peter Zijlstra @ 2015-04-07 12:52 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: Ingo Molnar, linux-kernel, Ben Segall, Roman Gushchin On Fri, Apr 03, 2015 at 03:51:17PM +0300, Konstantin Khlebnikov wrote: > On 03.04.2015 15:41, Konstantin Khlebnikov wrote: > >Pick_next_task_fair() must be sure that here is at least one runnable > >task before calling put_prev_task(), but put_prev_task() can expire > >last remains of cfs quota and throttle all currently runnable tasks. > >As a result pick_next_task_fair() cannot find next task and crashes. > > Kernel crash looks like this: > > <1>[50288.719491] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 > <1>[50288.719538] IP: [<ffffffff81097b8c>] set_next_entity+0x1c/0x80 > <4>[50288.720388] Call Trace: > <4>[50288.720402] [<ffffffff8109ead8>] pick_next_task_fair+0x88/0x5d0 > <4>[50288.720479] [<ffffffff816b5b99>] __schedule+0xf9/0x7d0 Which set_next_entity() is that? There are 3 in pick_next_task_fair(). I have a vague suspicion its in the 'simple' code, please verify. The thinking is that if it was the 'complex' pick_next_entity() returning NULL we'd have exploded elsewhere, the cfs_rq iteration would've wandered off into random memory and most likely exploded on cfs_rq->curr. Which too would suggest the check_cfs_rq_runtime() thing works just fine, it send us to the simple code. > >This patch leaves 1 in ->runtime_remaining when current assignation > >expires and tries to refill it right after that. In the worst case > >task will be scheduled once and throttled at the end of slice. Which is a strange approach. If pick_next_task_fair() is borken, we should fix that, no? In any case, it appears to me that: 606dba2e2894 ("sched: Push put_prev_task() into pick_next_task()") inverted the ->nr_running and put_prev_task() statements. If the above set_next_entity() is indeed the simple one, does the below cure things? --- kernel/sched/fair.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fdae26eb7218..df72d61138a8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5176,12 +5176,11 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev) simple: cfs_rq = &rq->cfs; #endif + put_prev_task(rq, prev); if (!cfs_rq->nr_running) goto idle; - put_prev_task(rq, prev); - do { se = pick_next_entity(cfs_rq, NULL); set_next_entity(cfs_rq, se); ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task() 2015-04-07 12:52 ` Peter Zijlstra @ 2015-04-07 13:47 ` Peter Zijlstra 2015-04-07 15:12 ` Peter Zijlstra 0 siblings, 1 reply; 10+ messages in thread From: Peter Zijlstra @ 2015-04-07 13:47 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: Ingo Molnar, linux-kernel, Ben Segall, Roman Gushchin On Tue, Apr 07, 2015 at 02:52:51PM +0200, Peter Zijlstra wrote: > If the above set_next_entity() is indeed the simple one, does the below > cure things? > > --- > kernel/sched/fair.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index fdae26eb7218..df72d61138a8 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5176,12 +5176,11 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev) > simple: > cfs_rq = &rq->cfs; > #endif > + put_prev_task(rq, prev); > > if (!cfs_rq->nr_running) > goto idle; > > - put_prev_task(rq, prev); > - > do { > se = pick_next_entity(cfs_rq, NULL); > set_next_entity(cfs_rq, se); Bah, that's broken because if we end up going idle pick_next_task_idle() is going to do put_prev_task() again. Lemme think a bit more on that. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task() 2015-04-07 13:47 ` Peter Zijlstra @ 2015-04-07 15:12 ` Peter Zijlstra 2015-04-07 15:32 ` Konstantin Khlebnikov 0 siblings, 1 reply; 10+ messages in thread From: Peter Zijlstra @ 2015-04-07 15:12 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: Ingo Molnar, linux-kernel, Ben Segall, Roman Gushchin On Tue, Apr 07, 2015 at 03:47:58PM +0200, Peter Zijlstra wrote: > Lemme think a bit more on that. So going by 734ff2a71f9e ("sched/rt: Fix picking RT and DL tasks from empty queue") something like this would be called for. --- kernel/sched/fair.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fdae26eb7218..1e47f816c976 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5175,6 +5175,21 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev) return p; simple: cfs_rq = &rq->cfs; + + if (cfs_bandwidth_used()) { + /* + * We may dequeue prev's cfs_rq in put_prev_task(). + * So, we update time before cfs_rq->nr_running check. + */ + if (prev->on_rq) + update_curr(cfs_rq); + + se = &prev->se; + for_each_sched_entity(se) { + cfs_rq = cfs_rq_of(se); + check_cfs_rq_runtime(cfs_rq); + } + } #endif if (!cfs_rq->nr_running) ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task() 2015-04-07 15:12 ` Peter Zijlstra @ 2015-04-07 15:32 ` Konstantin Khlebnikov 2015-04-07 15:43 ` Peter Zijlstra 0 siblings, 1 reply; 10+ messages in thread From: Konstantin Khlebnikov @ 2015-04-07 15:32 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Ben Segall, Roman Gushchin On 07.04.2015 18:12, Peter Zijlstra wrote: > On Tue, Apr 07, 2015 at 03:47:58PM +0200, Peter Zijlstra wrote: >> Lemme think a bit more on that. > > So going by 734ff2a71f9e ("sched/rt: Fix picking RT and DL tasks from > empty queue") something like this would be called for. > > --- > kernel/sched/fair.c | 15 +++++++++++++++ > 1 file changed, 15 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index fdae26eb7218..1e47f816c976 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5175,6 +5175,21 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev) > return p; > simple: > cfs_rq = &rq->cfs; We don't need this if prev isn't from fair_sched_class: first "goto simple" should go directly to "if (!cfs_rq->nr_running)". > + > + if (cfs_bandwidth_used()) { > + /* > + * We may dequeue prev's cfs_rq in put_prev_task(). > + * So, we update time before cfs_rq->nr_running check. > + */ > + if (prev->on_rq) > + update_curr(cfs_rq); > + > + se = &prev->se; > + for_each_sched_entity(se) { Probably update_curr() should be inside loop too? > + cfs_rq = cfs_rq_of(se); > + check_cfs_rq_runtime(cfs_rq); > + } > + } > #endif > > if (!cfs_rq->nr_running) > -- Konstantin ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task() 2015-04-07 15:32 ` Konstantin Khlebnikov @ 2015-04-07 15:43 ` Peter Zijlstra 0 siblings, 0 replies; 10+ messages in thread From: Peter Zijlstra @ 2015-04-07 15:43 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: Ingo Molnar, linux-kernel, Ben Segall, Roman Gushchin On Tue, Apr 07, 2015 at 06:32:47PM +0300, Konstantin Khlebnikov wrote: > On 07.04.2015 18:12, Peter Zijlstra wrote: > >On Tue, Apr 07, 2015 at 03:47:58PM +0200, Peter Zijlstra wrote: > >>Lemme think a bit more on that. > > > >So going by 734ff2a71f9e ("sched/rt: Fix picking RT and DL tasks from > >empty queue") something like this would be called for. > > > >--- > > kernel/sched/fair.c | 15 +++++++++++++++ > > 1 file changed, 15 insertions(+) > > > >diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >index fdae26eb7218..1e47f816c976 100644 > >--- a/kernel/sched/fair.c > >+++ b/kernel/sched/fair.c > >@@ -5175,6 +5175,21 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev) > > return p; > > simple: > > cfs_rq = &rq->cfs; > > We don't need this if prev isn't from fair_sched_class: > first "goto simple" should go directly to "if (!cfs_rq->nr_running)". Just add that to the test here: > >+ if (cfs_bandwidth_used()) { > >+ /* > >+ * We may dequeue prev's cfs_rq in put_prev_task(). > >+ * So, we update time before cfs_rq->nr_running check. > >+ */ > >+ if (prev->on_rq) > >+ update_curr(cfs_rq); > >+ > >+ se = &prev->se; > >+ for_each_sched_entity(se) { > > Probably update_curr() should be inside loop too? Ah, yes, you're right. > >+ cfs_rq = cfs_rq_of(se); > >+ check_cfs_rq_runtime(cfs_rq); > >+ } > >+ } > > #endif > > > > if (!cfs_rq->nr_running) > > > > > -- > Konstantin ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task() 2015-04-03 12:41 [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task() Konstantin Khlebnikov 2015-04-03 12:51 ` Konstantin Khlebnikov @ 2015-04-06 22:45 ` bsegall 2015-04-07 15:53 ` Konstantin Khlebnikov 2015-06-07 17:47 ` [tip:sched/core] sched/fair: Prevent throttling in early pick_next_task_fair() tip-bot for Ben Segall 1 sibling, 2 replies; 10+ messages in thread From: bsegall @ 2015-04-06 22:45 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Roman Gushchin, pjt Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes: > Pick_next_task_fair() must be sure that here is at least one runnable > task before calling put_prev_task(), but put_prev_task() can expire > last remains of cfs quota and throttle all currently runnable tasks. > As a result pick_next_task_fair() cannot find next task and crashes. > > This patch leaves 1 in ->runtime_remaining when current assignation > expires and tries to refill it right after that. In the worst case > task will be scheduled once and throttled at the end of slice. > I don't think expire_cfs_rq_runtime is the problem. What I believe happens is this: /prev/some_task is running, calls schedule() with nr_running == 2. pick_next's first do/while loop does update_curr(/) and picks /next, and the next iteration just sees check_cfs_rq_runtime(/next), and thus does goto simple. However, there is now only /prev/some_task runnable, and it hasn't checked the entire prev hierarchy for throttling, thus leading to the crash. This would require that check_cfs_rq_runtime(/next) return true despite being on_rq though, which iirc is not supposed to happen (note that we do not call update_curr(/next), and it would do nothing if we did, because /next isn't part of the current thread's hierarchy). However, this /can/ happen if runtime has just been (re)enabled on /next, because tg_set_cfs_bandwidth sets runtime_remaining to 0, not 1. The idea was that each rq would grab runtime when they were scheduled (pick_next_task_fair didn't ever look at throttling info), so this was fine with the old code, but is a problem now. I think it would be sufficient to just initialize to 1 in tg_set_cfs_bandwidth. The arguably more precise option would be to only check_cfs_rq_runtime if cfs_rq->curr is set, but the code is slightly less pretty. Could you check this patch to see if it works (or the trivial tg_set_bandwidth runtime_remaining = 1 patch)? ---8<----->8--- >From f12fa8e981bf1d87cbbc30951bdf27e70c803e25 Mon Sep 17 00:00:00 2001 From: Ben Segall <bsegall@google.com> Date: Mon, 6 Apr 2015 15:28:10 -0700 Subject: [PATCH] sched: prevent throttle in early pick_next_task_fair The first call to check_cfs_rq_runtime in pick_next_task_fair is only supposed to trigger when cfs_rq is still an ancestor of prev. However, it was able to trigger on tgs that had just had bandwidth toggled, because tg_set_cfs_bandwidth set runtime_remaining to 0, and check_cfs_rq_runtime doesn't check the global pool. Fix this by only calling check_cfs_rq_runtime if we are still in prev's ancestry, as evidenced by cfs_rq->curr. Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Ben Segall <bsegall@google.com> --- kernel/sched/fair.c | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ee595ef..5cb52e9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5038,18 +5038,21 @@ again: * entity, update_curr() will update its vruntime, otherwise * forget we've ever seen it. */ - if (curr && curr->on_rq) - update_curr(cfs_rq); - else - curr = NULL; + if (curr) { + if (curr->on_rq) + update_curr(cfs_rq); + else + curr = NULL; - /* - * This call to check_cfs_rq_runtime() will do the throttle and - * dequeue its entity in the parent(s). Therefore the 'simple' - * nr_running test will indeed be correct. - */ - if (unlikely(check_cfs_rq_runtime(cfs_rq))) - goto simple; + /* + * This call to check_cfs_rq_runtime() will do the + * throttle and dequeue its entity in the parent(s). + * Therefore the 'simple' nr_running test will indeed + * be correct. + */ + if (unlikely(check_cfs_rq_runtime(cfs_rq))) + goto simple; + } se = pick_next_entity(cfs_rq, curr); cfs_rq = group_cfs_rq(se); -- 2.2.0.rc0.207.ga3a616c ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task() 2015-04-06 22:45 ` bsegall @ 2015-04-07 15:53 ` Konstantin Khlebnikov 2015-06-07 17:47 ` [tip:sched/core] sched/fair: Prevent throttling in early pick_next_task_fair() tip-bot for Ben Segall 1 sibling, 0 replies; 10+ messages in thread From: Konstantin Khlebnikov @ 2015-04-07 15:53 UTC (permalink / raw) To: bsegall; +Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Roman Gushchin, pjt On 07.04.2015 01:45, bsegall@google.com wrote: > Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes: > >> Pick_next_task_fair() must be sure that here is at least one runnable >> task before calling put_prev_task(), but put_prev_task() can expire >> last remains of cfs quota and throttle all currently runnable tasks. >> As a result pick_next_task_fair() cannot find next task and crashes. >> >> This patch leaves 1 in ->runtime_remaining when current assignation >> expires and tries to refill it right after that. In the worst case >> task will be scheduled once and throttled at the end of slice. >> > > I don't think expire_cfs_rq_runtime is the problem. What I believe > happens is this: > > /prev/some_task is running, calls schedule() with nr_running == 2. > pick_next's first do/while loop does update_curr(/) and picks /next, and > the next iteration just sees check_cfs_rq_runtime(/next), and thus does > goto simple. However, there is now only /prev/some_task runnable, and it > hasn't checked the entire prev hierarchy for throttling, thus leading to > the crash. > > This would require that check_cfs_rq_runtime(/next) return true despite > being on_rq though, which iirc is not supposed to happen (note that we > do not call update_curr(/next), and it would do nothing if we did, > because /next isn't part of the current thread's hierarchy). However, > this /can/ happen if runtime has just been (re)enabled on /next, because > tg_set_cfs_bandwidth sets runtime_remaining to 0, not 1. Yeah, this seems possible too. > The idea was that each rq would grab runtime when they were scheduled > (pick_next_task_fair didn't ever look at throttling info), so this was > fine with the old code, but is a problem now. I think it would be > sufficient to just initialize to 1 in tg_set_cfs_bandwidth. The arguably > more precise option would be to only check_cfs_rq_runtime if > cfs_rq->curr is set, but the code is slightly less pretty. Probably this code should call something like account_cfs_rq_runtime(cfs_rq, 0); > > Could you check this patch to see if it works (or the trivial > tg_set_bandwidth runtime_remaining = 1 patch)? I'm not sure that I'll see this bug again: we're replacing this setup with something different. > > ---8<----->8--- > > From f12fa8e981bf1d87cbbc30951bdf27e70c803e25 Mon Sep 17 00:00:00 2001 > From: Ben Segall <bsegall@google.com> > Date: Mon, 6 Apr 2015 15:28:10 -0700 > Subject: [PATCH] sched: prevent throttle in early pick_next_task_fair > > The first call to check_cfs_rq_runtime in pick_next_task_fair is only supposed > to trigger when cfs_rq is still an ancestor of prev. However, it was able to > trigger on tgs that had just had bandwidth toggled, because tg_set_cfs_bandwidth > set runtime_remaining to 0, and check_cfs_rq_runtime doesn't check the global > pool. > > Fix this by only calling check_cfs_rq_runtime if we are still in prev's > ancestry, as evidenced by cfs_rq->curr. > > Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> > Signed-off-by: Ben Segall <bsegall@google.com> > --- > kernel/sched/fair.c | 25 ++++++++++++++----------- > 1 file changed, 14 insertions(+), 11 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index ee595ef..5cb52e9 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5038,18 +5038,21 @@ again: > * entity, update_curr() will update its vruntime, otherwise > * forget we've ever seen it. > */ > - if (curr && curr->on_rq) > - update_curr(cfs_rq); > - else > - curr = NULL; > + if (curr) { > + if (curr->on_rq) > + update_curr(cfs_rq); > + else > + curr = NULL; > > - /* > - * This call to check_cfs_rq_runtime() will do the throttle and > - * dequeue its entity in the parent(s). Therefore the 'simple' > - * nr_running test will indeed be correct. > - */ > - if (unlikely(check_cfs_rq_runtime(cfs_rq))) > - goto simple; > + /* > + * This call to check_cfs_rq_runtime() will do the > + * throttle and dequeue its entity in the parent(s). > + * Therefore the 'simple' nr_running test will indeed > + * be correct. > + */ > + if (unlikely(check_cfs_rq_runtime(cfs_rq))) > + goto simple; > + } > > se = pick_next_entity(cfs_rq, curr); > cfs_rq = group_cfs_rq(se); > -- Konstantin ^ permalink raw reply [flat|nested] 10+ messages in thread
* [tip:sched/core] sched/fair: Prevent throttling in early pick_next_task_fair() 2015-04-06 22:45 ` bsegall 2015-04-07 15:53 ` Konstantin Khlebnikov @ 2015-06-07 17:47 ` tip-bot for Ben Segall 1 sibling, 0 replies; 10+ messages in thread From: tip-bot for Ben Segall @ 2015-06-07 17:47 UTC (permalink / raw) To: linux-tip-commits Cc: mingo, torvalds, akpm, bsegall, mnaser, khlebnikov, peterz, hpa, klamm, tglx, linux-kernel Commit-ID: 54d27365cae88fbcc853b391dcd561e71acb81fa Gitweb: http://git.kernel.org/tip/54d27365cae88fbcc853b391dcd561e71acb81fa Author: Ben Segall <bsegall@google.com> AuthorDate: Mon, 6 Apr 2015 15:28:10 -0700 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Sun, 7 Jun 2015 15:57:44 +0200 sched/fair: Prevent throttling in early pick_next_task_fair() The optimized task selection logic optimistically selects a new task to run without first doing a full put_prev_task(). This is so that we can avoid a put/set on the common ancestors of the old and new task. Similarly, we should only call check_cfs_rq_runtime() to throttle eligible groups if they're part of the common ancestry, otherwise it is possible to end up with no eligible task in the simple task selection. Imagine: /root /prev /next /A /B If our optimistic selection ends up throttling /next, we goto simple and our put_prev_task() ends up throttling /prev, after which we're going to bug out in set_next_entity() because there aren't any tasks left. Avoid this scenario by only throttling common ancestors. Reported-by: Mohammed Naser <mnaser@vexxhost.com> Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Ben Segall <bsegall@google.com> [ munged Changelog ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: pjt@google.com Fixes: 678d5718d8d0 ("sched/fair: Optimize cgroup pick_next_task_fair()") Link: http://lkml.kernel.org/r/xm26wq1oswoq.fsf@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- kernel/sched/fair.c | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0d4632f..84ada05 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5322,18 +5322,21 @@ again: * entity, update_curr() will update its vruntime, otherwise * forget we've ever seen it. */ - if (curr && curr->on_rq) - update_curr(cfs_rq); - else - curr = NULL; + if (curr) { + if (curr->on_rq) + update_curr(cfs_rq); + else + curr = NULL; - /* - * This call to check_cfs_rq_runtime() will do the throttle and - * dequeue its entity in the parent(s). Therefore the 'simple' - * nr_running test will indeed be correct. - */ - if (unlikely(check_cfs_rq_runtime(cfs_rq))) - goto simple; + /* + * This call to check_cfs_rq_runtime() will do the + * throttle and dequeue its entity in the parent(s). + * Therefore the 'simple' nr_running test will indeed + * be correct. + */ + if (unlikely(check_cfs_rq_runtime(cfs_rq))) + goto simple; + } se = pick_next_entity(cfs_rq, curr); cfs_rq = group_cfs_rq(se); ^ permalink raw reply related [flat|nested] 10+ messages in thread
end of thread, other threads:[~2015-06-07 17:48 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-04-03 12:41 [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task() Konstantin Khlebnikov 2015-04-03 12:51 ` Konstantin Khlebnikov 2015-04-07 12:52 ` Peter Zijlstra 2015-04-07 13:47 ` Peter Zijlstra 2015-04-07 15:12 ` Peter Zijlstra 2015-04-07 15:32 ` Konstantin Khlebnikov 2015-04-07 15:43 ` Peter Zijlstra 2015-04-06 22:45 ` bsegall 2015-04-07 15:53 ` Konstantin Khlebnikov 2015-06-07 17:47 ` [tip:sched/core] sched/fair: Prevent throttling in early pick_next_task_fair() tip-bot for Ben Segall
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox