* [RFC PATCH] kernel/sched/core: busy wait before going idle @ 2018-04-15 13:31 Nicholas Piggin 2018-04-20 7:44 ` Peter Zijlstra 2018-04-23 10:17 ` Pavan Kondeti 0 siblings, 2 replies; 7+ messages in thread From: Nicholas Piggin @ 2018-04-15 13:31 UTC (permalink / raw) To: linux-kernel Cc: Nicholas Piggin, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Paul E . McKenney This is a quick hack for comments, but I've always wondered -- if we have a short term polling idle states in cpuidle for performance -- why not skip the context switch and entry into all the idle states, and just wait for a bit to see if something wakes up again. It's not uncommon to see various going-to-idle work in kernel profiles. This might be a way to reduce that (and just the cost of switching registers and kernel stack to idle thread). This can be an important path for single thread request-response throughput. tbench bandwidth seems to be improved (the numbers aren't too stable but they pretty consistently show some gain). 10-20% would be a pretty nice gain for such workloads clients 1 2 4 8 16 128 vanilla 232 467 823 1819 3218 9065 patched 310 503 962 2465 3743 9820 --- kernel/sched/core.c | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e8afd6086f23..30a0b13edfa5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3404,6 +3404,7 @@ static void __sched notrace __schedule(bool preempt) struct rq_flags rf; struct rq *rq; int cpu; + bool do_idle_spin = true; cpu = smp_processor_id(); rq = cpu_rq(cpu); @@ -3428,6 +3429,7 @@ static void __sched notrace __schedule(bool preempt) rq_lock(rq, &rf); smp_mb__after_spinlock(); +idle_spin_end: /* Promote REQ to ACT */ rq->clock_update_flags <<= 1; update_rq_clock(rq); @@ -3437,6 +3439,32 @@ static void __sched notrace __schedule(bool preempt) if (unlikely(signal_pending_state(prev->state, prev))) { prev->state = TASK_RUNNING; } else { + /* + * Busy wait before switching to idle thread. This + * is marked unlikely because we're idle so jumping + * out of line doesn't matter too much. + */ + if (unlikely(do_idle_spin && rq->nr_running == 1)) { + u64 start; + + do_idle_spin = false; + + rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP); + rq_unlock_irq(rq, &rf); + + spin_begin(); + start = local_clock(); + while (!need_resched() && prev->state && + !signal_pending_state(prev->state, prev)) { + spin_cpu_relax(); + if (local_clock() - start > 1000000) + break; + } + spin_end(); + + rq_lock_irq(rq, &rf); + goto idle_spin_end; + } deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK); prev->on_rq = 0; -- 2.17.0 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] kernel/sched/core: busy wait before going idle 2018-04-15 13:31 [RFC PATCH] kernel/sched/core: busy wait before going idle Nicholas Piggin @ 2018-04-20 7:44 ` Peter Zijlstra 2018-04-20 9:01 ` Nicholas Piggin 2018-04-23 10:17 ` Pavan Kondeti 1 sibling, 1 reply; 7+ messages in thread From: Peter Zijlstra @ 2018-04-20 7:44 UTC (permalink / raw) To: Nicholas Piggin Cc: linux-kernel, Ingo Molnar, Rafael J. Wysocki, Paul E . McKenney On Sun, Apr 15, 2018 at 11:31:49PM +1000, Nicholas Piggin wrote: > This is a quick hack for comments, but I've always wondered -- > if we have a short term polling idle states in cpuidle for performance > -- why not skip the context switch and entry into all the idle states, > and just wait for a bit to see if something wakes up again. Is that context switch so expensive? And what kernel did you test on? We recently merged a bunch of patches from Rafael that avoided disabling the tick for short idle predictions. This also has a performance improvements for such workloads. Did your kernel include those? > It's not uncommon to see various going-to-idle work in kernel profiles. > This might be a way to reduce that (and just the cost of switching > registers and kernel stack to idle thread). This can be an important > path for single thread request-response throughput. So I feel that _if_ we do a spin here, it should only be long enough to amortize the schedule switch context. However, doing busy waits here has the downside that the 'idle' time is not in fact fed into the cpuidle predictor. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] kernel/sched/core: busy wait before going idle 2018-04-20 7:44 ` Peter Zijlstra @ 2018-04-20 9:01 ` Nicholas Piggin 2018-04-20 10:58 ` Peter Zijlstra 0 siblings, 1 reply; 7+ messages in thread From: Nicholas Piggin @ 2018-04-20 9:01 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Rafael J. Wysocki, Paul E . McKenney On Fri, 20 Apr 2018 09:44:56 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Sun, Apr 15, 2018 at 11:31:49PM +1000, Nicholas Piggin wrote: > > This is a quick hack for comments, but I've always wondered -- > > if we have a short term polling idle states in cpuidle for performance > > -- why not skip the context switch and entry into all the idle states, > > and just wait for a bit to see if something wakes up again. > > Is that context switch so expensive? I guess relatively much more than taking one branch mispredict on the loop exit when the task wakes. 10s of cycles vs 1000s? > And what kernel did you test on? We recently merged a bunch of patches > from Rafael that avoided disabling the tick for short idle predictions. > This also has a performance improvements for such workloads. Did your > kernel include those? Yes that actually improved profiles quite a lot, but these numbers were with those changes. I'll try to find some fast disks or network and get some more more interesting numbers. > > It's not uncommon to see various going-to-idle work in kernel profiles. > > This might be a way to reduce that (and just the cost of switching > > registers and kernel stack to idle thread). This can be an important > > path for single thread request-response throughput. > > So I feel that _if_ we do a spin here, it should only be long enough to > amortize the schedule switch context. > > However, doing busy waits here has the downside that the 'idle' time is > not in fact fed into the cpuidle predictor. That's why I cc'ed Rafael :) Yes the latency in my hack is probably too long, but I think if we did this, the cpuile predictor could become involved here. There is no fundamental reason it has to wait for the idle task to be context switched for that... it's already become involved in core scheduler code. Thanks, Nick ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] kernel/sched/core: busy wait before going idle 2018-04-20 9:01 ` Nicholas Piggin @ 2018-04-20 10:58 ` Peter Zijlstra 2018-04-20 12:28 ` Nicholas Piggin 0 siblings, 1 reply; 7+ messages in thread From: Peter Zijlstra @ 2018-04-20 10:58 UTC (permalink / raw) To: Nicholas Piggin Cc: linux-kernel, Ingo Molnar, Rafael J. Wysocki, Paul E . McKenney On Fri, Apr 20, 2018 at 07:01:47PM +1000, Nicholas Piggin wrote: > On Fri, 20 Apr 2018 09:44:56 +0200 > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Sun, Apr 15, 2018 at 11:31:49PM +1000, Nicholas Piggin wrote: > > > This is a quick hack for comments, but I've always wondered -- > > > if we have a short term polling idle states in cpuidle for performance > > > -- why not skip the context switch and entry into all the idle states, > > > and just wait for a bit to see if something wakes up again. > > > > Is that context switch so expensive? > > I guess relatively much more than taking one branch mispredict on the > loop exit when the task wakes. 10s of cycles vs 1000s? Sure, just wondering how much. And I'm assuming you're looking at Power here, right? > > And what kernel did you test on? We recently merged a bunch of patches > > from Rafael that avoided disabling the tick for short idle predictions. > > This also has a performance improvements for such workloads. Did your > > kernel include those? > > Yes that actually improved profiles quite a lot, but these numbers were > with those changes. I'll try to find some fast disks or network and get > some more more interesting numbers. OK, good that you have those patches in. That ensures you're not trying to fix something that's possibly already addressed elsewhere. > > > It's not uncommon to see various going-to-idle work in kernel profiles. > > > This might be a way to reduce that (and just the cost of switching > > > registers and kernel stack to idle thread). This can be an important > > > path for single thread request-response throughput. > > > > So I feel that _if_ we do a spin here, it should only be long enough to > > amortize the schedule switch context. > > > > However, doing busy waits here has the downside that the 'idle' time is > > not in fact fed into the cpuidle predictor. > > That's why I cc'ed Rafael :) > > Yes the latency in my hack is probably too long, but I think if we did > this, the cpuile predictor could become involved here. There is no > fundamental reason it has to wait for the idle task to be context > switched for that... it's already become involved in core scheduler > code. Yes, cpuidle/cpufreq are getting more and more intergrated so there is no objection from that point. Growing multiple 'idle' points otoh is a little dodgy and could cause some maintenance issues. Of course, this loop would have the same idle-duration problems as the poll_state.c one. We should probably use that code. Also, do we want to ask the estimator before doing this? If it predicts a very long idle time, spinning here is just wasting cycles. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] kernel/sched/core: busy wait before going idle 2018-04-20 10:58 ` Peter Zijlstra @ 2018-04-20 12:28 ` Nicholas Piggin 0 siblings, 0 replies; 7+ messages in thread From: Nicholas Piggin @ 2018-04-20 12:28 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Rafael J. Wysocki, Paul E . McKenney On Fri, 20 Apr 2018 12:58:27 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Apr 20, 2018 at 07:01:47PM +1000, Nicholas Piggin wrote: > > On Fri, 20 Apr 2018 09:44:56 +0200 > > Peter Zijlstra <peterz@infradead.org> wrote: > > > > > On Sun, Apr 15, 2018 at 11:31:49PM +1000, Nicholas Piggin wrote: > > > > This is a quick hack for comments, but I've always wondered -- > > > > if we have a short term polling idle states in cpuidle for performance > > > > -- why not skip the context switch and entry into all the idle states, > > > > and just wait for a bit to see if something wakes up again. > > > > > > Is that context switch so expensive? > > > > I guess relatively much more than taking one branch mispredict on the > > loop exit when the task wakes. 10s of cycles vs 1000s? > > Sure, just wondering how much. And I'm assuming you're looking at Power > here, right? Well I'll try to get more numbers. Yes, talking about power. It trails x86 on context switches by a bit, but similar orders of magnitude. My skylake is doing ~1900 cycles syscall + context switch with a distro kernel. POWER9 is ~2500. > > > And what kernel did you test on? We recently merged a bunch of patches > > > from Rafael that avoided disabling the tick for short idle predictions. > > > This also has a performance improvements for such workloads. Did your > > > kernel include those? > > > > Yes that actually improved profiles quite a lot, but these numbers were > > with those changes. I'll try to find some fast disks or network and get > > some more more interesting numbers. > > OK, good that you have those patches in. That ensures you're not trying > to fix something that's possibly already addressed elsewhere. Yep. > > > > > It's not uncommon to see various going-to-idle work in kernel profiles. > > > > This might be a way to reduce that (and just the cost of switching > > > > registers and kernel stack to idle thread). This can be an important > > > > path for single thread request-response throughput. > > > > > > So I feel that _if_ we do a spin here, it should only be long enough to > > > amortize the schedule switch context. > > > > > > However, doing busy waits here has the downside that the 'idle' time is > > > not in fact fed into the cpuidle predictor. > > > > That's why I cc'ed Rafael :) > > > > Yes the latency in my hack is probably too long, but I think if we did > > this, the cpuile predictor could become involved here. There is no > > fundamental reason it has to wait for the idle task to be context > > switched for that... it's already become involved in core scheduler > > code. > > Yes, cpuidle/cpufreq are getting more and more intergrated so there is > no objection from that point. > > Growing multiple 'idle' points otoh is a little dodgy and could cause > some maintenance issues. Right, it should be done a bit better than my patch, which is just a hack. > Of course, this loop would have the same idle-duration problems as the > poll_state.c one. We should probably use that code. Also, do we want to > ask the estimator before doing this? If it predicts a very long idle > time, spinning here is just wasting cycles. I would say so, yes. I think if we did go this route, it should take over the the existing polling idle states, so it would make sense to control it in a similar way. (Unless polling idle is the only state available of course we need to switch to it eventually, and we must immediately switch in case of do_task_dead, etc) Anyway I'll wait for the merge window to settle and try to get some more numbers. I basically just wanted to see if there were any fundamental problems with the concept. Thanks, Nick ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] kernel/sched/core: busy wait before going idle 2018-04-15 13:31 [RFC PATCH] kernel/sched/core: busy wait before going idle Nicholas Piggin 2018-04-20 7:44 ` Peter Zijlstra @ 2018-04-23 10:17 ` Pavan Kondeti 2018-04-24 5:26 ` Nicholas Piggin 1 sibling, 1 reply; 7+ messages in thread From: Pavan Kondeti @ 2018-04-23 10:17 UTC (permalink / raw) To: Nicholas Piggin Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Paul E . McKenney Hi Nick, On Sun, Apr 15, 2018 at 11:31:49PM +1000, Nicholas Piggin wrote: > This is a quick hack for comments, but I've always wondered -- > if we have a short term polling idle states in cpuidle for performance > -- why not skip the context switch and entry into all the idle states, > and just wait for a bit to see if something wakes up again. > > It's not uncommon to see various going-to-idle work in kernel profiles. > This might be a way to reduce that (and just the cost of switching > registers and kernel stack to idle thread). This can be an important > path for single thread request-response throughput. > > tbench bandwidth seems to be improved (the numbers aren't too stable > but they pretty consistently show some gain). 10-20% would be a pretty > nice gain for such workloads > > clients 1 2 4 8 16 128 > vanilla 232 467 823 1819 3218 9065 > patched 310 503 962 2465 3743 9820 > <snip> > +idle_spin_end: > /* Promote REQ to ACT */ > rq->clock_update_flags <<= 1; > update_rq_clock(rq); > @@ -3437,6 +3439,32 @@ static void __sched notrace __schedule(bool preempt) > if (unlikely(signal_pending_state(prev->state, prev))) { > prev->state = TASK_RUNNING; > } else { > + /* > + * Busy wait before switching to idle thread. This > + * is marked unlikely because we're idle so jumping > + * out of line doesn't matter too much. > + */ > + if (unlikely(do_idle_spin && rq->nr_running == 1)) { > + u64 start; > + > + do_idle_spin = false; > + > + rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP); > + rq_unlock_irq(rq, &rf); > + > + spin_begin(); > + start = local_clock(); > + while (!need_resched() && prev->state && > + !signal_pending_state(prev->state, prev)) { > + spin_cpu_relax(); > + if (local_clock() - start > 1000000) > + break; > + } Couple of comments/questions. When a RT task is doing this busy loop, (1) need_resched() may not be set even if a fair/normal task is enqueued on this CPU. (2) Any lower prio RT task waking up on this CPU may migrate to another CPU thinking this CPU is busy with higher prio RT task. > + spin_end(); > + > + rq_lock_irq(rq, &rf); > + goto idle_spin_end; > + } > deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK); > prev->on_rq = 0; > > -- > 2.17.0 > Thanks, Pavan -- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] kernel/sched/core: busy wait before going idle 2018-04-23 10:17 ` Pavan Kondeti @ 2018-04-24 5:26 ` Nicholas Piggin 0 siblings, 0 replies; 7+ messages in thread From: Nicholas Piggin @ 2018-04-24 5:26 UTC (permalink / raw) To: Pavan Kondeti Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Paul E . McKenney On Mon, 23 Apr 2018 15:47:40 +0530 Pavan Kondeti <pkondeti@codeaurora.org> wrote: > Hi Nick, > > On Sun, Apr 15, 2018 at 11:31:49PM +1000, Nicholas Piggin wrote: > > This is a quick hack for comments, but I've always wondered -- > > if we have a short term polling idle states in cpuidle for performance > > -- why not skip the context switch and entry into all the idle states, > > and just wait for a bit to see if something wakes up again. > > > > It's not uncommon to see various going-to-idle work in kernel profiles. > > This might be a way to reduce that (and just the cost of switching > > registers and kernel stack to idle thread). This can be an important > > path for single thread request-response throughput. > > > > tbench bandwidth seems to be improved (the numbers aren't too stable > > but they pretty consistently show some gain). 10-20% would be a pretty > > nice gain for such workloads > > > > clients 1 2 4 8 16 128 > > vanilla 232 467 823 1819 3218 9065 > > patched 310 503 962 2465 3743 9820 > > > > <snip> > > > +idle_spin_end: > > /* Promote REQ to ACT */ > > rq->clock_update_flags <<= 1; > > update_rq_clock(rq); > > @@ -3437,6 +3439,32 @@ static void __sched notrace __schedule(bool preempt) > > if (unlikely(signal_pending_state(prev->state, prev))) { > > prev->state = TASK_RUNNING; > > } else { > > + /* > > + * Busy wait before switching to idle thread. This > > + * is marked unlikely because we're idle so jumping > > + * out of line doesn't matter too much. > > + */ > > + if (unlikely(do_idle_spin && rq->nr_running == 1)) { > > + u64 start; > > + > > + do_idle_spin = false; > > + > > + rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP); > > + rq_unlock_irq(rq, &rf); > > + > > + spin_begin(); > > + start = local_clock(); > > + while (!need_resched() && prev->state && > > + !signal_pending_state(prev->state, prev)) { > > + spin_cpu_relax(); > > + if (local_clock() - start > 1000000) > > + break; > > + } > > Couple of comments/questions. > > When a RT task is doing this busy loop, > > (1) need_resched() may not be set even if a fair/normal task is enqueued on > this CPU. This is true, it should probably spin on nr_running == 1, good catch. > > (2) Any lower prio RT task waking up on this CPU may migrate to another CPU > thinking this CPU is busy with higher prio RT task. Also true. If we completely replaced the polling idle states with a spin here, this would not be acceptable and it would have to be quite a lot more work to interact with load calculations etc. On the other hand if it is a much smaller spin on the order of context switch latency that could be considered part of the cost of context switching for the purposes of load balancing, *maybe* not much else is need. Thanks, Nick ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2018-04-24 5:26 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-04-15 13:31 [RFC PATCH] kernel/sched/core: busy wait before going idle Nicholas Piggin 2018-04-20 7:44 ` Peter Zijlstra 2018-04-20 9:01 ` Nicholas Piggin 2018-04-20 10:58 ` Peter Zijlstra 2018-04-20 12:28 ` Nicholas Piggin 2018-04-23 10:17 ` Pavan Kondeti 2018-04-24 5:26 ` Nicholas Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox