* too many timer retries happen when do local timer swtich with broadcast timer @ 2013-02-20 11:16 Jason Liu 2013-02-20 13:33 ` Thomas Gleixner 0 siblings, 1 reply; 25+ messages in thread From: Jason Liu @ 2013-02-20 11:16 UTC (permalink / raw) To: linux-arm-kernel Hi, sorry for so long email, please be patient... thanks, I have seen too many timer retries happen when do local timer switch with broadcast timeron ARM Cortex A9 SMP(4 cores), see the following log such as: retries: 36383 root@~$ cat /proc/timer_list Timer List Version: v0.6 HRTIMER_MAX_CLOCK_BASES: 3 now at 3297691988044 nsecs cpu: 0 clock 0: .base: 8c0084b8 .index: 0 .resolution: 1 nsecs .get_time: ktime_get .offset: 0 nsecs [...] Tick Device: mode: 1 Broadcast device Clock Event Device: mxc_timer1 max_delta_ns: 1431655863333 min_delta_ns: 85000 mult: 12884901 shift: 32 mode: 3 next_event: 3297700000000 nsecs set_next_event: v2_set_next_event set_mode: mxc_set_mode event_handler: tick_handle_oneshot_broadcast retries: 92 tick_broadcast_mask: 00000000 tick_broadcast_oneshot_mask: 0000000a Tick Device: mode: 1 Per CPU device: 0 Clock Event Device: local_timer max_delta_ns: 8624432320 min_delta_ns: 1000 mult: 2138893713 shift: 32 mode: 3 next_event: 3297700000000 nsecs set_next_event: twd_set_next_event set_mode: twd_set_mode event_handler: hrtimer_interrupt retries: 36383 Tick Device: mode: 1 Per CPU device: 1 Clock Event Device: local_timer max_delta_ns: 8624432320 min_delta_ns: 1000 mult: 2138893713 shift: 32 mode: 1 next_event: 3297720000000 nsecs set_next_event: twd_set_next_event set_mode: twd_set_mode event_handler: hrtimer_interrupt retries: 6510 Tick Device: mode: 1 Per CPU device: 2 Clock Event Device: local_timer max_delta_ns: 8624432320 min_delta_ns: 1000 mult: 2138893713 shift: 32 mode: 3 next_event: 3297700000000 nsecs set_next_event: twd_set_next_event set_mode: twd_set_mode event_handler: hrtimer_interrupt retries: 790 Tick Device: mode: 1 Per CPU device: 3 Clock Event Device: local_timer max_delta_ns: 8624432320 min_delta_ns: 1000 mult: 2138893713 shift: 32 mode: 1 next_event: 3298000000000 nsecs set_next_event: twd_set_next_event set_mode: twd_set_mode event_handler: hrtimer_interrupt retries: 6873 Since on our platform, the local timer will stop when enter C3 state, we need switch the local timer to bc timer when enter the state and switch back when exit from the that state. the code is like this: void arch_idle(void) { .... clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu); enter_the_wait_mode(); clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); } when the broadcast timer interrupt arrives(this interrupt just wakeup the ARM, and ARM has no chance to handle it since local irq is disabled. In fact it's disabled in cpu_idle() of arch/arm/kernel/process.c) the broadcast timer interrupt will wake up the CPU and run: clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); -> tick_broadcast_oneshot_control(...); -> tick_program_event(dev->next_event, 1); -> tick_dev_program_event(dev, expires, force); -> for (i = 0;;) { int ret = clockevents_program_event(dev, expires, now); if (!ret || !force) return ret; dev->retries++; .... now = ktime_get(); expires = ktime_add_ns(now, dev->min_delta_ns); } clockevents_program_event(dev, expires, now); delta = ktime_to_ns(ktime_sub(expires, now)); if (delta <= 0) return -ETIME; when the bc timer interrupt arrives, which means the last local timer expires too. so, clockevents_program_event will return -ETIME, which will cause the dev->retries++ when retry to program the expired timer. Even under the worst case, after the re-program the expired timer, then CPU enter idle quickly before the re-progam timer expired, it will make system ping-pang forever, switch to bc timer->wait->bc timer expires->wakeup->switch to loc timer-> | ^ | |-------------------<-enter idle <- reprogram the expired loc timer ------------------< I have run into the worst case on my project. I think this is the common issue on ARM platform. What do you think how we can fix this problem? Thanks you. Best Regards, Jason Liu ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-20 11:16 too many timer retries happen when do local timer swtich with broadcast timer Jason Liu @ 2013-02-20 13:33 ` Thomas Gleixner 2013-02-21 6:16 ` Jason Liu 0 siblings, 1 reply; 25+ messages in thread From: Thomas Gleixner @ 2013-02-20 13:33 UTC (permalink / raw) To: linux-arm-kernel On Wed, 20 Feb 2013, Jason Liu wrote: > void arch_idle(void) > { > .... > clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu); > > enter_the_wait_mode(); > > clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); > } > > when the broadcast timer interrupt arrives(this interrupt just wakeup > the ARM, and ARM has no chance > to handle it since local irq is disabled. In fact it's disabled in > cpu_idle() of arch/arm/kernel/process.c) > > the broadcast timer interrupt will wake up the CPU and run: > > clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); -> > tick_broadcast_oneshot_control(...); > -> > tick_program_event(dev->next_event, 1); > -> > tick_dev_program_event(dev, expires, force); > -> > for (i = 0;;) { > int ret = clockevents_program_event(dev, expires, now); > if (!ret || !force) > return ret; > > dev->retries++; > .... > now = ktime_get(); > expires = ktime_add_ns(now, dev->min_delta_ns); > } > clockevents_program_event(dev, expires, now); > > delta = ktime_to_ns(ktime_sub(expires, now)); > > if (delta <= 0) > return -ETIME; > > when the bc timer interrupt arrives, which means the last local timer > expires too. so, > clockevents_program_event will return -ETIME, which will cause the > dev->retries++ > when retry to program the expired timer. > > Even under the worst case, after the re-program the expired timer, > then CPU enter idle > quickly before the re-progam timer expired, it will make system > ping-pang forever, That's nonsense. The timer IPI brings the core out of the deep idle state. So after returning from enter_wait_mode() and after calling clockevents_notify() it returns from arch_idle() to cpu_idle(). In cpu_idle() interrupts are reenabled, so the timer IPI handler is invoked. That calls the event_handler of the per cpu local clockevent device (the one which stops in C3). That ends up in the generic timer code which expires timers and reprograms the local clock event device with the next pending timer. So you cannot go idle again, before the expired timers of this event are handled and their callbacks invoked. Now the reprogramming itself is a different issue. It's necessary as we have no information whether the wakeup was caused by the timer IPI or by something else. We might try to be clever and store the pending IPI in tick_handle_oneshot_broadcast() and avoid the reprogramming in that case. Completely untested patch below. Thanks, tglx Index: linux-2.6/kernel/time/tick-broadcast.c =================================================================== --- linux-2.6.orig/kernel/time/tick-broadcast.c +++ linux-2.6/kernel/time/tick-broadcast.c @@ -29,6 +29,7 @@ static struct tick_device tick_broadcast_device; /* FIXME: Use cpumask_var_t. */ static DECLARE_BITMAP(tick_broadcast_mask, NR_CPUS); +static DECLARE_BITMAP(tick_broadcast_pending, NR_CPUS); static DECLARE_BITMAP(tmpmask, NR_CPUS); static DEFINE_RAW_SPINLOCK(tick_broadcast_lock); static int tick_broadcast_force; @@ -417,9 +418,10 @@ again: /* Find all expired events */ for_each_cpu(cpu, tick_get_broadcast_oneshot_mask()) { td = &per_cpu(tick_cpu_device, cpu); - if (td->evtdev->next_event.tv64 <= now.tv64) + if (td->evtdev->next_event.tv64 <= now.tv64) { cpumask_set_cpu(cpu, to_cpumask(tmpmask)); - else if (td->evtdev->next_event.tv64 < next_event.tv64) + set_bit(cpu, tick_broadcast_pending); + } else if (td->evtdev->next_event.tv64 < next_event.tv64) next_event.tv64 = td->evtdev->next_event.tv64; } @@ -493,7 +495,8 @@ void tick_broadcast_oneshot_control(unsi cpumask_clear_cpu(cpu, tick_get_broadcast_oneshot_mask()); clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT); - if (dev->next_event.tv64 != KTIME_MAX) + if (dev->next_event.tv64 != KTIME_MAX && + !test_and_clear_bit(cpu, tick_broadcast_pending)) tick_program_event(dev->next_event, 1); } } ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-20 13:33 ` Thomas Gleixner @ 2013-02-21 6:16 ` Jason Liu 2013-02-21 9:36 ` Thomas Gleixner 2013-02-21 10:35 ` Lorenzo Pieralisi 0 siblings, 2 replies; 25+ messages in thread From: Jason Liu @ 2013-02-21 6:16 UTC (permalink / raw) To: linux-arm-kernel 2013/2/20 Thomas Gleixner <tglx@linutronix.de>: > On Wed, 20 Feb 2013, Jason Liu wrote: >> void arch_idle(void) >> { >> .... >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu); >> >> enter_the_wait_mode(); >> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); >> } >> >> when the broadcast timer interrupt arrives(this interrupt just wakeup >> the ARM, and ARM has no chance >> to handle it since local irq is disabled. In fact it's disabled in >> cpu_idle() of arch/arm/kernel/process.c) >> >> the broadcast timer interrupt will wake up the CPU and run: >> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); -> >> tick_broadcast_oneshot_control(...); >> -> >> tick_program_event(dev->next_event, 1); >> -> >> tick_dev_program_event(dev, expires, force); >> -> >> for (i = 0;;) { >> int ret = clockevents_program_event(dev, expires, now); >> if (!ret || !force) >> return ret; >> >> dev->retries++; >> .... >> now = ktime_get(); >> expires = ktime_add_ns(now, dev->min_delta_ns); >> } >> clockevents_program_event(dev, expires, now); >> >> delta = ktime_to_ns(ktime_sub(expires, now)); >> >> if (delta <= 0) >> return -ETIME; >> >> when the bc timer interrupt arrives, which means the last local timer >> expires too. so, >> clockevents_program_event will return -ETIME, which will cause the >> dev->retries++ >> when retry to program the expired timer. >> >> Even under the worst case, after the re-program the expired timer, >> then CPU enter idle >> quickly before the re-progam timer expired, it will make system >> ping-pang forever, > > That's nonsense. I don't think so. > > The timer IPI brings the core out of the deep idle state. > > So after returning from enter_wait_mode() and after calling > clockevents_notify() it returns from arch_idle() to cpu_idle(). > > In cpu_idle() interrupts are reenabled, so the timer IPI handler is > invoked. That calls the event_handler of the per cpu local clockevent > device (the one which stops in C3). That ends up in the generic timer > code which expires timers and reprograms the local clock event device > with the next pending timer. > > So you cannot go idle again, before the expired timers of this event > are handled and their callbacks invoked. That's true for the CPUs which not response to the global timer interrupt. Take our platform as example: we have 4CPUs(CPU0, CPU1,CPU2,CPU3) The global timer device will keep running even in the deep idle mode, so, it can be used as the broadcast timer device, and the interrupt of this device just raised to CPU0 when the timer expired, then, CPU0 will broadcast the IPI timer to other CPUs which is in deep idle mode. So for CPU1, CPU2, CPU3, you are right, the IPI timer will bring it out of idle state, after running clockevents_notify() it returns from arch_idle() to cpu_idle(), then local_irq_enable(), the IPI handler will be invoked and handle the expires times and re-program the next pending timer. But, that's not true for the CPU0. The flow for CPU0 is: the global timer interrupt wakes up CPU0 and then call: clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); which will cpumask_clear_cpu(cpu, tick_get_broadcast_oneshot_mask()); in the function tick_broadcast_oneshot_control(), After return from clockevents_notify(), it will return to cpu_idle from arch_idle, then local_irq_enable(), the CPU0 will response to the global timer interrupt, and call the interrupt handler: tick_handle_oneshot_broadcast() static void tick_handle_oneshot_broadcast(struct clock_event_device *dev) { struct tick_device *td; ktime_t now, next_event; int cpu; raw_spin_lock(&tick_broadcast_lock); again: dev->next_event.tv64 = KTIME_MAX; next_event.tv64 = KTIME_MAX; cpumask_clear(to_cpumask(tmpmask)); now = ktime_get(); /* Find all expired events */ for_each_cpu(cpu, tick_get_broadcast_oneshot_mask()) { td = &per_cpu(tick_cpu_device, cpu); if (td->evtdev->next_event.tv64 <= now.tv64) cpumask_set_cpu(cpu, to_cpumask(tmpmask)); else if (td->evtdev->next_event.tv64 < next_event.tv64) next_event.tv64 = td->evtdev->next_event.tv64; } /* * Wakeup the cpus which have an expired event. */ tick_do_broadcast(to_cpumask(tmpmask)); ... } since cpu0 has been removed from the tick_get_broadcast_oneshot_mask(), and if all the other cpu1/2/3 state in idle, and no expired timers, then the tmpmask will be 0, when call tick_do_broadcast(). static void tick_do_broadcast(struct cpumask *mask) { int cpu = smp_processor_id(); struct tick_device *td; /* * Check, if the current cpu is in the mask */ if (cpumask_test_cpu(cpu, mask)) { cpumask_clear_cpu(cpu, mask); td = &per_cpu(tick_cpu_device, cpu); td->evtdev->event_handler(td->evtdev); } if (!cpumask_empty(mask)) { /* * It might be necessary to actually check whether the devices * have different broadcast functions. For now, just use the * one of the first device. This works as long as we have this * misfeature only on x86 (lapic) */ td = &per_cpu(tick_cpu_device, cpumask_first(mask)); td->evtdev->broadcast(mask); } } If the mask is empty, then tick_do_broadcast will do nothing and return, which will make cpu0 enter idle quickly, and then system will ping-pang there. switch to bc timer->wait->bc timer expires->wakeup->switch to loc timer-> | ^ |------<-enter idle <- reprogram the expired loc timer ---< We did see such things happen on the system which will make system stuck there and no any sched-tick handler running on the CPU0. And the easy way to reproduce it is: /* hack code: just for experiment */ diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index f113755..d142802 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -493,8 +493,12 @@ void tick_broadcast_oneshot_control(unsigned long reason) cpumask_clear_cpu(cpu, tick_get_broadcast_oneshot_mask()); clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT); - if (dev->next_event.tv64 != KTIME_MAX) - tick_program_event(dev->next_event, 1); + if (dev->next_event.tv64 != KTIME_MAX) { + if (cpu) + tick_program_event(dev->next_event, 1); + else + tick_program_event(ktime_add_ns(dev->next_event, 100000), 1); + } } } We need fix this common issue. Do you have good idea about how to fix it? > > > Now the reprogramming itself is a different issue. > > It's necessary as we have no information whether the wakeup was caused > by the timer IPI or by something else. > > We might try to be clever and store the pending IPI in > tick_handle_oneshot_broadcast() and avoid the reprogramming in that > case. Completely untested patch below. Thanks for the patch. > > Thanks, > > tglx > > Index: linux-2.6/kernel/time/tick-broadcast.c > =================================================================== > --- linux-2.6.orig/kernel/time/tick-broadcast.c > +++ linux-2.6/kernel/time/tick-broadcast.c > @@ -29,6 +29,7 @@ > static struct tick_device tick_broadcast_device; > /* FIXME: Use cpumask_var_t. */ > static DECLARE_BITMAP(tick_broadcast_mask, NR_CPUS); > +static DECLARE_BITMAP(tick_broadcast_pending, NR_CPUS); > static DECLARE_BITMAP(tmpmask, NR_CPUS); > static DEFINE_RAW_SPINLOCK(tick_broadcast_lock); > static int tick_broadcast_force; > @@ -417,9 +418,10 @@ again: > /* Find all expired events */ > for_each_cpu(cpu, tick_get_broadcast_oneshot_mask()) { > td = &per_cpu(tick_cpu_device, cpu); > - if (td->evtdev->next_event.tv64 <= now.tv64) > + if (td->evtdev->next_event.tv64 <= now.tv64) { > cpumask_set_cpu(cpu, to_cpumask(tmpmask)); > - else if (td->evtdev->next_event.tv64 < next_event.tv64) > + set_bit(cpu, tick_broadcast_pending); > + } else if (td->evtdev->next_event.tv64 < next_event.tv64) > next_event.tv64 = td->evtdev->next_event.tv64; > } > > @@ -493,7 +495,8 @@ void tick_broadcast_oneshot_control(unsi > cpumask_clear_cpu(cpu, > tick_get_broadcast_oneshot_mask()); > clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT); > - if (dev->next_event.tv64 != KTIME_MAX) > + if (dev->next_event.tv64 != KTIME_MAX && > + !test_and_clear_bit(cpu, tick_broadcast_pending)) > tick_program_event(dev->next_event, 1); > } > } > I have tested the patch, it can fix the retries on the CPU1/2/3, but not on CPU0. see, cpu0: retries: 39042 root@~$ cat /proc/timer_list Timer List Version: v0.6 HRTIMER_MAX_CLOCK_BASES: 3 [...] Tick Device: mode: 1 Per CPU device: 0 Clock Event Device: local_timer max_delta_ns: 8624432320 min_delta_ns: 1000 mult: 2138893713 shift: 32 mode: 3 next_event: 3295010000000 nsecs set_next_event: twd_set_next_event set_mode: twd_set_mode event_handler: hrtimer_interrupt retries: 39042 Tick Device: mode: 1 Per CPU device: 1 Clock Event Device: local_timer max_delta_ns: 8624432320 min_delta_ns: 1000 mult: 2138893713 shift: 32 mode: 1 next_event: 3295050000000 nsecs set_next_event: twd_set_next_event set_mode: twd_set_mode event_handler: hrtimer_interrupt retries: 1 Tick Device: mode: 1 Per CPU device: 2 Clock Event Device: local_timer max_delta_ns: 8624432320 min_delta_ns: 1000 mult: 2138893713 shift: 32 mode: 1 next_event: 10740407770000000 nsecs set_next_event: twd_set_next_event set_mode: twd_set_mode event_handler: hrtimer_interrupt retries: 0 Tick Device: mode: 1 Per CPU device: 3 Clock Event Device: local_timer max_delta_ns: 8624432320 min_delta_ns: 1000 mult: 2138893713 shift: 32 mode: 1 next_event: 10737578200000000 nsecs set_next_event: twd_set_next_event set_mode: twd_set_mode event_handler: hrtimer_interrupt retries: 0 Jason Liu > > > > > ^ permalink raw reply related [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-21 6:16 ` Jason Liu @ 2013-02-21 9:36 ` Thomas Gleixner 2013-02-21 10:50 ` Jason Liu 2013-02-21 10:35 ` Lorenzo Pieralisi 1 sibling, 1 reply; 25+ messages in thread From: Thomas Gleixner @ 2013-02-21 9:36 UTC (permalink / raw) To: linux-arm-kernel On Thu, 21 Feb 2013, Jason Liu wrote: > 2013/2/20 Thomas Gleixner <tglx@linutronix.de>: > > On Wed, 20 Feb 2013, Jason Liu wrote: > >> void arch_idle(void) > >> { > >> .... > >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu); > >> > >> enter_the_wait_mode(); > >> > >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); > >> } > >> > >> when the broadcast timer interrupt arrives(this interrupt just wakeup > >> the ARM, and ARM has no chance > >> to handle it since local irq is disabled. In fact it's disabled in > >> cpu_idle() of arch/arm/kernel/process.c) > >> > >> the broadcast timer interrupt will wake up the CPU and run: > >> > >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); -> > >> tick_broadcast_oneshot_control(...); > >> -> > >> tick_program_event(dev->next_event, 1); > >> -> > >> tick_dev_program_event(dev, expires, force); > >> -> > >> for (i = 0;;) { > >> int ret = clockevents_program_event(dev, expires, now); > >> if (!ret || !force) > >> return ret; > >> > >> dev->retries++; > >> .... > >> now = ktime_get(); > >> expires = ktime_add_ns(now, dev->min_delta_ns); > >> } > >> clockevents_program_event(dev, expires, now); > >> > >> delta = ktime_to_ns(ktime_sub(expires, now)); > >> > >> if (delta <= 0) > >> return -ETIME; > >> > >> when the bc timer interrupt arrives, which means the last local timer > >> expires too. so, > >> clockevents_program_event will return -ETIME, which will cause the > >> dev->retries++ > >> when retry to program the expired timer. > >> > >> Even under the worst case, after the re-program the expired timer, > >> then CPU enter idle > >> quickly before the re-progam timer expired, it will make system > >> ping-pang forever, > > > > That's nonsense. > > I don't think so. > > > > > The timer IPI brings the core out of the deep idle state. > > > > So after returning from enter_wait_mode() and after calling > > clockevents_notify() it returns from arch_idle() to cpu_idle(). > > > > In cpu_idle() interrupts are reenabled, so the timer IPI handler is > > invoked. That calls the event_handler of the per cpu local clockevent > > device (the one which stops in C3). That ends up in the generic timer > > code which expires timers and reprograms the local clock event device > > with the next pending timer. > > > > So you cannot go idle again, before the expired timers of this event > > are handled and their callbacks invoked. > > That's true for the CPUs which not response to the global timer interrupt. > Take our platform as example: we have 4CPUs(CPU0, CPU1,CPU2,CPU3) > The global timer device will keep running even in the deep idle mode, so, it > can be used as the broadcast timer device, and the interrupt of this device > just raised to CPU0 when the timer expired, then, CPU0 will broadcast the > IPI timer to other CPUs which is in deep idle mode. > > So for CPU1, CPU2, CPU3, you are right, the IPI timer will bring it out of idle > state, after running clockevents_notify() it returns from arch_idle() > to cpu_idle(), > then local_irq_enable(), the IPI handler will be invoked and handle > the expires times > and re-program the next pending timer. > > But, that's not true for the CPU0. The flow for CPU0 is: > the global timer interrupt wakes up CPU0 and then call: > clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); > > which will cpumask_clear_cpu(cpu, tick_get_broadcast_oneshot_mask()); > in the function tick_broadcast_oneshot_control(), Now your explanation makes sense. I have no fast solution for this, but I think that I have an idea how to fix it. Stay tuned. Thanks, tglx ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-21 9:36 ` Thomas Gleixner @ 2013-02-21 10:50 ` Jason Liu 2013-02-21 13:48 ` Thomas Gleixner 0 siblings, 1 reply; 25+ messages in thread From: Jason Liu @ 2013-02-21 10:50 UTC (permalink / raw) To: linux-arm-kernel 2013/2/21 Thomas Gleixner <tglx@linutronix.de>: > On Thu, 21 Feb 2013, Jason Liu wrote: >> 2013/2/20 Thomas Gleixner <tglx@linutronix.de>: >> > On Wed, 20 Feb 2013, Jason Liu wrote: >> >> void arch_idle(void) >> >> { >> >> .... >> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu); >> >> >> >> enter_the_wait_mode(); >> >> >> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); >> >> } >> >> >> >> when the broadcast timer interrupt arrives(this interrupt just wakeup >> >> the ARM, and ARM has no chance >> >> to handle it since local irq is disabled. In fact it's disabled in >> >> cpu_idle() of arch/arm/kernel/process.c) >> >> >> >> the broadcast timer interrupt will wake up the CPU and run: >> >> >> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); -> >> >> tick_broadcast_oneshot_control(...); >> >> -> >> >> tick_program_event(dev->next_event, 1); >> >> -> >> >> tick_dev_program_event(dev, expires, force); >> >> -> >> >> for (i = 0;;) { >> >> int ret = clockevents_program_event(dev, expires, now); >> >> if (!ret || !force) >> >> return ret; >> >> >> >> dev->retries++; >> >> .... >> >> now = ktime_get(); >> >> expires = ktime_add_ns(now, dev->min_delta_ns); >> >> } >> >> clockevents_program_event(dev, expires, now); >> >> >> >> delta = ktime_to_ns(ktime_sub(expires, now)); >> >> >> >> if (delta <= 0) >> >> return -ETIME; >> >> >> >> when the bc timer interrupt arrives, which means the last local timer >> >> expires too. so, >> >> clockevents_program_event will return -ETIME, which will cause the >> >> dev->retries++ >> >> when retry to program the expired timer. >> >> >> >> Even under the worst case, after the re-program the expired timer, >> >> then CPU enter idle >> >> quickly before the re-progam timer expired, it will make system >> >> ping-pang forever, >> > >> > That's nonsense. >> >> I don't think so. >> >> > >> > The timer IPI brings the core out of the deep idle state. >> > >> > So after returning from enter_wait_mode() and after calling >> > clockevents_notify() it returns from arch_idle() to cpu_idle(). >> > >> > In cpu_idle() interrupts are reenabled, so the timer IPI handler is >> > invoked. That calls the event_handler of the per cpu local clockevent >> > device (the one which stops in C3). That ends up in the generic timer >> > code which expires timers and reprograms the local clock event device >> > with the next pending timer. >> > >> > So you cannot go idle again, before the expired timers of this event >> > are handled and their callbacks invoked. >> >> That's true for the CPUs which not response to the global timer interrupt. >> Take our platform as example: we have 4CPUs(CPU0, CPU1,CPU2,CPU3) >> The global timer device will keep running even in the deep idle mode, so, it >> can be used as the broadcast timer device, and the interrupt of this device >> just raised to CPU0 when the timer expired, then, CPU0 will broadcast the >> IPI timer to other CPUs which is in deep idle mode. >> >> So for CPU1, CPU2, CPU3, you are right, the IPI timer will bring it out of idle >> state, after running clockevents_notify() it returns from arch_idle() >> to cpu_idle(), >> then local_irq_enable(), the IPI handler will be invoked and handle >> the expires times >> and re-program the next pending timer. >> >> But, that's not true for the CPU0. The flow for CPU0 is: >> the global timer interrupt wakes up CPU0 and then call: >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); >> >> which will cpumask_clear_cpu(cpu, tick_get_broadcast_oneshot_mask()); >> in the function tick_broadcast_oneshot_control(), > > Now your explanation makes sense. > > I have no fast solution for this, but I think that I have an idea how > to fix it. Stay tuned. Thanks Thomas, wait for your fix. :) > > Thanks, > > tglx ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-21 10:50 ` Jason Liu @ 2013-02-21 13:48 ` Thomas Gleixner 2013-02-21 15:12 ` Santosh Shilimkar 2013-02-22 10:26 ` Jason Liu 0 siblings, 2 replies; 25+ messages in thread From: Thomas Gleixner @ 2013-02-21 13:48 UTC (permalink / raw) To: linux-arm-kernel Jason, On Thu, 21 Feb 2013, Jason Liu wrote: > 2013/2/21 Thomas Gleixner <tglx@linutronix.de>: > > Now your explanation makes sense. > > > > I have no fast solution for this, but I think that I have an idea how > > to fix it. Stay tuned. > > Thanks Thomas, wait for your fix. :) find below a completely untested patch, which should address that issue. Sigh. Stopping the cpu local timer in deep idle is known to be an idiotic design decision for 10+ years. I'll never understand why ARM vendors insist on implementing the same stupidity over and over. Thanks, tglx Index: linux-2.6/kernel/time/tick-broadcast.c =================================================================== --- linux-2.6.orig/kernel/time/tick-broadcast.c +++ linux-2.6/kernel/time/tick-broadcast.c @@ -397,6 +397,8 @@ int tick_resume_broadcast(void) /* FIXME: use cpumask_var_t. */ static DECLARE_BITMAP(tick_broadcast_oneshot_mask, NR_CPUS); +static DECLARE_BITMAP(tick_broadcast_pending, NR_CPUS); +static DECLARE_BITMAP(tick_force_broadcast_mask, NR_CPUS); /* * Exposed for debugging: see timer_list.c @@ -453,12 +455,24 @@ again: /* Find all expired events */ for_each_cpu(cpu, tick_get_broadcast_oneshot_mask()) { td = &per_cpu(tick_cpu_device, cpu); - if (td->evtdev->next_event.tv64 <= now.tv64) + if (td->evtdev->next_event.tv64 <= now.tv64) { cpumask_set_cpu(cpu, to_cpumask(tmpmask)); - else if (td->evtdev->next_event.tv64 < next_event.tv64) + /* + * Mark the remote cpu in the pending mask, so + * it can avoid reprogramming the cpu local + * timer in tick_broadcast_oneshot_control(). + */ + set_bit(cpu, tick_broadcast_pending); + } else if (td->evtdev->next_event.tv64 < next_event.tv64) next_event.tv64 = td->evtdev->next_event.tv64; } + /* Take care of enforced broadcast requests */ + for_each_cpu(cpu, to_cpumask(tick_force_broadcast_mask)) { + set_bit(cpu, tmpmask); + clear_bit(cpu, tick_force_broadcast_mask); + } + /* * Wakeup the cpus which have an expired event. */ @@ -494,6 +508,7 @@ void tick_broadcast_oneshot_control(unsi struct clock_event_device *bc, *dev; struct tick_device *td; unsigned long flags; + ktime_t now; int cpu; /* @@ -518,6 +533,8 @@ void tick_broadcast_oneshot_control(unsi raw_spin_lock_irqsave(&tick_broadcast_lock, flags); if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ENTER) { + WARN_ON_ONCE(test_bit(cpu, tick_broadcast_pending)); + WARN_ON_ONCE(test_bit(cpu, tick_force_broadcast_mask)); if (!cpumask_test_cpu(cpu, tick_get_broadcast_oneshot_mask())) { cpumask_set_cpu(cpu, tick_get_broadcast_oneshot_mask()); clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN); @@ -529,10 +546,63 @@ void tick_broadcast_oneshot_control(unsi cpumask_clear_cpu(cpu, tick_get_broadcast_oneshot_mask()); clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT); - if (dev->next_event.tv64 != KTIME_MAX) - tick_program_event(dev->next_event, 1); + if (dev->next_event.tv64 == KTIME_MAX) + goto out; + /* + * The cpu handling the broadcast timer marked + * this cpu in the broadcast pending mask and + * fired the broadcast IPI. So we are going to + * handle the expired event anyway via the + * broadcast IPI handler. No need to reprogram + * the timer with an already expired event. + */ + if (test_and_clear_bit(cpu, tick_broadcast_pending)) + goto out; + /* + * If the pending bit is not set, then we are + * either the CPU handling the broadcast + * interrupt or we got woken by something else. + * + * We are not longer in the broadcast mask, so + * if the cpu local expiry time is already + * reached, we would reprogram the cpu local + * timer with an already expired event. + * + * This can lead to a ping-pong when we return + * to idle and therefor rearm the broadcast + * timer before the cpu local timer was able + * to fire. This happens because the forced + * reprogramming makes sure that the event + * will happen in the future and depending on + * the min_delta setting this might be far + * enough out that the ping-pong starts. + * + * If the cpu local next_event has expired + * then we know that the broadcast timer + * next_event has expired as well and + * broadcast is about to be handled. So we + * avoid reprogramming and enforce that the + * broadcast handler, which did not run yet, + * will invoke the cpu local handler. + * + * We cannot call the handler directly from + * here, because we might be in a NOHZ phase + * and we did not go through the irq_enter() + * nohz fixups. + */ + now = ktime_get(); + if (dev->next_event.tv64 <= now.tv64) { + set_bit(cpu, tick_force_broadcast_mask); + goto out; + } + /* + * We got woken by something else. Reprogram + * the cpu local timer device. + */ + tick_program_event(dev->next_event, 1); } } +out: raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags); } ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-21 13:48 ` Thomas Gleixner @ 2013-02-21 15:12 ` Santosh Shilimkar 2013-02-21 22:19 ` Thomas Gleixner 2013-02-22 10:26 ` Jason Liu 1 sibling, 1 reply; 25+ messages in thread From: Santosh Shilimkar @ 2013-02-21 15:12 UTC (permalink / raw) To: linux-arm-kernel Thomas, On Thursday 21 February 2013 07:18 PM, Thomas Gleixner wrote: > Jason, > > On Thu, 21 Feb 2013, Jason Liu wrote: >> 2013/2/21 Thomas Gleixner <tglx@linutronix.de>: >>> Now your explanation makes sense. >>> >>> I have no fast solution for this, but I think that I have an idea how >>> to fix it. Stay tuned. >> >> Thanks Thomas, wait for your fix. :) > > find below a completely untested patch, which should address that issue. > After looking at the thread, I tried to see the issue on OMAP and could see the same issue as Jason. Your patch fixes the retries on both CPUs on my dual core machine. So you use my tested by if you need one. Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Thanks for the patch. And thanks to Jason for spotting the issue. Regards, Santosh ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-21 15:12 ` Santosh Shilimkar @ 2013-02-21 22:19 ` Thomas Gleixner 2013-02-22 10:07 ` Santosh Shilimkar 2013-02-22 10:28 ` Lorenzo Pieralisi 0 siblings, 2 replies; 25+ messages in thread From: Thomas Gleixner @ 2013-02-21 22:19 UTC (permalink / raw) To: linux-arm-kernel On Thu, 21 Feb 2013, Santosh Shilimkar wrote: > On Thursday 21 February 2013 07:18 PM, Thomas Gleixner wrote: > > find below a completely untested patch, which should address that issue. > > > After looking at the thread, I tried to see the issue on OMAP and could > see the same issue as Jason. That's interesting. We have the same issue on x86 since 2007 and nobody noticed ever. It's basically the same problem there, but it seems that on x86 getting out of those low power states is way slower than the minimal reprogramming delta which is used to enforce the local timer to fire after the wakeup. I'm still amazed that as Jason stated a 1us reprogramming delta is sufficient to get this ping-pong going. I somehow doubt that, but maybe ARM is really that fast :) > Your patch fixes the retries on both CPUs on my dual core machine. So > you use my tested by if you need one. They are always welcome. > Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com> > > Thanks for the patch. > And thanks to Jason for spotting the issue. And for coping with my initial inability to parse his report! Thanks, tglx ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-21 22:19 ` Thomas Gleixner @ 2013-02-22 10:07 ` Santosh Shilimkar 2013-02-22 10:24 ` Thomas Gleixner 2013-02-22 10:28 ` Lorenzo Pieralisi 1 sibling, 1 reply; 25+ messages in thread From: Santosh Shilimkar @ 2013-02-22 10:07 UTC (permalink / raw) To: linux-arm-kernel Thomas, On Friday 22 February 2013 03:49 AM, Thomas Gleixner wrote: > On Thu, 21 Feb 2013, Santosh Shilimkar wrote: >> On Thursday 21 February 2013 07:18 PM, Thomas Gleixner wrote: >>> find below a completely untested patch, which should address that issue. >>> >> After looking at the thread, I tried to see the issue on OMAP and could >> see the same issue as Jason. > > That's interesting. We have the same issue on x86 since 2007 and > nobody noticed ever. It's basically the same problem there, but it > seems that on x86 getting out of those low power states is way slower > than the minimal reprogramming delta which is used to enforce the > local timer to fire after the wakeup. > > I'm still amazed that as Jason stated a 1us reprogramming delta is > sufficient to get this ping-pong going. I somehow doubt that, but > maybe ARM is really that fast :) > >> Your patch fixes the retries on both CPUs on my dual core machine. So >> you use my tested by if you need one. > > They are always welcome. > BTW, Lorenzo off-list mentioned to me about warning in boot-up which I missed while testing your patch. It will take bit more time for me to look into it and hence thought of reporting it. [ 2.186126] ------------[ cut here ]------------ [ 2.190979] WARNING: at kernel/time/tick-broadcast.c:501 tick_broadcast_oneshot_control+0x1c0/0x21c() [ 2.200622] Modules linked in: [ 2.203826] [<c001bfe4>] (unwind_backtrace+0x0/0xf0) from [<c0047d6c>] (warn_slowpath_common+0x4c/0x64) [ 2.213653] [<c0047d6c>] (warn_slowpath_common+0x4c/0x64) from [<c0047da0>] (warn_slowpath_null+0x1c/0x24) [ 2.223754] [<c0047da0>] (warn_slowpath_null+0x1c/0x24) from [<c009336c>] (tick_broadcast_oneshot_control+0x1c0/0x21c) [ 2.234924] [<c009336c>] (tick_broadcast_oneshot_control+0x1c0/0x21c) from [<c00928dc>] (tick_notify+0x23c/0x42c) [ 2.245666] [<c00928dc>] (tick_notify+0x23c/0x42c) from [<c0539a3c>] (notifier_call_chain+0x44/0x84) [ 2.255218] [<c0539a3c>] (notifier_call_chain+0x44/0x84) from [<c0071068>] (raw_notifier_call_chain+0x18/0x20) [ 2.265686] [<c0071068>] (raw_notifier_call_chain+0x18/0x20) from [<c0091c70>] (clockevents_notify+0x2c/0x174) [ 2.276123] [<c0091c70>] (clockevents_notify+0x2c/0x174) from [<c0035294>] (omap_enter_idle_smp+0x3c/0x120) [ 2.286315] [<c0035294>] (omap_enter_idle_smp+0x3c/0x120) from [<c042e504>] (cpuidle_enter+0x14/0x18) [ 2.295928] [<c042e504>] (cpuidle_enter+0x14/0x18) from [<c042ef14>] (cpuidle_wrap_enter+0x34/0xa0) [ 2.305389] [<c042ef14>] (cpuidle_wrap_enter+0x34/0xa0) from [<c042eb20>] (cpuidle_idle_call+0xe0/0x328) [ 2.315307] [<c042eb20>] (cpuidle_idle_call+0xe0/0x328) from [<c0015100>] (cpu_idle+0x8c/0x11c) [ 2.324401] [<c0015100>] (cpu_idle+0x8c/0x11c) from [<c073d7ac>] (start_kernel+0x2b0/0x300) [ 2.333129] ---[ end trace 6fe1f7b4606a9e20 ]--- ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-22 10:07 ` Santosh Shilimkar @ 2013-02-22 10:24 ` Thomas Gleixner 2013-02-22 10:30 ` Santosh Shilimkar 2013-02-22 10:31 ` Lorenzo Pieralisi 0 siblings, 2 replies; 25+ messages in thread From: Thomas Gleixner @ 2013-02-22 10:24 UTC (permalink / raw) To: linux-arm-kernel On Fri, 22 Feb 2013, Santosh Shilimkar wrote: > BTW, Lorenzo off-list mentioned to me about warning in boot-up > which I missed while testing your patch. It will take bit more > time for me to look into it and hence thought of reporting it. > > [ 2.186126] ------------[ cut here ]------------ > [ 2.190979] WARNING: at kernel/time/tick-broadcast.c:501 > tick_broadcast_oneshot_control+0x1c0/0x21c() Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ? ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-22 10:24 ` Thomas Gleixner @ 2013-02-22 10:30 ` Santosh Shilimkar 2013-02-22 10:31 ` Lorenzo Pieralisi 1 sibling, 0 replies; 25+ messages in thread From: Santosh Shilimkar @ 2013-02-22 10:30 UTC (permalink / raw) To: linux-arm-kernel On Friday 22 February 2013 03:54 PM, Thomas Gleixner wrote: > On Fri, 22 Feb 2013, Santosh Shilimkar wrote: >> BTW, Lorenzo off-list mentioned to me about warning in boot-up >> which I missed while testing your patch. It will take bit more >> time for me to look into it and hence thought of reporting it. >> >> [ 2.186126] ------------[ cut here ]------------ >> [ 2.190979] WARNING: at kernel/time/tick-broadcast.c:501 >> tick_broadcast_oneshot_control+0x1c0/0x21c() > > Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ? > The force broadcast mask one coming from below. "WARN_ON_ONCE(test_bit(cpu, tick_force_broadcast_mask));" Regards, Santosh ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-22 10:24 ` Thomas Gleixner 2013-02-22 10:30 ` Santosh Shilimkar @ 2013-02-22 10:31 ` Lorenzo Pieralisi 2013-02-22 11:02 ` Santosh Shilimkar 1 sibling, 1 reply; 25+ messages in thread From: Lorenzo Pieralisi @ 2013-02-22 10:31 UTC (permalink / raw) To: linux-arm-kernel On Fri, Feb 22, 2013 at 10:24:00AM +0000, Thomas Gleixner wrote: > On Fri, 22 Feb 2013, Santosh Shilimkar wrote: > > BTW, Lorenzo off-list mentioned to me about warning in boot-up > > which I missed while testing your patch. It will take bit more > > time for me to look into it and hence thought of reporting it. > > > > [ 2.186126] ------------[ cut here ]------------ > > [ 2.190979] WARNING: at kernel/time/tick-broadcast.c:501 > > tick_broadcast_oneshot_control+0x1c0/0x21c() > > Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ? It is the tick_force_broadcast_mask and I think that's because on all systems we are testing, the broadcast timer IRQ is a thundering herd, all CPUs get out of idle at once and try to get out of broadcast mode at more or less the same time. Lorenzo ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-22 10:31 ` Lorenzo Pieralisi @ 2013-02-22 11:02 ` Santosh Shilimkar 2013-02-22 12:07 ` Thomas Gleixner 0 siblings, 1 reply; 25+ messages in thread From: Santosh Shilimkar @ 2013-02-22 11:02 UTC (permalink / raw) To: linux-arm-kernel On Friday 22 February 2013 04:01 PM, Lorenzo Pieralisi wrote: > On Fri, Feb 22, 2013 at 10:24:00AM +0000, Thomas Gleixner wrote: >> On Fri, 22 Feb 2013, Santosh Shilimkar wrote: >>> BTW, Lorenzo off-list mentioned to me about warning in boot-up >>> which I missed while testing your patch. It will take bit more >>> time for me to look into it and hence thought of reporting it. >>> >>> [ 2.186126] ------------[ cut here ]------------ >>> [ 2.190979] WARNING: at kernel/time/tick-broadcast.c:501 >>> tick_broadcast_oneshot_control+0x1c0/0x21c() >> >> Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ? > > It is the tick_force_broadcast_mask and I think that's because on all > systems we are testing, the broadcast timer IRQ is a thundering herd, > all CPUs get out of idle at once and try to get out of broadcast mode > at more or less the same time. > So the issue comes ups only when the idle state used where CPU wakeup more or less at same time as Lorenzo mentioned. I have two platforms where I could test the patch and see the issue only with one platform. Yesterday I didn't notice the warning because it wasn't seen on that platform :-) OMAP4 idle entry and exit in deep state is staggered between CPUs and hence the warning isn't seen. On OMAP5 though, there is an additional C-state where idle entry/exit for CPU isn't staggered and I see the issue in that case. Actually the broad-cast code doesn't expect such a behavior from CPUs since only the broad-cast affine CPU should wake up and rest of the CPU should be woken up by the broad-cast IPIs. Regards, Santosh ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-22 11:02 ` Santosh Shilimkar @ 2013-02-22 12:07 ` Thomas Gleixner 2013-02-22 14:48 ` Lorenzo Pieralisi 0 siblings, 1 reply; 25+ messages in thread From: Thomas Gleixner @ 2013-02-22 12:07 UTC (permalink / raw) To: linux-arm-kernel On Fri, 22 Feb 2013, Santosh Shilimkar wrote: > On Friday 22 February 2013 04:01 PM, Lorenzo Pieralisi wrote: > > On Fri, Feb 22, 2013 at 10:24:00AM +0000, Thomas Gleixner wrote: > > > On Fri, 22 Feb 2013, Santosh Shilimkar wrote: > > > > BTW, Lorenzo off-list mentioned to me about warning in boot-up > > > > which I missed while testing your patch. It will take bit more > > > > time for me to look into it and hence thought of reporting it. > > > > > > > > [ 2.186126] ------------[ cut here ]------------ > > > > [ 2.190979] WARNING: at kernel/time/tick-broadcast.c:501 > > > > tick_broadcast_oneshot_control+0x1c0/0x21c() > > > > > > Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ? > > > > It is the tick_force_broadcast_mask and I think that's because on all > > systems we are testing, the broadcast timer IRQ is a thundering herd, > > all CPUs get out of idle at once and try to get out of broadcast mode > > at more or less the same time. > > > So the issue comes ups only when the idle state used where CPU wakeup > more or less at same time as Lorenzo mentioned. I have two platforms > where I could test the patch and see the issue only with one platform. > > Yesterday I didn't notice the warning because it wasn't seen on that > platform :-) OMAP4 idle entry and exit in deep state is staggered > between CPUs and hence the warning isn't seen. On OMAP5 though, > there is an additional C-state where idle entry/exit for CPU > isn't staggered and I see the issue in that case. > > Actually the broad-cast code doesn't expect such a behavior > from CPUs since only the broad-cast affine CPU should wake > up and rest of the CPU should be woken up by the broad-cast > IPIs. That's what I feared. We might have the same issue on x86, depending on the cpu model. But thinking more about it. It's actually not a real problem, just pointless burning of cpu cycles. So on the CPU which gets woken along with the target CPU of the broadcast the following happens: deep_idle() <-- spurious wakeup broadcast_exit() set forced bit enable interrupts <-- Nothing happens disable interrupts broadcast_enter() <-- Here we observe the forced bit is set deep_idle() Now after that the target CPU of the broadcast runs the broadcast handler and finds the other CPU in both the broadcast and the forced mask, sends the IPI and stuff gets back to normal. So it's not actually harmful, just more evidence for the theory, that hardware designers have access to very special drug supplies. Now we could make use of that and avoid going deep idle just to come back right away via the IPI. Unfortunately the notification thingy has no return value, but we can fix that. To confirm that theory, could you please try the hack below and add some instrumentation (trace_printk)? Thanks, tglx Index: linux-2.6/arch/arm/kernel/process.c =================================================================== --- linux-2.6.orig/arch/arm/kernel/process.c +++ linux-2.6/arch/arm/kernel/process.c @@ -199,7 +199,7 @@ void cpu_idle(void) #ifdef CONFIG_PL310_ERRATA_769419 wmb(); #endif - if (hlt_counter) { + if (hlt_counter || tick_check_broadcast_pending()) { local_irq_enable(); cpu_relax(); } else if (!need_resched()) { Index: linux-2.6/include/linux/clockchips.h =================================================================== --- linux-2.6.orig/include/linux/clockchips.h +++ linux-2.6/include/linux/clockchips.h @@ -170,6 +170,12 @@ extern void tick_broadcast(const struct extern int tick_receive_broadcast(void); #endif +#if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && defined(CONFIG_TICK_ONESHOT) +extern int tick_check_broadcast_pending(void); +#else +static inline int tick_check_broadcast_pending(void) { return 0; } +#endif + #ifdef CONFIG_GENERIC_CLOCKEVENTS extern void clockevents_notify(unsigned long reason, void *arg); #else Index: linux-2.6/kernel/time/tick-broadcast.c =================================================================== --- linux-2.6.orig/kernel/time/tick-broadcast.c +++ linux-2.6/kernel/time/tick-broadcast.c @@ -418,6 +418,15 @@ static int tick_broadcast_set_event(ktim return clockevents_program_event(bc, expires, force); } +/* + * Called before going idle with interrupts disabled. Checks whether a + * broadcast event from the other core is about to happen. + */ +int tick_check_broadcast_pending(void) +{ + return test_bit(smp_processor_id(), tick_force_broadcast_mask); +} + int tick_resume_broadcast_oneshot(struct clock_event_device *bc) { clockevents_set_mode(bc, CLOCK_EVT_MODE_ONESHOT); ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-22 12:07 ` Thomas Gleixner @ 2013-02-22 14:48 ` Lorenzo Pieralisi 2013-02-22 15:03 ` Thomas Gleixner 0 siblings, 1 reply; 25+ messages in thread From: Lorenzo Pieralisi @ 2013-02-22 14:48 UTC (permalink / raw) To: linux-arm-kernel On Fri, Feb 22, 2013 at 12:07:30PM +0000, Thomas Gleixner wrote: > On Fri, 22 Feb 2013, Santosh Shilimkar wrote: > > > On Friday 22 February 2013 04:01 PM, Lorenzo Pieralisi wrote: > > > On Fri, Feb 22, 2013 at 10:24:00AM +0000, Thomas Gleixner wrote: > > > > On Fri, 22 Feb 2013, Santosh Shilimkar wrote: > > > > > BTW, Lorenzo off-list mentioned to me about warning in boot-up > > > > > which I missed while testing your patch. It will take bit more > > > > > time for me to look into it and hence thought of reporting it. > > > > > > > > > > [ 2.186126] ------------[ cut here ]------------ > > > > > [ 2.190979] WARNING: at kernel/time/tick-broadcast.c:501 > > > > > tick_broadcast_oneshot_control+0x1c0/0x21c() > > > > > > > > Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ? > > > > > > It is the tick_force_broadcast_mask and I think that's because on all > > > systems we are testing, the broadcast timer IRQ is a thundering herd, > > > all CPUs get out of idle at once and try to get out of broadcast mode > > > at more or less the same time. > > > > > So the issue comes ups only when the idle state used where CPU wakeup > > more or less at same time as Lorenzo mentioned. I have two platforms > > where I could test the patch and see the issue only with one platform. > > > > Yesterday I didn't notice the warning because it wasn't seen on that > > platform :-) OMAP4 idle entry and exit in deep state is staggered > > between CPUs and hence the warning isn't seen. On OMAP5 though, > > there is an additional C-state where idle entry/exit for CPU > > isn't staggered and I see the issue in that case. > > > > Actually the broad-cast code doesn't expect such a behavior > > from CPUs since only the broad-cast affine CPU should wake > > up and rest of the CPU should be woken up by the broad-cast > > IPIs. > > That's what I feared. We might have the same issue on x86, depending > on the cpu model. > > But thinking more about it. It's actually not a real problem, just > pointless burning of cpu cycles. > > So on the CPU which gets woken along with the target CPU of the > broadcast the following happens: > > deep_idle() > <-- spurious wakeup > broadcast_exit() > set forced bit > > enable interrupts > > <-- Nothing happens > > disable interrupts > > broadcast_enter() > <-- Here we observe the forced bit is set > deep_idle() > > Now after that the target CPU of the broadcast runs the broadcast > handler and finds the other CPU in both the broadcast and the forced > mask, sends the IPI and stuff gets back to normal. > > So it's not actually harmful, just more evidence for the theory, that > hardware designers have access to very special drug supplies. > > Now we could make use of that and avoid going deep idle just to come > back right away via the IPI. Unfortunately the notification thingy has > no return value, but we can fix that. > > To confirm that theory, could you please try the hack below and add > some instrumentation (trace_printk)? Applied, and it looks like that's exactly why the warning triggers, at least on the platform I am testing on which is a dual-cluster ARM testchip. There is a still time window though where the CPU (the IPI target) can get back to idle (tick_broadcast_pending still not set) before the CPU target of the broadcast has a chance to run tick_handle_oneshot_broadcast (and set tick_broadcast_pending), or am I missing something ? It is a corner case, granted. Best thing would be to check pending IRQs in the idle driver back-end (or have always-on local timers :-)). Thanks, Lorenzo ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-22 14:48 ` Lorenzo Pieralisi @ 2013-02-22 15:03 ` Thomas Gleixner 2013-02-22 15:26 ` Lorenzo Pieralisi 0 siblings, 1 reply; 25+ messages in thread From: Thomas Gleixner @ 2013-02-22 15:03 UTC (permalink / raw) To: linux-arm-kernel On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote: > On Fri, Feb 22, 2013 at 12:07:30PM +0000, Thomas Gleixner wrote: > > Now we could make use of that and avoid going deep idle just to come > > back right away via the IPI. Unfortunately the notification thingy has > > no return value, but we can fix that. > > > > To confirm that theory, could you please try the hack below and add > > some instrumentation (trace_printk)? > > Applied, and it looks like that's exactly why the warning triggers, at least > on the platform I am testing on which is a dual-cluster ARM testchip. > > There is a still time window though where the CPU (the IPI target) can get > back to idle (tick_broadcast_pending still not set) before the CPU target of > the broadcast has a chance to run tick_handle_oneshot_broadcast (and set > tick_broadcast_pending), or am I missing something ? Well, the tick_broadcast_pending bit is uninteresting if the force_broadcast bit is set. Because if that bit is set we know for sure, that we got woken with the cpu which gets the broadcast timer and raced back to idle before the broadcast handler managed to send the IPI. If we did not get woken before the broadcast IPI then the pending bit is set when we exit the broadcast mode. > It is a corner case, granted. Best thing would be to check pending IRQs in the > idle driver back-end (or have always-on local timers :-)). The latter is definitely the only sane solution. Thanks, tglx ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-22 15:03 ` Thomas Gleixner @ 2013-02-22 15:26 ` Lorenzo Pieralisi 2013-02-22 18:52 ` Thomas Gleixner 0 siblings, 1 reply; 25+ messages in thread From: Lorenzo Pieralisi @ 2013-02-22 15:26 UTC (permalink / raw) To: linux-arm-kernel On Fri, Feb 22, 2013 at 03:03:02PM +0000, Thomas Gleixner wrote: > On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote: > > On Fri, Feb 22, 2013 at 12:07:30PM +0000, Thomas Gleixner wrote: > > > Now we could make use of that and avoid going deep idle just to come > > > back right away via the IPI. Unfortunately the notification thingy has > > > no return value, but we can fix that. > > > > > > To confirm that theory, could you please try the hack below and add > > > some instrumentation (trace_printk)? > > > > Applied, and it looks like that's exactly why the warning triggers, at least > > on the platform I am testing on which is a dual-cluster ARM testchip. > > > > There is a still time window though where the CPU (the IPI target) can get > > back to idle (tick_broadcast_pending still not set) before the CPU target of > > the broadcast has a chance to run tick_handle_oneshot_broadcast (and set > > tick_broadcast_pending), or am I missing something ? > > Well, the tick_broadcast_pending bit is uninteresting if the > force_broadcast bit is set. Because if that bit is set we know for > sure, that we got woken with the cpu which gets the broadcast timer > and raced back to idle before the broadcast handler managed to > send the IPI. Gah, my bad sorry, I mixed things up. I thought tick_check_broadcast_pending() was checking against the tick_broadcast_pending mask not tick_force_broadcast_mask as it correctly does. All clear now. Thanks a lot, Lorenzo ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-22 15:26 ` Lorenzo Pieralisi @ 2013-02-22 18:52 ` Thomas Gleixner 2013-02-25 6:12 ` Santosh Shilimkar ` (2 more replies) 0 siblings, 3 replies; 25+ messages in thread From: Thomas Gleixner @ 2013-02-22 18:52 UTC (permalink / raw) To: linux-arm-kernel On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote: > On Fri, Feb 22, 2013 at 03:03:02PM +0000, Thomas Gleixner wrote: > > On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote: > > > On Fri, Feb 22, 2013 at 12:07:30PM +0000, Thomas Gleixner wrote: > > > > Now we could make use of that and avoid going deep idle just to come > > > > back right away via the IPI. Unfortunately the notification thingy has > > > > no return value, but we can fix that. > > > > > > > > To confirm that theory, could you please try the hack below and add > > > > some instrumentation (trace_printk)? > > > > > > Applied, and it looks like that's exactly why the warning triggers, at least > > > on the platform I am testing on which is a dual-cluster ARM testchip. > > > > > > There is a still time window though where the CPU (the IPI target) can get > > > back to idle (tick_broadcast_pending still not set) before the CPU target of > > > the broadcast has a chance to run tick_handle_oneshot_broadcast (and set > > > tick_broadcast_pending), or am I missing something ? > > > > Well, the tick_broadcast_pending bit is uninteresting if the > > force_broadcast bit is set. Because if that bit is set we know for > > sure, that we got woken with the cpu which gets the broadcast timer > > and raced back to idle before the broadcast handler managed to > > send the IPI. > > Gah, my bad sorry, I mixed things up. I thought > > tick_check_broadcast_pending() > > was checking against the tick_broadcast_pending mask not > > tick_force_broadcast_mask Yep, that's a misnomer. I just wanted to make sure that my theory is correct. I need to think about the real solution some more. We have two alternatives: 1) Make the clockevents_notify function have a return value. 2) Add something like the hack I gave you with a proper name. The latter has the beauty, that we just need to modify the platform independent idle code instead of going down to every callsite of the clockevents_notify thing. Thanks, tglx ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-22 18:52 ` Thomas Gleixner @ 2013-02-25 6:12 ` Santosh Shilimkar 2013-02-25 6:38 ` Jason Liu 2013-02-25 13:34 ` Lorenzo Pieralisi 2 siblings, 0 replies; 25+ messages in thread From: Santosh Shilimkar @ 2013-02-25 6:12 UTC (permalink / raw) To: linux-arm-kernel On Saturday 23 February 2013 12:22 AM, Thomas Gleixner wrote: > On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote: >> On Fri, Feb 22, 2013 at 03:03:02PM +0000, Thomas Gleixner wrote: >>> On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote: >>>> On Fri, Feb 22, 2013 at 12:07:30PM +0000, Thomas Gleixner wrote: >>>>> Now we could make use of that and avoid going deep idle just to come >>>>> back right away via the IPI. Unfortunately the notification thingy has >>>>> no return value, but we can fix that. >>>>> >>>>> To confirm that theory, could you please try the hack below and add >>>>> some instrumentation (trace_printk)? >>>> >>>> Applied, and it looks like that's exactly why the warning triggers, at least >>>> on the platform I am testing on which is a dual-cluster ARM testchip. >>>> I too confirm that the warnings cause is same. >>>> There is a still time window though where the CPU (the IPI target) can get >>>> back to idle (tick_broadcast_pending still not set) before the CPU target of >>>> the broadcast has a chance to run tick_handle_oneshot_broadcast (and set >>>> tick_broadcast_pending), or am I missing something ? >>> >>> Well, the tick_broadcast_pending bit is uninteresting if the >>> force_broadcast bit is set. Because if that bit is set we know for >>> sure, that we got woken with the cpu which gets the broadcast timer >>> and raced back to idle before the broadcast handler managed to >>> send the IPI. >> >> Gah, my bad sorry, I mixed things up. I thought >> >> tick_check_broadcast_pending() >> >> was checking against the tick_broadcast_pending mask not >> >> tick_force_broadcast_mask > > Yep, that's a misnomer. I just wanted to make sure that my theory is > correct. I need to think about the real solution some more. > > We have two alternatives: > > 1) Make the clockevents_notify function have a return value. > > 2) Add something like the hack I gave you with a proper name. > > The latter has the beauty, that we just need to modify the platform > independent idle code instead of going down to every callsite of the > clockevents_notify thing. > I agree that 2 is better alternative to avoid multiple changes. Whichever alternative you choose, will be happy to test it :) Regards, Santosh ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-22 18:52 ` Thomas Gleixner 2013-02-25 6:12 ` Santosh Shilimkar @ 2013-02-25 6:38 ` Jason Liu 2013-02-25 13:34 ` Lorenzo Pieralisi 2 siblings, 0 replies; 25+ messages in thread From: Jason Liu @ 2013-02-25 6:38 UTC (permalink / raw) To: linux-arm-kernel Thomas, 2013/2/23 Thomas Gleixner <tglx@linutronix.de>: > On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote: >> On Fri, Feb 22, 2013 at 03:03:02PM +0000, Thomas Gleixner wrote: >> > On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote: >> > > On Fri, Feb 22, 2013 at 12:07:30PM +0000, Thomas Gleixner wrote: >> > > > Now we could make use of that and avoid going deep idle just to come >> > > > back right away via the IPI. Unfortunately the notification thingy has >> > > > no return value, but we can fix that. >> > > > >> > > > To confirm that theory, could you please try the hack below and add >> > > > some instrumentation (trace_printk)? >> > > >> > > Applied, and it looks like that's exactly why the warning triggers, at least >> > > on the platform I am testing on which is a dual-cluster ARM testchip. >> > > >> > > There is a still time window though where the CPU (the IPI target) can get >> > > back to idle (tick_broadcast_pending still not set) before the CPU target of >> > > the broadcast has a chance to run tick_handle_oneshot_broadcast (and set >> > > tick_broadcast_pending), or am I missing something ? >> > >> > Well, the tick_broadcast_pending bit is uninteresting if the >> > force_broadcast bit is set. Because if that bit is set we know for >> > sure, that we got woken with the cpu which gets the broadcast timer >> > and raced back to idle before the broadcast handler managed to >> > send the IPI. >> >> Gah, my bad sorry, I mixed things up. I thought >> >> tick_check_broadcast_pending() >> >> was checking against the tick_broadcast_pending mask not >> >> tick_force_broadcast_mask > > Yep, that's a misnomer. I just wanted to make sure that my theory is > correct. I need to think about the real solution some more. I have applied your patch and tested, there is NO warning at all then. I think your theory is correct. > > We have two alternatives: > > 1) Make the clockevents_notify function have a return value. > > 2) Add something like the hack I gave you with a proper name. > > The latter has the beauty, that we just need to modify the platform > independent idle code instead of going down to every callsite of the > clockevents_notify thing. I prefer the solution 2). > > Thanks, > > tglx ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-22 18:52 ` Thomas Gleixner 2013-02-25 6:12 ` Santosh Shilimkar 2013-02-25 6:38 ` Jason Liu @ 2013-02-25 13:34 ` Lorenzo Pieralisi 2 siblings, 0 replies; 25+ messages in thread From: Lorenzo Pieralisi @ 2013-02-25 13:34 UTC (permalink / raw) To: linux-arm-kernel On Fri, Feb 22, 2013 at 06:52:14PM +0000, Thomas Gleixner wrote: > On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote: > > On Fri, Feb 22, 2013 at 03:03:02PM +0000, Thomas Gleixner wrote: > > > On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote: > > > > On Fri, Feb 22, 2013 at 12:07:30PM +0000, Thomas Gleixner wrote: > > > > > Now we could make use of that and avoid going deep idle just to come > > > > > back right away via the IPI. Unfortunately the notification thingy has > > > > > no return value, but we can fix that. > > > > > > > > > > To confirm that theory, could you please try the hack below and add > > > > > some instrumentation (trace_printk)? > > > > > > > > Applied, and it looks like that's exactly why the warning triggers, at least > > > > on the platform I am testing on which is a dual-cluster ARM testchip. > > > > > > > > There is a still time window though where the CPU (the IPI target) can get > > > > back to idle (tick_broadcast_pending still not set) before the CPU target of > > > > the broadcast has a chance to run tick_handle_oneshot_broadcast (and set > > > > tick_broadcast_pending), or am I missing something ? > > > > > > Well, the tick_broadcast_pending bit is uninteresting if the > > > force_broadcast bit is set. Because if that bit is set we know for > > > sure, that we got woken with the cpu which gets the broadcast timer > > > and raced back to idle before the broadcast handler managed to > > > send the IPI. > > > > Gah, my bad sorry, I mixed things up. I thought > > > > tick_check_broadcast_pending() > > > > was checking against the tick_broadcast_pending mask not > > > > tick_force_broadcast_mask > > Yep, that's a misnomer. I just wanted to make sure that my theory is > correct. I need to think about the real solution some more. > > We have two alternatives: > > 1) Make the clockevents_notify function have a return value. > > 2) Add something like the hack I gave you with a proper name. > > The latter has the beauty, that we just need to modify the platform > independent idle code instead of going down to every callsite of the > clockevents_notify thing. Thank you. I am not sure (1) would buy us anything compared to (2) and as you said we would end up patching all callsites so (2) is preferred. As I mentioned, we can even just apply your fixes and leave platform specific code deal with this optimization, at the end of the day idle driver has just to check pending IRQs/wake-up sources (which would cover all IRQs not just TIMER IPI) if and when it has to start time consuming operations like cache cleaning to enter deep idle. If it goes into a shallow C-state so be it. On x86 I think it is HW/FW that prevents C-state entering if IRQs are pending, and on ARM it is likely to happen too, so I am just saying you should not bother if you think the code becomes too hairy to justify this change. Thank you very much for the fixes and your help, Lorenzo ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-21 22:19 ` Thomas Gleixner 2013-02-22 10:07 ` Santosh Shilimkar @ 2013-02-22 10:28 ` Lorenzo Pieralisi 1 sibling, 0 replies; 25+ messages in thread From: Lorenzo Pieralisi @ 2013-02-22 10:28 UTC (permalink / raw) To: linux-arm-kernel On Thu, Feb 21, 2013 at 10:19:18PM +0000, Thomas Gleixner wrote: > On Thu, 21 Feb 2013, Santosh Shilimkar wrote: > > On Thursday 21 February 2013 07:18 PM, Thomas Gleixner wrote: > > > find below a completely untested patch, which should address that issue. > > > > > After looking at the thread, I tried to see the issue on OMAP and could > > see the same issue as Jason. > > That's interesting. We have the same issue on x86 since 2007 and > nobody noticed ever. It's basically the same problem there, but it > seems that on x86 getting out of those low power states is way slower > than the minimal reprogramming delta which is used to enforce the > local timer to fire after the wakeup. > > I'm still amazed that as Jason stated a 1us reprogramming delta is > sufficient to get this ping-pong going. I somehow doubt that, but > maybe ARM is really that fast :) It also depends on when the idle driver exits broadcast mode. Certainly if that's the last thing it does before enabling IRQs, that might help trigger the issue. I am still a bit sceptic myself too, and I take advantage of Thomas' knowledge on the subject, which is ways deeper than mine BTW, to ask a question. The thread started with a subject "too many retries...." and here I have a doubt. If the fix is not applied, on the CPU affine to the broadcast timer, it is _normal_ to have local timer retries, since the CPU is setting/forcing the local timer to fire after a min_delta_ns every time the expired event was local to the CPU affine to the broadcast timer. The problem, supposedly, is that the timer has not enough time (sorry for the mouthful) to expire(fire) before IRQs are disabled and the idle thread goes back to idle again. This means that we should notice a mismatch between the number of broadcast timer IRQs and local timer IRQs on the CPU affine to the broadcast timer IRQ (granted, we also have to take into account broadcast timer IRQs meant to service (through IPI) a local timer expired on a CPU which is not the one running the broadcast IRQ handler and "normal" local timer IRQs as well). I am not sure the sheer number of retries is a symptom of the problem happening, but I might well be mistaken so I am asking. For certain, with the fix applied, lots of duplicate IRQs on the CPU affine to the broadcast timer are eliminated, since the local timer is not reprogrammed anymore (before the fix, basically the broadcast timer was firing an IRQ that did nothing since the CPU was already out of broadcast mode by the time the broadcast handler was running, the real job was carried out in the local timer handler). > > > Your patch fixes the retries on both CPUs on my dual core machine. So > > you use my tested by if you need one. > > They are always welcome. > > > Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com> You can add mine too, we should fix the WARN_ON_ONCE mentioned in Santosh's reply. Tested-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-21 13:48 ` Thomas Gleixner 2013-02-21 15:12 ` Santosh Shilimkar @ 2013-02-22 10:26 ` Jason Liu 1 sibling, 0 replies; 25+ messages in thread From: Jason Liu @ 2013-02-22 10:26 UTC (permalink / raw) To: linux-arm-kernel Thomas, 2013/2/21 Thomas Gleixner <tglx@linutronix.de>: > Jason, > > On Thu, 21 Feb 2013, Jason Liu wrote: >> 2013/2/21 Thomas Gleixner <tglx@linutronix.de>: >> > Now your explanation makes sense. >> > >> > I have no fast solution for this, but I think that I have an idea how >> > to fix it. Stay tuned. >> >> Thanks Thomas, wait for your fix. :) > > find below a completely untested patch, which should address that issue. > I have tested the below patch, but run into the following warnings: [ 713.340593] ------------[ cut here ]------------ [ 713.345238] WARNING: at /linux-2.6-imx/kernel/time/tick-broadcast.c:497 tick_broadcast_oneshot_control+0x210/0x254() [ 713.359240] Modules linked in: [ 713.362332] [<8004b2cc>] (unwind_backtrace+0x0/0xf8) from [<80079764>] (warn_slowpath_common+0x54/0x64) [ 713.371740] [<80079764>] (warn_slowpath_common+0x54/0x64) from [<80079790>] (warn_slowpath_null+0x1c/0x24) [ 713.381408] [<80079790>] (warn_slowpath_null+0x1c/0x24) from [<800a5320>] (tick_broadcast_oneshot_control+0x210/0x254) [ 713.392120] [<800a5320>] (tick_broadcast_oneshot_control+0x210/0x254) from [<800a48b0>] (tick_notify+0xd4/0x1f8) [ 713.402317] [<800a48b0>] (tick_notify+0xd4/0x1f8) from [<8009b154>] (notifier_call_chain+0x4c/0x8c) [ 713.411377] [<8009b154>] (notifier_call_chain+0x4c/0x8c) from [<8009b24c>] (raw_notifier_call_chain+0x18/0x20) [ 713.421392] [<8009b24c>] (raw_notifier_call_chain+0x18/0x20) from [<800a3d9c>] (clockevents_notify+0x30/0x198) [ 713.431417] [<800a3d9c>] (clockevents_notify+0x30/0x198) from [<80052f58>] (arch_idle_multi_core+0x4c/0xc4) [ 713.441175] [<80052f58>] (arch_idle_multi_core+0x4c/0xc4) from [<80044f68>] (default_idle+0x20/0x28) [ 713.450322] [<80044f68>] (default_idle+0x20/0x28) from [<800455c0>] (cpu_idle+0xc8/0xfc) [ 713.458425] [<800455c0>] (cpu_idle+0xc8/0xfc) from [<80008984>] (start_kernel+0x260/0x294) [ 713.466701] [<80008984>] (start_kernel+0x260/0x294) from [<10008040>] (0x10008040) So, the code here which bring the warnings: void tick_broadcast_oneshot_control(unsigned long reason) { ... WARN_ON_ONCE(test_bit(cpu, tick_force_broadcast_mask)); } > > Thanks, > > tglx > > Index: linux-2.6/kernel/time/tick-broadcast.c > =================================================================== > --- linux-2.6.orig/kernel/time/tick-broadcast.c > +++ linux-2.6/kernel/time/tick-broadcast.c > @@ -397,6 +397,8 @@ int tick_resume_broadcast(void) > > /* FIXME: use cpumask_var_t. */ > static DECLARE_BITMAP(tick_broadcast_oneshot_mask, NR_CPUS); > +static DECLARE_BITMAP(tick_broadcast_pending, NR_CPUS); > +static DECLARE_BITMAP(tick_force_broadcast_mask, NR_CPUS); > > /* > * Exposed for debugging: see timer_list.c > @@ -453,12 +455,24 @@ again: > /* Find all expired events */ > for_each_cpu(cpu, tick_get_broadcast_oneshot_mask()) { > td = &per_cpu(tick_cpu_device, cpu); > - if (td->evtdev->next_event.tv64 <= now.tv64) > + if (td->evtdev->next_event.tv64 <= now.tv64) { > cpumask_set_cpu(cpu, to_cpumask(tmpmask)); > - else if (td->evtdev->next_event.tv64 < next_event.tv64) > + /* > + * Mark the remote cpu in the pending mask, so > + * it can avoid reprogramming the cpu local > + * timer in tick_broadcast_oneshot_control(). > + */ > + set_bit(cpu, tick_broadcast_pending); > + } else if (td->evtdev->next_event.tv64 < next_event.tv64) > next_event.tv64 = td->evtdev->next_event.tv64; > } > > + /* Take care of enforced broadcast requests */ > + for_each_cpu(cpu, to_cpumask(tick_force_broadcast_mask)) { > + set_bit(cpu, tmpmask); > + clear_bit(cpu, tick_force_broadcast_mask); > + } > + > /* > * Wakeup the cpus which have an expired event. > */ > @@ -494,6 +508,7 @@ void tick_broadcast_oneshot_control(unsi > struct clock_event_device *bc, *dev; > struct tick_device *td; > unsigned long flags; > + ktime_t now; > int cpu; > > /* > @@ -518,6 +533,8 @@ void tick_broadcast_oneshot_control(unsi > > raw_spin_lock_irqsave(&tick_broadcast_lock, flags); > if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ENTER) { > + WARN_ON_ONCE(test_bit(cpu, tick_broadcast_pending)); > + WARN_ON_ONCE(test_bit(cpu, tick_force_broadcast_mask)); > if (!cpumask_test_cpu(cpu, tick_get_broadcast_oneshot_mask())) { > cpumask_set_cpu(cpu, tick_get_broadcast_oneshot_mask()); > clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN); > @@ -529,10 +546,63 @@ void tick_broadcast_oneshot_control(unsi > cpumask_clear_cpu(cpu, > tick_get_broadcast_oneshot_mask()); > clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT); > - if (dev->next_event.tv64 != KTIME_MAX) > - tick_program_event(dev->next_event, 1); > + if (dev->next_event.tv64 == KTIME_MAX) > + goto out; > + /* > + * The cpu handling the broadcast timer marked > + * this cpu in the broadcast pending mask and > + * fired the broadcast IPI. So we are going to > + * handle the expired event anyway via the > + * broadcast IPI handler. No need to reprogram > + * the timer with an already expired event. > + */ > + if (test_and_clear_bit(cpu, tick_broadcast_pending)) > + goto out; > + /* > + * If the pending bit is not set, then we are > + * either the CPU handling the broadcast > + * interrupt or we got woken by something else. > + * > + * We are not longer in the broadcast mask, so > + * if the cpu local expiry time is already > + * reached, we would reprogram the cpu local > + * timer with an already expired event. > + * > + * This can lead to a ping-pong when we return > + * to idle and therefor rearm the broadcast > + * timer before the cpu local timer was able > + * to fire. This happens because the forced > + * reprogramming makes sure that the event > + * will happen in the future and depending on > + * the min_delta setting this might be far > + * enough out that the ping-pong starts. > + * > + * If the cpu local next_event has expired > + * then we know that the broadcast timer > + * next_event has expired as well and > + * broadcast is about to be handled. So we > + * avoid reprogramming and enforce that the > + * broadcast handler, which did not run yet, > + * will invoke the cpu local handler. > + * > + * We cannot call the handler directly from > + * here, because we might be in a NOHZ phase > + * and we did not go through the irq_enter() > + * nohz fixups. > + */ > + now = ktime_get(); > + if (dev->next_event.tv64 <= now.tv64) { > + set_bit(cpu, tick_force_broadcast_mask); > + goto out; > + } > + /* > + * We got woken by something else. Reprogram > + * the cpu local timer device. > + */ > + tick_program_event(dev->next_event, 1); > } > } > +out: > raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags); > } > ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-21 6:16 ` Jason Liu 2013-02-21 9:36 ` Thomas Gleixner @ 2013-02-21 10:35 ` Lorenzo Pieralisi 2013-02-21 10:49 ` Jason Liu 1 sibling, 1 reply; 25+ messages in thread From: Lorenzo Pieralisi @ 2013-02-21 10:35 UTC (permalink / raw) To: linux-arm-kernel Hi Jason, On Thu, Feb 21, 2013 at 06:16:51AM +0000, Jason Liu wrote: > 2013/2/20 Thomas Gleixner <tglx@linutronix.de>: > > On Wed, 20 Feb 2013, Jason Liu wrote: > >> void arch_idle(void) > >> { > >> .... > >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu); > >> > >> enter_the_wait_mode(); > >> > >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); > >> } > >> > >> when the broadcast timer interrupt arrives(this interrupt just wakeup > >> the ARM, and ARM has no chance > >> to handle it since local irq is disabled. In fact it's disabled in > >> cpu_idle() of arch/arm/kernel/process.c) > >> > >> the broadcast timer interrupt will wake up the CPU and run: > >> > >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); -> > >> tick_broadcast_oneshot_control(...); > >> -> > >> tick_program_event(dev->next_event, 1); > >> -> > >> tick_dev_program_event(dev, expires, force); > >> -> > >> for (i = 0;;) { > >> int ret = clockevents_program_event(dev, expires, now); > >> if (!ret || !force) > >> return ret; > >> > >> dev->retries++; > >> .... > >> now = ktime_get(); > >> expires = ktime_add_ns(now, dev->min_delta_ns); > >> } > >> clockevents_program_event(dev, expires, now); > >> > >> delta = ktime_to_ns(ktime_sub(expires, now)); > >> > >> if (delta <= 0) > >> return -ETIME; > >> > >> when the bc timer interrupt arrives, which means the last local timer > >> expires too. so, > >> clockevents_program_event will return -ETIME, which will cause the > >> dev->retries++ > >> when retry to program the expired timer. > >> > >> Even under the worst case, after the re-program the expired timer, > >> then CPU enter idle > >> quickly before the re-progam timer expired, it will make system > >> ping-pang forever, > > > > That's nonsense. > > I don't think so. > > > > > The timer IPI brings the core out of the deep idle state. > > > > So after returning from enter_wait_mode() and after calling > > clockevents_notify() it returns from arch_idle() to cpu_idle(). > > > > In cpu_idle() interrupts are reenabled, so the timer IPI handler is > > invoked. That calls the event_handler of the per cpu local clockevent > > device (the one which stops in C3). That ends up in the generic timer > > code which expires timers and reprograms the local clock event device > > with the next pending timer. > > > > So you cannot go idle again, before the expired timers of this event > > are handled and their callbacks invoked. > > That's true for the CPUs which not response to the global timer interrupt. > Take our platform as example: we have 4CPUs(CPU0, CPU1,CPU2,CPU3) > The global timer device will keep running even in the deep idle mode, so, it > can be used as the broadcast timer device, and the interrupt of this device > just raised to CPU0 when the timer expired, then, CPU0 will broadcast the > IPI timer to other CPUs which is in deep idle mode. > > So for CPU1, CPU2, CPU3, you are right, the IPI timer will bring it out of idle > state, after running clockevents_notify() it returns from arch_idle() > to cpu_idle(), > then local_irq_enable(), the IPI handler will be invoked and handle > the expires times > and re-program the next pending timer. > > But, that's not true for the CPU0. The flow for CPU0 is: > the global timer interrupt wakes up CPU0 and then call: > clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); > > which will cpumask_clear_cpu(cpu, tick_get_broadcast_oneshot_mask()); > in the function tick_broadcast_oneshot_control(), For my own understanding: at this point in time CPU0 local timer is also reprogrammed, with min_delta (ie 1us) if I got your description right. > > After return from clockevents_notify(), it will return to cpu_idle > from arch_idle, > then local_irq_enable(), the CPU0 will response to the global timer > interrupt, and > call the interrupt handler: tick_handle_oneshot_broadcast() > > static void tick_handle_oneshot_broadcast(struct clock_event_device *dev) > { > struct tick_device *td; > ktime_t now, next_event; > int cpu; > > raw_spin_lock(&tick_broadcast_lock); > again: > dev->next_event.tv64 = KTIME_MAX; > next_event.tv64 = KTIME_MAX; > cpumask_clear(to_cpumask(tmpmask)); > now = ktime_get(); > /* Find all expired events */ > for_each_cpu(cpu, tick_get_broadcast_oneshot_mask()) { > td = &per_cpu(tick_cpu_device, cpu); > if (td->evtdev->next_event.tv64 <= now.tv64) > cpumask_set_cpu(cpu, to_cpumask(tmpmask)); > else if (td->evtdev->next_event.tv64 < next_event.tv64) > next_event.tv64 = td->evtdev->next_event.tv64; > } > > /* > * Wakeup the cpus which have an expired event. > */ > tick_do_broadcast(to_cpumask(tmpmask)); > ... > } > > since cpu0 has been removed from the tick_get_broadcast_oneshot_mask(), and if > all the other cpu1/2/3 state in idle, and no expired timers, then the > tmpmask will be 0, > when call tick_do_broadcast(). > > static void tick_do_broadcast(struct cpumask *mask) > { > int cpu = smp_processor_id(); > struct tick_device *td; > > /* > * Check, if the current cpu is in the mask > */ > if (cpumask_test_cpu(cpu, mask)) { > cpumask_clear_cpu(cpu, mask); > td = &per_cpu(tick_cpu_device, cpu); > td->evtdev->event_handler(td->evtdev); > } > > if (!cpumask_empty(mask)) { > /* > * It might be necessary to actually check whether the devices > * have different broadcast functions. For now, just use the > * one of the first device. This works as long as we have this > * misfeature only on x86 (lapic) > */ > td = &per_cpu(tick_cpu_device, cpumask_first(mask)); > td->evtdev->broadcast(mask); > } > } > > If the mask is empty, then tick_do_broadcast will do nothing and return, which > will make cpu0 enter idle quickly, and then system will ping-pang there. This means that the local timer reprogrammed above (to actually emulate the expired local timer on CPU0, likely to be set to min_delta == 1us) does not have time to expire before the idle thread disables IRQs and goes idle again. Is this a correct description of what's happening ? Thanks a lot, Lorenzo ^ permalink raw reply [flat|nested] 25+ messages in thread
* too many timer retries happen when do local timer swtich with broadcast timer 2013-02-21 10:35 ` Lorenzo Pieralisi @ 2013-02-21 10:49 ` Jason Liu 0 siblings, 0 replies; 25+ messages in thread From: Jason Liu @ 2013-02-21 10:49 UTC (permalink / raw) To: linux-arm-kernel 2013/2/21 Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>: > Hi Jason, > > On Thu, Feb 21, 2013 at 06:16:51AM +0000, Jason Liu wrote: >> 2013/2/20 Thomas Gleixner <tglx@linutronix.de>: >> > On Wed, 20 Feb 2013, Jason Liu wrote: >> >> void arch_idle(void) >> >> { >> >> .... >> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu); >> >> >> >> enter_the_wait_mode(); >> >> >> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); >> >> } >> >> >> >> when the broadcast timer interrupt arrives(this interrupt just wakeup >> >> the ARM, and ARM has no chance >> >> to handle it since local irq is disabled. In fact it's disabled in >> >> cpu_idle() of arch/arm/kernel/process.c) >> >> >> >> the broadcast timer interrupt will wake up the CPU and run: >> >> >> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); -> >> >> tick_broadcast_oneshot_control(...); >> >> -> >> >> tick_program_event(dev->next_event, 1); >> >> -> >> >> tick_dev_program_event(dev, expires, force); >> >> -> >> >> for (i = 0;;) { >> >> int ret = clockevents_program_event(dev, expires, now); >> >> if (!ret || !force) >> >> return ret; >> >> >> >> dev->retries++; >> >> .... >> >> now = ktime_get(); >> >> expires = ktime_add_ns(now, dev->min_delta_ns); >> >> } >> >> clockevents_program_event(dev, expires, now); >> >> >> >> delta = ktime_to_ns(ktime_sub(expires, now)); >> >> >> >> if (delta <= 0) >> >> return -ETIME; >> >> >> >> when the bc timer interrupt arrives, which means the last local timer >> >> expires too. so, >> >> clockevents_program_event will return -ETIME, which will cause the >> >> dev->retries++ >> >> when retry to program the expired timer. >> >> >> >> Even under the worst case, after the re-program the expired timer, >> >> then CPU enter idle >> >> quickly before the re-progam timer expired, it will make system >> >> ping-pang forever, >> > >> > That's nonsense. >> >> I don't think so. >> >> > >> > The timer IPI brings the core out of the deep idle state. >> > >> > So after returning from enter_wait_mode() and after calling >> > clockevents_notify() it returns from arch_idle() to cpu_idle(). >> > >> > In cpu_idle() interrupts are reenabled, so the timer IPI handler is >> > invoked. That calls the event_handler of the per cpu local clockevent >> > device (the one which stops in C3). That ends up in the generic timer >> > code which expires timers and reprograms the local clock event device >> > with the next pending timer. >> > >> > So you cannot go idle again, before the expired timers of this event >> > are handled and their callbacks invoked. >> >> That's true for the CPUs which not response to the global timer interrupt. >> Take our platform as example: we have 4CPUs(CPU0, CPU1,CPU2,CPU3) >> The global timer device will keep running even in the deep idle mode, so, it >> can be used as the broadcast timer device, and the interrupt of this device >> just raised to CPU0 when the timer expired, then, CPU0 will broadcast the >> IPI timer to other CPUs which is in deep idle mode. >> >> So for CPU1, CPU2, CPU3, you are right, the IPI timer will bring it out of idle >> state, after running clockevents_notify() it returns from arch_idle() >> to cpu_idle(), >> then local_irq_enable(), the IPI handler will be invoked and handle >> the expires times >> and re-program the next pending timer. >> >> But, that's not true for the CPU0. The flow for CPU0 is: >> the global timer interrupt wakes up CPU0 and then call: >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); >> >> which will cpumask_clear_cpu(cpu, tick_get_broadcast_oneshot_mask()); >> in the function tick_broadcast_oneshot_control(), > > For my own understanding: at this point in time CPU0 local timer is > also reprogrammed, with min_delta (ie 1us) if I got your description > right. yes, right. > >> >> After return from clockevents_notify(), it will return to cpu_idle >> from arch_idle, >> then local_irq_enable(), the CPU0 will response to the global timer >> interrupt, and >> call the interrupt handler: tick_handle_oneshot_broadcast() >> >> static void tick_handle_oneshot_broadcast(struct clock_event_device *dev) >> { >> struct tick_device *td; >> ktime_t now, next_event; >> int cpu; >> >> raw_spin_lock(&tick_broadcast_lock); >> again: >> dev->next_event.tv64 = KTIME_MAX; >> next_event.tv64 = KTIME_MAX; >> cpumask_clear(to_cpumask(tmpmask)); >> now = ktime_get(); >> /* Find all expired events */ >> for_each_cpu(cpu, tick_get_broadcast_oneshot_mask()) { >> td = &per_cpu(tick_cpu_device, cpu); >> if (td->evtdev->next_event.tv64 <= now.tv64) >> cpumask_set_cpu(cpu, to_cpumask(tmpmask)); >> else if (td->evtdev->next_event.tv64 < next_event.tv64) >> next_event.tv64 = td->evtdev->next_event.tv64; >> } >> >> /* >> * Wakeup the cpus which have an expired event. >> */ >> tick_do_broadcast(to_cpumask(tmpmask)); >> ... >> } >> >> since cpu0 has been removed from the tick_get_broadcast_oneshot_mask(), and if >> all the other cpu1/2/3 state in idle, and no expired timers, then the >> tmpmask will be 0, >> when call tick_do_broadcast(). >> >> static void tick_do_broadcast(struct cpumask *mask) >> { >> int cpu = smp_processor_id(); >> struct tick_device *td; >> >> /* >> * Check, if the current cpu is in the mask >> */ >> if (cpumask_test_cpu(cpu, mask)) { >> cpumask_clear_cpu(cpu, mask); >> td = &per_cpu(tick_cpu_device, cpu); >> td->evtdev->event_handler(td->evtdev); >> } >> >> if (!cpumask_empty(mask)) { >> /* >> * It might be necessary to actually check whether the devices >> * have different broadcast functions. For now, just use the >> * one of the first device. This works as long as we have this >> * misfeature only on x86 (lapic) >> */ >> td = &per_cpu(tick_cpu_device, cpumask_first(mask)); >> td->evtdev->broadcast(mask); >> } >> } >> >> If the mask is empty, then tick_do_broadcast will do nothing and return, which >> will make cpu0 enter idle quickly, and then system will ping-pang there. > > This means that the local timer reprogrammed above (to actually emulate the > expired local timer on CPU0, likely to be set to min_delta == 1us) does not > have time to expire before the idle thread disables IRQs and goes idle again. > > Is this a correct description of what's happening ? yes, correct. > > Thanks a lot, > Lorenzo > ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2013-02-25 13:34 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-02-20 11:16 too many timer retries happen when do local timer swtich with broadcast timer Jason Liu 2013-02-20 13:33 ` Thomas Gleixner 2013-02-21 6:16 ` Jason Liu 2013-02-21 9:36 ` Thomas Gleixner 2013-02-21 10:50 ` Jason Liu 2013-02-21 13:48 ` Thomas Gleixner 2013-02-21 15:12 ` Santosh Shilimkar 2013-02-21 22:19 ` Thomas Gleixner 2013-02-22 10:07 ` Santosh Shilimkar 2013-02-22 10:24 ` Thomas Gleixner 2013-02-22 10:30 ` Santosh Shilimkar 2013-02-22 10:31 ` Lorenzo Pieralisi 2013-02-22 11:02 ` Santosh Shilimkar 2013-02-22 12:07 ` Thomas Gleixner 2013-02-22 14:48 ` Lorenzo Pieralisi 2013-02-22 15:03 ` Thomas Gleixner 2013-02-22 15:26 ` Lorenzo Pieralisi 2013-02-22 18:52 ` Thomas Gleixner 2013-02-25 6:12 ` Santosh Shilimkar 2013-02-25 6:38 ` Jason Liu 2013-02-25 13:34 ` Lorenzo Pieralisi 2013-02-22 10:28 ` Lorenzo Pieralisi 2013-02-22 10:26 ` Jason Liu 2013-02-21 10:35 ` Lorenzo Pieralisi 2013-02-21 10:49 ` Jason Liu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).