too many timer retries happen when do local timer swtich with broadcast timer

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* too many timer retries happen when do local timer swtich with broadcast timer
@ 2013-02-20 11:16 Jason Liu
  2013-02-20 13:33 ` Thomas Gleixner
  0 siblings, 1 reply; 25+ messages in thread
From: Jason Liu @ 2013-02-20 11:16 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

sorry for so long email, please be patient... thanks,

I have seen too many timer retries happen when do local timer switch
with broadcast
timeron ARM Cortex A9 SMP(4 cores), see the following log such as:
retries: 36383

root@~$ cat /proc/timer_list
Timer List Version: v0.6
HRTIMER_MAX_CLOCK_BASES: 3
now at 3297691988044 nsecs

cpu: 0
 clock 0:
  .base:       8c0084b8
  .index:      0
  .resolution: 1 nsecs
  .get_time:   ktime_get
  .offset:     0 nsecs
[...]

Tick Device: mode:     1
Broadcast device
Clock Event Device: mxc_timer1
 max_delta_ns:   1431655863333
 min_delta_ns:   85000
 mult:           12884901
 shift:          32
 mode:           3
 next_event:     3297700000000 nsecs
 set_next_event: v2_set_next_event
 set_mode:       mxc_set_mode
 event_handler:  tick_handle_oneshot_broadcast
 retries:        92
tick_broadcast_mask: 00000000
tick_broadcast_oneshot_mask: 0000000a


Tick Device: mode:     1
Per CPU device: 0
Clock Event Device: local_timer
 max_delta_ns:   8624432320
 min_delta_ns:   1000
 mult:           2138893713
 shift:          32
 mode:           3
 next_event:     3297700000000 nsecs
 set_next_event: twd_set_next_event
 set_mode:       twd_set_mode
 event_handler:  hrtimer_interrupt
 retries:        36383

Tick Device: mode:     1
Per CPU device: 1
Clock Event Device: local_timer
 max_delta_ns:   8624432320
 min_delta_ns:   1000
 mult:           2138893713
 shift:          32
 mode:           1
 next_event:     3297720000000 nsecs
 set_next_event: twd_set_next_event
 set_mode:       twd_set_mode
 event_handler:  hrtimer_interrupt
 retries:        6510

Tick Device: mode:     1
Per CPU device: 2
Clock Event Device: local_timer
 max_delta_ns:   8624432320
 min_delta_ns:   1000
 mult:           2138893713
 shift:          32
 mode:           3
 next_event:     3297700000000 nsecs
 set_next_event: twd_set_next_event
 set_mode:       twd_set_mode
 event_handler:  hrtimer_interrupt
 retries:        790

Tick Device: mode:     1
Per CPU device: 3
Clock Event Device: local_timer
 max_delta_ns:   8624432320
 min_delta_ns:   1000
 mult:           2138893713
 shift:          32
 mode:           1
 next_event:     3298000000000 nsecs
 set_next_event: twd_set_next_event
 set_mode:       twd_set_mode
 event_handler:  hrtimer_interrupt
 retries:        6873


Since on our platform, the local timer will stop when enter C3 state,
we need switch the local timer
to bc timer when enter the state and switch back when exit from the
that state. the code is like this:

void arch_idle(void)
{
....
clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);

enter_the_wait_mode();

clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
}

when the broadcast timer interrupt arrives(this interrupt just wakeup
the ARM, and ARM has no chance
to handle it since local irq is disabled. In fact it's disabled in
cpu_idle() of arch/arm/kernel/process.c)

the broadcast timer interrupt will wake up the CPU and run:

clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);    ->
tick_broadcast_oneshot_control(...);
->
tick_program_event(dev->next_event, 1);
->
tick_dev_program_event(dev, expires, force);
->
for (i = 0;;) {
                int ret = clockevents_program_event(dev, expires, now);
                if (!ret || !force)
                        return ret;

                dev->retries++;
                ....
                now = ktime_get();
                expires = ktime_add_ns(now, dev->min_delta_ns);
}
clockevents_program_event(dev, expires, now);

        delta = ktime_to_ns(ktime_sub(expires, now));

        if (delta <= 0)
                return -ETIME;

when the bc timer interrupt arrives,  which means the last local timer
expires too. so,
clockevents_program_event will return -ETIME, which will cause the
dev->retries++
when retry to program the expired timer.

Even under the worst case, after the re-program the expired timer,
then CPU enter idle
quickly before the re-progam timer expired, it will make system
ping-pang forever,

switch to bc timer->wait->bc timer expires->wakeup->switch to loc timer->  |
 ^
                                                         |
 |-------------------<-enter idle <- reprogram the expired loc timer
------------------<

I have run into the worst case on my project. I think this is the
common issue on ARM platform.

What do you think how we can fix this problem?

Thanks you.

Best Regards,
Jason Liu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-20 11:16 too many timer retries happen when do local timer swtich with broadcast timer Jason Liu
@ 2013-02-20 13:33 ` Thomas Gleixner
  2013-02-21  6:16   ` Jason Liu
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2013-02-20 13:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 20 Feb 2013, Jason Liu wrote:
> void arch_idle(void)
> {
> ....
> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
> 
> enter_the_wait_mode();
> 
> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
> }
> 
> when the broadcast timer interrupt arrives(this interrupt just wakeup
> the ARM, and ARM has no chance
> to handle it since local irq is disabled. In fact it's disabled in
> cpu_idle() of arch/arm/kernel/process.c)
> 
> the broadcast timer interrupt will wake up the CPU and run:
> 
> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);    ->
> tick_broadcast_oneshot_control(...);
> ->
> tick_program_event(dev->next_event, 1);
> ->
> tick_dev_program_event(dev, expires, force);
> ->
> for (i = 0;;) {
>                 int ret = clockevents_program_event(dev, expires, now);
>                 if (!ret || !force)
>                         return ret;
> 
>                 dev->retries++;
>                 ....
>                 now = ktime_get();
>                 expires = ktime_add_ns(now, dev->min_delta_ns);
> }
> clockevents_program_event(dev, expires, now);
> 
>         delta = ktime_to_ns(ktime_sub(expires, now));
> 
>         if (delta <= 0)
>                 return -ETIME;
> 
> when the bc timer interrupt arrives,  which means the last local timer
> expires too. so,
> clockevents_program_event will return -ETIME, which will cause the
> dev->retries++
> when retry to program the expired timer.
> 
> Even under the worst case, after the re-program the expired timer,
> then CPU enter idle
> quickly before the re-progam timer expired, it will make system
> ping-pang forever,

That's nonsense.

The timer IPI brings the core out of the deep idle state.

So after returning from enter_wait_mode() and after calling
clockevents_notify() it returns from arch_idle() to cpu_idle().

In cpu_idle() interrupts are reenabled, so the timer IPI handler is
invoked. That calls the event_handler of the per cpu local clockevent
device (the one which stops in C3). That ends up in the generic timer
code which expires timers and reprograms the local clock event device
with the next pending timer.

So you cannot go idle again, before the expired timers of this event
are handled and their callbacks invoked.


Now the reprogramming itself is a different issue.

It's necessary as we have no information whether the wakeup was caused
by the timer IPI or by something else.

We might try to be clever and store the pending IPI in
tick_handle_oneshot_broadcast() and avoid the reprogramming in that
case. Completely untested patch below.

Thanks,

	tglx

Index: linux-2.6/kernel/time/tick-broadcast.c
===================================================================
--- linux-2.6.orig/kernel/time/tick-broadcast.c
+++ linux-2.6/kernel/time/tick-broadcast.c
@@ -29,6 +29,7 @@
 static struct tick_device tick_broadcast_device;
 /* FIXME: Use cpumask_var_t. */
 static DECLARE_BITMAP(tick_broadcast_mask, NR_CPUS);
+static DECLARE_BITMAP(tick_broadcast_pending, NR_CPUS);
 static DECLARE_BITMAP(tmpmask, NR_CPUS);
 static DEFINE_RAW_SPINLOCK(tick_broadcast_lock);
 static int tick_broadcast_force;
@@ -417,9 +418,10 @@ again:
 	/* Find all expired events */
 	for_each_cpu(cpu, tick_get_broadcast_oneshot_mask()) {
 		td = &per_cpu(tick_cpu_device, cpu);
-		if (td->evtdev->next_event.tv64 <= now.tv64)
+		if (td->evtdev->next_event.tv64 <= now.tv64) {
 			cpumask_set_cpu(cpu, to_cpumask(tmpmask));
-		else if (td->evtdev->next_event.tv64 < next_event.tv64)
+			set_bit(cpu, tick_broadcast_pending);
+		} else if (td->evtdev->next_event.tv64 < next_event.tv64)
 			next_event.tv64 = td->evtdev->next_event.tv64;
 	}
 
@@ -493,7 +495,8 @@ void tick_broadcast_oneshot_control(unsi
 			cpumask_clear_cpu(cpu,
 					  tick_get_broadcast_oneshot_mask());
 			clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT);
-			if (dev->next_event.tv64 != KTIME_MAX)
+			if (dev->next_event.tv64 != KTIME_MAX &&
+			    !test_and_clear_bit(cpu, tick_broadcast_pending))
 				tick_program_event(dev->next_event, 1);
 		}
 	}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-20 13:33 ` Thomas Gleixner
@ 2013-02-21  6:16   ` Jason Liu
  2013-02-21  9:36     ` Thomas Gleixner
  2013-02-21 10:35     ` Lorenzo Pieralisi
  0 siblings, 2 replies; 25+ messages in thread
From: Jason Liu @ 2013-02-21  6:16 UTC (permalink / raw)
  To: linux-arm-kernel

2013/2/20 Thomas Gleixner <tglx@linutronix.de>:
> On Wed, 20 Feb 2013, Jason Liu wrote:
>> void arch_idle(void)
>> {
>> ....
>> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
>>
>> enter_the_wait_mode();
>>
>> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
>> }
>>
>> when the broadcast timer interrupt arrives(this interrupt just wakeup
>> the ARM, and ARM has no chance
>> to handle it since local irq is disabled. In fact it's disabled in
>> cpu_idle() of arch/arm/kernel/process.c)
>>
>> the broadcast timer interrupt will wake up the CPU and run:
>>
>> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);    ->
>> tick_broadcast_oneshot_control(...);
>> ->
>> tick_program_event(dev->next_event, 1);
>> ->
>> tick_dev_program_event(dev, expires, force);
>> ->
>> for (i = 0;;) {
>>                 int ret = clockevents_program_event(dev, expires, now);
>>                 if (!ret || !force)
>>                         return ret;
>>
>>                 dev->retries++;
>>                 ....
>>                 now = ktime_get();
>>                 expires = ktime_add_ns(now, dev->min_delta_ns);
>> }
>> clockevents_program_event(dev, expires, now);
>>
>>         delta = ktime_to_ns(ktime_sub(expires, now));
>>
>>         if (delta <= 0)
>>                 return -ETIME;
>>
>> when the bc timer interrupt arrives,  which means the last local timer
>> expires too. so,
>> clockevents_program_event will return -ETIME, which will cause the
>> dev->retries++
>> when retry to program the expired timer.
>>
>> Even under the worst case, after the re-program the expired timer,
>> then CPU enter idle
>> quickly before the re-progam timer expired, it will make system
>> ping-pang forever,
>
> That's nonsense.

I don't think so.

>
> The timer IPI brings the core out of the deep idle state.
>
> So after returning from enter_wait_mode() and after calling
> clockevents_notify() it returns from arch_idle() to cpu_idle().
>
> In cpu_idle() interrupts are reenabled, so the timer IPI handler is
> invoked. That calls the event_handler of the per cpu local clockevent
> device (the one which stops in C3). That ends up in the generic timer
> code which expires timers and reprograms the local clock event device
> with the next pending timer.
>
> So you cannot go idle again, before the expired timers of this event
> are handled and their callbacks invoked.

That's true for the CPUs which not response to the global timer interrupt.
Take our platform as example: we have 4CPUs(CPU0, CPU1,CPU2,CPU3)
The global timer device will keep running even in the deep idle mode, so, it
can be used as the broadcast timer device, and the interrupt of this device
just raised to CPU0 when the timer expired, then, CPU0 will broadcast the
IPI timer to other CPUs which is in deep idle mode.

So for CPU1, CPU2, CPU3, you are right, the IPI timer will bring it out of idle
state, after running clockevents_notify() it returns from arch_idle()
to cpu_idle(),
then local_irq_enable(), the IPI handler will be invoked and handle
the expires times
and re-program the next pending timer.

But, that's not true for the CPU0. The flow for CPU0 is:
the global timer interrupt wakes up CPU0 and then call:
clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);

which will cpumask_clear_cpu(cpu, tick_get_broadcast_oneshot_mask());
in the function tick_broadcast_oneshot_control(),

After return from clockevents_notify(), it will return to cpu_idle
from arch_idle,
then local_irq_enable(), the CPU0 will response to the global timer
interrupt, and
call the interrupt handler: tick_handle_oneshot_broadcast()

static void tick_handle_oneshot_broadcast(struct clock_event_device *dev)
{
        struct tick_device *td;
        ktime_t now, next_event;
        int cpu;

        raw_spin_lock(&tick_broadcast_lock);
again:
        dev->next_event.tv64 = KTIME_MAX;
        next_event.tv64 = KTIME_MAX;
        cpumask_clear(to_cpumask(tmpmask));
        now = ktime_get();
        /* Find all expired events */
        for_each_cpu(cpu, tick_get_broadcast_oneshot_mask()) {
                td = &per_cpu(tick_cpu_device, cpu);
                if (td->evtdev->next_event.tv64 <= now.tv64)
                        cpumask_set_cpu(cpu, to_cpumask(tmpmask));
                else if (td->evtdev->next_event.tv64 < next_event.tv64)
                        next_event.tv64 = td->evtdev->next_event.tv64;
        }

        /*
         * Wakeup the cpus which have an expired event.
         */
        tick_do_broadcast(to_cpumask(tmpmask));
        ...
}

since cpu0 has been removed from the tick_get_broadcast_oneshot_mask(), and if
all the other cpu1/2/3 state in idle, and no expired timers, then the
tmpmask will be 0,
when call tick_do_broadcast().

static void tick_do_broadcast(struct cpumask *mask)
{
        int cpu = smp_processor_id();
        struct tick_device *td;

        /*
         * Check, if the current cpu is in the mask
         */
        if (cpumask_test_cpu(cpu, mask)) {
                cpumask_clear_cpu(cpu, mask);
                td = &per_cpu(tick_cpu_device, cpu);
                td->evtdev->event_handler(td->evtdev);
        }

        if (!cpumask_empty(mask)) {
                /*
                 * It might be necessary to actually check whether the devices
                 * have different broadcast functions. For now, just use the
                 * one of the first device. This works as long as we have this
                 * misfeature only on x86 (lapic)
                 */
                td = &per_cpu(tick_cpu_device, cpumask_first(mask));
                td->evtdev->broadcast(mask);
        }
}

If the mask is empty, then tick_do_broadcast will do nothing and return, which
will make cpu0 enter idle quickly, and then system will ping-pang there.

switch to bc timer->wait->bc timer expires->wakeup->switch to loc timer->  |
 ^
 |------<-enter idle <- reprogram the expired loc timer ---<


We did see such things happen on the system which will make system
stuck there and
no any sched-tick handler running on the CPU0.

And the easy way to reproduce it is:

/* hack code: just for experiment */
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index f113755..d142802 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -493,8 +493,12 @@ void tick_broadcast_oneshot_control(unsigned long reason)
                        cpumask_clear_cpu(cpu,
                                          tick_get_broadcast_oneshot_mask());
                        clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT);
-                       if (dev->next_event.tv64 != KTIME_MAX)
-                               tick_program_event(dev->next_event, 1);
+                       if (dev->next_event.tv64 != KTIME_MAX) {
+                               if (cpu)
+                                       tick_program_event(dev->next_event, 1);
+                               else
+
tick_program_event(ktime_add_ns(dev->next_event, 100000), 1);
+                       }
                }
        }

We need fix this common issue. Do you have good idea about how to fix it?

>
>
> Now the reprogramming itself is a different issue.
>
> It's necessary as we have no information whether the wakeup was caused
> by the timer IPI or by something else.
>
> We might try to be clever and store the pending IPI in
> tick_handle_oneshot_broadcast() and avoid the reprogramming in that
> case. Completely untested patch below.

Thanks for the patch.

>
> Thanks,
>
>         tglx
>
> Index: linux-2.6/kernel/time/tick-broadcast.c
> ===================================================================
> --- linux-2.6.orig/kernel/time/tick-broadcast.c
> +++ linux-2.6/kernel/time/tick-broadcast.c
> @@ -29,6 +29,7 @@
>  static struct tick_device tick_broadcast_device;
>  /* FIXME: Use cpumask_var_t. */
>  static DECLARE_BITMAP(tick_broadcast_mask, NR_CPUS);
> +static DECLARE_BITMAP(tick_broadcast_pending, NR_CPUS);
>  static DECLARE_BITMAP(tmpmask, NR_CPUS);
>  static DEFINE_RAW_SPINLOCK(tick_broadcast_lock);
>  static int tick_broadcast_force;
> @@ -417,9 +418,10 @@ again:
>         /* Find all expired events */
>         for_each_cpu(cpu, tick_get_broadcast_oneshot_mask()) {
>                 td = &per_cpu(tick_cpu_device, cpu);
> -               if (td->evtdev->next_event.tv64 <= now.tv64)
> +               if (td->evtdev->next_event.tv64 <= now.tv64) {
>                         cpumask_set_cpu(cpu, to_cpumask(tmpmask));
> -               else if (td->evtdev->next_event.tv64 < next_event.tv64)
> +                       set_bit(cpu, tick_broadcast_pending);
> +               } else if (td->evtdev->next_event.tv64 < next_event.tv64)
>                         next_event.tv64 = td->evtdev->next_event.tv64;
>         }
>
> @@ -493,7 +495,8 @@ void tick_broadcast_oneshot_control(unsi
>                         cpumask_clear_cpu(cpu,
>                                           tick_get_broadcast_oneshot_mask());
>                         clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT);
> -                       if (dev->next_event.tv64 != KTIME_MAX)
> +                       if (dev->next_event.tv64 != KTIME_MAX &&
> +                           !test_and_clear_bit(cpu, tick_broadcast_pending))
>                                 tick_program_event(dev->next_event, 1);
>                 }
>         }
>

I have tested the patch, it can fix the retries on the CPU1/2/3, but
not on CPU0.
see, cpu0:  retries:        39042

root@~$ cat /proc/timer_list
Timer List Version: v0.6
HRTIMER_MAX_CLOCK_BASES: 3
[...]

Tick Device: mode:     1
Per CPU device: 0
Clock Event Device: local_timer
 max_delta_ns:   8624432320
 min_delta_ns:   1000
 mult:           2138893713
 shift:          32
 mode:           3
 next_event:     3295010000000 nsecs
 set_next_event: twd_set_next_event
 set_mode:       twd_set_mode
 event_handler:  hrtimer_interrupt
 retries:        39042

Tick Device: mode:     1
Per CPU device: 1
Clock Event Device: local_timer
 max_delta_ns:   8624432320
 min_delta_ns:   1000
 mult:           2138893713
 shift:          32
 mode:           1
 next_event:     3295050000000 nsecs
 set_next_event: twd_set_next_event
 set_mode:       twd_set_mode
 event_handler:  hrtimer_interrupt
 retries:        1

Tick Device: mode:     1
Per CPU device: 2
Clock Event Device: local_timer
 max_delta_ns:   8624432320
 min_delta_ns:   1000
 mult:           2138893713
 shift:          32
 mode:           1
 next_event:     10740407770000000 nsecs
 set_next_event: twd_set_next_event
 set_mode:       twd_set_mode
 event_handler:  hrtimer_interrupt
 retries:        0

Tick Device: mode:     1
Per CPU device: 3
Clock Event Device: local_timer
 max_delta_ns:   8624432320
 min_delta_ns:   1000
 mult:           2138893713
 shift:          32
 mode:           1
 next_event:     10737578200000000 nsecs
 set_next_event: twd_set_next_event
 set_mode:       twd_set_mode
 event_handler:  hrtimer_interrupt
 retries:        0


Jason Liu
>
>
>
>
>

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-21  6:16   ` Jason Liu
@ 2013-02-21  9:36     ` Thomas Gleixner
  2013-02-21 10:50       ` Jason Liu
  2013-02-21 10:35     ` Lorenzo Pieralisi
  1 sibling, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2013-02-21  9:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 21 Feb 2013, Jason Liu wrote:
> 2013/2/20 Thomas Gleixner <tglx@linutronix.de>:
> > On Wed, 20 Feb 2013, Jason Liu wrote:
> >> void arch_idle(void)
> >> {
> >> ....
> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
> >>
> >> enter_the_wait_mode();
> >>
> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
> >> }
> >>
> >> when the broadcast timer interrupt arrives(this interrupt just wakeup
> >> the ARM, and ARM has no chance
> >> to handle it since local irq is disabled. In fact it's disabled in
> >> cpu_idle() of arch/arm/kernel/process.c)
> >>
> >> the broadcast timer interrupt will wake up the CPU and run:
> >>
> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);    ->
> >> tick_broadcast_oneshot_control(...);
> >> ->
> >> tick_program_event(dev->next_event, 1);
> >> ->
> >> tick_dev_program_event(dev, expires, force);
> >> ->
> >> for (i = 0;;) {
> >>                 int ret = clockevents_program_event(dev, expires, now);
> >>                 if (!ret || !force)
> >>                         return ret;
> >>
> >>                 dev->retries++;
> >>                 ....
> >>                 now = ktime_get();
> >>                 expires = ktime_add_ns(now, dev->min_delta_ns);
> >> }
> >> clockevents_program_event(dev, expires, now);
> >>
> >>         delta = ktime_to_ns(ktime_sub(expires, now));
> >>
> >>         if (delta <= 0)
> >>                 return -ETIME;
> >>
> >> when the bc timer interrupt arrives,  which means the last local timer
> >> expires too. so,
> >> clockevents_program_event will return -ETIME, which will cause the
> >> dev->retries++
> >> when retry to program the expired timer.
> >>
> >> Even under the worst case, after the re-program the expired timer,
> >> then CPU enter idle
> >> quickly before the re-progam timer expired, it will make system
> >> ping-pang forever,
> >
> > That's nonsense.
> 
> I don't think so.
> 
> >
> > The timer IPI brings the core out of the deep idle state.
> >
> > So after returning from enter_wait_mode() and after calling
> > clockevents_notify() it returns from arch_idle() to cpu_idle().
> >
> > In cpu_idle() interrupts are reenabled, so the timer IPI handler is
> > invoked. That calls the event_handler of the per cpu local clockevent
> > device (the one which stops in C3). That ends up in the generic timer
> > code which expires timers and reprograms the local clock event device
> > with the next pending timer.
> >
> > So you cannot go idle again, before the expired timers of this event
> > are handled and their callbacks invoked.
> 
> That's true for the CPUs which not response to the global timer interrupt.
> Take our platform as example: we have 4CPUs(CPU0, CPU1,CPU2,CPU3)
> The global timer device will keep running even in the deep idle mode, so, it
> can be used as the broadcast timer device, and the interrupt of this device
> just raised to CPU0 when the timer expired, then, CPU0 will broadcast the
> IPI timer to other CPUs which is in deep idle mode.
> 
> So for CPU1, CPU2, CPU3, you are right, the IPI timer will bring it out of idle
> state, after running clockevents_notify() it returns from arch_idle()
> to cpu_idle(),
> then local_irq_enable(), the IPI handler will be invoked and handle
> the expires times
> and re-program the next pending timer.
> 
> But, that's not true for the CPU0. The flow for CPU0 is:
> the global timer interrupt wakes up CPU0 and then call:
> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
> 
> which will cpumask_clear_cpu(cpu, tick_get_broadcast_oneshot_mask());
> in the function tick_broadcast_oneshot_control(),

Now your explanation makes sense. 

I have no fast solution for this, but I think that I have an idea how
to fix it. Stay tuned.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-21  9:36     ` Thomas Gleixner
@ 2013-02-21 10:50       ` Jason Liu
  2013-02-21 13:48         ` Thomas Gleixner
  0 siblings, 1 reply; 25+ messages in thread
From: Jason Liu @ 2013-02-21 10:50 UTC (permalink / raw)
  To: linux-arm-kernel

2013/2/21 Thomas Gleixner <tglx@linutronix.de>:
> On Thu, 21 Feb 2013, Jason Liu wrote:
>> 2013/2/20 Thomas Gleixner <tglx@linutronix.de>:
>> > On Wed, 20 Feb 2013, Jason Liu wrote:
>> >> void arch_idle(void)
>> >> {
>> >> ....
>> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
>> >>
>> >> enter_the_wait_mode();
>> >>
>> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
>> >> }
>> >>
>> >> when the broadcast timer interrupt arrives(this interrupt just wakeup
>> >> the ARM, and ARM has no chance
>> >> to handle it since local irq is disabled. In fact it's disabled in
>> >> cpu_idle() of arch/arm/kernel/process.c)
>> >>
>> >> the broadcast timer interrupt will wake up the CPU and run:
>> >>
>> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);    ->
>> >> tick_broadcast_oneshot_control(...);
>> >> ->
>> >> tick_program_event(dev->next_event, 1);
>> >> ->
>> >> tick_dev_program_event(dev, expires, force);
>> >> ->
>> >> for (i = 0;;) {
>> >>                 int ret = clockevents_program_event(dev, expires, now);
>> >>                 if (!ret || !force)
>> >>                         return ret;
>> >>
>> >>                 dev->retries++;
>> >>                 ....
>> >>                 now = ktime_get();
>> >>                 expires = ktime_add_ns(now, dev->min_delta_ns);
>> >> }
>> >> clockevents_program_event(dev, expires, now);
>> >>
>> >>         delta = ktime_to_ns(ktime_sub(expires, now));
>> >>
>> >>         if (delta <= 0)
>> >>                 return -ETIME;
>> >>
>> >> when the bc timer interrupt arrives,  which means the last local timer
>> >> expires too. so,
>> >> clockevents_program_event will return -ETIME, which will cause the
>> >> dev->retries++
>> >> when retry to program the expired timer.
>> >>
>> >> Even under the worst case, after the re-program the expired timer,
>> >> then CPU enter idle
>> >> quickly before the re-progam timer expired, it will make system
>> >> ping-pang forever,
>> >
>> > That's nonsense.
>>
>> I don't think so.
>>
>> >
>> > The timer IPI brings the core out of the deep idle state.
>> >
>> > So after returning from enter_wait_mode() and after calling
>> > clockevents_notify() it returns from arch_idle() to cpu_idle().
>> >
>> > In cpu_idle() interrupts are reenabled, so the timer IPI handler is
>> > invoked. That calls the event_handler of the per cpu local clockevent
>> > device (the one which stops in C3). That ends up in the generic timer
>> > code which expires timers and reprograms the local clock event device
>> > with the next pending timer.
>> >
>> > So you cannot go idle again, before the expired timers of this event
>> > are handled and their callbacks invoked.
>>
>> That's true for the CPUs which not response to the global timer interrupt.
>> Take our platform as example: we have 4CPUs(CPU0, CPU1,CPU2,CPU3)
>> The global timer device will keep running even in the deep idle mode, so, it
>> can be used as the broadcast timer device, and the interrupt of this device
>> just raised to CPU0 when the timer expired, then, CPU0 will broadcast the
>> IPI timer to other CPUs which is in deep idle mode.
>>
>> So for CPU1, CPU2, CPU3, you are right, the IPI timer will bring it out of idle
>> state, after running clockevents_notify() it returns from arch_idle()
>> to cpu_idle(),
>> then local_irq_enable(), the IPI handler will be invoked and handle
>> the expires times
>> and re-program the next pending timer.
>>
>> But, that's not true for the CPU0. The flow for CPU0 is:
>> the global timer interrupt wakes up CPU0 and then call:
>> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
>>
>> which will cpumask_clear_cpu(cpu, tick_get_broadcast_oneshot_mask());
>> in the function tick_broadcast_oneshot_control(),
>
> Now your explanation makes sense.
>
> I have no fast solution for this, but I think that I have an idea how
> to fix it. Stay tuned.

Thanks Thomas, wait for your fix. :)

>
> Thanks,
>
>         tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-21 10:50       ` Jason Liu
@ 2013-02-21 13:48         ` Thomas Gleixner
  2013-02-21 15:12           ` Santosh Shilimkar
  2013-02-22 10:26           ` Jason Liu
  0 siblings, 2 replies; 25+ messages in thread
From: Thomas Gleixner @ 2013-02-21 13:48 UTC (permalink / raw)
  To: linux-arm-kernel

Jason,

On Thu, 21 Feb 2013, Jason Liu wrote:
> 2013/2/21 Thomas Gleixner <tglx@linutronix.de>:
> > Now your explanation makes sense.
> >
> > I have no fast solution for this, but I think that I have an idea how
> > to fix it. Stay tuned.
> 
> Thanks Thomas, wait for your fix. :)

find below a completely untested patch, which should address that issue.

Sigh. Stopping the cpu local timer in deep idle is known to be an
idiotic design decision for 10+ years. I'll never understand why ARM
vendors insist on implementing the same stupidity over and over.

Thanks,

	tglx

Index: linux-2.6/kernel/time/tick-broadcast.c
===================================================================
--- linux-2.6.orig/kernel/time/tick-broadcast.c
+++ linux-2.6/kernel/time/tick-broadcast.c
@@ -397,6 +397,8 @@ int tick_resume_broadcast(void)
 
 /* FIXME: use cpumask_var_t. */
 static DECLARE_BITMAP(tick_broadcast_oneshot_mask, NR_CPUS);
+static DECLARE_BITMAP(tick_broadcast_pending, NR_CPUS);
+static DECLARE_BITMAP(tick_force_broadcast_mask, NR_CPUS);
 
 /*
  * Exposed for debugging: see timer_list.c
@@ -453,12 +455,24 @@ again:
 	/* Find all expired events */
 	for_each_cpu(cpu, tick_get_broadcast_oneshot_mask()) {
 		td = &per_cpu(tick_cpu_device, cpu);
-		if (td->evtdev->next_event.tv64 <= now.tv64)
+		if (td->evtdev->next_event.tv64 <= now.tv64) {
 			cpumask_set_cpu(cpu, to_cpumask(tmpmask));
-		else if (td->evtdev->next_event.tv64 < next_event.tv64)
+			/*
+			 * Mark the remote cpu in the pending mask, so
+			 * it can avoid reprogramming the cpu local
+			 * timer in tick_broadcast_oneshot_control().
+			 */
+			set_bit(cpu, tick_broadcast_pending);
+		} else if (td->evtdev->next_event.tv64 < next_event.tv64)
 			next_event.tv64 = td->evtdev->next_event.tv64;
 	}
 
+	/* Take care of enforced broadcast requests */
+	for_each_cpu(cpu, to_cpumask(tick_force_broadcast_mask)) {
+		set_bit(cpu, tmpmask);
+		clear_bit(cpu, tick_force_broadcast_mask);
+	}
+
 	/*
 	 * Wakeup the cpus which have an expired event.
 	 */
@@ -494,6 +508,7 @@ void tick_broadcast_oneshot_control(unsi
 	struct clock_event_device *bc, *dev;
 	struct tick_device *td;
 	unsigned long flags;
+	ktime_t now;
 	int cpu;
 
 	/*
@@ -518,6 +533,8 @@ void tick_broadcast_oneshot_control(unsi
 
 	raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
 	if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ENTER) {
+		WARN_ON_ONCE(test_bit(cpu, tick_broadcast_pending));
+		WARN_ON_ONCE(test_bit(cpu, tick_force_broadcast_mask));
 		if (!cpumask_test_cpu(cpu, tick_get_broadcast_oneshot_mask())) {
 			cpumask_set_cpu(cpu, tick_get_broadcast_oneshot_mask());
 			clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN);
@@ -529,10 +546,63 @@ void tick_broadcast_oneshot_control(unsi
 			cpumask_clear_cpu(cpu,
 					  tick_get_broadcast_oneshot_mask());
 			clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT);
-			if (dev->next_event.tv64 != KTIME_MAX)
-				tick_program_event(dev->next_event, 1);
+			if (dev->next_event.tv64 == KTIME_MAX)
+				goto out;
+			/*
+			 * The cpu handling the broadcast timer marked
+			 * this cpu in the broadcast pending mask and
+			 * fired the broadcast IPI. So we are going to
+			 * handle the expired event anyway via the
+			 * broadcast IPI handler. No need to reprogram
+			 * the timer with an already expired event.
+			 */
+			if (test_and_clear_bit(cpu, tick_broadcast_pending))
+				goto out;
+			/*
+			 * If the pending bit is not set, then we are
+			 * either the CPU handling the broadcast
+			 * interrupt or we got woken by something else.
+			 *
+			 * We are not longer in the broadcast mask, so
+			 * if the cpu local expiry time is already
+			 * reached, we would reprogram the cpu local
+			 * timer with an already expired event.
+			 *
+			 * This can lead to a ping-pong when we return
+			 * to idle and therefor rearm the broadcast
+			 * timer before the cpu local timer was able
+			 * to fire. This happens because the forced
+			 * reprogramming makes sure that the event
+			 * will happen in the future and depending on
+			 * the min_delta setting this might be far
+			 * enough out that the ping-pong starts.
+			 *
+			 * If the cpu local next_event has expired
+			 * then we know that the broadcast timer
+			 * next_event has expired as well and
+			 * broadcast is about to be handled. So we
+			 * avoid reprogramming and enforce that the
+			 * broadcast handler, which did not run yet,
+			 * will invoke the cpu local handler.
+			 *
+			 * We cannot call the handler directly from
+			 * here, because we might be in a NOHZ phase
+			 * and we did not go through the irq_enter()
+			 * nohz fixups.
+			 */
+			now = ktime_get();
+			if (dev->next_event.tv64 <= now.tv64) {
+				set_bit(cpu, tick_force_broadcast_mask);
+				goto out;
+			}
+			/*
+			 * We got woken by something else. Reprogram
+			 * the cpu local timer device.
+			 */
+			tick_program_event(dev->next_event, 1);
 		}
 	}
+out:
 	raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
 }
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-21 13:48         ` Thomas Gleixner
@ 2013-02-21 15:12           ` Santosh Shilimkar
  2013-02-21 22:19             ` Thomas Gleixner
  2013-02-22 10:26           ` Jason Liu
  1 sibling, 1 reply; 25+ messages in thread
From: Santosh Shilimkar @ 2013-02-21 15:12 UTC (permalink / raw)
  To: linux-arm-kernel

Thomas,

On Thursday 21 February 2013 07:18 PM, Thomas Gleixner wrote:
> Jason,
>
> On Thu, 21 Feb 2013, Jason Liu wrote:
>> 2013/2/21 Thomas Gleixner <tglx@linutronix.de>:
>>> Now your explanation makes sense.
>>>
>>> I have no fast solution for this, but I think that I have an idea how
>>> to fix it. Stay tuned.
>>
>> Thanks Thomas, wait for your fix. :)
>
> find below a completely untested patch, which should address that issue.
>
After looking at the thread, I tried to see the issue on OMAP and could
see the same issue as Jason.

Your patch fixes the retries on both CPUs on my dual core machine. So
you use my tested by if you need one.
Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com>

Thanks for the patch.
And thanks to Jason for spotting the issue.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-21 15:12           ` Santosh Shilimkar
@ 2013-02-21 22:19             ` Thomas Gleixner
  2013-02-22 10:07               ` Santosh Shilimkar
  2013-02-22 10:28               ` Lorenzo Pieralisi
  0 siblings, 2 replies; 25+ messages in thread
From: Thomas Gleixner @ 2013-02-21 22:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 21 Feb 2013, Santosh Shilimkar wrote:
> On Thursday 21 February 2013 07:18 PM, Thomas Gleixner wrote:
> > find below a completely untested patch, which should address that issue.
> > 
> After looking at the thread, I tried to see the issue on OMAP and could
> see the same issue as Jason.

That's interesting. We have the same issue on x86 since 2007 and
nobody noticed ever. It's basically the same problem there, but it
seems that on x86 getting out of those low power states is way slower
than the minimal reprogramming delta which is used to enforce the
local timer to fire after the wakeup. 

I'm still amazed that as Jason stated a 1us reprogramming delta is
sufficient to get this ping-pong going. I somehow doubt that, but
maybe ARM is really that fast :)

> Your patch fixes the retries on both CPUs on my dual core machine. So
> you use my tested by if you need one.

They are always welcome.

> Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
> 
> Thanks for the patch.
> And thanks to Jason for spotting the issue.

And for coping with my initial inability to parse his report!

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-21 22:19             ` Thomas Gleixner
@ 2013-02-22 10:07               ` Santosh Shilimkar
  2013-02-22 10:24                 ` Thomas Gleixner
  2013-02-22 10:28               ` Lorenzo Pieralisi
  1 sibling, 1 reply; 25+ messages in thread
From: Santosh Shilimkar @ 2013-02-22 10:07 UTC (permalink / raw)
  To: linux-arm-kernel

Thomas,

On Friday 22 February 2013 03:49 AM, Thomas Gleixner wrote:
> On Thu, 21 Feb 2013, Santosh Shilimkar wrote:
>> On Thursday 21 February 2013 07:18 PM, Thomas Gleixner wrote:
>>> find below a completely untested patch, which should address that issue.
>>>
>> After looking at the thread, I tried to see the issue on OMAP and could
>> see the same issue as Jason.
>
> That's interesting. We have the same issue on x86 since 2007 and
> nobody noticed ever. It's basically the same problem there, but it
> seems that on x86 getting out of those low power states is way slower
> than the minimal reprogramming delta which is used to enforce the
> local timer to fire after the wakeup.
>
> I'm still amazed that as Jason stated a 1us reprogramming delta is
> sufficient to get this ping-pong going. I somehow doubt that, but
> maybe ARM is really that fast :)
>
>> Your patch fixes the retries on both CPUs on my dual core machine. So
>> you use my tested by if you need one.
>
> They are always welcome.
>
BTW, Lorenzo off-list mentioned to me about warning in boot-up
which I missed while testing your patch. It will take bit more
time for me to look into it and hence thought of reporting it.

[    2.186126] ------------[ cut here ]------------
[    2.190979] WARNING: at kernel/time/tick-broadcast.c:501 
tick_broadcast_oneshot_control+0x1c0/0x21c()
[    2.200622] Modules linked in:
[    2.203826] [<c001bfe4>] (unwind_backtrace+0x0/0xf0) from 
[<c0047d6c>] (warn_slowpath_common+0x4c/0x64)
[    2.213653] [<c0047d6c>] (warn_slowpath_common+0x4c/0x64) from 
[<c0047da0>] (warn_slowpath_null+0x1c/0x24)
[    2.223754] [<c0047da0>] (warn_slowpath_null+0x1c/0x24) from 
[<c009336c>] (tick_broadcast_oneshot_control+0x1c0/0x21c)
[    2.234924] [<c009336c>] (tick_broadcast_oneshot_control+0x1c0/0x21c) 
from [<c00928dc>] (tick_notify+0x23c/0x42c)
[    2.245666] [<c00928dc>] (tick_notify+0x23c/0x42c) from [<c0539a3c>] 
(notifier_call_chain+0x44/0x84)
[    2.255218] [<c0539a3c>] (notifier_call_chain+0x44/0x84) from 
[<c0071068>] (raw_notifier_call_chain+0x18/0x20)
[    2.265686] [<c0071068>] (raw_notifier_call_chain+0x18/0x20) from 
[<c0091c70>] (clockevents_notify+0x2c/0x174)
[    2.276123] [<c0091c70>] (clockevents_notify+0x2c/0x174) from 
[<c0035294>] (omap_enter_idle_smp+0x3c/0x120)
[    2.286315] [<c0035294>] (omap_enter_idle_smp+0x3c/0x120) from 
[<c042e504>] (cpuidle_enter+0x14/0x18)
[    2.295928] [<c042e504>] (cpuidle_enter+0x14/0x18) from [<c042ef14>] 
(cpuidle_wrap_enter+0x34/0xa0)
[    2.305389] [<c042ef14>] (cpuidle_wrap_enter+0x34/0xa0) from 
[<c042eb20>] (cpuidle_idle_call+0xe0/0x328)
[    2.315307] [<c042eb20>] (cpuidle_idle_call+0xe0/0x328) from 
[<c0015100>] (cpu_idle+0x8c/0x11c)
[    2.324401] [<c0015100>] (cpu_idle+0x8c/0x11c) from [<c073d7ac>] 
(start_kernel+0x2b0/0x300)
[    2.333129] ---[ end trace 6fe1f7b4606a9e20 ]---

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-22 10:07               ` Santosh Shilimkar
@ 2013-02-22 10:24                 ` Thomas Gleixner
  2013-02-22 10:30                   ` Santosh Shilimkar
  2013-02-22 10:31                   ` Lorenzo Pieralisi
  0 siblings, 2 replies; 25+ messages in thread
From: Thomas Gleixner @ 2013-02-22 10:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 22 Feb 2013, Santosh Shilimkar wrote:
> BTW, Lorenzo off-list mentioned to me about warning in boot-up
> which I missed while testing your patch. It will take bit more
> time for me to look into it and hence thought of reporting it.
> 
> [    2.186126] ------------[ cut here ]------------
> [    2.190979] WARNING: at kernel/time/tick-broadcast.c:501
> tick_broadcast_oneshot_control+0x1c0/0x21c()

Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-22 10:24                 ` Thomas Gleixner
@ 2013-02-22 10:30                   ` Santosh Shilimkar
  2013-02-22 10:31                   ` Lorenzo Pieralisi
  1 sibling, 0 replies; 25+ messages in thread
From: Santosh Shilimkar @ 2013-02-22 10:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday 22 February 2013 03:54 PM, Thomas Gleixner wrote:
> On Fri, 22 Feb 2013, Santosh Shilimkar wrote:
>> BTW, Lorenzo off-list mentioned to me about warning in boot-up
>> which I missed while testing your patch. It will take bit more
>> time for me to look into it and hence thought of reporting it.
>>
>> [    2.186126] ------------[ cut here ]------------
>> [    2.190979] WARNING: at kernel/time/tick-broadcast.c:501
>> tick_broadcast_oneshot_control+0x1c0/0x21c()
>
> Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ?
>
The force broadcast mask one coming from below.
"WARN_ON_ONCE(test_bit(cpu, tick_force_broadcast_mask));"

Regards,
Santosh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-22 10:24                 ` Thomas Gleixner
  2013-02-22 10:30                   ` Santosh Shilimkar
@ 2013-02-22 10:31                   ` Lorenzo Pieralisi
  2013-02-22 11:02                     ` Santosh Shilimkar
  1 sibling, 1 reply; 25+ messages in thread
From: Lorenzo Pieralisi @ 2013-02-22 10:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 22, 2013 at 10:24:00AM +0000, Thomas Gleixner wrote:
> On Fri, 22 Feb 2013, Santosh Shilimkar wrote:
> > BTW, Lorenzo off-list mentioned to me about warning in boot-up
> > which I missed while testing your patch. It will take bit more
> > time for me to look into it and hence thought of reporting it.
> > 
> > [    2.186126] ------------[ cut here ]------------
> > [    2.190979] WARNING: at kernel/time/tick-broadcast.c:501
> > tick_broadcast_oneshot_control+0x1c0/0x21c()
> 
> Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ?

It is the tick_force_broadcast_mask and I think that's because on all
systems we are testing, the broadcast timer IRQ is a thundering herd,
all CPUs get out of idle at once and try to get out of broadcast mode
at more or less the same time.

Lorenzo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-22 10:31                   ` Lorenzo Pieralisi
@ 2013-02-22 11:02                     ` Santosh Shilimkar
  2013-02-22 12:07                       ` Thomas Gleixner
  0 siblings, 1 reply; 25+ messages in thread
From: Santosh Shilimkar @ 2013-02-22 11:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday 22 February 2013 04:01 PM, Lorenzo Pieralisi wrote:
> On Fri, Feb 22, 2013 at 10:24:00AM +0000, Thomas Gleixner wrote:
>> On Fri, 22 Feb 2013, Santosh Shilimkar wrote:
>>> BTW, Lorenzo off-list mentioned to me about warning in boot-up
>>> which I missed while testing your patch. It will take bit more
>>> time for me to look into it and hence thought of reporting it.
>>>
>>> [    2.186126] ------------[ cut here ]------------
>>> [    2.190979] WARNING: at kernel/time/tick-broadcast.c:501
>>> tick_broadcast_oneshot_control+0x1c0/0x21c()
>>
>> Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ?
>
> It is the tick_force_broadcast_mask and I think that's because on all
> systems we are testing, the broadcast timer IRQ is a thundering herd,
> all CPUs get out of idle at once and try to get out of broadcast mode
> at more or less the same time.
>
So the issue comes ups only when the idle state used where CPU wakeup
more or less at same time as Lorenzo mentioned. I have two platforms
where I could test the patch and see the issue only with one platform.

Yesterday I didn't notice the warning because it wasn't seen on that
platform :-) OMAP4 idle entry and exit in deep state is staggered
between CPUs and hence the warning isn't seen. On OMAP5 though,
there is an additional C-state where idle entry/exit for CPU
isn't staggered and I see the issue in that case.

Actually the broad-cast code doesn't expect such a behavior
from CPUs since only the broad-cast affine CPU should wake
up and rest of the CPU should be woken up by the broad-cast
IPIs.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-22 11:02                     ` Santosh Shilimkar
@ 2013-02-22 12:07                       ` Thomas Gleixner
  2013-02-22 14:48                         ` Lorenzo Pieralisi
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2013-02-22 12:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 22 Feb 2013, Santosh Shilimkar wrote:

> On Friday 22 February 2013 04:01 PM, Lorenzo Pieralisi wrote:
> > On Fri, Feb 22, 2013 at 10:24:00AM +0000, Thomas Gleixner wrote:
> > > On Fri, 22 Feb 2013, Santosh Shilimkar wrote:
> > > > BTW, Lorenzo off-list mentioned to me about warning in boot-up
> > > > which I missed while testing your patch. It will take bit more
> > > > time for me to look into it and hence thought of reporting it.
> > > > 
> > > > [    2.186126] ------------[ cut here ]------------
> > > > [    2.190979] WARNING: at kernel/time/tick-broadcast.c:501
> > > > tick_broadcast_oneshot_control+0x1c0/0x21c()
> > > 
> > > Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ?
> > 
> > It is the tick_force_broadcast_mask and I think that's because on all
> > systems we are testing, the broadcast timer IRQ is a thundering herd,
> > all CPUs get out of idle at once and try to get out of broadcast mode
> > at more or less the same time.
> > 
> So the issue comes ups only when the idle state used where CPU wakeup
> more or less at same time as Lorenzo mentioned. I have two platforms
> where I could test the patch and see the issue only with one platform.
> 
> Yesterday I didn't notice the warning because it wasn't seen on that
> platform :-) OMAP4 idle entry and exit in deep state is staggered
> between CPUs and hence the warning isn't seen. On OMAP5 though,
> there is an additional C-state where idle entry/exit for CPU
> isn't staggered and I see the issue in that case.
> 
> Actually the broad-cast code doesn't expect such a behavior
> from CPUs since only the broad-cast affine CPU should wake
> up and rest of the CPU should be woken up by the broad-cast
> IPIs.

That's what I feared. We might have the same issue on x86, depending
on the cpu model.

But thinking more about it. It's actually not a real problem, just
pointless burning of cpu cycles.

So on the CPU which gets woken along with the target CPU of the
broadcast the following happens:

  deep_idle()
			<-- spurious wakeup
  broadcast_exit()
    set forced bit
  
  enable interrupts
    
			<-- Nothing happens

  disable interrupts

  broadcast_enter()
			<-- Here we observe the forced bit is set
  deep_idle()

Now after that the target CPU of the broadcast runs the broadcast
handler and finds the other CPU in both the broadcast and the forced
mask, sends the IPI and stuff gets back to normal.

So it's not actually harmful, just more evidence for the theory, that
hardware designers have access to very special drug supplies.

Now we could make use of that and avoid going deep idle just to come
back right away via the IPI. Unfortunately the notification thingy has
no return value, but we can fix that.

To confirm that theory, could you please try the hack below and add
some instrumentation (trace_printk)?

Thanks,

	tglx

Index: linux-2.6/arch/arm/kernel/process.c
===================================================================
--- linux-2.6.orig/arch/arm/kernel/process.c
+++ linux-2.6/arch/arm/kernel/process.c
@@ -199,7 +199,7 @@ void cpu_idle(void)
 #ifdef CONFIG_PL310_ERRATA_769419
 			wmb();
 #endif
-			if (hlt_counter) {
+			if (hlt_counter || tick_check_broadcast_pending()) {
 				local_irq_enable();
 				cpu_relax();
 			} else if (!need_resched()) {
Index: linux-2.6/include/linux/clockchips.h
===================================================================
--- linux-2.6.orig/include/linux/clockchips.h
+++ linux-2.6/include/linux/clockchips.h
@@ -170,6 +170,12 @@ extern void tick_broadcast(const struct 
 extern int tick_receive_broadcast(void);
 #endif
 
+#if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && defined(CONFIG_TICK_ONESHOT)
+extern int tick_check_broadcast_pending(void);
+#else
+static inline int tick_check_broadcast_pending(void) { return 0; }
+#endif
+
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 extern void clockevents_notify(unsigned long reason, void *arg);
 #else
Index: linux-2.6/kernel/time/tick-broadcast.c
===================================================================
--- linux-2.6.orig/kernel/time/tick-broadcast.c
+++ linux-2.6/kernel/time/tick-broadcast.c
@@ -418,6 +418,15 @@ static int tick_broadcast_set_event(ktim
 	return clockevents_program_event(bc, expires, force);
 }
 
+/*
+ * Called before going idle with interrupts disabled. Checks whether a
+ * broadcast event from the other core is about to happen.
+ */
+int tick_check_broadcast_pending(void)
+{
+	return test_bit(smp_processor_id(), tick_force_broadcast_mask);
+}
+
 int tick_resume_broadcast_oneshot(struct clock_event_device *bc)
 {
 	clockevents_set_mode(bc, CLOCK_EVT_MODE_ONESHOT);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-22 12:07                       ` Thomas Gleixner
@ 2013-02-22 14:48                         ` Lorenzo Pieralisi
  2013-02-22 15:03                           ` Thomas Gleixner
  0 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Pieralisi @ 2013-02-22 14:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 22, 2013 at 12:07:30PM +0000, Thomas Gleixner wrote:
> On Fri, 22 Feb 2013, Santosh Shilimkar wrote:
> 
> > On Friday 22 February 2013 04:01 PM, Lorenzo Pieralisi wrote:
> > > On Fri, Feb 22, 2013 at 10:24:00AM +0000, Thomas Gleixner wrote:
> > > > On Fri, 22 Feb 2013, Santosh Shilimkar wrote:
> > > > > BTW, Lorenzo off-list mentioned to me about warning in boot-up
> > > > > which I missed while testing your patch. It will take bit more
> > > > > time for me to look into it and hence thought of reporting it.
> > > > > 
> > > > > [    2.186126] ------------[ cut here ]------------
> > > > > [    2.190979] WARNING: at kernel/time/tick-broadcast.c:501
> > > > > tick_broadcast_oneshot_control+0x1c0/0x21c()
> > > > 
> > > > Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ?
> > > 
> > > It is the tick_force_broadcast_mask and I think that's because on all
> > > systems we are testing, the broadcast timer IRQ is a thundering herd,
> > > all CPUs get out of idle at once and try to get out of broadcast mode
> > > at more or less the same time.
> > > 
> > So the issue comes ups only when the idle state used where CPU wakeup
> > more or less at same time as Lorenzo mentioned. I have two platforms
> > where I could test the patch and see the issue only with one platform.
> > 
> > Yesterday I didn't notice the warning because it wasn't seen on that
> > platform :-) OMAP4 idle entry and exit in deep state is staggered
> > between CPUs and hence the warning isn't seen. On OMAP5 though,
> > there is an additional C-state where idle entry/exit for CPU
> > isn't staggered and I see the issue in that case.
> > 
> > Actually the broad-cast code doesn't expect such a behavior
> > from CPUs since only the broad-cast affine CPU should wake
> > up and rest of the CPU should be woken up by the broad-cast
> > IPIs.
> 
> That's what I feared. We might have the same issue on x86, depending
> on the cpu model.
> 
> But thinking more about it. It's actually not a real problem, just
> pointless burning of cpu cycles.
> 
> So on the CPU which gets woken along with the target CPU of the
> broadcast the following happens:
> 
>   deep_idle()
> 			<-- spurious wakeup
>   broadcast_exit()
>     set forced bit
>   
>   enable interrupts
>     
> 			<-- Nothing happens
> 
>   disable interrupts
> 
>   broadcast_enter()
> 			<-- Here we observe the forced bit is set
>   deep_idle()
> 
> Now after that the target CPU of the broadcast runs the broadcast
> handler and finds the other CPU in both the broadcast and the forced
> mask, sends the IPI and stuff gets back to normal.
> 
> So it's not actually harmful, just more evidence for the theory, that
> hardware designers have access to very special drug supplies.
> 
> Now we could make use of that and avoid going deep idle just to come
> back right away via the IPI. Unfortunately the notification thingy has
> no return value, but we can fix that.
> 
> To confirm that theory, could you please try the hack below and add
> some instrumentation (trace_printk)?

Applied, and it looks like that's exactly why the warning triggers, at least
on the platform I am testing on which is a dual-cluster ARM testchip.

There is a still time window though where the CPU (the IPI target) can get
back to idle (tick_broadcast_pending still not set) before the CPU target of
the broadcast has a chance to run tick_handle_oneshot_broadcast (and set
tick_broadcast_pending), or am I missing something ?
It is a corner case, granted. Best thing would be to check pending IRQs in the
idle driver back-end (or have always-on local timers :-)).

Thanks,
Lorenzo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-22 14:48                         ` Lorenzo Pieralisi
@ 2013-02-22 15:03                           ` Thomas Gleixner
  2013-02-22 15:26                             ` Lorenzo Pieralisi
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2013-02-22 15:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote:
> On Fri, Feb 22, 2013 at 12:07:30PM +0000, Thomas Gleixner wrote:
> > Now we could make use of that and avoid going deep idle just to come
> > back right away via the IPI. Unfortunately the notification thingy has
> > no return value, but we can fix that.
> > 
> > To confirm that theory, could you please try the hack below and add
> > some instrumentation (trace_printk)?
> 
> Applied, and it looks like that's exactly why the warning triggers, at least
> on the platform I am testing on which is a dual-cluster ARM testchip.
> 
> There is a still time window though where the CPU (the IPI target) can get
> back to idle (tick_broadcast_pending still not set) before the CPU target of
> the broadcast has a chance to run tick_handle_oneshot_broadcast (and set
> tick_broadcast_pending), or am I missing something ?

Well, the tick_broadcast_pending bit is uninteresting if the
force_broadcast bit is set. Because if that bit is set we know for
sure, that we got woken with the cpu which gets the broadcast timer
and raced back to idle before the broadcast handler managed to
send the IPI.

If we did not get woken before the broadcast IPI then the pending bit
is set when we exit the broadcast mode.

> It is a corner case, granted. Best thing would be to check pending IRQs in the
> idle driver back-end (or have always-on local timers :-)).

The latter is definitely the only sane solution.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-22 15:03                           ` Thomas Gleixner
@ 2013-02-22 15:26                             ` Lorenzo Pieralisi
  2013-02-22 18:52                               ` Thomas Gleixner
  0 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Pieralisi @ 2013-02-22 15:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 22, 2013 at 03:03:02PM +0000, Thomas Gleixner wrote:
> On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote:
> > On Fri, Feb 22, 2013 at 12:07:30PM +0000, Thomas Gleixner wrote:
> > > Now we could make use of that and avoid going deep idle just to come
> > > back right away via the IPI. Unfortunately the notification thingy has
> > > no return value, but we can fix that.
> > > 
> > > To confirm that theory, could you please try the hack below and add
> > > some instrumentation (trace_printk)?
> > 
> > Applied, and it looks like that's exactly why the warning triggers, at least
> > on the platform I am testing on which is a dual-cluster ARM testchip.
> > 
> > There is a still time window though where the CPU (the IPI target) can get
> > back to idle (tick_broadcast_pending still not set) before the CPU target of
> > the broadcast has a chance to run tick_handle_oneshot_broadcast (and set
> > tick_broadcast_pending), or am I missing something ?
> 
> Well, the tick_broadcast_pending bit is uninteresting if the
> force_broadcast bit is set. Because if that bit is set we know for
> sure, that we got woken with the cpu which gets the broadcast timer
> and raced back to idle before the broadcast handler managed to
> send the IPI.

Gah, my bad sorry, I mixed things up. I thought

tick_check_broadcast_pending()

was checking against the tick_broadcast_pending mask not

tick_force_broadcast_mask

as it correctly does.

All clear now.

Thanks a lot,
Lorenzo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-22 15:26                             ` Lorenzo Pieralisi
@ 2013-02-22 18:52                               ` Thomas Gleixner
  2013-02-25  6:12                                 ` Santosh Shilimkar
                                                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Thomas Gleixner @ 2013-02-22 18:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote:
> On Fri, Feb 22, 2013 at 03:03:02PM +0000, Thomas Gleixner wrote:
> > On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote:
> > > On Fri, Feb 22, 2013 at 12:07:30PM +0000, Thomas Gleixner wrote:
> > > > Now we could make use of that and avoid going deep idle just to come
> > > > back right away via the IPI. Unfortunately the notification thingy has
> > > > no return value, but we can fix that.
> > > > 
> > > > To confirm that theory, could you please try the hack below and add
> > > > some instrumentation (trace_printk)?
> > > 
> > > Applied, and it looks like that's exactly why the warning triggers, at least
> > > on the platform I am testing on which is a dual-cluster ARM testchip.
> > > 
> > > There is a still time window though where the CPU (the IPI target) can get
> > > back to idle (tick_broadcast_pending still not set) before the CPU target of
> > > the broadcast has a chance to run tick_handle_oneshot_broadcast (and set
> > > tick_broadcast_pending), or am I missing something ?
> > 
> > Well, the tick_broadcast_pending bit is uninteresting if the
> > force_broadcast bit is set. Because if that bit is set we know for
> > sure, that we got woken with the cpu which gets the broadcast timer
> > and raced back to idle before the broadcast handler managed to
> > send the IPI.
> 
> Gah, my bad sorry, I mixed things up. I thought
> 
> tick_check_broadcast_pending()
> 
> was checking against the tick_broadcast_pending mask not
> 
> tick_force_broadcast_mask

Yep, that's a misnomer. I just wanted to make sure that my theory is
correct. I need to think about the real solution some more.

We have two alternatives:

1) Make the clockevents_notify function have a return value.

2) Add something like the hack I gave you with a proper name.

The latter has the beauty, that we just need to modify the platform
independent idle code instead of going down to every callsite of the
clockevents_notify thing.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-22 18:52                               ` Thomas Gleixner
@ 2013-02-25  6:12                                 ` Santosh Shilimkar
  2013-02-25  6:38                                 ` Jason Liu
  2013-02-25 13:34                                 ` Lorenzo Pieralisi
  2 siblings, 0 replies; 25+ messages in thread
From: Santosh Shilimkar @ 2013-02-25  6:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Saturday 23 February 2013 12:22 AM, Thomas Gleixner wrote:
> On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote:
>> On Fri, Feb 22, 2013 at 03:03:02PM +0000, Thomas Gleixner wrote:
>>> On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote:
>>>> On Fri, Feb 22, 2013 at 12:07:30PM +0000, Thomas Gleixner wrote:
>>>>> Now we could make use of that and avoid going deep idle just to come
>>>>> back right away via the IPI. Unfortunately the notification thingy has
>>>>> no return value, but we can fix that.
>>>>>
>>>>> To confirm that theory, could you please try the hack below and add
>>>>> some instrumentation (trace_printk)?
>>>>
>>>> Applied, and it looks like that's exactly why the warning triggers, at least
>>>> on the platform I am testing on which is a dual-cluster ARM testchip.
>>>>
I too confirm that the warnings cause is same.

>>>> There is a still time window though where the CPU (the IPI target) can get
>>>> back to idle (tick_broadcast_pending still not set) before the CPU target of
>>>> the broadcast has a chance to run tick_handle_oneshot_broadcast (and set
>>>> tick_broadcast_pending), or am I missing something ?
>>>
>>> Well, the tick_broadcast_pending bit is uninteresting if the
>>> force_broadcast bit is set. Because if that bit is set we know for
>>> sure, that we got woken with the cpu which gets the broadcast timer
>>> and raced back to idle before the broadcast handler managed to
>>> send the IPI.
>>
>> Gah, my bad sorry, I mixed things up. I thought
>>
>> tick_check_broadcast_pending()
>>
>> was checking against the tick_broadcast_pending mask not
>>
>> tick_force_broadcast_mask
>
> Yep, that's a misnomer. I just wanted to make sure that my theory is
> correct. I need to think about the real solution some more.
>
> We have two alternatives:
>
> 1) Make the clockevents_notify function have a return value.
>
> 2) Add something like the hack I gave you with a proper name.
>
> The latter has the beauty, that we just need to modify the platform
> independent idle code instead of going down to every callsite of the
> clockevents_notify thing.
>
I agree that 2 is better alternative to avoid multiple changes.
Whichever alternative you choose, will be happy to test it :)

Regards,
Santosh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-22 18:52                               ` Thomas Gleixner
  2013-02-25  6:12                                 ` Santosh Shilimkar
@ 2013-02-25  6:38                                 ` Jason Liu
  2013-02-25 13:34                                 ` Lorenzo Pieralisi
  2 siblings, 0 replies; 25+ messages in thread
From: Jason Liu @ 2013-02-25  6:38 UTC (permalink / raw)
  To: linux-arm-kernel

Thomas,

2013/2/23 Thomas Gleixner <tglx@linutronix.de>:
> On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote:
>> On Fri, Feb 22, 2013 at 03:03:02PM +0000, Thomas Gleixner wrote:
>> > On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote:
>> > > On Fri, Feb 22, 2013 at 12:07:30PM +0000, Thomas Gleixner wrote:
>> > > > Now we could make use of that and avoid going deep idle just to come
>> > > > back right away via the IPI. Unfortunately the notification thingy has
>> > > > no return value, but we can fix that.
>> > > >
>> > > > To confirm that theory, could you please try the hack below and add
>> > > > some instrumentation (trace_printk)?
>> > >
>> > > Applied, and it looks like that's exactly why the warning triggers, at least
>> > > on the platform I am testing on which is a dual-cluster ARM testchip.
>> > >
>> > > There is a still time window though where the CPU (the IPI target) can get
>> > > back to idle (tick_broadcast_pending still not set) before the CPU target of
>> > > the broadcast has a chance to run tick_handle_oneshot_broadcast (and set
>> > > tick_broadcast_pending), or am I missing something ?
>> >
>> > Well, the tick_broadcast_pending bit is uninteresting if the
>> > force_broadcast bit is set. Because if that bit is set we know for
>> > sure, that we got woken with the cpu which gets the broadcast timer
>> > and raced back to idle before the broadcast handler managed to
>> > send the IPI.
>>
>> Gah, my bad sorry, I mixed things up. I thought
>>
>> tick_check_broadcast_pending()
>>
>> was checking against the tick_broadcast_pending mask not
>>
>> tick_force_broadcast_mask
>
> Yep, that's a misnomer. I just wanted to make sure that my theory is
> correct. I need to think about the real solution some more.

I have applied your patch and tested, there is NO warning at all then.
I think your theory is correct.

>
> We have two alternatives:
>
> 1) Make the clockevents_notify function have a return value.
>
> 2) Add something like the hack I gave you with a proper name.
>
> The latter has the beauty, that we just need to modify the platform
> independent idle code instead of going down to every callsite of the
> clockevents_notify thing.

I prefer the solution 2).

>
> Thanks,
>
>         tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-22 18:52                               ` Thomas Gleixner
  2013-02-25  6:12                                 ` Santosh Shilimkar
  2013-02-25  6:38                                 ` Jason Liu
@ 2013-02-25 13:34                                 ` Lorenzo Pieralisi
  2 siblings, 0 replies; 25+ messages in thread
From: Lorenzo Pieralisi @ 2013-02-25 13:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 22, 2013 at 06:52:14PM +0000, Thomas Gleixner wrote:
> On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote:
> > On Fri, Feb 22, 2013 at 03:03:02PM +0000, Thomas Gleixner wrote:
> > > On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote:
> > > > On Fri, Feb 22, 2013 at 12:07:30PM +0000, Thomas Gleixner wrote:
> > > > > Now we could make use of that and avoid going deep idle just to come
> > > > > back right away via the IPI. Unfortunately the notification thingy has
> > > > > no return value, but we can fix that.
> > > > > 
> > > > > To confirm that theory, could you please try the hack below and add
> > > > > some instrumentation (trace_printk)?
> > > > 
> > > > Applied, and it looks like that's exactly why the warning triggers, at least
> > > > on the platform I am testing on which is a dual-cluster ARM testchip.
> > > > 
> > > > There is a still time window though where the CPU (the IPI target) can get
> > > > back to idle (tick_broadcast_pending still not set) before the CPU target of
> > > > the broadcast has a chance to run tick_handle_oneshot_broadcast (and set
> > > > tick_broadcast_pending), or am I missing something ?
> > > 
> > > Well, the tick_broadcast_pending bit is uninteresting if the
> > > force_broadcast bit is set. Because if that bit is set we know for
> > > sure, that we got woken with the cpu which gets the broadcast timer
> > > and raced back to idle before the broadcast handler managed to
> > > send the IPI.
> > 
> > Gah, my bad sorry, I mixed things up. I thought
> > 
> > tick_check_broadcast_pending()
> > 
> > was checking against the tick_broadcast_pending mask not
> > 
> > tick_force_broadcast_mask
> 
> Yep, that's a misnomer. I just wanted to make sure that my theory is
> correct. I need to think about the real solution some more.
> 
> We have two alternatives:
> 
> 1) Make the clockevents_notify function have a return value.
> 
> 2) Add something like the hack I gave you with a proper name.
> 
> The latter has the beauty, that we just need to modify the platform
> independent idle code instead of going down to every callsite of the
> clockevents_notify thing.

Thank you.

I am not sure (1) would buy us anything compared to (2) and as you said we
would end up patching all callsites so (2) is preferred.

As I mentioned, we can even just apply your fixes and leave platform specific
code deal with this optimization, at the end of the day idle driver has
just to check pending IRQs/wake-up sources (which would cover all IRQs not
just TIMER IPI) if and when it has to start time consuming operations like
cache cleaning to enter deep idle. If it goes into a shallow C-state so be it.

On x86 I think it is HW/FW that prevents C-state entering if IRQs are pending,
and on ARM it is likely to happen too, so I am just saying you should not
bother if you think the code becomes too hairy to justify this change.

Thank you very much for the fixes and your help,
Lorenzo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-21 22:19             ` Thomas Gleixner
  2013-02-22 10:07               ` Santosh Shilimkar
@ 2013-02-22 10:28               ` Lorenzo Pieralisi
  1 sibling, 0 replies; 25+ messages in thread
From: Lorenzo Pieralisi @ 2013-02-22 10:28 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 21, 2013 at 10:19:18PM +0000, Thomas Gleixner wrote:
> On Thu, 21 Feb 2013, Santosh Shilimkar wrote:
> > On Thursday 21 February 2013 07:18 PM, Thomas Gleixner wrote:
> > > find below a completely untested patch, which should address that issue.
> > > 
> > After looking at the thread, I tried to see the issue on OMAP and could
> > see the same issue as Jason.
> 
> That's interesting. We have the same issue on x86 since 2007 and
> nobody noticed ever. It's basically the same problem there, but it
> seems that on x86 getting out of those low power states is way slower
> than the minimal reprogramming delta which is used to enforce the
> local timer to fire after the wakeup. 
> 
> I'm still amazed that as Jason stated a 1us reprogramming delta is
> sufficient to get this ping-pong going. I somehow doubt that, but
> maybe ARM is really that fast :)

It also depends on when the idle driver exits broadcast mode.
Certainly if that's the last thing it does before enabling IRQs, that
might help trigger the issue.

I am still a bit sceptic myself too, and I take advantage of Thomas'
knowledge on the subject, which is ways deeper than mine BTW, to ask a
question. The thread started with a subject "too many retries...." and
here I have a doubt. If the fix is not applied, on the CPU affine to
the broadcast timer, it is _normal_ to have local timer retries, since
the CPU is setting/forcing the local timer to fire after a min_delta_ns every
time the expired event was local to the CPU affine to the broadcast timer.

The problem, supposedly, is that the timer has not enough time (sorry for the
mouthful) to expire(fire) before IRQs are disabled and the idle thread goes
back to idle again. This means that we should notice a mismatch between
the number of broadcast timer IRQs and local timer IRQs on the CPU
affine to the broadcast timer IRQ (granted, we also have to take into
account broadcast timer IRQs meant to service (through IPI) a local timer
expired on a CPU which is not the one running the broadcast IRQ handler and
"normal" local timer IRQs as well).

I am not sure the sheer number of retries is a symptom of the problem
happening, but I might well be mistaken so I am asking.

For certain, with the fix applied, lots of duplicate IRQs on the CPU
affine to the broadcast timer are eliminated, since the local timer is
not reprogrammed anymore (before the fix, basically the broadcast timer
was firing an IRQ that did nothing since the CPU was already out of
broadcast mode by the time the broadcast handler was running, the real job
was carried out in the local timer handler).

> 
> > Your patch fixes the retries on both CPUs on my dual core machine. So
> > you use my tested by if you need one.
> 
> They are always welcome.
> 
> > Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com>

You can add mine too, we should fix the WARN_ON_ONCE mentioned in Santosh's
reply.

Tested-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-21 13:48         ` Thomas Gleixner
  2013-02-21 15:12           ` Santosh Shilimkar
@ 2013-02-22 10:26           ` Jason Liu
  1 sibling, 0 replies; 25+ messages in thread
From: Jason Liu @ 2013-02-22 10:26 UTC (permalink / raw)
  To: linux-arm-kernel

Thomas,

2013/2/21 Thomas Gleixner <tglx@linutronix.de>:
> Jason,
>
> On Thu, 21 Feb 2013, Jason Liu wrote:
>> 2013/2/21 Thomas Gleixner <tglx@linutronix.de>:
>> > Now your explanation makes sense.
>> >
>> > I have no fast solution for this, but I think that I have an idea how
>> > to fix it. Stay tuned.
>>
>> Thanks Thomas, wait for your fix. :)
>
> find below a completely untested patch, which should address that issue.
>

I have tested the below patch, but run into the following warnings:

[  713.340593] ------------[ cut here ]------------
[  713.345238] WARNING: at
/linux-2.6-imx/kernel/time/tick-broadcast.c:497
tick_broadcast_oneshot_control+0x210/0x254()
[  713.359240] Modules linked in:
[  713.362332] [<8004b2cc>] (unwind_backtrace+0x0/0xf8) from
[<80079764>] (warn_slowpath_common+0x54/0x64)
[  713.371740] [<80079764>] (warn_slowpath_common+0x54/0x64) from
[<80079790>] (warn_slowpath_null+0x1c/0x24)
[  713.381408] [<80079790>] (warn_slowpath_null+0x1c/0x24) from
[<800a5320>] (tick_broadcast_oneshot_control+0x210/0x254)
[  713.392120] [<800a5320>]
(tick_broadcast_oneshot_control+0x210/0x254) from [<800a48b0>]
(tick_notify+0xd4/0x1f8)
[  713.402317] [<800a48b0>] (tick_notify+0xd4/0x1f8) from [<8009b154>]
(notifier_call_chain+0x4c/0x8c)
[  713.411377] [<8009b154>] (notifier_call_chain+0x4c/0x8c) from
[<8009b24c>] (raw_notifier_call_chain+0x18/0x20)
[  713.421392] [<8009b24c>] (raw_notifier_call_chain+0x18/0x20) from
[<800a3d9c>] (clockevents_notify+0x30/0x198)
[  713.431417] [<800a3d9c>] (clockevents_notify+0x30/0x198) from
[<80052f58>] (arch_idle_multi_core+0x4c/0xc4)
[  713.441175] [<80052f58>] (arch_idle_multi_core+0x4c/0xc4) from
[<80044f68>] (default_idle+0x20/0x28)
[  713.450322] [<80044f68>] (default_idle+0x20/0x28) from [<800455c0>]
(cpu_idle+0xc8/0xfc)
[  713.458425] [<800455c0>] (cpu_idle+0xc8/0xfc) from [<80008984>]
(start_kernel+0x260/0x294)
[  713.466701] [<80008984>] (start_kernel+0x260/0x294) from
[<10008040>] (0x10008040)

So, the code here which bring the warnings:
void tick_broadcast_oneshot_control(unsigned long reason)
{
...
WARN_ON_ONCE(test_bit(cpu, tick_force_broadcast_mask));

}

>
> Thanks,
>
>         tglx
>
> Index: linux-2.6/kernel/time/tick-broadcast.c
> ===================================================================
> --- linux-2.6.orig/kernel/time/tick-broadcast.c
> +++ linux-2.6/kernel/time/tick-broadcast.c
> @@ -397,6 +397,8 @@ int tick_resume_broadcast(void)
>
>  /* FIXME: use cpumask_var_t. */
>  static DECLARE_BITMAP(tick_broadcast_oneshot_mask, NR_CPUS);
> +static DECLARE_BITMAP(tick_broadcast_pending, NR_CPUS);
> +static DECLARE_BITMAP(tick_force_broadcast_mask, NR_CPUS);
>
>  /*
>   * Exposed for debugging: see timer_list.c
> @@ -453,12 +455,24 @@ again:
>         /* Find all expired events */
>         for_each_cpu(cpu, tick_get_broadcast_oneshot_mask()) {
>                 td = &per_cpu(tick_cpu_device, cpu);
> -               if (td->evtdev->next_event.tv64 <= now.tv64)
> +               if (td->evtdev->next_event.tv64 <= now.tv64) {
>                         cpumask_set_cpu(cpu, to_cpumask(tmpmask));
> -               else if (td->evtdev->next_event.tv64 < next_event.tv64)
> +                       /*
> +                        * Mark the remote cpu in the pending mask, so
> +                        * it can avoid reprogramming the cpu local
> +                        * timer in tick_broadcast_oneshot_control().
> +                        */
> +                       set_bit(cpu, tick_broadcast_pending);
> +               } else if (td->evtdev->next_event.tv64 < next_event.tv64)
>                         next_event.tv64 = td->evtdev->next_event.tv64;
>         }
>
> +       /* Take care of enforced broadcast requests */
> +       for_each_cpu(cpu, to_cpumask(tick_force_broadcast_mask)) {
> +               set_bit(cpu, tmpmask);
> +               clear_bit(cpu, tick_force_broadcast_mask);
> +       }
> +
>         /*
>          * Wakeup the cpus which have an expired event.
>          */
> @@ -494,6 +508,7 @@ void tick_broadcast_oneshot_control(unsi
>         struct clock_event_device *bc, *dev;
>         struct tick_device *td;
>         unsigned long flags;
> +       ktime_t now;
>         int cpu;
>
>         /*
> @@ -518,6 +533,8 @@ void tick_broadcast_oneshot_control(unsi
>
>         raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
>         if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ENTER) {
> +               WARN_ON_ONCE(test_bit(cpu, tick_broadcast_pending));
> +               WARN_ON_ONCE(test_bit(cpu, tick_force_broadcast_mask));
>                 if (!cpumask_test_cpu(cpu, tick_get_broadcast_oneshot_mask())) {
>                         cpumask_set_cpu(cpu, tick_get_broadcast_oneshot_mask());
>                         clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN);
> @@ -529,10 +546,63 @@ void tick_broadcast_oneshot_control(unsi
>                         cpumask_clear_cpu(cpu,
>                                           tick_get_broadcast_oneshot_mask());
>                         clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT);
> -                       if (dev->next_event.tv64 != KTIME_MAX)
> -                               tick_program_event(dev->next_event, 1);
> +                       if (dev->next_event.tv64 == KTIME_MAX)
> +                               goto out;
> +                       /*
> +                        * The cpu handling the broadcast timer marked
> +                        * this cpu in the broadcast pending mask and
> +                        * fired the broadcast IPI. So we are going to
> +                        * handle the expired event anyway via the
> +                        * broadcast IPI handler. No need to reprogram
> +                        * the timer with an already expired event.
> +                        */
> +                       if (test_and_clear_bit(cpu, tick_broadcast_pending))
> +                               goto out;
> +                       /*
> +                        * If the pending bit is not set, then we are
> +                        * either the CPU handling the broadcast
> +                        * interrupt or we got woken by something else.
> +                        *
> +                        * We are not longer in the broadcast mask, so
> +                        * if the cpu local expiry time is already
> +                        * reached, we would reprogram the cpu local
> +                        * timer with an already expired event.
> +                        *
> +                        * This can lead to a ping-pong when we return
> +                        * to idle and therefor rearm the broadcast
> +                        * timer before the cpu local timer was able
> +                        * to fire. This happens because the forced
> +                        * reprogramming makes sure that the event
> +                        * will happen in the future and depending on
> +                        * the min_delta setting this might be far
> +                        * enough out that the ping-pong starts.
> +                        *
> +                        * If the cpu local next_event has expired
> +                        * then we know that the broadcast timer
> +                        * next_event has expired as well and
> +                        * broadcast is about to be handled. So we
> +                        * avoid reprogramming and enforce that the
> +                        * broadcast handler, which did not run yet,
> +                        * will invoke the cpu local handler.
> +                        *
> +                        * We cannot call the handler directly from
> +                        * here, because we might be in a NOHZ phase
> +                        * and we did not go through the irq_enter()
> +                        * nohz fixups.
> +                        */
> +                       now = ktime_get();
> +                       if (dev->next_event.tv64 <= now.tv64) {
> +                               set_bit(cpu, tick_force_broadcast_mask);
> +                               goto out;
> +                       }
> +                       /*
> +                        * We got woken by something else. Reprogram
> +                        * the cpu local timer device.
> +                        */
> +                       tick_program_event(dev->next_event, 1);
>                 }
>         }
> +out:
>         raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
>  }
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-21  6:16   ` Jason Liu
  2013-02-21  9:36     ` Thomas Gleixner
@ 2013-02-21 10:35     ` Lorenzo Pieralisi
  2013-02-21 10:49       ` Jason Liu
  1 sibling, 1 reply; 25+ messages in thread
From: Lorenzo Pieralisi @ 2013-02-21 10:35 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Jason,

On Thu, Feb 21, 2013 at 06:16:51AM +0000, Jason Liu wrote:
> 2013/2/20 Thomas Gleixner <tglx@linutronix.de>:
> > On Wed, 20 Feb 2013, Jason Liu wrote:
> >> void arch_idle(void)
> >> {
> >> ....
> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
> >>
> >> enter_the_wait_mode();
> >>
> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
> >> }
> >>
> >> when the broadcast timer interrupt arrives(this interrupt just wakeup
> >> the ARM, and ARM has no chance
> >> to handle it since local irq is disabled. In fact it's disabled in
> >> cpu_idle() of arch/arm/kernel/process.c)
> >>
> >> the broadcast timer interrupt will wake up the CPU and run:
> >>
> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);    ->
> >> tick_broadcast_oneshot_control(...);
> >> ->
> >> tick_program_event(dev->next_event, 1);
> >> ->
> >> tick_dev_program_event(dev, expires, force);
> >> ->
> >> for (i = 0;;) {
> >>                 int ret = clockevents_program_event(dev, expires, now);
> >>                 if (!ret || !force)
> >>                         return ret;
> >>
> >>                 dev->retries++;
> >>                 ....
> >>                 now = ktime_get();
> >>                 expires = ktime_add_ns(now, dev->min_delta_ns);
> >> }
> >> clockevents_program_event(dev, expires, now);
> >>
> >>         delta = ktime_to_ns(ktime_sub(expires, now));
> >>
> >>         if (delta <= 0)
> >>                 return -ETIME;
> >>
> >> when the bc timer interrupt arrives,  which means the last local timer
> >> expires too. so,
> >> clockevents_program_event will return -ETIME, which will cause the
> >> dev->retries++
> >> when retry to program the expired timer.
> >>
> >> Even under the worst case, after the re-program the expired timer,
> >> then CPU enter idle
> >> quickly before the re-progam timer expired, it will make system
> >> ping-pang forever,
> >
> > That's nonsense.
> 
> I don't think so.
> 
> >
> > The timer IPI brings the core out of the deep idle state.
> >
> > So after returning from enter_wait_mode() and after calling
> > clockevents_notify() it returns from arch_idle() to cpu_idle().
> >
> > In cpu_idle() interrupts are reenabled, so the timer IPI handler is
> > invoked. That calls the event_handler of the per cpu local clockevent
> > device (the one which stops in C3). That ends up in the generic timer
> > code which expires timers and reprograms the local clock event device
> > with the next pending timer.
> >
> > So you cannot go idle again, before the expired timers of this event
> > are handled and their callbacks invoked.
> 
> That's true for the CPUs which not response to the global timer interrupt.
> Take our platform as example: we have 4CPUs(CPU0, CPU1,CPU2,CPU3)
> The global timer device will keep running even in the deep idle mode, so, it
> can be used as the broadcast timer device, and the interrupt of this device
> just raised to CPU0 when the timer expired, then, CPU0 will broadcast the
> IPI timer to other CPUs which is in deep idle mode.
> 
> So for CPU1, CPU2, CPU3, you are right, the IPI timer will bring it out of idle
> state, after running clockevents_notify() it returns from arch_idle()
> to cpu_idle(),
> then local_irq_enable(), the IPI handler will be invoked and handle
> the expires times
> and re-program the next pending timer.
> 
> But, that's not true for the CPU0. The flow for CPU0 is:
> the global timer interrupt wakes up CPU0 and then call:
> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
> 
> which will cpumask_clear_cpu(cpu, tick_get_broadcast_oneshot_mask());
> in the function tick_broadcast_oneshot_control(),

For my own understanding: at this point in time CPU0 local timer is
also reprogrammed, with min_delta (ie 1us) if I got your description
right.

> 
> After return from clockevents_notify(), it will return to cpu_idle
> from arch_idle,
> then local_irq_enable(), the CPU0 will response to the global timer
> interrupt, and
> call the interrupt handler: tick_handle_oneshot_broadcast()
> 
> static void tick_handle_oneshot_broadcast(struct clock_event_device *dev)
> {
>         struct tick_device *td;
>         ktime_t now, next_event;
>         int cpu;
> 
>         raw_spin_lock(&tick_broadcast_lock);
> again:
>         dev->next_event.tv64 = KTIME_MAX;
>         next_event.tv64 = KTIME_MAX;
>         cpumask_clear(to_cpumask(tmpmask));
>         now = ktime_get();
>         /* Find all expired events */
>         for_each_cpu(cpu, tick_get_broadcast_oneshot_mask()) {
>                 td = &per_cpu(tick_cpu_device, cpu);
>                 if (td->evtdev->next_event.tv64 <= now.tv64)
>                         cpumask_set_cpu(cpu, to_cpumask(tmpmask));
>                 else if (td->evtdev->next_event.tv64 < next_event.tv64)
>                         next_event.tv64 = td->evtdev->next_event.tv64;
>         }
> 
>         /*
>          * Wakeup the cpus which have an expired event.
>          */
>         tick_do_broadcast(to_cpumask(tmpmask));
>         ...
> }
> 
> since cpu0 has been removed from the tick_get_broadcast_oneshot_mask(), and if
> all the other cpu1/2/3 state in idle, and no expired timers, then the
> tmpmask will be 0,
> when call tick_do_broadcast().
> 
> static void tick_do_broadcast(struct cpumask *mask)
> {
>         int cpu = smp_processor_id();
>         struct tick_device *td;
> 
>         /*
>          * Check, if the current cpu is in the mask
>          */
>         if (cpumask_test_cpu(cpu, mask)) {
>                 cpumask_clear_cpu(cpu, mask);
>                 td = &per_cpu(tick_cpu_device, cpu);
>                 td->evtdev->event_handler(td->evtdev);
>         }
> 
>         if (!cpumask_empty(mask)) {
>                 /*
>                  * It might be necessary to actually check whether the devices
>                  * have different broadcast functions. For now, just use the
>                  * one of the first device. This works as long as we have this
>                  * misfeature only on x86 (lapic)
>                  */
>                 td = &per_cpu(tick_cpu_device, cpumask_first(mask));
>                 td->evtdev->broadcast(mask);
>         }
> }
> 
> If the mask is empty, then tick_do_broadcast will do nothing and return, which
> will make cpu0 enter idle quickly, and then system will ping-pang there.

This means that the local timer reprogrammed above (to actually emulate the
expired local timer on CPU0, likely to be set to min_delta == 1us) does not
have time to expire before the idle thread disables IRQs and goes idle again.

Is this a correct description of what's happening ?

Thanks a lot,
Lorenzo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* too many timer retries happen when do local timer swtich with broadcast timer
  2013-02-21 10:35     ` Lorenzo Pieralisi
@ 2013-02-21 10:49       ` Jason Liu
  0 siblings, 0 replies; 25+ messages in thread
From: Jason Liu @ 2013-02-21 10:49 UTC (permalink / raw)
  To: linux-arm-kernel

2013/2/21 Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>:
> Hi Jason,
>
> On Thu, Feb 21, 2013 at 06:16:51AM +0000, Jason Liu wrote:
>> 2013/2/20 Thomas Gleixner <tglx@linutronix.de>:
>> > On Wed, 20 Feb 2013, Jason Liu wrote:
>> >> void arch_idle(void)
>> >> {
>> >> ....
>> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
>> >>
>> >> enter_the_wait_mode();
>> >>
>> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
>> >> }
>> >>
>> >> when the broadcast timer interrupt arrives(this interrupt just wakeup
>> >> the ARM, and ARM has no chance
>> >> to handle it since local irq is disabled. In fact it's disabled in
>> >> cpu_idle() of arch/arm/kernel/process.c)
>> >>
>> >> the broadcast timer interrupt will wake up the CPU and run:
>> >>
>> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);    ->
>> >> tick_broadcast_oneshot_control(...);
>> >> ->
>> >> tick_program_event(dev->next_event, 1);
>> >> ->
>> >> tick_dev_program_event(dev, expires, force);
>> >> ->
>> >> for (i = 0;;) {
>> >>                 int ret = clockevents_program_event(dev, expires, now);
>> >>                 if (!ret || !force)
>> >>                         return ret;
>> >>
>> >>                 dev->retries++;
>> >>                 ....
>> >>                 now = ktime_get();
>> >>                 expires = ktime_add_ns(now, dev->min_delta_ns);
>> >> }
>> >> clockevents_program_event(dev, expires, now);
>> >>
>> >>         delta = ktime_to_ns(ktime_sub(expires, now));
>> >>
>> >>         if (delta <= 0)
>> >>                 return -ETIME;
>> >>
>> >> when the bc timer interrupt arrives,  which means the last local timer
>> >> expires too. so,
>> >> clockevents_program_event will return -ETIME, which will cause the
>> >> dev->retries++
>> >> when retry to program the expired timer.
>> >>
>> >> Even under the worst case, after the re-program the expired timer,
>> >> then CPU enter idle
>> >> quickly before the re-progam timer expired, it will make system
>> >> ping-pang forever,
>> >
>> > That's nonsense.
>>
>> I don't think so.
>>
>> >
>> > The timer IPI brings the core out of the deep idle state.
>> >
>> > So after returning from enter_wait_mode() and after calling
>> > clockevents_notify() it returns from arch_idle() to cpu_idle().
>> >
>> > In cpu_idle() interrupts are reenabled, so the timer IPI handler is
>> > invoked. That calls the event_handler of the per cpu local clockevent
>> > device (the one which stops in C3). That ends up in the generic timer
>> > code which expires timers and reprograms the local clock event device
>> > with the next pending timer.
>> >
>> > So you cannot go idle again, before the expired timers of this event
>> > are handled and their callbacks invoked.
>>
>> That's true for the CPUs which not response to the global timer interrupt.
>> Take our platform as example: we have 4CPUs(CPU0, CPU1,CPU2,CPU3)
>> The global timer device will keep running even in the deep idle mode, so, it
>> can be used as the broadcast timer device, and the interrupt of this device
>> just raised to CPU0 when the timer expired, then, CPU0 will broadcast the
>> IPI timer to other CPUs which is in deep idle mode.
>>
>> So for CPU1, CPU2, CPU3, you are right, the IPI timer will bring it out of idle
>> state, after running clockevents_notify() it returns from arch_idle()
>> to cpu_idle(),
>> then local_irq_enable(), the IPI handler will be invoked and handle
>> the expires times
>> and re-program the next pending timer.
>>
>> But, that's not true for the CPU0. The flow for CPU0 is:
>> the global timer interrupt wakes up CPU0 and then call:
>> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
>>
>> which will cpumask_clear_cpu(cpu, tick_get_broadcast_oneshot_mask());
>> in the function tick_broadcast_oneshot_control(),
>
> For my own understanding: at this point in time CPU0 local timer is
> also reprogrammed, with min_delta (ie 1us) if I got your description
> right.

yes, right.

>
>>
>> After return from clockevents_notify(), it will return to cpu_idle
>> from arch_idle,
>> then local_irq_enable(), the CPU0 will response to the global timer
>> interrupt, and
>> call the interrupt handler: tick_handle_oneshot_broadcast()
>>
>> static void tick_handle_oneshot_broadcast(struct clock_event_device *dev)
>> {
>>         struct tick_device *td;
>>         ktime_t now, next_event;
>>         int cpu;
>>
>>         raw_spin_lock(&tick_broadcast_lock);
>> again:
>>         dev->next_event.tv64 = KTIME_MAX;
>>         next_event.tv64 = KTIME_MAX;
>>         cpumask_clear(to_cpumask(tmpmask));
>>         now = ktime_get();
>>         /* Find all expired events */
>>         for_each_cpu(cpu, tick_get_broadcast_oneshot_mask()) {
>>                 td = &per_cpu(tick_cpu_device, cpu);
>>                 if (td->evtdev->next_event.tv64 <= now.tv64)
>>                         cpumask_set_cpu(cpu, to_cpumask(tmpmask));
>>                 else if (td->evtdev->next_event.tv64 < next_event.tv64)
>>                         next_event.tv64 = td->evtdev->next_event.tv64;
>>         }
>>
>>         /*
>>          * Wakeup the cpus which have an expired event.
>>          */
>>         tick_do_broadcast(to_cpumask(tmpmask));
>>         ...
>> }
>>
>> since cpu0 has been removed from the tick_get_broadcast_oneshot_mask(), and if
>> all the other cpu1/2/3 state in idle, and no expired timers, then the
>> tmpmask will be 0,
>> when call tick_do_broadcast().
>>
>> static void tick_do_broadcast(struct cpumask *mask)
>> {
>>         int cpu = smp_processor_id();
>>         struct tick_device *td;
>>
>>         /*
>>          * Check, if the current cpu is in the mask
>>          */
>>         if (cpumask_test_cpu(cpu, mask)) {
>>                 cpumask_clear_cpu(cpu, mask);
>>                 td = &per_cpu(tick_cpu_device, cpu);
>>                 td->evtdev->event_handler(td->evtdev);
>>         }
>>
>>         if (!cpumask_empty(mask)) {
>>                 /*
>>                  * It might be necessary to actually check whether the devices
>>                  * have different broadcast functions. For now, just use the
>>                  * one of the first device. This works as long as we have this
>>                  * misfeature only on x86 (lapic)
>>                  */
>>                 td = &per_cpu(tick_cpu_device, cpumask_first(mask));
>>                 td->evtdev->broadcast(mask);
>>         }
>> }
>>
>> If the mask is empty, then tick_do_broadcast will do nothing and return, which
>> will make cpu0 enter idle quickly, and then system will ping-pang there.
>
> This means that the local timer reprogrammed above (to actually emulate the
> expired local timer on CPU0, likely to be set to min_delta == 1us) does not
> have time to expire before the idle thread disables IRQs and goes idle again.
>
> Is this a correct description of what's happening ?

yes, correct.

>
> Thanks a lot,
> Lorenzo
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2013-02-25 13:34 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-20 11:16 too many timer retries happen when do local timer swtich with broadcast timer Jason Liu
2013-02-20 13:33 ` Thomas Gleixner
2013-02-21  6:16   ` Jason Liu
2013-02-21  9:36     ` Thomas Gleixner
2013-02-21 10:50       ` Jason Liu
2013-02-21 13:48         ` Thomas Gleixner
2013-02-21 15:12           ` Santosh Shilimkar
2013-02-21 22:19             ` Thomas Gleixner
2013-02-22 10:07               ` Santosh Shilimkar
2013-02-22 10:24                 ` Thomas Gleixner
2013-02-22 10:30                   ` Santosh Shilimkar
2013-02-22 10:31                   ` Lorenzo Pieralisi
2013-02-22 11:02                     ` Santosh Shilimkar
2013-02-22 12:07                       ` Thomas Gleixner
2013-02-22 14:48                         ` Lorenzo Pieralisi
2013-02-22 15:03                           ` Thomas Gleixner
2013-02-22 15:26                             ` Lorenzo Pieralisi
2013-02-22 18:52                               ` Thomas Gleixner
2013-02-25  6:12                                 ` Santosh Shilimkar
2013-02-25  6:38                                 ` Jason Liu
2013-02-25 13:34                                 ` Lorenzo Pieralisi
2013-02-22 10:28               ` Lorenzo Pieralisi
2013-02-22 10:26           ` Jason Liu
2013-02-21 10:35     ` Lorenzo Pieralisi
2013-02-21 10:49       ` Jason Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).