linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* Problem about CPU stalling in hrtimer_intterrupts()
@ 2015-10-22  6:43 Yang Yingliang
  2015-10-22  7:43 ` Thomas Gleixner
  0 siblings, 1 reply; 6+ messages in thread
From: Yang Yingliang @ 2015-10-22  6:43 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Thomas,

I got the following call trace when I called some set time syscalls in 
userspace.

---start---
Jan 01 00:02:29 Linux kernel: INFO: rcu_sched detected stalls on CPUs/tasks:
Jan 01 00:02:29 Linux kernel:         0: (2 GPs behind) idle=913/1/0 
softirq=59289/59291 fqs=488
Jan 01 00:02:29 Linux kernel:         (detected by 20, t=5252 jiffies, 
g=35769, c=35768, q=567)
Jan 01 00:02:29 Linux kernel: Task dump for CPU 0:
Jan 01 00:02:29 Linux kernel: swapper/0       R  running task        0 
   0      0 0x00000002
Jan 01 00:02:29 Linux kernel: Call trace:
Jan 01 00:02:29 Linux kernel: [<ffffffc000086c5c>] __switch_to+0x74/0x8c
Jan 01 00:02:29 Linux kernel: rcu_sched kthread starved for 4764 jiffies!
Jan 01 00:03:32 Linux kernel: NMI watchdog: BUG: soft lockup - CPU#0 
stuck for 23s! [swapper/0:0]
Jan 01 00:03:32 Linux kernel: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
4.1.6+ #184
Jan 01 00:03:32 Linux kernel: task: ffffffc00091cdf0 ti: 
ffffffc000910000 task.ti: ffffffc000910000
Jan 01 00:03:32 Linux kernel: PC is at arch_cpu_idle+0x10/0x18
Jan 01 00:03:32 Linux kernel: LR is at arch_cpu_idle+0xc/0x18
Jan 01 00:03:32 Linux kernel: pc : [<ffffffc00008686c>] lr : 
[<ffffffc000086868>] pstate: 60000145
Jan 01 00:03:32 Linux kernel: sp : ffffffc000913f20
Jan 01 00:03:32 Linux kernel: x29: ffffffc000913f20 x28: 000000003f4bbab0
Jan 01 00:03:32 Linux kernel: x27: ffffffc00091669c x26: ffffffc00096e000
Jan 01 00:03:32 Linux kernel: x25: ffffffc000804000 x24: ffffffc000913f30
Jan 01 00:03:32 Linux kernel: x23: 0000000000000001 x22: ffffffc0006817f8
Jan 01 00:03:32 Linux kernel: x21: ffffffc0008fdb00 x20: ffffffc000916618
Jan 01 00:03:32 Linux kernel: x19: ffffffc000910000 x18: 00000000ffffffff
Jan 01 00:03:32 Linux kernel: x17: 0000007f9d6f682c x16: ffffffc0001e19d0
Jan 01 00:03:32 Linux kernel: x15: 0000000000000061 x14: 0000000000000072
Jan 01 00:03:32 Linux kernel: x13: 0000000000000067 x12: ffffffc000682528
Jan 01 00:03:32 Linux kernel: x11: 0000000000000005 x10: 00000001000faf9a
Jan 01 00:03:32 Linux kernel: x9 : ffffffc000913e60 x8 : ffffffc00091d350
Jan 01 00:03:32 Linux kernel: x7 : 0000000000000000 x6 : 002b24c4f00026aa
Jan 01 00:03:32 Linux kernel: x5 : 0000001ffd5c6000 x4 : ffffffc000913ea0
Jan 01 00:03:32 Linux kernel: x3 : ffffffdffdec3b44 x2 : ffffffdffdec3b44
Jan 01 00:03:32 Linux kernel: x1 : 0000000000000000 x0 : 0000000000000000
---end---


I use the kernel-4.1.6 running on arm64.
My testcase is that it calls clock_settime and clock_adjtime alternately 
with random params on each core. My system has 32 cores.

I found the cpu stalling in  hrtimer_intterrupts(). So I added some 
debug info in hrtimer_intterrupts() and found that the while loop runs 
1020437660 times and takes 98761 jiffies(HZ=250).

Some debug log is here:
---start---
Jan 01 00:03:32 Linux kernel: i:0 basenow.tv64:4809284991830 
hrtimer_get_softexpires_tv64(timer):4440120000000 ccpu0 
timer:ffffffdffdec6138, timer->function:ffffffc000129b84
Jan 01 00:03:32 Linux kernel: i:0 basenow.tv64:4809284991830 
hrtimer_get_softexpires_tv64(timer):4440120000000 ccpu0 
timer:ffffffdffdec6138, timer->function:ffffffc000129b84
Jan 01 00:03:32 Linux kernel: i:0 basenow.tv64:4809284991830 
hrtimer_get_softexpires_tv64(timer):4440120000000 ccpu0 
timer:ffffffdffdec6138, timer->function:ffffffc000129b84
Jan 01 00:03:32 Linux kernel: i:0 basenow.tv64:4809284991830 
hrtimer_get_softexpires_tv64(timer):4440120000000 ccpu0 
timer:ffffffdffdec6138, timer->function:ffffffc000129b84
Jan 01 00:03:32 Linux kernel: i:0 basenow.tv64:4809284991830 
hrtimer_get_softexpires_tv64(timer):4440120000000 ccpu0 
timer:ffffffdffdec6138, timer->function:ffffffc000129b84
Jan 01 00:03:32 Linux kernel: i:0 basenow.tv64:4809284991830 
hrtimer_get_softexpires_tv64(timer):4440120000000 ccpu0 
timer:ffffffdffdec6138, timer->function:ffffffc000129b84
Jan 01 00:03:32 Linux kernel: i:0 basenow.tv64:4809284991830 
hrtimer_get_softexpires_tv64(timer):4440120000000 ccpu0 
timer:ffffffdffdec6138, timer->function:ffffffc000129b84
Jan 01 00:03:32 Linux kernel: i:0 basenow.tv64:4809284991830 
hrtimer_get_softexpires_tv64(timer):4440120000000 ccpu0 
timer:ffffffdffdec6138, timer->function:ffffffc000129b84
Jan 01 00:03:32 Linux kernel: i:0 basenow.tv64:4809284991830 
hrtimer_get_softexpires_tv64(timer):4440120000000 ccpu0 
timer:ffffffdffdec6138, timer->function:ffffffc000129b84
Jan 01 00:03:32 Linux kernel: i:0 basenow.tv64:4809284991830 
hrtimer_get_softexpires_tv64(timer):4440120000000 ccpu0 
timer:ffffffdffdec6138, timer->function:ffffffc000129b84
Jan 01 00:03:32 Linux kernel: hrtimer_interrupt: 12827864 callbacks 
suppressed
[....]
Jan 01 00:08:43 Linux kernel: i:0 basenow.tv64:4809284991830 
hrtimer_get_softexpires_tv64(timer):4805395417269 ccpu0 
timer:ffffffef90390a98, timer->function:ffffffc00052bcb8
Jan 01 00:08:43 Linux kernel: i:0 basenow.tv64:4809284991830 
hrtimer_get_softexpires_tv64(timer):4805395417269 ccpu0 
timer:ffffffef90390a98, timer->function:ffffffc00052bcb8
Jan 01 00:08:43 Linux kernel: i:0 basenow.tv64:4809284991830 
hrtimer_get_softexpires_tv64(timer):4805395417269 ccpu0 
timer:ffffffef90390a98, timer->function:ffffffc00052bcb8
Jan 01 00:08:43 Linux kernel: i:0 basenow.tv64:4809284991830 
hrtimer_get_softexpires_tv64(timer):4805395417269 ccpu0 
timer:ffffffef90390a98, timer->function:ffffffc00052bcb8
Jan 01 00:08:43 Linux kernel: i:0 basenow.tv64:4809284991830 
hrtimer_get_softexpires_tv64(timer):4805395417269 ccpu0 
timer:ffffffef90390a98, timer->function:ffffffc00052bcb8
Jan 01 00:08:43 Linux kernel: i:0 basenow.tv64:4809284991830 
hrtimer_get_softexpires_tv64(timer):4805395417269 ccpu0 
timer:ffffffef90390a98, timer->function:ffffffc00052bcb8
Jan 01 00:08:43 Linux kernel: i:0 basenow.tv64:4809284991830 
hrtimer_get_softexpires_tv64(timer):4809295433162 ccpu0
Jan 01 00:08:43 Linux kernel: i:0 while_cnt:1020437660, cost 
jiffies:98761, ccpu0
---end---

Is there maybe a bug or do you have any opinions ?

Thanks,
Yang

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Problem about CPU stalling in hrtimer_intterrupts()
  2015-10-22  6:43 Problem about CPU stalling in hrtimer_intterrupts() Yang Yingliang
@ 2015-10-22  7:43 ` Thomas Gleixner
  2015-10-22  9:28   ` Ding Tianhong
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Gleixner @ 2015-10-22  7:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 22 Oct 2015, Yang Yingliang wrote:
> I use the kernel-4.1.6 running on arm64.
> My testcase is that it calls clock_settime and clock_adjtime alternately with
> random params on each core. My system has 32 cores.
> 
> I found the cpu stalling in  hrtimer_intterrupts(). So I added some debug info
> in hrtimer_intterrupts() and found that the while loop runs 1020437660 times
> and takes 98761 jiffies(HZ=250).
> 
> Some debug log is here:
> ---start---
> Jan 01 00:03:32 Linux kernel: i:0 basenow.tv64:4809284991830
> hrtimer_get_softexpires_tv64(timer):4440120000000 ccpu0
> timer:ffffffdffdec6138, timer->function:ffffffc000129b84
> Jan 01 00:03:32 Linux kernel: i:0 basenow.tv64:4809284991830
> hrtimer_get_softexpires_tv64(timer):4440120000000 ccpu0

Something is rearming a timer over and over with expiry time in the
past.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Problem about CPU stalling in hrtimer_intterrupts()
  2015-10-22  7:43 ` Thomas Gleixner
@ 2015-10-22  9:28   ` Ding Tianhong
  2015-10-22 10:25     ` Thomas Gleixner
  0 siblings, 1 reply; 6+ messages in thread
From: Ding Tianhong @ 2015-10-22  9:28 UTC (permalink / raw)
  To: linux-arm-kernel

On 2015/10/22 15:43, Thomas Gleixner wrote:
> On Thu, 22 Oct 2015, Yang Yingliang wrote:
>> I use the kernel-4.1.6 running on arm64.
>> My testcase is that it calls clock_settime and clock_adjtime alternately with
>> random params on each core. My system has 32 cores.
>>
>> I found the cpu stalling in  hrtimer_intterrupts(). So I added some debug info
>> in hrtimer_intterrupts() and found that the while loop runs 1020437660 times
>> and takes 98761 jiffies(HZ=250).
>>
>> Some debug log is here:
>> ---start---
>> Jan 01 00:03:32 Linux kernel: i:0 basenow.tv64:4809284991830
>> hrtimer_get_softexpires_tv64(timer):4440120000000 ccpu0
>> timer:ffffffdffdec6138, timer->function:ffffffc000129b84
>> Jan 01 00:03:32 Linux kernel: i:0 basenow.tv64:4809284991830
>> hrtimer_get_softexpires_tv64(timer):4440120000000 ccpu0
> 
> Something is rearming a timer over and over with expiry time in the
> past.
> 
> Thanks,
> 
> 	tglx
> 


Hi Thomas:

This problem could only occur on the system with 32 cores, when I cut the cores to 16, this problem disappeared, 
so I think there is some parallel problem when the 32 core set clock time together:

I try to reproduce the scene:

1.do_settimeofday64
2.update tk time
3.update base time offset
4.update expires_next

the 3 and 4 will be called in softirq, but the hrtimer_interrupt may break the order and run before 3, I am not
sure whether this could make the problem, do we need to update base time and expires_next in the hrtimer_interrupt?
maybe I miss something, thanks for any suggestion.

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 93ef7190..9adab23 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1254,6 +1254,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)

raw_spin_lock(&cpu_base->lock);
entry_time = now = hrtimer_update_base(cpu_base);
+ hrtimer_force_reprogram(cpu_base, 0);
retry:
cpu_base->in_hrtirq = 1;


Thanks

Ding


> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 
> 

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Problem about CPU stalling in hrtimer_intterrupts()
  2015-10-22  9:28   ` Ding Tianhong
@ 2015-10-22 10:25     ` Thomas Gleixner
  2015-10-22 13:43       ` Yang Yingliang
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Gleixner @ 2015-10-22 10:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 22 Oct 2015, Ding Tianhong wrote:
> On 2015/10/22 15:43, Thomas Gleixner wrote:
> >> Jan 01 00:03:32 Linux kernel: i:0 basenow.tv64:4809284991830
> >> hrtimer_get_softexpires_tv64(timer):4440120000000 ccpu0
> >> timer:ffffffdffdec6138, timer->function:ffffffc000129b84
> >> Jan 01 00:03:32 Linux kernel: i:0 basenow.tv64:4809284991830
> >> hrtimer_get_softexpires_tv64(timer):4440120000000 ccpu0

> This problem could only occur on the system with 32 cores, when I
> cut the cores to 16, this problem disappeared, so I think there is
> some parallel problem when the 32 core set clock time together:

> I try to reproduce the scene:
> 
> 1.do_settimeofday64
> 2.update tk time
> 3.update base time offset
> 4.update expires_next
> 
> the 3 and 4 will be called in softirq, but the hrtimer_interrupt may
> break the order and run before 3, I am not sure whether this could
> make the problem, do we need to update base time and expires_next in
> the hrtimer_interrupt?  maybe I miss something, thanks for any
> suggestion.

Base offset is updated in hrtimer_interrupt as
well. hrtimer_update_base() does that. So that's not the problem.

Can you apply the patch below and enable the hrtimer tracepoints and
collect trace data across the point where the problem happens?

Thanks,

	tglx
----

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 44d2cc0436f4..614f8d272cb0 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -575,8 +575,14 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action)
 	update_fast_timekeeper(&tk->tkr_mono, &tk_fast_mono);
 	update_fast_timekeeper(&tk->tkr_raw,  &tk_fast_raw);
 
-	if (action & TK_CLOCK_WAS_SET)
+	if (action & TK_CLOCK_WAS_SET) {
 		tk->clock_was_set_seq++;
+		trace_printk("TK: Seq: %u R: %lld B: %lld T: %lld\n",
+			     tk->clock_was_set_seq,
+			     tk->offs_real,
+			     tk->offs_boot,
+			     tk->offs_tai);
+	}
 	/*
 	 * The mirroring of the data to the shadow-timekeeper needs
 	 * to happen last here to ensure we don't over-write the
@@ -1954,6 +1960,11 @@ ktime_t ktime_get_update_offsets_now(unsigned int *cwsseq, ktime_t *offs_real,
 		base = ktime_add_ns(base, nsecs);
 
 		if (*cwsseq != tk->clock_was_set_seq) {
+			trace_printk("HR: Seq: %u R: %lld B: %lld T: %lld\n",
+				     tk->clock_was_set_seq,
+				     tk->offs_real,
+				     tk->offs_boot,
+				     tk->offs_tai);
 			*cwsseq = tk->clock_was_set_seq;
 			*offs_real = tk->offs_real;
 			*offs_boot = tk->offs_boot;

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Problem about CPU stalling in hrtimer_intterrupts()
  2015-10-22 10:25     ` Thomas Gleixner
@ 2015-10-22 13:43       ` Yang Yingliang
  2015-10-24  9:58         ` Thomas Gleixner
  0 siblings, 1 reply; 6+ messages in thread
From: Yang Yingliang @ 2015-10-22 13:43 UTC (permalink / raw)
  To: linux-arm-kernel


On 2015/10/22 18:25, Thomas Gleixner wrote:
> On Thu, 22 Oct 2015, Ding Tianhong wrote:
>> On 2015/10/22 15:43, Thomas Gleixner wrote:
>>>> Jan 01 00:03:32 Linux kernel: i:0 basenow.tv64:4809284991830
>>>> hrtimer_get_softexpires_tv64(timer):4440120000000 ccpu0
>>>> timer:ffffffdffdec6138, timer->function:ffffffc000129b84
>>>> Jan 01 00:03:32 Linux kernel: i:0 basenow.tv64:4809284991830
>>>> hrtimer_get_softexpires_tv64(timer):4440120000000 ccpu0
>
>> This problem could only occur on the system with 32 cores, when I
>> cut the cores to 16, this problem disappeared, so I think there is
>> some parallel problem when the 32 core set clock time together:
>
>> I try to reproduce the scene:
>>
>> 1.do_settimeofday64
>> 2.update tk time
>> 3.update base time offset
>> 4.update expires_next
>>
>> the 3 and 4 will be called in softirq, but the hrtimer_interrupt may
>> break the order and run before 3, I am not sure whether this could
>> make the problem, do we need to update base time and expires_next in
>> the hrtimer_interrupt?  maybe I miss something, thanks for any
>> suggestion.
>
> Base offset is updated in hrtimer_interrupt as
> well. hrtimer_update_base() does that. So that's not the problem.
>
> Can you apply the patch below and enable the hrtimer tracepoints and
> collect trace data across the point where the problem happens?
>
> Thanks,
>
> 	tglx
> ----
>
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index 44d2cc0436f4..614f8d272cb0 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -575,8 +575,14 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action)
>   	update_fast_timekeeper(&tk->tkr_mono, &tk_fast_mono);
>   	update_fast_timekeeper(&tk->tkr_raw,  &tk_fast_raw);
>
> -	if (action & TK_CLOCK_WAS_SET)
> +	if (action & TK_CLOCK_WAS_SET) {
>   		tk->clock_was_set_seq++;
> +		trace_printk("TK: Seq: %u R: %lld B: %lld T: %lld\n",
> +			     tk->clock_was_set_seq,
> +			     tk->offs_real,
> +			     tk->offs_boot,
> +			     tk->offs_tai);
> +	}
>   	/*
>   	 * The mirroring of the data to the shadow-timekeeper needs
>   	 * to happen last here to ensure we don't over-write the
> @@ -1954,6 +1960,11 @@ ktime_t ktime_get_update_offsets_now(unsigned int *cwsseq, ktime_t *offs_real,
>   		base = ktime_add_ns(base, nsecs);
>
>   		if (*cwsseq != tk->clock_was_set_seq) {
> +			trace_printk("HR: Seq: %u R: %lld B: %lld T: %lld\n",
> +				     tk->clock_was_set_seq,
> +				     tk->offs_real,
> +				     tk->offs_boot,
> +				     tk->offs_tai);
>   			*cwsseq = tk->clock_was_set_seq;
>   			*offs_real = tk->offs_real;
>   			*offs_boot = tk->offs_boot;
>
> .
>


I don't try this patch yet.
But I found out when the cpu is stalling, basenow.tv64(7676664221321) is
bigger than ktime_get().tv64(7336008904750) in hrtimer_interrupt() and
the timer->_softexpires is 7336288000000. This makes it can not finish
the while loop until ktime_get().tv64 arrives basenow.tv64.

Is it correct that basenow bigger than ktime_get() ?

Thanks,
Yang

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Problem about CPU stalling in hrtimer_intterrupts()
  2015-10-22 13:43       ` Yang Yingliang
@ 2015-10-24  9:58         ` Thomas Gleixner
  0 siblings, 0 replies; 6+ messages in thread
From: Thomas Gleixner @ 2015-10-24  9:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 22 Oct 2015, Yang Yingliang wrote:

> But I found out when the cpu is stalling, basenow.tv64(7676664221321) is
> bigger than ktime_get().tv64(7336008904750) in hrtimer_interrupt() and
> the timer->_softexpires is 7336288000000. This makes it can not finish
> the while loop until ktime_get().tv64 arrives basenow.tv64.
> 
> Is it correct that basenow bigger than ktime_get() ?

You only can compare basenow and ktime_get() for the clock monotonic
base. If you are actually observing this on clock monotonic base then
base->offset of clock monotonic is not 0, which should never happen.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-10-24  9:58 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-22  6:43 Problem about CPU stalling in hrtimer_intterrupts() Yang Yingliang
2015-10-22  7:43 ` Thomas Gleixner
2015-10-22  9:28   ` Ding Tianhong
2015-10-22 10:25     ` Thomas Gleixner
2015-10-22 13:43       ` Yang Yingliang
2015-10-24  9:58         ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).