diff for duplicates of <20161001175837.GP14933@linux.vnet.ibm.com> diff --git a/a/1.txt b/N1/1.txt index 7b43285..941088c 100644 --- a/a/1.txt +++ b/N1/1.txt @@ -14,8 +14,8 @@ On Sat, Oct 01, 2016 at 01:05:08PM -0400, Rich Felker wrote: > > > into a NOHZ idle sleep. > > > > > > > <idle>-0 [001] d... 109.953436: tick_stop: success=1 dependency=NONE -> > > > <idle>-0 [001] d... 109.953617: hrtimer_cancel: hrtimer\x109f449c -> > > > <idle>-0 [001] d... 109.953818: hrtimer_start: hrtimer\x109f449c function=tick_sched_timer expires\x109880000000 softexpires\x109880000000 +> > > > <idle>-0 [001] d... 109.953617: hrtimer_cancel: hrtimer=109f449c +> > > > <idle>-0 [001] d... 109.953818: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109880000000 softexpires=109880000000 > > > > > > which is (using the 0.087621us delta between the trace clock and clock > > > MONO) at: 109.880 + 0.087621 = 109.968 @@ -46,40 +46,40 @@ and NO_HZ_FULL_ALL to get the everything-on-CPU-0 behavior. But are there rcu0 kthreads running on your system? -> > > > rcu_sched-7 [001] d... 109.952633: timer_start: timer\x160a9eb0 function=process_timeout expiresB94948284 [timeout=1] flags=0x00000001 +> > > > rcu_sched-7 [001] d... 109.952633: timer_start: timer=160a9eb0 function=process_timeout expires=4294948284 [timeout=1] flags=0x00000001 > > > > > > The CPU nr encoded in flags is: 1 > > > > > > Now we cancel and restart the timer w/o seing the interrupt expiring > > > it. And that expiry should have happened at 109.968000 !?! > > > -> > > > <idle>-0 [001] d... 109.968225: hrtimer_cancel: hrtimer\x109f449c -> > > > <idle>-0 [001] d... 109.968526: hrtimer_start: hrtimer\x109f449c function=tick_sched_timer expires\x109890000000 softexpires\x109890000000 +> > > > <idle>-0 [001] d... 109.968225: hrtimer_cancel: hrtimer=109f449c +> > > > <idle>-0 [001] d... 109.968526: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109890000000 softexpires=109890000000 > > > > > > So this advances the next tick even further out. And CPU 0 sets the timer to > > > the exact smae value: > > > -> > > > <idle>-0 [000] d.h. 109.969104: hrtimer_start: hrtimer\x109e949c function=tick_sched_timer expires\x109890000000 softexpires\x109890000000 +> > > > <idle>-0 [000] d.h. 109.969104: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109890000000 softexpires=109890000000 > > > > > > -> > > > <idle>-0 [000] d.h. 109.977690: irq_handler_entry: irq\x16 name=jcore_pit -> > > > <idle>-0 [000] d.h. 109.977911: hrtimer_cancel: hrtimer\x109e949c -> > > > <idle>-0 [000] d.h. 109.978053: hrtimer_expire_entry: hrtimer\x109e949c function=tick_sched_timer now\x109890434160 +> > > > <idle>-0 [000] d.h. 109.977690: irq_handler_entry: irq=16 name=jcore_pit +> > > > <idle>-0 [000] d.h. 109.977911: hrtimer_cancel: hrtimer=109e949c +> > > > <idle>-0 [000] d.h. 109.978053: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109890434160 > > > > > > Which expires here. And CPU1 instead of getting an interrupt and expiring > > > the timer does the cancel/restart to the next jiffie again: > > > -> > > > <idle>-0 [001] d... 109.978206: hrtimer_cancel: hrtimer\x109f449c -> > > > <idle>-0 [001] d... 109.978495: hrtimer_start: hrtimer\x109f449c function=tick_sched_timer expires\x109900000000 softexpires\x109900000000 +> > > > <idle>-0 [001] d... 109.978206: hrtimer_cancel: hrtimer=109f449c +> > > > <idle>-0 [001] d... 109.978495: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109900000000 softexpires=109900000000 > > > > > > And this repeats; > > > -> > > > <idle>-0 [000] d.h. 109.987726: irq_handler_entry: irq\x16 name=jcore_pit -> > > > <idle>-0 [000] d.h. 109.987954: hrtimer_cancel: hrtimer\x109e949c -> > > > <idle>-0 [000] d.h. 109.988095: hrtimer_expire_entry: hrtimer\x109e949c function=tick_sched_timer now\x109900474620 +> > > > <idle>-0 [000] d.h. 109.987726: irq_handler_entry: irq=16 name=jcore_pit +> > > > <idle>-0 [000] d.h. 109.987954: hrtimer_cancel: hrtimer=109e949c +> > > > <idle>-0 [000] d.h. 109.988095: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109900474620 > > > -> > > > <idle>-0 [001] d... 109.988243: hrtimer_cancel: hrtimer\x109f449c -> > > > <idle>-0 [001] d... 109.988537: hrtimer_start: hrtimer\x109f449c fun9c +> > > > <idle>-0 [001] d... 109.988243: hrtimer_cancel: hrtimer=109f449c +> > > > <idle>-0 [001] d... 109.988537: hrtimer_start: hrtimer=109f449c fun9c > > > > > > There is something badly wrong here. > > > @@ -87,7 +87,7 @@ But are there rcu0 kthreads running on your system? > > > > <idle>-0 [000] ..s. 110.019617: softirq_exit: vec=1 [action=TIMER] > > > > <idle>-0 [000] ..s. 110.019730: softirq_entry: vec=7 [action=SCHED] > > > > <idle>-0 [000] ..s. 110.020174: softirq_exit: vec=7 [action=SCHED] -> > > > <idle>-0 [000] d.h. 110.027674: irq_handler_entry: irq\x16 name=jcore_pit +> > > > <idle>-0 [000] d.h. 110.027674: irq_handler_entry: irq=16 name=jcore_pit > > > > > > > > The rcu_sched process does not run again after the tick_stop until > > > > 132s, and only a few RCU softirqs happen (all shown above). During @@ -110,22 +110,22 @@ But are there rcu0 kthreads running on your system? > > > Please put a tracepoint into your set_next_event() callback as well. SO > > > this changes here: > > > -> > > > <idle>-0 [001] d... 132.198170: hrtimer_cancel: hrtimer\x109f449c -> > > > <idle>-0 [001] d... 132.198451: hrtimer_start: hrtimer\x109f449c function=tick_sched_timer expires\x132120000000 softexpires\x132120000000 +> > > > <idle>-0 [001] d... 132.198170: hrtimer_cancel: hrtimer=109f449c +> > > > <idle>-0 [001] d... 132.198451: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=132120000000 softexpires=132120000000 > > > -> > > > <idle>-0 [001] dnh. 132.205860: irq_handler_entry: irq name=ipi -> > > > <idle>-0 [001] dnh. 132.206041: irq_handler_exit: irq ret=handle +> > > > <idle>-0 [001] dnh. 132.205860: irq_handler_entry: irq=20 name=ipi +> > > > <idle>-0 [001] dnh. 132.206041: irq_handler_exit: irq=20 ret=handle > > > > > > So CPU1 gets an IPI > > > -> > > > <idle>-0 [001] dn.. 132.206650: hrtimer_cancel: hrtimer\x109f449c -> > > 49c function=tick_sched_timer now\x132119115200 -> > > > <idle>-0 [001] dn.. 132.206936: hrtimer_start: hrtimer\x109f449c function=tick_sched_timer expires\x132120000000 softexpires\x132120000000 +> > > > <idle>-0 [001] dn.. 132.206650: hrtimer_cancel: hrtimer=109f449c +> > > 49c function=tick_sched_timer now=132119115200 +> > > > <idle>-0 [001] dn.. 132.206936: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=132120000000 softexpires=132120000000 > > > > > > And rcu-sched-7 gets running magically, but we don't know what woke it > > > up. Definitely not the timer, because that did not fire. > > > -> > > > rcu_sched-7 [001] d... 132.207710: timer_cancel: timer\x160a9eb0 +> > > > rcu_sched-7 [001] d... 132.207710: timer_cancel: timer=160a9eb0 > > > > It could have been an explicit wakeup at the end of a grace period. That > > would explain its cancelling the timer, anyway. @@ -161,9 +161,9 @@ be awakened every few jiffies by default. > > > obviously either getting out of idle and then moves the tick ahead for some > > > unknown reason. > > -> > And a one-jiffy timeout is in fact expected behavior when HZ\x100. -> > You have to be running HZ%0 or better to have two-jiffy timeouts, -> > and HZP0 or better for three-jiffy timeouts. +> > And a one-jiffy timeout is in fact expected behavior when HZ=100. +> > You have to be running HZ=250 or better to have two-jiffy timeouts, +> > and HZ=500 or better for three-jiffy timeouts. > > One possible theory I'm looking at is that the two cpus are both > waking up (leaving cpu_idle_poll or cpuidle_idle_call) every jiffy diff --git a/a/content_digest b/N1/content_digest index 274bd16..f4e633b 100644 --- a/a/content_digest +++ b/N1/content_digest @@ -10,7 +10,7 @@ "ref\020161001170508.GO19318@brightrain.aerifal.cx\0" "From\0Paul E. McKenney <paulmck@linux.vnet.ibm.com>\0" "Subject\0Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver\0" - "Date\0Sat, 01 Oct 2016 17:58:37 +0000\0" + "Date\0Sat, 1 Oct 2016 10:58:37 -0700\0" "To\0Rich Felker <dalias@libc.org>\0" "Cc\0Thomas Gleixner <tglx@linutronix.de>" Daniel Lezcano <daniel.lezcano@linaro.org> @@ -37,8 +37,8 @@ "> > > into a NOHZ idle sleep.\n" "> > > \n" "> > > > <idle>-0 [001] d... 109.953436: tick_stop: success=1 dependency=NONE\n" - "> > > > <idle>-0 [001] d... 109.953617: hrtimer_cancel: hrtimer\0209f449c\n" - "> > > > <idle>-0 [001] d... 109.953818: hrtimer_start: hrtimer\0209f449c function=tick_sched_timer expires\0209880000000 softexpires\0209880000000\n" + "> > > > <idle>-0 [001] d... 109.953617: hrtimer_cancel: hrtimer=109f449c\n" + "> > > > <idle>-0 [001] d... 109.953818: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109880000000 softexpires=109880000000\n" "> > > \n" "> > > which is (using the 0.087621us delta between the trace clock and clock\n" "> > > MONO) at: 109.880 + 0.087621 = 109.968\n" @@ -69,40 +69,40 @@ "\n" "But are there rcu0 kthreads running on your system?\n" "\n" - "> > > > rcu_sched-7 [001] d... 109.952633: timer_start: timer\0260a9eb0 function=process_timeout expiresB94948284 [timeout=1] flags=0x00000001\n" + "> > > > rcu_sched-7 [001] d... 109.952633: timer_start: timer=160a9eb0 function=process_timeout expires=4294948284 [timeout=1] flags=0x00000001\n" "> > > \n" "> > > The CPU nr encoded in flags is: 1\n" "> > > \n" "> > > Now we cancel and restart the timer w/o seing the interrupt expiring\n" "> > > it. And that expiry should have happened at 109.968000 !?!\n" "> > > \n" - "> > > > <idle>-0 [001] d... 109.968225: hrtimer_cancel: hrtimer\0209f449c\n" - "> > > > <idle>-0 [001] d... 109.968526: hrtimer_start: hrtimer\0209f449c function=tick_sched_timer expires\0209890000000 softexpires\0209890000000\n" + "> > > > <idle>-0 [001] d... 109.968225: hrtimer_cancel: hrtimer=109f449c\n" + "> > > > <idle>-0 [001] d... 109.968526: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109890000000 softexpires=109890000000\n" "> > > \n" "> > > So this advances the next tick even further out. And CPU 0 sets the timer to\n" "> > > the exact smae value:\n" "> > > \n" - "> > > > <idle>-0 [000] d.h. 109.969104: hrtimer_start: hrtimer\0209e949c function=tick_sched_timer expires\0209890000000 softexpires\0209890000000\n" + "> > > > <idle>-0 [000] d.h. 109.969104: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109890000000 softexpires=109890000000\n" "> > > \n" "> > > \n" - "> > > > <idle>-0 [000] d.h. 109.977690: irq_handler_entry: irq\026 name=jcore_pit\n" - "> > > > <idle>-0 [000] d.h. 109.977911: hrtimer_cancel: hrtimer\0209e949c\n" - "> > > > <idle>-0 [000] d.h. 109.978053: hrtimer_expire_entry: hrtimer\0209e949c function=tick_sched_timer now\0209890434160\n" + "> > > > <idle>-0 [000] d.h. 109.977690: irq_handler_entry: irq=16 name=jcore_pit\n" + "> > > > <idle>-0 [000] d.h. 109.977911: hrtimer_cancel: hrtimer=109e949c\n" + "> > > > <idle>-0 [000] d.h. 109.978053: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109890434160\n" "> > > \n" "> > > Which expires here. And CPU1 instead of getting an interrupt and expiring\n" "> > > the timer does the cancel/restart to the next jiffie again:\n" "> > > \n" - "> > > > <idle>-0 [001] d... 109.978206: hrtimer_cancel: hrtimer\0209f449c\n" - "> > > > <idle>-0 [001] d... 109.978495: hrtimer_start: hrtimer\0209f449c function=tick_sched_timer expires\0209900000000 softexpires\0209900000000\n" + "> > > > <idle>-0 [001] d... 109.978206: hrtimer_cancel: hrtimer=109f449c\n" + "> > > > <idle>-0 [001] d... 109.978495: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109900000000 softexpires=109900000000\n" "> > > \n" "> > > And this repeats;\n" "> > > \n" - "> > > > <idle>-0 [000] d.h. 109.987726: irq_handler_entry: irq\026 name=jcore_pit\n" - "> > > > <idle>-0 [000] d.h. 109.987954: hrtimer_cancel: hrtimer\0209e949c\n" - "> > > > <idle>-0 [000] d.h. 109.988095: hrtimer_expire_entry: hrtimer\0209e949c function=tick_sched_timer now\0209900474620\n" + "> > > > <idle>-0 [000] d.h. 109.987726: irq_handler_entry: irq=16 name=jcore_pit\n" + "> > > > <idle>-0 [000] d.h. 109.987954: hrtimer_cancel: hrtimer=109e949c\n" + "> > > > <idle>-0 [000] d.h. 109.988095: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109900474620\n" "> > > \n" - "> > > > <idle>-0 [001] d... 109.988243: hrtimer_cancel: hrtimer\0209f449c\n" - "> > > > <idle>-0 [001] d... 109.988537: hrtimer_start: hrtimer\0209f449c fun9c\n" + "> > > > <idle>-0 [001] d... 109.988243: hrtimer_cancel: hrtimer=109f449c\n" + "> > > > <idle>-0 [001] d... 109.988537: hrtimer_start: hrtimer=109f449c fun9c\n" "> > > \n" "> > > There is something badly wrong here.\n" "> > > \n" @@ -110,7 +110,7 @@ "> > > > <idle>-0 [000] ..s. 110.019617: softirq_exit: vec=1 [action=TIMER]\n" "> > > > <idle>-0 [000] ..s. 110.019730: softirq_entry: vec=7 [action=SCHED]\n" "> > > > <idle>-0 [000] ..s. 110.020174: softirq_exit: vec=7 [action=SCHED]\n" - "> > > > <idle>-0 [000] d.h. 110.027674: irq_handler_entry: irq\026 name=jcore_pit\n" + "> > > > <idle>-0 [000] d.h. 110.027674: irq_handler_entry: irq=16 name=jcore_pit\n" "> > > > \n" "> > > > The rcu_sched process does not run again after the tick_stop until\n" "> > > > 132s, and only a few RCU softirqs happen (all shown above). During\n" @@ -133,22 +133,22 @@ "> > > Please put a tracepoint into your set_next_event() callback as well. SO\n" "> > > this changes here:\n" "> > > \n" - "> > > > <idle>-0 [001] d... 132.198170: hrtimer_cancel: hrtimer\0209f449c\n" - "> > > > <idle>-0 [001] d... 132.198451: hrtimer_start: hrtimer\0209f449c function=tick_sched_timer expires\0232120000000 softexpires\0232120000000\n" + "> > > > <idle>-0 [001] d... 132.198170: hrtimer_cancel: hrtimer=109f449c\n" + "> > > > <idle>-0 [001] d... 132.198451: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=132120000000 softexpires=132120000000\n" "> > > \n" - "> > > > <idle>-0 [001] dnh. 132.205860: irq_handler_entry: irq name=ipi\n" - "> > > > <idle>-0 [001] dnh. 132.206041: irq_handler_exit: irq ret=handle\n" + "> > > > <idle>-0 [001] dnh. 132.205860: irq_handler_entry: irq=20 name=ipi\n" + "> > > > <idle>-0 [001] dnh. 132.206041: irq_handler_exit: irq=20 ret=handle\n" "> > > \n" "> > > So CPU1 gets an IPI\n" "> > > \n" - "> > > > <idle>-0 [001] dn.. 132.206650: hrtimer_cancel: hrtimer\0209f449c\n" - "> > > 49c function=tick_sched_timer now\0232119115200\n" - "> > > > <idle>-0 [001] dn.. 132.206936: hrtimer_start: hrtimer\0209f449c function=tick_sched_timer expires\0232120000000 softexpires\0232120000000\n" + "> > > > <idle>-0 [001] dn.. 132.206650: hrtimer_cancel: hrtimer=109f449c\n" + "> > > 49c function=tick_sched_timer now=132119115200\n" + "> > > > <idle>-0 [001] dn.. 132.206936: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=132120000000 softexpires=132120000000\n" "> > > \n" "> > > And rcu-sched-7 gets running magically, but we don't know what woke it\n" "> > > up. Definitely not the timer, because that did not fire.\n" "> > > \n" - "> > > > rcu_sched-7 [001] d... 132.207710: timer_cancel: timer\0260a9eb0\n" + "> > > > rcu_sched-7 [001] d... 132.207710: timer_cancel: timer=160a9eb0\n" "> > \n" "> > It could have been an explicit wakeup at the end of a grace period. That\n" "> > would explain its cancelling the timer, anyway.\n" @@ -184,9 +184,9 @@ "> > > obviously either getting out of idle and then moves the tick ahead for some\n" "> > > unknown reason.\n" "> > \n" - "> > And a one-jiffy timeout is in fact expected behavior when HZ\0200.\n" - "> > You have to be running HZ%0 or better to have two-jiffy timeouts,\n" - "> > and HZP0 or better for three-jiffy timeouts.\n" + "> > And a one-jiffy timeout is in fact expected behavior when HZ=100.\n" + "> > You have to be running HZ=250 or better to have two-jiffy timeouts,\n" + "> > and HZ=500 or better for three-jiffy timeouts.\n" "> \n" "> One possible theory I'm looking at is that the two cpus are both\n" "> waking up (leaving cpu_idle_poll or cpuidle_idle_call) every jiffy\n" @@ -221,4 +221,4 @@ "\n" "\t\t\t\t\t\t\tThanx, Paul" -327035edb1076a24a560d10d0f4e82394fb341db6ef336e1348e86f1161fab8f +b78201c5a7cc5594b8b77ccbb3b3714b249660fa4228928e121fc35e8fcd189f
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.