* Interrupt Bottom Half Scheduling
@ 2011-02-14 22:31 Peter LaDow
2011-02-14 23:04 ` Sven-Thorsten Dietrich
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Peter LaDow @ 2011-02-14 22:31 UTC (permalink / raw)
To: linux-rt-users
How is the scheduling of the hrtimers softirq thread handled?
When querying the RT priority of the hrtimer softirq, I get a priority
of 50. But when running a priority 99 thread, we still seem to be
getting interrupted. Shouldn't the hrtimer softirq be put off until
the CPU is idle or a lower priority task is running?
Thanks,
Pete
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: Interrupt Bottom Half Scheduling 2011-02-14 22:31 Interrupt Bottom Half Scheduling Peter LaDow @ 2011-02-14 23:04 ` Sven-Thorsten Dietrich 2011-02-14 23:08 ` Peter LaDow 2011-02-14 23:30 ` Frank Rowand 2011-02-15 8:23 ` Uwe Kleine-König 2 siblings, 1 reply; 18+ messages in thread From: Sven-Thorsten Dietrich @ 2011-02-14 23:04 UTC (permalink / raw) To: Peter LaDow; +Cc: linux-rt-users On Mon, 2011-02-14 at 14:31 -0800, Peter LaDow wrote: > How is the scheduling of the hrtimers softirq thread handled? > > When querying the RT priority of the hrtimer softirq, I get a priority > of 50. But when running a priority 99 thread, we still seem to be > getting interrupted. Shouldn't the hrtimer softirq be put off until > the CPU is idle or a lower priority task is running? Does your prio 99 thread perhaps encounter a prio inversion dependency on one of the softriq threads? > > Thanks, > Pete > -- > To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Interrupt Bottom Half Scheduling 2011-02-14 23:04 ` Sven-Thorsten Dietrich @ 2011-02-14 23:08 ` Peter LaDow 2011-02-14 23:35 ` Sven-Thorsten Dietrich 0 siblings, 1 reply; 18+ messages in thread From: Peter LaDow @ 2011-02-14 23:08 UTC (permalink / raw) To: Sven-Thorsten Dietrich; +Cc: linux-rt-users Note sure how that is possible. This is related to my earlier posting about timing jitter. Our code is basically this: while(1) { t1 = clock_gettime() for(i=0; i < 10000; i++) t2 = clock_gettime() diff = t2 - 1 } This task is pending on nothing. Unless clock_gettime() causes some sort of priority inversion, I don't see the problem. We see significant jitter on the for-loop when there are a significant number of other kernel timers. Now, as we understand it, the hrtimers run in the softirq. But if the softirq is priority 50, and this for-loop is priority 99, it shouldn't be affected by the softirq thread. Pete On Mon, Feb 14, 2011 at 3:04 PM, Sven-Thorsten Dietrich <thebigcorporation@gmail.com> wrote: > On Mon, 2011-02-14 at 14:31 -0800, Peter LaDow wrote: >> How is the scheduling of the hrtimers softirq thread handled? >> >> When querying the RT priority of the hrtimer softirq, I get a priority >> of 50. But when running a priority 99 thread, we still seem to be >> getting interrupted. Shouldn't the hrtimer softirq be put off until >> the CPU is idle or a lower priority task is running? > > Does your prio 99 thread perhaps encounter a prio inversion dependency > on one of the softriq threads? > >> >> Thanks, >> Pete >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Interrupt Bottom Half Scheduling 2011-02-14 23:08 ` Peter LaDow @ 2011-02-14 23:35 ` Sven-Thorsten Dietrich 2011-02-14 23:42 ` Peter LaDow 0 siblings, 1 reply; 18+ messages in thread From: Sven-Thorsten Dietrich @ 2011-02-14 23:35 UTC (permalink / raw) To: Peter LaDow; +Cc: linux-rt-users On Mon, 2011-02-14 at 15:08 -0800, Peter LaDow wrote: > Note sure how that is possible. This is related to my earlier posting > about timing jitter. Our code is basically this: > Where is the semicolon ending the for loop? > while(1) > { > t1 = clock_gettime() > for(i=0; i < 10000; i++) > t2 = clock_gettime() > > diff = t2 - 1 > } > > This task is pending on nothing. Unless clock_gettime() causes some > sort of priority inversion, I don't see the problem. We see > significant jitter on the for-loop when there are a significant number > of other kernel timers. Now, as we understand it, the hrtimers run in > the softirq. But if the softirq is priority 50, and this for-loop is > priority 99, it shouldn't be affected by the softirq thread. > > Pete > > On Mon, Feb 14, 2011 at 3:04 PM, Sven-Thorsten Dietrich > <thebigcorporation@gmail.com> wrote: > > On Mon, 2011-02-14 at 14:31 -0800, Peter LaDow wrote: > >> How is the scheduling of the hrtimers softirq thread handled? > >> > >> When querying the RT priority of the hrtimer softirq, I get a priority > >> of 50. But when running a priority 99 thread, we still seem to be > >> getting interrupted. Shouldn't the hrtimer softirq be put off until > >> the CPU is idle or a lower priority task is running? > > > > Does your prio 99 thread perhaps encounter a prio inversion dependency > > on one of the softriq threads? > > > >> > >> Thanks, > >> Pete > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in > >> the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Interrupt Bottom Half Scheduling 2011-02-14 23:35 ` Sven-Thorsten Dietrich @ 2011-02-14 23:42 ` Peter LaDow 2011-02-15 0:50 ` Sven-Thorsten Dietrich 0 siblings, 1 reply; 18+ messages in thread From: Peter LaDow @ 2011-02-14 23:42 UTC (permalink / raw) To: Sven-Thorsten Dietrich; +Cc: Peter LaDow, linux-rt-users@vger.kernel.org Pseudo-code. I can post the full code if it helps. On Feb 14, 2011, at 3:35 PM, Sven-Thorsten Dietrich <thebigcorporation@gmail.com> wrote: > On Mon, 2011-02-14 at 15:08 -0800, Peter LaDow wrote: >> Note sure how that is possible. This is related to my earlier posting >> about timing jitter. Our code is basically this: >> > > Where is the semicolon ending the for loop? > > >> while(1) >> { >> t1 = clock_gettime() >> for(i=0; i < 10000; i++) >> t2 = clock_gettime() >> >> diff = t2 - 1 >> } >> >> This task is pending on nothing. Unless clock_gettime() causes some >> sort of priority inversion, I don't see the problem. We see >> significant jitter on the for-loop when there are a significant number >> of other kernel timers. Now, as we understand it, the hrtimers run in >> the softirq. But if the softirq is priority 50, and this for-loop is >> priority 99, it shouldn't be affected by the softirq thread. >> >> Pete >> >> On Mon, Feb 14, 2011 at 3:04 PM, Sven-Thorsten Dietrich >> <thebigcorporation@gmail.com> wrote: >>> On Mon, 2011-02-14 at 14:31 -0800, Peter LaDow wrote: >>>> How is the scheduling of the hrtimers softirq thread handled? >>>> >>>> When querying the RT priority of the hrtimer softirq, I get a priority >>>> of 50. But when running a priority 99 thread, we still seem to be >>>> getting interrupted. Shouldn't the hrtimer softirq be put off until >>>> the CPU is idle or a lower priority task is running? >>> >>> Does your prio 99 thread perhaps encounter a prio inversion dependency >>> on one of the softriq threads? >>> >>>> >>>> Thanks, >>>> Pete >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >>> > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Interrupt Bottom Half Scheduling 2011-02-14 23:42 ` Peter LaDow @ 2011-02-15 0:50 ` Sven-Thorsten Dietrich 0 siblings, 0 replies; 18+ messages in thread From: Sven-Thorsten Dietrich @ 2011-02-15 0:50 UTC (permalink / raw) To: Peter LaDow; +Cc: Peter LaDow, linux-rt-users@vger.kernel.org On 02/14/2011 03:42 PM, Peter LaDow wrote: > Pseudo-code. I can post the full code if it helps. > My first thought was whether the compiler optimizes away the for loop. Either way you would be hammering on the clock_gettime in a narly way, it might be better just to hand-code a register read, if this is just some kind of benchmark. It would be good to know the time source chipset and its resolution. what is the distribution for t2 - t1 btw? for starters ;) On Feb 14, 2011, at 3:35 PM, Sven-Thorsten Dietrich <thebigcorporation@gmail.com> wrote: >> On Mon, 2011-02-14 at 15:08 -0800, Peter LaDow wrote: >>> Note sure how that is possible. This is related to my earlier posting >>> about timing jitter. Our code is basically this: >>> >> Where is the semicolon ending the for loop? >> >> >>> while(1) >>> { >>> t1 = clock_gettime() >>> for(i=0; i< 10000; i++) >>> t2 = clock_gettime() >>> >>> diff = t2 - 1 >>> } >>> >>> This task is pending on nothing. Unless clock_gettime() causes some >>> sort of priority inversion, I don't see the problem. We see >>> significant jitter on the for-loop when there are a significant number >>> of other kernel timers. Now, as we understand it, the hrtimers run in >>> the softirq. But if the softirq is priority 50, and this for-loop is >>> priority 99, it shouldn't be affected by the softirq thread. >>> >>> Pete >>> >>> On Mon, Feb 14, 2011 at 3:04 PM, Sven-Thorsten Dietrich >>> <thebigcorporation@gmail.com> wrote: >>>> On Mon, 2011-02-14 at 14:31 -0800, Peter LaDow wrote: >>>>> How is the scheduling of the hrtimers softirq thread handled? >>>>> >>>>> When querying the RT priority of the hrtimer softirq, I get a priority >>>>> of 50. But when running a priority 99 thread, we still seem to be >>>>> getting interrupted. Shouldn't the hrtimer softirq be put off until >>>>> the CPU is idle or a lower priority task is running? >>>> Does your prio 99 thread perhaps encounter a prio inversion dependency >>>> on one of the softriq threads? >>>> >>>>> Thanks, >>>>> Pete >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Interrupt Bottom Half Scheduling 2011-02-14 22:31 Interrupt Bottom Half Scheduling Peter LaDow 2011-02-14 23:04 ` Sven-Thorsten Dietrich @ 2011-02-14 23:30 ` Frank Rowand 2011-02-15 1:10 ` Peter LaDow 2011-02-15 8:23 ` Uwe Kleine-König 2 siblings, 1 reply; 18+ messages in thread From: Frank Rowand @ 2011-02-14 23:30 UTC (permalink / raw) To: Peter LaDow; +Cc: linux-rt-users On Mon, Feb 14, 2011 at 2:31 PM, Peter LaDow <petela@gocougs.wsu.edu> wrote: > How is the scheduling of the hrtimers softirq thread handled? > > When querying the RT priority of the hrtimer softirq, I get a priority > of 50. But when running a priority 99 thread, we still seem to be > getting interrupted. Shouldn't the hrtimer softirq be put off until > the CPU is idle or a lower priority task is running? Is the hrtimer softirq executing when the priority 99 thread is spinning in it's for loop? Your "jitter Due to Large Number of Timers" email said that the lower priority tasks don't seem to be interrupting the priority 99 thread. The hardware timer interupts will interrupt the priority 99 thread. The cost of these interrupts and the resultant calls to try_to_wake_up() of the hrtimer softirq might be quite large considering the rate of timer expires you mentioned in your first email. Out of curiosity, is the system UP or SMP? -Frank -- To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Interrupt Bottom Half Scheduling 2011-02-14 23:30 ` Frank Rowand @ 2011-02-15 1:10 ` Peter LaDow 2011-02-15 1:58 ` Frank Rowand 2011-02-15 8:40 ` Armin Steinhoff 0 siblings, 2 replies; 18+ messages in thread From: Peter LaDow @ 2011-02-15 1:10 UTC (permalink / raw) To: Frank Rowand; +Cc: linux-rt-users On Mon, Feb 14, 2011 at 3:30 PM, Frank Rowand <frank.rowand@gmail.com> wrote: > On Mon, Feb 14, 2011 at 2:31 PM, Peter LaDow <petela@gocougs.wsu.edu> wrote: >> How is the scheduling of the hrtimers softirq thread handled? >> >> When querying the RT priority of the hrtimer softirq, I get a priority >> of 50. But when running a priority 99 thread, we still seem to be >> getting interrupted. Shouldn't the hrtimer softirq be put off until >> the CPU is idle or a lower priority task is running? > > Is the hrtimer softirq executing when the priority 99 thread is spinning > in it's for loop? Your "jitter Due to Large Number of Timers" email > said that the lower priority tasks don't seem to be interrupting the > priority 99 thread. Did I? Hmm, well I mean the lower priority task with 100 threads. At least I think so. It is hard to tell. It seems to me that the softirq thread is the source of the problem. Since the tight loop is getting such a variety of times (400us of jitter only while the other process is running) that it does seem that the loop is getting interrupt. > The hardware timer interupts will interrupt the priority 99 thread. The > cost of these interrupts and the resultant calls to try_to_wake_up() > of the hrtimer softirq might be quite large considering the rate of > timer expires you mentioned in your first email. Sure, we expect the timer interrupt to interfere. But as we understand it, the softirq is what schedules the task switch. The top half only schedules the bottom half. But since the bottom half is priority 50, there shouldn't be any interruption of the priority 99 expect to handle the low level IRQ. > Out of curiosity, is the system UP or SMP? UP. Just a single MPC5349. pete -- To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Interrupt Bottom Half Scheduling 2011-02-15 1:10 ` Peter LaDow @ 2011-02-15 1:58 ` Frank Rowand 2011-02-15 2:16 ` Frank Rowand 2011-02-15 16:42 ` Peter LaDow 2011-02-15 8:40 ` Armin Steinhoff 1 sibling, 2 replies; 18+ messages in thread From: Frank Rowand @ 2011-02-15 1:58 UTC (permalink / raw) To: Peter LaDow; +Cc: linux-rt-users On Mon, Feb 14, 2011 at 5:10 PM, Peter LaDow <petela@gocougs.wsu.edu> wrote: > On Mon, Feb 14, 2011 at 3:30 PM, Frank Rowand <frank.rowand@gmail.com> wrote: >> On Mon, Feb 14, 2011 at 2:31 PM, Peter LaDow <petela@gocougs.wsu.edu> wrote: >>> How is the scheduling of the hrtimers softirq thread handled? >>> >>> When querying the RT priority of the hrtimer softirq, I get a priority >>> of 50. But when running a priority 99 thread, we still seem to be >>> getting interrupted. Shouldn't the hrtimer softirq be put off until >>> the CPU is idle or a lower priority task is running? >> >> Is the hrtimer softirq executing when the priority 99 thread is spinning >> in it's for loop? Your "jitter Due to Large Number of Timers" email >> said that the lower priority tasks don't seem to be interrupting the >> priority 99 thread. > > Did I? Hmm, well I mean the lower priority task with 100 threads. At > least I think so. It is hard to tell. > > It seems to me that the softirq thread is the source of the problem. > Since the tight loop is getting such a variety of times (400us of > jitter only while the other process is running) that it does seem that > the loop is getting interrupt. Just so we are speaking with a common definition of jitter, your first email said that the duration of the priority 99 thread loop increased by around 350us (average and maximum) when the lower priority task timers were added to the system. > >> The hardware timer interupts will interrupt the priority 99 thread. The >> cost of these interrupts and the resultant calls to try_to_wake_up() >> of the hrtimer softirq might be quite large considering the rate of >> timer expires you mentioned in your first email. > > Sure, we expect the timer interrupt to interfere. But as we So what is the overhead of the timer interrupt? 1) All 100 of the the test100 timers pop at the same time: - 1 hardware interrupt - Chasing through an hrtimer list of size 100 - call try_to_wake_up() for the hrtimer softirq (I don't remember whether try_to_wake_up() will be called just once, or once per timer. But even if called 100 times, the first call is "expensive" and the other 99 will be very cheap). 2) Each of the 100 test100 timers pop at a separate, unique time: - 100 hardware interrupts - For each interrupt, chase through an hrtimer list of size 1 - For each interrupt, call try_to_wake_up(). The first call is "expensive", for the other 99 interrupts the call will be cheap. 3) Then every other possible combination of clumps of timers popping at the same time. 4) Just for completeness, not all of the test100 timers has to pop during each iteration of the priority 99 thread loop, but that does not impact the analysis of the worst case scenario, so we can just ignore that. I would expect scenario 1 to have the lowest overhead, scenario 2 to have the highest overhead, and scenario 3 to be in the middle. For a 533 Mhz PPC, I would not be surprised if the overhead of these three scenarios is as large as 350 us. > understand it, the softirq is what schedules the task switch. The top > half only schedules the bottom half. But since the bottom half is > priority 50, there shouldn't be any interruption of the priority 99 > expect to handle the low level IRQ. You can verify whether any other process is executing through a variety of tools. LTT if you have it in your kernel. I think 2.6.29 had ftrace "Trace process context switches" (CONFIG_CONTEXT_SWITCH_TRACER), I don't think there was a perf option for context switches in 2.6.29. > >> Out of curiosity, is the system UP or SMP? > > UP. Just a single MPC5349. > > pete > -- To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Interrupt Bottom Half Scheduling 2011-02-15 1:58 ` Frank Rowand @ 2011-02-15 2:16 ` Frank Rowand 2011-02-15 16:42 ` Peter LaDow 1 sibling, 0 replies; 18+ messages in thread From: Frank Rowand @ 2011-02-15 2:16 UTC (permalink / raw) To: Peter LaDow; +Cc: linux-rt-users On Mon, Feb 14, 2011 at 5:58 PM, Frank Rowand <frank.rowand@gmail.com> wrote: > I don't think there was a perf option for context switches in 2.6.29. But it probably is in 2.6.33. I don't remember when it first showed up... -Frank ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Interrupt Bottom Half Scheduling 2011-02-15 1:58 ` Frank Rowand 2011-02-15 2:16 ` Frank Rowand @ 2011-02-15 16:42 ` Peter LaDow 2011-02-15 18:38 ` Frank Rowand 1 sibling, 1 reply; 18+ messages in thread From: Peter LaDow @ 2011-02-15 16:42 UTC (permalink / raw) To: Frank Rowand; +Cc: linux-rt-users On Mon, Feb 14, 2011 at 5:58 PM, Frank Rowand <frank.rowand@gmail.com> wrote: > Just so we are speaking with a common definition of jitter, your first email > said that the duration of the priority 99 thread loop increased by > around 350us (average and maximum) when the lower priority task > timers were added to the system. Well, I'm only speaking to the maximum. We do expect some increase in the maximum runtime of the loop when those other timers are added. However, we did not expect it to occasionally spike by 350us. >> Sure, we expect the timer interrupt to interfere. But as we > > So what is the overhead of the timer interrupt? We are on a PPC platform, and the decrementer interrupt is in arch/powerpc/kernel/time.c on lines 541-593. The only line that seems that it can have an impact (at least with regard to the timers) is on line 576: evt->event_handler(evt); Which according to /proc/timer_list is hrtimer_interrupt. This is found in kernel/hrtimer.c (lines 1195-1267). And this does indeed seem to be where the bulk of the problem lies. On line 1226 we have: while ((node = base->first)) { Which loops through all the clock bases. This only checks the first timer on the rbtree (uses base-->first). It then calls __run_timer with the timer at the head of the tree. And __run_hrtimer calls the timer callback function. In the case of these timers it is hrtimer_wakeup. And each of these calls wake_up_process(). So hmm, perhaps this is it. There is no softirq that calls the wakeup function. In fact, there doesn't seem to be a bottom half in this case at all. The decrementer interrupt does all the work, rather than postpone it to a bottom half. Looking at the call tree: timer_interrupt | + hrtimer_interrupt | + __run_timer | + hrtimer_wakeup | + wake_up_process | + try_to_wake_up And the try_to_wake_up is the scheduler (no?). So, if this is the chain of events, then what is sirq-hrtimer for? I see in hrtimers_init (lines 1642-1650): open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq); And run_hrtimer_softirq eventually calls hrtimer_interrupt. But the prior mechanism seems to be the standard means. Even on my x86 box (2.6.32-28) it shows hrtimer_interrupt as the event handler for the clocks. And looking in arch/x86/kernel/time_32.c and arch/x86/kernel/time_64.c both take the same route. So, it seems to me that run_hrtimer_softirq never gets called via any interrupt mechanism. In fact, it only seems to be called when creating timers such as in nanosleep. The HRTIMER_SOFTIRQ is only raised in hrtimer_enqueue_reprogram, which is called in hrtimer_start_range_ns. And none of these have to do with timer expiration. So, it seems the problem really is interrupt overhead. We had presumed that the timer sirq-hrtimer handled these timer expirations, and thus the scheduler. Rather, we find that a full reschedule is being done every interrupt. Does my analysis make sense? Thanks, Pete -- To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Interrupt Bottom Half Scheduling 2011-02-15 16:42 ` Peter LaDow @ 2011-02-15 18:38 ` Frank Rowand 2011-02-15 18:40 ` Frank Rowand 2011-02-15 19:12 ` Peter LaDow 0 siblings, 2 replies; 18+ messages in thread From: Frank Rowand @ 2011-02-15 18:38 UTC (permalink / raw) To: Peter LaDow; +Cc: Frank Rowand, linux-rt-users On 02/15/11 08:42, Peter LaDow wrote: > On Mon, Feb 14, 2011 at 5:58 PM, Frank Rowand <frank.rowand@gmail.com> wrote: >> Just so we are speaking with a common definition of jitter, your first email >> said that the duration of the priority 99 thread loop increased by >> around 350us (average and maximum) when the lower priority task >> timers were added to the system. > > Well, I'm only speaking to the maximum. We do expect some increase in > the maximum runtime of the loop when those other timers are added. > However, we did not expect it to occasionally spike by 350us. > >>> Sure, we expect the timer interrupt to interfere. But as we >> >> So what is the overhead of the timer interrupt? > > We are on a PPC platform, and the decrementer interrupt is in > arch/powerpc/kernel/time.c on lines 541-593. The only line that seems > that it can have an impact (at least with regard to the timers) is on > line 576: > > evt->event_handler(evt); > > Which according to /proc/timer_list is hrtimer_interrupt. This is > found in kernel/hrtimer.c (lines 1195-1267). And this does indeed > seem to be where the bulk of the problem lies. On line 1226 we have: > > while ((node = base->first)) { > > Which loops through all the clock bases. This only checks the first > timer on the rbtree (uses base-->first). It then calls __run_timer > with the timer at the head of the tree. And __run_hrtimer calls the > timer callback function. In the case of these timers it is > hrtimer_wakeup. And each of these calls wake_up_process(). > > So hmm, perhaps this is it. There is no softirq that calls the wakeup > function. In fact, there doesn't seem to be a bottom half in this > case at all. The decrementer interrupt does all the work, rather than > postpone it to a bottom half. Looking at the call tree: > > timer_interrupt > | > + hrtimer_interrupt > | > + __run_timer > | > + hrtimer_wakeup > | > + wake_up_process > | > + try_to_wake_up > > And the try_to_wake_up is the scheduler (no?). try_to_wake_up() is in the scheduler code (kernel/sched.c), but it is not "the scheduler". If the task is not already running, try_to_wake_up() will put the task on the run queue and set it's state to TASK_RUNNING. If the priority of the newly woken thread was higher than the current thread, then the newly woken thread would preempt current. If a preemption occurred, then TIF_NEED_RESCHED is set. The actual "schedule" will occur on the exit path of the interrupt only if TIF_NEED_RESCHED is set (see the call of preempt_schedule_irq()). > > So, if this is the chain of events, then what is sirq-hrtimer for? I > see in hrtimers_init (lines 1642-1650): > > open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq); > > And run_hrtimer_softirq eventually calls hrtimer_interrupt. But the > prior mechanism seems to be the standard means. Even on my x86 box > (2.6.32-28) it shows hrtimer_interrupt as the event handler for the > clocks. And looking in arch/x86/kernel/time_32.c and > arch/x86/kernel/time_64.c both take the same route. > > So, it seems to me that run_hrtimer_softirq never gets called via any > interrupt mechanism. In fact, it only seems to be called when > creating timers such as in nanosleep. The HRTIMER_SOFTIRQ is only > raised in hrtimer_enqueue_reprogram, which is called in > hrtimer_start_range_ns. And none of these have to do with timer > expiration. > > So, it seems the problem really is interrupt overhead. We had > presumed that the timer sirq-hrtimer handled these timer expirations, > and thus the scheduler. Rather, we find that a full reschedule is > being done every interrupt. You should not have a full reschedule when a timer interrupt occurs for a priority 50 process while the priority 99 process is executing (see earlier explanation). But yes, there is a possibility that the problem is interrupt overhead. You could measure it to verify the theory. > > Does my analysis make sense? Yes. I did not double check the actual code that you described, and I haven't been poking around in PPC for a while, but what you describe sounds reasonable. > > Thanks, > Pete > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Interrupt Bottom Half Scheduling 2011-02-15 18:38 ` Frank Rowand @ 2011-02-15 18:40 ` Frank Rowand 2011-02-15 19:12 ` Peter LaDow 1 sibling, 0 replies; 18+ messages in thread From: Frank Rowand @ 2011-02-15 18:40 UTC (permalink / raw) To: frank.rowand; +Cc: Peter LaDow, Frank Rowand, linux-rt-users On 02/15/11 10:38, Frank Rowand wrote: > try_to_wake_up() is in the scheduler code (kernel/sched.c), but it is > not "the scheduler". If the task is not already running, > try_to_wake_up() will put the task on the run queue and set it's state > to TASK_RUNNING. If the priority of the newly woken thread was higher > than the current thread, then the newly woken thread would preempt > current. If a preemption occurred, then TIF_NEED_RESCHED is set. Oops, I slipped into using "thread" instead of "task". Just substitue "task" for each occurance of "thread" in that paragraph. > The actual "schedule" will occur on the exit path of the interrupt > only if TIF_NEED_RESCHED is set (see the call of preempt_schedule_irq()). ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Interrupt Bottom Half Scheduling 2011-02-15 18:38 ` Frank Rowand 2011-02-15 18:40 ` Frank Rowand @ 2011-02-15 19:12 ` Peter LaDow 2011-02-15 19:35 ` Frank Rowand 1 sibling, 1 reply; 18+ messages in thread From: Peter LaDow @ 2011-02-15 19:12 UTC (permalink / raw) To: frank.rowand; +Cc: Frank Rowand, linux-rt-users I made an error in my last post. My call tree wasn't accurate since I was looking at unpatched code. After applying the RT patch, the call tree changes a bit: timer_interrupt | + hrtimer_interrupt | + raise_softirq_irqoff | + wakeup_softirqd | + wake_up_process | + try_to_wakeup It indeed does offload the timer expirations to the hrtimer softirq. And the only task that try_to_wakeup works on is the softirq handler. So this overhead is even less than I thought. Indeed it is quite light. So it seems that I was on track before. The hrtimer softirq task is running at a priority of 50: # ps | grep irq 10 root 0 SW< [sirq-hrtimer/0] # chrt -p 10 pid 10's current scheduling policy: SCHED_FIFO pid 10's current scheduling priority: 50 And I run my program with 'chrt -f 99'. So it does seem that the hrtimer softirq task should not interfere. So I'm back to the scenarios you described earlier. I suppose if the timers are close in proximity, there would be a flurry of interrupts frequently occurring. Each of these could in fact slow things down. So to prevent this deluge, we tried something. We bumped up the minimum resolution on the decrementer to something closer to 1ms. This means the decrementer would interrupt us no more often than 1ms. We modified arch/powerpc/kernel/time.c to set the min_delta_ns of the decrement to a larger value (large enough to equal about 1ms) rather than the default 2. The jitter disappeared. Now, I know that doing this effectively eliminates their use as "high resolution", but it proves the point that it is the flurry of interrupts causing the problems. So it does seem that it is the interrupt overhead that is the problem. So if we want high resolution, but low overhead, we have to get around the problem of lots of tasks using clock_nanosleep. In our real-world system, we have only 1 high priority task that must run every 500us. More than 99% of the time, it gets to run and completes its work very quickly. However, than <1% of the time, it doesn't run for 1ms to 2ms, breaking our requirements. We have several lower priority tasks running, each using clock_nanosleep or pending on an I/O event. It may be in our system that the relatively large number of timers is occasionally causing a flurry of interrupts increasing the jitter. So how do we get rid of it? I see only 2 ways: 1) stop using clock_nanosleep or 2) stop using high resolution timers. Implementation of both is problematic. Eliminating use of clock_nanosleep would require replacing it with something that didn't resolve to an underlying nanosleep system call, which I think is impossible (except for using sleep, but that only gives us 1sec resolution). And turning off the high resolution timers makes it impossible for us to wake every 500us. Hmmm....I guess this really is a limitation of our platform. We are just up against the wall in terms of burden and processing power. There just isn't enough horsepower to do everything we want at the time we want. On Tue, Feb 15, 2011 at 10:38 AM, Frank Rowand <frank.rowand@gmail.com> wrote: > On 02/15/11 08:42, Peter LaDow wrote: >> On Mon, Feb 14, 2011 at 5:58 PM, Frank Rowand <frank.rowand@gmail.com> wrote: >>> Just so we are speaking with a common definition of jitter, your first email >>> said that the duration of the priority 99 thread loop increased by >>> around 350us (average and maximum) when the lower priority task >>> timers were added to the system. >> >> Well, I'm only speaking to the maximum. We do expect some increase in >> the maximum runtime of the loop when those other timers are added. >> However, we did not expect it to occasionally spike by 350us. >> >>>> Sure, we expect the timer interrupt to interfere. But as we >>> >>> So what is the overhead of the timer interrupt? >> >> We are on a PPC platform, and the decrementer interrupt is in >> arch/powerpc/kernel/time.c on lines 541-593. The only line that seems >> that it can have an impact (at least with regard to the timers) is on >> line 576: >> >> evt->event_handler(evt); >> >> Which according to /proc/timer_list is hrtimer_interrupt. This is >> found in kernel/hrtimer.c (lines 1195-1267). And this does indeed >> seem to be where the bulk of the problem lies. On line 1226 we have: >> >> while ((node = base->first)) { >> >> Which loops through all the clock bases. This only checks the first >> timer on the rbtree (uses base-->first). It then calls __run_timer >> with the timer at the head of the tree. And __run_hrtimer calls the >> timer callback function. In the case of these timers it is >> hrtimer_wakeup. And each of these calls wake_up_process(). >> >> So hmm, perhaps this is it. There is no softirq that calls the wakeup >> function. In fact, there doesn't seem to be a bottom half in this >> case at all. The decrementer interrupt does all the work, rather than >> postpone it to a bottom half. Looking at the call tree: >> >> timer_interrupt >> | >> + hrtimer_interrupt >> | >> + __run_timer >> | >> + hrtimer_wakeup >> | >> + wake_up_process >> | >> + try_to_wake_up >> >> And the try_to_wake_up is the scheduler (no?). > > try_to_wake_up() is in the scheduler code (kernel/sched.c), but it is > not "the scheduler". If the task is not already running, > try_to_wake_up() will put the task on the run queue and set it's state > to TASK_RUNNING. If the priority of the newly woken thread was higher > than the current thread, then the newly woken thread would preempt > current. If a preemption occurred, then TIF_NEED_RESCHED is set. > > The actual "schedule" will occur on the exit path of the interrupt > only if TIF_NEED_RESCHED is set (see the call of preempt_schedule_irq()). > >> >> So, if this is the chain of events, then what is sirq-hrtimer for? I >> see in hrtimers_init (lines 1642-1650): >> >> open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq); >> >> And run_hrtimer_softirq eventually calls hrtimer_interrupt. But the >> prior mechanism seems to be the standard means. Even on my x86 box >> (2.6.32-28) it shows hrtimer_interrupt as the event handler for the >> clocks. And looking in arch/x86/kernel/time_32.c and >> arch/x86/kernel/time_64.c both take the same route. >> >> So, it seems to me that run_hrtimer_softirq never gets called via any >> interrupt mechanism. In fact, it only seems to be called when >> creating timers such as in nanosleep. The HRTIMER_SOFTIRQ is only >> raised in hrtimer_enqueue_reprogram, which is called in >> hrtimer_start_range_ns. And none of these have to do with timer >> expiration. >> >> So, it seems the problem really is interrupt overhead. We had >> presumed that the timer sirq-hrtimer handled these timer expirations, >> and thus the scheduler. Rather, we find that a full reschedule is >> being done every interrupt. > > You should not have a full reschedule when a timer interrupt occurs > for a priority 50 process while the priority 99 process is executing > (see earlier explanation). > > But yes, there is a possibility that the problem is interrupt > overhead. You could measure it to verify the theory. > >> >> Does my analysis make sense? > > Yes. I did not double check the actual code that you described, > and I haven't been poking around in PPC for a while, but what you > describe sounds reasonable. > >> >> Thanks, >> Pete >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Interrupt Bottom Half Scheduling 2011-02-15 19:12 ` Peter LaDow @ 2011-02-15 19:35 ` Frank Rowand 2011-02-16 20:18 ` Peter LaDow 0 siblings, 1 reply; 18+ messages in thread From: Frank Rowand @ 2011-02-15 19:35 UTC (permalink / raw) To: Peter LaDow; +Cc: frank.rowand, linux-rt-users On Tue, Feb 15, 2011 at 11:12 AM, Peter LaDow <petela@gocougs.wsu.edu> wrote: > I made an error in my last post. My call tree wasn't accurate since I > was looking at unpatched code. After applying the RT patch, the call > tree changes a bit: > > timer_interrupt > | > + hrtimer_interrupt > | > + raise_softirq_irqoff > | > + wakeup_softirqd > | > + wake_up_process > | > + try_to_wakeup > > It indeed does offload the timer expirations to the hrtimer softirq. > And the only task that try_to_wakeup works on is the softirq handler. > So this overhead is even less than I thought. Indeed it is quite > light. > > So it seems that I was on track before. The hrtimer softirq task is > running at a priority of 50: > > # ps | grep irq > 10 root 0 SW< [sirq-hrtimer/0] > # chrt -p 10 > pid 10's current scheduling policy: SCHED_FIFO > pid 10's current scheduling priority: 50 > > And I run my program with 'chrt -f 99'. So it does seem that the > hrtimer softirq task should not interfere. > > So I'm back to the scenarios you described earlier. I suppose if the > timers are close in proximity, there would be a flurry of interrupts > frequently occurring. Each of these could in fact slow things down. > So to prevent this deluge, we tried something. We bumped up the > minimum resolution on the decrementer to something closer to 1ms. > This means the decrementer would interrupt us no more often than 1ms. > We modified arch/powerpc/kernel/time.c to set the min_delta_ns of the > decrement to a larger value (large enough to equal about 1ms) rather > than the default 2. The jitter disappeared. Now, I know that doing > this effectively eliminates their use as "high resolution", but it > proves the point that it is the flurry of interrupts causing the > problems. > > So it does seem that it is the interrupt overhead that is the problem. > So if we want high resolution, but low overhead, we have to get > around the problem of lots of tasks using clock_nanosleep. In our > real-world system, we have only 1 high priority task that must run > every 500us. More than 99% of the time, it gets to run and completes > its work very quickly. However, than <1% of the time, it doesn't run > for 1ms to 2ms, breaking our requirements. We have several lower > priority tasks running, each using clock_nanosleep or pending on an > I/O event. It may be in our system that the relatively large number > of timers is occasionally causing a flurry of interrupts increasing > the jitter. So how do we get rid of it? > > I see only 2 ways: 1) stop using clock_nanosleep or 2) stop using > high resolution timers. Implementation of both is problematic. > Eliminating use of clock_nanosleep would require replacing it with > something that didn't resolve to an underlying nanosleep system call, > which I think is impossible (except for using sleep, but that only > gives us 1sec resolution). And turning off the high resolution timers > makes it impossible for us to wake every 500us. You might be able to use range timers to solve your problem: http://lwn.net/Articles/296578/ > > Hmmm....I guess this really is a limitation of our platform. We are > just up against the wall in terms of burden and processing power. > There just isn't enough horsepower to do everything we want at the > time we want. -- To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Interrupt Bottom Half Scheduling 2011-02-15 19:35 ` Frank Rowand @ 2011-02-16 20:18 ` Peter LaDow 0 siblings, 0 replies; 18+ messages in thread From: Peter LaDow @ 2011-02-16 20:18 UTC (permalink / raw) To: Frank Rowand; +Cc: frank.rowand, linux-rt-users For those that are interested, I've gathered some statistics about the high resolution timers. I've already described the test conditions, but here I'll give more information on our platform. MPC8349 (PPC32 @ 528MHz) 512MB DDR2 SDRAM 16GB NAND (x8) Some asked about the jitter distribution, and I've gathered some numbers. I ran this over 10000 samples. Baseline: Min = 3.821ms, Max = 3.951ms, Mean = 3.844ms, Variance = 88.5108 (deviation of about 9.4us) Loaded: Min = 3.833ms, Max = 4.234ms, Mean = 4.099ms, Variance = 1276.35 (deviation of about 35us) We see the mean go up and the variance increase. The large number of timers really do impact the performance of high priority tasks, and the jitter gets much worse. Further, I instrumented the timer_interrupt code to get a feeling for the burden on the interrupt and how often the interrupts happen. I read the timbase register (a 64-bit incrementer that increments once every ~15ns) at the start and end of the interrupt, and determine the maximum. I also measure how the time between the end of the timer interrupt and the start of the next timer interrupt. Baseline: max burden = 79.7us, minimum time between interrupts = 364ns, average time between interrupts 421us. Loaded: max burden = 92.7us, minimum time between interrupts = 24ns, average time between interrupts 180us As we see, the interrupt frequency spikes when loaded, and the time between interrupts drops significantly. The interrupt burden rises slightly, but given the fact we are interrupted quite frequently. Now, our modification to increase the minimum decrementer delta did improve the time between interrupts, but significantly increased the timer interrupt burden (from 92.7us to 462us). So, we either get lots of interrupts, averaging 93us per hit, or few interrupts, but 426us of burden. Either way, it averages out to too much. We are looking into ways to get away from nanosleep (or dramatically drop the number of nanosleep calls). To that end, we are considering a timer "server" type application. This application accepts timer requests, schedules the appropriate nanosleep (or select, poll, etc), then responds when the client needs to be awakened. Does anyone know of an existing project that does something similar to this? Perhaps somethings smaller than dbus, but with the same kind of operation. Thanks, Pete ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Interrupt Bottom Half Scheduling 2011-02-15 1:10 ` Peter LaDow 2011-02-15 1:58 ` Frank Rowand @ 2011-02-15 8:40 ` Armin Steinhoff 1 sibling, 0 replies; 18+ messages in thread From: Armin Steinhoff @ 2011-02-15 8:40 UTC (permalink / raw) To: Peter LaDow; +Cc: linux-rt-users Hi, the scheduling is done by the CFS ? --Armin Peter LaDow wrote: > On Mon, Feb 14, 2011 at 3:30 PM, Frank Rowand<frank.rowand@gmail.com> wrote: >> On Mon, Feb 14, 2011 at 2:31 PM, Peter LaDow<petela@gocougs.wsu.edu> wrote: >>> How is the scheduling of the hrtimers softirq thread handled? >>> >>> When querying the RT priority of the hrtimer softirq, I get a priority >>> of 50. But when running a priority 99 thread, we still seem to be >>> getting interrupted. Shouldn't the hrtimer softirq be put off until >>> the CPU is idle or a lower priority task is running? >> Is the hrtimer softirq executing when the priority 99 thread is spinning >> in it's for loop? Your "jitter Due to Large Number of Timers" email >> said that the lower priority tasks don't seem to be interrupting the >> priority 99 thread. > Did I? Hmm, well I mean the lower priority task with 100 threads. At > least I think so. It is hard to tell. > > It seems to me that the softirq thread is the source of the problem. > Since the tight loop is getting such a variety of times (400us of > jitter only while the other process is running) that it does seem that > the loop is getting interrupt. > >> The hardware timer interupts will interrupt the priority 99 thread. The >> cost of these interrupts and the resultant calls to try_to_wake_up() >> of the hrtimer softirq might be quite large considering the rate of >> timer expires you mentioned in your first email. > Sure, we expect the timer interrupt to interfere. But as we > understand it, the softirq is what schedules the task switch. The top > half only schedules the bottom half. But since the bottom half is > priority 50, there shouldn't be any interruption of the priority 99 > expect to handle the low level IRQ. > >> Out of curiosity, is the system UP or SMP? > UP. Just a single MPC5349. > > pete > -- > To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Interrupt Bottom Half Scheduling 2011-02-14 22:31 Interrupt Bottom Half Scheduling Peter LaDow 2011-02-14 23:04 ` Sven-Thorsten Dietrich 2011-02-14 23:30 ` Frank Rowand @ 2011-02-15 8:23 ` Uwe Kleine-König 2 siblings, 0 replies; 18+ messages in thread From: Uwe Kleine-König @ 2011-02-15 8:23 UTC (permalink / raw) To: Peter LaDow; +Cc: linux-rt-users On Mon, Feb 14, 2011 at 02:31:43PM -0800, Peter LaDow wrote: > How is the scheduling of the hrtimers softirq thread handled? > > When querying the RT priority of the hrtimer softirq, I get a priority > of 50. But when running a priority 99 thread, we still seem to be > getting interrupted. Shouldn't the hrtimer softirq be put off until > the CPU is idle or a lower priority task is running? The problem isn't just $(cat /proc/sys/kernel/sched_rt_runtime_us) being smaller than $(/proc/sys/kernel/sched_rt_period_us)? Best regards Uwe -- Pengutronix e.K. | Uwe Kleine-König | Industrial Linux Solutions | http://www.pengutronix.de/ | -- To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2011-02-16 20:18 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-02-14 22:31 Interrupt Bottom Half Scheduling Peter LaDow 2011-02-14 23:04 ` Sven-Thorsten Dietrich 2011-02-14 23:08 ` Peter LaDow 2011-02-14 23:35 ` Sven-Thorsten Dietrich 2011-02-14 23:42 ` Peter LaDow 2011-02-15 0:50 ` Sven-Thorsten Dietrich 2011-02-14 23:30 ` Frank Rowand 2011-02-15 1:10 ` Peter LaDow 2011-02-15 1:58 ` Frank Rowand 2011-02-15 2:16 ` Frank Rowand 2011-02-15 16:42 ` Peter LaDow 2011-02-15 18:38 ` Frank Rowand 2011-02-15 18:40 ` Frank Rowand 2011-02-15 19:12 ` Peter LaDow 2011-02-15 19:35 ` Frank Rowand 2011-02-16 20:18 ` Peter LaDow 2011-02-15 8:40 ` Armin Steinhoff 2011-02-15 8:23 ` Uwe Kleine-König
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).