* RT and Cascade interrupts @ 2005-05-12 14:43 Daniel Walker 2005-05-13 7:44 ` Ingo Molnar 0 siblings, 1 reply; 34+ messages in thread From: Daniel Walker @ 2005-05-12 14:43 UTC (permalink / raw) To: mingo; +Cc: linux-kernel It seems like the cascade interrupts could run in threads, but i386 doesn't, and I know ARM crashed with cascades in threads. You may have a bit of a slow down, but it seems possible. Does anyone have some reasoning for why we aren't running the cascades in threads? Daniel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-05-12 14:43 RT and Cascade interrupts Daniel Walker @ 2005-05-13 7:44 ` Ingo Molnar 2005-05-13 13:12 ` john cooper 0 siblings, 1 reply; 34+ messages in thread From: Ingo Molnar @ 2005-05-13 7:44 UTC (permalink / raw) To: Daniel Walker; +Cc: linux-kernel * Daniel Walker <dwalker@mvista.com> wrote: > It seems like the cascade interrupts could run in threads, but > i386 doesn't, and I know ARM crashed with cascades in threads. You may > have a bit of a slow down, but it seems possible. Does anyone have > some reasoning for why we aren't running the cascades in threads? are the x86 cascade interrupts real ones in fact? Normally they should never trigger directly. (except on ARM which has a completely different notion of cascade irq) Ingo ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-05-13 7:44 ` Ingo Molnar @ 2005-05-13 13:12 ` john cooper 2005-05-24 16:32 ` john cooper 0 siblings, 1 reply; 34+ messages in thread From: john cooper @ 2005-05-13 13:12 UTC (permalink / raw) To: Ingo Molnar; +Cc: Daniel Walker, linux-kernel, john cooper Ingo Molnar wrote: > * Daniel Walker <dwalker@mvista.com> wrote: > > >> It seems like the cascade interrupts could run in threads, but >>i386 doesn't, and I know ARM crashed with cascades in threads. You may >>have a bit of a slow down, but it seems possible. Does anyone have >>some reasoning for why we aren't running the cascades in threads? > > > are the x86 cascade interrupts real ones in fact? Normally they should > never trigger directly. (except on ARM which has a completely different > notion of cascade irq) I just caught this thread in the corner of my eye. I'm seeing the BUG assert in kernel/timers.c:cascade() kick in (tmp->base is somehow 0) during a test which creates a few tasks of priority higher than ksoftirqd. This race doesn't happen if ksoftirqd's priority is elevated (eg: chrt -f -p 75 2) so the -RT patch might be opening up a window here. I'm in the process of investigating this and see a few potential suspects but was wondering if anyone else has seen this behavior? -john -- john.cooper@timesys.com ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-05-13 13:12 ` john cooper @ 2005-05-24 16:32 ` john cooper 2005-05-27 7:25 ` Ingo Molnar 0 siblings, 1 reply; 34+ messages in thread From: john cooper @ 2005-05-24 16:32 UTC (permalink / raw) To: Ingo Molnar; +Cc: john cooper, Daniel Walker, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1550 bytes --] john cooper wrote: > I'm seeing the BUG assert in kernel/timers.c:cascade() > kick in (tmp->base is somehow 0) during a test which > creates a few tasks of priority higher than ksoftirqd. > This race doesn't happen if ksoftirqd's priority is > elevated (eg: chrt -f -p 75 2) so the -RT patch might > be opening up a window here. There is a window in rpc_run_timer() which allows it to lose track of timer ownership when ksoftirqd (and thus itself) are preempted. This doesn't immediately cause a problem but does corrupt the timer cascade list when the timer struct is recycled/requeued. This shows up some time later as the list is processed. The failure mode is cascade() attempting to percolate a timer with poisoned next/prev *s and a NULL base causing the assertion BUG(tmp->base != base) to kick in. The RPC code is attempting to replicate state of timer ownership for a given rpc_task via RPC_TASK_HAS_TIMER in rpc_task.tk_runstate. Besides not working correctly in the case of preemptable context it is a replication of state of a timer pending in the cascade structure (ie: timer->base). The fix changes the RPC code to use timer->base when deciding whether an outstanding timer registration exists during rpc_task tear down. Note: this failure occurred in the 40-04 version of the patch though it applies to more current versions. It was seen when executing stress tests on a number of PPC targets running on an NFS mounted root though was not observed on a x86 target under similar conditions. -john -- john.cooper@timesys.com [-- Attachment #2: RPC.patch --] [-- Type: text/plain, Size: 1630 bytes --] ./include/linux/sunrpc/sched.h ./net/sunrpc/sched.c ================================================================= --- ./include/linux/sunrpc/sched.h.ORG 2005-05-24 10:29:24.000000000 -0400 +++ ./include/linux/sunrpc/sched.h 2005-05-24 10:47:56.000000000 -0400 @@ -142,7 +142,6 @@ typedef void (*rpc_action)(struct rpc_ #define RPC_TASK_RUNNING 0 #define RPC_TASK_QUEUED 1 #define RPC_TASK_WAKEUP 2 -#define RPC_TASK_HAS_TIMER 3 #define RPC_IS_RUNNING(t) (test_bit(RPC_TASK_RUNNING, &(t)->tk_runstate)) #define rpc_set_running(t) (set_bit(RPC_TASK_RUNNING, &(t)->tk_runstate)) ================================================================= --- ./net/sunrpc/sched.c.ORG 2005-05-24 10:29:52.000000000 -0400 +++ ./net/sunrpc/sched.c 2005-05-24 11:02:44.000000000 -0400 @@ -103,9 +103,6 @@ static void rpc_run_timer(struct rpc_tas dprintk("RPC: %4d running timer\n", task->tk_pid); callback(task); } - smp_mb__before_clear_bit(); - clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate); - smp_mb__after_clear_bit(); } /* @@ -124,7 +121,6 @@ __rpc_add_timer(struct rpc_task *task, r task->tk_timeout_fn = timer; else task->tk_timeout_fn = __rpc_default_timer; - set_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate); mod_timer(&task->tk_timer, jiffies + task->tk_timeout); } @@ -135,7 +131,7 @@ __rpc_add_timer(struct rpc_task *task, r static inline void rpc_delete_timer(struct rpc_task *task) { - if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) { + if (task->tk_timer.base) { del_singleshot_timer_sync(&task->tk_timer); dprintk("RPC: %4d deleting timer\n", task->tk_pid); } ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-05-24 16:32 ` john cooper @ 2005-05-27 7:25 ` Ingo Molnar 2005-05-27 13:53 ` john cooper 0 siblings, 1 reply; 34+ messages in thread From: Ingo Molnar @ 2005-05-27 7:25 UTC (permalink / raw) To: john cooper; +Cc: Daniel Walker, linux-kernel * john cooper <john.cooper@timesys.com> wrote: > john cooper wrote: > >I'm seeing the BUG assert in kernel/timers.c:cascade() > >kick in (tmp->base is somehow 0) during a test which > >creates a few tasks of priority higher than ksoftirqd. > >This race doesn't happen if ksoftirqd's priority is > >elevated (eg: chrt -f -p 75 2) so the -RT patch might > >be opening up a window here. > > There is a window in rpc_run_timer() which allows > it to lose track of timer ownership when ksoftirqd > (and thus itself) are preempted. This doesn't > immediately cause a problem but does corrupt > the timer cascade list when the timer struct is > recycled/requeued. This shows up some time later > as the list is processed. The failure mode is cascade() > attempting to percolate a timer with poisoned > next/prev *s and a NULL base causing the assertion > BUG(tmp->base != base) to kick in. > > The RPC code is attempting to replicate state of > timer ownership for a given rpc_task via RPC_TASK_HAS_TIMER > in rpc_task.tk_runstate. Besides not working > correctly in the case of preemptable context it is > a replication of state of a timer pending in the > cascade structure (ie: timer->base). The fix > changes the RPC code to use timer->base when > deciding whether an outstanding timer registration > exists during rpc_task tear down. > > Note: this failure occurred in the 40-04 version of > the patch though it applies to more current versions. > It was seen when executing stress tests on a number > of PPC targets running on an NFS mounted root though > was not observed on a x86 target under similar > conditions. should this fix go upstream too? Ingo ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-05-27 7:25 ` Ingo Molnar @ 2005-05-27 13:53 ` john cooper 0 siblings, 0 replies; 34+ messages in thread From: john cooper @ 2005-05-27 13:53 UTC (permalink / raw) To: Ingo Molnar; +Cc: Daniel Walker, linux-kernel, john cooper Ingo Molnar wrote: > * john cooper <john.cooper@timesys.com> wrote: >>The RPC code is attempting to replicate state of >>timer ownership for a given rpc_task via RPC_TASK_HAS_TIMER >>in rpc_task.tk_runstate. Besides not working >>correctly in the case of preemptable context it is >>a replication of state of a timer pending in the >>cascade structure (ie: timer->base). The fix >>changes the RPC code to use timer->base when >>deciding whether an outstanding timer registration >>exists during rpc_task tear down. >> >>Note: this failure occurred in the 40-04 version of >>the patch though it applies to more current versions. >>It was seen when executing stress tests on a number >>of PPC targets running on an NFS mounted root though >>was not observed on a x86 target under similar >>conditions. > > > should this fix go upstream too? Yes. The RPC code is attempting to replicate existing and easily accessible state information in a timer structure. The simplistic means by which it does so fails if ksoftirqd/rpc_run_timer() runs in preemptive context. -john -- john.cooper@timesys.com ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts
@ 2005-05-27 16:47 Oleg Nesterov
2005-05-27 23:37 ` john cooper
0 siblings, 1 reply; 34+ messages in thread
From: Oleg Nesterov @ 2005-05-27 16:47 UTC (permalink / raw)
To: john cooper; +Cc: linux-kernel, Ingo Molnar
john cooper wrote:
>
> rpc_delete_timer(struct rpc_task *task)
> {
> - if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
> + if (task->tk_timer.base) {
> del_singleshot_timer_sync(&task->tk_timer);
> dprintk("RPC: %4d deleting timer\n", task->tk_pid);
> }
I know nothing about rpc, but this looks completely wrong to me.
Please don't use the ->base directly, use timer_pending(). This
field is never NULL in -mm tree.
Next, timer_pending() == 0 does not mean that del_timer_sync() is
not needed, it may be running on the other CPU.
Looking at the code for the first time, I can guess that RPC_TASK_HAS_TIMER
was invented to indicate when it is safe not to wait for timer
completition. This bit is cleared after ->tk_timeout_fn call in
the timer's handler. After that rpc_run_timer will not touch *task.
Oleg.
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: RT and Cascade interrupts 2005-05-27 16:47 Oleg Nesterov @ 2005-05-27 23:37 ` john cooper 2005-05-28 8:52 ` Oleg Nesterov 0 siblings, 1 reply; 34+ messages in thread From: john cooper @ 2005-05-27 23:37 UTC (permalink / raw) To: Oleg Nesterov; +Cc: linux-kernel, Ingo Molnar, john cooper [-- Attachment #1: Type: text/plain, Size: 3984 bytes --] Oleg Nesterov wrote: > john cooper wrote: > >> rpc_delete_timer(struct rpc_task *task) >> { >>- if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) { >>+ if (task->tk_timer.base) { >> del_singleshot_timer_sync(&task->tk_timer); >> dprintk("RPC: %4d deleting timer\n", task->tk_pid); >> } > > > I know nothing about rpc, but this looks completely wrong to me. > > Please don't use the ->base directly, use timer_pending(). A nit here but fair enough. > This field is never NULL in -mm tree. Which is ok if timer_pending() still provides the needed information. Otherwise another means will be needed to determine whether the struct timer is logically "off" the base to which it was previous queued. I added such state information to the timer structure and associated code in the process of hunting this problem, but removed it in the patch as for the tree to which this applies the state of base already provided the information. > Next, timer_pending() == 0 does not mean that del_timer_sync() is > not needed, it may be running on the other CPU. Whether timer_pending() is SMP safe depends on the caller's context relative to the overall usage of the timer structure. For example we aren't holding base->lock in rpc_delete_timer() as would normally be expected. The reason this is safe is the transition to "timer off base" and "timer.base <- NULL" follows this sequence in __run_timers(). So the worst we can do is to be racing with another cpu trying to expire this timer and we will delete an already inactive timer which is innocuous here. The state transition of "timer on base" and "timer.base <- base" can only happen after we free the struct rpc_task in which the struct timer is embedded. We haven't done so yet so no race here exists. > Looking at the code for the first time, I can guess that RPC_TASK_HAS_TIMER > was invented to indicate when it is safe not to wait for timer > completition. This bit is cleared after ->tk_timeout_fn call in > the timer's handler. After that rpc_run_timer will not touch *task. The problem being it is possible in the now preemptible context in which rpc_run_timer() executes for the call of callback(task) to be preempted allowing another task restart the timer (via a call to __rpc_add_timer()). So the previous implied assumption in a nonpreemptive rpc_run_timer() of the timer known to be inactive (as that's how we arrived here) is no longer reliable. The failure is for the timer to have been requeued during the preemption but RPC_TASK_HAS_TIMER to afterwards be cleared in rpc_run_timer(). This results in the call to rpc_delete_timer() from rpc_release_task() never occurring and the rpc_task struct then released for reuse with the embedded timer struct still linked in the timer cascade list. The next reuse of this rpc_task struct requeues the embedded timer struct again in the cascade list causing the corruption. The fix removes the attempt to replicate timer queue status from the RPC code by removing the usage of the pre-calculated RPC_TASK_HAS_TIMER. This information is only for the benefit of rpc_delete_timer() which is called in the case of rpc_task tear down. So the test is deferred until just before freeing the rpc_task for reuse. The potential race within rpc_release_task() of the timer being requeued between the call to rpc_delete_timer() and the call to rpc_free() is closed by setting tk_timeout = 0 in the rpc_task before the rpc_delete_timer() call. It may be unnecessary but I couldn't conclude a race did not exist so it errs on the side of caution. Updated patch attached relative to 40-04. This is a rather complex failure scenario which unwinds over time and requires considerable instrumentation to isolate. I think the addition of configurable debug state information to the timer structure along with assertion checking in the primitives which manipulate them would be quite useful for catching similar problems. -john -- john.cooper@timesys.com [-- Attachment #2: RPC.patch2 --] [-- Type: text/plain, Size: 1734 bytes --] --- ./include/linux/sunrpc/sched.h.ORG 2005-05-24 10:29:24.000000000 -0400 +++ ./include/linux/sunrpc/sched.h 2005-05-24 10:47:56.000000000 -0400 @@ -142,7 +142,6 @@ typedef void (*rpc_action)(struct rpc_ #define RPC_TASK_RUNNING 0 #define RPC_TASK_QUEUED 1 #define RPC_TASK_WAKEUP 2 -#define RPC_TASK_HAS_TIMER 3 #define RPC_IS_RUNNING(t) (test_bit(RPC_TASK_RUNNING, &(t)->tk_runstate)) #define rpc_set_running(t) (set_bit(RPC_TASK_RUNNING, &(t)->tk_runstate)) ================================================================= --- ./net/sunrpc/sched.c.ORG 2005-05-24 10:29:52.000000000 -0400 +++ ./net/sunrpc/sched.c 2005-05-27 18:27:41.000000000 -0400 @@ -103,9 +103,6 @@ static void rpc_run_timer(struct rpc_tas dprintk("RPC: %4d running timer\n", task->tk_pid); callback(task); } - smp_mb__before_clear_bit(); - clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate); - smp_mb__after_clear_bit(); } /* @@ -124,7 +121,6 @@ __rpc_add_timer(struct rpc_task *task, r task->tk_timeout_fn = timer; else task->tk_timeout_fn = __rpc_default_timer; - set_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate); mod_timer(&task->tk_timer, jiffies + task->tk_timeout); } @@ -135,7 +131,7 @@ __rpc_add_timer(struct rpc_task *task, r static inline void rpc_delete_timer(struct rpc_task *task) { - if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) { + if (timer_pending(&task->tk_timer)) { del_singleshot_timer_sync(&task->tk_timer); dprintk("RPC: %4d deleting timer\n", task->tk_pid); } @@ -849,6 +845,7 @@ void rpc_release_task(struct rpc_task *t task->tk_active = 0; /* Synchronously delete any running timer */ + task->tk_timeout = 0; rpc_delete_timer(task); /* Release resources */ ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-05-27 23:37 ` john cooper @ 2005-05-28 8:52 ` Oleg Nesterov 2005-05-28 14:02 ` john cooper 2005-05-28 17:48 ` john cooper 0 siblings, 2 replies; 34+ messages in thread From: Oleg Nesterov @ 2005-05-28 8:52 UTC (permalink / raw) To: john cooper; +Cc: linux-kernel, Ingo Molnar, Olaf Kirch john cooper wrote: > > Oleg Nesterov wrote: > > john cooper wrote: > > > >> rpc_delete_timer(struct rpc_task *task) > >> { > >>- if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) { > >>+ if (task->tk_timer.base) { > >> del_singleshot_timer_sync(&task->tk_timer); > >> dprintk("RPC: %4d deleting timer\n", task->tk_pid); > >> } > > > > > > I know nothing about rpc, but this looks completely wrong to me. > > > Next, timer_pending() == 0 does not mean that del_timer_sync() is > > not needed, it may be running on the other CPU. > > Whether timer_pending() is SMP safe depends on the caller's > context relative to the overall usage of the timer structure. > > For example we aren't holding base->lock in rpc_delete_timer() > as would normally be expected. The reason this is safe is > the transition to "timer off base" and "timer.base <- NULL" > follows this sequence in __run_timers(). So the worst we > can do is to be racing with another cpu trying to expire this > timer and we will delete an already inactive timer which is > innocuous here. CPU_0 CPU_1 rpc_release_task(task) __run_timers: timer->base = NULL; rpc_delete_timer() if (timer->base) // not taken del_timer_sync(); mempool_free(task); timer->function(task): // task already freed/reused __rpc_default_timer(task); This is totally pointless to do: if (timer_pending()) del_singleshot_timer_sync(); If you need to ensure that timer's handler is not running on any cpu then timer_pending() can't help. If you don't need this, you should use plain del_timer(). > > Looking at the code for the first time, I can guess that RPC_TASK_HAS_TIMER > > was invented to indicate when it is safe not to wait for timer > > completition. This bit is cleared after ->tk_timeout_fn call in > > the timer's handler. After that rpc_run_timer will not touch *task. > > The problem being it is possible in the now preemptible > context in which rpc_run_timer() executes for the call > of callback(task) to be preempted allowing another task > restart the timer (via a call to __rpc_add_timer()). > So the previous implied assumption in a nonpreemptive > rpc_run_timer() of the timer known to be inactive (as > that's how we arrived here) is no longer reliable. I don't understand why this race is rt specific. I think that PREEMPT_SOFTIRQS just enlarges the window. I believe the right fix is just to call del_singleshot_timer_sync() unconditionally and kill RPC_TASK_HAS_TIMER bit. I have added Olaf Kirch to the CC: list. > The fix removes the attempt to replicate timer queue > status from the RPC code by removing the usage of the > pre-calculated RPC_TASK_HAS_TIMER. No, RPC_TASK_HAS_TIMER doesn't replicate timer queue status. I believe, it was intended as optimization, to avoid costly del_timer_sync() call. Oleg. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-05-28 8:52 ` Oleg Nesterov @ 2005-05-28 14:02 ` john cooper 2005-05-28 16:34 ` Oleg Nesterov 2005-05-28 17:48 ` john cooper 1 sibling, 1 reply; 34+ messages in thread From: john cooper @ 2005-05-28 14:02 UTC (permalink / raw) To: Oleg Nesterov; +Cc: linux-kernel, Ingo Molnar, Olaf Kirch, john cooper Oleg Nesterov wrote: [race scenario deleted] This was not the race scenario of concern. > This is totally pointless to do: > > if (timer_pending()) > del_singleshot_timer_sync(); However this is the action the RPC code was attempting to take originally based upon stale/incorrect data. Again the failure mode was of the RPC_TASK_HAS_TIMER bit not being set when the embedded timer actually was still present in the timer cascade structure (eg: timer->base != 0 or timer_pending()). This caused the rpc_task to be recycled with the embedded timer struct still linked in the timer cascade. > If you need to ensure that timer's handler is not running on any > cpu then timer_pending() can't help. If you don't need this, you > should use plain del_timer(). That's not the goal of the timer_pending() usage here. Rather we're at a point in rpc_release_task where we want to tear down an rpc_task. The call to timer_pending() determines if the embedded timer is still linked in the timer cascade structure. The embedded timer may be running on another CPU. We may even check timer->base during this race when it is non-NULL and wind up deleting a timer which the path running on another CPU has already deleted. That is allowable. We simply want to idle the timer. > I don't understand why this race is rt specific. I think > that PREEMPT_SOFTIRQS just enlarges the window. That is certainly possible. However this code has been around for some time and this problem apparently hasn't otherwise surfaced. I suspect it depends on the combination of timer cascade structures existing per-CPU and the execution context of callback(task) invoked in rpc_run_timer() now being preemptive (ie: preemption is required on the same CPU for this problem to surface). > I believe > the right fix is just to call del_singleshot_timer_sync() > unconditionally and kill RPC_TASK_HAS_TIMER bit. Ignoring RPC_TASK_HAS_TIMER and always calling del_singleshot_timer_sync() will work but seems excessive. > I have added Olaf Kirch to the CC: list. Welcome. >>The fix removes the attempt to replicate timer queue >>status from the RPC code by removing the usage of the >>pre-calculated RPC_TASK_HAS_TIMER. > > > No, RPC_TASK_HAS_TIMER doesn't replicate timer queue status. > I believe, it was intended as optimization, to avoid costly > del_timer_sync() call. If true its chosen name seems a misnomer. In any case the embedded timer in the rpc_task structure must be quiesced before the rpc_task is made available for reuse. The fix under discussion here does so. Note the goal was not to rearchitecture the RPC code but rather to address this very specific bug which surfaced during stress testing. If anyone with more ownership of the RPC code would like to comment, any insight would be most welcome. -john -- john.cooper@timesys.com ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-05-28 14:02 ` john cooper @ 2005-05-28 16:34 ` Oleg Nesterov 0 siblings, 0 replies; 34+ messages in thread From: Oleg Nesterov @ 2005-05-28 16:34 UTC (permalink / raw) To: john cooper; +Cc: linux-kernel, Ingo Molnar, Olaf Kirch, Trond Myklebust john cooper wrote: > > Oleg Nesterov wrote: > > > If you need to ensure that timer's handler is not running on any > > cpu then timer_pending() can't help. If you don't need this, you > > should use plain del_timer(). > > That's not the goal of the timer_pending() usage here. > Rather we're at a point in rpc_release_task where we > want to tear down an rpc_task. The call to timer_pending() > determines if the embedded timer is still linked in the > timer cascade structure. Yes, I see what you are trying to fix. However, your fix opens even worse bug. > If anyone with more ownership of the RPC code would like > to comment, any insight would be most welcome. Yes. Trond, could you please look at this thread: http://marc.theaimsgroup.com/?t=111590936700001 and put an end to our discussion? Oleg. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-05-28 8:52 ` Oleg Nesterov 2005-05-28 14:02 ` john cooper @ 2005-05-28 17:48 ` john cooper 2005-05-28 20:35 ` Trond Myklebust 2005-05-28 22:17 ` Trond Myklebust 1 sibling, 2 replies; 34+ messages in thread From: john cooper @ 2005-05-28 17:48 UTC (permalink / raw) To: Oleg Nesterov Cc: linux-kernel, Ingo Molnar, Olaf Kirch, trond.myklebust, john cooper Oleg Nesterov wrote: > > CPU_0 CPU_1 > > rpc_release_task(task) > __run_timers: > timer->base = NULL; > rpc_delete_timer() > if (timer->base) > // not taken > del_timer_sync(); > > mempool_free(task); > timer->function(task): > // task already freed/reused > __rpc_default_timer(task); Ah ok, I was misreading the above. Now I see your point. The race if hit loses the synchronization provided by del_timer_sync() in my case. Thanks for being persistent. The other possibility to fix the original problem within the scope of the RPC code was to replace the bit state of RPC_TASK_HAS_TIMER with a count so we don't obscure the fact the timer was requeued during the preemption window. This happens as rpc_run_timer() does an unconditional clear_bit(RPC_TASK_HAS_TIMER,..) before returning. Another would be to check timer->base/timer_pending() in rpc_run_timer() and to omit doing a clear_bit() if the timer is found to have been requeued. The former has simpler locking requirements but requires an rpc_task data structure modification. The latter does not and is a bit more intuitive but needs additional locking. Both appear to provide a solution though I haven't yet attempted either. Anyone with more ownership of this code care to comment on a preferred fix? -john -- john.cooper@timesys.com ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-05-28 17:48 ` john cooper @ 2005-05-28 20:35 ` Trond Myklebust 2005-05-29 3:12 ` john cooper 2005-05-29 11:31 ` Oleg Nesterov 2005-05-28 22:17 ` Trond Myklebust 1 sibling, 2 replies; 34+ messages in thread From: Trond Myklebust @ 2005-05-28 20:35 UTC (permalink / raw) To: john cooper; +Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch lau den 28.05.2005 Klokka 13:48 (-0400) skreiv john cooper: > The other possibility to fix the original problem within > the scope of the RPC code was to replace the bit state of > RPC_TASK_HAS_TIMER with a count so we don't obscure the > fact the timer was requeued during the preemption window. > This happens as rpc_run_timer() does an unconditional > clear_bit(RPC_TASK_HAS_TIMER,..) before returning. > > Another would be to check timer->base/timer_pending() > in rpc_run_timer() and to omit doing a clear_bit() > if the timer is found to have been requeued. Could you please explain why you think such a scenario is possible? The timer functions themselves should never be causing a re-queue, and every iteration through the loop in __rpc_execute() should cause any pending timer to be killed, as should rpc_release_task(). That's why we can use del_singleshot_timer_sync() in the first place: because there is no recursion, and no re-queueing that can cause races. I don't see how either preemption or RT will change that (and if they do, then _that_ is the real bug that needs fixing). Cheers, Trond ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-05-28 20:35 ` Trond Myklebust @ 2005-05-29 3:12 ` john cooper 2005-05-29 7:40 ` Trond Myklebust 2005-05-29 11:31 ` Oleg Nesterov 1 sibling, 1 reply; 34+ messages in thread From: john cooper @ 2005-05-29 3:12 UTC (permalink / raw) To: Trond Myklebust Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch, john cooper Trond Myklebust wrote: > lau den 28.05.2005 Klokka 13:48 (-0400) skreiv john cooper: > Could you please explain why you think such a scenario is possible? The > timer functions themselves should never be causing a re-queue, and every > iteration through the loop in __rpc_execute() should cause any pending > timer to be killed, as should rpc_release_task(). I'm trying to pinpoint the cause of the RPC timer cascade structure corruption. Here is the data I've ascertained thus far: 1. the failure only has been seen for the timer struct embedded in an rpc_task. The callback function is always rpc_run_timer(). 2. the failure mode is an rpc_task's embedded timer struct being queued a second time in the timer cascade without having been dequeued from an existing cascade vector. The failure is not immediate but causes a corrupt entry which is later encountered in timer cascade code. 3. the problem occurs if rpc_run_timer() executes in preemptable context. The problem was not encountered when preemption was explicitly disabled within rpc_run_timer(). 4. the problem appears related to RPC_TASK_HAS_TIMER not being set in rpc_release_task(). Specifically the problem arises when !RPC_TASK_HAS_TIMER but timer->base is non-zero. Modifying rpc_delete_timer() to effect the del_singleshot_timer_sync() call when (RPC_TASK_HAS_TIMER || timer->base) prevents the failure. 5. (a detail of #2 above) Instrumenting the number of logical add/mod/del operations on the timer struct the corruption occurs at a point when the tally of the operations: (mod == add == del + 1) ie: the timer is has been add/modified several times but one count greater than the number of delete operations (active in cascade). The next operation on this timer struct is an add/mod operation with the state of the timer having been reinitialized in rpc_init_task():init_timer() with no intervening delete having been performed. If it wasn't clear in my prior mail, please disregard the earlier claim of RPC_TASK_HAS_TIMER replicating the state of timer->base. From Oleg's earlier mail I see that isn't the case as additional RPC-specific state is attached to this flag. The patch as well should be disregarded. > That's why we can use del_singleshot_timer_sync() in the first place: > because there is no recursion, and no re-queueing that can cause races. > I don't see how either preemption or RT will change that (and if they > do, then _that_ is the real bug that needs fixing). During the time I have been hunting this bug I've lost count of the number of times I've alternatively suspected either the kernel timer code or net/sunrpc/sched.c usage of the same. I still feel it is somehow related to the RPC code but will need to refine the instrumentation to extract further information from the failure scenario. A few questions I'd like to pose: 1. Can you correlate anything of rpc_run_timer() running in preemptive context which could explain the above behavior? 2. Do you agree that (!RPC_TASK_HAS_TIMER && timer->base) is an inconsistent state at the time of rpc_release_task():rpc_delete_timer() ? -john -- john.cooper@timesys.com ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-05-29 3:12 ` john cooper @ 2005-05-29 7:40 ` Trond Myklebust 2005-05-30 21:32 ` john cooper 0 siblings, 1 reply; 34+ messages in thread From: Trond Myklebust @ 2005-05-29 7:40 UTC (permalink / raw) To: john cooper; +Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch [-- Attachment #1: Type: text/plain, Size: 1465 bytes --] lau den 28.05.2005 Klokka 23:12 (-0400) skreiv john cooper: > 1. Can you correlate anything of rpc_run_timer() running in > preemptive context which could explain the above behavior? > > 2. Do you agree that (!RPC_TASK_HAS_TIMER && timer->base) > is an inconsistent state at the time of > rpc_release_task():rpc_delete_timer() ? I have yet to see an example of that happening. The current code should be working according to the following rules: timers are only set if the rpc_task is being put on a wait queue and tk_timeout is non-zero. RPC_TASK_HAS_TIMER is always set immediately before the timer is activated. It is cleared either by the handler _after_ it has run (and after timer->base has been cleared by __rpc_run_timer()), or by __rpc_execute()/rpc_release_task() immediately prior to the call to del_singleshot_timer_sync(). task->tk_callback should never be putting a task to sleep, so there is no timer being set after the call to rpc_delete_timer() and prior to the call to task->tk_action. The above rules should suffice to guarantee strict ordering w.r.t. starting a timer, clearing the same timer, and restarting a new timer, so the code should not need to be reentrant. I've appended a patch that should check for strict compliance of the above rules. Could you try it out and see if it triggers any Oopses? Cheers, Trond [-- Attachment #2: linux-2.6.12-test_timer.dif --] [-- Type: text/plain, Size: 1557 bytes --] sched.c | 8 ++++++++ 1 files changed, 8 insertions(+) Index: linux-2.6.12-rc4/net/sunrpc/sched.c =================================================================== --- linux-2.6.12-rc4.orig/net/sunrpc/sched.c +++ linux-2.6.12-rc4/net/sunrpc/sched.c @@ -135,6 +135,8 @@ __rpc_add_timer(struct rpc_task *task, r static void rpc_delete_timer(struct rpc_task *task) { + BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) == 0 && + timer_pending(&task->tk_timer)); if (RPC_IS_QUEUED(task)) return; if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) { @@ -337,6 +339,8 @@ static void __rpc_sleep_on(struct rpc_wa void rpc_sleep_on(struct rpc_wait_queue *q, struct rpc_task *task, rpc_action action, rpc_action timer) { + BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 || + timer_pending(&task->tk_timer)); /* * Protect the queue operations. */ @@ -594,6 +598,8 @@ static int __rpc_execute(struct rpc_task unlock_kernel(); } + BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 || + timer_pending(&task->tk_timer)); /* * Perform the next FSM step. * tk_action may be NULL when the task has been killed @@ -925,6 +931,8 @@ fail: void rpc_run_child(struct rpc_task *task, struct rpc_task *child, rpc_action func) { + BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 || + timer_pending(&task->tk_timer)); spin_lock_bh(&childq.lock); /* N.B. Is it possible for the child to have already finished? */ __rpc_sleep_on(&childq, task, func, NULL); ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-05-29 7:40 ` Trond Myklebust @ 2005-05-30 21:32 ` john cooper 2005-05-31 23:09 ` john cooper 2005-06-01 14:22 ` Oleg Nesterov 0 siblings, 2 replies; 34+ messages in thread From: john cooper @ 2005-05-30 21:32 UTC (permalink / raw) To: Trond Myklebust Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch, john cooper Trond Myklebust wrote: > I've appended a patch that > should check for strict compliance of the above rules. Could you try it > out and see if it triggers any Oopses? Yes, the assert in rpc_delete_timer() occurs just before the cascade list corruption. This is consistent with what I have seen. ie: the timer in a released rpc_task is still active. BTW, the patch from Oleg is relative to 2.6.12 and didn't look to apply to the 2.6.11-derived base with which I'm working (the RPC_IS_QUEUED() test at the head of rpc_delete_timer() does not exist). In any case the relocation of restarted: in __rpc_execute() did not influence the failure. I'd like to move to a 2.6.12-based -RT patch however I'm dealing with "code in the pipe" and unfortunately don't have that option. Sorry I'm just responding new. We're in the middle of a long holiday weekend. I will have more time come tomorrow to analyze this further. -john > ------------------------------------------------------------------------ > > sched.c | 8 ++++++++ > 1 files changed, 8 insertions(+) > > Index: linux-2.6.12-rc4/net/sunrpc/sched.c > =================================================================== > --- linux-2.6.12-rc4.orig/net/sunrpc/sched.c > +++ linux-2.6.12-rc4/net/sunrpc/sched.c > @@ -135,6 +135,8 @@ __rpc_add_timer(struct rpc_task *task, r > static void > rpc_delete_timer(struct rpc_task *task) > { > + BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) == 0 && > + timer_pending(&task->tk_timer)); > if (RPC_IS_QUEUED(task)) > return; > if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) { > @@ -337,6 +339,8 @@ static void __rpc_sleep_on(struct rpc_wa > void rpc_sleep_on(struct rpc_wait_queue *q, struct rpc_task *task, > rpc_action action, rpc_action timer) > { > + BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 || > + timer_pending(&task->tk_timer)); > /* > * Protect the queue operations. > */ > @@ -594,6 +598,8 @@ static int __rpc_execute(struct rpc_task > unlock_kernel(); > } > > + BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 || > + timer_pending(&task->tk_timer)); > /* > * Perform the next FSM step. > * tk_action may be NULL when the task has been killed > @@ -925,6 +931,8 @@ fail: > > void rpc_run_child(struct rpc_task *task, struct rpc_task *child, rpc_action func) > { > + BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 || > + timer_pending(&task->tk_timer)); > spin_lock_bh(&childq.lock); > /* N.B. Is it possible for the child to have already finished? */ > __rpc_sleep_on(&childq, task, func, NULL); -- john.cooper@timesys.com ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-05-30 21:32 ` john cooper @ 2005-05-31 23:09 ` john cooper 2005-06-01 14:22 ` Oleg Nesterov 1 sibling, 0 replies; 34+ messages in thread From: john cooper @ 2005-05-31 23:09 UTC (permalink / raw) To: Trond Myklebust Cc: john cooper, Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch john cooper wrote: > Trond Myklebust wrote: > >> I've appended a patch that >> should check for strict compliance of the above rules. Could you try it >> out and see if it triggers any Oopses? > > > Yes, the assert in rpc_delete_timer() occurs just before > the cascade list corruption. This is consistent with > what I have seen. ie: the timer in a released rpc_task > is still active. I've captured more data in the instrumentation and found the rpc_task's timer is being requeued by an application task which is preempting ksoftirqd when it wakes up in xprt_transmit(). This is what I had originally suspected but likely didn't communicate it effectively. The scenario unfolds as: [high priority app task] : call_transmit() xprt_transmit() /* blocks in xprt_transmit() */ ksoftirqd() __run_timers() list_del("rpc_task_X.timer") /* logically off cascade */ rpc_run_timer(data) task->tk_timeout_fn(task) /* ksoftirqd preempted */ : --------------------------------------------------------- /* Don't race with disconnect */ if (!xprt_connected(xprt)) task->tk_status = -ENOTCONN; else if (!req->rq_received) rpc_sleep_on(&xprt->pending, task, NULL, xprt_timer); --------------------------------------------------------- __rpc_sleep_on() __mod_timer("rpc_task_X.timer") /* requeued in cascade */ /* blocks */ /* rpc_run_timer resumes from preempt */ clear_bit(RPC_TASK_HAS_TIMER, "rpc_task_X.tk_runstate"); /* rpc_task_X.timer is now enqueued in cascade without RPC_TASK_HAS_TIMER set and will not be dequeued in rpc_release_task()/rpc_delete_timer() */ The usage of "rpc_task_X.timer" indicates the same KVA observed for the timer struct at the associated points in the instrumented code. The above was gathered by logging usage of the kernel/timer.c primitives. Thus I don't have more detailed state of the rpc_task in RPC context. However I did verify which of the three calls to rpc_sleep_on() in xprt_transmit() was being invoked (as above). So the root cause appears to be the rpc_task's timer being requeued in xprt_transmit() when rpc_run_timer is preempted. From looking at the code I'm unsure if modifying xprt_transmit()/out_receive is appropriate to synchronize with rpc_release_task(). It seems allowing rpc_sleep_on() to occur is more natural and for rpc_release_task() to detect the pending timer and remove it before proceeding. I'm still in the process of trying to digest the logic here but I thought there was enough information here to be of use. Suggestions, warnings welcome. -john -- john.cooper@timesys.com ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-05-30 21:32 ` john cooper 2005-05-31 23:09 ` john cooper @ 2005-06-01 14:22 ` Oleg Nesterov 2005-06-01 18:05 ` john cooper 1 sibling, 1 reply; 34+ messages in thread From: Oleg Nesterov @ 2005-06-01 14:22 UTC (permalink / raw) To: john cooper; +Cc: Trond Myklebust, linux-kernel, Ingo Molnar, Olaf Kirch del_timer_sync() is known to be racy. Now I think it has (theoretically) problems with singleshot timers too. Suppose whe have rpc timer which is running on CPU_0, it is preempted *after* it has cleared RPC_TASK_HAS_TIMER bit. Then __rpc_execute's main loop on CPU_1 calls rpc_delete_timer(), it returns without doing del_timer_sync(). Now it calls ->tk_action()->rpc_sleep_on()->__rpc_add_timer(), timer pending on CPU_1. preemption comes, reschedule at another (not CPU_1) processor, (this step is not strictly necessary). Next loop iteration, __rpc_execute calls rpc_delete_timer() again, now it calls del_timer_sync(). del_timer_sync: // __run_timers starts on CPU_1, picks rpc timer, sets // timer->base = 0 ret += del_timer(timer); // return 0, timer is not pending. for_each_online_cpu() { if (timer running on that cpu) { // finds the timer still running on CPU_0, // waits for __run_timers on CPU_0 change // ->running_timer, // the timer on CPU_0 completes. break; } } if (timer_pending()) // NO goto del_again; // The timer still running on CPU_1 return; This all is very unlikely of course, but it would be nice to verify that kernel/timer.c is not the source of the problem. John, if it is easy to reproduce the problem, could you please retest with this patch? Oleg. --- 2.6.12-rc5/net/sunrpc/sched.c~ Wed Jun 1 17:49:57 2005 +++ 2.6.12-rc5/net/sunrpc/sched.c Wed Jun 1 18:00:31 2005 @@ -137,8 +137,12 @@ rpc_delete_timer(struct rpc_task *task) { if (RPC_IS_QUEUED(task)) return; - if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) { - del_singleshot_timer_sync(&task->tk_timer); + if (test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) { + if (del_singleshot_timer_sync(&task->tk_timer)) { + BUG_ON(!test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)); + clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate); + } else + BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)); dprintk("RPC: %4d deleting timer\n", task->tk_pid); } } ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-06-01 14:22 ` Oleg Nesterov @ 2005-06-01 18:05 ` john cooper 2005-06-01 18:31 ` Trond Myklebust 0 siblings, 1 reply; 34+ messages in thread From: john cooper @ 2005-06-01 18:05 UTC (permalink / raw) To: Oleg Nesterov Cc: Trond Myklebust, linux-kernel, Ingo Molnar, Olaf Kirch, john cooper Oleg Nesterov wrote: > This all is very unlikely of course, but it would be nice to verify > that kernel/timer.c is not the source of the problem. The problem I am seeing is on a single CPU system. > John, if it is easy to reproduce the problem, could you please retest > with this patch? I've done so and the second assert is generated when running the test. So here we have a case of RPC_TASK_HAS_TIMER set but the associated timer->base == NULL. It would seem this could easily be the scenario of executing in: __run_timers() timer->base = NULL; rpc_run_timer() task->tk_timeout_fn(task) /* ksoftirqd preempted */ : /* RPC client */ rpc_execute() rpc_delete_timer() del_timer() returns 0 BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)); : /* rpc_run_timer() resumes */ clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate); I don't see how this would imply a kernel/timer.c problem. It also appears this wouldn't be causing the timer cascade corruption I've seen as the end result is deleting an already dequeued timer which is safe here. BTW, if preemption is explicitly disabled in rpc_run_timer() the BUG_ON assert is not generated nor (as reported earlier) does the cascade corruption occur. I'm still investigating. -john > --- 2.6.12-rc5/net/sunrpc/sched.c~ Wed Jun 1 17:49:57 2005 > +++ 2.6.12-rc5/net/sunrpc/sched.c Wed Jun 1 18:00:31 2005 > @@ -137,8 +137,12 @@ rpc_delete_timer(struct rpc_task *task) > { > if (RPC_IS_QUEUED(task)) > return; > - if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) { > - del_singleshot_timer_sync(&task->tk_timer); > + if (test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) { > + if (del_singleshot_timer_sync(&task->tk_timer)) { > + BUG_ON(!test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)); > + clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate); > + } else > + BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)); > dprintk("RPC: %4d deleting timer\n", task->tk_pid); > } > } > -- john.cooper@timesys.com ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-06-01 18:05 ` john cooper @ 2005-06-01 18:31 ` Trond Myklebust 2005-06-01 19:20 ` john cooper 0 siblings, 1 reply; 34+ messages in thread From: Trond Myklebust @ 2005-06-01 18:31 UTC (permalink / raw) To: john cooper; +Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch on den 01.06.2005 Klokka 14:05 (-0400) skreiv john cooper: > I've done so and the second assert is generated > when running the test. So here we have a case > of RPC_TASK_HAS_TIMER set but the associated > timer->base == NULL. It would seem this could > easily be the scenario of executing in: > > __run_timers() > timer->base = NULL; > rpc_run_timer() > task->tk_timeout_fn(task) > /* ksoftirqd preempted */ > : > /* RPC client */ > rpc_execute() > rpc_delete_timer() > del_timer() returns 0 > BUG_ON(test_bit(RPC_TASK_HAS_TIMER, > &task->tk_runstate)); > : > /* rpc_run_timer() resumes */ > clear_bit(RPC_TASK_HAS_TIMER, > &task->tk_runstate); > > I don't see how this would imply a kernel/timer.c > problem. It also appears this wouldn't be causing > the timer cascade corruption I've seen as the > end result is deleting an already dequeued timer > which is safe here. If timer->base==NULL, then del_timer() returns 0 (as you correctly state above). What you appear to be missing is that this will trigger a call to del_timer_sync() inside del_singleshot_timer_sync(). In other words, your scenario above should, if the timer code is working according to spec instead lead to __run_timers() timer->base = NULL; rpc_run_timer() task->tk_timeout_fn(task) /* ksoftirqd preempted */ : /* RPC client */ rpc_execute() rpc_delete_timer() del_singleshot_timer_sync() del_timer() returns 0 del_timer_sync() /* wait until timer has completed */ /*rpc_run_timer() resumes */ clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate); .... set_running_timer(some_other_value); BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)); : Cheers, Trond ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-06-01 18:31 ` Trond Myklebust @ 2005-06-01 19:20 ` john cooper 2005-06-01 19:46 ` Trond Myklebust 2005-06-01 20:21 ` Trond Myklebust 0 siblings, 2 replies; 34+ messages in thread From: john cooper @ 2005-06-01 19:20 UTC (permalink / raw) To: Trond Myklebust Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch, john cooper Trond Myklebust wrote: > on den 01.06.2005 Klokka 14:05 (-0400) skreiv john cooper: >>I don't see how this would imply a kernel/timer.c >>problem. It also appears this wouldn't be causing >>the timer cascade corruption I've seen as the >>end result is deleting an already dequeued timer >>which is safe here. > > > If timer->base==NULL, then del_timer() returns 0 (as you correctly state > above). What you appear to be missing is that this will trigger a call > to del_timer_sync() inside del_singleshot_timer_sync(). You might have missed in my earlier mail as this is a not an MP kernel ie: !CONFIG_SMP The synchronous timer delete primitives don't exist in this configuration: include/linux/timer.h: #ifdef CONFIG_SMP extern int del_timer_sync(struct timer_list *timer); extern int del_singleshot_timer_sync(struct timer_list *timer); #else # define del_timer_sync(t) del_timer(t) # define del_singleshot_timer_sync(t) del_timer(t) #endif BTW, I don't know if you happened on the mail I sent yesterday. It details rpc_run_timer() waking up an application task blocked in call_transmit(). The app task preempts ksoftirqd and eventually does a __rpc_sleep_on()/__mod_timer() which requeues the timer in the cascade. When ksoftirqd/rpc_run_timer() resumes RPC_TASK_HAS_TIMER is unconditionally cleared however the timer is queued in the cascade. This appears to be at least one cause of the timer cascade corruption I've seen. -john -- john.cooper@timesys.com ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-06-01 19:20 ` john cooper @ 2005-06-01 19:46 ` Trond Myklebust 2005-06-01 20:21 ` Trond Myklebust 1 sibling, 0 replies; 34+ messages in thread From: Trond Myklebust @ 2005-06-01 19:46 UTC (permalink / raw) To: john cooper; +Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch on den 01.06.2005 Klokka 15:20 (-0400) skreiv john cooper: > You might have missed in my earlier mail as > this is a not an MP kernel ie: !CONFIG_SMP > The synchronous timer delete primitives don't > exist in this configuration: > > include/linux/timer.h: > > #ifdef CONFIG_SMP > extern int del_timer_sync(struct timer_list *timer); > extern int del_singleshot_timer_sync(struct timer_list *timer); > #else > # define del_timer_sync(t) del_timer(t) > # define del_singleshot_timer_sync(t) del_timer(t) > #endif For the RT patched stuff that should read #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_SOFTIRQS) extern int del_timer_sync(struct timer_list *timer); extern int del_singleshot_timer_sync(struct timer_list *timer); #else # define del_timer_sync(t) del_timer(t) # define del_singleshot_timer_sync(t) del_timer(t) #endif > BTW, I don't know if you happened on the mail I sent > yesterday. It details rpc_run_timer() waking up an > application task blocked in call_transmit(). The > app task preempts ksoftirqd and eventually does a > __rpc_sleep_on()/__mod_timer() which requeues the > timer in the cascade. When ksoftirqd/rpc_run_timer() > resumes RPC_TASK_HAS_TIMER is unconditionally cleared > however the timer is queued in the cascade. This > appears to be at least one cause of the timer cascade > corruption I've seen. I saw it. Once again, I don't see how that can happen. __rpc_execute() should be calling rpc_delete_timer() before it calls task->tk_action. There should be no instances of RPC entering call_transmit() or any other tk_action callback with a pending timer. Cheers, Trond ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-06-01 19:20 ` john cooper 2005-06-01 19:46 ` Trond Myklebust @ 2005-06-01 20:21 ` Trond Myklebust 2005-06-01 20:59 ` john cooper 1 sibling, 1 reply; 34+ messages in thread From: Trond Myklebust @ 2005-06-01 20:21 UTC (permalink / raw) To: john cooper; +Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch on den 01.06.2005 Klokka 15:20 (-0400) skreiv john cooper: > You might have missed in my earlier mail as > this is a not an MP kernel ie: !CONFIG_SMP > The synchronous timer delete primitives don't > exist in this configuration: This probably explains your trouble. It makes no sense to allow __run_timer to be preemptible without having the synchronous timer delete primitives. Synchronisation is impossible without them. Which version of the RT patches are you using? The one I'm looking at (2.6.12-rc5-rt-V0.7.47-15) certainly defines both del_timer_sync() and del_singleshot_timer_sync() to be the same as the SMP versions if you are running an RT kernel with preemptible softirqs. Cheers, Trond ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-06-01 20:21 ` Trond Myklebust @ 2005-06-01 20:59 ` john cooper 2005-06-01 22:51 ` Trond Myklebust 0 siblings, 1 reply; 34+ messages in thread From: john cooper @ 2005-06-01 20:59 UTC (permalink / raw) To: Trond Myklebust Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch, john cooper Trond Myklebust wrote: > on den 01.06.2005 Klokka 15:20 (-0400) skreiv john cooper: > >>You might have missed in my earlier mail as >>this is a not an MP kernel ie: !CONFIG_SMP >>The synchronous timer delete primitives don't >>exist in this configuration: > > > This probably explains your trouble. It makes no sense to allow > __run_timer to be preemptible without having the synchronous timer > delete primitives. Synchronisation is impossible without them. The addition of CONFIG_PREEMPT_SOFTIRQS in this context came into place in more recent RT patch versions than with what I'm dealing. I've just pulled it in but this doesn't appear to alter the nature of the failure. I'm still catching an inconsistency at the very head of rpc_delete_timer() in the case of: BUG_ON(!test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) && timer_pending(&task->tk_timer)); > Which version of the RT patches are you using? The one I'm looking at > (2.6.12-rc5-rt-V0.7.47-15) certainly defines both del_timer_sync() and > del_singleshot_timer_sync() to be the same as the SMP versions if you > are running an RT kernel with preemptible softirqs. Yes later versions of the patch do. The version at hand 40-04 is based on 2.6.11. We intend to sync-up with a more recent version of the RT patch pending resolution of this issue. I have two potential work-arounds I'm trying to validate. Though I have a bit more tree-shaking to do before I've completed this exercise. -john -- john.cooper@timesys.com ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-06-01 20:59 ` john cooper @ 2005-06-01 22:51 ` Trond Myklebust 2005-06-01 23:09 ` Trond Myklebust 2005-06-02 3:31 ` john cooper 0 siblings, 2 replies; 34+ messages in thread From: Trond Myklebust @ 2005-06-01 22:51 UTC (permalink / raw) To: john cooper; +Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch [-- Attachment #1: Type: text/plain, Size: 701 bytes --] on den 01.06.2005 Klokka 16:59 (-0400) skreiv john cooper: > Yes later versions of the patch do. The version at hand > 40-04 is based on 2.6.11. We intend to sync-up with a > more recent version of the RT patch pending resolution > of this issue. Well it is pointless to concentrate on an obviously buggy patch. Could you please sync up to rc5-rt-V0.7.47-15 at least: that looks like it might be working (or at least be close to working). Could you then apply the following debugging patch? It should warn you in case something happens to corrupt base->running_timer (something which would screw up del_timer_sync()). I'm not sure that can happen, but it might be worth checking. Cheers, Trond [-- Attachment #2: linux-2.6.12-rt-warn.dif --] [-- Type: text/plain, Size: 1617 bytes --] timer.c | 14 +++++++++++++- 1 files changed, 13 insertions(+), 1 deletion(-) Index: linux-2.6.12-rc5-rt-V0.7.47-15/kernel/timer.c =================================================================== --- linux-2.6.12-rc5-rt-V0.7.47-15.orig/kernel/timer.c +++ linux-2.6.12-rc5-rt-V0.7.47-15/kernel/timer.c @@ -436,7 +436,7 @@ static int cascade(tvec_base_t *base, tv static inline void __run_timers(tvec_base_t *base) { - struct timer_list *timer; + struct timer_list *timer = NULL; spin_lock_irq(&base->lock); while (time_after_eq(jiffies, base->timer_jiffies)) { @@ -444,6 +444,10 @@ static inline void __run_timers(tvec_bas struct list_head *head = &work_list; int index = base->timer_jiffies & TVR_MASK; + if (unlikely(base->running_timer != timer)) { + printk("BUG in line %d: __run_timers() is non-reentrant!\n", __LINE__); + return; + } if (softirq_need_resched()) { /* running_timer might be stale: */ set_running_timer(base, NULL); @@ -459,6 +463,10 @@ static inline void __run_timers(tvec_bas * valid. Any new jiffy will be taken care of in * subsequent loops: */ + if (unlikely(base->running_timer != timer)) { + printk("BUG in line %d: __run_timers() is non-reentrant!\n", __LINE__); + return; + } } /* * Cascade timers: @@ -498,6 +506,10 @@ repeat: goto repeat; } } + if (unlikely(base->running_timer != timer)) { + printk("BUG in line %d: __run_timers() is non-reentrant!\n", __LINE__); + return; + } set_running_timer(base, NULL); spin_unlock_irq(&base->lock); // if (waitqueue_active(&base->wait_running_timer)) ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-06-01 22:51 ` Trond Myklebust @ 2005-06-01 23:09 ` Trond Myklebust 2005-06-02 3:31 ` john cooper 1 sibling, 0 replies; 34+ messages in thread From: Trond Myklebust @ 2005-06-01 23:09 UTC (permalink / raw) To: john cooper; +Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch [-- Attachment #1: Type: text/plain, Size: 424 bytes --] on den 01.06.2005 Klokka 18:51 (-0400) skreiv Trond Myklebust: > Could you then apply the following debugging patch? It should warn you > in case something happens to corrupt base->running_timer (something > which would screw up del_timer_sync()). I'm not sure that can happen, > but it might be worth checking. Oops. You don't want to return without releasing the spinlock. This patch should be better... Cheers, Trond [-- Attachment #2: linux-2.6.12-rt-warn.dif --] [-- Type: text/plain, Size: 1673 bytes --] timer.c | 15 ++++++++++++++- 1 files changed, 14 insertions(+), 1 deletion(-) Index: linux-2.6.12-rc5-rt-V0.7.47-15/kernel/timer.c =================================================================== --- linux-2.6.12-rc5-rt-V0.7.47-15.orig/kernel/timer.c +++ linux-2.6.12-rc5-rt-V0.7.47-15/kernel/timer.c @@ -436,7 +436,7 @@ static int cascade(tvec_base_t *base, tv static inline void __run_timers(tvec_base_t *base) { - struct timer_list *timer; + struct timer_list *timer = NULL; spin_lock_irq(&base->lock); while (time_after_eq(jiffies, base->timer_jiffies)) { @@ -444,6 +444,10 @@ static inline void __run_timers(tvec_bas struct list_head *head = &work_list; int index = base->timer_jiffies & TVR_MASK; + if (unlikely(base->running_timer != timer)) { + printk("BUG in line %d: __run_timers() is non-reentrant!\n", __LINE__); + goto out; + } if (softirq_need_resched()) { /* running_timer might be stale: */ set_running_timer(base, NULL); @@ -459,6 +463,10 @@ static inline void __run_timers(tvec_bas * valid. Any new jiffy will be taken care of in * subsequent loops: */ + if (unlikely(base->running_timer != timer)) { + printk("BUG in line %d: __run_timers() is non-reentrant!\n", __LINE__); + goto out; + } } /* * Cascade timers: @@ -498,7 +506,12 @@ repeat: goto repeat; } } + if (unlikely(base->running_timer != timer)) { + printk("BUG in line %d: __run_timers() is non-reentrant!\n", __LINE__); + goto out; + } set_running_timer(base, NULL); +out: spin_unlock_irq(&base->lock); // if (waitqueue_active(&base->wait_running_timer)) wake_up(&base->wait_for_running_timer); ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-06-01 22:51 ` Trond Myklebust 2005-06-01 23:09 ` Trond Myklebust @ 2005-06-02 3:31 ` john cooper 2005-06-02 4:26 ` Trond Myklebust 1 sibling, 1 reply; 34+ messages in thread From: john cooper @ 2005-06-02 3:31 UTC (permalink / raw) To: Trond Myklebust Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch, john cooper Trond Myklebust wrote: > on den 01.06.2005 Klokka 16:59 (-0400) skreiv john cooper: > >>Yes later versions of the patch do. The version at hand >>40-04 is based on 2.6.11. We intend to sync-up with a >>more recent version of the RT patch pending resolution >>of this issue. > > > Well it is pointless to concentrate on an obviously buggy patch. Could > you please sync up to rc5-rt-V0.7.47-15 at least: that looks like it > might be working (or at least be close to working). I fully share your frustration of wanting to "use the latest patch -- dammit". However there are other practical constraints coming into play. This tree has accumulated a substantial amount of fixes for scheduler violation assertions along with associated testing and has faired well thus far. The bug under discussion here is the last major operational problem found in the associated testing process. Arriving at this point also required development of target specific driver/board code so a resync to a later version is not a trivial operation. However it would be justifiable in the case of encountering at an impasse with the current tree. > Could you then apply the following debugging patch? It should warn you > in case something happens to corrupt base->running_timer (something > which would screw up del_timer_sync()). I'm not sure that can happen, > but it might be worth checking. Yes, thanks. Though the event trace does not suggest a reentrance in __run_timer() but rather a preemption of it during the call to rpc_run_timer() by a high priority application task in the midst of an RPC. The preempting task requeues the timer in the cascade at the tail of xprt_transmit(). rpc_run_timer() upon resuming execution unconditionally clears the RPC_TASK_HAS_TIMER flag. This creates the inconsistent state. No explicit deletion attempt of the timer (synchronous or otherwise) is coming into play in the failure scenario as witnessed by the event trace. Rather it is the implicit dequeue of the timer from the cascade in __run_timer() and attempt to track ownership of it in rpc_run_timer() via RPC_TASK_HAS_TIMER which is undermined in the case of preemption. From earlier mail: > There should be no instances of RPC entering call_transmit() or any > other tk_action callback with a pending timer. My description wasn't clear. The timeout isn't pending before call_transmit(). Rather the RPC appears to be blocked elsewhere and upon wakeup via __run_timer()/xprt_timer() preempts ksoftirqd and does the __rpc_sleep_on()/__mod_timer() at the very tail of xprt_transmit(). I have work-arounds which detect the preemption of rpc_run_timer() and correct the timer state. But I don't suggest they are general solutions or a permanent fix. In any case I will ascertain whether or not the problem here still exists as soon as possible upon resync with a current tree. -john -- john.cooper@timesys.com ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-06-02 3:31 ` john cooper @ 2005-06-02 4:26 ` Trond Myklebust 2005-06-09 23:17 ` George Anzinger 0 siblings, 1 reply; 34+ messages in thread From: Trond Myklebust @ 2005-06-02 4:26 UTC (permalink / raw) To: john cooper; +Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch on den 01.06.2005 Klokka 23:31 (-0400) skreiv john cooper: > I fully share your frustration of wanting to "use the > latest patch -- dammit". However there are other practical > constraints coming into play. This tree has accumulated a > substantial amount of fixes for scheduler violation assertions > along with associated testing and has faired well thus far. > The bug under discussion here is the last major operational > problem found in the associated testing process. Arriving > at this point also required development of target specific > driver/board code so a resync to a later version is not a > trivial operation. However it would be justifiable in the > case of encountering at an impasse with the current tree. My point is that you are considering timer bugs due to synchronization problems in code which is obviously not designed to accommodate synchronization. Once that fact is established, one moves on and considers the code which does support synchronization. > > Could you then apply the following debugging patch? It should warn you > > in case something happens to corrupt base->running_timer (something > > which would screw up del_timer_sync()). I'm not sure that can happen, > > but it might be worth checking. > > Yes, thanks. Though the event trace does not suggest a > reentrance in __run_timer() but rather a preemption of it > during the call to rpc_run_timer() by a high priority > application task in the midst of an RPC. The preempting > task requeues the timer in the cascade at the tail of > xprt_transmit(). rpc_run_timer() upon resuming execution > unconditionally clears the RPC_TASK_HAS_TIMER flag. This > creates the inconsistent state. There are NO cases where that is supposed to be allowed to occur. This case is precisely what del_timer_sync() is supposed to treat. > No explicit deletion attempt of the timer (synchronous or > otherwise) is coming into play in the failure scenario as > witnessed by the event trace. Rather it is the implicit > dequeue of the timer from the cascade in __run_timer() and > attempt to track ownership of it in rpc_run_timer() via > RPC_TASK_HAS_TIMER which is undermined in the case of > preemption. No!!! The responsibility for tracking timers that have been dequeued and that are currently running inside __run_timer() lies fairly and squarely with del_timer_sync(). There is NOTHING within the RT patches that implies that the existing callers of del_timer_sync() should be burdened with having to do additional tracking of pending timers. To do so would be a major change of the existing API, and would require a lot of justification. IOW: nobody but you is claiming that the RPC code is trying to deal with this case by tracking RPC_TASK_HAS_TIMER. That is not its purpose, nor should it be in the RT case. > From earlier mail: > > > There should be no instances of RPC entering call_transmit() or any > > other tk_action callback with a pending timer. > > My description wasn't clear. The timeout isn't pending > before call_transmit(). Rather the RPC appears to be > blocked elsewhere and upon wakeup via __run_timer()/xprt_timer() > preempts ksoftirqd and does the __rpc_sleep_on()/__mod_timer() > at the very tail of xprt_transmit(). No!!! How is this supposed to happen? There is only one thread that is allowed to call rpc_sleep_on(), and that is the exact same thread that is calling __rpc_execute(). It may call rpc_sleep_on() only from inside a task->tk_action() call, and therefore only _after_ it has called rpc_delete_timer(). There is supposed to be strict ordering here! Trond ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-06-02 4:26 ` Trond Myklebust @ 2005-06-09 23:17 ` George Anzinger 2005-06-09 23:52 ` john cooper 0 siblings, 1 reply; 34+ messages in thread From: George Anzinger @ 2005-06-09 23:17 UTC (permalink / raw) To: Trond Myklebust Cc: john cooper, Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch Excuse me for interrupting this thread, but have you seen: http://marc.theaimsgroup.com/?l=linux-kernel&m=111717961227508&w=2 I think this will fix your problem. George -- Trond Myklebust wrote: > on den 01.06.2005 Klokka 23:31 (-0400) skreiv john cooper: > >>I fully share your frustration of wanting to "use the >>latest patch -- dammit". However there are other practical >>constraints coming into play. This tree has accumulated a >>substantial amount of fixes for scheduler violation assertions >>along with associated testing and has faired well thus far. >>The bug under discussion here is the last major operational >>problem found in the associated testing process. Arriving >>at this point also required development of target specific >>driver/board code so a resync to a later version is not a >>trivial operation. However it would be justifiable in the >>case of encountering at an impasse with the current tree. > > > My point is that you are considering timer bugs due to synchronization > problems in code which is obviously not designed to accommodate > synchronization. Once that fact is established, one moves on and > considers the code which does support synchronization. > > >>>Could you then apply the following debugging patch? It should warn you >>>in case something happens to corrupt base->running_timer (something >>>which would screw up del_timer_sync()). I'm not sure that can happen, >>>but it might be worth checking. >> >>Yes, thanks. Though the event trace does not suggest a >>reentrance in __run_timer() but rather a preemption of it >>during the call to rpc_run_timer() by a high priority >>application task in the midst of an RPC. The preempting >>task requeues the timer in the cascade at the tail of >>xprt_transmit(). rpc_run_timer() upon resuming execution >>unconditionally clears the RPC_TASK_HAS_TIMER flag. This >>creates the inconsistent state. > > > There are NO cases where that is supposed to be allowed to occur. This > case is precisely what del_timer_sync() is supposed to treat. > > >>No explicit deletion attempt of the timer (synchronous or >>otherwise) is coming into play in the failure scenario as >>witnessed by the event trace. Rather it is the implicit >>dequeue of the timer from the cascade in __run_timer() and >>attempt to track ownership of it in rpc_run_timer() via >>RPC_TASK_HAS_TIMER which is undermined in the case of >>preemption. > > > No!!! The responsibility for tracking timers that have been dequeued and > that are currently running inside __run_timer() lies fairly and squarely > with del_timer_sync(). > There is NOTHING within the RT patches that implies that the existing > callers of del_timer_sync() should be burdened with having to do > additional tracking of pending timers. To do so would be a major change > of the existing API, and would require a lot of justification. > > IOW: nobody but you is claiming that the RPC code is trying to deal with > this case by tracking RPC_TASK_HAS_TIMER. That is not its purpose, nor > should it be in the RT case. > > >> From earlier mail: >> >> > There should be no instances of RPC entering call_transmit() or any >> > other tk_action callback with a pending timer. >> >>My description wasn't clear. The timeout isn't pending >>before call_transmit(). Rather the RPC appears to be >>blocked elsewhere and upon wakeup via __run_timer()/xprt_timer() >>preempts ksoftirqd and does the __rpc_sleep_on()/__mod_timer() >>at the very tail of xprt_transmit(). > > > No!!! How is this supposed to happen? There is only one thread that is > allowed to call rpc_sleep_on(), and that is the exact same thread that > is calling __rpc_execute(). It may call rpc_sleep_on() only from inside > a task->tk_action() call, and therefore only _after_ it has called > rpc_delete_timer(). There is supposed to be strict ordering here! > > Trond > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- George Anzinger george@mvista.com HRT (High-res-timers): http://sourceforge.net/projects/high-res-timers/ ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-06-09 23:17 ` George Anzinger @ 2005-06-09 23:52 ` john cooper 0 siblings, 0 replies; 34+ messages in thread From: john cooper @ 2005-06-09 23:52 UTC (permalink / raw) To: george Cc: Trond Myklebust, Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch, john cooper George Anzinger wrote: > Excuse me for interrupting this thread, but have you seen: > > http://marc.theaimsgroup.com/?l=linux-kernel&m=111717961227508&w=2 > > I think this will fix your problem. That was touched on earlier in the thread. It did not fix the problem I was chasing in 40-04. I'll revisit this issue once I've moved to a more current version of the patch should it still exist. -john -- john.cooper@timesys.com ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-05-28 20:35 ` Trond Myklebust 2005-05-29 3:12 ` john cooper @ 2005-05-29 11:31 ` Oleg Nesterov 2005-05-29 13:58 ` Trond Myklebust 1 sibling, 1 reply; 34+ messages in thread From: Oleg Nesterov @ 2005-05-29 11:31 UTC (permalink / raw) To: Trond Myklebust; +Cc: john cooper, linux-kernel, Ingo Molnar, Olaf Kirch __rpc_execute() calls ->tk_exit and goes to 'restarted' label. What if ->tk_exit calls rpc_sleep_on() ? If it is possible, we have a race. CPU_0 (main loop in rpc_execute restarted) CPU_1 rpc_delete_timer: if (RPC_IS_QUEUED()) // YES return; rpc_run_timer: rpc_wake_up_task: clear_bit(RPC_TASK_QUEUED) preemption, or long interrupt if (!RPC_IS_QUEUED()) // YES task->tk_action() __rpc_add_timer: set_bit(RPC_TASK_HAS_TIMER) mod_timer(); clear_bit(RPC_TASK_HAS_TIMER) Now we have pending timer without RPC_TASK_HAS_TIMER bit set. Is this patch makes any sense? Oleg. --- 2.6.12-rc5/net/sunrpc/sched.c~ Sun May 29 15:09:25 2005 +++ 2.6.12-rc5/net/sunrpc/sched.c Sun May 29 15:11:24 2005 @@ -135,8 +135,6 @@ __rpc_add_timer(struct rpc_task *task, r static void rpc_delete_timer(struct rpc_task *task) { - if (RPC_IS_QUEUED(task)) - return; if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) { del_singleshot_timer_sync(&task->tk_timer); dprintk("RPC: %4d deleting timer\n", task->tk_pid); @@ -566,7 +564,6 @@ static int __rpc_execute(struct rpc_task BUG_ON(RPC_IS_QUEUED(task)); - restarted: while (1) { /* * Garbage collection of pending timers... @@ -607,6 +604,7 @@ static int __rpc_execute(struct rpc_task unlock_kernel(); } + restarted: /* * Lockless check for whether task is sleeping or not. */ ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-05-29 11:31 ` Oleg Nesterov @ 2005-05-29 13:58 ` Trond Myklebust 2005-05-30 14:50 ` Ingo Molnar 0 siblings, 1 reply; 34+ messages in thread From: Trond Myklebust @ 2005-05-29 13:58 UTC (permalink / raw) To: Oleg Nesterov; +Cc: john cooper, linux-kernel, Ingo Molnar, Olaf Kirch su den 29.05.2005 Klokka 15:31 (+0400) skreiv Oleg Nesterov: > __rpc_execute() calls ->tk_exit and goes to 'restarted' label. > What if ->tk_exit calls rpc_sleep_on() ? If it is possible, we > have a race. > > CPU_0 (main loop in rpc_execute restarted) CPU_1 > > rpc_delete_timer: > if (RPC_IS_QUEUED()) // YES > return; > rpc_run_timer: > rpc_wake_up_task: > clear_bit(RPC_TASK_QUEUED) > preemption, or long interrupt > > if (!RPC_IS_QUEUED()) // YES > task->tk_action() > __rpc_add_timer: > set_bit(RPC_TASK_HAS_TIMER) > mod_timer(); > clear_bit(RPC_TASK_HAS_TIMER) > > Now we have pending timer without RPC_TASK_HAS_TIMER bit set. It is possible, but it should be _extremely_ rare. The only cases where tk_exit() sets a timer are the cases where we call rpc_delay(). In the existing cases in the kernel, that would be setting timers of 5 seconds (for the case of JUKEBOX errors) or longer. These tx_exit()+restart events are very rare in themselves (JUKEBOX errors only occur if the server needs to do something with a long latency - such as retrieving a file from an HSM unit), but for that to coincide with a preemption or interrupt of > 5 seconds inside __rpc_execute... I'd hate to see those RPC performance figures. > Is this patch makes any sense? Yes. I agree the scenario is theoretically possible (so I can queue that patch up for you). I am not convinced it is a plausible explanation for what John claims to be seeing, though. Cheers, Trond ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-05-29 13:58 ` Trond Myklebust @ 2005-05-30 14:50 ` Ingo Molnar 0 siblings, 0 replies; 34+ messages in thread From: Ingo Molnar @ 2005-05-30 14:50 UTC (permalink / raw) To: Trond Myklebust; +Cc: Oleg Nesterov, john cooper, linux-kernel, Olaf Kirch * Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > Is this patch makes any sense? > > Yes. I agree the scenario is theoretically possible (so I can queue > that patch up for you). I am not convinced it is a plausible > explanation for what John claims to be seeing, though. i've added this patch (and your debug asserts, except for the rpc_delete_timer() one) to the -RT tree and i've removed the earlier hack - perhaps John can re-run the test and see whether it still occurs under -rc5-RT-V0.7.47-13 or later? most races are much more likely to occur under PREEMPT_RT than under other preemption models, but maybe there's something else going on as well. Ingo --- linux/net/sunrpc/sched.c.orig +++ linux/net/sunrpc/sched.c @@ -135,8 +135,6 @@ __rpc_add_timer(struct rpc_task *task, r static void rpc_delete_timer(struct rpc_task *task) { - if (RPC_IS_QUEUED(task)) - return; if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) { del_singleshot_timer_sync(&task->tk_timer); dprintk("RPC: %4d deleting timer\n", task->tk_pid); @@ -337,6 +335,8 @@ static void __rpc_sleep_on(struct rpc_wa void rpc_sleep_on(struct rpc_wait_queue *q, struct rpc_task *task, rpc_action action, rpc_action timer) { + BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 || + timer_pending(&task->tk_timer)); /* * Protect the queue operations. */ @@ -566,7 +566,6 @@ static int __rpc_execute(struct rpc_task BUG_ON(RPC_IS_QUEUED(task)); - restarted: while (1) { /* * Garbage collection of pending timers... @@ -594,6 +593,8 @@ static int __rpc_execute(struct rpc_task unlock_kernel(); } + BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 || + timer_pending(&task->tk_timer)); /* * Perform the next FSM step. * tk_action may be NULL when the task has been killed @@ -607,6 +608,7 @@ static int __rpc_execute(struct rpc_task unlock_kernel(); } + restarted: /* * Lockless check for whether task is sleeping or not. */ @@ -925,6 +927,8 @@ fail: void rpc_run_child(struct rpc_task *task, struct rpc_task *child, rpc_action func) { + BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 || + timer_pending(&task->tk_timer)); spin_lock_bh(&childq.lock); /* N.B. Is it possible for the child to have already finished? */ __rpc_sleep_on(&childq, task, func, NULL); ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: RT and Cascade interrupts 2005-05-28 17:48 ` john cooper 2005-05-28 20:35 ` Trond Myklebust @ 2005-05-28 22:17 ` Trond Myklebust 1 sibling, 0 replies; 34+ messages in thread From: Trond Myklebust @ 2005-05-28 22:17 UTC (permalink / raw) To: john cooper; +Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Olaf Kirch lau den 28.05.2005 Klokka 13:48 (-0400) skreiv john cooper: > The other possibility to fix the original problem within > the scope of the RPC code was to replace the bit state of > RPC_TASK_HAS_TIMER with a count so we don't obscure the > fact the timer was requeued during the preemption window. > This happens as rpc_run_timer() does an unconditional > clear_bit(RPC_TASK_HAS_TIMER,..) before returning. Just to clarify a bit more here. RPC_TASK_HAS_TIMER is definitely _not_ redundantly replicating timer state, as I saw you asserting previously. There are two distinct cases where timer->base may be NULL. In the first case, the timer is inactive and was never queued so it is entirely pointless to be calling del_timer() & friends. In the second case, the timer was active, but has expired and was picked up by __run_timers(). del_singleshot_timer_sync() fails to distinguish these two cases, and will therefore end up redundantly calling del_timer_sync() every time a task loops through __rpc_execute() without setting a timer. Your patch to remove RPC_TASK_HAS_TIMER will re-introduce that redundant behaviour but will re-introduce precisely the race that del_timer_sync() is designed to fix. So, I repeat: I want to see a case description that clearly demonstrates a race scenario in the existing code before I'm willing to consider any patches at all. Cheers, Trond ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2005-06-09 23:55 UTC | newest] Thread overview: 34+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-05-12 14:43 RT and Cascade interrupts Daniel Walker 2005-05-13 7:44 ` Ingo Molnar 2005-05-13 13:12 ` john cooper 2005-05-24 16:32 ` john cooper 2005-05-27 7:25 ` Ingo Molnar 2005-05-27 13:53 ` john cooper -- strict thread matches above, loose matches on Subject: below -- 2005-05-27 16:47 Oleg Nesterov 2005-05-27 23:37 ` john cooper 2005-05-28 8:52 ` Oleg Nesterov 2005-05-28 14:02 ` john cooper 2005-05-28 16:34 ` Oleg Nesterov 2005-05-28 17:48 ` john cooper 2005-05-28 20:35 ` Trond Myklebust 2005-05-29 3:12 ` john cooper 2005-05-29 7:40 ` Trond Myklebust 2005-05-30 21:32 ` john cooper 2005-05-31 23:09 ` john cooper 2005-06-01 14:22 ` Oleg Nesterov 2005-06-01 18:05 ` john cooper 2005-06-01 18:31 ` Trond Myklebust 2005-06-01 19:20 ` john cooper 2005-06-01 19:46 ` Trond Myklebust 2005-06-01 20:21 ` Trond Myklebust 2005-06-01 20:59 ` john cooper 2005-06-01 22:51 ` Trond Myklebust 2005-06-01 23:09 ` Trond Myklebust 2005-06-02 3:31 ` john cooper 2005-06-02 4:26 ` Trond Myklebust 2005-06-09 23:17 ` George Anzinger 2005-06-09 23:52 ` john cooper 2005-05-29 11:31 ` Oleg Nesterov 2005-05-29 13:58 ` Trond Myklebust 2005-05-30 14:50 ` Ingo Molnar 2005-05-28 22:17 ` Trond Myklebust
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox