From: john cooper <john.cooper@timesys.com>
To: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Oleg Nesterov <oleg@tv-sign.ru>,
linux-kernel@vger.kernel.org, Ingo Molnar <mingo@elte.hu>,
Olaf Kirch <okir@suse.de>, john cooper <john.cooper@timesys.com>
Subject: Re: RT and Cascade interrupts
Date: Sat, 28 May 2005 23:12:47 -0400 [thread overview]
Message-ID: <4299332F.6090900@timesys.com> (raw)
In-Reply-To: <1117312557.10746.6.camel@lade.trondhjem.org>
Trond Myklebust wrote:
> lau den 28.05.2005 Klokka 13:48 (-0400) skreiv john cooper:
> Could you please explain why you think such a scenario is possible? The
> timer functions themselves should never be causing a re-queue, and every
> iteration through the loop in __rpc_execute() should cause any pending
> timer to be killed, as should rpc_release_task().
I'm trying to pinpoint the cause of the RPC timer cascade
structure corruption. Here is the data I've ascertained thus far:
1. the failure only has been seen for the timer struct embedded
in an rpc_task. The callback function is always rpc_run_timer().
2. the failure mode is an rpc_task's embedded timer struct being
queued a second time in the timer cascade without having
been dequeued from an existing cascade vector. The failure
is not immediate but causes a corrupt entry which is later
encountered in timer cascade code.
3. the problem occurs if rpc_run_timer() executes in preemptable
context. The problem was not encountered when preemption was
explicitly disabled within rpc_run_timer().
4. the problem appears related to RPC_TASK_HAS_TIMER not being set
in rpc_release_task(). Specifically the problem arises when
!RPC_TASK_HAS_TIMER but timer->base is non-zero. Modifying
rpc_delete_timer() to effect the del_singleshot_timer_sync()
call when (RPC_TASK_HAS_TIMER || timer->base) prevents the
failure.
5. (a detail of #2 above) Instrumenting the number of logical
add/mod/del operations on the timer struct the corruption
occurs at a point when the tally of the operations:
(mod == add == del + 1) ie: the timer is has been add/modified
several times but one count greater than the number of delete
operations (active in cascade). The next operation on this
timer struct is an add/mod operation with the state of the
timer having been reinitialized in rpc_init_task():init_timer()
with no intervening delete having been performed.
If it wasn't clear in my prior mail, please disregard the earlier
claim of RPC_TASK_HAS_TIMER replicating the state of timer->base.
From Oleg's earlier mail I see that isn't the case as additional
RPC-specific state is attached to this flag. The patch as well
should be disregarded.
> That's why we can use del_singleshot_timer_sync() in the first place:
> because there is no recursion, and no re-queueing that can cause races.
> I don't see how either preemption or RT will change that (and if they
> do, then _that_ is the real bug that needs fixing).
During the time I have been hunting this bug I've lost
count of the number of times I've alternatively suspected
either the kernel timer code or net/sunrpc/sched.c usage
of the same. I still feel it is somehow related to the RPC
code but will need to refine the instrumentation to extract
further information from the failure scenario.
A few questions I'd like to pose:
1. Can you correlate anything of rpc_run_timer() running in
preemptive context which could explain the above behavior?
2. Do you agree that (!RPC_TASK_HAS_TIMER && timer->base)
is an inconsistent state at the time of
rpc_release_task():rpc_delete_timer() ?
-john
--
john.cooper@timesys.com
next prev parent reply other threads:[~2005-05-29 3:14 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-05-27 16:47 RT and Cascade interrupts Oleg Nesterov
2005-05-27 23:37 ` john cooper
2005-05-28 8:52 ` Oleg Nesterov
2005-05-28 14:02 ` john cooper
2005-05-28 16:34 ` Oleg Nesterov
2005-05-28 17:48 ` john cooper
2005-05-28 20:35 ` Trond Myklebust
2005-05-29 3:12 ` john cooper [this message]
2005-05-29 7:40 ` Trond Myklebust
2005-05-30 21:32 ` john cooper
2005-05-31 23:09 ` john cooper
2005-06-01 14:22 ` Oleg Nesterov
2005-06-01 18:05 ` john cooper
2005-06-01 18:31 ` Trond Myklebust
2005-06-01 19:20 ` john cooper
2005-06-01 19:46 ` Trond Myklebust
2005-06-01 20:21 ` Trond Myklebust
2005-06-01 20:59 ` john cooper
2005-06-01 22:51 ` Trond Myklebust
2005-06-01 23:09 ` Trond Myklebust
2005-06-02 3:31 ` john cooper
2005-06-02 4:26 ` Trond Myklebust
2005-06-09 23:17 ` George Anzinger
2005-06-09 23:52 ` john cooper
2005-05-29 11:31 ` Oleg Nesterov
2005-05-29 13:58 ` Trond Myklebust
2005-05-30 14:50 ` Ingo Molnar
2005-05-28 22:17 ` Trond Myklebust
-- strict thread matches above, loose matches on Subject: below --
2005-05-12 14:43 Daniel Walker
2005-05-13 7:44 ` Ingo Molnar
2005-05-13 13:12 ` john cooper
2005-05-24 16:32 ` john cooper
2005-05-27 7:25 ` Ingo Molnar
2005-05-27 13:53 ` john cooper
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4299332F.6090900@timesys.com \
--to=john.cooper@timesys.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=okir@suse.de \
--cc=oleg@tv-sign.ru \
--cc=trond.myklebust@fys.uio.no \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.