From: john cooper <john.cooper@timesys.com>
To: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Oleg Nesterov <oleg@tv-sign.ru>,
linux-kernel@vger.kernel.org, Ingo Molnar <mingo@elte.hu>,
Olaf Kirch <okir@suse.de>, john cooper <john.cooper@timesys.com>
Subject: Re: RT and Cascade interrupts
Date: Sat, 28 May 2005 23:12:47 -0400 [thread overview]
Message-ID: <4299332F.6090900@timesys.com> (raw)
In-Reply-To: <1117312557.10746.6.camel@lade.trondhjem.org>
Trond Myklebust wrote:
> lau den 28.05.2005 Klokka 13:48 (-0400) skreiv john cooper:
> Could you please explain why you think such a scenario is possible? The
> timer functions themselves should never be causing a re-queue, and every
> iteration through the loop in __rpc_execute() should cause any pending
> timer to be killed, as should rpc_release_task().
I'm trying to pinpoint the cause of the RPC timer cascade
structure corruption. Here is the data I've ascertained thus far:
1. the failure only has been seen for the timer struct embedded
in an rpc_task. The callback function is always rpc_run_timer().
2. the failure mode is an rpc_task's embedded timer struct being
queued a second time in the timer cascade without having
been dequeued from an existing cascade vector. The failure
is not immediate but causes a corrupt entry which is later
encountered in timer cascade code.
3. the problem occurs if rpc_run_timer() executes in preemptable
context. The problem was not encountered when preemption was
explicitly disabled within rpc_run_timer().
4. the problem appears related to RPC_TASK_HAS_TIMER not being set
in rpc_release_task(). Specifically the problem arises when
!RPC_TASK_HAS_TIMER but timer->base is non-zero. Modifying
rpc_delete_timer() to effect the del_singleshot_timer_sync()
call when (RPC_TASK_HAS_TIMER || timer->base) prevents the
failure.
5. (a detail of #2 above) Instrumenting the number of logical
add/mod/del operations on the timer struct the corruption
occurs at a point when the tally of the operations:
(mod == add == del + 1) ie: the timer is has been add/modified
several times but one count greater than the number of delete
operations (active in cascade). The next operation on this
timer struct is an add/mod operation with the state of the
timer having been reinitialized in rpc_init_task():init_timer()
with no intervening delete having been performed.
If it wasn't clear in my prior mail, please disregard the earlier
claim of RPC_TASK_HAS_TIMER replicating the state of timer->base.
From Oleg's earlier mail I see that isn't the case as additional
RPC-specific state is attached to this flag. The patch as well
should be disregarded.
> That's why we can use del_singleshot_timer_sync() in the first place:
> because there is no recursion, and no re-queueing that can cause races.
> I don't see how either preemption or RT will change that (and if they
> do, then _that_ is the real bug that needs fixing).
During the time I have been hunting this bug I've lost
count of the number of times I've alternatively suspected
either the kernel timer code or net/sunrpc/sched.c usage
of the same. I still feel it is somehow related to the RPC
code but will need to refine the instrumentation to extract
further information from the failure scenario.
A few questions I'd like to pose:
1. Can you correlate anything of rpc_run_timer() running in
preemptive context which could explain the above behavior?
2. Do you agree that (!RPC_TASK_HAS_TIMER && timer->base)
is an inconsistent state at the time of
rpc_release_task():rpc_delete_timer() ?
-john
--
john.cooper@timesys.com
next prev parent reply other threads:[~2005-05-29 3:14 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-05-27 16:47 RT and Cascade interrupts Oleg Nesterov
2005-05-27 23:37 ` john cooper
2005-05-28 8:52 ` Oleg Nesterov
2005-05-28 14:02 ` john cooper
2005-05-28 16:34 ` Oleg Nesterov
2005-05-28 17:48 ` john cooper
2005-05-28 20:35 ` Trond Myklebust
2005-05-29 3:12 ` john cooper [this message]
2005-05-29 7:40 ` Trond Myklebust
2005-05-30 21:32 ` john cooper
2005-05-31 23:09 ` john cooper
2005-06-01 14:22 ` Oleg Nesterov
2005-06-01 18:05 ` john cooper
2005-06-01 18:31 ` Trond Myklebust
2005-06-01 19:20 ` john cooper
2005-06-01 19:46 ` Trond Myklebust
2005-06-01 20:21 ` Trond Myklebust
2005-06-01 20:59 ` john cooper
2005-06-01 22:51 ` Trond Myklebust
2005-06-01 23:09 ` Trond Myklebust
2005-06-02 3:31 ` john cooper
2005-06-02 4:26 ` Trond Myklebust
2005-06-09 23:17 ` George Anzinger
2005-06-09 23:52 ` john cooper
2005-05-29 11:31 ` Oleg Nesterov
2005-05-29 13:58 ` Trond Myklebust
2005-05-30 14:50 ` Ingo Molnar
2005-05-28 22:17 ` Trond Myklebust
-- strict thread matches above, loose matches on Subject: below --
2005-05-12 14:43 Daniel Walker
2005-05-13 7:44 ` Ingo Molnar
2005-05-13 13:12 ` john cooper
2005-05-24 16:32 ` john cooper
2005-05-27 7:25 ` Ingo Molnar
2005-05-27 13:53 ` john cooper
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4299332F.6090900@timesys.com \
--to=john.cooper@timesys.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=okir@suse.de \
--cc=oleg@tv-sign.ru \
--cc=trond.myklebust@fys.uio.no \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox