public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Trond Myklebust <trondmy@hammerspace.com>
To: "aglo@umich.edu" <aglo@umich.edu>
Cc: "bfields@fieldses.org" <bfields@fieldses.org>,
	"jiufei.xue@linux.alibaba.com" <jiufei.xue@linux.alibaba.com>,
	"Anna.Schumaker@netapp.com" <Anna.Schumaker@netapp.com>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"joseph.qi@linux.alibaba.com" <joseph.qi@linux.alibaba.com>
Subject: Re: [bug report] task hang while testing xfstests generic/323
Date: Mon, 11 Mar 2019 15:28:53 +0000	[thread overview]
Message-ID: <7ffd7594113b9d7e3105aef61752caa1f01e61e3.camel@hammerspace.com> (raw)
In-Reply-To: <CAN-5tyH52dv7zuCkQoUS04rpbqotWn+SM1rJoxqBgsWG2bKLcg@mail.gmail.com>

On Mon, 2019-03-11 at 11:14 -0400, Olga Kornievskaia wrote:
> On Mon, Mar 11, 2019 at 11:12 AM Trond Myklebust
> <trondmy@hammerspace.com> wrote:
> > On Mon, 2019-03-11 at 14:30 +0000, Trond Myklebust wrote:
> > > Hi Olga,
> > > 
> > > On Sun, 2019-03-10 at 18:20 -0400, Olga Kornievskaia wrote:
> > > > There are a bunch of cases where multiple operations are using
> > > > the
> > > > same seqid and slot.
> > > > 
> > > > Example of such weirdness is (nfs.seqid == 0x000002f4) &&
> > > > (nfs.slotid
> > > > == 0) and the one leading to the hang.
> > > > 
> > > > In frame 415870, there is an OPEN using that seqid and slot for
> > > > the
> > > > first time (but this slot will be re-used a bunch of times
> > > > before
> > > > it
> > > > gets a reply in frame 415908 with the open stateid
> > > > seq=40).  (also
> > > > in
> > > > this packet there is an example of reuse
> > > > slot=1+seqid=0x000128f7 by
> > > > both TEST_STATEID and OPEN but let's set that aside).
> > > > 
> > > > In frame 415874 (in the same packet), client sends 5 opens on
> > > > the
> > > > SAME
> > > > seqid and slot (all have distinct xids). In a ways that's end
> > > > up
> > > > being
> > > > alright since opens are for the same file and thus reply out of
> > > > the
> > > > cache and the reply is ERR_DELAY. But in frame 415876, client
> > > > sends
> > > > again uses the same seqid and slot  and in this case it's used
> > > > by
> > > > 3opens and a test_stateid.
> > 
> > This should result in exactly 1 bump of the stateid seqid.
> > 
> > > > Client in all this mess never processes the open stateid seq=40
> > > > and
> > > > keeps on resending CLOSE with seq=37 (also to note client
> > > > "missed"
> > > > processing seqid=38 and 39 as well. 39 probably because it was
> > > > a
> > > > reply
> > > > on the same kind of "Reused" slot=1 and seq=0x000128f7. I
> > > > haven't
> > > > tracked 38 but i'm assuming it's the same). I don't know how
> > > > many
> > > > times but after 5mins, I see a TEST_STATEID that again uses the
> > > > same
> > > > seqid+slot (which gets a reply from the cache matching OPEN).
> > > > Also
> > > > open + close (still with seq=37) open is replied to but after
> > > > this
> > > > client goes into a soft lockup logs have
> > > > "nfs4_schedule_state_manager:
> > > > kthread_ruan: -4" over and over again . then a soft lockup.
> > > > 
> > > > Looking back on slot 0. nfs.seqid=0x000002f3 was used in
> > > > frame=415866
> > > > by the TEST_STATEID. This is replied to in frame 415877 (with
> > > > an
> > > > ERR_DELAY). But before the client got a reply, it used the slot
> > > > and
> > > > the seq by frame 415874. TEST_STATEID is a synchronous and
> > > > interruptible operation. I'm suspecting that somehow it was
> > > > interrupted and that's who the slot was able to be re-used by
> > > > the
> > > > frame 415874. But how the several opens were able to get the
> > > > same
> > > > slot
> > > > I don't know..
> > > 
> > > Is this still true with the current linux-next? I would expect
> > > this
> > > patch
> > > http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=commitdiff;h=3453d5708b33efe76f40eca1c0ed60923094b971
> > > to change the Linux client behaviour in the above regard.
> > > 
> > 
> > Note also that what you describe would appear to indicate a serious
> > server bug. If the client is reusing a slot+seqid, then the correct
> > server behaviour is to either return one of the errors
> > NFS4ERR_DELAY,
> > NFS4ERR_SEQ_FALSE_RETRY, NFS4ERR_RETRY_UNCACHED_REP,
> > NFS4ERR_SEQ_MISORDERED, or else it must replay the exact same reply
> > that it had cached for the original request.
> > 
> > It is protocol violation for the server to execute new requests on
> > a
> > slot+seqid combination that has already been used. For that reason,
> > it
> > is also impossible for a replay to cause further state changes on
> > the
> > server; a replay means that the server belts out the exact same
> > reply
> > that was attempted sent earlier with no changes (with stateids that
> > match bit for bit what would have been sent earlier).
> > 
> 
> But it is the same requests because all of them are opens on the same
> file same everything.

That is irrelevant. The whole point of the session slot mechanism is to
provide reliable only once semantics as defined in section 2.10.6 of
RFC5661: https://tools.ietf.org/html/rfc5661#section-2.10.6

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



      reply	other threads:[~2019-03-11 15:29 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-28 10:10 [bug report] task hang while testing xfstests generic/323 Jiufei Xue
2019-02-28 22:26 ` Olga Kornievskaia
2019-02-28 23:56   ` Trond Myklebust
2019-03-01  5:19     ` Jiufei Xue
2019-03-01  5:08   ` Jiufei Xue
2019-03-01  8:49     ` Jiufei Xue
2019-03-01 13:08       ` Trond Myklebust
2019-03-02 16:34         ` Jiufei Xue
2019-03-04 15:20         ` Jiufei Xue
2019-03-04 15:50           ` Trond Myklebust
2019-03-05  5:09             ` Jiufei Xue
2019-03-05 14:45               ` Trond Myklebust
2019-03-06  9:59                 ` Jiufei Xue
2019-03-06 16:09                   ` bfields
2019-03-10 22:20                     ` Olga Kornievskaia
2019-03-11 14:30                       ` Trond Myklebust
2019-03-11 15:07                         ` Olga Kornievskaia
2019-03-11 15:13                           ` Olga Kornievskaia
2019-03-15  6:30                             ` Jiufei Xue
2019-03-15 20:33                               ` Olga Kornievskaia
2019-03-15 20:55                                 ` Trond Myklebust
2019-03-16 14:11                                 ` Jiufei Xue
2019-03-19 15:33                                   ` Olga Kornievskaia
2019-03-11 15:12                         ` Trond Myklebust
2019-03-11 15:14                           ` Olga Kornievskaia
2019-03-11 15:28                             ` Trond Myklebust [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7ffd7594113b9d7e3105aef61752caa1f01e61e3.camel@hammerspace.com \
    --to=trondmy@hammerspace.com \
    --cc=Anna.Schumaker@netapp.com \
    --cc=aglo@umich.edu \
    --cc=bfields@fieldses.org \
    --cc=jiufei.xue@linux.alibaba.com \
    --cc=joseph.qi@linux.alibaba.com \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox