Re: [PATCH 0/7] lockd: fix races that can result in stuck filelocks

Linux NFS development
 help / color / mirror / Atom feed

From: Jeff Layton <jlayton@kernel.org>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Chuck Lever III <chuck.lever@oracle.com>,
	Trond Myklebust <trond.myklebust@hammerspace.com>,
	Anna Schumaker <anna@kernel.org>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	"yoyang@redhat.com" <yoyang@redhat.com>
Subject: Re: [PATCH 0/7] lockd: fix races that can result in stuck filelocks
Date: Mon, 13 Mar 2023 15:19:52 -0400	[thread overview]
Message-ID: <1538df6baedec8ed465c3902aebebe60d560f859.camel@kernel.org> (raw)
In-Reply-To: <CAOQ4uxhFf=k+7Zm-Go=a+MJs0hYHrD+KrxOXw2mLXMcz4xACMQ@mail.gmail.com>

On Mon, 2023-03-13 at 17:14 +0200, Amir Goldstein wrote:
> On Mon, Mar 13, 2023 at 12:45 PM Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Sun, 2023-03-12 at 17:33 +0200, Amir Goldstein wrote:
> > > On Fri, Mar 3, 2023 at 4:54 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
> > > > 
> > > > 
> > > > 
> > > > > On Mar 3, 2023, at 7:15 AM, Jeff Layton <jlayton@kernel.org> wrote:
> > > > > 
> > > > > I sent the first patch in this series the other day, but didn't get any
> > > > > responses.
> > > > 
> > > > We'll have to work out who will take which patches in this set.
> > > > Once fully reviewed, I can take the set if the client maintainers
> > > > send Acks for 2-4 and 6-7.
> > > > 
> > > > nfsd-next for v6.4 is not yet open. I can work on setting that up
> > > > today.
> > > > 
> > > > 
> > > > > Since then I've had time to follow up on the client-side part
> > > > > of this problem, which eventually also pointed out yet another bug on
> > > > > the server side. There are also a couple of cleanup patches in here too,
> > > > > and a patch to add some tracepoints that I found useful while diagnosing
> > > > > this.
> > > > > 
> > > > > With this set on both client and server, I'm now able to run Yongcheng's
> > > > > test for an hour straight with no stuck locks.
> > > 
> > > My nfstest_lock test occasionally gets into an endless wait loop for the lock in
> > > one of the optests.
> 
> I forgot to mention that the regression is only with nfsversion=3!
> Is anyone else running nfstest_lock with nfsversion=3?
> 
> > > 
> > > AFAIK, this started happening after I upgraded my client machine to v5.15.88.
> > > Does this seem related to the client bug fixes in this patch set?
> > > 
> > > If so, is this bug a regression? and why are the fixes aimed for v6.4?
> > > 
> > 
> > Most of this (lockd) code hasn't changed in well over a decade, so if
> > this is a regression then it's a very old one. I suppose it's possible
> > that this regressed after the BKL was removed from this code, but that
> > was a long time ago now and I'm not sure I can identify a commit that
> > this fixes.
> > 
> > I'm fine with this going in sooner than v6.4, but given that this has
> > been broken so long, I didn't see the need to rush.
> > 
> 
> I don't know what is the relation of the optest regression that I am
> experiencing and the client and server bugs mentioned in this patch set.
> I just re-tested optest01 with several combinations of client-server kernels.
> I rebooted both client and server before each test.
> The results are a bit odd:
> 
> client           server      optest01 result
> ------------------------------------------------------
> 5.10.109     5.10.109  optest01 completes successfully after <30s
> 5.15.88       5.15.88    optest01 never completes (see attached log)
> 5.15.88       5.10.109  optest01 never completes
> 5.15.88+ [*] 5.15.88   optest01 never completes
> 5.15.88+     5.10.109  optest01 never completes
> 5.15.88+     5.15.88+  optest01 completes successfully after ~300s [**]
> 
> Unless I missed something with the tests, it looks like
> 1.a. There was a regressions in client from 5.10.109..5.15.88
> 1.b. The regression is manifested with both 5.10 and 5.15 servers
> 2.a. The patches improve the situation (from infinite to 30s per wait)...
> 2.b. ...but only when applied to both client and server and...
> 2.c. The situation is still a lot worse than 5.10 client with 5.10 server
> 
> Attached also the NFS[D] Kconfig which is identical for the tested
> 5.10 and 5.15 kernels.
> 
> Do you need me to provide any traces or any other info?
> 
> Thanks,
> Amir.
> 
> [*] 5.15.88+ stands for 5.15.88 + the patches in this set, which all
> apply cleanly
> [**] The test takes 300s because every single 30s wait takes the entire 30s:
> 
>     DBG1: 15:21:47.118095 - Unlock file (F_UNLCK, F_SETLK) off=0 len=0
> range(0, 18446744073709551615)
>     DBG3: 15:21:47.119832 - Wait up to 30 secs to check if blocked
> lock has been granted @253.87
>     DBG3: 15:21:48.121296 - Check if blocked lock has been granted @254.87
> ...
>     DBG3: 15:22:14.158314 - Check if blocked lock has been granted @280.90
>     DBG3: 15:22:15.017594 - Getting results from blocked lock @281.76
>     DBG1: 15:22:15.017832 - Unlock file (F_UNLCK, F_SETLK) off=0 len=0
> range(0, 18446744073709551615) on second process @281.76
>     PASS: Locking byte range (72 passed, 0 failed)

This sounds like a different problem than what this patchset fixes. This
patchset is really all about signal handling during the wait for a lock.
That sounds more like the wait is just not completing?

I just kicked off this test in nfstests with vers=3 and I think I see
the same 30s stalls. Coincidentally:

    #define NLMCLNT_POLL_TIMEOUT    (30*HZ)                            

So it does look like something may be going wrong with the lock granting
mechanism. I'll need to do a bit of investigation to figure out what's
going on.

-- 
Jeff Layton <jlayton@kernel.org>

     prev parent reply	other threads:[~2023-03-13 19:20 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-03 12:15 [PATCH 0/7] lockd: fix races that can result in stuck filelocks Jeff Layton
2023-03-03 12:15 ` [PATCH 1/7] lockd: purge resources held on behalf of nlm clients when shutting down Jeff Layton
2023-03-03 12:15 ` [PATCH 2/7] lockd: remove 2 unused helper functions Jeff Layton
2023-03-03 12:15 ` [PATCH 3/7] lockd: move struct nlm_wait to lockd.h Jeff Layton
2023-03-03 12:16 ` [PATCH 4/7] lockd: fix races in client GRANTED_MSG wait logic Jeff Layton
2023-03-03 12:16 ` [PATCH 5/7] lockd: server should unlock lock if client rejects the grant Jeff Layton
2023-03-03 12:16 ` [PATCH 6/7] nfs: move nfs_fhandle_hash to common include file Jeff Layton
2023-03-03 12:16 ` [PATCH 7/7] lockd: add some client-side tracepoints Jeff Layton
2023-03-03 14:41 ` [PATCH 0/7] lockd: fix races that can result in stuck filelocks Chuck Lever III
2023-03-03 18:11   ` Chuck Lever III
2023-03-12 15:33   ` Amir Goldstein
2023-03-12 16:44     ` Chuck Lever III
2023-03-13 10:45     ` Jeff Layton
2023-03-13 15:14       ` Amir Goldstein
2023-03-13 19:19         ` Jeff Layton [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1538df6baedec8ed465c3902aebebe60d560f859.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=amir73il@gmail.com \
    --cc=anna@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trond.myklebust@hammerspace.com \
    --cc=yoyang@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox