Linux NFS development
 help / color / mirror / Atom feed
* REGRESSION: NFSv4: mkdir returns EEXIST after NFS4ERR_DELAY-then-success;
@ 2026-04-29  5:02 Igor Raits
  2026-04-29  7:12 ` Igor Raits
  0 siblings, 1 reply; 4+ messages in thread
From: Igor Raits @ 2026-04-29  5:02 UTC (permalink / raw)
  To: NeilBrown, Anna Schumaker, Trond Myklebust, linux-nfs
  Cc: Jaroslav Pulchart, Jan Cipa

Hi all,

I think I've run into an NFSv4 client regression and wanted to report
it before I forget the details. Apologies in advance if I'm
mis-reading the code — please correct me if so.

Symptom: an occasional mkdir(2) on an NFSv4 mount returns -EEXIST,
but the directory it was supposed to create is actually present
afterwards. It's reproducible on both NFSv4.0 and NFSv4.2 against an
in-kernel Linux nfsd. Both client and server are running 6.19.14.

Reproducer (random 16-hex names so collisions are not the cause):

  N=2000000; base=/var/gdc/export
  for ((i=1; i<=N; i++)); do
      d=$base/$(openssl rand -hex 8)
      mkdir "$d" 2>/dev/null || echo "$(date +%T) failed loop=$i $d"
      rmdir "$d" 2>/dev/null
  done

Failures cluster every ~2-3 minutes, and also reliably trigger on the
first mkdir after a few minutes of mount idleness. Each failed mkdir
takes about 100 ms.

strace shows just one syscall, so userspace isn't retrying:

  $ strace -ttt -e trace=mkdir mkdir "$dir"
  mkdir("/var/gdc/export/954ce422698ef4b1", 0777) = -1 EEXIST (File exists)
  +++ exited with 1 +++

A packet capture for one failure (NFSv4.2; the v4.0 capture has the
same shape):

  client → server  CREATE name=...  → NFS4ERR_DELAY (10008)
  ~100 ms later
  client → server  CREATE name=...  → NFS4_OK            ← dir created
  ~80 µs later
  client → server  CREATE name=...  → NFS4ERR_EXIST (17) ← server is right

Three CREATE RPCs from one mkdir(2). The server looks correct: it
returns DELAY, then OK on the retry, then EXIST when the client asks
again for a name that now exists. The client then surfaces that final
EXIST to userspace even though its own previous retry already
succeeded.

While poking around in fs/nfs/nfs4proc.c I noticed nfs4_proc_mkdir()
looks like this in current master:

  do {
      alias = _nfs4_proc_mkdir(dir, dentry, sattr, label, &err);
      trace_nfs4_mkdir(dir, &dentry->d_name, err);
      if (err)
          alias = ERR_PTR(nfs4_handle_exception(NFS_SERVER(dir),
                                                err, &exception));
  } while (exception.retry);

If I'm reading this right, on a successful retry (err == 0)
nfs4_handle_exception() is skipped, so exception.retry stays at the
value it had after the previous DELAY iteration (which is 1). The
loop then runs once more, sends another CREATE for the same name,
and that one legitimately gets NFS4ERR_EXIST. Other do-while loops
in the same file (e.g. nfs4_proc_symlink) seem to call
nfs4_handle_exception() unconditionally, which would reset
exception.retry to 0 on success and exit the loop.

git blame points at:

  dd862da61e91 ("nfs: fix incorrect handling of large-number NFS errors
                 in nfs4_do_mkdir()")

(stable backport: 062feb506caf). The change makes sense in itself —
the goal of returning the int separately from the dentry is good — but
the `if (err)` gate around nfs4_handle_exception() seems to be what
introduced the retry-state issue. I might be wrong about that, though,
so please take it with a grain of salt.

Happy to provide pcaps, more traces, or test a patch if useful.
Reproduces on demand here, so iteration should be quick.

Thanks for all the work on NFS,
Igor

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: REGRESSION: NFSv4: mkdir returns EEXIST after NFS4ERR_DELAY-then-success;
  2026-04-29  5:02 REGRESSION: NFSv4: mkdir returns EEXIST after NFS4ERR_DELAY-then-success; Igor Raits
@ 2026-04-29  7:12 ` Igor Raits
  2026-04-29  9:58   ` NeilBrown
  0 siblings, 1 reply; 4+ messages in thread
From: Igor Raits @ 2026-04-29  7:12 UTC (permalink / raw)
  To: NeilBrown, Anna Schumaker, Trond Myklebust, linux-nfs
  Cc: Jaroslav Pulchart, Jan Cipa

On Wed, Apr 29, 2026 at 7:02 AM Igor Raits <igor@gooddata.com> wrote:
>
> Hi all,
>
> I think I've run into an NFSv4 client regression and wanted to report
> it before I forget the details. Apologies in advance if I'm
> mis-reading the code — please correct me if so.
>
> Symptom: an occasional mkdir(2) on an NFSv4 mount returns -EEXIST,
> but the directory it was supposed to create is actually present
> afterwards. It's reproducible on both NFSv4.0 and NFSv4.2 against an
> in-kernel Linux nfsd. Both client and server are running 6.19.14.
>
> Reproducer (random 16-hex names so collisions are not the cause):
>
>   N=2000000; base=/var/gdc/export
>   for ((i=1; i<=N; i++)); do
>       d=$base/$(openssl rand -hex 8)
>       mkdir "$d" 2>/dev/null || echo "$(date +%T) failed loop=$i $d"
>       rmdir "$d" 2>/dev/null
>   done
>
> Failures cluster every ~2-3 minutes, and also reliably trigger on the
> first mkdir after a few minutes of mount idleness. Each failed mkdir
> takes about 100 ms.
>
> strace shows just one syscall, so userspace isn't retrying:
>
>   $ strace -ttt -e trace=mkdir mkdir "$dir"
>   mkdir("/var/gdc/export/954ce422698ef4b1", 0777) = -1 EEXIST (File exists)
>   +++ exited with 1 +++
>
> A packet capture for one failure (NFSv4.2; the v4.0 capture has the
> same shape):
>
>   client → server  CREATE name=...  → NFS4ERR_DELAY (10008)
>   ~100 ms later
>   client → server  CREATE name=...  → NFS4_OK            ← dir created
>   ~80 µs later
>   client → server  CREATE name=...  → NFS4ERR_EXIST (17) ← server is right
>
> Three CREATE RPCs from one mkdir(2). The server looks correct: it
> returns DELAY, then OK on the retry, then EXIST when the client asks
> again for a name that now exists. The client then surfaces that final
> EXIST to userspace even though its own previous retry already
> succeeded.
>
> While poking around in fs/nfs/nfs4proc.c I noticed nfs4_proc_mkdir()
> looks like this in current master:
>
>   do {
>       alias = _nfs4_proc_mkdir(dir, dentry, sattr, label, &err);
>       trace_nfs4_mkdir(dir, &dentry->d_name, err);
>       if (err)
>           alias = ERR_PTR(nfs4_handle_exception(NFS_SERVER(dir),
>                                                 err, &exception));
>   } while (exception.retry);
>
> If I'm reading this right, on a successful retry (err == 0)
> nfs4_handle_exception() is skipped, so exception.retry stays at the
> value it had after the previous DELAY iteration (which is 1). The
> loop then runs once more, sends another CREATE for the same name,
> and that one legitimately gets NFS4ERR_EXIST. Other do-while loops
> in the same file (e.g. nfs4_proc_symlink) seem to call
> nfs4_handle_exception() unconditionally, which would reset
> exception.retry to 0 on success and exit the loop.
>
> git blame points at:
>
>   dd862da61e91 ("nfs: fix incorrect handling of large-number NFS errors
>                  in nfs4_do_mkdir()")
>
> (stable backport: 062feb506caf). The change makes sense in itself —
> the goal of returning the int separately from the dentry is good — but
> the `if (err)` gate around nfs4_handle_exception() seems to be what
> introduced the retry-state issue. I might be wrong about that, though,
> so please take it with a grain of salt.
>
> Happy to provide pcaps, more traces, or test a patch if useful.
> Reproduces on demand here, so iteration should be quick.


FTR I have applied following patch and it seems to fix our issue:

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index a0885ae55abc..ffd14141ea1d 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -5393,10 +5393,9 @@ static struct dentry *nfs4_proc_mkdir(struct
inode *dir, struct dentry *dentry,
        do {
                alias = _nfs4_proc_mkdir(dir, dentry, sattr, label, &err);
                trace_nfs4_mkdir(dir, &dentry->d_name, err);
+               err = nfs4_handle_exception(NFS_SERVER(dir), err, &exception);
                if (err)
-                       alias = ERR_PTR(nfs4_handle_exception(NFS_SERVER(dir),
-                                                             err,
-                                                             &exception));
+                       alias = ERR_PTR(err);
        } while (exception.retry);
        nfs4_label_release_security(label);

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: REGRESSION: NFSv4: mkdir returns EEXIST after NFS4ERR_DELAY-then-success;
  2026-04-29  7:12 ` Igor Raits
@ 2026-04-29  9:58   ` NeilBrown
  2026-04-29 10:49     ` [PATCH] NFSv4: clear exception state on successful mkdir retry Igor Raits
  0 siblings, 1 reply; 4+ messages in thread
From: NeilBrown @ 2026-04-29  9:58 UTC (permalink / raw)
  To: Igor Raits
  Cc: Anna Schumaker, Trond Myklebust, linux-nfs, Jaroslav Pulchart,
	Jan Cipa

On Wed, 29 Apr 2026, Igor Raits wrote:
> On Wed, Apr 29, 2026 at 7:02 AM Igor Raits <igor@gooddata.com> wrote:
> >
> > Hi all,
> >
> > I think I've run into an NFSv4 client regression and wanted to report
> > it before I forget the details. Apologies in advance if I'm
> > mis-reading the code — please correct me if so.
> >
> > Symptom: an occasional mkdir(2) on an NFSv4 mount returns -EEXIST,
> > but the directory it was supposed to create is actually present
> > afterwards. It's reproducible on both NFSv4.0 and NFSv4.2 against an
> > in-kernel Linux nfsd. Both client and server are running 6.19.14.
> >
> > Reproducer (random 16-hex names so collisions are not the cause):
> >
> >   N=2000000; base=/var/gdc/export
> >   for ((i=1; i<=N; i++)); do
> >       d=$base/$(openssl rand -hex 8)
> >       mkdir "$d" 2>/dev/null || echo "$(date +%T) failed loop=$i $d"
> >       rmdir "$d" 2>/dev/null
> >   done
> >
> > Failures cluster every ~2-3 minutes, and also reliably trigger on the
> > first mkdir after a few minutes of mount idleness. Each failed mkdir
> > takes about 100 ms.
> >
> > strace shows just one syscall, so userspace isn't retrying:
> >
> >   $ strace -ttt -e trace=mkdir mkdir "$dir"
> >   mkdir("/var/gdc/export/954ce422698ef4b1", 0777) = -1 EEXIST (File exists)
> >   +++ exited with 1 +++
> >
> > A packet capture for one failure (NFSv4.2; the v4.0 capture has the
> > same shape):
> >
> >   client → server  CREATE name=...  → NFS4ERR_DELAY (10008)
> >   ~100 ms later
> >   client → server  CREATE name=...  → NFS4_OK            ← dir created
> >   ~80 µs later
> >   client → server  CREATE name=...  → NFS4ERR_EXIST (17) ← server is right
> >
> > Three CREATE RPCs from one mkdir(2). The server looks correct: it
> > returns DELAY, then OK on the retry, then EXIST when the client asks
> > again for a name that now exists. The client then surfaces that final
> > EXIST to userspace even though its own previous retry already
> > succeeded.
> >
> > While poking around in fs/nfs/nfs4proc.c I noticed nfs4_proc_mkdir()
> > looks like this in current master:
> >
> >   do {
> >       alias = _nfs4_proc_mkdir(dir, dentry, sattr, label, &err);
> >       trace_nfs4_mkdir(dir, &dentry->d_name, err);
> >       if (err)
> >           alias = ERR_PTR(nfs4_handle_exception(NFS_SERVER(dir),
> >                                                 err, &exception));
> >   } while (exception.retry);

Oh dear, that was careless of me.  We *always* need to call
nfs4_handle_exception(). 

> >
> > If I'm reading this right, on a successful retry (err == 0)
> > nfs4_handle_exception() is skipped, so exception.retry stays at the
> > value it had after the previous DELAY iteration (which is 1). The
> > loop then runs once more, sends another CREATE for the same name,
> > and that one legitimately gets NFS4ERR_EXIST. Other do-while loops
> > in the same file (e.g. nfs4_proc_symlink) seem to call
> > nfs4_handle_exception() unconditionally, which would reset
> > exception.retry to 0 on success and exit the loop.
> >
> > git blame points at:
> >
> >   dd862da61e91 ("nfs: fix incorrect handling of large-number NFS errors
> >                  in nfs4_do_mkdir()")
> >
> > (stable backport: 062feb506caf). The change makes sense in itself —
> > the goal of returning the int separately from the dentry is good — but
> > the `if (err)` gate around nfs4_handle_exception() seems to be what
> > introduced the retry-state issue. I might be wrong about that, though,
> > so please take it with a grain of salt.
> >
> > Happy to provide pcaps, more traces, or test a patch if useful.
> > Reproduces on demand here, so iteration should be quick.
> 
> 
> FTR I have applied following patch and it seems to fix our issue:
> 
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index a0885ae55abc..ffd14141ea1d 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -5393,10 +5393,9 @@ static struct dentry *nfs4_proc_mkdir(struct
> inode *dir, struct dentry *dentry,
>         do {
>                 alias = _nfs4_proc_mkdir(dir, dentry, sattr, label, &err);
>                 trace_nfs4_mkdir(dir, &dentry->d_name, err);
> +               err = nfs4_handle_exception(NFS_SERVER(dir), err, &exception);
>                 if (err)
> -                       alias = ERR_PTR(nfs4_handle_exception(NFS_SERVER(dir),
> -                                                             err,
> -                                                             &exception));
> +                       alias = ERR_PTR(err);
>         } while (exception.retry);
>         nfs4_label_release_security(label);
> 

That is exactly the patch I was thinking of when I saw your first email.
If you would like to create a properly formatted patch for submission,
please add
  Reviewed-by: NeilBrown <neil@brown.name>

If you don't want to, I can do it for you.

Thanks for the report.

NeilBrown


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] NFSv4: clear exception state on successful mkdir retry
  2026-04-29  9:58   ` NeilBrown
@ 2026-04-29 10:49     ` Igor Raits
  0 siblings, 0 replies; 4+ messages in thread
From: Igor Raits @ 2026-04-29 10:49 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker
  Cc: NeilBrown, Jan Čípa, linux-nfs, linux-kernel, stable

After a server returns NFS4ERR_DELAY for an NFSv4 CREATE issued by
mkdir(2), the client correctly waits and retries.  When the retry
succeeds, however, mkdir(2) can still surface -EEXIST to userspace
even though the directory was just created on the server.

Reproducer (random 16-hex names so collisions are not the cause)
against an in-kernel Linux nfsd; reproduces under both NFSv4.0 and
NFSv4.2:

  N=2000000; base=/var/gdc/export
  for ((i=1; i<=N; i++)); do
      d=$base/$(openssl rand -hex 8)
      mkdir "$d" 2>/dev/null || echo "$(date +%T) failed loop=$i $d"
      rmdir "$d" 2>/dev/null
  done

Failures cluster at the cadence at which the server-side auth/export
cache refresh path causes nfsd to return NFS4ERR_DELAY for CREATE.

A wire trace of one failure (the three CREATE RPCs all come from a
single mkdir(2), generated by the do-while in nfs4_proc_mkdir()):

  client -> server  CREATE name=...  -> NFS4ERR_DELAY
  ~100 ms later
  client -> server  CREATE name=...  -> NFS4_OK         (dir created)
  ~80 us later
  client -> server  CREATE name=...  -> NFS4ERR_EXIST   (correct)

Since commit dd862da61e91 ("nfs: fix incorrect handling of large-number
NFS errors in nfs4_do_mkdir()"), nfs4_handle_exception() is called only
when _nfs4_proc_mkdir() returned an error.  That gate breaks retry-state
hygiene: nfs4_do_handle_exception() resets exception.{delay,recovering,
retry} to 0 on entry, so calling it on success is what previously
cleared the retry flag set by the preceding NFS4ERR_DELAY iteration.
With the gate in place, exception.retry stays at 1 after the successful
retry, the loop runs once more, and the resulting CREATE for an
already-created name yields NFS4ERR_EXIST -> -EEXIST to userspace.

Drop the conditional and call nfs4_handle_exception() unconditionally,
matching every other do-while in fs/nfs/nfs4proc.c (nfs4_proc_symlink(),
nfs4_proc_link(), etc.).  The dentry/status separation introduced by
that commit is preserved.

Fixes: dd862da61e91 ("nfs: fix incorrect handling of large-number NFS errors in nfs4_do_mkdir()")
Reported-and-tested-by: Jan Čípa <jan.cipa@gooddata.com>
Closes: https://lore.kernel.org/linux-nfs/CA+9S74hSp_tJu2Ffe2BPNC2T25gfkhgjjDkdgSsF5c2rnJq_wA@mail.gmail.com/
Reviewed-by: NeilBrown <neil@brown.name>
Cc: stable@vger.kernel.org
Signed-off-by: Igor Raits <igor.raits@gmail.com>
---
 fs/nfs/nfs4proc.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index a0885ae55abc..ffd14141ea1d 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -5393,10 +5393,9 @@ static struct dentry *nfs4_proc_mkdir(struct inode *dir, struct dentry *dentry,
 	do {
 		alias = _nfs4_proc_mkdir(dir, dentry, sattr, label, &err);
 		trace_nfs4_mkdir(dir, &dentry->d_name, err);
+		err = nfs4_handle_exception(NFS_SERVER(dir), err, &exception);
 		if (err)
-			alias = ERR_PTR(nfs4_handle_exception(NFS_SERVER(dir),
-							      err,
-							      &exception));
+			alias = ERR_PTR(err);
 	} while (exception.retry);
 	nfs4_label_release_security(label);
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-04-29 10:49 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-29  5:02 REGRESSION: NFSv4: mkdir returns EEXIST after NFS4ERR_DELAY-then-success; Igor Raits
2026-04-29  7:12 ` Igor Raits
2026-04-29  9:58   ` NeilBrown
2026-04-29 10:49     ` [PATCH] NFSv4: clear exception state on successful mkdir retry Igor Raits

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox