Btrfs, NFS (v3) and ESTALE

All of lore.kernel.org
 help / color / mirror / Atom feed

* Btrfs, NFS (v3) and ESTALE
@ 2010-09-23 11:02 David Flynn
  2010-09-23 12:28 ` Daniel J Blueman
  2010-09-23 14:03 ` David Nicol
  0 siblings, 2 replies; 4+ messages in thread
From: David Flynn @ 2010-09-23 11:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: davidf

Dear all;

On a cluster of ~35 machines used for batch processing, which all mount
via NFS (v3) a BTRFS export, I am experiencing issues that are causing
NFS clients to occasionally produce Stale NFS handle errors on accessing
this file system.  I would be interested to know if this is possibly
related to use of BTRFS, or is mere coincidence.

Background:
  - The NFS server is running 2.6.33, with a btrfs file system created
    under the same kernel.

  - The file system is mounted as:
    /dev/md2 /work btrfs rw,noatime,nodiratime 0 0

  - The file system is exported as:
    /work           <world>(rw,wdelay,root_squash,no_subtree_check)

  - Clients are mostly 2.6.35, however, problems have also been
    seen with 2.6.32.

  - Clients mount (from /proc/mounts)
    vc-fs0:/work /work nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.29.146.16,mountvers=3,mountport=51102,mountproto=udp,addr=172.29.146.16 0 0

The problem manifests itself when issuing a job to the cluster, of ~120
tasks on 30 nodes.  We will occasionally find that a machine reports
NFS stale filehandle errors when trying to stat a directory.  The
directory will not have been deleted during the lifetime of the job,
however some (eg 30) sub-directories will have been created.

The erros are Usually seen from a machine that has not done any work.

For example:

(2.6.35:)
vcfe0:~$ ls -l /work >/dev/null
--launch job (doesn't do anything on vcfe, uses different nodes)--
... time passes (unknown how long) ...

vcfe0:~$ ls -l /work >/dev/null
ls: cannot access /work/marta-cip-test: Stale NFS file handle
ls: cannot access /work/andrea-test-ais: Stale NFS file handle

(2.6.35:)
vc-r210-0:~$ ls -l /work >/dev/null
vc-r210-0:~$

(2.6.32:)
b36048:~$ ls -l /work/ >/dev/null
ls: cannot access /work/marta-cip-test: Stale NFS file handle
ls: cannot access /work/andrea-test-ais: Stale NFS file handle

Two separate machines are seeing the same stale file handles.  b36048
hadn't even touched /work for some considerable time before doing that
ls.

performing `touch /work/andrea-test-ais' on the client will allow the
client machine to stat the directory again, however, doing it on the
file server does not.

performing `echo 2 > /proc/sys/vm/drop_caches' on the client will
sometimes solve the problem for that client [but not always].

I've not yet found a reliable way to reproduce the problem, other than
running large jobs (we aren't running small ones at the moment, so can't
say if it is related to size)

I would be interested to know if anyone believes this may be related to
the use of btrfs, (or even a configuration / nfs cache coherency problem).

Some extra anecdotal evidence:
  I don't recall this being an issue before we upgraded all the compute
  nodes to 2.6.35.  Previously they used 2.6.33, but an upgrade was
  forced due to an nfs bug under high write loads.  However, it may be
  that the nature of the jobs that we are running now has changed
  slightly too.

Kind regards,

..david

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Btrfs, NFS (v3) and ESTALE
  2010-09-23 11:02 Btrfs, NFS (v3) and ESTALE David Flynn
@ 2010-09-23 12:28 ` Daniel J Blueman
  2010-11-04 22:40   ` David Flynn
  2010-09-23 14:03 ` David Nicol
  1 sibling, 1 reply; 4+ messages in thread
From: Daniel J Blueman @ 2010-09-23 12:28 UTC (permalink / raw)
  To: David Flynn; +Cc: linux-btrfs, Trond Myklebust

Hi David,

On 23 September 2010 12:02, David Flynn <davidf@rd.bbc.co.uk> wrote:
> Dear all;
>
> On a cluster of ~35 machines used for batch processing, which all mou=
nt
> via NFS (v3) a BTRFS export, I am experiencing issues that are causin=
g
> NFS clients to occasionally produce Stale NFS handle errors on access=
ing
> this file system. =A0I would be interested to know if this is possibl=
y
> related to use of BTRFS, or is mere coincidence.
>
> Background:
> =A0- The NFS server is running 2.6.33, with a btrfs file system creat=
ed
> =A0 =A0under the same kernel.
>
> =A0- The file system is mounted as:
> =A0 =A0/dev/md2 /work btrfs rw,noatime,nodiratime 0 0
>
> =A0- The file system is exported as:
> =A0 =A0/work =A0 =A0 =A0 =A0 =A0 <world>(rw,wdelay,root_squash,no_sub=
tree_check)
>
> =A0- Clients are mostly 2.6.35, however, problems have also been
> =A0 =A0seen with 2.6.32.
>
> =A0- Clients mount (from /proc/mounts)
> =A0 =A0vc-fs0:/work /work nfs rw,relatime,vers=3D3,rsize=3D1048576,ws=
ize=3D1048576,namlen=3D255,hard,proto=3Dtcp,timeo=3D600,retrans=3D2,sec=
=3Dsys,mountaddr=3D172.29.146.16,mountvers=3D3,mountport=3D51102,mountp=
roto=3Dudp,addr=3D172.29.146.16 0 0
>
> The problem manifests itself when issuing a job to the cluster, of ~1=
20
> tasks on 30 nodes. =A0We will occasionally find that a machine report=
s
> NFS stale filehandle errors when trying to stat a directory. =A0The
> directory will not have been deleted during the lifetime of the job,
> however some (eg 30) sub-directories will have been created.
>
> The erros are Usually seen from a machine that has not done any work.
>
> For example:
>
> (2.6.35:)
> vcfe0:~$ ls -l /work >/dev/null
> --launch job (doesn't do anything on vcfe, uses different nodes)--
> ... time passes (unknown how long) ...
>
> vcfe0:~$ ls -l /work >/dev/null
> ls: cannot access /work/marta-cip-test: Stale NFS file handle
> ls: cannot access /work/andrea-test-ais: Stale NFS file handle
>
> (2.6.35:)
> vc-r210-0:~$ ls -l /work >/dev/null
> vc-r210-0:~$
>
> (2.6.32:)
> b36048:~$ ls -l /work/ >/dev/null
> ls: cannot access /work/marta-cip-test: Stale NFS file handle
> ls: cannot access /work/andrea-test-ais: Stale NFS file handle
>
> Two separate machines are seeing the same stale file handles. =A0b360=
48
> hadn't even touched /work for some considerable time before doing tha=
t
> ls.
>
> performing `touch /work/andrea-test-ais' on the client will allow the
> client machine to stat the directory again, however, doing it on the
> file server does not.
>
> performing `echo 2 > /proc/sys/vm/drop_caches' on the client will
> sometimes solve the problem for that client [but not always].
>
> I've not yet found a reliable way to reproduce the problem, other tha=
n
> running large jobs (we aren't running small ones at the moment, so ca=
n't
> say if it is related to size)
>
> I would be interested to know if anyone believes this may be related =
to
> the use of btrfs, (or even a configuration / nfs cache coherency prob=
lem).
>
> Some extra anecdotal evidence:
> =A0I don't recall this being an issue before we upgraded all the comp=
ute
> =A0nodes to 2.6.35. =A0Previously they used 2.6.33, but an upgrade wa=
s
> =A0forced due to an nfs bug under high write loads. =A0However, it ma=
y be
> =A0that the nature of the jobs that we are running now has changed
> =A0slightly too.

I was experiencing a similar pattern of ESTALE issues with NFS with
2.6.33 (IIRC) and cached data on ext4, and could reproduce it from
time to time performing kernel rebuilds over NFS.

I've CC'd Trond on the full email to see if it rings a bell. The best
outcome may be if we write a micro-reproducer which exploits this race
using cached data.

Thanks,
  Daniel
--=20
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Btrfs, NFS (v3) and ESTALE
  2010-09-23 11:02 Btrfs, NFS (v3) and ESTALE David Flynn
  2010-09-23 12:28 ` Daniel J Blueman
@ 2010-09-23 14:03 ` David Nicol
  1 sibling, 0 replies; 4+ messages in thread
From: David Nicol @ 2010-09-23 14:03 UTC (permalink / raw)
  Cc: linux-btrfs

I wonder how difficult it would be to use GFS storage allocation, so
the right way to cluster btrfs would be at the block device level
instead of file system level.

Like the way btrfs can coexist in ext4 partition, but coexisting with GFS.


On Thu, Sep 23, 2010 at 6:02 AM, David Flynn <davidf@rd.bbc.co.uk> wrote:
> Dear all;
>
> On a cluster of ~35 machines used for batch processing, which all mount
> via NFS (v3) a BTRFS export, I am experiencing

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Btrfs, NFS (v3) and ESTALE
  2010-09-23 12:28 ` Daniel J Blueman
@ 2010-11-04 22:40   ` David Flynn
  0 siblings, 0 replies; 4+ messages in thread
From: David Flynn @ 2010-11-04 22:40 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: David Flynn, linux-btrfs, Trond Myklebust

* Daniel J Blueman (daniel.blueman@gmail.com) wrote:
> I was experiencing a similar pattern of ESTALE issues with NFS with
> 2.6.33 (IIRC) and cached data on ext4, and could reproduce it from
> time to time performing kernel rebuilds over NFS.
> 
> I've CC'd Trond on the full email to see if it rings a bell. The best
> outcome may be if we write a micro-reproducer which exploits this race
> using cached data.

I've recently seen quite a concrete case, which may be interesting:

NB, this is not an exact transcript

## step1: build a binary to use (out of tree build, touches: depends
##        files, object files, binaries)
vcfe:some/dir/bin$ make
## step2: launch a job on the cluster that uses the binaries in dir/bin
##        but does not touch any other files in dir/bin
vcfe:some/dir$ sbatch -N4 my_job.sh
## step3: let time pass (job completed, came back next day)
##
vcfe:some/dir$ ls -l bin
< many stale filehandle errors >

In actual fact, steps 1 and 2 were repeated several times (happened to be
bisecting something) with out issue, then the following day step 3
revealed a problem.

Now all writes to dir/bin occurred on vcfe, other computers only accessed
it for the binary.  Other computers will have created extra directories
in "some/dir/".

stale filehandle errors were resolved by:
  echo 2 > /proc/sys/vm/drop_caches

A quick summary of the setup:
 - nfs client was 2.6.35, mounting with nfsv3
 - nfs server was 2.6.33, exporting a btrfs filesystem (noatime,nodiratime)

I'd be very interested if anyone has any further thoughts on the issue.

Kind regards,

..david

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-11-04 22:40 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-23 11:02 Btrfs, NFS (v3) and ESTALE David Flynn
2010-09-23 12:28 ` Daniel J Blueman
2010-11-04 22:40   ` David Flynn
2010-09-23 14:03 ` David Nicol

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.