Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram

linux-lvm.redhat.com archive mirror
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Spelic <spelic@shiftmail.org>
Cc: linux-lvm@redhat.com,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	xfs@oss.sgi.com
Subject: Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram
Date: Fri, 3 Dec 2010 10:07:43 +1100	[thread overview]
Message-ID: <20101202230743.GZ16922@dastard> (raw)
In-Reply-To: <4CF7A9CF.2020904@shiftmail.org>

On Thu, Dec 02, 2010 at 03:14:39PM +0100, Spelic wrote:
> Sorry for replying to my own email already
> one more thing on the 3rd bug:
> 
> On 12/02/2010 02:55 PM, Spelic wrote:
> >Hello all
> >[CUT]
> >.......
> >with NFS over <RDMA or IPoIB> over Infiniband over XFS over
> >ramdisk it is possible to write a file (2.3GB) which is larger
> >than
> 
> This is also reproducible with:
> NFS over TCP over Ethernet over XFS over ramdisk.
> You don't need infiniband for this.
> With ethernet it doesn't hang (that's another bug, for RDMA people,
> in the othter thread) but the file is still 1.9GB, i.e. larger than
> the device.
> 
> 
> Look, after running the test over ethernet,
> at server side:
> 
> # ll -h /mnt/ram
> total 1.5G
> drwxr-xr-x 2 root root   21 2010-12-02 12:54 ./
> drwxr-xr-x 3 root root 4.0K 2010-11-29 23:51 ../
> -rw-r--r-- 1 root root 1.9G 2010-12-02 15:04 zerofile

This is a classic ENOSPC vs NFS client writeback overcommit caching
issue.  Have a look at the block map output - I bet theres holes in
the file and it's only consuming 1.5GB of disk space. use xfs_bmap
to check this. du should tell you the same thing.

Basically, the NFS client overcommits the server filesystem space by
doing local writeback caching. Hence it caches 1.9GB of data before
it gets the first ENOSPC error back from the server at around 1.5GB
of written data. At that point, the data that gets ENOSPC errors is
tossed by the NFS client, and a ENOSPC error is placed on the
address space to be reported to the next write/sync call. That gets
to the dd process when it's 1.9GB into the write.

However, there is still (in this case) 400MB of dirty data in the
NFS client cache that it will try to write to the server. Because
XFS uses speculative preallocation and reserves some space for
metadata allocation during delayed allocation, it's handling of the
initial ENOSPC condition can result in some space being freed up
again as unused reserved metadata space is returned to the free pool
as delalloc occurs during server writeback. This usually takes a
second or two to complete.

As a result, shortly after the first ENOSPC has been reported and
subsequent writes have also ENOSPC, we can have space freed up and
another write will succeed. At that point, the write that succeeds
will be a different offset to the last one that succeeded, leaving a
hole in the file and moving the EOF well past 1.5GB. That will go on
until there really is no space left at all or the NFS client has no
more dirty data to send.

Basically, what you see it not a bug in XFS, it is a result of NFS
clients being able to overcommit server filesystem space and the
interaction that has with the way the filesystem on the NFS server
handles ENOSPC.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2010-12-02 23:07 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-12-02 13:55 [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram Spelic
2010-12-02 14:11 ` Christoph Hellwig
2010-12-02 14:14   ` Spelic
2010-12-02 14:17     ` Christoph Hellwig
2010-12-02 21:22       ` Mike Snitzer
2010-12-02 22:08         ` Mike Snitzer
2010-12-03 17:11         ` Nick Piggin
2010-12-03 18:15           ` Ted Ts'o
2010-12-02 14:14 ` Spelic
2010-12-02 23:07   ` Dave Chinner [this message]
2010-12-03 14:07     ` Spelic
2010-12-06  4:09       ` Dave Chinner
2010-12-06 12:20         ` [linux-lvm] NFS corruption on ENOSPC (was: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram) Spelic
2010-12-06 13:33           ` Trond Myklebust

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101202230743.GZ16922@dastard \
    --to=david@fromorbit.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-lvm@redhat.com \
    --cc=spelic@shiftmail.org \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).