[linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram

linux-lvm.redhat.com archive mirror
 help / color / mirror / Atom feed

* [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram
@ 2010-12-02 13:55 Spelic
  2010-12-02 14:11 ` Christoph Hellwig
  2010-12-02 14:14 ` Spelic
  0 siblings, 2 replies; 14+ messages in thread
From: Spelic @ 2010-12-02 13:55 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, xfs, linux-lvm

Hello all

I noticed what seem to be 4 bugs.
(kernel v2.6.37-rc4 but probably also before)

First two are one in mkfs.xfs and one in device mapper (lvm mailing list 
I suppose, otherwise pls forward it):

Steps to reproduce:

Boot with a large ramdisk, like ramdisk_size=2097152
(actually I had 14GB ramdisk when I tried this but I don't think it will 
make a difference)

Now partition it with a 1GB partition:
   fdisk /dev/ram0
   n
   p
   1
   1
   +1G
   w
(only one 1GB physical partition)

Make a devmapper mapping for the partition
   kpartx -av /dev/ram0

mkfs.xfs -f /dev/mapper/ram0p1
     meta-data=/dev/mapper/ram0p1     isize=256    agcount=4, 
agsize=66266 blks
             =                       sectsz=512   attr=2
     data     =                       bsize=4096   blocks=265064, imaxpct=25
             =                       sunit=0      swidth=0 blks
     naming   =version 2              bsize=4096   ascii-ci=0
     log      =internal log           bsize=4096   blocks=2560, version=2
             =                       sectsz=512   sunit=0 blks, lazy-count=1
     realtime =none                   extsz=4096   blocks=0, rtextents=0

Now, lo and behold, partition is gone!
   fdisk  /dev/ram0
   p
will show no partitions!

you can also check with
   dd if=/dev/ram bs=1M count=1 | hexdump -C
All first MB of /dev/ram is zeroed!!

also
   mount /dev/ram0p1 /mnt
will fail. Unknown filesystem

I think this shows 2 bugs: firstly mkfs.xfs dares to do stuff before the 
beginning of the device on which it should work.
Secondly, device mapper does not constrain access within the boundaries 
of the device, which I think it should do.

Then I have 2 more bugs for you. Please see my thread in linux-rdma called:
     "NFS-RDMA hangs: connection closed (-103)"
     in particular this post 
http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg06632.html
with NFS over <RDMA or IPoIB> over Infiniband over XFS over ramdisk it 
is possible to write a file (2.3GB) which is larger than the size of the 
device (1.5GB): one bug I think is for XFS people (because I think XFS 
should check if the space on the filesystem is finished), and another 
one I think is for /dev/ram people (what mailing list? I am adding 
lkml), because I think the device should check if someone is writing 
beyond the end of it.

Thank you
PS: I am not subscribed to lkml so please do not reply ONLY to lkml.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram
  2010-12-02 13:55 [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram Spelic
@ 2010-12-02 14:11 ` Christoph Hellwig
  2010-12-02 14:14   ` Spelic
  2010-12-02 14:14 ` Spelic
  1 sibling, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2010-12-02 14:11 UTC (permalink / raw)
  To: Spelic; +Cc: linux-lvm, linux-kernel@vger.kernel.org, xfs

I'm pretty sure you have CONFIG_DEBUG_BLOCK_EXT_DEVT enabled.  This
option must never be enabled, as it causes block devices to be
randomly renumered.  Together with the ramdisk driver overloading
the BLKFLSBUF ioctl to discard all data it guarantees you to get
data loss like yours.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram
  2010-12-02 14:11 ` Christoph Hellwig
@ 2010-12-02 14:14   ` Spelic
  2010-12-02 14:17     ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Spelic @ 2010-12-02 14:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-lvm, linux-kernel@vger.kernel.org, xfs

On 12/02/2010 03:11 PM, Christoph Hellwig wrote:
> I'm pretty sure you have CONFIG_DEBUG_BLOCK_EXT_DEVT enabled.  This
> option must never be enabled, as it causes block devices to be
> randomly renumered.  Together with the ramdisk driver overloading
> the BLKFLSBUF ioctl to discard all data it guarantees you to get
> data loss like yours.
>    

Nope...

# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram
  2010-12-02 14:14   ` Spelic
@ 2010-12-02 14:17     ` Christoph Hellwig
  2010-12-02 21:22       ` Mike Snitzer
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2010-12-02 14:17 UTC (permalink / raw)
  To: Spelic; +Cc: Christoph Hellwig, linux-lvm, linux-kernel@vger.kernel.org, xfs

On Thu, Dec 02, 2010 at 03:14:28PM +0100, Spelic wrote:
> On 12/02/2010 03:11 PM, Christoph Hellwig wrote:
> >I'm pretty sure you have CONFIG_DEBUG_BLOCK_EXT_DEVT enabled.  This
> >option must never be enabled, as it causes block devices to be
> >randomly renumered.  Together with the ramdisk driver overloading
> >the BLKFLSBUF ioctl to discard all data it guarantees you to get
> >data loss like yours.
> 
> Nope...
> 
> # CONFIG_DEBUG_BLOCK_EXT_DEVT is not set

Hmm, I suspect dm-linear's dumb forwarding of ioctls has the same
effect.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram
  2010-12-02 14:17     ` Christoph Hellwig
@ 2010-12-02 21:22       ` Mike Snitzer
  2010-12-02 22:08         ` Mike Snitzer
  2010-12-03 17:11         ` Nick Piggin
  0 siblings, 2 replies; 14+ messages in thread
From: Mike Snitzer @ 2010-12-02 21:22 UTC (permalink / raw)
  To: LVM general discussion and development
  Cc: npiggin, linux-kernel@vger.kernel.org, xfs, Christoph Hellwig,
	dm-devel, Spelic

On Thu, Dec 02 2010 at  9:17am -0500,
Christoph Hellwig <hch@infradead.org> wrote:

> On Thu, Dec 02, 2010 at 03:14:28PM +0100, Spelic wrote:
> > On 12/02/2010 03:11 PM, Christoph Hellwig wrote:
> > >I'm pretty sure you have CONFIG_DEBUG_BLOCK_EXT_DEVT enabled.  This
> > >option must never be enabled, as it causes block devices to be
> > >randomly renumered.  Together with the ramdisk driver overloading
> > >the BLKFLSBUF ioctl to discard all data it guarantees you to get
> > >data loss like yours.
> > 
> > Nope...
> > 
> > # CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
> 
> Hmm, I suspect dm-linear's dumb forwarding of ioctls has the same
> effect.

For the benefit of others:
- mkfs.xfs will avoid sending BLKFLSBUF to any device whose major is
  ramdisk's major, this dates back to 2004:
  http://oss.sgi.com/archives/xfs/2004-08/msg00463.html
- but because a kpartx partition overlay (linear DM mapping) is used for
  the /dev/ram0p1 device, mkfs.xfs only sees a device with DM's major 
- so mkfs.xfs sends BLKFLSBUF to the DM device blissfully unaware that
  the backing device (behind the DM linear target) is a brd device
- DM will forward the BLKFLSBUF ioctl to brd, which triggers
  drivers/block/brd.c:brd_ioctl (nuking the entire ramdisk in the
  process)

So coming full circle this is what hch was referring to when he
mentioned:
1) "ramdisk driver overloading the BLKFLSBUF ioctl ..."
2) "dm-linear's dumb forwarding of ioctls ..."

I really can't see DM adding a specific check for ramdisk's major when
forwarding the BLKFLSBUF ioctl.

brd has direct partition support (see commit d7853d1f8932c) so maybe
kpartx should just blacklist /dev/ram devices?

Alternatively, what about switching brd away from overloading BLKFLSBUF
to a real implementation of (overloaded) BLKDISCARD support in brd.c?
One that doesn't blindly nuke the entire device but that properly
processes the discard request.

Mike

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram
  2010-12-02 21:22       ` Mike Snitzer
@ 2010-12-02 22:08         ` Mike Snitzer
  2010-12-03 17:11         ` Nick Piggin
  1 sibling, 0 replies; 14+ messages in thread
From: Mike Snitzer @ 2010-12-02 22:08 UTC (permalink / raw)
  To: LVM general discussion and development
  Cc: tytso, npiggin, linux-kernel@vger.kernel.org, xfs,
	Christoph Hellwig, dm-devel, Spelic

On Thu, Dec 02 2010 at  4:22pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Thu, Dec 02 2010 at  9:17am -0500,
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Thu, Dec 02, 2010 at 03:14:28PM +0100, Spelic wrote:
> > > On 12/02/2010 03:11 PM, Christoph Hellwig wrote:
> > > >I'm pretty sure you have CONFIG_DEBUG_BLOCK_EXT_DEVT enabled.  This
> > > >option must never be enabled, as it causes block devices to be
> > > >randomly renumered.  Together with the ramdisk driver overloading
> > > >the BLKFLSBUF ioctl to discard all data it guarantees you to get
> > > >data loss like yours.
> > > 
> > > Nope...
> > > 
> > > # CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
> > 
> > Hmm, I suspect dm-linear's dumb forwarding of ioctls has the same
> > effect.
> 
> For the benefit of others:
> - mkfs.xfs will avoid sending BLKFLSBUF to any device whose major is
>   ramdisk's major, this dates back to 2004:
>   http://oss.sgi.com/archives/xfs/2004-08/msg00463.html
> - but because a kpartx partition overlay (linear DM mapping) is used for
>   the /dev/ram0p1 device, mkfs.xfs only sees a device with DM's major 
> - so mkfs.xfs sends BLKFLSBUF to the DM device blissfully unaware that
>   the backing device (behind the DM linear target) is a brd device
> - DM will forward the BLKFLSBUF ioctl to brd, which triggers
>   drivers/block/brd.c:brd_ioctl (nuking the entire ramdisk in the
>   process)
> 
> So coming full circle this is what hch was referring to when he
> mentioned:
> 1) "ramdisk driver overloading the BLKFLSBUF ioctl ..."
> 2) "dm-linear's dumb forwarding of ioctls ..."
> 
> I really can't see DM adding a specific check for ramdisk's major when
> forwarding the BLKFLSBUF ioctl.
> 
> brd has direct partition support (see commit d7853d1f8932c) so maybe
> kpartx should just blacklist /dev/ram devices?
> 
> Alternatively, what about switching brd away from overloading BLKFLSBUF
> to a real implementation of (overloaded) BLKDISCARD support in brd.c?
> One that doesn't blindly nuke the entire device but that properly
> processes the discard request.

Hmm, any chance we could revisit this approach?

http://lkml.indiana.edu/hypermail/linux/kernel/0405.3/0998.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram
  2010-12-02 21:22       ` Mike Snitzer
  2010-12-02 22:08         ` Mike Snitzer
@ 2010-12-03 17:11         ` Nick Piggin
  2010-12-03 18:15           ` Ted Ts'o
  1 sibling, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2010-12-03 17:11 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: npiggin, LVM general discussion and development,
	linux-kernel@vger.kernel.org, xfs, Christoph Hellwig, dm-devel,
	Spelic

On Thu, Dec 02, 2010 at 04:22:27PM -0500, Mike Snitzer wrote:
> On Thu, Dec 02 2010 at  9:17am -0500,
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Thu, Dec 02, 2010 at 03:14:28PM +0100, Spelic wrote:
> > > On 12/02/2010 03:11 PM, Christoph Hellwig wrote:
> > > >I'm pretty sure you have CONFIG_DEBUG_BLOCK_EXT_DEVT enabled.  This
> > > >option must never be enabled, as it causes block devices to be
> > > >randomly renumered.  Together with the ramdisk driver overloading
> > > >the BLKFLSBUF ioctl to discard all data it guarantees you to get
> > > >data loss like yours.
> > > 
> > > Nope...
> > > 
> > > # CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
> > 
> > Hmm, I suspect dm-linear's dumb forwarding of ioctls has the same
> > effect.
> 
> For the benefit of others:
> - mkfs.xfs will avoid sending BLKFLSBUF to any device whose major is
>   ramdisk's major, this dates back to 2004:
>   http://oss.sgi.com/archives/xfs/2004-08/msg00463.html
> - but because a kpartx partition overlay (linear DM mapping) is used for
>   the /dev/ram0p1 device, mkfs.xfs only sees a device with DM's major 
> - so mkfs.xfs sends BLKFLSBUF to the DM device blissfully unaware that
>   the backing device (behind the DM linear target) is a brd device
> - DM will forward the BLKFLSBUF ioctl to brd, which triggers
>   drivers/block/brd.c:brd_ioctl (nuking the entire ramdisk in the
>   process)
> 
> So coming full circle this is what hch was referring to when he
> mentioned:
> 1) "ramdisk driver overloading the BLKFLSBUF ioctl ..."
> 2) "dm-linear's dumb forwarding of ioctls ..."
> 
> I really can't see DM adding a specific check for ramdisk's major when
> forwarding the BLKFLSBUF ioctl.
> 
> brd has direct partition support (see commit d7853d1f8932c) so maybe
> kpartx should just blacklist /dev/ram devices?
> 
> Alternatively, what about switching brd away from overloading BLKFLSBUF
> to a real implementation of (overloaded) BLKDISCARD support in brd.c?
> One that doesn't blindly nuke the entire device but that properly
> processes the discard request.

Yeah the situation really sucks (mkfs.jfs doesn't work on ramdisk
for the same reason).

I want to unfortunately keep ioctl for compatibility, but adding new
saner ones would be welcome. Also, having a non-default config or
load time parameter for brd, to skip the special case, if that would
help testing on older userspace.

DISCARD is actually a problem for rd. To actually get proper
correctness, you need to preload brd with pages, otherwise when
doing stress tests, IO can require memory allocations and deadlock.
If we add a discard that frees pages, that introduces the same problem.
If you find any option useful for testing, however, patches are fine --
brd pretty much is only useful for testing nowadays.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram
  2010-12-03 17:11         ` Nick Piggin
@ 2010-12-03 18:15           ` Ted Ts'o
  0 siblings, 0 replies; 14+ messages in thread
From: Ted Ts'o @ 2010-12-03 18:15 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mike Snitzer, LVM general discussion and development,
	linux-kernel@vger.kernel.org, xfs, Christoph Hellwig, dm-devel,
	Spelic

On Sat, Dec 04, 2010 at 04:11:40AM +1100, Nick Piggin wrote:
> > Alternatively, what about switching brd away from overloading BLKFLSBUF
> > to a real implementation of (overloaded) BLKDISCARD support in brd.c?
> > One that doesn't blindly nuke the entire device but that properly
> > processes the discard request.
> 
> Yeah the situation really sucks (mkfs.jfs doesn't work on ramdisk
> for the same reason).
> 
> I want to unfortunately keep ioctl for compatibility, but adding new
> saner ones would be welcome. Also, having a non-default config or
> load time parameter for brd, to skip the special case, if that would
> help testing on older userspace.

How many programs actually depend on BLKFLSBUF dropping the pages used
in /dev/ram?  The fact that it did this at all was a historical
accident of how the original /dev/ram was implemented (in the buffer
cache directly), and not anything that was intended.  I think that's
something that we should be able to fix, since the number of programs
that knowly operate on the ramdisk is quite small.  Just a few system
programs used by distributions in their early boot scripts....

So I would argue for dropping the "special" behavior of BLKFLSBUF for
/dev/ram.

							- Ted

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram
  2010-12-02 13:55 [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram Spelic
  2010-12-02 14:11 ` Christoph Hellwig
@ 2010-12-02 14:14 ` Spelic
  2010-12-02 23:07   ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Spelic @ 2010-12-02 14:14 UTC (permalink / raw)
  To: Spelic; +Cc: linux-lvm, linux-kernel@vger.kernel.org, xfs

Sorry for replying to my own email already
one more thing on the 3rd bug:

On 12/02/2010 02:55 PM, Spelic wrote:
> Hello all
> [CUT]
> .......
> with NFS over <RDMA or IPoIB> over Infiniband over XFS over ramdisk it 
> is possible to write a file (2.3GB) which is larger than

This is also reproducible with:
NFS over TCP over Ethernet over XFS over ramdisk.
You don't need infiniband for this.
With ethernet it doesn't hang (that's another bug, for RDMA people, in 
the othter thread) but the file is still 1.9GB, i.e. larger than the device.


Look, after running the test over ethernet,
at server side:

# ll -h /mnt/ram
total 1.5G
drwxr-xr-x 2 root root   21 2010-12-02 12:54 ./
drwxr-xr-x 3 root root 4.0K 2010-11-29 23:51 ../
-rw-r--r-- 1 root root 1.9G 2010-12-02 15:04 zerofile

# mount
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/sda1 on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw,noexec,nosuid,nodev)
none on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
devtmpfs on /dev type devtmpfs (rw,mode=0755)
none on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
none on /dev/shm type tmpfs (rw,nosuid,nodev)
none on /var/run type tmpfs (rw,nosuid,mode=0755)
none on /var/lock type tmpfs (rw,noexec,nosuid,nodev)
none on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
nfsd on /proc/fs/nfsd type nfsd (rw)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc 
(rw,noexec,nosuid,nodev)
/dev/ram0 on /mnt/ram type xfs (rw)

# blockdev --getsize64 /dev/ram0
1610612736

# dd if=/mnt/ram/zerofile | wc -c
1985937408
3878784+0 records in
3878784+0 records out
1985937408 bytes (2.0 GB) copied, 6.57081 s, 302 MB/s

Feel free to forward to NFS mailing list also if you think it's appropriate.
Thank you

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram
  2010-12-02 14:14 ` Spelic
@ 2010-12-02 23:07   ` Dave Chinner
  2010-12-03 14:07     ` Spelic
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2010-12-02 23:07 UTC (permalink / raw)
  To: Spelic; +Cc: linux-lvm, linux-kernel@vger.kernel.org, xfs

On Thu, Dec 02, 2010 at 03:14:39PM +0100, Spelic wrote:
> Sorry for replying to my own email already
> one more thing on the 3rd bug:
> 
> On 12/02/2010 02:55 PM, Spelic wrote:
> >Hello all
> >[CUT]
> >.......
> >with NFS over <RDMA or IPoIB> over Infiniband over XFS over
> >ramdisk it is possible to write a file (2.3GB) which is larger
> >than
> 
> This is also reproducible with:
> NFS over TCP over Ethernet over XFS over ramdisk.
> You don't need infiniband for this.
> With ethernet it doesn't hang (that's another bug, for RDMA people,
> in the othter thread) but the file is still 1.9GB, i.e. larger than
> the device.
> 
> 
> Look, after running the test over ethernet,
> at server side:
> 
> # ll -h /mnt/ram
> total 1.5G
> drwxr-xr-x 2 root root   21 2010-12-02 12:54 ./
> drwxr-xr-x 3 root root 4.0K 2010-11-29 23:51 ../
> -rw-r--r-- 1 root root 1.9G 2010-12-02 15:04 zerofile

This is a classic ENOSPC vs NFS client writeback overcommit caching
issue.  Have a look at the block map output - I bet theres holes in
the file and it's only consuming 1.5GB of disk space. use xfs_bmap
to check this. du should tell you the same thing.

Basically, the NFS client overcommits the server filesystem space by
doing local writeback caching. Hence it caches 1.9GB of data before
it gets the first ENOSPC error back from the server at around 1.5GB
of written data. At that point, the data that gets ENOSPC errors is
tossed by the NFS client, and a ENOSPC error is placed on the
address space to be reported to the next write/sync call. That gets
to the dd process when it's 1.9GB into the write.

However, there is still (in this case) 400MB of dirty data in the
NFS client cache that it will try to write to the server. Because
XFS uses speculative preallocation and reserves some space for
metadata allocation during delayed allocation, it's handling of the
initial ENOSPC condition can result in some space being freed up
again as unused reserved metadata space is returned to the free pool
as delalloc occurs during server writeback. This usually takes a
second or two to complete.

As a result, shortly after the first ENOSPC has been reported and
subsequent writes have also ENOSPC, we can have space freed up and
another write will succeed. At that point, the write that succeeds
will be a different offset to the last one that succeeded, leaving a
hole in the file and moving the EOF well past 1.5GB. That will go on
until there really is no space left at all or the NFS client has no
more dirty data to send.

Basically, what you see it not a bug in XFS, it is a result of NFS
clients being able to overcommit server filesystem space and the
interaction that has with the way the filesystem on the NFS server
handles ENOSPC.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram
  2010-12-02 23:07   ` Dave Chinner
@ 2010-12-03 14:07     ` Spelic
  2010-12-06  4:09       ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Spelic @ 2010-12-03 14:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-lvm, linux-kernel@vger.kernel.org, xfs

On 12/03/2010 12:07 AM, Dave Chinner wrote:
> This is a classic ENOSPC vs NFS client writeback overcommit caching
> issue.  Have a look at the block map output - I bet theres holes in
> the file and it's only consuming 1.5GB of disk space. use xfs_bmap
> to check this. du should tell you the same thing.
>
>    

Yes you are right!

root@server:/mnt/ram# ll -h
total 1.5G
drwxr-xr-x 2 root root   21 2010-12-02 12:54 ./
drwxr-xr-x 3 root root 4.0K 2010-11-29 23:51 ../
-rw-r--r-- 1 root root 1.9G 2010-12-02 15:04 zerofile

root@server:/mnt/ram# ls -lsh
total 1.5G
1.5G -rw-r--r-- 1 root root 1.9G 2010-12-02 15:04 zerofile
(it's a sparse file)

root@server:/mnt/ram# xfs_bmap zerofile
zerofile:
         0: [0..786367]: 786496..1572863
         1: [786368..1572735]: 2359360..3145727
         2: [1572736..2232319]: 1593408..2252991
         3: [2232320..2529279]: 285184..582143
         4: [2529280..2531327]: hole
         5: [2531328..2816407]: 96..285175
         6: [2816408..2971511]: 582144..737247
         7: [2971512..2971647]: hole
         8: [2971648..2975183]: 761904..765439
         9: [2975184..2975743]: hole
         10: [2975744..2975751]: 765440..765447
         11: [2975752..2977791]: hole
         12: [2977792..2977799]: 765480..765487
         13: [2977800..2979839]: hole
         14: [2979840..2979847]: 765448..765455
         15: [2979848..2981887]: hole
         16: [2981888..2981895]: 765472..765479
         17: [2981896..2983935]: hole
         18: [2983936..2983943]: 765456..765463
         19: [2983944..2985983]: hole
         20: [2985984..2985991]: 765464..765471
         21: [2985992..3202903]: hole
         22: [3202904..3215231]: 737248..749575
         23: [3215232..3239767]: hole
         24: [3239768..3252095]: 774104..786431
         25: [3252096..3293015]: hole
         26: [3293016..3305343]: 749576..761903
         27: [3305344..3370839]: hole
         28: [3370840..3383167]: 2252992..2265319
         29: [3383168..3473239]: hole
         30: [3473240..3485567]: 2265328..2277655
         31: [3485568..3632983]: hole
         32: [3632984..3645311]: 2277656..2289983
         33: [3645312..3866455]: hole
         34: [3866456..3878783]: 2289984..2302311

(many delayed allocation extents cannot be filled because space on 
device is finished)

However ...


> Basically, the NFS client overcommits the server filesystem space by
> doing local writeback caching. Hence it caches 1.9GB of data before
> it gets the first ENOSPC error back from the server at around 1.5GB
> of written data. At that point, the data that gets ENOSPC errors is
> tossed by the NFS client, and a ENOSPC error is placed on the
> address space to be reported to the next write/sync call. That gets
> to the dd process when it's 1.9GB into the write.
>    

I'm no great expert but isn't this a design flaw in NFS?

Ok in this case we were lucky it was all zeroes so XFS made a sparse 
file and could fit a 1.9GB into 1.5GB device size.

In general with nonzero data it seems to me you will get data corruption 
because the NFS client thinks it has written the data while the NFS 
server really can't write more data than the device size.

It's nice that the NFS server does local writeback caching but it should 
also cache the filesystem's free space (and check it periodically, since 
nfs-server is presumably not the only process writing in that 
filesystem) so that it doesn't accept more data than it can really 
write. Alternatively, when free space drops below 1GB (or a reasonable 
size based on network speed), nfs-server should turn off filesystem 
writeback caching.

I can't repeat the test with urandom because it's too slow (8MB/sec !?). 
How come Linux hasn't got an "uurandom" device capable of e.g. 400MB/sec 
with only very weak randomness?

But I have repeated the test over ethernet with a bunch of symlinks to a 
100MB file created from urandom:

At client side:

# time cat randfile{001..020} | pv -b > /mnt/nfsram/randfile
1.95GB

real    0m22.978s
user    0m0.310s
sys     0m5.360s


At server side:

# ls -lsh ram
total 1.5G
1.5G -rw-r--r-- 1 root root 1.7G 2010-12-03 14:43 randfile
# xfs_bmap ram/randfile
ram/randfile:
         0: [0..786367]: 786496..1572863
         1: [786368..790527]: 96..4255
         2: [790528..1130495]: hole
         3: [1130496..1916863]: 2359360..3145727
         4: [1916864..2682751]: 1593408..2359295
         5: [2682752..3183999]: 285184..786431
         6: [3184000..3387207]: 4256..207463
         7: [3387208..3387391]: hole
         8: [3387392..3391567]: 207648..211823
         9: [3391568..3393535]: hole
         10: [3393536..3393543]: 211824..211831
         11: [3393544..3395583]: hole
         12: [3395584..3395591]: 211832..211839
         13: [3395592..3397631]: hole
         14: [3397632..3397639]: 211856..211863
         15: [3397640..3399679]: hole
         16: [3399680..3399687]: 211848..211855
         17: [3399688..3401727]: hole
         18: [3401728..3409623]: 221984..229879
# dd if=/mnt/ram/randfile | wc -c
3409624+0 records in
3409624+0 records out
1745727488
1745727488 bytes (1.7 GB) copied, 5.72443 s, 305 MB/s

The file is still sparse, and this time it certainly has data corruption 
(holes will be read as zeroes).
I understand that the client receives Input/output error when this 
condition is hit, but the file written at server side has apparent size 
1.8GB but the valid data in it is not 1.8GB. Is it good semantics? 
Wouldn't it be better for nfs-server to turn off writeback caching when 
it approaches a disk-full situation?


And then I see another problem:
As you see, xfs_fsr shows lots of holes, even with randomfile (this is 
taken from urandom so you can be sure it hasn't got many zeroes) already 
from offset 790528 sectors which is far from the disk full situation...

First I checked that this does not happen by pushing less than 1.5GB of 
data. Ok it does not.
Then I tried with exactly 15*100MB (files are 100MB, are symliks to a 
file which was created with dd if=/dev/urandom of=randfile.rnd bs=1M 
count=100)
and this happened:

client side:

# time cat randfile{001..015} | pv -b > /mnt/nfsram/randfile
1.46GB

real    0m18.265s
user    0m0.260s
sys     0m4.460s

(please note: no I/O error at client side! blockdev --getsize64 
/dev/ram0 == 1610612736)


server side:

# ls -ls ram
total 1529676
1529676 -rw-r--r-- 1 root root 1571819520 2010-12-03 14:51 randfile

# dd if=/mnt/ram/randfile | wc -c
3069960+0 records in
3069960+0 records out
1571819520
1571819520 bytes (1.6 GB) copied, 5.30442 s, 296 MB/s

# xfs_bmap ram/randfile
ram/randfile:
         0: [0..112639]: 96..112735
         1: [112640..208895]: 114784..211039
         2: [208896..399359]: 285184..475647
         3: [399360..401407]: 112736..114783
         4: [401408..573439]: 475648..647679
         5: [573440..937983]: 786496..1151039
         6: [937984..1724351]: 2359360..3145727
         7: [1724352..2383871]: 1593408..2252927
         8: [2383872..2805695]: 1151040..1572863
         9: [2805696..2944447]: 647680..786431
         10: [2944448..2949119]: 211040..215711
         11: [2949120..3055487]: 2252928..2359295
         12: [3055488..3058871]: 215712..219095
         13: [3058872..3059711]: hole
         14: [3059712..3060143]: 219936..220367
         15: [3060144..3061759]: hole
         16: [3061760..3061767]: 220368..220375
         17: [3061768..3063807]: hole
         18: [3063808..3063815]: 220376..220383
         19: [3063816..3065855]: hole
         20: [3065856..3065863]: 220384..220391
         21: [3065864..3067903]: hole
         22: [3067904..3067911]: 220392..220399
         23: [3067912..3069951]: hole
         24: [3069952..3069959]: 220400..220407

Holes in a random file!
This is data corruption, and nobody is notified of this data corruption: 
no error at client side or server side!
Is it good semantics? How could client get notified of this? Some kind 
of fsync maybe?

Thank you

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram
  2010-12-03 14:07     ` Spelic
@ 2010-12-06  4:09       ` Dave Chinner
  2010-12-06 12:20         ` [linux-lvm] NFS corruption on ENOSPC (was: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram) Spelic
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2010-12-06  4:09 UTC (permalink / raw)
  To: Spelic; +Cc: linux-lvm, linux-kernel@vger.kernel.org, xfs

On Fri, Dec 03, 2010 at 03:07:58PM +0100, Spelic wrote:
> On 12/03/2010 12:07 AM, Dave Chinner wrote:
> >This is a classic ENOSPC vs NFS client writeback overcommit caching
> >issue.  Have a look at the block map output - I bet theres holes in
> >the file and it's only consuming 1.5GB of disk space. use xfs_bmap
> >to check this. du should tell you the same thing.
> >
> 
> Yes you are right!
....
> root@server:/mnt/ram# xfs_bmap zerofile
> zerofile:
....
>         30: [3473240..3485567]: 2265328..2277655
>         31: [3485568..3632983]: hole
>         32: [3632984..3645311]: 2277656..2289983
>         33: [3645312..3866455]: hole
>         34: [3866456..3878783]: 2289984..2302311
> 
> (many delayed allocation extents cannot be filled because space on
> device is finished)
> 
> However ...
> 
> 
> >Basically, the NFS client overcommits the server filesystem space by
> >doing local writeback caching. Hence it caches 1.9GB of data before
> >it gets the first ENOSPC error back from the server at around 1.5GB
> >of written data. At that point, the data that gets ENOSPC errors is
> >tossed by the NFS client, and a ENOSPC error is placed on the
> >address space to be reported to the next write/sync call. That gets
> >to the dd process when it's 1.9GB into the write.
> 
> I'm no great expert but isn't this a design flaw in NFS?

Yes, sure is.

[ Well, to be precise the original NFSv2 specification
didn't have this flaw because all writes were synchronous. NFSv3
introduced asynchronous writes (writeback caching) and with it this
problem. NFSv4 does not fix this flaw. ]

> Ok in this case we were lucky it was all zeroes so XFS made a sparse
> file and could fit a 1.9GB into 1.5GB device size.
> 
> In general with nonzero data it seems to me you will get data
> corruption because the NFS client thinks it has written the data
> while the NFS server really can't write more data than the device
> size.

Yup, well known issue. Simple rule: don't run your NFS server out of
space.

> It's nice that the NFS server does local writeback caching but it
> should also cache the filesystem's free space (and check it
> periodically, since nfs-server is presumably not the only process
> writing in that filesystem) so that it doesn't accept more data than
> it can really write. Alternatively, when free space drops below 1GB
> (or a reasonable size based on network speed), nfs-server should
> turn off filesystem writeback caching.

This isn't a NFS server problem, or one that canbe worked around at
the server. it's a NFS _client_ problem in that it does not get
synchronous ENOSPC errors when using writeback caching. There is no
way for the NFS client to know the server is near ENOSPC conditions
prior to writing the data to the server as clients operate
independently.

If you really want your NFS clients to behave correctly when the
server goes ENOSPC, turn off writeback caching at the client side,
not the server (i.e. use sync mounts on the client side).
Write performance will suck, but if you want sane ENOSPC behaviour...

.....

> Holes in a random file!
> This is data corruption, and nobody is notified of this data
> corruption: no error at client side or server side!
> Is it good semantics? How could client get notified of this? Some
> kind of fsync maybe?

Use wireshark to determine if the server sends an ENOSPC to the
client when the first background write fails. I bet it does and that
your dd write failed with ENOSPC, too. Something stopped it writing
at 1.9GB....

What happens to the remaining cached writeback data in the NFS
client once the server runs out of space is NFS client specific
behaviour. If you end up with only bits of the file on the server,
ending up on the server, then that's a result of NFS client
behaviour, not a NFS server problem.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [linux-lvm] NFS corruption on ENOSPC (was: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram)
  2010-12-06  4:09       ` Dave Chinner
@ 2010-12-06 12:20         ` Spelic
  2010-12-06 13:33           ` Trond Myklebust
  0 siblings, 1 reply; 14+ messages in thread
From: Spelic @ 2010-12-06 12:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-nfs, linux-lvm, linux-kernel@vger.kernel.org, xfs

On 12/06/2010 05:09 AM, Dave Chinner wrote:
>> [Files become sparse at nfs-server-side upon hitting ENOSPC if NFS client uses local writeback caching]
>>
>>
>> It's nice that the NFS server does local writeback caching but it
>> should also cache the filesystem's free space (and check it
>> periodically, since nfs-server is presumably not the only process
>> writing in that filesystem) so that it doesn't accept more data than
>> it can really write. Alternatively, when free space drops below 1GB
>> (or a reasonable size based on network speed), nfs-server should
>> turn off filesystem writeback caching.
>>      
> This isn't a NFS server problem, or one that canbe worked around at
> the server. it's a NFS _client_ problem in that it does not get
> synchronous ENOSPC errors when using writeback caching. There is no
> way for the NFS client to know the server is near ENOSPC conditions
> prior to writing the data to the server as clients operate
> independently.
>
> If you really want your NFS clients to behave correctly when the
> server goes ENOSPC, turn off writeback caching at the client side,
> not the server (i.e. use sync mounts on the client side).
> Write performance will suck, but if you want sane ENOSPC behaviour...
>
>    

[adding NFS ML in cc]

Thank you for your very clear explanation.

Going without writeback cache is a problem (write performance sucks as 
you say), but guaranteeing to never reach ENOSPC also is hardly 
feasible, especially if humans are logged at client side and they are 
doing "whatever they want".

I would suggest that either be the NFS client to do polling to see if 
it's near an ENOSPC and if yes disable writeback caching, or be the 
server to do the polling and if it finds out it's near-ENOSPC condition 
it sends a specific message to clients to warn them so that they can 
disable caching.

Performed at client side wouldn't change the NFS protocol and can be 
good enough if one can specify how often freespace should be polled and 
what is the freespace threshold. Or with just one value: specify what is 
the max speed at which server disk can fill (next polling period can be 
inferred from current free space), and maybe also specify a minimum 
polling period (just in case).

Regarding the last part of the email, perhaps I was not clear:

> .....
>    
>> Holes in a random file!
>> This is data corruption, and nobody is notified of this data
>> corruption: no error at client side or server side!
>> Is it good semantics? How could client get notified of this? Some
>> kind of fsync maybe?
>>      
> Use wireshark to determine if the server sends an ENOSPC to the
> client when the first background write fails. I bet it does and that
> your dd write failed with ENOSPC, too. Something stopped it writing
> at 1.9GB....
>    

No, in that case I had written 15x100MB which was more than the 
available space but less than available+writeback_cache.
So "cat" ended by itself and never got an ENOSPC error but data never 
reached the disk at the other side.

However today I found that by using fsync, the problem is fortunately 
detected:

# time cat randfile{001..015} | pv -b | dd conv=fsync 
of=/mnt/nfsram/randfile
1.46GB
dd: fsync failed for `/mnt/nfsram/randfile': Input/output error
3072000+0 records in
3072000+0 records out
1572864000 bytes (1.6 GB) copied, 20.9101 s, 75.2 MB/s

real    0m21.364s
user    0m0.470s
sys     0m11.440s

so ok I understand that processes needing guarantees on written data 
should use fsync/fdatasync (which is good practice also for a local 
filesystem actually...)

Thank you

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-lvm] NFS corruption on ENOSPC (was: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram)
  2010-12-06 12:20         ` [linux-lvm] NFS corruption on ENOSPC (was: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram) Spelic
@ 2010-12-06 13:33           ` Trond Myklebust
  0 siblings, 0 replies; 14+ messages in thread
From: Trond Myklebust @ 2010-12-06 13:33 UTC (permalink / raw)
  To: Spelic
  Cc: linux-nfs, linux-lvm, Dave Chinner, linux-kernel@vger.kernel.org,
	xfs

On Mon, 2010-12-06 at 13:20 +0100, Spelic wrote:
> On 12/06/2010 05:09 AM, Dave Chinner wrote:
> >> [Files become sparse at nfs-server-side upon hitting ENOSPC if NFS client uses local writeback caching]
> >>
> >>
> >> It's nice that the NFS server does local writeback caching but it
> >> should also cache the filesystem's free space (and check it
> >> periodically, since nfs-server is presumably not the only process
> >> writing in that filesystem) so that it doesn't accept more data than
> >> it can really write. Alternatively, when free space drops below 1GB
> >> (or a reasonable size based on network speed), nfs-server should
> >> turn off filesystem writeback caching.
> >>      
> > This isn't a NFS server problem, or one that canbe worked around at
> > the server. it's a NFS _client_ problem in that it does not get
> > synchronous ENOSPC errors when using writeback caching. There is no
> > way for the NFS client to know the server is near ENOSPC conditions
> > prior to writing the data to the server as clients operate
> > independently.
> >
> > If you really want your NFS clients to behave correctly when the
> > server goes ENOSPC, turn off writeback caching at the client side,
> > not the server (i.e. use sync mounts on the client side).
> > Write performance will suck, but if you want sane ENOSPC behaviour...
> >
> >    
> 
> [adding NFS ML in cc]
> 
> Thank you for your very clear explanation.
> 
> Going without writeback cache is a problem (write performance sucks as 
> you say), but guaranteeing to never reach ENOSPC also is hardly 
> feasible, especially if humans are logged at client side and they are 
> doing "whatever they want".
> 
> I would suggest that either be the NFS client to do polling to see if 
> it's near an ENOSPC and if yes disable writeback caching, or be the 
> server to do the polling and if it finds out it's near-ENOSPC condition 
> it sends a specific message to clients to warn them so that they can 
> disable caching.



> Performed at client side wouldn't change the NFS protocol and can be 
> good enough if one can specify how often freespace should be polled and 
> what is the freespace threshold. Or with just one value: specify what is 
> the max speed at which server disk can fill (next polling period can be 
> inferred from current free space), and maybe also specify a minimum 
> polling period (just in case).

You can just as easily do this at the application level. The kernel
can't do it any more reliably than the application can, so there really
is no point in doing it there.

We already ensure that when the server does send us an error, we switch
to synchronous operation until the error clears.

Trond

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-12-06 13:34 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-02 13:55 [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram Spelic
2010-12-02 14:11 ` Christoph Hellwig
2010-12-02 14:14   ` Spelic
2010-12-02 14:17     ` Christoph Hellwig
2010-12-02 21:22       ` Mike Snitzer
2010-12-02 22:08         ` Mike Snitzer
2010-12-03 17:11         ` Nick Piggin
2010-12-03 18:15           ` Ted Ts'o
2010-12-02 14:14 ` Spelic
2010-12-02 23:07   ` Dave Chinner
2010-12-03 14:07     ` Spelic
2010-12-06  4:09       ` Dave Chinner
2010-12-06 12:20         ` [linux-lvm] NFS corruption on ENOSPC (was: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram) Spelic
2010-12-06 13:33           ` Trond Myklebust

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).