* [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram @ 2010-12-02 13:55 Spelic 2010-12-02 14:11 ` Christoph Hellwig 2010-12-02 14:14 ` Spelic 0 siblings, 2 replies; 14+ messages in thread From: Spelic @ 2010-12-02 13:55 UTC (permalink / raw) To: linux-kernel@vger.kernel.org, xfs, linux-lvm Hello all I noticed what seem to be 4 bugs. (kernel v2.6.37-rc4 but probably also before) First two are one in mkfs.xfs and one in device mapper (lvm mailing list I suppose, otherwise pls forward it): Steps to reproduce: Boot with a large ramdisk, like ramdisk_size=2097152 (actually I had 14GB ramdisk when I tried this but I don't think it will make a difference) Now partition it with a 1GB partition: fdisk /dev/ram0 n p 1 1 +1G w (only one 1GB physical partition) Make a devmapper mapping for the partition kpartx -av /dev/ram0 mkfs.xfs -f /dev/mapper/ram0p1 meta-data=/dev/mapper/ram0p1 isize=256 agcount=4, agsize=66266 blks = sectsz=512 attr=2 data = bsize=4096 blocks=265064, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Now, lo and behold, partition is gone! fdisk /dev/ram0 p will show no partitions! you can also check with dd if=/dev/ram bs=1M count=1 | hexdump -C All first MB of /dev/ram is zeroed!! also mount /dev/ram0p1 /mnt will fail. Unknown filesystem I think this shows 2 bugs: firstly mkfs.xfs dares to do stuff before the beginning of the device on which it should work. Secondly, device mapper does not constrain access within the boundaries of the device, which I think it should do. Then I have 2 more bugs for you. Please see my thread in linux-rdma called: "NFS-RDMA hangs: connection closed (-103)" in particular this post http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg06632.html with NFS over <RDMA or IPoIB> over Infiniband over XFS over ramdisk it is possible to write a file (2.3GB) which is larger than the size of the device (1.5GB): one bug I think is for XFS people (because I think XFS should check if the space on the filesystem is finished), and another one I think is for /dev/ram people (what mailing list? I am adding lkml), because I think the device should check if someone is writing beyond the end of it. Thank you PS: I am not subscribed to lkml so please do not reply ONLY to lkml. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram 2010-12-02 13:55 [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram Spelic @ 2010-12-02 14:11 ` Christoph Hellwig 2010-12-02 14:14 ` Spelic 2010-12-02 14:14 ` Spelic 1 sibling, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2010-12-02 14:11 UTC (permalink / raw) To: Spelic; +Cc: linux-lvm, linux-kernel@vger.kernel.org, xfs I'm pretty sure you have CONFIG_DEBUG_BLOCK_EXT_DEVT enabled. This option must never be enabled, as it causes block devices to be randomly renumered. Together with the ramdisk driver overloading the BLKFLSBUF ioctl to discard all data it guarantees you to get data loss like yours. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram 2010-12-02 14:11 ` Christoph Hellwig @ 2010-12-02 14:14 ` Spelic 2010-12-02 14:17 ` Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Spelic @ 2010-12-02 14:14 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-lvm, linux-kernel@vger.kernel.org, xfs On 12/02/2010 03:11 PM, Christoph Hellwig wrote: > I'm pretty sure you have CONFIG_DEBUG_BLOCK_EXT_DEVT enabled. This > option must never be enabled, as it causes block devices to be > randomly renumered. Together with the ramdisk driver overloading > the BLKFLSBUF ioctl to discard all data it guarantees you to get > data loss like yours. > Nope... # CONFIG_DEBUG_BLOCK_EXT_DEVT is not set ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram 2010-12-02 14:14 ` Spelic @ 2010-12-02 14:17 ` Christoph Hellwig 2010-12-02 21:22 ` Mike Snitzer 0 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2010-12-02 14:17 UTC (permalink / raw) To: Spelic; +Cc: Christoph Hellwig, linux-lvm, linux-kernel@vger.kernel.org, xfs On Thu, Dec 02, 2010 at 03:14:28PM +0100, Spelic wrote: > On 12/02/2010 03:11 PM, Christoph Hellwig wrote: > >I'm pretty sure you have CONFIG_DEBUG_BLOCK_EXT_DEVT enabled. This > >option must never be enabled, as it causes block devices to be > >randomly renumered. Together with the ramdisk driver overloading > >the BLKFLSBUF ioctl to discard all data it guarantees you to get > >data loss like yours. > > Nope... > > # CONFIG_DEBUG_BLOCK_EXT_DEVT is not set Hmm, I suspect dm-linear's dumb forwarding of ioctls has the same effect. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram 2010-12-02 14:17 ` Christoph Hellwig @ 2010-12-02 21:22 ` Mike Snitzer 2010-12-02 22:08 ` Mike Snitzer 2010-12-03 17:11 ` Nick Piggin 0 siblings, 2 replies; 14+ messages in thread From: Mike Snitzer @ 2010-12-02 21:22 UTC (permalink / raw) To: LVM general discussion and development Cc: npiggin, linux-kernel@vger.kernel.org, xfs, Christoph Hellwig, dm-devel, Spelic On Thu, Dec 02 2010 at 9:17am -0500, Christoph Hellwig <hch@infradead.org> wrote: > On Thu, Dec 02, 2010 at 03:14:28PM +0100, Spelic wrote: > > On 12/02/2010 03:11 PM, Christoph Hellwig wrote: > > >I'm pretty sure you have CONFIG_DEBUG_BLOCK_EXT_DEVT enabled. This > > >option must never be enabled, as it causes block devices to be > > >randomly renumered. Together with the ramdisk driver overloading > > >the BLKFLSBUF ioctl to discard all data it guarantees you to get > > >data loss like yours. > > > > Nope... > > > > # CONFIG_DEBUG_BLOCK_EXT_DEVT is not set > > Hmm, I suspect dm-linear's dumb forwarding of ioctls has the same > effect. For the benefit of others: - mkfs.xfs will avoid sending BLKFLSBUF to any device whose major is ramdisk's major, this dates back to 2004: http://oss.sgi.com/archives/xfs/2004-08/msg00463.html - but because a kpartx partition overlay (linear DM mapping) is used for the /dev/ram0p1 device, mkfs.xfs only sees a device with DM's major - so mkfs.xfs sends BLKFLSBUF to the DM device blissfully unaware that the backing device (behind the DM linear target) is a brd device - DM will forward the BLKFLSBUF ioctl to brd, which triggers drivers/block/brd.c:brd_ioctl (nuking the entire ramdisk in the process) So coming full circle this is what hch was referring to when he mentioned: 1) "ramdisk driver overloading the BLKFLSBUF ioctl ..." 2) "dm-linear's dumb forwarding of ioctls ..." I really can't see DM adding a specific check for ramdisk's major when forwarding the BLKFLSBUF ioctl. brd has direct partition support (see commit d7853d1f8932c) so maybe kpartx should just blacklist /dev/ram devices? Alternatively, what about switching brd away from overloading BLKFLSBUF to a real implementation of (overloaded) BLKDISCARD support in brd.c? One that doesn't blindly nuke the entire device but that properly processes the discard request. Mike ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram 2010-12-02 21:22 ` Mike Snitzer @ 2010-12-02 22:08 ` Mike Snitzer 2010-12-03 17:11 ` Nick Piggin 1 sibling, 0 replies; 14+ messages in thread From: Mike Snitzer @ 2010-12-02 22:08 UTC (permalink / raw) To: LVM general discussion and development Cc: tytso, npiggin, linux-kernel@vger.kernel.org, xfs, Christoph Hellwig, dm-devel, Spelic On Thu, Dec 02 2010 at 4:22pm -0500, Mike Snitzer <snitzer@redhat.com> wrote: > On Thu, Dec 02 2010 at 9:17am -0500, > Christoph Hellwig <hch@infradead.org> wrote: > > > On Thu, Dec 02, 2010 at 03:14:28PM +0100, Spelic wrote: > > > On 12/02/2010 03:11 PM, Christoph Hellwig wrote: > > > >I'm pretty sure you have CONFIG_DEBUG_BLOCK_EXT_DEVT enabled. This > > > >option must never be enabled, as it causes block devices to be > > > >randomly renumered. Together with the ramdisk driver overloading > > > >the BLKFLSBUF ioctl to discard all data it guarantees you to get > > > >data loss like yours. > > > > > > Nope... > > > > > > # CONFIG_DEBUG_BLOCK_EXT_DEVT is not set > > > > Hmm, I suspect dm-linear's dumb forwarding of ioctls has the same > > effect. > > For the benefit of others: > - mkfs.xfs will avoid sending BLKFLSBUF to any device whose major is > ramdisk's major, this dates back to 2004: > http://oss.sgi.com/archives/xfs/2004-08/msg00463.html > - but because a kpartx partition overlay (linear DM mapping) is used for > the /dev/ram0p1 device, mkfs.xfs only sees a device with DM's major > - so mkfs.xfs sends BLKFLSBUF to the DM device blissfully unaware that > the backing device (behind the DM linear target) is a brd device > - DM will forward the BLKFLSBUF ioctl to brd, which triggers > drivers/block/brd.c:brd_ioctl (nuking the entire ramdisk in the > process) > > So coming full circle this is what hch was referring to when he > mentioned: > 1) "ramdisk driver overloading the BLKFLSBUF ioctl ..." > 2) "dm-linear's dumb forwarding of ioctls ..." > > I really can't see DM adding a specific check for ramdisk's major when > forwarding the BLKFLSBUF ioctl. > > brd has direct partition support (see commit d7853d1f8932c) so maybe > kpartx should just blacklist /dev/ram devices? > > Alternatively, what about switching brd away from overloading BLKFLSBUF > to a real implementation of (overloaded) BLKDISCARD support in brd.c? > One that doesn't blindly nuke the entire device but that properly > processes the discard request. Hmm, any chance we could revisit this approach? http://lkml.indiana.edu/hypermail/linux/kernel/0405.3/0998.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram 2010-12-02 21:22 ` Mike Snitzer 2010-12-02 22:08 ` Mike Snitzer @ 2010-12-03 17:11 ` Nick Piggin 2010-12-03 18:15 ` Ted Ts'o 1 sibling, 1 reply; 14+ messages in thread From: Nick Piggin @ 2010-12-03 17:11 UTC (permalink / raw) To: Mike Snitzer Cc: npiggin, LVM general discussion and development, linux-kernel@vger.kernel.org, xfs, Christoph Hellwig, dm-devel, Spelic On Thu, Dec 02, 2010 at 04:22:27PM -0500, Mike Snitzer wrote: > On Thu, Dec 02 2010 at 9:17am -0500, > Christoph Hellwig <hch@infradead.org> wrote: > > > On Thu, Dec 02, 2010 at 03:14:28PM +0100, Spelic wrote: > > > On 12/02/2010 03:11 PM, Christoph Hellwig wrote: > > > >I'm pretty sure you have CONFIG_DEBUG_BLOCK_EXT_DEVT enabled. This > > > >option must never be enabled, as it causes block devices to be > > > >randomly renumered. Together with the ramdisk driver overloading > > > >the BLKFLSBUF ioctl to discard all data it guarantees you to get > > > >data loss like yours. > > > > > > Nope... > > > > > > # CONFIG_DEBUG_BLOCK_EXT_DEVT is not set > > > > Hmm, I suspect dm-linear's dumb forwarding of ioctls has the same > > effect. > > For the benefit of others: > - mkfs.xfs will avoid sending BLKFLSBUF to any device whose major is > ramdisk's major, this dates back to 2004: > http://oss.sgi.com/archives/xfs/2004-08/msg00463.html > - but because a kpartx partition overlay (linear DM mapping) is used for > the /dev/ram0p1 device, mkfs.xfs only sees a device with DM's major > - so mkfs.xfs sends BLKFLSBUF to the DM device blissfully unaware that > the backing device (behind the DM linear target) is a brd device > - DM will forward the BLKFLSBUF ioctl to brd, which triggers > drivers/block/brd.c:brd_ioctl (nuking the entire ramdisk in the > process) > > So coming full circle this is what hch was referring to when he > mentioned: > 1) "ramdisk driver overloading the BLKFLSBUF ioctl ..." > 2) "dm-linear's dumb forwarding of ioctls ..." > > I really can't see DM adding a specific check for ramdisk's major when > forwarding the BLKFLSBUF ioctl. > > brd has direct partition support (see commit d7853d1f8932c) so maybe > kpartx should just blacklist /dev/ram devices? > > Alternatively, what about switching brd away from overloading BLKFLSBUF > to a real implementation of (overloaded) BLKDISCARD support in brd.c? > One that doesn't blindly nuke the entire device but that properly > processes the discard request. Yeah the situation really sucks (mkfs.jfs doesn't work on ramdisk for the same reason). I want to unfortunately keep ioctl for compatibility, but adding new saner ones would be welcome. Also, having a non-default config or load time parameter for brd, to skip the special case, if that would help testing on older userspace. DISCARD is actually a problem for rd. To actually get proper correctness, you need to preload brd with pages, otherwise when doing stress tests, IO can require memory allocations and deadlock. If we add a discard that frees pages, that introduces the same problem. If you find any option useful for testing, however, patches are fine -- brd pretty much is only useful for testing nowadays. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram 2010-12-03 17:11 ` Nick Piggin @ 2010-12-03 18:15 ` Ted Ts'o 0 siblings, 0 replies; 14+ messages in thread From: Ted Ts'o @ 2010-12-03 18:15 UTC (permalink / raw) To: Nick Piggin Cc: Mike Snitzer, LVM general discussion and development, linux-kernel@vger.kernel.org, xfs, Christoph Hellwig, dm-devel, Spelic On Sat, Dec 04, 2010 at 04:11:40AM +1100, Nick Piggin wrote: > > Alternatively, what about switching brd away from overloading BLKFLSBUF > > to a real implementation of (overloaded) BLKDISCARD support in brd.c? > > One that doesn't blindly nuke the entire device but that properly > > processes the discard request. > > Yeah the situation really sucks (mkfs.jfs doesn't work on ramdisk > for the same reason). > > I want to unfortunately keep ioctl for compatibility, but adding new > saner ones would be welcome. Also, having a non-default config or > load time parameter for brd, to skip the special case, if that would > help testing on older userspace. How many programs actually depend on BLKFLSBUF dropping the pages used in /dev/ram? The fact that it did this at all was a historical accident of how the original /dev/ram was implemented (in the buffer cache directly), and not anything that was intended. I think that's something that we should be able to fix, since the number of programs that knowly operate on the ramdisk is quite small. Just a few system programs used by distributions in their early boot scripts.... So I would argue for dropping the "special" behavior of BLKFLSBUF for /dev/ram. - Ted ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram 2010-12-02 13:55 [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram Spelic 2010-12-02 14:11 ` Christoph Hellwig @ 2010-12-02 14:14 ` Spelic 2010-12-02 23:07 ` Dave Chinner 1 sibling, 1 reply; 14+ messages in thread From: Spelic @ 2010-12-02 14:14 UTC (permalink / raw) To: Spelic; +Cc: linux-lvm, linux-kernel@vger.kernel.org, xfs Sorry for replying to my own email already one more thing on the 3rd bug: On 12/02/2010 02:55 PM, Spelic wrote: > Hello all > [CUT] > ....... > with NFS over <RDMA or IPoIB> over Infiniband over XFS over ramdisk it > is possible to write a file (2.3GB) which is larger than This is also reproducible with: NFS over TCP over Ethernet over XFS over ramdisk. You don't need infiniband for this. With ethernet it doesn't hang (that's another bug, for RDMA people, in the othter thread) but the file is still 1.9GB, i.e. larger than the device. Look, after running the test over ethernet, at server side: # ll -h /mnt/ram total 1.5G drwxr-xr-x 2 root root 21 2010-12-02 12:54 ./ drwxr-xr-x 3 root root 4.0K 2010-11-29 23:51 ../ -rw-r--r-- 1 root root 1.9G 2010-12-02 15:04 zerofile # mount rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) /dev/sda1 on / type ext4 (rw,errors=remount-ro) proc on /proc type proc (rw,noexec,nosuid,nodev) none on /sys type sysfs (rw,noexec,nosuid,nodev) none on /sys/fs/fuse/connections type fusectl (rw) none on /sys/kernel/debug type debugfs (rw) none on /sys/kernel/security type securityfs (rw) devtmpfs on /dev type devtmpfs (rw,mode=0755) none on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620) none on /dev/shm type tmpfs (rw,nosuid,nodev) none on /var/run type tmpfs (rw,nosuid,mode=0755) none on /var/lock type tmpfs (rw,noexec,nosuid,nodev) none on /lib/init/rw type tmpfs (rw,nosuid,mode=0755) nfsd on /proc/fs/nfsd type nfsd (rw) binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,noexec,nosuid,nodev) /dev/ram0 on /mnt/ram type xfs (rw) # blockdev --getsize64 /dev/ram0 1610612736 # dd if=/mnt/ram/zerofile | wc -c 1985937408 3878784+0 records in 3878784+0 records out 1985937408 bytes (2.0 GB) copied, 6.57081 s, 302 MB/s Feel free to forward to NFS mailing list also if you think it's appropriate. Thank you ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram 2010-12-02 14:14 ` Spelic @ 2010-12-02 23:07 ` Dave Chinner 2010-12-03 14:07 ` Spelic 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2010-12-02 23:07 UTC (permalink / raw) To: Spelic; +Cc: linux-lvm, linux-kernel@vger.kernel.org, xfs On Thu, Dec 02, 2010 at 03:14:39PM +0100, Spelic wrote: > Sorry for replying to my own email already > one more thing on the 3rd bug: > > On 12/02/2010 02:55 PM, Spelic wrote: > >Hello all > >[CUT] > >....... > >with NFS over <RDMA or IPoIB> over Infiniband over XFS over > >ramdisk it is possible to write a file (2.3GB) which is larger > >than > > This is also reproducible with: > NFS over TCP over Ethernet over XFS over ramdisk. > You don't need infiniband for this. > With ethernet it doesn't hang (that's another bug, for RDMA people, > in the othter thread) but the file is still 1.9GB, i.e. larger than > the device. > > > Look, after running the test over ethernet, > at server side: > > # ll -h /mnt/ram > total 1.5G > drwxr-xr-x 2 root root 21 2010-12-02 12:54 ./ > drwxr-xr-x 3 root root 4.0K 2010-11-29 23:51 ../ > -rw-r--r-- 1 root root 1.9G 2010-12-02 15:04 zerofile This is a classic ENOSPC vs NFS client writeback overcommit caching issue. Have a look at the block map output - I bet theres holes in the file and it's only consuming 1.5GB of disk space. use xfs_bmap to check this. du should tell you the same thing. Basically, the NFS client overcommits the server filesystem space by doing local writeback caching. Hence it caches 1.9GB of data before it gets the first ENOSPC error back from the server at around 1.5GB of written data. At that point, the data that gets ENOSPC errors is tossed by the NFS client, and a ENOSPC error is placed on the address space to be reported to the next write/sync call. That gets to the dd process when it's 1.9GB into the write. However, there is still (in this case) 400MB of dirty data in the NFS client cache that it will try to write to the server. Because XFS uses speculative preallocation and reserves some space for metadata allocation during delayed allocation, it's handling of the initial ENOSPC condition can result in some space being freed up again as unused reserved metadata space is returned to the free pool as delalloc occurs during server writeback. This usually takes a second or two to complete. As a result, shortly after the first ENOSPC has been reported and subsequent writes have also ENOSPC, we can have space freed up and another write will succeed. At that point, the write that succeeds will be a different offset to the last one that succeeded, leaving a hole in the file and moving the EOF well past 1.5GB. That will go on until there really is no space left at all or the NFS client has no more dirty data to send. Basically, what you see it not a bug in XFS, it is a result of NFS clients being able to overcommit server filesystem space and the interaction that has with the way the filesystem on the NFS server handles ENOSPC. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram 2010-12-02 23:07 ` Dave Chinner @ 2010-12-03 14:07 ` Spelic 2010-12-06 4:09 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Spelic @ 2010-12-03 14:07 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-lvm, linux-kernel@vger.kernel.org, xfs On 12/03/2010 12:07 AM, Dave Chinner wrote: > This is a classic ENOSPC vs NFS client writeback overcommit caching > issue. Have a look at the block map output - I bet theres holes in > the file and it's only consuming 1.5GB of disk space. use xfs_bmap > to check this. du should tell you the same thing. > > Yes you are right! root@server:/mnt/ram# ll -h total 1.5G drwxr-xr-x 2 root root 21 2010-12-02 12:54 ./ drwxr-xr-x 3 root root 4.0K 2010-11-29 23:51 ../ -rw-r--r-- 1 root root 1.9G 2010-12-02 15:04 zerofile root@server:/mnt/ram# ls -lsh total 1.5G 1.5G -rw-r--r-- 1 root root 1.9G 2010-12-02 15:04 zerofile (it's a sparse file) root@server:/mnt/ram# xfs_bmap zerofile zerofile: 0: [0..786367]: 786496..1572863 1: [786368..1572735]: 2359360..3145727 2: [1572736..2232319]: 1593408..2252991 3: [2232320..2529279]: 285184..582143 4: [2529280..2531327]: hole 5: [2531328..2816407]: 96..285175 6: [2816408..2971511]: 582144..737247 7: [2971512..2971647]: hole 8: [2971648..2975183]: 761904..765439 9: [2975184..2975743]: hole 10: [2975744..2975751]: 765440..765447 11: [2975752..2977791]: hole 12: [2977792..2977799]: 765480..765487 13: [2977800..2979839]: hole 14: [2979840..2979847]: 765448..765455 15: [2979848..2981887]: hole 16: [2981888..2981895]: 765472..765479 17: [2981896..2983935]: hole 18: [2983936..2983943]: 765456..765463 19: [2983944..2985983]: hole 20: [2985984..2985991]: 765464..765471 21: [2985992..3202903]: hole 22: [3202904..3215231]: 737248..749575 23: [3215232..3239767]: hole 24: [3239768..3252095]: 774104..786431 25: [3252096..3293015]: hole 26: [3293016..3305343]: 749576..761903 27: [3305344..3370839]: hole 28: [3370840..3383167]: 2252992..2265319 29: [3383168..3473239]: hole 30: [3473240..3485567]: 2265328..2277655 31: [3485568..3632983]: hole 32: [3632984..3645311]: 2277656..2289983 33: [3645312..3866455]: hole 34: [3866456..3878783]: 2289984..2302311 (many delayed allocation extents cannot be filled because space on device is finished) However ... > Basically, the NFS client overcommits the server filesystem space by > doing local writeback caching. Hence it caches 1.9GB of data before > it gets the first ENOSPC error back from the server at around 1.5GB > of written data. At that point, the data that gets ENOSPC errors is > tossed by the NFS client, and a ENOSPC error is placed on the > address space to be reported to the next write/sync call. That gets > to the dd process when it's 1.9GB into the write. > I'm no great expert but isn't this a design flaw in NFS? Ok in this case we were lucky it was all zeroes so XFS made a sparse file and could fit a 1.9GB into 1.5GB device size. In general with nonzero data it seems to me you will get data corruption because the NFS client thinks it has written the data while the NFS server really can't write more data than the device size. It's nice that the NFS server does local writeback caching but it should also cache the filesystem's free space (and check it periodically, since nfs-server is presumably not the only process writing in that filesystem) so that it doesn't accept more data than it can really write. Alternatively, when free space drops below 1GB (or a reasonable size based on network speed), nfs-server should turn off filesystem writeback caching. I can't repeat the test with urandom because it's too slow (8MB/sec !?). How come Linux hasn't got an "uurandom" device capable of e.g. 400MB/sec with only very weak randomness? But I have repeated the test over ethernet with a bunch of symlinks to a 100MB file created from urandom: At client side: # time cat randfile{001..020} | pv -b > /mnt/nfsram/randfile 1.95GB real 0m22.978s user 0m0.310s sys 0m5.360s At server side: # ls -lsh ram total 1.5G 1.5G -rw-r--r-- 1 root root 1.7G 2010-12-03 14:43 randfile # xfs_bmap ram/randfile ram/randfile: 0: [0..786367]: 786496..1572863 1: [786368..790527]: 96..4255 2: [790528..1130495]: hole 3: [1130496..1916863]: 2359360..3145727 4: [1916864..2682751]: 1593408..2359295 5: [2682752..3183999]: 285184..786431 6: [3184000..3387207]: 4256..207463 7: [3387208..3387391]: hole 8: [3387392..3391567]: 207648..211823 9: [3391568..3393535]: hole 10: [3393536..3393543]: 211824..211831 11: [3393544..3395583]: hole 12: [3395584..3395591]: 211832..211839 13: [3395592..3397631]: hole 14: [3397632..3397639]: 211856..211863 15: [3397640..3399679]: hole 16: [3399680..3399687]: 211848..211855 17: [3399688..3401727]: hole 18: [3401728..3409623]: 221984..229879 # dd if=/mnt/ram/randfile | wc -c 3409624+0 records in 3409624+0 records out 1745727488 1745727488 bytes (1.7 GB) copied, 5.72443 s, 305 MB/s The file is still sparse, and this time it certainly has data corruption (holes will be read as zeroes). I understand that the client receives Input/output error when this condition is hit, but the file written at server side has apparent size 1.8GB but the valid data in it is not 1.8GB. Is it good semantics? Wouldn't it be better for nfs-server to turn off writeback caching when it approaches a disk-full situation? And then I see another problem: As you see, xfs_fsr shows lots of holes, even with randomfile (this is taken from urandom so you can be sure it hasn't got many zeroes) already from offset 790528 sectors which is far from the disk full situation... First I checked that this does not happen by pushing less than 1.5GB of data. Ok it does not. Then I tried with exactly 15*100MB (files are 100MB, are symliks to a file which was created with dd if=/dev/urandom of=randfile.rnd bs=1M count=100) and this happened: client side: # time cat randfile{001..015} | pv -b > /mnt/nfsram/randfile 1.46GB real 0m18.265s user 0m0.260s sys 0m4.460s (please note: no I/O error at client side! blockdev --getsize64 /dev/ram0 == 1610612736) server side: # ls -ls ram total 1529676 1529676 -rw-r--r-- 1 root root 1571819520 2010-12-03 14:51 randfile # dd if=/mnt/ram/randfile | wc -c 3069960+0 records in 3069960+0 records out 1571819520 1571819520 bytes (1.6 GB) copied, 5.30442 s, 296 MB/s # xfs_bmap ram/randfile ram/randfile: 0: [0..112639]: 96..112735 1: [112640..208895]: 114784..211039 2: [208896..399359]: 285184..475647 3: [399360..401407]: 112736..114783 4: [401408..573439]: 475648..647679 5: [573440..937983]: 786496..1151039 6: [937984..1724351]: 2359360..3145727 7: [1724352..2383871]: 1593408..2252927 8: [2383872..2805695]: 1151040..1572863 9: [2805696..2944447]: 647680..786431 10: [2944448..2949119]: 211040..215711 11: [2949120..3055487]: 2252928..2359295 12: [3055488..3058871]: 215712..219095 13: [3058872..3059711]: hole 14: [3059712..3060143]: 219936..220367 15: [3060144..3061759]: hole 16: [3061760..3061767]: 220368..220375 17: [3061768..3063807]: hole 18: [3063808..3063815]: 220376..220383 19: [3063816..3065855]: hole 20: [3065856..3065863]: 220384..220391 21: [3065864..3067903]: hole 22: [3067904..3067911]: 220392..220399 23: [3067912..3069951]: hole 24: [3069952..3069959]: 220400..220407 Holes in a random file! This is data corruption, and nobody is notified of this data corruption: no error at client side or server side! Is it good semantics? How could client get notified of this? Some kind of fsync maybe? Thank you ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram 2010-12-03 14:07 ` Spelic @ 2010-12-06 4:09 ` Dave Chinner 2010-12-06 12:20 ` [linux-lvm] NFS corruption on ENOSPC (was: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram) Spelic 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2010-12-06 4:09 UTC (permalink / raw) To: Spelic; +Cc: linux-lvm, linux-kernel@vger.kernel.org, xfs On Fri, Dec 03, 2010 at 03:07:58PM +0100, Spelic wrote: > On 12/03/2010 12:07 AM, Dave Chinner wrote: > >This is a classic ENOSPC vs NFS client writeback overcommit caching > >issue. Have a look at the block map output - I bet theres holes in > >the file and it's only consuming 1.5GB of disk space. use xfs_bmap > >to check this. du should tell you the same thing. > > > > Yes you are right! .... > root@server:/mnt/ram# xfs_bmap zerofile > zerofile: .... > 30: [3473240..3485567]: 2265328..2277655 > 31: [3485568..3632983]: hole > 32: [3632984..3645311]: 2277656..2289983 > 33: [3645312..3866455]: hole > 34: [3866456..3878783]: 2289984..2302311 > > (many delayed allocation extents cannot be filled because space on > device is finished) > > However ... > > > >Basically, the NFS client overcommits the server filesystem space by > >doing local writeback caching. Hence it caches 1.9GB of data before > >it gets the first ENOSPC error back from the server at around 1.5GB > >of written data. At that point, the data that gets ENOSPC errors is > >tossed by the NFS client, and a ENOSPC error is placed on the > >address space to be reported to the next write/sync call. That gets > >to the dd process when it's 1.9GB into the write. > > I'm no great expert but isn't this a design flaw in NFS? Yes, sure is. [ Well, to be precise the original NFSv2 specification didn't have this flaw because all writes were synchronous. NFSv3 introduced asynchronous writes (writeback caching) and with it this problem. NFSv4 does not fix this flaw. ] > Ok in this case we were lucky it was all zeroes so XFS made a sparse > file and could fit a 1.9GB into 1.5GB device size. > > In general with nonzero data it seems to me you will get data > corruption because the NFS client thinks it has written the data > while the NFS server really can't write more data than the device > size. Yup, well known issue. Simple rule: don't run your NFS server out of space. > It's nice that the NFS server does local writeback caching but it > should also cache the filesystem's free space (and check it > periodically, since nfs-server is presumably not the only process > writing in that filesystem) so that it doesn't accept more data than > it can really write. Alternatively, when free space drops below 1GB > (or a reasonable size based on network speed), nfs-server should > turn off filesystem writeback caching. This isn't a NFS server problem, or one that canbe worked around at the server. it's a NFS _client_ problem in that it does not get synchronous ENOSPC errors when using writeback caching. There is no way for the NFS client to know the server is near ENOSPC conditions prior to writing the data to the server as clients operate independently. If you really want your NFS clients to behave correctly when the server goes ENOSPC, turn off writeback caching at the client side, not the server (i.e. use sync mounts on the client side). Write performance will suck, but if you want sane ENOSPC behaviour... ..... > Holes in a random file! > This is data corruption, and nobody is notified of this data > corruption: no error at client side or server side! > Is it good semantics? How could client get notified of this? Some > kind of fsync maybe? Use wireshark to determine if the server sends an ENOSPC to the client when the first background write fails. I bet it does and that your dd write failed with ENOSPC, too. Something stopped it writing at 1.9GB.... What happens to the remaining cached writeback data in the NFS client once the server runs out of space is NFS client specific behaviour. If you end up with only bits of the file on the server, ending up on the server, then that's a result of NFS client behaviour, not a NFS server problem. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* [linux-lvm] NFS corruption on ENOSPC (was: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram) 2010-12-06 4:09 ` Dave Chinner @ 2010-12-06 12:20 ` Spelic 2010-12-06 13:33 ` Trond Myklebust 0 siblings, 1 reply; 14+ messages in thread From: Spelic @ 2010-12-06 12:20 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-nfs, linux-lvm, linux-kernel@vger.kernel.org, xfs On 12/06/2010 05:09 AM, Dave Chinner wrote: >> [Files become sparse at nfs-server-side upon hitting ENOSPC if NFS client uses local writeback caching] >> >> >> It's nice that the NFS server does local writeback caching but it >> should also cache the filesystem's free space (and check it >> periodically, since nfs-server is presumably not the only process >> writing in that filesystem) so that it doesn't accept more data than >> it can really write. Alternatively, when free space drops below 1GB >> (or a reasonable size based on network speed), nfs-server should >> turn off filesystem writeback caching. >> > This isn't a NFS server problem, or one that canbe worked around at > the server. it's a NFS _client_ problem in that it does not get > synchronous ENOSPC errors when using writeback caching. There is no > way for the NFS client to know the server is near ENOSPC conditions > prior to writing the data to the server as clients operate > independently. > > If you really want your NFS clients to behave correctly when the > server goes ENOSPC, turn off writeback caching at the client side, > not the server (i.e. use sync mounts on the client side). > Write performance will suck, but if you want sane ENOSPC behaviour... > > [adding NFS ML in cc] Thank you for your very clear explanation. Going without writeback cache is a problem (write performance sucks as you say), but guaranteeing to never reach ENOSPC also is hardly feasible, especially if humans are logged at client side and they are doing "whatever they want". I would suggest that either be the NFS client to do polling to see if it's near an ENOSPC and if yes disable writeback caching, or be the server to do the polling and if it finds out it's near-ENOSPC condition it sends a specific message to clients to warn them so that they can disable caching. Performed at client side wouldn't change the NFS protocol and can be good enough if one can specify how often freespace should be polled and what is the freespace threshold. Or with just one value: specify what is the max speed at which server disk can fill (next polling period can be inferred from current free space), and maybe also specify a minimum polling period (just in case). Regarding the last part of the email, perhaps I was not clear: > ..... > >> Holes in a random file! >> This is data corruption, and nobody is notified of this data >> corruption: no error at client side or server side! >> Is it good semantics? How could client get notified of this? Some >> kind of fsync maybe? >> > Use wireshark to determine if the server sends an ENOSPC to the > client when the first background write fails. I bet it does and that > your dd write failed with ENOSPC, too. Something stopped it writing > at 1.9GB.... > No, in that case I had written 15x100MB which was more than the available space but less than available+writeback_cache. So "cat" ended by itself and never got an ENOSPC error but data never reached the disk at the other side. However today I found that by using fsync, the problem is fortunately detected: # time cat randfile{001..015} | pv -b | dd conv=fsync of=/mnt/nfsram/randfile 1.46GB dd: fsync failed for `/mnt/nfsram/randfile': Input/output error 3072000+0 records in 3072000+0 records out 1572864000 bytes (1.6 GB) copied, 20.9101 s, 75.2 MB/s real 0m21.364s user 0m0.470s sys 0m11.440s so ok I understand that processes needing guarantees on written data should use fsync/fdatasync (which is good practice also for a local filesystem actually...) Thank you ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] NFS corruption on ENOSPC (was: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram) 2010-12-06 12:20 ` [linux-lvm] NFS corruption on ENOSPC (was: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram) Spelic @ 2010-12-06 13:33 ` Trond Myklebust 0 siblings, 0 replies; 14+ messages in thread From: Trond Myklebust @ 2010-12-06 13:33 UTC (permalink / raw) To: Spelic Cc: linux-nfs, linux-lvm, Dave Chinner, linux-kernel@vger.kernel.org, xfs On Mon, 2010-12-06 at 13:20 +0100, Spelic wrote: > On 12/06/2010 05:09 AM, Dave Chinner wrote: > >> [Files become sparse at nfs-server-side upon hitting ENOSPC if NFS client uses local writeback caching] > >> > >> > >> It's nice that the NFS server does local writeback caching but it > >> should also cache the filesystem's free space (and check it > >> periodically, since nfs-server is presumably not the only process > >> writing in that filesystem) so that it doesn't accept more data than > >> it can really write. Alternatively, when free space drops below 1GB > >> (or a reasonable size based on network speed), nfs-server should > >> turn off filesystem writeback caching. > >> > > This isn't a NFS server problem, or one that canbe worked around at > > the server. it's a NFS _client_ problem in that it does not get > > synchronous ENOSPC errors when using writeback caching. There is no > > way for the NFS client to know the server is near ENOSPC conditions > > prior to writing the data to the server as clients operate > > independently. > > > > If you really want your NFS clients to behave correctly when the > > server goes ENOSPC, turn off writeback caching at the client side, > > not the server (i.e. use sync mounts on the client side). > > Write performance will suck, but if you want sane ENOSPC behaviour... > > > > > > [adding NFS ML in cc] > > Thank you for your very clear explanation. > > Going without writeback cache is a problem (write performance sucks as > you say), but guaranteeing to never reach ENOSPC also is hardly > feasible, especially if humans are logged at client side and they are > doing "whatever they want". > > I would suggest that either be the NFS client to do polling to see if > it's near an ENOSPC and if yes disable writeback caching, or be the > server to do the polling and if it finds out it's near-ENOSPC condition > it sends a specific message to clients to warn them so that they can > disable caching. > Performed at client side wouldn't change the NFS protocol and can be > good enough if one can specify how often freespace should be polled and > what is the freespace threshold. Or with just one value: specify what is > the max speed at which server disk can fill (next polling period can be > inferred from current free space), and maybe also specify a minimum > polling period (just in case). You can just as easily do this at the application level. The kernel can't do it any more reliably than the application can, so there really is no point in doing it there. We already ensure that when the server does send us an error, we switch to synchronous operation until the error clears. Trond ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2010-12-06 13:34 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-12-02 13:55 [linux-lvm] Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram Spelic 2010-12-02 14:11 ` Christoph Hellwig 2010-12-02 14:14 ` Spelic 2010-12-02 14:17 ` Christoph Hellwig 2010-12-02 21:22 ` Mike Snitzer 2010-12-02 22:08 ` Mike Snitzer 2010-12-03 17:11 ` Nick Piggin 2010-12-03 18:15 ` Ted Ts'o 2010-12-02 14:14 ` Spelic 2010-12-02 23:07 ` Dave Chinner 2010-12-03 14:07 ` Spelic 2010-12-06 4:09 ` Dave Chinner 2010-12-06 12:20 ` [linux-lvm] NFS corruption on ENOSPC (was: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram) Spelic 2010-12-06 13:33 ` Trond Myklebust
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).