correct use of vmtruncate()?

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* correct use of vmtruncate()?
@ 2008-04-29 10:06 David Chinner
  2008-04-29 17:10 ` Zach Brown
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: David Chinner @ 2008-04-29 10:06 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, xfs-oss

Folks,

It appears to me that vmtruncate() is not used correctly in
block_write_begin() and friends. The short summary is that it
appears that the usage in these functions implies that vmtruncate()
should cause truncation of blocks on disk but no filesystem
appears to do this, nor does the documentation imply they should.

The longer story now.

For as long as I've worked on XFS we've had intermittent ASSERT
failures when tearing down inodes or doing direct I/O where inodes
have delayed allocation extents still attached to them where they
shouldn't.  Because the ASSERT failure has occurred so long after
the problem and it only happens once every blue moon, it's been
extremely difficult to track down.

Lucky for me, I had my main test box start to fall over the problem
reliably last week. [ I say lucky, because a customer started to
trip over a different symptom of the same problem reliably about a
week before that and I have no idea what changed in my code base
to make it trigger on every run. ]

The problem stems around this piece of the debug trace pulled
from KDB after the system died with an ASSERT:

PAGE INVALIDATE:
ip 0xe0000038805cc600 inode 0xe0000038805bb980 page 0xa07fffffdf8b5180
pgoff 0x0 di_size 0x026b000 isize 0x026b000 offset 0x0320000
delalloc 1 unmapped 0 unwritten 0 pid 2930
^^^^^^^^^^

PAGE RELEASE:
ip 0xe0000038805cc600 inode 0xe0000038805bb980 page 0xa07fffffdf8b5180
pgoff 0x0 di_size 0x026b000 isize 0x026b000 offset 0x0320000
delalloc 0 unmapped 1 unwritten 0 pid 2930
^^^^^^^^^^

When ->invalidate_page is called, we have a delalloc extent on the page,
but by the time ->release_page is called, the delalloc extent is gone.
The code path is:

        ->invalidate_page
          xfs_vm_invalidatepage
            block_invalidatepage
>>>>>         discard_buffer
              try_to_release_page
                ->release_page
                  xfs_vm_releasepage

The key point here is in this code path, discard buffer is called
on all the buffers on the page being invalidated. That is, we do this
to them:

static void discard_buffer(struct buffer_head * bh)
{
        lock_buffer(bh);
        clear_buffer_dirty(bh);
        bh->b_bdev = NULL;
        clear_buffer_mapped(bh);
        clear_buffer_req(bh);
        clear_buffer_new(bh);
        clear_buffer_delay(bh);
        clear_buffer_unwritten(bh);
        unlock_buffer(bh);
}

We *clear* the delalloc state from the page, and hence we lose the
delalloc state before we get to xfs_vm_releasepage(). it also makes
the buffers appear unmapped, so just removing the
clear_buffer_delay(bh) is not sufficient to enable us to know this
is a delalloc buffer without changing other code.

The result is that xfs_vm_releasepage() is unable to convert those
extents to real extents (beyond eof) because it can't tell they
exist by looking at the bufferhead state. Hence if we then extend
the file again later we can trip over these delalloc extents. If
it's buffered I/O, it's ok. If it's inode reclaim, then we ASSERT
fail. If it's direct I/O, we BUG_ON() in __xfs_get_blocks. If it's
hole punching, then we ASSERT fail there. Pain, pain and more pain.

IOWs, the current path through vmtruncate into XFS and releasing the
page does no truncation at all - in fact ->releasepage *allocates*
delayed extents as it's semantics imply that the caller will write
the page out and needs the blocks allocated.

The question I was asking now was "how the hell do we get to
->invalidate_page call with an active extent without having a
matching extent removal operation from the filesystem to clean
up?"

The key to solving the problem came from this ASSERT failure on a very
new inode during a hole punch:

Assertion failed: imap.br_startblock != DELAYSTARTBLOCK, file: fs/xfs/xfs_vnodeops.c, line: 3619
....
 [<a0000001003f7920>] assfail+0x60/0x80
                                sp=e00000381a0dfc40 bsp=e00000381a0d10b8
 [<a0000001003cbdf0>] xfs_zero_remaining_bytes+0x2f0/0x560
                                sp=e00000381a0dfc40 bsp=e00000381a0d1050
 [<a0000001003cc7d0>] xfs_free_file_space+0x770/0xbc0
                                sp=e00000381a0dfc90 bsp=e00000381a0d0fc8
 [<a0000001003d1ac0>] xfs_change_file_space+0x320/0x6a0
                                sp=e00000381a0dfd10 bsp=e00000381a0d0f78
 [<a0000001003e9630>] xfs_ioc_space+0x1b0/0x1e0
                                sp=e00000381a0dfdb0 bsp=e00000381a0d0f30
 [<a0000001003ebff0>] xfs_ioctl+0x6b0/0x1260
                                sp=e00000381a0dfde0 bsp=e00000381a0d0ee0
 [<a0000001003e7db0>] xfs_file_ioctl+0x50/0xe0
                                sp=e00000381a0dfe10 bsp=e00000381a0d0e98
 [<a0000001001805f0>] vfs_ioctl+0x90/0x180
                                sp=e00000381a0dfe10 bsp=e00000381a0d0e58
 [<a000000100181060>] do_vfs_ioctl+0x980/0xa00
                                sp=e00000381a0dfe10 bsp=e00000381a0d0e10
 [<a000000100181140>] sys_ioctl+0x60/0xc0
                                sp=e00000381a0dfe20 bsp=e00000381a0d0d90
....

And the trace:

[1]kdb> xexlist 0xe000003880774e00
inode 0xe000003880774e00 df extents 0xe000003880774e80 nextents 0x1
0: startoff 41 startblock NULLSTARTBLOCK(5) blockcount 1 flag 0
[1]kdb> xrwtrc 0xe000003880774e00
i_rwtrace = 0xe00000381ca8d4a0
WRITE ENTER:
ip 0xe000003880774e00 size 0x00 ptr 0xe00000381a0dfd30 size 1
io offset 0x029a61 ioflags 0x1 new size 0x048fa3 pid 2939

IOMAP WRITE ENTER:
ip 0xe000003880774e00 size 0x00 offset 0x029000 count 0x1000
io new size 0x048fa3 pid=2939

ALLOC MAP:
ip 0xe000003880774e00 size 0x00 offset 0x029000 count 0x1000
bmapi flags 0x2 <write > iomap off 0x0199a35 delta 0x809f2a00 bsize 0x4815a972 bno 0x0
imap off 0x29 count 0x1 block 0xffffffff

IOMAP WRITE ENTER:
ip 0xe000003880774e00 size 0x00 offset 0x02a000 count 0x1000
io new size 0x048fa3 pid=2939

IOMAP WRITE NOSPACE:
ip 0xe000003880774e00 size 0x00 offset 0x02a000 count 0x1000
io new size 0x048fa3 pid=2939

IOMAP WRITE NOSPACE:
ip 0xe000003880774e00 size 0x00 offset 0x02a000 count 0x1000
io new size 0x048fa3 pid=2939

IOMAP WRITE NOSPACE:
ip 0xe000003880774e00 size 0x00 offset 0x02a000 count 0x1000
io new size 0x048fa3 pid=2939

IOMAP WRITE NOSPACE:
ip 0xe000003880774e00 size 0x00 offset 0x02a000 count 0x1000
io new size 0x048fa3 pid=2939

PAGE INVALIDATE:
ip 0xe000003880774e00 inode 0xe000003880767000 page 0xa07fffffdf875a00
pgoff 0x0 di_size 0x00 isize 0x00 offset 0x020000
delalloc 1 unmapped 0 unwritten 0 pid 2939

PAGE RELEASE:
ip 0xe000003880774e00 inode 0xe000003880767000 page 0xa07fffffdf875a00
pgoff 0x0 di_size 0x00 isize 0x00 offset 0x020000
delalloc 0 unmapped 1 unwritten 0 pid 2939

-----

And a strategically placed dump_stack() call showed
invalidate_page() had come from vmtruncate() via
__block_prepare_write().

IOWs, this trace says that xfs_get_blocks() has returned ENOSPC to
__block_prepare_write() after the first buffer on the page has been
set up for delayed allocation. As a result of this write being
beyond the current EOF, block_begin_write() sees this error and
decides to roll back the entire change by truncating the addres
space beyond the old EOF with a call to vmtruncate().

But, as we've already seen, vmtruncate() does not cause removal of
blocks in XFS; only the removal of pages and buffers from the
mapping. IOWs, we've just leaked a delayed allocation extent and
left a landmine that we can step on later.

My understanding  is that XFS is behaving correctly with respect to
->invalidate_page and vmtruncate. Looking at the only relevant hit
on vmtruncate() in the Documentation directory, filesystems/Locking
says:

|         ->truncate() is never called directly - it's a callback, not a
| method. It's called by vmtruncate() - library function normally used by
| ->setattr(). Locking information above applies to that call (i.e. is
| inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been
| passed).

This implies that vmtruncate() should only be called from within
filesystems when the size of the inode is being changed. In XFS, the
vmtruncate() call is closely followed by the extent removal
transactions, and they are effectively done as an atomic operation
due to the locks that are held at the time. So AFAICT XFS is doing
the right thing here...

[ Indeed, if vmtruncate() were to do the extent removal at this
point in time, XFS would totally suck at removing large files as it
would need to do a transaction per page as opposed to one every two
extents being removed. ]

Hence it seems to me that calling vmtruncate() directly from any
context other than from with a filesystem whilst a size change is
being executed is incorrect use of vmtruncate(). i.e. all the
*write_begin implementations that are used by filesystems that
support multiple blocks per page are broken because they are relying
on vmtruncate() to remove blocks that are allocated via get_block
callouts before the failure occurred.

The obvious fix for this is that block_write_begin() and
friends should be calling ->setattr to do the truncation and hence
follow normal convention for truncating blocks off an inode.
However, even that appears to have thorns. e.g. in XFS we hold the
iolock exclusively when we call block_write_begin(), but it is not
held in all cases where ->setattr is currently called. Hence calling
->setattr from block_write_begin in this failure case will deadlock
unless we also pass a "nolock" flag as well. XFS already
supports this (e.g. see the XFS fallocate implementation) but no other
filesystem does (some probably don't need to).

Hence I'm not sure what the best way to fix this is. I don't want to
have to duplicate all the generic code just to be able to issue a
correct, non-deadlocking truncate operation. I don't want to have to
commit the hack I already have for ->invalidate page that does:

	xfs_count_page_state(page, &delalloc, ....)
	if (delalloc && !PageUptodate(page)) {
		/*
		 * set up and call xfs_bumapi() to remove the delalloc
		 * extents on this page.
		 */
		.....
	}
	block_invalidatepage(page, offset);

because it has negative performance impact on several different
common workloads and is completely unnecessary except for this
rare error case from block_write_begin(). Since it's impossible to
uniquely identify the case in ->invalidate_page, the above hack
is as good as I can see can be done.

All in all, I'd prefer the ->setattr() with a "ATTR_NO_LOCK" flag
solution as the simplest way to solve this, but maybe there's
something that I've missed. Comments, suggestions are welcome....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: correct use of vmtruncate()?
  2008-04-29 10:06 correct use of vmtruncate()? David Chinner
@ 2008-04-29 17:10 ` Zach Brown
  2008-04-29 21:52   ` David Chinner
  2008-04-30  7:24   ` Aneesh Kumar K.V
  2008-04-30  3:46 ` David Chinner
  2008-04-30  7:47 ` Aneesh Kumar K.V
  2 siblings, 2 replies; 8+ messages in thread
From: Zach Brown @ 2008-04-29 17:10 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-fsdevel, linux-mm, xfs-oss

> The obvious fix for this is that block_write_begin() and
> friends should be calling ->setattr to do the truncation and hence
> follow normal convention for truncating blocks off an inode.
> However, even that appears to have thorns. e.g. in XFS we hold the
> iolock exclusively when we call block_write_begin(), but it is not
> held in all cases where ->setattr is currently called. Hence calling
> ->setattr from block_write_begin in this failure case will deadlock
> unless we also pass a "nolock" flag as well. XFS already
> supports this (e.g. see the XFS fallocate implementation) but no other
> filesystem does (some probably don't need to).

This paragraph in particular reminds me of an outstanding bug with
O_DIRECT and ext*.  It isn't truncating partial allocations when a dio
fails with ENOSPC.  This was noticed by a user who saw that fsck found
bocks outside i_size in the file that saw ENOSPC if they tried to
unmount and check the volume after the failed write.

So, whether we decide that failed writes should call setattr or
vmtruncate, we should also keep the generic O_DIRECT path in
consideration.  Today it doesn't even try the supposed generic method of
calling vmtrunate().

- z

(Though I'm sure XFS' dio code already handles freeing blocks :))

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: correct use of vmtruncate()?
  2008-04-29 17:10 ` Zach Brown
@ 2008-04-29 21:52   ` David Chinner
  2008-04-30  7:24   ` Aneesh Kumar K.V
  1 sibling, 0 replies; 8+ messages in thread
From: David Chinner @ 2008-04-29 21:52 UTC (permalink / raw)
  To: Zach Brown; +Cc: David Chinner, linux-fsdevel, linux-mm, xfs-oss

On Tue, Apr 29, 2008 at 10:10:59AM -0700, Zach Brown wrote:
> 
> > The obvious fix for this is that block_write_begin() and
> > friends should be calling ->setattr to do the truncation and hence
> > follow normal convention for truncating blocks off an inode.
> > However, even that appears to have thorns. e.g. in XFS we hold the
> > iolock exclusively when we call block_write_begin(), but it is not
> > held in all cases where ->setattr is currently called. Hence calling
> > ->setattr from block_write_begin in this failure case will deadlock
> > unless we also pass a "nolock" flag as well. XFS already
> > supports this (e.g. see the XFS fallocate implementation) but no other
> > filesystem does (some probably don't need to).
> 
> This paragraph in particular reminds me of an outstanding bug with
> O_DIRECT and ext*.  It isn't truncating partial allocations when a dio
> fails with ENOSPC.  This was noticed by a user who saw that fsck found
> bocks outside i_size in the file that saw ENOSPC if they tried to
> unmount and check the volume after the failed write.

That sounds very similar - ENOSPC seems to be one way of "easily"
generating the error condition that exposes this condition, but
I'm sure there are others as well...

> So, whether we decide that failed writes should call setattr or
> vmtruncate, we should also keep the generic O_DIRECT path in
> consideration.  Today it doesn't even try the supposed generic method of
> calling vmtrunate().

Certainly, though the locking will certainly be entertaining in
this path....

> (Though I'm sure XFS' dio code already handles freeing blocks :))

Not the dio code as such, but the close path does. Blocks beyond EOF get
truncated off in ->release or ->clear_inode (unless they were specifically
preallocated) and dio does not do delayed allocation so does not suffer
from the "need ->setattr issue" to truncate them away on ENOSPC. i.e. after
the error occurs and the app closes the fd, the blocks get truncated away.

Basically the problem I described is leaving delayed allocation blocks beyond
EOF without any page cache mappings to indicate they are there - allocated
blocks beyond EOF are not a problem...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: correct use of vmtruncate()?
  2008-04-29 17:10 ` Zach Brown
  2008-04-29 21:52   ` David Chinner
@ 2008-04-30  7:24   ` Aneesh Kumar K.V
  2008-04-30 15:55     ` Zach Brown
  1 sibling, 1 reply; 8+ messages in thread
From: Aneesh Kumar K.V @ 2008-04-30  7:24 UTC (permalink / raw)
  To: Zach Brown; +Cc: David Chinner, linux-fsdevel, linux-mm, xfs-oss

On Tue, Apr 29, 2008 at 10:10:59AM -0700, Zach Brown wrote:
> 
> > The obvious fix for this is that block_write_begin() and
> > friends should be calling ->setattr to do the truncation and hence
> > follow normal convention for truncating blocks off an inode.
> > However, even that appears to have thorns. e.g. in XFS we hold the
> > iolock exclusively when we call block_write_begin(), but it is not
> > held in all cases where ->setattr is currently called. Hence calling
> > ->setattr from block_write_begin in this failure case will deadlock
> > unless we also pass a "nolock" flag as well. XFS already
> > supports this (e.g. see the XFS fallocate implementation) but no other
> > filesystem does (some probably don't need to).
> 
> This paragraph in particular reminds me of an outstanding bug with
> O_DIRECT and ext*.  It isn't truncating partial allocations when a dio
> fails with ENOSPC.  This was noticed by a user who saw that fsck found
> bocks outside i_size in the file that saw ENOSPC if they tried to
> unmount and check the volume after the failed write.

This patch should be the fix I guess
	http://lkml.org/lkml/2006/12/18/103

-aneesh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: correct use of vmtruncate()?
  2008-04-30  7:24   ` Aneesh Kumar K.V
@ 2008-04-30 15:55     ` Zach Brown
  0 siblings, 0 replies; 8+ messages in thread
From: Zach Brown @ 2008-04-30 15:55 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: David Chinner, linux-fsdevel, linux-mm, xfs-oss


>> This paragraph in particular reminds me of an outstanding bug with
>> O_DIRECT and ext*.  It isn't truncating partial allocations when a dio
>> fails with ENOSPC.  This was noticed by a user who saw that fsck found
>> bocks outside i_size in the file that saw ENOSPC if they tried to
>> unmount and check the volume after the failed write.
> 
> This patch should be the fix I guess
> 	http://lkml.org/lkml/2006/12/18/103

That's the thread related to the bug, yes, but that isn't the right fix
as David's later messages in the thread indicate.

- z

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: correct use of vmtruncate()?
  2008-04-29 10:06 correct use of vmtruncate()? David Chinner
  2008-04-29 17:10 ` Zach Brown
@ 2008-04-30  3:46 ` David Chinner
  2008-04-30  7:47 ` Aneesh Kumar K.V
  2 siblings, 0 replies; 8+ messages in thread
From: David Chinner @ 2008-04-30  3:46 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-fsdevel, linux-mm, xfs-oss

On Tue, Apr 29, 2008 at 08:06:01PM +1000, David Chinner wrote:
> Folks,
> 
> It appears to me that vmtruncate() is not used correctly in
> block_write_begin() and friends. The short summary is that it
> appears that the usage in these functions implies that vmtruncate()
> should cause truncation of blocks on disk but no filesystem
> appears to do this, nor does the documentation imply they should.

[snip]

> All in all, I'd prefer the ->setattr() with a "ATTR_NO_LOCK" flag
> solution as the simplest way to solve this, but maybe there's
> something that I've missed. Comments, suggestions are welcome....

And the patch to demonstrate this is below. It does appear to fix
the problem, so I'd appreciate some feedback from various other fs
maintainers on whether this will cause problems or not....

Cheers,

Dave.

---
 fs/buffer.c                 |   18 ++++++++++++++----
 fs/xfs/linux-2.6/xfs_iops.c |    4 ++++
 include/linux/fs.h          |    1 +
 3 files changed, 19 insertions(+), 4 deletions(-)

Index: 2.6.x-xfs-new/fs/buffer.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/buffer.c	2008-04-30 12:32:59.482687869 +1000
+++ 2.6.x-xfs-new/fs/buffer.c	2008-04-30 12:43:15.595973324 +1000
@@ -2019,8 +2019,13 @@ int block_write_begin(struct file *file,
 			 * outside i_size.  Trim these off again. Don't need
 			 * i_size_read because we hold i_mutex.
 			 */
-			if (pos + len > inode->i_size)
-				vmtruncate(inode, inode->i_size);
+			if (pos + len > inode->i_size) {
+				struct iattr newattrs;
+
+				newattrs.ia_size = inode->i_size;
+				newattrs.ia_valid = ATTR_SIZE | ATTR_NO_LOCK;
+				notify_change(file->f_dentry, &newattrs);
+			}
 		}
 		goto out;
 	}
@@ -2576,8 +2581,13 @@ out_release:
 	page_cache_release(page);
 	*pagep = NULL;
 
-	if (pos + len > inode->i_size)
-		vmtruncate(inode, inode->i_size);
+	if (pos + len > inode->i_size) {
+		struct iattr newattrs;
+
+		newattrs.ia_size = inode->i_size;
+		newattrs.ia_valid = ATTR_SIZE | ATTR_NO_LOCK;
+		notify_change(file->f_dentry, &newattrs);
+	}
 
 	return ret;
 }
Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_iops.c	2008-04-30 12:32:59.046743585 +1000
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c	2008-04-30 12:33:28.946922244 +1000
@@ -709,6 +709,10 @@ xfs_vn_setattr(
 
 	if (ia_valid & (ATTR_MTIME_SET | ATTR_ATIME_SET))
 		flags |= ATTR_UTIME;
+
+	if (ia_valid & ATTR_NO_LOCK)
+		flags |= ATTR_NOLOCK;
+
 #ifdef ATTR_NO_BLOCK
 	if ((ia_valid & ATTR_NO_BLOCK))
 		flags |= ATTR_NONBLOCK;
Index: 2.6.x-xfs-new/include/linux/fs.h
===================================================================
--- 2.6.x-xfs-new.orig/include/linux/fs.h	2008-04-30 12:32:59.094737451 +1000
+++ 2.6.x-xfs-new/include/linux/fs.h	2008-04-30 12:33:28.998915599 +1000
@@ -337,6 +337,7 @@ typedef void (dio_iodone_t)(struct kiocb
 #define ATTR_FILE	8192
 #define ATTR_KILL_PRIV	16384
 #define ATTR_OPEN	32768	/* Truncating from open(O_TRUNC) */
+#define ATTR_NO_LOCK	65536	/* calling with fs locks already held */
 
 /*
  * This is the Inode Attributes structure, used for notify_change().  It

-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: correct use of vmtruncate()?
  2008-04-29 10:06 correct use of vmtruncate()? David Chinner
  2008-04-29 17:10 ` Zach Brown
  2008-04-30  3:46 ` David Chinner
@ 2008-04-30  7:47 ` Aneesh Kumar K.V
  2008-04-30 10:15   ` David Chinner
  2 siblings, 1 reply; 8+ messages in thread
From: Aneesh Kumar K.V @ 2008-04-30  7:47 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-fsdevel, linux-mm, xfs-oss

On Tue, Apr 29, 2008 at 08:06:01PM +1000, David Chinner wrote:
> Folks,
> 
> It appears to me that vmtruncate() is not used correctly in
> block_write_begin() and friends. The short summary is that it
> appears that the usage in these functions implies that vmtruncate()
> should cause truncation of blocks on disk but no filesystem
> appears to do this, nor does the documentation imply they should.

Looking at ext*_truncate, I see we are freeing blocks as a part of vmtruncate.
Or did I miss something ?

-aneesh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: correct use of vmtruncate()?
  2008-04-30  7:47 ` Aneesh Kumar K.V
@ 2008-04-30 10:15   ` David Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: David Chinner @ 2008-04-30 10:15 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: David Chinner, linux-fsdevel, linux-mm, xfs-oss

On Wed, Apr 30, 2008 at 01:17:38PM +0530, Aneesh Kumar K.V wrote:
> On Tue, Apr 29, 2008 at 08:06:01PM +1000, David Chinner wrote:
> > Folks,
> > 
> > It appears to me that vmtruncate() is not used correctly in
> > block_write_begin() and friends. The short summary is that it
> > appears that the usage in these functions implies that vmtruncate()
> > should cause truncation of blocks on disk but no filesystem
> > appears to do this, nor does the documentation imply they should.
> 
> Looking at ext*_truncate, I see we are freeing blocks as a part of vmtruncate.
> Or did I miss something ?

No I missed something. I was looking at block_truncate_page() which is
called by various truncate methods but does not do truncation itself.

Still doesn't help XFS, though, as updating different parts of the
inode in different transactions will result in non-atomic ->setattr
updates. Which, given that XFS tends to excel at exposing non-atomic
modifications in crash recovery, is a really bad thing.

Looking further, doing the truncate operation in ->truncate is
probably really stupid simply because the interface does not allow
errors to be returned to the caller. e.g. ufs_setattr() has this
comment:

/*
 * We don't define our `inode->i_op->truncate', and call it here,
 * because of:
 * - there is no way to know old size
 * - there is no way inform user about error, if it happens in `truncate'
 */

and I've just added a WARN_ON(error) to xfs_vn_truncate() so that
errors don't get lost silently.

UFS also uses block_write_begin(), so it will have exactly the same
problem as XFS - blocks beyond EOF don't get truncated away by
vmtruncate if an error occurs in block_write_begin().

AFAICT, gfs2 is another filesystem that does not have a ->truncate
callback - truncation is driven through the ->setattr interface.
However, gfs2_write_begin() calls vmtruncate() like
block_write_begin() on error from block_prepare_write() and hence
also has this bug. 

I'm sure there are other filesystems that, like XFS, UFS and GFS2,
don't do block truncation in ->truncate. Hence it really does seem
that calling vmtruncate() from anything other than a ->setattr method
is a bug because to do so is to make a false assumption about how
filesystems are implemented....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-04-30 15:55 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-29 10:06 correct use of vmtruncate()? David Chinner
2008-04-29 17:10 ` Zach Brown
2008-04-29 21:52   ` David Chinner
2008-04-30  7:24   ` Aneesh Kumar K.V
2008-04-30 15:55     ` Zach Brown
2008-04-30  3:46 ` David Chinner
2008-04-30  7:47 ` Aneesh Kumar K.V
2008-04-30 10:15   ` David Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).