public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH]  incorrect error handling inside generic_file_direct_write
       [not found] <87k60y1rq4.fsf@sw.ru>
@ 2006-12-11 20:40 ` Andrew Morton
  2006-12-12  9:22   ` Dmitriy Monakhov
  2006-12-12 12:20   ` Dmitriy Monakhov
  0 siblings, 2 replies; 12+ messages in thread
From: Andrew Morton @ 2006-12-11 20:40 UTC (permalink / raw)
  To: Dmitriy Monakhov; +Cc: linux-kernel, Linux Memory Management, devel, xfs

On Mon, 11 Dec 2006 16:34:27 +0300
Dmitriy Monakhov <dmonakhov@openvz.org> wrote:

> OpenVZ team has discovered error inside generic_file_direct_write()
> If generic_file_direct_IO() has fail (ENOSPC condition) it may have instantiated
> a few blocks outside i_size. And fsck will complain about wrong i_size
> (ext2, ext3 and reiserfs interpret i_size and biggest block difference as error),
> after fsck will fix error i_size will be increased to the biggest block,
> but this blocks contain gurbage from previous write attempt, this is not 
> information leak, but its silence file data corruption. 
> We need truncate any block beyond i_size after write have failed , do in simular
> generic_file_buffered_write() error path.
> 
> Exampe:
> open("mnt2/FILE3", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
> write(3, "aaaaaa"..., 4096) = -1 ENOSPC (No space left on device)
> 
> stat mnt2/FILE3
> File: `mnt2/FILE3'
> Size: 0               Blocks: 4          IO Block: 4096   regular empty file
> >>>>>>>>>>>>>>>>>>>>>>^^^^^^^^^^ file size is less than biggest block idx
> Device: 700h/1792d      Inode: 14          Links: 1
> Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> 
> fsck.ext2 -f -n  mnt1/fs_img
> Pass 1: Checking inodes, blocks, and sizes
> Inode 14, i_size is 0, should be 2048.  Fix? no
> 
> Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>
> ----------
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7b84dc8..bf7cf6c 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2041,6 +2041,14 @@ generic_file_direct_write(struct kiocb *
>  			mark_inode_dirty(inode);
>  		}
>  		*ppos = end;
> +	} else if (written < 0) {
> +		loff_t isize = i_size_read(inode);
> +		/*
> +		 * generic_file_direct_IO() may have instantiated a few blocks
> +		 * outside i_size.  Trim these off again.
> +		 */
> +		if (pos + count > isize)
> +			vmtruncate(inode, isize);
>  	}
>  

XFS (at least) can call generic_file_direct_write() with i_mutex not held. 
And vmtruncate() expects i_mutex to be held.

I guess a suitable solution would be to push this problem back up to the
callers: let them decide whether to run vmtruncate() and if so, to ensure
that i_mutex is held.

The existence of generic_file_aio_write_nolock() makes that rather messy
though.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH]  incorrect error handling inside generic_file_direct_write
  2006-12-12  9:22   ` Dmitriy Monakhov
@ 2006-12-12  6:36     ` Andrew Morton
  0 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2006-12-12  6:36 UTC (permalink / raw)
  To: Dmitriy Monakhov
  Cc: Dmitriy Monakhov, linux-kernel, Linux Memory Management, devel,
	xfs

On Tue, 12 Dec 2006 12:22:14 +0300
Dmitriy Monakhov <dmonakhov@sw.ru> wrote:

> >> @@ -2041,6 +2041,14 @@ generic_file_direct_write(struct kiocb *
> >>  			mark_inode_dirty(inode);
> >>  		}
> >>  		*ppos = end;
> >> +	} else if (written < 0) {
> >> +		loff_t isize = i_size_read(inode);
> >> +		/*
> >> +		 * generic_file_direct_IO() may have instantiated a few blocks
> >> +		 * outside i_size.  Trim these off again.
> >> +		 */
> >> +		if (pos + count > isize)
> >> +			vmtruncate(inode, isize);
> >>  	}
> >>  
> >
> > XFS (at least) can call generic_file_direct_write() with i_mutex not held. 
> How could it be ?
> 
> from mm/filemap.c:2046 generic_file_direct_write() comment right after 
> place where i want to add vmtruncate()
> /*
> 	 * Sync the fs metadata but not the minor inode changes and
> 	 * of course not the data as we did direct DMA for the IO.
> 	 * i_mutex is held, which protects generic_osync_inode() from
> 	 * livelocking.
> 	 */
> 
> > And vmtruncate() expects i_mutex to be held.
> generic_file_direct_IO must called under i_mutex too
> from mm/filemap.c:2388
>   /*
>    * Called under i_mutex for writes to S_ISREG files.   Returns -EIO if something
>    * went wrong during pagecache shootdown.
>    */
>   static ssize_t
>   generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,

yup, the comments are wrong.

> This means XFS generic_file_direct_write() call generic_file_direct_IO() without
> i_mutex held too?

Think so.  XFS uses blockdev_direct_IO_own_locking().  We'd need to check
with the XFS guys regarding its precise operation and what needs to be done
here.

> >
> > I guess a suitable solution would be to push this problem back up to the
> > callers: let them decide whether to run vmtruncate() and if so, to ensure
> > that i_mutex is held.
> >
> > The existence of generic_file_aio_write_nolock() makes that rather messy
> > though.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH]  incorrect error handling inside generic_file_direct_write
  2006-12-11 20:40 ` [PATCH] incorrect error handling inside generic_file_direct_write Andrew Morton
@ 2006-12-12  9:22   ` Dmitriy Monakhov
  2006-12-12  6:36     ` Andrew Morton
  2006-12-12 12:20   ` Dmitriy Monakhov
  1 sibling, 1 reply; 12+ messages in thread
From: Dmitriy Monakhov @ 2006-12-12  9:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dmitriy Monakhov, linux-kernel, Linux Memory Management, devel,
	xfs

Andrew Morton <akpm@osdl.org> writes:

> On Mon, 11 Dec 2006 16:34:27 +0300
> Dmitriy Monakhov <dmonakhov@openvz.org> wrote:
>
>> OpenVZ team has discovered error inside generic_file_direct_write()
>> If generic_file_direct_IO() has fail (ENOSPC condition) it may have instantiated
>> a few blocks outside i_size. And fsck will complain about wrong i_size
>> (ext2, ext3 and reiserfs interpret i_size and biggest block difference as error),
>> after fsck will fix error i_size will be increased to the biggest block,
>> but this blocks contain gurbage from previous write attempt, this is not 
>> information leak, but its silence file data corruption. 
>> We need truncate any block beyond i_size after write have failed , do in simular
>> generic_file_buffered_write() error path.
>> 
>> Exampe:
>> open("mnt2/FILE3", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
>> write(3, "aaaaaa"..., 4096) = -1 ENOSPC (No space left on device)
>> 
>> stat mnt2/FILE3
>> File: `mnt2/FILE3'
>> Size: 0               Blocks: 4          IO Block: 4096   regular empty file
>> >>>>>>>>>>>>>>>>>>>>>>^^^^^^^^^^ file size is less than biggest block idx
>> Device: 700h/1792d      Inode: 14          Links: 1
>> Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
>> 
>> fsck.ext2 -f -n  mnt1/fs_img
>> Pass 1: Checking inodes, blocks, and sizes
>> Inode 14, i_size is 0, should be 2048.  Fix? no
>> 
>> Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>
>> ----------
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 7b84dc8..bf7cf6c 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -2041,6 +2041,14 @@ generic_file_direct_write(struct kiocb *
>>  			mark_inode_dirty(inode);
>>  		}
>>  		*ppos = end;
>> +	} else if (written < 0) {
>> +		loff_t isize = i_size_read(inode);
>> +		/*
>> +		 * generic_file_direct_IO() may have instantiated a few blocks
>> +		 * outside i_size.  Trim these off again.
>> +		 */
>> +		if (pos + count > isize)
>> +			vmtruncate(inode, isize);
>>  	}
>>  
>
> XFS (at least) can call generic_file_direct_write() with i_mutex not held. 
How could it be ?

from mm/filemap.c:2046 generic_file_direct_write() comment right after 
place where i want to add vmtruncate()
/*
	 * Sync the fs metadata but not the minor inode changes and
	 * of course not the data as we did direct DMA for the IO.
	 * i_mutex is held, which protects generic_osync_inode() from
	 * livelocking.
	 */

> And vmtruncate() expects i_mutex to be held.
generic_file_direct_IO must called under i_mutex too
from mm/filemap.c:2388
  /*
   * Called under i_mutex for writes to S_ISREG files.   Returns -EIO if something
   * went wrong during pagecache shootdown.
   */
  static ssize_t
  generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,

This means XFS generic_file_direct_write() call generic_file_direct_IO() without
i_mutex held too?
>
> I guess a suitable solution would be to push this problem back up to the
> callers: let them decide whether to run vmtruncate() and if so, to ensure
> that i_mutex is held.
>
> The existence of generic_file_aio_write_nolock() makes that rather messy
> though.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH]  incorrect error handling inside generic_file_direct_write
  2006-12-12 12:20   ` Dmitriy Monakhov
@ 2006-12-12  9:52     ` Andrew Morton
  2006-12-12 13:18       ` Dmitriy Monakhov
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2006-12-12  9:52 UTC (permalink / raw)
  To: Dmitriy Monakhov
  Cc: Dmitriy Monakhov, linux-kernel, Linux Memory Management, devel,
	xfs

On Tue, 12 Dec 2006 15:20:52 +0300
Dmitriy Monakhov <dmonakhov@sw.ru> wrote:

> > XFS (at least) can call generic_file_direct_write() with i_mutex not held. 
> > And vmtruncate() expects i_mutex to be held.
> >
> > I guess a suitable solution would be to push this problem back up to the
> > callers: let them decide whether to run vmtruncate() and if so, to ensure
> > that i_mutex is held.
> >
> > The existence of generic_file_aio_write_nolock() makes that rather messy
> > though.
> This means we may call generic_file_aio_write_nolock() without i_mutex, right?
> but call trace is :
>   generic_file_aio_write_nolock() 
>   ->generic_file_buffered_write() /* i_mutex not held here */ 
> but according to filemaps locking rules: mm/filemap.c:77
>  ..
>  *  ->i_mutex			(generic_file_buffered_write)
>  *    ->mmap_sem		(fault_in_pages_readable->do_page_fault)
>  ..
> I'm confused a litle bit, where is the truth? 

xfs_write() calls generic_file_direct_write() without taking i_mutex for
O_DIRECT writes.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH]  incorrect error handling inside generic_file_direct_write
  2006-12-12 13:18       ` Dmitriy Monakhov
@ 2006-12-12 10:40         ` Andrew Morton
  2006-12-12 23:14           ` Dmitriy Monakhov
  2006-12-13  2:43           ` Chen, Kenneth W
  0 siblings, 2 replies; 12+ messages in thread
From: Andrew Morton @ 2006-12-12 10:40 UTC (permalink / raw)
  To: Dmitriy Monakhov
  Cc: Dmitriy Monakhov, linux-kernel, Linux Memory Management, devel,
	xfs

On Tue, 12 Dec 2006 16:18:32 +0300
Dmitriy Monakhov <dmonakhov@sw.ru> wrote:

> >> but according to filemaps locking rules: mm/filemap.c:77
> >>  ..
> >>  *  ->i_mutex			(generic_file_buffered_write)
> >>  *    ->mmap_sem		(fault_in_pages_readable->do_page_fault)
> >>  ..
> >> I'm confused a litle bit, where is the truth? 
> >
> > xfs_write() calls generic_file_direct_write() without taking i_mutex for
> > O_DIRECT writes.
> Yes, but my quastion is about __generic_file_aio_write_nolock().
> As i understand _nolock sufix means that i_mutex was already locked 
> by caller, am i right ?

Nope.  It just means that __generic_file_aio_write_nolock() doesn't take
the lock.  We don't assume or require that the caller took it.  For example
the raw driver calls generic_file_aio_write_nolock() without taking
i_mutex.  Raw isn't relevant to the problem (although ocfs2 might be).  But
we cannot assume that all callers have taken i_mutex, I think.

I guess we can make that a rule (document it, add
BUG_ON(!mutex_is_locked(..)) if it isn't a blockdev) if needs be.  After
really checking that this matches reality for all callers.

It's important, too - if we have an unprotected i_size_write() then the
seqlock can get out of sync due to a race and then i_size_read() locks up
the kernel.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH]  incorrect error handling inside generic_file_direct_write
  2006-12-11 20:40 ` [PATCH] incorrect error handling inside generic_file_direct_write Andrew Morton
  2006-12-12  9:22   ` Dmitriy Monakhov
@ 2006-12-12 12:20   ` Dmitriy Monakhov
  2006-12-12  9:52     ` Andrew Morton
  1 sibling, 1 reply; 12+ messages in thread
From: Dmitriy Monakhov @ 2006-12-12 12:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dmitriy Monakhov, linux-kernel, Linux Memory Management, devel,
	xfs

Andrew Morton <akpm@osdl.org> writes:

> On Mon, 11 Dec 2006 16:34:27 +0300
> Dmitriy Monakhov <dmonakhov@openvz.org> wrote:
>
>> OpenVZ team has discovered error inside generic_file_direct_write()
>> If generic_file_direct_IO() has fail (ENOSPC condition) it may have instantiated
>> a few blocks outside i_size. And fsck will complain about wrong i_size
>> (ext2, ext3 and reiserfs interpret i_size and biggest block difference as error),
>> after fsck will fix error i_size will be increased to the biggest block,
>> but this blocks contain gurbage from previous write attempt, this is not 
>> information leak, but its silence file data corruption. 
>> We need truncate any block beyond i_size after write have failed , do in simular
>> generic_file_buffered_write() error path.
>> 
>> Exampe:
>> open("mnt2/FILE3", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
>> write(3, "aaaaaa"..., 4096) = -1 ENOSPC (No space left on device)
>> 
>> stat mnt2/FILE3
>> File: `mnt2/FILE3'
>> Size: 0               Blocks: 4          IO Block: 4096   regular empty file
>> >>>>>>>>>>>>>>>>>>>>>>^^^^^^^^^^ file size is less than biggest block idx
>> Device: 700h/1792d      Inode: 14          Links: 1
>> Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
>> 
>> fsck.ext2 -f -n  mnt1/fs_img
>> Pass 1: Checking inodes, blocks, and sizes
>> Inode 14, i_size is 0, should be 2048.  Fix? no
>> 
>> Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>
>> ----------
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 7b84dc8..bf7cf6c 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -2041,6 +2041,14 @@ generic_file_direct_write(struct kiocb *
>>  			mark_inode_dirty(inode);
>>  		}
>>  		*ppos = end;
>> +	} else if (written < 0) {
>> +		loff_t isize = i_size_read(inode);
>> +		/*
>> +		 * generic_file_direct_IO() may have instantiated a few blocks
>> +		 * outside i_size.  Trim these off again.
>> +		 */
>> +		if (pos + count > isize)
>> +			vmtruncate(inode, isize);
>>  	}
>>  
>
> XFS (at least) can call generic_file_direct_write() with i_mutex not held. 
> And vmtruncate() expects i_mutex to be held.
>
> I guess a suitable solution would be to push this problem back up to the
> callers: let them decide whether to run vmtruncate() and if so, to ensure
> that i_mutex is held.
>
> The existence of generic_file_aio_write_nolock() makes that rather messy
> though.
This means we may call generic_file_aio_write_nolock() without i_mutex, right?
but call trace is :
  generic_file_aio_write_nolock() 
  ->generic_file_buffered_write() /* i_mutex not held here */ 
but according to filemaps locking rules: mm/filemap.c:77
 ..
 *  ->i_mutex			(generic_file_buffered_write)
 *    ->mmap_sem		(fault_in_pages_readable->do_page_fault)
 ..
I'm confused a litle bit, where is the truth? 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH]  incorrect error handling inside generic_file_direct_write
  2006-12-12  9:52     ` Andrew Morton
@ 2006-12-12 13:18       ` Dmitriy Monakhov
  2006-12-12 10:40         ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Dmitriy Monakhov @ 2006-12-12 13:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dmitriy Monakhov, Dmitriy Monakhov, linux-kernel,
	Linux Memory Management, devel, xfs

[-- Attachment #1: Type: text/plain, Size: 1433 bytes --]

Andrew Morton <akpm@osdl.org> writes:

> On Tue, 12 Dec 2006 15:20:52 +0300
> Dmitriy Monakhov <dmonakhov@sw.ru> wrote:
>
>> > XFS (at least) can call generic_file_direct_write() with i_mutex not held. 
>> > And vmtruncate() expects i_mutex to be held.
>> >
>> > I guess a suitable solution would be to push this problem back up to the
>> > callers: let them decide whether to run vmtruncate() and if so, to ensure
>> > that i_mutex is held.
>> >
>> > The existence of generic_file_aio_write_nolock() makes that rather messy
>> > though.
>> This means we may call generic_file_aio_write_nolock() without i_mutex, right?
>> but call trace is :
>>   generic_file_aio_write_nolock() 
>>   ->generic_file_buffered_write() /* i_mutex not held here */ 
>> but according to filemaps locking rules: mm/filemap.c:77
>>  ..
>>  *  ->i_mutex			(generic_file_buffered_write)
>>  *    ->mmap_sem		(fault_in_pages_readable->do_page_fault)
>>  ..
>> I'm confused a litle bit, where is the truth? 
>
> xfs_write() calls generic_file_direct_write() without taking i_mutex for
> O_DIRECT writes.
Yes, but my quastion is about __generic_file_aio_write_nolock().
As i understand _nolock sufix means that i_mutex was already locked 
by caller, am i right ?
If yes, than __generic_file_aio_write_nolock() is beter place for vmtrancate() 
acclivity after generic_file_direct_write() has fail.
Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>
-------

[-- Attachment #2: diff-generic-direct-io-write-fix --]
[-- Type: text/plain, Size: 587 bytes --]

diff --git a/mm/filemap.c b/mm/filemap.c
index 7b84dc8..723d2ca 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2282,6 +2282,15 @@ __generic_file_aio_write_nolock(struct k
 
 		written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
 							ppos, count, ocount);
+		if (written < 0) {
+			loff_t isize = i_size_read(inode);
+			/*
+			 * generic_file_direct_write() may have instantiated 
+			 * a few blocks outside i_size. Trim these off again.
+			 */
+			if (pos + count > isize)
+				vmtruncate(inode, isize);
+		}
 		if (written < 0 || written == count)
 			goto out;
 		/*

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH]  incorrect error handling inside generic_file_direct_write
  2006-12-12 10:40         ` Andrew Morton
@ 2006-12-12 23:14           ` Dmitriy Monakhov
  2006-12-13  2:43           ` Chen, Kenneth W
  1 sibling, 0 replies; 12+ messages in thread
From: Dmitriy Monakhov @ 2006-12-12 23:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dmitriy Monakhov, linux-kernel, Linux Memory Management, devel,
	xfs

[-- Attachment #1: Type: text/plain, Size: 1535 bytes --]

Andrew Morton <akpm@osdl.org> writes:

> On Tue, 12 Dec 2006 16:18:32 +0300
> Dmitriy Monakhov <dmonakhov@sw.ru> wrote:
>
>> >> but according to filemaps locking rules: mm/filemap.c:77
>> >>  ..
>> >>  *  ->i_mutex			(generic_file_buffered_write)
>> >>  *    ->mmap_sem		(fault_in_pages_readable->do_page_fault)
>> >>  ..
>> >> I'm confused a litle bit, where is the truth? 
>> >
>> > xfs_write() calls generic_file_direct_write() without taking i_mutex for
>> > O_DIRECT writes.
>> Yes, but my quastion is about __generic_file_aio_write_nolock().
>> As i understand _nolock sufix means that i_mutex was already locked 
>> by caller, am i right ?
>
> Nope.  It just means that __generic_file_aio_write_nolock() doesn't take
> the lock.  We don't assume or require that the caller took it.  For example
> the raw driver calls generic_file_aio_write_nolock() without taking
> i_mutex.  Raw isn't relevant to the problem (although ocfs2 might be).  But
> we cannot assume that all callers have taken i_mutex, I think.
>
> I guess we can make that a rule (document it, add
> BUG_ON(!mutex_is_locked(..)) if it isn't a blockdev) if needs be.  After
> really checking that this matches reality for all callers.
I've checked generic_file_aio_write_nolock() callers for non blockdev.
Only ocfs2 call it explicitly, and do it under i_mutex.
So we need to do: 
1) Change wrong comments.
2) Add BUG_ON(!mutex_is_locked(..)) for non blkdev.
3) Invoke vmtruncate only for non blkdev.

Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>
-------

[-- Attachment #2: direct-io-fix.patch --]
[-- Type: text/plain, Size: 2123 bytes --]

diff --git a/mm/filemap.c b/mm/filemap.c
index 7b84dc8..540ef5e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2046,8 +2046,8 @@ generic_file_direct_write(struct kiocb *
 	/*
 	 * Sync the fs metadata but not the minor inode changes and
 	 * of course not the data as we did direct DMA for the IO.
-	 * i_mutex is held, which protects generic_osync_inode() from
-	 * livelocking.
+	 * i_mutex may not being held, if so some specific locking
+	 * ordering must protect generic_osync_inode() from livelocking.
 	 */
 	if (written >= 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
 		int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
@@ -2282,6 +2282,17 @@ __generic_file_aio_write_nolock(struct k
 
 		written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
 							ppos, count, ocount);
+		/*
+		 * If host is not S_ISBLK generic_file_direct_write() may 
+		 * have instantiated a few blocks outside i_size  files
+		 * Trim these off again.
+		 */
+		if (unlikely(written < 0) && !S_ISBLK(inode->i_mode)) {
+			loff_t isize = i_size_read(inode);
+			if (pos + count > isize)
+				vmtruncate(inode, isize);
+		}
+
 		if (written < 0 || written == count)
 			goto out;
 		/*
@@ -2344,6 +2355,13 @@ ssize_t generic_file_aio_write_nolock(st
 	ssize_t ret;
 
 	BUG_ON(iocb->ki_pos != pos);
+	/*
+	 *  generic_file_buffered_write() may be called inside 
+	 *  __generic_file_aio_write_nolock() even in case of
+	 *  O_DIRECT for non S_ISBLK files. So i_mutex must be held.
+	 */
+	if (!S_ISBLK(inode->i_mode))
+		BUG_ON(!mutex_is_locked(&inode->i_mutex));
 
 	ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
 			&iocb->ki_pos);
@@ -2386,8 +2404,8 @@ ssize_t generic_file_aio_write(struct ki
 EXPORT_SYMBOL(generic_file_aio_write);
 
 /*
- * Called under i_mutex for writes to S_ISREG files.   Returns -EIO if something
- * went wrong during pagecache shootdown.
+ * May be called without i_mutex for writes to S_ISREG files.
+ * Returns -EIO if something went wrong during pagecache shootdown.
  */
 static ssize_t
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* RE: [PATCH]  incorrect error handling inside generic_file_direct_write
  2006-12-12 10:40         ` Andrew Morton
  2006-12-12 23:14           ` Dmitriy Monakhov
@ 2006-12-13  2:43           ` Chen, Kenneth W
  2006-12-15 10:43             ` 'Christoph Hellwig'
  1 sibling, 1 reply; 12+ messages in thread
From: Chen, Kenneth W @ 2006-12-13  2:43 UTC (permalink / raw)
  To: 'Andrew Morton', Dmitriy Monakhov,
	'Christoph Hellwig'
  Cc: Dmitriy Monakhov, linux-kernel, Linux Memory Management, devel,
	xfs

Andrew Morton wrote on Tuesday, December 12, 2006 2:40 AM
> On Tue, 12 Dec 2006 16:18:32 +0300
> Dmitriy Monakhov <dmonakhov@sw.ru> wrote:
> 
> > >> but according to filemaps locking rules: mm/filemap.c:77
> > >>  ..
> > >>  *  ->i_mutex			(generic_file_buffered_write)
> > >>  *    ->mmap_sem		(fault_in_pages_readable->do_page_fault)
> > >>  ..
> > >> I'm confused a litle bit, where is the truth? 
> > >
> > > xfs_write() calls generic_file_direct_write() without taking i_mutex for
> > > O_DIRECT writes.
> > Yes, but my quastion is about __generic_file_aio_write_nolock().
> > As i understand _nolock sufix means that i_mutex was already locked 
> > by caller, am i right ?
> 
> Nope.  It just means that __generic_file_aio_write_nolock() doesn't take
> the lock.  We don't assume or require that the caller took it.  For example
> the raw driver calls generic_file_aio_write_nolock() without taking
> i_mutex.  Raw isn't relevant to the problem (although ocfs2 might be).  But
> we cannot assume that all callers have taken i_mutex, I think.


I think we should also clean up generic_file_aio_write_nolock. This was
brought up a couple of weeks ago and I gave up too early.  Here is my
second attempt.

How about the following patch, I think we can kill generic_file_aio_write_nolock
and merge both *file_aio_write_nolock into one function, then

generic_file_aio_write
ocfs2_file_aio_write
blk_dev->aio_write

all points to a non-lock version of __generic_file_aio_write().  First
two already hold i_mutex, while the block device's aio_write method
doesn't require i_mutex to be held.


Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>

diff -Nurp linux-2.6.19/drivers/char/raw.c linux-2.6.19.ken/drivers/char/raw.c
--- linux-2.6.19/drivers/char/raw.c	2006-11-29 13:57:37.000000000 -0800
+++ linux-2.6.19.ken/drivers/char/raw.c	2006-12-12 16:41:39.000000000 -0800
@@ -242,7 +242,7 @@ static const struct file_operations raw_
 	.read	=	do_sync_read,
 	.aio_read = 	generic_file_aio_read,
 	.write	=	do_sync_write,
-	.aio_write = 	generic_file_aio_write_nolock,
+	.aio_write = 	__generic_file_aio_write,
 	.open	=	raw_open,
 	.release=	raw_release,
 	.ioctl	=	raw_ioctl,
diff -Nurp linux-2.6.19/fs/block_dev.c linux-2.6.19.ken/fs/block_dev.c
--- linux-2.6.19/fs/block_dev.c	2006-11-29 13:57:37.000000000 -0800
+++ linux-2.6.19.ken/fs/block_dev.c	2006-12-12 16:47:58.000000000 -0800
@@ -1198,7 +1198,7 @@ const struct file_operations def_blk_fop
 	.read		= do_sync_read,
 	.write		= do_sync_write,
   	.aio_read	= generic_file_aio_read,
-  	.aio_write	= generic_file_aio_write_nolock,
+  	.aio_write	= __generic_file_aio_write,
 	.mmap		= generic_file_mmap,
 	.fsync		= block_fsync,
 	.unlocked_ioctl	= block_ioctl,
diff -Nurp linux-2.6.19/fs/ocfs2/file.c linux-2.6.19.ken/fs/ocfs2/file.c
--- linux-2.6.19/fs/ocfs2/file.c	2006-11-29 13:57:37.000000000 -0800
+++ linux-2.6.19.ken/fs/ocfs2/file.c	2006-12-12 16:42:09.000000000 -0800
@@ -1107,7 +1107,7 @@ static ssize_t ocfs2_file_aio_write(stru
 	/* communicate with ocfs2_dio_end_io */
 	ocfs2_iocb_set_rw_locked(iocb);
 
-	ret = generic_file_aio_write_nolock(iocb, iov, nr_segs, iocb->ki_pos);
+	ret = __generic_file_aio_write(iocb, iov, nr_segs, iocb->ki_pos);
 
 	/* buffered aio wouldn't have proper lock coverage today */
 	BUG_ON(ret == -EIOCBQUEUED && !(filp->f_flags & O_DIRECT));
diff -Nurp linux-2.6.19/include/linux/fs.h linux-2.6.19.ken/include/linux/fs.h
--- linux-2.6.19/include/linux/fs.h	2006-11-29 13:57:37.000000000 -0800
+++ linux-2.6.19.ken/include/linux/fs.h	2006-12-12 16:41:58.000000000 -0800
@@ -1742,7 +1742,7 @@ extern int file_send_actor(read_descript
 int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk);
 extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t);
 extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long, loff_t);
-extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct iovec *,
+extern ssize_t __generic_file_aio_write(struct kiocb *, const struct iovec *,
 		unsigned long, loff_t);
 extern ssize_t generic_file_direct_write(struct kiocb *, const struct iovec *,
 		unsigned long *, loff_t, loff_t *, size_t, size_t);
diff -Nurp linux-2.6.19/mm/filemap.c linux-2.6.19.ken/mm/filemap.c
--- linux-2.6.19/mm/filemap.c	2006-11-29 13:57:37.000000000 -0800
+++ linux-2.6.19.ken/mm/filemap.c	2006-12-12 16:47:58.000000000 -0800
@@ -2219,9 +2219,9 @@ zero_length_segment:
 }
 EXPORT_SYMBOL(generic_file_buffered_write);
 
-static ssize_t
-__generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov,
-				unsigned long nr_segs, loff_t *ppos)
+ssize_t
+__generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
+				unsigned long nr_segs, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space * mapping = file->f_mapping;
@@ -2229,9 +2229,10 @@ __generic_file_aio_write_nolock(struct k
 	size_t count;		/* after file limit checks */
 	struct inode 	*inode = mapping->host;
 	unsigned long	seg;
-	loff_t		pos;
+	loff_t		*ppos = &iocb->ki_pos;
 	ssize_t		written;
 	ssize_t		err;
+	ssize_t		ret;
 
 	ocount = 0;
 	for (seg = 0; seg < nr_segs; seg++) {
@@ -2254,7 +2255,6 @@ __generic_file_aio_write_nolock(struct k
 	}
 
 	count = ocount;
-	pos = *ppos;
 
 	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
 
@@ -2332,32 +2332,16 @@ __generic_file_aio_write_nolock(struct k
 	}
 out:
 	current->backing_dev_info = NULL;
-	return written ? written : err;
-}
-
-ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
-		const struct iovec *iov, unsigned long nr_segs, loff_t pos)
-{
-	struct file *file = iocb->ki_filp;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
-	ssize_t ret;
-
-	BUG_ON(iocb->ki_pos != pos);
-
-	ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
-			&iocb->ki_pos);
+	ret = written ? written : err;
 
 	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-		ssize_t err;
-
 		err = sync_page_range_nolock(inode, mapping, pos, ret);
 		if (err < 0)
 			ret = err;
 	}
 	return ret;
 }
-EXPORT_SYMBOL(generic_file_aio_write_nolock);
+EXPORT_SYMBOL(__generic_file_aio_write);
 
 ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long nr_segs, loff_t pos)
@@ -2370,8 +2354,7 @@ ssize_t generic_file_aio_write(struct ki
 	BUG_ON(iocb->ki_pos != pos);
 
 	mutex_lock(&inode->i_mutex);
-	ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
-			&iocb->ki_pos);
+	ret = __generic_file_aio_write(iocb, iov, nr_segs, pos);
 	mutex_unlock(&inode->i_mutex);
 
 	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH]  incorrect error handling inside generic_file_direct_write
  2006-12-13  2:43           ` Chen, Kenneth W
@ 2006-12-15 10:43             ` 'Christoph Hellwig'
  2006-12-15 18:53               ` Chen, Kenneth W
  0 siblings, 1 reply; 12+ messages in thread
From: 'Christoph Hellwig' @ 2006-12-15 10:43 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Andrew Morton', Dmitriy Monakhov,
	'Christoph Hellwig', Dmitriy Monakhov, linux-kernel,
	Linux Memory Management, devel, xfs

> +ssize_t
> +__generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
> +				unsigned long nr_segs, loff_t pos)

I'd still call this generic_file_aio_write_nolock.

> +	loff_t		*ppos = &iocb->ki_pos;

I'd rather use iocb->ki_pos directly in the few places ppos is referenced
currently.

>  	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> -		ssize_t err;
> -
>  		err = sync_page_range_nolock(inode, mapping, pos, ret);
>  		if (err < 0)
>  			ret = err;
>  	}

So we're doing the sync_page_range once in __generic_file_aio_write
with i_mutex held.


>  	mutex_lock(&inode->i_mutex);
> -	ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
> -			&iocb->ki_pos);
> +	ret = __generic_file_aio_write(iocb, iov, nr_segs, pos);
>  	mutex_unlock(&inode->i_mutex);
>  
>  	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {

And then another time after it's unlocked, this seems wrong.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH]  incorrect error handling inside generic_file_direct_write
  2006-12-15 10:43             ` 'Christoph Hellwig'
@ 2006-12-15 18:53               ` Chen, Kenneth W
  2007-01-02 11:17                 ` 'Christoph Hellwig'
  0 siblings, 1 reply; 12+ messages in thread
From: Chen, Kenneth W @ 2006-12-15 18:53 UTC (permalink / raw)
  To: 'Christoph Hellwig'
  Cc: 'Andrew Morton', Dmitriy Monakhov, Dmitriy Monakhov,
	linux-kernel, Linux Memory Management, devel, xfs

Christoph Hellwig wrote on Friday, December 15, 2006 2:44 AM
> So we're doing the sync_page_range once in __generic_file_aio_write
> with i_mutex held.
> 
> 
> >  	mutex_lock(&inode->i_mutex);
> > -	ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
> > -			&iocb->ki_pos);
> > +	ret = __generic_file_aio_write(iocb, iov, nr_segs, pos);
> >  	mutex_unlock(&inode->i_mutex);
> >  
> >  	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> 
> And then another time after it's unlocked, this seems wrong.


I didn't invent that mess though.

I should've ask the question first: in 2.6.20-rc1, generic_file_aio_write
will call sync_page_range twice, once from __generic_file_aio_write_nolock
and once within the function itself.  Is it redundant?  Can we delete the
one in the top level function?  Like the following?


--- ./mm/filemap.c.orig	2006-12-15 09:02:58.000000000 -0800
+++ ./mm/filemap.c	2006-12-15 09:03:19.000000000 -0800
@@ -2370,14 +2370,6 @@ ssize_t generic_file_aio_write(struct ki
 	ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
 			&iocb->ki_pos);
 	mutex_unlock(&inode->i_mutex);
-
-	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-		ssize_t err;
-
-		err = sync_page_range(inode, mapping, pos, ret);
-		if (err < 0)
-			ret = err;
-	}
 	return ret;
 }
 EXPORT_SYMBOL(generic_file_aio_write);

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH]  incorrect error handling inside generic_file_direct_write
  2006-12-15 18:53               ` Chen, Kenneth W
@ 2007-01-02 11:17                 ` 'Christoph Hellwig'
  0 siblings, 0 replies; 12+ messages in thread
From: 'Christoph Hellwig' @ 2007-01-02 11:17 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Christoph Hellwig', 'Andrew Morton',
	Dmitriy Monakhov, Dmitriy Monakhov, linux-kernel,
	Linux Memory Management, devel, xfs

On Fri, Dec 15, 2006 at 10:53:18AM -0800, Chen, Kenneth W wrote:
> Christoph Hellwig wrote on Friday, December 15, 2006 2:44 AM
> > So we're doing the sync_page_range once in __generic_file_aio_write
> > with i_mutex held.
> > 
> > 
> > >  	mutex_lock(&inode->i_mutex);
> > > -	ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
> > > -			&iocb->ki_pos);
> > > +	ret = __generic_file_aio_write(iocb, iov, nr_segs, pos);
> > >  	mutex_unlock(&inode->i_mutex);
> > >  
> > >  	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> > 
> > And then another time after it's unlocked, this seems wrong.
> 
> 
> I didn't invent that mess though.
> 
> I should've ask the question first: in 2.6.20-rc1, generic_file_aio_write
> will call sync_page_range twice, once from __generic_file_aio_write_nolock
> and once within the function itself.  Is it redundant?  Can we delete the
> one in the top level function?  Like the following?

Really?  I'm looking at -rc3 now as -rc1 is rather old and it's definitly
not the case there.  I also can't remember ever doing this - when I
started the generic read/write path untangling I had exactly the same
situation that's now in -rc3:

  - generic_file_aio_write_nolock calls sync_page_range_nolock
  - generic_file_aio_write calls sync_page_range
  - __generic_file_aio_write_nolock doesn't call any sync_page_range variant

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-01-02 11:36 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <87k60y1rq4.fsf@sw.ru>
2006-12-11 20:40 ` [PATCH] incorrect error handling inside generic_file_direct_write Andrew Morton
2006-12-12  9:22   ` Dmitriy Monakhov
2006-12-12  6:36     ` Andrew Morton
2006-12-12 12:20   ` Dmitriy Monakhov
2006-12-12  9:52     ` Andrew Morton
2006-12-12 13:18       ` Dmitriy Monakhov
2006-12-12 10:40         ` Andrew Morton
2006-12-12 23:14           ` Dmitriy Monakhov
2006-12-13  2:43           ` Chen, Kenneth W
2006-12-15 10:43             ` 'Christoph Hellwig'
2006-12-15 18:53               ` Chen, Kenneth W
2007-01-02 11:17                 ` 'Christoph Hellwig'

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox