[PATCH 0/9] remove i_alloc

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/9] remove i_alloc_sem V2
@ 2011-06-24 18:29 Christoph Hellwig
  2011-06-24 18:29 ` [PATCH 1/9] fat: remove i_alloc_sem abuse Christoph Hellwig
                   ` (8 more replies)
  0 siblings, 9 replies; 11+ messages in thread
From: Christoph Hellwig @ 2011-06-24 18:29 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

i_alloc_sem has always been a bit of an odd "lock".  It's the only remaining
rw_semaphore that can be released by a different thread than the one that
locked it, and it's use case in the core direct I/O code is more like a
counter given that the writers already have external serialization.

This series removes it in favour of a simpler counter scheme, thus getting
rid of the rw_semaphore non-owner APIs as requests by Thomas, while at the
same time shrinking the size of struct inode by 160 bytes on 64-bit systems.

The only nasty bit is that two filesystems (fat and ext4) have started
abusing the lock for their own purposes.  I've added a new rw_semaphore
to the fat node structures to keep the current behaviour, and merged a
patch from Jan Kara to remove the i_alloc_sem abuse from ext4.

changes from v1:
 - update the fat patch description
 - replace my ext4 truncate_lock patch with Jan's rewrite of ext4_page_mkwrite
 - do not use wait_on_bit, but replace it with an opencoded hashed waitqueue
 - rename inode_dio_wake to inode_dio_done
 - add kerneldoc comments for inode_dio_wait and inode_dio_done
 - simplify the blockdev_direct_IO prototype
 - move the i_dio_count decrement into the ->end_io handler if present to
   make i_dio_count useful for filesystems delaying AIO completion
 - reorder the patch series - patches 1 to 5 are the meat, the rest is
   additonal tidyups in that area required for future improvements

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/9] fat: remove i_alloc_sem abuse
  2011-06-24 18:29 [PATCH 0/9] remove i_alloc_sem V2 Christoph Hellwig
@ 2011-06-24 18:29 ` Christoph Hellwig
  2011-06-24 18:29 ` [PATCH 2/9] ext4: Rewrite ext4_page_mkwrite() to use generic helpers Christoph Hellwig
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2011-06-24 18:29 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

[-- Attachment #1: fat-avoid-i_alloc_sem --]
[-- Type: text/plain, Size: 2242 bytes --]

Add a new rw_semaphore to protect bmap against truncate.  Previous
i_alloc_sem was abused for this, but it's going away in this series.

Note that we can't simply use i_mutex, given that the swapon code
calls ->bmap under it.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/fat/inode.c
===================================================================
--- linux-2.6.orig/fs/fat/inode.c	2011-06-20 21:28:19.707963855 +0200
+++ linux-2.6/fs/fat/inode.c	2011-06-20 21:29:25.031293882 +0200
@@ -224,9 +224,9 @@ static sector_t _fat_bmap(struct address
 	sector_t blocknr;
 
 	/* fat_get_cluster() assumes the requested blocknr isn't truncated. */
-	down_read(&mapping->host->i_alloc_sem);
+	down_read(&MSDOS_I(mapping->host)->truncate_lock);
 	blocknr = generic_block_bmap(mapping, block, fat_get_block);
-	up_read(&mapping->host->i_alloc_sem);
+	up_read(&MSDOS_I(mapping->host)->truncate_lock);
 
 	return blocknr;
 }
@@ -510,6 +510,8 @@ static struct inode *fat_alloc_inode(str
 	ei = kmem_cache_alloc(fat_inode_cachep, GFP_NOFS);
 	if (!ei)
 		return NULL;
+
+	init_rwsem(&ei->truncate_lock);
 	return &ei->vfs_inode;
 }
 
Index: linux-2.6/fs/fat/fat.h
===================================================================
--- linux-2.6.orig/fs/fat/fat.h	2011-06-20 21:28:19.724630522 +0200
+++ linux-2.6/fs/fat/fat.h	2011-06-20 21:29:25.034627215 +0200
@@ -109,6 +109,7 @@ struct msdos_inode_info {
 	int i_attrs;		/* unused attribute bits */
 	loff_t i_pos;		/* on-disk position of directory entry or 0 */
 	struct hlist_node i_fat_hash;	/* hash by i_location */
+	struct rw_semaphore truncate_lock; /* protect bmap against truncate */
 	struct inode vfs_inode;
 };
 
Index: linux-2.6/fs/fat/file.c
===================================================================
--- linux-2.6.orig/fs/fat/file.c	2011-06-20 21:28:19.744630521 +0200
+++ linux-2.6/fs/fat/file.c	2011-06-20 21:29:54.501292390 +0200
@@ -429,8 +429,10 @@ int fat_setattr(struct dentry *dentry, s
 	}
 
 	if (attr->ia_valid & ATTR_SIZE) {
+		down_write(&MSDOS_I(inode)->truncate_lock);
 		truncate_setsize(inode, attr->ia_size);
 		fat_truncate_blocks(inode, attr->ia_size);
+		up_write(&MSDOS_I(inode)->truncate_lock);
 	}
 
 	setattr_copy(inode, attr);


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/9] ext4: Rewrite ext4_page_mkwrite() to use generic helpers
  2011-06-24 18:29 [PATCH 0/9] remove i_alloc_sem V2 Christoph Hellwig
  2011-06-24 18:29 ` [PATCH 1/9] fat: remove i_alloc_sem abuse Christoph Hellwig
@ 2011-06-24 18:29 ` Christoph Hellwig
  2011-06-24 18:29 ` [PATCH 3/9] fs: simplify handling of zero sized reads in __blockdev_direct_IO Christoph Hellwig
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2011-06-24 18:29 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec,
	Jan Kara

[-- Attachment #1: ext4-rewrite-page_mkwrite --]
[-- Type: text/plain, Size: 5329 bytes --]

From:	Jan Kara <jack@suse.cz>

Rewrite ext4_page_mkwrite() to use __block_page_mkwrite() helper. This
removes the need of using i_alloc_sem to avoid races with truncate which
seems to be the wrong locking order according to lock ordering documented in
mm/rmap.c. Also calling ext4_da_write_begin() as used by the old code seems to
be problematic because we can decide to flush delay-allocated blocks which
will acquire s_umount semaphore - again creating unpleasant lock dependency
if not directly a deadlock.

Also add a check for frozen filesystem so that we don't busyloop in page fault
when the filesystem is frozen.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c |  106 ++++++++++++++++++++++++++++--------------------------
 1 files changed, 55 insertions(+), 51 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e3126c0..bd30976 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5843,80 +5843,84 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	struct page *page = vmf->page;
 	loff_t size;
 	unsigned long len;
-	int ret = -EINVAL;
-	void *fsdata;
+	int ret;
 	struct file *file = vma->vm_file;
 	struct inode *inode = file->f_path.dentry->d_inode;
 	struct address_space *mapping = inode->i_mapping;
+	handle_t *handle;
+	get_block_t *get_block;
+	int retries = 0;
 
 	/*
-	 * Get i_alloc_sem to stop truncates messing with the inode. We cannot
-	 * get i_mutex because we are already holding mmap_sem.
+	 * This check is racy but catches the common case. We rely on
+	 * __block_page_mkwrite() to do a reliable check.
 	 */
-	down_read(&inode->i_alloc_sem);
-	size = i_size_read(inode);
-	if (page->mapping != mapping || size <= page_offset(page)
-	    || !PageUptodate(page)) {
-		/* page got truncated from under us? */
-		goto out_unlock;
+	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+	/* Delalloc case is easy... */
+	if (test_opt(inode->i_sb, DELALLOC) &&
+	    !ext4_should_journal_data(inode) &&
+	    !ext4_nonda_switch(inode->i_sb)) {
+		do {
+			ret = __block_page_mkwrite(vma, vmf,
+						   ext4_da_get_block_prep);
+		} while (ret == -ENOSPC &&
+		       ext4_should_retry_alloc(inode->i_sb, &retries));
+		goto out_ret;
 	}
-	ret = 0;
 
 	lock_page(page);
-	wait_on_page_writeback(page);
-	if (PageMappedToDisk(page)) {
-		up_read(&inode->i_alloc_sem);
-		return VM_FAULT_LOCKED;
+	size = i_size_read(inode);
+	/* Page got truncated from under us? */
+	if (page->mapping != mapping || page_offset(page) > size) {
+		unlock_page(page);
+		ret = VM_FAULT_NOPAGE;
+		goto out;
 	}
 
 	if (page->index == size >> PAGE_CACHE_SHIFT)
 		len = size & ~PAGE_CACHE_MASK;
 	else
 		len = PAGE_CACHE_SIZE;
-
 	/*
-	 * return if we have all the buffers mapped. This avoid
-	 * the need to call write_begin/write_end which does a
-	 * journal_start/journal_stop which can block and take
-	 * long time
+	 * Return if we have all the buffers mapped. This avoids the need to do
+	 * journal_start/journal_stop which can block and take a long time
 	 */
 	if (page_has_buffers(page)) {
 		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
 					ext4_bh_unmapped)) {
-			up_read(&inode->i_alloc_sem);
-			return VM_FAULT_LOCKED;
+			/* Wait so that we don't change page under IO */
+			wait_on_page_writeback(page);
+			ret = VM_FAULT_LOCKED;
+			goto out;
 		}
 	}
 	unlock_page(page);
-	/*
-	 * OK, we need to fill the hole... Do write_begin write_end
-	 * to do block allocation/reservation.We are not holding
-	 * inode.i__mutex here. That allow * parallel write_begin,
-	 * write_end call. lock_page prevent this from happening
-	 * on the same page though
-	 */
-	ret = mapping->a_ops->write_begin(file, mapping, page_offset(page),
-			len, AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
-	if (ret < 0)
-		goto out_unlock;
-	ret = mapping->a_ops->write_end(file, mapping, page_offset(page),
-			len, len, page, fsdata);
-	if (ret < 0)
-		goto out_unlock;
-	ret = 0;
-
-	/*
-	 * write_begin/end might have created a dirty page and someone
-	 * could wander in and start the IO.  Make sure that hasn't
-	 * happened.
-	 */
-	lock_page(page);
-	wait_on_page_writeback(page);
-	up_read(&inode->i_alloc_sem);
-	return VM_FAULT_LOCKED;
-out_unlock:
-	if (ret)
+	/* OK, we need to fill the hole... */
+	if (ext4_should_dioread_nolock(inode))
+		get_block = ext4_get_block_write;
+	else
+		get_block = ext4_get_block;
+retry_alloc:
+	handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
+	if (IS_ERR(handle)) {
 		ret = VM_FAULT_SIGBUS;
-	up_read(&inode->i_alloc_sem);
+		goto out;
+	}
+	ret = __block_page_mkwrite(vma, vmf, get_block);
+	if (!ret && ext4_should_journal_data(inode)) {
+		if (walk_page_buffers(handle, page_buffers(page), 0,
+			  PAGE_CACHE_SIZE, NULL, do_journal_get_write_access)) {
+			unlock_page(page);
+			ret = VM_FAULT_SIGBUS;
+			goto out;
+		}
+		ext4_set_inode_state(inode, EXT4_STATE_JDATA);
+	}
+	ext4_journal_stop(handle);
+	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+		goto retry_alloc;
+out_ret:
+	ret = block_page_mkwrite_return(ret);
+out:
 	return ret;
 }
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/9] fs: simplify handling of zero sized reads in __blockdev_direct_IO
  2011-06-24 18:29 [PATCH 0/9] remove i_alloc_sem V2 Christoph Hellwig
  2011-06-24 18:29 ` [PATCH 1/9] fat: remove i_alloc_sem abuse Christoph Hellwig
  2011-06-24 18:29 ` [PATCH 2/9] ext4: Rewrite ext4_page_mkwrite() to use generic helpers Christoph Hellwig
@ 2011-06-24 18:29 ` Christoph Hellwig
  2011-06-24 18:29 ` [PATCH 4/9] fs: kill i_alloc_sem Christoph Hellwig
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2011-06-24 18:29 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

[-- Attachment #1: fs-cleanup-zero-size-dio-reads --]
[-- Type: text/plain, Size: 983 bytes --]

Reject zero sized reads as soon as we know our I/O length, and don't
borther with locks or allocations that might have to be cleaned up
otherwise.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/direct-io.c
===================================================================
--- linux-2.6.orig/fs/direct-io.c	2011-06-24 14:30:22.488402525 +0200
+++ linux-2.6/fs/direct-io.c	2011-06-24 15:13:16.711605526 +0200
@@ -1200,6 +1200,10 @@ __blockdev_direct_IO(int rw, struct kioc
 		}
 	}
 
+	/* watch out for a 0 len io from a tricksy fs */
+	if (rw == READ && end == offset)
+		return 0;
+
 	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
 	retval = -ENOMEM;
 	if (!dio)
@@ -1213,8 +1217,7 @@ __blockdev_direct_IO(int rw, struct kioc
 
 	dio->flags = flags;
 	if (dio->flags & DIO_LOCKING) {
-		/* watch out for a 0 len io from a tricksy fs */
-		if (rw == READ && end > offset) {
+		if (rw == READ) {
 			struct address_space *mapping =
 					iocb->ki_filp->f_mapping;
 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 4/9] fs: kill i_alloc_sem
  2011-06-24 18:29 [PATCH 0/9] remove i_alloc_sem V2 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2011-06-24 18:29 ` [PATCH 3/9] fs: simplify handling of zero sized reads in __blockdev_direct_IO Christoph Hellwig
@ 2011-06-24 18:29 ` Christoph Hellwig
  2011-06-24 18:34   ` Christoph Hellwig
  2011-06-24 18:29 ` [PATCH 5/9] rw_semaphore: remove up/down_read_non_owner Christoph Hellwig
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2011-06-24 18:29 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

[-- Attachment #1: fs-kill-i_alloc_sem --]
[-- Type: text/plain, Size: 15748 bytes --]

i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
be released by a non-owner, and it's write side is always mirrored by
real exclusion.  It's intended use it to wait for all pending direct I/O
requests to finish before starting a truncate.

Replace it with a hand-grown construct:

 - exclusion for truncates is already guaranteed by i_mutex, so it can
   simply fall way
 - the reader side is replaced by an i_dio_count member in struct inode
   that counts the number of pending direct I/O requests.  Truncate can't
   proceed as long as it's non-zero
 - when i_dio_count reaches non-zero we wake up a pending truncate using
   wake_up_bit on a new bit in i_flags
 - new references to i_dio_count can't appear while we are waiting for
   it to read zero because the direct I/O count always needs i_mutex
   (or an equivalent like XFS's i_iolock) for starting a new operation.

This scheme is much simpler, and saves the space of a spinlock_t and a
struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit
system).

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/direct-io.c
===================================================================
--- linux-2.6.orig/fs/direct-io.c	2011-06-24 15:13:16.711605526 +0200
+++ linux-2.6/fs/direct-io.c	2011-06-24 15:18:33.021589512 +0200
@@ -135,6 +135,50 @@ struct dio {
 	struct page *pages[DIO_PAGES];	/* page buffer */
 };
 
+static void __inode_dio_wait(struct inode *inode)
+{
+	wait_queue_head_t *wq = bit_waitqueue(&inode->i_state, __I_DIO_WAKEUP);
+	DEFINE_WAIT_BIT(q, &inode->i_state, __I_DIO_WAKEUP);
+
+	do {
+		prepare_to_wait(wq, &q.wait, TASK_UNINTERRUPTIBLE);
+		if (atomic_read(&inode->i_dio_count))
+			schedule();
+	} while (atomic_read(&inode->i_dio_count));
+	finish_wait(wq, &q.wait);
+}
+
+/**
+ * inode_dio_wait - wait for outstanding DIO requests to finish
+ * @inode: inode to wait for
+ *
+ * Waits for all pending direct I/O requests to finish so that we can
+ * proceed with a truncate or equivalent operation.
+ *
+ * Must be called under a lock that serializes taking new references
+ * to i_dio_count, usually by inode->i_mutex.
+ */
+void inode_dio_wait(struct inode *inode)
+{
+	if (atomic_read(&inode->i_dio_count))
+		__inode_dio_wait(inode);
+}
+EXPORT_SYMBOL_GPL(inode_dio_wait);
+
+/*
+ * inode_dio_done - signal finish of a direct I/O requests
+ * @inode: inode the direct I/O happens on
+ *
+ * This is called once we've finished processing a direct I/O request,
+ * and is used to wake up callers waiting for direct I/O to be quiesced.
+ */
+void inode_dio_done(struct inode *inode)
+{
+	if (atomic_dec_and_test(&inode->i_dio_count))
+		wake_up_bit(&inode->i_state, __I_DIO_WAKEUP);
+}
+EXPORT_SYMBOL_GPL(inode_dio_done);
+
 /*
  * How many pages are in the queue?
  */
@@ -254,9 +298,7 @@ static ssize_t dio_complete(struct dio *
 	}
 
 	if (dio->flags & DIO_LOCKING)
-		/* lockdep: non-owner release */
-		up_read_non_owner(&dio->inode->i_alloc_sem);
-
+		inode_dio_done(dio->inode);
 	return ret;
 }
 
@@ -980,9 +1022,6 @@ out:
 	return ret;
 }
 
-/*
- * Releases both i_mutex and i_alloc_sem
- */
 static ssize_t
 direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
 	const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
@@ -1146,15 +1185,14 @@ direct_io_worker(int rw, struct kiocb *i
  *    For writes this function is called under i_mutex and returns with
  *    i_mutex held, for reads, i_mutex is not held on entry, but it is
  *    taken and dropped again before returning.
- *    For reads and writes i_alloc_sem is taken in shared mode and released
- *    on I/O completion (which may happen asynchronously after returning to
- *    the caller).
+ *    The i_dio_count counter keeps track of the number of outstanding
+ *    direct I/O requests, and truncate waits for it to reach zero.
+ *    New references to i_dio_count must only be grabbed with i_mutex
+ *    held.
  *
  *  - if the flags value does NOT contain DIO_LOCKING we don't use any
  *    internal locking but rather rely on the filesystem to synchronize
  *    direct I/O reads/writes versus each other and truncate.
- *    For reads and writes both i_mutex and i_alloc_sem are not held on
- *    entry and are never taken.
  */
 ssize_t
 __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
@@ -1234,10 +1272,9 @@ __blockdev_direct_IO(int rw, struct kioc
 		}
 
 		/*
-		 * Will be released at I/O completion, possibly in a
-		 * different thread.
+		 * Will be decremented at I/O completion time.
 		 */
-		down_read_non_owner(&inode->i_alloc_sem);
+		atomic_inc(&inode->i_dio_count);
 	}
 
 	/*
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2011-06-24 15:13:16.804938855 +0200
+++ linux-2.6/mm/filemap.c	2011-06-24 15:14:56.364933813 +0200
@@ -78,9 +78,6 @@
  *  ->i_mutex			(generic_file_buffered_write)
  *    ->mmap_sem		(fault_in_pages_readable->do_page_fault)
  *
- *  ->i_mutex
- *    ->i_alloc_sem             (various)
- *
  *  inode_wb_list_lock
  *    sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2011-06-24 15:13:16.818272187 +0200
+++ linux-2.6/mm/rmap.c	2011-06-24 15:14:56.368267154 +0200
@@ -21,7 +21,6 @@
  * Lock ordering in mm:
  *
  * inode->i_mutex	(while writing or truncating, not reading or faulting)
- *   inode->i_alloc_sem (vmtruncate_range)
  *   mm->mmap_sem
  *     page->flags PG_locked (lock_page)
  *       mapping->i_mmap_mutex
Index: linux-2.6/fs/attr.c
===================================================================
--- linux-2.6.orig/fs/attr.c	2011-06-24 15:13:16.721605526 +0200
+++ linux-2.6/fs/attr.c	2011-06-24 15:14:56.368267154 +0200
@@ -233,16 +233,13 @@ int notify_change(struct dentry * dentry
 		return error;
 
 	if (ia_valid & ATTR_SIZE)
-		down_write(&dentry->d_inode->i_alloc_sem);
+		inode_dio_wait(inode);
 
 	if (inode->i_op->setattr)
 		error = inode->i_op->setattr(dentry, attr);
 	else
 		error = simple_setattr(dentry, attr);
 
-	if (ia_valid & ATTR_SIZE)
-		up_write(&dentry->d_inode->i_alloc_sem);
-
 	if (!error)
 		fsnotify_change(dentry, ia_valid);
 
Index: linux-2.6/fs/ntfs/file.c
===================================================================
--- linux-2.6.orig/fs/ntfs/file.c	2011-06-24 15:13:16.734938859 +0200
+++ linux-2.6/fs/ntfs/file.c	2011-06-24 15:14:56.371600489 +0200
@@ -1832,9 +1832,8 @@ static ssize_t ntfs_file_buffered_write(
 	 * fails again.
 	 */
 	if (unlikely(NInoTruncateFailed(ni))) {
-		down_write(&vi->i_alloc_sem);
+		inode_dio_wait(vi);
 		err = ntfs_truncate(vi);
-		up_write(&vi->i_alloc_sem);
 		if (err || NInoTruncateFailed(ni)) {
 			if (!err)
 				err = -EIO;
Index: linux-2.6/fs/reiserfs/xattr.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/xattr.c	2011-06-24 15:13:16.758272190 +0200
+++ linux-2.6/fs/reiserfs/xattr.c	2011-06-24 15:14:56.374933821 +0200
@@ -555,11 +555,10 @@ reiserfs_xattr_set_handle(struct reiserf
 
 		reiserfs_write_unlock(inode->i_sb);
 		mutex_lock_nested(&dentry->d_inode->i_mutex, I_MUTEX_XATTR);
-		down_write(&dentry->d_inode->i_alloc_sem);
+		inode_dio_wait(dentry->d_inode);
 		reiserfs_write_lock(inode->i_sb);
 
 		err = reiserfs_setattr(dentry, &newattrs);
-		up_write(&dentry->d_inode->i_alloc_sem);
 		mutex_unlock(&dentry->d_inode->i_mutex);
 	} else
 		update_ctime(inode);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2011-06-24 15:13:16.858272186 +0200
+++ linux-2.6/include/linux/fs.h	2011-06-24 15:14:56.378267151 +0200
@@ -776,7 +776,7 @@ struct inode {
 	struct timespec		i_ctime;
 	blkcnt_t		i_blocks;
 	unsigned short          i_bytes;
-	struct rw_semaphore	i_alloc_sem;
+	atomic_t		i_dio_count;
 	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
 	struct file_lock	*i_flock;
 	struct address_space	*i_mapping;
@@ -1692,6 +1692,10 @@ struct super_operations {
  *			set during data writeback, and cleared with a wakeup
  *			on the bit address once it is done.
  *
+ * I_REFERENCED		Marks the inode as recently references on the LRU list.
+ *
+ * I_DIO_WAKEUP		Never set.  Only used as a key for wait_on_bit().
+ *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  */
 #define I_DIRTY_SYNC		(1 << 0)
@@ -1705,6 +1709,8 @@ struct super_operations {
 #define __I_SYNC		7
 #define I_SYNC			(1 << __I_SYNC)
 #define I_REFERENCED		(1 << 8)
+#define __I_DIO_WAKEUP		9
+#define I_DIO_WAKEUP		(1 << I_DIO_WAKEUP)
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
 
@@ -1815,7 +1821,6 @@ struct file_system_type {
 	struct lock_class_key i_lock_key;
 	struct lock_class_key i_mutex_key;
 	struct lock_class_key i_mutex_dir_key;
-	struct lock_class_key i_alloc_sem_key;
 };
 
 extern struct dentry *mount_ns(struct file_system_type *fs_type, int flags,
@@ -2367,6 +2372,8 @@ enum {
 };
 
 void dio_end_io(struct bio *bio, int error);
+void inode_dio_wait(struct inode *inode);
+void inode_dio_done(struct inode *inode);
 
 ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	struct block_device *bdev, const struct iovec *iov, loff_t offset,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2011-06-24 15:13:16.828272188 +0200
+++ linux-2.6/mm/memory.c	2011-06-24 15:14:56.381600482 +0200
@@ -2811,12 +2811,11 @@ int vmtruncate_range(struct inode *inode
 		return -ENOSYS;
 
 	mutex_lock(&inode->i_mutex);
-	down_write(&inode->i_alloc_sem);
+	inode_dio_wait(inode);
 	unmap_mapping_range(mapping, offset, (end - offset), 1);
 	truncate_inode_pages_range(mapping, offset, end);
 	unmap_mapping_range(mapping, offset, (end - offset), 1);
 	inode->i_op->truncate_range(inode, offset, end);
-	up_write(&inode->i_alloc_sem);
 	mutex_unlock(&inode->i_mutex);
 
 	return 0;
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2011-06-24 15:13:16.771605525 +0200
+++ linux-2.6/fs/inode.c	2011-06-24 15:14:56.381600482 +0200
@@ -176,8 +176,7 @@ int inode_init_always(struct super_block
 	mutex_init(&inode->i_mutex);
 	lockdep_set_class(&inode->i_mutex, &sb->s_type->i_mutex_key);
 
-	init_rwsem(&inode->i_alloc_sem);
-	lockdep_set_class(&inode->i_alloc_sem, &sb->s_type->i_alloc_sem_key);
+	atomic_set(&inode->i_dio_count, 0);
 
 	mapping->a_ops = &empty_aops;
 	mapping->host = inode;
Index: linux-2.6/fs/ntfs/inode.c
===================================================================
--- linux-2.6.orig/fs/ntfs/inode.c	2011-06-24 15:13:16.744938858 +0200
+++ linux-2.6/fs/ntfs/inode.c	2011-06-24 15:14:56.384933812 +0200
@@ -2357,12 +2357,7 @@ static const char *es = "  Leaving incon
  *
  * Returns 0 on success or -errno on error.
  *
- * Called with ->i_mutex held.  In all but one case ->i_alloc_sem is held for
- * writing.  The only case in the kernel where ->i_alloc_sem is not held is
- * mm/filemap.c::generic_file_buffered_write() where vmtruncate() is called
- * with the current i_size as the offset.  The analogous place in NTFS is in
- * fs/ntfs/file.c::ntfs_file_buffered_write() where we call vmtruncate() again
- * without holding ->i_alloc_sem.
+ * Called with ->i_mutex held.
  */
 int ntfs_truncate(struct inode *vi)
 {
@@ -2887,8 +2882,7 @@ void ntfs_truncate_vfs(struct inode *vi)
  * We also abort all changes of user, group, and mode as we do not implement
  * the NTFS ACLs yet.
  *
- * Called with ->i_mutex held.  For the ATTR_SIZE (i.e. ->truncate) case, also
- * called with ->i_alloc_sem held for writing.
+ * Called with ->i_mutex held.
  */
 int ntfs_setattr(struct dentry *dentry, struct iattr *attr)
 {
Index: linux-2.6/fs/ocfs2/aops.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/aops.c	2011-06-24 15:13:16.781605524 +0200
+++ linux-2.6/fs/ocfs2/aops.c	2011-06-24 15:14:56.388267143 +0200
@@ -551,9 +551,8 @@ bail:
 
 /*
  * ocfs2_dio_end_io is called by the dio core when a dio is finished.  We're
- * particularly interested in the aio/dio case.  Like the core uses
- * i_alloc_sem, we use the rw_lock DLM lock to protect io on one node from
- * truncation on another.
+ * particularly interested in the aio/dio case.  We use the rw_lock DLM lock
+ * to protect io on one node from truncation on another.
  */
 static void ocfs2_dio_end_io(struct kiocb *iocb,
 			     loff_t offset,
@@ -569,7 +568,7 @@ static void ocfs2_dio_end_io(struct kioc
 	BUG_ON(!ocfs2_iocb_is_rw_locked(iocb));
 
 	if (ocfs2_iocb_is_sem_locked(iocb)) {
-		up_read(&inode->i_alloc_sem);
+		inode_dio_done(inode);
 		ocfs2_iocb_clear_sem_locked(iocb);
 	}
 
Index: linux-2.6/fs/ocfs2/file.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/file.c	2011-06-24 15:13:16.794938856 +0200
+++ linux-2.6/fs/ocfs2/file.c	2011-06-24 15:14:56.391600477 +0200
@@ -2236,9 +2236,9 @@ static ssize_t ocfs2_file_aio_write(stru
 	ocfs2_iocb_clear_sem_locked(iocb);
 
 relock:
-	/* to match setattr's i_mutex -> i_alloc_sem -> rw_lock ordering */
+	/* to match setattr's i_mutex -> rw_lock ordering */
 	if (direct_io) {
-		down_read(&inode->i_alloc_sem);
+		atomic_inc(&inode->i_dio_count);
 		have_alloc_sem = 1;
 		/* communicate with ocfs2_dio_end_io */
 		ocfs2_iocb_set_sem_locked(iocb);
@@ -2290,7 +2290,7 @@ relock:
 	 */
 	if (direct_io && !can_do_direct) {
 		ocfs2_rw_unlock(inode, rw_level);
-		up_read(&inode->i_alloc_sem);
+		inode_dio_done(inode);
 
 		have_alloc_sem = 0;
 		rw_level = -1;
@@ -2361,8 +2361,7 @@ out_dio:
 	/*
 	 * deep in g_f_a_w_n()->ocfs2_direct_IO we pass in a ocfs2_dio_end_io
 	 * function pointer which is called when o_direct io completes so that
-	 * it can unlock our rw lock.  (it's the clustered equivalent of
-	 * i_alloc_sem; protects truncate from racing with pending ios).
+	 * it can unlock our rw lock.
 	 * Unfortunately there are error cases which call end_io and others
 	 * that don't.  so we don't have to unlock the rw_lock if either an
 	 * async dio is going to do it in the future or an end_io after an
@@ -2379,7 +2378,7 @@ out:
 
 out_sems:
 	if (have_alloc_sem) {
-		up_read(&inode->i_alloc_sem);
+		inode_dio_done(inode);
 		ocfs2_iocb_clear_sem_locked(iocb);
 	}
 
@@ -2531,8 +2530,8 @@ static ssize_t ocfs2_file_aio_read(struc
 	 * need locks to protect pending reads from racing with truncate.
 	 */
 	if (filp->f_flags & O_DIRECT) {
-		down_read(&inode->i_alloc_sem);
 		have_alloc_sem = 1;
+		atomic_inc(&inode->i_dio_count);
 		ocfs2_iocb_set_sem_locked(iocb);
 
 		ret = ocfs2_rw_lock(inode, 0);
@@ -2575,7 +2574,7 @@ static ssize_t ocfs2_file_aio_read(struc
 
 bail:
 	if (have_alloc_sem) {
-		up_read(&inode->i_alloc_sem);
+		inode_dio_done(inode);
 		ocfs2_iocb_clear_sem_locked(iocb);
 	}
 	if (rw_level != -1)
Index: linux-2.6/mm/madvise.c
===================================================================
--- linux-2.6.orig/mm/madvise.c	2011-06-24 15:13:16.841605521 +0200
+++ linux-2.6/mm/madvise.c	2011-06-24 15:14:56.394933812 +0200
@@ -218,7 +218,7 @@ static long madvise_remove(struct vm_are
 	endoff = (loff_t)(end - vma->vm_start - 1)
 			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
 
-	/* vmtruncate_range needs to take i_mutex and i_alloc_sem */
+	/* vmtruncate_range needs to take i_mutex */
 	up_read(&current->mm->mmap_sem);
 	error = vmtruncate_range(mapping->host, offset, endoff);
 	down_read(&current->mm->mmap_sem);


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 4/9] fs: kill i_alloc_sem
  2011-06-24 18:29 ` [PATCH 4/9] fs: kill i_alloc_sem Christoph Hellwig
@ 2011-06-24 18:34   ` Christoph Hellwig
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2011-06-24 18:34 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

> This scheme is much simpler, and saves the space of a spinlock_t and a
> struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit
> system).

And I still haven't fixed that typo, damn.  Updated in local version now
to make sure it won't be missed next time.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 5/9] rw_semaphore: remove up/down_read_non_owner
  2011-06-24 18:29 [PATCH 0/9] remove i_alloc_sem V2 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2011-06-24 18:29 ` [PATCH 4/9] fs: kill i_alloc_sem Christoph Hellwig
@ 2011-06-24 18:29 ` Christoph Hellwig
  2011-06-24 18:29 ` [PATCH 6/9] fs: move inode_dio_wait calls into ->setattr Christoph Hellwig
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2011-06-24 18:29 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

[-- Attachment #1: remove-rw_semaphore-non_owner --]
[-- Type: text/plain, Size: 1921 bytes --]

Now that the last users is gone these can be removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/include/linux/rwsem.h
===================================================================
--- linux-2.6.orig/include/linux/rwsem.h	2011-06-24 14:30:21.571735905 +0200
+++ linux-2.6/include/linux/rwsem.h	2011-06-24 15:02:14.854972359 +0200
@@ -124,19 +124,9 @@ extern void downgrade_write(struct rw_se
  */
 extern void down_read_nested(struct rw_semaphore *sem, int subclass);
 extern void down_write_nested(struct rw_semaphore *sem, int subclass);
-/*
- * Take/release a lock when not the owner will release it.
- *
- * [ This API should be avoided as much as possible - the
- *   proper abstraction for this case is completions. ]
- */
-extern void down_read_non_owner(struct rw_semaphore *sem);
-extern void up_read_non_owner(struct rw_semaphore *sem);
 #else
 # define down_read_nested(sem, subclass)		down_read(sem)
 # define down_write_nested(sem, subclass)	down_write(sem)
-# define down_read_non_owner(sem)		down_read(sem)
-# define up_read_non_owner(sem)			up_read(sem)
 #endif
 
 #endif /* _LINUX_RWSEM_H */
Index: linux-2.6/kernel/rwsem.c
===================================================================
--- linux-2.6.orig/kernel/rwsem.c	2011-06-24 14:30:21.588402571 +0200
+++ linux-2.6/kernel/rwsem.c	2011-06-24 15:02:14.854972359 +0200
@@ -117,15 +117,6 @@ void down_read_nested(struct rw_semaphor
 
 EXPORT_SYMBOL(down_read_nested);
 
-void down_read_non_owner(struct rw_semaphore *sem)
-{
-	might_sleep();
-
-	__down_read(sem);
-}
-
-EXPORT_SYMBOL(down_read_non_owner);
-
 void down_write_nested(struct rw_semaphore *sem, int subclass)
 {
 	might_sleep();
@@ -136,13 +127,6 @@ void down_write_nested(struct rw_semapho
 
 EXPORT_SYMBOL(down_write_nested);
 
-void up_read_non_owner(struct rw_semaphore *sem)
-{
-	__up_read(sem);
-}
-
-EXPORT_SYMBOL(up_read_non_owner);
-
 #endif
 
 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 6/9] fs: move inode_dio_wait calls into ->setattr
  2011-06-24 18:29 [PATCH 0/9] remove i_alloc_sem V2 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2011-06-24 18:29 ` [PATCH 5/9] rw_semaphore: remove up/down_read_non_owner Christoph Hellwig
@ 2011-06-24 18:29 ` Christoph Hellwig
  2011-06-24 18:29 ` [PATCH 7/9] fs: always maintain i_dio_count Christoph Hellwig
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2011-06-24 18:29 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

[-- Attachment #1: fs-move-dio_wait --]
[-- Type: text/plain, Size: 6289 bytes --]

Let filesystems handle waiting for direct I/O requests themselves instead
of doing it beforehand.  This means filesystem-specific locks to prevent
new dio referenes from appearing can be held.  This is important to allow
generalizing i_dio_count to non-DIO_LOCKING filesystems.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/ocfs2/file.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/file.c	2011-06-20 09:28:54.516815966 +0200
+++ linux-2.6/fs/ocfs2/file.c	2011-06-20 09:31:34.706807855 +0200
@@ -1142,6 +1142,8 @@ int ocfs2_setattr(struct dentry *dentry,
 		if (status)
 			goto bail_unlock;
 
+		inode_dio_wait(inode);
+
 		if (i_size_read(inode) > attr->ia_size) {
 			if (ocfs2_should_order_data(inode)) {
 				status = ocfs2_begin_ordered_truncate(inode,
Index: linux-2.6/fs/attr.c
===================================================================
--- linux-2.6.orig/fs/attr.c	2011-06-20 09:28:54.490149300 +0200
+++ linux-2.6/fs/attr.c	2011-06-20 09:29:06.000000000 +0200
@@ -232,9 +232,6 @@ int notify_change(struct dentry * dentry
 	if (error)
 		return error;
 
-	if (ia_valid & ATTR_SIZE)
-		inode_dio_wait(inode);
-
 	if (inode->i_op->setattr)
 		error = inode->i_op->setattr(dentry, attr);
 	else
Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c	2011-06-18 12:54:28.058273680 +0200
+++ linux-2.6/fs/ext2/inode.c	2011-06-20 09:29:06.500148692 +0200
@@ -1184,6 +1184,8 @@ static int ext2_setsize(struct inode *in
 	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
 		return -EPERM;
 
+	inode_dio_wait(inode);
+
 	if (mapping_is_xip(inode->i_mapping))
 		error = xip_truncate_page(inode->i_mapping, newsize);
 	else if (test_opt(inode->i_sb, NOBH))
Index: linux-2.6/fs/ext3/inode.c
===================================================================
--- linux-2.6.orig/fs/ext3/inode.c	2011-06-18 12:54:28.071607014 +0200
+++ linux-2.6/fs/ext3/inode.c	2011-06-20 09:29:06.500148692 +0200
@@ -3216,6 +3216,9 @@ int ext3_setattr(struct dentry *dentry,
 		ext3_journal_stop(handle);
 	}
 
+	if (attr->ia_valid & ATTR_SIZE)
+		inode_dio_wait(inode);
+
 	if (S_ISREG(inode->i_mode) &&
 	    attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
 		handle_t *handle;
Index: linux-2.6/fs/ext4/inode.c
===================================================================
--- linux-2.6.orig/fs/ext4/inode.c	2011-06-20 09:28:54.506815967 +0200
+++ linux-2.6/fs/ext4/inode.c	2011-06-20 09:29:06.000000000 +0200
@@ -5351,6 +5351,8 @@ int ext4_setattr(struct dentry *dentry,
 	}
 
 	if (attr->ia_valid & ATTR_SIZE) {
+		inode_dio_wait(inode);
+
 		if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
 			struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 
Index: linux-2.6/fs/fat/file.c
===================================================================
--- linux-2.6.orig/fs/fat/file.c	2011-06-18 12:54:28.118273678 +0200
+++ linux-2.6/fs/fat/file.c	2011-06-20 09:29:06.000000000 +0200
@@ -397,6 +397,8 @@ int fat_setattr(struct dentry *dentry, s
 	 * sequence.
 	 */
 	if (attr->ia_valid & ATTR_SIZE) {
+		inode_dio_wait(inode);
+
 		if (attr->ia_size > inode->i_size) {
 			error = fat_cont_expand(inode, attr->ia_size);
 			if (error || attr->ia_valid == ATTR_SIZE)
Index: linux-2.6/fs/gfs2/bmap.c
===================================================================
--- linux-2.6.orig/fs/gfs2/bmap.c	2011-06-18 12:54:28.141607009 +0200
+++ linux-2.6/fs/gfs2/bmap.c	2011-06-20 09:29:06.510148693 +0200
@@ -1224,6 +1224,8 @@ int gfs2_setattr_size(struct inode *inod
 	if (ret)
 		return ret;
 
+	inode_dio_wait(inode);
+
 	oldsize = inode->i_size;
 	if (newsize >= oldsize)
 		return do_grow(inode, newsize);
Index: linux-2.6/fs/hfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hfs/inode.c	2011-06-18 12:54:28.154940342 +0200
+++ linux-2.6/fs/hfs/inode.c	2011-06-20 09:29:06.000000000 +0200
@@ -615,6 +615,8 @@ int hfs_inode_setattr(struct dentry *den
 
 	if ((attr->ia_valid & ATTR_SIZE) &&
 	    attr->ia_size != i_size_read(inode)) {
+		inode_dio_wait(inode);
+
 		error = vmtruncate(inode, attr->ia_size);
 		if (error)
 			return error;
Index: linux-2.6/fs/hfsplus/inode.c
===================================================================
--- linux-2.6.orig/fs/hfsplus/inode.c	2011-06-18 12:54:28.168273676 +0200
+++ linux-2.6/fs/hfsplus/inode.c	2011-06-20 09:29:06.000000000 +0200
@@ -296,6 +296,8 @@ static int hfsplus_setattr(struct dentry
 
 	if ((attr->ia_valid & ATTR_SIZE) &&
 	    attr->ia_size != i_size_read(inode)) {
+		inode_dio_wait(inode);
+
 		error = vmtruncate(inode, attr->ia_size);
 		if (error)
 			return error;
Index: linux-2.6/fs/jfs/file.c
===================================================================
--- linux-2.6.orig/fs/jfs/file.c	2011-06-18 12:54:28.191607007 +0200
+++ linux-2.6/fs/jfs/file.c	2011-06-20 09:29:06.000000000 +0200
@@ -110,6 +110,8 @@ int jfs_setattr(struct dentry *dentry, s
 
 	if ((iattr->ia_valid & ATTR_SIZE) &&
 	    iattr->ia_size != i_size_read(inode)) {
+		inode_dio_wait(inode);
+
 		rc = vmtruncate(inode, iattr->ia_size);
 		if (rc)
 			return rc;
Index: linux-2.6/fs/nilfs2/inode.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/inode.c	2011-06-18 12:54:28.204940339 +0200
+++ linux-2.6/fs/nilfs2/inode.c	2011-06-20 09:29:06.000000000 +0200
@@ -778,6 +778,8 @@ int nilfs_setattr(struct dentry *dentry,
 
 	if ((iattr->ia_valid & ATTR_SIZE) &&
 	    iattr->ia_size != i_size_read(inode)) {
+		inode_dio_wait(inode);
+
 		err = vmtruncate(inode, iattr->ia_size);
 		if (unlikely(err))
 			goto out_err;
Index: linux-2.6/fs/reiserfs/inode.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/inode.c	2011-06-18 12:54:28.218273673 +0200
+++ linux-2.6/fs/reiserfs/inode.c	2011-06-20 09:29:06.000000000 +0200
@@ -3114,6 +3114,9 @@ int reiserfs_setattr(struct dentry *dent
 			error = -EFBIG;
 			goto out;
 		}
+
+		inode_dio_wait(inode);
+
 		/* fill in hole pointers in the expanding truncate case. */
 		if (attr->ia_size > inode->i_size) {
 			error = generic_cont_expand_simple(inode, attr->ia_size);


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 7/9] fs: always maintain i_dio_count
  2011-06-24 18:29 [PATCH 0/9] remove i_alloc_sem V2 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2011-06-24 18:29 ` [PATCH 6/9] fs: move inode_dio_wait calls into ->setattr Christoph Hellwig
@ 2011-06-24 18:29 ` Christoph Hellwig
  2011-06-24 18:29 ` [PATCH 8/9] fs: simplify the blockdev_direct_IO prototype Christoph Hellwig
  2011-06-24 18:29 ` [PATCH 9/9] fs: move inode_dio_done to the end_io handler Christoph Hellwig
  8 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2011-06-24 18:29 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

[-- Attachment #1: fs-generalize-dio_count --]
[-- Type: text/plain, Size: 4846 bytes --]

Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
This these filesystems to also protect truncate against direct I/O requests
by using common code.  Right now the only non-DIO_LOCKING filesystem that
appears to do so is XFS, which uses an opencoded variant of the i_dio_count
scheme.

Behaviour doesn't change for filesystems never calling inode_dio_wait.
For ext4 behaviour changes when using the dioread_nonlock option, which
previously was missing any protection between truncate and direct I/O reads.
For ocfs2 that handcrafted i_dio_count manipulations are replaced with
the common code now enable.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/direct-io.c
===================================================================
--- linux-2.6.orig/fs/direct-io.c	2011-06-24 15:18:52.000000000 +0200
+++ linux-2.6/fs/direct-io.c	2011-06-24 15:22:25.341577750 +0200
@@ -297,8 +297,7 @@ static ssize_t dio_complete(struct dio *
 		aio_complete(dio->iocb, ret, 0);
 	}
 
-	if (dio->flags & DIO_LOCKING)
-		inode_dio_done(dio->inode);
+	inode_dio_done(dio->inode);
 	return ret;
 }
 
@@ -1185,14 +1184,16 @@ direct_io_worker(int rw, struct kiocb *i
  *    For writes this function is called under i_mutex and returns with
  *    i_mutex held, for reads, i_mutex is not held on entry, but it is
  *    taken and dropped again before returning.
- *    The i_dio_count counter keeps track of the number of outstanding
- *    direct I/O requests, and truncate waits for it to reach zero.
- *    New references to i_dio_count must only be grabbed with i_mutex
- *    held.
- *
  *  - if the flags value does NOT contain DIO_LOCKING we don't use any
  *    internal locking but rather rely on the filesystem to synchronize
  *    direct I/O reads/writes versus each other and truncate.
+ *
+ * To help with locking against truncate we incremented the i_dio_count
+ * counter before starting direct I/O, and decrement it once we are done.
+ * Truncate can wait for it to reach zero to provide exclusion.  It is
+ * expected that filesystem provide exclusion between new direct I/O
+ * and truncates.  For DIO_LOCKING filesystems this is done by i_mutex,
+ * but other filesystems need to take care of this on their own.
  */
 ssize_t
 __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
@@ -1270,14 +1271,14 @@ __blockdev_direct_IO(int rw, struct kioc
 				goto out;
 			}
 		}
-
-		/*
-		 * Will be decremented at I/O completion time.
-		 */
-		atomic_inc(&inode->i_dio_count);
 	}
 
 	/*
+	 * Will be decremented at I/O completion time.
+	 */
+	atomic_inc(&inode->i_dio_count);
+
+	/*
 	 * For file extending writes updating i_size before data
 	 * writeouts complete can expose uninitialized blocks. So
 	 * even for AIO, we need to wait for i/o to complete before
Index: linux-2.6/fs/ocfs2/aops.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/aops.c	2011-06-24 15:18:52.000000000 +0200
+++ linux-2.6/fs/ocfs2/aops.c	2011-06-24 15:22:02.918245553 +0200
@@ -567,10 +567,8 @@ static void ocfs2_dio_end_io(struct kioc
 	/* this io's submitter should not have unlocked this before we could */
 	BUG_ON(!ocfs2_iocb_is_rw_locked(iocb));
 
-	if (ocfs2_iocb_is_sem_locked(iocb)) {
-		inode_dio_done(inode);
+	if (ocfs2_iocb_is_sem_locked(iocb))
 		ocfs2_iocb_clear_sem_locked(iocb);
-	}
 
 	ocfs2_iocb_clear_rw_locked(iocb);
 
Index: linux-2.6/fs/ocfs2/file.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/file.c	2011-06-24 15:18:53.268255154 +0200
+++ linux-2.6/fs/ocfs2/file.c	2011-06-24 15:20:41.668249665 +0200
@@ -2240,7 +2240,6 @@ static ssize_t ocfs2_file_aio_write(stru
 relock:
 	/* to match setattr's i_mutex -> rw_lock ordering */
 	if (direct_io) {
-		atomic_inc(&inode->i_dio_count);
 		have_alloc_sem = 1;
 		/* communicate with ocfs2_dio_end_io */
 		ocfs2_iocb_set_sem_locked(iocb);
@@ -2292,7 +2291,6 @@ relock:
 	 */
 	if (direct_io && !can_do_direct) {
 		ocfs2_rw_unlock(inode, rw_level);
-		inode_dio_done(inode);
 
 		have_alloc_sem = 0;
 		rw_level = -1;
@@ -2379,10 +2377,8 @@ out:
 		ocfs2_rw_unlock(inode, rw_level);
 
 out_sems:
-	if (have_alloc_sem) {
-		inode_dio_done(inode);
+	if (have_alloc_sem)
 		ocfs2_iocb_clear_sem_locked(iocb);
-	}
 
 	mutex_unlock(&inode->i_mutex);
 
@@ -2533,7 +2529,6 @@ static ssize_t ocfs2_file_aio_read(struc
 	 */
 	if (filp->f_flags & O_DIRECT) {
 		have_alloc_sem = 1;
-		atomic_inc(&inode->i_dio_count);
 		ocfs2_iocb_set_sem_locked(iocb);
 
 		ret = ocfs2_rw_lock(inode, 0);
@@ -2575,10 +2570,9 @@ static ssize_t ocfs2_file_aio_read(struc
 	}
 
 bail:
-	if (have_alloc_sem) {
-		inode_dio_done(inode);
+	if (have_alloc_sem)
 		ocfs2_iocb_clear_sem_locked(iocb);
-	}
+
 	if (rw_level != -1)
 		ocfs2_rw_unlock(inode, rw_level);
 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 8/9] fs: simplify the blockdev_direct_IO prototype
  2011-06-24 18:29 [PATCH 0/9] remove i_alloc_sem V2 Christoph Hellwig
                   ` (6 preceding siblings ...)
  2011-06-24 18:29 ` [PATCH 7/9] fs: always maintain i_dio_count Christoph Hellwig
@ 2011-06-24 18:29 ` Christoph Hellwig
  2011-06-24 18:29 ` [PATCH 9/9] fs: move inode_dio_done to the end_io handler Christoph Hellwig
  8 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2011-06-24 18:29 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

[-- Attachment #1: fs-simplify-blockdev_direct_IO-prototype --]
[-- Type: text/plain, Size: 7977 bytes --]

Simple filesystems always pass inode->i_sb_bdev as the block device
argument, and never need a end_io handler.  Let's simply things for
them and for my grepping activity by dropping these arguments.  The
only thing not falling into that scheme is ext4, which passes and
end_io handler without needing special flags (yet), but given how
messy the direct I/O code there is use of __blockdev_direct_IO
in one instead of two out of three cases isn't going to make a large
difference anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c	2011-06-24 15:27:33.131562166 +0200
+++ linux-2.6/fs/ext2/inode.c	2011-06-24 15:43:31.164846996 +0200
@@ -843,8 +843,8 @@ ext2_direct_IO(int rw, struct kiocb *ioc
 	struct inode *inode = mapping->host;
 	ssize_t ret;
 
-	ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev,
-				iov, offset, nr_segs, ext2_get_block, NULL);
+	ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
+				 ext2_get_block);
 	if (ret < 0 && (rw & WRITE))
 		ext2_write_failed(mapping, offset + iov_length(iov, nr_segs));
 	return ret;
Index: linux-2.6/fs/ext3/inode.c
===================================================================
--- linux-2.6.orig/fs/ext3/inode.c	2011-06-24 15:27:33.151562165 +0200
+++ linux-2.6/fs/ext3/inode.c	2011-06-24 15:28:09.048226915 +0200
@@ -1816,9 +1816,8 @@ static ssize_t ext3_direct_IO(int rw, st
 	}
 
 retry:
-	ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
-				 offset, nr_segs,
-				 ext3_get_block, NULL);
+	ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
+				 ext3_get_block);
 	/*
 	 * In case of error extending write may have instantiated a few
 	 * blocks outside i_size. Trim these off again.
Index: linux-2.6/fs/ext4/inode.c
===================================================================
--- linux-2.6.orig/fs/ext4/inode.c	2011-06-24 15:27:33.171562165 +0200
+++ linux-2.6/fs/ext4/inode.c	2011-06-24 15:32:40.694879881 +0200
@@ -3501,10 +3501,8 @@ retry:
 				 offset, nr_segs,
 				 ext4_get_block, NULL, NULL, 0);
 	else {
-		ret = blockdev_direct_IO(rw, iocb, inode,
-				 inode->i_sb->s_bdev, iov,
-				 offset, nr_segs,
-				 ext4_get_block, NULL);
+		ret = blockdev_direct_IO(rw, iocb, inode, iov,
+				 offset, nr_segs, ext4_get_block);
 
 		if (unlikely((rw & WRITE) && ret < 0)) {
 			loff_t isize = i_size_read(inode);
@@ -3748,11 +3746,13 @@ static ssize_t ext4_ext_direct_IO(int rw
 			EXT4_I(inode)->cur_aio_dio = iocb->private;
 		}
 
-		ret = blockdev_direct_IO(rw, iocb, inode,
+		ret = __blockdev_direct_IO(rw, iocb, inode,
 					 inode->i_sb->s_bdev, iov,
 					 offset, nr_segs,
 					 ext4_get_block_write,
-					 ext4_end_io_dio);
+					 ext4_end_io_dio,
+					 NULL,
+					 DIO_LOCKING | DIO_SKIP_HOLES);
 		if (iocb->private)
 			EXT4_I(inode)->cur_aio_dio = NULL;
 		/*
Index: linux-2.6/fs/fat/inode.c
===================================================================
--- linux-2.6.orig/fs/fat/inode.c	2011-06-24 15:27:33.188228830 +0200
+++ linux-2.6/fs/fat/inode.c	2011-06-24 15:32:48.341546189 +0200
@@ -211,8 +211,8 @@ static ssize_t fat_direct_IO(int rw, str
 	 * FAT need to use the DIO_LOCKING for avoiding the race
 	 * condition of fat_get_block() and ->truncate().
 	 */
-	ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev,
-				 iov, offset, nr_segs, fat_get_block, NULL);
+	ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
+				 fat_get_block);
 	if (ret < 0 && (rw & WRITE))
 		fat_write_failed(mapping, offset + iov_length(iov, nr_segs));
 
Index: linux-2.6/fs/hfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hfs/inode.c	2011-06-24 15:27:33.228228829 +0200
+++ linux-2.6/fs/hfs/inode.c	2011-06-24 15:29:45.218222143 +0200
@@ -123,8 +123,8 @@ static ssize_t hfs_direct_IO(int rw, str
 	struct inode *inode = file->f_path.dentry->d_inode->i_mapping->host;
 	ssize_t ret;
 
-	ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
-				  offset, nr_segs, hfs_get_block, NULL);
+	ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
+				 hfs_get_block);
 
 	/*
 	 * In case of error extending write may have instantiated a few
Index: linux-2.6/fs/hfsplus/inode.c
===================================================================
--- linux-2.6.orig/fs/hfsplus/inode.c	2011-06-24 15:27:33.244895494 +0200
+++ linux-2.6/fs/hfsplus/inode.c	2011-06-24 15:29:59.911554734 +0200
@@ -119,8 +119,8 @@ static ssize_t hfsplus_direct_IO(int rw,
 	struct inode *inode = file->f_path.dentry->d_inode->i_mapping->host;
 	ssize_t ret;
 
-	ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
-				  offset, nr_segs, hfsplus_get_block, NULL);
+	ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
+				 hfsplus_get_block);
 
 	/*
 	 * In case of error extending write may have instantiated a few
Index: linux-2.6/fs/jfs/inode.c
===================================================================
--- linux-2.6.orig/fs/jfs/inode.c	2011-06-24 15:27:33.264895492 +0200
+++ linux-2.6/fs/jfs/inode.c	2011-06-24 15:30:11.701554144 +0200
@@ -329,8 +329,8 @@ static ssize_t jfs_direct_IO(int rw, str
 	struct inode *inode = file->f_mapping->host;
 	ssize_t ret;
 
-	ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
-				offset, nr_segs, jfs_get_block, NULL);
+	ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
+				 jfs_get_block);
 
 	/*
 	 * In case of error extending write may have instantiated a few
Index: linux-2.6/fs/nilfs2/inode.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/inode.c	2011-06-24 15:27:33.284895493 +0200
+++ linux-2.6/fs/nilfs2/inode.c	2011-06-24 15:30:24.968220135 +0200
@@ -259,8 +259,8 @@ nilfs_direct_IO(int rw, struct kiocb *io
 		return 0;
 
 	/* Needs synchronization with the cleaner */
-	size = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
-				  offset, nr_segs, nilfs_get_block, NULL);
+	size = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
+				  nilfs_get_block);
 
 	/*
 	 * In case of error extending write may have instantiated a few
Index: linux-2.6/fs/reiserfs/inode.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/inode.c	2011-06-24 15:27:33.324895489 +0200
+++ linux-2.6/fs/reiserfs/inode.c	2011-06-24 15:30:38.311552796 +0200
@@ -3068,9 +3068,8 @@ static ssize_t reiserfs_direct_IO(int rw
 	struct inode *inode = file->f_mapping->host;
 	ssize_t ret;
 
-	ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
-				  offset, nr_segs,
-				  reiserfs_get_blocks_direct_io, NULL);
+	ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
+				  reiserfs_get_blocks_direct_io);
 
 	/*
 	 * In case of error extending write may have instantiated a few
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2011-06-24 15:27:33.361562155 +0200
+++ linux-2.6/include/linux/fs.h	2011-06-24 15:46:57.914836526 +0200
@@ -2381,12 +2381,11 @@ ssize_t __blockdev_direct_IO(int rw, str
 	dio_submit_t submit_io,	int flags);
 
 static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb,
-	struct inode *inode, struct block_device *bdev, const struct iovec *iov,
-	loff_t offset, unsigned long nr_segs, get_block_t get_block,
-	dio_iodone_t end_io)
+		struct inode *inode, const struct iovec *iov, loff_t offset,
+		unsigned long nr_segs, get_block_t get_block)
 {
-	return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
-				    nr_segs, get_block, end_io, NULL,
+	return __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
+				    offset, nr_segs, get_block, NULL, NULL,
 				    DIO_LOCKING | DIO_SKIP_HOLES);
 }
 #endif


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 9/9] fs: move inode_dio_done to the end_io handler
  2011-06-24 18:29 [PATCH 0/9] remove i_alloc_sem V2 Christoph Hellwig
                   ` (7 preceding siblings ...)
  2011-06-24 18:29 ` [PATCH 8/9] fs: simplify the blockdev_direct_IO prototype Christoph Hellwig
@ 2011-06-24 18:29 ` Christoph Hellwig
  8 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2011-06-24 18:29 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

[-- Attachment #1: fs-move-inode_dio_done-to-end_io --]
[-- Type: text/plain, Size: 3020 bytes --]

For filesystems that delay their end_io processing we should keep our
i_dio_count until the the processing is done.  Enable this by moving
the inode_dio_done call to the end_io handler if one exist.  Note that
the actual move to the workqueue for ext4 and XFS is not done in
this patch yet, but left to the filesystem maintainers.  At least
for XFS it's not needed yet either as XFS has an internal equivalent
to i_dio_count.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/direct-io.c
===================================================================
--- linux-2.6.orig/fs/direct-io.c	2011-06-24 15:27:14.124896461 +0200
+++ linux-2.6/fs/direct-io.c	2011-06-24 15:47:03.358169584 +0200
@@ -293,11 +293,12 @@ static ssize_t dio_complete(struct dio *
 	if (dio->end_io && dio->result) {
 		dio->end_io(dio->iocb, offset, transferred,
 			    dio->map_bh.b_private, ret, is_async);
-	} else if (is_async) {
-		aio_complete(dio->iocb, ret, 0);
+	} else {
+		if (is_async)
+			aio_complete(dio->iocb, ret, 0);
+		inode_dio_done(dio->inode);
 	}
 
-	inode_dio_done(dio->inode);
 	return ret;
 }
 
Index: linux-2.6/fs/ext4/inode.c
===================================================================
--- linux-2.6.orig/fs/ext4/inode.c	2011-06-24 15:47:13.111502423 +0200
+++ linux-2.6/fs/ext4/inode.c	2011-06-24 15:50:13.471493302 +0200
@@ -3573,6 +3573,7 @@ static void ext4_end_io_dio(struct kiocb
 			    ssize_t size, void *private, int ret,
 			    bool is_async)
 {
+	struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
         ext4_io_end_t *io_end = iocb->private;
 	struct workqueue_struct *wq;
 	unsigned long flags;
@@ -3594,6 +3595,7 @@ static void ext4_end_io_dio(struct kiocb
 out:
 		if (is_async)
 			aio_complete(iocb, ret, 0);
+		inode_dio_done(inode);
 		return;
 	}
 
@@ -3614,6 +3616,9 @@ out:
 	/* queue the work to convert unwritten extents to written */
 	queue_work(wq, &io_end->work);
 	iocb->private = NULL;
+
+	/* XXX: probably should move into the real I/O completion handler */
+	inode_dio_done(inode);
 }
 
 static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
Index: linux-2.6/fs/ocfs2/aops.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/aops.c	2011-06-24 15:49:26.731495659 +0200
+++ linux-2.6/fs/ocfs2/aops.c	2011-06-24 15:49:48.324827901 +0200
@@ -577,6 +577,7 @@ static void ocfs2_dio_end_io(struct kioc
 
 	if (is_async)
 		aio_complete(iocb, ret, 0);
+	inode_dio_done(inode);
 }
 
 /*
Index: linux-2.6/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_aops.c	2011-06-24 15:48:25.581498754 +0200
+++ linux-2.6/fs/xfs/linux-2.6/xfs_aops.c	2011-06-24 15:51:00.874824252 +0200
@@ -1339,6 +1339,9 @@ xfs_end_io_direct_write(
 	} else {
 		xfs_finish_ioend_sync(ioend);
 	}
+
+	/* XXX: probably should move into the real I/O completion handler */
+	inode_dio_done(ioend->io_inode);
 }
 
 STATIC ssize_t


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-06-24 18:34 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-24 18:29 [PATCH 0/9] remove i_alloc_sem V2 Christoph Hellwig
2011-06-24 18:29 ` [PATCH 1/9] fat: remove i_alloc_sem abuse Christoph Hellwig
2011-06-24 18:29 ` [PATCH 2/9] ext4: Rewrite ext4_page_mkwrite() to use generic helpers Christoph Hellwig
2011-06-24 18:29 ` [PATCH 3/9] fs: simplify handling of zero sized reads in __blockdev_direct_IO Christoph Hellwig
2011-06-24 18:29 ` [PATCH 4/9] fs: kill i_alloc_sem Christoph Hellwig
2011-06-24 18:34   ` Christoph Hellwig
2011-06-24 18:29 ` [PATCH 5/9] rw_semaphore: remove up/down_read_non_owner Christoph Hellwig
2011-06-24 18:29 ` [PATCH 6/9] fs: move inode_dio_wait calls into ->setattr Christoph Hellwig
2011-06-24 18:29 ` [PATCH 7/9] fs: always maintain i_dio_count Christoph Hellwig
2011-06-24 18:29 ` [PATCH 8/9] fs: simplify the blockdev_direct_IO prototype Christoph Hellwig
2011-06-24 18:29 ` [PATCH 9/9] fs: move inode_dio_done to the end_io handler Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).