+ fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree

All of lore.kernel.org
 help / color / mirror / Atom feed

* + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
@ 2007-05-29 21:19 akpm
  2007-05-30  3:13 ` Nick Piggin
  2007-06-13 13:40 ` Dmitriy Monakhov
  0 siblings, 2 replies; 20+ messages in thread
From: akpm @ 2007-05-29 21:19 UTC (permalink / raw)
  To: mm-commits; +Cc: npiggin, mark.fasheh


The patch titled
     fs: introduce write_begin, write_end, and perform_write aops
has been added to the -mm tree.  Its filename is
     fs-introduce-write_begin-write_end-and-perform_write-aops.patch

*** Remember to use Documentation/SubmitChecklist when testing your code ***

See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
out what to do about this

------------------------------------------------------
Subject: fs: introduce write_begin, write_end, and perform_write aops
From: Nick Piggin <npiggin@suse.de>

These are intended to replace prepare_write and commit_write with more
flexible alternatives that are also able to avoid the buffered write
deadlock problems efficiently (which prepare_write is unable to do).

[mark.fasheh@oracle.com: API design contributions, code review and fixes]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/filesystems/Locking |    9 -
 Documentation/filesystems/vfs.txt |   45 +++++
 drivers/block/loop.c              |   75 +++-----
 fs/buffer.c                       |  198 +++++++++++++++++++----
 fs/libfs.c                        |   44 +++++
 fs/namei.c                        |   47 +----
 fs/splice.c                       |   70 --------
 include/linux/buffer_head.h       |   10 +
 include/linux/fs.h                |   30 +++
 include/linux/pagemap.h           |    2 
 mm/filemap.c                      |  238 ++++++++++++++++++++++++----
 11 files changed, 562 insertions(+), 206 deletions(-)

diff -puN Documentation/filesystems/Locking~fs-introduce-write_begin-write_end-and-perform_write-aops Documentation/filesystems/Locking
--- a/Documentation/filesystems/Locking~fs-introduce-write_begin-write_end-and-perform_write-aops
+++ a/Documentation/filesystems/Locking
@@ -178,15 +178,18 @@ prototypes:
 locking rules:
 	All except set_page_dirty may block
 
-			BKL	PageLocked(page)
+			BKL	PageLocked(page)	i_sem
 writepage:		no	yes, unlocks (see below)
 readpage:		no	yes, unlocks
 sync_page:		no	maybe
 writepages:		no
 set_page_dirty		no	no
 readpages:		no
-prepare_write:		no	yes
-commit_write:		no	yes
+prepare_write:		no	yes			yes
+commit_write:		no	yes			yes
+write_begin:		no	locks the page		yes
+write_end:		no	yes, unlocks		yes
+perform_write:		no	n/a			yes
 bmap:			yes
 invalidatepage:		no	yes
 releasepage:		no	yes
diff -puN Documentation/filesystems/vfs.txt~fs-introduce-write_begin-write_end-and-perform_write-aops Documentation/filesystems/vfs.txt
--- a/Documentation/filesystems/vfs.txt~fs-introduce-write_begin-write_end-and-perform_write-aops
+++ a/Documentation/filesystems/vfs.txt
@@ -534,6 +534,12 @@ struct address_space_operations {
 			struct list_head *pages, unsigned nr_pages);
 	int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
 	int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
+	int (*write_begin)(struct file *, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned flags,
+				struct page **pagep, void **fsdata);
+	int (*write_end)(struct file *, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata);
 	sector_t (*bmap)(struct address_space *, sector_t);
 	int (*invalidatepage) (struct page *, unsigned long);
 	int (*releasepage) (struct page *, int);
@@ -629,6 +635,45 @@ struct address_space_operations {
         operations.  It should avoid returning an error if possible -
         errors should have been handled by prepare_write.
 
+  write_begin: This is intended as a replacement for prepare_write. The
+	key differences being that:
+		- it returns a locked page (in *pagep) rather than being
+		  given a pre locked page;
+		- it must be able to cope with short writes (where the
+		  length passed to write_begin is greater than the number
+		  of bytes copied into the page).
+
+	Called by the generic buffered write code to ask the filesystem to
+	prepare to write len bytes at the given offset in the file. The
+	address_space should check that the write will be able to complete,
+	by allocating space if necessary and doing any other internal
+	housekeeping.  If the write will update parts of any basic-blocks on
+	storage, then those blocks should be pre-read (if they haven't been
+	read already) so that the updated blocks can be written out properly.
+
+        The filesystem must return the locked pagecache page for the specified
+	offset, in *pagep, for the caller to write into.
+
+	flags is a field for AOP_FLAG_xxx flags, described in
+	include/linux/fs.h.
+
+        A void * may be returned in fsdata, which then gets passed into
+        write_end.
+
+        Returns 0 on success; < 0 on failure (which is the error code), in
+	which case write_end is not called.
+
+  write_end: After a successful write_begin, and data copy, write_end must
+        be called. len is the original len passed to write_begin, and copied
+        is the amount that was able to be copied (copied == len is always true
+	if write_begin was called with the AOP_FLAG_UNINTERRUPTIBLE flag).
+
+        The filesystem must take care of unlocking the page and releasing it
+        refcount, and updating i_size.
+
+        Returns < 0 on failure, otherwise the number of bytes (<= 'copied')
+        that were able to be copied into pagecache.
+
   bmap: called by the VFS to map a logical block offset within object to
   	physical block number. This method is used by the FIBMAP
   	ioctl and for working with swap-files.  To be able to swap to
diff -puN drivers/block/loop.c~fs-introduce-write_begin-write_end-and-perform_write-aops drivers/block/loop.c
--- a/drivers/block/loop.c~fs-introduce-write_begin-write_end-and-perform_write-aops
+++ a/drivers/block/loop.c
@@ -202,14 +202,13 @@ lo_do_transfer(struct loop_device *lo, i
  * do_lo_send_aops - helper for writing data to a loop device
  *
  * This is the fast version for backing filesystems which implement the address
- * space operations prepare_write and commit_write.
+ * space operations write_begin and write_end.
  */
 static int do_lo_send_aops(struct loop_device *lo, struct bio_vec *bvec,
-		int bsize, loff_t pos, struct page *page)
+		int bsize, loff_t pos, struct page *unused)
 {
 	struct file *file = lo->lo_backing_file; /* kudos to NFsckingS */
 	struct address_space *mapping = file->f_mapping;
-	const struct address_space_operations *aops = mapping->a_ops;
 	pgoff_t index;
 	unsigned offset, bv_offs;
 	int len, ret;
@@ -221,63 +220,47 @@ static int do_lo_send_aops(struct loop_d
 	len = bvec->bv_len;
 	while (len > 0) {
 		sector_t IV;
-		unsigned size;
+		unsigned size, copied;
 		int transfer_result;
+		struct page *page;
+		void *fsdata;
 
 		IV = ((sector_t)index << (PAGE_CACHE_SHIFT - 9))+(offset >> 9);
 		size = PAGE_CACHE_SIZE - offset;
 		if (size > len)
 			size = len;
-		page = grab_cache_page(mapping, index);
-		if (unlikely(!page))
+
+		ret = pagecache_write_begin(file, mapping, pos, size, 0,
+							&page, &fsdata);
+		if (ret)
 			goto fail;
-		ret = aops->prepare_write(file, page, offset,
-					  offset + size);
-		if (unlikely(ret)) {
-			if (ret == AOP_TRUNCATED_PAGE) {
-				page_cache_release(page);
-				continue;
-			}
-			goto unlock;
-		}
+
 		transfer_result = lo_do_transfer(lo, WRITE, page, offset,
 				bvec->bv_page, bv_offs, size, IV);
-		if (unlikely(transfer_result)) {
-			/*
-			 * The transfer failed, but we still write the data to
-			 * keep prepare/commit calls balanced.
-			 */
-			printk(KERN_ERR "loop: transfer error block %llu\n",
-			       (unsigned long long)index);
-			zero_user_page(page, offset, size, KM_USER0);
-		}
-		flush_dcache_page(page);
-		ret = aops->commit_write(file, page, offset,
-					 offset + size);
-		if (unlikely(ret)) {
-			if (ret == AOP_TRUNCATED_PAGE) {
-				page_cache_release(page);
-				continue;
-			}
-			goto unlock;
-		}
+		copied = size;
 		if (unlikely(transfer_result))
-			goto unlock;
-		bv_offs += size;
-		len -= size;
+			copied = 0;
+
+		ret = pagecache_write_end(file, mapping, pos, size, copied,
+							page, fsdata);
+		if (ret < 0)
+			goto fail;
+		if (ret < copied)
+			copied = ret;
+
+		if (unlikely(transfer_result))
+			goto fail;
+
+		bv_offs += copied;
+		len -= copied;
 		offset = 0;
 		index++;
-		pos += size;
-		unlock_page(page);
-		page_cache_release(page);
+		pos += copied;
 	}
 	ret = 0;
 out:
 	mutex_unlock(&mapping->host->i_mutex);
 	return ret;
-unlock:
-	unlock_page(page);
-	page_cache_release(page);
 fail:
 	ret = -1;
 	goto out;
@@ -311,7 +294,7 @@ static int __do_lo_send_write(struct fil
  * do_lo_send_direct_write - helper for writing data to a loop device
  *
  * This is the fast, non-transforming version for backing filesystems which do
- * not implement the address space operations prepare_write and commit_write.
+ * not implement the address space operations write_begin and write_end.
  * It uses the write file operation which should be present on all writeable
  * filesystems.
  */
@@ -330,7 +313,7 @@ static int do_lo_send_direct_write(struc
  * do_lo_send_write - helper for writing data to a loop device
  *
  * This is the slow, transforming version for filesystems which do not
- * implement the address space operations prepare_write and commit_write.  It
+ * implement the address space operations write_begin and write_end.  It
  * uses the write file operation which should be present on all writeable
  * filesystems.
  *
@@ -762,7 +745,7 @@ static int loop_set_fd(struct loop_devic
 		 */
 		if (!file->f_op->sendfile)
 			goto out_putf;
-		if (aops->prepare_write && aops->commit_write)
+		if (aops->prepare_write || aops->write_begin)
 			lo_flags |= LO_FLAGS_USE_AOPS;
 		if (!(lo_flags & LO_FLAGS_USE_AOPS) && !file->f_op->write)
 			lo_flags |= LO_FLAGS_READ_ONLY;
diff -puN fs/buffer.c~fs-introduce-write_begin-write_end-and-perform_write-aops fs/buffer.c
--- a/fs/buffer.c~fs-introduce-write_begin-write_end-and-perform_write-aops
+++ a/fs/buffer.c
@@ -1742,6 +1742,48 @@ recover:
 	goto done;
 }
 
+/*
+ * If a page has any new buffers, zero them out here, and mark them uptodate
+ * and dirty so they'll be written out (in order to prevent uninitialised
+ * block data from leaking). And clear the new bit.
+ */
+void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
+{
+	unsigned int block_start, block_end;
+	struct buffer_head *head, *bh;
+
+	BUG_ON(!PageLocked(page));
+	if (!page_has_buffers(page))
+		return;
+
+	bh = head = page_buffers(page);
+	block_start = 0;
+	do {
+		block_end = block_start + bh->b_size;
+
+		if (buffer_new(bh)) {
+			if (block_end > from && block_start < to) {
+				if (!PageUptodate(page)) {
+					unsigned start, size;
+
+					start = max(from, block_start);
+					size = min(to, block_end) - start;
+
+					zero_user_page(page, start, size, KM_USER0);
+					set_buffer_uptodate(bh);
+				}
+
+				clear_buffer_new(bh);
+				mark_buffer_dirty(bh);
+			}
+		}
+
+		block_start = block_end;
+		bh = bh->b_this_page;
+	} while (bh != head);
+}
+EXPORT_SYMBOL(page_zero_new_buffers);
+
 static int __block_prepare_write(struct inode *inode, struct page *page,
 		unsigned from, unsigned to, get_block_t *get_block)
 {
@@ -1826,38 +1868,8 @@ static int __block_prepare_write(struct 
 		if (!buffer_uptodate(*wait_bh))
 			err = -EIO;
 	}
-	if (!err) {
-		bh = head;
-		do {
-			if (buffer_new(bh))
-				clear_buffer_new(bh);
-		} while ((bh = bh->b_this_page) != head);
-		return 0;
-	}
-	/* Error case: */
-	/*
-	 * Zero out any newly allocated blocks to avoid exposing stale
-	 * data.  If BH_New is set, we know that the block was newly
-	 * allocated in the above loop.
-	 */
-	bh = head;
-	block_start = 0;
-	do {
-		block_end = block_start+blocksize;
-		if (block_end <= from)
-			goto next_bh;
-		if (block_start >= to)
-			break;
-		if (buffer_new(bh)) {
-			clear_buffer_new(bh);
-			zero_user_page(page, block_start, bh->b_size, KM_USER0);
-			set_buffer_uptodate(bh);
-			mark_buffer_dirty(bh);
-		}
-next_bh:
-		block_start = block_end;
-		bh = bh->b_this_page;
-	} while (bh != head);
+	if (unlikely(err))
+		page_zero_new_buffers(page, from, to);
 	return err;
 }
 
@@ -1882,6 +1894,7 @@ static int __block_commit_write(struct i
 			set_buffer_uptodate(bh);
 			mark_buffer_dirty(bh);
 		}
+		clear_buffer_new(bh);
 	}
 
 	/*
@@ -1896,6 +1909,127 @@ static int __block_commit_write(struct i
 }
 
 /*
+ * block_write_begin takes care of the basic task of block allocation and
+ * bringing partial write blocks uptodate first.
+ *
+ * If *pagep is not NULL, then block_write_begin uses the locked page
+ * at *pagep rather than allocating its own. In this case, the page will
+ * not be unlocked or deallocated on failure.
+ */
+int block_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata,
+			get_block_t *get_block)
+{
+	struct inode *inode = mapping->host;
+	int status = 0;
+	struct page *page;
+	pgoff_t index;
+	unsigned start, end;
+	int ownpage = 0;
+
+	index = pos >> PAGE_CACHE_SHIFT;
+	start = pos & (PAGE_CACHE_SIZE - 1);
+	end = start + len;
+
+	page = *pagep;
+	if (page == NULL) {
+		ownpage = 1;
+		page = __grab_cache_page(mapping, index);
+		if (!page) {
+			status = -ENOMEM;
+			goto out;
+		}
+		*pagep = page;
+	} else
+		BUG_ON(!PageLocked(page));
+
+	status = __block_prepare_write(inode, page, start, end, get_block);
+	if (unlikely(status)) {
+		ClearPageUptodate(page);
+
+		if (ownpage) {
+			unlock_page(page);
+			page_cache_release(page);
+
+			/*
+			 * prepare_write() may have instantiated a few blocks
+			 * outside i_size.  Trim these off again. Don't need
+			 * i_size_read because we hold i_mutex.
+			 */
+			if (pos + len > inode->i_size)
+				vmtruncate(inode, inode->i_size);
+		}
+		goto out;
+	}
+
+out:
+	return status;
+}
+EXPORT_SYMBOL(block_write_begin);
+
+int block_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata)
+{
+	struct inode *inode = mapping->host;
+	unsigned start;
+
+	start = pos & (PAGE_CACHE_SIZE - 1);
+
+	if (unlikely(copied < len)) {
+		/*
+		 * The buffers that were written will now be uptodate, so we
+		 * don't have to worry about a readpage reading them and
+		 * overwriting a partial write. However if we have encountered
+		 * a short write and only partially written into a buffer, it
+		 * will not be marked uptodate, so a readpage might come in and
+		 * destroy our partial write.
+		 *
+		 * Do the simplest thing, and just treat any short write to a
+		 * non uptodate page as a zero-length write, and force the
+		 * caller to redo the whole thing.
+		 */
+		if (!PageUptodate(page))
+			copied = 0;
+
+		page_zero_new_buffers(page, start+copied, start+len);
+	}
+	flush_dcache_page(page);
+
+	/* This could be a short (even 0-length) commit */
+	__block_commit_write(inode, page, start, start+copied);
+
+	return copied;
+}
+EXPORT_SYMBOL(block_write_end);
+
+int generic_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata)
+{
+	struct inode *inode = mapping->host;
+
+	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+
+	unlock_page(page);
+	mark_page_accessed(page); /* XXX: put this in caller? */
+	page_cache_release(page);
+
+	/*
+	 * No need to use i_size_read() here, the i_size
+	 * cannot change under us because we hold i_mutex.
+	 */
+	if (pos+copied > inode->i_size) {
+		i_size_write(inode, pos+copied);
+		mark_inode_dirty(inode);
+	}
+
+	return copied;
+}
+EXPORT_SYMBOL(generic_write_end);
+
+/*
  * Generic "read page" function for block devices that have the normal
  * get_block functionality. This is most of the block device filesystems.
  * Reads the page asynchronously --- the unlock_buffer() and
diff -puN fs/libfs.c~fs-introduce-write_begin-write_end-and-perform_write-aops fs/libfs.c
--- a/fs/libfs.c~fs-introduce-write_begin-write_end-and-perform_write-aops
+++ a/fs/libfs.c
@@ -351,6 +351,26 @@ int simple_prepare_write(struct file *fi
 	return 0;
 }
 
+int simple_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
+{
+	struct page *page;
+	pgoff_t index;
+	unsigned from;
+
+	index = pos >> PAGE_CACHE_SHIFT;
+	from = pos & (PAGE_CACHE_SIZE - 1);
+
+	page = __grab_cache_page(mapping, index);
+	if (!page)
+		return -ENOMEM;
+
+	*pagep = page;
+
+	return simple_prepare_write(file, page, from, from+len);
+}
+
 int simple_commit_write(struct file *file, struct page *page,
 			unsigned from, unsigned to)
 {
@@ -369,6 +389,28 @@ int simple_commit_write(struct file *fil
 	return 0;
 }
 
+int simple_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata)
+{
+	unsigned from = pos & (PAGE_CACHE_SIZE - 1);
+
+	/* zero the stale part of the page if we did a short copy */
+	if (copied < len) {
+		void *kaddr = kmap_atomic(page, KM_USER0);
+		memset(kaddr + from + copied, 0, len - copied);
+		flush_dcache_page(page);
+		kunmap_atomic(kaddr, KM_USER0);
+	}
+
+	simple_commit_write(file, page, from, from+copied);
+
+	unlock_page(page);
+	page_cache_release(page);
+
+	return copied;
+}
+
 /*
  * the inodes created here are not hashed. If you use iunique to generate
  * unique inode values later for this filesystem, then you must take care
@@ -642,6 +684,8 @@ EXPORT_SYMBOL(dcache_dir_open);
 EXPORT_SYMBOL(dcache_readdir);
 EXPORT_SYMBOL(generic_read_dir);
 EXPORT_SYMBOL(get_sb_pseudo);
+EXPORT_SYMBOL(simple_write_begin);
+EXPORT_SYMBOL(simple_write_end);
 EXPORT_SYMBOL(simple_commit_write);
 EXPORT_SYMBOL(simple_dir_inode_operations);
 EXPORT_SYMBOL(simple_dir_operations);
diff -puN fs/namei.c~fs-introduce-write_begin-write_end-and-perform_write-aops fs/namei.c
--- a/fs/namei.c~fs-introduce-write_begin-write_end-and-perform_write-aops
+++ a/fs/namei.c
@@ -2697,53 +2697,30 @@ int __page_symlink(struct inode *inode, 
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct page *page;
+	void *fsdata;
 	int err;
 	char *kaddr;
 
 retry:
-	err = -ENOMEM;
-	page = find_or_create_page(mapping, 0, gfp_mask);
-	if (!page)
-		goto fail;
-	err = mapping->a_ops->prepare_write(NULL, page, 0, len-1);
-	if (err == AOP_TRUNCATED_PAGE) {
-		page_cache_release(page);
-		goto retry;
-	}
+	err = pagecache_write_begin(NULL, mapping, 0, PAGE_CACHE_SIZE,
+				AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
 	if (err)
-		goto fail_map;
+		goto fail;
+
 	kaddr = kmap_atomic(page, KM_USER0);
 	memcpy(kaddr, symname, len-1);
+	memset(kaddr+len-1, 0, PAGE_CACHE_SIZE-(len-1));
 	kunmap_atomic(kaddr, KM_USER0);
-	err = mapping->a_ops->commit_write(NULL, page, 0, len-1);
-	if (err == AOP_TRUNCATED_PAGE) {
-		page_cache_release(page);
-		goto retry;
-	}
-	if (err)
-		goto fail_map;
-	/*
-	 * Notice that we are _not_ going to block here - end of page is
-	 * unmapped, so this will only try to map the rest of page, see
-	 * that it is unmapped (typically even will not look into inode -
-	 * ->i_size will be enough for everything) and zero it out.
-	 * OTOH it's obviously correct and should make the page up-to-date.
-	 */
-	if (!PageUptodate(page)) {
-		err = mapping->a_ops->readpage(NULL, page);
-		if (err != AOP_TRUNCATED_PAGE)
-			wait_on_page_locked(page);
-	} else {
-		unlock_page(page);
-	}
-	page_cache_release(page);
+
+	err = pagecache_write_end(NULL, mapping, 0, PAGE_CACHE_SIZE, PAGE_CACHE_SIZE,
+							page, fsdata);
 	if (err < 0)
 		goto fail;
+	if (err < PAGE_CACHE_SIZE)
+		goto retry;
+
 	mark_inode_dirty(inode);
 	return 0;
-fail_map:
-	unlock_page(page);
-	page_cache_release(page);
 fail:
 	return err;
 }
diff -puN fs/splice.c~fs-introduce-write_begin-write_end-and-perform_write-aops fs/splice.c
--- a/fs/splice.c~fs-introduce-write_begin-write_end-and-perform_write-aops
+++ a/fs/splice.c
@@ -558,7 +558,7 @@ static int pipe_to_file(struct pipe_inod
 	struct address_space *mapping = file->f_mapping;
 	unsigned int offset, this_len;
 	struct page *page;
-	pgoff_t index;
+	void *fsdata;
 	int ret;
 
 	/*
@@ -568,49 +568,16 @@ static int pipe_to_file(struct pipe_inod
 	if (unlikely(ret))
 		return ret;
 
-	index = sd->pos >> PAGE_CACHE_SHIFT;
 	offset = sd->pos & ~PAGE_CACHE_MASK;
 
 	this_len = sd->len;
 	if (this_len + offset > PAGE_CACHE_SIZE)
 		this_len = PAGE_CACHE_SIZE - offset;
 
-find_page:
-	page = find_lock_page(mapping, index);
-	if (!page) {
-		ret = -ENOMEM;
-		page = page_cache_alloc_cold(mapping);
-		if (unlikely(!page))
-			goto out_ret;
-
-		/*
-		 * This will also lock the page
-		 */
-		ret = add_to_page_cache_lru(page, mapping, index,
-					    GFP_KERNEL);
-		if (unlikely(ret))
-			goto out;
-	}
-
-	ret = mapping->a_ops->prepare_write(file, page, offset, offset+this_len);
-	if (unlikely(ret)) {
-		loff_t isize = i_size_read(mapping->host);
-
-		if (ret != AOP_TRUNCATED_PAGE)
-			unlock_page(page);
-		page_cache_release(page);
-		if (ret == AOP_TRUNCATED_PAGE)
-			goto find_page;
-
-		/*
-		 * prepare_write() may have instantiated a few blocks
-		 * outside i_size.  Trim these off again.
-		 */
-		if (sd->pos + this_len > isize)
-			vmtruncate(mapping->host, isize);
-
-		goto out_ret;
-	}
+	ret = pagecache_write_begin(file, mapping, sd->pos, sd->len,
+				AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
+	if (unlikely(ret))
+		goto out;
 
 	if (buf->page != page) {
 		/*
@@ -620,35 +587,14 @@ find_page:
 		char *dst = kmap_atomic(page, KM_USER1);
 
 		memcpy(dst + offset, src + buf->offset, this_len);
-		flush_dcache_page(page);
 		kunmap_atomic(dst, KM_USER1);
 		buf->ops->unmap(pipe, buf, src);
 	}
 
-	ret = mapping->a_ops->commit_write(file, page, offset, offset+this_len);
-	if (ret) {
-		if (ret == AOP_TRUNCATED_PAGE) {
-			page_cache_release(page);
-			goto find_page;
-		}
-		if (ret < 0)
-			goto out;
-		/*
-		 * Partial write has happened, so 'ret' already initialized by
-		 * number of bytes written, Where is nothing we have to do here.
-		 */
-	} else
-		ret = this_len;
-	/*
-	 * Return the number of bytes written and mark page as
-	 * accessed, we are now done!
-	 */
-	mark_page_accessed(page);
-	balance_dirty_pages_ratelimited(mapping);
+	ret = pagecache_write_end(file, mapping, sd->pos, sd->len, sd->len, page, fsdata);
+
 out:
-	page_cache_release(page);
-	unlock_page(page);
-out_ret:
+
 	return ret;
 }
 
diff -puN include/linux/buffer_head.h~fs-introduce-write_begin-write_end-and-perform_write-aops include/linux/buffer_head.h
--- a/include/linux/buffer_head.h~fs-introduce-write_begin-write_end-and-perform_write-aops
+++ a/include/linux/buffer_head.h
@@ -203,6 +203,16 @@ void block_invalidatepage(struct page *p
 int block_write_full_page(struct page *page, get_block_t *get_block,
 				struct writeback_control *wbc);
 int block_read_full_page(struct page*, get_block_t*);
+int block_write_begin(struct file *, struct address_space *,
+				loff_t, unsigned, unsigned,
+				struct page **, void **, get_block_t*);
+int block_write_end(struct file *, struct address_space *,
+				loff_t, unsigned, unsigned,
+				struct page *, void *);
+int generic_write_end(struct file *, struct address_space *,
+				loff_t, unsigned, unsigned,
+				struct page *, void *);
+void page_zero_new_buffers(struct page *page, unsigned from, unsigned to);
 int block_prepare_write(struct page*, unsigned, unsigned, get_block_t*);
 int cont_prepare_write(struct page*, unsigned, unsigned, get_block_t*,
 				loff_t *);
diff -puN include/linux/fs.h~fs-introduce-write_begin-write_end-and-perform_write-aops include/linux/fs.h
--- a/include/linux/fs.h~fs-introduce-write_begin-write_end-and-perform_write-aops
+++ a/include/linux/fs.h
@@ -391,6 +391,8 @@ enum positive_aop_returns {
 	AOP_TRUNCATED_PAGE	= 0x80001,
 };
 
+#define AOP_FLAG_UNINTERRUPTIBLE	0x0001 /* will not do a short write */
+
 /*
  * oh the beauties of C type declarations.
  */
@@ -410,7 +412,7 @@ size_t iov_iter_copy_from_user_atomic(st
 size_t iov_iter_copy_from_user(struct page *page,
 		struct iov_iter *i, unsigned long offset, size_t bytes);
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
-int iov_iter_fault_in_readable(struct iov_iter *i);
+int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
 size_t iov_iter_single_seg_count(struct iov_iter *i);
 
 static inline void iov_iter_init(struct iov_iter *i,
@@ -451,6 +453,14 @@ struct address_space_operations {
 	 */
 	int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
 	int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
+
+	int (*write_begin)(struct file *, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned flags,
+				struct page **pagep, void **fsdata);
+	int (*write_end)(struct file *, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata);
+
 	/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
 	sector_t (*bmap)(struct address_space *, sector_t);
 	void (*invalidatepage) (struct page *, unsigned long);
@@ -465,6 +475,18 @@ struct address_space_operations {
 	int (*launder_page) (struct page *);
 };
 
+/*
+ * pagecache_write_begin/pagecache_write_end must be used by general code
+ * to write into the pagecache.
+ */
+int pagecache_write_begin(struct file *, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned flags,
+				struct page **pagep, void **fsdata);
+
+int pagecache_write_end(struct file *, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata);
+
 struct backing_dev_info;
 struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
@@ -1920,6 +1942,12 @@ extern int simple_prepare_write(struct f
 			unsigned offset, unsigned to);
 extern int simple_commit_write(struct file *file, struct page *page,
 				unsigned offset, unsigned to);
+extern int simple_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata);
+extern int simple_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata);
 
 extern struct dentry *simple_lookup(struct inode *, struct dentry *, struct nameidata *);
 extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *);
diff -puN include/linux/pagemap.h~fs-introduce-write_begin-write_end-and-perform_write-aops include/linux/pagemap.h
--- a/include/linux/pagemap.h~fs-introduce-write_begin-write_end-and-perform_write-aops
+++ a/include/linux/pagemap.h
@@ -96,6 +96,8 @@ unsigned find_get_pages_contig(struct ad
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
 
+struct page *__grab_cache_page(struct address_space *mapping, pgoff_t index);
+
 /*
  * Returns locked page at given index in given cache, creating it if needed.
  */
diff -puN mm/filemap.c~fs-introduce-write_begin-write_end-and-perform_write-aops mm/filemap.c
--- a/mm/filemap.c~fs-introduce-write_begin-write_end-and-perform_write-aops
+++ a/mm/filemap.c
@@ -1794,11 +1794,10 @@ void iov_iter_advance(struct iov_iter *i
 	i->count -= bytes;
 }
 
-int iov_iter_fault_in_readable(struct iov_iter *i)
+int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
 {
-	size_t seglen = min(i->iov->iov_len - i->iov_offset, i->count);
 	char __user *buf = i->iov->iov_base + i->iov_offset;
-	return fault_in_pages_readable(buf, seglen);
+	return fault_in_pages_readable(buf, bytes);
 }
 
 /*
@@ -1897,6 +1896,93 @@ inline int generic_write_checks(struct f
 }
 EXPORT_SYMBOL(generic_write_checks);
 
+int pagecache_write_begin(struct file *file, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned flags,
+				struct page **pagep, void **fsdata)
+{
+	const struct address_space_operations *aops = mapping->a_ops;
+
+	if (aops->write_begin) {
+		return aops->write_begin(file, mapping, pos, len, flags,
+							pagep, fsdata);
+	} else {
+		int ret;
+		pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+		unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
+		struct inode *inode = mapping->host;
+		struct page *page;
+again:
+		page = __grab_cache_page(mapping, index);
+		*pagep = page;
+		if (!page)
+			return -ENOMEM;
+
+		if (flags & AOP_FLAG_UNINTERRUPTIBLE && !PageUptodate(page)) {
+			/*
+			 * There is no way to resolve a short write situation
+			 * for a !Uptodate page (except by double copying in
+			 * the caller done by generic_perform_write_2copy).
+			 *
+			 * Instead, we have to bring it uptodate here.
+			 */
+			ret = aops->readpage(file, page);
+			page_cache_release(page);
+			if (ret) {
+				if (ret == AOP_TRUNCATED_PAGE)
+					goto again;
+				return ret;
+			}
+			goto again;
+		}
+
+		ret = aops->prepare_write(file, page, offset, offset+len);
+		if (ret) {
+			if (ret != AOP_TRUNCATED_PAGE)
+				unlock_page(page);
+			page_cache_release(page);
+			if (pos + len > inode->i_size)
+				vmtruncate(inode, inode->i_size);
+			if (ret == AOP_TRUNCATED_PAGE)
+				goto again;
+		}
+		return ret;
+	}
+}
+EXPORT_SYMBOL(pagecache_write_begin);
+
+int pagecache_write_end(struct file *file, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata)
+{
+	const struct address_space_operations *aops = mapping->a_ops;
+	int ret;
+
+	if (aops->write_begin) {
+		ret = aops->write_end(file, mapping, pos, len, copied,
+							page, fsdata);
+	} else {
+		unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
+		struct inode *inode = mapping->host;
+
+		flush_dcache_page(page);
+		ret = aops->commit_write(file, page, offset, offset+len);
+		unlock_page(page);
+		page_cache_release(page);
+		BUG_ON(ret == AOP_TRUNCATED_PAGE); /* can't deal with */
+
+		if (ret < 0) {
+			if (pos + len > inode->i_size)
+				vmtruncate(inode, inode->i_size);
+		} else if (ret > 0)
+			ret = min_t(size_t, copied, ret);
+		else
+			ret = copied;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(pagecache_write_end);
+
 ssize_t
 generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long *nr_segs, loff_t pos, loff_t *ppos,
@@ -1940,8 +2026,7 @@ EXPORT_SYMBOL(generic_file_direct_write)
  * Find or create a page at the given pagecache position. Return the locked
  * page. This function is specifically for buffered writes.
  */
-static struct page *__grab_cache_page(struct address_space *mapping,
-							pgoff_t index)
+struct page *__grab_cache_page(struct address_space *mapping, pgoff_t index)
 {
 	int status;
 	struct page *page;
@@ -1962,20 +2047,16 @@ repeat:
 	}
 	return page;
 }
+EXPORT_SYMBOL(__grab_cache_page);
 
-ssize_t
-generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
-		unsigned long nr_segs, loff_t pos, loff_t *ppos,
-		size_t count, ssize_t written)
+static ssize_t generic_perform_write_2copy(struct file *file,
+				struct iov_iter *i, loff_t pos)
 {
-	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
 	const struct address_space_operations *a_ops = mapping->a_ops;
-	struct inode 	*inode = mapping->host;
-	long		status = 0;
-	struct iov_iter i;
-
-	iov_iter_init(&i, iov, nr_segs, count, written);
+	struct inode *inode = mapping->host;
+	long status = 0;
+	ssize_t written = 0;
 
 	do {
 		struct page *src_page;
@@ -1988,7 +2069,7 @@ generic_file_buffered_write(struct kiocb
 		offset = (pos & (PAGE_CACHE_SIZE - 1));
 		index = pos >> PAGE_CACHE_SHIFT;
 		bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
-						iov_iter_count(&i));
+						iov_iter_count(i));
 
 		/*
 		 * a non-NULL src_page indicates that we're doing the
@@ -2006,7 +2087,7 @@ generic_file_buffered_write(struct kiocb
 		 * to check that the address is actually valid, when atomic
 		 * usercopies are used, below.
 		 */
-		if (unlikely(iov_iter_fault_in_readable(&i))) {
+		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
 			status = -EFAULT;
 			break;
 		}
@@ -2037,7 +2118,7 @@ generic_file_buffered_write(struct kiocb
 			 * same reason as we can't take a page fault with a
 			 * page locked (as explained below).
 			 */
-			copied = iov_iter_copy_from_user(src_page, &i,
+			copied = iov_iter_copy_from_user(src_page, i,
 								offset, bytes);
 			if (unlikely(copied == 0)) {
 				status = -EFAULT;
@@ -2062,7 +2143,6 @@ generic_file_buffered_write(struct kiocb
 				page_cache_release(src_page);
 				continue;
 			}
-
 		}
 
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
@@ -2084,7 +2164,7 @@ generic_file_buffered_write(struct kiocb
 			 * really matter.
 			 */
 			pagefault_disable();
-			copied = iov_iter_copy_from_user_atomic(page, &i,
+			copied = iov_iter_copy_from_user_atomic(page, i,
 								offset, bytes);
 			pagefault_enable();
 		} else {
@@ -2110,9 +2190,9 @@ generic_file_buffered_write(struct kiocb
 		if (src_page)
 			page_cache_release(src_page);
 
-		iov_iter_advance(&i, copied);
-		written += copied;
+		iov_iter_advance(i, copied);
 		pos += copied;
+		written += copied;
 
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
@@ -2136,13 +2216,117 @@ fs_write_aop_error:
 			continue;
 		else
 			break;
-	} while (iov_iter_count(&i));
-	*ppos = pos;
+	} while (iov_iter_count(i));
+
+	return written ? written : status;
+}
+
+static ssize_t generic_perform_write(struct file *file,
+				struct iov_iter *i, loff_t pos)
+{
+	struct address_space *mapping = file->f_mapping;
+	const struct address_space_operations *a_ops = mapping->a_ops;
+	long status = 0;
+	ssize_t written = 0;
+
+	do {
+		struct page *page;
+		pgoff_t index;		/* Pagecache index for current page */
+		unsigned long offset;	/* Offset into pagecache page */
+		unsigned long bytes;	/* Bytes to write to page */
+		size_t copied;		/* Bytes copied from user */
+		void *fsdata;
+
+		offset = (pos & (PAGE_CACHE_SIZE - 1));
+		index = pos >> PAGE_CACHE_SHIFT;
+		bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
+						iov_iter_count(i));
+
+again:
+
+		/*
+		 * Bring in the user page that we will copy from _first_.
+		 * Otherwise there's a nasty deadlock on copying from the
+		 * same page as we're writing to, without it being marked
+		 * up-to-date.
+		 *
+		 * Not only is this an optimisation, but it is also required
+		 * to check that the address is actually valid, when atomic
+		 * usercopies are used, below.
+		 */
+		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+			status = -EFAULT;
+			break;
+		}
+
+		status = a_ops->write_begin(file, mapping, pos, bytes, 0,
+						&page, &fsdata);
+		if (unlikely(status))
+			break;
+
+		pagefault_disable();
+		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+		pagefault_enable();
+		flush_dcache_page(page);
+
+		status = a_ops->write_end(file, mapping, pos, bytes, copied,
+						page, fsdata);
+		if (unlikely(status < 0))
+			break;
+		copied = status;
+
+		cond_resched();
+
+		if (unlikely(copied == 0)) {
+			/*
+			 * If we were unable to copy any data at all, we must
+			 * fall back to a single segment length write.
+			 *
+			 * If we didn't fallback here, we could livelock
+			 * because not all segments in the iov can be copied at
+			 * once without a pagefault.
+			 */
+			bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
+						iov_iter_single_seg_count(i));
+			goto again;
+		}
+		iov_iter_advance(i, copied);
+		pos += copied;
+		written += copied;
+
+		balance_dirty_pages_ratelimited(mapping);
+
+	} while (iov_iter_count(i));
+
+	return written ? written : status;
+}
+
+ssize_t
+generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
+		unsigned long nr_segs, loff_t pos, loff_t *ppos,
+		size_t count, ssize_t written)
+{
+	struct file *file = iocb->ki_filp;
+	struct address_space *mapping = file->f_mapping;
+	const struct address_space_operations *a_ops = mapping->a_ops;
+	struct inode *inode = mapping->host;
+	ssize_t status;
+	struct iov_iter i;
+
+	iov_iter_init(&i, iov, nr_segs, count, written);
+	if (a_ops->write_begin)
+		status = generic_perform_write(file, &i, pos);
+	else
+		status = generic_perform_write_2copy(file, &i, pos);
 
-	/*
-	 * For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
-	 */
 	if (likely(status >= 0)) {
+		written += status;
+		*ppos = pos + status;
+
+		/*
+		 * For now, when the user asks for O_SYNC, we'll actually give
+		 * O_DSYNC
+		 */
 		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
 			if (!a_ops->writepage || !is_sync_kiocb(iocb))
 				status = generic_osync_inode(inode, mapping,
_

Patches currently in -mm which might be from npiggin@suse.de are

mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch
mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
mm-merge-nopfn-into-fault.patch
mm-remove-legacy-cruft.patch
mm-debug-check-for-the-fault-vs-invalidate-race.patch
mm-fix-clear_page_dirty_for_io-vs-fault-race.patch
git-arm-master.patch
slob-rework-freelist-handling.patch
slob-remove-bigblock-tracking.patch
slob-improved-alignment-handling.patch
mm-revert-kernel_ds-buffered-write-optimisation.patch
revert-81b0c8713385ce1b1b9058e916edcf9561ad76d6.patch
revert-6527c2bdf1f833cc18e8f42bd97973d583e4aa83.patch
mm-clean-up-buffered-write-code.patch
mm-debug-write-deadlocks.patch
mm-trim-more-holes.patch
mm-buffered-write-cleanup.patch
mm-write-iovec-cleanup.patch
mm-fix-pagecache-write-deadlocks.patch
mm-buffered-write-iterator.patch
fs-fix-data-loss-on-error.patch
fs-introduce-write_begin-write_end-and-perform_write-aops.patch
mm-restore-kernel_ds-optimisations.patch
implement-simple-fs-aops.patch
block_dev-convert-to-new-aops.patch
ext2-convert-to-new-aops.patch
ext3-convert-to-new-aops.patch
ext4-convert-to-new-aops.patch
xfs-convert-to-new-aops.patch
fs-new-cont-helpers.patch
fat-convert-to-new-aops.patch
hfs-convert-to-new-aops.patch
hfsplus-convert-to-new-aops.patch
hpfs-convert-to-new-aops.patch
bfs-convert-to-new-aops.patch
qnx4-convert-to-new-aops.patch
reiserfs-use-generic-write.patch
reiserfs-convert-to-new-aops.patch
reiserfs-use-generic_cont_expand_simple.patch
with-reiserfs-no-longer-using-the-weird-generic_cont_expand-remove-it-completely.patch
nfs-convert-to-new-aops.patch
smb-convert-to-new-aops.patch
fuse-convert-to-new-aops.patch
hostfs-convert-to-new-aops.patch
jffs2-convert-to-new-aops.patch
ufs-convert-to-new-aops.patch
udf-convert-to-new-aops.patch
sysv-convert-to-new-aops.patch
minix-convert-to-new-aops.patch
jfs-convert-to-new-aops.patch
mm-document-fault_data-and-flags.patch
fs-introduce-some-page-buffer-invariants.patch
fs-introduce-write_begin-write_end-and-perform_write-aops-revoke.patch

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
  2007-05-29 21:19 + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree akpm
@ 2007-05-30  3:13 ` Nick Piggin
  2007-05-30  3:24   ` Andrew Morton
  2007-05-30 10:39   ` Steven Whitehouse
  2007-06-13 13:40 ` Dmitriy Monakhov
  1 sibling, 2 replies; 20+ messages in thread
From: Nick Piggin @ 2007-05-30  3:13 UTC (permalink / raw)
  To: akpm; +Cc: mark.fasheh, Steven Whitehouse, linux-fsdevel

On Tue, May 29, 2007 at 02:19:55PM -0700, Andrew Morton wrote:
> 
> The patch titled
>      fs: introduce write_begin, write_end, and perform_write aops
> has been added to the -mm tree.  Its filename is
>      fs-introduce-write_begin-write_end-and-perform_write-aops.patch
> 
> *** Remember to use Documentation/SubmitChecklist when testing your code ***
> 
> See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
> out what to do about this
> 
> ------------------------------------------------------
> Subject: fs: introduce write_begin, write_end, and perform_write aops
> From: Nick Piggin <npiggin@suse.de>
> 
> These are intended to replace prepare_write and commit_write with more
> flexible alternatives that are also able to avoid the buffered write
> deadlock problems efficiently (which prepare_write is unable to do).

OK, well now Andrew's merged a significant chunk of this work, I
would like to try getting the clustered filesystem patches back
in too (Steven, the last GFS2 patch you sent had rejects against this
tree, so I dropped it... hope it isn't too much work to bring it back
uptodate?).

The cluster filesystems aren't 100% happy with the backward-compat
code, because pagecache_write_end cannot handle AOP_TRUNCATED_PAGE from
->commit_write... so if you were to try using loop over GFS2, it might
go BUG. This is a bit bad of me, however the compat code would have been
a whole lot uglier to support that, and I figure the cluster filesystems
want to convert to the new aops ASAP anyway.

I doubt anybody but the filesystem developers would be using -mm in such
a way, but even so I hope we can fix this before long.

Meanwhile, I'll look at redoing the rest of the filesystems that got
left behind.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
  2007-05-30  3:13 ` Nick Piggin
@ 2007-05-30  3:24   ` Andrew Morton
  2007-05-30 22:44     ` Mark Fasheh
  2007-05-30 10:39   ` Steven Whitehouse
  1 sibling, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2007-05-30  3:24 UTC (permalink / raw)
  To: Nick Piggin; +Cc: mark.fasheh, Steven Whitehouse, linux-fsdevel

On Wed, 30 May 2007 05:13:54 +0200 Nick Piggin <npiggin@suse.de> wrote:

> On Tue, May 29, 2007 at 02:19:55PM -0700, Andrew Morton wrote:
> > 
> > The patch titled
> >      fs: introduce write_begin, write_end, and perform_write aops
> > has been added to the -mm tree.  Its filename is
> >      fs-introduce-write_begin-write_end-and-perform_write-aops.patch
> > 
> > *** Remember to use Documentation/SubmitChecklist when testing your code ***
> > 
> > See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
> > out what to do about this
> > 
> > ------------------------------------------------------
> > Subject: fs: introduce write_begin, write_end, and perform_write aops
> > From: Nick Piggin <npiggin@suse.de>
> > 
> > These are intended to replace prepare_write and commit_write with more
> > flexible alternatives that are also able to avoid the buffered write
> > deadlock problems efficiently (which prepare_write is unable to do).
> 
> OK, well now Andrew's merged a significant chunk of this work, I
> would like to try getting the clustered filesystem patches back
> in too (Steven, the last GFS2 patch you sent had rejects against this
> tree, so I dropped it... hope it isn't too much work to bring it back
> uptodate?).
>
> The cluster filesystems aren't 100% happy with the backward-compat
> code, because pagecache_write_end cannot handle AOP_TRUNCATED_PAGE from
> ->commit_write... so if you were to try using loop over GFS2, it might
> go BUG. This is a bit bad of me, however the compat code would have been
> a whole lot uglier to support that, and I figure the cluster filesystems
> want to convert to the new aops ASAP anyway.
> 
> I doubt anybody but the filesystem developers would be using -mm in such
> a way, but even so I hope we can fix this before long.
> 
> Meanwhile, I'll look at redoing the rest of the filesystems that got
> left behind.

hm, I suppose that means I need to undrop git-ocfs2.patch.  It has a mild
disagreeement with the fault-vs-invalidate patches which I didn't feel like
fixing.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
  2007-05-30  3:13 ` Nick Piggin
  2007-05-30  3:24   ` Andrew Morton
@ 2007-05-30 10:39   ` Steven Whitehouse
  2007-05-31  5:50     ` Nick Piggin
  1 sibling, 1 reply; 20+ messages in thread
From: Steven Whitehouse @ 2007-05-30 10:39 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, mark.fasheh, linux-fsdevel

Hi,

On Wed, 2007-05-30 at 05:13 +0200, Nick Piggin wrote:
> On Tue, May 29, 2007 at 02:19:55PM -0700, Andrew Morton wrote:
> > 
> > The patch titled
> >      fs: introduce write_begin, write_end, and perform_write aops
> > has been added to the -mm tree.  Its filename is
> >      fs-introduce-write_begin-write_end-and-perform_write-aops.patch
> > 
> > *** Remember to use Documentation/SubmitChecklist when testing your code ***
> > 
> > See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
> > out what to do about this
> > 
> > ------------------------------------------------------
> > Subject: fs: introduce write_begin, write_end, and perform_write aops
> > From: Nick Piggin <npiggin@suse.de>
> > 
> > These are intended to replace prepare_write and commit_write with more
> > flexible alternatives that are also able to avoid the buffered write
> > deadlock problems efficiently (which prepare_write is unable to do).
> 
> OK, well now Andrew's merged a significant chunk of this work, I
> would like to try getting the clustered filesystem patches back
> in too (Steven, the last GFS2 patch you sent had rejects against this
> tree, so I dropped it... hope it isn't too much work to bring it back
> uptodate?).
> 
I think the following should do the trick... sorry for the delay, I was
on holiday last week and I'm just catching up again. There is not a lot
of change from the previous version, just a few small changes in the
upstream code which had caused one chunk not to apply,

Steve.

---------------------------------------------------------------------------------
diff --git a/fs/gfs2/ops_address.c b/fs/gfs2/ops_address.c
index fb84478..ad84a55 100644
--- a/fs/gfs2/ops_address.c
+++ b/fs/gfs2/ops_address.c
@@ -17,6 +17,7 @@
 #include <linux/mpage.h>
 #include <linux/fs.h>
 #include <linux/writeback.h>
+#include <linux/swap.h>
 #include <linux/gfs2_ondisk.h>
 #include <linux/lm_interface.h>
 
@@ -349,45 +350,49 @@ out_unlock:
 }
 
 /**
- * gfs2_prepare_write - Prepare to write a page to a file
+ * gfs2_write_begin - Begin to write to a file
  * @file: The file to write to
- * @page: The page which is to be prepared for writing
- * @from: From (byte range within page)
- * @to: To (byte range within page)
+ * @mapping: The mapping in which to write
+ * @pos: The file offset at which to start writing
+ * @len: Length of the write
+ * @flags: Various flags
+ * @pagep: Pointer to return the page
+ * @fsdata: Pointer to return fs data (unused by GFS2)
  *
  * Returns: errno
  */
 
-static int gfs2_prepare_write(struct file *file, struct page *page,
-			      unsigned from, unsigned to)
+static int gfs2_write_begin(struct file *file, struct address_space *mapping,
+			    loff_t pos, unsigned len, unsigned flags,
+			    struct page **pagep, void **fsdata)
 {
-	struct gfs2_inode *ip = GFS2_I(page->mapping->host);
-	struct gfs2_sbd *sdp = GFS2_SB(page->mapping->host);
+	struct gfs2_inode *ip = GFS2_I(mapping->host);
+	struct gfs2_sbd *sdp = GFS2_SB(mapping->host);
 	unsigned int data_blocks, ind_blocks, rblocks;
 	int alloc_required;
 	int error = 0;
-	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + from;
-	loff_t end = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
 	struct gfs2_alloc *al;
-	unsigned int write_len = to - from;
+	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+	unsigned from = pos & (PAGE_CACHE_SIZE - 1);
+	unsigned to = from + len;
+	struct page *page;
 
-
-	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, GL_ATIME|LM_FLAG_TRY_1CB, &ip->i_gh);
+	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, GL_ATIME, &ip->i_gh);
 	error = gfs2_glock_nq_atime(&ip->i_gh);
-	if (unlikely(error)) {
-		if (error == GLR_TRYFAILED) {
-			unlock_page(page);
-			error = AOP_TRUNCATED_PAGE;
-			yield();
-		}
+	if (unlikely(error))
 		goto out_uninit;
-	}
 
-	gfs2_write_calc_reserv(ip, write_len, &data_blocks, &ind_blocks);
+	error = -ENOMEM;
+	page = __grab_cache_page(mapping, index);
+	*pagep = page;
+	if (!page)
+		goto out_unlock;
+
+	gfs2_write_calc_reserv(ip, len, &data_blocks, &ind_blocks);
 
-	error = gfs2_write_alloc_required(ip, pos, write_len, &alloc_required);
+	error = gfs2_write_alloc_required(ip, pos, len, &alloc_required);
 	if (error)
-		goto out_unlock;
+		goto out_putpage;
 
 
 	ip->i_alloc.al_requested = 0;
@@ -419,7 +424,7 @@ static int gfs2_prepare_write(struct file *file, struct page *page,
 		goto out;
 
 	if (gfs2_is_stuffed(ip)) {
-		if (end > sdp->sd_sb.sb_bsize - sizeof(struct gfs2_dinode)) {
+		if (pos + len > sdp->sd_sb.sb_bsize - sizeof(struct gfs2_dinode)) {
 			error = gfs2_unstuff_dinode(ip, page);
 			if (error == 0)
 				goto prepare_write;
@@ -441,6 +446,10 @@ out_qunlock:
 out_alloc_put:
 			gfs2_alloc_put(ip);
 		}
+out_putpage:
+		page_cache_release(page);
+		if (pos + len > ip->i_inode.i_size)
+			vmtruncate(&ip->i_inode, ip->i_inode.i_size);
 out_unlock:
 		gfs2_glock_dq_m(1, &ip->i_gh);
 out_uninit:
@@ -476,65 +485,118 @@ static void adjust_fs_space(struct inode *inode)
 }
 
 /**
- * gfs2_commit_write - Commit write to a file
+ * gfs2_stuffed_write_end - Write end for stuffed files
+ * @inode: The inode
+ * @dibh: The buffer_head containing the on-disk inode
+ * @pos: The file position
+ * @len: The length of the write
+ * @copied: How much was actually copied by the VFS
+ * @page: The page
+ *
+ * This copies the data from the page into the inode block after
+ * the inode data structure itself.
+ *
+ * Returns: errno
+ */
+static int gfs2_stuffed_write_end(struct inode *inode, struct buffer_head *dibh,
+				  loff_t pos, unsigned len, unsigned copied,
+				  struct page *page)
+{
+	struct gfs2_inode *ip = GFS2_I(inode);
+	struct gfs2_sbd *sdp = GFS2_SB(inode);
+	u64 to = pos + copied;
+	void *kaddr;
+	unsigned char *buf = dibh->b_data + sizeof(struct gfs2_dinode);
+	struct gfs2_dinode *di = (struct gfs2_dinode *)dibh->b_data;
+
+	BUG_ON((pos + len) > (dibh->b_size - sizeof(struct gfs2_dinode)));
+	kaddr = kmap_atomic(page, KM_USER0);
+	memcpy(buf + pos, kaddr + pos, copied);
+	memset(kaddr + pos + copied, 0, len - copied);
+	flush_dcache_page(page);
+	kunmap_atomic(kaddr, KM_USER0);
+
+	if (!PageUptodate(page))
+		SetPageUptodate(page);
+	unlock_page(page);
+	mark_page_accessed(page);
+	page_cache_release(page);
+
+	if (inode->i_size < to) {
+		i_size_write(inode, to);
+		ip->i_di.di_size = inode->i_size;
+		di->di_size = cpu_to_be64(inode->i_size);
+		mark_inode_dirty(inode);
+	}
+
+	if (inode == sdp->sd_rindex)
+		adjust_fs_space(inode);
+
+	brelse(dibh);
+	gfs2_trans_end(sdp);
+	gfs2_glock_dq(&ip->i_gh);
+	gfs2_holder_uninit(&ip->i_gh);
+	return copied;
+}
+
+/**
+ * gfs2_write_end
  * @file: The file to write to
- * @page: The page containing the data
- * @from: From (byte range within page)
- * @to: To (byte range within page)
+ * @mapping: The address space to write to
+ * @pos: The file position
+ * @len: The length of the data
+ * @copied:
+ * @page: The page that has been written
+ * @fsdata: The fsdata (unused in GFS2)
+ *
+ * The main write_end function for GFS2. We have a separate one for
+ * stuffed files as they are slightly different, otherwise we just
+ * put our locking around the VFS provided functions.
  *
  * Returns: errno
  */
 
-static int gfs2_commit_write(struct file *file, struct page *page,
-			     unsigned from, unsigned to)
+static int gfs2_write_end(struct file *file, struct address_space *mapping,
+			  loff_t pos, unsigned len, unsigned copied,
+			  struct page *page, void *fsdata)
 {
 	struct inode *inode = page->mapping->host;
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
-	int error = -EOPNOTSUPP;
 	struct buffer_head *dibh;
 	struct gfs2_alloc *al = &ip->i_alloc;
 	struct gfs2_dinode *di;
+	unsigned int from = pos & (PAGE_CACHE_SIZE - 1);
+	unsigned int to = from + len;
+	int ret;
 
-	if (gfs2_assert_withdraw(sdp, gfs2_glock_is_locked_by_me(ip->i_gl)))
-                goto fail_nounlock;
+	BUG_ON(gfs2_glock_is_locked_by_me(ip->i_gl) == 0);
 
-	error = gfs2_meta_inode_buffer(ip, &dibh);
-	if (error)
-		goto fail_endtrans;
+	ret = gfs2_meta_inode_buffer(ip, &dibh);
+	if (unlikely(ret)) {
+		unlock_page(page);
+		page_cache_release(page);
+		goto failed;
+	}
 
 	gfs2_trans_add_bh(ip->i_gl, dibh, 1);
-	di = (struct gfs2_dinode *)dibh->b_data;
 
-	if (gfs2_is_stuffed(ip)) {
-		u64 file_size;
-		void *kaddr;
+	if (gfs2_is_stuffed(ip))
+		return gfs2_stuffed_write_end(inode, dibh, pos, len, copied, page);
 
-		file_size = ((u64)page->index << PAGE_CACHE_SHIFT) + to;
+	if (sdp->sd_args.ar_data == GFS2_DATA_ORDERED || gfs2_is_jdata(ip))
+		gfs2_page_add_databufs(ip, page, from, to);
 
-		kaddr = kmap_atomic(page, KM_USER0);
-		memcpy(dibh->b_data + sizeof(struct gfs2_dinode) + from,
-		       kaddr + from, to - from);
-		kunmap_atomic(kaddr, KM_USER0);
+	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
 
-		SetPageUptodate(page);
-
-		if (inode->i_size < file_size) {
-			i_size_write(inode, file_size);
+	if (likely(ret >= 0)) {
+		copied = ret;
+		if  ((pos + copied) > inode->i_size) {
+			di = (struct gfs2_dinode *)dibh->b_data;
+			ip->i_di.di_size = inode->i_size;
+			di->di_size = cpu_to_be64(inode->i_size);
 			mark_inode_dirty(inode);
 		}
-	} else {
-		if (sdp->sd_args.ar_data == GFS2_DATA_ORDERED ||
-		    gfs2_is_jdata(ip))
-			gfs2_page_add_databufs(ip, page, from, to);
-		error = generic_commit_write(file, page, from, to);
-		if (error)
-			goto fail;
-	}
-
-	if (ip->i_di.di_size < inode->i_size) {
-		ip->i_di.di_size = inode->i_size;
-		di->di_size = cpu_to_be64(inode->i_size);
 	}
 
 	if (inode == sdp->sd_rindex)
@@ -542,33 +604,15 @@ static int gfs2_commit_write(struct file *file, struct page *page,
 
 	brelse(dibh);
 	gfs2_trans_end(sdp);
+failed:
 	if (al->al_requested) {
 		gfs2_inplace_release(ip);
 		gfs2_quota_unlock(ip);
 		gfs2_alloc_put(ip);
 	}
-	unlock_page(page);
-	gfs2_glock_dq_m(1, &ip->i_gh);
-	lock_page(page);
+	gfs2_glock_dq(&ip->i_gh);
 	gfs2_holder_uninit(&ip->i_gh);
-	return 0;
-
-fail:
-	brelse(dibh);
-fail_endtrans:
-	gfs2_trans_end(sdp);
-	if (al->al_requested) {
-		gfs2_inplace_release(ip);
-		gfs2_quota_unlock(ip);
-		gfs2_alloc_put(ip);
-	}
-	unlock_page(page);
-	gfs2_glock_dq_m(1, &ip->i_gh);
-	lock_page(page);
-	gfs2_holder_uninit(&ip->i_gh);
-fail_nounlock:
-	ClearPageUptodate(page);
-	return error;
+	return ret;
 }
 
 /**
@@ -837,8 +881,8 @@ const struct address_space_operations gfs2_file_aops = {
 	.readpage = gfs2_readpage,
 	.readpages = gfs2_readpages,
 	.sync_page = block_sync_page,
-	.prepare_write = gfs2_prepare_write,
-	.commit_write = gfs2_commit_write,
+	.write_begin = gfs2_write_begin,
+	.write_end = gfs2_write_end,
 	.bmap = gfs2_bmap,
 	.invalidatepage = gfs2_invalidatepage,
 	.releasepage = gfs2_releasepage,




^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
  2007-05-30  3:24   ` Andrew Morton
@ 2007-05-30 22:44     ` Mark Fasheh
  2007-05-30 22:49       ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: Mark Fasheh @ 2007-05-30 22:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, Steven Whitehouse, linux-fsdevel

On Tue, May 29, 2007 at 08:24:41PM -0700, Andrew Morton wrote:
> hm, I suppose that means I need to undrop git-ocfs2.patch.  It has a mild
> disagreeement with the fault-vs-invalidate patches which I didn't feel like
> fixing.

Erf, somehow I missed that it was dropped. I guess I'll take a look...
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
  2007-05-30 22:44     ` Mark Fasheh
@ 2007-05-30 22:49       ` Andrew Morton
  0 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2007-05-30 22:49 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Nick Piggin, Steven Whitehouse, linux-fsdevel

On Wed, 30 May 2007 15:44:43 -0700
Mark Fasheh <mark.fasheh@oracle.com> wrote:

> On Tue, May 29, 2007 at 08:24:41PM -0700, Andrew Morton wrote:
> > hm, I suppose that means I need to undrop git-ocfs2.patch.  It has a mild
> > disagreeement with the fault-vs-invalidate patches which I didn't feel like
> > fixing.
> 
> Erf, somehow I missed that it was dropped.

I didn't tell you ;)

> I guess I'll take a look...

It's the sort of mm-vs-mm thing which I get to fix, only I've been putting
it off.  As it now looks like the fault patches are moing away from death's
door, I shall do that.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
  2007-05-30 10:39   ` Steven Whitehouse
@ 2007-05-31  5:50     ` Nick Piggin
  0 siblings, 0 replies; 20+ messages in thread
From: Nick Piggin @ 2007-05-31  5:50 UTC (permalink / raw)
  To: Steven Whitehouse; +Cc: akpm, mark.fasheh, linux-fsdevel

On Wed, May 30, 2007 at 11:39:54AM +0100, Steven Whitehouse wrote:
> Hi,
> 
> On Wed, 2007-05-30 at 05:13 +0200, Nick Piggin wrote:
> > On Tue, May 29, 2007 at 02:19:55PM -0700, Andrew Morton wrote:
> > > 
> > > The patch titled
> > >      fs: introduce write_begin, write_end, and perform_write aops
> > > has been added to the -mm tree.  Its filename is
> > >      fs-introduce-write_begin-write_end-and-perform_write-aops.patch
> > > 
> > > *** Remember to use Documentation/SubmitChecklist when testing your code ***
> > > 
> > > See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
> > > out what to do about this
> > > 
> > > ------------------------------------------------------
> > > Subject: fs: introduce write_begin, write_end, and perform_write aops
> > > From: Nick Piggin <npiggin@suse.de>
> > > 
> > > These are intended to replace prepare_write and commit_write with more
> > > flexible alternatives that are also able to avoid the buffered write
> > > deadlock problems efficiently (which prepare_write is unable to do).
> > 
> > OK, well now Andrew's merged a significant chunk of this work, I
> > would like to try getting the clustered filesystem patches back
> > in too (Steven, the last GFS2 patch you sent had rejects against this
> > tree, so I dropped it... hope it isn't too much work to bring it back
> > uptodate?).
> > 
> I think the following should do the trick... sorry for the delay, I was
> on holiday last week and I'm just catching up again. There is not a lot
> of change from the previous version, just a few small changes in the
> upstream code which had caused one chunk not to apply,

OK thanks! I'll send this to Andrew after he releases the next -mm and
hopefully picks up the patchset again.
 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
  2007-06-13 13:40 ` Dmitriy Monakhov
@ 2007-06-13 11:43   ` Nick Piggin
  2007-06-13 18:51     ` Dmitriy Monakhov
  2007-06-13 23:07     ` Badari Pulavarty
  2007-06-13 13:50   ` [patch] new aop block_write_begin fix Dmitriy Monakhov
  2007-06-13 13:57   ` iov_iter_fault_in_readable fix Dmitriy Monakhov
  2 siblings, 2 replies; 20+ messages in thread
From: Nick Piggin @ 2007-06-13 11:43 UTC (permalink / raw)
  To: linux-kernel, mark.fasheh, linux-ext4; +Cc: Andrew Morton

On Wed, Jun 13, 2007 at 05:40:05PM +0400, Dmitriy Monakhov wrote:
> On 14:19 ?????? 29 ??????     , akpm@linux-foundation.org wrote:
> > 
> > The patch titled
> >      fs: introduce write_begin, write_end, and perform_write aops
> > has been added to the -mm tree.  Its filename is
> >      fs-introduce-write_begin-write_end-and-perform_write-aops.patch
> > 
> > *** Remember to use Documentation/SubmitChecklist when testing your code ***
> > 
> > See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
> > out what to do about this
> > 
> > ------------------------------------------------------
> > Subject: fs: introduce write_begin, write_end, and perform_write aops
> > From: Nick Piggin <npiggin@suse.de>
> > 
> > These are intended to replace prepare_write and commit_write with more
> > flexible alternatives that are also able to avoid the buffered write
> > deadlock problems efficiently (which prepare_write is unable to do).
> > 
> > [mark.fasheh@oracle.com: API design contributions, code review and fixes]
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> I've finaly find time to review Nick's "write_begin/write_end aop" patch set.
> And i have some fixes and questions. My be i've missed somthing and it was 
> already disscussed, but i cant find in LKML.

Thanks, that's really helpful.

 
> 1) loop dev:
> 	loop.c code itself is not perfect. In fact before Nick's patch
> 	partial write was't possible. Assumption what write chunks are
> 	always page aligned is realy weird ( see "index++" line).
> 	Fixed by "new aop loop fix" patch

I think you're right, fix looks good.

 
> 2)block_write_begin:
> 	After we enter to block_write_begin with *pagep == NULL and
> 	some page was grabed we remember this page in *pagep
> 	And if __block_prepare_write() we have to clear *pagep , as 
> 	it was before. Because this may confuse caller.
> 	for example caller may have folowing code:
> 		ret = block_write_begin(..., pagep,...)
> 		if (ret && *pagep != NULL) {
> 			unlock_page(*pagep);
> 			page_cache_release(*pagep);
> 		}
> 	Fixed my "new aop block_write_begin fix" patch

I don't think the caller can rely on that if it returns failure.
But that is more defensive I guess. Maybe setting it to 1 or
so would catch abusers.

 
> 3) __page_symlink:
> 	Nick's patch add folowing code:
> 	+ err = pagecache_write_begin(NULL, mapping, 0,PAGE_CACHE_SIZE,
> 	+                 AOP_FLAG_UNINTERRUPTIBLE, &page,&fsdata);
> 	symlink now consume whole page. I have only one question "WHY???".
> 	I don't see any advantages, but where are huge list of
> 	dissdvantages:
> 	a) fs with blksize == 1k and pagesize == 16k after this patch
> 	   waste more than 10x times disk space for nothing.
> 	b) What happends if we want use fs with blksize == 4k on i386
> 	   after it was used by ia64 ??? (before this patch it was
> 	   possible).
> 	
> 	I dont prepare patch for this because i dont understand issue
> 	witch Nick aimed to fix.

I don't know why myself :P I think it would be just fine to use
len-1 as it did previously, so it must have been a typo?


> 4) iov_iter_fault_in_readable:
> 	Function prerform check for signgle region, with out respect to
> 	segment nature of iovec, For example writev no longer works :) :
> 	writev(3, [{"\1", 1}, {"\2"..., 4096}], 2) = -1 EFAULT (Bad address)
> 	this hidden bug, and it was invisiable because XXXX_fault_in_readable
> 	return value was ignored before. Lets iov_iter_fault_in_readable
> 	perform checks for all segments.
> 	Fixed by :"iov_iter_fault_in_readable fix"

OK thanks. I would rather just change this to use the length of
the first iovec always (prefaulting multiple iovecs would have to
be benchmarked on real apps that make heavy use of writev, I
suspect).

 
> 5) ext3_write_end:
> 	Before  write_begin/write_end patch set we have folowing locking
> 	order:
> 		stop_journal(handle);
> 		unlock_page(page);
> 	But now order is oposite:
> 		unlock_page(page);
> 		stop_journal(handle);
> 	Can we got any race condition now? I'm not sure is it actual problem,
> 	may be somebody cant describe this.

Can we just change it to the original order? That would seem to be
safest unless one of the ext3 devs explicitly acks it.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
  2007-05-29 21:19 + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree akpm
  2007-05-30  3:13 ` Nick Piggin
@ 2007-06-13 13:40 ` Dmitriy Monakhov
  2007-06-13 11:43   ` Nick Piggin
                     ` (2 more replies)
  1 sibling, 3 replies; 20+ messages in thread
From: Dmitriy Monakhov @ 2007-06-13 13:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: npiggin, mark.fasheh, dmonakhov, linux-ext4

On 14:19 Втр 29 Май     , akpm@linux-foundation.org wrote:
> 
> The patch titled
>      fs: introduce write_begin, write_end, and perform_write aops
> has been added to the -mm tree.  Its filename is
>      fs-introduce-write_begin-write_end-and-perform_write-aops.patch
> 
> *** Remember to use Documentation/SubmitChecklist when testing your code ***
> 
> See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
> out what to do about this
> 
> ------------------------------------------------------
> Subject: fs: introduce write_begin, write_end, and perform_write aops
> From: Nick Piggin <npiggin@suse.de>
> 
> These are intended to replace prepare_write and commit_write with more
> flexible alternatives that are also able to avoid the buffered write
> deadlock problems efficiently (which prepare_write is unable to do).
> 
> [mark.fasheh@oracle.com: API design contributions, code review and fixes]
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
I've finaly find time to review Nick's "write_begin/write_end aop" patch set.
And i have some fixes and questions. My be i've missed somthing and it was 
already disscussed, but i cant find in LKML.

1) loop dev:
	loop.c code itself is not perfect. In fact before Nick's patch
	partial write was't possible. Assumption what write chunks are
	always page aligned is realy weird ( see "index++" line).
	Fixed by "new aop loop fix" patch

2)block_write_begin:
	After we enter to block_write_begin with *pagep == NULL and
	some page was grabed we remember this page in *pagep
	And if __block_prepare_write() we have to clear *pagep , as 
	it was before. Because this may confuse caller.
	for example caller may have folowing code:
		ret = block_write_begin(..., pagep,...)
		if (ret && *pagep != NULL) {
			unlock_page(*pagep);
			page_cache_release(*pagep);
		}
	Fixed my "new aop block_write_begin fix" patch

3) __page_symlink:
	Nick's patch add folowing code:
	+ err = pagecache_write_begin(NULL, mapping, 0,PAGE_CACHE_SIZE,
	+                 AOP_FLAG_UNINTERRUPTIBLE, &page,&fsdata);
	symlink now consume whole page. I have only one question "WHY???".
	I don't see any advantages, but where are huge list of
	dissdvantages:
	a) fs with blksize == 1k and pagesize == 16k after this patch
	   waste more than 10x times disk space for nothing.
	b) What happends if we want use fs with blksize == 4k on i386
	   after it was used by ia64 ??? (before this patch it was
	   possible).
	
	I dont prepare patch for this because i dont understand issue
	witch Nick aimed to fix.

4) iov_iter_fault_in_readable:
	Function prerform check for signgle region, with out respect to
	segment nature of iovec, For example writev no longer works :) :
	writev(3, [{"\1", 1}, {"\2"..., 4096}], 2) = -1 EFAULT (Bad address)
	this hidden bug, and it was invisiable because XXXX_fault_in_readable
	return value was ignored before. Lets iov_iter_fault_in_readable
	perform checks for all segments.
	Fixed by :"iov_iter_fault_in_readable fix"

5) ext3_write_end:
	Before  write_begin/write_end patch set we have folowing locking
	order:
		stop_journal(handle);
		unlock_page(page);
	But now order is oposite:
		unlock_page(page);
		stop_journal(handle);
	Can we got any race condition now? I'm not sure is it actual problem,
	may be somebody cant describe this.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch] new aop block_write_begin fix
  2007-06-13 13:40 ` Dmitriy Monakhov
  2007-06-13 11:43   ` Nick Piggin
@ 2007-06-13 13:50   ` Dmitriy Monakhov
  2007-06-13 13:57   ` iov_iter_fault_in_readable fix Dmitriy Monakhov
  2 siblings, 0 replies; 20+ messages in thread
From: Dmitriy Monakhov @ 2007-06-13 13:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: npiggin

	After we enter to block_write_begin with *pagep == NULL and
 	some page was grabed we remember this page in *pagep
 	And if __block_prepare_write() we have to clear *pagep , as 
 	it was before. Because this may confuse caller.
 	for example caller may have folowing code:
 		ret = block_write_begin(..., pagep,...)
 		if (ret && *pagep != NULL) {
 			unlock_page(*pagep);
 			page_cache_release(*pagep);
 		}
Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>

diff --git a/fs/buffer.c b/fs/buffer.c
index 07cd457..df933ba 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1982,6 +1982,7 @@ int block_write_begin(struct file *file, struct address_space *mapping,
 		if (ownpage) {
 			unlock_page(page);
 			page_cache_release(page);
+			*pagep = NULL;
 
 			/*
 			 * prepare_write() may have instantiated a few blocks 


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* iov_iter_fault_in_readable fix
  2007-06-13 13:40 ` Dmitriy Monakhov
  2007-06-13 11:43   ` Nick Piggin
  2007-06-13 13:50   ` [patch] new aop block_write_begin fix Dmitriy Monakhov
@ 2007-06-13 13:57   ` Dmitriy Monakhov
  2007-06-14 17:31     ` Christoph Hellwig
  2 siblings, 1 reply; 20+ messages in thread
From: Dmitriy Monakhov @ 2007-06-13 13:57 UTC (permalink / raw)
  To: linux-kernel, npiggin, mark.fasheh, linux-ext4

 	Function prerform check for signgle region, with out respect to
 	segment nature of iovec, For example writev no longer works :)

	/* TESTCASE BEGIN */
	#include <stdio.h>
	#include <sys/types.h>
	#include <sys/stat.h>
	#include <fcntl.h>
	#include <sys/uio.h>
	#include <sys/mman.h>
	#define SIZE  (4096 * 2)
	int main(int argc, char* argv[])
	{	
		char* ptr[4];
		struct iovec iov[2];
		int fd, ret;
		ptr[0] = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
			MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
		ptr[1] = mmap(NULL, SIZE, PROT_NONE,
			MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
		ptr[2] = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
			MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
		ptr[3] = mmap(NULL, SIZE, PROT_NONE, 
			MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);

		iov[0].iov_base = ptr[0] + (SIZE -1);
		iov[0].iov_len = 1;
		memset(ptr[0], 1, SIZE);

		iov[1].iov_base = ptr[2];
		iov[1].iov_len = SIZE;
		memset(ptr[2], 2, SIZE);

		fd = open(argv[1], O_CREAT|O_RDWR|O_TRUNC, 0666);
		ret = writev(fd, iov, sizeof(iov) / sizeof(struct iovec));
		return 0;
	}	
	/* TESTCASE END*/
	We will get folowing result:
		writev(3, [{"\1", 1}, {"\2"..., 8192}], 2) = -1 EFAULT (Bad address)
 	
	this is hidden bug, and it was invisiable because XXXX_fault_in_readable
 	return value was ignored before. Lets iov_iter_fault_in_readable
 	perform checks for all segments.

Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>

diff --git a/include/linux/fs.h b/include/linux/fs.h
index fef19fc..7e025ea 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -433,7 +433,7 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 size_t iov_iter_copy_from_user(struct page *page,
 		struct iov_iter *i, unsigned long offset, size_t bytes);
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
-int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
+int iov_iter_fault_in_readable(struct iov_iter *i, size_t *bytes);
 size_t iov_iter_single_seg_count(struct iov_iter *i);
 
 static inline void iov_iter_init(struct iov_iter *i,
diff --git a/mm/filemap.c b/mm/filemap.c
index 8d59ed9..8600c3e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1817,10 +1817,32 @@ void iov_iter_advance(struct iov_iter *i, size_t bytes)
 }
 EXPORT_SYMBOL(iov_iter_advance);
 
-int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
+int iov_iter_fault_in_readable(struct iov_iter *i, size_t* bytes)
 {
-	char __user *buf = i->iov->iov_base + i->iov_offset;
-	return fault_in_pages_readable(buf, bytes);
+	size_t len = *bytes;
+	int ret;
+	if (likely(i->nr_segs == 1)) {
+		ret = fault_in_pages_readable(i->iov->iov_base, len);
+		if (ret)
+			*bytes = 0;
+	} else {
+		const struct iovec *iov = i->iov;
+		size_t base = i->iov_offset;
+		*bytes = 0;
+		while (len) {
+			int copy = min(len, iov->iov_len - base);
+			if ((ret = fault_in_pages_readable(iov->iov_base + base, copy)))
+				break;
+			*bytes += copy;
+			len -= copy;
+			base += copy;
+			if (iov->iov_len == base) {
+				iov++;
+				base = 0;
+			}
+		}
+	}
+	return ret;	
 }
 EXPORT_SYMBOL(iov_iter_fault_in_readable);
 
@@ -2110,7 +2132,7 @@ static ssize_t generic_perform_write_2copy(struct file *file,
 		 * to check that the address is actually valid, when atomic
 		 * usercopies are used, below.
 		 */
-		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+		if (unlikely(iov_iter_fault_in_readable(i, &bytes) && !bytes)) {
 			status = -EFAULT;
 			break;
 		}
@@ -2284,7 +2306,7 @@ again:
 		 * to check that the address is actually valid, when atomic
 		 * usercopies are used, below.
 		 */
-		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+		if (unlikely(iov_iter_fault_in_readable(i, &bytes)&& !bytes)) {
 			status = -EFAULT;
 			break;
 		}

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
  2007-06-13 11:43   ` Nick Piggin
@ 2007-06-13 18:51     ` Dmitriy Monakhov
  2007-06-13 23:07     ` Badari Pulavarty
  1 sibling, 0 replies; 20+ messages in thread
From: Dmitriy Monakhov @ 2007-06-13 18:51 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, mark.fasheh, linux-ext4, Andrew Morton

On 13:43 Срд 13 Июн     , Nick Piggin wrote:
> On Wed, Jun 13, 2007 at 05:40:05PM +0400, Dmitriy Monakhov wrote:
> > On 14:19 ?????? 29 ??????     , akpm@linux-foundation.org wrote:
> > > 
> > > The patch titled
> > >      fs: introduce write_begin, write_end, and perform_write aops
> > > has been added to the -mm tree.  Its filename is
> > >      fs-introduce-write_begin-write_end-and-perform_write-aops.patch
> > > 
> > > *** Remember to use Documentation/SubmitChecklist when testing your code ***
> > > 
> > > See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
> > > out what to do about this
> > > 
> > > ------------------------------------------------------
> > > Subject: fs: introduce write_begin, write_end, and perform_write aops
> > > From: Nick Piggin <npiggin@suse.de>
> > > 
> > > These are intended to replace prepare_writ eand commit_write with more
> > > flexible alternatives that are also able to avoid the buffered write
> > > deadlock problems efficiently (which prepare_write is unable to do).
> > > 
> > > [mark.fasheh@oracle.com: API design contributions, code review and fixes]
> > > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > > Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
> > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > I've finaly find time to review Nick's "write_begin/write_end aop" patch set.
> > And i have some fixes and questions. My be i've missed somthing and it was 
> > already disscussed, but i cant find in LKML.
> 
> Thanks, that's really helpful.
> 
>  
> > 1) loop dev:
> > 	loop.c code itself is not perfect. In fact before Nick's patch
> > 	partial write was't possible. Assumption what write chunks are
> > 	always page aligned is realy weird ( see "index++" line).
> > 	Fixed by "new aop loop fix" patch
> 
> I think you're right, fix looks good.
> 
>  
> > 2)block_write_begin:
> > 	After we enter to block_write_begin with *pagep == NULL and
> > 	some page was grabed we remember this page in *pagep
> > 	And if __block_prepare_write() we have to clear *pagep , as 
> > 	it was before. Because this may confuse caller.
> > 	for example caller may have folowing code:
> > 		ret = block_write_begin(..., pagep,...)
> > 		if (ret && *pagep != NULL) {
> > 			unlock_page(*pagep);
> > 			page_cache_release(*pagep);
> > 		}
> > 	Fixed my "new aop block_write_begin fix" patch
> 
> I don't think the caller can rely on that if it returns failure.
> But that is more defensive I guess. Maybe setting it to 1 or
> so would catch abusers.
> 
>  
> > 3) __page_symlink:
> > 	Nick's patch add folowing code:
> > 	+ err = pagecache_write_begin(NULL, mapping, 0,PAGE_CACHE_SIZE,
> > 	+                 AOP_FLAG_UNINTERRUPTIBLE, &page,&fsdata);
> > 	symlink now consume whole page. I have only one question "WHY???".
> > 	I don't see any advantages, but where are huge list of
> > 	dissdvantages:
> > 	a) fs with blksize == 1k and pagesize == 16k after this patch
> > 	   waste more than 10x times disk space for nothing.
> > 	b) What happends if we want use fs with blksize == 4k on i386
> > 	   after it was used by ia64 ??? (before this patch it was
> > 	   possible).
One more visiable effect caused by wrong symlink size:
fsstress cause folowing error:
<LOG BEGIN>
EXT3-fs unexpected failure: !buffer_revoked(bh); 
inconsistent data on disk
ext3_forget: aborting transaction: IO failure in __ext3_journal_revoke
ext3_abort called.

EXT3-fs error (device dm-4): ext3_forget: error -5 when attempting
revoke
Remounting filesystem read-only 
Aborting journal on device dm-4.
journal commit I/O error
journal commit I/O error
journal commit I/O error
journal commit I/O error
journal commit I/O error
journal commit I/O error
EXT3-fs error (device dm-4) in ext3_free_blocks_sb: Journal has aborted

journal commit I/O error
journal commit I/O error
EXT3-fs error (device dm-4) in ext3_reserve_inode_write: Journal has
aborted
EXT3-fs error (device dm-4) in ext3_truncate: IO failure
journal commit I/O error
journal commit I/O error
journal commit I/O error
journal commit I/O error
EXT3-fs error (device dm-4) in ext3_reserve_inode_write: Journal has
aborted
EXT3-fs error (device dm-4) in ext3_orphan_del: Journal has aborted
EXT3-fs error (device dm-4) in ext3_reserve_inode_write: Journal has
aborted
EXT3-fs error (device dm-4) in ext3_delete_inode: IO failure
<LOG END>

After symlink size was fixed to len-1 problem dissappeared. 
> > 	
> > 	I dont prepare patch for this because i dont understand issue
> > 	witch Nick aimed to fix.
> 
> I don't know why myself :P I think it would be just fine to use
> len-1 as it did previously, so it must have been a typo?
> 
> 
> > 4) iov_iter_fault_in_readable:
> > 	Function prerform check for signgle region, with out respect to
> > 	segment nature of iovec, For example writev no longer works :) :
> > 	writev(3, [{"\1", 1}, {"\2"..., 4096}], 2) = -1 EFAULT (Bad address)
> > 	this hidden bug, and it was invisiable because XXXX_fault_in_readable
> > 	return value was ignored before. Lets iov_iter_fault_in_readable
> > 	perform checks for all segments.
> > 	Fixed by :"iov_iter_fault_in_readable fix"
> 
> OK thanks. I would rather just change this to use the length of
> the first iovec always (prefaulting multiple iovecs would have to
> be benchmarked on real apps that make heavy use of writev, I
> suspect).
> 
>  
> > 5) ext3_write_end:
> > 	Before  write_begin/write_end patch set we have folowing locking
> > 	order:
> > 		stop_journal(handle);
> > 		unlock_page(page);
> > 	But now order is oposite:
> > 		unlock_page(page);
> > 		stop_journal(handle);
> > 	Can we got any race condition now? I'm not sure is it actual problem,
> > 	may be somebody cant describe this.
> 
> Can we just change it to the original order? That would seem to be
> safest unless one of the ext3 devs explicitly acks it.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
  2007-06-13 11:43   ` Nick Piggin
  2007-06-13 18:51     ` Dmitriy Monakhov
@ 2007-06-13 23:07     ` Badari Pulavarty
  2007-06-13 23:28       ` Nick Piggin
  2007-06-14  9:52       ` Jan Kara
  1 sibling, 2 replies; 20+ messages in thread
From: Badari Pulavarty @ 2007-06-13 23:07 UTC (permalink / raw)
  To: Nick Piggin; +Cc: lkml, mark.fasheh, ext4, Andrew Morton, cmm

On Wed, 2007-06-13 at 13:43 +0200, Nick Piggin wrote:
..
>  
> > 5) ext3_write_end:
> > 	Before  write_begin/write_end patch set we have folowing locking
> > 	order:
> > 		stop_journal(handle);
> > 		unlock_page(page);
> > 	But now order is oposite:
> > 		unlock_page(page);
> > 		stop_journal(handle);
> > 	Can we got any race condition now? I'm not sure is it actual problem,
> > 	may be somebody cant describe this.
> 
> Can we just change it to the original order? That would seem to be
> safest unless one of the ext3 devs explicitly acks it.

It would be nice to go back to original order, but its not that
simple with current structure of the code. With Nick's patches
unlock_page() happens in generic_write_end(). journal_stop() 
needs to happen after generic_write_end(). :(

Mingming, can you take a look at the current & proposed order ?
I ran into bunch of races when I tried to change the order for
->writepages() support earlier :(

Thanks,
Badari

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
  2007-06-13 23:07     ` Badari Pulavarty
@ 2007-06-13 23:28       ` Nick Piggin
  2007-06-14  9:52       ` Jan Kara
  1 sibling, 0 replies; 20+ messages in thread
From: Nick Piggin @ 2007-06-13 23:28 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: lkml, mark.fasheh, ext4, Andrew Morton, cmm

On Wed, Jun 13, 2007 at 04:07:01PM -0700, Badari Pulavarty wrote:
> On Wed, 2007-06-13 at 13:43 +0200, Nick Piggin wrote:
> ..
> >  
> > > 5) ext3_write_end:
> > > 	Before  write_begin/write_end patch set we have folowing locking
> > > 	order:
> > > 		stop_journal(handle);
> > > 		unlock_page(page);
> > > 	But now order is oposite:
> > > 		unlock_page(page);
> > > 		stop_journal(handle);
> > > 	Can we got any race condition now? I'm not sure is it actual problem,
> > > 	may be somebody cant describe this.
> > 
> > Can we just change it to the original order? That would seem to be
> > safest unless one of the ext3 devs explicitly acks it.
> 
> It would be nice to go back to original order, but its not that
> simple with current structure of the code. With Nick's patches
> unlock_page() happens in generic_write_end(). journal_stop() 
> needs to happen after generic_write_end(). :(

Well we could use block_write_end?

 
> Mingming, can you take a look at the current & proposed order ?
> I ran into bunch of races when I tried to change the order for
> ->writepages() support earlier :(

OK, it sounds like we probably want to revert to the original
order at least for this patchset. If the new order is proven
safe then that could be introduced later to simplify things...

Thanks,
Nick

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
  2007-06-13 23:07     ` Badari Pulavarty
  2007-06-13 23:28       ` Nick Piggin
@ 2007-06-14  9:52       ` Jan Kara
  2007-06-14 10:39         ` Nick Piggin
  1 sibling, 1 reply; 20+ messages in thread
From: Jan Kara @ 2007-06-14  9:52 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Nick Piggin, lkml, mark.fasheh, ext4, Andrew Morton, cmm

> On Wed, 2007-06-13 at 13:43 +0200, Nick Piggin wrote:
> ..
> >  
> > > 5) ext3_write_end:
> > > 	Before  write_begin/write_end patch set we have folowing locking
> > > 	order:
> > > 		stop_journal(handle);
> > > 		unlock_page(page);
> > > 	But now order is oposite:
> > > 		unlock_page(page);
> > > 		stop_journal(handle);
> > > 	Can we got any race condition now? I'm not sure is it actual problem,
> > > 	may be somebody cant describe this.
> > 
> > Can we just change it to the original order? That would seem to be
> > safest unless one of the ext3 devs explicitly acks it.
  Sorry, I've missed beginning of this thread. But what problems can
exactly cause this ordering change? ext3_journal_stop has no need to be
protected by the page lock - it can be even better that it's not
protected as it can trigger commit and all that would happen
unnecessarily under page lock...

								Honza

-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
  2007-06-14  9:52       ` Jan Kara
@ 2007-06-14 10:39         ` Nick Piggin
  0 siblings, 0 replies; 20+ messages in thread
From: Nick Piggin @ 2007-06-14 10:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: Badari Pulavarty, lkml, mark.fasheh, ext4, Andrew Morton, cmm

On Thu, Jun 14, 2007 at 11:52:49AM +0200, Jan Kara wrote:
> > On Wed, 2007-06-13 at 13:43 +0200, Nick Piggin wrote:
> > ..
> > >  
> > > > 5) ext3_write_end:
> > > > 	Before  write_begin/write_end patch set we have folowing locking
> > > > 	order:
> > > > 		stop_journal(handle);
> > > > 		unlock_page(page);
> > > > 	But now order is oposite:
> > > > 		unlock_page(page);
> > > > 		stop_journal(handle);
> > > > 	Can we got any race condition now? I'm not sure is it actual problem,
> > > > 	may be somebody cant describe this.
> > > 
> > > Can we just change it to the original order? That would seem to be
> > > safest unless one of the ext3 devs explicitly acks it.
>   Sorry, I've missed beginning of this thread. But what problems can
> exactly cause this ordering change? ext3_journal_stop has no need to be
> protected by the page lock - it can be even better that it's not
> protected as it can trigger commit and all that would happen
> unnecessarily under page lock...

Sure, if you think it is safe. I would rather it be done in a
different patch though.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iov_iter_fault_in_readable fix
  2007-06-13 13:57   ` iov_iter_fault_in_readable fix Dmitriy Monakhov
@ 2007-06-14 17:31     ` Christoph Hellwig
  2007-06-14 22:21       ` David Chinner
  2007-06-16 18:17       ` Dmitriy Monakhov
  0 siblings, 2 replies; 20+ messages in thread
From: Christoph Hellwig @ 2007-06-14 17:31 UTC (permalink / raw)
  To: linux-kernel, npiggin, mark.fasheh, linux-ext4, xfs

On Wed, Jun 13, 2007 at 05:57:59PM +0400, Dmitriy Monakhov wrote:
>  	Function prerform check for signgle region, with out respect to
>  	segment nature of iovec, For example writev no longer works :)

Btw, could someone please start to collect all sniplets like this in
a nice simple regression test suite?  If no one wants to start a new
one we should probably just put it into xfsqa (which should be useable
for other filesystems aswell despite the name)

> 
> 	/* TESTCASE BEGIN */
> 	#include <stdio.h>
> 	#include <sys/types.h>
> 	#include <sys/stat.h>
> 	#include <fcntl.h>
> 	#include <sys/uio.h>
> 	#include <sys/mman.h>
> 	#define SIZE  (4096 * 2)
> 	int main(int argc, char* argv[])
> 	{	
> 		char* ptr[4];
> 		struct iovec iov[2];
> 		int fd, ret;
> 		ptr[0] = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
> 			MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 		ptr[1] = mmap(NULL, SIZE, PROT_NONE,
> 			MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 		ptr[2] = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
> 			MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 		ptr[3] = mmap(NULL, SIZE, PROT_NONE, 
> 			MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 
> 		iov[0].iov_base = ptr[0] + (SIZE -1);
> 		iov[0].iov_len = 1;
> 		memset(ptr[0], 1, SIZE);
> 
> 		iov[1].iov_base = ptr[2];
> 		iov[1].iov_len = SIZE;
> 		memset(ptr[2], 2, SIZE);
> 
> 		fd = open(argv[1], O_CREAT|O_RDWR|O_TRUNC, 0666);
> 		ret = writev(fd, iov, sizeof(iov) / sizeof(struct iovec));
> 		return 0;
> 	}	
> 	/* TESTCASE END*/
> 	We will get folowing result:
> 		writev(3, [{"\1", 1}, {"\2"..., 8192}], 2) = -1 EFAULT (Bad address)
>  	
> 	this is hidden bug, and it was invisiable because XXXX_fault_in_readable
>  	return value was ignored before. Lets iov_iter_fault_in_readable
>  	perform checks for all segments.
> 
> Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index fef19fc..7e025ea 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -433,7 +433,7 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
>  size_t iov_iter_copy_from_user(struct page *page,
>  		struct iov_iter *i, unsigned long offset, size_t bytes);
>  void iov_iter_advance(struct iov_iter *i, size_t bytes);
> -int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
> +int iov_iter_fault_in_readable(struct iov_iter *i, size_t *bytes);
>  size_t iov_iter_single_seg_count(struct iov_iter *i);
>  
>  static inline void iov_iter_init(struct iov_iter *i,
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8d59ed9..8600c3e 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1817,10 +1817,32 @@ void iov_iter_advance(struct iov_iter *i, size_t bytes)
>  }
>  EXPORT_SYMBOL(iov_iter_advance);
>  
> -int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
> +int iov_iter_fault_in_readable(struct iov_iter *i, size_t* bytes)
>  {
> -	char __user *buf = i->iov->iov_base + i->iov_offset;
> -	return fault_in_pages_readable(buf, bytes);
> +	size_t len = *bytes;
> +	int ret;
> +	if (likely(i->nr_segs == 1)) {
> +		ret = fault_in_pages_readable(i->iov->iov_base, len);
> +		if (ret)
> +			*bytes = 0;
> +	} else {
> +		const struct iovec *iov = i->iov;
> +		size_t base = i->iov_offset;
> +		*bytes = 0;
> +		while (len) {
> +			int copy = min(len, iov->iov_len - base);
> +			if ((ret = fault_in_pages_readable(iov->iov_base + base, copy)))
> +				break;
> +			*bytes += copy;
> +			len -= copy;
> +			base += copy;
> +			if (iov->iov_len == base) {
> +				iov++;
> +				base = 0;
> +			}
> +		}
> +	}
> +	return ret;	
>  }
>  EXPORT_SYMBOL(iov_iter_fault_in_readable);
>  
> @@ -2110,7 +2132,7 @@ static ssize_t generic_perform_write_2copy(struct file *file,
>  		 * to check that the address is actually valid, when atomic
>  		 * usercopies are used, below.
>  		 */
> -		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
> +		if (unlikely(iov_iter_fault_in_readable(i, &bytes) && !bytes)) {
>  			status = -EFAULT;
>  			break;
>  		}
> @@ -2284,7 +2306,7 @@ again:
>  		 * to check that the address is actually valid, when atomic
>  		 * usercopies are used, below.
>  		 */
> -		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
> +		if (unlikely(iov_iter_fault_in_readable(i, &bytes)&& !bytes)) {
>  			status = -EFAULT;
>  			break;
>  		}
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
---end quoted text---

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iov_iter_fault_in_readable fix
  2007-06-14 17:31     ` Christoph Hellwig
@ 2007-06-14 22:21       ` David Chinner
  2007-06-14 22:34         ` Christoph Hellwig
  2007-06-16 18:17       ` Dmitriy Monakhov
  1 sibling, 1 reply; 20+ messages in thread
From: David Chinner @ 2007-06-14 22:21 UTC (permalink / raw)
  To: Christoph Hellwig, linux-kernel, npiggin, mark.fasheh, linux-ext4,
	xfs

On Thu, Jun 14, 2007 at 06:31:53PM +0100, Christoph Hellwig wrote:
> On Wed, Jun 13, 2007 at 05:57:59PM +0400, Dmitriy Monakhov wrote:
> >  	Function prerform check for signgle region, with out respect to
> >  	segment nature of iovec, For example writev no longer works :)
> 
> Btw, could someone please start to collect all sniplets like this in
> a nice simple regression test suite?  If no one wants to start a new
> one we should probably just put it into xfsqa (which should be useable
> for other filesystems aswell despite the name)

Yeah, it can run a subset of the tests on NFS and UDF filesystems as well and
there are some specific UDF-only tests in it too.  I think the NFS test group
is mostly generic tests that don't use or test specific XFS features.

I'd be happy to accumulate tests like these in xfsqa...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iov_iter_fault_in_readable fix
  2007-06-14 22:21       ` David Chinner
@ 2007-06-14 22:34         ` Christoph Hellwig
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2007-06-14 22:34 UTC (permalink / raw)
  To: David Chinner
  Cc: Christoph Hellwig, linux-kernel, npiggin, mark.fasheh, linux-ext4,
	xfs

On Fri, Jun 15, 2007 at 08:21:09AM +1000, David Chinner wrote:
> Yeah, it can run a subset of the tests on NFS and UDF filesystems as well and
> there are some specific UDF-only tests in it too.  I think the NFS test group
> is mostly generic tests that don't use or test specific XFS features.

Actually most testcases can run on any reasonable posixish filesystem, we
just need some glue to tell the testsuite it's actually okay.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iov_iter_fault_in_readable fix
  2007-06-14 17:31     ` Christoph Hellwig
  2007-06-14 22:21       ` David Chinner
@ 2007-06-16 18:17       ` Dmitriy Monakhov
  1 sibling, 0 replies; 20+ messages in thread
From: Dmitriy Monakhov @ 2007-06-16 18:17 UTC (permalink / raw)
  To: Christoph Hellwig, linux-kernel, npiggin, mark.fasheh, linux-ext4,
	xfs

On 18:31 Чтв 14 Июн     , Christoph Hellwig wrote:
> On Wed, Jun 13, 2007 at 05:57:59PM +0400, Dmitriy Monakhov wrote:
> >  	Function prerform check for signgle region, with out respect to
> >  	segment nature of iovec, For example writev no longer works :)
> 
> Btw, could someone please start to collect all sniplets like this in
> a nice simple regression test suite?  If no one wants to start a new
> one we should probably just put it into xfsqa (which should be useable
> for other filesystems aswell despite the name)
I've prepared testcase (testcases/kernel/syscalls/writev/writev06.c) 
and sent it to ltp mailing list.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2007-06-16 18:18 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-29 21:19 + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree akpm
2007-05-30  3:13 ` Nick Piggin
2007-05-30  3:24   ` Andrew Morton
2007-05-30 22:44     ` Mark Fasheh
2007-05-30 22:49       ` Andrew Morton
2007-05-30 10:39   ` Steven Whitehouse
2007-05-31  5:50     ` Nick Piggin
2007-06-13 13:40 ` Dmitriy Monakhov
2007-06-13 11:43   ` Nick Piggin
2007-06-13 18:51     ` Dmitriy Monakhov
2007-06-13 23:07     ` Badari Pulavarty
2007-06-13 23:28       ` Nick Piggin
2007-06-14  9:52       ` Jan Kara
2007-06-14 10:39         ` Nick Piggin
2007-06-13 13:50   ` [patch] new aop block_write_begin fix Dmitriy Monakhov
2007-06-13 13:57   ` iov_iter_fault_in_readable fix Dmitriy Monakhov
2007-06-14 17:31     ` Christoph Hellwig
2007-06-14 22:21       ` David Chinner
2007-06-14 22:34         ` Christoph Hellwig
2007-06-16 18:17       ` Dmitriy Monakhov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.