linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] ext4: ioc_move_ext fixes
@ 2012-09-24 12:23 Dmitry Monakhov
  2012-09-24 12:23 ` [PATCH 1/4] ext4: move_extent code cleanup Dmitry Monakhov
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Dmitry Monakhov @ 2012-09-24 12:23 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, a-fujita, Dmitry Monakhov

Currenty EXT4_IOC_MOVE_EXT is very dangerous call because anyone who
has write perm is able to call it and trigger kernel panic. Even more
all kernel since v2.6.30-6558-g748de67 are affected, so current
Debian and Redhat server solutions are dangerous.

Tests are available here:
http://www.spinics.net/lists/linux-fsdevel/msg58316.html
Or here:
https://github.com/dmonakhov/xfstests/tree/ce8e3adab629b2a9be8ba2e73db7dad49eb46614
288'th test

This patchset attempt to close all known bugs, so it survived after
all tests i know.

TOC:
ext4: move_extent code cleanup
ext4: online defrag is not supported for journaled files
ext4: basic bug shield for move_extent_per_page V2
ext4: reimplement uninit extent optimization for move_extent_per_page V2


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/4] ext4: move_extent code cleanup
  2012-09-24 12:23 [PATCH 0/4] ext4: ioc_move_ext fixes Dmitry Monakhov
@ 2012-09-24 12:23 ` Dmitry Monakhov
  2012-09-26 16:30   ` Theodore Ts'o
  2012-09-24 12:23 ` [PATCH 2/4] ext4: online defrag is not supported for journaled files Dmitry Monakhov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Dmitry Monakhov @ 2012-09-24 12:23 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, a-fujita, Dmitry Monakhov

- Remove usless checks, because it is too late to check that inode != NULL
  at the moment it was referenced several times.
- Double lock routines looks very ugly and locking ordering relays on
  order of i_ino, but other kernel code relay on order of pointers.
  Let's  make them simple and clean.
- check that inodes belongs to the same SB as soon as possible.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/move_extent.c |  131 +++++++++++--------------------------------------
 1 files changed, 29 insertions(+), 102 deletions(-)

diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index ae0357b..37c8497 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -141,55 +141,21 @@ mext_next_extent(struct inode *inode, struct ext4_ext_path *path,
 }
 
 /**
- * mext_check_null_inode - NULL check for two inodes
- *
- * If inode1 or inode2 is NULL, return -EIO. Otherwise, return 0.
- */
-static int
-mext_check_null_inode(struct inode *inode1, struct inode *inode2,
-		      const char *function, unsigned int line)
-{
-	int ret = 0;
-
-	if (inode1 == NULL) {
-		__ext4_error(inode2->i_sb, function, line,
-			"Both inodes should not be NULL: "
-			"inode1 NULL inode2 %lu", inode2->i_ino);
-		ret = -EIO;
-	} else if (inode2 == NULL) {
-		__ext4_error(inode1->i_sb, function, line,
-			"Both inodes should not be NULL: "
-			"inode1 %lu inode2 NULL", inode1->i_ino);
-		ret = -EIO;
-	}
-	return ret;
-}
-
-/**
  * double_down_write_data_sem - Acquire two inodes' write lock of i_data_sem
  *
- * @orig_inode:		original inode structure
- * @donor_inode:	donor inode structure
- * Acquire write lock of i_data_sem of the two inodes (orig and donor) by
- * i_ino order.
+ * Acquire write lock of i_data_sem of the two inodes
  */
 static void
-double_down_write_data_sem(struct inode *orig_inode, struct inode *donor_inode)
+double_down_write_data_sem(struct inode *first, struct inode *second)
 {
-	struct inode *first = orig_inode, *second = donor_inode;
+	if (first < second) {
+		down_write(&EXT4_I(first)->i_data_sem);
+		down_write_nested(&EXT4_I(second)->i_data_sem, SINGLE_DEPTH_NESTING);
+	} else {
+		down_write(&EXT4_I(second)->i_data_sem);
+		down_write_nested(&EXT4_I(first)->i_data_sem, SINGLE_DEPTH_NESTING);
 
-	/*
-	 * Use the inode number to provide the stable locking order instead
-	 * of its address, because the C language doesn't guarantee you can
-	 * compare pointers that don't come from the same array.
-	 */
-	if (donor_inode->i_ino < orig_inode->i_ino) {
-		first = donor_inode;
-		second = orig_inode;
 	}
-
-	down_write(&EXT4_I(first)->i_data_sem);
-	down_write_nested(&EXT4_I(second)->i_data_sem, SINGLE_DEPTH_NESTING);
 }
 
 /**
@@ -969,14 +935,6 @@ mext_check_arguments(struct inode *orig_inode,
 		return -EINVAL;
 	}
 
-	/* Files should be in the same ext4 FS */
-	if (orig_inode->i_sb != donor_inode->i_sb) {
-		ext4_debug("ext4 move extent: The argument files "
-			"should be in same FS [ino:orig %lu, donor %lu]\n",
-			orig_inode->i_ino, donor_inode->i_ino);
-		return -EINVAL;
-	}
-
 	/* Ext4 move extent supports only extent based file */
 	if (!(ext4_test_inode_flag(orig_inode, EXT4_INODE_EXTENTS))) {
 		ext4_debug("ext4 move extent: orig file is not extents "
@@ -1072,35 +1030,19 @@ mext_check_arguments(struct inode *orig_inode,
  * @inode1:	the inode structure
  * @inode2:	the inode structure
  *
- * Lock two inodes' i_mutex by i_ino order.
- * If inode1 or inode2 is NULL, return -EIO. Otherwise, return 0.
+ * Lock two inodes' i_mutex
  */
-static int
+static void
 mext_inode_double_lock(struct inode *inode1, struct inode *inode2)
 {
-	int ret = 0;
-
-	BUG_ON(inode1 == NULL && inode2 == NULL);
-
-	ret = mext_check_null_inode(inode1, inode2, __func__, __LINE__);
-	if (ret < 0)
-		goto out;
-
-	if (inode1 == inode2) {
-		mutex_lock(&inode1->i_mutex);
-		goto out;
-	}
-
-	if (inode1->i_ino < inode2->i_ino) {
+	BUG_ON(inode1 == inode2);
+	if (inode1 < inode2) {
 		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
 		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
 	} else {
 		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
 		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
 	}
-
-out:
-	return ret;
 }
 
 /**
@@ -1109,28 +1051,13 @@ out:
  * @inode1:     the inode that is released first
  * @inode2:     the inode that is released second
  *
- * If inode1 or inode2 is NULL, return -EIO. Otherwise, return 0.
  */
 
-static int
+static void
 mext_inode_double_unlock(struct inode *inode1, struct inode *inode2)
 {
-	int ret = 0;
-
-	BUG_ON(inode1 == NULL && inode2 == NULL);
-
-	ret = mext_check_null_inode(inode1, inode2, __func__, __LINE__);
-	if (ret < 0)
-		goto out;
-
-	if (inode1)
-		mutex_unlock(&inode1->i_mutex);
-
-	if (inode2 && inode2 != inode1)
-		mutex_unlock(&inode2->i_mutex);
-
-out:
-	return ret;
+	mutex_unlock(&inode1->i_mutex);
+	mutex_unlock(&inode2->i_mutex);
 }
 
 /**
@@ -1187,16 +1114,23 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp,
 	ext4_lblk_t block_end, seq_start, add_blocks, file_end, seq_blocks = 0;
 	ext4_lblk_t rest_blocks;
 	pgoff_t orig_page_offset = 0, seq_end_page;
-	int ret1, ret2, depth, last_extent = 0;
+	int ret1, depth, last_extent = 0;
 	int blocks_per_page = PAGE_CACHE_SIZE >> orig_inode->i_blkbits;
 	int data_offset_in_page;
 	int block_len_in_page;
 	int uninit;
 
-	/* orig and donor should be different file */
-	if (orig_inode->i_ino == donor_inode->i_ino) {
+	if (orig_inode->i_sb != donor_inode->i_sb) {
+		ext4_debug("ext4 move extent: The argument files "
+			"should be in same FS [ino:orig %lu, donor %lu]\n",
+			orig_inode->i_ino, donor_inode->i_ino);
+		return -EINVAL;
+	}
+
+	/* orig and donor should be different inodes */
+	if (orig_inode == donor_inode) {
 		ext4_debug("ext4 move extent: The argument files should not "
-			"be same file [ino:orig %lu, donor %lu]\n",
+			"be same inode [ino:orig %lu, donor %lu]\n",
 			orig_inode->i_ino, donor_inode->i_ino);
 		return -EINVAL;
 	}
@@ -1210,9 +1144,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp,
 	}
 
 	/* Protect orig and donor inodes against a truncate */
-	ret1 = mext_inode_double_lock(orig_inode, donor_inode);
-	if (ret1 < 0)
-		return ret1;
+	mext_inode_double_lock(orig_inode, donor_inode);
 
 	/* Wait for all existing dio workers */
 	ext4_inode_block_unlocked_dio(orig_inode);
@@ -1420,12 +1352,7 @@ out:
 	double_up_write_data_sem(orig_inode, donor_inode);
 	ext4_inode_block_unlocked_dio(orig_inode);
 	ext4_inode_block_unlocked_dio(donor_inode);
-	ret2 = mext_inode_double_unlock(orig_inode, donor_inode);
+	mext_inode_double_unlock(orig_inode, donor_inode);
 
-	if (ret1)
-		return ret1;
-	else if (ret2)
-		return ret2;

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/4] ext4: online defrag is not supported for journaled files
  2012-09-24 12:23 [PATCH 0/4] ext4: ioc_move_ext fixes Dmitry Monakhov
  2012-09-24 12:23 ` [PATCH 1/4] ext4: move_extent code cleanup Dmitry Monakhov
@ 2012-09-24 12:23 ` Dmitry Monakhov
  2012-09-26 16:30   ` Theodore Ts'o
  2012-09-24 12:23 ` [PATCH 3/4] ext4: basic bug shield for move_extent_per_page Dmitry Monakhov
  2012-09-24 12:23 ` [PATCH 4/4] ext4: reimplement uninit extent optimization for move_extent_per_page V2 Dmitry Monakhov
  3 siblings, 1 reply; 9+ messages in thread
From: Dmitry Monakhov @ 2012-09-24 12:23 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, a-fujita, Dmitry Monakhov

Proper block swap for inodes with full journaling enabled is
truly non obvious task. In order to be on a safe side let's
explicitly disable it for now.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/move_extent.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 37c8497..0aa2ad7 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -1142,7 +1142,12 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp,
 			orig_inode->i_ino, donor_inode->i_ino);
 		return -EINVAL;
 	}

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/4] ext4: basic bug shield for move_extent_per_page
  2012-09-24 12:23 [PATCH 0/4] ext4: ioc_move_ext fixes Dmitry Monakhov
  2012-09-24 12:23 ` [PATCH 1/4] ext4: move_extent code cleanup Dmitry Monakhov
  2012-09-24 12:23 ` [PATCH 2/4] ext4: online defrag is not supported for journaled files Dmitry Monakhov
@ 2012-09-24 12:23 ` Dmitry Monakhov
  2012-09-26 16:50   ` Theodore Ts'o
  2012-09-24 12:23 ` [PATCH 4/4] ext4: reimplement uninit extent optimization for move_extent_per_page V2 Dmitry Monakhov
  3 siblings, 1 reply; 9+ messages in thread
From: Dmitry Monakhov @ 2012-09-24 12:23 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, a-fujita, Dmitry Monakhov

The move_extent_per_page function is just one big peace of crap.
this is the most buggy peace of code i've ever seen

Non-full list of bugs:
1) uninitialized extent optimization does not hold page's lock,
   and simply replace brunches after that writeback code goes
   crazy because block mapping changed under it's feets
   kernel BUG at fs/ext4/inode.c:1434!  ( 288'th xfstress)

2) uninitialized extent may became initialized right after we
   drop i_data_sem, so extent state must be rechecked

3) Locked pages goes uptodate via following sequence:
   ->readpage(page); lock_page(page); use_that_page(page)
   But after readpage() one may invalidate it because it is
   uptodate and unlocked (reclaimer does that)
   As result kernel bug at include/linux/buffer_head.c:133!

4) We call write_begin() with already opened stansaction which
   result in following deadlock:
->move_extent_per_page()
  ->ext4_journal_start()-> hold journal transaction
  ->write_begin()
    ->ext4_da_write_begin()
      ->ext4_nonda_switch()
        ->writeback_inodes_sb_if_idle()  --> will wait for journal_stop()

5) try_to_release_page() may fail and it does fail if one of page's bh was
   pinned by journal

6) If we about to change page's mapping we MUST hold it's lock during entire
   remapping procedure, this is true for both pages(original and donor one)

Fixes:

- Avoid (1) and (2) simply by temproraly drop uninitialized extent handling
  optimization, this will be reimplemented later.

- Fix (3) by manually forcing page to uptodate state w/o dropping it's lock

- Fix (4) by rearranging existing locking:
  from: journal_start(); ->write_begin
  to: write_begin(); journal_extend()
- Fix (5) simply by checking retvalue
- Fix (6) by locking both (original and donor one) pages during extent swap
  with help of mext_page_double_lock()

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/move_extent.c |  251 ++++++++++++++++++++++++++++++++++---------------
 1 files changed, 176 insertions(+), 75 deletions(-)

diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 0aa2ad7..f416f83 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -736,6 +736,117 @@ out:
 }
 
 /**
+ * mext_page_double_lock - Grab and lock pages on both @inode1 and @inode2
+ *
+ * @inode1:	the inode structure
+ * @inode2:	the inode structure
+ * @index:	page index
+ * @page:	result page vector
+ *
+ * Grab two locked pages for inode's by inode order
+ */
+static int
+mext_page_double_lock(struct inode *inode1, struct inode *inode2,
+		      pgoff_t index, struct page *page[2])
+{
+	struct address_space *mapping[2];
+	unsigned fl = AOP_FLAG_NOFS;
+
+	BUG_ON(!inode1 || !inode2);
+	if (inode1 < inode2) {
+		mapping[0] = inode1->i_mapping;
+		mapping[1] = inode2->i_mapping;
+	} else {
+		mapping[0] = inode2->i_mapping;
+		mapping[1] = inode1->i_mapping;
+	}
+
+	page[0] = grab_cache_page_write_begin(mapping[0], index, fl);
+	if (!page[0])
+		return -ENOMEM;
+
+	page[1] = grab_cache_page_write_begin(mapping[1], index, fl);
+	if (!page[1]) {
+		unlock_page(page[0]);
+		page_cache_release(page[0]);
+		return -ENOMEM;
+	}
+
+	if (inode1 > inode2) {
+		struct page *tmp;
+		tmp = page[0];
+		page[0] = page[1];
+		page[1] = tmp;
+	}
+	return 0;
+}
+
+/* Force page buffers uptodate w/o dropping page's lock */
+static int
+mext_page_mkuptodate(struct page *page, unsigned from, unsigned to)
+{
+	struct inode *inode = page->mapping->host;
+	sector_t block;
+	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+	unsigned int blocksize, block_start, block_end;
+	int i, err,  nr = 0, partial = 0;
+	BUG_ON(!PageLocked(page));
+	BUG_ON(PageWriteback(page));
+
+	if (PageUptodate(page))
+		return 0;
+
+	blocksize = 1 << inode->i_blkbits;
+	if (!page_has_buffers(page))
+		create_empty_buffers(page, blocksize, 0);
+
+	head = page_buffers(page);
+	block = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+	for(bh = head, block_start = 0; bh != head || !block_start;
+	    block++, block_start = block_end, bh = bh->b_this_page) {
+		block_end = block_start + blocksize;
+		if (block_end <= from || block_start >= to) {
+			if (!buffer_uptodate(bh))
+				partial = 1;
+			continue;
+		}
+		if (buffer_uptodate(bh))
+			continue;
+		if (!buffer_mapped(bh)) {
+			int err = 0;
+			err = ext4_get_block(inode, block, bh, 0);
+			if (err) {
+				SetPageError(page);
+				return err;
+			}
+			if (!buffer_mapped(bh)) {
+				zero_user(page, block_start, blocksize);
+				if (!err)
+					set_buffer_uptodate(bh);
+				continue;
+			}
+		}
+		arr[nr++] = bh;
+	}
+	/* No io required */
+	if (!nr)
+		goto out;
+
+	for (i = 0; i < nr; i++) {
+		bh = arr[i];
+		if (!bh_uptodate_or_lock(bh)) {
+			err = bh_submit_read(bh);
+			if (err)
+				return err;
+		}
+	}
+out:
+	if (!partial)
+		SetPageUptodate(page);
+	return 0;
+}
+
+/**
  * move_extent_per_page - Move extent data per page
  *
  * @o_filp:			file structure of original file
@@ -757,26 +868,24 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
 		  int block_len_in_page, int uninit, int *err)
 {
 	struct inode *orig_inode = o_filp->f_dentry->d_inode;
-	struct address_space *mapping = orig_inode->i_mapping;
-	struct buffer_head *bh;
-	struct page *page = NULL;
-	const struct address_space_operations *a_ops = mapping->a_ops;
+	struct page *pagep[2] = {NULL, NULL};
 	handle_t *handle;
 	ext4_lblk_t orig_blk_offset;
 	long long offs = orig_page_offset << PAGE_CACHE_SHIFT;
 	unsigned long blocksize = orig_inode->i_sb->s_blocksize;
 	unsigned int w_flags = 0;
 	unsigned int tmp_data_size, data_size, replaced_size;
-	void *fsdata;
-	int i, jblocks;
-	int err2 = 0;
+	int err2, jblocks, retries = 0;
 	int replaced_count = 0;
+	int from = data_offset_in_page << orig_inode->i_blkbits;
 	int blocks_per_page = PAGE_CACHE_SIZE >> orig_inode->i_blkbits;
 
 	/*
 	 * It needs twice the amount of ordinary journal buffers because
 	 * inode and donor_inode may change each different metadata blocks.
 	 */
+again:
+	*err = 0;
 	jblocks = ext4_writepage_trans_blocks(orig_inode) * 2;
 	handle = ext4_journal_start(orig_inode, jblocks);
 	if (IS_ERR(handle)) {
@@ -790,19 +899,6 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
 	orig_blk_offset = orig_page_offset * blocks_per_page +
 		data_offset_in_page;
 
-	/*
-	 * If orig extent is uninitialized one,
-	 * it's not necessary force the page into memory
-	 * and then force it to be written out again.
-	 * Just swap data blocks between orig and donor.
-	 */
-	if (uninit) {
-		replaced_count = mext_replace_branches(handle, orig_inode,
-						donor_inode, orig_blk_offset,
-						block_len_in_page, err);
-		goto out2;
-	}
-
 	offs = (long long)orig_blk_offset << orig_inode->i_blkbits;
 
 	/* Calculate data_size */
@@ -824,75 +920,80 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
 
 	replaced_size = data_size;
 
-	*err = a_ops->write_begin(o_filp, mapping, offs, data_size, w_flags,
-				 &page, &fsdata);
+	*err = mext_page_double_lock(orig_inode, donor_inode, orig_page_offset,
+				     pagep);
 	if (unlikely(*err < 0))
-		goto out;
+		goto stop_journal;
 
-	if (!PageUptodate(page)) {
-		mapping->a_ops->readpage(o_filp, page);
-		lock_page(page);
+	*err = mext_page_mkuptodate(pagep[0], from, from + replaced_size);
+	if (*err)
+		goto unlock_pages;
+
+	/* At this point all buffers in range are uptodate, old mapping layout
+	 * is no longer required, try to drop it now. */
+	if ((page_has_private(pagep[0]) && !try_to_release_page(pagep[0], 0)) ||
+	    (page_has_private(pagep[1]) && !try_to_release_page(pagep[1], 0))) {
+		*err = -EBUSY;
+		goto unlock_pages;
 	}
 
-	/*
-	 * try_to_release_page() doesn't call releasepage in writeback mode.
-	 * We should care about the order of writing to the same file
-	 * by multiple move extent processes.
-	 * It needs to call wait_on_page_writeback() to wait for the
-	 * writeback of the page.
-	 */
-	wait_on_page_writeback(page);
-
-	/* Release old bh and drop refs */
-	try_to_release_page(page, 0);
-
 	replaced_count = mext_replace_branches(handle, orig_inode, donor_inode,
-					orig_blk_offset, block_len_in_page,
-					&err2);
-	if (err2) {
+					orig_blk_offset, block_len_in_page, err);
+	if (*err) {
 		if (replaced_count) {
 			block_len_in_page = replaced_count;
 			replaced_size =
 				block_len_in_page << orig_inode->i_blkbits;
 		} else
-			goto out;
+			goto unlock_pages;
 	}
+	/* Perform all necessary steps similar write_begin()/write_end()
+	 * but keeping in mind that i_size will not change */
+	*err = __block_write_begin(pagep[0], from, from + replaced_size,
+				   ext4_get_block);
+	if (!*err)
+		*err = block_commit_write(pagep[0], from, from + replaced_size);
 
-	if (!page_has_buffers(page))
-		create_empty_buffers(page, 1 << orig_inode->i_blkbits, 0);
-
-	bh = page_buffers(page);
-	for (i = 0; i < data_offset_in_page; i++)
-		bh = bh->b_this_page;
-
-	for (i = 0; i < block_len_in_page; i++) {
-		*err = ext4_get_block(orig_inode,
-				(sector_t)(orig_blk_offset + i), bh, 0);
-		if (*err < 0)
-			goto out;
-
-		if (bh->b_this_page != NULL)
-			bh = bh->b_this_page;
-	}
-
-	*err = a_ops->write_end(o_filp, mapping, offs, data_size, replaced_size,
-			       page, fsdata);
-	page = NULL;
-
-out:
-	if (unlikely(page)) {
-		if (PageLocked(page))
-			unlock_page(page);
-		page_cache_release(page);
-		ext4_journal_stop(handle);
-	}
-out2:
+	if (unlikely(*err < 0))
+		goto repair_branches;
+
+	/* Even in case of data=writeback it is reasonable to pin
+	 * inode to transaction, to prevent unexpected data loss */
+	*err = ext4_jbd2_file_inode(handle, orig_inode);
+
+unlock_pages:
+	unlock_page(pagep[0]);
+	page_cache_release(pagep[0]);
+	unlock_page(pagep[1]);
+	page_cache_release(pagep[1]);
+stop_journal:
 	ext4_journal_stop(handle);
-
-	if (err2)
-		*err = err2;
-
+	/* Buffer was busy because probably is pinned to journal transaction,
+	 * force transaction commit may help to free it. */
+	if (*err == -EBUSY && ext4_should_retry_alloc(orig_inode->i_sb,
+						      &retries))
+		goto again;
 	return replaced_count;
+
+repair_branches:
+	/*
+	 * This should never ever happen!
+	 * Extents are swapped already, but we are not able to copy data.
+	 * Try to swap extents to it's original places
+	 */
+	double_down_write_data_sem(orig_inode, donor_inode);
+	replaced_count = mext_replace_branches(handle, donor_inode, orig_inode,
+					       orig_blk_offset,
+					       block_len_in_page, &err2);
+	double_up_write_data_sem(orig_inode, donor_inode);
+	if (replaced_count != block_len_in_page) {
+		EXT4_ERROR_INODE_BLOCK(orig_inode,(sector_t)(orig_blk_offset),
+				       "Unable to copy data block,"
+				       " data will be lost.");
+		*err = -EIO;
+	}
+	replaced_count = 0;
+	goto unlock_pages;
 }
 
 /**
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/4] ext4: reimplement uninit extent optimization for move_extent_per_page V2
  2012-09-24 12:23 [PATCH 0/4] ext4: ioc_move_ext fixes Dmitry Monakhov
                   ` (2 preceding siblings ...)
  2012-09-24 12:23 ` [PATCH 3/4] ext4: basic bug shield for move_extent_per_page Dmitry Monakhov
@ 2012-09-24 12:23 ` Dmitry Monakhov
  2012-09-26 16:56   ` Theodore Ts'o
  3 siblings, 1 reply; 9+ messages in thread
From: Dmitry Monakhov @ 2012-09-24 12:23 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, a-fujita, Dmitry Monakhov

Uninitialized extent may became initialized(parallel writeback task)
at any moment after we drop i_data_sem, so we have to recheck extent's
state after we hold page's lock and i_data_sem.

If we about to change page's mapping we must hold page's lock in order to
serialize other users.
Changes from V1 in response to Akira's comments:
- fix last block mistype
- Rerutn inital error code

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/move_extent.c |   81 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index f416f83..cbcca98 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -595,6 +595,43 @@ mext_calc_swap_extents(struct ext4_extent *tmp_dext,
 }
 
 /**
+ * mext_check_coverage - Check that all extents in range has the same type
+ *
+ * @inode:		inode in question
+ * @from:		block offset of inode
+ * @count:		block count to be checked
+ * @uninit:		extents expected to be uninitialized
+ * @err:		pointer to save error value
+ *
+ * Return 1 if all extents in range has expected type, and zero otherwise.
+ */
+static int
+mext_check_coverage(struct inode *inode, ext4_lblk_t from, ext4_lblk_t count,
+			  int uninit, int *err)
+{
+	struct ext4_ext_path *path = NULL;
+	struct ext4_extent *ext;
+	ext4_lblk_t last = from + count;
+	while (from < last) {
+		*err = get_ext_path(inode, from, &path);
+		if (*err)
+			return 0;
+		ext = path[ext_depth(inode)].p_ext;
+		if (!ext) {
+			ext4_ext_drop_refs(path);
+			return 0;
+		}
+		if (uninit != ext4_ext_is_uninitialized(ext)) {
+			ext4_ext_drop_refs(path);
+			return 0;
+		}
+		from += ext4_ext_get_actual_len(ext);
+		ext4_ext_drop_refs(path);
+	}
+	return 1;
+}
+
+/**
  * mext_replace_branches - Replace original extents with new extents
  *
  * @handle:		journal handle
@@ -629,9 +666,6 @@ mext_replace_branches(handle_t *handle, struct inode *orig_inode,
 	int replaced_count = 0;
 	int dext_alen;
 
-	/* Protect extent trees against block allocations via delalloc */
-	double_down_write_data_sem(orig_inode, donor_inode);
-
 	/* Get the original extent for the block "orig_off" */
 	*err = get_ext_path(orig_inode, orig_off, &orig_path);
 	if (*err)
@@ -730,8 +764,6 @@ out:
 	ext4_ext_invalidate_cache(orig_inode);
 	ext4_ext_invalidate_cache(donor_inode);
 
-	double_up_write_data_sem(orig_inode, donor_inode);
-
 	return replaced_count;
 }
 
@@ -924,7 +956,46 @@ again:
 				     pagep);
 	if (unlikely(*err < 0))
 		goto stop_journal;
+	/*
+	 * If orig extent was uninitialized it can become initialized at any time
+	 * after i_data_sem was dropped, in order to serialize with delalloc
+	 * we have recheck extent while we hold page's lock, if it is still the
+	 * case data copy is not necessary, just swap data blocks between orig
+	 * and donor.
+	 */
+	if (uninit) {
+		double_down_write_data_sem(orig_inode, donor_inode);
+		/* If any of extents in range became initialized we have to
+		 * fallback to data copying */
+		uninit = mext_check_coverage(orig_inode, orig_blk_offset,
+					     block_len_in_page, 1, err);
+		if (*err)
+			goto drop_data_sem;
 
+		uninit &= mext_check_coverage(donor_inode, orig_blk_offset,
+					      block_len_in_page, 1, err);
+		if (*err)
+			goto drop_data_sem;
+
+		if (!uninit) {
+			double_up_write_data_sem(orig_inode, donor_inode);
+			goto data_copy;
+		}
+		if ((page_has_private(pagep[0]) &&
+		     !try_to_release_page(pagep[0], 0)) ||
+		    (page_has_private(pagep[1]) &&
+		     !try_to_release_page(pagep[1], 0))) {
+			*err = -EBUSY;
+			goto drop_data_sem;
+		}
+		replaced_count = mext_replace_branches(handle, orig_inode,
+						donor_inode, orig_blk_offset,
+						block_len_in_page, err);
+	drop_data_sem:
+		double_up_write_data_sem(orig_inode, donor_inode);
+		goto unlock_pages;
+	}
+data_copy:
 	*err = mext_page_mkuptodate(pagep[0], from, from + replaced_size);
 	if (*err)
 		goto unlock_pages;
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/4] ext4: move_extent code cleanup
  2012-09-24 12:23 ` [PATCH 1/4] ext4: move_extent code cleanup Dmitry Monakhov
@ 2012-09-26 16:30   ` Theodore Ts'o
  0 siblings, 0 replies; 9+ messages in thread
From: Theodore Ts'o @ 2012-09-26 16:30 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, a-fujita

On Mon, Sep 24, 2012 at 04:23:51PM +0400, Dmitry Monakhov wrote:
> - Remove usless checks, because it is too late to check that inode != NULL
>   at the moment it was referenced several times.
> - Double lock routines looks very ugly and locking ordering relays on
>   order of i_ino, but other kernel code relay on order of pointers.
>   Let's  make them simple and clean.
> - check that inodes belongs to the same SB as soon as possible.
> 
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>

Applied, thanks!!

						- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/4] ext4: online defrag is not supported for journaled files
  2012-09-24 12:23 ` [PATCH 2/4] ext4: online defrag is not supported for journaled files Dmitry Monakhov
@ 2012-09-26 16:30   ` Theodore Ts'o
  0 siblings, 0 replies; 9+ messages in thread
From: Theodore Ts'o @ 2012-09-26 16:30 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, a-fujita

On Mon, Sep 24, 2012 at 04:23:52PM +0400, Dmitry Monakhov wrote:
> Proper block swap for inodes with full journaling enabled is
> truly non obvious task. In order to be on a safe side let's
> explicitly disable it for now.
> 
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>

Applied, thanks!!

					- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/4] ext4: basic bug shield for move_extent_per_page
  2012-09-24 12:23 ` [PATCH 3/4] ext4: basic bug shield for move_extent_per_page Dmitry Monakhov
@ 2012-09-26 16:50   ` Theodore Ts'o
  0 siblings, 0 replies; 9+ messages in thread
From: Theodore Ts'o @ 2012-09-26 16:50 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, a-fujita

Applied, thanks.

I added one sanity check:

> +	for(bh = head, block_start = 0; bh != head || !block_start;
> +	    block++, block_start = block_end, bh = bh->b_this_page) {
  	    ...
		BUG_ON(nr >= MAX_BUF_PER_PAGE);
> +		arr[nr++] = bh;
> +	}

...just to make sure a corrupted link list on b_this_page can't cause
a on-stack array overrun.

						- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 4/4] ext4: reimplement uninit extent optimization for move_extent_per_page V2
  2012-09-24 12:23 ` [PATCH 4/4] ext4: reimplement uninit extent optimization for move_extent_per_page V2 Dmitry Monakhov
@ 2012-09-26 16:56   ` Theodore Ts'o
  0 siblings, 0 replies; 9+ messages in thread
From: Theodore Ts'o @ 2012-09-26 16:56 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, a-fujita

On Mon, Sep 24, 2012 at 04:23:54PM +0400, Dmitry Monakhov wrote:
> Uninitialized extent may became initialized(parallel writeback task)
> at any moment after we drop i_data_sem, so we have to recheck extent's
> state after we hold page's lock and i_data_sem.
> 
> If we about to change page's mapping we must hold page's lock in order to
> serialize other users.
> Changes from V1 in response to Akira's comments:
> - fix last block mistype
> - Rerutn inital error code
> 
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>

Thanks, applied.

					- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-09-26 16:56 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-24 12:23 [PATCH 0/4] ext4: ioc_move_ext fixes Dmitry Monakhov
2012-09-24 12:23 ` [PATCH 1/4] ext4: move_extent code cleanup Dmitry Monakhov
2012-09-26 16:30   ` Theodore Ts'o
2012-09-24 12:23 ` [PATCH 2/4] ext4: online defrag is not supported for journaled files Dmitry Monakhov
2012-09-26 16:30   ` Theodore Ts'o
2012-09-24 12:23 ` [PATCH 3/4] ext4: basic bug shield for move_extent_per_page Dmitry Monakhov
2012-09-26 16:50   ` Theodore Ts'o
2012-09-24 12:23 ` [PATCH 4/4] ext4: reimplement uninit extent optimization for move_extent_per_page V2 Dmitry Monakhov
2012-09-26 16:56   ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).