Re: [PATCH 2/3] ext4: basic bug shield for move_extent_per_page

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dmitry Monakhov <dmonakhov@openvz.org>
To: linux-ext4@vger.kernel.org
Cc: tytso@mit.edu, a-fujita@rs.jp.nec.com
Subject: Re: [PATCH 2/3] ext4: basic bug shield for move_extent_per_page
Date: Tue, 28 Aug 2012 20:29:52 +0400	[thread overview]
Message-ID: <878vczdlof.fsf@openvz.org> (raw)
In-Reply-To: <1346170903-7563-2-git-send-email-dmonakhov@openvz.org>

[-- Attachment #1: Type: text/plain, Size: 164 bytes --]

On Tue, 28 Aug 2012 20:21:42 +0400, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
In order to test MOVE_EXT ioctl code i use run e4defrag and fsstress
in parallel:

[-- Attachment #2: 285 --]
[-- Type: text/plain, Size: 2522 bytes --]

#! /bin/bash
# FSQA Test No. 270
#
# Online defrag stress test for ext4
#
#-----------------------------------------------------------------------
# Copyright (c) 2006 Silicon Graphics, Inc.  All Rights Reserved.
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write the Free Software Foundation,
# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
#
#-----------------------------------------------------------------------
#
# creator
owner=dmonakhov@openvz.org

seq=`basename $0`
echo "QA output created by $seq"

here=`pwd`
tmp=/tmp/$$
status=1	# failure is the default!
trap "rm -f $tmp.*; exit \$status" 0 1 2 3 15

# get standard environment, filters and checks
. ./common.rc
. ./common.filter
. ./common.quota

# Disable all sync operations to get higher load
FSSTRESS_AVOID="$FSSTRESS_AVOID -ffsync=0 -fsync=0 -ffdatasync=0"
_workout()
{
	echo ""
	echo "Run fsstress"
	echo ""
	num_iterations=10
	delay=2
	out=$SCRATCH_MNT/fsstress.$$
	# Restrict number of inodes in order to increase  
	setquota -u $qa_user 1000000 1000000 100 100 $SCRATCH_MNT
	args="-p4 -n999999999 -f setattr=0 $FSSTRESS_AVOID -d $out"
	echo "fsstress $args" >> $here/$seq.full
	(su $qa_user -c "$FSSTRESS_PROG $args"  &) > /dev/null 2>&1
	pid=$!
	echo "Run online defrag in parallel"
	for ((i=0; i < num_iterations; i++))
	do
		e4defrag -v $SCRATCH_MNT >> $here/$seq.full 2>&1
		sleep $delay
	done
	killall fsstress
	wait $pid
}

# real QA test starts here
_supported_fs generic
_supported_os Linux
_require_quota
_require_user
_need_to_be_root
_require_scratch

_scratch_mkfs_sized $((512 * 1024 * 1024)) >> $seq.full 2>&1
_scratch_mount "-o usrquota,grpquota"
chmod 777 $SCRATCH_MNT
quotacheck -u -g $SCRATCH_MNT 2>/dev/null
quotaon -u -g $SCRATCH_MNT 2>/dev/null

if ! _workout; then
	_scratch_unmount 2>/dev/null
	exit
fi

if ! _check_quota_usage; then
	_scratch_unmount 2>/dev/null
	status=1
	exit
fi

echo Comparing filesystem consistency
if ! _scratch_unmount; then
	echo "failed to umount"
	status=1
	exit
fi
_check_scratch_fs
status=$?
exit

[-- Attachment #3: Type: text/plain, Size: 6221 bytes --]


> The move_extent_per_page function is just one big peace of crap.
> 
> Non-full list of bugs:
> 1) uninitialized extent optimization does not hold page's lock,
>    and simply replace brunches, so writeback code goes
>    crazy because block mapping changed under it's feets
>    kernel BUG at fs/ext4/inode.c:1434!
> 
> 2) uninitialized extent may became initialized right after we
>    drop i_data_sem, so extent state must be rechecked
> 
> 3) Locked pages goes up to date via following sequence:
>    ->readpage(page); lock_page(page); use_that_page(page)
>    But after readpage() one may invalidate it because it is
>    uptodate and unlocked (reclaimer does that)
>    As result kernel bug at include/linux/buffer_head.c:133!
> 4) We call write_begin() with already opened stansaction which
>    result in following deadlock:
> ->move_extent_per_page()
>   ->ext4_journal_start()-> hold journal transaction
>   ->write_begin()
>     ->ext4_da_write_begin()
>       ->ext4_nonda_switch()
>         ->writeback_inodes_sb_if_idle()  --> will wait for journal_stop()
> 
> 5) try_to_release_page() may fail and it does fail if one of page's bh was
>    pinned by journal
> 
> 6) If we about to change page's mapping we MUST hold it's lock during entire
>    remapping procedure, this is true for both pages(original and donor one)
> 
> Fixes:
> 
> - Avoid (1) and (2) simply by temproraly drop uninitialized extent handling
>   optimization, this will be reimplemented later.
> 
> - Fix (3) by manually forcing page to uptodate state w/o dropping it's lock
> 
> - Fix (4) by rearranging existing locking:
>   from: journal_start(); ->write_begin
>   to: write_begin(); journal_extend()
> - Fix (5) simply by checking retvalue
> - (6) requires full function's code rearrangement, will be done later.
> ---
>  fs/ext4/move_extent.c |   74 ++++++++++++++++++++++++-------------------------
>  1 files changed, 36 insertions(+), 38 deletions(-)
> 
> diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
> index c5826c6..c5ca317 100644
> --- a/fs/ext4/move_extent.c
> +++ b/fs/ext4/move_extent.c
> @@ -812,11 +812,6 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
>  	 * inode and donor_inode may change each different metadata blocks.
>  	 */
>  	jblocks = ext4_writepage_trans_blocks(orig_inode) * 2;
> -	handle = ext4_journal_start(orig_inode, jblocks);
> -	if (IS_ERR(handle)) {
> -		*err = PTR_ERR(handle);
> -		return 0;
> -	}
>  
>  	if (segment_eq(get_fs(), KERNEL_DS))
>  		w_flags |= AOP_FLAG_UNINTERRUPTIBLE;
> @@ -824,19 +819,6 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
>  	orig_blk_offset = orig_page_offset * blocks_per_page +
>  		data_offset_in_page;
>  
> -	/*
> -	 * If orig extent is uninitialized one,
> -	 * it's not necessary force the page into memory
> -	 * and then force it to be written out again.
> -	 * Just swap data blocks between orig and donor.
> -	 */
> -	if (uninit) {
> -		replaced_count = mext_replace_branches(handle, orig_inode,
> -						donor_inode, orig_blk_offset,
> -						block_len_in_page, err);
> -		goto out2;
> -	}
> -
>  	offs = (long long)orig_blk_offset << orig_inode->i_blkbits;
>  
>  	/* Calculate data_size */
> @@ -862,12 +844,28 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
>  				 &page, &fsdata);
>  	if (unlikely(*err < 0))
>  		goto out;
> +	handle = journal_current_handle();
>  
> +	if (!page_has_buffers(page))
> +		create_empty_buffers(page, 1 << orig_inode->i_blkbits, 0);
> +
> +	/* Force page uptodate similar block_write_commit */
> +	page_zero_new_buffers(page, 0, PAGE_SIZE);
>  	if (!PageUptodate(page)) {
> -		mapping->a_ops->readpage(o_filp, page);
> -		lock_page(page);
> +		struct buffer_head *head, *bh;
> +		bh = head =  page_buffers(page);
> +		do {
> +			if (!bh_uptodate_or_lock(bh)) {
> +				if (bh_submit_read(bh) < 0) {
> +					put_bh(bh);
> +					err2  = -EIO;
> +					goto drop_page;
> +				}
> +			}
> +			bh = bh->b_this_page;
> +		} while (bh != head);
>  	}
> -
> +	SetPageUptodate(page);
>  	/*
>  	 * try_to_release_page() doesn't call releasepage in writeback mode.
>  	 * We should care about the order of writing to the same file
> @@ -876,9 +874,15 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
>  	 * writeback of the page.
>  	 */
>  	wait_on_page_writeback(page);
> -
> -	/* Release old bh and drop refs */
> -	try_to_release_page(page, 0);
> +	/* Finally page is fully uptodate and locked, it is time to drop
> +	 * old mapping info, function may fail other task hold reference
> +	 * on page's bh */
> +	if (!try_to_release_page(page, 0)) {
> +		replaced_size = 0;
> +		goto write_end;
> +	}
> +	if (ext4_journal_extend(handle, jblocks) < 0)
> +		goto write_end;
>  
>  	replaced_count = mext_replace_branches(handle, orig_inode, donor_inode,
>  					orig_blk_offset, block_len_in_page,
> @@ -889,7 +893,7 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
>  			replaced_size =
>  				block_len_in_page << orig_inode->i_blkbits;
>  		} else
> -			goto out;
> +			goto drop_page;
>  	}
>  
>  	if (!page_has_buffers(page))
> @@ -903,30 +907,24 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
>  		*err = ext4_get_block(orig_inode,
>  				(sector_t)(orig_blk_offset + i), bh, 0);
>  		if (*err < 0)
> -			goto out;
> +			goto drop_page;
>  
>  		if (bh->b_this_page != NULL)
>  			bh = bh->b_this_page;
>  	}
> -
> +write_end:
>  	*err = a_ops->write_end(o_filp, mapping, offs, data_size, replaced_size,
>  			       page, fsdata);
> -	page = NULL;
> -
>  out:
> -	if (unlikely(page)) {
> -		if (PageLocked(page))
> -			unlock_page(page);
> -		page_cache_release(page);
> -		ext4_journal_stop(handle);
> -	}
> -out2:
> -	ext4_journal_stop(handle);
> -
>  	if (err2)
>  		*err = err2;
>  
>  	return replaced_count;
> +drop_page:
> +	unlock_page(page);
> +	page_cache_release(page);
> +	ext4_journal_stop(handle);
> +	goto out;
>  }
>  
>  /**
> -- 
> 1.7.7.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2012-08-28 16:29 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-08-28 16:21 [PATCH 1/3] ext4: nonda_switch prevent deadlock Dmitry Monakhov
2012-08-28 16:21 ` [PATCH 2/3] ext4: basic bug shield for move_extent_per_page Dmitry Monakhov
2012-08-28 16:29   ` Dmitry Monakhov [this message]
2012-08-28 16:21 ` [PATCH 3/3] ext4: reimplement uninit extent optimization " Dmitry Monakhov
2012-08-29  8:24   ` Akira Fujita
2012-08-29 13:28 ` [PATCH 1/3] ext4: nonda_switch prevent deadlock Jan Kara
2012-08-30  6:48   ` Dmitry Monakhov
2012-08-30 12:55     ` Jan Kara
2012-08-30 11:12   ` Akira Fujita

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=878vczdlof.fsf@openvz.org \
    --to=dmonakhov@openvz.org \
    --cc=a-fujita@rs.jp.nec.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.