* [PATCH 1/3] ext4: nonda_switch prevent deadlock
@ 2012-08-28 16:21 Dmitry Monakhov
2012-08-28 16:21 ` [PATCH 2/3] ext4: basic bug shield for move_extent_per_page Dmitry Monakhov
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Dmitry Monakhov @ 2012-08-28 16:21 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, a-fujita, Dmitry Monakhov
Currently ext4_da_write_begin may deadlock if called with opened journal
transaction. Real life example:
->move_extent_per_page()
->ext4_journal_start()-> hold journal transaction
->write_begin()
->ext4_da_write_begin()
->ext4_nonda_switch()
->writeback_inodes_sb_if_idle() --> will wait for journal_stop()
This bug may be easily fixed by code reordering,
But in some cases it should be possible to call write_begin()
while holding journal's transaction, in this case caller may avoid
recoursion by passing AOP_FLAG_NOFS flag.
---
fs/ext4/inode.c | 28 +++++++++++++++++-----------
1 files changed, 17 insertions(+), 11 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6324f74..d12d30e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -889,6 +889,11 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
struct page *page;
pgoff_t index;
unsigned from, to;
+ int nofs = flags & AOP_FLAG_NOFS;
+
+ /* We cannot recurse into the filesystem if the transaction is already
+ * started */
+ BUG_ON(!nofs && journal_current_handle());
trace_ext4_write_begin(inode, pos, len, flags);
/*
@@ -906,9 +911,6 @@ retry:
ret = PTR_ERR(handle);
goto out;
}
-
- /* We cannot recurse into the filesystem as the transaction is already
- * started */
flags |= AOP_FLAG_NOFS;
page = grab_cache_page_write_begin(mapping, index, flags);
@@ -957,7 +959,8 @@ retry:
}
}
- if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+ if (!nofs && ret == -ENOSPC &&
+ ext4_should_retry_alloc(inode->i_sb, &retries))
goto retry;
out:
return ret;
@@ -2447,7 +2450,7 @@ out_writepages:
}
#define FALL_BACK_TO_NONDELALLOC 1
-static int ext4_nonda_switch(struct super_block *sb)
+static int ext4_nonda_switch(struct super_block *sb, int writeback_allowed)
{
s64 free_blocks, dirty_blocks;
struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -2475,7 +2478,7 @@ static int ext4_nonda_switch(struct super_block *sb)
* Even if we don't switch but are nearing capacity,
* start pushing delalloc when 1/2 of free blocks are dirty.
*/
- if (free_blocks < 2 * dirty_blocks)
+ if (writeback_allowed && free_blocks < 2 * dirty_blocks)
writeback_inodes_sb_if_idle(sb, WB_REASON_FS_FREE_SPACE);
return 0;
@@ -2490,10 +2493,14 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
pgoff_t index;
struct inode *inode = mapping->host;
handle_t *handle;
+ int nofs = flags & AOP_FLAG_NOFS;
index = pos >> PAGE_CACHE_SHIFT;
+ /* We cannot recurse into the filesystem if the transaction is already
+ * started */
+ BUG_ON(!nofs && journal_current_handle());
- if (ext4_nonda_switch(inode->i_sb)) {
+ if (ext4_nonda_switch(inode->i_sb, !nofs)) {
*fsdata = (void *)FALL_BACK_TO_NONDELALLOC;
return ext4_write_begin(file, mapping, pos,
len, flags, pagep, fsdata);
@@ -2512,8 +2519,6 @@ retry:
ret = PTR_ERR(handle);
goto out;
}
- /* We cannot recurse into the filesystem as the transaction is already
- * started */
flags |= AOP_FLAG_NOFS;
page = grab_cache_page_write_begin(mapping, index, flags);
@@ -2538,7 +2543,8 @@ retry:
ext4_truncate_failed_write(inode);
}
- if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+ if (!nofs && ret == -ENOSPC &&
+ ext4_should_retry_alloc(inode->i_sb, &retries))
goto retry;
out:
return ret;
@@ -4791,7 +4797,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
/* Delalloc case is easy... */
if (test_opt(inode->i_sb, DELALLOC) &&
!ext4_should_journal_data(inode) &&
- !ext4_nonda_switch(inode->i_sb)) {
+ !ext4_nonda_switch(inode->i_sb, 1)) {
do {
ret = __block_page_mkwrite(vma, vmf,
ext4_da_get_block_prep);
--
1.7.7.6
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/3] ext4: basic bug shield for move_extent_per_page
2012-08-28 16:21 [PATCH 1/3] ext4: nonda_switch prevent deadlock Dmitry Monakhov
@ 2012-08-28 16:21 ` Dmitry Monakhov
2012-08-28 16:29 ` Dmitry Monakhov
2012-08-28 16:21 ` [PATCH 3/3] ext4: reimplement uninit extent optimization " Dmitry Monakhov
2012-08-29 13:28 ` [PATCH 1/3] ext4: nonda_switch prevent deadlock Jan Kara
2 siblings, 1 reply; 9+ messages in thread
From: Dmitry Monakhov @ 2012-08-28 16:21 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, a-fujita, Dmitry Monakhov
The move_extent_per_page function is just one big peace of crap.
Non-full list of bugs:
1) uninitialized extent optimization does not hold page's lock,
and simply replace brunches, so writeback code goes
crazy because block mapping changed under it's feets
kernel BUG at fs/ext4/inode.c:1434!
2) uninitialized extent may became initialized right after we
drop i_data_sem, so extent state must be rechecked
3) Locked pages goes up to date via following sequence:
->readpage(page); lock_page(page); use_that_page(page)
But after readpage() one may invalidate it because it is
uptodate and unlocked (reclaimer does that)
As result kernel bug at include/linux/buffer_head.c:133!
4) We call write_begin() with already opened stansaction which
result in following deadlock:
->move_extent_per_page()
->ext4_journal_start()-> hold journal transaction
->write_begin()
->ext4_da_write_begin()
->ext4_nonda_switch()
->writeback_inodes_sb_if_idle() --> will wait for journal_stop()
5) try_to_release_page() may fail and it does fail if one of page's bh was
pinned by journal
6) If we about to change page's mapping we MUST hold it's lock during entire
remapping procedure, this is true for both pages(original and donor one)
Fixes:
- Avoid (1) and (2) simply by temproraly drop uninitialized extent handling
optimization, this will be reimplemented later.
- Fix (3) by manually forcing page to uptodate state w/o dropping it's lock
- Fix (4) by rearranging existing locking:
from: journal_start(); ->write_begin
to: write_begin(); journal_extend()
- Fix (5) simply by checking retvalue
- (6) requires full function's code rearrangement, will be done later.
---
fs/ext4/move_extent.c | 74 ++++++++++++++++++++++++-------------------------
1 files changed, 36 insertions(+), 38 deletions(-)
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index c5826c6..c5ca317 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -812,11 +812,6 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
* inode and donor_inode may change each different metadata blocks.
*/
jblocks = ext4_writepage_trans_blocks(orig_inode) * 2;
- handle = ext4_journal_start(orig_inode, jblocks);
- if (IS_ERR(handle)) {
- *err = PTR_ERR(handle);
- return 0;
- }
if (segment_eq(get_fs(), KERNEL_DS))
w_flags |= AOP_FLAG_UNINTERRUPTIBLE;
@@ -824,19 +819,6 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
orig_blk_offset = orig_page_offset * blocks_per_page +
data_offset_in_page;
- /*
- * If orig extent is uninitialized one,
- * it's not necessary force the page into memory
- * and then force it to be written out again.
- * Just swap data blocks between orig and donor.
- */
- if (uninit) {
- replaced_count = mext_replace_branches(handle, orig_inode,
- donor_inode, orig_blk_offset,
- block_len_in_page, err);
- goto out2;
- }
-
offs = (long long)orig_blk_offset << orig_inode->i_blkbits;
/* Calculate data_size */
@@ -862,12 +844,28 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
&page, &fsdata);
if (unlikely(*err < 0))
goto out;
+ handle = journal_current_handle();
+ if (!page_has_buffers(page))
+ create_empty_buffers(page, 1 << orig_inode->i_blkbits, 0);
+
+ /* Force page uptodate similar block_write_commit */
+ page_zero_new_buffers(page, 0, PAGE_SIZE);
if (!PageUptodate(page)) {
- mapping->a_ops->readpage(o_filp, page);
- lock_page(page);
+ struct buffer_head *head, *bh;
+ bh = head = page_buffers(page);
+ do {
+ if (!bh_uptodate_or_lock(bh)) {
+ if (bh_submit_read(bh) < 0) {
+ put_bh(bh);
+ err2 = -EIO;
+ goto drop_page;
+ }
+ }
+ bh = bh->b_this_page;
+ } while (bh != head);
}
-
+ SetPageUptodate(page);
/*
* try_to_release_page() doesn't call releasepage in writeback mode.
* We should care about the order of writing to the same file
@@ -876,9 +874,15 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
* writeback of the page.
*/
wait_on_page_writeback(page);
-
- /* Release old bh and drop refs */
- try_to_release_page(page, 0);
+ /* Finally page is fully uptodate and locked, it is time to drop
+ * old mapping info, function may fail other task hold reference
+ * on page's bh */
+ if (!try_to_release_page(page, 0)) {
+ replaced_size = 0;
+ goto write_end;
+ }
+ if (ext4_journal_extend(handle, jblocks) < 0)
+ goto write_end;
replaced_count = mext_replace_branches(handle, orig_inode, donor_inode,
orig_blk_offset, block_len_in_page,
@@ -889,7 +893,7 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
replaced_size =
block_len_in_page << orig_inode->i_blkbits;
} else
- goto out;
+ goto drop_page;
}
if (!page_has_buffers(page))
@@ -903,30 +907,24 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
*err = ext4_get_block(orig_inode,
(sector_t)(orig_blk_offset + i), bh, 0);
if (*err < 0)
- goto out;
+ goto drop_page;
if (bh->b_this_page != NULL)
bh = bh->b_this_page;
}
-
+write_end:
*err = a_ops->write_end(o_filp, mapping, offs, data_size, replaced_size,
page, fsdata);
- page = NULL;
-
out:
- if (unlikely(page)) {
- if (PageLocked(page))
- unlock_page(page);
- page_cache_release(page);
- ext4_journal_stop(handle);
- }
-out2:
- ext4_journal_stop(handle);
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 3/3] ext4: reimplement uninit extent optimization for move_extent_per_page
2012-08-28 16:21 [PATCH 1/3] ext4: nonda_switch prevent deadlock Dmitry Monakhov
2012-08-28 16:21 ` [PATCH 2/3] ext4: basic bug shield for move_extent_per_page Dmitry Monakhov
@ 2012-08-28 16:21 ` Dmitry Monakhov
2012-08-29 8:24 ` Akira Fujita
2012-08-29 13:28 ` [PATCH 1/3] ext4: nonda_switch prevent deadlock Jan Kara
2 siblings, 1 reply; 9+ messages in thread
From: Dmitry Monakhov @ 2012-08-28 16:21 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, a-fujita, Dmitry Monakhov
Uninitialized extent may became initialized(parallel writeback task)
at any moment after we drop i_data_sem, so we have to recheck extent's
state after we hold page's lock and i_data_sem.
If we about to change page's mapping we must hold page's lock in order to
serialize other users.
---
fs/ext4/move_extent.c | 96 ++++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 87 insertions(+), 9 deletions(-)
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index c5ca317..b62defa 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -629,6 +629,42 @@ mext_calc_swap_extents(struct ext4_extent *tmp_dext,
}
/**
+ * mext_check_range_coverage - Check that all extents in range has the same type
+ *
+ * @inode: inode in question
+ * @from: block offset of inode
+ * @count: block count to be checked
+ * @uninit: extents expected to be uninitialized
+ * @err: pointer to save error value
+ *
+ * Return 1 if all extents in range has expected type, and zero otherwise.
+ */
+int mext_check_range_coverage(struct inode *inode, ext4_lblk_t from, ext4_lblk_t count,
+ int uninit, int *err)
+{
+ struct ext4_ext_path *path = NULL;
+ struct ext4_extent *ext;
+ ext4_lblk_t last = from;
+ while (from < last) {
+ *err = get_ext_path(inode, from, &path);
+ if (*err)
+ return 0;
+ ext = path[ext_depth(inode)].p_ext;
+ if (!ext) {
+ ext4_ext_drop_refs(path);
+ return 0;
+ }
+ if (uninit != ext4_ext_is_uninitialized(ext)) {
+ ext4_ext_drop_refs(path);
+ return 0;
+ }
+ from += ext4_ext_get_actual_len(ext);
+ ext4_ext_drop_refs(path);
+ }
+ return 1;
+}
+
+/**
* mext_replace_branches - Replace original extents with new extents
*
* @handle: journal handle
@@ -663,9 +699,6 @@ mext_replace_branches(handle_t *handle, struct inode *orig_inode,
int replaced_count = 0;
int dext_alen;
- /* Protect extent trees against block allocations via delalloc */
- double_down_write_data_sem(orig_inode, donor_inode);
-
/* Get the original extent for the block "orig_off" */
*err = get_ext_path(orig_inode, orig_off, &orig_path);
if (*err)
@@ -764,8 +797,6 @@ out:
ext4_ext_invalidate_cache(orig_inode);
ext4_ext_invalidate_cache(donor_inode);
- double_up_write_data_sem(orig_inode, donor_inode);
-
return replaced_count;
}
@@ -807,10 +838,6 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
int replaced_count = 0;
int blocks_per_page = PAGE_CACHE_SIZE >> orig_inode->i_blkbits;
- /*
- * It needs twice the amount of ordinary journal buffers because
- * inode and donor_inode may change each different metadata blocks.
- */
jblocks = ext4_writepage_trans_blocks(orig_inode) * 2;
if (segment_eq(get_fs(), KERNEL_DS))
@@ -819,6 +846,55 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
orig_blk_offset = orig_page_offset * blocks_per_page +
data_offset_in_page;
+ /*
+ * If orig extent was uninitialized, but it can become initialized at any
+ * time after i_data_sem was dropped, in order to serialize with delalloc
+ * we have recheck extent while we hold page's lock, if it is still the
+ * case data copy is not necessery, just swap data blocks between orig
+ * and donor.
+ */
+ if (uninit) {
+ handle = ext4_journal_start(orig_inode, jblocks);
+ if (!handle) {
+ *err = -ENOMEM;
+ return 0;
+ }
+ page = find_or_create_page(mapping,orig_page_offset, GFP_NOFS);
+ if (!page) {
+ ext4_journal_stop(handle);
+ *err = -ENOMEM;
+ return 0;
+ }
+ double_down_write_data_sem(orig_inode, donor_inode);
+ /* If any of extents in range became initialized we have to
+ * fallback to data copying */
+ if (!mext_check_range_coverage(orig_inode, orig_blk_offset,
+ block_len_in_page, 1, err) ||
+ !mext_check_range_coverage(donor_inode, orig_blk_offset,
+ block_len_in_page, 1, &err2)) {
+ double_up_write_data_sem(orig_inode, donor_inode);
+ unlock_page(page);
+ page_cache_release(page);
+ ext4_journal_stop(handle);
+ if (*err || err2)
+ goto out;
+ goto data_copy;
+ }
+ if (!try_to_release_page(page, 0))
+ goto drop_data_sem;
+
+ if (!PageUptodate(page)) {
+ zero_user(page, 0, PAGE_SIZE);
+ SetPageUptodate(page);
+ }
+ replaced_count = mext_replace_branches(handle, orig_inode,
+ donor_inode, orig_blk_offset,
+ block_len_in_page, err);
+ drop_data_sem:
+ double_up_write_data_sem(orig_inode, donor_inode);
+ goto drop_page;
+ }
+data_copy:
offs = (long long)orig_blk_offset << orig_inode->i_blkbits;
/* Calculate data_size */
@@ -884,9 +960,11 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
if (ext4_journal_extend(handle, jblocks) < 0)
goto write_end;
+ double_down_write_data_sem(orig_inode, donor_inode);
replaced_count = mext_replace_branches(handle, orig_inode, donor_inode,
orig_blk_offset, block_len_in_page,
&err2);
+ double_up_write_data_sem(orig_inode, donor_inode);
if (err2) {
if (replaced_count) {
block_len_in_page = replaced_count;
--
1.7.7.6
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 2/3] ext4: basic bug shield for move_extent_per_page
2012-08-28 16:21 ` [PATCH 2/3] ext4: basic bug shield for move_extent_per_page Dmitry Monakhov
@ 2012-08-28 16:29 ` Dmitry Monakhov
0 siblings, 0 replies; 9+ messages in thread
From: Dmitry Monakhov @ 2012-08-28 16:29 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, a-fujita
[-- Attachment #1: Type: text/plain, Size: 164 bytes --]
On Tue, 28 Aug 2012 20:21:42 +0400, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
In order to test MOVE_EXT ioctl code i use run e4defrag and fsstress
in parallel:
[-- Attachment #2: 285 --]
[-- Type: text/plain, Size: 2522 bytes --]
#! /bin/bash
# FSQA Test No. 270
#
# Online defrag stress test for ext4
#
#-----------------------------------------------------------------------
# Copyright (c) 2006 Silicon Graphics, Inc. All Rights Reserved.
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write the Free Software Foundation,
# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
#
#-----------------------------------------------------------------------
#
# creator
owner=dmonakhov@openvz.org
seq=`basename $0`
echo "QA output created by $seq"
here=`pwd`
tmp=/tmp/$$
status=1 # failure is the default!
trap "rm -f $tmp.*; exit \$status" 0 1 2 3 15
# get standard environment, filters and checks
. ./common.rc
. ./common.filter
. ./common.quota
# Disable all sync operations to get higher load
FSSTRESS_AVOID="$FSSTRESS_AVOID -ffsync=0 -fsync=0 -ffdatasync=0"
_workout()
{
echo ""
echo "Run fsstress"
echo ""
num_iterations=10
delay=2
out=$SCRATCH_MNT/fsstress.$$
# Restrict number of inodes in order to increase
setquota -u $qa_user 1000000 1000000 100 100 $SCRATCH_MNT
args="-p4 -n999999999 -f setattr=0 $FSSTRESS_AVOID -d $out"
echo "fsstress $args" >> $here/$seq.full
(su $qa_user -c "$FSSTRESS_PROG $args" &) > /dev/null 2>&1
pid=$!
echo "Run online defrag in parallel"
for ((i=0; i < num_iterations; i++))
do
e4defrag -v $SCRATCH_MNT >> $here/$seq.full 2>&1
sleep $delay
done
killall fsstress
wait $pid
}
# real QA test starts here
_supported_fs generic
_supported_os Linux
_require_quota
_require_user
_need_to_be_root
_require_scratch
_scratch_mkfs_sized $((512 * 1024 * 1024)) >> $seq.full 2>&1
_scratch_mount "-o usrquota,grpquota"
chmod 777 $SCRATCH_MNT
quotacheck -u -g $SCRATCH_MNT 2>/dev/null
quotaon -u -g $SCRATCH_MNT 2>/dev/null
if ! _workout; then
_scratch_unmount 2>/dev/null
exit
fi
if ! _check_quota_usage; then
_scratch_unmount 2>/dev/null
status=1
exit
fi
echo Comparing filesystem consistency
if ! _scratch_unmount; then
echo "failed to umount"
status=1
exit
fi
_check_scratch_fs
status=$?
exit
[-- Attachment #3: Type: text/plain, Size: 6221 bytes --]
> The move_extent_per_page function is just one big peace of crap.
>
> Non-full list of bugs:
> 1) uninitialized extent optimization does not hold page's lock,
> and simply replace brunches, so writeback code goes
> crazy because block mapping changed under it's feets
> kernel BUG at fs/ext4/inode.c:1434!
>
> 2) uninitialized extent may became initialized right after we
> drop i_data_sem, so extent state must be rechecked
>
> 3) Locked pages goes up to date via following sequence:
> ->readpage(page); lock_page(page); use_that_page(page)
> But after readpage() one may invalidate it because it is
> uptodate and unlocked (reclaimer does that)
> As result kernel bug at include/linux/buffer_head.c:133!
> 4) We call write_begin() with already opened stansaction which
> result in following deadlock:
> ->move_extent_per_page()
> ->ext4_journal_start()-> hold journal transaction
> ->write_begin()
> ->ext4_da_write_begin()
> ->ext4_nonda_switch()
> ->writeback_inodes_sb_if_idle() --> will wait for journal_stop()
>
> 5) try_to_release_page() may fail and it does fail if one of page's bh was
> pinned by journal
>
> 6) If we about to change page's mapping we MUST hold it's lock during entire
> remapping procedure, this is true for both pages(original and donor one)
>
> Fixes:
>
> - Avoid (1) and (2) simply by temproraly drop uninitialized extent handling
> optimization, this will be reimplemented later.
>
> - Fix (3) by manually forcing page to uptodate state w/o dropping it's lock
>
> - Fix (4) by rearranging existing locking:
> from: journal_start(); ->write_begin
> to: write_begin(); journal_extend()
> - Fix (5) simply by checking retvalue
> - (6) requires full function's code rearrangement, will be done later.
> ---
> fs/ext4/move_extent.c | 74 ++++++++++++++++++++++++-------------------------
> 1 files changed, 36 insertions(+), 38 deletions(-)
>
> diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
> index c5826c6..c5ca317 100644
> --- a/fs/ext4/move_extent.c
> +++ b/fs/ext4/move_extent.c
> @@ -812,11 +812,6 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
> * inode and donor_inode may change each different metadata blocks.
> */
> jblocks = ext4_writepage_trans_blocks(orig_inode) * 2;
> - handle = ext4_journal_start(orig_inode, jblocks);
> - if (IS_ERR(handle)) {
> - *err = PTR_ERR(handle);
> - return 0;
> - }
>
> if (segment_eq(get_fs(), KERNEL_DS))
> w_flags |= AOP_FLAG_UNINTERRUPTIBLE;
> @@ -824,19 +819,6 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
> orig_blk_offset = orig_page_offset * blocks_per_page +
> data_offset_in_page;
>
> - /*
> - * If orig extent is uninitialized one,
> - * it's not necessary force the page into memory
> - * and then force it to be written out again.
> - * Just swap data blocks between orig and donor.
> - */
> - if (uninit) {
> - replaced_count = mext_replace_branches(handle, orig_inode,
> - donor_inode, orig_blk_offset,
> - block_len_in_page, err);
> - goto out2;
> - }
> -
> offs = (long long)orig_blk_offset << orig_inode->i_blkbits;
>
> /* Calculate data_size */
> @@ -862,12 +844,28 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
> &page, &fsdata);
> if (unlikely(*err < 0))
> goto out;
> + handle = journal_current_handle();
>
> + if (!page_has_buffers(page))
> + create_empty_buffers(page, 1 << orig_inode->i_blkbits, 0);
> +
> + /* Force page uptodate similar block_write_commit */
> + page_zero_new_buffers(page, 0, PAGE_SIZE);
> if (!PageUptodate(page)) {
> - mapping->a_ops->readpage(o_filp, page);
> - lock_page(page);
> + struct buffer_head *head, *bh;
> + bh = head = page_buffers(page);
> + do {
> + if (!bh_uptodate_or_lock(bh)) {
> + if (bh_submit_read(bh) < 0) {
> + put_bh(bh);
> + err2 = -EIO;
> + goto drop_page;
> + }
> + }
> + bh = bh->b_this_page;
> + } while (bh != head);
> }
> -
> + SetPageUptodate(page);
> /*
> * try_to_release_page() doesn't call releasepage in writeback mode.
> * We should care about the order of writing to the same file
> @@ -876,9 +874,15 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
> * writeback of the page.
> */
> wait_on_page_writeback(page);
> -
> - /* Release old bh and drop refs */
> - try_to_release_page(page, 0);
> + /* Finally page is fully uptodate and locked, it is time to drop
> + * old mapping info, function may fail other task hold reference
> + * on page's bh */
> + if (!try_to_release_page(page, 0)) {
> + replaced_size = 0;
> + goto write_end;
> + }
> + if (ext4_journal_extend(handle, jblocks) < 0)
> + goto write_end;
>
> replaced_count = mext_replace_branches(handle, orig_inode, donor_inode,
> orig_blk_offset, block_len_in_page,
> @@ -889,7 +893,7 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
> replaced_size =
> block_len_in_page << orig_inode->i_blkbits;
> } else
> - goto out;
> + goto drop_page;
> }
>
> if (!page_has_buffers(page))
> @@ -903,30 +907,24 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
> *err = ext4_get_block(orig_inode,
> (sector_t)(orig_blk_offset + i), bh, 0);
> if (*err < 0)
> - goto out;
> + goto drop_page;
>
> if (bh->b_this_page != NULL)
> bh = bh->b_this_page;
> }
> -
> +write_end:
> *err = a_ops->write_end(o_filp, mapping, offs, data_size, replaced_size,
> page, fsdata);
> - page = NULL;
> -
> out:
> - if (unlikely(page)) {
> - if (PageLocked(page))
> - unlock_page(page);
> - page_cache_release(page);
> - ext4_journal_stop(handle);
> - }
> -out2:
> - ext4_journal_stop(handle);
> -
> if (err2)
> *err = err2;
>
> return replaced_count;
> +drop_page:
> + unlock_page(page);
> + page_cache_release(page);
> + ext4_journal_stop(handle);
> + goto out;
> }
>
> /**
> --
> 1.7.7.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 3/3] ext4: reimplement uninit extent optimization for move_extent_per_page
2012-08-28 16:21 ` [PATCH 3/3] ext4: reimplement uninit extent optimization " Dmitry Monakhov
@ 2012-08-29 8:24 ` Akira Fujita
0 siblings, 0 replies; 9+ messages in thread
From: Akira Fujita @ 2012-08-29 8:24 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, tytso
Hi Dmitry,
(2012/08/29 1:21), Dmitry Monakhov wrote:
> Uninitialized extent may became initialized(parallel writeback task)
> at any moment after we drop i_data_sem, so we have to recheck extent's
> state after we hold page's lock and i_data_sem.
>
> If we about to change page's mapping we must hold page's lock in order to
> serialize other users.
> ---
> fs/ext4/move_extent.c | 96 ++++++++++++++++++++++++++++++++++++++++++++-----
> 1 files changed, 87 insertions(+), 9 deletions(-)
>
> diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
> index c5ca317..b62defa 100644
> --- a/fs/ext4/move_extent.c
> +++ b/fs/ext4/move_extent.c
> @@ -629,6 +629,42 @@ mext_calc_swap_extents(struct ext4_extent *tmp_dext,
> }
>
> /**
> + * mext_check_range_coverage - Check that all extents in range has the same type
> + *
> + * @inode: inode in question
> + * @from: block offset of inode
> + * @count: block count to be checked
> + * @uninit: extents expected to be uninitialized
> + * @err: pointer to save error value
> + *
> + * Return 1 if all extents in range has expected type, and zero otherwise.
> + */
> +int mext_check_range_coverage(struct inode *inode, ext4_lblk_t from, ext4_lblk_t count,
> + int uninit, int *err)
> +{
> + struct ext4_ext_path *path = NULL;
> + struct ext4_extent *ext;
> + ext4_lblk_t last = from;
ext4_lblk_t last = from + count;
> + while (from < last) {
> + *err = get_ext_path(inode, from, &path);
> + if (*err)
> + return 0;
> + ext = path[ext_depth(inode)].p_ext;
> + if (!ext) {
> + ext4_ext_drop_refs(path);
> + return 0;
> + }
> + if (uninit != ext4_ext_is_uninitialized(ext)) {
> + ext4_ext_drop_refs(path);
> + return 0;
> + }
> + from += ext4_ext_get_actual_len(ext);
> + ext4_ext_drop_refs(path);
> + }
> + return 1;
> +}
> +
> +/**
> * mext_replace_branches - Replace original extents with new extents
> *
> * @handle: journal handle
> @@ -663,9 +699,6 @@ mext_replace_branches(handle_t *handle, struct inode *orig_inode,
> int replaced_count = 0;
> int dext_alen;
>
> - /* Protect extent trees against block allocations via delalloc */
> - double_down_write_data_sem(orig_inode, donor_inode);
> -
> /* Get the original extent for the block "orig_off" */
> *err = get_ext_path(orig_inode, orig_off, &orig_path);
> if (*err)
> @@ -764,8 +797,6 @@ out:
> ext4_ext_invalidate_cache(orig_inode);
> ext4_ext_invalidate_cache(donor_inode);
>
> - double_up_write_data_sem(orig_inode, donor_inode);
> -
> return replaced_count;
> }
>
> @@ -807,10 +838,6 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
> int replaced_count = 0;
> int blocks_per_page = PAGE_CACHE_SIZE >> orig_inode->i_blkbits;
>
> - /*
> - * It needs twice the amount of ordinary journal buffers because
> - * inode and donor_inode may change each different metadata blocks.
> - */
> jblocks = ext4_writepage_trans_blocks(orig_inode) * 2;
>
> if (segment_eq(get_fs(), KERNEL_DS))
> @@ -819,6 +846,55 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
> orig_blk_offset = orig_page_offset * blocks_per_page +
> data_offset_in_page;
>
> + /*
> + * If orig extent was uninitialized, but it can become initialized at any
> + * time after i_data_sem was dropped, in order to serialize with delalloc
> + * we have recheck extent while we hold page's lock, if it is still the
> + * case data copy is not necessery, just swap data blocks between orig
> + * and donor.
> + */
> + if (uninit) {
> + handle = ext4_journal_start(orig_inode, jblocks);
> + if (!handle) {
> + *err = -ENOMEM;
> + return 0;
> + }
It would be better we use IS_ERR(handle)
because ext4_journal_start() can return various error values.
> + page = find_or_create_page(mapping,orig_page_offset, GFP_NOFS);
You might want to put a white space after "mapping,".
> + if (!page) {
> + ext4_journal_stop(handle);
> + *err = -ENOMEM;
> + return 0;
> + }
> + double_down_write_data_sem(orig_inode, donor_inode);
> + /* If any of extents in range became initialized we have to
> + * fallback to data copying */
> + if (!mext_check_range_coverage(orig_inode, orig_blk_offset,
> + block_len_in_page, 1, err) ||
> + !mext_check_range_coverage(donor_inode, orig_blk_offset,
> + block_len_in_page, 1, &err2)) {
> + double_up_write_data_sem(orig_inode, donor_inode);
> + unlock_page(page);
> + page_cache_release(page);
> + ext4_journal_stop(handle);
> + if (*err || err2)
> + goto out;
> + goto data_copy;
> + }
> + if (!try_to_release_page(page, 0))
> + goto drop_data_sem;
> +
> + if (!PageUptodate(page)) {
> + zero_user(page, 0, PAGE_SIZE);
> + SetPageUptodate(page);
> + }
> + replaced_count = mext_replace_branches(handle, orig_inode,
> + donor_inode, orig_blk_offset,
> + block_len_in_page, err);
> + drop_data_sem:
> + double_up_write_data_sem(orig_inode, donor_inode);
> + goto drop_page;
> + }
> +data_copy:
> offs = (long long)orig_blk_offset << orig_inode->i_blkbits;
>
> /* Calculate data_size */
> @@ -884,9 +960,11 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
> if (ext4_journal_extend(handle, jblocks) < 0)
> goto write_end;
>
> + double_down_write_data_sem(orig_inode, donor_inode);
> replaced_count = mext_replace_branches(handle, orig_inode, donor_inode,
> orig_blk_offset, block_len_in_page,
> &err2);
> + double_up_write_data_sem(orig_inode, donor_inode);
> if (err2) {
> if (replaced_count) {
> block_len_in_page = replaced_count;
>
Regards,
Akira Fujita
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/3] ext4: nonda_switch prevent deadlock
2012-08-28 16:21 [PATCH 1/3] ext4: nonda_switch prevent deadlock Dmitry Monakhov
2012-08-28 16:21 ` [PATCH 2/3] ext4: basic bug shield for move_extent_per_page Dmitry Monakhov
2012-08-28 16:21 ` [PATCH 3/3] ext4: reimplement uninit extent optimization " Dmitry Monakhov
@ 2012-08-29 13:28 ` Jan Kara
2012-08-30 6:48 ` Dmitry Monakhov
2012-08-30 11:12 ` Akira Fujita
2 siblings, 2 replies; 9+ messages in thread
From: Jan Kara @ 2012-08-29 13:28 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, tytso, a-fujita
On Tue 28-08-12 20:21:41, Dmitry Monakhov wrote:
> Currently ext4_da_write_begin may deadlock if called with opened journal
> transaction. Real life example:
> ->move_extent_per_page()
> ->ext4_journal_start()-> hold journal transaction
> ->write_begin()
> ->ext4_da_write_begin()
> ->ext4_nonda_switch()
> ->writeback_inodes_sb_if_idle() --> will wait for journal_stop()
>
> This bug may be easily fixed by code reordering,
> But in some cases it should be possible to call write_begin()
> while holding journal's transaction, in this case caller may avoid
> recoursion by passing AOP_FLAG_NOFS flag.
Well, I find calling ext4_write_begin() with a transaction started a bug.
Possibly ext4_write_begin() can be tweaked to handle that but things would
be simpler if we didn't have to.
Looking into move_extent_per_page(), calling ->write_begin() doesn't seem
to be quite right there anyway. For example it results in filling holes
under that page which is not desirable. I'm not even sure why do we call
->write_begin() there at all. The data in the page is unchanged. So it
should be enough to just remap buffers and mark the page dirty (but I might
be missing some subtlety here). Fujita-san, can you possibly explain?
Honza
> ---
> fs/ext4/inode.c | 28 +++++++++++++++++-----------
> 1 files changed, 17 insertions(+), 11 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 6324f74..d12d30e 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -889,6 +889,11 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
> struct page *page;
> pgoff_t index;
> unsigned from, to;
> + int nofs = flags & AOP_FLAG_NOFS;
> +
> + /* We cannot recurse into the filesystem if the transaction is already
> + * started */
> + BUG_ON(!nofs && journal_current_handle());
>
> trace_ext4_write_begin(inode, pos, len, flags);
> /*
> @@ -906,9 +911,6 @@ retry:
> ret = PTR_ERR(handle);
> goto out;
> }
> -
> - /* We cannot recurse into the filesystem as the transaction is already
> - * started */
> flags |= AOP_FLAG_NOFS;
>
> page = grab_cache_page_write_begin(mapping, index, flags);
> @@ -957,7 +959,8 @@ retry:
> }
> }
>
> - if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> + if (!nofs && ret == -ENOSPC &&
> + ext4_should_retry_alloc(inode->i_sb, &retries))
> goto retry;
> out:
> return ret;
> @@ -2447,7 +2450,7 @@ out_writepages:
> }
>
> #define FALL_BACK_TO_NONDELALLOC 1
> -static int ext4_nonda_switch(struct super_block *sb)
> +static int ext4_nonda_switch(struct super_block *sb, int writeback_allowed)
> {
> s64 free_blocks, dirty_blocks;
> struct ext4_sb_info *sbi = EXT4_SB(sb);
> @@ -2475,7 +2478,7 @@ static int ext4_nonda_switch(struct super_block *sb)
> * Even if we don't switch but are nearing capacity,
> * start pushing delalloc when 1/2 of free blocks are dirty.
> */
> - if (free_blocks < 2 * dirty_blocks)
> + if (writeback_allowed && free_blocks < 2 * dirty_blocks)
> writeback_inodes_sb_if_idle(sb, WB_REASON_FS_FREE_SPACE);
>
> return 0;
> @@ -2490,10 +2493,14 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
> pgoff_t index;
> struct inode *inode = mapping->host;
> handle_t *handle;
> + int nofs = flags & AOP_FLAG_NOFS;
>
> index = pos >> PAGE_CACHE_SHIFT;
> + /* We cannot recurse into the filesystem if the transaction is already
> + * started */
> + BUG_ON(!nofs && journal_current_handle());
>
> - if (ext4_nonda_switch(inode->i_sb)) {
> + if (ext4_nonda_switch(inode->i_sb, !nofs)) {
> *fsdata = (void *)FALL_BACK_TO_NONDELALLOC;
> return ext4_write_begin(file, mapping, pos,
> len, flags, pagep, fsdata);
> @@ -2512,8 +2519,6 @@ retry:
> ret = PTR_ERR(handle);
> goto out;
> }
> - /* We cannot recurse into the filesystem as the transaction is already
> - * started */
> flags |= AOP_FLAG_NOFS;
>
> page = grab_cache_page_write_begin(mapping, index, flags);
> @@ -2538,7 +2543,8 @@ retry:
> ext4_truncate_failed_write(inode);
> }
>
> - if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> + if (!nofs && ret == -ENOSPC &&
> + ext4_should_retry_alloc(inode->i_sb, &retries))
> goto retry;
> out:
> return ret;
> @@ -4791,7 +4797,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> /* Delalloc case is easy... */
> if (test_opt(inode->i_sb, DELALLOC) &&
> !ext4_should_journal_data(inode) &&
> - !ext4_nonda_switch(inode->i_sb)) {
> + !ext4_nonda_switch(inode->i_sb, 1)) {
> do {
> ret = __block_page_mkwrite(vma, vmf,
> ext4_da_get_block_prep);
> --
> 1.7.7.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/3] ext4: nonda_switch prevent deadlock
2012-08-29 13:28 ` [PATCH 1/3] ext4: nonda_switch prevent deadlock Jan Kara
@ 2012-08-30 6:48 ` Dmitry Monakhov
2012-08-30 12:55 ` Jan Kara
2012-08-30 11:12 ` Akira Fujita
1 sibling, 1 reply; 9+ messages in thread
From: Dmitry Monakhov @ 2012-08-30 6:48 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-ext4, tytso, a-fujita
On Wed, 29 Aug 2012 15:28:16 +0200, Jan Kara <jack@suse.cz> wrote:
> On Tue 28-08-12 20:21:41, Dmitry Monakhov wrote:
> > Currently ext4_da_write_begin may deadlock if called with opened journal
> > transaction. Real life example:
> > ->move_extent_per_page()
> > ->ext4_journal_start()-> hold journal transaction
> > ->write_begin()
> > ->ext4_da_write_begin()
> > ->ext4_nonda_switch()
> > ->writeback_inodes_sb_if_idle() --> will wait for journal_stop()
> >
> > This bug may be easily fixed by code reordering,
> > But in some cases it should be possible to call write_begin()
> > while holding journal's transaction, in this case caller may avoid
> > recoursion by passing AOP_FLAG_NOFS flag.
> Well, I find calling ext4_write_begin() with a transaction started a bug.
> Possibly ext4_write_begin() can be tweaked to handle that but things would
> be simpler if we didn't have to.
>
> Looking into move_extent_per_page(), calling ->write_begin() doesn't seem
> to be quite right there anyway. For example it results in filling holes
> under that page which is not desirable. I'm not even sure why do we call
> ->write_begin() there at all. The data in the page is unchanged. So it
> should be enough to just remap buffers and mark the page dirty (but I might
> be missing some subtlety here). Fujita-san, can you possibly explain?
Yes, and No imho write_begin/write_end is not necessary here,
but simple marking page as dirty is not sufficient either because
we end up files becomes zeroed if power failure during defrag procedure,
as it now happen in case of delalloc
start_journal()
remap_page()
set_page_dirty()
stop_journal()
...15secs <---- power failure
write_back_page()
but in our case situation is even worse because good/oldfile
will become corrupted so we have to somehow pin dirty pages to
transaction similar to ordered mode.
Currently i'm work on that approach which is simple but looks
reliable: remap_procedure should looks like follows(error paths missed):
/* Double lock pages on two inodes, this is horrible but necessary */
double_grab_page(struct inode *inode, struct inode *inode2,
pgoff idx, struct page** pgvec1, page** pgvec2)
{
if (mapping1 > mapping2) {
*pgvec1 = find_or_create_page(mapping1, idx, GFP_NOFS);
*pgvec2 = find_or_create_page(mapping2, idx, GFP_NOFS);
} else {
*pgvec2 = find_or_create_page(mapping2, idx, GFP_NOFS);
*pgvec1 = find_or_create_page(mapping1, idx, GFP_NOFS);
}
}
move_extent () {
page * o_vec[32];
page * d_vec[32];
new_bio:
bio = bio_alloc()
bio->bi_sector = donor_extent->sector+start
... /* other bio initialization */
/* start journal before lock pages */
journal_start();
for (idx = 0; idx <= end; idx++) {
double_grab_page(orig, donor, idx + start, o_vec + idx, d_vec + idx)
make_page_uptodate_nounlock(o_vec + idx)
try_to_release_page(o_vec+idx)
try_to_release_page(d_vec+idx)
/* try to add page to bio which point to donor extent */
if (add_page_to_bio(bio, o_vec + idx))
continue;
/* There is no mo space left in bio, it is time to submit it
* pages will be unlocked soon so we have to protest pages
* from modification via writeback bit, writeback will be
* cleared at bio_end_io
*/
for_each_page(o_vec, o_vec + idx, SetPageWriteback)
submit_bio(bio);
/* After we submitted bio for io we can be shure that
* it will be flushed do disk before commit journal record
* because commit record request contain barrier,
* so effectively this bio pinned to transaction
*/
mext_replace_branches(orig, donor, start, start+idx)
/* Ok we have complete remapping, it is now safe to unlock pages */
for_each_page(o_vec, o_vec + idx, page_unlock)
for_each_page(d_vec, d_vec + idx, page_unlock)
start += idx;
goto new_bio:
}
}
This should works for any file types with data (with extra precautions for dir
and symlinks)
>
> Honza
>
> > ---
> > fs/ext4/inode.c | 28 +++++++++++++++++-----------
> > 1 files changed, 17 insertions(+), 11 deletions(-)
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 6324f74..d12d30e 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -889,6 +889,11 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
> > struct page *page;
> > pgoff_t index;
> > unsigned from, to;
> > + int nofs = flags & AOP_FLAG_NOFS;
> > +
> > + /* We cannot recurse into the filesystem if the transaction is already
> > + * started */
> > + BUG_ON(!nofs && journal_current_handle());
> >
> > trace_ext4_write_begin(inode, pos, len, flags);
> > /*
> > @@ -906,9 +911,6 @@ retry:
> > ret = PTR_ERR(handle);
> > goto out;
> > }
> > -
> > - /* We cannot recurse into the filesystem as the transaction is already
> > - * started */
> > flags |= AOP_FLAG_NOFS;
> >
> > page = grab_cache_page_write_begin(mapping, index, flags);
> > @@ -957,7 +959,8 @@ retry:
> > }
> > }
> >
> > - if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> > + if (!nofs && ret == -ENOSPC &&
> > + ext4_should_retry_alloc(inode->i_sb, &retries))
> > goto retry;
> > out:
> > return ret;
> > @@ -2447,7 +2450,7 @@ out_writepages:
> > }
> >
> > #define FALL_BACK_TO_NONDELALLOC 1
> > -static int ext4_nonda_switch(struct super_block *sb)
> > +static int ext4_nonda_switch(struct super_block *sb, int writeback_allowed)
> > {
> > s64 free_blocks, dirty_blocks;
> > struct ext4_sb_info *sbi = EXT4_SB(sb);
> > @@ -2475,7 +2478,7 @@ static int ext4_nonda_switch(struct super_block *sb)
> > * Even if we don't switch but are nearing capacity,
> > * start pushing delalloc when 1/2 of free blocks are dirty.
> > */
> > - if (free_blocks < 2 * dirty_blocks)
> > + if (writeback_allowed && free_blocks < 2 * dirty_blocks)
> > writeback_inodes_sb_if_idle(sb, WB_REASON_FS_FREE_SPACE);
> >
> > return 0;
> > @@ -2490,10 +2493,14 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
> > pgoff_t index;
> > struct inode *inode = mapping->host;
> > handle_t *handle;
> > + int nofs = flags & AOP_FLAG_NOFS;
> >
> > index = pos >> PAGE_CACHE_SHIFT;
> > + /* We cannot recurse into the filesystem if the transaction is already
> > + * started */
> > + BUG_ON(!nofs && journal_current_handle());
> >
> > - if (ext4_nonda_switch(inode->i_sb)) {
> > + if (ext4_nonda_switch(inode->i_sb, !nofs)) {
> > *fsdata = (void *)FALL_BACK_TO_NONDELALLOC;
> > return ext4_write_begin(file, mapping, pos,
> > len, flags, pagep, fsdata);
> > @@ -2512,8 +2519,6 @@ retry:
> > ret = PTR_ERR(handle);
> > goto out;
> > }
> > - /* We cannot recurse into the filesystem as the transaction is already
> > - * started */
> > flags |= AOP_FLAG_NOFS;
> >
> > page = grab_cache_page_write_begin(mapping, index, flags);
> > @@ -2538,7 +2543,8 @@ retry:
> > ext4_truncate_failed_write(inode);
> > }
> >
> > - if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> > + if (!nofs && ret == -ENOSPC &&
> > + ext4_should_retry_alloc(inode->i_sb, &retries))
> > goto retry;
> > out:
> > return ret;
> > @@ -4791,7 +4797,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > /* Delalloc case is easy... */
> > if (test_opt(inode->i_sb, DELALLOC) &&
> > !ext4_should_journal_data(inode) &&
> > - !ext4_nonda_switch(inode->i_sb)) {
> > + !ext4_nonda_switch(inode->i_sb, 1)) {
> > do {
> > ret = __block_page_mkwrite(vma, vmf,
> > ext4_da_get_block_prep);
> > --
> > 1.7.7.6
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/3] ext4: nonda_switch prevent deadlock
2012-08-29 13:28 ` [PATCH 1/3] ext4: nonda_switch prevent deadlock Jan Kara
2012-08-30 6:48 ` Dmitry Monakhov
@ 2012-08-30 11:12 ` Akira Fujita
1 sibling, 0 replies; 9+ messages in thread
From: Akira Fujita @ 2012-08-30 11:12 UTC (permalink / raw)
To: Jan Kara; +Cc: Dmitry Monakhov, linux-ext4, tytso
Hi,
(2012/08/29 22:28), Jan Kara wrote:
> On Tue 28-08-12 20:21:41, Dmitry Monakhov wrote:
>> Currently ext4_da_write_begin may deadlock if called with opened journal
>> transaction. Real life example:
>> ->move_extent_per_page()
>> ->ext4_journal_start()-> hold journal transaction
>> ->write_begin()
>> ->ext4_da_write_begin()
>> ->ext4_nonda_switch()
>> ->writeback_inodes_sb_if_idle() --> will wait for journal_stop()
>>
>> This bug may be easily fixed by code reordering,
>> But in some cases it should be possible to call write_begin()
>> while holding journal's transaction, in this case caller may avoid
>> recoursion by passing AOP_FLAG_NOFS flag.
> Well, I find calling ext4_write_begin() with a transaction started a bug.
> Possibly ext4_write_begin() can be tweaked to handle that but things would
> be simpler if we didn't have to.
>
> Looking into move_extent_per_page(), calling ->write_begin() doesn't seem
> to be quite right there anyway. For example it results in filling holes
> under that page which is not desirable. I'm not even sure why do we call
> ->write_begin() there at all. The data in the page is unchanged. So it
> should be enough to just remap buffers and mark the page dirty (but I might
> be missing some subtlety here). Fujita-san, can you possibly explain?
Originally, calling write_begin/end in move_extent_per_page() was
to get a page and mark bh which exchanged by mext_replace_branches() as dirty.
But if there is a better way to do this, it makes sense to fix.
Regards,
Akira Fujita
> Honza
>
>> ---
>> fs/ext4/inode.c | 28 +++++++++++++++++-----------
>> 1 files changed, 17 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 6324f74..d12d30e 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -889,6 +889,11 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
>> struct page *page;
>> pgoff_t index;
>> unsigned from, to;
>> + int nofs = flags & AOP_FLAG_NOFS;
>> +
>> + /* We cannot recurse into the filesystem if the transaction is already
>> + * started */
>> + BUG_ON(!nofs && journal_current_handle());
>>
>> trace_ext4_write_begin(inode, pos, len, flags);
>> /*
>> @@ -906,9 +911,6 @@ retry:
>> ret = PTR_ERR(handle);
>> goto out;
>> }
>> -
>> - /* We cannot recurse into the filesystem as the transaction is already
>> - * started */
>> flags |= AOP_FLAG_NOFS;
>>
>> page = grab_cache_page_write_begin(mapping, index, flags);
>> @@ -957,7 +959,8 @@ retry:
>> }
>> }
>>
>> - if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
>> + if (!nofs && ret == -ENOSPC &&
>> + ext4_should_retry_alloc(inode->i_sb, &retries))
>> goto retry;
>> out:
>> return ret;
>> @@ -2447,7 +2450,7 @@ out_writepages:
>> }
>>
>> #define FALL_BACK_TO_NONDELALLOC 1
>> -static int ext4_nonda_switch(struct super_block *sb)
>> +static int ext4_nonda_switch(struct super_block *sb, int writeback_allowed)
>> {
>> s64 free_blocks, dirty_blocks;
>> struct ext4_sb_info *sbi = EXT4_SB(sb);
>> @@ -2475,7 +2478,7 @@ static int ext4_nonda_switch(struct super_block *sb)
>> * Even if we don't switch but are nearing capacity,
>> * start pushing delalloc when 1/2 of free blocks are dirty.
>> */
>> - if (free_blocks < 2 * dirty_blocks)
>> + if (writeback_allowed && free_blocks < 2 * dirty_blocks)
>> writeback_inodes_sb_if_idle(sb, WB_REASON_FS_FREE_SPACE);
>>
>> return 0;
>> @@ -2490,10 +2493,14 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
>> pgoff_t index;
>> struct inode *inode = mapping->host;
>> handle_t *handle;
>> + int nofs = flags & AOP_FLAG_NOFS;
>>
>> index = pos >> PAGE_CACHE_SHIFT;
>> + /* We cannot recurse into the filesystem if the transaction is already
>> + * started */
>> + BUG_ON(!nofs && journal_current_handle());
>>
>> - if (ext4_nonda_switch(inode->i_sb)) {
>> + if (ext4_nonda_switch(inode->i_sb, !nofs)) {
>> *fsdata = (void *)FALL_BACK_TO_NONDELALLOC;
>> return ext4_write_begin(file, mapping, pos,
>> len, flags, pagep, fsdata);
>> @@ -2512,8 +2519,6 @@ retry:
>> ret = PTR_ERR(handle);
>> goto out;
>> }
>> - /* We cannot recurse into the filesystem as the transaction is already
>> - * started */
>> flags |= AOP_FLAG_NOFS;
>>
>> page = grab_cache_page_write_begin(mapping, index, flags);
>> @@ -2538,7 +2543,8 @@ retry:
>> ext4_truncate_failed_write(inode);
>> }
>>
>> - if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
>> + if (!nofs && ret == -ENOSPC &&
>> + ext4_should_retry_alloc(inode->i_sb, &retries))
>> goto retry;
>> out:
>> return ret;
>> @@ -4791,7 +4797,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>> /* Delalloc case is easy... */
>> if (test_opt(inode->i_sb, DELALLOC) &&
>> !ext4_should_journal_data(inode) &&
>> - !ext4_nonda_switch(inode->i_sb)) {
>> + !ext4_nonda_switch(inode->i_sb, 1)) {
>> do {
>> ret = __block_page_mkwrite(vma, vmf,
>> ext4_da_get_block_prep);
>> --
>> 1.7.7.6
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Akira Fujita <a-fujita@rs.jp.nec.com>
The First Fundamental Software Development Group,
Platform Division, NEC Software Tohoku, Ltd.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/3] ext4: nonda_switch prevent deadlock
2012-08-30 6:48 ` Dmitry Monakhov
@ 2012-08-30 12:55 ` Jan Kara
0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2012-08-30 12:55 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: Jan Kara, linux-ext4, tytso, a-fujita
On Thu 30-08-12 10:48:43, Dmitry Monakhov wrote:
> On Wed, 29 Aug 2012 15:28:16 +0200, Jan Kara <jack@suse.cz> wrote:
> > On Tue 28-08-12 20:21:41, Dmitry Monakhov wrote:
> > > Currently ext4_da_write_begin may deadlock if called with opened journal
> > > transaction. Real life example:
> > > ->move_extent_per_page()
> > > ->ext4_journal_start()-> hold journal transaction
> > > ->write_begin()
> > > ->ext4_da_write_begin()
> > > ->ext4_nonda_switch()
> > > ->writeback_inodes_sb_if_idle() --> will wait for journal_stop()
> > >
> > > This bug may be easily fixed by code reordering,
> > > But in some cases it should be possible to call write_begin()
> > > while holding journal's transaction, in this case caller may avoid
> > > recoursion by passing AOP_FLAG_NOFS flag.
> > Well, I find calling ext4_write_begin() with a transaction started a bug.
> > Possibly ext4_write_begin() can be tweaked to handle that but things would
> > be simpler if we didn't have to.
> >
> > Looking into move_extent_per_page(), calling ->write_begin() doesn't seem
> > to be quite right there anyway. For example it results in filling holes
> > under that page which is not desirable. I'm not even sure why do we call
> > ->write_begin() there at all. The data in the page is unchanged. So it
> > should be enough to just remap buffers and mark the page dirty (but I might
> > be missing some subtlety here). Fujita-san, can you possibly explain?
> Yes, and No imho write_begin/write_end is not necessary here,
> but simple marking page as dirty is not sufficient either because
> we end up files becomes zeroed if power failure during defrag procedure,
> as it now happen in case of delalloc
> start_journal()
> remap_page()
> set_page_dirty()
> stop_journal()
> ...15secs <---- power failure
> write_back_page()
Right, thanks for noticing this. Luckily that's rather easy to fix - just
do ext4_jbd2_file_inode() to make journal thread flush all dirty pages from
the inode before the transaction is committed.
> but in our case situation is even worse because good/oldfile
> will become corrupted so we have to somehow pin dirty pages to
> transaction similar to ordered mode.
> Currently i'm work on that approach which is simple but looks
> reliable: remap_procedure should looks like follows(error paths missed):
>
> /* Double lock pages on two inodes, this is horrible but necessary */
> double_grab_page(struct inode *inode, struct inode *inode2,
> pgoff idx, struct page** pgvec1, page** pgvec2)
> {
> if (mapping1 > mapping2) {
> *pgvec1 = find_or_create_page(mapping1, idx, GFP_NOFS);
> *pgvec2 = find_or_create_page(mapping2, idx, GFP_NOFS);
> } else {
> *pgvec2 = find_or_create_page(mapping2, idx, GFP_NOFS);
> *pgvec1 = find_or_create_page(mapping1, idx, GFP_NOFS);
> }
> }
>
> move_extent () {
> page * o_vec[32];
> page * d_vec[32];
>
> new_bio:
> bio = bio_alloc()
> bio->bi_sector = donor_extent->sector+start
> ... /* other bio initialization */
> /* start journal before lock pages */
> journal_start();
> for (idx = 0; idx <= end; idx++) {
> double_grab_page(orig, donor, idx + start, o_vec + idx, d_vec + idx)
I don't think you can lock several pages like this - you have to use
trylock to avoid deadlock...
> make_page_uptodate_nounlock(o_vec + idx)
> try_to_release_page(o_vec+idx)
> try_to_release_page(d_vec+idx)
> /* try to add page to bio which point to donor extent */
> if (add_page_to_bio(bio, o_vec + idx))
> continue;
>
> /* There is no mo space left in bio, it is time to submit it
> * pages will be unlocked soon so we have to protest pages
> * from modification via writeback bit, writeback will be
> * cleared at bio_end_io
> */
> for_each_page(o_vec, o_vec + idx, SetPageWriteback)
> submit_bio(bio);
>
> /* After we submitted bio for io we can be shure that
> * it will be flushed do disk before commit journal record
> * because commit record request contain barrier,
> * so effectively this bio pinned to transaction
> */
This isn't true. We can run without barriers and it's perfectly fine for
battery backed storage. But for your code to work, you have to wait for
storage to complete the bio before you replace the branches.
> mext_replace_branches(orig, donor, start, start+idx)
>
> /* Ok we have complete remapping, it is now safe to unlock pages */
> for_each_page(o_vec, o_vec + idx, page_unlock)
> for_each_page(d_vec, d_vec + idx, page_unlock)
> start += idx;
> goto new_bio:
> }
> }
> This should works for any file types with data (with extra precautions for dir
> and symlinks)
In principle, something like this should work but I think letting journal
thread handle the flushing is easier...
Honza
> > > ---
> > > fs/ext4/inode.c | 28 +++++++++++++++++-----------
> > > 1 files changed, 17 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 6324f74..d12d30e 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -889,6 +889,11 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
> > > struct page *page;
> > > pgoff_t index;
> > > unsigned from, to;
> > > + int nofs = flags & AOP_FLAG_NOFS;
> > > +
> > > + /* We cannot recurse into the filesystem if the transaction is already
> > > + * started */
> > > + BUG_ON(!nofs && journal_current_handle());
> > >
> > > trace_ext4_write_begin(inode, pos, len, flags);
> > > /*
> > > @@ -906,9 +911,6 @@ retry:
> > > ret = PTR_ERR(handle);
> > > goto out;
> > > }
> > > -
> > > - /* We cannot recurse into the filesystem as the transaction is already
> > > - * started */
> > > flags |= AOP_FLAG_NOFS;
> > >
> > > page = grab_cache_page_write_begin(mapping, index, flags);
> > > @@ -957,7 +959,8 @@ retry:
> > > }
> > > }
> > >
> > > - if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> > > + if (!nofs && ret == -ENOSPC &&
> > > + ext4_should_retry_alloc(inode->i_sb, &retries))
> > > goto retry;
> > > out:
> > > return ret;
> > > @@ -2447,7 +2450,7 @@ out_writepages:
> > > }
> > >
> > > #define FALL_BACK_TO_NONDELALLOC 1
> > > -static int ext4_nonda_switch(struct super_block *sb)
> > > +static int ext4_nonda_switch(struct super_block *sb, int writeback_allowed)
> > > {
> > > s64 free_blocks, dirty_blocks;
> > > struct ext4_sb_info *sbi = EXT4_SB(sb);
> > > @@ -2475,7 +2478,7 @@ static int ext4_nonda_switch(struct super_block *sb)
> > > * Even if we don't switch but are nearing capacity,
> > > * start pushing delalloc when 1/2 of free blocks are dirty.
> > > */
> > > - if (free_blocks < 2 * dirty_blocks)
> > > + if (writeback_allowed && free_blocks < 2 * dirty_blocks)
> > > writeback_inodes_sb_if_idle(sb, WB_REASON_FS_FREE_SPACE);
> > >
> > > return 0;
> > > @@ -2490,10 +2493,14 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
> > > pgoff_t index;
> > > struct inode *inode = mapping->host;
> > > handle_t *handle;
> > > + int nofs = flags & AOP_FLAG_NOFS;
> > >
> > > index = pos >> PAGE_CACHE_SHIFT;
> > > + /* We cannot recurse into the filesystem if the transaction is already
> > > + * started */
> > > + BUG_ON(!nofs && journal_current_handle());
> > >
> > > - if (ext4_nonda_switch(inode->i_sb)) {
> > > + if (ext4_nonda_switch(inode->i_sb, !nofs)) {
> > > *fsdata = (void *)FALL_BACK_TO_NONDELALLOC;
> > > return ext4_write_begin(file, mapping, pos,
> > > len, flags, pagep, fsdata);
> > > @@ -2512,8 +2519,6 @@ retry:
> > > ret = PTR_ERR(handle);
> > > goto out;
> > > }
> > > - /* We cannot recurse into the filesystem as the transaction is already
> > > - * started */
> > > flags |= AOP_FLAG_NOFS;
> > >
> > > page = grab_cache_page_write_begin(mapping, index, flags);
> > > @@ -2538,7 +2543,8 @@ retry:
> > > ext4_truncate_failed_write(inode);
> > > }
> > >
> > > - if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> > > + if (!nofs && ret == -ENOSPC &&
> > > + ext4_should_retry_alloc(inode->i_sb, &retries))
> > > goto retry;
> > > out:
> > > return ret;
> > > @@ -4791,7 +4797,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > /* Delalloc case is easy... */
> > > if (test_opt(inode->i_sb, DELALLOC) &&
> > > !ext4_should_journal_data(inode) &&
> > > - !ext4_nonda_switch(inode->i_sb)) {
> > > + !ext4_nonda_switch(inode->i_sb, 1)) {
> > > do {
> > > ret = __block_page_mkwrite(vma, vmf,
> > > ext4_da_get_block_prep);
> > > --
> > > 1.7.7.6
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > --
> > Jan Kara <jack@suse.cz>
> > SUSE Labs, CR
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2012-08-30 12:55 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-28 16:21 [PATCH 1/3] ext4: nonda_switch prevent deadlock Dmitry Monakhov
2012-08-28 16:21 ` [PATCH 2/3] ext4: basic bug shield for move_extent_per_page Dmitry Monakhov
2012-08-28 16:29 ` Dmitry Monakhov
2012-08-28 16:21 ` [PATCH 3/3] ext4: reimplement uninit extent optimization " Dmitry Monakhov
2012-08-29 8:24 ` Akira Fujita
2012-08-29 13:28 ` [PATCH 1/3] ext4: nonda_switch prevent deadlock Jan Kara
2012-08-30 6:48 ` Dmitry Monakhov
2012-08-30 12:55 ` Jan Kara
2012-08-30 11:12 ` Akira Fujita
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).