From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cn.fujitsu.com ([59.151.112.132]:63623 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1753869AbdBVBbv (ORCPT ); Tue, 21 Feb 2017 20:31:51 -0500 Subject: Re: [PATCH v2] btrfs: Handle delalloc error correctly to avoid deadlock To: References: <20170221080659.23299-1-quwenruo@cn.fujitsu.com> <20170222011409.GE32725@lim.localdomain> CC: , From: Qu Wenruo Message-ID: <5cafeced-c62a-af10-8042-bc73d3c27acf@cn.fujitsu.com> Date: Wed, 22 Feb 2017 09:31:33 +0800 MIME-Version: 1.0 In-Reply-To: <20170222011409.GE32725@lim.localdomain> Content-Type: text/plain; charset="utf-8"; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: At 02/22/2017 09:14 AM, Liu Bo wrote: > On Tue, Feb 21, 2017 at 04:06:59PM +0800, Qu Wenruo wrote: >> If run btrfs/125 with nospace_cache or space_cache=v2 mount option, >> btrfs will block with the following backtrace: >> >> Call Trace: >> __schedule+0x2d4/0xae0 >> schedule+0x3d/0x90 >> btrfs_start_ordered_extent+0x160/0x200 [btrfs] >> ? wake_atomic_t_function+0x60/0x60 >> btrfs_run_ordered_extent_work+0x25/0x40 [btrfs] >> btrfs_scrubparity_helper+0x1c1/0x620 [btrfs] >> btrfs_flush_delalloc_helper+0xe/0x10 [btrfs] >> process_one_work+0x2af/0x720 >> ? process_one_work+0x22b/0x720 >> worker_thread+0x4b/0x4f0 >> kthread+0x10f/0x150 >> ? process_one_work+0x720/0x720 >> ? kthread_create_on_node+0x40/0x40 >> ret_from_fork+0x2e/0x40 >> >> The direct cause is the error handler in run_delalloc_nocow() doesn't >> handle error from btrfs_reloc_clone_csums() well. >> >> The error handler of run_delalloc_nocow() will clear dirty and finish IO >> for the pages in that extent. >> However we have already inserted one ordered extent. >> And that ordered extent is relying on endio hooks to wait all its pages >> to finish, while only the first page will finish. >> >> This makes that ordered extent never finish, so blocking the file >> system. >> >> Although the root cause is still in RAID5/6, it won't hurt to fix the >> error routine first. >> >> This patch will slightly modify one existing function, >> btrfs_endio_direct_write_update_ordered() to handle free space inode, >> just like what btrfs_writepage_end_io_hook() does. >> >> And use it as base to implement one inline function, >> btrfs_cleanup_ordered_extents() to handle the error in >> run_delalloc_nocow() and cow_file_range(). >> >> For compression, it's calling writepage_end_io_hook() itself to handle >> its error, and any submitted ordered extent will have its bio submitted, >> so no need to worry about compression part. >> >> Suggested-by: Filipe Manana >> Signed-off-by: Qu Wenruo >> --- >> v2: >> Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing >> outstanding extents, which is already done by >> extent_clear_unlock_delalloc() >> --- >> fs/btrfs/inode.c | 75 +++++++++++++++++++++++++++++++++++++++++-------- >> fs/btrfs/ordered-data.h | 2 ++ >> 2 files changed, 66 insertions(+), 11 deletions(-) >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index 1e861a063721..a0b09ff73eae 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -116,6 +116,41 @@ static struct extent_map *create_pinned_em(struct inode *inode, u64 start, >> >> static int btrfs_dirty_inode(struct inode *inode); >> >> +static void __endio_write_update_ordered(struct inode *inode, >> + const u64 offset, >> + const u64 bytes, >> + const int uptodate, >> + const int skip_meta); >> +static inline void btrfs_endio_direct_write_update_ordered(struct inode *inode, >> + const u64 offset, >> + const u64 bytes, >> + const int uptodate) >> +{ >> + return __endio_write_update_ordered(inode, offset, bytes, uptodate, 0); >> +} >> + >> +/* >> + * Set error bit and cleanup all ordered extents in specified range of @inode. >> + * >> + * This is for error case where ordered extent(s) is submitted but >> + * corresponding bio is not submitted. >> + * This can make waiter on such ordered extent never finish, as there is no >> + * endio hook called to finish such ordered extent. >> + */ >> +static inline void btrfs_cleanup_ordered_extents(struct inode *inode, >> + const u64 offset, >> + const u64 bytes) >> +{ >> + /* >> + * In error handler, we have extent_clear_unlock_delalloc() called >> + * to reduce our metadata space reservation and outstanding extents. >> + * >> + * So here, we don't need finish_ordered_io() to free metadata space >> + * for us, or we will underflow outstanding extents. >> + */ >> + return __endio_write_update_ordered(inode, offset, bytes, 0, 1); >> +} >> + >> #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS >> void btrfs_test_inode_set_ops(struct inode *inode) >> { >> @@ -237,7 +272,6 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans, >> return err; >> } >> >> - >> /* >> * conditionally insert an inline extent into the file. This >> * does the checks required to make sure the data is small enough >> @@ -1096,6 +1130,7 @@ static noinline int cow_file_range(struct inode *inode, >> EXTENT_DELALLOC | EXTENT_DEFRAG, >> PAGE_UNLOCK | PAGE_CLEAR_DIRTY | >> PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK); >> + btrfs_cleanup_ordered_extents(inode, start, end - start + 1); > > Note that @start is rolling forward in the 'while' loop, using start here won't > cleanup previous added ordered extent. Thanks for pointing this out, cow_file_range() differs from run_delalloc_nocow() and does modify start. This needs to be fixed. > > Also, the above extent_clear_unlock_extent() doesn't cover previous added range, > so it won't release metadata, either. Right. Due to the newly introduced BTRFS_ORDERED_SKIP_META, btrfs_finish_ordered_io() won't release metadata space in that case, so extent_clear_unlock_extent() must cover the full range. Thanks, Qu > > Thanks, > > -liubo >> goto out; >> } >> >> @@ -1538,7 +1573,7 @@ static noinline int run_delalloc_nocow(struct inode *inode, >> if (!ret) >> ret = err; >> >> - if (ret && cur_offset < end) >> + if (ret && cur_offset < end) { >> extent_clear_unlock_delalloc(inode, cur_offset, end, end, >> locked_page, EXTENT_LOCKED | >> EXTENT_DELALLOC | EXTENT_DEFRAG | >> @@ -1546,6 +1581,8 @@ static noinline int run_delalloc_nocow(struct inode *inode, >> PAGE_CLEAR_DIRTY | >> PAGE_SET_WRITEBACK | >> PAGE_END_WRITEBACK); >> + btrfs_cleanup_ordered_extents(inode, start, end - start + 1); >> + } >> btrfs_free_path(path); >> return ret; >> } >> @@ -2889,6 +2926,10 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) >> >> nolock = btrfs_is_free_space_inode(inode); >> >> + /* Ordered extent with SKIP_META implies IOERR */ >> + if (test_bit(BTRFS_ORDERED_SKIP_META, &ordered_extent->flags)) >> + ASSERT(test_bit(BTRFS_ORDERED_IOERR, &ordered_extent->flags)); >> + >> if (test_bit(BTRFS_ORDERED_IOERR, &ordered_extent->flags)) { >> ret = -EIO; >> goto out; >> @@ -3008,7 +3049,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) >> ordered_extent->file_offset + >> ordered_extent->len - 1, &cached_state, GFP_NOFS); >> out: >> - if (root != fs_info->tree_root) >> + if (root != fs_info->tree_root && >> + !test_bit(BTRFS_ORDERED_SKIP_META, &ordered_extent->flags)) >> btrfs_delalloc_release_metadata(inode, ordered_extent->len); >> if (trans) >> btrfs_end_transaction(trans); >> @@ -8185,17 +8227,28 @@ static void btrfs_endio_direct_read(struct bio *bio) >> bio_put(bio); >> } >> >> -static void btrfs_endio_direct_write_update_ordered(struct inode *inode, >> - const u64 offset, >> - const u64 bytes, >> - const int uptodate) >> +static void __endio_write_update_ordered(struct inode *inode, >> + const u64 offset, >> + const u64 bytes, >> + const int uptodate, >> + const int skip_meta) >> { >> struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); >> struct btrfs_ordered_extent *ordered = NULL; >> + struct btrfs_workqueue *wq; >> + btrfs_work_func_t func; >> u64 ordered_offset = offset; >> u64 ordered_bytes = bytes; >> int ret; >> >> + if (btrfs_is_free_space_inode(inode)) { >> + wq = fs_info->endio_freespace_worker; >> + func = btrfs_freespace_write_helper; >> + } else { >> + wq = fs_info->endio_write_workers; >> + func = btrfs_endio_write_helper; >> + } >> + >> again: >> ret = btrfs_dec_test_first_ordered_pending(inode, &ordered, >> &ordered_offset, >> @@ -8203,10 +8256,10 @@ static void btrfs_endio_direct_write_update_ordered(struct inode *inode, >> uptodate); >> if (!ret) >> goto out_test; >> - >> - btrfs_init_work(&ordered->work, btrfs_endio_write_helper, >> - finish_ordered_fn, NULL, NULL); >> - btrfs_queue_work(fs_info->endio_write_workers, &ordered->work); >> + if (skip_meta) >> + set_bit(BTRFS_ORDERED_SKIP_META, &ordered->flags); >> + btrfs_init_work(&ordered->work, func, finish_ordered_fn, NULL, NULL); >> + btrfs_queue_work(wq, &ordered->work); >> out_test: >> /* >> * our bio might span multiple ordered extents. If we haven't >> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h >> index 5f2b0ca28705..1434372c6352 100644 >> --- a/fs/btrfs/ordered-data.h >> +++ b/fs/btrfs/ordered-data.h >> @@ -75,6 +75,8 @@ struct btrfs_ordered_sum { >> * in the logging code. */ >> #define BTRFS_ORDERED_PENDING 11 /* We are waiting for this ordered extent to >> * complete in the current transaction. */ >> +#define BTRFS_ORDERED_SKIP_META 12 /* No need to free metadata */ >> + >> struct btrfs_ordered_extent { >> /* logical offset in the file */ >> u64 file_offset; >> -- >> 2.11.1 >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >