From: Nikolay Borisov <nborisov@suse.com>
To: Qu Wenruo <wqu@suse.de>, David Sterba <DSterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v2] btrfs: Fix error handling in btrfs_cleanup_ordered_extents
Date: Fri, 26 Oct 2018 15:02:24 +0300 [thread overview]
Message-ID: <10430924-b70e-4882-15fc-61c0376d83a4@suse.com> (raw)
In-Reply-To: <abaee0e6-c865-3402-3352-33dc6a1d4ed7@suse.de>
On 26.10.2018 14:53, Qu Wenruo wrote:
>
>
> On 2018/10/26 下午7:41, Nikolay Borisov wrote:
>> Running btrfs/124 in a loop hung up on me sporadically with the
>> following call trace:
>> btrfs D 0 5760 5324 0x00000000
>> Call Trace:
>> ? __schedule+0x243/0x800
>> schedule+0x33/0x90
>> btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
>> ? wait_woken+0xa0/0xa0
>> btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
>> btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
>> btrfs_relocate_chunk+0x49/0x100 [btrfs]
>> btrfs_balance+0xbeb/0x1740 [btrfs]
>> btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
>> btrfs_ioctl+0x1691/0x3110 [btrfs]
>> ? lockdep_hardirqs_on+0xed/0x180
>> ? __handle_mm_fault+0x8e7/0xfb0
>> ? _raw_spin_unlock+0x24/0x30
>> ? __handle_mm_fault+0x8e7/0xfb0
>> ? do_vfs_ioctl+0xa5/0x6e0
>> ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
>> do_vfs_ioctl+0xa5/0x6e0
>> ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
>> ksys_ioctl+0x3a/0x70
>> __x64_sys_ioctl+0x16/0x20
>> do_syscall_64+0x60/0x1b0
>> entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> Turns out during page writeback it's possible that the code in
>> writepage_delalloc can instantiate a delalloc range which doesn't
>> belong to the page currently being written back.
>
> Just a nitpick, would you please split long paragraphs with newlines?
>
> It makes reviewers' eyes less painful.
>
>> This happens since
>> find_lock_delalloc_range returns up to BTRFS_MAX_EXTENT_SIZE delalloc
>> range when asked and doens't really consider the range of the passed
>> page.
>
>
>> When such a foregin range is found the code proceeds to
>> run_delalloc_range and calls the appropriate function to fill the
>> delalloc and create ordered extents. If, however, a failure occurs
>> while this operation is in effect then the clean up code in
>> btrfs_cleanup_ordered_extents will be called. This function has the
>> wrong assumption that caller of run_delalloc_range always properly
>> cleans the first page of the range hence when it calls
>> __endio_write_update_ordered it explicitly ommits the first page of
>> the delalloc range.
>
> Yes, that's the old assumption, at least well explained by some ascii
> art. (even it's wrong)
>
>> This is wrong because this function could be
>> cleaning a delalloc range that doesn't belong to the current page. This
>> in turn means that the page cleanup code in __extent_writepage will
>> not really free the initial page for the range, leaving a hanging
>> ordered extent with bytes_left set to 4k. This bug has been present
>> ever since the original introduction of the cleanup code in 524272607e88.
>
> The cause sounds valid, however would you please explain more about how
> such cleaning on unrelated delalloc range happens?
So in my case the following happened - 2 block groups were created as
delalloc ranges in the - 0-1m and 1m-128m. Their respective pages were
dirtied, so when page 0 - 0-4k when into writepage_delalloc,
find_lock_delalloc_range would return the range 0-1m. So the call to
fill_delalloc instantiates OE 0-1m and writeback continues as expected.
Now, when the 2nd page from range 0-1m is again set for writeback and
find_lock_delalloc_range is called with delalloc_start == 4096 it will
actually return the range 1m-128m. Then fill_delalloc is called with
locked_page belonging to the range which was already instantiated and
the start/end arguments pointing to 1m-128m if an error occurred in
run_delalloc_range in this case then btrfs_cleanup_ordered_extents will
be called which will clear Private2 bit for pages belonging to 1m-128m
range and the OE will be cleared of all but the first page (since the
code wrongly assumed locked_page always belongs to the range currently
being instantiated).
>
> Even the fix looks solid, it's still better to explain the cause a
> little more, as the more we understand the cause, the better solution we
> may get.
>
>>
>> Fix this by correctly checking whether the current page belongs to the
>> range being instantiated and if so correctly adjust the range parameters
>> passed for cleaning up. If it doesn't, then just clean the whole OE
>> range directly.
>
> And the solution also looks good to me, and much more robust, without
> any (possibly wrong) assumption.
>
> Thanks,
> Qu
>
>>
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
>> ---
>>
>> V2:
>> * Fix compilation failure due to missing parentheses
>> * Fixed the "Fixes" tag.
>>
>> fs/btrfs/inode.c | 29 ++++++++++++++++++++---------
>> 1 file changed, 20 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index e1f00d8c24ce..5906564ae2e9 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -109,17 +109,17 @@ static void __endio_write_update_ordered(struct inode *inode,
>> * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING
>> * and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata
>> * to be released, which we want to happen only when finishing the ordered
>> - * extent (btrfs_finish_ordered_io()). Also note that the caller of the
>> - * fill_delalloc() callback already does proper cleanup for the first page of
>> - * the range, that is, it invokes the callback writepage_end_io_hook() for the
>> - * range of the first page.
>> + * extent (btrfs_finish_ordered_io()).
>> */
>> static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
>> - const u64 offset,
>> - const u64 bytes)
>> + struct page *locked_page,
>> + u64 offset, u64 bytes)
>> {
>> unsigned long index = offset >> PAGE_SHIFT;
>> unsigned long end_index = (offset + bytes - 1) >> PAGE_SHIFT;
>> + u64 page_start = page_offset(locked_page);
>> + u64 page_end = page_start + PAGE_SIZE - 1;
>> +
>> struct page *page;
>>
>> while (index <= end_index) {
>> @@ -130,8 +130,18 @@ static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
>> ClearPagePrivate2(page);
>> put_page(page);
>> }
>> - return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
>> - bytes - PAGE_SIZE, false);
>> +
>> + /*
>> + * In case this page belongs to the delalloc range being instantiated
>> + * then skip it, since the first page of a range is going to be
>> + * properly cleaned up by the caller of run_delalloc_range
>> + */
>> + if (page_start >= offset && page_end <= (offset + bytes - 1)) {
>> + offset += PAGE_SIZE;
>> + bytes -= PAGE_SIZE;
>> + }
>> +
>> + return __endio_write_update_ordered(inode, offset, bytes, false);
>> }
>>
>> static int btrfs_dirty_inode(struct inode *inode);
>> @@ -1606,7 +1616,8 @@ static int run_delalloc_range(void *private_data, struct page *locked_page,
>> write_flags);
>> }
>> if (ret)
>> - btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
>> + btrfs_cleanup_ordered_extents(inode, locked_page, start,
>> + end - start + 1);
>> return ret;
>> }
>>
>>
>
next prev parent reply other threads:[~2018-10-26 12:02 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-10-26 11:41 [PATCH v2] btrfs: Fix error handling in btrfs_cleanup_ordered_extents Nikolay Borisov
2018-10-26 11:53 ` Qu Wenruo
2018-10-26 12:02 ` Nikolay Borisov [this message]
2018-10-29 5:53 ` Qu Wenruo
2018-10-29 7:51 ` Nikolay Borisov
2018-10-29 12:21 ` Nikolay Borisov
2018-10-29 14:39 ` Nikolay Borisov
2018-11-20 19:00 ` Josef Bacik
2018-11-20 21:09 ` Nikolay Borisov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=10430924-b70e-4882-15fc-61c0376d83a4@suse.com \
--to=nborisov@suse.com \
--cc=DSterba@suse.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=wqu@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).