From: Qu Wenruo <quwenruo@cn.fujitsu.com>
To: Josef Bacik <jbacik@fb.com>, <linux-btrfs@vger.kernel.org>
Cc: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Subject: Re: [PATCH v8 19/27] btrfs: try more times to alloc metadata reserve space
Date: Tue, 26 Apr 2016 08:50:53 +0800 [thread overview]
Message-ID: <f596b5ff-a1db-e621-867f-093f098a04b3@cn.fujitsu.com> (raw)
In-Reply-To: <55a7c2e4-58a7-c36f-1604-bdfc9ee2f330@fb.com>
Josef Bacik wrote on 2016/04/25 10:05 -0400:
> On 04/24/2016 08:54 PM, Qu Wenruo wrote:
>>
>>
>> Josef Bacik wrote on 2016/04/22 14:06 -0400:
>>> On 03/21/2016 09:35 PM, Qu Wenruo wrote:
>>>> From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>>>>
>>>> In btrfs_delalloc_reserve_metadata(), the number of metadata bytes we
>>>> try
>>>> to reserve is calculated by the difference between outstanding_extents
>>>> and
>>>> reserved_extents.
>>>>
>>>> When reserve_metadata_bytes() fails to reserve desited metadata space,
>>>> it has already done some reclaim work, such as write ordered extents.
>>>>
>>>> In that case, outstanding_extents and reserved_extents may already
>>>> changed, and we may reserve enough metadata space then.
>>>>
>>>> So this patch will try to call reserve_metadata_bytes() at most 3 times
>>>> to ensure we really run out of space.
>>>>
>>>> Such false ENOSPC is mainly caused by small file extents and time
>>>> consuming delalloc functions, which mainly affects in-band
>>>> de-duplication. (Compress should also be affected, but LZO/zlib is
>>>> faster than SHA256, so still harder to trigger than dedupe).
>>>>
>>>> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>>>> ---
>>>> fs/btrfs/extent-tree.c | 25 ++++++++++++++++++++++---
>>>> 1 file changed, 22 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>> index dabd721..016d2ec 100644
>>>> --- a/fs/btrfs/extent-tree.c
>>>> +++ b/fs/btrfs/extent-tree.c
>>>> @@ -2421,7 +2421,7 @@ static int run_one_delayed_ref(struct
>>>> btrfs_trans_handle *trans,
>>>> * a new extent is revered, then deleted
>>>> * in one tran, and inc/dec get merged to 0.
>>>> *
>>>> - * In this case, we need to remove its dedup
>>>> + * In this case, we need to remove its dedupe
>>>> * hash.
>>>> */
>>>> btrfs_dedupe_del(trans, fs_info, node->bytenr);
>>>> @@ -5675,6 +5675,7 @@ int btrfs_delalloc_reserve_metadata(struct inode
>>>> *inode, u64 num_bytes)
>>>> bool delalloc_lock = true;
>>>> u64 to_free = 0;
>>>> unsigned dropped;
>>>> + int loops = 0;
>>>>
>>>> /* If we are a free space inode we need to not flush since we
>>>> will be in
>>>> * the middle of a transaction commit. We also don't need the
>>>> delalloc
>>>> @@ -5690,11 +5691,12 @@ int btrfs_delalloc_reserve_metadata(struct
>>>> inode *inode, u64 num_bytes)
>>>> btrfs_transaction_in_commit(root->fs_info))
>>>> schedule_timeout(1);
>>>>
>>>> + num_bytes = ALIGN(num_bytes, root->sectorsize);
>>>> +
>>>> +again:
>>>> if (delalloc_lock)
>>>> mutex_lock(&BTRFS_I(inode)->delalloc_mutex);
>>>>
>>>> - num_bytes = ALIGN(num_bytes, root->sectorsize);
>>>> -
>>>> spin_lock(&BTRFS_I(inode)->lock);
>>>> nr_extents = (unsigned)div64_u64(num_bytes +
>>>> BTRFS_MAX_EXTENT_SIZE - 1,
>>>> @@ -5815,6 +5817,23 @@ out_fail:
>>>> }
>>>> if (delalloc_lock)
>>>> mutex_unlock(&BTRFS_I(inode)->delalloc_mutex);
>>>> + /*
>>>> + * The number of metadata bytes is calculated by the difference
>>>> + * between outstanding_extents and reserved_extents. Sometimes
>>>> though
>>>> + * reserve_metadata_bytes() fails to reserve the wanted metadata
>>>> bytes,
>>>> + * indeed it has already done some work to reclaim metadata
>>>> space, hence
>>>> + * both outstanding_extents and reserved_extents would have
>>>> changed and
>>>> + * the bytes we try to reserve would also has changed(may be
>>>> smaller).
>>>> + * So here we try to reserve again. This is much useful for online
>>>> + * dedupe, which will easily eat almost all meta space.
>>>> + *
>>>> + * XXX: Indeed here 3 is arbitrarily choosed, it's a good
>>>> workaround for
>>>> + * online dedupe, later we should find a better method to avoid
>>>> dedupe
>>>> + * enospc issue.
>>>> + */
>>>> + if (unlikely(ret == -ENOSPC && loops++ < 3))
>>>> + goto again;
>>>> +
>>>> return ret;
>>>> }
>>>>
>>>>
>>>
>>> NAK, we aren't going to just arbitrarily retry to make our metadata
>>> reservation. Dropping reserved metadata space by completing ordered
>>> extents should free enough to make our current reservation, and in fact
>>> this only accounts for the disparity, so should be an accurate count
>>> most of the time. I can see a case for detecting that the disparity no
>>> longer exists and retrying in that case (we free enough ordered extents
>>> that we are no longer trying to reserve ours + overflow but now only
>>> ours) and retry in _that specific case_, but we need to limit it to this
>>> case only. Thanks,
>>
>> Would it be OK to retry only for dedupe enabled case?
>>
>> Currently it's only a workaround and we are still digging the root
>> cause, but for a workaround, I assume it is good enough though for
>> dedupe enabled case.
>>
>
> No we're not going to leave things in a known broken state to come back
> to later, that just makes it so we forget stuff and it sits there
> forever. Thanks,
>
> Josef
OK, We'll investigate it and find the best fix.
BTW, we also found extent-tree.c is using the same 3 loops code:
(and that's why we choose the same method)
------
loops = 0;
while (delalloc_bytes && loops < 3) {
max_reclaim = min(delalloc_bytes, to_reclaim);
nr_pages = max_reclaim >> PAGE_CACHE_SHIFT;
btrfs_writeback_inodes_sb_nr(root, nr_pages, items);
/*
* We need to wait for the async pages to actually
start before
* we do anything.
*/
max_reclaim =
atomic_read(&root->fs_info->async_delalloc_pages);
if (!max_reclaim)
goto skip_async;
------
Any idea why it's still there?
Thanks,
Qu
next prev parent reply other threads:[~2016-04-26 0:51 UTC|newest]
Thread overview: 62+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-03-22 1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 01/27] btrfs: dedupe: Introduce dedupe framework and its header Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 02/27] btrfs: dedupe: Introduce function to initialize dedupe info Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 03/27] btrfs: dedupe: Introduce function to add hash into in-memory tree Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 04/27] btrfs: dedupe: Introduce function to remove hash from " Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 05/27] btrfs: delayed-ref: Add support for increasing data ref under spinlock Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 06/27] btrfs: dedupe: Introduce function to search for an existing hash Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 07/27] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 08/27] btrfs: ordered-extent: Add support for dedupe Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 09/27] btrfs: dedupe: Inband in-memory only de-duplication implement Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 10/27] btrfs: dedupe: Add basic tree structure for on-disk dedupe method Qu Wenruo
2016-03-24 20:58 ` Chris Mason
2016-03-25 1:59 ` Qu Wenruo
2016-03-25 15:11 ` Chris Mason
2016-03-26 13:11 ` Qu Wenruo
2016-03-28 14:09 ` Chris Mason
2016-03-29 1:47 ` Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 11/27] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info Qu Wenruo
2016-03-29 17:31 ` Alex Lyakas
2016-03-30 0:26 ` Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 12/27] btrfs: dedupe: Add support for on-disk hash search Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 13/27] btrfs: dedupe: Add support to delete hash for on-disk backend Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 14/27] btrfs: dedupe: Add support for adding " Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 15/27] btrfs: dedupe: Add ioctl for inband dedupelication Qu Wenruo
2016-03-22 2:29 ` kbuild test robot
2016-03-22 2:48 ` kbuild test robot
2016-03-22 1:35 ` [PATCH v8 16/27] btrfs: dedupe: add an inode nodedupe flag Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 17/27] btrfs: dedupe: add a property handler for online dedupe Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 18/27] btrfs: dedupe: add per-file online dedupe control Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 19/27] btrfs: try more times to alloc metadata reserve space Qu Wenruo
2016-04-22 18:06 ` Josef Bacik
2016-04-25 0:54 ` Qu Wenruo
2016-04-25 14:05 ` Josef Bacik
2016-04-26 0:50 ` Qu Wenruo [this message]
2016-03-22 1:35 ` [PATCH v8 20/27] btrfs: dedupe: Fix a bug when running inband dedupe with balance Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 21/27] btrfs: Fix a memory leak in inband dedupe hash Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 22/27] btrfs: dedupe: Fix metadata balance error when dedupe is enabled Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 23/27] btrfs: dedupe: Avoid submit IO for hash hit extent Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 24/27] btrfs: dedupe: Preparation for compress-dedupe co-work Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 25/27] btrfs: dedupe: Add support for compression and dedpue Qu Wenruo
2016-03-24 20:35 ` Chris Mason
2016-03-25 1:44 ` Qu Wenruo
2016-03-25 15:12 ` Chris Mason
2016-03-22 1:35 ` [PATCH v8 26/27] btrfs: relocation: Enhance error handling to avoid BUG_ON Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 27/27] btrfs: dedupe: Fix a space cache delalloc bytes underflow bug Qu Wenruo
2016-03-22 13:38 ` [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework David Sterba
2016-03-23 2:25 ` Qu Wenruo
2016-03-24 13:42 ` David Sterba
2016-03-25 1:38 ` Qu Wenruo
2016-04-04 16:55 ` David Sterba
2016-04-05 3:08 ` Qu Wenruo
2016-04-20 2:02 ` Qu Wenruo
2016-04-20 19:14 ` Chris Mason
2016-04-06 3:47 ` Nicholas D Steeves
2016-04-06 5:22 ` Qu Wenruo
2016-04-22 22:14 ` Nicholas D Steeves
2016-04-25 1:25 ` Qu Wenruo
2016-03-29 17:22 ` Alex Lyakas
2016-03-30 0:34 ` Qu Wenruo
2016-03-30 10:36 ` Alex Lyakas
2016-04-03 8:22 ` Alex Lyakas
2016-04-05 3:51 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f596b5ff-a1db-e621-867f-093f098a04b3@cn.fujitsu.com \
--to=quwenruo@cn.fujitsu.com \
--cc=jbacik@fb.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=wangxg.fnst@cn.fujitsu.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).