From: Qu Wenruo <quwenruo@cn.fujitsu.com>
To: <bo.li.liu@oracle.com>
Cc: <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map
Date: Thu, 18 Sep 2014 15:58:26 +0800 [thread overview]
Message-ID: <541A90A2.5050407@cn.fujitsu.com> (raw)
In-Reply-To: <20140918073307.GB15092@localhost.localdomain>
-------- Original Message --------
Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to
insert best fitted extent map
From: Liu Bo <bo.li.liu@oracle.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>
Date: 2014年09月18日 15:33
> On Thu, Sep 18, 2014 at 01:36:47PM +0800, Qu Wenruo wrote:
>> -------- Original Message --------
>> Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping()
>> to insert best fitted extent map
>> From: Liu Bo <bo.li.liu@oracle.com>
>> To: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> Date: 2014年09月18日 12:21
>>> On Wed, Sep 17, 2014 at 11:53:35AM +0800, Qu Wenruo wrote:
>>>> The following commit enhanced the merge_extent_mapping() to reduce
>>>> fragment in extent map tree, but it can't handle case which existing
>>>> lies before map_start:
>>>> 51f39 btrfs: Use right extent length when inserting overlap extent map.
>>>>
>>>> [BUG]
>>>> When existing extent map's start is before map_start,
>>>> the em->len will be minus, which will corrupt the extent map and fail to
>>>> insert the new extent map.
>>>> This will happen when someone get a large extent map, but when it is
>>>> going to insert it into extent map tree, some one has already commit
>>>> some write and split the huge extent into small parts.
>>>>
>>>> [REPRODUCER]
>>>> It is very easy to tiger using filebench with randomrw personality.
>>>> It is about 100% to reproduce when using 8G preallocated file in 60s
>>>> randonrw test.
>>>>
>>>> [FIX]
>>>> This patch can now handle any existing extent position.
>>>> Since it does not directly use existing->start, now it will find the
>>>> previous and next extent around map_start.
>>>> So the old existing->start < map_start bug will never happen again.
>>>>
>>>> [ENHANCE]
>>>> This patch will insert the best fitted extent map into extent map tree,
>>>> other than the oldest [map_start, map_start + sectorsize) or the
>>>> relatively newer but not perfect [map_start, existing->start).
>>>>
>>>> The patch will first search existing extent that does not intersects with
>>>> the desired map range [map_start, map_start + len).
>>>> The existing extent will be either before or behind map_start, and based
>>>> on the existing extent, we can find out the previous and next extent
>>>> around map_start.
>>>>
>>>> So the best fitted extent would be [prev->end, next->start).
>>>> For prev or next is not found, em->start would be prev->end and em->end
>>>> wold be next->start.
>>>>
>>>> With this patch, the fragment in extent map tree should be reduced much
>>>> more than the 51f39 commit and reduce an unneeded extent map tree search.
>>>>
>>>> Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
>>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>>> ---
>>>> fs/btrfs/inode.c | 79 ++++++++++++++++++++++++++++++++++++++++----------------
>>>> 1 file changed, 57 insertions(+), 22 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>>> index 016c403..8039021 100644
>>>> --- a/fs/btrfs/inode.c
>>>> +++ b/fs/btrfs/inode.c
>>>> @@ -6191,21 +6191,60 @@ out_fail_inode:
>>>> goto out_fail;
>>>> }
>>>> +/* Find next extent map of a given extent map, caller needs to ensure locks */
>>>> +static struct extent_map *next_extent_map(struct extent_map *em)
>>>> +{
>>>> + struct rb_node *next;
>>>> +
>>>> + next = rb_next(&em->rb_node);
>>>> + if (!next)
>>>> + return NULL;
>>>> + return container_of(next, struct extent_map, rb_node);
>>>> +}
>>>> +
>>>> +static struct extent_map *prev_extent_map(struct extent_map *em)
>>>> +{
>>>> + struct rb_node *prev;
>>>> +
>>>> + prev = rb_prev(&em->rb_node);
>>>> + if (!prev)
>>>> + return NULL;
>>>> + return container_of(prev, struct extent_map, rb_node);
>>>> +}
>>>> +
>>>> /* helper for btfs_get_extent. Given an existing extent in the tree,
>>>> + * the existing extent is the nearest extent to map_start,
>>>> * and an extent that you want to insert, deal with overlap and insert
>>>> - * the new extent into the tree.
>>>> + * the best fitted new extent into the tree.
>>>> */
>>>> static int merge_extent_mapping(struct extent_map_tree *em_tree,
>>>> struct extent_map *existing,
>>>> struct extent_map *em,
>>>> u64 map_start)
>>>> {
>>>> + struct extent_map *prev;
>>>> + struct extent_map *next;
>>>> + u64 start;
>>>> + u64 end;
>>>> u64 start_diff;
>>>> BUG_ON(map_start < em->start || map_start >= extent_map_end(em));
>>>> - start_diff = map_start - em->start;
>>>> - em->start = map_start;
>>>> - em->len = existing->start - em->start;
>>>> +
>>>> + if (existing->start > map_start) {
>>>> + next = existing;
>>>> + prev = prev_extent_map(next);
>>>> + } else {
>>>> + prev = existing;
>>>> + next = next_extent_map(prev);
>>>> + }
>>>> +
>>>> + start = prev ? extent_map_end(prev) : em->start;
>>>> + start = max_t(u64, start, em->start);
>>>> + end = next ? next->start : extent_map_end(em);
>>>> + end = min_t(u64, end, extent_map_end(em));
>>>> + start_diff = start - em->start;
>>>> + em->start = start;
>>>> + em->len = end - start;
>>>> if (em->block_start < EXTENT_MAP_LAST_BYTE &&
>>>> !test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
>>>> em->block_start += start_diff;
>>>> @@ -6482,25 +6521,21 @@ insert:
>>>> ret = 0;
>>>> - existing = lookup_extent_mapping(em_tree, start, len);
>>>> - if (existing && (existing->start > start ||
>>>> - existing->start + existing->len <= start)) {
>>>> + existing = search_extent_mapping(em_tree, start, len);
>>>> + /*
>>>> + * existing will always be non-NULL, since there must be
>>>> + * extent causing the -EEXIST.
>>>> + */
>>>> + if (start >= extent_map_end(existing) ||
>>>> + start + len <= existing->start) {
>>> This will introduce something wrong, the 'else' part is 'em = existing;',
>>> and the condition is actually
>>> (start < extent_map_end(existing) && start + len > existing->start),
>>> which means the existing overlaps with [start, start+len).
>> Nope, the else part is doing the right thing.
>>
>> Before the patch, going to the 'em = existing;' routine's condition
>> is like the following:
>> 1) existing returned by lookup_extent_mapping is not NULL
>> 2) (existing->start > start || existing->start + existing->len
>> <=start) is not met
>>
>> 1) implies the following condition: (in extent_map.c,
>> __lookup_extent_mapping())
>> !!(end > existing->start && start < extent_map_end(existing)), which
>> is equal to the following:
>> start + len > existing->start(1) && start < extent_map_end(existing) (2)
>>
>> 2) is actually the following
>> start >= existing->start (3) && start < extent_map_end(existing) (4)
>>
>> And the hidden condition len > 0(5)
>> combining 1) and 2), you will find the real condition to go to 'em =
>> existing' routine is what the patch does.
>> Due to (5), (1) and (3) is the same condition, and (2) (4) is the same too.
>> So the patch is OK. 'em = existing' condition is not broken.
> Okay, but think it twice, they're not same,
>
> original: (start >= existing->start && start < extent_map_end(existing))
> this patch: (start < extent_map_end(existing) && start + len > existing->start)
>
> (start + len > existing->start) doesn't equal to start >= existing->start,
> here is a case of (start+len > existing->start) but (start <= existing->start).
>
> |--------| -->(existing)
> |--------| -->[start, start+len)
>
> And calling search_extent_mapping() doesn't make sure that
> (start >= existing->start) is true, either.
All right, that case is right and I'm wrong.
I'll change the if() to use start >= existing->start.
BTW, Any other problem?
Thanks,
Qu
>
>>> And one of overlapping cases is (existing->start > start), ie. em->start > start, this is
>>> against our rule of btrfs_get_extent,
>> Nope again, this overlapping in fact is quite normal in multithread
>> random read/write.
>> The files's [0~16) is a preallocated one,
>> Thread A:
>> write [4K, 8K) into the file, but not committed yet.
>> extent map tree contains [0,16K) only
>> Thread B:
>> btrfs_get_extent()
>> the map_start is 8K, len is 4K as an example
>> grab a large em, take [0,16K), since [4K,8K) write is not committed.
>> comes to insert: btrfs_release_path(path);
>>
>> Thread A:
>> [4K, 8K) is not committed
>> the extent map is now [0, 4K) [4K, 8K) [8K, 16K).
>>
>> Thread B:
>> goes to insert: add_extent_mapping()
>> the [0,16K) is overlapping, and the returned existing one is [8K, 16K).
>> which contains the [map_start, map_start + len).
> So this's an example of existing->start == start (both are 8K), not
> existing->start > start.
>
> See __extent_writepage_io(),
>
> {
> ...
> em = epd->get_extent(inode, page, pg_offset, cur,
> end - cur + 1, 1);
> if (IS_ERR_OR_NULL(em)) {
> SetPageError(page);
> ret = PTR_ERR_OR_ZERO(em);
> break;
> }
>
> extent_offset = cur - em->start;
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^ it needs to be (em->start <= cur)
>
> ...
> }
>
> thanks,
> -liubo
>
>>> struct extent_map *btrfs_get_extent(...)
>>> {
>>> [...]
>>> insert:
>>> btrfs_release_path(path);
>>> if (em->start > start || extent_map_end(em) <= start) {
>>> btrfs_err(root->fs_info, "bad extent! em: [%llu %llu] passed
>>> [%llu %llu]",
>>> em->start, em->len, start, len);
>>> err = -EIO;
>>> goto out;
>>> }
>>> [...]
>>> }
>>>
>>> thanks,
>>> -liubo
>>>
>>>> + /*
>>>> + * The existing extent map is the one nearest to
>>>> + * the [start, start + len) range which overlaps
>>>> + */
>>>> + err = merge_extent_mapping(em_tree, existing,
>>>> + em, start);
>>>> free_extent_map(existing);
>>>> - existing = NULL;
>>>> - }
>>>> - if (!existing) {
>>>> - existing = lookup_extent_mapping(em_tree, em->start,
>>>> - em->len);
>>>> - if (existing) {
>>>> - err = merge_extent_mapping(em_tree, existing,
>>>> - em, start);
>>>> - free_extent_map(existing);
>>>> - if (err) {
>>>> - free_extent_map(em);
>>>> - em = NULL;
>>>> - }
>>>> - } else {
>>>> - err = -EIO;
>>>> + if (err) {
>>>> free_extent_map(em);
>>>> em = NULL;
>>>> }
>>>> --
>>>> 2.1.0
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2014-09-18 7:58 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-09-17 3:53 [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map Qu Wenruo
2014-09-18 4:21 ` Liu Bo
2014-09-18 5:36 ` Qu Wenruo
2014-09-18 5:40 ` Qu Wenruo
2014-09-18 7:33 ` Liu Bo
2014-09-18 7:58 ` Qu Wenruo [this message]
2014-09-18 8:20 ` Liu Bo
2014-09-18 8:24 ` Qu Wenruo
2014-09-18 9:01 ` Liu Bo
2014-09-18 13:16 ` Filipe David Manana
2014-09-19 0:31 ` Qu Wenruo
2014-10-08 12:08 ` Filipe David Manana
2014-10-09 0:28 ` Qu Wenruo
2014-10-09 10:27 ` Filipe David Manana
2014-10-10 2:39 ` Qu Wenruo
2014-10-10 8:08 ` Filipe David Manana
2014-10-13 2:47 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=541A90A2.5050407@cn.fujitsu.com \
--to=quwenruo@cn.fujitsu.com \
--cc=bo.li.liu@oracle.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).