All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo@cn.fujitsu.com>
To: <bo.li.liu@oracle.com>
Cc: <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map
Date: Thu, 18 Sep 2014 13:36:47 +0800	[thread overview]
Message-ID: <541A6F6F.4010901@cn.fujitsu.com> (raw)
In-Reply-To: <20140918042113.GA15092@localhost.localdomain>


-------- Original Message --------
Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to 
insert best fitted extent map
From: Liu Bo <bo.li.liu@oracle.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>
Date: 2014年09月18日 12:21
> On Wed, Sep 17, 2014 at 11:53:35AM +0800, Qu Wenruo wrote:
>> The following commit enhanced the merge_extent_mapping() to reduce
>> fragment in extent map tree, but it can't handle case which existing
>> lies before map_start:
>> 51f39 btrfs: Use right extent length when inserting overlap extent map.
>>
>> [BUG]
>> When existing extent map's start is before map_start,
>> the em->len will be minus, which will corrupt the extent map and fail to
>> insert the new extent map.
>> This will happen when someone get a large extent map, but when it is
>> going to insert it into extent map tree, some one has already commit
>> some write and split the huge extent into small parts.
>>
>> [REPRODUCER]
>> It is very easy to tiger using filebench with randomrw personality.
>> It is about 100% to reproduce when using 8G preallocated file in 60s
>> randonrw test.
>>
>> [FIX]
>> This patch can now handle any existing extent position.
>> Since it does not directly use existing->start, now it will find the
>> previous and next extent around map_start.
>> So the old existing->start < map_start bug will never happen again.
>>
>> [ENHANCE]
>> This patch will insert the best fitted extent map into extent map tree,
>> other than the oldest [map_start, map_start + sectorsize) or the
>> relatively newer but not perfect [map_start, existing->start).
>>
>> The patch will first search existing extent that does not intersects with
>> the desired map range [map_start, map_start + len).
>> The existing extent will be either before or behind map_start, and based
>> on the existing extent, we can find out the previous and next extent
>> around map_start.
>>
>> So the best fitted extent would be [prev->end, next->start).
>> For prev or next is not found, em->start would be prev->end and em->end
>> wold be next->start.
>>
>> With this patch, the fragment in extent map tree should be reduced much
>> more than the 51f39 commit and reduce an unneeded extent map tree search.
>>
>> Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> ---
>>   fs/btrfs/inode.c | 79 ++++++++++++++++++++++++++++++++++++++++----------------
>>   1 file changed, 57 insertions(+), 22 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 016c403..8039021 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -6191,21 +6191,60 @@ out_fail_inode:
>>   	goto out_fail;
>>   }
>>   
>> +/* Find next extent map of a given extent map, caller needs to ensure locks */
>> +static struct extent_map *next_extent_map(struct extent_map *em)
>> +{
>> +	struct rb_node *next;
>> +
>> +	next = rb_next(&em->rb_node);
>> +	if (!next)
>> +		return NULL;
>> +	return container_of(next, struct extent_map, rb_node);
>> +}
>> +
>> +static struct extent_map *prev_extent_map(struct extent_map *em)
>> +{
>> +	struct rb_node *prev;
>> +
>> +	prev = rb_prev(&em->rb_node);
>> +	if (!prev)
>> +		return NULL;
>> +	return container_of(prev, struct extent_map, rb_node);
>> +}
>> +
>>   /* helper for btfs_get_extent.  Given an existing extent in the tree,
>> + * the existing extent is the nearest extent to map_start,
>>    * and an extent that you want to insert, deal with overlap and insert
>> - * the new extent into the tree.
>> + * the best fitted new extent into the tree.
>>    */
>>   static int merge_extent_mapping(struct extent_map_tree *em_tree,
>>   				struct extent_map *existing,
>>   				struct extent_map *em,
>>   				u64 map_start)
>>   {
>> +	struct extent_map *prev;
>> +	struct extent_map *next;
>> +	u64 start;
>> +	u64 end;
>>   	u64 start_diff;
>>   
>>   	BUG_ON(map_start < em->start || map_start >= extent_map_end(em));
>> -	start_diff = map_start - em->start;
>> -	em->start = map_start;
>> -	em->len = existing->start - em->start;
>> +
>> +	if (existing->start > map_start) {
>> +		next = existing;
>> +		prev = prev_extent_map(next);
>> +	} else {
>> +		prev = existing;
>> +		next = next_extent_map(prev);
>> +	}
>> +
>> +	start = prev ? extent_map_end(prev) : em->start;
>> +	start = max_t(u64, start, em->start);
>> +	end = next ? next->start : extent_map_end(em);
>> +	end = min_t(u64, end, extent_map_end(em));
>> +	start_diff = start - em->start;
>> +	em->start = start;
>> +	em->len = end - start;
>>   	if (em->block_start < EXTENT_MAP_LAST_BYTE &&
>>   	    !test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
>>   		em->block_start += start_diff;
>> @@ -6482,25 +6521,21 @@ insert:
>>   
>>   		ret = 0;
>>   
>> -		existing = lookup_extent_mapping(em_tree, start, len);
>> -		if (existing && (existing->start > start ||
>> -		    existing->start + existing->len <= start)) {
>> +		existing = search_extent_mapping(em_tree, start, len);
>> +		/*
>> +		 * existing will always be non-NULL, since there must be
>> +		 * extent causing the -EEXIST.
>> +		 */
>> +		if (start >= extent_map_end(existing) ||
>> +		    start + len <= existing->start) {
> This will introduce something wrong, the 'else' part is 'em = existing;',
> and the condition is actually
> (start < extent_map_end(existing) && start + len > existing->start),
> which means the existing overlaps with [start, start+len).
Nope, the else part is doing the right thing.

Before the patch, going to the 'em = existing;' routine's condition is 
like the following:
1) existing returned by lookup_extent_mapping is not NULL
2) (existing->start > start || existing->start + existing->len <=start) 
is not met

1) implies the following condition: (in extent_map.c, 
__lookup_extent_mapping())
!!(end > existing->start && start < extent_map_end(existing)), which is 
equal to the following:
start + len > existing->start(1) && start < extent_map_end(existing) (2)

2) is actually the following
start >= existing->start (3) && start < extent_map_end(existing) (4)

And the hidden condition len > 0(5)
combining 1) and 2), you will find the real condition to go to 'em = 
existing' routine is what the patch does.
Due to (5), (1) and (3) is the same condition, and (2) (4) is the same too.
So the patch is OK. 'em = existing' condition is not broken.

>
> And one of overlapping cases is (existing->start > start), ie. em->start > start, this is
> against our rule of btrfs_get_extent,
Nope again, this overlapping in fact is quite normal in multithread 
random read/write.
The files's [0~16) is a preallocated one,
Thread A:
     write [4K, 8K) into the file, but not committed yet.
     extent map tree contains [0,16K) only
Thread B:
btrfs_get_extent()
     the map_start is 8K, len is 4K as an example
     grab a large em, take [0,16K), since [4K,8K) write is not committed.
     comes to insert: btrfs_release_path(path);

Thread A:
     [4K, 8K) is not committed
     the extent map is now [0, 4K) [4K, 8K) [8K, 16K).

Thread B:
     goes to insert: add_extent_mapping()
     the [0,16K) is overlapping, and the returned existing one is [8K, 16K).
     which contains the [map_start, map_start +  len).

> struct extent_map *btrfs_get_extent(...)
> {
> 	[...]
> 	insert:
> 		btrfs_release_path(path);
> 		if (em->start > start || extent_map_end(em) <= start) {
> 			btrfs_err(root->fs_info, "bad extent! em: [%llu %llu] passed
> 	[%llu %llu]",
> 				em->start, em->len, start, len);
> 			err = -EIO;
>                  goto out;
>          }
>          [...]
> }
>
> thanks,
> -liubo
>
>> +			/*
>> +			 * The existing extent map is the one nearest to
>> +			 * the [start, start + len) range which overlaps
>> +			 */
>> +			err = merge_extent_mapping(em_tree, existing,
>> +						   em, start);
>>   			free_extent_map(existing);
>> -			existing = NULL;
>> -		}
>> -		if (!existing) {
>> -			existing = lookup_extent_mapping(em_tree, em->start,
>> -							 em->len);
>> -			if (existing) {
>> -				err = merge_extent_mapping(em_tree, existing,
>> -							   em, start);
>> -				free_extent_map(existing);
>> -				if (err) {
>> -					free_extent_map(em);
>> -					em = NULL;
>> -				}
>> -			} else {
>> -				err = -EIO;
>> +			if (err) {
>>   				free_extent_map(em);
>>   				em = NULL;
>>   			}
>> -- 
>> 2.1.0
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


  reply	other threads:[~2014-09-18  5:36 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-09-17  3:53 [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map Qu Wenruo
2014-09-18  4:21 ` Liu Bo
2014-09-18  5:36   ` Qu Wenruo [this message]
2014-09-18  5:40     ` Qu Wenruo
2014-09-18  7:33     ` Liu Bo
2014-09-18  7:58       ` Qu Wenruo
2014-09-18  8:20         ` Liu Bo
2014-09-18  8:24           ` Qu Wenruo
2014-09-18  9:01             ` Liu Bo
2014-09-18 13:16 ` Filipe David Manana
2014-09-19  0:31   ` Qu Wenruo
2014-10-08 12:08     ` Filipe David Manana
2014-10-09  0:28       ` Qu Wenruo
2014-10-09 10:27         ` Filipe David Manana
2014-10-10  2:39           ` Qu Wenruo
2014-10-10  8:08             ` Filipe David Manana
2014-10-13  2:47               ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=541A6F6F.4010901@cn.fujitsu.com \
    --to=quwenruo@cn.fujitsu.com \
    --cc=bo.li.liu@oracle.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.