[PATCH 1/2] btrfs: Make the chunk size limit on on-disk/logical more clean.

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] btrfs: Make the chunk size limit on on-disk/logical more clean.
@ 2014-12-24  1:55 Qu Wenruo
  2014-12-24  1:55 ` [PATCH v2 2/2] btrfs: Enhance btrfs chunk allocation algorithm to reduce ENOSPC caused by unbalanced data/metadata allocation Qu Wenruo
  2015-02-26  1:20 ` [PATCH 1/2] btrfs: Make the chunk size limit on on-disk/logical more clean Qu Wenruo
  0 siblings, 2 replies; 5+ messages in thread
From: Qu Wenruo @ 2014-12-24  1:55 UTC (permalink / raw)
  To: linux-btrfs

Original __btrfs_alloc_chunk() use max_chunk_size to limit chunk size,
however it mixed the on-disk space with logical space.
When comes to 10% of writable space, max_chunk_size is used with on-disk
size, but it is also used as logical space size limit, so it is very
confusing and causing inconsistence in different profile.

For example:
on M single, D single 5G btrfs single device,
data chunk is limited to 512M due to 10% limit.

on M RAID1, D RAID1 10Gx2 btrfs 2 devices,
data chunk is limited to 2G due to 10% limit is mixed with on-disk
space, causing the logical chunk space to 1G, twice than single device.

This patch will make the logical and on-disk space limit independent and
clear and solve the above inconsistence.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
changelog:
v2:
   Newly introduced.
---
 fs/btrfs/volumes.c | 40 ++++++++++++++++++++++++++--------------
 1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0144790..8e74b34 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4235,10 +4235,12 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	int ncopies;		/* how many copies to data has */
 	int ret;
 	u64 max_stripe_size;
-	u64 max_chunk_size;
+	u64 max_logical_size;	/* Up limit on chunk's logical size */
+	u64 max_physical_size;	/* Up limit on a chunk's on-disk size */
 	u64 stripe_size;
 	u64 num_bytes;
 	u64 raid_stripe_len = BTRFS_STRIPE_LEN;
+	int need_bump = 0;
 	int ndevs;
 	int i;
 	int j;
@@ -4260,7 +4262,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 
 	if (type & BTRFS_BLOCK_GROUP_DATA) {
 		max_stripe_size = 1024 * 1024 * 1024;
-		max_chunk_size = 10 * max_stripe_size;
+		max_logical_size = 10 * max_stripe_size;
 		if (!devs_max)
 			devs_max = BTRFS_MAX_DEVS(info->chunk_root);
 	} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
@@ -4269,12 +4271,12 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 			max_stripe_size = 1024 * 1024 * 1024;
 		else
 			max_stripe_size = 256 * 1024 * 1024;
-		max_chunk_size = max_stripe_size;
+		max_logical_size = max_stripe_size;
 		if (!devs_max)
 			devs_max = BTRFS_MAX_DEVS(info->chunk_root);
 	} else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
 		max_stripe_size = 32 * 1024 * 1024;
-		max_chunk_size = 2 * max_stripe_size;
+		max_logical_size = 2 * max_stripe_size;
 		if (!devs_max)
 			devs_max = BTRFS_MAX_DEVS_SYS_CHUNK;
 	} else {
@@ -4284,8 +4286,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	}
 
 	/* we don't want a chunk larger than 10% of writeable space */
-	max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
-			     max_chunk_size);
+	max_physical_size = div_factor(fs_devices->total_rw_bytes, 1);
 
 	devices_info = kzalloc(sizeof(*devices_info) * fs_devices->rw_devices,
 			       GFP_NOFS);
@@ -4391,15 +4392,21 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		data_stripes = num_stripes - 2;
 	}
 
-	/*
-	 * Use the number of data stripes to figure out how big this chunk
-	 * is really going to be in terms of logical address space,
-	 * and compare that answer with the max chunk size
-	 */
-	if (stripe_size * data_stripes > max_chunk_size) {
-		u64 mask = (1ULL << 24) - 1;
-		stripe_size = max_chunk_size;
+	/* Restrict on-disk chunk size */
+	if (stripe_size * num_stripes > max_physical_size) {
+		stripe_size = max_physical_size;
+		do_div(stripe_size, num_stripes);
+		need_bump = 1;
+	}
+	/* restrict logical chunk size  */
+	if (stripe_size * data_stripes > max_logical_size) {
+		stripe_size = max_logical_size;
 		do_div(stripe_size, data_stripes);
+		need_bump = 1;
+	}
+
+	if (need_bump) {
+		u64 mask = (1ULL << 24) - 1;
 
 		/* bump the answer up to a 16MB boundary */
 		stripe_size = (stripe_size + mask) & ~mask;
@@ -4411,6 +4418,11 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 			stripe_size = devices_info[ndevs-1].max_avail;
 	}
 
+	/*
+	 * Special handle for DUP, since stripe_size is the largest free extent
+	 * we found, DUP can only use half of it. Other profile's dev_stripes
+	 * is always 1.
+	 */
 	do_div(stripe_size, dev_stripes);
 
 	/* align to BTRFS_STRIPE_LEN */
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v2 2/2] btrfs: Enhance btrfs chunk allocation algorithm to reduce ENOSPC caused by unbalanced data/metadata allocation.
  2014-12-24  1:55 [PATCH 1/2] btrfs: Make the chunk size limit on on-disk/logical more clean Qu Wenruo
@ 2014-12-24  1:55 ` Qu Wenruo
  2014-12-29 14:56   ` David Sterba
  2015-02-26  1:20 ` [PATCH 1/2] btrfs: Make the chunk size limit on on-disk/logical more clean Qu Wenruo
  1 sibling, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2014-12-24  1:55 UTC (permalink / raw)
  To: linux-btrfs

When btrfs allocate a chunk, it will try to alloc up to 1G for data and
256M for metadata, or 10% of all the writeable space if there is enough
space for the stripe on device.

However, when we run out of space, this allocation may cause unbalanced
chunk allocation.
For example, there are only 1G unallocated space, and request for
allocate DATA chunk is sent, and all the space will be allocated as data
chunk, making later metadata chunk alloc request unable to handle, which
will cause ENOSPC.
This is the one of the common complains from end users about why ENOSPC
happens but there is still available space.

This patch will try not to alloc chunk which is more than half of the
unallocated space, making the last space more balanced at a small cost
of more fragmented chunk at the last 1G.

Some easy example:
Preallocate 17.5G on a 20G empty btrfs fs:
[Before]
 # btrfs fi show /mnt/test
Label: none  uuid: da8741b1-5d47-4245-9e94-bfccea34e91e
	Total devices 1 FS bytes used 17.50GiB
	devid    1 size 20.00GiB used 20.00GiB path /dev/sdb
All space is allocated. No space for later metadata allocation.

[After]
 # btrfs fi show /mnt/test
Label: none  uuid: e6935aeb-a232-4140-84f9-80aab1f23d56
	Total devices 1 FS bytes used 17.50GiB
	devid    1 size 20.00GiB used 19.77GiB path /dev/sdb
About 230M is still available for later metadata allocation.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
Changelog:
v2:
   Remove false dead zone judgement since it won't happen
---
 fs/btrfs/volumes.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8e74b34..20b3eea 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4237,6 +4237,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	u64 max_stripe_size;
 	u64 max_logical_size;	/* Up limit on chunk's logical size */
 	u64 max_physical_size;	/* Up limit on a chunk's on-disk size */
+	u64 total_physical_avail = 0;
 	u64 stripe_size;
 	u64 num_bytes;
 	u64 raid_stripe_len = BTRFS_STRIPE_LEN;
@@ -4349,6 +4350,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		devices_info[ndevs].max_avail = max_avail;
 		devices_info[ndevs].total_avail = total_avail;
 		devices_info[ndevs].dev = device;
+		total_physical_avail += total_avail;
 		++ndevs;
 	}

@@ -4398,6 +4400,23 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		do_div(stripe_size, num_stripes);
 		need_bump = 1;
 	}
+
+	/*
+	 * Don't alloc chunk whose physical size is larger than half
+	 * of the rest physical space.
+	 * This will reduce the possibility of ENOSPC when comes to
+	 * last unallocated space
+	 *
+	 * For the last 16~32M (e.g. 20M), it will first alloc 16M
+	 * (bumped to 16M) and the next time will be the rest size
+	 * (bumped to 16M and reduced to 4M).
+	 * So no dead zone.
+	 */
+	if (stripe_size * num_stripes > total_physical_avail / 2) {
+		stripe_size = total_physical_avail / 2;
+		need_bump = 1;
+
+	}
 	/* restrict logical chunk size  */
 	if (stripe_size * data_stripes > max_logical_size) {
 		stripe_size = max_logical_size;
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 2/2] btrfs: Enhance btrfs chunk allocation algorithm to reduce ENOSPC caused by unbalanced data/metadata allocation.
  2014-12-24  1:55 ` [PATCH v2 2/2] btrfs: Enhance btrfs chunk allocation algorithm to reduce ENOSPC caused by unbalanced data/metadata allocation Qu Wenruo
@ 2014-12-29 14:56   ` David Sterba
  2014-12-30  0:40     ` Qu Wenruo
  0 siblings, 1 reply; 5+ messages in thread
From: David Sterba @ 2014-12-29 14:56 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Dec 24, 2014 at 09:55:14AM +0800, Qu Wenruo wrote:
> When btrfs allocate a chunk, it will try to alloc up to 1G for data and
> 256M for metadata, or 10% of all the writeable space if there is enough
> space for the stripe on device.
> 
> However, when we run out of space, this allocation may cause unbalanced
> chunk allocation.
> For example, there are only 1G unallocated space, and request for
> allocate DATA chunk is sent, and all the space will be allocated as data
> chunk, making later metadata chunk alloc request unable to handle, which
> will cause ENOSPC.

The question is why the metadata is full although there's 1G free, as
the metadata chunks are being preallocated according to the metadata
ratio.

> This is the one of the common complains from end users about why ENOSPC
> happens but there is still available space.
> 
> This patch will try not to alloc chunk which is more than half of the
> unallocated space, making the last space more balanced at a small cost
> of more fragmented chunk at the last 1G.

I'm really worried about the small chunks and the fragmentation on that
level wrt balancing. The small chunks will be relolcated to bigger free
chunks (eg. 256mb) and make it unusable for further rebalancing of the
256mb chunks. Newly allocated chunks will have to be reduced in size to
fit in the remaining place and will cause further fragmentation of the
chunk space.

The drawbacks of small chunks are obvious:

* more chunks mean more processing
* smaller chance of getting big contiguous space for extents, leading to
  file fragmentation that cannot be much improved fixed by
  defragmentation

IMO the chunk allocation should be more predictable and should give some
clue how the layout happens, otherwise this will become another dark
corner that would make debugging harder and can negatively and
unpreditactably affect performance after some time.

The problems you're trying to address are real, no doubt here, but I'd
rather try to address them in a different way.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 2/2] btrfs: Enhance btrfs chunk allocation algorithm to reduce ENOSPC caused by unbalanced data/metadata allocation.
  2014-12-29 14:56   ` David Sterba
@ 2014-12-30  0:40     ` Qu Wenruo
  0 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2014-12-30  0:40 UTC (permalink / raw)
  To: dsterba, linux-btrfs


-------- Original Message --------
Subject: Re: [PATCH v2 2/2] btrfs: Enhance btrfs chunk allocation 
algorithm to reduce ENOSPC caused by unbalanced data/metadata allocation.
From: David Sterba <dsterba@suse.cz>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>
Date: 2014年12月29日 22:56
> On Wed, Dec 24, 2014 at 09:55:14AM +0800, Qu Wenruo wrote:
>> When btrfs allocate a chunk, it will try to alloc up to 1G for data and
>> 256M for metadata, or 10% of all the writeable space if there is enough
>> space for the stripe on device.
>>
>> However, when we run out of space, this allocation may cause unbalanced
>> chunk allocation.
>> For example, there are only 1G unallocated space, and request for
>> allocate DATA chunk is sent, and all the space will be allocated as data
>> chunk, making later metadata chunk alloc request unable to handle, which
>> will cause ENOSPC.
> The question is why the metadata is full although there's 1G free, as
> the metadata chunks are being preallocated according to the metadata
> ratio.
This can still happen after the data chunk is allocated but later only 
heavy metadata workload.
>
>> This is the one of the common complains from end users about why ENOSPC
>> happens but there is still available space.
>>
>> This patch will try not to alloc chunk which is more than half of the
>> unallocated space, making the last space more balanced at a small cost
>> of more fragmented chunk at the last 1G.
> I'm really worried about the small chunks and the fragmentation on that
> level wrt balancing. The small chunks will be relolcated to bigger free
> chunks (eg. 256mb) and make it unusable for further rebalancing of the
> 256mb chunks. Newly allocated chunks will have to be reduced in size to
> fit in the remaining place and will cause further fragmentation of the
> chunk space.
>
> The drawbacks of small chunks are obvious:
>
> * more chunks mean more processing
> * smaller chance of getting big contiguous space for extents, leading to
>    file fragmentation that cannot be much improved fixed by
>    defragmentation
You're right, such half-half method will mess up with relocate, that's I 
forgot.
>
> IMO the chunk allocation should be more predictable and should give some
> clue how the layout happens, otherwise this will become another dark
> corner that would make debugging harder and can negatively and
> unpreditactably affect performance after some time.
Some other methods also come to me, like predict the data:metadata ratio 
using current or recent
allocated data:metadata ratio, but it seems not help for the last 1GB case.

Or when it comes to the last 1GB, allocate it as mixed(data+metadata) ?
It seems needs new incompat flags and some tweaks on relocate.

Thanks,
Qu
>
> The problems you're trying to address are real, no doubt here, but I'd
> rather try to address them in a different way.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] btrfs: Make the chunk size limit on on-disk/logical more clean.
  2014-12-24  1:55 [PATCH 1/2] btrfs: Make the chunk size limit on on-disk/logical more clean Qu Wenruo
  2014-12-24  1:55 ` [PATCH v2 2/2] btrfs: Enhance btrfs chunk allocation algorithm to reduce ENOSPC caused by unbalanced data/metadata allocation Qu Wenruo
@ 2015-02-26  1:20 ` Qu Wenruo
  1 sibling, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2015-02-26  1:20 UTC (permalink / raw)
  To: linux-btrfs

Ping.

Any comment?

Thanks,
Qu
-------- Original Message --------
Subject: [PATCH 1/2] btrfs: Make the chunk size limit on on-disk/logical 
more clean.
From: Qu Wenruo <quwenruo@cn.fujitsu.com>
To: <linux-btrfs@vger.kernel.org>
Date: 2014年12月24日 09:55
> Original __btrfs_alloc_chunk() use max_chunk_size to limit chunk size,
> however it mixed the on-disk space with logical space.
> When comes to 10% of writable space, max_chunk_size is used with on-disk
> size, but it is also used as logical space size limit, so it is very
> confusing and causing inconsistence in different profile.
>
> For example:
> on M single, D single 5G btrfs single device,
> data chunk is limited to 512M due to 10% limit.
>
> on M RAID1, D RAID1 10Gx2 btrfs 2 devices,
> data chunk is limited to 2G due to 10% limit is mixed with on-disk
> space, causing the logical chunk space to 1G, twice than single device.
>
> This patch will make the logical and on-disk space limit independent and
> clear and solve the above inconsistence.
>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
> changelog:
> v2:
>     Newly introduced.
> ---
>   fs/btrfs/volumes.c | 40 ++++++++++++++++++++++++++--------------
>   1 file changed, 26 insertions(+), 14 deletions(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 0144790..8e74b34 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -4235,10 +4235,12 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>   	int ncopies;		/* how many copies to data has */
>   	int ret;
>   	u64 max_stripe_size;
> -	u64 max_chunk_size;
> +	u64 max_logical_size;	/* Up limit on chunk's logical size */
> +	u64 max_physical_size;	/* Up limit on a chunk's on-disk size */
>   	u64 stripe_size;
>   	u64 num_bytes;
>   	u64 raid_stripe_len = BTRFS_STRIPE_LEN;
> +	int need_bump = 0;
>   	int ndevs;
>   	int i;
>   	int j;
> @@ -4260,7 +4262,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>   
>   	if (type & BTRFS_BLOCK_GROUP_DATA) {
>   		max_stripe_size = 1024 * 1024 * 1024;
> -		max_chunk_size = 10 * max_stripe_size;
> +		max_logical_size = 10 * max_stripe_size;
>   		if (!devs_max)
>   			devs_max = BTRFS_MAX_DEVS(info->chunk_root);
>   	} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
> @@ -4269,12 +4271,12 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>   			max_stripe_size = 1024 * 1024 * 1024;
>   		else
>   			max_stripe_size = 256 * 1024 * 1024;
> -		max_chunk_size = max_stripe_size;
> +		max_logical_size = max_stripe_size;
>   		if (!devs_max)
>   			devs_max = BTRFS_MAX_DEVS(info->chunk_root);
>   	} else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
>   		max_stripe_size = 32 * 1024 * 1024;
> -		max_chunk_size = 2 * max_stripe_size;
> +		max_logical_size = 2 * max_stripe_size;
>   		if (!devs_max)
>   			devs_max = BTRFS_MAX_DEVS_SYS_CHUNK;
>   	} else {
> @@ -4284,8 +4286,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>   	}
>   
>   	/* we don't want a chunk larger than 10% of writeable space */
> -	max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
> -			     max_chunk_size);
> +	max_physical_size = div_factor(fs_devices->total_rw_bytes, 1);
>   
>   	devices_info = kzalloc(sizeof(*devices_info) * fs_devices->rw_devices,
>   			       GFP_NOFS);
> @@ -4391,15 +4392,21 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>   		data_stripes = num_stripes - 2;
>   	}
>   
> -	/*
> -	 * Use the number of data stripes to figure out how big this chunk
> -	 * is really going to be in terms of logical address space,
> -	 * and compare that answer with the max chunk size
> -	 */
> -	if (stripe_size * data_stripes > max_chunk_size) {
> -		u64 mask = (1ULL << 24) - 1;
> -		stripe_size = max_chunk_size;
> +	/* Restrict on-disk chunk size */
> +	if (stripe_size * num_stripes > max_physical_size) {
> +		stripe_size = max_physical_size;
> +		do_div(stripe_size, num_stripes);
> +		need_bump = 1;
> +	}
> +	/* restrict logical chunk size  */
> +	if (stripe_size * data_stripes > max_logical_size) {
> +		stripe_size = max_logical_size;
>   		do_div(stripe_size, data_stripes);
> +		need_bump = 1;
> +	}
> +
> +	if (need_bump) {
> +		u64 mask = (1ULL << 24) - 1;
>   
>   		/* bump the answer up to a 16MB boundary */
>   		stripe_size = (stripe_size + mask) & ~mask;
> @@ -4411,6 +4418,11 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>   			stripe_size = devices_info[ndevs-1].max_avail;
>   	}
>   
> +	/*
> +	 * Special handle for DUP, since stripe_size is the largest free extent
> +	 * we found, DUP can only use half of it. Other profile's dev_stripes
> +	 * is always 1.
> +	 */
>   	do_div(stripe_size, dev_stripes);
>   
>   	/* align to BTRFS_STRIPE_LEN */


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-02-26  1:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-24  1:55 [PATCH 1/2] btrfs: Make the chunk size limit on on-disk/logical more clean Qu Wenruo
2014-12-24  1:55 ` [PATCH v2 2/2] btrfs: Enhance btrfs chunk allocation algorithm to reduce ENOSPC caused by unbalanced data/metadata allocation Qu Wenruo
2014-12-29 14:56   ` David Sterba
2014-12-30  0:40     ` Qu Wenruo
2015-02-26  1:20 ` [PATCH 1/2] btrfs: Make the chunk size limit on on-disk/logical more clean Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).