From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cn.fujitsu.com ([59.151.112.132]:31198 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1752093AbaJ3DqU convert rfc822-to-8bit (ORCPT ); Wed, 29 Oct 2014 23:46:20 -0400 Message-ID: <5451B48A.4080508@cn.fujitsu.com> Date: Thu, 30 Oct 2014 11:46:18 +0800 From: Qu Wenruo MIME-Version: 1.0 To: CC: Subject: Re: [PATCH] btrfs: Enhance btrfs chunk allocation algorithm to reduce ENOSPC caused by unbalanced data/metadata allocation. References: <1414031871-10859-1-git-send-email-quwenruo@cn.fujitsu.com> <20141024110624.GB32526@localhost.localdomain> <544D8F44.8050706@cn.fujitsu.com> <20141027081456.GD27271@localhost.localdomain> <544E040A.3090407@cn.fujitsu.com> <20141029142917.GA9547@localhost.localdomain> <54518D1F.9040008@cn.fujitsu.com> In-Reply-To: <54518D1F.9040008@cn.fujitsu.com> Content-Type: text/plain; charset="utf-8"; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: -------- Original Message -------- Subject: Re: [PATCH] btrfs: Enhance btrfs chunk allocation algorithm to reduce ENOSPC caused by unbalanced data/metadata allocation. From: Qu Wenruo To: bo.li.liu@oracle.com Date: 2014年10月30日 08:58 > > -------- Original Message -------- > Subject: Re: [PATCH] btrfs: Enhance btrfs chunk allocation algorithm > to reduce ENOSPC caused by unbalanced data/metadata allocation. > From: Liu Bo > To: Qu Wenruo > Date: 2014年10月29日 22:29 >> On Mon, Oct 27, 2014 at 04:36:26PM +0800, Qu Wenruo wrote: >>> -------- Original Message -------- >>> Subject: Re: [PATCH] btrfs: Enhance btrfs chunk allocation algorithm >>> to reduce ENOSPC caused by unbalanced data/metadata allocation. >>> From: Liu Bo >>> To: Qu Wenruo >>> Date: 2014年10月27日 16:14 >>>> On Mon, Oct 27, 2014 at 08:18:12AM +0800, Qu Wenruo wrote: >>>>> -------- Original Message -------- >>>>> Subject: Re: [PATCH] btrfs: Enhance btrfs chunk allocation algorithm >>>>> to reduce ENOSPC caused by unbalanced data/metadata allocation. >>>>> From: Liu Bo >>>>> To: Qu Wenruo >>>>> Date: 2014年10月24日 19:06 >>>>>> On Thu, Oct 23, 2014 at 10:37:51AM +0800, Qu Wenruo wrote: >>>>>>> When btrfs allocate a chunk, it will try to alloc up to 1G for >>>>>>> data and >>>>>>> 256M for metadata, or 10% of all the writeable space if there is >>>>>>> enough >>>>>> 10G for data, >>>>>> if (type & BTRFS_BLOCK_GROUP_DATA) { >>>>>> max_stripe_size = 1024 * 1024 * 1024; >>>>>> max_chunk_size = 10 * max_stripe_size; >>>>> Oh, sorry, 10G is right. >>>>> >>>>> Any other comments? >>>>> >>>>> Thanks, >>>>> Qu >>>>> >>>>> >>>>>> ... >>>>>> >>>>>> thanks, >>>>>> -liubo >>>>>> >>>>>>> space for the stripe on device. >>>>>>> >>>>>>> However, when we run out of space, this allocation may cause >>>>>>> unbalanced >>>>>>> chunk allocation. >>>>>>> For example, there are only 1G unallocated space, and request for >>>>>>> allocate DATA chunk is sent, and all the space will be allocated >>>>>>> as data >>>>>>> chunk, making later metadata chunk alloc request unable to >>>>>>> handle, which >>>>>>> will cause ENOSPC. >>>>>>> This is the one of the common complains from end users about why >>>>>>> ENOSPC >>>>>>> happens but there is still available space. >>>> Okay, I don't think this is the common case, AFAIK, the most ENOSPC >>>> is caused >>>> by our runtime worst case metadata reservation problem. >>>> >>>> btrfs has been inclined to create a fairly large metadata chunk >>>> (1G) in its >>>> initial mkfs stage and 256M metadata chunk is also a very large one. >>>> >>>> As of your below example, yes, we don't have space for metadata >>>> allocation, but do we really need to allocate a new one? >>>> >>>> Or am I missing something? >>>> >>>> thanks, >>>> -liubo >>> Yes that's true this is not the common cause, but at least this >>> patch may make the percentage >>> of 'df' command reach as close to 100% as possible before hitting >>> ENOSPC under normal operations. >>> (If not using balance) >>> >>> And some case like the following mail may be improved by the patch: >>> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg36097.html >>> >>> I understand that most of the cases that a lot of free data space >>> and no metadata space is caused by >>> create and then delete large files, but if the last giga bytes can >>> be allocated more carefully, >>> at least the available bytes of 'df' command should be reduced >>> before hit ENOSPC. >>> >>> How do you think about it? >> Sorry for the late reply. >> >> I just notice that a recent commit has fixed this problem. >> >> commit 47ab2a6c689913db23ccae38349714edf8365e0a >> Author: Josef Bacik >> Date: Thu Sep 18 11:20:02 2014 -0400 >> >> Btrfs: remove empty block groups automatically >> thanks, >> -liubo > Oh, that's much better than my patch. > > So please ignore my patch. > > Thanks, > Qu Wait a second, that's true block group auto-reclaim can deal with some cases, but it will not improve the vanilla 'df' used percentage before hit ENOSPC. The old 10%/10G will still hit the ENOSPC below 90% used space if using 100G disk. This patch should improve it to above 95% or even above 99%. The old behavior may leave a bad image on normal users that btrfs can't use space effectively. So I still consider the patch has positive effect on btrfs. Thanks, Qu >> >>> Thanks, >>> Qu >>>>>>> This patch will try not to alloc chunk which is more than half >>>>>>> of the >>>>>>> unallocated space, making the last space more balanced at a >>>>>>> small cost >>>>>>> of more fragmented chunk at the last 1G. >>>>>>> >>>>>>> Some easy example: >>>>>>> Preallocate 17.5G on a 20G empty btrfs fs: >>>>>>> [Before] >>>>>>> # btrfs fi show /mnt/test >>>>>>> Label: none uuid: da8741b1-5d47-4245-9e94-bfccea34e91e >>>>>>> Total devices 1 FS bytes used 17.50GiB >>>>>>> devid 1 size 20.00GiB used 20.00GiB path /dev/sdb >>>>>>> All space is allocated. No space later metadata space. >>>>>>> >>>>>>> [After] >>>>>>> # btrfs fi show /mnt/test >>>>>>> Label: none uuid: e6935aeb-a232-4140-84f9-80aab1f23d56 >>>>>>> Total devices 1 FS bytes used 17.50GiB >>>>>>> devid 1 size 20.00GiB used 19.77GiB path /dev/sdb >>>>>>> About 230M is still available for later metadata allocation. >>>>>>> >>>>>>> Signed-off-by: Qu Wenruo >>>>>>> --- >>>>>>> fs/btrfs/volumes.c | 18 ++++++++++++++++++ >>>>>>> 1 file changed, 18 insertions(+) >>>>>>> >>>>>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >>>>>>> index d47289c..fa8de79 100644 >>>>>>> --- a/fs/btrfs/volumes.c >>>>>>> +++ b/fs/btrfs/volumes.c >>>>>>> @@ -4240,6 +4240,7 @@ static int __btrfs_alloc_chunk(struct >>>>>>> btrfs_trans_handle *trans, >>>>>>> int ret; >>>>>>> u64 max_stripe_size; >>>>>>> u64 max_chunk_size; >>>>>>> + u64 total_avail_space = 0; >>>>>>> u64 stripe_size; >>>>>>> u64 num_bytes; >>>>>>> u64 raid_stripe_len = BTRFS_STRIPE_LEN; >>>>>>> @@ -4352,10 +4353,27 @@ static int __btrfs_alloc_chunk(struct >>>>>>> btrfs_trans_handle *trans, >>>>>>> devices_info[ndevs].max_avail = max_avail; >>>>>>> devices_info[ndevs].total_avail = total_avail; >>>>>>> devices_info[ndevs].dev = device; >>>>>>> + total_avail_space += total_avail; >>>>>>> ++ndevs; >>>>>>> } >>>>>>> /* >>>>>>> + * Try not to occupy more than half of the unallocated space. >>>>>>> + * When run short of space and alloc all the space to >>>>>>> + * data/metadata will cause ENOSPC to be triggered more >>>>>>> easily. >>>>>>> + * >>>>>>> + * And since the minimum chunk size is 16M, the half-half >>>>>>> will cause >>>>>>> + * 16M allocated from 20M available space and reset 4M will >>>>>>> not be >>>>>>> + * used ever. In that case(16~32M), allocate all directly. >>>>>>> + */ >>>>>>> + if (total_avail_space < 32 * 1024 * 1024 && >>>>>>> + total_avail_space > 16 * 1024 * 1024) >>>>>>> + max_chunk_size = total_avail_space; >>>>>>> + else >>>>>>> + max_chunk_size = min(total_avail_space / 2, >>>>>>> max_chunk_size); >>>>>>> + max_chunk_size = min(total_avail_space / 2, max_chunk_size); >>>>>>> + >>>>>>> + /* >>>>>>> * now sort the devices by hole size / available space >>>>>>> */ >>>>>>> sort(devices_info, ndevs, sizeof(struct btrfs_device_info), >>>>>>> -- >>>>>>> 2.1.2 >>>>>>> >>>>>>> -- >>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>> linux-btrfs" in >>>>>>> the body of a message to majordomo@vger.kernel.org >>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >