From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cn.fujitsu.com ([59.151.112.132]:2201 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1751104AbbJIFpI (ORCPT ); Fri, 9 Oct 2015 01:45:08 -0400 Subject: Re: [PATCH v2 00/23] Rework btrfs qgroup reserved space framework To: Josef Bacik , References: <1444356684-30162-1-git-send-email-quwenruo@cn.fujitsu.com> <5617445F.7030203@fb.com> From: Qu Wenruo Message-ID: <5617545C.2080303@cn.fujitsu.com> Date: Fri, 9 Oct 2015 13:45:00 +0800 MIME-Version: 1.0 In-Reply-To: <5617445F.7030203@fb.com> Content-Type: text/plain; charset="utf-8"; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: Josef Bacik wrote on 2015/10/08 21:36 -0700: > On 10/08/2015 07:11 PM, Qu Wenruo wrote: >> In previous rework of qgroup, we succeeded in fixing qgroup accounting >> part, making the rfer/excl numbers accurate. >> >> But that's just part of qgroup work, another part of qgroup still has >> quite a lot problem, that's qgroup reserve space part which will lead to >> EQUOT even we are far from the limit. >> >> [[BUG]] >> The easiest way to trigger the bug is, >> 1) Enable quota >> 2) Limit excl of qgroup 5 to 16M >> 3) Write [0,2M) of a file inside subvol 5 10 times without sync >> >> EQUOT will be triggered at about the 8th write. >> But after remount, we can still write until about 15M. >> >> [[CAUSE]] >> The problem is caused by the fact that qgroup will reserve space even >> the data space is already reserved. >> >> In above reproducer, each time we buffered write [0,2M) qgroup will >> reserve 2M space, but in fact, at the 1st time, we have already reserved >> 2M and from then on, we don't need to reserved any data space as we are >> only writing [0,2M). >> >> Also, the reserved space will only be freed *ONCE* when its backref is >> run at commit_transaction() time. >> >> That's causing the reserved space leaking. >> >> [[FIX]] >> The fix is not a simple one, as currently btrfs_qgroup_reserve() will >> allocate whatever caller asked for. >> >> So for accurate qgroup reserve, we introduce a completely new framework >> for data and metadata. >> 1) Per-inode data reserve map >> Now, each inode will have a data reserve map, recording which range >> of data is already reserved. >> If we are writing a range which is already reserved, we won't need to >> reserve space again. >> >> Also, for the fact that qgroup is only accounted at commit_trans(), >> for data commit into disc and its metadata is also inserted into >> current tree, we should free the data reserved range, but still keep >> the reserved space until commit_trans(). >> >> So delayed_ref_head will have new members to record how much space is >> reserved and free them at commit_trans() time. > > This is already handled by setting DELALLOC in the io_tree, we do > similar sort of stuff for the normal enospc accounting, why not use > that? Thanks, > > Josef Thanks for pointing this out. I was also searching for a existing facility, but didn't find one as I'm not familiar with io_tree. After a quick glance, it seems quite fit the need, but not completely sure. I'll keep investigating on it and try to use it. BTW, from what I understand, __btrfs_buffered_write() should cause the range to be DEALLOC, but I didn't find any call to set_extent_delalloc(), it that done in other place? Thanks, Qu