From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:28679 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932418AbbJIEgy (ORCPT ); Fri, 9 Oct 2015 00:36:54 -0400 Subject: Re: [PATCH v2 00/23] Rework btrfs qgroup reserved space framework To: Qu Wenruo , References: <1444356684-30162-1-git-send-email-quwenruo@cn.fujitsu.com> From: Josef Bacik Message-ID: <5617445F.7030203@fb.com> Date: Thu, 8 Oct 2015 21:36:47 -0700 MIME-Version: 1.0 In-Reply-To: <1444356684-30162-1-git-send-email-quwenruo@cn.fujitsu.com> Content-Type: text/plain; charset="windows-1252"; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 10/08/2015 07:11 PM, Qu Wenruo wrote: > In previous rework of qgroup, we succeeded in fixing qgroup accounting > part, making the rfer/excl numbers accurate. > > But that's just part of qgroup work, another part of qgroup still has > quite a lot problem, that's qgroup reserve space part which will lead to > EQUOT even we are far from the limit. > > [[BUG]] > The easiest way to trigger the bug is, > 1) Enable quota > 2) Limit excl of qgroup 5 to 16M > 3) Write [0,2M) of a file inside subvol 5 10 times without sync > > EQUOT will be triggered at about the 8th write. > But after remount, we can still write until about 15M. > > [[CAUSE]] > The problem is caused by the fact that qgroup will reserve space even > the data space is already reserved. > > In above reproducer, each time we buffered write [0,2M) qgroup will > reserve 2M space, but in fact, at the 1st time, we have already reserved > 2M and from then on, we don't need to reserved any data space as we are > only writing [0,2M). > > Also, the reserved space will only be freed *ONCE* when its backref is > run at commit_transaction() time. > > That's causing the reserved space leaking. > > [[FIX]] > The fix is not a simple one, as currently btrfs_qgroup_reserve() will > allocate whatever caller asked for. > > So for accurate qgroup reserve, we introduce a completely new framework > for data and metadata. > 1) Per-inode data reserve map > Now, each inode will have a data reserve map, recording which range > of data is already reserved. > If we are writing a range which is already reserved, we won't need to > reserve space again. > > Also, for the fact that qgroup is only accounted at commit_trans(), > for data commit into disc and its metadata is also inserted into > current tree, we should free the data reserved range, but still keep > the reserved space until commit_trans(). > > So delayed_ref_head will have new members to record how much space is > reserved and free them at commit_trans() time. This is already handled by setting DELALLOC in the io_tree, we do similar sort of stuff for the normal enospc accounting, why not use that? Thanks, Josef