From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-btrfs-owner@vger.kernel.org>
Received: from cn.fujitsu.com ([59.151.112.132]:26191 "EHLO
	heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org
	with ESMTP id S1750997AbcDRBeS (ORCPT
	<rfc822;linux-btrfs@vger.kernel.org>);
	Sun, 17 Apr 2016 21:34:18 -0400
Subject: Re: [PATCH v3] btrfs: qgroup: Fix qgroup accounting when creating
 snapshot
To: Mark Fasheh <mfasheh@suse.de>
References: <1460612320-19199-1-git-send-email-quwenruo@cn.fujitsu.com>
 <20160414214208.GM2187@wotan.suse.de> <57103D16.2050100@cn.fujitsu.com>
 <20160415160005.GP2187@wotan.suse.de>
CC: <linux-btrfs@vger.kernel.org>, Filipe Manana <fdmanana@suse.com>
From: Qu Wenruo <quwenruo@cn.fujitsu.com>
Message-ID: <5714398F.90103@cn.fujitsu.com>
Date: Mon, 18 Apr 2016 09:34:07 +0800
MIME-Version: 1.0
In-Reply-To: <20160415160005.GP2187@wotan.suse.de>
Content-Type: text/plain; charset="utf-8"; format=flowed
Sender: linux-btrfs-owner@vger.kernel.org
List-ID: <linux-btrfs.vger.kernel.org>


Mark Fasheh wrote on 2016/04/15 09:00 -0700:
> On Fri, Apr 15, 2016 at 09:00:06AM +0800, Qu Wenruo wrote:
>>
>>
>> Mark Fasheh wrote on 2016/04/14 14:42 -0700:
>>> Hi Qu,
>>>
>>> On Thu, Apr 14, 2016 at 01:38:40PM +0800, Qu Wenruo wrote:
>>>> Current btrfs qgroup design implies a requirement that after calling
>>>> btrfs_qgroup_account_extents() there must be a commit root switch.
>>>>
>>>> Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
>>>> inside btrfs_commit_transaction() just be commit_cowonly_roots().
>>>>
>>>> However there is a exception at create_pending_snapshot(), which will
>>>> call btrfs_qgroup_account_extents() but no any commit root switch.
>>>>
>>>> In case of creating a snapshot whose parent root is itself (create a
>>>> snapshot of fs tree), it will corrupt qgroup by the following trace:
>>>> (skipped unrelated data)
>>>> ======
>>>> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
>>>> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
>>>> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
>>>> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
>>>> ======
>>>>
>>>> The problem here is in first qgroup_account_extent(), the
>>>> nr_new_roots of the extent is 1, which means its reference got
>>>> increased, and qgroup increased its rfer and excl.
>>>>
>>>> But at second qgroup_account_extent(), its reference got decreased, but
>>>> between these two qgroup_account_extent(), there is no switch roots.
>>>> This leads to the same nr_old_roots, and this extent just got ignored by
>>>> qgroup, which means this extent is wrongly accounted.
>>>>
>>>> Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
>>>> create_pending_snapshot(), with needed preparation.
>>>>
>>>> Reported-by: Mark Fasheh <mfasheh@suse.de>
>>>
>>> Can you please CC me on this patch when you send it out? FYI it's customary
>>> to CC anyone listed here as well as significant reviewers of your patch
>>> (such as Filipe).
>>>
>>>
>>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>>> ---
>>>> v2:
>>>>    Fix a soft lockup caused by missing switch_commit_root() call.
>>>>    Fix a warning caused by dirty-but-not-committed root.
>>>
>>> This version doesn't introduce any lockups that I encountered, thanks!
>>>
>>>
>>>> v3:
>>>>    Fix a difference behavior that btrfs qgroup will start accounting
>>>>    dropped roots if we are creating snapshots.
>>>>    Other than always account them in next transaction.
>>>
>>> This still corrupts the qgroup numbers if you do anything significant to the
>>> source subvolume. For example, this script shows a 16K difference. My guess
>>> is that we're missing accounting of some metadata somewhere?
>>>
>>>
>>> #!/bin/bash
>>>
>>> DEV=/dev/vdb1
>>> MNT=/btrfs
>>>
>>> mkfs.btrfs -f $DEV
>>> mount -t btrfs $DEV $MNT
>>> btrfs quota enable $MNT
>>> mkdir "$MNT/snaps"
>>> mkdir "$MNT/data"
>>> echo "populate $MNT with some data"
>>> for i in `seq -w 0 640`; do
>>>      dd if=/dev/zero of="$MNT/data/file$i" bs=1M count=1 >&/dev/null
>>> done;
>>> for i in `seq -w 0 1`; do
>>>          S="$MNT/snaps/snap$i"
>>>          echo "create snapshot $S"
>>>          btrfs su snap $MNT $S;
>>> done;
>>> btrfs qg show $MNT
>>>
>>> umount $MNT
>>> btrfsck $DEV
>>>
>>>
>>> Sample output:
>>>
>>> btrfs-progs v4.4+20160122
>>> See http://btrfs.wiki.kernel.org for more information.
>>>
>>> Label:              (null)
>>> UUID:               a0b648b1-7a23-4213-9bc3-db02b8520efe
>>> Node size:          16384
>>> Sector size:        4096
>>> Filesystem size:    16.00GiB
>>> Block group profiles:
>>>    Data:             single            8.00MiB
>>>    Metadata:         DUP               1.01GiB
>>>    System:           DUP              12.00MiB
>>> SSD detected:       no
>>> Incompat features:  extref, skinny-metadata
>>> Number of devices:  1
>>> Devices:
>>>     ID        SIZE  PATH
>>>      1    16.00GiB  /dev/vdb1
>>>
>>> populate /btrfs with some data
>>> create snapshot /btrfs/snaps/snap0
>>> Create a snapshot of '/btrfs' in '/btrfs/snaps/snap0'
>>> create snapshot /btrfs/snaps/snap1
>>> Create a snapshot of '/btrfs' in '/btrfs/snaps/snap1'
>>> qgroupid         rfer         excl
>>> --------         ----         ----
>>> 0/5         641.34MiB     16.00KiB
>>> 0/258       641.34MiB     16.00KiB
>>> 0/259       641.34MiB     16.00KiB
>>> Checking filesystem on /dev/vdb1
>>> UUID: a0b648b1-7a23-4213-9bc3-db02b8520efe
>>> checking extents
>>> checking free space cache
>>> checking fs roots
>>> checking csums
>>> checking root refs
>>> checking quota groups
>>> Counts for qgroup id: 5 are different
>>> our:		referenced 672497664 referenced compressed 672497664
>>> disk:		referenced 672497664 referenced compressed 672497664
>>> our:		exclusive 49152 exclusive compressed 49152
>>> disk:		exclusive 16384 exclusive compressed 16384
>>> diff:		exclusive 32768 exclusive compressed 32768
>>> found 673562626 bytes used err is 0
>>> total csum bytes: 656384
>>> total tree bytes: 1425408
>>> total fs tree bytes: 442368
>>> total extent tree bytes: 98304
>>> btree space waste bytes: 385361
>>> file data blocks allocated: 672661504
>>>   referenced 672661504
>>> extent buffer leak: start 30965760 len 16384
>>> extent buffer leak: start 30998528 len 16384
>>> extent buffer leak: start 31014912 len 16384
>>
>> I recently found btrfsck --qgroup-report itself is not stable.
>
> No, it works - I just don't think you're reading the output correctly.
> Notice above in my example that btrfsck shows a difference in exclusive
> counts? You don't have that below. Also it reports 'Counts for qgroup id: X
> are different' when there is a difference as you see above.
>
>
>>
>> So here I prefer to do the accounting check by qgroup rescan, and
>> compare the binary output.
>>>
>>>
>>>
>>> How are you testing this on your end?
>>
>> Much like yours, but with smaller fs size, about 16M.
>>
>> The result shows pretty good:
>> (As I didn't believe btrfsck --qgroup-report now, I use rescan to
>> check the result)
>> ------
>> qgroupid         rfer         excl     max_rfer     max_excl parent  child
>> --------         ----         ----     --------     -------- ------  -----
>> 0/5          16.00KiB     16.00KiB         none         none ---     ---
>> populating '/mnt/test' with 16 normal files, 1M size
>> sync
>> Create a snapshot of '/mnt/test' in '/mnt/test/snap1'
>> Create a snapshot of '/mnt/test' in '/mnt/test/snap2'
>> Create a snapshot of '/mnt/test' in '/mnt/test/snap3'
>> qgroupid         rfer         excl     max_rfer     max_excl parent  child
>> --------         ----         ----     --------     -------- ------  -----
>> 0/5          16793600        16384         none         none ---     ---
>> 0/258        16793600        16384         none         none ---     ---
>> 0/259        16793600        16384         none         none ---     ---
>> 0/260        16793600        16384         none         none ---     ---
>> quota rescan started
>> qgroupid         rfer         excl     max_rfer     max_excl parent  child
>> --------         ----         ----     --------     -------- ------  -----
>> 0/5          16793600        16384         none         none ---     ---
>> 0/258        16793600        16384         none         none ---     ---
>> 0/259        16793600        16384         none         none ---     ---
>> 0/260        16793600        16384         none         none ---     ---
>> ------
>>
>> And yes, even after rescan, btrfsck --qgroup-report seems to report
>> false alert.
>> ------
>> Counts for qgroup id: 5
>> our:		referenced 16793600 referenced compressed 16793600
>> disk:		referenced 16793600 referenced compressed 16793600
>> our:		exclusive 16384 exclusive compressed 16384
>> disk:		exclusive 16384 exclusive compressed 16384
>> Counts for qgroup id: 258
>> our:		referenced 16793600 referenced compressed 16793600
>> disk:		referenced 16793600 referenced compressed 16793600
>> our:		exclusive 16384 exclusive compressed 16384
>> disk:		exclusive 16384 exclusive compressed 16384
>> Counts for qgroup id: 259
>> our:		referenced 16793600 referenced compressed 16793600
>> disk:		referenced 16793600 referenced compressed 16793600
>> our:		exclusive 16384 exclusive compressed 16384
>> disk:		exclusive 16384 exclusive compressed 16384
>> Counts for qgroup id: 260
>> our:		referenced 16793600 referenced compressed 16793600
>> disk:		referenced 16793600 referenced compressed 16793600
>> our:		exclusive 16384 exclusive compressed 16384
>> disk:		exclusive 16384 exclusive compressed 16384
>
> This isn't reporting any inconsistency - if you look at the numbers the
> 'disk' and 'our' versions are all the same. Basicaly in this case you asked
> for a summary and that's what you're getting.
>
> Please try to use btrfsck to check qgroup inconsistencies, IMHO you will
> find them much faster that way.
> 	--Mark
>
> --
> Mark Fasheh
>
>
You're right.
I just read the output wrong.

The code shows that if using --qgroup-report it will always report all 
qgroup info, no matter if it is corrupted.

While if not using --qgroup-report, it will only report corrupted one.
But even qgroup is wrong, btrfsck still return 0 not 1.

I'd better fix the return value along with the extent buffer leaking.

Thanks for pointing out this, it would be quite handy to detect qgroup bugs.

Thanks,
Qu