Qgroups wrong after snapshot create

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Qgroups wrong after snapshot create
@ 2016-04-04 23:06 Mark Fasheh
  2016-04-05  1:27 ` Qu Wenruo
  2016-04-05 13:27 ` Qu Wenruo
  0 siblings, 2 replies; 5+ messages in thread
From: Mark Fasheh @ 2016-04-04 23:06 UTC (permalink / raw)
  To: linux-btrfs; +Cc: quwenruo, jbacik, clm

[-- Attachment #1: Type: text/plain, Size: 7237 bytes --]

Hi,

Making a snapshot gets us the wrong qgroup numbers. This is very easy to
reproduce. From a fresh btrfs filesystem, simply enable qgroups and create a
snapshot. In this example we have mounted a newly created fresh filesystem
and mounted it at /btrfs:

# btrfs quota enable /btrfs
# btrfs sub sna /btrfs/ /btrfs/snap1
# btrfs qg show /btrfs

qgroupid         rfer         excl 
--------         ----         ---- 
0/5          32.00KiB     32.00KiB 
0/257        16.00KiB     16.00KiB 


In the example above, the default subvolume (0/5) should read 16KiB
referenced and 16KiB exclusive.

A rescan fixes things, so we know the rescan process is doing the math
right:

# btrfs quota rescan /btrfs
# btrfs qgroup show /btrfs
qgroupid         rfer         excl 
--------         ----         ---- 
0/5          16.00KiB     16.00KiB 
0/257        16.00KiB     16.00KiB 



The last kernel to get this right was v4.1:

# uname -r
4.1.20
# btrfs quota enable /btrfs
# btrfs sub sna /btrfs/ /btrfs/snap1
Create a snapshot of '/btrfs/' in '/btrfs/snap1'
# btrfs qg show /btrfs
qgroupid         rfer         excl 
--------         ----         ---- 
0/5          16.00KiB     16.00KiB 
0/257        16.00KiB     16.00KiB 


Which leads me to believe that this was a regression introduced by Qu's
rewrite as that is the biggest change to qgroups during that development
period.


Going back to upstream, I applied my tracing patch from this list
( http://thread.gmane.org/gmane.comp.file-systems.btrfs/54685 ), with a
couple changes - I'm printing the rfer/excl bytecounts in
qgroup_update_counters AND I print them twice - once before we make any
changes and once after the changes. If I enable tracing in
btrfs_qgroup_account_extent and qgroup_update_counters just before the
snapshot creation, we get the following trace:


# btrfs quota enable /btrfs
# <wait a sec for the rescan to finish>
# echo 1 > /sys/kernel/debug/tracing/events/btrfs/btrfs_qgroup_account_extent/enable
# echo 1 > //sys/kernel/debug/tracing/events/btrfs/qgroup_update_counters/enable
# btrfs sub sna /btrfs/ /btrfs/snap2
Create a snapshot of '/btrfs/' in '/btrfs/snap2'
# btrfs qg show /btrfs
qgroupid         rfer         excl 
--------         ----         ---- 
0/5          32.00KiB     32.00KiB 
0/257        16.00KiB     16.00KiB 
# fstest1:~ # cat /sys/kernel/debug/tracing/trace

# tracer: nop
#
# entries-in-buffer/entries-written: 13/13   #P:2
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
           btrfs-10233 [001] .... 260298.823339: btrfs_qgroup_account_extent: bytenr = 29360128, num_bytes = 16384, nr_old_roots = 1, nr_new_roots = 0
           btrfs-10233 [001] .... 260298.823342: qgroup_update_counters: qgid = 5, cur_old_count = 1, cur_new_count = 0, rfer = 16384, excl = 16384
           btrfs-10233 [001] .... 260298.823342: qgroup_update_counters: qgid = 5, cur_old_count = 1, cur_new_count = 0, rfer = 0, excl = 0
           btrfs-10233 [001] .... 260298.823343: btrfs_qgroup_account_extent: bytenr = 29720576, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
           btrfs-10233 [001] .... 260298.823345: btrfs_qgroup_account_extent: bytenr = 29736960, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
           btrfs-10233 [001] .... 260298.823347: btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
           btrfs-10233 [001] .... 260298.823347: qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
           btrfs-10233 [001] .... 260298.823348: qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
           btrfs-10233 [001] .... 260298.823421: btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
           btrfs-10233 [001] .... 260298.823422: btrfs_qgroup_account_extent: bytenr = 29835264, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
           btrfs-10233 [001] .... 260298.823425: btrfs_qgroup_account_extent: bytenr = 29851648, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
           btrfs-10233 [001] .... 260298.823426: qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
           btrfs-10233 [001] .... 260298.823426: qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 32768, excl = 32768

If you read through the whole log we do some ... interesting.. things - at
the start, we *subtract* from qgroup 5, making it's count go to zero. I want
to say that this is kind of unexpected for a snapshot create but perhaps
there's something I'm missing.

Remember that I'm printing each qgroup twice in qgroup_adjust_counters (once
before, once after). Sothen we can see then that extent 29851648 (len 16k)
is the extent being counted against qgroup 5 which makes the count invalid.

>From a btrfs-debug-tree I get the following records referencing that extent:

>From the root tree:
        item 3 key (FS_TREE ROOT_ITEM 0) itemoff 14949 itemsize 439
                root data bytenr 29851648 level 0 dirid 256 refs 1 gen 10 lastsnap 10
                uuid 00000000-0000-0000-0000-000000000000
                ctransid 10 otransid 0 stransid 0 rtransid 0

>From the extent tree:
        item 9 key (29851648 METADATA_ITEM 0) itemoff 15960 itemsize 33
                extent refs 1 gen 10 flags TREE_BLOCK
                tree block skinny level 0
                tree block backref root 5

And here is the block itself:

fs tree key (FS_TREE ROOT_ITEM 0) 
leaf 29851648 items 4 free space 15941 generation 10 owner 5
fs uuid f7e55c97-b0b3-44e5-bab1-1fd55d54409b
chunk uuid b78fe016-e35f-4f57-8211-796cbc9be3a4
        item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
                inode generation 3 transid 10 size 10 nbytes 16384
                block group 0 mode 40755 links 1 uid 0 gid 0
                rdev 0 flags 0x0
        item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
                inode ref index 0 namelen 2 name: ..
        item 2 key (256 DIR_ITEM 3390559794) itemoff 16076 itemsize 35
                location key (257 ROOT_ITEM -1) type DIR
                namelen 5 datalen 0 name: snap2
        item 3 key (256 DIR_INDEX 2) itemoff 16041 itemsize 35
                location key (257 ROOT_ITEM -1) type DIR
                namelen 5 datalen 0 name: snap2


So unless I'm mistaken, it seems like we're counting the original snapshot
root against itself when creating a snapshot.

I found this looking for what I believe to be a _different_ corruption in
qgroups. In the meantime while I track that one down though I was hoping
that someone might be able to shed some light on this particular issue.

Qu, do you have any ideas how we might fix this?

Thanks,
	--Mark

PS: I have attached the output of btrfs-debug-tree for the FS used in this
example.

--
Mark Fasheh

[-- Attachment #2: debug-tree.txt --]
[-- Type: text/plain, Size: 10754 bytes --]

root tree
leaf 29884416 items 17 free space 11820 generation 11 owner 1
fs uuid f7e55c97-b0b3-44e5-bab1-1fd55d54409b
chunk uuid b78fe016-e35f-4f57-8211-796cbc9be3a4
	item 0 key (EXTENT_TREE ROOT_ITEM 0) itemoff 15844 itemsize 439
		root data bytenr 29900800 level 0 dirid 0 refs 1 gen 11 lastsnap 0
		uuid 00000000-0000-0000-0000-000000000000
	item 1 key (DEV_TREE ROOT_ITEM 0) itemoff 15405 itemsize 439
		root data bytenr 29507584 level 0 dirid 0 refs 1 gen 6 lastsnap 0
		uuid 00000000-0000-0000-0000-000000000000
	item 2 key (FS_TREE INODE_REF 6) itemoff 15388 itemsize 17
		inode ref index 0 namelen 7 name: default
	item 3 key (FS_TREE ROOT_ITEM 0) itemoff 14949 itemsize 439
		root data bytenr 29851648 level 0 dirid 256 refs 1 gen 10 lastsnap 10
		uuid 00000000-0000-0000-0000-000000000000
		ctransid 10 otransid 0 stransid 0 rtransid 0
	item 4 key (FS_TREE ROOT_REF 257) itemoff 14926 itemsize 23
		root ref key dirid 256 sequence 2 name snap2
	item 5 key (ROOT_TREE_DIR INODE_ITEM 0) itemoff 14766 itemsize 160
		inode generation 3 transid 0 size 0 nbytes 16384
		block group 0 mode 40755 links 1 uid 0 gid 0
		rdev 0 flags 0x0
	item 6 key (ROOT_TREE_DIR INODE_REF 6) itemoff 14754 itemsize 12
		inode ref index 0 namelen 2 name: ..
	item 7 key (ROOT_TREE_DIR DIR_ITEM 2378154706) itemoff 14717 itemsize 37
		location key (FS_TREE ROOT_ITEM -1) type DIR
		namelen 7 datalen 0 name: default
	item 8 key (CSUM_TREE ROOT_ITEM 0) itemoff 14278 itemsize 439
		root data bytenr 29933568 level 0 dirid 0 refs 1 gen 11 lastsnap 0
		uuid 00000000-0000-0000-0000-000000000000
	item 9 key (QUOTA_TREE ROOT_ITEM 0) itemoff 13839 itemsize 439
		root data bytenr 29917184 level 0 dirid 0 refs 1 gen 11 lastsnap 0
		uuid d66e47c6-9943-ae4e-9adb-6d97065f6358
	item 10 key (UUID_TREE ROOT_ITEM 0) itemoff 13400 itemsize 439
		root data bytenr 29802496 level 0 dirid 0 refs 1 gen 10 lastsnap 0
		uuid 4bded89b-be0f-ba46-becf-15604fcc58fc
	item 11 key (256 INODE_ITEM 0) itemoff 13240 itemsize 160
		inode generation 11 transid 11 size 262144 nbytes 1572864
		block group 0 mode 100600 links 1 uid 0 gid 0
		rdev 0 flags 0x1b
	item 12 key (256 EXTENT_DATA 0) itemoff 13187 itemsize 53
		extent data disk byte 12845056 nr 262144
		extent data offset 0 nr 262144 ram 262144
		extent compression 0
	item 13 key (257 ROOT_ITEM 10) itemoff 12748 itemsize 439
		root data bytenr 29736960 level 0 dirid 256 refs 1 gen 10 lastsnap 10
		uuid fb326c16-07e8-4944-aba6-9154d860322c
	item 14 key (257 ROOT_BACKREF 5) itemoff 12725 itemsize 23
		root backref key dirid 256 sequence 2 name snap2
	item 15 key (FREE_SPACE UNTYPED 29360128) itemoff 12684 itemsize 41
		location key (256 INODE_ITEM 0)
		cache generation 11 entries 10 bitmaps 0
	item 16 key (DATA_RELOC_TREE ROOT_ITEM 0) itemoff 12245 itemsize 439
		root data bytenr 29442048 level 0 dirid 256 refs 1 gen 4 lastsnap 0
		uuid 00000000-0000-0000-0000-000000000000
chunk tree
leaf 20987904 items 4 free space 15781 generation 5 owner 3
fs uuid f7e55c97-b0b3-44e5-bab1-1fd55d54409b
chunk uuid b78fe016-e35f-4f57-8211-796cbc9be3a4
	item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98
		dev item devid 1 total_bytes 17178820608 bytes used 2172649472
		dev uuid 24080b38-13bb-4f4c-8c9f-c6d5313c8621
	item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 12582912) itemoff 16105 itemsize 80
		chunk length 8388608 owner 2 stripe_len 65536
		type DATA num_stripes 1
			stripe 0 devid 1 offset 12582912
			dev uuid: 24080b38-13bb-4f4c-8c9f-c6d5313c8621
	item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 20971520) itemoff 15993 itemsize 112
		chunk length 8388608 owner 2 stripe_len 65536
		type SYSTEM|DUP num_stripes 2
			stripe 0 devid 1 offset 20971520
			dev uuid: 24080b38-13bb-4f4c-8c9f-c6d5313c8621
			stripe 1 devid 1 offset 29360128
			dev uuid: 24080b38-13bb-4f4c-8c9f-c6d5313c8621
	item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 29360128) itemoff 15881 itemsize 112
		chunk length 1073741824 owner 2 stripe_len 65536
		type METADATA|DUP num_stripes 2
			stripe 0 devid 1 offset 37748736
			dev uuid: 24080b38-13bb-4f4c-8c9f-c6d5313c8621
			stripe 1 devid 1 offset 1111490560
			dev uuid: 24080b38-13bb-4f4c-8c9f-c6d5313c8621
extent tree key (EXTENT_TREE ROOT_ITEM 0) 
leaf 29900800 items 14 free space 15478 generation 11 owner 2
fs uuid f7e55c97-b0b3-44e5-bab1-1fd55d54409b
chunk uuid b78fe016-e35f-4f57-8211-796cbc9be3a4
	item 0 key (12582912 BLOCK_GROUP_ITEM 8388608) itemoff 16259 itemsize 24
		block group used 262144 chunk_objectid 256 flags DATA
	item 1 key (12845056 EXTENT_ITEM 262144) itemoff 16206 itemsize 53
		extent refs 1 gen 11 flags DATA
		extent data backref root 1 objectid 256 offset 0 count 1
	item 2 key (20971520 BLOCK_GROUP_ITEM 8388608) itemoff 16182 itemsize 24
		block group used 16384 chunk_objectid 256 flags SYSTEM|DUP
	item 3 key (20987904 METADATA_ITEM 0) itemoff 16149 itemsize 33
		extent refs 1 gen 5 flags TREE_BLOCK
		tree block skinny level 0
		tree block backref root 3
	item 4 key (29360128 BLOCK_GROUP_ITEM 1073741824) itemoff 16125 itemsize 24
		block group used 147456 chunk_objectid 256 flags METADATA|DUP
	item 5 key (29442048 METADATA_ITEM 0) itemoff 16092 itemsize 33
		extent refs 1 gen 4 flags TREE_BLOCK
		tree block skinny level 0
		tree block backref root 18446744073709551607
	item 6 key (29507584 METADATA_ITEM 0) itemoff 16059 itemsize 33
		extent refs 1 gen 6 flags TREE_BLOCK
		tree block skinny level 0
		tree block backref root 4
	item 7 key (29736960 METADATA_ITEM 0) itemoff 16026 itemsize 33
		extent refs 1 gen 10 flags TREE_BLOCK
		tree block skinny level 0
		tree block backref root 257
	item 8 key (29802496 METADATA_ITEM 0) itemoff 15993 itemsize 33
		extent refs 1 gen 10 flags TREE_BLOCK
		tree block skinny level 0
		tree block backref root 9
	item 9 key (29851648 METADATA_ITEM 0) itemoff 15960 itemsize 33
		extent refs 1 gen 10 flags TREE_BLOCK
		tree block skinny level 0
		tree block backref root 5
	item 10 key (29884416 METADATA_ITEM 0) itemoff 15927 itemsize 33
		extent refs 1 gen 11 flags TREE_BLOCK
		tree block skinny level 0
		tree block backref root 1
	item 11 key (29900800 METADATA_ITEM 0) itemoff 15894 itemsize 33
		extent refs 1 gen 11 flags TREE_BLOCK
		tree block skinny level 0
		tree block backref root 2
	item 12 key (29917184 METADATA_ITEM 0) itemoff 15861 itemsize 33
		extent refs 1 gen 11 flags TREE_BLOCK
		tree block skinny level 0
		tree block backref root 8
	item 13 key (29933568 METADATA_ITEM 0) itemoff 15828 itemsize 33
		extent refs 1 gen 11 flags TREE_BLOCK
		tree block skinny level 0
		tree block backref root 7
device tree key (DEV_TREE ROOT_ITEM 0) 
leaf 29507584 items 6 free space 15853 generation 6 owner 4
fs uuid f7e55c97-b0b3-44e5-bab1-1fd55d54409b
chunk uuid b78fe016-e35f-4f57-8211-796cbc9be3a4
	item 0 key (0 DEV_STATS 1) itemoff 16243 itemsize 40
		device stats
	item 1 key (1 DEV_EXTENT 12582912) itemoff 16195 itemsize 48
		dev extent chunk_tree 3
		chunk objectid 256 chunk offset 12582912 length 8388608
	item 2 key (1 DEV_EXTENT 20971520) itemoff 16147 itemsize 48
		dev extent chunk_tree 3
		chunk objectid 256 chunk offset 20971520 length 8388608
	item 3 key (1 DEV_EXTENT 29360128) itemoff 16099 itemsize 48
		dev extent chunk_tree 3
		chunk objectid 256 chunk offset 20971520 length 8388608
	item 4 key (1 DEV_EXTENT 37748736) itemoff 16051 itemsize 48
		dev extent chunk_tree 3
		chunk objectid 256 chunk offset 29360128 length 1073741824
	item 5 key (1 DEV_EXTENT 1111490560) itemoff 16003 itemsize 48
		dev extent chunk_tree 3
		chunk objectid 256 chunk offset 29360128 length 1073741824
fs tree key (FS_TREE ROOT_ITEM 0) 
leaf 29851648 items 4 free space 15941 generation 10 owner 5
fs uuid f7e55c97-b0b3-44e5-bab1-1fd55d54409b
chunk uuid b78fe016-e35f-4f57-8211-796cbc9be3a4
	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
		inode generation 3 transid 10 size 10 nbytes 16384
		block group 0 mode 40755 links 1 uid 0 gid 0
		rdev 0 flags 0x0
	item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
		inode ref index 0 namelen 2 name: ..
	item 2 key (256 DIR_ITEM 3390559794) itemoff 16076 itemsize 35
		location key (257 ROOT_ITEM -1) type DIR
		namelen 5 datalen 0 name: snap2
	item 3 key (256 DIR_INDEX 2) itemoff 16041 itemsize 35
		location key (257 ROOT_ITEM -1) type DIR
		namelen 5 datalen 0 name: snap2
checksum tree key (CSUM_TREE ROOT_ITEM 0) 
leaf 29933568 items 0 free space 16283 generation 11 owner 7
fs uuid f7e55c97-b0b3-44e5-bab1-1fd55d54409b
chunk uuid b78fe016-e35f-4f57-8211-796cbc9be3a4
quota tree key (QUOTA_TREE ROOT_ITEM 0) 
leaf 29917184 items 5 free space 15966 generation 11 owner 8
fs uuid f7e55c97-b0b3-44e5-bab1-1fd55d54409b
chunk uuid b78fe016-e35f-4f57-8211-796cbc9be3a4
	item 0 key (0 QGROUP_STATUS 0) itemoff 16251 itemsize 32
		version 1 generation 11 flags ON scan -1
	item 1 key (0 QGROUP_INFO 0/5) itemoff 16211 itemsize 40
		generation 10
		referenced 32768 referenced compressed 32768
		exclusive 32768 exclusive compressed 32768
	item 2 key (0 QGROUP_INFO 0/257) itemoff 16171 itemsize 40
		generation 10
		referenced 16384 referenced compressed 16384
		exclusive 16384 exclusive compressed 16384
	item 3 key (0 QGROUP_LIMIT 0/5) itemoff 16131 itemsize 40
		flags 0
		max referenced 0 max exclusive 0
		rsv referenced 0 rsv exclusive 0
	item 4 key (0 QGROUP_LIMIT 0/257) itemoff 16091 itemsize 40
		flags 0
		max referenced 0 max exclusive 0
		rsv referenced 0 rsv exclusive 0
uuid tree key (UUID_TREE ROOT_ITEM 0) 
leaf 29802496 items 1 free space 16250 generation 10 owner 9
fs uuid f7e55c97-b0b3-44e5-bab1-1fd55d54409b
chunk uuid b78fe016-e35f-4f57-8211-796cbc9be3a4
	item 0 key (0x4449e807166c32fb UUID_KEY_SUBVOL 0x2c3260d85491a6ab) itemoff 16275 itemsize 8
		subvol_id 257
file tree key (257 ROOT_ITEM 10) 
leaf 29736960 items 2 free space 16061 generation 10 owner 257
fs uuid f7e55c97-b0b3-44e5-bab1-1fd55d54409b
chunk uuid b78fe016-e35f-4f57-8211-796cbc9be3a4
	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
		inode generation 3 transid 0 size 0 nbytes 16384
		block group 0 mode 40755 links 1 uid 0 gid 0
		rdev 0 flags 0x0
	item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
		inode ref index 0 namelen 2 name: ..
data reloc tree key (DATA_RELOC_TREE ROOT_ITEM 0) 
leaf 29442048 items 2 free space 16061 generation 4 owner 18446744073709551607
fs uuid f7e55c97-b0b3-44e5-bab1-1fd55d54409b
chunk uuid b78fe016-e35f-4f57-8211-796cbc9be3a4
	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
		inode generation 3 transid 0 size 0 nbytes 16384
		block group 0 mode 40755 links 1 uid 0 gid 0
		rdev 0 flags 0x0
	item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
		inode ref index 0 namelen 2 name: ..
total bytes 17178820608
bytes used 425984
uuid f7e55c97-b0b3-44e5-bab1-1fd55d54409b
btrfs-progs v4.4+20160122

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Qgroups wrong after snapshot create
  2016-04-04 23:06 Qgroups wrong after snapshot create Mark Fasheh
@ 2016-04-05  1:27 ` Qu Wenruo
  2016-04-05 22:16   ` Mark Fasheh
  2016-04-05 13:27 ` Qu Wenruo
  1 sibling, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2016-04-05  1:27 UTC (permalink / raw)
  To: Mark Fasheh, linux-btrfs; +Cc: jbacik, clm

Hi,

Thanks for the report.

Mark Fasheh wrote on 2016/04/04 16:06 -0700:
> Hi,
>
> Making a snapshot gets us the wrong qgroup numbers. This is very easy to
> reproduce. From a fresh btrfs filesystem, simply enable qgroups and create a
> snapshot. In this example we have mounted a newly created fresh filesystem
> and mounted it at /btrfs:
>
> # btrfs quota enable /btrfs
> # btrfs sub sna /btrfs/ /btrfs/snap1
> # btrfs qg show /btrfs
>
> qgroupid         rfer         excl
> --------         ----         ----
> 0/5          32.00KiB     32.00KiB
> 0/257        16.00KiB     16.00KiB
>

Also reproduced it.

My first idea is, old snapshot qgroup hack is involved.

Unlike btrfs_inc/dec_extent_ref(), snapshotting just use a dirty hack to 
handle it:
Copy rfer from source subvolume, and directly set excl to nodesize.

If such work is before adding snapshot inode into src subvolume, it may 
be the reason causing the bug.

>
> In the example above, the default subvolume (0/5) should read 16KiB
> referenced and 16KiB exclusive.
>
> A rescan fixes things, so we know the rescan process is doing the math
> right:
>
> # btrfs quota rescan /btrfs
> # btrfs qgroup show /btrfs
> qgroupid         rfer         excl
> --------         ----         ----
> 0/5          16.00KiB     16.00KiB
> 0/257        16.00KiB     16.00KiB
>

So the base of qgroup code is not affected, or we may need another 
painful rework.

>
>
> The last kernel to get this right was v4.1:
>
> # uname -r
> 4.1.20
> # btrfs quota enable /btrfs
> # btrfs sub sna /btrfs/ /btrfs/snap1
> Create a snapshot of '/btrfs/' in '/btrfs/snap1'
> # btrfs qg show /btrfs
> qgroupid         rfer         excl
> --------         ----         ----
> 0/5          16.00KiB     16.00KiB
> 0/257        16.00KiB     16.00KiB
>
>
> Which leads me to believe that this was a regression introduced by Qu's
> rewrite as that is the biggest change to qgroups during that development
> period.
>
>
> Going back to upstream, I applied my tracing patch from this list
> ( http://thread.gmane.org/gmane.comp.file-systems.btrfs/54685 ), with a
> couple changes - I'm printing the rfer/excl bytecounts in
> qgroup_update_counters AND I print them twice - once before we make any
> changes and once after the changes. If I enable tracing in
> btrfs_qgroup_account_extent and qgroup_update_counters just before the
> snapshot creation, we get the following trace:
>
>
> # btrfs quota enable /btrfs
> # <wait a sec for the rescan to finish>
> # echo 1 > /sys/kernel/debug/tracing/events/btrfs/btrfs_qgroup_account_extent/enable
> # echo 1 > //sys/kernel/debug/tracing/events/btrfs/qgroup_update_counters/enable
> # btrfs sub sna /btrfs/ /btrfs/snap2
> Create a snapshot of '/btrfs/' in '/btrfs/snap2'
> # btrfs qg show /btrfs
> qgroupid         rfer         excl
> --------         ----         ----
> 0/5          32.00KiB     32.00KiB
> 0/257        16.00KiB     16.00KiB
> # fstest1:~ # cat /sys/kernel/debug/tracing/trace
>
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 13/13   #P:2
> #
> #                              _-----=> irqs-off
> #                             / _----=> need-resched
> #                            | / _---=> hardirq/softirq
> #                            || / _--=> preempt-depth
> #                            ||| /     delay
> #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
> #              | |       |   ||||       |         |
>             btrfs-10233 [001] .... 260298.823339: btrfs_qgroup_account_extent: bytenr = 29360128, num_bytes = 16384, nr_old_roots = 1, nr_new_roots = 0
>             btrfs-10233 [001] .... 260298.823342: qgroup_update_counters: qgid = 5, cur_old_count = 1, cur_new_count = 0, rfer = 16384, excl = 16384
>             btrfs-10233 [001] .... 260298.823342: qgroup_update_counters: qgid = 5, cur_old_count = 1, cur_new_count = 0, rfer = 0, excl = 0
>             btrfs-10233 [001] .... 260298.823343: btrfs_qgroup_account_extent: bytenr = 29720576, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
>             btrfs-10233 [001] .... 260298.823345: btrfs_qgroup_account_extent: bytenr = 29736960, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
>             btrfs-10233 [001] .... 260298.823347: btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1

Now, for extent 29786112, its nr_new_roots is 1.

>             btrfs-10233 [001] .... 260298.823347: qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
>             btrfs-10233 [001] .... 260298.823348: qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
>             btrfs-10233 [001] .... 260298.823421: btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0

Now the problem is here, nr_old_roots should be 1, not 0.
Just as previous trace shows, we increased extent ref on that extent, 
but now it dropped back to 0.

Since its old_root == new_root == 0, qgroup code doesn't do anything on it.
If its nr_old_roots is 1, qgroup will drop it's excl/rfer to 0, and then 
accounting may goes back to normal.

>             btrfs-10233 [001] .... 260298.823422: btrfs_qgroup_account_extent: bytenr = 29835264, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
>             btrfs-10233 [001] .... 260298.823425: btrfs_qgroup_account_extent: bytenr = 29851648, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
>             btrfs-10233 [001] .... 260298.823426: qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
>             btrfs-10233 [001] .... 260298.823426: qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 32768, excl = 32768
>
> If you read through the whole log we do some ... interesting.. things - at
> the start, we *subtract* from qgroup 5, making it's count go to zero. I want
> to say that this is kind of unexpected for a snapshot create but perhaps
> there's something I'm missing.
>
> Remember that I'm printing each qgroup twice in qgroup_adjust_counters (once
> before, once after). Sothen we can see then that extent 29851648 (len 16k)
> is the extent being counted against qgroup 5 which makes the count invalid.

It seems that, for 29851648 we are doing right accounting.
As we are modifying source subvolume to create the new inode for snapshot.

>
>  From a btrfs-debug-tree I get the following records referencing that extent:
>
>  From the root tree:
>          item 3 key (FS_TREE ROOT_ITEM 0) itemoff 14949 itemsize 439
>                  root data bytenr 29851648 level 0 dirid 256 refs 1 gen 10 lastsnap 10
>                  uuid 00000000-0000-0000-0000-000000000000
>                  ctransid 10 otransid 0 stransid 0 rtransid 0
>
>  From the extent tree:
>          item 9 key (29851648 METADATA_ITEM 0) itemoff 15960 itemsize 33
>                  extent refs 1 gen 10 flags TREE_BLOCK
>                  tree block skinny level 0
>                  tree block backref root 5
>
> And here is the block itself:
>
> fs tree key (FS_TREE ROOT_ITEM 0)
> leaf 29851648 items 4 free space 15941 generation 10 owner 5
> fs uuid f7e55c97-b0b3-44e5-bab1-1fd55d54409b
> chunk uuid b78fe016-e35f-4f57-8211-796cbc9be3a4
>          item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
>                  inode generation 3 transid 10 size 10 nbytes 16384
>                  block group 0 mode 40755 links 1 uid 0 gid 0
>                  rdev 0 flags 0x0
>          item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
>                  inode ref index 0 namelen 2 name: ..
>          item 2 key (256 DIR_ITEM 3390559794) itemoff 16076 itemsize 35
>                  location key (257 ROOT_ITEM -1) type DIR
>                  namelen 5 datalen 0 name: snap2
>          item 3 key (256 DIR_INDEX 2) itemoff 16041 itemsize 35
>                  location key (257 ROOT_ITEM -1) type DIR
>                  namelen 5 datalen 0 name: snap2
>
>
> So unless I'm mistaken, it seems like we're counting the original snapshot
> root against itself when creating a snapshot.

I assume this is because we're modifying the source subvolume at 
snapshot creation process.

That's to say, if you use the following subvolume layout, snapshot hack 
code won't cause bug:

root(5)
|- subvol 257(src)
\- snapshot 258 (src 257).

Anyway, this is really a big bug, I'll investigate it to ensure the new 
qgroup code will work as expected.

BTW, since we are using the hack for snapshot creation, if using with 
"-i" option, it will still cause qgroup corruption as we didn't go 
through correct accounting, making higher level go crazy.

Thanks,
Qu


>
> I found this looking for what I believe to be a _different_ corruption in
> qgroups. In the meantime while I track that one down though I was hoping
> that someone might be able to shed some light on this particular issue.
>
> Qu, do you have any ideas how we might fix this?
>
> Thanks,
> 	--Mark
>
> PS: I have attached the output of btrfs-debug-tree for the FS used in this
> example.
>
> --
> Mark Fasheh
>
>



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Qgroups wrong after snapshot create
  2016-04-05  1:27 ` Qu Wenruo
@ 2016-04-05 22:16   ` Mark Fasheh
  2016-04-05 22:28     ` Mark Fasheh
  0 siblings, 1 reply; 5+ messages in thread
From: Mark Fasheh @ 2016-04-05 22:16 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, jbacik, clm

On Tue, Apr 05, 2016 at 09:27:01AM +0800, Qu Wenruo wrote:
> Mark Fasheh wrote on 2016/04/04 16:06 -0700:
> >Hi,
> >
> >Making a snapshot gets us the wrong qgroup numbers. This is very easy to
> >reproduce. From a fresh btrfs filesystem, simply enable qgroups and create a
> >snapshot. In this example we have mounted a newly created fresh filesystem
> >and mounted it at /btrfs:
> >
> ># btrfs quota enable /btrfs
> ># btrfs sub sna /btrfs/ /btrfs/snap1
> ># btrfs qg show /btrfs
> >
> >qgroupid         rfer         excl
> >--------         ----         ----
> >0/5          32.00KiB     32.00KiB
> >0/257        16.00KiB     16.00KiB
> >
> 
> Also reproduced it.
> 
> My first idea is, old snapshot qgroup hack is involved.
> 
> Unlike btrfs_inc/dec_extent_ref(), snapshotting just use a dirty
> hack to handle it:
> Copy rfer from source subvolume, and directly set excl to nodesize.
> 
> If such work is before adding snapshot inode into src subvolume, it
> may be the reason causing the bug.

Ok, thanks very much for looking into this Qu.


> >In the example above, the default subvolume (0/5) should read 16KiB
> >referenced and 16KiB exclusive.
> >
> >A rescan fixes things, so we know the rescan process is doing the math
> >right:
> >
> ># btrfs quota rescan /btrfs
> ># btrfs qgroup show /btrfs
> >qgroupid         rfer         excl
> >--------         ----         ----
> >0/5          16.00KiB     16.00KiB
> >0/257        16.00KiB     16.00KiB
> >
> 
> So the base of qgroup code is not affected, or we may need another
> painful rework.

Yeah as far as I can tell the core algorithm is fine. We're just running the
extents incorrectly somehow.


> >#                              _-----=> irqs-off
> >#                             / _----=> need-resched
> >#                            | / _---=> hardirq/softirq
> >#                            || / _--=> preempt-depth
> >#                            ||| /     delay
> >#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
> >#              | |       |   ||||       |         |
> >            btrfs-10233 [001] .... 260298.823339: btrfs_qgroup_account_extent: bytenr = 29360128, num_bytes = 16384, nr_old_roots = 1, nr_new_roots = 0
> >            btrfs-10233 [001] .... 260298.823342: qgroup_update_counters: qgid = 5, cur_old_count = 1, cur_new_count = 0, rfer = 16384, excl = 16384
> >            btrfs-10233 [001] .... 260298.823342: qgroup_update_counters: qgid = 5, cur_old_count = 1, cur_new_count = 0, rfer = 0, excl = 0
> >            btrfs-10233 [001] .... 260298.823343: btrfs_qgroup_account_extent: bytenr = 29720576, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
> >            btrfs-10233 [001] .... 260298.823345: btrfs_qgroup_account_extent: bytenr = 29736960, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
> >            btrfs-10233 [001] .... 260298.823347: btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
> 
> Now, for extent 29786112, its nr_new_roots is 1.
> 
> >            btrfs-10233 [001] .... 260298.823347: qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
> >            btrfs-10233 [001] .... 260298.823348: qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
> >            btrfs-10233 [001] .... 260298.823421: btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
> 
> Now the problem is here, nr_old_roots should be 1, not 0.
> Just as previous trace shows, we increased extent ref on that
> extent, but now it dropped back to 0.
> 
> Since its old_root == new_root == 0, qgroup code doesn't do anything on it.
> If its nr_old_roots is 1, qgroup will drop it's excl/rfer to 0, and
> then accounting may goes back to normal.

Ok, so we're fine with the numbers going to zero so long as it gets back to
where it should be. That also explains the 'strange' behavior I saw.


> >            btrfs-10233 [001] .... 260298.823422: btrfs_qgroup_account_extent: bytenr = 29835264, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
> >            btrfs-10233 [001] .... 260298.823425: btrfs_qgroup_account_extent: bytenr = 29851648, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
> >            btrfs-10233 [001] .... 260298.823426: qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
> >            btrfs-10233 [001] .... 260298.823426: qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 32768, excl = 32768
> >
> >If you read through the whole log we do some ... interesting.. things - at
> >the start, we *subtract* from qgroup 5, making it's count go to zero. I want
> >to say that this is kind of unexpected for a snapshot create but perhaps
> >there's something I'm missing.
> >
> >Remember that I'm printing each qgroup twice in qgroup_adjust_counters (once
> >before, once after). Sothen we can see then that extent 29851648 (len 16k)
> >is the extent being counted against qgroup 5 which makes the count invalid.
> 
> It seems that, for 29851648 we are doing right accounting.
> As we are modifying source subvolume to create the new inode for snapshot.

Got it, the real problem is the 2nd pass at 29786112 where it should have
had nr_old_roots == 1.


> > From a btrfs-debug-tree I get the following records referencing that extent:
> >
> > From the root tree:
> >         item 3 key (FS_TREE ROOT_ITEM 0) itemoff 14949 itemsize 439
> >                 root data bytenr 29851648 level 0 dirid 256 refs 1 gen 10 lastsnap 10
> >                 uuid 00000000-0000-0000-0000-000000000000
> >                 ctransid 10 otransid 0 stransid 0 rtransid 0
> >
> > From the extent tree:
> >         item 9 key (29851648 METADATA_ITEM 0) itemoff 15960 itemsize 33
> >                 extent refs 1 gen 10 flags TREE_BLOCK
> >                 tree block skinny level 0
> >                 tree block backref root 5
> >
> >And here is the block itself:
> >
> >fs tree key (FS_TREE ROOT_ITEM 0)
> >leaf 29851648 items 4 free space 15941 generation 10 owner 5
> >fs uuid f7e55c97-b0b3-44e5-bab1-1fd55d54409b
> >chunk uuid b78fe016-e35f-4f57-8211-796cbc9be3a4
> >         item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
> >                 inode generation 3 transid 10 size 10 nbytes 16384
> >                 block group 0 mode 40755 links 1 uid 0 gid 0
> >                 rdev 0 flags 0x0
> >         item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
> >                 inode ref index 0 namelen 2 name: ..
> >         item 2 key (256 DIR_ITEM 3390559794) itemoff 16076 itemsize 35
> >                 location key (257 ROOT_ITEM -1) type DIR
> >                 namelen 5 datalen 0 name: snap2
> >         item 3 key (256 DIR_INDEX 2) itemoff 16041 itemsize 35
> >                 location key (257 ROOT_ITEM -1) type DIR
> >                 namelen 5 datalen 0 name: snap2
> >
> >
> >So unless I'm mistaken, it seems like we're counting the original snapshot
> >root against itself when creating a snapshot.
> 
> I assume this is because we're modifying the source subvolume at
> snapshot creation process.
> 
> That's to say, if you use the following subvolume layout, snapshot
> hack code won't cause bug:
> 
> root(5)
> |- subvol 257(src)
> \- snapshot 258 (src 257).

Indeed, I can confirm that making subvolumes like that does not reproduce the
issue.


> Anyway, this is really a big bug, I'll investigate it to ensure the
> new qgroup code will work as expected.
> 
> BTW, since we are using the hack for snapshot creation, if using
> with "-i" option, it will still cause qgroup corruption as we didn't
> go through correct accounting, making higher level go crazy.

Right, I have seen that on someone elses FS but so far haven't been able to
corrupt a higher level qgroup on my own.

Thanks,
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Qgroups wrong after snapshot create
  2016-04-05 22:16   ` Mark Fasheh
@ 2016-04-05 22:28     ` Mark Fasheh
  0 siblings, 0 replies; 5+ messages in thread
From: Mark Fasheh @ 2016-04-05 22:28 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, jbacik, clm

On Tue, Apr 05, 2016 at 03:16:54PM -0700, Mark Fasheh wrote:
> On Tue, Apr 05, 2016 at 09:27:01AM +0800, Qu Wenruo wrote:
> > Mark Fasheh wrote on 2016/04/04 16:06 -0700:
> > >Hi,
> > >
> > >Making a snapshot gets us the wrong qgroup numbers. This is very easy to
> > >reproduce. From a fresh btrfs filesystem, simply enable qgroups and create a
> > >snapshot. In this example we have mounted a newly created fresh filesystem
> > >and mounted it at /btrfs:
> > >
> > ># btrfs quota enable /btrfs
> > ># btrfs sub sna /btrfs/ /btrfs/snap1
> > ># btrfs qg show /btrfs
> > >
> > >qgroupid         rfer         excl
> > >--------         ----         ----
> > >0/5          32.00KiB     32.00KiB
> > >0/257        16.00KiB     16.00KiB
> > >
> > 
> > Also reproduced it.
> > 
> > My first idea is, old snapshot qgroup hack is involved.
> > 
> > Unlike btrfs_inc/dec_extent_ref(), snapshotting just use a dirty
> > hack to handle it:
> > Copy rfer from source subvolume, and directly set excl to nodesize.
> > 
> > If such work is before adding snapshot inode into src subvolume, it
> > may be the reason causing the bug.
> 
> Ok, thanks very much for looking into this Qu.
> 
> 
> > >In the example above, the default subvolume (0/5) should read 16KiB
> > >referenced and 16KiB exclusive.
> > >
> > >A rescan fixes things, so we know the rescan process is doing the math
> > >right:
> > >
> > ># btrfs quota rescan /btrfs
> > ># btrfs qgroup show /btrfs
> > >qgroupid         rfer         excl
> > >--------         ----         ----
> > >0/5          16.00KiB     16.00KiB
> > >0/257        16.00KiB     16.00KiB
> > >
> > 
> > So the base of qgroup code is not affected, or we may need another
> > painful rework.
> 
> Yeah as far as I can tell the core algorithm is fine. We're just running the
> extents incorrectly somehow.

Btw, I should add - my biggest fear was an algorithm change which would have
made older versions of btrfsck incompatible. It seems though we can still
use it for checking qgroups.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Qgroups wrong after snapshot create
  2016-04-04 23:06 Qgroups wrong after snapshot create Mark Fasheh
  2016-04-05  1:27 ` Qu Wenruo
@ 2016-04-05 13:27 ` Qu Wenruo
  1 sibling, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2016-04-05 13:27 UTC (permalink / raw)
  To: Mark Fasheh, linux-btrfs; +Cc: quwenruo, jbacik, clm



On 04/05/2016 07:06 AM, Mark Fasheh wrote:
> Hi,
>
> Making a snapshot gets us the wrong qgroup numbers. This is very easy to
> reproduce. From a fresh btrfs filesystem, simply enable qgroups and create a
> snapshot. In this example we have mounted a newly created fresh filesystem
> and mounted it at /btrfs:
>
> # btrfs quota enable /btrfs
> # btrfs sub sna /btrfs/ /btrfs/snap1
> # btrfs qg show /btrfs
>
> qgroupid         rfer         excl
> --------         ----         ----
> 0/5          32.00KiB     32.00KiB
> 0/257        16.00KiB     16.00KiB
>
>
> In the example above, the default subvolume (0/5) should read 16KiB
> referenced and 16KiB exclusive.
>
> A rescan fixes things, so we know the rescan process is doing the math
> right:
>
> # btrfs quota rescan /btrfs
> # btrfs qgroup show /btrfs
> qgroupid         rfer         excl
> --------         ----         ----
> 0/5          16.00KiB     16.00KiB
> 0/257        16.00KiB     16.00KiB
>
>
>
> The last kernel to get this right was v4.1:
>
> # uname -r
> 4.1.20
> # btrfs quota enable /btrfs
> # btrfs sub sna /btrfs/ /btrfs/snap1
> Create a snapshot of '/btrfs/' in '/btrfs/snap1'
> # btrfs qg show /btrfs
> qgroupid         rfer         excl
> --------         ----         ----
> 0/5          16.00KiB     16.00KiB
> 0/257        16.00KiB     16.00KiB
>
>
> Which leads me to believe that this was a regression introduced by Qu's
> rewrite as that is the biggest change to qgroups during that development
> period.
>
>
> Going back to upstream, I applied my tracing patch from this list
> ( http://thread.gmane.org/gmane.comp.file-systems.btrfs/54685 ), with a
> couple changes - I'm printing the rfer/excl bytecounts in
> qgroup_update_counters AND I print them twice - once before we make any
> changes and once after the changes. If I enable tracing in
> btrfs_qgroup_account_extent and qgroup_update_counters just before the
> snapshot creation, we get the following trace:
>
>
> # btrfs quota enable /btrfs
> # <wait a sec for the rescan to finish>
> # echo 1 > /sys/kernel/debug/tracing/events/btrfs/btrfs_qgroup_account_extent/enable
> # echo 1 > //sys/kernel/debug/tracing/events/btrfs/qgroup_update_counters/enable
> # btrfs sub sna /btrfs/ /btrfs/snap2
> Create a snapshot of '/btrfs/' in '/btrfs/snap2'
> # btrfs qg show /btrfs
> qgroupid         rfer         excl
> --------         ----         ----
> 0/5          32.00KiB     32.00KiB
> 0/257        16.00KiB     16.00KiB
> # fstest1:~ # cat /sys/kernel/debug/tracing/trace
>
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 13/13   #P:2
> #
> #                              _-----=> irqs-off
> #                             / _----=> need-resched
> #                            | / _---=> hardirq/softirq
> #                            || / _--=> preempt-depth
> #                            ||| /     delay
> #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
> #              | |       |   ||||       |         |
>             btrfs-10233 [001] .... 260298.823339: btrfs_qgroup_account_extent: bytenr = 29360128, num_bytes = 16384, nr_old_roots = 1, nr_new_roots = 0
>             btrfs-10233 [001] .... 260298.823342: qgroup_update_counters: qgid = 5, cur_old_count = 1, cur_new_count = 0, rfer = 16384, excl = 16384
>             btrfs-10233 [001] .... 260298.823342: qgroup_update_counters: qgid = 5, cur_old_count = 1, cur_new_count = 0, rfer = 0, excl = 0
>             btrfs-10233 [001] .... 260298.823343: btrfs_qgroup_account_extent: bytenr = 29720576, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
>             btrfs-10233 [001] .... 260298.823345: btrfs_qgroup_account_extent: bytenr = 29736960, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
>             btrfs-10233 [001] .... 260298.823347: btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
>             btrfs-10233 [001] .... 260298.823347: qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
>             btrfs-10233 [001] .... 260298.823348: qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
>             btrfs-10233 [001] .... 260298.823421: btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0

I think I caught the direct cause.

Here for bytenr 29786112, it got quota dirtied twice, one for cow to a 
new block(although I don't got why it got cowed, as it is cowed again 
later), one for de-ref.

But the problem is, qgroup accounting is run twice inside the same 
transaction.

Current qgroup design heavily depends on transaction to get correct old 
and new roots.
It always get old roots from *COMMIT ROOT*, and get new roots from 
current root just before *SWITCHING COMMIT ROOT*.

However in this case, we got new roots in the middle of a transaction, 
and did update qgroup accounting.
This means we get wrong new_roots for extent which will not be committed 
to disk.

In this case, it is bytenr 29786112. It doesn't reach disk, so it's not 
on disk in either transaction before/after snapshot.
Normally, its old and new roots should both be 0, and qgroup will skip 
it, nothing will be corrupted
However current qgroup doesn't do it like that, it just run delayed refs 
and then update qgroup accounting, then adding snapshot inode into fs 
tree, then finally commit transaction(which did qgroup accounting again).

That's why for extent 29786112, it get nr_new_roots = 1, but at second 
time, it get nr_new_roots = 0.
(nr_old_roots are always 0 as we didn't switch commit root, for 1st 
time, it get allocated so its nr_new_roots = 1, but for 2nd time, it got 
removed, so its nr_new_roots = 0)


To fix it we have at least 2 method.
1) Ensure no temporary cowed tree block will happen in create_snapshot()
    That's to eliminate extent 29786112, only old fs root and the final
    fs root(29851648).

    This fix has some meaning, as in fact we really don't need the
    temporary tree block, but that's not the real fix.
    (And I didn't got the reason yet)

2) Don't run qgroup accounting in the middle of a transaction
    That's the root fix, I'll use this method to fix tomorrow.

Thanks for your report again.
Qu
>             btrfs-10233 [001] .... 260298.823422: btrfs_qgroup_account_extent: bytenr = 29835264, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
>             btrfs-10233 [001] .... 260298.823425: btrfs_qgroup_account_extent: bytenr = 29851648, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
>             btrfs-10233 [001] .... 260298.823426: qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
>             btrfs-10233 [001] .... 260298.823426: qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 32768, excl = 32768
>
> If you read through the whole log we do some ... interesting.. things - at
> the start, we *subtract* from qgroup 5, making it's count go to zero. I want
> to say that this is kind of unexpected for a snapshot create but perhaps
> there's something I'm missing.
>
> Remember that I'm printing each qgroup twice in qgroup_adjust_counters (once
> before, once after). Sothen we can see then that extent 29851648 (len 16k)
> is the extent being counted against qgroup 5 which makes the count invalid.
>
>  From a btrfs-debug-tree I get the following records referencing that extent:
>
>  From the root tree:
>          item 3 key (FS_TREE ROOT_ITEM 0) itemoff 14949 itemsize 439
>                  root data bytenr 29851648 level 0 dirid 256 refs 1 gen 10 lastsnap 10
>                  uuid 00000000-0000-0000-0000-000000000000
>                  ctransid 10 otransid 0 stransid 0 rtransid 0
>
>  From the extent tree:
>          item 9 key (29851648 METADATA_ITEM 0) itemoff 15960 itemsize 33
>                  extent refs 1 gen 10 flags TREE_BLOCK
>                  tree block skinny level 0
>                  tree block backref root 5
>
> And here is the block itself:
>
> fs tree key (FS_TREE ROOT_ITEM 0)
> leaf 29851648 items 4 free space 15941 generation 10 owner 5
> fs uuid f7e55c97-b0b3-44e5-bab1-1fd55d54409b
> chunk uuid b78fe016-e35f-4f57-8211-796cbc9be3a4
>          item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
>                  inode generation 3 transid 10 size 10 nbytes 16384
>                  block group 0 mode 40755 links 1 uid 0 gid 0
>                  rdev 0 flags 0x0
>          item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
>                  inode ref index 0 namelen 2 name: ..
>          item 2 key (256 DIR_ITEM 3390559794) itemoff 16076 itemsize 35
>                  location key (257 ROOT_ITEM -1) type DIR
>                  namelen 5 datalen 0 name: snap2
>          item 3 key (256 DIR_INDEX 2) itemoff 16041 itemsize 35
>                  location key (257 ROOT_ITEM -1) type DIR
>                  namelen 5 datalen 0 name: snap2
>
>
> So unless I'm mistaken, it seems like we're counting the original snapshot
> root against itself when creating a snapshot.
>
> I found this looking for what I believe to be a _different_ corruption in
> qgroups. In the meantime while I track that one down though I was hoping
> that someone might be able to shed some light on this particular issue.
>
> Qu, do you have any ideas how we might fix this?
>
> Thanks,
> 	--Mark
>
> PS: I have attached the output of btrfs-debug-tree for the FS used in this
> example.
>
> --
> Mark Fasheh
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-04-05 22:28 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-04 23:06 Qgroups wrong after snapshot create Mark Fasheh
2016-04-05  1:27 ` Qu Wenruo
2016-04-05 22:16   ` Mark Fasheh
2016-04-05 22:28     ` Mark Fasheh
2016-04-05 13:27 ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).