[PATCH 0/6] Btrfs: fixes for block group remove/allocation and trim/discard

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] Btrfs: fixes for block group remove/allocation and trim/discard
@ 2014-11-26 15:28 Filipe Manana
  2014-11-26 15:28 ` [PATCH 1/6] Btrfs: fix invalid block group rbtree access after bg is removed Filipe Manana
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Filipe Manana @ 2014-11-26 15:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

This patchset fixes several issues exposed by block group removal/allocation and
trim/discard running in parallel.

The first 3 patches and the last one (6) are independent and don't depend on each
other. Patches 3 and 6 are not really related to trim/discard at all.

I bundled all these patches into the same patchset because all these issues were
triggered by the same test for xfstests that I prepared and will send out soon.
The issues were triggered on a qemu/kvm guest using scsi-hd drives with discard
support enabled (which makes the host do hole punching on the virtual disks'
image files), 4 virtual CPUs and 4Gb of ram. Some of these issues were hard to
trigger and happened like once for every 10+ runs of the test (each test run
takes nearly 30 minutes with my hardware and a debug kernel).

Filipe Manana (6):
  Btrfs: fix invalid block group rbtree access after bg is removed
  Btrfs: fix crash caused by block group removal
  Btrfs: fix freeing used extents after removing empty block group
  Btrfs: fix race between fs trimming and block group remove/allocation
  Btrfs: fix race between writing free space cache and trimming
  Btrfs: make btrfs_abort_transaction consider existence of new block
    groups

 fs/btrfs/ctree.h            | 14 ++++++-
 fs/btrfs/disk-io.c          | 14 +++++++
 fs/btrfs/extent-tree.c      | 90 ++++++++++++++++++++++++++++++++++++++-------
 fs/btrfs/free-space-cache.c | 85 ++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/free-space-cache.h |  2 +
 fs/btrfs/inode-map.c        |  2 +
 fs/btrfs/super.c            |  2 +-
 fs/btrfs/volumes.c          | 33 +++++++++++++----
 8 files changed, 215 insertions(+), 27 deletions(-)

-- 
2.1.3

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/6] Btrfs: fix invalid block group rbtree access after bg is removed
  2014-11-26 15:28 [PATCH 0/6] Btrfs: fixes for block group remove/allocation and trim/discard Filipe Manana
@ 2014-11-26 15:28 ` Filipe Manana
  2014-11-26 15:58   ` Josef Bacik
  2014-11-26 15:28 ` [PATCH 2/6] Btrfs: fix crash caused by block group removal Filipe Manana
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Filipe Manana @ 2014-11-26 15:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

If we grab a block group, for example in btrfs_trim_fs(), we will be holding
a reference on it but the block group can be removed after we got it (via
btrfs_remove_block_group), which means it will no longer be part of the
rbtree.

However, btrfs_remove_block_group() was only calling rb_erase() which leaves
the block group's rb_node left and right child pointers with the same content
they had before calling rb_erase. This was dangerous because a call to
next_block_group() would access the node's left and right child pointers (via
rb_next), which can be no longer valid.

Fix this by clearing a block group's node after removing it from the tree,
and have next_block_group() do a tree search to get the next block group
instead of using rb_next() if our block group was removed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-tree.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 744b580..3ba65d9 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3162,7 +3162,19 @@ next_block_group(struct btrfs_root *root,
 		 struct btrfs_block_group_cache *cache)
 {
 	struct rb_node *node;
+
 	spin_lock(&root->fs_info->block_group_cache_lock);
+
+	/* If our block group was removed, we need a full search. */
+	if (RB_EMPTY_NODE(&cache->cache_node)) {
+		const u64 next_bytenr = cache->key.objectid + cache->key.offset;
+
+		spin_unlock(&root->fs_info->block_group_cache_lock);
+		btrfs_put_block_group(cache);
+		cache = btrfs_lookup_first_block_group(root->fs_info,
+						       next_bytenr);
+		return cache;
+	}
 	node = rb_next(&cache->cache_node);
 	btrfs_put_block_group(cache);
 	if (node) {
@@ -9400,6 +9412,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	spin_lock(&root->fs_info->block_group_cache_lock);
 	rb_erase(&block_group->cache_node,
 		 &root->fs_info->block_group_cache_tree);
+	RB_CLEAR_NODE(&block_group->cache_node);
 
 	if (root->fs_info->first_logical_byte == block_group->key.objectid)
 		root->fs_info->first_logical_byte = (u64)-1;
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/6] Btrfs: fix crash caused by block group removal
  2014-11-26 15:28 [PATCH 0/6] Btrfs: fixes for block group remove/allocation and trim/discard Filipe Manana
  2014-11-26 15:28 ` [PATCH 1/6] Btrfs: fix invalid block group rbtree access after bg is removed Filipe Manana
@ 2014-11-26 15:28 ` Filipe Manana
  2014-11-26 15:57   ` Josef Bacik
  2014-11-26 15:28 ` [PATCH 3/6] Btrfs: fix freeing used extents after removing empty block group Filipe Manana
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Filipe Manana @ 2014-11-26 15:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

If we remove a block group (because it became empty), we might have left
a caching_ctl structure in fs_info->caching_block_groups that points to
the block group and is accessed at transaction commit time. This results
in accessing an invalid or incorrect block group. This issue became visible
after Josef's patch "Btrfs: remove empty block groups automatically".

So if the block group is removed make sure we don't leave a dangling
caching_ctl in caching_block_groups.

Sample crash trace:

[58380.439449] BUG: unable to handle kernel paging request at ffff8801446eaeb8
[58380.439707] IP: [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
[58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 80000001446ea060
[58380.441220] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: btrfs]
[58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
[58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[58380.443238] task: ffff88013ac82090 ti: ffff88013896c000 task.ti: ffff88013896c000
[58380.443238] RIP: 0010:[<ffffffffa03f6d05>]  [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
[58380.443238] RSP: 0018:ffff88013896fdd8  EFLAGS: 00010283
[58380.443238] RAX: ffff880222cae850 RBX: ffff880119ba74c0 RCX: 0000000000000000
[58380.443238] RDX: 0000000000000000 RSI: ffff880185e16800 RDI: ffff8801446eaeb8
[58380.443238] RBP: ffff88013896fdd8 R08: ffff8801a9ca9fa8 R09: ffff88013896fc60
[58380.443238] R10: ffff88013896fd28 R11: 0000000000000000 R12: ffff880222cae000
[58380.443238] R13: ffff880222cae850 R14: ffff880222cae6b0 R15: ffff8801446eae00
[58380.443238] FS:  0000000000000000(0000) GS:ffff88023ed80000(0000) knlGS:0000000000000000
[58380.443238] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[58380.443238] CR2: ffff8801446eaeb8 CR3: 0000000001811000 CR4: 00000000000006e0
[58380.443238] Stack:
[58380.443238]  ffff88013896fe18 ffffffffa03fe2d5 ffff880222cae850 ffff880185e16800
[58380.443238]  ffff88000dc41c20 0000000000000000 ffff8801a9ca9f00 0000000000000000
[58380.443238]  ffff88013896fe80 ffffffffa040fbcf ffff88018b0dcdb0 ffff88013ac82090
[58380.443238] Call Trace:
[58380.443238]  [<ffffffffa03fe2d5>] btrfs_prepare_extent_commit+0x5a/0xd7 [btrfs]
[58380.443238]  [<ffffffffa040fbcf>] btrfs_commit_transaction+0x45c/0x882 [btrfs]
[58380.443238]  [<ffffffffa040c058>] transaction_kthread+0xf2/0x1a4 [btrfs]
[58380.443238]  [<ffffffffa040bf66>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[58380.443238]  [<ffffffff8105966b>] kthread+0xb7/0xbf
[58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[58380.443238]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
[58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.h       |  1 +
 fs/btrfs/extent-tree.c | 27 +++++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d3ccd09..7f40a65 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1277,6 +1277,7 @@ struct btrfs_block_group_cache {
 	unsigned int ro:1;
 	unsigned int dirty:1;
 	unsigned int iref:1;
+	unsigned int has_caching_ctl:1;
 
 	int disk_cache_state;
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3ba65d9..b7e40ef 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -607,6 +607,7 @@ static int cache_block_group(struct btrfs_block_group_cache *cache,
 				cache->cached = BTRFS_CACHE_NO;
 			} else {
 				cache->cached = BTRFS_CACHE_STARTED;
+				cache->has_caching_ctl = 1;
 			}
 		}
 		spin_unlock(&cache->lock);
@@ -627,6 +628,7 @@ static int cache_block_group(struct btrfs_block_group_cache *cache,
 			cache->cached = BTRFS_CACHE_NO;
 		} else {
 			cache->cached = BTRFS_CACHE_STARTED;
+			cache->has_caching_ctl = 1;
 		}
 		spin_unlock(&cache->lock);
 		wake_up(&caching_ctl->wait);
@@ -9328,6 +9330,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	int ret;
 	int index;
 	int factor;
+	struct btrfs_caching_control *caching_ctl = NULL;
 
 	root = root->fs_info->extent_root;
 
@@ -9435,8 +9438,32 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 		kobject_put(kobj);
 	}
 
+	if (block_group->has_caching_ctl)
+		caching_ctl = get_caching_control(block_group);
 	if (block_group->cached == BTRFS_CACHE_STARTED)
 		wait_block_group_cache_done(block_group);
+	if (block_group->has_caching_ctl) {
+		down_write(&root->fs_info->commit_root_sem);
+		if (!caching_ctl) {
+			struct btrfs_caching_control *ctl;
+
+			list_for_each_entry(ctl,
+				    &root->fs_info->caching_block_groups, list)
+				if (ctl->block_group == block_group) {
+					caching_ctl = ctl;
+					atomic_inc(&caching_ctl->count);
+					break;
+				}
+		}
+		if (caching_ctl)
+			list_del_init(&caching_ctl->list);
+		up_write(&root->fs_info->commit_root_sem);
+		if (caching_ctl) {
+			/* Once for the caching bgs list and once for us. */
+			put_caching_control(caching_ctl);
+			put_caching_control(caching_ctl);
+		}
+	}
 
 	btrfs_remove_free_space_cache(block_group);
 
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/6] Btrfs: fix freeing used extents after removing empty block group
  2014-11-26 15:28 [PATCH 0/6] Btrfs: fixes for block group remove/allocation and trim/discard Filipe Manana
  2014-11-26 15:28 ` [PATCH 1/6] Btrfs: fix invalid block group rbtree access after bg is removed Filipe Manana
  2014-11-26 15:28 ` [PATCH 2/6] Btrfs: fix crash caused by block group removal Filipe Manana
@ 2014-11-26 15:28 ` Filipe Manana
  2014-11-26 16:02   ` Josef Bacik
  2014-11-26 15:28 ` [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation Filipe Manana
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Filipe Manana @ 2014-11-26 15:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

There's a race between adding a block group to the list of the unused
block groups and removing an unused block group (cleaner kthread) that
leads to freeing extents that are in use or a crash during transaction
commmit. Basically the cleaner kthread, when executing
btrfs_delete_unused_bgs(), might catch the newly added block group to
the list fs_info->unused_bgs and clear the range representing the whole
group from fs_info->freed_extents[] before the task that added the block
group to the list (running update_block_group()) marked the last freed
extent as dirty in fs_info->freed_extents (pinned_extents).

That is:

     CPU 1                                CPU 2

                                  btrfs_delete_unused_bgs()
update_block_group()
   add block group to
   fs_info->unused_bgs
                                    got block group from the list
                                    clear_extent_bits for the whole
                                    block group range in freed_extents[]
   set_extent_dirty for the
   range covering the freed
   extent in freed_extents[]
   (fs_info->pinned_extents)

                                  block group deleted, and a new block
                                  group with the same logical address is
                                  created

                                  reserve space from the new block group
                                  for new data or metadata - the reserved
                                  space overlaps the range specified by
                                  CPU 1 for set_extent_dirty()

                                  commit transaction
                                    find all ranges marked as dirty in
                                    fs_info->pinned_extents, clear them
                                    and add them to the free space cache

Alternatively, if CPU 2 doesn't create a new block group with the same
logical address, we get a crash/BUG_ON at transaction commit when unpining
extent ranges because we can't find a block group for the range marked as
dirty by CPU 1. Sample trace:

[ 2163.426462] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 2163.426640] Modules linked in: btrfs xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio crc32c_generic libcrc32c dm_mod nfsd auth_rpc
gss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse parport_pc parport i2c_piix4 processor thermal_sys i2ccore evdev button pcspkr microcode serio_raw ext4 crc16 jbd2 mbcache
 sg sr_mod cdrom sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata e1000 scsi_mod virtio_pci virtio_ring virtio
[ 2163.428209] CPU: 0 PID: 11858 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
[ 2163.428519] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 2163.428875] task: ffff88009f2c0650 ti: ffff8801356bc000 task.ti: ffff8801356bc000
[ 2163.429157] RIP: 0010:[<ffffffffa037728e>]  [<ffffffffa037728e>] unpin_extent_range.isra.58+0x62/0x192 [btrfs]
[ 2163.429562] RSP: 0018:ffff8801356bfda8  EFLAGS: 00010246
[ 2163.429802] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2163.429990] RDX: 0000000041bfffff RSI: 0000000001c00000 RDI: ffff880024307080
[ 2163.430042] RBP: ffff8801356bfde8 R08: 0000000000000068 R09: ffff88003734f118
[ 2163.430042] R10: ffff8801356bfcb8 R11: fffffffffffffb69 R12: ffff8800243070d0
[ 2163.430042] R13: 0000000083c04000 R14: ffff8800751b0f00 R15: ffff880024307000
[ 2163.430042] FS:  0000000000000000(0000) GS:ffff88013f400000(0000) knlGS:0000000000000000
[ 2163.430042] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2163.430042] CR2: 00007ff10eb43fc0 CR3: 0000000004cb8000 CR4: 00000000000006f0
[ 2163.430042] Stack:
[ 2163.430042]  ffff8800243070d0 0000000083c08000 0000000083c07fff ffff88012d6bc800
[ 2163.430042]  ffff8800243070d0 ffff8800751b0f18 ffff8800751b0f00 0000000000000000
[ 2163.430042]  ffff8801356bfe18 ffffffffa037a481 0000000083c04000 0000000083c07fff
[ 2163.430042] Call Trace:
[ 2163.430042]  [<ffffffffa037a481>] btrfs_finish_extent_commit+0xac/0xbf [btrfs]
[ 2163.430042]  [<ffffffffa038c06d>] btrfs_commit_transaction+0x6ee/0x882 [btrfs]
[ 2163.430042]  [<ffffffffa03881f1>] transaction_kthread+0xf2/0x1a4 [btrfs]
[ 2163.430042]  [<ffffffffa03880ff>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[ 2163.430042]  [<ffffffff8105966b>] kthread+0xb7/0xbf
[ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[ 2163.430042]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
[ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67

So fix this by making update_block_group() first set the range as dirty
in pinned_extents before adding the block group to the unused_bgs list.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-tree.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b7e40ef..92f61f2 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5439,7 +5439,17 @@ static int update_block_group(struct btrfs_root *root,
 			spin_unlock(&cache->space_info->lock);
 		} else {
 			old_val -= num_bytes;
+			btrfs_set_block_group_used(&cache->item, old_val);
+			cache->pinned += num_bytes;
+			cache->space_info->bytes_pinned += num_bytes;
+			cache->space_info->bytes_used -= num_bytes;
+			cache->space_info->disk_used -= num_bytes * factor;
+			spin_unlock(&cache->lock);
+			spin_unlock(&cache->space_info->lock);
 
+			set_extent_dirty(info->pinned_extents,
+					 bytenr, bytenr + num_bytes - 1,
+					 GFP_NOFS | __GFP_NOFAIL);
 			/*
 			 * No longer have used bytes in this block group, queue
 			 * it for deletion.
@@ -5453,17 +5463,6 @@ static int update_block_group(struct btrfs_root *root,
 				}
 				spin_unlock(&info->unused_bgs_lock);
 			}
-			btrfs_set_block_group_used(&cache->item, old_val);
-			cache->pinned += num_bytes;
-			cache->space_info->bytes_pinned += num_bytes;
-			cache->space_info->bytes_used -= num_bytes;
-			cache->space_info->disk_used -= num_bytes * factor;
-			spin_unlock(&cache->lock);
-			spin_unlock(&cache->space_info->lock);
-
-			set_extent_dirty(info->pinned_extents,
-					 bytenr, bytenr + num_bytes - 1,
-					 GFP_NOFS | __GFP_NOFAIL);
 		}
 		btrfs_put_block_group(cache);
 		total -= num_bytes;
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation
  2014-11-26 15:28 [PATCH 0/6] Btrfs: fixes for block group remove/allocation and trim/discard Filipe Manana
                   ` (2 preceding siblings ...)
  2014-11-26 15:28 ` [PATCH 3/6] Btrfs: fix freeing used extents after removing empty block group Filipe Manana
@ 2014-11-26 15:28 ` Filipe Manana
  2014-11-26 16:15   ` Josef Bacik
                     ` (3 more replies)
  2014-11-26 15:28 ` [PATCH 5/6] Btrfs: fix race between writing free space cache and trimming Filipe Manana
  2014-11-26 15:28 ` [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups Filipe Manana
  5 siblings, 4 replies; 26+ messages in thread
From: Filipe Manana @ 2014-11-26 15:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.

If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.

So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.

If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:

        checking extents
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        owner ref check failed [833912832 16384]
        Errors found in extent allocation tree or chunk allocation
        checking free space cache
        checking fs roots
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        root 5 root dir 256 error
        root 5 inode 260 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 262 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 263 errors 2001, no inode item, link count wrong
        (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.h            | 13 ++++++++++++-
 fs/btrfs/disk-io.c          | 14 ++++++++++++++
 fs/btrfs/extent-tree.c      | 24 +++++++++++++++++++++++-
 fs/btrfs/free-space-cache.c | 26 +++++++++++++++++++++++++-
 fs/btrfs/volumes.c          | 33 ++++++++++++++++++++++++++-------
 5 files changed, 100 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7f40a65..51056c7 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache {
 	unsigned int dirty:1;
 	unsigned int iref:1;
 	unsigned int has_caching_ctl:1;
+	unsigned int removed:1;
 
 	int disk_cache_state;
 
@@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache {
 
 	/* For delayed block group creation or deletion of empty block groups */
 	struct list_head bg_list;
+
+	atomic_t trimming;
 };
 
 /* delayed seq elem */
@@ -1731,6 +1734,13 @@ struct btrfs_fs_info {
 
 	/* For btrfs to record security options */
 	struct security_mnt_opts security_opts;
+
+	/*
+	 * Chunks that can't be freed yet (under a trim/discard operation)
+	 * and will be latter freed.
+	 */
+	rwlock_t pinned_chunks_lock;
+	struct list_head pinned_chunks;
 };
 
 struct btrfs_subvolume_writers {
@@ -3353,7 +3363,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans,
 			   u64 type, u64 chunk_objectid, u64 chunk_offset,
 			   u64 size);
 int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
-			     struct btrfs_root *root, u64 group_start);
+			     struct btrfs_root *root, u64 group_start,
+			     bool *remove_em);
 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
 void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
 				       struct btrfs_root *root);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9d4fb0a..76012d0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2397,6 +2397,9 @@ int open_ctree(struct super_block *sb,
 	init_waitqueue_head(&fs_info->transaction_blocked_wait);
 	init_waitqueue_head(&fs_info->async_submit_wait);
 
+	rwlock_init(&fs_info->pinned_chunks_lock);
+	INIT_LIST_HEAD(&fs_info->pinned_chunks);
+
 	ret = btrfs_alloc_stripe_hash_table(fs_info);
 	if (ret) {
 		err = ret;
@@ -3726,6 +3729,17 @@ void close_ctree(struct btrfs_root *root)
 
 	btrfs_free_block_rsv(root, root->orphan_block_rsv);
 	root->orphan_block_rsv = NULL;
+
+	write_lock(&fs_info->pinned_chunks_lock);
+	while (!list_empty(&fs_info->pinned_chunks)) {
+		struct extent_map *em;
+
+		em = list_first_entry(&fs_info->pinned_chunks,
+				      struct extent_map, list);
+		list_del_init(&em->list);
+		free_extent_map(em);
+	}
+	write_unlock(&fs_info->pinned_chunks_lock);
 }
 
 int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 92f61f2..4bf8f02 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9016,6 +9016,7 @@ btrfs_create_block_group_cache(struct btrfs_root *root, u64 start, u64 size)
 	INIT_LIST_HEAD(&cache->cluster_list);
 	INIT_LIST_HEAD(&cache->bg_list);
 	btrfs_init_free_space_ctl(cache);
+	atomic_set(&cache->trimming, 0);
 
 	return cache;
 }
@@ -9317,7 +9318,8 @@ static void clear_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
 }
 
 int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
-			     struct btrfs_root *root, u64 group_start)
+			     struct btrfs_root *root, u64 group_start,
+			     bool *remove_em)
 {
 	struct btrfs_path *path;
 	struct btrfs_block_group_cache *block_group;
@@ -9474,6 +9476,26 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 
 	memcpy(&key, &block_group->key, sizeof(key));
 
+	spin_lock(&block_group->lock);
+	block_group->removed = 1;
+	/*
+	 * At this point trimming can't start on this block group, because we
+	 * removed the block group from the tree fs_info->block_group_cache_tree
+	 * so no one can't find it anymore.
+	 *
+	 * And we must tell our caller to not remove the extent map from the
+	 * fs_info->mapping_tree to prevent the same logical address range and
+	 * physical device space ranges from being reused for a new block group.
+	 * This is because our fs trim operation (btrfs_trim_fs(),
+	 * btrfs_ioctl_fitrim()) is completely transactionless, so while its
+	 * trimming a range the currently running transaction might finish and
+	 * a new one start, allowing for new block groups to be created that can
+	 * reuse the same physical device locations unless we take this special
+	 * care.
+	 */
+	*remove_em = (atomic_read(&block_group->trimming) == 0);
+	spin_unlock(&block_group->lock);
+
 	btrfs_put_block_group(block_group);
 	btrfs_put_block_group(block_group);
 
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 3384819..16c2d39 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3101,11 +3101,35 @@ int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
 
 	*trimmed = 0;
 
+	atomic_inc(&block_group->trimming);
+
 	ret = trim_no_bitmap(block_group, trimmed, start, end, minlen);
 	if (ret)
-		return ret;
+		goto out;
 
 	ret = trim_bitmaps(block_group, trimmed, start, end, minlen);
+out:
+	if (atomic_dec_and_test(&block_group->trimming) &&
+	    block_group->removed) {
+		struct extent_map_tree *em_tree;
+		struct extent_map *em;
+
+		em_tree = &block_group->fs_info->mapping_tree.map_tree;
+		write_lock(&em_tree->lock);
+		em = lookup_extent_mapping(em_tree, block_group->key.objectid,
+					   1);
+		BUG_ON(!em); /* logic error, can't happen */
+		remove_extent_mapping(em_tree, em);
+		write_unlock(&em_tree->lock);
+
+		write_lock(&block_group->fs_info->pinned_chunks_lock);
+		list_del_init(&em->list);
+		write_unlock(&block_group->fs_info->pinned_chunks_lock);
+
+		/* once for us and once for the tree */
+		free_extent_map(em);
+		free_extent_map(em);
+	}
 
 	return ret;
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f61278f..e0ad613 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1069,9 +1069,14 @@ static int contains_pending_extent(struct btrfs_trans_handle *trans,
 				   u64 *start, u64 len)
 {
 	struct extent_map *em;
+	struct list_head *search_list = &trans->transaction->pending_chunks;
 	int ret = 0;
 
-	list_for_each_entry(em, &trans->transaction->pending_chunks, list) {
+again:
+	if (search_list == &trans->root->fs_info->pinned_chunks)
+		read_lock(&trans->root->fs_info->pinned_chunks_lock);
+
+	list_for_each_entry(em, search_list, list) {
 		struct map_lookup *map;
 		int i;
 
@@ -1088,6 +1093,12 @@ static int contains_pending_extent(struct btrfs_trans_handle *trans,
 			ret = 1;
 		}
 	}
+	if (search_list == &trans->root->fs_info->pinned_chunks) {
+		read_unlock(&trans->root->fs_info->pinned_chunks_lock);
+	} else {
+		search_list = &trans->root->fs_info->pinned_chunks;
+		goto again;
+	}
 
 	return ret;
 }
@@ -2579,6 +2590,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
 	u64 chunk_objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
 	u64 chunk_tree = root->fs_info->chunk_root->objectid;
 	int i, ret = 0;
+	bool remove_em;
 
 	/* Just in case */
 	root = root->fs_info->chunk_root;
@@ -2648,18 +2660,25 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
 		}
 	}
 
-	ret = btrfs_remove_block_group(trans, extent_root, chunk_offset);
+	ret = btrfs_remove_block_group(trans, extent_root, chunk_offset,
+				       &remove_em);
 	if (ret) {
 		btrfs_abort_transaction(trans, extent_root, ret);
 		goto out;
 	}
 
-	write_lock(&em_tree->lock);
-	remove_extent_mapping(em_tree, em);
-	write_unlock(&em_tree->lock);
+	if (remove_em) {
+		write_lock(&em_tree->lock);
+		remove_extent_mapping(em_tree, em);
+		write_unlock(&em_tree->lock);
 
-	/* once for the tree */
-	free_extent_map(em);
+		/* once for the tree */
+		free_extent_map(em);
+	} else {
+		write_lock(&root->fs_info->pinned_chunks_lock);
+		list_move_tail(&em->list, &root->fs_info->pinned_chunks);
+		write_unlock(&root->fs_info->pinned_chunks_lock);
+	}
 out:
 	/* once for us */
 	free_extent_map(em);
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/6] Btrfs: fix race between writing free space cache and trimming
  2014-11-26 15:28 [PATCH 0/6] Btrfs: fixes for block group remove/allocation and trim/discard Filipe Manana
                   ` (3 preceding siblings ...)
  2014-11-26 15:28 ` [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation Filipe Manana
@ 2014-11-26 15:28 ` Filipe Manana
  2014-12-01 17:04   ` [PATCH v2 " Filipe Manana
  2014-11-26 15:28 ` [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups Filipe Manana
  5 siblings, 1 reply; 26+ messages in thread
From: Filipe Manana @ 2014-11-26 15:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

Trimming is completely transactionless, and the way it operates consists
of hiding free space entries from a block group, perform the trim/discard
and then make the free space entries visible again.
Therefore while free space entry is being trimmed, we can have free space
cache writing running in parallel (as part of a transaction commit) which
will miss the free space entry. This means that an unmount (or crash/reboot)
after that transaction commit and mount again before another transaction
starts/commits, we will have some free space that won't be used again unless
the free space cache is rebuilt. After the unmount, fsck (btrfsck, btrfs check)
reports the issue like the following example:

        *** fsck.btrfs output ***
        checking extents
        checking free space cache
        There is no free space entry for 521764864-521781248
        There is no free space entry for 521764864-1103101952
        cache appears valid but isnt 29360128
        Checking filesystem on /dev/sdc
        UUID: b4789e27-4774-4626-98e9-ae8dfbfb0fb5
        found 1235681286 bytes used err is -22
        (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/free-space-cache.c | 59 ++++++++++++++++++++++++++++++++++++++++++---
 fs/btrfs/free-space-cache.h |  2 ++
 fs/btrfs/inode-map.c        |  2 ++
 3 files changed, 60 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 16c2d39..6380863 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -31,6 +31,12 @@
 #define BITS_PER_BITMAP		(PAGE_CACHE_SIZE * 8)
 #define MAX_CACHE_BYTES_PER_GIG	(32 * 1024)
 
+struct btrfs_trim_range {
+	u64 start;
+	u64 bytes;
+	struct list_head list;
+};
+
 static int link_free_space(struct btrfs_free_space_ctl *ctl,
 			   struct btrfs_free_space *info);
 static void unlink_free_space(struct btrfs_free_space_ctl *ctl,
@@ -881,6 +887,7 @@ int write_cache_extent_entries(struct io_ctl *io_ctl,
 	int ret;
 	struct btrfs_free_cluster *cluster = NULL;
 	struct rb_node *node = rb_first(&ctl->free_space_offset);
+	struct btrfs_trim_range *trim_entry;
 
 	/* Get the cluster for this block_group if it exists */
 	if (block_group && !list_empty(&block_group->cluster_list)) {
@@ -916,6 +923,21 @@ int write_cache_extent_entries(struct io_ctl *io_ctl,
 			cluster = NULL;
 		}
 	}
+
+	/*
+	 * Make sure we don't miss any range that was removed from our rbtree
+	 * because trimming is running. Otherwise after a umount+mount (or crash
+	 * after committing the transaction) we would leak free space and get
+	 * an inconsistent free space cache report from fsck.
+	 */
+	list_for_each_entry(trim_entry, &ctl->trimming_ranges, list) {
+		ret = io_ctl_add_entry(io_ctl, trim_entry->start,
+				       trim_entry->bytes, NULL);
+		if (ret)
+			goto fail;
+		*entries += 1;
+	}
+
 	return 0;
 fail:
 	return -ENOSPC;
@@ -1135,10 +1157,12 @@ static int __btrfs_write_out_cache(struct btrfs_root *root, struct inode *inode,
 
 	io_ctl_set_generation(&io_ctl, trans->transid);
 
+	mutex_lock(&ctl->cache_writeout_mutex);
 	/* Write out the extent entries in the free space cache */
 	ret = write_cache_extent_entries(&io_ctl, ctl,
 					 block_group, &entries, &bitmaps,
 					 &bitmap_list);
+	mutex_unlock(&ctl->cache_writeout_mutex);
 	if (ret)
 		goto out_nospc;
 
@@ -2295,6 +2319,8 @@ void btrfs_init_free_space_ctl(struct btrfs_block_group_cache *block_group)
 	ctl->start = block_group->key.objectid;
 	ctl->private = block_group;
 	ctl->op = &free_space_op;
+	INIT_LIST_HEAD(&ctl->trimming_ranges);
+	mutex_init(&ctl->cache_writeout_mutex);
 
 	/*
 	 * we only want to have 32k of ram per block group for keeping
@@ -2911,10 +2937,12 @@ void btrfs_init_free_cluster(struct btrfs_free_cluster *cluster)
 
 static int do_trimming(struct btrfs_block_group_cache *block_group,
 		       u64 *total_trimmed, u64 start, u64 bytes,
-		       u64 reserved_start, u64 reserved_bytes)
+		       u64 reserved_start, u64 reserved_bytes,
+		       struct btrfs_trim_range *trim_entry)
 {
 	struct btrfs_space_info *space_info = block_group->space_info;
 	struct btrfs_fs_info *fs_info = block_group->fs_info;
+	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
 	int ret;
 	int update = 0;
 	u64 trimmed = 0;
@@ -2934,7 +2962,10 @@ static int do_trimming(struct btrfs_block_group_cache *block_group,
 	if (!ret)
 		*total_trimmed += trimmed;
 
+	mutex_lock(&ctl->cache_writeout_mutex);
 	btrfs_add_free_space(block_group, reserved_start, reserved_bytes);
+	list_del(&trim_entry->list);
+	mutex_unlock(&ctl->cache_writeout_mutex);
 
 	if (update) {
 		spin_lock(&space_info->lock);
@@ -2962,16 +2993,21 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 	u64 bytes;
 
 	while (start < end) {
+		struct btrfs_trim_range trim_entry;
+
+		mutex_lock(&ctl->cache_writeout_mutex);
 		spin_lock(&ctl->tree_lock);
 
 		if (ctl->free_space < minlen) {
 			spin_unlock(&ctl->tree_lock);
+			mutex_unlock(&ctl->cache_writeout_mutex);
 			break;
 		}
 
 		entry = tree_search_offset(ctl, start, 0, 1);
 		if (!entry) {
 			spin_unlock(&ctl->tree_lock);
+			mutex_unlock(&ctl->cache_writeout_mutex);
 			break;
 		}
 
@@ -2980,6 +3016,7 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 			node = rb_next(&entry->offset_index);
 			if (!node) {
 				spin_unlock(&ctl->tree_lock);
+				mutex_unlock(&ctl->cache_writeout_mutex);
 				goto out;
 			}
 			entry = rb_entry(node, struct btrfs_free_space,
@@ -2988,6 +3025,7 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 
 		if (entry->offset >= end) {
 			spin_unlock(&ctl->tree_lock);
+			mutex_unlock(&ctl->cache_writeout_mutex);
 			break;
 		}
 
@@ -2997,6 +3035,7 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 		bytes = min(extent_start + extent_bytes, end) - start;
 		if (bytes < minlen) {
 			spin_unlock(&ctl->tree_lock);
+			mutex_unlock(&ctl->cache_writeout_mutex);
 			goto next;
 		}
 
@@ -3004,9 +3043,13 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 		kmem_cache_free(btrfs_free_space_cachep, entry);
 
 		spin_unlock(&ctl->tree_lock);
+		trim_entry.start = extent_start;
+		trim_entry.bytes = extent_bytes;
+		list_add_tail(&trim_entry.list, &ctl->trimming_ranges);
+		mutex_unlock(&ctl->cache_writeout_mutex);
 
 		ret = do_trimming(block_group, total_trimmed, start, bytes,
-				  extent_start, extent_bytes);
+				  extent_start, extent_bytes, &trim_entry);
 		if (ret)
 			break;
 next:
@@ -3035,17 +3078,21 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 
 	while (offset < end) {
 		bool next_bitmap = false;
+		struct btrfs_trim_range trim_entry;
 
+		mutex_lock(&ctl->cache_writeout_mutex);
 		spin_lock(&ctl->tree_lock);
 
 		if (ctl->free_space < minlen) {
 			spin_unlock(&ctl->tree_lock);
+			mutex_unlock(&ctl->cache_writeout_mutex);
 			break;
 		}
 
 		entry = tree_search_offset(ctl, offset, 1, 0);
 		if (!entry) {
 			spin_unlock(&ctl->tree_lock);
+			mutex_unlock(&ctl->cache_writeout_mutex);
 			next_bitmap = true;
 			goto next;
 		}
@@ -3054,6 +3101,7 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 		ret2 = search_bitmap(ctl, entry, &start, &bytes);
 		if (ret2 || start >= end) {
 			spin_unlock(&ctl->tree_lock);
+			mutex_unlock(&ctl->cache_writeout_mutex);
 			next_bitmap = true;
 			goto next;
 		}
@@ -3061,6 +3109,7 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 		bytes = min(bytes, end - start);
 		if (bytes < minlen) {
 			spin_unlock(&ctl->tree_lock);
+			mutex_unlock(&ctl->cache_writeout_mutex);
 			goto next;
 		}
 
@@ -3069,9 +3118,13 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 			free_bitmap(ctl, entry);
 
 		spin_unlock(&ctl->tree_lock);
+		trim_entry.start = start;
+		trim_entry.bytes = bytes;
+		list_add_tail(&trim_entry.list, &ctl->trimming_ranges);
+		mutex_unlock(&ctl->cache_writeout_mutex);
 
 		ret = do_trimming(block_group, total_trimmed, start, bytes,
-				  start, bytes);
+				  start, bytes, &trim_entry);
 		if (ret)
 			break;
 next:
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index 0cf4977..88b2238 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -38,6 +38,8 @@ struct btrfs_free_space_ctl {
 	u64 start;
 	struct btrfs_free_space_op *op;
 	void *private;
+	struct mutex cache_writeout_mutex;
+	struct list_head trimming_ranges;
 };
 
 struct btrfs_free_space_op {
diff --git a/fs/btrfs/inode-map.c b/fs/btrfs/inode-map.c
index 83d646b..81efd83 100644
--- a/fs/btrfs/inode-map.c
+++ b/fs/btrfs/inode-map.c
@@ -364,6 +364,8 @@ void btrfs_init_free_ino_ctl(struct btrfs_root *root)
 	ctl->start = 0;
 	ctl->private = NULL;
 	ctl->op = &free_ino_op;
+	INIT_LIST_HEAD(&ctl->trimming_ranges);
+	mutex_init(&ctl->cache_writeout_mutex);
 
 	/*
 	 * Initially we allow to use 16K of ram to cache chunks of
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups
  2014-11-26 15:28 [PATCH 0/6] Btrfs: fixes for block group remove/allocation and trim/discard Filipe Manana
                   ` (4 preceding siblings ...)
  2014-11-26 15:28 ` [PATCH 5/6] Btrfs: fix race between writing free space cache and trimming Filipe Manana
@ 2014-11-26 15:28 ` Filipe Manana
  2014-11-26 16:07   ` Josef Bacik
  5 siblings, 1 reply; 26+ messages in thread
From: Filipe Manana @ 2014-11-26 15:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

If the transaction handle doesn't have used blocks but has created new block
groups make sure we turn the fs into readonly mode too. This is because the
new block groups didn't get all their metadata persisted into the chunk and
device trees, and therefore if a subsequent transaction starts, allocates
space from the new block groups, writes data or metadata into that space,
commits successfully and then after we unmount and mount the filesystem
again, the same space can be allocated again for a new block group,
resulting in file data or metadata corruption.

Example where we don't abort the transaction when we fail to finish the
chunk allocation (add items to the chunk and device trees) and later a
future transaction where the block group is removed fails because it can't
find the chunk item in the chunk tree:

[25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x50/0xfc [btrfs]()
[25230.404301] BTRFS: Transaction aborted (error -28)
[25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs]
[25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 3.17.0-rc5-btrfs-next-1+ #1
[25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[25230.404328]  0000000000000000 ffff88004581bb08 ffffffff813e7a13 ffff88004581bb50
[25230.404330]  ffff88004581bb40 ffffffff810423aa ffffffffa049386a 00000000ffffffe4
[25230.404332]  ffffffffa05214c0 000000000000240c ffff88010fc8f800 ffff88004581bba8
[25230.404334] Call Trace:
[25230.404338]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
[25230.404342]  [<ffffffff810423aa>] warn_slowpath_common+0x7f/0x98
[25230.404351]  [<ffffffffa049386a>] ? __btrfs_abort_transaction+0x50/0xfc [btrfs]
[25230.404353]  [<ffffffff8104240b>] warn_slowpath_fmt+0x48/0x50
[25230.404362]  [<ffffffffa049386a>] __btrfs_abort_transaction+0x50/0xfc [btrfs]
[25230.404374]  [<ffffffffa04a8c43>] btrfs_create_pending_block_groups+0x10c/0x135 [btrfs]
[25230.404387]  [<ffffffffa04b77fd>] __btrfs_end_transaction+0x7e/0x2de [btrfs]
[25230.404398]  [<ffffffffa04b7a6d>] btrfs_end_transaction+0x10/0x12 [btrfs]
[25230.404408]  [<ffffffffa04a3d64>] btrfs_check_data_free_space+0x111/0x1f0 [btrfs]
[25230.404421]  [<ffffffffa04c53bd>] __btrfs_buffered_write+0x160/0x48d [btrfs]
[25230.404425]  [<ffffffff811a9268>] ? cap_inode_need_killpriv+0x2d/0x37
[25230.404429]  [<ffffffff810f6501>] ? get_page+0x1a/0x2b
[25230.404441]  [<ffffffffa04c7c95>] btrfs_file_write_iter+0x321/0x42f [btrfs]
[25230.404443]  [<ffffffff8110f5d9>] ? handle_mm_fault+0x7f3/0x846
[25230.404446]  [<ffffffff813e98c5>] ? mutex_unlock+0x16/0x18
[25230.404449]  [<ffffffff81138d68>] new_sync_write+0x7c/0xa0
[25230.404450]  [<ffffffff81139401>] vfs_write+0xb0/0x112
[25230.404452]  [<ffffffff81139c9d>] SyS_pwrite64+0x66/0x84
[25230.404454]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b
[25230.404455] ---[ end trace 5aa5684fdf47ab38 ]---
[25230.404458] BTRFS warning (device sdc): btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space left).
[25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No such entry (Failed lookup while freeing chunk.)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-tree.c | 5 +++--
 fs/btrfs/super.c       | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4bf8f02..0a5e770 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9209,9 +9209,8 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
 	int ret = 0;
 
 	list_for_each_entry_safe(block_group, tmp, &trans->new_bgs, bg_list) {
-		list_del_init(&block_group->bg_list);
 		if (ret)
-			continue;
+			goto next;
 
 		spin_lock(&block_group->lock);
 		memcpy(&item, &block_group->item, sizeof(item));
@@ -9226,6 +9225,8 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
 					       key.objectid, key.offset);
 		if (ret)
 			btrfs_abort_transaction(trans, extent_root, ret);
+next:
+		list_del_init(&block_group->bg_list);
 	}
 }
 
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index a2b97ef..c9ab88c 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -262,7 +262,7 @@ void __btrfs_abort_transaction(struct btrfs_trans_handle *trans,
 	trans->aborted = errno;
 	/* Nothing used. The other threads that have joined this
 	 * transaction may be able to continue. */
-	if (!trans->blocks_used) {
+	if (!trans->blocks_used && list_empty(&trans->new_bgs)) {
 		const char *errstr;
 
 		errstr = btrfs_decode_error(errno);
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] Btrfs: fix crash caused by block group removal
  2014-11-26 15:28 ` [PATCH 2/6] Btrfs: fix crash caused by block group removal Filipe Manana
@ 2014-11-26 15:57   ` Josef Bacik
  2014-11-26 16:09     ` Filipe David Manana
  0 siblings, 1 reply; 26+ messages in thread
From: Josef Bacik @ 2014-11-26 15:57 UTC (permalink / raw)
  To: Filipe Manana, linux-btrfs

On 11/26/2014 10:28 AM, Filipe Manana wrote:
> If we remove a block group (because it became empty), we might have left
> a caching_ctl structure in fs_info->caching_block_groups that points to
> the block group and is accessed at transaction commit time. This results
> in accessing an invalid or incorrect block group. This issue became visible
> after Josef's patch "Btrfs: remove empty block groups automatically".
>
> So if the block group is removed make sure we don't leave a dangling
> caching_ctl in caching_block_groups.
>
> Sample crash trace:
>
> [58380.439449] BUG: unable to handle kernel paging request at ffff8801446eaeb8
> [58380.439707] IP: [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
> [58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 80000001446ea060
> [58380.441220] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
> [58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: btrfs]
> [58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
> [58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
> [58380.443238] task: ffff88013ac82090 ti: ffff88013896c000 task.ti: ffff88013896c000
> [58380.443238] RIP: 0010:[<ffffffffa03f6d05>]  [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
> [58380.443238] RSP: 0018:ffff88013896fdd8  EFLAGS: 00010283
> [58380.443238] RAX: ffff880222cae850 RBX: ffff880119ba74c0 RCX: 0000000000000000
> [58380.443238] RDX: 0000000000000000 RSI: ffff880185e16800 RDI: ffff8801446eaeb8
> [58380.443238] RBP: ffff88013896fdd8 R08: ffff8801a9ca9fa8 R09: ffff88013896fc60
> [58380.443238] R10: ffff88013896fd28 R11: 0000000000000000 R12: ffff880222cae000
> [58380.443238] R13: ffff880222cae850 R14: ffff880222cae6b0 R15: ffff8801446eae00
> [58380.443238] FS:  0000000000000000(0000) GS:ffff88023ed80000(0000) knlGS:0000000000000000
> [58380.443238] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [58380.443238] CR2: ffff8801446eaeb8 CR3: 0000000001811000 CR4: 00000000000006e0
> [58380.443238] Stack:
> [58380.443238]  ffff88013896fe18 ffffffffa03fe2d5 ffff880222cae850 ffff880185e16800
> [58380.443238]  ffff88000dc41c20 0000000000000000 ffff8801a9ca9f00 0000000000000000
> [58380.443238]  ffff88013896fe80 ffffffffa040fbcf ffff88018b0dcdb0 ffff88013ac82090
> [58380.443238] Call Trace:
> [58380.443238]  [<ffffffffa03fe2d5>] btrfs_prepare_extent_commit+0x5a/0xd7 [btrfs]
> [58380.443238]  [<ffffffffa040fbcf>] btrfs_commit_transaction+0x45c/0x882 [btrfs]
> [58380.443238]  [<ffffffffa040c058>] transaction_kthread+0xf2/0x1a4 [btrfs]
> [58380.443238]  [<ffffffffa040bf66>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
> [58380.443238]  [<ffffffff8105966b>] kthread+0xb7/0xbf
> [58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
> [58380.443238]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
> [58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/ctree.h       |  1 +
>   fs/btrfs/extent-tree.c | 27 +++++++++++++++++++++++++++
>   2 files changed, 28 insertions(+)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index d3ccd09..7f40a65 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1277,6 +1277,7 @@ struct btrfs_block_group_cache {
>   	unsigned int ro:1;
>   	unsigned int dirty:1;
>   	unsigned int iref:1;
> +	unsigned int has_caching_ctl:1;
>

Don't do this, just unconditionally call get_caching_control in 
btrfs_remove_block_group, then if we get something we can do stuff, 
otherwise we can just continue.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] Btrfs: fix invalid block group rbtree access after bg is removed
  2014-11-26 15:28 ` [PATCH 1/6] Btrfs: fix invalid block group rbtree access after bg is removed Filipe Manana
@ 2014-11-26 15:58   ` Josef Bacik
  0 siblings, 0 replies; 26+ messages in thread
From: Josef Bacik @ 2014-11-26 15:58 UTC (permalink / raw)
  To: Filipe Manana, linux-btrfs

On 11/26/2014 10:28 AM, Filipe Manana wrote:
> If we grab a block group, for example in btrfs_trim_fs(), we will be holding
> a reference on it but the block group can be removed after we got it (via
> btrfs_remove_block_group), which means it will no longer be part of the
> rbtree.
>
> However, btrfs_remove_block_group() was only calling rb_erase() which leaves
> the block group's rb_node left and right child pointers with the same content
> they had before calling rb_erase. This was dangerous because a call to
> next_block_group() would access the node's left and right child pointers (via
> rb_next), which can be no longer valid.
>
> Fix this by clearing a block group's node after removing it from the tree,
> and have next_block_group() do a tree search to get the next block group
> instead of using rb_next() if our block group was removed.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <jbacik@fb.com>

Josef

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/6] Btrfs: fix freeing used extents after removing empty block group
  2014-11-26 15:28 ` [PATCH 3/6] Btrfs: fix freeing used extents after removing empty block group Filipe Manana
@ 2014-11-26 16:02   ` Josef Bacik
  0 siblings, 0 replies; 26+ messages in thread
From: Josef Bacik @ 2014-11-26 16:02 UTC (permalink / raw)
  To: Filipe Manana, linux-btrfs

On 11/26/2014 10:28 AM, Filipe Manana wrote:
> There's a race between adding a block group to the list of the unused
> block groups and removing an unused block group (cleaner kthread) that
> leads to freeing extents that are in use or a crash during transaction
> commmit. Basically the cleaner kthread, when executing
> btrfs_delete_unused_bgs(), might catch the newly added block group to
> the list fs_info->unused_bgs and clear the range representing the whole
> group from fs_info->freed_extents[] before the task that added the block
> group to the list (running update_block_group()) marked the last freed
> extent as dirty in fs_info->freed_extents (pinned_extents).
>
> That is:
>
>       CPU 1                                CPU 2
>
>                                    btrfs_delete_unused_bgs()
> update_block_group()
>     add block group to
>     fs_info->unused_bgs
>                                      got block group from the list
>                                      clear_extent_bits for the whole
>                                      block group range in freed_extents[]
>     set_extent_dirty for the
>     range covering the freed
>     extent in freed_extents[]
>     (fs_info->pinned_extents)
>
>                                    block group deleted, and a new block
>                                    group with the same logical address is
>                                    created
>
>                                    reserve space from the new block group
>                                    for new data or metadata - the reserved
>                                    space overlaps the range specified by
>                                    CPU 1 for set_extent_dirty()
>
>                                    commit transaction
>                                      find all ranges marked as dirty in
>                                      fs_info->pinned_extents, clear them
>                                      and add them to the free space cache
>
> Alternatively, if CPU 2 doesn't create a new block group with the same
> logical address, we get a crash/BUG_ON at transaction commit when unpining
> extent ranges because we can't find a block group for the range marked as
> dirty by CPU 1. Sample trace:
>
> [ 2163.426462] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> [ 2163.426640] Modules linked in: btrfs xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio crc32c_generic libcrc32c dm_mod nfsd auth_rpc
> gss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse parport_pc parport i2c_piix4 processor thermal_sys i2ccore evdev button pcspkr microcode serio_raw ext4 crc16 jbd2 mbcache
>   sg sr_mod cdrom sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata e1000 scsi_mod virtio_pci virtio_ring virtio
> [ 2163.428209] CPU: 0 PID: 11858 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
> [ 2163.428519] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
> [ 2163.428875] task: ffff88009f2c0650 ti: ffff8801356bc000 task.ti: ffff8801356bc000
> [ 2163.429157] RIP: 0010:[<ffffffffa037728e>]  [<ffffffffa037728e>] unpin_extent_range.isra.58+0x62/0x192 [btrfs]
> [ 2163.429562] RSP: 0018:ffff8801356bfda8  EFLAGS: 00010246
> [ 2163.429802] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> [ 2163.429990] RDX: 0000000041bfffff RSI: 0000000001c00000 RDI: ffff880024307080
> [ 2163.430042] RBP: ffff8801356bfde8 R08: 0000000000000068 R09: ffff88003734f118
> [ 2163.430042] R10: ffff8801356bfcb8 R11: fffffffffffffb69 R12: ffff8800243070d0
> [ 2163.430042] R13: 0000000083c04000 R14: ffff8800751b0f00 R15: ffff880024307000
> [ 2163.430042] FS:  0000000000000000(0000) GS:ffff88013f400000(0000) knlGS:0000000000000000
> [ 2163.430042] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 2163.430042] CR2: 00007ff10eb43fc0 CR3: 0000000004cb8000 CR4: 00000000000006f0
> [ 2163.430042] Stack:
> [ 2163.430042]  ffff8800243070d0 0000000083c08000 0000000083c07fff ffff88012d6bc800
> [ 2163.430042]  ffff8800243070d0 ffff8800751b0f18 ffff8800751b0f00 0000000000000000
> [ 2163.430042]  ffff8801356bfe18 ffffffffa037a481 0000000083c04000 0000000083c07fff
> [ 2163.430042] Call Trace:
> [ 2163.430042]  [<ffffffffa037a481>] btrfs_finish_extent_commit+0xac/0xbf [btrfs]
> [ 2163.430042]  [<ffffffffa038c06d>] btrfs_commit_transaction+0x6ee/0x882 [btrfs]
> [ 2163.430042]  [<ffffffffa03881f1>] transaction_kthread+0xf2/0x1a4 [btrfs]
> [ 2163.430042]  [<ffffffffa03880ff>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
> [ 2163.430042]  [<ffffffff8105966b>] kthread+0xb7/0xbf
> [ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
> [ 2163.430042]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
> [ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
>
> So fix this by making update_block_group() first set the range as dirty
> in pinned_extents before adding the block group to the unused_bgs list.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/extent-tree.c | 21 ++++++++++-----------
>   1 file changed, 10 insertions(+), 11 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index b7e40ef..92f61f2 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5439,7 +5439,17 @@ static int update_block_group(struct btrfs_root *root,
>   			spin_unlock(&cache->space_info->lock);
>   		} else {
>   			old_val -= num_bytes;
> +			btrfs_set_block_group_used(&cache->item, old_val);
> +			cache->pinned += num_bytes;
> +			cache->space_info->bytes_pinned += num_bytes;
> +			cache->space_info->bytes_used -= num_bytes;
> +			cache->space_info->disk_used -= num_bytes * factor;
> +			spin_unlock(&cache->lock);
> +			spin_unlock(&cache->space_info->lock);
>
> +			set_extent_dirty(info->pinned_extents,
> +					 bytenr, bytenr + num_bytes - 1,
> +					 GFP_NOFS | __GFP_NOFAIL);
>   			/*
>   			 * No longer have used bytes in this block group, queue
>   			 * it for deletion.
> @@ -5453,17 +5463,6 @@ static int update_block_group(struct btrfs_root *root,
>   				}
>   				spin_unlock(&info->unused_bgs_lock);
>   			}
> -			btrfs_set_block_group_used(&cache->item, old_val);
> -			cache->pinned += num_bytes;
> -			cache->space_info->bytes_pinned += num_bytes;
> -			cache->space_info->bytes_used -= num_bytes;
> -			cache->space_info->disk_used -= num_bytes * factor;
> -			spin_unlock(&cache->lock);
> -			spin_unlock(&cache->space_info->lock);
> -
> -			set_extent_dirty(info->pinned_extents,
> -					 bytenr, bytenr + num_bytes - 1,
> -					 GFP_NOFS | __GFP_NOFAIL);
>   		}
>   		btrfs_put_block_group(cache);
>   		total -= num_bytes;
>

Reviewed-by: Josef Bacik <jbacik@fb.com>

Josef

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups
  2014-11-26 15:28 ` [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups Filipe Manana
@ 2014-11-26 16:07   ` Josef Bacik
  2014-11-26 16:15     ` Filipe David Manana
  0 siblings, 1 reply; 26+ messages in thread
From: Josef Bacik @ 2014-11-26 16:07 UTC (permalink / raw)
  To: Filipe Manana, linux-btrfs

On 11/26/2014 10:28 AM, Filipe Manana wrote:
> If the transaction handle doesn't have used blocks but has created new block
> groups make sure we turn the fs into readonly mode too. This is because the
> new block groups didn't get all their metadata persisted into the chunk and
> device trees, and therefore if a subsequent transaction starts, allocates
> space from the new block groups, writes data or metadata into that space,
> commits successfully and then after we unmount and mount the filesystem
> again, the same space can be allocated again for a new block group,
> resulting in file data or metadata corruption.
>
> Example where we don't abort the transaction when we fail to finish the
> chunk allocation (add items to the chunk and device trees) and later a
> future transaction where the block group is removed fails because it can't
> find the chunk item in the chunk tree:
>
> [25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x50/0xfc [btrfs]()
> [25230.404301] BTRFS: Transaction aborted (error -28)
> [25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs]
> [25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 3.17.0-rc5-btrfs-next-1+ #1
> [25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
> [25230.404328]  0000000000000000 ffff88004581bb08 ffffffff813e7a13 ffff88004581bb50
> [25230.404330]  ffff88004581bb40 ffffffff810423aa ffffffffa049386a 00000000ffffffe4
> [25230.404332]  ffffffffa05214c0 000000000000240c ffff88010fc8f800 ffff88004581bba8
> [25230.404334] Call Trace:
> [25230.404338]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
> [25230.404342]  [<ffffffff810423aa>] warn_slowpath_common+0x7f/0x98
> [25230.404351]  [<ffffffffa049386a>] ? __btrfs_abort_transaction+0x50/0xfc [btrfs]
> [25230.404353]  [<ffffffff8104240b>] warn_slowpath_fmt+0x48/0x50
> [25230.404362]  [<ffffffffa049386a>] __btrfs_abort_transaction+0x50/0xfc [btrfs]
> [25230.404374]  [<ffffffffa04a8c43>] btrfs_create_pending_block_groups+0x10c/0x135 [btrfs]
> [25230.404387]  [<ffffffffa04b77fd>] __btrfs_end_transaction+0x7e/0x2de [btrfs]
> [25230.404398]  [<ffffffffa04b7a6d>] btrfs_end_transaction+0x10/0x12 [btrfs]
> [25230.404408]  [<ffffffffa04a3d64>] btrfs_check_data_free_space+0x111/0x1f0 [btrfs]
> [25230.404421]  [<ffffffffa04c53bd>] __btrfs_buffered_write+0x160/0x48d [btrfs]
> [25230.404425]  [<ffffffff811a9268>] ? cap_inode_need_killpriv+0x2d/0x37
> [25230.404429]  [<ffffffff810f6501>] ? get_page+0x1a/0x2b
> [25230.404441]  [<ffffffffa04c7c95>] btrfs_file_write_iter+0x321/0x42f [btrfs]
> [25230.404443]  [<ffffffff8110f5d9>] ? handle_mm_fault+0x7f3/0x846
> [25230.404446]  [<ffffffff813e98c5>] ? mutex_unlock+0x16/0x18
> [25230.404449]  [<ffffffff81138d68>] new_sync_write+0x7c/0xa0
> [25230.404450]  [<ffffffff81139401>] vfs_write+0xb0/0x112
> [25230.404452]  [<ffffffff81139c9d>] SyS_pwrite64+0x66/0x84
> [25230.404454]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b
> [25230.404455] ---[ end trace 5aa5684fdf47ab38 ]---
> [25230.404458] BTRFS warning (device sdc): btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space left).
> [25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No such entry (Failed lookup while freeing chunk.)
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/extent-tree.c | 5 +++--
>   fs/btrfs/super.c       | 2 +-
>   2 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 4bf8f02..0a5e770 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9209,9 +9209,8 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
>   	int ret = 0;
>
>   	list_for_each_entry_safe(block_group, tmp, &trans->new_bgs, bg_list) {
> -		list_del_init(&block_group->bg_list);
>   		if (ret)
> -			continue;
> +			goto next;
>
>   		spin_lock(&block_group->lock);
>   		memcpy(&item, &block_group->item, sizeof(item));
> @@ -9226,6 +9225,8 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
>   					       key.objectid, key.offset);
>   		if (ret)
>   			btrfs_abort_transaction(trans, extent_root, ret);
> +next:
> +		list_del_init(&block_group->bg_list);
>   	}
>   }
>

I don't understand this change, logically it seems the same as what we 
had before.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] Btrfs: fix crash caused by block group removal
  2014-11-26 15:57   ` Josef Bacik
@ 2014-11-26 16:09     ` Filipe David Manana
  2014-11-26 16:24       ` Josef Bacik
  0 siblings, 1 reply; 26+ messages in thread
From: Filipe David Manana @ 2014-11-26 16:09 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org

On Wed, Nov 26, 2014 at 3:57 PM, Josef Bacik <jbacik@fb.com> wrote:
> On 11/26/2014 10:28 AM, Filipe Manana wrote:
>>
>> If we remove a block group (because it became empty), we might have left
>> a caching_ctl structure in fs_info->caching_block_groups that points to
>> the block group and is accessed at transaction commit time. This results
>> in accessing an invalid or incorrect block group. This issue became
>> visible
>> after Josef's patch "Btrfs: remove empty block groups automatically".
>>
>> So if the block group is removed make sure we don't leave a dangling
>> caching_ctl in caching_block_groups.
>>
>> Sample crash trace:
>>
>> [58380.439449] BUG: unable to handle kernel paging request at
>> ffff8801446eaeb8
>> [58380.439707] IP: [<ffffffffa03f6d05>]
>> block_group_cache_done.isra.21+0xc/0x1c [btrfs]
>> [58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE
>> 80000001446ea060
>> [58380.441220] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
>> [58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd
>> auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse
>> processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore
>> thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom
>> ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common
>> virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring
>> virtio [last unloaded: btrfs]
>> [58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G        W
>> 3.17.0-rc5-btrfs-next-1+ #1
>> [58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>> rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
>> [58380.443238] task: ffff88013ac82090 ti: ffff88013896c000 task.ti:
>> ffff88013896c000
>> [58380.443238] RIP: 0010:[<ffffffffa03f6d05>]  [<ffffffffa03f6d05>]
>> block_group_cache_done.isra.21+0xc/0x1c [btrfs]
>> [58380.443238] RSP: 0018:ffff88013896fdd8  EFLAGS: 00010283
>> [58380.443238] RAX: ffff880222cae850 RBX: ffff880119ba74c0 RCX:
>> 0000000000000000
>> [58380.443238] RDX: 0000000000000000 RSI: ffff880185e16800 RDI:
>> ffff8801446eaeb8
>> [58380.443238] RBP: ffff88013896fdd8 R08: ffff8801a9ca9fa8 R09:
>> ffff88013896fc60
>> [58380.443238] R10: ffff88013896fd28 R11: 0000000000000000 R12:
>> ffff880222cae000
>> [58380.443238] R13: ffff880222cae850 R14: ffff880222cae6b0 R15:
>> ffff8801446eae00
>> [58380.443238] FS:  0000000000000000(0000) GS:ffff88023ed80000(0000)
>> knlGS:0000000000000000
>> [58380.443238] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [58380.443238] CR2: ffff8801446eaeb8 CR3: 0000000001811000 CR4:
>> 00000000000006e0
>> [58380.443238] Stack:
>> [58380.443238]  ffff88013896fe18 ffffffffa03fe2d5 ffff880222cae850
>> ffff880185e16800
>> [58380.443238]  ffff88000dc41c20 0000000000000000 ffff8801a9ca9f00
>> 0000000000000000
>> [58380.443238]  ffff88013896fe80 ffffffffa040fbcf ffff88018b0dcdb0
>> ffff88013ac82090
>> [58380.443238] Call Trace:
>> [58380.443238]  [<ffffffffa03fe2d5>] btrfs_prepare_extent_commit+0x5a/0xd7
>> [btrfs]
>> [58380.443238]  [<ffffffffa040fbcf>] btrfs_commit_transaction+0x45c/0x882
>> [btrfs]
>> [58380.443238]  [<ffffffffa040c058>] transaction_kthread+0xf2/0x1a4
>> [btrfs]
>> [58380.443238]  [<ffffffffa040bf66>] ?
>> btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
>> [58380.443238]  [<ffffffff8105966b>] kthread+0xb7/0xbf
>> [58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
>> [58380.443238]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
>> [58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
>>
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>>   fs/btrfs/ctree.h       |  1 +
>>   fs/btrfs/extent-tree.c | 27 +++++++++++++++++++++++++++
>>   2 files changed, 28 insertions(+)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index d3ccd09..7f40a65 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1277,6 +1277,7 @@ struct btrfs_block_group_cache {
>>         unsigned int ro:1;
>>         unsigned int dirty:1;
>>         unsigned int iref:1;
>> +       unsigned int has_caching_ctl:1;
>>
>
> Don't do this, just unconditionally call get_caching_control in
> btrfs_remove_block_group, then if we get something we can do stuff,
> otherwise we can just continue.  Thanks,

That's what I initially thought too. However, get_caching_control only
returns us the caching_ctl if block_group->cached ==
BTRFS_CACHE_STARTED, so it's not enough to use it exclusively. The
has_caching_ctl flag is just to avoid holding the semaphore and search
through the list (since block_group->caching_ctl can be NULL but a
caching_ctl point to the block group can be in the list).

>
> Josef
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups
  2014-11-26 16:07   ` Josef Bacik
@ 2014-11-26 16:15     ` Filipe David Manana
  2014-11-26 16:19       ` Josef Bacik
  0 siblings, 1 reply; 26+ messages in thread
From: Filipe David Manana @ 2014-11-26 16:15 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org

On Wed, Nov 26, 2014 at 4:07 PM, Josef Bacik <jbacik@fb.com> wrote:
> On 11/26/2014 10:28 AM, Filipe Manana wrote:
>>
>> If the transaction handle doesn't have used blocks but has created new
>> block
>> groups make sure we turn the fs into readonly mode too. This is because
>> the
>> new block groups didn't get all their metadata persisted into the chunk
>> and
>> device trees, and therefore if a subsequent transaction starts, allocates
>> space from the new block groups, writes data or metadata into that space,
>> commits successfully and then after we unmount and mount the filesystem
>> again, the same space can be allocated again for a new block group,
>> resulting in file data or metadata corruption.
>>
>> Example where we don't abort the transaction when we fail to finish the
>> chunk allocation (add items to the chunk and device trees) and later a
>> future transaction where the block group is removed fails because it can't
>> find the chunk item in the chunk tree:
>>
>> [25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260
>> __btrfs_abort_transaction+0x50/0xfc [btrfs]()
>> [25230.404301] BTRFS: Transaction aborted (error -28)
>> [25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor
>> raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2
>> dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop
>> psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr
>> serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom
>> ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common
>> virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod
>> virtio [last unloaded: btrfs]
>> [25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted
>> 3.17.0-rc5-btrfs-next-1+ #1
>> [25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>> rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
>> [25230.404328]  0000000000000000 ffff88004581bb08 ffffffff813e7a13
>> ffff88004581bb50
>> [25230.404330]  ffff88004581bb40 ffffffff810423aa ffffffffa049386a
>> 00000000ffffffe4
>> [25230.404332]  ffffffffa05214c0 000000000000240c ffff88010fc8f800
>> ffff88004581bba8
>> [25230.404334] Call Trace:
>> [25230.404338]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
>> [25230.404342]  [<ffffffff810423aa>] warn_slowpath_common+0x7f/0x98
>> [25230.404351]  [<ffffffffa049386a>] ? __btrfs_abort_transaction+0x50/0xfc
>> [btrfs]
>> [25230.404353]  [<ffffffff8104240b>] warn_slowpath_fmt+0x48/0x50
>> [25230.404362]  [<ffffffffa049386a>] __btrfs_abort_transaction+0x50/0xfc
>> [btrfs]
>> [25230.404374]  [<ffffffffa04a8c43>]
>> btrfs_create_pending_block_groups+0x10c/0x135 [btrfs]
>> [25230.404387]  [<ffffffffa04b77fd>] __btrfs_end_transaction+0x7e/0x2de
>> [btrfs]
>> [25230.404398]  [<ffffffffa04b7a6d>] btrfs_end_transaction+0x10/0x12
>> [btrfs]
>> [25230.404408]  [<ffffffffa04a3d64>]
>> btrfs_check_data_free_space+0x111/0x1f0 [btrfs]
>> [25230.404421]  [<ffffffffa04c53bd>] __btrfs_buffered_write+0x160/0x48d
>> [btrfs]
>> [25230.404425]  [<ffffffff811a9268>] ? cap_inode_need_killpriv+0x2d/0x37
>> [25230.404429]  [<ffffffff810f6501>] ? get_page+0x1a/0x2b
>> [25230.404441]  [<ffffffffa04c7c95>] btrfs_file_write_iter+0x321/0x42f
>> [btrfs]
>> [25230.404443]  [<ffffffff8110f5d9>] ? handle_mm_fault+0x7f3/0x846
>> [25230.404446]  [<ffffffff813e98c5>] ? mutex_unlock+0x16/0x18
>> [25230.404449]  [<ffffffff81138d68>] new_sync_write+0x7c/0xa0
>> [25230.404450]  [<ffffffff81139401>] vfs_write+0xb0/0x112
>> [25230.404452]  [<ffffffff81139c9d>] SyS_pwrite64+0x66/0x84
>> [25230.404454]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b
>> [25230.404455] ---[ end trace 5aa5684fdf47ab38 ]---
>> [25230.404458] BTRFS warning (device sdc):
>> btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space
>> left).
>> [25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509:
>> errno=-2 No such entry (Failed lookup while freeing chunk.)
>>
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>>   fs/btrfs/extent-tree.c | 5 +++--
>>   fs/btrfs/super.c       | 2 +-
>>   2 files changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 4bf8f02..0a5e770 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -9209,9 +9209,8 @@ void btrfs_create_pending_block_groups(struct
>> btrfs_trans_handle *trans,
>>         int ret = 0;
>>
>>         list_for_each_entry_safe(block_group, tmp, &trans->new_bgs,
>> bg_list) {
>> -               list_del_init(&block_group->bg_list);
>>                 if (ret)
>> -                       continue;
>> +                       goto next;
>>
>>                 spin_lock(&block_group->lock);
>>                 memcpy(&item, &block_group->item, sizeof(item));
>> @@ -9226,6 +9225,8 @@ void btrfs_create_pending_block_groups(struct
>> btrfs_trans_handle *trans,
>>                                                key.objectid, key.offset);
>>                 if (ret)
>>                         btrfs_abort_transaction(trans, extent_root, ret);
>> +next:
>> +               list_del_init(&block_group->bg_list);
>>         }
>>   }
>>
>
> I don't understand this change, logically it seems the same as what we had
> before.  Thanks,

It isn't. Before we would not turn the fs readonly if the transaction
had no blocks used but had new block groups. This change just makes it
turn into readonly mode if it has new block groups (even if no blocks
were used).
See the trace in the log where it shows we failed to finish the chunk
allocation but the transaction was aborted and didn't turn the fs into
readonly mode.


>
> Josef
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation
  2014-11-26 15:28 ` [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation Filipe Manana
@ 2014-11-26 16:15   ` Josef Bacik
  2014-11-26 16:25     ` Filipe David Manana
  2014-11-26 17:19   ` [PATCH v2 " Filipe Manana
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 26+ messages in thread
From: Josef Bacik @ 2014-11-26 16:15 UTC (permalink / raw)
  To: Filipe Manana, linux-btrfs

On 11/26/2014 10:28 AM, Filipe Manana wrote:
> Our fs trim operation, which is completely transactionless (doesn't start
> or joins an existing transaction) consists of visiting all block groups
> and then for each one to iterate its free space entries and perform a
> discard operation against the space range represented by the free space
> entries. However before performing a discard, the corresponding free space
> entry is removed from the free space rbtree, and when the discard completes
> it is added back to the free space rbtree.
>
> If a block group remove operation happens while the discard is ongoing (or
> before it starts and after a free space entry is hidden), we end up not
> waiting for the discard to complete, remove the extent map that maps
> logical address to physical addresses and the corresponding chunk metadata
> from the the chunk and device trees. After that and before the discard
> completes, the current running transaction can finish and a new one start,
> allowing for new block groups that map to the same physical addresses to
> be allocated and written to.
>
> So fix this by keeping the extent map in memory until the discard completes
> so that the same physical addresses aren't reused before it completes.
>
> If the physical locations that are under a discard operation end up being
> used for a new metadata block group for example, and dirty metadata extents
> are written before the discard finishes (the VM might call writepages() of
> our btree inode's i_mapping for example, or an fsync log commit happens) we
> end up overwriting metadata with zeroes, which leads to errors from fsck
> like the following:
>
>          checking extents
>          Check tree block failed, want=833912832, have=0
>          Check tree block failed, want=833912832, have=0
>          Check tree block failed, want=833912832, have=0
>          Check tree block failed, want=833912832, have=0
>          Check tree block failed, want=833912832, have=0
>          read block failed check_tree_block
>          owner ref check failed [833912832 16384]
>          Errors found in extent allocation tree or chunk allocation
>          checking free space cache
>          checking fs roots
>          Check tree block failed, want=833912832, have=0
>          Check tree block failed, want=833912832, have=0
>          Check tree block failed, want=833912832, have=0
>          Check tree block failed, want=833912832, have=0
>          Check tree block failed, want=833912832, have=0
>          read block failed check_tree_block
>          root 5 root dir 256 error
>          root 5 inode 260 errors 2001, no inode item, link count wrong
>                  unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
>          root 5 inode 262 errors 2001, no inode item, link count wrong
>                  unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
>          root 5 inode 263 errors 2001, no inode item, link count wrong
>          (...)
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/ctree.h            | 13 ++++++++++++-
>   fs/btrfs/disk-io.c          | 14 ++++++++++++++
>   fs/btrfs/extent-tree.c      | 24 +++++++++++++++++++++++-
>   fs/btrfs/free-space-cache.c | 26 +++++++++++++++++++++++++-
>   fs/btrfs/volumes.c          | 33 ++++++++++++++++++++++++++-------
>   5 files changed, 100 insertions(+), 10 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 7f40a65..51056c7 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache {
>   	unsigned int dirty:1;
>   	unsigned int iref:1;
>   	unsigned int has_caching_ctl:1;
> +	unsigned int removed:1;
>
>   	int disk_cache_state;
>
> @@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache {
>
>   	/* For delayed block group creation or deletion of empty block groups */
>   	struct list_head bg_list;
> +
> +	atomic_t trimming;
>   };
>
>   /* delayed seq elem */
> @@ -1731,6 +1734,13 @@ struct btrfs_fs_info {
>
>   	/* For btrfs to record security options */
>   	struct security_mnt_opts security_opts;
> +
> +	/*
> +	 * Chunks that can't be freed yet (under a trim/discard operation)
> +	 * and will be latter freed.
> +	 */
> +	rwlock_t pinned_chunks_lock;
> +	struct list_head pinned_chunks;
>   };
>
>   struct btrfs_subvolume_writers {
> @@ -3353,7 +3363,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans,
>   			   u64 type, u64 chunk_objectid, u64 chunk_offset,
>   			   u64 size);
>   int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
> -			     struct btrfs_root *root, u64 group_start);
> +			     struct btrfs_root *root, u64 group_start,
> +			     bool *remove_em);
>   void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
>   void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
>   				       struct btrfs_root *root);
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 9d4fb0a..76012d0 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2397,6 +2397,9 @@ int open_ctree(struct super_block *sb,
>   	init_waitqueue_head(&fs_info->transaction_blocked_wait);
>   	init_waitqueue_head(&fs_info->async_submit_wait);
>
> +	rwlock_init(&fs_info->pinned_chunks_lock);
> +	INIT_LIST_HEAD(&fs_info->pinned_chunks);
> +
>   	ret = btrfs_alloc_stripe_hash_table(fs_info);
>   	if (ret) {
>   		err = ret;
> @@ -3726,6 +3729,17 @@ void close_ctree(struct btrfs_root *root)
>
>   	btrfs_free_block_rsv(root, root->orphan_block_rsv);
>   	root->orphan_block_rsv = NULL;
> +
> +	write_lock(&fs_info->pinned_chunks_lock);
> +	while (!list_empty(&fs_info->pinned_chunks)) {
> +		struct extent_map *em;
> +
> +		em = list_first_entry(&fs_info->pinned_chunks,
> +				      struct extent_map, list);
> +		list_del_init(&em->list);
> +		free_extent_map(em);
> +	}
> +	write_unlock(&fs_info->pinned_chunks_lock);
>   }
>
>   int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 92f61f2..4bf8f02 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9016,6 +9016,7 @@ btrfs_create_block_group_cache(struct btrfs_root *root, u64 start, u64 size)
>   	INIT_LIST_HEAD(&cache->cluster_list);
>   	INIT_LIST_HEAD(&cache->bg_list);
>   	btrfs_init_free_space_ctl(cache);
> +	atomic_set(&cache->trimming, 0);
>
>   	return cache;
>   }
> @@ -9317,7 +9318,8 @@ static void clear_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
>   }
>
>   int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
> -			     struct btrfs_root *root, u64 group_start)
> +			     struct btrfs_root *root, u64 group_start,
> +			     bool *remove_em)
>   {
>   	struct btrfs_path *path;
>   	struct btrfs_block_group_cache *block_group;
> @@ -9474,6 +9476,26 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>
>   	memcpy(&key, &block_group->key, sizeof(key));
>
> +	spin_lock(&block_group->lock);
> +	block_group->removed = 1;

Ok you set block_group->removed here, but you don't check it in 
btrfs_trim_block_group, so we can easily race in afterwards and start 
trimming this block group.

> +	/*
> +	 * At this point trimming can't start on this block group, because we
> +	 * removed the block group from the tree fs_info->block_group_cache_tree
> +	 * so no one can't find it anymore.
> +	 *
> +	 * And we must tell our caller to not remove the extent map from the
> +	 * fs_info->mapping_tree to prevent the same logical address range and
> +	 * physical device space ranges from being reused for a new block group.
> +	 * This is because our fs trim operation (btrfs_trim_fs(),
> +	 * btrfs_ioctl_fitrim()) is completely transactionless, so while its
> +	 * trimming a range the currently running transaction might finish and
> +	 * a new one start, allowing for new block groups to be created that can
> +	 * reuse the same physical device locations unless we take this special
> +	 * care.
> +	 */
> +	*remove_em = (atomic_read(&block_group->trimming) == 0);
> +	spin_unlock(&block_group->lock);
> +
>   	btrfs_put_block_group(block_group);
>   	btrfs_put_block_group(block_group);
>
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index 3384819..16c2d39 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -3101,11 +3101,35 @@ int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
>
>   	*trimmed = 0;
>
> +	atomic_inc(&block_group->trimming);
> +

This needs to be something like this


spin_lock(&block_group->lock);
if (block_group->removed == 1) {
	spin_unlock(&block_group->lock);
	return 0;
}
atomic_inc(&block_group->trimming);
spin_unlock(&block_group->lock);

To be  properly safe.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups
  2014-11-26 16:15     ` Filipe David Manana
@ 2014-11-26 16:19       ` Josef Bacik
  2014-11-26 16:29         ` Filipe David Manana
  0 siblings, 1 reply; 26+ messages in thread
From: Josef Bacik @ 2014-11-26 16:19 UTC (permalink / raw)
  To: fdmanana; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org

On 11/26/2014 11:15 AM, Filipe David Manana wrote:
> On Wed, Nov 26, 2014 at 4:07 PM, Josef Bacik <jbacik@fb.com> wrote:
>> On 11/26/2014 10:28 AM, Filipe Manana wrote:
>>>
>>> If the transaction handle doesn't have used blocks but has created new
>>> block
>>> groups make sure we turn the fs into readonly mode too. This is because
>>> the
>>> new block groups didn't get all their metadata persisted into the chunk
>>> and
>>> device trees, and therefore if a subsequent transaction starts, allocates
>>> space from the new block groups, writes data or metadata into that space,
>>> commits successfully and then after we unmount and mount the filesystem
>>> again, the same space can be allocated again for a new block group,
>>> resulting in file data or metadata corruption.
>>>
>>> Example where we don't abort the transaction when we fail to finish the
>>> chunk allocation (add items to the chunk and device trees) and later a
>>> future transaction where the block group is removed fails because it can't
>>> find the chunk item in the chunk tree:
>>>
>>> [25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260
>>> __btrfs_abort_transaction+0x50/0xfc [btrfs]()
>>> [25230.404301] BTRFS: Transaction aborted (error -28)
>>> [25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor
>>> raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2
>>> dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop
>>> psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr
>>> serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom
>>> ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common
>>> virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod
>>> virtio [last unloaded: btrfs]
>>> [25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted
>>> 3.17.0-rc5-btrfs-next-1+ #1
>>> [25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>> rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
>>> [25230.404328]  0000000000000000 ffff88004581bb08 ffffffff813e7a13
>>> ffff88004581bb50
>>> [25230.404330]  ffff88004581bb40 ffffffff810423aa ffffffffa049386a
>>> 00000000ffffffe4
>>> [25230.404332]  ffffffffa05214c0 000000000000240c ffff88010fc8f800
>>> ffff88004581bba8
>>> [25230.404334] Call Trace:
>>> [25230.404338]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
>>> [25230.404342]  [<ffffffff810423aa>] warn_slowpath_common+0x7f/0x98
>>> [25230.404351]  [<ffffffffa049386a>] ? __btrfs_abort_transaction+0x50/0xfc
>>> [btrfs]
>>> [25230.404353]  [<ffffffff8104240b>] warn_slowpath_fmt+0x48/0x50
>>> [25230.404362]  [<ffffffffa049386a>] __btrfs_abort_transaction+0x50/0xfc
>>> [btrfs]
>>> [25230.404374]  [<ffffffffa04a8c43>]
>>> btrfs_create_pending_block_groups+0x10c/0x135 [btrfs]
>>> [25230.404387]  [<ffffffffa04b77fd>] __btrfs_end_transaction+0x7e/0x2de
>>> [btrfs]
>>> [25230.404398]  [<ffffffffa04b7a6d>] btrfs_end_transaction+0x10/0x12
>>> [btrfs]
>>> [25230.404408]  [<ffffffffa04a3d64>]
>>> btrfs_check_data_free_space+0x111/0x1f0 [btrfs]
>>> [25230.404421]  [<ffffffffa04c53bd>] __btrfs_buffered_write+0x160/0x48d
>>> [btrfs]
>>> [25230.404425]  [<ffffffff811a9268>] ? cap_inode_need_killpriv+0x2d/0x37
>>> [25230.404429]  [<ffffffff810f6501>] ? get_page+0x1a/0x2b
>>> [25230.404441]  [<ffffffffa04c7c95>] btrfs_file_write_iter+0x321/0x42f
>>> [btrfs]
>>> [25230.404443]  [<ffffffff8110f5d9>] ? handle_mm_fault+0x7f3/0x846
>>> [25230.404446]  [<ffffffff813e98c5>] ? mutex_unlock+0x16/0x18
>>> [25230.404449]  [<ffffffff81138d68>] new_sync_write+0x7c/0xa0
>>> [25230.404450]  [<ffffffff81139401>] vfs_write+0xb0/0x112
>>> [25230.404452]  [<ffffffff81139c9d>] SyS_pwrite64+0x66/0x84
>>> [25230.404454]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b
>>> [25230.404455] ---[ end trace 5aa5684fdf47ab38 ]---
>>> [25230.404458] BTRFS warning (device sdc):
>>> btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space
>>> left).
>>> [25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509:
>>> errno=-2 No such entry (Failed lookup while freeing chunk.)
>>>
>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>> ---
>>>    fs/btrfs/extent-tree.c | 5 +++--
>>>    fs/btrfs/super.c       | 2 +-
>>>    2 files changed, 4 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 4bf8f02..0a5e770 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -9209,9 +9209,8 @@ void btrfs_create_pending_block_groups(struct
>>> btrfs_trans_handle *trans,
>>>          int ret = 0;
>>>
>>>          list_for_each_entry_safe(block_group, tmp, &trans->new_bgs,
>>> bg_list) {
>>> -               list_del_init(&block_group->bg_list);
>>>                  if (ret)
>>> -                       continue;
>>> +                       goto next;
>>>
>>>                  spin_lock(&block_group->lock);
>>>                  memcpy(&item, &block_group->item, sizeof(item));
>>> @@ -9226,6 +9225,8 @@ void btrfs_create_pending_block_groups(struct
>>> btrfs_trans_handle *trans,
>>>                                                 key.objectid, key.offset);
>>>                  if (ret)
>>>                          btrfs_abort_transaction(trans, extent_root, ret);
>>> +next:
>>> +               list_del_init(&block_group->bg_list);
>>>          }
>>>    }
>>>
>>
>> I don't understand this change, logically it seems the same as what we had
>> before.  Thanks,
>
> It isn't. Before we would not turn the fs readonly if the transaction
> had no blocks used but had new block groups. This change just makes it
> turn into readonly mode if it has new block groups (even if no blocks
> were used).
> See the trace in the log where it shows we failed to finish the chunk
> allocation but the transaction was aborted and didn't turn the fs into
> readonly mode.
>

Yeah the bit below makes sense, its the above change that is the same. 
Before we deleted the entry from the list and then if there is an error 
we just continue, now if there is an error you go to next and delete the 
entry and continue, which is the same thing.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] Btrfs: fix crash caused by block group removal
  2014-11-26 16:09     ` Filipe David Manana
@ 2014-11-26 16:24       ` Josef Bacik
  2014-11-26 16:34         ` Filipe David Manana
  0 siblings, 1 reply; 26+ messages in thread
From: Josef Bacik @ 2014-11-26 16:24 UTC (permalink / raw)
  To: fdmanana; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org

On 11/26/2014 11:09 AM, Filipe David Manana wrote:
> On Wed, Nov 26, 2014 at 3:57 PM, Josef Bacik <jbacik@fb.com> wrote:
>> On 11/26/2014 10:28 AM, Filipe Manana wrote:
>>>
>>> If we remove a block group (because it became empty), we might have left
>>> a caching_ctl structure in fs_info->caching_block_groups that points to
>>> the block group and is accessed at transaction commit time. This results
>>> in accessing an invalid or incorrect block group. This issue became
>>> visible
>>> after Josef's patch "Btrfs: remove empty block groups automatically".
>>>
>>> So if the block group is removed make sure we don't leave a dangling
>>> caching_ctl in caching_block_groups.
>>>
>>> Sample crash trace:
>>>
>>> [58380.439449] BUG: unable to handle kernel paging request at
>>> ffff8801446eaeb8
>>> [58380.439707] IP: [<ffffffffa03f6d05>]
>>> block_group_cache_done.isra.21+0xc/0x1c [btrfs]
>>> [58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE
>>> 80000001446ea060
>>> [58380.441220] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
>>> [58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd
>>> auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse
>>> processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore
>>> thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom
>>> ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common
>>> virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring
>>> virtio [last unloaded: btrfs]
>>> [58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G        W
>>> 3.17.0-rc5-btrfs-next-1+ #1
>>> [58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>> rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
>>> [58380.443238] task: ffff88013ac82090 ti: ffff88013896c000 task.ti:
>>> ffff88013896c000
>>> [58380.443238] RIP: 0010:[<ffffffffa03f6d05>]  [<ffffffffa03f6d05>]
>>> block_group_cache_done.isra.21+0xc/0x1c [btrfs]
>>> [58380.443238] RSP: 0018:ffff88013896fdd8  EFLAGS: 00010283
>>> [58380.443238] RAX: ffff880222cae850 RBX: ffff880119ba74c0 RCX:
>>> 0000000000000000
>>> [58380.443238] RDX: 0000000000000000 RSI: ffff880185e16800 RDI:
>>> ffff8801446eaeb8
>>> [58380.443238] RBP: ffff88013896fdd8 R08: ffff8801a9ca9fa8 R09:
>>> ffff88013896fc60
>>> [58380.443238] R10: ffff88013896fd28 R11: 0000000000000000 R12:
>>> ffff880222cae000
>>> [58380.443238] R13: ffff880222cae850 R14: ffff880222cae6b0 R15:
>>> ffff8801446eae00
>>> [58380.443238] FS:  0000000000000000(0000) GS:ffff88023ed80000(0000)
>>> knlGS:0000000000000000
>>> [58380.443238] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>> [58380.443238] CR2: ffff8801446eaeb8 CR3: 0000000001811000 CR4:
>>> 00000000000006e0
>>> [58380.443238] Stack:
>>> [58380.443238]  ffff88013896fe18 ffffffffa03fe2d5 ffff880222cae850
>>> ffff880185e16800
>>> [58380.443238]  ffff88000dc41c20 0000000000000000 ffff8801a9ca9f00
>>> 0000000000000000
>>> [58380.443238]  ffff88013896fe80 ffffffffa040fbcf ffff88018b0dcdb0
>>> ffff88013ac82090
>>> [58380.443238] Call Trace:
>>> [58380.443238]  [<ffffffffa03fe2d5>] btrfs_prepare_extent_commit+0x5a/0xd7
>>> [btrfs]
>>> [58380.443238]  [<ffffffffa040fbcf>] btrfs_commit_transaction+0x45c/0x882
>>> [btrfs]
>>> [58380.443238]  [<ffffffffa040c058>] transaction_kthread+0xf2/0x1a4
>>> [btrfs]
>>> [58380.443238]  [<ffffffffa040bf66>] ?
>>> btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
>>> [58380.443238]  [<ffffffff8105966b>] kthread+0xb7/0xbf
>>> [58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
>>> [58380.443238]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
>>> [58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
>>>
>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>> ---
>>>    fs/btrfs/ctree.h       |  1 +
>>>    fs/btrfs/extent-tree.c | 27 +++++++++++++++++++++++++++
>>>    2 files changed, 28 insertions(+)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index d3ccd09..7f40a65 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -1277,6 +1277,7 @@ struct btrfs_block_group_cache {
>>>          unsigned int ro:1;
>>>          unsigned int dirty:1;
>>>          unsigned int iref:1;
>>> +       unsigned int has_caching_ctl:1;
>>>
>>
>> Don't do this, just unconditionally call get_caching_control in
>> btrfs_remove_block_group, then if we get something we can do stuff,
>> otherwise we can just continue.  Thanks,
>
> That's what I initially thought too. However, get_caching_control only
> returns us the caching_ctl if block_group->cached ==
> BTRFS_CACHE_STARTED, so it's not enough to use it exclusively. The
> has_caching_ctl flag is just to avoid holding the semaphore and search
> through the list (since block_group->caching_ctl can be NULL but a
> caching_ctl point to the block group can be in the list).
>

Oh God that's not good, we need to change get_caching_control to return 
if there is a caching control at all, since the other users want to wait 
for the fast caching to finish too.  So change that and then use it 
unconditionally.  I bet this has been causing us the random early ENOSPC 
problems.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation
  2014-11-26 16:15   ` Josef Bacik
@ 2014-11-26 16:25     ` Filipe David Manana
  2014-11-26 16:30       ` Josef Bacik
  0 siblings, 1 reply; 26+ messages in thread
From: Filipe David Manana @ 2014-11-26 16:25 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org

On Wed, Nov 26, 2014 at 4:15 PM, Josef Bacik <jbacik@fb.com> wrote:
> On 11/26/2014 10:28 AM, Filipe Manana wrote:
>>
>> Our fs trim operation, which is completely transactionless (doesn't start
>> or joins an existing transaction) consists of visiting all block groups
>> and then for each one to iterate its free space entries and perform a
>> discard operation against the space range represented by the free space
>> entries. However before performing a discard, the corresponding free space
>> entry is removed from the free space rbtree, and when the discard
>> completes
>> it is added back to the free space rbtree.
>>
>> If a block group remove operation happens while the discard is ongoing (or
>> before it starts and after a free space entry is hidden), we end up not
>> waiting for the discard to complete, remove the extent map that maps
>> logical address to physical addresses and the corresponding chunk metadata
>> from the the chunk and device trees. After that and before the discard
>> completes, the current running transaction can finish and a new one start,
>> allowing for new block groups that map to the same physical addresses to
>> be allocated and written to.
>>
>> So fix this by keeping the extent map in memory until the discard
>> completes
>> so that the same physical addresses aren't reused before it completes.
>>
>> If the physical locations that are under a discard operation end up being
>> used for a new metadata block group for example, and dirty metadata
>> extents
>> are written before the discard finishes (the VM might call writepages() of
>> our btree inode's i_mapping for example, or an fsync log commit happens)
>> we
>> end up overwriting metadata with zeroes, which leads to errors from fsck
>> like the following:
>>
>>          checking extents
>>          Check tree block failed, want=833912832, have=0
>>          Check tree block failed, want=833912832, have=0
>>          Check tree block failed, want=833912832, have=0
>>          Check tree block failed, want=833912832, have=0
>>          Check tree block failed, want=833912832, have=0
>>          read block failed check_tree_block
>>          owner ref check failed [833912832 16384]
>>          Errors found in extent allocation tree or chunk allocation
>>          checking free space cache
>>          checking fs roots
>>          Check tree block failed, want=833912832, have=0
>>          Check tree block failed, want=833912832, have=0
>>          Check tree block failed, want=833912832, have=0
>>          Check tree block failed, want=833912832, have=0
>>          Check tree block failed, want=833912832, have=0
>>          read block failed check_tree_block
>>          root 5 root dir 256 error
>>          root 5 inode 260 errors 2001, no inode item, link count wrong
>>                  unresolved ref dir 256 index 0 namelen 8 name foobar_3
>> filetype 1 errors 6, no dir index, no inode ref
>>          root 5 inode 262 errors 2001, no inode item, link count wrong
>>                  unresolved ref dir 256 index 0 namelen 8 name foobar_5
>> filetype 1 errors 6, no dir index, no inode ref
>>          root 5 inode 263 errors 2001, no inode item, link count wrong
>>          (...)
>>
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>>   fs/btrfs/ctree.h            | 13 ++++++++++++-
>>   fs/btrfs/disk-io.c          | 14 ++++++++++++++
>>   fs/btrfs/extent-tree.c      | 24 +++++++++++++++++++++++-
>>   fs/btrfs/free-space-cache.c | 26 +++++++++++++++++++++++++-
>>   fs/btrfs/volumes.c          | 33 ++++++++++++++++++++++++++-------
>>   5 files changed, 100 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 7f40a65..51056c7 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache {
>>         unsigned int dirty:1;
>>         unsigned int iref:1;
>>         unsigned int has_caching_ctl:1;
>> +       unsigned int removed:1;
>>
>>         int disk_cache_state;
>>
>> @@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache {
>>
>>         /* For delayed block group creation or deletion of empty block
>> groups */
>>         struct list_head bg_list;
>> +
>> +       atomic_t trimming;
>>   };
>>
>>   /* delayed seq elem */
>> @@ -1731,6 +1734,13 @@ struct btrfs_fs_info {
>>
>>         /* For btrfs to record security options */
>>         struct security_mnt_opts security_opts;
>> +
>> +       /*
>> +        * Chunks that can't be freed yet (under a trim/discard operation)
>> +        * and will be latter freed.
>> +        */
>> +       rwlock_t pinned_chunks_lock;
>> +       struct list_head pinned_chunks;
>>   };
>>
>>   struct btrfs_subvolume_writers {
>> @@ -3353,7 +3363,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle
>> *trans,
>>                            u64 type, u64 chunk_objectid, u64 chunk_offset,
>>                            u64 size);
>>   int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>> -                            struct btrfs_root *root, u64 group_start);
>> +                            struct btrfs_root *root, u64 group_start,
>> +                            bool *remove_em);
>>   void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
>>   void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
>>                                        struct btrfs_root *root);
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 9d4fb0a..76012d0 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -2397,6 +2397,9 @@ int open_ctree(struct super_block *sb,
>>         init_waitqueue_head(&fs_info->transaction_blocked_wait);
>>         init_waitqueue_head(&fs_info->async_submit_wait);
>>
>> +       rwlock_init(&fs_info->pinned_chunks_lock);
>> +       INIT_LIST_HEAD(&fs_info->pinned_chunks);
>> +
>>         ret = btrfs_alloc_stripe_hash_table(fs_info);
>>         if (ret) {
>>                 err = ret;
>> @@ -3726,6 +3729,17 @@ void close_ctree(struct btrfs_root *root)
>>
>>         btrfs_free_block_rsv(root, root->orphan_block_rsv);
>>         root->orphan_block_rsv = NULL;
>> +
>> +       write_lock(&fs_info->pinned_chunks_lock);
>> +       while (!list_empty(&fs_info->pinned_chunks)) {
>> +               struct extent_map *em;
>> +
>> +               em = list_first_entry(&fs_info->pinned_chunks,
>> +                                     struct extent_map, list);
>> +               list_del_init(&em->list);
>> +               free_extent_map(em);
>> +       }
>> +       write_unlock(&fs_info->pinned_chunks_lock);
>>   }
>>
>>   int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 92f61f2..4bf8f02 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -9016,6 +9016,7 @@ btrfs_create_block_group_cache(struct btrfs_root
>> *root, u64 start, u64 size)
>>         INIT_LIST_HEAD(&cache->cluster_list);
>>         INIT_LIST_HEAD(&cache->bg_list);
>>         btrfs_init_free_space_ctl(cache);
>> +       atomic_set(&cache->trimming, 0);
>>
>>         return cache;
>>   }
>> @@ -9317,7 +9318,8 @@ static void clear_avail_alloc_bits(struct
>> btrfs_fs_info *fs_info, u64 flags)
>>   }
>>
>>   int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>> -                            struct btrfs_root *root, u64 group_start)
>> +                            struct btrfs_root *root, u64 group_start,
>> +                            bool *remove_em)
>>   {
>>         struct btrfs_path *path;
>>         struct btrfs_block_group_cache *block_group;
>> @@ -9474,6 +9476,26 @@ int btrfs_remove_block_group(struct
>> btrfs_trans_handle *trans,
>>
>>         memcpy(&key, &block_group->key, sizeof(key));
>>
>> +       spin_lock(&block_group->lock);
>> +       block_group->removed = 1;
>
>
> Ok you set block_group->removed here, but you don't check it in
> btrfs_trim_block_group, so we can easily race in afterwards and start
> trimming this block group.
>
>
>> +       /*
>> +        * At this point trimming can't start on this block group, because
>> we
>> +        * removed the block group from the tree
>> fs_info->block_group_cache_tree
>> +        * so no one can't find it anymore.
>> +        *
>> +        * And we must tell our caller to not remove the extent map from
>> the
>> +        * fs_info->mapping_tree to prevent the same logical address range
>> and
>> +        * physical device space ranges from being reused for a new block
>> group.
>> +        * This is because our fs trim operation (btrfs_trim_fs(),
>> +        * btrfs_ioctl_fitrim()) is completely transactionless, so while
>> its
>> +        * trimming a range the currently running transaction might finish
>> and
>> +        * a new one start, allowing for new block groups to be created
>> that can
>> +        * reuse the same physical device locations unless we take this
>> special
>> +        * care.
>> +        */
>> +       *remove_em = (atomic_read(&block_group->trimming) == 0);
>> +       spin_unlock(&block_group->lock);
>> +
>>         btrfs_put_block_group(block_group);
>>         btrfs_put_block_group(block_group);
>>
>> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
>> index 3384819..16c2d39 100644
>> --- a/fs/btrfs/free-space-cache.c
>> +++ b/fs/btrfs/free-space-cache.c
>> @@ -3101,11 +3101,35 @@ int btrfs_trim_block_group(struct
>> btrfs_block_group_cache *block_group,
>>
>>         *trimmed = 0;
>>
>> +       atomic_inc(&block_group->trimming);
>> +
>
>
> This needs to be something like this
>
>
> spin_lock(&block_group->lock);
> if (block_group->removed == 1) {
>         spin_unlock(&block_group->lock);
>         return 0;
> }
> atomic_inc(&block_group->trimming);
> spin_unlock(&block_group->lock);
>
> To be  properly safe.  Thanks,

Hi Josef, it is safe.
btrfs_trim_block_group() checks for ->removed after decrementing the
atomic. Before it starts trimming, it increments the counter - if the
block group removal already started it isn't a problem, because it
removed all the free space entries from the rbtree while holding the
tree's lock.

>
> Josef
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups
  2014-11-26 16:19       ` Josef Bacik
@ 2014-11-26 16:29         ` Filipe David Manana
  2014-11-26 16:32           ` Josef Bacik
  0 siblings, 1 reply; 26+ messages in thread
From: Filipe David Manana @ 2014-11-26 16:29 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org

On Wed, Nov 26, 2014 at 4:19 PM, Josef Bacik <jbacik@fb.com> wrote:
> On 11/26/2014 11:15 AM, Filipe David Manana wrote:
>>
>> On Wed, Nov 26, 2014 at 4:07 PM, Josef Bacik <jbacik@fb.com> wrote:
>>>
>>> On 11/26/2014 10:28 AM, Filipe Manana wrote:
>>>>
>>>>
>>>> If the transaction handle doesn't have used blocks but has created new
>>>> block
>>>> groups make sure we turn the fs into readonly mode too. This is because
>>>> the
>>>> new block groups didn't get all their metadata persisted into the chunk
>>>> and
>>>> device trees, and therefore if a subsequent transaction starts,
>>>> allocates
>>>> space from the new block groups, writes data or metadata into that
>>>> space,
>>>> commits successfully and then after we unmount and mount the filesystem
>>>> again, the same space can be allocated again for a new block group,
>>>> resulting in file data or metadata corruption.
>>>>
>>>> Example where we don't abort the transaction when we fail to finish the
>>>> chunk allocation (add items to the chunk and device trees) and later a
>>>> future transaction where the block group is removed fails because it
>>>> can't
>>>> find the chunk item in the chunk tree:
>>>>
>>>> [25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260
>>>> __btrfs_abort_transaction+0x50/0xfc [btrfs]()
>>>> [25230.404301] BTRFS: Transaction aborted (error -28)
>>>> [25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor
>>>> raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2
>>>> dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc
>>>> loop
>>>> psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr
>>>> serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod
>>>> cdrom
>>>> ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common
>>>> virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod
>>>> virtio [last unloaded: btrfs]
>>>> [25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted
>>>> 3.17.0-rc5-btrfs-next-1+ #1
>>>> [25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>>> BIOS
>>>> rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
>>>> [25230.404328]  0000000000000000 ffff88004581bb08 ffffffff813e7a13
>>>> ffff88004581bb50
>>>> [25230.404330]  ffff88004581bb40 ffffffff810423aa ffffffffa049386a
>>>> 00000000ffffffe4
>>>> [25230.404332]  ffffffffa05214c0 000000000000240c ffff88010fc8f800
>>>> ffff88004581bba8
>>>> [25230.404334] Call Trace:
>>>> [25230.404338]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
>>>> [25230.404342]  [<ffffffff810423aa>] warn_slowpath_common+0x7f/0x98
>>>> [25230.404351]  [<ffffffffa049386a>] ?
>>>> __btrfs_abort_transaction+0x50/0xfc
>>>> [btrfs]
>>>> [25230.404353]  [<ffffffff8104240b>] warn_slowpath_fmt+0x48/0x50
>>>> [25230.404362]  [<ffffffffa049386a>] __btrfs_abort_transaction+0x50/0xfc
>>>> [btrfs]
>>>> [25230.404374]  [<ffffffffa04a8c43>]
>>>> btrfs_create_pending_block_groups+0x10c/0x135 [btrfs]
>>>> [25230.404387]  [<ffffffffa04b77fd>] __btrfs_end_transaction+0x7e/0x2de
>>>> [btrfs]
>>>> [25230.404398]  [<ffffffffa04b7a6d>] btrfs_end_transaction+0x10/0x12
>>>> [btrfs]
>>>> [25230.404408]  [<ffffffffa04a3d64>]
>>>> btrfs_check_data_free_space+0x111/0x1f0 [btrfs]
>>>> [25230.404421]  [<ffffffffa04c53bd>] __btrfs_buffered_write+0x160/0x48d
>>>> [btrfs]
>>>> [25230.404425]  [<ffffffff811a9268>] ? cap_inode_need_killpriv+0x2d/0x37
>>>> [25230.404429]  [<ffffffff810f6501>] ? get_page+0x1a/0x2b
>>>> [25230.404441]  [<ffffffffa04c7c95>] btrfs_file_write_iter+0x321/0x42f
>>>> [btrfs]
>>>> [25230.404443]  [<ffffffff8110f5d9>] ? handle_mm_fault+0x7f3/0x846
>>>> [25230.404446]  [<ffffffff813e98c5>] ? mutex_unlock+0x16/0x18
>>>> [25230.404449]  [<ffffffff81138d68>] new_sync_write+0x7c/0xa0
>>>> [25230.404450]  [<ffffffff81139401>] vfs_write+0xb0/0x112
>>>> [25230.404452]  [<ffffffff81139c9d>] SyS_pwrite64+0x66/0x84
>>>> [25230.404454]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b
>>>> [25230.404455] ---[ end trace 5aa5684fdf47ab38 ]---
>>>> [25230.404458] BTRFS warning (device sdc):
>>>> btrfs_create_pending_block_groups:9228: Aborting unused transaction(No
>>>> space
>>>> left).
>>>> [25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509:
>>>> errno=-2 No such entry (Failed lookup while freeing chunk.)
>>>>
>>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>>> ---
>>>>    fs/btrfs/extent-tree.c | 5 +++--
>>>>    fs/btrfs/super.c       | 2 +-
>>>>    2 files changed, 4 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>> index 4bf8f02..0a5e770 100644
>>>> --- a/fs/btrfs/extent-tree.c
>>>> +++ b/fs/btrfs/extent-tree.c
>>>> @@ -9209,9 +9209,8 @@ void btrfs_create_pending_block_groups(struct
>>>> btrfs_trans_handle *trans,
>>>>          int ret = 0;
>>>>
>>>>          list_for_each_entry_safe(block_group, tmp, &trans->new_bgs,
>>>> bg_list) {
>>>> -               list_del_init(&block_group->bg_list);
>>>>                  if (ret)
>>>> -                       continue;
>>>> +                       goto next;
>>>>
>>>>                  spin_lock(&block_group->lock);
>>>>                  memcpy(&item, &block_group->item, sizeof(item));
>>>> @@ -9226,6 +9225,8 @@ void btrfs_create_pending_block_groups(struct
>>>> btrfs_trans_handle *trans,
>>>>                                                 key.objectid,
>>>> key.offset);
>>>>                  if (ret)
>>>>                          btrfs_abort_transaction(trans, extent_root,
>>>> ret);
>>>> +next:
>>>> +               list_del_init(&block_group->bg_list);
>>>>          }
>>>>    }
>>>>
>>>
>>> I don't understand this change, logically it seems the same as what we
>>> had
>>> before.  Thanks,
>>
>>
>> It isn't. Before we would not turn the fs readonly if the transaction
>> had no blocks used but had new block groups. This change just makes it
>> turn into readonly mode if it has new block groups (even if no blocks
>> were used).
>> See the trace in the log where it shows we failed to finish the chunk
>> allocation but the transaction was aborted and didn't turn the fs into
>> readonly mode.
>>
>
> Yeah the bit below makes sense, its the above change that is the same.
> Before we deleted the entry from the list and then if there is an error we
> just continue, now if there is an error you go to next and delete the entry
> and continue, which is the same thing.  Thanks,

I don't see how it's the same.
Before if we had a single new block group, when we called
abort_transaction the list of block groups was empty because we
removed the bg from the list before calling abort_transaction. This
change just ensures that for the single new block group case
abort_transaction sees there's a new block group and takes the action
of making the fs readonly.

>
> Josef
>



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation
  2014-11-26 16:25     ` Filipe David Manana
@ 2014-11-26 16:30       ` Josef Bacik
  0 siblings, 0 replies; 26+ messages in thread
From: Josef Bacik @ 2014-11-26 16:30 UTC (permalink / raw)
  To: fdmanana; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org

On 11/26/2014 11:25 AM, Filipe David Manana wrote:
> On Wed, Nov 26, 2014 at 4:15 PM, Josef Bacik <jbacik@fb.com> wrote:
>> On 11/26/2014 10:28 AM, Filipe Manana wrote:
>>>
>>> Our fs trim operation, which is completely transactionless (doesn't start
>>> or joins an existing transaction) consists of visiting all block groups
>>> and then for each one to iterate its free space entries and perform a
>>> discard operation against the space range represented by the free space
>>> entries. However before performing a discard, the corresponding free space
>>> entry is removed from the free space rbtree, and when the discard
>>> completes
>>> it is added back to the free space rbtree.
>>>
>>> If a block group remove operation happens while the discard is ongoing (or
>>> before it starts and after a free space entry is hidden), we end up not
>>> waiting for the discard to complete, remove the extent map that maps
>>> logical address to physical addresses and the corresponding chunk metadata
>>> from the the chunk and device trees. After that and before the discard
>>> completes, the current running transaction can finish and a new one start,
>>> allowing for new block groups that map to the same physical addresses to
>>> be allocated and written to.
>>>
>>> So fix this by keeping the extent map in memory until the discard
>>> completes
>>> so that the same physical addresses aren't reused before it completes.
>>>
>>> If the physical locations that are under a discard operation end up being
>>> used for a new metadata block group for example, and dirty metadata
>>> extents
>>> are written before the discard finishes (the VM might call writepages() of
>>> our btree inode's i_mapping for example, or an fsync log commit happens)
>>> we
>>> end up overwriting metadata with zeroes, which leads to errors from fsck
>>> like the following:
>>>
>>>           checking extents
>>>           Check tree block failed, want=833912832, have=0
>>>           Check tree block failed, want=833912832, have=0
>>>           Check tree block failed, want=833912832, have=0
>>>           Check tree block failed, want=833912832, have=0
>>>           Check tree block failed, want=833912832, have=0
>>>           read block failed check_tree_block
>>>           owner ref check failed [833912832 16384]
>>>           Errors found in extent allocation tree or chunk allocation
>>>           checking free space cache
>>>           checking fs roots
>>>           Check tree block failed, want=833912832, have=0
>>>           Check tree block failed, want=833912832, have=0
>>>           Check tree block failed, want=833912832, have=0
>>>           Check tree block failed, want=833912832, have=0
>>>           Check tree block failed, want=833912832, have=0
>>>           read block failed check_tree_block
>>>           root 5 root dir 256 error
>>>           root 5 inode 260 errors 2001, no inode item, link count wrong
>>>                   unresolved ref dir 256 index 0 namelen 8 name foobar_3
>>> filetype 1 errors 6, no dir index, no inode ref
>>>           root 5 inode 262 errors 2001, no inode item, link count wrong
>>>                   unresolved ref dir 256 index 0 namelen 8 name foobar_5
>>> filetype 1 errors 6, no dir index, no inode ref
>>>           root 5 inode 263 errors 2001, no inode item, link count wrong
>>>           (...)
>>>
>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>> ---
>>>    fs/btrfs/ctree.h            | 13 ++++++++++++-
>>>    fs/btrfs/disk-io.c          | 14 ++++++++++++++
>>>    fs/btrfs/extent-tree.c      | 24 +++++++++++++++++++++++-
>>>    fs/btrfs/free-space-cache.c | 26 +++++++++++++++++++++++++-
>>>    fs/btrfs/volumes.c          | 33 ++++++++++++++++++++++++++-------
>>>    5 files changed, 100 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 7f40a65..51056c7 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache {
>>>          unsigned int dirty:1;
>>>          unsigned int iref:1;
>>>          unsigned int has_caching_ctl:1;
>>> +       unsigned int removed:1;
>>>
>>>          int disk_cache_state;
>>>
>>> @@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache {
>>>
>>>          /* For delayed block group creation or deletion of empty block
>>> groups */
>>>          struct list_head bg_list;
>>> +
>>> +       atomic_t trimming;
>>>    };
>>>
>>>    /* delayed seq elem */
>>> @@ -1731,6 +1734,13 @@ struct btrfs_fs_info {
>>>
>>>          /* For btrfs to record security options */
>>>          struct security_mnt_opts security_opts;
>>> +
>>> +       /*
>>> +        * Chunks that can't be freed yet (under a trim/discard operation)
>>> +        * and will be latter freed.
>>> +        */
>>> +       rwlock_t pinned_chunks_lock;
>>> +       struct list_head pinned_chunks;
>>>    };
>>>
>>>    struct btrfs_subvolume_writers {
>>> @@ -3353,7 +3363,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle
>>> *trans,
>>>                             u64 type, u64 chunk_objectid, u64 chunk_offset,
>>>                             u64 size);
>>>    int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>>> -                            struct btrfs_root *root, u64 group_start);
>>> +                            struct btrfs_root *root, u64 group_start,
>>> +                            bool *remove_em);
>>>    void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
>>>    void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
>>>                                         struct btrfs_root *root);
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 9d4fb0a..76012d0 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -2397,6 +2397,9 @@ int open_ctree(struct super_block *sb,
>>>          init_waitqueue_head(&fs_info->transaction_blocked_wait);
>>>          init_waitqueue_head(&fs_info->async_submit_wait);
>>>
>>> +       rwlock_init(&fs_info->pinned_chunks_lock);
>>> +       INIT_LIST_HEAD(&fs_info->pinned_chunks);
>>> +
>>>          ret = btrfs_alloc_stripe_hash_table(fs_info);
>>>          if (ret) {
>>>                  err = ret;
>>> @@ -3726,6 +3729,17 @@ void close_ctree(struct btrfs_root *root)
>>>
>>>          btrfs_free_block_rsv(root, root->orphan_block_rsv);
>>>          root->orphan_block_rsv = NULL;
>>> +
>>> +       write_lock(&fs_info->pinned_chunks_lock);
>>> +       while (!list_empty(&fs_info->pinned_chunks)) {
>>> +               struct extent_map *em;
>>> +
>>> +               em = list_first_entry(&fs_info->pinned_chunks,
>>> +                                     struct extent_map, list);
>>> +               list_del_init(&em->list);
>>> +               free_extent_map(em);
>>> +       }
>>> +       write_unlock(&fs_info->pinned_chunks_lock);
>>>    }
>>>
>>>    int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 92f61f2..4bf8f02 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -9016,6 +9016,7 @@ btrfs_create_block_group_cache(struct btrfs_root
>>> *root, u64 start, u64 size)
>>>          INIT_LIST_HEAD(&cache->cluster_list);
>>>          INIT_LIST_HEAD(&cache->bg_list);
>>>          btrfs_init_free_space_ctl(cache);
>>> +       atomic_set(&cache->trimming, 0);
>>>
>>>          return cache;
>>>    }
>>> @@ -9317,7 +9318,8 @@ static void clear_avail_alloc_bits(struct
>>> btrfs_fs_info *fs_info, u64 flags)
>>>    }
>>>
>>>    int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>>> -                            struct btrfs_root *root, u64 group_start)
>>> +                            struct btrfs_root *root, u64 group_start,
>>> +                            bool *remove_em)
>>>    {
>>>          struct btrfs_path *path;
>>>          struct btrfs_block_group_cache *block_group;
>>> @@ -9474,6 +9476,26 @@ int btrfs_remove_block_group(struct
>>> btrfs_trans_handle *trans,
>>>
>>>          memcpy(&key, &block_group->key, sizeof(key));
>>>
>>> +       spin_lock(&block_group->lock);
>>> +       block_group->removed = 1;
>>
>>
>> Ok you set block_group->removed here, but you don't check it in
>> btrfs_trim_block_group, so we can easily race in afterwards and start
>> trimming this block group.
>>
>>
>>> +       /*
>>> +        * At this point trimming can't start on this block group, because
>>> we
>>> +        * removed the block group from the tree
>>> fs_info->block_group_cache_tree
>>> +        * so no one can't find it anymore.
>>> +        *
>>> +        * And we must tell our caller to not remove the extent map from
>>> the
>>> +        * fs_info->mapping_tree to prevent the same logical address range
>>> and
>>> +        * physical device space ranges from being reused for a new block
>>> group.
>>> +        * This is because our fs trim operation (btrfs_trim_fs(),
>>> +        * btrfs_ioctl_fitrim()) is completely transactionless, so while
>>> its
>>> +        * trimming a range the currently running transaction might finish
>>> and
>>> +        * a new one start, allowing for new block groups to be created
>>> that can
>>> +        * reuse the same physical device locations unless we take this
>>> special
>>> +        * care.
>>> +        */
>>> +       *remove_em = (atomic_read(&block_group->trimming) == 0);
>>> +       spin_unlock(&block_group->lock);
>>> +
>>>          btrfs_put_block_group(block_group);
>>>          btrfs_put_block_group(block_group);
>>>
>>> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
>>> index 3384819..16c2d39 100644
>>> --- a/fs/btrfs/free-space-cache.c
>>> +++ b/fs/btrfs/free-space-cache.c
>>> @@ -3101,11 +3101,35 @@ int btrfs_trim_block_group(struct
>>> btrfs_block_group_cache *block_group,
>>>
>>>          *trimmed = 0;
>>>
>>> +       atomic_inc(&block_group->trimming);
>>> +
>>
>>
>> This needs to be something like this
>>
>>
>> spin_lock(&block_group->lock);
>> if (block_group->removed == 1) {
>>          spin_unlock(&block_group->lock);
>>          return 0;
>> }
>> atomic_inc(&block_group->trimming);
>> spin_unlock(&block_group->lock);
>>
>> To be  properly safe.  Thanks,
>
> Hi Josef, it is safe.
> btrfs_trim_block_group() checks for ->removed after decrementing the
> atomic. Before it starts trimming, it increments the counter - if the
> block group removal already started it isn't a problem, because it
> removed all the free space entries from the rbtree while holding the
> tree's lock.

Ah ok that makes sense.  Will you add that to the comment cause it was 
not clear and still took me 5 minutes to figure out how it was safe. 
Thanks,

Josef


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups
  2014-11-26 16:29         ` Filipe David Manana
@ 2014-11-26 16:32           ` Josef Bacik
  0 siblings, 0 replies; 26+ messages in thread
From: Josef Bacik @ 2014-11-26 16:32 UTC (permalink / raw)
  To: fdmanana; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org

On 11/26/2014 11:29 AM, Filipe David Manana wrote:
> On Wed, Nov 26, 2014 at 4:19 PM, Josef Bacik <jbacik@fb.com> wrote:
>> On 11/26/2014 11:15 AM, Filipe David Manana wrote:
>>>
>>> On Wed, Nov 26, 2014 at 4:07 PM, Josef Bacik <jbacik@fb.com> wrote:
>>>>
>>>> On 11/26/2014 10:28 AM, Filipe Manana wrote:
>>>>>
>>>>>
>>>>> If the transaction handle doesn't have used blocks but has created new
>>>>> block
>>>>> groups make sure we turn the fs into readonly mode too. This is because
>>>>> the
>>>>> new block groups didn't get all their metadata persisted into the chunk
>>>>> and
>>>>> device trees, and therefore if a subsequent transaction starts,
>>>>> allocates
>>>>> space from the new block groups, writes data or metadata into that
>>>>> space,
>>>>> commits successfully and then after we unmount and mount the filesystem
>>>>> again, the same space can be allocated again for a new block group,
>>>>> resulting in file data or metadata corruption.
>>>>>
>>>>> Example where we don't abort the transaction when we fail to finish the
>>>>> chunk allocation (add items to the chunk and device trees) and later a
>>>>> future transaction where the block group is removed fails because it
>>>>> can't
>>>>> find the chunk item in the chunk tree:
>>>>>
>>>>> [25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260
>>>>> __btrfs_abort_transaction+0x50/0xfc [btrfs]()
>>>>> [25230.404301] BTRFS: Transaction aborted (error -28)
>>>>> [25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor
>>>>> raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2
>>>>> dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc
>>>>> loop
>>>>> psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr
>>>>> serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod
>>>>> cdrom
>>>>> ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common
>>>>> virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod
>>>>> virtio [last unloaded: btrfs]
>>>>> [25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted
>>>>> 3.17.0-rc5-btrfs-next-1+ #1
>>>>> [25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>>>> BIOS
>>>>> rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
>>>>> [25230.404328]  0000000000000000 ffff88004581bb08 ffffffff813e7a13
>>>>> ffff88004581bb50
>>>>> [25230.404330]  ffff88004581bb40 ffffffff810423aa ffffffffa049386a
>>>>> 00000000ffffffe4
>>>>> [25230.404332]  ffffffffa05214c0 000000000000240c ffff88010fc8f800
>>>>> ffff88004581bba8
>>>>> [25230.404334] Call Trace:
>>>>> [25230.404338]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
>>>>> [25230.404342]  [<ffffffff810423aa>] warn_slowpath_common+0x7f/0x98
>>>>> [25230.404351]  [<ffffffffa049386a>] ?
>>>>> __btrfs_abort_transaction+0x50/0xfc
>>>>> [btrfs]
>>>>> [25230.404353]  [<ffffffff8104240b>] warn_slowpath_fmt+0x48/0x50
>>>>> [25230.404362]  [<ffffffffa049386a>] __btrfs_abort_transaction+0x50/0xfc
>>>>> [btrfs]
>>>>> [25230.404374]  [<ffffffffa04a8c43>]
>>>>> btrfs_create_pending_block_groups+0x10c/0x135 [btrfs]
>>>>> [25230.404387]  [<ffffffffa04b77fd>] __btrfs_end_transaction+0x7e/0x2de
>>>>> [btrfs]
>>>>> [25230.404398]  [<ffffffffa04b7a6d>] btrfs_end_transaction+0x10/0x12
>>>>> [btrfs]
>>>>> [25230.404408]  [<ffffffffa04a3d64>]
>>>>> btrfs_check_data_free_space+0x111/0x1f0 [btrfs]
>>>>> [25230.404421]  [<ffffffffa04c53bd>] __btrfs_buffered_write+0x160/0x48d
>>>>> [btrfs]
>>>>> [25230.404425]  [<ffffffff811a9268>] ? cap_inode_need_killpriv+0x2d/0x37
>>>>> [25230.404429]  [<ffffffff810f6501>] ? get_page+0x1a/0x2b
>>>>> [25230.404441]  [<ffffffffa04c7c95>] btrfs_file_write_iter+0x321/0x42f
>>>>> [btrfs]
>>>>> [25230.404443]  [<ffffffff8110f5d9>] ? handle_mm_fault+0x7f3/0x846
>>>>> [25230.404446]  [<ffffffff813e98c5>] ? mutex_unlock+0x16/0x18
>>>>> [25230.404449]  [<ffffffff81138d68>] new_sync_write+0x7c/0xa0
>>>>> [25230.404450]  [<ffffffff81139401>] vfs_write+0xb0/0x112
>>>>> [25230.404452]  [<ffffffff81139c9d>] SyS_pwrite64+0x66/0x84
>>>>> [25230.404454]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b
>>>>> [25230.404455] ---[ end trace 5aa5684fdf47ab38 ]---
>>>>> [25230.404458] BTRFS warning (device sdc):
>>>>> btrfs_create_pending_block_groups:9228: Aborting unused transaction(No
>>>>> space
>>>>> left).
>>>>> [25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509:
>>>>> errno=-2 No such entry (Failed lookup while freeing chunk.)
>>>>>
>>>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>>>> ---
>>>>>     fs/btrfs/extent-tree.c | 5 +++--
>>>>>     fs/btrfs/super.c       | 2 +-
>>>>>     2 files changed, 4 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>>> index 4bf8f02..0a5e770 100644
>>>>> --- a/fs/btrfs/extent-tree.c
>>>>> +++ b/fs/btrfs/extent-tree.c
>>>>> @@ -9209,9 +9209,8 @@ void btrfs_create_pending_block_groups(struct
>>>>> btrfs_trans_handle *trans,
>>>>>           int ret = 0;
>>>>>
>>>>>           list_for_each_entry_safe(block_group, tmp, &trans->new_bgs,
>>>>> bg_list) {
>>>>> -               list_del_init(&block_group->bg_list);
>>>>>                   if (ret)
>>>>> -                       continue;
>>>>> +                       goto next;
>>>>>
>>>>>                   spin_lock(&block_group->lock);
>>>>>                   memcpy(&item, &block_group->item, sizeof(item));
>>>>> @@ -9226,6 +9225,8 @@ void btrfs_create_pending_block_groups(struct
>>>>> btrfs_trans_handle *trans,
>>>>>                                                  key.objectid,
>>>>> key.offset);
>>>>>                   if (ret)
>>>>>                           btrfs_abort_transaction(trans, extent_root,
>>>>> ret);
>>>>> +next:
>>>>> +               list_del_init(&block_group->bg_list);
>>>>>           }
>>>>>     }
>>>>>
>>>>
>>>> I don't understand this change, logically it seems the same as what we
>>>> had
>>>> before.  Thanks,
>>>
>>>
>>> It isn't. Before we would not turn the fs readonly if the transaction
>>> had no blocks used but had new block groups. This change just makes it
>>> turn into readonly mode if it has new block groups (even if no blocks
>>> were used).
>>> See the trace in the log where it shows we failed to finish the chunk
>>> allocation but the transaction was aborted and didn't turn the fs into
>>> readonly mode.
>>>
>>
>> Yeah the bit below makes sense, its the above change that is the same.
>> Before we deleted the entry from the list and then if there is an error we
>> just continue, now if there is an error you go to next and delete the entry
>> and continue, which is the same thing.  Thanks,
>
> I don't see how it's the same.
> Before if we had a single new block group, when we called
> abort_transaction the list of block groups was empty because we
> removed the bg from the list before calling abort_transaction. This
> change just ensures that for the single new block group case
> abort_transaction sees there's a new block group and takes the action
> of making the fs readonly.
>

Oooooh Jesus I see it now, sorry

Reviewed-by: Josef Bacik <jbacik@fb.com>

Thanks,

Josef


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] Btrfs: fix crash caused by block group removal
  2014-11-26 16:24       ` Josef Bacik
@ 2014-11-26 16:34         ` Filipe David Manana
  2014-11-26 16:41           ` Josef Bacik
  0 siblings, 1 reply; 26+ messages in thread
From: Filipe David Manana @ 2014-11-26 16:34 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org

On Wed, Nov 26, 2014 at 4:24 PM, Josef Bacik <jbacik@fb.com> wrote:
> On 11/26/2014 11:09 AM, Filipe David Manana wrote:
>>
>> On Wed, Nov 26, 2014 at 3:57 PM, Josef Bacik <jbacik@fb.com> wrote:
>>>
>>> On 11/26/2014 10:28 AM, Filipe Manana wrote:
>>>>
>>>>
>>>> If we remove a block group (because it became empty), we might have left
>>>> a caching_ctl structure in fs_info->caching_block_groups that points to
>>>> the block group and is accessed at transaction commit time. This results
>>>> in accessing an invalid or incorrect block group. This issue became
>>>> visible
>>>> after Josef's patch "Btrfs: remove empty block groups automatically".
>>>>
>>>> So if the block group is removed make sure we don't leave a dangling
>>>> caching_ctl in caching_block_groups.
>>>>
>>>> Sample crash trace:
>>>>
>>>> [58380.439449] BUG: unable to handle kernel paging request at
>>>> ffff8801446eaeb8
>>>> [58380.439707] IP: [<ffffffffa03f6d05>]
>>>> block_group_cache_done.isra.21+0xc/0x1c [btrfs]
>>>> [58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE
>>>> 80000001446ea060
>>>> [58380.441220] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
>>>> [58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd
>>>> auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse
>>>> processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore
>>>> thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom
>>>> ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common
>>>> virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring
>>>> virtio [last unloaded: btrfs]
>>>> [58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G
>>>> W
>>>> 3.17.0-rc5-btrfs-next-1+ #1
>>>> [58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>>> BIOS
>>>> rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
>>>> [58380.443238] task: ffff88013ac82090 ti: ffff88013896c000 task.ti:
>>>> ffff88013896c000
>>>> [58380.443238] RIP: 0010:[<ffffffffa03f6d05>]  [<ffffffffa03f6d05>]
>>>> block_group_cache_done.isra.21+0xc/0x1c [btrfs]
>>>> [58380.443238] RSP: 0018:ffff88013896fdd8  EFLAGS: 00010283
>>>> [58380.443238] RAX: ffff880222cae850 RBX: ffff880119ba74c0 RCX:
>>>> 0000000000000000
>>>> [58380.443238] RDX: 0000000000000000 RSI: ffff880185e16800 RDI:
>>>> ffff8801446eaeb8
>>>> [58380.443238] RBP: ffff88013896fdd8 R08: ffff8801a9ca9fa8 R09:
>>>> ffff88013896fc60
>>>> [58380.443238] R10: ffff88013896fd28 R11: 0000000000000000 R12:
>>>> ffff880222cae000
>>>> [58380.443238] R13: ffff880222cae850 R14: ffff880222cae6b0 R15:
>>>> ffff8801446eae00
>>>> [58380.443238] FS:  0000000000000000(0000) GS:ffff88023ed80000(0000)
>>>> knlGS:0000000000000000
>>>> [58380.443238] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>> [58380.443238] CR2: ffff8801446eaeb8 CR3: 0000000001811000 CR4:
>>>> 00000000000006e0
>>>> [58380.443238] Stack:
>>>> [58380.443238]  ffff88013896fe18 ffffffffa03fe2d5 ffff880222cae850
>>>> ffff880185e16800
>>>> [58380.443238]  ffff88000dc41c20 0000000000000000 ffff8801a9ca9f00
>>>> 0000000000000000
>>>> [58380.443238]  ffff88013896fe80 ffffffffa040fbcf ffff88018b0dcdb0
>>>> ffff88013ac82090
>>>> [58380.443238] Call Trace:
>>>> [58380.443238]  [<ffffffffa03fe2d5>]
>>>> btrfs_prepare_extent_commit+0x5a/0xd7
>>>> [btrfs]
>>>> [58380.443238]  [<ffffffffa040fbcf>]
>>>> btrfs_commit_transaction+0x45c/0x882
>>>> [btrfs]
>>>> [58380.443238]  [<ffffffffa040c058>] transaction_kthread+0xf2/0x1a4
>>>> [btrfs]
>>>> [58380.443238]  [<ffffffffa040bf66>] ?
>>>> btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
>>>> [58380.443238]  [<ffffffff8105966b>] kthread+0xb7/0xbf
>>>> [58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
>>>> [58380.443238]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
>>>> [58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
>>>>
>>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>>> ---
>>>>    fs/btrfs/ctree.h       |  1 +
>>>>    fs/btrfs/extent-tree.c | 27 +++++++++++++++++++++++++++
>>>>    2 files changed, 28 insertions(+)
>>>>
>>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>>> index d3ccd09..7f40a65 100644
>>>> --- a/fs/btrfs/ctree.h
>>>> +++ b/fs/btrfs/ctree.h
>>>> @@ -1277,6 +1277,7 @@ struct btrfs_block_group_cache {
>>>>          unsigned int ro:1;
>>>>          unsigned int dirty:1;
>>>>          unsigned int iref:1;
>>>> +       unsigned int has_caching_ctl:1;
>>>>
>>>
>>> Don't do this, just unconditionally call get_caching_control in
>>> btrfs_remove_block_group, then if we get something we can do stuff,
>>> otherwise we can just continue.  Thanks,
>>
>>
>> That's what I initially thought too. However, get_caching_control only
>> returns us the caching_ctl if block_group->cached ==
>> BTRFS_CACHE_STARTED, so it's not enough to use it exclusively. The
>> has_caching_ctl flag is just to avoid holding the semaphore and search
>> through the list (since block_group->caching_ctl can be NULL but a
>> caching_ctl point to the block group can be in the list).
>>
>
> Oh God that's not good, we need to change get_caching_control to return if
> there is a caching control at all, since the other users want to wait for
> the fast caching to finish too.  So change that and then use it
> unconditionally.  I bet this has been causing us the random early ENOSPC
> problems.  Thanks,

Right, I think that's a separate change different from what I'm trying to fix.

When caching_thread finishes, it sets the block_group->caching_ctl to
NULL, and caching_ctl remains in the list until transaction commit
time. So doing that change, to make get_caching_control() always
return the block group's ->caching_ctl wouldn't solve this issue - we
still need to search in the list if block_group->caching_ctl is NULL.


>
> Josef
>



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] Btrfs: fix crash caused by block group removal
  2014-11-26 16:34         ` Filipe David Manana
@ 2014-11-26 16:41           ` Josef Bacik
  0 siblings, 0 replies; 26+ messages in thread
From: Josef Bacik @ 2014-11-26 16:41 UTC (permalink / raw)
  To: fdmanana; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org

On 11/26/2014 11:34 AM, Filipe David Manana wrote:
> On Wed, Nov 26, 2014 at 4:24 PM, Josef Bacik <jbacik@fb.com> wrote:
>> On 11/26/2014 11:09 AM, Filipe David Manana wrote:
>>>
>>> On Wed, Nov 26, 2014 at 3:57 PM, Josef Bacik <jbacik@fb.com> wrote:
>>>>
>>>> On 11/26/2014 10:28 AM, Filipe Manana wrote:
>>>>>
>>>>>
>>>>> If we remove a block group (because it became empty), we might have left
>>>>> a caching_ctl structure in fs_info->caching_block_groups that points to
>>>>> the block group and is accessed at transaction commit time. This results
>>>>> in accessing an invalid or incorrect block group. This issue became
>>>>> visible
>>>>> after Josef's patch "Btrfs: remove empty block groups automatically".
>>>>>
>>>>> So if the block group is removed make sure we don't leave a dangling
>>>>> caching_ctl in caching_block_groups.
>>>>>
>>>>> Sample crash trace:
>>>>>
>>>>> [58380.439449] BUG: unable to handle kernel paging request at
>>>>> ffff8801446eaeb8
>>>>> [58380.439707] IP: [<ffffffffa03f6d05>]
>>>>> block_group_cache_done.isra.21+0xc/0x1c [btrfs]
>>>>> [58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE
>>>>> 80000001446ea060
>>>>> [58380.441220] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
>>>>> [58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd
>>>>> auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse
>>>>> processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore
>>>>> thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom
>>>>> ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common
>>>>> virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring
>>>>> virtio [last unloaded: btrfs]
>>>>> [58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G
>>>>> W
>>>>> 3.17.0-rc5-btrfs-next-1+ #1
>>>>> [58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>>>> BIOS
>>>>> rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
>>>>> [58380.443238] task: ffff88013ac82090 ti: ffff88013896c000 task.ti:
>>>>> ffff88013896c000
>>>>> [58380.443238] RIP: 0010:[<ffffffffa03f6d05>]  [<ffffffffa03f6d05>]
>>>>> block_group_cache_done.isra.21+0xc/0x1c [btrfs]
>>>>> [58380.443238] RSP: 0018:ffff88013896fdd8  EFLAGS: 00010283
>>>>> [58380.443238] RAX: ffff880222cae850 RBX: ffff880119ba74c0 RCX:
>>>>> 0000000000000000
>>>>> [58380.443238] RDX: 0000000000000000 RSI: ffff880185e16800 RDI:
>>>>> ffff8801446eaeb8
>>>>> [58380.443238] RBP: ffff88013896fdd8 R08: ffff8801a9ca9fa8 R09:
>>>>> ffff88013896fc60
>>>>> [58380.443238] R10: ffff88013896fd28 R11: 0000000000000000 R12:
>>>>> ffff880222cae000
>>>>> [58380.443238] R13: ffff880222cae850 R14: ffff880222cae6b0 R15:
>>>>> ffff8801446eae00
>>>>> [58380.443238] FS:  0000000000000000(0000) GS:ffff88023ed80000(0000)
>>>>> knlGS:0000000000000000
>>>>> [58380.443238] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>> [58380.443238] CR2: ffff8801446eaeb8 CR3: 0000000001811000 CR4:
>>>>> 00000000000006e0
>>>>> [58380.443238] Stack:
>>>>> [58380.443238]  ffff88013896fe18 ffffffffa03fe2d5 ffff880222cae850
>>>>> ffff880185e16800
>>>>> [58380.443238]  ffff88000dc41c20 0000000000000000 ffff8801a9ca9f00
>>>>> 0000000000000000
>>>>> [58380.443238]  ffff88013896fe80 ffffffffa040fbcf ffff88018b0dcdb0
>>>>> ffff88013ac82090
>>>>> [58380.443238] Call Trace:
>>>>> [58380.443238]  [<ffffffffa03fe2d5>]
>>>>> btrfs_prepare_extent_commit+0x5a/0xd7
>>>>> [btrfs]
>>>>> [58380.443238]  [<ffffffffa040fbcf>]
>>>>> btrfs_commit_transaction+0x45c/0x882
>>>>> [btrfs]
>>>>> [58380.443238]  [<ffffffffa040c058>] transaction_kthread+0xf2/0x1a4
>>>>> [btrfs]
>>>>> [58380.443238]  [<ffffffffa040bf66>] ?
>>>>> btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
>>>>> [58380.443238]  [<ffffffff8105966b>] kthread+0xb7/0xbf
>>>>> [58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
>>>>> [58380.443238]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
>>>>> [58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
>>>>>
>>>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>>>> ---
>>>>>     fs/btrfs/ctree.h       |  1 +
>>>>>     fs/btrfs/extent-tree.c | 27 +++++++++++++++++++++++++++
>>>>>     2 files changed, 28 insertions(+)
>>>>>
>>>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>>>> index d3ccd09..7f40a65 100644
>>>>> --- a/fs/btrfs/ctree.h
>>>>> +++ b/fs/btrfs/ctree.h
>>>>> @@ -1277,6 +1277,7 @@ struct btrfs_block_group_cache {
>>>>>           unsigned int ro:1;
>>>>>           unsigned int dirty:1;
>>>>>           unsigned int iref:1;
>>>>> +       unsigned int has_caching_ctl:1;
>>>>>
>>>>
>>>> Don't do this, just unconditionally call get_caching_control in
>>>> btrfs_remove_block_group, then if we get something we can do stuff,
>>>> otherwise we can just continue.  Thanks,
>>>
>>>
>>> That's what I initially thought too. However, get_caching_control only
>>> returns us the caching_ctl if block_group->cached ==
>>> BTRFS_CACHE_STARTED, so it's not enough to use it exclusively. The
>>> has_caching_ctl flag is just to avoid holding the semaphore and search
>>> through the list (since block_group->caching_ctl can be NULL but a
>>> caching_ctl point to the block group can be in the list).
>>>
>>
>> Oh God that's not good, we need to change get_caching_control to return if
>> there is a caching control at all, since the other users want to wait for
>> the fast caching to finish too.  So change that and then use it
>> unconditionally.  I bet this has been causing us the random early ENOSPC
>> problems.  Thanks,
>
> Right, I think that's a separate change different from what I'm trying to fix.
>
> When caching_thread finishes, it sets the block_group->caching_ctl to
> NULL, and caching_ctl remains in the list until transaction commit
> time. So doing that change, to make get_caching_control() always
> return the block group's ->caching_ctl wouldn't solve this issue - we
> still need to search in the list if block_group->caching_ctl is NULL.
>

/me bangs his head on the desk.  Ok we should probably remove the 
caching ctl from the list when we finish caching so we don't have this 
problem, but I'm fine doing that later.  I'll fix the 
get_caching_control thing here and send a patch.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 4/6] Btrfs: fix race between fs trimming and block group remove/allocation
  2014-11-26 15:28 ` [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation Filipe Manana
  2014-11-26 16:15   ` Josef Bacik
@ 2014-11-26 17:19   ` Filipe Manana
  2014-11-27  1:16   ` [PATCH v3 " Filipe Manana
  2014-11-27 21:14   ` [PATCH v4 " Filipe Manana
  3 siblings, 0 replies; 26+ messages in thread
From: Filipe Manana @ 2014-11-26 17:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: jbacik, Filipe Manana

Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.

If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.

So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.

If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:

        checking extents
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        owner ref check failed [833912832 16384]
        Errors found in extent allocation tree or chunk allocation
        checking free space cache
        checking fs roots
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        root 5 root dir 256 error
        root 5 inode 260 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 262 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 263 errors 2001, no inode item, link count wrong
        (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Added a comment to better explain the synchronization between block
    group removal and triming, as requested by Josef.

 fs/btrfs/ctree.h            | 13 ++++++++++++-
 fs/btrfs/disk-io.c          | 14 ++++++++++++++
 fs/btrfs/extent-tree.c      | 28 +++++++++++++++++++++++++++-
 fs/btrfs/free-space-cache.c | 26 +++++++++++++++++++++++++-
 fs/btrfs/volumes.c          | 33 ++++++++++++++++++++++++++-------
 5 files changed, 104 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7f40a65..51056c7 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache {
 	unsigned int dirty:1;
 	unsigned int iref:1;
 	unsigned int has_caching_ctl:1;
+	unsigned int removed:1;
 
 	int disk_cache_state;
 
@@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache {
 
 	/* For delayed block group creation or deletion of empty block groups */
 	struct list_head bg_list;
+
+	atomic_t trimming;
 };
 
 /* delayed seq elem */
@@ -1731,6 +1734,13 @@ struct btrfs_fs_info {
 
 	/* For btrfs to record security options */
 	struct security_mnt_opts security_opts;
+
+	/*
+	 * Chunks that can't be freed yet (under a trim/discard operation)
+	 * and will be latter freed.
+	 */
+	rwlock_t pinned_chunks_lock;
+	struct list_head pinned_chunks;
 };
 
 struct btrfs_subvolume_writers {
@@ -3353,7 +3363,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans,
 			   u64 type, u64 chunk_objectid, u64 chunk_offset,
 			   u64 size);
 int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
-			     struct btrfs_root *root, u64 group_start);
+			     struct btrfs_root *root, u64 group_start,
+			     bool *remove_em);
 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
 void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
 				       struct btrfs_root *root);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9d4fb0a..76012d0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2397,6 +2397,9 @@ int open_ctree(struct super_block *sb,
 	init_waitqueue_head(&fs_info->transaction_blocked_wait);
 	init_waitqueue_head(&fs_info->async_submit_wait);
 
+	rwlock_init(&fs_info->pinned_chunks_lock);
+	INIT_LIST_HEAD(&fs_info->pinned_chunks);
+
 	ret = btrfs_alloc_stripe_hash_table(fs_info);
 	if (ret) {
 		err = ret;
@@ -3726,6 +3729,17 @@ void close_ctree(struct btrfs_root *root)
 
 	btrfs_free_block_rsv(root, root->orphan_block_rsv);
 	root->orphan_block_rsv = NULL;
+
+	write_lock(&fs_info->pinned_chunks_lock);
+	while (!list_empty(&fs_info->pinned_chunks)) {
+		struct extent_map *em;
+
+		em = list_first_entry(&fs_info->pinned_chunks,
+				      struct extent_map, list);
+		list_del_init(&em->list);
+		free_extent_map(em);
+	}
+	write_unlock(&fs_info->pinned_chunks_lock);
 }
 
 int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 92f61f2..12dc053 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9016,6 +9016,7 @@ btrfs_create_block_group_cache(struct btrfs_root *root, u64 start, u64 size)
 	INIT_LIST_HEAD(&cache->cluster_list);
 	INIT_LIST_HEAD(&cache->bg_list);
 	btrfs_init_free_space_ctl(cache);
+	atomic_set(&cache->trimming, 0);
 
 	return cache;
 }
@@ -9317,7 +9318,8 @@ static void clear_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
 }
 
 int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
-			     struct btrfs_root *root, u64 group_start)
+			     struct btrfs_root *root, u64 group_start,
+			     bool *remove_em)
 {
 	struct btrfs_path *path;
 	struct btrfs_block_group_cache *block_group;
@@ -9474,6 +9476,30 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 
 	memcpy(&key, &block_group->key, sizeof(key));
 
+	spin_lock(&block_group->lock);
+	block_group->removed = 1;
+	/*
+	 * At this point trimming can't start on this block group, because we
+	 * removed the block group from the tree fs_info->block_group_cache_tree
+	 * so no one can't find it anymore and even if someone already got this
+	 * block group before we removed it from the rbtree, they have already
+	 * incremented block_group->trimming (and we also removed already all
+	 * the free space entries from the block group's cache through the call
+	 * to btrfs_remove_free_space_cache()).
+	 *
+	 * And we must tell our caller to not remove the extent map from the
+	 * fs_info->mapping_tree to prevent the same logical address range and
+	 * physical device space ranges from being reused for a new block group.
+	 * This is because our fs trim operation (btrfs_trim_fs(),
+	 * btrfs_ioctl_fitrim()) is completely transactionless, so while its
+	 * trimming a range the currently running transaction might finish and
+	 * a new one start, allowing for new block groups to be created that can
+	 * reuse the same physical device locations unless we take this special
+	 * care.
+	 */
+	*remove_em = (atomic_read(&block_group->trimming) == 0);
+	spin_unlock(&block_group->lock);
+
 	btrfs_put_block_group(block_group);
 	btrfs_put_block_group(block_group);
 
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 3384819..16c2d39 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3101,11 +3101,35 @@ int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
 
 	*trimmed = 0;
 
+	atomic_inc(&block_group->trimming);
+
 	ret = trim_no_bitmap(block_group, trimmed, start, end, minlen);
 	if (ret)
-		return ret;
+		goto out;
 
 	ret = trim_bitmaps(block_group, trimmed, start, end, minlen);
+out:
+	if (atomic_dec_and_test(&block_group->trimming) &&
+	    block_group->removed) {
+		struct extent_map_tree *em_tree;
+		struct extent_map *em;
+
+		em_tree = &block_group->fs_info->mapping_tree.map_tree;
+		write_lock(&em_tree->lock);
+		em = lookup_extent_mapping(em_tree, block_group->key.objectid,
+					   1);
+		BUG_ON(!em); /* logic error, can't happen */
+		remove_extent_mapping(em_tree, em);
+		write_unlock(&em_tree->lock);
+
+		write_lock(&block_group->fs_info->pinned_chunks_lock);
+		list_del_init(&em->list);
+		write_unlock(&block_group->fs_info->pinned_chunks_lock);
+
+		/* once for us and once for the tree */
+		free_extent_map(em);
+		free_extent_map(em);
+	}
 
 	return ret;
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f61278f..e0ad613 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1069,9 +1069,14 @@ static int contains_pending_extent(struct btrfs_trans_handle *trans,
 				   u64 *start, u64 len)
 {
 	struct extent_map *em;
+	struct list_head *search_list = &trans->transaction->pending_chunks;
 	int ret = 0;
 
-	list_for_each_entry(em, &trans->transaction->pending_chunks, list) {
+again:
+	if (search_list == &trans->root->fs_info->pinned_chunks)
+		read_lock(&trans->root->fs_info->pinned_chunks_lock);
+
+	list_for_each_entry(em, search_list, list) {
 		struct map_lookup *map;
 		int i;
 
@@ -1088,6 +1093,12 @@ static int contains_pending_extent(struct btrfs_trans_handle *trans,
 			ret = 1;
 		}
 	}
+	if (search_list == &trans->root->fs_info->pinned_chunks) {
+		read_unlock(&trans->root->fs_info->pinned_chunks_lock);
+	} else {
+		search_list = &trans->root->fs_info->pinned_chunks;
+		goto again;
+	}
 
 	return ret;
 }
@@ -2579,6 +2590,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
 	u64 chunk_objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
 	u64 chunk_tree = root->fs_info->chunk_root->objectid;
 	int i, ret = 0;
+	bool remove_em;
 
 	/* Just in case */
 	root = root->fs_info->chunk_root;
@@ -2648,18 +2660,25 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
 		}
 	}
 
-	ret = btrfs_remove_block_group(trans, extent_root, chunk_offset);
+	ret = btrfs_remove_block_group(trans, extent_root, chunk_offset,
+				       &remove_em);
 	if (ret) {
 		btrfs_abort_transaction(trans, extent_root, ret);
 		goto out;
 	}
 
-	write_lock(&em_tree->lock);
-	remove_extent_mapping(em_tree, em);
-	write_unlock(&em_tree->lock);
+	if (remove_em) {
+		write_lock(&em_tree->lock);
+		remove_extent_mapping(em_tree, em);
+		write_unlock(&em_tree->lock);
 
-	/* once for the tree */
-	free_extent_map(em);
+		/* once for the tree */
+		free_extent_map(em);
+	} else {
+		write_lock(&root->fs_info->pinned_chunks_lock);
+		list_move_tail(&em->list, &root->fs_info->pinned_chunks);
+		write_unlock(&root->fs_info->pinned_chunks_lock);
+	}
 out:
 	/* once for us */
 	free_extent_map(em);
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 4/6] Btrfs: fix race between fs trimming and block group remove/allocation
  2014-11-26 15:28 ` [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation Filipe Manana
  2014-11-26 16:15   ` Josef Bacik
  2014-11-26 17:19   ` [PATCH v2 " Filipe Manana
@ 2014-11-27  1:16   ` Filipe Manana
  2014-11-27 21:14   ` [PATCH v4 " Filipe Manana
  3 siblings, 0 replies; 26+ messages in thread
From: Filipe Manana @ 2014-11-27  1:16 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.

If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.

So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.

If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:

        checking extents
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        owner ref check failed [833912832 16384]
        Errors found in extent allocation tree or chunk allocation
        checking free space cache
        checking fs roots
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        root 5 root dir 256 error
        root 5 inode 260 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 262 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 263 errors 2001, no inode item, link count wrong
        (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Added a comment to better explain the synchronization between block
    group removal and triming, as requested by Josef.

V3: Changed logic to move extent map to pinned chunks list while holding
    the block group's spinlock, and similarly to make trimming check for
    the need to remove the em from the pinned chunks list while holding
    that spinlock. This is to avoid the (unlikely) case where the trimmer
    attempts to remove the em before chunk removal adds it to the pinned
    chunks list, which would leave the em alive (but outside the mapping
    tree) until umount time.

 fs/btrfs/ctree.h            | 13 ++++++++++++-
 fs/btrfs/disk-io.c          | 14 ++++++++++++++
 fs/btrfs/extent-tree.c      | 43 ++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/free-space-cache.c | 37 ++++++++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.c          | 21 +++++++++++++--------
 5 files changed, 117 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7f40a65..4a167a0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache {
 	unsigned int dirty:1;
 	unsigned int iref:1;
 	unsigned int has_caching_ctl:1;
+	unsigned int removed:1;
 
 	int disk_cache_state;
 
@@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache {
 
 	/* For delayed block group creation or deletion of empty block groups */
 	struct list_head bg_list;
+
+	atomic_t trimming;
 };
 
 /* delayed seq elem */
@@ -1731,6 +1734,13 @@ struct btrfs_fs_info {
 
 	/* For btrfs to record security options */
 	struct security_mnt_opts security_opts;
+
+	/*
+	 * Chunks that can't be freed yet (under a trim/discard operation)
+	 * and will be latter freed.
+	 */
+	rwlock_t pinned_chunks_lock;
+	struct list_head pinned_chunks;
 };
 
 struct btrfs_subvolume_writers {
@@ -3353,7 +3363,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans,
 			   u64 type, u64 chunk_objectid, u64 chunk_offset,
 			   u64 size);
 int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
-			     struct btrfs_root *root, u64 group_start);
+			     struct btrfs_root *root, u64 group_start,
+			     struct extent_map *em);
 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
 void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
 				       struct btrfs_root *root);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9d4fb0a..76012d0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2397,6 +2397,9 @@ int open_ctree(struct super_block *sb,
 	init_waitqueue_head(&fs_info->transaction_blocked_wait);
 	init_waitqueue_head(&fs_info->async_submit_wait);
 
+	rwlock_init(&fs_info->pinned_chunks_lock);
+	INIT_LIST_HEAD(&fs_info->pinned_chunks);
+
 	ret = btrfs_alloc_stripe_hash_table(fs_info);
 	if (ret) {
 		err = ret;
@@ -3726,6 +3729,17 @@ void close_ctree(struct btrfs_root *root)
 
 	btrfs_free_block_rsv(root, root->orphan_block_rsv);
 	root->orphan_block_rsv = NULL;
+
+	write_lock(&fs_info->pinned_chunks_lock);
+	while (!list_empty(&fs_info->pinned_chunks)) {
+		struct extent_map *em;
+
+		em = list_first_entry(&fs_info->pinned_chunks,
+				      struct extent_map, list);
+		list_del_init(&em->list);
+		free_extent_map(em);
+	}
+	write_unlock(&fs_info->pinned_chunks_lock);
 }
 
 int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 92f61f2..010f46b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9016,6 +9016,7 @@ btrfs_create_block_group_cache(struct btrfs_root *root, u64 start, u64 size)
 	INIT_LIST_HEAD(&cache->cluster_list);
 	INIT_LIST_HEAD(&cache->bg_list);
 	btrfs_init_free_space_ctl(cache);
+	atomic_set(&cache->trimming, 0);
 
 	return cache;
 }
@@ -9317,7 +9318,8 @@ static void clear_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
 }
 
 int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
-			     struct btrfs_root *root, u64 group_start)
+			     struct btrfs_root *root, u64 group_start,
+			     struct extent_map *em)
 {
 	struct btrfs_path *path;
 	struct btrfs_block_group_cache *block_group;
@@ -9330,6 +9332,8 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	int index;
 	int factor;
 	struct btrfs_caching_control *caching_ctl = NULL;
+	bool remove_em;
+	struct extent_map_tree *em_tree = &root->fs_info->mapping_tree.map_tree;
 
 	root = root->fs_info->extent_root;
 
@@ -9474,6 +9478,43 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 
 	memcpy(&key, &block_group->key, sizeof(key));
 
+	spin_lock(&block_group->lock);
+	block_group->removed = 1;
+	/*
+	 * At this point trimming can't start on this block group, because we
+	 * removed the block group from the tree fs_info->block_group_cache_tree
+	 * so no one can't find it anymore and even if someone already got this
+	 * block group before we removed it from the rbtree, they have already
+	 * incremented block_group->trimming - if they didn't, they won't find
+	 * any free space entries because we already removed them all when we
+	 * called btrfs_remove_free_space_cache().
+	 *
+	 * And we must tell our caller to not remove the extent map from the
+	 * fs_info->mapping_tree to prevent the same logical address range and
+	 * physical device space ranges from being reused for a new block group.
+	 * This is because our fs trim operation (btrfs_trim_fs(),
+	 * btrfs_ioctl_fitrim()) is completely transactionless, so while its
+	 * trimming a range the currently running transaction might finish and
+	 * a new one start, allowing for new block groups to be created that can
+	 * reuse the same physical device locations unless we take this special
+	 * care.
+	 */
+	remove_em = (atomic_read(&block_group->trimming) == 0);
+	if (!remove_em) {
+		write_lock(&root->fs_info->pinned_chunks_lock);
+		list_move_tail(&em->list, &root->fs_info->pinned_chunks);
+		write_unlock(&root->fs_info->pinned_chunks_lock);
+	}
+	spin_unlock(&block_group->lock);
+
+	if (remove_em) {
+		write_lock(&em_tree->lock);
+		remove_extent_mapping(em_tree, em);
+		write_unlock(&em_tree->lock);
+		/* once for the tree */
+		free_extent_map(em);
+	}
+
 	btrfs_put_block_group(block_group);
 	btrfs_put_block_group(block_group);
 
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 3384819..5309026 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3101,11 +3101,46 @@ int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
 
 	*trimmed = 0;
 
+	spin_lock(&block_group->lock);
+	if (block_group->removed) {
+		spin_unlock(&block_group->lock);
+		return 0;
+	}
+	atomic_inc(&block_group->trimming);
+	spin_unlock(&block_group->lock);
+
 	ret = trim_no_bitmap(block_group, trimmed, start, end, minlen);
 	if (ret)
-		return ret;
+		goto out;
 
 	ret = trim_bitmaps(block_group, trimmed, start, end, minlen);
+out:
+	spin_lock(&block_group->lock);
+	if (atomic_dec_and_test(&block_group->trimming) &&
+	    block_group->removed) {
+		struct extent_map_tree *em_tree;
+		struct extent_map *em;
+
+		spin_unlock(&block_group->lock);
+
+		em_tree = &block_group->fs_info->mapping_tree.map_tree;
+		write_lock(&em_tree->lock);
+		em = lookup_extent_mapping(em_tree, block_group->key.objectid,
+					   1);
+		BUG_ON(!em); /* logic error, can't happen */
+		remove_extent_mapping(em_tree, em);
+		write_unlock(&em_tree->lock);
+
+		write_lock(&block_group->fs_info->pinned_chunks_lock);
+		list_del_init(&em->list);
+		write_unlock(&block_group->fs_info->pinned_chunks_lock);
+
+		/* once for us and once for the tree */
+		free_extent_map(em);
+		free_extent_map(em);
+	} else {
+		spin_unlock(&block_group->lock);
+	}
 
 	return ret;
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f61278f..16a00bd 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1069,9 +1069,14 @@ static int contains_pending_extent(struct btrfs_trans_handle *trans,
 				   u64 *start, u64 len)
 {
 	struct extent_map *em;
+	struct list_head *search_list = &trans->transaction->pending_chunks;
 	int ret = 0;
 
-	list_for_each_entry(em, &trans->transaction->pending_chunks, list) {
+again:
+	if (search_list == &trans->root->fs_info->pinned_chunks)
+		read_lock(&trans->root->fs_info->pinned_chunks_lock);
+
+	list_for_each_entry(em, search_list, list) {
 		struct map_lookup *map;
 		int i;
 
@@ -1088,6 +1093,12 @@ static int contains_pending_extent(struct btrfs_trans_handle *trans,
 			ret = 1;
 		}
 	}
+	if (search_list == &trans->root->fs_info->pinned_chunks) {
+		read_unlock(&trans->root->fs_info->pinned_chunks_lock);
+	} else {
+		search_list = &trans->root->fs_info->pinned_chunks;
+		goto again;
+	}
 
 	return ret;
 }
@@ -2648,18 +2659,12 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
 		}
 	}
 
-	ret = btrfs_remove_block_group(trans, extent_root, chunk_offset);
+	ret = btrfs_remove_block_group(trans, extent_root, chunk_offset, em);
 	if (ret) {
 		btrfs_abort_transaction(trans, extent_root, ret);
 		goto out;
 	}
 
-	write_lock(&em_tree->lock);
-	remove_extent_mapping(em_tree, em);
-	write_unlock(&em_tree->lock);
-
-	/* once for the tree */
-	free_extent_map(em);
 out:
 	/* once for us */
 	free_extent_map(em);
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 4/6] Btrfs: fix race between fs trimming and block group remove/allocation
  2014-11-26 15:28 ` [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation Filipe Manana
                     ` (2 preceding siblings ...)
  2014-11-27  1:16   ` [PATCH v3 " Filipe Manana
@ 2014-11-27 21:14   ` Filipe Manana
  3 siblings, 0 replies; 26+ messages in thread
From: Filipe Manana @ 2014-11-27 21:14 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.

If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.

So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.

If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:

        checking extents
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        owner ref check failed [833912832 16384]
        Errors found in extent allocation tree or chunk allocation
        checking free space cache
        checking fs roots
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        root 5 root dir 256 error
        root 5 inode 260 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 262 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 263 errors 2001, no inode item, link count wrong
        (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Added a comment to better explain the synchronization between block
    group removal and triming, as requested by Josef.

V3: Changed logic to move extent map to pinned chunks list while holding
    the block group's spinlock, and similarly to make trimming check for
    the need to remove the em from the pinned chunks list while holding
    that spinlock. This is to avoid the (unlikely) case where the trimmer
    attempts to remove the em before chunk removal adds it to the pinned
    chunks list, which would leave the em alive (but outside the mapping
    tree) until umount time.

V4: Added more comments about the synchronization between block group
    removal and trim tasks. Made it safer to move the em into the pinned
    chunks list, because it might currently be in transaction->pending_chunks,
    (which is protected by the chunk mutex and used during chunk allocation)
    and block/chunk remove isn't invoked with the chunk mutex held.

 fs/btrfs/ctree.h            | 12 ++++++++-
 fs/btrfs/disk-io.c          | 13 ++++++++++
 fs/btrfs/extent-tree.c      | 60 ++++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/free-space-cache.c | 38 +++++++++++++++++++++++++++-
 fs/btrfs/volumes.c          | 26 ++++++--------------
 fs/btrfs/volumes.h          | 12 +++++++++
 6 files changed, 140 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7f40a65..c7e5f2a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache {
 	unsigned int dirty:1;
 	unsigned int iref:1;
 	unsigned int has_caching_ctl:1;
+	unsigned int removed:1;
 
 	int disk_cache_state;
 
@@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache {
 
 	/* For delayed block group creation or deletion of empty block groups */
 	struct list_head bg_list;
+
+	atomic_t trimming;
 };
 
 /* delayed seq elem */
@@ -1731,6 +1734,12 @@ struct btrfs_fs_info {
 
 	/* For btrfs to record security options */
 	struct security_mnt_opts security_opts;
+
+	/*
+	 * Chunks that can't be freed yet (under a trim/discard operation)
+	 * and will be latter freed. Protected by fs_info->chunk_mutex.
+	 */
+	struct list_head pinned_chunks;
 };
 
 struct btrfs_subvolume_writers {
@@ -3353,7 +3362,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans,
 			   u64 type, u64 chunk_objectid, u64 chunk_offset,
 			   u64 size);
 int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
-			     struct btrfs_root *root, u64 group_start);
+			     struct btrfs_root *root, u64 group_start,
+			     struct extent_map *em);
 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
 void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
 				       struct btrfs_root *root);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9d4fb0a..7e2405a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2397,6 +2397,8 @@ int open_ctree(struct super_block *sb,
 	init_waitqueue_head(&fs_info->transaction_blocked_wait);
 	init_waitqueue_head(&fs_info->async_submit_wait);
 
+	INIT_LIST_HEAD(&fs_info->pinned_chunks);
+
 	ret = btrfs_alloc_stripe_hash_table(fs_info);
 	if (ret) {
 		err = ret;
@@ -3726,6 +3728,17 @@ void close_ctree(struct btrfs_root *root)
 
 	btrfs_free_block_rsv(root, root->orphan_block_rsv);
 	root->orphan_block_rsv = NULL;
+
+	lock_chunks(root);
+	while (!list_empty(&fs_info->pinned_chunks)) {
+		struct extent_map *em;
+
+		em = list_first_entry(&fs_info->pinned_chunks,
+				      struct extent_map, list);
+		list_del_init(&em->list);
+		free_extent_map(em);
+	}
+	unlock_chunks(root);
 }
 
 int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 92f61f2..a40979a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9016,6 +9016,7 @@ btrfs_create_block_group_cache(struct btrfs_root *root, u64 start, u64 size)
 	INIT_LIST_HEAD(&cache->cluster_list);
 	INIT_LIST_HEAD(&cache->bg_list);
 	btrfs_init_free_space_ctl(cache);
+	atomic_set(&cache->trimming, 0);
 
 	return cache;
 }
@@ -9317,7 +9318,8 @@ static void clear_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
 }
 
 int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
-			     struct btrfs_root *root, u64 group_start)
+			     struct btrfs_root *root, u64 group_start,
+			     struct extent_map *em)
 {
 	struct btrfs_path *path;
 	struct btrfs_block_group_cache *block_group;
@@ -9330,6 +9332,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	int index;
 	int factor;
 	struct btrfs_caching_control *caching_ctl = NULL;
+	bool remove_em;
 
 	root = root->fs_info->extent_root;
 
@@ -9474,6 +9477,61 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 
 	memcpy(&key, &block_group->key, sizeof(key));
 
+	lock_chunks(root);
+	spin_lock(&block_group->lock);
+	block_group->removed = 1;
+	/*
+	 * At this point trimming can't start on this block group, because we
+	 * removed the block group from the tree fs_info->block_group_cache_tree
+	 * so no one can't find it anymore and even if someone already got this
+	 * block group before we removed it from the rbtree, they have already
+	 * incremented block_group->trimming - if they didn't, they won't find
+	 * any free space entries because we already removed them all when we
+	 * called btrfs_remove_free_space_cache().
+	 *
+	 * And we must not remove the extent map from the fs_info->mapping_tree
+	 * to prevent the same logical address range and physical device space
+	 * ranges from being reused for a new block group. This is because our
+	 * fs trim operation (btrfs_trim_fs() / btrfs_ioctl_fitrim()) is
+	 * completely transactionless, so while it is trimming a range the
+	 * currently running transaction might finish and a new one start,
+	 * allowing for new block groups to be created that can reuse the same
+	 * physical device locations unless we take this special care.
+	 */
+	remove_em = (atomic_read(&block_group->trimming) == 0);
+	/*
+	 * Make sure a trimmer task always sees the em in the pinned_chunks list
+	 * if it sees block_group->removed == 1 (needs to lock block_group->lock
+	 * before checking block_group->removed).
+	 */
+	if (!remove_em) {
+		/*
+		 * Our em might be in trans->transaction->pending_chunks which
+		 * is protected by fs_info->chunk_mutex ([lock|unlock]_chunks),
+		 * and so is the fs_info->pinned_chunks list.
+		 *
+		 * So at this point we must be holding the chunk_mutex to avoid
+		 * any races with chunk allocation (more specifically at
+		 * volumes.c:contains_pending_extent()), to ensure it always
+		 * sees the em, either in the pending_chunks list or in the
+		 * pinned_chunks list.
+		 */
+		list_move_tail(&em->list, &root->fs_info->pinned_chunks);
+	}
+	spin_unlock(&block_group->lock);
+	unlock_chunks(root);
+
+	if (remove_em) {
+		struct extent_map_tree *em_tree;
+
+		em_tree = &root->fs_info->mapping_tree.map_tree;
+		write_lock(&em_tree->lock);
+		remove_extent_mapping(em_tree, em);
+		write_unlock(&em_tree->lock);
+		/* once for the tree */
+		free_extent_map(em);
+	}
+
 	btrfs_put_block_group(block_group);
 	btrfs_put_block_group(block_group);
 
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 3384819..0ddc114 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -27,6 +27,7 @@
 #include "disk-io.h"
 #include "extent_io.h"
 #include "inode-map.h"
+#include "volumes.h"
 
 #define BITS_PER_BITMAP		(PAGE_CACHE_SIZE * 8)
 #define MAX_CACHE_BYTES_PER_GIG	(32 * 1024)
@@ -3101,11 +3102,46 @@ int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
 
 	*trimmed = 0;
 
+	spin_lock(&block_group->lock);
+	if (block_group->removed) {
+		spin_unlock(&block_group->lock);
+		return 0;
+	}
+	atomic_inc(&block_group->trimming);
+	spin_unlock(&block_group->lock);
+
 	ret = trim_no_bitmap(block_group, trimmed, start, end, minlen);
 	if (ret)
-		return ret;
+		goto out;
 
 	ret = trim_bitmaps(block_group, trimmed, start, end, minlen);
+out:
+	spin_lock(&block_group->lock);
+	if (atomic_dec_and_test(&block_group->trimming) &&
+	    block_group->removed) {
+		struct extent_map_tree *em_tree;
+		struct extent_map *em;
+
+		spin_unlock(&block_group->lock);
+
+		em_tree = &block_group->fs_info->mapping_tree.map_tree;
+		write_lock(&em_tree->lock);
+		em = lookup_extent_mapping(em_tree, block_group->key.objectid,
+					   1);
+		BUG_ON(!em); /* logic error, can't happen */
+		remove_extent_mapping(em_tree, em);
+		write_unlock(&em_tree->lock);
+
+		lock_chunks(block_group->fs_info->chunk_root);
+		list_del_init(&em->list);
+		unlock_chunks(block_group->fs_info->chunk_root);
+
+		/* once for us and once for the tree */
+		free_extent_map(em);
+		free_extent_map(em);
+	} else {
+		spin_unlock(&block_group->lock);
+	}
 
 	return ret;
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f61278f..66a5a1e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -53,16 +53,6 @@ static void btrfs_dev_stat_print_on_load(struct btrfs_device *device);
 DEFINE_MUTEX(uuid_mutex);
 static LIST_HEAD(fs_uuids);
 
-static void lock_chunks(struct btrfs_root *root)
-{
-	mutex_lock(&root->fs_info->chunk_mutex);
-}
-
-static void unlock_chunks(struct btrfs_root *root)
-{
-	mutex_unlock(&root->fs_info->chunk_mutex);
-}
-
 static struct btrfs_fs_devices *__alloc_fs_devices(void)
 {
 	struct btrfs_fs_devices *fs_devs;
@@ -1069,9 +1059,11 @@ static int contains_pending_extent(struct btrfs_trans_handle *trans,
 				   u64 *start, u64 len)
 {
 	struct extent_map *em;
+	struct list_head *search_list = &trans->transaction->pending_chunks;
 	int ret = 0;
 
-	list_for_each_entry(em, &trans->transaction->pending_chunks, list) {
+again:
+	list_for_each_entry(em, search_list, list) {
 		struct map_lookup *map;
 		int i;
 
@@ -1088,6 +1080,10 @@ static int contains_pending_extent(struct btrfs_trans_handle *trans,
 			ret = 1;
 		}
 	}
+	if (search_list == &trans->transaction->pending_chunks) {
+		search_list = &trans->root->fs_info->pinned_chunks;
+		goto again;
+	}
 
 	return ret;
 }
@@ -2648,18 +2644,12 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
 		}
 	}
 
-	ret = btrfs_remove_block_group(trans, extent_root, chunk_offset);
+	ret = btrfs_remove_block_group(trans, extent_root, chunk_offset, em);
 	if (ret) {
 		btrfs_abort_transaction(trans, extent_root, ret);
 		goto out;
 	}
 
-	write_lock(&em_tree->lock);
-	remove_extent_mapping(em_tree, em);
-	write_unlock(&em_tree->lock);
-
-	/* once for the tree */
-	free_extent_map(em);
 out:
 	/* once for us */
 	free_extent_map(em);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 08980fa..479c238 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -513,4 +513,16 @@ static inline void btrfs_dev_stat_reset(struct btrfs_device *dev,
 void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info);
 void btrfs_update_commit_device_bytes_used(struct btrfs_root *root,
 					struct btrfs_transaction *transaction);
+
+static inline void lock_chunks(struct btrfs_root *root)
+{
+	mutex_lock(&root->fs_info->chunk_mutex);
+}
+
+static inline void unlock_chunks(struct btrfs_root *root)
+{
+	mutex_unlock(&root->fs_info->chunk_mutex);
+}
+
+
 #endif
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 5/6] Btrfs: fix race between writing free space cache and trimming
  2014-11-26 15:28 ` [PATCH 5/6] Btrfs: fix race between writing free space cache and trimming Filipe Manana
@ 2014-12-01 17:04   ` Filipe Manana
  0 siblings, 0 replies; 26+ messages in thread
From: Filipe Manana @ 2014-12-01 17:04 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

Trimming is completely transactionless, and the way it operates consists
of hiding free space entries from a block group, perform the trim/discard
and then make the free space entries visible again.
Therefore while a free space entry is being trimmed, we can have free space
cache writing running in parallel (as part of a transaction commit) which
will miss the free space entry. This means that an unmount (or crash/reboot)
after that transaction commit and mount again before another transaction
starts/commits after the discard finishes, we will have some free space
that won't be used again unless the free space cache is rebuilt. After the
unmount, fsck (btrfsck, btrfs check) reports the issue like the following
example:

        *** fsck.btrfs output ***
        checking extents
        checking free space cache
        There is no free space entry for 521764864-521781248
        There is no free space entry for 521764864-1103101952
        cache appears valid but isnt 29360128
        Checking filesystem on /dev/sdc
        UUID: b4789e27-4774-4626-98e9-ae8dfbfb0fb5
        found 1235681286 bytes used err is -22
        (...)

Another issue caused by this race is a crash while writing bitmap entries
to the cache, because while the cache writeout task accesses the bitmaps,
the trim task can be concurrently modifying the bitmap or worse might
be freeing the bitmap. The later case results in the following crash:

[55650.804460] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
[55650.804835] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop parport_pc parport i2c_piix4 psmouse evdev pcspkr microcode processor i2ccore serio_raw thermal_sys button ext4 crc16 jbd2 mbcache sg sd_mod crc_t10dif sr_mod cdrom crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata virtio_pci virtio_ring virtio scsi_mod e1000 [last unloaded: btrfs]
[55650.806169] CPU: 1 PID: 31002 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
[55650.806493] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[55650.806867] task: ffff8800b12f6410 ti: ffff880071538000 task.ti: ffff880071538000
[55650.807166] RIP: 0010:[<ffffffffa037cf45>]  [<ffffffffa037cf45>] write_bitmap_entries+0x65/0xbb [btrfs]
[55650.807514] RSP: 0018:ffff88007153bc30  EFLAGS: 00010246
[55650.807687] RAX: 000000005d1ec000 RBX: ffff8800a665df08 RCX: 0000000000000400
[55650.807885] RDX: ffff88005d1ec000 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88005d1ec000
[55650.808017] RBP: ffff88007153bc58 R08: 00000000ddd51536 R09: 00000000000001e0
[55650.808017] R10: 0000000000000000 R11: 0000000000000037 R12: 6b6b6b6b6b6b6b6b
[55650.808017] R13: ffff88007153bca8 R14: 6b6b6b6b6b6b6b6b R15: ffff88007153bc98
[55650.808017] FS:  0000000000000000(0000) GS:ffff88023ec80000(0000) knlGS:0000000000000000
[55650.808017] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[55650.808017] CR2: 0000000002273b88 CR3: 00000000b18f6000 CR4: 00000000000006e0
[55650.808017] Stack:
[55650.808017]  ffff88020e834e00 ffff880172d68db0 0000000000000000 ffff88019257c800
[55650.808017]  ffff8801d42ea720 ffff88007153bd10 ffffffffa037d2fa ffff880224e99180
[55650.808017]  ffff8801469a6188 ffff880224e99140 ffff880172d68c50 00000003000000b7
[55650.808017] Call Trace:
[55650.808017]  [<ffffffffa037d2fa>] __btrfs_write_out_cache+0x1ea/0x37f [btrfs]
[55650.808017]  [<ffffffffa037d959>] btrfs_write_out_cache+0xa1/0xd8 [btrfs]
[55650.808017]  [<ffffffffa033936b>] btrfs_write_dirty_block_groups+0x4b5/0x505 [btrfs]
[55650.808017]  [<ffffffffa03aa98e>] commit_cowonly_roots+0x15e/0x1f7 [btrfs]
[55650.808017]  [<ffffffff813eb9c7>] ? _raw_spin_lock+0xe/0x10
[55650.808017]  [<ffffffffa0346e46>] btrfs_commit_transaction+0x411/0x882 [btrfs]
[55650.808017]  [<ffffffffa03432a4>] transaction_kthread+0xf2/0x1a4 [btrfs]
[55650.808017]  [<ffffffffa03431b2>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[55650.808017]  [<ffffffff8105966b>] kthread+0xb7/0xbf
[55650.808017]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[55650.808017]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
[55650.808017]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[55650.808017] Code: 4c 89 ef 8d 70 ff e8 d4 fc ff ff 41 8b 45 34 41 39 45 30 7d 5c 31 f6 4c 89 ef e8 80 f6 ff ff 49 8b 7d 00 4c 89 f6 b9 00 04 00 00 <f3> a5 4c 89 ef 41 8b 45 30 8d 70 ff e8 a3 fc ff ff 41 8b 45 34
[55650.808017] RIP  [<ffffffffa037cf45>] write_bitmap_entries+0x65/0xbb [btrfs]
[55650.808017]  RSP <ffff88007153bc30>
[55650.815725] ---[ end trace 1c032e96b149ff86 ]---

Fix this by serializing both tasks in such a way that cache writeout
doesn't wait for the trim/discard of free space entries to finish and
doesn't miss any free space entry.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Enlonged the critical section to include the cache writeout of bitmaps,
    since I ran into this recently. The issue is that a concurrent trim can
    modify or free the bitmaps while the cache writeout task is using them.
    Updated commit message with crash trace example.

 fs/btrfs/free-space-cache.c | 73 +++++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/free-space-cache.h |  2 ++
 fs/btrfs/inode-map.c        |  2 ++
 3 files changed, 71 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 0ddc114..2ee73c2 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -32,6 +32,12 @@
 #define BITS_PER_BITMAP		(PAGE_CACHE_SIZE * 8)
 #define MAX_CACHE_BYTES_PER_GIG	(32 * 1024)
 
+struct btrfs_trim_range {
+	u64 start;
+	u64 bytes;
+	struct list_head list;
+};
+
 static int link_free_space(struct btrfs_free_space_ctl *ctl,
 			   struct btrfs_free_space *info);
 static void unlink_free_space(struct btrfs_free_space_ctl *ctl,
@@ -882,6 +888,7 @@ int write_cache_extent_entries(struct io_ctl *io_ctl,
 	int ret;
 	struct btrfs_free_cluster *cluster = NULL;
 	struct rb_node *node = rb_first(&ctl->free_space_offset);
+	struct btrfs_trim_range *trim_entry;
 
 	/* Get the cluster for this block_group if it exists */
 	if (block_group && !list_empty(&block_group->cluster_list)) {
@@ -917,6 +924,21 @@ int write_cache_extent_entries(struct io_ctl *io_ctl,
 			cluster = NULL;
 		}
 	}
+
+	/*
+	 * Make sure we don't miss any range that was removed from our rbtree
+	 * because trimming is running. Otherwise after a umount+mount (or crash
+	 * after committing the transaction) we would leak free space and get
+	 * an inconsistent free space cache report from fsck.
+	 */
+	list_for_each_entry(trim_entry, &ctl->trimming_ranges, list) {
+		ret = io_ctl_add_entry(io_ctl, trim_entry->start,
+				       trim_entry->bytes, NULL);
+		if (ret)
+			goto fail;
+		*entries += 1;
+	}
+
 	return 0;
 fail:
 	return -ENOSPC;
@@ -1136,12 +1158,15 @@ static int __btrfs_write_out_cache(struct btrfs_root *root, struct inode *inode,
 
 	io_ctl_set_generation(&io_ctl, trans->transid);
 
+	mutex_lock(&ctl->cache_writeout_mutex);
 	/* Write out the extent entries in the free space cache */
 	ret = write_cache_extent_entries(&io_ctl, ctl,
 					 block_group, &entries, &bitmaps,
 					 &bitmap_list);
-	if (ret)
+	if (ret) {
+		mutex_unlock(&ctl->cache_writeout_mutex);
 		goto out_nospc;
+	}
 
 	/*
 	 * Some spaces that are freed in the current transaction are pinned,
@@ -1149,11 +1174,18 @@ static int __btrfs_write_out_cache(struct btrfs_root *root, struct inode *inode,
 	 * committed, we shouldn't lose them.
 	 */
 	ret = write_pinned_extent_entries(root, block_group, &io_ctl, &entries);
-	if (ret)
+	if (ret) {
+		mutex_unlock(&ctl->cache_writeout_mutex);
 		goto out_nospc;
+	}
 
-	/* At last, we write out all the bitmaps. */
+	/*
+	 * At last, we write out all the bitmaps and keep cache_writeout_mutex
+	 * locked while doing it because a concurrent trim can be manipulating
+	 * or freeing the bitmap.
+	 */
 	ret = write_bitmap_entries(&io_ctl, &bitmap_list);
+	mutex_unlock(&ctl->cache_writeout_mutex);
 	if (ret)
 		goto out_nospc;
 
@@ -2296,6 +2328,8 @@ void btrfs_init_free_space_ctl(struct btrfs_block_group_cache *block_group)
 	ctl->start = block_group->key.objectid;
 	ctl->private = block_group;
 	ctl->op = &free_space_op;
+	INIT_LIST_HEAD(&ctl->trimming_ranges);
+	mutex_init(&ctl->cache_writeout_mutex);
 
 	/*
 	 * we only want to have 32k of ram per block group for keeping
@@ -2912,10 +2946,12 @@ void btrfs_init_free_cluster(struct btrfs_free_cluster *cluster)
 
 static int do_trimming(struct btrfs_block_group_cache *block_group,
 		       u64 *total_trimmed, u64 start, u64 bytes,
-		       u64 reserved_start, u64 reserved_bytes)
+		       u64 reserved_start, u64 reserved_bytes,
+		       struct btrfs_trim_range *trim_entry)
 {
 	struct btrfs_space_info *space_info = block_group->space_info;
 	struct btrfs_fs_info *fs_info = block_group->fs_info;
+	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
 	int ret;
 	int update = 0;
 	u64 trimmed = 0;
@@ -2935,7 +2971,10 @@ static int do_trimming(struct btrfs_block_group_cache *block_group,
 	if (!ret)
 		*total_trimmed += trimmed;
 
+	mutex_lock(&ctl->cache_writeout_mutex);
 	btrfs_add_free_space(block_group, reserved_start, reserved_bytes);
+	list_del(&trim_entry->list);
+	mutex_unlock(&ctl->cache_writeout_mutex);
 
 	if (update) {
 		spin_lock(&space_info->lock);
@@ -2963,16 +3002,21 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 	u64 bytes;
 
 	while (start < end) {
+		struct btrfs_trim_range trim_entry;
+
+		mutex_lock(&ctl->cache_writeout_mutex);
 		spin_lock(&ctl->tree_lock);
 
 		if (ctl->free_space < minlen) {
 			spin_unlock(&ctl->tree_lock);
+			mutex_unlock(&ctl->cache_writeout_mutex);
 			break;
 		}
 
 		entry = tree_search_offset(ctl, start, 0, 1);
 		if (!entry) {
 			spin_unlock(&ctl->tree_lock);
+			mutex_unlock(&ctl->cache_writeout_mutex);
 			break;
 		}
 
@@ -2981,6 +3025,7 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 			node = rb_next(&entry->offset_index);
 			if (!node) {
 				spin_unlock(&ctl->tree_lock);
+				mutex_unlock(&ctl->cache_writeout_mutex);
 				goto out;
 			}
 			entry = rb_entry(node, struct btrfs_free_space,
@@ -2989,6 +3034,7 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 
 		if (entry->offset >= end) {
 			spin_unlock(&ctl->tree_lock);
+			mutex_unlock(&ctl->cache_writeout_mutex);
 			break;
 		}
 
@@ -2998,6 +3044,7 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 		bytes = min(extent_start + extent_bytes, end) - start;
 		if (bytes < minlen) {
 			spin_unlock(&ctl->tree_lock);
+			mutex_unlock(&ctl->cache_writeout_mutex);
 			goto next;
 		}
 
@@ -3005,9 +3052,13 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 		kmem_cache_free(btrfs_free_space_cachep, entry);
 
 		spin_unlock(&ctl->tree_lock);
+		trim_entry.start = extent_start;
+		trim_entry.bytes = extent_bytes;
+		list_add_tail(&trim_entry.list, &ctl->trimming_ranges);
+		mutex_unlock(&ctl->cache_writeout_mutex);
 
 		ret = do_trimming(block_group, total_trimmed, start, bytes,
-				  extent_start, extent_bytes);
+				  extent_start, extent_bytes, &trim_entry);
 		if (ret)
 			break;
 next:
@@ -3036,17 +3087,21 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 
 	while (offset < end) {
 		bool next_bitmap = false;
+		struct btrfs_trim_range trim_entry;
 
+		mutex_lock(&ctl->cache_writeout_mutex);
 		spin_lock(&ctl->tree_lock);
 
 		if (ctl->free_space < minlen) {
 			spin_unlock(&ctl->tree_lock);
+			mutex_unlock(&ctl->cache_writeout_mutex);
 			break;
 		}
 
 		entry = tree_search_offset(ctl, offset, 1, 0);
 		if (!entry) {
 			spin_unlock(&ctl->tree_lock);
+			mutex_unlock(&ctl->cache_writeout_mutex);
 			next_bitmap = true;
 			goto next;
 		}
@@ -3055,6 +3110,7 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 		ret2 = search_bitmap(ctl, entry, &start, &bytes);
 		if (ret2 || start >= end) {
 			spin_unlock(&ctl->tree_lock);
+			mutex_unlock(&ctl->cache_writeout_mutex);
 			next_bitmap = true;
 			goto next;
 		}
@@ -3062,6 +3118,7 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 		bytes = min(bytes, end - start);
 		if (bytes < minlen) {
 			spin_unlock(&ctl->tree_lock);
+			mutex_unlock(&ctl->cache_writeout_mutex);
 			goto next;
 		}
 
@@ -3070,9 +3127,13 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 			free_bitmap(ctl, entry);
 
 		spin_unlock(&ctl->tree_lock);
+		trim_entry.start = start;
+		trim_entry.bytes = bytes;
+		list_add_tail(&trim_entry.list, &ctl->trimming_ranges);
+		mutex_unlock(&ctl->cache_writeout_mutex);
 
 		ret = do_trimming(block_group, total_trimmed, start, bytes,
-				  start, bytes);
+				  start, bytes, &trim_entry);
 		if (ret)
 			break;
 next:
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index 0cf4977..88b2238 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -38,6 +38,8 @@ struct btrfs_free_space_ctl {
 	u64 start;
 	struct btrfs_free_space_op *op;
 	void *private;
+	struct mutex cache_writeout_mutex;
+	struct list_head trimming_ranges;
 };
 
 struct btrfs_free_space_op {
diff --git a/fs/btrfs/inode-map.c b/fs/btrfs/inode-map.c
index 83d646b..81efd83 100644
--- a/fs/btrfs/inode-map.c
+++ b/fs/btrfs/inode-map.c
@@ -364,6 +364,8 @@ void btrfs_init_free_ino_ctl(struct btrfs_root *root)
 	ctl->start = 0;
 	ctl->private = NULL;
 	ctl->op = &free_ino_op;
+	INIT_LIST_HEAD(&ctl->trimming_ranges);
+	mutex_init(&ctl->cache_writeout_mutex);
 
 	/*
 	 * Initially we allow to use 16K of ram to cache chunks of
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2014-12-01 17:04 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-26 15:28 [PATCH 0/6] Btrfs: fixes for block group remove/allocation and trim/discard Filipe Manana
2014-11-26 15:28 ` [PATCH 1/6] Btrfs: fix invalid block group rbtree access after bg is removed Filipe Manana
2014-11-26 15:58   ` Josef Bacik
2014-11-26 15:28 ` [PATCH 2/6] Btrfs: fix crash caused by block group removal Filipe Manana
2014-11-26 15:57   ` Josef Bacik
2014-11-26 16:09     ` Filipe David Manana
2014-11-26 16:24       ` Josef Bacik
2014-11-26 16:34         ` Filipe David Manana
2014-11-26 16:41           ` Josef Bacik
2014-11-26 15:28 ` [PATCH 3/6] Btrfs: fix freeing used extents after removing empty block group Filipe Manana
2014-11-26 16:02   ` Josef Bacik
2014-11-26 15:28 ` [PATCH 4/6] Btrfs: fix race between fs trimming and block group remove/allocation Filipe Manana
2014-11-26 16:15   ` Josef Bacik
2014-11-26 16:25     ` Filipe David Manana
2014-11-26 16:30       ` Josef Bacik
2014-11-26 17:19   ` [PATCH v2 " Filipe Manana
2014-11-27  1:16   ` [PATCH v3 " Filipe Manana
2014-11-27 21:14   ` [PATCH v4 " Filipe Manana
2014-11-26 15:28 ` [PATCH 5/6] Btrfs: fix race between writing free space cache and trimming Filipe Manana
2014-12-01 17:04   ` [PATCH v2 " Filipe Manana
2014-11-26 15:28 ` [PATCH 6/6] Btrfs: make btrfs_abort_transaction consider existence of new block groups Filipe Manana
2014-11-26 16:07   ` Josef Bacik
2014-11-26 16:15     ` Filipe David Manana
2014-11-26 16:19       ` Josef Bacik
2014-11-26 16:29         ` Filipe David Manana
2014-11-26 16:32           ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).