linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] update bytes_may_use timely to avoid false ENOSPC issue
@ 2016-07-25  7:51 Wang Xiaoguang
  2016-07-25  7:51 ` [PATCH v2 1/4] btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster() Wang Xiaoguang
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Wang Xiaoguang @ 2016-07-25  7:51 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, jbacik, holger

Currently in btrfs, for data space reservation, it does not update
bytes_may_use in btrfs_update_reserved_bytes() and the decrease operation
will be delayed to be done in extent_clear_unlock_delalloc(), for
fallocate(2), decrease operation is even delayed to be done in end
of btrfs_fallocate(), which is too late. Obviously such delay will
cause unnecessary pressure to enospc system.

So in this patch set, we will remove RESERVE_FREE, RESERVE_ALLOC and
RESERVE_ALLOC_NO_ACCOUNT, and always update bytes_may_use timely.

I already have sent a fstests test case for this issue, and I can send
[Patch 4/4] as a independent patch, but its bug also can be revealed by
the same reproduce scripts, so I include it here.

Changelog:
v2:
  Fix a trace point issue.

Wang Xiaoguang (4):
  btrfs: use correct offset for reloc_inode in
    prealloc_file_extent_cluster()
  btrfs: divide btrfs_update_reserved_bytes() into two functions
  btrfs: update btrfs_space_info's bytes_may_use timely
  btrfs: should block unused block groups deletion work when allocating
    data space

 fs/btrfs/ctree.h       |   3 +-
 fs/btrfs/disk-io.c     |   1 +
 fs/btrfs/extent-tree.c | 171 ++++++++++++++++++++++++++++---------------------
 fs/btrfs/extent_io.h   |   1 +
 fs/btrfs/file.c        |  26 ++++----
 fs/btrfs/inode-map.c   |   3 +-
 fs/btrfs/inode.c       |  37 ++++++++---
 fs/btrfs/relocation.c  |  17 +++--
 8 files changed, 159 insertions(+), 100 deletions(-)

-- 
2.9.0




^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 1/4] btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()
  2016-07-25  7:51 [PATCH v2 0/4] update bytes_may_use timely to avoid false ENOSPC issue Wang Xiaoguang
@ 2016-07-25  7:51 ` Wang Xiaoguang
  2016-07-25  7:51 ` [PATCH v2 2/4] btrfs: divide btrfs_update_reserved_bytes() into two functions Wang Xiaoguang
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Wang Xiaoguang @ 2016-07-25  7:51 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, jbacik, holger

In prealloc_file_extent_cluster(), btrfs_check_data_free_space() uses
wrong file offset for reloc_inode, it uses cluster->start and cluster->end,
which indeed are extent's bytenr. The correct value should be
cluster->[start|end] minus block group's start bytenr.

start bytenr   cluster->start
|              |     extent      |   extent   | ...| extent |
|----------------------------------------------------------------|
|                block group reloc_inode                         |

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/relocation.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 0477dca..a0de885 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3030,12 +3030,14 @@ int prealloc_file_extent_cluster(struct inode *inode,
 	u64 num_bytes;
 	int nr = 0;
 	int ret = 0;
+	u64 prealloc_start = cluster->start - offset;
+	u64 prealloc_end = cluster->end - offset;
 
 	BUG_ON(cluster->start != cluster->boundary[0]);
 	inode_lock(inode);
 
-	ret = btrfs_check_data_free_space(inode, cluster->start,
-					  cluster->end + 1 - cluster->start);
+	ret = btrfs_check_data_free_space(inode, prealloc_start,
+					  prealloc_end + 1 - prealloc_start);
 	if (ret)
 		goto out;
 
@@ -3056,8 +3058,8 @@ int prealloc_file_extent_cluster(struct inode *inode,
 			break;
 		nr++;
 	}
-	btrfs_free_reserved_data_space(inode, cluster->start,
-				       cluster->end + 1 - cluster->start);
+	btrfs_free_reserved_data_space(inode, prealloc_start,
+				       prealloc_end + 1 - prealloc_start);
 out:
 	inode_unlock(inode);
 	return ret;
-- 
2.9.0




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 2/4] btrfs: divide btrfs_update_reserved_bytes() into two functions
  2016-07-25  7:51 [PATCH v2 0/4] update bytes_may_use timely to avoid false ENOSPC issue Wang Xiaoguang
  2016-07-25  7:51 ` [PATCH v2 1/4] btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster() Wang Xiaoguang
@ 2016-07-25  7:51 ` Wang Xiaoguang
  2016-07-25  7:51 ` [PATCH v2 3/4] btrfs: update btrfs_space_info's bytes_may_use timely Wang Xiaoguang
  2016-07-25  7:51 ` [PATCH v2 4/4] btrfs: should block unused block groups deletion work when allocating data space Wang Xiaoguang
  3 siblings, 0 replies; 21+ messages in thread
From: Wang Xiaoguang @ 2016-07-25  7:51 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, jbacik, holger

This patch divides btrfs_update_reserved_bytes() into
btrfs_add_reserved_bytes() and btrfs_free_reserved_bytes(), and
next patch will extend btrfs_add_reserved_bytes()to fix some
false ENOSPC error, please see later patch for detailed info.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/extent-tree.c | 97 +++++++++++++++++++++++++++++---------------------
 1 file changed, 57 insertions(+), 40 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 82b912a..8eaac39 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -104,9 +104,10 @@ static int find_next_key(struct btrfs_path *path, int level,
 			 struct btrfs_key *key);
 static void dump_space_info(struct btrfs_space_info *info, u64 bytes,
 			    int dump_block_groups);
-static int btrfs_update_reserved_bytes(struct btrfs_block_group_cache *cache,
-				       u64 num_bytes, int reserve,
-				       int delalloc);
+static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
+				    u64 num_bytes, int reserve, int delalloc);
+static int btrfs_free_reserved_bytes(struct btrfs_block_group_cache *cache,
+				     u64 num_bytes, int delalloc);
 static int block_rsv_use_bytes(struct btrfs_block_rsv *block_rsv,
 			       u64 num_bytes);
 int btrfs_pin_extent(struct btrfs_root *root,
@@ -6297,19 +6298,14 @@ void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg)
 }
 
 /**
- * btrfs_update_reserved_bytes - update the block_group and space info counters
+ * btrfs_add_reserved_bytes - update the block_group and space info counters
  * @cache:	The cache we are manipulating
  * @num_bytes:	The number of bytes in question
  * @reserve:	One of the reservation enums
  * @delalloc:   The blocks are allocated for the delalloc write
  *
- * This is called by the allocator when it reserves space, or by somebody who is
- * freeing space that was never actually used on disk.  For example if you
- * reserve some space for a new leaf in transaction A and before transaction A
- * commits you free that leaf, you call this with reserve set to 0 in order to
- * clear the reservation.
- *
- * Metadata reservations should be called with RESERVE_ALLOC so we do the proper
+ * This is called by the allocator when it reserves space. Metadata
+ * reservations should be called with RESERVE_ALLOC so we do the proper
  * ENOSPC accounting.  For data we handle the reservation through clearing the
  * delalloc bits in the io_tree.  We have to do this since we could end up
  * allocating less disk space for the amount of data we have reserved in the
@@ -6319,44 +6315,65 @@ void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg)
  * make the reservation and return -EAGAIN, otherwise this function always
  * succeeds.
  */
-static int btrfs_update_reserved_bytes(struct btrfs_block_group_cache *cache,
-				       u64 num_bytes, int reserve, int delalloc)
+static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
+				    u64 num_bytes, int reserve, int delalloc)
 {
 	struct btrfs_space_info *space_info = cache->space_info;
 	int ret = 0;
 
 	spin_lock(&space_info->lock);
 	spin_lock(&cache->lock);
-	if (reserve != RESERVE_FREE) {
-		if (cache->ro) {
-			ret = -EAGAIN;
-		} else {
-			cache->reserved += num_bytes;
-			space_info->bytes_reserved += num_bytes;
-			if (reserve == RESERVE_ALLOC) {
-				trace_btrfs_space_reservation(cache->fs_info,
-						"space_info", space_info->flags,
-						num_bytes, 0);
-				space_info->bytes_may_use -= num_bytes;
-			}
-
-			if (delalloc)
-				cache->delalloc_bytes += num_bytes;
-		}
+	if (cache->ro) {
+		ret = -EAGAIN;
 	} else {
-		if (cache->ro)
-			space_info->bytes_readonly += num_bytes;
-		cache->reserved -= num_bytes;
-		space_info->bytes_reserved -= num_bytes;
+		cache->reserved += num_bytes;
+		space_info->bytes_reserved += num_bytes;
+		if (reserve == RESERVE_ALLOC) {
+			trace_btrfs_space_reservation(cache->fs_info,
+					"space_info", space_info->flags,
+					num_bytes, 0);
+			space_info->bytes_may_use -= num_bytes;
+		}
 
 		if (delalloc)
-			cache->delalloc_bytes -= num_bytes;
+			cache->delalloc_bytes += num_bytes;
 	}
 	spin_unlock(&cache->lock);
 	spin_unlock(&space_info->lock);
 	return ret;
 }
 
+/**
+ * btrfs_free_reserved_bytes - update the block_group and space info counters
+ * @cache:      The cache we are manipulating
+ * @num_bytes:  The number of bytes in question
+ * @delalloc:   The blocks are allocated for the delalloc write
+ *
+ * This is called by somebody who is freeing space that was never actually used
+ * on disk.  For example if you reserve some space for a new leaf in transaction
+ * A and before transaction A commits you free that leaf, you call this with
+ * reserve set to 0 in order to clear the reservation.
+ */
+
+static int btrfs_free_reserved_bytes(struct btrfs_block_group_cache *cache,
+				     u64 num_bytes, int delalloc)
+{
+	struct btrfs_space_info *space_info = cache->space_info;
+	int ret = 0;
+
+	spin_lock(&space_info->lock);
+	spin_lock(&cache->lock);
+	if (cache->ro)
+		space_info->bytes_readonly += num_bytes;
+	cache->reserved -= num_bytes;
+	space_info->bytes_reserved -= num_bytes;
+
+	if (delalloc)
+		cache->delalloc_bytes -= num_bytes;
+	spin_unlock(&cache->lock);
+	spin_unlock(&space_info->lock);
+	return ret;
+}
 void btrfs_prepare_extent_commit(struct btrfs_trans_handle *trans,
 				struct btrfs_root *root)
 {
@@ -6976,7 +6993,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 		WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags));
 
 		btrfs_add_free_space(cache, buf->start, buf->len);
-		btrfs_update_reserved_bytes(cache, buf->len, RESERVE_FREE, 0);
+		btrfs_free_reserved_bytes(cache, buf->len, 0);
 		btrfs_put_block_group(cache);
 		trace_btrfs_reserved_extent_free(root, buf->start, buf->len);
 		pin = 0;
@@ -7548,8 +7565,8 @@ checks:
 					     search_start - offset);
 		BUG_ON(offset > search_start);
 
-		ret = btrfs_update_reserved_bytes(block_group, num_bytes,
-						  alloc_type, delalloc);
+		ret = btrfs_add_reserved_bytes(block_group, num_bytes,
+				alloc_type, delalloc);
 		if (ret == -EAGAIN) {
 			btrfs_add_free_space(block_group, offset, num_bytes);
 			goto loop;
@@ -7781,7 +7798,7 @@ static int __btrfs_free_reserved_extent(struct btrfs_root *root,
 		if (btrfs_test_opt(root, DISCARD))
 			ret = btrfs_discard_extent(root, start, len, NULL);
 		btrfs_add_free_space(cache, start, len);
-		btrfs_update_reserved_bytes(cache, len, RESERVE_FREE, delalloc);
+		btrfs_free_reserved_bytes(cache, len, delalloc);
 	}
 
 	btrfs_put_block_group(cache);
@@ -8011,8 +8028,8 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
 	if (!block_group)
 		return -EINVAL;
 
-	ret = btrfs_update_reserved_bytes(block_group, ins->offset,
-					  RESERVE_ALLOC_NO_ACCOUNT, 0);
+	ret = btrfs_add_reserved_bytes(block_group, ins->offset,
+			RESERVE_ALLOC_NO_ACCOUNT, 0);
 	BUG_ON(ret); /* logic error */
 	ret = alloc_reserved_file_extent(trans, root, 0, root_objectid,
 					 0, owner, offset, ins, 1);
-- 
2.9.0




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 3/4] btrfs: update btrfs_space_info's bytes_may_use timely
  2016-07-25  7:51 [PATCH v2 0/4] update bytes_may_use timely to avoid false ENOSPC issue Wang Xiaoguang
  2016-07-25  7:51 ` [PATCH v2 1/4] btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster() Wang Xiaoguang
  2016-07-25  7:51 ` [PATCH v2 2/4] btrfs: divide btrfs_update_reserved_bytes() into two functions Wang Xiaoguang
@ 2016-07-25  7:51 ` Wang Xiaoguang
  2016-07-25 13:33   ` Josef Bacik
  2016-07-25 18:10   ` Josef Bacik
  2016-07-25  7:51 ` [PATCH v2 4/4] btrfs: should block unused block groups deletion work when allocating data space Wang Xiaoguang
  3 siblings, 2 replies; 21+ messages in thread
From: Wang Xiaoguang @ 2016-07-25  7:51 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, jbacik, holger

This patch can fix some false ENOSPC errors, below test script can
reproduce one false ENOSPC error:
	#!/bin/bash
	dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
	dev=$(losetup --show -f fs.img)
	mkfs.btrfs -f -M $dev
	mkdir /tmp/mntpoint
	mount $dev /tmp/mntpoint
	cd /tmp/mntpoint
	xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile

Above script will fail for ENOSPC reason, but indeed fs still has free
space to satisfy this request. Please see call graph:
btrfs_fallocate()
|-> btrfs_alloc_data_chunk_ondemand()
|   bytes_may_use += 64M
|-> btrfs_prealloc_file_range()
    |-> btrfs_reserve_extent()
        |-> btrfs_add_reserved_bytes()
        |   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
        |   change bytes_may_use, and bytes_reserved += 64M. Now
        |   bytes_may_use + bytes_reserved == 128M, which is greater
        |   than btrfs_space_info's total_bytes, false enospc occurs.
        |   Note, the bytes_may_use decrease operation will be done in
        |   end of btrfs_fallocate(), which is too late.

Here is another simple case for buffered write:
                    CPU 1              |              CPU 2
                                       |
|-> cow_file_range()                   |-> __btrfs_buffered_write()
    |-> btrfs_reserve_extent()         |   |
    |                                  |   |
    |                                  |   |
    |    .....                         |   |-> btrfs_check_data_free_space()
    |                                  |
    |                                  |
    |-> extent_clear_unlock_delalloc() |

In CPU 1, btrfs_reserve_extent()->find_free_extent()->
btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
operation will be delayed to be done in extent_clear_unlock_delalloc().
Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
btrfs_check_data_free_space() tries to reserve 100MB data space.
If
	100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
		data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
		data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
btrfs_check_data_free_space() will try to allcate new data chunk or call
btrfs_start_delalloc_roots(), or commit current transaction in order to
reserve some free space, obviously a lot of work. But indeed it's not
necessary as long as decreasing bytes_may_use timely, we still have
free space, decreasing 128M from bytes_may_use.

To fix this issue, this patch chooses to update bytes_may_use for both
data and metadata in btrfs_add_reserved_bytes(). For compress path, real
extent length may not be equal to file content length, so introduce a
ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
file content length. Then compress path can update bytes_may_use
correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
and RESERVE_FREE.

As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
PREALLOC, we also need to update bytes_may_use, but can not pass
EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
to update btrfs_space_info's bytes_may_use.

Meanwhile __btrfs_prealloc_file_range() will call
btrfs_free_reserved_data_space() internally for both sucessful and failed
path, btrfs_prealloc_file_range()'s callers does not need to call
btrfs_free_reserved_data_space() any more.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/ctree.h       |  2 +-
 fs/btrfs/extent-tree.c | 56 +++++++++++++++++---------------------------------
 fs/btrfs/extent_io.h   |  1 +
 fs/btrfs/file.c        | 26 +++++++++++++----------
 fs/btrfs/inode-map.c   |  3 +--
 fs/btrfs/inode.c       | 37 ++++++++++++++++++++++++---------
 fs/btrfs/relocation.c  | 11 ++++++++--
 7 files changed, 73 insertions(+), 63 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4274a7b..7eb2913 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2556,7 +2556,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
 				   struct btrfs_root *root,
 				   u64 root_objectid, u64 owner, u64 offset,
 				   struct btrfs_key *ins);
-int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes,
+int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes, u64 num_bytes,
 			 u64 min_alloc_size, u64 empty_size, u64 hint_byte,
 			 struct btrfs_key *ins, int is_data, int delalloc);
 int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 8eaac39..df8d756 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -60,21 +60,6 @@ enum {
 	CHUNK_ALLOC_FORCE = 2,
 };
 
-/*
- * Control how reservations are dealt with.
- *
- * RESERVE_FREE - freeing a reservation.
- * RESERVE_ALLOC - allocating space and we need to update bytes_may_use for
- *   ENOSPC accounting
- * RESERVE_ALLOC_NO_ACCOUNT - allocating space and we should not update
- *   bytes_may_use as the ENOSPC accounting is done elsewhere
- */
-enum {
-	RESERVE_FREE = 0,
-	RESERVE_ALLOC = 1,
-	RESERVE_ALLOC_NO_ACCOUNT = 2,
-};
-
 static int update_block_group(struct btrfs_trans_handle *trans,
 			      struct btrfs_root *root, u64 bytenr,
 			      u64 num_bytes, int alloc);
@@ -105,7 +90,7 @@ static int find_next_key(struct btrfs_path *path, int level,
 static void dump_space_info(struct btrfs_space_info *info, u64 bytes,
 			    int dump_block_groups);
 static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
-				    u64 num_bytes, int reserve, int delalloc);
+				    u64 ram_bytes, u64 num_bytes, int delalloc);
 static int btrfs_free_reserved_bytes(struct btrfs_block_group_cache *cache,
 				     u64 num_bytes, int delalloc);
 static int block_rsv_use_bytes(struct btrfs_block_rsv *block_rsv,
@@ -3491,7 +3476,6 @@ again:
 		dcs = BTRFS_DC_SETUP;
 	else if (ret == -ENOSPC)
 		set_bit(BTRFS_TRANS_CACHE_ENOSPC, &trans->transaction->flags);
-	btrfs_free_reserved_data_space(inode, 0, num_pages);
 
 out_put:
 	iput(inode);
@@ -6300,8 +6284,9 @@ void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg)
 /**
  * btrfs_add_reserved_bytes - update the block_group and space info counters
  * @cache:	The cache we are manipulating
+ * @ram_bytes:  The number of bytes of file content, and will be same to
+ *              @num_bytes except for the compress path.
  * @num_bytes:	The number of bytes in question
- * @reserve:	One of the reservation enums
  * @delalloc:   The blocks are allocated for the delalloc write
  *
  * This is called by the allocator when it reserves space. Metadata
@@ -6316,7 +6301,7 @@ void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg)
  * succeeds.
  */
 static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
-				    u64 num_bytes, int reserve, int delalloc)
+				    u64 ram_bytes, u64 num_bytes, int delalloc)
 {
 	struct btrfs_space_info *space_info = cache->space_info;
 	int ret = 0;
@@ -6328,13 +6313,11 @@ static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
 	} else {
 		cache->reserved += num_bytes;
 		space_info->bytes_reserved += num_bytes;
-		if (reserve == RESERVE_ALLOC) {
-			trace_btrfs_space_reservation(cache->fs_info,
-					"space_info", space_info->flags,
-					num_bytes, 0);
-			space_info->bytes_may_use -= num_bytes;
-		}
 
+		trace_btrfs_space_reservation(cache->fs_info,
+				"space_info", space_info->flags,
+				ram_bytes, 0);
+		space_info->bytes_may_use -= ram_bytes;
 		if (delalloc)
 			cache->delalloc_bytes += num_bytes;
 	}
@@ -7218,9 +7201,9 @@ btrfs_release_block_group(struct btrfs_block_group_cache *cache,
  * the free space extent currently.
  */
 static noinline int find_free_extent(struct btrfs_root *orig_root,
-				     u64 num_bytes, u64 empty_size,
-				     u64 hint_byte, struct btrfs_key *ins,
-				     u64 flags, int delalloc)
+				u64 ram_bytes, u64 num_bytes, u64 empty_size,
+				u64 hint_byte, struct btrfs_key *ins,
+				u64 flags, int delalloc)
 {
 	int ret = 0;
 	struct btrfs_root *root = orig_root->fs_info->extent_root;
@@ -7232,8 +7215,6 @@ static noinline int find_free_extent(struct btrfs_root *orig_root,
 	struct btrfs_space_info *space_info;
 	int loop = 0;
 	int index = __get_raid_index(flags);
-	int alloc_type = (flags & BTRFS_BLOCK_GROUP_DATA) ?
-		RESERVE_ALLOC_NO_ACCOUNT : RESERVE_ALLOC;
 	bool failed_cluster_refill = false;
 	bool failed_alloc = false;
 	bool use_cluster = true;
@@ -7565,8 +7546,8 @@ checks:
 					     search_start - offset);
 		BUG_ON(offset > search_start);
 
-		ret = btrfs_add_reserved_bytes(block_group, num_bytes,
-				alloc_type, delalloc);
+		ret = btrfs_add_reserved_bytes(block_group, ram_bytes,
+				num_bytes, delalloc);
 		if (ret == -EAGAIN) {
 			btrfs_add_free_space(block_group, offset, num_bytes);
 			goto loop;
@@ -7739,7 +7720,7 @@ again:
 	up_read(&info->groups_sem);
 }
 
-int btrfs_reserve_extent(struct btrfs_root *root,
+int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
 			 u64 num_bytes, u64 min_alloc_size,
 			 u64 empty_size, u64 hint_byte,
 			 struct btrfs_key *ins, int is_data, int delalloc)
@@ -7751,8 +7732,8 @@ int btrfs_reserve_extent(struct btrfs_root *root,
 	flags = btrfs_get_alloc_profile(root, is_data);
 again:
 	WARN_ON(num_bytes < root->sectorsize);
-	ret = find_free_extent(root, num_bytes, empty_size, hint_byte, ins,
-			       flags, delalloc);
+	ret = find_free_extent(root, ram_bytes, num_bytes, empty_size,
+			       hint_byte, ins, flags, delalloc);
 	if (!ret && !is_data) {
 		btrfs_dec_block_group_reservations(root->fs_info,
 						   ins->objectid);
@@ -7761,6 +7742,7 @@ again:
 			num_bytes = min(num_bytes >> 1, ins->offset);
 			num_bytes = round_down(num_bytes, root->sectorsize);
 			num_bytes = max(num_bytes, min_alloc_size);
+			ram_bytes = num_bytes;
 			if (num_bytes == min_alloc_size)
 				final_tried = true;
 			goto again;
@@ -8029,7 +8011,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
 		return -EINVAL;
 
 	ret = btrfs_add_reserved_bytes(block_group, ins->offset,
-			RESERVE_ALLOC_NO_ACCOUNT, 0);
+				       ins->offset, 0);
 	BUG_ON(ret); /* logic error */
 	ret = alloc_reserved_file_extent(trans, root, 0, root_objectid,
 					 0, owner, offset, ins, 1);
@@ -8171,7 +8153,7 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 	if (IS_ERR(block_rsv))
 		return ERR_CAST(block_rsv);
 
-	ret = btrfs_reserve_extent(root, blocksize, blocksize,
+	ret = btrfs_reserve_extent(root, blocksize, blocksize, blocksize,
 				   empty_size, hint, &ins, 0, 0);
 	if (ret)
 		goto out_unuse;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index c0c1c4f..b52ca5d 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -20,6 +20,7 @@
 #define EXTENT_DAMAGED		(1U << 14)
 #define EXTENT_NORESERVE	(1U << 15)
 #define EXTENT_QGROUP_RESERVED	(1U << 16)
+#define EXTENT_CLEAR_DATA_RESV	(1U << 17)
 #define EXTENT_IOBITS		(EXTENT_LOCKED | EXTENT_WRITEBACK)
 #define EXTENT_CTLBITS		(EXTENT_DO_ACCOUNTING | EXTENT_FIRST_DELALLOC)
 
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 2234e88..b4d9258 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2669,6 +2669,7 @@ static long btrfs_fallocate(struct file *file, int mode,
 
 	alloc_start = round_down(offset, blocksize);
 	alloc_end = round_up(offset + len, blocksize);
+	cur_offset = alloc_start;
 
 	/* Make sure we aren't being give some crap mode */
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
@@ -2761,7 +2762,6 @@ static long btrfs_fallocate(struct file *file, int mode,
 
 	/* First, check if we exceed the qgroup limit */
 	INIT_LIST_HEAD(&reserve_list);
-	cur_offset = alloc_start;
 	while (1) {
 		em = btrfs_get_extent(inode, NULL, 0, cur_offset,
 				      alloc_end - cur_offset, 0);
@@ -2788,6 +2788,14 @@ static long btrfs_fallocate(struct file *file, int mode,
 					last_byte - cur_offset);
 			if (ret < 0)
 				break;
+		} else {
+			/*
+			 * Do not need to reserve unwritten extent for this
+			 * range, free reserved data space first, otherwise
+			 * it'll result in false ENOSPC error.
+			 */
+			btrfs_free_reserved_data_space(inode, cur_offset,
+				last_byte - cur_offset);
 		}
 		free_extent_map(em);
 		cur_offset = last_byte;
@@ -2805,6 +2813,9 @@ static long btrfs_fallocate(struct file *file, int mode,
 					range->start,
 					range->len, 1 << inode->i_blkbits,
 					offset + len, &alloc_hint);
+		else
+			btrfs_free_reserved_data_space(inode, range->start,
+						       range->len);
 		list_del(&range->list);
 		kfree(range);
 	}
@@ -2839,18 +2850,11 @@ out_unlock:
 	unlock_extent_cached(&BTRFS_I(inode)->io_tree, alloc_start, locked_end,
 			     &cached_state, GFP_KERNEL);
 out:
-	/*
-	 * As we waited the extent range, the data_rsv_map must be empty
-	 * in the range, as written data range will be released from it.
-	 * And for prealloacted extent, it will also be released when
-	 * its metadata is written.
-	 * So this is completely used as cleanup.
-	 */
-	btrfs_qgroup_free_data(inode, alloc_start, alloc_end - alloc_start);
 	inode_unlock(inode);
 	/* Let go of our reservation. */
-	btrfs_free_reserved_data_space(inode, alloc_start,
-				       alloc_end - alloc_start);
+	if (ret != 0)
+		btrfs_free_reserved_data_space(inode, alloc_start,
+				       alloc_end - cur_offset);
 	return ret;
 }
 
diff --git a/fs/btrfs/inode-map.c b/fs/btrfs/inode-map.c
index 70107f7..e59e7d6 100644
--- a/fs/btrfs/inode-map.c
+++ b/fs/btrfs/inode-map.c
@@ -495,10 +495,9 @@ again:
 	ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, prealloc,
 					      prealloc, prealloc, &alloc_hint);
 	if (ret) {
-		btrfs_delalloc_release_space(inode, 0, prealloc);
+		btrfs_delalloc_release_metadata(inode, prealloc);
 		goto out_put;
 	}
-	btrfs_free_reserved_data_space(inode, 0, prealloc);
 
 	ret = btrfs_write_out_ino_cache(root, trans, path, inode);
 out_put:
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4421954..e0cee59 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -564,6 +564,8 @@ cont:
 						     PAGE_SET_WRITEBACK |
 						     page_error_op |
 						     PAGE_END_WRITEBACK);
+			btrfs_free_reserved_data_space_noquota(inode, start,
+						end - start + 1);
 			goto free_pages_out;
 		}
 	}
@@ -739,7 +741,7 @@ retry:
 		lock_extent(io_tree, async_extent->start,
 			    async_extent->start + async_extent->ram_size - 1);
 
-		ret = btrfs_reserve_extent(root,
+		ret = btrfs_reserve_extent(root, async_extent->ram_size,
 					   async_extent->compressed_size,
 					   async_extent->compressed_size,
 					   0, alloc_hint, &ins, 1, 1);
@@ -966,7 +968,8 @@ static noinline int cow_file_range(struct inode *inode,
 				     EXTENT_DEFRAG, PAGE_UNLOCK |
 				     PAGE_CLEAR_DIRTY | PAGE_SET_WRITEBACK |
 				     PAGE_END_WRITEBACK);
-
+			btrfs_free_reserved_data_space_noquota(inode, start,
+						end - start + 1);
 			*nr_written = *nr_written +
 			     (end - start + PAGE_SIZE) / PAGE_SIZE;
 			*page_started = 1;
@@ -986,7 +989,7 @@ static noinline int cow_file_range(struct inode *inode,
 		unsigned long op;
 
 		cur_alloc_size = disk_num_bytes;
-		ret = btrfs_reserve_extent(root, cur_alloc_size,
+		ret = btrfs_reserve_extent(root, cur_alloc_size, cur_alloc_size,
 					   root->sectorsize, 0, alloc_hint,
 					   &ins, 1, 1);
 		if (ret < 0)
@@ -1485,8 +1488,10 @@ out_check:
 		extent_clear_unlock_delalloc(inode, cur_offset,
 					     cur_offset + num_bytes - 1,
 					     locked_page, EXTENT_LOCKED |
-					     EXTENT_DELALLOC, PAGE_UNLOCK |
-					     PAGE_SET_PRIVATE2);
+					     EXTENT_DELALLOC |
+					     EXTENT_CLEAR_DATA_RESV,
+					     PAGE_UNLOCK | PAGE_SET_PRIVATE2);
+
 		if (!nolock && nocow)
 			btrfs_end_write_no_snapshoting(root);
 		cur_offset = extent_end;
@@ -1803,7 +1808,9 @@ static void btrfs_clear_bit_hook(struct inode *inode,
 			return;
 
 		if (root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID
-		    && do_list && !(state->state & EXTENT_NORESERVE))
+		    && do_list && !(state->state & EXTENT_NORESERVE)
+		    && (*bits & (EXTENT_DO_ACCOUNTING |
+		    EXTENT_CLEAR_DATA_RESV)))
 			btrfs_free_reserved_data_space_noquota(inode,
 					state->start, len);
 
@@ -7214,7 +7221,7 @@ static struct extent_map *btrfs_new_extent_direct(struct inode *inode,
 	int ret;
 
 	alloc_hint = get_extent_allocation_hint(inode, start, len);
-	ret = btrfs_reserve_extent(root, len, root->sectorsize, 0,
+	ret = btrfs_reserve_extent(root, len, len, root->sectorsize, 0,
 				   alloc_hint, &ins, 1, 1);
 	if (ret)
 		return ERR_PTR(ret);
@@ -7714,6 +7721,13 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
 				ret = PTR_ERR(em2);
 				goto unlock_err;
 			}
+			/*
+			 * For inode marked NODATACOW or extent marked PREALLOC,
+			 * use the existing or preallocated extent, so does not
+			 * need to adjust btrfs_space_info's bytes_may_use.
+			 */
+			btrfs_free_reserved_data_space_noquota(inode,
+					start, len);
 			goto unlock;
 		}
 	}
@@ -7748,7 +7762,6 @@ unlock:
 			i_size_write(inode, start + len);
 
 		adjust_dio_outstanding_extents(inode, dio_data, len);
-		btrfs_free_reserved_data_space(inode, start, len);
 		WARN_ON(dio_data->reserve < len);
 		dio_data->reserve -= len;
 		dio_data->unsubmitted_oe_range_end = start + len;
@@ -10269,6 +10282,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
 	u64 last_alloc = (u64)-1;
 	int ret = 0;
 	bool own_trans = true;
+	u64 end = start + num_bytes - 1;
 
 	if (trans)
 		own_trans = false;
@@ -10290,8 +10304,8 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
 		 * sized chunks.
 		 */
 		cur_bytes = min(cur_bytes, last_alloc);
-		ret = btrfs_reserve_extent(root, cur_bytes, min_size, 0,
-					   *alloc_hint, &ins, 1, 0);
+		ret = btrfs_reserve_extent(root, cur_bytes, cur_bytes,
+				min_size, 0, *alloc_hint, &ins, 1, 0);
 		if (ret) {
 			if (own_trans)
 				btrfs_end_transaction(trans, root);
@@ -10377,6 +10391,9 @@ next:
 		if (own_trans)
 			btrfs_end_transaction(trans, root);
 	}
+	if (cur_offset < end)
+		btrfs_free_reserved_data_space(inode, cur_offset,
+			end - cur_offset + 1);
 	return ret;
 }
 
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index a0de885..f39c4db 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3032,6 +3032,7 @@ int prealloc_file_extent_cluster(struct inode *inode,
 	int ret = 0;
 	u64 prealloc_start = cluster->start - offset;
 	u64 prealloc_end = cluster->end - offset;
+	u64 cur_offset;
 
 	BUG_ON(cluster->start != cluster->boundary[0]);
 	inode_lock(inode);
@@ -3041,6 +3042,7 @@ int prealloc_file_extent_cluster(struct inode *inode,
 	if (ret)
 		goto out;
 
+	cur_offset = prealloc_start;
 	while (nr < cluster->nr) {
 		start = cluster->boundary[nr] - offset;
 		if (nr + 1 < cluster->nr)
@@ -3050,16 +3052,21 @@ int prealloc_file_extent_cluster(struct inode *inode,
 
 		lock_extent(&BTRFS_I(inode)->io_tree, start, end);
 		num_bytes = end + 1 - start;
+		if (cur_offset < start)
+			btrfs_free_reserved_data_space(inode, cur_offset,
+					start - cur_offset);
 		ret = btrfs_prealloc_file_range(inode, 0, start,
 						num_bytes, num_bytes,
 						end + 1, &alloc_hint);
+		cur_offset = end + 1;
 		unlock_extent(&BTRFS_I(inode)->io_tree, start, end);
 		if (ret)
 			break;
 		nr++;
 	}
-	btrfs_free_reserved_data_space(inode, prealloc_start,
-				       prealloc_end + 1 - prealloc_start);
+	if (cur_offset < prealloc_end)
+		btrfs_free_reserved_data_space(inode, cur_offset,
+				       prealloc_end + 1 - cur_offset);
 out:
 	inode_unlock(inode);
 	return ret;
-- 
2.9.0




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 4/4] btrfs: should block unused block groups deletion work when allocating data space
  2016-07-25  7:51 [PATCH v2 0/4] update bytes_may_use timely to avoid false ENOSPC issue Wang Xiaoguang
                   ` (2 preceding siblings ...)
  2016-07-25  7:51 ` [PATCH v2 3/4] btrfs: update btrfs_space_info's bytes_may_use timely Wang Xiaoguang
@ 2016-07-25  7:51 ` Wang Xiaoguang
  2016-07-25 13:32   ` Josef Bacik
  3 siblings, 1 reply; 21+ messages in thread
From: Wang Xiaoguang @ 2016-07-25  7:51 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, jbacik, holger

cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs()
to delete unused block groups. Because this work is asynchronous, it may also result
in false ENOSPC error. Please see below race window:

               CPU1                           |             CPU2
                                              |
|-> btrfs_alloc_data_chunk_ondemand()         |-> cleaner_kthread()
    |-> do_chunk_alloc()                      |   |
    |   assume it returns ENOSPC, which means |   |
    |   btrfs_space_info is full and have free|   |
    |   space to satisfy data request.        |   |
    |                                         |   |- > btrfs_delete_unused_bgs()
    |                                         |   |    it will decrease btrfs_space_info
    |                                         |   |    total_bytes and make
    |                                         |   |    btrfs_space_info is not full.
    |                                         |   |
In this case, we may get ENOSPC error, but btrfs_space_info is not full.

To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need to call
do_chunk_alloc() to allocating new chunk, we should block btrfs_delete_unused_bgs().
So here we introduce a new struct rw_semaphore bg_delete_sem to do this job.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/ctree.h       |  1 +
 fs/btrfs/disk-io.c     |  1 +
 fs/btrfs/extent-tree.c | 40 ++++++++++++++++++++++++++++++++++------
 3 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7eb2913..bf0751d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -800,6 +800,7 @@ struct btrfs_fs_info {
 	struct mutex cleaner_mutex;
 	struct mutex chunk_mutex;
 	struct mutex volume_mutex;
+	struct rw_semaphore bg_delete_sem;
 
 	/*
 	 * this is taken to make sure we don't set block groups ro after
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 60ce119..65a1465 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2676,6 +2676,7 @@ int open_ctree(struct super_block *sb,
 	mutex_init(&fs_info->ordered_operations_mutex);
 	mutex_init(&fs_info->tree_log_mutex);
 	mutex_init(&fs_info->chunk_mutex);
+	init_rwsem(&fs_info->bg_delete_sem);
 	mutex_init(&fs_info->transaction_kthread_mutex);
 	mutex_init(&fs_info->cleaner_mutex);
 	mutex_init(&fs_info->volume_mutex);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index df8d756..d1f8638 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4111,6 +4111,7 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
 	int ret = 0;
 	int need_commit = 2;
 	int have_pinned_space;
+	int have_bg_delete_sem = 0;
 
 	/* make sure bytes are sectorsize aligned */
 	bytes = ALIGN(bytes, root->sectorsize);
@@ -4121,8 +4122,11 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
 	}
 
 	data_sinfo = fs_info->data_sinfo;
-	if (!data_sinfo)
+	if (!data_sinfo) {
+		down_read(&root->fs_info->bg_delete_sem);
+		have_bg_delete_sem = 1;
 		goto alloc;
+	}
 
 again:
 	/* make sure we have enough space to handle the data first */
@@ -4134,10 +4138,21 @@ again:
 	if (used + bytes > data_sinfo->total_bytes) {
 		struct btrfs_trans_handle *trans;
 
+		spin_unlock(&data_sinfo->lock);
+		/*
+		 * We may need to allocate new chunk, so we should block
+		 * btrfs_delete_unused_bgs()
+		 */
+		if (have_bg_delete_sem == 0) {
+			down_read(&root->fs_info->bg_delete_sem);
+			have_bg_delete_sem = 1;
+		}
+
 		/*
 		 * if we don't have enough free bytes in this space then we need
 		 * to alloc a new chunk.
 		 */
+		spin_lock(&data_sinfo->lock);
 		if (!data_sinfo->full) {
 			u64 alloc_target;
 
@@ -4156,17 +4171,20 @@ alloc:
 			 * the fs.
 			 */
 			trans = btrfs_join_transaction(root);
-			if (IS_ERR(trans))
+			if (IS_ERR(trans)) {
+				up_read(&root->fs_info->bg_delete_sem);
 				return PTR_ERR(trans);
+			}
 
 			ret = do_chunk_alloc(trans, root->fs_info->extent_root,
 					     alloc_target,
 					     CHUNK_ALLOC_NO_FORCE);
 			btrfs_end_transaction(trans, root);
 			if (ret < 0) {
-				if (ret != -ENOSPC)
+				if (ret != -ENOSPC) {
+					up_read(&root->fs_info->bg_delete_sem);
 					return ret;
-				else {
+				} else {
 					have_pinned_space = 1;
 					goto commit_trans;
 				}
@@ -4200,15 +4218,19 @@ commit_trans:
 			}
 
 			trans = btrfs_join_transaction(root);
-			if (IS_ERR(trans))
+			if (IS_ERR(trans)) {
+				up_read(&root->fs_info->bg_delete_sem);
 				return PTR_ERR(trans);
+			}
 			if (have_pinned_space >= 0 ||
 			    test_bit(BTRFS_TRANS_HAVE_FREE_BGS,
 				     &trans->transaction->flags) ||
 			    need_commit > 0) {
 				ret = btrfs_commit_transaction(trans, root);
-				if (ret)
+				if (ret) {
+					up_read(&root->fs_info->bg_delete_sem);
 					return ret;
+				}
 				/*
 				 * The cleaner kthread might still be doing iput
 				 * operations. Wait for it to finish so that
@@ -4225,6 +4247,7 @@ commit_trans:
 		trace_btrfs_space_reservation(root->fs_info,
 					      "space_info:enospc",
 					      data_sinfo->flags, bytes, 1);
+		up_read(&root->fs_info->bg_delete_sem);
 		return -ENOSPC;
 	}
 	data_sinfo->bytes_may_use += bytes;
@@ -4232,6 +4255,9 @@ commit_trans:
 				      data_sinfo->flags, bytes, 1);
 	spin_unlock(&data_sinfo->lock);
 
+	if (have_bg_delete_sem == 1)
+		up_read(&root->fs_info->bg_delete_sem);
+
 	return ret;
 }
 
@@ -10594,6 +10620,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		spin_unlock(&fs_info->unused_bgs_lock);
 
 		mutex_lock(&fs_info->delete_unused_bgs_mutex);
+		down_write(&root->fs_info->bg_delete_sem);
 
 		/* Don't want to race with allocators so take the groups_sem */
 		down_write(&space_info->groups_sem);
@@ -10721,6 +10748,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 end_trans:
 		btrfs_end_transaction(trans, root);
 next:
+		up_write(&root->fs_info->bg_delete_sem);
 		mutex_unlock(&fs_info->delete_unused_bgs_mutex);
 		btrfs_put_block_group(block_group);
 		spin_lock(&fs_info->unused_bgs_lock);
-- 
2.9.0




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/4] btrfs: should block unused block groups deletion work when allocating data space
  2016-07-25  7:51 ` [PATCH v2 4/4] btrfs: should block unused block groups deletion work when allocating data space Wang Xiaoguang
@ 2016-07-25 13:32   ` Josef Bacik
  2016-07-26 12:04     ` [PATCH v3] " Wang Xiaoguang
  0 siblings, 1 reply; 21+ messages in thread
From: Josef Bacik @ 2016-07-25 13:32 UTC (permalink / raw)
  To: Wang Xiaoguang, linux-btrfs; +Cc: dsterba, holger

On 07/25/2016 03:51 AM, Wang Xiaoguang wrote:
> cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs()
> to delete unused block groups. Because this work is asynchronous, it may also result
> in false ENOSPC error. Please see below race window:
>
>                CPU1                           |             CPU2
>                                               |
> |-> btrfs_alloc_data_chunk_ondemand()         |-> cleaner_kthread()
>     |-> do_chunk_alloc()                      |   |
>     |   assume it returns ENOSPC, which means |   |
>     |   btrfs_space_info is full and have free|   |
>     |   space to satisfy data request.        |   |
>     |                                         |   |- > btrfs_delete_unused_bgs()
>     |                                         |   |    it will decrease btrfs_space_info
>     |                                         |   |    total_bytes and make
>     |                                         |   |    btrfs_space_info is not full.
>     |                                         |   |
> In this case, we may get ENOSPC error, but btrfs_space_info is not full.
>
> To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need to call
> do_chunk_alloc() to allocating new chunk, we should block btrfs_delete_unused_bgs().
> So here we introduce a new struct rw_semaphore bg_delete_sem to do this job.
>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> ---
>  fs/btrfs/ctree.h       |  1 +
>  fs/btrfs/disk-io.c     |  1 +
>  fs/btrfs/extent-tree.c | 40 ++++++++++++++++++++++++++++++++++------
>  3 files changed, 36 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 7eb2913..bf0751d 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -800,6 +800,7 @@ struct btrfs_fs_info {
>  	struct mutex cleaner_mutex;
>  	struct mutex chunk_mutex;
>  	struct mutex volume_mutex;
> +	struct rw_semaphore bg_delete_sem;
>
>  	/*
>  	 * this is taken to make sure we don't set block groups ro after
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 60ce119..65a1465 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2676,6 +2676,7 @@ int open_ctree(struct super_block *sb,
>  	mutex_init(&fs_info->ordered_operations_mutex);
>  	mutex_init(&fs_info->tree_log_mutex);
>  	mutex_init(&fs_info->chunk_mutex);
> +	init_rwsem(&fs_info->bg_delete_sem);
>  	mutex_init(&fs_info->transaction_kthread_mutex);
>  	mutex_init(&fs_info->cleaner_mutex);
>  	mutex_init(&fs_info->volume_mutex);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index df8d756..d1f8638 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4111,6 +4111,7 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
>  	int ret = 0;
>  	int need_commit = 2;
>  	int have_pinned_space;
> +	int have_bg_delete_sem = 0;
>
>  	/* make sure bytes are sectorsize aligned */
>  	bytes = ALIGN(bytes, root->sectorsize);
> @@ -4121,8 +4122,11 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
>  	}
>
>  	data_sinfo = fs_info->data_sinfo;
> -	if (!data_sinfo)
> +	if (!data_sinfo) {
> +		down_read(&root->fs_info->bg_delete_sem);
> +		have_bg_delete_sem = 1;
>  		goto alloc;
> +	}
>
>  again:
>  	/* make sure we have enough space to handle the data first */
> @@ -4134,10 +4138,21 @@ again:
>  	if (used + bytes > data_sinfo->total_bytes) {
>  		struct btrfs_trans_handle *trans;
>
> +		spin_unlock(&data_sinfo->lock);
> +		/*
> +		 * We may need to allocate new chunk, so we should block
> +		 * btrfs_delete_unused_bgs()
> +		 */
> +		if (have_bg_delete_sem == 0) {
> +			down_read(&root->fs_info->bg_delete_sem);
> +			have_bg_delete_sem = 1;
> +		}
> +

We could race with somebody else doing an allocation here, so instead do 
something like

if (!have_bg_delete_sem) {
	spin_unlock(&data_sinfo->lock);
	down_read(&root->fs_info->bg_delete_sem);
	have_bg_delete_sem = 1;
	spin_lock((&data_sinfo->lock);
	if (used + bytes <= data_sinfo->total_bytes)
		goto done; // add a done label at bytes_may_use += bytes
}

Thanks,

Josef

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/4] btrfs: update btrfs_space_info's bytes_may_use timely
  2016-07-25  7:51 ` [PATCH v2 3/4] btrfs: update btrfs_space_info's bytes_may_use timely Wang Xiaoguang
@ 2016-07-25 13:33   ` Josef Bacik
  2016-07-25 18:10   ` Josef Bacik
  1 sibling, 0 replies; 21+ messages in thread
From: Josef Bacik @ 2016-07-25 13:33 UTC (permalink / raw)
  To: Wang Xiaoguang, linux-btrfs; +Cc: dsterba, holger

On 07/25/2016 03:51 AM, Wang Xiaoguang wrote:
> This patch can fix some false ENOSPC errors, below test script can
> reproduce one false ENOSPC error:
> 	#!/bin/bash
> 	dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
> 	dev=$(losetup --show -f fs.img)
> 	mkfs.btrfs -f -M $dev
> 	mkdir /tmp/mntpoint
> 	mount $dev /tmp/mntpoint
> 	cd /tmp/mntpoint
> 	xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile
>
> Above script will fail for ENOSPC reason, but indeed fs still has free
> space to satisfy this request. Please see call graph:
> btrfs_fallocate()
> |-> btrfs_alloc_data_chunk_ondemand()
> |   bytes_may_use += 64M
> |-> btrfs_prealloc_file_range()
>     |-> btrfs_reserve_extent()
>         |-> btrfs_add_reserved_bytes()
>         |   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
>         |   change bytes_may_use, and bytes_reserved += 64M. Now
>         |   bytes_may_use + bytes_reserved == 128M, which is greater
>         |   than btrfs_space_info's total_bytes, false enospc occurs.
>         |   Note, the bytes_may_use decrease operation will be done in
>         |   end of btrfs_fallocate(), which is too late.
>
> Here is another simple case for buffered write:
>                     CPU 1              |              CPU 2
>                                        |
> |-> cow_file_range()                   |-> __btrfs_buffered_write()
>     |-> btrfs_reserve_extent()         |   |
>     |                                  |   |
>     |                                  |   |
>     |    .....                         |   |-> btrfs_check_data_free_space()
>     |                                  |
>     |                                  |
>     |-> extent_clear_unlock_delalloc() |
>
> In CPU 1, btrfs_reserve_extent()->find_free_extent()->
> btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
> operation will be delayed to be done in extent_clear_unlock_delalloc().
> Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
> btrfs_check_data_free_space() tries to reserve 100MB data space.
> If
> 	100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
> 		data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
> 		data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
> btrfs_check_data_free_space() will try to allcate new data chunk or call
> btrfs_start_delalloc_roots(), or commit current transaction in order to
> reserve some free space, obviously a lot of work. But indeed it's not
> necessary as long as decreasing bytes_may_use timely, we still have
> free space, decreasing 128M from bytes_may_use.
>
> To fix this issue, this patch chooses to update bytes_may_use for both
> data and metadata in btrfs_add_reserved_bytes(). For compress path, real
> extent length may not be equal to file content length, so introduce a
> ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
> btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
> file content length. Then compress path can update bytes_may_use
> correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
> and RESERVE_FREE.
>
> As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
> run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
> PREALLOC, we also need to update bytes_may_use, but can not pass
> EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
> here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
> to update btrfs_space_info's bytes_may_use.
>
> Meanwhile __btrfs_prealloc_file_range() will call
> btrfs_free_reserved_data_space() internally for both sucessful and failed
> path, btrfs_prealloc_file_range()'s callers does not need to call
> btrfs_free_reserved_data_space() any more.
>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Yup this is good, thanks for fixing this problem

Signed-off-by: Josef Bacik <jbacik@fb.com>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/4] btrfs: update btrfs_space_info's bytes_may_use timely
  2016-07-25  7:51 ` [PATCH v2 3/4] btrfs: update btrfs_space_info's bytes_may_use timely Wang Xiaoguang
  2016-07-25 13:33   ` Josef Bacik
@ 2016-07-25 18:10   ` Josef Bacik
  1 sibling, 0 replies; 21+ messages in thread
From: Josef Bacik @ 2016-07-25 18:10 UTC (permalink / raw)
  To: Wang Xiaoguang, linux-btrfs; +Cc: dsterba, holger

On 07/25/2016 03:51 AM, Wang Xiaoguang wrote:
> This patch can fix some false ENOSPC errors, below test script can
> reproduce one false ENOSPC error:
> 	#!/bin/bash
> 	dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
> 	dev=$(losetup --show -f fs.img)
> 	mkfs.btrfs -f -M $dev
> 	mkdir /tmp/mntpoint
> 	mount $dev /tmp/mntpoint
> 	cd /tmp/mntpoint
> 	xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile
>
> Above script will fail for ENOSPC reason, but indeed fs still has free
> space to satisfy this request. Please see call graph:
> btrfs_fallocate()
> |-> btrfs_alloc_data_chunk_ondemand()
> |   bytes_may_use += 64M
> |-> btrfs_prealloc_file_range()
>     |-> btrfs_reserve_extent()
>         |-> btrfs_add_reserved_bytes()
>         |   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
>         |   change bytes_may_use, and bytes_reserved += 64M. Now
>         |   bytes_may_use + bytes_reserved == 128M, which is greater
>         |   than btrfs_space_info's total_bytes, false enospc occurs.
>         |   Note, the bytes_may_use decrease operation will be done in
>         |   end of btrfs_fallocate(), which is too late.
>
> Here is another simple case for buffered write:
>                     CPU 1              |              CPU 2
>                                        |
> |-> cow_file_range()                   |-> __btrfs_buffered_write()
>     |-> btrfs_reserve_extent()         |   |
>     |                                  |   |
>     |                                  |   |
>     |    .....                         |   |-> btrfs_check_data_free_space()
>     |                                  |
>     |                                  |
>     |-> extent_clear_unlock_delalloc() |
>
> In CPU 1, btrfs_reserve_extent()->find_free_extent()->
> btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
> operation will be delayed to be done in extent_clear_unlock_delalloc().
> Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
> btrfs_check_data_free_space() tries to reserve 100MB data space.
> If
> 	100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
> 		data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
> 		data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
> btrfs_check_data_free_space() will try to allcate new data chunk or call
> btrfs_start_delalloc_roots(), or commit current transaction in order to
> reserve some free space, obviously a lot of work. But indeed it's not
> necessary as long as decreasing bytes_may_use timely, we still have
> free space, decreasing 128M from bytes_may_use.
>
> To fix this issue, this patch chooses to update bytes_may_use for both
> data and metadata in btrfs_add_reserved_bytes(). For compress path, real
> extent length may not be equal to file content length, so introduce a
> ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
> btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
> file content length. Then compress path can update bytes_may_use
> correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
> and RESERVE_FREE.
>
> As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
> run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
> PREALLOC, we also need to update bytes_may_use, but can not pass
> EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
> here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
> to update btrfs_space_info's bytes_may_use.
>
> Meanwhile __btrfs_prealloc_file_range() will call
> btrfs_free_reserved_data_space() internally for both sucessful and failed
> path, btrfs_prealloc_file_range()'s callers does not need to call
> btrfs_free_reserved_data_space() any more.
>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> ---

Sigh sorry, baby kept me up most of the night, that should be

Reviewed-by: Josef Bacik <jbacik@fb.com>

Thanks,

Josef


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-07-25 13:32   ` Josef Bacik
@ 2016-07-26 12:04     ` Wang Xiaoguang
  2016-07-26 15:51       ` Josef Bacik
  0 siblings, 1 reply; 21+ messages in thread
From: Wang Xiaoguang @ 2016-07-26 12:04 UTC (permalink / raw)
  To: linux-btrfs; +Cc: jbacik

cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs()
to delete unused block groups. Because this work is asynchronous, it may also result
in false ENOSPC error. Please see below race window:

               CPU1                           |             CPU2
                                              |
|-> btrfs_alloc_data_chunk_ondemand()         |-> cleaner_kthread()
    |-> do_chunk_alloc()                      |   |
    |   assume it returns ENOSPC, which means |   |
    |   btrfs_space_info is full and have free|   |
    |   space to satisfy data request.        |   |
    |                                         |   |- > btrfs_delete_unused_bgs()
    |                                         |   |    it will decrease btrfs_space_info
    |                                         |   |    total_bytes and make
    |                                         |   |    btrfs_space_info is not full.
    |                                         |   |
In this case, we may get ENOSPC error, but btrfs_space_info is not full.

To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need to call
do_chunk_alloc() to allocating new chunk, we should block btrfs_delete_unused_bgs().
So here we introduce a new struct rw_semaphore bg_delete_sem to do this job.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
v3: improve one code logic suggested by Josef, thanks.
---
 fs/btrfs/ctree.h       |  1 +
 fs/btrfs/disk-io.c     |  1 +
 fs/btrfs/extent-tree.c | 44 ++++++++++++++++++++++++++++++++++++++------
 3 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7eb2913..bf0751d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -800,6 +800,7 @@ struct btrfs_fs_info {
 	struct mutex cleaner_mutex;
 	struct mutex chunk_mutex;
 	struct mutex volume_mutex;
+	struct rw_semaphore bg_delete_sem;
 
 	/*
 	 * this is taken to make sure we don't set block groups ro after
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 60ce119..f8609fd 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2683,6 +2683,7 @@ int open_ctree(struct super_block *sb,
 	init_rwsem(&fs_info->commit_root_sem);
 	init_rwsem(&fs_info->cleanup_work_sem);
 	init_rwsem(&fs_info->subvol_sem);
+	init_rwsem(&fs_info->bg_delete_sem);
 	sema_init(&fs_info->uuid_tree_rescan_sem, 1);
 
 	btrfs_init_dev_replace_locks(fs_info);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index df8d756..50b440e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4111,6 +4111,7 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
 	int ret = 0;
 	int need_commit = 2;
 	int have_pinned_space;
+	int have_bg_delete_sem = 0;
 
 	/* make sure bytes are sectorsize aligned */
 	bytes = ALIGN(bytes, root->sectorsize);
@@ -4121,8 +4122,11 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
 	}
 
 	data_sinfo = fs_info->data_sinfo;
-	if (!data_sinfo)
+	if (!data_sinfo) {
+		down_read(&root->fs_info->bg_delete_sem);
+		have_bg_delete_sem = 1;
 		goto alloc;
+	}
 
 again:
 	/* make sure we have enough space to handle the data first */
@@ -4135,6 +4139,19 @@ again:
 		struct btrfs_trans_handle *trans;
 
 		/*
+		 * We may need to allocate new chunk, so we should block
+		 * btrfs_delete_unused_bgs()
+		 */
+		if (!have_bg_delete_sem) {
+			spin_unlock(&data_sinfo->lock);
+			down_read(&root->fs_info->bg_delete_sem);
+			have_bg_delete_sem = 1;
+			spin_lock(&data_sinfo->lock);
+			if (used + bytes <= data_sinfo->total_bytes)
+				goto done;
+		}
+
+		/*
 		 * if we don't have enough free bytes in this space then we need
 		 * to alloc a new chunk.
 		 */
@@ -4156,17 +4173,20 @@ alloc:
 			 * the fs.
 			 */
 			trans = btrfs_join_transaction(root);
-			if (IS_ERR(trans))
+			if (IS_ERR(trans)) {
+				up_read(&root->fs_info->bg_delete_sem);
 				return PTR_ERR(trans);
+			}
 
 			ret = do_chunk_alloc(trans, root->fs_info->extent_root,
 					     alloc_target,
 					     CHUNK_ALLOC_NO_FORCE);
 			btrfs_end_transaction(trans, root);
 			if (ret < 0) {
-				if (ret != -ENOSPC)
+				if (ret != -ENOSPC) {
+					up_read(&root->fs_info->bg_delete_sem);
 					return ret;
-				else {
+				} else {
 					have_pinned_space = 1;
 					goto commit_trans;
 				}
@@ -4200,15 +4220,19 @@ commit_trans:
 			}
 
 			trans = btrfs_join_transaction(root);
-			if (IS_ERR(trans))
+			if (IS_ERR(trans)) {
+				up_read(&root->fs_info->bg_delete_sem);
 				return PTR_ERR(trans);
+			}
 			if (have_pinned_space >= 0 ||
 			    test_bit(BTRFS_TRANS_HAVE_FREE_BGS,
 				     &trans->transaction->flags) ||
 			    need_commit > 0) {
 				ret = btrfs_commit_transaction(trans, root);
-				if (ret)
+				if (ret) {
+					up_read(&root->fs_info->bg_delete_sem);
 					return ret;
+				}
 				/*
 				 * The cleaner kthread might still be doing iput
 				 * operations. Wait for it to finish so that
@@ -4225,13 +4249,19 @@ commit_trans:
 		trace_btrfs_space_reservation(root->fs_info,
 					      "space_info:enospc",
 					      data_sinfo->flags, bytes, 1);
+		up_read(&root->fs_info->bg_delete_sem);
 		return -ENOSPC;
 	}
+
+done:
 	data_sinfo->bytes_may_use += bytes;
 	trace_btrfs_space_reservation(root->fs_info, "space_info",
 				      data_sinfo->flags, bytes, 1);
 	spin_unlock(&data_sinfo->lock);
 
+	if (have_bg_delete_sem)
+		up_read(&root->fs_info->bg_delete_sem);
+
 	return ret;
 }
 
@@ -10594,6 +10624,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		spin_unlock(&fs_info->unused_bgs_lock);
 
 		mutex_lock(&fs_info->delete_unused_bgs_mutex);
+		down_write(&root->fs_info->bg_delete_sem);
 
 		/* Don't want to race with allocators so take the groups_sem */
 		down_write(&space_info->groups_sem);
@@ -10721,6 +10752,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 end_trans:
 		btrfs_end_transaction(trans, root);
 next:
+		up_write(&root->fs_info->bg_delete_sem);
 		mutex_unlock(&fs_info->delete_unused_bgs_mutex);
 		btrfs_put_block_group(block_group);
 		spin_lock(&fs_info->unused_bgs_lock);
-- 
2.9.0




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-07-26 12:04     ` [PATCH v3] " Wang Xiaoguang
@ 2016-07-26 15:51       ` Josef Bacik
  2016-07-26 16:25         ` David Sterba
  2016-07-27  1:26         ` Wang Xiaoguang
  0 siblings, 2 replies; 21+ messages in thread
From: Josef Bacik @ 2016-07-26 15:51 UTC (permalink / raw)
  To: Wang Xiaoguang, linux-btrfs

On 07/26/2016 08:04 AM, Wang Xiaoguang wrote:
> cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs()
> to delete unused block groups. Because this work is asynchronous, it may also result
> in false ENOSPC error. Please see below race window:
>
>                CPU1                           |             CPU2
>                                               |
> |-> btrfs_alloc_data_chunk_ondemand()         |-> cleaner_kthread()
>     |-> do_chunk_alloc()                      |   |
>     |   assume it returns ENOSPC, which means |   |
>     |   btrfs_space_info is full and have free|   |
>     |   space to satisfy data request.        |   |
>     |                                         |   |- > btrfs_delete_unused_bgs()
>     |                                         |   |    it will decrease btrfs_space_info
>     |                                         |   |    total_bytes and make
>     |                                         |   |    btrfs_space_info is not full.
>     |                                         |   |
> In this case, we may get ENOSPC error, but btrfs_space_info is not full.
>
> To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need to call
> do_chunk_alloc() to allocating new chunk, we should block btrfs_delete_unused_bgs().
> So here we introduce a new struct rw_semaphore bg_delete_sem to do this job.
>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> ---
> v3: improve one code logic suggested by Josef, thanks.
> ---
>  fs/btrfs/ctree.h       |  1 +
>  fs/btrfs/disk-io.c     |  1 +
>  fs/btrfs/extent-tree.c | 44 ++++++++++++++++++++++++++++++++++++++------
>  3 files changed, 40 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 7eb2913..bf0751d 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -800,6 +800,7 @@ struct btrfs_fs_info {
>  	struct mutex cleaner_mutex;
>  	struct mutex chunk_mutex;
>  	struct mutex volume_mutex;
> +	struct rw_semaphore bg_delete_sem;
>
>  	/*
>  	 * this is taken to make sure we don't set block groups ro after
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 60ce119..f8609fd 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2683,6 +2683,7 @@ int open_ctree(struct super_block *sb,
>  	init_rwsem(&fs_info->commit_root_sem);
>  	init_rwsem(&fs_info->cleanup_work_sem);
>  	init_rwsem(&fs_info->subvol_sem);
> +	init_rwsem(&fs_info->bg_delete_sem);
>  	sema_init(&fs_info->uuid_tree_rescan_sem, 1);
>
>  	btrfs_init_dev_replace_locks(fs_info);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index df8d756..50b440e 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4111,6 +4111,7 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
>  	int ret = 0;
>  	int need_commit = 2;
>  	int have_pinned_space;
> +	int have_bg_delete_sem = 0;
>
>  	/* make sure bytes are sectorsize aligned */
>  	bytes = ALIGN(bytes, root->sectorsize);
> @@ -4121,8 +4122,11 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
>  	}
>
>  	data_sinfo = fs_info->data_sinfo;
> -	if (!data_sinfo)
> +	if (!data_sinfo) {
> +		down_read(&root->fs_info->bg_delete_sem);
> +		have_bg_delete_sem = 1;
>  		goto alloc;
> +	}
>
>  again:
>  	/* make sure we have enough space to handle the data first */
> @@ -4135,6 +4139,19 @@ again:
>  		struct btrfs_trans_handle *trans;
>
>  		/*
> +		 * We may need to allocate new chunk, so we should block
> +		 * btrfs_delete_unused_bgs()
> +		 */
> +		if (!have_bg_delete_sem) {
> +			spin_unlock(&data_sinfo->lock);
> +			down_read(&root->fs_info->bg_delete_sem);
> +			have_bg_delete_sem = 1;
> +			spin_lock(&data_sinfo->lock);
> +			if (used + bytes <= data_sinfo->total_bytes)

Doh sorry I forgot to mention you need to re-calculate used here.  Also I 
noticed we already have a delete_unused_bgs_mutex, can we drop that and replace 
it with bg_delete_sem as well?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-07-26 15:51       ` Josef Bacik
@ 2016-07-26 16:25         ` David Sterba
  2016-07-27  1:26         ` Wang Xiaoguang
  1 sibling, 0 replies; 21+ messages in thread
From: David Sterba @ 2016-07-26 16:25 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Wang Xiaoguang, linux-btrfs

On Tue, Jul 26, 2016 at 11:51:39AM -0400, Josef Bacik wrote:
> Also I 
> noticed we already have a delete_unused_bgs_mutex, can we drop that and replace 
> it with bg_delete_sem as well?  Thanks,

Oh, yes please, making sense of all the mutexes and semaphores gets
harder with every new one.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-07-26 15:51       ` Josef Bacik
  2016-07-26 16:25         ` David Sterba
@ 2016-07-27  1:26         ` Wang Xiaoguang
  2016-07-27  5:50           ` [PATCH v4] " Wang Xiaoguang
  1 sibling, 1 reply; 21+ messages in thread
From: Wang Xiaoguang @ 2016-07-27  1:26 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs

hello,

On 07/26/2016 11:51 PM, Josef Bacik wrote:
> On 07/26/2016 08:04 AM, Wang Xiaoguang wrote:
>> cleaner_kthread() may run at any time, in which it'll call 
>> btrfs_delete_unused_bgs()
>> to delete unused block groups. Because this work is asynchronous, it 
>> may also result
>> in false ENOSPC error. Please see below race window:
>>
>>                CPU1                           |             CPU2
>>                                               |
>> |-> btrfs_alloc_data_chunk_ondemand()         |-> cleaner_kthread()
>>     |-> do_chunk_alloc()                      |   |
>>     |   assume it returns ENOSPC, which means |   |
>>     |   btrfs_space_info is full and have free|   |
>>     |   space to satisfy data request.        |   |
>>     |                                         |   |- > 
>> btrfs_delete_unused_bgs()
>>     |                                         |   |    it will 
>> decrease btrfs_space_info
>>     |                                         |   | total_bytes and make
>>     |                                         |   | btrfs_space_info 
>> is not full.
>>     |                                         |   |
>> In this case, we may get ENOSPC error, but btrfs_space_info is not full.
>>
>> To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need 
>> to call
>> do_chunk_alloc() to allocating new chunk, we should block 
>> btrfs_delete_unused_bgs().
>> So here we introduce a new struct rw_semaphore bg_delete_sem to do 
>> this job.
>>
>> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>> ---
>> v3: improve one code logic suggested by Josef, thanks.
>> ---
>>  fs/btrfs/ctree.h       |  1 +
>>  fs/btrfs/disk-io.c     |  1 +
>>  fs/btrfs/extent-tree.c | 44 
>> ++++++++++++++++++++++++++++++++++++++------
>>  3 files changed, 40 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 7eb2913..bf0751d 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -800,6 +800,7 @@ struct btrfs_fs_info {
>>      struct mutex cleaner_mutex;
>>      struct mutex chunk_mutex;
>>      struct mutex volume_mutex;
>> +    struct rw_semaphore bg_delete_sem;
>>
>>      /*
>>       * this is taken to make sure we don't set block groups ro after
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 60ce119..f8609fd 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -2683,6 +2683,7 @@ int open_ctree(struct super_block *sb,
>>      init_rwsem(&fs_info->commit_root_sem);
>>      init_rwsem(&fs_info->cleanup_work_sem);
>>      init_rwsem(&fs_info->subvol_sem);
>> +    init_rwsem(&fs_info->bg_delete_sem);
>>      sema_init(&fs_info->uuid_tree_rescan_sem, 1);
>>
>>      btrfs_init_dev_replace_locks(fs_info);
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index df8d756..50b440e 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -4111,6 +4111,7 @@ int btrfs_alloc_data_chunk_ondemand(struct 
>> inode *inode, u64 bytes)
>>      int ret = 0;
>>      int need_commit = 2;
>>      int have_pinned_space;
>> +    int have_bg_delete_sem = 0;
>>
>>      /* make sure bytes are sectorsize aligned */
>>      bytes = ALIGN(bytes, root->sectorsize);
>> @@ -4121,8 +4122,11 @@ int btrfs_alloc_data_chunk_ondemand(struct 
>> inode *inode, u64 bytes)
>>      }
>>
>>      data_sinfo = fs_info->data_sinfo;
>> -    if (!data_sinfo)
>> +    if (!data_sinfo) {
>> +        down_read(&root->fs_info->bg_delete_sem);
>> +        have_bg_delete_sem = 1;
>>          goto alloc;
>> +    }
>>
>>  again:
>>      /* make sure we have enough space to handle the data first */
>> @@ -4135,6 +4139,19 @@ again:
>>          struct btrfs_trans_handle *trans;
>>
>>          /*
>> +         * We may need to allocate new chunk, so we should block
>> +         * btrfs_delete_unused_bgs()
>> +         */
>> +        if (!have_bg_delete_sem) {
>> +            spin_unlock(&data_sinfo->lock);
>> +            down_read(&root->fs_info->bg_delete_sem);
>> +            have_bg_delete_sem = 1;
>> +            spin_lock(&data_sinfo->lock);
>> +            if (used + bytes <= data_sinfo->total_bytes)
>
> Doh sorry I forgot to mention you need to re-calculate used here. Also 
> I noticed we already have a delete_unused_bgs_mutex, can we drop that 
> and replace it with bg_delete_sem as well?  Thanks,
Sorry, I also forgot to re-calculate used...
OK, I'll try to replace delete_unused_bgs_mutex with bg_delete_sem, 
please wait a while :)

Regards,
Xiaoguang Wang

>
> Josef
>
>




^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4] btrfs: should block unused block groups deletion work when allocating data space
  2016-07-27  1:26         ` Wang Xiaoguang
@ 2016-07-27  5:50           ` Wang Xiaoguang
  0 siblings, 0 replies; 21+ messages in thread
From: Wang Xiaoguang @ 2016-07-27  5:50 UTC (permalink / raw)
  To: linux-btrfs; +Cc: jbacik, dsterba

cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs()
to delete unused block groups. Because this work is asynchronous, it may also result
in false ENOSPC error. Please see below race window:

               CPU1                           |             CPU2
                                              |
|-> btrfs_alloc_data_chunk_ondemand()         |-> cleaner_kthread()
    |-> do_chunk_alloc()                      |   |
    |   assume it returns ENOSPC, which means |   |
    |   btrfs_space_info is full and have free|   |
    |   space to satisfy data request.        |   |
    |                                         |   |- > btrfs_delete_unused_bgs()
    |                                         |   |    it will decrease btrfs_space_info
    |                                         |   |    total_bytes and make
    |                                         |   |    btrfs_space_info is not full.
    |                                         |   |
In this case, we may get ENOSPC error, but btrfs_space_info is not full.

To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need to call
do_chunk_alloc() to allocating new chunk, we should block btrfs_delete_unused_bgs().
Here we introduce a new struct rw_semaphore bg_delete_sem to do this job.

Indeed there is already a "struct mutex delete_unused_bgs_mutex", but it's mutex,
we can not use it for this purpose. Of course, we can re-define it to be struct
rw_semaphore, then use it in btrfs_alloc_data_chunk_ondemand(). Either method will
work.

But given that delete_unused_bgs_mutex's name length is longer than bg_delete_sem,
I choose the first method, to create a new struct rw_semaphore bg_delete_sem and
delete delete_unused_bgs_mutex :)

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
v3: improve one code logic suggested by Josef, thanks.
v4: replace delete_unused_bgs_mutex with our new bg_delete_sem, also
    suggested by Josef.
---
 fs/btrfs/ctree.h       |  2 +-
 fs/btrfs/disk-io.c     | 13 ++++++-------
 fs/btrfs/extent-tree.c | 42 ++++++++++++++++++++++++++++++++++--------
 fs/btrfs/volumes.c     | 42 +++++++++++++++++++++---------------------
 4 files changed, 62 insertions(+), 37 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7eb2913..90041a2 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -800,6 +800,7 @@ struct btrfs_fs_info {
 	struct mutex cleaner_mutex;
 	struct mutex chunk_mutex;
 	struct mutex volume_mutex;
+	struct rw_semaphore bg_delete_sem;
 
 	/*
 	 * this is taken to make sure we don't set block groups ro after
@@ -1079,7 +1080,6 @@ struct btrfs_fs_info {
 	spinlock_t unused_bgs_lock;
 	struct list_head unused_bgs;
 	struct mutex unused_bg_unpin_mutex;
-	struct mutex delete_unused_bgs_mutex;
 
 	/* For btrfs to record security options */
 	struct security_mnt_opts security_opts;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 60ce119..86cad9a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1839,12 +1839,11 @@ static int cleaner_kthread(void *arg)
 		btrfs_run_defrag_inodes(root->fs_info);
 
 		/*
-		 * Acquires fs_info->delete_unused_bgs_mutex to avoid racing
-		 * with relocation (btrfs_relocate_chunk) and relocation
-		 * acquires fs_info->cleaner_mutex (btrfs_relocate_block_group)
-		 * after acquiring fs_info->delete_unused_bgs_mutex. So we
-		 * can't hold, nor need to, fs_info->cleaner_mutex when deleting
-		 * unused block groups.
+		 * Acquires fs_info->bg_delete_sem to avoid racing with
+		 * relocation (btrfs_relocate_chunk) and relocation acquires
+		 * fs_info->cleaner_mutex (btrfs_relocate_block_group) after
+		 * acquiring fs_info->bg_delete_sem. So we can't hold, nor need
+		 * to, fs_info->cleaner_mutex when deleting unused block groups.
 		 */
 		btrfs_delete_unused_bgs(root->fs_info);
 sleep:
@@ -2595,7 +2594,6 @@ int open_ctree(struct super_block *sb,
 	spin_lock_init(&fs_info->unused_bgs_lock);
 	rwlock_init(&fs_info->tree_mod_log_lock);
 	mutex_init(&fs_info->unused_bg_unpin_mutex);
-	mutex_init(&fs_info->delete_unused_bgs_mutex);
 	mutex_init(&fs_info->reloc_mutex);
 	mutex_init(&fs_info->delalloc_root_mutex);
 	mutex_init(&fs_info->cleaner_delayed_iput_mutex);
@@ -2683,6 +2681,7 @@ int open_ctree(struct super_block *sb,
 	init_rwsem(&fs_info->commit_root_sem);
 	init_rwsem(&fs_info->cleanup_work_sem);
 	init_rwsem(&fs_info->subvol_sem);
+	init_rwsem(&fs_info->bg_delete_sem);
 	sema_init(&fs_info->uuid_tree_rescan_sem, 1);
 
 	btrfs_init_dev_replace_locks(fs_info);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index df8d756..19d4bb1 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4111,6 +4111,7 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
 	int ret = 0;
 	int need_commit = 2;
 	int have_pinned_space;
+	int have_bg_delete_sem = 0;
 
 	/* make sure bytes are sectorsize aligned */
 	bytes = ALIGN(bytes, root->sectorsize);
@@ -4121,8 +4122,11 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
 	}
 
 	data_sinfo = fs_info->data_sinfo;
-	if (!data_sinfo)
+	if (!data_sinfo) {
+		down_read(&root->fs_info->bg_delete_sem);
+		have_bg_delete_sem = 1;
 		goto alloc;
+	}
 
 again:
 	/* make sure we have enough space to handle the data first */
@@ -4135,6 +4139,17 @@ again:
 		struct btrfs_trans_handle *trans;
 
 		/*
+		 * We may need to allocate new chunk, so we should block
+		 * btrfs_delete_unused_bgs()
+		 */
+		if (!have_bg_delete_sem) {
+			spin_unlock(&data_sinfo->lock);
+			down_read(&root->fs_info->bg_delete_sem);
+			have_bg_delete_sem = 1;
+			goto again;
+		}
+
+		/*
 		 * if we don't have enough free bytes in this space then we need
 		 * to alloc a new chunk.
 		 */
@@ -4156,17 +4171,20 @@ alloc:
 			 * the fs.
 			 */
 			trans = btrfs_join_transaction(root);
-			if (IS_ERR(trans))
+			if (IS_ERR(trans)) {
+				up_read(&root->fs_info->bg_delete_sem);
 				return PTR_ERR(trans);
+			}
 
 			ret = do_chunk_alloc(trans, root->fs_info->extent_root,
 					     alloc_target,
 					     CHUNK_ALLOC_NO_FORCE);
 			btrfs_end_transaction(trans, root);
 			if (ret < 0) {
-				if (ret != -ENOSPC)
+				if (ret != -ENOSPC) {
+					up_read(&root->fs_info->bg_delete_sem);
 					return ret;
-				else {
+				} else {
 					have_pinned_space = 1;
 					goto commit_trans;
 				}
@@ -4200,15 +4218,19 @@ commit_trans:
 			}
 
 			trans = btrfs_join_transaction(root);
-			if (IS_ERR(trans))
+			if (IS_ERR(trans)) {
+				up_read(&root->fs_info->bg_delete_sem);
 				return PTR_ERR(trans);
+			}
 			if (have_pinned_space >= 0 ||
 			    test_bit(BTRFS_TRANS_HAVE_FREE_BGS,
 				     &trans->transaction->flags) ||
 			    need_commit > 0) {
 				ret = btrfs_commit_transaction(trans, root);
-				if (ret)
+				if (ret) {
+					up_read(&root->fs_info->bg_delete_sem);
 					return ret;
+				}
 				/*
 				 * The cleaner kthread might still be doing iput
 				 * operations. Wait for it to finish so that
@@ -4225,6 +4247,7 @@ commit_trans:
 		trace_btrfs_space_reservation(root->fs_info,
 					      "space_info:enospc",
 					      data_sinfo->flags, bytes, 1);
+		up_read(&root->fs_info->bg_delete_sem);
 		return -ENOSPC;
 	}
 	data_sinfo->bytes_may_use += bytes;
@@ -4232,6 +4255,9 @@ commit_trans:
 				      data_sinfo->flags, bytes, 1);
 	spin_unlock(&data_sinfo->lock);
 
+	if (have_bg_delete_sem)
+		up_read(&root->fs_info->bg_delete_sem);
+
 	return ret;
 }
 
@@ -10593,7 +10619,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		}
 		spin_unlock(&fs_info->unused_bgs_lock);
 
-		mutex_lock(&fs_info->delete_unused_bgs_mutex);
+		down_write(&root->fs_info->bg_delete_sem);
 
 		/* Don't want to race with allocators so take the groups_sem */
 		down_write(&space_info->groups_sem);
@@ -10721,7 +10747,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 end_trans:
 		btrfs_end_transaction(trans, root);
 next:
-		mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+		up_write(&root->fs_info->bg_delete_sem);
 		btrfs_put_block_group(block_group);
 		spin_lock(&fs_info->unused_bgs_lock);
 	}
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 589f128..7dbf70a 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2876,7 +2876,7 @@ static int btrfs_relocate_chunk(struct btrfs_root *root, u64 chunk_offset)
 	 * we release the path used to search the chunk/dev tree and before
 	 * the current task acquires this mutex and calls us.
 	 */
-	ASSERT(mutex_is_locked(&root->fs_info->delete_unused_bgs_mutex));
+	ASSERT(rwsem_is_locked(&root->fs_info->bg_delete_sem));
 
 	ret = btrfs_can_relocate(extent_root, chunk_offset);
 	if (ret)
@@ -2929,10 +2929,10 @@ again:
 	key.type = BTRFS_CHUNK_ITEM_KEY;
 
 	while (1) {
-		mutex_lock(&root->fs_info->delete_unused_bgs_mutex);
+		down_read(&root->fs_info->bg_delete_sem);
 		ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
 		if (ret < 0) {
-			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+			up_read(&root->fs_info->bg_delete_sem);
 			goto error;
 		}
 		BUG_ON(ret == 0); /* Corruption */
@@ -2940,7 +2940,7 @@ again:
 		ret = btrfs_previous_item(chunk_root, path, key.objectid,
 					  key.type);
 		if (ret)
-			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+			up_read(&root->fs_info->bg_delete_sem);
 		if (ret < 0)
 			goto error;
 		if (ret > 0)
@@ -2962,7 +2962,7 @@ again:
 			else
 				BUG_ON(ret);
 		}
-		mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+		up_read(&root->fs_info->bg_delete_sem);
 
 		if (found_key.offset == 0)
 			break;
@@ -3498,10 +3498,10 @@ again:
 			goto error;
 		}
 
-		mutex_lock(&fs_info->delete_unused_bgs_mutex);
+		down_read(&fs_info->bg_delete_sem);
 		ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
 		if (ret < 0) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			up_read(&fs_info->bg_delete_sem);
 			goto error;
 		}
 
@@ -3515,7 +3515,7 @@ again:
 		ret = btrfs_previous_item(chunk_root, path, 0,
 					  BTRFS_CHUNK_ITEM_KEY);
 		if (ret) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			up_read(&fs_info->bg_delete_sem);
 			ret = 0;
 			break;
 		}
@@ -3525,7 +3525,7 @@ again:
 		btrfs_item_key_to_cpu(leaf, &found_key, slot);
 
 		if (found_key.objectid != key.objectid) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			up_read(&fs_info->bg_delete_sem);
 			break;
 		}
 
@@ -3543,12 +3543,12 @@ again:
 
 		btrfs_release_path(path);
 		if (!ret) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			up_read(&fs_info->bg_delete_sem);
 			goto loop;
 		}
 
 		if (counting) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			up_read(&fs_info->bg_delete_sem);
 			spin_lock(&fs_info->balance_lock);
 			bctl->stat.expected++;
 			spin_unlock(&fs_info->balance_lock);
@@ -3573,7 +3573,7 @@ again:
 					count_meta < bctl->meta.limit_min)
 				|| ((chunk_type & BTRFS_BLOCK_GROUP_SYSTEM) &&
 					count_sys < bctl->sys.limit_min)) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			up_read(&fs_info->bg_delete_sem);
 			goto loop;
 		}
 
@@ -3586,7 +3586,7 @@ again:
 		    !chunk_reserved && !bytes_used) {
 			trans = btrfs_start_transaction(chunk_root, 0);
 			if (IS_ERR(trans)) {
-				mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+				up_read(&fs_info->bg_delete_sem);
 				ret = PTR_ERR(trans);
 				goto error;
 			}
@@ -3595,7 +3595,7 @@ again:
 						      BTRFS_BLOCK_GROUP_DATA);
 			btrfs_end_transaction(trans, chunk_root);
 			if (ret < 0) {
-				mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+				up_read(&fs_info->bg_delete_sem);
 				goto error;
 			}
 			chunk_reserved = 1;
@@ -3603,7 +3603,7 @@ again:
 
 		ret = btrfs_relocate_chunk(chunk_root,
 					   found_key.offset);
-		mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+		up_read(&fs_info->bg_delete_sem);
 		if (ret && ret != -ENOSPC)
 			goto error;
 		if (ret == -ENOSPC) {
@@ -4330,16 +4330,16 @@ again:
 	key.type = BTRFS_DEV_EXTENT_KEY;
 
 	do {
-		mutex_lock(&root->fs_info->delete_unused_bgs_mutex);
+		down_read(&root->fs_info->bg_delete_sem);
 		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
 		if (ret < 0) {
-			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+			up_read(&root->fs_info->bg_delete_sem);
 			goto done;
 		}
 
 		ret = btrfs_previous_item(root, path, 0, key.type);
 		if (ret)
-			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+			up_read(&root->fs_info->bg_delete_sem);
 		if (ret < 0)
 			goto done;
 		if (ret) {
@@ -4353,7 +4353,7 @@ again:
 		btrfs_item_key_to_cpu(l, &key, path->slots[0]);
 
 		if (key.objectid != device->devid) {
-			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+			up_read(&root->fs_info->bg_delete_sem);
 			btrfs_release_path(path);
 			break;
 		}
@@ -4362,7 +4362,7 @@ again:
 		length = btrfs_dev_extent_length(l, dev_extent);
 
 		if (key.offset + length <= new_size) {
-			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+			up_read(&root->fs_info->bg_delete_sem);
 			btrfs_release_path(path);
 			break;
 		}
@@ -4371,7 +4371,7 @@ again:
 		btrfs_release_path(path);
 
 		ret = btrfs_relocate_chunk(root, chunk_offset);
-		mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+		up_read(&root->fs_info->bg_delete_sem);
 		if (ret && ret != -ENOSPC)
 			goto done;
 		if (ret == -ENOSPC)
-- 
2.9.0




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
@ 2016-09-09  8:17 Wang Xiaoguang
  2016-09-09  8:25 ` Wang Xiaoguang
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Wang Xiaoguang @ 2016-09-09  8:17 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, s.priebe

cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs()
to delete unused block groups. Because this work is asynchronous, it may also result
in false ENOSPC error. Please see below race window:

               CPU1                           |             CPU2
                                              |
|-> btrfs_alloc_data_chunk_ondemand()         |-> cleaner_kthread()
    |-> do_chunk_alloc()                      |   |
    |   assume it returns ENOSPC, which means |   |
    |   btrfs_space_info is full and have free|   |
    |   space to satisfy data request.        |   |
    |                                         |   |- > btrfs_delete_unused_bgs()
    |                                         |   |    it will decrease btrfs_space_info
    |                                         |   |    total_bytes and make
    |                                         |   |    btrfs_space_info is not full.
    |                                         |   |
In this case, we may get ENOSPC error, but btrfs_space_info is not full.

To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need to call
do_chunk_alloc() to allocating new chunk, we should block btrfs_delete_unused_bgs().
Here we introduce a new struct rw_semaphore bg_delete_sem to do this job.

Indeed there is already a "struct mutex delete_unused_bgs_mutex", but it's mutex,
we can not use it for this purpose. Of course, we can re-define it to be struct
rw_semaphore, then use it in btrfs_alloc_data_chunk_ondemand(). Either method will
work.

But given that delete_unused_bgs_mutex's name length is longer than bg_delete_sem,
I choose the first method, to create a new struct rw_semaphore bg_delete_sem and
delete delete_unused_bgs_mutex :)

Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
V2: fix a deadlock revealed by fstests case btrfs/071, we call
    start_transaction() before in down_write(bg_delete_sem) in
    btrfs_delete_unused_bgs().

v3: Stefan Priebe reported another similar deadlock, so here we choose
    to not call down_read(bg_delete_sem) for free space inode in
    btrfs_alloc_data_chunk_ondemand(). Meanwhile because we only do the
    data space reservation for free space cache in the transaction context,
    btrfs_delete_unused_bgs() will either have finished its job, or start
    a new transaction waiting current transaction to complete, there will
    be no unused block groups to be deleted, so it's safe to not call
    down_read(bg_delete_sem)
---
---
 fs/btrfs/ctree.h       |  2 +-
 fs/btrfs/disk-io.c     | 13 +++++------
 fs/btrfs/extent-tree.c | 59 ++++++++++++++++++++++++++++++++++++++++----------
 fs/btrfs/volumes.c     | 42 +++++++++++++++++------------------
 4 files changed, 76 insertions(+), 40 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index eff3993..fa78ef9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -788,6 +788,7 @@ struct btrfs_fs_info {
 	struct mutex cleaner_mutex;
 	struct mutex chunk_mutex;
 	struct mutex volume_mutex;
+	struct rw_semaphore bg_delete_sem;
 
 	/*
 	 * this is taken to make sure we don't set block groups ro after
@@ -1068,7 +1069,6 @@ struct btrfs_fs_info {
 	spinlock_t unused_bgs_lock;
 	struct list_head unused_bgs;
 	struct mutex unused_bg_unpin_mutex;
-	struct mutex delete_unused_bgs_mutex;
 
 	/* For btrfs to record security options */
 	struct security_mnt_opts security_opts;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 54bc8c7..3cdbd05 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1868,12 +1868,11 @@ static int cleaner_kthread(void *arg)
 		btrfs_run_defrag_inodes(root->fs_info);
 
 		/*
-		 * Acquires fs_info->delete_unused_bgs_mutex to avoid racing
-		 * with relocation (btrfs_relocate_chunk) and relocation
-		 * acquires fs_info->cleaner_mutex (btrfs_relocate_block_group)
-		 * after acquiring fs_info->delete_unused_bgs_mutex. So we
-		 * can't hold, nor need to, fs_info->cleaner_mutex when deleting
-		 * unused block groups.
+		 * Acquires fs_info->bg_delete_sem to avoid racing with
+		 * relocation (btrfs_relocate_chunk) and relocation acquires
+		 * fs_info->cleaner_mutex (btrfs_relocate_block_group) after
+		 * acquiring fs_info->bg_delete_sem. So we can't hold, nor need
+		 * to, fs_info->cleaner_mutex when deleting unused block groups.
 		 */
 		btrfs_delete_unused_bgs(root->fs_info);
 sleep:
@@ -2634,7 +2633,6 @@ int open_ctree(struct super_block *sb,
 	spin_lock_init(&fs_info->unused_bgs_lock);
 	rwlock_init(&fs_info->tree_mod_log_lock);
 	mutex_init(&fs_info->unused_bg_unpin_mutex);
-	mutex_init(&fs_info->delete_unused_bgs_mutex);
 	mutex_init(&fs_info->reloc_mutex);
 	mutex_init(&fs_info->delalloc_root_mutex);
 	mutex_init(&fs_info->cleaner_delayed_iput_mutex);
@@ -2723,6 +2721,7 @@ int open_ctree(struct super_block *sb,
 	init_rwsem(&fs_info->commit_root_sem);
 	init_rwsem(&fs_info->cleanup_work_sem);
 	init_rwsem(&fs_info->subvol_sem);
+	init_rwsem(&fs_info->bg_delete_sem);
 	sema_init(&fs_info->uuid_tree_rescan_sem, 1);
 
 	btrfs_init_dev_replace_locks(fs_info);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 8c8a4d1..f037769 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4128,18 +4128,35 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
 	int ret = 0;
 	int need_commit = 2;
 	int have_pinned_space;
+	int have_bg_delete_sem = 0;
+	bool free_space_inode = btrfs_is_free_space_inode(inode);
 
 	/* make sure bytes are sectorsize aligned */
 	bytes = ALIGN(bytes, root->sectorsize);
 
-	if (btrfs_is_free_space_inode(inode)) {
+	if (free_space_inode) {
 		need_commit = 0;
 		ASSERT(current->journal_info);
 	}
 
+	/*
+	 * Here we shouldn't call down_read(bg_delete_sem) for free space inode,
+	 * there is lock order between bg_delete_sem and "wait current trans
+	 * finished". Meanwhile because we only do the data space reservation
+	 * for free space cache in the transaction context,
+	 * btrfs_delete_unused_bgs() will either have finished its job, or start
+	 * a new transaction waiting current transaction to complete, there will
+	 * be no unused block groups to be deleted, so it's safe to not call
+	 * down_read(bg_delete_sem).
+	 */
 	data_sinfo = fs_info->data_sinfo;
-	if (!data_sinfo)
+	if (!data_sinfo) {
+		if (!free_space_inode) {
+			down_read(&root->fs_info->bg_delete_sem);
+			have_bg_delete_sem = 1;
+		}
 		goto alloc;
+	}
 
 again:
 	/* make sure we have enough space to handle the data first */
@@ -4152,6 +4169,17 @@ again:
 		struct btrfs_trans_handle *trans;
 
 		/*
+		 * We may need to allocate new chunk, so we should block
+		 * btrfs_delete_unused_bgs()
+		 */
+		if (!have_bg_delete_sem && !free_space_inode) {
+			spin_unlock(&data_sinfo->lock);
+			down_read(&root->fs_info->bg_delete_sem);
+			have_bg_delete_sem = 1;
+			goto again;
+		}
+
+		/*
 		 * if we don't have enough free bytes in this space then we need
 		 * to alloc a new chunk.
 		 */
@@ -4173,8 +4201,10 @@ alloc:
 			 * the fs.
 			 */
 			trans = btrfs_join_transaction(root);
-			if (IS_ERR(trans))
-				return PTR_ERR(trans);
+			if (IS_ERR(trans)) {
+				ret = PTR_ERR(trans);
+				goto out;
+			}
 
 			ret = do_chunk_alloc(trans, root->fs_info->extent_root,
 					     alloc_target,
@@ -4182,7 +4212,7 @@ alloc:
 			btrfs_end_transaction(trans, root);
 			if (ret < 0) {
 				if (ret != -ENOSPC)
-					return ret;
+					goto out;
 				else {
 					have_pinned_space = 1;
 					goto commit_trans;
@@ -4217,15 +4247,17 @@ commit_trans:
 			}
 
 			trans = btrfs_join_transaction(root);
-			if (IS_ERR(trans))
-				return PTR_ERR(trans);
+			if (IS_ERR(trans)) {
+				ret = PTR_ERR(trans);
+				goto out;
+			}
 			if (have_pinned_space >= 0 ||
 			    test_bit(BTRFS_TRANS_HAVE_FREE_BGS,
 				     &trans->transaction->flags) ||
 			    need_commit > 0) {
 				ret = btrfs_commit_transaction(trans, root);
 				if (ret)
-					return ret;
+					goto out;
 				/*
 				 * The cleaner kthread might still be doing iput
 				 * operations. Wait for it to finish so that
@@ -4242,13 +4274,18 @@ commit_trans:
 		trace_btrfs_space_reservation(root->fs_info,
 					      "space_info:enospc",
 					      data_sinfo->flags, bytes, 1);
-		return -ENOSPC;
+		ret = -ENOSPC;
+		goto out;
 	}
 	data_sinfo->bytes_may_use += bytes;
 	trace_btrfs_space_reservation(root->fs_info, "space_info",
 				      data_sinfo->flags, bytes, 1);
 	spin_unlock(&data_sinfo->lock);
 
+out:
+	if (have_bg_delete_sem && !free_space_inode)
+		up_read(&root->fs_info->bg_delete_sem);
+
 	return ret;
 }
 
@@ -10806,7 +10843,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		}
 		spin_unlock(&fs_info->unused_bgs_lock);
 
-		mutex_lock(&fs_info->delete_unused_bgs_mutex);
+		down_write(&root->fs_info->bg_delete_sem);
 
 		/* Don't want to race with allocators so take the groups_sem */
 		down_write(&space_info->groups_sem);
@@ -10934,7 +10971,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 end_trans:
 		btrfs_end_transaction(trans, root);
 next:
-		mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+		up_write(&root->fs_info->bg_delete_sem);
 		btrfs_put_block_group(block_group);
 		spin_lock(&fs_info->unused_bgs_lock);
 	}
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 035efce..408fb0c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2926,7 +2926,7 @@ static int btrfs_relocate_chunk(struct btrfs_root *root, u64 chunk_offset)
 	 * we release the path used to search the chunk/dev tree and before
 	 * the current task acquires this mutex and calls us.
 	 */
-	ASSERT(mutex_is_locked(&root->fs_info->delete_unused_bgs_mutex));
+	ASSERT(rwsem_is_locked(&root->fs_info->bg_delete_sem));
 
 	ret = btrfs_can_relocate(extent_root, chunk_offset);
 	if (ret)
@@ -2979,10 +2979,10 @@ again:
 	key.type = BTRFS_CHUNK_ITEM_KEY;
 
 	while (1) {
-		mutex_lock(&root->fs_info->delete_unused_bgs_mutex);
+		down_read(&root->fs_info->bg_delete_sem);
 		ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
 		if (ret < 0) {
-			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+			up_read(&root->fs_info->bg_delete_sem);
 			goto error;
 		}
 		BUG_ON(ret == 0); /* Corruption */
@@ -2990,7 +2990,7 @@ again:
 		ret = btrfs_previous_item(chunk_root, path, key.objectid,
 					  key.type);
 		if (ret)
-			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+			up_read(&root->fs_info->bg_delete_sem);
 		if (ret < 0)
 			goto error;
 		if (ret > 0)
@@ -3012,7 +3012,7 @@ again:
 			else
 				BUG_ON(ret);
 		}
-		mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+		up_read(&root->fs_info->bg_delete_sem);
 
 		if (found_key.offset == 0)
 			break;
@@ -3568,10 +3568,10 @@ again:
 			goto error;
 		}
 
-		mutex_lock(&fs_info->delete_unused_bgs_mutex);
+		down_read(&fs_info->bg_delete_sem);
 		ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
 		if (ret < 0) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			up_read(&fs_info->bg_delete_sem);
 			goto error;
 		}
 
@@ -3585,7 +3585,7 @@ again:
 		ret = btrfs_previous_item(chunk_root, path, 0,
 					  BTRFS_CHUNK_ITEM_KEY);
 		if (ret) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			up_read(&fs_info->bg_delete_sem);
 			ret = 0;
 			break;
 		}
@@ -3595,7 +3595,7 @@ again:
 		btrfs_item_key_to_cpu(leaf, &found_key, slot);
 
 		if (found_key.objectid != key.objectid) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			up_read(&fs_info->bg_delete_sem);
 			break;
 		}
 
@@ -3613,12 +3613,12 @@ again:
 
 		btrfs_release_path(path);
 		if (!ret) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			up_read(&fs_info->bg_delete_sem);
 			goto loop;
 		}
 
 		if (counting) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			up_read(&fs_info->bg_delete_sem);
 			spin_lock(&fs_info->balance_lock);
 			bctl->stat.expected++;
 			spin_unlock(&fs_info->balance_lock);
@@ -3643,7 +3643,7 @@ again:
 					count_meta < bctl->meta.limit_min)
 				|| ((chunk_type & BTRFS_BLOCK_GROUP_SYSTEM) &&
 					count_sys < bctl->sys.limit_min)) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			up_read(&fs_info->bg_delete_sem);
 			goto loop;
 		}
 
@@ -3656,7 +3656,7 @@ again:
 		    !chunk_reserved && !bytes_used) {
 			trans = btrfs_start_transaction(chunk_root, 0);
 			if (IS_ERR(trans)) {
-				mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+				up_read(&fs_info->bg_delete_sem);
 				ret = PTR_ERR(trans);
 				goto error;
 			}
@@ -3665,7 +3665,7 @@ again:
 						      BTRFS_BLOCK_GROUP_DATA);
 			btrfs_end_transaction(trans, chunk_root);
 			if (ret < 0) {
-				mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+				up_read(&fs_info->bg_delete_sem);
 				goto error;
 			}
 			chunk_reserved = 1;
@@ -3673,7 +3673,7 @@ again:
 
 		ret = btrfs_relocate_chunk(chunk_root,
 					   found_key.offset);
-		mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+		up_read(&fs_info->bg_delete_sem);
 		if (ret && ret != -ENOSPC)
 			goto error;
 		if (ret == -ENOSPC) {
@@ -4400,16 +4400,16 @@ again:
 	key.type = BTRFS_DEV_EXTENT_KEY;
 
 	do {
-		mutex_lock(&root->fs_info->delete_unused_bgs_mutex);
+		down_read(&root->fs_info->bg_delete_sem);
 		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
 		if (ret < 0) {
-			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+			up_read(&root->fs_info->bg_delete_sem);
 			goto done;
 		}
 
 		ret = btrfs_previous_item(root, path, 0, key.type);
 		if (ret)
-			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+			up_read(&root->fs_info->bg_delete_sem);
 		if (ret < 0)
 			goto done;
 		if (ret) {
@@ -4423,7 +4423,7 @@ again:
 		btrfs_item_key_to_cpu(l, &key, path->slots[0]);
 
 		if (key.objectid != device->devid) {
-			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+			up_read(&root->fs_info->bg_delete_sem);
 			btrfs_release_path(path);
 			break;
 		}
@@ -4432,7 +4432,7 @@ again:
 		length = btrfs_dev_extent_length(l, dev_extent);
 
 		if (key.offset + length <= new_size) {
-			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+			up_read(&root->fs_info->bg_delete_sem);
 			btrfs_release_path(path);
 			break;
 		}
@@ -4441,7 +4441,7 @@ again:
 		btrfs_release_path(path);
 
 		ret = btrfs_relocate_chunk(root, chunk_offset);
-		mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
+		up_read(&root->fs_info->bg_delete_sem);
 		if (ret && ret != -ENOSPC)
 			goto done;
 		if (ret == -ENOSPC)
-- 
2.9.0




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-09-09  8:17 [PATCH v3] " Wang Xiaoguang
@ 2016-09-09  8:25 ` Wang Xiaoguang
  2016-09-09  9:02   ` David Sterba
  2016-09-09 10:18 ` Holger Hoffstätte
  2016-09-10  6:04 ` Stefan Priebe - Profihost AG
  2 siblings, 1 reply; 21+ messages in thread
From: Wang Xiaoguang @ 2016-09-09  8:25 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, s.priebe

hello David,

This patch's v2 version in your for-next-20160906 branch is still wrong, 
really sorry,
please revert it.
Stefan Priebe has reported another similar issue, thought I didn't see 
it in my
test environment. Now I choose to not call down_read(bg_delete_sem) for free
space inode, which I think can resolve these issues, please check, thanks.

Regards,
Xiaoguang Wang

On 09/09/2016 04:17 PM, Wang Xiaoguang wrote:
> cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs()
> to delete unused block groups. Because this work is asynchronous, it may also result
> in false ENOSPC error. Please see below race window:
>
>                 CPU1                           |             CPU2
>                                                |
> |-> btrfs_alloc_data_chunk_ondemand()         |-> cleaner_kthread()
>      |-> do_chunk_alloc()                      |   |
>      |   assume it returns ENOSPC, which means |   |
>      |   btrfs_space_info is full and have free|   |
>      |   space to satisfy data request.        |   |
>      |                                         |   |- > btrfs_delete_unused_bgs()
>      |                                         |   |    it will decrease btrfs_space_info
>      |                                         |   |    total_bytes and make
>      |                                         |   |    btrfs_space_info is not full.
>      |                                         |   |
> In this case, we may get ENOSPC error, but btrfs_space_info is not full.
>
> To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need to call
> do_chunk_alloc() to allocating new chunk, we should block btrfs_delete_unused_bgs().
> Here we introduce a new struct rw_semaphore bg_delete_sem to do this job.
>
> Indeed there is already a "struct mutex delete_unused_bgs_mutex", but it's mutex,
> we can not use it for this purpose. Of course, we can re-define it to be struct
> rw_semaphore, then use it in btrfs_alloc_data_chunk_ondemand(). Either method will
> work.
>
> But given that delete_unused_bgs_mutex's name length is longer than bg_delete_sem,
> I choose the first method, to create a new struct rw_semaphore bg_delete_sem and
> delete delete_unused_bgs_mutex :)
>
> Reported-by: Stefan Priebe <s.priebe@profihost.ag>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> ---
> V2: fix a deadlock revealed by fstests case btrfs/071, we call
>      start_transaction() before in down_write(bg_delete_sem) in
>      btrfs_delete_unused_bgs().
>
> v3: Stefan Priebe reported another similar deadlock, so here we choose
>      to not call down_read(bg_delete_sem) for free space inode in
>      btrfs_alloc_data_chunk_ondemand(). Meanwhile because we only do the
>      data space reservation for free space cache in the transaction context,
>      btrfs_delete_unused_bgs() will either have finished its job, or start
>      a new transaction waiting current transaction to complete, there will
>      be no unused block groups to be deleted, so it's safe to not call
>      down_read(bg_delete_sem)
> ---
> ---
>   fs/btrfs/ctree.h       |  2 +-
>   fs/btrfs/disk-io.c     | 13 +++++------
>   fs/btrfs/extent-tree.c | 59 ++++++++++++++++++++++++++++++++++++++++----------
>   fs/btrfs/volumes.c     | 42 +++++++++++++++++------------------
>   4 files changed, 76 insertions(+), 40 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index eff3993..fa78ef9 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -788,6 +788,7 @@ struct btrfs_fs_info {
>   	struct mutex cleaner_mutex;
>   	struct mutex chunk_mutex;
>   	struct mutex volume_mutex;
> +	struct rw_semaphore bg_delete_sem;
>   
>   	/*
>   	 * this is taken to make sure we don't set block groups ro after
> @@ -1068,7 +1069,6 @@ struct btrfs_fs_info {
>   	spinlock_t unused_bgs_lock;
>   	struct list_head unused_bgs;
>   	struct mutex unused_bg_unpin_mutex;
> -	struct mutex delete_unused_bgs_mutex;
>   
>   	/* For btrfs to record security options */
>   	struct security_mnt_opts security_opts;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 54bc8c7..3cdbd05 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1868,12 +1868,11 @@ static int cleaner_kthread(void *arg)
>   		btrfs_run_defrag_inodes(root->fs_info);
>   
>   		/*
> -		 * Acquires fs_info->delete_unused_bgs_mutex to avoid racing
> -		 * with relocation (btrfs_relocate_chunk) and relocation
> -		 * acquires fs_info->cleaner_mutex (btrfs_relocate_block_group)
> -		 * after acquiring fs_info->delete_unused_bgs_mutex. So we
> -		 * can't hold, nor need to, fs_info->cleaner_mutex when deleting
> -		 * unused block groups.
> +		 * Acquires fs_info->bg_delete_sem to avoid racing with
> +		 * relocation (btrfs_relocate_chunk) and relocation acquires
> +		 * fs_info->cleaner_mutex (btrfs_relocate_block_group) after
> +		 * acquiring fs_info->bg_delete_sem. So we can't hold, nor need
> +		 * to, fs_info->cleaner_mutex when deleting unused block groups.
>   		 */
>   		btrfs_delete_unused_bgs(root->fs_info);
>   sleep:
> @@ -2634,7 +2633,6 @@ int open_ctree(struct super_block *sb,
>   	spin_lock_init(&fs_info->unused_bgs_lock);
>   	rwlock_init(&fs_info->tree_mod_log_lock);
>   	mutex_init(&fs_info->unused_bg_unpin_mutex);
> -	mutex_init(&fs_info->delete_unused_bgs_mutex);
>   	mutex_init(&fs_info->reloc_mutex);
>   	mutex_init(&fs_info->delalloc_root_mutex);
>   	mutex_init(&fs_info->cleaner_delayed_iput_mutex);
> @@ -2723,6 +2721,7 @@ int open_ctree(struct super_block *sb,
>   	init_rwsem(&fs_info->commit_root_sem);
>   	init_rwsem(&fs_info->cleanup_work_sem);
>   	init_rwsem(&fs_info->subvol_sem);
> +	init_rwsem(&fs_info->bg_delete_sem);
>   	sema_init(&fs_info->uuid_tree_rescan_sem, 1);
>   
>   	btrfs_init_dev_replace_locks(fs_info);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 8c8a4d1..f037769 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4128,18 +4128,35 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
>   	int ret = 0;
>   	int need_commit = 2;
>   	int have_pinned_space;
> +	int have_bg_delete_sem = 0;
> +	bool free_space_inode = btrfs_is_free_space_inode(inode);
>   
>   	/* make sure bytes are sectorsize aligned */
>   	bytes = ALIGN(bytes, root->sectorsize);
>   
> -	if (btrfs_is_free_space_inode(inode)) {
> +	if (free_space_inode) {
>   		need_commit = 0;
>   		ASSERT(current->journal_info);
>   	}
>   
> +	/*
> +	 * Here we shouldn't call down_read(bg_delete_sem) for free space inode,
> +	 * there is lock order between bg_delete_sem and "wait current trans
> +	 * finished". Meanwhile because we only do the data space reservation
> +	 * for free space cache in the transaction context,
> +	 * btrfs_delete_unused_bgs() will either have finished its job, or start
> +	 * a new transaction waiting current transaction to complete, there will
> +	 * be no unused block groups to be deleted, so it's safe to not call
> +	 * down_read(bg_delete_sem).
> +	 */
>   	data_sinfo = fs_info->data_sinfo;
> -	if (!data_sinfo)
> +	if (!data_sinfo) {
> +		if (!free_space_inode) {
> +			down_read(&root->fs_info->bg_delete_sem);
> +			have_bg_delete_sem = 1;
> +		}
>   		goto alloc;
> +	}
>   
>   again:
>   	/* make sure we have enough space to handle the data first */
> @@ -4152,6 +4169,17 @@ again:
>   		struct btrfs_trans_handle *trans;
>   
>   		/*
> +		 * We may need to allocate new chunk, so we should block
> +		 * btrfs_delete_unused_bgs()
> +		 */
> +		if (!have_bg_delete_sem && !free_space_inode) {
> +			spin_unlock(&data_sinfo->lock);
> +			down_read(&root->fs_info->bg_delete_sem);
> +			have_bg_delete_sem = 1;
> +			goto again;
> +		}
> +
> +		/*
>   		 * if we don't have enough free bytes in this space then we need
>   		 * to alloc a new chunk.
>   		 */
> @@ -4173,8 +4201,10 @@ alloc:
>   			 * the fs.
>   			 */
>   			trans = btrfs_join_transaction(root);
> -			if (IS_ERR(trans))
> -				return PTR_ERR(trans);
> +			if (IS_ERR(trans)) {
> +				ret = PTR_ERR(trans);
> +				goto out;
> +			}
>   
>   			ret = do_chunk_alloc(trans, root->fs_info->extent_root,
>   					     alloc_target,
> @@ -4182,7 +4212,7 @@ alloc:
>   			btrfs_end_transaction(trans, root);
>   			if (ret < 0) {
>   				if (ret != -ENOSPC)
> -					return ret;
> +					goto out;
>   				else {
>   					have_pinned_space = 1;
>   					goto commit_trans;
> @@ -4217,15 +4247,17 @@ commit_trans:
>   			}
>   
>   			trans = btrfs_join_transaction(root);
> -			if (IS_ERR(trans))
> -				return PTR_ERR(trans);
> +			if (IS_ERR(trans)) {
> +				ret = PTR_ERR(trans);
> +				goto out;
> +			}
>   			if (have_pinned_space >= 0 ||
>   			    test_bit(BTRFS_TRANS_HAVE_FREE_BGS,
>   				     &trans->transaction->flags) ||
>   			    need_commit > 0) {
>   				ret = btrfs_commit_transaction(trans, root);
>   				if (ret)
> -					return ret;
> +					goto out;
>   				/*
>   				 * The cleaner kthread might still be doing iput
>   				 * operations. Wait for it to finish so that
> @@ -4242,13 +4274,18 @@ commit_trans:
>   		trace_btrfs_space_reservation(root->fs_info,
>   					      "space_info:enospc",
>   					      data_sinfo->flags, bytes, 1);
> -		return -ENOSPC;
> +		ret = -ENOSPC;
> +		goto out;
>   	}
>   	data_sinfo->bytes_may_use += bytes;
>   	trace_btrfs_space_reservation(root->fs_info, "space_info",
>   				      data_sinfo->flags, bytes, 1);
>   	spin_unlock(&data_sinfo->lock);
>   
> +out:
> +	if (have_bg_delete_sem && !free_space_inode)
> +		up_read(&root->fs_info->bg_delete_sem);
> +
>   	return ret;
>   }
>   
> @@ -10806,7 +10843,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>   		}
>   		spin_unlock(&fs_info->unused_bgs_lock);
>   
> -		mutex_lock(&fs_info->delete_unused_bgs_mutex);
> +		down_write(&root->fs_info->bg_delete_sem);
>   
>   		/* Don't want to race with allocators so take the groups_sem */
>   		down_write(&space_info->groups_sem);
> @@ -10934,7 +10971,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>   end_trans:
>   		btrfs_end_transaction(trans, root);
>   next:
> -		mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +		up_write(&root->fs_info->bg_delete_sem);
>   		btrfs_put_block_group(block_group);
>   		spin_lock(&fs_info->unused_bgs_lock);
>   	}
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 035efce..408fb0c 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2926,7 +2926,7 @@ static int btrfs_relocate_chunk(struct btrfs_root *root, u64 chunk_offset)
>   	 * we release the path used to search the chunk/dev tree and before
>   	 * the current task acquires this mutex and calls us.
>   	 */
> -	ASSERT(mutex_is_locked(&root->fs_info->delete_unused_bgs_mutex));
> +	ASSERT(rwsem_is_locked(&root->fs_info->bg_delete_sem));
>   
>   	ret = btrfs_can_relocate(extent_root, chunk_offset);
>   	if (ret)
> @@ -2979,10 +2979,10 @@ again:
>   	key.type = BTRFS_CHUNK_ITEM_KEY;
>   
>   	while (1) {
> -		mutex_lock(&root->fs_info->delete_unused_bgs_mutex);
> +		down_read(&root->fs_info->bg_delete_sem);
>   		ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
>   		if (ret < 0) {
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>   			goto error;
>   		}
>   		BUG_ON(ret == 0); /* Corruption */
> @@ -2990,7 +2990,7 @@ again:
>   		ret = btrfs_previous_item(chunk_root, path, key.objectid,
>   					  key.type);
>   		if (ret)
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>   		if (ret < 0)
>   			goto error;
>   		if (ret > 0)
> @@ -3012,7 +3012,7 @@ again:
>   			else
>   				BUG_ON(ret);
>   		}
> -		mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +		up_read(&root->fs_info->bg_delete_sem);
>   
>   		if (found_key.offset == 0)
>   			break;
> @@ -3568,10 +3568,10 @@ again:
>   			goto error;
>   		}
>   
> -		mutex_lock(&fs_info->delete_unused_bgs_mutex);
> +		down_read(&fs_info->bg_delete_sem);
>   		ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
>   		if (ret < 0) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>   			goto error;
>   		}
>   
> @@ -3585,7 +3585,7 @@ again:
>   		ret = btrfs_previous_item(chunk_root, path, 0,
>   					  BTRFS_CHUNK_ITEM_KEY);
>   		if (ret) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>   			ret = 0;
>   			break;
>   		}
> @@ -3595,7 +3595,7 @@ again:
>   		btrfs_item_key_to_cpu(leaf, &found_key, slot);
>   
>   		if (found_key.objectid != key.objectid) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>   			break;
>   		}
>   
> @@ -3613,12 +3613,12 @@ again:
>   
>   		btrfs_release_path(path);
>   		if (!ret) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>   			goto loop;
>   		}
>   
>   		if (counting) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>   			spin_lock(&fs_info->balance_lock);
>   			bctl->stat.expected++;
>   			spin_unlock(&fs_info->balance_lock);
> @@ -3643,7 +3643,7 @@ again:
>   					count_meta < bctl->meta.limit_min)
>   				|| ((chunk_type & BTRFS_BLOCK_GROUP_SYSTEM) &&
>   					count_sys < bctl->sys.limit_min)) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>   			goto loop;
>   		}
>   
> @@ -3656,7 +3656,7 @@ again:
>   		    !chunk_reserved && !bytes_used) {
>   			trans = btrfs_start_transaction(chunk_root, 0);
>   			if (IS_ERR(trans)) {
> -				mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +				up_read(&fs_info->bg_delete_sem);
>   				ret = PTR_ERR(trans);
>   				goto error;
>   			}
> @@ -3665,7 +3665,7 @@ again:
>   						      BTRFS_BLOCK_GROUP_DATA);
>   			btrfs_end_transaction(trans, chunk_root);
>   			if (ret < 0) {
> -				mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +				up_read(&fs_info->bg_delete_sem);
>   				goto error;
>   			}
>   			chunk_reserved = 1;
> @@ -3673,7 +3673,7 @@ again:
>   
>   		ret = btrfs_relocate_chunk(chunk_root,
>   					   found_key.offset);
> -		mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +		up_read(&fs_info->bg_delete_sem);
>   		if (ret && ret != -ENOSPC)
>   			goto error;
>   		if (ret == -ENOSPC) {
> @@ -4400,16 +4400,16 @@ again:
>   	key.type = BTRFS_DEV_EXTENT_KEY;
>   
>   	do {
> -		mutex_lock(&root->fs_info->delete_unused_bgs_mutex);
> +		down_read(&root->fs_info->bg_delete_sem);
>   		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
>   		if (ret < 0) {
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>   			goto done;
>   		}
>   
>   		ret = btrfs_previous_item(root, path, 0, key.type);
>   		if (ret)
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>   		if (ret < 0)
>   			goto done;
>   		if (ret) {
> @@ -4423,7 +4423,7 @@ again:
>   		btrfs_item_key_to_cpu(l, &key, path->slots[0]);
>   
>   		if (key.objectid != device->devid) {
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>   			btrfs_release_path(path);
>   			break;
>   		}
> @@ -4432,7 +4432,7 @@ again:
>   		length = btrfs_dev_extent_length(l, dev_extent);
>   
>   		if (key.offset + length <= new_size) {
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>   			btrfs_release_path(path);
>   			break;
>   		}
> @@ -4441,7 +4441,7 @@ again:
>   		btrfs_release_path(path);
>   
>   		ret = btrfs_relocate_chunk(root, chunk_offset);
> -		mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +		up_read(&root->fs_info->bg_delete_sem);
>   		if (ret && ret != -ENOSPC)
>   			goto done;
>   		if (ret == -ENOSPC)




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-09-09  8:25 ` Wang Xiaoguang
@ 2016-09-09  9:02   ` David Sterba
  0 siblings, 0 replies; 21+ messages in thread
From: David Sterba @ 2016-09-09  9:02 UTC (permalink / raw)
  To: Wang Xiaoguang; +Cc: linux-btrfs, dsterba, s.priebe

On Fri, Sep 09, 2016 at 04:25:15PM +0800, Wang Xiaoguang wrote:
> hello David,
> 
> This patch's v2 version in your for-next-20160906 branch is still wrong, 
> really sorry,
> please revert it.

Patch replaced with V3 in the upcoming for-next.

> Stefan Priebe has reported another similar issue, thought I didn't see 
> it in my
> test environment. Now I choose to not call down_read(bg_delete_sem) for free
> space inode, which I think can resolve these issues, please check, thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-09-09  8:17 [PATCH v3] " Wang Xiaoguang
  2016-09-09  8:25 ` Wang Xiaoguang
@ 2016-09-09 10:18 ` Holger Hoffstätte
  2016-09-09 10:56   ` Holger Hoffstätte
  2016-09-10  6:04 ` Stefan Priebe - Profihost AG
  2 siblings, 1 reply; 21+ messages in thread
From: Holger Hoffstätte @ 2016-09-09 10:18 UTC (permalink / raw)
  To: linux-btrfs

On Fri, 09 Sep 2016 16:17:48 +0800, Wang Xiaoguang wrote:

> cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs()
> to delete unused block groups. Because this work is asynchronous, it may also result
> in false ENOSPC error. 
<snip>

With this v3 I can now no longer balance (tested only with metadata).
New chunks are allocated (as balance does) but nothing ever shrinks, until
after unmount/remount, when the cleaner eventually kicks in.

This might be related to the recent patch by Naohiro Aota:
"btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"

which by itself doesn't seem to do any harm (i.e. everything still seems
to work as expected).

-h


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-09-09 10:18 ` Holger Hoffstätte
@ 2016-09-09 10:56   ` Holger Hoffstätte
  2016-09-12  7:38     ` Wang Xiaoguang
  0 siblings, 1 reply; 21+ messages in thread
From: Holger Hoffstätte @ 2016-09-09 10:56 UTC (permalink / raw)
  To: linux-btrfs, Wang Xiaoguang, Naohiro Aota

On 09/09/16 12:18, Holger Hoffstätte wrote:
> On Fri, 09 Sep 2016 16:17:48 +0800, Wang Xiaoguang wrote:
> 
>> cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs()
>> to delete unused block groups. Because this work is asynchronous, it may also result
>> in false ENOSPC error. 
> <snip>
> 
> With this v3 I can now no longer balance (tested only with metadata).
> New chunks are allocated (as balance does) but nothing ever shrinks, until
> after unmount/remount, when the cleaner eventually kicks in.
> 
> This might be related to the recent patch by Naohiro Aota:
> "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"
> 
> which by itself doesn't seem to do any harm (i.e. everything still seems
> to work as expected).

Actually even that is not true; both patches seem to be wrong in subtle
ways. Naohiro's patch seems to prevent the deletion during balance, whereas
yours prevents the cleaner from kicking in.

As a simple reproducer you can convert from -mdup to -msingle (to create
bloat) and then balance with -musage=10. Depending on which of the two
patches are applied, you end with bloat that only grows and never shrinks,
or bloat that ends up in mixed state (dup and single).

Undoing both makes both balancing and cleaning work again.

-h


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-09-09  8:17 [PATCH v3] " Wang Xiaoguang
  2016-09-09  8:25 ` Wang Xiaoguang
  2016-09-09 10:18 ` Holger Hoffstätte
@ 2016-09-10  6:04 ` Stefan Priebe - Profihost AG
  2 siblings, 0 replies; 21+ messages in thread
From: Stefan Priebe - Profihost AG @ 2016-09-10  6:04 UTC (permalink / raw)
  To: Wang Xiaoguang, linux-btrfs; +Cc: dsterba, Holger Hoffstätte

Thanks,

this one works fine. No deadlocks.

Stefan

Am 09.09.2016 um 10:17 schrieb Wang Xiaoguang:
> cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs()
> to delete unused block groups. Because this work is asynchronous, it may also result
> in false ENOSPC error. Please see below race window:
> 
>                CPU1                           |             CPU2
>                                               |
> |-> btrfs_alloc_data_chunk_ondemand()         |-> cleaner_kthread()
>     |-> do_chunk_alloc()                      |   |
>     |   assume it returns ENOSPC, which means |   |
>     |   btrfs_space_info is full and have free|   |
>     |   space to satisfy data request.        |   |
>     |                                         |   |- > btrfs_delete_unused_bgs()
>     |                                         |   |    it will decrease btrfs_space_info
>     |                                         |   |    total_bytes and make
>     |                                         |   |    btrfs_space_info is not full.
>     |                                         |   |
> In this case, we may get ENOSPC error, but btrfs_space_info is not full.
> 
> To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need to call
> do_chunk_alloc() to allocating new chunk, we should block btrfs_delete_unused_bgs().
> Here we introduce a new struct rw_semaphore bg_delete_sem to do this job.
> 
> Indeed there is already a "struct mutex delete_unused_bgs_mutex", but it's mutex,
> we can not use it for this purpose. Of course, we can re-define it to be struct
> rw_semaphore, then use it in btrfs_alloc_data_chunk_ondemand(). Either method will
> work.
> 
> But given that delete_unused_bgs_mutex's name length is longer than bg_delete_sem,
> I choose the first method, to create a new struct rw_semaphore bg_delete_sem and
> delete delete_unused_bgs_mutex :)
> 
> Reported-by: Stefan Priebe <s.priebe@profihost.ag>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> ---
> V2: fix a deadlock revealed by fstests case btrfs/071, we call
>     start_transaction() before in down_write(bg_delete_sem) in
>     btrfs_delete_unused_bgs().
> 
> v3: Stefan Priebe reported another similar deadlock, so here we choose
>     to not call down_read(bg_delete_sem) for free space inode in
>     btrfs_alloc_data_chunk_ondemand(). Meanwhile because we only do the
>     data space reservation for free space cache in the transaction context,
>     btrfs_delete_unused_bgs() will either have finished its job, or start
>     a new transaction waiting current transaction to complete, there will
>     be no unused block groups to be deleted, so it's safe to not call
>     down_read(bg_delete_sem)
> ---
> ---
>  fs/btrfs/ctree.h       |  2 +-
>  fs/btrfs/disk-io.c     | 13 +++++------
>  fs/btrfs/extent-tree.c | 59 ++++++++++++++++++++++++++++++++++++++++----------
>  fs/btrfs/volumes.c     | 42 +++++++++++++++++------------------
>  4 files changed, 76 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index eff3993..fa78ef9 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -788,6 +788,7 @@ struct btrfs_fs_info {
>  	struct mutex cleaner_mutex;
>  	struct mutex chunk_mutex;
>  	struct mutex volume_mutex;
> +	struct rw_semaphore bg_delete_sem;
>  
>  	/*
>  	 * this is taken to make sure we don't set block groups ro after
> @@ -1068,7 +1069,6 @@ struct btrfs_fs_info {
>  	spinlock_t unused_bgs_lock;
>  	struct list_head unused_bgs;
>  	struct mutex unused_bg_unpin_mutex;
> -	struct mutex delete_unused_bgs_mutex;
>  
>  	/* For btrfs to record security options */
>  	struct security_mnt_opts security_opts;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 54bc8c7..3cdbd05 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1868,12 +1868,11 @@ static int cleaner_kthread(void *arg)
>  		btrfs_run_defrag_inodes(root->fs_info);
>  
>  		/*
> -		 * Acquires fs_info->delete_unused_bgs_mutex to avoid racing
> -		 * with relocation (btrfs_relocate_chunk) and relocation
> -		 * acquires fs_info->cleaner_mutex (btrfs_relocate_block_group)
> -		 * after acquiring fs_info->delete_unused_bgs_mutex. So we
> -		 * can't hold, nor need to, fs_info->cleaner_mutex when deleting
> -		 * unused block groups.
> +		 * Acquires fs_info->bg_delete_sem to avoid racing with
> +		 * relocation (btrfs_relocate_chunk) and relocation acquires
> +		 * fs_info->cleaner_mutex (btrfs_relocate_block_group) after
> +		 * acquiring fs_info->bg_delete_sem. So we can't hold, nor need
> +		 * to, fs_info->cleaner_mutex when deleting unused block groups.
>  		 */
>  		btrfs_delete_unused_bgs(root->fs_info);
>  sleep:
> @@ -2634,7 +2633,6 @@ int open_ctree(struct super_block *sb,
>  	spin_lock_init(&fs_info->unused_bgs_lock);
>  	rwlock_init(&fs_info->tree_mod_log_lock);
>  	mutex_init(&fs_info->unused_bg_unpin_mutex);
> -	mutex_init(&fs_info->delete_unused_bgs_mutex);
>  	mutex_init(&fs_info->reloc_mutex);
>  	mutex_init(&fs_info->delalloc_root_mutex);
>  	mutex_init(&fs_info->cleaner_delayed_iput_mutex);
> @@ -2723,6 +2721,7 @@ int open_ctree(struct super_block *sb,
>  	init_rwsem(&fs_info->commit_root_sem);
>  	init_rwsem(&fs_info->cleanup_work_sem);
>  	init_rwsem(&fs_info->subvol_sem);
> +	init_rwsem(&fs_info->bg_delete_sem);
>  	sema_init(&fs_info->uuid_tree_rescan_sem, 1);
>  
>  	btrfs_init_dev_replace_locks(fs_info);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 8c8a4d1..f037769 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4128,18 +4128,35 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
>  	int ret = 0;
>  	int need_commit = 2;
>  	int have_pinned_space;
> +	int have_bg_delete_sem = 0;
> +	bool free_space_inode = btrfs_is_free_space_inode(inode);
>  
>  	/* make sure bytes are sectorsize aligned */
>  	bytes = ALIGN(bytes, root->sectorsize);
>  
> -	if (btrfs_is_free_space_inode(inode)) {
> +	if (free_space_inode) {
>  		need_commit = 0;
>  		ASSERT(current->journal_info);
>  	}
>  
> +	/*
> +	 * Here we shouldn't call down_read(bg_delete_sem) for free space inode,
> +	 * there is lock order between bg_delete_sem and "wait current trans
> +	 * finished". Meanwhile because we only do the data space reservation
> +	 * for free space cache in the transaction context,
> +	 * btrfs_delete_unused_bgs() will either have finished its job, or start
> +	 * a new transaction waiting current transaction to complete, there will
> +	 * be no unused block groups to be deleted, so it's safe to not call
> +	 * down_read(bg_delete_sem).
> +	 */
>  	data_sinfo = fs_info->data_sinfo;
> -	if (!data_sinfo)
> +	if (!data_sinfo) {
> +		if (!free_space_inode) {
> +			down_read(&root->fs_info->bg_delete_sem);
> +			have_bg_delete_sem = 1;
> +		}
>  		goto alloc;
> +	}
>  
>  again:
>  	/* make sure we have enough space to handle the data first */
> @@ -4152,6 +4169,17 @@ again:
>  		struct btrfs_trans_handle *trans;
>  
>  		/*
> +		 * We may need to allocate new chunk, so we should block
> +		 * btrfs_delete_unused_bgs()
> +		 */
> +		if (!have_bg_delete_sem && !free_space_inode) {
> +			spin_unlock(&data_sinfo->lock);
> +			down_read(&root->fs_info->bg_delete_sem);
> +			have_bg_delete_sem = 1;
> +			goto again;
> +		}
> +
> +		/*
>  		 * if we don't have enough free bytes in this space then we need
>  		 * to alloc a new chunk.
>  		 */
> @@ -4173,8 +4201,10 @@ alloc:
>  			 * the fs.
>  			 */
>  			trans = btrfs_join_transaction(root);
> -			if (IS_ERR(trans))
> -				return PTR_ERR(trans);
> +			if (IS_ERR(trans)) {
> +				ret = PTR_ERR(trans);
> +				goto out;
> +			}
>  
>  			ret = do_chunk_alloc(trans, root->fs_info->extent_root,
>  					     alloc_target,
> @@ -4182,7 +4212,7 @@ alloc:
>  			btrfs_end_transaction(trans, root);
>  			if (ret < 0) {
>  				if (ret != -ENOSPC)
> -					return ret;
> +					goto out;
>  				else {
>  					have_pinned_space = 1;
>  					goto commit_trans;
> @@ -4217,15 +4247,17 @@ commit_trans:
>  			}
>  
>  			trans = btrfs_join_transaction(root);
> -			if (IS_ERR(trans))
> -				return PTR_ERR(trans);
> +			if (IS_ERR(trans)) {
> +				ret = PTR_ERR(trans);
> +				goto out;
> +			}
>  			if (have_pinned_space >= 0 ||
>  			    test_bit(BTRFS_TRANS_HAVE_FREE_BGS,
>  				     &trans->transaction->flags) ||
>  			    need_commit > 0) {
>  				ret = btrfs_commit_transaction(trans, root);
>  				if (ret)
> -					return ret;
> +					goto out;
>  				/*
>  				 * The cleaner kthread might still be doing iput
>  				 * operations. Wait for it to finish so that
> @@ -4242,13 +4274,18 @@ commit_trans:
>  		trace_btrfs_space_reservation(root->fs_info,
>  					      "space_info:enospc",
>  					      data_sinfo->flags, bytes, 1);
> -		return -ENOSPC;
> +		ret = -ENOSPC;
> +		goto out;
>  	}
>  	data_sinfo->bytes_may_use += bytes;
>  	trace_btrfs_space_reservation(root->fs_info, "space_info",
>  				      data_sinfo->flags, bytes, 1);
>  	spin_unlock(&data_sinfo->lock);
>  
> +out:
> +	if (have_bg_delete_sem && !free_space_inode)
> +		up_read(&root->fs_info->bg_delete_sem);
> +
>  	return ret;
>  }
>  
> @@ -10806,7 +10843,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  		}
>  		spin_unlock(&fs_info->unused_bgs_lock);
>  
> -		mutex_lock(&fs_info->delete_unused_bgs_mutex);
> +		down_write(&root->fs_info->bg_delete_sem);
>  
>  		/* Don't want to race with allocators so take the groups_sem */
>  		down_write(&space_info->groups_sem);
> @@ -10934,7 +10971,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  end_trans:
>  		btrfs_end_transaction(trans, root);
>  next:
> -		mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +		up_write(&root->fs_info->bg_delete_sem);
>  		btrfs_put_block_group(block_group);
>  		spin_lock(&fs_info->unused_bgs_lock);
>  	}
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 035efce..408fb0c 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2926,7 +2926,7 @@ static int btrfs_relocate_chunk(struct btrfs_root *root, u64 chunk_offset)
>  	 * we release the path used to search the chunk/dev tree and before
>  	 * the current task acquires this mutex and calls us.
>  	 */
> -	ASSERT(mutex_is_locked(&root->fs_info->delete_unused_bgs_mutex));
> +	ASSERT(rwsem_is_locked(&root->fs_info->bg_delete_sem));
>  
>  	ret = btrfs_can_relocate(extent_root, chunk_offset);
>  	if (ret)
> @@ -2979,10 +2979,10 @@ again:
>  	key.type = BTRFS_CHUNK_ITEM_KEY;
>  
>  	while (1) {
> -		mutex_lock(&root->fs_info->delete_unused_bgs_mutex);
> +		down_read(&root->fs_info->bg_delete_sem);
>  		ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
>  		if (ret < 0) {
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>  			goto error;
>  		}
>  		BUG_ON(ret == 0); /* Corruption */
> @@ -2990,7 +2990,7 @@ again:
>  		ret = btrfs_previous_item(chunk_root, path, key.objectid,
>  					  key.type);
>  		if (ret)
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>  		if (ret < 0)
>  			goto error;
>  		if (ret > 0)
> @@ -3012,7 +3012,7 @@ again:
>  			else
>  				BUG_ON(ret);
>  		}
> -		mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +		up_read(&root->fs_info->bg_delete_sem);
>  
>  		if (found_key.offset == 0)
>  			break;
> @@ -3568,10 +3568,10 @@ again:
>  			goto error;
>  		}
>  
> -		mutex_lock(&fs_info->delete_unused_bgs_mutex);
> +		down_read(&fs_info->bg_delete_sem);
>  		ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
>  		if (ret < 0) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>  			goto error;
>  		}
>  
> @@ -3585,7 +3585,7 @@ again:
>  		ret = btrfs_previous_item(chunk_root, path, 0,
>  					  BTRFS_CHUNK_ITEM_KEY);
>  		if (ret) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>  			ret = 0;
>  			break;
>  		}
> @@ -3595,7 +3595,7 @@ again:
>  		btrfs_item_key_to_cpu(leaf, &found_key, slot);
>  
>  		if (found_key.objectid != key.objectid) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>  			break;
>  		}
>  
> @@ -3613,12 +3613,12 @@ again:
>  
>  		btrfs_release_path(path);
>  		if (!ret) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>  			goto loop;
>  		}
>  
>  		if (counting) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>  			spin_lock(&fs_info->balance_lock);
>  			bctl->stat.expected++;
>  			spin_unlock(&fs_info->balance_lock);
> @@ -3643,7 +3643,7 @@ again:
>  					count_meta < bctl->meta.limit_min)
>  				|| ((chunk_type & BTRFS_BLOCK_GROUP_SYSTEM) &&
>  					count_sys < bctl->sys.limit_min)) {
> -			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +			up_read(&fs_info->bg_delete_sem);
>  			goto loop;
>  		}
>  
> @@ -3656,7 +3656,7 @@ again:
>  		    !chunk_reserved && !bytes_used) {
>  			trans = btrfs_start_transaction(chunk_root, 0);
>  			if (IS_ERR(trans)) {
> -				mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +				up_read(&fs_info->bg_delete_sem);
>  				ret = PTR_ERR(trans);
>  				goto error;
>  			}
> @@ -3665,7 +3665,7 @@ again:
>  						      BTRFS_BLOCK_GROUP_DATA);
>  			btrfs_end_transaction(trans, chunk_root);
>  			if (ret < 0) {
> -				mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +				up_read(&fs_info->bg_delete_sem);
>  				goto error;
>  			}
>  			chunk_reserved = 1;
> @@ -3673,7 +3673,7 @@ again:
>  
>  		ret = btrfs_relocate_chunk(chunk_root,
>  					   found_key.offset);
> -		mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> +		up_read(&fs_info->bg_delete_sem);
>  		if (ret && ret != -ENOSPC)
>  			goto error;
>  		if (ret == -ENOSPC) {
> @@ -4400,16 +4400,16 @@ again:
>  	key.type = BTRFS_DEV_EXTENT_KEY;
>  
>  	do {
> -		mutex_lock(&root->fs_info->delete_unused_bgs_mutex);
> +		down_read(&root->fs_info->bg_delete_sem);
>  		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
>  		if (ret < 0) {
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>  			goto done;
>  		}
>  
>  		ret = btrfs_previous_item(root, path, 0, key.type);
>  		if (ret)
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>  		if (ret < 0)
>  			goto done;
>  		if (ret) {
> @@ -4423,7 +4423,7 @@ again:
>  		btrfs_item_key_to_cpu(l, &key, path->slots[0]);
>  
>  		if (key.objectid != device->devid) {
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>  			btrfs_release_path(path);
>  			break;
>  		}
> @@ -4432,7 +4432,7 @@ again:
>  		length = btrfs_dev_extent_length(l, dev_extent);
>  
>  		if (key.offset + length <= new_size) {
> -			mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +			up_read(&root->fs_info->bg_delete_sem);
>  			btrfs_release_path(path);
>  			break;
>  		}
> @@ -4441,7 +4441,7 @@ again:
>  		btrfs_release_path(path);
>  
>  		ret = btrfs_relocate_chunk(root, chunk_offset);
> -		mutex_unlock(&root->fs_info->delete_unused_bgs_mutex);
> +		up_read(&root->fs_info->bg_delete_sem);
>  		if (ret && ret != -ENOSPC)
>  			goto done;
>  		if (ret == -ENOSPC)
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-09-09 10:56   ` Holger Hoffstätte
@ 2016-09-12  7:38     ` Wang Xiaoguang
  2016-09-12  8:19       ` Holger Hoffstätte
  0 siblings, 1 reply; 21+ messages in thread
From: Wang Xiaoguang @ 2016-09-12  7:38 UTC (permalink / raw)
  To: Holger Hoffstätte, linux-btrfs, Naohiro Aota

hello,

On 09/09/2016 06:56 PM, Holger Hoffstätte wrote:
> On 09/09/16 12:18, Holger Hoffstätte wrote:
>> On Fri, 09 Sep 2016 16:17:48 +0800, Wang Xiaoguang wrote:
>>
>>> cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs()
>>> to delete unused block groups. Because this work is asynchronous, it may also result
>>> in false ENOSPC error.
>> <snip>
>>
>> With this v3 I can now no longer balance (tested only with metadata).
>> New chunks are allocated (as balance does) but nothing ever shrinks, until
>> after unmount/remount, when the cleaner eventually kicks in.
>>
>> This might be related to the recent patch by Naohiro Aota:
>> "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"
>>
>> which by itself doesn't seem to do any harm (i.e. everything still seems
>> to work as expected).
> Actually even that is not true; both patches seem to be wrong in subtle
> ways. Naohiro's patch seems to prevent the deletion during balance, whereas
> yours prevents the cleaner from kicking in.
Indeed in my patch, I just change "struct mutex delete_unused_bgs_mutex"
to "struct rw_semaphore bg_delete_sem", and try to get bg_delete_sem when
we allocate data space, so this patch should work as before :)

>
> As a simple reproducer you can convert from -mdup to -msingle (to create
> bloat) and then balance with -musage=10. Depending on which of the two
> patches are applied, you end with bloat that only grows and never shrinks,
> or bloat that ends up in mixed state (dup and single).
Can you give me a simple test script to reproduce your problem.
(I can write it myself, but I'm afraid I may misunderstand you :) )

Regards,
Xiaoguang Wang
>
> Undoing both makes both balancing and cleaning work again.
>
> -h
>
>
>




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
  2016-09-12  7:38     ` Wang Xiaoguang
@ 2016-09-12  8:19       ` Holger Hoffstätte
  0 siblings, 0 replies; 21+ messages in thread
From: Holger Hoffstätte @ 2016-09-12  8:19 UTC (permalink / raw)
  To: Wang Xiaoguang, linux-btrfs, Naohiro Aota

On 09/12/16 09:38, Wang Xiaoguang wrote:
<snip>
>> Actually even that is not true; both patches seem to be wrong in subtle
>> ways. Naohiro's patch seems to prevent the deletion during balance, whereas
>> yours prevents the cleaner from kicking in.
> Indeed in my patch, I just change "struct mutex delete_unused_bgs_mutex"
> to "struct rw_semaphore bg_delete_sem", and try to get bg_delete_sem when
> we allocate data space, so this patch should work as before :)
> 
>>
>> As a simple reproducer you can convert from -mdup to -msingle (to create
>> bloat) and then balance with -musage=10. Depending on which of the two
>> patches are applied, you end with bloat that only grows and never shrinks,
>> or bloat that ends up in mixed state (dup and single).
> Can you give me a simple test script to reproduce your problem.
> (I can write it myself, but I'm afraid I may misunderstand you :) )

I don't have a script and no time this week to play with this.
Just create an fs with dup metadata, balance -mconvert=single and then
back with -mconvert=dup. That's all I tried.

Holger


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-09-12  8:19 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-25  7:51 [PATCH v2 0/4] update bytes_may_use timely to avoid false ENOSPC issue Wang Xiaoguang
2016-07-25  7:51 ` [PATCH v2 1/4] btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster() Wang Xiaoguang
2016-07-25  7:51 ` [PATCH v2 2/4] btrfs: divide btrfs_update_reserved_bytes() into two functions Wang Xiaoguang
2016-07-25  7:51 ` [PATCH v2 3/4] btrfs: update btrfs_space_info's bytes_may_use timely Wang Xiaoguang
2016-07-25 13:33   ` Josef Bacik
2016-07-25 18:10   ` Josef Bacik
2016-07-25  7:51 ` [PATCH v2 4/4] btrfs: should block unused block groups deletion work when allocating data space Wang Xiaoguang
2016-07-25 13:32   ` Josef Bacik
2016-07-26 12:04     ` [PATCH v3] " Wang Xiaoguang
2016-07-26 15:51       ` Josef Bacik
2016-07-26 16:25         ` David Sterba
2016-07-27  1:26         ` Wang Xiaoguang
2016-07-27  5:50           ` [PATCH v4] " Wang Xiaoguang
  -- strict thread matches above, loose matches on Subject: below --
2016-09-09  8:17 [PATCH v3] " Wang Xiaoguang
2016-09-09  8:25 ` Wang Xiaoguang
2016-09-09  9:02   ` David Sterba
2016-09-09 10:18 ` Holger Hoffstätte
2016-09-09 10:56   ` Holger Hoffstätte
2016-09-12  7:38     ` Wang Xiaoguang
2016-09-12  8:19       ` Holger Hoffstätte
2016-09-10  6:04 ` Stefan Priebe - Profihost AG

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).