[PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group
@ 2023-08-07 16:12 Naohiro Aota
  2023-08-07 16:12 ` [PATCH v3 01/10] btrfs: introduce struct to consolidate extent buffer write context Naohiro Aota
                   ` (12 more replies)
  0 siblings, 13 replies; 19+ messages in thread
From: Naohiro Aota @ 2023-08-07 16:12 UTC (permalink / raw)
  To: linux-btrfs; +Cc: hch, josef, dsterba, Naohiro Aota

In the current implementation, block groups are activated at
reservation time to ensure that all reserved bytes can be written to
an active metadata block group. However, this approach has proven to
be less efficient, as it activates block groups more frequently than
necessary, putting pressure on the active zone resource and leading to
potential issues such as early ENOSPC or hung_task.

Another drawback of the current method is that it hampers metadata
over-commit, and necessitates additional flush operations and block
group allocations, resulting in decreased overall performance.

Actually, we don't need so many active metadata block groups because
there is only one sequential metadata write stream.

So, this series introduces a write-time activation of metadata and
system block group. This involves reserving at least one active block
group specifically for a metadata and system block group. When the
write goes into a new block group, it should have allocated all the
regions in the current active block group. So, we can wait for IOs to
fill the space, and then switch to a new block group.

Switching to the write-time activation solves the above issue and will
lead to better performance.

* Performance

There is a significant difference with a workload (buffered write without
sync) because we re-enable metadata over-commit.

before the patch:  741.00 MB/sec
after the patch:  1430.27 MB/sec (+ 93%)

* Organization

Patches 1-5 are preparation patches involves meta_write_pointer check.

Patches 6 and 7 are the main part of this series, implementing the
write-time activation.

Patches 8-10 addresses code for reserve time activation: counting fresh
block group as zone_unusable, activating a block group on allocation,
and disabling metadata over-commit.

* Changes

- v3
  - Rework the reservation patch to fix the over-reservation problem
    https://lore.kernel.org/all/xpb5wdmxx5wops26ihulo73oluc64dt4zpxqc7cirp2wvxl3qy@hv7lsvma5hxf/
  - Rename btrfs_eb_write_context's block_group to zoned_bg.

- v2
  - Introduce a struct to consolidate extent buffer write context
    (btrfs_eb_write_context)
  - Change return type of btrfs_check_meta_write_pointer to int
  - Calculate the reservation count only when it sees DUP BG
  - Drop unnecessary BG lock

Naohiro Aota (10):
  btrfs: introduce struct to consolidate extent buffer write context
  btrfs: zoned: introduce block group context to btrfs_eb_write_context
  btrfs: zoned: return int from btrfs_check_meta_write_pointer
  btrfs: zoned: defer advancing meta_write_pointer
  btrfs: zoned: update meta_write_pointer on zone finish
  btrfs: zoned: reserve zones for an active metadata/system block group
  btrfs: zoned: activate metadata block group on write time
  btrfs: zoned: no longer count fresh BG region as zone unusable
  btrfs: zoned: don't activate non-DATA BG on allocation
  btrfs: zoned: re-enable metadata over-commit for zoned mode

 fs/btrfs/block-group.c      |  13 +-
 fs/btrfs/disk-io.c          |   2 +
 fs/btrfs/extent-tree.c      |   8 +-
 fs/btrfs/extent_io.c        |  44 +++---
 fs/btrfs/extent_io.h        |   7 +
 fs/btrfs/free-space-cache.c |   8 +-
 fs/btrfs/fs.h               |   3 +
 fs/btrfs/space-info.c       |  34 +----
 fs/btrfs/zoned.c            | 259 ++++++++++++++++++++++++++++--------
 fs/btrfs/zoned.h            |  29 ++--
 10 files changed, 273 insertions(+), 134 deletions(-)

-- 
2.41.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v3 01/10] btrfs: introduce struct to consolidate extent buffer write context
  2023-08-07 16:12 [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
@ 2023-08-07 16:12 ` Naohiro Aota
  2023-08-07 16:12 ` [PATCH v3 02/10] btrfs: zoned: introduce block group context to btrfs_eb_write_context Naohiro Aota
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Naohiro Aota @ 2023-08-07 16:12 UTC (permalink / raw)
  To: linux-btrfs
  Cc: hch, josef, dsterba, Naohiro Aota, Christoph Hellwig,
	Johannes Thumshirn

Introduce btrfs_eb_write_context to consolidate writeback_control and the
exntent buffer context.

This will help adding a block group context as well.

While at it, move the eb context setting before
btrfs_check_meta_write_pointer(). We can set it here because we anyway need
to skip pages in the same eb if that eb is rejected by
btrfs_check_meta_write_pointer().

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent_io.c | 14 +++++++-------
 fs/btrfs/extent_io.h |  5 +++++
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 177d65d51447..5905d2d42aab 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1784,9 +1784,9 @@ static int submit_eb_subpage(struct page *page, struct writeback_control *wbc)
  * previous call.
  * Return <0 for fatal error.
  */
-static int submit_eb_page(struct page *page, struct writeback_control *wbc,
-			  struct extent_buffer **eb_context)
+static int submit_eb_page(struct page *page, struct btrfs_eb_write_context *ctx)
 {
+	struct writeback_control *wbc = ctx->wbc;
 	struct address_space *mapping = page->mapping;
 	struct btrfs_block_group *cache = NULL;
 	struct extent_buffer *eb;
@@ -1815,7 +1815,7 @@ static int submit_eb_page(struct page *page, struct writeback_control *wbc,
 		return 0;
 	}
 
-	if (eb == *eb_context) {
+	if (eb == ctx->eb) {
 		spin_unlock(&mapping->private_lock);
 		return 0;
 	}
@@ -1824,6 +1824,8 @@ static int submit_eb_page(struct page *page, struct writeback_control *wbc,
 	if (!ret)
 		return 0;
 
+	ctx->eb = eb;
+
 	if (!btrfs_check_meta_write_pointer(eb->fs_info, eb, &cache)) {
 		/*
 		 * If for_sync, this hole will be filled with
@@ -1837,8 +1839,6 @@ static int submit_eb_page(struct page *page, struct writeback_control *wbc,
 		return ret;
 	}
 
-	*eb_context = eb;
-
 	if (!lock_extent_buffer_for_io(eb, wbc)) {
 		btrfs_revert_meta_write_pointer(cache, eb);
 		if (cache)
@@ -1861,7 +1861,7 @@ static int submit_eb_page(struct page *page, struct writeback_control *wbc,
 int btree_write_cache_pages(struct address_space *mapping,
 				   struct writeback_control *wbc)
 {
-	struct extent_buffer *eb_context = NULL;
+	struct btrfs_eb_write_context ctx = { .wbc = wbc };
 	struct btrfs_fs_info *fs_info = BTRFS_I(mapping->host)->root->fs_info;
 	int ret = 0;
 	int done = 0;
@@ -1903,7 +1903,7 @@ int btree_write_cache_pages(struct address_space *mapping,
 		for (i = 0; i < nr_folios; i++) {
 			struct folio *folio = fbatch.folios[i];
 
-			ret = submit_eb_page(&folio->page, wbc, &eb_context);
+			ret = submit_eb_page(&folio->page, &ctx);
 			if (ret == 0)
 				continue;
 			if (ret < 0) {
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index adda14c1b763..e243a8eac910 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -93,6 +93,11 @@ struct extent_buffer {
 #endif
 };
 
+struct btrfs_eb_write_context {
+	struct writeback_control *wbc;
+	struct extent_buffer *eb;
+};
+
 /*
  * Get the correct offset inside the page of extent buffer.
  *
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 02/10] btrfs: zoned: introduce block group context to btrfs_eb_write_context
  2023-08-07 16:12 [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
  2023-08-07 16:12 ` [PATCH v3 01/10] btrfs: introduce struct to consolidate extent buffer write context Naohiro Aota
@ 2023-08-07 16:12 ` Naohiro Aota
  2023-08-07 16:12 ` [PATCH v3 03/10] btrfs: zoned: return int from btrfs_check_meta_write_pointer Naohiro Aota
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Naohiro Aota @ 2023-08-07 16:12 UTC (permalink / raw)
  To: linux-btrfs
  Cc: hch, josef, dsterba, Naohiro Aota, Christoph Hellwig,
	Johannes Thumshirn

For metadata write out on the zoned mode, we call
btrfs_check_meta_write_pointer() to check if an extent buffer to be written
is aligned to the write pointer.

We lookup for a block group containing the extent buffer for every extent
buffer, which take unnecessary effort as the writing extent buffers are
mostly contiguous.

Introduce "zoned_bg" to cache the block group working on.

Also, while at it, rename "cache" to "block_group".

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent_io.c | 15 +++++++--------
 fs/btrfs/extent_io.h |  2 ++
 fs/btrfs/zoned.c     | 35 ++++++++++++++++++++---------------
 fs/btrfs/zoned.h     |  6 ++----
 4 files changed, 31 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 5905d2d42aab..4a629a7cf478 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1788,7 +1788,6 @@ static int submit_eb_page(struct page *page, struct btrfs_eb_write_context *ctx)
 {
 	struct writeback_control *wbc = ctx->wbc;
 	struct address_space *mapping = page->mapping;
-	struct btrfs_block_group *cache = NULL;
 	struct extent_buffer *eb;
 	int ret;
 
@@ -1826,7 +1825,7 @@ static int submit_eb_page(struct page *page, struct btrfs_eb_write_context *ctx)
 
 	ctx->eb = eb;
 
-	if (!btrfs_check_meta_write_pointer(eb->fs_info, eb, &cache)) {
+	if (!btrfs_check_meta_write_pointer(eb->fs_info, ctx)) {
 		/*
 		 * If for_sync, this hole will be filled with
 		 * trasnsaction commit.
@@ -1840,18 +1839,15 @@ static int submit_eb_page(struct page *page, struct btrfs_eb_write_context *ctx)
 	}
 
 	if (!lock_extent_buffer_for_io(eb, wbc)) {
-		btrfs_revert_meta_write_pointer(cache, eb);
-		if (cache)
-			btrfs_put_block_group(cache);
+		btrfs_revert_meta_write_pointer(ctx->zoned_bg, eb);
 		free_extent_buffer(eb);
 		return 0;
 	}
-	if (cache) {
+	if (ctx->zoned_bg) {
 		/*
 		 * Implies write in zoned mode. Mark the last eb in a block group.
 		 */
-		btrfs_schedule_zone_finish_bg(cache, eb);
-		btrfs_put_block_group(cache);
+		btrfs_schedule_zone_finish_bg(ctx->zoned_bg, eb);
 	}
 	write_one_eb(eb, wbc);
 	free_extent_buffer(eb);
@@ -1964,6 +1960,9 @@ int btree_write_cache_pages(struct address_space *mapping,
 		ret = 0;
 	if (!ret && BTRFS_FS_ERROR(fs_info))
 		ret = -EROFS;
+
+	if (ctx.zoned_bg)
+		btrfs_put_block_group(ctx.zoned_bg);
 	btrfs_zoned_meta_io_unlock(fs_info);
 	return ret;
 }
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index e243a8eac910..68368ba99321 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -96,6 +96,8 @@ struct extent_buffer {
 struct btrfs_eb_write_context {
 	struct writeback_control *wbc;
 	struct extent_buffer *eb;
+	/* Block group @eb resides in. Only used for zoned mode. */
+	struct btrfs_block_group *zoned_bg;
 };
 
 /*
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 5e4285ae112c..3a763eb535b0 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1748,30 +1748,35 @@ void btrfs_finish_ordered_zoned(struct btrfs_ordered_extent *ordered)
 }
 
 bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
-				    struct extent_buffer *eb,
-				    struct btrfs_block_group **cache_ret)
+				    struct btrfs_eb_write_context *ctx)
 {
-	struct btrfs_block_group *cache;
-	bool ret = true;
+	const struct extent_buffer *eb = ctx->eb;
+	struct btrfs_block_group *block_group = ctx->zoned_bg;
 
 	if (!btrfs_is_zoned(fs_info))
 		return true;
 
-	cache = btrfs_lookup_block_group(fs_info, eb->start);
-	if (!cache)
-		return true;
+	if (block_group) {
+		if (block_group->start > eb->start ||
+		    block_group->start + block_group->length <= eb->start) {
+			btrfs_put_block_group(block_group);
+			block_group = NULL;
+			ctx->zoned_bg = NULL;
+		}
+	}
 
-	if (cache->meta_write_pointer != eb->start) {
-		btrfs_put_block_group(cache);
-		cache = NULL;
-		ret = false;
-	} else {
-		cache->meta_write_pointer = eb->start + eb->len;
+	if (!block_group) {
+		block_group = btrfs_lookup_block_group(fs_info, eb->start);
+		if (!block_group)
+			return true;
+		ctx->zoned_bg = block_group;
 	}
 
-	*cache_ret = cache;
+	if (block_group->meta_write_pointer != eb->start)
+		return false;
+	block_group->meta_write_pointer = eb->start + eb->len;
 
-	return ret;
+	return true;
 }
 
 void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 27322b926038..49d5bd87245c 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -59,8 +59,7 @@ void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 bool btrfs_use_zone_append(struct btrfs_bio *bbio);
 void btrfs_record_physical_zoned(struct btrfs_bio *bbio);
 bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
-				    struct extent_buffer *eb,
-				    struct btrfs_block_group **cache_ret);
+				    struct btrfs_eb_write_context *ctx);
 void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
 				     struct extent_buffer *eb);
 int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, u64 length);
@@ -190,8 +189,7 @@ static inline void btrfs_record_physical_zoned(struct btrfs_bio *bbio)
 }
 
 static inline bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
-			       struct extent_buffer *eb,
-			       struct btrfs_block_group **cache_ret)
+						  struct btrfs_eb_write_context *ctx)
 {
 	return true;
 }
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 03/10] btrfs: zoned: return int from btrfs_check_meta_write_pointer
  2023-08-07 16:12 [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
  2023-08-07 16:12 ` [PATCH v3 01/10] btrfs: introduce struct to consolidate extent buffer write context Naohiro Aota
  2023-08-07 16:12 ` [PATCH v3 02/10] btrfs: zoned: introduce block group context to btrfs_eb_write_context Naohiro Aota
@ 2023-08-07 16:12 ` Naohiro Aota
  2023-08-07 16:12 ` [PATCH v3 04/10] btrfs: zoned: defer advancing meta_write_pointer Naohiro Aota
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Naohiro Aota @ 2023-08-07 16:12 UTC (permalink / raw)
  To: linux-btrfs
  Cc: hch, josef, dsterba, Naohiro Aota, Christoph Hellwig,
	Johannes Thumshirn

Now that we have writeback_control passed to
btrfs_check_meta_write_pointer(), we can move the wbc condition in
submit_eb_page() to btrfs_check_meta_write_pointer() and return int.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent_io.c | 11 +++--------
 fs/btrfs/zoned.c     | 30 ++++++++++++++++++++++--------
 fs/btrfs/zoned.h     | 10 +++++-----
 3 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4a629a7cf478..f6c47e24956a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1825,14 +1825,9 @@ static int submit_eb_page(struct page *page, struct btrfs_eb_write_context *ctx)
 
 	ctx->eb = eb;
 
-	if (!btrfs_check_meta_write_pointer(eb->fs_info, ctx)) {
-		/*
-		 * If for_sync, this hole will be filled with
-		 * trasnsaction commit.
-		 */
-		if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync)
-			ret = -EAGAIN;
-		else
+	ret = btrfs_check_meta_write_pointer(eb->fs_info, ctx);
+	if (ret) {
+		if (ret == -EBUSY)
 			ret = 0;
 		free_extent_buffer(eb);
 		return ret;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 3a763eb535b0..beaf082c16c0 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1747,14 +1747,23 @@ void btrfs_finish_ordered_zoned(struct btrfs_ordered_extent *ordered)
 	}
 }
 
-bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
-				    struct btrfs_eb_write_context *ctx)
+/*
+ * Check @ctx->eb is aligned to the write pointer
+ *
+ * Return:
+ *   0: @ctx->eb is at the write pointer. You can write it.
+ *   -EAGAIN: There is a hole. The caller should handle the case.
+ *   -EBUSY: There is a hole, but the caller can just bail out.
+ */
+int btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
+				   struct btrfs_eb_write_context *ctx)
 {
+	const struct writeback_control *wbc = ctx->wbc;
 	const struct extent_buffer *eb = ctx->eb;
 	struct btrfs_block_group *block_group = ctx->zoned_bg;
 
 	if (!btrfs_is_zoned(fs_info))
-		return true;
+		return 0;
 
 	if (block_group) {
 		if (block_group->start > eb->start ||
@@ -1768,15 +1777,20 @@ bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 	if (!block_group) {
 		block_group = btrfs_lookup_block_group(fs_info, eb->start);
 		if (!block_group)
-			return true;
+			return 0;
 		ctx->zoned_bg = block_group;
 	}
 
-	if (block_group->meta_write_pointer != eb->start)
-		return false;
-	block_group->meta_write_pointer = eb->start + eb->len;
+	if (block_group->meta_write_pointer == eb->start) {
+		block_group->meta_write_pointer = eb->start + eb->len;
 
-	return true;
+		return 0;
+	}
+
+	/* If for_sync, this hole will be filled with trasnsaction commit. */
+	if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync)
+		return -EAGAIN;
+	return -EBUSY;
 }
 
 void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 49d5bd87245c..c0859d8be152 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -58,8 +58,8 @@ void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 			    struct extent_buffer *eb);
 bool btrfs_use_zone_append(struct btrfs_bio *bbio);
 void btrfs_record_physical_zoned(struct btrfs_bio *bbio);
-bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
-				    struct btrfs_eb_write_context *ctx);
+int btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
+				   struct btrfs_eb_write_context *ctx);
 void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
 				     struct extent_buffer *eb);
 int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, u64 length);
@@ -188,10 +188,10 @@ static inline void btrfs_record_physical_zoned(struct btrfs_bio *bbio)
 {
 }
 
-static inline bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
-						  struct btrfs_eb_write_context *ctx)
+static inline int btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
+						 struct btrfs_eb_write_context *ctx)
 {
-	return true;
+	return 0;
 }
 
 static inline void btrfs_revert_meta_write_pointer(
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 04/10] btrfs: zoned: defer advancing meta_write_pointer
  2023-08-07 16:12 [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
                   ` (2 preceding siblings ...)
  2023-08-07 16:12 ` [PATCH v3 03/10] btrfs: zoned: return int from btrfs_check_meta_write_pointer Naohiro Aota
@ 2023-08-07 16:12 ` Naohiro Aota
  2023-08-07 16:12 ` [PATCH v3 05/10] btrfs: zoned: update meta_write_pointer on zone finish Naohiro Aota
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Naohiro Aota @ 2023-08-07 16:12 UTC (permalink / raw)
  To: linux-btrfs
  Cc: hch, josef, dsterba, Naohiro Aota, Christoph Hellwig,
	Johannes Thumshirn

We currently advance the meta_write_pointer in
btrfs_check_meta_write_pointer(). That make it necessary to revert to it
when locking the buffer failed. Instead, we can advance it just before
sending the buffer.

Also, this is necessary for the following commit. In the commit, it needs
to release the zoned_meta_io_lock to allow IOs to come in and wait for them
to fill the currently active block group. If we advance the
meta_write_pointer before locking the extent buffer, the following extent
buffer can pass the meta_write_pointer check, resuting in an unaligned
write failure.

Advancing the pointer is still thread-safe as the extent buffer is locked.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent_io.c |  8 ++++----
 fs/btrfs/zoned.c     | 15 +--------------
 fs/btrfs/zoned.h     |  8 --------
 3 files changed, 5 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f6c47e24956a..d1b0a0181aed 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1834,15 +1834,15 @@ static int submit_eb_page(struct page *page, struct btrfs_eb_write_context *ctx)
 	}
 
 	if (!lock_extent_buffer_for_io(eb, wbc)) {
-		btrfs_revert_meta_write_pointer(ctx->zoned_bg, eb);
 		free_extent_buffer(eb);
 		return 0;
 	}
 	if (ctx->zoned_bg) {
-		/*
-		 * Implies write in zoned mode. Mark the last eb in a block group.
-		 */
+		/* Implies write in zoned mode. */
+
+		/* Mark the last eb in the block group. */
 		btrfs_schedule_zone_finish_bg(ctx->zoned_bg, eb);
+		ctx->zoned_bg->meta_write_pointer += eb->len;
 	}
 	write_one_eb(eb, wbc);
 	free_extent_buffer(eb);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index beaf082c16c0..3f56604bdaef 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1781,11 +1781,8 @@ int btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 		ctx->zoned_bg = block_group;
 	}
 
-	if (block_group->meta_write_pointer == eb->start) {
-		block_group->meta_write_pointer = eb->start + eb->len;
-
+	if (block_group->meta_write_pointer == eb->start)
 		return 0;
-	}
 
 	/* If for_sync, this hole will be filled with trasnsaction commit. */
 	if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync)
@@ -1793,16 +1790,6 @@ int btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 	return -EBUSY;
 }
 
-void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
-				     struct extent_buffer *eb)
-{
-	if (!btrfs_is_zoned(eb->fs_info) || !cache)
-		return;
-
-	ASSERT(cache->meta_write_pointer == eb->start + eb->len);
-	cache->meta_write_pointer = eb->start;
-}
-
 int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, u64 length)
 {
 	if (!btrfs_dev_is_sequential(device, physical))
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index c0859d8be152..74ec37a25808 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -60,8 +60,6 @@ bool btrfs_use_zone_append(struct btrfs_bio *bbio);
 void btrfs_record_physical_zoned(struct btrfs_bio *bbio);
 int btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 				   struct btrfs_eb_write_context *ctx);
-void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
-				     struct extent_buffer *eb);
 int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, u64 length);
 int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
 				  u64 physical_start, u64 physical_pos);
@@ -194,12 +192,6 @@ static inline int btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
-static inline void btrfs_revert_meta_write_pointer(
-						struct btrfs_block_group *cache,
-						struct extent_buffer *eb)
-{
-}
-
 static inline int btrfs_zoned_issue_zeroout(struct btrfs_device *device,
 					    u64 physical, u64 length)
 {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 05/10] btrfs: zoned: update meta_write_pointer on zone finish
  2023-08-07 16:12 [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
                   ` (3 preceding siblings ...)
  2023-08-07 16:12 ` [PATCH v3 04/10] btrfs: zoned: defer advancing meta_write_pointer Naohiro Aota
@ 2023-08-07 16:12 ` Naohiro Aota
  2023-08-07 16:12 ` [PATCH v3 06/10] btrfs: zoned: reserve zones for an active metadata/system block group Naohiro Aota
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Naohiro Aota @ 2023-08-07 16:12 UTC (permalink / raw)
  To: linux-btrfs
  Cc: hch, josef, dsterba, Naohiro Aota, Christoph Hellwig,
	Johannes Thumshirn

On finishing a zone, the meta_write_pointer should be set of the end of the
zone to reflect the actual write pointer position.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/zoned.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 3f56604bdaef..fd1458049b18 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -2056,6 +2056,9 @@ static int do_zone_finish(struct btrfs_block_group *block_group, bool fully_writ
 
 	clear_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE, &block_group->runtime_flags);
 	block_group->alloc_offset = block_group->zone_capacity;
+	if (block_group->flags & (BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_SYSTEM))
+		block_group->meta_write_pointer = block_group->start +
+			block_group->zone_capacity;
 	block_group->free_space_ctl->free_space = 0;
 	btrfs_clear_treelog_bg(block_group);
 	btrfs_clear_data_reloc_bg(block_group);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 06/10] btrfs: zoned: reserve zones for an active metadata/system block group
  2023-08-07 16:12 [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
                   ` (4 preceding siblings ...)
  2023-08-07 16:12 ` [PATCH v3 05/10] btrfs: zoned: update meta_write_pointer on zone finish Naohiro Aota
@ 2023-08-07 16:12 ` Naohiro Aota
  2023-08-07 16:12 ` [PATCH v3 07/10] btrfs: zoned: activate metadata block group on write time Naohiro Aota
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Naohiro Aota @ 2023-08-07 16:12 UTC (permalink / raw)
  To: linux-btrfs; +Cc: hch, josef, dsterba, Naohiro Aota

Ensure a metadata and system block group can be activated on write time, by
leaving a certain number of active zones when trying to activate a data
block group.

Zones for two metadata block groups (normal and tree-log) and one system
block group are reserved, according to the profile type: two zones per
block group on the DUP profile and one zone per block group otherwise.

The reservation must be freed once a non-data block group is allocated. If
not, we over-reserve the active zones and data block group activation will
suffer. For the dynamic reservation count, we need to manage the
reservation count per device.

The reservation count variable is protected by
fs_info->zone_active_bgs_lock.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c |  2 +
 fs/btrfs/zoned.c   | 97 +++++++++++++++++++++++++++++++++++++++++++---
 fs/btrfs/zoned.h   |  9 +++++
 3 files changed, 103 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index da51e5750443..471bc20e5397 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3456,6 +3456,8 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 
 	btrfs_free_zone_cache(fs_info);
 
+	btrfs_check_active_zone_reservation(fs_info);
+
 	if (!sb_rdonly(sb) && fs_info->fs_devices->missing_devices &&
 	    !btrfs_check_rw_degradable(fs_info, NULL)) {
 		btrfs_warn(fs_info,
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index fd1458049b18..c9a1732469fd 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1889,6 +1889,7 @@ bool btrfs_zone_activate(struct btrfs_block_group *block_group)
 	struct map_lookup *map;
 	struct btrfs_device *device;
 	u64 physical;
+	bool is_data = (block_group->flags & BTRFS_BLOCK_GROUP_DATA);
 	bool ret;
 	int i;
 
@@ -1910,19 +1911,40 @@ bool btrfs_zone_activate(struct btrfs_block_group *block_group)
 		goto out_unlock;
 	}
 
+	spin_lock(&fs_info->zone_active_bgs_lock);
 	for (i = 0; i < map->num_stripes; i++) {
+		struct btrfs_zoned_device_info *zinfo;
+		int reserved = 0;
+
 		device = map->stripes[i].dev;
 		physical = map->stripes[i].physical;
+		zinfo = device->zone_info;
 
-		if (device->zone_info->max_active_zones == 0)
+		if (zinfo->max_active_zones == 0)
 			continue;
 
+		if (is_data)
+			reserved = zinfo->reserved_active_zones;
+		/*
+		 * For the data block group, leave active zones for one
+		 * metadata block group and one system block group.
+		 */
+		if (atomic_read(&zinfo->active_zones_left) <= reserved) {
+			ret = false;
+			spin_unlock(&fs_info->zone_active_bgs_lock);
+			goto out_unlock;
+		}
+
 		if (!btrfs_dev_set_active_zone(device, physical)) {
 			/* Cannot activate the zone */
 			ret = false;
+			spin_unlock(&fs_info->zone_active_bgs_lock);
 			goto out_unlock;
 		}
+		if (!is_data)
+			zinfo->reserved_active_zones--;
 	}
+	spin_unlock(&fs_info->zone_active_bgs_lock);
 
 	/* Successfully activated all the zones */
 	set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE, &block_group->runtime_flags);
@@ -2068,18 +2090,21 @@ static int do_zone_finish(struct btrfs_block_group *block_group, bool fully_writ
 	for (i = 0; i < map->num_stripes; i++) {
 		struct btrfs_device *device = map->stripes[i].dev;
 		const u64 physical = map->stripes[i].physical;
+		struct btrfs_zoned_device_info *zinfo = device->zone_info;
 
-		if (device->zone_info->max_active_zones == 0)
+		if (zinfo->max_active_zones == 0)
 			continue;
 
 		ret = blkdev_zone_mgmt(device->bdev, REQ_OP_ZONE_FINISH,
 				       physical >> SECTOR_SHIFT,
-				       device->zone_info->zone_size >> SECTOR_SHIFT,
+				       zinfo->zone_size >> SECTOR_SHIFT,
 				       GFP_NOFS);
 
 		if (ret)
 			return ret;
 
+		if (!(block_group->flags & BTRFS_BLOCK_GROUP_DATA))
+			zinfo->reserved_active_zones++;
 		btrfs_dev_clear_active_zone(device, physical);
 	}
 
@@ -2118,8 +2143,10 @@ bool btrfs_can_activate_zone(struct btrfs_fs_devices *fs_devices, u64 flags)
 
 	/* Check if there is a device with active zones left */
 	mutex_lock(&fs_info->chunk_mutex);
+	spin_lock(&fs_info->zone_active_bgs_lock);
 	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
 		struct btrfs_zoned_device_info *zinfo = device->zone_info;
+		int reserved = 0;
 
 		if (!device->bdev)
 			continue;
@@ -2129,17 +2156,21 @@ bool btrfs_can_activate_zone(struct btrfs_fs_devices *fs_devices, u64 flags)
 			break;
 		}
 
+		if (flags & BTRFS_BLOCK_GROUP_DATA)
+			reserved = zinfo->reserved_active_zones;
+
 		switch (flags & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
 		case 0: /* single */
-			ret = (atomic_read(&zinfo->active_zones_left) >= 1);
+			ret = (atomic_read(&zinfo->active_zones_left) >= (1 + reserved));
 			break;
 		case BTRFS_BLOCK_GROUP_DUP:
-			ret = (atomic_read(&zinfo->active_zones_left) >= 2);
+			ret = (atomic_read(&zinfo->active_zones_left) >= (2 + reserved));
 			break;
 		}
 		if (ret)
 			break;
 	}
+	spin_unlock(&fs_info->zone_active_bgs_lock);
 	mutex_unlock(&fs_info->chunk_mutex);
 
 	if (!ret)
@@ -2386,3 +2417,59 @@ int btrfs_zoned_activate_one_bg(struct btrfs_fs_info *fs_info,
 
 	return 0;
 }
+
+/*
+ * Check if we properly activated one metadata block group and one
+ * system block group.
+ */
+void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_block_group *block_group;
+	struct btrfs_device *device;
+	/* Reserve zones for normal SINGLE metadata and tree-log block group. */
+	unsigned int metadata_reserve = 2;
+	/* Reserve a zone for SINGLE system block group. */
+	unsigned int system_reserve = 1;
+
+	/*
+	 * This function is called from the mount context. So, there
+	 * is no parallel process touching the bits. No need for
+	 * read_seqretry().
+	 */
+	if (fs_info->avail_metadata_alloc_bits & BTRFS_BLOCK_GROUP_DUP)
+		metadata_reserve = 4;
+	if (fs_info->avail_system_alloc_bits & BTRFS_BLOCK_GROUP_DUP)
+		system_reserve = 2;
+
+	/* Apply the reservation on all the devices. */
+	mutex_lock(&fs_devices->device_list_mutex);
+	list_for_each_entry(device, &fs_devices->devices, dev_list) {
+		if (!device->bdev)
+			continue;
+
+		device->zone_info->reserved_active_zones =
+			metadata_reserve + system_reserve;
+	}
+	mutex_unlock(&fs_devices->device_list_mutex);
+
+	/* Release reservation for currently active block groups. */
+	spin_lock(&fs_info->zone_active_bgs_lock);
+	list_for_each_entry(block_group, &fs_info->zone_active_bgs,
+			    active_bg_list) {
+		struct map_lookup *map = block_group->physical_map;
+		int i;
+
+		if (!(block_group->flags &
+		      (BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_SYSTEM)))
+			continue;
+
+		for (i = 0; i < map->num_stripes; i++) {
+			struct btrfs_zoned_device_info *zinfo =
+				map->stripes[i].dev->zone_info;
+
+			zinfo->reserved_active_zones--;
+		}
+	}
+	spin_unlock(&fs_info->zone_active_bgs_lock);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 74ec37a25808..03e140018f29 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -22,6 +22,12 @@ struct btrfs_zoned_device_info {
 	u8  zone_size_shift;
 	u32 nr_zones;
 	unsigned int max_active_zones;
+	/*
+	 * Reserved active zones for one metadata and one system block
+	 * group. It can vary per-device depending of the allocation
+	 * status.
+	 */
+	int reserved_active_zones;
 	atomic_t active_zones_left;
 	unsigned long *seq_zones;
 	unsigned long *empty_zones;
@@ -78,6 +84,7 @@ void btrfs_zoned_release_data_reloc_bg(struct btrfs_fs_info *fs_info, u64 logica
 int btrfs_zone_finish_one_bg(struct btrfs_fs_info *fs_info);
 int btrfs_zoned_activate_one_bg(struct btrfs_fs_info *fs_info,
 				struct btrfs_space_info *space_info, bool do_finish);
+void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -252,6 +259,8 @@ static inline int btrfs_zoned_activate_one_bg(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
+static inline void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info) { }
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 07/10] btrfs: zoned: activate metadata block group on write time
  2023-08-07 16:12 [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
                   ` (5 preceding siblings ...)
  2023-08-07 16:12 ` [PATCH v3 06/10] btrfs: zoned: reserve zones for an active metadata/system block group Naohiro Aota
@ 2023-08-07 16:12 ` Naohiro Aota
  2023-08-09 16:42   ` David Sterba
  2023-08-07 16:12 ` [PATCH v3 08/10] btrfs: zoned: no longer count fresh BG region as zone unusable Naohiro Aota
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 19+ messages in thread
From: Naohiro Aota @ 2023-08-07 16:12 UTC (permalink / raw)
  To: linux-btrfs; +Cc: hch, josef, dsterba, Naohiro Aota

In the current implementation, block groups are activated at reservation
time to ensure that all reserved bytes can be written to an active metadata
block group. However, this approach has proven to be less efficient, as it
activates block groups more frequently than necessary, putting pressure on
the active zone resource and leading to potential issues such as early
ENOSPC or hung_task.

Another drawback of the current method is that it hampers metadata
over-commit, and necessitates additional flush operations and block group
allocations, resulting in decreased overall performance.

To address these issues, this commit introduces a write-time activation of
metadata and system block group. This involves reserving at least one
active block group specifically for a metadata and system block group.

Since metadata write-out is always allocated sequentially, when we need to
write to a non-active block group, we can wait for the ongoing IOs to
complete, activate a new block group, and then proceed with writing to the
new block group.

Fixes: b09315139136 ("btrfs: zoned: activate metadata block group on flush_space")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c | 11 ++++++
 fs/btrfs/fs.h          |  3 ++
 fs/btrfs/zoned.c       | 83 +++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index a127865f49f9..b0e432c30e1d 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -4287,6 +4287,17 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
 	struct btrfs_caching_control *caching_ctl;
 	struct rb_node *n;
 
+	if (btrfs_is_zoned(info)) {
+		if (info->active_meta_bg) {
+			btrfs_put_block_group(info->active_meta_bg);
+			info->active_meta_bg = NULL;
+		}
+		if (info->active_system_bg) {
+			btrfs_put_block_group(info->active_system_bg);
+			info->active_system_bg = NULL;
+		}
+	}
+
 	write_lock(&info->block_group_cache_lock);
 	while (!list_empty(&info->caching_block_groups)) {
 		caching_ctl = list_entry(info->caching_block_groups.next,
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index ef07c6c252d8..a523d64d5491 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -770,6 +770,9 @@ struct btrfs_fs_info {
 	u64 data_reloc_bg;
 	struct mutex zoned_data_reloc_io_lock;
 
+	struct btrfs_block_group *active_meta_bg;
+	struct btrfs_block_group *active_system_bg;
+
 	u64 nr_global_roots;
 
 	spinlock_t zone_active_bgs_lock;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index c9a1732469fd..4fa1590f71ac 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -65,6 +65,9 @@
 
 #define SUPER_INFO_SECTORS	((u64)BTRFS_SUPER_INFO_SIZE >> SECTOR_SHIFT)
 
+static void wait_eb_writebacks(struct btrfs_block_group *block_group);
+static int do_zone_finish(struct btrfs_block_group *block_group, bool fully_written);
+
 static inline bool sb_zone_is_full(const struct blk_zone *zone)
 {
 	return (zone->cond == BLK_ZONE_COND_FULL) ||
@@ -1747,6 +1750,64 @@ void btrfs_finish_ordered_zoned(struct btrfs_ordered_extent *ordered)
 	}
 }
 
+static bool check_bg_is_active(struct btrfs_eb_write_context *ctx,
+			       struct btrfs_block_group **active_bg)
+{
+	const struct writeback_control *wbc = ctx->wbc;
+	struct btrfs_block_group *block_group = ctx->zoned_bg;
+	struct btrfs_fs_info *fs_info = block_group->fs_info;
+
+	if (test_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE, &block_group->runtime_flags))
+		return true;
+
+	if (fs_info->treelog_bg == block_group->start) {
+		if (!btrfs_zone_activate(block_group)) {
+			int ret_fin = btrfs_zone_finish_one_bg(fs_info);
+
+			if (ret_fin != 1 || !btrfs_zone_activate(block_group))
+				return false;
+		}
+	} else if (*active_bg != block_group) {
+		struct btrfs_block_group *tgt = *active_bg;
+
+		/*
+		 * zoned_meta_io_lock protects fs_info->active_{meta,system}_bg.
+		 */
+		lockdep_assert_held(&fs_info->zoned_meta_io_lock);
+
+		if (tgt) {
+			/*
+			 * If there is an unsent IO left in the allocated area,
+			 * we cannot wait for them as it may cause a deadlock.
+			 */
+			if (tgt->meta_write_pointer < tgt->start + tgt->alloc_offset) {
+				if (wbc->sync_mode == WB_SYNC_NONE ||
+				    (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync))
+					return false;
+			}
+
+			/* Pivot active metadata/system block group. */
+			btrfs_zoned_meta_io_unlock(fs_info);
+			wait_eb_writebacks(tgt);
+			do_zone_finish(tgt, true);
+			btrfs_zoned_meta_io_lock(fs_info);
+			if (*active_bg == tgt) {
+				btrfs_put_block_group(tgt);
+				*active_bg = NULL;
+			}
+		}
+		if (!btrfs_zone_activate(block_group))
+			return false;
+		if (*active_bg != block_group) {
+			ASSERT(*active_bg == NULL);
+			*active_bg = block_group;
+			btrfs_get_block_group(block_group);
+		}
+	}
+
+	return true;
+}
+
 /*
  * Check @ctx->eb is aligned to the write pointer
  *
@@ -1781,8 +1842,26 @@ int btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 		ctx->zoned_bg = block_group;
 	}
 
-	if (block_group->meta_write_pointer == eb->start)
-		return 0;
+	if (block_group->meta_write_pointer == eb->start) {
+		struct btrfs_block_group **tgt;
+
+		if (!test_bit(BTRFS_FS_ACTIVE_ZONE_TRACKING, &fs_info->flags))
+			return 0;
+
+		if (block_group->flags & BTRFS_BLOCK_GROUP_SYSTEM)
+			tgt = &fs_info->active_system_bg;
+		else
+			tgt = &fs_info->active_meta_bg;
+		if (check_bg_is_active(ctx, tgt))
+			return 0;
+	}
+
+	/*
+	 * Since we may release fs_info->zoned_meta_io_lock, someone can already
+	 * start writing this eb. In that case, we can just bail out.
+	 */
+	if (block_group->meta_write_pointer > eb->start)
+		return -EBUSY;
 
 	/* If for_sync, this hole will be filled with trasnsaction commit. */
 	if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 08/10] btrfs: zoned: no longer count fresh BG region as zone unusable
  2023-08-07 16:12 [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
                   ` (6 preceding siblings ...)
  2023-08-07 16:12 ` [PATCH v3 07/10] btrfs: zoned: activate metadata block group on write time Naohiro Aota
@ 2023-08-07 16:12 ` Naohiro Aota
  2023-08-09 16:49   ` David Sterba
  2023-08-07 16:12 ` [PATCH v3 09/10] btrfs: zoned: don't activate non-DATA BG on allocation Naohiro Aota
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 19+ messages in thread
From: Naohiro Aota @ 2023-08-07 16:12 UTC (permalink / raw)
  To: linux-btrfs; +Cc: hch, josef, dsterba, Naohiro Aota

Now that we switched to write time activation, we no longer need to (and
must not) count the fresh region as zone unusable. This commit is similar
to revert commit fc22cf8eba79 ("btrfs: zoned: count fresh BG region as zone
unusable").

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/free-space-cache.c |  8 +-------
 fs/btrfs/zoned.c            | 26 +++-----------------------
 2 files changed, 4 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index cd5bfda2c259..27fad70451aa 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2704,13 +2704,8 @@ static int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group,
 		bg_reclaim_threshold = READ_ONCE(sinfo->bg_reclaim_threshold);
 
 	spin_lock(&ctl->tree_lock);
-	/* Count initial region as zone_unusable until it gets activated. */
 	if (!used)
 		to_free = size;
-	else if (initial &&
-		 test_bit(BTRFS_FS_ACTIVE_ZONE_TRACKING, &block_group->fs_info->flags) &&
-		 (block_group->flags & (BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_SYSTEM)))
-		to_free = 0;
 	else if (initial)
 		to_free = block_group->zone_capacity;
 	else if (offset >= block_group->alloc_offset)
@@ -2738,8 +2733,7 @@ static int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group,
 	reclaimable_unusable = block_group->zone_unusable -
 			       (block_group->length - block_group->zone_capacity);
 	/* All the region is now unusable. Mark it as unused and reclaim */
-	if (block_group->zone_unusable == block_group->length &&
-	    block_group->alloc_offset) {
+	if (block_group->zone_unusable == block_group->length) {
 		btrfs_mark_bg_unused(block_group);
 	} else if (bg_reclaim_threshold &&
 		   reclaimable_unusable >=
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 4fa1590f71ac..957fc76079bd 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1586,19 +1586,9 @@ void btrfs_calc_zone_unusable(struct btrfs_block_group *cache)
 		return;
 
 	WARN_ON(cache->bytes_super != 0);
-
-	/* Check for block groups never get activated */
-	if (test_bit(BTRFS_FS_ACTIVE_ZONE_TRACKING, &cache->fs_info->flags) &&
-	    cache->flags & (BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_SYSTEM) &&
-	    !test_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE, &cache->runtime_flags) &&
-	    cache->alloc_offset == 0) {
-		unusable = cache->length;
-		free = 0;
-	} else {
-		unusable = (cache->alloc_offset - cache->used) +
-			   (cache->length - cache->zone_capacity);
-		free = cache->zone_capacity - cache->alloc_offset;
-	}
+	unusable = (cache->alloc_offset - cache->used) +
+		   (cache->length - cache->zone_capacity);
+	free = cache->zone_capacity - cache->alloc_offset;
 
 	/* We only need ->free_space in ALLOC_SEQ block groups */
 	cache->cached = BTRFS_CACHE_FINISHED;
@@ -1964,7 +1954,6 @@ int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
 bool btrfs_zone_activate(struct btrfs_block_group *block_group)
 {
 	struct btrfs_fs_info *fs_info = block_group->fs_info;
-	struct btrfs_space_info *space_info = block_group->space_info;
 	struct map_lookup *map;
 	struct btrfs_device *device;
 	u64 physical;
@@ -1977,7 +1966,6 @@ bool btrfs_zone_activate(struct btrfs_block_group *block_group)
 
 	map = block_group->physical_map;
 
-	spin_lock(&space_info->lock);
 	spin_lock(&block_group->lock);
 	if (test_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE, &block_group->runtime_flags)) {
 		ret = true;
@@ -2027,14 +2015,7 @@ bool btrfs_zone_activate(struct btrfs_block_group *block_group)
 
 	/* Successfully activated all the zones */
 	set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE, &block_group->runtime_flags);
-	WARN_ON(block_group->alloc_offset != 0);
-	if (block_group->zone_unusable == block_group->length) {
-		block_group->zone_unusable = block_group->length - block_group->zone_capacity;
-		space_info->bytes_zone_unusable -= block_group->zone_capacity;
-	}
 	spin_unlock(&block_group->lock);
-	btrfs_try_granting_tickets(fs_info, space_info);
-	spin_unlock(&space_info->lock);
 
 	/* For the active block group list */
 	btrfs_get_block_group(block_group);
@@ -2047,7 +2028,6 @@ bool btrfs_zone_activate(struct btrfs_block_group *block_group)
 
 out_unlock:
 	spin_unlock(&block_group->lock);
-	spin_unlock(&space_info->lock);
 	return ret;
 }
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 09/10] btrfs: zoned: don't activate non-DATA BG on allocation
  2023-08-07 16:12 [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
                   ` (7 preceding siblings ...)
  2023-08-07 16:12 ` [PATCH v3 08/10] btrfs: zoned: no longer count fresh BG region as zone unusable Naohiro Aota
@ 2023-08-07 16:12 ` Naohiro Aota
  2023-08-07 16:12 ` [PATCH v3 10/10] btrfs: zoned: re-enable metadata over-commit for zoned mode Naohiro Aota
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Naohiro Aota @ 2023-08-07 16:12 UTC (permalink / raw)
  To: linux-btrfs; +Cc: hch, josef, dsterba, Naohiro Aota, Johannes Thumshirn

Now that, a non-DATA block group is activated at write time. Don't activate
it on allocation time.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c |  2 +-
 fs/btrfs/extent-tree.c |  8 +++++++-
 fs/btrfs/space-info.c  | 28 ----------------------------
 3 files changed, 8 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index b0e432c30e1d..0cb1dee965a0 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -4089,7 +4089,7 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
 
 	if (IS_ERR(ret_bg)) {
 		ret = PTR_ERR(ret_bg);
-	} else if (from_extent_allocation) {
+	} else if (from_extent_allocation && (flags & BTRFS_BLOCK_GROUP_DATA)) {
 		/*
 		 * New block group is likely to be used soon. Try to activate
 		 * it now. Failure is OK for now.
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 12bd8dc37385..92eccb0cd487 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3690,7 +3690,9 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	}
 	spin_unlock(&block_group->lock);
 
-	if (!ret && !btrfs_zone_activate(block_group)) {
+	/* Metadata block group is activated on write time. */
+	if (!ret && (block_group->flags & BTRFS_BLOCK_GROUP_DATA) &&
+	    !btrfs_zone_activate(block_group)) {
 		ret = 1;
 		/*
 		 * May need to clear fs_info->{treelog,data_reloc}_bg.
@@ -3870,6 +3872,10 @@ static void found_extent(struct find_free_extent_ctl *ffe_ctl,
 static int can_allocate_chunk_zoned(struct btrfs_fs_info *fs_info,
 				    struct find_free_extent_ctl *ffe_ctl)
 {
+	/* Block group's activeness is not a requirement for METADATA block groups. */
+	if (!(ffe_ctl->flags & BTRFS_BLOCK_GROUP_DATA))
+		return 0;
+
 	/* If we can activate new zone, just allocate a chunk and use it */
 	if (btrfs_can_activate_zone(fs_info->fs_devices, ffe_ctl->flags))
 		return 0;
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 17c86db7b1b1..356638f54fef 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -761,18 +761,6 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 		break;
 	case ALLOC_CHUNK:
 	case ALLOC_CHUNK_FORCE:
-		/*
-		 * For metadata space on zoned filesystem, reaching here means we
-		 * don't have enough space left in active_total_bytes. Try to
-		 * activate a block group first, because we may have inactive
-		 * block group already allocated.
-		 */
-		ret = btrfs_zoned_activate_one_bg(fs_info, space_info, false);
-		if (ret < 0)
-			break;
-		else if (ret == 1)
-			break;
-
 		trans = btrfs_join_transaction(root);
 		if (IS_ERR(trans)) {
 			ret = PTR_ERR(trans);
@@ -784,22 +772,6 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 					CHUNK_ALLOC_FORCE);
 		btrfs_end_transaction(trans);
 
-		/*
-		 * For metadata space on zoned filesystem, allocating a new chunk
-		 * is not enough. We still need to activate the block * group.
-		 * Active the newly allocated block group by (maybe) finishing
-		 * a block group.
-		 */
-		if (ret == 1) {
-			ret = btrfs_zoned_activate_one_bg(fs_info, space_info, true);
-			/*
-			 * Revert to the original ret regardless we could finish
-			 * one block group or not.
-			 */
-			if (ret >= 0)
-				ret = 1;
-		}
-
 		if (ret > 0 || ret == -ENOSPC)
 			ret = 0;
 		break;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 10/10] btrfs: zoned: re-enable metadata over-commit for zoned mode
  2023-08-07 16:12 [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
                   ` (8 preceding siblings ...)
  2023-08-07 16:12 ` [PATCH v3 09/10] btrfs: zoned: don't activate non-DATA BG on allocation Naohiro Aota
@ 2023-08-07 16:12 ` Naohiro Aota
  2023-08-09 18:02 ` [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group David Sterba
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Naohiro Aota @ 2023-08-07 16:12 UTC (permalink / raw)
  To: linux-btrfs; +Cc: hch, josef, dsterba, Naohiro Aota, Johannes Thumshirn

Now that, we can re-enable metadata over-commit. As we moved the activation
from the reservation time to the write time, we no longer need to ensure
all the reserved bytes is properly activated.

Without the metadata over-commit, it suffers from lower performance because
it needs to flush the delalloc items more often and allocate more block
groups. Re-enabling metadata over-commit will solve the issue.

Fixes: 79417d040f4f ("btrfs: zoned: disable metadata overcommit for zoned")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/space-info.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 356638f54fef..d7e8cd4f140c 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -389,11 +389,7 @@ int btrfs_can_overcommit(struct btrfs_fs_info *fs_info,
 		return 0;
 
 	used = btrfs_space_info_used(space_info, true);
-	if (test_bit(BTRFS_FS_ACTIVE_ZONE_TRACKING, &fs_info->flags) &&
-	    (space_info->flags & BTRFS_BLOCK_GROUP_METADATA))
-		avail = 0;
-	else
-		avail = calc_available_free_space(fs_info, space_info, flush);
+	avail = calc_available_free_space(fs_info, space_info, flush);
 
 	if (used + bytes < space_info->total_bytes + avail)
 		return 1;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 07/10] btrfs: zoned: activate metadata block group on write time
  2023-08-07 16:12 ` [PATCH v3 07/10] btrfs: zoned: activate metadata block group on write time Naohiro Aota
@ 2023-08-09 16:42   ` David Sterba
  0 siblings, 0 replies; 19+ messages in thread
From: David Sterba @ 2023-08-09 16:42 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs, hch, josef, dsterba

On Tue, Aug 08, 2023 at 01:12:37AM +0900, Naohiro Aota wrote:
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -65,6 +65,9 @@
>  
>  #define SUPER_INFO_SECTORS	((u64)BTRFS_SUPER_INFO_SIZE >> SECTOR_SHIFT)
>  
> +static void wait_eb_writebacks(struct btrfs_block_group *block_group);
> +static int do_zone_finish(struct btrfs_block_group *block_group, bool fully_written);

Looks like the forward declarations can't be avoided due to
btrfs_check_meta_write_pointer being defined earlier in the file.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 08/10] btrfs: zoned: no longer count fresh BG region as zone unusable
  2023-08-07 16:12 ` [PATCH v3 08/10] btrfs: zoned: no longer count fresh BG region as zone unusable Naohiro Aota
@ 2023-08-09 16:49   ` David Sterba
  0 siblings, 0 replies; 19+ messages in thread
From: David Sterba @ 2023-08-09 16:49 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs, hch, josef, dsterba

On Tue, Aug 08, 2023 at 01:12:38AM +0900, Naohiro Aota wrote:
> Now that we switched to write time activation, we no longer need to (and
> must not) count the fresh region as zone unusable. This commit is similar
> to revert commit fc22cf8eba79 ("btrfs: zoned: count fresh BG region as zone
> unusable").

The commit id should be fa2068d7e922b434eba guessing by the subject, I
don't have fc22cf8eba79.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group
  2023-08-07 16:12 [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
                   ` (9 preceding siblings ...)
  2023-08-07 16:12 ` [PATCH v3 10/10] btrfs: zoned: re-enable metadata over-commit for zoned mode Naohiro Aota
@ 2023-08-09 18:02 ` David Sterba
  2023-08-10 12:59 ` Josef Bacik
  2023-08-10 13:34 ` Josef Bacik
  12 siblings, 0 replies; 19+ messages in thread
From: David Sterba @ 2023-08-09 18:02 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs, hch, josef, dsterba

On Tue, Aug 08, 2023 at 01:12:30AM +0900, Naohiro Aota wrote:
> In the current implementation, block groups are activated at
> reservation time to ensure that all reserved bytes can be written to
> an active metadata block group. However, this approach has proven to
> be less efficient, as it activates block groups more frequently than
> necessary, putting pressure on the active zone resource and leading to
> potential issues such as early ENOSPC or hung_task.
> 
> Another drawback of the current method is that it hampers metadata
> over-commit, and necessitates additional flush operations and block
> group allocations, resulting in decreased overall performance.
> 
> Actually, we don't need so many active metadata block groups because
> there is only one sequential metadata write stream.
> 
> So, this series introduces a write-time activation of metadata and
> system block group. This involves reserving at least one active block
> group specifically for a metadata and system block group. When the
> write goes into a new block group, it should have allocated all the
> regions in the current active block group. So, we can wait for IOs to
> fill the space, and then switch to a new block group.
> 
> Switching to the write-time activation solves the above issue and will
> lead to better performance.
> 
> * Performance
> 
> There is a significant difference with a workload (buffered write without
> sync) because we re-enable metadata over-commit.
> 
> before the patch:  741.00 MB/sec
> after the patch:  1430.27 MB/sec (+ 93%)
> 
> * Organization
> 
> Patches 1-5 are preparation patches involves meta_write_pointer check.
> 
> Patches 6 and 7 are the main part of this series, implementing the
> write-time activation.
> 
> Patches 8-10 addresses code for reserve time activation: counting fresh
> block group as zone_unusable, activating a block group on allocation,
> and disabling metadata over-commit.
> 
> * Changes
> 
> - v3
>   - Rework the reservation patch to fix the over-reservation problem
>     https://lore.kernel.org/all/xpb5wdmxx5wops26ihulo73oluc64dt4zpxqc7cirp2wvxl3qy@hv7lsvma5hxf/
>   - Rename btrfs_eb_write_context's block_group to zoned_bg.

Added to misc-next, thanks. We need it in order to enable zoned tests in
the CI so this goes in now, any fixups or more review tags will be done
in the commits.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group
  2023-08-07 16:12 [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
                   ` (10 preceding siblings ...)
  2023-08-09 18:02 ` [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group David Sterba
@ 2023-08-10 12:59 ` Josef Bacik
  2023-08-10 14:13   ` Naohiro Aota
  2023-08-10 13:34 ` Josef Bacik
  12 siblings, 1 reply; 19+ messages in thread
From: Josef Bacik @ 2023-08-10 12:59 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs, hch, dsterba

On Tue, Aug 08, 2023 at 01:12:30AM +0900, Naohiro Aota wrote:
> In the current implementation, block groups are activated at
> reservation time to ensure that all reserved bytes can be written to
> an active metadata block group. However, this approach has proven to
> be less efficient, as it activates block groups more frequently than
> necessary, putting pressure on the active zone resource and leading to
> potential issues such as early ENOSPC or hung_task.
> 
> Another drawback of the current method is that it hampers metadata
> over-commit, and necessitates additional flush operations and block
> group allocations, resulting in decreased overall performance.
> 
> Actually, we don't need so many active metadata block groups because
> there is only one sequential metadata write stream.
> 
> So, this series introduces a write-time activation of metadata and
> system block group. This involves reserving at least one active block
> group specifically for a metadata and system block group. When the
> write goes into a new block group, it should have allocated all the
> regions in the current active block group. So, we can wait for IOs to
> fill the space, and then switch to a new block group.
> 
> Switching to the write-time activation solves the above issue and will
> lead to better performance.
> 
> * Performance
> 
> There is a significant difference with a workload (buffered write without
> sync) because we re-enable metadata over-commit.
> 
> before the patch:  741.00 MB/sec
> after the patch:  1430.27 MB/sec (+ 93%)
> 
> * Organization
> 
> Patches 1-5 are preparation patches involves meta_write_pointer check.
> 
> Patches 6 and 7 are the main part of this series, implementing the
> write-time activation.
> 
> Patches 8-10 addresses code for reserve time activation: counting fresh
> block group as zone_unusable, activating a block group on allocation,
> and disabling metadata over-commit.
> 

Hey Naohiro,

This enabled me to turn on the zoned vm for the GitHub CI, we're only failing 7
tests now, so great job!

However all the !zoned vms panic immediately

https://paste.centos.org/view/54d11384

Can you fix that up?  Also you can submit a PR against the 'ci' branch of our
linux repo in the btrfs GitHub project to run through the CI yourself to make
sure you didn't mess anything up.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group
  2023-08-07 16:12 [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
                   ` (11 preceding siblings ...)
  2023-08-10 12:59 ` Josef Bacik
@ 2023-08-10 13:34 ` Josef Bacik
  2023-08-10 14:34   ` Naohiro Aota
  12 siblings, 1 reply; 19+ messages in thread
From: Josef Bacik @ 2023-08-10 13:34 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs, hch, dsterba

On Tue, Aug 08, 2023 at 01:12:30AM +0900, Naohiro Aota wrote:
> In the current implementation, block groups are activated at
> reservation time to ensure that all reserved bytes can be written to
> an active metadata block group. However, this approach has proven to
> be less efficient, as it activates block groups more frequently than
> necessary, putting pressure on the active zone resource and leading to
> potential issues such as early ENOSPC or hung_task.
> 
> Another drawback of the current method is that it hampers metadata
> over-commit, and necessitates additional flush operations and block
> group allocations, resulting in decreased overall performance.
> 
> Actually, we don't need so many active metadata block groups because
> there is only one sequential metadata write stream.
> 
> So, this series introduces a write-time activation of metadata and
> system block group. This involves reserving at least one active block
> group specifically for a metadata and system block group. When the
> write goes into a new block group, it should have allocated all the
> regions in the current active block group. So, we can wait for IOs to
> fill the space, and then switch to a new block group.
> 
> Switching to the write-time activation solves the above issue and will
> lead to better performance.
> 
> * Performance
> 
> There is a significant difference with a workload (buffered write without
> sync) because we re-enable metadata over-commit.
> 
> before the patch:  741.00 MB/sec
> after the patch:  1430.27 MB/sec (+ 93%)
> 
> * Organization
> 
> Patches 1-5 are preparation patches involves meta_write_pointer check.
> 
> Patches 6 and 7 are the main part of this series, implementing the
> write-time activation.
> 
> Patches 8-10 addresses code for reserve time activation: counting fresh
> block group as zone_unusable, activating a block group on allocation,
> and disabling metadata over-commit.
> 
> * Changes

Additionally you had these failures in the CI setup

btrfs/220 btrfs/237 btrfs/239 btrfs/273 btrfs/295 generic/551 generic/574

I've excluded them so we can catch regressions, but everything except btrfs/220
seem like legitimate failures.  btrfs/220 needs to be updated since zoned
doesn't do discard=async, but you can do that whenever, I'm less worried about
that.  The rest should be investigated at some point, though not as a
prerequisite for merging this series.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group
  2023-08-10 12:59 ` Josef Bacik
@ 2023-08-10 14:13   ` Naohiro Aota
  0 siblings, 0 replies; 19+ messages in thread
From: Naohiro Aota @ 2023-08-10 14:13 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs@vger.kernel.org, hch@infradead.org, dsterba@suse.cz

On Thu, Aug 10, 2023 at 08:59:37AM -0400, Josef Bacik wrote:
> On Tue, Aug 08, 2023 at 01:12:30AM +0900, Naohiro Aota wrote:
> > In the current implementation, block groups are activated at
> > reservation time to ensure that all reserved bytes can be written to
> > an active metadata block group. However, this approach has proven to
> > be less efficient, as it activates block groups more frequently than
> > necessary, putting pressure on the active zone resource and leading to
> > potential issues such as early ENOSPC or hung_task.
> > 
> > Another drawback of the current method is that it hampers metadata
> > over-commit, and necessitates additional flush operations and block
> > group allocations, resulting in decreased overall performance.
> > 
> > Actually, we don't need so many active metadata block groups because
> > there is only one sequential metadata write stream.
> > 
> > So, this series introduces a write-time activation of metadata and
> > system block group. This involves reserving at least one active block
> > group specifically for a metadata and system block group. When the
> > write goes into a new block group, it should have allocated all the
> > regions in the current active block group. So, we can wait for IOs to
> > fill the space, and then switch to a new block group.
> > 
> > Switching to the write-time activation solves the above issue and will
> > lead to better performance.
> > 
> > * Performance
> > 
> > There is a significant difference with a workload (buffered write without
> > sync) because we re-enable metadata over-commit.
> > 
> > before the patch:  741.00 MB/sec
> > after the patch:  1430.27 MB/sec (+ 93%)
> > 
> > * Organization
> > 
> > Patches 1-5 are preparation patches involves meta_write_pointer check.
> > 
> > Patches 6 and 7 are the main part of this series, implementing the
> > write-time activation.
> > 
> > Patches 8-10 addresses code for reserve time activation: counting fresh
> > block group as zone_unusable, activating a block group on allocation,
> > and disabling metadata over-commit.
> > 
> 
> Hey Naohiro,
> 
> This enabled me to turn on the zoned vm for the GitHub CI, we're only failing 7
> tests now, so great job!

Thanks! The github CI setup is really interesting. I tried to figure out
how it setup the zoned devices. Are they QEmu emulated ZNS devices?

> However all the !zoned vms panic immediately
> 
> https://paste.centos.org/view/54d11384
> 
> Can you fix that up?  Also you can submit a PR against the 'ci' branch of our
> linux repo in the btrfs GitHub project to run through the CI yourself to make
> sure you didn't mess anything up.  Thanks,

I sent a candidate fix as a PR. I hope it works well.

> 
> Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group
  2023-08-10 13:34 ` Josef Bacik
@ 2023-08-10 14:34   ` Naohiro Aota
  2023-08-10 14:36     ` David Sterba
  0 siblings, 1 reply; 19+ messages in thread
From: Naohiro Aota @ 2023-08-10 14:34 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs@vger.kernel.org, hch@infradead.org, dsterba@suse.cz

On Thu, Aug 10, 2023 at 09:34:58AM -0400, Josef Bacik wrote:
> On Tue, Aug 08, 2023 at 01:12:30AM +0900, Naohiro Aota wrote:
> > In the current implementation, block groups are activated at
> > reservation time to ensure that all reserved bytes can be written to
> > an active metadata block group. However, this approach has proven to
> > be less efficient, as it activates block groups more frequently than
> > necessary, putting pressure on the active zone resource and leading to
> > potential issues such as early ENOSPC or hung_task.
> > 
> > Another drawback of the current method is that it hampers metadata
> > over-commit, and necessitates additional flush operations and block
> > group allocations, resulting in decreased overall performance.
> > 
> > Actually, we don't need so many active metadata block groups because
> > there is only one sequential metadata write stream.
> > 
> > So, this series introduces a write-time activation of metadata and
> > system block group. This involves reserving at least one active block
> > group specifically for a metadata and system block group. When the
> > write goes into a new block group, it should have allocated all the
> > regions in the current active block group. So, we can wait for IOs to
> > fill the space, and then switch to a new block group.
> > 
> > Switching to the write-time activation solves the above issue and will
> > lead to better performance.
> > 
> > * Performance
> > 
> > There is a significant difference with a workload (buffered write without
> > sync) because we re-enable metadata over-commit.
> > 
> > before the patch:  741.00 MB/sec
> > after the patch:  1430.27 MB/sec (+ 93%)
> > 
> > * Organization
> > 
> > Patches 1-5 are preparation patches involves meta_write_pointer check.
> > 
> > Patches 6 and 7 are the main part of this series, implementing the
> > write-time activation.
> > 
> > Patches 8-10 addresses code for reserve time activation: counting fresh
> > block group as zone_unusable, activating a block group on allocation,
> > and disabling metadata over-commit.
> > 
> > * Changes
> 
> Additionally you had these failures in the CI setup
> 
> btrfs/220 btrfs/237 btrfs/239 btrfs/273 btrfs/295 generic/551 generic/574
> 
> I've excluded them so we can catch regressions, but everything except btrfs/220
> seem like legitimate failures.  btrfs/220 needs to be updated since zoned
> doesn't do discard=async, but you can do that whenever, I'm less worried about
> that.  The rest should be investigated at some point, though not as a
> prerequisite for merging this series.  Thanks,

I checked the CI log. Yes, btrfs/220 is due to discards=async.

* known to fail
- btrfs/237: we need to tweak the test for ZNS (zone capacity != zone size)
- btrfs/239: somehow, tree-log is behaving differently on zoned mode... I
  	     have no idea why it fail. But, I think it is still a valid status...

* need to modify test?
- btrfs/295: overwriting a zoned device won't work. So, this test should be skipped.
- generic/574: not sure fsverity works with zoned mode. Need to check.

So, btrfs/273 and generic/551 are suspicious. btrfs/273 prints some WARN
dmesg and generic/551 killed a AIO_TEST program... Are there details
available?

> 
> Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group
  2023-08-10 14:34   ` Naohiro Aota
@ 2023-08-10 14:36     ` David Sterba
  0 siblings, 0 replies; 19+ messages in thread
From: David Sterba @ 2023-08-10 14:36 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: Josef Bacik, linux-btrfs@vger.kernel.org, hch@infradead.org,
	dsterba@suse.cz

On Thu, Aug 10, 2023 at 02:34:11PM +0000, Naohiro Aota wrote:
> > seem like legitimate failures.  btrfs/220 needs to be updated since zoned
> > doesn't do discard=async, but you can do that whenever, I'm less worried about
> > that.  The rest should be investigated at some point, though not as a
> > prerequisite for merging this series.  Thanks,
> 
> I checked the CI log. Yes, btrfs/220 is due to discards=async.
> 
> * known to fail
> - btrfs/237: we need to tweak the test for ZNS (zone capacity != zone size)
> - btrfs/239: somehow, tree-log is behaving differently on zoned mode... I
>   	     have no idea why it fail. But, I think it is still a valid status...
> 
> * need to modify test?
> - generic/574: not sure fsverity works with zoned mode. Need to check.

The compatibility matrix at https://btrfs.readthedocs.io/en/latest/Status.html#zoned-mode
does not mention fsverity, so somebody has to test it and add the entry.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-08-10 14:43 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-07 16:12 [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
2023-08-07 16:12 ` [PATCH v3 01/10] btrfs: introduce struct to consolidate extent buffer write context Naohiro Aota
2023-08-07 16:12 ` [PATCH v3 02/10] btrfs: zoned: introduce block group context to btrfs_eb_write_context Naohiro Aota
2023-08-07 16:12 ` [PATCH v3 03/10] btrfs: zoned: return int from btrfs_check_meta_write_pointer Naohiro Aota
2023-08-07 16:12 ` [PATCH v3 04/10] btrfs: zoned: defer advancing meta_write_pointer Naohiro Aota
2023-08-07 16:12 ` [PATCH v3 05/10] btrfs: zoned: update meta_write_pointer on zone finish Naohiro Aota
2023-08-07 16:12 ` [PATCH v3 06/10] btrfs: zoned: reserve zones for an active metadata/system block group Naohiro Aota
2023-08-07 16:12 ` [PATCH v3 07/10] btrfs: zoned: activate metadata block group on write time Naohiro Aota
2023-08-09 16:42   ` David Sterba
2023-08-07 16:12 ` [PATCH v3 08/10] btrfs: zoned: no longer count fresh BG region as zone unusable Naohiro Aota
2023-08-09 16:49   ` David Sterba
2023-08-07 16:12 ` [PATCH v3 09/10] btrfs: zoned: don't activate non-DATA BG on allocation Naohiro Aota
2023-08-07 16:12 ` [PATCH v3 10/10] btrfs: zoned: re-enable metadata over-commit for zoned mode Naohiro Aota
2023-08-09 18:02 ` [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group David Sterba
2023-08-10 12:59 ` Josef Bacik
2023-08-10 14:13   ` Naohiro Aota
2023-08-10 13:34 ` Josef Bacik
2023-08-10 14:34   ` Naohiro Aota
2023-08-10 14:36     ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox