[PATCH 0/8] btrfs: zoned: write-time activation of metadata block group

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8] btrfs: zoned: write-time activation of metadata block group
@ 2023-07-24  4:18 Naohiro Aota
  2023-07-24  4:18 ` [PATCH 1/8] btrfs: zoned: introduce block_group context for submit_eb_page() Naohiro Aota
                   ` (7 more replies)
  0 siblings, 8 replies; 25+ messages in thread
From: Naohiro Aota @ 2023-07-24  4:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

In the current implementation, block groups are activated at
reservation time to ensure that all reserved bytes can be written to
an active metadata block group. However, this approach has proven to
be less efficient, as it activates block groups more frequently than
necessary, putting pressure on the active zone resource and leading to
potential issues such as early ENOSPC or hung_task.

Another drawback of the current method is that it hampers metadata
over-commit, and necessitates additional flush operations and block
group allocations, resulting in decreased overall performance.

Actually, we don't need so many active metadata block groups because
there is only one sequential metadata write stream.

So, this series introduces a write-time activation of metadata and
system block group. This involves reserving at least one active block
group specifically for a metadata and system block group. When the
write goes into a new block group, it should have allocated all the
regions in the current active block group. So, we can wait for IOs to
fill the space, and then switch to a new block group.

Switching to the write-time activation solves the above issue and will
lead to better performance.

* Organization

Patches 1-3 are preparation patches involves meta_write_pointer check.

Patches 4 and 5 are the main part of this series, implementing the
write-time activation.

Patches 6-8 addresses code for reserve time activation: counting fresh
block group as zone_unusable, activating a block group on allocation,
and disabling metadata over-commit.

Naohiro Aota (8):
  btrfs: zoned: introduce block_group context for submit_eb_page()
  btrfs: zoned: defer advancing meta_write_pointer
  btrfs: zoned: update meta_write_pointer on zone finish
  btrfs: zoned: reserve zones for an active metadata/system block group
  btrfs: zoned: activate metadata block group on write time
  btrfs: zoned: no longer count fresh BG region as zone unusable
  btrfs: zoned: don't activate non-DATA BG on allocation
  btrfs: zoned: re-enable metadata over-commit for zoned mode

 fs/btrfs/block-group.c      |  13 ++-
 fs/btrfs/extent-tree.c      |   8 +-
 fs/btrfs/extent_io.c        |  28 +++---
 fs/btrfs/free-space-cache.c |   8 +-
 fs/btrfs/fs.h               |   3 +
 fs/btrfs/space-info.c       |  34 +------
 fs/btrfs/zoned.c            | 187 +++++++++++++++++++++++++++---------
 fs/btrfs/zoned.h            |   9 +-
 8 files changed, 181 insertions(+), 109 deletions(-)

-- 
2.41.0

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/8] btrfs: zoned: introduce block_group context for submit_eb_page()
  2023-07-24  4:18 [PATCH 0/8] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
@ 2023-07-24  4:18 ` Naohiro Aota
  2023-07-24 15:00   ` Christoph Hellwig
  2023-07-24  4:18 ` [PATCH 2/8] btrfs: zoned: defer advancing meta_write_pointer Naohiro Aota
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Naohiro Aota @ 2023-07-24  4:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

For metadata write out on the zoned mode, we call
btrfs_check_meta_write_pointer() to check if an extent buffer to be written
is aligned to the write pointer.

We lookup for a block group containing the extent buffer for every extent
buffer, which take unnecessary effort as the writing extent buffers are
mostly contiguous.

Introduce "bg_context" to cache the block group working on.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent_io.c | 21 +++++++++++----------
 fs/btrfs/zoned.c     | 32 +++++++++++++++++++-------------
 2 files changed, 30 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 91282aefcb77..c7a88d2b5555 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1855,10 +1855,10 @@ static int submit_eb_subpage(struct page *page, struct writeback_control *wbc)
  * Return <0 for fatal error.
  */
 static int submit_eb_page(struct page *page, struct writeback_control *wbc,
-			  struct extent_buffer **eb_context)
+			  struct extent_buffer **eb_context,
+			  struct btrfs_block_group **bg_context)
 {
 	struct address_space *mapping = page->mapping;
-	struct btrfs_block_group *cache = NULL;
 	struct extent_buffer *eb;
 	int ret;
 
@@ -1894,7 +1894,7 @@ static int submit_eb_page(struct page *page, struct writeback_control *wbc,
 	if (!ret)
 		return 0;
 
-	if (!btrfs_check_meta_write_pointer(eb->fs_info, eb, &cache)) {
+	if (!btrfs_check_meta_write_pointer(eb->fs_info, eb, bg_context)) {
 		/*
 		 * If for_sync, this hole will be filled with
 		 * trasnsaction commit.
@@ -1910,18 +1910,15 @@ static int submit_eb_page(struct page *page, struct writeback_control *wbc,
 	*eb_context = eb;
 
 	if (!lock_extent_buffer_for_io(eb, wbc)) {
-		btrfs_revert_meta_write_pointer(cache, eb);
-		if (cache)
-			btrfs_put_block_group(cache);
+		btrfs_revert_meta_write_pointer(*bg_context, eb);
 		free_extent_buffer(eb);
 		return 0;
 	}
-	if (cache) {
+	if (*bg_context) {
 		/*
 		 * Implies write in zoned mode. Mark the last eb in a block group.
 		 */
-		btrfs_schedule_zone_finish_bg(cache, eb);
-		btrfs_put_block_group(cache);
+		btrfs_schedule_zone_finish_bg(*bg_context, eb);
 	}
 	write_one_eb(eb, wbc);
 	free_extent_buffer(eb);
@@ -1932,6 +1929,7 @@ int btree_write_cache_pages(struct address_space *mapping,
 				   struct writeback_control *wbc)
 {
 	struct extent_buffer *eb_context = NULL;
+	struct btrfs_block_group *bg_context = NULL;
 	struct btrfs_fs_info *fs_info = BTRFS_I(mapping->host)->root->fs_info;
 	int ret = 0;
 	int done = 0;
@@ -1973,7 +1971,7 @@ int btree_write_cache_pages(struct address_space *mapping,
 		for (i = 0; i < nr_folios; i++) {
 			struct folio *folio = fbatch.folios[i];
 
-			ret = submit_eb_page(&folio->page, wbc, &eb_context);
+			ret = submit_eb_page(&folio->page, wbc, &eb_context, &bg_context);
 			if (ret == 0)
 				continue;
 			if (ret < 0) {
@@ -2034,6 +2032,9 @@ int btree_write_cache_pages(struct address_space *mapping,
 		ret = 0;
 	if (!ret && BTRFS_FS_ERROR(fs_info))
 		ret = -EROFS;
+
+	if (bg_context)
+		btrfs_put_block_group(bg_context);
 	btrfs_zoned_meta_io_unlock(fs_info);
 	return ret;
 }
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 5e4285ae112c..58bd2de4026d 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1751,27 +1751,33 @@ bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 				    struct extent_buffer *eb,
 				    struct btrfs_block_group **cache_ret)
 {
-	struct btrfs_block_group *cache;
-	bool ret = true;
+	struct btrfs_block_group *cache = NULL;
 
 	if (!btrfs_is_zoned(fs_info))
 		return true;
 
-	cache = btrfs_lookup_block_group(fs_info, eb->start);
-	if (!cache)
-		return true;
+	if (*cache_ret) {
+		cache = *cache_ret;
+		if (cache->start > eb->start ||
+		    cache->start + cache->length <= eb->start) {
+			btrfs_put_block_group(cache);
+			cache = NULL;
+			*cache_ret = NULL;
+		}
+	}
 
-	if (cache->meta_write_pointer != eb->start) {
-		btrfs_put_block_group(cache);
-		cache = NULL;
-		ret = false;
-	} else {
-		cache->meta_write_pointer = eb->start + eb->len;
+	if (!cache) {
+		cache = btrfs_lookup_block_group(fs_info, eb->start);
+		if (!cache)
+			return true;
+		*cache_ret = cache;
 	}
 
-	*cache_ret = cache;
+	if (cache->meta_write_pointer != eb->start)
+		return false;
+	cache->meta_write_pointer = eb->start + eb->len;
 
-	return ret;
+	return true;
 }
 
 void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/8] btrfs: zoned: defer advancing meta_write_pointer
  2023-07-24  4:18 [PATCH 0/8] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
  2023-07-24  4:18 ` [PATCH 1/8] btrfs: zoned: introduce block_group context for submit_eb_page() Naohiro Aota
@ 2023-07-24  4:18 ` Naohiro Aota
  2023-07-24 15:04   ` Christoph Hellwig
  2023-07-24  4:18 ` [PATCH 3/8] btrfs: zoned: update meta_write_pointer on zone finish Naohiro Aota
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Naohiro Aota @ 2023-07-24  4:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

We currently advance the meta_write_pointer in
btrfs_check_meta_write_pointer(). That make it necessary to revert to it
when locking the buffer failed. Instead, we can advance it just before
sending the buffer.

Also, this is necessary for the following commit. In the commit, it needs
to release the zoned_meta_io_lock to allow IOs to come in and wait for them
to fill the currently active block group. If we advance the
meta_write_pointer before locking the extent buffer, the following extent
buffer can pass the meta_write_pointer check, resuting in an unaligned
write failure.

Advancing the pointer is still thread-safe as the extent buffer is locked.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent_io.c | 11 ++++++-----
 fs/btrfs/zoned.c     | 14 +++-----------
 fs/btrfs/zoned.h     |  8 --------
 3 files changed, 9 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c7a88d2b5555..46a0b5357009 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1910,15 +1910,16 @@ static int submit_eb_page(struct page *page, struct writeback_control *wbc,
 	*eb_context = eb;
 
 	if (!lock_extent_buffer_for_io(eb, wbc)) {
-		btrfs_revert_meta_write_pointer(*bg_context, eb);
 		free_extent_buffer(eb);
 		return 0;
 	}
 	if (*bg_context) {
-		/*
-		 * Implies write in zoned mode. Mark the last eb in a block group.
-		 */
-		btrfs_schedule_zone_finish_bg(*bg_context, eb);
+		/* Implies write in zoned mode. */
+		struct btrfs_block_group *bg = *bg_context;
+
+		/* Mark the last eb in the block group. */
+		btrfs_schedule_zone_finish_bg(bg, eb);
+		bg->meta_write_pointer += eb->len;
 	}
 	write_one_eb(eb, wbc);
 	free_extent_buffer(eb);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 58bd2de4026d..3f8f5e8c28a9 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1773,23 +1773,15 @@ bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 		*cache_ret = cache;
 	}
 
+	/* Someone already start writing this eb. */
+	if (cache->meta_write_pointer > eb->start)
+		return true;
 	if (cache->meta_write_pointer != eb->start)
 		return false;
-	cache->meta_write_pointer = eb->start + eb->len;
 
 	return true;
 }
 
-void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
-				     struct extent_buffer *eb)
-{
-	if (!btrfs_is_zoned(eb->fs_info) || !cache)
-		return;
-
-	ASSERT(cache->meta_write_pointer == eb->start + eb->len);
-	cache->meta_write_pointer = eb->start;
-}
-
 int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, u64 length)
 {
 	if (!btrfs_dev_is_sequential(device, physical))
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 27322b926038..42a4df94dc29 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -61,8 +61,6 @@ void btrfs_record_physical_zoned(struct btrfs_bio *bbio);
 bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 				    struct extent_buffer *eb,
 				    struct btrfs_block_group **cache_ret);
-void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
-				     struct extent_buffer *eb);
 int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, u64 length);
 int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
 				  u64 physical_start, u64 physical_pos);
@@ -196,12 +194,6 @@ static inline bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 	return true;
 }
 
-static inline void btrfs_revert_meta_write_pointer(
-						struct btrfs_block_group *cache,
-						struct extent_buffer *eb)
-{
-}
-
 static inline int btrfs_zoned_issue_zeroout(struct btrfs_device *device,
 					    u64 physical, u64 length)
 {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/8] btrfs: zoned: update meta_write_pointer on zone finish
  2023-07-24  4:18 [PATCH 0/8] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
  2023-07-24  4:18 ` [PATCH 1/8] btrfs: zoned: introduce block_group context for submit_eb_page() Naohiro Aota
  2023-07-24  4:18 ` [PATCH 2/8] btrfs: zoned: defer advancing meta_write_pointer Naohiro Aota
@ 2023-07-24  4:18 ` Naohiro Aota
  2023-07-24 15:04   ` Christoph Hellwig
  2023-07-24  4:18 ` [PATCH 4/8] btrfs: zoned: reserve zones for an active metadata/system block group Naohiro Aota
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Naohiro Aota @ 2023-07-24  4:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

On finishing a zone, the meta_write_pointer should be set of the end of the
zone to reflect the actual write pointer position.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/zoned.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 3f8f5e8c28a9..dd86f1858c88 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -2048,6 +2048,9 @@ static int do_zone_finish(struct btrfs_block_group *block_group, bool fully_writ
 
 	clear_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE, &block_group->runtime_flags);
 	block_group->alloc_offset = block_group->zone_capacity;
+	if (block_group->flags & (BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_SYSTEM))
+		block_group->meta_write_pointer = block_group->start +
+			block_group->zone_capacity;
 	block_group->free_space_ctl->free_space = 0;
 	btrfs_clear_treelog_bg(block_group);
 	btrfs_clear_data_reloc_bg(block_group);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/8] btrfs: zoned: reserve zones for an active metadata/system block group
  2023-07-24  4:18 [PATCH 0/8] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
                   ` (2 preceding siblings ...)
  2023-07-24  4:18 ` [PATCH 3/8] btrfs: zoned: update meta_write_pointer on zone finish Naohiro Aota
@ 2023-07-24  4:18 ` Naohiro Aota
  2023-07-24 15:06   ` Christoph Hellwig
  2023-07-24  4:18 ` [PATCH 5/8] btrfs: zoned: activate metadata block group on write time Naohiro Aota
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Naohiro Aota @ 2023-07-24  4:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

Ensure a metadata and system block group can be activated on write time, by
leaving a certain number of active zones when trying to activate a data
block group.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/zoned.c | 46 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 44 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index dd86f1858c88..dbfa49c70c1a 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1867,6 +1867,35 @@ int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
 	return btrfs_zoned_issue_zeroout(tgt_dev, physical_pos, length);
 }
 
+/*
+ * Number of active zones reserved for one metadata block group and one
+ * system block group.
+ */
+static int reserved_zones(struct btrfs_fs_info *fs_info)
+{
+	const u64 flags[] = {BTRFS_BLOCK_GROUP_METADATA, BTRFS_BLOCK_GROUP_SYSTEM};
+	int reserved = 0;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(flags); i++) {
+		u64 profile = btrfs_get_alloc_profile(fs_info, flags[i]);
+
+		switch (profile) {
+		case 0: /* single */
+			reserved += 1;
+			break;
+		case BTRFS_BLOCK_GROUP_DUP:
+			reserved += 2;
+			break;
+		default:
+			ASSERT(0);
+			break;
+		}
+	}
+
+	return reserved;
+}
+
 /*
  * Activate block group and underlying device zones
  *
@@ -1880,6 +1909,8 @@ bool btrfs_zone_activate(struct btrfs_block_group *block_group)
 	struct btrfs_space_info *space_info = block_group->space_info;
 	struct map_lookup *map;
 	struct btrfs_device *device;
+	const int reserved = (block_group->flags & BTRFS_BLOCK_GROUP_DATA) ?
+		reserved_zones(fs_info) : 0;
 	u64 physical;
 	bool ret;
 	int i;
@@ -1909,6 +1940,15 @@ bool btrfs_zone_activate(struct btrfs_block_group *block_group)
 		if (device->zone_info->max_active_zones == 0)
 			continue;
 
+		/*
+		 * For the data block group, leave active zones for one
+		 * metadata block group and one system block group.
+		 */
+		if (atomic_read(&device->zone_info->active_zones_left) <= reserved) {
+			ret = false;
+			goto out_unlock;
+		}
+
 		if (!btrfs_dev_set_active_zone(device, physical)) {
 			/* Cannot activate the zone */
 			ret = false;
@@ -2103,6 +2143,8 @@ bool btrfs_can_activate_zone(struct btrfs_fs_devices *fs_devices, u64 flags)
 {
 	struct btrfs_fs_info *fs_info = fs_devices->fs_info;
 	struct btrfs_device *device;
+	const int reserved = (flags & BTRFS_BLOCK_GROUP_DATA) ?
+		reserved_zones(fs_info) : 0;
 	bool ret = false;
 
 	if (!btrfs_is_zoned(fs_info))
@@ -2123,10 +2165,10 @@ bool btrfs_can_activate_zone(struct btrfs_fs_devices *fs_devices, u64 flags)
 
 		switch (flags & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
 		case 0: /* single */
-			ret = (atomic_read(&zinfo->active_zones_left) >= 1);
+			ret = (atomic_read(&zinfo->active_zones_left) >= (1 + reserved));
 			break;
 		case BTRFS_BLOCK_GROUP_DUP:
-			ret = (atomic_read(&zinfo->active_zones_left) >= 2);
+			ret = (atomic_read(&zinfo->active_zones_left) >= (2 + reserved));
 			break;
 		}
 		if (ret)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/8] btrfs: zoned: activate metadata block group on write time
  2023-07-24  4:18 [PATCH 0/8] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
                   ` (3 preceding siblings ...)
  2023-07-24  4:18 ` [PATCH 4/8] btrfs: zoned: reserve zones for an active metadata/system block group Naohiro Aota
@ 2023-07-24  4:18 ` Naohiro Aota
  2023-07-24  6:24   ` kernel test robot
                     ` (3 more replies)
  2023-07-24  4:18 ` [PATCH 6/8] btrfs: zoned: no longer count fresh BG region as zone unusable Naohiro Aota
                   ` (2 subsequent siblings)
  7 siblings, 4 replies; 25+ messages in thread
From: Naohiro Aota @ 2023-07-24  4:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

In the current implementation, block groups are activated at reservation
time to ensure that all reserved bytes can be written to an active metadata
block group. However, this approach has proven to be less efficient, as it
activates block groups more frequently than necessary, putting pressure on
the active zone resource and leading to potential issues such as early
ENOSPC or hung_task.

Another drawback of the current method is that it hampers metadata
over-commit, and necessitates additional flush operations and block group
allocations, resulting in decreased overall performance.

To address these issues, this commit introduces a write-time activation of
metadata and system block group. This involves reserving at least one
active block group specifically for a metadata and system block group.

Since metadata write-out is always allocated sequentially, when we need to
write to a non-active block group, we can wait for the ongoing IOs to
complete, activate a new block group, and then proceed with writing to the
new block group.

Fixes: b09315139136 ("btrfs: zoned: activate metadata block group on flush_space")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c | 11 +++++++
 fs/btrfs/extent_io.c   |  2 +-
 fs/btrfs/fs.h          |  3 ++
 fs/btrfs/zoned.c       | 72 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       |  1 +
 5 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 91d38f38c1e7..75f8482f45e5 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -4274,6 +4274,17 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
 	struct btrfs_caching_control *caching_ctl;
 	struct rb_node *n;
 
+	if (btrfs_is_zoned(info)) {
+		if (info->active_meta_bg) {
+			btrfs_put_block_group(info->active_meta_bg);
+			info->active_meta_bg = NULL;
+		}
+		if (info->active_system_bg) {
+			btrfs_put_block_group(info->active_system_bg);
+			info->active_system_bg = NULL;
+		}
+	}
+
 	write_lock(&info->block_group_cache_lock);
 	while (!list_empty(&info->caching_block_groups)) {
 		caching_ctl = list_entry(info->caching_block_groups.next,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 46a0b5357009..3f104559c0cc 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1894,7 +1894,7 @@ static int submit_eb_page(struct page *page, struct writeback_control *wbc,
 	if (!ret)
 		return 0;
 
-	if (!btrfs_check_meta_write_pointer(eb->fs_info, eb, bg_context)) {
+	if (!btrfs_check_meta_write_pointer(eb->fs_info, wbc, eb, bg_context)) {
 		/*
 		 * If for_sync, this hole will be filled with
 		 * trasnsaction commit.
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 203d2a267828..1f2d33112106 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -766,6 +766,9 @@ struct btrfs_fs_info {
 	u64 data_reloc_bg;
 	struct mutex zoned_data_reloc_io_lock;
 
+	struct btrfs_block_group *active_meta_bg;
+	struct btrfs_block_group *active_system_bg;
+
 	u64 nr_global_roots;
 
 	spinlock_t zone_active_bgs_lock;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index dbfa49c70c1a..f440853dff1c 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -65,6 +65,9 @@
 
 #define SUPER_INFO_SECTORS	((u64)BTRFS_SUPER_INFO_SIZE >> SECTOR_SHIFT)
 
+static void wait_eb_writebacks(struct btrfs_block_group *block_group);
+static int do_zone_finish(struct btrfs_block_group *block_group, bool fully_written);
+
 static inline bool sb_zone_is_full(const struct blk_zone *zone)
 {
 	return (zone->cond == BLK_ZONE_COND_FULL) ||
@@ -1748,6 +1751,7 @@ void btrfs_finish_ordered_zoned(struct btrfs_ordered_extent *ordered)
 }
 
 bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
+				    struct writeback_control *wbc,
 				    struct extent_buffer *eb,
 				    struct btrfs_block_group **cache_ret)
 {
@@ -1779,6 +1783,74 @@ bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 	if (cache->meta_write_pointer != eb->start)
 		return false;
 
+	if (test_bit(BTRFS_FS_ACTIVE_ZONE_TRACKING, &fs_info->flags)) {
+		bool is_system = cache->flags & BTRFS_BLOCK_GROUP_SYSTEM;
+
+		spin_lock(&cache->lock);
+		if (test_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
+			     &cache->runtime_flags)) {
+			spin_unlock(&cache->lock);
+			return true;
+		}
+
+		spin_unlock(&cache->lock);
+		if (fs_info->treelog_bg == cache->start) {
+			if (!btrfs_zone_activate(cache)) {
+				int ret_fin = btrfs_zone_finish_one_bg(fs_info);
+
+				if (ret_fin != 1 || !btrfs_zone_activate(cache))
+					return false;
+			}
+		} else if ((!is_system && fs_info->active_meta_bg != cache) ||
+			   (is_system && fs_info->active_system_bg != cache)) {
+			struct btrfs_block_group *tgt = is_system ?
+				fs_info->active_system_bg : fs_info->active_meta_bg;
+
+			/*
+			 * zoned_meta_io_lock protects
+			 * fs_info->active_{meta,system}_bg.
+			 */
+			lockdep_assert_held(&fs_info->zoned_meta_io_lock);
+
+			if (tgt) {
+				/*
+				 * If there is an unsent IO left in the
+				 * allocated area, we cannot wait for them
+				 * as it may cause a deadlock.
+				 */
+				if (tgt->meta_write_pointer < tgt->start + tgt->alloc_offset) {
+					if (wbc->sync_mode == WB_SYNC_NONE ||
+					    (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync))
+						return false;
+				}
+
+				/* Pivot active metadata/system block group. */
+				btrfs_zoned_meta_io_unlock(fs_info);
+				wait_eb_writebacks(tgt);
+				do_zone_finish(tgt, true);
+				btrfs_zoned_meta_io_lock(fs_info);
+				if (is_system && fs_info->active_system_bg == tgt) {
+					btrfs_put_block_group(tgt);
+					fs_info->active_system_bg = NULL;
+				} else if (!is_system && fs_info->active_meta_bg == tgt) {
+					btrfs_put_block_group(tgt);
+					fs_info->active_meta_bg = NULL;
+				}
+			}
+			if (!btrfs_zone_activate(cache))
+				return false;
+			if (is_system && fs_info->active_system_bg != cache) {
+				ASSERT(fs_info->active_system_bg == NULL);
+				fs_info->active_system_bg = cache;
+				btrfs_get_block_group(cache);
+			} else if (!is_system && fs_info->active_meta_bg != cache) {
+				ASSERT(fs_info->active_meta_bg == NULL);
+				fs_info->active_meta_bg = cache;
+				btrfs_get_block_group(cache);
+			}
+		}
+	}
+
 	return true;
 }
 
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 42a4df94dc29..6935e04effdd 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -59,6 +59,7 @@ void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 bool btrfs_use_zone_append(struct btrfs_bio *bbio);
 void btrfs_record_physical_zoned(struct btrfs_bio *bbio);
 bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
+				    struct writeback_control *wbc,
 				    struct extent_buffer *eb,
 				    struct btrfs_block_group **cache_ret);
 int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, u64 length);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 6/8] btrfs: zoned: no longer count fresh BG region as zone unusable
  2023-07-24  4:18 [PATCH 0/8] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
                   ` (4 preceding siblings ...)
  2023-07-24  4:18 ` [PATCH 5/8] btrfs: zoned: activate metadata block group on write time Naohiro Aota
@ 2023-07-24  4:18 ` Naohiro Aota
  2023-07-24  4:18 ` [PATCH 7/8] btrfs: zoned: don't activate non-DATA BG on allocation Naohiro Aota
  2023-07-24  4:18 ` [PATCH 8/8] btrfs: zoned: re-enable metadata over-commit for zoned mode Naohiro Aota
  7 siblings, 0 replies; 25+ messages in thread
From: Naohiro Aota @ 2023-07-24  4:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

Now that we switched to write time activation, we no longer need to (and
must not) count the fresh region as zone unusable. This commit is similar
to revert commit fc22cf8eba79 ("btrfs: zoned: count fresh BG region as zone
unusable").

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/free-space-cache.c |  8 +-------
 fs/btrfs/zoned.c            | 26 +++-----------------------
 2 files changed, 4 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index bfc01352351c..13f47d9ec13d 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2704,13 +2704,8 @@ static int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group,
 		bg_reclaim_threshold = READ_ONCE(sinfo->bg_reclaim_threshold);
 
 	spin_lock(&ctl->tree_lock);
-	/* Count initial region as zone_unusable until it gets activated. */
 	if (!used)
 		to_free = size;
-	else if (initial &&
-		 test_bit(BTRFS_FS_ACTIVE_ZONE_TRACKING, &block_group->fs_info->flags) &&
-		 (block_group->flags & (BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_SYSTEM)))
-		to_free = 0;
 	else if (initial)
 		to_free = block_group->zone_capacity;
 	else if (offset >= block_group->alloc_offset)
@@ -2738,8 +2733,7 @@ static int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group,
 	reclaimable_unusable = block_group->zone_unusable -
 			       (block_group->length - block_group->zone_capacity);
 	/* All the region is now unusable. Mark it as unused and reclaim */
-	if (block_group->zone_unusable == block_group->length &&
-	    block_group->alloc_offset) {
+	if (block_group->zone_unusable == block_group->length) {
 		btrfs_mark_bg_unused(block_group);
 	} else if (bg_reclaim_threshold &&
 		   reclaimable_unusable >=
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index f440853dff1c..ce816f5885fb 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1586,19 +1586,9 @@ void btrfs_calc_zone_unusable(struct btrfs_block_group *cache)
 		return;
 
 	WARN_ON(cache->bytes_super != 0);
-
-	/* Check for block groups never get activated */
-	if (test_bit(BTRFS_FS_ACTIVE_ZONE_TRACKING, &cache->fs_info->flags) &&
-	    cache->flags & (BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_SYSTEM) &&
-	    !test_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE, &cache->runtime_flags) &&
-	    cache->alloc_offset == 0) {
-		unusable = cache->length;
-		free = 0;
-	} else {
-		unusable = (cache->alloc_offset - cache->used) +
-			   (cache->length - cache->zone_capacity);
-		free = cache->zone_capacity - cache->alloc_offset;
-	}
+	unusable = (cache->alloc_offset - cache->used) +
+		   (cache->length - cache->zone_capacity);
+	free = cache->zone_capacity - cache->alloc_offset;
 
 	/* We only need ->free_space in ALLOC_SEQ block groups */
 	cache->cached = BTRFS_CACHE_FINISHED;
@@ -1978,7 +1968,6 @@ static int reserved_zones(struct btrfs_fs_info *fs_info)
 bool btrfs_zone_activate(struct btrfs_block_group *block_group)
 {
 	struct btrfs_fs_info *fs_info = block_group->fs_info;
-	struct btrfs_space_info *space_info = block_group->space_info;
 	struct map_lookup *map;
 	struct btrfs_device *device;
 	const int reserved = (block_group->flags & BTRFS_BLOCK_GROUP_DATA) ?
@@ -1992,7 +1981,6 @@ bool btrfs_zone_activate(struct btrfs_block_group *block_group)
 
 	map = block_group->physical_map;
 
-	spin_lock(&space_info->lock);
 	spin_lock(&block_group->lock);
 	if (test_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE, &block_group->runtime_flags)) {
 		ret = true;
@@ -2030,14 +2018,7 @@ bool btrfs_zone_activate(struct btrfs_block_group *block_group)
 
 	/* Successfully activated all the zones */
 	set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE, &block_group->runtime_flags);
-	WARN_ON(block_group->alloc_offset != 0);
-	if (block_group->zone_unusable == block_group->length) {
-		block_group->zone_unusable = block_group->length - block_group->zone_capacity;
-		space_info->bytes_zone_unusable -= block_group->zone_capacity;
-	}
 	spin_unlock(&block_group->lock);
-	btrfs_try_granting_tickets(fs_info, space_info);
-	spin_unlock(&space_info->lock);
 
 	/* For the active block group list */
 	btrfs_get_block_group(block_group);
@@ -2050,7 +2031,6 @@ bool btrfs_zone_activate(struct btrfs_block_group *block_group)
 
 out_unlock:
 	spin_unlock(&block_group->lock);
-	spin_unlock(&space_info->lock);
 	return ret;
 }
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 7/8] btrfs: zoned: don't activate non-DATA BG on allocation
  2023-07-24  4:18 [PATCH 0/8] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
                   ` (5 preceding siblings ...)
  2023-07-24  4:18 ` [PATCH 6/8] btrfs: zoned: no longer count fresh BG region as zone unusable Naohiro Aota
@ 2023-07-24  4:18 ` Naohiro Aota
  2023-07-24  4:18 ` [PATCH 8/8] btrfs: zoned: re-enable metadata over-commit for zoned mode Naohiro Aota
  7 siblings, 0 replies; 25+ messages in thread
From: Naohiro Aota @ 2023-07-24  4:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

Now that, a non-DATA block group is activated at write time. Don't activate
it on allocation time.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c |  2 +-
 fs/btrfs/extent-tree.c |  8 +++++++-
 fs/btrfs/space-info.c  | 28 ----------------------------
 3 files changed, 8 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 75f8482f45e5..fc5f6b977189 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -4076,7 +4076,7 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
 
 	if (IS_ERR(ret_bg)) {
 		ret = PTR_ERR(ret_bg);
-	} else if (from_extent_allocation) {
+	} else if (from_extent_allocation && (flags & BTRFS_BLOCK_GROUP_DATA)) {
 		/*
 		 * New block group is likely to be used soon. Try to activate
 		 * it now. Failure is OK for now.
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 602cb750100c..9804e3fcc5ba 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3663,7 +3663,9 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	}
 	spin_unlock(&block_group->lock);
 
-	if (!ret && !btrfs_zone_activate(block_group)) {
+	/* Metadata block group is activated on write time. */
+	if (!ret && (block_group->flags & BTRFS_BLOCK_GROUP_DATA) &&
+	    !btrfs_zone_activate(block_group)) {
 		ret = 1;
 		/*
 		 * May need to clear fs_info->{treelog,data_reloc}_bg.
@@ -3843,6 +3845,10 @@ static void found_extent(struct find_free_extent_ctl *ffe_ctl,
 static int can_allocate_chunk_zoned(struct btrfs_fs_info *fs_info,
 				    struct find_free_extent_ctl *ffe_ctl)
 {
+	/* Block group's activeness is not a requirement for METADATA block groups. */
+	if (!(ffe_ctl->flags & BTRFS_BLOCK_GROUP_DATA))
+		return 0;
+
 	/* If we can activate new zone, just allocate a chunk and use it */
 	if (btrfs_can_activate_zone(fs_info->fs_devices, ffe_ctl->flags))
 		return 0;
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 75e7fa337e66..a84b6088a73d 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -747,18 +747,6 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 		break;
 	case ALLOC_CHUNK:
 	case ALLOC_CHUNK_FORCE:
-		/*
-		 * For metadata space on zoned filesystem, reaching here means we
-		 * don't have enough space left in active_total_bytes. Try to
-		 * activate a block group first, because we may have inactive
-		 * block group already allocated.
-		 */
-		ret = btrfs_zoned_activate_one_bg(fs_info, space_info, false);
-		if (ret < 0)
-			break;
-		else if (ret == 1)
-			break;
-
 		trans = btrfs_join_transaction(root);
 		if (IS_ERR(trans)) {
 			ret = PTR_ERR(trans);
@@ -770,22 +758,6 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 					CHUNK_ALLOC_FORCE);
 		btrfs_end_transaction(trans);
 
-		/*
-		 * For metadata space on zoned filesystem, allocating a new chunk
-		 * is not enough. We still need to activate the block * group.
-		 * Active the newly allocated block group by (maybe) finishing
-		 * a block group.
-		 */
-		if (ret == 1) {
-			ret = btrfs_zoned_activate_one_bg(fs_info, space_info, true);
-			/*
-			 * Revert to the original ret regardless we could finish
-			 * one block group or not.
-			 */
-			if (ret >= 0)
-				ret = 1;
-		}
-
 		if (ret > 0 || ret == -ENOSPC)
 			ret = 0;
 		break;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 8/8] btrfs: zoned: re-enable metadata over-commit for zoned mode
  2023-07-24  4:18 [PATCH 0/8] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
                   ` (6 preceding siblings ...)
  2023-07-24  4:18 ` [PATCH 7/8] btrfs: zoned: don't activate non-DATA BG on allocation Naohiro Aota
@ 2023-07-24  4:18 ` Naohiro Aota
  7 siblings, 0 replies; 25+ messages in thread
From: Naohiro Aota @ 2023-07-24  4:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

Now that, we can re-enable metadata over-commit. As we moved the activation
from the reservation time to the write time, we no longer need to ensure
all the reserved bytes is properly activated.

Without the metadata over-commit, it suffers from lower performance because
it needs to flush the delalloc items more often and allocate more block
groups. Re-enabling metadata over-commit will solve the issue.

Fixes: 79417d040f4f ("btrfs: zoned: disable metadata overcommit for zoned")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/space-info.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index a84b6088a73d..4c4a30439fcf 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -389,11 +389,7 @@ int btrfs_can_overcommit(struct btrfs_fs_info *fs_info,
 		return 0;
 
 	used = btrfs_space_info_used(space_info, true);
-	if (test_bit(BTRFS_FS_ACTIVE_ZONE_TRACKING, &fs_info->flags) &&
-	    (space_info->flags & BTRFS_BLOCK_GROUP_METADATA))
-		avail = 0;
-	else
-		avail = calc_available_free_space(fs_info, space_info, flush);
+	avail = calc_available_free_space(fs_info, space_info, flush);
 
 	if (used + bytes < space_info->total_bytes + avail)
 		return 1;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/8] btrfs: zoned: activate metadata block group on write time
  2023-07-24  4:18 ` [PATCH 5/8] btrfs: zoned: activate metadata block group on write time Naohiro Aota
@ 2023-07-24  6:24   ` kernel test robot
  2023-07-24  7:36   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: kernel test robot @ 2023-07-24  6:24 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs; +Cc: oe-kbuild-all, Naohiro Aota

Hi Naohiro,

kernel test robot noticed the following build errors:

[auto build test ERROR on kdave/for-next]
[also build test ERROR on linus/master v6.5-rc3 next-20230721]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Naohiro-Aota/btrfs-zoned-introduce-block_group-context-for-submit_eb_page/20230724-122053
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
patch link:    https://lore.kernel.org/r/2f8e30f966cfd9e06b0745a691bd1a5566aab780.1690171333.git.naohiro.aota%40wdc.com
patch subject: [PATCH 5/8] btrfs: zoned: activate metadata block group on write time
config: parisc-randconfig-r023-20230724 (https://download.01.org/0day-ci/archive/20230724/202307241420.4nG9x5F7-lkp@intel.com/config)
compiler: hppa-linux-gcc (GCC) 12.3.0
reproduce: (https://download.01.org/0day-ci/archive/20230724/202307241420.4nG9x5F7-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202307241420.4nG9x5F7-lkp@intel.com/

All errors (new ones prefixed by >>):

   fs/btrfs/extent_io.c: In function 'submit_eb_page':
>> fs/btrfs/extent_io.c:1827:58: error: passing argument 2 of 'btrfs_check_meta_write_pointer' from incompatible pointer type [-Werror=incompatible-pointer-types]
    1827 |         if (!btrfs_check_meta_write_pointer(eb->fs_info, wbc, eb, bg_context)) {
         |                                                          ^~~
         |                                                          |
         |                                                          struct writeback_control *
   In file included from fs/btrfs/extent_io.c:30:
   fs/btrfs/zoned.h:192:54: note: expected 'struct extent_buffer *' but argument is of type 'struct writeback_control *'
     192 |                                struct extent_buffer *eb,
         |                                ~~~~~~~~~~~~~~~~~~~~~~^~
   fs/btrfs/extent_io.c:1827:63: error: passing argument 3 of 'btrfs_check_meta_write_pointer' from incompatible pointer type [-Werror=incompatible-pointer-types]
    1827 |         if (!btrfs_check_meta_write_pointer(eb->fs_info, wbc, eb, bg_context)) {
         |                                                               ^~
         |                                                               |
         |                                                               struct extent_buffer *
   fs/btrfs/zoned.h:193:59: note: expected 'struct btrfs_block_group **' but argument is of type 'struct extent_buffer *'
     193 |                                struct btrfs_block_group **cache_ret)
         |                                ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~
>> fs/btrfs/extent_io.c:1827:14: error: too many arguments to function 'btrfs_check_meta_write_pointer'
    1827 |         if (!btrfs_check_meta_write_pointer(eb->fs_info, wbc, eb, bg_context)) {
         |              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/btrfs/zoned.h:191:20: note: declared here
     191 | static inline bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
         |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +/btrfs_check_meta_write_pointer +1827 fs/btrfs/extent_io.c

  1766	
  1767	/*
  1768	 * Submit all page(s) of one extent buffer.
  1769	 *
  1770	 * @page:	the page of one extent buffer
  1771	 * @eb_context:	to determine if we need to submit this page, if current page
  1772	 *		belongs to this eb, we don't need to submit
  1773	 *
  1774	 * The caller should pass each page in their bytenr order, and here we use
  1775	 * @eb_context to determine if we have submitted pages of one extent buffer.
  1776	 *
  1777	 * If we have, we just skip until we hit a new page that doesn't belong to
  1778	 * current @eb_context.
  1779	 *
  1780	 * If not, we submit all the page(s) of the extent buffer.
  1781	 *
  1782	 * Return >0 if we have submitted the extent buffer successfully.
  1783	 * Return 0 if we don't need to submit the page, as it's already submitted by
  1784	 * previous call.
  1785	 * Return <0 for fatal error.
  1786	 */
  1787	static int submit_eb_page(struct page *page, struct writeback_control *wbc,
  1788				  struct extent_buffer **eb_context,
  1789				  struct btrfs_block_group **bg_context)
  1790	{
  1791		struct address_space *mapping = page->mapping;
  1792		struct extent_buffer *eb;
  1793		int ret;
  1794	
  1795		if (!PagePrivate(page))
  1796			return 0;
  1797	
  1798		if (btrfs_sb(page->mapping->host->i_sb)->nodesize < PAGE_SIZE)
  1799			return submit_eb_subpage(page, wbc);
  1800	
  1801		spin_lock(&mapping->private_lock);
  1802		if (!PagePrivate(page)) {
  1803			spin_unlock(&mapping->private_lock);
  1804			return 0;
  1805		}
  1806	
  1807		eb = (struct extent_buffer *)page->private;
  1808	
  1809		/*
  1810		 * Shouldn't happen and normally this would be a BUG_ON but no point
  1811		 * crashing the machine for something we can survive anyway.
  1812		 */
  1813		if (WARN_ON(!eb)) {
  1814			spin_unlock(&mapping->private_lock);
  1815			return 0;
  1816		}
  1817	
  1818		if (eb == *eb_context) {
  1819			spin_unlock(&mapping->private_lock);
  1820			return 0;
  1821		}
  1822		ret = atomic_inc_not_zero(&eb->refs);
  1823		spin_unlock(&mapping->private_lock);
  1824		if (!ret)
  1825			return 0;
  1826	
> 1827		if (!btrfs_check_meta_write_pointer(eb->fs_info, wbc, eb, bg_context)) {
  1828			/*
  1829			 * If for_sync, this hole will be filled with
  1830			 * trasnsaction commit.
  1831			 */
  1832			if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync)
  1833				ret = -EAGAIN;
  1834			else
  1835				ret = 0;
  1836			free_extent_buffer(eb);
  1837			return ret;
  1838		}
  1839	
  1840		*eb_context = eb;
  1841	
  1842		if (!lock_extent_buffer_for_io(eb, wbc)) {
  1843			free_extent_buffer(eb);
  1844			return 0;
  1845		}
  1846		if (*bg_context) {
  1847			/* Implies write in zoned mode. */
  1848			struct btrfs_block_group *bg = *bg_context;
  1849	
  1850			/* Mark the last eb in the block group. */
  1851			btrfs_schedule_zone_finish_bg(bg, eb);
  1852			bg->meta_write_pointer += eb->len;
  1853		}
  1854		write_one_eb(eb, wbc);
  1855		free_extent_buffer(eb);
  1856		return 1;
  1857	}
  1858	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/8] btrfs: zoned: activate metadata block group on write time
  2023-07-24  4:18 ` [PATCH 5/8] btrfs: zoned: activate metadata block group on write time Naohiro Aota
  2023-07-24  6:24   ` kernel test robot
@ 2023-07-24  7:36   ` kernel test robot
  2023-07-24  7:57   ` kernel test robot
  2023-07-24 15:12   ` Christoph Hellwig
  3 siblings, 0 replies; 25+ messages in thread
From: kernel test robot @ 2023-07-24  7:36 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs; +Cc: llvm, oe-kbuild-all, Naohiro Aota

Hi Naohiro,

kernel test robot noticed the following build errors:

[auto build test ERROR on kdave/for-next]
[also build test ERROR on linus/master v6.5-rc3 next-20230724]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Naohiro-Aota/btrfs-zoned-introduce-block_group-context-for-submit_eb_page/20230724-122053
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
patch link:    https://lore.kernel.org/r/2f8e30f966cfd9e06b0745a691bd1a5566aab780.1690171333.git.naohiro.aota%40wdc.com
patch subject: [PATCH 5/8] btrfs: zoned: activate metadata block group on write time
config: x86_64-rhel-8.3-rust (https://download.01.org/0day-ci/archive/20230724/202307241530.1g5O8gwF-lkp@intel.com/config)
compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project.git 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
reproduce: (https://download.01.org/0day-ci/archive/20230724/202307241530.1g5O8gwF-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202307241530.1g5O8gwF-lkp@intel.com/

All errors (new ones prefixed by >>):

   fs/btrfs/extent_io.c:214:16: warning: variable 'pages_processed' set but not used [-Wunused-but-set-variable]
           unsigned long pages_processed = 0;
                         ^
>> fs/btrfs/extent_io.c:1827:60: error: too many arguments to function call, expected 3, have 4
           if (!btrfs_check_meta_write_pointer(eb->fs_info, wbc, eb, bg_context)) {
                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                       ^~~~~~~~~~
   fs/btrfs/zoned.h:191:20: note: 'btrfs_check_meta_write_pointer' declared here
   static inline bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
                      ^
   1 warning and 1 error generated.


vim +1827 fs/btrfs/extent_io.c

  1766	
  1767	/*
  1768	 * Submit all page(s) of one extent buffer.
  1769	 *
  1770	 * @page:	the page of one extent buffer
  1771	 * @eb_context:	to determine if we need to submit this page, if current page
  1772	 *		belongs to this eb, we don't need to submit
  1773	 *
  1774	 * The caller should pass each page in their bytenr order, and here we use
  1775	 * @eb_context to determine if we have submitted pages of one extent buffer.
  1776	 *
  1777	 * If we have, we just skip until we hit a new page that doesn't belong to
  1778	 * current @eb_context.
  1779	 *
  1780	 * If not, we submit all the page(s) of the extent buffer.
  1781	 *
  1782	 * Return >0 if we have submitted the extent buffer successfully.
  1783	 * Return 0 if we don't need to submit the page, as it's already submitted by
  1784	 * previous call.
  1785	 * Return <0 for fatal error.
  1786	 */
  1787	static int submit_eb_page(struct page *page, struct writeback_control *wbc,
  1788				  struct extent_buffer **eb_context,
  1789				  struct btrfs_block_group **bg_context)
  1790	{
  1791		struct address_space *mapping = page->mapping;
  1792		struct extent_buffer *eb;
  1793		int ret;
  1794	
  1795		if (!PagePrivate(page))
  1796			return 0;
  1797	
  1798		if (btrfs_sb(page->mapping->host->i_sb)->nodesize < PAGE_SIZE)
  1799			return submit_eb_subpage(page, wbc);
  1800	
  1801		spin_lock(&mapping->private_lock);
  1802		if (!PagePrivate(page)) {
  1803			spin_unlock(&mapping->private_lock);
  1804			return 0;
  1805		}
  1806	
  1807		eb = (struct extent_buffer *)page->private;
  1808	
  1809		/*
  1810		 * Shouldn't happen and normally this would be a BUG_ON but no point
  1811		 * crashing the machine for something we can survive anyway.
  1812		 */
  1813		if (WARN_ON(!eb)) {
  1814			spin_unlock(&mapping->private_lock);
  1815			return 0;
  1816		}
  1817	
  1818		if (eb == *eb_context) {
  1819			spin_unlock(&mapping->private_lock);
  1820			return 0;
  1821		}
  1822		ret = atomic_inc_not_zero(&eb->refs);
  1823		spin_unlock(&mapping->private_lock);
  1824		if (!ret)
  1825			return 0;
  1826	
> 1827		if (!btrfs_check_meta_write_pointer(eb->fs_info, wbc, eb, bg_context)) {
  1828			/*
  1829			 * If for_sync, this hole will be filled with
  1830			 * trasnsaction commit.
  1831			 */
  1832			if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync)
  1833				ret = -EAGAIN;
  1834			else
  1835				ret = 0;
  1836			free_extent_buffer(eb);
  1837			return ret;
  1838		}
  1839	
  1840		*eb_context = eb;
  1841	
  1842		if (!lock_extent_buffer_for_io(eb, wbc)) {
  1843			free_extent_buffer(eb);
  1844			return 0;
  1845		}
  1846		if (*bg_context) {
  1847			/* Implies write in zoned mode. */
  1848			struct btrfs_block_group *bg = *bg_context;
  1849	
  1850			/* Mark the last eb in the block group. */
  1851			btrfs_schedule_zone_finish_bg(bg, eb);
  1852			bg->meta_write_pointer += eb->len;
  1853		}
  1854		write_one_eb(eb, wbc);
  1855		free_extent_buffer(eb);
  1856		return 1;
  1857	}
  1858	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/8] btrfs: zoned: activate metadata block group on write time
  2023-07-24  4:18 ` [PATCH 5/8] btrfs: zoned: activate metadata block group on write time Naohiro Aota
  2023-07-24  6:24   ` kernel test robot
  2023-07-24  7:36   ` kernel test robot
@ 2023-07-24  7:57   ` kernel test robot
  2023-07-24 15:12   ` Christoph Hellwig
  3 siblings, 0 replies; 25+ messages in thread
From: kernel test robot @ 2023-07-24  7:57 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs; +Cc: llvm, oe-kbuild-all, Naohiro Aota

Hi Naohiro,

kernel test robot noticed the following build errors:

[auto build test ERROR on kdave/for-next]
[also build test ERROR on linus/master v6.5-rc3 next-20230724]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Naohiro-Aota/btrfs-zoned-introduce-block_group-context-for-submit_eb_page/20230724-122053
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
patch link:    https://lore.kernel.org/r/2f8e30f966cfd9e06b0745a691bd1a5566aab780.1690171333.git.naohiro.aota%40wdc.com
patch subject: [PATCH 5/8] btrfs: zoned: activate metadata block group on write time
config: arm-randconfig-r002-20230724 (https://download.01.org/0day-ci/archive/20230724/202307241541.8w52nEnt-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce: (https://download.01.org/0day-ci/archive/20230724/202307241541.8w52nEnt-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202307241541.8w52nEnt-lkp@intel.com/

All errors (new ones prefixed by >>):

   fs/btrfs/extent_io.c:214:16: warning: variable 'pages_processed' set but not used [-Wunused-but-set-variable]
     214 |         unsigned long pages_processed = 0;
         |                       ^
>> fs/btrfs/extent_io.c:1827:60: error: too many arguments to function call, expected 3, have 4
    1827 |         if (!btrfs_check_meta_write_pointer(eb->fs_info, wbc, eb, bg_context)) {
         |              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                       ^~~~~~~~~~
   fs/btrfs/zoned.h:191:20: note: 'btrfs_check_meta_write_pointer' declared here
     191 | static inline bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
         |                    ^
   1 warning and 1 error generated.


vim +1827 fs/btrfs/extent_io.c

  1766	
  1767	/*
  1768	 * Submit all page(s) of one extent buffer.
  1769	 *
  1770	 * @page:	the page of one extent buffer
  1771	 * @eb_context:	to determine if we need to submit this page, if current page
  1772	 *		belongs to this eb, we don't need to submit
  1773	 *
  1774	 * The caller should pass each page in their bytenr order, and here we use
  1775	 * @eb_context to determine if we have submitted pages of one extent buffer.
  1776	 *
  1777	 * If we have, we just skip until we hit a new page that doesn't belong to
  1778	 * current @eb_context.
  1779	 *
  1780	 * If not, we submit all the page(s) of the extent buffer.
  1781	 *
  1782	 * Return >0 if we have submitted the extent buffer successfully.
  1783	 * Return 0 if we don't need to submit the page, as it's already submitted by
  1784	 * previous call.
  1785	 * Return <0 for fatal error.
  1786	 */
  1787	static int submit_eb_page(struct page *page, struct writeback_control *wbc,
  1788				  struct extent_buffer **eb_context,
  1789				  struct btrfs_block_group **bg_context)
  1790	{
  1791		struct address_space *mapping = page->mapping;
  1792		struct extent_buffer *eb;
  1793		int ret;
  1794	
  1795		if (!PagePrivate(page))
  1796			return 0;
  1797	
  1798		if (btrfs_sb(page->mapping->host->i_sb)->nodesize < PAGE_SIZE)
  1799			return submit_eb_subpage(page, wbc);
  1800	
  1801		spin_lock(&mapping->private_lock);
  1802		if (!PagePrivate(page)) {
  1803			spin_unlock(&mapping->private_lock);
  1804			return 0;
  1805		}
  1806	
  1807		eb = (struct extent_buffer *)page->private;
  1808	
  1809		/*
  1810		 * Shouldn't happen and normally this would be a BUG_ON but no point
  1811		 * crashing the machine for something we can survive anyway.
  1812		 */
  1813		if (WARN_ON(!eb)) {
  1814			spin_unlock(&mapping->private_lock);
  1815			return 0;
  1816		}
  1817	
  1818		if (eb == *eb_context) {
  1819			spin_unlock(&mapping->private_lock);
  1820			return 0;
  1821		}
  1822		ret = atomic_inc_not_zero(&eb->refs);
  1823		spin_unlock(&mapping->private_lock);
  1824		if (!ret)
  1825			return 0;
  1826	
> 1827		if (!btrfs_check_meta_write_pointer(eb->fs_info, wbc, eb, bg_context)) {
  1828			/*
  1829			 * If for_sync, this hole will be filled with
  1830			 * trasnsaction commit.
  1831			 */
  1832			if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync)
  1833				ret = -EAGAIN;
  1834			else
  1835				ret = 0;
  1836			free_extent_buffer(eb);
  1837			return ret;
  1838		}
  1839	
  1840		*eb_context = eb;
  1841	
  1842		if (!lock_extent_buffer_for_io(eb, wbc)) {
  1843			free_extent_buffer(eb);
  1844			return 0;
  1845		}
  1846		if (*bg_context) {
  1847			/* Implies write in zoned mode. */
  1848			struct btrfs_block_group *bg = *bg_context;
  1849	
  1850			/* Mark the last eb in the block group. */
  1851			btrfs_schedule_zone_finish_bg(bg, eb);
  1852			bg->meta_write_pointer += eb->len;
  1853		}
  1854		write_one_eb(eb, wbc);
  1855		free_extent_buffer(eb);
  1856		return 1;
  1857	}
  1858	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/8] btrfs: zoned: introduce block_group context for submit_eb_page()
  2023-07-24  4:18 ` [PATCH 1/8] btrfs: zoned: introduce block_group context for submit_eb_page() Naohiro Aota
@ 2023-07-24 15:00   ` Christoph Hellwig
  2023-07-25  5:44     ` Naohiro Aota
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2023-07-24 15:00 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs

On Mon, Jul 24, 2023 at 01:18:30PM +0900, Naohiro Aota wrote:
> For metadata write out on the zoned mode, we call
> btrfs_check_meta_write_pointer() to check if an extent buffer to be written
> is aligned to the write pointer.
> 
> We lookup for a block group containing the extent buffer for every extent
> buffer, which take unnecessary effort as the writing extent buffers are
> mostly contiguous.
> 
> Introduce "bg_context" to cache the block group working on.

As someone who already found the eb_context naming and handling in the
existing code highly confusing, I wonder if we should first add a new
eb_write_context structure, which initially contains the wbc
and eb_context pointers, and which also would grow the bg.  This
should make argument passing a little simpler, but most importantly
removes all the dealing with the double pointers that need a lot
of checking and clearing.

> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index 5e4285ae112c..58bd2de4026d 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -1751,27 +1751,33 @@ bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
>  				    struct extent_buffer *eb,
>  				    struct btrfs_block_group **cache_ret)
>  {
>
>
> -	struct btrfs_block_group *cache;
> -	bool ret = true;
> +	struct btrfs_block_group *cache = NULL;

.. and independent of the above comment, I also found the "cache" and
"cache_ret" namings here very highly confusing.  What speaks again
using a bg-based naming that makes it clear what we're dealing with?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/8] btrfs: zoned: defer advancing meta_write_pointer
  2023-07-24  4:18 ` [PATCH 2/8] btrfs: zoned: defer advancing meta_write_pointer Naohiro Aota
@ 2023-07-24 15:04   ` Christoph Hellwig
  2023-07-25  8:59     ` Naohiro Aota
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2023-07-24 15:04 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs

> @@ -1773,23 +1773,15 @@ bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
>  		*cache_ret = cache;
>  	}
>  
> +	/* Someone already start writing this eb. */
> +	if (cache->meta_write_pointer > eb->start)
> +		return true;

This is actually a behavior change and not mentioned in the commit log.
How is this supposed to work?  If eb->start is before the write pointer
we're going to get a write reordering problem.

The other bits in the patch look like a very nice improvement, though.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/8] btrfs: zoned: update meta_write_pointer on zone finish
  2023-07-24  4:18 ` [PATCH 3/8] btrfs: zoned: update meta_write_pointer on zone finish Naohiro Aota
@ 2023-07-24 15:04   ` Christoph Hellwig
  2023-07-25  9:01     ` Naohiro Aota
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2023-07-24 15:04 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs

On Mon, Jul 24, 2023 at 01:18:32PM +0900, Naohiro Aota wrote:
> On finishing a zone, the meta_write_pointer should be set of the end of the
> zone to reflect the actual write pointer position.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

Does this need a Fixes a tag?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/8] btrfs: zoned: reserve zones for an active metadata/system block group
  2023-07-24  4:18 ` [PATCH 4/8] btrfs: zoned: reserve zones for an active metadata/system block group Naohiro Aota
@ 2023-07-24 15:06   ` Christoph Hellwig
  2023-07-25  9:07     ` Naohiro Aota
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2023-07-24 15:06 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs

> +static int reserved_zones(struct btrfs_fs_info *fs_info)
> +{
> +	const u64 flags[] = {BTRFS_BLOCK_GROUP_METADATA, BTRFS_BLOCK_GROUP_SYSTEM};
> +	int reserved = 0;
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(flags); i++) {
> +		u64 profile = btrfs_get_alloc_profile(fs_info, flags[i]);
> +
> +		switch (profile) {
> +		case 0: /* single */
> +			reserved += 1;
> +			break;
> +		case BTRFS_BLOCK_GROUP_DUP:
> +			reserved += 2;
> +			break;
> +		default:
> +			ASSERT(0);
> +			break;
> +		}
> +	}
> +
> +	return reserved;
> +}

Shouldn't we just store the number of reserved zones in fs_info instead
of recalculating this over and over?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/8] btrfs: zoned: activate metadata block group on write time
  2023-07-24  4:18 ` [PATCH 5/8] btrfs: zoned: activate metadata block group on write time Naohiro Aota
                     ` (2 preceding siblings ...)
  2023-07-24  7:57   ` kernel test robot
@ 2023-07-24 15:12   ` Christoph Hellwig
  2023-07-25  9:28     ` Naohiro Aota
  3 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2023-07-24 15:12 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs

On Mon, Jul 24, 2023 at 01:18:34PM +0900, Naohiro Aota wrote:
> Since metadata write-out is always allocated sequentially, when we need to
> write to a non-active block group, we can wait for the ongoing IOs to
> complete, activate a new block group, and then proceed with writing to the
> new block group.

Somewhat unrelated, but isn't the same true for data as well, and maybe
in a follow on the same activation policy should apply there as well?
I guess it won't quite work without much work as reservations are tied
to a BG, but it would seem like the better scheme.

> +	if (test_bit(BTRFS_FS_ACTIVE_ZONE_TRACKING, &fs_info->flags)) {

Just a more or less cosmetic nit:  there is a lot of code in this
branch, and I wonder if the active zone tracking code would benefit
from being split out into one or more helpers.

> +		bool is_system = cache->flags & BTRFS_BLOCK_GROUP_SYSTEM;
> +
> +		spin_lock(&cache->lock);
> +		if (test_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
> +			     &cache->runtime_flags)) {
> +			spin_unlock(&cache->lock);
> +			return true;
> +		}
> +
> +		spin_unlock(&cache->lock);

What is the point of the cache->lock crticial section here?  The
information can be out of date as soon as you drop the lock, so it
looks superflous to me.

> +		if (fs_info->treelog_bg == cache->start) {
> +			if (!btrfs_zone_activate(cache)) {
> +				int ret_fin = btrfs_zone_finish_one_bg(fs_info);
> +
> +				if (ret_fin != 1 || !btrfs_zone_activate(cache))
> +					return false;

The ret_fin variable here doesn't seem to be actually needed.

> +			}
> +		} else if ((!is_system && fs_info->active_meta_bg != cache) ||
> +			   (is_system && fs_info->active_system_bg != cache)) {
> +			struct btrfs_block_group *tgt = is_system ?
> +				fs_info->active_system_bg : fs_info->active_meta_bg;

There's a lot of checks for is_system here and later on in the
logic.  If we had a helper for this, you could just pass in a bg double
pointer argument that the callers sets to &fs_info->active_system_bg or
&fs_info->active_meta_bg and simplify a lot of the logic.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/8] btrfs: zoned: introduce block_group context for submit_eb_page()
  2023-07-24 15:00   ` Christoph Hellwig
@ 2023-07-25  5:44     ` Naohiro Aota
  0 siblings, 0 replies; 25+ messages in thread
From: Naohiro Aota @ 2023-07-25  5:44 UTC (permalink / raw)
  To: hch@infradead.org; +Cc: linux-btrfs@vger.kernel.org

On Mon, Jul 24, 2023 at 08:00:04AM -0700, Christoph Hellwig wrote:
> On Mon, Jul 24, 2023 at 01:18:30PM +0900, Naohiro Aota wrote:
> > For metadata write out on the zoned mode, we call
> > btrfs_check_meta_write_pointer() to check if an extent buffer to be written
> > is aligned to the write pointer.
> > 
> > We lookup for a block group containing the extent buffer for every extent
> > buffer, which take unnecessary effort as the writing extent buffers are
> > mostly contiguous.
> > 
> > Introduce "bg_context" to cache the block group working on.
> 
> As someone who already found the eb_context naming and handling in the
> existing code highly confusing, I wonder if we should first add a new
> eb_write_context structure, which initially contains the wbc
> and eb_context pointers, and which also would grow the bg.  This
> should make argument passing a little simpler, but most importantly
> removes all the dealing with the double pointers that need a lot
> of checking and clearing.

Ah, that works. Actually, I previously used the bio_ctrl for that purpose,
but it was removed from the metadata context.

> > diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> > index 5e4285ae112c..58bd2de4026d 100644
> > --- a/fs/btrfs/zoned.c
> > +++ b/fs/btrfs/zoned.c
> > @@ -1751,27 +1751,33 @@ bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
> >  				    struct extent_buffer *eb,
> >  				    struct btrfs_block_group **cache_ret)
> >  {
> >
> >
> > -	struct btrfs_block_group *cache;
> > -	bool ret = true;
> > +	struct btrfs_block_group *cache = NULL;
> 
> .. and independent of the above comment, I also found the "cache" and
> "cache_ret" namings here very highly confusing.  What speaks again
> using a bg-based naming that makes it clear what we're dealing with?

Yes. That is because the original code was written before than
"btrfs_block_group_cache" is renamed to "btrfs_block_group". This is good
opportunity to rename it as well.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/8] btrfs: zoned: defer advancing meta_write_pointer
  2023-07-24 15:04   ` Christoph Hellwig
@ 2023-07-25  8:59     ` Naohiro Aota
  2023-07-26 13:06       ` hch
  0 siblings, 1 reply; 25+ messages in thread
From: Naohiro Aota @ 2023-07-25  8:59 UTC (permalink / raw)
  To: hch@infradead.org; +Cc: linux-btrfs@vger.kernel.org

On Mon, Jul 24, 2023 at 08:04:10AM -0700, Christoph Hellwig wrote:
> > @@ -1773,23 +1773,15 @@ bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
> >  		*cache_ret = cache;
> >  	}
> >  
> > +	/* Someone already start writing this eb. */
> > +	if (cache->meta_write_pointer > eb->start)
> > +		return true;
> 
> This is actually a behavior change and not mentioned in the commit log.
> How is this supposed to work?  If eb->start is before the write pointer
> we're going to get a write reordering problem.

This won't happen at this point, but can happen after the write-time
activation patch. When one process wait for current metadata writeback to
finish, other process can take the same contiguous extent buffers. That
region will be processed by the first process, and the second process see
the write pointer larger than the extent buffer it has.

In detail, consider we have two extent buffers (eb#0 and eb#1) to write,
and they are going to a new non-active block group.

Process A                                                        Process B

btree_write_cache_pages()
  btrfs_zoned_meta_io_lock();
  filemap_get_folio_tags() returns pages for the two ebs.
  submit_eb_page()
    btrfs_check_meta_write_pointer()
      # unlock and wait for the writeback
      btrfs_zoned_meta_io_unlock();
      wait_eb_writebacks();                                      
                                                                 btree_write_cache_pages();
								   btrfs_zoned_meta_lock();
								   filemap_get_folio_tags() returns pages for the two ebs.
								   submit_eb_page();
								     btrfs_check_meta_write_pointer()
								       # We may still need to wait.
								       btrfs_zoned_meta_unlock()
								       wait_eb_writebacks();
      # Now, writeback finished
      btrfs_zoned_meta_io_lock();
    lock_extent_buffer_for_io(eb#0)
    # write_pointer == end of eb#0
    write_one_eb(eb#0);
    ...
    # and, write eb#1 as well.
    # write_pointer == end of eb#1
    write_one_eb(eb#1);
  btrfs_zoned_meta_io_unlock();
                                                                       btrfs_zoned_meta_io_lock();
								     lock_extent_buffer_for_io(eb#0) fails, return
								   # and, proceed to eb#1
								   submit_eb_page(#1)
								     btrfs_check_meta_write_pointer(eb#1)
								       # Now, the write pointer is larger than eb#1->start.
								       # Hitting the above condition.

Returning true itself in this case should be OK, because in the end that eb
is rejected by the lock_extent_buffer_for_io(). We cannot simply return
false here, because that may lead to returning -EAGAIN in a certain
case. For the return value, we can move the wbc check from submit_eb_page()
to btrfs_check_meta_write_pointer() and return the proper int value e.g,
returning -EBUSY here and submit_eb_page() convert it to
free_extent_buffer(eb) and return 0.

But, yeah, these lines should go with the write-time activation time patch.

> 
> The other bits in the patch look like a very nice improvement, though.
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/8] btrfs: zoned: update meta_write_pointer on zone finish
  2023-07-24 15:04   ` Christoph Hellwig
@ 2023-07-25  9:01     ` Naohiro Aota
  0 siblings, 0 replies; 25+ messages in thread
From: Naohiro Aota @ 2023-07-25  9:01 UTC (permalink / raw)
  To: hch@infradead.org; +Cc: linux-btrfs@vger.kernel.org

On Mon, Jul 24, 2023 at 08:04:54AM -0700, Christoph Hellwig wrote:
> On Mon, Jul 24, 2023 at 01:18:32PM +0900, Naohiro Aota wrote:
> > On finishing a zone, the meta_write_pointer should be set of the end of the
> > zone to reflect the actual write pointer position.
> 
> Looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> Does this need a Fixes a tag?

I don't think so, because there is no actual bug hit by this as far as I
know.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/8] btrfs: zoned: reserve zones for an active metadata/system block group
  2023-07-24 15:06   ` Christoph Hellwig
@ 2023-07-25  9:07     ` Naohiro Aota
  2023-07-26 13:07       ` hch
  0 siblings, 1 reply; 25+ messages in thread
From: Naohiro Aota @ 2023-07-25  9:07 UTC (permalink / raw)
  To: hch@infradead.org; +Cc: linux-btrfs@vger.kernel.org

On Mon, Jul 24, 2023 at 08:06:50AM -0700, Christoph Hellwig wrote:
> > +static int reserved_zones(struct btrfs_fs_info *fs_info)
> > +{
> > +	const u64 flags[] = {BTRFS_BLOCK_GROUP_METADATA, BTRFS_BLOCK_GROUP_SYSTEM};
> > +	int reserved = 0;
> > +	int i;
> > +
> > +	for (i = 0; i < ARRAY_SIZE(flags); i++) {
> > +		u64 profile = btrfs_get_alloc_profile(fs_info, flags[i]);
> > +
> > +		switch (profile) {
> > +		case 0: /* single */
> > +			reserved += 1;
> > +			break;
> > +		case BTRFS_BLOCK_GROUP_DUP:
> > +			reserved += 2;
> > +			break;
> > +		default:
> > +			ASSERT(0);
> > +			break;
> > +		}
> > +	}
> > +
> > +	return reserved;
> > +}
> 
> Shouldn't we just store the number of reserved zones in fs_info instead
> of recalculating this over and over?
> 

We can convert the profile from SINGLE to DUP while the FS is running. So,
simply storing it won't work.

But, we can hook the converting process and increase the reservation count
if it is converting to DUP? Then, we can calculate it on mount and record
it in fs_info.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/8] btrfs: zoned: activate metadata block group on write time
  2023-07-24 15:12   ` Christoph Hellwig
@ 2023-07-25  9:28     ` Naohiro Aota
  2023-07-26 13:10       ` hch
  0 siblings, 1 reply; 25+ messages in thread
From: Naohiro Aota @ 2023-07-25  9:28 UTC (permalink / raw)
  To: hch@infradead.org; +Cc: linux-btrfs@vger.kernel.org

On Mon, Jul 24, 2023 at 08:12:28AM -0700, Christoph Hellwig wrote:
> On Mon, Jul 24, 2023 at 01:18:34PM +0900, Naohiro Aota wrote:
> > Since metadata write-out is always allocated sequentially, when we need to
> > write to a non-active block group, we can wait for the ongoing IOs to
> > complete, activate a new block group, and then proceed with writing to the
> > new block group.
> 
> Somewhat unrelated, but isn't the same true for data as well, and maybe
> in a follow on the same activation policy should apply there as well?
> I guess it won't quite work without much work as reservations are tied
> to a BG, but it would seem like the better scheme.

Yeah, we may be able to apply the write-time activation to data as
well. But, there are some differences between data and metadata, which make
it not so useful.

- Allocation of data and writing out is done in the same context under
  run_delalloc_zoned(), and so they are near in the timing
  - While, metadata allocation is done while building a B-tree, and written
    on a transaction commit
- No over-commit for data in the first place

> > +	if (test_bit(BTRFS_FS_ACTIVE_ZONE_TRACKING, &fs_info->flags)) {
> 
> Just a more or less cosmetic nit:  there is a lot of code in this
> branch, and I wonder if the active zone tracking code would benefit
> from being split out into one or more helpers.

Yeah, I'll create a helper.

> > +		bool is_system = cache->flags & BTRFS_BLOCK_GROUP_SYSTEM;
> > +
> > +		spin_lock(&cache->lock);
> > +		if (test_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
> > +			     &cache->runtime_flags)) {
> > +			spin_unlock(&cache->lock);
> > +			return true;
> > +		}
> > +
> > +		spin_unlock(&cache->lock);
> 
> What is the point of the cache->lock crticial section here?  The
> information can be out of date as soon as you drop the lock, so it
> looks superflous to me.

The block group's activeness starts from non-active, then activated, and
finally finished. So, if its "ACTIVE" here, the block group is going to be
still active or finished. It's OK it is active. If it is finished, the IO
will fail anyway, so it is a bug itself.

> > +		if (fs_info->treelog_bg == cache->start) {
> > +			if (!btrfs_zone_activate(cache)) {
> > +				int ret_fin = btrfs_zone_finish_one_bg(fs_info);
> > +
> > +				if (ret_fin != 1 || !btrfs_zone_activate(cache))
> > +					return false;
> 
> The ret_fin variable here doesn't seem to be actually needed.

Indeed.

> > +			}
> > +		} else if ((!is_system && fs_info->active_meta_bg != cache) ||
> > +			   (is_system && fs_info->active_system_bg != cache)) {
> > +			struct btrfs_block_group *tgt = is_system ?
> > +				fs_info->active_system_bg : fs_info->active_meta_bg;
> 
> There's a lot of checks for is_system here and later on in the
> logic.  If we had a helper for this, you could just pass in a bg double
> pointer argument that the callers sets to &fs_info->active_system_bg or
> &fs_info->active_meta_bg and simplify a lot of the logic.
> 

Sure.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/8] btrfs: zoned: defer advancing meta_write_pointer
  2023-07-25  8:59     ` Naohiro Aota
@ 2023-07-26 13:06       ` hch
  0 siblings, 0 replies; 25+ messages in thread
From: hch @ 2023-07-26 13:06 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: hch@infradead.org, linux-btrfs@vger.kernel.org

On Tue, Jul 25, 2023 at 08:59:17AM +0000, Naohiro Aota wrote:
> This won't happen at this point, but can happen after the write-time
> activation patch.

Let's move the check there and very clearly document it in that patch
instead of slipping it in here.

> Returning true itself in this case should be OK, because in the end that eb
> is rejected by the lock_extent_buffer_for_io(). We cannot simply return
> false here, because that may lead to returning -EAGAIN in a certain
> case. For the return value, we can move the wbc check from submit_eb_page()
> to btrfs_check_meta_write_pointer() and return the proper int value e.g,
> returning -EBUSY here and submit_eb_page() convert it to
> free_extent_buffer(eb) and return 0.

That feels like the right thing to do.  What I think would be really
good in the long run is to actually avoid the write_pointer holes
entirely by never cancelling buffers for zoned file systems, but instead
move the zeroing that's currently btrfs_redirty_list_add into
btrfs_clear_buffer_dirty.  I tried this in a naive way and it failed,
but I plan to get back to it eventually.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/8] btrfs: zoned: reserve zones for an active metadata/system block group
  2023-07-25  9:07     ` Naohiro Aota
@ 2023-07-26 13:07       ` hch
  0 siblings, 0 replies; 25+ messages in thread
From: hch @ 2023-07-26 13:07 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: hch@infradead.org, linux-btrfs@vger.kernel.org

On Tue, Jul 25, 2023 at 09:07:19AM +0000, Naohiro Aota wrote:
> We can convert the profile from SINGLE to DUP while the FS is running. So,
> simply storing it won't work.
> 
> But, we can hook the converting process and increase the reservation count
> if it is converting to DUP? Then, we can calculate it on mount and record
> it in fs_info.

That sounds like the right thing to do to me.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/8] btrfs: zoned: activate metadata block group on write time
  2023-07-25  9:28     ` Naohiro Aota
@ 2023-07-26 13:10       ` hch
  0 siblings, 0 replies; 25+ messages in thread
From: hch @ 2023-07-26 13:10 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: hch@infradead.org, linux-btrfs@vger.kernel.org

On Tue, Jul 25, 2023 at 09:28:09AM +0000, Naohiro Aota wrote:
> > > +		bool is_system = cache->flags & BTRFS_BLOCK_GROUP_SYSTEM;
> > > +
> > > +		spin_lock(&cache->lock);
> > > +		if (test_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
> > > +			     &cache->runtime_flags)) {
> > > +			spin_unlock(&cache->lock);
> > > +			return true;
> > > +		}
> > > +
> > > +		spin_unlock(&cache->lock);
> > 
> > What is the point of the cache->lock crticial section here?  The
> > information can be out of date as soon as you drop the lock, so it
> > looks superflous to me.
> 
> The block group's activeness starts from non-active, then activated, and
> finally finished. So, if its "ACTIVE" here, the block group is going to be
> still active or finished. It's OK it is active. If it is finished, the IO
> will fail anyway, so it is a bug itself.

Yes, but what I mean is that there is no point in taking cache->lock
here.  The only thing that's done under it is testing a single bit,
and the information from that bit is only used after dropping the
lock.  So there should be no need to take the lock in the first place.


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-07-26 13:10 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-24  4:18 [PATCH 0/8] btrfs: zoned: write-time activation of metadata block group Naohiro Aota
2023-07-24  4:18 ` [PATCH 1/8] btrfs: zoned: introduce block_group context for submit_eb_page() Naohiro Aota
2023-07-24 15:00   ` Christoph Hellwig
2023-07-25  5:44     ` Naohiro Aota
2023-07-24  4:18 ` [PATCH 2/8] btrfs: zoned: defer advancing meta_write_pointer Naohiro Aota
2023-07-24 15:04   ` Christoph Hellwig
2023-07-25  8:59     ` Naohiro Aota
2023-07-26 13:06       ` hch
2023-07-24  4:18 ` [PATCH 3/8] btrfs: zoned: update meta_write_pointer on zone finish Naohiro Aota
2023-07-24 15:04   ` Christoph Hellwig
2023-07-25  9:01     ` Naohiro Aota
2023-07-24  4:18 ` [PATCH 4/8] btrfs: zoned: reserve zones for an active metadata/system block group Naohiro Aota
2023-07-24 15:06   ` Christoph Hellwig
2023-07-25  9:07     ` Naohiro Aota
2023-07-26 13:07       ` hch
2023-07-24  4:18 ` [PATCH 5/8] btrfs: zoned: activate metadata block group on write time Naohiro Aota
2023-07-24  6:24   ` kernel test robot
2023-07-24  7:36   ` kernel test robot
2023-07-24  7:57   ` kernel test robot
2023-07-24 15:12   ` Christoph Hellwig
2023-07-25  9:28     ` Naohiro Aota
2023-07-26 13:10       ` hch
2023-07-24  4:18 ` [PATCH 6/8] btrfs: zoned: no longer count fresh BG region as zone unusable Naohiro Aota
2023-07-24  4:18 ` [PATCH 7/8] btrfs: zoned: don't activate non-DATA BG on allocation Naohiro Aota
2023-07-24  4:18 ` [PATCH 8/8] btrfs: zoned: re-enable metadata over-commit for zoned mode Naohiro Aota

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).