All of lore.kernel.org
 help / color / mirror / Atom feed
* simplify dirty buffer tracking
@ 2023-05-15 19:22 Christoph Hellwig
  2023-05-15 19:22 ` [PATCH 1/6] btrfs: use a linked list for tracking per-transaction/log dirty buffers Christoph Hellwig
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Christoph Hellwig @ 2023-05-15 19:22 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: linux-btrfs

Hi all,

this series rewrites the tracking and writeback of the per-transaction
or per-tree_log dirty buffers.  It is intended as a first step towards
getting rid of btree_inode, but turns out to be a pretty nice cleanup
on it's own.  I also see minor speedup in fs_mark workloads, but given
how variable they are in general I'm not sure if they really matter.

This series starts out new extent_buffer.[ch].  I plan to eventually
concentrate all code for extent_buffer handling there, but for now
this just has brand-new code.

Note that the series is based on top of net-next because the previous
round of extent_buffer changes has not made it to misc-next yet.

Diffstat:
 fs/btrfs/Makefile                |    2 
 fs/btrfs/ctree.h                 |    3 
 fs/btrfs/disk-io.c               |   72 +++++--------
 fs/btrfs/extent-io-tree.c        |  212 ---------------------------------------
 fs/btrfs/extent-io-tree.h        |   14 --
 fs/btrfs/extent-tree.c           |   21 ---
 fs/btrfs/extent_buffer.c         |  104 +++++++++++++++++++
 fs/btrfs/extent_buffer.h         |   26 ++++
 fs/btrfs/extent_io.c             |  131 ++++++------------------
 fs/btrfs/extent_io.h             |    2 
 fs/btrfs/fs.h                    |    3 
 fs/btrfs/tests/extent-io-tests.c |    2 
 fs/btrfs/transaction.c           |  168 ++----------------------------
 fs/btrfs/transaction.h           |    3 
 fs/btrfs/tree-log.c              |   71 ++++++++-----
 fs/btrfs/zoned.c                 |   11 +-
 include/trace/events/btrfs.h     |   57 ----------
 17 files changed, 278 insertions(+), 624 deletions(-)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/6] btrfs: use a linked list for tracking per-transaction/log dirty buffers
  2023-05-15 19:22 simplify dirty buffer tracking Christoph Hellwig
@ 2023-05-15 19:22 ` Christoph Hellwig
  2023-05-17 10:40   ` Filipe Manana
  2023-05-15 19:22 ` [PATCH 2/6] btrfs: remove convert_extent_bit Christoph Hellwig
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2023-05-15 19:22 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: linux-btrfs

Currently there is an extent_io_tree per transaction or tree_log root to
track the dirty buffer ranges in the transaction or tree_log.  This is
fairly inefficient, as:

 - we only add to the list, until the transaction / log_tree commit
   happens, at which point we'll want to walk them sequentially
 - the dirty range means we need another mechanism to actually find
   the dirty buffer based off the dirty range

This patch instead switches tracking to one linked list per transaction
and two to each root for the two tree logs which link a new object that
points directly to the buffer.  Note that the list_head can't directly be
embedded into the extent_buffer structure given that a buffer can be part
of more than one transaction or tree_log.  This also means the existing
error propagation based off eb->log_index never fully worked, as this
index would get overwritten once a buffer is added to a new dirty tree.
The new code instead directly looks at the EXTENT_BUFFER_WRITE_ERR bit,
which doesn't have this problem.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/Makefile                |   2 +-
 fs/btrfs/ctree.h                 |   3 +-
 fs/btrfs/disk-io.c               |  72 ++++++-------
 fs/btrfs/extent-io-tree.h        |  10 --
 fs/btrfs/extent-tree.c           |  21 +---
 fs/btrfs/extent_buffer.c         |  99 ++++++++++++++++++
 fs/btrfs/extent_buffer.h         |  23 +++++
 fs/btrfs/extent_io.c             |  59 +----------
 fs/btrfs/extent_io.h             |   2 -
 fs/btrfs/fs.h                    |   3 -
 fs/btrfs/tests/extent-io-tests.c |   2 -
 fs/btrfs/transaction.c           | 168 +++----------------------------
 fs/btrfs/transaction.h           |   3 +-
 fs/btrfs/tree-log.c              |  71 ++++++++-----
 fs/btrfs/zoned.c                 |   4 +-
 include/trace/events/btrfs.h     |  16 +--
 16 files changed, 232 insertions(+), 326 deletions(-)
 create mode 100644 fs/btrfs/extent_buffer.c
 create mode 100644 fs/btrfs/extent_buffer.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 90d53209755bf8..661cead213d6df 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -24,7 +24,7 @@ obj-$(CONFIG_BTRFS_FS) := btrfs.o
 
 btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   file-item.o inode-item.o disk-io.o \
-	   transaction.o inode.o file.o defrag.o \
+	   transaction.o inode.o file.o defrag.o extent_buffer.o \
 	   extent_map.o sysfs.o accessors.o xattr.o ordered-data.o \
 	   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
 	   export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5af61480dde613..5a5fda077a0288 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -202,7 +202,8 @@ struct btrfs_root {
 	struct btrfs_root_item root_item;
 	struct btrfs_key root_key;
 	struct btrfs_fs_info *fs_info;
-	struct extent_io_tree dirty_log_pages;
+	struct list_head dirty_buffers[2];
+	spinlock_t dirty_buffers_lock;
 
 	struct mutex objectid_mutex;
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f58806b798cd65..3b81d6ec1b47b8 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -24,6 +24,7 @@
 #include "transaction.h"
 #include "btrfs_inode.h"
 #include "bio.h"
+#include "extent_buffer.h"
 #include "print-tree.h"
 #include "locking.h"
 #include "tree-log.h"
@@ -64,9 +65,6 @@ static void btrfs_destroy_ordered_extents(struct btrfs_root *root);
 static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
 				      struct btrfs_fs_info *fs_info);
 static void btrfs_destroy_delalloc_inodes(struct btrfs_root *root);
-static int btrfs_destroy_marked_extents(struct btrfs_fs_info *fs_info,
-					struct extent_io_tree *dirty_pages,
-					int mark);
 static int btrfs_destroy_pinned_extent(struct btrfs_fs_info *fs_info,
 				       struct extent_io_tree *pinned_extents);
 static int btrfs_cleanup_transaction(struct btrfs_fs_info *fs_info);
@@ -693,8 +691,9 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	root->last_log_commit = 0;
 	root->anon_dev = 0;
 	if (!dummy) {
-		extent_io_tree_init(fs_info, &root->dirty_log_pages,
-				    IO_TREE_ROOT_DIRTY_LOG_PAGES);
+		INIT_LIST_HEAD(&root->dirty_buffers[0]);
+		INIT_LIST_HEAD(&root->dirty_buffers[1]);
+		spin_lock_init(&root->dirty_buffers_lock);
 		extent_io_tree_init(fs_info, &root->log_csum_range,
 				    IO_TREE_LOG_CSUM_RANGE);
 	}
@@ -4208,18 +4207,16 @@ static void warn_about_uncommitted_trans(struct btrfs_fs_info *fs_info)
 	 */
 	ASSERT(test_bit(BTRFS_FS_CLOSING_DONE, &fs_info->flags));
 	list_for_each_entry_safe(trans, tmp, &fs_info->trans_list, list) {
-		struct extent_state *cached = NULL;
+		struct dirty_buffer *db;
 		u64 dirty_bytes = 0;
-		u64 cur = 0;
-		u64 found_start;
-		u64 found_end;
 
 		found = true;
-		while (!find_first_extent_bit(&trans->dirty_pages, cur,
-			&found_start, &found_end, EXTENT_DIRTY, &cached)) {
-			dirty_bytes += found_end + 1 - found_start;
-			cur = found_end + 1;
-		}
+	
+		spin_lock(&trans->dirty_buffers_lock);
+		list_for_each_entry(db, &trans->dirty_buffers, wb_entry)
+			dirty_bytes += db->eb->len;
+		spin_unlock(&trans->dirty_buffers_lock);
+
 		btrfs_warn(fs_info,
 	"transaction %llu (with %llu dirty metadata bytes) is not committed",
 			   trans->transid, dirty_bytes);
@@ -4707,38 +4704,30 @@ static void btrfs_destroy_all_delalloc_inodes(struct btrfs_fs_info *fs_info)
 	spin_unlock(&fs_info->delalloc_root_lock);
 }
 
-static int btrfs_destroy_marked_extents(struct btrfs_fs_info *fs_info,
-					struct extent_io_tree *dirty_pages,
-					int mark)
+static void btrfs_destroy_dirty_buffers(struct btrfs_fs_info *fs_info,
+					struct btrfs_transaction *cur_trans)
 {
-	int ret;
-	struct extent_buffer *eb;
-	u64 start = 0;
-	u64 end;
+	struct dirty_buffer *db;
 
-	while (1) {
-		ret = find_first_extent_bit(dirty_pages, start, &start, &end,
-					    mark, NULL);
-		if (ret)
-			break;
+	spin_lock(&cur_trans->dirty_buffers_lock);
+	while ((db = list_first_entry_or_null(&cur_trans->dirty_buffers,
+			struct dirty_buffer, wb_entry))) {
+		struct extent_buffer *eb = db->eb;
 
-		clear_extent_bits(dirty_pages, start, end, mark);
-		while (start <= end) {
-			eb = find_extent_buffer(fs_info, start);
-			start += fs_info->nodesize;
-			if (!eb)
-				continue;
+		list_del_init(&db->wb_entry);
+		spin_unlock(&cur_trans->dirty_buffers_lock);
 
-			btrfs_tree_lock(eb);
-			wait_on_extent_buffer_writeback(eb);
-			btrfs_clear_buffer_dirty(NULL, eb);
-			btrfs_tree_unlock(eb);
+		btrfs_tree_lock(eb);
+		wait_on_extent_buffer_writeback(eb);
+		btrfs_clear_buffer_dirty(NULL, eb);
+		btrfs_tree_unlock(eb);
 
-			free_extent_buffer_stale(eb);
-		}
-	}
+		free_extent_buffer_stale(eb);
+		kfree(db);
 
-	return ret;
+		spin_lock(&cur_trans->dirty_buffers_lock);
+	}
+	spin_unlock(&cur_trans->dirty_buffers_lock);
 }
 
 static int btrfs_destroy_pinned_extent(struct btrfs_fs_info *fs_info,
@@ -4866,8 +4855,7 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
 
 	btrfs_destroy_delayed_inodes(fs_info);
 
-	btrfs_destroy_marked_extents(fs_info, &cur_trans->dirty_pages,
-				     EXTENT_DIRTY);
+	btrfs_destroy_dirty_buffers(fs_info, cur_trans);
 	btrfs_destroy_pinned_extent(fs_info, &cur_trans->pinned_extents);
 
 	cur_trans->state =TRANS_STATE_COMPLETED;
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index 21766e49ec02e7..0dae0a16a87cb7 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -12,13 +12,11 @@ enum {
 	ENUM_BIT(EXTENT_DIRTY),
 	ENUM_BIT(EXTENT_UPTODATE),
 	ENUM_BIT(EXTENT_LOCKED),
-	ENUM_BIT(EXTENT_NEW),
 	ENUM_BIT(EXTENT_DELALLOC),
 	ENUM_BIT(EXTENT_DEFRAG),
 	ENUM_BIT(EXTENT_BOUNDARY),
 	ENUM_BIT(EXTENT_NODATASUM),
 	ENUM_BIT(EXTENT_CLEAR_META_RESV),
-	ENUM_BIT(EXTENT_NEED_WAIT),
 	ENUM_BIT(EXTENT_NORESERVE),
 	ENUM_BIT(EXTENT_QGROUP_RESERVED),
 	ENUM_BIT(EXTENT_CLEAR_DATA_RESV),
@@ -68,8 +66,6 @@ enum {
 	IO_TREE_BTREE_INODE_IO,
 	IO_TREE_INODE_IO,
 	IO_TREE_RELOC_BLOCKS,
-	IO_TREE_TRANS_DIRTY_PAGES,
-	IO_TREE_ROOT_DIRTY_LOG_PAGES,
 	IO_TREE_INODE_FILE_EXTENT,
 	IO_TREE_LOG_CSUM_RANGE,
 	IO_TREE_SELFTEST,
@@ -210,12 +206,6 @@ static inline int set_extent_defrag(struct extent_io_tree *tree, u64 start,
 			      cached_state, GFP_NOFS);
 }
 
-static inline int set_extent_new(struct extent_io_tree *tree, u64 start,
-		u64 end)
-{
-	return set_extent_bit(tree, start, end, EXTENT_NEW, NULL, GFP_NOFS);
-}
-
 int find_first_extent_bit(struct extent_io_tree *tree, u64 start,
 			  u64 *start_ret, u64 *end_ret, u32 bits,
 			  struct extent_state **cached_state);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 99b9ab5ccf932a..20d7cb51de944f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -17,6 +17,7 @@
 #include <linux/lockdep.h>
 #include <linux/crc32c.h>
 #include "ctree.h"
+#include "extent_buffer.h"
 #include "extent-tree.h"
 #include "tree-log.h"
 #include "disk-io.h"
@@ -4822,23 +4823,9 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	btrfs_set_header_owner(buf, owner);
 	write_extent_buffer_fsid(buf, fs_info->fs_devices->metadata_uuid);
 	write_extent_buffer_chunk_tree_uuid(buf, fs_info->chunk_tree_uuid);
-	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
-		buf->log_index = root->log_transid % 2;
-		/*
-		 * we allow two log transactions at a time, use different
-		 * EXTENT bit to differentiate dirty pages.
-		 */
-		if (buf->log_index == 0)
-			set_extent_dirty(&root->dirty_log_pages, buf->start,
-					buf->start + buf->len - 1, GFP_NOFS);
-		else
-			set_extent_new(&root->dirty_log_pages, buf->start,
-					buf->start + buf->len - 1);
-	} else {
-		buf->log_index = -1;
-		set_extent_dirty(&trans->transaction->dirty_pages, buf->start,
-			 buf->start + buf->len - 1, GFP_NOFS);
-	}
+
+	btrfs_add_to_dirty_buffers_list(buf, trans->transaction, root);
+
 	/* this returns a buffer locked for blocking */
 	return buf;
 }
diff --git a/fs/btrfs/extent_buffer.c b/fs/btrfs/extent_buffer.c
new file mode 100644
index 00000000000000..d20fc0d113968f
--- /dev/null
+++ b/fs/btrfs/extent_buffer.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2023 Christoph Hellwig.
+ */
+
+#include <linux/list_sort.h>
+#include "extent_buffer.h"
+#include "extent_io.h"
+#include "fs.h"
+#include "transaction.h"
+
+void btrfs_add_to_dirty_buffers_list(struct extent_buffer *eb,
+				     struct btrfs_transaction *trans,
+				     struct btrfs_root *root)
+{
+	struct dirty_buffer *db = kmalloc(sizeof(*db), GFP_NOFS | __GFP_NOFAIL);
+
+	db->eb = eb;
+	atomic_inc(&eb->refs); /* for the dirty list */
+	if (root && root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+		spin_lock(&root->dirty_buffers_lock);
+		list_add_tail(&db->wb_entry,
+			      &root->dirty_buffers[root->log_transid % 2]);
+		spin_unlock(&root->dirty_buffers_lock);
+	} else {
+		spin_lock(&trans->dirty_buffers_lock);
+		list_add_tail(&db->wb_entry, &trans->dirty_buffers);
+		spin_unlock(&trans->dirty_buffers_lock);
+	}
+}
+
+static int extent_buffer_cmp(void *priv, const struct list_head *a,
+			     const struct list_head *b)
+{
+	s64 diff = container_of(a, struct dirty_buffer, wb_entry)->eb->start -
+		container_of(b, struct dirty_buffer, wb_entry)->eb->start;
+
+	if (diff < 0)
+		return -1;
+	if (diff > 0)
+		return 1;
+	return 0;
+}
+
+int btrfs_write_buffers(struct btrfs_fs_info *fs_info, struct list_head *list)
+{
+	struct address_space *mapping = fs_info->btree_inode->i_mapping;
+	struct dirty_buffer *db;
+	int werr = 0, err;
+
+	list_sort(NULL, list, extent_buffer_cmp);
+
+	list_for_each_entry(db, list, wb_entry) {
+		err = filemap_fdatawrite_range(mapping, db->eb->start,
+					       db->eb->start + db->eb->len - 1);
+		if (err)
+			werr = err;
+		cond_resched();
+	}
+
+	return werr;
+}
+
+int btrfs_wait_buffers(struct btrfs_fs_info *fs_info, struct list_head *list)
+{
+	struct address_space *mapping = fs_info->btree_inode->i_mapping;
+	struct dirty_buffer *db;
+	int werr = 0, err;
+
+	while ((db = list_first_entry_or_null(list, struct dirty_buffer,
+			wb_entry))) {
+		list_del_init(&db->wb_entry);
+
+		err = filemap_fdatawait_range(mapping, db->eb->start,
+					      db->eb->start + db->eb->len - 1);
+		if (err)
+			werr = err;
+		if (!werr && test_bit(EXTENT_BUFFER_WRITE_ERR, &db->eb->bflags))
+			werr = -EIO;
+		free_extent_buffer(db->eb);
+		kfree(db);
+
+		cond_resched();
+	}
+
+	return werr;
+}
+
+void btrfs_release_buffers(struct list_head *list)
+{
+	struct dirty_buffer *db;
+
+	while ((db = list_first_entry_or_null(list, struct dirty_buffer,
+			wb_entry))) {
+		list_del_init(&db->wb_entry);
+		free_extent_buffer(db->eb);
+		kfree(db);
+	}
+}
diff --git a/fs/btrfs/extent_buffer.h b/fs/btrfs/extent_buffer.h
new file mode 100644
index 00000000000000..880b325d7b411a
--- /dev/null
+++ b/fs/btrfs/extent_buffer.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef BTRFS_EXTENT_BUFFER_H
+#define BTRFS_EXTENT_BUFFER_H
+
+struct extent_buffer;
+struct btrfs_fs_info;
+struct btrfs_root;
+struct btrfs_transaction;
+struct list_head;
+
+struct dirty_buffer {
+	struct list_head wb_entry;
+	struct extent_buffer *eb;
+};
+
+void btrfs_add_to_dirty_buffers_list(struct extent_buffer *eb,
+				     struct btrfs_transaction *trans,
+				     struct btrfs_root *root);
+int btrfs_write_buffers(struct btrfs_fs_info *fs_info, struct list_head *list);
+int btrfs_wait_buffers(struct btrfs_fs_info *fs_info, struct list_head *list);
+void btrfs_release_buffers(struct list_head *list);
+
+#endif /* BTRFS_EXTENT_BUFFER_H */
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 5999ac3ee601db..d5937ed0962d38 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1659,8 +1659,11 @@ static bool lock_extent_buffer_for_io(struct extent_buffer *eb,
 
 static void set_btree_ioerr(struct extent_buffer *eb)
 {
-	struct btrfs_fs_info *fs_info = eb->fs_info;
-
+	/*
+	 * Propagate this error to btrfs_wait_buffers in case the buffer got
+	 * written out by optimistic VM flushing and not the explicit writeback
+	 * from btrfs_write_buffers().
+	 */
 	set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
 
 	/*
@@ -1676,58 +1679,6 @@ static void set_btree_ioerr(struct extent_buffer *eb)
 	 * the superblock.
 	 */
 	mapping_set_error(eb->fs_info->btree_inode->i_mapping, -EIO);
-
-	/*
-	 * If writeback for a btree extent that doesn't belong to a log tree
-	 * failed, increment the counter transaction->eb_write_errors.
-	 * We do this because while the transaction is running and before it's
-	 * committing (when we call filemap_fdata[write|wait]_range against
-	 * the btree inode), we might have
-	 * btree_inode->i_mapping->a_ops->writepages() called by the VM - if it
-	 * returns an error or an error happens during writeback, when we're
-	 * committing the transaction we wouldn't know about it, since the pages
-	 * can be no longer dirty nor marked anymore for writeback (if a
-	 * subsequent modification to the extent buffer didn't happen before the
-	 * transaction commit), which makes filemap_fdata[write|wait]_range not
-	 * able to find the pages tagged with SetPageError at transaction
-	 * commit time. So if this happens we must abort the transaction,
-	 * otherwise we commit a super block with btree roots that point to
-	 * btree nodes/leafs whose content on disk is invalid - either garbage
-	 * or the content of some node/leaf from a past generation that got
-	 * cowed or deleted and is no longer valid.
-	 *
-	 * Note: setting AS_EIO/AS_ENOSPC in the btree inode's i_mapping would
-	 * not be enough - we need to distinguish between log tree extents vs
-	 * non-log tree extents, and the next filemap_fdatawait_range() call
-	 * will catch and clear such errors in the mapping - and that call might
-	 * be from a log sync and not from a transaction commit. Also, checking
-	 * for the eb flag EXTENT_BUFFER_WRITE_ERR at transaction commit time is
-	 * not done and would not be reliable - the eb might have been released
-	 * from memory and reading it back again means that flag would not be
-	 * set (since it's a runtime flag, not persisted on disk).
-	 *
-	 * Using the flags below in the btree inode also makes us achieve the
-	 * goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
-	 * writeback for all dirty pages and before filemap_fdatawait_range()
-	 * is called, the writeback for all dirty pages had already finished
-	 * with errors - because we were not using AS_EIO/AS_ENOSPC,
-	 * filemap_fdatawait_range() would return success, as it could not know
-	 * that writeback errors happened (the pages were no longer tagged for
-	 * writeback).
-	 */
-	switch (eb->log_index) {
-	case -1:
-		set_bit(BTRFS_FS_BTREE_ERR, &fs_info->flags);
-		break;
-	case 0:
-		set_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags);
-		break;
-	case 1:
-		set_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags);
-		break;
-	default:
-		BUG(); /* unexpected, logic error */
-	}
 }
 
 /*
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index db9148bafd02c3..7c0e638346017e 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -83,8 +83,6 @@ struct extent_buffer {
 	int read_mirror;
 	struct rcu_head rcu_head;
 	pid_t lock_owner;
-	/* >= 0 if eb belongs to a log tree, -1 otherwise */
-	s8 log_index;
 
 	struct rw_semaphore lock;
 
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 840e4def18b519..f71281408102c3 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -81,9 +81,6 @@ enum {
 	BTRFS_FS_QUOTA_ENABLED,
 	BTRFS_FS_UPDATE_UUID_TREE_GEN,
 	BTRFS_FS_CREATING_FREE_SPACE_TREE,
-	BTRFS_FS_BTREE_ERR,
-	BTRFS_FS_LOG1_ERR,
-	BTRFS_FS_LOG2_ERR,
 	BTRFS_FS_QUOTA_OVERRIDE,
 	/* Used to record internally whether fs has been frozen */
 	BTRFS_FS_FROZEN,
diff --git a/fs/btrfs/tests/extent-io-tests.c b/fs/btrfs/tests/extent-io-tests.c
index dfc5c7fa603899..df7f526e0718e0 100644
--- a/fs/btrfs/tests/extent-io-tests.c
+++ b/fs/btrfs/tests/extent-io-tests.c
@@ -75,13 +75,11 @@ static void extent_flag_to_str(const struct extent_state *state, char *dest)
 	PRINT_ONE_FLAG(state, dest, cur, DIRTY);
 	PRINT_ONE_FLAG(state, dest, cur, UPTODATE);
 	PRINT_ONE_FLAG(state, dest, cur, LOCKED);
-	PRINT_ONE_FLAG(state, dest, cur, NEW);
 	PRINT_ONE_FLAG(state, dest, cur, DELALLOC);
 	PRINT_ONE_FLAG(state, dest, cur, DEFRAG);
 	PRINT_ONE_FLAG(state, dest, cur, BOUNDARY);
 	PRINT_ONE_FLAG(state, dest, cur, NODATASUM);
 	PRINT_ONE_FLAG(state, dest, cur, CLEAR_META_RESV);
-	PRINT_ONE_FLAG(state, dest, cur, NEED_WAIT);
 	PRINT_ONE_FLAG(state, dest, cur, NORESERVE);
 	PRINT_ONE_FLAG(state, dest, cur, QGROUP_RESERVED);
 	PRINT_ONE_FLAG(state, dest, cur, CLEAR_DATA_RESV);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index fe0f00e717a834..5353bfd8309425 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -3,6 +3,7 @@
  * Copyright (C) 2007 Oracle.  All rights reserved.
  */
 
+#include <linux/list_sort.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
@@ -12,6 +13,7 @@
 #include <linux/blkdev.h>
 #include <linux/uuid.h>
 #include <linux/timekeeping.h>
+#include "extent_buffer.h"
 #include "misc.h"
 #include "ctree.h"
 #include "disk-io.h"
@@ -191,7 +193,8 @@ static noinline void switch_commit_roots(struct btrfs_trans_handle *trans)
 		list_del_init(&root->dirty_list);
 		free_extent_buffer(root->commit_root);
 		root->commit_root = btrfs_root_node(root);
-		extent_io_tree_release(&root->dirty_log_pages);
+		ASSERT(list_empty_careful(&root->dirty_buffers[0]));
+		ASSERT(list_empty_careful(&root->dirty_buffers[1]));
 		btrfs_qgroup_clean_swapped_blocks(root);
 	}
 
@@ -375,8 +378,8 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
 	INIT_LIST_HEAD(&cur_trans->deleted_bgs);
 	spin_lock_init(&cur_trans->dropped_roots_lock);
 	list_add_tail(&cur_trans->list, &fs_info->trans_list);
-	extent_io_tree_init(fs_info, &cur_trans->dirty_pages,
-			IO_TREE_TRANS_DIRTY_PAGES);
+	INIT_LIST_HEAD(&cur_trans->dirty_buffers);
+	spin_lock_init(&cur_trans->dirty_buffers_lock);
 	extent_io_tree_init(fs_info, &cur_trans->pinned_extents,
 			IO_TREE_FS_PINNED_EXTENTS);
 	fs_info->generation++;
@@ -1039,141 +1042,6 @@ int btrfs_end_transaction_throttle(struct btrfs_trans_handle *trans)
 	return __btrfs_end_transaction(trans, 1);
 }
 
-/*
- * when btree blocks are allocated, they have some corresponding bits set for
- * them in one of two extent_io trees.  This is used to make sure all of
- * those extents are sent to disk but does not wait on them
- */
-int btrfs_write_marked_extents(struct btrfs_fs_info *fs_info,
-			       struct extent_io_tree *dirty_pages, int mark)
-{
-	int err = 0;
-	int werr = 0;
-	struct address_space *mapping = fs_info->btree_inode->i_mapping;
-	struct extent_state *cached_state = NULL;
-	u64 start = 0;
-	u64 end;
-
-	while (!find_first_extent_bit(dirty_pages, start, &start, &end,
-				      mark, &cached_state)) {
-		bool wait_writeback = false;
-
-		err = convert_extent_bit(dirty_pages, start, end,
-					 EXTENT_NEED_WAIT,
-					 mark, &cached_state);
-		/*
-		 * convert_extent_bit can return -ENOMEM, which is most of the
-		 * time a temporary error. So when it happens, ignore the error
-		 * and wait for writeback of this range to finish - because we
-		 * failed to set the bit EXTENT_NEED_WAIT for the range, a call
-		 * to __btrfs_wait_marked_extents() would not know that
-		 * writeback for this range started and therefore wouldn't
-		 * wait for it to finish - we don't want to commit a
-		 * superblock that points to btree nodes/leafs for which
-		 * writeback hasn't finished yet (and without errors).
-		 * We cleanup any entries left in the io tree when committing
-		 * the transaction (through extent_io_tree_release()).
-		 */
-		if (err == -ENOMEM) {
-			err = 0;
-			wait_writeback = true;
-		}
-		if (!err)
-			err = filemap_fdatawrite_range(mapping, start, end);
-		if (err)
-			werr = err;
-		else if (wait_writeback)
-			werr = filemap_fdatawait_range(mapping, start, end);
-		free_extent_state(cached_state);
-		cached_state = NULL;
-		cond_resched();
-		start = end + 1;
-	}
-	return werr;
-}
-
-/*
- * when btree blocks are allocated, they have some corresponding bits set for
- * them in one of two extent_io trees.  This is used to make sure all of
- * those extents are on disk for transaction or log commit.  We wait
- * on all the pages and clear them from the dirty pages state tree
- */
-static int __btrfs_wait_marked_extents(struct btrfs_fs_info *fs_info,
-				       struct extent_io_tree *dirty_pages)
-{
-	int err = 0;
-	int werr = 0;
-	struct address_space *mapping = fs_info->btree_inode->i_mapping;
-	struct extent_state *cached_state = NULL;
-	u64 start = 0;
-	u64 end;
-
-	while (!find_first_extent_bit(dirty_pages, start, &start, &end,
-				      EXTENT_NEED_WAIT, &cached_state)) {
-		/*
-		 * Ignore -ENOMEM errors returned by clear_extent_bit().
-		 * When committing the transaction, we'll remove any entries
-		 * left in the io tree. For a log commit, we don't remove them
-		 * after committing the log because the tree can be accessed
-		 * concurrently - we do it only at transaction commit time when
-		 * it's safe to do it (through extent_io_tree_release()).
-		 */
-		err = clear_extent_bit(dirty_pages, start, end,
-				       EXTENT_NEED_WAIT, &cached_state);
-		if (err == -ENOMEM)
-			err = 0;
-		if (!err)
-			err = filemap_fdatawait_range(mapping, start, end);
-		if (err)
-			werr = err;
-		free_extent_state(cached_state);
-		cached_state = NULL;
-		cond_resched();
-		start = end + 1;
-	}
-	if (err)
-		werr = err;
-	return werr;
-}
-
-static int btrfs_wait_extents(struct btrfs_fs_info *fs_info,
-		       struct extent_io_tree *dirty_pages)
-{
-	bool errors = false;
-	int err;
-
-	err = __btrfs_wait_marked_extents(fs_info, dirty_pages);
-	if (test_and_clear_bit(BTRFS_FS_BTREE_ERR, &fs_info->flags))
-		errors = true;
-
-	if (errors && !err)
-		err = -EIO;
-	return err;
-}
-
-int btrfs_wait_tree_log_extents(struct btrfs_root *log_root, int mark)
-{
-	struct btrfs_fs_info *fs_info = log_root->fs_info;
-	struct extent_io_tree *dirty_pages = &log_root->dirty_log_pages;
-	bool errors = false;
-	int err;
-
-	ASSERT(log_root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID);
-
-	err = __btrfs_wait_marked_extents(fs_info, dirty_pages);
-	if ((mark & EXTENT_DIRTY) &&
-	    test_and_clear_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags))
-		errors = true;
-
-	if ((mark & EXTENT_NEW) &&
-	    test_and_clear_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags))
-		errors = true;
-
-	if (errors && !err)
-		err = -EIO;
-	return err;
-}
-
 /*
  * When btree blocks are allocated the corresponding extents are marked dirty.
  * This function ensures such extents are persisted on disk for transaction or
@@ -1183,25 +1051,22 @@ int btrfs_wait_tree_log_extents(struct btrfs_root *log_root, int mark)
  */
 static int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans)
 {
-	int ret;
-	int ret2;
-	struct extent_io_tree *dirty_pages = &trans->transaction->dirty_pages;
-	struct btrfs_fs_info *fs_info = trans->fs_info;
+	LIST_HEAD(dirty_buffers);
 	struct blk_plug plug;
+	int ret, ret2;
+
+	spin_lock(&trans->transaction->dirty_buffers_lock);
+	list_splice_init(&trans->transaction->dirty_buffers, &dirty_buffers);
+	spin_unlock(&trans->transaction->dirty_buffers_lock);
 
 	blk_start_plug(&plug);
-	ret = btrfs_write_marked_extents(fs_info, dirty_pages, EXTENT_DIRTY);
+	ret = btrfs_write_buffers(trans->fs_info, &dirty_buffers);
 	blk_finish_plug(&plug);
-	ret2 = btrfs_wait_extents(fs_info, dirty_pages);
-
-	extent_io_tree_release(&trans->transaction->dirty_pages);
 
+	ret2 = btrfs_wait_buffers(trans->fs_info, &dirty_buffers);
 	if (ret)
 		return ret;
-	else if (ret2)
-		return ret2;
-	else
-		return 0;
+	return ret2;
 }
 
 /*
@@ -2443,9 +2308,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 
 	btrfs_commit_device_sizes(cur_trans);
 
-	clear_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags);
-	clear_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags);
-
 	btrfs_trans_release_chunk_metadata(trans);
 
 	/*
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 8e9fa23bd7fed7..eade7eb41a07ef 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -47,7 +47,8 @@ struct btrfs_transaction {
 	enum btrfs_trans_state state;
 	int aborted;
 	struct list_head list;
-	struct extent_io_tree dirty_pages;
+	struct list_head dirty_buffers;
+	spinlock_t dirty_buffers_lock;
 	time64_t start_time;
 	wait_queue_head_t writer_wait;
 	wait_queue_head_t commit_wait;
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 9b212e8c70cc58..012b7e03d57968 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -8,6 +8,7 @@
 #include <linux/blkdev.h>
 #include <linux/list_sort.h>
 #include <linux/iversion.h>
+#include "extent_buffer.h"
 #include "misc.h"
 #include "ctree.h"
 #include "tree-log.h"
@@ -2847,6 +2848,24 @@ static inline void btrfs_remove_all_log_ctxs(struct btrfs_root *root,
 	}
 }
 
+static void btrfs_pick_all_dirty_log_buffers(struct btrfs_root *log,
+		struct list_head *dirty_buffers)
+{
+	spin_lock(&log->dirty_buffers_lock);
+	list_splice_tail_init(&log->dirty_buffers[0], dirty_buffers);
+	list_splice_tail_init(&log->dirty_buffers[1], dirty_buffers);
+	spin_unlock(&log->dirty_buffers_lock);
+}
+
+static void btrfs_pick_dirty_log_buffers(struct btrfs_root *log,
+					 struct list_head *dirty_buffers,
+					 int log_transid)
+{
+	spin_lock(&log->dirty_buffers_lock);
+	list_splice_init(&log->dirty_buffers[log_transid % 2], dirty_buffers);
+	spin_unlock(&log->dirty_buffers_lock);
+}
+
 /*
  * btrfs_sync_log does sends a given tree log down to the disk and
  * updates the super blocks to record it.  When this call is done,
@@ -2864,7 +2883,6 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 {
 	int index1;
 	int index2;
-	int mark;
 	int ret;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_root *log = root->log_root;
@@ -2875,6 +2893,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	struct blk_plug plug;
 	u64 log_root_start;
 	u64 log_root_level;
+	LIST_HEAD(dirty_log_buffers);
+	LIST_HEAD(dirty_log_root_buffers);
 
 	mutex_lock(&root->log_mutex);
 	log_transid = ctx->log_transid;
@@ -2917,16 +2937,15 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 		goto out;
 	}
 
-	if (log_transid % 2 == 0)
-		mark = EXTENT_DIRTY;
-	else
-		mark = EXTENT_NEW;
+	blk_start_plug(&plug);
 
-	/* we start IO on  all the marked extents here, but we don't actually
-	 * wait for them until later.
+	/*
+	 * Start I/O on all dirty buffers, but don't actually wait for them
+	 * until later.
 	 */
-	blk_start_plug(&plug);
-	ret = btrfs_write_marked_extents(fs_info, &log->dirty_log_pages, mark);
+	btrfs_pick_dirty_log_buffers(log, &dirty_log_buffers, log_transid);
+	ret = btrfs_write_buffers(fs_info, &dirty_log_buffers);
+
 	/*
 	 * -EAGAIN happens when someone, e.g., a concurrent transaction
 	 *  commit, writes a dirty extent in this tree-log commit. This
@@ -3008,7 +3027,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 			btrfs_err(fs_info,
 				  "failed to update log for root %llu ret %d",
 				  root->root_key.objectid, ret);
-		btrfs_wait_tree_log_extents(log, mark);
+		btrfs_wait_buffers(fs_info, &dirty_log_buffers);
 		mutex_unlock(&log_root_tree->log_mutex);
 		goto out;
 	}
@@ -3024,7 +3043,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	index2 = root_log_ctx.log_transid % 2;
 	if (atomic_read(&log_root_tree->log_commit[index2])) {
 		blk_finish_plug(&plug);
-		ret = btrfs_wait_tree_log_extents(log, mark);
+		ret = btrfs_wait_buffers(fs_info, &dirty_log_buffers);
 		wait_log_commit(log_root_tree,
 				root_log_ctx.log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
@@ -3046,15 +3065,14 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	 */
 	if (btrfs_need_log_full_commit(trans)) {
 		blk_finish_plug(&plug);
-		btrfs_wait_tree_log_extents(log, mark);
+		btrfs_wait_buffers(fs_info, &dirty_log_buffers);
 		mutex_unlock(&log_root_tree->log_mutex);
 		ret = BTRFS_LOG_FORCE_COMMIT;
 		goto out_wake_log_root;
 	}
 
-	ret = btrfs_write_marked_extents(fs_info,
-					 &log_root_tree->dirty_log_pages,
-					 EXTENT_DIRTY | EXTENT_NEW);
+	btrfs_pick_all_dirty_log_buffers(log_root_tree, &dirty_log_root_buffers);
+	ret = btrfs_write_buffers(fs_info, &dirty_log_root_buffers);
 	blk_finish_plug(&plug);
 	/*
 	 * As described above, -EAGAIN indicates a hole in the extents. We
@@ -3063,7 +3081,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	 */
 	if (ret == -EAGAIN && btrfs_is_zoned(fs_info)) {
 		btrfs_set_log_full_commit(trans);
-		btrfs_wait_tree_log_extents(log, mark);
+		btrfs_wait_buffers(fs_info, &dirty_log_buffers);
 		mutex_unlock(&log_root_tree->log_mutex);
 		goto out_wake_log_root;
 	} else if (ret) {
@@ -3071,10 +3089,9 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 		mutex_unlock(&log_root_tree->log_mutex);
 		goto out_wake_log_root;
 	}
-	ret = btrfs_wait_tree_log_extents(log, mark);
+	ret = btrfs_wait_buffers(fs_info, &dirty_log_buffers);
 	if (!ret)
-		ret = btrfs_wait_tree_log_extents(log_root_tree,
-						  EXTENT_NEW | EXTENT_DIRTY);
+		ret = btrfs_wait_buffers(fs_info, &dirty_log_root_buffers);
 	if (ret) {
 		btrfs_set_log_full_commit(trans);
 		mutex_unlock(&log_root_tree->log_mutex);
@@ -3160,6 +3177,9 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	atomic_set(&root->log_commit[index1], 0);
 	mutex_unlock(&root->log_mutex);
 
+	btrfs_release_buffers(&dirty_log_buffers);
+	btrfs_release_buffers(&dirty_log_root_buffers);
+
 	/*
 	 * The barrier before waitqueue_active (in cond_wake_up) is needed so
 	 * all the updates above are seen by the woken threads. It might not be
@@ -3177,6 +3197,7 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
 		.free = 1,
 		.process_func = process_one_buffer
 	};
+	LIST_HEAD(dirty_buffers);
 
 	if (log->node) {
 		ret = walk_log_tree(trans, log, &wc);
@@ -3198,11 +3219,9 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
 			 * them out and wait for their writeback to complete, so
 			 * that we properly cleanup their state and pages.
 			 */
-			btrfs_write_marked_extents(log->fs_info,
-						   &log->dirty_log_pages,
-						   EXTENT_DIRTY | EXTENT_NEW);
-			btrfs_wait_tree_log_extents(log,
-						    EXTENT_DIRTY | EXTENT_NEW);
+			btrfs_pick_all_dirty_log_buffers(log, &dirty_buffers);
+			btrfs_write_buffers(log->fs_info, &dirty_buffers);
+			btrfs_wait_buffers(log->fs_info, &dirty_buffers);
 
 			if (trans)
 				btrfs_abort_transaction(trans, ret);
@@ -3211,8 +3230,8 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
 		}
 	}
 
-	clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1,
-			  EXTENT_DIRTY | EXTENT_NEW | EXTENT_NEED_WAIT);
+	btrfs_pick_all_dirty_log_buffers(log, &dirty_buffers);
+	btrfs_release_buffers(&dirty_buffers);
 	extent_io_tree_release(&log->log_csum_range);
 
 	btrfs_put_root(log);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index bda301a55cbe3b..e4b8134ab70166 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -6,6 +6,7 @@
 #include <linux/sched/mm.h>
 #include <linux/atomic.h>
 #include <linux/vmalloc.h>
+#include "extent_buffer.h"
 #include "ctree.h"
 #include "volumes.h"
 #include "zoned.h"
@@ -1611,8 +1612,7 @@ void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 	memzero_extent_buffer(eb, 0, eb->len);
 	set_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags);
 	set_extent_buffer_dirty(eb);
-	set_extent_bits_nowait(&trans->dirty_pages, eb->start,
-			       eb->start + eb->len - 1, EXTENT_DIRTY);
+	btrfs_add_to_dirty_buffers_list(eb, trans, NULL);
 }
 
 bool btrfs_use_zone_append(struct btrfs_bio *bbio)
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 8ea9cea9bfeb4d..aae33b9067b918 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -86,8 +86,6 @@ struct find_free_extent_ctl;
 	EM( IO_TREE_BTREE_INODE_IO,	  "BTREE_INODE_IO")	    \
 	EM( IO_TREE_INODE_IO,		  "INODE_IO")		    \
 	EM( IO_TREE_RELOC_BLOCKS,	  "RELOC_BLOCKS")	    \
-	EM( IO_TREE_TRANS_DIRTY_PAGES,	  "TRANS_DIRTY_PAGES")      \
-	EM( IO_TREE_ROOT_DIRTY_LOG_PAGES, "ROOT_DIRTY_LOG_PAGES")   \
 	EM( IO_TREE_INODE_FILE_EXTENT,	  "INODE_FILE_EXTENT")      \
 	EM( IO_TREE_LOG_CSUM_RANGE,	  "LOG_CSUM_RANGE")         \
 	EMe(IO_TREE_SELFTEST,		  "SELFTEST")
@@ -147,13 +145,11 @@ FLUSH_STATES
 	{ EXTENT_DIRTY,			"DIRTY"},		\
 	{ EXTENT_UPTODATE,		"UPTODATE"},		\
 	{ EXTENT_LOCKED,		"LOCKED"},		\
-	{ EXTENT_NEW,			"NEW"},			\
 	{ EXTENT_DELALLOC,		"DELALLOC"},		\
 	{ EXTENT_DEFRAG,		"DEFRAG"},		\
 	{ EXTENT_BOUNDARY,		"BOUNDARY"},		\
 	{ EXTENT_NODATASUM,		"NODATASUM"},		\
 	{ EXTENT_CLEAR_META_RESV,	"CLEAR_META_RESV"},	\
-	{ EXTENT_NEED_WAIT,		"NEED_WAIT"},		\
 	{ EXTENT_NORESERVE,		"NORESERVE"},		\
 	{ EXTENT_QGROUP_RESERVED,	"QGROUP_RESERVED"},	\
 	{ EXTENT_CLEAR_DATA_RESV,	"CLEAR_DATA_RESV"},	\
@@ -2289,7 +2285,6 @@ DECLARE_EVENT_CLASS(btrfs_sleep_tree_lock,
 		__field(	u64,	end_ns		)
 		__field(	u64,	diff_ns		)
 		__field(	u64,	owner		)
-		__field(	int,	is_log_tree	)
 	),
 
 	TP_fast_assign_btrfs(eb->fs_info,
@@ -2299,14 +2294,13 @@ DECLARE_EVENT_CLASS(btrfs_sleep_tree_lock,
 		__entry->end_ns		= ktime_get_ns();
 		__entry->diff_ns	= __entry->end_ns - start_ns;
 		__entry->owner		= btrfs_header_owner(eb);
-		__entry->is_log_tree	= (eb->log_index >= 0);
 	),
 
 	TP_printk_btrfs(
-"block=%llu generation=%llu start_ns=%llu end_ns=%llu diff_ns=%llu owner=%llu is_log_tree=%d",
+"block=%llu generation=%llu start_ns=%llu end_ns=%llu diff_ns=%llu owner=%llu",
 		__entry->block, __entry->generation,
 		__entry->start_ns, __entry->end_ns, __entry->diff_ns,
-		__entry->owner, __entry->is_log_tree)
+		__entry->owner)
 );
 
 DEFINE_EVENT(btrfs_sleep_tree_lock, btrfs_tree_read_lock,
@@ -2330,19 +2324,17 @@ DECLARE_EVENT_CLASS(btrfs_locking_events,
 		__field(	u64,	block		)
 		__field(	u64,	generation	)
 		__field(	u64,	owner		)
-		__field(	int,	is_log_tree	)
 	),
 
 	TP_fast_assign_btrfs(eb->fs_info,
 		__entry->block		= eb->start;
 		__entry->generation	= btrfs_header_generation(eb);
 		__entry->owner		= btrfs_header_owner(eb);
-		__entry->is_log_tree	= (eb->log_index >= 0);
 	),
 
-	TP_printk_btrfs("block=%llu generation=%llu owner=%llu is_log_tree=%d",
+	TP_printk_btrfs("block=%llu generation=%llu owner=%llu",
 		__entry->block, __entry->generation,
-		__entry->owner, __entry->is_log_tree)
+		__entry->owner)
 );
 
 #define DEFINE_BTRFS_LOCK_EVENT(name)				\
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/6] btrfs: remove convert_extent_bit
  2023-05-15 19:22 simplify dirty buffer tracking Christoph Hellwig
  2023-05-15 19:22 ` [PATCH 1/6] btrfs: use a linked list for tracking per-transaction/log dirty buffers Christoph Hellwig
@ 2023-05-15 19:22 ` Christoph Hellwig
  2023-05-15 19:22 ` [PATCH 3/6] btrfs: directly wait for buffer writeback completion in btrfs_wait_buffers Christoph Hellwig
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2023-05-15 19:22 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: linux-btrfs

The last user of this function just went away, so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent-io-tree.c    | 212 -----------------------------------
 fs/btrfs/extent-io-tree.h    |   4 -
 include/trace/events/btrfs.h |  41 -------
 3 files changed, 257 deletions(-)

diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index 29a225836e286e..bfda8e1debee1e 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -1194,218 +1194,6 @@ int set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 				cached_state, NULL, mask);
 }
 
-/*
- * Convert all bits in a given range from one bit to another
- *
- * @tree:	the io tree to search
- * @start:	the start offset in bytes
- * @end:	the end offset in bytes (inclusive)
- * @bits:	the bits to set in this range
- * @clear_bits:	the bits to clear in this range
- * @cached_state:	state that we're going to cache
- *
- * This will go through and set bits for the given range.  If any states exist
- * already in this range they are set with the given bit and cleared of the
- * clear_bits.  This is only meant to be used by things that are mergeable, ie.
- * converting from say DELALLOC to DIRTY.  This is not meant to be used with
- * boundary bits like LOCK.
- *
- * All allocations are done with GFP_NOFS.
- */
-int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
-		       u32 bits, u32 clear_bits,
-		       struct extent_state **cached_state)
-{
-	struct extent_state *state;
-	struct extent_state *prealloc = NULL;
-	struct rb_node **p = NULL;
-	struct rb_node *parent = NULL;
-	int err = 0;
-	u64 last_start;
-	u64 last_end;
-	bool first_iteration = true;
-
-	btrfs_debug_check_extent_io_range(tree, start, end);
-	trace_btrfs_convert_extent_bit(tree, start, end - start + 1, bits,
-				       clear_bits);
-
-again:
-	if (!prealloc) {
-		/*
-		 * Best effort, don't worry if extent state allocation fails
-		 * here for the first iteration. We might have a cached state
-		 * that matches exactly the target range, in which case no
-		 * extent state allocations are needed. We'll only know this
-		 * after locking the tree.
-		 */
-		prealloc = alloc_extent_state(GFP_NOFS);
-		if (!prealloc && !first_iteration)
-			return -ENOMEM;
-	}
-
-	spin_lock(&tree->lock);
-	if (cached_state && *cached_state) {
-		state = *cached_state;
-		if (state->start <= start && state->end > start &&
-		    extent_state_in_tree(state))
-			goto hit_next;
-	}
-
-	/*
-	 * This search will find all the extents that end after our range
-	 * starts.
-	 */
-	state = tree_search_for_insert(tree, start, &p, &parent);
-	if (!state) {
-		prealloc = alloc_extent_state_atomic(prealloc);
-		if (!prealloc) {
-			err = -ENOMEM;
-			goto out;
-		}
-		prealloc->start = start;
-		prealloc->end = end;
-		insert_state_fast(tree, prealloc, p, parent, bits, NULL);
-		cache_state(prealloc, cached_state);
-		prealloc = NULL;
-		goto out;
-	}
-hit_next:
-	last_start = state->start;
-	last_end = state->end;
-
-	/*
-	 * | ---- desired range ---- |
-	 * | state |
-	 *
-	 * Just lock what we found and keep going.
-	 */
-	if (state->start == start && state->end <= end) {
-		set_state_bits(tree, state, bits, NULL);
-		cache_state(state, cached_state);
-		state = clear_state_bit(tree, state, clear_bits, 0, NULL);
-		if (last_end == (u64)-1)
-			goto out;
-		start = last_end + 1;
-		if (start < end && state && state->start == start &&
-		    !need_resched())
-			goto hit_next;
-		goto search_again;
-	}
-
-	/*
-	 *     | ---- desired range ---- |
-	 * | state |
-	 *   or
-	 * | ------------- state -------------- |
-	 *
-	 * We need to split the extent we found, and may flip bits on second
-	 * half.
-	 *
-	 * If the extent we found extends past our range, we just split and
-	 * search again.  It'll get split again the next time though.
-	 *
-	 * If the extent we found is inside our range, we set the desired bit
-	 * on it.
-	 */
-	if (state->start < start) {
-		prealloc = alloc_extent_state_atomic(prealloc);
-		if (!prealloc) {
-			err = -ENOMEM;
-			goto out;
-		}
-		err = split_state(tree, state, prealloc, start);
-		if (err)
-			extent_io_tree_panic(tree, err);
-		prealloc = NULL;
-		if (err)
-			goto out;
-		if (state->end <= end) {
-			set_state_bits(tree, state, bits, NULL);
-			cache_state(state, cached_state);
-			state = clear_state_bit(tree, state, clear_bits, 0, NULL);
-			if (last_end == (u64)-1)
-				goto out;
-			start = last_end + 1;
-			if (start < end && state && state->start == start &&
-			    !need_resched())
-				goto hit_next;
-		}
-		goto search_again;
-	}
-	/*
-	 * | ---- desired range ---- |
-	 *     | state | or               | state |
-	 *
-	 * There's a hole, we need to insert something in it and ignore the
-	 * extent we found.
-	 */
-	if (state->start > start) {
-		u64 this_end;
-		if (end < last_start)
-			this_end = end;
-		else
-			this_end = last_start - 1;
-
-		prealloc = alloc_extent_state_atomic(prealloc);
-		if (!prealloc) {
-			err = -ENOMEM;
-			goto out;
-		}
-
-		/*
-		 * Avoid to free 'prealloc' if it can be merged with the later
-		 * extent.
-		 */
-		prealloc->start = start;
-		prealloc->end = this_end;
-		err = insert_state(tree, prealloc, bits, NULL);
-		if (err)
-			extent_io_tree_panic(tree, err);
-		cache_state(prealloc, cached_state);
-		prealloc = NULL;
-		start = this_end + 1;
-		goto search_again;
-	}
-	/*
-	 * | ---- desired range ---- |
-	 *                        | state |
-	 *
-	 * We need to split the extent, and set the bit on the first half.
-	 */
-	if (state->start <= end && state->end > end) {
-		prealloc = alloc_extent_state_atomic(prealloc);
-		if (!prealloc) {
-			err = -ENOMEM;
-			goto out;
-		}
-
-		err = split_state(tree, state, prealloc, end + 1);
-		if (err)
-			extent_io_tree_panic(tree, err);
-
-		set_state_bits(tree, prealloc, bits, NULL);
-		cache_state(prealloc, cached_state);
-		clear_state_bit(tree, prealloc, clear_bits, 0, NULL);
-		prealloc = NULL;
-		goto out;
-	}
-
-search_again:
-	if (start > end)
-		goto out;
-	spin_unlock(&tree->lock);
-	cond_resched();
-	first_iteration = false;
-	goto again;
-
-out:
-	spin_unlock(&tree->lock);
-	if (prealloc)
-		free_extent_state(prealloc);
-
-	return err;
-}
-
 /*
  * Find the first range that has @bits not set. This range could start before
  * @start.
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index 0dae0a16a87cb7..1567ce11c10c9a 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -185,10 +185,6 @@ static inline int clear_extent_dirty(struct extent_io_tree *tree, u64 start,
 				EXTENT_DO_ACCOUNTING, cached);
 }
 
-int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
-		       u32 bits, u32 clear_bits,
-		       struct extent_state **cached_state);
-
 static inline int set_extent_delalloc(struct extent_io_tree *tree, u64 start,
 				      u64 end, u32 extra_bits,
 				      struct extent_state **cached_state)
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index aae33b9067b918..4f4a2006b4d5e3 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2129,47 +2129,6 @@ TRACE_EVENT(btrfs_clear_extent_bit,
 		__print_flags(__entry->clear_bits, "|", EXTENT_FLAGS))
 );
 
-TRACE_EVENT(btrfs_convert_extent_bit,
-	TP_PROTO(const struct extent_io_tree *tree,
-		 u64 start, u64 len, unsigned set_bits, unsigned clear_bits),
-
-	TP_ARGS(tree, start, len, set_bits, clear_bits),
-
-	TP_STRUCT__entry_btrfs(
-		__field(	unsigned,	owner	)
-		__field(	u64,		ino	)
-		__field(	u64,		rootid	)
-		__field(	u64,		start	)
-		__field(	u64,		len	)
-		__field(	unsigned,	set_bits)
-		__field(	unsigned,	clear_bits)
-	),
-
-	TP_fast_assign_btrfs(tree->fs_info,
-		__entry->owner = tree->owner;
-		if (tree->inode) {
-			const struct btrfs_inode *inode = tree->inode;
-
-			__entry->ino	= btrfs_ino(inode);
-			__entry->rootid	= inode->root->root_key.objectid;
-		} else {
-			__entry->ino	= 0;
-			__entry->rootid	= 0;
-		}
-		__entry->start		= start;
-		__entry->len		= len;
-		__entry->set_bits	= set_bits;
-		__entry->clear_bits	= clear_bits;
-	),
-
-	TP_printk_btrfs(
-"io_tree=%s ino=%llu root=%llu start=%llu len=%llu set_bits=%s clear_bits=%s",
-		  __print_symbolic(__entry->owner, IO_TREE_OWNER), __entry->ino,
-		  __entry->rootid, __entry->start, __entry->len,
-		  __print_flags(__entry->set_bits , "|", EXTENT_FLAGS),
-		  __print_flags(__entry->clear_bits, "|", EXTENT_FLAGS))
-);
-
 DECLARE_EVENT_CLASS(btrfs_dump_space_info,
 	TP_PROTO(struct btrfs_fs_info *fs_info,
 		 const struct btrfs_space_info *sinfo),
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/6] btrfs: directly wait for buffer writeback completion in btrfs_wait_buffers
  2023-05-15 19:22 simplify dirty buffer tracking Christoph Hellwig
  2023-05-15 19:22 ` [PATCH 1/6] btrfs: use a linked list for tracking per-transaction/log dirty buffers Christoph Hellwig
  2023-05-15 19:22 ` [PATCH 2/6] btrfs: remove convert_extent_bit Christoph Hellwig
@ 2023-05-15 19:22 ` Christoph Hellwig
  2023-05-15 19:22 ` [PATCH 4/6] btrfs: move dropping the bg reference out of submit_eb_page Christoph Hellwig
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2023-05-15 19:22 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: linux-btrfs

Don't bother to got to filemap_fdatawait_range to look up each page in an
extent_buffer and then wait for writeback completion on each page, but
instead just call wait_on_extent_buffer_writeback on the buffer.  All
write errors are propagated through the EXTENT_BUFFER_WRITE_ERR bit on
the extent_buffer itself, so there is no need to check for per-page
errors either.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_buffer.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent_buffer.c b/fs/btrfs/extent_buffer.c
index d20fc0d113968f..70a6a5f2e0e0a8 100644
--- a/fs/btrfs/extent_buffer.c
+++ b/fs/btrfs/extent_buffer.c
@@ -63,19 +63,15 @@ int btrfs_write_buffers(struct btrfs_fs_info *fs_info, struct list_head *list)
 
 int btrfs_wait_buffers(struct btrfs_fs_info *fs_info, struct list_head *list)
 {
-	struct address_space *mapping = fs_info->btree_inode->i_mapping;
 	struct dirty_buffer *db;
-	int werr = 0, err;
+	int werr = 0;
 
 	while ((db = list_first_entry_or_null(list, struct dirty_buffer,
 			wb_entry))) {
 		list_del_init(&db->wb_entry);
 
-		err = filemap_fdatawait_range(mapping, db->eb->start,
-					      db->eb->start + db->eb->len - 1);
-		if (err)
-			werr = err;
-		if (!werr && test_bit(EXTENT_BUFFER_WRITE_ERR, &db->eb->bflags))
+		wait_on_extent_buffer_writeback(db->eb);
+		if (test_bit(EXTENT_BUFFER_WRITE_ERR, &db->eb->bflags))
 			werr = -EIO;
 		free_extent_buffer(db->eb);
 		kfree(db);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 4/6] btrfs: move dropping the bg reference out of submit_eb_page
  2023-05-15 19:22 simplify dirty buffer tracking Christoph Hellwig
                   ` (2 preceding siblings ...)
  2023-05-15 19:22 ` [PATCH 3/6] btrfs: directly wait for buffer writeback completion in btrfs_wait_buffers Christoph Hellwig
@ 2023-05-15 19:22 ` Christoph Hellwig
  2023-05-15 19:22 ` [PATCH 5/6] btrfs: move locking and write pointer checking into write_one_eb Christoph Hellwig
  2023-05-15 19:22 ` [PATCH 6/6] btrfs: bypass filemap_fdatawrite_range in btrfs_write_buffers Christoph Hellwig
  5 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2023-05-15 19:22 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: linux-btrfs

Instead of putting the cached bg for zoned metadata writes in
submit_eb_page, let the btrfs_revert_meta_write_pointer and
btrfs_schedule_zone_finish_bg helpers consume it.  This mirrors how the
reference to it is acquired in btrfs_check_meta_write_pointer and
isolated the extent_buffer writeback code from some of the zoned
device implementation details.  It also avoids a reference roundtrip
for the case where btrfs_schedule_zone_finish_bg actually schedules
a zone_finish command.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_io.c | 3 ---
 fs/btrfs/zoned.c     | 7 +++++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d5937ed0962d38..1205b3a3147e7d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1955,8 +1955,6 @@ static int submit_eb_page(struct page *page, struct writeback_control *wbc,
 
 	if (!lock_extent_buffer_for_io(eb, wbc)) {
 		btrfs_revert_meta_write_pointer(cache, eb);
-		if (cache)
-			btrfs_put_block_group(cache);
 		free_extent_buffer(eb);
 		return 0;
 	}
@@ -1965,7 +1963,6 @@ static int submit_eb_page(struct page *page, struct writeback_control *wbc,
 		 * Implies write in zoned mode. Mark the last eb in a block group.
 		 */
 		btrfs_schedule_zone_finish_bg(cache, eb);
-		btrfs_put_block_group(cache);
 	}
 	write_one_eb(eb, wbc);
 	free_extent_buffer(eb);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index e4b8134ab70166..eed96ec35052a0 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1752,6 +1752,7 @@ void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
 
 	ASSERT(cache->meta_write_pointer == eb->start + eb->len);
 	cache->meta_write_pointer = eb->start;
+	btrfs_put_block_group(cache);
 }
 
 int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, u64 length)
@@ -2145,17 +2146,19 @@ void btrfs_schedule_zone_finish_bg(struct btrfs_block_group *bg,
 				   struct extent_buffer *eb)
 {
 	if (!test_bit(BLOCK_GROUP_FLAG_SEQUENTIAL_ZONE, &bg->runtime_flags) ||
-	    eb->start + eb->len * 2 <= bg->start + bg->zone_capacity)
+	    eb->start + eb->len * 2 <= bg->start + bg->zone_capacity) {
+		btrfs_put_block_group(bg);
 		return;
+	}
 
 	if (WARN_ON(bg->zone_finish_work.func == btrfs_zone_finish_endio_workfn)) {
 		btrfs_err(bg->fs_info, "double scheduling of bg %llu zone finishing",
 			  bg->start);
+		btrfs_put_block_group(bg);
 		return;
 	}
 
 	/* For the work */
-	btrfs_get_block_group(bg);
 	atomic_inc(&eb->refs);
 	bg->last_eb = eb;
 	INIT_WORK(&bg->zone_finish_work, btrfs_zone_finish_endio_workfn);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 5/6] btrfs: move locking and write pointer checking into write_one_eb
  2023-05-15 19:22 simplify dirty buffer tracking Christoph Hellwig
                   ` (3 preceding siblings ...)
  2023-05-15 19:22 ` [PATCH 4/6] btrfs: move dropping the bg reference out of submit_eb_page Christoph Hellwig
@ 2023-05-15 19:22 ` Christoph Hellwig
  2023-05-15 19:22 ` [PATCH 6/6] btrfs: bypass filemap_fdatawrite_range in btrfs_write_buffers Christoph Hellwig
  5 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2023-05-15 19:22 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: linux-btrfs

Currently only submit_eb_page contains handling for checking against the
write pointer on zoned devices.  This is ok as there is no support for
block size < PAGE_SIZE with zoned devices at the moment, but prevents
us from easily adding new callers of write_one_eb without breaking zoned
device support.

Move the call tolock_extent_buffer_for_io as well as the write pointer
checking and related code into write_one_eb.  To be able to do this,
write_one_eb now needs to set the eb_context pointer if the caller
passed in a non-NULL argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_io.c | 69 +++++++++++++++++++++-----------------------
 1 file changed, 33 insertions(+), 36 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1205b3a3147e7d..c291fc728db047 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1759,12 +1759,39 @@ static void prepare_eb_write(struct extent_buffer *eb)
 	}
 }
 
-static noinline_for_stack void write_one_eb(struct extent_buffer *eb,
-					    struct writeback_control *wbc)
+static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
+					   struct writeback_control *wbc,
+					   struct extent_buffer **eb_context)
 {
 	struct btrfs_fs_info *fs_info = eb->fs_info;
+	struct btrfs_block_group *zoned_bg = NULL;
 	struct btrfs_bio *bbio;
 
+	if (!btrfs_check_meta_write_pointer(eb->fs_info, eb, &zoned_bg)) {
+		/*
+		 * If for_sync, this hole will be filled with
+		 * trasnsaction commit.
+		 */
+		if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync)
+			return -EAGAIN;
+		return 0;
+	}
+
+	if (eb_context)
+		*eb_context = eb;
+
+	if (!lock_extent_buffer_for_io(eb, wbc)) {
+		btrfs_revert_meta_write_pointer(zoned_bg, eb);
+		return 0;
+	}
+
+	/*
+	 * A non-NULL zoned_bg implies zoned mode and that we are writing the
+	 * last possible block in the zone.
+	 */
+	if (zoned_bg)
+		btrfs_schedule_zone_finish_bg(zoned_bg, eb);
+
 	prepare_eb_write(eb);
 
 	bbio = btrfs_bio_alloc(INLINE_EXTENT_BUFFER_PAGES,
@@ -1801,6 +1828,7 @@ static noinline_for_stack void write_one_eb(struct extent_buffer *eb,
 		}
 	}
 	btrfs_submit_bio(bbio, 0);
+	return 1;
 }
 
 /*
@@ -1868,11 +1896,8 @@ static int submit_eb_subpage(struct page *page, struct writeback_control *wbc)
 		 */
 		if (!eb)
 			continue;
-
-		if (lock_extent_buffer_for_io(eb, wbc)) {
-			write_one_eb(eb, wbc);
+		if (write_one_eb(eb, wbc, NULL) > 0)
 			submitted++;
-		}
 		free_extent_buffer(eb);
 	}
 	return submitted;
@@ -1902,7 +1927,6 @@ static int submit_eb_page(struct page *page, struct writeback_control *wbc,
 			  struct extent_buffer **eb_context)
 {
 	struct address_space *mapping = page->mapping;
-	struct btrfs_block_group *cache = NULL;
 	struct extent_buffer *eb;
 	int ret;
 
@@ -1937,36 +1961,9 @@ static int submit_eb_page(struct page *page, struct writeback_control *wbc,
 	spin_unlock(&mapping->private_lock);
 	if (!ret)
 		return 0;
-
-	if (!btrfs_check_meta_write_pointer(eb->fs_info, eb, &cache)) {
-		/*
-		 * If for_sync, this hole will be filled with
-		 * trasnsaction commit.
-		 */
-		if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync)
-			ret = -EAGAIN;
-		else
-			ret = 0;
-		free_extent_buffer(eb);
-		return ret;
-	}
-
-	*eb_context = eb;
-
-	if (!lock_extent_buffer_for_io(eb, wbc)) {
-		btrfs_revert_meta_write_pointer(cache, eb);
-		free_extent_buffer(eb);
-		return 0;
-	}
-	if (cache) {
-		/*
-		 * Implies write in zoned mode. Mark the last eb in a block group.
-		 */
-		btrfs_schedule_zone_finish_bg(cache, eb);
-	}
-	write_one_eb(eb, wbc);
+	ret = write_one_eb(eb, wbc, eb_context);
 	free_extent_buffer(eb);
-	return 1;
+	return ret;
 }
 
 int btree_write_cache_pages(struct address_space *mapping,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 6/6] btrfs: bypass filemap_fdatawrite_range in btrfs_write_buffers
  2023-05-15 19:22 simplify dirty buffer tracking Christoph Hellwig
                   ` (4 preceding siblings ...)
  2023-05-15 19:22 ` [PATCH 5/6] btrfs: move locking and write pointer checking into write_one_eb Christoph Hellwig
@ 2023-05-15 19:22 ` Christoph Hellwig
  5 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2023-05-15 19:22 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: linux-btrfs

btrfs_write_buffers already has a pointer to each extent_buffer it needs
to write.  Instead of going through filemap_fdatawrite_range to look up
each page in the buffer, just to cycle back to the buffer for writing
it, just call write_one_eb directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_buffer.c | 17 +++++++++++++----
 fs/btrfs/extent_buffer.h |  3 +++
 fs/btrfs/extent_io.c     |  6 +++---
 3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent_buffer.c b/fs/btrfs/extent_buffer.c
index 70a6a5f2e0e0a8..8d0802d2ab004e 100644
--- a/fs/btrfs/extent_buffer.c
+++ b/fs/btrfs/extent_buffer.c
@@ -8,6 +8,7 @@
 #include "extent_io.h"
 #include "fs.h"
 #include "transaction.h"
+#include "zoned.h"
 
 void btrfs_add_to_dirty_buffers_list(struct extent_buffer *eb,
 				     struct btrfs_transaction *trans,
@@ -44,20 +45,28 @@ static int extent_buffer_cmp(void *priv, const struct list_head *a,
 
 int btrfs_write_buffers(struct btrfs_fs_info *fs_info, struct list_head *list)
 {
-	struct address_space *mapping = fs_info->btree_inode->i_mapping;
 	struct dirty_buffer *db;
+	struct writeback_control wbc = {
+		.sync_mode	= WB_SYNC_ALL,
+		.nr_to_write	= LONG_MAX,
+	};
 	int werr = 0, err;
 
 	list_sort(NULL, list, extent_buffer_cmp);
 
+	btrfs_zoned_meta_io_lock(fs_info);
 	list_for_each_entry(db, list, wb_entry) {
-		err = filemap_fdatawrite_range(mapping, db->eb->start,
-					       db->eb->start + db->eb->len - 1);
-		if (err)
+		err = write_one_eb(db->eb, &wbc, NULL);
+		if (err < 0) {
 			werr = err;
+			break;
+		}
 		cond_resched();
 	}
+	btrfs_zoned_meta_io_unlock(fs_info);
 
+	if (!werr && BTRFS_FS_ERROR(fs_info))
+		return -EROFS;
 	return werr;
 }
 
diff --git a/fs/btrfs/extent_buffer.h b/fs/btrfs/extent_buffer.h
index 880b325d7b411a..c521685d7ca85c 100644
--- a/fs/btrfs/extent_buffer.h
+++ b/fs/btrfs/extent_buffer.h
@@ -7,6 +7,7 @@ struct btrfs_fs_info;
 struct btrfs_root;
 struct btrfs_transaction;
 struct list_head;
+struct writeback_control;
 
 struct dirty_buffer {
 	struct list_head wb_entry;
@@ -19,5 +20,7 @@ void btrfs_add_to_dirty_buffers_list(struct extent_buffer *eb,
 int btrfs_write_buffers(struct btrfs_fs_info *fs_info, struct list_head *list);
 int btrfs_wait_buffers(struct btrfs_fs_info *fs_info, struct list_head *list);
 void btrfs_release_buffers(struct list_head *list);
+int write_one_eb(struct extent_buffer *eb, struct writeback_control *wbc,
+		 struct extent_buffer **eb_context);
 
 #endif /* BTRFS_EXTENT_BUFFER_H */
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c291fc728db047..28078dd4acd99a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -15,6 +15,7 @@
 #include <linux/prefetch.h>
 #include <linux/fsverity.h>
 #include "misc.h"
+#include "extent_buffer.h"
 #include "extent_io.h"
 #include "extent-io-tree.h"
 #include "extent_map.h"
@@ -1759,9 +1760,8 @@ static void prepare_eb_write(struct extent_buffer *eb)
 	}
 }
 
-static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
-					   struct writeback_control *wbc,
-					   struct extent_buffer **eb_context)
+int write_one_eb(struct extent_buffer *eb, struct writeback_control *wbc,
+		 struct extent_buffer **eb_context)
 {
 	struct btrfs_fs_info *fs_info = eb->fs_info;
 	struct btrfs_block_group *zoned_bg = NULL;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/6] btrfs: use a linked list for tracking per-transaction/log dirty buffers
  2023-05-15 19:22 ` [PATCH 1/6] btrfs: use a linked list for tracking per-transaction/log dirty buffers Christoph Hellwig
@ 2023-05-17 10:40   ` Filipe Manana
  2023-05-17 12:21     ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Filipe Manana @ 2023-05-17 10:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chris Mason, Josef Bacik, David Sterba, linux-btrfs

On Mon, May 15, 2023 at 8:27 PM Christoph Hellwig <hch@lst.de> wrote:
>
> Currently there is an extent_io_tree per transaction or tree_log root to
> track the dirty buffer ranges in the transaction or tree_log.  This is
> fairly inefficient, as:
>
>  - we only add to the list, until the transaction / log_tree commit
>    happens, at which point we'll want to walk them sequentially
>  - the dirty range means we need another mechanism to actually find
>    the dirty buffer based off the dirty range
>
> This patch instead switches tracking to one linked list per transaction
> and two to each root for the two tree logs which link a new object that
> points directly to the buffer.  Note that the list_head can't directly be
> embedded into the extent_buffer structure given that a buffer can be part
> of more than one transaction or tree_log.  This also means the existing
> error propagation based off eb->log_index never fully worked, as this
> index would get overwritten once a buffer is added to a new dirty tree.

If an extent buffer is part of 2 transactions, it means that when it
was allocated
for the next one, it was already cleaned up in the previous one, so
its ->log_index is
no longer used by the previous transaction and can be safely
overwritten by the next transaction.

Or did you find a case where that is not true?

Also some comments inlined below.

> The new code instead directly looks at the EXTENT_BUFFER_WRITE_ERR bit,
> which doesn't have this problem.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/btrfs/Makefile                |   2 +-
>  fs/btrfs/ctree.h                 |   3 +-
>  fs/btrfs/disk-io.c               |  72 ++++++-------
>  fs/btrfs/extent-io-tree.h        |  10 --
>  fs/btrfs/extent-tree.c           |  21 +---
>  fs/btrfs/extent_buffer.c         |  99 ++++++++++++++++++
>  fs/btrfs/extent_buffer.h         |  23 +++++
>  fs/btrfs/extent_io.c             |  59 +----------
>  fs/btrfs/extent_io.h             |   2 -
>  fs/btrfs/fs.h                    |   3 -
>  fs/btrfs/tests/extent-io-tests.c |   2 -
>  fs/btrfs/transaction.c           | 168 +++----------------------------
>  fs/btrfs/transaction.h           |   3 +-
>  fs/btrfs/tree-log.c              |  71 ++++++++-----
>  fs/btrfs/zoned.c                 |   4 +-
>  include/trace/events/btrfs.h     |  16 +--
>  16 files changed, 232 insertions(+), 326 deletions(-)
>  create mode 100644 fs/btrfs/extent_buffer.c
>  create mode 100644 fs/btrfs/extent_buffer.h
>
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index 90d53209755bf8..661cead213d6df 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -24,7 +24,7 @@ obj-$(CONFIG_BTRFS_FS) := btrfs.o
>
>  btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>            file-item.o inode-item.o disk-io.o \
> -          transaction.o inode.o file.o defrag.o \
> +          transaction.o inode.o file.o defrag.o extent_buffer.o \
>            extent_map.o sysfs.o accessors.o xattr.o ordered-data.o \
>            extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
>            export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 5af61480dde613..5a5fda077a0288 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -202,7 +202,8 @@ struct btrfs_root {
>         struct btrfs_root_item root_item;
>         struct btrfs_key root_key;
>         struct btrfs_fs_info *fs_info;
> -       struct extent_io_tree dirty_log_pages;
> +       struct list_head dirty_buffers[2];

As this is for the log tree, I'd prefer to have its name reflect that,
like we had before.
Something like "log_dirty_buffers" for example.

> +       spinlock_t dirty_buffers_lock;

Same here.

>
>         struct mutex objectid_mutex;
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index f58806b798cd65..3b81d6ec1b47b8 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -24,6 +24,7 @@
>  #include "transaction.h"
>  #include "btrfs_inode.h"
>  #include "bio.h"
> +#include "extent_buffer.h"
>  #include "print-tree.h"
>  #include "locking.h"
>  #include "tree-log.h"
> @@ -64,9 +65,6 @@ static void btrfs_destroy_ordered_extents(struct btrfs_root *root);
>  static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
>                                       struct btrfs_fs_info *fs_info);
>  static void btrfs_destroy_delalloc_inodes(struct btrfs_root *root);
> -static int btrfs_destroy_marked_extents(struct btrfs_fs_info *fs_info,
> -                                       struct extent_io_tree *dirty_pages,
> -                                       int mark);
>  static int btrfs_destroy_pinned_extent(struct btrfs_fs_info *fs_info,
>                                        struct extent_io_tree *pinned_extents);
>  static int btrfs_cleanup_transaction(struct btrfs_fs_info *fs_info);
> @@ -693,8 +691,9 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
>         root->last_log_commit = 0;
>         root->anon_dev = 0;
>         if (!dummy) {
> -               extent_io_tree_init(fs_info, &root->dirty_log_pages,
> -                                   IO_TREE_ROOT_DIRTY_LOG_PAGES);
> +               INIT_LIST_HEAD(&root->dirty_buffers[0]);
> +               INIT_LIST_HEAD(&root->dirty_buffers[1]);
> +               spin_lock_init(&root->dirty_buffers_lock);
>                 extent_io_tree_init(fs_info, &root->log_csum_range,
>                                     IO_TREE_LOG_CSUM_RANGE);
>         }
> @@ -4208,18 +4207,16 @@ static void warn_about_uncommitted_trans(struct btrfs_fs_info *fs_info)
>          */
>         ASSERT(test_bit(BTRFS_FS_CLOSING_DONE, &fs_info->flags));
>         list_for_each_entry_safe(trans, tmp, &fs_info->trans_list, list) {
> -               struct extent_state *cached = NULL;
> +               struct dirty_buffer *db;
>                 u64 dirty_bytes = 0;
> -               u64 cur = 0;
> -               u64 found_start;
> -               u64 found_end;
>
>                 found = true;
> -               while (!find_first_extent_bit(&trans->dirty_pages, cur,
> -                       &found_start, &found_end, EXTENT_DIRTY, &cached)) {
> -                       dirty_bytes += found_end + 1 - found_start;
> -                       cur = found_end + 1;
> -               }
> +
> +               spin_lock(&trans->dirty_buffers_lock);
> +               list_for_each_entry(db, &trans->dirty_buffers, wb_entry)
> +                       dirty_bytes += db->eb->len;
> +               spin_unlock(&trans->dirty_buffers_lock);
> +
>                 btrfs_warn(fs_info,
>         "transaction %llu (with %llu dirty metadata bytes) is not committed",
>                            trans->transid, dirty_bytes);
> @@ -4707,38 +4704,30 @@ static void btrfs_destroy_all_delalloc_inodes(struct btrfs_fs_info *fs_info)
>         spin_unlock(&fs_info->delalloc_root_lock);
>  }
>
> -static int btrfs_destroy_marked_extents(struct btrfs_fs_info *fs_info,
> -                                       struct extent_io_tree *dirty_pages,
> -                                       int mark)
> +static void btrfs_destroy_dirty_buffers(struct btrfs_fs_info *fs_info,
> +                                       struct btrfs_transaction *cur_trans)
>  {
> -       int ret;
> -       struct extent_buffer *eb;
> -       u64 start = 0;
> -       u64 end;
> +       struct dirty_buffer *db;
>
> -       while (1) {
> -               ret = find_first_extent_bit(dirty_pages, start, &start, &end,
> -                                           mark, NULL);
> -               if (ret)
> -                       break;
> +       spin_lock(&cur_trans->dirty_buffers_lock);
> +       while ((db = list_first_entry_or_null(&cur_trans->dirty_buffers,
> +                       struct dirty_buffer, wb_entry))) {
> +               struct extent_buffer *eb = db->eb;
>
> -               clear_extent_bits(dirty_pages, start, end, mark);
> -               while (start <= end) {
> -                       eb = find_extent_buffer(fs_info, start);
> -                       start += fs_info->nodesize;
> -                       if (!eb)
> -                               continue;
> +               list_del_init(&db->wb_entry);
> +               spin_unlock(&cur_trans->dirty_buffers_lock);
>
> -                       btrfs_tree_lock(eb);
> -                       wait_on_extent_buffer_writeback(eb);
> -                       btrfs_clear_buffer_dirty(NULL, eb);
> -                       btrfs_tree_unlock(eb);
> +               btrfs_tree_lock(eb);
> +               wait_on_extent_buffer_writeback(eb);
> +               btrfs_clear_buffer_dirty(NULL, eb);
> +               btrfs_tree_unlock(eb);
>
> -                       free_extent_buffer_stale(eb);
> -               }
> -       }
> +               free_extent_buffer_stale(eb);
> +               kfree(db);
>
> -       return ret;
> +               spin_lock(&cur_trans->dirty_buffers_lock);
> +       }
> +       spin_unlock(&cur_trans->dirty_buffers_lock);
>  }
>
>  static int btrfs_destroy_pinned_extent(struct btrfs_fs_info *fs_info,
> @@ -4866,8 +4855,7 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
>
>         btrfs_destroy_delayed_inodes(fs_info);
>
> -       btrfs_destroy_marked_extents(fs_info, &cur_trans->dirty_pages,
> -                                    EXTENT_DIRTY);
> +       btrfs_destroy_dirty_buffers(fs_info, cur_trans);
>         btrfs_destroy_pinned_extent(fs_info, &cur_trans->pinned_extents);
>
>         cur_trans->state =TRANS_STATE_COMPLETED;
> diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
> index 21766e49ec02e7..0dae0a16a87cb7 100644
> --- a/fs/btrfs/extent-io-tree.h
> +++ b/fs/btrfs/extent-io-tree.h
> @@ -12,13 +12,11 @@ enum {
>         ENUM_BIT(EXTENT_DIRTY),
>         ENUM_BIT(EXTENT_UPTODATE),
>         ENUM_BIT(EXTENT_LOCKED),
> -       ENUM_BIT(EXTENT_NEW),
>         ENUM_BIT(EXTENT_DELALLOC),
>         ENUM_BIT(EXTENT_DEFRAG),
>         ENUM_BIT(EXTENT_BOUNDARY),
>         ENUM_BIT(EXTENT_NODATASUM),
>         ENUM_BIT(EXTENT_CLEAR_META_RESV),
> -       ENUM_BIT(EXTENT_NEED_WAIT),
>         ENUM_BIT(EXTENT_NORESERVE),
>         ENUM_BIT(EXTENT_QGROUP_RESERVED),
>         ENUM_BIT(EXTENT_CLEAR_DATA_RESV),
> @@ -68,8 +66,6 @@ enum {
>         IO_TREE_BTREE_INODE_IO,
>         IO_TREE_INODE_IO,
>         IO_TREE_RELOC_BLOCKS,
> -       IO_TREE_TRANS_DIRTY_PAGES,
> -       IO_TREE_ROOT_DIRTY_LOG_PAGES,
>         IO_TREE_INODE_FILE_EXTENT,
>         IO_TREE_LOG_CSUM_RANGE,
>         IO_TREE_SELFTEST,
> @@ -210,12 +206,6 @@ static inline int set_extent_defrag(struct extent_io_tree *tree, u64 start,
>                               cached_state, GFP_NOFS);
>  }
>
> -static inline int set_extent_new(struct extent_io_tree *tree, u64 start,
> -               u64 end)
> -{
> -       return set_extent_bit(tree, start, end, EXTENT_NEW, NULL, GFP_NOFS);
> -}
> -
>  int find_first_extent_bit(struct extent_io_tree *tree, u64 start,
>                           u64 *start_ret, u64 *end_ret, u32 bits,
>                           struct extent_state **cached_state);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 99b9ab5ccf932a..20d7cb51de944f 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -17,6 +17,7 @@
>  #include <linux/lockdep.h>
>  #include <linux/crc32c.h>
>  #include "ctree.h"
> +#include "extent_buffer.h"
>  #include "extent-tree.h"
>  #include "tree-log.h"
>  #include "disk-io.h"
> @@ -4822,23 +4823,9 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>         btrfs_set_header_owner(buf, owner);
>         write_extent_buffer_fsid(buf, fs_info->fs_devices->metadata_uuid);
>         write_extent_buffer_chunk_tree_uuid(buf, fs_info->chunk_tree_uuid);
> -       if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
> -               buf->log_index = root->log_transid % 2;
> -               /*
> -                * we allow two log transactions at a time, use different
> -                * EXTENT bit to differentiate dirty pages.
> -                */
> -               if (buf->log_index == 0)
> -                       set_extent_dirty(&root->dirty_log_pages, buf->start,
> -                                       buf->start + buf->len - 1, GFP_NOFS);
> -               else
> -                       set_extent_new(&root->dirty_log_pages, buf->start,
> -                                       buf->start + buf->len - 1);
> -       } else {
> -               buf->log_index = -1;
> -               set_extent_dirty(&trans->transaction->dirty_pages, buf->start,
> -                        buf->start + buf->len - 1, GFP_NOFS);
> -       }
> +
> +       btrfs_add_to_dirty_buffers_list(buf, trans->transaction, root);
> +
>         /* this returns a buffer locked for blocking */
>         return buf;
>  }
> diff --git a/fs/btrfs/extent_buffer.c b/fs/btrfs/extent_buffer.c
> new file mode 100644
> index 00000000000000..d20fc0d113968f
> --- /dev/null
> +++ b/fs/btrfs/extent_buffer.c
> @@ -0,0 +1,99 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2023 Christoph Hellwig.
> + */
> +
> +#include <linux/list_sort.h>
> +#include "extent_buffer.h"
> +#include "extent_io.h"
> +#include "fs.h"
> +#include "transaction.h"
> +
> +void btrfs_add_to_dirty_buffers_list(struct extent_buffer *eb,
> +                                    struct btrfs_transaction *trans,
> +                                    struct btrfs_root *root)
> +{
> +       struct dirty_buffer *db = kmalloc(sizeof(*db), GFP_NOFS | __GFP_NOFAIL);
> +
> +       db->eb = eb;
> +       atomic_inc(&eb->refs); /* for the dirty list */
> +       if (root && root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
> +               spin_lock(&root->dirty_buffers_lock);
> +               list_add_tail(&db->wb_entry,
> +                             &root->dirty_buffers[root->log_transid % 2]);
> +               spin_unlock(&root->dirty_buffers_lock);
> +       } else {
> +               spin_lock(&trans->dirty_buffers_lock);
> +               list_add_tail(&db->wb_entry, &trans->dirty_buffers);
> +               spin_unlock(&trans->dirty_buffers_lock);
> +       }

Ok, so there are several important differences here when compared to
the existing io tree approach that you don't mention at all in the
changelog.
These are:

1) With the io tree approach, if we allocate multiple extent buffers
that are adjacent, we get a single entry to represent them, due the
merging done by the io tree code.
With this new approach we don't have any merging at all, using more
memory and keeping a longer list, which will take longer to iterate
and sort.
For example if a 1G metadata block group is allocated, and then we
allocate all metadata extents from it, we get 65536 struct
dirty_buffer allocated, while with the io tree approach we would get a
single struct extent_state record.

2) We now need to keep references on the extent buffers. This means
they can't be released from memory until the transaction commits.
Before we didn't do this, and if an extent buffer was allocated and
freed in the same transaction, we wouldn't need to keep it in memory
until the transaction commits.
We would not need to do it as well if its writeback is started and
completed before the transaction commit.

3) Looking a bit below, we now need to sort the list, which can be
huge, especially taking into account the fact that adjacent extents
are not merged anymore as mentioned before, potentially making anyone
who's waiting on the transaction commit to wait for longer.
On the other end, when a task allocates a new buffer the insertion is
faster, as it's just appending to a list and can reduce the latency of
many syscalls (creat, mkdir, rmdir, rename, link/unlink, reflinks,
etc)

Not saying that in the end this approach isn't often or generally
better, but at the very least I would like to see all these
differences explicitly mentioned in the changelog.
The changelog gives the impression that there are no tradeoffs and the
new solution is better in every aspect.

I would say more tests/benchmarks results would be good to have too,
other than just fs_mark, and have the tests mentioned in the changelog
(results before and after this change, command lines, fio configs for
example).

Thanks.

> +}
> +
> +static int extent_buffer_cmp(void *priv, const struct list_head *a,
> +                            const struct list_head *b)
> +{
> +       s64 diff = container_of(a, struct dirty_buffer, wb_entry)->eb->start -
> +               container_of(b, struct dirty_buffer, wb_entry)->eb->start;
> +
> +       if (diff < 0)
> +               return -1;
> +       if (diff > 0)
> +               return 1;
> +       return 0;
> +}
> +
> +int btrfs_write_buffers(struct btrfs_fs_info *fs_info, struct list_head *list)
> +{
> +       struct address_space *mapping = fs_info->btree_inode->i_mapping;
> +       struct dirty_buffer *db;
> +       int werr = 0, err;
> +
> +       list_sort(NULL, list, extent_buffer_cmp);
> +
> +       list_for_each_entry(db, list, wb_entry) {
> +               err = filemap_fdatawrite_range(mapping, db->eb->start,
> +                                              db->eb->start + db->eb->len - 1);
> +               if (err)
> +                       werr = err;
> +               cond_resched();
> +       }
> +
> +       return werr;
> +}
> +
> +int btrfs_wait_buffers(struct btrfs_fs_info *fs_info, struct list_head *list)
> +{
> +       struct address_space *mapping = fs_info->btree_inode->i_mapping;
> +       struct dirty_buffer *db;
> +       int werr = 0, err;
> +
> +       while ((db = list_first_entry_or_null(list, struct dirty_buffer,
> +                       wb_entry))) {
> +               list_del_init(&db->wb_entry);
> +
> +               err = filemap_fdatawait_range(mapping, db->eb->start,
> +                                             db->eb->start + db->eb->len - 1);
> +               if (err)
> +                       werr = err;
> +               if (!werr && test_bit(EXTENT_BUFFER_WRITE_ERR, &db->eb->bflags))
> +                       werr = -EIO;
> +               free_extent_buffer(db->eb);
> +               kfree(db);
> +
> +               cond_resched();
> +       }
> +
> +       return werr;
> +}
> +
> +void btrfs_release_buffers(struct list_head *list)
> +{
> +       struct dirty_buffer *db;
> +
> +       while ((db = list_first_entry_or_null(list, struct dirty_buffer,
> +                       wb_entry))) {
> +               list_del_init(&db->wb_entry);
> +               free_extent_buffer(db->eb);
> +               kfree(db);
> +       }
> +}
> diff --git a/fs/btrfs/extent_buffer.h b/fs/btrfs/extent_buffer.h
> new file mode 100644
> index 00000000000000..880b325d7b411a
> --- /dev/null
> +++ b/fs/btrfs/extent_buffer.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef BTRFS_EXTENT_BUFFER_H
> +#define BTRFS_EXTENT_BUFFER_H
> +
> +struct extent_buffer;
> +struct btrfs_fs_info;
> +struct btrfs_root;
> +struct btrfs_transaction;
> +struct list_head;
> +
> +struct dirty_buffer {
> +       struct list_head wb_entry;
> +       struct extent_buffer *eb;
> +};
> +
> +void btrfs_add_to_dirty_buffers_list(struct extent_buffer *eb,
> +                                    struct btrfs_transaction *trans,
> +                                    struct btrfs_root *root);
> +int btrfs_write_buffers(struct btrfs_fs_info *fs_info, struct list_head *list);
> +int btrfs_wait_buffers(struct btrfs_fs_info *fs_info, struct list_head *list);
> +void btrfs_release_buffers(struct list_head *list);
> +
> +#endif /* BTRFS_EXTENT_BUFFER_H */
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 5999ac3ee601db..d5937ed0962d38 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1659,8 +1659,11 @@ static bool lock_extent_buffer_for_io(struct extent_buffer *eb,
>
>  static void set_btree_ioerr(struct extent_buffer *eb)
>  {
> -       struct btrfs_fs_info *fs_info = eb->fs_info;
> -
> +       /*
> +        * Propagate this error to btrfs_wait_buffers in case the buffer got
> +        * written out by optimistic VM flushing and not the explicit writeback
> +        * from btrfs_write_buffers().
> +        */
>         set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
>
>         /*
> @@ -1676,58 +1679,6 @@ static void set_btree_ioerr(struct extent_buffer *eb)
>          * the superblock.
>          */
>         mapping_set_error(eb->fs_info->btree_inode->i_mapping, -EIO);
> -
> -       /*
> -        * If writeback for a btree extent that doesn't belong to a log tree
> -        * failed, increment the counter transaction->eb_write_errors.
> -        * We do this because while the transaction is running and before it's
> -        * committing (when we call filemap_fdata[write|wait]_range against
> -        * the btree inode), we might have
> -        * btree_inode->i_mapping->a_ops->writepages() called by the VM - if it
> -        * returns an error or an error happens during writeback, when we're
> -        * committing the transaction we wouldn't know about it, since the pages
> -        * can be no longer dirty nor marked anymore for writeback (if a
> -        * subsequent modification to the extent buffer didn't happen before the
> -        * transaction commit), which makes filemap_fdata[write|wait]_range not
> -        * able to find the pages tagged with SetPageError at transaction
> -        * commit time. So if this happens we must abort the transaction,
> -        * otherwise we commit a super block with btree roots that point to
> -        * btree nodes/leafs whose content on disk is invalid - either garbage
> -        * or the content of some node/leaf from a past generation that got
> -        * cowed or deleted and is no longer valid.
> -        *
> -        * Note: setting AS_EIO/AS_ENOSPC in the btree inode's i_mapping would
> -        * not be enough - we need to distinguish between log tree extents vs
> -        * non-log tree extents, and the next filemap_fdatawait_range() call
> -        * will catch and clear such errors in the mapping - and that call might
> -        * be from a log sync and not from a transaction commit. Also, checking
> -        * for the eb flag EXTENT_BUFFER_WRITE_ERR at transaction commit time is
> -        * not done and would not be reliable - the eb might have been released
> -        * from memory and reading it back again means that flag would not be
> -        * set (since it's a runtime flag, not persisted on disk).
> -        *
> -        * Using the flags below in the btree inode also makes us achieve the
> -        * goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
> -        * writeback for all dirty pages and before filemap_fdatawait_range()
> -        * is called, the writeback for all dirty pages had already finished
> -        * with errors - because we were not using AS_EIO/AS_ENOSPC,
> -        * filemap_fdatawait_range() would return success, as it could not know
> -        * that writeback errors happened (the pages were no longer tagged for
> -        * writeback).
> -        */
> -       switch (eb->log_index) {
> -       case -1:
> -               set_bit(BTRFS_FS_BTREE_ERR, &fs_info->flags);
> -               break;
> -       case 0:
> -               set_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags);
> -               break;
> -       case 1:
> -               set_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags);
> -               break;
> -       default:
> -               BUG(); /* unexpected, logic error */
> -       }
>  }
>
>  /*
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index db9148bafd02c3..7c0e638346017e 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -83,8 +83,6 @@ struct extent_buffer {
>         int read_mirror;
>         struct rcu_head rcu_head;
>         pid_t lock_owner;
> -       /* >= 0 if eb belongs to a log tree, -1 otherwise */
> -       s8 log_index;
>
>         struct rw_semaphore lock;
>
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index 840e4def18b519..f71281408102c3 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -81,9 +81,6 @@ enum {
>         BTRFS_FS_QUOTA_ENABLED,
>         BTRFS_FS_UPDATE_UUID_TREE_GEN,
>         BTRFS_FS_CREATING_FREE_SPACE_TREE,
> -       BTRFS_FS_BTREE_ERR,
> -       BTRFS_FS_LOG1_ERR,
> -       BTRFS_FS_LOG2_ERR,
>         BTRFS_FS_QUOTA_OVERRIDE,
>         /* Used to record internally whether fs has been frozen */
>         BTRFS_FS_FROZEN,
> diff --git a/fs/btrfs/tests/extent-io-tests.c b/fs/btrfs/tests/extent-io-tests.c
> index dfc5c7fa603899..df7f526e0718e0 100644
> --- a/fs/btrfs/tests/extent-io-tests.c
> +++ b/fs/btrfs/tests/extent-io-tests.c
> @@ -75,13 +75,11 @@ static void extent_flag_to_str(const struct extent_state *state, char *dest)
>         PRINT_ONE_FLAG(state, dest, cur, DIRTY);
>         PRINT_ONE_FLAG(state, dest, cur, UPTODATE);
>         PRINT_ONE_FLAG(state, dest, cur, LOCKED);
> -       PRINT_ONE_FLAG(state, dest, cur, NEW);
>         PRINT_ONE_FLAG(state, dest, cur, DELALLOC);
>         PRINT_ONE_FLAG(state, dest, cur, DEFRAG);
>         PRINT_ONE_FLAG(state, dest, cur, BOUNDARY);
>         PRINT_ONE_FLAG(state, dest, cur, NODATASUM);
>         PRINT_ONE_FLAG(state, dest, cur, CLEAR_META_RESV);
> -       PRINT_ONE_FLAG(state, dest, cur, NEED_WAIT);
>         PRINT_ONE_FLAG(state, dest, cur, NORESERVE);
>         PRINT_ONE_FLAG(state, dest, cur, QGROUP_RESERVED);
>         PRINT_ONE_FLAG(state, dest, cur, CLEAR_DATA_RESV);
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index fe0f00e717a834..5353bfd8309425 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -3,6 +3,7 @@
>   * Copyright (C) 2007 Oracle.  All rights reserved.
>   */
>
> +#include <linux/list_sort.h>
>  #include <linux/fs.h>
>  #include <linux/slab.h>
>  #include <linux/sched.h>
> @@ -12,6 +13,7 @@
>  #include <linux/blkdev.h>
>  #include <linux/uuid.h>
>  #include <linux/timekeeping.h>
> +#include "extent_buffer.h"
>  #include "misc.h"
>  #include "ctree.h"
>  #include "disk-io.h"
> @@ -191,7 +193,8 @@ static noinline void switch_commit_roots(struct btrfs_trans_handle *trans)
>                 list_del_init(&root->dirty_list);
>                 free_extent_buffer(root->commit_root);
>                 root->commit_root = btrfs_root_node(root);
> -               extent_io_tree_release(&root->dirty_log_pages);
> +               ASSERT(list_empty_careful(&root->dirty_buffers[0]));
> +               ASSERT(list_empty_careful(&root->dirty_buffers[1]));
>                 btrfs_qgroup_clean_swapped_blocks(root);
>         }
>
> @@ -375,8 +378,8 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
>         INIT_LIST_HEAD(&cur_trans->deleted_bgs);
>         spin_lock_init(&cur_trans->dropped_roots_lock);
>         list_add_tail(&cur_trans->list, &fs_info->trans_list);
> -       extent_io_tree_init(fs_info, &cur_trans->dirty_pages,
> -                       IO_TREE_TRANS_DIRTY_PAGES);
> +       INIT_LIST_HEAD(&cur_trans->dirty_buffers);
> +       spin_lock_init(&cur_trans->dirty_buffers_lock);
>         extent_io_tree_init(fs_info, &cur_trans->pinned_extents,
>                         IO_TREE_FS_PINNED_EXTENTS);
>         fs_info->generation++;
> @@ -1039,141 +1042,6 @@ int btrfs_end_transaction_throttle(struct btrfs_trans_handle *trans)
>         return __btrfs_end_transaction(trans, 1);
>  }
>
> -/*
> - * when btree blocks are allocated, they have some corresponding bits set for
> - * them in one of two extent_io trees.  This is used to make sure all of
> - * those extents are sent to disk but does not wait on them
> - */
> -int btrfs_write_marked_extents(struct btrfs_fs_info *fs_info,
> -                              struct extent_io_tree *dirty_pages, int mark)
> -{
> -       int err = 0;
> -       int werr = 0;
> -       struct address_space *mapping = fs_info->btree_inode->i_mapping;
> -       struct extent_state *cached_state = NULL;
> -       u64 start = 0;
> -       u64 end;
> -
> -       while (!find_first_extent_bit(dirty_pages, start, &start, &end,
> -                                     mark, &cached_state)) {
> -               bool wait_writeback = false;
> -
> -               err = convert_extent_bit(dirty_pages, start, end,
> -                                        EXTENT_NEED_WAIT,
> -                                        mark, &cached_state);
> -               /*
> -                * convert_extent_bit can return -ENOMEM, which is most of the
> -                * time a temporary error. So when it happens, ignore the error
> -                * and wait for writeback of this range to finish - because we
> -                * failed to set the bit EXTENT_NEED_WAIT for the range, a call
> -                * to __btrfs_wait_marked_extents() would not know that
> -                * writeback for this range started and therefore wouldn't
> -                * wait for it to finish - we don't want to commit a
> -                * superblock that points to btree nodes/leafs for which
> -                * writeback hasn't finished yet (and without errors).
> -                * We cleanup any entries left in the io tree when committing
> -                * the transaction (through extent_io_tree_release()).
> -                */
> -               if (err == -ENOMEM) {
> -                       err = 0;
> -                       wait_writeback = true;
> -               }
> -               if (!err)
> -                       err = filemap_fdatawrite_range(mapping, start, end);
> -               if (err)
> -                       werr = err;
> -               else if (wait_writeback)
> -                       werr = filemap_fdatawait_range(mapping, start, end);
> -               free_extent_state(cached_state);
> -               cached_state = NULL;
> -               cond_resched();
> -               start = end + 1;
> -       }
> -       return werr;
> -}
> -
> -/*
> - * when btree blocks are allocated, they have some corresponding bits set for
> - * them in one of two extent_io trees.  This is used to make sure all of
> - * those extents are on disk for transaction or log commit.  We wait
> - * on all the pages and clear them from the dirty pages state tree
> - */
> -static int __btrfs_wait_marked_extents(struct btrfs_fs_info *fs_info,
> -                                      struct extent_io_tree *dirty_pages)
> -{
> -       int err = 0;
> -       int werr = 0;
> -       struct address_space *mapping = fs_info->btree_inode->i_mapping;
> -       struct extent_state *cached_state = NULL;
> -       u64 start = 0;
> -       u64 end;
> -
> -       while (!find_first_extent_bit(dirty_pages, start, &start, &end,
> -                                     EXTENT_NEED_WAIT, &cached_state)) {
> -               /*
> -                * Ignore -ENOMEM errors returned by clear_extent_bit().
> -                * When committing the transaction, we'll remove any entries
> -                * left in the io tree. For a log commit, we don't remove them
> -                * after committing the log because the tree can be accessed
> -                * concurrently - we do it only at transaction commit time when
> -                * it's safe to do it (through extent_io_tree_release()).
> -                */
> -               err = clear_extent_bit(dirty_pages, start, end,
> -                                      EXTENT_NEED_WAIT, &cached_state);
> -               if (err == -ENOMEM)
> -                       err = 0;
> -               if (!err)
> -                       err = filemap_fdatawait_range(mapping, start, end);
> -               if (err)
> -                       werr = err;
> -               free_extent_state(cached_state);
> -               cached_state = NULL;
> -               cond_resched();
> -               start = end + 1;
> -       }
> -       if (err)
> -               werr = err;
> -       return werr;
> -}
> -
> -static int btrfs_wait_extents(struct btrfs_fs_info *fs_info,
> -                      struct extent_io_tree *dirty_pages)
> -{
> -       bool errors = false;
> -       int err;
> -
> -       err = __btrfs_wait_marked_extents(fs_info, dirty_pages);
> -       if (test_and_clear_bit(BTRFS_FS_BTREE_ERR, &fs_info->flags))
> -               errors = true;
> -
> -       if (errors && !err)
> -               err = -EIO;
> -       return err;
> -}
> -
> -int btrfs_wait_tree_log_extents(struct btrfs_root *log_root, int mark)
> -{
> -       struct btrfs_fs_info *fs_info = log_root->fs_info;
> -       struct extent_io_tree *dirty_pages = &log_root->dirty_log_pages;
> -       bool errors = false;
> -       int err;
> -
> -       ASSERT(log_root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID);
> -
> -       err = __btrfs_wait_marked_extents(fs_info, dirty_pages);
> -       if ((mark & EXTENT_DIRTY) &&
> -           test_and_clear_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags))
> -               errors = true;
> -
> -       if ((mark & EXTENT_NEW) &&
> -           test_and_clear_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags))
> -               errors = true;
> -
> -       if (errors && !err)
> -               err = -EIO;
> -       return err;
> -}
> -
>  /*
>   * When btree blocks are allocated the corresponding extents are marked dirty.
>   * This function ensures such extents are persisted on disk for transaction or
> @@ -1183,25 +1051,22 @@ int btrfs_wait_tree_log_extents(struct btrfs_root *log_root, int mark)
>   */
>  static int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans)
>  {
> -       int ret;
> -       int ret2;
> -       struct extent_io_tree *dirty_pages = &trans->transaction->dirty_pages;
> -       struct btrfs_fs_info *fs_info = trans->fs_info;
> +       LIST_HEAD(dirty_buffers);
>         struct blk_plug plug;
> +       int ret, ret2;
> +
> +       spin_lock(&trans->transaction->dirty_buffers_lock);
> +       list_splice_init(&trans->transaction->dirty_buffers, &dirty_buffers);
> +       spin_unlock(&trans->transaction->dirty_buffers_lock);
>
>         blk_start_plug(&plug);
> -       ret = btrfs_write_marked_extents(fs_info, dirty_pages, EXTENT_DIRTY);
> +       ret = btrfs_write_buffers(trans->fs_info, &dirty_buffers);
>         blk_finish_plug(&plug);
> -       ret2 = btrfs_wait_extents(fs_info, dirty_pages);
> -
> -       extent_io_tree_release(&trans->transaction->dirty_pages);
>
> +       ret2 = btrfs_wait_buffers(trans->fs_info, &dirty_buffers);
>         if (ret)
>                 return ret;
> -       else if (ret2)
> -               return ret2;
> -       else
> -               return 0;
> +       return ret2;
>  }
>
>  /*
> @@ -2443,9 +2308,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>
>         btrfs_commit_device_sizes(cur_trans);
>
> -       clear_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags);
> -       clear_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags);
> -
>         btrfs_trans_release_chunk_metadata(trans);
>
>         /*
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 8e9fa23bd7fed7..eade7eb41a07ef 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -47,7 +47,8 @@ struct btrfs_transaction {
>         enum btrfs_trans_state state;
>         int aborted;
>         struct list_head list;
> -       struct extent_io_tree dirty_pages;
> +       struct list_head dirty_buffers;
> +       spinlock_t dirty_buffers_lock;
>         time64_t start_time;
>         wait_queue_head_t writer_wait;
>         wait_queue_head_t commit_wait;
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 9b212e8c70cc58..012b7e03d57968 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -8,6 +8,7 @@
>  #include <linux/blkdev.h>
>  #include <linux/list_sort.h>
>  #include <linux/iversion.h>
> +#include "extent_buffer.h"
>  #include "misc.h"
>  #include "ctree.h"
>  #include "tree-log.h"
> @@ -2847,6 +2848,24 @@ static inline void btrfs_remove_all_log_ctxs(struct btrfs_root *root,
>         }
>  }
>
> +static void btrfs_pick_all_dirty_log_buffers(struct btrfs_root *log,
> +               struct list_head *dirty_buffers)
> +{
> +       spin_lock(&log->dirty_buffers_lock);
> +       list_splice_tail_init(&log->dirty_buffers[0], dirty_buffers);
> +       list_splice_tail_init(&log->dirty_buffers[1], dirty_buffers);
> +       spin_unlock(&log->dirty_buffers_lock);
> +}
> +
> +static void btrfs_pick_dirty_log_buffers(struct btrfs_root *log,
> +                                        struct list_head *dirty_buffers,
> +                                        int log_transid)
> +{
> +       spin_lock(&log->dirty_buffers_lock);
> +       list_splice_init(&log->dirty_buffers[log_transid % 2], dirty_buffers);
> +       spin_unlock(&log->dirty_buffers_lock);
> +}
> +
>  /*
>   * btrfs_sync_log does sends a given tree log down to the disk and
>   * updates the super blocks to record it.  When this call is done,
> @@ -2864,7 +2883,6 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  {
>         int index1;
>         int index2;
> -       int mark;
>         int ret;
>         struct btrfs_fs_info *fs_info = root->fs_info;
>         struct btrfs_root *log = root->log_root;
> @@ -2875,6 +2893,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>         struct blk_plug plug;
>         u64 log_root_start;
>         u64 log_root_level;
> +       LIST_HEAD(dirty_log_buffers);
> +       LIST_HEAD(dirty_log_root_buffers);
>
>         mutex_lock(&root->log_mutex);
>         log_transid = ctx->log_transid;
> @@ -2917,16 +2937,15 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>                 goto out;
>         }
>
> -       if (log_transid % 2 == 0)
> -               mark = EXTENT_DIRTY;
> -       else
> -               mark = EXTENT_NEW;
> +       blk_start_plug(&plug);
>
> -       /* we start IO on  all the marked extents here, but we don't actually
> -        * wait for them until later.
> +       /*
> +        * Start I/O on all dirty buffers, but don't actually wait for them
> +        * until later.
>          */
> -       blk_start_plug(&plug);
> -       ret = btrfs_write_marked_extents(fs_info, &log->dirty_log_pages, mark);
> +       btrfs_pick_dirty_log_buffers(log, &dirty_log_buffers, log_transid);
> +       ret = btrfs_write_buffers(fs_info, &dirty_log_buffers);
> +
>         /*
>          * -EAGAIN happens when someone, e.g., a concurrent transaction
>          *  commit, writes a dirty extent in this tree-log commit. This
> @@ -3008,7 +3027,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>                         btrfs_err(fs_info,
>                                   "failed to update log for root %llu ret %d",
>                                   root->root_key.objectid, ret);
> -               btrfs_wait_tree_log_extents(log, mark);
> +               btrfs_wait_buffers(fs_info, &dirty_log_buffers);
>                 mutex_unlock(&log_root_tree->log_mutex);
>                 goto out;
>         }
> @@ -3024,7 +3043,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>         index2 = root_log_ctx.log_transid % 2;
>         if (atomic_read(&log_root_tree->log_commit[index2])) {
>                 blk_finish_plug(&plug);
> -               ret = btrfs_wait_tree_log_extents(log, mark);
> +               ret = btrfs_wait_buffers(fs_info, &dirty_log_buffers);
>                 wait_log_commit(log_root_tree,
>                                 root_log_ctx.log_transid);
>                 mutex_unlock(&log_root_tree->log_mutex);
> @@ -3046,15 +3065,14 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>          */
>         if (btrfs_need_log_full_commit(trans)) {
>                 blk_finish_plug(&plug);
> -               btrfs_wait_tree_log_extents(log, mark);
> +               btrfs_wait_buffers(fs_info, &dirty_log_buffers);
>                 mutex_unlock(&log_root_tree->log_mutex);
>                 ret = BTRFS_LOG_FORCE_COMMIT;
>                 goto out_wake_log_root;
>         }
>
> -       ret = btrfs_write_marked_extents(fs_info,
> -                                        &log_root_tree->dirty_log_pages,
> -                                        EXTENT_DIRTY | EXTENT_NEW);
> +       btrfs_pick_all_dirty_log_buffers(log_root_tree, &dirty_log_root_buffers);
> +       ret = btrfs_write_buffers(fs_info, &dirty_log_root_buffers);
>         blk_finish_plug(&plug);
>         /*
>          * As described above, -EAGAIN indicates a hole in the extents. We
> @@ -3063,7 +3081,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>          */
>         if (ret == -EAGAIN && btrfs_is_zoned(fs_info)) {
>                 btrfs_set_log_full_commit(trans);
> -               btrfs_wait_tree_log_extents(log, mark);
> +               btrfs_wait_buffers(fs_info, &dirty_log_buffers);
>                 mutex_unlock(&log_root_tree->log_mutex);
>                 goto out_wake_log_root;
>         } else if (ret) {
> @@ -3071,10 +3089,9 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>                 mutex_unlock(&log_root_tree->log_mutex);
>                 goto out_wake_log_root;
>         }
> -       ret = btrfs_wait_tree_log_extents(log, mark);
> +       ret = btrfs_wait_buffers(fs_info, &dirty_log_buffers);
>         if (!ret)
> -               ret = btrfs_wait_tree_log_extents(log_root_tree,
> -                                                 EXTENT_NEW | EXTENT_DIRTY);
> +               ret = btrfs_wait_buffers(fs_info, &dirty_log_root_buffers);
>         if (ret) {
>                 btrfs_set_log_full_commit(trans);
>                 mutex_unlock(&log_root_tree->log_mutex);
> @@ -3160,6 +3177,9 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>         atomic_set(&root->log_commit[index1], 0);
>         mutex_unlock(&root->log_mutex);
>
> +       btrfs_release_buffers(&dirty_log_buffers);
> +       btrfs_release_buffers(&dirty_log_root_buffers);
> +
>         /*
>          * The barrier before waitqueue_active (in cond_wake_up) is needed so
>          * all the updates above are seen by the woken threads. It might not be
> @@ -3177,6 +3197,7 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
>                 .free = 1,
>                 .process_func = process_one_buffer
>         };
> +       LIST_HEAD(dirty_buffers);
>
>         if (log->node) {
>                 ret = walk_log_tree(trans, log, &wc);
> @@ -3198,11 +3219,9 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
>                          * them out and wait for their writeback to complete, so
>                          * that we properly cleanup their state and pages.
>                          */
> -                       btrfs_write_marked_extents(log->fs_info,
> -                                                  &log->dirty_log_pages,
> -                                                  EXTENT_DIRTY | EXTENT_NEW);
> -                       btrfs_wait_tree_log_extents(log,
> -                                                   EXTENT_DIRTY | EXTENT_NEW);
> +                       btrfs_pick_all_dirty_log_buffers(log, &dirty_buffers);
> +                       btrfs_write_buffers(log->fs_info, &dirty_buffers);
> +                       btrfs_wait_buffers(log->fs_info, &dirty_buffers);
>
>                         if (trans)
>                                 btrfs_abort_transaction(trans, ret);
> @@ -3211,8 +3230,8 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
>                 }
>         }
>
> -       clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1,
> -                         EXTENT_DIRTY | EXTENT_NEW | EXTENT_NEED_WAIT);
> +       btrfs_pick_all_dirty_log_buffers(log, &dirty_buffers);
> +       btrfs_release_buffers(&dirty_buffers);
>         extent_io_tree_release(&log->log_csum_range);
>
>         btrfs_put_root(log);
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index bda301a55cbe3b..e4b8134ab70166 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -6,6 +6,7 @@
>  #include <linux/sched/mm.h>
>  #include <linux/atomic.h>
>  #include <linux/vmalloc.h>
> +#include "extent_buffer.h"
>  #include "ctree.h"
>  #include "volumes.h"
>  #include "zoned.h"
> @@ -1611,8 +1612,7 @@ void btrfs_redirty_list_add(struct btrfs_transaction *trans,
>         memzero_extent_buffer(eb, 0, eb->len);
>         set_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags);
>         set_extent_buffer_dirty(eb);
> -       set_extent_bits_nowait(&trans->dirty_pages, eb->start,
> -                              eb->start + eb->len - 1, EXTENT_DIRTY);
> +       btrfs_add_to_dirty_buffers_list(eb, trans, NULL);
>  }
>
>  bool btrfs_use_zone_append(struct btrfs_bio *bbio)
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index 8ea9cea9bfeb4d..aae33b9067b918 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -86,8 +86,6 @@ struct find_free_extent_ctl;
>         EM( IO_TREE_BTREE_INODE_IO,       "BTREE_INODE_IO")         \
>         EM( IO_TREE_INODE_IO,             "INODE_IO")               \
>         EM( IO_TREE_RELOC_BLOCKS,         "RELOC_BLOCKS")           \
> -       EM( IO_TREE_TRANS_DIRTY_PAGES,    "TRANS_DIRTY_PAGES")      \
> -       EM( IO_TREE_ROOT_DIRTY_LOG_PAGES, "ROOT_DIRTY_LOG_PAGES")   \
>         EM( IO_TREE_INODE_FILE_EXTENT,    "INODE_FILE_EXTENT")      \
>         EM( IO_TREE_LOG_CSUM_RANGE,       "LOG_CSUM_RANGE")         \
>         EMe(IO_TREE_SELFTEST,             "SELFTEST")
> @@ -147,13 +145,11 @@ FLUSH_STATES
>         { EXTENT_DIRTY,                 "DIRTY"},               \
>         { EXTENT_UPTODATE,              "UPTODATE"},            \
>         { EXTENT_LOCKED,                "LOCKED"},              \
> -       { EXTENT_NEW,                   "NEW"},                 \
>         { EXTENT_DELALLOC,              "DELALLOC"},            \
>         { EXTENT_DEFRAG,                "DEFRAG"},              \
>         { EXTENT_BOUNDARY,              "BOUNDARY"},            \
>         { EXTENT_NODATASUM,             "NODATASUM"},           \
>         { EXTENT_CLEAR_META_RESV,       "CLEAR_META_RESV"},     \
> -       { EXTENT_NEED_WAIT,             "NEED_WAIT"},           \
>         { EXTENT_NORESERVE,             "NORESERVE"},           \
>         { EXTENT_QGROUP_RESERVED,       "QGROUP_RESERVED"},     \
>         { EXTENT_CLEAR_DATA_RESV,       "CLEAR_DATA_RESV"},     \
> @@ -2289,7 +2285,6 @@ DECLARE_EVENT_CLASS(btrfs_sleep_tree_lock,
>                 __field(        u64,    end_ns          )
>                 __field(        u64,    diff_ns         )
>                 __field(        u64,    owner           )
> -               __field(        int,    is_log_tree     )
>         ),
>
>         TP_fast_assign_btrfs(eb->fs_info,
> @@ -2299,14 +2294,13 @@ DECLARE_EVENT_CLASS(btrfs_sleep_tree_lock,
>                 __entry->end_ns         = ktime_get_ns();
>                 __entry->diff_ns        = __entry->end_ns - start_ns;
>                 __entry->owner          = btrfs_header_owner(eb);
> -               __entry->is_log_tree    = (eb->log_index >= 0);
>         ),
>
>         TP_printk_btrfs(
> -"block=%llu generation=%llu start_ns=%llu end_ns=%llu diff_ns=%llu owner=%llu is_log_tree=%d",
> +"block=%llu generation=%llu start_ns=%llu end_ns=%llu diff_ns=%llu owner=%llu",
>                 __entry->block, __entry->generation,
>                 __entry->start_ns, __entry->end_ns, __entry->diff_ns,
> -               __entry->owner, __entry->is_log_tree)
> +               __entry->owner)
>  );
>
>  DEFINE_EVENT(btrfs_sleep_tree_lock, btrfs_tree_read_lock,
> @@ -2330,19 +2324,17 @@ DECLARE_EVENT_CLASS(btrfs_locking_events,
>                 __field(        u64,    block           )
>                 __field(        u64,    generation      )
>                 __field(        u64,    owner           )
> -               __field(        int,    is_log_tree     )
>         ),
>
>         TP_fast_assign_btrfs(eb->fs_info,
>                 __entry->block          = eb->start;
>                 __entry->generation     = btrfs_header_generation(eb);
>                 __entry->owner          = btrfs_header_owner(eb);
> -               __entry->is_log_tree    = (eb->log_index >= 0);
>         ),
>
> -       TP_printk_btrfs("block=%llu generation=%llu owner=%llu is_log_tree=%d",
> +       TP_printk_btrfs("block=%llu generation=%llu owner=%llu",
>                 __entry->block, __entry->generation,
> -               __entry->owner, __entry->is_log_tree)
> +               __entry->owner)
>  );
>
>  #define DEFINE_BTRFS_LOCK_EVENT(name)                          \
> --
> 2.39.2
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/6] btrfs: use a linked list for tracking per-transaction/log dirty buffers
  2023-05-17 10:40   ` Filipe Manana
@ 2023-05-17 12:21     ` Christoph Hellwig
  2023-05-17 16:24       ` Filipe Manana
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2023-05-17 12:21 UTC (permalink / raw)
  To: Filipe Manana
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	linux-btrfs

On Wed, May 17, 2023 at 11:40:14AM +0100, Filipe Manana wrote:
> > This patch instead switches tracking to one linked list per transaction
> > and two to each root for the two tree logs which link a new object that
> > points directly to the buffer.  Note that the list_head can't directly be
> > embedded into the extent_buffer structure given that a buffer can be part
> > of more than one transaction or tree_log.  This also means the existing
> > error propagation based off eb->log_index never fully worked, as this
> > index would get overwritten once a buffer is added to a new dirty tree.
> 
> If an extent buffer is part of 2 transactions, it means that when it
> was allocated
> for the next one, it was already cleaned up in the previous one, so
> its ->log_index is
> no longer used by the previous transaction and can be safely
> overwritten by the next transaction.
> 
> Or did you find a case where that is not true?

At least with a previous version of this patch where the list_head was
embedded into the extent_buffer structure I could very easily reproduce
cases where one buffer was added to another list while still on another
one.  The most common case was a tree log and a transaction, but I think
I've also seen two transactions or two tree logs.

"Cleanup up" means written back and waited for writeback or dirty
canceled I guess? Or is there some other aspect I should look for?

> > @@ -202,7 +202,8 @@ struct btrfs_root {
> >         struct btrfs_root_item root_item;
> >         struct btrfs_key root_key;
> >         struct btrfs_fs_info *fs_info;
> > -       struct extent_io_tree dirty_log_pages;
> > +       struct list_head dirty_buffers[2];
> 
> As this is for the log tree, I'd prefer to have its name reflect that,
> like we had before.
> Something like "log_dirty_buffers" for example.

Ok.  Given that the btrfs_root structure isn't specific to log_trees
that absolutely makes sense.

> 1) With the io tree approach, if we allocate multiple extent buffers
> that are adjacent, we get a single entry to represent them, due the
> merging done by the io tree code.
> With this new approach we don't have any merging at all, using more
> memory and keeping a longer list, which will take longer to iterate
> and sort.

At least in theory yes.  But we also save a whole lot of lookups
by going directly to the object instead of indirecting through the
pages xarray and then again through the buffers array for the
sub-block case.

> For example if a 1G metadata block group is allocated, and then we
> allocate all metadata extents from it, we get 65536 struct
> dirty_buffer allocated, while with the io tree approach we would get a
> single struct extent_state record.

But that will only get you to the filemap_fdtawrite and fdatawait.
After that we're still looking up every single page in that range
while with this series (the patch alone isn't enough, the rest of
the work comes in later patches) the list gets us straight to the
extent_buffer.

> 2) We now need to keep references on the extent buffers. This means
> they can't be released from memory until the transaction commits.
> Before we didn't do this, and if an extent buffer was allocated and
> freed in the same transaction, we wouldn't need to keep it in memory
> until the transaction commits.
> We would not need to do it as well if its writeback is started and
> completed before the transaction commit.

True.  But at the same time it will allow to get rid of all the
extent_buffer_under_io hacks.  Note that if we figure out what
causes buffers to be added to multiple transactions/tree_logs and
just have a list in the object we could just deleted it when cleaned
and have the best of both worlds.

> 3) Looking a bit below, we now need to sort the list, which can be
> huge, especially taking into account the fact that adjacent extents
> are not merged anymore as mentioned before, potentially making anyone
> who's waiting on the transaction commit to wait for longer.
> On the other end, when a task allocates a new buffer the insertion is
> faster, as it's just appending to a list and can reduce the latency of
> many syscalls (creat, mkdir, rmdir, rename, link/unlink, reflinks,
> etc)
> 
> Not saying that in the end this approach isn't often or generally
> better, but at the very least I would like to see all these
> differences explicitly mentioned in the changelog.
> The changelog gives the impression that there are no tradeoffs and the
> new solution is better in every aspect.
> 
> I would say more tests/benchmarks results would be good to have too,
> other than just fs_mark, and have the tests mentioned in the changelog
> (results before and after this change, command lines, fio configs for
> example).

Do you have any particular workload you think would be useful to test?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/6] btrfs: use a linked list for tracking per-transaction/log dirty buffers
  2023-05-17 12:21     ` Christoph Hellwig
@ 2023-05-17 16:24       ` Filipe Manana
  2023-05-17 16:37         ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Filipe Manana @ 2023-05-17 16:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chris Mason, Josef Bacik, David Sterba, linux-btrfs

On Wed, May 17, 2023 at 1:21 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, May 17, 2023 at 11:40:14AM +0100, Filipe Manana wrote:
> > > This patch instead switches tracking to one linked list per transaction
> > > and two to each root for the two tree logs which link a new object that
> > > points directly to the buffer.  Note that the list_head can't directly be
> > > embedded into the extent_buffer structure given that a buffer can be part
> > > of more than one transaction or tree_log.  This also means the existing
> > > error propagation based off eb->log_index never fully worked, as this
> > > index would get overwritten once a buffer is added to a new dirty tree.
> >
> > If an extent buffer is part of 2 transactions, it means that when it
> > was allocated
> > for the next one, it was already cleaned up in the previous one, so
> > its ->log_index is
> > no longer used by the previous transaction and can be safely
> > overwritten by the next transaction.
> >
> > Or did you find a case where that is not true?
>
> At least with a previous version of this patch where the list_head was
> embedded into the extent_buffer structure I could very easily reproduce
> cases where one buffer was added to another list while still on another
> one.  The most common case was a tree log and a transaction, but I think
> I've also seen two transactions or two tree logs.

Probably what you ran into is the case where a tree block is allocated
and freed in the same transaction before being written, in that case
it gets added
back to the free space cache (this happens at btrfs_free_tree_block())
and can be allocated again in the same transaction.
If so, in that case you can remove it from the list and you shouldn't
run into that issue anymore.
We currently don't remove it from the io tree in that case, because
that may need splitting an existing extent state record and therefore
requiring allocating memory.

I can't remember any other case that could lead to that.

>
> "Cleanup up" means written back and waited for writeback or dirty
> canceled I guess? Or is there some other aspect I should look for?

Yes, it's that.

>
> > > @@ -202,7 +202,8 @@ struct btrfs_root {
> > >         struct btrfs_root_item root_item;
> > >         struct btrfs_key root_key;
> > >         struct btrfs_fs_info *fs_info;
> > > -       struct extent_io_tree dirty_log_pages;
> > > +       struct list_head dirty_buffers[2];
> >
> > As this is for the log tree, I'd prefer to have its name reflect that,
> > like we had before.
> > Something like "log_dirty_buffers" for example.
>
> Ok.  Given that the btrfs_root structure isn't specific to log_trees
> that absolutely makes sense.
>
> > 1) With the io tree approach, if we allocate multiple extent buffers
> > that are adjacent, we get a single entry to represent them, due the
> > merging done by the io tree code.
> > With this new approach we don't have any merging at all, using more
> > memory and keeping a longer list, which will take longer to iterate
> > and sort.
>
> At least in theory yes.  But we also save a whole lot of lookups
> by going directly to the object instead of indirecting through the
> pages xarray and then again through the buffers array for the
> sub-block case.

I think you are talking about further changes in the patchset, not
about this particular patch?

I'm talking about the data structure used to represent a range, not pages.
The io tree is good to merge adjacent ranges and reduce the total
structure size and used memory
(and it's not uncommon to have allocated metadata extents that are
adjacent, either due to new block groups or holes in existing ones).

>
> > For example if a 1G metadata block group is allocated, and then we
> > allocate all metadata extents from it, we get 65536 struct
> > dirty_buffer allocated, while with the io tree approach we would get a
> > single struct extent_state record.
>
> But that will only get you to the filemap_fdtawrite and fdatawait.
> After that we're still looking up every single page in that range
> while with this series (the patch alone isn't enough, the rest of
> the work comes in later patches) the list gets us straight to the
> extent_buffer.
>
> > 2) We now need to keep references on the extent buffers. This means
> > they can't be released from memory until the transaction commits.
> > Before we didn't do this, and if an extent buffer was allocated and
> > freed in the same transaction, we wouldn't need to keep it in memory
> > until the transaction commits.
> > We would not need to do it as well if its writeback is started and
> > completed before the transaction commit.
>
> True.  But at the same time it will allow to get rid of all the
> extent_buffer_under_io hacks.  Note that if we figure out what
> causes buffers to be added to multiple transactions/tree_logs and
> just have a list in the object we could just deleted it when cleaned
> and have the best of both worlds.
>
> > 3) Looking a bit below, we now need to sort the list, which can be
> > huge, especially taking into account the fact that adjacent extents
> > are not merged anymore as mentioned before, potentially making anyone
> > who's waiting on the transaction commit to wait for longer.
> > On the other end, when a task allocates a new buffer the insertion is
> > faster, as it's just appending to a list and can reduce the latency of
> > many syscalls (creat, mkdir, rmdir, rename, link/unlink, reflinks,
> > etc)
> >
> > Not saying that in the end this approach isn't often or generally
> > better, but at the very least I would like to see all these
> > differences explicitly mentioned in the changelog.
> > The changelog gives the impression that there are no tradeoffs and the
> > new solution is better in every aspect.
> >
> > I would say more tests/benchmarks results would be good to have too,
> > other than just fs_mark, and have the tests mentioned in the changelog
> > (results before and after this change, command lines, fio configs for
> > example).
>
> Do you have any particular workload you think would be useful to test?

fs_mark (as you did already), dbench, fio for random writes and periodic fsync,
anything that is not too short, runtimes of several minutes at least.

Thanks.


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/6] btrfs: use a linked list for tracking per-transaction/log dirty buffers
  2023-05-17 16:24       ` Filipe Manana
@ 2023-05-17 16:37         ` Christoph Hellwig
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2023-05-17 16:37 UTC (permalink / raw)
  To: Filipe Manana
  Cc: Christoph Hellwig, Chris Mason, Josef Bacik, David Sterba,
	linux-btrfs

On Wed, May 17, 2023 at 05:24:59PM +0100, Filipe Manana wrote:
> Probably what you ran into is the case where a tree block is allocated
> and freed in the same transaction before being written, in that case
> it gets added
> back to the free space cache (this happens at btrfs_free_tree_block())
> and can be allocated again in the same transaction.
> If so, in that case you can remove it from the list and you shouldn't
> run into that issue anymore.
> We currently don't remove it from the io tree in that case, because
> that may need splitting an existing extent state record and therefore
> requiring allocating memory.

Yes.  I did in fact run into exactly that when I resurrected the
old idea of the embedded list_head and add a lot of instrumentation.
With the list removal in btrfs_clear_buffer_dirty it works fine.

> > At least in theory yes.  But we also save a whole lot of lookups
> > by going directly to the object instead of indirecting through the
> > pages xarray and then again through the buffers array for the
> > sub-block case.
> 
> I think you are talking about further changes in the patchset, not
> about this particular patch?

Yes.  This one is the enabler.

> I'm talking about the data structure used to represent a range, not pages.
> The io tree is good to merge adjacent ranges and reduce the total
> structure size and used memory

Yes.  Assuming you actually want to track ranges.  But that's now
what we ultimatively want to track here, though.  In the end everything
is about the actual buffers.  Patches 3 and 6 are the other key ones
that completely removed the need for any ranges.  I've tried to keep
them separate as Dave tends to prefer split out patches, but maybe
I should just fold them back to make it more obvious what is going on
here.

> fs_mark (as you did already), dbench, fio for random writes and periodic fsync,
> anything that is not too short, runtimes of several minutes at least.

Ok.  If you have any particular workloads you care about and have
scripts for them, feel free to send them my way.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-05-17 16:38 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-15 19:22 simplify dirty buffer tracking Christoph Hellwig
2023-05-15 19:22 ` [PATCH 1/6] btrfs: use a linked list for tracking per-transaction/log dirty buffers Christoph Hellwig
2023-05-17 10:40   ` Filipe Manana
2023-05-17 12:21     ` Christoph Hellwig
2023-05-17 16:24       ` Filipe Manana
2023-05-17 16:37         ` Christoph Hellwig
2023-05-15 19:22 ` [PATCH 2/6] btrfs: remove convert_extent_bit Christoph Hellwig
2023-05-15 19:22 ` [PATCH 3/6] btrfs: directly wait for buffer writeback completion in btrfs_wait_buffers Christoph Hellwig
2023-05-15 19:22 ` [PATCH 4/6] btrfs: move dropping the bg reference out of submit_eb_page Christoph Hellwig
2023-05-15 19:22 ` [PATCH 5/6] btrfs: move locking and write pointer checking into write_one_eb Christoph Hellwig
2023-05-15 19:22 ` [PATCH 6/6] btrfs: bypass filemap_fdatawrite_range in btrfs_write_buffers Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.