[PATCH v2 0/3] btrfs: no longer hold the extent lock for the entire read

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/3] btrfs: no longer hold the extent lock for the entire read
@ 2024-08-26 16:37 Josef Bacik
  2024-08-26 16:37 ` [PATCH v2 1/3] btrfs: introduce EXTENT_DIO_LOCKED Josef Bacik
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Josef Bacik @ 2024-08-26 16:37 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

v1: https://lore.kernel.org/linux-btrfs/cover.1724255475.git.josef@toxicpanda.com/

v1->v2:
- Added Goldwyn's bit to move the extent lock to only cover the part where we
  look up the extent map, added his Co-developed-by to the patch.

-- Original email --
Hello,

One of the biggest obstacles to switching to iomap is that our extent locking is
a bit wonky.  We've made a lot of progress on the write side to drastically
narrow the scope of the extent lock, but the read side is arguably the most
problematic in that we hold it until the readpage is completed.

This patch series addresses this by no longer holding the lock for the entire
IO.  This is safe on the buffered side because we are protected by the page lock
and the checks for ordered extents when we start the write.

The direct io side is the more problematic side, since we could now end up with
overlapping and concurrent direct writes in direct read ranges.  To solve this
problem I'm introducing a new extent io bit to do range locking for direct IO.
This will work basically the same as the extent lock did before, but only for
direct IO.  We are saved by the normal inode lock and page cache in the mixed
buffered and direct case, so this shouldn't carry the same iomap+locking
reloated woes that the extent lock did.

This will also allow us to remove the page fault restrictions in the direct IO
case, which will be done in a followup series.

I've run this through the CI and a lot of local testing, I'm keeping it small
and targeted because this is a pretty big logical shift for us, so I want to
make sure we get good testing on it before I go doing the other larger projects.
Thanks,

Josef

Josef Bacik (3):
  btrfs: introduce EXTENT_DIO_LOCKED
  btrfs: take the dio extent lock during O_DIRECT operations
  btrfs: do not hold the extent lock for entire read

 fs/btrfs/compression.c    |  2 +-
 fs/btrfs/direct-io.c      | 72 +++++++++++++++++++-----------
 fs/btrfs/extent-io-tree.c | 52 ++++++++++------------
 fs/btrfs/extent-io-tree.h | 38 ++++++++++++++--
 fs/btrfs/extent_io.c      | 94 ++-------------------------------------
 5 files changed, 109 insertions(+), 149 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/3] btrfs: introduce EXTENT_DIO_LOCKED
  2024-08-26 16:37 [PATCH v2 0/3] btrfs: no longer hold the extent lock for the entire read Josef Bacik
@ 2024-08-26 16:37 ` Josef Bacik
  2024-08-26 16:37 ` [PATCH v2 2/3] btrfs: take the dio extent lock during O_DIRECT operations Josef Bacik
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Josef Bacik @ 2024-08-26 16:37 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

In order to support dropping the extent lock during a read we need a way
to make sure that direct reads and direct writes for overlapping ranges
are protected from each other.  To accomplish this introduce another
lock bit specifically for direct io.  Subsequent patches will utilize
this to protect direct IO operations.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-io-tree.c | 52 ++++++++++++++++++---------------------
 fs/btrfs/extent-io-tree.h | 38 +++++++++++++++++++++++++---
 2 files changed, 58 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index c54c5d7a5cd5..383e55e0f62e 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -126,7 +126,7 @@ void extent_io_tree_init(struct btrfs_fs_info *fs_info,
  * Empty an io tree, removing and freeing every extent state record from the
  * tree. This should be called once we are sure no other task can access the
  * tree anymore, so no tree updates happen after we empty the tree and there
- * aren't any waiters on any extent state record (EXTENT_LOCKED bit is never
+ * aren't any waiters on any extent state record (EXTENT_LOCK_BITS are never
  * set on any extent state when calling this function).
  */
 void extent_io_tree_release(struct extent_io_tree *tree)
@@ -141,7 +141,7 @@ void extent_io_tree_release(struct extent_io_tree *tree)
 	rbtree_postorder_for_each_entry_safe(state, tmp, &root, rb_node) {
 		/* Clear node to keep free_extent_state() happy. */
 		RB_CLEAR_NODE(&state->rb_node);
-		ASSERT(!(state->state & EXTENT_LOCKED));
+		ASSERT(!(state->state & EXTENT_LOCK_BITS));
 		/*
 		 * No need for a memory barrier here, as we are holding the tree
 		 * lock and we only change the waitqueue while holding that lock
@@ -399,7 +399,7 @@ static void merge_next_state(struct extent_io_tree *tree, struct extent_state *s
  */
 static void merge_state(struct extent_io_tree *tree, struct extent_state *state)
 {
-	if (state->state & (EXTENT_LOCKED | EXTENT_BOUNDARY))
+	if (state->state & (EXTENT_LOCK_BITS | EXTENT_BOUNDARY))
 		return;
 
 	merge_prev_state(tree, state);
@@ -445,7 +445,7 @@ static struct extent_state *insert_state(struct extent_io_tree *tree,
 	struct rb_node *parent = NULL;
 	const u64 start = state->start - 1;
 	const u64 end = state->end + 1;
-	const bool try_merge = !(bits & (EXTENT_LOCKED | EXTENT_BOUNDARY));
+	const bool try_merge = !(bits & (EXTENT_LOCK_BITS | EXTENT_BOUNDARY));
 
 	set_state_bits(tree, state, bits, changeset);
 
@@ -616,9 +616,6 @@ static void set_gfp_mask_from_bits(u32 *bits, gfp_t *mask)
  * inserting elements in the tree, so the gfp mask is used to indicate which
  * allocations or sleeping are allowed.
  *
- * Pass 'wake' == 1 to kick any sleepers, and 'delete' == 1 to remove the given
- * range from the tree regardless of state (ie for truncate).
- *
  * The range [start, end] is inclusive.
  *
  * This takes the tree lock, and returns 0 on success and < 0 on error.
@@ -647,8 +644,8 @@ int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	if (bits & EXTENT_DELALLOC)
 		bits |= EXTENT_NORESERVE;
 
-	wake = (bits & EXTENT_LOCKED) ? 1 : 0;
-	if (bits & (EXTENT_LOCKED | EXTENT_BOUNDARY))
+	wake = (bits & EXTENT_LOCK_BITS) ? 1 : 0;
+	if (bits & (EXTENT_LOCK_BITS | EXTENT_BOUNDARY))
 		clear = 1;
 again:
 	if (!prealloc) {
@@ -862,7 +859,7 @@ static void cache_state(struct extent_state *state,
 			struct extent_state **cached_ptr)
 {
 	return cache_state_if_flags(state, cached_ptr,
-				    EXTENT_LOCKED | EXTENT_BOUNDARY);
+				    EXTENT_LOCK_BITS | EXTENT_BOUNDARY);
 }
 
 /*
@@ -1063,7 +1060,7 @@ static int __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	int ret = 0;
 	u64 last_start;
 	u64 last_end;
-	u32 exclusive_bits = (bits & EXTENT_LOCKED);
+	u32 exclusive_bits = (bits & EXTENT_LOCK_BITS);
 	gfp_t mask;
 
 	set_gfp_mask_from_bits(&bits, &mask);
@@ -1812,12 +1809,11 @@ int set_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 			   u32 bits, struct extent_changeset *changeset)
 {
 	/*
-	 * We don't support EXTENT_LOCKED yet, as current changeset will
-	 * record any bits changed, so for EXTENT_LOCKED case, it will
-	 * either fail with -EEXIST or changeset will record the whole
-	 * range.
+	 * We don't support EXTENT_LOCK_BITS yet, as current changeset will
+	 * record any bits changed, so for EXTENT_LOCK_BITS case, it will either
+	 * fail with -EEXIST or changeset will record the whole range.
 	 */
-	ASSERT(!(bits & EXTENT_LOCKED));
+	ASSERT(!(bits & EXTENT_LOCK_BITS));
 
 	return __set_extent_bit(tree, start, end, bits, NULL, NULL, NULL, changeset);
 }
@@ -1826,26 +1822,26 @@ int clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 			     u32 bits, struct extent_changeset *changeset)
 {
 	/*
-	 * Don't support EXTENT_LOCKED case, same reason as
+	 * Don't support EXTENT_LOCK_BITS case, same reason as
 	 * set_record_extent_bits().
 	 */
-	ASSERT(!(bits & EXTENT_LOCKED));
+	ASSERT(!(bits & EXTENT_LOCK_BITS));
 
 	return __clear_extent_bit(tree, start, end, bits, NULL, changeset);
 }
 
-int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
-		    struct extent_state **cached)
+bool __try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
+		       u32 bits, struct extent_state **cached)
 {
 	int err;
 	u64 failed_start;
 
-	err = __set_extent_bit(tree, start, end, EXTENT_LOCKED, &failed_start,
+	err = __set_extent_bit(tree, start, end, bits, &failed_start,
 			       NULL, cached, NULL);
 	if (err == -EEXIST) {
 		if (failed_start > start)
 			clear_extent_bit(tree, start, failed_start - 1,
-					 EXTENT_LOCKED, cached);
+					 bits, cached);
 		return 0;
 	}
 	return 1;
@@ -1855,23 +1851,23 @@ int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
  * Either insert or lock state struct between start and end use mask to tell
  * us if waiting is desired.
  */
-int lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
-		struct extent_state **cached_state)
+int __lock_extent(struct extent_io_tree *tree, u64 start, u64 end, u32 bits,
+		  struct extent_state **cached_state)
 {
 	struct extent_state *failed_state = NULL;
 	int err;
 	u64 failed_start;
 
-	err = __set_extent_bit(tree, start, end, EXTENT_LOCKED, &failed_start,
+	err = __set_extent_bit(tree, start, end, bits, &failed_start,
 			       &failed_state, cached_state, NULL);
 	while (err == -EEXIST) {
 		if (failed_start != start)
 			clear_extent_bit(tree, start, failed_start - 1,
-					 EXTENT_LOCKED, cached_state);
+					 bits, cached_state);
 
-		wait_extent_bit(tree, failed_start, end, EXTENT_LOCKED,
+		wait_extent_bit(tree, failed_start, end, bits,
 				&failed_state);
-		err = __set_extent_bit(tree, start, end, EXTENT_LOCKED,
+		err = __set_extent_bit(tree, start, end, bits,
 				       &failed_start, &failed_state,
 				       cached_state, NULL);
 	}
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index 9d3a52d8f59a..def953f441f9 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -19,6 +19,7 @@ enum {
 	ENUM_BIT(EXTENT_DIRTY),
 	ENUM_BIT(EXTENT_UPTODATE),
 	ENUM_BIT(EXTENT_LOCKED),
+	ENUM_BIT(EXTENT_DIO_LOCKED),
 	ENUM_BIT(EXTENT_NEW),
 	ENUM_BIT(EXTENT_DELALLOC),
 	ENUM_BIT(EXTENT_DEFRAG),
@@ -67,6 +68,8 @@ enum {
 				 EXTENT_ADD_INODE_BYTES | \
 				 EXTENT_CLEAR_ALL_BITS)
 
+#define EXTENT_LOCK_BITS	(EXTENT_LOCKED | EXTENT_DIO_LOCKED)
+
 /*
  * Redefined bits above which are used only in the device allocation tree,
  * shouldn't be using EXTENT_LOCKED / EXTENT_BOUNDARY / EXTENT_CLEAR_META_RESV
@@ -134,12 +137,22 @@ const struct btrfs_fs_info *extent_io_tree_to_fs_info(const struct extent_io_tre
 void extent_io_tree_init(struct btrfs_fs_info *fs_info,
 			 struct extent_io_tree *tree, unsigned int owner);
 void extent_io_tree_release(struct extent_io_tree *tree);
+int __lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
+		  u32 bits, struct extent_state **cached);
+bool __try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
+		       u32 bits, struct extent_state **cached);
 
-int lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
-		struct extent_state **cached);
+static inline int lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
+			      struct extent_state **cached)
+{
+	return __lock_extent(tree, start, end, EXTENT_LOCKED, cached);
+}
 
-int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
-		    struct extent_state **cached);
+static inline bool try_lock_extent(struct extent_io_tree *tree, u64 start,
+				   u64 end, struct extent_state **cached)
+{
+	return __try_lock_extent(tree, start, end, EXTENT_LOCKED, cached);
+}
 
 int __init extent_state_init_cachep(void);
 void __cold extent_state_free_cachep(void);
@@ -212,5 +225,22 @@ int find_contiguous_extent_bit(struct extent_io_tree *tree, u64 start,
 bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
 			       u64 *end, u64 max_bytes,
 			       struct extent_state **cached_state);
+static inline int lock_dio_extent(struct extent_io_tree *tree, u64 start,
+				  u64 end, struct extent_state **cached)
+{
+	return __lock_extent(tree, start, end, EXTENT_DIO_LOCKED, cached);
+}
+
+static inline bool try_lock_dio_extent(struct extent_io_tree *tree, u64 start,
+				       u64 end, struct extent_state **cached)
+{
+	return __try_lock_extent(tree, start, end, EXTENT_DIO_LOCKED, cached);
+}
+
+static inline int unlock_dio_extent(struct extent_io_tree *tree, u64 start,
+				    u64 end, struct extent_state **cached)
+{
+	return __clear_extent_bit(tree, start, end, EXTENT_DIO_LOCKED, cached, NULL);
+}
 
 #endif /* BTRFS_EXTENT_IO_TREE_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/3] btrfs: take the dio extent lock during O_DIRECT operations
  2024-08-26 16:37 [PATCH v2 0/3] btrfs: no longer hold the extent lock for the entire read Josef Bacik
  2024-08-26 16:37 ` [PATCH v2 1/3] btrfs: introduce EXTENT_DIO_LOCKED Josef Bacik
@ 2024-08-26 16:37 ` Josef Bacik
  2024-08-26 16:37 ` [PATCH v2 3/3] btrfs: do not hold the extent lock for entire read Josef Bacik
  2024-08-27  0:53 ` [PATCH v2 0/3] btrfs: no longer hold the extent lock for the " David Sterba
  3 siblings, 0 replies; 8+ messages in thread
From: Josef Bacik @ 2024-08-26 16:37 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Currently we hold the extent lock for the entire duration of a read.
This isn't really necessary in the buffered case, we're protected by the
page lock, however it's necessary for O_DIRECT.

For O_DIRECT reads, if we only locked the extent for the part where we
get the extent, we could potentially race with an O_DIRECT write in the
same region.  This isn't really a problem, unless the read is delayed so
much that the write does the CoW, unpins the old extent, and some other
application re-allocates the extent before the read is actually able to
be submitted.  At that point at best we'd have a csum mismatch, but at
worse we could read data that doesn't belong to us.

To address this potential race we need to make sure we don't have
overlapping, concurrent direct io reads and writes.

To accomplish this use the new EXTENT_DIO_LOCKED bit in the direct IO
case in the same spot as the current extent lock.  The writes will take
this while they're creating the ordered extent, which is also used to
make sure concurrent buffered reads or concurrent direct reads are not
allowed to occur, and drop it after the ordered extent is taken.  For
reads it will act as the current read behavior for the EXTENT_LOCKED
bit, we set it when we're starting the read, we clear it in the end_io
to allow other direct writes to continue.

This still has the drawback of disallowing concurrent overlapping direct
reads from occurring, but that exists with the current extent locking.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/direct-io.c | 47 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 67adbe9d294a..576f469cacee 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -40,11 +40,22 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
 	struct btrfs_ordered_extent *ordered;
 	int ret = 0;
 
+	/* Direct lock must be taken before the extent lock. */
+	if (nowait) {
+		if (!try_lock_dio_extent(io_tree, lockstart, lockend,
+					 cached_state))
+			return -EAGAIN;
+	} else {
+		lock_dio_extent(io_tree, lockstart, lockend, cached_state);
+	}
+
 	while (1) {
 		if (nowait) {
 			if (!try_lock_extent(io_tree, lockstart, lockend,
-					     cached_state))
-				return -EAGAIN;
+					     cached_state)) {
+				ret = -EAGAIN;
+				break;
+			}
 		} else {
 			lock_extent(io_tree, lockstart, lockend, cached_state);
 		}
@@ -120,6 +131,8 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
 		cond_resched();
 	}
 
+	if (ret)
+		unlock_dio_extent(io_tree, lockstart, lockend, cached_state);
 	return ret;
 }
 
@@ -546,8 +559,9 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
 	}
 
 	if (unlock_extents)
-		unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend,
-			      &cached_state);
+		clear_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, lockend,
+				 EXTENT_LOCKED | EXTENT_DIO_LOCKED,
+				 &cached_state);
 	else
 		free_extent_state(cached_state);
 
@@ -572,8 +586,13 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
 	return 0;
 
 unlock_err:
-	unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend,
-		      &cached_state);
+	/*
+	 * Don't use EXTENT_LOCK_BITS here in case we extend it later and forget
+	 * to update this, be explicit that we expect EXTENT_LOCKED and
+	 * EXTENT_DIO_LOCKED to be set here, and so that's what we're clearing.
+	 */
+	clear_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, lockend,
+			 EXTENT_LOCKED | EXTENT_DIO_LOCKED, &cached_state);
 err:
 	if (dio_data->data_space_reserved) {
 		btrfs_free_reserved_data_space(BTRFS_I(inode),
@@ -596,8 +615,9 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 
 	if (!write && (iomap->type == IOMAP_HOLE)) {
 		/* If reading from a hole, unlock and return */
-		unlock_extent(&BTRFS_I(inode)->io_tree, pos, pos + length - 1,
-			      NULL);
+		clear_extent_bit(&BTRFS_I(inode)->io_tree, pos,
+				  pos + length - 1,
+				  EXTENT_LOCKED | EXTENT_DIO_LOCKED, NULL);
 		return 0;
 	}
 
@@ -608,8 +628,10 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 			btrfs_finish_ordered_extent(dio_data->ordered, NULL,
 						    pos, length, false);
 		else
-			unlock_extent(&BTRFS_I(inode)->io_tree, pos,
-				      pos + length - 1, NULL);
+			clear_extent_bit(&BTRFS_I(inode)->io_tree, pos,
+					 pos + length - 1,
+					 EXTENT_LOCKED | EXTENT_DIO_LOCKED,
+					 NULL);
 		ret = -ENOTBLK;
 	}
 	if (write) {
@@ -641,8 +663,9 @@ static void btrfs_dio_end_io(struct btrfs_bio *bbio)
 					    dip->file_offset, dip->bytes,
 					    !bio->bi_status);
 	} else {
-		unlock_extent(&inode->io_tree, dip->file_offset,
-			      dip->file_offset + dip->bytes - 1, NULL);
+		clear_extent_bit(&inode->io_tree, dip->file_offset,
+				 dip->file_offset + dip->bytes - 1,
+				 EXTENT_LOCKED | EXTENT_DIO_LOCKED, NULL);
 	}
 
 	bbio->bio.bi_private = bbio->private;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 3/3] btrfs: do not hold the extent lock for entire read
  2024-08-26 16:37 [PATCH v2 0/3] btrfs: no longer hold the extent lock for the entire read Josef Bacik
  2024-08-26 16:37 ` [PATCH v2 1/3] btrfs: introduce EXTENT_DIO_LOCKED Josef Bacik
  2024-08-26 16:37 ` [PATCH v2 2/3] btrfs: take the dio extent lock during O_DIRECT operations Josef Bacik
@ 2024-08-26 16:37 ` Josef Bacik
  2024-08-27 19:00   ` Goldwyn Rodrigues
  2024-11-25 12:11   ` Filipe Manana
  2024-08-27  0:53 ` [PATCH v2 0/3] btrfs: no longer hold the extent lock for the " David Sterba
  3 siblings, 2 replies; 8+ messages in thread
From: Josef Bacik @ 2024-08-26 16:37 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: Goldwyn Rodrigues

Historically we've held the extent lock throughout the entire read.
There's been a few reasons for this, but it's mostly just caused us
problems.  For example, this prevents us from allowing page faults
during direct io reads, because we could deadlock.  This has forced us
to only allow 4k reads at a time for io_uring NOWAIT requests because we
have no idea if we'll be forced to page fault and thus have to do a
whole lot of work.

On the buffered side we are protected by the page lock, as long as we're
reading things like buffered writes, punch hole, and even direct IO to a
certain degree will get hung up on the page lock while the page is in
flight.

On the direct side we have the dio extent lock, which acts much like the
way the extent lock worked previously to this patch, however just for
direct reads.  This protects direct reads from concurrent direct writes,
while we're protected from buffered writes via the inode lock.

Now that we're protected in all cases, narrow the extent lock to the
part where we're getting the extent map to submit the reads, no longer
holding the extent lock for the entire read operation.  Push the extent
lock down into do_readpage() so that we're only grabbing it when looking
up the extent map.  This portion was contributed by Goldwyn.

Co-developed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/compression.c |  2 +-
 fs/btrfs/direct-io.c   | 51 +++++++++++------------
 fs/btrfs/extent_io.c   | 94 ++----------------------------------------
 3 files changed, 29 insertions(+), 118 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 832ab8984c41..511f81f312af 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -521,6 +521,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 		}
 		add_size = min(em->start + em->len, page_end + 1) - cur;
 		free_extent_map(em);
+		unlock_extent(tree, cur, page_end, NULL);
 
 		if (folio->index == end_index) {
 			size_t zero_offset = offset_in_folio(folio, isize);
@@ -534,7 +535,6 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 
 		if (!bio_add_folio(orig_bio, folio, add_size,
 				   offset_in_folio(folio, cur))) {
-			unlock_extent(tree, cur, page_end, NULL);
 			folio_unlock(folio);
 			folio_put(folio);
 			break;
diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 576f469cacee..66f1ce5fcfd2 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -366,7 +366,7 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
 	int ret = 0;
 	u64 len = length;
 	const u64 data_alloc_len = length;
-	bool unlock_extents = false;
+	u32 unlock_bits = EXTENT_LOCKED;
 
 	/*
 	 * We could potentially fault if we have a buffer > PAGE_SIZE, and if
@@ -527,7 +527,6 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
 						    start, &len, flags);
 		if (ret < 0)
 			goto unlock_err;
-		unlock_extents = true;
 		/* Recalc len in case the new em is smaller than requested */
 		len = min(len, em->len - (start - em->start));
 		if (dio_data->data_space_reserved) {
@@ -548,23 +547,8 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
 							       release_offset,
 							       release_len);
 		}
-	} else {
-		/*
-		 * We need to unlock only the end area that we aren't using.
-		 * The rest is going to be unlocked by the endio routine.
-		 */
-		lockstart = start + len;
-		if (lockstart < lockend)
-			unlock_extents = true;
 	}
 
-	if (unlock_extents)
-		clear_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, lockend,
-				 EXTENT_LOCKED | EXTENT_DIO_LOCKED,
-				 &cached_state);
-	else
-		free_extent_state(cached_state);
-
 	/*
 	 * Translate extent map information to iomap.
 	 * We trim the extents (and move the addr) even though iomap code does
@@ -583,6 +567,23 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
 	iomap->length = len;
 	free_extent_map(em);
 
+	/*
+	 * Reads will hold the EXTENT_DIO_LOCKED bit until the io is completed,
+	 * writes only hold it for this part.  We hold the extent lock until
+	 * we're completely done with the extent map to make sure it remains
+	 * valid.
+	 */
+	if (write)
+		unlock_bits |= EXTENT_DIO_LOCKED;
+
+	clear_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, lockend,
+			 unlock_bits, &cached_state);
+
+	/* We didn't use everything, unlock the dio extent for the remainder. */
+	if (!write && (start + len) < lockend)
+		unlock_dio_extent(&BTRFS_I(inode)->io_tree, start + len,
+				  lockend, NULL);
+
 	return 0;
 
 unlock_err:
@@ -615,9 +616,8 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 
 	if (!write && (iomap->type == IOMAP_HOLE)) {
 		/* If reading from a hole, unlock and return */
-		clear_extent_bit(&BTRFS_I(inode)->io_tree, pos,
-				  pos + length - 1,
-				  EXTENT_LOCKED | EXTENT_DIO_LOCKED, NULL);
+		unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos,
+				  pos + length - 1, NULL);
 		return 0;
 	}
 
@@ -628,10 +628,8 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 			btrfs_finish_ordered_extent(dio_data->ordered, NULL,
 						    pos, length, false);
 		else
-			clear_extent_bit(&BTRFS_I(inode)->io_tree, pos,
-					 pos + length - 1,
-					 EXTENT_LOCKED | EXTENT_DIO_LOCKED,
-					 NULL);
+			unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos,
+					  pos + length - 1, NULL);
 		ret = -ENOTBLK;
 	}
 	if (write) {
@@ -663,9 +661,8 @@ static void btrfs_dio_end_io(struct btrfs_bio *bbio)
 					    dip->file_offset, dip->bytes,
 					    !bio->bi_status);
 	} else {
-		clear_extent_bit(&inode->io_tree, dip->file_offset,
-				 dip->file_offset + dip->bytes - 1,
-				 EXTENT_LOCKED | EXTENT_DIO_LOCKED, NULL);
+		unlock_dio_extent(&inode->io_tree, dip->file_offset,
+				  dip->file_offset + dip->bytes - 1, NULL);
 	}
 
 	bbio->bio.bi_private = bbio->private;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6083bed89df2..77161116af7a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -480,75 +480,6 @@ static void end_bbio_data_write(struct btrfs_bio *bbio)
 	bio_put(bio);
 }
 
-/*
- * Record previously processed extent range
- *
- * For endio_readpage_release_extent() to handle a full extent range, reducing
- * the extent io operations.
- */
-struct processed_extent {
-	struct btrfs_inode *inode;
-	/* Start of the range in @inode */
-	u64 start;
-	/* End of the range in @inode */
-	u64 end;
-	bool uptodate;
-};
-
-/*
- * Try to release processed extent range
- *
- * May not release the extent range right now if the current range is
- * contiguous to processed extent.
- *
- * Will release processed extent when any of @inode, @uptodate, the range is
- * no longer contiguous to the processed range.
- *
- * Passing @inode == NULL will force processed extent to be released.
- */
-static void endio_readpage_release_extent(struct processed_extent *processed,
-			      struct btrfs_inode *inode, u64 start, u64 end,
-			      bool uptodate)
-{
-	struct extent_state *cached = NULL;
-	struct extent_io_tree *tree;
-
-	/* The first extent, initialize @processed */
-	if (!processed->inode)
-		goto update;
-
-	/*
-	 * Contiguous to processed extent, just uptodate the end.
-	 *
-	 * Several things to notice:
-	 *
-	 * - bio can be merged as long as on-disk bytenr is contiguous
-	 *   This means we can have page belonging to other inodes, thus need to
-	 *   check if the inode still matches.
-	 * - bvec can contain range beyond current page for multi-page bvec
-	 *   Thus we need to do processed->end + 1 >= start check
-	 */
-	if (processed->inode == inode && processed->uptodate == uptodate &&
-	    processed->end + 1 >= start && end >= processed->end) {
-		processed->end = end;
-		return;
-	}
-
-	tree = &processed->inode->io_tree;
-	/*
-	 * Now we don't have range contiguous to the processed range, release
-	 * the processed range now.
-	 */
-	unlock_extent(tree, processed->start, processed->end, &cached);
-
-update:
-	/* Update processed to current range */
-	processed->inode = inode;
-	processed->start = start;
-	processed->end = end;
-	processed->uptodate = uptodate;
-}
-
 static void begin_folio_read(struct btrfs_fs_info *fs_info, struct folio *folio)
 {
 	ASSERT(folio_test_locked(folio));
@@ -575,7 +506,6 @@ static void end_bbio_data_read(struct btrfs_bio *bbio)
 {
 	struct btrfs_fs_info *fs_info = bbio->fs_info;
 	struct bio *bio = &bbio->bio;
-	struct processed_extent processed = { 0 };
 	struct folio_iter fi;
 	const u32 sectorsize = fs_info->sectorsize;
 
@@ -640,11 +570,7 @@ static void end_bbio_data_read(struct btrfs_bio *bbio)
 
 		/* Update page status and unlock. */
 		end_folio_read(folio, uptodate, start, len);
-		endio_readpage_release_extent(&processed, BTRFS_I(inode),
-					      start, end, uptodate);
 	}
-	/* Release the last extent */
-	endio_readpage_release_extent(&processed, NULL, 0, 0, false);
 	bio_put(bio);
 }
 
@@ -973,6 +899,7 @@ static struct extent_map *__get_extent_map(struct inode *inode,
 					   u64 len, struct extent_map **em_cached)
 {
 	struct extent_map *em;
+	struct extent_state *cached_state = NULL;
 
 	ASSERT(em_cached);
 
@@ -988,12 +915,15 @@ static struct extent_map *__get_extent_map(struct inode *inode,
 		*em_cached = NULL;
 	}
 
+	btrfs_lock_and_flush_ordered_range(BTRFS_I(inode), start, start + len - 1, &cached_state);
 	em = btrfs_get_extent(BTRFS_I(inode), folio, start, len);
 	if (!IS_ERR(em)) {
 		BUG_ON(*em_cached);
 		refcount_inc(&em->refs);
 		*em_cached = em;
 	}
+	unlock_extent(&BTRFS_I(inode)->io_tree, start, start + len - 1, &cached_state);
+
 	return em;
 }
 /*
@@ -1019,11 +949,9 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 	size_t pg_offset = 0;
 	size_t iosize;
 	size_t blocksize = fs_info->sectorsize;
-	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
 
 	ret = set_folio_extent_mapped(folio);
 	if (ret < 0) {
-		unlock_extent(tree, start, end, NULL);
 		folio_unlock(folio);
 		return ret;
 	}
@@ -1047,14 +975,12 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 		if (cur >= last_byte) {
 			iosize = folio_size(folio) - pg_offset;
 			folio_zero_range(folio, pg_offset, iosize);
-			unlock_extent(tree, cur, cur + iosize - 1, NULL);
 			end_folio_read(folio, true, cur, iosize);
 			break;
 		}
 		em = __get_extent_map(inode, folio, cur, end - cur + 1,
 				      em_cached);
 		if (IS_ERR(em)) {
-			unlock_extent(tree, cur, end, NULL);
 			end_folio_read(folio, false, cur, end + 1 - cur);
 			return PTR_ERR(em);
 		}
@@ -1123,7 +1049,6 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 		if (block_start == EXTENT_MAP_HOLE) {
 			folio_zero_range(folio, pg_offset, iosize);
 
-			unlock_extent(tree, cur, cur + iosize - 1, NULL);
 			end_folio_read(folio, true, cur, iosize);
 			cur = cur + iosize;
 			pg_offset += iosize;
@@ -1131,7 +1056,6 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 		}
 		/* the get_extent function already copied into the folio */
 		if (block_start == EXTENT_MAP_INLINE) {
-			unlock_extent(tree, cur, cur + iosize - 1, NULL);
 			end_folio_read(folio, true, cur, iosize);
 			cur = cur + iosize;
 			pg_offset += iosize;
@@ -1156,15 +1080,10 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 
 int btrfs_read_folio(struct file *file, struct folio *folio)
 {
-	struct btrfs_inode *inode = folio_to_inode(folio);
-	u64 start = folio_pos(folio);
-	u64 end = start + folio_size(folio) - 1;
 	struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
 	struct extent_map *em_cached = NULL;
 	int ret;
 
-	btrfs_lock_and_flush_ordered_range(inode, start, end, NULL);
-
 	ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
 	free_extent_map(em_cached);
 
@@ -2337,15 +2256,10 @@ int btrfs_writepages(struct address_space *mapping, struct writeback_control *wb
 void btrfs_readahead(struct readahead_control *rac)
 {
 	struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ | REQ_RAHEAD };
-	struct btrfs_inode *inode = BTRFS_I(rac->mapping->host);
 	struct folio *folio;
-	u64 start = readahead_pos(rac);
-	u64 end = start + readahead_length(rac) - 1;
 	struct extent_map *em_cached = NULL;
 	u64 prev_em_start = (u64)-1;
 
-	btrfs_lock_and_flush_ordered_range(inode, start, end, NULL);
-
 	while ((folio = readahead_folio(rac)) != NULL)
 		btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/3] btrfs: no longer hold the extent lock for the entire read
  2024-08-26 16:37 [PATCH v2 0/3] btrfs: no longer hold the extent lock for the entire read Josef Bacik
                   ` (2 preceding siblings ...)
  2024-08-26 16:37 ` [PATCH v2 3/3] btrfs: do not hold the extent lock for entire read Josef Bacik
@ 2024-08-27  0:53 ` David Sterba
  3 siblings, 0 replies; 8+ messages in thread
From: David Sterba @ 2024-08-27  0:53 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Mon, Aug 26, 2024 at 12:37:23PM -0400, Josef Bacik wrote:
> v1: https://lore.kernel.org/linux-btrfs/cover.1724255475.git.josef@toxicpanda.com/
> 
> v1->v2:
> - Added Goldwyn's bit to move the extent lock to only cover the part where we
>   look up the extent map, added his Co-developed-by to the patch.
> 
> -- Original email --
> Hello,
> 
> One of the biggest obstacles to switching to iomap is that our extent locking is
> a bit wonky.  We've made a lot of progress on the write side to drastically
> narrow the scope of the extent lock, but the read side is arguably the most
> problematic in that we hold it until the readpage is completed.
> 
> This patch series addresses this by no longer holding the lock for the entire
> IO.  This is safe on the buffered side because we are protected by the page lock
> and the checks for ordered extents when we start the write.
> 
> The direct io side is the more problematic side, since we could now end up with
> overlapping and concurrent direct writes in direct read ranges.  To solve this
> problem I'm introducing a new extent io bit to do range locking for direct IO.
> This will work basically the same as the extent lock did before, but only for
> direct IO.  We are saved by the normal inode lock and page cache in the mixed
> buffered and direct case, so this shouldn't carry the same iomap+locking
> reloated woes that the extent lock did.
> 
> This will also allow us to remove the page fault restrictions in the direct IO
> case, which will be done in a followup series.
> 
> I've run this through the CI and a lot of local testing, I'm keeping it small
> and targeted because this is a pretty big logical shift for us, so I want to
> make sure we get good testing on it before I go doing the other larger projects.
> Thanks,

We don't have anything that would interfere with that in for-next, the
folio conversions are mostly direct and there were no new problems
preported. The read locking change will probably lead to some
performance gain so I guess this is a good improvement for the next
release.
> 
> Josef
> 
> Josef Bacik (3):
>   btrfs: introduce EXTENT_DIO_LOCKED
>   btrfs: take the dio extent lock during O_DIRECT operations

Patches 1 and 2 are easy, so

Reviewed-by: David Sterba <dsterba@suse.com>

>   btrfs: do not hold the extent lock for entire read

I don't see anything obviously wrong here, the logic change makes sense
and code follows that.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 3/3] btrfs: do not hold the extent lock for entire read
  2024-08-26 16:37 ` [PATCH v2 3/3] btrfs: do not hold the extent lock for entire read Josef Bacik
@ 2024-08-27 19:00   ` Goldwyn Rodrigues
  2024-11-25 12:11   ` Filipe Manana
  1 sibling, 0 replies; 8+ messages in thread
From: Goldwyn Rodrigues @ 2024-08-27 19:00 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On 12:37 26/08, Josef Bacik wrote:
> Historically we've held the extent lock throughout the entire read.
> There's been a few reasons for this, but it's mostly just caused us
> problems.  For example, this prevents us from allowing page faults
> during direct io reads, because we could deadlock.  This has forced us
> to only allow 4k reads at a time for io_uring NOWAIT requests because we
> have no idea if we'll be forced to page fault and thus have to do a
> whole lot of work.
> 
> On the buffered side we are protected by the page lock, as long as we're
> reading things like buffered writes, punch hole, and even direct IO to a
> certain degree will get hung up on the page lock while the page is in
> flight.
> 
> On the direct side we have the dio extent lock, which acts much like the
> way the extent lock worked previously to this patch, however just for
> direct reads.  This protects direct reads from concurrent direct writes,
> while we're protected from buffered writes via the inode lock.
> 
> Now that we're protected in all cases, narrow the extent lock to the
> part where we're getting the extent map to submit the reads, no longer
> holding the extent lock for the entire read operation.  Push the extent
> lock down into do_readpage() so that we're only grabbing it when looking
> up the extent map.  This portion was contributed by Goldwyn.
> 
> Co-developed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Looks good.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 3/3] btrfs: do not hold the extent lock for entire read
  2024-08-26 16:37 ` [PATCH v2 3/3] btrfs: do not hold the extent lock for entire read Josef Bacik
  2024-08-27 19:00   ` Goldwyn Rodrigues
@ 2024-11-25 12:11   ` Filipe Manana
  2024-11-25 12:17     ` Filipe Manana
  1 sibling, 1 reply; 8+ messages in thread
From: Filipe Manana @ 2024-11-25 12:11 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team, Goldwyn Rodrigues

On Mon, Aug 26, 2024 at 5:39 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> Historically we've held the extent lock throughout the entire read.
> There's been a few reasons for this, but it's mostly just caused us
> problems.  For example, this prevents us from allowing page faults
> during direct io reads, because we could deadlock.  This has forced us
> to only allow 4k reads at a time for io_uring NOWAIT requests because we
> have no idea if we'll be forced to page fault and thus have to do a
> whole lot of work.
>
> On the buffered side we are protected by the page lock, as long as we're
> reading things like buffered writes, punch hole, and even direct IO to a
> certain degree will get hung up on the page lock while the page is in
> flight.
>
> On the direct side we have the dio extent lock, which acts much like the
> way the extent lock worked previously to this patch, however just for
> direct reads.  This protects direct reads from concurrent direct writes,
> while we're protected from buffered writes via the inode lock.
>
> Now that we're protected in all cases, narrow the extent lock to the
> part where we're getting the extent map to submit the reads, no longer
> holding the extent lock for the entire read operation.  Push the extent
> lock down into do_readpage() so that we're only grabbing it when looking
> up the extent map.  This portion was contributed by Goldwyn.
>
> Co-developed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/compression.c |  2 +-
>  fs/btrfs/direct-io.c   | 51 +++++++++++------------
>  fs/btrfs/extent_io.c   | 94 ++----------------------------------------
>  3 files changed, 29 insertions(+), 118 deletions(-)
>
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 832ab8984c41..511f81f312af 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -521,6 +521,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
>                 }
>                 add_size = min(em->start + em->len, page_end + 1) - cur;
>                 free_extent_map(em);
> +               unlock_extent(tree, cur, page_end, NULL);
>
>                 if (folio->index == end_index) {
>                         size_t zero_offset = offset_in_folio(folio, isize);
> @@ -534,7 +535,6 @@ static noinline int add_ra_bio_pages(struct inode *inode,
>
>                 if (!bio_add_folio(orig_bio, folio, add_size,
>                                    offset_in_folio(folio, cur))) {
> -                       unlock_extent(tree, cur, page_end, NULL);
>                         folio_unlock(folio);
>                         folio_put(folio);
>                         break;
> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
> index 576f469cacee..66f1ce5fcfd2 100644
> --- a/fs/btrfs/direct-io.c
> +++ b/fs/btrfs/direct-io.c
> @@ -366,7 +366,7 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
>         int ret = 0;
>         u64 len = length;
>         const u64 data_alloc_len = length;
> -       bool unlock_extents = false;
> +       u32 unlock_bits = EXTENT_LOCKED;
>
>         /*
>          * We could potentially fault if we have a buffer > PAGE_SIZE, and if
> @@ -527,7 +527,6 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
>                                                     start, &len, flags);
>                 if (ret < 0)
>                         goto unlock_err;
> -               unlock_extents = true;
>                 /* Recalc len in case the new em is smaller than requested */
>                 len = min(len, em->len - (start - em->start));
>                 if (dio_data->data_space_reserved) {
> @@ -548,23 +547,8 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
>                                                                release_offset,
>                                                                release_len);
>                 }
> -       } else {
> -               /*
> -                * We need to unlock only the end area that we aren't using.
> -                * The rest is going to be unlocked by the endio routine.
> -                */
> -               lockstart = start + len;
> -               if (lockstart < lockend)
> -                       unlock_extents = true;
>         }
>
> -       if (unlock_extents)
> -               clear_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, lockend,
> -                                EXTENT_LOCKED | EXTENT_DIO_LOCKED,
> -                                &cached_state);
> -       else
> -               free_extent_state(cached_state);
> -
>         /*
>          * Translate extent map information to iomap.
>          * We trim the extents (and move the addr) even though iomap code does
> @@ -583,6 +567,23 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
>         iomap->length = len;
>         free_extent_map(em);
>
> +       /*
> +        * Reads will hold the EXTENT_DIO_LOCKED bit until the io is completed,
> +        * writes only hold it for this part.  We hold the extent lock until
> +        * we're completely done with the extent map to make sure it remains
> +        * valid.
> +        */
> +       if (write)
> +               unlock_bits |= EXTENT_DIO_LOCKED;
> +
> +       clear_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, lockend,
> +                        unlock_bits, &cached_state);
> +
> +       /* We didn't use everything, unlock the dio extent for the remainder. */
> +       if (!write && (start + len) < lockend)
> +               unlock_dio_extent(&BTRFS_I(inode)->io_tree, start + len,
> +                                 lockend, NULL);
> +
>         return 0;
>
>  unlock_err:
> @@ -615,9 +616,8 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
>
>         if (!write && (iomap->type == IOMAP_HOLE)) {
>                 /* If reading from a hole, unlock and return */
> -               clear_extent_bit(&BTRFS_I(inode)->io_tree, pos,
> -                                 pos + length - 1,
> -                                 EXTENT_LOCKED | EXTENT_DIO_LOCKED, NULL);
> +               unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos,
> +                                 pos + length - 1, NULL);
>                 return 0;
>         }
>
> @@ -628,10 +628,8 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
>                         btrfs_finish_ordered_extent(dio_data->ordered, NULL,
>                                                     pos, length, false);
>                 else
> -                       clear_extent_bit(&BTRFS_I(inode)->io_tree, pos,
> -                                        pos + length - 1,
> -                                        EXTENT_LOCKED | EXTENT_DIO_LOCKED,
> -                                        NULL);
> +                       unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos,
> +                                         pos + length - 1, NULL);
>                 ret = -ENOTBLK;
>         }
>         if (write) {
> @@ -663,9 +661,8 @@ static void btrfs_dio_end_io(struct btrfs_bio *bbio)
>                                             dip->file_offset, dip->bytes,
>                                             !bio->bi_status);
>         } else {
> -               clear_extent_bit(&inode->io_tree, dip->file_offset,
> -                                dip->file_offset + dip->bytes - 1,
> -                                EXTENT_LOCKED | EXTENT_DIO_LOCKED, NULL);
> +               unlock_dio_extent(&inode->io_tree, dip->file_offset,
> +                                 dip->file_offset + dip->bytes - 1, NULL);
>         }
>
>         bbio->bio.bi_private = bbio->private;
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 6083bed89df2..77161116af7a 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -480,75 +480,6 @@ static void end_bbio_data_write(struct btrfs_bio *bbio)
>         bio_put(bio);
>  }
>
> -/*
> - * Record previously processed extent range
> - *
> - * For endio_readpage_release_extent() to handle a full extent range, reducing
> - * the extent io operations.
> - */
> -struct processed_extent {
> -       struct btrfs_inode *inode;
> -       /* Start of the range in @inode */
> -       u64 start;
> -       /* End of the range in @inode */
> -       u64 end;
> -       bool uptodate;
> -};
> -
> -/*
> - * Try to release processed extent range
> - *
> - * May not release the extent range right now if the current range is
> - * contiguous to processed extent.
> - *
> - * Will release processed extent when any of @inode, @uptodate, the range is
> - * no longer contiguous to the processed range.
> - *
> - * Passing @inode == NULL will force processed extent to be released.
> - */
> -static void endio_readpage_release_extent(struct processed_extent *processed,
> -                             struct btrfs_inode *inode, u64 start, u64 end,
> -                             bool uptodate)
> -{
> -       struct extent_state *cached = NULL;
> -       struct extent_io_tree *tree;
> -
> -       /* The first extent, initialize @processed */
> -       if (!processed->inode)
> -               goto update;
> -
> -       /*
> -        * Contiguous to processed extent, just uptodate the end.
> -        *
> -        * Several things to notice:
> -        *
> -        * - bio can be merged as long as on-disk bytenr is contiguous
> -        *   This means we can have page belonging to other inodes, thus need to
> -        *   check if the inode still matches.
> -        * - bvec can contain range beyond current page for multi-page bvec
> -        *   Thus we need to do processed->end + 1 >= start check
> -        */
> -       if (processed->inode == inode && processed->uptodate == uptodate &&
> -           processed->end + 1 >= start && end >= processed->end) {
> -               processed->end = end;
> -               return;
> -       }
> -
> -       tree = &processed->inode->io_tree;
> -       /*
> -        * Now we don't have range contiguous to the processed range, release
> -        * the processed range now.
> -        */
> -       unlock_extent(tree, processed->start, processed->end, &cached);
> -
> -update:
> -       /* Update processed to current range */
> -       processed->inode = inode;
> -       processed->start = start;
> -       processed->end = end;
> -       processed->uptodate = uptodate;
> -}
> -
>  static void begin_folio_read(struct btrfs_fs_info *fs_info, struct folio *folio)
>  {
>         ASSERT(folio_test_locked(folio));
> @@ -575,7 +506,6 @@ static void end_bbio_data_read(struct btrfs_bio *bbio)
>  {
>         struct btrfs_fs_info *fs_info = bbio->fs_info;
>         struct bio *bio = &bbio->bio;
> -       struct processed_extent processed = { 0 };
>         struct folio_iter fi;
>         const u32 sectorsize = fs_info->sectorsize;
>
> @@ -640,11 +570,7 @@ static void end_bbio_data_read(struct btrfs_bio *bbio)
>
>                 /* Update page status and unlock. */
>                 end_folio_read(folio, uptodate, start, len);
> -               endio_readpage_release_extent(&processed, BTRFS_I(inode),
> -                                             start, end, uptodate);
>         }
> -       /* Release the last extent */
> -       endio_readpage_release_extent(&processed, NULL, 0, 0, false);
>         bio_put(bio);
>  }
>
> @@ -973,6 +899,7 @@ static struct extent_map *__get_extent_map(struct inode *inode,
>                                            u64 len, struct extent_map **em_cached)
>  {
>         struct extent_map *em;
> +       struct extent_state *cached_state = NULL;
>
>         ASSERT(em_cached);
>
> @@ -988,12 +915,15 @@ static struct extent_map *__get_extent_map(struct inode *inode,
>                 *em_cached = NULL;
>         }
>
> +       btrfs_lock_and_flush_ordered_range(BTRFS_I(inode), start, start + len - 1, &cached_state);
>         em = btrfs_get_extent(BTRFS_I(inode), folio, start, len);
>         if (!IS_ERR(em)) {
>                 BUG_ON(*em_cached);
>                 refcount_inc(&em->refs);
>                 *em_cached = em;
>         }
> +       unlock_extent(&BTRFS_I(inode)->io_tree, start, start + len - 1, &cached_state);

Sorry for the late review, but I missed this when it was posted, was
busy with other stuff and summer holidays were approaching.

I don't see how this is safe against concurrent COW writes, this
leaves here some races that trigger either checksum
failures making the reads fail with -EIO or allow a read to read data
from another file (and that can be seen as a
security issue).

So before this patch we locked the extent range when we started the
read and then unlocked it only after read bios
finished - after the data was read from disk and we verified it
matches the checksums.

But now we lock it only to get an extent map, and we unlock it
immediately after getting the extent map.

So first unsafe scenario, for the simple case of reading a single 4K page:

1) We lock the extent range, get the extent map, which lets say points
to an extent at disk_bytenr X, and then unlock the range;

2) We read the checksums from the csum tree and submit the read bios
later, when calling btrfs_submit_bbio() -> btrfs_submit_chunk();

3) But before we go into btrfs_submit_bbio(), a COW write comes for 4K
range, which will now proceed because we are not locking
   the extent range for the whole duration of the read anymore.

   This results in removing extent X, replacing it with a new one, and
pinning it in the current transaction N;

4) Still before we reach btrfs_submit_bio(), transaction N is
committed, so extent X is unpinned and from
   now on anyone can allocate that extent;

5) A write for another file happens and it allocates extent X, writes
to it, and adds checksums to the csum tree;

6) Our read proceeds and enters btrfs_submit_bbio().
   So we submit a read for extent X and the read will succeed since
the checksums match the extent's data.

   We end up reading data from another file - this can cause all sorts
of unexpected results to applications,
   and can also be seen as a security issue.

Another possible race, again for reading a single 4K page (simplest case):

1) We lock the extent range, get the extent map, which lets say points
to an extent at disk_bytenr X, and then unlock the range;

2) We get into btrfs_submit_bbio() -> btrfs_submit_chunk() ->
btrfs_lookup_bio_sums() and reads the checksums for the extent
   from the csum tree;

3) Before we submit the read bio(s), a COW write for this range
happens, so it drops extent X and pins it in the current transaction
N;

4) Transaction N is committed, so extent X is unpinned and can now be
re-allocated by anyone;

5) A write for another file happens and it allocates extent X, writing
data to it;

6) The read proceeds and submits a bio to read extent X;

7) When the bio completes we do checksum validation and we fail, since
the data does not match the checksums we got
   in step 2, as a result the read will fail -EIO due to checksum failure.

These are races we didn't have before, since we unlocked the extent
range only after the bio completed.

Please tell me that I'm missing something here and these races are impossible.

Thanks.

> +
>         return em;
>  }
>  /*
> @@ -1019,11 +949,9 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>         size_t pg_offset = 0;
>         size_t iosize;
>         size_t blocksize = fs_info->sectorsize;
> -       struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
>
>         ret = set_folio_extent_mapped(folio);
>         if (ret < 0) {
> -               unlock_extent(tree, start, end, NULL);
>                 folio_unlock(folio);
>                 return ret;
>         }
> @@ -1047,14 +975,12 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>                 if (cur >= last_byte) {
>                         iosize = folio_size(folio) - pg_offset;
>                         folio_zero_range(folio, pg_offset, iosize);
> -                       unlock_extent(tree, cur, cur + iosize - 1, NULL);
>                         end_folio_read(folio, true, cur, iosize);
>                         break;
>                 }
>                 em = __get_extent_map(inode, folio, cur, end - cur + 1,
>                                       em_cached);
>                 if (IS_ERR(em)) {
> -                       unlock_extent(tree, cur, end, NULL);
>                         end_folio_read(folio, false, cur, end + 1 - cur);
>                         return PTR_ERR(em);
>                 }
> @@ -1123,7 +1049,6 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>                 if (block_start == EXTENT_MAP_HOLE) {
>                         folio_zero_range(folio, pg_offset, iosize);
>
> -                       unlock_extent(tree, cur, cur + iosize - 1, NULL);
>                         end_folio_read(folio, true, cur, iosize);
>                         cur = cur + iosize;
>                         pg_offset += iosize;
> @@ -1131,7 +1056,6 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>                 }
>                 /* the get_extent function already copied into the folio */
>                 if (block_start == EXTENT_MAP_INLINE) {
> -                       unlock_extent(tree, cur, cur + iosize - 1, NULL);
>                         end_folio_read(folio, true, cur, iosize);
>                         cur = cur + iosize;
>                         pg_offset += iosize;
> @@ -1156,15 +1080,10 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>
>  int btrfs_read_folio(struct file *file, struct folio *folio)
>  {
> -       struct btrfs_inode *inode = folio_to_inode(folio);
> -       u64 start = folio_pos(folio);
> -       u64 end = start + folio_size(folio) - 1;
>         struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
>         struct extent_map *em_cached = NULL;
>         int ret;
>
> -       btrfs_lock_and_flush_ordered_range(inode, start, end, NULL);
> -
>         ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
>         free_extent_map(em_cached);
>
> @@ -2337,15 +2256,10 @@ int btrfs_writepages(struct address_space *mapping, struct writeback_control *wb
>  void btrfs_readahead(struct readahead_control *rac)
>  {
>         struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ | REQ_RAHEAD };
> -       struct btrfs_inode *inode = BTRFS_I(rac->mapping->host);
>         struct folio *folio;
> -       u64 start = readahead_pos(rac);
> -       u64 end = start + readahead_length(rac) - 1;
>         struct extent_map *em_cached = NULL;
>         u64 prev_em_start = (u64)-1;
>
> -       btrfs_lock_and_flush_ordered_range(inode, start, end, NULL);
> -
>         while ((folio = readahead_folio(rac)) != NULL)
>                 btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
>
> --
> 2.43.0
>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 3/3] btrfs: do not hold the extent lock for entire read
  2024-11-25 12:11   ` Filipe Manana
@ 2024-11-25 12:17     ` Filipe Manana
  0 siblings, 0 replies; 8+ messages in thread
From: Filipe Manana @ 2024-11-25 12:17 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team, Goldwyn Rodrigues

On Mon, Nov 25, 2024 at 12:11 PM Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Mon, Aug 26, 2024 at 5:39 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > Historically we've held the extent lock throughout the entire read.
> > There's been a few reasons for this, but it's mostly just caused us
> > problems.  For example, this prevents us from allowing page faults
> > during direct io reads, because we could deadlock.  This has forced us
> > to only allow 4k reads at a time for io_uring NOWAIT requests because we
> > have no idea if we'll be forced to page fault and thus have to do a
> > whole lot of work.
> >
> > On the buffered side we are protected by the page lock, as long as we're
> > reading things like buffered writes, punch hole, and even direct IO to a
> > certain degree will get hung up on the page lock while the page is in
> > flight.
> >
> > On the direct side we have the dio extent lock, which acts much like the
> > way the extent lock worked previously to this patch, however just for
> > direct reads.  This protects direct reads from concurrent direct writes,
> > while we're protected from buffered writes via the inode lock.
> >
> > Now that we're protected in all cases, narrow the extent lock to the
> > part where we're getting the extent map to submit the reads, no longer
> > holding the extent lock for the entire read operation.  Push the extent
> > lock down into do_readpage() so that we're only grabbing it when looking
> > up the extent map.  This portion was contributed by Goldwyn.
> >
> > Co-developed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> >  fs/btrfs/compression.c |  2 +-
> >  fs/btrfs/direct-io.c   | 51 +++++++++++------------
> >  fs/btrfs/extent_io.c   | 94 ++----------------------------------------
> >  3 files changed, 29 insertions(+), 118 deletions(-)
> >
> > diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> > index 832ab8984c41..511f81f312af 100644
> > --- a/fs/btrfs/compression.c
> > +++ b/fs/btrfs/compression.c
> > @@ -521,6 +521,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
> >                 }
> >                 add_size = min(em->start + em->len, page_end + 1) - cur;
> >                 free_extent_map(em);
> > +               unlock_extent(tree, cur, page_end, NULL);
> >
> >                 if (folio->index == end_index) {
> >                         size_t zero_offset = offset_in_folio(folio, isize);
> > @@ -534,7 +535,6 @@ static noinline int add_ra_bio_pages(struct inode *inode,
> >
> >                 if (!bio_add_folio(orig_bio, folio, add_size,
> >                                    offset_in_folio(folio, cur))) {
> > -                       unlock_extent(tree, cur, page_end, NULL);
> >                         folio_unlock(folio);
> >                         folio_put(folio);
> >                         break;
> > diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
> > index 576f469cacee..66f1ce5fcfd2 100644
> > --- a/fs/btrfs/direct-io.c
> > +++ b/fs/btrfs/direct-io.c
> > @@ -366,7 +366,7 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
> >         int ret = 0;
> >         u64 len = length;
> >         const u64 data_alloc_len = length;
> > -       bool unlock_extents = false;
> > +       u32 unlock_bits = EXTENT_LOCKED;
> >
> >         /*
> >          * We could potentially fault if we have a buffer > PAGE_SIZE, and if
> > @@ -527,7 +527,6 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
> >                                                     start, &len, flags);
> >                 if (ret < 0)
> >                         goto unlock_err;
> > -               unlock_extents = true;
> >                 /* Recalc len in case the new em is smaller than requested */
> >                 len = min(len, em->len - (start - em->start));
> >                 if (dio_data->data_space_reserved) {
> > @@ -548,23 +547,8 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
> >                                                                release_offset,
> >                                                                release_len);
> >                 }
> > -       } else {
> > -               /*
> > -                * We need to unlock only the end area that we aren't using.
> > -                * The rest is going to be unlocked by the endio routine.
> > -                */
> > -               lockstart = start + len;
> > -               if (lockstart < lockend)
> > -                       unlock_extents = true;
> >         }
> >
> > -       if (unlock_extents)
> > -               clear_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, lockend,
> > -                                EXTENT_LOCKED | EXTENT_DIO_LOCKED,
> > -                                &cached_state);
> > -       else
> > -               free_extent_state(cached_state);
> > -
> >         /*
> >          * Translate extent map information to iomap.
> >          * We trim the extents (and move the addr) even though iomap code does
> > @@ -583,6 +567,23 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
> >         iomap->length = len;
> >         free_extent_map(em);
> >
> > +       /*
> > +        * Reads will hold the EXTENT_DIO_LOCKED bit until the io is completed,
> > +        * writes only hold it for this part.  We hold the extent lock until
> > +        * we're completely done with the extent map to make sure it remains
> > +        * valid.
> > +        */
> > +       if (write)
> > +               unlock_bits |= EXTENT_DIO_LOCKED;
> > +
> > +       clear_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, lockend,
> > +                        unlock_bits, &cached_state);
> > +
> > +       /* We didn't use everything, unlock the dio extent for the remainder. */
> > +       if (!write && (start + len) < lockend)
> > +               unlock_dio_extent(&BTRFS_I(inode)->io_tree, start + len,
> > +                                 lockend, NULL);
> > +
> >         return 0;
> >
> >  unlock_err:
> > @@ -615,9 +616,8 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
> >
> >         if (!write && (iomap->type == IOMAP_HOLE)) {
> >                 /* If reading from a hole, unlock and return */
> > -               clear_extent_bit(&BTRFS_I(inode)->io_tree, pos,
> > -                                 pos + length - 1,
> > -                                 EXTENT_LOCKED | EXTENT_DIO_LOCKED, NULL);
> > +               unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos,
> > +                                 pos + length - 1, NULL);
> >                 return 0;
> >         }
> >
> > @@ -628,10 +628,8 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
> >                         btrfs_finish_ordered_extent(dio_data->ordered, NULL,
> >                                                     pos, length, false);
> >                 else
> > -                       clear_extent_bit(&BTRFS_I(inode)->io_tree, pos,
> > -                                        pos + length - 1,
> > -                                        EXTENT_LOCKED | EXTENT_DIO_LOCKED,
> > -                                        NULL);
> > +                       unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos,
> > +                                         pos + length - 1, NULL);
> >                 ret = -ENOTBLK;
> >         }
> >         if (write) {
> > @@ -663,9 +661,8 @@ static void btrfs_dio_end_io(struct btrfs_bio *bbio)
> >                                             dip->file_offset, dip->bytes,
> >                                             !bio->bi_status);
> >         } else {
> > -               clear_extent_bit(&inode->io_tree, dip->file_offset,
> > -                                dip->file_offset + dip->bytes - 1,
> > -                                EXTENT_LOCKED | EXTENT_DIO_LOCKED, NULL);
> > +               unlock_dio_extent(&inode->io_tree, dip->file_offset,
> > +                                 dip->file_offset + dip->bytes - 1, NULL);
> >         }
> >
> >         bbio->bio.bi_private = bbio->private;
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 6083bed89df2..77161116af7a 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -480,75 +480,6 @@ static void end_bbio_data_write(struct btrfs_bio *bbio)
> >         bio_put(bio);
> >  }
> >
> > -/*
> > - * Record previously processed extent range
> > - *
> > - * For endio_readpage_release_extent() to handle a full extent range, reducing
> > - * the extent io operations.
> > - */
> > -struct processed_extent {
> > -       struct btrfs_inode *inode;
> > -       /* Start of the range in @inode */
> > -       u64 start;
> > -       /* End of the range in @inode */
> > -       u64 end;
> > -       bool uptodate;
> > -};
> > -
> > -/*
> > - * Try to release processed extent range
> > - *
> > - * May not release the extent range right now if the current range is
> > - * contiguous to processed extent.
> > - *
> > - * Will release processed extent when any of @inode, @uptodate, the range is
> > - * no longer contiguous to the processed range.
> > - *
> > - * Passing @inode == NULL will force processed extent to be released.
> > - */
> > -static void endio_readpage_release_extent(struct processed_extent *processed,
> > -                             struct btrfs_inode *inode, u64 start, u64 end,
> > -                             bool uptodate)
> > -{
> > -       struct extent_state *cached = NULL;
> > -       struct extent_io_tree *tree;
> > -
> > -       /* The first extent, initialize @processed */
> > -       if (!processed->inode)
> > -               goto update;
> > -
> > -       /*
> > -        * Contiguous to processed extent, just uptodate the end.
> > -        *
> > -        * Several things to notice:
> > -        *
> > -        * - bio can be merged as long as on-disk bytenr is contiguous
> > -        *   This means we can have page belonging to other inodes, thus need to
> > -        *   check if the inode still matches.
> > -        * - bvec can contain range beyond current page for multi-page bvec
> > -        *   Thus we need to do processed->end + 1 >= start check
> > -        */
> > -       if (processed->inode == inode && processed->uptodate == uptodate &&
> > -           processed->end + 1 >= start && end >= processed->end) {
> > -               processed->end = end;
> > -               return;
> > -       }
> > -
> > -       tree = &processed->inode->io_tree;
> > -       /*
> > -        * Now we don't have range contiguous to the processed range, release
> > -        * the processed range now.
> > -        */
> > -       unlock_extent(tree, processed->start, processed->end, &cached);
> > -
> > -update:
> > -       /* Update processed to current range */
> > -       processed->inode = inode;
> > -       processed->start = start;
> > -       processed->end = end;
> > -       processed->uptodate = uptodate;
> > -}
> > -
> >  static void begin_folio_read(struct btrfs_fs_info *fs_info, struct folio *folio)
> >  {
> >         ASSERT(folio_test_locked(folio));
> > @@ -575,7 +506,6 @@ static void end_bbio_data_read(struct btrfs_bio *bbio)
> >  {
> >         struct btrfs_fs_info *fs_info = bbio->fs_info;
> >         struct bio *bio = &bbio->bio;
> > -       struct processed_extent processed = { 0 };
> >         struct folio_iter fi;
> >         const u32 sectorsize = fs_info->sectorsize;
> >
> > @@ -640,11 +570,7 @@ static void end_bbio_data_read(struct btrfs_bio *bbio)
> >
> >                 /* Update page status and unlock. */
> >                 end_folio_read(folio, uptodate, start, len);
> > -               endio_readpage_release_extent(&processed, BTRFS_I(inode),
> > -                                             start, end, uptodate);
> >         }
> > -       /* Release the last extent */
> > -       endio_readpage_release_extent(&processed, NULL, 0, 0, false);
> >         bio_put(bio);
> >  }
> >
> > @@ -973,6 +899,7 @@ static struct extent_map *__get_extent_map(struct inode *inode,
> >                                            u64 len, struct extent_map **em_cached)
> >  {
> >         struct extent_map *em;
> > +       struct extent_state *cached_state = NULL;
> >
> >         ASSERT(em_cached);
> >
> > @@ -988,12 +915,15 @@ static struct extent_map *__get_extent_map(struct inode *inode,
> >                 *em_cached = NULL;
> >         }
> >
> > +       btrfs_lock_and_flush_ordered_range(BTRFS_I(inode), start, start + len - 1, &cached_state);
> >         em = btrfs_get_extent(BTRFS_I(inode), folio, start, len);
> >         if (!IS_ERR(em)) {
> >                 BUG_ON(*em_cached);
> >                 refcount_inc(&em->refs);
> >                 *em_cached = em;
> >         }
> > +       unlock_extent(&BTRFS_I(inode)->io_tree, start, start + len - 1, &cached_state);
>
> Sorry for the late review, but I missed this when it was posted, was
> busy with other stuff and summer holidays were approaching.
>
> I don't see how this is safe against concurrent COW writes, this
> leaves here some races that trigger either checksum
> failures making the reads fail with -EIO or allow a read to read data
> from another file (and that can be seen as a
> security issue).
>
> So before this patch we locked the extent range when we started the
> read and then unlocked it only after read bios
> finished - after the data was read from disk and we verified it
> matches the checksums.
>
> But now we lock it only to get an extent map, and we unlock it
> immediately after getting the extent map.
>
> So first unsafe scenario, for the simple case of reading a single 4K page:
>
> 1) We lock the extent range, get the extent map, which lets say points
> to an extent at disk_bytenr X, and then unlock the range;
>
> 2) We read the checksums from the csum tree and submit the read bios
> later, when calling btrfs_submit_bbio() -> btrfs_submit_chunk();
>
> 3) But before we go into btrfs_submit_bbio(), a COW write comes for 4K
> range, which will now proceed because we are not locking
>    the extent range for the whole duration of the read anymore.
>
>    This results in removing extent X, replacing it with a new one, and
> pinning it in the current transaction N;
>
> 4) Still before we reach btrfs_submit_bio(), transaction N is
> committed, so extent X is unpinned and from
>    now on anyone can allocate that extent;
>
> 5) A write for another file happens and it allocates extent X, writes
> to it, and adds checksums to the csum tree;
>
> 6) Our read proceeds and enters btrfs_submit_bbio().
>    So we submit a read for extent X and the read will succeed since
> the checksums match the extent's data.
>
>    We end up reading data from another file - this can cause all sorts
> of unexpected results to applications,
>    and can also be seen as a security issue.
>
> Another possible race, again for reading a single 4K page (simplest case):
>
> 1) We lock the extent range, get the extent map, which lets say points
> to an extent at disk_bytenr X, and then unlock the range;
>
> 2) We get into btrfs_submit_bbio() -> btrfs_submit_chunk() ->
> btrfs_lookup_bio_sums() and reads the checksums for the extent
>    from the csum tree;
>
> 3) Before we submit the read bio(s), a COW write for this range
> happens, so it drops extent X and pins it in the current transaction
> N;
>
> 4) Transaction N is committed, so extent X is unpinned and can now be
> re-allocated by anyone;
>
> 5) A write for another file happens and it allocates extent X, writing
> data to it;
>
> 6) The read proceeds and submits a bio to read extent X;
>
> 7) When the bio completes we do checksum validation and we fail, since
> the data does not match the checksums we got
>    in step 2, as a result the read will fail -EIO due to checksum failure.
>
> These are races we didn't have before, since we unlocked the extent
> range only after the bio completed.

Forgot to mention, those writes are direct IO writes, so they don't
use page locks to serialize with the reads.

>
> Please tell me that I'm missing something here and these races are impossible.
>
> Thanks.
>
> > +
> >         return em;
> >  }
> >  /*
> > @@ -1019,11 +949,9 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
> >         size_t pg_offset = 0;
> >         size_t iosize;
> >         size_t blocksize = fs_info->sectorsize;
> > -       struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
> >
> >         ret = set_folio_extent_mapped(folio);
> >         if (ret < 0) {
> > -               unlock_extent(tree, start, end, NULL);
> >                 folio_unlock(folio);
> >                 return ret;
> >         }
> > @@ -1047,14 +975,12 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
> >                 if (cur >= last_byte) {
> >                         iosize = folio_size(folio) - pg_offset;
> >                         folio_zero_range(folio, pg_offset, iosize);
> > -                       unlock_extent(tree, cur, cur + iosize - 1, NULL);
> >                         end_folio_read(folio, true, cur, iosize);
> >                         break;
> >                 }
> >                 em = __get_extent_map(inode, folio, cur, end - cur + 1,
> >                                       em_cached);
> >                 if (IS_ERR(em)) {
> > -                       unlock_extent(tree, cur, end, NULL);
> >                         end_folio_read(folio, false, cur, end + 1 - cur);
> >                         return PTR_ERR(em);
> >                 }
> > @@ -1123,7 +1049,6 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
> >                 if (block_start == EXTENT_MAP_HOLE) {
> >                         folio_zero_range(folio, pg_offset, iosize);
> >
> > -                       unlock_extent(tree, cur, cur + iosize - 1, NULL);
> >                         end_folio_read(folio, true, cur, iosize);
> >                         cur = cur + iosize;
> >                         pg_offset += iosize;
> > @@ -1131,7 +1056,6 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
> >                 }
> >                 /* the get_extent function already copied into the folio */
> >                 if (block_start == EXTENT_MAP_INLINE) {
> > -                       unlock_extent(tree, cur, cur + iosize - 1, NULL);
> >                         end_folio_read(folio, true, cur, iosize);
> >                         cur = cur + iosize;
> >                         pg_offset += iosize;
> > @@ -1156,15 +1080,10 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
> >
> >  int btrfs_read_folio(struct file *file, struct folio *folio)
> >  {
> > -       struct btrfs_inode *inode = folio_to_inode(folio);
> > -       u64 start = folio_pos(folio);
> > -       u64 end = start + folio_size(folio) - 1;
> >         struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
> >         struct extent_map *em_cached = NULL;
> >         int ret;
> >
> > -       btrfs_lock_and_flush_ordered_range(inode, start, end, NULL);
> > -
> >         ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
> >         free_extent_map(em_cached);
> >
> > @@ -2337,15 +2256,10 @@ int btrfs_writepages(struct address_space *mapping, struct writeback_control *wb
> >  void btrfs_readahead(struct readahead_control *rac)
> >  {
> >         struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ | REQ_RAHEAD };
> > -       struct btrfs_inode *inode = BTRFS_I(rac->mapping->host);
> >         struct folio *folio;
> > -       u64 start = readahead_pos(rac);
> > -       u64 end = start + readahead_length(rac) - 1;
> >         struct extent_map *em_cached = NULL;
> >         u64 prev_em_start = (u64)-1;
> >
> > -       btrfs_lock_and_flush_ordered_range(inode, start, end, NULL);
> > -
> >         while ((folio = readahead_folio(rac)) != NULL)
> >                 btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
> >
> > --
> > 2.43.0
> >
> >

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-11-25 12:17 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-26 16:37 [PATCH v2 0/3] btrfs: no longer hold the extent lock for the entire read Josef Bacik
2024-08-26 16:37 ` [PATCH v2 1/3] btrfs: introduce EXTENT_DIO_LOCKED Josef Bacik
2024-08-26 16:37 ` [PATCH v2 2/3] btrfs: take the dio extent lock during O_DIRECT operations Josef Bacik
2024-08-26 16:37 ` [PATCH v2 3/3] btrfs: do not hold the extent lock for entire read Josef Bacik
2024-08-27 19:00   ` Goldwyn Rodrigues
2024-11-25 12:11   ` Filipe Manana
2024-11-25 12:17     ` Filipe Manana
2024-08-27  0:53 ` [PATCH v2 0/3] btrfs: no longer hold the extent lock for the " David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox