small writeback fixes

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* small writeback fixes
@ 2023-07-13 13:04 Christoph Hellwig
  2023-07-13 13:04 ` [PATCH 1/9] btrfs: don't stop integrity writeback too early Christoph Hellwig
                   ` (10 more replies)
  0 siblings, 11 replies; 18+ messages in thread
From: Christoph Hellwig @ 2023-07-13 13:04 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: linux-btrfs, linux-fsdevel

Hi all,

this series has various fixes for bugs found in inspect or only triggered
with upcoming changes that are a fallout from my work on bound lifetimes
for the ordered extent and better confirming to expectations from the
common writeback code.

Note that this series builds on the "btrfs compressed writeback cleanups"
series sent out previously.

A git tree is also available here:

    git://git.infradead.org/users/hch/misc.git btrfs-writeback-fixes

Gitweb:

    http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/btrfs-writeback-fixes

Diffatat:
 extent_io.c |  182 ++++++++++++++++++++++++++++++++++++------------------------
 inode.c     |   16 +----
 2 files changed, 117 insertions(+), 81 deletions(-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/9] btrfs: don't stop integrity writeback too early
  2023-07-13 13:04 small writeback fixes Christoph Hellwig
@ 2023-07-13 13:04 ` Christoph Hellwig
  2023-07-13 13:04 ` [PATCH 2/9] btrfs: don't wait for writeback on clean pages in extent_write_cache_pages Christoph Hellwig
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2023-07-13 13:04 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: linux-btrfs, linux-fsdevel

extent_write_cache_pages stops writing pages as soon as nr_to_write hits
zero.  That is the right thing for opportunistic writeback, but incorrect
for data integrity writeback, which needs to ensure that no dirty pages
are left in the range.  Thus only stop the writeback for WB_SYNC_NONE
if nr_to_write hits 0.

This is a port of write_cache_pages changes in commit 05fe478dd04e
("mm: write_cache_pages integrity fix").

Note that I've only trigger the problem with other changes to the btrfs
writeback code, but this condition seems worthwhile fixing anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_io.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f2b9c72ea0c104..cb8c5d06fe2304 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2098,7 +2098,8 @@ static int extent_write_cache_pages(struct address_space *mapping,
 			 * We have to make sure to honor the new nr_to_write
 			 * at any time
 			 */
-			nr_to_write_done = wbc->nr_to_write <= 0;
+			nr_to_write_done = wbc->sync_mode == WB_SYNC_NONE &&
+						wbc->nr_to_write <= 0;
 		}
 		folio_batch_release(&fbatch);
 		cond_resched();
-- 
2.39.2

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/9] btrfs: don't wait for writeback on clean pages in extent_write_cache_pages
  2023-07-13 13:04 small writeback fixes Christoph Hellwig
  2023-07-13 13:04 ` [PATCH 1/9] btrfs: don't stop integrity writeback too early Christoph Hellwig
@ 2023-07-13 13:04 ` Christoph Hellwig
  2023-07-13 13:04 ` [PATCH 3/9] btrfs: fix an error handling corner case in cow_file_range Christoph Hellwig
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2023-07-13 13:04 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: linux-btrfs, linux-fsdevel

__extent_writepage can have started on more pages than the one it was
called for.  This happens regularly for zoned file systems, and in theory
could happen for compressed I/O if the worker thread was executed very
quicky. For such pages extent_write_cache_pages waits for writeback
to complete before moving on to the next page, which is highly inefficient
as it blocks the flusher thread

Port over the PageDirty check that was added to write_cache_pages in
commit 515f4a037fb ("mm: write_cache_pages optimise page cleaning") to
fix this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_io.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cb8c5d06fe2304..2ae3badaf36df6 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2075,6 +2075,12 @@ static int extent_write_cache_pages(struct address_space *mapping,
 				continue;
 			}
 
+			if (!folio_test_dirty(folio)) {
+				/* someone wrote it for us */
+				folio_unlock(folio);
+				continue;
+			}
+
 			if (wbc->sync_mode != WB_SYNC_NONE) {
 				if (folio_test_writeback(folio))
 					submit_write_bio(bio_ctrl, 0);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/9] btrfs: fix an error handling corner case in cow_file_range
  2023-07-13 13:04 small writeback fixes Christoph Hellwig
  2023-07-13 13:04 ` [PATCH 1/9] btrfs: don't stop integrity writeback too early Christoph Hellwig
  2023-07-13 13:04 ` [PATCH 2/9] btrfs: don't wait for writeback on clean pages in extent_write_cache_pages Christoph Hellwig
@ 2023-07-13 13:04 ` Christoph Hellwig
  2023-07-13 13:04 ` [PATCH 4/9] btrfs: move the cow_fixup earlier in writepages handling Christoph Hellwig
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2023-07-13 13:04 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: linux-btrfs, linux-fsdevel

When the call to btrfs_reloc_clone_csums in cow_file_range returns an
error, we jump to the out_unlock label with the extent_reserved variable
set to false.   The cleanup at the label will then call
extent_clear_unlock_delalloc on the range from start to end.  But we've
already added cur_alloc_size to start, so there might no range be left
from the newly increment start to end.  Move the check for start < end
so that it is reached by also for the !extent_reserved case.

Fixes: a315e68f6e8b ("Btrfs: fix invalid attempt to free reserved space on failure to cow range")
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ae1bda58004b26..20bee7f8dae01c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1548,8 +1548,6 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 					     clear_bits,
 					     page_ops);
 		start += cur_alloc_size;
-		if (start >= end)
-			return ret;
 	}
 
 	/*
@@ -1558,9 +1556,11 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 	 * space_info's bytes_may_use counter, reserved in
 	 * btrfs_check_data_free_space().
 	 */
-	extent_clear_unlock_delalloc(inode, start, end, locked_page,
-				     clear_bits | EXTENT_CLEAR_DATA_RESV,
-				     page_ops);
+	if (start < end) {
+		clear_bits |= EXTENT_CLEAR_DATA_RESV;
+		extent_clear_unlock_delalloc(inode, start, end, locked_page,
+					     clear_bits, page_ops);
+	}
 	return ret;
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/9] btrfs: move the cow_fixup earlier in writepages handling
  2023-07-13 13:04 small writeback fixes Christoph Hellwig
                   ` (2 preceding siblings ...)
  2023-07-13 13:04 ` [PATCH 3/9] btrfs: fix an error handling corner case in cow_file_range Christoph Hellwig
@ 2023-07-13 13:04 ` Christoph Hellwig
  2023-07-13 13:04 ` [PATCH 5/9] btrfs: fix handling of errors from __extent_writepage_io Christoph Hellwig
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2023-07-13 13:04 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: linux-btrfs, linux-fsdevel

btrfs has a special fixup for pages that are marked dirty without having
space reserved for them.  But the place where is is run means it can't
work for I/O that isn't kicked off inline from __extent_writepage, most
notable compressed I/O and I/O to zoned file systems.

Move the fixup earlier based on not findining any delalloc range in the
I/O tree to cover this case as well instead of relying on the fairly
obscure fallthrough behavior that calls __extent_write_page_io even when
no delalloc space was found.

Fixes: c8b978188c9a ("Btrfs: Add zlib compression support")
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_io.c | 31 +++++++++++++++++--------------
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 2ae3badaf36df6..d8185582b9f4b0 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1153,6 +1153,7 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 	u64 delalloc_start = page_start;
 	u64 delalloc_end = page_end;
 	u64 delalloc_to_write = 0;
+	bool found_delalloc = false;
 	int ret = 0;
 
 	while (delalloc_start < page_end) {
@@ -1169,6 +1170,22 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 			return ret;
 
 		delalloc_start = delalloc_end + 1;
+		found_delalloc = true;
+	}
+
+	/*
+	 * If we did not find any delalloc range in the io_tree, this must be
+	 * the rare case of dirtying pages through get_user_pages without
+	 * calling into ->page_mkwrite.
+	 * While these are in the process of being fixed by switching to
+	 * pin_user_pages, some are still around and need to be worked around
+	 * by creating a delalloc reservation in a fixup worker, and waiting
+	 * us to be called again with that reservation.
+	 */
+	if (!found_delalloc && btrfs_writepage_cow_fixup(page)) {
+		redirty_page_for_writepage(wbc, page);
+		unlock_page(page);
+		return 1;
 	}
 
 	/*
@@ -1274,14 +1291,6 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 	int ret = 0;
 	int nr = 0;
 
-	ret = btrfs_writepage_cow_fixup(page);
-	if (ret) {
-		/* Fixup worker will requeue */
-		redirty_page_for_writepage(bio_ctrl->wbc, page);
-		unlock_page(page);
-		return 1;
-	}
-
 	bio_ctrl->end_io_func = end_bio_extent_writepage;
 	while (cur <= end) {
 		u32 len = end - cur + 1;
@@ -1421,9 +1430,6 @@ static int __extent_writepage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl
 		goto done;
 
 	ret = __extent_writepage_io(BTRFS_I(inode), page, bio_ctrl, i_size, &nr);
-	if (ret == 1)
-		return 0;
-
 	bio_ctrl->wbc->nr_to_write--;
 
 done:
@@ -2176,8 +2182,6 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
 
 		ret = __extent_writepage_io(BTRFS_I(inode), page, &bio_ctrl,
 					    i_size, &nr);
-		if (ret == 1)
-			goto next_page;
 
 		/* Make sure the mapping tag for page dirty gets cleared. */
 		if (nr == 0) {
@@ -2193,7 +2197,6 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
 		btrfs_page_unlock_writer(fs_info, page, cur, cur_len);
 		if (ret < 0)
 			found_error = true;
-next_page:
 		put_page(page);
 		cur = cur_end + 1;
 	}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/9] btrfs: fix handling of errors from __extent_writepage_io
  2023-07-13 13:04 small writeback fixes Christoph Hellwig
                   ` (3 preceding siblings ...)
  2023-07-13 13:04 ` [PATCH 4/9] btrfs: move the cow_fixup earlier in writepages handling Christoph Hellwig
@ 2023-07-13 13:04 ` Christoph Hellwig
  2023-07-13 13:04 ` [PATCH 6/9] btrfs: stop submitting I/O after an error in extent_write_locked_range Christoph Hellwig
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2023-07-13 13:04 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: linux-btrfs, linux-fsdevel

When __extent_writepage_io fails, the callers complete the ordered range
for the entire page, including the range for which bios have been
submitted already, leading to a double completion for those ranges.

Fix this by pulling the error handling into __extent_writepage_io, and
only apply it to the ranges not submitted yet.

This also means that the remaining error handling in __extent_writepage
never needs to complete the ordered extent, as there won't be one before
writepage_delalloc is called, or when writepage_delalloc returns an
error.

Fixes: 61391d562229 ("Btrfs: fix hang on error (such as ENOSPC) when writing extent pages")
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_io.c | 78 +++++++++++++++++++-------------------------
 1 file changed, 33 insertions(+), 45 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d8185582b9f4b0..0c5e540eb2ebea 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1278,12 +1278,11 @@ static void find_next_dirty_byte(struct btrfs_fs_info *fs_info,
  */
 static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 				 struct page *page,
-				 struct btrfs_bio_ctrl *bio_ctrl,
-				 loff_t i_size,
-				 int *nr_ret)
+				 struct btrfs_bio_ctrl *bio_ctrl, loff_t i_size)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	u64 cur = page_offset(page);
+	u64 start = page_offset(page);
+	u64 cur = start;
 	u64 end = cur + PAGE_SIZE - 1;
 	u64 extent_offset;
 	u64 block_start;
@@ -1325,7 +1324,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		em = btrfs_get_extent(inode, NULL, 0, cur, len);
 		if (IS_ERR(em)) {
 			ret = PTR_ERR_OR_ZERO(em);
-			goto out_error;
+			goto out;
 		}
 
 		extent_offset = cur - em->start;
@@ -1372,15 +1371,21 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 	}
 
 	btrfs_page_assert_not_dirty(fs_info, page);
-	*nr_ret = nr;
-	return 0;
+	ret = 0;
+out:
+	/* Make sure the mapping tag for page dirty gets cleared. */
+	if (nr == 0) {
+		set_page_writeback(page);
+		end_page_writeback(page);
+	}
+	if (ret) {
+		u32 remaining = end + 1 - cur;
 
-out_error:
-	/*
-	 * If we finish without problem, we should not only clear page dirty,
-	 * but also empty subpage dirty bits
-	 */
-	*nr_ret = nr;
+		btrfs_mark_ordered_io_finished(inode, page, cur, remaining,
+					       false);
+		btrfs_page_clear_uptodate(fs_info, page, cur, remaining);
+		mapping_set_error(page->mapping, ret);
+	}
 	return ret;
 }
 
@@ -1399,7 +1404,6 @@ static int __extent_writepage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl
 	struct inode *inode = page->mapping->host;
 	const u64 page_start = page_offset(page);
 	int ret;
-	int nr = 0;
 	size_t pg_offset;
 	loff_t i_size = i_size_read(inode);
 	unsigned long end_index = i_size >> PAGE_SHIFT;
@@ -1421,30 +1425,27 @@ static int __extent_writepage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl
 
 	ret = set_page_extent_mapped(page);
 	if (ret < 0)
-		goto done;
+		goto error;
 
 	ret = writepage_delalloc(BTRFS_I(inode), page, bio_ctrl->wbc);
 	if (ret == 1)
 		return 0;
 	if (ret)
-		goto done;
+		goto error;
 
-	ret = __extent_writepage_io(BTRFS_I(inode), page, bio_ctrl, i_size, &nr);
+	ret = __extent_writepage_io(BTRFS_I(inode), page, bio_ctrl, i_size);
+	/* __extent_writepage_io cleans up by itself on error. */
 	bio_ctrl->wbc->nr_to_write--;
-
-done:
-	if (nr == 0) {
-		/* make sure the mapping tag for page dirty gets cleared */
-		set_page_writeback(page);
-		end_page_writeback(page);
-	}
-	if (ret) {
-		btrfs_mark_ordered_io_finished(BTRFS_I(inode), page, page_start,
-					       PAGE_SIZE, !ret);
-		btrfs_page_clear_uptodate(btrfs_sb(inode->i_sb), page,
-					  page_start, PAGE_SIZE);
-		mapping_set_error(page->mapping, ret);
-	}
+	goto unlock_page;
+
+error:
+	/* Make sure the mapping tag for page dirty gets cleared. */
+	set_page_writeback(page);
+	end_page_writeback(page);
+	btrfs_page_clear_uptodate(btrfs_sb(inode->i_sb), page,
+				  page_start, PAGE_SIZE);
+	mapping_set_error(page->mapping, ret);
+unlock_page:
 	unlock_page(page);
 	ASSERT(ret <= 0);
 	return ret;
@@ -2171,7 +2172,6 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
 		u64 cur_end = min(round_down(cur, PAGE_SIZE) + PAGE_SIZE - 1, end);
 		u32 cur_len = cur_end + 1 - cur;
 		struct page *page;
-		int nr = 0;
 
 		page = find_get_page(mapping, cur >> PAGE_SHIFT);
 		ASSERT(PageLocked(page));
@@ -2181,19 +2181,7 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
 		}
 
 		ret = __extent_writepage_io(BTRFS_I(inode), page, &bio_ctrl,
-					    i_size, &nr);
-
-		/* Make sure the mapping tag for page dirty gets cleared. */
-		if (nr == 0) {
-			set_page_writeback(page);
-			end_page_writeback(page);
-		}
-		if (ret) {
-			btrfs_mark_ordered_io_finished(BTRFS_I(inode), page,
-						       cur, cur_len, !ret);
-			btrfs_page_clear_uptodate(fs_info, page, cur, cur_len);
-			mapping_set_error(page->mapping, ret);
-		}
+					    i_size);
 		btrfs_page_unlock_writer(fs_info, page, cur, cur_len);
 		if (ret < 0)
 			found_error = true;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 6/9] btrfs: stop submitting I/O after an error in extent_write_locked_range
  2023-07-13 13:04 small writeback fixes Christoph Hellwig
                   ` (4 preceding siblings ...)
  2023-07-13 13:04 ` [PATCH 5/9] btrfs: fix handling of errors from __extent_writepage_io Christoph Hellwig
@ 2023-07-13 13:04 ` Christoph Hellwig
  2023-07-13 13:04 ` [PATCH 7/9] btrfs: fix a race in clearing the writeback bit for sub-page I/O Christoph Hellwig
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2023-07-13 13:04 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: linux-btrfs, linux-fsdevel

As soon a __extent_writepage_io returns an error we know that the
extent_map is corrupted or we're out of memory, and there is no point
in continuing to submit I/O.  Follow the behavior in extent_write_cache
pages and stop submitting I/O, and instead just unlock all pages.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_io.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0c5e540eb2ebea..035f49de024887 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2151,7 +2151,6 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
 			       u64 start, u64 end, struct writeback_control *wbc,
 			       bool pages_dirty)
 {
-	bool found_error = false;
 	int ret = 0;
 	struct address_space *mapping = inode->i_mapping;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -2180,16 +2179,29 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
 			clear_page_dirty_for_io(page);
 		}
 
-		ret = __extent_writepage_io(BTRFS_I(inode), page, &bio_ctrl,
-					    i_size);
+		/*
+		 * The entire page range that we were called on is locked and
+		 * covered by an ordered_extent.  Make sure we continue the
+		 * loop after an initial writeback error to unwind these as well
+		 * as clearing the dirty bit on the page by starting and ending
+		 * writeback.
+		 */
+		if (ret) {
+			btrfs_mark_ordered_io_finished(BTRFS_I(inode), page,
+						       cur, cur_len, false);
+			btrfs_page_clear_uptodate(fs_info, page, cur, cur_len);
+			set_page_writeback(page);
+			end_page_writeback(page);
+		} else {
+			ret = __extent_writepage_io(BTRFS_I(inode), page,
+						    &bio_ctrl, i_size);
+		}
 		btrfs_page_unlock_writer(fs_info, page, cur, cur_len);
-		if (ret < 0)
-			found_error = true;
 		put_page(page);
 		cur = cur_end + 1;
 	}
 
-	submit_write_bio(&bio_ctrl, found_error ? ret : 0);
+	submit_write_bio(&bio_ctrl, ret);
 }
 
 int extent_writepages(struct address_space *mapping,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 7/9] btrfs: fix a race in clearing the writeback bit for sub-page I/O
  2023-07-13 13:04 small writeback fixes Christoph Hellwig
                   ` (5 preceding siblings ...)
  2023-07-13 13:04 ` [PATCH 6/9] btrfs: stop submitting I/O after an error in extent_write_locked_range Christoph Hellwig
@ 2023-07-13 13:04 ` Christoph Hellwig
  2023-07-13 13:04 ` [PATCH 8/9] btrfs: remove the call to btrfs_mark_ordered_io_finished in btrfs_writepage_fixup_worker Christoph Hellwig
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2023-07-13 13:04 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: linux-btrfs, linux-fsdevel

For sub-page I/O, a fast I/O completion can clear the page writeback bit
in the I/O completion handler before subsequent bios were submitted.
Use a trick from iomap and defer submission of bios until we're reached
the end of the page to avoid this race.

Fixes: c5ef5c6c733a ("btrfs: make __extent_writepage_io() only submit dirty range for subpage")
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_io.c | 54 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 42 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 035f49de024887..994d38f59cbed4 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -105,7 +105,8 @@ struct btrfs_bio_ctrl {
 	struct writeback_control *wbc;
 };
 
-static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
+static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl,
+			   struct bio_list *bio_list)
 {
 	struct btrfs_bio *bbio = bio_ctrl->bbio;
 
@@ -118,6 +119,8 @@ static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
 	if (btrfs_op(&bbio->bio) == BTRFS_MAP_READ &&
 	    bio_ctrl->compress_type != BTRFS_COMPRESS_NONE)
 		btrfs_submit_compressed_read(bbio);
+	else if (bio_list)
+		bio_list_add(bio_list, &bbio->bio);
 	else
 		btrfs_submit_bio(bbio, 0);
 
@@ -141,7 +144,22 @@ static void submit_write_bio(struct btrfs_bio_ctrl *bio_ctrl, int ret)
 		/* The bio is owned by the end_io handler now */
 		bio_ctrl->bbio = NULL;
 	} else {
-		submit_one_bio(bio_ctrl);
+		submit_one_bio(bio_ctrl, NULL);
+	}
+}
+
+static void btrfs_submit_pending_bios(struct bio_list *bio_list, int ret)
+{
+	struct bio *bio;
+
+	if (ret) {
+		blk_status_t status = errno_to_blk_status(ret);
+
+		while ((bio = bio_list_pop(bio_list)))
+			btrfs_bio_end_io(btrfs_bio(bio), status);
+	} else {
+		while ((bio = bio_list_pop(bio_list)))
+			btrfs_submit_bio(btrfs_bio(bio), 0);
 	}
 }
 
@@ -791,7 +809,8 @@ static void alloc_new_bio(struct btrfs_inode *inode,
  */
 static void submit_extent_page(struct btrfs_bio_ctrl *bio_ctrl,
 			       u64 disk_bytenr, struct page *page,
-			       size_t size, unsigned long pg_offset)
+			       size_t size, unsigned long pg_offset,
+			       struct bio_list *bio_list)
 {
 	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
 
@@ -800,7 +819,7 @@ static void submit_extent_page(struct btrfs_bio_ctrl *bio_ctrl,
 
 	if (bio_ctrl->bbio &&
 	    !btrfs_bio_is_contig(bio_ctrl, page, disk_bytenr, pg_offset))
-		submit_one_bio(bio_ctrl);
+		submit_one_bio(bio_ctrl, bio_list);
 
 	do {
 		u32 len = size;
@@ -820,7 +839,7 @@ static void submit_extent_page(struct btrfs_bio_ctrl *bio_ctrl,
 
 		if (bio_add_page(&bio_ctrl->bbio->bio, page, len, pg_offset) != len) {
 			/* bio full: move on to a new one */
-			submit_one_bio(bio_ctrl);
+			submit_one_bio(bio_ctrl, bio_list);
 			continue;
 		}
 
@@ -834,7 +853,7 @@ static void submit_extent_page(struct btrfs_bio_ctrl *bio_ctrl,
 
 		/* Ordered extent boundary: move on to a new bio. */
 		if (bio_ctrl->len_to_oe_boundary == 0)
-			submit_one_bio(bio_ctrl);
+			submit_one_bio(bio_ctrl, bio_list);
 	} while (size);
 }
 
@@ -1082,14 +1101,14 @@ static int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 		}
 
 		if (bio_ctrl->compress_type != compress_type) {
-			submit_one_bio(bio_ctrl);
+			submit_one_bio(bio_ctrl, NULL);
 			bio_ctrl->compress_type = compress_type;
 		}
 
 		if (force_bio_submit)
-			submit_one_bio(bio_ctrl);
+			submit_one_bio(bio_ctrl, NULL);
 		submit_extent_page(bio_ctrl, disk_bytenr, page, iosize,
-				   pg_offset);
+				   pg_offset, NULL);
 		cur = cur + iosize;
 		pg_offset += iosize;
 	}
@@ -1113,7 +1132,7 @@ int btrfs_read_folio(struct file *file, struct folio *folio)
 	 * If btrfs_do_readpage() failed we will want to submit the assembled
 	 * bio to do the cleanup.
 	 */
-	submit_one_bio(&bio_ctrl);
+	submit_one_bio(&bio_ctrl, NULL);
 	return ret;
 }
 
@@ -1287,6 +1306,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 	u64 extent_offset;
 	u64 block_start;
 	struct extent_map *em;
+	struct bio_list bio_list = BIO_EMPTY_LIST;
 	int ret = 0;
 	int nr = 0;
 
@@ -1365,7 +1385,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		btrfs_page_clear_dirty(fs_info, page, cur, iosize);
 
 		submit_extent_page(bio_ctrl, disk_bytenr, page, iosize,
-				   cur - page_offset(page));
+				   cur - page_offset(page), &bio_list);
 		cur += iosize;
 		nr++;
 	}
@@ -1378,6 +1398,16 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		set_page_writeback(page);
 		end_page_writeback(page);
 	}
+
+	/*
+	 * Submit all bios we queued up for this page.
+	 *
+	 * The bios are not submitted directly after building them as otherwise
+	 * a very fast I/O completion on an earlier bio could clear the page
+	 * writeback bit before the subsequent bios are even submitted.
+	 */
+	btrfs_submit_pending_bios(&bio_list, ret);
+
 	if (ret) {
 		u32 remaining = end + 1 - cur;
 
@@ -2243,7 +2273,7 @@ void extent_readahead(struct readahead_control *rac)
 
 	if (em_cached)
 		free_extent_map(em_cached);
-	submit_one_bio(&bio_ctrl);
+	submit_one_bio(&bio_ctrl, NULL);
 }
 
 /*
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 8/9] btrfs: remove the call to btrfs_mark_ordered_io_finished in btrfs_writepage_fixup_worker
  2023-07-13 13:04 small writeback fixes Christoph Hellwig
                   ` (6 preceding siblings ...)
  2023-07-13 13:04 ` [PATCH 7/9] btrfs: fix a race in clearing the writeback bit for sub-page I/O Christoph Hellwig
@ 2023-07-13 13:04 ` Christoph Hellwig
  2023-07-13 13:04 ` [PATCH 9/9] btrfs: lift the call to mapping_set_error out of cow_file_range Christoph Hellwig
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2023-07-13 13:04 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: linux-btrfs, linux-fsdevel

If btrfs_writepage_fixup_worker is called for a range that is
covered by and ordered_extent, but which doesn't have the PageOrdered
bit set, that means the ordered_extent is either for a pending direct
I/O, or for a buffered writeback in the final stages of I/O completion.

In either case it is now owned by the calling context, and should
never be completed by it.

Fixes: 87826df0ec36 ("btrfs: delalloc for page dirtied out-of-band in fixup worker")
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 20bee7f8dae01c..46d04803d76f13 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2796,8 +2796,6 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 		 * to reflect the errors and clean the page.
 		 */
 		mapping_set_error(page->mapping, ret);
-		btrfs_mark_ordered_io_finished(inode, page, page_start,
-					       PAGE_SIZE, !ret);
 		btrfs_page_clear_uptodate(fs_info, page, page_start, PAGE_SIZE);
 		clear_page_dirty_for_io(page);
 	}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 9/9] btrfs: lift the call to mapping_set_error out of cow_file_range
  2023-07-13 13:04 small writeback fixes Christoph Hellwig
                   ` (7 preceding siblings ...)
  2023-07-13 13:04 ` [PATCH 8/9] btrfs: remove the call to btrfs_mark_ordered_io_finished in btrfs_writepage_fixup_worker Christoph Hellwig
@ 2023-07-13 13:04 ` Christoph Hellwig
  2023-07-14 13:59 ` small writeback fixes Josef Bacik
  2023-07-18 17:17 ` Josef Bacik
  10 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2023-07-13 13:04 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: linux-btrfs, linux-fsdevel

Instead of calling mapping_set_error in cow_file_range for the
!locked_page case, make the submit_uncompressed_range call
mapping_set_error also for the !locked_page as that is the only caller
that might pass a NULL locked_page.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 46d04803d76f13..9305a100b5f809 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1074,6 +1074,7 @@ static void submit_uncompressed_range(struct btrfs_inode *inode,
 	wbc_detach_inode(&wbc);
 	if (ret < 0) {
 		btrfs_cleanup_ordered_extents(inode, locked_page, start, end - start + 1);
+		mapping_set_error(inode->vfs_inode.i_mapping, ret);
 		if (locked_page) {
 			const u64 page_start = page_offset(locked_page);
 
@@ -1085,7 +1086,6 @@ static void submit_uncompressed_range(struct btrfs_inode *inode,
 			btrfs_page_clear_uptodate(inode->root->fs_info,
 						  locked_page, page_start,
 						  PAGE_SIZE);
-			mapping_set_error(locked_page->mapping, ret);
 			unlock_page(locked_page);
 		}
 	}
@@ -1525,8 +1525,6 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 	 * pages (except @locked_page) to ensure all the pages are unlocked.
 	 */
 	if ((flags & CFR_KEEP_LOCKED) && orig_start < start) {
-		if (!locked_page)
-			mapping_set_error(inode->vfs_inode.i_mapping, ret);
 		extent_clear_unlock_delalloc(inode, orig_start, start - 1,
 					     locked_page, 0, page_ops);
 	}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: small writeback fixes
  2023-07-13 13:04 small writeback fixes Christoph Hellwig
                   ` (8 preceding siblings ...)
  2023-07-13 13:04 ` [PATCH 9/9] btrfs: lift the call to mapping_set_error out of cow_file_range Christoph Hellwig
@ 2023-07-14 13:59 ` Josef Bacik
  2023-07-18 17:17 ` Josef Bacik
  10 siblings, 0 replies; 18+ messages in thread
From: Josef Bacik @ 2023-07-14 13:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chris Mason, David Sterba, linux-btrfs, linux-fsdevel

On Thu, Jul 13, 2023 at 03:04:22PM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series has various fixes for bugs found in inspect or only triggered
> with upcoming changes that are a fallout from my work on bound lifetimes
> for the ordered extent and better confirming to expectations from the
> common writeback code.
> 
> Note that this series builds on the "btrfs compressed writeback cleanups"
> series sent out previously.
> 
> A git tree is also available here:
> 
>     git://git.infradead.org/users/hch/misc.git btrfs-writeback-fixes
> 
> Gitweb:
> 
>     http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/btrfs-writeback-fixes
> 
> Diffatat:
>  extent_io.c |  182 ++++++++++++++++++++++++++++++++++++------------------------
>  inode.c     |   16 +----
>  2 files changed, 117 insertions(+), 81 deletions(-)

You can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

to the series, thanks,

Josef

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: small writeback fixes
  2023-07-13 13:04 small writeback fixes Christoph Hellwig
                   ` (9 preceding siblings ...)
  2023-07-14 13:59 ` small writeback fixes Josef Bacik
@ 2023-07-18 17:17 ` Josef Bacik
  2023-07-19  5:39   ` Christoph Hellwig
  10 siblings, 1 reply; 18+ messages in thread
From: Josef Bacik @ 2023-07-18 17:17 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chris Mason, David Sterba, linux-btrfs, linux-fsdevel

On Thu, Jul 13, 2023 at 03:04:22PM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series has various fixes for bugs found in inspect or only triggered
> with upcoming changes that are a fallout from my work on bound lifetimes
> for the ordered extent and better confirming to expectations from the
> common writeback code.
> 
> Note that this series builds on the "btrfs compressed writeback cleanups"
> series sent out previously.
> 
> A git tree is also available here:
> 
>     git://git.infradead.org/users/hch/misc.git btrfs-writeback-fixes
> 
> Gitweb:
> 
>     http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/btrfs-writeback-fixes
> 
> Diffatat:
>  extent_io.c |  182 ++++++++++++++++++++++++++++++++++++------------------------
>  inode.c     |   16 +----
>  2 files changed, 117 insertions(+), 81 deletions(-)

Just FYI I've been using these two series to see how the github CI stuff was
working, and I keep tripping over a hang in generic/475.  It appears to be in
the fixup worker, here's the sysrq w output

sysrq: Show Blocked State
task:kworker/u4:5    state:D stack:0     pid:1713600 ppid:2      flags:0x00004000
Workqueue: btrfs-fixup btrfs_work_helper
Call Trace:
 <TASK>
 __schedule+0x533/0x1910
 ? find_held_lock+0x2b/0x80
 schedule+0x5e/0xd0
 __reserve_bytes+0x4e2/0x830
 ? __pfx_autoremove_wake_function+0x10/0x10
 btrfs_reserve_data_bytes+0x54/0x170
 btrfs_check_data_free_space+0x6a/0xf0
 btrfs_delalloc_reserve_space+0x2b/0xe0
 btrfs_writepage_fixup_worker+0x7e/0x4c0
 btrfs_work_helper+0xff/0x410
 process_one_work+0x26b/0x550
 worker_thread+0x53/0x3a0
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xf5/0x130
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2c/0x50
 </TASK>
task:kworker/u4:4    state:D stack:0     pid:2513631 ppid:2      flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_data_space
Call Trace:
 <TASK>
 __schedule+0x533/0x1910
 ? lock_acquire+0xca/0x2b0
 schedule+0x5e/0xd0
 schedule_timeout+0x1ad/0x1c0
 __wait_for_common+0xbd/0x220
 ? __pfx_schedule_timeout+0x10/0x10
 btrfs_wait_ordered_extents+0x3e3/0x480
 btrfs_wait_ordered_roots+0x184/0x260
 flush_space+0x3de/0x6a0
 ? btrfs_async_reclaim_data_space+0x52/0x180
 ? lock_release+0xc9/0x270
 btrfs_async_reclaim_data_space+0xff/0x180
 process_one_work+0x26b/0x550
 worker_thread+0x1eb/0x3a0
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xf5/0x130
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2c/0x50
 </TASK>
task:kworker/u4:6    state:D stack:0     pid:2513783 ppid:2      flags:0x00004000
Workqueue: btrfs-flush_delalloc btrfs_work_helper
Call Trace:
 <TASK>
 __schedule+0x533/0x1910
 schedule+0x5e/0xd0
 btrfs_start_ordered_extent+0x153/0x210
 ? __pfx_autoremove_wake_function+0x10/0x10
 btrfs_run_ordered_extent_work+0x19/0x30
 btrfs_work_helper+0xff/0x410
 process_one_work+0x26b/0x550
 worker_thread+0x53/0x3a0
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xf5/0x130
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2c/0x50
 </TASK>

We appear to be getting hung up because the ENOSPC stuff is flushing and waiting
on ordered extents, and then the fixup worker is waiting on trying to reserve
space.  My hunch is the page that's in the fixup worker is attached to an
ordered extent.

I can pretty reliably reproduce this in the CI, so if you have trouble
reproducing it let me know.  I'll dig into it later today, but I may not get to
it before you do.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: small writeback fixes
  2023-07-18 17:17 ` Josef Bacik
@ 2023-07-19  5:39   ` Christoph Hellwig
  2023-07-19 11:50     ` Christoph Hellwig
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2023-07-19  5:39 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Christoph Hellwig, Chris Mason, David Sterba, linux-btrfs,
	linux-fsdevel

On Tue, Jul 18, 2023 at 01:17:44PM -0400, Josef Bacik wrote:
> Just FYI I've been using these two series to see how the github CI stuff was
> working, and I keep tripping over a hang in generic/475.  It appears to be in
> the fixup worker, here's the sysrq w output

> 
> We appear to be getting hung up because the ENOSPC stuff is flushing and waiting
> on ordered extents, and then the fixup worker is waiting on trying to reserve
> space.  My hunch is the page that's in the fixup worker is attached to an
> ordered extent.
> 
> I can pretty reliably reproduce this in the CI, so if you have trouble
> reproducing it let me know.  I'll dig into it later today, but I may not get to
> it before you do.  Thanks,

My day was already over by the time you sent this, but I looked into
it the first thing this morning.

I can't reproduce the hang, but my first thought was "why the heck do
even end up in the fixup worker" given that there is no GUP-based
dirtying in the thread.

I can reproduce the test case hitting the fixup worker now, while
I can't on misc-next.  Looking into it now, but the rework of the
fixup logic is a hot candidate.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: small writeback fixes
  2023-07-19  5:39   ` Christoph Hellwig
@ 2023-07-19 11:50     ` Christoph Hellwig
  2023-07-19 14:30       ` Josef Bacik
  2023-07-19 21:42       ` Josef Bacik
  0 siblings, 2 replies; 18+ messages in thread
From: Christoph Hellwig @ 2023-07-19 11:50 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Christoph Hellwig, Chris Mason, David Sterba, linux-btrfs,
	linux-fsdevel

On Wed, Jul 19, 2023 at 07:39:01AM +0200, Christoph Hellwig wrote:
> My day was already over by the time you sent this, but I looked into
> it the first thing this morning.
> 
> I can't reproduce the hang, but my first thought was "why the heck do
> even end up in the fixup worker" given that there is no GUP-based
> dirtying in the thread.
> 
> I can reproduce the test case hitting the fixup worker now, while
> I can't on misc-next.  Looking into it now, but the rework of the
> fixup logic is a hot candidate.

So unfortunately even the BUG seems to trigger in a very sporadic
manner, making a bisect impossible.  This is made worse by me actually
hitting another hang (dmesg output below) way more frequently, but that
one actually reproduces on misc-next as well.  I'm also still confused
on how we hit the fixup worker, as that means we'll need to see a page
that.

  a) was dirty so that the writeback code picks it up
  b) had the delalloc bit already cleaned in the I/O tree
  c) does not have the orderd bit set

"btrfs: move the cow_fixup earlier in writepages handling" would
be the obvious candidate touching this area, even if I can't see
how it makes a difference.  Any chance you could check if it is
indeed the culprit?

And here is the more frequent hang I see with generic/475 loops:

  135.245713] BTRFS info (device dm-0): using crc32c (crc32c-intel) checksum algorithm
  [  135.246243] BTRFS info (device dm-0): using free space tree
  [  135.252874] BTRFS info (device dm-0): bdev /dev/mapper/error-test errs: wr 0, rd 0, flush 00
  [  135.255126] BTRFS info (device dm-0): auto enabling async discard
  [  363.683264] INFO: task kworker/u4:7:4288 blocked for more than 120 seconds.
  [  363.683948]       Not tainted 6.5.0-rc2+ #1662
  [  363.684290] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [  363.684866] task:kworker/u4:7    state:D stack:0     pid:4288  ppid:2      flags:0x00004000
  [  363.685421] Workqueue: btrfs-cache btrfs_work_helper
  [  363.685763] Call Trace:
  [  363.685926]  <TASK>
  [  363.686067]  __schedule+0x406/0xcf0
  [  363.686314]  schedule+0x5d/0xe0
  [  363.686526]  io_schedule+0x41/0x70
  [  363.686751]  bit_wait_io+0x8/0x50
  [  363.686967]  __wait_on_bit+0x43/0x140
  [  363.688362]  ? bit_wait+0x50/0x50
  [  363.688607]  out_of_line_wait_on_bit+0x8f/0xb0
  [  363.688929]  ? collect_percpu_times+0x380/0x380
  [  363.689257]  read_extent_buffer_pages+0x18b/0x1d0
  [  363.689588]  btrfs_read_extent_buffer+0x8e/0x170
  [  363.689905]  read_tree_block+0x2e/0xa0
  [  363.690153]  read_block_for_search+0x23b/0x350
  [  363.690453]  btrfs_search_slot+0x2cf/0xe30
  [  363.690723]  ? _raw_read_unlock+0x24/0x40
  [  363.690990]  caching_thread+0x348/0x9e0
  [  363.699234]  btrfs_work_helper+0xe6/0x3d0
  [  363.699543]  ? lock_is_held_type+0xe3/0x140
  [  363.699859]  process_one_work+0x255/0x4e0
  [  363.700160]  worker_thread+0x4a/0x3a0
  [  363.700433]  ? _raw_spin_unlock_irqrestore+0x3b/0x60
  [  363.700773]  ? process_one_work+0x4e0/0x4e0
  [  363.701058]  kthread+0xee/0x120
  [  363.701275]  ? kthread_complete_and_exit+0x20/0x20
  [  363.701590]  ret_from_fork+0x28/0x40
  [  363.701833]  ? kthread_complete_and_exit+0x20/0x20
  [  363.702148]  ret_from_fork_asm+0x11/0x20
  [  363.702413] RIP: 0000:0x0
  [  363.702596] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
  [  363.703028] RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: 0000000000000000
  [  363.704102] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
  [  363.704592] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
  [  363.705076] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
  [  363.705552] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
  [  363.706024] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  [  363.706502]  </TASK>
  [  363.706665] INFO: task btrfs-transacti:8653 blocked for more than 120 seconds.
  [  363.707141]       Not tainted 6.5.0-rc2+ #1662
  [  363.719240] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [  363.719833] task:btrfs-transacti state:D stack:0     pid:8653  ppid:2      flags:0x00004000
  [  363.720456] Call Trace:
  [  363.720643]  <TASK>
  [  363.720812]  __schedule+0x406/0xcf0
  [  363.721089]  schedule+0x5d/0xe0
  [  363.721314]  io_schedule+0x41/0x70
  [  363.721548]  bit_wait_io+0x8/0x50
  [  363.721778]  __wait_on_bit+0x43/0x140
  [  363.722031]  ? bit_wait+0x50/0x50
  [  363.722260]  out_of_line_wait_on_bit+0x8f/0xb0
  [  363.722566]  ? collect_percpu_times+0x380/0x380
  [  363.722877]  read_extent_buffer_pages+0x18b/0x1d0
  [  363.723221]  btrfs_read_extent_buffer+0x8e/0x170
  [  363.723542]  read_tree_block+0x2e/0xa0
  [  363.723804]  read_block_for_search+0x23b/0x350
  [  363.724115]  btrfs_search_slot+0x2cf/0xe30
  [  363.724401]  ? _raw_read_unlock+0x24/0x40
  [  363.724682]  lookup_inline_extent_backref+0x16c/0x770
  [  363.725038]  ? lock_acquire+0xe4/0x2d0
  [  363.725305]  lookup_extent_backref+0x3c/0xc0
  [  363.725605]  __btrfs_free_extent+0xf2/0x1070
  [  363.725907]  ? lock_release+0x142/0x290
  [  363.726176]  __btrfs_run_delayed_refs+0x2e8/0x1280
  [  363.726514]  ? btrfs_commit_transaction+0x38/0x1290
  [  363.726852]  btrfs_run_delayed_refs+0x70/0x1e0
  [  363.727161]  ? btrfs_commit_transaction+0x38/0x1290
  [  363.735220]  btrfs_commit_transaction+0xa7/0x1290
  [  363.735598]  ? start_transaction+0xc0/0x700
  [  363.735936]  transaction_kthread+0x139/0x1a0
  [  363.736284]  ? _raw_spin_unlock_irqrestore+0x3b/0x60
  [  363.736680]  ? close_ctree+0x5a0/0x5a0
  [  363.736944]  kthread+0xee/0x120
  [  363.737176]  ? kthread_complete_and_exit+0x20/0x20
  [  363.737506]  ret_from_fork+0x28/0x40
  [  363.737757]  ? kthread_complete_and_exit+0x20/0x20
  [  363.738087]  ret_from_fork_asm+0x11/0x20
  [  363.738356] RIP: 0000:0x0
  [  363.738544] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
  [  363.738986] RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: 0000000000000000
  [  363.739512] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
  [  363.740001] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
  [  363.740500] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
  [  363.740995] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
  [  363.741500] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  [  363.742008]  </TASK>
  [  363.742171] INFO: task fsstress:8657 blocked for more than 120 seconds.
  [  363.742630]       Not tainted 6.5.0-rc2+ #1662
  [  363.742941] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [  363.751221] task:fsstress        state:D stack:0     pid:8657  ppid:8655   flags:0x00004002
  [  363.751839] Call Trace:
  [  363.752017]  <TASK>
  [  363.752178]  __schedule+0x406/0xcf0
  [  363.752427]  ? _raw_spin_unlock_irqrestore+0x2b/0x60
  [  363.752771]  ? __x64_sys_tee+0xc0/0xc0
  [  363.753041]  schedule+0x5d/0xe0
  [  363.753269]  wb_wait_for_completion+0x4d/0x80
  [  363.753574]  ? this_rq_lock_irq+0xa0/0xa0
  [  363.753858]  sync_inodes_sb+0xd4/0x470
  [  363.754124]  ? __x64_sys_tee+0xc0/0xc0
  [  363.754395]  iterate_supers+0x80/0xe0
  [  363.754661]  ksys_sync+0x3b/0xa0
  [  363.754893]  __do_sys_sync+0x5/0x10
  [  363.755137]  do_syscall_64+0x34/0x80
  [  363.755403]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
  [  363.755759] RIP: 0033:0x7f6039ab2a37
  [  363.756006] RSP: 002b:00007fff79cafd78 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
  [  363.756522] RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 00007f6039ab2a37
  [  363.757009] RDX: 00000000fd69c659 RSI: 00000000029639a6 RDI: 0000000000000007
  [  363.757511] RBP: 000000000007a120 R08: 0000000000000070 R09: 00007fff79caf89c
  [  363.757994] R10: 00007f60399bd258 R11: 0000000000000206 R12: 000055b6208925e0
  [  363.758483] R13: 028f5c28f5c28f5c R14: 8f5c28f5c28f5c29 R15: 000055b62087e620
  [  363.758988]  </TASK>
  [  363.759147] INFO: task fsstress:8658 blocked for more than 120 seconds.
  [  363.767219]       Not tainted 6.5.0-rc2+ #1662
  [  363.767560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [  363.768135] task:fsstress        state:D stack:0     pid:8658  ppid:8655   flags:0x00004002
  [  363.768753] Call Trace:
  [  363.768942]  <TASK>
  [  363.769116]  __schedule+0x406/0xcf0
  [  363.769363]  schedule+0x5d/0xe0
  [  363.769589]  io_schedule+0x41/0x70
  [  363.769831]  folio_wait_bit_common+0x11d/0x340
  [  363.770146]  ? filemap_get_folios_tag+0x378/0x450
  [  363.770484]  ? __probestub_file_check_and_advance_wb_err+0x10/0x10
  [  363.770925]  folio_wait_writeback+0x1f/0xa0
  [  363.771240]  __filemap_fdatawait_range+0x74/0xd0
  [  363.771575]  ? _raw_spin_unlock_irqrestore+0x3b/0x60
  [  363.771929]  ? __clear_extent_bit+0x173/0x4c0
  [  363.772242]  filemap_fdatawait_range+0x9/0x20
  [  363.772551]  __btrfs_wait_marked_extents.isra.0+0xb2/0xf0
  [  363.772928]  btrfs_wait_tree_log_extents+0x28/0xa0
  [  363.773281]  btrfs_sync_log+0x4a0/0xb90
  [  363.773567]  btrfs_sync_file+0x424/0x5e0
  [  363.773854]  __x64_sys_fsync+0x32/0x60
  [  363.774123]  do_syscall_64+0x34/0x80
  [  363.774381]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
  [  363.774735] RIP: 0033:0x7f6039ab29b0
  [  363.774988] RSP: 002b:00007fff79cafd48 EFLAGS: 00000202 ORIG_RAX: 000000000000004a
  [  363.783212] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f6039ab29b0
  [  363.783739] RDX: 0000000000000105 RSI: 000055b62088b048 RDI: 0000000000000004
  [  363.784206] RBP: 0000000000000015 R08: 00007f6039b4e4c0 R09: 00007fff79cafd5c
  [  363.784683] R10: 00007f60399cab78 R11: 0000000000000202 R12: 00007fff79cafd60
  [  363.785184] R13: 028f5c28f5c28f5c R14: 8f5c28f5c28f5c29 R15: 000055b6208867c0
  [  363.785677]  </TASK>
  [  363.785837] INFO: task fsstress:8659 blocked for more than 120 seconds.
  [  363.786296]       Not tainted 6.5.0-rc2+ #1662
  [  363.786620] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [  363.787157] task:fsstress        state:D stack:0     pid:8659  ppid:8655   flags:0x00000002
  [  363.787755] Call Trace:
  [  363.787940]  <TASK>
  [  363.788099]  __schedule+0x406/0xcf0
  [  363.788348]  schedule+0x5d/0xe0
  [  363.788573]  io_schedule+0x41/0x70
  [  363.788816]  folio_wait_bit_common+0x11d/0x340
  [  363.789140]  ? __probestub_file_check_and_advance_wb_err+0x10/0x10
  [  363.789565]  extent_write_cache_pages+0x41e/0x7b0
  [  363.789892]  ? __kernel_text_address+0x9/0x30
  [  363.790196]  ? unwind_get_return_address+0x16/0x30
  [  363.790538]  extent_writepages+0x7e/0x100
  [  363.790823]  do_writepages+0xc5/0x180
  [  363.791087]  filemap_fdatawrite_wbc+0x56/0x80
  [  363.799222]  __filemap_fdatawrite_range+0x53/0x70
  [  363.799596]  btrfs_remap_file_range+0x132/0x620
  [  363.799944]  vfs_dedupe_file_range_one+0x177/0x1f0
  [  363.800297]  vfs_dedupe_file_range+0x186/0x1f0
  [  363.800620]  do_vfs_ioctl+0x49d/0x930
  [  363.800890]  __x64_sys_ioctl+0x64/0xc0
  [  363.801170]  do_syscall_64+0x34/0x80
  [  363.801430]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
  [  363.801784] RIP: 0033:0x7f6039ab1afb
  [  363.802038] RSP: 002b:00007fff79cafc70 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [  363.802555] RAX: ffffffffffffffda RBX: 000055b6224398e8 RCX: 00007f6039ab1afb
  [  363.803038] RDX: 000055b6224397c0 RSI: 00000000c0189436 RDI: 0000000000000004
  [  363.803544] RBP: 000055b622439988 R08: 000055b6224399c0 R09: 000055b6224399a4
  [  363.804039] R10: 0000000000001000 R11: 0000000000000246 R12: 000055b622439820
  [  363.804520] R13: 000000000000000b R14: 0000000000000002 R15: 0000000000007000
  [  363.805015]  </TASK>
  [  363.805204] 
  [  363.805204] Showing all locks held in the system:
  [  363.805646] 1 lock held by rcu_tasks_kthre/11:
  [  363.805965]  #0: ffffffff839bf8f0 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one0
  [  363.806625] 1 lock held by rcu_tasks_trace/12:
  [  363.806936]  #0: ffffffff839bf630 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tas0
  [  363.815221] 1 lock held by khungtaskd/24:
  [  363.815528]  #0: ffffffff839c0360 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0xe0
  [  363.816232] 1 lock held by in:imklog/3192:
  [  363.816570]  #0: ffff8881076fb788 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x43/0x50
  [  363.817228] 4 locks held by kworker/u4:6/4181:
  [  363.817546] 4 locks held by kworker/u4:7/4288:
  [  363.817859]  #0: ffff888113284d38 ((wq_completion)btrfs-cache){+.+.}-{0:0}, at: process_one0
  [  363.818539]  #1: ffffc90004b4fe58 ((work_completion)(&work->normal_work)){+.+.}-{0:0}, at: 0
  [  363.819289]  #2: ffff8881168d5078 (&caching_ctl->mutex){+.+.}-{3:3}, at: caching_thread+0x40
  [  363.819904]  #3: ffff88810d5acab0 (&fs_info->commit_root_sem){++++}-{3:3}, at: caching_thre0
  [  363.820560] 6 locks held by btrfs-transacti/8653:
  [  363.820888]  #0: ffff88810d5ac7d0 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: tr0
  [  363.821625]  #1: ffff88810d5ae378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transacti0
  [  363.822276]  #2: ffff88810d5ae3a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transa0
  [  363.822944]  #3: ffff88810d5ae3c8 (btrfs_trans_commit_start){.+.+}-{0:0}, at: btrfs_commit_0
  [  363.835223]  #4: ffff88810da5ca28 (&head_ref->mutex){+.+.}-{3:3}, at: btrfs_delayed_ref_loc0
  [  363.835936]  #5: ffff8881149c7548 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_lock+0x0
  [  363.836570] 2 locks held by fsstress/8657:
  [  363.836857]  #0: ffff88810da390e0 (&type->s_umount_key#51){++++}-{3:3}, at: iterate_supers+0
  [  363.837504]  #1: ffff88810da3c7d0 (&bdi->wb_switch_rwsem){++++}-{3:3}, at: sync_inodes_sb+00
  [  363.838126] 4 locks held by fsstress/8658:
  [  363.838408]  #0: ffff88810da395f0 (sb_internal#2){.+.+}-{0:0}, at: btrfs_sync_file+0x339/0x0
  [  363.838994]  #1: ffff88810d5ae378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transacti0
  [  363.843228]  #2: ffff88810d5ae3a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transa0
  [  363.844006]  #3: ffff888115b193b0 (&root->log_mutex){+.+.}-{3:3}, at: btrfs_sync_log+0x303/0
  [  363.844690] 3 locks held by fsstress/8659:
  [  363.844973]  #0: ffff88810da39400 (sb_writers#14){.+.+}-{0:0}, at: vfs_dedupe_file_range_on0
  [  363.845612]  #1: ffff888111ef2110 (&sb->s_type->i_mutex_key#18){++++}-{3:3}, at: btrfs_inod0
  [  363.846264]  #2: ffff888111ef1f98 (&ei->i_mmap_lock){++++}-{3:3}, at: btrfs_inode_lock+0x440
  [  363.846858] 4 locks held by fsstress/8660:
  [  363.847136] 1 lock held by dmsetup/8661:
  [  363.851231]  #0: ffff88811549d068 (&md->suspend_lock/1){+.+.}-{3:3}, at: dm_suspend+0x23/0x0
  [  363.851867] 
  [  363.851993] =============================================
  [  363.851993] 
  [  484.515287] INFO: task kworker/u4:7:4288 blocked for more than 241 seconds.
  [  484.515907]       Not tainted 6.5.0-rc2+ #1662
  [  484.516254] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [  484.516822] task:kworker/u4:7    state:D stack:0     pid:4288  ppid:2      flags:0x00004000
  [  484.517411] Workqueue: btrfs-cache btrfs_work_helper
  [  484.517772] Call Trace:
  [  484.517950]  <TASK>
  [  484.518107]  __schedule+0x406/0xcf0
  [  484.518364]  schedule+0x5d/0xe0
  [  484.518591]  io_schedule+0x41/0x70
  [  484.518839]  bit_wait_io+0x8/0x50
  [  484.519079]  __wait_on_bit+0x43/0x140
  [  484.520645]  ? bit_wait+0x50/0x50
  [  484.521078]  out_of_line_wait_on_bit+0x8f/0xb0
  [  484.521409]  ? collect_percpu_times+0x380/0x380
  [  484.521731]  read_extent_buffer_pages+0x18b/0x1d0
  [  484.522065]  btrfs_read_extent_buffer+0x8e/0x170
  [  484.522392]  read_tree_block+0x2e/0xa0
  [  484.522658]  read_block_for_search+0x23b/0x350
  [  484.522978]  btrfs_search_slot+0x2cf/0xe30
  [  484.529712]  ? _raw_read_unlock+0x24/0x40
  [  484.530361]  caching_thread+0x348/0x9e0
  [  484.530747]  btrfs_work_helper+0xe6/0x3d0
  [  484.531037]  ? lock_is_held_type+0xe3/0x140
  [  484.531510]  process_one_work+0x255/0x4e0
  [  484.531823]  worker_thread+0x4a/0x3a0
  [  484.532089]  ? _raw_spin_unlock_irqrestore+0x3b/0x60
  [  484.532530]  ? process_one_work+0x4e0/0x4e0
  [  484.532932]  kthread+0xee/0x120
  [  484.533201]  ? kthread_complete_and_exit+0x20/0x20
  [  484.533537]  ret_from_fork+0x28/0x40
  [  484.533796]  ? kthread_complete_and_exit+0x20/0x20
  [  484.534145]  ret_from_fork_asm+0x11/0x20
  [  484.534431] RIP: 0000:0x0
  [  484.534630] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
  [  484.536665] RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: 0000000000000000
  [  484.537222] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
  [  484.537726] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
  [  484.538245] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
  [  484.538754] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
  [  484.547221] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  [  484.547899]  </TASK>
  [  484.548073] INFO: task btrfs-transacti:8653 blocked for more than 241 seconds.
  [  484.548592]       Not tainted 6.5.0-rc2+ #1662
  [  484.548909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [  484.549452] task:btrfs-transacti state:D stack:0     pid:8653  ppid:2      flags:0x00004000
  [  484.550050] Call Trace:
  [  484.550243]  <TASK>
  [  484.550408]  __schedule+0x406/0xcf0
  [  484.550684]  schedule+0x5d/0xe0
  [  484.550921]  io_schedule+0x41/0x70
  [  484.551176]  bit_wait_io+0x8/0x50
  [  484.552000]  __wait_on_bit+0x43/0x140
  [  484.552295]  ? bit_wait+0x50/0x50
  [  484.552549]  out_of_line_wait_on_bit+0x8f/0xb0
  [  484.552884]  ? collect_percpu_times+0x380/0x380
  [  484.553221]  read_extent_buffer_pages+0x18b/0x1d0
  [  484.553569]  btrfs_read_extent_buffer+0x8e/0x170
  [  484.553914]  read_tree_block+0x2e/0xa0
  [  484.554194]  read_block_for_search+0x23b/0x350
  [  484.554533]  btrfs_search_slot+0x2cf/0xe30
  [  484.554846]  ? _raw_read_unlock+0x24/0x40
  [  484.555144]  lookup_inline_extent_backref+0x16c/0x770
  [  484.563223]  ? lock_acquire+0xe4/0x2d0
  [  484.563556]  lookup_extent_backref+0x3c/0xc0
  [  484.563905]  __btrfs_free_extent+0xf2/0x1070
  [  484.564234]  ? lock_release+0x142/0x290
  [  484.564533]  __btrfs_run_delayed_refs+0x2e8/0x1280
  [  484.564910]  ? btrfs_commit_transaction+0x38/0x1290
  [  484.565283]  btrfs_run_delayed_refs+0x70/0x1e0
  [  484.565629]  ? btrfs_commit_transaction+0x38/0x1290
  [  484.566007]  btrfs_commit_transaction+0xa7/0x1290
  [  484.566372]  ? start_transaction+0xc0/0x700
  [  484.566697]  transaction_kthread+0x139/0x1a0
  [  484.567033]  ? _raw_spin_unlock_irqrestore+0x3b/0x60
  [  484.567952]  ? close_ctree+0x5a0/0x5a0
  [  484.568261]  kthread+0xee/0x120
  [  484.568515]  ? kthread_complete_and_exit+0x20/0x20


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: small writeback fixes
  2023-07-19 11:50     ` Christoph Hellwig
@ 2023-07-19 14:30       ` Josef Bacik
  2023-07-19 15:25         ` Christoph Hellwig
  2023-07-19 21:42       ` Josef Bacik
  1 sibling, 1 reply; 18+ messages in thread
From: Josef Bacik @ 2023-07-19 14:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chris Mason, David Sterba, linux-btrfs, linux-fsdevel

On Wed, Jul 19, 2023 at 01:50:10PM +0200, Christoph Hellwig wrote:
> On Wed, Jul 19, 2023 at 07:39:01AM +0200, Christoph Hellwig wrote:
> > My day was already over by the time you sent this, but I looked into
> > it the first thing this morning.
> > 
> > I can't reproduce the hang, but my first thought was "why the heck do
> > even end up in the fixup worker" given that there is no GUP-based
> > dirtying in the thread.
> > 
> > I can reproduce the test case hitting the fixup worker now, while
> > I can't on misc-next.  Looking into it now, but the rework of the
> > fixup logic is a hot candidate.
> 
> So unfortunately even the BUG seems to trigger in a very sporadic
> manner, making a bisect impossible.  This is made worse by me actually
> hitting another hang (dmesg output below) way more frequently, but that
> one actually reproduces on misc-next as well.  I'm also still confused
> on how we hit the fixup worker, as that means we'll need to see a page
> that.
> 
>   a) was dirty so that the writeback code picks it up
>   b) had the delalloc bit already cleaned in the I/O tree
>   c) does not have the orderd bit set
> 
> "btrfs: move the cow_fixup earlier in writepages handling" would
> be the obvious candidate touching this area, even if I can't see
> how it makes a difference.  Any chance you could check if it is
> indeed the culprit?
> 
> And here is the more frequent hang I see with generic/475 loops:
> 

I backed your patches out and re-ran and I hit hangs with generic/475 still, so
I think you're clear.  There's something awkward going on here, the below hang
just looks like we're waiting for IO.  The caching thread is blocking the
transaction commit because it's trying to read some old blocks, and it's been
waiting for them to come back for 2 minutes.  That's holding everybody else up.
I'll dig into all of this, misc-next is definitely fucked somehow, your stuff
may just be a victim.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: small writeback fixes
  2023-07-19 14:30       ` Josef Bacik
@ 2023-07-19 15:25         ` Christoph Hellwig
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2023-07-19 15:25 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Christoph Hellwig, Chris Mason, David Sterba, linux-btrfs,
	linux-fsdevel

On Wed, Jul 19, 2023 at 10:30:32AM -0400, Josef Bacik wrote:
> I backed your patches out and re-ran and I hit hangs with generic/475 still, so
> I think you're clear.  There's something awkward going on here, the below hang
> just looks like we're waiting for IO.  The caching thread is blocking the
> transaction commit because it's trying to read some old blocks, and it's been
> waiting for them to come back for 2 minutes.  That's holding everybody else up.
> I'll dig into all of this, misc-next is definitely fucked somehow, your stuff
> may just be a victim.  Thanks,

Yes.  I think this has shown up in misc-next as so far I've not
been able to reproduce anything on 6.5-rc2.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: small writeback fixes
  2023-07-19 11:50     ` Christoph Hellwig
  2023-07-19 14:30       ` Josef Bacik
@ 2023-07-19 21:42       ` Josef Bacik
  2023-07-20  4:48         ` Christoph Hellwig
  1 sibling, 1 reply; 18+ messages in thread
From: Josef Bacik @ 2023-07-19 21:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chris Mason, David Sterba, linux-btrfs, linux-fsdevel

On Wed, Jul 19, 2023 at 01:50:10PM +0200, Christoph Hellwig wrote:
> On Wed, Jul 19, 2023 at 07:39:01AM +0200, Christoph Hellwig wrote:
> > My day was already over by the time you sent this, but I looked into
> > it the first thing this morning.
> > 
> > I can't reproduce the hang, but my first thought was "why the heck do
> > even end up in the fixup worker" given that there is no GUP-based
> > dirtying in the thread.
> > 
> > I can reproduce the test case hitting the fixup worker now, while
> > I can't on misc-next.  Looking into it now, but the rework of the
> > fixup logic is a hot candidate.
> 
> So unfortunately even the BUG seems to trigger in a very sporadic
> manner, making a bisect impossible.  This is made worse by me actually
> hitting another hang (dmesg output below) way more frequently, but that
> one actually reproduces on misc-next as well.  I'm also still confused
> on how we hit the fixup worker, as that means we'll need to see a page
> that.
> 
>   a) was dirty so that the writeback code picks it up
>   b) had the delalloc bit already cleaned in the I/O tree
>   c) does not have the orderd bit set
> 
> "btrfs: move the cow_fixup earlier in writepages handling" would
> be the obvious candidate touching this area, even if I can't see
> how it makes a difference.  Any chance you could check if it is
> indeed the culprit?
> 
> And here is the more frequent hang I see with generic/475 loops:
> 

More investigation, what's happening is dmsetup suspend is stuck because it's
waiting for outstanding io to finish, so these other IO's are stuck in the
deffered list for the linear mapping.  I'm still getting to why the outstanding
IO's aren't going, I'll figure that out in the morning, but seems like this may
not be a btrfs problem.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: small writeback fixes
  2023-07-19 21:42       ` Josef Bacik
@ 2023-07-20  4:48         ` Christoph Hellwig
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2023-07-20  4:48 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Christoph Hellwig, Chris Mason, David Sterba, linux-btrfs,
	linux-fsdevel

On Wed, Jul 19, 2023 at 05:42:55PM -0400, Josef Bacik wrote:
> More investigation, what's happening is dmsetup suspend is stuck because it's
> waiting for outstanding io to finish, so these other IO's are stuck in the
> deffered list for the linear mapping.  I'm still getting to why the outstanding
> IO's aren't going, I'll figure that out in the morning, but seems like this may
> not be a btrfs problem.  Thanks,

I tried to bisect it based on my earlier observation that 6.5-rc2 wasn't
affect, but at that point could not reproduce the hang any more.
Overall the conditions seems very non-deterministic and might depend
on the state and thus latency of the SSD or something like that.

Instead of seen a bunch of unmount failures and then busy devices which
clear up with another re-run..

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-07-20  4:48 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-13 13:04 small writeback fixes Christoph Hellwig
2023-07-13 13:04 ` [PATCH 1/9] btrfs: don't stop integrity writeback too early Christoph Hellwig
2023-07-13 13:04 ` [PATCH 2/9] btrfs: don't wait for writeback on clean pages in extent_write_cache_pages Christoph Hellwig
2023-07-13 13:04 ` [PATCH 3/9] btrfs: fix an error handling corner case in cow_file_range Christoph Hellwig
2023-07-13 13:04 ` [PATCH 4/9] btrfs: move the cow_fixup earlier in writepages handling Christoph Hellwig
2023-07-13 13:04 ` [PATCH 5/9] btrfs: fix handling of errors from __extent_writepage_io Christoph Hellwig
2023-07-13 13:04 ` [PATCH 6/9] btrfs: stop submitting I/O after an error in extent_write_locked_range Christoph Hellwig
2023-07-13 13:04 ` [PATCH 7/9] btrfs: fix a race in clearing the writeback bit for sub-page I/O Christoph Hellwig
2023-07-13 13:04 ` [PATCH 8/9] btrfs: remove the call to btrfs_mark_ordered_io_finished in btrfs_writepage_fixup_worker Christoph Hellwig
2023-07-13 13:04 ` [PATCH 9/9] btrfs: lift the call to mapping_set_error out of cow_file_range Christoph Hellwig
2023-07-14 13:59 ` small writeback fixes Josef Bacik
2023-07-18 17:17 ` Josef Bacik
2023-07-19  5:39   ` Christoph Hellwig
2023-07-19 11:50     ` Christoph Hellwig
2023-07-19 14:30       ` Josef Bacik
2023-07-19 15:25         ` Christoph Hellwig
2023-07-19 21:42       ` Josef Bacik
2023-07-20  4:48         ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).