[PATCH v5 0/5] btrfs: subpage + zoned fixes

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* [PATCH v5 0/5] btrfs: subpage + zoned fixes
@ 2024-05-18  5:07 Qu Wenruo
  2024-05-18  5:07 ` [PATCH v5 1/5] btrfs: make __extent_writepage_io() to write specified range only Qu Wenruo
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Qu Wenruo @ 2024-05-18  5:07 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes.Thumshirn, josef

[CHANGELOG]
v5:
- Enhance the commit message on why we should not clear page dirty
  inside extent_write_locked_range()

- Reorder the patches so that no temporary list based solution for
  delalloc ranges

v4:
- Rebased to the latest for-next branch
  Thankfully no conflicts at all.

- Include all the previous preparation patches
  It turns out I split the preparation into other series and even get
  myself confused.

- Use the correct commit message from my local branch
  It turns out Josef is totally correct, the problem I described in
  "btrfs: do no clera page dirty inside extent_write_locked_range()" is
  really confusing, it has direct IO involved and my local branch is
  already using a much better commit and I just forgot it.
 
v3:
- Use the minimal fsstress workload with trace_printk() output to
  explain the bug better

v2:
- Update the commit message for the first patch
  As there is something wrong with the ASCII art of the memory layout.

[REPO]
https://github.com/adam900710/linux/tree/subpage_delalloc

If running subpage with zoned devices (TCMU emulated HDD, 64K or 16K
page size with 4K sectorsize), btrfs can easily hitting various bugs:

- ASSERT()s related to submitting page range which has no OE coverage

- Various reserved space leakage and some OE never finished

This is caused by two major reasons:

- run_delalloc_cow() is not subpage compatible
  There are several different problems involved furthermore.

  * extent_write_locked_range() would try to submit dirty pages beyond
    the specified subpage range
    Thus hit some ASSERT() that a dirty range has no corresponding OE


  * extent_write_locked_range() would unlock the whole page meanwhile
    we're only triggered for a subpage range
    Thus causing unexpected page to be unlocked.

  This would be addressed by patch 1~3 by:

  * Limited the submission range to follow the subpage ranges

  * Make the page unlocking part also subpage compatible, and always
    lock all delalloc subpage ranges covering the current page.

- Some dirty range is not submitted thus OE would never finish
  This happens due to the mismatch that extent_write_locked_range() can
  clear the full page dirty, even if we're only submitting part of the 
  dirty ranges, causing page dirty flags desync from subpage dirty
  flags.

  Then later __extent_writepage_io() would skip a non-dirty page, as the
  check is only checking the full page dirty flag, not the
  subpage bitmaps.

  This would be addressed by patch 4~5.

Qu Wenruo (5):
  btrfs: make __extent_writepage_io() to write specified range only
  btrfs: subpage: introduce helpers to handle subpage delalloc locking
  btrfs: lock subpage ranges in one go for writepage_delalloc()
  btrfs: do not clear page dirty inside extent_write_locked_range()
  btrfs: make extent_write_locked_range() to handle subpage writeback
    correctly

 fs/btrfs/extent_io.c | 105 +++++++++++++++++++++++++------
 fs/btrfs/subpage.c   | 144 +++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/subpage.h   |  10 ++-
 3 files changed, 233 insertions(+), 26 deletions(-)

-- 
2.45.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 1/5] btrfs: make __extent_writepage_io() to write specified range only
  2024-05-18  5:07 [PATCH v5 0/5] btrfs: subpage + zoned fixes Qu Wenruo
@ 2024-05-18  5:07 ` Qu Wenruo
  2024-05-21  7:23   ` Naohiro Aota
  2024-05-18  5:07 ` [PATCH v5 2/5] btrfs: subpage: introduce helpers to handle subpage delalloc locking Qu Wenruo
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2024-05-18  5:07 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes.Thumshirn, josef, Johannes Thumshirn

Function __extent_writepage_io() is designed to find all dirty range of
a page, and add that dirty range into the bio_ctrl for submission.
It requires all the dirtied range to be covered by an ordered extent.

It get called in two locations, but one call site is not subpage aware:

- __extent_writepage()
  It get called when writepage_delalloc() returned 0, which means
  writepage_delalloc() has handled dellalloc for all subpage sectors
  inside the page.

  So this call site is OK.

- extent_write_locked_range()
  This call site is utilized by zoned support, and in this case, we may
  only run delalloc range for a subset of the page, like this: (64K page
  size)

  0     16K     32K     48K     64K
  |/////|       |///////|       |

  In above case, if extent_write_locked_range() is only triggered for
  range [0, 16K), __extent_writepage_io() would still try to submit
  the dirty range of [32K, 48K), then it would not find any ordered
  extent for it and trigger various ASSERT()s.

Fix this problem by:

- Introducing @start and @len parameters to specify the range

  For the first call site, we just pass the whole page, and the behavior
  is not touched, since run_delalloc_range() for the page should have
  created all ordered extents for the page.

  For the second call site, we would avoid touching anything beyond the
  range, thus avoid the dirty range which is not yet covered by any
  delalloc range.

- Making btrfs_folio_assert_not_dirty() subpage aware
  The only caller is inside __extent_writepage_io(), and since that
  caller now accepts a subpage range, we should also check the subpage
  range other than the whole page.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 18 +++++++++++-------
 fs/btrfs/subpage.c   | 22 ++++++++++++++++------
 fs/btrfs/subpage.h   |  3 ++-
 3 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 597387e9f040..8a4a7d00795f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1339,20 +1339,23 @@ static void find_next_dirty_byte(struct btrfs_fs_info *fs_info,
  * < 0 if there were errors (page still locked)
  */
 static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
-				 struct page *page,
+				 struct page *page, u64 start, u32 len,
 				 struct btrfs_bio_ctrl *bio_ctrl,
 				 loff_t i_size,
 				 int *nr_ret)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	u64 cur = page_offset(page);
-	u64 end = cur + PAGE_SIZE - 1;
+	u64 cur = start;
+	u64 end = start + len - 1;
 	u64 extent_offset;
 	u64 block_start;
 	struct extent_map *em;
 	int ret = 0;
 	int nr = 0;
 
+	ASSERT(start >= page_offset(page) &&
+	       start + len <= page_offset(page) + PAGE_SIZE);
+
 	ret = btrfs_writepage_cow_fixup(page);
 	if (ret) {
 		/* Fixup worker will requeue */
@@ -1441,7 +1444,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		nr++;
 	}
 
-	btrfs_folio_assert_not_dirty(fs_info, page_folio(page));
+	btrfs_folio_assert_not_dirty(fs_info, page_folio(page), start, len);
 	*nr_ret = nr;
 	return 0;
 
@@ -1499,7 +1502,8 @@ static int __extent_writepage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl
 	if (ret)
 		goto done;
 
-	ret = __extent_writepage_io(BTRFS_I(inode), page, bio_ctrl, i_size, &nr);
+	ret = __extent_writepage_io(BTRFS_I(inode), page, page_offset(page),
+				    PAGE_SIZE, bio_ctrl, i_size, &nr);
 	if (ret == 1)
 		return 0;
 
@@ -2251,8 +2255,8 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
 			clear_page_dirty_for_io(page);
 		}
 
-		ret = __extent_writepage_io(BTRFS_I(inode), page, &bio_ctrl,
-					    i_size, &nr);
+		ret = __extent_writepage_io(BTRFS_I(inode), page, cur, cur_len,
+					    &bio_ctrl, i_size, &nr);
 		if (ret == 1)
 			goto next_page;
 
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 54736f6238e6..183b32f51f51 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -703,19 +703,29 @@ IMPLEMENT_BTRFS_PAGE_OPS(checked, folio_set_checked, folio_clear_checked,
  * Make sure not only the page dirty bit is cleared, but also subpage dirty bit
  * is cleared.
  */
-void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info, struct folio *folio)
+void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info,
+				  struct folio *folio, u64 start, u32 len)
 {
-	struct btrfs_subpage *subpage = folio_get_private(folio);
+	struct btrfs_subpage *subpage;
+	int start_bit;
+	int nbits;
+	unsigned long flags;
 
 	if (!IS_ENABLED(CONFIG_BTRFS_ASSERT))
 		return;
 
-	ASSERT(!folio_test_dirty(folio));
-	if (!btrfs_is_subpage(fs_info, folio->mapping))
+	if (!btrfs_is_subpage(fs_info, folio->mapping)) {
+		ASSERT(!folio_test_dirty(folio));
 		return;
+	}
 
-	ASSERT(folio_test_private(folio) && folio_get_private(folio));
-	ASSERT(subpage_test_bitmap_all_zero(fs_info, subpage, dirty));
+	start_bit = subpage_calc_start_bit(fs_info, folio, dirty, start, len);
+	nbits = len >> fs_info->sectorsize_bits;
+	subpage = folio_get_private(folio);
+	ASSERT(subpage);
+	spin_lock_irqsave(&subpage->lock, flags);
+	ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits));
+	spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
 /*
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index b6dc013b0fdc..4b363d9453af 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -156,7 +156,8 @@ DECLARE_BTRFS_SUBPAGE_OPS(checked);
 bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
 					struct folio *folio, u64 start, u32 len);
 
-void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info, struct folio *folio);
+void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info,
+				  struct folio *folio, u64 start, u32 len);
 void btrfs_folio_unlock_writer(struct btrfs_fs_info *fs_info,
 			       struct folio *folio, u64 start, u32 len);
 void __cold btrfs_subpage_dump_bitmap(const struct btrfs_fs_info *fs_info,
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 1/5] btrfs: make __extent_writepage_io() to write specified range only
  2024-05-18  5:07 ` [PATCH v5 1/5] btrfs: make __extent_writepage_io() to write specified range only Qu Wenruo
@ 2024-05-21  7:23   ` Naohiro Aota
  0 siblings, 0 replies; 15+ messages in thread
From: Naohiro Aota @ 2024-05-21  7:23 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: linux-btrfs@vger.kernel.org, Johannes Thumshirn,
	josef@toxicpanda.com

On Sat, May 18, 2024 at 02:37:39PM GMT, Qu Wenruo wrote:
> Function __extent_writepage_io() is designed to find all dirty range of
> a page, and add that dirty range into the bio_ctrl for submission.
> It requires all the dirtied range to be covered by an ordered extent.
> 
> It get called in two locations, but one call site is not subpage aware:
> 
> - __extent_writepage()
>   It get called when writepage_delalloc() returned 0, which means
>   writepage_delalloc() has handled dellalloc for all subpage sectors
>   inside the page.
> 
>   So this call site is OK.
> 
> - extent_write_locked_range()
>   This call site is utilized by zoned support, and in this case, we may
>   only run delalloc range for a subset of the page, like this: (64K page
>   size)
> 
>   0     16K     32K     48K     64K
>   |/////|       |///////|       |
> 
>   In above case, if extent_write_locked_range() is only triggered for
>   range [0, 16K), __extent_writepage_io() would still try to submit
>   the dirty range of [32K, 48K), then it would not find any ordered
>   extent for it and trigger various ASSERT()s.
> 
> Fix this problem by:
> 
> - Introducing @start and @len parameters to specify the range
> 
>   For the first call site, we just pass the whole page, and the behavior
>   is not touched, since run_delalloc_range() for the page should have
>   created all ordered extents for the page.
> 
>   For the second call site, we would avoid touching anything beyond the
>   range, thus avoid the dirty range which is not yet covered by any
>   delalloc range.
> 
> - Making btrfs_folio_assert_not_dirty() subpage aware
>   The only caller is inside __extent_writepage_io(), and since that
>   caller now accepts a subpage range, we should also check the subpage
>   range other than the whole page.
> 
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 18 +++++++++++-------
>  fs/btrfs/subpage.c   | 22 ++++++++++++++++------
>  fs/btrfs/subpage.h   |  3 ++-
>  3 files changed, 29 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 597387e9f040..8a4a7d00795f 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1339,20 +1339,23 @@ static void find_next_dirty_byte(struct btrfs_fs_info *fs_info,
>   * < 0 if there were errors (page still locked)
>   */
>  static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
> -				 struct page *page,
> +				 struct page *page, u64 start, u32 len,
>  				 struct btrfs_bio_ctrl *bio_ctrl,
>  				 loff_t i_size,
>  				 int *nr_ret)
>  {
>  	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -	u64 cur = page_offset(page);
> -	u64 end = cur + PAGE_SIZE - 1;
> +	u64 cur = start;
> +	u64 end = start + len - 1;
>  	u64 extent_offset;
>  	u64 block_start;
>  	struct extent_map *em;
>  	int ret = 0;
>  	int nr = 0;
>  
> +	ASSERT(start >= page_offset(page) &&
> +	       start + len <= page_offset(page) + PAGE_SIZE);
> +
>  	ret = btrfs_writepage_cow_fixup(page);
>  	if (ret) {
>  		/* Fixup worker will requeue */
> @@ -1441,7 +1444,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
>  		nr++;
>  	}
>  
> -	btrfs_folio_assert_not_dirty(fs_info, page_folio(page));
> +	btrfs_folio_assert_not_dirty(fs_info, page_folio(page), start, len);
>  	*nr_ret = nr;
>  	return 0;
>  
> @@ -1499,7 +1502,8 @@ static int __extent_writepage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl
>  	if (ret)
>  		goto done;
>  
> -	ret = __extent_writepage_io(BTRFS_I(inode), page, bio_ctrl, i_size, &nr);
> +	ret = __extent_writepage_io(BTRFS_I(inode), page, page_offset(page),
> +				    PAGE_SIZE, bio_ctrl, i_size, &nr);
>  	if (ret == 1)
>  		return 0;
>  
> @@ -2251,8 +2255,8 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
>  			clear_page_dirty_for_io(page);
>  		}
>  
> -		ret = __extent_writepage_io(BTRFS_I(inode), page, &bio_ctrl,
> -					    i_size, &nr);
> +		ret = __extent_writepage_io(BTRFS_I(inode), page, cur, cur_len,
> +					    &bio_ctrl, i_size, &nr);
>  		if (ret == 1)
>  			goto next_page;
>  
> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> index 54736f6238e6..183b32f51f51 100644
> --- a/fs/btrfs/subpage.c
> +++ b/fs/btrfs/subpage.c
> @@ -703,19 +703,29 @@ IMPLEMENT_BTRFS_PAGE_OPS(checked, folio_set_checked, folio_clear_checked,
>   * Make sure not only the page dirty bit is cleared, but also subpage dirty bit
>   * is cleared.
>   */
> -void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info, struct folio *folio)
> +void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info,
> +				  struct folio *folio, u64 start, u32 len)
>  {
> -	struct btrfs_subpage *subpage = folio_get_private(folio);
> +	struct btrfs_subpage *subpage;
> +	int start_bit;
> +	int nbits;

Can we have these as "unsigned int" to be consistent with the function
interface? But, it's not a big deal so

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>

> +	unsigned long flags;
>  
>  	if (!IS_ENABLED(CONFIG_BTRFS_ASSERT))
>  		return;
>  
> -	ASSERT(!folio_test_dirty(folio));
> -	if (!btrfs_is_subpage(fs_info, folio->mapping))
> +	if (!btrfs_is_subpage(fs_info, folio->mapping)) {
> +		ASSERT(!folio_test_dirty(folio));
>  		return;
> +	}
>  
> -	ASSERT(folio_test_private(folio) && folio_get_private(folio));
> -	ASSERT(subpage_test_bitmap_all_zero(fs_info, subpage, dirty));
> +	start_bit = subpage_calc_start_bit(fs_info, folio, dirty, start, len);
> +	nbits = len >> fs_info->sectorsize_bits;
> +	subpage = folio_get_private(folio);
> +	ASSERT(subpage);
> +	spin_lock_irqsave(&subpage->lock, flags);
> +	ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits));
> +	spin_unlock_irqrestore(&subpage->lock, flags);
>  }
>  
>  /*
> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
> index b6dc013b0fdc..4b363d9453af 100644
> --- a/fs/btrfs/subpage.h
> +++ b/fs/btrfs/subpage.h
> @@ -156,7 +156,8 @@ DECLARE_BTRFS_SUBPAGE_OPS(checked);
>  bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
>  					struct folio *folio, u64 start, u32 len);
>  
> -void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info, struct folio *folio);
> +void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info,
> +				  struct folio *folio, u64 start, u32 len);
>  void btrfs_folio_unlock_writer(struct btrfs_fs_info *fs_info,
>  			       struct folio *folio, u64 start, u32 len);
>  void __cold btrfs_subpage_dump_bitmap(const struct btrfs_fs_info *fs_info,
> -- 
> 2.45.0
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 2/5] btrfs: subpage: introduce helpers to handle subpage delalloc locking
  2024-05-18  5:07 [PATCH v5 0/5] btrfs: subpage + zoned fixes Qu Wenruo
  2024-05-18  5:07 ` [PATCH v5 1/5] btrfs: make __extent_writepage_io() to write specified range only Qu Wenruo
@ 2024-05-18  5:07 ` Qu Wenruo
  2024-05-21  7:50   ` Naohiro Aota
  2024-05-18  5:07 ` [PATCH v5 3/5] btrfs: lock subpage ranges in one go for writepage_delalloc() Qu Wenruo
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2024-05-18  5:07 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes.Thumshirn, josef, Johannes Thumshirn

Three new helpers are introduced for the incoming subpage delalloc locking
change.

- btrfs_folio_set_writer_lock()
  This is to mark specified range with subpage specific writer lock.
  After calling this, the subpage range can be proper unlocked by
  btrfs_folio_end_writer_lock()

- btrfs_subpage_find_writer_locked()
  This is to find the writer locked subpage range in a page.
  With the help of btrfs_folio_set_writer_lock(), it can allow us to
  record and find previously locked subpage range without extra memory
  allocation.

- btrfs_folio_end_all_writers()
  This is for the locked_page of __extent_writepage(), as there may be
  multiple subpage delalloc ranges locked.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/subpage.c | 116 +++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/subpage.h |   7 +++
 2 files changed, 123 insertions(+)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 183b32f51f51..3c957d03324e 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -775,6 +775,122 @@ void btrfs_folio_unlock_writer(struct btrfs_fs_info *fs_info,
 	btrfs_folio_end_writer_lock(fs_info, folio, start, len);
 }
 
+/*
+ * This is for folio already locked by plain lock_page()/folio_lock(), which
+ * doesn't have any subpage awareness.
+ *
+ * This would populate the involved subpage ranges so that subpage helpers can
+ * properly unlock them.
+ */
+void btrfs_folio_set_writer_lock(const struct btrfs_fs_info *fs_info,
+				 struct folio *folio, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage;
+	unsigned long flags;
+	int start_bit;
+	int nbits;
+	int ret;
+
+	ASSERT(folio_test_locked(folio));
+	if (unlikely(!fs_info) || !btrfs_is_subpage(fs_info, folio->mapping))
+		return;
+
+	subpage = folio_get_private(folio);
+	start_bit = subpage_calc_start_bit(fs_info, folio, locked, start, len);
+	nbits = len >> fs_info->sectorsize_bits;
+	spin_lock_irqsave(&subpage->lock, flags);
+	/* Target range should not yet be locked. */
+	ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits));
+	bitmap_set(subpage->bitmaps, start_bit, nbits);
+	ret = atomic_add_return(nbits, &subpage->writers);
+	ASSERT(ret <= fs_info->subpage_info->bitmap_nr_bits);
+	spin_unlock_irqrestore(&subpage->lock, flags);
+}
+
+/*
+ * Find any subpage writer locked range inside @folio, starting at file offset
+ * @search_start.
+ * The caller should ensure the folio is locked.
+ *
+ * Return true and update @found_start_ret and @found_len_ret to the first
+ * writer locked range.
+ * Return false if there is no writer locked range.
+ */
+bool btrfs_subpage_find_writer_locked(const struct btrfs_fs_info *fs_info,
+				      struct folio *folio, u64 search_start,
+				      u64 *found_start_ret, u32 *found_len_ret)
+{
+	struct btrfs_subpage_info *subpage_info = fs_info->subpage_info;
+	struct btrfs_subpage *subpage = folio_get_private(folio);
+	const int len = PAGE_SIZE - offset_in_page(search_start);
+	const int start_bit = subpage_calc_start_bit(fs_info, folio, locked,
+						     search_start, len);
+	const int locked_bitmap_start = subpage_info->locked_offset;
+	const int locked_bitmap_end = locked_bitmap_start +
+				      subpage_info->bitmap_nr_bits;
+	unsigned long flags;
+	int first_zero;
+	int first_set;
+	bool found = false;
+
+	ASSERT(folio_test_locked(folio));
+	spin_lock_irqsave(&subpage->lock, flags);
+	first_set = find_next_bit(subpage->bitmaps, locked_bitmap_end,
+				  start_bit);
+	if (first_set >= locked_bitmap_end)
+		goto out;
+
+	found = true;
+	*found_start_ret = folio_pos(folio) +
+		((first_set - locked_bitmap_start) << fs_info->sectorsize_bits);
+
+	first_zero = find_next_zero_bit(subpage->bitmaps,
+					locked_bitmap_end, first_set);
+	*found_len_ret = (first_zero - first_set) << fs_info->sectorsize_bits;
+out:
+	spin_unlock_irqrestore(&subpage->lock, flags);
+	return found;
+}
+
+/*
+ * Unlike btrfs_folio_end_writer_lock() which unlock a specified subpage range,
+ * this would end all writer locked ranges of a page.
+ *
+ * This is for the locked page of __extent_writepage(), as the locked page
+ * can contain several locked subpage ranges.
+ */
+void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
+				 struct folio *folio)
+{
+	u64 folio_start = folio_pos(folio);
+	u64 cur = folio_start;
+
+	ASSERT(folio_test_locked(folio));
+	if (!btrfs_is_subpage(fs_info, folio->mapping)) {
+		folio_unlock(folio);
+		return;
+	}
+
+	while (cur < folio_start + PAGE_SIZE) {
+		u64 found_start;
+		u32 found_len;
+		bool found;
+		bool last;
+
+		found = btrfs_subpage_find_writer_locked(fs_info, folio, cur,
+							 &found_start, &found_len);
+		if (!found)
+			break;
+		last = btrfs_subpage_end_and_test_writer(fs_info, folio,
+							 found_start, found_len);
+		if (last) {
+			folio_unlock(folio);
+			break;
+		}
+		cur = found_start + found_len;
+	}
+}
+
 #define GET_SUBPAGE_BITMAP(subpage, subpage_info, name, dst)		\
 	bitmap_cut(dst, subpage->bitmaps, 0,				\
 		   subpage_info->name##_offset, subpage_info->bitmap_nr_bits)
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 4b363d9453af..9f19850d59f2 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -112,6 +112,13 @@ int btrfs_folio_start_writer_lock(const struct btrfs_fs_info *fs_info,
 				  struct folio *folio, u64 start, u32 len);
 void btrfs_folio_end_writer_lock(const struct btrfs_fs_info *fs_info,
 				 struct folio *folio, u64 start, u32 len);
+void btrfs_folio_set_writer_lock(const struct btrfs_fs_info *fs_info,
+				 struct folio *folio, u64 start, u32 len);
+bool btrfs_subpage_find_writer_locked(const struct btrfs_fs_info *fs_info,
+				      struct folio *folio, u64 search_start,
+				      u64 *found_start_ret, u32 *found_len_ret);
+void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
+				 struct folio *folio);
 
 /*
  * Template for subpage related operations.
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 2/5] btrfs: subpage: introduce helpers to handle subpage delalloc locking
  2024-05-18  5:07 ` [PATCH v5 2/5] btrfs: subpage: introduce helpers to handle subpage delalloc locking Qu Wenruo
@ 2024-05-21  7:50   ` Naohiro Aota
  2024-05-21  7:57     ` Qu Wenruo
  0 siblings, 1 reply; 15+ messages in thread
From: Naohiro Aota @ 2024-05-21  7:50 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: linux-btrfs@vger.kernel.org, Johannes Thumshirn,
	josef@toxicpanda.com

On Sat, May 18, 2024 at 02:37:40PM GMT, Qu Wenruo wrote:
> Three new helpers are introduced for the incoming subpage delalloc locking
> change.
> 
> - btrfs_folio_set_writer_lock()
>   This is to mark specified range with subpage specific writer lock.
>   After calling this, the subpage range can be proper unlocked by
>   btrfs_folio_end_writer_lock()
> 
> - btrfs_subpage_find_writer_locked()
>   This is to find the writer locked subpage range in a page.
>   With the help of btrfs_folio_set_writer_lock(), it can allow us to
>   record and find previously locked subpage range without extra memory
>   allocation.
> 
> - btrfs_folio_end_all_writers()
>   This is for the locked_page of __extent_writepage(), as there may be
>   multiple subpage delalloc ranges locked.
> 
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---

There are some nits inlined below, basically it looks good.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>

>  fs/btrfs/subpage.c | 116 +++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/subpage.h |   7 +++
>  2 files changed, 123 insertions(+)
> 
> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> index 183b32f51f51..3c957d03324e 100644
> --- a/fs/btrfs/subpage.c
> +++ b/fs/btrfs/subpage.c
> @@ -775,6 +775,122 @@ void btrfs_folio_unlock_writer(struct btrfs_fs_info *fs_info,
>  	btrfs_folio_end_writer_lock(fs_info, folio, start, len);
>  }
>  
> +/*
> + * This is for folio already locked by plain lock_page()/folio_lock(), which
> + * doesn't have any subpage awareness.
> + *
> + * This would populate the involved subpage ranges so that subpage helpers can
> + * properly unlock them.
> + */
> +void btrfs_folio_set_writer_lock(const struct btrfs_fs_info *fs_info,
> +				 struct folio *folio, u64 start, u32 len)
> +{
> +	struct btrfs_subpage *subpage;
> +	unsigned long flags;
> +	int start_bit;
> +	int nbits;

May want to use unsigned int for a consistency...

> +	int ret;
> +
> +	ASSERT(folio_test_locked(folio));
> +	if (unlikely(!fs_info) || !btrfs_is_subpage(fs_info, folio->mapping))
> +		return;
> +
> +	subpage = folio_get_private(folio);
> +	start_bit = subpage_calc_start_bit(fs_info, folio, locked, start, len);
> +	nbits = len >> fs_info->sectorsize_bits;
> +	spin_lock_irqsave(&subpage->lock, flags);
> +	/* Target range should not yet be locked. */
> +	ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits));
> +	bitmap_set(subpage->bitmaps, start_bit, nbits);
> +	ret = atomic_add_return(nbits, &subpage->writers);
> +	ASSERT(ret <= fs_info->subpage_info->bitmap_nr_bits);
> +	spin_unlock_irqrestore(&subpage->lock, flags);
> +}
> +
> +/*
> + * Find any subpage writer locked range inside @folio, starting at file offset
> + * @search_start.
> + * The caller should ensure the folio is locked.
> + *
> + * Return true and update @found_start_ret and @found_len_ret to the first
> + * writer locked range.
> + * Return false if there is no writer locked range.
> + */
> +bool btrfs_subpage_find_writer_locked(const struct btrfs_fs_info *fs_info,
> +				      struct folio *folio, u64 search_start,
> +				      u64 *found_start_ret, u32 *found_len_ret)
> +{
> +	struct btrfs_subpage_info *subpage_info = fs_info->subpage_info;
> +	struct btrfs_subpage *subpage = folio_get_private(folio);
> +	const int len = PAGE_SIZE - offset_in_page(search_start);
> +	const int start_bit = subpage_calc_start_bit(fs_info, folio, locked,
> +						     search_start, len);
> +	const int locked_bitmap_start = subpage_info->locked_offset;
> +	const int locked_bitmap_end = locked_bitmap_start +
> +				      subpage_info->bitmap_nr_bits;
> +	unsigned long flags;
> +	int first_zero;
> +	int first_set;
> +	bool found = false;
> +
> +	ASSERT(folio_test_locked(folio));
> +	spin_lock_irqsave(&subpage->lock, flags);
> +	first_set = find_next_bit(subpage->bitmaps, locked_bitmap_end,
> +				  start_bit);
> +	if (first_set >= locked_bitmap_end)
> +		goto out;
> +
> +	found = true;
> +	*found_start_ret = folio_pos(folio) +
> +		((first_set - locked_bitmap_start) << fs_info->sectorsize_bits);

It's a bit fearful to see an "int" value is shifted and added into u64
value. But, I guess sectorsize is within 32-bit range, right?

> +
> +	first_zero = find_next_zero_bit(subpage->bitmaps,
> +					locked_bitmap_end, first_set);
> +	*found_len_ret = (first_zero - first_set) << fs_info->sectorsize_bits;
> +out:
> +	spin_unlock_irqrestore(&subpage->lock, flags);
> +	return found;
> +}
> +
> +/*
> + * Unlike btrfs_folio_end_writer_lock() which unlock a specified subpage range,
> + * this would end all writer locked ranges of a page.
> + *
> + * This is for the locked page of __extent_writepage(), as the locked page
> + * can contain several locked subpage ranges.
> + */
> +void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
> +				 struct folio *folio)
> +{
> +	u64 folio_start = folio_pos(folio);
> +	u64 cur = folio_start;
> +
> +	ASSERT(folio_test_locked(folio));
> +	if (!btrfs_is_subpage(fs_info, folio->mapping)) {
> +		folio_unlock(folio);
> +		return;
> +	}
> +
> +	while (cur < folio_start + PAGE_SIZE) {
> +		u64 found_start;
> +		u32 found_len;
> +		bool found;
> +		bool last;
> +
> +		found = btrfs_subpage_find_writer_locked(fs_info, folio, cur,
> +							 &found_start, &found_len);
> +		if (!found)
> +			break;
> +		last = btrfs_subpage_end_and_test_writer(fs_info, folio,
> +							 found_start, found_len);
> +		if (last) {
> +			folio_unlock(folio);
> +			break;
> +		}
> +		cur = found_start + found_len;
> +	}
> +}
> +
>  #define GET_SUBPAGE_BITMAP(subpage, subpage_info, name, dst)		\
>  	bitmap_cut(dst, subpage->bitmaps, 0,				\
>  		   subpage_info->name##_offset, subpage_info->bitmap_nr_bits)
> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
> index 4b363d9453af..9f19850d59f2 100644
> --- a/fs/btrfs/subpage.h
> +++ b/fs/btrfs/subpage.h
> @@ -112,6 +112,13 @@ int btrfs_folio_start_writer_lock(const struct btrfs_fs_info *fs_info,
>  				  struct folio *folio, u64 start, u32 len);
>  void btrfs_folio_end_writer_lock(const struct btrfs_fs_info *fs_info,
>  				 struct folio *folio, u64 start, u32 len);
> +void btrfs_folio_set_writer_lock(const struct btrfs_fs_info *fs_info,
> +				 struct folio *folio, u64 start, u32 len);
> +bool btrfs_subpage_find_writer_locked(const struct btrfs_fs_info *fs_info,
> +				      struct folio *folio, u64 search_start,
> +				      u64 *found_start_ret, u32 *found_len_ret);
> +void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
> +				 struct folio *folio);
>  
>  /*
>   * Template for subpage related operations.
> -- 
> 2.45.0
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 2/5] btrfs: subpage: introduce helpers to handle subpage delalloc locking
  2024-05-21  7:50   ` Naohiro Aota
@ 2024-05-21  7:57     ` Qu Wenruo
  0 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2024-05-21  7:57 UTC (permalink / raw)
  To: Naohiro Aota, Qu Wenruo
  Cc: linux-btrfs@vger.kernel.org, Johannes Thumshirn,
	josef@toxicpanda.com



在 2024/5/21 17:20, Naohiro Aota 写道:
[...]
>> +void btrfs_folio_set_writer_lock(const struct btrfs_fs_info *fs_info,
>> +				 struct folio *folio, u64 start, u32 len)
>> +{
>> +	struct btrfs_subpage *subpage;
>> +	unsigned long flags;
>> +	int start_bit;
>> +	int nbits;
>
> May want to use unsigned int for a consistency...
>

I can definitely change all the int to "unsigned int" to be consistent
during pushing to for-next branch.

[...]
>> +	found = true;
>> +	*found_start_ret = folio_pos(folio) +
>> +		((first_set - locked_bitmap_start) << fs_info->sectorsize_bits);
>
> It's a bit fearful to see an "int" value is shifted and added into u64
> value. But, I guess sectorsize is within 32-bit range, right?

In fact, (first_set - locked_bitmap_start) is never going to be larger
than (PAGE_SIZE / sectorsize).

I can add extra ASSERT() to be extra safe for that too.

Thanks,
Qu

>
>> +
>> +	first_zero = find_next_zero_bit(subpage->bitmaps,
>> +					locked_bitmap_end, first_set);
>> +	*found_len_ret = (first_zero - first_set) << fs_info->sectorsize_bits;
>> +out:
>> +	spin_unlock_irqrestore(&subpage->lock, flags);
>> +	return found;
>> +}
>> +
>> +/*
>> + * Unlike btrfs_folio_end_writer_lock() which unlock a specified subpage range,
>> + * this would end all writer locked ranges of a page.
>> + *
>> + * This is for the locked page of __extent_writepage(), as the locked page
>> + * can contain several locked subpage ranges.
>> + */
>> +void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
>> +				 struct folio *folio)
>> +{
>> +	u64 folio_start = folio_pos(folio);
>> +	u64 cur = folio_start;
>> +
>> +	ASSERT(folio_test_locked(folio));
>> +	if (!btrfs_is_subpage(fs_info, folio->mapping)) {
>> +		folio_unlock(folio);
>> +		return;
>> +	}
>> +
>> +	while (cur < folio_start + PAGE_SIZE) {
>> +		u64 found_start;
>> +		u32 found_len;
>> +		bool found;
>> +		bool last;
>> +
>> +		found = btrfs_subpage_find_writer_locked(fs_info, folio, cur,
>> +							 &found_start, &found_len);
>> +		if (!found)
>> +			break;
>> +		last = btrfs_subpage_end_and_test_writer(fs_info, folio,
>> +							 found_start, found_len);
>> +		if (last) {
>> +			folio_unlock(folio);
>> +			break;
>> +		}
>> +		cur = found_start + found_len;
>> +	}
>> +}
>> +
>>   #define GET_SUBPAGE_BITMAP(subpage, subpage_info, name, dst)		\
>>   	bitmap_cut(dst, subpage->bitmaps, 0,				\
>>   		   subpage_info->name##_offset, subpage_info->bitmap_nr_bits)
>> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
>> index 4b363d9453af..9f19850d59f2 100644
>> --- a/fs/btrfs/subpage.h
>> +++ b/fs/btrfs/subpage.h
>> @@ -112,6 +112,13 @@ int btrfs_folio_start_writer_lock(const struct btrfs_fs_info *fs_info,
>>   				  struct folio *folio, u64 start, u32 len);
>>   void btrfs_folio_end_writer_lock(const struct btrfs_fs_info *fs_info,
>>   				 struct folio *folio, u64 start, u32 len);
>> +void btrfs_folio_set_writer_lock(const struct btrfs_fs_info *fs_info,
>> +				 struct folio *folio, u64 start, u32 len);
>> +bool btrfs_subpage_find_writer_locked(const struct btrfs_fs_info *fs_info,
>> +				      struct folio *folio, u64 search_start,
>> +				      u64 *found_start_ret, u32 *found_len_ret);
>> +void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
>> +				 struct folio *folio);
>>
>>   /*
>>    * Template for subpage related operations.
>> --
>> 2.45.0
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 3/5] btrfs: lock subpage ranges in one go for writepage_delalloc()
  2024-05-18  5:07 [PATCH v5 0/5] btrfs: subpage + zoned fixes Qu Wenruo
  2024-05-18  5:07 ` [PATCH v5 1/5] btrfs: make __extent_writepage_io() to write specified range only Qu Wenruo
  2024-05-18  5:07 ` [PATCH v5 2/5] btrfs: subpage: introduce helpers to handle subpage delalloc locking Qu Wenruo
@ 2024-05-18  5:07 ` Qu Wenruo
  2024-05-21  8:11   ` Naohiro Aota
  2024-05-18  5:07 ` [PATCH v5 4/5] btrfs: do not clear page dirty inside extent_write_locked_range() Qu Wenruo
  2024-05-18  5:07 ` [PATCH v5 5/5] btrfs: make extent_write_locked_range() to handle subpage writeback correctly Qu Wenruo
  4 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2024-05-18  5:07 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes.Thumshirn, josef

If we have a subpage range like this for a 16K page with 4K sectorsize:

    0     4K     8K     12K     16K
    |/////|      |//////|       |

    |/////| = dirty range

Currently writepage_delalloc() would go through the following steps:

- lock range [0, 4K)
- run delalloc range for [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [8K 12K)

So far it's fine for regular subpage writeback, as
btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
cow_file_range() and run_delalloc_compressed().

But there is a special pitfall for zoned subpage, where we will go
through run_delalloc_cow(), which would create the ordered extent for the
range and immediately submit the range.
This would unlock the whole page range, causing all kinds of different
ASSERT()s related to locked page.

This patch would address the page unlocking problem of run_delalloc_cow(),
by changing the workflow to the following one:

- lock range [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [0, 4K)
- run delalloc range for [8K, 12K)

So that run_delalloc_cow() can only unlock the full page until the
last lock user released.

To do that, this patch would:

- Utilizing subpage locked bitmap
  So for every delalloc range we found, call
  btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
  and later btrfs_folio_end_all_writers() if the page is fully unlocked.

  So we know there is a delalloc range that needs to be run later.

- Save the @delalloc_end as @last_delalloc_end inside
  writepage_delalloc()
  Since subpage locked bitmap is only for ranges inside the page,
  meanwhile we can have delalloc range ends beyond our page boundary,
  we have to save the @last_delalloc_end just in case it's beyond our
  page boundary.

Although there is one extra point to notice:

- We need to handle errors in previous iteration
  Since we can have multiple locked delalloc ranges thus we have to call
  run_delalloc_ranges() multiple times.
  If we hit an error half way, we still need to unlock the remaining
  ranges.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 77 ++++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/subpage.c   |  6 ++++
 2 files changed, 76 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 8a4a7d00795f..b6dc9308105d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1226,13 +1226,22 @@ static inline void contiguous_readpages(struct page *pages[], int nr_pages,
 static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 		struct page *page, struct writeback_control *wbc)
 {
+	struct btrfs_fs_info *fs_info = inode_to_fs_info(&inode->vfs_inode);
+	struct folio *folio = page_folio(page);
 	const u64 page_start = page_offset(page);
 	const u64 page_end = page_start + PAGE_SIZE - 1;
+	/*
+	 * Saves the last found delalloc end. As the delalloc end can go beyond
+	 * page boundary, thus we can not rely on subpage bitmap to locate
+	 * the last dealloc end.
+	 */
+	u64 last_delalloc_end = 0;
 	u64 delalloc_start = page_start;
 	u64 delalloc_end = page_end;
 	u64 delalloc_to_write = 0;
 	int ret = 0;
 
+	/* Lock all (subpage) dealloc ranges inside the page first. */
 	while (delalloc_start < page_end) {
 		delalloc_end = page_end;
 		if (!find_lock_delalloc_range(&inode->vfs_inode, page,
@@ -1240,15 +1249,68 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 			delalloc_start = delalloc_end + 1;
 			continue;
 		}
-
-		ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
-					       delalloc_end, wbc);
-		if (ret < 0)
-			return ret;
-
+		btrfs_folio_set_writer_lock(fs_info, folio, delalloc_start,
+					    min(delalloc_end, page_end) + 1 -
+					    delalloc_start);
+		last_delalloc_end = delalloc_end;
 		delalloc_start = delalloc_end + 1;
 	}
+	delalloc_start = page_start;
+	/* Run the delalloc ranges for above locked ranges. */
+	while (last_delalloc_end && delalloc_start < page_end) {
+		u64 found_start;
+		u32 found_len;
+		bool found;
 
+		if (!btrfs_is_subpage(fs_info, page->mapping)) {
+			/*
+			 * For non-subpage case, the found delalloc range must
+			 * cover this page and there must be only one locked
+			 * delalloc range.
+			 */
+			found_start = page_start;
+			found_len = last_delalloc_end + 1 - found_start;
+			found = true;
+		} else {
+			found = btrfs_subpage_find_writer_locked(fs_info, folio,
+					delalloc_start, &found_start, &found_len);
+		}
+		if (!found)
+			break;
+		/*
+		 * The subpage range covers the last sector, the delalloc range may
+		 * end beyonds the page boundary, use the saved delalloc_end
+		 * instead.
+		 */
+		if (found_start + found_len >= page_end)
+			found_len = last_delalloc_end + 1 - found_start;
+
+		if (ret < 0) {
+			/* Cleanup the remaining locked ranges. */
+			unlock_extent(&inode->io_tree, found_start,
+				      found_start + found_len - 1, NULL);
+			__unlock_for_delalloc(&inode->vfs_inode, page, found_start,
+					      found_start + found_len - 1);
+		} else {
+			ret = btrfs_run_delalloc_range(inode, page, found_start,
+						       found_start + found_len - 1, wbc);
+		}
+		/*
+		 * Above btrfs_run_delalloc_range() may have unlocked the page,
+		 * Thus for the last range, we can not touch the page anymore.
+		 */
+		if (found_start + found_len >= last_delalloc_end + 1)
+			break;
+
+		delalloc_start = found_start + found_len;
+	}
+	if (ret < 0)
+		return ret;
+
+	if (last_delalloc_end)
+		delalloc_end = last_delalloc_end;
+	else
+		delalloc_end = page_end;
 	/*
 	 * delalloc_end is already one less than the total length, so
 	 * we don't subtract one from PAGE_SIZE
@@ -1520,7 +1582,8 @@ static int __extent_writepage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl
 					       PAGE_SIZE, !ret);
 		mapping_set_error(page->mapping, ret);
 	}
-	unlock_page(page);
+
+	btrfs_folio_end_all_writers(inode_to_fs_info(inode), folio);
 	ASSERT(ret <= 0);
 	return ret;
 }
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 3c957d03324e..81b862d7ab53 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -862,6 +862,7 @@ bool btrfs_subpage_find_writer_locked(const struct btrfs_fs_info *fs_info,
 void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
 				 struct folio *folio)
 {
+	struct btrfs_subpage *subpage = folio_get_private(folio);
 	u64 folio_start = folio_pos(folio);
 	u64 cur = folio_start;
 
@@ -871,6 +872,11 @@ void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
 		return;
 	}
 
+	/* The page has no new delalloc range locked on it. Just plain unlock. */
+	if (atomic_read(&subpage->writers) == 0) {
+		folio_unlock(folio);
+		return;
+	}
 	while (cur < folio_start + PAGE_SIZE) {
 		u64 found_start;
 		u32 found_len;
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 3/5] btrfs: lock subpage ranges in one go for writepage_delalloc()
  2024-05-18  5:07 ` [PATCH v5 3/5] btrfs: lock subpage ranges in one go for writepage_delalloc() Qu Wenruo
@ 2024-05-21  8:11   ` Naohiro Aota
  2024-05-21  8:45     ` Qu Wenruo
  0 siblings, 1 reply; 15+ messages in thread
From: Naohiro Aota @ 2024-05-21  8:11 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: linux-btrfs@vger.kernel.org, Johannes Thumshirn,
	josef@toxicpanda.com

On Sat, May 18, 2024 at 02:37:41PM GMT, Qu Wenruo wrote:
> If we have a subpage range like this for a 16K page with 4K sectorsize:
> 
>     0     4K     8K     12K     16K
>     |/////|      |//////|       |
> 
>     |/////| = dirty range
> 
> Currently writepage_delalloc() would go through the following steps:
> 
> - lock range [0, 4K)
> - run delalloc range for [0, 4K)
> - lock range [8K, 12K)
> - run delalloc range for [8K 12K)
> 
> So far it's fine for regular subpage writeback, as
> btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
> cow_file_range() and run_delalloc_compressed().
> 
> But there is a special pitfall for zoned subpage, where we will go
> through run_delalloc_cow(), which would create the ordered extent for the
> range and immediately submit the range.
> This would unlock the whole page range, causing all kinds of different
> ASSERT()s related to locked page.
> 
> This patch would address the page unlocking problem of run_delalloc_cow(),
> by changing the workflow to the following one:
> 
> - lock range [0, 4K)
> - lock range [8K, 12K)
> - run delalloc range for [0, 4K)
> - run delalloc range for [8K, 12K)
> 
> So that run_delalloc_cow() can only unlock the full page until the
> last lock user released.
> 
> To do that, this patch would:
> 
> - Utilizing subpage locked bitmap
>   So for every delalloc range we found, call
>   btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
>   and later btrfs_folio_end_all_writers() if the page is fully unlocked.
> 
>   So we know there is a delalloc range that needs to be run later.
> 
> - Save the @delalloc_end as @last_delalloc_end inside
>   writepage_delalloc()
>   Since subpage locked bitmap is only for ranges inside the page,
>   meanwhile we can have delalloc range ends beyond our page boundary,
>   we have to save the @last_delalloc_end just in case it's beyond our
>   page boundary.
> 
> Although there is one extra point to notice:
> 
> - We need to handle errors in previous iteration
>   Since we can have multiple locked delalloc ranges thus we have to call
>   run_delalloc_ranges() multiple times.
>   If we hit an error half way, we still need to unlock the remaining
>   ranges.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 77 ++++++++++++++++++++++++++++++++++++++++----
>  fs/btrfs/subpage.c   |  6 ++++
>  2 files changed, 76 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 8a4a7d00795f..b6dc9308105d 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1226,13 +1226,22 @@ static inline void contiguous_readpages(struct page *pages[], int nr_pages,
>  static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>  		struct page *page, struct writeback_control *wbc)
>  {
> +	struct btrfs_fs_info *fs_info = inode_to_fs_info(&inode->vfs_inode);
> +	struct folio *folio = page_folio(page);
>  	const u64 page_start = page_offset(page);
>  	const u64 page_end = page_start + PAGE_SIZE - 1;
> +	/*
> +	 * Saves the last found delalloc end. As the delalloc end can go beyond
> +	 * page boundary, thus we can not rely on subpage bitmap to locate
> +	 * the last dealloc end.

typo: dealloc -> delalloc

> +	 */
> +	u64 last_delalloc_end = 0;
>  	u64 delalloc_start = page_start;
>  	u64 delalloc_end = page_end;
>  	u64 delalloc_to_write = 0;
>  	int ret = 0;
>  
> +	/* Lock all (subpage) dealloc ranges inside the page first. */

Same here.

>  	while (delalloc_start < page_end) {
>  		delalloc_end = page_end;
>  		if (!find_lock_delalloc_range(&inode->vfs_inode, page,
> @@ -1240,15 +1249,68 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>  			delalloc_start = delalloc_end + 1;
>  			continue;
>  		}
> -
> -		ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
> -					       delalloc_end, wbc);
> -		if (ret < 0)
> -			return ret;
> -
> +		btrfs_folio_set_writer_lock(fs_info, folio, delalloc_start,
> +					    min(delalloc_end, page_end) + 1 -
> +					    delalloc_start);
> +		last_delalloc_end = delalloc_end;
>  		delalloc_start = delalloc_end + 1;
>  	}

Can we bail out on the "if (!last_delalloc_end)" case? It would make the
following code simpler.

> +	delalloc_start = page_start;
> +	/* Run the delalloc ranges for above locked ranges. */
> +	while (last_delalloc_end && delalloc_start < page_end) {
> +		u64 found_start;
> +		u32 found_len;
> +		bool found;
>  
> +		if (!btrfs_is_subpage(fs_info, page->mapping)) {
> +			/*
> +			 * For non-subpage case, the found delalloc range must
> +			 * cover this page and there must be only one locked
> +			 * delalloc range.
> +			 */
> +			found_start = page_start;
> +			found_len = last_delalloc_end + 1 - found_start;
> +			found = true;
> +		} else {
> +			found = btrfs_subpage_find_writer_locked(fs_info, folio,
> +					delalloc_start, &found_start, &found_len);
> +		}
> +		if (!found)
> +			break;
> +		/*
> +		 * The subpage range covers the last sector, the delalloc range may
> +		 * end beyonds the page boundary, use the saved delalloc_end
> +		 * instead.
> +		 */
> +		if (found_start + found_len >= page_end)
> +			found_len = last_delalloc_end + 1 - found_start;
> +
> +		if (ret < 0) {

At first glance, it is strange because "ret" is not set above. But, it is
executed when btrfs_run_delalloc_range() returns an error in an iteration,
for the remaining iterations...

I'd like to have a dedicated clean-up path ... but I agree it is difficult
to make such cleanup loop clean.

Flipping the if-conditions looks better? Or, adding more comments would be nice.

> +			/* Cleanup the remaining locked ranges. */
> +			unlock_extent(&inode->io_tree, found_start,
> +				      found_start + found_len - 1, NULL);
> +			__unlock_for_delalloc(&inode->vfs_inode, page, found_start,
> +					      found_start + found_len - 1);
> +		} else {
> +			ret = btrfs_run_delalloc_range(inode, page, found_start,
> +						       found_start + found_len - 1, wbc);

Also, what happens if the first range returns "1" and a later one returns
"0"? Is it OK to override the "ret" for the case? Actually, I guess it
won't happen now because (as said in patch 5) subpage disables an inline
extent, but having an ASSERT() would be good to catch a future mistake.

> +		}
> +		/*
> +		 * Above btrfs_run_delalloc_range() may have unlocked the page,
> +		 * Thus for the last range, we can not touch the page anymore.
> +		 */
> +		if (found_start + found_len >= last_delalloc_end + 1)
> +			break;
> +
> +		delalloc_start = found_start + found_len;
> +	}
> +	if (ret < 0)
> +		return ret;
> +
> +	if (last_delalloc_end)
> +		delalloc_end = last_delalloc_end;
> +	else
> +		delalloc_end = page_end;
>  	/*
>  	 * delalloc_end is already one less than the total length, so
>  	 * we don't subtract one from PAGE_SIZE
> @@ -1520,7 +1582,8 @@ static int __extent_writepage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl
>  					       PAGE_SIZE, !ret);
>  		mapping_set_error(page->mapping, ret);
>  	}
> -	unlock_page(page);
> +
> +	btrfs_folio_end_all_writers(inode_to_fs_info(inode), folio);
>  	ASSERT(ret <= 0);
>  	return ret;
>  }
> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> index 3c957d03324e..81b862d7ab53 100644
> --- a/fs/btrfs/subpage.c
> +++ b/fs/btrfs/subpage.c
> @@ -862,6 +862,7 @@ bool btrfs_subpage_find_writer_locked(const struct btrfs_fs_info *fs_info,
>  void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
>  				 struct folio *folio)
>  {
> +	struct btrfs_subpage *subpage = folio_get_private(folio);
>  	u64 folio_start = folio_pos(folio);
>  	u64 cur = folio_start;
>  
> @@ -871,6 +872,11 @@ void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
>  		return;
>  	}
>  
> +	/* The page has no new delalloc range locked on it. Just plain unlock. */
> +	if (atomic_read(&subpage->writers) == 0) {
> +		folio_unlock(folio);
> +		return;
> +	}
>  	while (cur < folio_start + PAGE_SIZE) {
>  		u64 found_start;
>  		u32 found_len;
> -- 
> 2.45.0
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 3/5] btrfs: lock subpage ranges in one go for writepage_delalloc()
  2024-05-21  8:11   ` Naohiro Aota
@ 2024-05-21  8:45     ` Qu Wenruo
  2024-05-21 11:54       ` Naohiro Aota
  0 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2024-05-21  8:45 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs@vger.kernel.org, Johannes Thumshirn,
	josef@toxicpanda.com



在 2024/5/21 17:41, Naohiro Aota 写道:
[...]
> Same here.
> 
>>   	while (delalloc_start < page_end) {
>>   		delalloc_end = page_end;
>>   		if (!find_lock_delalloc_range(&inode->vfs_inode, page,
>> @@ -1240,15 +1249,68 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>>   			delalloc_start = delalloc_end + 1;
>>   			continue;
>>   		}
>> -
>> -		ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
>> -					       delalloc_end, wbc);
>> -		if (ret < 0)
>> -			return ret;
>> -
>> +		btrfs_folio_set_writer_lock(fs_info, folio, delalloc_start,
>> +					    min(delalloc_end, page_end) + 1 -
>> +					    delalloc_start);
>> +		last_delalloc_end = delalloc_end;
>>   		delalloc_start = delalloc_end + 1;
>>   	}
> 
> Can we bail out on the "if (!last_delalloc_end)" case? It would make the
> following code simpler.

Mind to explain it a little more?

Did you mean something like this:

	while (delalloc_start < page_end) {
		/* lock all subpage delalloc range code */
	}
	if (!last_delalloc_end)
		goto finish;
	while (delalloc_start < page_end) {
		/* run the delalloc ranges code* /
	}

If so, I can definitely go that way.

> 
>> +	delalloc_start = page_start;
>> +	/* Run the delalloc ranges for above locked ranges. */
>> +	while (last_delalloc_end && delalloc_start < page_end) {
>> +		u64 found_start;
>> +		u32 found_len;
>> +		bool found;
>>   
>> +		if (!btrfs_is_subpage(fs_info, page->mapping)) {
>> +			/*
>> +			 * For non-subpage case, the found delalloc range must
>> +			 * cover this page and there must be only one locked
>> +			 * delalloc range.
>> +			 */
>> +			found_start = page_start;
>> +			found_len = last_delalloc_end + 1 - found_start;
>> +			found = true;
>> +		} else {
>> +			found = btrfs_subpage_find_writer_locked(fs_info, folio,
>> +					delalloc_start, &found_start, &found_len);
>> +		}
>> +		if (!found)
>> +			break;
>> +		/*
>> +		 * The subpage range covers the last sector, the delalloc range may
>> +		 * end beyonds the page boundary, use the saved delalloc_end
>> +		 * instead.
>> +		 */
>> +		if (found_start + found_len >= page_end)
>> +			found_len = last_delalloc_end + 1 - found_start;
>> +
>> +		if (ret < 0) {
> 
> At first glance, it is strange because "ret" is not set above. But, it is
> executed when btrfs_run_delalloc_range() returns an error in an iteration,
> for the remaining iterations...
> 
> I'd like to have a dedicated clean-up path ... but I agree it is difficult
> to make such cleanup loop clean.

I can add an extra bool to indicate if we have any error, but overall 
it's not much different.

> 
> Flipping the if-conditions looks better? Or, adding more comments would be nice.

I guess that would go this path, flipping the if conditions and extra 
comments.

> 
>> +			/* Cleanup the remaining locked ranges. */
>> +			unlock_extent(&inode->io_tree, found_start,
>> +				      found_start + found_len - 1, NULL);
>> +			__unlock_for_delalloc(&inode->vfs_inode, page, found_start,
>> +					      found_start + found_len - 1);
>> +		} else {
>> +			ret = btrfs_run_delalloc_range(inode, page, found_start,
>> +						       found_start + found_len - 1, wbc);
> 
> Also, what happens if the first range returns "1" and a later one returns
> "0"? Is it OK to override the "ret" for the case? Actually, I guess it
> won't happen now because (as said in patch 5) subpage disables an inline
> extent, but having an ASSERT() would be good to catch a future mistake.

It's not only inline but also compression can return 1.

Thankfully for subpage, inline is disabled, meanwhile compression can 
only be done for a full page aligned range (start and end are both page 
aligned).

Considering you're mentioning this, I would definitely add an ASSERT() 
with comments explaining this.

Thanks for the feedback!
Qu

> 
>> +		}
>> +		/*
>> +		 * Above btrfs_run_delalloc_range() may have unlocked the page,
>> +		 * Thus for the last range, we can not touch the page anymore.
>> +		 */
>> +		if (found_start + found_len >= last_delalloc_end + 1)
>> +			break;
>> +
>> +		delalloc_start = found_start + found_len;
>> +	}
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	if (last_delalloc_end)
>> +		delalloc_end = last_delalloc_end;
>> +	else
>> +		delalloc_end = page_end;
>>   	/*
>>   	 * delalloc_end is already one less than the total length, so
>>   	 * we don't subtract one from PAGE_SIZE
>> @@ -1520,7 +1582,8 @@ static int __extent_writepage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl
>>   					       PAGE_SIZE, !ret);
>>   		mapping_set_error(page->mapping, ret);
>>   	}
>> -	unlock_page(page);
>> +
>> +	btrfs_folio_end_all_writers(inode_to_fs_info(inode), folio);
>>   	ASSERT(ret <= 0);
>>   	return ret;
>>   }
>> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
>> index 3c957d03324e..81b862d7ab53 100644
>> --- a/fs/btrfs/subpage.c
>> +++ b/fs/btrfs/subpage.c
>> @@ -862,6 +862,7 @@ bool btrfs_subpage_find_writer_locked(const struct btrfs_fs_info *fs_info,
>>   void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
>>   				 struct folio *folio)
>>   {
>> +	struct btrfs_subpage *subpage = folio_get_private(folio);
>>   	u64 folio_start = folio_pos(folio);
>>   	u64 cur = folio_start;
>>   
>> @@ -871,6 +872,11 @@ void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
>>   		return;
>>   	}
>>   
>> +	/* The page has no new delalloc range locked on it. Just plain unlock. */
>> +	if (atomic_read(&subpage->writers) == 0) {
>> +		folio_unlock(folio);
>> +		return;
>> +	}
>>   	while (cur < folio_start + PAGE_SIZE) {
>>   		u64 found_start;
>>   		u32 found_len;
>> -- 
>> 2.45.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 3/5] btrfs: lock subpage ranges in one go for writepage_delalloc()
  2024-05-21  8:45     ` Qu Wenruo
@ 2024-05-21 11:54       ` Naohiro Aota
  2024-05-21 22:16         ` Qu Wenruo
  0 siblings, 1 reply; 15+ messages in thread
From: Naohiro Aota @ 2024-05-21 11:54 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: linux-btrfs@vger.kernel.org, Johannes Thumshirn,
	josef@toxicpanda.com

On Tue, May 21, 2024 at 06:15:32PM GMT, Qu Wenruo wrote:
> 
> 
> 在 2024/5/21 17:41, Naohiro Aota 写道:
> [...]
> > Same here.
> > 
> > >   	while (delalloc_start < page_end) {
> > >   		delalloc_end = page_end;
> > >   		if (!find_lock_delalloc_range(&inode->vfs_inode, page,
> > > @@ -1240,15 +1249,68 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
> > >   			delalloc_start = delalloc_end + 1;
> > >   			continue;
> > >   		}
> > > -
> > > -		ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
> > > -					       delalloc_end, wbc);
> > > -		if (ret < 0)
> > > -			return ret;
> > > -
> > > +		btrfs_folio_set_writer_lock(fs_info, folio, delalloc_start,
> > > +					    min(delalloc_end, page_end) + 1 -
> > > +					    delalloc_start);
> > > +		last_delalloc_end = delalloc_end;
> > >   		delalloc_start = delalloc_end + 1;
> > >   	}
> > 
> > Can we bail out on the "if (!last_delalloc_end)" case? It would make the
> > following code simpler.
> 
> Mind to explain it a little more?
> 
> Did you mean something like this:
> 
> 	while (delalloc_start < page_end) {
> 		/* lock all subpage delalloc range code */
> 	}
> 	if (!last_delalloc_end)
> 		goto finish;
> 	while (delalloc_start < page_end) {
> 		/* run the delalloc ranges code* /
> 	}
> 
> If so, I can definitely go that way.

Yes, I meant that way. Apparently, "!last_delalloc_end" means it get no
delalloc region. So, we can just return 0 in that case without touching
"wbc->nr_to_write" as the current code does.

BTW, is this actually an overlooked error case? Is it OK to progress in
__extent_writepage() even if we don't run run_delalloc_range() ?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 3/5] btrfs: lock subpage ranges in one go for writepage_delalloc()
  2024-05-21 11:54       ` Naohiro Aota
@ 2024-05-21 22:16         ` Qu Wenruo
  2024-05-22  1:10           ` Naohiro Aota
  0 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2024-05-21 22:16 UTC (permalink / raw)
  To: Naohiro Aota, Qu Wenruo
  Cc: linux-btrfs@vger.kernel.org, Johannes Thumshirn,
	josef@toxicpanda.com



在 2024/5/21 21:24, Naohiro Aota 写道:
> On Tue, May 21, 2024 at 06:15:32PM GMT, Qu Wenruo wrote:
>>
>>
>> 在 2024/5/21 17:41, Naohiro Aota 写道:
>> [...]
>>> Same here.
>>>
>>>>    	while (delalloc_start < page_end) {
>>>>    		delalloc_end = page_end;
>>>>    		if (!find_lock_delalloc_range(&inode->vfs_inode, page,
>>>> @@ -1240,15 +1249,68 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>>>>    			delalloc_start = delalloc_end + 1;
>>>>    			continue;
>>>>    		}
>>>> -
>>>> -		ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
>>>> -					       delalloc_end, wbc);
>>>> -		if (ret < 0)
>>>> -			return ret;
>>>> -
>>>> +		btrfs_folio_set_writer_lock(fs_info, folio, delalloc_start,
>>>> +					    min(delalloc_end, page_end) + 1 -
>>>> +					    delalloc_start);
>>>> +		last_delalloc_end = delalloc_end;
>>>>    		delalloc_start = delalloc_end + 1;
>>>>    	}
>>>
>>> Can we bail out on the "if (!last_delalloc_end)" case? It would make the
>>> following code simpler.
>>
>> Mind to explain it a little more?
>>
>> Did you mean something like this:
>>
>> 	while (delalloc_start < page_end) {
>> 		/* lock all subpage delalloc range code */
>> 	}
>> 	if (!last_delalloc_end)
>> 		goto finish;
>> 	while (delalloc_start < page_end) {
>> 		/* run the delalloc ranges code* /
>> 	}
>>
>> If so, I can definitely go that way.
>
> Yes, I meant that way. Apparently, "!last_delalloc_end" means it get no
> delalloc region. So, we can just return 0 in that case without touching
> "wbc->nr_to_write" as the current code does.
>
> BTW, is this actually an overlooked error case? Is it OK to progress in
> __extent_writepage() even if we don't run run_delalloc_range() ?

That's totally expected, and it would even be more common in fact.

Consider a very ordinary case like this:

    0             4K              8K            12K
    |/////////////|///////////////|/////////////|

When running extent_writepage() for page 0, we run delalloc range for
the whole [0, 12K) range, and created an OE for it.
Then __extent_writepage_io() add page range [0, 4k) for bio.

Then extent_writepage() for page 4K, find_lock_delalloc() would not find
any range, as previous iteration at page 0 has already created OE for
the whole [0, 12K) range.

Although we would still run __extent_writepage_io() to add page range
[4k, 8K) to the bio.

The same for page 8K.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 3/5] btrfs: lock subpage ranges in one go for writepage_delalloc()
  2024-05-21 22:16         ` Qu Wenruo
@ 2024-05-22  1:10           ` Naohiro Aota
  0 siblings, 0 replies; 15+ messages in thread
From: Naohiro Aota @ 2024-05-22  1:10 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Qu Wenruo, linux-btrfs@vger.kernel.org, Johannes Thumshirn,
	josef@toxicpanda.com

On Wed, May 22, 2024 at 07:46:46AM GMT, Qu Wenruo wrote:
> 
> 
> 在 2024/5/21 21:24, Naohiro Aota 写道:
> > On Tue, May 21, 2024 at 06:15:32PM GMT, Qu Wenruo wrote:
> > > 
> > > 
> > > 在 2024/5/21 17:41, Naohiro Aota 写道:
> > > [...]
> > > > Same here.
> > > > 
> > > > >    	while (delalloc_start < page_end) {
> > > > >    		delalloc_end = page_end;
> > > > >    		if (!find_lock_delalloc_range(&inode->vfs_inode, page,
> > > > > @@ -1240,15 +1249,68 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
> > > > >    			delalloc_start = delalloc_end + 1;
> > > > >    			continue;
> > > > >    		}
> > > > > -
> > > > > -		ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
> > > > > -					       delalloc_end, wbc);
> > > > > -		if (ret < 0)
> > > > > -			return ret;
> > > > > -
> > > > > +		btrfs_folio_set_writer_lock(fs_info, folio, delalloc_start,
> > > > > +					    min(delalloc_end, page_end) + 1 -
> > > > > +					    delalloc_start);
> > > > > +		last_delalloc_end = delalloc_end;
> > > > >    		delalloc_start = delalloc_end + 1;
> > > > >    	}
> > > > 
> > > > Can we bail out on the "if (!last_delalloc_end)" case? It would make the
> > > > following code simpler.
> > > 
> > > Mind to explain it a little more?
> > > 
> > > Did you mean something like this:
> > > 
> > > 	while (delalloc_start < page_end) {
> > > 		/* lock all subpage delalloc range code */
> > > 	}
> > > 	if (!last_delalloc_end)
> > > 		goto finish;
> > > 	while (delalloc_start < page_end) {
> > > 		/* run the delalloc ranges code* /
> > > 	}
> > > 
> > > If so, I can definitely go that way.
> > 
> > Yes, I meant that way. Apparently, "!last_delalloc_end" means it get no
> > delalloc region. So, we can just return 0 in that case without touching
> > "wbc->nr_to_write" as the current code does.
> > 
> > BTW, is this actually an overlooked error case? Is it OK to progress in
> > __extent_writepage() even if we don't run run_delalloc_range() ?
> 
> That's totally expected, and it would even be more common in fact.
> 
> Consider a very ordinary case like this:
> 
>    0             4K              8K            12K
>    |/////////////|///////////////|/////////////|
> 
> When running extent_writepage() for page 0, we run delalloc range for
> the whole [0, 12K) range, and created an OE for it.
> Then __extent_writepage_io() add page range [0, 4k) for bio.
> 
> Then extent_writepage() for page 4K, find_lock_delalloc() would not find
> any range, as previous iteration at page 0 has already created OE for
> the whole [0, 12K) range.
> 
> Although we would still run __extent_writepage_io() to add page range
> [4k, 8K) to the bio.
> 
> The same for page 8K.
> 
> Thanks,
> Qu

Ah, yes, that's true. I forgot that the following pages case. Thank you for
your explanation.

Regards,

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 4/5] btrfs: do not clear page dirty inside extent_write_locked_range()
  2024-05-18  5:07 [PATCH v5 0/5] btrfs: subpage + zoned fixes Qu Wenruo
                   ` (2 preceding siblings ...)
  2024-05-18  5:07 ` [PATCH v5 3/5] btrfs: lock subpage ranges in one go for writepage_delalloc() Qu Wenruo
@ 2024-05-18  5:07 ` Qu Wenruo
  2024-05-18  5:07 ` [PATCH v5 5/5] btrfs: make extent_write_locked_range() to handle subpage writeback correctly Qu Wenruo
  4 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2024-05-18  5:07 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes.Thumshirn, josef, Johannes Thumshirn

[BUG]
For subpage + zoned case, the following workload can lead to rsv data
leak at unmount time:

 # mkfs.btrfs -f -s 4k $dev
 # mount $dev $mnt
 # fsstress -w -n 8 -d $mnt -s 1709539240
 0/0: fiemap - no filename
 0/1: copyrange read - no filename
 0/2: write - no filename
 0/3: rename - no source filename
 0/4: creat f0 x:0 0 0
 0/4: creat add id=0,parent=-1
 0/5: writev f0[259 1 0 0 0 0] [778052,113,965] 0
 0/6: ioctl(FIEMAP) f0[259 1 0 0 224 887097] [1294220,2291618343991484791,0x10000] -1
 0/7: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 224 887097] return 25, fallback to stat()
 0/7: dwrite f0[259 1 0 0 224 887097] [696320,102400] 0
 # umount $mnt

The dmesg would include the following rsv leak detection wanring (all
call trace skipped):

 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8653 btrfs_destroy_inode+0x1e0/0x200 [btrfs]
 ---[ end trace 0000000000000000 ]---
 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8654 btrfs_destroy_inode+0x1a8/0x200 [btrfs]
 ---[ end trace 0000000000000000 ]---
 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8660 btrfs_destroy_inode+0x1a0/0x200 [btrfs]
 ---[ end trace 0000000000000000 ]---
 BTRFS info (device sda): last unmount of filesystem 1b4abba9-de34-4f07-9e7f-157cf12a18d6
 ------------[ cut here ]------------
 WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs]
 ---[ end trace 0000000000000000 ]---
 BTRFS info (device sda): space_info DATA has 268218368 free, is not full
 BTRFS info (device sda): space_info total=268435456, used=204800, pinned=0, reserved=0, may_use=12288, readonly=0 zone_unusable=0
 BTRFS info (device sda): global_block_rsv: size 0 reserved 0
 BTRFS info (device sda): trans_block_rsv: size 0 reserved 0
 BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0
 BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0
 BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0
 ------------[ cut here ]------------
 WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs]
 ---[ end trace 0000000000000000 ]---
 BTRFS info (device sda): space_info METADATA has 267796480 free, is not full
 BTRFS info (device sda): space_info total=268435456, used=131072, pinned=0, reserved=0, may_use=262144, readonly=0 zone_unusable=245760
 BTRFS info (device sda): global_block_rsv: size 0 reserved 0
 BTRFS info (device sda): trans_block_rsv: size 0 reserved 0
 BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0
 BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0
 BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0

Above $dev is a tcmu-runner emulated zoned HDD, which has a max zone
append size of 64K, and the system has 64K page size.

[CAUSE]
I have added several trace_printk() to show the events (header skipped):

 > btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688
 > btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288
 > btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536
 > btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864

The above lines shows our buffered write has dirtied 3 pages of inode
259 of root 5:

  704K             768K              832K              896K
  I           |////I/////////////////I///////////|     I
              756K                               868K

  |///| is the dirtied range using subpage bitmaps. and 'I' is the page
  boundary.

  Meanwhile all three pages (704K, 768K, 832K) all have its PageDirty
  flag set.

 > btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400

Then direct IO write starts, since the range [680K, 780K) covers the
beginning part of the above dirty range, btrfs needs to writeback the
two pages at 704K and 768K.

 > cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536
 > extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536

Now the above 2 lines shows that, we're writing back for dirty range
[756K, 756K + 64K).
We only writeback 64K because the zoned device has max zone append size
as 64K.

 > extent_write_locked_range: r/i=5/259 clear dirty for page=786432

!!! The above line shows the root cause. !!!

We're calling clear_page_dirty_for_io() inside extent_write_locked_range(),
for the page 768K.
This is because extent_write_locked_range() can go beyond the current
locked page, here we hit the page at 768K and clear it's page dirt.

In fact this would lead to the desync between subpage dirty and page
dirty flags.
We have the page dirty flag cleared, but the subpage range [820K, 832K)
is still dirty.

After the writeback of range [756K, 820K), the dirty flags looks like
this, as page 768K no longer has dirty flag set.

  704K             768K              832K              896K
  I                I      |          I/////////////|   I
                          820K                     868K

This means we will no longer writeback range [820K, 832K), thus the
reserved data/metadata space would never be properly released.

 > extent_write_cache_pages: r/i=5/259 skip non-dirty folio=786432

Now even we try to start wrteiback for page 768K, since the
page is not dirty, we completely skip it at extent_write_cache_pages()
time.

 > btrfs_direct_write: r/i=5/259 dio done filepos=696320 len=0

Now the direct IO finished.

 > cow_file_range: r/i=5/259 add ordered extent filepos=851968 len=36864
 > extent_write_locked_range: r/i=5/259 locked page=851968 start=851968 len=36864

Now we writeback the remaining dirty range, which is [832K, 868K).
Causing the range [820K, 832K) never be submitted, thus leaking the
reserved space.

This bug only affects subpage and zoned case.
For non-subpage and zoned case, we have exact one sector for each page,
thus no such partial dirty cases.

For subpage and non-zoned case, we never go into run_delalloc_cow(), and
normally all the dirty subpage ranges would be properly submitted inside
__extent_writepage_io().

[FIX]
Just do not clear the page dirty at all inside
extent_write_locked_range().
As __extent_writepage_io() would do a more accurate, subpage compatible
clear for page and subpage dirty flags anyway.

Now the correct trace would look like this:

 > btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688
 > btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288
 > btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536
 > btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864

The page dirty part is still the same 3 pages.

 > btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400
 > cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536
 > extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536

And the writeback for the first 64K is still correct.

 > cow_file_range: r/i=5/259 add ordered extent filepos=839680 len=49152
 > extent_write_locked_range: r/i=5/259 locked page=786432 start=839680 len=49152

Now with the fix, we can properly writeback the range [820K, 832K), and
properly release the reserved data/metadata space.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index b6dc9308105d..08556db4b4de 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2313,10 +2313,8 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
 
 		page = find_get_page(mapping, cur >> PAGE_SHIFT);
 		ASSERT(PageLocked(page));
-		if (pages_dirty && page != locked_page) {
+		if (pages_dirty && page != locked_page)
 			ASSERT(PageDirty(page));
-			clear_page_dirty_for_io(page);
-		}
 
 		ret = __extent_writepage_io(BTRFS_I(inode), page, cur, cur_len,
 					    &bio_ctrl, i_size, &nr);
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 5/5] btrfs: make extent_write_locked_range() to handle subpage writeback correctly
  2024-05-18  5:07 [PATCH v5 0/5] btrfs: subpage + zoned fixes Qu Wenruo
                   ` (3 preceding siblings ...)
  2024-05-18  5:07 ` [PATCH v5 4/5] btrfs: do not clear page dirty inside extent_write_locked_range() Qu Wenruo
@ 2024-05-18  5:07 ` Qu Wenruo
  2024-05-21  7:13   ` Naohiro Aota
  4 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2024-05-18  5:07 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes.Thumshirn, josef, Johannes Thumshirn

When extent_write_locked_range() generated an inline extent, it would
set and finish the writeback for the whole page.

Although currently it's safe since subpage disables inline creation,
for the sake of consistency, let it go with subpage helpers to set and
clear the writeback flags.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 08556db4b4de..7275bd919a3e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2309,6 +2309,7 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
 		u64 cur_end = min(round_down(cur, PAGE_SIZE) + PAGE_SIZE - 1, end);
 		u32 cur_len = cur_end + 1 - cur;
 		struct page *page;
+		struct folio *folio;
 		int nr = 0;
 
 		page = find_get_page(mapping, cur >> PAGE_SHIFT);
@@ -2323,8 +2324,9 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
 
 		/* Make sure the mapping tag for page dirty gets cleared. */
 		if (nr == 0) {
-			set_page_writeback(page);
-			end_page_writeback(page);
+			folio = page_folio(page);
+			btrfs_folio_set_writeback(fs_info, folio, cur, cur_len);
+			btrfs_folio_clear_writeback(fs_info, folio, cur, cur_len);
 		}
 		if (ret) {
 			btrfs_mark_ordered_io_finished(BTRFS_I(inode), page,
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 5/5] btrfs: make extent_write_locked_range() to handle subpage writeback correctly
  2024-05-18  5:07 ` [PATCH v5 5/5] btrfs: make extent_write_locked_range() to handle subpage writeback correctly Qu Wenruo
@ 2024-05-21  7:13   ` Naohiro Aota
  0 siblings, 0 replies; 15+ messages in thread
From: Naohiro Aota @ 2024-05-21  7:13 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: linux-btrfs@vger.kernel.org, Johannes Thumshirn,
	josef@toxicpanda.com

On Sat, May 18, 2024 at 02:37:43PM GMT, Qu Wenruo wrote:
> When extent_write_locked_range() generated an inline extent, it would
> set and finish the writeback for the whole page.
> 
> Although currently it's safe since subpage disables inline creation,
> for the sake of consistency, let it go with subpage helpers to set and
> clear the writeback flags.
> 
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---

LGTM

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-05-22  1:10 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-18  5:07 [PATCH v5 0/5] btrfs: subpage + zoned fixes Qu Wenruo
2024-05-18  5:07 ` [PATCH v5 1/5] btrfs: make __extent_writepage_io() to write specified range only Qu Wenruo
2024-05-21  7:23   ` Naohiro Aota
2024-05-18  5:07 ` [PATCH v5 2/5] btrfs: subpage: introduce helpers to handle subpage delalloc locking Qu Wenruo
2024-05-21  7:50   ` Naohiro Aota
2024-05-21  7:57     ` Qu Wenruo
2024-05-18  5:07 ` [PATCH v5 3/5] btrfs: lock subpage ranges in one go for writepage_delalloc() Qu Wenruo
2024-05-21  8:11   ` Naohiro Aota
2024-05-21  8:45     ` Qu Wenruo
2024-05-21 11:54       ` Naohiro Aota
2024-05-21 22:16         ` Qu Wenruo
2024-05-22  1:10           ` Naohiro Aota
2024-05-18  5:07 ` [PATCH v5 4/5] btrfs: do not clear page dirty inside extent_write_locked_range() Qu Wenruo
2024-05-18  5:07 ` [PATCH v5 5/5] btrfs: make extent_write_locked_range() to handle subpage writeback correctly Qu Wenruo
2024-05-21  7:13   ` Naohiro Aota

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox