[PATCH v2] Btrfs: only subtract from len_to_oe_boundary when it is tracking an extent

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] Btrfs: only subtract from len_to_oe_boundary when it is tracking an extent
@ 2023-08-01 16:28 Chris Mason
  2023-08-01 16:42 ` Christoph Hellwig
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Chris Mason @ 2023-08-01 16:28 UTC (permalink / raw)
  To: linux-btrfs, dsterba, josef, hch

[Note: I dropped the RFC because I can now trigger on Linus kernels, and
I think we need to send something to stable as well ]

bio_ctrl->len_to_oe_boundary is used to make sure we stay inside a zone
as we submit bios for writes.  Every time we add a page to the bio, we
decrement those bytes from len_to_oe_boundary, and then we submit the
bio if we happen to hit zero.

Most of the time, len_to_oe_boundary gets set to U32_MAX.
submit_extent_page() adds pages into our bio, and the size of the bio
ends up limited by:

- Are we contiguous on disk?
- Does bio_add_page() allow us to stuff more in?
- is len_to_oe_boundary > 0?

The len_to_oe_boundary math starts with U32_MAX, which isn't page or
sector aligned, and subtracts from it until it hits zero.  In the
non-zoned case, the last IO we submit before we hit zero is going to be
unaligned, triggering BUGs and other sadness.

This is hard to trigger because bio_add_page() isn't going to make a bio
of U32_MAX size unless you give it a perfect set of pages and fully
contiguous extents on disk.  We can hit it pretty reliably while making
large swapfiles during provisioning because the machine is freshly
booted, mostly idle, and the disk is freshly formatted.  It's also
possible to trigger with reads when read_ahead_kb is set to 4GB.

The code has been clean up and shifted around a few times, but this flaw
has been lurking since the counter was added.  I think Christoph's
commit ended up exposing the bug.

The fix used here is to skip doing math on len_to_oe_boundary unless
we've changed it from the default U32_MAX value.  bio_add_page() is the
real limit we want, and there's no reason to do extra math when Jens
is doing it for us.

Sample repro, note you'll need to change the path to the bdi and device:

SUBVOL=/btrfs/swapvol
SWAPFILE=$SUBVOL/swapfile
SZMB=8192

mkfs.btrfs -f /dev/vdb
mount /dev/vdb /btrfs

btrfs subvol create $SUBVOL
chattr +C $SUBVOL
dd if=/dev/zero of=$SWAPFILE bs=1M count=$SZMB
sync;sync;sync

echo 4 > /proc/sys/vm/drop_caches

echo 4194304 > /sys/class/bdi/btrfs-2/read_ahead_kb

while(true) ; do
        echo 1 > /proc/sys/vm/drop_caches
        echo 1 > /proc/sys/vm/drop_caches
        dd of=/dev/zero if=$SWAPFILE bs=4096M count=2 iflag=fullblock
done

Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
CC: stable@vger.kernel.org # 6.4
Fixes: 24e6c8082208 ("btrfs: simplify main loop in submit_extent_page")
---
 fs/btrfs/extent_io.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

v1 -> v2: update the comments, add repro to commit log

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6b40189a1a3e..c25115592d99 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -849,7 +849,30 @@ static void submit_extent_page(struct btrfs_bio_ctrl *bio_ctrl,
 		size -= len;
 		pg_offset += len;
 		disk_bytenr += len;
-		bio_ctrl->len_to_oe_boundary -= len;
+
+		/*
+		 * len_to_oe_boundary defaults to U32_MAX, which isn't page or
+		 * sector aligned.  alloc_new_bio() then sets it to the end of
+		 * our ordered extent for writes into zoned devices.
+		 *
+		 * When len_to_oe_boundary is tracking an ordered extent, we
+		 * trust the ordered extent code to align things properly, and
+		 * the check above to cap our write to the ordered extent
+		 * boundary is correct.
+		 *
+		 * When len_to_oe_boundary is U32_MAX, the cap above would
+		 * result in a 4095 byte IO for the last page riiiiight before
+		 * we hit the bio limit of UINT_MAX.  bio_add_page() has all
+		 * the checks required to make sure we don't overflow the bio,
+		 * and we should just ignore len_to_oe_boundary completely
+		 * unless we're using it to track an ordered extent.
+		 *
+		 * It's pretty hard to make a bio sized U32_MAX, but it can
+		 * happen when the page cache is able to feed us contiguous
+		 * pages for large extents.
+		 */
+		if (bio_ctrl->len_to_oe_boundary != U32_MAX)
+			bio_ctrl->len_to_oe_boundary -= len;

 		/* Ordered extent boundary: move on to a new bio. */
 		if (bio_ctrl->len_to_oe_boundary == 0)
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] Btrfs: only subtract from len_to_oe_boundary when it is tracking an extent
  2023-08-01 16:28 [PATCH v2] Btrfs: only subtract from len_to_oe_boundary when it is tracking an extent Chris Mason
@ 2023-08-01 16:42 ` Christoph Hellwig
  2023-08-01 17:29   ` Chris Mason
  2023-08-01 22:34 ` Qu Wenruo
  2023-08-17 11:38 ` David Sterba
  2 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2023-08-01 16:42 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs, dsterba, josef, hch

On Tue, Aug 01, 2023 at 09:28:28AM -0700, Chris Mason wrote:
> +		 * When len_to_oe_boundary is U32_MAX, the cap above would
> +		 * result in a 4095 byte IO for the last page riiiiight before
> +		 * we hit the bio limit of UINT_MAX.  bio_add_page() has all
> +		 * the checks required to make sure we don't overflow the bio,
> +		 * and we should just ignore len_to_oe_boundary completely
> +		 * unless we're using it to track an ordered extent.
> +		 *
> +		 * It's pretty hard to make a bio sized U32_MAX, but it can
> +		 * happen when the page cache is able to feed us contiguous
> +		 * pages for large extents.
> +		 */
> +		if (bio_ctrl->len_to_oe_boundary != U32_MAX)

So I don't know the btrfs extent allocator, but what is the maximum
size of an extent?  Could there be an U32_MAX sized extent that could
be hitting this?  In other words, what about adding an explicit flag
to bio_ctrl when to check the boundary, and just don't bother with
len_to_oe_boundary at all if it isn't set.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] Btrfs: only subtract from len_to_oe_boundary when it is tracking an extent
  2023-08-01 16:42 ` Christoph Hellwig
@ 2023-08-01 17:29   ` Chris Mason
  2023-08-02  9:35     ` Christoph Hellwig
  0 siblings, 1 reply; 6+ messages in thread
From: Chris Mason @ 2023-08-01 17:29 UTC (permalink / raw)
  To: Christoph Hellwig, Chris Mason; +Cc: linux-btrfs, dsterba, josef

On 8/1/23 12:42 PM, Christoph Hellwig wrote:
> On Tue, Aug 01, 2023 at 09:28:28AM -0700, Chris Mason wrote:
>> +		 * When len_to_oe_boundary is U32_MAX, the cap above would
>> +		 * result in a 4095 byte IO for the last page riiiiight before
>> +		 * we hit the bio limit of UINT_MAX.  bio_add_page() has all
>> +		 * the checks required to make sure we don't overflow the bio,
>> +		 * and we should just ignore len_to_oe_boundary completely
>> +		 * unless we're using it to track an ordered extent.
>> +		 *
>> +		 * It's pretty hard to make a bio sized U32_MAX, but it can
>> +		 * happen when the page cache is able to feed us contiguous
>> +		 * pages for large extents.
>> +		 */
>> +		if (bio_ctrl->len_to_oe_boundary != U32_MAX)
> 
> So I don't know the btrfs extent allocator, but what is the maximum
> size of an extent?

The actual max extent size is limited to one chunk, which right now is
max 10G and then we round down for various reasons.  The disk format is
all u64s, so it really comes down to checks like this one in
find_lock_delalloc_range():

u64 max_bytes = fs_info ? fs_info->max_extent_size : BTRFS_MAX_EXTENT_SIZE;

(BTRFS_MAX_EXTENT_SIZE is 128MB)

In practice we're less aggressive than XFS about trying really hard to
layout big extents on disk, and the max size we'll find is what we
collect via delayed allocation.

So, taking the swapfile from my repro:

        item 5 key (257 INODE_REF 256) itemoff 15857 itemsize 18
                inode ref index 2 namelen 8 name: swapfile
        item 6 key (257 EXTENT_DATA 0) itemoff 15804 itemsize 53
                extent data disk byte 1104150528 nr 134217728
                extent data offset 0 nr 134217728 ram 134217728
                extent compression(none)
        item 7 key (257 EXTENT_DATA 134217728) itemoff 15751 itemsize 53
                extent data disk byte 1238368256 nr 134217728
                extent data offset 0 nr 134217728 ram 134217728
                extent compression(none)
        item 8 key (257 EXTENT_DATA 268435456) itemoff 15698 itemsize 53
                extent data disk byte 1372585984 nr 134217728
                extent data offset 0 nr 134217728 ram 134217728
                extent compression(none)
        item 9 key (257 EXTENT_DATA 402653184) itemoff 15645 itemsize 53
                extent data disk byte 1506803712 nr 134217728
                extent data offset 0 nr 134217728 ram 134217728
                extent compression(none)

But these are all contig on disk:

# filefrag -v /btrfs/swapvol/swapfile
Filesystem type is: 9123683e
File size of /btrfs/swapvol/swapfile is 8589934592 (2097152 blocks of
4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0.. 2097151:     269568..   2366719: 2097152:
last,eof
/btrfs/swapvol/swapfile: 1 extent found

> Could there be an U32_MAX sized extent that could
> be hitting this?

For writes, I didn't actually audit the way we're setting
len_to_oe_boundary vs U32_MAX.  BTRFS_MAX_EXTENT_SIZE should save us
from any single wild extents.

But especially for reads, we'll end up collapsing contiguous areas on
disk into a single bio even if they span extents.

> In other words, what about adding an explicit flag
> to bio_ctrl when to check the boundary, and just don't bother with
> len_to_oe_boundary at all if it isn't set.
> 

Yeah, I'm just using (len_to_oe_boundary == U32_MAX) as the explicit
flag, but your idea above is basically where I started.  I actually
started at len_to_oe_boundary = ROUND_DOWN(U32_MAX, PAGE_SIZE), but then
I realized I'd have to actually think about sub-page blocksizes and went
to the flag idea.

Especially since the code is going away, I'd prefer a minimal change and
a big comment, but I don't actually have strong opinions.

-chris


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] Btrfs: only subtract from len_to_oe_boundary when it is tracking an extent
  2023-08-01 16:28 [PATCH v2] Btrfs: only subtract from len_to_oe_boundary when it is tracking an extent Chris Mason
  2023-08-01 16:42 ` Christoph Hellwig
@ 2023-08-01 22:34 ` Qu Wenruo
  2023-08-17 11:38 ` David Sterba
  2 siblings, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2023-08-01 22:34 UTC (permalink / raw)
  To: Chris Mason, linux-btrfs, dsterba, josef, hch



On 2023/8/2 00:28, Chris Mason wrote:
> [Note: I dropped the RFC because I can now trigger on Linus kernels, and
> I think we need to send something to stable as well ]
>
> bio_ctrl->len_to_oe_boundary is used to make sure we stay inside a zone
> as we submit bios for writes.  Every time we add a page to the bio, we
> decrement those bytes from len_to_oe_boundary, and then we submit the
> bio if we happen to hit zero.
>
> Most of the time, len_to_oe_boundary gets set to U32_MAX.
> submit_extent_page() adds pages into our bio, and the size of the bio
> ends up limited by:
>
> - Are we contiguous on disk?
> - Does bio_add_page() allow us to stuff more in?
> - is len_to_oe_boundary > 0?
>
> The len_to_oe_boundary math starts with U32_MAX, which isn't page or
> sector aligned, and subtracts from it until it hits zero.  In the
> non-zoned case, the last IO we submit before we hit zero is going to be
> unaligned, triggering BUGs and other sadness.
>
> This is hard to trigger because bio_add_page() isn't going to make a bio
> of U32_MAX size unless you give it a perfect set of pages and fully
> contiguous extents on disk.  We can hit it pretty reliably while making
> large swapfiles during provisioning because the machine is freshly
> booted, mostly idle, and the disk is freshly formatted.  It's also
> possible to trigger with reads when read_ahead_kb is set to 4GB.
>
> The code has been clean up and shifted around a few times, but this flaw
> has been lurking since the counter was added.  I think Christoph's
> commit ended up exposing the bug.
>
> The fix used here is to skip doing math on len_to_oe_boundary unless
> we've changed it from the default U32_MAX value.  bio_add_page() is the
> real limit we want, and there's no reason to do extra math when Jens
> is doing it for us.
>
> Sample repro, note you'll need to change the path to the bdi and device:
>
> SUBVOL=/btrfs/swapvol
> SWAPFILE=$SUBVOL/swapfile
> SZMB=8192
>
> mkfs.btrfs -f /dev/vdb
> mount /dev/vdb /btrfs
>
> btrfs subvol create $SUBVOL
> chattr +C $SUBVOL
> dd if=/dev/zero of=$SWAPFILE bs=1M count=$SZMB
> sync;sync;sync
>
> echo 4 > /proc/sys/vm/drop_caches
>
> echo 4194304 > /sys/class/bdi/btrfs-2/read_ahead_kb
>
> while(true) ; do
>          echo 1 > /proc/sys/vm/drop_caches
>          echo 1 > /proc/sys/vm/drop_caches
>          dd of=/dev/zero if=$SWAPFILE bs=4096M count=2 iflag=fullblock
> done
>
> Signed-off-by: Chris Mason <clm@fb.com>
> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
> CC: stable@vger.kernel.org # 6.4
> Fixes: 24e6c8082208 ("btrfs: simplify main loop in submit_extent_page")

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/extent_io.c | 25 ++++++++++++++++++++++++-
>   1 file changed, 24 insertions(+), 1 deletion(-)
>
> v1 -> v2: update the comments, add repro to commit log
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 6b40189a1a3e..c25115592d99 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -849,7 +849,30 @@ static void submit_extent_page(struct btrfs_bio_ctrl *bio_ctrl,
>   		size -= len;
>   		pg_offset += len;
>   		disk_bytenr += len;
> -		bio_ctrl->len_to_oe_boundary -= len;
> +
> +		/*
> +		 * len_to_oe_boundary defaults to U32_MAX, which isn't page or
> +		 * sector aligned.  alloc_new_bio() then sets it to the end of
> +		 * our ordered extent for writes into zoned devices.
> +		 *
> +		 * When len_to_oe_boundary is tracking an ordered extent, we
> +		 * trust the ordered extent code to align things properly, and
> +		 * the check above to cap our write to the ordered extent
> +		 * boundary is correct.
> +		 *
> +		 * When len_to_oe_boundary is U32_MAX, the cap above would
> +		 * result in a 4095 byte IO for the last page riiiiight before
> +		 * we hit the bio limit of UINT_MAX.  bio_add_page() has all
> +		 * the checks required to make sure we don't overflow the bio,
> +		 * and we should just ignore len_to_oe_boundary completely
> +		 * unless we're using it to track an ordered extent.
> +		 *
> +		 * It's pretty hard to make a bio sized U32_MAX, but it can
> +		 * happen when the page cache is able to feed us contiguous
> +		 * pages for large extents.
> +		 */
> +		if (bio_ctrl->len_to_oe_boundary != U32_MAX)
> +			bio_ctrl->len_to_oe_boundary -= len;
>
>   		/* Ordered extent boundary: move on to a new bio. */
>   		if (bio_ctrl->len_to_oe_boundary == 0)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] Btrfs: only subtract from len_to_oe_boundary when it is tracking an extent
  2023-08-01 17:29   ` Chris Mason
@ 2023-08-02  9:35     ` Christoph Hellwig
  0 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2023-08-02  9:35 UTC (permalink / raw)
  To: Chris Mason; +Cc: Christoph Hellwig, Chris Mason, linux-btrfs, dsterba, josef

On Tue, Aug 01, 2023 at 01:29:51PM -0400, Chris Mason wrote:
> The actual max extent size is limited to one chunk, which right now is
> max 10G and then we round down for various reasons.

Ok.

> For writes, I didn't actually audit the way we're setting
> len_to_oe_boundary vs U32_MAX.  BTRFS_MAX_EXTENT_SIZE should save us
> from any single wild extents.

At leasr in older kernels U32_MAX is the default as we're only
looking up the ordered extent for zoned writes.  This is changing
right now, though.

> Yeah, I'm just using (len_to_oe_boundary == U32_MAX) as the explicit
> flag, but your idea above is basically where I started.  I actually
> started at len_to_oe_boundary = ROUND_DOWN(U32_MAX, PAGE_SIZE), but then
> I realized I'd have to actually think about sub-page blocksizes and went
> to the flag idea.
> 
> Especially since the code is going away, I'd prefer a minimal change and
> a big comment, but I don't actually have strong opinions.

Ok:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] Btrfs: only subtract from len_to_oe_boundary when it is tracking an extent
  2023-08-01 16:28 [PATCH v2] Btrfs: only subtract from len_to_oe_boundary when it is tracking an extent Chris Mason
  2023-08-01 16:42 ` Christoph Hellwig
  2023-08-01 22:34 ` Qu Wenruo
@ 2023-08-17 11:38 ` David Sterba
  2 siblings, 0 replies; 6+ messages in thread
From: David Sterba @ 2023-08-17 11:38 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs, dsterba, josef, hch

On Tue, Aug 01, 2023 at 09:28:28AM -0700, Chris Mason wrote:
> [Note: I dropped the RFC because I can now trigger on Linus kernels, and
> I think we need to send something to stable as well ]
> 
> bio_ctrl->len_to_oe_boundary is used to make sure we stay inside a zone
> as we submit bios for writes.  Every time we add a page to the bio, we
> decrement those bytes from len_to_oe_boundary, and then we submit the
> bio if we happen to hit zero.
> 
> Most of the time, len_to_oe_boundary gets set to U32_MAX.
> submit_extent_page() adds pages into our bio, and the size of the bio
> ends up limited by:
> 
> - Are we contiguous on disk?
> - Does bio_add_page() allow us to stuff more in?
> - is len_to_oe_boundary > 0?
> 
> The len_to_oe_boundary math starts with U32_MAX, which isn't page or
> sector aligned, and subtracts from it until it hits zero.  In the
> non-zoned case, the last IO we submit before we hit zero is going to be
> unaligned, triggering BUGs and other sadness.
> 
> This is hard to trigger because bio_add_page() isn't going to make a bio
> of U32_MAX size unless you give it a perfect set of pages and fully
> contiguous extents on disk.  We can hit it pretty reliably while making
> large swapfiles during provisioning because the machine is freshly
> booted, mostly idle, and the disk is freshly formatted.  It's also
> possible to trigger with reads when read_ahead_kb is set to 4GB.
> 
> The code has been clean up and shifted around a few times, but this flaw
> has been lurking since the counter was added.  I think Christoph's
> commit ended up exposing the bug.
> 
> The fix used here is to skip doing math on len_to_oe_boundary unless
> we've changed it from the default U32_MAX value.  bio_add_page() is the
> real limit we want, and there's no reason to do extra math when Jens
> is doing it for us.
> 
> Sample repro, note you'll need to change the path to the bdi and device:
> 
> SUBVOL=/btrfs/swapvol
> SWAPFILE=$SUBVOL/swapfile
> SZMB=8192
> 
> mkfs.btrfs -f /dev/vdb
> mount /dev/vdb /btrfs
> 
> btrfs subvol create $SUBVOL
> chattr +C $SUBVOL
> dd if=/dev/zero of=$SWAPFILE bs=1M count=$SZMB
> sync;sync;sync
> 
> echo 4 > /proc/sys/vm/drop_caches
> 
> echo 4194304 > /sys/class/bdi/btrfs-2/read_ahead_kb
> 
> while(true) ; do
>         echo 1 > /proc/sys/vm/drop_caches
>         echo 1 > /proc/sys/vm/drop_caches
>         dd of=/dev/zero if=$SWAPFILE bs=4096M count=2 iflag=fullblock
> done
> 
> Signed-off-by: Chris Mason <clm@fb.com>
> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
> CC: stable@vger.kernel.org # 6.4
> Fixes: 24e6c8082208 ("btrfs: simplify main loop in submit_extent_page")

Added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-08-17 11:45 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-01 16:28 [PATCH v2] Btrfs: only subtract from len_to_oe_boundary when it is tracking an extent Chris Mason
2023-08-01 16:42 ` Christoph Hellwig
2023-08-01 17:29   ` Chris Mason
2023-08-02  9:35     ` Christoph Hellwig
2023-08-01 22:34 ` Qu Wenruo
2023-08-17 11:38 ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox