* [PATCH] btrfs: adjust overcommit logic when very close to full
@ 2023-09-18 19:27 Josef Bacik
2023-09-18 20:14 ` Boris Burkov
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Josef Bacik @ 2023-09-18 19:27 UTC (permalink / raw)
To: linux-btrfs, kernel-team
A user reported some unpleasant behavior with very small file systems.
The reproducer is this
mkfs.btrfs -f -m single -b 8g /dev/vdb
mount /dev/vdb /mnt/test
dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20
This will result in usage that looks like this
Overall:
Device size: 8.00GiB
Device allocated: 8.00GiB
Device unallocated: 1.00MiB
Device missing: 0.00B
Device slack: 2.00GiB
Used: 5.47GiB
Free (estimated): 2.52GiB (min: 2.52GiB)
Free (statfs, df): 0.00B
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 5.50MiB (used: 0.00B)
Multiple profiles: no
Data,single: Size:7.99GiB, Used:5.46GiB (68.41%)
/dev/vdb 7.99GiB
Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%)
/dev/vdb 8.00MiB
System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
/dev/vdb 4.00MiB
Unallocated:
/dev/vdb 1.00MiB
As you can see we've gotten ourselves quite full with metadata, with all
of the disk being allocated for data.
On smaller file systems there's not a lot of time before we get full, so
our overcommit behavior bites us here. Generally speaking data
reservations result in chunk allocations as we assume reservation ==
actual use for data. This means at any point we could end up with a
chunk allocation for data, and if we're very close to full we could do
this before we have a chance to figure out that we need another metadata
chunk.
Address this by adjusting the overcommit logic. Simply put we need to
take away 1 chunk from the available chunk space in case of a data
reservation. This will allow us to stop overcommitting before we
potentially lose this space to a data allocation. With this fix in
place we properly allocate a metadata chunk before we're completely
full, allowing for enough slack space in metadata.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/space-info.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index d7e8cd4f140c..7aa53058d893 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -365,6 +365,23 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info,
factor = btrfs_bg_type_to_factor(profile);
avail = div_u64(avail, factor);
+ /*
+ * Since data allocations immediately use block groups as part of the
+ * reservation, because we assume that data reservations will == actual
+ * usage, we could potentially overcommit and then immediately have that
+ * available space used by a data allocation, which could put us in a
+ * bind when we get close to filling the file system.
+ *
+ * To handle this simply remove 1G (which is our current maximum chunk
+ * allocation size) from the available space. If we are relatively
+ * empty this won't affect our ability to overcommit much, and if we're
+ * very close to full it'll keep us from getting into a position where
+ * we've given ourselves very little metadata wiggle room.
+ */
+ if (avail < SZ_1G)
+ return 0;
+ avail -= SZ_1G;
+
/*
* If we aren't flushing all things, let us overcommit up to
* 1/2th of the space. If we can flush, don't let us overcommit
--
2.41.0
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH] btrfs: adjust overcommit logic when very close to full 2023-09-18 19:27 [PATCH] btrfs: adjust overcommit logic when very close to full Josef Bacik @ 2023-09-18 20:14 ` Boris Burkov 2023-09-20 13:59 ` Josef Bacik 2023-09-18 21:29 ` David Sterba 2023-09-20 19:05 ` David Sterba 2 siblings, 1 reply; 10+ messages in thread From: Boris Burkov @ 2023-09-18 20:14 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs, kernel-team On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote: > A user reported some unpleasant behavior with very small file systems. > The reproducer is this > > mkfs.btrfs -f -m single -b 8g /dev/vdb > mount /dev/vdb /mnt/test > dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20 > > This will result in usage that looks like this > > Overall: > Device size: 8.00GiB > Device allocated: 8.00GiB > Device unallocated: 1.00MiB > Device missing: 0.00B > Device slack: 2.00GiB > Used: 5.47GiB > Free (estimated): 2.52GiB (min: 2.52GiB) > Free (statfs, df): 0.00B > Data ratio: 1.00 > Metadata ratio: 1.00 > Global reserve: 5.50MiB (used: 0.00B) > Multiple profiles: no > > Data,single: Size:7.99GiB, Used:5.46GiB (68.41%) > /dev/vdb 7.99GiB > > Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%) > /dev/vdb 8.00MiB > > System,single: Size:4.00MiB, Used:16.00KiB (0.39%) > /dev/vdb 4.00MiB > > Unallocated: > /dev/vdb 1.00MiB > > As you can see we've gotten ourselves quite full with metadata, with all > of the disk being allocated for data. > > On smaller file systems there's not a lot of time before we get full, so > our overcommit behavior bites us here. Generally speaking data > reservations result in chunk allocations as we assume reservation == > actual use for data. This means at any point we could end up with a > chunk allocation for data, and if we're very close to full we could do > this before we have a chance to figure out that we need another metadata > chunk. > > Address this by adjusting the overcommit logic. Simply put we need to > take away 1 chunk from the available chunk space in case of a data > reservation. This will allow us to stop overcommitting before we > potentially lose this space to a data allocation. With this fix in > place we properly allocate a metadata chunk before we're completely > full, allowing for enough slack space in metadata. LGTM, this should help and I've been kicking around the same idea in my head for a while. I do think this is kind of a band-aid, though. It isn't hard to imagine that you allocate data chunks up to the 1G, then allocate a metadata chunk, then fragment/under-utilize the data to the point that you actually fill up the metadata and get right back to this same point. Long term, I think we still need more/smarter reclaim, but this should be a good steam valve for the simple cases where we deterministically gobble up all the unallocated space for data. > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> > --- > fs/btrfs/space-info.c | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) > > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c > index d7e8cd4f140c..7aa53058d893 100644 > --- a/fs/btrfs/space-info.c > +++ b/fs/btrfs/space-info.c > @@ -365,6 +365,23 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info, > factor = btrfs_bg_type_to_factor(profile); > avail = div_u64(avail, factor); > > + /* > + * Since data allocations immediately use block groups as part of the > + * reservation, because we assume that data reservations will == actual > + * usage, we could potentially overcommit and then immediately have that > + * available space used by a data allocation, which could put us in a > + * bind when we get close to filling the file system. > + * > + * To handle this simply remove 1G (which is our current maximum chunk > + * allocation size) from the available space. If we are relatively > + * empty this won't affect our ability to overcommit much, and if we're > + * very close to full it'll keep us from getting into a position where > + * we've given ourselves very little metadata wiggle room. > + */ > + if (avail < SZ_1G) > + return 0; > + avail -= SZ_1G; > + > /* > * If we aren't flushing all things, let us overcommit up to > * 1/2th of the space. If we can flush, don't let us overcommit > -- > 2.41.0 > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] btrfs: adjust overcommit logic when very close to full 2023-09-18 20:14 ` Boris Burkov @ 2023-09-20 13:59 ` Josef Bacik 2023-09-20 19:02 ` David Sterba 0 siblings, 1 reply; 10+ messages in thread From: Josef Bacik @ 2023-09-20 13:59 UTC (permalink / raw) To: Boris Burkov; +Cc: linux-btrfs, kernel-team On Mon, Sep 18, 2023 at 01:14:41PM -0700, Boris Burkov wrote: > On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote: > > A user reported some unpleasant behavior with very small file systems. > > The reproducer is this > > > > mkfs.btrfs -f -m single -b 8g /dev/vdb > > mount /dev/vdb /mnt/test > > dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20 > > > > This will result in usage that looks like this > > > > Overall: > > Device size: 8.00GiB > > Device allocated: 8.00GiB > > Device unallocated: 1.00MiB > > Device missing: 0.00B > > Device slack: 2.00GiB > > Used: 5.47GiB > > Free (estimated): 2.52GiB (min: 2.52GiB) > > Free (statfs, df): 0.00B > > Data ratio: 1.00 > > Metadata ratio: 1.00 > > Global reserve: 5.50MiB (used: 0.00B) > > Multiple profiles: no > > > > Data,single: Size:7.99GiB, Used:5.46GiB (68.41%) > > /dev/vdb 7.99GiB > > > > Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%) > > /dev/vdb 8.00MiB > > > > System,single: Size:4.00MiB, Used:16.00KiB (0.39%) > > /dev/vdb 4.00MiB > > > > Unallocated: > > /dev/vdb 1.00MiB > > > > As you can see we've gotten ourselves quite full with metadata, with all > > of the disk being allocated for data. > > > > On smaller file systems there's not a lot of time before we get full, so > > our overcommit behavior bites us here. Generally speaking data > > reservations result in chunk allocations as we assume reservation == > > actual use for data. This means at any point we could end up with a > > chunk allocation for data, and if we're very close to full we could do > > this before we have a chance to figure out that we need another metadata > > chunk. > > > > Address this by adjusting the overcommit logic. Simply put we need to > > take away 1 chunk from the available chunk space in case of a data > > reservation. This will allow us to stop overcommitting before we > > potentially lose this space to a data allocation. With this fix in > > place we properly allocate a metadata chunk before we're completely > > full, allowing for enough slack space in metadata. > > LGTM, this should help and I've been kicking around the same idea in my > head for a while. > > I do think this is kind of a band-aid, though. It isn't hard to imagine > that you allocate data chunks up to the 1G, then allocate a metadata > chunk, then fragment/under-utilize the data to the point that you > actually fill up the metadata and get right back to this same point. > > Long term, I think we still need more/smarter reclaim, but this should > be a good steam valve for the simple cases where we deterministically > gobble up all the unallocated space for data. This is definitely a bit of a bandaid, because we can have any number of things allocate a chunk at any given time, however this is more of a concern for small file systems where we only have the initial 8mib metadata block group. We spoke offline, but the long term fix here is to have a chunk reservation system and use that in overcommit so we never have to worry about suddenly not being able to allocate a chunk. Then if we need to revoke that reservation we can force flush everything to get us under the overcommit threshold, and then disable overcommit because we'll have allocated that chunk. For now this fixes the problem with the least surprise. Thanks, Josef ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] btrfs: adjust overcommit logic when very close to full 2023-09-20 13:59 ` Josef Bacik @ 2023-09-20 19:02 ` David Sterba 0 siblings, 0 replies; 10+ messages in thread From: David Sterba @ 2023-09-20 19:02 UTC (permalink / raw) To: Josef Bacik; +Cc: Boris Burkov, linux-btrfs, kernel-team On Wed, Sep 20, 2023 at 09:59:23AM -0400, Josef Bacik wrote: > On Mon, Sep 18, 2023 at 01:14:41PM -0700, Boris Burkov wrote: > > On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote: > > > A user reported some unpleasant behavior with very small file systems. > > > The reproducer is this > > > > > > mkfs.btrfs -f -m single -b 8g /dev/vdb > > > mount /dev/vdb /mnt/test > > > dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20 > > > > > > This will result in usage that looks like this > > > > > > Overall: > > > Device size: 8.00GiB > > > Device allocated: 8.00GiB > > > Device unallocated: 1.00MiB > > > Device missing: 0.00B > > > Device slack: 2.00GiB > > > Used: 5.47GiB > > > Free (estimated): 2.52GiB (min: 2.52GiB) > > > Free (statfs, df): 0.00B > > > Data ratio: 1.00 > > > Metadata ratio: 1.00 > > > Global reserve: 5.50MiB (used: 0.00B) > > > Multiple profiles: no > > > > > > Data,single: Size:7.99GiB, Used:5.46GiB (68.41%) > > > /dev/vdb 7.99GiB > > > > > > Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%) > > > /dev/vdb 8.00MiB > > > > > > System,single: Size:4.00MiB, Used:16.00KiB (0.39%) > > > /dev/vdb 4.00MiB > > > > > > Unallocated: > > > /dev/vdb 1.00MiB > > > > > > As you can see we've gotten ourselves quite full with metadata, with all > > > of the disk being allocated for data. > > > > > > On smaller file systems there's not a lot of time before we get full, so > > > our overcommit behavior bites us here. Generally speaking data > > > reservations result in chunk allocations as we assume reservation == > > > actual use for data. This means at any point we could end up with a > > > chunk allocation for data, and if we're very close to full we could do > > > this before we have a chance to figure out that we need another metadata > > > chunk. > > > > > > Address this by adjusting the overcommit logic. Simply put we need to > > > take away 1 chunk from the available chunk space in case of a data > > > reservation. This will allow us to stop overcommitting before we > > > potentially lose this space to a data allocation. With this fix in > > > place we properly allocate a metadata chunk before we're completely > > > full, allowing for enough slack space in metadata. > > > > LGTM, this should help and I've been kicking around the same idea in my > > head for a while. > > > > I do think this is kind of a band-aid, though. It isn't hard to imagine > > that you allocate data chunks up to the 1G, then allocate a metadata > > chunk, then fragment/under-utilize the data to the point that you > > actually fill up the metadata and get right back to this same point. > > > > Long term, I think we still need more/smarter reclaim, but this should > > be a good steam valve for the simple cases where we deterministically > > gobble up all the unallocated space for data. > > This is definitely a bit of a bandaid, because we can have any number of things > allocate a chunk at any given time, however this is more of a concern for small > file systems where we only have the initial 8mib metadata block group. Is really 8M for metadata? The default mkfs creates something like this: Filesystem size: 4.00GiB Block group profiles: Data: single 8.00MiB Metadata: DUP 256.00MiB System: DUP 8.00MiB ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] btrfs: adjust overcommit logic when very close to full 2023-09-18 19:27 [PATCH] btrfs: adjust overcommit logic when very close to full Josef Bacik 2023-09-18 20:14 ` Boris Burkov @ 2023-09-18 21:29 ` David Sterba 2023-09-20 14:01 ` Josef Bacik 2023-09-20 19:05 ` David Sterba 2 siblings, 1 reply; 10+ messages in thread From: David Sterba @ 2023-09-18 21:29 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs, kernel-team On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote: > A user reported some unpleasant behavior with very small file systems. > The reproducer is this > > mkfs.btrfs -f -m single -b 8g /dev/vdb > mount /dev/vdb /mnt/test > dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20 > > This will result in usage that looks like this > > Overall: > Device size: 8.00GiB > Device allocated: 8.00GiB > Device unallocated: 1.00MiB > Device missing: 0.00B > Device slack: 2.00GiB > Used: 5.47GiB > Free (estimated): 2.52GiB (min: 2.52GiB) > Free (statfs, df): 0.00B > Data ratio: 1.00 > Metadata ratio: 1.00 > Global reserve: 5.50MiB (used: 0.00B) > Multiple profiles: no > > Data,single: Size:7.99GiB, Used:5.46GiB (68.41%) > /dev/vdb 7.99GiB > > Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%) > /dev/vdb 8.00MiB > > System,single: Size:4.00MiB, Used:16.00KiB (0.39%) > /dev/vdb 4.00MiB > > Unallocated: > /dev/vdb 1.00MiB > > As you can see we've gotten ourselves quite full with metadata, with all > of the disk being allocated for data. > > On smaller file systems there's not a lot of time before we get full, so > our overcommit behavior bites us here. Generally speaking data > reservations result in chunk allocations as we assume reservation == > actual use for data. This means at any point we could end up with a > chunk allocation for data, and if we're very close to full we could do > this before we have a chance to figure out that we need another metadata > chunk. > > Address this by adjusting the overcommit logic. Simply put we need to > take away 1 chunk from the available chunk space in case of a data > reservation. This will allow us to stop overcommitting before we > potentially lose this space to a data allocation. With this fix in > place we properly allocate a metadata chunk before we're completely > full, allowing for enough slack space in metadata. > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> > --- > fs/btrfs/space-info.c | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) > > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c > index d7e8cd4f140c..7aa53058d893 100644 > --- a/fs/btrfs/space-info.c > +++ b/fs/btrfs/space-info.c > @@ -365,6 +365,23 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info, > factor = btrfs_bg_type_to_factor(profile); > avail = div_u64(avail, factor); > > + /* > + * Since data allocations immediately use block groups as part of the > + * reservation, because we assume that data reservations will == actual > + * usage, we could potentially overcommit and then immediately have that > + * available space used by a data allocation, which could put us in a > + * bind when we get close to filling the file system. > + * > + * To handle this simply remove 1G (which is our current maximum chunk > + * allocation size) from the available space. If we are relatively > + * empty this won't affect our ability to overcommit much, and if we're > + * very close to full it'll keep us from getting into a position where > + * we've given ourselves very little metadata wiggle room. > + */ > + if (avail < SZ_1G) > + return 0; > + avail -= SZ_1G; Should the value be derived from the alloc_chunk_ctl::max_chunk_size or chunk_size? Or at least use a named constant, similar to BTRFS_MAX_DATA_CHUNK_SIZE . > + > /* > * If we aren't flushing all things, let us overcommit up to > * 1/2th of the space. If we can flush, don't let us overcommit > -- > 2.41.0 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] btrfs: adjust overcommit logic when very close to full 2023-09-18 21:29 ` David Sterba @ 2023-09-20 14:01 ` Josef Bacik 2023-09-20 14:04 ` David Sterba 0 siblings, 1 reply; 10+ messages in thread From: Josef Bacik @ 2023-09-20 14:01 UTC (permalink / raw) To: David Sterba; +Cc: linux-btrfs, kernel-team On Mon, Sep 18, 2023 at 11:29:50PM +0200, David Sterba wrote: > On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote: > > A user reported some unpleasant behavior with very small file systems. > > The reproducer is this > > > > mkfs.btrfs -f -m single -b 8g /dev/vdb > > mount /dev/vdb /mnt/test > > dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20 > > > > This will result in usage that looks like this > > > > Overall: > > Device size: 8.00GiB > > Device allocated: 8.00GiB > > Device unallocated: 1.00MiB > > Device missing: 0.00B > > Device slack: 2.00GiB > > Used: 5.47GiB > > Free (estimated): 2.52GiB (min: 2.52GiB) > > Free (statfs, df): 0.00B > > Data ratio: 1.00 > > Metadata ratio: 1.00 > > Global reserve: 5.50MiB (used: 0.00B) > > Multiple profiles: no > > > > Data,single: Size:7.99GiB, Used:5.46GiB (68.41%) > > /dev/vdb 7.99GiB > > > > Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%) > > /dev/vdb 8.00MiB > > > > System,single: Size:4.00MiB, Used:16.00KiB (0.39%) > > /dev/vdb 4.00MiB > > > > Unallocated: > > /dev/vdb 1.00MiB > > > > As you can see we've gotten ourselves quite full with metadata, with all > > of the disk being allocated for data. > > > > On smaller file systems there's not a lot of time before we get full, so > > our overcommit behavior bites us here. Generally speaking data > > reservations result in chunk allocations as we assume reservation == > > actual use for data. This means at any point we could end up with a > > chunk allocation for data, and if we're very close to full we could do > > this before we have a chance to figure out that we need another metadata > > chunk. > > > > Address this by adjusting the overcommit logic. Simply put we need to > > take away 1 chunk from the available chunk space in case of a data > > reservation. This will allow us to stop overcommitting before we > > potentially lose this space to a data allocation. With this fix in > > place we properly allocate a metadata chunk before we're completely > > full, allowing for enough slack space in metadata. > > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> > > --- > > fs/btrfs/space-info.c | 17 +++++++++++++++++ > > 1 file changed, 17 insertions(+) > > > > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c > > index d7e8cd4f140c..7aa53058d893 100644 > > --- a/fs/btrfs/space-info.c > > +++ b/fs/btrfs/space-info.c > > @@ -365,6 +365,23 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info, > > factor = btrfs_bg_type_to_factor(profile); > > avail = div_u64(avail, factor); > > > > + /* > > + * Since data allocations immediately use block groups as part of the > > + * reservation, because we assume that data reservations will == actual > > + * usage, we could potentially overcommit and then immediately have that > > + * available space used by a data allocation, which could put us in a > > + * bind when we get close to filling the file system. > > + * > > + * To handle this simply remove 1G (which is our current maximum chunk > > + * allocation size) from the available space. If we are relatively > > + * empty this won't affect our ability to overcommit much, and if we're > > + * very close to full it'll keep us from getting into a position where > > + * we've given ourselves very little metadata wiggle room. > > + */ > > + if (avail < SZ_1G) > > + return 0; > > + avail -= SZ_1G; > > Should the value be derived from the alloc_chunk_ctl::max_chunk_size or > chunk_size? Or at least use a named constant, similar to > BTRFS_MAX_DATA_CHUNK_SIZE . I originally had this, but we still have this logic in the chunk allocator /* Stripe size should not go beyond 1G. */ ctl->stripe_size = min_t(u64, ctl->stripe_size, SZ_1G); max_chunk_size tends to be larger than this, BTRFS_MAX_DATA_CHUNK_SIZE is 10G. We need to go back and clean up the logic to not have the SZ_1G here, rename it to BTRFS_MAX_CHUNK_SIZE or something like that, and then update my patch to use this. I opted for the cleanest fix for now, the cleanups can come after. Alternatively I can do the cleanups and do the fix, it's up to you. Thanks, Josef ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] btrfs: adjust overcommit logic when very close to full 2023-09-20 14:01 ` Josef Bacik @ 2023-09-20 14:04 ` David Sterba 0 siblings, 0 replies; 10+ messages in thread From: David Sterba @ 2023-09-20 14:04 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs, kernel-team On Wed, Sep 20, 2023 at 10:01:47AM -0400, Josef Bacik wrote: > On Mon, Sep 18, 2023 at 11:29:50PM +0200, David Sterba wrote: > > On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote: > > > A user reported some unpleasant behavior with very small file systems. > > > The reproducer is this > > > > > > mkfs.btrfs -f -m single -b 8g /dev/vdb > > > mount /dev/vdb /mnt/test > > > dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20 > > > > > > This will result in usage that looks like this > > > > > > Overall: > > > Device size: 8.00GiB > > > Device allocated: 8.00GiB > > > Device unallocated: 1.00MiB > > > Device missing: 0.00B > > > Device slack: 2.00GiB > > > Used: 5.47GiB > > > Free (estimated): 2.52GiB (min: 2.52GiB) > > > Free (statfs, df): 0.00B > > > Data ratio: 1.00 > > > Metadata ratio: 1.00 > > > Global reserve: 5.50MiB (used: 0.00B) > > > Multiple profiles: no > > > > > > Data,single: Size:7.99GiB, Used:5.46GiB (68.41%) > > > /dev/vdb 7.99GiB > > > > > > Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%) > > > /dev/vdb 8.00MiB > > > > > > System,single: Size:4.00MiB, Used:16.00KiB (0.39%) > > > /dev/vdb 4.00MiB > > > > > > Unallocated: > > > /dev/vdb 1.00MiB > > > > > > As you can see we've gotten ourselves quite full with metadata, with all > > > of the disk being allocated for data. > > > > > > On smaller file systems there's not a lot of time before we get full, so > > > our overcommit behavior bites us here. Generally speaking data > > > reservations result in chunk allocations as we assume reservation == > > > actual use for data. This means at any point we could end up with a > > > chunk allocation for data, and if we're very close to full we could do > > > this before we have a chance to figure out that we need another metadata > > > chunk. > > > > > > Address this by adjusting the overcommit logic. Simply put we need to > > > take away 1 chunk from the available chunk space in case of a data > > > reservation. This will allow us to stop overcommitting before we > > > potentially lose this space to a data allocation. With this fix in > > > place we properly allocate a metadata chunk before we're completely > > > full, allowing for enough slack space in metadata. > > > > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> > > > --- > > > fs/btrfs/space-info.c | 17 +++++++++++++++++ > > > 1 file changed, 17 insertions(+) > > > > > > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c > > > index d7e8cd4f140c..7aa53058d893 100644 > > > --- a/fs/btrfs/space-info.c > > > +++ b/fs/btrfs/space-info.c > > > @@ -365,6 +365,23 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info, > > > factor = btrfs_bg_type_to_factor(profile); > > > avail = div_u64(avail, factor); > > > > > > + /* > > > + * Since data allocations immediately use block groups as part of the > > > + * reservation, because we assume that data reservations will == actual > > > + * usage, we could potentially overcommit and then immediately have that > > > + * available space used by a data allocation, which could put us in a > > > + * bind when we get close to filling the file system. > > > + * > > > + * To handle this simply remove 1G (which is our current maximum chunk > > > + * allocation size) from the available space. If we are relatively > > > + * empty this won't affect our ability to overcommit much, and if we're > > > + * very close to full it'll keep us from getting into a position where > > > + * we've given ourselves very little metadata wiggle room. > > > + */ > > > + if (avail < SZ_1G) > > > + return 0; > > > + avail -= SZ_1G; > > > > Should the value be derived from the alloc_chunk_ctl::max_chunk_size or > > chunk_size? Or at least use a named constant, similar to > > BTRFS_MAX_DATA_CHUNK_SIZE . > > I originally had this, but we still have this logic in the chunk allocator > > /* Stripe size should not go beyond 1G. */ > ctl->stripe_size = min_t(u64, ctl->stripe_size, SZ_1G); > > max_chunk_size tends to be larger than this, BTRFS_MAX_DATA_CHUNK_SIZE is 10G. > > We need to go back and clean up the logic to not have the SZ_1G here, rename it > to BTRFS_MAX_CHUNK_SIZE or something like that, and then update my patch to use > this. > > I opted for the cleanest fix for now, the cleanups can come after. > Alternatively I can do the cleanups and do the fix, it's up to you. Thanks, For a quick fix it's ok, at least it's commented and not just a random magic constant. The cleanups can be done later. Thanks. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] btrfs: adjust overcommit logic when very close to full 2023-09-18 19:27 [PATCH] btrfs: adjust overcommit logic when very close to full Josef Bacik 2023-09-18 20:14 ` Boris Burkov 2023-09-18 21:29 ` David Sterba @ 2023-09-20 19:05 ` David Sterba 2023-09-21 22:50 ` Filipe Manana 2 siblings, 1 reply; 10+ messages in thread From: David Sterba @ 2023-09-20 19:05 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs, kernel-team On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote: > A user reported some unpleasant behavior with very small file systems. > The reproducer is this > > mkfs.btrfs -f -m single -b 8g /dev/vdb > mount /dev/vdb /mnt/test > dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20 > > This will result in usage that looks like this > > Overall: > Device size: 8.00GiB > Device allocated: 8.00GiB > Device unallocated: 1.00MiB > Device missing: 0.00B > Device slack: 2.00GiB > Used: 5.47GiB > Free (estimated): 2.52GiB (min: 2.52GiB) > Free (statfs, df): 0.00B > Data ratio: 1.00 > Metadata ratio: 1.00 > Global reserve: 5.50MiB (used: 0.00B) > Multiple profiles: no > > Data,single: Size:7.99GiB, Used:5.46GiB (68.41%) > /dev/vdb 7.99GiB > > Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%) > /dev/vdb 8.00MiB > > System,single: Size:4.00MiB, Used:16.00KiB (0.39%) > /dev/vdb 4.00MiB > > Unallocated: > /dev/vdb 1.00MiB > > As you can see we've gotten ourselves quite full with metadata, with all > of the disk being allocated for data. > > On smaller file systems there's not a lot of time before we get full, so > our overcommit behavior bites us here. Generally speaking data > reservations result in chunk allocations as we assume reservation == > actual use for data. This means at any point we could end up with a > chunk allocation for data, and if we're very close to full we could do > this before we have a chance to figure out that we need another metadata > chunk. > > Address this by adjusting the overcommit logic. Simply put we need to > take away 1 chunk from the available chunk space in case of a data > reservation. This will allow us to stop overcommitting before we > potentially lose this space to a data allocation. With this fix in > place we properly allocate a metadata chunk before we're completely > full, allowing for enough slack space in metadata. > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> Added to misc-next, thanks. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] btrfs: adjust overcommit logic when very close to full 2023-09-20 19:05 ` David Sterba @ 2023-09-21 22:50 ` Filipe Manana 2023-09-22 10:25 ` David Sterba 0 siblings, 1 reply; 10+ messages in thread From: Filipe Manana @ 2023-09-21 22:50 UTC (permalink / raw) To: dsterba; +Cc: Josef Bacik, linux-btrfs, kernel-team On Wed, Sep 20, 2023 at 11:44 PM David Sterba <dsterba@suse.cz> wrote: > > On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote: > > A user reported some unpleasant behavior with very small file systems. > > The reproducer is this > > > > mkfs.btrfs -f -m single -b 8g /dev/vdb > > mount /dev/vdb /mnt/test > > dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20 > > > > This will result in usage that looks like this > > > > Overall: > > Device size: 8.00GiB > > Device allocated: 8.00GiB > > Device unallocated: 1.00MiB > > Device missing: 0.00B > > Device slack: 2.00GiB > > Used: 5.47GiB > > Free (estimated): 2.52GiB (min: 2.52GiB) > > Free (statfs, df): 0.00B > > Data ratio: 1.00 > > Metadata ratio: 1.00 > > Global reserve: 5.50MiB (used: 0.00B) > > Multiple profiles: no > > > > Data,single: Size:7.99GiB, Used:5.46GiB (68.41%) > > /dev/vdb 7.99GiB > > > > Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%) > > /dev/vdb 8.00MiB > > > > System,single: Size:4.00MiB, Used:16.00KiB (0.39%) > > /dev/vdb 4.00MiB > > > > Unallocated: > > /dev/vdb 1.00MiB > > > > As you can see we've gotten ourselves quite full with metadata, with all > > of the disk being allocated for data. > > > > On smaller file systems there's not a lot of time before we get full, so > > our overcommit behavior bites us here. Generally speaking data > > reservations result in chunk allocations as we assume reservation == > > actual use for data. This means at any point we could end up with a > > chunk allocation for data, and if we're very close to full we could do > > this before we have a chance to figure out that we need another metadata > > chunk. > > > > Address this by adjusting the overcommit logic. Simply put we need to > > take away 1 chunk from the available chunk space in case of a data > > reservation. This will allow us to stop overcommitting before we > > potentially lose this space to a data allocation. With this fix in > > place we properly allocate a metadata chunk before we're completely > > full, allowing for enough slack space in metadata. > > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> > > Added to misc-next, thanks. So after this change, at least 2 test cases from fstests are failing with -ENOSPC on misc-next: $ ./check btrfs/156 btrfs/177 FSTYP -- btrfs PLATFORM -- Linux/x86_64 debian0 6.6.0-rc2-btrfs-next-138+ #1 SMP PREEMPT_DYNAMIC Thu Sep 21 17:58:48 WEST 2023 MKFS_OPTIONS -- /dev/sdc MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 btrfs/156 14s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/156.out.bad) --- tests/btrfs/156.out 2020-06-10 19:29:03.818519162 +0100 +++ /home/fdmanana/git/hub/xfstests/results//btrfs/156.out.bad 2023-09-21 23:41:49.911844058 +0100 @@ -1,2 +1,6 @@ QA output created by 156 +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': No space left on device +There may be more info in syslog - try dmesg | tail +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': No space left on device +There may be more info in syslog - try dmesg | tail Silence is golden ... (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/156.out /home/fdmanana/git/hub/xfstests/results//btrfs/156.out.bad' to see the entire diff) btrfs/177 2s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/177.out.bad) --- tests/btrfs/177.out 2023-03-02 21:47:53.872609330 +0000 +++ /home/fdmanana/git/hub/xfstests/results//btrfs/177.out.bad 2023-09-21 23:41:57.200117690 +0100 @@ -1,4 +1,8 @@ QA output created by 177 +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': No space left on device +There may be more info in syslog - try dmesg | tail Resized to 3221225472 +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': No space left on device +There may be more info in syslog - try dmesg | tail Text file busy ... (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/177.out /home/fdmanana/git/hub/xfstests/results//btrfs/177.out.bad' to see the entire diff) Ran: btrfs/156 btrfs/177 Failures: btrfs/156 btrfs/177 Failed 2 of 2 tests The tests pass again after reverting this change. Thanks. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] btrfs: adjust overcommit logic when very close to full 2023-09-21 22:50 ` Filipe Manana @ 2023-09-22 10:25 ` David Sterba 0 siblings, 0 replies; 10+ messages in thread From: David Sterba @ 2023-09-22 10:25 UTC (permalink / raw) To: Filipe Manana; +Cc: Josef Bacik, linux-btrfs, kernel-team On Thu, Sep 21, 2023 at 11:50:24PM +0100, Filipe Manana wrote: > On Wed, Sep 20, 2023 at 11:44 PM David Sterba <dsterba@suse.cz> wrote: > > > > On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote: > > > > Added to misc-next, thanks. > > So after this change, at least 2 test cases from fstests are failing > with -ENOSPC on misc-next: > > $ ./check btrfs/156 btrfs/177 Thanks, I've removed the patch from misc-next for now. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2023-09-22 10:31 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-09-18 19:27 [PATCH] btrfs: adjust overcommit logic when very close to full Josef Bacik 2023-09-18 20:14 ` Boris Burkov 2023-09-20 13:59 ` Josef Bacik 2023-09-20 19:02 ` David Sterba 2023-09-18 21:29 ` David Sterba 2023-09-20 14:01 ` Josef Bacik 2023-09-20 14:04 ` David Sterba 2023-09-20 19:05 ` David Sterba 2023-09-21 22:50 ` Filipe Manana 2023-09-22 10:25 ` David Sterba
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).