* fix for selecting a zone with active GC I/O for GC @ 2025-10-17 6:07 Christoph Hellwig 2025-10-17 6:07 ` [PATCH 1/2] xfs: prevent gc from picking the same zone twice Christoph Hellwig 2025-10-17 6:07 ` [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent Christoph Hellwig 0 siblings, 2 replies; 17+ messages in thread From: Christoph Hellwig @ 2025-10-17 6:07 UTC (permalink / raw) To: Carlos Maiolino; +Cc: Hans Holmberg, Darrick J. Wong, linux-xfs Hi all, the first patch in the series adds a counter to ensure that we never select a zone for which there is active GC I/O for GC again, as we'd try to copy the same blocks again, only to see the mapping had changed in I/O completion to discard that I/O. The second one documents an non-obvious version of that GC race detection that took us way too long to track down. ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/2] xfs: prevent gc from picking the same zone twice 2025-10-17 6:07 fix for selecting a zone with active GC I/O for GC Christoph Hellwig @ 2025-10-17 6:07 ` Christoph Hellwig 2025-10-17 12:37 ` Carlos Maiolino ` (2 more replies) 2025-10-17 6:07 ` [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent Christoph Hellwig 1 sibling, 3 replies; 17+ messages in thread From: Christoph Hellwig @ 2025-10-17 6:07 UTC (permalink / raw) To: Carlos Maiolino; +Cc: Hans Holmberg, Darrick J. Wong, linux-xfs When we are picking a zone for gc it might already be in the pipeline which can lead to us moving the same data twice resulting in in write amplification and a very unfortunate case where keep on garbage collecting the zone we just filled with migrated data stopping all forward progress. Fix this by introducing a count of on-going GC operations on a zone, so and skip any zone with ongoing GC when picking a new victim. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_rtgroup.h | 6 ++++++ fs/xfs/xfs_zone_gc.c | 27 +++++++++++++++++++++++++++ 2 files changed, 33 insertions(+) diff --git a/fs/xfs/libxfs/xfs_rtgroup.h b/fs/xfs/libxfs/xfs_rtgroup.h index d36a6ae0abe5..d4fcf591e63d 100644 --- a/fs/xfs/libxfs/xfs_rtgroup.h +++ b/fs/xfs/libxfs/xfs_rtgroup.h @@ -50,6 +50,12 @@ struct xfs_rtgroup { uint8_t *rtg_rsum_cache; struct xfs_open_zone *rtg_open_zone; }; + + /* + * Count of outstanding GC operations for zoned XFS. Any RTG with a + * non-zero rtg_gccount will not be picked as new GC victim. + */ + atomic_t rtg_gccount; }; /* diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c index 109877d9a6bf..efcb52796d05 100644 --- a/fs/xfs/xfs_zone_gc.c +++ b/fs/xfs/xfs_zone_gc.c @@ -114,6 +114,8 @@ struct xfs_gc_bio { /* Open Zone being written to */ struct xfs_open_zone *oz; + struct xfs_rtgroup *victim_rtg; + /* Bio used for reads and writes, including the bvec used by it */ struct bio_vec bv; struct bio bio; /* must be last */ @@ -264,6 +266,7 @@ xfs_zone_gc_iter_init( iter->rec_count = 0; iter->rec_idx = 0; iter->victim_rtg = victim_rtg; + atomic_inc(&victim_rtg->rtg_gccount); } /* @@ -362,6 +365,7 @@ xfs_zone_gc_query( return 0; done: + atomic_dec(&iter->victim_rtg->rtg_gccount); xfs_rtgroup_rele(iter->victim_rtg); iter->victim_rtg = NULL; return 0; @@ -451,6 +455,20 @@ xfs_zone_gc_pick_victim_from( if (!rtg) continue; + /* + * If the zone is already undergoing GC, don't pick it again. + * + * This prevents us from picking one of the zones for which we + * already submitted GC I/O, but for which the remapping hasn't + * concluded again. This won't cause data corruption, but + * increases write amplification and slows down GC, so this is + * a bad thing. + */ + if (atomic_read(&rtg->rtg_gccount)) { + xfs_rtgroup_rele(rtg); + continue; + } + /* skip zones that are just waiting for a reset */ if (rtg_rmap(rtg)->i_used_blocks == 0 || rtg_rmap(rtg)->i_used_blocks >= victim_used) { @@ -688,6 +706,9 @@ xfs_zone_gc_start_chunk( chunk->scratch = &data->scratch[data->scratch_idx]; chunk->data = data; chunk->oz = oz; + chunk->victim_rtg = iter->victim_rtg; + atomic_inc(&chunk->victim_rtg->rtg_group.xg_active_ref); + atomic_inc(&chunk->victim_rtg->rtg_gccount); bio->bi_iter.bi_sector = xfs_rtb_to_daddr(mp, chunk->old_startblock); bio->bi_end_io = xfs_zone_gc_end_io; @@ -710,6 +731,8 @@ static void xfs_zone_gc_free_chunk( struct xfs_gc_bio *chunk) { + atomic_dec(&chunk->victim_rtg->rtg_gccount); + xfs_rtgroup_rele(chunk->victim_rtg); list_del(&chunk->entry); xfs_open_zone_put(chunk->oz); xfs_irele(chunk->ip); @@ -770,6 +793,10 @@ xfs_zone_gc_split_write( split_chunk->oz = chunk->oz; atomic_inc(&chunk->oz->oz_ref); + split_chunk->victim_rtg = chunk->victim_rtg; + atomic_inc(&chunk->victim_rtg->rtg_group.xg_active_ref); + atomic_inc(&chunk->victim_rtg->rtg_gccount); + chunk->offset += split_len; chunk->len -= split_len; chunk->old_startblock += XFS_B_TO_FSB(data->mp, split_len); -- 2.47.3 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] xfs: prevent gc from picking the same zone twice 2025-10-17 6:07 ` [PATCH 1/2] xfs: prevent gc from picking the same zone twice Christoph Hellwig @ 2025-10-17 12:37 ` Carlos Maiolino 2025-10-20 11:49 ` Christoph Hellwig 2025-10-18 4:08 ` Damien Le Moal 2025-10-23 6:16 ` Darrick J. Wong 2 siblings, 1 reply; 17+ messages in thread From: Carlos Maiolino @ 2025-10-17 12:37 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Hans Holmberg, Darrick J. Wong, linux-xfs On Fri, Oct 17, 2025 at 08:07:02AM +0200, Christoph Hellwig wrote: > When we are picking a zone for gc it might already be in the pipeline > which can lead to us moving the same data twice resulting in in write > amplification and a very unfortunate case where keep on garbage > collecting the zone we just filled with migrated data stopping all > forward progress. > > Fix this by introducing a count of on-going GC operations on a zone, so This 'so' shouldn't be here? ^ > and skip any zone with ongoing GC when picking a new victim. > > Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> > Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> > Signed-off-by: Christoph Hellwig <hch@lst.de> This looks good, but I think it's worth to add a: Fixes: 080d01c41 ("xfs: implement zoned garbage collection") If you agree I can update both nitpicks above when I pull the series, otherwise, feel free to add to the next version: Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> > --- > fs/xfs/libxfs/xfs_rtgroup.h | 6 ++++++ > fs/xfs/xfs_zone_gc.c | 27 +++++++++++++++++++++++++++ > 2 files changed, 33 insertions(+) > > diff --git a/fs/xfs/libxfs/xfs_rtgroup.h b/fs/xfs/libxfs/xfs_rtgroup.h > index d36a6ae0abe5..d4fcf591e63d 100644 > --- a/fs/xfs/libxfs/xfs_rtgroup.h > +++ b/fs/xfs/libxfs/xfs_rtgroup.h > @@ -50,6 +50,12 @@ struct xfs_rtgroup { > uint8_t *rtg_rsum_cache; > struct xfs_open_zone *rtg_open_zone; > }; > + > + /* > + * Count of outstanding GC operations for zoned XFS. Any RTG with a > + * non-zero rtg_gccount will not be picked as new GC victim. > + */ > + atomic_t rtg_gccount; > }; > > /* > diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c > index 109877d9a6bf..efcb52796d05 100644 > --- a/fs/xfs/xfs_zone_gc.c > +++ b/fs/xfs/xfs_zone_gc.c > @@ -114,6 +114,8 @@ struct xfs_gc_bio { > /* Open Zone being written to */ > struct xfs_open_zone *oz; > > + struct xfs_rtgroup *victim_rtg; > + > /* Bio used for reads and writes, including the bvec used by it */ > struct bio_vec bv; > struct bio bio; /* must be last */ > @@ -264,6 +266,7 @@ xfs_zone_gc_iter_init( > iter->rec_count = 0; > iter->rec_idx = 0; > iter->victim_rtg = victim_rtg; > + atomic_inc(&victim_rtg->rtg_gccount); > } > > /* > @@ -362,6 +365,7 @@ xfs_zone_gc_query( > > return 0; > done: > + atomic_dec(&iter->victim_rtg->rtg_gccount); > xfs_rtgroup_rele(iter->victim_rtg); > iter->victim_rtg = NULL; > return 0; > @@ -451,6 +455,20 @@ xfs_zone_gc_pick_victim_from( > if (!rtg) > continue; > > + /* > + * If the zone is already undergoing GC, don't pick it again. > + * > + * This prevents us from picking one of the zones for which we > + * already submitted GC I/O, but for which the remapping hasn't > + * concluded again. This won't cause data corruption, but > + * increases write amplification and slows down GC, so this is > + * a bad thing. > + */ > + if (atomic_read(&rtg->rtg_gccount)) { > + xfs_rtgroup_rele(rtg); > + continue; > + } > + > /* skip zones that are just waiting for a reset */ > if (rtg_rmap(rtg)->i_used_blocks == 0 || > rtg_rmap(rtg)->i_used_blocks >= victim_used) { > @@ -688,6 +706,9 @@ xfs_zone_gc_start_chunk( > chunk->scratch = &data->scratch[data->scratch_idx]; > chunk->data = data; > chunk->oz = oz; > + chunk->victim_rtg = iter->victim_rtg; > + atomic_inc(&chunk->victim_rtg->rtg_group.xg_active_ref); > + atomic_inc(&chunk->victim_rtg->rtg_gccount); > > bio->bi_iter.bi_sector = xfs_rtb_to_daddr(mp, chunk->old_startblock); > bio->bi_end_io = xfs_zone_gc_end_io; > @@ -710,6 +731,8 @@ static void > xfs_zone_gc_free_chunk( > struct xfs_gc_bio *chunk) > { > + atomic_dec(&chunk->victim_rtg->rtg_gccount); > + xfs_rtgroup_rele(chunk->victim_rtg); > list_del(&chunk->entry); > xfs_open_zone_put(chunk->oz); > xfs_irele(chunk->ip); > @@ -770,6 +793,10 @@ xfs_zone_gc_split_write( > split_chunk->oz = chunk->oz; > atomic_inc(&chunk->oz->oz_ref); > > + split_chunk->victim_rtg = chunk->victim_rtg; > + atomic_inc(&chunk->victim_rtg->rtg_group.xg_active_ref); > + atomic_inc(&chunk->victim_rtg->rtg_gccount); > + > chunk->offset += split_len; > chunk->len -= split_len; > chunk->old_startblock += XFS_B_TO_FSB(data->mp, split_len); > -- > 2.47.3 > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] xfs: prevent gc from picking the same zone twice 2025-10-17 12:37 ` Carlos Maiolino @ 2025-10-20 11:49 ` Christoph Hellwig 0 siblings, 0 replies; 17+ messages in thread From: Christoph Hellwig @ 2025-10-20 11:49 UTC (permalink / raw) To: Carlos Maiolino Cc: Christoph Hellwig, Hans Holmberg, Darrick J. Wong, linux-xfs On Fri, Oct 17, 2025 at 02:37:02PM +0200, Carlos Maiolino wrote: > This looks good, but I think it's worth to add a: > > Fixes: 080d01c41 ("xfs: implement zoned garbage collection") > > > If you agree I can update both nitpicks above when I pull the series, > otherwise, feel free to add to the next version: Sure. Let's see if I need a resend for another reason, but otherwise your tweaks look good. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] xfs: prevent gc from picking the same zone twice 2025-10-17 6:07 ` [PATCH 1/2] xfs: prevent gc from picking the same zone twice Christoph Hellwig 2025-10-17 12:37 ` Carlos Maiolino @ 2025-10-18 4:08 ` Damien Le Moal 2025-10-23 6:16 ` Darrick J. Wong 2 siblings, 0 replies; 17+ messages in thread From: Damien Le Moal @ 2025-10-18 4:08 UTC (permalink / raw) To: Christoph Hellwig, Carlos Maiolino Cc: Hans Holmberg, Darrick J. Wong, linux-xfs On 10/17/25 15:07, Christoph Hellwig wrote: > When we are picking a zone for gc it might already be in the pipeline > which can lead to us moving the same data twice resulting in in write > amplification and a very unfortunate case where keep on garbage s/keep/we keep > collecting the zone we just filled with migrated data stopping all > forward progress. > > Fix this by introducing a count of on-going GC operations on a zone, so > and skip any zone with ongoing GC when picking a new victim. > > Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> > Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> > Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] xfs: prevent gc from picking the same zone twice 2025-10-17 6:07 ` [PATCH 1/2] xfs: prevent gc from picking the same zone twice Christoph Hellwig 2025-10-17 12:37 ` Carlos Maiolino 2025-10-18 4:08 ` Damien Le Moal @ 2025-10-23 6:16 ` Darrick J. Wong 2025-10-23 6:28 ` Christoph Hellwig 2 siblings, 1 reply; 17+ messages in thread From: Darrick J. Wong @ 2025-10-23 6:16 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Carlos Maiolino, Hans Holmberg, linux-xfs On Fri, Oct 17, 2025 at 08:07:02AM +0200, Christoph Hellwig wrote: > When we are picking a zone for gc it might already be in the pipeline > which can lead to us moving the same data twice resulting in in write > amplification and a very unfortunate case where keep on garbage > collecting the zone we just filled with migrated data stopping all > forward progress. > > Fix this by introducing a count of on-going GC operations on a zone, so > and skip any zone with ongoing GC when picking a new victim. > > Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> > Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > fs/xfs/libxfs/xfs_rtgroup.h | 6 ++++++ > fs/xfs/xfs_zone_gc.c | 27 +++++++++++++++++++++++++++ > 2 files changed, 33 insertions(+) > > diff --git a/fs/xfs/libxfs/xfs_rtgroup.h b/fs/xfs/libxfs/xfs_rtgroup.h > index d36a6ae0abe5..d4fcf591e63d 100644 > --- a/fs/xfs/libxfs/xfs_rtgroup.h > +++ b/fs/xfs/libxfs/xfs_rtgroup.h > @@ -50,6 +50,12 @@ struct xfs_rtgroup { > uint8_t *rtg_rsum_cache; > struct xfs_open_zone *rtg_open_zone; > }; > + > + /* > + * Count of outstanding GC operations for zoned XFS. Any RTG with a > + * non-zero rtg_gccount will not be picked as new GC victim. > + */ > + atomic_t rtg_gccount; > }; > > /* > diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c > index 109877d9a6bf..efcb52796d05 100644 > --- a/fs/xfs/xfs_zone_gc.c > +++ b/fs/xfs/xfs_zone_gc.c > @@ -114,6 +114,8 @@ struct xfs_gc_bio { > /* Open Zone being written to */ > struct xfs_open_zone *oz; > > + struct xfs_rtgroup *victim_rtg; > + > /* Bio used for reads and writes, including the bvec used by it */ > struct bio_vec bv; > struct bio bio; /* must be last */ > @@ -264,6 +266,7 @@ xfs_zone_gc_iter_init( > iter->rec_count = 0; > iter->rec_idx = 0; > iter->victim_rtg = victim_rtg; > + atomic_inc(&victim_rtg->rtg_gccount); > } > > /* > @@ -362,6 +365,7 @@ xfs_zone_gc_query( > > return 0; > done: > + atomic_dec(&iter->victim_rtg->rtg_gccount); > xfs_rtgroup_rele(iter->victim_rtg); > iter->victim_rtg = NULL; > return 0; > @@ -451,6 +455,20 @@ xfs_zone_gc_pick_victim_from( > if (!rtg) > continue; > > + /* > + * If the zone is already undergoing GC, don't pick it again. > + * > + * This prevents us from picking one of the zones for which we > + * already submitted GC I/O, but for which the remapping hasn't > + * concluded again. This won't cause data corruption, but "...but that I/O hasn't yet finished." With that and the other comments corrected and a Fixes tag applied, Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> --D > + * increases write amplification and slows down GC, so this is > + * a bad thing. > + */ > + if (atomic_read(&rtg->rtg_gccount)) { > + xfs_rtgroup_rele(rtg); > + continue; > + } > + > /* skip zones that are just waiting for a reset */ > if (rtg_rmap(rtg)->i_used_blocks == 0 || > rtg_rmap(rtg)->i_used_blocks >= victim_used) { > @@ -688,6 +706,9 @@ xfs_zone_gc_start_chunk( > chunk->scratch = &data->scratch[data->scratch_idx]; > chunk->data = data; > chunk->oz = oz; > + chunk->victim_rtg = iter->victim_rtg; > + atomic_inc(&chunk->victim_rtg->rtg_group.xg_active_ref); > + atomic_inc(&chunk->victim_rtg->rtg_gccount); > > bio->bi_iter.bi_sector = xfs_rtb_to_daddr(mp, chunk->old_startblock); > bio->bi_end_io = xfs_zone_gc_end_io; > @@ -710,6 +731,8 @@ static void > xfs_zone_gc_free_chunk( > struct xfs_gc_bio *chunk) > { > + atomic_dec(&chunk->victim_rtg->rtg_gccount); > + xfs_rtgroup_rele(chunk->victim_rtg); > list_del(&chunk->entry); > xfs_open_zone_put(chunk->oz); > xfs_irele(chunk->ip); > @@ -770,6 +793,10 @@ xfs_zone_gc_split_write( > split_chunk->oz = chunk->oz; > atomic_inc(&chunk->oz->oz_ref); > > + split_chunk->victim_rtg = chunk->victim_rtg; > + atomic_inc(&chunk->victim_rtg->rtg_group.xg_active_ref); > + atomic_inc(&chunk->victim_rtg->rtg_gccount); > + > chunk->offset += split_len; > chunk->len -= split_len; > chunk->old_startblock += XFS_B_TO_FSB(data->mp, split_len); > -- > 2.47.3 > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] xfs: prevent gc from picking the same zone twice 2025-10-23 6:16 ` Darrick J. Wong @ 2025-10-23 6:28 ` Christoph Hellwig 2025-10-23 15:02 ` Darrick J. Wong 0 siblings, 1 reply; 17+ messages in thread From: Christoph Hellwig @ 2025-10-23 6:28 UTC (permalink / raw) To: Darrick J. Wong Cc: Christoph Hellwig, Carlos Maiolino, Hans Holmberg, linux-xfs On Wed, Oct 22, 2025 at 11:16:22PM -0700, Darrick J. Wong wrote: > > + /* > > + * If the zone is already undergoing GC, don't pick it again. > > + * > > + * This prevents us from picking one of the zones for which we > > + * already submitted GC I/O, but for which the remapping hasn't > > + * concluded again. This won't cause data corruption, but > > "...but that I/O hasn't yet finished." It's really the remapping after the I/O that has to finish, but given that we talk about I/O earlier, maybe your wording is less confusing. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] xfs: prevent gc from picking the same zone twice 2025-10-23 6:28 ` Christoph Hellwig @ 2025-10-23 15:02 ` Darrick J. Wong 2025-10-23 15:04 ` Christoph Hellwig 0 siblings, 1 reply; 17+ messages in thread From: Darrick J. Wong @ 2025-10-23 15:02 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Carlos Maiolino, Hans Holmberg, linux-xfs On Thu, Oct 23, 2025 at 08:28:29AM +0200, Christoph Hellwig wrote: > On Wed, Oct 22, 2025 at 11:16:22PM -0700, Darrick J. Wong wrote: > > > + /* > > > + * If the zone is already undergoing GC, don't pick it again. > > > + * > > > + * This prevents us from picking one of the zones for which we > > > + * already submitted GC I/O, but for which the remapping hasn't > > > + * concluded again. This won't cause data corruption, but > > > > "...but that I/O hasn't yet finished." > > It's really the remapping after the I/O that has to finish, but given > that we talk about I/O earlier, maybe your wording is less confusing. The word 'again' at the end of the sentence tripped me up. I think the word you want there is "yet" (because "again" implies that this isn't the first time we've remapped the space, which isn't true) but you could also end the sentence wit` "concluded". --D ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] xfs: prevent gc from picking the same zone twice 2025-10-23 15:02 ` Darrick J. Wong @ 2025-10-23 15:04 ` Christoph Hellwig 0 siblings, 0 replies; 17+ messages in thread From: Christoph Hellwig @ 2025-10-23 15:04 UTC (permalink / raw) To: Darrick J. Wong Cc: Christoph Hellwig, Carlos Maiolino, Hans Holmberg, linux-xfs On Thu, Oct 23, 2025 at 08:02:32AM -0700, Darrick J. Wong wrote: > > > > + * > > > > + * This prevents us from picking one of the zones for which we > > > > + * already submitted GC I/O, but for which the remapping hasn't > > > > + * concluded again. This won't cause data corruption, but > > > > > > "...but that I/O hasn't yet finished." > > > > It's really the remapping after the I/O that has to finish, but given > > that we talk about I/O earlier, maybe your wording is less confusing. > > The word 'again' at the end of the sentence tripped me up. I think the > word you want there is "yet" (because "again" implies that this isn't > the first time we've remapped the space, which isn't true) but you could > also end the sentence wit` "concluded". Both sounds fine. ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent 2025-10-17 6:07 fix for selecting a zone with active GC I/O for GC Christoph Hellwig 2025-10-17 6:07 ` [PATCH 1/2] xfs: prevent gc from picking the same zone twice Christoph Hellwig @ 2025-10-17 6:07 ` Christoph Hellwig 2025-10-17 12:40 ` Carlos Maiolino ` (3 more replies) 1 sibling, 4 replies; 17+ messages in thread From: Christoph Hellwig @ 2025-10-17 6:07 UTC (permalink / raw) To: Carlos Maiolino; +Cc: Hans Holmberg, Darrick J. Wong, linux-xfs Besides blocks being invalidated, there is another case when the original mapping could have changed between querying the rmap for GC and calling xfs_zoned_map_extent. Document it there as it took us quite some time to figure out what is going on while developing the multiple-GC protection fix. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_zone_alloc.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/fs/xfs/xfs_zone_alloc.c b/fs/xfs/xfs_zone_alloc.c index e7e439918f6d..2790001ee0f1 100644 --- a/fs/xfs/xfs_zone_alloc.c +++ b/fs/xfs/xfs_zone_alloc.c @@ -246,6 +246,14 @@ xfs_zoned_map_extent( * If a data write raced with this GC write, keep the existing data in * the data fork, mark our newly written GC extent as reclaimable, then * move on to the next extent. + * + * Note that this can also happen when racing with operations that do + * not actually invalidate the data, but just move it to a different + * inode (XFS_IOC_EXCHANGE_RANGE), or to a different offset inside the + * inode (FALLOC_FL_COLLAPSE_RANGE / FALLOC_FL_INSERT_RANGE). If the + * data was just moved around, GC fails to free the zone, but the zone + * becomes a GC candidate again as soon as all previous GC I/O has + * finished and these blocks will be moved out eventually. */ if (old_startblock != NULLFSBLOCK && old_startblock != data.br_startblock) -- 2.47.3 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent 2025-10-17 6:07 ` [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent Christoph Hellwig @ 2025-10-17 12:40 ` Carlos Maiolino 2025-10-18 4:09 ` Damien Le Moal ` (2 subsequent siblings) 3 siblings, 0 replies; 17+ messages in thread From: Carlos Maiolino @ 2025-10-17 12:40 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Hans Holmberg, Darrick J. Wong, linux-xfs On Fri, Oct 17, 2025 at 08:07:03AM +0200, Christoph Hellwig wrote: > Besides blocks being invalidated, there is another case when the original > mapping could have changed between querying the rmap for GC and calling > xfs_zoned_map_extent. Document it there as it took us quite some time > to figure out what is going on while developing the multiple-GC > protection fix. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > fs/xfs/xfs_zone_alloc.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/fs/xfs/xfs_zone_alloc.c b/fs/xfs/xfs_zone_alloc.c > index e7e439918f6d..2790001ee0f1 100644 > --- a/fs/xfs/xfs_zone_alloc.c > +++ b/fs/xfs/xfs_zone_alloc.c > @@ -246,6 +246,14 @@ xfs_zoned_map_extent( > * If a data write raced with this GC write, keep the existing data in > * the data fork, mark our newly written GC extent as reclaimable, then > * move on to the next extent. > + * > + * Note that this can also happen when racing with operations that do > + * not actually invalidate the data, but just move it to a different > + * inode (XFS_IOC_EXCHANGE_RANGE), or to a different offset inside the > + * inode (FALLOC_FL_COLLAPSE_RANGE / FALLOC_FL_INSERT_RANGE). If the > + * data was just moved around, GC fails to free the zone, but the zone > + * becomes a GC candidate again as soon as all previous GC I/O has > + * finished and these blocks will be moved out eventually. > */ > if (old_startblock != NULLFSBLOCK && > old_startblock != data.br_startblock) > -- > 2.47.3 Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent 2025-10-17 6:07 ` [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent Christoph Hellwig 2025-10-17 12:40 ` Carlos Maiolino @ 2025-10-18 4:09 ` Damien Le Moal 2025-10-20 8:19 ` Hans Holmberg 2025-10-23 6:21 ` Darrick J. Wong 3 siblings, 0 replies; 17+ messages in thread From: Damien Le Moal @ 2025-10-18 4:09 UTC (permalink / raw) To: Christoph Hellwig, Carlos Maiolino Cc: Hans Holmberg, Darrick J. Wong, linux-xfs On 10/17/25 15:07, Christoph Hellwig wrote: > Besides blocks being invalidated, there is another case when the original > mapping could have changed between querying the rmap for GC and calling > xfs_zoned_map_extent. Document it there as it took us quite some time > to figure out what is going on while developing the multiple-GC > protection fix. > > Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent 2025-10-17 6:07 ` [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent Christoph Hellwig 2025-10-17 12:40 ` Carlos Maiolino 2025-10-18 4:09 ` Damien Le Moal @ 2025-10-20 8:19 ` Hans Holmberg 2025-10-23 6:21 ` Darrick J. Wong 3 siblings, 0 replies; 17+ messages in thread From: Hans Holmberg @ 2025-10-20 8:19 UTC (permalink / raw) To: hch, Carlos Maiolino; +Cc: Darrick J. Wong, linux-xfs@vger.kernel.org On 17/10/2025 08:07, Christoph Hellwig wrote: > Besides blocks being invalidated, there is another case when the original > mapping could have changed between querying the rmap for GC and calling > xfs_zoned_map_extent. Document it there as it took us quite some time > to figure out what is going on while developing the multiple-GC > protection fix. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > fs/xfs/xfs_zone_alloc.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/fs/xfs/xfs_zone_alloc.c b/fs/xfs/xfs_zone_alloc.c > index e7e439918f6d..2790001ee0f1 100644 > --- a/fs/xfs/xfs_zone_alloc.c > +++ b/fs/xfs/xfs_zone_alloc.c > @@ -246,6 +246,14 @@ xfs_zoned_map_extent( > * If a data write raced with this GC write, keep the existing data in > * the data fork, mark our newly written GC extent as reclaimable, then > * move on to the next extent. > + * > + * Note that this can also happen when racing with operations that do > + * not actually invalidate the data, but just move it to a different > + * inode (XFS_IOC_EXCHANGE_RANGE), or to a different offset inside the > + * inode (FALLOC_FL_COLLAPSE_RANGE / FALLOC_FL_INSERT_RANGE). If the > + * data was just moved around, GC fails to free the zone, but the zone > + * becomes a GC candidate again as soon as all previous GC I/O has > + * finished and these blocks will be moved out eventually. > */ > if (old_startblock != NULLFSBLOCK && > old_startblock != data.br_startblock) Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent 2025-10-17 6:07 ` [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent Christoph Hellwig ` (2 preceding siblings ...) 2025-10-20 8:19 ` Hans Holmberg @ 2025-10-23 6:21 ` Darrick J. Wong 2025-10-23 6:30 ` Christoph Hellwig 3 siblings, 1 reply; 17+ messages in thread From: Darrick J. Wong @ 2025-10-23 6:21 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Carlos Maiolino, Hans Holmberg, linux-xfs On Fri, Oct 17, 2025 at 08:07:03AM +0200, Christoph Hellwig wrote: > Besides blocks being invalidated, there is another case when the original > mapping could have changed between querying the rmap for GC and calling > xfs_zoned_map_extent. Document it there as it took us quite some time > to figure out what is going on while developing the multiple-GC > protection fix. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > fs/xfs/xfs_zone_alloc.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/fs/xfs/xfs_zone_alloc.c b/fs/xfs/xfs_zone_alloc.c > index e7e439918f6d..2790001ee0f1 100644 > --- a/fs/xfs/xfs_zone_alloc.c > +++ b/fs/xfs/xfs_zone_alloc.c > @@ -246,6 +246,14 @@ xfs_zoned_map_extent( > * If a data write raced with this GC write, keep the existing data in > * the data fork, mark our newly written GC extent as reclaimable, then > * move on to the next extent. > + * > + * Note that this can also happen when racing with operations that do > + * not actually invalidate the data, but just move it to a different > + * inode (XFS_IOC_EXCHANGE_RANGE), or to a different offset inside the > + * inode (FALLOC_FL_COLLAPSE_RANGE / FALLOC_FL_INSERT_RANGE). If the Or (eventually) this will also be possible if zonegc races with a plain remapping via FICLONERANGE, right? (Like, whenever reflink gets implemented) I guess the zonegc write completion could just iterate the rtrmapbt for records that overlap the old extent and remap them until there are no rtrmapbt entries returned. Then you'd be able to catch the exchange-range and reflink cases, right? (Not asking for a reflink-aware zonegc right this second, just checking my understanding.) If the answers to both questions are yes then I've grokked this well enough to say Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> --D > + * data was just moved around, GC fails to free the zone, but the zone > + * becomes a GC candidate again as soon as all previous GC I/O has > + * finished and these blocks will be moved out eventually. > */ > if (old_startblock != NULLFSBLOCK && > old_startblock != data.br_startblock) > -- > 2.47.3 > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent 2025-10-23 6:21 ` Darrick J. Wong @ 2025-10-23 6:30 ` Christoph Hellwig 0 siblings, 0 replies; 17+ messages in thread From: Christoph Hellwig @ 2025-10-23 6:30 UTC (permalink / raw) To: Darrick J. Wong Cc: Christoph Hellwig, Carlos Maiolino, Hans Holmberg, linux-xfs On Wed, Oct 22, 2025 at 11:21:25PM -0700, Darrick J. Wong wrote: > > + * Note that this can also happen when racing with operations that do > > + * not actually invalidate the data, but just move it to a different > > + * inode (XFS_IOC_EXCHANGE_RANGE), or to a different offset inside the > > + * inode (FALLOC_FL_COLLAPSE_RANGE / FALLOC_FL_INSERT_RANGE). If the > > Or (eventually) this will also be possible if zonegc races with a plain > remapping via FICLONERANGE, right? > > (Like, whenever reflink gets implemented) Yes. > I guess the zonegc write completion could just iterate the rtrmapbt for > records that overlap the old extent and remap them until there are no > rtrmapbt entries returned. Then you'd be able to catch the > exchange-range and reflink cases, right? I don't think so. All these remap operation would at this point would have remove the rmapbt record for the old record because they changed the logical mapping. So I don't think we'd find anything useful there. ^ permalink raw reply [flat|nested] 17+ messages in thread
* fix for selecting a zone with active GC I/O for GC v2 @ 2025-10-23 15:17 Christoph Hellwig 2025-10-23 15:17 ` [PATCH 1/2] xfs: prevent gc from picking the same zone twice Christoph Hellwig 0 siblings, 1 reply; 17+ messages in thread From: Christoph Hellwig @ 2025-10-23 15:17 UTC (permalink / raw) To: Carlos Maiolino; +Cc: Hans Holmberg, Darrick J. Wong, linux-xfs Hi all, the first patch in the series adds a counter to ensure that we never select a zone for which there is active GC I/O for GC again, as we'd try to copy the same blocks again, only to see the mapping had changed in I/O completion to discard that I/O. The second one documents an non-obvious version of that GC race detection that took us way too long to track down. Changes since v1: - fix text in a comment and a commit message ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/2] xfs: prevent gc from picking the same zone twice 2025-10-23 15:17 fix for selecting a zone with active GC I/O for GC v2 Christoph Hellwig @ 2025-10-23 15:17 ` Christoph Hellwig 2025-10-31 11:19 ` Carlos Maiolino 0 siblings, 1 reply; 17+ messages in thread From: Christoph Hellwig @ 2025-10-23 15:17 UTC (permalink / raw) To: Carlos Maiolino Cc: Hans Holmberg, Darrick J. Wong, linux-xfs, Damien Le Moal, Carlos Maiolino When we are picking a zone for gc it might already be in the pipeline which can lead to us moving the same data twice resulting in in write amplification and a very unfortunate case where we keep on garbage collecting the zone we just filled with migrated data stopping all forward progress. Fix this by introducing a count of on-going GC operations on a zone, and skip any zone with ongoing GC when picking a new victim. Fixes: 080d01c41 ("xfs: implement zoned garbage collection") Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> --- fs/xfs/libxfs/xfs_rtgroup.h | 6 ++++++ fs/xfs/xfs_zone_gc.c | 27 +++++++++++++++++++++++++++ 2 files changed, 33 insertions(+) diff --git a/fs/xfs/libxfs/xfs_rtgroup.h b/fs/xfs/libxfs/xfs_rtgroup.h index d36a6ae0abe5..d4fcf591e63d 100644 --- a/fs/xfs/libxfs/xfs_rtgroup.h +++ b/fs/xfs/libxfs/xfs_rtgroup.h @@ -50,6 +50,12 @@ struct xfs_rtgroup { uint8_t *rtg_rsum_cache; struct xfs_open_zone *rtg_open_zone; }; + + /* + * Count of outstanding GC operations for zoned XFS. Any RTG with a + * non-zero rtg_gccount will not be picked as new GC victim. + */ + atomic_t rtg_gccount; }; /* diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c index 109877d9a6bf..4ade54445532 100644 --- a/fs/xfs/xfs_zone_gc.c +++ b/fs/xfs/xfs_zone_gc.c @@ -114,6 +114,8 @@ struct xfs_gc_bio { /* Open Zone being written to */ struct xfs_open_zone *oz; + struct xfs_rtgroup *victim_rtg; + /* Bio used for reads and writes, including the bvec used by it */ struct bio_vec bv; struct bio bio; /* must be last */ @@ -264,6 +266,7 @@ xfs_zone_gc_iter_init( iter->rec_count = 0; iter->rec_idx = 0; iter->victim_rtg = victim_rtg; + atomic_inc(&victim_rtg->rtg_gccount); } /* @@ -362,6 +365,7 @@ xfs_zone_gc_query( return 0; done: + atomic_dec(&iter->victim_rtg->rtg_gccount); xfs_rtgroup_rele(iter->victim_rtg); iter->victim_rtg = NULL; return 0; @@ -451,6 +455,20 @@ xfs_zone_gc_pick_victim_from( if (!rtg) continue; + /* + * If the zone is already undergoing GC, don't pick it again. + * + * This prevents us from picking one of the zones for which we + * already submitted GC I/O, but for which the remapping hasn't + * concluded yet. This won't cause data corruption, but + * increases write amplification and slows down GC, so this is + * a bad thing. + */ + if (atomic_read(&rtg->rtg_gccount)) { + xfs_rtgroup_rele(rtg); + continue; + } + /* skip zones that are just waiting for a reset */ if (rtg_rmap(rtg)->i_used_blocks == 0 || rtg_rmap(rtg)->i_used_blocks >= victim_used) { @@ -688,6 +706,9 @@ xfs_zone_gc_start_chunk( chunk->scratch = &data->scratch[data->scratch_idx]; chunk->data = data; chunk->oz = oz; + chunk->victim_rtg = iter->victim_rtg; + atomic_inc(&chunk->victim_rtg->rtg_group.xg_active_ref); + atomic_inc(&chunk->victim_rtg->rtg_gccount); bio->bi_iter.bi_sector = xfs_rtb_to_daddr(mp, chunk->old_startblock); bio->bi_end_io = xfs_zone_gc_end_io; @@ -710,6 +731,8 @@ static void xfs_zone_gc_free_chunk( struct xfs_gc_bio *chunk) { + atomic_dec(&chunk->victim_rtg->rtg_gccount); + xfs_rtgroup_rele(chunk->victim_rtg); list_del(&chunk->entry); xfs_open_zone_put(chunk->oz); xfs_irele(chunk->ip); @@ -770,6 +793,10 @@ xfs_zone_gc_split_write( split_chunk->oz = chunk->oz; atomic_inc(&chunk->oz->oz_ref); + split_chunk->victim_rtg = chunk->victim_rtg; + atomic_inc(&chunk->victim_rtg->rtg_group.xg_active_ref); + atomic_inc(&chunk->victim_rtg->rtg_gccount); + chunk->offset += split_len; chunk->len -= split_len; chunk->old_startblock += XFS_B_TO_FSB(data->mp, split_len); -- 2.47.3 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] xfs: prevent gc from picking the same zone twice 2025-10-23 15:17 ` [PATCH 1/2] xfs: prevent gc from picking the same zone twice Christoph Hellwig @ 2025-10-31 11:19 ` Carlos Maiolino 0 siblings, 0 replies; 17+ messages in thread From: Carlos Maiolino @ 2025-10-31 11:19 UTC (permalink / raw) To: Christoph Hellwig Cc: Hans Holmberg, Darrick J. Wong, linux-xfs, Damien Le Moal, Carlos Maiolino On Thu, 23 Oct 2025 17:17:02 +0200, Christoph Hellwig wrote: > When we are picking a zone for gc it might already be in the pipeline > which can lead to us moving the same data twice resulting in in write > amplification and a very unfortunate case where we keep on garbage > collecting the zone we just filled with migrated data stopping all > forward progress. > > Fix this by introducing a count of on-going GC operations on a zone, and > skip any zone with ongoing GC when picking a new victim. > > [...] Applied to for-next, thanks! [1/2] xfs: prevent gc from picking the same zone twice commit: 83bac569c762651ac6dff9a86f54ecc13d911f7d [2/2] xfs: document another racy GC case in xfs_zoned_map_extent commit: 0db22d7ee462c42c1284e98d47840932792c1adb Best regards, -- Carlos Maiolino <cem@kernel.org> ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-10-31 11:19 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-10-17 6:07 fix for selecting a zone with active GC I/O for GC Christoph Hellwig 2025-10-17 6:07 ` [PATCH 1/2] xfs: prevent gc from picking the same zone twice Christoph Hellwig 2025-10-17 12:37 ` Carlos Maiolino 2025-10-20 11:49 ` Christoph Hellwig 2025-10-18 4:08 ` Damien Le Moal 2025-10-23 6:16 ` Darrick J. Wong 2025-10-23 6:28 ` Christoph Hellwig 2025-10-23 15:02 ` Darrick J. Wong 2025-10-23 15:04 ` Christoph Hellwig 2025-10-17 6:07 ` [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent Christoph Hellwig 2025-10-17 12:40 ` Carlos Maiolino 2025-10-18 4:09 ` Damien Le Moal 2025-10-20 8:19 ` Hans Holmberg 2025-10-23 6:21 ` Darrick J. Wong 2025-10-23 6:30 ` Christoph Hellwig -- strict thread matches above, loose matches on Subject: below -- 2025-10-23 15:17 fix for selecting a zone with active GC I/O for GC v2 Christoph Hellwig 2025-10-23 15:17 ` [PATCH 1/2] xfs: prevent gc from picking the same zone twice Christoph Hellwig 2025-10-31 11:19 ` Carlos Maiolino
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).