fix for selecting a zone with active GC I/O for GC

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* fix for selecting a zone with active GC I/O for GC
@ 2025-10-17  6:07 Christoph Hellwig
  2025-10-17  6:07 ` [PATCH 1/2] xfs: prevent gc from picking the same zone twice Christoph Hellwig
  2025-10-17  6:07 ` [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent Christoph Hellwig
  0 siblings, 2 replies; 16+ messages in thread
From: Christoph Hellwig @ 2025-10-17  6:07 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: Hans Holmberg, Darrick J. Wong, linux-xfs

Hi all,

the first patch in the series adds a counter to ensure that we never
select a zone for which there is active GC I/O for GC again, as we'd
try to copy the same blocks again, only to see the mapping had changed
in I/O completion to discard that I/O.

The second one documents an non-obvious version of that GC race detection
that took us way too long to track down.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] xfs: prevent gc from picking the same zone twice
  2025-10-17  6:07 fix for selecting a zone with active GC I/O for GC Christoph Hellwig
@ 2025-10-17  6:07 ` Christoph Hellwig
  2025-10-17 12:37   ` Carlos Maiolino
                     ` (2 more replies)
  2025-10-17  6:07 ` [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent Christoph Hellwig
  1 sibling, 3 replies; 16+ messages in thread
From: Christoph Hellwig @ 2025-10-17  6:07 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: Hans Holmberg, Darrick J. Wong, linux-xfs

When we are picking a zone for gc it might already be in the pipeline
which can lead to us moving the same data twice resulting in in write
amplification and a very unfortunate case where keep on garbage
collecting the zone we just filled with migrated data stopping all
forward progress.

Fix this by introducing a count of on-going GC operations on a zone, so
and skip any zone with ongoing GC when picking a new victim.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_rtgroup.h |  6 ++++++
 fs/xfs/xfs_zone_gc.c        | 27 +++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_rtgroup.h b/fs/xfs/libxfs/xfs_rtgroup.h
index d36a6ae0abe5..d4fcf591e63d 100644
--- a/fs/xfs/libxfs/xfs_rtgroup.h
+++ b/fs/xfs/libxfs/xfs_rtgroup.h
@@ -50,6 +50,12 @@ struct xfs_rtgroup {
 		uint8_t			*rtg_rsum_cache;
 		struct xfs_open_zone	*rtg_open_zone;
 	};
+
+	/*
+	 * Count of outstanding GC operations for zoned XFS.  Any RTG with a
+	 * non-zero rtg_gccount will not be picked as new GC victim.
+	 */
+	atomic_t		rtg_gccount;
 };
 
 /*
diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c
index 109877d9a6bf..efcb52796d05 100644
--- a/fs/xfs/xfs_zone_gc.c
+++ b/fs/xfs/xfs_zone_gc.c
@@ -114,6 +114,8 @@ struct xfs_gc_bio {
 	/* Open Zone being written to */
 	struct xfs_open_zone		*oz;
 
+	struct xfs_rtgroup		*victim_rtg;
+
 	/* Bio used for reads and writes, including the bvec used by it */
 	struct bio_vec			bv;
 	struct bio			bio;	/* must be last */
@@ -264,6 +266,7 @@ xfs_zone_gc_iter_init(
 	iter->rec_count = 0;
 	iter->rec_idx = 0;
 	iter->victim_rtg = victim_rtg;
+	atomic_inc(&victim_rtg->rtg_gccount);
 }
 
 /*
@@ -362,6 +365,7 @@ xfs_zone_gc_query(
 
 	return 0;
 done:
+	atomic_dec(&iter->victim_rtg->rtg_gccount);
 	xfs_rtgroup_rele(iter->victim_rtg);
 	iter->victim_rtg = NULL;
 	return 0;
@@ -451,6 +455,20 @@ xfs_zone_gc_pick_victim_from(
 		if (!rtg)
 			continue;
 
+		/*
+		 * If the zone is already undergoing GC, don't pick it again.
+		 *
+		 * This prevents us from picking one of the zones for which we
+		 * already submitted GC I/O, but for which the remapping hasn't
+		 * concluded again.  This won't cause data corruption, but
+		 * increases write amplification and slows down GC, so this is
+		 * a bad thing.
+		 */
+		if (atomic_read(&rtg->rtg_gccount)) {
+			xfs_rtgroup_rele(rtg);
+			continue;
+		}
+
 		/* skip zones that are just waiting for a reset */
 		if (rtg_rmap(rtg)->i_used_blocks == 0 ||
 		    rtg_rmap(rtg)->i_used_blocks >= victim_used) {
@@ -688,6 +706,9 @@ xfs_zone_gc_start_chunk(
 	chunk->scratch = &data->scratch[data->scratch_idx];
 	chunk->data = data;
 	chunk->oz = oz;
+	chunk->victim_rtg = iter->victim_rtg;
+	atomic_inc(&chunk->victim_rtg->rtg_group.xg_active_ref);
+	atomic_inc(&chunk->victim_rtg->rtg_gccount);
 
 	bio->bi_iter.bi_sector = xfs_rtb_to_daddr(mp, chunk->old_startblock);
 	bio->bi_end_io = xfs_zone_gc_end_io;
@@ -710,6 +731,8 @@ static void
 xfs_zone_gc_free_chunk(
 	struct xfs_gc_bio	*chunk)
 {
+	atomic_dec(&chunk->victim_rtg->rtg_gccount);
+	xfs_rtgroup_rele(chunk->victim_rtg);
 	list_del(&chunk->entry);
 	xfs_open_zone_put(chunk->oz);
 	xfs_irele(chunk->ip);
@@ -770,6 +793,10 @@ xfs_zone_gc_split_write(
 	split_chunk->oz = chunk->oz;
 	atomic_inc(&chunk->oz->oz_ref);
 
+	split_chunk->victim_rtg = chunk->victim_rtg;
+	atomic_inc(&chunk->victim_rtg->rtg_group.xg_active_ref);
+	atomic_inc(&chunk->victim_rtg->rtg_gccount);
+
 	chunk->offset += split_len;
 	chunk->len -= split_len;
 	chunk->old_startblock += XFS_B_TO_FSB(data->mp, split_len);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent
  2025-10-17  6:07 fix for selecting a zone with active GC I/O for GC Christoph Hellwig
  2025-10-17  6:07 ` [PATCH 1/2] xfs: prevent gc from picking the same zone twice Christoph Hellwig
@ 2025-10-17  6:07 ` Christoph Hellwig
  2025-10-17 12:40   ` Carlos Maiolino
                     ` (3 more replies)
  1 sibling, 4 replies; 16+ messages in thread
From: Christoph Hellwig @ 2025-10-17  6:07 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: Hans Holmberg, Darrick J. Wong, linux-xfs

Besides blocks being invalidated, there is another case when the original
mapping could have changed between querying the rmap for GC and calling
xfs_zoned_map_extent.  Document it there as it took us quite some time
to figure out what is going on while developing the multiple-GC
protection fix.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_zone_alloc.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/xfs/xfs_zone_alloc.c b/fs/xfs/xfs_zone_alloc.c
index e7e439918f6d..2790001ee0f1 100644
--- a/fs/xfs/xfs_zone_alloc.c
+++ b/fs/xfs/xfs_zone_alloc.c
@@ -246,6 +246,14 @@ xfs_zoned_map_extent(
 	 * If a data write raced with this GC write, keep the existing data in
 	 * the data fork, mark our newly written GC extent as reclaimable, then
 	 * move on to the next extent.
+	 *
+	 * Note that this can also happen when racing with operations that do
+	 * not actually invalidate the data, but just move it to a different
+	 * inode (XFS_IOC_EXCHANGE_RANGE), or to a different offset inside the
+	 * inode (FALLOC_FL_COLLAPSE_RANGE / FALLOC_FL_INSERT_RANGE).  If the
+	 * data was just moved around, GC fails to free the zone, but the zone
+	 * becomes a GC candidate again as soon as all previous GC I/O has
+	 * finished and these blocks will be moved out eventually.
 	 */
 	if (old_startblock != NULLFSBLOCK &&
 	    old_startblock != data.br_startblock)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: prevent gc from picking the same zone twice
  2025-10-17  6:07 ` [PATCH 1/2] xfs: prevent gc from picking the same zone twice Christoph Hellwig
@ 2025-10-17 12:37   ` Carlos Maiolino
  2025-10-20 11:49     ` Christoph Hellwig
  2025-10-18  4:08   ` Damien Le Moal
  2025-10-23  6:16   ` Darrick J. Wong
  2 siblings, 1 reply; 16+ messages in thread
From: Carlos Maiolino @ 2025-10-17 12:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Hans Holmberg, Darrick J. Wong, linux-xfs

On Fri, Oct 17, 2025 at 08:07:02AM +0200, Christoph Hellwig wrote:
> When we are picking a zone for gc it might already be in the pipeline
> which can lead to us moving the same data twice resulting in in write
> amplification and a very unfortunate case where keep on garbage
> collecting the zone we just filled with migrated data stopping all
> forward progress.
> 
> Fix this by introducing a count of on-going GC operations on a zone, so

					  This 'so' shouldn't be here? ^

> and skip any zone with ongoing GC when picking a new victim.
> 
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>

This looks good, but I think it's worth to add a:

Fixes: 080d01c41 ("xfs: implement zoned garbage collection")


If you agree I can update both nitpicks above when I pull the series,
otherwise, feel free to add to the next version:

Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

> ---
>  fs/xfs/libxfs/xfs_rtgroup.h |  6 ++++++
>  fs/xfs/xfs_zone_gc.c        | 27 +++++++++++++++++++++++++++
>  2 files changed, 33 insertions(+)
> 
> diff --git a/fs/xfs/libxfs/xfs_rtgroup.h b/fs/xfs/libxfs/xfs_rtgroup.h
> index d36a6ae0abe5..d4fcf591e63d 100644
> --- a/fs/xfs/libxfs/xfs_rtgroup.h
> +++ b/fs/xfs/libxfs/xfs_rtgroup.h
> @@ -50,6 +50,12 @@ struct xfs_rtgroup {
>  		uint8_t			*rtg_rsum_cache;
>  		struct xfs_open_zone	*rtg_open_zone;
>  	};
> +
> +	/*
> +	 * Count of outstanding GC operations for zoned XFS.  Any RTG with a
> +	 * non-zero rtg_gccount will not be picked as new GC victim.
> +	 */
> +	atomic_t		rtg_gccount;
>  };
> 
>  /*
> diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c
> index 109877d9a6bf..efcb52796d05 100644
> --- a/fs/xfs/xfs_zone_gc.c
> +++ b/fs/xfs/xfs_zone_gc.c
> @@ -114,6 +114,8 @@ struct xfs_gc_bio {
>  	/* Open Zone being written to */
>  	struct xfs_open_zone		*oz;
> 
> +	struct xfs_rtgroup		*victim_rtg;
> +
>  	/* Bio used for reads and writes, including the bvec used by it */
>  	struct bio_vec			bv;
>  	struct bio			bio;	/* must be last */
> @@ -264,6 +266,7 @@ xfs_zone_gc_iter_init(
>  	iter->rec_count = 0;
>  	iter->rec_idx = 0;
>  	iter->victim_rtg = victim_rtg;
> +	atomic_inc(&victim_rtg->rtg_gccount);
>  }
> 
>  /*
> @@ -362,6 +365,7 @@ xfs_zone_gc_query(
> 
>  	return 0;
>  done:
> +	atomic_dec(&iter->victim_rtg->rtg_gccount);
>  	xfs_rtgroup_rele(iter->victim_rtg);
>  	iter->victim_rtg = NULL;
>  	return 0;
> @@ -451,6 +455,20 @@ xfs_zone_gc_pick_victim_from(
>  		if (!rtg)
>  			continue;
> 
> +		/*
> +		 * If the zone is already undergoing GC, don't pick it again.
> +		 *
> +		 * This prevents us from picking one of the zones for which we
> +		 * already submitted GC I/O, but for which the remapping hasn't
> +		 * concluded again.  This won't cause data corruption, but
> +		 * increases write amplification and slows down GC, so this is
> +		 * a bad thing.
> +		 */
> +		if (atomic_read(&rtg->rtg_gccount)) {
> +			xfs_rtgroup_rele(rtg);
> +			continue;
> +		}
> +
>  		/* skip zones that are just waiting for a reset */
>  		if (rtg_rmap(rtg)->i_used_blocks == 0 ||
>  		    rtg_rmap(rtg)->i_used_blocks >= victim_used) {
> @@ -688,6 +706,9 @@ xfs_zone_gc_start_chunk(
>  	chunk->scratch = &data->scratch[data->scratch_idx];
>  	chunk->data = data;
>  	chunk->oz = oz;
> +	chunk->victim_rtg = iter->victim_rtg;
> +	atomic_inc(&chunk->victim_rtg->rtg_group.xg_active_ref);
> +	atomic_inc(&chunk->victim_rtg->rtg_gccount);
> 
>  	bio->bi_iter.bi_sector = xfs_rtb_to_daddr(mp, chunk->old_startblock);
>  	bio->bi_end_io = xfs_zone_gc_end_io;
> @@ -710,6 +731,8 @@ static void
>  xfs_zone_gc_free_chunk(
>  	struct xfs_gc_bio	*chunk)
>  {
> +	atomic_dec(&chunk->victim_rtg->rtg_gccount);
> +	xfs_rtgroup_rele(chunk->victim_rtg);
>  	list_del(&chunk->entry);
>  	xfs_open_zone_put(chunk->oz);
>  	xfs_irele(chunk->ip);
> @@ -770,6 +793,10 @@ xfs_zone_gc_split_write(
>  	split_chunk->oz = chunk->oz;
>  	atomic_inc(&chunk->oz->oz_ref);
> 
> +	split_chunk->victim_rtg = chunk->victim_rtg;
> +	atomic_inc(&chunk->victim_rtg->rtg_group.xg_active_ref);
> +	atomic_inc(&chunk->victim_rtg->rtg_gccount);
> +
>  	chunk->offset += split_len;
>  	chunk->len -= split_len;
>  	chunk->old_startblock += XFS_B_TO_FSB(data->mp, split_len);
> --
> 2.47.3
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent
  2025-10-17  6:07 ` [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent Christoph Hellwig
@ 2025-10-17 12:40   ` Carlos Maiolino
  2025-10-18  4:09   ` Damien Le Moal
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Carlos Maiolino @ 2025-10-17 12:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Hans Holmberg, Darrick J. Wong, linux-xfs

On Fri, Oct 17, 2025 at 08:07:03AM +0200, Christoph Hellwig wrote:
> Besides blocks being invalidated, there is another case when the original
> mapping could have changed between querying the rmap for GC and calling
> xfs_zoned_map_extent.  Document it there as it took us quite some time
> to figure out what is going on while developing the multiple-GC
> protection fix.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_zone_alloc.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/xfs/xfs_zone_alloc.c b/fs/xfs/xfs_zone_alloc.c
> index e7e439918f6d..2790001ee0f1 100644
> --- a/fs/xfs/xfs_zone_alloc.c
> +++ b/fs/xfs/xfs_zone_alloc.c
> @@ -246,6 +246,14 @@ xfs_zoned_map_extent(
>  	 * If a data write raced with this GC write, keep the existing data in
>  	 * the data fork, mark our newly written GC extent as reclaimable, then
>  	 * move on to the next extent.
> +	 *
> +	 * Note that this can also happen when racing with operations that do
> +	 * not actually invalidate the data, but just move it to a different
> +	 * inode (XFS_IOC_EXCHANGE_RANGE), or to a different offset inside the
> +	 * inode (FALLOC_FL_COLLAPSE_RANGE / FALLOC_FL_INSERT_RANGE).  If the
> +	 * data was just moved around, GC fails to free the zone, but the zone
> +	 * becomes a GC candidate again as soon as all previous GC I/O has
> +	 * finished and these blocks will be moved out eventually.
>  	 */
>  	if (old_startblock != NULLFSBLOCK &&
>  	    old_startblock != data.br_startblock)
> --
> 2.47.3

Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: prevent gc from picking the same zone twice
  2025-10-17  6:07 ` [PATCH 1/2] xfs: prevent gc from picking the same zone twice Christoph Hellwig
  2025-10-17 12:37   ` Carlos Maiolino
@ 2025-10-18  4:08   ` Damien Le Moal
  2025-10-23  6:16   ` Darrick J. Wong
  2 siblings, 0 replies; 16+ messages in thread
From: Damien Le Moal @ 2025-10-18  4:08 UTC (permalink / raw)
  To: Christoph Hellwig, Carlos Maiolino
  Cc: Hans Holmberg, Darrick J. Wong, linux-xfs

On 10/17/25 15:07, Christoph Hellwig wrote:
> When we are picking a zone for gc it might already be in the pipeline
> which can lead to us moving the same data twice resulting in in write
> amplification and a very unfortunate case where keep on garbage

s/keep/we keep

> collecting the zone we just filled with migrated data stopping all
> forward progress.
> 
> Fix this by introducing a count of on-going GC operations on a zone, so
> and skip any zone with ongoing GC when picking a new victim.
> 
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent
  2025-10-17  6:07 ` [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent Christoph Hellwig
  2025-10-17 12:40   ` Carlos Maiolino
@ 2025-10-18  4:09   ` Damien Le Moal
  2025-10-20  8:19   ` Hans Holmberg
  2025-10-23  6:21   ` Darrick J. Wong
  3 siblings, 0 replies; 16+ messages in thread
From: Damien Le Moal @ 2025-10-18  4:09 UTC (permalink / raw)
  To: Christoph Hellwig, Carlos Maiolino
  Cc: Hans Holmberg, Darrick J. Wong, linux-xfs

On 10/17/25 15:07, Christoph Hellwig wrote:
> Besides blocks being invalidated, there is another case when the original
> mapping could have changed between querying the rmap for GC and calling
> xfs_zoned_map_extent.  Document it there as it took us quite some time
> to figure out what is going on while developing the multiple-GC
> protection fix.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent
  2025-10-17  6:07 ` [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent Christoph Hellwig
  2025-10-17 12:40   ` Carlos Maiolino
  2025-10-18  4:09   ` Damien Le Moal
@ 2025-10-20  8:19   ` Hans Holmberg
  2025-10-23  6:21   ` Darrick J. Wong
  3 siblings, 0 replies; 16+ messages in thread
From: Hans Holmberg @ 2025-10-20  8:19 UTC (permalink / raw)
  To: hch, Carlos Maiolino; +Cc: Darrick J. Wong, linux-xfs@vger.kernel.org

On 17/10/2025 08:07, Christoph Hellwig wrote:
> Besides blocks being invalidated, there is another case when the original
> mapping could have changed between querying the rmap for GC and calling
> xfs_zoned_map_extent.  Document it there as it took us quite some time
> to figure out what is going on while developing the multiple-GC
> protection fix.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_zone_alloc.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/xfs/xfs_zone_alloc.c b/fs/xfs/xfs_zone_alloc.c
> index e7e439918f6d..2790001ee0f1 100644
> --- a/fs/xfs/xfs_zone_alloc.c
> +++ b/fs/xfs/xfs_zone_alloc.c
> @@ -246,6 +246,14 @@ xfs_zoned_map_extent(
>  	 * If a data write raced with this GC write, keep the existing data in
>  	 * the data fork, mark our newly written GC extent as reclaimable, then
>  	 * move on to the next extent.
> +	 *
> +	 * Note that this can also happen when racing with operations that do
> +	 * not actually invalidate the data, but just move it to a different
> +	 * inode (XFS_IOC_EXCHANGE_RANGE), or to a different offset inside the
> +	 * inode (FALLOC_FL_COLLAPSE_RANGE / FALLOC_FL_INSERT_RANGE).  If the
> +	 * data was just moved around, GC fails to free the zone, but the zone
> +	 * becomes a GC candidate again as soon as all previous GC I/O has
> +	 * finished and these blocks will be moved out eventually.
>  	 */
>  	if (old_startblock != NULLFSBLOCK &&
>  	    old_startblock != data.br_startblock)

Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: prevent gc from picking the same zone twice
  2025-10-17 12:37   ` Carlos Maiolino
@ 2025-10-20 11:49     ` Christoph Hellwig
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2025-10-20 11:49 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Christoph Hellwig, Hans Holmberg, Darrick J. Wong, linux-xfs

On Fri, Oct 17, 2025 at 02:37:02PM +0200, Carlos Maiolino wrote:
> This looks good, but I think it's worth to add a:
> 
> Fixes: 080d01c41 ("xfs: implement zoned garbage collection")
> 
> 
> If you agree I can update both nitpicks above when I pull the series,
> otherwise, feel free to add to the next version:

Sure.  Let's see if I need a resend for another reason, but otherwise
your tweaks look good.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: prevent gc from picking the same zone twice
  2025-10-17  6:07 ` [PATCH 1/2] xfs: prevent gc from picking the same zone twice Christoph Hellwig
  2025-10-17 12:37   ` Carlos Maiolino
  2025-10-18  4:08   ` Damien Le Moal
@ 2025-10-23  6:16   ` Darrick J. Wong
  2025-10-23  6:28     ` Christoph Hellwig
  2 siblings, 1 reply; 16+ messages in thread
From: Darrick J. Wong @ 2025-10-23  6:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Carlos Maiolino, Hans Holmberg, linux-xfs

On Fri, Oct 17, 2025 at 08:07:02AM +0200, Christoph Hellwig wrote:
> When we are picking a zone for gc it might already be in the pipeline
> which can lead to us moving the same data twice resulting in in write
> amplification and a very unfortunate case where keep on garbage
> collecting the zone we just filled with migrated data stopping all
> forward progress.
> 
> Fix this by introducing a count of on-going GC operations on a zone, so
> and skip any zone with ongoing GC when picking a new victim.
> 
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/libxfs/xfs_rtgroup.h |  6 ++++++
>  fs/xfs/xfs_zone_gc.c        | 27 +++++++++++++++++++++++++++
>  2 files changed, 33 insertions(+)
> 
> diff --git a/fs/xfs/libxfs/xfs_rtgroup.h b/fs/xfs/libxfs/xfs_rtgroup.h
> index d36a6ae0abe5..d4fcf591e63d 100644
> --- a/fs/xfs/libxfs/xfs_rtgroup.h
> +++ b/fs/xfs/libxfs/xfs_rtgroup.h
> @@ -50,6 +50,12 @@ struct xfs_rtgroup {
>  		uint8_t			*rtg_rsum_cache;
>  		struct xfs_open_zone	*rtg_open_zone;
>  	};
> +
> +	/*
> +	 * Count of outstanding GC operations for zoned XFS.  Any RTG with a
> +	 * non-zero rtg_gccount will not be picked as new GC victim.
> +	 */
> +	atomic_t		rtg_gccount;
>  };
>  
>  /*
> diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c
> index 109877d9a6bf..efcb52796d05 100644
> --- a/fs/xfs/xfs_zone_gc.c
> +++ b/fs/xfs/xfs_zone_gc.c
> @@ -114,6 +114,8 @@ struct xfs_gc_bio {
>  	/* Open Zone being written to */
>  	struct xfs_open_zone		*oz;
>  
> +	struct xfs_rtgroup		*victim_rtg;
> +
>  	/* Bio used for reads and writes, including the bvec used by it */
>  	struct bio_vec			bv;
>  	struct bio			bio;	/* must be last */
> @@ -264,6 +266,7 @@ xfs_zone_gc_iter_init(
>  	iter->rec_count = 0;
>  	iter->rec_idx = 0;
>  	iter->victim_rtg = victim_rtg;
> +	atomic_inc(&victim_rtg->rtg_gccount);
>  }
>  
>  /*
> @@ -362,6 +365,7 @@ xfs_zone_gc_query(
>  
>  	return 0;
>  done:
> +	atomic_dec(&iter->victim_rtg->rtg_gccount);
>  	xfs_rtgroup_rele(iter->victim_rtg);
>  	iter->victim_rtg = NULL;
>  	return 0;
> @@ -451,6 +455,20 @@ xfs_zone_gc_pick_victim_from(
>  		if (!rtg)
>  			continue;
>  
> +		/*
> +		 * If the zone is already undergoing GC, don't pick it again.
> +		 *
> +		 * This prevents us from picking one of the zones for which we
> +		 * already submitted GC I/O, but for which the remapping hasn't
> +		 * concluded again.  This won't cause data corruption, but

"...but that I/O hasn't yet finished."

With that and the other comments corrected and a Fixes tag applied,
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

> +		 * increases write amplification and slows down GC, so this is
> +		 * a bad thing.
> +		 */
> +		if (atomic_read(&rtg->rtg_gccount)) {
> +			xfs_rtgroup_rele(rtg);
> +			continue;
> +		}
> +
>  		/* skip zones that are just waiting for a reset */
>  		if (rtg_rmap(rtg)->i_used_blocks == 0 ||
>  		    rtg_rmap(rtg)->i_used_blocks >= victim_used) {
> @@ -688,6 +706,9 @@ xfs_zone_gc_start_chunk(
>  	chunk->scratch = &data->scratch[data->scratch_idx];
>  	chunk->data = data;
>  	chunk->oz = oz;
> +	chunk->victim_rtg = iter->victim_rtg;
> +	atomic_inc(&chunk->victim_rtg->rtg_group.xg_active_ref);
> +	atomic_inc(&chunk->victim_rtg->rtg_gccount);
>  
>  	bio->bi_iter.bi_sector = xfs_rtb_to_daddr(mp, chunk->old_startblock);
>  	bio->bi_end_io = xfs_zone_gc_end_io;
> @@ -710,6 +731,8 @@ static void
>  xfs_zone_gc_free_chunk(
>  	struct xfs_gc_bio	*chunk)
>  {
> +	atomic_dec(&chunk->victim_rtg->rtg_gccount);
> +	xfs_rtgroup_rele(chunk->victim_rtg);
>  	list_del(&chunk->entry);
>  	xfs_open_zone_put(chunk->oz);
>  	xfs_irele(chunk->ip);
> @@ -770,6 +793,10 @@ xfs_zone_gc_split_write(
>  	split_chunk->oz = chunk->oz;
>  	atomic_inc(&chunk->oz->oz_ref);
>  
> +	split_chunk->victim_rtg = chunk->victim_rtg;
> +	atomic_inc(&chunk->victim_rtg->rtg_group.xg_active_ref);
> +	atomic_inc(&chunk->victim_rtg->rtg_gccount);
> +
>  	chunk->offset += split_len;
>  	chunk->len -= split_len;
>  	chunk->old_startblock += XFS_B_TO_FSB(data->mp, split_len);
> -- 
> 2.47.3
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent
  2025-10-17  6:07 ` [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent Christoph Hellwig
                     ` (2 preceding siblings ...)
  2025-10-20  8:19   ` Hans Holmberg
@ 2025-10-23  6:21   ` Darrick J. Wong
  2025-10-23  6:30     ` Christoph Hellwig
  3 siblings, 1 reply; 16+ messages in thread
From: Darrick J. Wong @ 2025-10-23  6:21 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Carlos Maiolino, Hans Holmberg, linux-xfs

On Fri, Oct 17, 2025 at 08:07:03AM +0200, Christoph Hellwig wrote:
> Besides blocks being invalidated, there is another case when the original
> mapping could have changed between querying the rmap for GC and calling
> xfs_zoned_map_extent.  Document it there as it took us quite some time
> to figure out what is going on while developing the multiple-GC
> protection fix.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_zone_alloc.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/xfs/xfs_zone_alloc.c b/fs/xfs/xfs_zone_alloc.c
> index e7e439918f6d..2790001ee0f1 100644
> --- a/fs/xfs/xfs_zone_alloc.c
> +++ b/fs/xfs/xfs_zone_alloc.c
> @@ -246,6 +246,14 @@ xfs_zoned_map_extent(
>  	 * If a data write raced with this GC write, keep the existing data in
>  	 * the data fork, mark our newly written GC extent as reclaimable, then
>  	 * move on to the next extent.
> +	 *
> +	 * Note that this can also happen when racing with operations that do
> +	 * not actually invalidate the data, but just move it to a different
> +	 * inode (XFS_IOC_EXCHANGE_RANGE), or to a different offset inside the
> +	 * inode (FALLOC_FL_COLLAPSE_RANGE / FALLOC_FL_INSERT_RANGE).  If the

Or (eventually) this will also be possible if zonegc races with a plain
remapping via FICLONERANGE, right?

(Like, whenever reflink gets implemented)

I guess the zonegc write completion could just iterate the rtrmapbt for
records that overlap the old extent and remap them until there are no
rtrmapbt entries returned.  Then you'd be able to catch the
exchange-range and reflink cases, right?

(Not asking for a reflink-aware zonegc right this second, just checking
my understanding.)

If the answers to both questions are yes then I've grokked this well
enough to say
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

> +	 * data was just moved around, GC fails to free the zone, but the zone
> +	 * becomes a GC candidate again as soon as all previous GC I/O has
> +	 * finished and these blocks will be moved out eventually.
>  	 */
>  	if (old_startblock != NULLFSBLOCK &&
>  	    old_startblock != data.br_startblock)
> -- 
> 2.47.3
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: prevent gc from picking the same zone twice
  2025-10-23  6:16   ` Darrick J. Wong
@ 2025-10-23  6:28     ` Christoph Hellwig
  2025-10-23 15:02       ` Darrick J. Wong
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2025-10-23  6:28 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Carlos Maiolino, Hans Holmberg, linux-xfs

On Wed, Oct 22, 2025 at 11:16:22PM -0700, Darrick J. Wong wrote:
> > +		/*
> > +		 * If the zone is already undergoing GC, don't pick it again.
> > +		 *
> > +		 * This prevents us from picking one of the zones for which we
> > +		 * already submitted GC I/O, but for which the remapping hasn't
> > +		 * concluded again.  This won't cause data corruption, but
> 
> "...but that I/O hasn't yet finished."

It's really the remapping after the I/O that has to finish, but given
that we talk about I/O earlier, maybe your wording is less confusing.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent
  2025-10-23  6:21   ` Darrick J. Wong
@ 2025-10-23  6:30     ` Christoph Hellwig
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2025-10-23  6:30 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Carlos Maiolino, Hans Holmberg, linux-xfs

On Wed, Oct 22, 2025 at 11:21:25PM -0700, Darrick J. Wong wrote:
> > +	 * Note that this can also happen when racing with operations that do
> > +	 * not actually invalidate the data, but just move it to a different
> > +	 * inode (XFS_IOC_EXCHANGE_RANGE), or to a different offset inside the
> > +	 * inode (FALLOC_FL_COLLAPSE_RANGE / FALLOC_FL_INSERT_RANGE).  If the
> 
> Or (eventually) this will also be possible if zonegc races with a plain
> remapping via FICLONERANGE, right?
> 
> (Like, whenever reflink gets implemented)

Yes.

> I guess the zonegc write completion could just iterate the rtrmapbt for
> records that overlap the old extent and remap them until there are no
> rtrmapbt entries returned.  Then you'd be able to catch the
> exchange-range and reflink cases, right?

I don't think so.  All these remap operation would at this point would
have remove the rmapbt record for the old record because they changed
the logical mapping.  So I don't think we'd find anything useful there.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: prevent gc from picking the same zone twice
  2025-10-23  6:28     ` Christoph Hellwig
@ 2025-10-23 15:02       ` Darrick J. Wong
  2025-10-23 15:04         ` Christoph Hellwig
  0 siblings, 1 reply; 16+ messages in thread
From: Darrick J. Wong @ 2025-10-23 15:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Carlos Maiolino, Hans Holmberg, linux-xfs

On Thu, Oct 23, 2025 at 08:28:29AM +0200, Christoph Hellwig wrote:
> On Wed, Oct 22, 2025 at 11:16:22PM -0700, Darrick J. Wong wrote:
> > > +		/*
> > > +		 * If the zone is already undergoing GC, don't pick it again.
> > > +		 *
> > > +		 * This prevents us from picking one of the zones for which we
> > > +		 * already submitted GC I/O, but for which the remapping hasn't
> > > +		 * concluded again.  This won't cause data corruption, but
> > 
> > "...but that I/O hasn't yet finished."
> 
> It's really the remapping after the I/O that has to finish, but given
> that we talk about I/O earlier, maybe your wording is less confusing.

The word 'again' at the end of the sentence tripped me up.  I think the
word you want there is "yet" (because "again" implies that this isn't
the first time we've remapped the space, which isn't true) but you could
also end the sentence wit` "concluded".

--D

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: prevent gc from picking the same zone twice
  2025-10-23 15:02       ` Darrick J. Wong
@ 2025-10-23 15:04         ` Christoph Hellwig
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2025-10-23 15:04 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Carlos Maiolino, Hans Holmberg, linux-xfs

On Thu, Oct 23, 2025 at 08:02:32AM -0700, Darrick J. Wong wrote:
> > > > +		 *
> > > > +		 * This prevents us from picking one of the zones for which we
> > > > +		 * already submitted GC I/O, but for which the remapping hasn't
> > > > +		 * concluded again.  This won't cause data corruption, but
> > > 
> > > "...but that I/O hasn't yet finished."
> > 
> > It's really the remapping after the I/O that has to finish, but given
> > that we talk about I/O earlier, maybe your wording is less confusing.
> 
> The word 'again' at the end of the sentence tripped me up.  I think the
> word you want there is "yet" (because "again" implies that this isn't
> the first time we've remapped the space, which isn't true) but you could
> also end the sentence wit` "concluded".

Both sounds fine.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent
  2025-10-23 15:17 fix for selecting a zone with active GC I/O for GC v2 Christoph Hellwig
@ 2025-10-23 15:17 ` Christoph Hellwig
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2025-10-23 15:17 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Hans Holmberg, Darrick J. Wong, linux-xfs, Damien Le Moal,
	Carlos Maiolino

Besides blocks being invalidated, there is another case when the original
mapping could have changed between querying the rmap for GC and calling
xfs_zoned_map_extent.  Document it there as it took us quite some time
to figure out what is going on while developing the multiple-GC
protection fix.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_zone_alloc.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/xfs/xfs_zone_alloc.c b/fs/xfs/xfs_zone_alloc.c
index 27276519a4a7..336c9f184a75 100644
--- a/fs/xfs/xfs_zone_alloc.c
+++ b/fs/xfs/xfs_zone_alloc.c
@@ -246,6 +246,14 @@ xfs_zoned_map_extent(
 	 * If a data write raced with this GC write, keep the existing data in
 	 * the data fork, mark our newly written GC extent as reclaimable, then
 	 * move on to the next extent.
+	 *
+	 * Note that this can also happen when racing with operations that do
+	 * not actually invalidate the data, but just move it to a different
+	 * inode (XFS_IOC_EXCHANGE_RANGE), or to a different offset inside the
+	 * inode (FALLOC_FL_COLLAPSE_RANGE / FALLOC_FL_INSERT_RANGE).  If the
+	 * data was just moved around, GC fails to free the zone, but the zone
+	 * becomes a GC candidate again as soon as all previous GC I/O has
+	 * finished and these blocks will be moved out eventually.
 	 */
 	if (old_startblock != NULLFSBLOCK &&
 	    old_startblock != data.br_startblock)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-10-23 15:17 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-17  6:07 fix for selecting a zone with active GC I/O for GC Christoph Hellwig
2025-10-17  6:07 ` [PATCH 1/2] xfs: prevent gc from picking the same zone twice Christoph Hellwig
2025-10-17 12:37   ` Carlos Maiolino
2025-10-20 11:49     ` Christoph Hellwig
2025-10-18  4:08   ` Damien Le Moal
2025-10-23  6:16   ` Darrick J. Wong
2025-10-23  6:28     ` Christoph Hellwig
2025-10-23 15:02       ` Darrick J. Wong
2025-10-23 15:04         ` Christoph Hellwig
2025-10-17  6:07 ` [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent Christoph Hellwig
2025-10-17 12:40   ` Carlos Maiolino
2025-10-18  4:09   ` Damien Le Moal
2025-10-20  8:19   ` Hans Holmberg
2025-10-23  6:21   ` Darrick J. Wong
2025-10-23  6:30     ` Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2025-10-23 15:17 fix for selecting a zone with active GC I/O for GC v2 Christoph Hellwig
2025-10-23 15:17 ` [PATCH 2/2] xfs: document another racy GC case in xfs_zoned_map_extent Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).