public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 2/2] xfs: don't extend the FITRIM range if the rt device does not support discard
  2024-08-14  4:23 [PATCH 1/2] xfs: remove a stale comment in xfs_ioc_trim Christoph Hellwig
@ 2024-08-14  4:23 ` Christoph Hellwig
  2024-08-14  5:41   ` Darrick J. Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2024-08-14  4:23 UTC (permalink / raw)
  To: chandan.babu; +Cc: djwong, linux-xfs

Fix the newly added discard support to only offer a FITRIM range that
spans the RT device in addition to the main device if the RT device
actually supports discard.  Without this we'll incorrectly accept
a larger range than actually supported and confuse user space if the
RT device does not support discard.  This can easily happen when the
main device is a SSD but the RT device is a hard driver.

Move the code around a bit to keep the max_blocks and granularity
assignments together and to handle the case where only the RT device
supports discard as well.

Fixes: 3ba3ab1f6719 ("xfs: enable FITRIM on the realtime device")
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_discard.c | 37 ++++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
index 6516afecce0979..46e28499a7d3ce 100644
--- a/fs/xfs/xfs_discard.c
+++ b/fs/xfs/xfs_discard.c
@@ -654,9 +654,9 @@ xfs_ioc_trim(
 	struct xfs_mount		*mp,
 	struct fstrim_range __user	*urange)
 {
-	unsigned int		granularity =
-		bdev_discard_granularity(mp->m_ddev_targp->bt_bdev);
+	struct block_device	*bdev = mp->m_ddev_targp->bt_bdev;
 	struct block_device	*rt_bdev = NULL;
+	unsigned int		granularity = 0;
 	struct fstrim_range	range;
 	xfs_daddr_t		start, end;
 	xfs_extlen_t		minlen;
@@ -666,15 +666,6 @@ xfs_ioc_trim(
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
-	if (mp->m_rtdev_targp &&
-	    bdev_max_discard_sectors(mp->m_rtdev_targp->bt_bdev))
-		rt_bdev = mp->m_rtdev_targp->bt_bdev;
-	if (!bdev_max_discard_sectors(mp->m_ddev_targp->bt_bdev) && !rt_bdev)
-		return -EOPNOTSUPP;
-
-	if (rt_bdev)
-		granularity = max(granularity,
-				  bdev_discard_granularity(rt_bdev));
 
 	/*
 	 * We haven't recovered the log, so we cannot use our bnobt-guided
@@ -683,13 +674,33 @@ xfs_ioc_trim(
 	if (xfs_has_norecovery(mp))
 		return -EROFS;
 
+	if (bdev_max_discard_sectors(bdev)) {
+		max_blocks = mp->m_sb.sb_dblocks;
+		granularity = bdev_discard_granularity(bdev);
+	} else {
+		bdev = NULL;
+	}
+
+	if (mp->m_rtdev_targp) {
+		rt_bdev = mp->m_rtdev_targp->bt_bdev;
+		if (!bdev_max_discard_sectors(rt_bdev)) {
+			if (!bdev)
+				return -EOPNOTSUPP;
+			rt_bdev = NULL;
+		}
+	}
+	if (rt_bdev) {
+		max_blocks += mp->m_sb.sb_rblocks;
+		granularity =
+			max(granularity, bdev_discard_granularity(rt_bdev));
+	}
+
 	if (copy_from_user(&range, urange, sizeof(range)))
 		return -EFAULT;
 
 	range.minlen = max_t(u64, granularity, range.minlen);
 	minlen = XFS_B_TO_FSB(mp, range.minlen);
 
-	max_blocks = mp->m_sb.sb_dblocks + mp->m_sb.sb_rblocks;
 	if (range.start >= XFS_FSB_TO_B(mp, max_blocks) ||
 	    range.minlen > XFS_FSB_TO_B(mp, mp->m_ag_max_usable) ||
 	    range.len < mp->m_sb.sb_blocksize)
@@ -698,7 +709,7 @@ xfs_ioc_trim(
 	start = BTOBB(range.start);
 	end = start + BTOBBT(range.len) - 1;
 
-	if (bdev_max_discard_sectors(mp->m_ddev_targp->bt_bdev)) {
+	if (bdev) {
 		error = xfs_trim_datadev_extents(mp, start, end, minlen,
 				&blocks_trimmed);
 		if (error)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xfs: don't extend the FITRIM range if the rt device does not support discard
  2024-08-14  4:23 ` [PATCH 2/2] xfs: don't extend the FITRIM range if the rt device does not support discard Christoph Hellwig
@ 2024-08-14  5:41   ` Darrick J. Wong
  2024-08-14  5:48     ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2024-08-14  5:41 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: chandan.babu, linux-xfs

On Wed, Aug 14, 2024 at 06:23:58AM +0200, Christoph Hellwig wrote:
> Fix the newly added discard support to only offer a FITRIM range that
> spans the RT device in addition to the main device if the RT device
> actually supports discard.  Without this we'll incorrectly accept
> a larger range than actually supported and confuse user space if the
> RT device does not support discard.  This can easily happen when the
> main device is a SSD but the RT device is a hard driver.
> 
> Move the code around a bit to keep the max_blocks and granularity
> assignments together and to handle the case where only the RT device
> supports discard as well.
> 
> Fixes: 3ba3ab1f6719 ("xfs: enable FITRIM on the realtime device")
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_discard.c | 37 ++++++++++++++++++++++++-------------
>  1 file changed, 24 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
> index 6516afecce0979..46e28499a7d3ce 100644
> --- a/fs/xfs/xfs_discard.c
> +++ b/fs/xfs/xfs_discard.c
> @@ -654,9 +654,9 @@ xfs_ioc_trim(
>  	struct xfs_mount		*mp,
>  	struct fstrim_range __user	*urange)
>  {
> -	unsigned int		granularity =
> -		bdev_discard_granularity(mp->m_ddev_targp->bt_bdev);
> +	struct block_device	*bdev = mp->m_ddev_targp->bt_bdev;
>  	struct block_device	*rt_bdev = NULL;
> +	unsigned int		granularity = 0;
>  	struct fstrim_range	range;
>  	xfs_daddr_t		start, end;
>  	xfs_extlen_t		minlen;
> @@ -666,15 +666,6 @@ xfs_ioc_trim(
>  
>  	if (!capable(CAP_SYS_ADMIN))
>  		return -EPERM;
> -	if (mp->m_rtdev_targp &&
> -	    bdev_max_discard_sectors(mp->m_rtdev_targp->bt_bdev))
> -		rt_bdev = mp->m_rtdev_targp->bt_bdev;
> -	if (!bdev_max_discard_sectors(mp->m_ddev_targp->bt_bdev) && !rt_bdev)
> -		return -EOPNOTSUPP;

Does this still return EOPNOTSUPP if there's no rt device and the data
device doesn't support discard?  I only see an EOPNOTSUPP return if
there is a rt device and both devices don't support discard.

> -
> -	if (rt_bdev)
> -		granularity = max(granularity,
> -				  bdev_discard_granularity(rt_bdev));
>  
>  	/*
>  	 * We haven't recovered the log, so we cannot use our bnobt-guided
> @@ -683,13 +674,33 @@ xfs_ioc_trim(
>  	if (xfs_has_norecovery(mp))
>  		return -EROFS;
>  
> +	if (bdev_max_discard_sectors(bdev)) {
> +		max_blocks = mp->m_sb.sb_dblocks;
> +		granularity = bdev_discard_granularity(bdev);
> +	} else {
> +		bdev = NULL;
> +	}
> +
> +	if (mp->m_rtdev_targp) {
> +		rt_bdev = mp->m_rtdev_targp->bt_bdev;
> +		if (!bdev_max_discard_sectors(rt_bdev)) {
> +			if (!bdev)
> +				return -EOPNOTSUPP;
> +			rt_bdev = NULL;
> +		}
> +	}
> +	if (rt_bdev) {
> +		max_blocks += mp->m_sb.sb_rblocks;

I think this will break xfs_scrub, which (unlike fstrim) breaks up its
FITRIM requests into smaller pieces.  The (afwul) FITRIM interface says
that [0, dblocks) trims the data device, and [dblocks, dblocks +
rblocks) trims the realtime device.

If the data device doesn't support discard, max_blocks will be rblocks,
and that's what we use to validate the @start parameter.  For example,
if the data device is 10T spinning rust and the rt device is a 10G SSD,
max_blocks will be 10G.  A FITRIM request for just the rt device will be
[10T, 10G), which now fails with EINVAL.

I don't have a fix to suggest for this yet, but let me play around with
this tomorrow and see if I can come up with something better, or figure
out how I'm being thick. ;)

My guess is that what we really want is if either device supports
discard we allow the full range, but if a specific device doesn't
support discard then we skip it and don't add anything to the outgoing
range.len.  But that's what I thought the current code does. <shrug>

--D

> +		granularity =
> +			max(granularity, bdev_discard_granularity(rt_bdev));
> +	}
> +
>  	if (copy_from_user(&range, urange, sizeof(range)))
>  		return -EFAULT;
>  
>  	range.minlen = max_t(u64, granularity, range.minlen);
>  	minlen = XFS_B_TO_FSB(mp, range.minlen);
>  
> -	max_blocks = mp->m_sb.sb_dblocks + mp->m_sb.sb_rblocks;
>  	if (range.start >= XFS_FSB_TO_B(mp, max_blocks) ||
>  	    range.minlen > XFS_FSB_TO_B(mp, mp->m_ag_max_usable) ||
>  	    range.len < mp->m_sb.sb_blocksize)
> @@ -698,7 +709,7 @@ xfs_ioc_trim(
>  	start = BTOBB(range.start);
>  	end = start + BTOBBT(range.len) - 1;
>  
> -	if (bdev_max_discard_sectors(mp->m_ddev_targp->bt_bdev)) {
> +	if (bdev) {
>  		error = xfs_trim_datadev_extents(mp, start, end, minlen,
>  				&blocks_trimmed);
>  		if (error)
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xfs: don't extend the FITRIM range if the rt device does not support discard
  2024-08-14  5:41   ` Darrick J. Wong
@ 2024-08-14  5:48     ` Christoph Hellwig
  2024-08-14  6:16       ` Darrick J. Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2024-08-14  5:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, chandan.babu, linux-xfs

On Tue, Aug 13, 2024 at 10:41:18PM -0700, Darrick J. Wong wrote:
> Does this still return EOPNOTSUPP if there's no rt device and the data
> device doesn't support discard?

Yes, I'll need to fix that.

> > +	if (rt_bdev) {
> > +		max_blocks += mp->m_sb.sb_rblocks;
> 
> I think this will break xfs_scrub, which (unlike fstrim) breaks up its
> FITRIM requests into smaller pieces.

No breakage noticed during my testing, but I'm not sure how that
would have materialized anyway..

> The (afwul) FITRIM interface says
> that [0, dblocks) trims the data device, and [dblocks, dblocks +
> rblocks) trims the realtime device.
> 
> If the data device doesn't support discard, max_blocks will be rblocks,
> and that's what we use to validate the @start parameter.  For example,
> if the data device is 10T spinning rust and the rt device is a 10G SSD,
> max_blocks will be 10G.  A FITRIM request for just the rt device will be
> [10T, 10G), which now fails with EINVAL.
> 
> I don't have a fix to suggest for this yet, but let me play around with
> this tomorrow and see if I can come up with something better, or figure
> out how I'm being thick. ;)
> 
> My guess is that what we really want is if either device supports
> discard we allow the full range, but if a specific device doesn't
> support discard then we skip it and don't add anything to the outgoing
> range.len.  But that's what I thought the current code does. <shrug>

The problem is that if we allow the full range, but return a smaller
around generic/260 fails.  That is with your patches to not fail if
FITRIM for the RT device is supported, without them it always fails
as soon as FITRIM is supported on the RT device.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xfs: don't extend the FITRIM range if the rt device does not support discard
  2024-08-14  5:48     ` Christoph Hellwig
@ 2024-08-14  6:16       ` Darrick J. Wong
  0 siblings, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2024-08-14  6:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: chandan.babu, linux-xfs

On Wed, Aug 14, 2024 at 07:48:38AM +0200, Christoph Hellwig wrote:
> On Tue, Aug 13, 2024 at 10:41:18PM -0700, Darrick J. Wong wrote:
> > Does this still return EOPNOTSUPP if there's no rt device and the data
> > device doesn't support discard?
> 
> Yes, I'll need to fix that.
> 
> > > +	if (rt_bdev) {
> > > +		max_blocks += mp->m_sb.sb_rblocks;
> > 
> > I think this will break xfs_scrub, which (unlike fstrim) breaks up its
> > FITRIM requests into smaller pieces.
> 
> No breakage noticed during my testing, but I'm not sure how that
> would have materialized anyway..

scrub silently starts discarding less than it does now.  My guess is
nobody would notice because meh and who cares? ;)

> > The (afwul) FITRIM interface says
> > that [0, dblocks) trims the data device, and [dblocks, dblocks +
> > rblocks) trims the realtime device.
> > 
> > If the data device doesn't support discard, max_blocks will be rblocks,
> > and that's what we use to validate the @start parameter.  For example,
> > if the data device is 10T spinning rust and the rt device is a 10G SSD,
> > max_blocks will be 10G.  A FITRIM request for just the rt device will be
> > [10T, 10G), which now fails with EINVAL.
> > 
> > I don't have a fix to suggest for this yet, but let me play around with
> > this tomorrow and see if I can come up with something better, or figure
> > out how I'm being thick. ;)
> > 
> > My guess is that what we really want is if either device supports
> > discard we allow the full range, but if a specific device doesn't
> > support discard then we skip it and don't add anything to the outgoing
> > range.len.  But that's what I thought the current code does. <shrug>
> 
> The problem is that if we allow the full range, but return a smaller
> around generic/260 fails.  That is with your patches to not fail if
> FITRIM for the RT device is supported, without them it always fails
> as soon as FITRIM is supported on the RT device.

Ugh.  We can't just lie and add unsupported ranges to range.len because
(a) that's dishonest even if there's no expectation that the discards
did anything and (b) we'd have walk the freespace metadata just to lie.

OTOH generic/260 is butt-ugly with all the accounting special cases in
there.  Maybe there's a way to change the test to figure out how much
discarding is possible on an xfs filesystem so that it can handle
hetergeneous filesystems?

--D

^ permalink raw reply	[flat|nested] 13+ messages in thread

* fix FITRIM with non-discard capable RT device v2
@ 2024-08-16  8:18 Christoph Hellwig
  2024-08-16  8:18 ` [PATCH 1/2] xfs: remove a stale comment in xfs_ioc_trim Christoph Hellwig
  2024-08-16  8:18 ` [PATCH 2/2] xfs: don't extend the FITRIM range if the rt device does not support discard Christoph Hellwig
  0 siblings, 2 replies; 13+ messages in thread
From: Christoph Hellwig @ 2024-08-16  8:18 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, linux-xfs

Hi all,

this series fixes FITRIM when an RT device that does not support
discards is present.

Changes since v1:
 - explicitly reject the discard capable RT device and non-discard capable
   main device case because it is too much of a mess to handle and not
   practically useful

Diffstat:
 xfs_discard.c |   57 +++++++++++++++++++++++++++++++--------------------------
 1 file changed, 31 insertions(+), 26 deletions(-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] xfs: remove a stale comment in xfs_ioc_trim
  2024-08-16  8:18 fix FITRIM with non-discard capable RT device v2 Christoph Hellwig
@ 2024-08-16  8:18 ` Christoph Hellwig
  2024-08-16  8:18 ` [PATCH 2/2] xfs: don't extend the FITRIM range if the rt device does not support discard Christoph Hellwig
  1 sibling, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2024-08-16  8:18 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, linux-xfs

There is no truncating down going on here, the code has changed multiple
times since the comment was added with the initial FITRIM implementation
and it doesn't make sense in the current context.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_discard.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
index 80586336276c19..d56efe9eae2cef 100644
--- a/fs/xfs/xfs_discard.c
+++ b/fs/xfs/xfs_discard.c
@@ -720,13 +720,6 @@ xfs_ioc_trim(
 	range.minlen = max_t(u64, granularity, range.minlen);
 	minlen = XFS_B_TO_FSB(mp, range.minlen);
 
-	/*
-	 * Truncating down the len isn't actually quite correct, but using
-	 * BBTOB would mean we trivially get overflows for values
-	 * of ULLONG_MAX or slightly lower.  And ULLONG_MAX is the default
-	 * used by the fstrim application.  In the end it really doesn't
-	 * matter as trimming blocks is an advisory interface.
-	 */
 	max_blocks = mp->m_sb.sb_dblocks + mp->m_sb.sb_rblocks;
 	if (range.start >= XFS_FSB_TO_B(mp, max_blocks) ||
 	    range.minlen > XFS_FSB_TO_B(mp, mp->m_ag_max_usable) ||
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] xfs: don't extend the FITRIM range if the rt device does not support discard
  2024-08-16  8:18 fix FITRIM with non-discard capable RT device v2 Christoph Hellwig
  2024-08-16  8:18 ` [PATCH 1/2] xfs: remove a stale comment in xfs_ioc_trim Christoph Hellwig
@ 2024-08-16  8:18 ` Christoph Hellwig
  2024-08-16 21:50   ` Darrick J. Wong
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2024-08-16  8:18 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, linux-xfs

Fix the newly added discard support to only offer a FITRIM range that
spans the RT device in addition to the main device if the RT device
actually supports discard.  Without this we'll incorrectly accept
a larger range than actually supported and confuse user space if the
RT device does not support discard.  This can easily happen when the
main device is a SSD but the RT device is a hard driver.

Move the code around a bit to keep the max_blocks and granularity
assignments together and explicitly reject that case where only the
RT device supports discard, as that does not fit the way the FITRIM ABI
works very well and is a very fringe use case.

Fixes: 3ba3ab1f6719 ("xfs: enable FITRIM on the realtime device")
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_discard.c | 50 +++++++++++++++++++++++++++-----------------
 1 file changed, 31 insertions(+), 19 deletions(-)

diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
index d56efe9eae2cef..77d9d2b39f9b00 100644
--- a/fs/xfs/xfs_discard.c
+++ b/fs/xfs/xfs_discard.c
@@ -685,27 +685,18 @@ xfs_ioc_trim(
 	struct xfs_mount		*mp,
 	struct fstrim_range __user	*urange)
 {
-	unsigned int		granularity =
-		bdev_discard_granularity(mp->m_ddev_targp->bt_bdev);
+	struct block_device	*bdev = mp->m_ddev_targp->bt_bdev;
 	struct block_device	*rt_bdev = NULL;
+	unsigned int		granularity = bdev_discard_granularity(bdev);
+	xfs_rfsblock_t		max_blocks = mp->m_sb.sb_dblocks;
 	struct fstrim_range	range;
 	xfs_daddr_t		start, end;
 	xfs_extlen_t		minlen;
-	xfs_rfsblock_t		max_blocks;
 	uint64_t		blocks_trimmed = 0;
 	int			error, last_error = 0;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
-	if (mp->m_rtdev_targp &&
-	    bdev_max_discard_sectors(mp->m_rtdev_targp->bt_bdev))
-		rt_bdev = mp->m_rtdev_targp->bt_bdev;
-	if (!bdev_max_discard_sectors(mp->m_ddev_targp->bt_bdev) && !rt_bdev)
-		return -EOPNOTSUPP;
-
-	if (rt_bdev)
-		granularity = max(granularity,
-				  bdev_discard_granularity(rt_bdev));
 
 	/*
 	 * We haven't recovered the log, so we cannot use our bnobt-guided
@@ -714,13 +705,36 @@ xfs_ioc_trim(
 	if (xfs_has_norecovery(mp))
 		return -EROFS;
 
+	/*
+	 * If the main device doesn't support discards, fail the entire ioctl
+	 * as the RT blocks are mapped after the main device blocks, and we'd
+	 * have to fake operation results for the main device for the ABI to
+	 * work.
+	 *
+	 * Note that while a main device that supports discards (SSD) and a RT
+	 * device that does not (HDD) is a fairly common setup, the reverse is
+	 * theoretically possible but rather odd, so this is not much of a loss.
+	 */
+	if (!bdev_max_discard_sectors(bdev))
+		return -EOPNOTSUPP;
+
+	if (mp->m_rtdev_targp) {
+		rt_bdev = mp->m_rtdev_targp->bt_bdev;
+		if (!bdev_max_discard_sectors(rt_bdev))
+			rt_bdev = NULL;
+	}
+	if (rt_bdev) {
+		max_blocks += mp->m_sb.sb_rblocks;
+		granularity =
+			max(granularity, bdev_discard_granularity(rt_bdev));
+	}
+
 	if (copy_from_user(&range, urange, sizeof(range)))
 		return -EFAULT;
 
 	range.minlen = max_t(u64, granularity, range.minlen);
 	minlen = XFS_B_TO_FSB(mp, range.minlen);
 
-	max_blocks = mp->m_sb.sb_dblocks + mp->m_sb.sb_rblocks;
 	if (range.start >= XFS_FSB_TO_B(mp, max_blocks) ||
 	    range.minlen > XFS_FSB_TO_B(mp, mp->m_ag_max_usable) ||
 	    range.len < mp->m_sb.sb_blocksize)
@@ -729,12 +743,10 @@ xfs_ioc_trim(
 	start = BTOBB(range.start);
 	end = start + BTOBBT(range.len) - 1;
 
-	if (bdev_max_discard_sectors(mp->m_ddev_targp->bt_bdev)) {
-		error = xfs_trim_datadev_extents(mp, start, end, minlen,
-				&blocks_trimmed);
-		if (error)
-			last_error = error;
-	}
+	error = xfs_trim_datadev_extents(mp, start, end, minlen,
+			&blocks_trimmed);
+	if (error)
+		last_error = error;
 
 	if (rt_bdev && !xfs_trim_should_stop()) {
 		error = xfs_trim_rtdev_extents(mp, start, end, minlen,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xfs: don't extend the FITRIM range if the rt device does not support discard
  2024-08-16  8:18 ` [PATCH 2/2] xfs: don't extend the FITRIM range if the rt device does not support discard Christoph Hellwig
@ 2024-08-16 21:50   ` Darrick J. Wong
  2024-08-19 12:44     ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2024-08-16 21:50 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chandan Babu R, linux-xfs

On Fri, Aug 16, 2024 at 10:18:43AM +0200, Christoph Hellwig wrote:
> Fix the newly added discard support to only offer a FITRIM range that
> spans the RT device in addition to the main device if the RT device
> actually supports discard.  Without this we'll incorrectly accept
> a larger range than actually supported and confuse user space if the
> RT device does not support discard.  This can easily happen when the
> main device is a SSD but the RT device is a hard driver.
> 
> Move the code around a bit to keep the max_blocks and granularity
> assignments together and explicitly reject that case where only the
> RT device supports discard, as that does not fit the way the FITRIM ABI
> works very well and is a very fringe use case.

Is there more to this than generic/260 failing?  And if not, does the
following patch things up for you?

--D

From: Darrick J. Wong <djwong@kernel.org>
Subject: [PATCH] generic/260: fix for multi-device xfs with mixed discard support

Fix this test so that it can handle XFS filesystems with a realtime
volume when the data and rt devices do not both support discard.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/rc  |   12 ++++++++++-
 common/xfs |   65 +++++++++++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 64 insertions(+), 13 deletions(-)

diff --git a/common/rc b/common/rc
index b718030a59..426f2de43b 100644
--- a/common/rc
+++ b/common/rc
@@ -4207,6 +4207,16 @@ _require_batched_discard()
 	fi
 }
 
+_bdev_queue_property()
+{
+	local dev="$1"
+	local property="$2"
+	local default="$3"
+
+	local fname="/sys/block/$(_short_dev "$dev")/queue/$property"
+	cat "$fname" 2>/dev/null || echo "$default"
+}
+
 # Given a mountpoint and the device associated with that mountpoint, return the
 # maximum start offset that the FITRIM command will accept, in units of 1024
 # byte blocks.
@@ -4214,7 +4224,7 @@ _discard_max_offset_kb()
 {
 	case "$FSTYP" in
 	xfs)
-		_xfs_discard_max_offset_kb "$1"
+		_xfs_discard_max_offset_kb "$@"
 		;;
 	*)
 		$DF_PROG -k | awk -v dev="$2" -v mnt="$1" '$1 == dev && $7 == mnt { print $3 }'
diff --git a/common/xfs b/common/xfs
index 9501adac4c..5ad60a3ddc 100644
--- a/common/xfs
+++ b/common/xfs
@@ -1882,7 +1882,13 @@ _require_xfs_scratch_atomicswap()
 # of 1024 byte blocks.
 _xfs_discard_max_offset_kb()
 {
+	local mount="$1"
+	local dev="$2"
 	local statfs
+	local datadev_discard=
+	local rtdev_discard=
+	local dev_discard_max=
+	local rtdev_discard_max=
 
 	# Use awk to read the statfs output for the XFS filesystem, compute
 	# the two possible FITRIM offset maximums, and then use some horrid
@@ -1895,31 +1901,66 @@ _xfs_discard_max_offset_kb()
 	# 2: Realtime volume size in fsblocks.
 	# 3: Max FITRIM offset if we can only trim the data volume
 	# 4: Max FITRIM offset if we can trim the data and rt volumes
-	readarray -t statfs < <($XFS_IO_PROG -c 'statfs' "$1" | \
-		awk '{g[$1] = $3} END {printf("%d\n%d\n%d\n%d\n%d\n",
+	# 5: Max FITRIM offset if we can only trim the rt volume
+	readarray -t statfs < <($XFS_IO_PROG -c 'statfs' "$mount" | \
+		awk '{g[$1] = $3} END {printf("%d\n%d\n%d\n%d\n%d\n%d\n",
 			g["geom.bsize"],
 			g["geom.datablocks"],
 			g["geom.rtblocks"],
 			g["geom.bsize"] * g["geom.datablocks"] / 1024,
-			g["geom.bsize"] * (g["geom.datablocks"] + g["geom.rtblocks"]) / 1024);}')
+			g["geom.bsize"] * (g["geom.datablocks"] + g["geom.rtblocks"]) / 1024,
+			g["geom.bsize"] * g["geom.rtblocks"] / 1024);}')
 
 	# If the kernel supports discarding the realtime volume, then it will
 	# not reject a FITRIM for fsblock dblks+1, even if the len/minlen
 	# arguments are absurd.
 	if [ "${statfs[2]}" -gt 0 ]; then
-		if $FSTRIM_PROG -o "$((statfs[0] * statfs[1]))" \
+		case "$dev" in
+		"$SCRATCH_DEV")
+			rtdev_discard_max="$(_bdev_queue_property "$SCRATCH_RTDEV" discard_max_bytes 0)"
+			;;
+		"$TEST_DEV")
+			rtdev_discard_max="$(_bdev_queue_property "$TEST_RTDEV" discard_max_bytes 0)"
+			;;
+		*)
+			echo "Unrecognized device $dev" >&2
+			rtdev_discard_max=0
+			;;
+		esac
+
+		if [ "$rtdev_discard_max" -gt 0 ] &&
+		   $FSTRIM_PROG -o "$((statfs[0] * statfs[1]))" \
 				-l "${statfs[0]}" \
-				-m "$((statfs[0] * 2))" "$1" &>/dev/null; then
-			# The kernel supports discarding the rt volume, so
-			# print out the second answer from above.
-			echo "${statfs[4]}"
-			return
+				-m "$((statfs[0] * 2))" "$mount" &>/dev/null; then
+			# The kernel supports discarding the rt volume
+			rtdev_discard=2
 		fi
 	fi
 
-	# The kernel does not support discarding the rt volume or there is no
-	# rt volume.  Print out the first answer from above.
-	echo "${statfs[3]}"
+	# The kernel supports discarding the rt volume
+	dev_discard_max="$(_bdev_queue_property "$dev" discard_max_bytes 0)"
+	if [ "$dev_discard_max" -gt 0 ]; then
+		datadev_discard=1
+	fi
+
+	case "$datadev_discard$rtdev_discard" in
+	"12")
+		# Both devices support it
+		echo "${statfs[4]}"
+		;;
+	"1")
+		# Only the data device supports it
+		echo "${statfs[3]}"
+		;;
+	"2")
+		# Only the rt device supports it
+		echo "${statfs[5]}"
+		;;
+	*)
+		# No support at all
+		echo 0
+		;;
+	esac
 }
 
 # check if mkfs and the kernel support nocrc (v4) file systems

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xfs: don't extend the FITRIM range if the rt device does not support discard
  2024-08-16 21:50   ` Darrick J. Wong
@ 2024-08-19 12:44     ` Christoph Hellwig
  2024-08-19 15:00       ` Darrick J. Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2024-08-19 12:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, Chandan Babu R, linux-xfs

On Fri, Aug 16, 2024 at 02:50:17PM -0700, Darrick J. Wong wrote:
> Is there more to this than generic/260 failing?

The only other users of the detection is generic/251, which doesn't
seem to care about the exact value.

> And if not, does the
> following patch things up for you?

This works.  OTOH it will break again with the zoned RT subvolume
which can't support FITRIM even on devices that claim it.  And for
actual users that care (and not just xfstests) these kinds of hacks
don't seem very palatable..


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xfs: don't extend the FITRIM range if the rt device does not support discard
  2024-08-19 12:44     ` Christoph Hellwig
@ 2024-08-19 15:00       ` Darrick J. Wong
  2024-08-19 15:08         ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2024-08-19 15:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chandan Babu R, linux-xfs

On Mon, Aug 19, 2024 at 02:44:07PM +0200, Christoph Hellwig wrote:
> On Fri, Aug 16, 2024 at 02:50:17PM -0700, Darrick J. Wong wrote:
> > Is there more to this than generic/260 failing?
> 
> The only other users of the detection is generic/251, which doesn't
> seem to care about the exact value.
> 
> > And if not, does the
> > following patch things up for you?
> 
> This works.  OTOH it will break again with the zoned RT subvolume
> which can't support FITRIM even on devices that claim it.  And for
> actual users that care (and not just xfstests) these kinds of hacks
> don't seem very palatable..

What does discard do on a zoned device?  Is that how you reset the write
pointer?  And does that mean that either you tell the device to discard
everything it's written in a zone, or it will do nothing?

I see that this test really just puts a lower bound on how much FITRIM
can report that it discarded from an empty fs.  That number is already
rather meaningless on XFS because the amount "discarded" is roughly
(free space - (amount of pending gc work + discard work already in
progress)) and the user has no visibility into the amount of pending gc
work.  And you can repeatedly FITRIM and it reports the same number
since we have no idea if the device actually *did* anything.

Hmm.  No manpage for FITRIM.  Why don't we return the number of bytes
in the space map that we iterated as range.len?  Or perhaps leave it
unchanged?

--D

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xfs: don't extend the FITRIM range if the rt device does not support discard
  2024-08-19 15:00       ` Darrick J. Wong
@ 2024-08-19 15:08         ` Christoph Hellwig
  2024-08-20 16:19           ` Darrick J. Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2024-08-19 15:08 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, Chandan Babu R, linux-xfs

On Mon, Aug 19, 2024 at 08:00:30AM -0700, Darrick J. Wong wrote:
> > This works.  OTOH it will break again with the zoned RT subvolume
> > which can't support FITRIM even on devices that claim it.  And for
> > actual users that care (and not just xfstests) these kinds of hacks
> > don't seem very palatable..
> 
> What does discard do on a zoned device?  Is that how you reset the write
> pointer?  And does that mean that either you tell the device to discard
> everything it's written in a zone, or it will do nothing?

On an actual zone device it will probably do nothing.  But at least for
NVMe the command used to implement discard is mandatory, so all
devices will show support.  We also support the zoned mode on
conventional devices, but instead of through FITRIM we want to issue
it instad of a zone reset when the whole rtg has been garbage collected.

> Hmm.  No manpage for FITRIM.  Why don't we return the number of bytes
> in the space map that we iterated as range.len?  Or perhaps leave it
> unchanged?

The above would seem sensible.  Not sure if we can still pull it
off, though.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xfs: don't extend the FITRIM range if the rt device does not support discard
  2024-08-19 15:08         ` Christoph Hellwig
@ 2024-08-20 16:19           ` Darrick J. Wong
  2024-08-20 16:35             ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2024-08-20 16:19 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chandan Babu R, linux-xfs

On Mon, Aug 19, 2024 at 05:08:04PM +0200, Christoph Hellwig wrote:
> On Mon, Aug 19, 2024 at 08:00:30AM -0700, Darrick J. Wong wrote:
> > > This works.  OTOH it will break again with the zoned RT subvolume
> > > which can't support FITRIM even on devices that claim it.  And for
> > > actual users that care (and not just xfstests) these kinds of hacks
> > > don't seem very palatable..
> > 
> > What does discard do on a zoned device?  Is that how you reset the write
> > pointer?  And does that mean that either you tell the device to discard
> > everything it's written in a zone, or it will do nothing?
> 
> On an actual zone device it will probably do nothing.  But at least for
> NVMe the command used to implement discard is mandatory, so all
> devices will show support.  We also support the zoned mode on
> conventional devices, but instead of through FITRIM we want to issue
> it instad of a zone reset when the whole rtg has been garbage collected.
> 
> > Hmm.  No manpage for FITRIM.  Why don't we return the number of bytes
> > in the space map that we iterated as range.len?  Or perhaps leave it
> > unchanged?
> 
> The above would seem sensible.  Not sure if we can still pull it
> off, though.

It seems to have survived testing on TOT overnight, so I'll bake it into
djwong-dev when I go through and remove the rtgroups/rtsb feature bits
today.  And I guess the rtgroups xarray conversion too.

--D

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xfs: don't extend the FITRIM range if the rt device does not support discard
  2024-08-20 16:19           ` Darrick J. Wong
@ 2024-08-20 16:35             ` Christoph Hellwig
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2024-08-20 16:35 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, Chandan Babu R, linux-xfs

On Tue, Aug 20, 2024 at 09:19:55AM -0700, Darrick J. Wong wrote:
> It seems to have survived testing on TOT overnight, so I'll bake it into
> djwong-dev

Note that this a fix for the new RT discard code in 6.11 and it would
be kinda nice to get the fix into the 6.11 tree so that we don't have
a release with the current state.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-08-20 16:35 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-16  8:18 fix FITRIM with non-discard capable RT device v2 Christoph Hellwig
2024-08-16  8:18 ` [PATCH 1/2] xfs: remove a stale comment in xfs_ioc_trim Christoph Hellwig
2024-08-16  8:18 ` [PATCH 2/2] xfs: don't extend the FITRIM range if the rt device does not support discard Christoph Hellwig
2024-08-16 21:50   ` Darrick J. Wong
2024-08-19 12:44     ` Christoph Hellwig
2024-08-19 15:00       ` Darrick J. Wong
2024-08-19 15:08         ` Christoph Hellwig
2024-08-20 16:19           ` Darrick J. Wong
2024-08-20 16:35             ` Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2024-08-14  4:23 [PATCH 1/2] xfs: remove a stale comment in xfs_ioc_trim Christoph Hellwig
2024-08-14  4:23 ` [PATCH 2/2] xfs: don't extend the FITRIM range if the rt device does not support discard Christoph Hellwig
2024-08-14  5:41   ` Darrick J. Wong
2024-08-14  5:48     ` Christoph Hellwig
2024-08-14  6:16       ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox