public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/1] scsi: sas: fix mkfs.xfs failure due to bogus optimal_io_size
@ 2026-03-18  7:43 Ionut Nechita (Wind River)
  2026-03-18  7:43 ` [PATCH 1/1] scsi: sas: skip opt_sectors when DMA reports no real optimization hint Ionut Nechita (Wind River)
  2026-03-18  8:51 ` [PATCH v2 0/1] scsi: sas: fix mkfs.xfs failure due to bogus optimal_io_size John Garry
  0 siblings, 2 replies; 5+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-03-18  7:43 UTC (permalink / raw)
  To: James E . J . Bottomley, Martin K . Petersen
  Cc: ahuang12, axboe, damien.lemoal, hch, iommu, ionut_n2001,
	john.g.garry, kbusch, linux-kernel, linux-nvme, linux-scsi,
	m.szyprowski, robin.murphy, sagi, stable, sunlightlinux,
	Ionut Nechita

From: Ionut Nechita <ionut.nechita@windriver.com>

v2:
  - Dropped the dma_opt_mapping_size() change per Robin Murphy's feedback:
    the DMA core semantics are correct, the bug is in the caller.
  - Dropped the nvme-pci patch (no longer needed).
  - Single patch now fixes the actual bug in scsi_transport_sas.c by
    checking if dma_opt_mapping_size() == dma_max_mapping_size() before
    setting opt_sectors.  When they are equal, no backend provided a
    real hint.
  - Added concrete values from the affected system (Dell PowerEdge R750,
    mpt3sas, SAMSUNG MZILT800HBHQ0D3) to the commit message.

v1 feedback summary:
  - Robin Murphy: dma_opt_mapping_size() semantics are correct; if no
    restriction exists, the largest efficient size IS the largest size.
    Fix the caller, not the common code.
  - John Garry: Asked for concrete max_sectors/opt_sectors values and
    questioned whether sd_revalidate_disk() would override opt_sectors
    via opt_xfer_blocks.
  - Damien Le Moal: Suggested min_not_zero() for nvme-pci (now moot).

Answer to John's question about opt_xfer_blocks:
  The SAS disks on this system do not report Optimal Transfer Length in
  VPD page B0, so sdkp->opt_xfer_blocks = 0.  sd_revalidate_disk() uses
  min_not_zero(0, opt_sectors) which returns opt_sectors, propagating
  the bogus value.  Observed values:

    shost->max_sectors      = 32767
    opt_sectors             = 32767  (capped at max_sectors)
    optimal_io_size         = 16773120  (visible in lsblk --topology)
    minimum_io_size         = 8192

  mkfs.xfs computes swidth=4095, sunit=2, fails because 4095 % 2 != 0.

Answer to John's question about blk_validate_limits() rounding:
  blk_validate_limits() rounds optimal_io_size down to physical_block_size
  (4096), but does NOT enforce that optimal_io_size is a multiple of
  minimum_io_size (8192).  So optimal_io_size=16773120 survives validation
  unchanged — it is already a multiple of 4096.  The mismatch only shows
  up when mkfs.xfs divides optimal_io_size by minimum_io_size and expects
  an integer result: 16773120 / 8192 = 2047.5, giving swidth=4095 and
  sunit=2, with 4095 % 2 != 0.

Test environment:
  - Dell PowerEdge R750
  - SAS Controller: Broadcom/LSI mpt3sas (SAS3816, FW 33.15.00.00)
  - Disks: SAMSUNG MZILT800HBHQ0D3 (800GB SCSI SAS SSD)
  - Kernel: 6.12.0-1-amd64 with intel_iommu=off
  - IOMMU: Disabled (DMAR: IOMMU disabled), default domain: Passthrough

Based on linux-next (next-20260316).

Link: https://lore.kernel.org/lkml/20260316203956.64515-1-ionut.nechita@windriver.com/

Ionut Nechita (1):
  scsi: sas: skip opt_sectors when DMA reports no real optimization hint

 drivers/scsi/scsi_transport_sas.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

--
2.43.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/1] scsi: sas: skip opt_sectors when DMA reports no real optimization hint
  2026-03-18  7:43 [PATCH v2 0/1] scsi: sas: fix mkfs.xfs failure due to bogus optimal_io_size Ionut Nechita (Wind River)
@ 2026-03-18  7:43 ` Ionut Nechita (Wind River)
  2026-03-18  7:53   ` Christoph Hellwig
  2026-03-18 16:39   ` Robin Murphy
  2026-03-18  8:51 ` [PATCH v2 0/1] scsi: sas: fix mkfs.xfs failure due to bogus optimal_io_size John Garry
  1 sibling, 2 replies; 5+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-03-18  7:43 UTC (permalink / raw)
  To: James E . J . Bottomley, Martin K . Petersen
  Cc: ahuang12, axboe, damien.lemoal, hch, iommu, ionut_n2001,
	john.g.garry, kbusch, linux-kernel, linux-nvme, linux-scsi,
	m.szyprowski, robin.murphy, sagi, stable, sunlightlinux,
	Ionut Nechita

From: Ionut Nechita <ionut.nechita@windriver.com>

sas_host_setup() unconditionally sets shost->opt_sectors from
dma_opt_mapping_size().  When the IOMMU is disabled or in passthrough
mode and no DMA ops provide an opt_mapping_size callback,
dma_opt_mapping_size() returns min(dma_max_mapping_size(), SIZE_MAX)
which equals dma_max_mapping_size() — a hard upper bound, not an
optimization hint.

On a Dell PowerEdge R750 with mpt3sas (Broadcom SAS3816, FW 33.15.00.00)
and intel_iommu=off the following values are observed:

  dma_opt_mapping_size()  = dma_max_mapping_size() (no real hint)
  shost->max_sectors      = 32767
  opt_sectors             = min(32767, huge >> 9) = 32767
  optimal_io_size         = 32767 << 9 = 16776704
                          → round_down(16776704, 4096) = 16773120

The SAS disk (SAMSUNG MZILT800HBHQ0D3) do not report an
Optimal Transfer Length in VPD page B0,so sdkp->opt_xfer_blocks remains 0.
sd_revalidate_disk() then uses min_not_zero(0, opt_sectors) = opt_sectors,
propagating the bogus value into the block device's optimal_io_size
(visible as OPT-IO = 16773120 in lsblk --topology).

mkfs.xfs picks up optimal_io_size and minimum_io_size and computes:

  swidth = 16773120 / 4096 = 4095
  sunit  = 8192 / 4096     = 2

Since 4095 % 2 != 0, XFS rejects the geometry:

  SB stripe unit sanity check failed

This makes it impossible to create XFS filesystems (e.g. for
/var/lib/docker) during system bootstrap.

Fix this by only setting opt_sectors when dma_opt_mapping_size() returns
a value strictly less than dma_max_mapping_size(), which indicates a
genuine DMA optimization constraint from an IOMMU or DMA ops backend.
When they are equal, no backend provided a real hint, so leave
opt_sectors at its default of 0 ("no preference").

Fixes: 4cbfca5f7750 ("scsi: scsi_transport_sas: cap shost opt_sectors according to DMA optimal limit")
Cc: stable@vger.kernel.org
Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
---
 drivers/scsi/scsi_transport_sas.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/scsi_transport_sas.c b/drivers/scsi/scsi_transport_sas.c
index 12124f9d5ccd..6b4de5116feb 100644
--- a/drivers/scsi/scsi_transport_sas.c
+++ b/drivers/scsi/scsi_transport_sas.c
@@ -240,8 +240,20 @@ static int sas_host_setup(struct transport_container *tc, struct device *dev,
 			   shost->host_no);
 
 	if (dma_dev->dma_mask) {
-		shost->opt_sectors = min_t(unsigned int, shost->max_sectors,
-				dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
+		size_t opt = dma_opt_mapping_size(dma_dev);
+
+		/*
+		 * Only set opt_sectors when the DMA layer reports a
+		 * genuine optimization constraint.  When opt equals
+		 * dma_max_mapping_size() no backend provided a real
+		 * hint — the value is just the DMA maximum, which is
+		 * not useful as an optimal I/O size and can cause
+		 * mkfs.xfs to compute invalid stripe geometry.
+		 */
+		if (opt < dma_max_mapping_size(dma_dev))
+			shost->opt_sectors = min_t(unsigned int,
+					shost->max_sectors,
+					opt >> SECTOR_SHIFT);
 	}
 
 	return 0;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/1] scsi: sas: skip opt_sectors when DMA reports no real optimization hint
  2026-03-18  7:43 ` [PATCH 1/1] scsi: sas: skip opt_sectors when DMA reports no real optimization hint Ionut Nechita (Wind River)
@ 2026-03-18  7:53   ` Christoph Hellwig
  2026-03-18 16:39   ` Robin Murphy
  1 sibling, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2026-03-18  7:53 UTC (permalink / raw)
  To: Ionut Nechita (Wind River)
  Cc: James E . J . Bottomley, Martin K . Petersen, ahuang12, axboe,
	damien.lemoal, hch, iommu, ionut_n2001, john.g.garry, kbusch,
	linux-kernel, linux-nvme, linux-scsi, m.szyprowski, robin.murphy,
	sagi, stable, sunlightlinux

>  	if (dma_dev->dma_mask) {
> -		shost->opt_sectors = min_t(unsigned int, shost->max_sectors,
> -				dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
> +		size_t opt = dma_opt_mapping_size(dma_dev);
> +
> +		/*
> +		 * Only set opt_sectors when the DMA layer reports a
> +		 * genuine optimization constraint.  When opt equals
> +		 * dma_max_mapping_size() no backend provided a real
> +		 * hint — the value is just the DMA maximum, which is
> +		 * not useful as an optimal I/O size and can cause
> +		 * mkfs.xfs to compute invalid stripe geometry.
> +		 */
> +		if (opt < dma_max_mapping_size(dma_dev))
> +			shost->opt_sectors = min_t(unsigned int,
> +					shost->max_sectors,
> +					opt >> SECTOR_SHIFT);

This looks reasonable, but please also round down the opt value
to a power of two when you touch this anyway.

And especially with that this logic is complicated enough that it
warrants a little helper that is clearly split out.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 0/1] scsi: sas: fix mkfs.xfs failure due to bogus optimal_io_size
  2026-03-18  7:43 [PATCH v2 0/1] scsi: sas: fix mkfs.xfs failure due to bogus optimal_io_size Ionut Nechita (Wind River)
  2026-03-18  7:43 ` [PATCH 1/1] scsi: sas: skip opt_sectors when DMA reports no real optimization hint Ionut Nechita (Wind River)
@ 2026-03-18  8:51 ` John Garry
  1 sibling, 0 replies; 5+ messages in thread
From: John Garry @ 2026-03-18  8:51 UTC (permalink / raw)
  To: Ionut Nechita (Wind River), James E . J . Bottomley,
	Martin K . Petersen
  Cc: ahuang12, axboe, damien.lemoal, hch, iommu, ionut_n2001, kbusch,
	linux-kernel, linux-nvme, linux-scsi, m.szyprowski, robin.murphy,
	sagi, stable, sunlightlinux

On 18/03/2026 07:43, Ionut Nechita (Wind River) wrote:
> Answer to John's question about blk_validate_limits() rounding:
>    blk_validate_limits() rounds optimal_io_size down to physical_block_size
>    (4096), but does NOT enforce that optimal_io_size is a multiple of
>    minimum_io_size (8192).  So optimal_io_size=16773120 survives validation
>    unchanged — it is already a multiple of 4096.  The mismatch only shows
>    up when mkfs.xfs divides optimal_io_size by minimum_io_size and expects
>    an integer result: 16773120 / 8192 = 2047.5, giving swidth=4095 and
>    sunit=2, with 4095 % 2 != 0.

thanks for the info. I feel that that io_opt should be a multiple of the 
io_min and we should enforce it in blk queue limits validation, but that 
can mask problems like you have seen.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/1] scsi: sas: skip opt_sectors when DMA reports no real optimization hint
  2026-03-18  7:43 ` [PATCH 1/1] scsi: sas: skip opt_sectors when DMA reports no real optimization hint Ionut Nechita (Wind River)
  2026-03-18  7:53   ` Christoph Hellwig
@ 2026-03-18 16:39   ` Robin Murphy
  1 sibling, 0 replies; 5+ messages in thread
From: Robin Murphy @ 2026-03-18 16:39 UTC (permalink / raw)
  To: Ionut Nechita (Wind River), James E . J . Bottomley,
	Martin K . Petersen
  Cc: ahuang12, axboe, damien.lemoal, hch, iommu, ionut_n2001,
	john.g.garry, kbusch, linux-kernel, linux-nvme, linux-scsi,
	m.szyprowski, sagi, stable, sunlightlinux

On 2026-03-18 7:43 am, Ionut Nechita (Wind River) wrote:
> From: Ionut Nechita <ionut.nechita@windriver.com>
> 
> sas_host_setup() unconditionally sets shost->opt_sectors from
> dma_opt_mapping_size().  When the IOMMU is disabled or in passthrough
> mode and no DMA ops provide an opt_mapping_size callback,
> dma_opt_mapping_size() returns min(dma_max_mapping_size(), SIZE_MAX)
> which equals dma_max_mapping_size() — a hard upper bound, not an
> optimization hint.
> 
> On a Dell PowerEdge R750 with mpt3sas (Broadcom SAS3816, FW 33.15.00.00)
> and intel_iommu=off the following values are observed:
> 
>    dma_opt_mapping_size()  = dma_max_mapping_size() (no real hint)
>    shost->max_sectors      = 32767
>    opt_sectors             = min(32767, huge >> 9) = 32767
>    optimal_io_size         = 32767 << 9 = 16776704
>                            → round_down(16776704, 4096) = 16773120
> 
> The SAS disk (SAMSUNG MZILT800HBHQ0D3) do not report an
> Optimal Transfer Length in VPD page B0,so sdkp->opt_xfer_blocks remains 0.
> sd_revalidate_disk() then uses min_not_zero(0, opt_sectors) = opt_sectors,
> propagating the bogus value into the block device's optimal_io_size
> (visible as OPT-IO = 16773120 in lsblk --topology).
> 
> mkfs.xfs picks up optimal_io_size and minimum_io_size and computes:
> 
>    swidth = 16773120 / 4096 = 4095
>    sunit  = 8192 / 4096     = 2
> 
> Since 4095 % 2 != 0, XFS rejects the geometry:
> 
>    SB stripe unit sanity check failed
> 
> This makes it impossible to create XFS filesystems (e.g. for
> /var/lib/docker) during system bootstrap.
> 
> Fix this by only setting opt_sectors when dma_opt_mapping_size() returns
> a value strictly less than dma_max_mapping_size(), which indicates a
> genuine DMA optimization constraint from an IOMMU or DMA ops backend.
> When they are equal, no backend provided a real hint, so leave
> opt_sectors at its default of 0 ("no preference").
> 
> Fixes: 4cbfca5f7750 ("scsi: scsi_transport_sas: cap shost opt_sectors according to DMA optimal limit")
> Cc: stable@vger.kernel.org
> Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
> ---
>   drivers/scsi/scsi_transport_sas.c | 16 ++++++++++++++--
>   1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_transport_sas.c b/drivers/scsi/scsi_transport_sas.c
> index 12124f9d5ccd..6b4de5116feb 100644
> --- a/drivers/scsi/scsi_transport_sas.c
> +++ b/drivers/scsi/scsi_transport_sas.c
> @@ -240,8 +240,20 @@ static int sas_host_setup(struct transport_container *tc, struct device *dev,
>   			   shost->host_no);
>   
>   	if (dma_dev->dma_mask) {
> -		shost->opt_sectors = min_t(unsigned int, shost->max_sectors,
> -				dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
> +		size_t opt = dma_opt_mapping_size(dma_dev);
> +
> +		/*
> +		 * Only set opt_sectors when the DMA layer reports a
> +		 * genuine optimization constraint.  When opt equals
> +		 * dma_max_mapping_size() no backend provided a real
> +		 * hint — the value is just the DMA maximum, which is
> +		 * not useful as an optimal I/O size and can cause
> +		 * mkfs.xfs to compute invalid stripe geometry.
> +		 */
> +		if (opt < dma_max_mapping_size(dma_dev))

The point is more that dma_opt_mapping_size() is *always* only ever a 
constraint, never a target. This code should be coming up with its own 
idea of whether max_sectors is large enough to be meaningless, and 
picking an initial opt_sectors value based on that, and only *then* 
potentially reducing that value further if the DMA API indicates it 
would be more efficient to do so. Making this conditional makes little 
sense even if it wasn't clearly still broken when dma_opt_mapping_size() 
== (dma_max_mapping_size() - n) for most non-zero values of n.

That said, the comment in sd_revalidate_disk() implies that opt_sectors 
itself is also only intended as an upper limit rather than a specific 
preference, so there wouldn't seem to be any harm in deriving a 
suitably-aligned value from dma_max_mapping_size() either.

Thanks,
Robin.

> +			shost->opt_sectors = min_t(unsigned int,
> +					shost->max_sectors,
> +					opt >> SECTOR_SHIFT);
>   	}
>   
>   	return 0;


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-03-18 16:39 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-18  7:43 [PATCH v2 0/1] scsi: sas: fix mkfs.xfs failure due to bogus optimal_io_size Ionut Nechita (Wind River)
2026-03-18  7:43 ` [PATCH 1/1] scsi: sas: skip opt_sectors when DMA reports no real optimization hint Ionut Nechita (Wind River)
2026-03-18  7:53   ` Christoph Hellwig
2026-03-18 16:39   ` Robin Murphy
2026-03-18  8:51 ` [PATCH v2 0/1] scsi: sas: fix mkfs.xfs failure due to bogus optimal_io_size John Garry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox