public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/2] dma: fix dma_opt_mapping_size() returning bogus value when no backend hint exists
@ 2026-03-16 20:39 Ionut Nechita (Wind River)
  2026-03-16 20:39 ` [PATCH v1 1/2] dma: return 0 from dma_opt_mapping_size() when no real " Ionut Nechita (Wind River)
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-03-16 20:39 UTC (permalink / raw)
  To: m.szyprowski, kbusch, axboe, hch, sagi
  Cc: robin.murphy, martin.petersen, damien.lemoal, john.g.garry,
	ahuang12, iommu, linux-nvme, linux-kernel, stable, ionut_n2001,
	sunlightlinux, Ionut Nechita (Wind River)

dma_opt_mapping_size() currently returns min(dma_max_mapping_size(),
SIZE_MAX) when neither an IOMMU nor a DMA ops opt_mapping_size callback
is present.  That value is the DMA maximum, not an optimal transfer
size, yet callers treat it as a genuine optimization hint.

The concrete problem shows up on SAS controllers (e.g. mpt3sas) running
with IOMMU in passthrough mode.  The bogus value propagates through
scsi_transport_sas into Scsi_Host.opt_sectors and then into the block
device's optimal_io_size.  mkfs.xfs picks it up, computes
swidth=4095 / sunit=2, and fails with:

  XFS: SB stripe unit sanity check failed

making it impossible to create filesystems during system bootstrap.

Patch 1 changes dma_opt_mapping_size() to return 0 ("no preference")
when no backend provides a real hint.

Patch 2 adjusts the only other in-tree caller (nvme-pci) to handle the
new 0 return value, falling back to its existing default instead of
setting max_hw_sectors to 0.

Note: the scsi_transport_sas caller (the one that triggers the XFS
issue) already handles 0 safely.  It passes the return value through
min_t() into shost->opt_sectors, which becomes 0; sd.c then feeds that
into min_not_zero() when computing io_opt, so a zero opt_sectors is
correctly treated as "no preference" and ignored.

Based on linux-next (next-20260316).

Ionut Nechita (2):
  dma: return 0 from dma_opt_mapping_size() when no real hint exists
  nvme-pci: handle dma_opt_mapping_size() returning 0

 drivers/nvme/host/pci.c | 15 ++++++++++-----
 kernel/dma/mapping.c    | 13 ++++++++-----
 2 files changed, 18 insertions(+), 10 deletions(-)

-- 
2.53.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v1 1/2] dma: return 0 from dma_opt_mapping_size() when no real hint exists
  2026-03-16 20:39 [PATCH v1 0/2] dma: fix dma_opt_mapping_size() returning bogus value when no backend hint exists Ionut Nechita (Wind River)
@ 2026-03-16 20:39 ` Ionut Nechita (Wind River)
  2026-03-17  9:43   ` Robin Murphy
  2026-03-16 20:39 ` [PATCH v1 2/2] nvme-pci: handle dma_opt_mapping_size() returning 0 Ionut Nechita (Wind River)
  2026-03-17  9:11 ` [PATCH v1 0/2] dma: fix dma_opt_mapping_size() returning bogus value when no backend hint exists John Garry
  2 siblings, 1 reply; 12+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-03-16 20:39 UTC (permalink / raw)
  To: m.szyprowski, kbusch, axboe, hch, sagi
  Cc: robin.murphy, martin.petersen, damien.lemoal, john.g.garry,
	ahuang12, iommu, linux-nvme, linux-kernel, stable, ionut_n2001,
	sunlightlinux, Ionut Nechita

From: Ionut Nechita <ionut.nechita@windriver.com>

dma_opt_mapping_size() currently initializes its local size to SIZE_MAX
and, when neither an IOMMU nor a DMA ops opt_mapping_size callback is
present, returns min(dma_max_mapping_size(dev), SIZE_MAX).  That value
is a large but finite number that has nothing to do with an optimal
transfer size — it is simply the maximum the DMA layer can map.

Callers such as scsi_transport_sas treat the return value as a genuine
optimization hint and propagate it into Scsi_Host.opt_sectors, which in
turn becomes the block device's optimal_io_size.  On SAS controllers
like mpt3sas running with IOMMU in passthrough mode the bogus value
(max_sectors << 9 = 16776704, rounded to 16773120) reaches mkfs.xfs,
which computes swidth=4095 and sunit=2.  Because 4095 is not a multiple
of 2, XFS rejects the geometry with "SB stripe unit sanity check
failed", making it impossible to create filesystems during system
bootstrap.

Fix this by returning 0 when no backend provides an optimal mapping size
hint.  A return value of 0 unambiguously means "no preference" and lets
callers that use min() or min_not_zero() do the right thing without
special-casing.

The only other in-tree caller (nvme-pci) is adjusted in the next patch.

Fixes: a229cc14f339 ("dma-mapping: add dma_opt_mapping_size()")
Cc: stable@vger.kernel.org
Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
---
 kernel/dma/mapping.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 78d8b4039c3e6..fffa6a3f191a3 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -984,14 +984,17 @@ EXPORT_SYMBOL_GPL(dma_max_mapping_size);
 size_t dma_opt_mapping_size(struct device *dev)
 {
 	const struct dma_map_ops *ops = get_dma_ops(dev);
-	size_t size = SIZE_MAX;
 
 	if (use_dma_iommu(dev))
-		size = iommu_dma_opt_mapping_size();
-	else if (ops && ops->opt_mapping_size)
-		size = ops->opt_mapping_size();
+		return iommu_dma_opt_mapping_size();
+	if (ops && ops->opt_mapping_size)
+		return ops->opt_mapping_size();
 
-	return min(dma_max_mapping_size(dev), size);
+	/*
+	 * No backend provided an optimal size hint. Return 0 so that
+	 * callers can distinguish "no hint" from a real value.
+	 */
+	return 0;
 }
 EXPORT_SYMBOL_GPL(dma_opt_mapping_size);
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 2/2] nvme-pci: handle dma_opt_mapping_size() returning 0
  2026-03-16 20:39 [PATCH v1 0/2] dma: fix dma_opt_mapping_size() returning bogus value when no backend hint exists Ionut Nechita (Wind River)
  2026-03-16 20:39 ` [PATCH v1 1/2] dma: return 0 from dma_opt_mapping_size() when no real " Ionut Nechita (Wind River)
@ 2026-03-16 20:39 ` Ionut Nechita (Wind River)
  2026-03-16 21:21   ` Damien Le Moal
                     ` (2 more replies)
  2026-03-17  9:11 ` [PATCH v1 0/2] dma: fix dma_opt_mapping_size() returning bogus value when no backend hint exists John Garry
  2 siblings, 3 replies; 12+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-03-16 20:39 UTC (permalink / raw)
  To: m.szyprowski, kbusch, axboe, hch, sagi
  Cc: robin.murphy, martin.petersen, damien.lemoal, john.g.garry,
	ahuang12, iommu, linux-nvme, linux-kernel, stable, ionut_n2001,
	sunlightlinux, Ionut Nechita

From: Ionut Nechita <ionut.nechita@windriver.com>

After the previous commit, dma_opt_mapping_size() returns 0 when no DMA
backend provides an optimal mapping size hint (e.g. IOMMU in passthrough
mode with no ops->opt_mapping_size callback).

The NVMe PCI driver used min_t(u32, NVME_MAX_BYTES >> SECTOR_SHIFT,
dma_opt_mapping_size() >> 9) to cap max_hw_sectors.  With a 0 return
value this would set max_hw_sectors to 0, which is invalid.

Guard the min_t so that max_hw_sectors is only capped when
dma_opt_mapping_size() provides a real hint.  When it returns 0, fall
back to the existing NVME_MAX_BYTES >> SECTOR_SHIFT default.

Fixes: 3710e2b056cb ("nvme-pci: clamp max_hw_sectors based on DMA optimized limitation")
Cc: stable@vger.kernel.org
Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
---
 drivers/nvme/host/pci.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b78ba239c8ea8..dc148fb6eff28 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3640,6 +3640,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
 {
 	unsigned long quirks = id->driver_data;
 	int node = dev_to_node(&pdev->dev);
+	size_t dma_opt;
 	struct nvme_dev *dev;
 	struct quirk_entry *qentry;
 	int ret = -ENOMEM;
@@ -3691,12 +3692,16 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
 	dma_set_max_seg_size(&pdev->dev, 0xffffffff);
 
 	/*
-	 * Limit the max command size to prevent iod->sg allocations going
-	 * over a single page.
+	 * Limit the max command size to prevent iod->sg allocations
+	 * going over a single page.  Only apply the DMA optimal mapping
+	 * size limit when the DMA layer actually provides one (non-zero
+	 * return from dma_opt_mapping_size()).
 	 */
-	dev->ctrl.max_hw_sectors = min_t(u32,
-			NVME_MAX_BYTES >> SECTOR_SHIFT,
-			dma_opt_mapping_size(&pdev->dev) >> 9);
+	dev->ctrl.max_hw_sectors = NVME_MAX_BYTES >> SECTOR_SHIFT;
+	dma_opt = dma_opt_mapping_size(&pdev->dev);
+	if (dma_opt)
+		dev->ctrl.max_hw_sectors =
+			min_t(u32, dev->ctrl.max_hw_sectors, dma_opt >> 9);
 	dev->ctrl.max_segments = NVME_MAX_SEGS;
 	dev->ctrl.max_integrity_segments = 1;
 	return dev;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 2/2] nvme-pci: handle dma_opt_mapping_size() returning 0
  2026-03-16 20:39 ` [PATCH v1 2/2] nvme-pci: handle dma_opt_mapping_size() returning 0 Ionut Nechita (Wind River)
@ 2026-03-16 21:21   ` Damien Le Moal
  2026-03-17  8:55   ` John Garry
  2026-03-17 14:14   ` Christoph Hellwig
  2 siblings, 0 replies; 12+ messages in thread
From: Damien Le Moal @ 2026-03-16 21:21 UTC (permalink / raw)
  To: Ionut Nechita (Wind River), m.szyprowski, kbusch, axboe, hch,
	sagi
  Cc: robin.murphy, martin.petersen, damien.lemoal, john.g.garry,
	ahuang12, iommu, linux-nvme, linux-kernel, stable, ionut_n2001,
	sunlightlinux

On 3/17/26 05:39, Ionut Nechita (Wind River) wrote:
> From: Ionut Nechita <ionut.nechita@windriver.com>
> 
> After the previous commit, dma_opt_mapping_size() returns 0 when no DMA
> backend provides an optimal mapping size hint (e.g. IOMMU in passthrough
> mode with no ops->opt_mapping_size callback).
> 
> The NVMe PCI driver used min_t(u32, NVME_MAX_BYTES >> SECTOR_SHIFT,
> dma_opt_mapping_size() >> 9) to cap max_hw_sectors.  With a 0 return
> value this would set max_hw_sectors to 0, which is invalid.
> 
> Guard the min_t so that max_hw_sectors is only capped when
> dma_opt_mapping_size() provides a real hint.  When it returns 0, fall
> back to the existing NVME_MAX_BYTES >> SECTOR_SHIFT default.
> 
> Fixes: 3710e2b056cb ("nvme-pci: clamp max_hw_sectors based on DMA optimized limitation")
> Cc: stable@vger.kernel.org
> Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
> ---
>  drivers/nvme/host/pci.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index b78ba239c8ea8..dc148fb6eff28 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -3640,6 +3640,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
>  {
>  	unsigned long quirks = id->driver_data;
>  	int node = dev_to_node(&pdev->dev);
> +	size_t dma_opt;
>  	struct nvme_dev *dev;
>  	struct quirk_entry *qentry;
>  	int ret = -ENOMEM;
> @@ -3691,12 +3692,16 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
>  	dma_set_max_seg_size(&pdev->dev, 0xffffffff);
>  
>  	/*
> -	 * Limit the max command size to prevent iod->sg allocations going
> -	 * over a single page.
> +	 * Limit the max command size to prevent iod->sg allocations
> +	 * going over a single page.  Only apply the DMA optimal mapping
> +	 * size limit when the DMA layer actually provides one (non-zero
> +	 * return from dma_opt_mapping_size()).
>  	 */
> -	dev->ctrl.max_hw_sectors = min_t(u32,
> -			NVME_MAX_BYTES >> SECTOR_SHIFT,
> -			dma_opt_mapping_size(&pdev->dev) >> 9);

Why not simply change this to min_not_zero() ? That would do the same. Are you
maybe getting a warning without the u32 cast ?

> +	dev->ctrl.max_hw_sectors = NVME_MAX_BYTES >> SECTOR_SHIFT;
> +	dma_opt = dma_opt_mapping_size(&pdev->dev);
> +	if (dma_opt)
> +		dev->ctrl.max_hw_sectors =
> +			min_t(u32, dev->ctrl.max_hw_sectors, dma_opt >> 9);
>  	dev->ctrl.max_segments = NVME_MAX_SEGS;
>  	dev->ctrl.max_integrity_segments = 1;
>  	return dev;


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 2/2] nvme-pci: handle dma_opt_mapping_size() returning 0
  2026-03-16 20:39 ` [PATCH v1 2/2] nvme-pci: handle dma_opt_mapping_size() returning 0 Ionut Nechita (Wind River)
  2026-03-16 21:21   ` Damien Le Moal
@ 2026-03-17  8:55   ` John Garry
  2026-03-17 14:14   ` Christoph Hellwig
  2 siblings, 0 replies; 12+ messages in thread
From: John Garry @ 2026-03-17  8:55 UTC (permalink / raw)
  To: Ionut Nechita (Wind River), m.szyprowski, kbusch, axboe, hch,
	sagi
  Cc: robin.murphy, martin.petersen, damien.lemoal, ahuang12, iommu,
	linux-nvme, linux-kernel, stable, ionut_n2001, sunlightlinux

On 16/03/2026 20:39, Ionut Nechita (Wind River) wrote:
> From: Ionut Nechita <ionut.nechita@windriver.com>
> 
> After the previous commit, dma_opt_mapping_size() returns 0 when no DMA
> backend provides an optimal mapping size hint (e.g. IOMMU in passthrough
> mode with no ops->opt_mapping_size callback).
> 
> The NVMe PCI driver used min_t(u32, NVME_MAX_BYTES >> SECTOR_SHIFT,
> dma_opt_mapping_size() >> 9) to cap max_hw_sectors.  With a 0 return
> value this would set max_hw_sectors to 0, which is invalid.

With the first patch you have introduced a temporary breakage.

> 
> Guard the min_t so that max_hw_sectors is only capped when
> dma_opt_mapping_size() provides a real hint.  When it returns 0, fall
> back to the existing NVME_MAX_BYTES >> SECTOR_SHIFT default.
> 
> Fixes: 3710e2b056cb ("nvme-pci: clamp max_hw_sectors based on DMA optimized limitation")
> Cc: stable@vger.kernel.org
> Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
> ---
>   drivers/nvme/host/pci.c | 15 ++++++++++-----
>   1 file changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index b78ba239c8ea8..dc148fb6eff28 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -3640,6 +3640,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
>   {
>   	unsigned long quirks = id->driver_data;
>   	int node = dev_to_node(&pdev->dev);
> +	size_t dma_opt;
>   	struct nvme_dev *dev;
>   	struct quirk_entry *qentry;
>   	int ret = -ENOMEM;
> @@ -3691,12 +3692,16 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
>   	dma_set_max_seg_size(&pdev->dev, 0xffffffff);
>   
>   	/*
> -	 * Limit the max command size to prevent iod->sg allocations going
> -	 * over a single page.
> +	 * Limit the max command size to prevent iod->sg allocations
> +	 * going over a single page.  Only apply the DMA optimal mapping
> +	 * size limit when the DMA layer actually provides one (non-zero
> +	 * return from dma_opt_mapping_size()).
>   	 */
> -	dev->ctrl.max_hw_sectors = min_t(u32,
> -			NVME_MAX_BYTES >> SECTOR_SHIFT,
> -			dma_opt_mapping_size(&pdev->dev) >> 9);
> +	dev->ctrl.max_hw_sectors = NVME_MAX_BYTES >> SECTOR_SHIFT;
> +	dma_opt = dma_opt_mapping_size(&pdev->dev);
> +	if (dma_opt)
> +		dev->ctrl.max_hw_sectors =
> +			min_t(u32, dev->ctrl.max_hw_sectors, dma_opt >> 9); 

SECTOR_SHIFT can be used instead of hard-coded '9'

>   	dev->ctrl.max_segments = NVME_MAX_SEGS;
>   	dev->ctrl.max_integrity_segments = 1;
>   	return dev;


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 0/2] dma: fix dma_opt_mapping_size() returning bogus value when no backend hint exists
  2026-03-16 20:39 [PATCH v1 0/2] dma: fix dma_opt_mapping_size() returning bogus value when no backend hint exists Ionut Nechita (Wind River)
  2026-03-16 20:39 ` [PATCH v1 1/2] dma: return 0 from dma_opt_mapping_size() when no real " Ionut Nechita (Wind River)
  2026-03-16 20:39 ` [PATCH v1 2/2] nvme-pci: handle dma_opt_mapping_size() returning 0 Ionut Nechita (Wind River)
@ 2026-03-17  9:11 ` John Garry
  2026-03-17  9:18   ` Damien Le Moal
  2026-03-17 14:36   ` Christoph Hellwig
  2 siblings, 2 replies; 12+ messages in thread
From: John Garry @ 2026-03-17  9:11 UTC (permalink / raw)
  To: Ionut Nechita (Wind River), m.szyprowski, kbusch, axboe, hch,
	sagi
  Cc: robin.murphy, martin.petersen, damien.lemoal, ahuang12, iommu,
	linux-nvme, linux-kernel, stable, ionut_n2001, sunlightlinux

On 16/03/2026 20:39, Ionut Nechita (Wind River) wrote:
> dma_opt_mapping_size() currently returns min(dma_max_mapping_size(),
> SIZE_MAX) when neither an IOMMU nor a DMA ops opt_mapping_size callback
> is present.  That value is the DMA maximum, not an optimal transfer
> size, yet callers treat it as a genuine optimization hint.
> 
> The concrete problem shows up on SAS controllers (e.g. mpt3sas) running
> with IOMMU in passthrough mode.  The bogus value propagates through
> scsi_transport_sas into Scsi_Host.opt_sectors and then into the block
> device's optimal_io_size.  mkfs.xfs picks it up, computes
> swidth=4095 / sunit=2, and fails with:
> 
>    XFS: SB stripe unit sanity check failed
> 
> making it impossible to create filesystems during system bootstrap.

For SAS controllers, don't we limit shost->opt_sectors at 
shost->max_sectors, and then in sd_revalidate_disk() this value is 
ignored as sdkp->opt_xfer_blocks would be smaller, right?

What value are you seeing for max_sectors and opt_sectors? That mpt3sas 
driver seems to have many methods to set max_sectors.

Thanks,
John

> 
> Patch 1 changes dma_opt_mapping_size() to return 0 ("no preference")
> when no backend provides a real hint.
> 
> Patch 2 adjusts the only other in-tree caller (nvme-pci) to handle the
> new 0 return value, falling back to its existing default instead of
> setting max_hw_sectors to 0.
> 
> Note: the scsi_transport_sas caller (the one that triggers the XFS
> issue) already handles 0 safely.  It passes the return value through
> min_t() into shost->opt_sectors, which becomes 0; sd.c then feeds that
> into min_not_zero() when computing io_opt, so a zero opt_sectors is
> correctly treated as "no preference" and ignored.
> 
> Based on linux-next (next-20260316).
> 
> Ionut Nechita (2):
>    dma: return 0 from dma_opt_mapping_size() when no real hint exists
>    nvme-pci: handle dma_opt_mapping_size() returning 0
> 
>   drivers/nvme/host/pci.c | 15 ++++++++++-----
>   kernel/dma/mapping.c    | 13 ++++++++-----
>   2 files changed, 18 insertions(+), 10 deletions(-)
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 0/2] dma: fix dma_opt_mapping_size() returning bogus value when no backend hint exists
  2026-03-17  9:11 ` [PATCH v1 0/2] dma: fix dma_opt_mapping_size() returning bogus value when no backend hint exists John Garry
@ 2026-03-17  9:18   ` Damien Le Moal
  2026-03-17 14:36   ` Christoph Hellwig
  1 sibling, 0 replies; 12+ messages in thread
From: Damien Le Moal @ 2026-03-17  9:18 UTC (permalink / raw)
  To: John Garry, Ionut Nechita (Wind River), m.szyprowski, kbusch,
	axboe, hch, sagi
  Cc: robin.murphy, martin.petersen, damien.lemoal, ahuang12, iommu,
	linux-nvme, linux-kernel, stable, ionut_n2001, sunlightlinux

On 3/17/26 18:11, John Garry wrote:
> On 16/03/2026 20:39, Ionut Nechita (Wind River) wrote:
>> dma_opt_mapping_size() currently returns min(dma_max_mapping_size(),
>> SIZE_MAX) when neither an IOMMU nor a DMA ops opt_mapping_size callback
>> is present.  That value is the DMA maximum, not an optimal transfer
>> size, yet callers treat it as a genuine optimization hint.
>>
>> The concrete problem shows up on SAS controllers (e.g. mpt3sas) running
>> with IOMMU in passthrough mode.  The bogus value propagates through
>> scsi_transport_sas into Scsi_Host.opt_sectors and then into the block
>> device's optimal_io_size.  mkfs.xfs picks it up, computes
>> swidth=4095 / sunit=2, and fails with:
>>
>>    XFS: SB stripe unit sanity check failed
>>
>> making it impossible to create filesystems during system bootstrap.
> 
> For SAS controllers, don't we limit shost->opt_sectors at 
> shost->max_sectors, and then in sd_revalidate_disk() this value is 
> ignored as sdkp->opt_xfer_blocks would be smaller, right?
> 
> What value are you seeing for max_sectors and opt_sectors? That mpt3sas 
> driver seems to have many methods to set max_sectors.

And mpi3mr is also very similar.

> 
> Thanks,
> John
> 
>>
>> Patch 1 changes dma_opt_mapping_size() to return 0 ("no preference")
>> when no backend provides a real hint.
>>
>> Patch 2 adjusts the only other in-tree caller (nvme-pci) to handle the
>> new 0 return value, falling back to its existing default instead of
>> setting max_hw_sectors to 0.
>>
>> Note: the scsi_transport_sas caller (the one that triggers the XFS
>> issue) already handles 0 safely.  It passes the return value through
>> min_t() into shost->opt_sectors, which becomes 0; sd.c then feeds that
>> into min_not_zero() when computing io_opt, so a zero opt_sectors is
>> correctly treated as "no preference" and ignored.
>>
>> Based on linux-next (next-20260316).
>>
>> Ionut Nechita (2):
>>    dma: return 0 from dma_opt_mapping_size() when no real hint exists
>>    nvme-pci: handle dma_opt_mapping_size() returning 0
>>
>>   drivers/nvme/host/pci.c | 15 ++++++++++-----
>>   kernel/dma/mapping.c    | 13 ++++++++-----
>>   2 files changed, 18 insertions(+), 10 deletions(-)
>>
> 
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 1/2] dma: return 0 from dma_opt_mapping_size() when no real hint exists
  2026-03-16 20:39 ` [PATCH v1 1/2] dma: return 0 from dma_opt_mapping_size() when no real " Ionut Nechita (Wind River)
@ 2026-03-17  9:43   ` Robin Murphy
  2026-03-17 14:19     ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Robin Murphy @ 2026-03-17  9:43 UTC (permalink / raw)
  To: Ionut Nechita (Wind River), m.szyprowski, kbusch, axboe, hch,
	sagi
  Cc: martin.petersen, damien.lemoal, john.g.garry, ahuang12, iommu,
	linux-nvme, linux-kernel, stable, ionut_n2001, sunlightlinux

On 2026-03-16 8:39 pm, Ionut Nechita (Wind River) wrote:
> From: Ionut Nechita <ionut.nechita@windriver.com>
> 
> dma_opt_mapping_size() currently initializes its local size to SIZE_MAX
> and, when neither an IOMMU nor a DMA ops opt_mapping_size callback is
> present, returns min(dma_max_mapping_size(dev), SIZE_MAX).  That value
> is a large but finite number that has nothing to do with an optimal
> transfer size — it is simply the maximum the DMA layer can map.

No, the current code is correct. dma_opt_mapping_size() represents the 
largest size that can be mapped without incurring any significant 
performance penalty (compared to smaller sizes). If the implementation 
has no such restriction, then the largest "efficient" size is quite 
obviously just the largest size in total.

> Callers such as scsi_transport_sas treat the return value as a genuine
> optimization hint and propagate it into Scsi_Host.opt_sectors, which in
> turn becomes the block device's optimal_io_size.  On SAS controllers
> like mpt3sas running with IOMMU in passthrough mode the bogus value
> (max_sectors << 9 = 16776704, rounded to 16773120) reaches mkfs.xfs,
> which computes swidth=4095 and sunit=2.  Because 4095 is not a multiple
> of 2, XFS rejects the geometry with "SB stripe unit sanity check
> failed", making it impossible to create filesystems during system
> bootstrap.

And that is obviously a bug. There has never been any guarantee offered 
about the values returned by either dma_max_mapping_size() or 
dma_opt_mapping_size() - they could be very large, very small, and 
certainly do not have to be powers of 2. Say an implementation has some 
internal data size optimisation that makes U32_MAX its largest 
"efficient" size, it's free to return that, and then you'll still have 
the same bug regardless of this bodge.

Fix the actual bug, don't break common code in an attempt to paper over 
it that doesn't even achieve that very well.

Thanks,
Robin.

> Fix this by returning 0 when no backend provides an optimal mapping size
> hint.  A return value of 0 unambiguously means "no preference" and lets
> callers that use min() or min_not_zero() do the right thing without
> special-casing.
> 
> The only other in-tree caller (nvme-pci) is adjusted in the next patch.
> 
> Fixes: a229cc14f339 ("dma-mapping: add dma_opt_mapping_size()")
> Cc: stable@vger.kernel.org
> Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
> ---
>   kernel/dma/mapping.c | 13 ++++++++-----
>   1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> index 78d8b4039c3e6..fffa6a3f191a3 100644
> --- a/kernel/dma/mapping.c
> +++ b/kernel/dma/mapping.c
> @@ -984,14 +984,17 @@ EXPORT_SYMBOL_GPL(dma_max_mapping_size);
>   size_t dma_opt_mapping_size(struct device *dev)
>   {
>   	const struct dma_map_ops *ops = get_dma_ops(dev);
> -	size_t size = SIZE_MAX;
>   
>   	if (use_dma_iommu(dev))
> -		size = iommu_dma_opt_mapping_size();
> -	else if (ops && ops->opt_mapping_size)
> -		size = ops->opt_mapping_size();
> +		return iommu_dma_opt_mapping_size();
> +	if (ops && ops->opt_mapping_size)
> +		return ops->opt_mapping_size();
>   
> -	return min(dma_max_mapping_size(dev), size);
> +	/*
> +	 * No backend provided an optimal size hint. Return 0 so that
> +	 * callers can distinguish "no hint" from a real value.
> +	 */
> +	return 0;
>   }
>   EXPORT_SYMBOL_GPL(dma_opt_mapping_size);
>   


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 2/2] nvme-pci: handle dma_opt_mapping_size() returning 0
  2026-03-16 20:39 ` [PATCH v1 2/2] nvme-pci: handle dma_opt_mapping_size() returning 0 Ionut Nechita (Wind River)
  2026-03-16 21:21   ` Damien Le Moal
  2026-03-17  8:55   ` John Garry
@ 2026-03-17 14:14   ` Christoph Hellwig
  2 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2026-03-17 14:14 UTC (permalink / raw)
  To: Ionut Nechita (Wind River)
  Cc: m.szyprowski, kbusch, axboe, hch, sagi, robin.murphy,
	martin.petersen, damien.lemoal, john.g.garry, ahuang12, iommu,
	linux-nvme, linux-kernel, stable, ionut_n2001, sunlightlinux

On Mon, Mar 16, 2026 at 10:39:56PM +0200, Ionut Nechita (Wind River) wrote:
> From: Ionut Nechita <ionut.nechita@windriver.com>
> 
> After the previous commit, dma_opt_mapping_size() returns 0 when no DMA
> backend provides an optimal mapping size hint (e.g. IOMMU in passthrough
> mode with no ops->opt_mapping_size callback).
> 
> The NVMe PCI driver used min_t(u32, NVME_MAX_BYTES >> SECTOR_SHIFT,
> dma_opt_mapping_size() >> 9) to cap max_hw_sectors.  With a 0 return
> value this would set max_hw_sectors to 0, which is invalid.

... which means that if you want to change it, you need to combine
both patches into one to not create a regression.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 1/2] dma: return 0 from dma_opt_mapping_size() when no real hint exists
  2026-03-17  9:43   ` Robin Murphy
@ 2026-03-17 14:19     ` Christoph Hellwig
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2026-03-17 14:19 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Ionut Nechita (Wind River), m.szyprowski, kbusch, axboe, sagi,
	martin.petersen, damien.lemoal, john.g.garry, ahuang12, iommu,
	linux-nvme, linux-kernel, stable, ionut_n2001, sunlightlinux

On Tue, Mar 17, 2026 at 09:43:46AM +0000, Robin Murphy wrote:
> On 2026-03-16 8:39 pm, Ionut Nechita (Wind River) wrote:
>> From: Ionut Nechita <ionut.nechita@windriver.com>
>>
>> dma_opt_mapping_size() currently initializes its local size to SIZE_MAX
>> and, when neither an IOMMU nor a DMA ops opt_mapping_size callback is
>> present, returns min(dma_max_mapping_size(dev), SIZE_MAX).  That value
>> is a large but finite number that has nothing to do with an optimal
>> transfer size — it is simply the maximum the DMA layer can map.
>
> No, the current code is correct. dma_opt_mapping_size() represents the 
> largest size that can be mapped without incurring any significant 
> performance penalty (compared to smaller sizes). If the implementation has 
> no such restriction, then the largest "efficient" size is quite obviously 
> just the largest size in total.

Yes.

>> Callers such as scsi_transport_sas treat the return value as a genuine
>> optimization hint and propagate it into Scsi_Host.opt_sectors, which in
>> turn becomes the block device's optimal_io_size.  On SAS controllers
>> like mpt3sas running with IOMMU in passthrough mode the bogus value
>> (max_sectors << 9 = 16776704, rounded to 16773120) reaches mkfs.xfs,
>> which computes swidth=4095 and sunit=2.  Because 4095 is not a multiple
>> of 2, XFS rejects the geometry with "SB stripe unit sanity check
>> failed", making it impossible to create filesystems during system
>> bootstrap.
>
> And that is obviously a bug. There has never been any guarantee offered 
> about the values returned by either dma_max_mapping_size() or 
> dma_opt_mapping_size() - they could be very large, very small, and 
> certainly do not have to be powers of 2. Say an implementation has some 
> internal data size optimisation that makes U32_MAX its largest "efficient" 
> size, it's free to return that, and then you'll still have the same bug 
> regardless of this bodge.

Yes, the SCSI/SAS code needs to properly round the value.

But we might also need to split the values up a bit, as tools just
assign too much value to the I/O opt value.  I.e. the file system
geometry really should not be affected by the IOMMU details.
>
> Fix the actual bug, don't break common code in an attempt to paper over it 
> that doesn't even achieve that very well.
>
> Thanks,
> Robin.
>
>> Fix this by returning 0 when no backend provides an optimal mapping size
>> hint.  A return value of 0 unambiguously means "no preference" and lets
>> callers that use min() or min_not_zero() do the right thing without
>> special-casing.
>>
>> The only other in-tree caller (nvme-pci) is adjusted in the next patch.
>>
>> Fixes: a229cc14f339 ("dma-mapping: add dma_opt_mapping_size()")
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
>> ---
>>   kernel/dma/mapping.c | 13 ++++++++-----
>>   1 file changed, 8 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
>> index 78d8b4039c3e6..fffa6a3f191a3 100644
>> --- a/kernel/dma/mapping.c
>> +++ b/kernel/dma/mapping.c
>> @@ -984,14 +984,17 @@ EXPORT_SYMBOL_GPL(dma_max_mapping_size);
>>   size_t dma_opt_mapping_size(struct device *dev)
>>   {
>>   	const struct dma_map_ops *ops = get_dma_ops(dev);
>> -	size_t size = SIZE_MAX;
>>     	if (use_dma_iommu(dev))
>> -		size = iommu_dma_opt_mapping_size();
>> -	else if (ops && ops->opt_mapping_size)
>> -		size = ops->opt_mapping_size();
>> +		return iommu_dma_opt_mapping_size();
>> +	if (ops && ops->opt_mapping_size)
>> +		return ops->opt_mapping_size();
>>   -	return min(dma_max_mapping_size(dev), size);
>> +	/*
>> +	 * No backend provided an optimal size hint. Return 0 so that
>> +	 * callers can distinguish "no hint" from a real value.
>> +	 */
>> +	return 0;
>>   }
>>   EXPORT_SYMBOL_GPL(dma_opt_mapping_size);
>>   
---end quoted text---

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 0/2] dma: fix dma_opt_mapping_size() returning bogus value when no backend hint exists
  2026-03-17  9:11 ` [PATCH v1 0/2] dma: fix dma_opt_mapping_size() returning bogus value when no backend hint exists John Garry
  2026-03-17  9:18   ` Damien Le Moal
@ 2026-03-17 14:36   ` Christoph Hellwig
  2026-03-17 15:18     ` John Garry
  1 sibling, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2026-03-17 14:36 UTC (permalink / raw)
  To: John Garry
  Cc: Ionut Nechita (Wind River), m.szyprowski, kbusch, axboe, hch,
	sagi, robin.murphy, martin.petersen, damien.lemoal, ahuang12,
	iommu, linux-nvme, linux-kernel, stable, ionut_n2001,
	sunlightlinux

On Tue, Mar 17, 2026 at 09:11:59AM +0000, John Garry wrote:
> For SAS controllers, don't we limit shost->opt_sectors at 
> shost->max_sectors, and then in sd_revalidate_disk() this value is ignored 
> as sdkp->opt_xfer_blocks would be smaller, right?

That assumes opt_xfer_blocks is actually set.  It's an optional and
relatively recent SCSI feature.  So don't expect crappy SSDs or
RAID controllers faking up SCSI in shitty firmware to actually set
it.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 0/2] dma: fix dma_opt_mapping_size() returning bogus value when no backend hint exists
  2026-03-17 14:36   ` Christoph Hellwig
@ 2026-03-17 15:18     ` John Garry
  0 siblings, 0 replies; 12+ messages in thread
From: John Garry @ 2026-03-17 15:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ionut Nechita (Wind River), m.szyprowski, kbusch, axboe, sagi,
	robin.murphy, martin.petersen, damien.lemoal, ahuang12, iommu,
	linux-nvme, linux-kernel, stable, ionut_n2001, sunlightlinux

On 17/03/2026 14:36, Christoph Hellwig wrote:
> On Tue, Mar 17, 2026 at 09:11:59AM +0000, John Garry wrote:
>> For SAS controllers, don't we limit shost->opt_sectors at
>> shost->max_sectors, and then in sd_revalidate_disk() this value is ignored
>> as sdkp->opt_xfer_blocks would be smaller, right?
> That assumes opt_xfer_blocks is actually set.  It's an optional and
> relatively recent SCSI feature.  So don't expect crappy SSDs or
> RAID controllers faking up SCSI in shitty firmware to actually set
> it.

Sure, and then we would have io_opt at max_sectors, and it seems that 
value is totally configurable for that HBA driver.

However I still find the values reported strange:
swidth=4095 / sunit=2

I thought that they were from io_opt and io_min, and 
blk_validate_limits() does rounding to PBS, except io_min has no 
rounding for > PBS.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-03-17 15:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-16 20:39 [PATCH v1 0/2] dma: fix dma_opt_mapping_size() returning bogus value when no backend hint exists Ionut Nechita (Wind River)
2026-03-16 20:39 ` [PATCH v1 1/2] dma: return 0 from dma_opt_mapping_size() when no real " Ionut Nechita (Wind River)
2026-03-17  9:43   ` Robin Murphy
2026-03-17 14:19     ` Christoph Hellwig
2026-03-16 20:39 ` [PATCH v1 2/2] nvme-pci: handle dma_opt_mapping_size() returning 0 Ionut Nechita (Wind River)
2026-03-16 21:21   ` Damien Le Moal
2026-03-17  8:55   ` John Garry
2026-03-17 14:14   ` Christoph Hellwig
2026-03-17  9:11 ` [PATCH v1 0/2] dma: fix dma_opt_mapping_size() returning bogus value when no backend hint exists John Garry
2026-03-17  9:18   ` Damien Le Moal
2026-03-17 14:36   ` Christoph Hellwig
2026-03-17 15:18     ` John Garry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox