* [PATCH v4 0/1] scsi: sas: fix mkfs.xfs failure due to bogus optimal_io_size
@ 2026-03-19 8:39 Ionut Nechita (Wind River)
2026-03-19 8:39 ` [PATCH v4] scsi: sas: skip opt_sectors when DMA reports no real optimization hint Ionut Nechita (Wind River)
0 siblings, 1 reply; 3+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-03-19 8:39 UTC (permalink / raw)
To: linux-scsi
Cc: James.Bottomley, ahuang12, axboe, damien.lemoal, dlemoal, hch,
iommu, ionut_n2001, john.g.garry, kbusch, linux-kernel,
linux-nvme, m.szyprowski, martin.petersen, robin.murphy, sagi,
stable, sunlightlinux, Ionut Nechita
From: Ionut Nechita <ionut.nechita@windriver.com>
v4 (per Damien Le Moal's review of v3):
- Split the opt >= max check into a WARN_ONCE for the impossible
opt > max case (driver bug) and a plain == check for the "no hint"
case.
- Used min_t(unsigned int, ...) for the return value to avoid any
potential overflow when shifting size_t down to sectors.
- Reformatted the call site as suggested:
shost->opt_sectors =
sas_dma_opt_sectors(dma_dev, shost->max_sectors);
v3 (per Christoph Hellwig's review of v2):
- Extracted the opt_sectors logic into a dedicated sas_dma_opt_sectors()
helper function, clearly split out from sas_host_setup().
- Added rounddown_pow_of_two() on the DMA optimal mapping size so that
the resulting opt_sectors is always a power of two, keeping filesystem
geometry calculations clean.
- Added #include <linux/log2.h> for rounddown_pow_of_two().
v2:
- Dropped the dma_opt_mapping_size() change per Robin Murphy's feedback:
the DMA core semantics are correct, the bug is in the caller.
- Dropped the nvme-pci patch (no longer needed).
- Single patch now fixes the actual bug in scsi_transport_sas.c.
v1 feedback summary:
- Robin Murphy: dma_opt_mapping_size() semantics are correct; if no
restriction exists, the largest efficient size IS the largest size.
Fix the caller, not the common code.
- John Garry: Asked for concrete max_sectors/opt_sectors values and
questioned whether sd_revalidate_disk() would override opt_sectors
via opt_xfer_blocks.
- Damien Le Moal: Suggested min_not_zero() for nvme-pci (now moot).
Answer to John's question (from v2, still relevant):
The SAS disks on this system do not report Optimal Transfer Length in
VPD page B0, so sdkp->opt_xfer_blocks = 0. sd_revalidate_disk() uses
min_not_zero(0, opt_sectors) which returns opt_sectors, propagating
the bogus value. Observed values:
shost->max_sectors = 32767
opt_sectors = 32767 (capped at max_sectors)
optimal_io_size = 16773120 (visible in lsblk --topology)
minimum_io_size = 8192
mkfs.xfs computes swidth=4095, sunit=2, fails because 4095 % 2 != 0.
Test environment:
- Dell PowerEdge R750
- SAS Controller: Broadcom/LSI mpt3sas (SAS3816, FW 33.15.00.00)
- Disks: SAMSUNG MZILT800HBHQ0D3 (800GB SCSI SAS SSD)
- Kernel: 6.12.0-1-amd64 with intel_iommu=off
- IOMMU: Disabled (DMAR: IOMMU disabled), default domain: Passthrough
Based on linux-next (next-20260318).
Link: https://lore.kernel.org/lkml/20260316203956.64515-1-ionut.nechita@windriver.com/
Link: https://lore.kernel.org/all/20260318074314.17372-1-ionut.nechita@windriver.com/
Link: https://lore.kernel.org/all/20260318200532.51232-1-ionut.nechita@windriver.com/
Ionut Nechita (1):
scsi: sas: skip opt_sectors when DMA reports no real optimization hint
drivers/scsi/scsi_transport_sas.c | 40 +++++++++++++++++++++++++++----
1 file changed, 36 insertions(+), 4 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 3+ messages in thread* [PATCH v4] scsi: sas: skip opt_sectors when DMA reports no real optimization hint 2026-03-19 8:39 [PATCH v4 0/1] scsi: sas: fix mkfs.xfs failure due to bogus optimal_io_size Ionut Nechita (Wind River) @ 2026-03-19 8:39 ` Ionut Nechita (Wind River) 2026-03-19 11:06 ` Damien Le Moal 0 siblings, 1 reply; 3+ messages in thread From: Ionut Nechita (Wind River) @ 2026-03-19 8:39 UTC (permalink / raw) To: linux-scsi Cc: James.Bottomley, ahuang12, axboe, damien.lemoal, dlemoal, hch, iommu, ionut_n2001, john.g.garry, kbusch, linux-kernel, linux-nvme, m.szyprowski, martin.petersen, robin.murphy, sagi, stable, sunlightlinux, Ionut Nechita From: Ionut Nechita <ionut.nechita@windriver.com> sas_host_setup() unconditionally sets shost->opt_sectors from dma_opt_mapping_size(). When the IOMMU is disabled or in passthrough mode and no DMA ops provide an opt_mapping_size callback, dma_opt_mapping_size() returns min(dma_max_mapping_size(), SIZE_MAX) which equals dma_max_mapping_size() — a hard upper bound, not an optimization hint. On a Dell PowerEdge R750 with mpt3sas (Broadcom SAS3816, FW 33.15.00.00) and intel_iommu=off the following values are observed: dma_opt_mapping_size() = dma_max_mapping_size() (no real hint) shost->max_sectors = 32767 opt_sectors = min(32767, huge >> 9) = 32767 optimal_io_size = 32767 << 9 = 16776704 → round_down(16776704, 4096) = 16773120 The SAS disk (SAMSUNG MZILT800HBHQ0D3) do not report an Optimal Transfer Length in VPD page B0,so sdkp->opt_xfer_blocks remains 0. sd_revalidate_disk() then uses min_not_zero(0, opt_sectors) = opt_sectors, propagating the bogus value into the block device's optimal_io_size (visible as OPT-IO = 16773120 in lsblk --topology). mkfs.xfs picks up optimal_io_size and minimum_io_size and computes: swidth = 16773120 / 4096 = 4095 sunit = 8192 / 4096 = 2 Since 4095 % 2 != 0, XFS rejects the geometry: SB stripe unit sanity check failed This makes it impossible to create XFS filesystems (e.g. for /var/lib/docker) during system bootstrap. Fix this by introducing a sas_dma_opt_sectors() helper that only returns a non-zero opt_sectors when dma_opt_mapping_size() is strictly less than dma_max_mapping_size(), indicating a genuine DMA optimization constraint from an IOMMU or DMA ops backend. The helper also rounds the value down to a power of two so that filesystem geometry calculations always produce clean results. When the two DMA values are equal, no backend provided a real hint, so opt_sectors stays at 0 ("no preference"). A WARN_ONCE guards against dma_opt_mapping_size() returning a value larger than dma_max_mapping_size(), which would indicate a driver bug. The return value uses min_t(unsigned int, ...) to avoid any potential overflow when shifting the size_t opt value down to sectors. Fixes: 4cbfca5f7750 ("scsi: scsi_transport_sas: cap shost opt_sectors according to DMA optimal limit") Cc: stable@vger.kernel.org Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com> --- drivers/scsi/scsi_transport_sas.c | 40 +++++++++++++++++++++++++++---- 1 file changed, 36 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/scsi_transport_sas.c b/drivers/scsi/scsi_transport_sas.c index 12124f9d5ccd0..696627b6fe2c3 100644 --- a/drivers/scsi/scsi_transport_sas.c +++ b/drivers/scsi/scsi_transport_sas.c @@ -27,6 +27,7 @@ #include <linux/module.h> #include <linux/jiffies.h> #include <linux/err.h> +#include <linux/log2.h> #include <linux/slab.h> #include <linux/string.h> #include <linux/blkdev.h> @@ -222,6 +223,38 @@ static int sas_bsg_initialize(struct Scsi_Host *shost, struct sas_rphy *rphy) * SAS host attributes */ +/** + * sas_dma_opt_sectors - derive opt_sectors from DMA optimal mapping size + * @dma_dev: device to query DMA parameters for + * @max_sectors: upper bound from the host adapter + * + * When the DMA layer reports a genuine optimization constraint (i.e. + * dma_opt_mapping_size() < dma_max_mapping_size()), convert it to a + * sector count, round it down to a power of two so that filesystem + * geometry calculations stay sane, and cap it at @max_sectors. + * + * When the two values are equal no backend provided a real hint and + * the function returns 0 ("no preference"). + */ +static unsigned int sas_dma_opt_sectors(struct device *dma_dev, + unsigned int max_sectors) +{ + size_t opt = dma_opt_mapping_size(dma_dev); + size_t max = dma_max_mapping_size(dma_dev); + + if (WARN_ONCE(opt > max, + "dma_opt_mapping_size (%zu) > dma_max_mapping_size (%zu)\n", + opt, max)) + return 0; + + if (opt == max) + return 0; + + opt = rounddown_pow_of_two(opt); + + return min_t(unsigned int, opt >> SECTOR_SHIFT, max_sectors); +} + static int sas_host_setup(struct transport_container *tc, struct device *dev, struct device *cdev) { @@ -239,10 +272,9 @@ static int sas_host_setup(struct transport_container *tc, struct device *dev, dev_printk(KERN_ERR, dev, "fail to a bsg device %d\n", shost->host_no); - if (dma_dev->dma_mask) { - shost->opt_sectors = min_t(unsigned int, shost->max_sectors, - dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT); - } + if (dma_dev->dma_mask) + shost->opt_sectors = + sas_dma_opt_sectors(dma_dev, shost->max_sectors); return 0; } -- 2.53.0 ^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH v4] scsi: sas: skip opt_sectors when DMA reports no real optimization hint 2026-03-19 8:39 ` [PATCH v4] scsi: sas: skip opt_sectors when DMA reports no real optimization hint Ionut Nechita (Wind River) @ 2026-03-19 11:06 ` Damien Le Moal 0 siblings, 0 replies; 3+ messages in thread From: Damien Le Moal @ 2026-03-19 11:06 UTC (permalink / raw) To: Ionut Nechita (Wind River), linux-scsi Cc: James.Bottomley, ahuang12, axboe, damien.lemoal, hch, iommu, ionut_n2001, john.g.garry, kbusch, linux-kernel, linux-nvme, m.szyprowski, martin.petersen, robin.murphy, sagi, stable, sunlightlinux On 3/19/26 17:39, Ionut Nechita (Wind River) wrote: > From: Ionut Nechita <ionut.nechita@windriver.com> > > sas_host_setup() unconditionally sets shost->opt_sectors from > dma_opt_mapping_size(). When the IOMMU is disabled or in passthrough > mode and no DMA ops provide an opt_mapping_size callback, > dma_opt_mapping_size() returns min(dma_max_mapping_size(), SIZE_MAX) > which equals dma_max_mapping_size() — a hard upper bound, not an > optimization hint. Please reduce the distribution list. This is now a scsi patch. Nothing to do with iommu or nvme. > > On a Dell PowerEdge R750 with mpt3sas (Broadcom SAS3816, FW 33.15.00.00) > and intel_iommu=off the following values are observed: > > dma_opt_mapping_size() = dma_max_mapping_size() (no real hint) > shost->max_sectors = 32767 > opt_sectors = min(32767, huge >> 9) = 32767 > optimal_io_size = 32767 << 9 = 16776704 > → round_down(16776704, 4096) = 16773120 > > The SAS disk (SAMSUNG MZILT800HBHQ0D3) do not report an > Optimal Transfer Length in VPD page B0,so sdkp->opt_xfer_blocks remains 0. > sd_revalidate_disk() then uses min_not_zero(0, opt_sectors) = opt_sectors, > propagating the bogus value into the block device's optimal_io_size > (visible as OPT-IO = 16773120 in lsblk --topology). > > mkfs.xfs picks up optimal_io_size and minimum_io_size and computes: > > swidth = 16773120 / 4096 = 4095 > sunit = 8192 / 4096 = 2 > > Since 4095 % 2 != 0, XFS rejects the geometry: > > SB stripe unit sanity check failed > > This makes it impossible to create XFS filesystems (e.g. for > /var/lib/docker) during system bootstrap. > > Fix this by introducing a sas_dma_opt_sectors() helper that only returns > a non-zero opt_sectors when dma_opt_mapping_size() is strictly less than > dma_max_mapping_size(), indicating a genuine DMA optimization constraint > from an IOMMU or DMA ops backend. The helper also rounds the value down > to a power of two so that filesystem geometry calculations always produce > clean results. When the two DMA values are equal, no backend provided a > real hint, so opt_sectors stays at 0 ("no preference"). > > A WARN_ONCE guards against dma_opt_mapping_size() returning a value > larger than dma_max_mapping_size(), which would indicate a driver bug. > The return value uses min_t(unsigned int, ...) to avoid any potential > overflow when shifting the size_t opt value down to sectors. > > Fixes: 4cbfca5f7750 ("scsi: scsi_transport_sas: cap shost opt_sectors according to DMA optimal limit") > Cc: stable@vger.kernel.org > Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com> > --- > drivers/scsi/scsi_transport_sas.c | 40 +++++++++++++++++++++++++++---- > 1 file changed, 36 insertions(+), 4 deletions(-) > > diff --git a/drivers/scsi/scsi_transport_sas.c b/drivers/scsi/scsi_transport_sas.c > index 12124f9d5ccd0..696627b6fe2c3 100644 > --- a/drivers/scsi/scsi_transport_sas.c > +++ b/drivers/scsi/scsi_transport_sas.c > @@ -27,6 +27,7 @@ > #include <linux/module.h> > #include <linux/jiffies.h> > #include <linux/err.h> > +#include <linux/log2.h> > #include <linux/slab.h> > #include <linux/string.h> > #include <linux/blkdev.h> > @@ -222,6 +223,38 @@ static int sas_bsg_initialize(struct Scsi_Host *shost, struct sas_rphy *rphy) > * SAS host attributes > */ > > +/** > + * sas_dma_opt_sectors - derive opt_sectors from DMA optimal mapping size > + * @dma_dev: device to query DMA parameters for > + * @max_sectors: upper bound from the host adapter > + * > + * When the DMA layer reports a genuine optimization constraint (i.e. > + * dma_opt_mapping_size() < dma_max_mapping_size()), convert it to a > + * sector count, round it down to a power of two so that filesystem > + * geometry calculations stay sane, and cap it at @max_sectors. > + * > + * When the two values are equal no backend provided a real hint and > + * the function returns 0 ("no preference"). > + */ > +static unsigned int sas_dma_opt_sectors(struct device *dma_dev, > + unsigned int max_sectors) > +{ > + size_t opt = dma_opt_mapping_size(dma_dev); > + size_t max = dma_max_mapping_size(dma_dev); > + > + if (WARN_ONCE(opt > max, > + "dma_opt_mapping_size (%zu) > dma_max_mapping_size (%zu)\n", > + opt, max)) > + return 0; > + > + if (opt == max) > + return 0; > + > + opt = rounddown_pow_of_two(opt); > + > + return min_t(unsigned int, opt >> SECTOR_SHIFT, max_sectors); > +} > + > static int sas_host_setup(struct transport_container *tc, struct device *dev, > struct device *cdev) > { > @@ -239,10 +272,9 @@ static int sas_host_setup(struct transport_container *tc, struct device *dev, > dev_printk(KERN_ERR, dev, "fail to a bsg device %d\n", > shost->host_no); > > - if (dma_dev->dma_mask) { > - shost->opt_sectors = min_t(unsigned int, shost->max_sectors, > - dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT); > - } > + if (dma_dev->dma_mask) > + shost->opt_sectors = > + sas_dma_opt_sectors(dma_dev, shost->max_sectors); > > return 0; > } -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-03-19 11:06 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-03-19 8:39 [PATCH v4 0/1] scsi: sas: fix mkfs.xfs failure due to bogus optimal_io_size Ionut Nechita (Wind River) 2026-03-19 8:39 ` [PATCH v4] scsi: sas: skip opt_sectors when DMA reports no real optimization hint Ionut Nechita (Wind River) 2026-03-19 11:06 ` Damien Le Moal
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox