[PATCH] dma-buf: Split sgl by largest page-aligned chunk

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] dma-buf: Split sgl by largest page-aligned chunk
@ 2026-06-21 22:21 David Hu
  2026-06-21 22:34 ` sashiko-bot
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: David Hu @ 2026-06-21 22:21 UTC (permalink / raw)
  To: Sumit Semwal, Christian König
  Cc: Jason Gunthorpe, Nicolin Chen, Leon Romanovsky, Kevin Tian,
	Ankit Agrawal, Alex Williamson, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel, iommu, jmoroni, praan, kpberry,
	David Hu, sashiko-bot, stable

Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`.
This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the
first entry, resulting in non-page-aligned DMA addresses for all
subsequent entries.

While the underlying IOMMU mapping may be contiguous, hardware
DMA engines often require explicit address alignment (e.g., page,
cacheline, or storage sector boundaries). Passing unaligned
addresses and lengths can cause explicit failures in DMA descriptor
creation or silent data corruption if lower unaligned bits are
truncated.

Fix this by splitting the scatterlist by the largest possible page
aligned chunk within `UINT_MAX` (`ALIGN_DOWN(UINT_MAX, PAGE_SIZE)`).
This ensures all scatterlist DMA addresses and lengths remain page
aligned and satisfy hardware constraints.

Page-aligned entries allow the system to cleanly chunk payloads into
PCIe MaxPayloadSize (MPS) (e.g., 128 bytes, 256 bytes, 512 bytes).
As a result, this may help reduce TLP fragmentation in P2P transfers
and alleviate potential congestion within a logical PCIe switch
partition, especially when Relaxed Ordering is not possible due to
hardware constraints.

Reported-by: sashiko-bot <sashiko-bot@kernel.org>
Closes: https://lore.kernel.org/all/20260609165431.778061F00893@smtp.kernel.org/
Fixes: 3aa31a8bb11e ("dma-buf: provide phys_vec to scatter-gather mapping routine")
Cc: stable@vger.kernel.org
Signed-off-by: David Hu <xuehaohu@google.com>
---
 drivers/dma-buf/dma-buf-mapping.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c
index 794acff2546a..f2bde38fdb1f 100644
--- a/drivers/dma-buf/dma-buf-mapping.c
+++ b/drivers/dma-buf/dma-buf-mapping.c
@@ -5,6 +5,9 @@
  */
 #include <linux/dma-buf-mapping.h>
 #include <linux/dma-resv.h>
+#include <linux/align.h>
+
+#define MAX_ENT_SZ ALIGN_DOWN(UINT_MAX, PAGE_SIZE)
 
 static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
 					 dma_addr_t addr)
@@ -12,9 +15,9 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
 	unsigned int len, nents;
 	int i;
 
-	nents = DIV_ROUND_UP(length, UINT_MAX);
+	nents = DIV_ROUND_UP(length, MAX_ENT_SZ);
 	for (i = 0; i < nents; i++) {
-		len = min_t(size_t, length, UINT_MAX);
+		len = min_t(size_t, length, MAX_ENT_SZ);
 		length -= len;
 		/*
 		 * DMABUF abuses scatterlist to create a scatterlist
@@ -24,7 +27,7 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
 		 * does not require the CPU list for mapping or unmapping.
 		 */
 		sg_set_page(sgl, NULL, 0, 0);
-		sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX;
+		sg_dma_address(sgl) = addr + (dma_addr_t)i * MAX_ENT_SZ;
 		sg_dma_len(sgl) = len;
 		sgl = sg_next(sgl);
 	}
@@ -41,14 +44,14 @@ static unsigned int calc_sg_nents(struct dma_iova_state *state,
 
 	if (!state || !dma_use_iova(state)) {
 		for (i = 0; i < nr_ranges; i++)
-			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
+			nents += DIV_ROUND_UP(phys_vec[i].len, MAX_ENT_SZ);
 	} else {
 		/*
 		 * In IOVA case, there is only one SG entry which spans
 		 * for whole IOVA address space, but we need to make sure
 		 * that it fits sg->length, maybe we need more.
 		 */
-		nents = DIV_ROUND_UP(size, UINT_MAX);
+		nents = DIV_ROUND_UP(size, MAX_ENT_SZ);
 	}
 
 	return nents;
-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] dma-buf: Split sgl by largest page-aligned chunk
  2026-06-21 22:21 [PATCH] dma-buf: Split sgl by largest page-aligned chunk David Hu
@ 2026-06-21 22:34 ` sashiko-bot
  2026-06-22  8:13 ` David Laight
  2026-06-23  1:54 ` [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks David Hu
  2 siblings, 0 replies; 8+ messages in thread
From: sashiko-bot @ 2026-06-21 22:34 UTC (permalink / raw)
  To: David Hu; +Cc: dri-devel

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:

Pre-existing issues:
- [Critical] Calling `sg_set_page` with a `NULL` page parameter triggers a kernel panic when `CONFIG_DEBUG_VM` and SPARSEMEM without VMEMMAP are enabled.
- [High] Calculating the number of scatterlist entries (`nents`) using `DIV_ROUND_UP()` overflows on 32-bit architectures, resulting in `nents = 0` and uninitialized or empty scatterlists for large buffer lengths.
- [High] dma_unmap_phys() is invoked iteratively on chunked scatterlist addresses and lengths rather than the exact address and size initially mapped by dma_map_phys().
- [High] The error path in `dma_buf_phys_vec_to_sgt` unmaps all allocated scatterlist entries, including unpopulated ones, leading to invalid DMA unmap requests.
--

commit f6cce91b1ad2e1c68270e81a3cd60cffb903c456
Author: David Hu <xuehaohu@google.com>

dma-buf: Split sgl by largest page-aligned chunk

Currently, fill_sg_entry() splits the scatterlist using UINT_MAX, resulting
in non-page-aligned DMA addresses for subsequent entries. This patch changes
the split size to the largest page-aligned chunk within UINT_MAX to satisfy
hardware DMA alignment constraints.

Reported-by: sashiko-bot <sashiko-bot@kernel.org>
Closes: https://lore.kernel.org/all/20260609165431.778061F00893@smtp.kernel.org/
Fixes: 3aa31a8bb11e ("dma-buf: provide phys_vec to scatter-gather mapping routine")
Cc: stable@vger.kernel.org
Signed-off-by: David Hu <xuehaohu@google.com>

> diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c
> index 794acff2546a3..f2bde38fdb1f7 100644
> --- a/drivers/dma-buf/dma-buf-mapping.c
> +++ b/drivers/dma-buf/dma-buf-mapping.c
> @@ -12,9 +15,9 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
>  	unsigned int len, nents;
>  	int i;
>  
> -	nents = DIV_ROUND_UP(length, UINT_MAX);
> +	nents = DIV_ROUND_UP(length, MAX_ENT_SZ);

[Severity: High]
This isn't a bug introduced by this patch, but does this DIV_ROUND_UP
calculation overflow on 32-bit architectures?

On 32-bit platforms, size_t is 32 bits and MAX_ENT_SZ evaluates to 0xFFFFF000
assuming 4KB pages. The DIV_ROUND_UP macro expands to:

(length + MAX_ENT_SZ - 1) / MAX_ENT_SZ

If the length is greater than PAGE_SIZE, the numerator wraps around in 32-bit
arithmetic, resulting in a small value that evaluates to 0. This skips the
scatterlist setup loop entirely for large buffer lengths.

>  	for (i = 0; i < nents; i++) {
> -		len = min_t(size_t, length, UINT_MAX);
> +		len = min_t(size_t, length, MAX_ENT_SZ);
>  		length -= len;
>  		/*
>  		 * DMABUF abuses scatterlist to create a scatterlist
> @@ -24,7 +27,7 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
>  		 * does not require the CPU list for mapping or unmapping.
>  		 */
>  		sg_set_page(sgl, NULL, 0, 0);

[Severity: Critical]
This is a pre-existing issue, but does passing NULL to sg_set_page here
cause a kernel panic under certain configurations?

When CONFIG_DEBUG_VM and classic SPARSEMEM without VMEMMAP are enabled,
sg_set_page calls page_range_contiguous, which unconditionally calls
page_to_pfn.

On these setups, page_to_pfn determines the memory section by dereferencing
the page flags. Dereferencing a NULL page pointer here causes an immediate
kernel oops.

[ ... ]

> @@ -41,14 +44,14 @@ static unsigned int calc_sg_nents(struct dma_iova_state *state,
>  
>  	if (!state || !dma_use_iova(state)) {
>  		for (i = 0; i < nr_ranges; i++)
> -			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
> +			nents += DIV_ROUND_UP(phys_vec[i].len, MAX_ENT_SZ);

[Severity: High]
This is a pre-existing issue, but does this calculation also suffer from the
same 32-bit overflow mentioned above, resolving to 0 for large lengths?

[ ... ]

[Severity: High]
This isn't a bug introduced by this patch, but looking at how the
scatterlists populated by fill_sg_entry are mapped and unmapped, are we
violating the DMA API contract in dma_buf_free_sgt?

In dma_buf_phys_vec_to_sgt, a monolithic physical range is mapped using a
single dma_map_phys call, and then fill_sg_entry splits it into chunks.

However, in dma_buf_free_sgt, dma_unmap_phys is iteratively called on each
individual scatterlist chunk:

for_each_sgtable_dma_sg(sgt, sgl, i)
    dma_unmap_phys(attach->dev, sg_dma_address(sgl),
                   sg_dma_len(sgl), dir, DMA_ATTR_MMIO);

The DMA API requires that an allocation must be unmapped using the exact
same address and size parameters passed during creation. Chunked unmapping
violates this requirement.

[Severity: High]
This is also a pre-existing issue, but does the error path in
dma_buf_phys_vec_to_sgt incorrectly handle unmapping?

If dma_map_phys fails, the code jumps to the err_unmap_dma label. The error
loop uses:

for_each_sgtable_dma_sg(&dma->sgt, sgl, i)
    dma_unmap_phys(attach->dev, sg_dma_address(sgl), ...);

Because the loop macro uses i as its internal iterator up to
dma->sgt.nents, which was initialized to the total allocated entries,
it overwrites the failure index.

This causes it to iterate over all allocated entries, including unpopulated
ones, passing 0 for the address and length to dma_unmap_phys, which leads
to invalid DMA unmap requests.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260621222130.1667453-1-xuehaohu@google.com?part=1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] dma-buf: Split sgl by largest page-aligned chunk
  2026-06-21 22:21 [PATCH] dma-buf: Split sgl by largest page-aligned chunk David Hu
  2026-06-21 22:34 ` sashiko-bot
@ 2026-06-22  8:13 ` David Laight
  2026-06-22 21:26   ` David Hu
  2026-06-23  1:54 ` [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks David Hu
  2 siblings, 1 reply; 8+ messages in thread
From: David Laight @ 2026-06-22  8:13 UTC (permalink / raw)
  To: David Hu
  Cc: Sumit Semwal, Christian König, Jason Gunthorpe, Nicolin Chen,
	Leon Romanovsky, Kevin Tian, Ankit Agrawal, Alex Williamson,
	linux-media, dri-devel, linaro-mm-sig, linux-kernel, iommu,
	jmoroni, praan, kpberry, sashiko-bot, stable

On Sun, 21 Jun 2026 22:21:30 +0000
David Hu <xuehaohu@google.com> wrote:

> Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`.
> This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the
> first entry, resulting in non-page-aligned DMA addresses for all
> subsequent entries.

How did you find this?
It requires a single buffer over 4GB - seems highly unlikely.


> 
> While the underlying IOMMU mapping may be contiguous, hardware
> DMA engines often require explicit address alignment (e.g., page,
> cacheline, or storage sector boundaries). Passing unaligned
> addresses and lengths can cause explicit failures in DMA descriptor
> creation or silent data corruption if lower unaligned bits are
> truncated.
> 
> Fix this by splitting the scatterlist by the largest possible page
> aligned chunk within `UINT_MAX` (`ALIGN_DOWN(UINT_MAX, PAGE_SIZE)`).
> This ensures all scatterlist DMA addresses and lengths remain page
> aligned and satisfy hardware constraints.

It would almost certainly better to spilt into 2G chunks.
That removes any need for any divisions.

> Page-aligned entries allow the system to cleanly chunk payloads into
> PCIe MaxPayloadSize (MPS) (e.g., 128 bytes, 256 bytes, 512 bytes).
> As a result, this may help reduce TLP fragmentation in P2P transfers
> and alleviate potential congestion within a logical PCIe switch
> partition, especially when Relaxed Ordering is not possible due to
> hardware constraints.
> 
> Reported-by: sashiko-bot <sashiko-bot@kernel.org>
> Closes: https://lore.kernel.org/all/20260609165431.778061F00893@smtp.kernel.org/
> Fixes: 3aa31a8bb11e ("dma-buf: provide phys_vec to scatter-gather mapping routine")
> Cc: stable@vger.kernel.org
> Signed-off-by: David Hu <xuehaohu@google.com>
> ---
>  drivers/dma-buf/dma-buf-mapping.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c
> index 794acff2546a..f2bde38fdb1f 100644
> --- a/drivers/dma-buf/dma-buf-mapping.c
> +++ b/drivers/dma-buf/dma-buf-mapping.c
> @@ -5,6 +5,9 @@
>   */
>  #include <linux/dma-buf-mapping.h>
>  #include <linux/dma-resv.h>
> +#include <linux/align.h>
> +
> +#define MAX_ENT_SZ ALIGN_DOWN(UINT_MAX, PAGE_SIZE)

>  
>  static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
>  					 dma_addr_t addr)
> @@ -12,9 +15,9 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
>  	unsigned int len, nents;
>  	int i;
>  
> -	nents = DIV_ROUND_UP(length, UINT_MAX);
> +	nents = DIV_ROUND_UP(length, MAX_ENT_SZ);
>  	for (i = 0; i < nents; i++) {

Why not change that to 'while (length) {' to avoid the division above.

> -		len = min_t(size_t, length, UINT_MAX);
> +		len = min_t(size_t, length, MAX_ENT_SZ);

I bet that doesn't need to be min_t()

>  		length -= len;
>  		/*
>  		 * DMABUF abuses scatterlist to create a scatterlist
> @@ -24,7 +27,7 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
>  		 * does not require the CPU list for mapping or unmapping.
>  		 */
>  		sg_set_page(sgl, NULL, 0, 0);
> -		sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX;
> +		sg_dma_address(sgl) = addr + (dma_addr_t)i * MAX_ENT_SZ;
>  		sg_dma_len(sgl) = len;

Replace the multiply with 'addr += len'.

-- David

>  		sgl = sg_next(sgl);
>  	}
> @@ -41,14 +44,14 @@ static unsigned int calc_sg_nents(struct dma_iova_state *state,
>  
>  	if (!state || !dma_use_iova(state)) {
>  		for (i = 0; i < nr_ranges; i++)
> -			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
> +			nents += DIV_ROUND_UP(phys_vec[i].len, MAX_ENT_SZ);
>  	} else {
>  		/*
>  		 * In IOVA case, there is only one SG entry which spans
>  		 * for whole IOVA address space, but we need to make sure
>  		 * that it fits sg->length, maybe we need more.
>  		 */
> -		nents = DIV_ROUND_UP(size, UINT_MAX);
> +		nents = DIV_ROUND_UP(size, MAX_ENT_SZ);
>  	}
>  
>  	return nents;


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] dma-buf: Split sgl by largest page-aligned chunk
  2026-06-22  8:13 ` David Laight
@ 2026-06-22 21:26   ` David Hu
  2026-06-23  8:25     ` David Laight
  0 siblings, 1 reply; 8+ messages in thread
From: David Hu @ 2026-06-22 21:26 UTC (permalink / raw)
  To: David Laight
  Cc: Sumit Semwal, Christian König, Jason Gunthorpe, Nicolin Chen,
	Leon Romanovsky, Kevin Tian, Ankit Agrawal, Alex Williamson,
	linux-media, dri-devel, linaro-mm-sig, linux-kernel, iommu,
	jmoroni, praan, kpberry, sashiko-bot, stable

On Mon, Jun 22, 2026 at 4:13 AM David Laight
<david.laight.linux@gmail.com> wrote:
>

Hi David,

Thank you for your review. You raised many good points regarding
optimizations here. I'll switch to using 2G as the max entry size
(`SZ_2G` from `linux/sizes.h`), and remove divisions and
multiplications. I'll also replace the `for()` loop with `while
(length)`, and drop `min_t()` in favor of `min()` by casting `SZ_2G`
to `size_t`. I'll send out a v2 with these changes shortly.

Thanks,
David

> > Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`.
> > This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the
> > first entry, resulting in non-page-aligned DMA addresses for all
> > subsequent entries.
>
> How did you find this?
> It requires a single buffer over 4GB - seems highly unlikely.

It was observed during experiments with buffers over 8GB on an accelerator.

> >
> > While the underlying IOMMU mapping may be contiguous, hardware
> > DMA engines often require explicit address alignment (e.g., page,
> > cacheline, or storage sector boundaries). Passing unaligned
> > addresses and lengths can cause explicit failures in DMA descriptor
> > creation or silent data corruption if lower unaligned bits are
> > truncated.
> >
> > Fix this by splitting the scatterlist by the largest possible page
> > aligned chunk within `UINT_MAX` (`ALIGN_DOWN(UINT_MAX, PAGE_SIZE)`).
> > This ensures all scatterlist DMA addresses and lengths remain page
> > aligned and satisfy hardware constraints.
>
> It would almost certainly better to spilt into 2G chunks.
> That removes any need for any divisions.

I agree. 2G naturally aligns with most hardware boundaries, while also
allowing compiler optimizations with simple bit shifts.

>
> > Page-aligned entries allow the system to cleanly chunk payloads into
> > PCIe MaxPayloadSize (MPS) (e.g., 128 bytes, 256 bytes, 512 bytes).
> > As a result, this may help reduce TLP fragmentation in P2P transfers
> > and alleviate potential congestion within a logical PCIe switch
> > partition, especially when Relaxed Ordering is not possible due to
> > hardware constraints.
> >
> > Reported-by: sashiko-bot <sashiko-bot@kernel.org>
> > Closes: https://lore.kernel.org/all/20260609165431.778061F00893@smtp.kernel.org/
> > Fixes: 3aa31a8bb11e ("dma-buf: provide phys_vec to scatter-gather mapping routine")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: David Hu <xuehaohu@google.com>
> > ---
> >  drivers/dma-buf/dma-buf-mapping.c | 13 ++++++++-----
> >  1 file changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c
> > index 794acff2546a..f2bde38fdb1f 100644
> > --- a/drivers/dma-buf/dma-buf-mapping.c
> > +++ b/drivers/dma-buf/dma-buf-mapping.c
> > @@ -5,6 +5,9 @@
> >   */
> >  #include <linux/dma-buf-mapping.h>
> >  #include <linux/dma-resv.h>
> > +#include <linux/align.h>
> > +
> > +#define MAX_ENT_SZ ALIGN_DOWN(UINT_MAX, PAGE_SIZE)
>
> >
> >  static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
> >                                        dma_addr_t addr)
> > @@ -12,9 +15,9 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
> >       unsigned int len, nents;
> >       int i;
> >
> > -     nents = DIV_ROUND_UP(length, UINT_MAX);
> > +     nents = DIV_ROUND_UP(length, MAX_ENT_SZ);
> >       for (i = 0; i < nents; i++) {
>
> Why not change that to 'while (length) {' to avoid the division above.

Sounds good, will do.

>
> > -             len = min_t(size_t, length, UINT_MAX);
> > +             len = min_t(size_t, length, MAX_ENT_SZ);
>
> I bet that doesn't need to be min_t()

Agreed.


>
> >               length -= len;
> >               /*
> >                * DMABUF abuses scatterlist to create a scatterlist
> > @@ -24,7 +27,7 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
> >                * does not require the CPU list for mapping or unmapping.
> >                */
> >               sg_set_page(sgl, NULL, 0, 0);
> > -             sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX;
> > +             sg_dma_address(sgl) = addr + (dma_addr_t)i * MAX_ENT_SZ;
> >               sg_dma_len(sgl) = len;
>
> Replace the multiply with 'addr += len'.

Will update this as well.

>
> -- David
>
> >               sgl = sg_next(sgl);
> >       }
> > @@ -41,14 +44,14 @@ static unsigned int calc_sg_nents(struct dma_iova_state *state,
> >
> >       if (!state || !dma_use_iova(state)) {
> >               for (i = 0; i < nr_ranges; i++)
> > -                     nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
> > +                     nents += DIV_ROUND_UP(phys_vec[i].len, MAX_ENT_SZ);
> >       } else {
> >               /*
> >                * In IOVA case, there is only one SG entry which spans
> >                * for whole IOVA address space, but we need to make sure
> >                * that it fits sg->length, maybe we need more.
> >                */
> > -             nents = DIV_ROUND_UP(size, UINT_MAX);
> > +             nents = DIV_ROUND_UP(size, MAX_ENT_SZ);
> >       }
> >
> >       return nents;
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] dma-buf: Split sgl by largest page-aligned chunk
  2026-06-22 21:26   ` David Hu
@ 2026-06-23  8:25     ` David Laight
  0 siblings, 0 replies; 8+ messages in thread
From: David Laight @ 2026-06-23  8:25 UTC (permalink / raw)
  To: David Hu
  Cc: Sumit Semwal, Christian König, Jason Gunthorpe, Nicolin Chen,
	Leon Romanovsky, Kevin Tian, Ankit Agrawal, Alex Williamson,
	linux-media, dri-devel, linaro-mm-sig, linux-kernel, iommu,
	jmoroni, praan, kpberry, sashiko-bot, stable

On Mon, 22 Jun 2026 17:26:10 -0400
David Hu <xuehaohu@google.com> wrote:

> On Mon, Jun 22, 2026 at 4:13 AM David Laight
> <david.laight.linux@gmail.com> wrote:
> >  
> 
> Hi David,
> 
> Thank you for your review. You raised many good points regarding
> optimizations here. I'll switch to using 2G as the max entry size
> (`SZ_2G` from `linux/sizes.h`), and remove divisions and
> multiplications. I'll also replace the `for()` loop with `while
> (length)`, and drop `min_t()` in favor of `min()` by casting `SZ_2G`
> to `size_t`.

You shouldn't need a cast at all.

	David L.

> I'll send out a v2 with these changes shortly.
> 
> Thanks,
> David

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks
  2026-06-21 22:21 [PATCH] dma-buf: Split sgl by largest page-aligned chunk David Hu
  2026-06-21 22:34 ` sashiko-bot
  2026-06-22  8:13 ` David Laight
@ 2026-06-23  1:54 ` David Hu
  2026-06-23  2:08   ` sashiko-bot
  2026-06-23  8:44   ` David Laight
  2 siblings, 2 replies; 8+ messages in thread
From: David Hu @ 2026-06-23  1:54 UTC (permalink / raw)
  To: Sumit Semwal, Christian König
  Cc: David Laight, Jason Gunthorpe, Nicolin Chen, Leon Romanovsky,
	Kevin Tian, Ankit Agrawal, Alex Williamson, linux-media,
	dri-devel, linaro-mm-sig, linux-kernel, iommu, jmoroni, praan,
	kpberry, chriscli, sashiko-bot, stable, David Hu

Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`.
This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the
first entry, resulting in non-page-aligned DMA addresses for all
subsequent entries.

While the underlying IOMMU mapping may be contiguous, hardware
DMA engines often require explicit address alignment (e.g., page,
cacheline, or storage sector boundaries). Passing unaligned
addresses and lengths can cause explicit failures in DMA descriptor
creation or silent data corruption if lower unaligned bits are
truncated.

Fix this by splitting the scatterlist into 2G chunks. An alternative
previously considered was to use the largest page aligned chunk within
`UINT_MAX` (`ALIGN_DOWN(UINT_MAX, PAGE_SIZE)`) to satisfy page
alignment. A 2G chunk is better as it naturally aligns with most known
hardware boundaries, while also allowing compiler optimizations with
simple bit shifts. This ensures all scatterlist DMA addresses and
lengths remain page aligned and satisfy hardware constraints.

Page-aligned entries allow the system to cleanly chunk payloads into
PCIe MaxPayloadSize (MPS) (e.g., 128 bytes, 256 bytes, 512 bytes).
As a result, this may help reduce TLP fragmentation in P2P transfers
and alleviate potential congestion within a logical PCIe switch
partition, especially when Relaxed Ordering is not possible due to
hardware constraints.

Reported-by: sashiko-bot <sashiko-bot@kernel.org>
Closes: https://lore.kernel.org/all/20260609165431.778061F00893@smtp.kernel.org/
Fixes: 3aa31a8bb11e ("dma-buf: provide phys_vec to scatter-gather mapping routine")
Cc: stable@vger.kernel.org
Signed-off-by: David Hu <xuehaohu@google.com>
---
 Changes in v2:
 - Updated commit title and message to reflect the switch to 2G chunks
 - Switch to using 2G as the max sg entry size as it naturally aligns
   with most hardware boundaries, while allowing compiler optimizations
   with bit shifts (David Laight)
 - Optimized away division calculation for `nent`, and multiplication
   calculation for sgl address, by dropping the `for` loop in favor of a
   `while (length)` loop (David Laight)
 - Dropped `min_t` in favor of `min()` to maintain a strict type
   checking safety net (David Laight)

 drivers/dma-buf/dma-buf-mapping.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c
index 794acff2546a..2d88e08c5ebf 100644
--- a/drivers/dma-buf/dma-buf-mapping.c
+++ b/drivers/dma-buf/dma-buf-mapping.c
@@ -5,16 +5,17 @@
  */
 #include <linux/dma-buf-mapping.h>
 #include <linux/dma-resv.h>
+#include <linux/sizes.h>
+
+#define MAX_SG_ENT_SZ ((size_t)SZ_2G)
 
 static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
 					 dma_addr_t addr)
 {
-	unsigned int len, nents;
-	int i;
+	size_t len;
 
-	nents = DIV_ROUND_UP(length, UINT_MAX);
-	for (i = 0; i < nents; i++) {
-		len = min_t(size_t, length, UINT_MAX);
+	while (length) {
+		len = min(length, MAX_SG_ENT_SZ);
 		length -= len;
 		/*
 		 * DMABUF abuses scatterlist to create a scatterlist
@@ -24,11 +25,12 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
 		 * does not require the CPU list for mapping or unmapping.
 		 */
 		sg_set_page(sgl, NULL, 0, 0);
-		sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX;
+		sg_dma_address(sgl) = addr;
 		sg_dma_len(sgl) = len;
+		addr += len;
+		/* Unconditionally advance. On last segment, this becomes NULL */
 		sgl = sg_next(sgl);
 	}
-
 	return sgl;
 }
 
@@ -41,14 +43,14 @@ static unsigned int calc_sg_nents(struct dma_iova_state *state,
 
 	if (!state || !dma_use_iova(state)) {
 		for (i = 0; i < nr_ranges; i++)
-			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
+			nents += DIV_ROUND_UP(phys_vec[i].len, MAX_SG_ENT_SZ);
 	} else {
 		/*
 		 * In IOVA case, there is only one SG entry which spans
 		 * for whole IOVA address space, but we need to make sure
 		 * that it fits sg->length, maybe we need more.
 		 */
-		nents = DIV_ROUND_UP(size, UINT_MAX);
+		nents = DIV_ROUND_UP(size, MAX_SG_ENT_SZ);
 	}
 
 	return nents;
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks
  2026-06-23  1:54 ` [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks David Hu
@ 2026-06-23  2:08   ` sashiko-bot
  2026-06-23  8:44   ` David Laight
  1 sibling, 0 replies; 8+ messages in thread
From: sashiko-bot @ 2026-06-23  2:08 UTC (permalink / raw)
  To: David Hu; +Cc: dri-devel

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

New issues:
- [Critical] Arithmetic overflow in `calc_sg_nents` on 32-bit systems causes an under-allocation of scatterlist entries, leading to a NULL pointer dereference in `fill_sg_entry`.

Pre-existing issues:
- [High] Incorrect DMA unmapping of split scatterlist chunks and unpopulated entries violates the DMA API and corrupts IOMMU state.
--

commit f3f35bb73743f61d1b46175517c947331e0902f1
Author: David Hu <xuehaohu@google.com>

dma-buf: Split sgl into page-aligned 2G chunks

Currently, fill_sg_entry() splits the scatterlist using UINT_MAX. This patch
changes the split to 2G chunks to satisfy page alignment and natural hardware
boundaries for DMA engines.

Link: https://lore.kernel.org/all/20260609165431.778061F00893@smtp.kernel.org/

> diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c
> index 794acff2546a3..2d88e08c5ebf9 100644
> --- a/drivers/dma-buf/dma-buf-mapping.c
> +++ b/drivers/dma-buf/dma-buf-mapping.c
> @@ -5,16 +5,17 @@
>   */
>  #include <linux/dma-buf-mapping.h>
>  #include <linux/dma-resv.h>
> +#include <linux/sizes.h>
> +
> +#define MAX_SG_ENT_SZ ((size_t)SZ_2G)
>  
>  static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
>  					 dma_addr_t addr)
>  {
> -	unsigned int len, nents;
> -	int i;
> +	size_t len;
>  
> -	nents = DIV_ROUND_UP(length, UINT_MAX);
> -	for (i = 0; i < nents; i++) {
> -		len = min_t(size_t, length, UINT_MAX);
> +	while (length) {
> +		len = min(length, MAX_SG_ENT_SZ);
>  		length -= len;
>  		/*
>  		 * DMABUF abuses scatterlist to create a scatterlist
> @@ -24,11 +25,12 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
>  		 * does not require the CPU list for mapping or unmapping.
>  		 */
>  		sg_set_page(sgl, NULL, 0, 0);

[Severity: Critical]
Will this lead to a NULL pointer dereference on 32-bit systems?

If phys_vec[i].len is greater than 2GB (e.g. 0x80000001), the DIV_ROUND_UP
macro in calc_sg_nents() will overflow because size_t is 32-bit and
MAX_SG_ENT_SZ is 0x80000000:

    (0x80000001 + 0x80000000 - 1) / 0x80000000 == 0

This means no SGL entries are calculated or allocated for this region. However,
the while loop in fill_sg_entry() will still iterate to process the remaining
length. It will eventually advance past the end of the undersized scatterlist
table via:

    sgl = sg_next(sgl);

which becomes NULL, and then dereference it on the next iteration via
sg_set_page(sgl, ...).

> -		sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX;
> +		sg_dma_address(sgl) = addr;
>  		sg_dma_len(sgl) = len;
> +		addr += len;
> +		/* Unconditionally advance. On last segment, this becomes NULL */
>  		sgl = sg_next(sgl);
>  	}
> -
>  	return sgl;
>  }
>  
> @@ -41,14 +43,14 @@ static unsigned int calc_sg_nents(struct dma_iova_state *state,
>  
>  	if (!state || !dma_use_iova(state)) {
>  		for (i = 0; i < nr_ranges; i++)
> -			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
> +			nents += DIV_ROUND_UP(phys_vec[i].len, MAX_SG_ENT_SZ);
>  	} else {
>  		/*
>  		 * In IOVA case, there is only one SG entry which spans
>  		 * for whole IOVA address space, but we need to make sure
>  		 * that it fits sg->length, maybe we need more.
>  		 */
> -		nents = DIV_ROUND_UP(size, UINT_MAX);
> +		nents = DIV_ROUND_UP(size, MAX_SG_ENT_SZ);
>  	}
>  
>  	return nents;

[Severity: High]
This isn't a bug introduced by this patch, but does iterating over the SGL to
unmap chunked entries violate the DMA API?

In dma_buf_phys_vec_to_sgt(), dma_map_phys() maps the full, contiguous original
length of phys_vec[i].len. However, fill_sg_entry() splits this mapped region
into smaller chunks.

When unmapping, both the error path in dma_buf_phys_vec_to_sgt() and the
teardown in dma_buf_free_sgt() iterate over the split SGL entries:

drivers/dma-buf/dma-buf-mapping.c:dma_buf_free_sgt() {
    ...
		for_each_sgtable_dma_sg(sgt, sgl, i)
			dma_unmap_phys(attach->dev, sg_dma_address(sgl),
				       sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
    ...
}

Does the DMA API permit unmapping partial chunks instead of the exact length
and address originally mapped?

Additionally, in the error path of dma_buf_phys_vec_to_sgt():

drivers/dma-buf/dma-buf-mapping.c:dma_buf_phys_vec_to_sgt() {
    ...
err_unmap_dma:
	if (!i || !dma->state) {
		; /* Do nothing */
	...
	} else {
		for_each_sgtable_dma_sg(&dma->sgt, sgl, i)
			dma_unmap_phys(attach->dev, sg_dma_address(sgl),
				       sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
	}
    ...
}

Is it safe to reuse the outer loop variable i here? Since i was tracking the
successfully mapped count, using it in for_each_sgtable_dma_sg() clobbers it
and forces iteration up to the sgt->nents capacity, which could unmap
unpopulated zero-initialized entries or page 0.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260623015459.1153884-1-xuehaohu@google.com?part=1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks
  2026-06-23  1:54 ` [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks David Hu
  2026-06-23  2:08   ` sashiko-bot
@ 2026-06-23  8:44   ` David Laight
  1 sibling, 0 replies; 8+ messages in thread
From: David Laight @ 2026-06-23  8:44 UTC (permalink / raw)
  To: David Hu
  Cc: Sumit Semwal, Christian König, Jason Gunthorpe, Nicolin Chen,
	Leon Romanovsky, Kevin Tian, Ankit Agrawal, Alex Williamson,
	linux-media, dri-devel, linaro-mm-sig, linux-kernel, iommu,
	jmoroni, praan, kpberry, chriscli, sashiko-bot, stable

On Tue, 23 Jun 2026 01:54:59 +0000
David Hu <xuehaohu@google.com> wrote:

> Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`.
> This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the
> first entry, resulting in non-page-aligned DMA addresses for all
> subsequent entries.

There is a separate issue of whether this code is even needed at all.
Where can transfers over 2G (never mind 4G) actually come from.

The read, write and similar system calls limit transfers to INT_MAX
(even on 64bit) and a lot of driver code will need fixing it longer
lengths are allowed though.
io_uring better enforce the same limits.
So the transfers can come directly from userspace.

Not only that but you also need a single physically contiguous buffer.
Good luck allocating that!

Now maybe there are some peer-to-peer places where the large buffer
is device memory, but they will be unusual and probably need
special treatment anyway.

	David

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-06-23  8:44 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-21 22:21 [PATCH] dma-buf: Split sgl by largest page-aligned chunk David Hu
2026-06-21 22:34 ` sashiko-bot
2026-06-22  8:13 ` David Laight
2026-06-22 21:26   ` David Hu
2026-06-23  8:25     ` David Laight
2026-06-23  1:54 ` [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks David Hu
2026-06-23  2:08   ` sashiko-bot
2026-06-23  8:44   ` David Laight

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.