[PATCH] dma-buf: Split sgl by largest page-aligned chunk

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH] dma-buf: Split sgl by largest page-aligned chunk
@ 2026-06-21 22:21 David Hu
  2026-06-22  8:13 ` David Laight
  2026-06-23  1:54 ` [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks David Hu
  0 siblings, 2 replies; 14+ messages in thread
From: David Hu @ 2026-06-21 22:21 UTC (permalink / raw)
  To: Sumit Semwal, Christian König
  Cc: Jason Gunthorpe, Nicolin Chen, Leon Romanovsky, Kevin Tian,
	Ankit Agrawal, Alex Williamson, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel, iommu, jmoroni, praan, kpberry,
	David Hu, sashiko-bot, stable

Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`.
This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the
first entry, resulting in non-page-aligned DMA addresses for all
subsequent entries.

While the underlying IOMMU mapping may be contiguous, hardware
DMA engines often require explicit address alignment (e.g., page,
cacheline, or storage sector boundaries). Passing unaligned
addresses and lengths can cause explicit failures in DMA descriptor
creation or silent data corruption if lower unaligned bits are
truncated.

Fix this by splitting the scatterlist by the largest possible page
aligned chunk within `UINT_MAX` (`ALIGN_DOWN(UINT_MAX, PAGE_SIZE)`).
This ensures all scatterlist DMA addresses and lengths remain page
aligned and satisfy hardware constraints.

Page-aligned entries allow the system to cleanly chunk payloads into
PCIe MaxPayloadSize (MPS) (e.g., 128 bytes, 256 bytes, 512 bytes).
As a result, this may help reduce TLP fragmentation in P2P transfers
and alleviate potential congestion within a logical PCIe switch
partition, especially when Relaxed Ordering is not possible due to
hardware constraints.

Reported-by: sashiko-bot <sashiko-bot@kernel.org>
Closes: https://lore.kernel.org/all/20260609165431.778061F00893@smtp.kernel.org/
Fixes: 3aa31a8bb11e ("dma-buf: provide phys_vec to scatter-gather mapping routine")
Cc: stable@vger.kernel.org
Signed-off-by: David Hu <xuehaohu@google.com>
---
 drivers/dma-buf/dma-buf-mapping.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c
index 794acff2546a..f2bde38fdb1f 100644
--- a/drivers/dma-buf/dma-buf-mapping.c
+++ b/drivers/dma-buf/dma-buf-mapping.c
@@ -5,6 +5,9 @@
  */
 #include <linux/dma-buf-mapping.h>
 #include <linux/dma-resv.h>
+#include <linux/align.h>
+
+#define MAX_ENT_SZ ALIGN_DOWN(UINT_MAX, PAGE_SIZE)
 
 static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
 					 dma_addr_t addr)
@@ -12,9 +15,9 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
 	unsigned int len, nents;
 	int i;
 
-	nents = DIV_ROUND_UP(length, UINT_MAX);
+	nents = DIV_ROUND_UP(length, MAX_ENT_SZ);
 	for (i = 0; i < nents; i++) {
-		len = min_t(size_t, length, UINT_MAX);
+		len = min_t(size_t, length, MAX_ENT_SZ);
 		length -= len;
 		/*
 		 * DMABUF abuses scatterlist to create a scatterlist
@@ -24,7 +27,7 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
 		 * does not require the CPU list for mapping or unmapping.
 		 */
 		sg_set_page(sgl, NULL, 0, 0);
-		sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX;
+		sg_dma_address(sgl) = addr + (dma_addr_t)i * MAX_ENT_SZ;
 		sg_dma_len(sgl) = len;
 		sgl = sg_next(sgl);
 	}
@@ -41,14 +44,14 @@ static unsigned int calc_sg_nents(struct dma_iova_state *state,
 
 	if (!state || !dma_use_iova(state)) {
 		for (i = 0; i < nr_ranges; i++)
-			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
+			nents += DIV_ROUND_UP(phys_vec[i].len, MAX_ENT_SZ);
 	} else {
 		/*
 		 * In IOVA case, there is only one SG entry which spans
 		 * for whole IOVA address space, but we need to make sure
 		 * that it fits sg->length, maybe we need more.
 		 */
-		nents = DIV_ROUND_UP(size, UINT_MAX);
+		nents = DIV_ROUND_UP(size, MAX_ENT_SZ);
 	}
 
 	return nents;
-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] dma-buf: Split sgl by largest page-aligned chunk
  2026-06-21 22:21 [PATCH] dma-buf: Split sgl by largest page-aligned chunk David Hu
@ 2026-06-22  8:13 ` David Laight
  2026-06-22 21:26   ` David Hu
  2026-06-23  1:54 ` [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks David Hu
  1 sibling, 1 reply; 14+ messages in thread
From: David Laight @ 2026-06-22  8:13 UTC (permalink / raw)
  To: David Hu
  Cc: Sumit Semwal, Christian König, Jason Gunthorpe, Nicolin Chen,
	Leon Romanovsky, Kevin Tian, Ankit Agrawal, Alex Williamson,
	linux-media, dri-devel, linaro-mm-sig, linux-kernel, iommu,
	jmoroni, praan, kpberry, sashiko-bot, stable

On Sun, 21 Jun 2026 22:21:30 +0000
David Hu <xuehaohu@google.com> wrote:

> Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`.
> This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the
> first entry, resulting in non-page-aligned DMA addresses for all
> subsequent entries.

How did you find this?
It requires a single buffer over 4GB - seems highly unlikely.


> 
> While the underlying IOMMU mapping may be contiguous, hardware
> DMA engines often require explicit address alignment (e.g., page,
> cacheline, or storage sector boundaries). Passing unaligned
> addresses and lengths can cause explicit failures in DMA descriptor
> creation or silent data corruption if lower unaligned bits are
> truncated.
> 
> Fix this by splitting the scatterlist by the largest possible page
> aligned chunk within `UINT_MAX` (`ALIGN_DOWN(UINT_MAX, PAGE_SIZE)`).
> This ensures all scatterlist DMA addresses and lengths remain page
> aligned and satisfy hardware constraints.

It would almost certainly better to spilt into 2G chunks.
That removes any need for any divisions.

> Page-aligned entries allow the system to cleanly chunk payloads into
> PCIe MaxPayloadSize (MPS) (e.g., 128 bytes, 256 bytes, 512 bytes).
> As a result, this may help reduce TLP fragmentation in P2P transfers
> and alleviate potential congestion within a logical PCIe switch
> partition, especially when Relaxed Ordering is not possible due to
> hardware constraints.
> 
> Reported-by: sashiko-bot <sashiko-bot@kernel.org>
> Closes: https://lore.kernel.org/all/20260609165431.778061F00893@smtp.kernel.org/
> Fixes: 3aa31a8bb11e ("dma-buf: provide phys_vec to scatter-gather mapping routine")
> Cc: stable@vger.kernel.org
> Signed-off-by: David Hu <xuehaohu@google.com>
> ---
>  drivers/dma-buf/dma-buf-mapping.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c
> index 794acff2546a..f2bde38fdb1f 100644
> --- a/drivers/dma-buf/dma-buf-mapping.c
> +++ b/drivers/dma-buf/dma-buf-mapping.c
> @@ -5,6 +5,9 @@
>   */
>  #include <linux/dma-buf-mapping.h>
>  #include <linux/dma-resv.h>
> +#include <linux/align.h>
> +
> +#define MAX_ENT_SZ ALIGN_DOWN(UINT_MAX, PAGE_SIZE)

>  
>  static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
>  					 dma_addr_t addr)
> @@ -12,9 +15,9 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
>  	unsigned int len, nents;
>  	int i;
>  
> -	nents = DIV_ROUND_UP(length, UINT_MAX);
> +	nents = DIV_ROUND_UP(length, MAX_ENT_SZ);
>  	for (i = 0; i < nents; i++) {

Why not change that to 'while (length) {' to avoid the division above.

> -		len = min_t(size_t, length, UINT_MAX);
> +		len = min_t(size_t, length, MAX_ENT_SZ);

I bet that doesn't need to be min_t()

>  		length -= len;
>  		/*
>  		 * DMABUF abuses scatterlist to create a scatterlist
> @@ -24,7 +27,7 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
>  		 * does not require the CPU list for mapping or unmapping.
>  		 */
>  		sg_set_page(sgl, NULL, 0, 0);
> -		sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX;
> +		sg_dma_address(sgl) = addr + (dma_addr_t)i * MAX_ENT_SZ;
>  		sg_dma_len(sgl) = len;

Replace the multiply with 'addr += len'.

-- David

>  		sgl = sg_next(sgl);
>  	}
> @@ -41,14 +44,14 @@ static unsigned int calc_sg_nents(struct dma_iova_state *state,
>  
>  	if (!state || !dma_use_iova(state)) {
>  		for (i = 0; i < nr_ranges; i++)
> -			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
> +			nents += DIV_ROUND_UP(phys_vec[i].len, MAX_ENT_SZ);
>  	} else {
>  		/*
>  		 * In IOVA case, there is only one SG entry which spans
>  		 * for whole IOVA address space, but we need to make sure
>  		 * that it fits sg->length, maybe we need more.
>  		 */
> -		nents = DIV_ROUND_UP(size, UINT_MAX);
> +		nents = DIV_ROUND_UP(size, MAX_ENT_SZ);
>  	}
>  
>  	return nents;


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] dma-buf: Split sgl by largest page-aligned chunk
  2026-06-22  8:13 ` David Laight
@ 2026-06-22 21:26   ` David Hu
  2026-06-23  8:25     ` David Laight
  0 siblings, 1 reply; 14+ messages in thread
From: David Hu @ 2026-06-22 21:26 UTC (permalink / raw)
  To: David Laight
  Cc: Sumit Semwal, Christian König, Jason Gunthorpe, Nicolin Chen,
	Leon Romanovsky, Kevin Tian, Ankit Agrawal, Alex Williamson,
	linux-media, dri-devel, linaro-mm-sig, linux-kernel, iommu,
	jmoroni, praan, kpberry, sashiko-bot, stable

On Mon, Jun 22, 2026 at 4:13 AM David Laight
<david.laight.linux@gmail.com> wrote:
>

Hi David,

Thank you for your review. You raised many good points regarding
optimizations here. I'll switch to using 2G as the max entry size
(`SZ_2G` from `linux/sizes.h`), and remove divisions and
multiplications. I'll also replace the `for()` loop with `while
(length)`, and drop `min_t()` in favor of `min()` by casting `SZ_2G`
to `size_t`. I'll send out a v2 with these changes shortly.

Thanks,
David

> > Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`.
> > This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the
> > first entry, resulting in non-page-aligned DMA addresses for all
> > subsequent entries.
>
> How did you find this?
> It requires a single buffer over 4GB - seems highly unlikely.

It was observed during experiments with buffers over 8GB on an accelerator.

> >
> > While the underlying IOMMU mapping may be contiguous, hardware
> > DMA engines often require explicit address alignment (e.g., page,
> > cacheline, or storage sector boundaries). Passing unaligned
> > addresses and lengths can cause explicit failures in DMA descriptor
> > creation or silent data corruption if lower unaligned bits are
> > truncated.
> >
> > Fix this by splitting the scatterlist by the largest possible page
> > aligned chunk within `UINT_MAX` (`ALIGN_DOWN(UINT_MAX, PAGE_SIZE)`).
> > This ensures all scatterlist DMA addresses and lengths remain page
> > aligned and satisfy hardware constraints.
>
> It would almost certainly better to spilt into 2G chunks.
> That removes any need for any divisions.

I agree. 2G naturally aligns with most hardware boundaries, while also
allowing compiler optimizations with simple bit shifts.

>
> > Page-aligned entries allow the system to cleanly chunk payloads into
> > PCIe MaxPayloadSize (MPS) (e.g., 128 bytes, 256 bytes, 512 bytes).
> > As a result, this may help reduce TLP fragmentation in P2P transfers
> > and alleviate potential congestion within a logical PCIe switch
> > partition, especially when Relaxed Ordering is not possible due to
> > hardware constraints.
> >
> > Reported-by: sashiko-bot <sashiko-bot@kernel.org>
> > Closes: https://lore.kernel.org/all/20260609165431.778061F00893@smtp.kernel.org/
> > Fixes: 3aa31a8bb11e ("dma-buf: provide phys_vec to scatter-gather mapping routine")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: David Hu <xuehaohu@google.com>
> > ---
> >  drivers/dma-buf/dma-buf-mapping.c | 13 ++++++++-----
> >  1 file changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c
> > index 794acff2546a..f2bde38fdb1f 100644
> > --- a/drivers/dma-buf/dma-buf-mapping.c
> > +++ b/drivers/dma-buf/dma-buf-mapping.c
> > @@ -5,6 +5,9 @@
> >   */
> >  #include <linux/dma-buf-mapping.h>
> >  #include <linux/dma-resv.h>
> > +#include <linux/align.h>
> > +
> > +#define MAX_ENT_SZ ALIGN_DOWN(UINT_MAX, PAGE_SIZE)
>
> >
> >  static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
> >                                        dma_addr_t addr)
> > @@ -12,9 +15,9 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
> >       unsigned int len, nents;
> >       int i;
> >
> > -     nents = DIV_ROUND_UP(length, UINT_MAX);
> > +     nents = DIV_ROUND_UP(length, MAX_ENT_SZ);
> >       for (i = 0; i < nents; i++) {
>
> Why not change that to 'while (length) {' to avoid the division above.

Sounds good, will do.

>
> > -             len = min_t(size_t, length, UINT_MAX);
> > +             len = min_t(size_t, length, MAX_ENT_SZ);
>
> I bet that doesn't need to be min_t()

Agreed.


>
> >               length -= len;
> >               /*
> >                * DMABUF abuses scatterlist to create a scatterlist
> > @@ -24,7 +27,7 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
> >                * does not require the CPU list for mapping or unmapping.
> >                */
> >               sg_set_page(sgl, NULL, 0, 0);
> > -             sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX;
> > +             sg_dma_address(sgl) = addr + (dma_addr_t)i * MAX_ENT_SZ;
> >               sg_dma_len(sgl) = len;
>
> Replace the multiply with 'addr += len'.

Will update this as well.

>
> -- David
>
> >               sgl = sg_next(sgl);
> >       }
> > @@ -41,14 +44,14 @@ static unsigned int calc_sg_nents(struct dma_iova_state *state,
> >
> >       if (!state || !dma_use_iova(state)) {
> >               for (i = 0; i < nr_ranges; i++)
> > -                     nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
> > +                     nents += DIV_ROUND_UP(phys_vec[i].len, MAX_ENT_SZ);
> >       } else {
> >               /*
> >                * In IOVA case, there is only one SG entry which spans
> >                * for whole IOVA address space, but we need to make sure
> >                * that it fits sg->length, maybe we need more.
> >                */
> > -             nents = DIV_ROUND_UP(size, UINT_MAX);
> > +             nents = DIV_ROUND_UP(size, MAX_ENT_SZ);
> >       }
> >
> >       return nents;
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks
  2026-06-21 22:21 [PATCH] dma-buf: Split sgl by largest page-aligned chunk David Hu
  2026-06-22  8:13 ` David Laight
@ 2026-06-23  1:54 ` David Hu
  2026-06-23  8:44   ` David Laight
  1 sibling, 1 reply; 14+ messages in thread
From: David Hu @ 2026-06-23  1:54 UTC (permalink / raw)
  To: Sumit Semwal, Christian König
  Cc: David Laight, Jason Gunthorpe, Nicolin Chen, Leon Romanovsky,
	Kevin Tian, Ankit Agrawal, Alex Williamson, linux-media,
	dri-devel, linaro-mm-sig, linux-kernel, iommu, jmoroni, praan,
	kpberry, chriscli, sashiko-bot, stable, David Hu

Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`.
This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the
first entry, resulting in non-page-aligned DMA addresses for all
subsequent entries.

While the underlying IOMMU mapping may be contiguous, hardware
DMA engines often require explicit address alignment (e.g., page,
cacheline, or storage sector boundaries). Passing unaligned
addresses and lengths can cause explicit failures in DMA descriptor
creation or silent data corruption if lower unaligned bits are
truncated.

Fix this by splitting the scatterlist into 2G chunks. An alternative
previously considered was to use the largest page aligned chunk within
`UINT_MAX` (`ALIGN_DOWN(UINT_MAX, PAGE_SIZE)`) to satisfy page
alignment. A 2G chunk is better as it naturally aligns with most known
hardware boundaries, while also allowing compiler optimizations with
simple bit shifts. This ensures all scatterlist DMA addresses and
lengths remain page aligned and satisfy hardware constraints.

Page-aligned entries allow the system to cleanly chunk payloads into
PCIe MaxPayloadSize (MPS) (e.g., 128 bytes, 256 bytes, 512 bytes).
As a result, this may help reduce TLP fragmentation in P2P transfers
and alleviate potential congestion within a logical PCIe switch
partition, especially when Relaxed Ordering is not possible due to
hardware constraints.

Reported-by: sashiko-bot <sashiko-bot@kernel.org>
Closes: https://lore.kernel.org/all/20260609165431.778061F00893@smtp.kernel.org/
Fixes: 3aa31a8bb11e ("dma-buf: provide phys_vec to scatter-gather mapping routine")
Cc: stable@vger.kernel.org
Signed-off-by: David Hu <xuehaohu@google.com>
---
 Changes in v2:
 - Updated commit title and message to reflect the switch to 2G chunks
 - Switch to using 2G as the max sg entry size as it naturally aligns
   with most hardware boundaries, while allowing compiler optimizations
   with bit shifts (David Laight)
 - Optimized away division calculation for `nent`, and multiplication
   calculation for sgl address, by dropping the `for` loop in favor of a
   `while (length)` loop (David Laight)
 - Dropped `min_t` in favor of `min()` to maintain a strict type
   checking safety net (David Laight)

 drivers/dma-buf/dma-buf-mapping.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c
index 794acff2546a..2d88e08c5ebf 100644
--- a/drivers/dma-buf/dma-buf-mapping.c
+++ b/drivers/dma-buf/dma-buf-mapping.c
@@ -5,16 +5,17 @@
  */
 #include <linux/dma-buf-mapping.h>
 #include <linux/dma-resv.h>
+#include <linux/sizes.h>
+
+#define MAX_SG_ENT_SZ ((size_t)SZ_2G)
 
 static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
 					 dma_addr_t addr)
 {
-	unsigned int len, nents;
-	int i;
+	size_t len;
 
-	nents = DIV_ROUND_UP(length, UINT_MAX);
-	for (i = 0; i < nents; i++) {
-		len = min_t(size_t, length, UINT_MAX);
+	while (length) {
+		len = min(length, MAX_SG_ENT_SZ);
 		length -= len;
 		/*
 		 * DMABUF abuses scatterlist to create a scatterlist
@@ -24,11 +25,12 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
 		 * does not require the CPU list for mapping or unmapping.
 		 */
 		sg_set_page(sgl, NULL, 0, 0);
-		sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX;
+		sg_dma_address(sgl) = addr;
 		sg_dma_len(sgl) = len;
+		addr += len;
+		/* Unconditionally advance. On last segment, this becomes NULL */
 		sgl = sg_next(sgl);
 	}
-
 	return sgl;
 }
 
@@ -41,14 +43,14 @@ static unsigned int calc_sg_nents(struct dma_iova_state *state,
 
 	if (!state || !dma_use_iova(state)) {
 		for (i = 0; i < nr_ranges; i++)
-			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
+			nents += DIV_ROUND_UP(phys_vec[i].len, MAX_SG_ENT_SZ);
 	} else {
 		/*
 		 * In IOVA case, there is only one SG entry which spans
 		 * for whole IOVA address space, but we need to make sure
 		 * that it fits sg->length, maybe we need more.
 		 */
-		nents = DIV_ROUND_UP(size, UINT_MAX);
+		nents = DIV_ROUND_UP(size, MAX_SG_ENT_SZ);
 	}
 
 	return nents;
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] dma-buf: Split sgl by largest page-aligned chunk
  2026-06-22 21:26   ` David Hu
@ 2026-06-23  8:25     ` David Laight
  2026-06-23 21:03       ` David Hu
  0 siblings, 1 reply; 14+ messages in thread
From: David Laight @ 2026-06-23  8:25 UTC (permalink / raw)
  To: David Hu
  Cc: Sumit Semwal, Christian König, Jason Gunthorpe, Nicolin Chen,
	Leon Romanovsky, Kevin Tian, Ankit Agrawal, Alex Williamson,
	linux-media, dri-devel, linaro-mm-sig, linux-kernel, iommu,
	jmoroni, praan, kpberry, sashiko-bot, stable

On Mon, 22 Jun 2026 17:26:10 -0400
David Hu <xuehaohu@google.com> wrote:

> On Mon, Jun 22, 2026 at 4:13 AM David Laight
> <david.laight.linux@gmail.com> wrote:
> >  
> 
> Hi David,
> 
> Thank you for your review. You raised many good points regarding
> optimizations here. I'll switch to using 2G as the max entry size
> (`SZ_2G` from `linux/sizes.h`), and remove divisions and
> multiplications. I'll also replace the `for()` loop with `while
> (length)`, and drop `min_t()` in favor of `min()` by casting `SZ_2G`
> to `size_t`.

You shouldn't need a cast at all.

	David L.

> I'll send out a v2 with these changes shortly.
> 
> Thanks,
> David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks
  2026-06-23  1:54 ` [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks David Hu
@ 2026-06-23  8:44   ` David Laight
  2026-06-23 20:55     ` Pranjal Shrivastava
  2026-06-30 12:38     ` Jason Gunthorpe
  0 siblings, 2 replies; 14+ messages in thread
From: David Laight @ 2026-06-23  8:44 UTC (permalink / raw)
  To: David Hu
  Cc: Sumit Semwal, Christian König, Jason Gunthorpe, Nicolin Chen,
	Leon Romanovsky, Kevin Tian, Ankit Agrawal, Alex Williamson,
	linux-media, dri-devel, linaro-mm-sig, linux-kernel, iommu,
	jmoroni, praan, kpberry, chriscli, sashiko-bot, stable

On Tue, 23 Jun 2026 01:54:59 +0000
David Hu <xuehaohu@google.com> wrote:

> Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`.
> This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the
> first entry, resulting in non-page-aligned DMA addresses for all
> subsequent entries.

There is a separate issue of whether this code is even needed at all.
Where can transfers over 2G (never mind 4G) actually come from.

The read, write and similar system calls limit transfers to INT_MAX
(even on 64bit) and a lot of driver code will need fixing it longer
lengths are allowed though.
io_uring better enforce the same limits.
So the transfers can come directly from userspace.

Not only that but you also need a single physically contiguous buffer.
Good luck allocating that!

Now maybe there are some peer-to-peer places where the large buffer
is device memory, but they will be unusual and probably need
special treatment anyway.

	David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks
  2026-06-23  8:44   ` David Laight
@ 2026-06-23 20:55     ` Pranjal Shrivastava
  2026-06-23 22:53       ` David Laight
  2026-06-30 12:38     ` Jason Gunthorpe
  1 sibling, 1 reply; 14+ messages in thread
From: Pranjal Shrivastava @ 2026-06-23 20:55 UTC (permalink / raw)
  To: David Laight
  Cc: David Hu, Sumit Semwal, Christian König, Jason Gunthorpe,
	Nicolin Chen, Leon Romanovsky, Kevin Tian, Ankit Agrawal,
	Alex Williamson, linux-media, dri-devel, linaro-mm-sig,
	linux-kernel, iommu, jmoroni, kpberry, chriscli, sashiko-bot,
	stable

On Tue, Jun 23, 2026 at 09:44:46AM +0100, David Laight wrote:

Hi David,

> On Tue, 23 Jun 2026 01:54:59 +0000
> David Hu <xuehaohu@google.com> wrote:
> 
> > Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`.
> > This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the
> > first entry, resulting in non-page-aligned DMA addresses for all
> > subsequent entries.
> 
> There is a separate issue of whether this code is even needed at all.
> Where can transfers over 2G (never mind 4G) actually come from.
> 
> The read, write and similar system calls limit transfers to INT_MAX
> (even on 64bit) and a lot of driver code will need fixing it longer
> lengths are allowed though.
> io_uring better enforce the same limits.
> So the transfers can come directly from userspace.
> 
> Not only that but you also need a single physically contiguous buffer.
> Good luck allocating that!
> 
> Now maybe there are some peer-to-peer places where the large buffer
> is device memory, but they will be unusual and probably need
> special treatment anyway.
> 

I agree that traditional VFS read/write face the MAX_RW_COUNT limit 
(~2GB), and io_uring has its limits, but I'm a little confused by the
push to enforce these limits here in the SGL code?

File I/O seems to be only one side of the picture. In my view, this fix
is necessary and certainly has a use-case:

For example, the RDMA subsystem has the capability to import dmabufs [1],
which gives rise to use cases for dmabuf beyond standard file ops 
(via VFS/io_uring). 

In these scenarios, GPU HBM can be exported as dmabufs. With recent GPUs,
HBM capacity can be in the order of hundreds of GBs [2]. RDMA can employ
infrastructure like the vfio-dmabuf-exporter [3] or similar dmabuf 
exporters to frequently move huge blocks of data via P2PDMA.

If we restrict incoming dmabuf transfers to fit within VFS-centric 
limits (2GB), we impose unnecessary overhead on the RDMA stack, forcing
it to manage a significantly higher number of memory registrations. By 
cleanly splitting these massive contiguous device buffers into 
page-aligned SGL entries, we directly improve the efficiency of P2P 
transfers and memory registration.

Since this change doesn't seem to have a negative impact on standard file
I/O or break existing VFS constraints, I'm curious why we shouldn't 
support splitting these >4GB P2P transfers? Am I missing something?

Thanks,
Praan

[1] https://elixir.bootlin.com/linux/v7.1.1/source/drivers/infiniband/core/umem_dmabuf.c#L174 
[2] https://nvdam.widen.net/s/fdvdqvfvj2/hopper-h200-nvl-product-brief (Table 2-2)
[3] https://elixir.bootlin.com/linux/v7.1.1/source/drivers/vfio/pci/vfio_pci_dmabuf.c#L297

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] dma-buf: Split sgl by largest page-aligned chunk
  2026-06-23  8:25     ` David Laight
@ 2026-06-23 21:03       ` David Hu
  0 siblings, 0 replies; 14+ messages in thread
From: David Hu @ 2026-06-23 21:03 UTC (permalink / raw)
  To: David Laight
  Cc: Sumit Semwal, Christian König, Jason Gunthorpe, Nicolin Chen,
	Leon Romanovsky, Kevin Tian, Ankit Agrawal, Alex Williamson,
	linux-media, dri-devel, linaro-mm-sig, linux-kernel, iommu,
	jmoroni, praan, kpberry, sashiko-bot, stable

On Tue, Jun 23, 2026 at 4:25 AM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Mon, 22 Jun 2026 17:26:10 -0400
> David Hu <xuehaohu@google.com> wrote:
>
> > On Mon, Jun 22, 2026 at 4:13 AM David Laight
> > <david.laight.linux@gmail.com> wrote:
> > >
> >
> > Hi David,
> >
> > Thank you for your review. You raised many good points regarding
> > optimizations here. I'll switch to using 2G as the max entry size
> > (`SZ_2G` from `linux/sizes.h`), and remove divisions and
> > multiplications. I'll also replace the `for()` loop with `while
> > (length)`, and drop `min_t()` in favor of `min()` by casting `SZ_2G`
> > to `size_t`.
>
> You shouldn't need a cast at all.

Hi David,

You are right. It looks like `min(length, CONSTANT)` works well here
without triggering any type mismatch warnings, regardless of whether
`CONSTANT` is `SZ_1G` (`int`), `SZ_2G` (`unsigned int`), `SZ_4G`
(`unsigned long long`), or larger. I'll drop the cast and send out a
v3 shortly.

Thanks,
David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks
  2026-06-23 20:55     ` Pranjal Shrivastava
@ 2026-06-23 22:53       ` David Laight
  2026-06-24 14:31         ` Leon Romanovsky
  2026-06-30 12:42         ` Jason Gunthorpe
  0 siblings, 2 replies; 14+ messages in thread
From: David Laight @ 2026-06-23 22:53 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: David Hu, Sumit Semwal, Christian König, Jason Gunthorpe,
	Nicolin Chen, Leon Romanovsky, Kevin Tian, Ankit Agrawal,
	Alex Williamson, linux-media, dri-devel, linaro-mm-sig,
	linux-kernel, iommu, jmoroni, kpberry, chriscli, sashiko-bot,
	stable

On Tue, 23 Jun 2026 20:55:32 +0000
Pranjal Shrivastava <praan@google.com> wrote:

> On Tue, Jun 23, 2026 at 09:44:46AM +0100, David Laight wrote:
> 
> Hi David,
> 
> > On Tue, 23 Jun 2026 01:54:59 +0000
> > David Hu <xuehaohu@google.com> wrote:
> >   
> > > Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`.
> > > This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the
> > > first entry, resulting in non-page-aligned DMA addresses for all
> > > subsequent entries.  
> > 
> > There is a separate issue of whether this code is even needed at all.
> > Where can transfers over 2G (never mind 4G) actually come from.
> > 
> > The read, write and similar system calls limit transfers to INT_MAX
> > (even on 64bit) and a lot of driver code will need fixing it longer
> > lengths are allowed though.
> > io_uring better enforce the same limits.
> > So the transfers can come directly from userspace.
> > 
> > Not only that but you also need a single physically contiguous buffer.
> > Good luck allocating that!
> > 
> > Now maybe there are some peer-to-peer places where the large buffer
> > is device memory, but they will be unusual and probably need
> > special treatment anyway.
> >   
> 
> I agree that traditional VFS read/write face the MAX_RW_COUNT limit 
> (~2GB), and io_uring has its limits, but I'm a little confused by the
> push to enforce these limits here in the SGL code?
> 
> File I/O seems to be only one side of the picture. In my view, this fix
> is necessary and certainly has a use-case:
> 
> For example, the RDMA subsystem has the capability to import dmabufs [1],
> which gives rise to use cases for dmabuf beyond standard file ops 
> (via VFS/io_uring). 
> 
> In these scenarios, GPU HBM can be exported as dmabufs. With recent GPUs,
> HBM capacity can be in the order of hundreds of GBs [2]. RDMA can employ
> infrastructure like the vfio-dmabuf-exporter [3] or similar dmabuf 
> exporters to frequently move huge blocks of data via P2PDMA.

Ok, that explains where big buffers can come from.
I just wasn't sure.

> If we restrict incoming dmabuf transfers to fit within VFS-centric 
> limits (2GB), we impose unnecessary overhead on the RDMA stack, forcing
> it to manage a significantly higher number of memory registrations. By 
> cleanly splitting these massive contiguous device buffers into 
> page-aligned SGL entries, we directly improve the efficiency of P2P 
> transfers and memory registration.

But a divide by '4G - PAGE_SIZE' is also non-trivial and (I think affects
a lot of io) when the quotient is always 1.
Splitting into 2G chunks is a lot cheaper.

> Since this change doesn't seem to have a negative impact on standard file
> I/O or break existing VFS constraints, I'm curious why we shouldn't 
> support splitting these >4GB P2P transfers? Am I missing something?

I was only wondering whether it was needed...
It does bring up the question of why the >4GB transfers even need splitting.
But that is another question.

If you want to split large transfers into 4G-PAGE_SIZE blocks
it is probably worth having a quick test that returns 1 for 'small' buffers.

	David

> 
> Thanks,
> Praan
> 
> [1] https://elixir.bootlin.com/linux/v7.1.1/source/drivers/infiniband/core/umem_dmabuf.c#L174 
> [2] https://nvdam.widen.net/s/fdvdqvfvj2/hopper-h200-nvl-product-brief (Table 2-2)
> [3] https://elixir.bootlin.com/linux/v7.1.1/source/drivers/vfio/pci/vfio_pci_dmabuf.c#L297
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks
  2026-06-23 22:53       ` David Laight
@ 2026-06-24 14:31         ` Leon Romanovsky
  2026-06-30 12:42         ` Jason Gunthorpe
  1 sibling, 0 replies; 14+ messages in thread
From: Leon Romanovsky @ 2026-06-24 14:31 UTC (permalink / raw)
  To: David Laight
  Cc: Pranjal Shrivastava, David Hu, Sumit Semwal, Christian König,
	Jason Gunthorpe, Nicolin Chen, Kevin Tian, Ankit Agrawal,
	Alex Williamson, linux-media, dri-devel, linaro-mm-sig,
	linux-kernel, iommu, jmoroni, kpberry, chriscli, sashiko-bot,
	stable

On Tue, Jun 23, 2026 at 11:53:50PM +0100, David Laight wrote:
> On Tue, 23 Jun 2026 20:55:32 +0000
> Pranjal Shrivastava <praan@google.com> wrote:
> 
> > On Tue, Jun 23, 2026 at 09:44:46AM +0100, David Laight wrote:
> > 
> > Hi David,
> > 
> > > On Tue, 23 Jun 2026 01:54:59 +0000
> > > David Hu <xuehaohu@google.com> wrote:
> > >   
> > > > Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`.
> > > > This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the
> > > > first entry, resulting in non-page-aligned DMA addresses for all
> > > > subsequent entries.  
> > > 
> > > There is a separate issue of whether this code is even needed at all.
> > > Where can transfers over 2G (never mind 4G) actually come from.
> > > 
> > > The read, write and similar system calls limit transfers to INT_MAX
> > > (even on 64bit) and a lot of driver code will need fixing it longer
> > > lengths are allowed though.
> > > io_uring better enforce the same limits.
> > > So the transfers can come directly from userspace.
> > > 
> > > Not only that but you also need a single physically contiguous buffer.
> > > Good luck allocating that!
> > > 
> > > Now maybe there are some peer-to-peer places where the large buffer
> > > is device memory, but they will be unusual and probably need
> > > special treatment anyway.
> > >   
> > 
> > I agree that traditional VFS read/write face the MAX_RW_COUNT limit 
> > (~2GB), and io_uring has its limits, but I'm a little confused by the
> > push to enforce these limits here in the SGL code?
> > 
> > File I/O seems to be only one side of the picture. In my view, this fix
> > is necessary and certainly has a use-case:
> > 
> > For example, the RDMA subsystem has the capability to import dmabufs [1],
> > which gives rise to use cases for dmabuf beyond standard file ops 
> > (via VFS/io_uring). 
> > 
> > In these scenarios, GPU HBM can be exported as dmabufs. With recent GPUs,
> > HBM capacity can be in the order of hundreds of GBs [2]. RDMA can employ
> > infrastructure like the vfio-dmabuf-exporter [3] or similar dmabuf 
> > exporters to frequently move huge blocks of data via P2PDMA.
> 
> Ok, that explains where big buffers can come from.
> I just wasn't sure.
> 
> > If we restrict incoming dmabuf transfers to fit within VFS-centric 
> > limits (2GB), we impose unnecessary overhead on the RDMA stack, forcing
> > it to manage a significantly higher number of memory registrations. By 
> > cleanly splitting these massive contiguous device buffers into 
> > page-aligned SGL entries, we directly improve the efficiency of P2P 
> > transfers and memory registration.
> 
> But a divide by '4G - PAGE_SIZE' is also non-trivial and (I think affects
> a lot of io) when the quotient is always 1.
> Splitting into 2G chunks is a lot cheaper.
> 
> > Since this change doesn't seem to have a negative impact on standard file
> > I/O or break existing VFS constraints, I'm curious why we shouldn't 
> > support splitting these >4GB P2P transfers? Am I missing something?
> 
> I was only wondering whether it was needed...
> It does bring up the question of why the >4GB transfers even need splitting.
> But that is another question.

Just a side note:

In our vision, we aim to transition DMABUF to use physical  
addresses directly https://lore.kernel.org/all/0-v1-b5cab63049c0+191af-dmabuf_map_type_jgg@nvidia.com/  
and eliminate the scatter‑gather layer from the DMABUF path.

Thanks

> 
> If you want to split large transfers into 4G-PAGE_SIZE blocks
> it is probably worth having a quick test that returns 1 for 'small' buffers.
> 
> 	David
> 
> > 
> > Thanks,
> > Praan
> > 
> > [1] https://elixir.bootlin.com/linux/v7.1.1/source/drivers/infiniband/core/umem_dmabuf.c#L174 
> > [2] https://nvdam.widen.net/s/fdvdqvfvj2/hopper-h200-nvl-product-brief (Table 2-2)
> > [3] https://elixir.bootlin.com/linux/v7.1.1/source/drivers/vfio/pci/vfio_pci_dmabuf.c#L297
> > 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks
  2026-06-23  8:44   ` David Laight
  2026-06-23 20:55     ` Pranjal Shrivastava
@ 2026-06-30 12:38     ` Jason Gunthorpe
  1 sibling, 0 replies; 14+ messages in thread
From: Jason Gunthorpe @ 2026-06-30 12:38 UTC (permalink / raw)
  To: David Laight
  Cc: David Hu, Sumit Semwal, Christian König, Nicolin Chen,
	Leon Romanovsky, Kevin Tian, Ankit Agrawal, Alex Williamson,
	linux-media, dri-devel, linaro-mm-sig, linux-kernel, iommu,
	jmoroni, praan, kpberry, chriscli, sashiko-bot, stable

On Tue, Jun 23, 2026 at 09:44:46AM +0100, David Laight wrote:
> On Tue, 23 Jun 2026 01:54:59 +0000
> David Hu <xuehaohu@google.com> wrote:
> 
> > Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`.
> > This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the
> > first entry, resulting in non-page-aligned DMA addresses for all
> > subsequent entries.
> 
> There is a separate issue of whether this code is even needed at all.
> Where can transfers over 2G (never mind 4G) actually come from.

This is DMABUF land, you really can alocate DMABUFS of huge amounts of
physical memory, VFIO does this reliably and trivially for example. It
wouldn't come from the physical allocator.

So yes, these scenarios need to work in this code.

Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks
  2026-06-23 22:53       ` David Laight
  2026-06-24 14:31         ` Leon Romanovsky
@ 2026-06-30 12:42         ` Jason Gunthorpe
  2026-07-02  4:56           ` David Hu
  1 sibling, 1 reply; 14+ messages in thread
From: Jason Gunthorpe @ 2026-06-30 12:42 UTC (permalink / raw)
  To: David Laight
  Cc: Pranjal Shrivastava, David Hu, Sumit Semwal, Christian König,
	Nicolin Chen, Leon Romanovsky, Kevin Tian, Ankit Agrawal,
	Alex Williamson, linux-media, dri-devel, linaro-mm-sig,
	linux-kernel, iommu, jmoroni, kpberry, chriscli, sashiko-bot,
	stable

On Tue, Jun 23, 2026 at 11:53:50PM +0100, David Laight wrote:

> > If we restrict incoming dmabuf transfers to fit within VFS-centric 
> > limits (2GB), we impose unnecessary overhead on the RDMA stack, forcing
> > it to manage a significantly higher number of memory registrations. By 
> > cleanly splitting these massive contiguous device buffers into 
> > page-aligned SGL entries, we directly improve the efficiency of P2P 
> > transfers and memory registration.
> 
> But a divide by '4G - PAGE_SIZE' is also non-trivial and (I think affects
> a lot of io) when the quotient is always 1.
> Splitting into 2G chunks is a lot cheaper.

Doesn't matter this isn't fast path stuff. It is better to use fewer
SGL entries, IHMO.

> > Since this change doesn't seem to have a negative impact on standard file
> > I/O or break existing VFS constraints, I'm curious why we shouldn't 
> > support splitting these >4GB P2P transfers? Am I missing something?
> 
> I was only wondering whether it was needed...
> It does bring up the question of why the >4GB transfers even need splitting.
> But that is another question.

SGL can only store an unsigned int size, so any large physical range
has to be split down.

rdma now a days has code to process the sgl and restore back the > 4G
sizes since mode RDMA HW can accept that.

commit 486055f5e09df959ad4e3aa4ee75b5c91ddeec2e
Author: Michael Margolin <mrgolin@amazon.com>
Date:   Mon Feb 17 14:16:23 2025 +0000

    RDMA/core: Fix best page size finding when it can cross SG entries
    
So whatever this produces needs to be compatible with that to undo it.

Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks
  2026-06-30 12:42         ` Jason Gunthorpe
@ 2026-07-02  4:56           ` David Hu
  2026-07-02  8:10             ` David Laight
  0 siblings, 1 reply; 14+ messages in thread
From: David Hu @ 2026-07-02  4:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Laight, Pranjal Shrivastava, Sumit Semwal,
	Christian König, Nicolin Chen, Leon Romanovsky, Kevin Tian,
	Ankit Agrawal, Alex Williamson, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel, iommu, jmoroni, kpberry, chriscli,
	sashiko-bot, stable

On Tue, Jun 30, 2026 at 8:42 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Jun 23, 2026 at 11:53:50PM +0100, David Laight wrote:
>
> > > If we restrict incoming dmabuf transfers to fit within VFS-centric
> > > limits (2GB), we impose unnecessary overhead on the RDMA stack, forcing
> > > it to manage a significantly higher number of memory registrations. By
> > > cleanly splitting these massive contiguous device buffers into
> > > page-aligned SGL entries, we directly improve the efficiency of P2P
> > > transfers and memory registration.
> >
> > But a divide by '4G - PAGE_SIZE' is also non-trivial and (I think affects
> > a lot of io) when the quotient is always 1.
> > Splitting into 2G chunks is a lot cheaper.
>
> Doesn't matter this isn't fast path stuff. It is better to use fewer
> SGL entries, IHMO.
>
> > > Since this change doesn't seem to have a negative impact on standard file
> > > I/O or break existing VFS constraints, I'm curious why we shouldn't
> > > support splitting these >4GB P2P transfers? Am I missing something?
> >
> > I was only wondering whether it was needed...
> > It does bring up the question of why the >4GB transfers even need splitting.
> > But that is another question.
>
> SGL can only store an unsigned int size, so any large physical range
> has to be split down.
>
> rdma now a days has code to process the sgl and restore back the > 4G
> sizes since mode RDMA HW can accept that.
>
> commit 486055f5e09df959ad4e3aa4ee75b5c91ddeec2e
> Author: Michael Margolin <mrgolin@amazon.com>
> Date:   Mon Feb 17 14:16:23 2025 +0000
>
>     RDMA/core: Fix best page size finding when it can cross SG entries
>
> So whatever this produces needs to be compatible with that to undo it.

Thank you everyone. It looks like most open issues are sorted out.
I'll wait for maintainers to weigh in before sending out v3 (which
will remove the type cast for min() per David L.'s feedback, and
revert to ALIGN_DOWN(UINT_MAX, PAGE_SIZE) per Jason's feedback).

Hi Jason,

Thank you for your feedback. I took a closer look at the commit to
ensure compatibility. This patch is perfectly complementary, and
actually prevents a failure in an edge case for the latest
`ib_umem_find_best_pgsz` [1].

Regards,
David

[1] For dma-buf split with `0xFFFFFFFF`, in case of a discontinguity
in later buffers, we will hit this code path in
`ib_umem_find_best_pgsz`

```
if (i != 0)
    mask |= va;
```
(*After `va` had been incremented by `0xFFFFFFFF`, due to `va +=
sg_dma_len(sg) - pgoff`)
(*Which will set the lowest bit of `mask` to 1)

Because `count_trailing_zeros(mask) returns 0`,
`ib_umem_find_best_pgsz()` will always return 0 in such cases.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks
  2026-07-02  4:56           ` David Hu
@ 2026-07-02  8:10             ` David Laight
  0 siblings, 0 replies; 14+ messages in thread
From: David Laight @ 2026-07-02  8:10 UTC (permalink / raw)
  To: David Hu
  Cc: Jason Gunthorpe, Pranjal Shrivastava, Sumit Semwal,
	Christian König, Nicolin Chen, Leon Romanovsky, Kevin Tian,
	Ankit Agrawal, Alex Williamson, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel, iommu, jmoroni, kpberry, chriscli,
	sashiko-bot, stable

On Thu, 2 Jul 2026 00:56:40 -0400
David Hu <xuehaohu@google.com> wrote:

> On Tue, Jun 30, 2026 at 8:42 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Tue, Jun 23, 2026 at 11:53:50PM +0100, David Laight wrote:
> >  
> > > > If we restrict incoming dmabuf transfers to fit within VFS-centric
> > > > limits (2GB), we impose unnecessary overhead on the RDMA stack, forcing
> > > > it to manage a significantly higher number of memory registrations. By
> > > > cleanly splitting these massive contiguous device buffers into
> > > > page-aligned SGL entries, we directly improve the efficiency of P2P
> > > > transfers and memory registration.  
> > >
> > > But a divide by '4G - PAGE_SIZE' is also non-trivial and (I think affects
> > > a lot of io) when the quotient is always 1.
> > > Splitting into 2G chunks is a lot cheaper.  
> >
> > Doesn't matter this isn't fast path stuff. It is better to use fewer
> > SGL entries, IHMO.
> >  
> > > > Since this change doesn't seem to have a negative impact on standard file
> > > > I/O or break existing VFS constraints, I'm curious why we shouldn't
> > > > support splitting these >4GB P2P transfers? Am I missing something?  
> > >
> > > I was only wondering whether it was needed...
> > > It does bring up the question of why the >4GB transfers even need splitting.
> > > But that is another question.  
> >
> > SGL can only store an unsigned int size, so any large physical range
> > has to be split down.
> >
> > rdma now a days has code to process the sgl and restore back the > 4G
> > sizes since mode RDMA HW can accept that.
> >
> > commit 486055f5e09df959ad4e3aa4ee75b5c91ddeec2e
> > Author: Michael Margolin <mrgolin@amazon.com>
> > Date:   Mon Feb 17 14:16:23 2025 +0000
> >
> >     RDMA/core: Fix best page size finding when it can cross SG entries
> >
> > So whatever this produces needs to be compatible with that to undo it.  
> 
> Thank you everyone. It looks like most open issues are sorted out.
> I'll wait for maintainers to weigh in before sending out v3 (which
> will remove the type cast for min() per David L.'s feedback, and
> revert to ALIGN_DOWN(UINT_MAX, PAGE_SIZE) per Jason's feedback).

Does this code get used a lot for 'normal' transfers?
I'm away from my normal systems and can't check.
But if pretty much all of the fragments are small (< 4G) then
it is probably worth adding a check for 'size < limit' before
anything else and optimising that case.

	David

> 
> Hi Jason,
> 
> Thank you for your feedback. I took a closer look at the commit to
> ensure compatibility. This patch is perfectly complementary, and
> actually prevents a failure in an edge case for the latest
> `ib_umem_find_best_pgsz` [1].
> 
> Regards,
> David
> 
> [1] For dma-buf split with `0xFFFFFFFF`, in case of a discontinguity
> in later buffers, we will hit this code path in
> `ib_umem_find_best_pgsz`
> 
> ```
> if (i != 0)
>     mask |= va;
> ```
> (*After `va` had been incremented by `0xFFFFFFFF`, due to `va +=
> sg_dma_len(sg) - pgoff`)
> (*Which will set the lowest bit of `mask` to 1)
> 
> Because `count_trailing_zeros(mask) returns 0`,
> `ib_umem_find_best_pgsz()` will always return 0 in such cases.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-07-02  8:10 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-21 22:21 [PATCH] dma-buf: Split sgl by largest page-aligned chunk David Hu
2026-06-22  8:13 ` David Laight
2026-06-22 21:26   ` David Hu
2026-06-23  8:25     ` David Laight
2026-06-23 21:03       ` David Hu
2026-06-23  1:54 ` [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks David Hu
2026-06-23  8:44   ` David Laight
2026-06-23 20:55     ` Pranjal Shrivastava
2026-06-23 22:53       ` David Laight
2026-06-24 14:31         ` Leon Romanovsky
2026-06-30 12:42         ` Jason Gunthorpe
2026-07-02  4:56           ` David Hu
2026-07-02  8:10             ` David Laight
2026-06-30 12:38     ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox