* [RFC PATCH 0/3] Add an option for devices to skip SWIOTLB pre-copy on read.
@ 2025-02-04 23:36 Jianxiong Gao
2025-02-04 23:36 ` [RFC PATCH 1/3] add full_buffer_write flag to struct device_dma_parameters Jianxiong Gao
` (4 more replies)
0 siblings, 5 replies; 9+ messages in thread
From: Jianxiong Gao @ 2025-02-04 23:36 UTC (permalink / raw)
To: Keith Busch, Marek Szyprowski
Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Robin Murphy,
Andy Shevchenko, Dan Williams, Erdem Aktas, Vishal Annapurve,
Ryan Afranji, linux-nvme, iommu, Jianxiong Gao
Removes an extra memory copy that occurs during IO read
operations through the SWIOTLB. During high throughput
read workloads, this extra copy is causing a lot of stress
on the SWIOTLB.
With high performance devices, for example NVMe devices,
the device will be overwriting the entire buffer. In such
cases the entire pre-copy is redundent, only to slow down
the overall bounce buffering.
We propose to add a full_buffer_write flag to the
device_dma_parameters flag. When the flag is set the pre-copy
can be omitted to boost performance.
Jianxiong Gao (3):
add full_buffer_write flag to struct device_dma_parameters
skip swiotlb pre copy if the device does full buffer write.
set full_buffer_write for nvme devices.
drivers/nvme/host/pci.c | 1 +
include/linux/device.h | 1 +
include/linux/dma-mapping.h | 15 +++++++++++++++
kernel/dma/swiotlb.c | 3 ++-
4 files changed, 19 insertions(+), 1 deletion(-)
--
2.48.1.362.g079036d154-goog
^ permalink raw reply [flat|nested] 9+ messages in thread
* [RFC PATCH 1/3] add full_buffer_write flag to struct device_dma_parameters
2025-02-04 23:36 [RFC PATCH 0/3] Add an option for devices to skip SWIOTLB pre-copy on read Jianxiong Gao
@ 2025-02-04 23:36 ` Jianxiong Gao
2025-02-05 8:01 ` Andy Shevchenko
2025-02-04 23:36 ` [RFC PATCH 2/3] skip swiotlb pre copy if the device does full buffer write Jianxiong Gao
` (3 subsequent siblings)
4 siblings, 1 reply; 9+ messages in thread
From: Jianxiong Gao @ 2025-02-04 23:36 UTC (permalink / raw)
To: Keith Busch, Marek Szyprowski
Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Robin Murphy,
Andy Shevchenko, Dan Williams, Erdem Aktas, Vishal Annapurve,
Ryan Afranji, linux-nvme, iommu, Jianxiong Gao
When devices write to a buffer, some devices are guaranteed to
overwrite the entire buffer. These devices may benefit from such
behaviors by reducing the need to pre-condition the original
buffer. For example when bouncing data through swiotlb, the buffer
is forced to be synced before the device writes to it. For devices
that we know for sure overwrites the entire buffer, this flag can
be utilized to eliminate the extra copy on every IO bounced
through the swiotlb.
Signed-off-by: Jianxiong Gao <jxgao@google.com>
---
include/linux/device.h | 1 +
include/linux/dma-mapping.h | 15 +++++++++++++++
2 files changed, 16 insertions(+)
diff --git a/include/linux/device.h b/include/linux/device.h
index 80a5b3268986..003007ad6ad3 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -460,6 +460,7 @@ struct device_dma_parameters {
* a low level driver may set these to teach IOMMU code about
* sg limitations.
*/
+ bool full_buffer_write;
unsigned int max_segment_size;
unsigned int min_align_mask;
unsigned long segment_boundary_mask;
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index b79925b1c433..e93b909865d3 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -513,6 +513,21 @@ static inline int dma_coerce_mask_and_coherent(struct device *dev, u64 mask)
return dma_set_mask_and_coherent(dev, mask);
}
+static inline bool dma_is_full_buffer_write(struct device *dev)
+{
+ if (dev->dma_parms)
+ return dev->dma_parms->full_buffer_write;
+ return false;
+}
+
+static inline int dma_set_full_buffer_write(struct device *dev, bool full_buffer_write)
+{
+ if (WARN_ON_ONCE(!dev->dma_parms))
+ return -EIO;
+ dev->dma_parms->full_buffer_write = full_buffer_write;
+ return 0;
+}
+
static inline unsigned int dma_get_max_seg_size(struct device *dev)
{
if (dev->dma_parms && dev->dma_parms->max_segment_size)
--
2.48.1.362.g079036d154-goog
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [RFC PATCH 2/3] skip swiotlb pre copy if the device does full buffer write.
2025-02-04 23:36 [RFC PATCH 0/3] Add an option for devices to skip SWIOTLB pre-copy on read Jianxiong Gao
2025-02-04 23:36 ` [RFC PATCH 1/3] add full_buffer_write flag to struct device_dma_parameters Jianxiong Gao
@ 2025-02-04 23:36 ` Jianxiong Gao
2025-02-04 23:36 ` [RFC PATCH 3/3] set full_buffer_write for nvme devices Jianxiong Gao
` (2 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: Jianxiong Gao @ 2025-02-04 23:36 UTC (permalink / raw)
To: Keith Busch, Marek Szyprowski
Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Robin Murphy,
Andy Shevchenko, Dan Williams, Erdem Aktas, Vishal Annapurve,
Ryan Afranji, linux-nvme, iommu, Jianxiong Gao
In cases that the device is known to do a full buffer write, we
can skip the pre copy of the swiotlb buffer.
Signed-off-by: Jianxiong Gao <jxgao@google.com>
---
kernel/dma/swiotlb.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index abcf3fa63a56..12124d4fd44f 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -1436,7 +1436,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
* hardware behavior. Use of swiotlb is supposed to be transparent,
* i.e. swiotlb must not corrupt memory by clobbering unwritten bytes.
*/
- swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE, pool);
+ if ((dir == DMA_TO_DEVICE) || !dma_is_full_buffer_write(dev))
+ swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE, pool);
return tlb_addr;
}
--
2.48.1.362.g079036d154-goog
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [RFC PATCH 3/3] set full_buffer_write for nvme devices.
2025-02-04 23:36 [RFC PATCH 0/3] Add an option for devices to skip SWIOTLB pre-copy on read Jianxiong Gao
2025-02-04 23:36 ` [RFC PATCH 1/3] add full_buffer_write flag to struct device_dma_parameters Jianxiong Gao
2025-02-04 23:36 ` [RFC PATCH 2/3] skip swiotlb pre copy if the device does full buffer write Jianxiong Gao
@ 2025-02-04 23:36 ` Jianxiong Gao
2025-02-05 7:54 ` [RFC PATCH 0/3] Add an option for devices to skip SWIOTLB pre-copy on read Andy Shevchenko
2025-02-05 11:46 ` Robin Murphy
4 siblings, 0 replies; 9+ messages in thread
From: Jianxiong Gao @ 2025-02-04 23:36 UTC (permalink / raw)
To: Keith Busch, Marek Szyprowski
Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Robin Murphy,
Andy Shevchenko, Dan Williams, Erdem Aktas, Vishal Annapurve,
Ryan Afranji, linux-nvme, iommu, Jianxiong Gao
NVMe dma operations always overwrite the whole dma buffer passed to
the device. Making use the newly introduced full_buffer_write flag
so in cases swiotlb is used to bounce the buffer, we no longer need
to do the extra copy on preparing the buffer.
Signed-off-by: Jianxiong Gao <jxgao@google.com>
---
drivers/nvme/host/pci.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 278bed4e35bb..8fb6a0a87202 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3209,6 +3209,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(48));
else
dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
+ dma_set_full_buffer_write(&pdev->dev, true);
dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1);
dma_set_max_seg_size(&pdev->dev, 0xffffffff);
--
2.48.1.362.g079036d154-goog
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 0/3] Add an option for devices to skip SWIOTLB pre-copy on read.
2025-02-04 23:36 [RFC PATCH 0/3] Add an option for devices to skip SWIOTLB pre-copy on read Jianxiong Gao
` (2 preceding siblings ...)
2025-02-04 23:36 ` [RFC PATCH 3/3] set full_buffer_write for nvme devices Jianxiong Gao
@ 2025-02-05 7:54 ` Andy Shevchenko
2025-02-05 15:43 ` Christoph Hellwig
2025-02-05 11:46 ` Robin Murphy
4 siblings, 1 reply; 9+ messages in thread
From: Andy Shevchenko @ 2025-02-05 7:54 UTC (permalink / raw)
To: Jianxiong Gao
Cc: Keith Busch, Marek Szyprowski, Jens Axboe, Christoph Hellwig,
Sagi Grimberg, Robin Murphy, Dan Williams, Erdem Aktas,
Vishal Annapurve, Ryan Afranji, linux-nvme, iommu
On Tue, Feb 04, 2025 at 11:36:27PM +0000, Jianxiong Gao wrote:
> Removes an extra memory copy that occurs during IO read
> operations through the SWIOTLB. During high throughput
> read workloads, this extra copy is causing a lot of stress
> on the SWIOTLB.
>
> With high performance devices, for example NVMe devices,
> the device will be overwriting the entire buffer.
Is this really guaranteed? I can imagine surprise power cut or
hotplug event, for example, just in the middle of the transfer.
> In such cases the entire pre-copy is redundent, only to slow down
> the overall bounce buffering.
>
> We propose to add a full_buffer_write flag to the
> device_dma_parameters flag. When the flag is set the pre-copy
> can be omitted to boost performance.
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 1/3] add full_buffer_write flag to struct device_dma_parameters
2025-02-04 23:36 ` [RFC PATCH 1/3] add full_buffer_write flag to struct device_dma_parameters Jianxiong Gao
@ 2025-02-05 8:01 ` Andy Shevchenko
0 siblings, 0 replies; 9+ messages in thread
From: Andy Shevchenko @ 2025-02-05 8:01 UTC (permalink / raw)
To: Jianxiong Gao
Cc: Keith Busch, Marek Szyprowski, Jens Axboe, Christoph Hellwig,
Sagi Grimberg, Robin Murphy, Dan Williams, Erdem Aktas,
Vishal Annapurve, Ryan Afranji, linux-nvme, iommu
On Tue, Feb 04, 2025 at 11:36:28PM +0000, Jianxiong Gao wrote:
> When devices write to a buffer, some devices are guaranteed to
> overwrite the entire buffer. These devices may benefit from such
> behaviors by reducing the need to pre-condition the original
> buffer. For example when bouncing data through swiotlb, the buffer
> is forced to be synced before the device writes to it. For devices
> that we know for sure overwrites the entire buffer, this flag can
> be utilized to eliminate the extra copy on every IO bounced
> through the swiotlb.
...
> struct device_dma_parameters {
> * a low level driver may set these to teach IOMMU code about
> * sg limitations.
> */
> + bool full_buffer_write;
> unsigned int max_segment_size;
> unsigned int min_align_mask;
> unsigned long segment_boundary_mask;
Have you run `pahole`? Please, share the results.
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 0/3] Add an option for devices to skip SWIOTLB pre-copy on read.
2025-02-04 23:36 [RFC PATCH 0/3] Add an option for devices to skip SWIOTLB pre-copy on read Jianxiong Gao
` (3 preceding siblings ...)
2025-02-05 7:54 ` [RFC PATCH 0/3] Add an option for devices to skip SWIOTLB pre-copy on read Andy Shevchenko
@ 2025-02-05 11:46 ` Robin Murphy
4 siblings, 0 replies; 9+ messages in thread
From: Robin Murphy @ 2025-02-05 11:46 UTC (permalink / raw)
To: Jianxiong Gao, Keith Busch, Marek Szyprowski
Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Andy Shevchenko,
Dan Williams, Erdem Aktas, Vishal Annapurve, Ryan Afranji,
linux-nvme, iommu
On 2025-02-04 11:36 pm, Jianxiong Gao wrote:
> Removes an extra memory copy that occurs during IO read
> operations through the SWIOTLB. During high throughput
> read workloads, this extra copy is causing a lot of stress
> on the SWIOTLB.
>
> With high performance devices, for example NVMe devices,
> the device will be overwriting the entire buffer. In such
> cases the entire pre-copy is redundent, only to slow down
> the overall bounce buffering.
>
> We propose to add a full_buffer_write flag to the
> device_dma_parameters flag. When the flag is set the pre-copy
> can be omitted to boost performance.
No. We already went through this a couple of weeks ago[1]. It's not
about what the driver intends the device to do, it's about what happens
if for whatever reason it then *doesn't* do that.
Thanks,
Robin.
[1]
https://lore.kernel.org/lkml/9582878b-1ce7-4fc4-9b45-b72bba722f49@arm.com/
>
> Jianxiong Gao (3):
> add full_buffer_write flag to struct device_dma_parameters
> skip swiotlb pre copy if the device does full buffer write.
> set full_buffer_write for nvme devices.
>
> drivers/nvme/host/pci.c | 1 +
> include/linux/device.h | 1 +
> include/linux/dma-mapping.h | 15 +++++++++++++++
> kernel/dma/swiotlb.c | 3 ++-
> 4 files changed, 19 insertions(+), 1 deletion(-)
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 0/3] Add an option for devices to skip SWIOTLB pre-copy on read.
2025-02-05 7:54 ` [RFC PATCH 0/3] Add an option for devices to skip SWIOTLB pre-copy on read Andy Shevchenko
@ 2025-02-05 15:43 ` Christoph Hellwig
2025-02-05 16:18 ` Keith Busch
0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2025-02-05 15:43 UTC (permalink / raw)
To: Andy Shevchenko
Cc: Jianxiong Gao, Keith Busch, Marek Szyprowski, Jens Axboe,
Christoph Hellwig, Sagi Grimberg, Robin Murphy, Dan Williams,
Erdem Aktas, Vishal Annapurve, Ryan Afranji, linux-nvme, iommu
On Wed, Feb 05, 2025 at 09:54:21AM +0200, Andy Shevchenko wrote:
> On Tue, Feb 04, 2025 at 11:36:27PM +0000, Jianxiong Gao wrote:
> > Removes an extra memory copy that occurs during IO read
> > operations through the SWIOTLB. During high throughput
> > read workloads, this extra copy is causing a lot of stress
> > on the SWIOTLB.
> >
> > With high performance devices, for example NVMe devices,
> > the device will be overwriting the entire buffer.
>
> Is this really guaranteed? I can imagine surprise power cut or
> hotplug event, for example, just in the middle of the transfer.
Many command can return less data than originally mapped. Get Log Page
is an example that comes to mind that is heavily used that way.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 0/3] Add an option for devices to skip SWIOTLB pre-copy on read.
2025-02-05 15:43 ` Christoph Hellwig
@ 2025-02-05 16:18 ` Keith Busch
0 siblings, 0 replies; 9+ messages in thread
From: Keith Busch @ 2025-02-05 16:18 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andy Shevchenko, Jianxiong Gao, Marek Szyprowski, Jens Axboe,
Sagi Grimberg, Robin Murphy, Dan Williams, Erdem Aktas,
Vishal Annapurve, Ryan Afranji, linux-nvme, iommu
On Wed, Feb 05, 2025 at 04:43:21PM +0100, Christoph Hellwig wrote:
> On Wed, Feb 05, 2025 at 09:54:21AM +0200, Andy Shevchenko wrote:
> > On Tue, Feb 04, 2025 at 11:36:27PM +0000, Jianxiong Gao wrote:
> > > Removes an extra memory copy that occurs during IO read
> > > operations through the SWIOTLB. During high throughput
> > > read workloads, this extra copy is causing a lot of stress
> > > on the SWIOTLB.
> > >
> > > With high performance devices, for example NVMe devices,
> > > the device will be overwriting the entire buffer.
> >
> > Is this really guaranteed? I can imagine surprise power cut or
> > hotplug event, for example, just in the middle of the transfer.
>
> Many command can return less data than originally mapped. Get Log Page
> is an example that comes to mind that is heavily used that way.
It's also easy enough for a user to "mistakenly" request more memory
than the device will transfer. The safest thing is to clear any contents
that may get copied back to the user.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-02-05 16:18 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-04 23:36 [RFC PATCH 0/3] Add an option for devices to skip SWIOTLB pre-copy on read Jianxiong Gao
2025-02-04 23:36 ` [RFC PATCH 1/3] add full_buffer_write flag to struct device_dma_parameters Jianxiong Gao
2025-02-05 8:01 ` Andy Shevchenko
2025-02-04 23:36 ` [RFC PATCH 2/3] skip swiotlb pre copy if the device does full buffer write Jianxiong Gao
2025-02-04 23:36 ` [RFC PATCH 3/3] set full_buffer_write for nvme devices Jianxiong Gao
2025-02-05 7:54 ` [RFC PATCH 0/3] Add an option for devices to skip SWIOTLB pre-copy on read Andy Shevchenko
2025-02-05 15:43 ` Christoph Hellwig
2025-02-05 16:18 ` Keith Busch
2025-02-05 11:46 ` Robin Murphy
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.