* XDP performance regression due to CONFIG_RETPOLINE Spectre V2 @ 2018-04-12 13:50 Jesper Dangaard Brouer 2018-04-12 14:51 ` Christoph Hellwig 2018-04-16 12:27 ` Christoph Hellwig 0 siblings, 2 replies; 19+ messages in thread From: Jesper Dangaard Brouer @ 2018-04-12 13:50 UTC (permalink / raw) To: xdp-newbies@vger.kernel.org, netdev@vger.kernel.org Cc: brouer, Christoph Hellwig, David Woodhouse, William Tu, Björn Töpel, Karlsson, Magnus, Alexander Duyck, Arnaldo Carvalho de Melo Heads-up XDP performance nerds! I got an unpleasant surprise when I updated my GCC compiler (to support the option -mindirect-branch=thunk-extern). My XDP redirect performance numbers when cut in half; from approx 13Mpps to 6Mpps (single CPU core). I've identified the issue, which is caused by kernel CONFIG_RETPOLINE, that only have effect when the GCC compiler have support. This is mitigation of Spectre variant 2 (CVE-2017-5715) related to indirect (function call) branches. XDP_REDIRECT itself only have two primary (per packet) indirect function calls, ndo_xdp_xmit and invoking bpf_prog, plus any map_lookup_elem calls in the bpf_prog. I PoC implemented bulking for ndo_xdp_xmit, which helped, but not enough. The real root-cause is all the DMA API calls, which uses function pointers extensively. Mitigation plan --------------- Implement support for keeping the DMA mapping through the XDP return call, to remove RX map/unmap calls. Implement bulking for XDP ndo_xdp_xmit and XDP return frame API. Bulking allows to perform DMA bulking via scatter-gatter DMA calls, XDP TX need it for DMA map+unmap. The driver RX DMA-sync (to CPU) per packet calls are harder to mitigate (via bulk technique). Ask DMA maintainer for a common case direct call for swiotlb DMA sync call ;-) Root-cause verification ----------------------- I have verified that indirect DMA calls are the root-cause, by removing the DMA sync calls from the code (as they for swiotlb does nothing), and manually inlined the DMA map calls (basically calling phys_to_dma(dev, page_to_phys(page)) + offset). For my ixgbe test, performance "returned" to 11Mpps. Perf reports ------------ It is not easy to diagnose via perf event tool. I'm coordinating with ACME to make it easier to pinpoint the hotspots. Lookout for symbols: __x86_indirect_thunk_r10, __indirect_thunk_start, __x86_indirect_thunk_rdx etc. Be aware that they might not be super high in perf top, but they stop CPU speculation. Thus, instead use perf-stat and see the negative effect of 'insn per cycle'. Want to understand retpoline at ASM level read this: https://support.google.com/faqs/answer/7625886 -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2 2018-04-12 13:50 XDP performance regression due to CONFIG_RETPOLINE Spectre V2 Jesper Dangaard Brouer @ 2018-04-12 14:51 ` Christoph Hellwig 2018-04-12 14:56 ` Christoph Hellwig 2018-04-16 12:27 ` Christoph Hellwig 1 sibling, 1 reply; 19+ messages in thread From: Christoph Hellwig @ 2018-04-12 14:51 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: xdp-newbies@vger.kernel.org, netdev@vger.kernel.org, Christoph Hellwig, David Woodhouse, William Tu, Björn Töpel, Karlsson, Magnus, Alexander Duyck, Arnaldo Carvalho de Melo On Thu, Apr 12, 2018 at 03:50:29PM +0200, Jesper Dangaard Brouer wrote: > --------------- > Implement support for keeping the DMA mapping through the XDP return > call, to remove RX map/unmap calls. Implement bulking for XDP > ndo_xdp_xmit and XDP return frame API. Bulking allows to perform DMA > bulking via scatter-gatter DMA calls, XDP TX need it for DMA > map+unmap. The driver RX DMA-sync (to CPU) per packet calls are harder > to mitigate (via bulk technique). Ask DMA maintainer for a common > case direct call for swiotlb DMA sync call ;-) Why do you even end up in swiotlb code? Once you bounce buffer your performance is toast anyway.. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2 2018-04-12 14:51 ` Christoph Hellwig @ 2018-04-12 14:56 ` Christoph Hellwig 2018-04-12 15:31 ` Jesper Dangaard Brouer 2018-04-13 17:12 ` Tushar Dave 0 siblings, 2 replies; 19+ messages in thread From: Christoph Hellwig @ 2018-04-12 14:56 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: xdp-newbies@vger.kernel.org, netdev@vger.kernel.org, Christoph Hellwig, David Woodhouse, William Tu, Björn Töpel, Karlsson, Magnus, Alexander Duyck, Arnaldo Carvalho de Melo On Thu, Apr 12, 2018 at 04:51:23PM +0200, Christoph Hellwig wrote: > On Thu, Apr 12, 2018 at 03:50:29PM +0200, Jesper Dangaard Brouer wrote: > > --------------- > > Implement support for keeping the DMA mapping through the XDP return > > call, to remove RX map/unmap calls. Implement bulking for XDP > > ndo_xdp_xmit and XDP return frame API. Bulking allows to perform DMA > > bulking via scatter-gatter DMA calls, XDP TX need it for DMA > > map+unmap. The driver RX DMA-sync (to CPU) per packet calls are harder > > to mitigate (via bulk technique). Ask DMA maintainer for a common > > case direct call for swiotlb DMA sync call ;-) > > Why do you even end up in swiotlb code? Once you bounce buffer your > performance is toast anyway.. I guess that is because x86 selects it as the default as soon as we have more than 4G memory. That should be solveable fairly easily with the per-device dma ops, though. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2 2018-04-12 14:56 ` Christoph Hellwig @ 2018-04-12 15:31 ` Jesper Dangaard Brouer 2018-04-13 16:49 ` Christoph Hellwig 2018-04-13 17:12 ` Tushar Dave 1 sibling, 1 reply; 19+ messages in thread From: Jesper Dangaard Brouer @ 2018-04-12 15:31 UTC (permalink / raw) To: Christoph Hellwig Cc: xdp-newbies@vger.kernel.org, netdev@vger.kernel.org, David Woodhouse, William Tu, Björn Töpel, Karlsson, Magnus, Alexander Duyck, Arnaldo Carvalho de Melo, brouer On Thu, 12 Apr 2018 16:56:53 +0200 Christoph Hellwig <hch@lst.de> wrote: > On Thu, Apr 12, 2018 at 04:51:23PM +0200, Christoph Hellwig wrote: > > On Thu, Apr 12, 2018 at 03:50:29PM +0200, Jesper Dangaard Brouer wrote: > > > --------------- > > > Implement support for keeping the DMA mapping through the XDP return > > > call, to remove RX map/unmap calls. Implement bulking for XDP > > > ndo_xdp_xmit and XDP return frame API. Bulking allows to perform DMA > > > bulking via scatter-gatter DMA calls, XDP TX need it for DMA > > > map+unmap. The driver RX DMA-sync (to CPU) per packet calls are harder > > > to mitigate (via bulk technique). Ask DMA maintainer for a common > > > case direct call for swiotlb DMA sync call ;-) > > > > Why do you even end up in swiotlb code? Once you bounce buffer your > > performance is toast anyway.. > > I guess that is because x86 selects it as the default as soon as > we have more than 4G memory. I were also confused why I ended up using SWIOTLB (SoftWare IO-TLB), that might explain it. And I'm not hitting the bounce-buffer case. How do I control which DMA engine I use? (So, I can play a little) > That should be solveable fairly easily with the per-device dma ops, > though. I didn't understand this part. I wanted to ask your opinion, on a hackish idea I have... Which is howto detect, if I can reuse the RX-DMA map address, for TX-DMA operation on another device (still/only calling sync_single_for_device). With XDP_REDIRECT we are redirecting between net_device's. Usually we keep the RX-DMA mapping as we recycle the page. On the redirect to TX-device (via ndo_xdp_xmit) we do a new DMA map+unmap for TX. The question is how to avoid this mapping(?). In some cases, with some DMA engines (or lack of) I guess the DMA address is actually the same as the RX-DMA mapping dma_addr_t already known, right? For those cases, would it be possible to just (re)use that address for TX? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2 2018-04-12 15:31 ` Jesper Dangaard Brouer @ 2018-04-13 16:49 ` Christoph Hellwig 0 siblings, 0 replies; 19+ messages in thread From: Christoph Hellwig @ 2018-04-13 16:49 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Christoph Hellwig, xdp-newbies@vger.kernel.org, netdev@vger.kernel.org, David Woodhouse, William Tu, Björn Töpel, Karlsson, Magnus, Alexander Duyck, Arnaldo Carvalho de Melo On Thu, Apr 12, 2018 at 05:31:31PM +0200, Jesper Dangaard Brouer wrote: > > I guess that is because x86 selects it as the default as soon as > > we have more than 4G memory. > > I were also confused why I ended up using SWIOTLB (SoftWare IO-TLB), > that might explain it. And I'm not hitting the bounce-buffer case. > > How do I control which DMA engine I use? (So, I can play a little) At the lowest level you control it by: (1) setting the dma_ops pointer in struct device (2) if that is NULL by choosing what is returned from get_arch_dma_ops() > > > > That should be solveable fairly easily with the per-device dma ops, > > though. > > I didn't understand this part. What I mean with that is that we can start out setting dma_ops to dma_direct_ops for everyone on x86 when we start out (that is assuming we don't have an iommu), and only switching to swiotlb_dma_ops when actually required by either a dma_mask that can't address all memory, or some other special cases like SEV or broken bridges. > I wanted to ask your opinion, on a hackish idea I have... > Which is howto detect, if I can reuse the RX-DMA map address, for TX-DMA > operation on another device (still/only calling sync_single_for_device). > > With XDP_REDIRECT we are redirecting between net_device's. Usually > we keep the RX-DMA mapping as we recycle the page. On the redirect to > TX-device (via ndo_xdp_xmit) we do a new DMA map+unmap for TX. The > question is how to avoid this mapping(?). In some cases, with some DMA > engines (or lack of) I guess the DMA address is actually the same as > the RX-DMA mapping dma_addr_t already known, right? For those cases, > would it be possible to just (re)use that address for TX? You can't in any sensible way without breaking a lot of abstractions. For dma direct ops that mapping will be the same unless the devices have different dma_offsets in their struct device, or the architecture overrides phys_to_dma entirely, in which case all bets are off. If you have an iommu it depends on which devices are behind the same iommu. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2 2018-04-12 14:56 ` Christoph Hellwig 2018-04-12 15:31 ` Jesper Dangaard Brouer @ 2018-04-13 17:12 ` Tushar Dave 2018-04-13 17:26 ` Christoph Hellwig 1 sibling, 1 reply; 19+ messages in thread From: Tushar Dave @ 2018-04-13 17:12 UTC (permalink / raw) To: Christoph Hellwig, Jesper Dangaard Brouer Cc: xdp-newbies@vger.kernel.org, netdev@vger.kernel.org, David Woodhouse, William Tu, Björn Töpel, Karlsson, Magnus, Alexander Duyck, Arnaldo Carvalho de Melo On 04/12/2018 07:56 AM, Christoph Hellwig wrote: > On Thu, Apr 12, 2018 at 04:51:23PM +0200, Christoph Hellwig wrote: >> On Thu, Apr 12, 2018 at 03:50:29PM +0200, Jesper Dangaard Brouer wrote: >>> --------------- >>> Implement support for keeping the DMA mapping through the XDP return >>> call, to remove RX map/unmap calls. Implement bulking for XDP >>> ndo_xdp_xmit and XDP return frame API. Bulking allows to perform DMA >>> bulking via scatter-gatter DMA calls, XDP TX need it for DMA >>> map+unmap. The driver RX DMA-sync (to CPU) per packet calls are harder >>> to mitigate (via bulk technique). Ask DMA maintainer for a common >>> case direct call for swiotlb DMA sync call ;-) >> >> Why do you even end up in swiotlb code? Once you bounce buffer your >> performance is toast anyway.. > > I guess that is because x86 selects it as the default as soon as > we have more than 4G memory. That should be solveable fairly easily > with the per-device dma ops, though.\ I guess there is nothing we need to do! On x86, in case of no intel iommu or iommu is disabled, you end up in swiotlb for DMA API calls when system has 4G memory. However, AFAICT, for 64bit DMA capable devices swiotlb DMA APIs do not use bounce buffer until and unless you have swiotlb=force specified in kernel commandline. e.g. here is the snip: dma_addr_t swiotlb_map_page(struct device *dev, struct page *page, unsigned long offset, size_t size, enum dma_data_direction dir, unsigned long attrs) { phys_addr_t map, phys = page_to_phys(page) + offset; dma_addr_t dev_addr = phys_to_dma(dev, phys); BUG_ON(dir == DMA_NONE); /* * If the address happens to be in the device's DMA window, * we can safely return the device addr and not worry about bounce * buffering it. */ if (dma_capable(dev, dev_addr, size) && swiotlb_force != SWIOTLB_FORCE) return dev_addr; -Tushar ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2 2018-04-13 17:12 ` Tushar Dave @ 2018-04-13 17:26 ` Christoph Hellwig 2018-04-14 19:29 ` David Woodhouse 0 siblings, 1 reply; 19+ messages in thread From: Christoph Hellwig @ 2018-04-13 17:26 UTC (permalink / raw) To: Tushar Dave Cc: Christoph Hellwig, Jesper Dangaard Brouer, xdp-newbies@vger.kernel.org, netdev@vger.kernel.org, David Woodhouse, William Tu, Björn Töpel, Karlsson, Magnus, Alexander Duyck, Arnaldo Carvalho de Melo On Fri, Apr 13, 2018 at 10:12:41AM -0700, Tushar Dave wrote: > I guess there is nothing we need to do! > > On x86, in case of no intel iommu or iommu is disabled, you end up in > swiotlb for DMA API calls when system has 4G memory. > However, AFAICT, for 64bit DMA capable devices swiotlb DMA APIs do not > use bounce buffer until and unless you have swiotlb=force specified in > kernel commandline. Sure. But that means very sync_*_to_device and sync_*_to_cpu now involves an indirect call to do exactly nothing, which in the workload Jesper is looking at is causing a huge performance degradation due to retpolines. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2 2018-04-13 17:26 ` Christoph Hellwig @ 2018-04-14 19:29 ` David Woodhouse 2018-04-16 6:02 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 19+ messages in thread From: David Woodhouse @ 2018-04-14 19:29 UTC (permalink / raw) To: Christoph Hellwig, Tushar Dave Cc: Jesper Dangaard Brouer, xdp-newbies@vger.kernel.org, netdev@vger.kernel.org, William Tu, Björn Töpel, Karlsson, Magnus, Alexander Duyck, Arnaldo Carvalho de Melo [-- Attachment #1: Type: text/plain, Size: 1001 bytes --] On Fri, 2018-04-13 at 19:26 +0200, Christoph Hellwig wrote: > On Fri, Apr 13, 2018 at 10:12:41AM -0700, Tushar Dave wrote: > > I guess there is nothing we need to do! > > > > On x86, in case of no intel iommu or iommu is disabled, you end up in > > swiotlb for DMA API calls when system has 4G memory. > > However, AFAICT, for 64bit DMA capable devices swiotlb DMA APIs do not > > use bounce buffer until and unless you have swiotlb=force specified in > > kernel commandline. > > Sure. But that means very sync_*_to_device and sync_*_to_cpu now > involves an indirect call to do exactly nothing, which in the workload > Jesper is looking at is causing a huge performance degradation due to > retpolines. We should look at using the if (dma_ops == swiotlb_dma_ops) swiotlb_map_page() else dma_ops->map_page() trick for this. Perhaps with alternatives so that when an Intel or AMD IOMMU is detected, it's *that* which is checked for as the special case. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5213 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2 2018-04-14 19:29 ` David Woodhouse @ 2018-04-16 6:02 ` Jesper Dangaard Brouer 0 siblings, 0 replies; 19+ messages in thread From: Jesper Dangaard Brouer @ 2018-04-16 6:02 UTC (permalink / raw) To: David Woodhouse Cc: Christoph Hellwig, Tushar Dave, xdp-newbies@vger.kernel.org, netdev@vger.kernel.org, William Tu, Björn Töpel, Karlsson, Magnus, Alexander Duyck, Arnaldo Carvalho de Melo, brouer On Sat, 14 Apr 2018 21:29:26 +0200 David Woodhouse <dwmw2@infradead.org> wrote: > On Fri, 2018-04-13 at 19:26 +0200, Christoph Hellwig wrote: > > On Fri, Apr 13, 2018 at 10:12:41AM -0700, Tushar Dave wrote: > > > I guess there is nothing we need to do! > > > > > > On x86, in case of no intel iommu or iommu is disabled, you end up in > > > swiotlb for DMA API calls when system has 4G memory. > > > However, AFAICT, for 64bit DMA capable devices swiotlb DMA APIs do not > > > use bounce buffer until and unless you have swiotlb=force specified in > > > kernel commandline. > > > > Sure. But that means very sync_*_to_device and sync_*_to_cpu now > > involves an indirect call to do exactly nothing, which in the workload > > Jesper is looking at is causing a huge performance degradation due to > > retpolines. Yes, exactly. > > We should look at using the > > if (dma_ops == swiotlb_dma_ops) > swiotlb_map_page() > else > dma_ops->map_page() > > trick for this. Perhaps with alternatives so that when an Intel or AMD > IOMMU is detected, it's *that* which is checked for as the special > case. Yes, this trick is basically what I'm asking for :-) It did sound like Hellwig wanted to first avoid/fix that x86 end-up defaulting to swiotlb. Thus, we just have to do the same trick with the new default fall-through dma_ops. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2 2018-04-12 13:50 XDP performance regression due to CONFIG_RETPOLINE Spectre V2 Jesper Dangaard Brouer 2018-04-12 14:51 ` Christoph Hellwig @ 2018-04-16 12:27 ` Christoph Hellwig 2018-04-16 16:04 ` Alexander Duyck ` (3 more replies) 1 sibling, 4 replies; 19+ messages in thread From: Christoph Hellwig @ 2018-04-16 12:27 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: xdp-newbies@vger.kernel.org, netdev@vger.kernel.org, Christoph Hellwig, David Woodhouse, William Tu, Björn Töpel, Karlsson, Magnus, Alexander Duyck, Arnaldo Carvalho de Melo Can you try the following hack which avoids indirect calls entirely for the fast path direct mapping case? --- >From b256a008c1b305e6a1c2afe7c004c54ad2e96d4b Mon Sep 17 00:00:00 2001 From: Christoph Hellwig <hch@lst.de> Date: Mon, 16 Apr 2018 14:18:14 +0200 Subject: dma-mapping: bypass dma_ops for direct mappings Reportedly the retpoline mitigation for spectre causes huge penalties for indirect function calls. This hack bypasses the dma_ops mechanism for simple direct mappings. Signed-off-by: Christoph Hellwig <hch@lst.de> --- include/linux/device.h | 1 + include/linux/dma-mapping.h | 53 +++++++++++++++++++++++++++---------- lib/dma-direct.c | 4 +-- 3 files changed, 42 insertions(+), 16 deletions(-) diff --git a/include/linux/device.h b/include/linux/device.h index 0059b99e1f25..725eec4c6653 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -990,6 +990,7 @@ struct device { bool offline_disabled:1; bool offline:1; bool of_node_reused:1; + bool is_dma_direct:1; }; static inline struct device *kobj_to_dev(struct kobject *kobj) diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index f8ab1c0f589e..c5d384ae25d6 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -223,6 +223,13 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev) } #endif +/* do not use directly! */ +dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, enum dma_data_direction dir, + unsigned long attrs); +int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, + int nents, enum dma_data_direction dir, unsigned long attrs); + static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr, size_t size, enum dma_data_direction dir, @@ -232,9 +239,13 @@ static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr, dma_addr_t addr; BUG_ON(!valid_dma_direction(dir)); - addr = ops->map_page(dev, virt_to_page(ptr), - offset_in_page(ptr), size, - dir, attrs); + if (dev->is_dma_direct) { + addr = dma_direct_map_page(dev, virt_to_page(ptr), + offset_in_page(ptr), size, dir, attrs); + } else { + addr = ops->map_page(dev, virt_to_page(ptr), + offset_in_page(ptr), size, dir, attrs); + } debug_dma_map_page(dev, virt_to_page(ptr), offset_in_page(ptr), size, dir, addr, true); @@ -249,7 +260,7 @@ static inline void dma_unmap_single_attrs(struct device *dev, dma_addr_t addr, const struct dma_map_ops *ops = get_dma_ops(dev); BUG_ON(!valid_dma_direction(dir)); - if (ops->unmap_page) + if (!dev->is_dma_direct && ops->unmap_page) ops->unmap_page(dev, addr, size, dir, attrs); debug_dma_unmap_page(dev, addr, size, dir, true); } @@ -266,7 +277,10 @@ static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int ents; BUG_ON(!valid_dma_direction(dir)); - ents = ops->map_sg(dev, sg, nents, dir, attrs); + if (dev->is_dma_direct) + ents = dma_direct_map_sg(dev, sg, nents, dir, attrs); + else + ents = ops->map_sg(dev, sg, nents, dir, attrs); BUG_ON(ents < 0); debug_dma_map_sg(dev, sg, nents, ents, dir); @@ -281,7 +295,7 @@ static inline void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg BUG_ON(!valid_dma_direction(dir)); debug_dma_unmap_sg(dev, sg, nents, dir); - if (ops->unmap_sg) + if (!dev->is_dma_direct && ops->unmap_sg) ops->unmap_sg(dev, sg, nents, dir, attrs); } @@ -295,7 +309,10 @@ static inline dma_addr_t dma_map_page_attrs(struct device *dev, dma_addr_t addr; BUG_ON(!valid_dma_direction(dir)); - addr = ops->map_page(dev, page, offset, size, dir, attrs); + if (dev->is_dma_direct) + addr = dma_direct_map_page(dev, page, offset, size, dir, attrs); + else + addr = ops->map_page(dev, page, offset, size, dir, attrs); debug_dma_map_page(dev, page, offset, size, dir, addr, false); return addr; @@ -309,7 +326,7 @@ static inline void dma_unmap_page_attrs(struct device *dev, const struct dma_map_ops *ops = get_dma_ops(dev); BUG_ON(!valid_dma_direction(dir)); - if (ops->unmap_page) + if (!dev->is_dma_direct && ops->unmap_page) ops->unmap_page(dev, addr, size, dir, attrs); debug_dma_unmap_page(dev, addr, size, dir, false); } @@ -356,7 +373,7 @@ static inline void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, const struct dma_map_ops *ops = get_dma_ops(dev); BUG_ON(!valid_dma_direction(dir)); - if (ops->sync_single_for_cpu) + if (!dev->is_dma_direct && ops->sync_single_for_cpu) ops->sync_single_for_cpu(dev, addr, size, dir); debug_dma_sync_single_for_cpu(dev, addr, size, dir); } @@ -368,7 +385,7 @@ static inline void dma_sync_single_for_device(struct device *dev, const struct dma_map_ops *ops = get_dma_ops(dev); BUG_ON(!valid_dma_direction(dir)); - if (ops->sync_single_for_device) + if (!dev->is_dma_direct && ops->sync_single_for_device) ops->sync_single_for_device(dev, addr, size, dir); debug_dma_sync_single_for_device(dev, addr, size, dir); } @@ -382,7 +399,7 @@ static inline void dma_sync_single_range_for_cpu(struct device *dev, const struct dma_map_ops *ops = get_dma_ops(dev); BUG_ON(!valid_dma_direction(dir)); - if (ops->sync_single_for_cpu) + if (!dev->is_dma_direct && ops->sync_single_for_cpu) ops->sync_single_for_cpu(dev, addr + offset, size, dir); debug_dma_sync_single_range_for_cpu(dev, addr, offset, size, dir); } @@ -396,7 +413,7 @@ static inline void dma_sync_single_range_for_device(struct device *dev, const struct dma_map_ops *ops = get_dma_ops(dev); BUG_ON(!valid_dma_direction(dir)); - if (ops->sync_single_for_device) + if (!dev->is_dma_direct && ops->sync_single_for_device) ops->sync_single_for_device(dev, addr + offset, size, dir); debug_dma_sync_single_range_for_device(dev, addr, offset, size, dir); } @@ -408,7 +425,7 @@ dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, const struct dma_map_ops *ops = get_dma_ops(dev); BUG_ON(!valid_dma_direction(dir)); - if (ops->sync_sg_for_cpu) + if (!dev->is_dma_direct && ops->sync_sg_for_cpu) ops->sync_sg_for_cpu(dev, sg, nelems, dir); debug_dma_sync_sg_for_cpu(dev, sg, nelems, dir); } @@ -420,7 +437,7 @@ dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, const struct dma_map_ops *ops = get_dma_ops(dev); BUG_ON(!valid_dma_direction(dir)); - if (ops->sync_sg_for_device) + if (!dev->is_dma_direct && ops->sync_sg_for_device) ops->sync_sg_for_device(dev, sg, nelems, dir); debug_dma_sync_sg_for_device(dev, sg, nelems, dir); @@ -600,6 +617,8 @@ static inline int dma_supported(struct device *dev, u64 mask) return ops->dma_supported(dev, mask); } +extern const struct dma_map_ops swiotlb_dma_ops; + #ifndef HAVE_ARCH_DMA_SET_MASK static inline int dma_set_mask(struct device *dev, u64 mask) { @@ -609,6 +628,12 @@ static inline int dma_set_mask(struct device *dev, u64 mask) dma_check_mask(dev, mask); *dev->dma_mask = mask; + if (dev->dma_ops == &dma_direct_ops || + (dev->dma_ops == &swiotlb_dma_ops && + mask == DMA_BIT_MASK(64))) + dev->is_dma_direct = true; + else + dev->is_dma_direct = false; return 0; } #endif diff --git a/lib/dma-direct.c b/lib/dma-direct.c index c0bba30fef0a..3deb8666974b 100644 --- a/lib/dma-direct.c +++ b/lib/dma-direct.c @@ -120,7 +120,7 @@ void dma_direct_free(struct device *dev, size_t size, void *cpu_addr, free_pages((unsigned long)cpu_addr, page_order); } -static dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, +dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, unsigned long offset, size_t size, enum dma_data_direction dir, unsigned long attrs) { @@ -131,7 +131,7 @@ static dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, return dma_addr; } -static int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, +int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir, unsigned long attrs) { int i; -- 2.17.0 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2 2018-04-16 12:27 ` Christoph Hellwig @ 2018-04-16 16:04 ` Alexander Duyck 2018-04-17 6:19 ` Christoph Hellwig 2018-04-16 18:05 ` dma-mapping: bypass dma_ops for direct mappings kbuild test robot ` (2 subsequent siblings) 3 siblings, 1 reply; 19+ messages in thread From: Alexander Duyck @ 2018-04-16 16:04 UTC (permalink / raw) To: Christoph Hellwig Cc: Jesper Dangaard Brouer, xdp-newbies@vger.kernel.org, netdev@vger.kernel.org, Christoph Hellwig, David Woodhouse, William Tu, Björn Töpel, Karlsson, Magnus, Arnaldo Carvalho de Melo On Mon, Apr 16, 2018 at 5:27 AM, Christoph Hellwig <hch@infradead.org> wrote: > Can you try the following hack which avoids indirect calls entirely > for the fast path direct mapping case? > > --- > From b256a008c1b305e6a1c2afe7c004c54ad2e96d4b Mon Sep 17 00:00:00 2001 > From: Christoph Hellwig <hch@lst.de> > Date: Mon, 16 Apr 2018 14:18:14 +0200 > Subject: dma-mapping: bypass dma_ops for direct mappings > > Reportedly the retpoline mitigation for spectre causes huge penalties > for indirect function calls. This hack bypasses the dma_ops mechanism > for simple direct mappings. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > include/linux/device.h | 1 + > include/linux/dma-mapping.h | 53 +++++++++++++++++++++++++++---------- > lib/dma-direct.c | 4 +-- > 3 files changed, 42 insertions(+), 16 deletions(-) > > diff --git a/include/linux/device.h b/include/linux/device.h > index 0059b99e1f25..725eec4c6653 100644 > --- a/include/linux/device.h > +++ b/include/linux/device.h > @@ -990,6 +990,7 @@ struct device { > bool offline_disabled:1; > bool offline:1; > bool of_node_reused:1; > + bool is_dma_direct:1; > }; > > static inline struct device *kobj_to_dev(struct kobject *kobj) > diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h > index f8ab1c0f589e..c5d384ae25d6 100644 > --- a/include/linux/dma-mapping.h > +++ b/include/linux/dma-mapping.h > @@ -223,6 +223,13 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev) > } > #endif > > +/* do not use directly! */ > +dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, > + unsigned long offset, size_t size, enum dma_data_direction dir, > + unsigned long attrs); > +int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, > + int nents, enum dma_data_direction dir, unsigned long attrs); > + > static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr, > size_t size, > enum dma_data_direction dir, > @@ -232,9 +239,13 @@ static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr, > dma_addr_t addr; > > BUG_ON(!valid_dma_direction(dir)); > - addr = ops->map_page(dev, virt_to_page(ptr), > - offset_in_page(ptr), size, > - dir, attrs); > + if (dev->is_dma_direct) { > + addr = dma_direct_map_page(dev, virt_to_page(ptr), > + offset_in_page(ptr), size, dir, attrs); > + } else { > + addr = ops->map_page(dev, virt_to_page(ptr), > + offset_in_page(ptr), size, dir, attrs); > + } > debug_dma_map_page(dev, virt_to_page(ptr), > offset_in_page(ptr), size, > dir, addr, true); I'm not sure if I am really a fan of trying to solve this in this way. It seems like this is going to be optimizing the paths for one case at the detriment of others. Historically mapping and unmapping has always been expensive, especially in the case of IOMMU enabled environments. I would much rather see us focus on having swiotlb_dma_ops replaced with dma_direct_ops in the cases where the device can access all of physical memory. > @@ -249,7 +260,7 @@ static inline void dma_unmap_single_attrs(struct device *dev, dma_addr_t addr, > const struct dma_map_ops *ops = get_dma_ops(dev); > > BUG_ON(!valid_dma_direction(dir)); > - if (ops->unmap_page) > + if (!dev->is_dma_direct && ops->unmap_page) If I understand correctly this is only needed for the swiotlb case and not the dma_direct case. It would make much more sense to just overwrite the dev->dma_ops pointer with dma_direct_ops to address all of the sync and unmap cases. > ops->unmap_page(dev, addr, size, dir, attrs); > debug_dma_unmap_page(dev, addr, size, dir, true); > } > @@ -266,7 +277,10 @@ static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, > int ents; > > BUG_ON(!valid_dma_direction(dir)); > - ents = ops->map_sg(dev, sg, nents, dir, attrs); > + if (dev->is_dma_direct) > + ents = dma_direct_map_sg(dev, sg, nents, dir, attrs); > + else > + ents = ops->map_sg(dev, sg, nents, dir, attrs); > BUG_ON(ents < 0); > debug_dma_map_sg(dev, sg, nents, ents, dir); > > @@ -281,7 +295,7 @@ static inline void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg > > BUG_ON(!valid_dma_direction(dir)); > debug_dma_unmap_sg(dev, sg, nents, dir); > - if (ops->unmap_sg) > + if (!dev->is_dma_direct && ops->unmap_sg) > ops->unmap_sg(dev, sg, nents, dir, attrs); > } > > @@ -295,7 +309,10 @@ static inline dma_addr_t dma_map_page_attrs(struct device *dev, > dma_addr_t addr; > > BUG_ON(!valid_dma_direction(dir)); > - addr = ops->map_page(dev, page, offset, size, dir, attrs); > + if (dev->is_dma_direct) > + addr = dma_direct_map_page(dev, page, offset, size, dir, attrs); > + else > + addr = ops->map_page(dev, page, offset, size, dir, attrs); > debug_dma_map_page(dev, page, offset, size, dir, addr, false); > > return addr; > @@ -309,7 +326,7 @@ static inline void dma_unmap_page_attrs(struct device *dev, > const struct dma_map_ops *ops = get_dma_ops(dev); > > BUG_ON(!valid_dma_direction(dir)); > - if (ops->unmap_page) > + if (!dev->is_dma_direct && ops->unmap_page) > ops->unmap_page(dev, addr, size, dir, attrs); > debug_dma_unmap_page(dev, addr, size, dir, false); > } > @@ -356,7 +373,7 @@ static inline void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, > const struct dma_map_ops *ops = get_dma_ops(dev); > > BUG_ON(!valid_dma_direction(dir)); > - if (ops->sync_single_for_cpu) > + if (!dev->is_dma_direct && ops->sync_single_for_cpu) > ops->sync_single_for_cpu(dev, addr, size, dir); > debug_dma_sync_single_for_cpu(dev, addr, size, dir); > } > @@ -368,7 +385,7 @@ static inline void dma_sync_single_for_device(struct device *dev, > const struct dma_map_ops *ops = get_dma_ops(dev); > > BUG_ON(!valid_dma_direction(dir)); > - if (ops->sync_single_for_device) > + if (!dev->is_dma_direct && ops->sync_single_for_device) > ops->sync_single_for_device(dev, addr, size, dir); > debug_dma_sync_single_for_device(dev, addr, size, dir); > } > @@ -382,7 +399,7 @@ static inline void dma_sync_single_range_for_cpu(struct device *dev, > const struct dma_map_ops *ops = get_dma_ops(dev); > > BUG_ON(!valid_dma_direction(dir)); > - if (ops->sync_single_for_cpu) > + if (!dev->is_dma_direct && ops->sync_single_for_cpu) > ops->sync_single_for_cpu(dev, addr + offset, size, dir); > debug_dma_sync_single_range_for_cpu(dev, addr, offset, size, dir); > } > @@ -396,7 +413,7 @@ static inline void dma_sync_single_range_for_device(struct device *dev, > const struct dma_map_ops *ops = get_dma_ops(dev); > > BUG_ON(!valid_dma_direction(dir)); > - if (ops->sync_single_for_device) > + if (!dev->is_dma_direct && ops->sync_single_for_device) > ops->sync_single_for_device(dev, addr + offset, size, dir); > debug_dma_sync_single_range_for_device(dev, addr, offset, size, dir); > } > @@ -408,7 +425,7 @@ dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, > const struct dma_map_ops *ops = get_dma_ops(dev); > > BUG_ON(!valid_dma_direction(dir)); > - if (ops->sync_sg_for_cpu) > + if (!dev->is_dma_direct && ops->sync_sg_for_cpu) > ops->sync_sg_for_cpu(dev, sg, nelems, dir); > debug_dma_sync_sg_for_cpu(dev, sg, nelems, dir); > } > @@ -420,7 +437,7 @@ dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, > const struct dma_map_ops *ops = get_dma_ops(dev); > > BUG_ON(!valid_dma_direction(dir)); > - if (ops->sync_sg_for_device) > + if (!dev->is_dma_direct && ops->sync_sg_for_device) > ops->sync_sg_for_device(dev, sg, nelems, dir); > debug_dma_sync_sg_for_device(dev, sg, nelems, dir); > > @@ -600,6 +617,8 @@ static inline int dma_supported(struct device *dev, u64 mask) > return ops->dma_supported(dev, mask); > } > > +extern const struct dma_map_ops swiotlb_dma_ops; > + > #ifndef HAVE_ARCH_DMA_SET_MASK > static inline int dma_set_mask(struct device *dev, u64 mask) > { > @@ -609,6 +628,12 @@ static inline int dma_set_mask(struct device *dev, u64 mask) > dma_check_mask(dev, mask); > > *dev->dma_mask = mask; > + if (dev->dma_ops == &dma_direct_ops || > + (dev->dma_ops == &swiotlb_dma_ops && > + mask == DMA_BIT_MASK(64))) > + dev->is_dma_direct = true; > + else > + dev->is_dma_direct = false; So I am not sure this will work on x86. If I am not mistaken I believe dev->dma_ops is normally not set and instead the default dma operations are pulled via get_arch_dma_ops which returns the global dma_ops pointer. What you may want to consider as an alternative would be to look at modifying drivers that are using the swiotlb so that you could just overwrite the dev->dma_ops with the dma_direct_ops in the cases where the hardware can support accessing all of physical hardware and where we aren't forcing the use of the bounce buffers in the swiotlb. Then for the above code you only have to worry about the map calls, and you could just do a check against the dma_direct_ops pointer instead of having to add a new flag. > return 0; > } > #endif > diff --git a/lib/dma-direct.c b/lib/dma-direct.c > index c0bba30fef0a..3deb8666974b 100644 > --- a/lib/dma-direct.c > +++ b/lib/dma-direct.c > @@ -120,7 +120,7 @@ void dma_direct_free(struct device *dev, size_t size, void *cpu_addr, > free_pages((unsigned long)cpu_addr, page_order); > } > > -static dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, > +dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, > unsigned long offset, size_t size, enum dma_data_direction dir, > unsigned long attrs) > { > @@ -131,7 +131,7 @@ static dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, > return dma_addr; > } > > -static int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, > +int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, > int nents, enum dma_data_direction dir, unsigned long attrs) > { > int i; > -- > 2.17.0 > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2 2018-04-16 16:04 ` Alexander Duyck @ 2018-04-17 6:19 ` Christoph Hellwig 0 siblings, 0 replies; 19+ messages in thread From: Christoph Hellwig @ 2018-04-17 6:19 UTC (permalink / raw) To: Alexander Duyck Cc: Christoph Hellwig, Jesper Dangaard Brouer, xdp-newbies@vger.kernel.org, netdev@vger.kernel.org, Christoph Hellwig, David Woodhouse, William Tu, Björn Töpel, Karlsson, Magnus, Arnaldo Carvalho de Melo > I'm not sure if I am really a fan of trying to solve this in this way. > It seems like this is going to be optimizing the paths for one case at > the detriment of others. Historically mapping and unmapping has always > been expensive, especially in the case of IOMMU enabled environments. > I would much rather see us focus on having swiotlb_dma_ops replaced > with dma_direct_ops in the cases where the device can access all of > physical memory. I am definitively not a fan, but IFF indirect calls are such an overhead it makes sense to avoid it for the common and simple case. And the direct mapping is a common case present on just about every architecture, and it is a very simple fast path that just adds an offset to the physical address. So if we want to speed something up, this is it. > > - if (ops->unmap_page) > > + if (!dev->is_dma_direct && ops->unmap_page) > > If I understand correctly this is only needed for the swiotlb case and > not the dma_direct case. It would make much more sense to just > overwrite the dev->dma_ops pointer with dma_direct_ops to address all > of the sync and unmap cases. Yes. > > + if (dev->dma_ops == &dma_direct_ops || > > + (dev->dma_ops == &swiotlb_dma_ops && > > + mask == DMA_BIT_MASK(64))) > > + dev->is_dma_direct = true; > > + else > > + dev->is_dma_direct = false; > > So I am not sure this will work on x86. If I am not mistaken I believe > dev->dma_ops is normally not set and instead the default dma > operations are pulled via get_arch_dma_ops which returns the global > dma_ops pointer. True, for x86 we'd need to check get_arch_dma_ops(). > What you may want to consider as an alternative would be to look at > modifying drivers that are using the swiotlb so that you could just > overwrite the dev->dma_ops with the dma_direct_ops in the cases where > the hardware can support accessing all of physical hardware and where > we aren't forcing the use of the bounce buffers in the swiotlb. > > Then for the above code you only have to worry about the map calls, > and you could just do a check against the dma_direct_ops pointer > instead of having to add a new flag. That would be the long term plan IFF we go down this route. For now I just wanted a quick hack for performance testing. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: dma-mapping: bypass dma_ops for direct mappings 2018-04-16 12:27 ` Christoph Hellwig 2018-04-16 16:04 ` Alexander Duyck @ 2018-04-16 18:05 ` kbuild test robot 2018-04-16 18:26 ` Jesper Dangaard Brouer 2018-04-16 18:31 ` kbuild test robot 2018-04-16 21:07 ` XDP performance regression due to CONFIG_RETPOLINE Spectre V2 Jesper Dangaard Brouer 3 siblings, 1 reply; 19+ messages in thread From: kbuild test robot @ 2018-04-16 18:05 UTC (permalink / raw) To: Christoph Hellwig Cc: kbuild-all, Jesper Dangaard Brouer, xdp-newbies@vger.kernel.org, netdev@vger.kernel.org, Christoph Hellwig, David Woodhouse, William Tu, Björn Töpel, Karlsson, Magnus, Alexander Duyck, Arnaldo Carvalho de Melo [-- Attachment #1: Type: text/plain, Size: 1561 bytes --] Hi Christoph, I love your patch! Yet something to improve: [auto build test ERROR on linus/master] [also build test ERROR on v4.17-rc1 next-20180416] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Christoph-Hellwig/dma-mapping-bypass-dma_ops-for-direct-mappings/20180416-230032 config: i386-defconfig (attached as .config) compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): drivers/char/agp/intel-gtt.o: In function `intel_gmch_probe': >> intel-gtt.c:(.text+0x11e4): undefined reference to `swiotlb_dma_ops' drivers/ata/ahci.o: In function `ahci_init_one': >> ahci.c:(.text+0x108d): undefined reference to `swiotlb_dma_ops' drivers/net/ethernet/broadcom/bnx2.o: In function `bnx2_init_one': >> bnx2.c:(.text+0x7fe7): undefined reference to `swiotlb_dma_ops' drivers/net/ethernet/broadcom/tg3.o: In function `tg3_init_one': >> tg3.c:(.text+0x13549): undefined reference to `swiotlb_dma_ops' drivers/net/ethernet/intel/e1000/e1000_main.o: In function `e1000_probe': >> e1000_main.c:(.text+0x49b3): undefined reference to `swiotlb_dma_ops' drivers/net/ethernet/intel/e1000e/netdev.o:netdev.c:(.text+0xa65e): more undefined references to `swiotlb_dma_ops' follow --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 26269 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: dma-mapping: bypass dma_ops for direct mappings 2018-04-16 18:05 ` dma-mapping: bypass dma_ops for direct mappings kbuild test robot @ 2018-04-16 18:26 ` Jesper Dangaard Brouer 0 siblings, 0 replies; 19+ messages in thread From: Jesper Dangaard Brouer @ 2018-04-16 18:26 UTC (permalink / raw) To: kbuild test robot Cc: Christoph Hellwig, kbuild-all, netdev@vger.kernel.org, Christoph Hellwig, David Woodhouse, William Tu, Björn Töpel, Karlsson, Magnus, Alexander Duyck, Arnaldo Carvalho de Melo, brouer On Tue, 17 Apr 2018 02:05:12 +0800 kbuild test robot <lkp@intel.com> wrote: > Hi Christoph, > > I love your patch! Yet something to improve: I was just about to complain about the same compile error ;-) > [auto build test ERROR on linus/master] > [also build test ERROR on v4.17-rc1 next-20180416] > [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] > > url: https://github.com/0day-ci/linux/commits/Christoph-Hellwig/dma-mapping-bypass-dma_ops-for-direct-mappings/20180416-230032 > config: i386-defconfig (attached as .config) > compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 > reproduce: > # save the attached .config to linux build tree > make ARCH=i386 > > All errors (new ones prefixed by >>): > > drivers/char/agp/intel-gtt.o: In function `intel_gmch_probe': > >> intel-gtt.c:(.text+0x11e4): undefined reference to `swiotlb_dma_ops' > drivers/ata/ahci.o: In function `ahci_init_one': > >> ahci.c:(.text+0x108d): undefined reference to `swiotlb_dma_ops' > drivers/net/ethernet/broadcom/bnx2.o: In function `bnx2_init_one': > >> bnx2.c:(.text+0x7fe7): undefined reference to `swiotlb_dma_ops' > drivers/net/ethernet/broadcom/tg3.o: In function `tg3_init_one': > >> tg3.c:(.text+0x13549): undefined reference to `swiotlb_dma_ops' > drivers/net/ethernet/intel/e1000/e1000_main.o: In function `e1000_probe': > >> e1000_main.c:(.text+0x49b3): undefined reference to `swiotlb_dma_ops' > drivers/net/ethernet/intel/e1000e/netdev.o:netdev.c:(.text+0xa65e): more undefined references to `swiotlb_dma_ops' follow > > --- > 0-DAY kernel test infrastructure Open Source Technology Center > https://lists.01.org/pipermail/kbuild-all Intel Corporation -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: dma-mapping: bypass dma_ops for direct mappings 2018-04-16 12:27 ` Christoph Hellwig 2018-04-16 16:04 ` Alexander Duyck 2018-04-16 18:05 ` dma-mapping: bypass dma_ops for direct mappings kbuild test robot @ 2018-04-16 18:31 ` kbuild test robot 2018-04-16 21:07 ` XDP performance regression due to CONFIG_RETPOLINE Spectre V2 Jesper Dangaard Brouer 3 siblings, 0 replies; 19+ messages in thread From: kbuild test robot @ 2018-04-16 18:31 UTC (permalink / raw) To: Christoph Hellwig Cc: kbuild-all, Jesper Dangaard Brouer, xdp-newbies@vger.kernel.org, netdev@vger.kernel.org, Christoph Hellwig, David Woodhouse, William Tu, Björn Töpel, Karlsson, Magnus, Alexander Duyck, Arnaldo Carvalho de Melo [-- Attachment #1: Type: text/plain, Size: 12130 bytes --] Hi Christoph, I love your patch! Perhaps something to improve: [auto build test WARNING on linus/master] [also build test WARNING on v4.17-rc1 next-20180416] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Christoph-Hellwig/dma-mapping-bypass-dma_ops-for-direct-mappings/20180416-230032 reproduce: make htmldocs All warnings (new ones prefixed by >>): include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > include/net/mac80211.h:2083: warning: bad line: > net/mac80211/sta_info.h:586: warning: Function parameter or member 'rx_stats_avg' not described in 'sta_info' net/mac80211/sta_info.h:586: warning: Function parameter or member 'rx_stats_avg.signal' not described in 'sta_info' net/mac80211/sta_info.h:586: warning: Function parameter or member 'rx_stats_avg.chain_signal' not described in 'sta_info' net/mac80211/sta_info.h:586: warning: Function parameter or member 'status_stats.filtered' not described in 'sta_info' net/mac80211/sta_info.h:586: warning: Function parameter or member 'status_stats.retry_failed' not described in 'sta_info' net/mac80211/sta_info.h:586: warning: Function parameter or member 'status_stats.retry_count' not described in 'sta_info' net/mac80211/sta_info.h:586: warning: Function parameter or member 'status_stats.lost_packets' not described in 'sta_info' net/mac80211/sta_info.h:586: warning: Function parameter or member 'status_stats.last_tdls_pkt_time' not described in 'sta_info' net/mac80211/sta_info.h:586: warning: Function parameter or member 'status_stats.msdu_retries' not described in 'sta_info' net/mac80211/sta_info.h:586: warning: Function parameter or member 'status_stats.msdu_failed' not described in 'sta_info' net/mac80211/sta_info.h:586: warning: Function parameter or member 'status_stats.last_ack' not described in 'sta_info' net/mac80211/sta_info.h:586: warning: Function parameter or member 'status_stats.last_ack_signal' not described in 'sta_info' net/mac80211/sta_info.h:586: warning: Function parameter or member 'status_stats.ack_signal_filled' not described in 'sta_info' net/mac80211/sta_info.h:586: warning: Function parameter or member 'tx_stats.packets' not described in 'sta_info' net/mac80211/sta_info.h:586: warning: Function parameter or member 'tx_stats.bytes' not described in 'sta_info' net/mac80211/sta_info.h:586: warning: Function parameter or member 'tx_stats.last_rate' not described in 'sta_info' net/mac80211/sta_info.h:586: warning: Function parameter or member 'tx_stats.msdu' not described in 'sta_info' kernel/sched/fair.c:3731: warning: Function parameter or member 'flags' not described in 'attach_entity_load_avg' include/linux/dma-buf.h:307: warning: Function parameter or member 'cb_excl.cb' not described in 'dma_buf' include/linux/dma-buf.h:307: warning: Function parameter or member 'cb_excl.poll' not described in 'dma_buf' include/linux/dma-buf.h:307: warning: Function parameter or member 'cb_excl.active' not described in 'dma_buf' include/linux/dma-buf.h:307: warning: Function parameter or member 'cb_shared.cb' not described in 'dma_buf' include/linux/dma-buf.h:307: warning: Function parameter or member 'cb_shared.poll' not described in 'dma_buf' include/linux/dma-buf.h:307: warning: Function parameter or member 'cb_shared.active' not described in 'dma_buf' include/linux/dma-fence-array.h:54: warning: Function parameter or member 'work' not described in 'dma_fence_array' Error: Cannot open file drivers/base/firmware_class.c WARNING: kernel-doc 'scripts/kernel-doc -rst -enable-lineno -function request_firmware drivers/base/firmware_class.c' failed with return code 1 Error: Cannot open file drivers/base/firmware_class.c WARNING: kernel-doc 'scripts/kernel-doc -rst -enable-lineno -function request_firmware_direct drivers/base/firmware_class.c' failed with return code 1 Error: Cannot open file drivers/base/firmware_class.c WARNING: kernel-doc 'scripts/kernel-doc -rst -enable-lineno -function request_firmware_into_buf drivers/base/firmware_class.c' failed with return code 1 Error: Cannot open file drivers/base/firmware_class.c WARNING: kernel-doc 'scripts/kernel-doc -rst -enable-lineno -function request_firmware_nowait drivers/base/firmware_class.c' failed with return code 1 Error: Cannot open file drivers/base/firmware_class.c WARNING: kernel-doc 'scripts/kernel-doc -rst -enable-lineno -function firmware_request_cache drivers/base/firmware_class.c' failed with return code 1 include/linux/gpio/driver.h:142: warning: Function parameter or member 'request_key' not described in 'gpio_irq_chip' include/linux/iio/iio.h:270: warning: Function parameter or member 'scan_type.sign' not described in 'iio_chan_spec' include/linux/iio/iio.h:270: warning: Function parameter or member 'scan_type.realbits' not described in 'iio_chan_spec' include/linux/iio/iio.h:270: warning: Function parameter or member 'scan_type.storagebits' not described in 'iio_chan_spec' include/linux/iio/iio.h:270: warning: Function parameter or member 'scan_type.shift' not described in 'iio_chan_spec' include/linux/iio/iio.h:270: warning: Function parameter or member 'scan_type.repeat' not described in 'iio_chan_spec' include/linux/iio/iio.h:270: warning: Function parameter or member 'scan_type.endianness' not described in 'iio_chan_spec' include/linux/iio/hw-consumer.h:1: warning: no structured comments found >> include/linux/device.h:995: warning: Function parameter or member 'is_dma_direct' not described in 'device' Error: Cannot open file drivers/base/firmware_class.c Error: Cannot open file drivers/base/firmware_class.c WARNING: kernel-doc 'scripts/kernel-doc -rst -enable-lineno -export drivers/base/firmware_class.c' failed with return code 2 include/linux/input/sparse-keymap.h:46: warning: Function parameter or member 'sw' not described in 'key_entry' include/linux/mtd/rawnand.h:752: warning: Function parameter or member 'timings.sdr' not described in 'nand_data_interface' include/linux/mtd/rawnand.h:817: warning: Function parameter or member 'buf' not described in 'nand_op_data_instr' include/linux/mtd/rawnand.h:817: warning: Function parameter or member 'buf.in' not described in 'nand_op_data_instr' include/linux/mtd/rawnand.h:817: warning: Function parameter or member 'buf.out' not described in 'nand_op_data_instr' include/linux/mtd/rawnand.h:863: warning: Function parameter or member 'ctx' not described in 'nand_op_instr' include/linux/mtd/rawnand.h:863: warning: Function parameter or member 'ctx.cmd' not described in 'nand_op_instr' include/linux/mtd/rawnand.h:863: warning: Function parameter or member 'ctx.addr' not described in 'nand_op_instr' include/linux/mtd/rawnand.h:863: warning: Function parameter or member 'ctx.data' not described in 'nand_op_instr' include/linux/mtd/rawnand.h:863: warning: Function parameter or member 'ctx.waitrdy' not described in 'nand_op_instr' include/linux/mtd/rawnand.h:1010: warning: Function parameter or member 'ctx' not described in 'nand_op_parser_pattern_elem' include/linux/mtd/rawnand.h:1010: warning: Function parameter or member 'ctx.addr' not described in 'nand_op_parser_pattern_elem' include/linux/mtd/rawnand.h:1010: warning: Function parameter or member 'ctx.data' not described in 'nand_op_parser_pattern_elem' include/linux/mtd/rawnand.h:1313: warning: Function parameter or member 'manufacturer.desc' not described in 'nand_chip' include/linux/mtd/rawnand.h:1313: warning: Function parameter or member 'manufacturer.priv' not described in 'nand_chip' include/linux/regulator/driver.h:222: warning: Function parameter or member 'resume_early' not described in 'regulator_ops' drivers/regulator/core.c:4306: warning: Excess function parameter 'state' description in 'regulator_suspend_late' arch/s390/include/asm/cio.h:245: warning: Function parameter or member 'esw.esw0' not described in 'irb' arch/s390/include/asm/cio.h:245: warning: Function parameter or member 'esw.esw1' not described in 'irb' arch/s390/include/asm/cio.h:245: warning: Function parameter or member 'esw.esw2' not described in 'irb' arch/s390/include/asm/cio.h:245: warning: Function parameter or member 'esw.esw3' not described in 'irb' arch/s390/include/asm/cio.h:245: warning: Function parameter or member 'esw.eadm' not described in 'irb' drivers/usb/typec/mux.c:186: warning: Function parameter or member 'mux' not described in 'typec_mux_unregister' drivers/usb/typec/mux.c:186: warning: Excess function parameter 'sw' description in 'typec_mux_unregister' vim +995 include/linux/device.h ^1da177e Linus Torvalds 2005-04-16 @995 :::::: The code at line 995 was first introduced by commit :::::: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 Linux-2.6.12-rc2 :::::: TO: Linus Torvalds <torvalds@ppc970.osdl.org> :::::: CC: Linus Torvalds <torvalds@ppc970.osdl.org> --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 6353 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2 2018-04-16 12:27 ` Christoph Hellwig ` (2 preceding siblings ...) 2018-04-16 18:31 ` kbuild test robot @ 2018-04-16 21:07 ` Jesper Dangaard Brouer 2018-04-17 6:15 ` Christoph Hellwig 3 siblings, 1 reply; 19+ messages in thread From: Jesper Dangaard Brouer @ 2018-04-16 21:07 UTC (permalink / raw) To: Christoph Hellwig Cc: xdp-newbies@vger.kernel.org, netdev@vger.kernel.org, Christoph Hellwig, David Woodhouse, William Tu, Björn Töpel, Karlsson, Magnus, Alexander Duyck, Arnaldo Carvalho de Melo, brouer On Mon, 16 Apr 2018 05:27:06 -0700 Christoph Hellwig <hch@infradead.org> wrote: > Can you try the following hack which avoids indirect calls entirely > for the fast path direct mapping case? > > --- > From b256a008c1b305e6a1c2afe7c004c54ad2e96d4b Mon Sep 17 00:00:00 2001 > From: Christoph Hellwig <hch@lst.de> > Date: Mon, 16 Apr 2018 14:18:14 +0200 > Subject: dma-mapping: bypass dma_ops for direct mappings > > Reportedly the retpoline mitigation for spectre causes huge penalties > for indirect function calls. This hack bypasses the dma_ops mechanism > for simple direct mappings. I did below to get it compiling, and working... On X86 swiotlb fallback (via get_dma_ops -> get_arch_dma_ops) to use x86_swiotlb_dma_ops, instead of swiotlb_dma_ops. I also included that in below fix patch. Performance improved to 8.9 Mpps from approx 6.5Mpps. (This was without my bulking for net_device->ndo_xdp_xmit, so that number should improve more). --- [PATCH RFC] fixups for Hellwig's DMA avoid retpoline overhead patch From: Jesper Dangaard Brouer <brouer@redhat.com> Performance improved to 8.9 Mpps 8917613 pkt/s it was around 6.5 Mpps before. --- arch/x86/kernel/pci-swiotlb.c | 3 ++- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 1 + include/linux/dma-mapping.h | 14 +++++++++++++- lib/Kconfig | 2 +- lib/Makefile | 1 + lib/dma-direct.c | 2 ++ lib/swiotlb.c | 1 + 7 files changed, 21 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c index 0ee0f8f34251..46207e288587 100644 --- a/arch/x86/kernel/pci-swiotlb.c +++ b/arch/x86/kernel/pci-swiotlb.c @@ -48,7 +48,7 @@ void x86_swiotlb_free_coherent(struct device *dev, size_t size, dma_generic_free_coherent(dev, size, vaddr, dma_addr, attrs); } -static const struct dma_map_ops x86_swiotlb_dma_ops = { +const struct dma_map_ops x86_swiotlb_dma_ops = { .mapping_error = swiotlb_dma_mapping_error, .alloc = x86_swiotlb_alloc_coherent, .free = x86_swiotlb_free_coherent, @@ -62,6 +62,7 @@ static const struct dma_map_ops x86_swiotlb_dma_ops = { .unmap_page = swiotlb_unmap_page, .dma_supported = NULL, }; +EXPORT_SYMBOL(x86_swiotlb_dma_ops); /* * pci_swiotlb_detect_override - set swiotlb to 1 if necessary diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 0daccaf72a30..6d2e3f75febc 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -10297,6 +10297,7 @@ static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent) return err; if (!dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64))) { + pr_info("XXX %s() dma_set_mask_and_coherent\n", __func__); pci_using_dac = 1; } else { err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32)); diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index f2fb5aec7626..7fa92664ebfd 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -622,6 +622,7 @@ static inline int dma_supported(struct device *dev, u64 mask) } extern const struct dma_map_ops swiotlb_dma_ops; +extern const struct dma_map_ops x86_swiotlb_dma_ops; #ifndef HAVE_ARCH_DMA_SET_MASK static inline int dma_set_mask(struct device *dev, u64 mask) @@ -632,12 +633,23 @@ static inline int dma_set_mask(struct device *dev, u64 mask) dma_check_mask(dev, mask); *dev->dma_mask = mask; +#ifdef CONFIG_DMA_DIRECT_OPS if (dev->dma_ops == &dma_direct_ops || +# ifdef CONFIG_SWIOTLB (dev->dma_ops == &swiotlb_dma_ops && - mask == DMA_BIT_MASK(64))) + mask == DMA_BIT_MASK(64)) || +# ifdef CONFIG_X86 + (get_dma_ops(dev) == &x86_swiotlb_dma_ops && + mask == DMA_BIT_MASK(64)) +# endif /* CONFIG_X86 */ +# endif /* CONFIG_SWIOTLB */ + ) dev->is_dma_direct = true; else +#endif /* CONFIG_DMA_DIRECT_OPS */ dev->is_dma_direct = false; + + pr_info("XXX: %s() DMA is direct: %d\n", __func__, dev->is_dma_direct); return 0; } #endif diff --git a/lib/Kconfig b/lib/Kconfig index e96089499371..6eba2bcf468a 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -416,7 +416,7 @@ config SGL_ALLOC config DMA_DIRECT_OPS bool depends on HAS_DMA && (!64BIT || ARCH_DMA_ADDR_T_64BIT) - default n + default y config DMA_VIRT_OPS bool diff --git a/lib/Makefile b/lib/Makefile index a90d4fcd748f..df4885eabf9c 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -29,6 +29,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ lib-$(CONFIG_MMU) += ioremap.o lib-$(CONFIG_SMP) += cpumask.o lib-$(CONFIG_DMA_DIRECT_OPS) += dma-direct.o +#lib-y += dma-direct.o lib-$(CONFIG_DMA_VIRT_OPS) += dma-virt.o lib-y += kobject.o klist.o diff --git a/lib/dma-direct.c b/lib/dma-direct.c index ea69f8777e7f..d945efea3dae 100644 --- a/lib/dma-direct.c +++ b/lib/dma-direct.c @@ -107,6 +107,7 @@ dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, return DIRECT_MAPPING_ERROR; return dma_addr; } +EXPORT_SYMBOL(dma_direct_map_page); int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir, unsigned long attrs) @@ -125,6 +126,7 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, return nents; } +EXPORT_SYMBOL(dma_direct_map_sg); int dma_direct_supported(struct device *dev, u64 mask) { diff --git a/lib/swiotlb.c b/lib/swiotlb.c index c43ec2271469..ecb70f5e95ba 100644 --- a/lib/swiotlb.c +++ b/lib/swiotlb.c @@ -1132,4 +1132,5 @@ const struct dma_map_ops swiotlb_dma_ops = { .unmap_page = swiotlb_unmap_page, .dma_supported = swiotlb_dma_supported, }; +EXPORT_SYMBOL(swiotlb_dma_ops); #endif /* CONFIG_DMA_DIRECT_OPS */ ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2 2018-04-16 21:07 ` XDP performance regression due to CONFIG_RETPOLINE Spectre V2 Jesper Dangaard Brouer @ 2018-04-17 6:15 ` Christoph Hellwig 2018-04-17 7:07 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 19+ messages in thread From: Christoph Hellwig @ 2018-04-17 6:15 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Christoph Hellwig, xdp-newbies@vger.kernel.org, netdev@vger.kernel.org, Christoph Hellwig, David Woodhouse, William Tu, Björn Töpel, Karlsson, Magnus, Alexander Duyck, Arnaldo Carvalho de Melo On Mon, Apr 16, 2018 at 11:07:04PM +0200, Jesper Dangaard Brouer wrote: > On X86 swiotlb fallback (via get_dma_ops -> get_arch_dma_ops) to use > x86_swiotlb_dma_ops, instead of swiotlb_dma_ops. I also included that > in below fix patch. x86_swiotlb_dma_ops should not exist any mor, and x86 now uses dma_direct_ops. Looks like you are applying it to an old kernel :) > Performance improved to 8.9 Mpps from approx 6.5Mpps. > > (This was without my bulking for net_device->ndo_xdp_xmit, so that > number should improve more). What is the number for the otherwise comparable setup without repolines? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2 2018-04-17 6:15 ` Christoph Hellwig @ 2018-04-17 7:07 ` Jesper Dangaard Brouer 2018-04-17 7:13 ` Christoph Hellwig 0 siblings, 1 reply; 19+ messages in thread From: Jesper Dangaard Brouer @ 2018-04-17 7:07 UTC (permalink / raw) To: Christoph Hellwig Cc: xdp-newbies@vger.kernel.org, netdev@vger.kernel.org, Christoph Hellwig, David Woodhouse, William Tu, Björn Töpel, Karlsson, Magnus, Alexander Duyck, Arnaldo Carvalho de Melo, brouer On Mon, 16 Apr 2018 23:15:50 -0700 Christoph Hellwig <hch@infradead.org> wrote: > On Mon, Apr 16, 2018 at 11:07:04PM +0200, Jesper Dangaard Brouer wrote: > > On X86 swiotlb fallback (via get_dma_ops -> get_arch_dma_ops) to use > > x86_swiotlb_dma_ops, instead of swiotlb_dma_ops. I also included that > > in below fix patch. > > x86_swiotlb_dma_ops should not exist any mor, and x86 now uses > dma_direct_ops. Looks like you are applying it to an old kernel :) > > > Performance improved to 8.9 Mpps from approx 6.5Mpps. > > > > (This was without my bulking for net_device->ndo_xdp_xmit, so that > > number should improve more). > > What is the number for the otherwise comparable setup without repolines? Approx 12 Mpps. You forgot to handle the dma_direct_mapping_error() case, which still used the retpoline in the above 8.9 Mpps measurement, I fixed it up and performance increased to 9.6 Mpps. Notice, in this test there are still two retpoline/indirect-calls left. The net_device->ndo_xdp_xmit and the invocation of the XDP BPF prog. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2 2018-04-17 7:07 ` Jesper Dangaard Brouer @ 2018-04-17 7:13 ` Christoph Hellwig 0 siblings, 0 replies; 19+ messages in thread From: Christoph Hellwig @ 2018-04-17 7:13 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Christoph Hellwig, xdp-newbies@vger.kernel.org, netdev@vger.kernel.org, Christoph Hellwig, David Woodhouse, William Tu, Björn Töpel, Karlsson, Magnus, Alexander Duyck, Arnaldo Carvalho de Melo On Tue, Apr 17, 2018 at 09:07:01AM +0200, Jesper Dangaard Brouer wrote: > > > number should improve more). > > > > What is the number for the otherwise comparable setup without repolines? > > Approx 12 Mpps. > > You forgot to handle the dma_direct_mapping_error() case, which still > used the retpoline in the above 8.9 Mpps measurement, I fixed it up and > performance increased to 9.6 Mpps. > > Notice, in this test there are still two retpoline/indirect-calls > left. The net_device->ndo_xdp_xmit and the invocation of the XDP BPF > prog. But that seems like a pretty clear indicator that we want the fast path direct mapping. I'll try to find some time over the next weeks to do a cleaner version of it. ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2018-04-17 7:13 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-04-12 13:50 XDP performance regression due to CONFIG_RETPOLINE Spectre V2 Jesper Dangaard Brouer 2018-04-12 14:51 ` Christoph Hellwig 2018-04-12 14:56 ` Christoph Hellwig 2018-04-12 15:31 ` Jesper Dangaard Brouer 2018-04-13 16:49 ` Christoph Hellwig 2018-04-13 17:12 ` Tushar Dave 2018-04-13 17:26 ` Christoph Hellwig 2018-04-14 19:29 ` David Woodhouse 2018-04-16 6:02 ` Jesper Dangaard Brouer 2018-04-16 12:27 ` Christoph Hellwig 2018-04-16 16:04 ` Alexander Duyck 2018-04-17 6:19 ` Christoph Hellwig 2018-04-16 18:05 ` dma-mapping: bypass dma_ops for direct mappings kbuild test robot 2018-04-16 18:26 ` Jesper Dangaard Brouer 2018-04-16 18:31 ` kbuild test robot 2018-04-16 21:07 ` XDP performance regression due to CONFIG_RETPOLINE Spectre V2 Jesper Dangaard Brouer 2018-04-17 6:15 ` Christoph Hellwig 2018-04-17 7:07 ` Jesper Dangaard Brouer 2018-04-17 7:13 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).