* s390-iommu.c default domain conversion @ 2022-05-09 23:35 Jason Gunthorpe 2022-05-10 15:25 ` Matthew Rosato 0 siblings, 1 reply; 10+ messages in thread From: Jason Gunthorpe @ 2022-05-09 23:35 UTC (permalink / raw) To: Matthew Rosato Cc: linux-s390, alex.williamson, cohuck, schnelle, farman, pmorel, borntraeger, hca, gor, gerald.schaefer, agordeev, svens, frankja, david, imbrenda, vneethv, oberpar, freude, thuth, pasic, Robin Murphy Hi s390 folks/Matthew Since everyone is looking at iommu support for the nested domains, could we also tackle the default domain conversion please? s390 is one of the last 4 drivers that need it. From what I can see it looks like when detach_dev() is called it expects the platform's dma_ops to work in arch/s390/pci/pci_dma.c ? Has anyone thought about converting the dma_ops to use the normal DMA API iommu support and run it through the iommu driver instead of through the dma_ops? Alternatively perhaps we can keep the dma_ops with some iommu side-change. Thanks, Jason ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: s390-iommu.c default domain conversion 2022-05-09 23:35 s390-iommu.c default domain conversion Jason Gunthorpe @ 2022-05-10 15:25 ` Matthew Rosato 2022-05-10 16:09 ` Jason Gunthorpe 0 siblings, 1 reply; 10+ messages in thread From: Matthew Rosato @ 2022-05-10 15:25 UTC (permalink / raw) To: Jason Gunthorpe, schnelle Cc: linux-s390, alex.williamson, cohuck, farman, pmorel, borntraeger, hca, gor, gerald.schaefer, agordeev, svens, frankja, david, imbrenda, vneethv, oberpar, freude, thuth, pasic, Robin Murphy On 5/9/22 7:35 PM, Jason Gunthorpe wrote: > Hi s390 folks/Matthew > > Since everyone is looking at iommu support for the nested domains, > could we also tackle the default domain conversion please? s390 is one > of the last 4 drivers that need it. > > From what I can see it looks like when detach_dev() is called it > expects the platform's dma_ops to work in arch/s390/pci/pci_dma.c ? Yes > > Has anyone thought about converting the dma_ops to use the normal DMA > API iommu support and run it through the iommu driver instead of > through the dma_ops? > > Alternatively perhaps we can keep the dma_ops with some iommu > side-change. It has come up before. So ultimately the goal is to be driving the dma through the default iommu domain (via dma-iommu) rather than directly in the dma_ops? One of our main concerns is performance loss from s390-ism optimizations in the dma_ops like RPCIT avoidance / lazy map + global flush (maybe I am wrong or something can be generalized through the api - or something is already there that will work). Of course there are some obvious benefits too, we have some duplication of work between s390-iommu.c and arch/s390/pci/pci_dma.c I think the reality is that Niklas and I need to have a close look and do some testing on our end to see what it will take and if we can get acceptable performance from a conversion, then get back to you. Thanks, Matt ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: s390-iommu.c default domain conversion 2022-05-10 15:25 ` Matthew Rosato @ 2022-05-10 16:09 ` Jason Gunthorpe 2022-05-20 13:05 ` Niklas Schnelle 0 siblings, 1 reply; 10+ messages in thread From: Jason Gunthorpe @ 2022-05-10 16:09 UTC (permalink / raw) To: Matthew Rosato Cc: schnelle, linux-s390, alex.williamson, cohuck, farman, pmorel, borntraeger, hca, gor, gerald.schaefer, agordeev, svens, frankja, david, imbrenda, vneethv, oberpar, freude, thuth, pasic, Robin Murphy On Tue, May 10, 2022 at 11:25:54AM -0400, Matthew Rosato wrote: > On 5/9/22 7:35 PM, Jason Gunthorpe wrote: > > Hi s390 folks/Matthew > > > > Since everyone is looking at iommu support for the nested domains, > > could we also tackle the default domain conversion please? s390 is one > > of the last 4 drivers that need it. > > > > From what I can see it looks like when detach_dev() is called it > > expects the platform's dma_ops to work in arch/s390/pci/pci_dma.c ? > > Yes > > > > > Has anyone thought about converting the dma_ops to use the normal DMA > > API iommu support and run it through the iommu driver instead of > > through the dma_ops? > > > > Alternatively perhaps we can keep the dma_ops with some iommu > > side-change. > > It has come up before. So ultimately the goal is to be driving the dma > through the default iommu domain (via dma-iommu) rather than directly in the > dma_ops? One of our main concerns is performance loss from s390-ism > optimizations in the dma_ops like RPCIT avoidance / lazy map + > global flush The core version is somewhat different, it triggers the iotlb_flush_all from a timer, not just on address space wrap around, but the fast path on unmap can still skip the zpci_refresh_trans(). On the other hand it doesn't have the limit of iova space, and the iova allocator is somewhat more sophisticated which will optimize large page cases that s390 currently doesn't. Basically it will work better with things like mlx5 cards in the normal case. The lasy flush is done via the IOMMU_DOMAIN_DMA_FQ and the iommu gather->queued stuff to allow skipping the RCPIT during the normal iotlb_sync. > I think the reality is that Niklas and I need to have a close look and do > some testing on our end to see what it will take and if we can get > acceptable performance from a conversion, then get back to you. It would be a good long term goal, getting rid of these duplicated dma_ops is another open task. There is a patch series out there to convert arm, so this whole area will become even more niche. But another path is to somehow keep them and just allow default domains to work - ARM did this. Jason ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: s390-iommu.c default domain conversion 2022-05-10 16:09 ` Jason Gunthorpe @ 2022-05-20 13:05 ` Niklas Schnelle 2022-05-20 13:44 ` Jason Gunthorpe 0 siblings, 1 reply; 10+ messages in thread From: Niklas Schnelle @ 2022-05-20 13:05 UTC (permalink / raw) To: Jason Gunthorpe, Matthew Rosato Cc: linux-s390, alex.williamson, cohuck, farman, pmorel, borntraeger, hca, gor, gerald.schaefer, agordeev, svens, frankja, david, imbrenda, vneethv, oberpar, freude, thuth, pasic, Robin Murphy On Tue, 2022-05-10 at 13:09 -0300, Jason Gunthorpe wrote: > On Tue, May 10, 2022 at 11:25:54AM -0400, Matthew Rosato wrote: > > On 5/9/22 7:35 PM, Jason Gunthorpe wrote: > > > Hi s390 folks/Matthew > > > > > > Since everyone is looking at iommu support for the nested domains, > > > could we also tackle the default domain conversion please? s390 is one > > > of the last 4 drivers that need it. > > > > > > From what I can see it looks like when detach_dev() is called it > > > expects the platform's dma_ops to work in arch/s390/pci/pci_dma.c ? > > > > Yes > > > > > Has anyone thought about converting the dma_ops to use the normal DMA > > > API iommu support and run it through the iommu driver instead of > > > through the dma_ops? > > > > > > Alternatively perhaps we can keep the dma_ops with some iommu > > > side-change. > > > > It has come up before. So ultimately the goal is to be driving the dma > > through the default iommu domain (via dma-iommu) rather than directly in the > > dma_ops? One of our main concerns is performance loss from s390-ism > > optimizations in the dma_ops like RPCIT avoidance / lazy map + > > global flush > > The core version is somewhat different, it triggers the > iotlb_flush_all from a timer, not just on address space wrap around, > but the fast path on unmap can still skip the zpci_refresh_trans(). > > On the other hand it doesn't have the limit of iova space, and the > iova allocator is somewhat more sophisticated which will optimize > large page cases that s390 currently doesn't. Basically it will work > better with things like mlx5 cards in the normal case. > > The lasy flush is done via the IOMMU_DOMAIN_DMA_FQ and the iommu gather->queued > stuff to allow skipping the RCPIT during the normal iotlb_sync. > > > I think the reality is that Niklas and I need to have a close look and do > > some testing on our end to see what it will take and if we can get > > acceptable performance from a conversion, then get back to you. > > It would be a good long term goal, getting rid of these duplicated > dma_ops is another open task. There is a patch series out there to > convert arm, so this whole area will become even more niche. > > But another path is to somehow keep them and just allow default > domains to work - ARM did this. > > Jason I did some testing and created a prototype that gets rid of arch/s390/pci_dma.c and works soley via dma-iommu on top of our IOMMU driver. It looks like the existing dma-iommu code allows us to do this with relatively simple changes to the IOMMU driver only, mostly just implementing iotlb_sync(), iotlb_sync_map() and flush_iotlb_all() so that's great. They also do seem to map quite well to our RPCIT I/O TLB flush so that's great. For now the prototype still uses 4k pages only. With that the performance on the LPAR machine hypervisor (no paging) is on par with our existing code. On paging hypervisors (z/VM and KVM) i.e. with the hypervisor shadowing the I/O translation tables, it's still slower than our existing code and interestingly strict mode seems to be better than lazy here. One thing I haven't done yet is implement the map_pages() operation or adding larger page sizes. Maybe you have some tips what you'd expect to be most beneficial? Either way we're optimistic this can be solved and this conversion will be a high ranking item on my backlog going forward. I also stumbled over the following patch series which I think would also help our paging hypervisor cases a lot since it should alleviate the cost of shadowing short lived mappings: https://lore.kernel.org/linux-iommu/20210806103423.3341285-1-stevensd@google.com/ Sadly it seems it hasn't gained much traction so far. Thanks, Niklas ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: s390-iommu.c default domain conversion 2022-05-20 13:05 ` Niklas Schnelle @ 2022-05-20 13:44 ` Jason Gunthorpe 2022-05-20 15:17 ` Niklas Schnelle 0 siblings, 1 reply; 10+ messages in thread From: Jason Gunthorpe @ 2022-05-20 13:44 UTC (permalink / raw) To: Niklas Schnelle Cc: Matthew Rosato, linux-s390, alex.williamson, cohuck, farman, pmorel, borntraeger, hca, gor, gerald.schaefer, agordeev, svens, frankja, david, imbrenda, vneethv, oberpar, freude, thuth, pasic, Robin Murphy On Fri, May 20, 2022 at 03:05:46PM +0200, Niklas Schnelle wrote: > I did some testing and created a prototype that gets rid of > arch/s390/pci_dma.c and works soley via dma-iommu on top of our IOMMU > driver. It looks like the existing dma-iommu code allows us to do this > with relatively simple changes to the IOMMU driver only, mostly just > implementing iotlb_sync(), iotlb_sync_map() and flush_iotlb_all() so > that's great. They also do seem to map quite well to our RPCIT I/O TLB > flush so that's great. For now the prototype still uses 4k pages only. You are going to want to improve that page sizes in the iommu driver anyhow for VFIO. > With that the performance on the LPAR machine hypervisor (no paging) is > on par with our existing code. On paging hypervisors (z/VM and KVM) > i.e. with the hypervisor shadowing the I/O translation tables, it's > still slower than our existing code and interestingly strict mode seems > to be better than lazy here. One thing I haven't done yet is implement > the map_pages() operation or adding larger page sizes. map_pages() speeds thiings up if there is contiguous memory, I'm not sure what work load you are testing with so hard to guess if that is interesting or not. > Maybe you have some tips what you'd expect to be most beneficial? > Either way we're optimistic this can be solved and this conversion > will be a high ranking item on my backlog going forward. I'm not really sure I understand the differences, do you have a sense what is making it slower? Maybe there is some small feature that can be added to the core code? It is very strange that strict is faster, that should not be, strict requires synchronous flush in the unmap cas, lazy does not. Are you sure you are getting the lazy flushes enabled? > I also stumbled over the following patch series which I think would > also help our paging hypervisor cases a lot since it should alleviate > the cost of shadowing short lived mappings: This is quite different than what your current code does though? Still, it seems encouraging Jason ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: s390-iommu.c default domain conversion 2022-05-20 13:44 ` Jason Gunthorpe @ 2022-05-20 15:17 ` Niklas Schnelle 2022-05-20 15:51 ` Robin Murphy 2022-05-20 15:56 ` Jason Gunthorpe 0 siblings, 2 replies; 10+ messages in thread From: Niklas Schnelle @ 2022-05-20 15:17 UTC (permalink / raw) To: Jason Gunthorpe Cc: Matthew Rosato, linux-s390, alex.williamson, cohuck, farman, pmorel, borntraeger, hca, gor, gerald.schaefer, agordeev, svens, frankja, david, imbrenda, vneethv, oberpar, freude, thuth, pasic, Robin Murphy On Fri, 2022-05-20 at 10:44 -0300, Jason Gunthorpe wrote: > On Fri, May 20, 2022 at 03:05:46PM +0200, Niklas Schnelle wrote: > > > I did some testing and created a prototype that gets rid of > > arch/s390/pci_dma.c and works soley via dma-iommu on top of our IOMMU > > driver. It looks like the existing dma-iommu code allows us to do this > > with relatively simple changes to the IOMMU driver only, mostly just > > implementing iotlb_sync(), iotlb_sync_map() and flush_iotlb_all() so > > that's great. They also do seem to map quite well to our RPCIT I/O TLB > > flush so that's great. For now the prototype still uses 4k pages only. > > You are going to want to improve that page sizes in the iommu driver > anyhow for VFIO. Ok, we'll look into this. > > > With that the performance on the LPAR machine hypervisor (no paging) is > > on par with our existing code. On paging hypervisors (z/VM and KVM) > > i.e. with the hypervisor shadowing the I/O translation tables, it's > > still slower than our existing code and interestingly strict mode seems > > to be better than lazy here. One thing I haven't done yet is implement > > the map_pages() operation or adding larger page sizes. > > map_pages() speeds thiings up if there is contiguous memory, I'm not > sure what work load you are testing with so hard to guess if that is > interesting or not. Our most important driver is mlx5 with both IP and RDMA traffic on ConnectX-4/5/6 but we also support NVMes. > > > Maybe you have some tips what you'd expect to be most beneficial? > > Either way we're optimistic this can be solved and this conversion > > will be a high ranking item on my backlog going forward. > > I'm not really sure I understand the differences, do you have a sense > what is making it slower? Maybe there is some small feature that can > be added to the core code? It is very strange that strict is faster, > that should not be, strict requires synchronous flush in the unmap > cas, lazy does not. Are you sure you are getting the lazy flushes > enabled? The lazy flushes are the timer triggered flush_iotlb_all() in fq_flush_iotlb(), right? I definitely see that when tracing my flush_iotlb_all() implementation via that path. That flush_iotlb_all() in my prototype is basically the same as the global RPCIT we did once we wrapped around our IOVA address space. I suspect that this just happens much more often with the timer than our wrap around and flushing the entire aperture is somewhat slow because it causes the hypervisor to re-examine the entire I/O translation table. On the other hand in strict mode the iommu_iotlb_sync() call in __iommu_unmap() always flushes a relatively small contiguous range as I'm using the following construct to extend gather: if (iommu_iotlb_gather_is_disjoint(gather, iova, size)) iommu_iotlb_sync(domain, gather); iommu_iotlb_gather_add_range(gather, iova, size); Maybe the smaller contiguous ranges just help with locality/caching because the flushed range in the guests I/O tables was just updated. > > I also stumbled over the following patch series which I think would > > also help our paging hypervisor cases a lot since it should alleviate > > the cost of shadowing short lived mappings: > > This is quite different than what your current code does though? Yes > > Still, it seems encouraging > > Jason ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: s390-iommu.c default domain conversion 2022-05-20 15:17 ` Niklas Schnelle @ 2022-05-20 15:51 ` Robin Murphy 2022-05-20 15:56 ` Jason Gunthorpe 1 sibling, 0 replies; 10+ messages in thread From: Robin Murphy @ 2022-05-20 15:51 UTC (permalink / raw) To: Niklas Schnelle, Jason Gunthorpe Cc: Matthew Rosato, linux-s390, alex.williamson, cohuck, farman, pmorel, borntraeger, hca, gor, gerald.schaefer, agordeev, svens, frankja, david, imbrenda, vneethv, oberpar, freude, thuth, pasic On 2022-05-20 16:17, Niklas Schnelle wrote: > On Fri, 2022-05-20 at 10:44 -0300, Jason Gunthorpe wrote: >> On Fri, May 20, 2022 at 03:05:46PM +0200, Niklas Schnelle wrote: >> >>> I did some testing and created a prototype that gets rid of >>> arch/s390/pci_dma.c and works soley via dma-iommu on top of our IOMMU >>> driver. It looks like the existing dma-iommu code allows us to do this >>> with relatively simple changes to the IOMMU driver only, mostly just >>> implementing iotlb_sync(), iotlb_sync_map() and flush_iotlb_all() so >>> that's great. They also do seem to map quite well to our RPCIT I/O TLB >>> flush so that's great. For now the prototype still uses 4k pages only. >> >> You are going to want to improve that page sizes in the iommu driver >> anyhow for VFIO. > > Ok, we'll look into this. > >> >>> With that the performance on the LPAR machine hypervisor (no paging) is >>> on par with our existing code. On paging hypervisors (z/VM and KVM) >>> i.e. with the hypervisor shadowing the I/O translation tables, it's >>> still slower than our existing code and interestingly strict mode seems >>> to be better than lazy here. One thing I haven't done yet is implement >>> the map_pages() operation or adding larger page sizes. >> >> map_pages() speeds thiings up if there is contiguous memory, I'm not >> sure what work load you are testing with so hard to guess if that is >> interesting or not. > > Our most important driver is mlx5 with both IP and RDMA traffic on > ConnectX-4/5/6 but we also support NVMes. Since you already have the loop in s390_iommu_update_trans(), updating to map/unmap_pages should be trivial, and well worth it. >>> Maybe you have some tips what you'd expect to be most beneficial? >>> Either way we're optimistic this can be solved and this conversion >>> will be a high ranking item on my backlog going forward. >> >> I'm not really sure I understand the differences, do you have a sense >> what is making it slower? Maybe there is some small feature that can >> be added to the core code? It is very strange that strict is faster, >> that should not be, strict requires synchronous flush in the unmap >> cas, lazy does not. Are you sure you are getting the lazy flushes >> enabled? > > The lazy flushes are the timer triggered flush_iotlb_all() in > fq_flush_iotlb(), right? I definitely see that when tracing my > flush_iotlb_all() implementation via that path. That flush_iotlb_all() > in my prototype is basically the same as the global RPCIT we did once > we wrapped around our IOVA address space. I suspect that this just > happens much more often with the timer than our wrap around and > flushing the entire aperture is somewhat slow because it causes the > hypervisor to re-examine the entire I/O translation table. On the other > hand in strict mode the iommu_iotlb_sync() call in __iommu_unmap() > always flushes a relatively small contiguous range as I'm using the > following construct to extend gather: > > if (iommu_iotlb_gather_is_disjoint(gather, iova, size)) > iommu_iotlb_sync(domain, gather); > > iommu_iotlb_gather_add_range(gather, iova, size); > > Maybe the smaller contiguous ranges just help with locality/caching > because the flushed range in the guests I/O tables was just updated. That's entirely believable - both the AMD and Intel drivers force strict mode when virtualised for similar reasons, so feel free to do the same. Robin. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: s390-iommu.c default domain conversion 2022-05-20 15:17 ` Niklas Schnelle 2022-05-20 15:51 ` Robin Murphy @ 2022-05-20 15:56 ` Jason Gunthorpe 2022-05-20 16:26 ` Niklas Schnelle 1 sibling, 1 reply; 10+ messages in thread From: Jason Gunthorpe @ 2022-05-20 15:56 UTC (permalink / raw) To: Niklas Schnelle Cc: Matthew Rosato, linux-s390, alex.williamson, cohuck, farman, pmorel, borntraeger, hca, gor, gerald.schaefer, agordeev, svens, frankja, david, imbrenda, vneethv, oberpar, freude, thuth, pasic, Robin Murphy On Fri, May 20, 2022 at 05:17:05PM +0200, Niklas Schnelle wrote: > > > With that the performance on the LPAR machine hypervisor (no paging) is > > > on par with our existing code. On paging hypervisors (z/VM and KVM) > > > i.e. with the hypervisor shadowing the I/O translation tables, it's > > > still slower than our existing code and interestingly strict mode seems > > > to be better than lazy here. One thing I haven't done yet is implement > > > the map_pages() operation or adding larger page sizes. > > > > map_pages() speeds thiings up if there is contiguous memory, I'm not > > sure what work load you are testing with so hard to guess if that is > > interesting or not. > > Our most important driver is mlx5 with both IP and RDMA traffic on > ConnectX-4/5/6 but we also support NVMes. So you probably won't see big gains here from larger page sizes unless you also have a specific userspace that is trigger huge pages. qemu users spaces do this so it is worth doing anyhow though. > > > Maybe you have some tips what you'd expect to be most beneficial? > > > Either way we're optimistic this can be solved and this conversion > > > will be a high ranking item on my backlog going forward. > > > > I'm not really sure I understand the differences, do you have a sense > > what is making it slower? Maybe there is some small feature that can > > be added to the core code? It is very strange that strict is faster, > > that should not be, strict requires synchronous flush in the unmap > > cas, lazy does not. Are you sure you are getting the lazy flushes > > enabled? > > The lazy flushes are the timer triggered flush_iotlb_all() in > fq_flush_iotlb(), right? I definitely see that when tracing my > flush_iotlb_all() implementation via that path. That flush_iotlb_all() > in my prototype is basically the same as the global RPCIT we did once > we wrapped around our IOVA address space. I suspect that this just > happens much more often with the timer than our wrap around and > flushing the entire aperture is somewhat slow because it causes the > hypervisor to re-examine the entire I/O translation table. On the other > hand in strict mode the iommu_iotlb_sync() call in __iommu_unmap() > always flushes a relatively small contiguous range as I'm using the > following construct to extend gather: > > if (iommu_iotlb_gather_is_disjoint(gather, iova, size)) > iommu_iotlb_sync(domain, gather); > > iommu_iotlb_gather_add_range(gather, iova, size); > > Maybe the smaller contiguous ranges just help with locality/caching > because the flushed range in the guests I/O tables was just updated. So, from what I can tell, the S390 HW is not really the same as a normal iommu in that you can do map over IOVA that hasn't been flushed yet and the map will restore coherency to the new page table entries. I see the zpci_refresh_trans() call in map which is why I assume this? (note that normal HW has a HW IOTLB cache that MUST be flushed or new maps will not be loaded by the HW, so mapping to areas that previously had uninvalidated IOVA is a functional problem, which motivates the design of this scheme) However, since S390 can restore coherency during map the lazy invalidation is not for correctness but only for security - to eventually unmap things that the DMA device should not be touching? So, what you want is a slightly different FQ operation in the core code: - Continue to skip iotlb_sync during unmap - Do not do fq_ring_add() tracking, immediately recycle the IOVA - Call flush all on a timer with a suitable duration - much like today You can teach the core code to do this with some new flag, or rely on a custom driver implementation: - Make the alloc for IOMMU_DOMAIN_DMA produce a iommu_domain with special ops: iotlb_sync/flush all both NOPs. - Have a timer per iommu_domain and internally flush all on that timer, quite similar to how the current code works - Flush all when detaching a device - IOMMU_DOMAIN_UNMANAGED will work the same as today and includes the ops. Jason ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: s390-iommu.c default domain conversion 2022-05-20 15:56 ` Jason Gunthorpe @ 2022-05-20 16:26 ` Niklas Schnelle 2022-05-20 16:43 ` Jason Gunthorpe 0 siblings, 1 reply; 10+ messages in thread From: Niklas Schnelle @ 2022-05-20 16:26 UTC (permalink / raw) To: Jason Gunthorpe Cc: Matthew Rosato, linux-s390, alex.williamson, cohuck, farman, pmorel, borntraeger, hca, gor, gerald.schaefer, agordeev, svens, frankja, david, imbrenda, vneethv, oberpar, freude, thuth, pasic, Robin Murphy On Fri, 2022-05-20 at 12:56 -0300, Jason Gunthorpe wrote: > On Fri, May 20, 2022 at 05:17:05PM +0200, Niklas Schnelle wrote: > > > > > With that the performance on the LPAR machine hypervisor (no paging) is > > > > on par with our existing code. On paging hypervisors (z/VM and KVM) > > > > i.e. with the hypervisor shadowing the I/O translation tables, it's > > > > still slower than our existing code and interestingly strict mode seems > > > > to be better than lazy here. One thing I haven't done yet is implement > > > > the map_pages() operation or adding larger page sizes. > > > > > > map_pages() speeds thiings up if there is contiguous memory, I'm not > > > sure what work load you are testing with so hard to guess if that is > > > interesting or not. > > > > Our most important driver is mlx5 with both IP and RDMA traffic on > > ConnectX-4/5/6 but we also support NVMes. > > So you probably won't see big gains here from larger page sizes unless > you also have a specific userspace that is trigger huge pages. > > qemu users spaces do this so it is worth doing anyhow though. > > > > > Maybe you have some tips what you'd expect to be most beneficial? > > > > Either way we're optimistic this can be solved and this conversion > > > > will be a high ranking item on my backlog going forward. > > > > > > I'm not really sure I understand the differences, do you have a sense > > > what is making it slower? Maybe there is some small feature that can > > > be added to the core code? It is very strange that strict is faster, > > > that should not be, strict requires synchronous flush in the unmap > > > cas, lazy does not. Are you sure you are getting the lazy flushes > > > enabled? > > > > The lazy flushes are the timer triggered flush_iotlb_all() in > > fq_flush_iotlb(), right? I definitely see that when tracing my > > flush_iotlb_all() implementation via that path. That flush_iotlb_all() > > in my prototype is basically the same as the global RPCIT we did once > > we wrapped around our IOVA address space. I suspect that this just > > happens much more often with the timer than our wrap around and > > flushing the entire aperture is somewhat slow because it causes the > > hypervisor to re-examine the entire I/O translation table. On the other > > hand in strict mode the iommu_iotlb_sync() call in __iommu_unmap() > > always flushes a relatively small contiguous range as I'm using the > > following construct to extend gather: > > > > if (iommu_iotlb_gather_is_disjoint(gather, iova, size)) > > iommu_iotlb_sync(domain, gather); > > > > iommu_iotlb_gather_add_range(gather, iova, size); > > > > Maybe the smaller contiguous ranges just help with locality/caching > > because the flushed range in the guests I/O tables was just updated. > > So, from what I can tell, the S390 HW is not really the same as a > normal iommu in that you can do map over IOVA that hasn't been flushed > yet and the map will restore coherency to the new page table > entries. I see the zpci_refresh_trans() call in map which is why I > assume this? The zpci_refresh_trans() in map is only there because previously we didn't implement iotlb_sync_map(). Also, we only need to flush on map for the paged guest case so the hypervisor can update its shadow table. It happens unconditionally in the existing s390_iommu.c because that was not well optimized and uses the same s390_iommu_update_trans() for map and unmap. We had the skipping of the TLB flush handled properly in the arch/s390/pci_dma.c mapping code where !zdev->tlb_refresh indicates that we don't need flushes on map. > > (note that normal HW has a HW IOTLB cache that MUST be flushed or new > maps will not be loaded by the HW, so mapping to areas that previously > had uninvalidated IOVA is a functional problem, which motivates the > design of this scheme) We do need to flush the TLBs on unmap. The reason is that under LPAR (non paging hypervisor) the hardware can establish a new mapping on its own if an I/O PTE is changed from invalid to a valid translation and it wasn't previously in the TLB. I think that's how most hardware IOMMUs work and how I understand your explanation too. > > However, since S390 can restore coherency during map the lazy > invalidation is not for correctness but only for security - to > eventually unmap things that the DMA device should not be > touching? As explained above it is for correctness but with the existing code we handle this slightly differently. As we go through the entire IOVA space we're never reusing a previously unmapped IOVA until we run out of IOVAs. Then we do a global flush which on LPAR just drops the hardware TLBs making sure that future re-uses of IOVAs will trigger a harware walk of the I/O translation tables. Same constraints just a different scheme. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: s390-iommu.c default domain conversion 2022-05-20 16:26 ` Niklas Schnelle @ 2022-05-20 16:43 ` Jason Gunthorpe 0 siblings, 0 replies; 10+ messages in thread From: Jason Gunthorpe @ 2022-05-20 16:43 UTC (permalink / raw) To: Niklas Schnelle Cc: Matthew Rosato, linux-s390, alex.williamson, cohuck, farman, pmorel, borntraeger, hca, gor, gerald.schaefer, agordeev, svens, frankja, david, imbrenda, vneethv, oberpar, freude, thuth, pasic, Robin Murphy On Fri, May 20, 2022 at 06:26:03PM +0200, Niklas Schnelle wrote: > > So, from what I can tell, the S390 HW is not really the same as a > > normal iommu in that you can do map over IOVA that hasn't been flushed > > yet and the map will restore coherency to the new page table > > entries. I see the zpci_refresh_trans() call in map which is why I > > assume this? > > The zpci_refresh_trans() in map is only there because previously we > didn't implement iotlb_sync_map(). Also, we only need to flush on map > for the paged guest case so the hypervisor can update its shadow table. > It happens unconditionally in the existing s390_iommu.c because that > was not well optimized and uses the same s390_iommu_update_trans() for > map and unmap. We had the skipping of the TLB flush handled properly in > the arch/s390/pci_dma.c mapping code where !zdev->tlb_refresh indicates > that we don't need flushes on map. Even the arch/s390/pci_dma.c has a zpci_refresh_trans() on map, it is just conditional on zdev->tlb_refresh I had also assumed that the paging case uses this path? > > (note that normal HW has a HW IOTLB cache that MUST be flushed or new > > maps will not be loaded by the HW, so mapping to areas that previously > > had uninvalidated IOVA is a functional problem, which motivates the > > design of this scheme) > > We do need to flush the TLBs on unmap. The reason is that under LPAR > (non paging hypervisor) the hardware can establish a new mapping on its > own if an I/O PTE is changed from invalid to a valid translation and it > wasn't previously in the TLB. I think that's how most hardware IOMMUs > work and how I understand your explanation too. Since you said LPAR was OK performance wise, I was thinking only about the paging case. You can have different iommu_domains implementations for these two cases. But if the paging case doesn't need the hypercall anyhow it doesn't work. You'd have to change the core code to increase the timer duration, I think. Jason ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2022-05-20 16:43 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-05-09 23:35 s390-iommu.c default domain conversion Jason Gunthorpe 2022-05-10 15:25 ` Matthew Rosato 2022-05-10 16:09 ` Jason Gunthorpe 2022-05-20 13:05 ` Niklas Schnelle 2022-05-20 13:44 ` Jason Gunthorpe 2022-05-20 15:17 ` Niklas Schnelle 2022-05-20 15:51 ` Robin Murphy 2022-05-20 15:56 ` Jason Gunthorpe 2022-05-20 16:26 ` Niklas Schnelle 2022-05-20 16:43 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox