From: Jason Gunthorpe <jgg@nvidia.com>
To: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Matthew Rosato <mjrosato@linux.ibm.com>,
linux-s390@vger.kernel.org, alex.williamson@redhat.com,
cohuck@redhat.com, farman@linux.ibm.com, pmorel@linux.ibm.com,
borntraeger@linux.ibm.com, hca@linux.ibm.com, gor@linux.ibm.com,
gerald.schaefer@linux.ibm.com, agordeev@linux.ibm.com,
svens@linux.ibm.com, frankja@linux.ibm.com, david@redhat.com,
imbrenda@linux.ibm.com, vneethv@linux.ibm.com,
oberpar@linux.ibm.com, freude@linux.ibm.com, thuth@redhat.com,
pasic@linux.ibm.com, Robin Murphy <robin.murphy@arm.com>
Subject: Re: s390-iommu.c default domain conversion
Date: Fri, 20 May 2022 12:56:49 -0300 [thread overview]
Message-ID: <20220520155649.GJ1343366@nvidia.com> (raw)
In-Reply-To: <6271dd24bfcf82b0c1b911a163ae9549c24691a4.camel@linux.ibm.com>
On Fri, May 20, 2022 at 05:17:05PM +0200, Niklas Schnelle wrote:
> > > With that the performance on the LPAR machine hypervisor (no paging) is
> > > on par with our existing code. On paging hypervisors (z/VM and KVM)
> > > i.e. with the hypervisor shadowing the I/O translation tables, it's
> > > still slower than our existing code and interestingly strict mode seems
> > > to be better than lazy here. One thing I haven't done yet is implement
> > > the map_pages() operation or adding larger page sizes.
> >
> > map_pages() speeds thiings up if there is contiguous memory, I'm not
> > sure what work load you are testing with so hard to guess if that is
> > interesting or not.
>
> Our most important driver is mlx5 with both IP and RDMA traffic on
> ConnectX-4/5/6 but we also support NVMes.
So you probably won't see big gains here from larger page sizes unless
you also have a specific userspace that is trigger huge pages.
qemu users spaces do this so it is worth doing anyhow though.
> > > Maybe you have some tips what you'd expect to be most beneficial?
> > > Either way we're optimistic this can be solved and this conversion
> > > will be a high ranking item on my backlog going forward.
> >
> > I'm not really sure I understand the differences, do you have a sense
> > what is making it slower? Maybe there is some small feature that can
> > be added to the core code? It is very strange that strict is faster,
> > that should not be, strict requires synchronous flush in the unmap
> > cas, lazy does not. Are you sure you are getting the lazy flushes
> > enabled?
>
> The lazy flushes are the timer triggered flush_iotlb_all() in
> fq_flush_iotlb(), right? I definitely see that when tracing my
> flush_iotlb_all() implementation via that path. That flush_iotlb_all()
> in my prototype is basically the same as the global RPCIT we did once
> we wrapped around our IOVA address space. I suspect that this just
> happens much more often with the timer than our wrap around and
> flushing the entire aperture is somewhat slow because it causes the
> hypervisor to re-examine the entire I/O translation table. On the other
> hand in strict mode the iommu_iotlb_sync() call in __iommu_unmap()
> always flushes a relatively small contiguous range as I'm using the
> following construct to extend gather:
>
> if (iommu_iotlb_gather_is_disjoint(gather, iova, size))
> iommu_iotlb_sync(domain, gather);
>
> iommu_iotlb_gather_add_range(gather, iova, size);
>
> Maybe the smaller contiguous ranges just help with locality/caching
> because the flushed range in the guests I/O tables was just updated.
So, from what I can tell, the S390 HW is not really the same as a
normal iommu in that you can do map over IOVA that hasn't been flushed
yet and the map will restore coherency to the new page table
entries. I see the zpci_refresh_trans() call in map which is why I
assume this?
(note that normal HW has a HW IOTLB cache that MUST be flushed or new
maps will not be loaded by the HW, so mapping to areas that previously
had uninvalidated IOVA is a functional problem, which motivates the
design of this scheme)
However, since S390 can restore coherency during map the lazy
invalidation is not for correctness but only for security - to
eventually unmap things that the DMA device should not be
touching?
So, what you want is a slightly different FQ operation in the core
code:
- Continue to skip iotlb_sync during unmap
- Do not do fq_ring_add() tracking, immediately recycle the IOVA
- Call flush all on a timer with a suitable duration - much like today
You can teach the core code to do this with some new flag, or
rely on a custom driver implementation:
- Make the alloc for IOMMU_DOMAIN_DMA produce a iommu_domain with
special ops: iotlb_sync/flush all both NOPs.
- Have a timer per iommu_domain and internally flush all on that
timer, quite similar to how the current code works
- Flush all when detaching a device
- IOMMU_DOMAIN_UNMANAGED will work the same as today and includes the
ops.
Jason
next prev parent reply other threads:[~2022-05-20 15:57 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-09 23:35 s390-iommu.c default domain conversion Jason Gunthorpe
2022-05-10 15:25 ` Matthew Rosato
2022-05-10 16:09 ` Jason Gunthorpe
2022-05-20 13:05 ` Niklas Schnelle
2022-05-20 13:44 ` Jason Gunthorpe
2022-05-20 15:17 ` Niklas Schnelle
2022-05-20 15:51 ` Robin Murphy
2022-05-20 15:56 ` Jason Gunthorpe [this message]
2022-05-20 16:26 ` Niklas Schnelle
2022-05-20 16:43 ` Jason Gunthorpe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220520155649.GJ1343366@nvidia.com \
--to=jgg@nvidia.com \
--cc=agordeev@linux.ibm.com \
--cc=alex.williamson@redhat.com \
--cc=borntraeger@linux.ibm.com \
--cc=cohuck@redhat.com \
--cc=david@redhat.com \
--cc=farman@linux.ibm.com \
--cc=frankja@linux.ibm.com \
--cc=freude@linux.ibm.com \
--cc=gerald.schaefer@linux.ibm.com \
--cc=gor@linux.ibm.com \
--cc=hca@linux.ibm.com \
--cc=imbrenda@linux.ibm.com \
--cc=linux-s390@vger.kernel.org \
--cc=mjrosato@linux.ibm.com \
--cc=oberpar@linux.ibm.com \
--cc=pasic@linux.ibm.com \
--cc=pmorel@linux.ibm.com \
--cc=robin.murphy@arm.com \
--cc=schnelle@linux.ibm.com \
--cc=svens@linux.ibm.com \
--cc=thuth@redhat.com \
--cc=vneethv@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.