Re: s390-iommu.c default domain conversion

public inbox for linux-s390@vger.kernel.org
 help / color / mirror / Atom feed

From: Jason Gunthorpe <jgg@nvidia.com>
To: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Matthew Rosato <mjrosato@linux.ibm.com>,
	linux-s390@vger.kernel.org, alex.williamson@redhat.com,
	cohuck@redhat.com, farman@linux.ibm.com, pmorel@linux.ibm.com,
	borntraeger@linux.ibm.com, hca@linux.ibm.com, gor@linux.ibm.com,
	gerald.schaefer@linux.ibm.com, agordeev@linux.ibm.com,
	svens@linux.ibm.com, frankja@linux.ibm.com, david@redhat.com,
	imbrenda@linux.ibm.com, vneethv@linux.ibm.com,
	oberpar@linux.ibm.com, freude@linux.ibm.com, thuth@redhat.com,
	pasic@linux.ibm.com, Robin Murphy <robin.murphy@arm.com>
Subject: Re: s390-iommu.c default domain conversion
Date: Fri, 20 May 2022 12:56:49 -0300	[thread overview]
Message-ID: <20220520155649.GJ1343366@nvidia.com> (raw)
In-Reply-To: <6271dd24bfcf82b0c1b911a163ae9549c24691a4.camel@linux.ibm.com>

On Fri, May 20, 2022 at 05:17:05PM +0200, Niklas Schnelle wrote:

> > > With that the performance on the LPAR machine hypervisor (no paging) is
> > > on par with our existing code. On paging hypervisors (z/VM and KVM)
> > > i.e. with the hypervisor shadowing the I/O translation tables, it's
> > > still slower than our existing code and interestingly strict mode seems
> > > to be better than lazy here. One thing I haven't done yet is implement
> > > the map_pages() operation or adding larger page sizes. 
> > 
> > map_pages() speeds thiings up if there is contiguous memory, I'm not
> > sure what work load you are testing with so hard to guess if that is
> > interesting or not.
> 
> Our most important driver is mlx5 with both IP and RDMA traffic on
> ConnectX-4/5/6 but we also support NVMes.

So you probably won't see big gains here from larger page sizes unless
you also have a specific userspace that is trigger huge pages.

qemu users spaces do this so it is worth doing anyhow though.

> > > Maybe you have some tips what you'd expect to be most beneficial?
> > > Either way we're optimistic this can be solved and this conversion
> > > will be a high ranking item on my backlog going forward.
> > 
> > I'm not really sure I understand the differences, do you have a sense
> > what is making it slower? Maybe there is some small feature that can
> > be added to the core code? It is very strange that strict is faster,
> > that should not be, strict requires synchronous flush in the unmap
> > cas, lazy does not. Are you sure you are getting the lazy flushes
> > enabled?
> 
> The lazy flushes are the timer triggered flush_iotlb_all() in
> fq_flush_iotlb(), right? I definitely see that when tracing my
> flush_iotlb_all() implementation via that path. That flush_iotlb_all()
> in my prototype is basically the same as the global RPCIT we did once
> we wrapped around our IOVA address space. I suspect that this just
> happens much more often with the timer than our wrap around and
> flushing the entire aperture is somewhat slow because it causes the
> hypervisor to re-examine the entire I/O translation table. On the other
> hand in strict mode the iommu_iotlb_sync() call in __iommu_unmap()
> always flushes a relatively small contiguous range as I'm using the
> following construct to extend gather:
> 
> 	if (iommu_iotlb_gather_is_disjoint(gather, iova, size))
> 		iommu_iotlb_sync(domain, gather);
> 
> 	iommu_iotlb_gather_add_range(gather, iova, size);
> 
> Maybe the smaller contiguous ranges just help with locality/caching
> because the flushed range in the guests I/O tables was just updated.

So, from what I can tell, the S390 HW is not really the same as a
normal iommu in that you can do map over IOVA that hasn't been flushed
yet and the map will restore coherency to the new page table
entries. I see the zpci_refresh_trans() call in map which is why I
assume this?

(note that normal HW has a HW IOTLB cache that MUST be flushed or new
maps will not be loaded by the HW, so mapping to areas that previously
had uninvalidated IOVA is a functional problem, which motivates the
design of this scheme)

However, since S390 can restore coherency during map the lazy
invalidation is not for correctness but only for security - to
eventually unmap things that the DMA device should not be
touching?

So, what you want is a slightly different FQ operation in the core
code:

 - Continue to skip iotlb_sync during unmap
 - Do not do fq_ring_add() tracking, immediately recycle the IOVA
 - Call flush all on a timer with a suitable duration - much like today

You can teach the core code to do this with some new flag, or 
rely on a custom driver implementation:

 - Make the alloc for IOMMU_DOMAIN_DMA produce a iommu_domain with
   special ops: iotlb_sync/flush all both NOPs.

 - Have a timer per iommu_domain and internally flush all on that
   timer, quite similar to how the current code works

 - Flush all when detaching a device

 - IOMMU_DOMAIN_UNMANAGED will work the same as today and includes the
   ops.

Jason

next prev parent reply	other threads:[~2022-05-20 15:57 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-09 23:35 s390-iommu.c default domain conversion Jason Gunthorpe
2022-05-10 15:25 ` Matthew Rosato
2022-05-10 16:09   ` Jason Gunthorpe
2022-05-20 13:05     ` Niklas Schnelle
2022-05-20 13:44       ` Jason Gunthorpe
2022-05-20 15:17         ` Niklas Schnelle
2022-05-20 15:51           ` Robin Murphy
2022-05-20 15:56           ` Jason Gunthorpe [this message]
2022-05-20 16:26             ` Niklas Schnelle
2022-05-20 16:43               ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220520155649.GJ1343366@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=agordeev@linux.ibm.com \
    --cc=alex.williamson@redhat.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=cohuck@redhat.com \
    --cc=david@redhat.com \
    --cc=farman@linux.ibm.com \
    --cc=frankja@linux.ibm.com \
    --cc=freude@linux.ibm.com \
    --cc=gerald.schaefer@linux.ibm.com \
    --cc=gor@linux.ibm.com \
    --cc=hca@linux.ibm.com \
    --cc=imbrenda@linux.ibm.com \
    --cc=linux-s390@vger.kernel.org \
    --cc=mjrosato@linux.ibm.com \
    --cc=oberpar@linux.ibm.com \
    --cc=pasic@linux.ibm.com \
    --cc=pmorel@linux.ibm.com \
    --cc=robin.murphy@arm.com \
    --cc=schnelle@linux.ibm.com \
    --cc=svens@linux.ibm.com \
    --cc=thuth@redhat.com \
    --cc=vneethv@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox