public inbox for linux-s390@vger.kernel.org
 help / color / mirror / Atom feed
From: Niklas Schnelle <schnelle@linux.ibm.com>
To: Robin Murphy <robin.murphy@arm.com>,
	Matthew Rosato <mjrosato@linux.ibm.com>,
	Gerd Bayer <gbayer@linux.ibm.com>,
	iommu@lists.linux.dev, Joerg Roedel <joro@8bytes.org>,
	Will Deacon <will@kernel.org>, Jason Gunthorpe <jgg@nvidia.com>,
	Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Pierre Morel <pmorel@linux.ibm.com>,
	linux-s390@vger.kernel.org, borntraeger@linux.ibm.com,
	hca@linux.ibm.com, gor@linux.ibm.com,
	gerald.schaefer@linux.ibm.com, agordeev@linux.ibm.com,
	svens@linux.ibm.com, linux-kernel@vger.kernel.org,
	Julian Ruess <julianr@linux.ibm.com>
Subject: Re: [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication
Date: Tue, 06 Dec 2022 11:13:48 +0100	[thread overview]
Message-ID: <44e2bc660f0fb4fdb2061be1553c93339146cef9.camel@linux.ibm.com> (raw)
In-Reply-To: <fa9729b7-5d20-fb3d-4bf9-e073d18235b3@arm.com>

On Mon, 2022-12-05 at 18:24 +0000, Robin Murphy wrote:
> On 2022-12-02 14:29, Niklas Schnelle wrote:
> > On Tue, 2022-11-29 at 15:40 +0100, Niklas Schnelle wrote:
> > > On Tue, 2022-11-29 at 12:53 +0000, Robin Murphy wrote:
> > > > On 2022-11-29 12:00, Niklas Schnelle wrote:
> > > > > On Mon, 2022-11-28 at 14:52 +0000, Robin Murphy wrote:
> > > > > > On 2022-11-16 17:16, Niklas Schnelle wrote:
> > > > > > > When RPCIT indicates that the underlying hypervisor has run out of
> > > > > > > resources it often means that its IOVA space is exhausted and IOVAs need
> > > > > > > to be freed before new ones can be created. By triggering a flush of the
> > > > > > > IOVA queue we can get the queued IOVAs freed and also get the new
> > > > > > > mapping established during the global flush.
> > > > > > 
> > > > > > Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is
> > > > > > exhausted and fail the DMA API call before even getting as far as
> > > > > > iommu_map(), though? Or is there some less obvious limitation like a
> > > > > > maximum total number of distinct IOVA regions regardless of size?
> > > > > 
> > > > > Well, yes and no. Your thinking is of course correct if the advertised
> > > > > available IOVA space can be fully utilized without exhausting
> > > > > hypervisor resources we won't trigger this case. However sadly there
> > > > > are complications. The most obvious being that in QEMU/KVM the
> > > > > restriction of the IOVA space to what QEMU can actually have mapped at
> > > > > once was just added recently[0] prior to that we would regularly go
> > > > > through this "I'm out of resources free me some IOVAs" dance with our
> > > > > existing DMA API implementation where this just triggers an early cycle
> > > > > of freeing all unused IOVAs followed by a global flush. On z/VM I know
> > > > > of no situations where this is triggered. That said this signalling is
> > > > > architected so z/VM may have corner cases where it does this. On our
> > > > > bare metal hypervisor (no paging) this return code is unused and IOTLB
> > > > > flushes are simply hardware cache flushes as on bare metal platforms.
> > > > > 
> > > > > [0]
> > > > > https://lore.kernel.org/qemu-devel/20221028194758.204007-4-mjrosato@linux.ibm.com/
> > > > 
> > > > That sheds a bit more light, thanks, although I'm still not confident I
> > > > fully understand the whole setup. AFAICS that patch looks to me like
> > > > it's putting a fixed limit on the size of the usable address space. That
> > > > in turn implies that "free some IOVAs and try again" might be a red
> > > > herring and never going to work; for your current implementation, what
> > > > that presumably means in reality is "free some IOVAs, resetting the
> > > > allocator to start allocating lower down in the address space where it
> > > > will happen to be below that limit, and try again", but the iommu-dma
> > > > allocator won't do that. If it doesn't know that some arbitrary range
> > > > below the top of the driver-advertised aperture is unusable, it will
> > > > just keep allocating IOVAs up there and mappings will always fail.
> > > > 
> > > > If the driver can't accurately represent the usable IOVA space via the
> > > > aperture and/or reserved regions, then this whole approach seems doomed.
> > > > If on the other hand I've misunderstood and you can actually still use
> > > > any address, just not all of them at the same time,
> > > 
> > > 
> > > This is exactly it, the problem is a limit on the number of IOVAs that
> > > are concurrently mapped. In QEMU pass-through the tightest limit is
> > > usually the one set by the host kernel parameter
> > > vfio_iommu_type1.dma_entry_limit which defaults to 65535 mappings. With
> > > IOMMU_DOMAIN_DMA we stay under this limit without extra action but once
> > > there is a flush queue (including the existing per-CPU one) where each
> > > entry may keep many pages lazily unmapped this is easly hit with fio
> > > bandwidth tests on an NVMe. For this case this patch works reliably
> > > because of course the number of actually active mappings without the
> > > lazily freed ones is similar to the number of active ones with
> > > IOMMU_DOMAIN_DMA.
> > > 
> > > >   then it might in
> > > > fact be considerably easier to skip the flush queue mechanism entirely
> > > > and implement this internally to the driver - basically make .iotlb_sync
> > > > a no-op for non-strict DMA domains,
> > > 
> > > I'm assuming you mean .iotlb_sync_map above.
> 
> No, I did mean .iotlb_sync, however on reflection that was under the 
> assumption that it's OK for the hypervisor to see a new mapping for a 
> previously-used IOVA without having seen it explicitly unmapped in 
> between. Fair enough if that isn't the case, but if it is then your 
> pagetable can essentially act as the "flush queue" by itself.

Hmm, I see. We do want the hypervisor to see mappings as invalid before
they are changed again to a new valid mapping. In case of e.g. QEMU/KVM
this allows the hypervisor to itself rely on the hardware to fill in
the IOTLB on map i.e. use a no-op .iotlb_sync_map and a .iotlb_sync
that flushes the hardware IOTLB. Also this allows QEMU to emulate the
IOMMU with just map/unmap primitives on vfio/iommufd. Consequently, I
recently found that vfio-pci pass-through works in combination with an
emulated s390 guest on an x86_64 host albeit very slowly of course.

> 
> > > >   put the corresponding RPCIT flush
> > > > and retry in .sync_map, then allow that to propagate an error back to
> > > > iommu_map() if the new mapping still hasn't taken.
> > > > 
> > > > Thanks,
> > > > Robin.
> > > 
> > > Hmm, interesting. This would leave the IOVAs in the flush queue lazily
> > > unmapped and thus still block their re-use but free their host
> > > resources via a global RPCIT allowing the guest to use a different
> > > porition of the IOVA space with those resources. It could work, though
> > > I need to test it, but it feels a bit clunky.
> > > 
> > > Maybe we can go cleaner while using this idea of not having to flush
> > > the queue but just freeing their host side resources. If we allowed
> > > .iotlb_sync_map to return an error that fails the mapping operation,
> > > then we could do it all in there. In the normal case it just does the
> > > RPCIT but if that returns that the hypervisor ran out of resources it
> > > does another global RPCIT allowing the hypervisor to free IOVAs that
> > > were lazily unmapped. If the latter succeeds all is good if not then
> > > the mapping operation failed. Logically it makes sense too,
> > > .iotlb_sync_map is the final step of syncing the mapping to the host
> > > which can fail just like the mapping operation itself.
> > > 
> > > Apart from the out of active IOVAs case this would also handle the
> > > other useful error case when using .iotlb_sync_map for shadowing where
> > > it fails because the host ran against a pinned pages limit or out of
> > > actual memory. Not by fixing it but at least we would get a failed
> > > mapping operation.
> > > 
> > > The other callbacks .flush_iotlb_all and .iotlb_sync
> > > could stay the same as they are only used for unmapped pages where we
> > > can't reasonably run out of resources in the host neither active IOVAs
> > > nor pinned pages.
> > > 
> > 
> > Ok, I've done some testing with the above idea and this seems to work
> > great. I've verified that my version of QEMU (without Matt's IOVA
> > aperture resrtriction patch) creates the RPCIT out of resource
> > indications and then the global flush in .iotlb_sync_map is triggered
> > and allows QEMU to unpin pages and free IOVAs while the guest still has
> > them lazily unpapeg (sitting in the flush queue) and thus uses
> > different IOVAs.
> > 
> > @Robin @Joerg, if you are open to changing .iotlb_sync_map such that it
> > can return and error and then failing the mapping operation I think
> > this is a great approach. One advantage over the previous approach of
> > flushing the queue isthat this should work for the pure IOMMU API too.
> 
> Whatever happens I think allowing .iotlb_sync_map to propagate an error 
> out through iommu_map() is an appropriate thing to do - it sounds like 
> s390 might technically need that for regular IOMMU API correctness in 
> some circumstances anyway. Besides, even in the cases where it 
> represents "simple" TLB maintenance, there are potentially ways that 
> could fail (like timing out if the IOMMU has gone completely wrong), so 
> it wouldn't seem entirely unreasonable if a driver might want to report 
> overall failure if it can't guarantee that the new mapping will actually 
> be usable.
> 
> Thanks,
> Robin.
> 

Great then I'll use this approach for v3 or maybe even sent this
separately as a prerequisite IOMMU fix as yes this is also required for
the IOMMU API to work in a guest where the hypervisor may report
running out of resources.
> 

  reply	other threads:[~2022-12-06 10:15 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-16 17:16 [PATCH v2 0/7] iommu/dma: s390 DMA API conversion and optimized IOTLB flushing Niklas Schnelle
2022-11-16 17:16 ` [PATCH v2 1/7] s390/ism: Set DMA coherent mask Niklas Schnelle
2022-11-16 17:16 ` [PATCH v2 2/7] s390/pci: prepare is_passed_through() for dma-iommu Niklas Schnelle
2022-11-16 17:16 ` [PATCH v2 3/7] s390/pci: Use dma-iommu layer Niklas Schnelle
2022-11-28 18:03   ` Robin Murphy
2022-12-19 15:17     ` Niklas Schnelle
2022-11-16 17:16 ` [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type Niklas Schnelle
2022-11-17  1:55   ` Baolu Lu
2022-11-28 11:10     ` Niklas Schnelle
2022-11-28 13:00       ` Baolu Lu
2022-11-28 13:29       ` Jason Gunthorpe
2022-11-28 15:54         ` Niklas Schnelle
2022-11-28 16:35           ` Jason Gunthorpe
2022-11-28 21:01             ` Robin Murphy
2022-11-29 17:33               ` Jason Gunthorpe
2022-11-29 18:41                 ` Robin Murphy
2022-11-29 20:09                   ` Jason Gunthorpe
2022-11-30  1:28                     ` Baolu Lu
2022-12-05 15:34                     ` Niklas Schnelle
2022-12-06 23:09                       ` Jason Gunthorpe
2022-12-07 13:18                         ` Baolu Lu
2022-12-07 13:23                           ` Jason Gunthorpe
2022-12-07 14:18                             ` Robin Murphy
2022-12-07 14:30                               ` Jason Gunthorpe
2022-11-28 16:56           ` Robin Murphy
2022-11-16 17:16 ` [PATCH v2 5/7] iommu/dma: Allow a single FQ in addition to per-CPU FQs Niklas Schnelle
2022-11-16 17:16 ` [PATCH v2 6/7] iommu/dma: Enable variable queue size and use larger single queue Niklas Schnelle
2022-11-16 17:16 ` [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication Niklas Schnelle
2022-11-28 14:52   ` Robin Murphy
2022-11-29 12:00     ` Niklas Schnelle
2022-11-29 12:53       ` Robin Murphy
2022-11-29 14:40         ` Niklas Schnelle
2022-12-02 14:29           ` Niklas Schnelle
2022-12-02 14:42             ` Jason Gunthorpe
2022-12-02 15:12               ` Niklas Schnelle
2022-12-02 15:24                 ` Jason Gunthorpe
2022-12-05 18:24             ` Robin Murphy
2022-12-06 10:13               ` Niklas Schnelle [this message]
2022-11-29 13:51       ` Matthew Rosato

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=44e2bc660f0fb4fdb2061be1553c93339146cef9.camel@linux.ibm.com \
    --to=schnelle@linux.ibm.com \
    --cc=agordeev@linux.ibm.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=gbayer@linux.ibm.com \
    --cc=gerald.schaefer@linux.ibm.com \
    --cc=gor@linux.ibm.com \
    --cc=hca@linux.ibm.com \
    --cc=iommu@lists.linux.dev \
    --cc=jgg@nvidia.com \
    --cc=joro@8bytes.org \
    --cc=julianr@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=mjrosato@linux.ibm.com \
    --cc=pmorel@linux.ibm.com \
    --cc=robin.murphy@arm.com \
    --cc=svens@linux.ibm.com \
    --cc=wenjia@linux.ibm.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox