linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Catalin Marinas <catalin.marinas@arm.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Marc Zyngier <maz@kernel.org>,
	ankita@nvidia.com, alex.williamson@redhat.com,
	naoya.horiguchi@nec.com, oliver.upton@linux.dev,
	aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com,
	targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com,
	apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
	Lorenzo Pieralisi <lpieralisi@kernel.org>
Subject: Re: [PATCH v3 1/6] kvm: determine memory type from VMA
Date: Wed, 31 May 2023 12:35:57 +0100	[thread overview]
Message-ID: <ZHcxHbCb439I1Uk2@arm.com> (raw)
In-Reply-To: <ZDarrZmLWlA+BHQG@nvidia.com>

On Wed, Apr 12, 2023 at 10:01:33AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 12, 2023 at 01:43:26PM +0100, Marc Zyngier wrote:
> > What makes it safe? How does VFIO ensures that the memory type used is
> > correct in all circumstances? This has to hold for *ANY* device, not
> > just your favourite toy of the day. Nothing in this patch is horribly
> > wrong, but the above question must be answered before we can consider
> > any of this.
> 
> In VFIO we now have the concept of "variant drivers" which work with
> specific PCI IDs. The variant drivers can inject device specific
> knowledge into VFIO. In this series the driver injects the cachable
> pgprot when it creates some of the VMAs because it knows the PCI IDs
> it supports, parses the ACPI description, and knows for sure that the
> memory it puts in the cachable VMA is linked with a cache coherent
> interconnect.
> 
> The generic vfio-pci path is not changed, so 'any device' is not
> relevant here.

There were several off-list discussions, I'm trying to summarise my
understanding here. This series aims to relax the VFIO mapping to
cacheable and have KVM map it into the guest with the same attributes.
Somewhat related past threads also tried to relax the KVM device
pass-through mapping from Device_nGnRnE (pgprot_noncached) to Normal_NC
(pgprot_writecombine). Those were initially using the PCIe prefetchable
BAR attribute but that's not a reliable means to infer whether Normal vs
Device is safe. Anyway, I think we'd need to unify these threads and
come up with some common handling that can cater for various attributes
required by devices/drivers. Therefore replying in this thread.

Current status on arm64: vfio-pci user mapping (Qemu) of PCIe BARs is
Device_nGnRnE. KVM also maps it at Stage 2 with the same attributes. The
user mapping is fine since the VMM may not have detailed knowledge about
how to safely map the BAR. KVM doesn't have such detailed knowledge
either in the device pass-through case but for safety reasons it follows
the same attributes as the user mapping.

From a safety perspective, relaxing the KVM stage 2 mapping has some
implications:

1. It creates multiple memory aliases with different attributes.

2. Speculative loads, unaligned accesses.

3. Potentially new uncontained errors introduced by Normal NC vs Device
   mappings.

From the private discussions I had regarding point (3), Device doesn't
really make any difference and it can be even worse. For example, Normal
mappings would allow a contained error while Device would trigger an
uncontained SError. So the only safe bet here is for the platform, root
complex to behave properly when device pass-through is involved (e.g.
PCIe to ignore writes, return 0xff on reads if there are errors). That's
something Arm is working on to require in the BSA specs (base system
architecture). Presumably cloud vendors allowing device pass-through
already know their system well enough, no need for further policing in
KVM (which it can't do properly anyway).

As long as the errors are contained, point (2) becomes the
responsibility of the guest, given its detailed knowledge of the device,
using the appropriate attributes (whether x86 WC maps exactly onto arm64
Normal NC is the subject of a separate discussion; so far that's the
assumption we took in the arm64 kernel). Regarding vfio-pci, the user
mapping must remain Device_nGnRnE on arm64 to avoid unexpected
speculative loads.

This leaves us with (1) and since the vfio-pci user mapping must remain
Device, there's a potential mismatched alias if the guest maps the
corresponding BAR as Normal NC. Luckily the Arm ARM constrains the
hardware behaviour here to basically allowing the Device mapping in the
VMM to behave somewhat the Normal NC mapping in the guest. IOW, the VMM
loses the Device guarantees (non-speculation, ordering). That's not a
problem for the device since the guest already deemed such mapping to be
safe.

While I think it's less likely for the VMM to access the same BAR that
was mapped into the guest, if we want to be on the safe side from an ABI
perspective (the returned mmap() now has different memory guarantees),
we could make this an opt-in. That's pretty much the VMM stating that it
is ok with losing some of the Device guarantees for the vfio-pci
mapping. A question here is whether we want to this to be done at the
vfio-pci level, the KVM memory slot or a big knob per VMM. Or whether we
need to bother with this at all (if it's just theoretical, maybe we can
have a global opt-out).

Going back to this series, allowing Cacheable mapping in KVM has similar
implications as above. (2) and (3) would be assumed safe by the platform
vendors. Regarding (1), to avoid confusion, I would only allow it if FWB
(force write-back) is present so that KVM can enforce a cacheable
mapping at Stage 2 if the vfio variant driver also maps it as cacheable
in user space.

There were some other thoughts on only allowing different attributes in
KVM if the user counterpart does not have any mapping (e.g. fd-based
KVM memory slots). However, this won't work well with CXL-attached
memory for example where the VMM may want access to it (e.g. virtio). So
I wouldn't spend more thoughts on this.


The TL;DR summary of what I think we should do:

a) Relax the KVM Stage 2 mapping for vfio-pci to Normal NC
   (pgprot_writecombine). TBD whether we need a VMM opt-in is at the
   vfio-pci level, KVM slot or per-VMM level. Another option is to do
   this by default and have a global opt-out (cmdline). I don't think
   the latter is such a bad idea since I find it unlikely for the VMM to
   also touch the same PCIe BAR _and_ expect different memory ordering
   guarantees than the guest device driver.

b) Map the KVM stage 2 mapping as cacheable+FWB iff the VMM has a
   corresponding cacheable mapping.

Thoughts?

-- 
Catalin


  reply	other threads:[~2023-05-31 11:36 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-05 18:01 [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible ankita
2023-04-05 18:01 ` [PATCH v3 1/6] kvm: determine memory type from VMA ankita
2023-04-12 12:43   ` Marc Zyngier
2023-04-12 13:01     ` Jason Gunthorpe
2023-05-31 11:35       ` Catalin Marinas [this message]
2023-06-14 12:44         ` Jason Gunthorpe
2023-07-14  8:10         ` Benjamin Herrenschmidt
2023-07-16 15:09           ` Catalin Marinas
2023-07-16 22:30             ` Jason Gunthorpe
2023-07-17 18:35               ` Alex Williamson
2023-07-25  6:18                 ` Benjamin Herrenschmidt
2023-04-05 18:01 ` [PATCH v3 2/6] vfio/nvgpu: expose GPU device memory as BAR1 ankita
2023-04-05 21:07   ` kernel test robot
2023-04-05 18:01 ` [PATCH v3 3/6] mm: handle poisoning of pfn without struct pages ankita
2023-04-05 21:07   ` kernel test robot
2023-05-09  9:51   ` HORIGUCHI NAOYA(堀口 直也)
2023-05-15 11:18     ` Ankit Agrawal
2023-05-23  5:43       ` HORIGUCHI NAOYA(堀口 直也)
2023-04-05 18:01 ` [PATCH v3 4/6] mm: Add poison error check in fixup_user_fault() for mapped PFN ankita
2023-04-05 18:01 ` [PATCH v3 5/6] mm: Change ghes code to allow poison of non-struct PFN ankita
2023-04-05 18:01 ` [PATCH v3 6/6] vfio/nvgpu: register device memory for poison handling ankita
2023-04-05 20:24   ` Zhi Wang
2023-04-05 21:50   ` kernel test robot
2023-05-24  9:53   ` Dan Carpenter
2023-04-06 12:07 ` [PATCH v3 0/6] Expose GPU memory as coherently CPU accessible David Hildenbrand
2023-04-12  8:43   ` Ankit Agrawal
2023-04-12  9:48     ` Marc Zyngier
2023-04-12 12:28 ` Marc Zyngier
2023-04-12 12:53   ` Jason Gunthorpe
2023-04-13  9:52     ` Marc Zyngier
2023-04-13 13:19       ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZHcxHbCb439I1Uk2@arm.com \
    --to=catalin.marinas@arm.com \
    --cc=acurrid@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=aniketa@nvidia.com \
    --cc=ankita@nvidia.com \
    --cc=apopple@nvidia.com \
    --cc=cjia@nvidia.com \
    --cc=danw@nvidia.com \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lpieralisi@kernel.org \
    --cc=maz@kernel.org \
    --cc=naoya.horiguchi@nec.com \
    --cc=oliver.upton@linux.dev \
    --cc=targupta@nvidia.com \
    --cc=vsethi@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).