Re: [PATCH v2 1/1] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory

From: Jason Gunthorpe <jgg@nvidia.com>
To: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>,
	ankita@nvidia.com,
	Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>,
	oliver.upton@linux.dev, suzuki.poulose@arm.com,
	yuzenghui@huawei.com, will@kernel.org, ardb@kernel.org,
	akpm@linux-foundation.org, gshan@redhat.com, aniketa@nvidia.com,
	cjia@nvidia.com, kwankhede@nvidia.com, targupta@nvidia.com,
	vsethi@nvidia.com, acurrid@nvidia.com, apopple@nvidia.com,
	jhubbard@nvidia.com, danw@nvidia.com, mochs@nvidia.com,
	kvmarm@lists.linux.dev, kvm@vger.kernel.org,
	lpieralisi@kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH v2 1/1] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
Date: Tue, 5 Dec 2023 12:43:18 -0400	[thread overview]
Message-ID: <20231205164318.GG2692119@nvidia.com> (raw)
In-Reply-To: <ZW9OSe8Z9gAmM7My@arm.com>

On Tue, Dec 05, 2023 at 04:22:33PM +0000, Catalin Marinas wrote:
> Yeah, I made this argument in the past. But it's a fair question to ask
> since the Arm world is different from x86. Just reusing an existing
> driver in a different context may break its expectations. Does Normal NC
> access complete by the time a TLBI (for Stage 2) and DSB (DVMsync) is
> completed? It does reach some point of serialisation with subsequent
> accesses to the same address but not sure how it is ordered with an
> access to a different location like the config space used for reset.
> Maybe it's not a problem at all or it is safe only for PCIe but it would
> be good to get to the bottom of this.

IMHO, the answer is you can't know architecturally. The specific
vfio-platform driver must do an analysis of it's specific SOC and
determine what exactly is required to order the reset. The primary
purpose of the vfio-platform drivers is to provide this reset!

In most cases I would expect some reads from the device to be required
before the reset.

> > Remember, the feedback we got from the CPU architects was that even
> > DEVICE_* will experience an uncontained failure if the device tiggers
> > an error response in shipping ARM IP.
> > 
> > The reason PCIe is safe is because the PCI bridge does not generate
> > errors in the first place!
> 
> That's an argument to restrict this feature to PCIe. It's really about
> fewer arguments on the behaviour of other devices. Marc did raise
> another issue with the GIC VCPU interface (does this even have a vma in
> the host VMM?). That's a class of devices where the mapping is
> context-switched, so the TLBI+DSB rules don't help.

I don't know anything about the GIC VCPU interface, to give any
comment unfortunately. Since it seems there is something to fix here I
would appreciate some background..

When you say it is context switched do you mean kvm does a register
write on every vm entry to set the proper HW context for the vCPU?

We are worrying that register write will possibly not order after
NORMAL_NC?

> > Thus, the way a platform device can actually be safe is if it too
> > never generates errors in the first place! Obviously this approach
> > works just as well with NORMAL_NC.
> > 
> > If a platform device does generate errors then we shouldn't expect
> > containment at all, and the memory type has no bearing on the
> > safety. The correct answer is to block these platform devices from
> > VFIO/KVM/etc because they can trigger uncontained failures.
> 
> Assuming the error containment is sorted, there are two other issues
> with other types of devices:
> 
> 1. Ordering guarantees on reclaim or context switch

Solved in VFIO

> 2. Unaligned accesses
> 
> On (2), I think PCIe is fairly clear on how the TLPs are generated, so I
> wouldn't expect additional errors here. But I have no idea what AMBA/AXI
> does here in general. Perhaps it's fine, I don't think we looked into it
> as the focus was mostly on PCIe.

I would expect AXI devices to throw errors in all sorts of odd
cases. eg I would not be surprised at all to see carelessly built AXI
devices error if ST64B is pointed at them. At least when I was
building AXI logic years ago I was so lazy :P

This is mostly my point - if the devices under vfio-platform were not
designed to have contained errors then, IMHO, it is hard to believe
that DEVICE_X/NORMAL_NC is the only issue in that HW.

> So, I think it would be easier to get this patch upstream if we limit
> the change to PCIe devices for now. We may relax this further in the
> future. Do you actually have a need for non-PCIe devices to support WC
> in the guest or it's more about the complexity of the logic to detect
> whether it's actually a PCIe BAR we are mapping into the guest? (I can
> see some Arm GPU folk asking for this but those devices are not easily
> virtualisable).

The complexity is my concern, and the disruption to the ecosystem with
some of the ideas given.

If there was a trivial way to convey in the VMA that it is safe then
sure, no objection from me.

My worry is this has turned from fixing a real problem we have today
into a debate about theoretical issues that nobody may care about that
are very disruptive to solve.

I would turn it around and ask we find a way to restrict platform
devices when someone comes with a platform device that wants to use
secure kvm and has a single well defined HW problem that is solved by
this work.

What if we change vfio-pci to use pgprot_device() like it already
really should and say the pgprot_noncached() is enforced as
DEVICE_nGnRnE and pgprot_device() may be DEVICE_nGnRE or NORMAL_NC?
Would that be acceptable?

Jason