public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: ankita@nvidia.com, yishaih@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com,
	aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com,
	targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com,
	apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com,
	anuaggarwal@nvidia.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v10 1/1] vfio/nvgpu: Add vfio pci variant module for grace hopper
Date: Mon, 18 Sep 2023 10:02:56 -0300	[thread overview]
Message-ID: <20230918130256.GE13733@nvidia.com> (raw)
In-Reply-To: <20230915082430.11096aa3.alex.williamson@redhat.com>

On Fri, Sep 15, 2023 at 08:24:30AM -0600, Alex Williamson wrote:
> On Thu, 14 Sep 2023 19:54:15 -0700
> <ankita@nvidia.com> wrote:
> 
> > From: Ankit Agrawal <ankita@nvidia.com>
> > 
> > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> > for the on-chip GPU that is the logical OS representation of the
> > internal proprietary cache coherent interconnect.
> > 
> > This representation has a number of limitations compared to a real PCI
> > device, in particular, it does not model the coherent GPU memory
> > aperture as a PCI config space BAR, and PCI doesn't know anything
> > about cacheable memory types.
> > 
> > Provide a VFIO PCI variant driver that adapts the unique PCI
> > representation into a more standard PCI representation facing
> > userspace. The GPU memory aperture is obtained from ACPI using
> > device_property_read_u64(), according to the FW specification,
> > and exported to userspace as a separate VFIO_REGION. Since the device
> > implements only one 64-bit BAR (BAR0), the GPU memory aperture is mapped
> > to the next available PCI BAR (BAR2). Qemu will then naturally generate a
> > PCI device in the VM with two 64-bit BARs (where the cacheable aperture
> > reported in BAR2).
> > 
> > Since this memory region is actually cache coherent with the CPU, the
> > VFIO variant driver will mmap it into VMA using a cacheable mapping. The
> > mapping is done using remap_pfn_range().
> > 
> > PCI BAR are aligned to the power-of-2, but the actual memory on the
> > device may not. A read or write access to the physical address from the
> > last device PFN up to the next power-of-2 aligned physical address
> > results in reading ~0 and dropped writes.
> > 
> > Lastly the presence of CPU cache coherent device memory is exposed
> > through sysfs for use by user space.
> 
> This looks like a giant red flag that this approach of masquerading the
> coherent memory as a PCI BAR is the wrong way to go.  If the VMM needs
> to know about this coherent memory, it needs to get that information
> in-band. 

The VMM part doesn't need this flag, nor does the VM. The
orchestration needs to know when to setup the pxm stuff.

I think we should drop the sysfs for now until the qemu thread about
the pxm stuff settles into an idea.

When the qemu API is clear we can have a discussion on what component
should detect this driver and setup the pxm things, then answer the
how should the detection work from the kernel side.

> be reaching out to arbitrary sysfs attributes.  Minimally this
> information should be provided via a capability on the region info
> chain, 

That definitely isn't suitable, eg libvirt won't have access to inband
information if it turns out libvirt is supposed to setup the pxm qemu
arguments?

> A "coherent_mem" attribute on the device provides a very weak
> association to the memory region it's trying to describe.

That's because it's use has nothing to do with the memory region :)

Jason

  reply	other threads:[~2023-09-18 15:21 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-15  2:54 [PATCH v10 1/1] vfio/nvgpu: Add vfio pci variant module for grace hopper ankita
2023-09-15 14:24 ` Alex Williamson
2023-09-18 13:02   ` Jason Gunthorpe [this message]
2023-09-18 14:27     ` Alex Williamson
2023-09-18 14:49       ` Jason Gunthorpe
2023-09-18 17:19         ` Alex Williamson
2023-09-18 17:47           ` Jason Gunthorpe
2023-09-28  6:33 ` Tian, Kevin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230918130256.GE13733@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=acurrid@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=aniketa@nvidia.com \
    --cc=ankita@nvidia.com \
    --cc=anuaggarwal@nvidia.com \
    --cc=apopple@nvidia.com \
    --cc=cjia@nvidia.com \
    --cc=danw@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=targupta@nvidia.com \
    --cc=vsethi@nvidia.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox