All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: ankita@nvidia.com, yishaih@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com,
	aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com,
	targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com,
	apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com,
	anuaggarwal@nvidia.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v10 1/1] vfio/nvgpu: Add vfio pci variant module for grace hopper
Date: Mon, 18 Sep 2023 10:02:56 -0300	[thread overview]
Message-ID: <20230918130256.GE13733@nvidia.com> (raw)
In-Reply-To: <20230915082430.11096aa3.alex.williamson@redhat.com>

On Fri, Sep 15, 2023 at 08:24:30AM -0600, Alex Williamson wrote:
> On Thu, 14 Sep 2023 19:54:15 -0700
> <ankita@nvidia.com> wrote:
> 
> > From: Ankit Agrawal <ankita@nvidia.com>
> > 
> > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> > for the on-chip GPU that is the logical OS representation of the
> > internal proprietary cache coherent interconnect.
> > 
> > This representation has a number of limitations compared to a real PCI
> > device, in particular, it does not model the coherent GPU memory
> > aperture as a PCI config space BAR, and PCI doesn't know anything
> > about cacheable memory types.
> > 
> > Provide a VFIO PCI variant driver that adapts the unique PCI
> > representation into a more standard PCI representation facing
> > userspace. The GPU memory aperture is obtained from ACPI using
> > device_property_read_u64(), according to the FW specification,
> > and exported to userspace as a separate VFIO_REGION. Since the device
> > implements only one 64-bit BAR (BAR0), the GPU memory aperture is mapped
> > to the next available PCI BAR (BAR2). Qemu will then naturally generate a
> > PCI device in the VM with two 64-bit BARs (where the cacheable aperture
> > reported in BAR2).
> > 
> > Since this memory region is actually cache coherent with the CPU, the
> > VFIO variant driver will mmap it into VMA using a cacheable mapping. The
> > mapping is done using remap_pfn_range().
> > 
> > PCI BAR are aligned to the power-of-2, but the actual memory on the
> > device may not. A read or write access to the physical address from the
> > last device PFN up to the next power-of-2 aligned physical address
> > results in reading ~0 and dropped writes.
> > 
> > Lastly the presence of CPU cache coherent device memory is exposed
> > through sysfs for use by user space.
> 
> This looks like a giant red flag that this approach of masquerading the
> coherent memory as a PCI BAR is the wrong way to go.  If the VMM needs
> to know about this coherent memory, it needs to get that information
> in-band. 

The VMM part doesn't need this flag, nor does the VM. The
orchestration needs to know when to setup the pxm stuff.

I think we should drop the sysfs for now until the qemu thread about
the pxm stuff settles into an idea.

When the qemu API is clear we can have a discussion on what component
should detect this driver and setup the pxm things, then answer the
how should the detection work from the kernel side.

> be reaching out to arbitrary sysfs attributes.  Minimally this
> information should be provided via a capability on the region info
> chain, 

That definitely isn't suitable, eg libvirt won't have access to inband
information if it turns out libvirt is supposed to setup the pxm qemu
arguments?

> A "coherent_mem" attribute on the device provides a very weak
> association to the memory region it's trying to describe.

That's because it's use has nothing to do with the memory region :)

Jason

  reply	other threads:[~2023-09-18 15:21 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-15  2:54 [PATCH v10 1/1] vfio/nvgpu: Add vfio pci variant module for grace hopper ankita
2023-09-15 14:24 ` Alex Williamson
2023-09-18 13:02   ` Jason Gunthorpe [this message]
2023-09-18 14:27     ` Alex Williamson
2023-09-18 14:49       ` Jason Gunthorpe
2023-09-18 17:19         ` Alex Williamson
2023-09-18 17:47           ` Jason Gunthorpe
2023-09-28  6:33 ` Tian, Kevin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230918130256.GE13733@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=acurrid@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=aniketa@nvidia.com \
    --cc=ankita@nvidia.com \
    --cc=anuaggarwal@nvidia.com \
    --cc=apopple@nvidia.com \
    --cc=cjia@nvidia.com \
    --cc=danw@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=targupta@nvidia.com \
    --cc=vsethi@nvidia.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.