public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: ankita@nvidia.com, yishaih@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com,
	aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com,
	targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com,
	apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com,
	anuaggarwal@nvidia.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v10 1/1] vfio/nvgpu: Add vfio pci variant module for grace hopper
Date: Mon, 18 Sep 2023 14:47:16 -0300	[thread overview]
Message-ID: <20230918174716.GL13733@nvidia.com> (raw)
In-Reply-To: <20230918111949.1d6c8482.alex.williamson@redhat.com>

On Mon, Sep 18, 2023 at 11:19:49AM -0600, Alex Williamson wrote:

> > > And the VMM usage is self inflicted because we insist on
> > > masquerading the coherent memory as a nondescript PCI BAR rather
> > > than providing a device specific region to enlighten the VMM to this
> > > unique feature.  
> > 
> > I see it as two completely seperate things.
> > 
> > 1) VFIO and qemu creating a vPCI device. Here we don't need this
> >    information.
> > 
> > 2) This ACPI pxm stuff to emulate the bare metal FW.
> >    Including a proposal for auto-detection what kind of bare metal FW
> >    is being used.
> > 
> > This being a poor idea for #2 doesn't jump to problems with #1, it
> > just says more work is needed on the ACPI PXM stuff.
> 
> But I don't think we've justified why it's a good idea for #1.  Does
> the composed vPCI device with coherent memory masqueraded as BAR2 have
> a stand alone use case without #2?

Today there is no SW that can operate that configuration. But that is
a purely SW in the VM problem.

Jonathan got it right here:

https://lore.kernel.org/all/20230915153740.00006185@Huawei.com/

If Linux in the VM wants to use certain Linux kernel APIs then the FW
must provision these empty nodes. Universally. It is a CXL problem as
well.

For instance I could hack up Linux and force it to create extra nodes
regardless of ACPI and then everything would be fine with #1 alone.

When/if Linux learns to dynmically create these things without relying
on FW then we don't need #2.

It is ugly, it is hack, but it is copying what real FW decided to do.

> My understanding based on these series is that the guest driver somehow
> carves up the coherent memory among a set of memory-less NUMA nodes
> (how to know how many?) created by the VMM and reported via the _DSD for
> the device.  If this sort of configuration is a requirement for making
> use of the coherent memory, then what exactly becomes easier by the fact
> that it's exposed as a PCI BAR?

It is keeping two concerns seperate. The vPCI layer doesn't care about
any of this because it is a Linux problem. A coherent BAR is fine and
results in the least amount of special code everywhere.

The ACPI layer has to learn how to make this hack to support Linux.

I don't think we should dramatically warp the modeling of the VFIO
regions just to support auto detecting an ACPI hack.

> In fact, if it weren't a BAR I'd probably suggest that the whole
> configuration of this device should be centered around a new
> nvidia-gpu-mem object.  That object could reference the ID of a
> vfio-pci device providing the coherent memory via a device specific
> region and be provided with a range of memory-less nodes created for
> its use.  The object would insert the coherent memory range into the VM
> address space and provide the device properties to make use of it in
> the same way as done on bare metal.

How does that give auto configuration? The other thread mentions that
many other things need this too, like CXL and imagined coherent
virtio stuff?

Can we do the API you imagine more generically with any VFIO region
(even a normal BAR) providing the memory object?

> It seems to me that the PCI BAR representation of coherent memory is
> largely just a shortcut to getting it into the VM address space, but
> it's also leading us down these paths where the "pxm stuff" is invoked
> based on the device attached to the VM, which is getting a lot of
> resistance.  

I don't like the idea of a dedicated memory region type, I think we
will have more of these than just one.

Some kind of flag on the vfio device indicating that PXM nodes (and
how many) should be auto created would be fine.

But if there is resistance to auto configuration I don't see how that
goes away just because we shift around the indicator to trigger
auto configuration??

Jason

  reply	other threads:[~2023-09-18 17:47 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-15  2:54 [PATCH v10 1/1] vfio/nvgpu: Add vfio pci variant module for grace hopper ankita
2023-09-15 14:24 ` Alex Williamson
2023-09-18 13:02   ` Jason Gunthorpe
2023-09-18 14:27     ` Alex Williamson
2023-09-18 14:49       ` Jason Gunthorpe
2023-09-18 17:19         ` Alex Williamson
2023-09-18 17:47           ` Jason Gunthorpe [this message]
2023-09-28  6:33 ` Tian, Kevin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230918174716.GL13733@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=acurrid@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=aniketa@nvidia.com \
    --cc=ankita@nvidia.com \
    --cc=anuaggarwal@nvidia.com \
    --cc=apopple@nvidia.com \
    --cc=cjia@nvidia.com \
    --cc=danw@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=targupta@nvidia.com \
    --cc=vsethi@nvidia.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox