public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jonathan Cameron <jonathan.cameron@huawei.com>
To: Alex Williamson <alex@shazbot.org>
Cc: <mhonap@nvidia.com>, <aniketa@nvidia.com>, <ankita@nvidia.com>,
	<vsethi@nvidia.com>, <jgg@nvidia.com>, <mochs@nvidia.com>,
	<skolothumtho@nvidia.com>, <alejandro.lucero-palau@amd.com>,
	<dave@stgolabs.net>, <dave.jiang@intel.com>,
	<alison.schofield@intel.com>, <vishal.l.verma@intel.com>,
	<ira.weiny@intel.com>, <dan.j.williams@intel.com>, <jgg@ziepe.ca>,
	<yishaih@nvidia.com>, <kevin.tian@intel.com>, <cjia@nvidia.com>,
	<targupta@nvidia.com>, <zhiw@nvidia.com>, <kjaju@nvidia.com>,
	<linux-kernel@vger.kernel.org>, <linux-cxl@vger.kernel.org>,
	<kvm@vger.kernel.org>
Subject: Re: [PATCH 18/20] docs: vfio-pci: Document CXL Type-2 device passthrough
Date: Thu, 19 Mar 2026 16:06:30 +0000	[thread overview]
Message-ID: <20260319160630.00002c6d@huawei.com> (raw)
In-Reply-To: <20260317152445.67a93881@shazbot.org>

On Tue, 17 Mar 2026 15:24:45 -0600
Alex Williamson <alex@shazbot.org> wrote:

> On Fri, 13 Mar 2026 12:13:41 +0000
> Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> 
> > On Thu, 12 Mar 2026 02:04:38 +0530
> > mhonap@nvidia.com wrote:
> >   
> > > From: Manish Honap <mhonap@nvidia.com>
> > > ---
> > >  Documentation/driver-api/index.rst        |   1 +
> > >  Documentation/driver-api/vfio-pci-cxl.rst | 216 ++++++++++++++++++++++
> > >  2 files changed, 217 insertions(+)
> > >  create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
> > > 
> > > diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
> > > index 1833e6a0687e..7ec661846f6b 100644
> > > --- a/Documentation/driver-api/index.rst
> > > +++ b/Documentation/driver-api/index.rst    
> >   
> > >  
> > >  Bus-level documentation
> > >  =======================
> > > diff --git a/Documentation/driver-api/vfio-pci-cxl.rst b/Documentation/driver-api/vfio-pci-cxl.rst
> > > new file mode 100644
> > > index 000000000000..f2cbe2fdb036
> > > --- /dev/null
> > > +++ b/Documentation/driver-api/vfio-pci-cxl.rst    
> >   
> > > +Device Detection
> > > +----------------
> > > +
> > > +CXL Type-2 detection happens automatically when ``vfio-pci`` registers a
> > > +device that has:
> > > +
> > > +1. A CXL Device DVSEC capability (PCIe DVSEC Vendor ID 0x1E98, ID 0x0000).
> > > +2. Bit 2 (Mem_Capable) set in the CXL Capability register within that DVSEC.    
> > 
> > FWIW to be type 2 as opposed to a type 3 non class code device (e.g. the
> > compressed memory devices Gregory Price and others are using) you need
> > Cache_capable as well.  Might be worth making this all about
> > CXL Type-2 and non class code Type-3.
> >   
> > > +3. A PCI class code that is **not** ``0x050210`` (CXL Type-3 memory device).
> > > +4. An HDM Decoder block discoverable via the Register Locator DVSEC.
> > > +5. A pre-committed HDM decoder (BIOS/firmware programmed) with non-zero size.    
> > 
> > This is the bit that we need to make more general. Otherwise you'll have
> > to have a bios upgrade for every type 2 device (and no native hotplug).
> > Note native hotplug is quite likely if anyone is switch based device
> > pooling.
> > 
> > I assume that you are doing this today to get something upstream
> > and presume it works for the type 2 device you have on the host you
> > care about.  I'm not sure there are 'general' solutions but maybe
> > there are some heuristics or sufficient conditions for establishing the
> > size.
> > 
> > Type 2 might have any of:
> > - Conveniently preprogrammed HDM decoders (the case you use)
> > - Maximum of 2 HDM decoders + the same number of Range registers.
> >   In general the problem with range registers is they are a legacy feature
> >   and there are only 2 of them whereas a real device may have many more
> >   DPA ranges. In this corner case though, is it enough to give us the
> >   necessary sizes?  I think it might be but would like others familiar
> >   with the spec to confirm. (If needed I'll take this to the consortium
> >   for an 'official' view).
> > - A DOE and table access protocol.  CDAT should give us enough info to
> >   be fairly sure what is needed.
> > - A CXL mailbox (maybe the version in the PCI spec now) and the spec defined
> >   commands to query what is there.  Reading the intro to 8.2.10.9 Memory
> >   Device Command Sets, it's a little unclear on whether these are valid on
> >   non class code devices but I believe having the appropriate Mailbox
> >   type identifier is enough to say we expect to get them.
> > 
> > None of this is required though and the mailboxes are non trivial.
> > So personally I think we should propose a new DVSEC that provides any
> > info we need for generic passthrough.  Starting with what we need
> > to get the regions right.  Until something like that is in place we
> > will have to store this info somewhere.
> > 
> > There is (maybe) an alternative of doing the region allocation on demand.
> > That is emulate the HDM decoders in QEMU (on top of the emulation
> > here) and when settings corresponding to a region setup occur,
> > go request one from the CXL core. The problem is we can't guarantee
> > it will be available at that time. So we can 'guess' what to provide
> > to the VM in terms of CXL fixed memory windows, but short of heuristics
> > (either whole of the host offer, or divide it up based on devices present
> >  vs what is in the VM) that is going to be prone to it not being available
> > later.
> > 
> > Where do people think this should be?  We are going to end up with
> > a device list somewhere. Could be in kernel, or in QEMU or make it an
> > orchestrator problem (applying the 'someone else's problem' solution).  
> 
> That's the typical approach.  That's what we did with resizable BARs.
> If we cannot guarantee allocation on demand, we need to push the policy
> to the device, via something that indicates the size to use, or to the
> orchestration, via something that allows the size to be committed
> out-of-band.  As with REBAR, we then need to be able to restrict the
> guest behavior to select only the configured option.
> 
> I imagine this means for the non-pre-allocated case, we need to develop
> some sysfs attributes that allows that out-of-band sizing, which would
> then appear as a fixed, pre-allocated configuration to the guest.
> Thanks,

I did some reading as only vaguely familiar with how the resizeable bar stuff
was done. That approach should be fairly straight forward to adapt here.
Stash some config in struct pci_dev before binding vfio-pci/cxl via a sysfs
interface.  Given that the association with the CXL infrastructure only
happens later (unlike bar config) it would then be the job of the
vfio-pci/cxl driver to see what was requested and attempt to set up the
CXL topology to deliver it at bind time. 

Manesh, would you mind hack at small PoC on top of your existing code to see
if this approach shows up any problems?  I don't have anything to test against
right now, though could probably hack some emulation together fairly fast.
I'm thinking you'll get there faster!  I'm mostly focused on this cycle
stuff at the moment, and I suspect we'll be discussing this for a while
yet + it has dependencies on other series that aren't in yet.

I'm not sure the PCI folk will like us stashing random stuff in their
structures just because we haven't bound anything yet though so have no
CXL structures to use.  We should probably think about how VF CXL.mem
region/sub-region assignment might work as well.

Sticking to PF (well actually just function 0) passthrough for now...
For the guest, we can constrain things so there is only one right option
though it will limit what topologies we can build.  Basically each device
passed through has it's own CXL fixed memory window, it's own host bridge,
it's own root port + no switches.  The sizing it sees for the CFMWS
matches what we configured in the host.  We could program that topology up
and lock it down but that means VM BIOS nastiness so I'd leave it to the
native linux code to bring it up.  If anyone wants to do P2P it'll get
harder to do within the spec as we will have to prevent topologies that
contain foot guns like the ability to configure interleave.

This constrained approach is what we plan for the CXL class code type 3
device emulation used for DCD so we've been exploring it already.
It's still possible to do annoying things like zero size decoders +
skip. For now we can fail HDM decoder commits if they are particularly
non sensical and we haven't handled them yet - ultimately we'll probably
want to minimize what we refuse to handle as I'm sure 'other OS' may not
do things the same as Linux.

P2P and the fun of single device on multiple PCI heirarchies as to be solved
later. As an FYI, for bandwidth, people will be building devices that
interleave memory addresses over multiple root ports. Dan reminded
me of that challenge last night.  See bundled ports in CXL 4.0, though
this particular part related to CXL.mem is actually possible prior to
that stuff for CXL.cache.  Oh and don't get me started on TSP / coco challenges.
I take the view they are Dan's problem for now ;)

Jonathan


> 
> Alex


  reply	other threads:[~2026-03-19 16:06 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
2026-03-11 20:34 ` [PATCH 01/20] cxl: Introduce cxl_get_hdm_reg_info() mhonap
2026-03-12 11:28   ` Jonathan Cameron
2026-03-12 16:33   ` Dave Jiang
2026-03-11 20:34 ` [PATCH 02/20] cxl: Expose cxl subsystem specific functions for vfio mhonap
2026-03-12 16:49   ` Dave Jiang
2026-03-13 10:05     ` Manish Honap
2026-03-11 20:34 ` [PATCH 03/20] cxl: Move CXL spec defines to public header mhonap
2026-03-13 12:18   ` Jonathan Cameron
2026-03-13 16:56     ` Dave Jiang
2026-03-18 14:56       ` Jonathan Cameron
2026-03-18 17:51         ` Manish Honap
2026-03-11 20:34 ` [PATCH 04/20] cxl: Media ready check refactoring mhonap
2026-03-12 20:29   ` Dave Jiang
2026-03-13 10:05     ` Manish Honap
2026-03-11 20:34 ` [PATCH 05/20] cxl: Expose BAR index and offset from register map mhonap
2026-03-12 20:58   ` Dave Jiang
2026-03-13 10:11     ` Manish Honap
2026-03-11 20:34 ` [PATCH 06/20] vfio/cxl: Add UAPI for CXL Type-2 device passthrough mhonap
2026-03-12 21:04   ` Dave Jiang
2026-03-11 20:34 ` [PATCH 07/20] vfio/pci: Add CXL state to vfio_pci_core_device mhonap
2026-03-11 20:34 ` [PATCH 08/20] vfio/pci: Add vfio-cxl Kconfig and build infrastructure mhonap
2026-03-13 12:27   ` Jonathan Cameron
2026-03-18 17:21     ` Manish Honap
2026-03-11 20:34 ` [PATCH 09/20] vfio/cxl: Implement CXL device detection and HDM register probing mhonap
2026-03-12 22:31   ` Dave Jiang
2026-03-13 12:43     ` Jonathan Cameron
2026-03-18 17:43       ` Manish Honap
2026-03-11 20:34 ` [PATCH 10/20] vfio/cxl: CXL region management mhonap
2026-03-12 22:55   ` Dave Jiang
2026-03-13 12:52     ` Jonathan Cameron
2026-03-18 17:48       ` Manish Honap
2026-03-11 20:34 ` [PATCH 11/20] vfio/cxl: Expose DPA memory region to userspace with fault+zap mmap mhonap
2026-03-13 17:07   ` Dave Jiang
2026-03-18 17:54     ` Manish Honap
2026-03-11 20:34 ` [PATCH 12/20] vfio/pci: Export config access helpers mhonap
2026-03-11 20:34 ` [PATCH 13/20] vfio/cxl: Introduce HDM decoder register emulation framework mhonap
2026-03-13 19:05   ` Dave Jiang
2026-03-18 17:58     ` Manish Honap
2026-03-11 20:34 ` [PATCH 14/20] vfio/cxl: Check media readiness and create CXL memdev mhonap
2026-03-11 20:34 ` [PATCH 15/20] vfio/cxl: Introduce CXL DVSEC configuration space emulation mhonap
2026-03-13 22:07   ` Dave Jiang
2026-03-18 18:41     ` Manish Honap
2026-03-11 20:34 ` [PATCH 16/20] vfio/pci: Expose CXL device and region info via VFIO ioctl mhonap
2026-03-11 20:34 ` [PATCH 17/20] vfio/cxl: Provide opt-out for CXL feature mhonap
2026-03-11 20:34 ` [PATCH 18/20] docs: vfio-pci: Document CXL Type-2 device passthrough mhonap
2026-03-13 12:13   ` Jonathan Cameron
2026-03-17 21:24     ` Alex Williamson
2026-03-19 16:06       ` Jonathan Cameron [this message]
2026-03-23 14:36         ` Manish Honap
2026-03-11 20:34 ` [PATCH 19/20] selftests/vfio: Add CXL Type-2 passthrough tests mhonap
2026-03-11 20:34 ` [PATCH 20/20] selftests/vfio: Fix VLA initialisation in vfio_pci_irq_set() mhonap
2026-03-13 22:23   ` Dave Jiang
2026-03-18 18:07     ` Manish Honap

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260319160630.00002c6d@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=alejandro.lucero-palau@amd.com \
    --cc=alex@shazbot.org \
    --cc=alison.schofield@intel.com \
    --cc=aniketa@nvidia.com \
    --cc=ankita@nvidia.com \
    --cc=cjia@nvidia.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ira.weiny@intel.com \
    --cc=jgg@nvidia.com \
    --cc=jgg@ziepe.ca \
    --cc=kevin.tian@intel.com \
    --cc=kjaju@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhonap@nvidia.com \
    --cc=mochs@nvidia.com \
    --cc=skolothumtho@nvidia.com \
    --cc=targupta@nvidia.com \
    --cc=vishal.l.verma@intel.com \
    --cc=vsethi@nvidia.com \
    --cc=yishaih@nvidia.com \
    --cc=zhiw@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox