* [PATCH 01/20] cxl: Introduce cxl_get_hdm_reg_info()
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
@ 2026-03-11 20:34 ` mhonap
2026-03-12 11:28 ` Jonathan Cameron
2026-03-12 16:33 ` Dave Jiang
2026-03-11 20:34 ` [PATCH 02/20] cxl: Expose cxl subsystem specific functions for vfio mhonap
` (18 subsequent siblings)
19 siblings, 2 replies; 54+ messages in thread
From: mhonap @ 2026-03-11 20:34 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
CXL core has the information of what CXL register groups a device has.
When initializing the device, the CXL core probes the register groups
and saves the information. The probing sequence is quite complicated.
vfio-cxl requires the HDM register information to emulate the HDM decoder
registers.
Introduce cxl_get_hdm_reg_info() for vfio-cxl to leverage the HDM
register information in the CXL core. Thus, it doesn't need to implement
its own probing sequence.
Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/cxl/core/pci.c | 45 ++++++++++++++++++++++++++++++++++++++++++
include/cxl/cxl.h | 3 +++
2 files changed, 48 insertions(+)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index ba2d393c540a..52ed0b4f5e78 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -449,6 +449,51 @@ int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
}
EXPORT_SYMBOL_NS_GPL(cxl_hdm_decode_init, "CXL");
+/**
+ * cxl_get_hdm_reg_info - Get HDM decoder register block location and count
+ * @cxlds: CXL device state (must have component regs enumerated)
+ * @count: Output: number of HDM decoders (from DVSEC cap). Only set when
+ * the device has a valid HDM decoder capability.
+ * @offset: Output: byte offset of the HDM decoder register block within the
+ * component register BAR. Only set when valid.
+ * @size: Output: size in bytes of the HDM decoder register block. Only set
+ * when valid.
+ *
+ * Reads the CXL component register map and DVSEC capability to return the
+ * Host Managed Device Memory (HDM) decoder register block offset and size,
+ * and the number of HDM decoders. This function requires cxlds->cxl_dvsec
+ * to be non-zero.
+ *
+ * Return: 0 on success. A negative errno is returned when config read
+ * failure or when the decoder registers are not valid.
+ */
+int cxl_get_hdm_reg_info(struct cxl_dev_state *cxlds, u32 *count,
+ resource_size_t *offset, resource_size_t *size)
+{
+ struct pci_dev *pdev = to_pci_dev(cxlds->dev);
+ struct cxl_component_reg_map *map =
+ &cxlds->reg_map.component_map;
+ int d = cxlds->cxl_dvsec;
+ u16 cap;
+ int rc;
+
+ /* HDM decoder registers not implemented */
+ if (!map->hdm_decoder.valid || !d)
+ return -ENODEV;
+
+ rc = pci_read_config_word(pdev,
+ d + PCI_DVSEC_CXL_CAP, &cap);
+ if (rc)
+ return rc;
+
+ *count = FIELD_GET(PCI_DVSEC_CXL_HDM_COUNT, cap);
+ *offset = map->hdm_decoder.offset;
+ *size = map->hdm_decoder.size;
+
+ return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_get_hdm_reg_info, "CXL");
+
#define CXL_DOE_TABLE_ACCESS_REQ_CODE 0x000000ff
#define CXL_DOE_TABLE_ACCESS_REQ_CODE_READ 0
#define CXL_DOE_TABLE_ACCESS_TABLE_TYPE 0x0000ff00
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 50acbd13bcf8..8456177b523e 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -284,4 +284,7 @@ int cxl_dpa_free(struct cxl_endpoint_decoder *cxled);
struct cxl_region *cxl_create_region(struct cxl_root_decoder *cxlrd,
struct cxl_endpoint_decoder **cxled,
int ways);
+
+int cxl_get_hdm_reg_info(struct cxl_dev_state *cxlds, u32 *count,
+ resource_size_t *offset, resource_size_t *size);
#endif /* __CXL_CXL_H__ */
--
2.25.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [PATCH 01/20] cxl: Introduce cxl_get_hdm_reg_info()
2026-03-11 20:34 ` [PATCH 01/20] cxl: Introduce cxl_get_hdm_reg_info() mhonap
@ 2026-03-12 11:28 ` Jonathan Cameron
2026-03-12 16:33 ` Dave Jiang
1 sibling, 0 replies; 54+ messages in thread
From: Jonathan Cameron @ 2026-03-12 11:28 UTC (permalink / raw)
To: mhonap
Cc: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, jgg, yishaih,
kevin.tian, cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm
On Thu, 12 Mar 2026 02:04:21 +0530
mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
>
> CXL core has the information of what CXL register groups a device has.
> When initializing the device, the CXL core probes the register groups
> and saves the information. The probing sequence is quite complicated.
>
> vfio-cxl requires the HDM register information to emulate the HDM decoder
> registers.
>
> Introduce cxl_get_hdm_reg_info() for vfio-cxl to leverage the HDM
> register information in the CXL core. Thus, it doesn't need to implement
> its own probing sequence.
>
> Co-developed-by: Zhi Wang <zhiw@nvidia.com>
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
Hi Manish, Zhi,
Taking a first pass look at this so likely comments will be trivial
whilst I get my head around it.
Jonathan
> ---
> drivers/cxl/core/pci.c | 45 ++++++++++++++++++++++++++++++++++++++++++
> include/cxl/cxl.h | 3 +++
> 2 files changed, 48 insertions(+)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index ba2d393c540a..52ed0b4f5e78 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -449,6 +449,51 @@ int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
> }
> EXPORT_SYMBOL_NS_GPL(cxl_hdm_decode_init, "CXL");
>
> +/**
> + * cxl_get_hdm_reg_info - Get HDM decoder register block location and count
> + * @cxlds: CXL device state (must have component regs enumerated)
> + * @count: Output: number of HDM decoders (from DVSEC cap). Only set when
> + * the device has a valid HDM decoder capability.
> + * @offset: Output: byte offset of the HDM decoder register block within the
> + * component register BAR. Only set when valid.
> + * @size: Output: size in bytes of the HDM decoder register block. Only set
> + * when valid.
> + *
> + * Reads the CXL component register map and DVSEC capability to return the
> + * Host Managed Device Memory (HDM) decoder register block offset and size,
> + * and the number of HDM decoders.
No it doesn't, see below.
> This function requires cxlds->cxl_dvsec
> + * to be non-zero.
> + *
> + * Return: 0 on success. A negative errno is returned when config read
> + * failure or when the decoder registers are not valid.
> + */
> +int cxl_get_hdm_reg_info(struct cxl_dev_state *cxlds, u32 *count,
If we are going to use a fixed size for count (which is fine) maybe pick a
smaller one. It's never bigger than a u8. Actually never bigger than 2...
See below.
> + resource_size_t *offset, resource_size_t *size)
> +{
> + struct pci_dev *pdev = to_pci_dev(cxlds->dev);
> + struct cxl_component_reg_map *map =
> + &cxlds->reg_map.component_map;
Trivial: Fits on one < 80 char line.
> + int d = cxlds->cxl_dvsec;
> + u16 cap;
> + int rc;
> +
> + /* HDM decoder registers not implemented */
> + if (!map->hdm_decoder.valid || !d)
> + return -ENODEV;
> +
> + rc = pci_read_config_word(pdev,
> + d + PCI_DVSEC_CXL_CAP, &cap);
Single line.
Why do you want this? You are using HDM decoders, but checking the
number of ranges. Historical artifact before the HDM Decoder stuff
was added to the spec. That could really do with a rename!
In theory the spec recommends keeping these HDM ranges aligned
with the first 2 HDM decoders, but it doesn't require it and
functionally that doesn't matter so I'd not bother.
> + if (rc)
> + return rc;
> +
> + *count = FIELD_GET(PCI_DVSEC_CXL_HDM_COUNT, cap);
> + *offset = map->hdm_decoder.offset;
> + *size = map->hdm_decoder.size;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_get_hdm_reg_info, "CXL");
> +
> #define CXL_DOE_TABLE_ACCESS_REQ_CODE 0x000000ff
> #define CXL_DOE_TABLE_ACCESS_REQ_CODE_READ 0
> #define CXL_DOE_TABLE_ACCESS_TABLE_TYPE 0x0000ff00
> diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
> index 50acbd13bcf8..8456177b523e 100644
> --- a/include/cxl/cxl.h
> +++ b/include/cxl/cxl.h
> @@ -284,4 +284,7 @@ int cxl_dpa_free(struct cxl_endpoint_decoder *cxled);
> struct cxl_region *cxl_create_region(struct cxl_root_decoder *cxlrd,
> struct cxl_endpoint_decoder **cxled,
> int ways);
> +
> +int cxl_get_hdm_reg_info(struct cxl_dev_state *cxlds, u32 *count,
> + resource_size_t *offset, resource_size_t *size);
> #endif /* __CXL_CXL_H__ */
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [PATCH 01/20] cxl: Introduce cxl_get_hdm_reg_info()
2026-03-11 20:34 ` [PATCH 01/20] cxl: Introduce cxl_get_hdm_reg_info() mhonap
2026-03-12 11:28 ` Jonathan Cameron
@ 2026-03-12 16:33 ` Dave Jiang
1 sibling, 0 replies; 54+ messages in thread
From: Dave Jiang @ 2026-03-12 16:33 UTC (permalink / raw)
To: mhonap, aniketa, ankita, alwilliamson, vsethi, jgg, mochs,
skolothumtho, alejandro.lucero-palau, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm
On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
>
> CXL core has the information of what CXL register groups a device has.
> When initializing the device, the CXL core probes the register groups
> and saves the information. The probing sequence is quite complicated.
>
> vfio-cxl requires the HDM register information to emulate the HDM decoder
/register/register block/
Otherwise it implies a single register.
> registers.
>
> Introduce cxl_get_hdm_reg_info() for vfio-cxl to leverage the HDM
> register information in the CXL core. Thus, it doesn't need to implement
same here. "register block"
> its own probing sequence.
>
> Co-developed-by: Zhi Wang <zhiw@nvidia.com>
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
> ---
> drivers/cxl/core/pci.c | 45 ++++++++++++++++++++++++++++++++++++++++++
> include/cxl/cxl.h | 3 +++
> 2 files changed, 48 insertions(+)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index ba2d393c540a..52ed0b4f5e78 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -449,6 +449,51 @@ int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
> }
> EXPORT_SYMBOL_NS_GPL(cxl_hdm_decode_init, "CXL");
>
> +/**
> + * cxl_get_hdm_reg_info - Get HDM decoder register block location and count
cxl_get_hdm_info() to be consistent with CXL core naming convention.
> + * @cxlds: CXL device state (must have component regs enumerated)
> + * @count: Output: number of HDM decoders (from DVSEC cap). Only set when
> + * the device has a valid HDM decoder capability.
> + * @offset: Output: byte offset of the HDM decoder register block within the
> + * component register BAR. Only set when valid.
> + * @size: Output: size in bytes of the HDM decoder register block. Only set
> + * when valid.
> + *
> + * Reads the CXL component register map and DVSEC capability to return the
> + * Host Managed Device Memory (HDM) decoder register block offset and size,
> + * and the number of HDM decoders. This function requires cxlds->cxl_dvsec
> + * to be non-zero.
> + *
> + * Return: 0 on success. A negative errno is returned when config read
> + * failure or when the decoder registers are not valid.
> + */
> +int cxl_get_hdm_reg_info(struct cxl_dev_state *cxlds, u32 *count,
u8 for count probably sufficient as Jonathan mentioned. Decoder cap is only 4 bits for count.
> + resource_size_t *offset, resource_size_t *size)
> +{
> + struct pci_dev *pdev = to_pci_dev(cxlds->dev);
> + struct cxl_component_reg_map *map =
> + &cxlds->reg_map.component_map;
> + int d = cxlds->cxl_dvsec;
Name it dvsec for easier correlation when reading the code.
> + u16 cap;
> + int rc;
> +
> + /* HDM decoder registers not implemented */
> + if (!map->hdm_decoder.valid || !d)
> + return -ENODEV;
> +
> + rc = pci_read_config_word(pdev,
> + d + PCI_DVSEC_CXL_CAP, &cap);
> + if (rc)
> + return rc;
> +
> + *count = FIELD_GET(PCI_DVSEC_CXL_HDM_COUNT, cap);
This is not the correct place to find number of decoders. You should be looking at CXL HDM Decoder Capability Register (CXL r4.0 8.2.4.20.1) for the 'Decoder Count' field.
DJ
> + *offset = map->hdm_decoder.offset;
> + *size = map->hdm_decoder.size;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_get_hdm_reg_info, "CXL");
> +
> #define CXL_DOE_TABLE_ACCESS_REQ_CODE 0x000000ff
> #define CXL_DOE_TABLE_ACCESS_REQ_CODE_READ 0
> #define CXL_DOE_TABLE_ACCESS_TABLE_TYPE 0x0000ff00
> diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
> index 50acbd13bcf8..8456177b523e 100644
> --- a/include/cxl/cxl.h
> +++ b/include/cxl/cxl.h
> @@ -284,4 +284,7 @@ int cxl_dpa_free(struct cxl_endpoint_decoder *cxled);
> struct cxl_region *cxl_create_region(struct cxl_root_decoder *cxlrd,
> struct cxl_endpoint_decoder **cxled,
> int ways);
> +
> +int cxl_get_hdm_reg_info(struct cxl_dev_state *cxlds, u32 *count,
> + resource_size_t *offset, resource_size_t *size);
> #endif /* __CXL_CXL_H__ */
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH 02/20] cxl: Expose cxl subsystem specific functions for vfio
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
2026-03-11 20:34 ` [PATCH 01/20] cxl: Introduce cxl_get_hdm_reg_info() mhonap
@ 2026-03-11 20:34 ` mhonap
2026-03-12 16:49 ` Dave Jiang
2026-03-11 20:34 ` [PATCH 03/20] cxl: Move CXL spec defines to public header mhonap
` (17 subsequent siblings)
19 siblings, 1 reply; 54+ messages in thread
From: mhonap @ 2026-03-11 20:34 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
Below functions from CXL subsystem will be required in vfio-cxl
for supporting the type-2 device passthrough:
cxl_find_regblock - To find component registers
cxl_probe_component_regs - Probe HDM/RAS capabilities
Make these functions available via declaring them from include header
instead of subsystem-specific header.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/cxl/cxl.h | 4 ----
include/cxl/cxl.h | 7 +++++++
2 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 2b1f7d687a0e..10ddab3949ee 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -198,8 +198,6 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
#define CXLDEV_MBOX_BG_CMD_COMMAND_VENDOR_MASK GENMASK_ULL(63, 48)
#define CXLDEV_MBOX_PAYLOAD_OFFSET 0x20
-void cxl_probe_component_regs(struct device *dev, void __iomem *base,
- struct cxl_component_reg_map *map);
void cxl_probe_device_regs(struct device *dev, void __iomem *base,
struct cxl_device_reg_map *map);
int cxl_map_device_regs(const struct cxl_register_map *map,
@@ -211,8 +209,6 @@ enum cxl_regloc_type;
int cxl_count_regblock(struct pci_dev *pdev, enum cxl_regloc_type type);
int cxl_find_regblock_instance(struct pci_dev *pdev, enum cxl_regloc_type type,
struct cxl_register_map *map, unsigned int index);
-int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type,
- struct cxl_register_map *map);
int cxl_setup_regs(struct cxl_register_map *map);
struct cxl_dport;
int cxl_dport_map_rcd_linkcap(struct pci_dev *pdev, struct cxl_dport *dport);
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 8456177b523e..610711e861d4 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -287,4 +287,11 @@ struct cxl_region *cxl_create_region(struct cxl_root_decoder *cxlrd,
int cxl_get_hdm_reg_info(struct cxl_dev_state *cxlds, u32 *count,
resource_size_t *offset, resource_size_t *size);
+struct pci_dev;
+enum cxl_regloc_type;
+int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type,
+ struct cxl_register_map *map);
+void cxl_probe_component_regs(struct device *dev, void __iomem *base,
+ struct cxl_component_reg_map *map);
+
#endif /* __CXL_CXL_H__ */
--
2.25.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [PATCH 02/20] cxl: Expose cxl subsystem specific functions for vfio
2026-03-11 20:34 ` [PATCH 02/20] cxl: Expose cxl subsystem specific functions for vfio mhonap
@ 2026-03-12 16:49 ` Dave Jiang
2026-03-13 10:05 ` Manish Honap
0 siblings, 1 reply; 54+ messages in thread
From: Dave Jiang @ 2026-03-12 16:49 UTC (permalink / raw)
To: mhonap, aniketa, ankita, alwilliamson, vsethi, jgg, mochs,
skolothumtho, alejandro.lucero-palau, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm
On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
>
> Below functions from CXL subsystem will be required in vfio-cxl
> for supporting the type-2 device passthrough:
> cxl_find_regblock - To find component registers
> cxl_probe_component_regs - Probe HDM/RAS capabilities
>
> Make these functions available via declaring them from include header
> instead of subsystem-specific header.
>
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
> ---
> drivers/cxl/cxl.h | 4 ----
> include/cxl/cxl.h | 7 +++++++
> 2 files changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 2b1f7d687a0e..10ddab3949ee 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -198,8 +198,6 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
> #define CXLDEV_MBOX_BG_CMD_COMMAND_VENDOR_MASK GENMASK_ULL(63, 48)
> #define CXLDEV_MBOX_PAYLOAD_OFFSET 0x20
>
> -void cxl_probe_component_regs(struct device *dev, void __iomem *base,
> - struct cxl_component_reg_map *map);
> void cxl_probe_device_regs(struct device *dev, void __iomem *base,
> struct cxl_device_reg_map *map);
> int cxl_map_device_regs(const struct cxl_register_map *map,
> @@ -211,8 +209,6 @@ enum cxl_regloc_type;
> int cxl_count_regblock(struct pci_dev *pdev, enum cxl_regloc_type type);
> int cxl_find_regblock_instance(struct pci_dev *pdev, enum cxl_regloc_type type,
> struct cxl_register_map *map, unsigned int index);
> -int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type,
> - struct cxl_register_map *map);
> int cxl_setup_regs(struct cxl_register_map *map);
> struct cxl_dport;
> int cxl_dport_map_rcd_linkcap(struct pci_dev *pdev, struct cxl_dport *dport);
> diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
> index 8456177b523e..610711e861d4 100644
> --- a/include/cxl/cxl.h
> +++ b/include/cxl/cxl.h
> @@ -287,4 +287,11 @@ struct cxl_region *cxl_create_region(struct cxl_root_decoder *cxlrd,
>
> int cxl_get_hdm_reg_info(struct cxl_dev_state *cxlds, u32 *count,
> resource_size_t *offset, resource_size_t *size);
> +struct pci_dev;
> +enum cxl_regloc_type;
> +int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type,
> + struct cxl_register_map *map);
> +void cxl_probe_component_regs(struct device *dev, void __iomem *base,
> + struct cxl_component_reg_map *map);
I do wonder if this needs an ifdef of CONFIG_CXL_BUS given cxl_core can be made as a module or disabled.
> +
> #endif /* __CXL_CXL_H__ */
^ permalink raw reply [flat|nested] 54+ messages in thread
* RE: [PATCH 02/20] cxl: Expose cxl subsystem specific functions for vfio
2026-03-12 16:49 ` Dave Jiang
@ 2026-03-13 10:05 ` Manish Honap
0 siblings, 0 replies; 54+ messages in thread
From: Manish Honap @ 2026-03-13 10:05 UTC (permalink / raw)
To: Dave Jiang, Aniket Agashe, Ankit Agrawal, Alex Williamson,
Vikram Sethi, Jason Gunthorpe, Matt Ochs, Shameer Kolothum Thodi,
alejandro.lucero-palau@amd.com, dave@stgolabs.net,
jonathan.cameron@huawei.com, alison.schofield@intel.com,
vishal.l.verma@intel.com, ira.weiny@intel.com,
dan.j.williams@intel.com, jgg@ziepe.ca, Yishai Hadas,
kevin.tian@intel.com
Cc: Neo Jia, Tarun Gupta (SW-GPU), Zhi Wang, Krishnakant Jaju,
linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org,
kvm@vger.kernel.org, Manish Honap
> -----Original Message-----
> From: Dave Jiang <dave.jiang@intel.com>
> Sent: 12 March 2026 22:19
> To: Manish Honap <mhonap@nvidia.com>; Aniket Agashe <aniketa@nvidia.com>;
> Ankit Agrawal <ankita@nvidia.com>; Alex Williamson
> <alwilliamson@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Jason
> Gunthorpe <jgg@nvidia.com>; Matt Ochs <mochs@nvidia.com>; Shameer Kolothum
> Thodi <skolothumtho@nvidia.com>; alejandro.lucero-palau@amd.com;
> dave@stgolabs.net; jonathan.cameron@huawei.com;
> alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
> dan.j.williams@intel.com; jgg@ziepe.ca; Yishai Hadas <yishaih@nvidia.com>;
> kevin.tian@intel.com
> Cc: Neo Jia <cjia@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> Zhi Wang <zhiw@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>; linux-
> kernel@vger.kernel.org; linux-cxl@vger.kernel.org; kvm@vger.kernel.org
> Subject: Re: [PATCH 02/20] cxl: Expose cxl subsystem specific functions
> for vfio
>
> External email: Use caution opening links or attachments
>
>
> On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> > From: Manish Honap <mhonap@nvidia.com>
> >
> > Below functions from CXL subsystem will be required in vfio-cxl for
> > supporting the type-2 device passthrough:
> > cxl_find_regblock - To find component registers
> > cxl_probe_component_regs - Probe HDM/RAS capabilities
> >
> > Make these functions available via declaring them from include header
> > instead of subsystem-specific header.
> >
> > Signed-off-by: Manish Honap <mhonap@nvidia.com>
> > ---
> > drivers/cxl/cxl.h | 4 ----
> > include/cxl/cxl.h | 7 +++++++
> > 2 files changed, 7 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index
> > 2b1f7d687a0e..10ddab3949ee 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -198,8 +198,6 @@ static inline int ways_to_eiw(unsigned int ways, u8
> *eiw)
> > #define CXLDEV_MBOX_BG_CMD_COMMAND_VENDOR_MASK GENMASK_ULL(63, 48)
> > #define CXLDEV_MBOX_PAYLOAD_OFFSET 0x20
> >
> > -void cxl_probe_component_regs(struct device *dev, void __iomem *base,
> > - struct cxl_component_reg_map *map);
> > void cxl_probe_device_regs(struct device *dev, void __iomem *base,
> > struct cxl_device_reg_map *map); int
> > cxl_map_device_regs(const struct cxl_register_map *map, @@ -211,8
> > +209,6 @@ enum cxl_regloc_type; int cxl_count_regblock(struct pci_dev
> > *pdev, enum cxl_regloc_type type); int
> > cxl_find_regblock_instance(struct pci_dev *pdev, enum cxl_regloc_type
> type,
> > struct cxl_register_map *map, unsigned
> > int index); -int cxl_find_regblock(struct pci_dev *pdev, enum
> cxl_regloc_type type,
> > - struct cxl_register_map *map);
> > int cxl_setup_regs(struct cxl_register_map *map); struct cxl_dport;
> > int cxl_dport_map_rcd_linkcap(struct pci_dev *pdev, struct cxl_dport
> > *dport); diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h index
> > 8456177b523e..610711e861d4 100644
> > --- a/include/cxl/cxl.h
> > +++ b/include/cxl/cxl.h
> > @@ -287,4 +287,11 @@ struct cxl_region *cxl_create_region(struct
> > cxl_root_decoder *cxlrd,
> >
> > int cxl_get_hdm_reg_info(struct cxl_dev_state *cxlds, u32 *count,
> > resource_size_t *offset, resource_size_t
> > *size);
> > +struct pci_dev;
> > +enum cxl_regloc_type;
> > +int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type,
> > + struct cxl_register_map *map); void
> > +cxl_probe_component_regs(struct device *dev, void __iomem *base,
> > + struct cxl_component_reg_map *map);
>
> I do wonder if this needs an ifdef of CONFIG_CXL_BUS given cxl_core can be
> made as a module or disabled.
>
Okay, I will update this section for #ifdef protection.
>
> > +
> > #endif /* __CXL_CXL_H__ */
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH 03/20] cxl: Move CXL spec defines to public header
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
2026-03-11 20:34 ` [PATCH 01/20] cxl: Introduce cxl_get_hdm_reg_info() mhonap
2026-03-11 20:34 ` [PATCH 02/20] cxl: Expose cxl subsystem specific functions for vfio mhonap
@ 2026-03-11 20:34 ` mhonap
2026-03-13 12:18 ` Jonathan Cameron
2026-03-11 20:34 ` [PATCH 04/20] cxl: Media ready check refactoring mhonap
` (16 subsequent siblings)
19 siblings, 1 reply; 54+ messages in thread
From: mhonap @ 2026-03-11 20:34 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
HDM decoder capability structure and component reg block size
needs to be used by VFIO subsystem. Move the macros from private
CXL header to public one.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/cxl/cxl.h | 30 ------------------------------
include/cxl/cxl.h | 30 ++++++++++++++++++++++++++++++
2 files changed, 30 insertions(+), 30 deletions(-)
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 10ddab3949ee..7146059e0dae 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -24,9 +24,6 @@ extern const struct nvdimm_security_ops *cxl_security_ops;
* (port-driver, region-driver, nvdimm object-drivers... etc).
*/
-/* CXL 2.0 8.2.4 CXL Component Register Layout and Definition */
-#define CXL_COMPONENT_REG_BLOCK_SIZE SZ_64K
-
/* CXL 2.0 8.2.5 CXL.cache and CXL.mem Registers*/
#define CXL_CM_OFFSET 0x1000
#define CXL_CM_CAP_HDR_OFFSET 0x0
@@ -39,33 +36,6 @@ extern const struct nvdimm_security_ops *cxl_security_ops;
#define CXL_CM_CAP_HDR_ARRAY_SIZE_MASK GENMASK(31, 24)
#define CXL_CM_CAP_PTR_MASK GENMASK(31, 20)
-/* HDM decoders CXL 2.0 8.2.5.12 CXL HDM Decoder Capability Structure */
-#define CXL_HDM_DECODER_CAP_OFFSET 0x0
-#define CXL_HDM_DECODER_COUNT_MASK GENMASK(3, 0)
-#define CXL_HDM_DECODER_TARGET_COUNT_MASK GENMASK(7, 4)
-#define CXL_HDM_DECODER_INTERLEAVE_11_8 BIT(8)
-#define CXL_HDM_DECODER_INTERLEAVE_14_12 BIT(9)
-#define CXL_HDM_DECODER_INTERLEAVE_3_6_12_WAY BIT(11)
-#define CXL_HDM_DECODER_INTERLEAVE_16_WAY BIT(12)
-#define CXL_HDM_DECODER_CTRL_OFFSET 0x4
-#define CXL_HDM_DECODER_ENABLE BIT(1)
-#define CXL_HDM_DECODER0_BASE_LOW_OFFSET(i) (0x20 * (i) + 0x10)
-#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET(i) (0x20 * (i) + 0x14)
-#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET(i) (0x20 * (i) + 0x18)
-#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(i) (0x20 * (i) + 0x1c)
-#define CXL_HDM_DECODER0_CTRL_OFFSET(i) (0x20 * (i) + 0x20)
-#define CXL_HDM_DECODER0_CTRL_IG_MASK GENMASK(3, 0)
-#define CXL_HDM_DECODER0_CTRL_IW_MASK GENMASK(7, 4)
-#define CXL_HDM_DECODER0_CTRL_LOCK BIT(8)
-#define CXL_HDM_DECODER0_CTRL_COMMIT BIT(9)
-#define CXL_HDM_DECODER0_CTRL_COMMITTED BIT(10)
-#define CXL_HDM_DECODER0_CTRL_COMMIT_ERROR BIT(11)
-#define CXL_HDM_DECODER0_CTRL_HOSTONLY BIT(12)
-#define CXL_HDM_DECODER0_TL_LOW(i) (0x20 * (i) + 0x24)
-#define CXL_HDM_DECODER0_TL_HIGH(i) (0x20 * (i) + 0x28)
-#define CXL_HDM_DECODER0_SKIP_LOW(i) CXL_HDM_DECODER0_TL_LOW(i)
-#define CXL_HDM_DECODER0_SKIP_HIGH(i) CXL_HDM_DECODER0_TL_HIGH(i)
-
/* HDM decoder control register constants CXL 3.0 8.2.5.19.7 */
#define CXL_DECODER_MIN_GRANULARITY 256
#define CXL_DECODER_MAX_ENCODED_IG 6
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 610711e861d4..27c006fa53c3 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -75,6 +75,36 @@ struct cxl_regs {
#define CXL_CM_CAP_CAP_ID_HDM 0x5
#define CXL_CM_CAP_CAP_HDM_VERSION 1
+/* CXL 2.0 8.2.4 CXL Component Register Layout and Definition */
+#define CXL_COMPONENT_REG_BLOCK_SIZE SZ_64K
+
+/* HDM decoders CXL 2.0 8.2.5.12 CXL HDM Decoder Capability Structure */
+#define CXL_HDM_DECODER_CAP_OFFSET 0x0
+#define CXL_HDM_DECODER_COUNT_MASK GENMASK(3, 0)
+#define CXL_HDM_DECODER_TARGET_COUNT_MASK GENMASK(7, 4)
+#define CXL_HDM_DECODER_INTERLEAVE_11_8 BIT(8)
+#define CXL_HDM_DECODER_INTERLEAVE_14_12 BIT(9)
+#define CXL_HDM_DECODER_INTERLEAVE_3_6_12_WAY BIT(11)
+#define CXL_HDM_DECODER_INTERLEAVE_16_WAY BIT(12)
+#define CXL_HDM_DECODER_CTRL_OFFSET 0x4
+#define CXL_HDM_DECODER_ENABLE BIT(1)
+#define CXL_HDM_DECODER0_BASE_LOW_OFFSET(i) (0x20 * (i) + 0x10)
+#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET(i) (0x20 * (i) + 0x14)
+#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET(i) (0x20 * (i) + 0x18)
+#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(i) (0x20 * (i) + 0x1c)
+#define CXL_HDM_DECODER0_CTRL_OFFSET(i) (0x20 * (i) + 0x20)
+#define CXL_HDM_DECODER0_CTRL_IG_MASK GENMASK(3, 0)
+#define CXL_HDM_DECODER0_CTRL_IW_MASK GENMASK(7, 4)
+#define CXL_HDM_DECODER0_CTRL_LOCK BIT(8)
+#define CXL_HDM_DECODER0_CTRL_COMMIT BIT(9)
+#define CXL_HDM_DECODER0_CTRL_COMMITTED BIT(10)
+#define CXL_HDM_DECODER0_CTRL_COMMIT_ERROR BIT(11)
+#define CXL_HDM_DECODER0_CTRL_HOSTONLY BIT(12)
+#define CXL_HDM_DECODER0_TL_LOW(i) (0x20 * (i) + 0x24)
+#define CXL_HDM_DECODER0_TL_HIGH(i) (0x20 * (i) + 0x28)
+#define CXL_HDM_DECODER0_SKIP_LOW(i) CXL_HDM_DECODER0_TL_LOW(i)
+#define CXL_HDM_DECODER0_SKIP_HIGH(i) CXL_HDM_DECODER0_TL_HIGH(i)
+
struct cxl_reg_map {
bool valid;
int id;
--
2.25.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [PATCH 03/20] cxl: Move CXL spec defines to public header
2026-03-11 20:34 ` [PATCH 03/20] cxl: Move CXL spec defines to public header mhonap
@ 2026-03-13 12:18 ` Jonathan Cameron
2026-03-13 16:56 ` Dave Jiang
0 siblings, 1 reply; 54+ messages in thread
From: Jonathan Cameron @ 2026-03-13 12:18 UTC (permalink / raw)
To: mhonap
Cc: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, jgg, yishaih,
kevin.tian, cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm
On Thu, 12 Mar 2026 02:04:23 +0530
mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
>
> HDM decoder capability structure and component reg block size
> needs to be used by VFIO subsystem. Move the macros from private
> CXL header to public one.
>
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
Not really related to this patch, but...
Maybe this is the time to think about doing a PCI like
uapi/linux/cxl_regs.h header?
So far we have these duplicated in at least a couple of code
bases and that is less than ideal!
Move seems fine to me otherwise.
> ---
> drivers/cxl/cxl.h | 30 ------------------------------
> include/cxl/cxl.h | 30 ++++++++++++++++++++++++++++++
> 2 files changed, 30 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 10ddab3949ee..7146059e0dae 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -24,9 +24,6 @@ extern const struct nvdimm_security_ops *cxl_security_ops;
> * (port-driver, region-driver, nvdimm object-drivers... etc).
> */
>
> -/* CXL 2.0 8.2.4 CXL Component Register Layout and Definition */
> -#define CXL_COMPONENT_REG_BLOCK_SIZE SZ_64K
> -
> /* CXL 2.0 8.2.5 CXL.cache and CXL.mem Registers*/
> #define CXL_CM_OFFSET 0x1000
> #define CXL_CM_CAP_HDR_OFFSET 0x0
> @@ -39,33 +36,6 @@ extern const struct nvdimm_security_ops *cxl_security_ops;
> #define CXL_CM_CAP_HDR_ARRAY_SIZE_MASK GENMASK(31, 24)
> #define CXL_CM_CAP_PTR_MASK GENMASK(31, 20)
>
> -/* HDM decoders CXL 2.0 8.2.5.12 CXL HDM Decoder Capability Structure */
> -#define CXL_HDM_DECODER_CAP_OFFSET 0x0
> -#define CXL_HDM_DECODER_COUNT_MASK GENMASK(3, 0)
> -#define CXL_HDM_DECODER_TARGET_COUNT_MASK GENMASK(7, 4)
> -#define CXL_HDM_DECODER_INTERLEAVE_11_8 BIT(8)
> -#define CXL_HDM_DECODER_INTERLEAVE_14_12 BIT(9)
> -#define CXL_HDM_DECODER_INTERLEAVE_3_6_12_WAY BIT(11)
> -#define CXL_HDM_DECODER_INTERLEAVE_16_WAY BIT(12)
> -#define CXL_HDM_DECODER_CTRL_OFFSET 0x4
> -#define CXL_HDM_DECODER_ENABLE BIT(1)
> -#define CXL_HDM_DECODER0_BASE_LOW_OFFSET(i) (0x20 * (i) + 0x10)
> -#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET(i) (0x20 * (i) + 0x14)
> -#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET(i) (0x20 * (i) + 0x18)
> -#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(i) (0x20 * (i) + 0x1c)
> -#define CXL_HDM_DECODER0_CTRL_OFFSET(i) (0x20 * (i) + 0x20)
> -#define CXL_HDM_DECODER0_CTRL_IG_MASK GENMASK(3, 0)
> -#define CXL_HDM_DECODER0_CTRL_IW_MASK GENMASK(7, 4)
> -#define CXL_HDM_DECODER0_CTRL_LOCK BIT(8)
> -#define CXL_HDM_DECODER0_CTRL_COMMIT BIT(9)
> -#define CXL_HDM_DECODER0_CTRL_COMMITTED BIT(10)
> -#define CXL_HDM_DECODER0_CTRL_COMMIT_ERROR BIT(11)
> -#define CXL_HDM_DECODER0_CTRL_HOSTONLY BIT(12)
> -#define CXL_HDM_DECODER0_TL_LOW(i) (0x20 * (i) + 0x24)
> -#define CXL_HDM_DECODER0_TL_HIGH(i) (0x20 * (i) + 0x28)
> -#define CXL_HDM_DECODER0_SKIP_LOW(i) CXL_HDM_DECODER0_TL_LOW(i)
> -#define CXL_HDM_DECODER0_SKIP_HIGH(i) CXL_HDM_DECODER0_TL_HIGH(i)
> -
> /* HDM decoder control register constants CXL 3.0 8.2.5.19.7 */
> #define CXL_DECODER_MIN_GRANULARITY 256
> #define CXL_DECODER_MAX_ENCODED_IG 6
> diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
> index 610711e861d4..27c006fa53c3 100644
> --- a/include/cxl/cxl.h
> +++ b/include/cxl/cxl.h
> @@ -75,6 +75,36 @@ struct cxl_regs {
> #define CXL_CM_CAP_CAP_ID_HDM 0x5
> #define CXL_CM_CAP_CAP_HDM_VERSION 1
>
> +/* CXL 2.0 8.2.4 CXL Component Register Layout and Definition */
> +#define CXL_COMPONENT_REG_BLOCK_SIZE SZ_64K
> +
> +/* HDM decoders CXL 2.0 8.2.5.12 CXL HDM Decoder Capability Structure */
> +#define CXL_HDM_DECODER_CAP_OFFSET 0x0
> +#define CXL_HDM_DECODER_COUNT_MASK GENMASK(3, 0)
> +#define CXL_HDM_DECODER_TARGET_COUNT_MASK GENMASK(7, 4)
> +#define CXL_HDM_DECODER_INTERLEAVE_11_8 BIT(8)
> +#define CXL_HDM_DECODER_INTERLEAVE_14_12 BIT(9)
> +#define CXL_HDM_DECODER_INTERLEAVE_3_6_12_WAY BIT(11)
> +#define CXL_HDM_DECODER_INTERLEAVE_16_WAY BIT(12)
> +#define CXL_HDM_DECODER_CTRL_OFFSET 0x4
> +#define CXL_HDM_DECODER_ENABLE BIT(1)
> +#define CXL_HDM_DECODER0_BASE_LOW_OFFSET(i) (0x20 * (i) + 0x10)
> +#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET(i) (0x20 * (i) + 0x14)
> +#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET(i) (0x20 * (i) + 0x18)
> +#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(i) (0x20 * (i) + 0x1c)
> +#define CXL_HDM_DECODER0_CTRL_OFFSET(i) (0x20 * (i) + 0x20)
> +#define CXL_HDM_DECODER0_CTRL_IG_MASK GENMASK(3, 0)
> +#define CXL_HDM_DECODER0_CTRL_IW_MASK GENMASK(7, 4)
> +#define CXL_HDM_DECODER0_CTRL_LOCK BIT(8)
> +#define CXL_HDM_DECODER0_CTRL_COMMIT BIT(9)
> +#define CXL_HDM_DECODER0_CTRL_COMMITTED BIT(10)
> +#define CXL_HDM_DECODER0_CTRL_COMMIT_ERROR BIT(11)
> +#define CXL_HDM_DECODER0_CTRL_HOSTONLY BIT(12)
> +#define CXL_HDM_DECODER0_TL_LOW(i) (0x20 * (i) + 0x24)
> +#define CXL_HDM_DECODER0_TL_HIGH(i) (0x20 * (i) + 0x28)
> +#define CXL_HDM_DECODER0_SKIP_LOW(i) CXL_HDM_DECODER0_TL_LOW(i)
> +#define CXL_HDM_DECODER0_SKIP_HIGH(i) CXL_HDM_DECODER0_TL_HIGH(i)
> +
> struct cxl_reg_map {
> bool valid;
> int id;
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [PATCH 03/20] cxl: Move CXL spec defines to public header
2026-03-13 12:18 ` Jonathan Cameron
@ 2026-03-13 16:56 ` Dave Jiang
2026-03-18 14:56 ` Jonathan Cameron
0 siblings, 1 reply; 54+ messages in thread
From: Dave Jiang @ 2026-03-13 16:56 UTC (permalink / raw)
To: Jonathan Cameron, mhonap
Cc: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
ira.weiny, dan.j.williams, jgg, yishaih, kevin.tian, cjia,
targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm
On 3/13/26 5:18 AM, Jonathan Cameron wrote:
> On Thu, 12 Mar 2026 02:04:23 +0530
> mhonap@nvidia.com wrote:
>
>> From: Manish Honap <mhonap@nvidia.com>
>>
>> HDM decoder capability structure and component reg block size
>> needs to be used by VFIO subsystem. Move the macros from private
>> CXL header to public one.
>>
>> Signed-off-by: Manish Honap <mhonap@nvidia.com>
>
> Not really related to this patch, but...
>
> Maybe this is the time to think about doing a PCI like
> uapi/linux/cxl_regs.h header?
Sounds reasonable. Maybe makes sense to be in uapi/cxl/? There's already a features.h header in there.
>
> So far we have these duplicated in at least a couple of code
> bases and that is less than ideal!
>
> Move seems fine to me otherwise.
>
>> ---
>> drivers/cxl/cxl.h | 30 ------------------------------
>> include/cxl/cxl.h | 30 ++++++++++++++++++++++++++++++
>> 2 files changed, 30 insertions(+), 30 deletions(-)
>>
>> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
>> index 10ddab3949ee..7146059e0dae 100644
>> --- a/drivers/cxl/cxl.h
>> +++ b/drivers/cxl/cxl.h
>> @@ -24,9 +24,6 @@ extern const struct nvdimm_security_ops *cxl_security_ops;
>> * (port-driver, region-driver, nvdimm object-drivers... etc).
>> */
>>
>> -/* CXL 2.0 8.2.4 CXL Component Register Layout and Definition */
>> -#define CXL_COMPONENT_REG_BLOCK_SIZE SZ_64K
>> -
>> /* CXL 2.0 8.2.5 CXL.cache and CXL.mem Registers*/
>> #define CXL_CM_OFFSET 0x1000
>> #define CXL_CM_CAP_HDR_OFFSET 0x0
>> @@ -39,33 +36,6 @@ extern const struct nvdimm_security_ops *cxl_security_ops;
>> #define CXL_CM_CAP_HDR_ARRAY_SIZE_MASK GENMASK(31, 24)
>> #define CXL_CM_CAP_PTR_MASK GENMASK(31, 20)
>>
>> -/* HDM decoders CXL 2.0 8.2.5.12 CXL HDM Decoder Capability Structure */
>> -#define CXL_HDM_DECODER_CAP_OFFSET 0x0
>> -#define CXL_HDM_DECODER_COUNT_MASK GENMASK(3, 0)
>> -#define CXL_HDM_DECODER_TARGET_COUNT_MASK GENMASK(7, 4)
>> -#define CXL_HDM_DECODER_INTERLEAVE_11_8 BIT(8)
>> -#define CXL_HDM_DECODER_INTERLEAVE_14_12 BIT(9)
>> -#define CXL_HDM_DECODER_INTERLEAVE_3_6_12_WAY BIT(11)
>> -#define CXL_HDM_DECODER_INTERLEAVE_16_WAY BIT(12)
>> -#define CXL_HDM_DECODER_CTRL_OFFSET 0x4
>> -#define CXL_HDM_DECODER_ENABLE BIT(1)
>> -#define CXL_HDM_DECODER0_BASE_LOW_OFFSET(i) (0x20 * (i) + 0x10)
>> -#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET(i) (0x20 * (i) + 0x14)
>> -#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET(i) (0x20 * (i) + 0x18)
>> -#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(i) (0x20 * (i) + 0x1c)
>> -#define CXL_HDM_DECODER0_CTRL_OFFSET(i) (0x20 * (i) + 0x20)
>> -#define CXL_HDM_DECODER0_CTRL_IG_MASK GENMASK(3, 0)
>> -#define CXL_HDM_DECODER0_CTRL_IW_MASK GENMASK(7, 4)
>> -#define CXL_HDM_DECODER0_CTRL_LOCK BIT(8)
>> -#define CXL_HDM_DECODER0_CTRL_COMMIT BIT(9)
>> -#define CXL_HDM_DECODER0_CTRL_COMMITTED BIT(10)
>> -#define CXL_HDM_DECODER0_CTRL_COMMIT_ERROR BIT(11)
>> -#define CXL_HDM_DECODER0_CTRL_HOSTONLY BIT(12)
>> -#define CXL_HDM_DECODER0_TL_LOW(i) (0x20 * (i) + 0x24)
>> -#define CXL_HDM_DECODER0_TL_HIGH(i) (0x20 * (i) + 0x28)
>> -#define CXL_HDM_DECODER0_SKIP_LOW(i) CXL_HDM_DECODER0_TL_LOW(i)
>> -#define CXL_HDM_DECODER0_SKIP_HIGH(i) CXL_HDM_DECODER0_TL_HIGH(i)
>> -
>> /* HDM decoder control register constants CXL 3.0 8.2.5.19.7 */
>> #define CXL_DECODER_MIN_GRANULARITY 256
>> #define CXL_DECODER_MAX_ENCODED_IG 6
>> diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
>> index 610711e861d4..27c006fa53c3 100644
>> --- a/include/cxl/cxl.h
>> +++ b/include/cxl/cxl.h
>> @@ -75,6 +75,36 @@ struct cxl_regs {
>> #define CXL_CM_CAP_CAP_ID_HDM 0x5
>> #define CXL_CM_CAP_CAP_HDM_VERSION 1
>>
>> +/* CXL 2.0 8.2.4 CXL Component Register Layout and Definition */
>> +#define CXL_COMPONENT_REG_BLOCK_SIZE SZ_64K
>> +
>> +/* HDM decoders CXL 2.0 8.2.5.12 CXL HDM Decoder Capability Structure */
>> +#define CXL_HDM_DECODER_CAP_OFFSET 0x0
>> +#define CXL_HDM_DECODER_COUNT_MASK GENMASK(3, 0)
>> +#define CXL_HDM_DECODER_TARGET_COUNT_MASK GENMASK(7, 4)
>> +#define CXL_HDM_DECODER_INTERLEAVE_11_8 BIT(8)
>> +#define CXL_HDM_DECODER_INTERLEAVE_14_12 BIT(9)
>> +#define CXL_HDM_DECODER_INTERLEAVE_3_6_12_WAY BIT(11)
>> +#define CXL_HDM_DECODER_INTERLEAVE_16_WAY BIT(12)
>> +#define CXL_HDM_DECODER_CTRL_OFFSET 0x4
>> +#define CXL_HDM_DECODER_ENABLE BIT(1)
>> +#define CXL_HDM_DECODER0_BASE_LOW_OFFSET(i) (0x20 * (i) + 0x10)
>> +#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET(i) (0x20 * (i) + 0x14)
>> +#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET(i) (0x20 * (i) + 0x18)
>> +#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(i) (0x20 * (i) + 0x1c)
>> +#define CXL_HDM_DECODER0_CTRL_OFFSET(i) (0x20 * (i) + 0x20)
>> +#define CXL_HDM_DECODER0_CTRL_IG_MASK GENMASK(3, 0)
>> +#define CXL_HDM_DECODER0_CTRL_IW_MASK GENMASK(7, 4)
>> +#define CXL_HDM_DECODER0_CTRL_LOCK BIT(8)
>> +#define CXL_HDM_DECODER0_CTRL_COMMIT BIT(9)
>> +#define CXL_HDM_DECODER0_CTRL_COMMITTED BIT(10)
>> +#define CXL_HDM_DECODER0_CTRL_COMMIT_ERROR BIT(11)
>> +#define CXL_HDM_DECODER0_CTRL_HOSTONLY BIT(12)
>> +#define CXL_HDM_DECODER0_TL_LOW(i) (0x20 * (i) + 0x24)
>> +#define CXL_HDM_DECODER0_TL_HIGH(i) (0x20 * (i) + 0x28)
>> +#define CXL_HDM_DECODER0_SKIP_LOW(i) CXL_HDM_DECODER0_TL_LOW(i)
>> +#define CXL_HDM_DECODER0_SKIP_HIGH(i) CXL_HDM_DECODER0_TL_HIGH(i)
>> +
>> struct cxl_reg_map {
>> bool valid;
>> int id;
>
>
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [PATCH 03/20] cxl: Move CXL spec defines to public header
2026-03-13 16:56 ` Dave Jiang
@ 2026-03-18 14:56 ` Jonathan Cameron
2026-03-18 17:51 ` Manish Honap
0 siblings, 1 reply; 54+ messages in thread
From: Jonathan Cameron @ 2026-03-18 14:56 UTC (permalink / raw)
To: Dave Jiang
Cc: mhonap, aniketa, ankita, alwilliamson, vsethi, jgg, mochs,
skolothumtho, alejandro.lucero-palau, dave, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, jgg, yishaih,
kevin.tian, cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm
On Fri, 13 Mar 2026 09:56:10 -0700
Dave Jiang <dave.jiang@intel.com> wrote:
> On 3/13/26 5:18 AM, Jonathan Cameron wrote:
> > On Thu, 12 Mar 2026 02:04:23 +0530
> > mhonap@nvidia.com wrote:
> >
> >> From: Manish Honap <mhonap@nvidia.com>
> >>
> >> HDM decoder capability structure and component reg block size
> >> needs to be used by VFIO subsystem. Move the macros from private
> >> CXL header to public one.
> >>
> >> Signed-off-by: Manish Honap <mhonap@nvidia.com>
> >
> > Not really related to this patch, but...
> >
> > Maybe this is the time to think about doing a PCI like
> > uapi/linux/cxl_regs.h header?
>
> Sounds reasonable. Maybe makes sense to be in uapi/cxl/? There's already a features.h header in there.
Sure. I'd forgotten about that directory.
Jonathan
^ permalink raw reply [flat|nested] 54+ messages in thread
* RE: [PATCH 03/20] cxl: Move CXL spec defines to public header
2026-03-18 14:56 ` Jonathan Cameron
@ 2026-03-18 17:51 ` Manish Honap
0 siblings, 0 replies; 54+ messages in thread
From: Manish Honap @ 2026-03-18 17:51 UTC (permalink / raw)
To: Jonathan Cameron, Dave Jiang
Cc: Aniket Agashe, Ankit Agrawal, Alex Williamson, Vikram Sethi,
Jason Gunthorpe, Matt Ochs, Shameer Kolothum Thodi,
alejandro.lucero-palau@amd.com, dave@stgolabs.net,
alison.schofield@intel.com, vishal.l.verma@intel.com,
ira.weiny@intel.com, dan.j.williams@intel.com, jgg@ziepe.ca,
Yishai Hadas, kevin.tian@intel.com, Neo Jia, Tarun Gupta (SW-GPU),
Zhi Wang, Krishnakant Jaju, linux-kernel@vger.kernel.org,
linux-cxl@vger.kernel.org, kvm@vger.kernel.org, Manish Honap
> -----Original Message-----
> From: Jonathan Cameron <jonathan.cameron@huawei.com>
> Sent: 18 March 2026 20:27
> To: Dave Jiang <dave.jiang@intel.com>
> Cc: Manish Honap <mhonap@nvidia.com>; Aniket Agashe <aniketa@nvidia.com>;
> Ankit Agrawal <ankita@nvidia.com>; Alex Williamson
> <alwilliamson@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Jason
> Gunthorpe <jgg@nvidia.com>; Matt Ochs <mochs@nvidia.com>; Shameer Kolothum
> Thodi <skolothumtho@nvidia.com>; alejandro.lucero-palau@amd.com;
> dave@stgolabs.net; alison.schofield@intel.com; vishal.l.verma@intel.com;
> ira.weiny@intel.com; dan.j.williams@intel.com; jgg@ziepe.ca; Yishai Hadas
> <yishaih@nvidia.com>; kevin.tian@intel.com; Neo Jia <cjia@nvidia.com>;
> Tarun Gupta (SW-GPU) <targupta@nvidia.com>; Zhi Wang <zhiw@nvidia.com>;
> Krishnakant Jaju <kjaju@nvidia.com>; linux-kernel@vger.kernel.org; linux-
> cxl@vger.kernel.org; kvm@vger.kernel.org
> Subject: Re: [PATCH 03/20] cxl: Move CXL spec defines to public header
>
> External email: Use caution opening links or attachments
>
>
> On Fri, 13 Mar 2026 09:56:10 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
>
> > On 3/13/26 5:18 AM, Jonathan Cameron wrote:
> > > On Thu, 12 Mar 2026 02:04:23 +0530
> > > mhonap@nvidia.com wrote:
> > >
> > >> From: Manish Honap <mhonap@nvidia.com>
> > >>
> > >> HDM decoder capability structure and component reg block size needs
> > >> to be used by VFIO subsystem. Move the macros from private CXL
> > >> header to public one.
> > >>
> > >> Signed-off-by: Manish Honap <mhonap@nvidia.com>
> > >
> > > Not really related to this patch, but...
> > >
> > > Maybe this is the time to think about doing a PCI like
> > > uapi/linux/cxl_regs.h header?
> >
> > Sounds reasonable. Maybe makes sense to be in uapi/cxl/? There's already
> a features.h header in there.
>
> Sure. I'd forgotten about that directory.
> Jonathan
Okay, I have moved the CXL r4.0 spec defined register bits and flags in
include/uapi/cxl/cxl_regs.h but for the bits/flags which are required
for emulation handling, I have kept in vfio_cxl_priv.h.
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH 04/20] cxl: Media ready check refactoring
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (2 preceding siblings ...)
2026-03-11 20:34 ` [PATCH 03/20] cxl: Move CXL spec defines to public header mhonap
@ 2026-03-11 20:34 ` mhonap
2026-03-12 20:29 ` Dave Jiang
2026-03-11 20:34 ` [PATCH 05/20] cxl: Expose BAR index and offset from register map mhonap
` (15 subsequent siblings)
19 siblings, 1 reply; 54+ messages in thread
From: mhonap @ 2026-03-11 20:34 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
Before accessing CXL device memory after reset/power-on, the driver
must ensure media is ready. Not every CXL device implements the CXL
Memory Device register group (many Type-2 devices do not).
cxl_await_media_ready() reads cxlds->regs.memdev; calling it
on a Type-2 without that block can result in kernel panic.
This commit refactors the HDM range based check in a new function which
can be safely used for type-2 and type-3 devices. cxl_await_media_ready
still uses the same format of checking the HDM range and memory device
register status.
Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/cxl/core/pci.c | 35 ++++++++++++++++++++++++++++++-----
include/cxl/cxl.h | 1 +
2 files changed, 31 insertions(+), 5 deletions(-)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 52ed0b4f5e78..2b7e4d73a6dd 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -142,16 +142,24 @@ static int cxl_dvsec_mem_range_active(struct cxl_dev_state *cxlds, int id)
return 0;
}
-/*
- * Wait up to @media_ready_timeout for the device to report memory
- * active.
+/**
+ * cxl_await_range_active - Wait for all HDM DVSEC memory ranges to be active
+ * @cxlds: CXL device state (DVSEC and HDM count must be valid)
+ *
+ * For each HDM decoder range reported in the CXL DVSEC capability, waits for
+ * the range to report MEM INFO VALID (up to 1s per range), then MEM ACTIVE
+ * (up to media_ready_timeout seconds per range, default 60s). Used by
+ * cxl_await_media_ready() and by callers that only need range readiness
+ * without checking the memory device status register.
+ *
+ * Return: 0 if all ranges become valid and active, -ETIMEDOUT if a timeout
+ * occurs, or a negative errno from config read on failure.
*/
-int cxl_await_media_ready(struct cxl_dev_state *cxlds)
+int cxl_await_range_active(struct cxl_dev_state *cxlds)
{
struct pci_dev *pdev = to_pci_dev(cxlds->dev);
int d = cxlds->cxl_dvsec;
int rc, i, hdm_count;
- u64 md_status;
u16 cap;
rc = pci_read_config_word(pdev,
@@ -172,6 +180,23 @@ int cxl_await_media_ready(struct cxl_dev_state *cxlds)
return rc;
}
+ return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_await_range_active, "CXL");
+
+/*
+ * Wait up to @media_ready_timeout for the device to report memory
+ * active.
+ */
+int cxl_await_media_ready(struct cxl_dev_state *cxlds)
+{
+ u64 md_status;
+ int rc;
+
+ rc = cxl_await_range_active(cxlds);
+ if (rc)
+ return rc;
+
md_status = readq(cxlds->regs.memdev + CXLMDEV_STATUS_OFFSET);
if (!CXLMDEV_READY(md_status))
return -EIO;
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 27c006fa53c3..684603799fb1 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -323,5 +323,6 @@ int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type,
struct cxl_register_map *map);
void cxl_probe_component_regs(struct device *dev, void __iomem *base,
struct cxl_component_reg_map *map);
+int cxl_await_range_active(struct cxl_dev_state *cxlds);
#endif /* __CXL_CXL_H__ */
--
2.25.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [PATCH 04/20] cxl: Media ready check refactoring
2026-03-11 20:34 ` [PATCH 04/20] cxl: Media ready check refactoring mhonap
@ 2026-03-12 20:29 ` Dave Jiang
2026-03-13 10:05 ` Manish Honap
0 siblings, 1 reply; 54+ messages in thread
From: Dave Jiang @ 2026-03-12 20:29 UTC (permalink / raw)
To: mhonap, aniketa, ankita, alwilliamson, vsethi, jgg, mochs,
skolothumtho, alejandro.lucero-palau, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm
On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
>
> Before accessing CXL device memory after reset/power-on, the driver
> must ensure media is ready. Not every CXL device implements the CXL
> Memory Device register group (many Type-2 devices do not).
> cxl_await_media_ready() reads cxlds->regs.memdev; calling it
> on a Type-2 without that block can result in kernel panic.
please consider:
Access the memory device registers on a Type-2 device may result in kernel panic.
>
> This commit refactors the HDM range based check in a new function which
> can be safely used for type-2 and type-3 devices. cxl_await_media_ready
> still uses the same format of checking the HDM range and memory device
> register status.
>
> Co-developed-by: Zhi Wang <zhiw@nvidia.com>
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
> drivers/cxl/core/pci.c | 35 ++++++++++++++++++++++++++++++-----
> include/cxl/cxl.h | 1 +
> 2 files changed, 31 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 52ed0b4f5e78..2b7e4d73a6dd 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -142,16 +142,24 @@ static int cxl_dvsec_mem_range_active(struct cxl_dev_state *cxlds, int id)
> return 0;
> }
>
> -/*
> - * Wait up to @media_ready_timeout for the device to report memory
> - * active.
> +/**
> + * cxl_await_range_active - Wait for all HDM DVSEC memory ranges to be active
> + * @cxlds: CXL device state (DVSEC and HDM count must be valid)
> + *
> + * For each HDM decoder range reported in the CXL DVSEC capability, waits for
> + * the range to report MEM INFO VALID (up to 1s per range), then MEM ACTIVE
> + * (up to media_ready_timeout seconds per range, default 60s). Used by
> + * cxl_await_media_ready() and by callers that only need range readiness
> + * without checking the memory device status register.
> + *
> + * Return: 0 if all ranges become valid and active, -ETIMEDOUT if a timeout
> + * occurs, or a negative errno from config read on failure.
> */
> -int cxl_await_media_ready(struct cxl_dev_state *cxlds)
> +int cxl_await_range_active(struct cxl_dev_state *cxlds)
> {
> struct pci_dev *pdev = to_pci_dev(cxlds->dev);
> int d = cxlds->cxl_dvsec;
> int rc, i, hdm_count;
> - u64 md_status;
> u16 cap;
>
> rc = pci_read_config_word(pdev,
> @@ -172,6 +180,23 @@ int cxl_await_media_ready(struct cxl_dev_state *cxlds)
> return rc;
> }
>
> + return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_await_range_active, "CXL");
> +
> +/*
> + * Wait up to @media_ready_timeout for the device to report memory
> + * active.
> + */
> +int cxl_await_media_ready(struct cxl_dev_state *cxlds)
> +{
> + u64 md_status;
> + int rc;
> +
> + rc = cxl_await_range_active(cxlds);
> + if (rc)
> + return rc;
> +
> md_status = readq(cxlds->regs.memdev + CXLMDEV_STATUS_OFFSET);
> if (!CXLMDEV_READY(md_status))
> return -EIO;
> diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
> index 27c006fa53c3..684603799fb1 100644
> --- a/include/cxl/cxl.h
> +++ b/include/cxl/cxl.h
> @@ -323,5 +323,6 @@ int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type,
> struct cxl_register_map *map);
> void cxl_probe_component_regs(struct device *dev, void __iomem *base,
> struct cxl_component_reg_map *map);
> +int cxl_await_range_active(struct cxl_dev_state *cxlds);
>
> #endif /* __CXL_CXL_H__ */
^ permalink raw reply [flat|nested] 54+ messages in thread* RE: [PATCH 04/20] cxl: Media ready check refactoring
2026-03-12 20:29 ` Dave Jiang
@ 2026-03-13 10:05 ` Manish Honap
0 siblings, 0 replies; 54+ messages in thread
From: Manish Honap @ 2026-03-13 10:05 UTC (permalink / raw)
To: Dave Jiang, Aniket Agashe, Ankit Agrawal, Alex Williamson,
Vikram Sethi, Jason Gunthorpe, Matt Ochs, Shameer Kolothum Thodi,
alejandro.lucero-palau@amd.com, dave@stgolabs.net,
jonathan.cameron@huawei.com, alison.schofield@intel.com,
vishal.l.verma@intel.com, ira.weiny@intel.com,
dan.j.williams@intel.com, jgg@ziepe.ca, Yishai Hadas,
kevin.tian@intel.com
Cc: Neo Jia, Tarun Gupta (SW-GPU), Zhi Wang, Krishnakant Jaju,
linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org,
kvm@vger.kernel.org, Manish Honap
> -----Original Message-----
> From: Dave Jiang <dave.jiang@intel.com>
> Sent: 13 March 2026 01:59
> To: Manish Honap <mhonap@nvidia.com>; Aniket Agashe <aniketa@nvidia.com>;
> Ankit Agrawal <ankita@nvidia.com>; Alex Williamson
> <alwilliamson@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Jason
> Gunthorpe <jgg@nvidia.com>; Matt Ochs <mochs@nvidia.com>; Shameer Kolothum
> Thodi <skolothumtho@nvidia.com>; alejandro.lucero-palau@amd.com;
> dave@stgolabs.net; jonathan.cameron@huawei.com;
> alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
> dan.j.williams@intel.com; jgg@ziepe.ca; Yishai Hadas <yishaih@nvidia.com>;
> kevin.tian@intel.com
> Cc: Neo Jia <cjia@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> Zhi Wang <zhiw@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>; linux-
> kernel@vger.kernel.org; linux-cxl@vger.kernel.org; kvm@vger.kernel.org
> Subject: Re: [PATCH 04/20] cxl: Media ready check refactoring
>
> External email: Use caution opening links or attachments
>
>
> On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> > From: Manish Honap <mhonap@nvidia.com>
> >
> > Before accessing CXL device memory after reset/power-on, the driver
> > must ensure media is ready. Not every CXL device implements the CXL
> > Memory Device register group (many Type-2 devices do not).
> > cxl_await_media_ready() reads cxlds->regs.memdev; calling it on a
> > Type-2 without that block can result in kernel panic.
>
> please consider:
> Access the memory device registers on a Type-2 device may result in kernel
> panic.
Okay, I will update this wording accordingly.
>
> >
> > This commit refactors the HDM range based check in a new function
> > which can be safely used for type-2 and type-3 devices.
> > cxl_await_media_ready still uses the same format of checking the HDM
> > range and memory device register status.
> >
> > Co-developed-by: Zhi Wang <zhiw@nvidia.com>
> > Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> > Signed-off-by: Manish Honap <mhonap@nvidia.com>
>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>
> > ---
> > drivers/cxl/core/pci.c | 35 ++++++++++++++++++++++++++++++-----
> > include/cxl/cxl.h | 1 +
> > 2 files changed, 31 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c index
> > 52ed0b4f5e78..2b7e4d73a6dd 100644
> > --- a/drivers/cxl/core/pci.c
> > +++ b/drivers/cxl/core/pci.c
> > @@ -142,16 +142,24 @@ static int cxl_dvsec_mem_range_active(struct
> cxl_dev_state *cxlds, int id)
> > return 0;
> > }
> >
> > -/*
> > - * Wait up to @media_ready_timeout for the device to report memory
> > - * active.
> > +/**
> > + * cxl_await_range_active - Wait for all HDM DVSEC memory ranges to
> > +be active
> > + * @cxlds: CXL device state (DVSEC and HDM count must be valid)
> > + *
> > + * For each HDM decoder range reported in the CXL DVSEC capability,
> > +waits for
> > + * the range to report MEM INFO VALID (up to 1s per range), then MEM
> > +ACTIVE
> > + * (up to media_ready_timeout seconds per range, default 60s). Used
> > +by
> > + * cxl_await_media_ready() and by callers that only need range
> > +readiness
> > + * without checking the memory device status register.
> > + *
> > + * Return: 0 if all ranges become valid and active, -ETIMEDOUT if a
> > +timeout
> > + * occurs, or a negative errno from config read on failure.
> > */
> > -int cxl_await_media_ready(struct cxl_dev_state *cxlds)
> > +int cxl_await_range_active(struct cxl_dev_state *cxlds)
> > {
> > struct pci_dev *pdev = to_pci_dev(cxlds->dev);
> > int d = cxlds->cxl_dvsec;
> > int rc, i, hdm_count;
> > - u64 md_status;
> > u16 cap;
> >
> > rc = pci_read_config_word(pdev,
> > @@ -172,6 +180,23 @@ int cxl_await_media_ready(struct cxl_dev_state
> *cxlds)
> > return rc;
> > }
> >
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cxl_await_range_active, "CXL");
> > +
> > +/*
> > + * Wait up to @media_ready_timeout for the device to report memory
> > + * active.
> > + */
> > +int cxl_await_media_ready(struct cxl_dev_state *cxlds) {
> > + u64 md_status;
> > + int rc;
> > +
> > + rc = cxl_await_range_active(cxlds);
> > + if (rc)
> > + return rc;
> > +
> > md_status = readq(cxlds->regs.memdev + CXLMDEV_STATUS_OFFSET);
> > if (!CXLMDEV_READY(md_status))
> > return -EIO;
> > diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h index
> > 27c006fa53c3..684603799fb1 100644
> > --- a/include/cxl/cxl.h
> > +++ b/include/cxl/cxl.h
> > @@ -323,5 +323,6 @@ int cxl_find_regblock(struct pci_dev *pdev, enum
> cxl_regloc_type type,
> > struct cxl_register_map *map); void
> > cxl_probe_component_regs(struct device *dev, void __iomem *base,
> > struct cxl_component_reg_map *map);
> > +int cxl_await_range_active(struct cxl_dev_state *cxlds);
> >
> > #endif /* __CXL_CXL_H__ */
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH 05/20] cxl: Expose BAR index and offset from register map
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (3 preceding siblings ...)
2026-03-11 20:34 ` [PATCH 04/20] cxl: Media ready check refactoring mhonap
@ 2026-03-11 20:34 ` mhonap
2026-03-12 20:58 ` Dave Jiang
2026-03-11 20:34 ` [PATCH 06/20] vfio/cxl: Add UAPI for CXL Type-2 device passthrough mhonap
` (14 subsequent siblings)
19 siblings, 1 reply; 54+ messages in thread
From: mhonap @ 2026-03-11 20:34 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
The Register Locator DVSEC (CXL 2.0 8.1.9) describes register blocks by
BAR index (BIR) and offset within the BAR. CXL core currently only
stores the resolved HPA (resource + offset) in struct cxl_register_map,
so callers that need to use pci_iomap() or report the BAR to userspace
must reverse-engineer the BAR from the HPA.
Add bar_index and bar_offset to struct cxl_register_map and fill them
in cxl_decode_regblock() when the regblock is BAR-backed (BIR 0-5).
Add cxl_regblock_get_bar_info() so callers (e.g. vfio-cxl) can get BAR
index and offset directly and use pci_iomap() instead of ioremap(HPA).
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/cxl/core/regs.c | 29 +++++++++++++++++++++++++++++
include/cxl/cxl.h | 11 +++++++++++
2 files changed, 40 insertions(+)
diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index 20c2d9fbcfe7..720eb6eb5a45 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -287,9 +287,37 @@ static bool cxl_decode_regblock(struct pci_dev *pdev, u32 reg_lo, u32 reg_hi,
map->reg_type = reg_type;
map->resource = pci_resource_start(pdev, bar) + offset;
map->max_size = pci_resource_len(pdev, bar) - offset;
+ map->bar_index = (bar >= 0 && bar < PCI_STD_NUM_BARS) ? (u8)bar : 0xFF;
+ map->bar_offset = offset;
return true;
}
+/**
+ * cxl_regblock_get_bar_info() - Get BAR index and offset for a BAR-backed regblock
+ * @map: Register map from cxl_find_regblock() or cxl_find_regblock_instance()
+ * @bar_index: Output BAR index (0-5). Optional, may be NULL.
+ * @bar_offset: Output offset within the BAR. Optional, may be NULL.
+ *
+ * When the register block was found via the Register Locator DVSEC and
+ * lives in a PCI BAR (BIR 0-5), this returns the BAR index and the offset
+ * within that BAR. Callers can use pci_iomap(pdev, bar_index, size) and
+ * base + bar_offset instead of ioremap(map->resource).
+ *
+ * Return: 0 if the regblock is BAR-backed (bar_index <= 5), -EINVAL otherwise.
+ */
+int cxl_regblock_get_bar_info(const struct cxl_register_map *map, u8 *bar_index,
+ resource_size_t *bar_offset)
+{
+ if (!map || map->bar_index > PCI_STD_NUM_BARS - 1)
+ return -EINVAL;
+ if (bar_index)
+ *bar_index = map->bar_index;
+ if (bar_offset)
+ *bar_offset = map->bar_offset;
+ return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_regblock_get_bar_info, "CXL");
+
/*
* __cxl_find_regblock_instance() - Locate a register block or count instances by type / index
* Use CXL_INSTANCES_COUNT for @index if counting instances.
@@ -308,6 +336,7 @@ static int __cxl_find_regblock_instance(struct pci_dev *pdev, enum cxl_regloc_ty
*map = (struct cxl_register_map) {
.host = &pdev->dev,
+ .bar_index = 0xFF,
.resource = CXL_RESOURCE_NONE,
};
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 684603799fb1..08e327a929ba 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -134,9 +134,16 @@ struct cxl_pmu_reg_map {
* @resource: physical resource base of the register block
* @max_size: maximum mapping size to perform register search
* @reg_type: see enum cxl_regloc_type
+ * @bar_index: PCI BAR index (0-5) when regblock is BAR-backed; 0xFF otherwise
+ * @bar_offset: offset within the BAR; only valid when bar_index <= 5
* @component_map: cxl_reg_map for component registers
* @device_map: cxl_reg_maps for device registers
* @pmu_map: cxl_reg_maps for CXL Performance Monitoring Units
+ *
+ * When the register block is described by the Register Locator DVSEC with
+ * a BAR Indicator (BIR 0-5), bar_index and bar_offset are set so callers can
+ * use pci_iomap(pdev, bar_index, size) and base + bar_offset instead of
+ * ioremap(resource).
*/
struct cxl_register_map {
struct device *host;
@@ -144,6 +151,8 @@ struct cxl_register_map {
resource_size_t resource;
resource_size_t max_size;
u8 reg_type;
+ u8 bar_index;
+ resource_size_t bar_offset;
union {
struct cxl_component_reg_map component_map;
struct cxl_device_reg_map device_map;
@@ -319,6 +328,8 @@ int cxl_get_hdm_reg_info(struct cxl_dev_state *cxlds, u32 *count,
resource_size_t *offset, resource_size_t *size);
struct pci_dev;
enum cxl_regloc_type;
+int cxl_regblock_get_bar_info(const struct cxl_register_map *map, u8 *bar_index,
+ resource_size_t *bar_offset);
int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type,
struct cxl_register_map *map);
void cxl_probe_component_regs(struct device *dev, void __iomem *base,
--
2.25.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [PATCH 05/20] cxl: Expose BAR index and offset from register map
2026-03-11 20:34 ` [PATCH 05/20] cxl: Expose BAR index and offset from register map mhonap
@ 2026-03-12 20:58 ` Dave Jiang
2026-03-13 10:11 ` Manish Honap
0 siblings, 1 reply; 54+ messages in thread
From: Dave Jiang @ 2026-03-12 20:58 UTC (permalink / raw)
To: mhonap, aniketa, ankita, alwilliamson, vsethi, jgg, mochs,
skolothumtho, alejandro.lucero-palau, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm
On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
>
> The Register Locator DVSEC (CXL 2.0 8.1.9) describes register blocks by
CXL r4.0
Let's keep it to the latest spec version.
> BAR index (BIR) and offset within the BAR. CXL core currently only
> stores the resolved HPA (resource + offset) in struct cxl_register_map,
> so callers that need to use pci_iomap() or report the BAR to userspace
> must reverse-engineer the BAR from the HPA.
>
> Add bar_index and bar_offset to struct cxl_register_map and fill them
> in cxl_decode_regblock() when the regblock is BAR-backed (BIR 0-5).
> Add cxl_regblock_get_bar_info() so callers (e.g. vfio-cxl) can get BAR
> index and offset directly and use pci_iomap() instead of ioremap(HPA).
>
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
> ---
> drivers/cxl/core/regs.c | 29 +++++++++++++++++++++++++++++
> include/cxl/cxl.h | 11 +++++++++++
> 2 files changed, 40 insertions(+)
>
> diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
> index 20c2d9fbcfe7..720eb6eb5a45 100644
> --- a/drivers/cxl/core/regs.c
> +++ b/drivers/cxl/core/regs.c
> @@ -287,9 +287,37 @@ static bool cxl_decode_regblock(struct pci_dev *pdev, u32 reg_lo, u32 reg_hi,
> map->reg_type = reg_type;
> map->resource = pci_resource_start(pdev, bar) + offset;
> map->max_size = pci_resource_len(pdev, bar) - offset;
> + map->bar_index = (bar >= 0 && bar < PCI_STD_NUM_BARS) ? (u8)bar : 0xFF;
map->bar_index = bar should be fine. Otherwise the pci_resource_start() and pci_resource_len() would also be invalid.
> + map->bar_offset = offset;
> return true;
> }
>
> +/**
> + * cxl_regblock_get_bar_info() - Get BAR index and offset for a BAR-backed regblock
> + * @map: Register map from cxl_find_regblock() or cxl_find_regblock_instance()
> + * @bar_index: Output BAR index (0-5). Optional, may be NULL.
> + * @bar_offset: Output offset within the BAR. Optional, may be NULL.
> + *
> + * When the register block was found via the Register Locator DVSEC and
> + * lives in a PCI BAR (BIR 0-5), this returns the BAR index and the offset
> + * within that BAR. Callers can use pci_iomap(pdev, bar_index, size) and
> + * base + bar_offset instead of ioremap(map->resource).
> + *
> + * Return: 0 if the regblock is BAR-backed (bar_index <= 5), -EINVAL otherwise.
> + */
> +int cxl_regblock_get_bar_info(const struct cxl_register_map *map, u8 *bar_index,
> + resource_size_t *bar_offset)
> +{
> + if (!map || map->bar_index > PCI_STD_NUM_BARS - 1)
map->bar_index == 0xff? Otherwise it's probably a hardware issue right?
DJ
> + return -EINVAL;
> + if (bar_index)
> + *bar_index = map->bar_index;
> + if (bar_offset)
> + *bar_offset = map->bar_offset;
> + return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_regblock_get_bar_info, "CXL");
> +
> /*
> * __cxl_find_regblock_instance() - Locate a register block or count instances by type / index
> * Use CXL_INSTANCES_COUNT for @index if counting instances.
> @@ -308,6 +336,7 @@ static int __cxl_find_regblock_instance(struct pci_dev *pdev, enum cxl_regloc_ty
>
> *map = (struct cxl_register_map) {
> .host = &pdev->dev,
> + .bar_index = 0xFF,
> .resource = CXL_RESOURCE_NONE,
> };
>
> diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
> index 684603799fb1..08e327a929ba 100644
> --- a/include/cxl/cxl.h
> +++ b/include/cxl/cxl.h
> @@ -134,9 +134,16 @@ struct cxl_pmu_reg_map {
> * @resource: physical resource base of the register block
> * @max_size: maximum mapping size to perform register search
> * @reg_type: see enum cxl_regloc_type
> + * @bar_index: PCI BAR index (0-5) when regblock is BAR-backed; 0xFF otherwise
> + * @bar_offset: offset within the BAR; only valid when bar_index <= 5
> * @component_map: cxl_reg_map for component registers
> * @device_map: cxl_reg_maps for device registers
> * @pmu_map: cxl_reg_maps for CXL Performance Monitoring Units
> + *
> + * When the register block is described by the Register Locator DVSEC with
> + * a BAR Indicator (BIR 0-5), bar_index and bar_offset are set so callers can
> + * use pci_iomap(pdev, bar_index, size) and base + bar_offset instead of
> + * ioremap(resource).
> */
> struct cxl_register_map {
> struct device *host;
> @@ -144,6 +151,8 @@ struct cxl_register_map {
> resource_size_t resource;
> resource_size_t max_size;
> u8 reg_type;
> + u8 bar_index;
> + resource_size_t bar_offset;
> union {
> struct cxl_component_reg_map component_map;
> struct cxl_device_reg_map device_map;
> @@ -319,6 +328,8 @@ int cxl_get_hdm_reg_info(struct cxl_dev_state *cxlds, u32 *count,
> resource_size_t *offset, resource_size_t *size);
> struct pci_dev;
> enum cxl_regloc_type;
> +int cxl_regblock_get_bar_info(const struct cxl_register_map *map, u8 *bar_index,
> + resource_size_t *bar_offset);
> int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type,
> struct cxl_register_map *map);
> void cxl_probe_component_regs(struct device *dev, void __iomem *base,
^ permalink raw reply [flat|nested] 54+ messages in thread* RE: [PATCH 05/20] cxl: Expose BAR index and offset from register map
2026-03-12 20:58 ` Dave Jiang
@ 2026-03-13 10:11 ` Manish Honap
0 siblings, 0 replies; 54+ messages in thread
From: Manish Honap @ 2026-03-13 10:11 UTC (permalink / raw)
To: Dave Jiang, Aniket Agashe, Ankit Agrawal, Alex Williamson,
Vikram Sethi, Jason Gunthorpe, Matt Ochs, Shameer Kolothum Thodi,
alejandro.lucero-palau@amd.com, dave@stgolabs.net,
jonathan.cameron@huawei.com, alison.schofield@intel.com,
vishal.l.verma@intel.com, ira.weiny@intel.com,
dan.j.williams@intel.com, jgg@ziepe.ca, Yishai Hadas,
kevin.tian@intel.com
Cc: Neo Jia, Tarun Gupta (SW-GPU), Zhi Wang, Krishnakant Jaju,
linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org,
kvm@vger.kernel.org, Manish Honap
> -----Original Message-----
> From: Dave Jiang <dave.jiang@intel.com>
> Sent: 13 March 2026 02:29
> To: Manish Honap <mhonap@nvidia.com>; Aniket Agashe <aniketa@nvidia.com>;
> Ankit Agrawal <ankita@nvidia.com>; Alex Williamson
> <alwilliamson@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Jason
> Gunthorpe <jgg@nvidia.com>; Matt Ochs <mochs@nvidia.com>; Shameer Kolothum
> Thodi <skolothumtho@nvidia.com>; alejandro.lucero-palau@amd.com;
> dave@stgolabs.net; jonathan.cameron@huawei.com;
> alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
> dan.j.williams@intel.com; jgg@ziepe.ca; Yishai Hadas <yishaih@nvidia.com>;
> kevin.tian@intel.com
> Cc: Neo Jia <cjia@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> Zhi Wang <zhiw@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>; linux-
> kernel@vger.kernel.org; linux-cxl@vger.kernel.org; kvm@vger.kernel.org
> Subject: Re: [PATCH 05/20] cxl: Expose BAR index and offset from register
> map
>
> External email: Use caution opening links or attachments
>
>
> On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> > From: Manish Honap <mhonap@nvidia.com>
> >
> > The Register Locator DVSEC (CXL 2.0 8.1.9) describes register blocks
> > by
>
> CXL r4.0
>
> Let's keep it to the latest spec version.
Okay, I will update the spec version references to r4.0
>
> > BAR index (BIR) and offset within the BAR. CXL core currently only
> > stores the resolved HPA (resource + offset) in struct
> > cxl_register_map, so callers that need to use pci_iomap() or report
> > the BAR to userspace must reverse-engineer the BAR from the HPA.
> >
> > Add bar_index and bar_offset to struct cxl_register_map and fill them
> > in cxl_decode_regblock() when the regblock is BAR-backed (BIR 0-5).
> > Add cxl_regblock_get_bar_info() so callers (e.g. vfio-cxl) can get BAR
> > index and offset directly and use pci_iomap() instead of ioremap(HPA).
> >
> > Signed-off-by: Manish Honap <mhonap@nvidia.com>
> > ---
> > drivers/cxl/core/regs.c | 29 +++++++++++++++++++++++++++++
> > include/cxl/cxl.h | 11 +++++++++++
> > 2 files changed, 40 insertions(+)
> >
> > diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c index
> > 20c2d9fbcfe7..720eb6eb5a45 100644
> > --- a/drivers/cxl/core/regs.c
> > +++ b/drivers/cxl/core/regs.c
> > @@ -287,9 +287,37 @@ static bool cxl_decode_regblock(struct pci_dev
> *pdev, u32 reg_lo, u32 reg_hi,
> > map->reg_type = reg_type;
> > map->resource = pci_resource_start(pdev, bar) + offset;
> > map->max_size = pci_resource_len(pdev, bar) - offset;
> > + map->bar_index = (bar >= 0 && bar < PCI_STD_NUM_BARS) ? (u8)bar
> > + : 0xFF;
>
> map->bar_index = bar should be fine. Otherwise the pci_resource_start()
> and pci_resource_len() would also be invalid.
Okay, I will update this.
>
>
> > + map->bar_offset = offset;
> > return true;
> > }
> >
> > +/**
> > + * cxl_regblock_get_bar_info() - Get BAR index and offset for a
> > +BAR-backed regblock
> > + * @map: Register map from cxl_find_regblock() or
> > +cxl_find_regblock_instance()
> > + * @bar_index: Output BAR index (0-5). Optional, may be NULL.
> > + * @bar_offset: Output offset within the BAR. Optional, may be NULL.
> > + *
> > + * When the register block was found via the Register Locator DVSEC
> > +and
> > + * lives in a PCI BAR (BIR 0-5), this returns the BAR index and the
> > +offset
> > + * within that BAR. Callers can use pci_iomap(pdev, bar_index, size)
> > +and
> > + * base + bar_offset instead of ioremap(map->resource).
> > + *
> > + * Return: 0 if the regblock is BAR-backed (bar_index <= 5), -EINVAL
> otherwise.
> > + */
> > +int cxl_regblock_get_bar_info(const struct cxl_register_map *map, u8
> *bar_index,
> > + resource_size_t *bar_offset) {
> > + if (!map || map->bar_index > PCI_STD_NUM_BARS - 1)
>
> map->bar_index == 0xff? Otherwise it's probably a hardware issue right?
>
Okay, agreed.
> DJ
>
> > + return -EINVAL;
> > + if (bar_index)
> > + *bar_index = map->bar_index;
> > + if (bar_offset)
> > + *bar_offset = map->bar_offset;
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cxl_regblock_get_bar_info, "CXL");
> > +
> > /*
> > * __cxl_find_regblock_instance() - Locate a register block or count
> instances by type / index
> > * Use CXL_INSTANCES_COUNT for @index if counting instances.
> > @@ -308,6 +336,7 @@ static int __cxl_find_regblock_instance(struct
> > pci_dev *pdev, enum cxl_regloc_ty
> >
> > *map = (struct cxl_register_map) {
> > .host = &pdev->dev,
> > + .bar_index = 0xFF,
> > .resource = CXL_RESOURCE_NONE,
> > };
> >
> > diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h index
> > 684603799fb1..08e327a929ba 100644
> > --- a/include/cxl/cxl.h
> > +++ b/include/cxl/cxl.h
> > @@ -134,9 +134,16 @@ struct cxl_pmu_reg_map {
> > * @resource: physical resource base of the register block
> > * @max_size: maximum mapping size to perform register search
> > * @reg_type: see enum cxl_regloc_type
> > + * @bar_index: PCI BAR index (0-5) when regblock is BAR-backed; 0xFF
> > + otherwise
> > + * @bar_offset: offset within the BAR; only valid when bar_index <= 5
> > * @component_map: cxl_reg_map for component registers
> > * @device_map: cxl_reg_maps for device registers
> > * @pmu_map: cxl_reg_maps for CXL Performance Monitoring Units
> > + *
> > + * When the register block is described by the Register Locator DVSEC
> > + with
> > + * a BAR Indicator (BIR 0-5), bar_index and bar_offset are set so
> > + callers can
> > + * use pci_iomap(pdev, bar_index, size) and base + bar_offset instead
> > + of
> > + * ioremap(resource).
> > */
> > struct cxl_register_map {
> > struct device *host;
> > @@ -144,6 +151,8 @@ struct cxl_register_map {
> > resource_size_t resource;
> > resource_size_t max_size;
> > u8 reg_type;
> > + u8 bar_index;
> > + resource_size_t bar_offset;
> > union {
> > struct cxl_component_reg_map component_map;
> > struct cxl_device_reg_map device_map; @@ -319,6 +328,8
> > @@ int cxl_get_hdm_reg_info(struct cxl_dev_state *cxlds, u32 *count,
> > resource_size_t *offset, resource_size_t
> > *size); struct pci_dev; enum cxl_regloc_type;
> > +int cxl_regblock_get_bar_info(const struct cxl_register_map *map, u8
> *bar_index,
> > + resource_size_t *bar_offset);
> > int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type,
> > struct cxl_register_map *map); void
> > cxl_probe_component_regs(struct device *dev, void __iomem *base,
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH 06/20] vfio/cxl: Add UAPI for CXL Type-2 device passthrough
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (4 preceding siblings ...)
2026-03-11 20:34 ` [PATCH 05/20] cxl: Expose BAR index and offset from register map mhonap
@ 2026-03-11 20:34 ` mhonap
2026-03-12 21:04 ` Dave Jiang
2026-03-11 20:34 ` [PATCH 07/20] vfio/pci: Add CXL state to vfio_pci_core_device mhonap
` (13 subsequent siblings)
19 siblings, 1 reply; 54+ messages in thread
From: mhonap @ 2026-03-11 20:34 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
CXL capabilities include:
- hdm_count: Number of HDM decoders available
- capacity: Total device memory (DPA)
- flags: COMMITTED, PRECOMMITTED
This UAPI enables VMMs like QEMU to passthrough CXL Type-2 devices
(GPUs, accelerators) with coherent memory to VMs.
Also added user-kernel API definitions for CXL Type-2 device passthrough.
Document how VFIO_DEVICE_FLAGS_CXL relates to VFIO_DEVICE_FLAGS_PCI
and VFIO_DEVICE_FLAGS_CAPS, and add field and flag descriptions
for the CXL capability.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
include/uapi/linux/vfio.h | 52 +++++++++++++++++++++++++++++++++++++++
1 file changed, 52 insertions(+)
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index ac2329f24141..7ec0f96cc2d9 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -215,6 +215,13 @@ struct vfio_device_info {
#define VFIO_DEVICE_FLAGS_FSL_MC (1 << 6) /* vfio-fsl-mc device */
#define VFIO_DEVICE_FLAGS_CAPS (1 << 7) /* Info supports caps */
#define VFIO_DEVICE_FLAGS_CDX (1 << 8) /* vfio-cdx device */
+/*
+ * CXL Type-2 device (memory coherent; e.g. GPU, accelerator). When set,
+ * VFIO_DEVICE_FLAGS_PCI is also set (same device is a PCI device). The
+ * capability chain (VFIO_DEVICE_FLAGS_CAPS) contains VFIO_DEVICE_INFO_CAP_CXL
+ * describing HDM decoders, DPA size, and CXL-specific options.
+ */
+#define VFIO_DEVICE_FLAGS_CXL (1 << 9) /* Device supports CXL */
__u32 num_regions; /* Max region index + 1 */
__u32 num_irqs; /* Max IRQ index + 1 */
__u32 cap_offset; /* Offset within info struct of first cap */
@@ -257,6 +264,39 @@ struct vfio_device_info_cap_pci_atomic_comp {
__u32 reserved;
};
+/*
+ * VFIO_DEVICE_INFO_CAP_CXL - CXL Type-2 device capability
+ *
+ * Present in the device info capability chain when VFIO_DEVICE_FLAGS_CXL
+ * is set. Describes Host Managed Device Memory (HDM) layout and CXL
+ * memory options so that userspace (e.g. QEMU) can expose the CXL region
+ * and component registers correctly to the guest.
+ */
+#define VFIO_DEVICE_INFO_CAP_CXL 6
+struct vfio_device_info_cap_cxl {
+ struct vfio_info_cap_header header;
+ __u8 hdm_count; /* Number of HDM decoders */
+ __u8 hdm_regs_bar_index; /* PCI BAR containing HDM registers */
+ __u16 pad;
+ __u32 flags;
+/* Decoder was committed by host firmware/BIOS */
+#define VFIO_CXL_CAP_COMMITTED (1 << 0)
+/*
+ * Memory was pre-committed (firmware-programmed); VMM need not allocate
+ * from CXL pool
+ */
+#define VFIO_CXL_CAP_PRECOMMITTED (1 << 1)
+ __u64 hdm_regs_size; /* Size in bytes of HDM register block */
+ __u64 hdm_regs_offset; /* Byte offset within the BAR to the HDM decoder block */
+ __u64 dpa_size; /* Device Physical Address (DPA) size in bytes */
+ /*
+ * Region indices for the two CXL VFIO device regions.
+ * Avoids forcing userspace to scan all regions by type/subtype.
+ */
+ __u32 dpa_region_index; /* VFIO_REGION_SUBTYPE_CXL */
+ __u32 comp_regs_region_index; /* VFIO_REGION_SUBTYPE_CXL_COMP_REGS */
+};
+
/**
* VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
* struct vfio_region_info)
@@ -370,6 +410,18 @@ struct vfio_region_info_cap_type {
*/
#define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD (1)
+/* 1e98 vendor PCI sub-types (CXL Consortium) */
+/*
+ * CXL memory region. Use with region type
+ * (PCI_VENDOR_ID_CXL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE).
+ * DPA memory region (fault+zap mmap)
+ */
+#define VFIO_REGION_SUBTYPE_CXL (1)
+/*
+ * HDM decoder register emulation region (read/write only, no mmap).
+ */
+#define VFIO_REGION_SUBTYPE_CXL_COMP_REGS (2)
+
/* sub-types for VFIO_REGION_TYPE_GFX */
#define VFIO_REGION_SUBTYPE_GFX_EDID (1)
--
2.25.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [PATCH 06/20] vfio/cxl: Add UAPI for CXL Type-2 device passthrough
2026-03-11 20:34 ` [PATCH 06/20] vfio/cxl: Add UAPI for CXL Type-2 device passthrough mhonap
@ 2026-03-12 21:04 ` Dave Jiang
0 siblings, 0 replies; 54+ messages in thread
From: Dave Jiang @ 2026-03-12 21:04 UTC (permalink / raw)
To: mhonap, aniketa, ankita, alwilliamson, vsethi, jgg, mochs,
skolothumtho, alejandro.lucero-palau, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm
On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
>
> CXL capabilities include:
> - hdm_count: Number of HDM decoders available
> - capacity: Total device memory (DPA)
> - flags: COMMITTED, PRECOMMITTED
>
> This UAPI enables VMMs like QEMU to passthrough CXL Type-2 devices
> (GPUs, accelerators) with coherent memory to VMs.
>
> Also added user-kernel API definitions for CXL Type-2 device passthrough.
> Document how VFIO_DEVICE_FLAGS_CXL relates to VFIO_DEVICE_FLAGS_PCI
> and VFIO_DEVICE_FLAGS_CAPS, and add field and flag descriptions
> for the CXL capability.
>
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
> ---
> include/uapi/linux/vfio.h | 52 +++++++++++++++++++++++++++++++++++++++
> 1 file changed, 52 insertions(+)
>
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ac2329f24141..7ec0f96cc2d9 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -215,6 +215,13 @@ struct vfio_device_info {
> #define VFIO_DEVICE_FLAGS_FSL_MC (1 << 6) /* vfio-fsl-mc device */
> #define VFIO_DEVICE_FLAGS_CAPS (1 << 7) /* Info supports caps */
> #define VFIO_DEVICE_FLAGS_CDX (1 << 8) /* vfio-cdx device */
> +/*
> + * CXL Type-2 device (memory coherent; e.g. GPU, accelerator). When set,
> + * VFIO_DEVICE_FLAGS_PCI is also set (same device is a PCI device). The
> + * capability chain (VFIO_DEVICE_FLAGS_CAPS) contains VFIO_DEVICE_INFO_CAP_CXL
> + * describing HDM decoders, DPA size, and CXL-specific options.
> + */
> +#define VFIO_DEVICE_FLAGS_CXL (1 << 9) /* Device supports CXL */
> __u32 num_regions; /* Max region index + 1 */
> __u32 num_irqs; /* Max IRQ index + 1 */
> __u32 cap_offset; /* Offset within info struct of first cap */
> @@ -257,6 +264,39 @@ struct vfio_device_info_cap_pci_atomic_comp {
> __u32 reserved;
> };
>
> +/*
> + * VFIO_DEVICE_INFO_CAP_CXL - CXL Type-2 device capability
> + *
> + * Present in the device info capability chain when VFIO_DEVICE_FLAGS_CXL
> + * is set. Describes Host Managed Device Memory (HDM) layout and CXL
> + * memory options so that userspace (e.g. QEMU) can expose the CXL region
> + * and component registers correctly to the guest.
> + */
> +#define VFIO_DEVICE_INFO_CAP_CXL 6
> +struct vfio_device_info_cap_cxl {
> + struct vfio_info_cap_header header;
> + __u8 hdm_count; /* Number of HDM decoders */
> + __u8 hdm_regs_bar_index; /* PCI BAR containing HDM registers */
> + __u16 pad;
> + __u32 flags;
> +/* Decoder was committed by host firmware/BIOS */
I'm confused by COMMITTED vs PRECOMMITTED. Should it just say "Decoder is committed" here? Otherwise what is the difference? Also can you explain a little the usage for COMMITTED vs PRECOMMITTED in the commit log please? i.e why does VFIO CXL needs to know a decoder is pre-committed?
DJ
> +#define VFIO_CXL_CAP_COMMITTED (1 << 0)
> +/*
> + * Memory was pre-committed (firmware-programmed); VMM need not allocate
> + * from CXL pool
> + */
> +#define VFIO_CXL_CAP_PRECOMMITTED (1 << 1)
> + __u64 hdm_regs_size; /* Size in bytes of HDM register block */
> + __u64 hdm_regs_offset; /* Byte offset within the BAR to the HDM decoder block */
> + __u64 dpa_size; /* Device Physical Address (DPA) size in bytes */
> + /*
> + * Region indices for the two CXL VFIO device regions.
> + * Avoids forcing userspace to scan all regions by type/subtype.
> + */
> + __u32 dpa_region_index; /* VFIO_REGION_SUBTYPE_CXL */
> + __u32 comp_regs_region_index; /* VFIO_REGION_SUBTYPE_CXL_COMP_REGS */
> +};
> +
> /**
> * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
> * struct vfio_region_info)
> @@ -370,6 +410,18 @@ struct vfio_region_info_cap_type {
> */
> #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD (1)
>
> +/* 1e98 vendor PCI sub-types (CXL Consortium) */
> +/*
> + * CXL memory region. Use with region type
> + * (PCI_VENDOR_ID_CXL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE).
> + * DPA memory region (fault+zap mmap)
> + */
> +#define VFIO_REGION_SUBTYPE_CXL (1)
> +/*
> + * HDM decoder register emulation region (read/write only, no mmap).
> + */
> +#define VFIO_REGION_SUBTYPE_CXL_COMP_REGS (2)
> +
> /* sub-types for VFIO_REGION_TYPE_GFX */
> #define VFIO_REGION_SUBTYPE_GFX_EDID (1)
>
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH 07/20] vfio/pci: Add CXL state to vfio_pci_core_device
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (5 preceding siblings ...)
2026-03-11 20:34 ` [PATCH 06/20] vfio/cxl: Add UAPI for CXL Type-2 device passthrough mhonap
@ 2026-03-11 20:34 ` mhonap
2026-03-11 20:34 ` [PATCH 08/20] vfio/pci: Add vfio-cxl Kconfig and build infrastructure mhonap
` (12 subsequent siblings)
19 siblings, 0 replies; 54+ messages in thread
From: mhonap @ 2026-03-11 20:34 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
Add CXL-specific state to vfio_pci_core_device structure to support
CXL Type-2 device passthrough.
The new vfio_pci_cxl_state structure embeds CXL core objects:
- struct cxl_dev_state: CXL device state (from CXL core)
- struct cxl_memdev: CXL memory device
- struct cxl_region: CXL region object
- Root and endpoint decoders
Key design point: The CXL state pointer is NULL for non-CXL devices,
allowing vfio-pci-core to handle both CXL and standard PCI devices
with minimal overhead.
This will follow the approach where vfio-pci-core itself gains CXL
awareness, rather than requiring a separate variant driver.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/cxl/vfio_cxl_priv.h | 29 ++++++++++++++++++++++++++++
include/linux/vfio_pci_core.h | 3 +++
2 files changed, 32 insertions(+)
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_priv.h
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
new file mode 100644
index 000000000000..818a83a3809d
--- /dev/null
+++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Common infrastructure for CXL Type-2 device variant drivers
+ *
+ * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#ifndef __LINUX_VFIO_CXL_PRIV_H
+#define __LINUX_VFIO_CXL_PRIV_H
+
+#include <cxl/cxl.h>
+#include <linux/types.h>
+
+/* CXL device state embedded in vfio_pci_core_device */
+struct vfio_pci_cxl_state {
+ struct cxl_dev_state cxlds;
+ struct cxl_memdev *cxlmd;
+ struct cxl_root_decoder *cxlrd;
+ struct cxl_endpoint_decoder *cxled;
+ resource_size_t hdm_reg_offset;
+ size_t hdm_reg_size;
+ resource_size_t comp_reg_offset;
+ size_t comp_reg_size;
+ u32 hdm_count;
+ u16 dvsec;
+ u8 comp_reg_bar;
+};
+
+#endif /* __LINUX_VFIO_CXL_PRIV_H */
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 1ac86896875c..cd8ed98a82a3 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -30,6 +30,8 @@ struct vfio_pci_region;
struct p2pdma_provider;
struct dma_buf_phys_vec;
struct dma_buf_attachment;
+struct vfio_pci_cxl_state;
+
struct vfio_pci_eventfd {
struct eventfd_ctx *ctx;
@@ -138,6 +140,7 @@ struct vfio_pci_core_device {
struct mutex ioeventfds_lock;
struct list_head ioeventfds_list;
struct vfio_pci_vf_token *vf_token;
+ struct vfio_pci_cxl_state *cxl;
struct list_head sriov_pfs_item;
struct vfio_pci_core_device *sriov_pf_core_dev;
struct notifier_block nb;
--
2.25.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* [PATCH 08/20] vfio/pci: Add vfio-cxl Kconfig and build infrastructure
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (6 preceding siblings ...)
2026-03-11 20:34 ` [PATCH 07/20] vfio/pci: Add CXL state to vfio_pci_core_device mhonap
@ 2026-03-11 20:34 ` mhonap
2026-03-13 12:27 ` Jonathan Cameron
2026-03-11 20:34 ` [PATCH 09/20] vfio/cxl: Implement CXL device detection and HDM register probing mhonap
` (11 subsequent siblings)
19 siblings, 1 reply; 54+ messages in thread
From: mhonap @ 2026-03-11 20:34 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
Introduce the Kconfig option CONFIG_VFIO_CXL_CORE and the necessary
build rules to compile CXL Type-2 passthrough support into the
vfio-pci-core module. The new option depends on VFIO_PCI_CORE,
CXL_BUS and CXL_MEM.
Wire up the detection and cleanup entry-point stubs in
vfio_pci_core_register_device() and vfio_pci_core_unregister_device()
so that subsequent patches can fill in the CXL-specific logic without
touching the vfio-pci-core flow again.
The vfio_cxl_core.c file added here is an empty skeleton; the actual
CXL detection and initialisation code is introduced in the following
patch to keep this build-system patch reviewable on its own.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/Kconfig | 2 ++
drivers/vfio/pci/Makefile | 1 +
drivers/vfio/pci/cxl/Kconfig | 7 ++++++
drivers/vfio/pci/cxl/vfio_cxl_core.c | 35 ++++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_core.c | 4 ++++
drivers/vfio/pci/vfio_pci_priv.h | 14 +++++++++++
6 files changed, 63 insertions(+)
create mode 100644 drivers/vfio/pci/cxl/Kconfig
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_core.c
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 1e82b44bda1a..b981a7c164ca 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -68,6 +68,8 @@ source "drivers/vfio/pci/virtio/Kconfig"
source "drivers/vfio/pci/nvgrace-gpu/Kconfig"
+source "drivers/vfio/pci/cxl/Kconfig"
+
source "drivers/vfio/pci/qat/Kconfig"
source "drivers/vfio/pci/xe/Kconfig"
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index e0a0757dd1d2..ecb0eacbc089 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,6 +1,7 @@
# SPDX-License-Identifier: GPL-2.0-only
vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+vfio-pci-core-$(CONFIG_VFIO_CXL_CORE) += cxl/vfio_cxl_core.o
vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
diff --git a/drivers/vfio/pci/cxl/Kconfig b/drivers/vfio/pci/cxl/Kconfig
new file mode 100644
index 000000000000..41d60dc0de2d
--- /dev/null
+++ b/drivers/vfio/pci/cxl/Kconfig
@@ -0,0 +1,7 @@
+config VFIO_CXL_CORE
+ bool "VFIO CXL core"
+ depends on VFIO_PCI_CORE && CXL_BUS && CXL_MEM
+ help
+ Core library for VFIO CXL Type-2 device support (enlightened path).
+ When enabled, vfio-pci-core can detect and manage CXL Type-2 devices
+ without a separate variant driver.
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
new file mode 100644
index 000000000000..7698d94e16be
--- /dev/null
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VFIO CXL Core - Common infrastructure for CXL Type-2 device variant drivers
+ *
+ * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ *
+ * This module provides common functionality for VFIO variant drivers that
+ * support CXL Type-2 devices (cache-coherent accelerators with attached memory).
+ */
+
+#include <linux/vfio_pci_core.h>
+#include <linux/pci.h>
+#include <cxl/cxl.h>
+#include <cxl/pci.h>
+
+#include "../vfio_pci_priv.h"
+#include "vfio_cxl_priv.h"
+
+MODULE_IMPORT_NS("CXL");
+
+/**
+ * vfio_pci_cxl_detect_and_init - Detect and initialize CXL Type-2 device
+ * @vdev: VFIO PCI device
+ *
+ * Called from vfio_pci_core_register_device(). Detects CXL DVSEC capability
+ * and initializes CXL features. On failure vdev->cxl remains NULL and the
+ * device operates as a standard PCI device.
+ */
+void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
+{
+}
+
+void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev)
+{
+}
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 3a11e6f450f7..b7364178e23d 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -2181,6 +2181,8 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
if (ret)
goto out_vf;
+ vfio_pci_cxl_detect_and_init(vdev);
+
vfio_pci_probe_power_state(vdev);
/*
@@ -2224,6 +2226,8 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
vfio_pci_vf_uninit(vdev);
vfio_pci_vga_uninit(vdev);
+ vfio_pci_cxl_cleanup(vdev);
+
if (!disable_idle_d3)
pm_runtime_get_noresume(&vdev->pdev->dev);
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 27ac280f00b9..d7df5538dcde 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -133,4 +133,18 @@ static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
}
#endif
+#if IS_ENABLED(CONFIG_VFIO_CXL_CORE)
+
+void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev);
+void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev);
+
+#else
+
+static inline void
+vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev) { }
+static inline void
+vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev) { }
+
+#endif /* CONFIG_VFIO_CXL_CORE */
+
#endif
--
2.25.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [PATCH 08/20] vfio/pci: Add vfio-cxl Kconfig and build infrastructure
2026-03-11 20:34 ` [PATCH 08/20] vfio/pci: Add vfio-cxl Kconfig and build infrastructure mhonap
@ 2026-03-13 12:27 ` Jonathan Cameron
2026-03-18 17:21 ` Manish Honap
0 siblings, 1 reply; 54+ messages in thread
From: Jonathan Cameron @ 2026-03-13 12:27 UTC (permalink / raw)
To: mhonap
Cc: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, jgg, yishaih,
kevin.tian, cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm
On Thu, 12 Mar 2026 02:04:28 +0530
mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
>
> Introduce the Kconfig option CONFIG_VFIO_CXL_CORE and the necessary
> build rules to compile CXL Type-2 passthrough support into the
> vfio-pci-core module. The new option depends on VFIO_PCI_CORE,
> CXL_BUS and CXL_MEM.
>
> Wire up the detection and cleanup entry-point stubs in
> vfio_pci_core_register_device() and vfio_pci_core_unregister_device()
> so that subsequent patches can fill in the CXL-specific logic without
> touching the vfio-pci-core flow again.
>
> The vfio_cxl_core.c file added here is an empty skeleton; the actual
> CXL detection and initialisation code is introduced in the following
> patch to keep this build-system patch reviewable on its own.
>
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
Hi Manish,
A few trivial things inline.
> diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> new file mode 100644
> index 000000000000..7698d94e16be
> --- /dev/null
> +++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> @@ -0,0 +1,35 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * VFIO CXL Core - Common infrastructure for CXL Type-2 device variant drivers
> + *
> + * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + *
> + * This module provides common functionality for VFIO variant drivers that
> + * support CXL Type-2 devices (cache-coherent accelerators with attached memory).
As I mentioned for docs, that definition needs some finessing as also
CXL Type3 devices, though intention is not the ones compliant with the class
code as those can be nicely paravirtualized.
There is a whole class of CXL.mem only devices with various forms of accelerator
that never need CXL.cache and so aren't Type 2.
E.g. Compressed memory type 3 devices are known to be in the wild.
> + */
> +
> +#include <linux/vfio_pci_core.h>
> +#include <linux/pci.h>
> +#include <cxl/cxl.h>
> +#include <cxl/pci.h>
> +
> +#include "../vfio_pci_priv.h"
> +#include "vfio_cxl_priv.h"
> +
> +MODULE_IMPORT_NS("CXL");
Most often I've seen this added at the end of file next other MODULE_X calls.
Whilse we don't have any of those here, still feels like a sensible place to put it.
> +
> +/**
> + * vfio_pci_cxl_detect_and_init - Detect and initialize CXL Type-2 device
> + * @vdev: VFIO PCI device
> + *
> + * Called from vfio_pci_core_register_device(). Detects CXL DVSEC capability
> + * and initializes CXL features. On failure vdev->cxl remains NULL and the
> + * device operates as a standard PCI device.
> + */
> +void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
> +{
> +}
> +
> +void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev)
> +{
> +}
^ permalink raw reply [flat|nested] 54+ messages in thread* RE: [PATCH 08/20] vfio/pci: Add vfio-cxl Kconfig and build infrastructure
2026-03-13 12:27 ` Jonathan Cameron
@ 2026-03-18 17:21 ` Manish Honap
0 siblings, 0 replies; 54+ messages in thread
From: Manish Honap @ 2026-03-18 17:21 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Aniket Agashe, Ankit Agrawal, Alex Williamson, Vikram Sethi,
Jason Gunthorpe, Matt Ochs, Shameer Kolothum Thodi,
alejandro.lucero-palau@amd.com, dave@stgolabs.net,
dave.jiang@intel.com, alison.schofield@intel.com,
vishal.l.verma@intel.com, ira.weiny@intel.com,
dan.j.williams@intel.com, jgg@ziepe.ca, Yishai Hadas,
kevin.tian@intel.com, Neo Jia, Tarun Gupta (SW-GPU), Zhi Wang,
Krishnakant Jaju, linux-kernel@vger.kernel.org,
linux-cxl@vger.kernel.org, kvm@vger.kernel.org, Manish Honap
> -----Original Message-----
> From: Jonathan Cameron <jonathan.cameron@huawei.com>
> Sent: 13 March 2026 17:57
> To: Manish Honap <mhonap@nvidia.com>
> Cc: Aniket Agashe <aniketa@nvidia.com>; Ankit Agrawal <ankita@nvidia.com>;
> Alex Williamson <alwilliamson@nvidia.com>; Vikram Sethi
> <vsethi@nvidia.com>; Jason Gunthorpe <jgg@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; Shameer Kolothum Thodi <skolothumtho@nvidia.com>;
> alejandro.lucero-palau@amd.com; dave@stgolabs.net; dave.jiang@intel.com;
> alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
> dan.j.williams@intel.com; jgg@ziepe.ca; Yishai Hadas <yishaih@nvidia.com>;
> kevin.tian@intel.com; Neo Jia <cjia@nvidia.com>; Tarun Gupta (SW-GPU)
> <targupta@nvidia.com>; Zhi Wang <zhiw@nvidia.com>; Krishnakant Jaju
> <kjaju@nvidia.com>; linux-kernel@vger.kernel.org; linux-
> cxl@vger.kernel.org; kvm@vger.kernel.org
> Subject: Re: [PATCH 08/20] vfio/pci: Add vfio-cxl Kconfig and build
> infrastructure
>
> External email: Use caution opening links or attachments
>
>
> On Thu, 12 Mar 2026 02:04:28 +0530
> mhonap@nvidia.com wrote:
>
> > From: Manish Honap <mhonap@nvidia.com>
> >
> > Introduce the Kconfig option CONFIG_VFIO_CXL_CORE and the necessary
> > build rules to compile CXL Type-2 passthrough support into the
> > vfio-pci-core module. The new option depends on VFIO_PCI_CORE,
> > CXL_BUS and CXL_MEM.
> >
> > Wire up the detection and cleanup entry-point stubs in
> > vfio_pci_core_register_device() and vfio_pci_core_unregister_device()
> > so that subsequent patches can fill in the CXL-specific logic without
> > touching the vfio-pci-core flow again.
> >
> > The vfio_cxl_core.c file added here is an empty skeleton; the actual
> > CXL detection and initialisation code is introduced in the following
> > patch to keep this build-system patch reviewable on its own.
> >
> > Signed-off-by: Manish Honap <mhonap@nvidia.com>
> Hi Manish,
> A few trivial things inline.
>
> > diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > new file mode 100644
> > index 000000000000..7698d94e16be
> > --- /dev/null
> > +++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > @@ -0,0 +1,35 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * VFIO CXL Core - Common infrastructure for CXL Type-2 device
> > +variant drivers
> > + *
> > + * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights
> > +reserved
> > + *
> > + * This module provides common functionality for VFIO variant drivers
> > +that
> > + * support CXL Type-2 devices (cache-coherent accelerators with
> attached memory).
> As I mentioned for docs, that definition needs some finessing as also CXL
> Type3 devices, though intention is not the ones compliant with the class
> code as those can be nicely paravirtualized.
>
> There is a whole class of CXL.mem only devices with various forms of
> accelerator that never need CXL.cache and so aren't Type 2.
>
> E.g. Compressed memory type 3 devices are known to be in the wild.
Okay, I have updated this wording to specify the CXL capable device expectation.
>
> > + */
> > +
> > +#include <linux/vfio_pci_core.h>
> > +#include <linux/pci.h>
> > +#include <cxl/cxl.h>
> > +#include <cxl/pci.h>
> > +
> > +#include "../vfio_pci_priv.h"
> > +#include "vfio_cxl_priv.h"
> > +
> > +MODULE_IMPORT_NS("CXL");
>
> Most often I've seen this added at the end of file next other MODULE_X
> calls.
> Whilse we don't have any of those here, still feels like a sensible place
> to put it.
>
Moved this to EOF.
>
> > +
> > +/**
> > + * vfio_pci_cxl_detect_and_init - Detect and initialize CXL Type-2
> > +device
> > + * @vdev: VFIO PCI device
> > + *
> > + * Called from vfio_pci_core_register_device(). Detects CXL DVSEC
> > +capability
> > + * and initializes CXL features. On failure vdev->cxl remains NULL
> > +and the
> > + * device operates as a standard PCI device.
> > + */
> > +void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
> > +{ }
> > +
> > +void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev) { }
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH 09/20] vfio/cxl: Implement CXL device detection and HDM register probing
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (7 preceding siblings ...)
2026-03-11 20:34 ` [PATCH 08/20] vfio/pci: Add vfio-cxl Kconfig and build infrastructure mhonap
@ 2026-03-11 20:34 ` mhonap
2026-03-12 22:31 ` Dave Jiang
2026-03-11 20:34 ` [PATCH 10/20] vfio/cxl: CXL region management mhonap
` (10 subsequent siblings)
19 siblings, 1 reply; 54+ messages in thread
From: mhonap @ 2026-03-11 20:34 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
Implement the core CXL Type-2 device detection and component register
probing logic in vfio_pci_cxl_detect_and_init().
Three private helpers are introduced:
vfio_cxl_create_device_state() allocates the per-device
vfio_pci_cxl_state structure using devm_cxl_dev_state_create() so
that lifetime is tied to the PCI device binding.
vfio_cxl_find_bar() locates the PCI BAR that contains a given HPA
range, returning the BAR index and offset within it.
vfio_cxl_setup_regs() uses the CXL core helpers cxl_find_regblock()
and cxl_probe_component_regs() to enumerate the HDM decoder register
block, then records its BAR index, offset and size in the CXL state.
vfio_pci_cxl_detect_and_init() orchestrates detection:
1. Check for CXL DVSEC via pcie_is_cxl() + pci_find_dvsec_capability().
2. Allocate CXL device state.
3. Temporarily call pci_enable_device_mem() for ioremap, then disable.
4. Probe component registers to find the HDM decoder block.
On any failure vdev->cxl is devm_kfree'd so that device falls back to
plain PCI mode transparently.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/cxl/vfio_cxl_core.c | 151 +++++++++++++++++++++++++++
drivers/vfio/pci/cxl/vfio_cxl_priv.h | 8 ++
2 files changed, 159 insertions(+)
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index 7698d94e16be..2da6da1c0605 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -18,6 +18,114 @@
MODULE_IMPORT_NS("CXL");
+static int vfio_cxl_create_device_state(struct vfio_pci_core_device *vdev,
+ u16 dvsec)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ struct device *dev = &pdev->dev;
+ struct vfio_pci_cxl_state *cxl;
+ bool cxl_mem_capable, is_cxl_type3;
+ u16 cap_word;
+
+ /*
+ * The devm allocation for the CXL state remains for the entire time
+ * the PCI device is bound to vfio-pci. From successful CXL init
+ * in probe until the device is released on unbind.
+ * No extra explicit free is needed; devm handles it when
+ * pdev->dev is released.
+ */
+ vdev->cxl = devm_cxl_dev_state_create(dev,
+ CXL_DEVTYPE_DEVMEM,
+ pdev->dev.id, dvsec,
+ struct vfio_pci_cxl_state,
+ cxlds, false);
+ if (!vdev->cxl)
+ return -ENOMEM;
+
+ cxl = vdev->cxl;
+ cxl->dvsec = dvsec;
+
+ pci_read_config_word(pdev, dvsec + CXL_DVSEC_CAPABILITY_OFFSET,
+ &cap_word);
+
+ cxl_mem_capable = !!(cap_word & CXL_DVSEC_MEM_CAPABLE);
+ is_cxl_type3 = ((pdev->class >> 8) == PCI_CLASS_MEMORY_CXL);
+
+ /*
+ * Type 2 = CXL memory capable but NOT Type 3 (e.g. accelerator/GPU)
+ * Unsupported for non cxl type-2 class of devices.
+ */
+ if (!(cxl_mem_capable && !is_cxl_type3)) {
+ devm_kfree(&pdev->dev, vdev->cxl);
+ vdev->cxl = NULL;
+ return -ENODEV;
+ }
+
+ return 0;
+}
+
+static int vfio_cxl_setup_regs(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ struct cxl_register_map *map = &cxl->cxlds.reg_map;
+ resource_size_t offset, bar_offset, size;
+ struct pci_dev *pdev = vdev->pdev;
+ void __iomem *base;
+ u32 count;
+ int ret;
+ u8 bar;
+
+ if (WARN_ON_ONCE(!pci_is_enabled(pdev)))
+ return -EINVAL;
+
+ /* Find component register block via Register Locator DVSEC */
+ ret = cxl_find_regblock(pdev, CXL_REGLOC_RBI_COMPONENT, map);
+ if (ret)
+ return ret;
+
+ /* Temporarily map the register block */
+ base = ioremap(map->resource, map->max_size);
+ if (!base)
+ return -ENOMEM;
+
+ /* Probe component register capabilities */
+ cxl_probe_component_regs(&pdev->dev, base, &map->component_map);
+
+ /* Unmap immediately */
+ iounmap(base);
+
+ /* Check if HDM decoder was found */
+ if (!map->component_map.hdm_decoder.valid)
+ return -ENODEV;
+
+ pci_dbg(pdev,
+ "vfio_cxl: HDM decoder at offset=0x%lx, size=0x%lx\n",
+ map->component_map.hdm_decoder.offset,
+ map->component_map.hdm_decoder.size);
+
+ /* Get HDM register info */
+ ret = cxl_get_hdm_reg_info(&cxl->cxlds, &count, &offset, &size);
+ if (ret)
+ return ret;
+
+ if (!count || !size)
+ return -ENODEV;
+
+ cxl->hdm_count = count;
+ cxl->hdm_reg_offset = offset;
+ cxl->hdm_reg_size = size;
+
+ ret = cxl_regblock_get_bar_info(map, &bar, &bar_offset);
+ if (ret)
+ return ret;
+
+ cxl->comp_reg_bar = bar;
+ cxl->comp_reg_offset = bar_offset;
+ cxl->comp_reg_size = CXL_COMPONENT_REG_BLOCK_SIZE;
+
+ return 0;
+}
+
/**
* vfio_pci_cxl_detect_and_init - Detect and initialize CXL Type-2 device
* @vdev: VFIO PCI device
@@ -28,8 +136,51 @@ MODULE_IMPORT_NS("CXL");
*/
void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
{
+ struct pci_dev *pdev = vdev->pdev;
+ struct vfio_pci_cxl_state *cxl;
+ u16 dvsec;
+ int ret;
+
+ if (!pcie_is_cxl(pdev))
+ return;
+
+ dvsec = pci_find_dvsec_capability(pdev,
+ PCI_VENDOR_ID_CXL,
+ PCI_DVSEC_CXL_DEVICE);
+ if (!dvsec)
+ return;
+
+ ret = vfio_cxl_create_device_state(vdev, dvsec);
+ if (ret)
+ return;
+
+ cxl = vdev->cxl;
+
+ /*
+ * Required for ioremap of the component register block and
+ * calls to cxl_probe_component_regs().
+ */
+ ret = pci_enable_device_mem(pdev);
+ if (ret)
+ goto failed;
+
+ ret = vfio_cxl_setup_regs(vdev);
+ if (ret) {
+ pci_disable_device(pdev);
+ goto failed;
+ }
+
+ pci_disable_device(pdev);
+
+ return;
+
+failed:
+ devm_kfree(&pdev->dev, vdev->cxl);
+ vdev->cxl = NULL;
}
void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev)
{
+ if (!vdev->cxl)
+ return;
}
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
index 818a83a3809d..57fed39a80da 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_priv.h
+++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
@@ -26,4 +26,12 @@ struct vfio_pci_cxl_state {
u8 comp_reg_bar;
};
+/*
+ * CXL DVSEC for CXL Devices - register offsets within the DVSEC
+ * (CXL 2.0+ 8.1.3).
+ * Offsets are relative to the DVSEC capability base (cxl->dvsec).
+ */
+#define CXL_DVSEC_CAPABILITY_OFFSET 0xa
+#define CXL_DVSEC_MEM_CAPABLE BIT(2)
+
#endif /* __LINUX_VFIO_CXL_PRIV_H */
--
2.25.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [PATCH 09/20] vfio/cxl: Implement CXL device detection and HDM register probing
2026-03-11 20:34 ` [PATCH 09/20] vfio/cxl: Implement CXL device detection and HDM register probing mhonap
@ 2026-03-12 22:31 ` Dave Jiang
2026-03-13 12:43 ` Jonathan Cameron
0 siblings, 1 reply; 54+ messages in thread
From: Dave Jiang @ 2026-03-12 22:31 UTC (permalink / raw)
To: mhonap, aniketa, ankita, alwilliamson, vsethi, jgg, mochs,
skolothumtho, alejandro.lucero-palau, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm
On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
>
> Implement the core CXL Type-2 device detection and component register
> probing logic in vfio_pci_cxl_detect_and_init().
>
> Three private helpers are introduced:
>
> vfio_cxl_create_device_state() allocates the per-device
> vfio_pci_cxl_state structure using devm_cxl_dev_state_create() so
> that lifetime is tied to the PCI device binding.
>
> vfio_cxl_find_bar() locates the PCI BAR that contains a given HPA
> range, returning the BAR index and offset within it.
>
> vfio_cxl_setup_regs() uses the CXL core helpers cxl_find_regblock()
> and cxl_probe_component_regs() to enumerate the HDM decoder register
> block, then records its BAR index, offset and size in the CXL state.
>
> vfio_pci_cxl_detect_and_init() orchestrates detection:
> 1. Check for CXL DVSEC via pcie_is_cxl() + pci_find_dvsec_capability().
> 2. Allocate CXL device state.
> 3. Temporarily call pci_enable_device_mem() for ioremap, then disable.
> 4. Probe component registers to find the HDM decoder block.
>
> On any failure vdev->cxl is devm_kfree'd so that device falls back to
> plain PCI mode transparently.
>
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
> ---
> drivers/vfio/pci/cxl/vfio_cxl_core.c | 151 +++++++++++++++++++++++++++
> drivers/vfio/pci/cxl/vfio_cxl_priv.h | 8 ++
> 2 files changed, 159 insertions(+)
>
> diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> index 7698d94e16be..2da6da1c0605 100644
> --- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
> +++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> @@ -18,6 +18,114 @@
>
> MODULE_IMPORT_NS("CXL");
>
> +static int vfio_cxl_create_device_state(struct vfio_pci_core_device *vdev,
> + u16 dvsec)
> +{
> + struct pci_dev *pdev = vdev->pdev;
> + struct device *dev = &pdev->dev;
> + struct vfio_pci_cxl_state *cxl;
> + bool cxl_mem_capable, is_cxl_type3;
> + u16 cap_word;
> +
> + /*
> + * The devm allocation for the CXL state remains for the entire time
> + * the PCI device is bound to vfio-pci. From successful CXL init
> + * in probe until the device is released on unbind.
> + * No extra explicit free is needed; devm handles it when
> + * pdev->dev is released.
> + */
> + vdev->cxl = devm_cxl_dev_state_create(dev,
> + CXL_DEVTYPE_DEVMEM,
> + pdev->dev.id, dvsec,
> + struct vfio_pci_cxl_state,
> + cxlds, false);
> + if (!vdev->cxl)
> + return -ENOMEM;
> +
> + cxl = vdev->cxl;
> + cxl->dvsec = dvsec;
> +
> + pci_read_config_word(pdev, dvsec + CXL_DVSEC_CAPABILITY_OFFSET,
> + &cap_word);
> +
> + cxl_mem_capable = !!(cap_word & CXL_DVSEC_MEM_CAPABLE);
> + is_cxl_type3 = ((pdev->class >> 8) == PCI_CLASS_MEMORY_CXL);
Both of these can use FIELD_GET().
> +
> + /*
> + * Type 2 = CXL memory capable but NOT Type 3 (e.g. accelerator/GPU)
> + * Unsupported for non cxl type-2 class of devices.
> + */
> + if (!(cxl_mem_capable && !is_cxl_type3)) {
> + devm_kfree(&pdev->dev, vdev->cxl);
> + vdev->cxl = NULL;
> + return -ENODEV;
> + }
> +
> + return 0;
> +}
> +
> +static int vfio_cxl_setup_regs(struct vfio_pci_core_device *vdev)
> +{
> + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> + struct cxl_register_map *map = &cxl->cxlds.reg_map;
> + resource_size_t offset, bar_offset, size;
> + struct pci_dev *pdev = vdev->pdev;
> + void __iomem *base;
> + u32 count;
> + int ret;
> + u8 bar;
> +
> + if (WARN_ON_ONCE(!pci_is_enabled(pdev)))
> + return -EINVAL;
> +
> + /* Find component register block via Register Locator DVSEC */
> + ret = cxl_find_regblock(pdev, CXL_REGLOC_RBI_COMPONENT, map);
> + if (ret)
> + return ret;
> +
> + /* Temporarily map the register block */
> + base = ioremap(map->resource, map->max_size);
Request the mem region before mapping it?
DJ
> + if (!base)
> + return -ENOMEM;
> +
> + /* Probe component register capabilities */
> + cxl_probe_component_regs(&pdev->dev, base, &map->component_map);
> +
> + /* Unmap immediately */
> + iounmap(base);
> +
> + /* Check if HDM decoder was found */
> + if (!map->component_map.hdm_decoder.valid)
> + return -ENODEV;
> +
> + pci_dbg(pdev,
> + "vfio_cxl: HDM decoder at offset=0x%lx, size=0x%lx\n",
> + map->component_map.hdm_decoder.offset,
> + map->component_map.hdm_decoder.size);
> +
> + /* Get HDM register info */
> + ret = cxl_get_hdm_reg_info(&cxl->cxlds, &count, &offset, &size);
> + if (ret)
> + return ret;
> +
> + if (!count || !size)
> + return -ENODEV;
> +
> + cxl->hdm_count = count;
> + cxl->hdm_reg_offset = offset;
> + cxl->hdm_reg_size = size;
> +
> + ret = cxl_regblock_get_bar_info(map, &bar, &bar_offset);
> + if (ret)
> + return ret;
> +
> + cxl->comp_reg_bar = bar;
> + cxl->comp_reg_offset = bar_offset;
> + cxl->comp_reg_size = CXL_COMPONENT_REG_BLOCK_SIZE;
> +
> + return 0;
> +}
> +
> /**
> * vfio_pci_cxl_detect_and_init - Detect and initialize CXL Type-2 device
> * @vdev: VFIO PCI device
> @@ -28,8 +136,51 @@ MODULE_IMPORT_NS("CXL");
> */
> void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
> {
> + struct pci_dev *pdev = vdev->pdev;
> + struct vfio_pci_cxl_state *cxl;
> + u16 dvsec;
> + int ret;
> +
> + if (!pcie_is_cxl(pdev))
> + return;
> +
> + dvsec = pci_find_dvsec_capability(pdev,
> + PCI_VENDOR_ID_CXL,
> + PCI_DVSEC_CXL_DEVICE);
> + if (!dvsec)
> + return;
> +
> + ret = vfio_cxl_create_device_state(vdev, dvsec);
> + if (ret)
> + return;
> +
> + cxl = vdev->cxl;
> +
> + /*
> + * Required for ioremap of the component register block and
> + * calls to cxl_probe_component_regs().
> + */
> + ret = pci_enable_device_mem(pdev);
> + if (ret)
> + goto failed;
> +
> + ret = vfio_cxl_setup_regs(vdev);
> + if (ret) {
> + pci_disable_device(pdev);
> + goto failed;
> + }
> +
> + pci_disable_device(pdev);
> +
> + return;
> +
> +failed:
> + devm_kfree(&pdev->dev, vdev->cxl);
> + vdev->cxl = NULL;
> }
>
> void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev)
> {
> + if (!vdev->cxl)
> + return;
> }
> diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
> index 818a83a3809d..57fed39a80da 100644
> --- a/drivers/vfio/pci/cxl/vfio_cxl_priv.h
> +++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
> @@ -26,4 +26,12 @@ struct vfio_pci_cxl_state {
> u8 comp_reg_bar;
> };
>
> +/*
> + * CXL DVSEC for CXL Devices - register offsets within the DVSEC
> + * (CXL 2.0+ 8.1.3).
> + * Offsets are relative to the DVSEC capability base (cxl->dvsec).
> + */
> +#define CXL_DVSEC_CAPABILITY_OFFSET 0xa
> +#define CXL_DVSEC_MEM_CAPABLE BIT(2)
> +
> #endif /* __LINUX_VFIO_CXL_PRIV_H */
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [PATCH 09/20] vfio/cxl: Implement CXL device detection and HDM register probing
2026-03-12 22:31 ` Dave Jiang
@ 2026-03-13 12:43 ` Jonathan Cameron
2026-03-18 17:43 ` Manish Honap
0 siblings, 1 reply; 54+ messages in thread
From: Jonathan Cameron @ 2026-03-13 12:43 UTC (permalink / raw)
To: Dave Jiang
Cc: mhonap, aniketa, ankita, alwilliamson, vsethi, jgg, mochs,
skolothumtho, alejandro.lucero-palau, dave, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, jgg, yishaih,
kevin.tian, cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm
On Thu, 12 Mar 2026 15:31:03 -0700
Dave Jiang <dave.jiang@intel.com> wrote:
> On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> > From: Manish Honap <mhonap@nvidia.com>
> >
> > Implement the core CXL Type-2 device detection and component register
> > probing logic in vfio_pci_cxl_detect_and_init().
> >
> > Three private helpers are introduced:
> >
> > vfio_cxl_create_device_state() allocates the per-device
> > vfio_pci_cxl_state structure using devm_cxl_dev_state_create() so
> > that lifetime is tied to the PCI device binding.
> >
> > vfio_cxl_find_bar() locates the PCI BAR that contains a given HPA
> > range, returning the BAR index and offset within it.
> >
> > vfio_cxl_setup_regs() uses the CXL core helpers cxl_find_regblock()
> > and cxl_probe_component_regs() to enumerate the HDM decoder register
> > block, then records its BAR index, offset and size in the CXL state.
> >
> > vfio_pci_cxl_detect_and_init() orchestrates detection:
> > 1. Check for CXL DVSEC via pcie_is_cxl() + pci_find_dvsec_capability().
> > 2. Allocate CXL device state.
> > 3. Temporarily call pci_enable_device_mem() for ioremap, then disable.
> > 4. Probe component registers to find the HDM decoder block.
> >
> > On any failure vdev->cxl is devm_kfree'd so that device falls back to
> > plain PCI mode transparently.
> >
> > Signed-off-by: Manish Honap <mhonap@nvidia.com>
Given Dave didn't crop anything I'll just reply on top and avoid duplication.
Mostly lifetime handling comments. I get nervous when devm occurs in the middle
of non devm code. It needs to be done very carefully. I don't think you
have a bug here, but I'm not keen on the resulting difference in order of setup
and tear down. I'd like cleanup to tidy it up even though it would be safe
to do later.
> > ---
> > drivers/vfio/pci/cxl/vfio_cxl_core.c | 151 +++++++++++++++++++++++++++
> > drivers/vfio/pci/cxl/vfio_cxl_priv.h | 8 ++
> > 2 files changed, 159 insertions(+)
> >
> > diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > index 7698d94e16be..2da6da1c0605 100644
> > --- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > +++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > @@ -18,6 +18,114 @@
> >
> > MODULE_IMPORT_NS("CXL");
> >
> > +static int vfio_cxl_create_device_state(struct vfio_pci_core_device *vdev,
> > + u16 dvsec)
> > +{
> > + struct pci_dev *pdev = vdev->pdev;
> > + struct device *dev = &pdev->dev;
> > + struct vfio_pci_cxl_state *cxl;
> > + bool cxl_mem_capable, is_cxl_type3;
> > + u16 cap_word;
> > +
> > + /*
> > + * The devm allocation for the CXL state remains for the entire time
> > + * the PCI device is bound to vfio-pci. From successful CXL init
> > + * in probe until the device is released on unbind.
> > + * No extra explicit free is needed; devm handles it when
> > + * pdev->dev is released.
> > + */
> > + vdev->cxl = devm_cxl_dev_state_create(dev,
Rather than assigning this in here, I'd use a local variable for
the return of this, operate on that and return it from the function.
That both creates a clean separation and possibly make error
handling simpler later...
Also similar to some of the feedback Alejandro had on his type 2
series, be very careful with mixing devm calls and non devm,
generally that's a path to hard to read and reason about code.
I know you've stated this is fine because it's tied to the
PCI device lifetime and you are probably right on that, but
I'm not keen.
This might be a case where manual unwinding of devres is needed
(maybe using devres groups if we end up with a bunch of stuff
to undo).
> > + CXL_DEVTYPE_DEVMEM,
> > + pdev->dev.id, dvsec,
> > + struct vfio_pci_cxl_state,
> > + cxlds, false);
> > + if (!vdev->cxl)
> > + return -ENOMEM;
> > +
> > + cxl = vdev->cxl;
> > + cxl->dvsec = dvsec;
That's a bit odd given we pass dvsec into devm_cxl_dev_state_create()
why doesn't it assign it?
> > +
> > + pci_read_config_word(pdev, dvsec + CXL_DVSEC_CAPABILITY_OFFSET,
> > + &cap_word);
> > +
> > + cxl_mem_capable = !!(cap_word & CXL_DVSEC_MEM_CAPABLE);
> > + is_cxl_type3 = ((pdev->class >> 8) == PCI_CLASS_MEMORY_CXL);
>
> Both of these can use FIELD_GET().
>
> > +
> > + /*
> > + * Type 2 = CXL memory capable but NOT Type 3 (e.g. accelerator/GPU)
> > + * Unsupported for non cxl type-2 class of devices.
As in other places, type 3 doesn't mean that. What you mean is class
code compliant type 3.
> > + */
> > + if (!(cxl_mem_capable && !is_cxl_type3)) {
> > + devm_kfree(&pdev->dev, vdev->cxl);
As below. That needs a name that makes it clear it is the right devm call.
> > + vdev->cxl = NULL;
> > + return -ENODEV;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static int vfio_cxl_setup_regs(struct vfio_pci_core_device *vdev)
> > +{
> > + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> > + struct cxl_register_map *map = &cxl->cxlds.reg_map;
> > + resource_size_t offset, bar_offset, size;
> > + struct pci_dev *pdev = vdev->pdev;
> > + void __iomem *base;
> > + u32 count;
> > + int ret;
> > + u8 bar;
> > +
> > + if (WARN_ON_ONCE(!pci_is_enabled(pdev)))
> > + return -EINVAL;
> > +
> > + /* Find component register block via Register Locator DVSEC */
> > + ret = cxl_find_regblock(pdev, CXL_REGLOC_RBI_COMPONENT, map);
> > + if (ret)
> > + return ret;
> > +
> > + /* Temporarily map the register block */
> > + base = ioremap(map->resource, map->max_size);
>
> Request the mem region before mapping it?
>
> DJ
>
> > + if (!base)
> > + return -ENOMEM;
> > +
> > + /* Probe component register capabilities */
> > + cxl_probe_component_regs(&pdev->dev, base, &map->component_map);
> > +
> > + /* Unmap immediately */
> > + iounmap(base);
> > +
> > + /* Check if HDM decoder was found */
> > + if (!map->component_map.hdm_decoder.valid)
> > + return -ENODEV;
> > +
> > + pci_dbg(pdev,
> > + "vfio_cxl: HDM decoder at offset=0x%lx, size=0x%lx\n",
> > + map->component_map.hdm_decoder.offset,
> > + map->component_map.hdm_decoder.size);
> > +
> > + /* Get HDM register info */
> > + ret = cxl_get_hdm_reg_info(&cxl->cxlds, &count, &offset, &size);
> > + if (ret)
> > + return ret;
> > +
> > + if (!count || !size)
> > + return -ENODEV;
> > +
> > + cxl->hdm_count = count;
> > + cxl->hdm_reg_offset = offset;
> > + cxl->hdm_reg_size = size;
> > +
> > + ret = cxl_regblock_get_bar_info(map, &bar, &bar_offset);
> > + if (ret)
> > + return ret;
> > +
> > + cxl->comp_reg_bar = bar;
> > + cxl->comp_reg_offset = bar_offset;
> > + cxl->comp_reg_size = CXL_COMPONENT_REG_BLOCK_SIZE;
> > +
> > + return 0;
> > +}
> > +
> > /**
> > * vfio_pci_cxl_detect_and_init - Detect and initialize CXL Type-2 device
> > * @vdev: VFIO PCI device
> > @@ -28,8 +136,51 @@ MODULE_IMPORT_NS("CXL");
> > */
> > void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
> > {
> > + struct pci_dev *pdev = vdev->pdev;
> > + struct vfio_pci_cxl_state *cxl;
> > + u16 dvsec;
> > + int ret;
> > +
> > + if (!pcie_is_cxl(pdev))
> > + return;
> > +
> > + dvsec = pci_find_dvsec_capability(pdev,
> > + PCI_VENDOR_ID_CXL,
> > + PCI_DVSEC_CXL_DEVICE);
> > + if (!dvsec)
> > + return;
> > +
> > + ret = vfio_cxl_create_device_state(vdev, dvsec);
Suggestion above would lead to.
cxl = vfio_cxl_create_device_state(vdev, dvsec);
if (IS_ERR(cxl))
return PTR_ERR(cxl); //assuming failing at this point is fatal.
then only set vdev->cxl once you are sure this function succeeded.
Thus removing the need to set it to NULL on failure.
You need to pass it into a few more calls though.
> > + if (ret)
> > + return;
> > +
> > + cxl = vdev->cxl;
> > +
> > + /*
> > + * Required for ioremap of the component register block and
> > + * calls to cxl_probe_component_regs().
> > + */
> > + ret = pci_enable_device_mem(pdev);
> > + if (ret)
> > + goto failed;
> > +
> > + ret = vfio_cxl_setup_regs(vdev);
> > + if (ret) {
> > + pci_disable_device(pdev);
> > + goto failed;
> > + }
> > +
> > + pci_disable_device(pdev);
> > +
AS above, I think this would be cleaner if only here do we have
vdev->cxl = cxl;
> > + return;
> > +
> > +failed:
> > + devm_kfree(&pdev->dev, vdev->cxl);
If you get here you found a CXL device but couldn't handle it. Is it useful
to continue? I'd suggest probably not. If returning an error then devm
cleanup happens. Also, we should have a wrapper around devm_kfree to
make it clear that it is valid to use it for undoing the
devm_cxl_dev_state_create() (I haven't even checked it is valid)
> > + vdev->cxl = NULL;
> > }
if /* __LINUX_VFIO_CXL_PRIV_H */
>
>
^ permalink raw reply [flat|nested] 54+ messages in thread* RE: [PATCH 09/20] vfio/cxl: Implement CXL device detection and HDM register probing
2026-03-13 12:43 ` Jonathan Cameron
@ 2026-03-18 17:43 ` Manish Honap
0 siblings, 0 replies; 54+ messages in thread
From: Manish Honap @ 2026-03-18 17:43 UTC (permalink / raw)
To: Jonathan Cameron, Dave Jiang
Cc: Aniket Agashe, Ankit Agrawal, Alex Williamson, Vikram Sethi,
Jason Gunthorpe, Matt Ochs, Shameer Kolothum Thodi,
alejandro.lucero-palau@amd.com, dave@stgolabs.net,
alison.schofield@intel.com, vishal.l.verma@intel.com,
ira.weiny@intel.com, dan.j.williams@intel.com, jgg@ziepe.ca,
Yishai Hadas, kevin.tian@intel.com, Neo Jia, Tarun Gupta (SW-GPU),
Zhi Wang, Krishnakant Jaju, linux-kernel@vger.kernel.org,
linux-cxl@vger.kernel.org, kvm@vger.kernel.org, Manish Honap
> -----Original Message-----
> From: Jonathan Cameron <jonathan.cameron@huawei.com>
> Sent: 13 March 2026 18:14
> To: Dave Jiang <dave.jiang@intel.com>
> Cc: Manish Honap <mhonap@nvidia.com>; Aniket Agashe <aniketa@nvidia.com>;
> Ankit Agrawal <ankita@nvidia.com>; Alex Williamson
> <alwilliamson@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Jason
> Gunthorpe <jgg@nvidia.com>; Matt Ochs <mochs@nvidia.com>; Shameer Kolothum
> Thodi <skolothumtho@nvidia.com>; alejandro.lucero-palau@amd.com;
> dave@stgolabs.net; alison.schofield@intel.com; vishal.l.verma@intel.com;
> ira.weiny@intel.com; dan.j.williams@intel.com; jgg@ziepe.ca; Yishai Hadas
> <yishaih@nvidia.com>; kevin.tian@intel.com; Neo Jia <cjia@nvidia.com>;
> Tarun Gupta (SW-GPU) <targupta@nvidia.com>; Zhi Wang <zhiw@nvidia.com>;
> Krishnakant Jaju <kjaju@nvidia.com>; linux-kernel@vger.kernel.org; linux-
> cxl@vger.kernel.org; kvm@vger.kernel.org
> Subject: Re: [PATCH 09/20] vfio/cxl: Implement CXL device detection and
> HDM register probing
>
> External email: Use caution opening links or attachments
>
>
> On Thu, 12 Mar 2026 15:31:03 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
>
> > On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> > > From: Manish Honap <mhonap@nvidia.com>
> > >
> > > Implement the core CXL Type-2 device detection and component
> > > register probing logic in vfio_pci_cxl_detect_and_init().
> > >
> > > Three private helpers are introduced:
> > >
> > > vfio_cxl_create_device_state() allocates the per-device
> > > vfio_pci_cxl_state structure using devm_cxl_dev_state_create() so
> > > that lifetime is tied to the PCI device binding.
> > >
> > > vfio_cxl_find_bar() locates the PCI BAR that contains a given HPA
> > > range, returning the BAR index and offset within it.
> > >
> > > vfio_cxl_setup_regs() uses the CXL core helpers cxl_find_regblock()
> > > and cxl_probe_component_regs() to enumerate the HDM decoder register
> > > block, then records its BAR index, offset and size in the CXL state.
> > >
> > > vfio_pci_cxl_detect_and_init() orchestrates detection:
> > > 1. Check for CXL DVSEC via pcie_is_cxl() +
> pci_find_dvsec_capability().
> > > 2. Allocate CXL device state.
> > > 3. Temporarily call pci_enable_device_mem() for ioremap, then
> disable.
> > > 4. Probe component registers to find the HDM decoder block.
> > >
> > > On any failure vdev->cxl is devm_kfree'd so that device falls back
> > > to plain PCI mode transparently.
> > >
> > > Signed-off-by: Manish Honap <mhonap@nvidia.com>
> Given Dave didn't crop anything I'll just reply on top and avoid
> duplication.
>
> Mostly lifetime handling comments. I get nervous when devm occurs in the
> middle of non devm code. It needs to be done very carefully. I don't think
> you have a bug here, but I'm not keen on the resulting difference in order
> of setup and tear down. I'd like cleanup to tidy it up even though it
> would be safe to do later.
>
I have addressed these comments by creating a local variable before assigning
cxl to vdev->cxl.
>
> > > ---
> > > drivers/vfio/pci/cxl/vfio_cxl_core.c | 151
> +++++++++++++++++++++++++++
> > > drivers/vfio/pci/cxl/vfio_cxl_priv.h | 8 ++
> > > 2 files changed, 159 insertions(+)
> > >
> > > diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > > b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > > index 7698d94e16be..2da6da1c0605 100644
> > > --- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > > +++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > > @@ -18,6 +18,114 @@
> > >
> > > MODULE_IMPORT_NS("CXL");
> > >
> > > +static int vfio_cxl_create_device_state(struct vfio_pci_core_device
> *vdev,
> > > + u16 dvsec) {
> > > + struct pci_dev *pdev = vdev->pdev;
> > > + struct device *dev = &pdev->dev;
> > > + struct vfio_pci_cxl_state *cxl;
> > > + bool cxl_mem_capable, is_cxl_type3;
> > > + u16 cap_word;
> > > +
> > > + /*
> > > + * The devm allocation for the CXL state remains for the entire
> time
> > > + * the PCI device is bound to vfio-pci. From successful CXL init
> > > + * in probe until the device is released on unbind.
> > > + * No extra explicit free is needed; devm handles it when
> > > + * pdev->dev is released.
> > > + */
> > > + vdev->cxl = devm_cxl_dev_state_create(dev,
>
> Rather than assigning this in here, I'd use a local variable for the
> return of this, operate on that and return it from the function.
> That both creates a clean separation and possibly make error handling
> simpler later...
>
> Also similar to some of the feedback Alejandro had on his type 2 series,
> be very careful with mixing devm calls and non devm, generally that's a
> path to hard to read and reason about code.
> I know you've stated this is fine because it's tied to the PCI device
> lifetime and you are probably right on that, but I'm not keen.
>
> This might be a case where manual unwinding of devres is needed (maybe
> using devres groups if we end up with a bunch of stuff to undo).
>
>
>
> > > + CXL_DEVTYPE_DEVMEM,
> > > + pdev->dev.id, dvsec,
> > > + struct vfio_pci_cxl_state,
> > > + cxlds, false);
> > > + if (!vdev->cxl)
> > > + return -ENOMEM;
> > > +
> > > + cxl = vdev->cxl;
> > > + cxl->dvsec = dvsec;
>
> That's a bit odd given we pass dvsec into devm_cxl_dev_state_create() why
> doesn't it assign it?
Yes, dvsec was redundant in cxl->dvsec and cxl->cxlds.cxl_dvsec
Removed it from cxl.
>
> > > +
> > > + pci_read_config_word(pdev, dvsec + CXL_DVSEC_CAPABILITY_OFFSET,
> > > + &cap_word);
> > > +
> > > + cxl_mem_capable = !!(cap_word & CXL_DVSEC_MEM_CAPABLE);
> > > + is_cxl_type3 = ((pdev->class >> 8) == PCI_CLASS_MEMORY_CXL);
> >
> > Both of these can use FIELD_GET().
I have used FIELD_GET for MEM_CAPABLE flag. For pdev->class, I see there are
other occurrences (drivers/pci/pcie/aer_cxl_rch.c) which use same check for detecting
memory class. So, I have kept the is_cxl_type3 check as it is.
> >
> > > +
> > > + /*
> > > + * Type 2 = CXL memory capable but NOT Type 3 (e.g.
> accelerator/GPU)
> > > + * Unsupported for non cxl type-2 class of devices.
>
> As in other places, type 3 doesn't mean that. What you mean is class code
> compliant type 3.
Updated the comments and checks.
>
> > > + */
> > > + if (!(cxl_mem_capable && !is_cxl_type3)) {
> > > + devm_kfree(&pdev->dev, vdev->cxl);
>
> As below. That needs a name that makes it clear it is the right devm call.
>
> > > + vdev->cxl = NULL;
> > > + return -ENODEV;
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static int vfio_cxl_setup_regs(struct vfio_pci_core_device *vdev) {
> > > + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> > > + struct cxl_register_map *map = &cxl->cxlds.reg_map;
> > > + resource_size_t offset, bar_offset, size;
> > > + struct pci_dev *pdev = vdev->pdev;
> > > + void __iomem *base;
> > > + u32 count;
> > > + int ret;
> > > + u8 bar;
> > > +
> > > + if (WARN_ON_ONCE(!pci_is_enabled(pdev)))
> > > + return -EINVAL;
> > > +
> > > + /* Find component register block via Register Locator DVSEC */
> > > + ret = cxl_find_regblock(pdev, CXL_REGLOC_RBI_COMPONENT, map);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + /* Temporarily map the register block */
> > > + base = ioremap(map->resource, map->max_size);
> >
> > Request the mem region before mapping it?
Yes, added the call to request region first before mapping.
> >
> > DJ
> >
> > > + if (!base)
> > > + return -ENOMEM;
> > > +
> > > + /* Probe component register capabilities */
> > > + cxl_probe_component_regs(&pdev->dev, base, &map->component_map);
> > > +
> > > + /* Unmap immediately */
> > > + iounmap(base);
> > > +
> > > + /* Check if HDM decoder was found */
> > > + if (!map->component_map.hdm_decoder.valid)
> > > + return -ENODEV;
> > > +
> > > + pci_dbg(pdev,
> > > + "vfio_cxl: HDM decoder at offset=0x%lx, size=0x%lx\n",
> > > + map->component_map.hdm_decoder.offset,
> > > + map->component_map.hdm_decoder.size);
> > > +
> > > + /* Get HDM register info */
> > > + ret = cxl_get_hdm_reg_info(&cxl->cxlds, &count, &offset, &size);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + if (!count || !size)
> > > + return -ENODEV;
> > > +
> > > + cxl->hdm_count = count;
> > > + cxl->hdm_reg_offset = offset;
> > > + cxl->hdm_reg_size = size;
> > > +
> > > + ret = cxl_regblock_get_bar_info(map, &bar, &bar_offset);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + cxl->comp_reg_bar = bar;
> > > + cxl->comp_reg_offset = bar_offset;
> > > + cxl->comp_reg_size = CXL_COMPONENT_REG_BLOCK_SIZE;
> > > +
> > > + return 0;
> > > +}
> > > +
> > > /**
> > > * vfio_pci_cxl_detect_and_init - Detect and initialize CXL Type-2
> device
> > > * @vdev: VFIO PCI device
> > > @@ -28,8 +136,51 @@ MODULE_IMPORT_NS("CXL");
> > > */
> > > void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device
> > > *vdev) {
> > > + struct pci_dev *pdev = vdev->pdev;
> > > + struct vfio_pci_cxl_state *cxl;
> > > + u16 dvsec;
> > > + int ret;
> > > +
> > > + if (!pcie_is_cxl(pdev))
> > > + return;
> > > +
> > > + dvsec = pci_find_dvsec_capability(pdev,
> > > + PCI_VENDOR_ID_CXL,
> > > + PCI_DVSEC_CXL_DEVICE);
> > > + if (!dvsec)
> > > + return;
> > > +
> > > + ret = vfio_cxl_create_device_state(vdev, dvsec);
> Suggestion above would lead to.
> cxl = vfio_cxl_create_device_state(vdev, dvsec);
> if (IS_ERR(cxl))
> return PTR_ERR(cxl); //assuming failing at this point is
> fatal.
>
> then only set vdev->cxl once you are sure this function succeeded.
> Thus removing the need to set it to NULL on failure.
> You need to pass it into a few more calls though.
Yes, I have added a local variable for this purpose.
>
>
> > > + if (ret)
> > > + return;
> > > +
> > > + cxl = vdev->cxl;
> > > +
> > > + /*
> > > + * Required for ioremap of the component register block and
> > > + * calls to cxl_probe_component_regs().
> > > + */
> > > + ret = pci_enable_device_mem(pdev);
> > > + if (ret)
> > > + goto failed;
> > > +
> > > + ret = vfio_cxl_setup_regs(vdev);
> > > + if (ret) {
> > > + pci_disable_device(pdev);
> > > + goto failed;
> > > + }
> > > +
> > > + pci_disable_device(pdev);
> > > +
> AS above, I think this would be cleaner if only here do we have
>
> vdev->cxl = cxl;
>
> > > + return;
> > > +
> > > +failed:
> > > + devm_kfree(&pdev->dev, vdev->cxl);
>
> If you get here you found a CXL device but couldn't handle it. Is it
> useful
> to continue? I'd suggest probably not. If returning an error then devm
> cleanup happens. Also, we should have a wrapper around devm_kfree to make
> it clear that it is valid to use it for undoing the
> devm_cxl_dev_state_create() (I haven't even checked it is valid)
>
Added a wrapper around devm_kfree.
> > > + vdev->cxl = NULL;
> > > }
> if /* __LINUX_VFIO_CXL_PRIV_H */
> >
> >
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH 10/20] vfio/cxl: CXL region management
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (8 preceding siblings ...)
2026-03-11 20:34 ` [PATCH 09/20] vfio/cxl: Implement CXL device detection and HDM register probing mhonap
@ 2026-03-11 20:34 ` mhonap
2026-03-12 22:55 ` Dave Jiang
2026-03-11 20:34 ` [PATCH 11/20] vfio/cxl: Expose DPA memory region to userspace with fault+zap mmap mhonap
` (9 subsequent siblings)
19 siblings, 1 reply; 54+ messages in thread
From: mhonap @ 2026-03-11 20:34 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
Add CXL region management for future guest access.
Region Management makes use of APIs provided by CXL_CORE as below:
CREATE_REGION flow:
1. Validate request (size, decoder availability)
2. Allocate HPA via cxl_get_hpa_freespace()
3. Allocate DPA via cxl_request_dpa()
4. Create region via cxl_create_region() - commits HDM decoder!
5. Get HPA range via cxl_get_region_range()
DESTROY_REGION flow:
1. Detach decoder via cxl_decoder_detach()
2. Free DPA via cxl_dpa_free()
3. Release root decoder via cxl_put_root_decoder()
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/cxl/vfio_cxl_core.c | 118 ++++++++++++++++++++++++++-
drivers/vfio/pci/cxl/vfio_cxl_priv.h | 5 ++
drivers/vfio/pci/vfio_pci_priv.h | 8 ++
3 files changed, 130 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index 2da6da1c0605..9c71f592e74e 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -126,6 +126,112 @@ static int vfio_cxl_setup_regs(struct vfio_pci_core_device *vdev)
return 0;
}
+int vfio_cxl_create_cxl_region(struct vfio_pci_core_device *vdev, resource_size_t size)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ resource_size_t max_size;
+ int ret;
+
+ if (cxl->precommitted)
+ return 0;
+
+ cxl->cxlrd = cxl_get_hpa_freespace(cxl->cxlmd, 1,
+ CXL_DECODER_F_RAM |
+ CXL_DECODER_F_TYPE2,
+ &max_size);
+ if (IS_ERR(cxl->cxlrd))
+ return PTR_ERR(cxl->cxlrd);
+
+ /* Insufficient HPA space */
+ if (max_size < size) {
+ cxl_put_root_decoder(cxl->cxlrd);
+ cxl->cxlrd = NULL;
+ return -ENOSPC;
+ }
+
+ cxl->cxled = cxl_request_dpa(cxl->cxlmd, CXL_PARTMODE_RAM, size);
+ if (IS_ERR(cxl->cxled)) {
+ ret = PTR_ERR(cxl->cxled);
+ goto err_free_hpa;
+ }
+
+ cxl->region = cxl_create_region(cxl->cxlrd, &cxl->cxled, 1);
+ if (IS_ERR(cxl->region)) {
+ ret = PTR_ERR(cxl->region);
+ goto err_free_dpa;
+ }
+
+ return 0;
+
+err_free_dpa:
+ cxl_dpa_free(cxl->cxled);
+err_free_hpa:
+ if (cxl->cxlrd)
+ cxl_put_root_decoder(cxl->cxlrd);
+
+ return ret;
+}
+
+void vfio_cxl_destroy_cxl_region(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+ if (!cxl->region)
+ return;
+
+ cxl_unregister_region(cxl->region);
+ cxl->region = NULL;
+
+ if (cxl->precommitted)
+ return;
+
+ cxl_dpa_free(cxl->cxled);
+ cxl_put_root_decoder(cxl->cxlrd);
+}
+
+static int vfio_cxl_create_region_helper(struct vfio_pci_core_device *vdev,
+ resource_size_t capacity)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ struct pci_dev *pdev = vdev->pdev;
+ int ret;
+
+ if (cxl->precommitted) {
+ cxl->cxled = cxl_get_committed_decoder(cxl->cxlmd,
+ &cxl->region);
+ if (IS_ERR(cxl->cxled))
+ return PTR_ERR(cxl->cxled);
+ } else {
+ ret = vfio_cxl_create_cxl_region(vdev, capacity);
+ if (ret)
+ return ret;
+ }
+
+ if (cxl->region) {
+ struct range range;
+
+ ret = cxl_get_region_range(cxl->region, &range);
+ if (ret)
+ goto failed;
+
+ cxl->region_hpa = range.start;
+ cxl->region_size = range_len(&range);
+
+ pci_dbg(pdev, "Precommitted decoder: HPA 0x%llx size %lu MB\n",
+ cxl->region_hpa, cxl->region_size >> 20);
+ } else {
+ pci_err(pdev, "Failed to create CXL region\n");
+ ret = -ENODEV;
+ goto failed;
+ }
+
+ return 0;
+
+failed:
+ vfio_cxl_destroy_cxl_region(vdev);
+ return ret;
+}
+
/**
* vfio_pci_cxl_detect_and_init - Detect and initialize CXL Type-2 device
* @vdev: VFIO PCI device
@@ -172,6 +278,12 @@ void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
pci_disable_device(pdev);
+ ret = vfio_cxl_create_region_helper(vdev, SZ_256M);
+ if (ret)
+ goto failed;
+
+ cxl->precommitted = true;
+
return;
failed:
@@ -181,6 +293,10 @@ void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev)
{
- if (!vdev->cxl)
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+ if (!cxl || !cxl->region)
return;
+
+ vfio_cxl_destroy_cxl_region(vdev);
}
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
index 57fed39a80da..985680842a13 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_priv.h
+++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
@@ -17,6 +17,10 @@ struct vfio_pci_cxl_state {
struct cxl_memdev *cxlmd;
struct cxl_root_decoder *cxlrd;
struct cxl_endpoint_decoder *cxled;
+ struct cxl_region *region;
+ resource_size_t region_hpa;
+ size_t region_size;
+ void __iomem *region_vaddr;
resource_size_t hdm_reg_offset;
size_t hdm_reg_size;
resource_size_t comp_reg_offset;
@@ -24,6 +28,7 @@ struct vfio_pci_cxl_state {
u32 hdm_count;
u16 dvsec;
u8 comp_reg_bar;
+ bool precommitted;
};
/*
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index d7df5538dcde..818d99f098bf 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -137,6 +137,9 @@ static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev);
void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev);
+int vfio_cxl_create_cxl_region(struct vfio_pci_core_device *vdev,
+ resource_size_t size);
+void vfio_cxl_destroy_cxl_region(struct vfio_pci_core_device *vdev);
#else
@@ -144,6 +147,11 @@ static inline void
vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev) { }
static inline void
vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev) { }
+static inline int vfio_cxl_create_cxl_region(struct vfio_pci_core_device *vdev,
+ resource_size_t size)
+{ return 0; }
+static inline void
+vfio_cxl_destroy_cxl_region(struct vfio_pci_core_device *vdev) { }
#endif /* CONFIG_VFIO_CXL_CORE */
--
2.25.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [PATCH 10/20] vfio/cxl: CXL region management
2026-03-11 20:34 ` [PATCH 10/20] vfio/cxl: CXL region management mhonap
@ 2026-03-12 22:55 ` Dave Jiang
2026-03-13 12:52 ` Jonathan Cameron
0 siblings, 1 reply; 54+ messages in thread
From: Dave Jiang @ 2026-03-12 22:55 UTC (permalink / raw)
To: mhonap, aniketa, ankita, alwilliamson, vsethi, jgg, mochs,
skolothumtho, alejandro.lucero-palau, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm
On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
>
> Add CXL region management for future guest access.
>
> Region Management makes use of APIs provided by CXL_CORE as below:
>
> CREATE_REGION flow:
> 1. Validate request (size, decoder availability)
> 2. Allocate HPA via cxl_get_hpa_freespace()
> 3. Allocate DPA via cxl_request_dpa()
> 4. Create region via cxl_create_region() - commits HDM decoder!
> 5. Get HPA range via cxl_get_region_range()
>
> DESTROY_REGION flow:
> 1. Detach decoder via cxl_decoder_detach()
> 2. Free DPA via cxl_dpa_free()
> 3. Release root decoder via cxl_put_root_decoder()
>
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
> ---
> drivers/vfio/pci/cxl/vfio_cxl_core.c | 118 ++++++++++++++++++++++++++-
> drivers/vfio/pci/cxl/vfio_cxl_priv.h | 5 ++
> drivers/vfio/pci/vfio_pci_priv.h | 8 ++
> 3 files changed, 130 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> index 2da6da1c0605..9c71f592e74e 100644
> --- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
> +++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> @@ -126,6 +126,112 @@ static int vfio_cxl_setup_regs(struct vfio_pci_core_device *vdev)
> return 0;
> }
>
> +int vfio_cxl_create_cxl_region(struct vfio_pci_core_device *vdev, resource_size_t size)
> +{
> + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> + resource_size_t max_size;
> + int ret;
> +
> + if (cxl->precommitted)
> + return 0;
> +
> + cxl->cxlrd = cxl_get_hpa_freespace(cxl->cxlmd, 1,
> + CXL_DECODER_F_RAM |
> + CXL_DECODER_F_TYPE2,
> + &max_size);
Not sure what VFIO subsystem's policy is on scoped base resource cleanup, but a __free() here can get you out of managing put() of the root decoder.
> + if (IS_ERR(cxl->cxlrd))
> + return PTR_ERR(cxl->cxlrd);
> +
> + /* Insufficient HPA space */
> + if (max_size < size) {
> + cxl_put_root_decoder(cxl->cxlrd);
> + cxl->cxlrd = NULL;
> + return -ENOSPC;
> + }
> +
> + cxl->cxled = cxl_request_dpa(cxl->cxlmd, CXL_PARTMODE_RAM, size);
Same comment here about __free().
> + if (IS_ERR(cxl->cxled)) {
> + ret = PTR_ERR(cxl->cxled);
> + goto err_free_hpa;
> + }
> +
> + cxl->region = cxl_create_region(cxl->cxlrd, &cxl->cxled, 1);
> + if (IS_ERR(cxl->region)) {
> + ret = PTR_ERR(cxl->region);
> + goto err_free_dpa;
> + }
> +
> + return 0;
> +
> +err_free_dpa:
> + cxl_dpa_free(cxl->cxled);
> +err_free_hpa:
> + if (cxl->cxlrd)
> + cxl_put_root_decoder(cxl->cxlrd);
> +
> + return ret;
> +}
> +
> +void vfio_cxl_destroy_cxl_region(struct vfio_pci_core_device *vdev)
> +{
> + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> +
> + if (!cxl->region)
> + return;
> +
> + cxl_unregister_region(cxl->region);
> + cxl->region = NULL;
> +
> + if (cxl->precommitted)
> + return;
> +
> + cxl_dpa_free(cxl->cxled);
> + cxl_put_root_decoder(cxl->cxlrd);
> +}
> +
> +static int vfio_cxl_create_region_helper(struct vfio_pci_core_device *vdev,
> + resource_size_t capacity)
> +{
> + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> + struct pci_dev *pdev = vdev->pdev;
> + int ret;
> +
> + if (cxl->precommitted) {
> + cxl->cxled = cxl_get_committed_decoder(cxl->cxlmd,
> + &cxl->region);
> + if (IS_ERR(cxl->cxled))
> + return PTR_ERR(cxl->cxled);
> + } else {
> + ret = vfio_cxl_create_cxl_region(vdev, capacity);
> + if (ret)
> + return ret;
> + }
> +
> + if (cxl->region) {
Maybe if you do 'if (!cxl->region)' first and just exit, then you don't need to indent the normal code path.
> + struct range range;
> +
> + ret = cxl_get_region_range(cxl->region, &range);
> + if (ret)
> + goto failed;
> +
> + cxl->region_hpa = range.start;
> + cxl->region_size = range_len(&range);
> +
> + pci_dbg(pdev, "Precommitted decoder: HPA 0x%llx size %lu MB\n",
> + cxl->region_hpa, cxl->region_size >> 20);
> + } else {
> + pci_err(pdev, "Failed to create CXL region\n");
> + ret = -ENODEV;
> + goto failed;
> + }
> +
> + return 0;
> +
> +failed:
> + vfio_cxl_destroy_cxl_region(vdev);
> + return ret;
> +}
> +
> /**
> * vfio_pci_cxl_detect_and_init - Detect and initialize CXL Type-2 device
> * @vdev: VFIO PCI device
> @@ -172,6 +278,12 @@ void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
>
> pci_disable_device(pdev);
>
> + ret = vfio_cxl_create_region_helper(vdev, SZ_256M);
Maybe a comment on why this size?
DJ
> + if (ret)
> + goto failed;
> +
> + cxl->precommitted = true;
> +
> return;
>
> failed:
> @@ -181,6 +293,10 @@ void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
>
> void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev)
> {
> - if (!vdev->cxl)
> + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> +
> + if (!cxl || !cxl->region)
> return;
> +
> + vfio_cxl_destroy_cxl_region(vdev);
> }
> diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
> index 57fed39a80da..985680842a13 100644
> --- a/drivers/vfio/pci/cxl/vfio_cxl_priv.h
> +++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
> @@ -17,6 +17,10 @@ struct vfio_pci_cxl_state {
> struct cxl_memdev *cxlmd;
> struct cxl_root_decoder *cxlrd;
> struct cxl_endpoint_decoder *cxled;
> + struct cxl_region *region;
> + resource_size_t region_hpa;
> + size_t region_size;
> + void __iomem *region_vaddr;
> resource_size_t hdm_reg_offset;
> size_t hdm_reg_size;
> resource_size_t comp_reg_offset;
> @@ -24,6 +28,7 @@ struct vfio_pci_cxl_state {
> u32 hdm_count;
> u16 dvsec;
> u8 comp_reg_bar;
> + bool precommitted;
> };
>
> /*
> diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
> index d7df5538dcde..818d99f098bf 100644
> --- a/drivers/vfio/pci/vfio_pci_priv.h
> +++ b/drivers/vfio/pci/vfio_pci_priv.h
> @@ -137,6 +137,9 @@ static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
>
> void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev);
> void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev);
> +int vfio_cxl_create_cxl_region(struct vfio_pci_core_device *vdev,
> + resource_size_t size);
> +void vfio_cxl_destroy_cxl_region(struct vfio_pci_core_device *vdev);
>
> #else
>
> @@ -144,6 +147,11 @@ static inline void
> vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev) { }
> static inline void
> vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev) { }
> +static inline int vfio_cxl_create_cxl_region(struct vfio_pci_core_device *vdev,
> + resource_size_t size)
> +{ return 0; }
> +static inline void
> +vfio_cxl_destroy_cxl_region(struct vfio_pci_core_device *vdev) { }
>
> #endif /* CONFIG_VFIO_CXL_CORE */
>
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [PATCH 10/20] vfio/cxl: CXL region management
2026-03-12 22:55 ` Dave Jiang
@ 2026-03-13 12:52 ` Jonathan Cameron
2026-03-18 17:48 ` Manish Honap
0 siblings, 1 reply; 54+ messages in thread
From: Jonathan Cameron @ 2026-03-13 12:52 UTC (permalink / raw)
To: Dave Jiang
Cc: mhonap, aniketa, ankita, alwilliamson, vsethi, jgg, mochs,
skolothumtho, alejandro.lucero-palau, dave, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, jgg, yishaih,
kevin.tian, cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm
On Thu, 12 Mar 2026 15:55:32 -0700
Dave Jiang <dave.jiang@intel.com> wrote:
> On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> > From: Manish Honap <mhonap@nvidia.com>
> >
> > Add CXL region management for future guest access.
> >
> > Region Management makes use of APIs provided by CXL_CORE as below:
> >
> > CREATE_REGION flow:
> > 1. Validate request (size, decoder availability)
> > 2. Allocate HPA via cxl_get_hpa_freespace()
> > 3. Allocate DPA via cxl_request_dpa()
> > 4. Create region via cxl_create_region() - commits HDM decoder!
> > 5. Get HPA range via cxl_get_region_range()
> >
> > DESTROY_REGION flow:
> > 1. Detach decoder via cxl_decoder_detach()
> > 2. Free DPA via cxl_dpa_free()
> > 3. Release root decoder via cxl_put_root_decoder()
> >
> > Signed-off-by: Manish Honap <mhonap@nvidia.com>
A few additional comments from me.
> > ---
> > drivers/vfio/pci/cxl/vfio_cxl_core.c | 118 ++++++++++++++++++++++++++-
> > drivers/vfio/pci/cxl/vfio_cxl_priv.h | 5 ++
> > drivers/vfio/pci/vfio_pci_priv.h | 8 ++
> > 3 files changed, 130 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > index 2da6da1c0605..9c71f592e74e 100644
> > --- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > +++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > @@ -126,6 +126,112 @@ static int vfio_cxl_setup_regs(struct vfio_pci_core_device *vdev)
> > return 0;
> > }
> >
> > +int vfio_cxl_create_cxl_region(struct vfio_pci_core_device *vdev, resource_size_t size)
> > +{
> > + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> > + resource_size_t max_size;
> > + int ret;
> > +
> > + if (cxl->precommitted)
> > + return 0;
> > +
> > + cxl->cxlrd = cxl_get_hpa_freespace(cxl->cxlmd, 1,
> > + CXL_DECODER_F_RAM |
> > + CXL_DECODER_F_TYPE2,
> > + &max_size);
>
> Not sure what VFIO subsystem's policy is on scoped base resource cleanup, but a __free() here can get you out of managing put() of the root decoder.
>
> > + if (IS_ERR(cxl->cxlrd))
> > + return PTR_ERR(cxl->cxlrd);
> > +
> > + /* Insufficient HPA space */
> > + if (max_size < size) {
> > + cxl_put_root_decoder(cxl->cxlrd);
> > + cxl->cxlrd = NULL;
Similar to other cases, I'd keep assigning stuff in cxl to the point
where there are no more error paths. Use local variables until then.
(that would fit with using __free() as well which I'd also favor if
accepted in VFIO).
> > + return -ENOSPC;
> > + }
> > +
> > + cxl->cxled = cxl_request_dpa(cxl->cxlmd, CXL_PARTMODE_RAM, size);
>
> Same comment here about __free().
>
> > + if (IS_ERR(cxl->cxled)) {
> > + ret = PTR_ERR(cxl->cxled);
> > + goto err_free_hpa;
> > + }
> > +
> > + cxl->region = cxl_create_region(cxl->cxlrd, &cxl->cxled, 1);
> > + if (IS_ERR(cxl->region)) {
> > + ret = PTR_ERR(cxl->region);
You carefully NULL this in vfio_cxl_destroy_region, but if you fail
here you end up with it containing an ERR_PTR(). I'd avoid that by
using a local variable and only assigning cxl->region after this
suceeds.
> > + goto err_free_dpa;
> > + }
> > +
> > + return 0;
> > +
> > +err_free_dpa:
> > + cxl_dpa_free(cxl->cxled);
> > +err_free_hpa:
> > + if (cxl->cxlrd)
> > + cxl_put_root_decoder(cxl->cxlrd);
> > +
> > + return ret;
> > +}
> > +
> > +void vfio_cxl_destroy_cxl_region(struct vfio_pci_core_device *vdev)
> > +{
> > + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> > +
> > + if (!cxl->region)
> > + return;
> > +
> > + cxl_unregister_region(cxl->region);
> > + cxl->region = NULL;
> > +
> > + if (cxl->precommitted)
> > + return;
> > +
> > + cxl_dpa_free(cxl->cxled);
> > + cxl_put_root_decoder(cxl->cxlrd);
> > +}
> > +
> > +static int vfio_cxl_create_region_helper(struct vfio_pci_core_device *vdev,
> > + resource_size_t capacity)
> > +{
> > + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> > + struct pci_dev *pdev = vdev->pdev;
> > + int ret;
> > +
> > + if (cxl->precommitted) {
> > + cxl->cxled = cxl_get_committed_decoder(cxl->cxlmd,
> > + &cxl->region);
> > + if (IS_ERR(cxl->cxled))
> > + return PTR_ERR(cxl->cxled);
> > + } else {
> > + ret = vfio_cxl_create_cxl_region(vdev, capacity);
> > + if (ret)
> > + return ret;
> > + }
> > +
> > + if (cxl->region) {
>
> Maybe if you do 'if (!cxl->region)' first and just exit, then you don't need to indent the normal code path.
>
> > + struct range range;
> > +
> > + ret = cxl_get_region_range(cxl->region, &range);
> > + if (ret)
> > + goto failed;
> > +
> > + cxl->region_hpa = range.start;
> > + cxl->region_size = range_len(&range);
> > +
> > + pci_dbg(pdev, "Precommitted decoder: HPA 0x%llx size %lu MB\n",
> > + cxl->region_hpa, cxl->region_size >> 20);
> > + } else {
> > + pci_err(pdev, "Failed to create CXL region\n");
> > + ret = -ENODEV;
> > + goto failed;
> > + }
> > +
> > + return 0;
> > +
> > +failed:
> > + vfio_cxl_destroy_cxl_region(vdev);
Little bit of refactoring and this could be replaced with __free() magic.
> > + return ret;
> > +}
> > +
> > /**
> > * vfio_pci_cxl_detect_and_init - Detect and initialize CXL Type-2 device
> > * @vdev: VFIO PCI device
> > @@ -172,6 +278,12 @@ void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
> >
> > pci_disable_device(pdev);
> >
> > + ret = vfio_cxl_create_region_helper(vdev, SZ_256M);
>
> Maybe a comment on why this size?
:) I wondered that as well. I'm guessing your bios isn't always providing
the decoder and this lets you test.
>
> DJ
>
> > + if (ret)
> > + goto failed;
> > +
> > + cxl->precommitted = true;
> > +
> > return;
> >
> > failed:
> > @@ -181,6 +293,10 @@ void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
> >
> > void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev)
> > {
> > - if (!vdev->cxl)
> > + struct vfio_pci_cxl_state *cxl = vdev->cxl;
Do that in the earlier patch to reduce churn a tiny bit.
> > +
> > + if (!cxl || !cxl->region)
> > return;
> > +
> > + vfio_cxl_destroy_cxl_region(vdev);
> > }
^ permalink raw reply [flat|nested] 54+ messages in thread* RE: [PATCH 10/20] vfio/cxl: CXL region management
2026-03-13 12:52 ` Jonathan Cameron
@ 2026-03-18 17:48 ` Manish Honap
0 siblings, 0 replies; 54+ messages in thread
From: Manish Honap @ 2026-03-18 17:48 UTC (permalink / raw)
To: Jonathan Cameron, Dave Jiang
Cc: Aniket Agashe, Ankit Agrawal, Alex Williamson, Vikram Sethi,
Jason Gunthorpe, Matt Ochs, Shameer Kolothum Thodi,
alejandro.lucero-palau@amd.com, dave@stgolabs.net,
alison.schofield@intel.com, vishal.l.verma@intel.com,
ira.weiny@intel.com, dan.j.williams@intel.com, jgg@ziepe.ca,
Yishai Hadas, kevin.tian@intel.com, Neo Jia, Tarun Gupta (SW-GPU),
Zhi Wang, Krishnakant Jaju, linux-kernel@vger.kernel.org,
linux-cxl@vger.kernel.org, kvm@vger.kernel.org, Manish Honap
> -----Original Message-----
> From: Jonathan Cameron <jonathan.cameron@huawei.com>
> Sent: 13 March 2026 18:23
> To: Dave Jiang <dave.jiang@intel.com>
> Cc: Manish Honap <mhonap@nvidia.com>; Aniket Agashe <aniketa@nvidia.com>;
> Ankit Agrawal <ankita@nvidia.com>; Alex Williamson
> <alwilliamson@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Jason
> Gunthorpe <jgg@nvidia.com>; Matt Ochs <mochs@nvidia.com>; Shameer Kolothum
> Thodi <skolothumtho@nvidia.com>; alejandro.lucero-palau@amd.com;
> dave@stgolabs.net; alison.schofield@intel.com; vishal.l.verma@intel.com;
> ira.weiny@intel.com; dan.j.williams@intel.com; jgg@ziepe.ca; Yishai Hadas
> <yishaih@nvidia.com>; kevin.tian@intel.com; Neo Jia <cjia@nvidia.com>;
> Tarun Gupta (SW-GPU) <targupta@nvidia.com>; Zhi Wang <zhiw@nvidia.com>;
> Krishnakant Jaju <kjaju@nvidia.com>; linux-kernel@vger.kernel.org; linux-
> cxl@vger.kernel.org; kvm@vger.kernel.org
> Subject: Re: [PATCH 10/20] vfio/cxl: CXL region management
>
> External email: Use caution opening links or attachments
>
>
> On Thu, 12 Mar 2026 15:55:32 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
>
> > On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> > > From: Manish Honap <mhonap@nvidia.com>
> > >
> > > Add CXL region management for future guest access.
> > >
> > > Region Management makes use of APIs provided by CXL_CORE as below:
> > >
> > > CREATE_REGION flow:
> > > 1. Validate request (size, decoder availability) 2. Allocate HPA via
> > > cxl_get_hpa_freespace() 3. Allocate DPA via cxl_request_dpa() 4.
> > > Create region via cxl_create_region() - commits HDM decoder!
> > > 5. Get HPA range via cxl_get_region_range()
> > >
> > > DESTROY_REGION flow:
> > > 1. Detach decoder via cxl_decoder_detach() 2. Free DPA via
> > > cxl_dpa_free() 3. Release root decoder via cxl_put_root_decoder()
> > >
> > > Signed-off-by: Manish Honap <mhonap@nvidia.com>
> A few additional comments from me.
>
> > > ---
> > > drivers/vfio/pci/cxl/vfio_cxl_core.c | 118
> ++++++++++++++++++++++++++-
> > > drivers/vfio/pci/cxl/vfio_cxl_priv.h | 5 ++
> > > drivers/vfio/pci/vfio_pci_priv.h | 8 ++
> > > 3 files changed, 130 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > > b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > > index 2da6da1c0605..9c71f592e74e 100644
> > > --- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > > +++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > > @@ -126,6 +126,112 @@ static int vfio_cxl_setup_regs(struct
> vfio_pci_core_device *vdev)
> > > return 0;
> > > }
> > >
> > > +int vfio_cxl_create_cxl_region(struct vfio_pci_core_device *vdev,
> > > +resource_size_t size) {
> > > + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> > > + resource_size_t max_size;
> > > + int ret;
> > > +
> > > + if (cxl->precommitted)
> > > + return 0;
> > > +
> > > + cxl->cxlrd = cxl_get_hpa_freespace(cxl->cxlmd, 1,
> > > + CXL_DECODER_F_RAM |
> > > + CXL_DECODER_F_TYPE2,
> > > + &max_size);
> >
> > Not sure what VFIO subsystem's policy is on scoped base resource
> cleanup, but a __free() here can get you out of managing put() of the root
> decoder.
After having a word with Alex for this, I have added __free macros.
> >
> > > + if (IS_ERR(cxl->cxlrd))
> > > + return PTR_ERR(cxl->cxlrd);
> > > +
> > > + /* Insufficient HPA space */
> > > + if (max_size < size) {
> > > + cxl_put_root_decoder(cxl->cxlrd);
> > > + cxl->cxlrd = NULL;
> Similar to other cases, I'd keep assigning stuff in cxl to the point where
> there are no more error paths. Use local variables until then.
> (that would fit with using __free() as well which I'd also favor if
> accepted in VFIO).
>
Yes, addressed.
> > > + return -ENOSPC;
> > > + }
> > > +
> > > + cxl->cxled = cxl_request_dpa(cxl->cxlmd, CXL_PARTMODE_RAM,
> > > + size);
> >
> > Same comment here about __free().
> >
> > > + if (IS_ERR(cxl->cxled)) {
> > > + ret = PTR_ERR(cxl->cxled);
> > > + goto err_free_hpa;
> > > + }
> > > +
> > > + cxl->region = cxl_create_region(cxl->cxlrd, &cxl->cxled, 1);
> > > + if (IS_ERR(cxl->region)) {
> > > + ret = PTR_ERR(cxl->region);
>
> You carefully NULL this in vfio_cxl_destroy_region, but if you fail here
> you end up with it containing an ERR_PTR(). I'd avoid that by using a
> local variable and only assigning cxl->region after this suceeds.
Agreed.
>
> > > + goto err_free_dpa;
> > > + }
> > > +
> > > + return 0;
> > > +
> > > +err_free_dpa:
> > > + cxl_dpa_free(cxl->cxled);
> > > +err_free_hpa:
> > > + if (cxl->cxlrd)
> > > + cxl_put_root_decoder(cxl->cxlrd);
> > > +
> > > + return ret;
> > > +}
> > > +
> > > +void vfio_cxl_destroy_cxl_region(struct vfio_pci_core_device *vdev)
> > > +{
> > > + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> > > +
> > > + if (!cxl->region)
> > > + return;
> > > +
> > > + cxl_unregister_region(cxl->region);
> > > + cxl->region = NULL;
> > > +
> > > + if (cxl->precommitted)
> > > + return;
> > > +
> > > + cxl_dpa_free(cxl->cxled);
> > > + cxl_put_root_decoder(cxl->cxlrd); }
> > > +
> > > +static int vfio_cxl_create_region_helper(struct vfio_pci_core_device
> *vdev,
> > > + resource_size_t capacity) {
> > > + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> > > + struct pci_dev *pdev = vdev->pdev;
> > > + int ret;
> > > +
> > > + if (cxl->precommitted) {
> > > + cxl->cxled = cxl_get_committed_decoder(cxl->cxlmd,
> > > + &cxl->region);
> > > + if (IS_ERR(cxl->cxled))
> > > + return PTR_ERR(cxl->cxled);
> > > + } else {
> > > + ret = vfio_cxl_create_cxl_region(vdev, capacity);
> > > + if (ret)
> > > + return ret;
> > > + }
> > > +
> > > + if (cxl->region) {
> >
> > Maybe if you do 'if (!cxl->region)' first and just exit, then you don't
> need to indent the normal code path.
Okay, I will change this.
> >
> > > + struct range range;
> > > +
> > > + ret = cxl_get_region_range(cxl->region, &range);
> > > + if (ret)
> > > + goto failed;
> > > +
> > > + cxl->region_hpa = range.start;
> > > + cxl->region_size = range_len(&range);
> > > +
> > > + pci_dbg(pdev, "Precommitted decoder: HPA 0x%llx size %lu
> MB\n",
> > > + cxl->region_hpa, cxl->region_size >> 20);
> > > + } else {
> > > + pci_err(pdev, "Failed to create CXL region\n");
> > > + ret = -ENODEV;
> > > + goto failed;
> > > + }
> > > +
> > > + return 0;
> > > +
> > > +failed:
> > > + vfio_cxl_destroy_cxl_region(vdev);
>
> Little bit of refactoring and this could be replaced with __free() magic.
>
> > > + return ret;
> > > +}
> > > +
> > > /**
> > > * vfio_pci_cxl_detect_and_init - Detect and initialize CXL Type-2
> device
> > > * @vdev: VFIO PCI device
> > > @@ -172,6 +278,12 @@ void vfio_pci_cxl_detect_and_init(struct
> > > vfio_pci_core_device *vdev)
> > >
> > > pci_disable_device(pdev);
> > >
> > > + ret = vfio_cxl_create_region_helper(vdev, SZ_256M);
> >
> > Maybe a comment on why this size?
> :) I wondered that as well. I'm guessing your bios isn't always
> providing the decoder and this lets you test.
This was added for handling the patch organization so that the patches
are cut at a short reviewable boundary. It seems this patch is not
cut correctly.
I have removed this temporary size and refactored the patches to
incorporate firmware committed decoder size here.
>
>
> >
> > DJ
> >
> > > + if (ret)
> > > + goto failed;
> > > +
> > > + cxl->precommitted = true;
> > > +
> > > return;
> > >
> > > failed:
> > > @@ -181,6 +293,10 @@ void vfio_pci_cxl_detect_and_init(struct
> > > vfio_pci_core_device *vdev)
> > >
> > > void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev) {
> > > - if (!vdev->cxl)
> > > + struct vfio_pci_cxl_state *cxl = vdev->cxl;
>
> Do that in the earlier patch to reduce churn a tiny bit.
Okay, agreed.
>
> > > +
> > > + if (!cxl || !cxl->region)
> > > return;
> > > +
> > > + vfio_cxl_destroy_cxl_region(vdev);
> > > }
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH 11/20] vfio/cxl: Expose DPA memory region to userspace with fault+zap mmap
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (9 preceding siblings ...)
2026-03-11 20:34 ` [PATCH 10/20] vfio/cxl: CXL region management mhonap
@ 2026-03-11 20:34 ` mhonap
2026-03-13 17:07 ` Dave Jiang
2026-03-11 20:34 ` [PATCH 12/20] vfio/pci: Export config access helpers mhonap
` (8 subsequent siblings)
19 siblings, 1 reply; 54+ messages in thread
From: mhonap @ 2026-03-11 20:34 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
To directly access the device memory, a CXL region is required. For
the userspace (e.g. QEMU) to access the CXL region, the region is
required to be exposed via VFIO interfaces.
Introduce a new VFIO device region and region ops to expose the created
CXL region. Introduce a new sub region type for userspace to identify
a CXL region.
CXL region lifecycle:
- The CXL memory region is registered with VFIO layer during
vfio_pci_open_device
- mmap() establishes the VMA with vm_ops but inserts no PTEs
- Each guest page fault calls vfio_cxl_region_page_fault() which
inserts a single PFN under the memory_lock read side
- On device reset, vfio_cxl_zap_region_locked() sets region_active=false
and calls unmap_mapping_range() to invalidate all DPA PTEs atomically
while holding memory_lock for writing
- Faults racing with reset see region_active==false and return
VM_FAULT_SIGBUS
- vfio_cxl_reactivate_region() restores region_active after successful
hardware reset
Also integrate the zap/reactivate calls into vfio_pci_ioctl_reset() so
that FLR correctly invalidates DPA mappings and restores them on success.
Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/cxl/vfio_cxl_core.c | 222 +++++++++++++++++++++++++++
drivers/vfio/pci/cxl/vfio_cxl_priv.h | 2 +
drivers/vfio/pci/vfio_pci.c | 9 ++
drivers/vfio/pci/vfio_pci_core.c | 11 ++
drivers/vfio/pci/vfio_pci_priv.h | 13 ++
5 files changed, 257 insertions(+)
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index 9c71f592e74e..03846bd11c8a 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -44,6 +44,7 @@ static int vfio_cxl_create_device_state(struct vfio_pci_core_device *vdev,
cxl = vdev->cxl;
cxl->dvsec = dvsec;
+ cxl->dpa_region_idx = -1;
pci_read_config_word(pdev, dvsec + CXL_DVSEC_CAPABILITY_OFFSET,
&cap_word);
@@ -300,3 +301,224 @@ void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev)
vfio_cxl_destroy_cxl_region(vdev);
}
+
+/*
+ * Fault handler for the DPA region VMA. Called under mm->mmap_lock read
+ * side by the fault path. We take memory_lock read side here to exclude
+ * the write-side held by vfio_cxl_zap_region_locked() during reset.
+ */
+static vm_fault_t vfio_cxl_region_page_fault(struct vm_fault *vmf)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct vfio_pci_core_device *vdev = vma->vm_private_data;
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ unsigned long pfn;
+
+ guard(rwsem_read)(&vdev->memory_lock);
+
+ if (!READ_ONCE(cxl->region_active))
+ return VM_FAULT_SIGBUS;
+
+ pfn = PHYS_PFN(cxl->region_hpa) +
+ ((vmf->address - vma->vm_start) >> PAGE_SHIFT);
+
+ /*
+ * Scrub the page via the kernel ioremap_cache mapping before inserting
+ * the user PFN. Prevent the stale device data from leaking across VFIO
+ * device open/close boundaries.
+ */
+ memset_io((u8 __iomem *)cxl->region_vaddr +
+ ((pfn - PHYS_PFN(cxl->region_hpa)) << PAGE_SHIFT),
+ 0, PAGE_SIZE);
+
+ return vmf_insert_pfn(vma, vmf->address, pfn);
+}
+
+static const struct vm_operations_struct vfio_cxl_region_vm_ops = {
+ .fault = vfio_cxl_region_page_fault,
+};
+
+static int vfio_cxl_region_mmap(struct vfio_pci_core_device *vdev,
+ struct vfio_pci_region *region,
+ struct vm_area_struct *vma)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ unsigned long req_len;
+
+ if (!(region->flags & VFIO_REGION_INFO_FLAG_MMAP))
+ return -EINVAL;
+
+ if (check_sub_overflow(vma->vm_end, vma->vm_start, &req_len))
+ return -EOVERFLOW;
+
+ if (req_len > cxl->region_size)
+ return -EINVAL;
+
+ /*
+ * Do not insert PTEs here (no remap_pfn_range). PTEs are inserted
+ * lazily on first fault via vfio_cxl_region_page_fault(). This
+ * allows vfio_cxl_zap_region_locked() to safely invalidate them
+ * during device reset without any userspace cooperation.
+ * Leave vm_page_prot at its default.
+ */
+
+ vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
+ vma->vm_private_data = vdev;
+ vma->vm_ops = &vfio_cxl_region_vm_ops;
+
+ return 0;
+}
+
+/*
+ * vfio_cxl_zap_region_locked - Invalidate all DPA region PTEs.
+ *
+ * Must be called with vdev->memory_lock held for writing. Sets
+ * region_active=false before zapping so any fault racing with zap sees
+ * the inactive state and returns VM_FAULT_SIGBUS rather than inserting
+ * a stale PFN.
+ */
+void vfio_cxl_zap_region_locked(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_device *core_vdev = &vdev->vdev;
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+ lockdep_assert_held_write(&vdev->memory_lock);
+
+ if (!cxl || cxl->dpa_region_idx < 0)
+ return;
+
+ WRITE_ONCE(cxl->region_active, false);
+ unmap_mapping_range(core_vdev->inode->i_mapping,
+ VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_NUM_REGIONS +
+ cxl->dpa_region_idx),
+ cxl->region_size, true);
+}
+
+/*
+ * vfio_cxl_reactivate_region - Re-enable DPA region after successful reset.
+ *
+ * Must be called with vdev->memory_lock held for writing. Re-reads the
+ * HDM decoder state from hardware (FLR cleared it) and sets region_active
+ * so that subsequent faults can re-insert PFNs without a new mmap.
+ */
+void vfio_cxl_reactivate_region(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+ lockdep_assert_held_write(&vdev->memory_lock);
+
+ if (!cxl)
+ return;
+}
+
+static ssize_t vfio_cxl_region_rw(struct vfio_pci_core_device *core_dev,
+ char __user *buf, size_t count, loff_t *ppos,
+ bool iswrite)
+{
+ unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
+ struct vfio_pci_cxl_state *cxl = core_dev->region[i].data;
+ loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+ guard(rwsem_read)(&core_dev->memory_lock);
+
+ if (!READ_ONCE(cxl->region_active))
+ return -EIO;
+
+ if (!count)
+ return 0;
+
+ return vfio_pci_core_do_io_rw(core_dev, false,
+ cxl->region_vaddr,
+ (char __user *)buf, pos, count,
+ 0, 0, iswrite, VFIO_PCI_IO_WIDTH_8);
+}
+
+static void vfio_cxl_region_release(struct vfio_pci_core_device *vdev,
+ struct vfio_pci_region *region)
+{
+ struct vfio_pci_cxl_state *cxl = region->data;
+
+ if (cxl->region_vaddr) {
+ iounmap(cxl->region_vaddr);
+ cxl->region_vaddr = NULL;
+ }
+}
+
+static const struct vfio_pci_regops vfio_cxl_regops = {
+ .rw = vfio_cxl_region_rw,
+ .mmap = vfio_cxl_region_mmap,
+ .release = vfio_cxl_region_release,
+};
+
+int vfio_cxl_register_cxl_region(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ u32 flags;
+ int ret;
+
+ if (!cxl)
+ return -ENODEV;
+
+ if (!cxl->region || cxl->region_vaddr)
+ return -ENODEV;
+
+ cxl->region_vaddr = ioremap_cache(cxl->region_hpa, cxl->region_size);
+ if (!cxl->region_vaddr)
+ return -ENOMEM;
+
+ flags = VFIO_REGION_INFO_FLAG_READ |
+ VFIO_REGION_INFO_FLAG_WRITE |
+ VFIO_REGION_INFO_FLAG_MMAP;
+
+ ret = vfio_pci_core_register_dev_region(vdev,
+ PCI_VENDOR_ID_CXL |
+ VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
+ VFIO_REGION_SUBTYPE_CXL,
+ &vfio_cxl_regops,
+ cxl->region_size, flags,
+ cxl);
+ if (ret) {
+ iounmap(cxl->region_vaddr);
+ cxl->region_vaddr = NULL;
+ return ret;
+ }
+
+ /*
+ * Cache the vdev->region[] index before activating the region.
+ * vfio_pci_core_register_dev_region() placed the new entry at
+ * vdev->region[num_regions - 1] and incremented num_regions.
+ * vfio_cxl_zap_region_locked() uses this to avoid scanning
+ * vdev->region[] on every FLR.
+ */
+ cxl->dpa_region_idx = vdev->num_regions - 1;
+ WRITE_ONCE(cxl->region_active, true);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_cxl_register_cxl_region);
+
+/**
+ * vfio_cxl_unregister_cxl_region - Undo vfio_cxl_register_cxl_region()
+ * @vdev: VFIO PCI device
+ *
+ * Marks the DPA region inactive so any racing fault returns VM_FAULT_SIGBUS
+ * and resets dpa_region_idx. Does NOT call release() or touch num_regions;
+ * vfio_pci_core_disable() will call the idempotent release() callback as
+ * normal during device close.
+ *
+ * Does NOT touch CXL subsystem state (cxl->region, cxl->cxled, cxl->cxlrd).
+ * The caller must call vfio_cxl_destroy_cxl_region() separately to release
+ * those objects.
+ */
+void vfio_cxl_unregister_cxl_region(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+ if (!cxl || cxl->dpa_region_idx < 0)
+ return;
+
+ WRITE_ONCE(cxl->region_active, false);
+
+ cxl->dpa_region_idx = -1;
+}
+EXPORT_SYMBOL_GPL(vfio_cxl_unregister_cxl_region);
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
index 985680842a13..b870926bfb19 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_priv.h
+++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
@@ -26,9 +26,11 @@ struct vfio_pci_cxl_state {
resource_size_t comp_reg_offset;
size_t comp_reg_size;
u32 hdm_count;
+ int dpa_region_idx;
u16 dvsec;
u8 comp_reg_bar;
bool precommitted;
+ bool region_active;
};
/*
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 0c771064c0b8..d3138badeaa6 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -120,6 +120,15 @@ static int vfio_pci_open_device(struct vfio_device *core_vdev)
}
}
+ if (vdev->cxl) {
+ ret = vfio_cxl_register_cxl_region(vdev);
+ if (ret) {
+ pci_warn(pdev, "Failed to setup CXL region\n");
+ vfio_pci_core_disable(vdev);
+ return ret;
+ }
+ }
+
vfio_pci_core_finish_enable(vdev);
return 0;
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index b7364178e23d..48e0274c19aa 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1223,6 +1223,9 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
vfio_pci_zap_and_down_write_memory_lock(vdev);
+ /* Zap CXL DPA region PTEs before hardware reset clears HDM state */
+ vfio_cxl_zap_region_locked(vdev);
+
/*
* This function can be invoked while the power state is non-D0. If
* pci_try_reset_function() has been called while the power state is
@@ -1236,6 +1239,14 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
vfio_pci_dma_buf_move(vdev, true);
ret = pci_try_reset_function(vdev->pdev);
+
+ /*
+ * Re-enable DPA region if reset succeeded; fault handler will
+ * re-insert PFNs on next access without requiring a new mmap.
+ */
+ if (!ret)
+ vfio_cxl_reactivate_region(vdev);
+
if (__vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 818d99f098bf..441b4a47637a 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -140,6 +140,10 @@ void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev);
int vfio_cxl_create_cxl_region(struct vfio_pci_core_device *vdev,
resource_size_t size);
void vfio_cxl_destroy_cxl_region(struct vfio_pci_core_device *vdev);
+int vfio_cxl_register_cxl_region(struct vfio_pci_core_device *vdev);
+void vfio_cxl_unregister_cxl_region(struct vfio_pci_core_device *vdev);
+void vfio_cxl_zap_region_locked(struct vfio_pci_core_device *vdev);
+void vfio_cxl_reactivate_region(struct vfio_pci_core_device *vdev);
#else
@@ -152,6 +156,15 @@ static inline int vfio_cxl_create_cxl_region(struct vfio_pci_core_device *vdev,
{ return 0; }
static inline void
vfio_cxl_destroy_cxl_region(struct vfio_pci_core_device *vdev) { }
+static inline int
+vfio_cxl_register_cxl_region(struct vfio_pci_core_device *vdev)
+{ return 0; }
+static inline void
+vfio_cxl_unregister_cxl_region(struct vfio_pci_core_device *vdev) { }
+static inline void
+vfio_cxl_zap_region_locked(struct vfio_pci_core_device *vdev) { }
+static inline void
+vfio_cxl_reactivate_region(struct vfio_pci_core_device *vdev) { }
#endif /* CONFIG_VFIO_CXL_CORE */
--
2.25.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [PATCH 11/20] vfio/cxl: Expose DPA memory region to userspace with fault+zap mmap
2026-03-11 20:34 ` [PATCH 11/20] vfio/cxl: Expose DPA memory region to userspace with fault+zap mmap mhonap
@ 2026-03-13 17:07 ` Dave Jiang
2026-03-18 17:54 ` Manish Honap
0 siblings, 1 reply; 54+ messages in thread
From: Dave Jiang @ 2026-03-13 17:07 UTC (permalink / raw)
To: mhonap, aniketa, ankita, alwilliamson, vsethi, jgg, mochs,
skolothumtho, alejandro.lucero-palau, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm
On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
< --snip-- >
> +int vfio_cxl_register_cxl_region(struct vfio_pci_core_device *vdev)
> +{
> + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> + u32 flags;
> + int ret;
> +
> + if (!cxl)
> + return -ENODEV;
> +
> + if (!cxl->region || cxl->region_vaddr)
> + return -ENODEV;
> +
> + cxl->region_vaddr = ioremap_cache(cxl->region_hpa, cxl->region_size);
Should this be using memremap_pages() family of call rather than ioremap() like how DAX does it? CXL mem regions are not MMIO regions.
DJ
^ permalink raw reply [flat|nested] 54+ messages in thread* RE: [PATCH 11/20] vfio/cxl: Expose DPA memory region to userspace with fault+zap mmap
2026-03-13 17:07 ` Dave Jiang
@ 2026-03-18 17:54 ` Manish Honap
0 siblings, 0 replies; 54+ messages in thread
From: Manish Honap @ 2026-03-18 17:54 UTC (permalink / raw)
To: Dave Jiang, Aniket Agashe, Ankit Agrawal, Alex Williamson,
Vikram Sethi, Jason Gunthorpe, Matt Ochs, Shameer Kolothum Thodi,
alejandro.lucero-palau@amd.com, dave@stgolabs.net,
jonathan.cameron@huawei.com, alison.schofield@intel.com,
vishal.l.verma@intel.com, ira.weiny@intel.com,
dan.j.williams@intel.com, jgg@ziepe.ca, Yishai Hadas,
kevin.tian@intel.com
Cc: Neo Jia, Tarun Gupta (SW-GPU), Zhi Wang, Krishnakant Jaju,
linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org,
kvm@vger.kernel.org, Manish Honap
> -----Original Message-----
> From: Dave Jiang <dave.jiang@intel.com>
> Sent: 13 March 2026 22:38
> To: Manish Honap <mhonap@nvidia.com>; Aniket Agashe <aniketa@nvidia.com>;
> Ankit Agrawal <ankita@nvidia.com>; Alex Williamson
> <alwilliamson@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Jason
> Gunthorpe <jgg@nvidia.com>; Matt Ochs <mochs@nvidia.com>; Shameer Kolothum
> Thodi <skolothumtho@nvidia.com>; alejandro.lucero-palau@amd.com;
> dave@stgolabs.net; jonathan.cameron@huawei.com;
> alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
> dan.j.williams@intel.com; jgg@ziepe.ca; Yishai Hadas <yishaih@nvidia.com>;
> kevin.tian@intel.com
> Cc: Neo Jia <cjia@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> Zhi Wang <zhiw@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>; linux-
> kernel@vger.kernel.org; linux-cxl@vger.kernel.org; kvm@vger.kernel.org
> Subject: Re: [PATCH 11/20] vfio/cxl: Expose DPA memory region to userspace
> with fault+zap mmap
>
> External email: Use caution opening links or attachments
>
>
> On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> > From: Manish Honap <mhonap@nvidia.com>
> < --snip-- >
>
> > +int vfio_cxl_register_cxl_region(struct vfio_pci_core_device *vdev) {
> > + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> > + u32 flags;
> > + int ret;
> > +
> > + if (!cxl)
> > + return -ENODEV;
> > +
> > + if (!cxl->region || cxl->region_vaddr)
> > + return -ENODEV;
> > +
> > + cxl->region_vaddr = ioremap_cache(cxl->region_hpa,
> > + cxl->region_size);
>
> Should this be using memremap_pages() family of call rather than ioremap()
> like how DAX does it? CXL mem regions are not MMIO regions.
>
> DJ
Okay, I will check the DAX code to update this part.
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH 12/20] vfio/pci: Export config access helpers
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (10 preceding siblings ...)
2026-03-11 20:34 ` [PATCH 11/20] vfio/cxl: Expose DPA memory region to userspace with fault+zap mmap mhonap
@ 2026-03-11 20:34 ` mhonap
2026-03-11 20:34 ` [PATCH 13/20] vfio/cxl: Introduce HDM decoder register emulation framework mhonap
` (7 subsequent siblings)
19 siblings, 0 replies; 54+ messages in thread
From: mhonap @ 2026-03-11 20:34 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
Promote vfio_raw_config_write() and vfio_raw_config_read() to non-static so
that the CXL DVSEC write handler in the next patch can call them.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/vfio_pci_config.c | 12 ++++++------
drivers/vfio/pci/vfio_pci_priv.h | 8 ++++++++
2 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index dc4e510e6e1b..79aaf270adb2 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -270,9 +270,9 @@ static int vfio_direct_config_read(struct vfio_pci_core_device *vdev, int pos,
}
/* Raw access skips any kind of virtualization */
-static int vfio_raw_config_write(struct vfio_pci_core_device *vdev, int pos,
- int count, struct perm_bits *perm,
- int offset, __le32 val)
+int vfio_raw_config_write(struct vfio_pci_core_device *vdev, int pos,
+ int count, struct perm_bits *perm,
+ int offset, __le32 val)
{
int ret;
@@ -283,9 +283,9 @@ static int vfio_raw_config_write(struct vfio_pci_core_device *vdev, int pos,
return count;
}
-static int vfio_raw_config_read(struct vfio_pci_core_device *vdev, int pos,
- int count, struct perm_bits *perm,
- int offset, __le32 *val)
+int vfio_raw_config_read(struct vfio_pci_core_device *vdev, int pos,
+ int count, struct perm_bits *perm,
+ int offset, __le32 *val)
{
int ret;
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 441b4a47637a..8f440f9eaa0c 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -37,6 +37,14 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
size_t count, loff_t *ppos, bool iswrite);
+int vfio_raw_config_write(struct vfio_pci_core_device *vdev, int pos,
+ int count, struct perm_bits *perm,
+ int offset, __le32 val);
+
+int vfio_raw_config_read(struct vfio_pci_core_device *vdev, int pos,
+ int count, struct perm_bits *perm,
+ int offset, __le32 *val);
+
ssize_t vfio_pci_bar_rw(struct vfio_pci_core_device *vdev, char __user *buf,
size_t count, loff_t *ppos, bool iswrite);
--
2.25.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* [PATCH 13/20] vfio/cxl: Introduce HDM decoder register emulation framework
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (11 preceding siblings ...)
2026-03-11 20:34 ` [PATCH 12/20] vfio/pci: Export config access helpers mhonap
@ 2026-03-11 20:34 ` mhonap
2026-03-13 19:05 ` Dave Jiang
2026-03-11 20:34 ` [PATCH 14/20] vfio/cxl: Check media readiness and create CXL memdev mhonap
` (6 subsequent siblings)
19 siblings, 1 reply; 54+ messages in thread
From: mhonap @ 2026-03-11 20:34 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
Introduce an emulation framework to handle CXL MMIO register emulation
for CXL devices passed through to a VM.
A single compact __le32 array (comp_reg_virt) covers only the HDM
decoder register block (hdm_reg_size bytes, typically 256-512 bytes).
A new VFIO device region VFIO_REGION_SUBTYPE_CXL_COMP_REGS exposes
this array to userspace (QEMU) as a read-write region:
- Reads return the emulated state (comp_reg_virt[])
- Writes go through the HDM register write handlers and are
forwarded to hardware where appropriate
QEMU attaches a notify_change callback to this region. When the
COMMIT bit is written in a decoder CTRL register the callback
reads the BASE_LO/HI from the same region fd (emulated state) and
maps the DPA MemoryRegion at the correct GPA in system_memory.
Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/Makefile | 2 +-
drivers/vfio/pci/cxl/vfio_cxl_core.c | 36 ++-
drivers/vfio/pci/cxl/vfio_cxl_emu.c | 366 +++++++++++++++++++++++++++
drivers/vfio/pci/cxl/vfio_cxl_priv.h | 41 +++
drivers/vfio/pci/vfio_pci_priv.h | 7 +
5 files changed, 450 insertions(+), 2 deletions(-)
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_emu.c
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index ecb0eacbc089..bef916495eae 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,7 +1,7 @@
# SPDX-License-Identifier: GPL-2.0-only
vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
-vfio-pci-core-$(CONFIG_VFIO_CXL_CORE) += cxl/vfio_cxl_core.o
+vfio-pci-core-$(CONFIG_VFIO_CXL_CORE) += cxl/vfio_cxl_core.o cxl/vfio_cxl_emu.o
vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index 03846bd11c8a..d2401871489d 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -45,6 +45,7 @@ static int vfio_cxl_create_device_state(struct vfio_pci_core_device *vdev,
cxl = vdev->cxl;
cxl->dvsec = dvsec;
cxl->dpa_region_idx = -1;
+ cxl->comp_reg_region_idx = -1;
pci_read_config_word(pdev, dvsec + CXL_DVSEC_CAPABILITY_OFFSET,
&cap_word);
@@ -124,6 +125,10 @@ static int vfio_cxl_setup_regs(struct vfio_pci_core_device *vdev)
cxl->comp_reg_offset = bar_offset;
cxl->comp_reg_size = CXL_COMPONENT_REG_BLOCK_SIZE;
+ ret = vfio_cxl_setup_virt_regs(vdev);
+ if (ret)
+ return ret;
+
return 0;
}
@@ -281,12 +286,14 @@ void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
ret = vfio_cxl_create_region_helper(vdev, SZ_256M);
if (ret)
- goto failed;
+ goto regs_failed;
cxl->precommitted = true;
return;
+regs_failed:
+ vfio_cxl_clean_virt_regs(vdev);
failed:
devm_kfree(&pdev->dev, vdev->cxl);
vdev->cxl = NULL;
@@ -299,6 +306,7 @@ void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev)
if (!cxl || !cxl->region)
return;
+ vfio_cxl_clean_virt_regs(vdev);
vfio_cxl_destroy_cxl_region(vdev);
}
@@ -409,6 +417,32 @@ void vfio_cxl_reactivate_region(struct vfio_pci_core_device *vdev)
if (!cxl)
return;
+
+ /*
+ * Re-initialise the emulated HDM comp_reg_virt[] from hardware.
+ * After FLR the decoder registers read as zero; mirror that in
+ * the emulated state so QEMU sees a clean slate.
+ */
+ vfio_cxl_reinit_comp_regs(vdev);
+
+ /*
+ * Only re-enable the DPA mmap if the hardware has actually
+ * re-committed decoder 0 after FLR. Read the COMMITTED bit from the
+ * freshly-re-snapshotted comp_reg_virt[] so we check the post-FLR
+ * hardware state, not stale pre-reset state.
+ *
+ * If COMMITTED is 0 (slow firmware re-commit path), leave
+ * region_active=false. Guest faults will return VM_FAULT_SIGBUS
+ * until the decoder is re-committed and the region is re-enabled.
+ */
+ if (cxl->precommitted && cxl->comp_reg_virt) {
+ u32 ctrl = le32_to_cpu(cxl->comp_reg_virt[
+ CXL_HDM_DECODER0_CTRL_OFFSET(0) /
+ CXL_REG_SIZE_DWORD]);
+
+ if (ctrl & CXL_HDM_DECODER_CTRL_COMMITTED_BIT)
+ WRITE_ONCE(cxl->region_active, true);
+ }
}
static ssize_t vfio_cxl_region_rw(struct vfio_pci_core_device *core_dev,
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_emu.c b/drivers/vfio/pci/cxl/vfio_cxl_emu.c
new file mode 100644
index 000000000000..d5603c80fe51
--- /dev/null
+++ b/drivers/vfio/pci/cxl/vfio_cxl_emu.c
@@ -0,0 +1,366 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <linux/bitops.h>
+#include <linux/vfio_pci_core.h>
+
+#include "../vfio_pci_priv.h"
+#include "vfio_cxl_priv.h"
+
+/*
+ * comp_reg_virt[] layout:
+ * Index 0..N correspond to 32-bit registers at byte offset 0..hdm_reg_size-4
+ * within the HDM decoder capability block.
+ *
+ * Register layout within the HDM block (CXL spec 8.2.5.19):
+ * 0x00: HDM Decoder Capability
+ * 0x04: HDM Decoder Global Control
+ * 0x08: HDM Decoder Global Status
+ * 0x0c: (reserved)
+ * For each decoder N (N=0..hdm_count-1), at base 0x10 + N*0x20:
+ * +0x00: BASE_LO
+ * +0x04: BASE_HI
+ * +0x08: SIZE_LO
+ * +0x0c: SIZE_HI
+ * +0x10: CTRL
+ * +0x14: TARGET_LIST_LO
+ * +0x18: TARGET_LIST_HI
+ * +0x1c: (reserved)
+ */
+
+static inline __le32 *hdm_reg_ptr(struct vfio_pci_cxl_state *cxl, u32 off)
+{
+ /*
+ * off is byte offset within the HDM block; comp_reg_virt is indexed
+ * as an array of __le32.
+ */
+ return &cxl->comp_reg_virt[off / sizeof(__le32)];
+}
+
+static ssize_t virt_hdm_rev_reg_write(struct vfio_pci_core_device *vdev,
+ const __le32 *val32, u64 offset, u64 size)
+{
+ /* Discard writes on reserved registers. */
+ return size;
+}
+
+static ssize_t hdm_decoder_n_lo_write(struct vfio_pci_core_device *vdev,
+ const __le32 *val32, u64 offset, u64 size)
+{
+ u32 new_val = le32_to_cpu(*val32);
+
+ if (WARN_ON_ONCE(size != CXL_REG_SIZE_DWORD))
+ return -EINVAL;
+
+ /* Bit [27:0] are reserved. */
+ new_val &= ~CXL_HDM_DECODER_BASE_LO_RESERVED_MASK;
+
+ *hdm_reg_ptr(vdev->cxl, offset) = cpu_to_le32(new_val);
+
+ return size;
+}
+
+static ssize_t hdm_decoder_global_ctrl_write(struct vfio_pci_core_device *vdev,
+ const __le32 *val32, u64 offset, u64 size)
+{
+ u32 hdm_decoder_global_cap;
+ u32 new_val = le32_to_cpu(*val32);
+
+ if (WARN_ON_ONCE(size != CXL_REG_SIZE_DWORD))
+ return -EINVAL;
+
+ /* Bit [31:2] are reserved. */
+ new_val &= ~CXL_HDM_DECODER_GLOBAL_CTRL_RESERVED_MASK;
+
+ /* Poison On Decode Error Enable bit is 0 and RO if not support. */
+ hdm_decoder_global_cap = le32_to_cpu(*hdm_reg_ptr(vdev->cxl, 0));
+ if (!(hdm_decoder_global_cap & CXL_HDM_CAP_POISON_ON_DECODE_ERR_BIT))
+ new_val &= ~CXL_HDM_DECODER_GLOBAL_CTRL_POISON_EN_BIT;
+
+ *hdm_reg_ptr(vdev->cxl, offset) = cpu_to_le32(new_val);
+
+ return size;
+}
+
+/*
+ * hdm_decoder_n_ctrl_write - Write handler for HDM decoder CTRL register.
+ *
+ * The COMMIT bit (bit 9) is the key: setting it requests the hardware to
+ * lock the decoder. The emulated COMMITTED bit (bit 10) mirrors COMMIT
+ * immediately to allow QEMU's notify_change to detect the transition and
+ * map/unmap the DPA MemoryRegion in the guest address space.
+ *
+ * Note: the actual hardware HDM decoder programming (writing the real
+ * BASE/SIZE with host physical addresses) happens in the QEMU notify_change
+ * callback BEFORE this write reaches the hardware. This ordering is
+ * correct because vfio_region_write() calls notify_change() first.
+ */
+static ssize_t hdm_decoder_n_ctrl_write(struct vfio_pci_core_device *vdev,
+ const __le32 *val32, u64 offset, u64 size)
+{
+ u32 hdm_decoder_global_cap;
+ u32 ro_mask = CXL_HDM_DECODER_CTRL_RO_BITS_MASK;
+ u32 rev_mask = CXL_HDM_DECODER_CTRL_RESERVED_MASK;
+ u32 new_val = le32_to_cpu(*val32);
+ u32 cur_val;
+
+ if (WARN_ON_ONCE(size != CXL_REG_SIZE_DWORD))
+ return -EINVAL;
+
+ cur_val = le32_to_cpu(*hdm_reg_ptr(vdev->cxl, offset));
+ if (cur_val & CXL_HDM_DECODER_CTRL_COMMIT_LOCK_BIT)
+ return size;
+
+ hdm_decoder_global_cap = le32_to_cpu(*hdm_reg_ptr(vdev->cxl, 0));
+ ro_mask |= CXL_HDM_DECODER_CTRL_DEVICE_BITS_RO;
+ rev_mask |= CXL_HDM_DECODER_CTRL_DEVICE_RESERVED;
+ if (!(hdm_decoder_global_cap & CXL_HDM_CAP_UIO_SUPPORTED_BIT))
+ rev_mask |= CXL_HDM_DECODER_CTRL_UIO_RESERVED;
+
+ new_val &= ~rev_mask;
+ cur_val &= ro_mask;
+ new_val = (new_val & ~ro_mask) | cur_val;
+
+ /*
+ * Mirror COMMIT → COMMITTED immediately in the emulated state.
+ * QEMU's notify_change (called before this write reaches hardware)
+ * reads COMMITTED from the region fd to detect commit transitions.
+ */
+ if (new_val & CXL_HDM_DECODER_CTRL_COMMIT_BIT)
+ new_val |= CXL_HDM_DECODER_CTRL_COMMITTED_BIT;
+ else
+ new_val &= ~CXL_HDM_DECODER_CTRL_COMMITTED_BIT;
+
+ *hdm_reg_ptr(vdev->cxl, offset) = cpu_to_le32(new_val);
+
+ return size;
+}
+
+/*
+ * Dispatch table for COMP_REGS region writes. Indexed by byte offset within
+ * the HDM decoder block. Returns the appropriate write handler.
+ *
+ * Layout:
+ * 0x00 HDM Decoder Capability (RO)
+ * 0x04 HDM Global Control (RW with reserved masking)
+ * 0x08 HDM Global Status (RO)
+ * 0x0c (reserved) (ignored)
+ * Per decoder N, base = 0x10 + N*0x20:
+ * base+0x00 BASE_LO (RW, [27:0] reserved)
+ * base+0x04 BASE_HI (RW)
+ * base+0x08 SIZE_LO (RW, [27:0] reserved)
+ * base+0x0c SIZE_HI (RW)
+ * base+0x10 CTRL (RW, complex rules)
+ * base+0x14 TARGET_LIST_LO (ignored for Type-2)
+ * base+0x18 TARGET_LIST_HI (ignored for Type-2)
+ * base+0x1c (reserved) (ignored)
+ */
+static ssize_t comp_regs_dispatch_write(struct vfio_pci_core_device *vdev,
+ u32 off, const __le32 *val32, u32 size)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ u32 dec_base, dec_off;
+
+ /* HDM Decoder Capability (0x00): RO */
+ if (off == 0x00)
+ return size;
+
+ /* HDM Global Control (0x04) */
+ if (off == CXL_HDM_DECODER_GLOBAL_CTRL_OFFSET)
+ return hdm_decoder_global_ctrl_write(vdev, val32, off, size);
+
+ /* HDM Global Status (0x08): RO */
+ if (off == 0x08)
+ return size;
+
+ /* Per-decoder registers start at 0x10, stride 0x20 */
+ if (off < CXL_HDM_DECODER_FIRST_BLOCK_OFFSET)
+ return size; /* reserved gap */
+
+ dec_base = CXL_HDM_DECODER_FIRST_BLOCK_OFFSET;
+ dec_off = (off - dec_base) % CXL_HDM_DECODER_BLOCK_STRIDE;
+
+ switch (dec_off) {
+ case CXL_HDM_DECODER_N_BASE_LOW_OFFSET: /* BASE_LO */
+ case CXL_HDM_DECODER_N_SIZE_LOW_OFFSET: /* SIZE_LO */
+ return hdm_decoder_n_lo_write(vdev, val32, off, size);
+ case CXL_HDM_DECODER_N_BASE_HIGH_OFFSET: /* BASE_HI */
+ case CXL_HDM_DECODER_N_SIZE_HIGH_OFFSET: /* SIZE_HI */
+ /* Full 32-bit write, no reserved bits */
+ *hdm_reg_ptr(cxl, off) = *val32;
+ return size;
+ case CXL_HDM_DECODER_N_CTRL_OFFSET: /* CTRL */
+ return hdm_decoder_n_ctrl_write(vdev, val32, off, size);
+ case CXL_HDM_DECODER_N_TARGET_LIST_LOW_OFFSET:
+ case CXL_HDM_DECODER_N_TARGET_LIST_HIGH_OFFSET:
+ case CXL_HDM_DECODER_N_REV_OFFSET:
+ return virt_hdm_rev_reg_write(vdev, val32, off, size);
+ default:
+ return size;
+ }
+}
+
+/*
+ * vfio_cxl_comp_regs_rw - regops rw handler for VFIO_REGION_SUBTYPE_CXL_COMP_REGS.
+ *
+ * Reads return the emulated HDM state (comp_reg_virt[]).
+ * Writes go through comp_regs_dispatch_write() for bit-field enforcement.
+ * Only 4-byte aligned 4-byte accesses are supported (hardware requirement).
+ */
+static ssize_t vfio_cxl_comp_regs_rw(struct vfio_pci_core_device *vdev,
+ char __user *buf, size_t count,
+ loff_t *ppos, bool iswrite)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+ size_t done = 0;
+
+ if (!count)
+ return 0;
+
+ /* Clamp to region size */
+ if (pos >= cxl->hdm_reg_size)
+ return -EINVAL;
+ count = min(count, (size_t)(cxl->hdm_reg_size - pos));
+
+ while (done < count) {
+ u32 sz = min_t(u32, CXL_REG_SIZE_DWORD, count - done);
+ u32 off = pos + done;
+ __le32 v;
+
+ /* Enforce 4-byte alignment */
+ if (sz < CXL_REG_SIZE_DWORD || (off & 0x3))
+ return done ? (ssize_t)done : -EINVAL;
+
+ if (iswrite) {
+ if (copy_from_user(&v, buf + done, sizeof(v)))
+ return done ? (ssize_t)done : -EFAULT;
+ comp_regs_dispatch_write(vdev, off, &v, sizeof(v));
+ } else {
+ v = *hdm_reg_ptr(cxl, off);
+ if (copy_to_user(buf + done, &v, sizeof(v)))
+ return done ? (ssize_t)done : -EFAULT;
+ }
+ done += sizeof(v);
+ }
+
+ *ppos += done;
+ return done;
+}
+
+static void vfio_cxl_comp_regs_release(struct vfio_pci_core_device *vdev,
+ struct vfio_pci_region *region)
+{
+ /* comp_reg_virt is freed in vfio_cxl_clean_virt_regs(), not here. */
+}
+
+static const struct vfio_pci_regops vfio_cxl_comp_regs_ops = {
+ .rw = vfio_cxl_comp_regs_rw,
+ .release = vfio_cxl_comp_regs_release,
+};
+
+/*
+ * vfio_cxl_setup_virt_regs - Allocate emulated HDM register state.
+ *
+ * Allocates comp_reg_virt as a compact __le32 array covering only
+ * hdm_reg_size bytes of HDM decoder registers. The initial values
+ * are read from hardware via the BAR ioremap established by the caller.
+ *
+ * DVSEC state is accessed via vdev->vconfig (see the following patch).
+ */
+int vfio_cxl_setup_virt_regs(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ size_t nregs;
+
+ if (WARN_ON(!cxl->hdm_reg_size))
+ return -EINVAL;
+
+ if (pci_resource_len(vdev->pdev, cxl->comp_reg_bar) <
+ cxl->comp_reg_offset + cxl->hdm_reg_offset + cxl->hdm_reg_size)
+ return -ENODEV;
+
+ nregs = cxl->hdm_reg_size / sizeof(__le32);
+ cxl->comp_reg_virt = kcalloc(nregs, sizeof(__le32), GFP_KERNEL);
+ if (!cxl->comp_reg_virt)
+ return -ENOMEM;
+
+ /* Establish persistent mapping; kept alive until vfio_cxl_clean_virt_regs(). */
+ cxl->hdm_iobase = ioremap(pci_resource_start(vdev->pdev, cxl->comp_reg_bar) +
+ cxl->comp_reg_offset + cxl->hdm_reg_offset,
+ cxl->hdm_reg_size);
+ if (!cxl->hdm_iobase) {
+ kfree(cxl->comp_reg_virt);
+ cxl->comp_reg_virt = NULL;
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+/*
+ * Called with memory_lock write side held (from vfio_cxl_reactivate_region).
+ * Uses the pre-established hdm_iobase, no ioremap() under the lock,
+ * which would deadlock on PREEMPT_RT where ioremap() can sleep.
+ */
+void vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ size_t i, nregs;
+
+ if (!cxl || !cxl->comp_reg_virt || !cxl->hdm_iobase)
+ return;
+
+ nregs = cxl->hdm_reg_size / sizeof(__le32);
+
+ for (i = 0; i < nregs; i++)
+ cxl->comp_reg_virt[i] =
+ cpu_to_le32(readl(cxl->hdm_iobase + i * sizeof(__le32)));
+}
+
+void vfio_cxl_clean_virt_regs(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+ if (cxl->hdm_iobase) {
+ iounmap(cxl->hdm_iobase);
+ cxl->hdm_iobase = NULL;
+ }
+ kfree(cxl->comp_reg_virt);
+ cxl->comp_reg_virt = NULL;
+}
+
+/*
+ * vfio_cxl_register_comp_regs_region - Register the COMP_REGS device region.
+ *
+ * Exposes the emulated HDM decoder register state as a VFIO device region
+ * with type VFIO_REGION_SUBTYPE_CXL_COMP_REGS. QEMU attaches a
+ * notify_change callback to this region to intercept HDM COMMIT writes
+ * and map the DPA MemoryRegion at the appropriate GPA.
+ *
+ * The region is read+write only (no mmap) to ensure all accesses pass
+ * through comp_regs_dispatch_write() for proper bit-field enforcement.
+ */
+int vfio_cxl_register_comp_regs_region(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ u32 flags = VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE;
+ int ret;
+
+ if (!cxl || !cxl->comp_reg_virt)
+ return -ENODEV;
+
+ ret = vfio_pci_core_register_dev_region(vdev,
+ PCI_VENDOR_ID_CXL |
+ VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
+ VFIO_REGION_SUBTYPE_CXL_COMP_REGS,
+ &vfio_cxl_comp_regs_ops,
+ cxl->hdm_reg_size, flags, cxl);
+ if (!ret)
+ cxl->comp_reg_region_idx = vdev->num_regions - 1;
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_cxl_register_comp_regs_region);
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
index b870926bfb19..4f2637874e9d 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_priv.h
+++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
@@ -25,14 +25,51 @@ struct vfio_pci_cxl_state {
size_t hdm_reg_size;
resource_size_t comp_reg_offset;
size_t comp_reg_size;
+ __le32 *comp_reg_virt;
+ void __iomem *hdm_iobase;
u32 hdm_count;
int dpa_region_idx;
+ int comp_reg_region_idx;
u16 dvsec;
u8 comp_reg_bar;
bool precommitted;
bool region_active;
};
+/* Register access sizes */
+#define CXL_REG_SIZE_WORD 2
+#define CXL_REG_SIZE_DWORD 4
+
+/* HDM Decoder - register offsets (CXL 2.0 8.2.5.19) */
+#define CXL_HDM_DECODER_GLOBAL_CTRL_OFFSET 0x4
+#define CXL_HDM_DECODER_FIRST_BLOCK_OFFSET 0x10
+#define CXL_HDM_DECODER_BLOCK_STRIDE 0x20
+#define CXL_HDM_DECODER_N_BASE_LOW_OFFSET 0x0
+#define CXL_HDM_DECODER_N_BASE_HIGH_OFFSET 0x4
+#define CXL_HDM_DECODER_N_SIZE_LOW_OFFSET 0x8
+#define CXL_HDM_DECODER_N_SIZE_HIGH_OFFSET 0xc
+#define CXL_HDM_DECODER_N_CTRL_OFFSET 0x10
+#define CXL_HDM_DECODER_N_TARGET_LIST_LOW_OFFSET 0x14
+#define CXL_HDM_DECODER_N_TARGET_LIST_HIGH_OFFSET 0x18
+#define CXL_HDM_DECODER_N_REV_OFFSET 0x1c
+
+/* HDM Decoder Global Capability / Control - bit definitions */
+#define CXL_HDM_CAP_POISON_ON_DECODE_ERR_BIT BIT(10)
+#define CXL_HDM_CAP_UIO_SUPPORTED_BIT BIT(13)
+
+/* HDM Decoder N Control */
+#define CXL_HDM_DECODER_CTRL_COMMIT_LOCK_BIT BIT(8)
+#define CXL_HDM_DECODER_CTRL_COMMIT_BIT BIT(9)
+#define CXL_HDM_DECODER_CTRL_COMMITTED_BIT BIT(10)
+#define CXL_HDM_DECODER_CTRL_RO_BITS_MASK (BIT(10) | BIT(11))
+#define CXL_HDM_DECODER_CTRL_RESERVED_MASK (BIT(15) | GENMASK(31, 28))
+#define CXL_HDM_DECODER_CTRL_DEVICE_BITS_RO BIT(12)
+#define CXL_HDM_DECODER_CTRL_DEVICE_RESERVED (GENMASK(19, 16) | GENMASK(23, 20))
+#define CXL_HDM_DECODER_CTRL_UIO_RESERVED (BIT(14) | GENMASK(27, 24))
+#define CXL_HDM_DECODER_BASE_LO_RESERVED_MASK GENMASK(27, 0)
+#define CXL_HDM_DECODER_GLOBAL_CTRL_RESERVED_MASK GENMASK(31, 2)
+#define CXL_HDM_DECODER_GLOBAL_CTRL_POISON_EN_BIT BIT(0)
+
/*
* CXL DVSEC for CXL Devices - register offsets within the DVSEC
* (CXL 2.0+ 8.1.3).
@@ -41,4 +78,8 @@ struct vfio_pci_cxl_state {
#define CXL_DVSEC_CAPABILITY_OFFSET 0xa
#define CXL_DVSEC_MEM_CAPABLE BIT(2)
+int vfio_cxl_setup_virt_regs(struct vfio_pci_core_device *vdev);
+void vfio_cxl_clean_virt_regs(struct vfio_pci_core_device *vdev);
+void vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev);
+
#endif /* __LINUX_VFIO_CXL_PRIV_H */
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 8f440f9eaa0c..f8db9a05c033 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -152,6 +152,8 @@ int vfio_cxl_register_cxl_region(struct vfio_pci_core_device *vdev);
void vfio_cxl_unregister_cxl_region(struct vfio_pci_core_device *vdev);
void vfio_cxl_zap_region_locked(struct vfio_pci_core_device *vdev);
void vfio_cxl_reactivate_region(struct vfio_pci_core_device *vdev);
+int vfio_cxl_register_comp_regs_region(struct vfio_pci_core_device *vdev);
+void vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev);
#else
@@ -173,6 +175,11 @@ static inline void
vfio_cxl_zap_region_locked(struct vfio_pci_core_device *vdev) { }
static inline void
vfio_cxl_reactivate_region(struct vfio_pci_core_device *vdev) { }
+static inline int
+vfio_cxl_register_comp_regs_region(struct vfio_pci_core_device *vdev)
+{ return 0; }
+static inline void
+vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev) { }
#endif /* CONFIG_VFIO_CXL_CORE */
--
2.25.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [PATCH 13/20] vfio/cxl: Introduce HDM decoder register emulation framework
2026-03-11 20:34 ` [PATCH 13/20] vfio/cxl: Introduce HDM decoder register emulation framework mhonap
@ 2026-03-13 19:05 ` Dave Jiang
2026-03-18 17:58 ` Manish Honap
0 siblings, 1 reply; 54+ messages in thread
From: Dave Jiang @ 2026-03-13 19:05 UTC (permalink / raw)
To: mhonap, aniketa, ankita, alwilliamson, vsethi, jgg, mochs,
skolothumtho, alejandro.lucero-palau, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm
On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
>
> Introduce an emulation framework to handle CXL MMIO register emulation
> for CXL devices passed through to a VM.
>
> A single compact __le32 array (comp_reg_virt) covers only the HDM
> decoder register block (hdm_reg_size bytes, typically 256-512 bytes).
>
> A new VFIO device region VFIO_REGION_SUBTYPE_CXL_COMP_REGS exposes
> this array to userspace (QEMU) as a read-write region:
> - Reads return the emulated state (comp_reg_virt[])
> - Writes go through the HDM register write handlers and are
> forwarded to hardware where appropriate
>
> QEMU attaches a notify_change callback to this region. When the
> COMMIT bit is written in a decoder CTRL register the callback
> reads the BASE_LO/HI from the same region fd (emulated state) and
> maps the DPA MemoryRegion at the correct GPA in system_memory.
>
> Co-developed-by: Zhi Wang <zhiw@nvidia.com>
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
> ---
> drivers/vfio/pci/Makefile | 2 +-
> drivers/vfio/pci/cxl/vfio_cxl_core.c | 36 ++-
> drivers/vfio/pci/cxl/vfio_cxl_emu.c | 366 +++++++++++++++++++++++++++
> drivers/vfio/pci/cxl/vfio_cxl_priv.h | 41 +++
> drivers/vfio/pci/vfio_pci_priv.h | 7 +
> 5 files changed, 450 insertions(+), 2 deletions(-)
> create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_emu.c
>
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index ecb0eacbc089..bef916495eae 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -1,7 +1,7 @@
> # SPDX-License-Identifier: GPL-2.0-only
>
> vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
> -vfio-pci-core-$(CONFIG_VFIO_CXL_CORE) += cxl/vfio_cxl_core.o
> +vfio-pci-core-$(CONFIG_VFIO_CXL_CORE) += cxl/vfio_cxl_core.o cxl/vfio_cxl_emu.o
> vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
> vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
> obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
> diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> index 03846bd11c8a..d2401871489d 100644
> --- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
> +++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> @@ -45,6 +45,7 @@ static int vfio_cxl_create_device_state(struct vfio_pci_core_device *vdev,
> cxl = vdev->cxl;
> cxl->dvsec = dvsec;
> cxl->dpa_region_idx = -1;
> + cxl->comp_reg_region_idx = -1;
>
> pci_read_config_word(pdev, dvsec + CXL_DVSEC_CAPABILITY_OFFSET,
> &cap_word);
> @@ -124,6 +125,10 @@ static int vfio_cxl_setup_regs(struct vfio_pci_core_device *vdev)
> cxl->comp_reg_offset = bar_offset;
> cxl->comp_reg_size = CXL_COMPONENT_REG_BLOCK_SIZE;
>
> + ret = vfio_cxl_setup_virt_regs(vdev);
> + if (ret)
> + return ret;
> +
> return 0;
> }
>
> @@ -281,12 +286,14 @@ void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
>
> ret = vfio_cxl_create_region_helper(vdev, SZ_256M);
> if (ret)
> - goto failed;
> + goto regs_failed;
>
> cxl->precommitted = true;
>
> return;
>
> +regs_failed:
> + vfio_cxl_clean_virt_regs(vdev);
> failed:
> devm_kfree(&pdev->dev, vdev->cxl);
> vdev->cxl = NULL;
> @@ -299,6 +306,7 @@ void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev)
> if (!cxl || !cxl->region)
> return;
>
> + vfio_cxl_clean_virt_regs(vdev);
> vfio_cxl_destroy_cxl_region(vdev);
> }
>
> @@ -409,6 +417,32 @@ void vfio_cxl_reactivate_region(struct vfio_pci_core_device *vdev)
>
> if (!cxl)
> return;
> +
> + /*
> + * Re-initialise the emulated HDM comp_reg_virt[] from hardware.
> + * After FLR the decoder registers read as zero; mirror that in
> + * the emulated state so QEMU sees a clean slate.
> + */
> + vfio_cxl_reinit_comp_regs(vdev);
> +
> + /*
> + * Only re-enable the DPA mmap if the hardware has actually
> + * re-committed decoder 0 after FLR. Read the COMMITTED bit from the
> + * freshly-re-snapshotted comp_reg_virt[] so we check the post-FLR
> + * hardware state, not stale pre-reset state.
> + *
> + * If COMMITTED is 0 (slow firmware re-commit path), leave
> + * region_active=false. Guest faults will return VM_FAULT_SIGBUS
> + * until the decoder is re-committed and the region is re-enabled.
> + */
> + if (cxl->precommitted && cxl->comp_reg_virt) {
> + u32 ctrl = le32_to_cpu(cxl->comp_reg_virt[
> + CXL_HDM_DECODER0_CTRL_OFFSET(0) /
> + CXL_REG_SIZE_DWORD]);
> +
> + if (ctrl & CXL_HDM_DECODER_CTRL_COMMITTED_BIT)
> + WRITE_ONCE(cxl->region_active, true);
> + }
> }
>
> static ssize_t vfio_cxl_region_rw(struct vfio_pci_core_device *core_dev,
> diff --git a/drivers/vfio/pci/cxl/vfio_cxl_emu.c b/drivers/vfio/pci/cxl/vfio_cxl_emu.c
> new file mode 100644
> index 000000000000..d5603c80fe51
> --- /dev/null
> +++ b/drivers/vfio/pci/cxl/vfio_cxl_emu.c
> @@ -0,0 +1,366 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include <linux/bitops.h>
> +#include <linux/vfio_pci_core.h>
> +
> +#include "../vfio_pci_priv.h"
> +#include "vfio_cxl_priv.h"
> +
> +/*
> + * comp_reg_virt[] layout:
> + * Index 0..N correspond to 32-bit registers at byte offset 0..hdm_reg_size-4
> + * within the HDM decoder capability block.
> + *
> + * Register layout within the HDM block (CXL spec 8.2.5.19):
> + * 0x00: HDM Decoder Capability
> + * 0x04: HDM Decoder Global Control
> + * 0x08: HDM Decoder Global Status
> + * 0x0c: (reserved)
> + * For each decoder N (N=0..hdm_count-1), at base 0x10 + N*0x20:
> + * +0x00: BASE_LO
> + * +0x04: BASE_HI
> + * +0x08: SIZE_LO
> + * +0x0c: SIZE_HI
> + * +0x10: CTRL
> + * +0x14: TARGET_LIST_LO
> + * +0x18: TARGET_LIST_HI
> + * +0x1c: (reserved)
> + */
> +
> +static inline __le32 *hdm_reg_ptr(struct vfio_pci_cxl_state *cxl, u32 off)
> +{
> + /*
> + * off is byte offset within the HDM block; comp_reg_virt is indexed
> + * as an array of __le32.
> + */
> + return &cxl->comp_reg_virt[off / sizeof(__le32)];
> +}
> +
> +static ssize_t virt_hdm_rev_reg_write(struct vfio_pci_core_device *vdev,
> + const __le32 *val32, u64 offset, u64 size)
> +{
> + /* Discard writes on reserved registers. */
> + return size;
> +}
> +
> +static ssize_t hdm_decoder_n_lo_write(struct vfio_pci_core_device *vdev,
> + const __le32 *val32, u64 offset, u64 size)
> +{> + u32 new_val = le32_to_cpu(*val32);
> +
> + if (WARN_ON_ONCE(size != CXL_REG_SIZE_DWORD))
> + return -EINVAL;
> +> + /* Bit [27:0] are reserved. */
> + new_val &= ~CXL_HDM_DECODER_BASE_LO_RESERVED_MASK;
> +
> + *hdm_reg_ptr(vdev->cxl, offset) = cpu_to_le32(new_val);
> +
> + return size;
> +}
> +
> +static ssize_t hdm_decoder_global_ctrl_write(struct vfio_pci_core_device *vdev,
> + const __le32 *val32, u64 offset, u64 size)
Why offset? If the dispatch function already checked and confirmed this is the offset for the global ctrl register then there's no need to pass in the offset.
> +{
> + u32 hdm_decoder_global_cap;
> + u32 new_val = le32_to_cpu(*val32);
> +
> + if (WARN_ON_ONCE(size != CXL_REG_SIZE_DWORD))
> + return -EINVAL;
> +> + /* Bit [31:2] are reserved. */
> + new_val &= ~CXL_HDM_DECODER_GLOBAL_CTRL_RESERVED_MASK;
> +
> + /* Poison On Decode Error Enable bit is 0 and RO if not support. */
> + hdm_decoder_global_cap = le32_to_cpu(*hdm_reg_ptr(vdev->cxl, 0));
> + if (!(hdm_decoder_global_cap & CXL_HDM_CAP_POISON_ON_DECODE_ERR_BIT))
> + new_val &= ~CXL_HDM_DECODER_GLOBAL_CTRL_POISON_EN_BIT;
> +
> + *hdm_reg_ptr(vdev->cxl, offset) = cpu_to_le32(new_val);
> +
> + return size;
> +}
> +
> +/*
> + * hdm_decoder_n_ctrl_write - Write handler for HDM decoder CTRL register.
If we are going to start with kdoc style comment, may as well finish the kdoc block and provide parameters and return values
> + *
> + * The COMMIT bit (bit 9) is the key: setting it requests the hardware to
> + * lock the decoder. The emulated COMMITTED bit (bit 10) mirrors COMMIT
> + * immediately to allow QEMU's notify_change to detect the transition and
> + * map/unmap the DPA MemoryRegion in the guest address space.
> + *
> + * Note: the actual hardware HDM decoder programming (writing the real
> + * BASE/SIZE with host physical addresses) happens in the QEMU notify_change
> + * callback BEFORE this write reaches the hardware. This ordering is
> + * correct because vfio_region_write() calls notify_change() first.
> + */
> +static ssize_t hdm_decoder_n_ctrl_write(struct vfio_pci_core_device *vdev,
> + const __le32 *val32, u64 offset, u64 size)
> +{
> + u32 hdm_decoder_global_cap;
> + u32 ro_mask = CXL_HDM_DECODER_CTRL_RO_BITS_MASK;
> + u32 rev_mask = CXL_HDM_DECODER_CTRL_RESERVED_MASK;
> + u32 new_val = le32_to_cpu(*val32);
> + u32 cur_val;
> +
> + if (WARN_ON_ONCE(size != CXL_REG_SIZE_DWORD))
> + return -EINVAL;
> +
> + cur_val = le32_to_cpu(*hdm_reg_ptr(vdev->cxl, offset));
> + if (cur_val & CXL_HDM_DECODER_CTRL_COMMIT_LOCK_BIT)
> + return size;
> +
> + hdm_decoder_global_cap = le32_to_cpu(*hdm_reg_ptr(vdev->cxl, 0));
> + ro_mask |= CXL_HDM_DECODER_CTRL_DEVICE_BITS_RO;
> + rev_mask |= CXL_HDM_DECODER_CTRL_DEVICE_RESERVED;
> + if (!(hdm_decoder_global_cap & CXL_HDM_CAP_UIO_SUPPORTED_BIT))
> + rev_mask |= CXL_HDM_DECODER_CTRL_UIO_RESERVED;
> +
> + new_val &= ~rev_mask;
> + cur_val &= ro_mask;
> + new_val = (new_val & ~ro_mask) | cur_val;
> +
> + /*
> + * Mirror COMMIT → COMMITTED immediately in the emulated state.
> + * QEMU's notify_change (called before this write reaches hardware)
> + * reads COMMITTED from the region fd to detect commit transitions.
> + */
> + if (new_val & CXL_HDM_DECODER_CTRL_COMMIT_BIT)
> + new_val |= CXL_HDM_DECODER_CTRL_COMMITTED_BIT;
> + else
> + new_val &= ~CXL_HDM_DECODER_CTRL_COMMITTED_BIT;
> +
> + *hdm_reg_ptr(vdev->cxl, offset) = cpu_to_le32(new_val);
> +
> + return size;
> +}
> +
> +/*
> + * Dispatch table for COMP_REGS region writes. Indexed by byte offset within
> + * the HDM decoder block. Returns the appropriate write handler.
> + *
> + * Layout:
> + * 0x00 HDM Decoder Capability (RO)
> + * 0x04 HDM Global Control (RW with reserved masking)
> + * 0x08 HDM Global Status (RO)
> + * 0x0c (reserved) (ignored)
> + * Per decoder N, base = 0x10 + N*0x20:
> + * base+0x00 BASE_LO (RW, [27:0] reserved)
> + * base+0x04 BASE_HI (RW)
> + * base+0x08 SIZE_LO (RW, [27:0] reserved)
> + * base+0x0c SIZE_HI (RW)
> + * base+0x10 CTRL (RW, complex rules)
> + * base+0x14 TARGET_LIST_LO (ignored for Type-2)
> + * base+0x18 TARGET_LIST_HI (ignored for Type-2)
> + * base+0x1c (reserved) (ignored)
> + */
> +static ssize_t comp_regs_dispatch_write(struct vfio_pci_core_device *vdev,
> + u32 off, const __le32 *val32, u32 size)
> +{
> + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> + u32 dec_base, dec_off;
> +
> + /* HDM Decoder Capability (0x00): RO */
> + if (off == 0x00)
define magic number
> + return size;
> +
> + /* HDM Global Control (0x04) */
> + if (off == CXL_HDM_DECODER_GLOBAL_CTRL_OFFSET)
> + return hdm_decoder_global_ctrl_write(vdev, val32, off, size);
> +
> + /* HDM Global Status (0x08): RO */
> + if (off == 0x08)
define magic number
> + return size;
> +
> + /* Per-decoder registers start at 0x10, stride 0x20 */
> + if (off < CXL_HDM_DECODER_FIRST_BLOCK_OFFSET)
> + return size; /* reserved gap */
> +
> + dec_base = CXL_HDM_DECODER_FIRST_BLOCK_OFFSET;
> + dec_off = (off - dec_base) % CXL_HDM_DECODER_BLOCK_STRIDE;
Need a check here to make sure offset is within the number of supported decoders.
> +
> + switch (dec_off) {
> + case CXL_HDM_DECODER_N_BASE_LOW_OFFSET: /* BASE_LO */
> + case CXL_HDM_DECODER_N_SIZE_LOW_OFFSET: /* SIZE_LO */
> + return hdm_decoder_n_lo_write(vdev, val32, off, size);
> + case CXL_HDM_DECODER_N_BASE_HIGH_OFFSET: /* BASE_HI */
> + case CXL_HDM_DECODER_N_SIZE_HIGH_OFFSET: /* SIZE_HI */
> + /* Full 32-bit write, no reserved bits */
> + *hdm_reg_ptr(cxl, off) = *val32;
> + return size;
> + case CXL_HDM_DECODER_N_CTRL_OFFSET: /* CTRL */
> + return hdm_decoder_n_ctrl_write(vdev, val32, off, size);
> + case CXL_HDM_DECODER_N_TARGET_LIST_LOW_OFFSET:
> + case CXL_HDM_DECODER_N_TARGET_LIST_HIGH_OFFSET:
> + case CXL_HDM_DECODER_N_REV_OFFSET:
> + return virt_hdm_rev_reg_write(vdev, val32, off, size);
> + default:
> + return size;
> + }
> +}
> +
> +/*
> + * vfio_cxl_comp_regs_rw - regops rw handler for VFIO_REGION_SUBTYPE_CXL_COMP_REGS.
> + *
> + * Reads return the emulated HDM state (comp_reg_virt[]).
> + * Writes go through comp_regs_dispatch_write() for bit-field enforcement.
> + * Only 4-byte aligned 4-byte accesses are supported (hardware requirement).
> + */
> +static ssize_t vfio_cxl_comp_regs_rw(struct vfio_pci_core_device *vdev,
> + char __user *buf, size_t count,
> + loff_t *ppos, bool iswrite)
> +{
> + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + size_t done = 0;
> +
> + if (!count)
> + return 0;
> +
> + /* Clamp to region size */
> + if (pos >= cxl->hdm_reg_size)
> + return -EINVAL;
> + count = min(count, (size_t)(cxl->hdm_reg_size - pos));
> +
> + while (done < count) {
> + u32 sz = min_t(u32, CXL_REG_SIZE_DWORD, count - done);
> + u32 off = pos + done;
> + __le32 v;
> +
> + /* Enforce 4-byte alignment */
> + if (sz < CXL_REG_SIZE_DWORD || (off & 0x3))
> + return done ? (ssize_t)done : -EINVAL;
> +
> + if (iswrite) {
> + if (copy_from_user(&v, buf + done, sizeof(v)))
> + return done ? (ssize_t)done : -EFAULT;
> + comp_regs_dispatch_write(vdev, off, &v, sizeof(v));
> + } else {
> + v = *hdm_reg_ptr(cxl, off);
> + if (copy_to_user(buf + done, &v, sizeof(v)))
> + return done ? (ssize_t)done : -EFAULT;
> + }
> + done += sizeof(v);
> + }
> +
> + *ppos += done;
> + return done;
> +}
> +
> +static void vfio_cxl_comp_regs_release(struct vfio_pci_core_device *vdev,
> + struct vfio_pci_region *region)
> +{
> + /* comp_reg_virt is freed in vfio_cxl_clean_virt_regs(), not here. */
> +}
> +
> +static const struct vfio_pci_regops vfio_cxl_comp_regs_ops = {
> + .rw = vfio_cxl_comp_regs_rw,
> + .release = vfio_cxl_comp_regs_release,
> +};
> +
> +/*
> + * vfio_cxl_setup_virt_regs - Allocate emulated HDM register state.
> + *
> + * Allocates comp_reg_virt as a compact __le32 array covering only
> + * hdm_reg_size bytes of HDM decoder registers. The initial values
> + * are read from hardware via the BAR ioremap established by the caller.
> + *
> + * DVSEC state is accessed via vdev->vconfig (see the following patch).
> + */
> +int vfio_cxl_setup_virt_regs(struct vfio_pci_core_device *vdev)
> +{
> + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> + size_t nregs;
> +
> + if (WARN_ON(!cxl->hdm_reg_size))
> + return -EINVAL;
> +
> + if (pci_resource_len(vdev->pdev, cxl->comp_reg_bar) <
> + cxl->comp_reg_offset + cxl->hdm_reg_offset + cxl->hdm_reg_size)
> + return -ENODEV;
> +
> + nregs = cxl->hdm_reg_size / sizeof(__le32);
> + cxl->comp_reg_virt = kcalloc(nregs, sizeof(__le32), GFP_KERNEL);
> + if (!cxl->comp_reg_virt)
> + return -ENOMEM;
> +
> + /* Establish persistent mapping; kept alive until vfio_cxl_clean_virt_regs(). */
> + cxl->hdm_iobase = ioremap(pci_resource_start(vdev->pdev, cxl->comp_reg_bar) +
> + cxl->comp_reg_offset + cxl->hdm_reg_offset,
> + cxl->hdm_reg_size);
> + if (!cxl->hdm_iobase) {
> + kfree(cxl->comp_reg_virt);
> + cxl->comp_reg_virt = NULL;
> + return -ENOMEM;
> + }
> +
> + return 0;
> +}
> +
> +/*
> + * Called with memory_lock write side held (from vfio_cxl_reactivate_region).
> + * Uses the pre-established hdm_iobase, no ioremap() under the lock,
> + * which would deadlock on PREEMPT_RT where ioremap() can sleep.
> + */
> +void vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev)
> +{
> + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> + size_t i, nregs;
> +
> + if (!cxl || !cxl->comp_reg_virt || !cxl->hdm_iobase)
> + return;
> +
> + nregs = cxl->hdm_reg_size / sizeof(__le32);
> +
> + for (i = 0; i < nregs; i++)
> + cxl->comp_reg_virt[i] =
> + cpu_to_le32(readl(cxl->hdm_iobase + i * sizeof(__le32)));
> +}
> +
> +void vfio_cxl_clean_virt_regs(struct vfio_pci_core_device *vdev)
> +{
> + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> +
> + if (cxl->hdm_iobase) {
> + iounmap(cxl->hdm_iobase);
> + cxl->hdm_iobase = NULL;
> + }
> + kfree(cxl->comp_reg_virt);
> + cxl->comp_reg_virt = NULL;
> +}
> +
> +/*
> + * vfio_cxl_register_comp_regs_region - Register the COMP_REGS device region.
> + *
> + * Exposes the emulated HDM decoder register state as a VFIO device region
> + * with type VFIO_REGION_SUBTYPE_CXL_COMP_REGS. QEMU attaches a
> + * notify_change callback to this region to intercept HDM COMMIT writes
> + * and map the DPA MemoryRegion at the appropriate GPA.
> + *
> + * The region is read+write only (no mmap) to ensure all accesses pass
> + * through comp_regs_dispatch_write() for proper bit-field enforcement.
> + */
> +int vfio_cxl_register_comp_regs_region(struct vfio_pci_core_device *vdev)
> +{
> + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> + u32 flags = VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE;
> + int ret;
> +
> + if (!cxl || !cxl->comp_reg_virt)
> + return -ENODEV;
> +
> + ret = vfio_pci_core_register_dev_region(vdev,
> + PCI_VENDOR_ID_CXL |
> + VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
> + VFIO_REGION_SUBTYPE_CXL_COMP_REGS,
> + &vfio_cxl_comp_regs_ops,
> + cxl->hdm_reg_size, flags, cxl);
> + if (!ret)
> + cxl->comp_reg_region_idx = vdev->num_regions - 1;
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_cxl_register_comp_regs_region);
> diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
> index b870926bfb19..4f2637874e9d 100644
> --- a/drivers/vfio/pci/cxl/vfio_cxl_priv.h
> +++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
> @@ -25,14 +25,51 @@ struct vfio_pci_cxl_state {
> size_t hdm_reg_size;
> resource_size_t comp_reg_offset;
> size_t comp_reg_size;
> + __le32 *comp_reg_virt;
> + void __iomem *hdm_iobase;
> u32 hdm_count;
> int dpa_region_idx;
> + int comp_reg_region_idx;
> u16 dvsec;
> u8 comp_reg_bar;
> bool precommitted;
> bool region_active;
> };
>
> +/* Register access sizes */
> +#define CXL_REG_SIZE_WORD 2
> +#define CXL_REG_SIZE_DWORD 4
> +
> +/* HDM Decoder - register offsets (CXL 2.0 8.2.5.19) */
> +#define CXL_HDM_DECODER_GLOBAL_CTRL_OFFSET 0x4
> +#define CXL_HDM_DECODER_FIRST_BLOCK_OFFSET 0x10
> +#define CXL_HDM_DECODER_BLOCK_STRIDE 0x20
> +#define CXL_HDM_DECODER_N_BASE_LOW_OFFSET 0x0
> +#define CXL_HDM_DECODER_N_BASE_HIGH_OFFSET 0x4
> +#define CXL_HDM_DECODER_N_SIZE_LOW_OFFSET 0x8
> +#define CXL_HDM_DECODER_N_SIZE_HIGH_OFFSET 0xc
> +#define CXL_HDM_DECODER_N_CTRL_OFFSET 0x10
> +#define CXL_HDM_DECODER_N_TARGET_LIST_LOW_OFFSET 0x14
> +#define CXL_HDM_DECODER_N_TARGET_LIST_HIGH_OFFSET 0x18
> +#define CXL_HDM_DECODER_N_REV_OFFSET 0x1c
> +
> +/* HDM Decoder Global Capability / Control - bit definitions */
> +#define CXL_HDM_CAP_POISON_ON_DECODE_ERR_BIT BIT(10)
> +#define CXL_HDM_CAP_UIO_SUPPORTED_BIT BIT(13)
> +
> +/* HDM Decoder N Control */
> +#define CXL_HDM_DECODER_CTRL_COMMIT_LOCK_BIT BIT(8)
> +#define CXL_HDM_DECODER_CTRL_COMMIT_BIT BIT(9)
> +#define CXL_HDM_DECODER_CTRL_COMMITTED_BIT BIT(10)
> +#define CXL_HDM_DECODER_CTRL_RO_BITS_MASK (BIT(10) | BIT(11))
> +#define CXL_HDM_DECODER_CTRL_RESERVED_MASK (BIT(15) | GENMASK(31, 28))
> +#define CXL_HDM_DECODER_CTRL_DEVICE_BITS_RO BIT(12)
> +#define CXL_HDM_DECODER_CTRL_DEVICE_RESERVED (GENMASK(19, 16) | GENMASK(23, 20))
> +#define CXL_HDM_DECODER_CTRL_UIO_RESERVED (BIT(14) | GENMASK(27, 24))
> +#define CXL_HDM_DECODER_BASE_LO_RESERVED_MASK GENMASK(27, 0)
> +#define CXL_HDM_DECODER_GLOBAL_CTRL_RESERVED_MASK GENMASK(31, 2)
> +#define CXL_HDM_DECODER_GLOBAL_CTRL_POISON_EN_BIT BIT(0)
Maybe the reg defines should go in include/cxl/regs.h? Or move shared definitions out of drivers/cxl/.
DJ
> +
> /*
> * CXL DVSEC for CXL Devices - register offsets within the DVSEC
> * (CXL 2.0+ 8.1.3).
> @@ -41,4 +78,8 @@ struct vfio_pci_cxl_state {
> #define CXL_DVSEC_CAPABILITY_OFFSET 0xa
> #define CXL_DVSEC_MEM_CAPABLE BIT(2)
>
> +int vfio_cxl_setup_virt_regs(struct vfio_pci_core_device *vdev);
> +void vfio_cxl_clean_virt_regs(struct vfio_pci_core_device *vdev);
> +void vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev);
> +
> #endif /* __LINUX_VFIO_CXL_PRIV_H */
> diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
> index 8f440f9eaa0c..f8db9a05c033 100644
> --- a/drivers/vfio/pci/vfio_pci_priv.h
> +++ b/drivers/vfio/pci/vfio_pci_priv.h
> @@ -152,6 +152,8 @@ int vfio_cxl_register_cxl_region(struct vfio_pci_core_device *vdev);
> void vfio_cxl_unregister_cxl_region(struct vfio_pci_core_device *vdev);
> void vfio_cxl_zap_region_locked(struct vfio_pci_core_device *vdev);
> void vfio_cxl_reactivate_region(struct vfio_pci_core_device *vdev);
> +int vfio_cxl_register_comp_regs_region(struct vfio_pci_core_device *vdev);
> +void vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev);
>
> #else
>
> @@ -173,6 +175,11 @@ static inline void
> vfio_cxl_zap_region_locked(struct vfio_pci_core_device *vdev) { }
> static inline void
> vfio_cxl_reactivate_region(struct vfio_pci_core_device *vdev) { }
> +static inline int
> +vfio_cxl_register_comp_regs_region(struct vfio_pci_core_device *vdev)
> +{ return 0; }
> +static inline void
> +vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev) { }
>
> #endif /* CONFIG_VFIO_CXL_CORE */
>
^ permalink raw reply [flat|nested] 54+ messages in thread* RE: [PATCH 13/20] vfio/cxl: Introduce HDM decoder register emulation framework
2026-03-13 19:05 ` Dave Jiang
@ 2026-03-18 17:58 ` Manish Honap
0 siblings, 0 replies; 54+ messages in thread
From: Manish Honap @ 2026-03-18 17:58 UTC (permalink / raw)
To: Dave Jiang, Aniket Agashe, Ankit Agrawal, Alex Williamson,
Vikram Sethi, Jason Gunthorpe, Matt Ochs, Shameer Kolothum Thodi,
alejandro.lucero-palau@amd.com, dave@stgolabs.net,
jonathan.cameron@huawei.com, alison.schofield@intel.com,
vishal.l.verma@intel.com, ira.weiny@intel.com,
dan.j.williams@intel.com, jgg@ziepe.ca, Yishai Hadas,
kevin.tian@intel.com
Cc: Neo Jia, Tarun Gupta (SW-GPU), Zhi Wang, Krishnakant Jaju,
linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org,
kvm@vger.kernel.org, Manish Honap
> -----Original Message-----
> From: Dave Jiang <dave.jiang@intel.com>
> Sent: 14 March 2026 00:36
> To: Manish Honap <mhonap@nvidia.com>; Aniket Agashe <aniketa@nvidia.com>;
> Ankit Agrawal <ankita@nvidia.com>; Alex Williamson
> <alwilliamson@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Jason
> Gunthorpe <jgg@nvidia.com>; Matt Ochs <mochs@nvidia.com>; Shameer Kolothum
> Thodi <skolothumtho@nvidia.com>; alejandro.lucero-palau@amd.com;
> dave@stgolabs.net; jonathan.cameron@huawei.com;
> alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
> dan.j.williams@intel.com; jgg@ziepe.ca; Yishai Hadas <yishaih@nvidia.com>;
> kevin.tian@intel.com
> Cc: Neo Jia <cjia@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> Zhi Wang <zhiw@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>; linux-
> kernel@vger.kernel.org; linux-cxl@vger.kernel.org; kvm@vger.kernel.org
> Subject: Re: [PATCH 13/20] vfio/cxl: Introduce HDM decoder register
> emulation framework
>
> External email: Use caution opening links or attachments
>
>
> On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> > From: Manish Honap <mhonap@nvidia.com>
> >
> > Introduce an emulation framework to handle CXL MMIO register emulation
> > for CXL devices passed through to a VM.
> >
> > A single compact __le32 array (comp_reg_virt) covers only the HDM
> > decoder register block (hdm_reg_size bytes, typically 256-512 bytes).
> >
> > A new VFIO device region VFIO_REGION_SUBTYPE_CXL_COMP_REGS exposes
> > this array to userspace (QEMU) as a read-write region:
> > - Reads return the emulated state (comp_reg_virt[])
> > - Writes go through the HDM register write handlers and are
> > forwarded to hardware where appropriate
> >
> > QEMU attaches a notify_change callback to this region. When the COMMIT
> > bit is written in a decoder CTRL register the callback reads the
> > BASE_LO/HI from the same region fd (emulated state) and maps the DPA
> > MemoryRegion at the correct GPA in system_memory.
> >
> > Co-developed-by: Zhi Wang <zhiw@nvidia.com>
> > Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> > Signed-off-by: Manish Honap <mhonap@nvidia.com>
> > ---
> > drivers/vfio/pci/Makefile | 2 +-
> > drivers/vfio/pci/cxl/vfio_cxl_core.c | 36 ++-
> > drivers/vfio/pci/cxl/vfio_cxl_emu.c | 366 +++++++++++++++++++++++++++
> > drivers/vfio/pci/cxl/vfio_cxl_priv.h | 41 +++
> > drivers/vfio/pci/vfio_pci_priv.h | 7 +
> > 5 files changed, 450 insertions(+), 2 deletions(-) create mode
> > 100644 drivers/vfio/pci/cxl/vfio_cxl_emu.c
> >
> > diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> > index ecb0eacbc089..bef916495eae 100644
> > --- a/drivers/vfio/pci/Makefile
> > +++ b/drivers/vfio/pci/Makefile
> > @@ -1,7 +1,7 @@
> > # SPDX-License-Identifier: GPL-2.0-only
> >
> > vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o
> > vfio_pci_config.o
> > -vfio-pci-core-$(CONFIG_VFIO_CXL_CORE) += cxl/vfio_cxl_core.o
> > +vfio-pci-core-$(CONFIG_VFIO_CXL_CORE) += cxl/vfio_cxl_core.o
> > +cxl/vfio_cxl_emu.o
> > vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
> > vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
> > obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o diff --git
> > a/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > index 03846bd11c8a..d2401871489d 100644
> > --- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > +++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > @@ -45,6 +45,7 @@ static int vfio_cxl_create_device_state(struct
> vfio_pci_core_device *vdev,
> > cxl = vdev->cxl;
> > cxl->dvsec = dvsec;
> > cxl->dpa_region_idx = -1;
> > + cxl->comp_reg_region_idx = -1;
> >
> > pci_read_config_word(pdev, dvsec + CXL_DVSEC_CAPABILITY_OFFSET,
> > &cap_word); @@ -124,6 +125,10 @@ static int
> > vfio_cxl_setup_regs(struct vfio_pci_core_device *vdev)
> > cxl->comp_reg_offset = bar_offset;
> > cxl->comp_reg_size = CXL_COMPONENT_REG_BLOCK_SIZE;
> >
> > + ret = vfio_cxl_setup_virt_regs(vdev);
> > + if (ret)
> > + return ret;
> > +
> > return 0;
> > }
> >
> > @@ -281,12 +286,14 @@ void vfio_pci_cxl_detect_and_init(struct
> > vfio_pci_core_device *vdev)
> >
> > ret = vfio_cxl_create_region_helper(vdev, SZ_256M);
> > if (ret)
> > - goto failed;
> > + goto regs_failed;
> >
> > cxl->precommitted = true;
> >
> > return;
> >
> > +regs_failed:
> > + vfio_cxl_clean_virt_regs(vdev);
> > failed:
> > devm_kfree(&pdev->dev, vdev->cxl);
> > vdev->cxl = NULL;
> > @@ -299,6 +306,7 @@ void vfio_pci_cxl_cleanup(struct
> vfio_pci_core_device *vdev)
> > if (!cxl || !cxl->region)
> > return;
> >
> > + vfio_cxl_clean_virt_regs(vdev);
> > vfio_cxl_destroy_cxl_region(vdev);
> > }
> >
> > @@ -409,6 +417,32 @@ void vfio_cxl_reactivate_region(struct
> > vfio_pci_core_device *vdev)
> >
> > if (!cxl)
> > return;
> > +
> > + /*
> > + * Re-initialise the emulated HDM comp_reg_virt[] from hardware.
> > + * After FLR the decoder registers read as zero; mirror that in
> > + * the emulated state so QEMU sees a clean slate.
> > + */
> > + vfio_cxl_reinit_comp_regs(vdev);
> > +
> > + /*
> > + * Only re-enable the DPA mmap if the hardware has actually
> > + * re-committed decoder 0 after FLR. Read the COMMITTED bit from
> the
> > + * freshly-re-snapshotted comp_reg_virt[] so we check the post-FLR
> > + * hardware state, not stale pre-reset state.
> > + *
> > + * If COMMITTED is 0 (slow firmware re-commit path), leave
> > + * region_active=false. Guest faults will return VM_FAULT_SIGBUS
> > + * until the decoder is re-committed and the region is re-enabled.
> > + */
> > + if (cxl->precommitted && cxl->comp_reg_virt) {
> > + u32 ctrl = le32_to_cpu(cxl->comp_reg_virt[
> > + CXL_HDM_DECODER0_CTRL_OFFSET(0) /
> > + CXL_REG_SIZE_DWORD]);
> > +
> > + if (ctrl & CXL_HDM_DECODER_CTRL_COMMITTED_BIT)
> > + WRITE_ONCE(cxl->region_active, true);
> > + }
> > }
> >
> > static ssize_t vfio_cxl_region_rw(struct vfio_pci_core_device
> > *core_dev, diff --git a/drivers/vfio/pci/cxl/vfio_cxl_emu.c
> > b/drivers/vfio/pci/cxl/vfio_cxl_emu.c
> > new file mode 100644
> > index 000000000000..d5603c80fe51
> > --- /dev/null
> > +++ b/drivers/vfio/pci/cxl/vfio_cxl_emu.c
> > @@ -0,0 +1,366 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights
> > +reserved */
> > +
> > +#include <linux/bitops.h>
> > +#include <linux/vfio_pci_core.h>
> > +
> > +#include "../vfio_pci_priv.h"
> > +#include "vfio_cxl_priv.h"
> > +
> > +/*
> > + * comp_reg_virt[] layout:
> > + * Index 0..N correspond to 32-bit registers at byte offset
> 0..hdm_reg_size-4
> > + * within the HDM decoder capability block.
> > + *
> > + * Register layout within the HDM block (CXL spec 8.2.5.19):
> > + * 0x00: HDM Decoder Capability
> > + * 0x04: HDM Decoder Global Control
> > + * 0x08: HDM Decoder Global Status
> > + * 0x0c: (reserved)
> > + * For each decoder N (N=0..hdm_count-1), at base 0x10 + N*0x20:
> > + * +0x00: BASE_LO
> > + * +0x04: BASE_HI
> > + * +0x08: SIZE_LO
> > + * +0x0c: SIZE_HI
> > + * +0x10: CTRL
> > + * +0x14: TARGET_LIST_LO
> > + * +0x18: TARGET_LIST_HI
> > + * +0x1c: (reserved)
> > + */
> > +
> > +static inline __le32 *hdm_reg_ptr(struct vfio_pci_cxl_state *cxl, u32
> > +off) {
> > + /*
> > + * off is byte offset within the HDM block; comp_reg_virt is
> indexed
> > + * as an array of __le32.
> > + */
> > + return &cxl->comp_reg_virt[off / sizeof(__le32)]; }
> > +
> > +static ssize_t virt_hdm_rev_reg_write(struct vfio_pci_core_device
> *vdev,
> > + const __le32 *val32, u64 offset,
> > +u64 size) {
> > + /* Discard writes on reserved registers. */
> > + return size;
> > +}
> > +
> > +static ssize_t hdm_decoder_n_lo_write(struct vfio_pci_core_device
> *vdev,
> > + const __le32 *val32, u64 offset,
> > +u64 size) {> + u32 new_val = le32_to_cpu(*val32);
> > +
> > + if (WARN_ON_ONCE(size != CXL_REG_SIZE_DWORD))
> > + return -EINVAL;
> > +> + /* Bit [27:0] are reserved. */
> > + new_val &= ~CXL_HDM_DECODER_BASE_LO_RESERVED_MASK;
> > +
> > + *hdm_reg_ptr(vdev->cxl, offset) = cpu_to_le32(new_val);
> > +
> > + return size;
> > +}
> > +
> > +static ssize_t hdm_decoder_global_ctrl_write(struct
> vfio_pci_core_device *vdev,
> > + const __le32 *val32, u64
> > +offset, u64 size)
> Why offset? If the dispatch function already checked and confirmed this is
> the offset for the global ctrl register then there's no need to pass in
> the offset.
Okay, refactored this as per suggestion.
>
> > +{
> > + u32 hdm_decoder_global_cap;
> > + u32 new_val = le32_to_cpu(*val32);
> > +
> > + if (WARN_ON_ONCE(size != CXL_REG_SIZE_DWORD))
> > + return -EINVAL;
> > +> + /* Bit [31:2] are reserved. */
> > + new_val &= ~CXL_HDM_DECODER_GLOBAL_CTRL_RESERVED_MASK;
> > +
> > + /* Poison On Decode Error Enable bit is 0 and RO if not support.
> */
> > + hdm_decoder_global_cap = le32_to_cpu(*hdm_reg_ptr(vdev->cxl, 0));
> > + if (!(hdm_decoder_global_cap &
> CXL_HDM_CAP_POISON_ON_DECODE_ERR_BIT))
> > + new_val &= ~CXL_HDM_DECODER_GLOBAL_CTRL_POISON_EN_BIT;
> > +
> > + *hdm_reg_ptr(vdev->cxl, offset) = cpu_to_le32(new_val);
> > +
> > + return size;
> > +}
> > +
> > +/*
> > + * hdm_decoder_n_ctrl_write - Write handler for HDM decoder CTRL
> register.
>
> If we are going to start with kdoc style comment, may as well finish the
> kdoc block and provide parameters and return values
>
> > + *
> > + * The COMMIT bit (bit 9) is the key: setting it requests the
> > +hardware to
> > + * lock the decoder. The emulated COMMITTED bit (bit 10) mirrors
> > +COMMIT
> > + * immediately to allow QEMU's notify_change to detect the transition
> > +and
> > + * map/unmap the DPA MemoryRegion in the guest address space.
> > + *
> > + * Note: the actual hardware HDM decoder programming (writing the
> > +real
> > + * BASE/SIZE with host physical addresses) happens in the QEMU
> > +notify_change
> > + * callback BEFORE this write reaches the hardware. This ordering is
> > + * correct because vfio_region_write() calls notify_change() first.
> > + */
> > +static ssize_t hdm_decoder_n_ctrl_write(struct vfio_pci_core_device
> *vdev,
> > + const __le32 *val32, u64 offset,
> > +u64 size) {
> > + u32 hdm_decoder_global_cap;
> > + u32 ro_mask = CXL_HDM_DECODER_CTRL_RO_BITS_MASK;
> > + u32 rev_mask = CXL_HDM_DECODER_CTRL_RESERVED_MASK;
> > + u32 new_val = le32_to_cpu(*val32);
> > + u32 cur_val;
> > +
> > + if (WARN_ON_ONCE(size != CXL_REG_SIZE_DWORD))
> > + return -EINVAL;
> > +
> > + cur_val = le32_to_cpu(*hdm_reg_ptr(vdev->cxl, offset));
> > + if (cur_val & CXL_HDM_DECODER_CTRL_COMMIT_LOCK_BIT)
> > + return size;
> > +
> > + hdm_decoder_global_cap = le32_to_cpu(*hdm_reg_ptr(vdev->cxl, 0));
> > + ro_mask |= CXL_HDM_DECODER_CTRL_DEVICE_BITS_RO;
> > + rev_mask |= CXL_HDM_DECODER_CTRL_DEVICE_RESERVED;
> > + if (!(hdm_decoder_global_cap & CXL_HDM_CAP_UIO_SUPPORTED_BIT))
> > + rev_mask |= CXL_HDM_DECODER_CTRL_UIO_RESERVED;
> > +
> > + new_val &= ~rev_mask;
> > + cur_val &= ro_mask;
> > + new_val = (new_val & ~ro_mask) | cur_val;
> > +
> > + /*
> > + * Mirror COMMIT → COMMITTED immediately in the emulated state.
> > + * QEMU's notify_change (called before this write reaches
> hardware)
> > + * reads COMMITTED from the region fd to detect commit
> transitions.
> > + */
> > + if (new_val & CXL_HDM_DECODER_CTRL_COMMIT_BIT)
> > + new_val |= CXL_HDM_DECODER_CTRL_COMMITTED_BIT;
> > + else
> > + new_val &= ~CXL_HDM_DECODER_CTRL_COMMITTED_BIT;
> > +
> > + *hdm_reg_ptr(vdev->cxl, offset) = cpu_to_le32(new_val);
> > +
> > + return size;
> > +}
> > +
> > +/*
> > + * Dispatch table for COMP_REGS region writes. Indexed by byte
> offset within
> > + * the HDM decoder block. Returns the appropriate write handler.
> > + *
> > + * Layout:
> > + * 0x00 HDM Decoder Capability (RO)
> > + * 0x04 HDM Global Control (RW with reserved masking)
> > + * 0x08 HDM Global Status (RO)
> > + * 0x0c (reserved) (ignored)
> > + * Per decoder N, base = 0x10 + N*0x20:
> > + * base+0x00 BASE_LO (RW, [27:0] reserved)
> > + * base+0x04 BASE_HI (RW)
> > + * base+0x08 SIZE_LO (RW, [27:0] reserved)
> > + * base+0x0c SIZE_HI (RW)
> > + * base+0x10 CTRL (RW, complex rules)
> > + * base+0x14 TARGET_LIST_LO (ignored for Type-2)
> > + * base+0x18 TARGET_LIST_HI (ignored for Type-2)
> > + * base+0x1c (reserved) (ignored)
> > + */
> > +static ssize_t comp_regs_dispatch_write(struct vfio_pci_core_device
> *vdev,
> > + u32 off, const __le32 *val32,
> > +u32 size) {
> > + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> > + u32 dec_base, dec_off;
> > +
> > + /* HDM Decoder Capability (0x00): RO */
> > + if (off == 0x00)
>
> define magic number
>
> > + return size;
> > +
> > + /* HDM Global Control (0x04) */
> > + if (off == CXL_HDM_DECODER_GLOBAL_CTRL_OFFSET)
> > + return hdm_decoder_global_ctrl_write(vdev, val32, off,
> > + size);
> > +
> > + /* HDM Global Status (0x08): RO */
> > + if (off == 0x08)
>
> define magic number
Yes, removed these bare numbers and added macros for this.
>
> > + return size;
> > +
> > + /* Per-decoder registers start at 0x10, stride 0x20 */
> > + if (off < CXL_HDM_DECODER_FIRST_BLOCK_OFFSET)
> > + return size; /* reserved gap */
> > +
> > + dec_base = CXL_HDM_DECODER_FIRST_BLOCK_OFFSET;
> > + dec_off = (off - dec_base) % CXL_HDM_DECODER_BLOCK_STRIDE;
>
> Need a check here to make sure offset is within the number of supported
> decoders.
Added a check for this verification.
>
> > +
> > + switch (dec_off) {
> > + case CXL_HDM_DECODER_N_BASE_LOW_OFFSET: /* BASE_LO */
> > + case CXL_HDM_DECODER_N_SIZE_LOW_OFFSET: /* SIZE_LO */
> > + return hdm_decoder_n_lo_write(vdev, val32, off, size);
> > + case CXL_HDM_DECODER_N_BASE_HIGH_OFFSET: /* BASE_HI */
> > + case CXL_HDM_DECODER_N_SIZE_HIGH_OFFSET: /* SIZE_HI */
> > + /* Full 32-bit write, no reserved bits */
> > + *hdm_reg_ptr(cxl, off) = *val32;
> > + return size;
> > + case CXL_HDM_DECODER_N_CTRL_OFFSET: /* CTRL */
> > + return hdm_decoder_n_ctrl_write(vdev, val32, off, size);
> > + case CXL_HDM_DECODER_N_TARGET_LIST_LOW_OFFSET:
> > + case CXL_HDM_DECODER_N_TARGET_LIST_HIGH_OFFSET:
> > + case CXL_HDM_DECODER_N_REV_OFFSET:
> > + return virt_hdm_rev_reg_write(vdev, val32, off, size);
> > + default:
> > + return size;
> > + }
> > +}
> > +
> > +/*
> > + * vfio_cxl_comp_regs_rw - regops rw handler for
> VFIO_REGION_SUBTYPE_CXL_COMP_REGS.
> > + *
> > + * Reads return the emulated HDM state (comp_reg_virt[]).
> > + * Writes go through comp_regs_dispatch_write() for bit-field
> enforcement.
> > + * Only 4-byte aligned 4-byte accesses are supported (hardware
> requirement).
> > + */
> > +static ssize_t vfio_cxl_comp_regs_rw(struct vfio_pci_core_device *vdev,
> > + char __user *buf, size_t count,
> > + loff_t *ppos, bool iswrite) {
> > + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> > + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> > + size_t done = 0;
> > +
> > + if (!count)
> > + return 0;
> > +
> > + /* Clamp to region size */
> > + if (pos >= cxl->hdm_reg_size)
> > + return -EINVAL;
> > + count = min(count, (size_t)(cxl->hdm_reg_size - pos));
> > +
> > + while (done < count) {
> > + u32 sz = min_t(u32, CXL_REG_SIZE_DWORD, count - done);
> > + u32 off = pos + done;
> > + __le32 v;
> > +
> > + /* Enforce 4-byte alignment */
> > + if (sz < CXL_REG_SIZE_DWORD || (off & 0x3))
> > + return done ? (ssize_t)done : -EINVAL;
> > +
> > + if (iswrite) {
> > + if (copy_from_user(&v, buf + done, sizeof(v)))
> > + return done ? (ssize_t)done : -EFAULT;
> > + comp_regs_dispatch_write(vdev, off, &v,
> sizeof(v));
> > + } else {
> > + v = *hdm_reg_ptr(cxl, off);
> > + if (copy_to_user(buf + done, &v, sizeof(v)))
> > + return done ? (ssize_t)done : -EFAULT;
> > + }
> > + done += sizeof(v);
> > + }
> > +
> > + *ppos += done;
> > + return done;
> > +}
> > +
> > +static void vfio_cxl_comp_regs_release(struct vfio_pci_core_device
> *vdev,
> > + struct vfio_pci_region *region) {
> > + /* comp_reg_virt is freed in vfio_cxl_clean_virt_regs(), not
> > +here. */ }
> > +
> > +static const struct vfio_pci_regops vfio_cxl_comp_regs_ops = {
> > + .rw = vfio_cxl_comp_regs_rw,
> > + .release = vfio_cxl_comp_regs_release, };
> > +
> > +/*
> > + * vfio_cxl_setup_virt_regs - Allocate emulated HDM register state.
> > + *
> > + * Allocates comp_reg_virt as a compact __le32 array covering only
> > + * hdm_reg_size bytes of HDM decoder registers. The initial values
> > + * are read from hardware via the BAR ioremap established by the
> caller.
> > + *
> > + * DVSEC state is accessed via vdev->vconfig (see the following patch).
> > + */
> > +int vfio_cxl_setup_virt_regs(struct vfio_pci_core_device *vdev) {
> > + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> > + size_t nregs;
> > +
> > + if (WARN_ON(!cxl->hdm_reg_size))
> > + return -EINVAL;
> > +
> > + if (pci_resource_len(vdev->pdev, cxl->comp_reg_bar) <
> > + cxl->comp_reg_offset + cxl->hdm_reg_offset + cxl-
> >hdm_reg_size)
> > + return -ENODEV;
> > +
> > + nregs = cxl->hdm_reg_size / sizeof(__le32);
> > + cxl->comp_reg_virt = kcalloc(nregs, sizeof(__le32), GFP_KERNEL);
> > + if (!cxl->comp_reg_virt)
> > + return -ENOMEM;
> > +
> > + /* Establish persistent mapping; kept alive until
> vfio_cxl_clean_virt_regs(). */
> > + cxl->hdm_iobase = ioremap(pci_resource_start(vdev->pdev, cxl-
> >comp_reg_bar) +
> > + cxl->comp_reg_offset + cxl-
> >hdm_reg_offset,
> > + cxl->hdm_reg_size);
> > + if (!cxl->hdm_iobase) {
> > + kfree(cxl->comp_reg_virt);
> > + cxl->comp_reg_virt = NULL;
> > + return -ENOMEM;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +/*
> > + * Called with memory_lock write side held (from
> vfio_cxl_reactivate_region).
> > + * Uses the pre-established hdm_iobase, no ioremap() under the lock,
> > + * which would deadlock on PREEMPT_RT where ioremap() can sleep.
> > + */
> > +void vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev) {
> > + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> > + size_t i, nregs;
> > +
> > + if (!cxl || !cxl->comp_reg_virt || !cxl->hdm_iobase)
> > + return;
> > +
> > + nregs = cxl->hdm_reg_size / sizeof(__le32);
> > +
> > + for (i = 0; i < nregs; i++)
> > + cxl->comp_reg_virt[i] =
> > + cpu_to_le32(readl(cxl->hdm_iobase + i *
> > +sizeof(__le32))); }
> > +
> > +void vfio_cxl_clean_virt_regs(struct vfio_pci_core_device *vdev) {
> > + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> > +
> > + if (cxl->hdm_iobase) {
> > + iounmap(cxl->hdm_iobase);
> > + cxl->hdm_iobase = NULL;
> > + }
> > + kfree(cxl->comp_reg_virt);
> > + cxl->comp_reg_virt = NULL;
> > +}
> > +
> > +/*
> > + * vfio_cxl_register_comp_regs_region - Register the COMP_REGS device
> region.
> > + *
> > + * Exposes the emulated HDM decoder register state as a VFIO device
> region
> > + * with type VFIO_REGION_SUBTYPE_CXL_COMP_REGS. QEMU attaches a
> > + * notify_change callback to this region to intercept HDM COMMIT
> > +writes
> > + * and map the DPA MemoryRegion at the appropriate GPA.
> > + *
> > + * The region is read+write only (no mmap) to ensure all accesses
> > +pass
> > + * through comp_regs_dispatch_write() for proper bit-field enforcement.
> > + */
> > +int vfio_cxl_register_comp_regs_region(struct vfio_pci_core_device
> > +*vdev) {
> > + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> > + u32 flags = VFIO_REGION_INFO_FLAG_READ |
> VFIO_REGION_INFO_FLAG_WRITE;
> > + int ret;
> > +
> > + if (!cxl || !cxl->comp_reg_virt)
> > + return -ENODEV;
> > +
> > + ret = vfio_pci_core_register_dev_region(vdev,
> > + PCI_VENDOR_ID_CXL |
> > +
> VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
> > +
> VFIO_REGION_SUBTYPE_CXL_COMP_REGS,
> > + &vfio_cxl_comp_regs_ops,
> > + cxl->hdm_reg_size, flags,
> cxl);
> > + if (!ret)
> > + cxl->comp_reg_region_idx = vdev->num_regions - 1;
> > +
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_cxl_register_comp_regs_region);
> > diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h
> > b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
> > index b870926bfb19..4f2637874e9d 100644
> > --- a/drivers/vfio/pci/cxl/vfio_cxl_priv.h
> > +++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
> > @@ -25,14 +25,51 @@ struct vfio_pci_cxl_state {
> > size_t hdm_reg_size;
> > resource_size_t comp_reg_offset;
> > size_t comp_reg_size;
> > + __le32 *comp_reg_virt;
> > + void __iomem *hdm_iobase;
> > u32 hdm_count;
> > int dpa_region_idx;
> > + int comp_reg_region_idx;
> > u16 dvsec;
> > u8 comp_reg_bar;
> > bool precommitted;
> > bool region_active;
> > };
> >
> > +/* Register access sizes */
> > +#define CXL_REG_SIZE_WORD 2
> > +#define CXL_REG_SIZE_DWORD 4
> > +
> > +/* HDM Decoder - register offsets (CXL 2.0 8.2.5.19) */
> > +#define CXL_HDM_DECODER_GLOBAL_CTRL_OFFSET 0x4
> > +#define CXL_HDM_DECODER_FIRST_BLOCK_OFFSET 0x10
> > +#define CXL_HDM_DECODER_BLOCK_STRIDE 0x20
> > +#define CXL_HDM_DECODER_N_BASE_LOW_OFFSET 0x0
> > +#define CXL_HDM_DECODER_N_BASE_HIGH_OFFSET 0x4
> > +#define CXL_HDM_DECODER_N_SIZE_LOW_OFFSET 0x8
> > +#define CXL_HDM_DECODER_N_SIZE_HIGH_OFFSET 0xc
> > +#define CXL_HDM_DECODER_N_CTRL_OFFSET 0x10
> > +#define CXL_HDM_DECODER_N_TARGET_LIST_LOW_OFFSET 0x14 #define
> > +CXL_HDM_DECODER_N_TARGET_LIST_HIGH_OFFSET 0x18
> > +#define CXL_HDM_DECODER_N_REV_OFFSET 0x1c
> > +
> > +/* HDM Decoder Global Capability / Control - bit definitions */
> > +#define CXL_HDM_CAP_POISON_ON_DECODE_ERR_BIT BIT(10)
> > +#define CXL_HDM_CAP_UIO_SUPPORTED_BIT BIT(13)
> > +
> > +/* HDM Decoder N Control */
> > +#define CXL_HDM_DECODER_CTRL_COMMIT_LOCK_BIT BIT(8)
> > +#define CXL_HDM_DECODER_CTRL_COMMIT_BIT BIT(9)
> > +#define CXL_HDM_DECODER_CTRL_COMMITTED_BIT BIT(10)
> > +#define CXL_HDM_DECODER_CTRL_RO_BITS_MASK (BIT(10) | BIT(11))
> > +#define CXL_HDM_DECODER_CTRL_RESERVED_MASK (BIT(15) | GENMASK(31,
> 28))
> > +#define CXL_HDM_DECODER_CTRL_DEVICE_BITS_RO BIT(12)
> > +#define CXL_HDM_DECODER_CTRL_DEVICE_RESERVED (GENMASK(19, 16) |
> GENMASK(23, 20))
> > +#define CXL_HDM_DECODER_CTRL_UIO_RESERVED (BIT(14) | GENMASK(27,
> 24))
> > +#define CXL_HDM_DECODER_BASE_LO_RESERVED_MASK GENMASK(27, 0)
> > +#define CXL_HDM_DECODER_GLOBAL_CTRL_RESERVED_MASK GENMASK(31, 2)
> > +#define CXL_HDM_DECODER_GLOBAL_CTRL_POISON_EN_BIT BIT(0)
>
> Maybe the reg defines should go in include/cxl/regs.h? Or move shared
> definitions out of drivers/cxl/.
Added in include/uapi/cxl/cxl_regs.h
>
> DJ
>
> > +
> > /*
> > * CXL DVSEC for CXL Devices - register offsets within the DVSEC
> > * (CXL 2.0+ 8.1.3).
> > @@ -41,4 +78,8 @@ struct vfio_pci_cxl_state { #define
> > CXL_DVSEC_CAPABILITY_OFFSET 0xa
> > #define CXL_DVSEC_MEM_CAPABLE BIT(2)
> >
> > +int vfio_cxl_setup_virt_regs(struct vfio_pci_core_device *vdev); void
> > +vfio_cxl_clean_virt_regs(struct vfio_pci_core_device *vdev); void
> > +vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev);
> > +
> > #endif /* __LINUX_VFIO_CXL_PRIV_H */
> > diff --git a/drivers/vfio/pci/vfio_pci_priv.h
> > b/drivers/vfio/pci/vfio_pci_priv.h
> > index 8f440f9eaa0c..f8db9a05c033 100644
> > --- a/drivers/vfio/pci/vfio_pci_priv.h
> > +++ b/drivers/vfio/pci/vfio_pci_priv.h
> > @@ -152,6 +152,8 @@ int vfio_cxl_register_cxl_region(struct
> > vfio_pci_core_device *vdev); void
> > vfio_cxl_unregister_cxl_region(struct vfio_pci_core_device *vdev);
> > void vfio_cxl_zap_region_locked(struct vfio_pci_core_device *vdev);
> > void vfio_cxl_reactivate_region(struct vfio_pci_core_device *vdev);
> > +int vfio_cxl_register_comp_regs_region(struct vfio_pci_core_device
> > +*vdev); void vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device
> > +*vdev);
> >
> > #else
> >
> > @@ -173,6 +175,11 @@ static inline void
> > vfio_cxl_zap_region_locked(struct vfio_pci_core_device *vdev) { }
> > static inline void vfio_cxl_reactivate_region(struct
> > vfio_pci_core_device *vdev) { }
> > +static inline int
> > +vfio_cxl_register_comp_regs_region(struct vfio_pci_core_device *vdev)
> > +{ return 0; } static inline void vfio_cxl_reinit_comp_regs(struct
> > +vfio_pci_core_device *vdev) { }
> >
> > #endif /* CONFIG_VFIO_CXL_CORE */
> >
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH 14/20] vfio/cxl: Check media readiness and create CXL memdev
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (12 preceding siblings ...)
2026-03-11 20:34 ` [PATCH 13/20] vfio/cxl: Introduce HDM decoder register emulation framework mhonap
@ 2026-03-11 20:34 ` mhonap
2026-03-11 20:34 ` [PATCH 15/20] vfio/cxl: Introduce CXL DVSEC configuration space emulation mhonap
` (5 subsequent siblings)
19 siblings, 0 replies; 54+ messages in thread
From: mhonap @ 2026-03-11 20:34 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
Check media readiness at probe time and create a CXL memdev for region
management.
Media/range-active check is performed at probe time to keep the
vfio-is-advertised-as-cxl behavior consistent. A pre-committed HDM
decoder already implies media is active, so set media_ready directly
instead of calling the potentially blocking cxl_await_range_active().
For memdev creation we need to determine capacity before calling
devm_cxl_add_memdev(). Read the committed decoder size directly from
HDM decoder hardware registers; the CXL core will see the same values
when it enumerates decoders inside add_memdev.
For firmware uncommitted decoders, handling will be added in a later
commit.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/cxl/vfio_cxl_core.c | 67 +++++++++++++++++++++++++++-
drivers/vfio/pci/cxl/vfio_cxl_emu.c | 48 ++++++++++++++++++++
drivers/vfio/pci/cxl/vfio_cxl_priv.h | 2 +
3 files changed, 115 insertions(+), 2 deletions(-)
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index d2401871489d..15b6c0d75d9e 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -132,6 +132,37 @@ static int vfio_cxl_setup_regs(struct vfio_pci_core_device *vdev)
return 0;
}
+static int vfio_cxl_create_memdev(struct vfio_pci_core_device *vdev,
+ resource_size_t capacity)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ struct pci_dev *pdev = vdev->pdev;
+ int ret;
+
+ ret = cxl_set_capacity(&cxl->cxlds, capacity);
+ if (ret) {
+ pci_err(pdev, "Failed to set capacity: %d\n", ret);
+ return ret;
+ }
+
+ pci_dbg(pdev, "Device capacity: %llu MB (from %s)\n",
+ capacity >> 20,
+ cxl->precommitted ? "committed decoder" : "sysfs");
+ pci_dbg(pdev,
+ "vfio_cxl: creating memdev: capacity=0x%llx bytes (%llu MiB)\n",
+ (unsigned long long)capacity,
+ (unsigned long long)(capacity >> 20));
+
+ cxl->cxlmd = devm_cxl_add_memdev(&cxl->cxlds, NULL);
+ if (IS_ERR(cxl->cxlmd)) {
+ pci_err(pdev, "Failed to add CXL memdev: %ld\n",
+ PTR_ERR(cxl->cxlmd));
+ return PTR_ERR(cxl->cxlmd);
+ }
+
+ return 0;
+}
+
int vfio_cxl_create_cxl_region(struct vfio_pci_core_device *vdev, resource_size_t size)
{
struct vfio_pci_cxl_state *cxl = vdev->cxl;
@@ -250,6 +281,7 @@ void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
{
struct pci_dev *pdev = vdev->pdev;
struct vfio_pci_cxl_state *cxl;
+ resource_size_t capacity = 0;
u16 dvsec;
int ret;
@@ -282,13 +314,44 @@ void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
goto failed;
}
+ cxl->cxlds.media_ready = !cxl_await_range_active(&cxl->cxlds);
+ if (!cxl->cxlds.media_ready) {
+ pci_disable_device(pdev);
+ pci_err(pdev, "CXL media not ready\n");
+ goto regs_failed;
+ }
+
+ /*
+ * Take the single authoritative HDM decoder snapshot now that
+ * MEM_ACTIVE is confirmed and BAR memory is still enabled. Using
+ * readl() per-dword ensures correct MMIO serialisation and captures
+ * the final firmware-written values for all fields including SIZE_HIGH,
+ * which firmware commits to the BAR at MEM_ACTIVE time.
+ */
+ vfio_cxl_reinit_comp_regs(vdev);
+
pci_disable_device(pdev);
- ret = vfio_cxl_create_region_helper(vdev, SZ_256M);
- if (ret)
+ capacity = vfio_cxl_read_committed_decoder_size(vdev);
+ if (capacity == 0) {
+ /*
+ * TODO: Add handling for devices which do not have
+ * firmware pre-committed decoders
+ */
+ pci_info(pdev, "Uncommitted region size must be configured via sysfs before bind\n");
goto regs_failed;
+ }
cxl->precommitted = true;
+ cxl->dpa_size = capacity;
+
+ ret = vfio_cxl_create_memdev(vdev, capacity);
+ if (ret)
+ goto regs_failed;
+
+ ret = vfio_cxl_create_region_helper(vdev, capacity);
+ if (ret)
+ goto regs_failed;
return;
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_emu.c b/drivers/vfio/pci/cxl/vfio_cxl_emu.c
index d5603c80fe51..178a42267642 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_emu.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_emu.c
@@ -300,6 +300,54 @@ int vfio_cxl_setup_virt_regs(struct vfio_pci_core_device *vdev)
return 0;
}
+/*
+ * vfio_cxl_read_committed_decoder_size - Extract committed DPA capacity from
+ * comp_reg_virt[].
+ *
+ * Called from probe context after vfio_cxl_reinit_comp_regs() has taken the
+ * post-MEM_ACTIVE readl() snapshot and patched SIZE_HIGH/SIZE_LOW from DVSEC.
+ * comp_reg_virt[] is already correct at this point; no hardware access needed.
+ *
+ * Returns the committed DPA capacity in bytes, or 0 if the decoder is not
+ * committed.
+ */
+resource_size_t
+vfio_cxl_read_committed_decoder_size(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ struct pci_dev *pdev = vdev->pdev;
+ resource_size_t capacity;
+ u32 ctrl, sz_hi, sz_lo;
+
+ if (WARN_ON(!cxl || !cxl->comp_reg_virt))
+ return 0;
+
+ ctrl = le32_to_cpu(cxl->comp_reg_virt[CXL_HDM_DECODER0_CTRL_OFFSET(0) /
+ CXL_REG_SIZE_DWORD]);
+ sz_hi = le32_to_cpu(cxl->comp_reg_virt[CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(0) /
+ CXL_REG_SIZE_DWORD]);
+ sz_lo = le32_to_cpu(cxl->comp_reg_virt[CXL_HDM_DECODER0_SIZE_LOW_OFFSET(0) /
+ CXL_REG_SIZE_DWORD]);
+
+ if (!(ctrl & CXL_HDM_DECODER0_CTRL_COMMITTED)) {
+ pci_dbg(pdev,
+ "vfio_cxl: decoder0 not committed: ctrl=0x%08x\n",
+ ctrl);
+ return 0;
+ }
+
+ capacity = ((resource_size_t)sz_hi << 32) | (sz_lo & GENMASK(31, 28));
+
+ pci_dbg(pdev,
+ "vfio_cxl: decoder0 committed: sz_hi=0x%08x sz_lo=0x%08x "
+ "capacity=0x%llx (%llu MiB)\n",
+ sz_hi, sz_lo,
+ (unsigned long long)capacity,
+ (unsigned long long)(capacity >> 20));
+
+ return capacity;
+}
+
/*
* Called with memory_lock write side held (from vfio_cxl_reactivate_region).
* Uses the pre-established hdm_iobase, no ioremap() under the lock,
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
index 4f2637874e9d..3ef8d923a7e8 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_priv.h
+++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
@@ -26,6 +26,7 @@ struct vfio_pci_cxl_state {
resource_size_t comp_reg_offset;
size_t comp_reg_size;
__le32 *comp_reg_virt;
+ size_t dpa_size;
void __iomem *hdm_iobase;
u32 hdm_count;
int dpa_region_idx;
@@ -81,5 +82,6 @@ struct vfio_pci_cxl_state {
int vfio_cxl_setup_virt_regs(struct vfio_pci_core_device *vdev);
void vfio_cxl_clean_virt_regs(struct vfio_pci_core_device *vdev);
void vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev);
+resource_size_t vfio_cxl_read_committed_decoder_size(struct vfio_pci_core_device *vdev);
#endif /* __LINUX_VFIO_CXL_PRIV_H */
--
2.25.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* [PATCH 15/20] vfio/cxl: Introduce CXL DVSEC configuration space emulation
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (13 preceding siblings ...)
2026-03-11 20:34 ` [PATCH 14/20] vfio/cxl: Check media readiness and create CXL memdev mhonap
@ 2026-03-11 20:34 ` mhonap
2026-03-13 22:07 ` Dave Jiang
2026-03-11 20:34 ` [PATCH 16/20] vfio/pci: Expose CXL device and region info via VFIO ioctl mhonap
` (4 subsequent siblings)
19 siblings, 1 reply; 54+ messages in thread
From: mhonap @ 2026-03-11 20:34 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
CXL devices have CXL DVSEC registers in the configuration space.
Many of them affect the behaviors of the devices, e.g. enabling
CXL.io/CXL.mem/CXL.cache.
However, these configurations are owned by the host and a virtualization
policy should be applied when handling the access from the guest.
Introduce the emulation of CXL configuration space to handle the access
of the virtual CXL configuration space from the guest.
vfio-pci-core already allocates vdev->vconfig as the authoritative
virtual config space shadow. Directly use vdev->vconfig:
- DVSEC reads return data from vdev->vconfig (already populated by
vfio_config_init() via vfio_ecap_init())
- DVSEC writes go through new CXL-aware write handlers that update
vdev->vconfig in place
- The writable DVSEC registers are marked virtual in vdev->pci_config_map
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/Makefile | 2 +-
drivers/vfio/pci/cxl/vfio_cxl_config.c | 304 +++++++++++++++++++++++++
drivers/vfio/pci/cxl/vfio_cxl_core.c | 4 +
drivers/vfio/pci/cxl/vfio_cxl_priv.h | 38 +++-
drivers/vfio/pci/vfio_pci.c | 14 ++
drivers/vfio/pci/vfio_pci_config.c | 46 +++-
drivers/vfio/pci/vfio_pci_priv.h | 3 +
include/linux/vfio_pci_core.h | 8 +-
8 files changed, 415 insertions(+), 4 deletions(-)
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_config.c
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index bef916495eae..7c86b7845e8f 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,7 +1,7 @@
# SPDX-License-Identifier: GPL-2.0-only
vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
-vfio-pci-core-$(CONFIG_VFIO_CXL_CORE) += cxl/vfio_cxl_core.o cxl/vfio_cxl_emu.o
+vfio-pci-core-$(CONFIG_VFIO_CXL_CORE) += cxl/vfio_cxl_core.o cxl/vfio_cxl_emu.o cxl/vfio_cxl_config.o
vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_config.c b/drivers/vfio/pci/cxl/vfio_cxl_config.c
new file mode 100644
index 000000000000..a9560661345c
--- /dev/null
+++ b/drivers/vfio/pci/cxl/vfio_cxl_config.c
@@ -0,0 +1,304 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * CXL DVSEC configuration space emulation for vfio-pci.
+ *
+ * Integrates into the existing vfio-pci-core ecap_perms[] framework using
+ * vdev->vconfig as the sole shadow buffer for DVSEC registers.
+ *
+ * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <linux/pci.h>
+#include <linux/vfio_pci_core.h>
+
+#include "../vfio_pci_priv.h"
+#include "vfio_cxl_priv.h"
+
+/* Helpers to access vdev->vconfig at a DVSEC-relative offset */
+static inline u16 dvsec_virt_read16(struct vfio_pci_core_device *vdev,
+ u16 off)
+{
+ return le16_to_cpu(*(u16 *)(vdev->vconfig +
+ vdev->cxl->dvsec + off));
+}
+
+static inline void dvsec_virt_write16(struct vfio_pci_core_device *vdev,
+ u16 off, u16 val)
+{
+ *(u16 *)(vdev->vconfig + vdev->cxl->dvsec + off) = cpu_to_le16(val);
+}
+
+static inline u32 dvsec_virt_read32(struct vfio_pci_core_device *vdev,
+ u16 off)
+{
+ return le32_to_cpu(*(u32 *)(vdev->vconfig +
+ vdev->cxl->dvsec + off));
+}
+
+static inline void dvsec_virt_write32(struct vfio_pci_core_device *vdev,
+ u16 off, u32 val)
+{
+ *(u32 *)(vdev->vconfig + vdev->cxl->dvsec + off) = cpu_to_le32(val);
+}
+
+/* Individual DVSEC register write handlers */
+
+static void cxl_control_write(struct vfio_pci_core_device *vdev,
+ u16 abs_off, u16 new_val)
+{
+ u16 lock = dvsec_virt_read16(vdev, CXL_DVSEC_LOCK_OFFSET);
+ u16 cap3 = dvsec_virt_read16(vdev, CXL_DVSEC_CAPABILITY3_OFFSET);
+ u16 rev_mask = CXL_CTRL_RESERVED_MASK;
+
+ if (lock & CXL_CTRL_LOCK_BIT)
+ return; /* register is locked after first write */
+
+ if (!(cap3 & CXL_CAP3_P2P_BIT))
+ rev_mask |= CXL_CTRL_P2P_REV_MASK;
+
+ new_val &= ~rev_mask;
+ new_val |= CXL_CTRL_CXL_IO_ENABLE_BIT; /* CXL.io always enabled */
+
+ dvsec_virt_write16(vdev, CXL_DVSEC_CONTROL_OFFSET, new_val);
+}
+
+static void cxl_status_write(struct vfio_pci_core_device *vdev,
+ u16 abs_off, u16 new_val)
+{
+ u16 cur_val = dvsec_virt_read16(vdev, CXL_DVSEC_STATUS_OFFSET);
+
+ new_val &= ~CXL_STATUS_RESERVED_MASK;
+
+ /* RW1C: writing a 1 clears the bit; writing 0 leaves it unchanged */
+ if (new_val & CXL_STATUS_RW1C_BIT)
+ new_val &= ~CXL_STATUS_RW1C_BIT;
+ else
+ new_val = (new_val & ~CXL_STATUS_RW1C_BIT) |
+ (cur_val & CXL_STATUS_RW1C_BIT);
+
+ dvsec_virt_write16(vdev, CXL_DVSEC_STATUS_OFFSET, new_val);
+}
+
+static void cxl_control2_write(struct vfio_pci_core_device *vdev,
+ u16 abs_off, u16 new_val)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ u16 cap2 = dvsec_virt_read16(vdev, CXL_DVSEC_CAPABILITY2_OFFSET);
+ u16 cap3 = dvsec_virt_read16(vdev, CXL_DVSEC_CAPABILITY3_OFFSET);
+ u16 rev_mask = CXL_CTRL2_RESERVED_MASK;
+ u16 hw_bits = CXL_CTRL2_HW_BITS_MASK;
+ bool initiate_cxl_reset = new_val & CXL_CTRL2_INITIATE_CXL_RESET_BIT;
+
+ if (!(cap3 & CXL_CAP3_VOLATILE_HDM_BIT))
+ rev_mask |= CXL_CTRL2_VOLATILE_HDM_REV_MASK;
+ if (!(cap2 & CXL_CAP2_MODIFIED_COMPLETION_BIT))
+ rev_mask |= CXL_CTRL2_MODIFIED_COMP_REV_MASK;
+
+ new_val &= ~rev_mask;
+
+ /* Bits that go directly to hardware */
+ hw_bits &= new_val;
+
+ dvsec_virt_write16(vdev, CXL_DVSEC_CONTROL2_OFFSET, new_val);
+
+ if (hw_bits)
+ pci_write_config_word(pdev, abs_off, hw_bits);
+
+ if (initiate_cxl_reset) {
+ /* TODO: invoke CXL protocol reset via cxl subsystem */
+ dev_warn(&pdev->dev, "vfio-cxl: CXL reset requested but not yet supported\n");
+ }
+}
+
+static void cxl_status2_write(struct vfio_pci_core_device *vdev,
+ u16 abs_off, u16 new_val)
+{
+ u16 cap3 = dvsec_virt_read16(vdev, CXL_DVSEC_CAPABILITY3_OFFSET);
+
+ /* RW1CS: write 1 to clear, but only if the capability is supported */
+ if ((cap3 & CXL_CAP3_VOLATILE_HDM_BIT) &&
+ (new_val & CXL_STATUS2_RW1CS_BIT))
+ pci_write_config_word(vdev->pdev, abs_off,
+ CXL_STATUS2_RW1CS_BIT);
+ /* STATUS2 is not mirrored in vconfig - reads go to hardware */
+}
+
+static void cxl_lock_write(struct vfio_pci_core_device *vdev,
+ u16 abs_off, u16 new_val)
+{
+ u16 cur_val = dvsec_virt_read16(vdev, CXL_DVSEC_LOCK_OFFSET);
+
+ /* Once the LOCK bit is set it can only be cleared by conventional reset */
+ if (cur_val & CXL_CTRL_LOCK_BIT)
+ return;
+
+ new_val &= ~CXL_LOCK_RESERVED_MASK;
+ dvsec_virt_write16(vdev, CXL_DVSEC_LOCK_OFFSET, new_val);
+}
+
+static void cxl_range_base_lo_write(struct vfio_pci_core_device *vdev,
+ u16 dvsec_off, u32 new_val)
+{
+ new_val &= ~CXL_BASE_LO_RESERVED_MASK;
+ dvsec_virt_write32(vdev, dvsec_off, new_val);
+}
+
+/*
+ * vfio_cxl_dvsec_readfn - per-device DVSEC read handler.
+ *
+ * Called via vfio_pci_dvsec_dispatch_read() for devices that have registered
+ * a dvsec_readfn. Returns shadow vconfig values for virtualized DVSEC
+ * registers (CONTROL, STATUS, CONTROL2, LOCK) so that userspace reads reflect
+ * the emulated state rather than the raw hardware value. All other DVSEC
+ * bytes are passed through to hardware via vfio_raw_config_read().
+ */
+static int vfio_cxl_dvsec_readfn(struct vfio_pci_core_device *vdev,
+ int pos, int count,
+ struct perm_bits *perm,
+ int offset, __le32 *val)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ u16 dvsec_off;
+
+ if (!cxl || (u16)pos < cxl->dvsec ||
+ (u16)pos >= cxl->dvsec + cxl->dvsec_length)
+ return vfio_raw_config_read(vdev, pos, count, perm, offset, val);
+
+ dvsec_off = (u16)pos - cxl->dvsec;
+
+ switch (dvsec_off) {
+ case CXL_DVSEC_CONTROL_OFFSET:
+ case CXL_DVSEC_STATUS_OFFSET:
+ case CXL_DVSEC_CONTROL2_OFFSET:
+ case CXL_DVSEC_LOCK_OFFSET:
+ /* Return shadow vconfig value for virtualized registers */
+ memcpy(val, vdev->vconfig + pos, count);
+ return count;
+ default:
+ return vfio_raw_config_read(vdev, pos, count,
+ perm, offset, val);
+ }
+}
+
+/*
+ * vfio_cxl_dvsec_writefn - ecap_perms write handler for PCI_EXT_CAP_ID_DVSEC.
+ *
+ * Installed once into ecap_perms[PCI_EXT_CAP_ID_DVSEC].writefn by
+ * vfio_pci_init_perm_bits() when CONFIG_VFIO_CXL_CORE=y. Applies to every
+ * device opened under vfio-pci; the vdev->cxl NULL check distinguishes CXL
+ * devices from non-CXL devices that happen to expose a DVSEC capability.
+ *
+ * @pos: absolute byte position in config space
+ * @offset: byte offset within the capability structure
+ */
+static int vfio_cxl_dvsec_writefn(struct vfio_pci_core_device *vdev,
+ int pos, int count,
+ struct perm_bits *perm,
+ int offset, __le32 val)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ u16 abs_off = (u16)pos;
+ u16 dvsec_off;
+ u16 wval16;
+ u32 wval32;
+
+ if (!cxl || (u16)pos < cxl->dvsec ||
+ (u16)pos >= cxl->dvsec + cxl->dvsec_length)
+ return vfio_raw_config_write(vdev, pos, count, perm,
+ offset, val);
+
+ pci_dbg(vdev->pdev,
+ "vfio_cxl: DVSEC write: abs=0x%04x dvsec_off=0x%04x "
+ "count=%d raw_val=0x%08x\n",
+ abs_off, abs_off - cxl->dvsec, count, le32_to_cpu(val));
+
+ dvsec_off = abs_off - cxl->dvsec;
+
+ /* Route to the appropriate per-register handler */
+ switch (dvsec_off) {
+ case CXL_DVSEC_CONTROL_OFFSET:
+ wval16 = (u16)le32_to_cpu(val);
+ cxl_control_write(vdev, abs_off, wval16);
+ break;
+ case CXL_DVSEC_STATUS_OFFSET:
+ wval16 = (u16)le32_to_cpu(val);
+ cxl_status_write(vdev, abs_off, wval16);
+ break;
+ case CXL_DVSEC_CONTROL2_OFFSET:
+ wval16 = (u16)le32_to_cpu(val);
+ cxl_control2_write(vdev, abs_off, wval16);
+ break;
+ case CXL_DVSEC_STATUS2_OFFSET:
+ wval16 = (u16)le32_to_cpu(val);
+ cxl_status2_write(vdev, abs_off, wval16);
+ break;
+ case CXL_DVSEC_LOCK_OFFSET:
+ wval16 = (u16)le32_to_cpu(val);
+ cxl_lock_write(vdev, abs_off, wval16);
+ break;
+ case CXL_DVSEC_RANGE1_BASE_HIGH_OFFSET:
+ case CXL_DVSEC_RANGE2_BASE_HIGH_OFFSET:
+ wval32 = le32_to_cpu(val);
+ dvsec_virt_write32(vdev, dvsec_off, wval32);
+ break;
+ case CXL_DVSEC_RANGE1_BASE_LOW_OFFSET:
+ case CXL_DVSEC_RANGE2_BASE_LOW_OFFSET:
+ wval32 = le32_to_cpu(val);
+ cxl_range_base_lo_write(vdev, dvsec_off, wval32);
+ break;
+ default:
+ /* RO registers: header, capability, range sizes - discard */
+ break;
+ }
+
+ return count;
+}
+
+/*
+ * vfio_cxl_setup_dvsec_perms - Install per-device CXL DVSEC read/write hooks.
+ *
+ * Called once per device open after vfio_config_init() has seeded vdev->vconfig
+ * from hardware. Registers vfio_cxl_dvsec_readfn and vfio_cxl_dvsec_writefn
+ * as the per-device DVSEC handlers. The global dispatch functions installed
+ * in ecap_perms[PCI_EXT_CAP_ID_DVSEC] at module init call these per-device
+ * hooks so that pci_config_map bytes remain PCI_EXT_CAP_ID_DVSEC throughout.
+ *
+ * vfio_cxl_dvsec_readfn: returns vconfig shadow for CONTROL/STATUS/CONTROL2/
+ * LOCK; passes all other DVSEC bytes through to hardware.
+ * vfio_cxl_dvsec_writefn: enforces per-register semantics (RW1C, forced
+ * IO_ENABLE, reserved-bit masking) and stores results in vconfig.
+ *
+ * Also forces CXL.io IO_ENABLE in the CONTROL vconfig shadow so the initial
+ * read returns 1 even before the first write.
+ */
+void vfio_cxl_setup_dvsec_perms(struct vfio_pci_core_device *vdev)
+{
+ u16 ctrl = dvsec_virt_read16(vdev, CXL_DVSEC_CONTROL_OFFSET);
+
+ /*
+ * Register per-device DVSEC read/write handlers. The global
+ * ecap_perms[PCI_EXT_CAP_ID_DVSEC] dispatchers will call them.
+ *
+ * vfio_cxl_dvsec_readfn returns vconfig shadow values for the
+ * virtualized registers (CONTROL, STATUS, CONTROL2, LOCK) so that
+ * reads reflect emulated state rather than raw hardware.
+ *
+ * vfio_cxl_dvsec_writefn enforces per-register semantics (RW1C,
+ * forced IO_ENABLE, reserved-bit masking) and stores results in
+ * vconfig. Because ecap_perms[DVSEC].writefn dispatches to this
+ * handler, the pci_config_map bytes remain as PCI_EXT_CAP_ID_DVSEC
+ * _ no PCI_CAP_ID_INVALID_VIRT marking is needed or wanted.
+ */
+ vdev->dvsec_readfn = vfio_cxl_dvsec_readfn;
+ vdev->dvsec_writefn = vfio_cxl_dvsec_writefn;
+
+ /*
+ * vconfig is seeded from hardware at open time. Force IO_ENABLE set
+ * in the CONTROL shadow so the initial read returns 1 even if the
+ * hardware reset value has it cleared. Subsequent writes are handled
+ * by cxl_control_write() which also forces this bit.
+ */
+ ctrl |= CXL_CTRL_CXL_IO_ENABLE_BIT;
+ dvsec_virt_write16(vdev, CXL_DVSEC_CONTROL_OFFSET, ctrl);
+}
+EXPORT_SYMBOL_GPL(vfio_cxl_setup_dvsec_perms);
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index 15b6c0d75d9e..e18e992800f6 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -26,6 +26,7 @@ static int vfio_cxl_create_device_state(struct vfio_pci_core_device *vdev,
struct vfio_pci_cxl_state *cxl;
bool cxl_mem_capable, is_cxl_type3;
u16 cap_word;
+ u32 hdr1;
/*
* The devm allocation for the CXL state remains for the entire time
@@ -47,6 +48,9 @@ static int vfio_cxl_create_device_state(struct vfio_pci_core_device *vdev,
cxl->dpa_region_idx = -1;
cxl->comp_reg_region_idx = -1;
+ pci_read_config_dword(pdev, dvsec + PCI_DVSEC_HEADER1, &hdr1);
+ cxl->dvsec_length = PCI_DVSEC_HEADER1_LEN(hdr1);
+
pci_read_config_word(pdev, dvsec + CXL_DVSEC_CAPABILITY_OFFSET,
&cap_word);
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
index 3ef8d923a7e8..158fe4e67f98 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_priv.h
+++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
@@ -31,6 +31,7 @@ struct vfio_pci_cxl_state {
u32 hdm_count;
int dpa_region_idx;
int comp_reg_region_idx;
+ size_t dvsec_length;
u16 dvsec;
u8 comp_reg_bar;
bool precommitted;
@@ -76,9 +77,44 @@ struct vfio_pci_cxl_state {
* (CXL 2.0+ 8.1.3).
* Offsets are relative to the DVSEC capability base (cxl->dvsec).
*/
-#define CXL_DVSEC_CAPABILITY_OFFSET 0xa
+#define CXL_DVSEC_CAPABILITY_OFFSET 0xa
+#define CXL_DVSEC_CONTROL_OFFSET 0xc
+#define CXL_DVSEC_STATUS_OFFSET 0xe
+#define CXL_DVSEC_CONTROL2_OFFSET 0x10
+#define CXL_DVSEC_STATUS2_OFFSET 0x12
+#define CXL_DVSEC_LOCK_OFFSET 0x14
+#define CXL_DVSEC_CAPABILITY2_OFFSET 0x16
+#define CXL_DVSEC_RANGE1_SIZE_HIGH_OFFSET 0x18
+#define CXL_DVSEC_RANGE1_SIZE_LOW_OFFSET 0x1c
+#define CXL_DVSEC_RANGE1_BASE_HIGH_OFFSET 0x20
+#define CXL_DVSEC_RANGE1_BASE_LOW_OFFSET 0x24
+#define CXL_DVSEC_RANGE2_SIZE_HIGH_OFFSET 0x28
+#define CXL_DVSEC_RANGE2_SIZE_LOW_OFFSET 0x2c
+#define CXL_DVSEC_RANGE2_BASE_HIGH_OFFSET 0x30
+#define CXL_DVSEC_RANGE2_BASE_LOW_OFFSET 0x34
+#define CXL_DVSEC_CAPABILITY3_OFFSET 0x38
+
#define CXL_DVSEC_MEM_CAPABLE BIT(2)
+/* CXL Control / Status / Lock - bit definitions */
+#define CXL_CTRL_LOCK_BIT BIT(0)
+#define CXL_CTRL_CXL_IO_ENABLE_BIT BIT(1)
+#define CXL_CTRL2_INITIATE_CXL_RESET_BIT BIT(2)
+#define CXL_CAP3_VOLATILE_HDM_BIT BIT(3)
+#define CXL_STATUS2_RW1CS_BIT BIT(3)
+#define CXL_CAP3_P2P_BIT BIT(4)
+#define CXL_CAP2_MODIFIED_COMPLETION_BIT BIT(6)
+#define CXL_STATUS_RW1C_BIT BIT(14)
+#define CXL_CTRL_RESERVED_MASK (BIT(13) | BIT(15))
+#define CXL_CTRL_P2P_REV_MASK BIT(12)
+#define CXL_STATUS_RESERVED_MASK (GENMASK(13, 0) | BIT(15))
+#define CXL_CTRL2_RESERVED_MASK GENMASK(15, 6)
+#define CXL_CTRL2_HW_BITS_MASK (BIT(0) | BIT(1) | BIT(3))
+#define CXL_CTRL2_VOLATILE_HDM_REV_MASK BIT(4)
+#define CXL_CTRL2_MODIFIED_COMP_REV_MASK BIT(5)
+#define CXL_LOCK_RESERVED_MASK GENMASK(15, 1)
+#define CXL_BASE_LO_RESERVED_MASK GENMASK(27, 0)
+
int vfio_cxl_setup_virt_regs(struct vfio_pci_core_device *vdev);
void vfio_cxl_clean_virt_regs(struct vfio_pci_core_device *vdev);
void vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev);
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index d3138badeaa6..22cf9ea831f9 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -121,12 +121,26 @@ static int vfio_pci_open_device(struct vfio_device *core_vdev)
}
if (vdev->cxl) {
+ /*
+ * pci_config_map and vconfig are valid now (allocated by
+ * vfio_config_init() inside vfio_pci_core_enable() above).
+ */
+ vfio_cxl_setup_dvsec_perms(vdev);
+
ret = vfio_cxl_register_cxl_region(vdev);
if (ret) {
pci_warn(pdev, "Failed to setup CXL region\n");
vfio_pci_core_disable(vdev);
return ret;
}
+
+ ret = vfio_cxl_register_comp_regs_region(vdev);
+ if (ret) {
+ pci_warn(pdev, "Failed to register COMP_REGS region\n");
+ vfio_cxl_unregister_cxl_region(vdev);
+ vfio_pci_core_disable(vdev);
+ return ret;
+ }
}
vfio_pci_core_finish_enable(vdev);
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 79aaf270adb2..90e2c25381d6 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1085,6 +1085,49 @@ static int __init init_pci_ext_cap_pwr_perm(struct perm_bits *perm)
return 0;
}
+/*
+ * vfio_pci_dvsec_dispatch_read - per-device DVSEC read dispatcher.
+ *
+ * Installed as ecap_perms[PCI_EXT_CAP_ID_DVSEC].readfn at module init.
+ * Calls vdev->dvsec_readfn when a shadow-read handler has been registered
+ * (e.g. by vfio_cxl_setup_dvsec_perms() for CXL Type-2 devices), otherwise
+ * falls through to vfio_raw_config_read for hardware pass-through.
+ *
+ * This indirection allows per-device DVSEC reads from vconfig shadow
+ * without touching the global ecap_perms[] table.
+ */
+static int vfio_pci_dvsec_dispatch_read(struct vfio_pci_core_device *vdev,
+ int pos, int count,
+ struct perm_bits *perm,
+ int offset, __le32 *val)
+{
+ if (vdev->dvsec_readfn)
+ return vdev->dvsec_readfn(vdev, pos, count, perm, offset, val);
+ return vfio_raw_config_read(vdev, pos, count, perm, offset, val);
+}
+
+/*
+ * vfio_pci_dvsec_dispatch_write - per-device DVSEC write dispatcher.
+ *
+ * Installed as ecap_perms[PCI_EXT_CAP_ID_DVSEC].writefn at module init.
+ * Calls vdev->dvsec_writefn when a handler has been registered for this
+ * device (e.g. by vfio_cxl_setup_dvsec_perms() for CXL Type-2 devices),
+ * otherwise falls through to vfio_raw_config_write so that non-CXL
+ * devices with a DVSEC capability continue to pass writes to hardware.
+ *
+ * This indirection allows per-device DVSEC handlers to be registered
+ * without touching the global ecap_perms[] table.
+ */
+static int vfio_pci_dvsec_dispatch_write(struct vfio_pci_core_device *vdev,
+ int pos, int count,
+ struct perm_bits *perm,
+ int offset, __le32 val)
+{
+ if (vdev->dvsec_writefn)
+ return vdev->dvsec_writefn(vdev, pos, count, perm, offset, val);
+ return vfio_raw_config_write(vdev, pos, count, perm, offset, val);
+}
+
/*
* Initialize the shared permission tables
*/
@@ -1121,7 +1164,8 @@ int __init vfio_pci_init_perm_bits(void)
ret |= init_pci_ext_cap_err_perm(&ecap_perms[PCI_EXT_CAP_ID_ERR]);
ret |= init_pci_ext_cap_pwr_perm(&ecap_perms[PCI_EXT_CAP_ID_PWR]);
ecap_perms[PCI_EXT_CAP_ID_VNDR].writefn = vfio_raw_config_write;
- ecap_perms[PCI_EXT_CAP_ID_DVSEC].writefn = vfio_raw_config_write;
+ ecap_perms[PCI_EXT_CAP_ID_DVSEC].readfn = vfio_pci_dvsec_dispatch_read;
+ ecap_perms[PCI_EXT_CAP_ID_DVSEC].writefn = vfio_pci_dvsec_dispatch_write;
if (ret)
vfio_pci_uninit_perm_bits();
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index f8db9a05c033..d778107fa908 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -154,6 +154,7 @@ void vfio_cxl_zap_region_locked(struct vfio_pci_core_device *vdev);
void vfio_cxl_reactivate_region(struct vfio_pci_core_device *vdev);
int vfio_cxl_register_comp_regs_region(struct vfio_pci_core_device *vdev);
void vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev);
+void vfio_cxl_setup_dvsec_perms(struct vfio_pci_core_device *vdev);
#else
@@ -180,6 +181,8 @@ vfio_cxl_register_comp_regs_region(struct vfio_pci_core_device *vdev)
{ return 0; }
static inline void
vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev) { }
+static inline void
+vfio_cxl_setup_dvsec_perms(struct vfio_pci_core_device *vdev) { }
#endif /* CONFIG_VFIO_CXL_CORE */
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index cd8ed98a82a3..aa159d0c8da7 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -31,7 +31,7 @@ struct p2pdma_provider;
struct dma_buf_phys_vec;
struct dma_buf_attachment;
struct vfio_pci_cxl_state;
-
+struct perm_bits;
struct vfio_pci_eventfd {
struct eventfd_ctx *ctx;
@@ -141,6 +141,12 @@ struct vfio_pci_core_device {
struct list_head ioeventfds_list;
struct vfio_pci_vf_token *vf_token;
struct vfio_pci_cxl_state *cxl;
+ int (*dvsec_readfn)(struct vfio_pci_core_device *vdev, int pos,
+ int count, struct perm_bits *perm,
+ int offset, __le32 *val);
+ int (*dvsec_writefn)(struct vfio_pci_core_device *vdev, int pos,
+ int count, struct perm_bits *perm,
+ int offset, __le32 val);
struct list_head sriov_pfs_item;
struct vfio_pci_core_device *sriov_pf_core_dev;
struct notifier_block nb;
--
2.25.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [PATCH 15/20] vfio/cxl: Introduce CXL DVSEC configuration space emulation
2026-03-11 20:34 ` [PATCH 15/20] vfio/cxl: Introduce CXL DVSEC configuration space emulation mhonap
@ 2026-03-13 22:07 ` Dave Jiang
2026-03-18 18:41 ` Manish Honap
0 siblings, 1 reply; 54+ messages in thread
From: Dave Jiang @ 2026-03-13 22:07 UTC (permalink / raw)
To: mhonap, aniketa, ankita, alwilliamson, vsethi, jgg, mochs,
skolothumtho, alejandro.lucero-palau, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm
On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
>
> CXL devices have CXL DVSEC registers in the configuration space.
> Many of them affect the behaviors of the devices, e.g. enabling
> CXL.io/CXL.mem/CXL.cache.
>
> However, these configurations are owned by the host and a virtualization
> policy should be applied when handling the access from the guest.
>
> Introduce the emulation of CXL configuration space to handle the access
> of the virtual CXL configuration space from the guest.
>
> vfio-pci-core already allocates vdev->vconfig as the authoritative
> virtual config space shadow. Directly use vdev->vconfig:
> - DVSEC reads return data from vdev->vconfig (already populated by
> vfio_config_init() via vfio_ecap_init())
> - DVSEC writes go through new CXL-aware write handlers that update
> vdev->vconfig in place
> - The writable DVSEC registers are marked virtual in vdev->pci_config_map
>
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
> ---
> drivers/vfio/pci/Makefile | 2 +-
> drivers/vfio/pci/cxl/vfio_cxl_config.c | 304 +++++++++++++++++++++++++
> drivers/vfio/pci/cxl/vfio_cxl_core.c | 4 +
> drivers/vfio/pci/cxl/vfio_cxl_priv.h | 38 +++-
> drivers/vfio/pci/vfio_pci.c | 14 ++
> drivers/vfio/pci/vfio_pci_config.c | 46 +++-
> drivers/vfio/pci/vfio_pci_priv.h | 3 +
> include/linux/vfio_pci_core.h | 8 +-
> 8 files changed, 415 insertions(+), 4 deletions(-)
> create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_config.c
>
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index bef916495eae..7c86b7845e8f 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -1,7 +1,7 @@
> # SPDX-License-Identifier: GPL-2.0-only
>
> vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
> -vfio-pci-core-$(CONFIG_VFIO_CXL_CORE) += cxl/vfio_cxl_core.o cxl/vfio_cxl_emu.o
> +vfio-pci-core-$(CONFIG_VFIO_CXL_CORE) += cxl/vfio_cxl_core.o cxl/vfio_cxl_emu.o cxl/vfio_cxl_config.o
> vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
> vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
> obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
> diff --git a/drivers/vfio/pci/cxl/vfio_cxl_config.c b/drivers/vfio/pci/cxl/vfio_cxl_config.c
> new file mode 100644
> index 000000000000..a9560661345c
> --- /dev/null
> +++ b/drivers/vfio/pci/cxl/vfio_cxl_config.c
> @@ -0,0 +1,304 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * CXL DVSEC configuration space emulation for vfio-pci.
> + *
> + * Integrates into the existing vfio-pci-core ecap_perms[] framework using
> + * vdev->vconfig as the sole shadow buffer for DVSEC registers.
> + *
> + * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include <linux/pci.h>
> +#include <linux/vfio_pci_core.h>
> +
> +#include "../vfio_pci_priv.h"
> +#include "vfio_cxl_priv.h"
> +
> +/* Helpers to access vdev->vconfig at a DVSEC-relative offset */
> +static inline u16 dvsec_virt_read16(struct vfio_pci_core_device *vdev,
> + u16 off)
> +{
> + return le16_to_cpu(*(u16 *)(vdev->vconfig +
> + vdev->cxl->dvsec + off));
> +}
> +
> +static inline void dvsec_virt_write16(struct vfio_pci_core_device *vdev,
> + u16 off, u16 val)
> +{
> + *(u16 *)(vdev->vconfig + vdev->cxl->dvsec + off) = cpu_to_le16(val);
> +}
> +
> +static inline u32 dvsec_virt_read32(struct vfio_pci_core_device *vdev,
> + u16 off)
> +{
> + return le32_to_cpu(*(u32 *)(vdev->vconfig +
> + vdev->cxl->dvsec + off));
> +}
> +
> +static inline void dvsec_virt_write32(struct vfio_pci_core_device *vdev,
> + u16 off, u32 val)
> +{
> + *(u32 *)(vdev->vconfig + vdev->cxl->dvsec + off) = cpu_to_le32(val);
> +}
> +
> +/* Individual DVSEC register write handlers */
> +
> +static void cxl_control_write(struct vfio_pci_core_device *vdev,
cxl_dvsec_control_write()
> + u16 abs_off, u16 new_val)
abs_off not needed?
> +{
> + u16 lock = dvsec_virt_read16(vdev, CXL_DVSEC_LOCK_OFFSET);
> + u16 cap3 = dvsec_virt_read16(vdev, CXL_DVSEC_CAPABILITY3_OFFSET);
> + u16 rev_mask = CXL_CTRL_RESERVED_MASK;
> +
> + if (lock & CXL_CTRL_LOCK_BIT)
> + return; /* register is locked after first write */
> +
> + if (!(cap3 & CXL_CAP3_P2P_BIT))
> + rev_mask |= CXL_CTRL_P2P_REV_MASK;
> +
> + new_val &= ~rev_mask;
> + new_val |= CXL_CTRL_CXL_IO_ENABLE_BIT; /* CXL.io always enabled */
Can FIELD_MODIFY() be used here?
> +
> + dvsec_virt_write16(vdev, CXL_DVSEC_CONTROL_OFFSET, new_val);
> +}
> +
> +static void cxl_status_write(struct vfio_pci_core_device *vdev,
cxl_dvsec_status_write()
> + u16 abs_off, u16 new_val)
abs_off not needed
> +{
> + u16 cur_val = dvsec_virt_read16(vdev, CXL_DVSEC_STATUS_OFFSET);
> +
> + new_val &= ~CXL_STATUS_RESERVED_MASK;
> +
> + /* RW1C: writing a 1 clears the bit; writing 0 leaves it unchanged */
> + if (new_val & CXL_STATUS_RW1C_BIT)
> + new_val &= ~CXL_STATUS_RW1C_BIT;
> + else
> + new_val = (new_val & ~CXL_STATUS_RW1C_BIT) |
> + (cur_val & CXL_STATUS_RW1C_BIT);
Given there's only 1 bit we need to deal with in this register and everything else is reserved, can't we just do:
if (new_val & CXL_STATUS_RW1C_BIT)
new_val = 0;
else
new_val = old_val;
> +
> + dvsec_virt_write16(vdev, CXL_DVSEC_STATUS_OFFSET, new_val);
> +}
> +
> +static void cxl_control2_write(struct vfio_pci_core_device *vdev,
> + u16 abs_off, u16 new_val)
abs_off not needed?
> +{
> + struct pci_dev *pdev = vdev->pdev;
> + u16 cap2 = dvsec_virt_read16(vdev, CXL_DVSEC_CAPABILITY2_OFFSET);
> + u16 cap3 = dvsec_virt_read16(vdev, CXL_DVSEC_CAPABILITY3_OFFSET);
> + u16 rev_mask = CXL_CTRL2_RESERVED_MASK;
> + u16 hw_bits = CXL_CTRL2_HW_BITS_MASK;
> + bool initiate_cxl_reset = new_val & CXL_CTRL2_INITIATE_CXL_RESET_BIT;
> +
> + if (!(cap3 & CXL_CAP3_VOLATILE_HDM_BIT))
> + rev_mask |= CXL_CTRL2_VOLATILE_HDM_REV_MASK;
> + if (!(cap2 & CXL_CAP2_MODIFIED_COMPLETION_BIT))
> + rev_mask |= CXL_CTRL2_MODIFIED_COMP_REV_MASK;
> +
> + new_val &= ~rev_mask;
> +
> + /* Bits that go directly to hardware */
> + hw_bits &= new_val;
> +
Bit 1 and 2 are always read 0 by hardware. Probably should clear it before writing to the virtual register?
> + dvsec_virt_write16(vdev, CXL_DVSEC_CONTROL2_OFFSET, new_val);
> +
> + if (hw_bits)
> + pci_write_config_word(pdev, abs_off, hw_bits);
> +
> + if (initiate_cxl_reset) {
> + /* TODO: invoke CXL protocol reset via cxl subsystem */
> + dev_warn(&pdev->dev, "vfio-cxl: CXL reset requested but not yet supported\n");
> + }
> +}
> +
> +static void cxl_status2_write(struct vfio_pci_core_device *vdev,
> + u16 abs_off, u16 new_val)
abs_off not needed
> +{
> + u16 cap3 = dvsec_virt_read16(vdev, CXL_DVSEC_CAPABILITY3_OFFSET);
> +
> + /* RW1CS: write 1 to clear, but only if the capability is supported */
> + if ((cap3 & CXL_CAP3_VOLATILE_HDM_BIT) &&
> + (new_val & CXL_STATUS2_RW1CS_BIT))
> + pci_write_config_word(vdev->pdev, abs_off,
> + CXL_STATUS2_RW1CS_BIT);
> + /* STATUS2 is not mirrored in vconfig - reads go to hardware */
> +}
> +
> +static void cxl_lock_write(struct vfio_pci_core_device *vdev,
> + u16 abs_off, u16 new_val)
abs_off not needed
> +{
> + u16 cur_val = dvsec_virt_read16(vdev, CXL_DVSEC_LOCK_OFFSET);
> +
> + /* Once the LOCK bit is set it can only be cleared by conventional reset */
> + if (cur_val & CXL_CTRL_LOCK_BIT)
> + return;
> +
> + new_val &= ~CXL_LOCK_RESERVED_MASK;
> + dvsec_virt_write16(vdev, CXL_DVSEC_LOCK_OFFSET, new_val);
> +}
> +
> +static void cxl_range_base_lo_write(struct vfio_pci_core_device *vdev,
> + u16 dvsec_off, u32 new_val)
> +{
> + new_val &= ~CXL_BASE_LO_RESERVED_MASK;
> + dvsec_virt_write32(vdev, dvsec_off, new_val);
> +}
> +
> +/*
> + * vfio_cxl_dvsec_readfn - per-device DVSEC read handler.
> + *
> + * Called via vfio_pci_dvsec_dispatch_read() for devices that have registered
> + * a dvsec_readfn. Returns shadow vconfig values for virtualized DVSEC
> + * registers (CONTROL, STATUS, CONTROL2, LOCK) so that userspace reads reflect
> + * the emulated state rather than the raw hardware value. All other DVSEC
> + * bytes are passed through to hardware via vfio_raw_config_read().
> + */
Provide proper kdoc function header
> +static int vfio_cxl_dvsec_readfn(struct vfio_pci_core_device *vdev,
> + int pos, int count,
> + struct perm_bits *perm,
> + int offset, __le32 *val)
> +{
> + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> + u16 dvsec_off;
> +
> + if (!cxl || (u16)pos < cxl->dvsec ||
> + (u16)pos >= cxl->dvsec + cxl->dvsec_length)
> + return vfio_raw_config_read(vdev, pos, count, perm, offset, val);
> +
> + dvsec_off = (u16)pos - cxl->dvsec;
> +
> + switch (dvsec_off) {
> + case CXL_DVSEC_CONTROL_OFFSET:
> + case CXL_DVSEC_STATUS_OFFSET:
> + case CXL_DVSEC_CONTROL2_OFFSET:
> + case CXL_DVSEC_LOCK_OFFSET:
> + /* Return shadow vconfig value for virtualized registers */
> + memcpy(val, vdev->vconfig + pos, count);
> + return count;
> + default:
> + return vfio_raw_config_read(vdev, pos, count,
> + perm, offset, val);
> + }> +}
> +
> +/*
> + * vfio_cxl_dvsec_writefn - ecap_perms write handler for PCI_EXT_CAP_ID_DVSEC.
> + *
> + * Installed once into ecap_perms[PCI_EXT_CAP_ID_DVSEC].writefn by
> + * vfio_pci_init_perm_bits() when CONFIG_VFIO_CXL_CORE=y. Applies to every
> + * device opened under vfio-pci; the vdev->cxl NULL check distinguishes CXL
> + * devices from non-CXL devices that happen to expose a DVSEC capability.
> + *
> + * @pos: absolute byte position in config space
> + * @offset: byte offset within the capability structure
missing return value expectations
> + */
> +static int vfio_cxl_dvsec_writefn(struct vfio_pci_core_device *vdev,
> + int pos, int count,
> + struct perm_bits *perm,
> + int offset, __le32 val)
> +{
> + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> + u16 abs_off = (u16)pos;
> + u16 dvsec_off;
> + u16 wval16;
> + u32 wval32;
> +
> + if (!cxl || (u16)pos < cxl->dvsec ||
> + (u16)pos >= cxl->dvsec + cxl->dvsec_length)
> + return vfio_raw_config_write(vdev, pos, count, perm,
> + offset, val);
> +
> + pci_dbg(vdev->pdev,
> + "vfio_cxl: DVSEC write: abs=0x%04x dvsec_off=0x%04x "
> + "count=%d raw_val=0x%08x\n",
> + abs_off, abs_off - cxl->dvsec, count, le32_to_cpu(val));
> +
> + dvsec_off = abs_off - cxl->dvsec;
> +
> + /* Route to the appropriate per-register handler */
> + switch (dvsec_off) {
> + case CXL_DVSEC_CONTROL_OFFSET:
> + wval16 = (u16)le32_to_cpu(val);
> + cxl_control_write(vdev, abs_off, wval16);
> + break;
> + case CXL_DVSEC_STATUS_OFFSET:
> + wval16 = (u16)le32_to_cpu(val);
> + cxl_status_write(vdev, abs_off, wval16);
> + break;
> + case CXL_DVSEC_CONTROL2_OFFSET:
> + wval16 = (u16)le32_to_cpu(val);
> + cxl_control2_write(vdev, abs_off, wval16);
> + break;
> + case CXL_DVSEC_STATUS2_OFFSET:
> + wval16 = (u16)le32_to_cpu(val);
> + cxl_status2_write(vdev, abs_off, wval16);
> + break;
> + case CXL_DVSEC_LOCK_OFFSET:
> + wval16 = (u16)le32_to_cpu(val);
> + cxl_lock_write(vdev, abs_off, wval16);
> + break;
> + case CXL_DVSEC_RANGE1_BASE_HIGH_OFFSET:
> + case CXL_DVSEC_RANGE2_BASE_HIGH_OFFSET:
> + wval32 = le32_to_cpu(val);
> + dvsec_virt_write32(vdev, dvsec_off, wval32);
> + break;
> + case CXL_DVSEC_RANGE1_BASE_LOW_OFFSET:
> + case CXL_DVSEC_RANGE2_BASE_LOW_OFFSET:
> + wval32 = le32_to_cpu(val);
> + cxl_range_base_lo_write(vdev, dvsec_off, wval32);
> + break;
> + default:
> + /* RO registers: header, capability, range sizes - discard */
> + break;
> + }
> +
> + return count;
> +}
> +
> +/*
> + * vfio_cxl_setup_dvsec_perms - Install per-device CXL DVSEC read/write hooks.
> + *
> + * Called once per device open after vfio_config_init() has seeded vdev->vconfig
> + * from hardware. Registers vfio_cxl_dvsec_readfn and vfio_cxl_dvsec_writefn
> + * as the per-device DVSEC handlers. The global dispatch functions installed
> + * in ecap_perms[PCI_EXT_CAP_ID_DVSEC] at module init call these per-device
> + * hooks so that pci_config_map bytes remain PCI_EXT_CAP_ID_DVSEC throughout.
provide proper kdoc function header
> + *
> + * vfio_cxl_dvsec_readfn: returns vconfig shadow for CONTROL/STATUS/CONTROL2/
> + * LOCK; passes all other DVSEC bytes through to hardware.
> + * vfio_cxl_dvsec_writefn: enforces per-register semantics (RW1C, forced
> + * IO_ENABLE, reserved-bit masking) and stores results in vconfig.
> + *
> + * Also forces CXL.io IO_ENABLE in the CONTROL vconfig shadow so the initial
> + * read returns 1 even before the first write.
> + */
> +void vfio_cxl_setup_dvsec_perms(struct vfio_pci_core_device *vdev)
> +{
> + u16 ctrl = dvsec_virt_read16(vdev, CXL_DVSEC_CONTROL_OFFSET);
> +
> + /*
> + * Register per-device DVSEC read/write handlers. The global
> + * ecap_perms[PCI_EXT_CAP_ID_DVSEC] dispatchers will call them.
> + *
> + * vfio_cxl_dvsec_readfn returns vconfig shadow values for the
> + * virtualized registers (CONTROL, STATUS, CONTROL2, LOCK) so that
> + * reads reflect emulated state rather than raw hardware.
> + *
> + * vfio_cxl_dvsec_writefn enforces per-register semantics (RW1C,
> + * forced IO_ENABLE, reserved-bit masking) and stores results in
> + * vconfig. Because ecap_perms[DVSEC].writefn dispatches to this
> + * handler, the pci_config_map bytes remain as PCI_EXT_CAP_ID_DVSEC
> + * _ no PCI_CAP_ID_INVALID_VIRT marking is needed or wanted.
> + */
> + vdev->dvsec_readfn = vfio_cxl_dvsec_readfn;
> + vdev->dvsec_writefn = vfio_cxl_dvsec_writefn;
> +
> + /*
> + * vconfig is seeded from hardware at open time. Force IO_ENABLE set
> + * in the CONTROL shadow so the initial read returns 1 even if the
> + * hardware reset value has it cleared. Subsequent writes are handled
> + * by cxl_control_write() which also forces this bit.
> + */
> + ctrl |= CXL_CTRL_CXL_IO_ENABLE_BIT;
> + dvsec_virt_write16(vdev, CXL_DVSEC_CONTROL_OFFSET, ctrl);
> +}
> +EXPORT_SYMBOL_GPL(vfio_cxl_setup_dvsec_perms);
> diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> index 15b6c0d75d9e..e18e992800f6 100644
> --- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
> +++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> @@ -26,6 +26,7 @@ static int vfio_cxl_create_device_state(struct vfio_pci_core_device *vdev,
> struct vfio_pci_cxl_state *cxl;
> bool cxl_mem_capable, is_cxl_type3;
> u16 cap_word;
> + u32 hdr1;
>
> /*
> * The devm allocation for the CXL state remains for the entire time
> @@ -47,6 +48,9 @@ static int vfio_cxl_create_device_state(struct vfio_pci_core_device *vdev,
> cxl->dpa_region_idx = -1;
> cxl->comp_reg_region_idx = -1;
>
> + pci_read_config_dword(pdev, dvsec + PCI_DVSEC_HEADER1, &hdr1);
> + cxl->dvsec_length = PCI_DVSEC_HEADER1_LEN(hdr1);
> +
> pci_read_config_word(pdev, dvsec + CXL_DVSEC_CAPABILITY_OFFSET,
> &cap_word);
>
> diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
> index 3ef8d923a7e8..158fe4e67f98 100644
> --- a/drivers/vfio/pci/cxl/vfio_cxl_priv.h
> +++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
> @@ -31,6 +31,7 @@ struct vfio_pci_cxl_state {
> u32 hdm_count;
> int dpa_region_idx;
> int comp_reg_region_idx;
> + size_t dvsec_length;
> u16 dvsec;
> u8 comp_reg_bar;
> bool precommitted;
> @@ -76,9 +77,44 @@ struct vfio_pci_cxl_state {
> * (CXL 2.0+ 8.1.3).
> * Offsets are relative to the DVSEC capability base (cxl->dvsec).
> */
> -#define CXL_DVSEC_CAPABILITY_OFFSET 0xa
> +#define CXL_DVSEC_CAPABILITY_OFFSET 0xa
> +#define CXL_DVSEC_CONTROL_OFFSET 0xc
> +#define CXL_DVSEC_STATUS_OFFSET 0xe
> +#define CXL_DVSEC_CONTROL2_OFFSET 0x10
> +#define CXL_DVSEC_STATUS2_OFFSET 0x12
> +#define CXL_DVSEC_LOCK_OFFSET 0x14
> +#define CXL_DVSEC_CAPABILITY2_OFFSET 0x16
> +#define CXL_DVSEC_RANGE1_SIZE_HIGH_OFFSET 0x18
> +#define CXL_DVSEC_RANGE1_SIZE_LOW_OFFSET 0x1c
> +#define CXL_DVSEC_RANGE1_BASE_HIGH_OFFSET 0x20
> +#define CXL_DVSEC_RANGE1_BASE_LOW_OFFSET 0x24
> +#define CXL_DVSEC_RANGE2_SIZE_HIGH_OFFSET 0x28
> +#define CXL_DVSEC_RANGE2_SIZE_LOW_OFFSET 0x2c
> +#define CXL_DVSEC_RANGE2_BASE_HIGH_OFFSET 0x30
> +#define CXL_DVSEC_RANGE2_BASE_LOW_OFFSET 0x34
> +#define CXL_DVSEC_CAPABILITY3_OFFSET 0x38
> +
> #define CXL_DVSEC_MEM_CAPABLE BIT(2)
>
> +/* CXL Control / Status / Lock - bit definitions */
> +#define CXL_CTRL_LOCK_BIT BIT(0)
CXL_CTRL_CONFIG_LOCK_BIT
> +#define CXL_CTRL_CXL_IO_ENABLE_BIT BIT(1)
> +#define CXL_CTRL2_INITIATE_CXL_RESET_BIT BIT(2)
> +#define CXL_CAP3_VOLATILE_HDM_BIT BIT(3)
> +#define CXL_STATUS2_RW1CS_BIT BIT(3)
CXL_STATUS2_VOL_HDM_PRSV_ERR_BIT
> +#define CXL_CAP3_P2P_BIT BIT(4)
> +#define CXL_CAP2_MODIFIED_COMPLETION_BIT BIT(6)
> +#define CXL_STATUS_RW1C_BIT BIT(14)
CXL_STATUS_VIRAL_STATUS_BIT
> +#define CXL_CTRL_RESERVED_MASK (BIT(13) | BIT(15))
> +#define CXL_CTRL_P2P_REV_MASK BIT(12)
> +#define CXL_STATUS_RESERVED_MASK (GENMASK(13, 0) | BIT(15))
> +#define CXL_CTRL2_RESERVED_MASK GENMASK(15, 6)
> +#define CXL_CTRL2_HW_BITS_MASK (BIT(0) | BIT(1) | BIT(3))
> +#define CXL_CTRL2_VOLATILE_HDM_REV_MASK BIT(4)
> +#define CXL_CTRL2_MODIFIED_COMP_REV_MASK BIT(5)
> +#define CXL_LOCK_RESERVED_MASK GENMASK(15, 1)
> +#define CXL_BASE_LO_RESERVED_MASK GENMASK(27, 0)
Move the CXL reg offset and bit defs to a common header. Also, please group the relevant bits together per register.
DJ
> +
> int vfio_cxl_setup_virt_regs(struct vfio_pci_core_device *vdev);
> void vfio_cxl_clean_virt_regs(struct vfio_pci_core_device *vdev);
> void vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev);
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index d3138badeaa6..22cf9ea831f9 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -121,12 +121,26 @@ static int vfio_pci_open_device(struct vfio_device *core_vdev)
> }
>
> if (vdev->cxl) {
> + /*
> + * pci_config_map and vconfig are valid now (allocated by
> + * vfio_config_init() inside vfio_pci_core_enable() above).
> + */
> + vfio_cxl_setup_dvsec_perms(vdev);
> +
> ret = vfio_cxl_register_cxl_region(vdev);
> if (ret) {
> pci_warn(pdev, "Failed to setup CXL region\n");
> vfio_pci_core_disable(vdev);
> return ret;
> }
> +
> + ret = vfio_cxl_register_comp_regs_region(vdev);
> + if (ret) {
> + pci_warn(pdev, "Failed to register COMP_REGS region\n");
> + vfio_cxl_unregister_cxl_region(vdev);
> + vfio_pci_core_disable(vdev);
> + return ret;
> + }
> }
>
> vfio_pci_core_finish_enable(vdev);
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index 79aaf270adb2..90e2c25381d6 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -1085,6 +1085,49 @@ static int __init init_pci_ext_cap_pwr_perm(struct perm_bits *perm)
> return 0;
> }
>
> +/*
> + * vfio_pci_dvsec_dispatch_read - per-device DVSEC read dispatcher.
> + *
> + * Installed as ecap_perms[PCI_EXT_CAP_ID_DVSEC].readfn at module init.
> + * Calls vdev->dvsec_readfn when a shadow-read handler has been registered
> + * (e.g. by vfio_cxl_setup_dvsec_perms() for CXL Type-2 devices), otherwise
> + * falls through to vfio_raw_config_read for hardware pass-through.
> + *
> + * This indirection allows per-device DVSEC reads from vconfig shadow
> + * without touching the global ecap_perms[] table.
> + */
> +static int vfio_pci_dvsec_dispatch_read(struct vfio_pci_core_device *vdev,
> + int pos, int count,
> + struct perm_bits *perm,
> + int offset, __le32 *val)
> +{
> + if (vdev->dvsec_readfn)
> + return vdev->dvsec_readfn(vdev, pos, count, perm, offset, val);
> + return vfio_raw_config_read(vdev, pos, count, perm, offset, val);
> +}
> +
> +/*
> + * vfio_pci_dvsec_dispatch_write - per-device DVSEC write dispatcher.
> + *
> + * Installed as ecap_perms[PCI_EXT_CAP_ID_DVSEC].writefn at module init.
> + * Calls vdev->dvsec_writefn when a handler has been registered for this
> + * device (e.g. by vfio_cxl_setup_dvsec_perms() for CXL Type-2 devices),
> + * otherwise falls through to vfio_raw_config_write so that non-CXL
> + * devices with a DVSEC capability continue to pass writes to hardware.
> + *
> + * This indirection allows per-device DVSEC handlers to be registered
> + * without touching the global ecap_perms[] table.
> + */
> +static int vfio_pci_dvsec_dispatch_write(struct vfio_pci_core_device *vdev,
> + int pos, int count,
> + struct perm_bits *perm,
> + int offset, __le32 val)
> +{
> + if (vdev->dvsec_writefn)
> + return vdev->dvsec_writefn(vdev, pos, count, perm, offset, val);
> + return vfio_raw_config_write(vdev, pos, count, perm, offset, val);
> +}
> +
> /*
> * Initialize the shared permission tables
> */
> @@ -1121,7 +1164,8 @@ int __init vfio_pci_init_perm_bits(void)
> ret |= init_pci_ext_cap_err_perm(&ecap_perms[PCI_EXT_CAP_ID_ERR]);
> ret |= init_pci_ext_cap_pwr_perm(&ecap_perms[PCI_EXT_CAP_ID_PWR]);
> ecap_perms[PCI_EXT_CAP_ID_VNDR].writefn = vfio_raw_config_write;
> - ecap_perms[PCI_EXT_CAP_ID_DVSEC].writefn = vfio_raw_config_write;
> + ecap_perms[PCI_EXT_CAP_ID_DVSEC].readfn = vfio_pci_dvsec_dispatch_read;
> + ecap_perms[PCI_EXT_CAP_ID_DVSEC].writefn = vfio_pci_dvsec_dispatch_write;
>
> if (ret)
> vfio_pci_uninit_perm_bits();
> diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
> index f8db9a05c033..d778107fa908 100644
> --- a/drivers/vfio/pci/vfio_pci_priv.h
> +++ b/drivers/vfio/pci/vfio_pci_priv.h
> @@ -154,6 +154,7 @@ void vfio_cxl_zap_region_locked(struct vfio_pci_core_device *vdev);
> void vfio_cxl_reactivate_region(struct vfio_pci_core_device *vdev);
> int vfio_cxl_register_comp_regs_region(struct vfio_pci_core_device *vdev);
> void vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev);
> +void vfio_cxl_setup_dvsec_perms(struct vfio_pci_core_device *vdev);
>
> #else
>
> @@ -180,6 +181,8 @@ vfio_cxl_register_comp_regs_region(struct vfio_pci_core_device *vdev)
> { return 0; }
> static inline void
> vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev) { }
> +static inline void
> +vfio_cxl_setup_dvsec_perms(struct vfio_pci_core_device *vdev) { }
>
> #endif /* CONFIG_VFIO_CXL_CORE */
>
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index cd8ed98a82a3..aa159d0c8da7 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -31,7 +31,7 @@ struct p2pdma_provider;
> struct dma_buf_phys_vec;
> struct dma_buf_attachment;
> struct vfio_pci_cxl_state;
> -
> +struct perm_bits;
>
> struct vfio_pci_eventfd {
> struct eventfd_ctx *ctx;
> @@ -141,6 +141,12 @@ struct vfio_pci_core_device {
> struct list_head ioeventfds_list;
> struct vfio_pci_vf_token *vf_token;
> struct vfio_pci_cxl_state *cxl;
> + int (*dvsec_readfn)(struct vfio_pci_core_device *vdev, int pos,
> + int count, struct perm_bits *perm,
> + int offset, __le32 *val);
> + int (*dvsec_writefn)(struct vfio_pci_core_device *vdev, int pos,
> + int count, struct perm_bits *perm,
> + int offset, __le32 val);
> struct list_head sriov_pfs_item;
> struct vfio_pci_core_device *sriov_pf_core_dev;
> struct notifier_block nb;
^ permalink raw reply [flat|nested] 54+ messages in thread* RE: [PATCH 15/20] vfio/cxl: Introduce CXL DVSEC configuration space emulation
2026-03-13 22:07 ` Dave Jiang
@ 2026-03-18 18:41 ` Manish Honap
0 siblings, 0 replies; 54+ messages in thread
From: Manish Honap @ 2026-03-18 18:41 UTC (permalink / raw)
To: Dave Jiang, Aniket Agashe, Ankit Agrawal, Alex Williamson,
Vikram Sethi, Jason Gunthorpe, Matt Ochs, Shameer Kolothum Thodi,
alejandro.lucero-palau@amd.com, dave@stgolabs.net,
jonathan.cameron@huawei.com, alison.schofield@intel.com,
vishal.l.verma@intel.com, ira.weiny@intel.com,
dan.j.williams@intel.com, jgg@ziepe.ca, Yishai Hadas,
kevin.tian@intel.com
Cc: Neo Jia, Tarun Gupta (SW-GPU), Zhi Wang, Krishnakant Jaju,
linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org,
kvm@vger.kernel.org, Manish Honap
> -----Original Message-----
> From: Dave Jiang <dave.jiang@intel.com>
> Sent: 14 March 2026 03:37
> To: Manish Honap <mhonap@nvidia.com>; Aniket Agashe <aniketa@nvidia.com>;
> Ankit Agrawal <ankita@nvidia.com>; Alex Williamson
> <alwilliamson@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Jason
> Gunthorpe <jgg@nvidia.com>; Matt Ochs <mochs@nvidia.com>; Shameer Kolothum
> Thodi <skolothumtho@nvidia.com>; alejandro.lucero-palau@amd.com;
> dave@stgolabs.net; jonathan.cameron@huawei.com;
> alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
> dan.j.williams@intel.com; jgg@ziepe.ca; Yishai Hadas <yishaih@nvidia.com>;
> kevin.tian@intel.com
> Cc: Neo Jia <cjia@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> Zhi Wang <zhiw@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>; linux-
> kernel@vger.kernel.org; linux-cxl@vger.kernel.org; kvm@vger.kernel.org
> Subject: Re: [PATCH 15/20] vfio/cxl: Introduce CXL DVSEC configuration
> space emulation
>
> External email: Use caution opening links or attachments
>
>
> On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> > From: Manish Honap <mhonap@nvidia.com>
> >
> > CXL devices have CXL DVSEC registers in the configuration space.
> > Many of them affect the behaviors of the devices, e.g. enabling
> > CXL.io/CXL.mem/CXL.cache.
> >
> > However, these configurations are owned by the host and a
> > virtualization policy should be applied when handling the access from
> the guest.
> >
> > Introduce the emulation of CXL configuration space to handle the
> > access of the virtual CXL configuration space from the guest.
> >
> > vfio-pci-core already allocates vdev->vconfig as the authoritative
> > virtual config space shadow. Directly use vdev->vconfig:
> > - DVSEC reads return data from vdev->vconfig (already populated by
> > vfio_config_init() via vfio_ecap_init())
> > - DVSEC writes go through new CXL-aware write handlers that update
> > vdev->vconfig in place
> > - The writable DVSEC registers are marked virtual in
> > vdev->pci_config_map
> >
> > Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> > Signed-off-by: Manish Honap <mhonap@nvidia.com>
> > ---
> > drivers/vfio/pci/Makefile | 2 +-
> > drivers/vfio/pci/cxl/vfio_cxl_config.c | 304 +++++++++++++++++++++++++
> > drivers/vfio/pci/cxl/vfio_cxl_core.c | 4 +
> > drivers/vfio/pci/cxl/vfio_cxl_priv.h | 38 +++-
> > drivers/vfio/pci/vfio_pci.c | 14 ++
> > drivers/vfio/pci/vfio_pci_config.c | 46 +++-
> > drivers/vfio/pci/vfio_pci_priv.h | 3 +
> > include/linux/vfio_pci_core.h | 8 +-
> > 8 files changed, 415 insertions(+), 4 deletions(-) create mode
> > 100644 drivers/vfio/pci/cxl/vfio_cxl_config.c
> >
> > diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> > index bef916495eae..7c86b7845e8f 100644
> > --- a/drivers/vfio/pci/Makefile
> > +++ b/drivers/vfio/pci/Makefile
> > @@ -1,7 +1,7 @@
> > # SPDX-License-Identifier: GPL-2.0-only
> >
> > vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o
> > vfio_pci_config.o
> > -vfio-pci-core-$(CONFIG_VFIO_CXL_CORE) += cxl/vfio_cxl_core.o
> > cxl/vfio_cxl_emu.o
> > +vfio-pci-core-$(CONFIG_VFIO_CXL_CORE) += cxl/vfio_cxl_core.o
> > +cxl/vfio_cxl_emu.o cxl/vfio_cxl_config.o
> > vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
> > vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
> > obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o diff --git
> > a/drivers/vfio/pci/cxl/vfio_cxl_config.c
> > b/drivers/vfio/pci/cxl/vfio_cxl_config.c
> > new file mode 100644
> > index 000000000000..a9560661345c
> > --- /dev/null
> > +++ b/drivers/vfio/pci/cxl/vfio_cxl_config.c
> > @@ -0,0 +1,304 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * CXL DVSEC configuration space emulation for vfio-pci.
> > + *
> > + * Integrates into the existing vfio-pci-core ecap_perms[] framework
> > +using
> > + * vdev->vconfig as the sole shadow buffer for DVSEC registers.
> > + *
> > + * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights
> > +reserved */
> > +
> > +#include <linux/pci.h>
> > +#include <linux/vfio_pci_core.h>
> > +
> > +#include "../vfio_pci_priv.h"
> > +#include "vfio_cxl_priv.h"
> > +
> > +/* Helpers to access vdev->vconfig at a DVSEC-relative offset */
> > +static inline u16 dvsec_virt_read16(struct vfio_pci_core_device *vdev,
> > + u16 off) {
> > + return le16_to_cpu(*(u16 *)(vdev->vconfig +
> > + vdev->cxl->dvsec + off)); }
> > +
> > +static inline void dvsec_virt_write16(struct vfio_pci_core_device
> *vdev,
> > + u16 off, u16 val) {
> > + *(u16 *)(vdev->vconfig + vdev->cxl->dvsec + off) =
> > +cpu_to_le16(val); }
> > +
> > +static inline u32 dvsec_virt_read32(struct vfio_pci_core_device *vdev,
> > + u16 off) {
> > + return le32_to_cpu(*(u32 *)(vdev->vconfig +
> > + vdev->cxl->dvsec + off)); }
> > +
> > +static inline void dvsec_virt_write32(struct vfio_pci_core_device
> *vdev,
> > + u16 off, u32 val) {
> > + *(u32 *)(vdev->vconfig + vdev->cxl->dvsec + off) =
> > +cpu_to_le32(val); }
> > +
> > +/* Individual DVSEC register write handlers */
> > +
> > +static void cxl_control_write(struct vfio_pci_core_device *vdev,
>
> cxl_dvsec_control_write()
Yes, renamed.
>
> > + u16 abs_off, u16 new_val)
>
> abs_off not needed?
Removed.
>
> > +{
> > + u16 lock = dvsec_virt_read16(vdev, CXL_DVSEC_LOCK_OFFSET);
> > + u16 cap3 = dvsec_virt_read16(vdev, CXL_DVSEC_CAPABILITY3_OFFSET);
> > + u16 rev_mask = CXL_CTRL_RESERVED_MASK;
> > +
> > + if (lock & CXL_CTRL_LOCK_BIT)
> > + return; /* register is locked after first write */
> > +
> > + if (!(cap3 & CXL_CAP3_P2P_BIT))
> > + rev_mask |= CXL_CTRL_P2P_REV_MASK;
> > +
> > + new_val &= ~rev_mask;
> > + new_val |= CXL_CTRL_CXL_IO_ENABLE_BIT; /* CXL.io always enabled
> > + */
>
> Can FIELD_MODIFY() be used here?
I looked at include/linux/bitfield.h:: FIELD_MODIFY but it requires
constraint of having the mask as power of 2. For rev_mask = (BIT(13) | BIT(15))
this will fail.
>
> > +
> > + dvsec_virt_write16(vdev, CXL_DVSEC_CONTROL_OFFSET, new_val); }
> > +
> > +static void cxl_status_write(struct vfio_pci_core_device *vdev,
>
> cxl_dvsec_status_write()
> > + u16 abs_off, u16 new_val)
>
> abs_off not needed
Removed.
>
>
> > +{
> > + u16 cur_val = dvsec_virt_read16(vdev, CXL_DVSEC_STATUS_OFFSET);
> > +
> > + new_val &= ~CXL_STATUS_RESERVED_MASK;
> > +
> > + /* RW1C: writing a 1 clears the bit; writing 0 leaves it unchanged
> */
> > + if (new_val & CXL_STATUS_RW1C_BIT)
> > + new_val &= ~CXL_STATUS_RW1C_BIT;
> > + else
> > + new_val = (new_val & ~CXL_STATUS_RW1C_BIT) |
> > + (cur_val & CXL_STATUS_RW1C_BIT);
>
> Given there's only 1 bit we need to deal with in this register and
> everything else is reserved, can't we just do:
>
> if (new_val & CXL_STATUS_RW1C_BIT)
> new_val = 0;
> else
> new_val = old_val;
Yes, updated.
>
> > +
> > + dvsec_virt_write16(vdev, CXL_DVSEC_STATUS_OFFSET, new_val); }
> > +
> > +static void cxl_control2_write(struct vfio_pci_core_device *vdev,
> > + u16 abs_off, u16 new_val)
>
> abs_off not needed?
Removed.
>
> > +{
> > + struct pci_dev *pdev = vdev->pdev;
> > + u16 cap2 = dvsec_virt_read16(vdev, CXL_DVSEC_CAPABILITY2_OFFSET);
> > + u16 cap3 = dvsec_virt_read16(vdev, CXL_DVSEC_CAPABILITY3_OFFSET);
> > + u16 rev_mask = CXL_CTRL2_RESERVED_MASK;
> > + u16 hw_bits = CXL_CTRL2_HW_BITS_MASK;
> > + bool initiate_cxl_reset = new_val &
> > +CXL_CTRL2_INITIATE_CXL_RESET_BIT;
> > +
> > + if (!(cap3 & CXL_CAP3_VOLATILE_HDM_BIT))
> > + rev_mask |= CXL_CTRL2_VOLATILE_HDM_REV_MASK;
> > + if (!(cap2 & CXL_CAP2_MODIFIED_COMPLETION_BIT))
> > + rev_mask |= CXL_CTRL2_MODIFIED_COMP_REV_MASK;
> > +
> > + new_val &= ~rev_mask;
> > +
> > + /* Bits that go directly to hardware */
> > + hw_bits &= new_val;
> > +
>
> Bit 1 and 2 are always read 0 by hardware. Probably should clear it before
> writing to the virtual register?
Refactored this part.
>
> > + dvsec_virt_write16(vdev, CXL_DVSEC_CONTROL2_OFFSET, new_val);
> > +
> > + if (hw_bits)
> > + pci_write_config_word(pdev, abs_off, hw_bits);
> > +
> > + if (initiate_cxl_reset) {
> > + /* TODO: invoke CXL protocol reset via cxl subsystem */
> > + dev_warn(&pdev->dev, "vfio-cxl: CXL reset requested but
> not yet supported\n");
> > + }
> > +}
> > +
> > +static void cxl_status2_write(struct vfio_pci_core_device *vdev,
> > + u16 abs_off, u16 new_val)
>
> abs_off not needed
Removed.
>
> > +{
> > + u16 cap3 = dvsec_virt_read16(vdev,
> > +CXL_DVSEC_CAPABILITY3_OFFSET);
> > +
> > + /* RW1CS: write 1 to clear, but only if the capability is
> supported */
> > + if ((cap3 & CXL_CAP3_VOLATILE_HDM_BIT) &&
> > + (new_val & CXL_STATUS2_RW1CS_BIT))
> > + pci_write_config_word(vdev->pdev, abs_off,
> > + CXL_STATUS2_RW1CS_BIT);
> > + /* STATUS2 is not mirrored in vconfig - reads go to hardware */
> > +}
> > +
> > +static void cxl_lock_write(struct vfio_pci_core_device *vdev,
> > + u16 abs_off, u16 new_val)
>
> abs_off not needed
Removed.
>
> > +{
> > + u16 cur_val = dvsec_virt_read16(vdev, CXL_DVSEC_LOCK_OFFSET);
> > +
> > + /* Once the LOCK bit is set it can only be cleared by conventional
> reset */
> > + if (cur_val & CXL_CTRL_LOCK_BIT)
> > + return;
> > +
> > + new_val &= ~CXL_LOCK_RESERVED_MASK;
> > + dvsec_virt_write16(vdev, CXL_DVSEC_LOCK_OFFSET, new_val); }
> > +
> > +static void cxl_range_base_lo_write(struct vfio_pci_core_device *vdev,
> > + u16 dvsec_off, u32 new_val) {
> > + new_val &= ~CXL_BASE_LO_RESERVED_MASK;
> > + dvsec_virt_write32(vdev, dvsec_off, new_val); }
> > +
> > +/*
> > + * vfio_cxl_dvsec_readfn - per-device DVSEC read handler.
> > + *
> > + * Called via vfio_pci_dvsec_dispatch_read() for devices that have
> registered
> > + * a dvsec_readfn. Returns shadow vconfig values for virtualized DVSEC
> > + * registers (CONTROL, STATUS, CONTROL2, LOCK) so that userspace reads
> reflect
> > + * the emulated state rather than the raw hardware value. All other
> DVSEC
> > + * bytes are passed through to hardware via vfio_raw_config_read().
> > + */
>
> Provide proper kdoc function header
Updated.
>
> > +static int vfio_cxl_dvsec_readfn(struct vfio_pci_core_device *vdev,
> > + int pos, int count,
> > + struct perm_bits *perm,
> > + int offset, __le32 *val)
> > +{
> > + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> > + u16 dvsec_off;
> > +
> > + if (!cxl || (u16)pos < cxl->dvsec ||
> > + (u16)pos >= cxl->dvsec + cxl->dvsec_length)
> > + return vfio_raw_config_read(vdev, pos, count, perm,
> offset, val);
> > +
> > + dvsec_off = (u16)pos - cxl->dvsec;
> > +
> > + switch (dvsec_off) {
> > + case CXL_DVSEC_CONTROL_OFFSET:
> > + case CXL_DVSEC_STATUS_OFFSET:
> > + case CXL_DVSEC_CONTROL2_OFFSET:
> > + case CXL_DVSEC_LOCK_OFFSET:
> > + /* Return shadow vconfig value for virtualized registers
> */
> > + memcpy(val, vdev->vconfig + pos, count);
> > + return count;
> > + default:
> > + return vfio_raw_config_read(vdev, pos, count,
> > + perm, offset, val);
> > + }> +}
> > +
> > +/*
> > + * vfio_cxl_dvsec_writefn - ecap_perms write handler for
> PCI_EXT_CAP_ID_DVSEC.
> > + *
> > + * Installed once into ecap_perms[PCI_EXT_CAP_ID_DVSEC].writefn by
> > + * vfio_pci_init_perm_bits() when CONFIG_VFIO_CXL_CORE=y. Applies to
> every
> > + * device opened under vfio-pci; the vdev->cxl NULL check distinguishes
> CXL
> > + * devices from non-CXL devices that happen to expose a DVSEC
> capability.
> > + *
> > + * @pos: absolute byte position in config space
> > + * @offset: byte offset within the capability structure
>
> missing return value expectations
Updated.
> > + */
> > +static int vfio_cxl_dvsec_writefn(struct vfio_pci_core_device *vdev,
> > + int pos, int count,
> > + struct perm_bits *perm,
> > + int offset, __le32 val)
> > +{
> > + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> > + u16 abs_off = (u16)pos;
> > + u16 dvsec_off;
> > + u16 wval16;
> > + u32 wval32;
> > +
> > + if (!cxl || (u16)pos < cxl->dvsec ||
> > + (u16)pos >= cxl->dvsec + cxl->dvsec_length)
> > + return vfio_raw_config_write(vdev, pos, count, perm,
> > + offset, val);
> > +
> > + pci_dbg(vdev->pdev,
> > + "vfio_cxl: DVSEC write: abs=0x%04x dvsec_off=0x%04x "
> > + "count=%d raw_val=0x%08x\n",
> > + abs_off, abs_off - cxl->dvsec, count, le32_to_cpu(val));
> > +
> > + dvsec_off = abs_off - cxl->dvsec;
> > +
> > + /* Route to the appropriate per-register handler */
> > + switch (dvsec_off) {
> > + case CXL_DVSEC_CONTROL_OFFSET:
> > + wval16 = (u16)le32_to_cpu(val);
> > + cxl_control_write(vdev, abs_off, wval16);
> > + break;
> > + case CXL_DVSEC_STATUS_OFFSET:
> > + wval16 = (u16)le32_to_cpu(val);
> > + cxl_status_write(vdev, abs_off, wval16);
> > + break;
> > + case CXL_DVSEC_CONTROL2_OFFSET:
> > + wval16 = (u16)le32_to_cpu(val);
> > + cxl_control2_write(vdev, abs_off, wval16);
> > + break;
> > + case CXL_DVSEC_STATUS2_OFFSET:
> > + wval16 = (u16)le32_to_cpu(val);
> > + cxl_status2_write(vdev, abs_off, wval16);
> > + break;
> > + case CXL_DVSEC_LOCK_OFFSET:
> > + wval16 = (u16)le32_to_cpu(val);
> > + cxl_lock_write(vdev, abs_off, wval16);
> > + break;
> > + case CXL_DVSEC_RANGE1_BASE_HIGH_OFFSET:
> > + case CXL_DVSEC_RANGE2_BASE_HIGH_OFFSET:
> > + wval32 = le32_to_cpu(val);
> > + dvsec_virt_write32(vdev, dvsec_off, wval32);
> > + break;
> > + case CXL_DVSEC_RANGE1_BASE_LOW_OFFSET:
> > + case CXL_DVSEC_RANGE2_BASE_LOW_OFFSET:
> > + wval32 = le32_to_cpu(val);
> > + cxl_range_base_lo_write(vdev, dvsec_off, wval32);
> > + break;
> > + default:
> > + /* RO registers: header, capability, range sizes - discard
> */
> > + break;
> > + }
> > +
> > + return count;
> > +}
> > +
> > +/*
> > + * vfio_cxl_setup_dvsec_perms - Install per-device CXL DVSEC read/write
> hooks.
> > + *
> > + * Called once per device open after vfio_config_init() has seeded
> vdev->vconfig
> > + * from hardware. Registers vfio_cxl_dvsec_readfn and
> vfio_cxl_dvsec_writefn
> > + * as the per-device DVSEC handlers. The global dispatch functions
> installed
> > + * in ecap_perms[PCI_EXT_CAP_ID_DVSEC] at module init call these per-
> device
> > + * hooks so that pci_config_map bytes remain PCI_EXT_CAP_ID_DVSEC
> throughout.
>
> provide proper kdoc function header
Updated.
>
> > + *
> > + * vfio_cxl_dvsec_readfn: returns vconfig shadow for
> CONTROL/STATUS/CONTROL2/
> > + * LOCK; passes all other DVSEC bytes through to hardware.
> > + * vfio_cxl_dvsec_writefn: enforces per-register semantics (RW1C,
> forced
> > + * IO_ENABLE, reserved-bit masking) and stores results in vconfig.
> > + *
> > + * Also forces CXL.io IO_ENABLE in the CONTROL vconfig shadow so the
> initial
> > + * read returns 1 even before the first write.
> > + */
> > +void vfio_cxl_setup_dvsec_perms(struct vfio_pci_core_device *vdev)
> > +{
> > + u16 ctrl = dvsec_virt_read16(vdev, CXL_DVSEC_CONTROL_OFFSET);
> > +
> > + /*
> > + * Register per-device DVSEC read/write handlers. The global
> > + * ecap_perms[PCI_EXT_CAP_ID_DVSEC] dispatchers will call them.
> > + *
> > + * vfio_cxl_dvsec_readfn returns vconfig shadow values for the
> > + * virtualized registers (CONTROL, STATUS, CONTROL2, LOCK) so that
> > + * reads reflect emulated state rather than raw hardware.
> > + *
> > + * vfio_cxl_dvsec_writefn enforces per-register semantics (RW1C,
> > + * forced IO_ENABLE, reserved-bit masking) and stores results in
> > + * vconfig. Because ecap_perms[DVSEC].writefn dispatches to this
> > + * handler, the pci_config_map bytes remain as
> PCI_EXT_CAP_ID_DVSEC
> > + * _ no PCI_CAP_ID_INVALID_VIRT marking is needed or wanted.
> > + */
> > + vdev->dvsec_readfn = vfio_cxl_dvsec_readfn;
> > + vdev->dvsec_writefn = vfio_cxl_dvsec_writefn;
> > +
> > + /*
> > + * vconfig is seeded from hardware at open time. Force IO_ENABLE
> set
> > + * in the CONTROL shadow so the initial read returns 1 even if the
> > + * hardware reset value has it cleared. Subsequent writes are
> handled
> > + * by cxl_control_write() which also forces this bit.
> > + */
> > + ctrl |= CXL_CTRL_CXL_IO_ENABLE_BIT;
> > + dvsec_virt_write16(vdev, CXL_DVSEC_CONTROL_OFFSET, ctrl);
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_cxl_setup_dvsec_perms);
> > diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c
> b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > index 15b6c0d75d9e..e18e992800f6 100644
> > --- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > +++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > @@ -26,6 +26,7 @@ static int vfio_cxl_create_device_state(struct
> vfio_pci_core_device *vdev,
> > struct vfio_pci_cxl_state *cxl;
> > bool cxl_mem_capable, is_cxl_type3;
> > u16 cap_word;
> > + u32 hdr1;
> >
> > /*
> > * The devm allocation for the CXL state remains for the entire
> time
> > @@ -47,6 +48,9 @@ static int vfio_cxl_create_device_state(struct
> vfio_pci_core_device *vdev,
> > cxl->dpa_region_idx = -1;
> > cxl->comp_reg_region_idx = -1;
> >
> > + pci_read_config_dword(pdev, dvsec + PCI_DVSEC_HEADER1, &hdr1);
> > + cxl->dvsec_length = PCI_DVSEC_HEADER1_LEN(hdr1);
> > +
> > pci_read_config_word(pdev, dvsec + CXL_DVSEC_CAPABILITY_OFFSET,
> > &cap_word);
> >
> > diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h
> b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
> > index 3ef8d923a7e8..158fe4e67f98 100644
> > --- a/drivers/vfio/pci/cxl/vfio_cxl_priv.h
> > +++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
> > @@ -31,6 +31,7 @@ struct vfio_pci_cxl_state {
> > u32 hdm_count;
> > int dpa_region_idx;
> > int comp_reg_region_idx;
> > + size_t dvsec_length;
> > u16 dvsec;
> > u8 comp_reg_bar;
> > bool precommitted;
> > @@ -76,9 +77,44 @@ struct vfio_pci_cxl_state {
> > * (CXL 2.0+ 8.1.3).
> > * Offsets are relative to the DVSEC capability base (cxl->dvsec).
> > */
> > -#define CXL_DVSEC_CAPABILITY_OFFSET 0xa
> > +#define CXL_DVSEC_CAPABILITY_OFFSET 0xa
> > +#define CXL_DVSEC_CONTROL_OFFSET 0xc
> > +#define CXL_DVSEC_STATUS_OFFSET 0xe
> > +#define CXL_DVSEC_CONTROL2_OFFSET 0x10
> > +#define CXL_DVSEC_STATUS2_OFFSET 0x12
> > +#define CXL_DVSEC_LOCK_OFFSET 0x14
> > +#define CXL_DVSEC_CAPABILITY2_OFFSET 0x16
> > +#define CXL_DVSEC_RANGE1_SIZE_HIGH_OFFSET 0x18
> > +#define CXL_DVSEC_RANGE1_SIZE_LOW_OFFSET 0x1c
> > +#define CXL_DVSEC_RANGE1_BASE_HIGH_OFFSET 0x20
> > +#define CXL_DVSEC_RANGE1_BASE_LOW_OFFSET 0x24
> > +#define CXL_DVSEC_RANGE2_SIZE_HIGH_OFFSET 0x28
> > +#define CXL_DVSEC_RANGE2_SIZE_LOW_OFFSET 0x2c
> > +#define CXL_DVSEC_RANGE2_BASE_HIGH_OFFSET 0x30
> > +#define CXL_DVSEC_RANGE2_BASE_LOW_OFFSET 0x34
> > +#define CXL_DVSEC_CAPABILITY3_OFFSET 0x38
> > +
> > #define CXL_DVSEC_MEM_CAPABLE BIT(2)
> >
> > +/* CXL Control / Status / Lock - bit definitions */
> > +#define CXL_CTRL_LOCK_BIT BIT(0)
>
> CXL_CTRL_CONFIG_LOCK_BIT
>
> > +#define CXL_CTRL_CXL_IO_ENABLE_BIT BIT(1)
> > +#define CXL_CTRL2_INITIATE_CXL_RESET_BIT BIT(2)
> > +#define CXL_CAP3_VOLATILE_HDM_BIT BIT(3)
> > +#define CXL_STATUS2_RW1CS_BIT BIT(3)
>
> CXL_STATUS2_VOL_HDM_PRSV_ERR_BIT
>
> > +#define CXL_CAP3_P2P_BIT BIT(4)
> > +#define CXL_CAP2_MODIFIED_COMPLETION_BIT BIT(6)
> > +#define CXL_STATUS_RW1C_BIT BIT(14)
>
> CXL_STATUS_VIRAL_STATUS_BIT
>
> > +#define CXL_CTRL_RESERVED_MASK (BIT(13) | BIT(15))
> > +#define CXL_CTRL_P2P_REV_MASK BIT(12)
> > +#define CXL_STATUS_RESERVED_MASK (GENMASK(13, 0) | BIT(15))
> > +#define CXL_CTRL2_RESERVED_MASK GENMASK(15, 6)
> > +#define CXL_CTRL2_HW_BITS_MASK (BIT(0) | BIT(1) |
> BIT(3))
> > +#define CXL_CTRL2_VOLATILE_HDM_REV_MASK BIT(4)
> > +#define CXL_CTRL2_MODIFIED_COMP_REV_MASK BIT(5)
> > +#define CXL_LOCK_RESERVED_MASK GENMASK(15, 1)
> > +#define CXL_BASE_LO_RESERVED_MASK GENMASK(27, 0)
>
> Move the CXL reg offset and bit defs to a common header. Also, please
> group the relevant bits together per register.
Yes, updated as per this suggestion.
>
> DJ
>
> > +
> > int vfio_cxl_setup_virt_regs(struct vfio_pci_core_device *vdev);
> > void vfio_cxl_clean_virt_regs(struct vfio_pci_core_device *vdev);
> > void vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev);
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index d3138badeaa6..22cf9ea831f9 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -121,12 +121,26 @@ static int vfio_pci_open_device(struct vfio_device
> *core_vdev)
> > }
> >
> > if (vdev->cxl) {
> > + /*
> > + * pci_config_map and vconfig are valid now (allocated by
> > + * vfio_config_init() inside vfio_pci_core_enable()
> above).
> > + */
> > + vfio_cxl_setup_dvsec_perms(vdev);
> > +
> > ret = vfio_cxl_register_cxl_region(vdev);
> > if (ret) {
> > pci_warn(pdev, "Failed to setup CXL region\n");
> > vfio_pci_core_disable(vdev);
> > return ret;
> > }
> > +
> > + ret = vfio_cxl_register_comp_regs_region(vdev);
> > + if (ret) {
> > + pci_warn(pdev, "Failed to register COMP_REGS
> region\n");
> > + vfio_cxl_unregister_cxl_region(vdev);
> > + vfio_pci_core_disable(vdev);
> > + return ret;
> > + }
> > }
> >
> > vfio_pci_core_finish_enable(vdev);
> > diff --git a/drivers/vfio/pci/vfio_pci_config.c
> b/drivers/vfio/pci/vfio_pci_config.c
> > index 79aaf270adb2..90e2c25381d6 100644
> > --- a/drivers/vfio/pci/vfio_pci_config.c
> > +++ b/drivers/vfio/pci/vfio_pci_config.c
> > @@ -1085,6 +1085,49 @@ static int __init
> init_pci_ext_cap_pwr_perm(struct perm_bits *perm)
> > return 0;
> > }
> >
> > +/*
> > + * vfio_pci_dvsec_dispatch_read - per-device DVSEC read dispatcher.
> > + *
> > + * Installed as ecap_perms[PCI_EXT_CAP_ID_DVSEC].readfn at module init.
> > + * Calls vdev->dvsec_readfn when a shadow-read handler has been
> registered
> > + * (e.g. by vfio_cxl_setup_dvsec_perms() for CXL Type-2 devices),
> otherwise
> > + * falls through to vfio_raw_config_read for hardware pass-through.
> > + *
> > + * This indirection allows per-device DVSEC reads from vconfig shadow
> > + * without touching the global ecap_perms[] table.
> > + */
> > +static int vfio_pci_dvsec_dispatch_read(struct vfio_pci_core_device
> *vdev,
> > + int pos, int count,
> > + struct perm_bits *perm,
> > + int offset, __le32 *val)
> > +{
> > + if (vdev->dvsec_readfn)
> > + return vdev->dvsec_readfn(vdev, pos, count, perm, offset,
> val);
> > + return vfio_raw_config_read(vdev, pos, count, perm, offset, val);
> > +}
> > +
> > +/*
> > + * vfio_pci_dvsec_dispatch_write - per-device DVSEC write dispatcher.
> > + *
> > + * Installed as ecap_perms[PCI_EXT_CAP_ID_DVSEC].writefn at module
> init.
> > + * Calls vdev->dvsec_writefn when a handler has been registered for
> this
> > + * device (e.g. by vfio_cxl_setup_dvsec_perms() for CXL Type-2
> devices),
> > + * otherwise falls through to vfio_raw_config_write so that non-CXL
> > + * devices with a DVSEC capability continue to pass writes to hardware.
> > + *
> > + * This indirection allows per-device DVSEC handlers to be registered
> > + * without touching the global ecap_perms[] table.
> > + */
> > +static int vfio_pci_dvsec_dispatch_write(struct vfio_pci_core_device
> *vdev,
> > + int pos, int count,
> > + struct perm_bits *perm,
> > + int offset, __le32 val)
> > +{
> > + if (vdev->dvsec_writefn)
> > + return vdev->dvsec_writefn(vdev, pos, count, perm, offset,
> val);
> > + return vfio_raw_config_write(vdev, pos, count, perm, offset, val);
> > +}
> > +
> > /*
> > * Initialize the shared permission tables
> > */
> > @@ -1121,7 +1164,8 @@ int __init vfio_pci_init_perm_bits(void)
> > ret |= init_pci_ext_cap_err_perm(&ecap_perms[PCI_EXT_CAP_ID_ERR]);
> > ret |= init_pci_ext_cap_pwr_perm(&ecap_perms[PCI_EXT_CAP_ID_PWR]);
> > ecap_perms[PCI_EXT_CAP_ID_VNDR].writefn = vfio_raw_config_write;
> > - ecap_perms[PCI_EXT_CAP_ID_DVSEC].writefn = vfio_raw_config_write;
> > + ecap_perms[PCI_EXT_CAP_ID_DVSEC].readfn =
> vfio_pci_dvsec_dispatch_read;
> > + ecap_perms[PCI_EXT_CAP_ID_DVSEC].writefn =
> vfio_pci_dvsec_dispatch_write;
> >
> > if (ret)
> > vfio_pci_uninit_perm_bits();
> > diff --git a/drivers/vfio/pci/vfio_pci_priv.h
> b/drivers/vfio/pci/vfio_pci_priv.h
> > index f8db9a05c033..d778107fa908 100644
> > --- a/drivers/vfio/pci/vfio_pci_priv.h
> > +++ b/drivers/vfio/pci/vfio_pci_priv.h
> > @@ -154,6 +154,7 @@ void vfio_cxl_zap_region_locked(struct
> vfio_pci_core_device *vdev);
> > void vfio_cxl_reactivate_region(struct vfio_pci_core_device *vdev);
> > int vfio_cxl_register_comp_regs_region(struct vfio_pci_core_device
> *vdev);
> > void vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev);
> > +void vfio_cxl_setup_dvsec_perms(struct vfio_pci_core_device *vdev);
> >
> > #else
> >
> > @@ -180,6 +181,8 @@ vfio_cxl_register_comp_regs_region(struct
> vfio_pci_core_device *vdev)
> > { return 0; }
> > static inline void
> > vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev) { }
> > +static inline void
> > +vfio_cxl_setup_dvsec_perms(struct vfio_pci_core_device *vdev) { }
> >
> > #endif /* CONFIG_VFIO_CXL_CORE */
> >
> > diff --git a/include/linux/vfio_pci_core.h
> b/include/linux/vfio_pci_core.h
> > index cd8ed98a82a3..aa159d0c8da7 100644
> > --- a/include/linux/vfio_pci_core.h
> > +++ b/include/linux/vfio_pci_core.h
> > @@ -31,7 +31,7 @@ struct p2pdma_provider;
> > struct dma_buf_phys_vec;
> > struct dma_buf_attachment;
> > struct vfio_pci_cxl_state;
> > -
> > +struct perm_bits;
> >
> > struct vfio_pci_eventfd {
> > struct eventfd_ctx *ctx;
> > @@ -141,6 +141,12 @@ struct vfio_pci_core_device {
> > struct list_head ioeventfds_list;
> > struct vfio_pci_vf_token *vf_token;
> > struct vfio_pci_cxl_state *cxl;
> > + int (*dvsec_readfn)(struct vfio_pci_core_device *vdev, int pos,
> > + int count, struct perm_bits *perm,
> > + int offset, __le32 *val);
> > + int (*dvsec_writefn)(struct vfio_pci_core_device *vdev, int pos,
> > + int count, struct perm_bits *perm,
> > + int offset, __le32 val);
> > struct list_head sriov_pfs_item;
> > struct vfio_pci_core_device *sriov_pf_core_dev;
> > struct notifier_block nb;
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH 16/20] vfio/pci: Expose CXL device and region info via VFIO ioctl
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (14 preceding siblings ...)
2026-03-11 20:34 ` [PATCH 15/20] vfio/cxl: Introduce CXL DVSEC configuration space emulation mhonap
@ 2026-03-11 20:34 ` mhonap
2026-03-11 20:34 ` [PATCH 17/20] vfio/cxl: Provide opt-out for CXL feature mhonap
` (3 subsequent siblings)
19 siblings, 0 replies; 54+ messages in thread
From: mhonap @ 2026-03-11 20:34 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
Expose CXL device capability information through the VFIO device info
ioctl and hide the CXL component BAR from direct userspace access via
the standard region info path.
Add vfio_cxl_get_info() which fills a VFIO_DEVICE_INFO_CAP_CXL
capability structure with HDM register location, DPA size, commit
flags, and the region indices of the two CXL VFIO device regions (DPA
and COMP_REGS) so userspace does not need to scan all regions.
Add vfio_cxl_get_region_info() which intercepts BAR queries for the
component register BAR and returns size=0 to hide it, directing
userspace to use VFIO_REGION_SUBTYPE_CXL_COMP_REGS instead.
Hook both helpers into vfio_pci_ioctl_get_info() and
vfio_pci_ioctl_get_region_info() in vfio_pci_core.c.
The CXL component register BAR contains the HDM decoder MMIO registers.
Userspace must use the VFIO_REGION_SUBTYPE_CXL_COMP_REGS emulated region
instead of directly mapping or reading/writing this BAR, to ensure that
all accesses go through the emulation layer for correct bit-field
enforcement.
Reject mmap(), barmap setup, and BAR r/w for the CXL component BAR
index in vfio_pci_core_mmap(), vfio_pci_core_setup_barmap(), and
vfio_pci_bar_rw() respectively.
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/cxl/vfio_cxl_core.c | 84 ++++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_core.c | 16 ++++++
drivers/vfio/pci/vfio_pci_priv.h | 19 +++++++
drivers/vfio/pci/vfio_pci_rdwr.c | 8 +++
4 files changed, 127 insertions(+)
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index e18e992800f6..bda11f99746f 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -18,6 +18,90 @@
MODULE_IMPORT_NS("CXL");
+u8 vfio_cxl_get_component_reg_bar(struct vfio_pci_core_device *vdev)
+{
+ return vdev->cxl->comp_reg_bar;
+}
+
+int vfio_cxl_get_region_info(struct vfio_pci_core_device *vdev,
+ struct vfio_region_info *info,
+ struct vfio_info_cap *caps)
+{
+ unsigned long minsz = offsetofend(struct vfio_region_info, offset);
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+ if (!cxl)
+ return -ENOTTY;
+
+ if (!info)
+ return -ENOTTY;
+
+ if (info->argsz < minsz)
+ return -EINVAL;
+
+ if (info->index != cxl->comp_reg_bar)
+ return -ENOTTY;
+
+ /*
+ * Hide the component BAR for CXL. Report size 0 so userspace
+ * uses only the VFIO_REGION_SUBTYPE_CXL_COMP_REGS device region
+ * for BAR MMIO (HDM) emulation.
+ */
+ info->argsz = sizeof(*info);
+ info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
+ info->size = 0;
+ info->flags = 0;
+ info->cap_offset = 0;
+
+ return 0;
+}
+
+int vfio_cxl_get_info(struct vfio_pci_core_device *vdev,
+ struct vfio_info_cap *caps)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ struct vfio_device_info_cap_cxl cxl_cap = {0};
+
+ if (!cxl)
+ return 0;
+
+ /*
+ * Region indices are set at open time after
+ * vfio_pci_core_register_dev_region() succeeds. If either is still
+ * -1, the device is not yet fully initialised; return EAGAIN so
+ * userspace knows to retry rather than receiving 0xFFFFFFFF.
+ */
+ if (cxl->dpa_region_idx < 0 || cxl->comp_reg_region_idx < 0)
+ return -EAGAIN;
+
+ /* Fill in from CXL device structure */
+ cxl_cap.header.id = VFIO_DEVICE_INFO_CAP_CXL;
+ cxl_cap.header.version = 1;
+ cxl_cap.hdm_count = cxl->hdm_count;
+ cxl_cap.hdm_regs_offset = cxl->comp_reg_offset + cxl->hdm_reg_offset;
+ cxl_cap.hdm_regs_size = cxl->hdm_reg_size;
+ cxl_cap.hdm_regs_bar_index = cxl->comp_reg_bar;
+ cxl_cap.dpa_size = cxl->dpa_size;
+
+ if (cxl->precommitted) {
+ cxl_cap.flags |= VFIO_CXL_CAP_COMMITTED |
+ VFIO_CXL_CAP_PRECOMMITTED;
+ }
+
+ /*
+ * Populate absolute VFIO region indices so userspace can query them
+ * directly with VFIO_DEVICE_GET_REGION_INFO. Custom device regions
+ * live at VFIO_PCI_NUM_REGIONS + local_idx (see vfio_pci_core.c:999).
+ * dpa_region_idx / comp_reg_region_idx are 0-based local indices, so
+ * add VFIO_PCI_NUM_REGIONS to get the index VFIO_DEVICE_GET_REGION_INFO
+ * expects.
+ */
+ cxl_cap.dpa_region_index = VFIO_PCI_NUM_REGIONS + cxl->dpa_region_idx;
+ cxl_cap.comp_regs_region_index = VFIO_PCI_NUM_REGIONS + cxl->comp_reg_region_idx;
+
+ return vfio_info_add_capability(caps, &cxl_cap.header, sizeof(cxl_cap));
+}
+
static int vfio_cxl_create_device_state(struct vfio_pci_core_device *vdev,
u16 dvsec)
{
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 48e0274c19aa..5352e7810fed 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -989,6 +989,13 @@ static int vfio_pci_ioctl_get_info(struct vfio_pci_core_device *vdev,
if (vdev->reset_works)
info.flags |= VFIO_DEVICE_FLAGS_RESET;
+ if (vdev->cxl) {
+ ret = vfio_cxl_get_info(vdev, &caps);
+ if (ret)
+ return ret;
+ info.flags |= VFIO_DEVICE_FLAGS_CXL;
+ }
+
info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions;
info.num_irqs = VFIO_PCI_NUM_IRQS;
@@ -1034,6 +1041,12 @@ int vfio_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
struct pci_dev *pdev = vdev->pdev;
int i, ret;
+ if (vdev->cxl) {
+ ret = vfio_cxl_get_region_info(vdev, info, caps);
+ if (ret != -ENOTTY)
+ return ret;
+ }
+
switch (info->index) {
case VFIO_PCI_CONFIG_REGION_INDEX:
info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
@@ -1756,6 +1769,9 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
}
if (index >= VFIO_PCI_ROM_REGION_INDEX)
return -EINVAL;
+ /* Reject mmap of CXL component BAR; use COMP_REGS region only. */
+ if (vdev->cxl && index == vfio_cxl_get_component_reg_bar(vdev))
+ return -EINVAL;
if (!vdev->bar_mmap_supported[index])
return -EINVAL;
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index d778107fa908..c1befe7d028d 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -156,6 +156,13 @@ int vfio_cxl_register_comp_regs_region(struct vfio_pci_core_device *vdev);
void vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev);
void vfio_cxl_setup_dvsec_perms(struct vfio_pci_core_device *vdev);
+int vfio_cxl_get_info(struct vfio_pci_core_device *vdev,
+ struct vfio_info_cap *caps);
+int vfio_cxl_get_region_info(struct vfio_pci_core_device *vdev,
+ struct vfio_region_info *info,
+ struct vfio_info_cap *caps);
+u8 vfio_cxl_get_component_reg_bar(struct vfio_pci_core_device *vdev);
+
#else
static inline void
@@ -183,6 +190,18 @@ static inline void
vfio_cxl_reinit_comp_regs(struct vfio_pci_core_device *vdev) { }
static inline void
vfio_cxl_setup_dvsec_perms(struct vfio_pci_core_device *vdev) { }
+static inline int
+vfio_cxl_get_info(struct vfio_pci_core_device *vdev,
+ struct vfio_info_cap *caps)
+{ return -ENOTTY; }
+static inline int
+vfio_cxl_get_region_info(struct vfio_pci_core_device *vdev,
+ struct vfio_region_info *info,
+ struct vfio_info_cap *caps)
+{ return -ENOTTY; }
+static inline u8
+vfio_cxl_get_component_reg_bar(struct vfio_pci_core_device *vdev)
+{ return U8_MAX; }
#endif /* CONFIG_VFIO_CXL_CORE */
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index b38627b35c35..4f1f4882265a 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -207,6 +207,10 @@ int vfio_pci_core_setup_barmap(struct vfio_pci_core_device *vdev, int bar)
if (vdev->barmap[bar])
return 0;
+ /* Do not map the CXL component BAR; use COMP_REGS region only. */
+ if (vdev->cxl && bar == vfio_cxl_get_component_reg_bar(vdev))
+ return -EINVAL;
+
ret = pci_request_selected_regions(pdev, 1 << bar, "vfio");
if (ret)
return ret;
@@ -236,6 +240,10 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_core_device *vdev, char __user *buf,
ssize_t done;
enum vfio_pci_io_width max_width = VFIO_PCI_IO_WIDTH_8;
+ /* Reject BAR r/w for CXL component BAR; use COMP_REGS region only. */
+ if (vdev->cxl && bar == vfio_cxl_get_component_reg_bar(vdev))
+ return -EINVAL;
+
if (pci_resource_start(pdev, bar))
end = pci_resource_len(pdev, bar);
else if (bar == PCI_ROM_RESOURCE && pdev->rom && pdev->romlen)
--
2.25.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* [PATCH 17/20] vfio/cxl: Provide opt-out for CXL feature
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (15 preceding siblings ...)
2026-03-11 20:34 ` [PATCH 16/20] vfio/pci: Expose CXL device and region info via VFIO ioctl mhonap
@ 2026-03-11 20:34 ` mhonap
2026-03-11 20:34 ` [PATCH 18/20] docs: vfio-pci: Document CXL Type-2 device passthrough mhonap
` (2 subsequent siblings)
19 siblings, 0 replies; 54+ messages in thread
From: mhonap @ 2026-03-11 20:34 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
This commit provides an opt-out mechanism to disable the CXL
support from vfio module. The opt-out is provided both
build time and module load time.
Build time option CONFIG_VFIO_CXL_CORE is used to enable/disable
CXL support in vfio-pci module.
For runtime disabling the CXL support, use the module parameter
disable_cxl. This is a per-device opt-out on the core device
set by the driver before registration.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/cxl/vfio_cxl_core.c | 4 ++++
drivers/vfio/pci/vfio_pci.c | 9 +++++++++
include/linux/vfio_pci_core.h | 1 +
3 files changed, 14 insertions(+)
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index bda11f99746f..8b42ac05a110 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -373,6 +373,10 @@ void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
u16 dvsec;
int ret;
+ /* Honor the user opt-out decision */
+ if (vdev->disable_cxl)
+ return;
+
if (!pcie_is_cxl(pdev))
return;
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 22cf9ea831f9..a6b0fb882b9f 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -60,6 +60,12 @@ static bool disable_denylist;
module_param(disable_denylist, bool, 0444);
MODULE_PARM_DESC(disable_denylist, "Disable use of device denylist. Disabling the denylist allows binding to devices with known errata that may lead to exploitable stability or security issues when accessed by untrusted users.");
+#if IS_ENABLED(CONFIG_VFIO_CXL_CORE)
+static bool disable_cxl;
+module_param(disable_cxl, bool, 0444);
+MODULE_PARM_DESC(disable_cxl, "Disable CXL Type-2 extensions for all devices bound to vfio-pci. Variant drivers may instead set vdev->disable_cxl in their probe for per-device control without needing this parameter.");
+#endif
+
static bool vfio_pci_dev_in_denylist(struct pci_dev *pdev)
{
switch (pdev->vendor) {
@@ -189,6 +195,9 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
return PTR_ERR(vdev);
dev_set_drvdata(&pdev->dev, vdev);
+#if IS_ENABLED(CONFIG_VFIO_CXL_CORE)
+ vdev->disable_cxl = disable_cxl;
+#endif
vdev->pci_ops = &vfio_pci_dev_ops;
ret = vfio_pci_core_register_device(vdev);
if (ret)
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index aa159d0c8da7..48dc69df52fa 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -130,6 +130,7 @@ struct vfio_pci_core_device {
bool needs_pm_restore:1;
bool pm_intx_masked:1;
bool pm_runtime_engaged:1;
+ bool disable_cxl:1;
struct pci_saved_state *pci_saved_state;
struct pci_saved_state *pm_save;
int ioeventfds_nr;
--
2.25.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* [PATCH 18/20] docs: vfio-pci: Document CXL Type-2 device passthrough
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (16 preceding siblings ...)
2026-03-11 20:34 ` [PATCH 17/20] vfio/cxl: Provide opt-out for CXL feature mhonap
@ 2026-03-11 20:34 ` mhonap
2026-03-13 12:13 ` Jonathan Cameron
2026-03-11 20:34 ` [PATCH 19/20] selftests/vfio: Add CXL Type-2 passthrough tests mhonap
2026-03-11 20:34 ` [PATCH 20/20] selftests/vfio: Fix VLA initialisation in vfio_pci_irq_set() mhonap
19 siblings, 1 reply; 54+ messages in thread
From: mhonap @ 2026-03-11 20:34 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
Add a driver-api document describing the architecture, interfaces, and
operational constraints of CXL Type-2 device passthrough via vfio-pci-core.
CXL Type-2 devices (cache-coherent accelerators such as GPUs with attached
device memory) present unique passthrough requirements not covered by the
existing vfio-pci documentation:
- The host kernel retains ownership of the HDM decoder hardware through
the CXL subsystem, so the guest cannot program decoders directly.
- Two additional VFIO device regions expose the emulated HDM register
state (COMP_REGS) and the DPA memory window (DPA region) to userspace.
- DVSEC configuration space writes are intercepted and virtualized so
that the guest cannot alter host-owned CXL.io / CXL.mem enable bits.
- Device reset (FLR) is coordinated through vfio_pci_ioctl_reset(): all
DPA PTEs are zapped before the reset and restored afterward.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
Documentation/driver-api/index.rst | 1 +
Documentation/driver-api/vfio-pci-cxl.rst | 216 ++++++++++++++++++++++
2 files changed, 217 insertions(+)
create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index 1833e6a0687e..7ec661846f6b 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -47,6 +47,7 @@ of interest to most developers working on device drivers.
vfio-mediated-device
vfio
vfio-pci-device-specific-driver-acceptance
+ vfio-pci-cxl
Bus-level documentation
=======================
diff --git a/Documentation/driver-api/vfio-pci-cxl.rst b/Documentation/driver-api/vfio-pci-cxl.rst
new file mode 100644
index 000000000000..f2cbe2fdb036
--- /dev/null
+++ b/Documentation/driver-api/vfio-pci-cxl.rst
@@ -0,0 +1,216 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================================
+VFIO PCI CXL Type-2 Device Passthrough
+====================================================
+
+Overview
+--------
+
+CXL (Compute Express Link) Type-2 devices are cache-coherent PCIe accelerators
+and GPUs that attach their own volatile memory (Device Physical Address space,
+or DPA) to the host memory fabric via the CXL protocol. Examples include
+GPU/accelerator cards that expose coherent device memory to the host.
+
+When such a device is passthroughed to a virtual machine using ``vfio-pci``,
+the kernel CXL subsystem must remain in control of the Host-managed Device
+Memory (HDM) decoders that map the device's DPA into the host physical address
+(HPA) space. A VMM such as QEMU cannot program HDM decoders directly; instead
+it uses a set of VFIO-specific regions and UAPI extensions described here.
+
+This support is compiled in when ``CONFIG_VFIO_CXL_CORE=y``. It can be
+disabled at module load time for all devices bound to ``vfio-pci`` with::
+
+ modprobe vfio-pci disable_cxl=1
+
+Variant drivers can disable CXL extensions for individual devices by setting
+``vdev->disable_cxl = true`` in their probe function before registration.
+
+Device Detection
+----------------
+
+CXL Type-2 detection happens automatically when ``vfio-pci`` registers a
+device that has:
+
+1. A CXL Device DVSEC capability (PCIe DVSEC Vendor ID 0x1E98, ID 0x0000).
+2. Bit 2 (Mem_Capable) set in the CXL Capability register within that DVSEC.
+3. A PCI class code that is **not** ``0x050210`` (CXL Type-3 memory device).
+4. An HDM Decoder block discoverable via the Register Locator DVSEC.
+5. A pre-committed HDM decoder (BIOS/firmware programmed) with non-zero size.
+
+On successful detection ``VFIO_DEVICE_FLAGS_CXL`` is set in
+``vfio_device_info.flags`` alongside ``VFIO_DEVICE_FLAGS_PCI``.
+
+UAPI Extensions
+---------------
+
+VFIO_DEVICE_GET_INFO Capability: VFIO_DEVICE_INFO_CAP_CXL
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When ``VFIO_DEVICE_FLAGS_CXL`` is set the device info capability chain
+contains a ``vfio_device_info_cap_cxl`` structure (cap ID 6)::
+
+ struct vfio_device_info_cap_cxl {
+ struct vfio_info_cap_header header; /* id=6, version=1 */
+ __u8 hdm_count; /* number of HDM decoders */
+ __u8 hdm_regs_bar_index; /* PCI BAR containing component registers */
+ __u16 pad;
+ __u32 flags; /* VFIO_CXL_CAP_* flags */
+ __u64 hdm_regs_size; /* size in bytes of the HDM decoder block */
+ __u64 hdm_regs_offset; /* byte offset within the BAR to HDM block */
+ __u64 dpa_size; /* total DPA size in bytes */
+ __u32 dpa_region_index; /* index of the DPA device region */
+ __u32 comp_regs_region_index; /* index of the COMP_REGS device region */
+ };
+
+Flags:
+
+``VFIO_CXL_CAP_COMMITTED`` (bit 0)
+ The HDM decoder was committed by the kernel CXL subsystem.
+
+``VFIO_CXL_CAP_PRECOMMITTED`` (bit 1)
+ The HDM decoder was pre-committed by host firmware/BIOS. The VMM does
+ not need to allocate CXL HPA space; the mapping is already live.
+
+VFIO Regions
+~~~~~~~~~~~~~
+
+A CXL Type-2 device exposes two additional device regions beyond the standard
+PCI BAR regions. Their indices are reported in ``dpa_region_index`` and
+``comp_regs_region_index`` in the capability structure.
+
+**DPA Region** (subtype ``VFIO_REGION_SUBTYPE_CXL``)
+ Flags: ``VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE |
+ VFIO_REGION_INFO_FLAG_MMAP``
+
+ Represents the device's DPA memory mapped at the kernel-assigned HPA.
+ The VMM should map this region with mmap() to expose device memory to the
+ guest. Page faults are handled lazily; the kernel inserts PFNs on first
+ access rather than at mmap() time. During FLR/reset all PTEs are
+ invalidated and the region becomes inaccessible until the reset completes.
+
+ Read and write access via the region file descriptor is also supported and
+ routes through a kernel-managed virtual address established with
+ ``ioremap_cache()``.
+
+**COMP_REGS Region** (subtype ``VFIO_REGION_SUBTYPE_CXL_COMP_REGS``)
+ Flags: ``VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE``
+ (no mmap).
+
+ An emulated, read/write-only region exposing the HDM decoder registers.
+ The kernel shadows the hardware HDM register state and enforces all
+ bit-field rules (reserved bits, read-only bits, commit semantics) on every
+ write. Only 32-bit aligned, 32-bit wide accesses are permitted, matching
+ the hardware requirement.
+
+ The VMM uses this region to read and write HDM decoder BASE, SIZE, and
+ CTRL registers. Setting the COMMIT bit (bit 9) in a CTRL register causes
+ the kernel to immediately set the COMMITTED bit (bit 10) in the emulated
+ shadow state, allowing the VMM to detect the transition via a
+ ``notify_change`` callback.
+
+ The component register BAR itself (``hdm_regs_bar_index``) is hidden:
+ ``VFIO_DEVICE_GET_REGION_INFO`` for that BAR index returns ``size = 0``.
+ All HDM access must go through the COMP_REGS region.
+
+Region Type Identifiers::
+
+ /* type = PCI_VENDOR_ID_CXL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE (0x80001e98) */
+ #define VFIO_REGION_SUBTYPE_CXL 1 /* DPA memory region */
+ #define VFIO_REGION_SUBTYPE_CXL_COMP_REGS 2 /* HDM register region */
+
+DVSEC Configuration Space Emulation
+-------------------------------------
+
+When ``CONFIG_VFIO_CXL_CORE=y`` the kernel installs a CXL-aware write handler
+for the ``PCI_EXT_CAP_ID_DVSEC`` (0x23) extended capability entry in the vfio-pci
+configuration space permission table. This handler runs for every device
+opened under ``vfio-pci``; for non-CXL devices it falls through to the
+hardware write path unchanged.
+
+For CXL devices, writes to the following DVSEC registers are intercepted and
+emulated in ``vdev->vconfig`` (the per-device shadow configuration space):
+
++--------------------+--------+-------------------------------------------+
+| Register | Offset | Emulation |
++====================+========+===========================================+
+| CXL Control | 0x0c | RWL semantics; IO_Enable forced to 1; |
+| | | locked after Lock register bit 0 is set. |
++--------------------+--------+-------------------------------------------+
+| CXL Status | 0x0e | Bit 14 (Viral_Status) is RW1CS. |
++--------------------+--------+-------------------------------------------+
+| CXL Control2 | 0x10 | Bits 0, 3 forwarded to hardware; bits |
+| | | 1 and 2 trigger subsystem actions. |
++--------------------+--------+-------------------------------------------+
+| CXL Status2 | 0x12 | Bit 3 (RW1CS) forwarded to hardware when |
+| | | Capability3 bit 3 is set. |
++--------------------+--------+-------------------------------------------+
+| CXL Lock | 0x14 | RWO; once set, Control becomes read-only |
+| | | until conventional reset. |
++--------------------+--------+-------------------------------------------+
+| Range Base High/Lo | varies | Stored in vconfig; Base Low [27:0] |
+| | | reserved bits cleared. |
++--------------------+--------+-------------------------------------------+
+
+Reads of these registers return the emulated vconfig values. Read-only
+registers (Capability, Size registers, range Size High/Low) are also served
+from vconfig, which was seeded from hardware at device open time.
+
+FLR and Reset Behaviour
+-----------------------
+
+During Function Level Reset (FLR):
+
+1. ``vfio_cxl_zap_region_locked()`` is called under the write side of
+ ``memory_lock``. It sets ``region_active = false`` and calls
+ ``unmap_mapping_range()`` to invalidate all DPA region PTEs.
+
+2. Any concurrent page fault or ``read()``/``write()`` on the DPA region
+ sees ``region_active = false`` and returns ``VM_FAULT_SIGBUS`` or ``-EIO``
+ respectively.
+
+3. After reset completes, ``vfio_cxl_reactivate_region()`` re-reads the HDM
+ decoder state from hardware into ``comp_reg_virt[]`` (it will typically
+ be all-zeros after FLR) and sets ``region_active = true`` only if the
+ COMMITTED bit is set in the freshly re-snapshotted hardware state for
+ pre-committed decoders. The VMM may re-fault into the DPA region without
+ issuing a new ``mmap()`` call. Each newly faulted page is scrubbed via
+ ``memset_io()`` before the PFN is inserted.
+
+VMM Integration Notes
+---------------------
+
+A VMM integrating CXL Type-2 passthrough should:
+
+1. Issue ``VFIO_DEVICE_GET_INFO`` and check ``VFIO_DEVICE_FLAGS_CXL``.
+2. Walk the capability chain to find ``VFIO_DEVICE_INFO_CAP_CXL`` (id = 6).
+3. Record ``dpa_region_index``, ``comp_regs_region_index``, ``dpa_size``,
+ ``hdm_count``, ``hdm_regs_offset``, and ``hdm_regs_size``.
+4. Map the DPA region (``dpa_region_index``) with mmap() to a guest physical
+ address. The region supports ``PROT_READ | PROT_WRITE``.
+5. Open the COMP_REGS region (``comp_regs_region_index``) and attach a
+ ``notify_change`` callback to detect COMMIT transitions. When bit 10
+ (COMMITTED) transitions from 0 to 1 in a CTRL register read, the VMM
+ should expose the corresponding DPA range to the guest and map the
+ relevant slice of the DPA mmap.
+6. For pre-committed devices (``VFIO_CXL_CAP_PRECOMMITTED`` set) the entire
+ DPA is already mapped and the VMM need not wait for a guest COMMIT.
+7. Program the guest CXL DVSEC registers (via VFIO config space write) to
+ reflect the guest's view. The kernel emulates all register semantics
+ including the CONFIG_LOCK one-shot latch.
+
+Kernel Configuration
+--------------------
+
+``CONFIG_VFIO_CXL_CORE`` (bool)
+ Enable CXL Type-2 passthrough support in ``vfio-pci-core``.
+ Depends on ``CONFIG_VFIO_PCI_CORE``, ``CONFIG_CXL_BUS``, and
+ ``CONFIG_CXL_MEM``.
+
+References
+----------
+
+* CXL Specification 3.1, §8.1.3 — DVSEC for CXL Devices
+* CXL Specification 3.1, §8.2.4.20 — CXL HDM Decoder Capability Structure
+* ``include/uapi/linux/vfio.h`` — ``VFIO_DEVICE_INFO_CAP_CXL``,
+ ``VFIO_REGION_SUBTYPE_CXL``, ``VFIO_REGION_SUBTYPE_CXL_COMP_REGS``
--
2.25.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [PATCH 18/20] docs: vfio-pci: Document CXL Type-2 device passthrough
2026-03-11 20:34 ` [PATCH 18/20] docs: vfio-pci: Document CXL Type-2 device passthrough mhonap
@ 2026-03-13 12:13 ` Jonathan Cameron
2026-03-17 21:24 ` Alex Williamson
0 siblings, 1 reply; 54+ messages in thread
From: Jonathan Cameron @ 2026-03-13 12:13 UTC (permalink / raw)
To: mhonap
Cc: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, jgg, yishaih,
kevin.tian, cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm
On Thu, 12 Mar 2026 02:04:38 +0530
mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
>
> Add a driver-api document describing the architecture, interfaces, and
> operational constraints of CXL Type-2 device passthrough via vfio-pci-core.
>
> CXL Type-2 devices (cache-coherent accelerators such as GPUs with attached
> device memory) present unique passthrough requirements not covered by the
> existing vfio-pci documentation:
>
> - The host kernel retains ownership of the HDM decoder hardware through
> the CXL subsystem, so the guest cannot program decoders directly.
> - Two additional VFIO device regions expose the emulated HDM register
> state (COMP_REGS) and the DPA memory window (DPA region) to userspace.
> - DVSEC configuration space writes are intercepted and virtualized so
> that the guest cannot alter host-owned CXL.io / CXL.mem enable bits.
> - Device reset (FLR) is coordinated through vfio_pci_ioctl_reset(): all
> DPA PTEs are zapped before the reset and restored afterward.
>
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
Hi Manish.
Great to see this doc.
Provides a convenient place to talk about the restrictions on this
current patch set and how we resolve them.
My particular interest is in the region sizing as I don't see using
a locked own bios setup range as a comprehensive solution.
Shall we say, there is some awareness that the CXL spec doesn't require
enough information from type 2 devices and it wasn't necessarily
understood that VFIO type solutions can't rely on the
"It's an accelerator so it has a custom driver, no need for standards"
It is a gap I'd like to close. Given it's being discussed in public
we can prepare a Code First proposal to either add stuff to the spec
or develop some external guidance on what a device needs to do if we
aren't going to need either a variant driver, or device specific handling
in user space.
> ---
> Documentation/driver-api/index.rst | 1 +
> Documentation/driver-api/vfio-pci-cxl.rst | 216 ++++++++++++++++++++++
> 2 files changed, 217 insertions(+)
> create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
>
> diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
> index 1833e6a0687e..7ec661846f6b 100644
> --- a/Documentation/driver-api/index.rst
> +++ b/Documentation/driver-api/index.rst
>
> Bus-level documentation
> =======================
> diff --git a/Documentation/driver-api/vfio-pci-cxl.rst b/Documentation/driver-api/vfio-pci-cxl.rst
> new file mode 100644
> index 000000000000..f2cbe2fdb036
> --- /dev/null
> +++ b/Documentation/driver-api/vfio-pci-cxl.rst
> +Device Detection
> +----------------
> +
> +CXL Type-2 detection happens automatically when ``vfio-pci`` registers a
> +device that has:
> +
> +1. A CXL Device DVSEC capability (PCIe DVSEC Vendor ID 0x1E98, ID 0x0000).
> +2. Bit 2 (Mem_Capable) set in the CXL Capability register within that DVSEC.
FWIW to be type 2 as opposed to a type 3 non class code device (e.g. the
compressed memory devices Gregory Price and others are using) you need
Cache_capable as well. Might be worth making this all about
CXL Type-2 and non class code Type-3.
> +3. A PCI class code that is **not** ``0x050210`` (CXL Type-3 memory device).
> +4. An HDM Decoder block discoverable via the Register Locator DVSEC.
> +5. A pre-committed HDM decoder (BIOS/firmware programmed) with non-zero size.
This is the bit that we need to make more general. Otherwise you'll have
to have a bios upgrade for every type 2 device (and no native hotplug).
Note native hotplug is quite likely if anyone is switch based device
pooling.
I assume that you are doing this today to get something upstream
and presume it works for the type 2 device you have on the host you
care about. I'm not sure there are 'general' solutions but maybe
there are some heuristics or sufficient conditions for establishing the
size.
Type 2 might have any of:
- Conveniently preprogrammed HDM decoders (the case you use)
- Maximum of 2 HDM decoders + the same number of Range registers.
In general the problem with range registers is they are a legacy feature
and there are only 2 of them whereas a real device may have many more
DPA ranges. In this corner case though, is it enough to give us the
necessary sizes? I think it might be but would like others familiar
with the spec to confirm. (If needed I'll take this to the consortium
for an 'official' view).
- A DOE and table access protocol. CDAT should give us enough info to
be fairly sure what is needed.
- A CXL mailbox (maybe the version in the PCI spec now) and the spec defined
commands to query what is there. Reading the intro to 8.2.10.9 Memory
Device Command Sets, it's a little unclear on whether these are valid on
non class code devices but I believe having the appropriate Mailbox
type identifier is enough to say we expect to get them.
None of this is required though and the mailboxes are non trivial.
So personally I think we should propose a new DVSEC that provides any
info we need for generic passthrough. Starting with what we need
to get the regions right. Until something like that is in place we
will have to store this info somewhere.
There is (maybe) an alternative of doing the region allocation on demand.
That is emulate the HDM decoders in QEMU (on top of the emulation
here) and when settings corresponding to a region setup occur,
go request one from the CXL core. The problem is we can't guarantee
it will be available at that time. So we can 'guess' what to provide
to the VM in terms of CXL fixed memory windows, but short of heuristics
(either whole of the host offer, or divide it up based on devices present
vs what is in the VM) that is going to be prone to it not being available
later.
Where do people think this should be? We are going to end up with
a device list somewhere. Could be in kernel, or in QEMU or make it an
orchestrator problem (applying the 'someone else's problem' solution).
| locked after Lock register bit 0 is set. |
> +
> +VMM Integration Notes
> +---------------------
> +
> +A VMM integrating CXL Type-2 passthrough should:
> +
> +1. Issue ``VFIO_DEVICE_GET_INFO`` and check ``VFIO_DEVICE_FLAGS_CXL``.
> +2. Walk the capability chain to find ``VFIO_DEVICE_INFO_CAP_CXL`` (id = 6).
> +3. Record ``dpa_region_index``, ``comp_regs_region_index``, ``dpa_size``,
> + ``hdm_count``, ``hdm_regs_offset``, and ``hdm_regs_size``.
> +4. Map the DPA region (``dpa_region_index``) with mmap() to a guest physical
> + address. The region supports ``PROT_READ | PROT_WRITE``.
> +5. Open the COMP_REGS region (``comp_regs_region_index``) and attach a
> + ``notify_change`` callback to detect COMMIT transitions. When bit 10
> + (COMMITTED) transitions from 0 to 1 in a CTRL register read, the VMM
> + should expose the corresponding DPA range to the guest and map the
> + relevant slice of the DPA mmap.
> +6. For pre-committed devices (``VFIO_CXL_CAP_PRECOMMITTED`` set) the entire
> + DPA is already mapped and the VMM need not wait for a guest COMMIT.
> +7. Program the guest CXL DVSEC registers (via VFIO config space write) to
> + reflect the guest's view. The kernel emulates all register semantics
> + including the CONFIG_LOCK one-shot latch.
> +
Can you share an RFC for this flow in QEMU? Ideally also a type 2 model
(there have been a few posted in the past) that would allow testing this with
emulated qemu as the host, then KVM / VFIO on top of that?
If not I can probably find some time to hack something together.
Thanks,
Jonathan
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [PATCH 18/20] docs: vfio-pci: Document CXL Type-2 device passthrough
2026-03-13 12:13 ` Jonathan Cameron
@ 2026-03-17 21:24 ` Alex Williamson
2026-03-19 16:06 ` Jonathan Cameron
0 siblings, 1 reply; 54+ messages in thread
From: Alex Williamson @ 2026-03-17 21:24 UTC (permalink / raw)
To: Jonathan Cameron
Cc: alex, mhonap, aniketa, ankita, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, jgg, yishaih,
kevin.tian, cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm
On Fri, 13 Mar 2026 12:13:41 +0000
Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> On Thu, 12 Mar 2026 02:04:38 +0530
> mhonap@nvidia.com wrote:
>
> > From: Manish Honap <mhonap@nvidia.com>
> > ---
> > Documentation/driver-api/index.rst | 1 +
> > Documentation/driver-api/vfio-pci-cxl.rst | 216 ++++++++++++++++++++++
> > 2 files changed, 217 insertions(+)
> > create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
> >
> > diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
> > index 1833e6a0687e..7ec661846f6b 100644
> > --- a/Documentation/driver-api/index.rst
> > +++ b/Documentation/driver-api/index.rst
>
> >
> > Bus-level documentation
> > =======================
> > diff --git a/Documentation/driver-api/vfio-pci-cxl.rst b/Documentation/driver-api/vfio-pci-cxl.rst
> > new file mode 100644
> > index 000000000000..f2cbe2fdb036
> > --- /dev/null
> > +++ b/Documentation/driver-api/vfio-pci-cxl.rst
>
> > +Device Detection
> > +----------------
> > +
> > +CXL Type-2 detection happens automatically when ``vfio-pci`` registers a
> > +device that has:
> > +
> > +1. A CXL Device DVSEC capability (PCIe DVSEC Vendor ID 0x1E98, ID 0x0000).
> > +2. Bit 2 (Mem_Capable) set in the CXL Capability register within that DVSEC.
>
> FWIW to be type 2 as opposed to a type 3 non class code device (e.g. the
> compressed memory devices Gregory Price and others are using) you need
> Cache_capable as well. Might be worth making this all about
> CXL Type-2 and non class code Type-3.
>
> > +3. A PCI class code that is **not** ``0x050210`` (CXL Type-3 memory device).
> > +4. An HDM Decoder block discoverable via the Register Locator DVSEC.
> > +5. A pre-committed HDM decoder (BIOS/firmware programmed) with non-zero size.
>
> This is the bit that we need to make more general. Otherwise you'll have
> to have a bios upgrade for every type 2 device (and no native hotplug).
> Note native hotplug is quite likely if anyone is switch based device
> pooling.
>
> I assume that you are doing this today to get something upstream
> and presume it works for the type 2 device you have on the host you
> care about. I'm not sure there are 'general' solutions but maybe
> there are some heuristics or sufficient conditions for establishing the
> size.
>
> Type 2 might have any of:
> - Conveniently preprogrammed HDM decoders (the case you use)
> - Maximum of 2 HDM decoders + the same number of Range registers.
> In general the problem with range registers is they are a legacy feature
> and there are only 2 of them whereas a real device may have many more
> DPA ranges. In this corner case though, is it enough to give us the
> necessary sizes? I think it might be but would like others familiar
> with the spec to confirm. (If needed I'll take this to the consortium
> for an 'official' view).
> - A DOE and table access protocol. CDAT should give us enough info to
> be fairly sure what is needed.
> - A CXL mailbox (maybe the version in the PCI spec now) and the spec defined
> commands to query what is there. Reading the intro to 8.2.10.9 Memory
> Device Command Sets, it's a little unclear on whether these are valid on
> non class code devices but I believe having the appropriate Mailbox
> type identifier is enough to say we expect to get them.
>
> None of this is required though and the mailboxes are non trivial.
> So personally I think we should propose a new DVSEC that provides any
> info we need for generic passthrough. Starting with what we need
> to get the regions right. Until something like that is in place we
> will have to store this info somewhere.
>
> There is (maybe) an alternative of doing the region allocation on demand.
> That is emulate the HDM decoders in QEMU (on top of the emulation
> here) and when settings corresponding to a region setup occur,
> go request one from the CXL core. The problem is we can't guarantee
> it will be available at that time. So we can 'guess' what to provide
> to the VM in terms of CXL fixed memory windows, but short of heuristics
> (either whole of the host offer, or divide it up based on devices present
> vs what is in the VM) that is going to be prone to it not being available
> later.
>
> Where do people think this should be? We are going to end up with
> a device list somewhere. Could be in kernel, or in QEMU or make it an
> orchestrator problem (applying the 'someone else's problem' solution).
That's the typical approach. That's what we did with resizable BARs.
If we cannot guarantee allocation on demand, we need to push the policy
to the device, via something that indicates the size to use, or to the
orchestration, via something that allows the size to be committed
out-of-band. As with REBAR, we then need to be able to restrict the
guest behavior to select only the configured option.
I imagine this means for the non-pre-allocated case, we need to develop
some sysfs attributes that allows that out-of-band sizing, which would
then appear as a fixed, pre-allocated configuration to the guest.
Thanks,
Alex
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 18/20] docs: vfio-pci: Document CXL Type-2 device passthrough
2026-03-17 21:24 ` Alex Williamson
@ 2026-03-19 16:06 ` Jonathan Cameron
2026-03-23 14:36 ` Manish Honap
0 siblings, 1 reply; 54+ messages in thread
From: Jonathan Cameron @ 2026-03-19 16:06 UTC (permalink / raw)
To: Alex Williamson
Cc: mhonap, aniketa, ankita, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, jgg, yishaih,
kevin.tian, cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm
On Tue, 17 Mar 2026 15:24:45 -0600
Alex Williamson <alex@shazbot.org> wrote:
> On Fri, 13 Mar 2026 12:13:41 +0000
> Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
>
> > On Thu, 12 Mar 2026 02:04:38 +0530
> > mhonap@nvidia.com wrote:
> >
> > > From: Manish Honap <mhonap@nvidia.com>
> > > ---
> > > Documentation/driver-api/index.rst | 1 +
> > > Documentation/driver-api/vfio-pci-cxl.rst | 216 ++++++++++++++++++++++
> > > 2 files changed, 217 insertions(+)
> > > create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
> > >
> > > diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
> > > index 1833e6a0687e..7ec661846f6b 100644
> > > --- a/Documentation/driver-api/index.rst
> > > +++ b/Documentation/driver-api/index.rst
> >
> > >
> > > Bus-level documentation
> > > =======================
> > > diff --git a/Documentation/driver-api/vfio-pci-cxl.rst b/Documentation/driver-api/vfio-pci-cxl.rst
> > > new file mode 100644
> > > index 000000000000..f2cbe2fdb036
> > > --- /dev/null
> > > +++ b/Documentation/driver-api/vfio-pci-cxl.rst
> >
> > > +Device Detection
> > > +----------------
> > > +
> > > +CXL Type-2 detection happens automatically when ``vfio-pci`` registers a
> > > +device that has:
> > > +
> > > +1. A CXL Device DVSEC capability (PCIe DVSEC Vendor ID 0x1E98, ID 0x0000).
> > > +2. Bit 2 (Mem_Capable) set in the CXL Capability register within that DVSEC.
> >
> > FWIW to be type 2 as opposed to a type 3 non class code device (e.g. the
> > compressed memory devices Gregory Price and others are using) you need
> > Cache_capable as well. Might be worth making this all about
> > CXL Type-2 and non class code Type-3.
> >
> > > +3. A PCI class code that is **not** ``0x050210`` (CXL Type-3 memory device).
> > > +4. An HDM Decoder block discoverable via the Register Locator DVSEC.
> > > +5. A pre-committed HDM decoder (BIOS/firmware programmed) with non-zero size.
> >
> > This is the bit that we need to make more general. Otherwise you'll have
> > to have a bios upgrade for every type 2 device (and no native hotplug).
> > Note native hotplug is quite likely if anyone is switch based device
> > pooling.
> >
> > I assume that you are doing this today to get something upstream
> > and presume it works for the type 2 device you have on the host you
> > care about. I'm not sure there are 'general' solutions but maybe
> > there are some heuristics or sufficient conditions for establishing the
> > size.
> >
> > Type 2 might have any of:
> > - Conveniently preprogrammed HDM decoders (the case you use)
> > - Maximum of 2 HDM decoders + the same number of Range registers.
> > In general the problem with range registers is they are a legacy feature
> > and there are only 2 of them whereas a real device may have many more
> > DPA ranges. In this corner case though, is it enough to give us the
> > necessary sizes? I think it might be but would like others familiar
> > with the spec to confirm. (If needed I'll take this to the consortium
> > for an 'official' view).
> > - A DOE and table access protocol. CDAT should give us enough info to
> > be fairly sure what is needed.
> > - A CXL mailbox (maybe the version in the PCI spec now) and the spec defined
> > commands to query what is there. Reading the intro to 8.2.10.9 Memory
> > Device Command Sets, it's a little unclear on whether these are valid on
> > non class code devices but I believe having the appropriate Mailbox
> > type identifier is enough to say we expect to get them.
> >
> > None of this is required though and the mailboxes are non trivial.
> > So personally I think we should propose a new DVSEC that provides any
> > info we need for generic passthrough. Starting with what we need
> > to get the regions right. Until something like that is in place we
> > will have to store this info somewhere.
> >
> > There is (maybe) an alternative of doing the region allocation on demand.
> > That is emulate the HDM decoders in QEMU (on top of the emulation
> > here) and when settings corresponding to a region setup occur,
> > go request one from the CXL core. The problem is we can't guarantee
> > it will be available at that time. So we can 'guess' what to provide
> > to the VM in terms of CXL fixed memory windows, but short of heuristics
> > (either whole of the host offer, or divide it up based on devices present
> > vs what is in the VM) that is going to be prone to it not being available
> > later.
> >
> > Where do people think this should be? We are going to end up with
> > a device list somewhere. Could be in kernel, or in QEMU or make it an
> > orchestrator problem (applying the 'someone else's problem' solution).
>
> That's the typical approach. That's what we did with resizable BARs.
> If we cannot guarantee allocation on demand, we need to push the policy
> to the device, via something that indicates the size to use, or to the
> orchestration, via something that allows the size to be committed
> out-of-band. As with REBAR, we then need to be able to restrict the
> guest behavior to select only the configured option.
>
> I imagine this means for the non-pre-allocated case, we need to develop
> some sysfs attributes that allows that out-of-band sizing, which would
> then appear as a fixed, pre-allocated configuration to the guest.
> Thanks,
I did some reading as only vaguely familiar with how the resizeable bar stuff
was done. That approach should be fairly straight forward to adapt here.
Stash some config in struct pci_dev before binding vfio-pci/cxl via a sysfs
interface. Given that the association with the CXL infrastructure only
happens later (unlike bar config) it would then be the job of the
vfio-pci/cxl driver to see what was requested and attempt to set up the
CXL topology to deliver it at bind time.
Manesh, would you mind hack at small PoC on top of your existing code to see
if this approach shows up any problems? I don't have anything to test against
right now, though could probably hack some emulation together fairly fast.
I'm thinking you'll get there faster! I'm mostly focused on this cycle
stuff at the moment, and I suspect we'll be discussing this for a while
yet + it has dependencies on other series that aren't in yet.
I'm not sure the PCI folk will like us stashing random stuff in their
structures just because we haven't bound anything yet though so have no
CXL structures to use. We should probably think about how VF CXL.mem
region/sub-region assignment might work as well.
Sticking to PF (well actually just function 0) passthrough for now...
For the guest, we can constrain things so there is only one right option
though it will limit what topologies we can build. Basically each device
passed through has it's own CXL fixed memory window, it's own host bridge,
it's own root port + no switches. The sizing it sees for the CFMWS
matches what we configured in the host. We could program that topology up
and lock it down but that means VM BIOS nastiness so I'd leave it to the
native linux code to bring it up. If anyone wants to do P2P it'll get
harder to do within the spec as we will have to prevent topologies that
contain foot guns like the ability to configure interleave.
This constrained approach is what we plan for the CXL class code type 3
device emulation used for DCD so we've been exploring it already.
It's still possible to do annoying things like zero size decoders +
skip. For now we can fail HDM decoder commits if they are particularly
non sensical and we haven't handled them yet - ultimately we'll probably
want to minimize what we refuse to handle as I'm sure 'other OS' may not
do things the same as Linux.
P2P and the fun of single device on multiple PCI heirarchies as to be solved
later. As an FYI, for bandwidth, people will be building devices that
interleave memory addresses over multiple root ports. Dan reminded
me of that challenge last night. See bundled ports in CXL 4.0, though
this particular part related to CXL.mem is actually possible prior to
that stuff for CXL.cache. Oh and don't get me started on TSP / coco challenges.
I take the view they are Dan's problem for now ;)
Jonathan
>
> Alex
^ permalink raw reply [flat|nested] 54+ messages in thread
* RE: [PATCH 18/20] docs: vfio-pci: Document CXL Type-2 device passthrough
2026-03-19 16:06 ` Jonathan Cameron
@ 2026-03-23 14:36 ` Manish Honap
0 siblings, 0 replies; 54+ messages in thread
From: Manish Honap @ 2026-03-23 14:36 UTC (permalink / raw)
To: Jonathan Cameron, Alex Williamson
Cc: Aniket Agashe, Ankit Agrawal, Vikram Sethi, Jason Gunthorpe,
Matt Ochs, Shameer Kolothum Thodi, alejandro.lucero-palau@amd.com,
dave@stgolabs.net, dave.jiang@intel.com,
alison.schofield@intel.com, vishal.l.verma@intel.com,
ira.weiny@intel.com, dan.j.williams@intel.com, jgg@ziepe.ca,
Yishai Hadas, kevin.tian@intel.com, Neo Jia, Tarun Gupta (SW-GPU),
Zhi Wang, Krishnakant Jaju, linux-kernel@vger.kernel.org,
linux-cxl@vger.kernel.org, kvm@vger.kernel.org, Manish Honap
> -----Original Message-----
> From: Jonathan Cameron <jonathan.cameron@huawei.com>
> Sent: 19 March 2026 21:37
> To: Alex Williamson <alex@shazbot.org>
> Cc: Manish Honap <mhonap@nvidia.com>; Aniket Agashe
> <aniketa@nvidia.com>; Ankit Agrawal <ankita@nvidia.com>; Vikram Sethi
> <vsethi@nvidia.com>; Jason Gunthorpe <jgg@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; Shameer Kolothum Thodi <skolothumtho@nvidia.com>;
> alejandro.lucero-palau@amd.com; dave@stgolabs.net; dave.jiang@intel.com;
> alison.schofield@intel.com; vishal.l.verma@intel.com;
> ira.weiny@intel.com; dan.j.williams@intel.com; jgg@ziepe.ca; Yishai
> Hadas <yishaih@nvidia.com>; kevin.tian@intel.com; Neo Jia
> <cjia@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>; Zhi Wang
> <zhiw@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>; linux-
> kernel@vger.kernel.org; linux-cxl@vger.kernel.org; kvm@vger.kernel.org
> Subject: Re: [PATCH 18/20] docs: vfio-pci: Document CXL Type-2 device
> passthrough
>
> External email: Use caution opening links or attachments
>
>
> On Tue, 17 Mar 2026 15:24:45 -0600
> Alex Williamson <alex@shazbot.org> wrote:
>
> > On Fri, 13 Mar 2026 12:13:41 +0000
> > Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> >
> > > On Thu, 12 Mar 2026 02:04:38 +0530
> > > mhonap@nvidia.com wrote:
> > >
> > > > From: Manish Honap <mhonap@nvidia.com>
> > > > ---
> > > > Documentation/driver-api/index.rst | 1 +
> > > > Documentation/driver-api/vfio-pci-cxl.rst | 216
> > > > ++++++++++++++++++++++
> > > > 2 files changed, 217 insertions(+) create mode 100644
> > > > Documentation/driver-api/vfio-pci-cxl.rst
> > > >
> > > > diff --git a/Documentation/driver-api/index.rst
> > > > b/Documentation/driver-api/index.rst
> > > > index 1833e6a0687e..7ec661846f6b 100644
> > > > --- a/Documentation/driver-api/index.rst
> > > > +++ b/Documentation/driver-api/index.rst
> > >
> > > >
> > > > Bus-level documentation
> > > > =======================
> > > > diff --git a/Documentation/driver-api/vfio-pci-cxl.rst
> > > > b/Documentation/driver-api/vfio-pci-cxl.rst
> > > > new file mode 100644
> > > > index 000000000000..f2cbe2fdb036
> > > > --- /dev/null
> > > > +++ b/Documentation/driver-api/vfio-pci-cxl.rst
> > >
> > > > +Device Detection
> > > > +----------------
> > > > +
> > > > +CXL Type-2 detection happens automatically when ``vfio-pci``
> > > > +registers a device that has:
> > > > +
> > > > +1. A CXL Device DVSEC capability (PCIe DVSEC Vendor ID 0x1E98, ID
> 0x0000).
> > > > +2. Bit 2 (Mem_Capable) set in the CXL Capability register within
> that DVSEC.
> > >
> > > FWIW to be type 2 as opposed to a type 3 non class code device (e.g.
> > > the compressed memory devices Gregory Price and others are using)
> > > you need Cache_capable as well. Might be worth making this all
> > > about CXL Type-2 and non class code Type-3.
> > >
> > > > +3. A PCI class code that is **not** ``0x050210`` (CXL Type-3
> memory device).
> > > > +4. An HDM Decoder block discoverable via the Register Locator
> DVSEC.
> > > > +5. A pre-committed HDM decoder (BIOS/firmware programmed) with
> non-zero size.
> > >
> > > This is the bit that we need to make more general. Otherwise you'll
> > > have to have a bios upgrade for every type 2 device (and no native
> hotplug).
> > > Note native hotplug is quite likely if anyone is switch based device
> > > pooling.
> > >
> > > I assume that you are doing this today to get something upstream and
> > > presume it works for the type 2 device you have on the host you care
> > > about. I'm not sure there are 'general' solutions but maybe there
> > > are some heuristics or sufficient conditions for establishing the
> > > size.
> > >
> > > Type 2 might have any of:
> > > - Conveniently preprogrammed HDM decoders (the case you use)
> > > - Maximum of 2 HDM decoders + the same number of Range registers.
> > > In general the problem with range registers is they are a legacy
> feature
> > > and there are only 2 of them whereas a real device may have many
> more
> > > DPA ranges. In this corner case though, is it enough to give us
> the
> > > necessary sizes? I think it might be but would like others
> familiar
> > > with the spec to confirm. (If needed I'll take this to the
> consortium
> > > for an 'official' view).
> > > - A DOE and table access protocol. CDAT should give us enough info
> to
> > > be fairly sure what is needed.
> > > - A CXL mailbox (maybe the version in the PCI spec now) and the spec
> defined
> > > commands to query what is there. Reading the intro to 8.2.10.9
> Memory
> > > Device Command Sets, it's a little unclear on whether these are
> valid on
> > > non class code devices but I believe having the appropriate
> Mailbox
> > > type identifier is enough to say we expect to get them.
> > >
> > > None of this is required though and the mailboxes are non trivial.
> > > So personally I think we should propose a new DVSEC that provides
> > > any info we need for generic passthrough. Starting with what we
> > > need to get the regions right. Until something like that is in
> > > place we will have to store this info somewhere.
> > >
> > > There is (maybe) an alternative of doing the region allocation on
> demand.
> > > That is emulate the HDM decoders in QEMU (on top of the emulation
> > > here) and when settings corresponding to a region setup occur, go
> > > request one from the CXL core. The problem is we can't guarantee it
> > > will be available at that time. So we can 'guess' what to provide to
> > > the VM in terms of CXL fixed memory windows, but short of heuristics
> > > (either whole of the host offer, or divide it up based on devices
> > > present vs what is in the VM) that is going to be prone to it not
> > > being available later.
> > >
> > > Where do people think this should be? We are going to end up with a
> > > device list somewhere. Could be in kernel, or in QEMU or make it an
> > > orchestrator problem (applying the 'someone else's problem'
> solution).
> >
> > That's the typical approach. That's what we did with resizable BARs.
> > If we cannot guarantee allocation on demand, we need to push the
> > policy to the device, via something that indicates the size to use, or
> > to the orchestration, via something that allows the size to be
> > committed out-of-band. As with REBAR, we then need to be able to
> > restrict the guest behavior to select only the configured option.
> >
> > I imagine this means for the non-pre-allocated case, we need to
> > develop some sysfs attributes that allows that out-of-band sizing,
> > which would then appear as a fixed, pre-allocated configuration to the
> guest.
> > Thanks,
>
> I did some reading as only vaguely familiar with how the resizeable bar
> stuff was done. That approach should be fairly straight forward to adapt
> here.
> Stash some config in struct pci_dev before binding vfio-pci/cxl via a
> sysfs interface. Given that the association with the CXL infrastructure
> only happens later (unlike bar config) it would then be the job of the
> vfio-pci/cxl driver to see what was requested and attempt to set up the
> CXL topology to deliver it at bind time.
>
> Manesh, would you mind hack at small PoC on top of your existing code to
> see if this approach shows up any problems? I don't have anything to
> test against right now, though could probably hack some emulation
> together fairly fast.
> I'm thinking you'll get there faster! I'm mostly focused on this cycle
> stuff at the moment, and I suspect we'll be discussing this for a while
> yet + it has dependencies on other series that aren't in yet.
>
> I'm not sure the PCI folk will like us stashing random stuff in their
> structures just because we haven't bound anything yet though so have no
> CXL structures to use. We should probably think about how VF CXL.mem
> region/sub-region assignment might work as well.
>
> Sticking to PF (well actually just function 0) passthrough for now...
> For the guest, we can constrain things so there is only one right option
> though it will limit what topologies we can build. Basically each
> device passed through has it's own CXL fixed memory window, it's own
> host bridge, it's own root port + no switches. The sizing it sees for
> the CFMWS matches what we configured in the host. We could program that
> topology up and lock it down but that means VM BIOS nastiness so I'd
> leave it to the native linux code to bring it up. If anyone wants to do
> P2P it'll get harder to do within the spec as we will have to prevent
> topologies that contain foot guns like the ability to configure
> interleave.
>
> This constrained approach is what we plan for the CXL class code type 3
> device emulation used for DCD so we've been exploring it already.
> It's still possible to do annoying things like zero size decoders +
> skip. For now we can fail HDM decoder commits if they are particularly
> non sensical and we haven't handled them yet - ultimately we'll probably
> want to minimize what we refuse to handle as I'm sure 'other OS' may not
> do things the same as Linux.
>
> P2P and the fun of single device on multiple PCI heirarchies as to be
> solved later. As an FYI, for bandwidth, people will be building devices
> that interleave memory addresses over multiple root ports. Dan reminded
> me of that challenge last night. See bundled ports in CXL 4.0, though
> this particular part related to CXL.mem is actually possible prior to
> that stuff for CXL.cache. Oh and don't get me started on TSP / coco
> challenges.
> I take the view they are Dan's problem for now ;)
>
> Jonathan
>
Hello Alex, Jonathan,
I will shortly provide the updated vfio-cxl patch v2 and the QEMU RFC as per
suggestions mentioned here.
>
> >
> > Alex
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH 19/20] selftests/vfio: Add CXL Type-2 passthrough tests
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (17 preceding siblings ...)
2026-03-11 20:34 ` [PATCH 18/20] docs: vfio-pci: Document CXL Type-2 device passthrough mhonap
@ 2026-03-11 20:34 ` mhonap
2026-03-11 20:34 ` [PATCH 20/20] selftests/vfio: Fix VLA initialisation in vfio_pci_irq_set() mhonap
19 siblings, 0 replies; 54+ messages in thread
From: mhonap @ 2026-03-11 20:34 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
Add a selftest suite exercising the CXL Type-2 device passthrough
interfaces introduced by the vfio-cxl patch series.
The tests are designed to run on a host with a CXL Type-2 device bound
to vfio-pci and CONFIG_VFIO_CXL_CORE=y. They verify the kernel ABI
without requiring a running QEMU instance or a guest OS.
Test cases:
cxl_test_device_info:
Open the VFIO device. Call VFIO_DEVICE_GET_INFO and verify:
- VFIO_DEVICE_FLAGS_CXL is set in info.flags
- VFIO_DEVICE_FLAGS_PCI is also set (CXL implies PCI)
- info.num_regions > VFIO_PCI_NUM_REGIONS (CXL adds extra regions)
Walk the capability chain and locate VFIO_DEVICE_INFO_CAP_CXL.
Verify:
- hdm_count >= 1
- hdm_regs_bar_index <= 5
- hdm_regs_size >= 0x20 * hdm_count (minimum: one decoder slot)
- dpa_size > 0 (pre-committed decoder present)
- dpa_region_index and comp_regs_region_index within bounds
cxl_test_component_bar_hidden:
Query VFIO_DEVICE_GET_REGION_INFO for the BAR index reported by
hdm_regs_bar_index. Verify info.size == 0, confirming the host has
hidden the component BAR from direct userspace access.
cxl_test_comp_regs_region:
Query VFIO_DEVICE_GET_REGION_INFO for comp_regs_region_index.
Verify:
- flags has READ and WRITE set, mmap NOT set
- size == hdm_regs_size
Open the region fd and read 4 bytes at offset 0 (HDM Decoder Cap).
Verify the read succeeds and returns a non-zero value.
Attempt a misaligned read (3-byte or offset 1) _ verify EINVAL.
Attempt a 4-byte write to offset 0 (RO register) _ verify it
silently succeeds (write to RO discards without error per design).
Write a known pattern to HDM Decoder 0 BASE_LO (offset 0x10) and
read back _ verify the written value (with reserved bits cleared)
is returned.
cxl_test_dpa_region:
Query VFIO_DEVICE_GET_REGION_INFO for dpa_region_index. Verify:
- flags has READ, WRITE, MMAP set
- size == dpa_size (consistent with device info)
mmap() the full DPA region (MAP_SHARED). Verify mmap() succeeds.
Write a test pattern to offset 0 of the mapping (triggers first
page fault / PFN insertion). Read back and verify the value.
munmap() the region. Verify no crash.
cxl_test_bar_mmap_rejected:
Attempt to mmap() the component BAR directly via the standard
VFIO_PCI_BAR*_REGION_INDEX path. Verify EINVAL is returned.
cxl_test_bar_read_rejected:
Attempt to read() the component BAR region fd. Verify EINVAL.
cxl_test_disable_cxl_param:
(Requires root + module reload capability)
Reload vfio-pci with disable_cxl=1. Rebind the device. Call
VFIO_DEVICE_GET_INFO and verify VFIO_DEVICE_FLAGS_CXL is NOT set
and num_regions == VFIO_PCI_NUM_REGIONS (no extra CXL regions).
Reload vfio-pci without the parameter and rebind to restore state.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
tools/testing/selftests/vfio/Makefile | 1 +
.../selftests/vfio/vfio_cxl_type2_test.c | 816 ++++++++++++++++++
2 files changed, 817 insertions(+)
create mode 100644 tools/testing/selftests/vfio/vfio_cxl_type2_test.c
diff --git a/tools/testing/selftests/vfio/Makefile b/tools/testing/selftests/vfio/Makefile
index 3c796ca99a50..2cac98302609 100644
--- a/tools/testing/selftests/vfio/Makefile
+++ b/tools/testing/selftests/vfio/Makefile
@@ -4,6 +4,7 @@ TEST_GEN_PROGS += vfio_iommufd_setup_test
TEST_GEN_PROGS += vfio_pci_device_test
TEST_GEN_PROGS += vfio_pci_device_init_perf_test
TEST_GEN_PROGS += vfio_pci_driver_test
+TEST_GEN_PROGS += vfio_cxl_type2_test
TEST_FILES += scripts/cleanup.sh
TEST_FILES += scripts/lib.sh
diff --git a/tools/testing/selftests/vfio/vfio_cxl_type2_test.c b/tools/testing/selftests/vfio/vfio_cxl_type2_test.c
new file mode 100644
index 000000000000..44df62378749
--- /dev/null
+++ b/tools/testing/selftests/vfio/vfio_cxl_type2_test.c
@@ -0,0 +1,816 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * vfio_cxl_type2_test - selftests for CXL Type-2 device passthrough via vfio-pci
+ *
+ * Tests the UAPI and emulation layer introduced by CONFIG_VFIO_CXL_CORE:
+ * - VFIO_DEVICE_INFO_CAP_CXL capability detection and field validation
+ * - Component BAR hiding (size=0 response for hdm_regs_bar_index)
+ * - DPA region presence, size, and mmap
+ * - COMP_REGS region presence, size, read/write semantics
+ * - HDM decoder emulation: reserved-bit masking, COMMIT→COMMITTED transition
+ * - DVSEC configuration space emulation: Control, Status, Lock, Control2
+ *
+ * Usage:
+ * ./vfio_cxl_type2_test <BDF>
+ * or set the environment variable VFIO_SELFTESTS_BDF before running.
+ *
+ * The device must be a CXL Type-2 device (e.g. a GPU with coherent memory)
+ * with a pre-committed HDM decoder. The test is skipped automatically on
+ * non-CXL devices.
+ *
+ * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <fcntl.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+
+#include <linux/pci_regs.h>
+#include <linux/sizes.h>
+#include <linux/vfio.h>
+
+#include <libvfio.h>
+
+#include "kselftest_harness.h"
+
+/* Userspace equivalents of kernel helpers not available in user headers */
+#ifndef BIT
+#define BIT(n) (1u << (n))
+#endif
+#ifndef GENMASK
+#define GENMASK(h, l) (((~0u) >> (31 - (h))) & ((~0u) << (l)))
+#endif
+#define VFIO_PCI_INDEX_TO_OFFSET(idx) ((uint64_t)(idx) << 40)
+
+static const char *device_bdf;
+
+/* ------------------------------------------------------------------ */
+/* CXL UAPI constants (mirrors include/uapi/linux/vfio.h) */
+/* ------------------------------------------------------------------ */
+
+#define VFIO_DEVICE_FLAGS_CXL (1 << 9)
+
+#define VFIO_DEVICE_INFO_CAP_CXL 6
+
+#define VFIO_CXL_CAP_COMMITTED (1 << 0)
+#define VFIO_CXL_CAP_PRECOMMITTED (1 << 1)
+
+#define PCI_VENDOR_ID_CXL 0x1e98
+#define VFIO_REGION_TYPE_PCI_VENDOR_TYPE (1 << 31)
+#ifndef VFIO_REGION_SUBTYPE_CXL
+#define VFIO_REGION_SUBTYPE_CXL 1
+#endif
+#ifndef VFIO_REGION_SUBTYPE_CXL_COMP_REGS
+#define VFIO_REGION_SUBTYPE_CXL_COMP_REGS 2
+#endif
+
+/*
+ * HDM Decoder register layout within the component register block.
+ * Offsets relative to the start of the HDM decoder capability block
+ * (i.e. relative to the start of the COMP_REGS region).
+ */
+#define HDM_CAP_OFFSET 0x00
+#define HDM_GLOBAL_CTRL_OFFSET 0x04
+#define HDM_GLOBAL_STATUS_OFFSET 0x08
+#define HDM_DECODER_FIRST_OFFSET 0x10
+#define HDM_DECODER_STRIDE 0x20
+#define HDM_DECODER_BASE_LO 0x00
+#define HDM_DECODER_BASE_HI 0x04
+#define HDM_DECODER_SIZE_LO 0x08
+#define HDM_DECODER_SIZE_HI 0x0c
+#define HDM_DECODER_CTRL 0x10
+
+#define HDM_CTRL_COMMIT BIT(9)
+#define HDM_CTRL_COMMITTED BIT(10)
+#define HDM_CTRL_RESERVED_MASK (BIT(15) | GENMASK(31, 28))
+#define HDM_BASE_LO_RESERVED_MASK GENMASK(27, 0)
+#define HDM_GLOBAL_CTRL_RESERVED_MASK GENMASK(31, 2)
+
+/*
+ * CXL DVSEC register offsets relative to the DVSEC capability base.
+ * §8.1.3 of CXL 3.1 specification.
+ */
+#define CXL_DVSEC_CONTROL_OFFSET 0x0c
+#define CXL_DVSEC_STATUS_OFFSET 0x0e
+#define CXL_DVSEC_CONTROL2_OFFSET 0x10
+#define CXL_DVSEC_LOCK_OFFSET 0x14
+
+#define CXL_CTRL_IO_ENABLE BIT(1)
+#define CXL_STATUS_RW1C_BIT BIT(14)
+#define CXL_LOCK_BIT BIT(0)
+#define CXL_LOCK_RESERVED_MASK GENMASK(15, 1)
+
+/* ------------------------------------------------------------------ */
+/* Helpers */
+/* ------------------------------------------------------------------ */
+
+/*
+ * Walk the vfio_device_info capability chain embedded in @buf.
+ * Returns a pointer to the capability with the given @id, or NULL.
+ */
+static const struct vfio_info_cap_header *
+find_device_cap(const void *buf, size_t bufsz, uint16_t id)
+{
+ const struct vfio_device_info *info = buf;
+ const struct vfio_info_cap_header *cap;
+
+ if (!(info->flags & VFIO_DEVICE_FLAGS_CAPS) || !info->cap_offset)
+ return NULL;
+
+ cap = (const struct vfio_info_cap_header *)
+ ((const char *)buf + info->cap_offset);
+
+ while (true) {
+ if (cap->id == id)
+ return cap;
+ if (!cap->next)
+ return NULL;
+ cap = (const struct vfio_info_cap_header *)
+ ((const char *)buf + cap->next);
+ if ((const char *)cap + sizeof(*cap) > (const char *)buf + bufsz)
+ return NULL;
+ }
+}
+
+/*
+ * Read a 32-bit value from the COMP_REGS region at @offset (HDM-relative).
+ */
+static uint32_t comp_regs_read32(struct vfio_pci_device *dev,
+ uint32_t region_idx, uint64_t offset)
+{
+ uint32_t val;
+ loff_t pos = (loff_t)VFIO_PCI_INDEX_TO_OFFSET(region_idx) + offset;
+ ssize_t r;
+
+ r = pread(dev->fd, &val, sizeof(val), pos);
+ if (r != sizeof(val))
+ return ~0u;
+ return val;
+}
+
+/*
+ * Write a 32-bit value to the COMP_REGS region at @offset.
+ */
+static void comp_regs_write32(struct vfio_pci_device *dev,
+ uint32_t region_idx, uint64_t offset,
+ uint32_t val)
+{
+ loff_t pos = (loff_t)VFIO_PCI_INDEX_TO_OFFSET(region_idx) + offset;
+ pwrite(dev->fd, &val, sizeof(val), pos);
+}
+
+/*
+ * Find the CXL DVSEC capability base in config space.
+ * Walks the extended capability list (starting at 0x100).
+ * Returns the config-space offset of the DVSEC header, or 0.
+ */
+#define PCI_DVSEC_VENDOR_ID_CXL 0x1e98
+#define PCI_DVSEC_ID_CXL_DEVICE 0x0000
+#define PCI_EXT_CAP_ID_DVSEC 0x23
+
+static uint16_t find_cxl_dvsec(struct vfio_pci_device *dev)
+{
+ uint16_t pos = PCI_CFG_SPACE_SIZE; /* 0x100 */
+ int iter = 0;
+
+ while (pos && iter++ < 64) {
+ uint32_t hdr = vfio_pci_config_readl(dev, pos);
+ uint32_t hdr1, hdr2;
+ uint16_t cap_id = hdr & 0xffff;
+ uint16_t next = (hdr >> 20) & 0xffc;
+
+ if (cap_id == PCI_EXT_CAP_ID_DVSEC) {
+ hdr1 = vfio_pci_config_readl(dev, pos + 4);
+ hdr2 = vfio_pci_config_readl(dev, pos + 8);
+ /*
+ * PCIe DVSEC Header 1 layout (Table 9-16):
+ * Bits [15: 0] = DVSEC Vendor ID
+ * Bits [19:16] = DVSEC Revision
+ * Bits [31:20] = DVSEC Length
+ * DVSEC Header 2 layout:
+ * Bits [15: 0] = DVSEC ID
+ */
+ if ((hdr1 & 0xffff) == PCI_DVSEC_VENDOR_ID_CXL &&
+ (hdr2 & 0xffff) == PCI_DVSEC_ID_CXL_DEVICE)
+ return pos;
+ }
+ pos = next;
+ }
+ return 0;
+}
+
+
+/* ------------------------------------------------------------------ */
+/* Fixture */
+/* ------------------------------------------------------------------ */
+
+FIXTURE(cxl_type2) {
+ struct iommu *iommu;
+ struct vfio_pci_device *dev;
+
+ /* Filled in during FIXTURE_SETUP from the CXL cap */
+ struct vfio_device_info_cap_cxl cxl_cap;
+ uint16_t dvsec_base;
+
+ /* DPA mmap pointer (may be NULL if test skips mmap sub-tests) */
+ void *dpa_mmap;
+ size_t dpa_mmap_size;
+};
+
+FIXTURE_SETUP(cxl_type2)
+{
+ uint8_t infobuf[512] = {};
+ struct vfio_device_info *info = (void *)infobuf;
+ const struct vfio_device_info_cap_cxl *cap;
+
+ self->iommu = iommu_init(default_iommu_mode);
+ self->dev = vfio_pci_device_init(device_bdf, self->iommu);
+
+ /* Query device info with space for capability chain */
+ info->argsz = sizeof(infobuf);
+ ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_INFO, info));
+
+ if (!(info->flags & VFIO_DEVICE_FLAGS_CXL)) {
+ printf("Device %s is not a CXL Type-2 device — skipping\n",
+ device_bdf);
+ SKIP(return, "not a CXL Type-2 device");
+ }
+
+ cap = (const struct vfio_device_info_cap_cxl *)
+ find_device_cap(infobuf, sizeof(infobuf),
+ VFIO_DEVICE_INFO_CAP_CXL);
+ ASSERT_NE(NULL, cap);
+ memcpy(&self->cxl_cap, cap, sizeof(*cap));
+
+ self->dvsec_base = find_cxl_dvsec(self->dev);
+ self->dpa_mmap = MAP_FAILED;
+ self->dpa_mmap_size = 0;
+}
+
+FIXTURE_TEARDOWN(cxl_type2)
+{
+ if (self->dpa_mmap != MAP_FAILED && self->dpa_mmap_size)
+ munmap(self->dpa_mmap, self->dpa_mmap_size);
+ vfio_pci_device_cleanup(self->dev);
+ iommu_cleanup(self->iommu);
+}
+
+/* ------------------------------------------------------------------ */
+/* Tests: VFIO_DEVICE_GET_INFO */
+/* ------------------------------------------------------------------ */
+
+/*
+ * CXL and PCI flags must both be set; CAPS must be set since we have a cap.
+ */
+TEST_F(cxl_type2, device_flags)
+{
+ uint8_t infobuf[512] = {};
+ struct vfio_device_info *info = (void *)infobuf;
+
+ info->argsz = sizeof(infobuf);
+ ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_INFO, info));
+
+ ASSERT_TRUE(info->flags & VFIO_DEVICE_FLAGS_CXL);
+ ASSERT_TRUE(info->flags & VFIO_DEVICE_FLAGS_PCI);
+ ASSERT_TRUE(info->flags & VFIO_DEVICE_FLAGS_CAPS);
+
+ printf("device flags: 0x%x num_regions: %u\n",
+ info->flags, info->num_regions);
+}
+
+/*
+ * The CXL capability must report sane HDM and DPA values.
+ */
+TEST_F(cxl_type2, cxl_cap_fields)
+{
+ const struct vfio_device_info_cap_cxl *c = &self->cxl_cap;
+
+ ASSERT_EQ(VFIO_DEVICE_INFO_CAP_CXL, c->header.id);
+ ASSERT_EQ(1, c->header.version);
+
+ /* Must have at least one HDM decoder */
+ ASSERT_GT(c->hdm_count, 0);
+
+ /* DPA must be non-zero */
+ ASSERT_GT(c->dpa_size, 0ULL);
+
+ /* HDM region size must be non-zero and 4-byte aligned */
+ ASSERT_GT(c->hdm_regs_size, 0ULL);
+ ASSERT_EQ(0ULL, c->hdm_regs_size % 4);
+
+ /* Region indices must not be ~0U (sentinel for "not found") */
+ ASSERT_NE(~0U, c->dpa_region_index);
+ ASSERT_NE(~0U, c->comp_regs_region_index);
+
+ /* The two regions must be distinct */
+ ASSERT_NE(c->dpa_region_index, c->comp_regs_region_index);
+
+ /* For a pre-committed device both flags must be set */
+ if (c->flags & VFIO_CXL_CAP_PRECOMMITTED)
+ ASSERT_TRUE(c->flags & VFIO_CXL_CAP_COMMITTED);
+
+ printf("hdm_count=%u dpa_size=0x%llx hdm_regs_size=0x%llx "
+ "dpa_idx=%u comp_regs_idx=%u flags=0x%x\n",
+ c->hdm_count, (unsigned long long)c->dpa_size,
+ (unsigned long long)c->hdm_regs_size,
+ c->dpa_region_index, c->comp_regs_region_index, c->flags);
+}
+
+/* ------------------------------------------------------------------ */
+/* Tests: VFIO_DEVICE_GET_REGION_INFO */
+/* ------------------------------------------------------------------ */
+
+/*
+ * The component register BAR must be hidden: size=0 and no flags.
+ */
+TEST_F(cxl_type2, component_bar_hidden)
+{
+ struct vfio_region_info reg = { .argsz = sizeof(reg) };
+
+ reg.index = self->cxl_cap.hdm_regs_bar_index;
+ ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, ®));
+
+ ASSERT_EQ(0ULL, reg.size);
+ ASSERT_EQ(0U, reg.flags);
+
+ printf("component BAR %u: size=%llu flags=0x%x (hidden as expected)\n",
+ self->cxl_cap.hdm_regs_bar_index,
+ (unsigned long long)reg.size, reg.flags);
+}
+
+/*
+ * DPA region must be readable, writable, and mmappable.
+ * Its size must match dpa_size from the CXL capability.
+ */
+TEST_F(cxl_type2, dpa_region_info)
+{
+ struct vfio_region_info reg = { .argsz = sizeof(reg) };
+
+ reg.index = self->cxl_cap.dpa_region_index;
+ ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, ®));
+
+ ASSERT_EQ(self->cxl_cap.dpa_size, reg.size);
+ ASSERT_TRUE(reg.flags & VFIO_REGION_INFO_FLAG_READ);
+ ASSERT_TRUE(reg.flags & VFIO_REGION_INFO_FLAG_WRITE);
+ ASSERT_TRUE(reg.flags & VFIO_REGION_INFO_FLAG_MMAP);
+
+ printf("DPA region: size=0x%llx offset=0x%llx flags=0x%x\n",
+ (unsigned long long)reg.size,
+ (unsigned long long)reg.offset, reg.flags);
+}
+
+/*
+ * COMP_REGS region must be readable and writable but not mmappable.
+ * Its size must match hdm_regs_size from the CXL capability.
+ */
+TEST_F(cxl_type2, comp_regs_region_info)
+{
+ struct vfio_region_info reg = { .argsz = sizeof(reg) };
+
+ reg.index = self->cxl_cap.comp_regs_region_index;
+ ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, ®));
+
+ ASSERT_EQ(self->cxl_cap.hdm_regs_size, reg.size);
+ ASSERT_TRUE(reg.flags & VFIO_REGION_INFO_FLAG_READ);
+ ASSERT_TRUE(reg.flags & VFIO_REGION_INFO_FLAG_WRITE);
+ ASSERT_FALSE(reg.flags & VFIO_REGION_INFO_FLAG_MMAP);
+
+ printf("COMP_REGS region: size=0x%llx offset=0x%llx flags=0x%x\n",
+ (unsigned long long)reg.size,
+ (unsigned long long)reg.offset, reg.flags);
+}
+
+/* ------------------------------------------------------------------ */
+/* Tests: DPA region mmap */
+/* ------------------------------------------------------------------ */
+
+/*
+ * mmap() the DPA region and verify the first page can be read.
+ * The region uses lazy fault insertion so the first access triggers the
+ * vfio_cxl_region_page_fault path.
+ */
+TEST_F(cxl_type2, dpa_mmap_fault)
+{
+ struct vfio_region_info reg = { .argsz = sizeof(reg) };
+ size_t map_size;
+ void *ptr;
+ volatile uint8_t *p;
+ uint8_t val;
+
+ reg.index = self->cxl_cap.dpa_region_index;
+ ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, ®));
+
+ /* Map just the first 2MB or the full region, whichever is smaller */
+ map_size = (size_t)reg.size < (size_t)(2 * SZ_1M)
+ ? (size_t)reg.size : (size_t)(2 * SZ_1M);
+
+ ptr = mmap(NULL, map_size, PROT_READ | PROT_WRITE,
+ MAP_SHARED, self->dev->fd, (off_t)reg.offset);
+ ASSERT_NE(MAP_FAILED, ptr);
+
+ self->dpa_mmap = ptr;
+ self->dpa_mmap_size = map_size;
+
+ /* First access — triggers the page fault handler */
+ p = (volatile uint8_t *)ptr;
+ val = *p;
+ (void)val;
+
+ printf("DPA mmap: ptr=%p size=0x%zx first byte=0x%02x\n",
+ ptr, map_size, (unsigned)val);
+
+ /* Write a pattern and read it back */
+ *p = 0xab;
+ ASSERT_EQ(0xab, *p);
+}
+
+/*
+ * mmap() beyond dpa_size must fail with EINVAL.
+ */
+TEST_F(cxl_type2, dpa_mmap_overflow)
+{
+ struct vfio_region_info reg = { .argsz = sizeof(reg) };
+ void *ptr;
+
+ reg.index = self->cxl_cap.dpa_region_index;
+ ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, ®));
+
+ ptr = mmap(NULL, (size_t)reg.size + SZ_4K, PROT_READ,
+ MAP_SHARED, self->dev->fd, (off_t)reg.offset);
+ ASSERT_EQ(MAP_FAILED, ptr);
+
+ printf("mmap beyond dpa_size correctly failed (errno=%d)\n", errno);
+}
+
+/*
+ * mmap() of the COMP_REGS region (no MMAP flag) must fail.
+ */
+TEST_F(cxl_type2, comp_regs_no_mmap)
+{
+ struct vfio_region_info reg = { .argsz = sizeof(reg) };
+ void *ptr;
+
+ reg.index = self->cxl_cap.comp_regs_region_index;
+ ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, ®));
+
+ ptr = mmap(NULL, (size_t)reg.size, PROT_READ,
+ MAP_SHARED, self->dev->fd, (off_t)reg.offset);
+ ASSERT_EQ(MAP_FAILED, ptr);
+
+ printf("mmap of COMP_REGS correctly failed (errno=%d)\n", errno);
+}
+
+/* ------------------------------------------------------------------ */
+/* Tests: COMP_REGS region (HDM decoder emulation) */
+/* ------------------------------------------------------------------ */
+
+/*
+ * Reading HDM Capability (offset 0x00) must return a non-zero value
+ * consistent with at least one decoder being present.
+ * Bits [3:0] encode the HDM decoder count.
+ */
+TEST_F(cxl_type2, hdm_cap_read)
+{
+ uint32_t cap;
+ uint32_t idx = self->cxl_cap.comp_regs_region_index;
+
+ cap = comp_regs_read32(self->dev, idx, HDM_CAP_OFFSET);
+ ASSERT_NE(~0u, cap);
+
+ /* Lower 4 bits encode the number of decoders */
+ ASSERT_GE((int)(cap & 0xf), (int)self->cxl_cap.hdm_count - 1);
+
+ printf("HDM Capability register: 0x%08x decoder_count_field=%u\n",
+ cap, cap & 0xf);
+}
+
+/*
+ * Writing the HDM Capability register (RO) must be silently discarded.
+ */
+TEST_F(cxl_type2, hdm_cap_ro)
+{
+ uint32_t idx = self->cxl_cap.comp_regs_region_index;
+ uint32_t before, after;
+
+ before = comp_regs_read32(self->dev, idx, HDM_CAP_OFFSET);
+ comp_regs_write32(self->dev, idx, HDM_CAP_OFFSET, 0xdeadbeef);
+ after = comp_regs_read32(self->dev, idx, HDM_CAP_OFFSET);
+
+ ASSERT_EQ(before, after);
+ printf("HDM Capability: write discarded (before=0x%08x after=0x%08x)\n",
+ before, after);
+}
+
+/*
+ * Writing Global Control (offset 0x04) with reserved bits set must
+ * result in those bits being cleared in the stored value.
+ * HDM Decoder Enable (bit 1) is a genuine RW bit.
+ */
+TEST_F(cxl_type2, hdm_global_ctrl_reserved_bits)
+{
+ uint32_t idx = self->cxl_cap.comp_regs_region_index;
+ uint32_t readback;
+
+ /* Write all-ones — reserved bits GENMASK(31,2) must be cleared */
+ comp_regs_write32(self->dev, idx, HDM_GLOBAL_CTRL_OFFSET, 0xffffffff);
+ readback = comp_regs_read32(self->dev, idx, HDM_GLOBAL_CTRL_OFFSET);
+
+ ASSERT_EQ(0u, readback & HDM_GLOBAL_CTRL_RESERVED_MASK);
+ printf("HDM Global Control reserved bits cleared: 0x%08x\n", readback);
+
+ /* Restore to 0 */
+ comp_regs_write32(self->dev, idx, HDM_GLOBAL_CTRL_OFFSET, 0);
+}
+
+/*
+ * Writing decoder N BASE_LO with reserved bits [27:0] set must
+ * result in those bits being cleared. [31:28] are significant.
+ */
+TEST_F(cxl_type2, hdm_decoder_base_lo_reserved)
+{
+ uint32_t idx = self->cxl_cap.comp_regs_region_index;
+ uint64_t ctrl_off = HDM_DECODER_FIRST_OFFSET + HDM_DECODER_BASE_LO;
+ uint32_t readback;
+
+ /* Writing 0xffffffff: only [31:28] should survive */
+ comp_regs_write32(self->dev, idx, ctrl_off, 0xffffffff);
+ readback = comp_regs_read32(self->dev, idx, ctrl_off);
+
+ ASSERT_EQ(0u, readback & HDM_BASE_LO_RESERVED_MASK);
+ ASSERT_EQ(0xf0000000u, readback);
+
+ printf("HDM decoder 0 BASE_LO: reserved bits cleared -> 0x%08x\n",
+ readback);
+
+ /* Clean up */
+ comp_regs_write32(self->dev, idx, ctrl_off, 0);
+}
+
+/*
+ * Unaligned (sub-dword) access to the COMP_REGS region must fail with EINVAL.
+ */
+TEST_F(cxl_type2, comp_regs_unaligned_access)
+{
+ struct vfio_region_info reg = { .argsz = sizeof(reg) };
+ uint8_t byte_val = 0;
+ loff_t pos;
+ ssize_t r;
+
+ reg.index = self->cxl_cap.comp_regs_region_index;
+ ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, ®));
+
+ /* Attempt 1-byte read at offset 1 (unaligned) */
+ pos = (loff_t)reg.offset + 1;
+ r = pread(self->dev->fd, &byte_val, 1, pos);
+ ASSERT_EQ(-1, (int)r);
+ ASSERT_EQ(EINVAL, errno);
+
+ printf("Unaligned COMP_REGS access correctly failed (errno=%d)\n",
+ errno);
+}
+
+/*
+ * Writing HDM decoder N CTRL with COMMIT=1 must cause COMMITTED to
+ * be set immediately in the emulated shadow state (no hardware wait).
+ * This models the QEMU notify_change path.
+ */
+TEST_F(cxl_type2, hdm_ctrl_commit_to_committed)
+{
+ uint32_t idx = self->cxl_cap.comp_regs_region_index;
+ /*
+ * Use decoder 0; write BASE/SIZE first so the decoder is in a
+ * plausible state before committing. Use 256MB alignment
+ * (bit 28 = 1 in SIZE_LO) to satisfy the reserved-bit rule.
+ */
+ uint64_t base_lo_off = HDM_DECODER_FIRST_OFFSET + HDM_DECODER_BASE_LO;
+ uint64_t base_hi_off = HDM_DECODER_FIRST_OFFSET + HDM_DECODER_BASE_HI;
+ uint64_t size_lo_off = HDM_DECODER_FIRST_OFFSET + HDM_DECODER_SIZE_LO;
+ uint64_t size_hi_off = HDM_DECODER_FIRST_OFFSET + HDM_DECODER_SIZE_HI;
+ uint64_t ctrl_off = HDM_DECODER_FIRST_OFFSET + HDM_DECODER_CTRL;
+ uint32_t ctrl_readback;
+
+ /* Skip if COMMITTED is already set (pre-committed decoder) */
+ ctrl_readback = comp_regs_read32(self->dev, idx, ctrl_off);
+ if (ctrl_readback & HDM_CTRL_COMMITTED) {
+ printf("Decoder 0 already committed (ctrl=0x%08x); "
+ "skipping COMMIT test\n", ctrl_readback);
+ SKIP(return, "decoder already committed");
+ }
+
+ /* Set BASE and SIZE to plausible 256MB-aligned values */
+ comp_regs_write32(self->dev, idx, base_lo_off, 0x10000000); /* 256MB boundary */
+ comp_regs_write32(self->dev, idx, base_hi_off, 0);
+ comp_regs_write32(self->dev, idx, size_lo_off, 0x10000000); /* 256MB */
+ comp_regs_write32(self->dev, idx, size_hi_off, 0);
+
+ /* Write COMMIT=1 */
+ comp_regs_write32(self->dev, idx, ctrl_off, HDM_CTRL_COMMIT);
+
+ /* COMMITTED must be set immediately in the shadow */
+ ctrl_readback = comp_regs_read32(self->dev, idx, ctrl_off);
+ ASSERT_TRUE(ctrl_readback & HDM_CTRL_COMMITTED);
+
+ printf("HDM decoder 0 CTRL after COMMIT=1: 0x%08x (COMMITTED set)\n",
+ ctrl_readback);
+
+ /* Writing COMMIT=0 must clear COMMITTED */
+ comp_regs_write32(self->dev, idx, ctrl_off, 0);
+ ctrl_readback = comp_regs_read32(self->dev, idx, ctrl_off);
+ ASSERT_FALSE(ctrl_readback & HDM_CTRL_COMMITTED);
+
+ printf("HDM decoder 0 CTRL after COMMIT=0: 0x%08x (COMMITTED cleared)\n",
+ ctrl_readback);
+}
+
+/*
+ * Writing CTRL reserved bits must result in them being cleared.
+ */
+TEST_F(cxl_type2, hdm_ctrl_reserved_bits)
+{
+ uint32_t idx = self->cxl_cap.comp_regs_region_index;
+ uint64_t ctrl_off = HDM_DECODER_FIRST_OFFSET + HDM_DECODER_CTRL;
+ uint32_t ctrl_before, ctrl_after;
+
+ ctrl_before = comp_regs_read32(self->dev, idx, ctrl_off);
+
+ /*
+ * Write all-ones; reserved bits (BIT(15) and GENMASK(31,28)) must
+ * be cleared in the readback. Skip if COMMIT_LOCK is already set.
+ */
+ if (ctrl_before & BIT(8)) {
+ printf("Decoder 0 COMMIT_LOCK set; skipping reserved-bit test\n");
+ SKIP(return, "COMMIT_LOCK set");
+ }
+
+ comp_regs_write32(self->dev, idx, ctrl_off, 0xffffffff);
+ ctrl_after = comp_regs_read32(self->dev, idx, ctrl_off);
+
+ ASSERT_EQ(0u, ctrl_after & HDM_CTRL_RESERVED_MASK);
+
+ printf("HDM CTRL reserved bits cleared: before=0x%08x after=0x%08x\n",
+ ctrl_before, ctrl_after);
+
+ /* Restore */
+ comp_regs_write32(self->dev, idx, ctrl_off, ctrl_before);
+}
+
+/* ------------------------------------------------------------------ */
+/* Tests: DVSEC configuration space emulation */
+/* ------------------------------------------------------------------ */
+
+/*
+ * CXL Control (DVSEC offset 0x0c): IO_Enable (bit 1) must always read 1.
+ */
+TEST_F(cxl_type2, dvsec_control_io_enable_ro)
+{
+ uint16_t dvsec = self->dvsec_base;
+ uint16_t ctrl;
+
+ if (!dvsec)
+ SKIP(return, "CXL DVSEC not found in config space");
+
+ /* Read current value */
+ ctrl = vfio_pci_config_readw(self->dev, dvsec + CXL_DVSEC_CONTROL_OFFSET);
+ ASSERT_TRUE(ctrl & CXL_CTRL_IO_ENABLE);
+
+ /* Write with IO_Enable cleared — it must stay set */
+ vfio_pci_config_writew(self->dev, dvsec + CXL_DVSEC_CONTROL_OFFSET,
+ ctrl & ~CXL_CTRL_IO_ENABLE);
+ ctrl = vfio_pci_config_readw(self->dev, dvsec + CXL_DVSEC_CONTROL_OFFSET);
+ ASSERT_TRUE(ctrl & CXL_CTRL_IO_ENABLE);
+
+ printf("DVSEC CXL Control: IO_Enable always 1 (ctrl=0x%04x)\n", ctrl);
+}
+
+/*
+ * CXL Status (DVSEC offset 0x0e): Bit 14 (Viral_Status) is RW1CS —
+ * writing 1 clears it; writing 0 leaves it unchanged.
+ */
+TEST_F(cxl_type2, dvsec_status_rw1cs)
+{
+ uint16_t dvsec = self->dvsec_base;
+ uint16_t status_before, status_after;
+
+ if (!dvsec)
+ SKIP(return, "CXL DVSEC not found in config space");
+
+ status_before = vfio_pci_config_readw(self->dev,
+ dvsec + CXL_DVSEC_STATUS_OFFSET);
+ printf("DVSEC CXL Status before: 0x%04x\n", status_before);
+
+ /* Write 0 to RW1C bit — value must not change */
+ vfio_pci_config_writew(self->dev, dvsec + CXL_DVSEC_STATUS_OFFSET,
+ status_before & ~CXL_STATUS_RW1C_BIT);
+ status_after = vfio_pci_config_readw(self->dev,
+ dvsec + CXL_DVSEC_STATUS_OFFSET);
+ ASSERT_EQ(status_before & CXL_STATUS_RW1C_BIT,
+ status_after & CXL_STATUS_RW1C_BIT);
+
+ /* Write 1 to RW1C bit — bit must be cleared */
+ vfio_pci_config_writew(self->dev, dvsec + CXL_DVSEC_STATUS_OFFSET,
+ CXL_STATUS_RW1C_BIT);
+ status_after = vfio_pci_config_readw(self->dev,
+ dvsec + CXL_DVSEC_STATUS_OFFSET);
+ ASSERT_FALSE(status_after & CXL_STATUS_RW1C_BIT);
+
+ printf("DVSEC CXL Status RW1C cleared: 0x%04x -> 0x%04x\n",
+ status_before, status_after);
+}
+
+/*
+ * CXL Lock (DVSEC offset 0x14):
+ * - Reserved bits GENMASK(15,1) must be cleared.
+ * - Once locked, CXL Control writes must be discarded.
+ */
+TEST_F(cxl_type2, dvsec_lock_semantics)
+{
+ uint16_t dvsec = self->dvsec_base;
+ uint16_t lock_val, ctrl_before, ctrl_after;
+
+ if (!dvsec)
+ SKIP(return, "CXL DVSEC not found in config space");
+
+ lock_val = vfio_pci_config_readw(self->dev,
+ dvsec + CXL_DVSEC_LOCK_OFFSET);
+ if (lock_val & CXL_LOCK_BIT) {
+ printf("CXL Lock already set; skipping lock semantics test\n");
+ SKIP(return, "CONFIG_LOCK already set");
+ }
+
+ /* Write reserved bits — they must be cleared on readback */
+ vfio_pci_config_writew(self->dev, dvsec + CXL_DVSEC_LOCK_OFFSET,
+ CXL_LOCK_RESERVED_MASK);
+ lock_val = vfio_pci_config_readw(self->dev,
+ dvsec + CXL_DVSEC_LOCK_OFFSET);
+ ASSERT_EQ(0u, lock_val & CXL_LOCK_RESERVED_MASK);
+ ASSERT_FALSE(lock_val & CXL_LOCK_BIT);
+
+ printf("Lock reserved bits cleared correctly\n");
+
+ /*
+ * Save the current Control value, then set Lock.
+ * Any subsequent write to Control must be discarded.
+ */
+ ctrl_before = vfio_pci_config_readw(self->dev,
+ dvsec + CXL_DVSEC_CONTROL_OFFSET);
+
+ vfio_pci_config_writew(self->dev, dvsec + CXL_DVSEC_LOCK_OFFSET,
+ CXL_LOCK_BIT);
+ lock_val = vfio_pci_config_readw(self->dev,
+ dvsec + CXL_DVSEC_LOCK_OFFSET);
+ ASSERT_TRUE(lock_val & CXL_LOCK_BIT);
+ printf("CXL Lock set (lock=0x%04x)\n", lock_val);
+
+ /* Attempt to modify Control after locking — must be discarded */
+ vfio_pci_config_writew(self->dev, dvsec + CXL_DVSEC_CONTROL_OFFSET,
+ ctrl_before ^ 0x0001); /* flip Cache_Enable */
+ ctrl_after = vfio_pci_config_readw(self->dev,
+ dvsec + CXL_DVSEC_CONTROL_OFFSET);
+ ASSERT_EQ(ctrl_before, ctrl_after);
+
+ printf("CXL Control write after lock discarded: "
+ "before=0x%04x after=0x%04x\n", ctrl_before, ctrl_after);
+}
+
+/*
+ * CXL Control reserved bits (13, 15) must be cleared on write.
+ */
+TEST_F(cxl_type2, dvsec_control_reserved_bits)
+{
+ uint16_t dvsec = self->dvsec_base;
+ uint16_t lock_val, ctrl_after;
+
+ if (!dvsec)
+ SKIP(return, "CXL DVSEC not found in config space");
+
+ lock_val = vfio_pci_config_readw(self->dev,
+ dvsec + CXL_DVSEC_LOCK_OFFSET);
+ if (lock_val & CXL_LOCK_BIT)
+ SKIP(return, "CONFIG_LOCK already set; cannot test Control");
+
+ vfio_pci_config_writew(self->dev, dvsec + CXL_DVSEC_CONTROL_OFFSET,
+ 0xffff);
+ ctrl_after = vfio_pci_config_readw(self->dev,
+ dvsec + CXL_DVSEC_CONTROL_OFFSET);
+
+ /* Bits 13 and 15 must be cleared */
+ ASSERT_FALSE(ctrl_after & BIT(13));
+ ASSERT_FALSE(ctrl_after & BIT(15));
+
+ printf("DVSEC Control reserved bits cleared: 0x%04x\n", ctrl_after);
+}
+
+/* ------------------------------------------------------------------ */
+/* main */
+/* ------------------------------------------------------------------ */
+
+int main(int argc, char *argv[])
+{
+ device_bdf = vfio_selftests_get_bdf(&argc, argv);
+ return test_harness_run(argc, argv);
+}
--
2.25.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* [PATCH 20/20] selftests/vfio: Fix VLA initialisation in vfio_pci_irq_set()
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (18 preceding siblings ...)
2026-03-11 20:34 ` [PATCH 19/20] selftests/vfio: Add CXL Type-2 passthrough tests mhonap
@ 2026-03-11 20:34 ` mhonap
2026-03-13 22:23 ` Dave Jiang
19 siblings, 1 reply; 54+ messages in thread
From: mhonap @ 2026-03-11 20:34 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
C does not permit initialiser expressions on variable-length arrays.
vfio_pci_irq_set() declared
u8 buf[sizeof(struct vfio_irq_set) + sizeof(int) * count] = {};
where count is a function parameter, making buf a VLA. GCC rejects
this with "variable-sized object may not be initialized".
Replace the initialiser with an explicit memset() immediately after
the declaration.
Fixes: 19faf6fd969c2 ("vfio: selftests: Add a helper library for VFIO selftests")
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
tools/testing/selftests/vfio/lib/vfio_pci_device.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/vfio/lib/vfio_pci_device.c b/tools/testing/selftests/vfio/lib/vfio_pci_device.c
index fac4c0ecadef..3258e814f450 100644
--- a/tools/testing/selftests/vfio/lib/vfio_pci_device.c
+++ b/tools/testing/selftests/vfio/lib/vfio_pci_device.c
@@ -26,8 +26,10 @@
static void vfio_pci_irq_set(struct vfio_pci_device *device,
u32 index, u32 vector, u32 count, int *fds)
{
- u8 buf[sizeof(struct vfio_irq_set) + sizeof(int) * count] = {};
+ u8 buf[sizeof(struct vfio_irq_set) + sizeof(int) * count];
struct vfio_irq_set *irq = (void *)&buf;
+
+ memset(buf, 0, sizeof(buf));
int *irq_fds = (void *)&irq->data;
irq->argsz = sizeof(buf);
--
2.25.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [PATCH 20/20] selftests/vfio: Fix VLA initialisation in vfio_pci_irq_set()
2026-03-11 20:34 ` [PATCH 20/20] selftests/vfio: Fix VLA initialisation in vfio_pci_irq_set() mhonap
@ 2026-03-13 22:23 ` Dave Jiang
2026-03-18 18:07 ` Manish Honap
0 siblings, 1 reply; 54+ messages in thread
From: Dave Jiang @ 2026-03-13 22:23 UTC (permalink / raw)
To: mhonap, aniketa, ankita, alwilliamson, vsethi, jgg, mochs,
skolothumtho, alejandro.lucero-palau, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, targupta, zhiw, kjaju, linux-kernel, linux-cxl, kvm
On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
>
> C does not permit initialiser expressions on variable-length arrays.
> vfio_pci_irq_set() declared
>
> u8 buf[sizeof(struct vfio_irq_set) + sizeof(int) * count] = {};
>
> where count is a function parameter, making buf a VLA. GCC rejects
> this with "variable-sized object may not be initialized".
>
> Replace the initialiser with an explicit memset() immediately after
> the declaration.
>
> Fixes: 19faf6fd969c2 ("vfio: selftests: Add a helper library for VFIO selftests")
Should this fix be split out from the series and sent ahead? Does not seem to be tied to the current implementation.
DJ
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
> ---
> tools/testing/selftests/vfio/lib/vfio_pci_device.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/vfio/lib/vfio_pci_device.c b/tools/testing/selftests/vfio/lib/vfio_pci_device.c
> index fac4c0ecadef..3258e814f450 100644
> --- a/tools/testing/selftests/vfio/lib/vfio_pci_device.c
> +++ b/tools/testing/selftests/vfio/lib/vfio_pci_device.c
> @@ -26,8 +26,10 @@
> static void vfio_pci_irq_set(struct vfio_pci_device *device,
> u32 index, u32 vector, u32 count, int *fds)
> {
> - u8 buf[sizeof(struct vfio_irq_set) + sizeof(int) * count] = {};
> + u8 buf[sizeof(struct vfio_irq_set) + sizeof(int) * count];
> struct vfio_irq_set *irq = (void *)&buf;
> +
> + memset(buf, 0, sizeof(buf));
> int *irq_fds = (void *)&irq->data;
>
> irq->argsz = sizeof(buf);
^ permalink raw reply [flat|nested] 54+ messages in thread* RE: [PATCH 20/20] selftests/vfio: Fix VLA initialisation in vfio_pci_irq_set()
2026-03-13 22:23 ` Dave Jiang
@ 2026-03-18 18:07 ` Manish Honap
0 siblings, 0 replies; 54+ messages in thread
From: Manish Honap @ 2026-03-18 18:07 UTC (permalink / raw)
To: Dave Jiang, Aniket Agashe, Ankit Agrawal, Alex Williamson,
Vikram Sethi, Jason Gunthorpe, Matt Ochs, Shameer Kolothum Thodi,
alejandro.lucero-palau@amd.com, dave@stgolabs.net,
jonathan.cameron@huawei.com, alison.schofield@intel.com,
vishal.l.verma@intel.com, ira.weiny@intel.com,
dan.j.williams@intel.com, jgg@ziepe.ca, Yishai Hadas,
kevin.tian@intel.com
Cc: Neo Jia, Tarun Gupta (SW-GPU), Zhi Wang, Krishnakant Jaju,
linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org,
kvm@vger.kernel.org, Manish Honap
> -----Original Message-----
> From: Dave Jiang <dave.jiang@intel.com>
> Sent: 14 March 2026 03:54
> To: Manish Honap <mhonap@nvidia.com>; Aniket Agashe <aniketa@nvidia.com>;
> Ankit Agrawal <ankita@nvidia.com>; Alex Williamson
> <alwilliamson@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Jason
> Gunthorpe <jgg@nvidia.com>; Matt Ochs <mochs@nvidia.com>; Shameer Kolothum
> Thodi <skolothumtho@nvidia.com>; alejandro.lucero-palau@amd.com;
> dave@stgolabs.net; jonathan.cameron@huawei.com;
> alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
> dan.j.williams@intel.com; jgg@ziepe.ca; Yishai Hadas <yishaih@nvidia.com>;
> kevin.tian@intel.com
> Cc: Neo Jia <cjia@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> Zhi Wang <zhiw@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>; linux-
> kernel@vger.kernel.org; linux-cxl@vger.kernel.org; kvm@vger.kernel.org
> Subject: Re: [PATCH 20/20] selftests/vfio: Fix VLA initialisation in
> vfio_pci_irq_set()
>
> External email: Use caution opening links or attachments
>
>
> On 3/11/26 1:34 PM, mhonap@nvidia.com wrote:
> > From: Manish Honap <mhonap@nvidia.com>
> >
> > C does not permit initialiser expressions on variable-length arrays.
> > vfio_pci_irq_set() declared
> >
> > u8 buf[sizeof(struct vfio_irq_set) + sizeof(int) * count] = {};
> >
> > where count is a function parameter, making buf a VLA. GCC rejects
> > this with "variable-sized object may not be initialized".
> >
> > Replace the initialiser with an explicit memset() immediately after
> > the declaration.
> >
> > Fixes: 19faf6fd969c2 ("vfio: selftests: Add a helper library for VFIO
> > selftests")
>
> Should this fix be split out from the series and sent ahead? Does not seem
> to be tied to the current implementation.
Yes, sent this for review to David Matlack and linux-kselftest
>
> DJ
>
> > Signed-off-by: Manish Honap <mhonap@nvidia.com>
> > ---
> > tools/testing/selftests/vfio/lib/vfio_pci_device.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/testing/selftests/vfio/lib/vfio_pci_device.c
> > b/tools/testing/selftests/vfio/lib/vfio_pci_device.c
> > index fac4c0ecadef..3258e814f450 100644
> > --- a/tools/testing/selftests/vfio/lib/vfio_pci_device.c
> > +++ b/tools/testing/selftests/vfio/lib/vfio_pci_device.c
> > @@ -26,8 +26,10 @@
> > static void vfio_pci_irq_set(struct vfio_pci_device *device,
> > u32 index, u32 vector, u32 count, int *fds)
> > {
> > - u8 buf[sizeof(struct vfio_irq_set) + sizeof(int) * count] = {};
> > + u8 buf[sizeof(struct vfio_irq_set) + sizeof(int) * count];
> > struct vfio_irq_set *irq = (void *)&buf;
> > +
> > + memset(buf, 0, sizeof(buf));
> > int *irq_fds = (void *)&irq->data;
> >
> > irq->argsz = sizeof(buf);
^ permalink raw reply [flat|nested] 54+ messages in thread