* [PATCH v14 0/5] Enable Remote GPIO over RPMSG on i.MX Platform
From: Shenwei Wang @ 2026-06-25 15:54 UTC (permalink / raw)
To: Linus Walleij, Bartosz Golaszewski, Jonathan Corbet, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Bjorn Andersson,
Mathieu Poirier, Frank Li, Sascha Hauer
Cc: Shuah Khan, linux-gpio, linux-doc, linux-kernel,
Pengutronix Kernel Team, Fabio Estevam, Shenwei Wang, Peng Fan,
devicetree, linux-remoteproc, imx, linux-arm-kernel, linux-imx,
Arnaud POULIQUEN, b-padhi, Andrew Lunn
From: Shenwei Wang <shenwei.wang@nxp.com>
Support the remote devices on the remote processor via the RPMSG bus on
i.MX platform.
Changes in v14:
- Update gpio-rpmsg.rst per Mathieu’s feedback.
- Align the rpmsg-gpio driver with the revised gpio-rpmsg.rst.
- Modify rpmsg-core to enable prefix-based matching of RPMSG device IDs.
Changes in v13:
- drop the support for legacy NXP firmware.
- remove the fixed_up hooks from the rpmsg gpio driver.
- code cleanup.
Changes in v12:
- Fixed the "underline" warning reported by Randy.
Changes in v11:
- Expand RPMSG for the first time per Shuah's review comment.
Changes in v10:
- Update gpio-rpmsg.rst according to Daniel Baluta's review comments.
- Add a kernel CONFIG for fixed up handlers and only enable it on
i.MX products.
- Fixed bugs reported by kernel test robot.
Changes in v9:
- Reuse the gpio-virtio design for command and IRQ type definitions.
- Remove msg_id, version, and vendor fields from the generic protocol.
- Add fixed-up handlers to support legacy firmware.
Changes in v8:
- Add "depends on REMOTEPROC" in Kconfig to fix the build error reported
by the kernel test robot.
- Move the .rst patch before the .yaml patch.
- Handle the "ngpios" DT property based on Andrew's feedback.
Changes in v7:
- Reworked the driver to use the rpmsg_driver framework instead of
platform_driver, based on feedback from Bjorn and Arnaud.
- Updated gpio-rpmsg.yaml and imx_rproc.yaml according to comments from
Rob and Arnaud.
- Further refinements to gpio-rpmsg.yaml per Arnaud's feedback.
Changes in v6:
- make the driver more generic with the actions below:
rename the driver file to gpio-rpmsg.c
remove the imx related info in the function and variable names
rename the imx_rpmsg.h to rpdev_info.h
create a gpio-rpmsg.yaml and refer it in imx_rproc.yaml
- update the gpio-rpmsg.rst according to the feedback from Andrew and
move the source file to driver-api/gpio
- fix the bug reported by Zhongqiu Han
- remove the I2C related info
Changes in v5:
- move the gpio-rpmsg.rst from admin-guide to staging directory after
discussion with Randy Dunlap.
- add include files with some code improvements per Bartosz's comments.
Changes in v4:
- add a documentation to describe the transport protocol per Andrew's
comments.
- add a new handler to get the gpio direction.
Changes in v3:
- fix various format issue and return value check per Peng 's review
comments.
- add the logic to also populate the subnodes which are not in the
device map per Arnaud's request. (in imx_rproc.c)
- update the yaml per Frank's review comments.
Changes in v2:
- re-implemented the gpio driver per Linus Walleij's feedback by using
GPIOLIB_IRQCHIP helper library.
- fix various format issue per Mathieu/Peng 's review comments.
- update the yaml doc per Rob's feedback
Shenwei Wang (5):
docs: driver-api: gpio: rpmsg gpio driver over rpmsg bus
dt-bindings: remoteproc: imx_rproc: Add "rpmsg" subnode support
rpmsg: core: match rpmsg device IDs by prefix
gpio: rpmsg: add generic rpmsg GPIO driver
arm64: dts: imx8ulp: Add rpmsg node under imx_rproc
.../devicetree/bindings/gpio/gpio-rpmsg.yaml | 55 ++
.../bindings/remoteproc/fsl,imx-rproc.yaml | 53 ++
Documentation/driver-api/gpio/gpio-rpmsg.rst | 271 +++++++++
Documentation/driver-api/gpio/index.rst | 1 +
arch/arm64/boot/dts/freescale/imx8ulp.dtsi | 25 +
drivers/gpio/Kconfig | 17 +
drivers/gpio/Makefile | 1 +
drivers/gpio/gpio-rpmsg.c | 568 ++++++++++++++++++
drivers/rpmsg/rpmsg_core.c | 4 +-
9 files changed, 994 insertions(+), 1 deletion(-)
create mode 100644 Documentation/devicetree/bindings/gpio/gpio-rpmsg.yaml
create mode 100644 Documentation/driver-api/gpio/gpio-rpmsg.rst
create mode 100644 drivers/gpio/gpio-rpmsg.c
--
2.43.0
^ permalink raw reply
* [PATCH v14 1/5] docs: driver-api: gpio: rpmsg gpio driver over rpmsg bus
From: Shenwei Wang @ 2026-06-25 15:54 UTC (permalink / raw)
To: Linus Walleij, Bartosz Golaszewski, Jonathan Corbet, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Bjorn Andersson,
Mathieu Poirier, Frank Li, Sascha Hauer
Cc: Shuah Khan, linux-gpio, linux-doc, linux-kernel,
Pengutronix Kernel Team, Fabio Estevam, Shenwei Wang, Peng Fan,
devicetree, linux-remoteproc, imx, linux-arm-kernel, linux-imx,
Arnaud POULIQUEN, b-padhi, Andrew Lunn
In-Reply-To: <20260625155432.815185-1-shenwei.wang@oss.nxp.com>
From: Shenwei Wang <shenwei.wang@nxp.com>
Describes the gpio rpmsg transport protocol over the rpmsg bus between
the remote system and Linux.
Signed-off-by: Shenwei Wang <shenwei.wang@nxp.com>
---
Documentation/driver-api/gpio/gpio-rpmsg.rst | 271 +++++++++++++++++++
Documentation/driver-api/gpio/index.rst | 1 +
2 files changed, 272 insertions(+)
create mode 100644 Documentation/driver-api/gpio/gpio-rpmsg.rst
diff --git a/Documentation/driver-api/gpio/gpio-rpmsg.rst b/Documentation/driver-api/gpio/gpio-rpmsg.rst
new file mode 100644
index 000000000000..7d351ff0adb0
--- /dev/null
+++ b/Documentation/driver-api/gpio/gpio-rpmsg.rst
@@ -0,0 +1,271 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+GPIO RPMSG (Remote Processor Messaging) Protocol
+================================================
+
+The GPIO RPMSG transport protocol is used for communication and interaction
+with GPIO controllers on remote processors via the RPMSG bus.
+
+Message Format
+--------------
+
+The RPMSG message consists of a 8-byte packet with the following layout:
+
+.. code-block:: none
+
+ +------+------+------+------+------+------+------+------+
+ | 0x00 | 0x01 | 0x02 | 0x03 | 0x04 | 0x05 | 0x06 | 0x07 |
+ | cmd | line | value |
+ +------+------+------+------+------+------+------+------+
+
+- **cmd**: Command code, used for GPIO_RPMSG_SEND messages.
+
+- **line**: The GPIO line (pin) index of the port.
+
+- **value**: See details in the command description below.
+
+
+GPIO Commands
+-------------
+
+Commands are specified in the **Cmd** field.
+
+The SEND message is always sent from Linux to the remote firmware. Each
+SEND corresponds to a single REPLY message. The GPIO driver should
+serialize messages and determine whether a REPLY message is required. If a
+REPLY message is expected but not received within the specified timeout
+period (currently 1 second in the Linux driver), the driver should return
+-ETIMEOUT.
+
+GET_DIRECTION (Cmd=2)
+~~~~~~~~~~~~~~~~~~~~~
+
+**Request:**
+
+.. code-block:: none
+
+ +------+------+------+------+------+------+------+------+
+ | 0x00 | 0x01 | 0x02 | 0x03 | 0x04 | 0x05 | 0x06 | 0x07 |
+ | 2 | line | 0 |
+ +------+------+------+------+------+------+------+------+
+
+**Reply:**
+
+.. code-block:: none
+
+ +------+--------+--------+
+ | 0x00 | 0x01 | 0x02 |
+ | 1 | status | value |
+ +------+--------+--------+
+
+- **status**:
+
+ - 0: Ok
+ - 1: Error
+
+- **value**: Direction.
+
+ - 0: None
+ - 1: Output
+ - 2: Input
+
+
+SET_DIRECTION (Cmd=3)
+~~~~~~~~~~~~~~~~~~~~~
+
+**Request:**
+
+.. code-block:: none
+
+ +------+------+------+------+------+------+------+------+
+ | 0x00 | 0x01 | 0x02 | 0x03 | 0x04 | 0x05 | 0x06 | 0x07 |
+ | 3 | line | value |
+ +------+------+------+------+------+------+------+------+
+
+- **value**: Direction.
+
+ - 0: None
+ - 1: Output
+ - 2: Input
+
+**Reply:**
+
+.. code-block:: none
+
+ +------+--------+--------+
+ | 0x00 | 0x01 | 0x02 |
+ | 1 | status | 0 |
+ +------+--------+--------+
+
+- **status**:
+
+ - 0: Ok
+ - 1: Error
+
+
+GET_VALUE (Cmd=4)
+~~~~~~~~~~~~~~~~~
+
+**Request:**
+
+.. code-block:: none
+
+ +------+------+------+------+------+------+------+------+
+ | 0x00 | 0x01 | 0x02 | 0x03 | 0x04 | 0x05 | 0x06 | 0x07 |
+ | 4 | line | 0 |
+ +------+------+------+------+------+------+------+------+
+
+**Reply:**
+
+.. code-block:: none
+
+ +------+--------+--------+
+ | 0x00 | 0x01 | 0x02 |
+ | 1 | status | value |
+ +------+--------+--------+
+
+- **status**:
+
+ - 0: Ok
+ - 1: Error
+
+- **value**: Level.
+
+ - 0: Low
+ - 1: High
+
+
+SET_VALUE (Cmd=5)
+~~~~~~~~~~~~~~~~~
+
+**Request:**
+
+.. code-block:: none
+
+ +------+------+------+------+------+------+------+------+
+ | 0x00 | 0x01 | 0x02 | 0x03 | 0x04 | 0x05 | 0x06 | 0x07 |
+ | 5 | line | value |
+ +------+------+------+------+------+------+------+------+
+
+- **value**: Output level.
+
+ - 0: Low
+ - 1: High
+
+**Reply:**
+
+.. code-block:: none
+
+ +------+--------+--------+
+ | 0x00 | 0x01 | 0x02 |
+ | 1 | status | 0 |
+ +------+--------+--------+
+
+- **status**:
+
+ - 0: Ok
+ - 1: Error
+
+
+SET_IRQ_TYPE (Cmd=6)
+~~~~~~~~~~~~~~~~~~~~
+
+**Request:**
+
+.. code-block:: none
+
+ +------+------+------+------+------+------+------+------+
+ | 0x00 | 0x01 | 0x02 | 0x03 | 0x04 | 0x05 | 0x06 | 0x07 |
+ | 6 | line | value |
+ +------+------+------+------+------+------+------+------+
+
+- **value**: IRQ types.
+
+ - 0: Interrupt disabled
+ - 1: Rising edge trigger
+ - 2: Falling edge trigger
+ - 3: Both edge trigger
+ - 4: High level trigger
+ - 8: Low level trigger
+
+**Reply:**
+
+.. code-block:: none
+
+ +------+--------+--------+
+ | 0x00 | 0x01 | 0x02 |
+ | 1 | status | 0 |
+ +------+--------+--------+
+
+- **status**:
+
+ - 0: Ok
+ - 1: Error
+
+SET_WAKEUP (Cmd=16)
+~~~~~~~~~~~~~~~~~~~
+
+**Request:**
+
+.. code-block:: none
+
+ +------+------+------+------+------+------+------+------+
+ | 0x00 | 0x01 | 0x02 | 0x03 | 0x04 | 0x05 | 0x06 | 0x07 |
+ | 1 | line | value |
+ +------+------+------+------+------+------+------+------+
+
+- **value**: Wakeup enable.
+
+ The remote system should always aim to stay in a power-efficient state by
+ shutting down or clock-gating the GPIO blocks that aren't in use. Since
+ the remoteproc driver is responsible for managing the power states of the
+ remote firmware, the GPIO driver does not require to know the firmware's
+ running states.
+
+ When the wakeup bit is set, the remote firmware should configure the line
+ as a wakeup source. The firmware should send the notification message to
+ Linux after it is woken from the GPIO line.
+
+ - 0: Disable wakeup from GPIO
+ - 1: Enable wakeup from GPIO
+
+**Reply:**
+
+.. code-block:: none
+
+ +------+--------+--------+
+ | 0x00 | 0x01 | 0x02 |
+ | 1 | status | 0 |
+ +------+--------+--------+
+
+- **status**:
+
+ - 0: Ok
+ - 1: Error
+
+Notification Message
+--------------------
+
+Notifications are sent by the remote core and they have
+**Type=2 (GPIO_RPMSG_NOTIFY)**:
+
+When a GPIO line asserts an interrupt on the remote processor, the firmware
+should immediately mask the corresponding interrupt source and send a
+notification message to the Linux. Upon completion of the interrupt
+handling on the Linux side, the driver should issue a
+command **SET_IRQ_TYPE** to the firmware to unmask the interrupt.
+
+A Notification message can arrive between a SEND and its REPLY message,
+and the driver is expected to handle this scenario.
+
+.. code-block:: none
+
+ +------+------+--------+
+ | 0x00 | 0x01 | 0x02 |
+ | 2 | line | trigger|
+ +------+------+--------+
+
+- **line**: The GPIO line (pin) index of the port.
+
+- **trigger**: Optional parameter to indicate the trigger event type.
+
diff --git a/Documentation/driver-api/gpio/index.rst b/Documentation/driver-api/gpio/index.rst
index bee58f709b9a..e5eb1f82f01f 100644
--- a/Documentation/driver-api/gpio/index.rst
+++ b/Documentation/driver-api/gpio/index.rst
@@ -16,6 +16,7 @@ Contents:
drivers-on-gpio
bt8xxgpio
pca953x
+ gpio-rpmsg
Core
====
--
2.43.0
^ permalink raw reply related
* [PATCH v14 2/5] dt-bindings: remoteproc: imx_rproc: Add "rpmsg" subnode support
From: Shenwei Wang @ 2026-06-25 15:54 UTC (permalink / raw)
To: Linus Walleij, Bartosz Golaszewski, Jonathan Corbet, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Bjorn Andersson,
Mathieu Poirier, Frank Li, Sascha Hauer
Cc: Shuah Khan, linux-gpio, linux-doc, linux-kernel,
Pengutronix Kernel Team, Fabio Estevam, Shenwei Wang, Peng Fan,
devicetree, linux-remoteproc, imx, linux-arm-kernel, linux-imx,
Arnaud POULIQUEN, b-padhi, Andrew Lunn
In-Reply-To: <20260625155432.815185-1-shenwei.wang@oss.nxp.com>
From: Shenwei Wang <shenwei.wang@nxp.com>
Remote processors may announce multiple GPIO controllers over an RPMSG
channel. These GPIO controllers may require corresponding device tree
nodes, especially when acting as providers, to supply phandles for their
consumers.
Define an RPMSG node to work as a container for a group of RPMSG channels
under the imx_rproc node. Each subnode within "rpmsg" represents an
individual RPMSG channel. The name of each subnode corresponds to the
channel name as defined by the remote processor.
All remote devices associated with a given channel are defined as child
nodes under the corresponding channel node.
Signed-off-by: Shenwei Wang <shenwei.wang@nxp.com>
---
.../devicetree/bindings/gpio/gpio-rpmsg.yaml | 55 +++++++++++++++++++
.../bindings/remoteproc/fsl,imx-rproc.yaml | 53 ++++++++++++++++++
2 files changed, 108 insertions(+)
create mode 100644 Documentation/devicetree/bindings/gpio/gpio-rpmsg.yaml
diff --git a/Documentation/devicetree/bindings/gpio/gpio-rpmsg.yaml b/Documentation/devicetree/bindings/gpio/gpio-rpmsg.yaml
new file mode 100644
index 000000000000..6c78b6850321
--- /dev/null
+++ b/Documentation/devicetree/bindings/gpio/gpio-rpmsg.yaml
@@ -0,0 +1,55 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/gpio/gpio-rpmsg.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Generic RPMSG GPIO Controller
+
+maintainers:
+ - Shenwei Wang <shenwei.wang@nxp.com>
+
+description:
+ On an AMP platform, some GPIO controllers are exposed by the remote processor
+ through the RPMSG bus. The RPMSG GPIO transport protocol defines the packet
+ structure and communication flow between Linux and the remote firmware. Those
+ controllers are managed via this transport protocol. For more details of the
+ protocol, check the document below.
+ Documentation/driver-api/gpio/gpio-rpmsg.rst
+
+properties:
+ compatible:
+ oneOf:
+ - items:
+ - enum:
+ - fsl,rpmsg-gpio
+ - const: rpmsg-gpio
+ - const: rpmsg-gpio
+
+ reg:
+ description:
+ The reg property represents the index of the GPIO controllers. Since
+ the driver manages controllers on a remote system, this index tells
+ the remote system which controller to operate.
+ maxItems: 1
+
+ "#gpio-cells":
+ const: 2
+
+ gpio-controller: true
+
+ interrupt-controller: true
+
+ "#interrupt-cells":
+ const: 2
+
+required:
+ - compatible
+ - reg
+ - "#gpio-cells"
+ - "#interrupt-cells"
+
+allOf:
+ - $ref: /schemas/gpio/gpio.yaml#
+
+unevaluatedProperties: false
diff --git a/Documentation/devicetree/bindings/remoteproc/fsl,imx-rproc.yaml b/Documentation/devicetree/bindings/remoteproc/fsl,imx-rproc.yaml
index ce8ec0119469..aea33205a881 100644
--- a/Documentation/devicetree/bindings/remoteproc/fsl,imx-rproc.yaml
+++ b/Documentation/devicetree/bindings/remoteproc/fsl,imx-rproc.yaml
@@ -85,6 +85,34 @@ properties:
This property is to specify the resource id of the remote processor in SoC
which supports SCFW
+ rpmsg:
+ type: object
+ additionalProperties: false
+ description:
+ Represents the RPMSG bus between Linux and the remote system. Contains
+ a group of RPMSG channel devices running on the bus.
+
+ properties:
+ rpmsg-io:
+ type: object
+ additionalProperties: false
+ properties:
+ '#address-cells':
+ const: 1
+
+ '#size-cells':
+ const: 0
+
+ patternProperties:
+ "gpio@[0-9a-f]+$":
+ type: object
+ $ref: /schemas/gpio/gpio-rpmsg.yaml#
+ unevaluatedProperties: false
+
+ required:
+ - '#address-cells'
+ - '#size-cells'
+
required:
- compatible
@@ -147,5 +175,30 @@ examples:
&mu 3 1>;
memory-region = <&vdev0buffer>, <&vdev0vring0>, <&vdev0vring1>, <&rsc_table>;
syscon = <&src>;
+
+ rpmsg {
+ rpmsg-io {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ gpio@0 {
+ compatible = "rpmsg-gpio";
+ reg = <0>;
+ gpio-controller;
+ #gpio-cells = <2>;
+ #interrupt-cells = <2>;
+ interrupt-controller;
+ };
+
+ gpio@1 {
+ compatible = "rpmsg-gpio";
+ reg = <1>;
+ gpio-controller;
+ #gpio-cells = <2>;
+ #interrupt-cells = <2>;
+ interrupt-controller;
+ };
+ };
+ };
};
...
--
2.43.0
^ permalink raw reply related
* [PATCH v14 3/5] rpmsg: core: match rpmsg device IDs by prefix
From: Shenwei Wang @ 2026-06-25 15:54 UTC (permalink / raw)
To: Linus Walleij, Bartosz Golaszewski, Jonathan Corbet, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Bjorn Andersson,
Mathieu Poirier, Frank Li, Sascha Hauer
Cc: Shuah Khan, linux-gpio, linux-doc, linux-kernel,
Pengutronix Kernel Team, Fabio Estevam, Shenwei Wang, Peng Fan,
devicetree, linux-remoteproc, imx, linux-arm-kernel, linux-imx,
Arnaud POULIQUEN, b-padhi, Andrew Lunn
In-Reply-To: <20260625155432.815185-1-shenwei.wang@oss.nxp.com>
From: Shenwei Wang <shenwei.wang@nxp.com>
The current rpmsg_id_match() implementation requires an exact
string match between the driver id_table entry and the rpmsg
device name using strncmp() with RPMSG_NAME_SIZE.
This makes it impossible for a driver to match a group of
rpmsg devices sharing a common prefix (e.g. dynamically
suffixed channel names).
Update the matching logic to compare only the length of the
id->name string, allowing id_table entries to act as prefixes.
This enables drivers to bind to devices whose names start with
the specified id->name.
The implementation is copied from a reply by Mathieu.
Signed-off-by: Shenwei Wang <shenwei.wang@nxp.com>
---
drivers/rpmsg/rpmsg_core.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/rpmsg/rpmsg_core.c b/drivers/rpmsg/rpmsg_core.c
index e7f7831d37f8..f95bfc9965d4 100644
--- a/drivers/rpmsg/rpmsg_core.c
+++ b/drivers/rpmsg/rpmsg_core.c
@@ -414,7 +414,9 @@ ATTRIBUTE_GROUPS(rpmsg_dev);
static inline int rpmsg_id_match(const struct rpmsg_device *rpdev,
const struct rpmsg_device_id *id)
{
- return strncmp(id->name, rpdev->id.name, RPMSG_NAME_SIZE) == 0;
+ size_t len = strnlen(id->name, RPMSG_NAME_SIZE);
+
+ return strncmp(id->name, rpdev->id.name, len) == 0;
}
/* match rpmsg channel and rpmsg driver */
--
2.43.0
^ permalink raw reply related
* [PATCH v14 4/5] gpio: rpmsg: add generic rpmsg GPIO driver
From: Shenwei Wang @ 2026-06-25 15:54 UTC (permalink / raw)
To: Linus Walleij, Bartosz Golaszewski, Jonathan Corbet, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Bjorn Andersson,
Mathieu Poirier, Frank Li, Sascha Hauer
Cc: Shuah Khan, linux-gpio, linux-doc, linux-kernel,
Pengutronix Kernel Team, Fabio Estevam, Shenwei Wang, Peng Fan,
devicetree, linux-remoteproc, imx, linux-arm-kernel, linux-imx,
Arnaud POULIQUEN, b-padhi, Andrew Lunn, Bartosz Golaszewski
In-Reply-To: <20260625155432.815185-1-shenwei.wang@oss.nxp.com>
From: Shenwei Wang <shenwei.wang@nxp.com>
On an AMP platform, the system may include multiple processors:
- MCUs running an RTOS
- An MPU running Linux
These processors communicate via the RPMSG protocol.
The driver implements the standard GPIO interface, allowing
the Linux side to control GPIO controllers which reside in
the remote processor via RPMSG protocol.
Cc: Bartosz Golaszewski <brgl@bgdev.pl>
Cc: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Shenwei Wang <shenwei.wang@nxp.com>
---
drivers/gpio/Kconfig | 17 ++
drivers/gpio/Makefile | 1 +
drivers/gpio/gpio-rpmsg.c | 568 ++++++++++++++++++++++++++++++++++++++
3 files changed, 586 insertions(+)
create mode 100644 drivers/gpio/gpio-rpmsg.c
diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
index 020e51e30317..4ad299fe3c6f 100644
--- a/drivers/gpio/Kconfig
+++ b/drivers/gpio/Kconfig
@@ -1917,6 +1917,23 @@ config GPIO_SODAVILLE
endmenu
+menu "RPMSG GPIO drivers"
+ depends on RPMSG
+
+config GPIO_RPMSG
+ tristate "Generic RPMSG GPIO support"
+ depends on OF && REMOTEPROC
+ select GPIOLIB_IRQCHIP
+ default REMOTEPROC
+ help
+ Say yes here to support the generic GPIO functions over the RPMSG
+ bus. Currently supported devices: i.MX7ULP, i.MX8ULP, i.MX8x, and
+ i.MX9x.
+
+ If unsure, say N.
+
+endmenu
+
menu "SPI GPIO expanders"
depends on SPI_MASTER
diff --git a/drivers/gpio/Makefile b/drivers/gpio/Makefile
index b267598b517d..ee75c0e65b8b 100644
--- a/drivers/gpio/Makefile
+++ b/drivers/gpio/Makefile
@@ -157,6 +157,7 @@ obj-$(CONFIG_GPIO_RDC321X) += gpio-rdc321x.o
obj-$(CONFIG_GPIO_REALTEK_OTTO) += gpio-realtek-otto.o
obj-$(CONFIG_GPIO_REG) += gpio-reg.o
obj-$(CONFIG_GPIO_ROCKCHIP) += gpio-rockchip.o
+obj-$(CONFIG_GPIO_RPMSG) += gpio-rpmsg.o
obj-$(CONFIG_GPIO_RTD) += gpio-rtd.o
obj-$(CONFIG_ARCH_SA1100) += gpio-sa1100.o
obj-$(CONFIG_GPIO_SAMA5D2_PIOBU) += gpio-sama5d2-piobu.o
diff --git a/drivers/gpio/gpio-rpmsg.c b/drivers/gpio/gpio-rpmsg.c
new file mode 100644
index 000000000000..332e2925a830
--- /dev/null
+++ b/drivers/gpio/gpio-rpmsg.c
@@ -0,0 +1,568 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2026 NXP
+ *
+ * The driver exports a standard gpiochip interface to control
+ * the GPIO controllers via RPMSG on a remote processor.
+ */
+
+#include <linux/completion.h>
+#include <linux/device.h>
+#include <linux/err.h>
+#include <linux/gpio/driver.h>
+#include <linux/init.h>
+#include <linux/irqdomain.h>
+#include <linux/mod_devicetable.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/of.h>
+#include <linux/of_device.h>
+#include <linux/of_platform.h>
+#include <linux/platform_device.h>
+#include <linux/remoteproc.h>
+#include <linux/rpmsg.h>
+#include <linux/virtio_gpio.h>
+
+#define GPIOS_PER_PORT_DEFAULT 32
+#define RPMSG_TIMEOUT 1000
+
+/* Additional commands beyond virtio-gpio */
+#define VIRTIO_GPIO_MSG_SET_WAKEUP 0x0010
+
+/* GPIO Receive MSG Type */
+#define GPIO_RPMSG_REPLY 1
+#define GPIO_RPMSG_NOTIFY 2
+
+#define CHAN_NAME_PREFIX "rpmsg-io-"
+#define GPIO_COMPAT_STR "rpmsg-gpio"
+
+struct rpmsg_gpio_response {
+ __u8 type;
+ union {
+ /* command reply */
+ struct {
+ __u8 status;
+ __u8 value;
+ };
+
+ /* interrupt notification */
+ struct {
+ __u8 line;
+ __u8 trigger; /* rising/falling/high/low */
+ };
+ };
+};
+
+struct rpmsg_gpio_line {
+ u8 irq_shutdown;
+ u8 irq_unmask;
+ u8 irq_mask;
+ u32 irq_wake_enable;
+ u32 irq_type;
+};
+
+struct rpmsg_gpio_port {
+ struct gpio_chip gc;
+ struct rpmsg_device *rpdev;
+ struct virtio_gpio_request *send_msg;
+ struct rpmsg_gpio_response *recv_msg;
+ struct completion cmd_complete;
+ struct mutex lock;
+ u32 ngpios;
+ u32 idx;
+ struct rpmsg_gpio_line lines[GPIOS_PER_PORT_DEFAULT];
+};
+
+static int rpmsg_gpio_send_message(struct rpmsg_gpio_port *port)
+{
+ int ret;
+
+ reinit_completion(&port->cmd_complete);
+
+ ret = rpmsg_send(port->rpdev->ept, port->send_msg, sizeof(*port->send_msg));
+ if (ret) {
+ dev_err(&port->rpdev->dev, "rpmsg_send failed: cmd=%d ret=%d\n",
+ port->send_msg->type, ret);
+ return ret;
+ }
+
+ ret = wait_for_completion_timeout(&port->cmd_complete,
+ msecs_to_jiffies(RPMSG_TIMEOUT));
+ if (ret == 0) {
+ dev_err(&port->rpdev->dev, "rpmsg_send timeout! cmd=%d\n",
+ port->send_msg->type);
+ return -ETIMEDOUT;
+ }
+
+ if (unlikely(port->recv_msg->status != VIRTIO_GPIO_STATUS_OK)) {
+ dev_err(&port->rpdev->dev, "remote core replies an error: cmd=%d!\n",
+ port->send_msg->type);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static struct virtio_gpio_request *
+rpmsg_gpio_msg_prepare(struct rpmsg_gpio_port *port, u16 line, u16 cmd, u32 val)
+{
+ struct virtio_gpio_request *msg = port->send_msg;
+
+ msg->type = cmd;
+ msg->gpio = line;
+ msg->value = val;
+
+ return msg;
+}
+
+static int rpmsg_gpio_get(struct gpio_chip *gc, unsigned int line)
+{
+ struct rpmsg_gpio_port *port = gpiochip_get_data(gc);
+ int ret;
+
+ guard(mutex)(&port->lock);
+
+ rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_GET_VALUE, 0);
+
+ ret = rpmsg_gpio_send_message(port);
+ return ret ? ret : port->recv_msg->value;
+}
+
+static int rpmsg_gpio_get_direction(struct gpio_chip *gc, unsigned int line)
+{
+ struct rpmsg_gpio_port *port = gpiochip_get_data(gc);
+ int ret;
+
+ guard(mutex)(&port->lock);
+
+ rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_GET_DIRECTION, 0);
+
+ ret = rpmsg_gpio_send_message(port);
+ if (ret)
+ return ret;
+
+ switch (port->recv_msg->value) {
+ case VIRTIO_GPIO_DIRECTION_IN:
+ return GPIO_LINE_DIRECTION_IN;
+ case VIRTIO_GPIO_DIRECTION_OUT:
+ return GPIO_LINE_DIRECTION_OUT;
+ default:
+ break;
+ }
+
+ return -EINVAL;
+}
+
+static int rpmsg_gpio_direction_input(struct gpio_chip *gc, unsigned int line)
+{
+ struct rpmsg_gpio_port *port = gpiochip_get_data(gc);
+
+ guard(mutex)(&port->lock);
+
+ rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_SET_DIRECTION,
+ VIRTIO_GPIO_DIRECTION_IN);
+
+ return rpmsg_gpio_send_message(port);
+}
+
+static int rpmsg_gpio_set(struct gpio_chip *gc, unsigned int line, int val)
+{
+ struct rpmsg_gpio_port *port = gpiochip_get_data(gc);
+
+ guard(mutex)(&port->lock);
+
+ rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_SET_VALUE, val);
+
+ return rpmsg_gpio_send_message(port);
+}
+
+static int rpmsg_gpio_direction_output(struct gpio_chip *gc, unsigned int line, int val)
+{
+ struct rpmsg_gpio_port *port = gpiochip_get_data(gc);
+ int ret;
+
+ guard(mutex)(&port->lock);
+
+ rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_SET_DIRECTION,
+ VIRTIO_GPIO_DIRECTION_OUT);
+
+ ret = rpmsg_gpio_send_message(port);
+ if (ret)
+ return ret;
+
+ rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_SET_VALUE, val);
+
+ return rpmsg_gpio_send_message(port);
+}
+
+static int gpio_rpmsg_irq_set_type(struct irq_data *d, u32 type)
+{
+ struct rpmsg_gpio_port *port = irq_data_get_irq_chip_data(d);
+ u32 line = d->hwirq;
+
+ switch (type) {
+ case IRQ_TYPE_EDGE_RISING:
+ type = VIRTIO_GPIO_IRQ_TYPE_EDGE_RISING;
+ irq_set_handler_locked(d, handle_simple_irq);
+ break;
+ case IRQ_TYPE_EDGE_FALLING:
+ type = VIRTIO_GPIO_IRQ_TYPE_EDGE_FALLING;
+ irq_set_handler_locked(d, handle_simple_irq);
+ break;
+ case IRQ_TYPE_EDGE_BOTH:
+ type = VIRTIO_GPIO_IRQ_TYPE_EDGE_BOTH;
+ irq_set_handler_locked(d, handle_simple_irq);
+ break;
+ case IRQ_TYPE_LEVEL_LOW:
+ type = VIRTIO_GPIO_IRQ_TYPE_LEVEL_LOW;
+ irq_set_handler_locked(d, handle_level_irq);
+ break;
+ case IRQ_TYPE_LEVEL_HIGH:
+ type = VIRTIO_GPIO_IRQ_TYPE_LEVEL_HIGH;
+ irq_set_handler_locked(d, handle_level_irq);
+ break;
+ default:
+ dev_err(&port->rpdev->dev, "unsupported irq type: %u\n", type);
+ return -EINVAL;
+ }
+
+ port->lines[line].irq_type = type;
+
+ return 0;
+}
+
+static int gpio_rpmsg_irq_set_wake(struct irq_data *d, u32 enable)
+{
+ struct rpmsg_gpio_port *port = irq_data_get_irq_chip_data(d);
+ u32 line = d->hwirq;
+
+ port->lines[line].irq_wake_enable = enable;
+
+ return 0;
+}
+
+/*
+ * This unmask/mask function is invoked in two situations:
+ * - when an interrupt is being set up, and
+ * - after an interrupt has occurred.
+ *
+ * The GPIO driver does not access hardware registers directly.
+ * Instead, it caches all relevant information locally, and then sends
+ * the accumulated state to the remote system at this stage.
+ */
+static void gpio_rpmsg_unmask_irq(struct irq_data *d)
+{
+ struct rpmsg_gpio_port *port = irq_data_get_irq_chip_data(d);
+ u32 line = d->hwirq;
+
+ port->lines[line].irq_unmask = 1;
+}
+
+static void gpio_rpmsg_mask_irq(struct irq_data *d)
+{
+ struct rpmsg_gpio_port *port = irq_data_get_irq_chip_data(d);
+ u32 line = d->hwirq;
+
+ /*
+ * When an interrupt occurs, the remote system masks the interrupt
+ * and then sends a notification to Linux. After Linux processes
+ * that notification, it sends an RPMsg command back to the remote
+ * system to unmask the interrupt again.
+ */
+ port->lines[line].irq_mask = 1;
+}
+
+static void gpio_rpmsg_irq_shutdown(struct irq_data *d)
+{
+ struct rpmsg_gpio_port *port = irq_data_get_irq_chip_data(d);
+ u32 line = d->hwirq;
+
+ port->lines[line].irq_shutdown = 1;
+}
+
+static void gpio_rpmsg_irq_bus_lock(struct irq_data *d)
+{
+ struct rpmsg_gpio_port *port = irq_data_get_irq_chip_data(d);
+
+ mutex_lock(&port->lock);
+}
+
+static void gpio_rpmsg_irq_bus_sync_unlock(struct irq_data *d)
+{
+ struct rpmsg_gpio_port *port = irq_data_get_irq_chip_data(d);
+ u32 line = d->hwirq;
+
+ rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_SET_WAKEUP,
+ port->lines[line].irq_wake_enable);
+ rpmsg_gpio_send_message(port);
+
+ /*
+ * For mask irq, do nothing here.
+ * The remote system will mask interrupt after an interrupt occurs,
+ * and then send a notification to Linux system. After Linux system
+ * handles the notification, it sends an rpmsg back to the remote
+ * system to unmask this interrupt again.
+ */
+ if (port->lines[line].irq_mask && !port->lines[line].irq_unmask) {
+ port->lines[line].irq_mask = 0;
+ mutex_unlock(&port->lock);
+ return;
+ }
+
+ if (port->lines[line].irq_shutdown) {
+ rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_IRQ_TYPE,
+ VIRTIO_GPIO_IRQ_TYPE_NONE);
+ port->lines[line].irq_shutdown = 0;
+ } else {
+ rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_IRQ_TYPE,
+ port->lines[line].irq_type);
+
+ if (port->lines[line].irq_unmask)
+ port->lines[line].irq_unmask = 0;
+ }
+
+ rpmsg_gpio_send_message(port);
+ mutex_unlock(&port->lock);
+}
+
+static const struct irq_chip gpio_rpmsg_irq_chip = {
+ .irq_mask = gpio_rpmsg_mask_irq,
+ .irq_unmask = gpio_rpmsg_unmask_irq,
+ .irq_set_wake = gpio_rpmsg_irq_set_wake,
+ .irq_set_type = gpio_rpmsg_irq_set_type,
+ .irq_shutdown = gpio_rpmsg_irq_shutdown,
+ .irq_bus_lock = gpio_rpmsg_irq_bus_lock,
+ .irq_bus_sync_unlock = gpio_rpmsg_irq_bus_sync_unlock,
+ .flags = IRQCHIP_IMMUTABLE,
+};
+
+static int rpmsg_gpiochip_register(struct rpmsg_device *rpdev,
+ struct device_node *np, const char *name)
+{
+ struct rpmsg_gpio_port *port;
+ struct gpio_irq_chip *girq;
+ struct gpio_chip *gc;
+ int ret;
+
+ port = devm_kzalloc(&rpdev->dev, sizeof(*port), GFP_KERNEL);
+ if (!port)
+ return -ENOMEM;
+
+ ret = of_property_read_u32(np, "reg", &port->idx);
+ if (ret)
+ return ret;
+
+ ret = devm_mutex_init(&rpdev->dev, &port->lock);
+ if (ret)
+ return ret;
+
+ ret = of_property_read_u32(np, "ngpios", &port->ngpios);
+ if (ret || port->ngpios > GPIOS_PER_PORT_DEFAULT)
+ port->ngpios = GPIOS_PER_PORT_DEFAULT;
+
+ port->send_msg = devm_kzalloc(&rpdev->dev,
+ sizeof(*port->send_msg),
+ GFP_KERNEL);
+
+ port->recv_msg = devm_kzalloc(&rpdev->dev,
+ sizeof(*port->recv_msg),
+ GFP_KERNEL);
+ if (!port->send_msg || !port->recv_msg)
+ return -ENOMEM;
+
+ init_completion(&port->cmd_complete);
+ port->rpdev = rpdev;
+
+ gc = &port->gc;
+ gc->owner = THIS_MODULE;
+ gc->parent = &rpdev->dev;
+ gc->fwnode = of_fwnode_handle(np);
+ gc->ngpio = port->ngpios;
+ gc->base = -1;
+ gc->label = devm_kasprintf(&rpdev->dev, GFP_KERNEL, "%s-gpio%d",
+ name, port->idx);
+
+ gc->direction_input = rpmsg_gpio_direction_input;
+ gc->direction_output = rpmsg_gpio_direction_output;
+ gc->get_direction = rpmsg_gpio_get_direction;
+ gc->get = rpmsg_gpio_get;
+ gc->set = rpmsg_gpio_set;
+
+ girq = &gc->irq;
+ gpio_irq_chip_set_chip(girq, &gpio_rpmsg_irq_chip);
+ girq->parent_handler = NULL;
+ girq->num_parents = 0;
+ girq->parents = NULL;
+ girq->chip->name = devm_kstrdup(&rpdev->dev, gc->label, GFP_KERNEL);
+
+ dev_set_drvdata(&rpdev->dev, port);
+
+ return devm_gpiochip_add_data(&rpdev->dev, gc, port);
+}
+
+static const char *rpmsg_get_rproc_node_name(struct rpmsg_device *rpdev)
+{
+ const char *name = NULL;
+ struct device_node *np;
+ struct rproc *rproc;
+
+ rproc = rproc_get_by_child(&rpdev->dev);
+ if (!rproc)
+ return NULL;
+
+ np = of_node_get(rproc->dev.of_node);
+ if (!np && rproc->dev.parent)
+ np = of_node_get(rproc->dev.parent->of_node);
+
+ if (np) {
+ name = devm_kstrdup(&rpdev->dev, np->name, GFP_KERNEL);
+ of_node_put(np);
+ }
+
+ return name;
+}
+
+static struct device_node *
+rpmsg_find_child_by_compat_reg(struct device_node *parent, const char *compat, u32 idx)
+{
+ struct device_node *child;
+ u32 reg;
+
+ for_each_available_child_of_node(parent, child) {
+ if (!of_device_is_compatible(child, compat))
+ continue;
+
+ if (of_property_read_u32(child, "reg", ®))
+ continue;
+
+ if (reg == idx)
+ return child;
+ }
+
+ return NULL;
+}
+
+static struct device_node *
+rpmsg_get_channel_ofnode(struct rpmsg_device *rpdev, const char *compat, u32 idx)
+{
+ struct device_node *np_chan = NULL, *np;
+ struct rproc *rproc;
+
+ rproc = rproc_get_by_child(&rpdev->dev);
+ if (!rproc)
+ return NULL;
+
+ np = of_node_get(rproc->dev.of_node);
+ if (!np && rproc->dev.parent)
+ np = of_node_get(rproc->dev.parent->of_node);
+
+ if (np)
+ np_chan = rpmsg_find_child_by_compat_reg(np, compat, idx);
+
+ return np_chan;
+}
+
+static int rpmsg_get_gpio_index(const char *name, const char *prefix)
+{
+ const char *p;
+ int base = 10;
+ int val;
+
+ if (!name)
+ return -EINVAL;
+
+ /* Ensure correct prefix */
+ if (!str_has_prefix(name, prefix))
+ return -EINVAL;
+
+ /* Find last '-' */
+ p = strrchr(name, '-');
+
+ if (!p || *(p + 1) == '\0')
+ return -EINVAL;
+
+ if (p[1] == '0' && (p[2] == 'x' || p[2] == 'X'))
+ base = 16;
+
+ if (kstrtoint(p + 1, base, &val))
+ return -EINVAL;
+
+ return val;
+}
+
+static int rpmsg_gpio_channel_callback(struct rpmsg_device *rpdev, void *data,
+ int len, void *priv, u32 src)
+{
+ struct rpmsg_gpio_response *msg = data;
+ struct rpmsg_gpio_port *port = NULL;
+
+ port = dev_get_drvdata(&rpdev->dev);
+
+ if (!port) {
+ dev_err(&rpdev->dev, "port is null\n");
+ return -EINVAL;
+ }
+
+ if (msg->type == GPIO_RPMSG_REPLY) {
+ *port->recv_msg = *msg;
+ complete(&port->cmd_complete);
+ } else if (msg->type == GPIO_RPMSG_NOTIFY) {
+ generic_handle_domain_irq_safe(port->gc.irq.domain, msg->line);
+ } else {
+ dev_err(&rpdev->dev, "wrong message type (0x%x)\n", msg->type);
+ }
+
+ return 0;
+}
+
+static int rpmsg_gpio_channel_probe(struct rpmsg_device *rpdev)
+{
+ struct device *dev = &rpdev->dev;
+ struct device_node *np;
+ const char *rproc_name;
+ int idx;
+
+ idx = rpmsg_get_gpio_index(rpdev->id.name, CHAN_NAME_PREFIX);
+ if (idx < 0)
+ return -EINVAL;
+
+ if (!dev->of_node) {
+ np = rpmsg_get_channel_ofnode(rpdev, GPIO_COMPAT_STR, idx);
+ if (!np)
+ return -ENODEV;
+
+ dev->of_node = np;
+ set_primary_fwnode(dev, of_fwnode_handle(np));
+ return -EPROBE_DEFER;
+ }
+
+ rproc_name = rpmsg_get_rproc_node_name(rpdev);
+
+ return rpmsg_gpiochip_register(rpdev, dev->of_node, rproc_name);
+}
+
+static const struct of_device_id rpmsg_gpio_dt_ids[] = {
+ { .compatible = GPIO_COMPAT_STR },
+ { /* sentinel */ }
+};
+
+static struct rpmsg_device_id rpmsg_gpio_channel_id_table[] = {
+ { .name = CHAN_NAME_PREFIX },
+ { },
+};
+MODULE_DEVICE_TABLE(rpmsg, rpmsg_gpio_channel_id_table);
+
+static struct rpmsg_driver rpmsg_gpio_channel_client = {
+ .callback = rpmsg_gpio_channel_callback,
+ .id_table = rpmsg_gpio_channel_id_table,
+ .probe = rpmsg_gpio_channel_probe,
+ .drv = {
+ .name = KBUILD_MODNAME,
+ .of_match_table = rpmsg_gpio_dt_ids,
+ },
+};
+module_rpmsg_driver(rpmsg_gpio_channel_client);
+
+MODULE_AUTHOR("Shenwei Wang <shenwei.wang@nxp.com>");
+MODULE_DESCRIPTION("generic rpmsg gpio driver");
+MODULE_LICENSE("GPL");
--
2.43.0
^ permalink raw reply related
* [PATCH v14 5/5] arm64: dts: imx8ulp: Add rpmsg node under imx_rproc
From: Shenwei Wang @ 2026-06-25 15:54 UTC (permalink / raw)
To: Linus Walleij, Bartosz Golaszewski, Jonathan Corbet, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Bjorn Andersson,
Mathieu Poirier, Frank Li, Sascha Hauer
Cc: Shuah Khan, linux-gpio, linux-doc, linux-kernel,
Pengutronix Kernel Team, Fabio Estevam, Shenwei Wang, Peng Fan,
devicetree, linux-remoteproc, imx, linux-arm-kernel, linux-imx,
Arnaud POULIQUEN, b-padhi, Andrew Lunn
In-Reply-To: <20260625155432.815185-1-shenwei.wang@oss.nxp.com>
From: Shenwei Wang <shenwei.wang@nxp.com>
Add the RPMSG bus node along with its GPIO subnodes to the device
tree.
Enable remote device communication and GPIO control via RPMSG on
the i.MX platform.
Signed-off-by: Shenwei Wang <shenwei.wang@nxp.com>
---
arch/arm64/boot/dts/freescale/imx8ulp.dtsi | 25 ++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/arch/arm64/boot/dts/freescale/imx8ulp.dtsi b/arch/arm64/boot/dts/freescale/imx8ulp.dtsi
index 1de3ad60c6aa..f1b984eb1203 100644
--- a/arch/arm64/boot/dts/freescale/imx8ulp.dtsi
+++ b/arch/arm64/boot/dts/freescale/imx8ulp.dtsi
@@ -190,6 +190,31 @@ scmi_sensor: protocol@15 {
cm33: remoteproc-cm33 {
compatible = "fsl,imx8ulp-cm33";
status = "disabled";
+
+ rpmsg {
+ rpmsg-io {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ rpmsg_gpioa: gpio@0 {
+ compatible = "rpmsg-gpio";
+ reg = <0>;
+ gpio-controller;
+ #gpio-cells = <2>;
+ #interrupt-cells = <2>;
+ interrupt-controller;
+ };
+
+ rpmsg_gpiob: gpio@1 {
+ compatible = "rpmsg-gpio";
+ reg = <1>;
+ gpio-controller;
+ #gpio-cells = <2>;
+ #interrupt-cells = <2>;
+ interrupt-controller;
+ };
+ };
+ };
};
soc: soc@0 {
--
2.43.0
^ permalink raw reply related
* Re: [PATCH net-next] Documentation: networking: Add a test plan for ethtool pause validation
From: Maxime Chevallier @ 2026-06-25 16:03 UTC (permalink / raw)
To: Andrew Lunn
Cc: Jakub Kicinski, davem, Eric Dumazet, Paolo Abeni, Simon Horman,
Russell King, Heiner Kallweit, Jonathan Corbet, Shuah Khan,
Oleksij Rempel, Vladimir Oltean, Florian Fainelli,
thomas.petazzoni, netdev, linux-kernel, linux-doc
In-Reply-To: <dfee1484-fa2a-4b98-af5a-1e67ac716905@lunn.ch>
>
> Does it even make sense to advertise this when in HD? But i don't
> think we need to consider this now. I consider HD low priority, i
> doubt it is actually used very often. We should concentrate on FD
> testing.
That's fine by me as well, let's keep it simple, we may revisit that if
we really need to.
>
>> # ethtool -a eth2
>> Autonegotiate: on
>> RX: off
>> TX: off
>> RX negotiated: on
>> TX negotiated: on
>>
>>
>> Sure, pause and HD don't make sense, however what I find confusing to some
>> extent is that the only place we have information about the *actual* pause
>> settings is the "link is Up" log in dmesg.
>
> Maybe we should extend ksetting get to return the resolved pause
> parameters? But i'm not sure how much that actually gives us. Anything
> using phylink will just ask phylink to fill in the ksettings
> information, and it seems unlikely phylink gets it wrong. What we are
> really trying to test is drivers which don't user phylink, those are
> the ones which are generally broken, and they are not going to
> implement anything new in ksettings.
Correct yes. If the MAC driver uses phylink and a test fails, it very likely
means that the PHY driver is doing shady stuff (and some are/were for pause)
> So i think the test has to look
> at:
>
>> Advertised pause frame use: Symmetric Receive-only
>> Link partner advertised pause frame use: Symmetric Receive-only
>
> and check these match what we expect.
All good for me :) thanks for you feedback,
Maxime
^ permalink raw reply
* [PATCH v2 1/2] dt-bindings: hwmon: chipcap2: Add label property
From: Flaviu Nistor @ 2026-06-25 16:04 UTC (permalink / raw)
To: Guenter Roeck, Javier Carrasco, Rob Herring, Krzysztof Kozlowski,
Conor Dooley, Jonathan Corbet, Shuah Khan
Cc: Flaviu Nistor, linux-hwmon, linux-kernel, devicetree, linux-doc
Add support for an optional label property similar to other hwmon devices.
This allows, in case of boards with multiple CHIPCAP2 sensors, to assign
distinct names to each instance.
Signed-off-by: Flaviu Nistor <flaviu.nistor@gmail.com>
---
Changes in v2:
- Implement suggestion from Javier Carrasco as proposed by Krzysztof Kozlowski.
- Link to v1: https://lore.kernel.org/all/20260622122200.14245-1-flaviu.nistor@gmail.com/
.../devicetree/bindings/hwmon/amphenol,chipcap2.yaml | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/Documentation/devicetree/bindings/hwmon/amphenol,chipcap2.yaml b/Documentation/devicetree/bindings/hwmon/amphenol,chipcap2.yaml
index 17351fdbefce..56b0cecfca5f 100644
--- a/Documentation/devicetree/bindings/hwmon/amphenol,chipcap2.yaml
+++ b/Documentation/devicetree/bindings/hwmon/amphenol,chipcap2.yaml
@@ -45,6 +45,8 @@ properties:
- const: low
- const: high
+ label: true
+
vdd-supply:
description:
Dedicated, controllable supply-regulator to reset the device and
@@ -55,6 +57,9 @@ required:
- reg
- vdd-supply
+allOf:
+ - $ref: hwmon-common.yaml#
+
additionalProperties: false
examples:
@@ -72,6 +77,7 @@ examples:
<5 IRQ_TYPE_EDGE_RISING>,
<6 IRQ_TYPE_EDGE_RISING>;
interrupt-names = "ready", "low", "high";
+ label = "Room";
vdd-supply = <®_vdd>;
};
};
--
2.34.1
^ permalink raw reply related
* [PATCH v2 2/2] hwmon: (chipcap2) Add support for label
From: Flaviu Nistor @ 2026-06-25 16:04 UTC (permalink / raw)
To: Guenter Roeck, Javier Carrasco, Rob Herring, Krzysztof Kozlowski,
Conor Dooley, Jonathan Corbet, Shuah Khan
Cc: Flaviu Nistor, linux-hwmon, linux-kernel, devicetree, linux-doc
In-Reply-To: <20260625160423.17882-1-flaviu.nistor@gmail.com>
Add support for label sysfs attribute similar to other hwmon devices.
This is particularly useful for systems with multiple sensors on the
same board, where identifying individual sensors is much easier since
labels can be defined via device tree.
Signed-off-by: Flaviu Nistor <flaviu.nistor@gmail.com>
---
Changes in v2:
- No change for this patch in the patch series.
- Link to v1: https://lore.kernel.org/all/20260622122200.14245-1-flaviu.nistor@gmail.com/
Documentation/hwmon/chipcap2.rst | 2 ++
drivers/hwmon/chipcap2.c | 25 +++++++++++++++++++++++--
2 files changed, 25 insertions(+), 2 deletions(-)
diff --git a/Documentation/hwmon/chipcap2.rst b/Documentation/hwmon/chipcap2.rst
index dc165becc64c..c38d87b91b69 100644
--- a/Documentation/hwmon/chipcap2.rst
+++ b/Documentation/hwmon/chipcap2.rst
@@ -70,4 +70,6 @@ humidity1_min_hyst: RW humidity low hystersis
humidity1_max_hyst: RW humidity high hystersis
humidity1_min_alarm: RO humidity low alarm indicator
humidity1_max_alarm: RO humidity high alarm indicator
+humidity1_label: RO descriptive name for the sensor
+temp1_label: RO descriptive name for the sensor
=============================== ======= ========================================
diff --git a/drivers/hwmon/chipcap2.c b/drivers/hwmon/chipcap2.c
index 4aecf463180f..086571d556b7 100644
--- a/drivers/hwmon/chipcap2.c
+++ b/drivers/hwmon/chipcap2.c
@@ -22,6 +22,8 @@
#include <linux/irq.h>
#include <linux/module.h>
#include <linux/regulator/consumer.h>
+#include <linux/mod_devicetable.h>
+#include <linux/property.h>
#define CC2_START_CM 0xA0
#define CC2_START_NOM 0x80
@@ -83,6 +85,7 @@ struct cc2_data {
struct i2c_client *client;
struct regulator *regulator;
const char *name;
+ const char *label;
int irq_ready;
int irq_low;
int irq_high;
@@ -449,6 +452,8 @@ static umode_t cc2_is_visible(const void *data, enum hwmon_sensor_types type,
switch (attr) {
case hwmon_humidity_input:
return 0444;
+ case hwmon_humidity_label:
+ return cc2->label ? 0444 : 0;
case hwmon_humidity_min_alarm:
return cc2->rh_alarm.low_alarm_visible ? 0444 : 0;
case hwmon_humidity_max_alarm:
@@ -466,6 +471,8 @@ static umode_t cc2_is_visible(const void *data, enum hwmon_sensor_types type,
switch (attr) {
case hwmon_temp_input:
return 0444;
+ case hwmon_temp_label:
+ return cc2->label ? 0444 : 0;
default:
return 0;
}
@@ -552,6 +559,16 @@ static int cc2_humidity_max_alarm_status(struct cc2_data *data, long *val)
return 0;
}
+static int cc2_read_string(struct device *dev, enum hwmon_sensor_types type,
+ u32 attr, int channel, const char **str)
+{
+ struct cc2_data *data = dev_get_drvdata(dev);
+
+ *str = data->label;
+
+ return 0;
+}
+
static int cc2_read(struct device *dev, enum hwmon_sensor_types type, u32 attr,
int channel, long *val)
{
@@ -670,8 +687,9 @@ static int cc2_request_alarm_irqs(struct cc2_data *data, struct device *dev)
}
static const struct hwmon_channel_info *cc2_info[] = {
- HWMON_CHANNEL_INFO(temp, HWMON_T_INPUT),
- HWMON_CHANNEL_INFO(humidity, HWMON_H_INPUT | HWMON_H_MIN | HWMON_H_MAX |
+ HWMON_CHANNEL_INFO(temp, HWMON_T_INPUT | HWMON_T_LABEL),
+ HWMON_CHANNEL_INFO(humidity, HWMON_H_INPUT | HWMON_H_LABEL |
+ HWMON_H_MIN | HWMON_H_MAX |
HWMON_H_MIN_HYST | HWMON_H_MAX_HYST |
HWMON_H_MIN_ALARM | HWMON_H_MAX_ALARM),
NULL
@@ -680,6 +698,7 @@ static const struct hwmon_channel_info *cc2_info[] = {
static const struct hwmon_ops cc2_hwmon_ops = {
.is_visible = cc2_is_visible,
.read = cc2_read,
+ .read_string = cc2_read_string,
.write = cc2_write,
};
@@ -710,6 +729,8 @@ static int cc2_probe(struct i2c_client *client)
return dev_err_probe(dev, PTR_ERR(data->regulator),
"Failed to get regulator\n");
+ device_property_read_string(dev, "label", &data->label);
+
ret = cc2_request_ready_irq(data, dev);
if (ret)
return dev_err_probe(dev, ret, "Failed to request ready irq\n");
--
2.34.1
^ permalink raw reply related
* Re: [PATCH net-next] Documentation: networking: Add a test plan for ethtool pause validation
From: Andrew Lunn @ 2026-06-25 16:12 UTC (permalink / raw)
To: Maxime Chevallier
Cc: Jakub Kicinski, davem, Eric Dumazet, Paolo Abeni, Simon Horman,
Russell King, Heiner Kallweit, Jonathan Corbet, Shuah Khan,
Oleksij Rempel, Vladimir Oltean, Florian Fainelli,
thomas.petazzoni, netdev, linux-kernel, linux-doc
In-Reply-To: <38bafe7e-d419-46f7-8fa7-87e9183e578c@bootlin.com>
> This isn't sphynx, but I've come-up with something like this for a
> test definition :
>
>
> @ksft_ethtool_needs_supported_anyof([Pause, Asym_Pause])
> def test_ethtool_pause_advertising(cfg, peer) -> None:
> """Pause advertisement
>
> Validate that changing pause params through the ETHTOOL_MSG_PAUSE command
> translates to a change in the advertised pause params, and that these
> parameters are correct w.r.t the supported pause params and requested pause
> params.
>
> This exercises the .set_pauseparams() ethtool ops for MAC configuration,
> as well as the reconfiguration of the PHY's advertising and negociation.
>
> On non-phylink MACs, the MAC should call phy_set_sym_pause() to update the
> PHY's advertising, and restart a negotiation with phy_start_aneg() if
> need be. Failure to do so will result on the wrong advertising parameters.
>
> Pn phylink-enabled MACs, phylink deals with the PHY reconfiguration provided
On
> the MAC driver calls phylink_ethtool_set_pauseparam().
>
> Failing this test likely means that the PHY driver is not correctly advertising
> pause settings, either due to the MAC not triggering a PHY reconfiguration,
> a misconficonfiguration of the advertising registers by the PHY, or by
> mis-handling the phydev->advertising bitfield in the PHY driver directly.
>
> The validation is made by looking at the advertised modes locally, as well as
> what the peer's 'lp_advertising' values report.
>
> cfg -- local device's interface configuration
> peer -- peer device handle
Plain Sphinx can be made to pick up this method documentation and
include it the generated documentation. You would use something like
.. automethod:: test_ethtool_pause_advertising
in the .rst file.
I've no idea if the kernel configuration of sphinx allows this. At the
moment, i would not spend too much time on getting sphinx to generate
documentation. I would say that is nice to have. The description
itself is more important.
> """
>
> # Initial conditions :
> # - Local interface is admin UP, and reports lowlayer link UP
> # - Remote interface is adming UP, and reports lowlayer link UP
> #
> # Test 1
> # - SKIP if supported doesn't contain "Pause"
> # - run 'ethtool -A ethX rx on tx on autoneg on'
> # - FAIL if the return isn't 0
> # - FAIL if ETHTOOL_A_LINKMODES_OURS's advertised values does not contain
> # "Pause" or contains "Asym_Pause"
> # - FAIL if peer's lp_advertising doesn't contain "Pause" or contains
> # "Asym_Pause"
> # - Succeed otherwise
> #
> # Test 2
> # - SKIP uif supported doesn't contain both "Pause" and "Asym_Pause"
> # - run 'ethtool -A ethX rx on tx on autoneg on'
> # - FAIL if the return isn't 0
> # - FAIL if ETHTOOL_A_LINKMODES_OURS's advertised values does not contain
> # "Pause" or contains "Asym_Pause"
> # - FAIL if peer's lp_advertising doesn't contain "Pause" or contains
> # "Asym_Pause"
> #
> # ...
>
> The annotation defines the pre-requisites in terms of locally supported
> linkmodes, we have a docstring containing information for developpers
> to debug their drivers, what I'm unsure about is the commented-out part
> below, so either one big function testing multiple adjacent scenarios
> or indivitual functions.
Sphinx follows pythons object orientate structure. So you could have a
class test_ethtool_pause_advertising, with class documentation. And
then methods within the class which are individual tests. The
commented out section would then be method documentation.
However, i've no idea if the selftest code allows for classes of test
methods? It looks like ksft_run() takes a list of methods. So you can
probably instantiate the class, and then pass it methods from the
class?
I would say you are right about picking one of the simple test case,
and playing with it, define and implement it, and see what comes out
at the end.
Andrew
^ permalink raw reply
* Re: [PATCH v2 7/8] dt-bindings: riscv: Add generic CBQRI controller binding
From: Conor Dooley @ 2026-06-25 16:19 UTC (permalink / raw)
To: Drew Fustini
Cc: Adrien Ricciardi, Alexandre Ghiti, Atish Kumar Patra, Atish Patra,
Babu Moger, Ben Horgan, Borislav Petkov, Chen Pei, Conor Dooley,
Conor Dooley, Dave Hansen, Dave Martin, Fenghua Yu, Gong Shuai,
Gong Shuai, guo.wenjia23, James Morse, Kornel Dulęba,
Krzysztof Kozlowski, liu.qingtao2, Liu Zhiwei, Palmer Dabbelt,
Paul Walmsley, Peter Newman, Radim Krčmář,
Reinette Chatre, Rob Herring, Samuel Holland,
Sebastian Andrzej Siewior, Tony Luck, Vasudevan Srinivasan,
Ved Shanbhogue, Weiwei Li, yunhui cui, linux-kernel, linux-riscv,
x86, devicetree, linux-rt-devel, linux-doc
In-Reply-To: <20260624-dfustini-atl-sc-cbqri-dt-v2-7-2f8049fd902b@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 5686 bytes --]
On Wed, Jun 24, 2026 at 06:38:35PM -0700, Drew Fustini wrote:
> Document the generic compatibles for capacity and bandwidth controllers
> that implement the RISC-V CBQRI specification. The binding also
> describes the common riscv,cbqri-rcid and riscv,cbqri-mcid properties,
> and the optional riscv,cbqri-cache phandle that links a capacity
> controller to the cache whose capacity it allocates.
>
> Assisted-by: Claude:claude-opus-4-8
> Co-developed-by: Adrien Ricciardi <aricciardi@baylibre.com>
> Signed-off-by: Adrien Ricciardi <aricciardi@baylibre.com>
> Signed-off-by: Drew Fustini <fustini@kernel.org>
> ---
> .../devicetree/bindings/riscv/riscv,cbqri.yaml | 97 ++++++++++++++++++++++
> MAINTAINERS | 1 +
> 2 files changed, 98 insertions(+)
>
> diff --git a/Documentation/devicetree/bindings/riscv/riscv,cbqri.yaml b/Documentation/devicetree/bindings/riscv/riscv,cbqri.yaml
> new file mode 100644
> index 0000000000000000000000000000000000000000..5d6be645381780e187b39e60c3bb487fdf2cfb69
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/riscv/riscv,cbqri.yaml
> @@ -0,0 +1,97 @@
> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/riscv/riscv,cbqri.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: RISC-V Capacity and Bandwidth QoS Register Interface (CBQRI) controller
> +
> +description: |
> + The RISC-V CBQRI specification defines capacity-controller and
> + bandwidth-controller register blocks that allocate cache capacity and memory
> + bandwidth to resource-control IDs (RCIDs) and monitor usage per
> + monitoring-counter ID (MCID):
> + https://github.com/riscv-non-isa/riscv-cbqri/blob/main/riscv-cbqri.pdf
> +
> + Allocation and monitoring share one register block, and a controller may
> + implement either or both. A driver discovers which at runtime from the
> + capabilities register, so the compatible names only the controller type. It
> + does not distinguish allocation-only, monitoring-only or combined
> + controllers, and no property declares monitoring support.
> +
> +maintainers:
> + - Drew Fustini <fustini@kernel.org>
> +
> +properties:
> + compatible:
> + oneOf:
> + - items:
> + - description: Tenstorrent Ascalon Shared Cache
> + const: tenstorrent,ascalon-sc-cbqri
> + - const: riscv,cbqri-capacity-controller
> + - enum:
> + - riscv,cbqri-capacity-controller
> + - riscv,cbqri-bandwidth-controller
Please modify this, as has been done for other riscv spec related
bindings, to let people get away without using device-specific
compatibles.
In this case, you can just delete the first entry from this enum, since
it already has a user and only have to implement this feedback for the
second entry.
pw-bot: changes-requested
> +
> + reg:
> + maxItems: 1
> + description:
> + The CBQRI controller register block.
> +
> + riscv,cbqri-rcid:
> + $ref: /schemas/types.yaml#/definitions/uint32
> + description:
> + The maximum number of RCIDs the controller supports. RCIDs are the
> + resource-control IDs that allocation operations target.
> +
> + riscv,cbqri-mcid:
> + $ref: /schemas/types.yaml#/definitions/uint32
> + description:
> + The maximum number of MCIDs the controller supports. MCIDs are the
> + monitoring-counter IDs that usage-monitoring operations target. Present
> + on controllers that implement monitoring.
> +
> + riscv,cbqri-cache:
> + $ref: /schemas/types.yaml#/definitions/phandle
> + description:
> + Phandle to the cache node whose capacity this controller allocates.
> + Applies to capacity controllers that back a CPU cache. The cache level
> + and the harts sharing it are taken from that node's cache topology.
Architecturally, is it impossible for a capacity controller to control
more than one cache?
> +
> +required:
> + - compatible
> + - reg
> +
> +allOf:
> + - if:
> + properties:
> + compatible:
> + contains:
> + const: tenstorrent,ascalon-sc-cbqri
> + then:
> + required:
> + - riscv,cbqri-rcid
> + - riscv,cbqri-cache
> +
> +additionalProperties: false
> +
> +examples:
> + - |
> + l2_cache: l2-cache {
> + compatible = "cache";
> + cache-level = <2>;
> + cache-unified;
> + cache-size = <0xc00000>;
> + cache-sets = <512>;
> + cache-block-size = <64>;
> + };
> +
> + cache-controller@a21a00c0 {
> + compatible = "tenstorrent,ascalon-sc-cbqri",
> + "riscv,cbqri-capacity-controller";
Is this or is this not a cache controller?
The compatible and fact that the property points to an actual cache
controller suggests that this is not.
Cheers,
Conor.
> + reg = <0xa21a00c0 0xf40>;
> + riscv,cbqri-rcid = <16>;
> + riscv,cbqri-cache = <&l2_cache>;
> + };
> +
> +...
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 9e1092165046c773771b055869030bc1bdb64b16..64a95a4d795a57033d3f36200d98cfb4a013ab94 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -23298,6 +23298,7 @@ M: Drew Fustini <fustini@kernel.org>
> R: yunhui cui <cuiyunhui@bytedance.com>
> L: linux-riscv@lists.infradead.org
> S: Supported
> +F: Documentation/devicetree/bindings/riscv/riscv,cbqri.yaml
> F: arch/riscv/include/asm/qos.h
> F: arch/riscv/include/asm/resctrl.h
> F: arch/riscv/kernel/qos.c
>
> --
> 2.34.1
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* [PATCH v3 01/11] cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata
From: mhonap @ 2026-06-25 16:53 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>
From: Manish Honap <mhonap@nvidia.com>
cxl_probe_component_regs() finds the HDM decoder block during device
probe and caches its location, but does not record the decoder count
and does not expose the result outside drivers/cxl/.
In-kernel cxl drivers (Type-2 accelerator drivers, vfio-cxl) need the
decoder count and the byte offset and size of the HDM block without
re-running the probe sequence.
Record decoder_cnt in rmap->count when parsing the HDM capability in
cxl_probe_component_regs(), extend struct cxl_reg_map with a count
member, and add cxl_get_hdm_info() to return offset, size, and count
from the cached map. Export under the CXL namespace.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/cxl/core/pci.c | 33 +++++++++++++++++++++++++++++++++
drivers/cxl/core/regs.c | 1 +
include/cxl/cxl.h | 4 ++++
3 files changed, 38 insertions(+)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 2bcd683aa286..c917608c16f9 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -449,6 +449,39 @@ int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
}
EXPORT_SYMBOL_NS_GPL(cxl_hdm_decode_init, "CXL");
+/**
+ * cxl_get_hdm_info - Get HDM decoder register block location and count
+ * @cxlds: CXL device state (must have component regs enumerated via
+ * cxl_probe_component_regs())
+ * @count: number of HDM decoders (from HDM Capability bits [3:0])
+ * @offset: byte offset of HDM decoder block within the component register BAR
+ * @size: size in bytes of the HDM decoder block
+ *
+ * Exported for cxl drivers (in-kernel accelerator drivers, vfio-cxl) that
+ * need HDM decoder metadata from the cached component-register map without
+ * re-running the probe sequence.
+ *
+ * Return: 0 on success. -ENODEV if the HDM decoder block is not present.
+ */
+int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u8 *count,
+ resource_size_t *offset, resource_size_t *size)
+{
+ struct cxl_reg_map *hdm = &cxlds->reg_map.component_map.hdm_decoder;
+
+ if (WARN_ON(!count || !offset || !size))
+ return -EINVAL;
+
+ if (!hdm->valid)
+ return -ENODEV;
+
+ *count = hdm->count;
+ *offset = hdm->offset;
+ *size = hdm->size;
+
+ return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_get_hdm_info, "CXL");
+
#define CXL_DOE_TABLE_ACCESS_REQ_CODE 0x000000ff
#define CXL_DOE_TABLE_ACCESS_REQ_CODE_READ 0
#define CXL_DOE_TABLE_ACCESS_TABLE_TYPE 0x0000ff00
diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index 20c2d9fbcfe7..e828df0629d0 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -85,6 +85,7 @@ void cxl_probe_component_regs(struct device *dev, void __iomem *base,
decoder_cnt = cxl_hdm_decoder_count(hdr);
length = 0x20 * decoder_cnt + 0x10;
rmap = &map->hdm_decoder;
+ rmap->count = decoder_cnt;
break;
}
case CXL_CM_CAP_CAP_ID_RAS:
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 802b143de83d..440ab09c640e 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -75,6 +75,7 @@ struct cxl_reg_map {
int id;
unsigned long offset;
unsigned long size;
+ u8 count;
};
struct cxl_component_reg_map {
@@ -228,4 +229,7 @@ struct cxl_memdev *devm_cxl_probe_mem(struct cxl_dev_state *cxlds,
struct range *range);
int cxl_set_capacity(struct cxl_dev_state *cxlds, u64 capacity);
+
+int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u8 *count,
+ resource_size_t *offset, resource_size_t *size);
#endif /* __CXL_CXL_H__ */
--
2.25.1
^ permalink raw reply related
* [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support
From: mhonap @ 2026-06-25 16:53 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
From: Manish Honap <mhonap@nvidia.com>
CXL Type-2 accelerators (CXL.mem-capable GPUs and similar) cannot be
passed through to virtual machines with stock vfio-pci because the
driver has no concept of HDM decoder management, HDM region exposure,
or component register virtualization. This series adds those three
pieces, sufficient for a guest to use the device's firmware-committed
coherent memory under UVM / ATS.
v3 is a rewrite of the v2 framework form, responding to Dan's request
in the v2 review for "less emulation, narrower interfaces, and a
closer mapping to the spec language."
In this release, cxl-core exposes four EXPORT_SYMBOL_GPL helpers behind
an opaque handle. vfio-pci becomes a thin transport on top of those.
Please see "Changes since v2" and "Reviewer feedback addressed" below for
the per-area summary.
Motivation
==========
A CXL Type-2 device exposes its HDM-mapped device memory through HDM
decoders that BIOS programs and commits at boot. To pass such a
device to a guest, vfio-pci has to do three things at once:
1. Surface the firmware-committed HDM-mapped HPA range as a guest-
mmappable region.
2. Surface a CXL-spec-compliant view of the CXL Device DVSEC body,
the HDM Decoder Capability block, and the CXL.cache/mem cap-array
prefix, so the guest's CXL driver enumerates the same topology
the host saw.
3. Keep the host's committed decoder configuration intact (the
physical decoder is never reprogrammed) while letting the guest
observe and manage a shadow that follows the per-field write
semantics in the spec.
The series builds on Alejandro Lucero-Palau's v28 work
applied on for-7.3/cxl-type2-enabling [1] (sfc is the in-tree consumer
today). vfio-pci becomes the second consumer.
Architecture
============
cxl-core owns the CXL semantics. A new file
drivers/cxl/core/passthrough.c (gated by hidden Kconfig
CXL_VFIO_PASSTHROUGH) provides four exported symbols:
struct cxl_passthrough *
devm_cxl_passthrough_create(struct device *dev,
struct cxl_dev_state *cxlds);
int cxl_passthrough_dvsec_rw(p, off, val, sz, write);
int cxl_passthrough_hdm_rw (p, off, val, write);
int cxl_passthrough_cm_rw (p, off, val, write);
cxl_passthrough is an opaque handle; vfio-pci sees no cxl-internal
struct pointers. The shadows are snapshotted at create time: the
DVSEC body from PCI config space dword by dword, the CM cap-array and
HDM block from the cxl-core MMIO mapping at cxlds->reg_map.base.
Per-field write semantics follow below:
CXL r4.0 8.1.3 DVSEC:
- LOCK is RWO,
- CONTROL/CONTROL2 are RWL gated on CONFIG_LOCK,
- STATUS/STATUS2 are RW1C,
- RANGE1 is HwInit, RANGE2 is RsvdZ
CXL r4.0 8.2.4.20 HDM:
- GLOBAL_CTRL RW,
- decoder CTRL implements COMMIT/COMMITTED,
- decoder BASE/SIZE RWL gated on COMMITTED or LOCK_ON_COMMIT,
- cap header HwInit).
vfio-pci becomes a thin transport. The new module
drivers/vfio/pci/cxl/ exposes two VFIO regions.
VFIO_REGION_SUBTYPE_CXL (HDM region): mmappable view of the
HDM-mapped HPA. The mmap fault handler calls vmf_insert_pfn() from
the physical HPA. pread/pwrite go through the memremap_wb() kva
captured at bind time.
VFIO_REGION_SUBTYPE_CXL_COMP_REGS (component register shadow):
pread/pwrite only, dword-aligned (-EINVAL on misalignment).
Each dword dispatches by offset to cxl_passthrough_cm_rw() or
cxl_passthrough_hdm_rw(). No shadow state on the vfio side; cxl-core
enforces the spec.
CXL DVSEC config-space accesses use a clipping shim in
vfio_pci_config_rw_single(). A config-space chunk that crosses the
DVSEC body boundary is split: header bytes go through the generic
perm-bits path, body bytes go through cxl_passthrough_dvsec_rw().
The shim replaces v2's approach of repointing ecap_perms[]
Sparse-mmap is exposed on the component BAR so userspace can mmap the
non-component portions directly; only the CXL component register
sub-range goes through pread/pwrite emulation. The CXL sub-range is
also skipped from vfio_pci-core's request_selected_regions() set
because cxl-core's devm_cxl_probe_mem() already holds a
request_mem_region() on it; the asymmetric skip is matched by an
asymmetric release on disable().
Scope and out-of-scope
======================
In scope (rejected at create time with -EOPNOTSUPP otherwise):
- Firmware-committed devices (HOST_FIRMWARE_COMMITTED set).
- Single HDM decoder (hdm_count == 1).
- No interleave (IW == 0).
Out of scope, deferred for follow-on work:
- Multi-decoder devices and interleave.
- Guest-driven (non-firmware-committed) HDM commit.
- Hotplug, FLR, and sibling-function reset of CXL Type-2 devices.
Changes since v2
================
This is a rewrite, not an incremental update. The structure of the
series changed (20 patches in v2 to 11 in v3) because v3 collapses
v2 patches 9-15 (detection, HDM emulation, media readiness, region
management, HDM region, DVSEC emulation) into one cxl-core helper
file and one vfio-pci consumer.
Framework replaced by narrow opaque-handle helpers (patches 6, 8)
v2 carried a generic register-emulation framework split across four
state-machine files in cxl-core.
v3 collapses it into one file: drivers/cxl/core/passthrough.c
exposing the four EXPORT_SYMBOL_GPL helpers above behind a struct
cxl_passthrough opaque handle.
Shadow ownership moved into cxl-core (patches 6, 8)
vfio-pci no longer keeps any per-field state. It forwards
(offset, value) into cxl-core, and cxl-core enforces the spec
(RWO, RWL, RW1C, HwInit, RsvdZ) with explicit CXL r4.0 section
references in the switch arms.
DVSEC config-space clipping shim (patch 8)
v2 repointed ecap_perms[] to redirect CXL DVSEC reads and writes.
v3 keeps ecap_perms[] untouched and clips per-config-access chunks
at the DVSEC body boundary in vfio_pci_config_rw_single(); header bytes
go through the generic perm-bits path, body bytes go through
cxl_passthrough_dvsec_rw(). The shim is local to the per-device
path.
CONFIG_VFIO_PCI_CXL gates the new module (patch 7)
v2 had a CONFIG_VFIO_CXL_CORE Kconfig stub; v3 renames it to
CONFIG_VFIO_PCI_CXL to match the vfio-pci naming convention.
The hidden CXL_VFIO_PASSTHROUGH selects the cxl-core helper file
on demand. With both disabled, the cxl-core size is unchanged.
UAPI rewritten with named fields (patch 5)
vfio_device_info_cap_cxl in v3 carries:
flags + HOST_FIRMWARE_COMMITTED bit
hdm_region_idx
comp_reg_region_idx
comp_reg_bar
comp_reg_offset
comp_reg_size
The DPA terminology is renamed to HDM region throughout.
CACHE_CAPABLE (HDM-DB indicator) is dropped;
it was informational only in v2 with no caller, and re-adding it
for an active CXL.cache plumbing series later.
Selftests trimmed (patch 9)
v2 carried selftests for device detection, capability parsing,
region enumeration, HDM register emulation, HDM mmap with
page-fault insertion, FLR invalidation, and DVSEC register
emulation. v3 keeps a smoke-test set of six focused tests:
device_is_cxl GET_INFO advertises FLAGS_CXL
and a populated CAP_CXL.
hdm_region_mmap_rw mmap one page, write+read back.
component_bar_sparse_mmap SPARSE_MMAP cap excludes the
CXL component register sub-range.
comp_regs_cm_cap_array_read pread of the CM cap-array
header at CXL_CM_OFFSET succeeds
(CAP_ID == 1).
dvsec_lock_byte_read pread of the DVSEC CONFIG_LOCK
byte through the clipping shim
succeeds.
hdm_decoder_commit_fsm COMMIT / COMMITTED state machine
and LOCK_ON_COMMIT behaviour.
FLR invalidation, page-fault insertion under load, and full
DVSEC field-by-field write coverage are deferred to a follow-on
selftest series. The current six are the minimal set that
exercises the kernel-side contract end-to-end.
cxl-core prep patches split (patches 1-4)
v3 keeps the cxl-side enablers from v2 patches 1-4 but each as
a standalone change so the cxl maintainer can review the helper
API independently of the vfio consumer:
[1/11] cxl_get_hdm_info()
[2/11] cxl_await_range_active() split from media-ready wait
[3/11] cxl_register_map records BIR + BAR offset
[4/11] component/HDM register defines moved to uapi/cxl/cxl_regs.h
Reviewer feedback addressed
===========================
Dan
---
- VFIO exposes HDM/host-visible region, not raw DPA; docs/UAPI say HDM
region, DPA only inside cxl-core where appropriate.
- One vfio-pci device = one HDM region / one decoder, no interleave;
hdm_count != 1 → -EOPNOTSUPP.
- Global HDM on DVSEC Range Base treated as legacy; RANGE1/RANGE2
read-only snapshot, guest writes dropped.
- No guest/kernel lock games; DVSEC LOCK and HDM LOCK_ON_COMMIT RWO,
fixed at create from firmware snapshot.
- Opaque cxl_passthrough handle only; vfio gets HPA via memdev probe +
layout via cxl_get_hdm_info(), rw via helpers.
- No multi-region accelerator case in v3; single region enforced,
multi-region deferred.
- cxl_await_range_active stays in cxl-core probe; not exported, vfio does
not call it.
- No guest LOCK→0 reprogram; guest cannot clear LOCK to remap host HPA;
kernel uncommit tied to COMMIT, not LOCK alone.
Jason / Gregory / Dan
---------------------
- memremap(WB) + request_mem_region on HPA; conflicting direct-map/EFI use
fails probe with -EBUSY.
Jonathan
--------
- uapi/cxl/cxl_regs.h for register defines so VMMs need no private
kernel headers.
- __free() locals on cxl-core/passthrough error paths instead of
struct-owned temporaries.
- No "precommitted at probe" assumption; acquire checks COMMITTED in
HDM shadow and refuses if missing.
Dave
----
- memremap(MEMREMAP_WB) for HDM host mapping (not ioremap_cache).
- Renamed cap flag to VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED for clarity.
- __free() / DEFINE_FREE() cleanup in new passthrough.c create path.
Patch series
============
[1/11] cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata
[2/11] cxl: Split cxl_await_range_active() from media-ready wait
[3/11] cxl: Record BIR and BAR offset in cxl_register_map
[4/11] cxl: Move component/HDM register defines to
uapi/cxl/cxl_regs.h
[5/11] vfio: UAPI for CXL Type-2 device passthrough
[6/11] cxl: Add register-virtualization helpers for vfio Type-2
passthrough
[7/11] vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2
acquisition
[8/11] vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping
shim
[9/11] selftests/vfio: Add CXL Type-2 device passthrough smoke test
[10/11] docs: vfio-pci: Document CXL Type-2 device passthrough
[11/11] vfio/pci: Provide opt-out for CXL Type-2 extensions
Dependencies
============
[1] [PATCH v28 0/5] Type2 device basic support
https://lore.kernel.org/linux-cxl/20260618181806.118745-1-alejandro.lucero-palau@amd.com/
[2] Previous version of this patch series
[PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support
https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/
[3] Companion QEMU series
[RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci
https://lore.kernel.org/linux-cxl/20260427181235.3003865-1-mhonap@nvidia.com/
Manish Honap (11):
cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata
cxl: Split cxl_await_range_active() from media-ready wait
cxl: Record BIR and BAR offset in cxl_register_map
cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h
vfio: UAPI for CXL Type-2 device passthrough
cxl: Add register-virtualization helpers for vfio Type-2 passthrough
vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2
acquisition
vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping shim
selftests/vfio: Add CXL Type-2 device passthrough smoke test
docs: vfio-pci: Document CXL Type-2 device passthrough
vfio/pci: Provide opt-out for CXL Type-2 extensions
Documentation/driver-api/index.rst | 1 +
Documentation/driver-api/vfio-pci-cxl.rst | 282 ++++++
drivers/cxl/Kconfig | 7 +
drivers/cxl/core/Makefile | 1 +
drivers/cxl/core/passthrough.c | 590 ++++++++++++
drivers/cxl/core/pci.c | 70 +-
drivers/cxl/core/regs.c | 35 +
drivers/cxl/cxl.h | 52 +-
drivers/vfio/pci/Kconfig | 2 +
drivers/vfio/pci/Makefile | 1 +
drivers/vfio/pci/cxl/Kconfig | 34 +
drivers/vfio/pci/cxl/Makefile | 2 +
drivers/vfio/pci/cxl/vfio_cxl_core.c | 889 ++++++++++++++++++
drivers/vfio/pci/cxl/vfio_cxl_priv.h | 71 ++
drivers/vfio/pci/vfio_pci.c | 9 +
drivers/vfio/pci/vfio_pci_config.c | 31 +
drivers/vfio/pci/vfio_pci_core.c | 68 +-
drivers/vfio/pci/vfio_pci_priv.h | 93 ++
drivers/vfio/pci/vfio_pci_rdwr.c | 17 +
include/cxl/cxl.h | 18 +
include/cxl/passthrough.h | 121 +++
include/linux/vfio_pci_core.h | 8 +
include/uapi/cxl/cxl_regs.h | 63 ++
include/uapi/linux/vfio.h | 46 +
tools/testing/selftests/vfio/Makefile | 1 +
.../selftests/vfio/lib/vfio_pci_device.c | 11 +-
.../selftests/vfio/vfio_cxl_type2_test.c | 350 +++++++
27 files changed, 2821 insertions(+), 52 deletions(-)
create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
create mode 100644 drivers/cxl/core/passthrough.c
create mode 100644 drivers/vfio/pci/cxl/Kconfig
create mode 100644 drivers/vfio/pci/cxl/Makefile
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_core.c
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_priv.h
create mode 100644 include/cxl/passthrough.h
create mode 100644 include/uapi/cxl/cxl_regs.h
create mode 100644 tools/testing/selftests/vfio/vfio_cxl_type2_test.c
base-commit: 90cf2e0d702c8a132ccbe72e7687f33c04c14658
--
2.25.1
^ permalink raw reply
* [PATCH v3 02/11] cxl: Split cxl_await_range_active() from media-ready wait
From: mhonap @ 2026-06-25 16:53 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>
From: Manish Honap <mhonap@nvidia.com>
Before accessing CXL device memory after reset or power-on, the
driver must ensure media is ready. Not every CXL device implements
the CXL Memory Device register group: many Type-2 devices do not.
cxl_await_media_ready() reads cxlds->regs.memdev. Access to memdev
registers on a Type-2 device that lacks them can result in a kernel
panic.
Split the HDM DVSEC range-active poll out of cxl_await_media_ready()
into a new helper cxl_await_range_active(). Type-2 cxl drivers
(vfio-cxl, in-kernel accelerator drivers) that lack the CXLMDEV
status register call this directly. cxl_await_media_ready() now
calls cxl_await_range_active() for the DVSEC poll, then reads the
memory device status as before.
The 60 second per-range timeout from cxl_await_media_ready()
(media_ready_timeout module param) applies. Export under the CXL
namespace.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/cxl/core/pci.c | 35 ++++++++++++++++++++++++++++++-----
include/cxl/cxl.h | 2 ++
2 files changed, 32 insertions(+), 5 deletions(-)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index c917608c16f9..c44595447bd8 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -142,16 +142,24 @@ static int cxl_dvsec_mem_range_active(struct cxl_dev_state *cxlds, int id)
return 0;
}
-/*
- * Wait up to @media_ready_timeout for the device to report memory
- * active.
+/**
+ * cxl_await_range_active - Wait for all HDM DVSEC memory ranges to be active
+ * @cxlds: CXL device state (DVSEC and HDM count must be valid)
+ *
+ * For each HDM decoder range reported in the CXL DVSEC capability, waits
+ * for the range to report MEM INFO VALID (up to 1s per range), then
+ * MEM ACTIVE (up to media_ready_timeout seconds per range, default 60s).
+ * Used by cxl_await_media_ready() and by cxl drivers that bind to Type-2
+ * devices without the memdev mailbox (e.g. vfio-cxl, accelerator drivers).
+ *
+ * Return: 0 if all ranges become valid and active, -ETIMEDOUT if a
+ * timeout occurs, or a negative errno from config read on failure.
*/
-int cxl_await_media_ready(struct cxl_dev_state *cxlds)
+int cxl_await_range_active(struct cxl_dev_state *cxlds)
{
struct pci_dev *pdev = to_pci_dev(cxlds->dev);
int d = cxlds->cxl_dvsec;
int rc, i, hdm_count;
- u64 md_status;
u16 cap;
rc = pci_read_config_word(pdev,
@@ -172,6 +180,23 @@ int cxl_await_media_ready(struct cxl_dev_state *cxlds)
return rc;
}
+ return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_await_range_active, "CXL");
+
+/*
+ * Wait up to @media_ready_timeout for the device to report memory
+ * active.
+ */
+int cxl_await_media_ready(struct cxl_dev_state *cxlds)
+{
+ u64 md_status;
+ int rc;
+
+ rc = cxl_await_range_active(cxlds);
+ if (rc)
+ return rc;
+
md_status = readq(cxlds->regs.memdev + CXLMDEV_STATUS_OFFSET);
if (!CXLMDEV_READY(md_status))
return -EIO;
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 440ab09c640e..3dcc034360af 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -232,4 +232,6 @@ int cxl_set_capacity(struct cxl_dev_state *cxlds, u64 capacity);
int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u8 *count,
resource_size_t *offset, resource_size_t *size);
+
+int cxl_await_range_active(struct cxl_dev_state *cxlds);
#endif /* __CXL_CXL_H__ */
--
2.25.1
^ permalink raw reply related
* [PATCH v3 03/11] cxl: Record BIR and BAR offset in cxl_register_map
From: mhonap @ 2026-06-25 16:53 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>
From: Manish Honap <mhonap@nvidia.com>
The Register Locator DVSEC (CXL r4.0 8.1.9) describes register blocks
by BAR index (BIR) and offset within the BAR. CXL core currently
only stores the resolved HPA (resource + offset) in struct
cxl_register_map, so callers that need pci_iomap() or want to report
the BAR to userspace must reverse-engineer the BAR from the HPA.
Add bar_index and bar_offset to struct cxl_register_map and fill
them in cxl_decode_regblock() when the regblock is BAR-backed
(BIR 0-5). Add cxl_regblock_get_bar_info() so cxl drivers
(vfio-cxl, in-kernel accelerator drivers) can read the values
without touching the struct internals. Export under the CXL
namespace.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/cxl/core/pci.c | 2 ++
drivers/cxl/core/regs.c | 34 ++++++++++++++++++++++++++++++++++
include/cxl/cxl.h | 12 ++++++++++++
3 files changed, 48 insertions(+)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index c44595447bd8..9b9b17db9ee4 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -764,6 +764,8 @@ static int cxl_rcrb_get_comp_regs(struct pci_dev *pdev,
*map = (struct cxl_register_map) {
.host = &pdev->dev,
.resource = CXL_RESOURCE_NONE,
+ .bar_index = 0xff,
+ .bar_offset = 0,
};
component_reg_phys = cxl_rcd_component_reg_phys(&pdev->dev, dport);
diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index e828df0629d0..6af5739aa776 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -285,12 +285,46 @@ static bool cxl_decode_regblock(struct pci_dev *pdev, u32 reg_lo, u32 reg_hi,
return false;
}
+ if (bar >= 0 && bar <= 5) {
+ map->bar_index = (u8)bar;
+ map->bar_offset = offset;
+ } else {
+ map->bar_index = 0xff;
+ map->bar_offset = 0;
+ }
+
map->reg_type = reg_type;
map->resource = pci_resource_start(pdev, bar) + offset;
map->max_size = pci_resource_len(pdev, bar) - offset;
return true;
}
+/**
+ * cxl_regblock_get_bar_info - read BAR index and offset for a regblock
+ * @map: regblock map produced by cxl_find_regblock()
+ * @bar_index: out, PCI BAR index (0-5)
+ * @bar_offset: out, byte offset of the regblock within the BAR
+ *
+ * Exported for cxl drivers (vfio-cxl, in-kernel accelerator drivers)
+ * that need to map the regblock via pci_iomap() or report the BAR to
+ * userspace.
+ *
+ * Return: 0 on success, -EINVAL if the regblock is not BAR-backed or
+ * if any out pointer is NULL.
+ */
+int cxl_regblock_get_bar_info(const struct cxl_register_map *map,
+ u8 *bar_index, resource_size_t *bar_offset)
+{
+ if (!map || !bar_index || !bar_offset)
+ return -EINVAL;
+ if (map->bar_index > 5)
+ return -EINVAL;
+ *bar_index = map->bar_index;
+ *bar_offset = map->bar_offset;
+ return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_regblock_get_bar_info, "CXL");
+
/*
* __cxl_find_regblock_instance() - Locate a register block or count instances by type / index
* Use CXL_INSTANCES_COUNT for @index if counting instances.
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 3dcc034360af..3bcb71d80c91 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -100,9 +100,16 @@ struct cxl_pmu_reg_map {
* @resource: physical resource base of the register block
* @max_size: maximum mapping size to perform register search
* @reg_type: see enum cxl_regloc_type
+ * @bar_index: PCI BAR index (0-5) when regblock is BAR-backed; 0xff otherwise
+ * @bar_offset: offset within the BAR; only valid when bar_index <= 5
* @component_map: cxl_reg_map for component registers
* @device_map: cxl_reg_maps for device registers
* @pmu_map: cxl_reg_maps for CXL Performance Monitoring Units
+ *
+ * When the register block is described by the Register Locator DVSEC with
+ * a BAR Indicator (BIR 0-5), bar_index and bar_offset are set so callers
+ * can use pci_iomap(pdev, bar_index, size) and base + bar_offset instead
+ * of ioremap(resource).
*/
struct cxl_register_map {
struct device *host;
@@ -110,6 +117,8 @@ struct cxl_register_map {
resource_size_t resource;
resource_size_t max_size;
u8 reg_type;
+ u8 bar_index;
+ resource_size_t bar_offset;
union {
struct cxl_component_reg_map component_map;
struct cxl_device_reg_map device_map;
@@ -234,4 +243,7 @@ int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u8 *count,
resource_size_t *offset, resource_size_t *size);
int cxl_await_range_active(struct cxl_dev_state *cxlds);
+
+int cxl_regblock_get_bar_info(const struct cxl_register_map *map,
+ u8 *bar_index, resource_size_t *bar_offset);
#endif /* __CXL_CXL_H__ */
--
2.25.1
^ permalink raw reply related
* [PATCH v3 04/11] cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>
From: Manish Honap <mhonap@nvidia.com>
The CXL component register layout and the HDM Decoder Capability
Structure defines live in drivers/cxl/cxl.h, where userspace
consumers cannot include them without depending on kernel-only
headers. A VMM that owns a vfio-cxl COMP_REGS shadow region needs
these defines to interpret the shadow contents.
Move the spec-defined register layout, capability identifiers, and
HDM decoder field masks to a new public uapi header,
include/uapi/cxl/cxl_regs.h. Use __GENMASK() and _BITUL() (not
GENMASK() / BIT()) so the header is uapi-clean. Include
<asm/bitsperlong.h> for the __BITS_PER_LONG that __GENMASK() needs.
drivers/cxl/cxl.h now includes <uapi/cxl/cxl_regs.h>; the values
are identical, so kernel callers see no change. Static inline
helpers that use FIELD_GET stay in drivers/cxl/cxl.h.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/cxl/cxl.h | 52 +++++-------------------------
include/uapi/cxl/cxl_regs.h | 63 +++++++++++++++++++++++++++++++++++++
2 files changed, 70 insertions(+), 45 deletions(-)
create mode 100644 include/uapi/cxl/cxl_regs.h
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index f43abd1903ce..583a27b6659e 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -24,51 +24,13 @@ extern const struct nvdimm_security_ops *cxl_security_ops;
* (port-driver, region-driver, nvdimm object-drivers... etc).
*/
-/* CXL 2.0 8.2.4 CXL Component Register Layout and Definition */
-#define CXL_COMPONENT_REG_BLOCK_SIZE SZ_64K
-
-/* CXL 2.0 8.2.5 CXL.cache and CXL.mem Registers*/
-#define CXL_CM_OFFSET 0x1000
-#define CXL_CM_CAP_HDR_OFFSET 0x0
-#define CXL_CM_CAP_HDR_ID_MASK GENMASK(15, 0)
-#define CM_CAP_HDR_CAP_ID 1
-#define CXL_CM_CAP_HDR_VERSION_MASK GENMASK(19, 16)
-#define CM_CAP_HDR_CAP_VERSION 1
-#define CXL_CM_CAP_HDR_CACHE_MEM_VERSION_MASK GENMASK(23, 20)
-#define CM_CAP_HDR_CACHE_MEM_VERSION 1
-#define CXL_CM_CAP_HDR_ARRAY_SIZE_MASK GENMASK(31, 24)
-#define CXL_CM_CAP_PTR_MASK GENMASK(31, 20)
-
-#define CXL_CM_CAP_CAP_ID_RAS 0x2
-#define CXL_CM_CAP_CAP_ID_HDM 0x5
-#define CXL_CM_CAP_CAP_HDM_VERSION 1
-
-/* HDM decoders CXL 2.0 8.2.5.12 CXL HDM Decoder Capability Structure */
-#define CXL_HDM_DECODER_CAP_OFFSET 0x0
-#define CXL_HDM_DECODER_COUNT_MASK GENMASK(3, 0)
-#define CXL_HDM_DECODER_TARGET_COUNT_MASK GENMASK(7, 4)
-#define CXL_HDM_DECODER_INTERLEAVE_11_8 BIT(8)
-#define CXL_HDM_DECODER_INTERLEAVE_14_12 BIT(9)
-#define CXL_HDM_DECODER_INTERLEAVE_3_6_12_WAY BIT(11)
-#define CXL_HDM_DECODER_INTERLEAVE_16_WAY BIT(12)
-#define CXL_HDM_DECODER_CTRL_OFFSET 0x4
-#define CXL_HDM_DECODER_ENABLE BIT(1)
-#define CXL_HDM_DECODER0_BASE_LOW_OFFSET(i) (0x20 * (i) + 0x10)
-#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET(i) (0x20 * (i) + 0x14)
-#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET(i) (0x20 * (i) + 0x18)
-#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(i) (0x20 * (i) + 0x1c)
-#define CXL_HDM_DECODER0_CTRL_OFFSET(i) (0x20 * (i) + 0x20)
-#define CXL_HDM_DECODER0_CTRL_IG_MASK GENMASK(3, 0)
-#define CXL_HDM_DECODER0_CTRL_IW_MASK GENMASK(7, 4)
-#define CXL_HDM_DECODER0_CTRL_LOCK BIT(8)
-#define CXL_HDM_DECODER0_CTRL_COMMIT BIT(9)
-#define CXL_HDM_DECODER0_CTRL_COMMITTED BIT(10)
-#define CXL_HDM_DECODER0_CTRL_COMMIT_ERROR BIT(11)
-#define CXL_HDM_DECODER0_CTRL_HOSTONLY BIT(12)
-#define CXL_HDM_DECODER0_TL_LOW(i) (0x20 * (i) + 0x24)
-#define CXL_HDM_DECODER0_TL_HIGH(i) (0x20 * (i) + 0x28)
-#define CXL_HDM_DECODER0_SKIP_LOW(i) CXL_HDM_DECODER0_TL_LOW(i)
-#define CXL_HDM_DECODER0_SKIP_HIGH(i) CXL_HDM_DECODER0_TL_HIGH(i)
+/*
+ * Spec-defined CXL component register layout and HDM Decoder
+ * Capability Structure constants live in <uapi/cxl/cxl_regs.h> so a
+ * userspace VMM that owns a vfio-cxl COMP_REGS shadow region can
+ * consume them without depending on kernel-only headers.
+ */
+#include <uapi/cxl/cxl_regs.h>
/* HDM decoder control register constants CXL 3.0 8.2.5.19.7 */
#define CXL_DECODER_MIN_GRANULARITY 256
diff --git a/include/uapi/cxl/cxl_regs.h b/include/uapi/cxl/cxl_regs.h
new file mode 100644
index 000000000000..b284b7ad2d42
--- /dev/null
+++ b/include/uapi/cxl/cxl_regs.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0-only WITH Linux-syscall-note */
+/*
+ * CXL component register layout and HDM Decoder Capability Structure
+ * defines. Userspace consumers (e.g. a VMM that owns a vfio-cxl
+ * COMP_REGS shadow region) need these without kernel-only header
+ * dependencies.
+ *
+ * Spec references: CXL r4.0 sections 8.2.3 and 8.2.4.20.
+ */
+#ifndef _UAPI_CXL_REGS_H_
+#define _UAPI_CXL_REGS_H_
+
+#include <asm/bitsperlong.h> /* __BITS_PER_LONG; needed by __GENMASK() */
+#include <linux/const.h> /* _BITUL(), _BITULL() */
+#include <linux/bits.h> /* __GENMASK() */
+
+/* CXL r4.0 8.2.3 CXL Component Register Layout and Definition */
+#define CXL_COMPONENT_REG_BLOCK_SIZE 0x00010000
+
+/* CXL r4.0 8.2.4 CXL.cache and CXL.mem Registers */
+#define CXL_CM_OFFSET 0x1000
+#define CXL_CM_CAP_HDR_OFFSET 0x0
+#define CXL_CM_CAP_HDR_ID_MASK __GENMASK(15, 0)
+#define CM_CAP_HDR_CAP_ID 1
+#define CXL_CM_CAP_HDR_VERSION_MASK __GENMASK(19, 16)
+#define CM_CAP_HDR_CAP_VERSION 1
+#define CXL_CM_CAP_HDR_CACHE_MEM_VERSION_MASK __GENMASK(23, 20)
+#define CM_CAP_HDR_CACHE_MEM_VERSION 1
+#define CXL_CM_CAP_HDR_ARRAY_SIZE_MASK __GENMASK(31, 24)
+#define CXL_CM_CAP_PTR_MASK __GENMASK(31, 20)
+
+#define CXL_CM_CAP_CAP_ID_RAS 0x2
+#define CXL_CM_CAP_CAP_ID_HDM 0x5
+#define CXL_CM_CAP_CAP_HDM_VERSION 1
+
+/* HDM decoders, CXL r4.0 8.2.4.20 */
+#define CXL_HDM_DECODER_CAP_OFFSET 0x0
+#define CXL_HDM_DECODER_COUNT_MASK __GENMASK(3, 0)
+#define CXL_HDM_DECODER_TARGET_COUNT_MASK __GENMASK(7, 4)
+#define CXL_HDM_DECODER_INTERLEAVE_11_8 _BITUL(8)
+#define CXL_HDM_DECODER_INTERLEAVE_14_12 _BITUL(9)
+#define CXL_HDM_DECODER_INTERLEAVE_3_6_12_WAY _BITUL(11)
+#define CXL_HDM_DECODER_INTERLEAVE_16_WAY _BITUL(12)
+#define CXL_HDM_DECODER_CTRL_OFFSET 0x4
+#define CXL_HDM_DECODER_ENABLE _BITUL(1)
+#define CXL_HDM_DECODER0_BASE_LOW_OFFSET(i) (0x20 * (i) + 0x10)
+#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET(i) (0x20 * (i) + 0x14)
+#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET(i) (0x20 * (i) + 0x18)
+#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(i) (0x20 * (i) + 0x1c)
+#define CXL_HDM_DECODER0_CTRL_OFFSET(i) (0x20 * (i) + 0x20)
+#define CXL_HDM_DECODER0_CTRL_IG_MASK __GENMASK(3, 0)
+#define CXL_HDM_DECODER0_CTRL_IW_MASK __GENMASK(7, 4)
+#define CXL_HDM_DECODER0_CTRL_LOCK _BITUL(8)
+#define CXL_HDM_DECODER0_CTRL_COMMIT _BITUL(9)
+#define CXL_HDM_DECODER0_CTRL_COMMITTED _BITUL(10)
+#define CXL_HDM_DECODER0_CTRL_COMMIT_ERROR _BITUL(11)
+#define CXL_HDM_DECODER0_CTRL_HOSTONLY _BITUL(12)
+#define CXL_HDM_DECODER0_TL_LOW(i) (0x20 * (i) + 0x24)
+#define CXL_HDM_DECODER0_TL_HIGH(i) (0x20 * (i) + 0x28)
+#define CXL_HDM_DECODER0_SKIP_LOW(i) CXL_HDM_DECODER0_TL_LOW(i)
+#define CXL_HDM_DECODER0_SKIP_HIGH(i) CXL_HDM_DECODER0_TL_HIGH(i)
+
+#endif /* _UAPI_CXL_REGS_H_ */
--
2.25.1
^ permalink raw reply related
* [PATCH v3 05/11] vfio: UAPI for CXL Type-2 device passthrough
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>
From: Manish Honap <mhonap@nvidia.com>
Add the user-visible interface that exposes a CXL Type-2 device to a
VMM through vfio-pci:
VFIO_DEVICE_FLAGS_CXL (bit 9) on vfio_device_info::flags marks the
device as CXL.
VFIO_DEVICE_INFO_CAP_CXL (id 6) is the capability that carries the
HDM-backed memory region index, the CXL component register region
index, and the layout of the component register block within the
containing PCI BAR.
VFIO_REGION_SUBTYPE_CXL identifies the HDM memory region.
VFIO_REGION_SUBTYPE_CXL_COMP_REGS identifies the CXL component
register shadow.
Only the HOST_FIRMWARE_COMMITTED flag is exposed. Other CXL device
states stay invisible to userspace at this stage.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
include/uapi/linux/vfio.h | 46 +++++++++++++++++++++++++++++++++++++++
1 file changed, 46 insertions(+)
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 5de618a3a5ee..3707d53c4de5 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -215,6 +215,7 @@ struct vfio_device_info {
#define VFIO_DEVICE_FLAGS_FSL_MC (1 << 6) /* vfio-fsl-mc device */
#define VFIO_DEVICE_FLAGS_CAPS (1 << 7) /* Info supports caps */
#define VFIO_DEVICE_FLAGS_CDX (1 << 8) /* vfio-cdx device */
+#define VFIO_DEVICE_FLAGS_CXL (1 << 9) /* vfio-cxl Type-2 device */
__u32 num_regions; /* Max region index + 1 */
__u32 num_irqs; /* Max IRQ index + 1 */
__u32 cap_offset; /* Offset within info struct of first cap */
@@ -257,6 +258,36 @@ struct vfio_device_info_cap_pci_atomic_comp {
__u32 reserved;
};
+/*
+ * VFIO_DEVICE_INFO capability for CXL Type-2 passthrough devices.
+ * Present when VFIO_DEVICE_FLAGS_CXL is set on vfio_device_info::flags.
+ *
+ * @flags: VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED indicates the host CXL
+ * subsystem committed the endpoint HDM decoder.
+ * @hdm_region_idx: VFIO region index for the HDM memory region
+ * (subtype VFIO_REGION_SUBTYPE_CXL).
+ * @comp_reg_region_idx: VFIO region index for the CXL Component
+ * Register shadow (subtype VFIO_REGION_SUBTYPE_CXL_COMP_REGS).
+ * @comp_reg_bar: PCI BAR index that contains the CXL component
+ * register block. Get-region-info on this BAR returns a
+ * VFIO_REGION_INFO_CAP_SPARSE_MMAP that excludes the CXL block.
+ * @comp_reg_offset: byte offset of the CXL component register block
+ * within @comp_reg_bar.
+ * @comp_reg_size: byte size of the CXL component register block.
+ */
+#define VFIO_DEVICE_INFO_CAP_CXL 6
+struct vfio_device_info_cap_cxl {
+ struct vfio_info_cap_header header;
+ __u32 flags;
+#define VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED (1 << 0)
+ __u32 hdm_region_idx;
+ __u32 comp_reg_region_idx;
+ __u32 comp_reg_bar;
+ __u32 __resv;
+ __u64 comp_reg_offset;
+ __u64 comp_reg_size;
+};
+
/**
* VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
* struct vfio_region_info)
@@ -425,6 +456,21 @@ struct vfio_region_gfx_edid {
#define VFIO_REGION_SUBTYPE_CCW_SCHIB (2)
#define VFIO_REGION_SUBTYPE_CCW_CRW (3)
+/*
+ * sub-types for VFIO_REGION_TYPE_PCI_VENDOR (vendor id 1e98 reserved
+ * for the CXL Consortium); used by vfio-cxl Type-2 device passthrough.
+ *
+ * VFIO_REGION_SUBTYPE_CXL exposes the HDM-backed device memory range
+ * as a mappable region. The range is allocated by the host CXL
+ * subsystem and the VMM is expected to mmap() it.
+ * VFIO_REGION_SUBTYPE_CXL_COMP_REGS exposes the CXL Component Register
+ * block (read-write via pread()/pwrite() only, no mmap()). The VMM
+ * reads and writes HDM Decoder Capability registers through this
+ * shadow region instead of touching hardware directly.
+ */
+#define VFIO_REGION_SUBTYPE_CXL (1)
+#define VFIO_REGION_SUBTYPE_CXL_COMP_REGS (2)
+
/* sub-types for VFIO_REGION_TYPE_MIGRATION */
#define VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED (1)
--
2.25.1
^ permalink raw reply related
* [PATCH v3 06/11] cxl: Add register-virtualization helpers for vfio Type-2 passthrough
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>
From: Manish Honap <mhonap@nvidia.com>
vfio-pci needs the CXL Device DVSEC body, the HDM Decoder Capability
block, and the CXL.cache/mem cap-array prefix to be virtualized
toward a KVM guest in a CXL-spec-compliant way.
Introduce a narrow helper API owned by cxl-core:
struct cxl_passthrough *
devm_cxl_passthrough_create(struct device *dev,
struct cxl_dev_state *cxlds);
int cxl_passthrough_dvsec_rw(struct cxl_passthrough *p, u32 off,
u32 *val, size_t sz, bool write);
int cxl_passthrough_hdm_rw(struct cxl_passthrough *p, u32 off,
u32 *val, bool write);
int cxl_passthrough_cm_rw(struct cxl_passthrough *p, u32 off,
u32 *val, bool write);
Each helper takes a per-device mutex covering the DVSEC + HDM shadows
(the CM cap-array snapshot is immutable after create) and dispatches
by offset to a hand-written write handler against CXL r4.0 §8.1.3
(DVSEC: LOCK is RWO, CONTROL/CONTROL2 are RWL gated on CONFIG_LOCK,
STATUS/STATUS2 are RW1C, RANGE1 is HwInit, RANGE2 is RsvdZ) and
§8.2.4.20 (HDM: GLOBAL_CTRL RW, decoder CTRL implements
COMMIT/COMMITTED, decoder BASE/SIZE RWL gated on COMMITTED or
LOCK_ON_COMMIT, cap header HwInit).
Writes to the CM cap-array are silently discarded because the
cap-array headers are RO per CXL r4.0 §8.2.4; the write parameter is
kept on the rw API to make the drop policy explicit at the call site.
The shadows are snapshotted at create time: the DVSEC body from PCI
config space dword-at-a-time, the CM cap-array and HDM block from
the cxl-core MMIO mapping at cxlds->reg_map.base. This preserves
firmware-committed values so the guest reads what the host BIOS
committed, while writes update the shadow per the per-field write
semantics above.
The file is gated by the hidden Kconfig CXL_VFIO_PASSTHROUGH so the
passthrough code stays out of cxl_core when no vfio consumer is configured.
Scope: firmware-committed, single-decoder, no-interleave Type-2
passthrough. Multi-decoder, interleave, and hotplug are
out-of-scope and rejected at create time (-EOPNOTSUPP for
hdm_count != 1).
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/cxl/Kconfig | 7 +
drivers/cxl/core/Makefile | 1 +
drivers/cxl/core/passthrough.c | 590 +++++++++++++++++++++++++++++++++
include/cxl/passthrough.h | 121 +++++++
4 files changed, 719 insertions(+)
create mode 100644 drivers/cxl/core/passthrough.c
create mode 100644 include/cxl/passthrough.h
diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 80aeb0d556bd..7c874d486a9c 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -19,6 +19,13 @@ menuconfig CXL_BUS
if CXL_BUS
+config CXL_VFIO_PASSTHROUGH
+ bool
+ # Hidden symbol selected by VFIO_PCI_CXL to pull
+ # drivers/cxl/core/passthrough.c into cxl_core when a vfio
+ # Type-2 passthrough consumer is configured. Keep silent: no
+ # help text, no default, no user-visible prompt.
+
config CXL_PCI
tristate "PCI manageability"
default CXL_BUS
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index ce7213818d3c..0cc80bd35a88 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -22,3 +22,4 @@ cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
cxl_core-$(CONFIG_CXL_RAS) += ras.o
cxl_core-$(CONFIG_CXL_RAS) += ras_rch.o
cxl_core-$(CONFIG_CXL_ATL) += atl.o
+cxl_core-$(CONFIG_CXL_VFIO_PASSTHROUGH) += passthrough.o
diff --git a/drivers/cxl/core/passthrough.c b/drivers/cxl/core/passthrough.c
new file mode 100644
index 000000000000..b89829586024
--- /dev/null
+++ b/drivers/cxl/core/passthrough.c
@@ -0,0 +1,590 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2026 NVIDIA Corporation. All rights reserved.
+ *
+ * vfio-pci Type-2 device passthrough — CXL register virtualization.
+ *
+ * Owns the CXL spec-defined virtualization semantics for the
+ * - CXL Device DVSEC capability body (CXL r4.0 §8.1.3)
+ * - HDM Decoder Capability block (CXL r4.0 §8.2.4.20)
+ * - CXL.cache/mem (CM) cap-array (CXL r4.0 §8.2.4)
+ *
+ * vfio-pci is the only caller. This file is NOT a generic emulation
+ * framework: every register the guest may touch has a hand-written
+ * write handler against the spec. Reads serve from a shadow
+ * snapshotted at create time; writes update the shadow per the spec
+ * attribute mode for that field.
+ *
+ * Scope: firmware-committed, single-decoder, no-interleave Type-2
+ * passthrough. Multi-decoder, interleave, and hotplug are
+ * out-of-scope and rejected at create time.
+ */
+
+#include <linux/bitfield.h>
+#include <linux/bitops.h>
+#include <linux/cleanup.h>
+#include <linux/device.h>
+#include <linux/export.h>
+#include <linux/io.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/pci_ids.h>
+#include <linux/pci_regs.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/unaligned.h>
+
+#include <uapi/cxl/cxl_regs.h>
+
+#include <cxlpci.h>
+#include <cxlmem.h>
+#include <cxl/cxl.h>
+#include <cxl/passthrough.h>
+
+#include "core.h"
+
+/* DVSEC CXL Device body offsets — relative to DVSEC capability start.
+ * Body begins at PCI_DVSEC_CXL_CAP (0x0a); preceding bytes are the PCI
+ * ext-cap header and DVSEC headers handled by the generic vfio
+ * perm-bits path.
+ */
+#define DVSEC_OFF_CAPABILITY PCI_DVSEC_CXL_CAP /* 0x0a, u16 */
+#define DVSEC_OFF_CONTROL PCI_DVSEC_CXL_CTRL /* 0x0c, u16 */
+#define DVSEC_OFF_STATUS 0x0e /* u16 */
+#define DVSEC_OFF_CONTROL2 0x10 /* u16 */
+#define DVSEC_OFF_STATUS2 0x12 /* u16 */
+#define DVSEC_OFF_LOCK 0x14 /* u16 */
+#define DVSEC_OFF_RANGE1_SIZE_HI 0x18 /* u32 */
+#define DVSEC_OFF_RANGE1_SIZE_LO 0x1c
+#define DVSEC_OFF_RANGE1_BASE_HI 0x20
+#define DVSEC_OFF_RANGE1_BASE_LO 0x24
+#define DVSEC_OFF_RANGE2_SIZE_HI 0x28
+#define DVSEC_OFF_RANGE2_SIZE_LO 0x2c
+#define DVSEC_OFF_RANGE2_BASE_HI 0x30
+#define DVSEC_OFF_RANGE2_BASE_LO 0x34
+#define DVSEC_BODY_END 0x38
+
+#define DVSEC_LOCK_CONFIG_LOCK BIT(0)
+
+/* HDM Decoder Capability block offsets — relative to HDM block base.
+ * Decoder N register set starts at 0x10 + N * 0x20.
+ */
+#define HDM_OFF_CAP_HEADER 0x00
+#define HDM_OFF_GLOBAL_CTRL 0x04
+#define HDM_DEC_BASE 0x10
+#define HDM_DEC_STRIDE 0x20
+#define HDM_DEC_OFF_BASE_LO(n) (HDM_DEC_BASE + (n) * HDM_DEC_STRIDE + 0x00)
+#define HDM_DEC_OFF_BASE_HI(n) (HDM_DEC_BASE + (n) * HDM_DEC_STRIDE + 0x04)
+#define HDM_DEC_OFF_SIZE_LO(n) (HDM_DEC_BASE + (n) * HDM_DEC_STRIDE + 0x08)
+#define HDM_DEC_OFF_SIZE_HI(n) (HDM_DEC_BASE + (n) * HDM_DEC_STRIDE + 0x0c)
+#define HDM_DEC_OFF_CTRL(n) (HDM_DEC_BASE + (n) * HDM_DEC_STRIDE + 0x10)
+
+/* HDM Decoder CTRL bits per CXL r4.0 §8.2.4.20.5. */
+#define HDM_CTRL_LOCK_ON_COMMIT BIT(8)
+#define HDM_CTRL_COMMIT BIT(9)
+#define HDM_CTRL_COMMITTED BIT(10)
+#define HDM_CTRL_ERR_NOT_COMMITTED BIT(11)
+
+struct cxl_passthrough {
+ struct cxl_dev_state *cxlds;
+
+ /* DVSEC body shadow. Byte-indexed by (off - PCI_DVSEC_CXL_CAP).
+ * Allocated rounded up to a dword so dword reads at the tail
+ * never overrun.
+ */
+ u8 *dvsec_shadow;
+ u16 dvsec_size; /* full DVSEC cap length, incl. headers */
+ bool dvsec_config_locked;
+
+ /* HDM block shadow. Byte-indexed; size = hdm_reg_size. */
+ u8 *hdm_shadow;
+ resource_size_t hdm_reg_size;
+
+ /* CM cap-array snapshot. Dword-indexed by (off / 4) where off
+ * is the byte offset from CXL_CM_OFFSET. Read-only after create.
+ */
+ __le32 *cm_snapshot;
+ size_t cm_snapshot_dwords;
+
+ /* Covers dvsec_shadow + dvsec_config_locked + hdm_shadow.
+ * cm_snapshot is immutable after create; no lock needed. Leaf-
+ * level: no entry point holding this mutex calls into cxl-bus or
+ * vfio.
+ */
+ struct mutex lock;
+};
+
+/* ------------------------------------------------------------------ */
+/* Snapshot helpers */
+/* ------------------------------------------------------------------ */
+
+/* Read the DVSEC body bytes [PCI_DVSEC_CXL_CAP, dvsec_size) from PCI
+ * config space into the shadow.
+ *
+ * The body starts at PCI_DVSEC_CXL_CAP (0x0a), which is word-aligned but
+ * NOT dword-aligned, and CXL r4.0 §8.1.3 places six 16-bit descriptors
+ * (CAPABILITY through LOCK) at offsets 0x0a..0x14 before any 32-bit
+ * field. Strict-alignment PCIe host bridges (e.g. ARM64 ECAM) reject
+ * misaligned dword config accesses with PCIBIOS_BAD_REGISTER_NUMBER;
+ * snapshot at the natural granularity of the body's 16-bit descriptors
+ * (2-byte stride) so every offset in the range is naturally aligned.
+ */
+static int snapshot_dvsec_body(struct cxl_passthrough *p)
+{
+ struct pci_dev *pdev = to_pci_dev(p->cxlds->dev);
+ u16 dvsec = p->cxlds->cxl_dvsec;
+ u16 off;
+ u16 word;
+ int rc;
+
+ for (off = PCI_DVSEC_CXL_CAP; off < p->dvsec_size; off += 2) {
+ rc = pci_read_config_word(pdev, dvsec + off, &word);
+ if (rc)
+ return -EIO;
+ put_unaligned_le16(word, p->dvsec_shadow +
+ (off - PCI_DVSEC_CXL_CAP));
+ }
+ return 0;
+}
+
+/* Read the CM cap-array prefix [CXL_CM_OFFSET, hdm_reg_offset) from
+ * MMIO into cm_snapshot, and the HDM block [hdm_reg_offset,
+ * hdm_reg_offset + hdm_reg_size) into hdm_shadow.
+ *
+ * @base is a short-lived kva for the component register block,
+ * established by the caller via ioremap() against cxlds->reg_map.resource.
+ * cxl_setup_regs() drops its own ioremap (clears reg_map.base) after the
+ * cap-array probe completes, so this function cannot rely on
+ * cxlds->reg_map.base being valid; the caller passes a fresh mapping
+ * here and releases it once snapshot data has been copied into the
+ * in-memory shadows.
+ */
+static void snapshot_cm_and_hdm(struct cxl_passthrough *p,
+ void __iomem *base,
+ resource_size_t hdm_off)
+{
+ size_t i;
+
+ for (i = 0; i < p->cm_snapshot_dwords; i++)
+ p->cm_snapshot[i] = cpu_to_le32(readl(base + CXL_CM_OFFSET +
+ i * 4));
+
+ for (i = 0; i < p->hdm_reg_size / 4; i++)
+ put_unaligned_le32(readl(base + hdm_off + i * 4),
+ p->hdm_shadow + i * 4);
+}
+
+/* ------------------------------------------------------------------ */
+/* devres */
+/* ------------------------------------------------------------------ */
+
+static void cxl_passthrough_release(struct device *dev, void *res)
+{
+ struct cxl_passthrough *p = *(struct cxl_passthrough **)res;
+
+ kfree(p->dvsec_shadow);
+ kfree(p->hdm_shadow);
+ kfree(p->cm_snapshot);
+ mutex_destroy(&p->lock);
+ kfree(p);
+}
+
+struct cxl_passthrough *
+devm_cxl_passthrough_create(struct device *dev, struct cxl_dev_state *cxlds)
+{
+ struct cxl_passthrough **dres;
+ struct cxl_passthrough *p;
+ struct pci_dev *pdev;
+ resource_size_t hdm_off, hdm_size;
+ size_t dvsec_shadow_size;
+ u8 hdm_count;
+ u32 hdr;
+ int rc;
+
+ /*
+ * cxl_setup_regs() releases its short-lived ioremap before returning,
+ * so reg_map.base is NULL by the time we run. Validate the persistent
+ * fields (resource address and size) instead; the local ioremap
+ * established further below covers the snapshot reads.
+ */
+ if (!dev || !cxlds || !cxlds->dev || !cxlds->cxl_dvsec ||
+ !cxlds->reg_map.resource || !cxlds->reg_map.max_size)
+ return ERR_PTR(-EINVAL);
+
+ pdev = to_pci_dev(cxlds->dev);
+
+ rc = cxl_get_hdm_info(cxlds, &hdm_count, &hdm_off, &hdm_size);
+ if (rc)
+ return ERR_PTR(rc);
+ if (hdm_count != 1 || !hdm_size || hdm_off <= CXL_CM_OFFSET ||
+ !IS_ALIGNED(hdm_size, 4))
+ return ERR_PTR(-EOPNOTSUPP);
+
+ p = kzalloc_obj(*p, GFP_KERNEL);
+ if (!p)
+ return ERR_PTR(-ENOMEM);
+
+ mutex_init(&p->lock);
+ p->cxlds = cxlds;
+ p->hdm_reg_size = hdm_size;
+
+ /* DVSEC body length from PCI ext-cap header. */
+ rc = pci_read_config_dword(pdev, cxlds->cxl_dvsec + PCI_DVSEC_HEADER1,
+ &hdr);
+ if (rc) {
+ rc = -EIO;
+ goto err;
+ }
+ p->dvsec_size = PCI_DVSEC_HEADER1_LEN(hdr);
+ if (p->dvsec_size < DVSEC_BODY_END) {
+ rc = -EINVAL;
+ goto err;
+ }
+
+ dvsec_shadow_size = round_up(p->dvsec_size - PCI_DVSEC_CXL_CAP, 4);
+ p->dvsec_shadow = kzalloc(dvsec_shadow_size, GFP_KERNEL);
+ if (!p->dvsec_shadow) {
+ rc = -ENOMEM;
+ goto err;
+ }
+
+ p->cm_snapshot_dwords = (hdm_off - CXL_CM_OFFSET) / 4;
+ p->cm_snapshot = kcalloc(p->cm_snapshot_dwords, sizeof(__le32),
+ GFP_KERNEL);
+ if (!p->cm_snapshot) {
+ rc = -ENOMEM;
+ goto err;
+ }
+
+ p->hdm_shadow = kzalloc(hdm_size, GFP_KERNEL);
+ if (!p->hdm_shadow) {
+ rc = -ENOMEM;
+ goto err;
+ }
+
+ rc = snapshot_dvsec_body(p);
+ if (rc)
+ goto err;
+
+ {
+ void __iomem *base;
+
+ /*
+ * Bind-time-only ioremap. cxl_setup_regs() has already
+ * released the cxl-core ioremap (see comment on the entry
+ * gate). Take a fresh, short-lived mapping for the
+ * snapshot, then release it; all subsequent reads serve
+ * from the in-memory shadows.
+ */
+ base = ioremap(cxlds->reg_map.resource,
+ cxlds->reg_map.max_size);
+ if (!base) {
+ rc = -ENOMEM;
+ goto err;
+ }
+ snapshot_cm_and_hdm(p, base, hdm_off);
+ iounmap(base);
+ }
+
+ dres = devres_alloc(cxl_passthrough_release, sizeof(*dres),
+ GFP_KERNEL);
+ if (!dres) {
+ rc = -ENOMEM;
+ goto err;
+ }
+ *dres = p;
+ devres_add(dev, dres);
+ return p;
+
+err:
+ kfree(p->dvsec_shadow);
+ kfree(p->cm_snapshot);
+ kfree(p->hdm_shadow);
+ mutex_destroy(&p->lock);
+ kfree(p);
+ return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_NS_GPL(devm_cxl_passthrough_create, "CXL");
+
+/* ------------------------------------------------------------------ */
+/* DVSEC write semantics */
+/* ------------------------------------------------------------------ */
+
+static u16 dvsec_shadow_get_u16(struct cxl_passthrough *p, u16 off)
+{
+ return get_unaligned_le16(p->dvsec_shadow + (off - PCI_DVSEC_CXL_CAP));
+}
+
+static void dvsec_shadow_set_u16(struct cxl_passthrough *p, u16 off, u16 val)
+{
+ put_unaligned_le16(val, p->dvsec_shadow + (off - PCI_DVSEC_CXL_CAP));
+}
+
+/* Apply a write to a single DVSEC field at @off, with the field's
+ * native width (2 for descriptors, 4 for RANGE entries). @width is
+ * the field's spec width; @new is the merged value to apply. Caller
+ * holds p->lock.
+ */
+static void dvsec_apply_write(struct cxl_passthrough *p, u16 off, size_t width,
+ u32 new)
+{
+ u16 cur16;
+
+ switch (off) {
+ case DVSEC_OFF_CAPABILITY:
+ /* HwInit — drop. */
+ return;
+ case DVSEC_OFF_CONTROL:
+ case DVSEC_OFF_CONTROL2:
+ /* RWL — gated on CONFIG_LOCK. */
+ if (p->dvsec_config_locked)
+ return;
+ dvsec_shadow_set_u16(p, off, (u16)new);
+ return;
+ case DVSEC_OFF_STATUS:
+ case DVSEC_OFF_STATUS2:
+ /* RW1C — clear bits where the guest wrote 1. */
+ cur16 = dvsec_shadow_get_u16(p, off);
+ dvsec_shadow_set_u16(p, off, cur16 & ~(u16)new);
+ return;
+ case DVSEC_OFF_LOCK:
+ /* RWO — first 1-write latches CONFIG_LOCK; subsequent
+ * writes are ignored.
+ */
+ cur16 = dvsec_shadow_get_u16(p, off);
+ if (cur16 & DVSEC_LOCK_CONFIG_LOCK)
+ return;
+ if (new & DVSEC_LOCK_CONFIG_LOCK) {
+ dvsec_shadow_set_u16(p, off,
+ cur16 | DVSEC_LOCK_CONFIG_LOCK);
+ p->dvsec_config_locked = true;
+ }
+ return;
+ case DVSEC_OFF_RANGE1_SIZE_HI:
+ case DVSEC_OFF_RANGE1_SIZE_LO:
+ case DVSEC_OFF_RANGE1_BASE_HI:
+ case DVSEC_OFF_RANGE1_BASE_LO:
+ /* HwInit — drop. */
+ return;
+ case DVSEC_OFF_RANGE2_SIZE_HI:
+ case DVSEC_OFF_RANGE2_SIZE_LO:
+ case DVSEC_OFF_RANGE2_BASE_HI:
+ case DVSEC_OFF_RANGE2_BASE_LO:
+ /* RsvdZ — drop. */
+ return;
+ default:
+ /* Reserved offsets inside the modelled body: drop. */
+ (void)width;
+ return;
+ }
+}
+
+/* Map a byte offset @off inside the DVSEC body to the natural-width
+ * field that contains it: returns the field's base offset (16-bit
+ * aligned for descriptors, 32-bit aligned for RANGE entries) and width.
+ * Returns false if @off lies outside any modelled field.
+ */
+static bool dvsec_field_at(u16 off, u16 *field_off, size_t *width)
+{
+ if (off >= DVSEC_OFF_CAPABILITY && off < DVSEC_OFF_RANGE1_SIZE_HI) {
+ *field_off = ALIGN_DOWN(off, 2);
+ *width = 2;
+ return true;
+ }
+ if (off >= DVSEC_OFF_RANGE1_SIZE_HI && off < DVSEC_BODY_END) {
+ *field_off = ALIGN_DOWN(off, 4);
+ *width = 4;
+ return true;
+ }
+ return false;
+}
+
+int cxl_passthrough_dvsec_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+ size_t sz, bool write)
+{
+ u8 *shadow;
+ u16 field_off;
+ size_t field_width;
+ u32 cur, merged;
+ u32 sub_shift;
+ u32 width_mask;
+
+ if (!p || !val)
+ return -EINVAL;
+ if (sz != 1 && sz != 2 && sz != 4)
+ return -EINVAL;
+ if (off < PCI_DVSEC_CXL_CAP || off + sz > p->dvsec_size)
+ return -EINVAL;
+
+ guard(mutex)(&p->lock);
+
+ shadow = p->dvsec_shadow + (off - PCI_DVSEC_CXL_CAP);
+
+ if (!write) {
+ switch (sz) {
+ case 1:
+ *val = *shadow;
+ break;
+ case 2:
+ *val = get_unaligned_le16(shadow);
+ break;
+ case 4:
+ *val = get_unaligned_le32(shadow);
+ break;
+ }
+ return 0;
+ }
+
+ if (!dvsec_field_at(off, &field_off, &field_width))
+ return 0; /* outside any modelled field: drop */
+
+ /* Read-modify-merge the field at its natural width. */
+ if (field_width == 2)
+ cur = dvsec_shadow_get_u16(p, field_off);
+ else
+ cur = get_unaligned_le32(p->dvsec_shadow +
+ (field_off - PCI_DVSEC_CXL_CAP));
+
+ width_mask = (sz == 4) ? 0xffffffff : (sz == 2 ? 0xffff : 0xff);
+ sub_shift = (off - field_off) * 8;
+ merged = cur & ~(width_mask << sub_shift);
+ merged |= (*val & width_mask) << sub_shift;
+
+ dvsec_apply_write(p, field_off, field_width, merged);
+ return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_passthrough_dvsec_rw, "CXL");
+
+/* ------------------------------------------------------------------ */
+/* HDM write semantics */
+/* ------------------------------------------------------------------ */
+
+static u32 hdm_shadow_get(struct cxl_passthrough *p, u32 off)
+{
+ return get_unaligned_le32(p->hdm_shadow + off);
+}
+
+static void hdm_shadow_set(struct cxl_passthrough *p, u32 off, u32 val)
+{
+ put_unaligned_le32(val, p->hdm_shadow + off);
+}
+
+/* Decoder index for a per-decoder register offset. */
+static u32 hdm_decoder_of(u32 off)
+{
+ return (off - HDM_DEC_BASE) / HDM_DEC_STRIDE;
+}
+
+static u32 hdm_decoder_field(u32 off)
+{
+ return (off - HDM_DEC_BASE) % HDM_DEC_STRIDE;
+}
+
+static void hdm_decoder_ctrl_write(struct cxl_passthrough *p, u32 off, u32 val)
+{
+ u32 cur = hdm_shadow_get(p, off);
+ u32 next;
+
+ /* Once COMMITTED, only the COMMIT toggle is honoured. Releasing
+ * COMMIT clears COMMITTED and Lock-on-Commit per CXL r4.0
+ * §8.2.4.20.5.
+ */
+ if (cur & HDM_CTRL_COMMITTED) {
+ next = (cur & ~HDM_CTRL_COMMIT) | (val & HDM_CTRL_COMMIT);
+ if (!(val & HDM_CTRL_COMMIT)) {
+ next &= ~HDM_CTRL_COMMITTED;
+ next &= ~HDM_CTRL_LOCK_ON_COMMIT;
+ }
+ hdm_shadow_set(p, off, next);
+ return;
+ }
+
+ next = val & ~(HDM_CTRL_COMMITTED | HDM_CTRL_ERR_NOT_COMMITTED);
+ if (val & HDM_CTRL_COMMIT)
+ next |= HDM_CTRL_COMMITTED;
+ hdm_shadow_set(p, off, next);
+}
+
+static void hdm_decoder_basesize_write(struct cxl_passthrough *p, u32 off,
+ u32 val)
+{
+ u32 n = hdm_decoder_of(off);
+ u32 ctrl = hdm_shadow_get(p, HDM_DEC_OFF_CTRL(n));
+
+ /* RWL — BASE/SIZE locked when the decoder is committed or
+ * lock-on-commit has been latched.
+ */
+ if (ctrl & (HDM_CTRL_COMMITTED | HDM_CTRL_LOCK_ON_COMMIT))
+ return;
+ hdm_shadow_set(p, off, val);
+}
+
+int cxl_passthrough_hdm_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+ bool write)
+{
+ u32 field;
+
+ if (!p || !val)
+ return -EINVAL;
+ if (!IS_ALIGNED(off, 4) || off + 4 > p->hdm_reg_size)
+ return -EINVAL;
+
+ guard(mutex)(&p->lock);
+
+ if (!write) {
+ *val = hdm_shadow_get(p, off);
+ return 0;
+ }
+
+ switch (off) {
+ case HDM_OFF_CAP_HEADER:
+ /* HwInit — drop. */
+ return 0;
+ case HDM_OFF_GLOBAL_CTRL:
+ /* RW — shadow. */
+ hdm_shadow_set(p, off, *val);
+ return 0;
+ }
+
+ if (off < HDM_DEC_BASE)
+ return 0; /* gap before per-decoder regs: drop */
+
+ field = hdm_decoder_field(off);
+ switch (field) {
+ case 0x00: case 0x04: /* BASE_LO / BASE_HI */
+ case 0x08: case 0x0c: /* SIZE_LO / SIZE_HI */
+ hdm_decoder_basesize_write(p, off, *val);
+ return 0;
+ case 0x10: /* CTRL */
+ hdm_decoder_ctrl_write(p, off, *val);
+ return 0;
+ default:
+ /* TARGET_LIST_{LO,HI} and other per-decoder bytes are
+ * accepted as plain RW shadow for the firmware-committed
+ * scope; multi-decoder / interleave behaviour is
+ * out-of-scope.
+ */
+ hdm_shadow_set(p, off, *val);
+ return 0;
+ }
+}
+EXPORT_SYMBOL_NS_GPL(cxl_passthrough_hdm_rw, "CXL");
+
+/* ------------------------------------------------------------------ */
+/* CM cap-array snapshot */
+/* ------------------------------------------------------------------ */
+
+int cxl_passthrough_cm_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+ bool write)
+{
+ if (!p || !val)
+ return -EINVAL;
+ if (!IS_ALIGNED(off, 4) || off / 4 >= p->cm_snapshot_dwords)
+ return -EINVAL;
+
+ if (write)
+ return 0; /* cap-array headers are RO; drop. */
+
+ *val = le32_to_cpu(p->cm_snapshot[off / 4]);
+ return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_passthrough_cm_rw, "CXL");
diff --git a/include/cxl/passthrough.h b/include/cxl/passthrough.h
new file mode 100644
index 000000000000..43214b0d34f6
--- /dev/null
+++ b/include/cxl/passthrough.h
@@ -0,0 +1,121 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright(c) 2026 NVIDIA Corporation. All rights reserved.
+ *
+ * CXL register virtualization helpers for vfio-pci Type-2 passthrough.
+ *
+ * See Documentation/driver-api/vfio-pci-cxl.rst for the ownership
+ * contract. In short: cxl-core owns the per-device DVSEC body, HDM
+ * Decoder block, and CM cap-array shadows; vfio-pci is a transport
+ * that forwards guest reads and writes through the helpers below.
+ *
+ * The helpers are not a generic emulation framework. Each register
+ * is hand-coded against CXL r4.0 §8.1.3 and §8.2.4.20. Adding a new
+ * field is "add a case", not "add a mode".
+ */
+#ifndef __CXL_PASSTHROUGH_H__
+#define __CXL_PASSTHROUGH_H__
+
+#include <linux/types.h>
+
+struct cxl_dev_state;
+struct cxl_passthrough;
+struct device;
+
+/**
+ * devm_cxl_passthrough_create - snapshot a Type-2 device's DVSEC + HDM +
+ * CM cap-array shadows and return the opaque handle the rw helpers
+ * operate on.
+ *
+ * @dev: device whose devres lifetime bounds the returned handle.
+ * @cxlds: CXL device state with cxlds->cxl_dvsec populated and
+ * cxlds->reg_map.resource and cxlds->reg_map.max_size describing
+ * the component register block. cxlds->reg_map.base is NOT
+ * required; cxl_pci_setup_regs() releases its short-lived
+ * ioremap before returning, so this helper takes a local
+ * bind-time ioremap against cxlds->reg_map.resource for the
+ * duration of the snapshot.
+ *
+ * On success the returned handle is bound to @dev's devres so unwind
+ * happens automatically when @dev is unbound. The handle must not be
+ * freed by the caller.
+ *
+ * Return: a valid &struct cxl_passthrough on success, ERR_PTR(-errno)
+ * on failure.
+ */
+struct cxl_passthrough *
+devm_cxl_passthrough_create(struct device *dev, struct cxl_dev_state *cxlds);
+
+/**
+ * cxl_passthrough_dvsec_rw - read or write the CXL Device DVSEC body shadow.
+ *
+ * @p: handle from devm_cxl_passthrough_create().
+ * @off: byte offset from the start of the DVSEC capability. Must be
+ * >= PCI_DVSEC_CXL_CAP and (off + sz) must lie inside the DVSEC.
+ * Accesses to the PCI ext-cap header bytes (off < PCI_DVSEC_CXL_CAP)
+ * are the caller's responsibility; they belong on the generic
+ * perm-bits path, not here.
+ * @val: pointer to a u32 holding the read result or the write value.
+ * The low @sz bytes of *val are the payload; upper bytes ignored
+ * for writes and zero for reads.
+ * @sz: 1, 2, or 4. Other values return -EINVAL.
+ * @write: false for read, true for write.
+ *
+ * Reads serve from the shadow. Writes update the shadow per the spec
+ * attribute mode for the addressed field (LOCK is RWO, CONTROL/CONTROL2
+ * are RWL gated on CONFIG_LOCK, STATUS/STATUS2 are RW1C, RANGE1/2 are
+ * HwInit, Reserved/RsvdZ silently consumed).
+ *
+ * Known limitation: a 4-byte write whose @off straddles a 16-bit DVSEC
+ * field boundary (CONTROL/STATUS at 0x0c/0x0e, CONTROL2/STATUS2 at
+ * 0x10/0x12) applies only the field containing the first byte of the
+ * access; the adjacent 16-bit field is not updated by the same write.
+ * Standard CXL register-access patterns issue separate 2-byte accesses
+ * to CONTROL, STATUS, CONTROL2 and STATUS2, so this corner case is
+ * documented rather than handled.
+ *
+ * Return: 0 on success; -EINVAL on out-of-range or bad size.
+ */
+int cxl_passthrough_dvsec_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+ size_t sz, bool write);
+
+/**
+ * cxl_passthrough_hdm_rw - read or write the HDM Decoder block shadow.
+ *
+ * @p: handle from devm_cxl_passthrough_create().
+ * @off: byte offset from the HDM block base; must be 4-byte aligned and
+ * (off + 4) <= hdm_reg_size. Sub-dword access is not supported on
+ * HDM registers per CXL r4.0 §8.2.4.
+ * @val: pointer to a u32 holding the read result or the write value.
+ * @write: false for read, true for write.
+ *
+ * Reads serve from the shadow. Writes implement the per-decoder
+ * COMMIT/COMMITTED handshake (CTRL) and the RWL gating on BASE/SIZE
+ * imposed by COMMITTED|LOCK_ON_COMMIT. GLOBAL_CTRL is RW; the cap
+ * header is HwInit (writes dropped); other offsets in the per-decoder
+ * stride are RW shadow.
+ *
+ * Return: 0 on success; -EINVAL on misalignment or out-of-range.
+ */
+int cxl_passthrough_hdm_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+ bool write);
+
+/**
+ * cxl_passthrough_cm_rw - read or write the CXL.cache/mem cap-array snapshot.
+ *
+ * @p: handle from devm_cxl_passthrough_create().
+ * @off: byte offset from CXL_CM_OFFSET (the start of the CM cap-array
+ * header in the component register block); must be 4-byte aligned
+ * and (off + 4) <= cm_snapshot_size.
+ * @val: pointer to a u32 holding the read result; ignored on write.
+ * @write: false for read. Writes to the cap-array are silently dropped
+ * (the array headers are RO per CXL r4.0 §8.2.4); the @write
+ * parameter is present only to keep the API symmetric with the
+ * other rw helpers and to make the drop policy explicit at the
+ * call site.
+ *
+ * Return: 0 on success; -EINVAL on misalignment or out-of-range.
+ */
+int cxl_passthrough_cm_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+ bool write);
+
+#endif /* __CXL_PASSTHROUGH_H__ */
--
2.25.1
^ permalink raw reply related
* [PATCH v3 07/11] vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2 acquisition
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>
From: Manish Honap <mhonap@nvidia.com>
Wire vfio-pci-core to acquire CXL Type-2 device state at PCI bind
and release it at PCI unbind, mirroring the existing vfio_pci_zdev_*
integration model. Four lifecycle hooks are introduced —
vfio_pci_cxl_acquire / _release / _open / _close — with !-config
stubs that return -ENODEV / 0 / 0 / no-op respectively so vfio-pci
behaviour is unchanged when CONFIG_VFIO_PCI_CXL=n.
vfio_pci_cxl_acquire() implements the bind sequence:
- pcie_is_cxl() and CXL Device DVSEC discovery (-ENODEV if absent
or if MEM_CAPABLE clear — caller falls back to plain vfio-pci)
- devm_cxl_dev_state_create() with struct vfio_pci_cxl_state
embedding cxl_dev_state at offset 0 (required by the 7-arg
macro's static_assert in include/cxl/cxl.h)
- pci_enable_device_mem(), cxl_pci_setup_regs(), cxl_get_hdm_info()
(rejecting hdm_count != 1), cxl_regblock_get_bar_info(),
cxl_await_range_active()
- devm_cxl_passthrough_create() to snapshot the DVSEC body, HDM
block, and CM cap-array shadows owned by cxl-core
- pci_disable_device() — clears PCI_COMMAND_MASTER but NOT
PCI_COMMAND_MEMORY, so cxl-core MMIO accesses from the next step
still succeed
- devm_cxl_probe_mem() to register the cxl_memdev, enumerate the
endpoint port, and attach the firmware-committed autoregion
- request_mem_region() + memremap_wb() of the autoregion's HPA so
the HDM VFIO region can serve guest accesses through it
The sequence is fail-closed for confirmed-CXL devices: -ENODEV maps
to plain vfio-pci fall-through; any other negative errno aborts the
vfio-pci bind so the guest never sees a half-initialised CXL device.
vfio_pci_cxl_open() / _close() are present as stable call sites for
the region-registration hooks that follow.
Selects CXL_VFIO_PASSTHROUGH so cxl-core's per-device
register-virtualization helpers (drivers/cxl/core/passthrough.c) are
built.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/Kconfig | 2 +
drivers/vfio/pci/Makefile | 1 +
drivers/vfio/pci/cxl/Kconfig | 34 +++
drivers/vfio/pci/cxl/Makefile | 2 +
drivers/vfio/pci/cxl/vfio_cxl_core.c | 369 +++++++++++++++++++++++++++
drivers/vfio/pci/cxl/vfio_cxl_priv.h | 71 ++++++
drivers/vfio/pci/vfio_pci_core.c | 24 ++
drivers/vfio/pci/vfio_pci_priv.h | 21 ++
include/linux/vfio_pci_core.h | 7 +
9 files changed, 531 insertions(+)
create mode 100644 drivers/vfio/pci/cxl/Kconfig
create mode 100644 drivers/vfio/pci/cxl/Makefile
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_core.c
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_priv.h
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 296bf01e185e..4cd6acd36053 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -58,6 +58,8 @@ config VFIO_PCI_ZDEV_KVM
config VFIO_PCI_DMABUF
def_bool y if VFIO_PCI_CORE && PCI_P2PDMA && DMA_SHARED_BUFFER
+source "drivers/vfio/pci/cxl/Kconfig"
+
source "drivers/vfio/pci/mlx5/Kconfig"
source "drivers/vfio/pci/ism/Kconfig"
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 6138f1bf241d..ac26e7494f0a 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -3,6 +3,7 @@
vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
+include $(srctree)/$(src)/cxl/Makefile
obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
vfio-pci-y := vfio_pci.o
diff --git a/drivers/vfio/pci/cxl/Kconfig b/drivers/vfio/pci/cxl/Kconfig
new file mode 100644
index 000000000000..5d88999e1256
--- /dev/null
+++ b/drivers/vfio/pci/cxl/Kconfig
@@ -0,0 +1,34 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config VFIO_PCI_CXL
+ bool "VFIO support for CXL Type-2 device passthrough"
+ depends on VFIO_PCI_CORE
+ depends on CXL_BUS
+ depends on CXL_REGION
+ depends on CXL_MEM
+ # CXL providers are tristate; refuse a builtin vfio-pci-core
+ # against modular cxl-core (would fail to link the per-device
+ # helpers in drivers/cxl/core/passthrough.c).
+ depends on CXL_BUS=y || VFIO_PCI_CORE=m
+ depends on CXL_REGION=y || VFIO_PCI_CORE=m
+ depends on CXL_MEM=y || VFIO_PCI_CORE=m
+ select CXL_VFIO_PASSTHROUGH
+ help
+ Support CXL Type-2 (HDM-D, HDM-DB) accelerator device passthrough
+ to a KVM guest. When this option is enabled, vfio-pci-core
+ probes the CXL Register Locator DVSEC at PCI bind time, acquires
+ a cxl_memdev and autoregion via devm_cxl_probe_mem(), and
+ exposes two additional VFIO regions to userspace: a mappable
+ HDM memory region for the device's HPA range, and a COMP_REGS
+ shadow region forwarding HDM Decoder Capability accesses
+ through the cxl-core register-virtualization helpers added by
+ drivers/cxl/core/passthrough.c.
+
+ Devices that do not advertise a CXL Device DVSEC fall back to
+ plain vfio-pci behaviour. Confirmed-CXL devices whose host
+ firmware did not commit an HDM decoder, or whose cxl-core probe
+ otherwise fails, do not bind to vfio-pci at all so the guest is
+ never offered a half-initialised CXL device.
+
+ Scope: firmware-committed, single-decoder, no-interleave.
+
+ Say Y to support CXL Type-2 device passthrough.
diff --git a/drivers/vfio/pci/cxl/Makefile b/drivers/vfio/pci/cxl/Makefile
new file mode 100644
index 000000000000..35e952fe1858
--- /dev/null
+++ b/drivers/vfio/pci/cxl/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+vfio-pci-core-$(CONFIG_VFIO_PCI_CXL) += cxl/vfio_cxl_core.o
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
new file mode 100644
index 000000000000..42cd00bbe869
--- /dev/null
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -0,0 +1,369 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2026 NVIDIA Corporation. All rights reserved.
+ *
+ * vfio-pci CXL Type-2 device passthrough — core entry points.
+ *
+ * Four lifecycle hooks are inserted into vfio-pci-core: acquire and
+ * release run at PCI bind / unbind, open and close run on VFIO fd
+ * open / close. This mirrors the existing vfio_pci_zdev_* integration
+ * model.
+ *
+ * vfio_pci_cxl_acquire() runs at PCI bind time. It performs the CXL
+ * register-locator probe and HDM decoder discovery under a brief
+ * pci_enable_device_mem() / pci_disable_device() bracket, then asks
+ * cxl-core to register a cxl_memdev and auto-attach the
+ * firmware-committed region via devm_cxl_probe_mem(). pci_disable_device()
+ * clears PCI_COMMAND_MASTER but NOT PCI_COMMAND_MEMORY (see
+ * do_pci_disable_device() in drivers/pci/pci.c), so the cxl-core
+ * MMIO accesses performed by devm_cxl_probe_mem() after the disable
+ * still succeed even with vfio-pci's PCI enable refcount returned to
+ * zero. The refcount is re-taken cleanly by vfio_pci_core_enable()
+ * at first VFIO fd open.
+ *
+ * Acquisition is fail-closed for confirmed-CXL devices. Devices that
+ * do not advertise a CXL Device DVSEC, and CXL devices whose
+ * MEM_CAPABLE bit is clear, return -ENODEV so the caller falls back
+ * to plain vfio-pci behaviour. Any other negative errno from
+ * acquire() is a confirmed-CXL probe failure (locator missing, HDM
+ * not single-decoder, range-active timeout, passthrough shadow
+ * snapshot failure, devm_cxl_probe_mem() refusal, HDM HPA range busy)
+ * and aborts the vfio-pci bind so the guest never sees a CXL device
+ * with half-initialised cxl-core state.
+ */
+
+#include <linux/bitfield.h>
+#include <linux/io.h>
+#include <linux/pci.h>
+#include <linux/range.h>
+#include <linux/vfio_pci_core.h>
+
+#include <uapi/cxl/cxl_regs.h>
+#include <uapi/linux/pci_regs.h>
+#include <uapi/linux/vfio.h>
+
+#include <cxl/cxl.h>
+#include <cxl/passthrough.h>
+#include <cxl/pci.h>
+
+#include "../vfio_pci_priv.h"
+#include "vfio_cxl_priv.h"
+
+MODULE_IMPORT_NS("CXL");
+
+#define VFIO_PCI_CXL_HDM_RES_NAME "vfio-cxl-hdm"
+
+/* ------------------------------------------------------------------ */
+/* Bind-time setup helpers */
+/* ------------------------------------------------------------------ */
+
+static struct vfio_pci_cxl_state *
+vfio_cxl_create_device_state(struct pci_dev *pdev, u16 dvsec)
+{
+ struct vfio_pci_cxl_state *cxl;
+ u32 hdr1;
+ u16 cap;
+ int rc;
+
+ cxl = devm_cxl_dev_state_create(&pdev->dev, CXL_DEVTYPE_DEVMEM,
+ pci_get_dsn(pdev), dvsec,
+ struct vfio_pci_cxl_state,
+ cxlds, false);
+ if (!cxl)
+ return ERR_PTR(-ENOMEM);
+
+ cxl->pdev = pdev;
+
+ rc = pci_read_config_dword(pdev, dvsec + PCI_DVSEC_HEADER1, &hdr1);
+ if (rc) {
+ devm_kfree(&pdev->dev, cxl);
+ return ERR_PTR(-EIO);
+ }
+ cxl->info.dvsec_offset = dvsec;
+ cxl->info.dvsec_size = PCI_DVSEC_HEADER1_LEN(hdr1);
+
+ rc = pci_read_config_word(pdev, dvsec + PCI_DVSEC_CXL_CAP, &cap);
+ if (rc) {
+ devm_kfree(&pdev->dev, cxl);
+ return ERR_PTR(-EIO);
+ }
+ if (!(cap & PCI_DVSEC_CXL_MEM_CAPABLE)) {
+ devm_kfree(&pdev->dev, cxl);
+ return ERR_PTR(-ENODEV);
+ }
+
+ return cxl;
+}
+
+static int vfio_cxl_probe_regs(struct vfio_pci_cxl_state *cxl)
+{
+ struct cxl_dev_state *cxlds = &cxl->cxlds;
+ resource_size_t hdm_off, hdm_size, bar_off;
+ u8 hdm_count, bir;
+ int rc;
+
+ if (WARN_ON_ONCE(!pci_is_enabled(cxl->pdev)))
+ return -EINVAL;
+
+ rc = cxl_pci_setup_regs(cxl->pdev, CXL_REGLOC_RBI_COMPONENT,
+ &cxlds->reg_map);
+ if (rc)
+ return rc;
+
+ rc = cxl_get_hdm_info(cxlds, &hdm_count, &hdm_off, &hdm_size);
+ if (rc)
+ return rc;
+ if (hdm_count != 1) {
+ pci_err(cxl->pdev,
+ "vfio-cxl: hdm_count=%u, only 1 supported\n",
+ hdm_count);
+ return -EOPNOTSUPP;
+ }
+
+ rc = cxl_regblock_get_bar_info(&cxlds->reg_map, &bir, &bar_off);
+ if (rc)
+ return rc;
+
+ cxl->info.hdm_count = hdm_count;
+ cxl->info.hdm_reg_offset = hdm_off;
+ cxl->info.hdm_reg_size = hdm_size;
+ cxl->info.comp_reg_bir = bir;
+ cxl->info.comp_reg_offset = bar_off;
+ cxl->info.comp_reg_size = cxlds->reg_map.max_size;
+ cxl->info.host_firmware_committed = true;
+
+ /*
+ * Range-active polls a config-space bit in the CXL DVSEC, not
+ * MMIO, so it is safe inside or outside the memory-decode
+ * bracket. Keep it here so cxlds->media_ready is set before the
+ * caller drops the PCI enable refcount.
+ */
+ rc = cxl_await_range_active(cxlds);
+ if (rc)
+ return rc;
+ cxlds->media_ready = true;
+ return 0;
+}
+
+static int vfio_cxl_create_memdev(struct vfio_pci_cxl_state *cxl)
+{
+ struct range hpa_range;
+ struct cxl_memdev *cxlmd;
+
+ /*
+ * devm_cxl_probe_mem() runs synchronously: it registers a
+ * cxl_memdev which triggers cxl_mem_probe(), endpoint port
+ * creation, and autoregion attach. Endpoint port probe reads
+ * HDM decoder MMIO via devm_cxl_setup_hdm(); the device must
+ * therefore still be memory-decoded. pci_disable_device() only
+ * clears PCI_COMMAND_MASTER (not _MEMORY), so the paired enable
+ * / disable done by the caller leaves the decode bit asserted
+ * and these reads succeed even with the vfio refcount at zero.
+ */
+ cxlmd = devm_cxl_probe_mem(&cxl->cxlds, &hpa_range);
+ if (IS_ERR(cxlmd))
+ return PTR_ERR(cxlmd);
+
+ cxl->cxlmd = cxlmd;
+ cxl->info.hpa_base = hpa_range.start;
+ cxl->info.hpa_size = range_len(&hpa_range);
+ return 0;
+}
+
+/* ------------------------------------------------------------------ */
+/* HDM HPA mapping */
+/* ------------------------------------------------------------------ */
+
+static int vfio_cxl_map_hdm(struct vfio_pci_cxl_state *cxl)
+{
+ phys_addr_t base = cxl->info.hpa_base;
+ u64 size = cxl->info.hpa_size;
+
+ if (!size)
+ return -EINVAL;
+
+ cxl->hdm_res = request_mem_region(base, size,
+ VFIO_PCI_CXL_HDM_RES_NAME);
+ if (!cxl->hdm_res) {
+ pci_err(cxl->pdev,
+ "vfio-cxl: HDM HPA %pa-%llx busy; check firmware mappings\n",
+ &base, size);
+ return -EBUSY;
+ }
+
+ cxl->hdm_kva = memremap(base, size, MEMREMAP_WB);
+ if (!cxl->hdm_kva) {
+ release_mem_region(base, size);
+ cxl->hdm_res = NULL;
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+static void vfio_cxl_unmap_hdm(struct vfio_pci_cxl_state *cxl)
+{
+ if (cxl->hdm_kva) {
+ memunmap(cxl->hdm_kva);
+ cxl->hdm_kva = NULL;
+ }
+ if (cxl->hdm_res) {
+ release_mem_region(cxl->info.hpa_base, cxl->info.hpa_size);
+ cxl->hdm_res = NULL;
+ }
+}
+
+/* ------------------------------------------------------------------ */
+/* Lifecycle hooks */
+/* ------------------------------------------------------------------ */
+
+int vfio_pci_cxl_acquire(struct vfio_pci_core_device *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ struct vfio_pci_cxl_state *cxl;
+ u16 dvsec;
+ int rc;
+
+ if (!pcie_is_cxl(pdev))
+ return -ENODEV;
+
+ dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
+ PCI_DVSEC_CXL_DEVICE);
+ if (!dvsec)
+ return -ENODEV;
+
+ cxl = vfio_cxl_create_device_state(pdev, dvsec);
+ if (IS_ERR(cxl)) {
+ rc = PTR_ERR(cxl);
+ if (rc == -ENODEV)
+ return -ENODEV; /* MEM_CAPABLE clear: treat as non-CXL. */
+ pci_warn(pdev, "vfio-cxl: state alloc failed (%d)\n", rc);
+ return rc;
+ }
+
+ rc = pci_enable_device_mem(pdev);
+ if (rc) {
+ pci_warn(pdev, "vfio-cxl: pci_enable_device_mem failed (%d)\n",
+ rc);
+ goto err_free;
+ }
+
+ rc = vfio_cxl_probe_regs(cxl);
+ if (rc) {
+ pci_disable_device(pdev);
+ pci_warn(pdev, "vfio-cxl: register probe failed (%d)\n", rc);
+ goto err_free;
+ }
+
+ /*
+ * Allocate the cxl-core passthrough handle (DVSEC/HDM/CM
+ * shadows) BEFORE devm_cxl_probe_mem() so that a -ENOMEM or
+ * snapshot -EIO here is recoverable: devm_kfree() the
+ * containing state and let devres unwind cxlds. After
+ * devm_cxl_probe_mem() publishes the memdev, no devm_kfree() is
+ * possible because cxlmd->cxlds points into the state.
+ */
+ cxl->cxlpt = devm_cxl_passthrough_create(&pdev->dev, &cxl->cxlds);
+ if (IS_ERR(cxl->cxlpt)) {
+ rc = PTR_ERR(cxl->cxlpt);
+ cxl->cxlpt = NULL;
+ pci_disable_device(pdev);
+ pci_warn(pdev,
+ "vfio-cxl: passthrough shadow snapshot failed (%d)\n",
+ rc);
+ goto err_free;
+ }
+
+ /*
+ * Drop the PCI enable refcount before publishing the cxl_memdev:
+ * vfio_pci_core_enable() will take a fresh refcount at first VFIO
+ * fd open. PCI_COMMAND_MEMORY stays asserted (see file header).
+ */
+ pci_disable_device(pdev);
+
+ /*
+ * Populate the DPA partition tree on cxlds before
+ * devm_cxl_probe_mem() runs. The endpoint port probe will try to
+ * reserve the firmware-committed HDM decoder range as a DPA
+ * resource child of cxlds->dpa_res; without an explicit
+ * cxl_set_capacity() call dpa_res is zero-sized and the
+ * reservation fails with -EBUSY (see __cxl_dpa_reserve() in
+ * drivers/cxl/core/hdm.c). Read the decoder's SIZE from the
+ * snapshot we just took and size dpa_res to cover it.
+ */
+ {
+ u32 size_lo = 0, size_hi = 0;
+ u64 dpa_size;
+
+ cxl_passthrough_hdm_rw(cxl->cxlpt,
+ CXL_HDM_DECODER0_SIZE_LOW_OFFSET(0),
+ &size_lo, false);
+ cxl_passthrough_hdm_rw(cxl->cxlpt,
+ CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(0),
+ &size_hi, false);
+ dpa_size = ((u64)size_hi << 32) | size_lo;
+
+ rc = cxl_set_capacity(&cxl->cxlds, dpa_size);
+ if (rc) {
+ pci_warn(pdev,
+ "vfio-cxl: cxl_set_capacity(0x%llx) failed (%d)\n",
+ dpa_size, rc);
+ goto err_free;
+ }
+ }
+
+ rc = vfio_cxl_create_memdev(cxl);
+ if (rc) {
+ pci_warn(pdev,
+ "vfio-cxl: memdev/region creation failed (%d)\n", rc);
+ goto err_free;
+ }
+
+ /*
+ * Once devm_cxl_probe_mem() has published a cxl_memdev that
+ * holds a pointer into cxl->cxlds, the state must NOT be
+ * devm_kfree'd. A failure from vfio_cxl_map_hdm() is reported
+ * to userspace; the state stays allocated for the lifetime of
+ * the PCI device, and devres unwinds it when the pdev is
+ * removed.
+ */
+ rc = vfio_cxl_map_hdm(cxl);
+ if (rc) {
+ pci_warn(pdev, "vfio-cxl: HDM HPA mapping failed (%d)\n", rc);
+ return rc;
+ }
+
+ vdev->cxl = cxl;
+ pci_info(pdev,
+ "vfio-cxl: acquired (hpa=%pa/0x%llx hdm@0x%llx/0x%llx BAR%u@0x%llx/0x%llx)\n",
+ &cxl->info.hpa_base, cxl->info.hpa_size,
+ cxl->info.hdm_reg_offset, cxl->info.hdm_reg_size,
+ cxl->info.comp_reg_bir,
+ cxl->info.comp_reg_offset, cxl->info.comp_reg_size);
+ return 0;
+
+err_free:
+ devm_kfree(&pdev->dev, cxl);
+ return rc;
+}
+
+void vfio_pci_cxl_release(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+ if (cxl)
+ vfio_cxl_unmap_hdm(cxl);
+ vdev->cxl = NULL;
+}
+
+int vfio_pci_cxl_open(struct vfio_pci_core_device *vdev)
+{
+ /*
+ * Region registration (HDM, COMP_REGS) is added by the next
+ * patch in this series. This hook exists so vfio-pci-core's
+ * fd-open path has a stable call site.
+ */
+ return 0;
+}
+
+void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev)
+{
+}
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
new file mode 100644
index 000000000000..4ce8f88f8d3d
--- /dev/null
+++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright(c) 2026 NVIDIA Corporation. All rights reserved. */
+#ifndef __VFIO_PCI_CXL_PRIV_H__
+#define __VFIO_PCI_CXL_PRIV_H__
+
+#include <linux/pci.h>
+#include <linux/vfio_pci_core.h>
+
+#include <cxl/cxl.h>
+#include <cxl/passthrough.h>
+
+/**
+ * struct vfio_pci_cxl_state - per-device CXL Type-2 passthrough state
+ *
+ * Anchored to a vfio-pci-core device via @vdev->cxl. Allocated by
+ * devm_cxl_dev_state_create() so its lifetime is bound to the PCI
+ * device; the cxl_memdev acquired via devm_cxl_probe_mem() and the
+ * cxl_passthrough handle returned by devm_cxl_passthrough_create()
+ * are similarly devres-anchored.
+ *
+ * @cxlds: CXL device state. MUST be the first member (enforced by
+ * devm_cxl_dev_state_create()'s static_assert).
+ * @pdev: backpointer to the PCI device.
+ * @cxlmd: cxl_memdev acquired at PCI bind via devm_cxl_probe_mem().
+ * @cxlpt: register-virtualization handle owned by cxl-core; vfio
+ * forwards DVSEC config-space, COMP_REGS region, and HDM
+ * block accesses through this opaque pointer. See
+ * Documentation/driver-api/vfio-pci-cxl.rst.
+ * @info: snapshot of cxl-side metadata describing the device's CXL
+ * layout. Filled in during vfio_pci_cxl_acquire() and used
+ * by the VMM-facing helpers (CAP_CXL builder, region info,
+ * COMP_REGS dispatch boundary).
+ * @hdm_region_idx, @comp_reg_region_idx: VFIO region indices.
+ * Assigned by vfio_pci_cxl_open() when the regions are
+ * registered; zero on a device whose fd has never been
+ * opened.
+ * @hdm_res: request_mem_region cookie for the HPA range.
+ * @hdm_kva: memremap(MEMREMAP_WB) mapping of the HPA range. Used
+ * for the HDM region's pread/pwrite path. The mmap fault
+ * handler does vmf_insert_pfn from the physical HPA so the
+ * guest gets the same backing memory the host sees.
+ */
+struct vfio_pci_cxl_state {
+ /* MUST be first member - see devm_cxl_dev_state_create() macro. */
+ struct cxl_dev_state cxlds;
+
+ struct pci_dev *pdev;
+ struct cxl_memdev *cxlmd;
+ struct cxl_passthrough *cxlpt;
+
+ struct {
+ u16 dvsec_offset;
+ u16 dvsec_size;
+ phys_addr_t hpa_base;
+ u64 hpa_size;
+ u8 comp_reg_bir;
+ u64 comp_reg_offset;
+ u64 comp_reg_size;
+ u8 hdm_count;
+ u64 hdm_reg_offset;
+ u64 hdm_reg_size;
+ bool host_firmware_committed;
+ } info;
+
+ u32 hdm_region_idx;
+ u32 comp_reg_region_idx;
+ struct resource *hdm_res;
+ void *hdm_kva;
+};
+
+#endif /* __VFIO_PCI_CXL_PRIV_H__ */
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 050e7542952e..05ab4ae59157 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -602,10 +602,25 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
if (!vfio_vga_disabled() && vfio_pci_is_vga(pdev))
vdev->has_vga = true;
+ /*
+ * Register CXL VFIO regions before mapping BARs. CXL region
+ * registration only list-appends to vdev->region[]; it has no
+ * dependency on vdev->barmap[] being populated. Running it
+ * first means a failure here unwinds through out_free_config
+ * without leaking BAR ioremaps or selected-region requests
+ * (those are released by vfio_pci_core_disable(), which is not
+ * called for a failed open).
+ */
+ ret = vfio_pci_cxl_open(vdev);
+ if (ret)
+ goto out_free_config;
+
vfio_pci_core_map_bars(vdev);
return 0;
+out_free_config:
+ vfio_config_free(vdev);
out_free_zdev:
vfio_pci_zdev_close_device(vdev);
out_free_state:
@@ -699,6 +714,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
vdev->needs_reset = true;
+ vfio_pci_cxl_close(vdev);
vfio_pci_zdev_close_device(vdev);
/*
@@ -2222,6 +2238,10 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
if (ret)
goto out_vf;
+ ret = vfio_pci_cxl_acquire(vdev);
+ if (ret && ret != -ENODEV)
+ goto out_vga;
+
vfio_pci_probe_power_state(vdev);
/*
@@ -2250,6 +2270,9 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
pm_runtime_get_noresume(dev);
pm_runtime_forbid(dev);
+ vfio_pci_cxl_release(vdev);
+out_vga:
+ vfio_pci_vga_uninit(vdev);
out_vf:
vfio_pci_vf_uninit(vdev);
return ret;
@@ -2264,6 +2287,7 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
vfio_pci_vf_uninit(vdev);
vfio_pci_vga_uninit(vdev);
+ vfio_pci_cxl_release(vdev);
if (!disable_idle_d3)
pm_runtime_get_noresume(&vdev->pdev->dev);
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index fca9d0dfac90..94bf7c6a8548 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -109,6 +109,27 @@ static inline void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
{}
#endif
+#ifdef CONFIG_VFIO_PCI_CXL
+int vfio_pci_cxl_acquire(struct vfio_pci_core_device *vdev);
+void vfio_pci_cxl_release(struct vfio_pci_core_device *vdev);
+int vfio_pci_cxl_open(struct vfio_pci_core_device *vdev);
+void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev);
+#else
+static inline int vfio_pci_cxl_acquire(struct vfio_pci_core_device *vdev)
+{
+ return -ENODEV;
+}
+
+static inline void vfio_pci_cxl_release(struct vfio_pci_core_device *vdev) { }
+
+static inline int vfio_pci_cxl_open(struct vfio_pci_core_device *vdev)
+{
+ return 0;
+}
+
+static inline void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev) { }
+#endif
+
static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
{
return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 89165b769e5c..541c1911e090 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -142,6 +142,13 @@ struct vfio_pci_core_device {
struct notifier_block nb;
struct rw_semaphore memory_lock;
struct list_head dmabufs;
+ /*
+ * Opaque pointer to struct vfio_pci_cxl_state (defined in
+ * drivers/vfio/pci/cxl/vfio_cxl_priv.h). Set by
+ * vfio_pci_cxl_acquire() at PCI bind; NULL on non-CXL devices
+ * and when CONFIG_VFIO_PCI_CXL=n.
+ */
+ void *cxl;
};
enum vfio_pci_io_width {
--
2.25.1
^ permalink raw reply related
* [PATCH v3 08/11] vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping shim
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>
From: Manish Honap <mhonap@nvidia.com>
Complete the vfio-pci-core integration of CXL Type-2 device
passthrough by exposing two VFIO regions to userspace, wiring DVSEC
config-space accesses through cxl-core's register-virtualization
helpers, and reserving the CXL component register block from BAR
mmap and BAR resource claim.
HDM region (VFIO_REGION_SUBTYPE_CXL):
- mmappable view of the device's firmware-committed HPA range
- mmap fault handler calls vmf_insert_pfn() from the physical HPA
so the guest gets the same backing memory the host sees
- pread/pwrite go through the memremap_wb() kva captured at
bind time by vfio_cxl_map_hdm()
COMP_REGS region (VFIO_REGION_SUBTYPE_CXL_COMP_REGS):
- pread/pwrite only, dword-aligned (-EINVAL on misalignment)
- thin transport: each dword dispatches by offset to
cxl_passthrough_cm_rw() (CM cap-array snapshot) or
cxl_passthrough_hdm_rw() (HDM Decoder block). No shadow buffer
on the vfio side; all per-field semantics live in cxl-core.
DVSEC config-space access:
- vfio_pci_cxl_config_boundary() clips a chunk at the CXL Device
DVSEC body edge in vfio_pci_config_rw_single() so the generic
perm-bits path handles the DVSEC header bytes and the CXL hook
handles the body bytes. The clipping shim is used instead of
re-pointing the ecap_perms[] readfn/writefn (which would mutate
a module-init static and race across multiple CXL devices).
- vfio_pci_cxl_config_rw() forwards clipped accesses to
cxl_passthrough_dvsec_rw(); cxl-core enforces the per-field
write semantics (LOCK/RWO, CONTROL/RWL, STATUS/RW1C,
RANGE1/HwInit, RANGE2/RsvdZ).
GET_INFO / GET_REGION_INFO:
- VFIO_DEVICE_INFO_CAP_CXL advertises the two region indices, the
component BAR layout, and HOST_FIRMWARE_COMMITTED.
- GET_REGION_INFO on the component BAR returns a sparse-mmap cap
that excludes [comp_reg_offset, comp_reg_offset+comp_reg_size).
BAR resource handling:
- cxl-core holds request_mem_region() on the CXL component
register sub-range from devm_cxl_probe_mem(), so vfio_pci-core's
pci_request_selected_regions() on the full BAR would collide.
map_bars() skips the request for the component BAR (still iomaps
it; vfio holds the BAR via driver binding); disable() mirrors
the asymmetric skip.
- mmap of the component BAR refuses any range overlapping the CXL
sub-range via vfio_pci_cxl_mmap_overlaps_comp_regs().
vfio_pci_cxl_open() now registers both VFIO regions; close()
unregisters them. Raw BAR rw redirect into the CXL sub-range is
intentionally not implemented: VMMs use the COMP_REGS region
directly.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/cxl/vfio_cxl_core.c | 521 ++++++++++++++++++++++++++-
drivers/vfio/pci/vfio_pci_config.c | 31 ++
drivers/vfio/pci/vfio_pci_core.c | 44 ++-
drivers/vfio/pci/vfio_pci_priv.h | 72 ++++
drivers/vfio/pci/vfio_pci_rdwr.c | 17 +
5 files changed, 679 insertions(+), 6 deletions(-)
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index 42cd00bbe869..8a00b776d7c7 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -123,12 +123,24 @@ static int vfio_cxl_probe_regs(struct vfio_pci_cxl_state *cxl)
if (rc)
return rc;
+ /*
+ * The CXL Component Register block is a fixed 64 KiB area (CXL r4.0
+ * §8.2.3). cxl_pci_setup_regs() records the remaining BAR length
+ * after the regblock offset in reg_map.max_size, which is an upper
+ * bound, not the spec-defined size. Bail if the BAR does not have
+ * room for a full component register block at the recorded offset,
+ * and publish the spec size so the UAPI, sparse-mmap exclusion, and
+ * COMP_REGS region all agree on the same window.
+ */
+ if (cxlds->reg_map.max_size < CXL_COMPONENT_REG_BLOCK_SIZE)
+ return -ENXIO;
+
cxl->info.hdm_count = hdm_count;
cxl->info.hdm_reg_offset = hdm_off;
cxl->info.hdm_reg_size = hdm_size;
cxl->info.comp_reg_bir = bir;
cxl->info.comp_reg_offset = bar_off;
- cxl->info.comp_reg_size = cxlds->reg_map.max_size;
+ cxl->info.comp_reg_size = CXL_COMPONENT_REG_BLOCK_SIZE;
cxl->info.host_firmware_committed = true;
/*
@@ -354,16 +366,515 @@ void vfio_pci_cxl_release(struct vfio_pci_core_device *vdev)
vdev->cxl = NULL;
}
+static int vfio_pci_cxl_register_hdm(struct vfio_pci_core_device *vdev);
+static int vfio_pci_cxl_register_comp_regs(struct vfio_pci_core_device *vdev);
+
int vfio_pci_cxl_open(struct vfio_pci_core_device *vdev)
{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ int rc;
+
+ if (!cxl)
+ return 0; /* plain vfio-pci device */
+
+ rc = vfio_pci_cxl_register_comp_regs(vdev);
+ if (rc) {
+ pci_warn(vdev->pdev,
+ "vfio-cxl: COMP_REGS region register failed (%d)\n",
+ rc);
+ return rc;
+ }
+
+ rc = vfio_pci_cxl_register_hdm(vdev);
+ if (rc) {
+ pci_warn(vdev->pdev,
+ "vfio-cxl: HDM region register failed (%d)\n", rc);
+ /*
+ * COMP_REGS already registered above. vfio core does not
+ * call close_device() when open_device() returns an error,
+ * so roll back the COMP_REGS dynamic region here to avoid
+ * a leaked half-registered open state.
+ */
+ vfio_pci_cxl_close(vdev);
+ return rc;
+ }
+ return 0;
+}
+
+void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ unsigned int i;
+
+ if (!cxl)
+ return;
+
+ for (i = vdev->num_regions; i > 0; i--) {
+ struct vfio_pci_region *r = &vdev->region[i - 1];
+
+ if (r->data != cxl)
+ break;
+ if (r->ops->release)
+ r->ops->release(vdev, r);
+ vdev->num_regions--;
+ }
+}
+
+/* ------------------------------------------------------------------ */
+/* HDM region: mmappable view of the device's HPA range */
+/* ------------------------------------------------------------------ */
+
+static vm_fault_t hdm_region_fault(struct vm_fault *vmf)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct vfio_pci_cxl_state *cxl = vma->vm_private_data;
+ unsigned long off = (vmf->address - vma->vm_start) +
+ (vma->vm_pgoff << PAGE_SHIFT);
+ phys_addr_t pa;
+
+ if (!cxl || !cxl->info.hpa_size)
+ return VM_FAULT_SIGBUS;
+ if (off >= cxl->info.hpa_size)
+ return VM_FAULT_SIGBUS;
+
+ pa = cxl->info.hpa_base + off;
+ return vmf_insert_pfn(vma, vmf->address, PHYS_PFN(pa));
+}
+
+static const struct vm_operations_struct hdm_region_vm_ops = {
+ .fault = hdm_region_fault,
+};
+
+static int hdm_region_mmap(struct vfio_pci_core_device *vdev,
+ struct vfio_pci_region *region,
+ struct vm_area_struct *vma)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ pgoff_t pgoff;
+ u64 req_start, req_len;
+
+ if (!cxl || !cxl->info.hpa_size)
+ return -ENODEV;
+
/*
- * Region registration (HDM, COMP_REGS) is added by the next
- * patch in this series. This hook exists so vfio-pci-core's
- * fd-open path has a stable call site.
+ * vfio_pci_core_mmap() forwards the VMA with vm_pgoff still
+ * carrying the VFIO region index in the high bits. Mask it off
+ * so req_start is the in-region offset; also overwrite vm_pgoff
+ * with the normalised value so the fault handler computes the
+ * physical address from a clean offset.
*/
+ pgoff = vma->vm_pgoff &
+ ((1ULL << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+ req_start = (u64)pgoff << PAGE_SHIFT;
+ req_len = vma->vm_end - vma->vm_start;
+ if (req_start > cxl->info.hpa_size ||
+ req_len > cxl->info.hpa_size - req_start)
+ return -EINVAL;
+
+ vma->vm_pgoff = pgoff;
+ vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
+ vma->vm_ops = &hdm_region_vm_ops;
+ vma->vm_private_data = cxl;
return 0;
}
-void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev)
+static ssize_t hdm_region_rw(struct vfio_pci_core_device *vdev,
+ char __user *buf, size_t count,
+ loff_t *ppos, bool iswrite)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+ void *kva;
+
+ if (!cxl || !cxl->hdm_kva)
+ return -EINVAL;
+ if (pos < 0 || (u64)pos > cxl->info.hpa_size ||
+ count > cxl->info.hpa_size - (u64)pos)
+ return -EINVAL;
+
+ kva = (u8 *)cxl->hdm_kva + pos;
+ if (iswrite) {
+ if (copy_from_user(kva, buf, count))
+ return -EFAULT;
+ } else {
+ if (copy_to_user(buf, kva, count))
+ return -EFAULT;
+ }
+
+ *ppos += count;
+ return count;
+}
+
+static void hdm_region_release(struct vfio_pci_core_device *vdev,
+ struct vfio_pci_region *region)
+{
+}
+
+static const struct vfio_pci_regops vfio_pci_cxl_hdm_ops = {
+ .rw = hdm_region_rw,
+ .mmap = hdm_region_mmap,
+ .release = hdm_region_release,
+};
+
+static int vfio_pci_cxl_register_hdm(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ u32 region_type = VFIO_REGION_TYPE_PCI_VENDOR_TYPE | PCI_VENDOR_ID_CXL;
+ u32 region_flags = VFIO_REGION_INFO_FLAG_READ |
+ VFIO_REGION_INFO_FLAG_WRITE |
+ VFIO_REGION_INFO_FLAG_MMAP;
+ int rc;
+
+ rc = vfio_pci_core_register_dev_region(vdev, region_type,
+ VFIO_REGION_SUBTYPE_CXL,
+ &vfio_pci_cxl_hdm_ops,
+ cxl->info.hpa_size,
+ region_flags, cxl);
+ if (rc)
+ return rc;
+
+ cxl->hdm_region_idx = VFIO_PCI_NUM_REGIONS + vdev->num_regions - 1;
+ return 0;
+}
+
+/* ------------------------------------------------------------------ */
+/* COMP_REGS region: thin transport to cxl-core register helpers */
+/* ------------------------------------------------------------------ */
+
+/*
+ * COMP_REGS exposes the CXL component register sub-range of the
+ * device's component BAR as a pread/pwrite-only VFIO region. Access
+ * is dword-only (4-byte aligned); sub-dword access returns -EINVAL.
+ * The dispatch maps each dword to one of cxl-core's three rw helpers:
+ *
+ * pos < CXL_CM_OFFSET → zero-fill / drop
+ * CXL_CM_OFFSET <= pos < hdm_reg_offset → cxl_passthrough_cm_rw
+ * hdm_reg_offset <= pos < hdm_reg_offset+size → cxl_passthrough_hdm_rw
+ * pos >= hdm_reg_offset + hdm_reg_size → zero-fill / drop
+ *
+ * vfio holds no shadow buffer of its own; the per-field write
+ * semantics live entirely in cxl-core.
+ */
+static ssize_t comp_regs_rw(struct vfio_pci_core_device *vdev,
+ char __user *buf, size_t count,
+ loff_t *ppos, bool iswrite)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+ resource_size_t cm_off, hdm_start, hdm_end;
+ size_t done = 0;
+
+ if (!cxl || !cxl->cxlpt)
+ return -EINVAL;
+ if (pos < 0 || (u64)pos > cxl->info.comp_reg_size ||
+ count > cxl->info.comp_reg_size - (u64)pos)
+ return -EINVAL;
+ if (!IS_ALIGNED(pos, 4) || !IS_ALIGNED(count, 4))
+ return -EINVAL;
+
+ cm_off = CXL_CM_OFFSET;
+ hdm_start = cxl->info.hdm_reg_offset;
+ hdm_end = hdm_start + cxl->info.hdm_reg_size;
+
+ while (done < count) {
+ __le32 le = 0;
+ u32 v32 = 0;
+ int rc;
+
+ if (iswrite) {
+ if (copy_from_user(&le, buf + done, 4))
+ return done ?: -EFAULT;
+ v32 = le32_to_cpu(le);
+ }
+
+ if (pos >= cm_off && pos < hdm_start) {
+ rc = cxl_passthrough_cm_rw(cxl->cxlpt,
+ (u32)(pos - cm_off),
+ &v32, iswrite);
+ if (rc)
+ return done ?: rc;
+ } else if (pos >= hdm_start && pos < hdm_end) {
+ rc = cxl_passthrough_hdm_rw(cxl->cxlpt,
+ (u32)(pos - hdm_start),
+ &v32, iswrite);
+ if (rc)
+ return done ?: rc;
+ } else if (!iswrite) {
+ v32 = 0; /* outside modelled ranges: read 0 */
+ }
+ /* writes outside modelled ranges are silently dropped */
+
+ if (!iswrite) {
+ le = cpu_to_le32(v32);
+ if (copy_to_user(buf + done, &le, 4))
+ return done ?: -EFAULT;
+ }
+
+ pos += 4;
+ done += 4;
+ }
+
+ *ppos += done;
+ return done;
+}
+
+static void comp_regs_release(struct vfio_pci_core_device *vdev,
+ struct vfio_pci_region *region)
+{
+}
+
+static const struct vfio_pci_regops vfio_pci_cxl_comp_regs_ops = {
+ .rw = comp_regs_rw,
+ .release = comp_regs_release,
+};
+
+static int vfio_pci_cxl_register_comp_regs(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ u32 region_type = VFIO_REGION_TYPE_PCI_VENDOR_TYPE | PCI_VENDOR_ID_CXL;
+ u32 region_flags = VFIO_REGION_INFO_FLAG_READ |
+ VFIO_REGION_INFO_FLAG_WRITE;
+ int rc;
+
+ rc = vfio_pci_core_register_dev_region(vdev, region_type,
+ VFIO_REGION_SUBTYPE_CXL_COMP_REGS,
+ &vfio_pci_cxl_comp_regs_ops,
+ cxl->info.comp_reg_size,
+ region_flags, cxl);
+ if (rc)
+ return rc;
+
+ cxl->comp_reg_region_idx = VFIO_PCI_NUM_REGIONS + vdev->num_regions - 1;
+ return 0;
+}
+
+/* ------------------------------------------------------------------ */
+/* DVSEC config-space clipping shim */
+/* ------------------------------------------------------------------ */
+
+/*
+ * vfio_pci_cxl_config_boundary - clip a config-rw chunk at the DVSEC body edge
+ *
+ * Returns the maximum byte count the caller may pass through the
+ * generic chunker without straddling the CXL Device DVSEC body
+ * boundary, or SIZE_MAX when no clip is required. Used by
+ * vfio_pci_config_rw_single() so the DVSEC header bytes stay on the
+ * generic perm-bits path and the body bytes reach the CXL hook.
+ */
+size_t vfio_pci_cxl_config_boundary(struct vfio_pci_core_device *vdev,
+ loff_t pos)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ u32 body_start, body_end;
+
+ if (!cxl)
+ return SIZE_MAX;
+
+ body_start = cxl->info.dvsec_offset + PCI_DVSEC_CXL_CAP;
+ body_end = cxl->info.dvsec_offset + cxl->info.dvsec_size;
+
+ if (pos < body_start)
+ return body_start - pos;
+ if (pos < body_end)
+ return body_end - pos;
+ return SIZE_MAX;
+}
+
+/*
+ * vfio_pci_cxl_config_rw - forward CXL DVSEC config accesses to cxl-core
+ *
+ * Returns the number of bytes processed on success, -ENOENT if the
+ * access lies entirely outside the CXL Device DVSEC body (caller
+ * takes the standard perm-bits path), or another negative errno on
+ * hard failure. vfio_pci_config_rw_single() applies
+ * vfio_pci_cxl_config_boundary() before width selection, so any
+ * access that reaches here was already clipped to lie entirely inside
+ * the DVSEC body.
+ */
+ssize_t vfio_pci_cxl_config_rw(struct vfio_pci_core_device *vdev,
+ loff_t pos, size_t count, __le32 *val,
+ bool iswrite)
{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ u32 dvsec_off, body_start, body_end, off;
+ u32 host_val;
+ int rc;
+
+ if (!cxl || !cxl->cxlpt)
+ return -ENOENT;
+
+ dvsec_off = cxl->info.dvsec_offset;
+ body_start = dvsec_off + PCI_DVSEC_CXL_CAP;
+ body_end = dvsec_off + cxl->info.dvsec_size;
+
+ if (pos + count <= body_start || pos >= body_end)
+ return -ENOENT;
+ if (WARN_ON_ONCE(pos < body_start || pos + count > body_end))
+ return -EINVAL; /* caller failed to clip at body boundary */
+
+ off = (u32)(pos - dvsec_off);
+ host_val = iswrite ? le32_to_cpu(*val) : 0;
+
+ rc = cxl_passthrough_dvsec_rw(cxl->cxlpt, off, &host_val, count,
+ iswrite);
+ if (rc)
+ return rc;
+
+ if (!iswrite)
+ *val = cpu_to_le32(host_val);
+ return count;
+}
+
+/* ------------------------------------------------------------------ */
+/* GET_INFO / GET_REGION_INFO / mmap helpers */
+/* ------------------------------------------------------------------ */
+
+u8 vfio_pci_cxl_get_component_reg_bar(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+ return cxl ? cxl->info.comp_reg_bir : U8_MAX;
+}
+
+bool vfio_pci_cxl_get_comp_reg_range(struct vfio_pci_core_device *vdev,
+ size_t *start, size_t *end)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+ if (!cxl || !cxl->info.comp_reg_size)
+ return false;
+
+ *start = cxl->info.comp_reg_offset;
+ *end = cxl->info.comp_reg_offset + cxl->info.comp_reg_size;
+ return true;
+}
+
+bool vfio_pci_cxl_mmap_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+ u64 req_start, u64 req_len)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+ if (!cxl || !cxl->info.comp_reg_size)
+ return false;
+
+ return req_start < cxl->info.comp_reg_offset + cxl->info.comp_reg_size &&
+ req_start + req_len > cxl->info.comp_reg_offset;
+}
+
+/*
+ * vfio_pci_cxl_bar_overlaps_comp_regs - check whether a BAR-relative access
+ * overlaps the CXL component register sub-range.
+ *
+ * Returns true when @bar is the component BAR and the [@start, @start + @len)
+ * window overlaps [comp_reg_offset, comp_reg_offset + comp_reg_size). Used
+ * by the raw BAR read/write and ioeventfd paths to reject accesses that
+ * would bypass the COMP_REGS region and reach the physical component
+ * registers directly, sidestepping cxl-core's shadow and per-field write
+ * semantics.
+ */
+bool vfio_pci_cxl_bar_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+ int bar, u64 start, u64 len)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+ if (!cxl || !cxl->info.comp_reg_size || !len)
+ return false;
+ if (bar != cxl->info.comp_reg_bir)
+ return false;
+
+ return start < cxl->info.comp_reg_offset + cxl->info.comp_reg_size &&
+ start + len > cxl->info.comp_reg_offset;
+}
+
+int vfio_pci_cxl_get_info(struct vfio_pci_core_device *vdev,
+ struct vfio_info_cap *caps)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ struct vfio_device_info_cap_cxl cap = { };
+
+ if (!cxl)
+ return 0;
+
+ cap.header.id = VFIO_DEVICE_INFO_CAP_CXL;
+ cap.header.version = 1;
+ if (cxl->info.host_firmware_committed)
+ cap.flags |= VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED;
+ cap.hdm_region_idx = cxl->hdm_region_idx;
+ cap.comp_reg_region_idx = cxl->comp_reg_region_idx;
+ cap.comp_reg_bar = cxl->info.comp_reg_bir;
+ cap.comp_reg_offset = cxl->info.comp_reg_offset;
+ cap.comp_reg_size = cxl->info.comp_reg_size;
+
+ return vfio_info_add_capability(caps, &cap.header, sizeof(cap));
+}
+
+/*
+ * Build a VFIO_REGION_INFO_CAP_SPARSE_MMAP that excludes the CXL
+ * component register block from the mmappable areas of the
+ * component BAR. Returns -ENOTTY when the request is not for the
+ * component BAR or the component BAR is not mmappable; the caller
+ * (vfio_pci_ioctl_get_region_info) then continues with the standard
+ * BAR path.
+ */
+int vfio_pci_cxl_get_region_info(struct vfio_pci_core_device *vdev,
+ struct vfio_region_info *info,
+ struct vfio_info_cap *caps)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ struct vfio_region_info_cap_sparse_mmap *sparse;
+ u64 bar_len, comp_start, comp_end;
+ u64 before_end, after_start;
+ struct vfio_region_sparse_mmap_area areas[2];
+ u32 nr_areas = 0, cap_size;
+ int ret;
+
+ if (!cxl)
+ return -ENOTTY;
+ if (info->index != cxl->info.comp_reg_bir)
+ return -ENOTTY;
+ if (!cxl->info.comp_reg_size)
+ return -ENOTTY;
+ if (!vdev->bar_mmap_supported[info->index])
+ return -ENOTTY;
+
+ bar_len = pci_resource_len(vdev->pdev, info->index);
+ comp_start = cxl->info.comp_reg_offset;
+ comp_end = comp_start + cxl->info.comp_reg_size;
+
+ before_end = round_down(comp_start, PAGE_SIZE);
+ after_start = round_up(comp_end, PAGE_SIZE);
+
+ if (before_end > 0) {
+ areas[nr_areas].offset = 0;
+ areas[nr_areas].size = before_end;
+ nr_areas++;
+ }
+ if (after_start < bar_len) {
+ areas[nr_areas].offset = after_start;
+ areas[nr_areas].size = bar_len - after_start;
+ nr_areas++;
+ }
+
+ info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
+ info->size = bar_len;
+ info->flags = VFIO_REGION_INFO_FLAG_READ |
+ VFIO_REGION_INFO_FLAG_WRITE;
+ if (!nr_areas)
+ return 0;
+
+ info->flags |= VFIO_REGION_INFO_FLAG_MMAP;
+
+ cap_size = struct_size(sparse, areas, nr_areas);
+ sparse = kzalloc(cap_size, GFP_KERNEL);
+ if (!sparse)
+ return -ENOMEM;
+
+ sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+ sparse->header.version = 1;
+ sparse->nr_areas = nr_areas;
+ memcpy(sparse->areas, areas, nr_areas * sizeof(areas[0]));
+
+ ret = vfio_info_add_capability(caps, &sparse->header, cap_size);
+ kfree(sparse);
+ return ret;
}
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index a10ed733f0e3..b9f30a33515a 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1898,8 +1898,15 @@ ssize_t vfio_pci_config_rw_single(struct vfio_pci_core_device *vdev,
/*
* Chop accesses into aligned chunks containing no more than a
* single capability. Caller increments to the next chunk.
+ *
+ * For CXL Type-2 devices also clip at the CXL Device DVSEC body
+ * boundary so the generic perm-bits path handles the DVSEC
+ * header bytes and the CXL hook handles the body bytes; without
+ * this clip a 32-bit access at dvsec + 0x08 would span the
+ * generic Header2 word and the CXL CAPABILITY word.
*/
count = min(count, vfio_pci_cap_remaining_dword(vdev, *ppos));
+ count = min(count, vfio_pci_cxl_config_boundary(vdev, *ppos));
if (count >= 4 && !(*ppos % 4))
count = 4;
else if (count >= 2 && !(*ppos % 2))
@@ -1909,6 +1916,30 @@ ssize_t vfio_pci_config_rw_single(struct vfio_pci_core_device *vdev,
ret = count;
+ /*
+ * Give the CXL Type-2 hook first claim on this access: if the
+ * range lies inside the CXL Device DVSEC body, forward it to
+ * cxl-core's register-virtualization helpers instead of the
+ * standard perm-bits path. -ENOENT means "not for me; use the
+ * default path"; any other negative value is a hard error.
+ */
+ if (vdev->cxl) {
+ __le32 le_val = 0;
+ ssize_t cxl_ret;
+
+ if (iswrite && copy_from_user(&le_val, buf, count))
+ return -EFAULT;
+ cxl_ret = vfio_pci_cxl_config_rw(vdev, *ppos, count, &le_val,
+ iswrite);
+ if (cxl_ret >= 0) {
+ if (!iswrite && copy_to_user(buf, &le_val, count))
+ return -EFAULT;
+ return cxl_ret;
+ }
+ if (cxl_ret != -ENOENT)
+ return cxl_ret;
+ }
+
cap_id = vdev->pci_config_map[*ppos];
if (cap_id == PCI_CAP_ID_INVALID) {
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 05ab4ae59157..2d2dae278d1e 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -501,6 +501,23 @@ static void vfio_pci_core_map_bars(struct vfio_pci_core_device *vdev)
if (!pci_resource_len(pdev, i))
continue;
+ /*
+ * cxl-core already holds request_mem_region() on the CXL
+ * component register sub-range of this BAR. Skip the
+ * full-BAR request so we do not collide with that
+ * sub-region; vfio still owns the BAR via the driver
+ * binding and the iomap below succeeds without a region
+ * claim.
+ */
+ if (vdev->cxl && bar == vfio_pci_cxl_get_component_reg_bar(vdev)) {
+ vdev->barmap[bar] = pci_iomap(pdev, bar, 0);
+ if (!vdev->barmap[bar]) {
+ pci_dbg(pdev, "Failed to iomap region %d\n", bar);
+ vdev->barmap[bar] = IOMEM_ERR_PTR(-ENOMEM);
+ }
+ continue;
+ }
+
if (pci_request_selected_regions(pdev, 1 << bar, "vfio")) {
pci_dbg(pdev, "Failed to reserve region %d\n", bar);
vdev->barmap[bar] = IOMEM_ERR_PTR(-EBUSY);
@@ -701,7 +718,10 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
if (IS_ERR_OR_NULL(vdev->barmap[bar]))
continue;
pci_iounmap(pdev, vdev->barmap[bar]);
- pci_release_selected_regions(pdev, 1 << bar);
+ /* Mirror the asymmetric setup-time skip in map_bars(). */
+ if (!(vdev->cxl &&
+ i == vfio_pci_cxl_get_component_reg_bar(vdev)))
+ pci_release_selected_regions(pdev, 1 << bar);
vdev->barmap[bar] = NULL;
}
@@ -1051,6 +1071,16 @@ static int vfio_pci_ioctl_get_info(struct vfio_pci_core_device *vdev,
info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions;
info.num_irqs = VFIO_PCI_NUM_IRQS;
+ if (vdev->cxl) {
+ ret = vfio_pci_cxl_get_info(vdev, &caps);
+ if (ret) {
+ pci_warn(vdev->pdev,
+ "Failed to add CXL info capability\n");
+ return ret;
+ }
+ info.flags |= VFIO_DEVICE_FLAGS_CXL;
+ }
+
ret = vfio_pci_info_zdev_add_caps(vdev, &caps);
if (ret && ret != -ENODEV) {
pci_warn(vdev->pdev,
@@ -1093,6 +1123,12 @@ int vfio_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
struct pci_dev *pdev = vdev->pdev;
int i, ret;
+ if (vdev->cxl) {
+ ret = vfio_pci_cxl_get_region_info(vdev, info, caps);
+ if (ret != -ENOTTY)
+ return ret;
+ }
+
switch (info->index) {
case VFIO_PCI_CONFIG_REGION_INDEX:
info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
@@ -1811,6 +1847,12 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
if (req_start + req_len > phys_len)
return -EINVAL;
+ /* Block mmap of the CXL component register block. */
+ if (vdev->cxl &&
+ index == vfio_pci_cxl_get_component_reg_bar(vdev) &&
+ vfio_pci_cxl_mmap_overlaps_comp_regs(vdev, req_start, req_len))
+ return -EINVAL;
+
/*
* Even though we don't make use of the barmap for the mmap,
* we need to request the region and the barmap tracks that.
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 94bf7c6a8548..88b89da6dd5a 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -114,6 +114,23 @@ int vfio_pci_cxl_acquire(struct vfio_pci_core_device *vdev);
void vfio_pci_cxl_release(struct vfio_pci_core_device *vdev);
int vfio_pci_cxl_open(struct vfio_pci_core_device *vdev);
void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev);
+size_t vfio_pci_cxl_config_boundary(struct vfio_pci_core_device *vdev,
+ loff_t pos);
+ssize_t vfio_pci_cxl_config_rw(struct vfio_pci_core_device *vdev,
+ loff_t pos, size_t count, __le32 *val,
+ bool iswrite);
+int vfio_pci_cxl_get_info(struct vfio_pci_core_device *vdev,
+ struct vfio_info_cap *caps);
+int vfio_pci_cxl_get_region_info(struct vfio_pci_core_device *vdev,
+ struct vfio_region_info *info,
+ struct vfio_info_cap *caps);
+u8 vfio_pci_cxl_get_component_reg_bar(struct vfio_pci_core_device *vdev);
+bool vfio_pci_cxl_get_comp_reg_range(struct vfio_pci_core_device *vdev,
+ size_t *start, size_t *end);
+bool vfio_pci_cxl_mmap_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+ u64 req_start, u64 req_len);
+bool vfio_pci_cxl_bar_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+ int bar, u64 start, u64 len);
#else
static inline int vfio_pci_cxl_acquire(struct vfio_pci_core_device *vdev)
{
@@ -128,6 +145,61 @@ static inline int vfio_pci_cxl_open(struct vfio_pci_core_device *vdev)
}
static inline void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev) { }
+
+static inline size_t
+vfio_pci_cxl_config_boundary(struct vfio_pci_core_device *vdev, loff_t pos)
+{
+ return SIZE_MAX;
+}
+
+static inline ssize_t
+vfio_pci_cxl_config_rw(struct vfio_pci_core_device *vdev, loff_t pos,
+ size_t count, __le32 *val, bool iswrite)
+{
+ return -ENOENT;
+}
+
+static inline int
+vfio_pci_cxl_get_info(struct vfio_pci_core_device *vdev,
+ struct vfio_info_cap *caps)
+{
+ return 0;
+}
+
+static inline int
+vfio_pci_cxl_get_region_info(struct vfio_pci_core_device *vdev,
+ struct vfio_region_info *info,
+ struct vfio_info_cap *caps)
+{
+ return -ENOTTY;
+}
+
+static inline u8
+vfio_pci_cxl_get_component_reg_bar(struct vfio_pci_core_device *vdev)
+{
+ return U8_MAX;
+}
+
+static inline bool
+vfio_pci_cxl_get_comp_reg_range(struct vfio_pci_core_device *vdev,
+ size_t *start, size_t *end)
+{
+ return false;
+}
+
+static inline bool
+vfio_pci_cxl_mmap_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+ u64 req_start, u64 req_len)
+{
+ return false;
+}
+
+static inline bool
+vfio_pci_cxl_bar_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+ int bar, u64 start, u64 len)
+{
+ return false;
+}
#endif
static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 3bfbb879a005..a856f29a3c94 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -236,6 +236,15 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_core_device *vdev, char __user *buf,
count = min(count, (size_t)(end - pos));
+ /*
+ * Reject raw BAR access that would land inside the CXL component
+ * register sub-range. cxl-core owns the per-field shadow and
+ * spec-defined write semantics; userspace must use the dedicated
+ * COMP_REGS VFIO region for that range.
+ */
+ if (vfio_pci_cxl_bar_overlaps_comp_regs(vdev, bar, pos, count))
+ return -EINVAL;
+
if (bar == PCI_ROM_RESOURCE) {
/*
* The ROM can fill less space than the BAR, so we start the
@@ -437,6 +446,14 @@ int vfio_pci_ioeventfd(struct vfio_pci_core_device *vdev, loff_t offset,
pos >= vdev->msix_offset + vdev->msix_size))
return -EINVAL;
+ /*
+ * Disallow ioeventfds arming against the CXL component register
+ * sub-range; that area is fronted by cxl-core's shadow and must
+ * not be reached through the raw BAR map.
+ */
+ if (vfio_pci_cxl_bar_overlaps_comp_regs(vdev, bar, pos, count))
+ return -EINVAL;
+
if (count == 8)
return -EINVAL;
--
2.25.1
^ permalink raw reply related
* [PATCH v3 09/11] selftests/vfio: Add CXL Type-2 device passthrough smoke test
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>
From: Manish Honap <mhonap@nvidia.com>
Exercise the user-visible contract added by CONFIG_VFIO_PCI_CXL:
device_is_cxl GET_INFO returns VFIO_DEVICE_FLAGS_CXL
and a populated VFIO_DEVICE_INFO_CAP_CXL.
hdm_region_mmap_rw mmap() one page of the HDM region,
write a pattern, read it back. Proves
the mmap fault handler's vmf_insert_pfn
path and the firmware-committed HPA
mapping.
component_bar_sparse_mmap GET_REGION_INFO on the component BAR
advertises a SPARSE_MMAP cap, and every
advertised mmappable area lies outside
[comp_reg_offset, +comp_reg_size).
comp_regs_cm_cap_array_read pread() of the COMP_REGS region at
CXL_CM_OFFSET returns a valid CM
cap-array header (CAP_ID == 1,
ARRAY_SIZE > 0). Proves the
cxl_passthrough_cm_rw() dispatch is
wired.
dvsec_lock_byte_read pread() of the DVSEC CONFIG_LOCK byte
through the config-rw clipping shim
succeeds. Proves the
cxl_passthrough_dvsec_rw() path is
wired.
COMMIT/COMMITTED state-machine and DVSEC LOCK latch behaviour are
out of scope for this smoke test. No debugfs dependency.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
tools/testing/selftests/vfio/Makefile | 1 +
.../selftests/vfio/lib/vfio_pci_device.c | 11 +-
.../selftests/vfio/vfio_cxl_type2_test.c | 350 ++++++++++++++++++
3 files changed, 361 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/vfio/vfio_cxl_type2_test.c
diff --git a/tools/testing/selftests/vfio/Makefile b/tools/testing/selftests/vfio/Makefile
index 0684932d91bf..25f2a9420ef6 100644
--- a/tools/testing/selftests/vfio/Makefile
+++ b/tools/testing/selftests/vfio/Makefile
@@ -12,6 +12,7 @@ TEST_GEN_PROGS += vfio_iommufd_setup_test
TEST_GEN_PROGS += vfio_pci_device_test
TEST_GEN_PROGS += vfio_pci_device_init_perf_test
TEST_GEN_PROGS += vfio_pci_driver_test
+TEST_GEN_PROGS += vfio_cxl_type2_test
TEST_FILES += scripts/cleanup.sh
TEST_FILES += scripts/lib.sh
diff --git a/tools/testing/selftests/vfio/lib/vfio_pci_device.c b/tools/testing/selftests/vfio/lib/vfio_pci_device.c
index fc75e04ef010..d2150129d854 100644
--- a/tools/testing/selftests/vfio/lib/vfio_pci_device.c
+++ b/tools/testing/selftests/vfio/lib/vfio_pci_device.c
@@ -281,7 +281,16 @@ static void vfio_pci_device_setup(struct vfio_pci_device *device)
struct vfio_pci_bar *bar = device->bars + i;
vfio_pci_region_get(device, i, &bar->info);
- if (bar->info.flags & VFIO_REGION_INFO_FLAG_MMAP)
+ /*
+ * Skip auto-mmap when the BAR advertises region-info caps
+ * (e.g. VFIO_REGION_INFO_CAP_SPARSE_MMAP). Such BARs are
+ * only partially mmappable; the kernel rejects full-BAR
+ * mmaps and the caller must walk the sparse-area cap and
+ * mmap each advertised area separately. Tests that need
+ * access to such a BAR handle the per-area mmap themselves.
+ */
+ if ((bar->info.flags & VFIO_REGION_INFO_FLAG_MMAP) &&
+ !(bar->info.flags & VFIO_REGION_INFO_FLAG_CAPS))
vfio_pci_bar_map(device, i);
}
diff --git a/tools/testing/selftests/vfio/vfio_cxl_type2_test.c b/tools/testing/selftests/vfio/vfio_cxl_type2_test.c
new file mode 100644
index 000000000000..bc98a29f90ad
--- /dev/null
+++ b/tools/testing/selftests/vfio/vfio_cxl_type2_test.c
@@ -0,0 +1,350 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * vfio_cxl_type2_test - smoke + dispatch tests for CXL Type-2 device
+ * passthrough through vfio-pci.
+ *
+ * Exercises the user-visible surface gated by CONFIG_VFIO_PCI_CXL:
+ * - GET_INFO returns VFIO_DEVICE_FLAGS_CXL + a populated CAP_CXL.
+ * - The HDM-backed VFIO region can be mmap'd and read/written.
+ * - The component BAR exposes a SPARSE_MMAP cap that excludes the
+ * CXL component register sub-range.
+ * - The COMP_REGS region serves CM cap-array dwords from cxl-core's
+ * snapshot (proves the cxl_passthrough_cm_rw() path is wired).
+ * - DVSEC body reads through the config-rw clipping shim return the
+ * cxl-core shadow (proves cxl_passthrough_dvsec_rw() is wired).
+ *
+ * Usage:
+ * ./vfio_cxl_type2_test <BDF>
+ * or export VFIO_SELFTESTS_BDF=<BDF> before running. The device must
+ * be bound to vfio-pci and the kernel must have CONFIG_VFIO_PCI_CXL=y.
+ *
+ * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.
+ */
+
+#include <fcntl.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+
+#include <linux/pci_regs.h>
+#include <linux/sizes.h>
+#include <linux/vfio.h>
+
+#include <cxl/cxl_regs.h>
+
+#include <libvfio.h>
+
+#include "kselftest_harness.h"
+
+#define PCI_DVSEC_VENDOR_ID_CXL 0x1e98
+#define PCI_DVSEC_ID_CXL_DEVICE 0x0000
+
+/*
+ * vfio-pci's region offset packing (kernel-internal in
+ * include/linux/vfio_pci_core.h, not exposed via UAPI as of writing).
+ * Provide local definitions so the selftest builds against the bare
+ * UAPI vfio.h. The guards let a future kernel hoist these to UAPI
+ * without breaking this test.
+ */
+#ifndef VFIO_PCI_OFFSET_SHIFT
+#define VFIO_PCI_OFFSET_SHIFT 40
+#endif
+#ifndef VFIO_PCI_INDEX_TO_OFFSET
+#define VFIO_PCI_INDEX_TO_OFFSET(index) ((uint64_t)(index) << VFIO_PCI_OFFSET_SHIFT)
+#endif
+
+static const char *device_bdf;
+
+/* Find a struct vfio_device_info capability by id in a GET_INFO buffer. */
+static const struct vfio_info_cap_header *
+find_device_cap(const void *buf, size_t bufsz, uint16_t id)
+{
+ const struct vfio_device_info *info = buf;
+ const struct vfio_info_cap_header *cap;
+ size_t off = info->cap_offset;
+
+ while (off && off < bufsz) {
+ cap = (const void *)((const char *)buf + off);
+ if (cap->id == id)
+ return cap;
+ off = cap->next;
+ }
+ return NULL;
+}
+
+/* Walk PCI extended capability list for the CXL Device DVSEC. */
+static uint16_t find_cxl_dvsec(struct vfio_pci_device *dev)
+{
+ uint16_t pos = PCI_CFG_SPACE_SIZE;
+ int iter = 0;
+
+ while (pos && iter++ < 64) {
+ uint32_t hdr = vfio_pci_config_readl(dev, pos);
+ uint16_t cap_id = hdr & 0xffff;
+ uint16_t next = (hdr >> 20) & 0xffc;
+ uint32_t hdr1, hdr2;
+
+ if (cap_id == PCI_EXT_CAP_ID_DVSEC) {
+ hdr1 = vfio_pci_config_readl(dev, pos + 4);
+ hdr2 = vfio_pci_config_readl(dev, pos + 8);
+ if ((hdr1 & 0xffff) == PCI_DVSEC_VENDOR_ID_CXL &&
+ (hdr2 & 0xffff) == PCI_DVSEC_ID_CXL_DEVICE)
+ return pos;
+ }
+ pos = next;
+ }
+ return 0;
+}
+
+FIXTURE(cxl_type2) {
+ struct iommu *iommu;
+ struct vfio_pci_device *dev;
+
+ struct vfio_device_info_cap_cxl cxl_cap;
+ uint16_t dvsec_base;
+
+ uint64_t hdm_region_size;
+ uint64_t comp_regs_size;
+};
+
+FIXTURE_SETUP(cxl_type2)
+{
+ uint8_t infobuf[512] = {};
+ struct vfio_device_info *info = (void *)infobuf;
+ const struct vfio_device_info_cap_cxl *cap;
+ struct vfio_region_info ri = { .argsz = sizeof(ri) };
+
+ self->iommu = iommu_init(default_iommu_mode);
+ self->dev = vfio_pci_device_init(device_bdf, self->iommu);
+
+ info->argsz = sizeof(infobuf);
+ ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_INFO, info));
+
+ if (!(info->flags & VFIO_DEVICE_FLAGS_CXL))
+ SKIP(return, "not a CXL Type-2 device");
+
+ cap = (const void *)find_device_cap(infobuf, sizeof(infobuf),
+ VFIO_DEVICE_INFO_CAP_CXL);
+ ASSERT_NE(NULL, cap);
+ memcpy(&self->cxl_cap, cap, sizeof(*cap));
+
+ ri.index = self->cxl_cap.hdm_region_idx;
+ ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, &ri));
+ self->hdm_region_size = ri.size;
+
+ ri.argsz = sizeof(ri);
+ ri.index = self->cxl_cap.comp_reg_region_idx;
+ ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, &ri));
+ self->comp_regs_size = ri.size;
+
+ self->dvsec_base = find_cxl_dvsec(self->dev);
+}
+
+FIXTURE_TEARDOWN(cxl_type2)
+{
+ vfio_pci_device_cleanup(self->dev);
+ iommu_cleanup(self->iommu);
+}
+
+TEST_F(cxl_type2, device_is_cxl)
+{
+ const struct vfio_device_info_cap_cxl *c = &self->cxl_cap;
+
+ ASSERT_EQ(VFIO_DEVICE_INFO_CAP_CXL, c->header.id);
+ ASSERT_EQ(1, c->header.version);
+ ASSERT_NE(c->hdm_region_idx, c->comp_reg_region_idx);
+ ASSERT_GE(c->hdm_region_idx, VFIO_PCI_NUM_REGIONS);
+ ASSERT_GE(c->comp_reg_region_idx, VFIO_PCI_NUM_REGIONS);
+ ASSERT_LT(c->comp_reg_bar, PCI_STD_NUM_BARS);
+ ASSERT_GT(c->comp_reg_size, 0ULL);
+ ASSERT_EQ(c->comp_reg_size, self->comp_regs_size);
+}
+
+TEST_F(cxl_type2, hdm_region_mmap_rw)
+{
+ uint64_t off = (uint64_t)VFIO_PCI_INDEX_TO_OFFSET(
+ self->cxl_cap.hdm_region_idx);
+ uint32_t pattern = 0xdeadbeefU;
+ uint32_t readback = 0;
+ void *map;
+
+ if (self->hdm_region_size < SZ_4K)
+ SKIP(return, "HDM region < 4K");
+
+ map = mmap(NULL, SZ_4K, PROT_READ | PROT_WRITE, MAP_SHARED,
+ self->dev->fd, off);
+ ASSERT_NE(MAP_FAILED, map);
+
+ *(volatile uint32_t *)map = pattern;
+ readback = *(volatile uint32_t *)map;
+ ASSERT_EQ(pattern, readback);
+
+ ASSERT_EQ(0, munmap(map, SZ_4K));
+}
+
+TEST_F(cxl_type2, component_bar_sparse_mmap)
+{
+ const uint8_t bar = self->cxl_cap.comp_reg_bar;
+ uint8_t buf[512] = {};
+ struct vfio_region_info *ri = (void *)buf;
+ const struct vfio_region_info_cap_sparse_mmap *sp;
+ const struct vfio_info_cap_header *hdr;
+ size_t off;
+ uint32_t i;
+
+ ri->argsz = sizeof(buf);
+ ri->index = bar;
+ ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, ri));
+
+ ASSERT_TRUE(ri->flags & VFIO_REGION_INFO_FLAG_CAPS);
+ off = ri->cap_offset;
+ hdr = NULL;
+ while (off && off < sizeof(buf)) {
+ hdr = (const void *)(buf + off);
+ if (hdr->id == VFIO_REGION_INFO_CAP_SPARSE_MMAP)
+ break;
+ off = hdr->next;
+ hdr = NULL;
+ }
+ ASSERT_NE(NULL, hdr);
+ sp = (const void *)hdr;
+ ASSERT_GE(sp->nr_areas, 1U);
+ for (i = 0; i < sp->nr_areas; i++) {
+ uint64_t a_start = sp->areas[i].offset;
+ uint64_t a_end = a_start + sp->areas[i].size;
+
+ ASSERT_TRUE(a_end <= self->cxl_cap.comp_reg_offset ||
+ a_start >= self->cxl_cap.comp_reg_offset +
+ self->cxl_cap.comp_reg_size);
+ }
+}
+
+TEST_F(cxl_type2, comp_regs_cm_cap_array_read)
+{
+ uint64_t off = (uint64_t)VFIO_PCI_INDEX_TO_OFFSET(
+ self->cxl_cap.comp_reg_region_idx) + CXL_CM_OFFSET;
+ uint32_t hdr = 0;
+ uint16_t cap_id;
+ uint8_t array_size;
+
+ ASSERT_EQ((ssize_t)sizeof(hdr),
+ pread(self->dev->fd, &hdr, sizeof(hdr), off));
+
+ cap_id = hdr & CXL_CM_CAP_HDR_ID_MASK;
+ array_size = (hdr & CXL_CM_CAP_HDR_ARRAY_SIZE_MASK) >> 24;
+ ASSERT_EQ(cap_id, CM_CAP_HDR_CAP_ID);
+ ASSERT_GT(array_size, 0);
+}
+
+TEST_F(cxl_type2, dvsec_lock_byte_read)
+{
+ uint8_t v;
+
+ if (!self->dvsec_base)
+ SKIP(return, "CXL Device DVSEC not found");
+
+ v = vfio_pci_config_readb(self->dev,
+ self->dvsec_base + 0x14); /* CONFIG_LOCK */
+ /* Snapshot value is host-firmware-dependent; just assert read
+ * succeeds (no SIGBUS, no -EIO).
+ */
+ (void)v;
+}
+
+/*
+ * Exercise the per-decoder COMMIT/COMMITTED state machine in
+ * cxl_passthrough_hdm_rw() (cxl-core). Steps:
+ *
+ * - Walk the CM cap-array via COMP_REGS reads to locate the HDM block.
+ * - Read decoder 0 CTRL; for a firmware-committed Type-2 device both
+ * COMMIT (bit 9) and COMMITTED (bit 10) are expected to be set.
+ * - Release COMMIT by writing CTRL with bit 9 cleared.
+ * Expected FSM transition: COMMITTED -> 0, LOCK_ON_COMMIT (bit 8) -> 0.
+ * - Re-set COMMIT. Expected: COMMITTED -> 1 (auto-set by the handler).
+ * - Restore the original CTRL value so subsequent test runs see the
+ * firmware-committed state.
+ *
+ * The CTRL writes touch the cxl-core shadow only — they do not reach
+ * the device — so the operation is safe to run repeatedly.
+ */
+TEST_F(cxl_type2, hdm_decoder_commit_fsm)
+{
+ uint64_t comp_off = (uint64_t)VFIO_PCI_INDEX_TO_OFFSET(
+ self->cxl_cap.comp_reg_region_idx);
+ uint32_t cm_hdr = 0, entry = 0;
+ uint64_t hdm_reg_offset = 0;
+ uint64_t ctrl_off;
+ uint32_t ctrl_orig, ctrl_test;
+ uint32_t array_size;
+ uint32_t i;
+
+ /* Discover HDM block offset via CM cap-array walk. */
+ ASSERT_EQ((ssize_t)sizeof(cm_hdr),
+ pread(self->dev->fd, &cm_hdr, sizeof(cm_hdr),
+ comp_off + CXL_CM_OFFSET));
+ ASSERT_EQ(CM_CAP_HDR_CAP_ID, cm_hdr & CXL_CM_CAP_HDR_ID_MASK);
+ array_size = (cm_hdr & CXL_CM_CAP_HDR_ARRAY_SIZE_MASK) >> 24;
+ ASSERT_GT(array_size, 0);
+
+ for (i = 1; i <= array_size; i++) {
+ ASSERT_EQ((ssize_t)sizeof(entry),
+ pread(self->dev->fd, &entry, sizeof(entry),
+ comp_off + CXL_CM_OFFSET + i * 4));
+ if ((entry & CXL_CM_CAP_HDR_ID_MASK) == CXL_CM_CAP_CAP_ID_HDM) {
+ hdm_reg_offset = CXL_CM_OFFSET +
+ ((entry & CXL_CM_CAP_PTR_MASK) >> 20);
+ break;
+ }
+ }
+ ASSERT_NE(0, hdm_reg_offset);
+
+ /* Read decoder 0 CTRL. */
+ ctrl_off = comp_off + hdm_reg_offset +
+ CXL_HDM_DECODER0_CTRL_OFFSET(0);
+ ASSERT_EQ((ssize_t)sizeof(ctrl_orig),
+ pread(self->dev->fd, &ctrl_orig, sizeof(ctrl_orig),
+ ctrl_off));
+
+ /* Firmware-committed Type-2 device: COMMIT + COMMITTED both set. */
+ ASSERT_TRUE(ctrl_orig & BIT(9)); /* COMMIT */
+ ASSERT_TRUE(ctrl_orig & BIT(10)); /* COMMITTED */
+
+ /* Release COMMIT; FSM clears COMMITTED and LOCK_ON_COMMIT. */
+ ctrl_test = ctrl_orig & ~BIT(9);
+ ASSERT_EQ((ssize_t)sizeof(ctrl_test),
+ pwrite(self->dev->fd, &ctrl_test, sizeof(ctrl_test),
+ ctrl_off));
+ ASSERT_EQ((ssize_t)sizeof(ctrl_test),
+ pread(self->dev->fd, &ctrl_test, sizeof(ctrl_test),
+ ctrl_off));
+ ASSERT_FALSE(ctrl_test & BIT(9)); /* COMMIT cleared */
+ ASSERT_FALSE(ctrl_test & BIT(10)); /* COMMITTED auto-cleared */
+ ASSERT_FALSE(ctrl_test & BIT(8)); /* LOCK_ON_COMMIT auto-cleared */
+
+ /* Re-set COMMIT; FSM auto-sets COMMITTED. */
+ ctrl_test = BIT(9);
+ ASSERT_EQ((ssize_t)sizeof(ctrl_test),
+ pwrite(self->dev->fd, &ctrl_test, sizeof(ctrl_test),
+ ctrl_off));
+ ASSERT_EQ((ssize_t)sizeof(ctrl_test),
+ pread(self->dev->fd, &ctrl_test, sizeof(ctrl_test),
+ ctrl_off));
+ ASSERT_TRUE(ctrl_test & BIT(9)); /* COMMIT */
+ ASSERT_TRUE(ctrl_test & BIT(10)); /* COMMITTED auto-set */
+
+ /* Restore the original CTRL value. */
+ ASSERT_EQ((ssize_t)sizeof(ctrl_orig),
+ pwrite(self->dev->fd, &ctrl_orig, sizeof(ctrl_orig),
+ ctrl_off));
+}
+
+int main(int argc, char *argv[])
+{
+ device_bdf = vfio_selftests_get_bdf(&argc, argv);
+ return test_harness_run(argc, argv);
+}
--
2.25.1
^ permalink raw reply related
* [PATCH v3 10/11] docs: vfio-pci: Document CXL Type-2 device passthrough
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>
From: Manish Honap <mhonap@nvidia.com>
Capture the ownership model, bind sequence, region layout, and the
DVSEC + HDM + CM cap-array virtualization contract for vfio-pci
Type-2 device passthrough in Documentation/driver-api/vfio-pci-cxl.rst.
cxl-core owns the CXL register virtualization through
devm_cxl_passthrough_create() and the cxl_passthrough_*_rw()
helpers; vfio-pci is a transport that forwards guest reads and
writes through them. The HDM HPA range is mapped by vfio for the
mmappable HDM region. Topology constraints and host-bridge decoder
limitations are listed under Known limitations.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
Documentation/driver-api/index.rst | 1 +
Documentation/driver-api/vfio-pci-cxl.rst | 282 ++++++++++++++++++++++
2 files changed, 283 insertions(+)
create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index eaf7161ff957..52f0c06a376a 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -47,6 +47,7 @@ of interest to most developers working on device drivers.
vfio-mediated-device
vfio
vfio-pci-device-specific-driver-acceptance
+ vfio-pci-cxl
Bus-level documentation
=======================
diff --git a/Documentation/driver-api/vfio-pci-cxl.rst b/Documentation/driver-api/vfio-pci-cxl.rst
new file mode 100644
index 000000000000..1527b7dd85d0
--- /dev/null
+++ b/Documentation/driver-api/vfio-pci-cxl.rst
@@ -0,0 +1,282 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===========================================
+VFIO-PCI: CXL Type-2 device passthrough
+===========================================
+
+:Author: Manish Honap <mhonap@nvidia.com>
+
+Overview
+========
+
+vfio-pci-core, when built with ``CONFIG_VFIO_PCI_CXL=y``, passes a
+CXL Type-2 accelerator (CXL r4.0, HDM-D / HDM-DB) through to a KVM
+guest. The host firmware commits the endpoint's HDM decoder before
+vfio-pci binds; the guest sees a CXL Type-2 device whose CXL.mem
+range is already programmed and locked. The guest may inspect the
+HDM Decoder Capability block and DVSEC Device capability via spec-
+defined paths, and access the device's CXL.mem range as
+mmap'd memory.
+
+Scope
+=====
+
+The supported scope is intentionally narrow:
+
+* One CXL endpoint per host bridge.
+* The endpoint exposes exactly one HDM decoder (decoder 0).
+* No interleave.
+* Host firmware has committed the endpoint HDM decoder before
+ vfio-pci probes. Devices whose HDM decoder is *uncommitted* fail
+ vfio-pci bind cleanly.
+* The host bridge is in single-RP-passthrough mode (the CXL host
+ bridge's own HDM decoder is not used; CFMWS-to-RP decode flows
+ implicitly). This assumption is currently *not enforced* by
+ vfio-pci-core; it is a known limitation, see the Known
+ limitations section.
+
+Multi-decoder, interleave, FLR / reset state-machine integration,
+and host-bridge HDM decoder programming are explicitly out of scope.
+Adding any of them is additive on top of the contract described
+below.
+
+Driver model
+============
+
+There is no dedicated ``vfio-cxl`` PCI driver. vfio-pci is the only
+driver that binds to the host PCI device. When built with
+``CONFIG_VFIO_PCI_CXL=y``, vfio-pci-core calls into the cxl subsystem
+to do four things at bind time:
+
+1. ``devm_cxl_dev_state_create()`` — allocate per-device CXL state
+ embedded in ``struct vfio_pci_cxl_state``.
+2. ``cxl_pci_setup_regs()`` + ``cxl_get_hdm_info()`` — probe the
+ Register Locator DVSEC and harvest the HDM block's BAR-relative
+ offset and size.
+3. ``cxl_await_range_active()`` — wait for the firmware-committed
+ range to become live.
+4. ``devm_cxl_passthrough_create()`` — snapshot the CXL Device DVSEC
+ body, the HDM Decoder block, and the CXL.cache/mem cap-array
+ prefix into shadows owned by cxl-core. All subsequent
+ register-virtualization happens inside ``drivers/cxl/core/passthrough.c``.
+5. ``devm_cxl_probe_mem()`` — register a ``cxl_memdev``, enumerate
+ the endpoint port, and auto-attach the firmware-committed
+ region. cxl_mem binds to the memdev as it would for any other
+ Type-2 accelerator.
+
+Ownership split
+===============
+
+Each device-visible surface is owned by exactly one subsystem:
+
+============================================ ==============================================
+Surface Owner
+============================================ ==============================================
+PCI config (non-DVSEC, non-CXL) vfio-pci-core ``vconfig`` (existing perm-bits)
+CXL Device DVSEC body cxl-core ``cxl_passthrough_dvsec_rw()``
+HDM Decoder Capability block cxl-core ``cxl_passthrough_hdm_rw()``
+CM cap-array (read-only snapshot) cxl-core ``cxl_passthrough_cm_rw()``
+``cxl_memdev`` / endpoint port / autoregion cxl-core ``devm_cxl_probe_mem()``
+HDM HPA range mapping vfio-pci ``request_mem_region`` + ``memremap``
+Sparse mmap layout for the component BAR vfio-pci
+============================================ ==============================================
+
+The vfio side holds no shadow buffer of its own. ``vfio_pci_cxl_state``
+caches small scalars (DVSEC offset/size, HDM offset/size, component
+BAR layout) for dispatch decisions; the actual virtualization
+semantics live in cxl-core.
+
+Bind sequence
+=============
+
+``vfio_pci_cxl_acquire()`` is called from
+``vfio_pci_core_register_device()`` at PCI bind time. The sequence::
+
+ 0. devm_cxl_dev_state_create(parent, CXL_DEVTYPE_DEVMEM, dsn,
+ dvsec_off, vfio_pci_cxl_state, cxlds,
+ /*mbox=*/false)
+
+ 1. pcie_is_cxl() and pci_find_dvsec_capability(CXL_DEVICE)
+ -> -ENODEV if either is absent
+ -> -ENODEV if the DVSEC's MEM_CAPABLE bit is clear
+
+ 2. pci_enable_device_mem()
+
+ 2a. cxl_pci_setup_regs(CXL_REGLOC_RBI_COMPONENT)
+ 2b. cxl_get_hdm_info() — REJECT hdm_count != 1 with -EOPNOTSUPP
+ 2c. cxl_regblock_get_bar_info()
+ 2d. cxl_await_range_active()
+ 2e. devm_cxl_passthrough_create(&pdev->dev, &cxlds)
+
+ 3. pci_disable_device()
+ Clears PCI_COMMAND_MASTER but NOT PCI_COMMAND_MEMORY (see
+ do_pci_disable_device() in drivers/pci/pci.c). Subsequent
+ MMIO from step 4 still succeeds.
+
+ 4. devm_cxl_probe_mem(&cxlds, &hpa_range)
+ Registers the memdev, enumerates the endpoint port, attaches
+ the firmware-committed autoregion.
+
+ 5. request_mem_region(hpa_base, hpa_size) + memremap_wb()
+
+ 6. vdev->cxl = cxl (state published; HDM and COMP_REGS regions
+ are registered later when the VFIO fd is opened)
+
+Fail-closed semantics
+---------------------
+
+Three errnos are mapped to "not a CXL device; caller falls back to
+plain vfio-pci": ``pcie_is_cxl()`` false, DVSEC absent, ``MEM_CAPABLE``
+clear. All three return ``-ENODEV`` from
+``vfio_pci_cxl_acquire()``; the caller treats them as a silent
+fall-through.
+
+Any other negative errno from the bind sequence aborts the vfio-pci
+bind entirely. The guest never sees a half-initialised CXL device.
+Once ``devm_cxl_probe_mem()`` has succeeded the published memdev
+holds a pointer into the embedded ``cxl_dev_state``; a failure in
+``vfio_cxl_map_hdm()`` after that point cannot ``devm_kfree(cxl)``
+and leaves the state allocated for the lifetime of the PCI device
+(devres unwinds it at pdev removal).
+
+VFIO regions exposed
+====================
+
+When the VFIO fd is first opened, ``vfio_pci_cxl_open()`` registers
+two additional regions on top of the standard vfio-pci BARs / config
+region:
+
+HDM region (``VFIO_REGION_SUBTYPE_CXL``)
+ Mappable view of the device's firmware-committed HPA range.
+
+ * ``mmap``: fault handler does
+ ``vmf_insert_pfn(vma, addr, PHYS_PFN(hpa_base + off))``. The
+ guest gets the same backing physical memory the host sees.
+ * ``pread`` / ``pwrite``: served from the ``memremap_wb()`` kva
+ captured at bind time.
+
+COMP_REGS region (``VFIO_REGION_SUBTYPE_CXL_COMP_REGS``)
+ Shadow of the CXL component register sub-range. ``pread`` /
+ ``pwrite`` only; ``mmap`` is intentionally not supported (the VMM
+ uses this region instead of mmapping the BAR). Dword-aligned
+ access only; sub-dword accesses return ``-EINVAL``.
+
+ Dispatch by offset:
+
+ ============================================ =================================
+ Offset range cxl-core helper
+ ============================================ =================================
+ ``< CXL_CM_OFFSET`` zero-fill (reserved)
+ ``CXL_CM_OFFSET .. hdm_reg_offset`` ``cxl_passthrough_cm_rw()``
+ ``hdm_reg_offset .. +hdm_reg_size`` ``cxl_passthrough_hdm_rw()``
+ ``>= hdm_reg_offset + hdm_reg_size`` zero-fill (reserved)
+ ============================================ =================================
+
+DVSEC virtualization contract
+=============================
+
+The CXL Device DVSEC body is reached through the standard PCI
+config-space path. ``vfio_pci_config_rw_single()`` clips chunks at
+the DVSEC body boundary via ``vfio_pci_cxl_config_boundary()`` and
+forwards body bytes to ``vfio_pci_cxl_config_rw()``, which in turn
+calls ``cxl_passthrough_dvsec_rw()``.
+
+Per-field write semantics (CXL r4.0 §8.1.3):
+
+============================================ ==============================================
+Field (offset from DVSEC cap base) Spec attribute / behaviour
+============================================ ==============================================
+CAPABILITY (0x0a) HwInit — writes dropped
+CONTROL (0x0c) RWL — gated on DVSEC CONFIG_LOCK
+STATUS (0x0e) RW1C
+CONTROL2 (0x10) RWL — gated on DVSEC CONFIG_LOCK
+STATUS2 (0x12) RW1C
+LOCK (0x14) RWO — first 1-write latches CONFIG_LOCK
+Range1 SIZE_HI/LO BASE_HI/LO (0x18..0x27) HwInit — writes dropped
+Range2 SIZE_HI/LO BASE_HI/LO (0x28..0x37) RsvdZ — writes dropped
+============================================ ==============================================
+
+HDM virtualization contract
+===========================
+
+Per CXL r4.0 §8.2.4.20, on the single firmware-committed decoder:
+
+============================================ ==============================================
+Field (offset from HDM block base) Spec attribute / behaviour
+============================================ ==============================================
+HDM Decoder Capability Header (0x00) HwInit — writes dropped
+HDM Decoder Global Control (0x04) RW — shadow
+Decoder 0 BASE_LO / BASE_HI RWL — gated on COMMITTED or LOCK_ON_COMMIT
+Decoder 0 SIZE_LO / SIZE_HI RWL — same gate
+Decoder 0 CTRL Implements COMMIT → COMMITTED handshake; once
+ COMMITTED, only COMMIT toggles are honoured
+============================================ ==============================================
+
+CM cap-array
+============
+
+The CM cap-array (CXL r4.0 §8.2.4) prefix is snapshotted from the
+device's component register MMIO at bind time and served read-only
+through ``cxl_passthrough_cm_rw()``. Guest writes to the cap-array
+are silently dropped.
+
+UAPI: CAP_CXL
+=============
+
+``VFIO_DEVICE_GET_INFO`` returns ``VFIO_DEVICE_FLAGS_CXL`` and a
+``VFIO_DEVICE_INFO_CAP_CXL`` capability::
+
+ struct vfio_device_info_cap_cxl {
+ struct vfio_info_cap_header header;
+ __u32 flags;
+ #define VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED (1 << 0)
+ __u32 hdm_region_idx;
+ __u32 comp_reg_region_idx;
+ __u32 comp_reg_bar;
+ __u32 __resv;
+ __u64 comp_reg_offset;
+ __u64 comp_reg_size;
+ };
+
+``VFIO_DEVICE_GET_REGION_INFO`` on the component BAR returns a
+``VFIO_REGION_INFO_CAP_SPARSE_MMAP`` that excludes
+``[comp_reg_offset, comp_reg_offset + comp_reg_size)`` from the
+mmappable areas.
+
+Known limitations
+=================
+
+* Host bridge HDM decoder programming is not driven by this driver.
+ The driver silently assumes single-RP-passthrough topology (the
+ CXL host bridge's own HDM decoder is not used). Two remediations
+ are possible: either refuse to bind when the topology is not
+ single-RP-passthrough, or extend the kernel ABI so a host-bridge
+ HDM decoder programmer can attest the lock before vfio bind. Both
+ leave the existing contract intact or add a single boolean to
+ CAP_CXL.
+
+* Function-level reset (FLR) does not re-snapshot the shadows.
+ Guests that issue FLR will see stale HDM and DVSEC state after
+ the reset.
+
+* Multi-decoder devices return ``-EOPNOTSUPP`` at bind.
+
+* Hotplug while the device is held by vfio is not supported.
+
+* Raw BAR read/write into the CXL component register sub-range is
+ unsupported. VMMs must use the COMP_REGS region.
+
+Selftest
+========
+
+``tools/testing/selftests/vfio/vfio_cxl_type2_test`` exercises the
+five surfaces:
+
+* ``device_is_cxl`` — GET_INFO returns FLAGS_CXL + CAP_CXL.
+* ``hdm_region_mmap_rw`` — mmap + read/write pattern.
+* ``component_bar_sparse_mmap`` — SPARSE_MMAP cap excludes the CXL
+ block.
+* ``comp_regs_cm_cap_array_read`` — CM cap-array header is served
+ from the cxl-core snapshot.
+* ``dvsec_lock_byte_read`` -- DVSEC config-rw clipping shim is wired.
--
2.25.1
^ permalink raw reply related
* [PATCH v3 11/11] vfio/pci: Provide opt-out for CXL Type-2 extensions
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>
From: Manish Honap <mhonap@nvidia.com>
Add an opt-out so users can keep vfio-pci's CXL extensions out of the
path for individual devices or for an entire vfio-pci instance. The
build-time gate is CONFIG_VFIO_PCI_CXL; the runtime gates are:
- Module parameter vfio_pci.disable_cxl (bool, 0444). Setting
disable_cxl=1 at modprobe time makes vfio_pci_probe() set
vdev->disable_cxl on every device it binds.
- Variant drivers (mlx5, pds, hisi, nvgrace, xe, etc.) may set
vdev->disable_cxl=true in their own probe for per-device control
without needing the module parameter. The bit lives on
struct vfio_pci_core_device so it's reachable from any variant.
vfio_pci_cxl_acquire() consults vdev->disable_cxl as the very first
check and returns -ENODEV when set, which makes vfio-pci-core treat
the device as a plain (non-CXL) PCI passthrough — no CAP_CXL, no HDM
or COMP_REGS VFIO regions, no DVSEC clipping shim.
This mirrors the long-standing disable_denylist opt-out shape.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/cxl/vfio_cxl_core.c | 9 +++++++++
drivers/vfio/pci/vfio_pci.c | 9 +++++++++
include/linux/vfio_pci_core.h | 1 +
3 files changed, 19 insertions(+)
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index 8a00b776d7c7..905f74f4e725 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -234,6 +234,15 @@ int vfio_pci_cxl_acquire(struct vfio_pci_core_device *vdev)
u16 dvsec;
int rc;
+ /*
+ * Honour the per-device opt-out (set by vfio-pci's module
+ * parameter disable_cxl, or by a variant driver before
+ * registration). Returning -ENODEV here makes the caller
+ * treat this device as plain vfio-pci.
+ */
+ if (vdev->disable_cxl)
+ return -ENODEV;
+
if (!pcie_is_cxl(pdev))
return -ENODEV;
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 0c771064c0b8..fd226cb65d8b 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -60,6 +60,12 @@ static bool disable_denylist;
module_param(disable_denylist, bool, 0444);
MODULE_PARM_DESC(disable_denylist, "Disable use of device denylist. Disabling the denylist allows binding to devices with known errata that may lead to exploitable stability or security issues when accessed by untrusted users.");
+#if IS_ENABLED(CONFIG_VFIO_PCI_CXL)
+static bool disable_cxl;
+module_param(disable_cxl, bool, 0444);
+MODULE_PARM_DESC(disable_cxl, "Disable CXL Type-2 extensions for all devices bound to vfio-pci. Variant drivers may instead set vdev->disable_cxl in their probe for per-device control without needing this parameter.");
+#endif
+
static bool vfio_pci_dev_in_denylist(struct pci_dev *pdev)
{
switch (pdev->vendor) {
@@ -166,6 +172,9 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
return PTR_ERR(vdev);
dev_set_drvdata(&pdev->dev, vdev);
+#if IS_ENABLED(CONFIG_VFIO_PCI_CXL)
+ vdev->disable_cxl = disable_cxl;
+#endif
vdev->pci_ops = &vfio_pci_dev_ops;
ret = vfio_pci_core_register_device(vdev);
if (ret)
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 541c1911e090..20e9599b3bd7 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -127,6 +127,7 @@ struct vfio_pci_core_device {
bool needs_pm_restore:1;
bool pm_intx_masked:1;
bool pm_runtime_engaged:1;
+ bool disable_cxl:1;
struct pci_saved_state *pci_saved_state;
struct pci_saved_state *pm_save;
int ioeventfds_nr;
--
2.25.1
^ permalink raw reply related
* Re: [PATCH v5 13/24] virt/steal_monitor: Add documentation
From: Randy Dunlap @ 2026-06-25 17:00 UTC (permalink / raw)
To: Shrikanth Hegde, linux-kernel, mingo, peterz, juri.lelli,
vincent.guittot, yury.norov, kprateek.nayak, iii, corbet
Cc: tglx, gregkh, pbonzini, seanjc, vschneid, huschle, rostedt,
dietmar.eggemann, maddy, srikar, hdanton, chleroy, vineeth,
frederic, arighi, pauld, christian.loehle, tj, tommaso.cucinotta,
maz, rafael, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-14-sshegde@linux.ibm.com>
Hi,
On 6/25/26 5:46 AM, Shrikanth Hegde wrote:
> Document this module named steal_monitor and its parameters.
>
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
> v4-v5:
> - new patch
>
> Please let me know if the placing is not right.
>
> Documentation/driver-api/index.rst | 1 +
> Documentation/driver-api/steal-monitor.rst | 93 ++++++++++++++++++++++
> 2 files changed, 94 insertions(+)
> create mode 100644 Documentation/driver-api/steal-monitor.rst
> diff --git a/Documentation/driver-api/steal-monitor.rst b/Documentation/driver-api/steal-monitor.rst
> new file mode 100644
> index 000000000000..997a22d0812c
> --- /dev/null
> +++ b/Documentation/driver-api/steal-monitor.rst
> @@ -0,0 +1,93 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +=============
> +Steal Monitor
> +=============
> +
> +:Author: Shrikanth Hegde
> +
> +Introduction:
> +=============
Nit:
Kernel heading adornment style does not include an ending ':' character
(4 places).
> +
> +Steal monitor is a driver aimed at solving the Noisy Neighbour problem
> +in virtualized environments. I.e performance of workload
> +running in one VM gets affected significantly due to other VMs and
> +combined they make slower forward progress.
--
~Randy
^ permalink raw reply
* [PATCH 00/19] crypto: cmh - add CRI CryptoManager Hub driver
From: Saravanakrishnan Krishnamoorthy @ 2026-06-25 17:33 UTC (permalink / raw)
To: Albert Ou, Alex Ousherovitch, Conor Dooley, David S. Miller,
Herbert Xu, Jonathan Corbet, Krzysztof Kozlowski, Palmer Dabbelt,
Paul Walmsley, Rob Herring, Saravanakrishnan Krishnamoorthy,
Shuah Khan
Cc: Alexandre Ghiti, devicetree, Joel Wittenauer, linux-api,
linux-crypto, linux-doc, linux-kernel, linux-kselftest,
linux-riscv, Shuah Khan, sipsupport, Thi Nguyen
From: Alex Ousherovitch <aousherovitch@rambus.com>
crypto: cmh - add CRI CryptoManager Hub hardware crypto accelerator
This series adds a driver for the CRI CryptoManager Hub (CMH), a
hardware cryptographic accelerator IP from Cryptography Research at
Rambus Inc. (https://www.rambus.com/cryptographyresearch/).
CMH provides a broad set of symmetric, asymmetric, and post-quantum
cryptographic algorithms accelerated in hardware, accessed via a
mailbox-based Virtual Command Queue (VCQ) interface.
The hardware is a platform device matched via device tree
(compatible = "cri,cmh"). It exposes a single MMIO register region
(SIC) with per-mailbox doorbell, status, and command registers.
Each mailbox has DMA-coherent queue memory for VCQ command
submission and completion.
Driver architecture:
In-kernel users /dev/cmh_mgmt (ioctl)
(dm-crypt, IPsec, kTLS, fscrypt) (key management)
| |
v v
+----------------------------------------------------+
| Kernel Crypto API + hwrng (72 total) |
| ahash | skcipher | aead | akcipher | sig | kpp |
+----------------------------------------------------+
| |
v v
+------------------+ +------------------------+
| Transaction Mgr |--->| Key / Mgmt subsystem |
| (kthread, CMQ) | | (datastore, ioctl ops) |
+------------------+ +------------------------+
|
v
+------------------+ +-------------------+
| MQI (VCQ pack, |---->| Response Handler |
| DMA map, submit)| | (threaded IRQ, |
+------------------+ | watchdog, unmap) |
| +-------------------+
v ^
+-----------+ +-----------+
| Hardware |--- IRQ ----->| Hardware |
| (mailbox) | | (mailbox) |
+-----------+ +-----------+
The transaction manager runs as a dedicated kthread that pulls
requests from a central command queue, packs VCQ entries, maps DMA
buffers, and submits to the least-loaded mailbox. Completion is
handled by per-mailbox threaded IRQs. The driver returns
-EINPROGRESS for async crypto requests and supports the
CRYPTO_TFM_REQ_MAY_BACKLOG flag for queue-full backpressure.
Registered algorithms (72 total):
Type Count Algorithms
--------- ----- --------------------------------------------------
ahash 15 SHA-{224,256,384,512}, SHA3-{224,256,384,512},
SHAKE-{128,256}, cSHAKE-{128,256},
KMAC-{128,256}, SM3
ahash(HMAC) 8 HMAC-SHA-{224,256,384,512},
HMAC-SHA3-{224,256,384,512}
ahash(MAC) 4 CMAC(AES), CMAC(SM4), XCBC(SM4), Poly1305
skcipher 11 AES-{ECB,CBC,CTR,CFB,XTS},
SM4-{ECB,CBC,CTR,CFB,XTS}, ChaCha20
aead 6 AES-{GCM,CCM}, SM4-{GCM,CCM},
rfc7539(chacha20,poly1305),
rfc7539esp(chacha20,poly1305)
akcipher 1 RSA (2048--4096 bit; 512/1024 legacy/test)
sig 23 ECDSA P-{256,384,521}, SM2 (verify-only),
ML-DSA-{44,65,87},
SLH-DSA (12 parameter sets),
LMS, LMS-HSS, XMSS, XMSS-MT
kpp 3 ECDH P-{256,384}, X25519
hwrng 1 DRBG-backed /dev/hwrng
Ioctl-only algorithms (not registered with the crypto API at all):
- EdDSA (Ed25519, Ed448): sign and verify
- ML-KEM (ML-KEM-512/768/1024): no standard kernel KEM API exists
The driver also exposes /dev/cmh_mgmt, a misc device providing 44
ioctl commands. Relative to the in-kernel crypto API these fall into
two groups; the distinction matters because some commands name the
same primitives the driver also registers, and that overlap is
deliberate and bounded:
(1) Operations with no crypto API representation - the large
majority. The crypto API has no transform type or verb for
these, so a character device is the only available UAPI:
- hardware key lifecycle: create, import, export, derive,
destroy, enumerate (keystore CRUD) - no keystore verb
- KIC key derivation (HKDF, AES-CMAC-KDF, DKEK)
- asymmetric key generation (RSA, EC, EdDSA, ML-DSA, SLH-DSA)
and public-key derivation - the crypto API has no keygen verb
- ML-KEM encapsulate/decapsulate - no kernel KEM API exists
- SM2 encrypt/decrypt and key exchange (multi-step GM/T 0003)
- EdDSA sign/verify - not registered with the crypto API
- EAC Chip Authentication and DRBG (re)configuration
(2) Hardware-held-key operations on algorithms that ARE also
registered (RSA decrypt, ECDSA/ML-DSA/SLH-DSA sign, ECDH). These
name the same primitives as the registered akcipher/sig/kpp
transforms, but the crypto API's set_priv_key()/set_secret()
accept only raw key bytes supplied by the caller; they cannot
reference a private key that is generated inside, and never
leaves, the hardware datastore - the central security property of
this device. The ioctl path keeps the private key
hardware-resident, while the registered transforms serve raw-key
in-kernel users. The two paths are complementary, not redundant.
The device requires CAP_SYS_ADMIN.
/dev/cmh_mgmt is built conditionally on CONFIG_CRYPTO_DEV_CMH_MGMT
(default n); when disabled the ioctl interface is absent while all
kernel crypto API algorithms remain registered.
The ML-DSA sig algorithms are registered at priority 5001. The
kernel's crypto/mldsa.c registers at priority 5000 with verify-only
(sign returns -EOPNOTSUPP). Our driver provides full HW-accelerated
sign + verify, so the higher priority ensures the hardware
implementation is preferred when the driver is loaded.
Power management uses DEFINE_SIMPLE_DEV_PM_OPS. On suspend the
transaction manager drains in-flight requests (configurable 10s
timeout, returns -ECANCELED on timeout), stops the kthread, and
masks IRQs. On resume it re-verifies SIC/boot status and restarts
the kthread.
Dependencies:
- Kernel 7.1+ (based on Herbert Xu's cryptodev-2.6 tree, 7.1.0-rc2)
- sig_alg backend (upstream since 6.13)
- CRYPTO_AHASH_REQ_VIRT (native support, no fallback needed)
- CMH eSW loaded independently by hardware before driver probe
The driver registers all algorithms through the standard in-kernel
crypto API; in-kernel users (dm-crypt, fscrypt, IPsec, etc.) consume
them directly. Key provisioning and hardware-held-key operations are
exposed to user space via /dev/cmh_mgmt ioctls.
Public hardware documentation:
Product brief: https://go.rambus.com/ch-7xx-and-cc-7xx-product-brief
No public datasheets are currently available. The driver was
developed against the CRI CryptoManager Hub Hardware Reference
Manual (Rambus Inc. confidential). Detailed hardware reference is
available under NDA from Rambus Inc.; contact the maintainers listed
in MAINTAINERS for access during review.
Tested on RISC-V and ARM64 QEMU emulation with the CMH hardware
model (QEMU TCG, 512 MiB RAM). Also exercised on Xilinx VMK180
FPGA board with real CMH IP.
- testmgr: 41 CMH algorithm registrations matched by upstream
test vectors, all pass; 30 names report "No test for" (PQC
families, KMAC, cSHAKE - no upstream vectors yet).
- kselftest tools/testing/selftests/drivers/crypto/cmh:
6 pass, 0 fail.
checkpatch.pl --strict: 0 errors, 0 warnings, 0 checks on all
files (the only output is the expected per-file "does MAINTAINERS
need updating?" reminder, satisfied by the MAINTAINERS patch).
sparse (C=2): 0 warnings.
W=1 -Werror: clean.
make dt_binding_check: clean (dtschema validates the
cri,cmh.yaml binding).
Tested with the following debug options enabled simultaneously
(submit-checklist "Test your code" item 1):
CONFIG_PROVE_LOCKING, CONFIG_PROVE_RCU, CONFIG_DEBUG_LOCK_ALLOC,
CONFIG_DEBUG_OBJECTS_RCU_HEAD, CONFIG_SLUB_DEBUG,
CONFIG_DEBUG_PAGEALLOC, CONFIG_DEBUG_MUTEXES, CONFIG_DEBUG_SPINLOCK,
CONFIG_DEBUG_PREEMPT, CONFIG_DEBUG_ATOMIC_SLEEP.
Result: no lockdep warnings, no ODEBUG splats, no slab corruption.
Additionally tested (separate passes - mutually exclusive configs):
- CONFIG_KASAN + CONFIG_UBSAN + CONFIG_DEBUG_KMEMLEAK + CONFIG_KFENCE:
no sanitizer findings; KMEMLEAK scan reports 0 unreferenced objects.
- CONFIG_KCSAN (arm64; riscv64 lacks HAVE_ARCH_KCSAN):
0 data-race reports attributed to the driver.
Stack usage: worst-case under 1 KB on both riscv64 and arm64
(scripts/checkstack.pl). Hardware command buffers live in
per-request context (heap-allocated by the crypto framework).
Alex Ousherovitch (19):
dt-bindings: crypto: add Rambus CryptoManager Hub
crypto: cmh - add core platform driver
crypto: cmh - add key provisioning and management
crypto: cmh - add SHA-2/SHA-3/SHAKE ahash
crypto: cmh - add HMAC ahash
crypto: cmh - add CSHAKE/KMAC ahash
crypto: cmh - add SM3 ahash
crypto: cmh - add AES skcipher/aead/cmac
crypto: cmh - add SM4 skcipher/aead/cmac/xcbc
crypto: cmh - add ChaCha20-Poly1305
crypto: cmh - add DRBG hwrng
crypto: cmh - add RSA akcipher
crypto: cmh - add ECDSA/SM2 sig
crypto: cmh - add ECDH/X25519 kpp
crypto: cmh - add ML-KEM/ML-DSA (QSE)
crypto: cmh - add SLH-DSA/LMS/XMSS (HCQ)
Documentation: ioctl: add CMH ioctl documentation and register 'J'
selftests: crypto: cmh - add kselftest for management ioctl
MAINTAINERS: add Rambus CryptoManager Hub (CMH)
base-commit: 6ea0ce3a19f9c37a014099e2b0a46b27fa164564
--
2.43.7
** This message and any attachments are for the sole use of the intended recipient(s). It may contain information that is confidential and privileged. If you are not the intended recipient of this message, you are prohibited from printing, copying, forwarding or saving it. Please delete the message and attachments and notify the sender immediately. **
Rambus Inc.<http://www.rambus.com>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox