Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* [PATCH v3 06/11] cxl: Add register-virtualization helpers for vfio Type-2 passthrough
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
  To: djbw, alex, jgg, jic23, dave.jiang, ankita,
	alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
	ira.weiny
  Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
	linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>

From: Manish Honap <mhonap@nvidia.com>

vfio-pci needs the CXL Device DVSEC body, the HDM Decoder Capability
block, and the CXL.cache/mem cap-array prefix to be virtualized
toward a KVM guest in a CXL-spec-compliant way.

Introduce a narrow helper API owned by cxl-core:

  struct cxl_passthrough *
  devm_cxl_passthrough_create(struct device *dev,
                              struct cxl_dev_state *cxlds);

  int cxl_passthrough_dvsec_rw(struct cxl_passthrough *p, u32 off,
                               u32 *val, size_t sz, bool write);
  int cxl_passthrough_hdm_rw(struct cxl_passthrough *p, u32 off,
                             u32 *val, bool write);
  int cxl_passthrough_cm_rw(struct cxl_passthrough *p, u32 off,
                            u32 *val, bool write);

Each helper takes a per-device mutex covering the DVSEC + HDM shadows
(the CM cap-array snapshot is immutable after create) and dispatches
by offset to a hand-written write handler against CXL r4.0 §8.1.3
(DVSEC: LOCK is RWO, CONTROL/CONTROL2 are RWL gated on CONFIG_LOCK,
STATUS/STATUS2 are RW1C, RANGE1 is HwInit, RANGE2 is RsvdZ) and
§8.2.4.20 (HDM: GLOBAL_CTRL RW, decoder CTRL implements
COMMIT/COMMITTED, decoder BASE/SIZE RWL gated on COMMITTED or
LOCK_ON_COMMIT, cap header HwInit).

Writes to the CM cap-array are silently discarded because the
cap-array headers are RO per CXL r4.0 §8.2.4; the write parameter is
kept on the rw API to make the drop policy explicit at the call site.

The shadows are snapshotted at create time: the DVSEC body from PCI
config space dword-at-a-time, the CM cap-array and HDM block from
the cxl-core MMIO mapping at cxlds->reg_map.base.  This preserves
firmware-committed values so the guest reads what the host BIOS
committed, while writes update the shadow per the per-field write
semantics above.

The file is gated by the hidden Kconfig CXL_VFIO_PASSTHROUGH so the
passthrough code stays out of cxl_core when no vfio consumer is configured.

Scope: firmware-committed, single-decoder, no-interleave Type-2
passthrough.  Multi-decoder, interleave, and hotplug are
out-of-scope and rejected at create time (-EOPNOTSUPP for
hdm_count != 1).

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/cxl/Kconfig            |   7 +
 drivers/cxl/core/Makefile      |   1 +
 drivers/cxl/core/passthrough.c | 590 +++++++++++++++++++++++++++++++++
 include/cxl/passthrough.h      | 121 +++++++
 4 files changed, 719 insertions(+)
 create mode 100644 drivers/cxl/core/passthrough.c
 create mode 100644 include/cxl/passthrough.h

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 80aeb0d556bd..7c874d486a9c 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -19,6 +19,13 @@ menuconfig CXL_BUS
 
 if CXL_BUS
 
+config CXL_VFIO_PASSTHROUGH
+	bool
+	# Hidden symbol selected by VFIO_PCI_CXL to pull
+	# drivers/cxl/core/passthrough.c into cxl_core when a vfio
+	# Type-2 passthrough consumer is configured.  Keep silent: no
+	# help text, no default, no user-visible prompt.
+
 config CXL_PCI
 	tristate "PCI manageability"
 	default CXL_BUS
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index ce7213818d3c..0cc80bd35a88 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -22,3 +22,4 @@ cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
 cxl_core-$(CONFIG_CXL_RAS) += ras.o
 cxl_core-$(CONFIG_CXL_RAS) += ras_rch.o
 cxl_core-$(CONFIG_CXL_ATL) += atl.o
+cxl_core-$(CONFIG_CXL_VFIO_PASSTHROUGH) += passthrough.o
diff --git a/drivers/cxl/core/passthrough.c b/drivers/cxl/core/passthrough.c
new file mode 100644
index 000000000000..b89829586024
--- /dev/null
+++ b/drivers/cxl/core/passthrough.c
@@ -0,0 +1,590 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2026 NVIDIA Corporation. All rights reserved.
+ *
+ * vfio-pci Type-2 device passthrough — CXL register virtualization.
+ *
+ * Owns the CXL spec-defined virtualization semantics for the
+ *   - CXL Device DVSEC capability body  (CXL r4.0 §8.1.3)
+ *   - HDM Decoder Capability block      (CXL r4.0 §8.2.4.20)
+ *   - CXL.cache/mem (CM) cap-array      (CXL r4.0 §8.2.4)
+ *
+ * vfio-pci is the only caller.  This file is NOT a generic emulation
+ * framework: every register the guest may touch has a hand-written
+ * write handler against the spec.  Reads serve from a shadow
+ * snapshotted at create time; writes update the shadow per the spec
+ * attribute mode for that field.
+ *
+ * Scope: firmware-committed, single-decoder, no-interleave Type-2
+ * passthrough.  Multi-decoder, interleave, and hotplug are
+ * out-of-scope and rejected at create time.
+ */
+
+#include <linux/bitfield.h>
+#include <linux/bitops.h>
+#include <linux/cleanup.h>
+#include <linux/device.h>
+#include <linux/export.h>
+#include <linux/io.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/pci_ids.h>
+#include <linux/pci_regs.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/unaligned.h>
+
+#include <uapi/cxl/cxl_regs.h>
+
+#include <cxlpci.h>
+#include <cxlmem.h>
+#include <cxl/cxl.h>
+#include <cxl/passthrough.h>
+
+#include "core.h"
+
+/* DVSEC CXL Device body offsets — relative to DVSEC capability start.
+ * Body begins at PCI_DVSEC_CXL_CAP (0x0a); preceding bytes are the PCI
+ * ext-cap header and DVSEC headers handled by the generic vfio
+ * perm-bits path.
+ */
+#define DVSEC_OFF_CAPABILITY		PCI_DVSEC_CXL_CAP	/* 0x0a, u16 */
+#define DVSEC_OFF_CONTROL		PCI_DVSEC_CXL_CTRL	/* 0x0c, u16 */
+#define DVSEC_OFF_STATUS		0x0e			/* u16 */
+#define DVSEC_OFF_CONTROL2		0x10			/* u16 */
+#define DVSEC_OFF_STATUS2		0x12			/* u16 */
+#define DVSEC_OFF_LOCK			0x14			/* u16 */
+#define DVSEC_OFF_RANGE1_SIZE_HI	0x18			/* u32 */
+#define DVSEC_OFF_RANGE1_SIZE_LO	0x1c
+#define DVSEC_OFF_RANGE1_BASE_HI	0x20
+#define DVSEC_OFF_RANGE1_BASE_LO	0x24
+#define DVSEC_OFF_RANGE2_SIZE_HI	0x28
+#define DVSEC_OFF_RANGE2_SIZE_LO	0x2c
+#define DVSEC_OFF_RANGE2_BASE_HI	0x30
+#define DVSEC_OFF_RANGE2_BASE_LO	0x34
+#define DVSEC_BODY_END			0x38
+
+#define DVSEC_LOCK_CONFIG_LOCK		BIT(0)
+
+/* HDM Decoder Capability block offsets — relative to HDM block base.
+ * Decoder N register set starts at 0x10 + N * 0x20.
+ */
+#define HDM_OFF_CAP_HEADER		0x00
+#define HDM_OFF_GLOBAL_CTRL		0x04
+#define HDM_DEC_BASE			0x10
+#define HDM_DEC_STRIDE			0x20
+#define HDM_DEC_OFF_BASE_LO(n)		(HDM_DEC_BASE + (n) * HDM_DEC_STRIDE + 0x00)
+#define HDM_DEC_OFF_BASE_HI(n)		(HDM_DEC_BASE + (n) * HDM_DEC_STRIDE + 0x04)
+#define HDM_DEC_OFF_SIZE_LO(n)		(HDM_DEC_BASE + (n) * HDM_DEC_STRIDE + 0x08)
+#define HDM_DEC_OFF_SIZE_HI(n)		(HDM_DEC_BASE + (n) * HDM_DEC_STRIDE + 0x0c)
+#define HDM_DEC_OFF_CTRL(n)		(HDM_DEC_BASE + (n) * HDM_DEC_STRIDE + 0x10)
+
+/* HDM Decoder CTRL bits per CXL r4.0 §8.2.4.20.5. */
+#define HDM_CTRL_LOCK_ON_COMMIT		BIT(8)
+#define HDM_CTRL_COMMIT			BIT(9)
+#define HDM_CTRL_COMMITTED		BIT(10)
+#define HDM_CTRL_ERR_NOT_COMMITTED	BIT(11)
+
+struct cxl_passthrough {
+	struct cxl_dev_state *cxlds;
+
+	/* DVSEC body shadow.  Byte-indexed by (off - PCI_DVSEC_CXL_CAP).
+	 * Allocated rounded up to a dword so dword reads at the tail
+	 * never overrun.
+	 */
+	u8 *dvsec_shadow;
+	u16 dvsec_size;			/* full DVSEC cap length, incl. headers */
+	bool dvsec_config_locked;
+
+	/* HDM block shadow.  Byte-indexed; size = hdm_reg_size. */
+	u8 *hdm_shadow;
+	resource_size_t hdm_reg_size;
+
+	/* CM cap-array snapshot.  Dword-indexed by (off / 4) where off
+	 * is the byte offset from CXL_CM_OFFSET.  Read-only after create.
+	 */
+	__le32 *cm_snapshot;
+	size_t cm_snapshot_dwords;
+
+	/* Covers dvsec_shadow + dvsec_config_locked + hdm_shadow.
+	 * cm_snapshot is immutable after create; no lock needed.  Leaf-
+	 * level: no entry point holding this mutex calls into cxl-bus or
+	 * vfio.
+	 */
+	struct mutex lock;
+};
+
+/* ------------------------------------------------------------------ */
+/* Snapshot helpers                                                    */
+/* ------------------------------------------------------------------ */
+
+/* Read the DVSEC body bytes [PCI_DVSEC_CXL_CAP, dvsec_size) from PCI
+ * config space into the shadow.
+ *
+ * The body starts at PCI_DVSEC_CXL_CAP (0x0a), which is word-aligned but
+ * NOT dword-aligned, and CXL r4.0 §8.1.3 places six 16-bit descriptors
+ * (CAPABILITY through LOCK) at offsets 0x0a..0x14 before any 32-bit
+ * field.  Strict-alignment PCIe host bridges (e.g. ARM64 ECAM) reject
+ * misaligned dword config accesses with PCIBIOS_BAD_REGISTER_NUMBER;
+ * snapshot at the natural granularity of the body's 16-bit descriptors
+ * (2-byte stride) so every offset in the range is naturally aligned.
+ */
+static int snapshot_dvsec_body(struct cxl_passthrough *p)
+{
+	struct pci_dev *pdev = to_pci_dev(p->cxlds->dev);
+	u16 dvsec = p->cxlds->cxl_dvsec;
+	u16 off;
+	u16 word;
+	int rc;
+
+	for (off = PCI_DVSEC_CXL_CAP; off < p->dvsec_size; off += 2) {
+		rc = pci_read_config_word(pdev, dvsec + off, &word);
+		if (rc)
+			return -EIO;
+		put_unaligned_le16(word, p->dvsec_shadow +
+				   (off - PCI_DVSEC_CXL_CAP));
+	}
+	return 0;
+}
+
+/* Read the CM cap-array prefix [CXL_CM_OFFSET, hdm_reg_offset) from
+ * MMIO into cm_snapshot, and the HDM block [hdm_reg_offset,
+ * hdm_reg_offset + hdm_reg_size) into hdm_shadow.
+ *
+ * @base is a short-lived kva for the component register block,
+ * established by the caller via ioremap() against cxlds->reg_map.resource.
+ * cxl_setup_regs() drops its own ioremap (clears reg_map.base) after the
+ * cap-array probe completes, so this function cannot rely on
+ * cxlds->reg_map.base being valid; the caller passes a fresh mapping
+ * here and releases it once snapshot data has been copied into the
+ * in-memory shadows.
+ */
+static void snapshot_cm_and_hdm(struct cxl_passthrough *p,
+				void __iomem *base,
+				resource_size_t hdm_off)
+{
+	size_t i;
+
+	for (i = 0; i < p->cm_snapshot_dwords; i++)
+		p->cm_snapshot[i] = cpu_to_le32(readl(base + CXL_CM_OFFSET +
+						      i * 4));
+
+	for (i = 0; i < p->hdm_reg_size / 4; i++)
+		put_unaligned_le32(readl(base + hdm_off + i * 4),
+				   p->hdm_shadow + i * 4);
+}
+
+/* ------------------------------------------------------------------ */
+/* devres                                                              */
+/* ------------------------------------------------------------------ */
+
+static void cxl_passthrough_release(struct device *dev, void *res)
+{
+	struct cxl_passthrough *p = *(struct cxl_passthrough **)res;
+
+	kfree(p->dvsec_shadow);
+	kfree(p->hdm_shadow);
+	kfree(p->cm_snapshot);
+	mutex_destroy(&p->lock);
+	kfree(p);
+}
+
+struct cxl_passthrough *
+devm_cxl_passthrough_create(struct device *dev, struct cxl_dev_state *cxlds)
+{
+	struct cxl_passthrough **dres;
+	struct cxl_passthrough *p;
+	struct pci_dev *pdev;
+	resource_size_t hdm_off, hdm_size;
+	size_t dvsec_shadow_size;
+	u8 hdm_count;
+	u32 hdr;
+	int rc;
+
+	/*
+	 * cxl_setup_regs() releases its short-lived ioremap before returning,
+	 * so reg_map.base is NULL by the time we run.  Validate the persistent
+	 * fields (resource address and size) instead; the local ioremap
+	 * established further below covers the snapshot reads.
+	 */
+	if (!dev || !cxlds || !cxlds->dev || !cxlds->cxl_dvsec ||
+	    !cxlds->reg_map.resource || !cxlds->reg_map.max_size)
+		return ERR_PTR(-EINVAL);
+
+	pdev = to_pci_dev(cxlds->dev);
+
+	rc = cxl_get_hdm_info(cxlds, &hdm_count, &hdm_off, &hdm_size);
+	if (rc)
+		return ERR_PTR(rc);
+	if (hdm_count != 1 || !hdm_size || hdm_off <= CXL_CM_OFFSET ||
+	    !IS_ALIGNED(hdm_size, 4))
+		return ERR_PTR(-EOPNOTSUPP);
+
+	p = kzalloc_obj(*p, GFP_KERNEL);
+	if (!p)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_init(&p->lock);
+	p->cxlds = cxlds;
+	p->hdm_reg_size = hdm_size;
+
+	/* DVSEC body length from PCI ext-cap header. */
+	rc = pci_read_config_dword(pdev, cxlds->cxl_dvsec + PCI_DVSEC_HEADER1,
+				   &hdr);
+	if (rc) {
+		rc = -EIO;
+		goto err;
+	}
+	p->dvsec_size = PCI_DVSEC_HEADER1_LEN(hdr);
+	if (p->dvsec_size < DVSEC_BODY_END) {
+		rc = -EINVAL;
+		goto err;
+	}
+
+	dvsec_shadow_size = round_up(p->dvsec_size - PCI_DVSEC_CXL_CAP, 4);
+	p->dvsec_shadow = kzalloc(dvsec_shadow_size, GFP_KERNEL);
+	if (!p->dvsec_shadow) {
+		rc = -ENOMEM;
+		goto err;
+	}
+
+	p->cm_snapshot_dwords = (hdm_off - CXL_CM_OFFSET) / 4;
+	p->cm_snapshot = kcalloc(p->cm_snapshot_dwords, sizeof(__le32),
+				 GFP_KERNEL);
+	if (!p->cm_snapshot) {
+		rc = -ENOMEM;
+		goto err;
+	}
+
+	p->hdm_shadow = kzalloc(hdm_size, GFP_KERNEL);
+	if (!p->hdm_shadow) {
+		rc = -ENOMEM;
+		goto err;
+	}
+
+	rc = snapshot_dvsec_body(p);
+	if (rc)
+		goto err;
+
+	{
+		void __iomem *base;
+
+		/*
+		 * Bind-time-only ioremap.  cxl_setup_regs() has already
+		 * released the cxl-core ioremap (see comment on the entry
+		 * gate).  Take a fresh, short-lived mapping for the
+		 * snapshot, then release it; all subsequent reads serve
+		 * from the in-memory shadows.
+		 */
+		base = ioremap(cxlds->reg_map.resource,
+			       cxlds->reg_map.max_size);
+		if (!base) {
+			rc = -ENOMEM;
+			goto err;
+		}
+		snapshot_cm_and_hdm(p, base, hdm_off);
+		iounmap(base);
+	}
+
+	dres = devres_alloc(cxl_passthrough_release, sizeof(*dres),
+			    GFP_KERNEL);
+	if (!dres) {
+		rc = -ENOMEM;
+		goto err;
+	}
+	*dres = p;
+	devres_add(dev, dres);
+	return p;
+
+err:
+	kfree(p->dvsec_shadow);
+	kfree(p->cm_snapshot);
+	kfree(p->hdm_shadow);
+	mutex_destroy(&p->lock);
+	kfree(p);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_NS_GPL(devm_cxl_passthrough_create, "CXL");
+
+/* ------------------------------------------------------------------ */
+/* DVSEC write semantics                                               */
+/* ------------------------------------------------------------------ */
+
+static u16 dvsec_shadow_get_u16(struct cxl_passthrough *p, u16 off)
+{
+	return get_unaligned_le16(p->dvsec_shadow + (off - PCI_DVSEC_CXL_CAP));
+}
+
+static void dvsec_shadow_set_u16(struct cxl_passthrough *p, u16 off, u16 val)
+{
+	put_unaligned_le16(val, p->dvsec_shadow + (off - PCI_DVSEC_CXL_CAP));
+}
+
+/* Apply a write to a single DVSEC field at @off, with the field's
+ * native width (2 for descriptors, 4 for RANGE entries).  @width is
+ * the field's spec width; @new is the merged value to apply.  Caller
+ * holds p->lock.
+ */
+static void dvsec_apply_write(struct cxl_passthrough *p, u16 off, size_t width,
+			      u32 new)
+{
+	u16 cur16;
+
+	switch (off) {
+	case DVSEC_OFF_CAPABILITY:
+		/* HwInit — drop. */
+		return;
+	case DVSEC_OFF_CONTROL:
+	case DVSEC_OFF_CONTROL2:
+		/* RWL — gated on CONFIG_LOCK. */
+		if (p->dvsec_config_locked)
+			return;
+		dvsec_shadow_set_u16(p, off, (u16)new);
+		return;
+	case DVSEC_OFF_STATUS:
+	case DVSEC_OFF_STATUS2:
+		/* RW1C — clear bits where the guest wrote 1. */
+		cur16 = dvsec_shadow_get_u16(p, off);
+		dvsec_shadow_set_u16(p, off, cur16 & ~(u16)new);
+		return;
+	case DVSEC_OFF_LOCK:
+		/* RWO — first 1-write latches CONFIG_LOCK; subsequent
+		 * writes are ignored.
+		 */
+		cur16 = dvsec_shadow_get_u16(p, off);
+		if (cur16 & DVSEC_LOCK_CONFIG_LOCK)
+			return;
+		if (new & DVSEC_LOCK_CONFIG_LOCK) {
+			dvsec_shadow_set_u16(p, off,
+					     cur16 | DVSEC_LOCK_CONFIG_LOCK);
+			p->dvsec_config_locked = true;
+		}
+		return;
+	case DVSEC_OFF_RANGE1_SIZE_HI:
+	case DVSEC_OFF_RANGE1_SIZE_LO:
+	case DVSEC_OFF_RANGE1_BASE_HI:
+	case DVSEC_OFF_RANGE1_BASE_LO:
+		/* HwInit — drop. */
+		return;
+	case DVSEC_OFF_RANGE2_SIZE_HI:
+	case DVSEC_OFF_RANGE2_SIZE_LO:
+	case DVSEC_OFF_RANGE2_BASE_HI:
+	case DVSEC_OFF_RANGE2_BASE_LO:
+		/* RsvdZ — drop. */
+		return;
+	default:
+		/* Reserved offsets inside the modelled body: drop. */
+		(void)width;
+		return;
+	}
+}
+
+/* Map a byte offset @off inside the DVSEC body to the natural-width
+ * field that contains it: returns the field's base offset (16-bit
+ * aligned for descriptors, 32-bit aligned for RANGE entries) and width.
+ * Returns false if @off lies outside any modelled field.
+ */
+static bool dvsec_field_at(u16 off, u16 *field_off, size_t *width)
+{
+	if (off >= DVSEC_OFF_CAPABILITY && off < DVSEC_OFF_RANGE1_SIZE_HI) {
+		*field_off = ALIGN_DOWN(off, 2);
+		*width = 2;
+		return true;
+	}
+	if (off >= DVSEC_OFF_RANGE1_SIZE_HI && off < DVSEC_BODY_END) {
+		*field_off = ALIGN_DOWN(off, 4);
+		*width = 4;
+		return true;
+	}
+	return false;
+}
+
+int cxl_passthrough_dvsec_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+			     size_t sz, bool write)
+{
+	u8 *shadow;
+	u16 field_off;
+	size_t field_width;
+	u32 cur, merged;
+	u32 sub_shift;
+	u32 width_mask;
+
+	if (!p || !val)
+		return -EINVAL;
+	if (sz != 1 && sz != 2 && sz != 4)
+		return -EINVAL;
+	if (off < PCI_DVSEC_CXL_CAP || off + sz > p->dvsec_size)
+		return -EINVAL;
+
+	guard(mutex)(&p->lock);
+
+	shadow = p->dvsec_shadow + (off - PCI_DVSEC_CXL_CAP);
+
+	if (!write) {
+		switch (sz) {
+		case 1:
+			*val = *shadow;
+			break;
+		case 2:
+			*val = get_unaligned_le16(shadow);
+			break;
+		case 4:
+			*val = get_unaligned_le32(shadow);
+			break;
+		}
+		return 0;
+	}
+
+	if (!dvsec_field_at(off, &field_off, &field_width))
+		return 0;	/* outside any modelled field: drop */
+
+	/* Read-modify-merge the field at its natural width. */
+	if (field_width == 2)
+		cur = dvsec_shadow_get_u16(p, field_off);
+	else
+		cur = get_unaligned_le32(p->dvsec_shadow +
+					 (field_off - PCI_DVSEC_CXL_CAP));
+
+	width_mask = (sz == 4) ? 0xffffffff : (sz == 2 ? 0xffff : 0xff);
+	sub_shift = (off - field_off) * 8;
+	merged = cur & ~(width_mask << sub_shift);
+	merged |= (*val & width_mask) << sub_shift;
+
+	dvsec_apply_write(p, field_off, field_width, merged);
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_passthrough_dvsec_rw, "CXL");
+
+/* ------------------------------------------------------------------ */
+/* HDM write semantics                                                 */
+/* ------------------------------------------------------------------ */
+
+static u32 hdm_shadow_get(struct cxl_passthrough *p, u32 off)
+{
+	return get_unaligned_le32(p->hdm_shadow + off);
+}
+
+static void hdm_shadow_set(struct cxl_passthrough *p, u32 off, u32 val)
+{
+	put_unaligned_le32(val, p->hdm_shadow + off);
+}
+
+/* Decoder index for a per-decoder register offset. */
+static u32 hdm_decoder_of(u32 off)
+{
+	return (off - HDM_DEC_BASE) / HDM_DEC_STRIDE;
+}
+
+static u32 hdm_decoder_field(u32 off)
+{
+	return (off - HDM_DEC_BASE) % HDM_DEC_STRIDE;
+}
+
+static void hdm_decoder_ctrl_write(struct cxl_passthrough *p, u32 off, u32 val)
+{
+	u32 cur = hdm_shadow_get(p, off);
+	u32 next;
+
+	/* Once COMMITTED, only the COMMIT toggle is honoured.  Releasing
+	 * COMMIT clears COMMITTED and Lock-on-Commit per CXL r4.0
+	 * §8.2.4.20.5.
+	 */
+	if (cur & HDM_CTRL_COMMITTED) {
+		next = (cur & ~HDM_CTRL_COMMIT) | (val & HDM_CTRL_COMMIT);
+		if (!(val & HDM_CTRL_COMMIT)) {
+			next &= ~HDM_CTRL_COMMITTED;
+			next &= ~HDM_CTRL_LOCK_ON_COMMIT;
+		}
+		hdm_shadow_set(p, off, next);
+		return;
+	}
+
+	next = val & ~(HDM_CTRL_COMMITTED | HDM_CTRL_ERR_NOT_COMMITTED);
+	if (val & HDM_CTRL_COMMIT)
+		next |= HDM_CTRL_COMMITTED;
+	hdm_shadow_set(p, off, next);
+}
+
+static void hdm_decoder_basesize_write(struct cxl_passthrough *p, u32 off,
+				       u32 val)
+{
+	u32 n = hdm_decoder_of(off);
+	u32 ctrl = hdm_shadow_get(p, HDM_DEC_OFF_CTRL(n));
+
+	/* RWL — BASE/SIZE locked when the decoder is committed or
+	 * lock-on-commit has been latched.
+	 */
+	if (ctrl & (HDM_CTRL_COMMITTED | HDM_CTRL_LOCK_ON_COMMIT))
+		return;
+	hdm_shadow_set(p, off, val);
+}
+
+int cxl_passthrough_hdm_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+			   bool write)
+{
+	u32 field;
+
+	if (!p || !val)
+		return -EINVAL;
+	if (!IS_ALIGNED(off, 4) || off + 4 > p->hdm_reg_size)
+		return -EINVAL;
+
+	guard(mutex)(&p->lock);
+
+	if (!write) {
+		*val = hdm_shadow_get(p, off);
+		return 0;
+	}
+
+	switch (off) {
+	case HDM_OFF_CAP_HEADER:
+		/* HwInit — drop. */
+		return 0;
+	case HDM_OFF_GLOBAL_CTRL:
+		/* RW — shadow. */
+		hdm_shadow_set(p, off, *val);
+		return 0;
+	}
+
+	if (off < HDM_DEC_BASE)
+		return 0;	/* gap before per-decoder regs: drop */
+
+	field = hdm_decoder_field(off);
+	switch (field) {
+	case 0x00: case 0x04:	/* BASE_LO / BASE_HI */
+	case 0x08: case 0x0c:	/* SIZE_LO / SIZE_HI */
+		hdm_decoder_basesize_write(p, off, *val);
+		return 0;
+	case 0x10:		/* CTRL */
+		hdm_decoder_ctrl_write(p, off, *val);
+		return 0;
+	default:
+		/* TARGET_LIST_{LO,HI} and other per-decoder bytes are
+		 * accepted as plain RW shadow for the firmware-committed
+		 * scope; multi-decoder / interleave behaviour is
+		 * out-of-scope.
+		 */
+		hdm_shadow_set(p, off, *val);
+		return 0;
+	}
+}
+EXPORT_SYMBOL_NS_GPL(cxl_passthrough_hdm_rw, "CXL");
+
+/* ------------------------------------------------------------------ */
+/* CM cap-array snapshot                                               */
+/* ------------------------------------------------------------------ */
+
+int cxl_passthrough_cm_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+			  bool write)
+{
+	if (!p || !val)
+		return -EINVAL;
+	if (!IS_ALIGNED(off, 4) || off / 4 >= p->cm_snapshot_dwords)
+		return -EINVAL;
+
+	if (write)
+		return 0;	/* cap-array headers are RO; drop. */
+
+	*val = le32_to_cpu(p->cm_snapshot[off / 4]);
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_passthrough_cm_rw, "CXL");
diff --git a/include/cxl/passthrough.h b/include/cxl/passthrough.h
new file mode 100644
index 000000000000..43214b0d34f6
--- /dev/null
+++ b/include/cxl/passthrough.h
@@ -0,0 +1,121 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright(c) 2026 NVIDIA Corporation. All rights reserved.
+ *
+ * CXL register virtualization helpers for vfio-pci Type-2 passthrough.
+ *
+ * See Documentation/driver-api/vfio-pci-cxl.rst for the ownership
+ * contract.  In short: cxl-core owns the per-device DVSEC body, HDM
+ * Decoder block, and CM cap-array shadows; vfio-pci is a transport
+ * that forwards guest reads and writes through the helpers below.
+ *
+ * The helpers are not a generic emulation framework.  Each register
+ * is hand-coded against CXL r4.0 §8.1.3 and §8.2.4.20.  Adding a new
+ * field is "add a case", not "add a mode".
+ */
+#ifndef __CXL_PASSTHROUGH_H__
+#define __CXL_PASSTHROUGH_H__
+
+#include <linux/types.h>
+
+struct cxl_dev_state;
+struct cxl_passthrough;
+struct device;
+
+/**
+ * devm_cxl_passthrough_create - snapshot a Type-2 device's DVSEC + HDM +
+ * CM cap-array shadows and return the opaque handle the rw helpers
+ * operate on.
+ *
+ * @dev: device whose devres lifetime bounds the returned handle.
+ * @cxlds: CXL device state with cxlds->cxl_dvsec populated and
+ *	   cxlds->reg_map.resource and cxlds->reg_map.max_size describing
+ *	   the component register block.  cxlds->reg_map.base is NOT
+ *	   required; cxl_pci_setup_regs() releases its short-lived
+ *	   ioremap before returning, so this helper takes a local
+ *	   bind-time ioremap against cxlds->reg_map.resource for the
+ *	   duration of the snapshot.
+ *
+ * On success the returned handle is bound to @dev's devres so unwind
+ * happens automatically when @dev is unbound.  The handle must not be
+ * freed by the caller.
+ *
+ * Return: a valid &struct cxl_passthrough on success, ERR_PTR(-errno)
+ * on failure.
+ */
+struct cxl_passthrough *
+devm_cxl_passthrough_create(struct device *dev, struct cxl_dev_state *cxlds);
+
+/**
+ * cxl_passthrough_dvsec_rw - read or write the CXL Device DVSEC body shadow.
+ *
+ * @p: handle from devm_cxl_passthrough_create().
+ * @off: byte offset from the start of the DVSEC capability.  Must be
+ *	 >= PCI_DVSEC_CXL_CAP and (off + sz) must lie inside the DVSEC.
+ *	 Accesses to the PCI ext-cap header bytes (off < PCI_DVSEC_CXL_CAP)
+ *	 are the caller's responsibility; they belong on the generic
+ *	 perm-bits path, not here.
+ * @val: pointer to a u32 holding the read result or the write value.
+ *	 The low @sz bytes of *val are the payload; upper bytes ignored
+ *	 for writes and zero for reads.
+ * @sz: 1, 2, or 4.  Other values return -EINVAL.
+ * @write: false for read, true for write.
+ *
+ * Reads serve from the shadow.  Writes update the shadow per the spec
+ * attribute mode for the addressed field (LOCK is RWO, CONTROL/CONTROL2
+ * are RWL gated on CONFIG_LOCK, STATUS/STATUS2 are RW1C, RANGE1/2 are
+ * HwInit, Reserved/RsvdZ silently consumed).
+ *
+ * Known limitation: a 4-byte write whose @off straddles a 16-bit DVSEC
+ * field boundary (CONTROL/STATUS at 0x0c/0x0e, CONTROL2/STATUS2 at
+ * 0x10/0x12) applies only the field containing the first byte of the
+ * access; the adjacent 16-bit field is not updated by the same write.
+ * Standard CXL register-access patterns issue separate 2-byte accesses
+ * to CONTROL, STATUS, CONTROL2 and STATUS2, so this corner case is
+ * documented rather than handled.
+ *
+ * Return: 0 on success; -EINVAL on out-of-range or bad size.
+ */
+int cxl_passthrough_dvsec_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+			     size_t sz, bool write);
+
+/**
+ * cxl_passthrough_hdm_rw - read or write the HDM Decoder block shadow.
+ *
+ * @p: handle from devm_cxl_passthrough_create().
+ * @off: byte offset from the HDM block base; must be 4-byte aligned and
+ *	 (off + 4) <= hdm_reg_size.  Sub-dword access is not supported on
+ *	 HDM registers per CXL r4.0 §8.2.4.
+ * @val: pointer to a u32 holding the read result or the write value.
+ * @write: false for read, true for write.
+ *
+ * Reads serve from the shadow.  Writes implement the per-decoder
+ * COMMIT/COMMITTED handshake (CTRL) and the RWL gating on BASE/SIZE
+ * imposed by COMMITTED|LOCK_ON_COMMIT.  GLOBAL_CTRL is RW; the cap
+ * header is HwInit (writes dropped); other offsets in the per-decoder
+ * stride are RW shadow.
+ *
+ * Return: 0 on success; -EINVAL on misalignment or out-of-range.
+ */
+int cxl_passthrough_hdm_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+			   bool write);
+
+/**
+ * cxl_passthrough_cm_rw - read or write the CXL.cache/mem cap-array snapshot.
+ *
+ * @p: handle from devm_cxl_passthrough_create().
+ * @off: byte offset from CXL_CM_OFFSET (the start of the CM cap-array
+ *	 header in the component register block); must be 4-byte aligned
+ *	 and (off + 4) <= cm_snapshot_size.
+ * @val: pointer to a u32 holding the read result; ignored on write.
+ * @write: false for read.  Writes to the cap-array are silently dropped
+ *	   (the array headers are RO per CXL r4.0 §8.2.4); the @write
+ *	   parameter is present only to keep the API symmetric with the
+ *	   other rw helpers and to make the drop policy explicit at the
+ *	   call site.
+ *
+ * Return: 0 on success; -EINVAL on misalignment or out-of-range.
+ */
+int cxl_passthrough_cm_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+			  bool write);
+
+#endif /* __CXL_PASSTHROUGH_H__ */
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3 05/11] vfio: UAPI for CXL Type-2 device passthrough
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
  To: djbw, alex, jgg, jic23, dave.jiang, ankita,
	alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
	ira.weiny
  Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
	linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>

From: Manish Honap <mhonap@nvidia.com>

Add the user-visible interface that exposes a CXL Type-2 device to a
VMM through vfio-pci:

  VFIO_DEVICE_FLAGS_CXL (bit 9) on vfio_device_info::flags marks the
  device as CXL.

  VFIO_DEVICE_INFO_CAP_CXL (id 6) is the capability that carries the
  HDM-backed memory region index, the CXL component register region
  index, and the layout of the component register block within the
  containing PCI BAR.

  VFIO_REGION_SUBTYPE_CXL identifies the HDM memory region.
  VFIO_REGION_SUBTYPE_CXL_COMP_REGS identifies the CXL component
  register shadow.

Only the HOST_FIRMWARE_COMMITTED flag is exposed.  Other CXL device
states stay invisible to userspace at this stage.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 include/uapi/linux/vfio.h | 46 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 5de618a3a5ee..3707d53c4de5 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -215,6 +215,7 @@ struct vfio_device_info {
 #define VFIO_DEVICE_FLAGS_FSL_MC (1 << 6)	/* vfio-fsl-mc device */
 #define VFIO_DEVICE_FLAGS_CAPS	(1 << 7)	/* Info supports caps */
 #define VFIO_DEVICE_FLAGS_CDX	(1 << 8)	/* vfio-cdx device */
+#define VFIO_DEVICE_FLAGS_CXL	(1 << 9)	/* vfio-cxl Type-2 device */
 	__u32	num_regions;	/* Max region index + 1 */
 	__u32	num_irqs;	/* Max IRQ index + 1 */
 	__u32   cap_offset;	/* Offset within info struct of first cap */
@@ -257,6 +258,36 @@ struct vfio_device_info_cap_pci_atomic_comp {
 	__u32 reserved;
 };
 
+/*
+ * VFIO_DEVICE_INFO capability for CXL Type-2 passthrough devices.
+ * Present when VFIO_DEVICE_FLAGS_CXL is set on vfio_device_info::flags.
+ *
+ * @flags: VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED indicates the host CXL
+ *	subsystem committed the endpoint HDM decoder.
+ * @hdm_region_idx: VFIO region index for the HDM memory region
+ *	(subtype VFIO_REGION_SUBTYPE_CXL).
+ * @comp_reg_region_idx: VFIO region index for the CXL Component
+ *	Register shadow (subtype VFIO_REGION_SUBTYPE_CXL_COMP_REGS).
+ * @comp_reg_bar: PCI BAR index that contains the CXL component
+ *	register block.  Get-region-info on this BAR returns a
+ *	VFIO_REGION_INFO_CAP_SPARSE_MMAP that excludes the CXL block.
+ * @comp_reg_offset: byte offset of the CXL component register block
+ *	within @comp_reg_bar.
+ * @comp_reg_size: byte size of the CXL component register block.
+ */
+#define VFIO_DEVICE_INFO_CAP_CXL		6
+struct vfio_device_info_cap_cxl {
+	struct vfio_info_cap_header header;
+	__u32 flags;
+#define VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED	(1 << 0)
+	__u32 hdm_region_idx;
+	__u32 comp_reg_region_idx;
+	__u32 comp_reg_bar;
+	__u32 __resv;
+	__u64 comp_reg_offset;
+	__u64 comp_reg_size;
+};
+
 /**
  * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
  *				       struct vfio_region_info)
@@ -425,6 +456,21 @@ struct vfio_region_gfx_edid {
 #define VFIO_REGION_SUBTYPE_CCW_SCHIB		(2)
 #define VFIO_REGION_SUBTYPE_CCW_CRW		(3)
 
+/*
+ * sub-types for VFIO_REGION_TYPE_PCI_VENDOR (vendor id 1e98 reserved
+ * for the CXL Consortium); used by vfio-cxl Type-2 device passthrough.
+ *
+ * VFIO_REGION_SUBTYPE_CXL exposes the HDM-backed device memory range
+ *   as a mappable region.  The range is allocated by the host CXL
+ *   subsystem and the VMM is expected to mmap() it.
+ * VFIO_REGION_SUBTYPE_CXL_COMP_REGS exposes the CXL Component Register
+ *   block (read-write via pread()/pwrite() only, no mmap()).  The VMM
+ *   reads and writes HDM Decoder Capability registers through this
+ *   shadow region instead of touching hardware directly.
+ */
+#define VFIO_REGION_SUBTYPE_CXL			(1)
+#define VFIO_REGION_SUBTYPE_CXL_COMP_REGS	(2)
+
 /* sub-types for VFIO_REGION_TYPE_MIGRATION */
 #define VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED (1)
 
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3 04/11] cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
  To: djbw, alex, jgg, jic23, dave.jiang, ankita,
	alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
	ira.weiny
  Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
	linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>

From: Manish Honap <mhonap@nvidia.com>

The CXL component register layout and the HDM Decoder Capability
Structure defines live in drivers/cxl/cxl.h, where userspace
consumers cannot include them without depending on kernel-only
headers.  A VMM that owns a vfio-cxl COMP_REGS shadow region needs
these defines to interpret the shadow contents.

Move the spec-defined register layout, capability identifiers, and
HDM decoder field masks to a new public uapi header,
include/uapi/cxl/cxl_regs.h.  Use __GENMASK() and _BITUL() (not
GENMASK() / BIT()) so the header is uapi-clean.  Include
<asm/bitsperlong.h> for the __BITS_PER_LONG that __GENMASK() needs.

drivers/cxl/cxl.h now includes <uapi/cxl/cxl_regs.h>; the values
are identical, so kernel callers see no change.  Static inline
helpers that use FIELD_GET stay in drivers/cxl/cxl.h.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/cxl/cxl.h           | 52 +++++-------------------------
 include/uapi/cxl/cxl_regs.h | 63 +++++++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+), 45 deletions(-)
 create mode 100644 include/uapi/cxl/cxl_regs.h

diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index f43abd1903ce..583a27b6659e 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -24,51 +24,13 @@ extern const struct nvdimm_security_ops *cxl_security_ops;
  * (port-driver, region-driver, nvdimm object-drivers... etc).
  */
 
-/* CXL 2.0 8.2.4 CXL Component Register Layout and Definition */
-#define CXL_COMPONENT_REG_BLOCK_SIZE SZ_64K
-
-/* CXL 2.0 8.2.5 CXL.cache and CXL.mem Registers*/
-#define CXL_CM_OFFSET 0x1000
-#define CXL_CM_CAP_HDR_OFFSET 0x0
-#define   CXL_CM_CAP_HDR_ID_MASK GENMASK(15, 0)
-#define     CM_CAP_HDR_CAP_ID 1
-#define   CXL_CM_CAP_HDR_VERSION_MASK GENMASK(19, 16)
-#define     CM_CAP_HDR_CAP_VERSION 1
-#define   CXL_CM_CAP_HDR_CACHE_MEM_VERSION_MASK GENMASK(23, 20)
-#define     CM_CAP_HDR_CACHE_MEM_VERSION 1
-#define   CXL_CM_CAP_HDR_ARRAY_SIZE_MASK GENMASK(31, 24)
-#define CXL_CM_CAP_PTR_MASK GENMASK(31, 20)
-
-#define   CXL_CM_CAP_CAP_ID_RAS 0x2
-#define   CXL_CM_CAP_CAP_ID_HDM 0x5
-#define   CXL_CM_CAP_CAP_HDM_VERSION 1
-
-/* HDM decoders CXL 2.0 8.2.5.12 CXL HDM Decoder Capability Structure */
-#define CXL_HDM_DECODER_CAP_OFFSET 0x0
-#define   CXL_HDM_DECODER_COUNT_MASK GENMASK(3, 0)
-#define   CXL_HDM_DECODER_TARGET_COUNT_MASK GENMASK(7, 4)
-#define   CXL_HDM_DECODER_INTERLEAVE_11_8 BIT(8)
-#define   CXL_HDM_DECODER_INTERLEAVE_14_12 BIT(9)
-#define   CXL_HDM_DECODER_INTERLEAVE_3_6_12_WAY BIT(11)
-#define   CXL_HDM_DECODER_INTERLEAVE_16_WAY BIT(12)
-#define CXL_HDM_DECODER_CTRL_OFFSET 0x4
-#define   CXL_HDM_DECODER_ENABLE BIT(1)
-#define CXL_HDM_DECODER0_BASE_LOW_OFFSET(i) (0x20 * (i) + 0x10)
-#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET(i) (0x20 * (i) + 0x14)
-#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET(i) (0x20 * (i) + 0x18)
-#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(i) (0x20 * (i) + 0x1c)
-#define CXL_HDM_DECODER0_CTRL_OFFSET(i) (0x20 * (i) + 0x20)
-#define   CXL_HDM_DECODER0_CTRL_IG_MASK GENMASK(3, 0)
-#define   CXL_HDM_DECODER0_CTRL_IW_MASK GENMASK(7, 4)
-#define   CXL_HDM_DECODER0_CTRL_LOCK BIT(8)
-#define   CXL_HDM_DECODER0_CTRL_COMMIT BIT(9)
-#define   CXL_HDM_DECODER0_CTRL_COMMITTED BIT(10)
-#define   CXL_HDM_DECODER0_CTRL_COMMIT_ERROR BIT(11)
-#define   CXL_HDM_DECODER0_CTRL_HOSTONLY BIT(12)
-#define CXL_HDM_DECODER0_TL_LOW(i) (0x20 * (i) + 0x24)
-#define CXL_HDM_DECODER0_TL_HIGH(i) (0x20 * (i) + 0x28)
-#define CXL_HDM_DECODER0_SKIP_LOW(i) CXL_HDM_DECODER0_TL_LOW(i)
-#define CXL_HDM_DECODER0_SKIP_HIGH(i) CXL_HDM_DECODER0_TL_HIGH(i)
+/*
+ * Spec-defined CXL component register layout and HDM Decoder
+ * Capability Structure constants live in <uapi/cxl/cxl_regs.h> so a
+ * userspace VMM that owns a vfio-cxl COMP_REGS shadow region can
+ * consume them without depending on kernel-only headers.
+ */
+#include <uapi/cxl/cxl_regs.h>
 
 /* HDM decoder control register constants CXL 3.0 8.2.5.19.7 */
 #define CXL_DECODER_MIN_GRANULARITY 256
diff --git a/include/uapi/cxl/cxl_regs.h b/include/uapi/cxl/cxl_regs.h
new file mode 100644
index 000000000000..b284b7ad2d42
--- /dev/null
+++ b/include/uapi/cxl/cxl_regs.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0-only WITH Linux-syscall-note */
+/*
+ * CXL component register layout and HDM Decoder Capability Structure
+ * defines.  Userspace consumers (e.g. a VMM that owns a vfio-cxl
+ * COMP_REGS shadow region) need these without kernel-only header
+ * dependencies.
+ *
+ * Spec references: CXL r4.0 sections 8.2.3 and 8.2.4.20.
+ */
+#ifndef _UAPI_CXL_REGS_H_
+#define _UAPI_CXL_REGS_H_
+
+#include <asm/bitsperlong.h>	/* __BITS_PER_LONG; needed by __GENMASK() */
+#include <linux/const.h>	/* _BITUL(), _BITULL() */
+#include <linux/bits.h>		/* __GENMASK() */
+
+/* CXL r4.0 8.2.3 CXL Component Register Layout and Definition */
+#define CXL_COMPONENT_REG_BLOCK_SIZE		0x00010000
+
+/* CXL r4.0 8.2.4 CXL.cache and CXL.mem Registers */
+#define CXL_CM_OFFSET				0x1000
+#define CXL_CM_CAP_HDR_OFFSET			0x0
+#define   CXL_CM_CAP_HDR_ID_MASK		__GENMASK(15, 0)
+#define     CM_CAP_HDR_CAP_ID			1
+#define   CXL_CM_CAP_HDR_VERSION_MASK		__GENMASK(19, 16)
+#define     CM_CAP_HDR_CAP_VERSION		1
+#define   CXL_CM_CAP_HDR_CACHE_MEM_VERSION_MASK	__GENMASK(23, 20)
+#define     CM_CAP_HDR_CACHE_MEM_VERSION	1
+#define   CXL_CM_CAP_HDR_ARRAY_SIZE_MASK	__GENMASK(31, 24)
+#define CXL_CM_CAP_PTR_MASK			__GENMASK(31, 20)
+
+#define   CXL_CM_CAP_CAP_ID_RAS			0x2
+#define   CXL_CM_CAP_CAP_ID_HDM			0x5
+#define   CXL_CM_CAP_CAP_HDM_VERSION		1
+
+/* HDM decoders, CXL r4.0 8.2.4.20 */
+#define CXL_HDM_DECODER_CAP_OFFSET		0x0
+#define   CXL_HDM_DECODER_COUNT_MASK		__GENMASK(3, 0)
+#define   CXL_HDM_DECODER_TARGET_COUNT_MASK	__GENMASK(7, 4)
+#define   CXL_HDM_DECODER_INTERLEAVE_11_8	_BITUL(8)
+#define   CXL_HDM_DECODER_INTERLEAVE_14_12	_BITUL(9)
+#define   CXL_HDM_DECODER_INTERLEAVE_3_6_12_WAY	_BITUL(11)
+#define   CXL_HDM_DECODER_INTERLEAVE_16_WAY	_BITUL(12)
+#define CXL_HDM_DECODER_CTRL_OFFSET		0x4
+#define   CXL_HDM_DECODER_ENABLE		_BITUL(1)
+#define CXL_HDM_DECODER0_BASE_LOW_OFFSET(i)	(0x20 * (i) + 0x10)
+#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET(i)	(0x20 * (i) + 0x14)
+#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET(i)	(0x20 * (i) + 0x18)
+#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(i)	(0x20 * (i) + 0x1c)
+#define CXL_HDM_DECODER0_CTRL_OFFSET(i)		(0x20 * (i) + 0x20)
+#define   CXL_HDM_DECODER0_CTRL_IG_MASK		__GENMASK(3, 0)
+#define   CXL_HDM_DECODER0_CTRL_IW_MASK		__GENMASK(7, 4)
+#define   CXL_HDM_DECODER0_CTRL_LOCK		_BITUL(8)
+#define   CXL_HDM_DECODER0_CTRL_COMMIT		_BITUL(9)
+#define   CXL_HDM_DECODER0_CTRL_COMMITTED	_BITUL(10)
+#define   CXL_HDM_DECODER0_CTRL_COMMIT_ERROR	_BITUL(11)
+#define   CXL_HDM_DECODER0_CTRL_HOSTONLY	_BITUL(12)
+#define CXL_HDM_DECODER0_TL_LOW(i)		(0x20 * (i) + 0x24)
+#define CXL_HDM_DECODER0_TL_HIGH(i)		(0x20 * (i) + 0x28)
+#define CXL_HDM_DECODER0_SKIP_LOW(i)		CXL_HDM_DECODER0_TL_LOW(i)
+#define CXL_HDM_DECODER0_SKIP_HIGH(i)		CXL_HDM_DECODER0_TL_HIGH(i)
+
+#endif /* _UAPI_CXL_REGS_H_ */
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3 03/11] cxl: Record BIR and BAR offset in cxl_register_map
From: mhonap @ 2026-06-25 16:53 UTC (permalink / raw)
  To: djbw, alex, jgg, jic23, dave.jiang, ankita,
	alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
	ira.weiny
  Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
	linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>

From: Manish Honap <mhonap@nvidia.com>

The Register Locator DVSEC (CXL r4.0 8.1.9) describes register blocks
by BAR index (BIR) and offset within the BAR.  CXL core currently
only stores the resolved HPA (resource + offset) in struct
cxl_register_map, so callers that need pci_iomap() or want to report
the BAR to userspace must reverse-engineer the BAR from the HPA.

Add bar_index and bar_offset to struct cxl_register_map and fill
them in cxl_decode_regblock() when the regblock is BAR-backed
(BIR 0-5).  Add cxl_regblock_get_bar_info() so cxl drivers
(vfio-cxl, in-kernel accelerator drivers) can read the values
without touching the struct internals.  Export under the CXL
namespace.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/cxl/core/pci.c  |  2 ++
 drivers/cxl/core/regs.c | 34 ++++++++++++++++++++++++++++++++++
 include/cxl/cxl.h       | 12 ++++++++++++
 3 files changed, 48 insertions(+)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index c44595447bd8..9b9b17db9ee4 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -764,6 +764,8 @@ static int cxl_rcrb_get_comp_regs(struct pci_dev *pdev,
 	*map = (struct cxl_register_map) {
 		.host = &pdev->dev,
 		.resource = CXL_RESOURCE_NONE,
+		.bar_index = 0xff,
+		.bar_offset = 0,
 	};
 
 	component_reg_phys = cxl_rcd_component_reg_phys(&pdev->dev, dport);
diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index e828df0629d0..6af5739aa776 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -285,12 +285,46 @@ static bool cxl_decode_regblock(struct pci_dev *pdev, u32 reg_lo, u32 reg_hi,
 		return false;
 	}
 
+	if (bar >= 0 && bar <= 5) {
+		map->bar_index = (u8)bar;
+		map->bar_offset = offset;
+	} else {
+		map->bar_index = 0xff;
+		map->bar_offset = 0;
+	}
+
 	map->reg_type = reg_type;
 	map->resource = pci_resource_start(pdev, bar) + offset;
 	map->max_size = pci_resource_len(pdev, bar) - offset;
 	return true;
 }
 
+/**
+ * cxl_regblock_get_bar_info - read BAR index and offset for a regblock
+ * @map: regblock map produced by cxl_find_regblock()
+ * @bar_index: out, PCI BAR index (0-5)
+ * @bar_offset: out, byte offset of the regblock within the BAR
+ *
+ * Exported for cxl drivers (vfio-cxl, in-kernel accelerator drivers)
+ * that need to map the regblock via pci_iomap() or report the BAR to
+ * userspace.
+ *
+ * Return: 0 on success, -EINVAL if the regblock is not BAR-backed or
+ * if any out pointer is NULL.
+ */
+int cxl_regblock_get_bar_info(const struct cxl_register_map *map,
+			      u8 *bar_index, resource_size_t *bar_offset)
+{
+	if (!map || !bar_index || !bar_offset)
+		return -EINVAL;
+	if (map->bar_index > 5)
+		return -EINVAL;
+	*bar_index = map->bar_index;
+	*bar_offset = map->bar_offset;
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_regblock_get_bar_info, "CXL");
+
 /*
  * __cxl_find_regblock_instance() - Locate a register block or count instances by type / index
  * Use CXL_INSTANCES_COUNT for @index if counting instances.
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 3dcc034360af..3bcb71d80c91 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -100,9 +100,16 @@ struct cxl_pmu_reg_map {
  * @resource: physical resource base of the register block
  * @max_size: maximum mapping size to perform register search
  * @reg_type: see enum cxl_regloc_type
+ * @bar_index: PCI BAR index (0-5) when regblock is BAR-backed; 0xff otherwise
+ * @bar_offset: offset within the BAR; only valid when bar_index <= 5
  * @component_map: cxl_reg_map for component registers
  * @device_map: cxl_reg_maps for device registers
  * @pmu_map: cxl_reg_maps for CXL Performance Monitoring Units
+ *
+ * When the register block is described by the Register Locator DVSEC with
+ * a BAR Indicator (BIR 0-5), bar_index and bar_offset are set so callers
+ * can use pci_iomap(pdev, bar_index, size) and base + bar_offset instead
+ * of ioremap(resource).
  */
 struct cxl_register_map {
 	struct device *host;
@@ -110,6 +117,8 @@ struct cxl_register_map {
 	resource_size_t resource;
 	resource_size_t max_size;
 	u8 reg_type;
+	u8 bar_index;
+	resource_size_t bar_offset;
 	union {
 		struct cxl_component_reg_map component_map;
 		struct cxl_device_reg_map device_map;
@@ -234,4 +243,7 @@ int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u8 *count,
 		     resource_size_t *offset, resource_size_t *size);
 
 int cxl_await_range_active(struct cxl_dev_state *cxlds);
+
+int cxl_regblock_get_bar_info(const struct cxl_register_map *map,
+			      u8 *bar_index, resource_size_t *bar_offset);
 #endif /* __CXL_CXL_H__ */
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3 02/11] cxl: Split cxl_await_range_active() from media-ready wait
From: mhonap @ 2026-06-25 16:53 UTC (permalink / raw)
  To: djbw, alex, jgg, jic23, dave.jiang, ankita,
	alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
	ira.weiny
  Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
	linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>

From: Manish Honap <mhonap@nvidia.com>

Before accessing CXL device memory after reset or power-on, the
driver must ensure media is ready.  Not every CXL device implements
the CXL Memory Device register group: many Type-2 devices do not.
cxl_await_media_ready() reads cxlds->regs.memdev.  Access to memdev
registers on a Type-2 device that lacks them can result in a kernel
panic.

Split the HDM DVSEC range-active poll out of cxl_await_media_ready()
into a new helper cxl_await_range_active().  Type-2 cxl drivers
(vfio-cxl, in-kernel accelerator drivers) that lack the CXLMDEV
status register call this directly.  cxl_await_media_ready() now
calls cxl_await_range_active() for the DVSEC poll, then reads the
memory device status as before.

The 60 second per-range timeout from cxl_await_media_ready()
(media_ready_timeout module param) applies.  Export under the CXL
namespace.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/cxl/core/pci.c | 35 ++++++++++++++++++++++++++++++-----
 include/cxl/cxl.h      |  2 ++
 2 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index c917608c16f9..c44595447bd8 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -142,16 +142,24 @@ static int cxl_dvsec_mem_range_active(struct cxl_dev_state *cxlds, int id)
 	return 0;
 }
 
-/*
- * Wait up to @media_ready_timeout for the device to report memory
- * active.
+/**
+ * cxl_await_range_active - Wait for all HDM DVSEC memory ranges to be active
+ * @cxlds: CXL device state (DVSEC and HDM count must be valid)
+ *
+ * For each HDM decoder range reported in the CXL DVSEC capability, waits
+ * for the range to report MEM INFO VALID (up to 1s per range), then
+ * MEM ACTIVE (up to media_ready_timeout seconds per range, default 60s).
+ * Used by cxl_await_media_ready() and by cxl drivers that bind to Type-2
+ * devices without the memdev mailbox (e.g. vfio-cxl, accelerator drivers).
+ *
+ * Return: 0 if all ranges become valid and active, -ETIMEDOUT if a
+ * timeout occurs, or a negative errno from config read on failure.
  */
-int cxl_await_media_ready(struct cxl_dev_state *cxlds)
+int cxl_await_range_active(struct cxl_dev_state *cxlds)
 {
 	struct pci_dev *pdev = to_pci_dev(cxlds->dev);
 	int d = cxlds->cxl_dvsec;
 	int rc, i, hdm_count;
-	u64 md_status;
 	u16 cap;
 
 	rc = pci_read_config_word(pdev,
@@ -172,6 +180,23 @@ int cxl_await_media_ready(struct cxl_dev_state *cxlds)
 			return rc;
 	}
 
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_await_range_active, "CXL");
+
+/*
+ * Wait up to @media_ready_timeout for the device to report memory
+ * active.
+ */
+int cxl_await_media_ready(struct cxl_dev_state *cxlds)
+{
+	u64 md_status;
+	int rc;
+
+	rc = cxl_await_range_active(cxlds);
+	if (rc)
+		return rc;
+
 	md_status = readq(cxlds->regs.memdev + CXLMDEV_STATUS_OFFSET);
 	if (!CXLMDEV_READY(md_status))
 		return -EIO;
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 440ab09c640e..3dcc034360af 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -232,4 +232,6 @@ int cxl_set_capacity(struct cxl_dev_state *cxlds, u64 capacity);
 
 int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u8 *count,
 		     resource_size_t *offset, resource_size_t *size);
+
+int cxl_await_range_active(struct cxl_dev_state *cxlds);
 #endif /* __CXL_CXL_H__ */
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support
From: mhonap @ 2026-06-25 16:53 UTC (permalink / raw)
  To: djbw, alex, jgg, jic23, dave.jiang, ankita,
	alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
	ira.weiny
  Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
	linux-kernel, linux-kselftest

From: Manish Honap <mhonap@nvidia.com>

CXL Type-2 accelerators (CXL.mem-capable GPUs and similar) cannot be
passed through to virtual machines with stock vfio-pci because the
driver has no concept of HDM decoder management, HDM region exposure,
or component register virtualization.  This series adds those three
pieces, sufficient for a guest to use the device's firmware-committed
coherent memory under UVM / ATS.

v3 is a rewrite of the v2 framework form, responding to Dan's request
in the v2 review for "less emulation, narrower interfaces, and a
closer mapping to the spec language."
In this release, cxl-core exposes four EXPORT_SYMBOL_GPL helpers behind
an opaque handle.  vfio-pci becomes a thin transport on top of those.
Please see "Changes since v2" and "Reviewer feedback addressed" below for
the per-area summary.

Motivation
==========

A CXL Type-2 device exposes its HDM-mapped device memory through HDM
decoders that BIOS programs and commits at boot.  To pass such a
device to a guest, vfio-pci has to do three things at once:

  1. Surface the firmware-committed HDM-mapped HPA range as a guest-
     mmappable region.

  2. Surface a CXL-spec-compliant view of the CXL Device DVSEC body,
     the HDM Decoder Capability block, and the CXL.cache/mem cap-array
     prefix, so the guest's CXL driver enumerates the same topology
     the host saw.

  3. Keep the host's committed decoder configuration intact (the
     physical decoder is never reprogrammed) while letting the guest
     observe and manage a shadow that follows the per-field write
     semantics in the spec.

The series builds on Alejandro Lucero-Palau's v28 work
applied on for-7.3/cxl-type2-enabling [1] (sfc is the in-tree consumer
today). vfio-pci becomes the second consumer.

Architecture
============

cxl-core owns the CXL semantics.  A new file
drivers/cxl/core/passthrough.c (gated by hidden Kconfig
CXL_VFIO_PASSTHROUGH) provides four exported symbols:

    struct cxl_passthrough *
    devm_cxl_passthrough_create(struct device *dev,
                                struct cxl_dev_state *cxlds);

    int cxl_passthrough_dvsec_rw(p, off, val, sz, write);
    int cxl_passthrough_hdm_rw  (p, off, val,      write);
    int cxl_passthrough_cm_rw   (p, off, val,      write);

cxl_passthrough is an opaque handle; vfio-pci sees no cxl-internal
struct pointers.  The shadows are snapshotted at create time: the
DVSEC body from PCI config space dword by dword, the CM cap-array and
HDM block from the cxl-core MMIO mapping at cxlds->reg_map.base.
Per-field write semantics follow below:
CXL r4.0 8.1.3 DVSEC:
- LOCK is RWO,
- CONTROL/CONTROL2 are RWL gated on CONFIG_LOCK,
- STATUS/STATUS2 are RW1C,
- RANGE1 is HwInit, RANGE2 is RsvdZ
CXL r4.0 8.2.4.20 HDM:
- GLOBAL_CTRL RW,
- decoder CTRL implements COMMIT/COMMITTED,
- decoder BASE/SIZE RWL gated on COMMITTED or LOCK_ON_COMMIT,
- cap header HwInit).

vfio-pci becomes a thin transport.  The new module
drivers/vfio/pci/cxl/ exposes two VFIO regions.

  VFIO_REGION_SUBTYPE_CXL (HDM region): mmappable view of the
  HDM-mapped HPA. The mmap fault handler calls vmf_insert_pfn() from
  the physical HPA. pread/pwrite go through the memremap_wb() kva
  captured at bind time.

  VFIO_REGION_SUBTYPE_CXL_COMP_REGS (component register shadow):
  pread/pwrite only, dword-aligned (-EINVAL on misalignment).
  Each dword dispatches by offset to cxl_passthrough_cm_rw() or
  cxl_passthrough_hdm_rw(). No shadow state on the vfio side; cxl-core
  enforces the spec.

CXL DVSEC config-space accesses use a clipping shim in
vfio_pci_config_rw_single(). A config-space chunk that crosses the
DVSEC body boundary is split: header bytes go through the generic
perm-bits path, body bytes go through cxl_passthrough_dvsec_rw().
The shim replaces v2's approach of repointing ecap_perms[]

Sparse-mmap is exposed on the component BAR so userspace can mmap the
non-component portions directly; only the CXL component register
sub-range goes through pread/pwrite emulation. The CXL sub-range is
also skipped from vfio_pci-core's request_selected_regions() set
because cxl-core's devm_cxl_probe_mem() already holds a
request_mem_region() on it; the asymmetric skip is matched by an
asymmetric release on disable().

Scope and out-of-scope
======================

In scope (rejected at create time with -EOPNOTSUPP otherwise):

  - Firmware-committed devices (HOST_FIRMWARE_COMMITTED set).
  - Single HDM decoder (hdm_count == 1).
  - No interleave (IW == 0).

Out of scope, deferred for follow-on work:

  - Multi-decoder devices and interleave.
  - Guest-driven (non-firmware-committed) HDM commit.
  - Hotplug, FLR, and sibling-function reset of CXL Type-2 devices.

Changes since v2
================

This is a rewrite, not an incremental update.  The structure of the
series changed (20 patches in v2 to 11 in v3) because v3 collapses
v2 patches 9-15 (detection, HDM emulation, media readiness, region
management, HDM region, DVSEC emulation) into one cxl-core helper
file and one vfio-pci consumer.

Framework replaced by narrow opaque-handle helpers (patches 6, 8)

  v2 carried a generic register-emulation framework split across four
  state-machine files in cxl-core.
  v3 collapses it into one file: drivers/cxl/core/passthrough.c
  exposing the four EXPORT_SYMBOL_GPL helpers above behind a struct
  cxl_passthrough opaque handle.

Shadow ownership moved into cxl-core (patches 6, 8)

  vfio-pci no longer keeps any per-field state. It forwards
  (offset, value) into cxl-core, and cxl-core enforces the spec
  (RWO, RWL, RW1C, HwInit, RsvdZ) with explicit CXL r4.0 section
  references in the switch arms.

DVSEC config-space clipping shim (patch 8)

  v2 repointed ecap_perms[] to redirect CXL DVSEC reads and writes.
  v3 keeps ecap_perms[] untouched and clips per-config-access chunks
  at the DVSEC body boundary in vfio_pci_config_rw_single(); header bytes
  go through the generic perm-bits path, body bytes go through
  cxl_passthrough_dvsec_rw(). The shim is local to the per-device
  path.

CONFIG_VFIO_PCI_CXL gates the new module (patch 7)

  v2 had a CONFIG_VFIO_CXL_CORE Kconfig stub; v3 renames it to
  CONFIG_VFIO_PCI_CXL to match the vfio-pci naming convention.
  The hidden CXL_VFIO_PASSTHROUGH selects the cxl-core helper file
  on demand. With both disabled, the cxl-core size is unchanged.

UAPI rewritten with named fields (patch 5)

  vfio_device_info_cap_cxl in v3 carries:
    flags + HOST_FIRMWARE_COMMITTED bit
    hdm_region_idx
    comp_reg_region_idx
    comp_reg_bar
    comp_reg_offset
    comp_reg_size
  The DPA terminology is renamed to HDM region throughout.
  CACHE_CAPABLE (HDM-DB indicator) is dropped;
  it was informational only in v2 with no caller, and re-adding it
  for an active CXL.cache plumbing series later.

Selftests trimmed (patch 9)

  v2 carried selftests for device detection, capability parsing,
  region enumeration, HDM register emulation, HDM mmap with
  page-fault insertion, FLR invalidation, and DVSEC register
  emulation. v3 keeps a smoke-test set of six focused tests:

    device_is_cxl                  GET_INFO advertises FLAGS_CXL
                                   and a populated CAP_CXL.
    hdm_region_mmap_rw             mmap one page, write+read back.
    component_bar_sparse_mmap      SPARSE_MMAP cap excludes the
                                   CXL component register sub-range.
    comp_regs_cm_cap_array_read    pread of the CM cap-array
                                   header at CXL_CM_OFFSET succeeds
                                   (CAP_ID == 1).
    dvsec_lock_byte_read           pread of the DVSEC CONFIG_LOCK
                                   byte through the clipping shim
                                   succeeds.
    hdm_decoder_commit_fsm         COMMIT / COMMITTED state machine
                                   and LOCK_ON_COMMIT behaviour.

  FLR invalidation, page-fault insertion under load, and full
  DVSEC field-by-field write coverage are deferred to a follow-on
  selftest series. The current six are the minimal set that
  exercises the kernel-side contract end-to-end.

cxl-core prep patches split (patches 1-4)

  v3 keeps the cxl-side enablers from v2 patches 1-4 but each as
  a standalone change so the cxl maintainer can review the helper
  API independently of the vfio consumer:

    [1/11] cxl_get_hdm_info()
    [2/11] cxl_await_range_active() split from media-ready wait
    [3/11] cxl_register_map records BIR + BAR offset
    [4/11] component/HDM register defines moved to uapi/cxl/cxl_regs.h

Reviewer feedback addressed
===========================

Dan
---

- VFIO exposes HDM/host-visible region, not raw DPA; docs/UAPI say HDM
  region, DPA only inside cxl-core where appropriate.
- One vfio-pci device = one HDM region / one decoder, no interleave;
  hdm_count != 1 → -EOPNOTSUPP.
- Global HDM on DVSEC Range Base treated as legacy; RANGE1/RANGE2
  read-only snapshot, guest writes dropped.
- No guest/kernel lock games; DVSEC LOCK and HDM LOCK_ON_COMMIT RWO,
  fixed at create from firmware snapshot.
- Opaque cxl_passthrough handle only; vfio gets HPA via memdev probe +
  layout via cxl_get_hdm_info(), rw via helpers.
- No multi-region accelerator case in v3; single region enforced,
  multi-region deferred.
- cxl_await_range_active stays in cxl-core probe; not exported, vfio does
  not call it.
- No guest LOCK→0 reprogram; guest cannot clear LOCK to remap host HPA;
  kernel uncommit tied to COMMIT, not LOCK alone.

Jason / Gregory / Dan
---------------------

- memremap(WB) + request_mem_region on HPA; conflicting direct-map/EFI use
  fails probe with -EBUSY.

Jonathan
--------

- uapi/cxl/cxl_regs.h for register defines so VMMs need no private
  kernel headers.
- __free() locals on cxl-core/passthrough error paths instead of
  struct-owned temporaries.
- No "precommitted at probe" assumption; acquire checks COMMITTED in
  HDM shadow and refuses if missing.

Dave
----

- memremap(MEMREMAP_WB) for HDM host mapping (not ioremap_cache).
- Renamed cap flag to VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED for clarity.
- __free() / DEFINE_FREE() cleanup in new passthrough.c create path.

Patch series
============

 [1/11] cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata
 [2/11] cxl: Split cxl_await_range_active() from media-ready wait
 [3/11] cxl: Record BIR and BAR offset in cxl_register_map
 [4/11] cxl: Move component/HDM register defines to
        uapi/cxl/cxl_regs.h
 [5/11] vfio: UAPI for CXL Type-2 device passthrough
 [6/11] cxl: Add register-virtualization helpers for vfio Type-2
        passthrough
 [7/11] vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2
        acquisition
 [8/11] vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping
        shim
 [9/11] selftests/vfio: Add CXL Type-2 device passthrough smoke test
[10/11] docs: vfio-pci: Document CXL Type-2 device passthrough
[11/11] vfio/pci: Provide opt-out for CXL Type-2 extensions

Dependencies
============

[1] [PATCH v28 0/5] Type2 device basic support
https://lore.kernel.org/linux-cxl/20260618181806.118745-1-alejandro.lucero-palau@amd.com/

[2] Previous version of this patch series
[PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support
https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/

[3] Companion QEMU series
[RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci
https://lore.kernel.org/linux-cxl/20260427181235.3003865-1-mhonap@nvidia.com/

Manish Honap (11):
  cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata
  cxl: Split cxl_await_range_active() from media-ready wait
  cxl: Record BIR and BAR offset in cxl_register_map
  cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h
  vfio: UAPI for CXL Type-2 device passthrough
  cxl: Add register-virtualization helpers for vfio Type-2 passthrough
  vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2
    acquisition
  vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping shim
  selftests/vfio: Add CXL Type-2 device passthrough smoke test
  docs: vfio-pci: Document CXL Type-2 device passthrough
  vfio/pci: Provide opt-out for CXL Type-2 extensions

 Documentation/driver-api/index.rst            |   1 +
 Documentation/driver-api/vfio-pci-cxl.rst     | 282 ++++++
 drivers/cxl/Kconfig                           |   7 +
 drivers/cxl/core/Makefile                     |   1 +
 drivers/cxl/core/passthrough.c                | 590 ++++++++++++
 drivers/cxl/core/pci.c                        |  70 +-
 drivers/cxl/core/regs.c                       |  35 +
 drivers/cxl/cxl.h                             |  52 +-
 drivers/vfio/pci/Kconfig                      |   2 +
 drivers/vfio/pci/Makefile                     |   1 +
 drivers/vfio/pci/cxl/Kconfig                  |  34 +
 drivers/vfio/pci/cxl/Makefile                 |   2 +
 drivers/vfio/pci/cxl/vfio_cxl_core.c          | 889 ++++++++++++++++++
 drivers/vfio/pci/cxl/vfio_cxl_priv.h          |  71 ++
 drivers/vfio/pci/vfio_pci.c                   |   9 +
 drivers/vfio/pci/vfio_pci_config.c            |  31 +
 drivers/vfio/pci/vfio_pci_core.c              |  68 +-
 drivers/vfio/pci/vfio_pci_priv.h              |  93 ++
 drivers/vfio/pci/vfio_pci_rdwr.c              |  17 +
 include/cxl/cxl.h                             |  18 +
 include/cxl/passthrough.h                     | 121 +++
 include/linux/vfio_pci_core.h                 |   8 +
 include/uapi/cxl/cxl_regs.h                   |  63 ++
 include/uapi/linux/vfio.h                     |  46 +
 tools/testing/selftests/vfio/Makefile         |   1 +
 .../selftests/vfio/lib/vfio_pci_device.c      |  11 +-
 .../selftests/vfio/vfio_cxl_type2_test.c      | 350 +++++++
 27 files changed, 2821 insertions(+), 52 deletions(-)
 create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
 create mode 100644 drivers/cxl/core/passthrough.c
 create mode 100644 drivers/vfio/pci/cxl/Kconfig
 create mode 100644 drivers/vfio/pci/cxl/Makefile
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_core.c
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_priv.h
 create mode 100644 include/cxl/passthrough.h
 create mode 100644 include/uapi/cxl/cxl_regs.h
 create mode 100644 tools/testing/selftests/vfio/vfio_cxl_type2_test.c

base-commit: 90cf2e0d702c8a132ccbe72e7687f33c04c14658
-- 
2.25.1

^ permalink raw reply

* [PATCH v3 01/11] cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata
From: mhonap @ 2026-06-25 16:53 UTC (permalink / raw)
  To: djbw, alex, jgg, jic23, dave.jiang, ankita,
	alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
	ira.weiny
  Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
	linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>

From: Manish Honap <mhonap@nvidia.com>

cxl_probe_component_regs() finds the HDM decoder block during device
probe and caches its location, but does not record the decoder count
and does not expose the result outside drivers/cxl/.

In-kernel cxl drivers (Type-2 accelerator drivers, vfio-cxl) need the
decoder count and the byte offset and size of the HDM block without
re-running the probe sequence.

Record decoder_cnt in rmap->count when parsing the HDM capability in
cxl_probe_component_regs(), extend struct cxl_reg_map with a count
member, and add cxl_get_hdm_info() to return offset, size, and count
from the cached map.  Export under the CXL namespace.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/cxl/core/pci.c  | 33 +++++++++++++++++++++++++++++++++
 drivers/cxl/core/regs.c |  1 +
 include/cxl/cxl.h       |  4 ++++
 3 files changed, 38 insertions(+)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 2bcd683aa286..c917608c16f9 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -449,6 +449,39 @@ int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
 }
 EXPORT_SYMBOL_NS_GPL(cxl_hdm_decode_init, "CXL");
 
+/**
+ * cxl_get_hdm_info - Get HDM decoder register block location and count
+ * @cxlds: CXL device state (must have component regs enumerated via
+ *	   cxl_probe_component_regs())
+ * @count:  number of HDM decoders (from HDM Capability bits [3:0])
+ * @offset: byte offset of HDM decoder block within the component register BAR
+ * @size:   size in bytes of the HDM decoder block
+ *
+ * Exported for cxl drivers (in-kernel accelerator drivers, vfio-cxl) that
+ * need HDM decoder metadata from the cached component-register map without
+ * re-running the probe sequence.
+ *
+ * Return: 0 on success. -ENODEV if the HDM decoder block is not present.
+ */
+int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u8 *count,
+		     resource_size_t *offset, resource_size_t *size)
+{
+	struct cxl_reg_map *hdm = &cxlds->reg_map.component_map.hdm_decoder;
+
+	if (WARN_ON(!count || !offset || !size))
+		return -EINVAL;
+
+	if (!hdm->valid)
+		return -ENODEV;
+
+	*count	= hdm->count;
+	*offset = hdm->offset;
+	*size	= hdm->size;
+
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_get_hdm_info, "CXL");
+
 #define CXL_DOE_TABLE_ACCESS_REQ_CODE		0x000000ff
 #define   CXL_DOE_TABLE_ACCESS_REQ_CODE_READ	0
 #define CXL_DOE_TABLE_ACCESS_TABLE_TYPE		0x0000ff00
diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index 20c2d9fbcfe7..e828df0629d0 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -85,6 +85,7 @@ void cxl_probe_component_regs(struct device *dev, void __iomem *base,
 			decoder_cnt = cxl_hdm_decoder_count(hdr);
 			length = 0x20 * decoder_cnt + 0x10;
 			rmap = &map->hdm_decoder;
+			rmap->count = decoder_cnt;
 			break;
 		}
 		case CXL_CM_CAP_CAP_ID_RAS:
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 802b143de83d..440ab09c640e 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -75,6 +75,7 @@ struct cxl_reg_map {
 	int id;
 	unsigned long offset;
 	unsigned long size;
+	u8 count;
 };
 
 struct cxl_component_reg_map {
@@ -228,4 +229,7 @@ struct cxl_memdev *devm_cxl_probe_mem(struct cxl_dev_state *cxlds,
 				      struct range *range);
 
 int cxl_set_capacity(struct cxl_dev_state *cxlds, u64 capacity);
+
+int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u8 *count,
+		     resource_size_t *offset, resource_size_t *size);
 #endif /* __CXL_CXL_H__ */
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH v2 7/8] dt-bindings: riscv: Add generic CBQRI controller binding
From: Conor Dooley @ 2026-06-25 16:19 UTC (permalink / raw)
  To: Drew Fustini
  Cc: Adrien Ricciardi, Alexandre Ghiti, Atish Kumar Patra, Atish Patra,
	Babu Moger, Ben Horgan, Borislav Petkov, Chen Pei, Conor Dooley,
	Conor Dooley, Dave Hansen, Dave Martin, Fenghua Yu, Gong Shuai,
	Gong Shuai, guo.wenjia23, James Morse, Kornel Dulęba,
	Krzysztof Kozlowski, liu.qingtao2, Liu Zhiwei, Palmer Dabbelt,
	Paul Walmsley, Peter Newman, Radim Krčmář,
	Reinette Chatre, Rob Herring, Samuel Holland,
	Sebastian Andrzej Siewior, Tony Luck, Vasudevan Srinivasan,
	Ved Shanbhogue, Weiwei Li, yunhui cui, linux-kernel, linux-riscv,
	x86, devicetree, linux-rt-devel, linux-doc
In-Reply-To: <20260624-dfustini-atl-sc-cbqri-dt-v2-7-2f8049fd902b@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 5686 bytes --]

On Wed, Jun 24, 2026 at 06:38:35PM -0700, Drew Fustini wrote:
> Document the generic compatibles for capacity and bandwidth controllers
> that implement the RISC-V CBQRI specification. The binding also
> describes the common riscv,cbqri-rcid and riscv,cbqri-mcid properties,
> and the optional riscv,cbqri-cache phandle that links a capacity
> controller to the cache whose capacity it allocates.
> 
> Assisted-by: Claude:claude-opus-4-8
> Co-developed-by: Adrien Ricciardi <aricciardi@baylibre.com>
> Signed-off-by: Adrien Ricciardi <aricciardi@baylibre.com>
> Signed-off-by: Drew Fustini <fustini@kernel.org>
> ---
>  .../devicetree/bindings/riscv/riscv,cbqri.yaml     | 97 ++++++++++++++++++++++
>  MAINTAINERS                                        |  1 +
>  2 files changed, 98 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/riscv/riscv,cbqri.yaml b/Documentation/devicetree/bindings/riscv/riscv,cbqri.yaml
> new file mode 100644
> index 0000000000000000000000000000000000000000..5d6be645381780e187b39e60c3bb487fdf2cfb69
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/riscv/riscv,cbqri.yaml
> @@ -0,0 +1,97 @@
> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/riscv/riscv,cbqri.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: RISC-V Capacity and Bandwidth QoS Register Interface (CBQRI) controller
> +
> +description: |
> +  The RISC-V CBQRI specification defines capacity-controller and
> +  bandwidth-controller register blocks that allocate cache capacity and memory
> +  bandwidth to resource-control IDs (RCIDs) and monitor usage per
> +  monitoring-counter ID (MCID):
> +  https://github.com/riscv-non-isa/riscv-cbqri/blob/main/riscv-cbqri.pdf
> +
> +  Allocation and monitoring share one register block, and a controller may
> +  implement either or both. A driver discovers which at runtime from the
> +  capabilities register, so the compatible names only the controller type. It
> +  does not distinguish allocation-only, monitoring-only or combined
> +  controllers, and no property declares monitoring support.
> +
> +maintainers:
> +  - Drew Fustini <fustini@kernel.org>
> +
> +properties:
> +  compatible:
> +    oneOf:
> +      - items:
> +          - description: Tenstorrent Ascalon Shared Cache
> +            const: tenstorrent,ascalon-sc-cbqri
> +          - const: riscv,cbqri-capacity-controller
> +      - enum:
> +          - riscv,cbqri-capacity-controller
> +          - riscv,cbqri-bandwidth-controller

Please modify this, as has been done for other riscv spec related
bindings, to let people get away without using device-specific
compatibles.

In this case, you can just delete the first entry from this enum, since
it already has a user and only have to implement this feedback for the
second entry.

pw-bot: changes-requested

> +
> +  reg:
> +    maxItems: 1
> +    description:
> +      The CBQRI controller register block.
> +
> +  riscv,cbqri-rcid:
> +    $ref: /schemas/types.yaml#/definitions/uint32
> +    description:
> +      The maximum number of RCIDs the controller supports. RCIDs are the
> +      resource-control IDs that allocation operations target.
> +
> +  riscv,cbqri-mcid:
> +    $ref: /schemas/types.yaml#/definitions/uint32
> +    description:
> +      The maximum number of MCIDs the controller supports. MCIDs are the
> +      monitoring-counter IDs that usage-monitoring operations target. Present
> +      on controllers that implement monitoring.
> +
> +  riscv,cbqri-cache:
> +    $ref: /schemas/types.yaml#/definitions/phandle
> +    description:
> +      Phandle to the cache node whose capacity this controller allocates.
> +      Applies to capacity controllers that back a CPU cache. The cache level
> +      and the harts sharing it are taken from that node's cache topology.

Architecturally, is it impossible for a capacity controller to control
more than one cache?

> +
> +required:
> +  - compatible
> +  - reg
> +
> +allOf:
> +  - if:
> +      properties:
> +        compatible:
> +          contains:
> +            const: tenstorrent,ascalon-sc-cbqri
> +    then:
> +      required:
> +        - riscv,cbqri-rcid
> +        - riscv,cbqri-cache
> +
> +additionalProperties: false
> +
> +examples:
> +  - |
> +    l2_cache: l2-cache {
> +        compatible = "cache";
> +        cache-level = <2>;
> +        cache-unified;
> +        cache-size = <0xc00000>;
> +        cache-sets = <512>;
> +        cache-block-size = <64>;
> +    };
> +
> +    cache-controller@a21a00c0 {
> +        compatible = "tenstorrent,ascalon-sc-cbqri",
> +                     "riscv,cbqri-capacity-controller";

Is this or is this not a cache controller?
The compatible and fact that the property points to an actual cache
controller suggests that this is not.

Cheers,
Conor.

> +        reg = <0xa21a00c0 0xf40>;
> +        riscv,cbqri-rcid = <16>;
> +        riscv,cbqri-cache = <&l2_cache>;
> +    };
> +
> +...
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 9e1092165046c773771b055869030bc1bdb64b16..64a95a4d795a57033d3f36200d98cfb4a013ab94 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -23298,6 +23298,7 @@ M:	Drew Fustini <fustini@kernel.org>
>  R:	yunhui cui <cuiyunhui@bytedance.com>
>  L:	linux-riscv@lists.infradead.org
>  S:	Supported
> +F:	Documentation/devicetree/bindings/riscv/riscv,cbqri.yaml
>  F:	arch/riscv/include/asm/qos.h
>  F:	arch/riscv/include/asm/resctrl.h
>  F:	arch/riscv/kernel/qos.c
> 
> -- 
> 2.34.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH net-next] Documentation: networking: Add a test plan for ethtool pause validation
From: Andrew Lunn @ 2026-06-25 16:12 UTC (permalink / raw)
  To: Maxime Chevallier
  Cc: Jakub Kicinski, davem, Eric Dumazet, Paolo Abeni, Simon Horman,
	Russell King, Heiner Kallweit, Jonathan Corbet, Shuah Khan,
	Oleksij Rempel, Vladimir Oltean, Florian Fainelli,
	thomas.petazzoni, netdev, linux-kernel, linux-doc
In-Reply-To: <38bafe7e-d419-46f7-8fa7-87e9183e578c@bootlin.com>

> This isn't sphynx, but I've come-up with something like this for a
> test definition :
> 
> 
> @ksft_ethtool_needs_supported_anyof([Pause, Asym_Pause])
> def test_ethtool_pause_advertising(cfg, peer) -> None:
>     """Pause advertisement
> 
>     Validate that changing pause params through the ETHTOOL_MSG_PAUSE command
>     translates to a change in the advertised pause params, and that these
>     parameters are correct w.r.t the supported pause params and requested pause
>     params.
>     
>     This exercises the .set_pauseparams() ethtool ops for MAC configuration,
>     as well as the reconfiguration of the PHY's advertising and negociation.
>     
>     On non-phylink MACs, the MAC should call phy_set_sym_pause() to update the
>     PHY's advertising, and restart a negotiation with phy_start_aneg() if
>     need be. Failure to do so will result on the wrong advertising parameters.
>     
>     Pn phylink-enabled MACs, phylink deals with the PHY reconfiguration provided

On 

>     the MAC driver calls phylink_ethtool_set_pauseparam().
>     
>     Failing this test likely means that the PHY driver is not correctly advertising
>     pause settings, either due to the MAC not triggering a PHY reconfiguration,
>     a misconficonfiguration of the advertising registers by the PHY, or by
>     mis-handling the phydev->advertising bitfield in the PHY driver directly.
>     
>     The validation is made by looking at the advertised modes locally, as well as
>     what the peer's 'lp_advertising' values report.
> 
>     cfg -- local device's interface configuration
>     peer -- peer device handle

Plain Sphinx can be made to pick up this method documentation and
include it the generated documentation. You would use something like

.. automethod:: test_ethtool_pause_advertising

in the .rst file.

I've no idea if the kernel configuration of sphinx allows this. At the
moment, i would not spend too much time on getting sphinx to generate
documentation. I would say that is nice to have. The description
itself is more important.

>     """
> 
>     # Initial conditions :
>     # - Local interface is admin UP, and reports lowlayer link UP
>     # - Remote interface is adming UP, and reports lowlayer link UP
>     #
>     # Test 1
>     # - SKIP if supported doesn't contain "Pause"
>     # - run 'ethtool -A ethX rx on tx on autoneg on'
>     # - FAIL if the return isn't 0
>     # - FAIL if ETHTOOL_A_LINKMODES_OURS's advertised values does not contain
>     #   "Pause" or contains "Asym_Pause"
>     # - FAIL if peer's lp_advertising doesn't contain "Pause" or contains
>     #   "Asym_Pause"
>     # - Succeed otherwise
>     #
>     # Test 2
>     # - SKIP uif supported doesn't contain both "Pause" and "Asym_Pause"
>     # - run 'ethtool -A ethX rx on tx on autoneg on'
>     # - FAIL if the return isn't 0
>     # - FAIL if ETHTOOL_A_LINKMODES_OURS's advertised values does not contain
>     #   "Pause" or contains "Asym_Pause"
>     # - FAIL if peer's lp_advertising doesn't contain "Pause" or contains
>     #   "Asym_Pause"
>     #
>     # ...
>    
> The annotation defines the pre-requisites in terms of locally supported
> linkmodes, we have a docstring containing information for developpers
> to debug their drivers, what I'm unsure about is the commented-out part
> below, so either one big function testing multiple adjacent scenarios
> or indivitual functions.

Sphinx follows pythons object orientate structure. So you could have a
class test_ethtool_pause_advertising, with class documentation. And
then methods within the class which are individual tests.  The
commented out section would then be method documentation.

However, i've no idea if the selftest code allows for classes of test
methods? It looks like ksft_run() takes a list of methods. So you can
probably instantiate the class, and then pass it methods from the
class?

I would say you are right about picking one of the simple test case,
and playing with it, define and implement it, and see what comes out
at the end. 

	Andrew

^ permalink raw reply

* [PATCH v2 2/2] hwmon: (chipcap2) Add support for label
From: Flaviu Nistor @ 2026-06-25 16:04 UTC (permalink / raw)
  To: Guenter Roeck, Javier Carrasco, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Jonathan Corbet, Shuah Khan
  Cc: Flaviu Nistor, linux-hwmon, linux-kernel, devicetree, linux-doc
In-Reply-To: <20260625160423.17882-1-flaviu.nistor@gmail.com>

Add support for label sysfs attribute similar to other hwmon devices.
This is particularly useful for systems with multiple sensors on the
same board, where identifying individual sensors is much easier since
labels can be defined via device tree.

Signed-off-by: Flaviu Nistor <flaviu.nistor@gmail.com>
---
Changes in v2:
- No change for this patch in the patch series. 
- Link to v1: https://lore.kernel.org/all/20260622122200.14245-1-flaviu.nistor@gmail.com/

 Documentation/hwmon/chipcap2.rst |  2 ++
 drivers/hwmon/chipcap2.c         | 25 +++++++++++++++++++++++--
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/Documentation/hwmon/chipcap2.rst b/Documentation/hwmon/chipcap2.rst
index dc165becc64c..c38d87b91b69 100644
--- a/Documentation/hwmon/chipcap2.rst
+++ b/Documentation/hwmon/chipcap2.rst
@@ -70,4 +70,6 @@ humidity1_min_hyst:             RW      humidity low hystersis
 humidity1_max_hyst:             RW      humidity high hystersis
 humidity1_min_alarm:            RO      humidity low alarm indicator
 humidity1_max_alarm:            RO      humidity high alarm indicator
+humidity1_label:                RO      descriptive name for the sensor
+temp1_label:                    RO      descriptive name for the sensor
 =============================== ======= ========================================
diff --git a/drivers/hwmon/chipcap2.c b/drivers/hwmon/chipcap2.c
index 4aecf463180f..086571d556b7 100644
--- a/drivers/hwmon/chipcap2.c
+++ b/drivers/hwmon/chipcap2.c
@@ -22,6 +22,8 @@
 #include <linux/irq.h>
 #include <linux/module.h>
 #include <linux/regulator/consumer.h>
+#include <linux/mod_devicetable.h>
+#include <linux/property.h>
 
 #define CC2_START_CM			0xA0
 #define CC2_START_NOM			0x80
@@ -83,6 +85,7 @@ struct cc2_data {
 	struct i2c_client *client;
 	struct regulator *regulator;
 	const char *name;
+	const char *label;
 	int irq_ready;
 	int irq_low;
 	int irq_high;
@@ -449,6 +452,8 @@ static umode_t cc2_is_visible(const void *data, enum hwmon_sensor_types type,
 		switch (attr) {
 		case hwmon_humidity_input:
 			return 0444;
+		case hwmon_humidity_label:
+			return cc2->label ? 0444 : 0;
 		case hwmon_humidity_min_alarm:
 			return cc2->rh_alarm.low_alarm_visible ? 0444 : 0;
 		case hwmon_humidity_max_alarm:
@@ -466,6 +471,8 @@ static umode_t cc2_is_visible(const void *data, enum hwmon_sensor_types type,
 		switch (attr) {
 		case hwmon_temp_input:
 			return 0444;
+		case hwmon_temp_label:
+			return cc2->label ? 0444 : 0;
 		default:
 			return 0;
 		}
@@ -552,6 +559,16 @@ static int cc2_humidity_max_alarm_status(struct cc2_data *data, long *val)
 	return 0;
 }
 
+static int cc2_read_string(struct device *dev, enum hwmon_sensor_types type,
+			   u32 attr, int channel, const char **str)
+{
+	struct cc2_data *data = dev_get_drvdata(dev);
+
+	*str = data->label;
+
+	return 0;
+}
+
 static int cc2_read(struct device *dev, enum hwmon_sensor_types type, u32 attr,
 		    int channel, long *val)
 {
@@ -670,8 +687,9 @@ static int cc2_request_alarm_irqs(struct cc2_data *data, struct device *dev)
 }
 
 static const struct hwmon_channel_info *cc2_info[] = {
-	HWMON_CHANNEL_INFO(temp, HWMON_T_INPUT),
-	HWMON_CHANNEL_INFO(humidity, HWMON_H_INPUT | HWMON_H_MIN | HWMON_H_MAX |
+	HWMON_CHANNEL_INFO(temp, HWMON_T_INPUT | HWMON_T_LABEL),
+	HWMON_CHANNEL_INFO(humidity, HWMON_H_INPUT | HWMON_H_LABEL |
+			   HWMON_H_MIN | HWMON_H_MAX |
 			   HWMON_H_MIN_HYST | HWMON_H_MAX_HYST |
 			   HWMON_H_MIN_ALARM | HWMON_H_MAX_ALARM),
 	NULL
@@ -680,6 +698,7 @@ static const struct hwmon_channel_info *cc2_info[] = {
 static const struct hwmon_ops cc2_hwmon_ops = {
 	.is_visible = cc2_is_visible,
 	.read = cc2_read,
+	.read_string = cc2_read_string,
 	.write = cc2_write,
 };
 
@@ -710,6 +729,8 @@ static int cc2_probe(struct i2c_client *client)
 		return dev_err_probe(dev, PTR_ERR(data->regulator),
 				     "Failed to get regulator\n");
 
+	device_property_read_string(dev, "label", &data->label);
+
 	ret = cc2_request_ready_irq(data, dev);
 	if (ret)
 		return dev_err_probe(dev, ret, "Failed to request ready irq\n");
-- 
2.34.1


^ permalink raw reply related

* [PATCH v2 1/2] dt-bindings: hwmon: chipcap2: Add label property
From: Flaviu Nistor @ 2026-06-25 16:04 UTC (permalink / raw)
  To: Guenter Roeck, Javier Carrasco, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Jonathan Corbet, Shuah Khan
  Cc: Flaviu Nistor, linux-hwmon, linux-kernel, devicetree, linux-doc

Add support for an optional label property similar to other hwmon devices.
This allows, in case of boards with multiple CHIPCAP2 sensors, to assign
distinct names to each instance.

Signed-off-by: Flaviu Nistor <flaviu.nistor@gmail.com>
---
Changes in v2:
- Implement suggestion from Javier Carrasco as proposed by Krzysztof Kozlowski.
- Link to v1: https://lore.kernel.org/all/20260622122200.14245-1-flaviu.nistor@gmail.com/

 .../devicetree/bindings/hwmon/amphenol,chipcap2.yaml        | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/Documentation/devicetree/bindings/hwmon/amphenol,chipcap2.yaml b/Documentation/devicetree/bindings/hwmon/amphenol,chipcap2.yaml
index 17351fdbefce..56b0cecfca5f 100644
--- a/Documentation/devicetree/bindings/hwmon/amphenol,chipcap2.yaml
+++ b/Documentation/devicetree/bindings/hwmon/amphenol,chipcap2.yaml
@@ -45,6 +45,8 @@ properties:
       - const: low
       - const: high
 
+  label: true
+
   vdd-supply:
     description:
       Dedicated, controllable supply-regulator to reset the device and
@@ -55,6 +57,9 @@ required:
   - reg
   - vdd-supply
 
+allOf:
+  - $ref: hwmon-common.yaml#
+
 additionalProperties: false
 
 examples:
@@ -72,6 +77,7 @@ examples:
                          <5 IRQ_TYPE_EDGE_RISING>,
                          <6 IRQ_TYPE_EDGE_RISING>;
             interrupt-names = "ready", "low", "high";
+            label = "Room";
             vdd-supply = <&reg_vdd>;
         };
     };
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH net-next] Documentation: networking: Add a test plan for ethtool pause validation
From: Maxime Chevallier @ 2026-06-25 16:03 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jakub Kicinski, davem, Eric Dumazet, Paolo Abeni, Simon Horman,
	Russell King, Heiner Kallweit, Jonathan Corbet, Shuah Khan,
	Oleksij Rempel, Vladimir Oltean, Florian Fainelli,
	thomas.petazzoni, netdev, linux-kernel, linux-doc
In-Reply-To: <dfee1484-fa2a-4b98-af5a-1e67ac716905@lunn.ch>


> 
> Does it even make sense to advertise this when in HD? But i don't
> think we need to consider this now. I consider HD low priority, i
> doubt it is actually used very often. We should concentrate on FD
> testing.

That's fine by me as well, let's keep it simple, we may revisit that if
we really need to.

> 
>> # ethtool -a eth2
>> Autonegotiate:	on
>> RX:		off
>> TX:		off
>> RX negotiated: on
>> TX negotiated: on
>>
>>
>> Sure, pause and HD don't make sense, however what I find confusing to some
>> extent is that the only place we have information about the *actual* pause
>> settings is the "link is Up" log in dmesg.
> 
> Maybe we should extend ksetting get to return the resolved pause
> parameters? But i'm not sure how much that actually gives us. Anything
> using phylink will just ask phylink to fill in the ksettings
> information, and it seems unlikely phylink gets it wrong. What we are
> really trying to test is drivers which don't user phylink, those are
> the ones which are generally broken, and they are not going to
> implement anything new in ksettings.

Correct yes. If the MAC driver uses phylink and a test fails, it very likely
means that the PHY driver is doing shady stuff (and some are/were for pause)

> So i think the test has to look
> at:
> 
>> 	Advertised pause frame use: Symmetric Receive-only
>> 	Link partner advertised pause frame use: Symmetric Receive-only
> 
> and check these match what we expect.

All good for me :) thanks for you feedback,

Maxime

^ permalink raw reply

* [PATCH v14 5/5] arm64: dts: imx8ulp: Add rpmsg node under imx_rproc
From: Shenwei Wang @ 2026-06-25 15:54 UTC (permalink / raw)
  To: Linus Walleij, Bartosz Golaszewski, Jonathan Corbet, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Bjorn Andersson,
	Mathieu Poirier, Frank Li, Sascha Hauer
  Cc: Shuah Khan, linux-gpio, linux-doc, linux-kernel,
	Pengutronix Kernel Team, Fabio Estevam, Shenwei Wang, Peng Fan,
	devicetree, linux-remoteproc, imx, linux-arm-kernel, linux-imx,
	Arnaud POULIQUEN, b-padhi, Andrew Lunn
In-Reply-To: <20260625155432.815185-1-shenwei.wang@oss.nxp.com>

From: Shenwei Wang <shenwei.wang@nxp.com>

Add the RPMSG bus node along with its GPIO subnodes to the device
tree.

Enable remote device communication and GPIO control via RPMSG on
the i.MX platform.

Signed-off-by: Shenwei Wang <shenwei.wang@nxp.com>
---
 arch/arm64/boot/dts/freescale/imx8ulp.dtsi | 25 ++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/arch/arm64/boot/dts/freescale/imx8ulp.dtsi b/arch/arm64/boot/dts/freescale/imx8ulp.dtsi
index 1de3ad60c6aa..f1b984eb1203 100644
--- a/arch/arm64/boot/dts/freescale/imx8ulp.dtsi
+++ b/arch/arm64/boot/dts/freescale/imx8ulp.dtsi
@@ -190,6 +190,31 @@ scmi_sensor: protocol@15 {
 	cm33: remoteproc-cm33 {
 		compatible = "fsl,imx8ulp-cm33";
 		status = "disabled";
+
+		rpmsg {
+			rpmsg-io {
+				#address-cells = <1>;
+				#size-cells = <0>;
+
+				rpmsg_gpioa: gpio@0 {
+					compatible = "rpmsg-gpio";
+					reg = <0>;
+					gpio-controller;
+					#gpio-cells = <2>;
+					#interrupt-cells = <2>;
+					interrupt-controller;
+				};
+
+				rpmsg_gpiob: gpio@1 {
+					compatible = "rpmsg-gpio";
+					reg = <1>;
+					gpio-controller;
+					#gpio-cells = <2>;
+					#interrupt-cells = <2>;
+					interrupt-controller;
+				};
+			};
+		};
 	};
 
 	soc: soc@0 {
-- 
2.43.0


^ permalink raw reply related

* [PATCH v14 4/5] gpio: rpmsg: add generic rpmsg GPIO driver
From: Shenwei Wang @ 2026-06-25 15:54 UTC (permalink / raw)
  To: Linus Walleij, Bartosz Golaszewski, Jonathan Corbet, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Bjorn Andersson,
	Mathieu Poirier, Frank Li, Sascha Hauer
  Cc: Shuah Khan, linux-gpio, linux-doc, linux-kernel,
	Pengutronix Kernel Team, Fabio Estevam, Shenwei Wang, Peng Fan,
	devicetree, linux-remoteproc, imx, linux-arm-kernel, linux-imx,
	Arnaud POULIQUEN, b-padhi, Andrew Lunn, Bartosz Golaszewski
In-Reply-To: <20260625155432.815185-1-shenwei.wang@oss.nxp.com>

From: Shenwei Wang <shenwei.wang@nxp.com>

On an AMP platform, the system may include multiple processors:
	- MCUs running an RTOS
	- An MPU running Linux

These processors communicate via the RPMSG protocol.
The driver implements the standard GPIO interface, allowing
the Linux side to control GPIO controllers which reside in
the remote processor via RPMSG protocol.

Cc: Bartosz Golaszewski <brgl@bgdev.pl>
Cc: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Shenwei Wang <shenwei.wang@nxp.com>
---
 drivers/gpio/Kconfig      |  17 ++
 drivers/gpio/Makefile     |   1 +
 drivers/gpio/gpio-rpmsg.c | 568 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 586 insertions(+)
 create mode 100644 drivers/gpio/gpio-rpmsg.c

diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
index 020e51e30317..4ad299fe3c6f 100644
--- a/drivers/gpio/Kconfig
+++ b/drivers/gpio/Kconfig
@@ -1917,6 +1917,23 @@ config GPIO_SODAVILLE
 
 endmenu
 
+menu "RPMSG GPIO drivers"
+	depends on RPMSG
+
+config GPIO_RPMSG
+	tristate "Generic RPMSG GPIO support"
+	depends on OF && REMOTEPROC
+	select GPIOLIB_IRQCHIP
+	default REMOTEPROC
+	help
+	  Say yes here to support the generic GPIO functions over the RPMSG
+	  bus. Currently supported devices: i.MX7ULP, i.MX8ULP, i.MX8x, and
+	  i.MX9x.
+
+	  If unsure, say N.
+
+endmenu
+
 menu "SPI GPIO expanders"
 	depends on SPI_MASTER
 
diff --git a/drivers/gpio/Makefile b/drivers/gpio/Makefile
index b267598b517d..ee75c0e65b8b 100644
--- a/drivers/gpio/Makefile
+++ b/drivers/gpio/Makefile
@@ -157,6 +157,7 @@ obj-$(CONFIG_GPIO_RDC321X)		+= gpio-rdc321x.o
 obj-$(CONFIG_GPIO_REALTEK_OTTO)		+= gpio-realtek-otto.o
 obj-$(CONFIG_GPIO_REG)			+= gpio-reg.o
 obj-$(CONFIG_GPIO_ROCKCHIP)	+= gpio-rockchip.o
+obj-$(CONFIG_GPIO_RPMSG)		+= gpio-rpmsg.o
 obj-$(CONFIG_GPIO_RTD)			+= gpio-rtd.o
 obj-$(CONFIG_ARCH_SA1100)		+= gpio-sa1100.o
 obj-$(CONFIG_GPIO_SAMA5D2_PIOBU)	+= gpio-sama5d2-piobu.o
diff --git a/drivers/gpio/gpio-rpmsg.c b/drivers/gpio/gpio-rpmsg.c
new file mode 100644
index 000000000000..332e2925a830
--- /dev/null
+++ b/drivers/gpio/gpio-rpmsg.c
@@ -0,0 +1,568 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2026 NXP
+ *
+ * The driver exports a standard gpiochip interface to control
+ * the GPIO controllers via RPMSG on a remote processor.
+ */
+
+#include <linux/completion.h>
+#include <linux/device.h>
+#include <linux/err.h>
+#include <linux/gpio/driver.h>
+#include <linux/init.h>
+#include <linux/irqdomain.h>
+#include <linux/mod_devicetable.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/of.h>
+#include <linux/of_device.h>
+#include <linux/of_platform.h>
+#include <linux/platform_device.h>
+#include <linux/remoteproc.h>
+#include <linux/rpmsg.h>
+#include <linux/virtio_gpio.h>
+
+#define GPIOS_PER_PORT_DEFAULT		32
+#define RPMSG_TIMEOUT			1000
+
+/* Additional commands beyond virtio-gpio */
+#define VIRTIO_GPIO_MSG_SET_WAKEUP	0x0010
+
+/* GPIO Receive MSG Type */
+#define GPIO_RPMSG_REPLY	1
+#define GPIO_RPMSG_NOTIFY	2
+
+#define CHAN_NAME_PREFIX	"rpmsg-io-"
+#define GPIO_COMPAT_STR		"rpmsg-gpio"
+
+struct rpmsg_gpio_response {
+	__u8 type;
+	union {
+		/* command reply */
+		struct {
+			__u8 status;
+			__u8 value;
+		};
+
+		/* interrupt notification */
+		struct {
+			__u8 line;
+			__u8 trigger; /* rising/falling/high/low */
+		};
+	};
+};
+
+struct rpmsg_gpio_line {
+	u8 irq_shutdown;
+	u8 irq_unmask;
+	u8 irq_mask;
+	u32 irq_wake_enable;
+	u32 irq_type;
+};
+
+struct rpmsg_gpio_port {
+	struct gpio_chip gc;
+	struct rpmsg_device *rpdev;
+	struct virtio_gpio_request *send_msg;
+	struct rpmsg_gpio_response *recv_msg;
+	struct completion cmd_complete;
+	struct mutex lock;
+	u32 ngpios;
+	u32 idx;
+	struct rpmsg_gpio_line lines[GPIOS_PER_PORT_DEFAULT];
+};
+
+static int rpmsg_gpio_send_message(struct rpmsg_gpio_port *port)
+{
+	int ret;
+
+	reinit_completion(&port->cmd_complete);
+
+	ret = rpmsg_send(port->rpdev->ept, port->send_msg, sizeof(*port->send_msg));
+	if (ret) {
+		dev_err(&port->rpdev->dev, "rpmsg_send failed: cmd=%d ret=%d\n",
+			port->send_msg->type, ret);
+		return ret;
+	}
+
+	ret = wait_for_completion_timeout(&port->cmd_complete,
+					  msecs_to_jiffies(RPMSG_TIMEOUT));
+	if (ret == 0) {
+		dev_err(&port->rpdev->dev, "rpmsg_send timeout! cmd=%d\n",
+			port->send_msg->type);
+		return -ETIMEDOUT;
+	}
+
+	if (unlikely(port->recv_msg->status != VIRTIO_GPIO_STATUS_OK)) {
+		dev_err(&port->rpdev->dev, "remote core replies an error: cmd=%d!\n",
+			port->send_msg->type);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static struct virtio_gpio_request *
+rpmsg_gpio_msg_prepare(struct rpmsg_gpio_port *port, u16 line, u16 cmd, u32 val)
+{
+	struct virtio_gpio_request *msg = port->send_msg;
+
+	msg->type = cmd;
+	msg->gpio = line;
+	msg->value = val;
+
+	return msg;
+}
+
+static int rpmsg_gpio_get(struct gpio_chip *gc, unsigned int line)
+{
+	struct rpmsg_gpio_port *port = gpiochip_get_data(gc);
+	int ret;
+
+	guard(mutex)(&port->lock);
+
+	rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_GET_VALUE, 0);
+
+	ret = rpmsg_gpio_send_message(port);
+	return ret ? ret : port->recv_msg->value;
+}
+
+static int rpmsg_gpio_get_direction(struct gpio_chip *gc, unsigned int line)
+{
+	struct rpmsg_gpio_port *port = gpiochip_get_data(gc);
+	int ret;
+
+	guard(mutex)(&port->lock);
+
+	rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_GET_DIRECTION, 0);
+
+	ret = rpmsg_gpio_send_message(port);
+	if (ret)
+		return ret;
+
+	switch (port->recv_msg->value) {
+	case VIRTIO_GPIO_DIRECTION_IN:
+		return GPIO_LINE_DIRECTION_IN;
+	case VIRTIO_GPIO_DIRECTION_OUT:
+		return GPIO_LINE_DIRECTION_OUT;
+	default:
+		break;
+	}
+
+	return -EINVAL;
+}
+
+static int rpmsg_gpio_direction_input(struct gpio_chip *gc, unsigned int line)
+{
+	struct rpmsg_gpio_port *port = gpiochip_get_data(gc);
+
+	guard(mutex)(&port->lock);
+
+	rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_SET_DIRECTION,
+			       VIRTIO_GPIO_DIRECTION_IN);
+
+	return rpmsg_gpio_send_message(port);
+}
+
+static int rpmsg_gpio_set(struct gpio_chip *gc, unsigned int line, int val)
+{
+	struct rpmsg_gpio_port *port = gpiochip_get_data(gc);
+
+	guard(mutex)(&port->lock);
+
+	rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_SET_VALUE, val);
+
+	return rpmsg_gpio_send_message(port);
+}
+
+static int rpmsg_gpio_direction_output(struct gpio_chip *gc, unsigned int line, int val)
+{
+	struct rpmsg_gpio_port *port = gpiochip_get_data(gc);
+	int ret;
+
+	guard(mutex)(&port->lock);
+
+	rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_SET_DIRECTION,
+			       VIRTIO_GPIO_DIRECTION_OUT);
+
+	ret = rpmsg_gpio_send_message(port);
+	if (ret)
+		return ret;
+
+	rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_SET_VALUE, val);
+
+	return rpmsg_gpio_send_message(port);
+}
+
+static int gpio_rpmsg_irq_set_type(struct irq_data *d, u32 type)
+{
+	struct rpmsg_gpio_port *port = irq_data_get_irq_chip_data(d);
+	u32 line = d->hwirq;
+
+	switch (type) {
+	case IRQ_TYPE_EDGE_RISING:
+		type = VIRTIO_GPIO_IRQ_TYPE_EDGE_RISING;
+		irq_set_handler_locked(d, handle_simple_irq);
+		break;
+	case IRQ_TYPE_EDGE_FALLING:
+		type = VIRTIO_GPIO_IRQ_TYPE_EDGE_FALLING;
+		irq_set_handler_locked(d, handle_simple_irq);
+		break;
+	case IRQ_TYPE_EDGE_BOTH:
+		type = VIRTIO_GPIO_IRQ_TYPE_EDGE_BOTH;
+		irq_set_handler_locked(d, handle_simple_irq);
+		break;
+	case IRQ_TYPE_LEVEL_LOW:
+		type = VIRTIO_GPIO_IRQ_TYPE_LEVEL_LOW;
+		irq_set_handler_locked(d, handle_level_irq);
+		break;
+	case IRQ_TYPE_LEVEL_HIGH:
+		type = VIRTIO_GPIO_IRQ_TYPE_LEVEL_HIGH;
+		irq_set_handler_locked(d, handle_level_irq);
+		break;
+	default:
+		dev_err(&port->rpdev->dev, "unsupported irq type: %u\n", type);
+		return -EINVAL;
+	}
+
+	port->lines[line].irq_type = type;
+
+	return 0;
+}
+
+static int gpio_rpmsg_irq_set_wake(struct irq_data *d, u32 enable)
+{
+	struct rpmsg_gpio_port *port = irq_data_get_irq_chip_data(d);
+	u32 line = d->hwirq;
+
+	port->lines[line].irq_wake_enable = enable;
+
+	return 0;
+}
+
+/*
+ * This unmask/mask function is invoked in two situations:
+ *   - when an interrupt is being set up, and
+ *   - after an interrupt has occurred.
+ *
+ * The GPIO driver does not access hardware registers directly.
+ * Instead, it caches all relevant information locally, and then sends
+ * the accumulated state to the remote system at this stage.
+ */
+static void gpio_rpmsg_unmask_irq(struct irq_data *d)
+{
+	struct rpmsg_gpio_port *port = irq_data_get_irq_chip_data(d);
+	u32 line = d->hwirq;
+
+	port->lines[line].irq_unmask = 1;
+}
+
+static void gpio_rpmsg_mask_irq(struct irq_data *d)
+{
+	struct rpmsg_gpio_port *port = irq_data_get_irq_chip_data(d);
+	u32 line = d->hwirq;
+
+	/*
+	 * When an interrupt occurs, the remote system masks the interrupt
+	 * and then sends a notification to Linux. After Linux processes
+	 * that notification, it sends an RPMsg command back to the remote
+	 * system to unmask the interrupt again.
+	 */
+	port->lines[line].irq_mask = 1;
+}
+
+static void gpio_rpmsg_irq_shutdown(struct irq_data *d)
+{
+	struct rpmsg_gpio_port *port = irq_data_get_irq_chip_data(d);
+	u32 line = d->hwirq;
+
+	port->lines[line].irq_shutdown = 1;
+}
+
+static void gpio_rpmsg_irq_bus_lock(struct irq_data *d)
+{
+	struct rpmsg_gpio_port *port = irq_data_get_irq_chip_data(d);
+
+	mutex_lock(&port->lock);
+}
+
+static void gpio_rpmsg_irq_bus_sync_unlock(struct irq_data *d)
+{
+	struct rpmsg_gpio_port *port = irq_data_get_irq_chip_data(d);
+	u32 line = d->hwirq;
+
+	rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_SET_WAKEUP,
+			       port->lines[line].irq_wake_enable);
+	rpmsg_gpio_send_message(port);
+
+	/*
+	 * For mask irq, do nothing here.
+	 * The remote system will mask interrupt after an interrupt occurs,
+	 * and then send a notification to Linux system. After Linux system
+	 * handles the notification, it sends an rpmsg back to the remote
+	 * system to unmask this interrupt again.
+	 */
+	if (port->lines[line].irq_mask && !port->lines[line].irq_unmask) {
+		port->lines[line].irq_mask = 0;
+		mutex_unlock(&port->lock);
+		return;
+	}
+
+	if (port->lines[line].irq_shutdown) {
+		rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_IRQ_TYPE,
+				       VIRTIO_GPIO_IRQ_TYPE_NONE);
+		port->lines[line].irq_shutdown = 0;
+	} else {
+		rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_IRQ_TYPE,
+				       port->lines[line].irq_type);
+
+		if (port->lines[line].irq_unmask)
+			port->lines[line].irq_unmask = 0;
+	}
+
+	rpmsg_gpio_send_message(port);
+	mutex_unlock(&port->lock);
+}
+
+static const struct irq_chip gpio_rpmsg_irq_chip = {
+	.irq_mask = gpio_rpmsg_mask_irq,
+	.irq_unmask = gpio_rpmsg_unmask_irq,
+	.irq_set_wake = gpio_rpmsg_irq_set_wake,
+	.irq_set_type = gpio_rpmsg_irq_set_type,
+	.irq_shutdown = gpio_rpmsg_irq_shutdown,
+	.irq_bus_lock = gpio_rpmsg_irq_bus_lock,
+	.irq_bus_sync_unlock = gpio_rpmsg_irq_bus_sync_unlock,
+	.flags = IRQCHIP_IMMUTABLE,
+};
+
+static int rpmsg_gpiochip_register(struct rpmsg_device *rpdev,
+				   struct device_node *np, const char *name)
+{
+	struct rpmsg_gpio_port *port;
+	struct gpio_irq_chip *girq;
+	struct gpio_chip *gc;
+	int ret;
+
+	port = devm_kzalloc(&rpdev->dev, sizeof(*port), GFP_KERNEL);
+	if (!port)
+		return -ENOMEM;
+
+	ret = of_property_read_u32(np, "reg", &port->idx);
+	if (ret)
+		return ret;
+
+	ret = devm_mutex_init(&rpdev->dev, &port->lock);
+	if (ret)
+		return ret;
+
+	ret = of_property_read_u32(np, "ngpios", &port->ngpios);
+	if (ret || port->ngpios > GPIOS_PER_PORT_DEFAULT)
+		port->ngpios = GPIOS_PER_PORT_DEFAULT;
+
+	port->send_msg = devm_kzalloc(&rpdev->dev,
+				      sizeof(*port->send_msg),
+				      GFP_KERNEL);
+
+	port->recv_msg = devm_kzalloc(&rpdev->dev,
+				      sizeof(*port->recv_msg),
+				      GFP_KERNEL);
+	if (!port->send_msg || !port->recv_msg)
+		return -ENOMEM;
+
+	init_completion(&port->cmd_complete);
+	port->rpdev = rpdev;
+
+	gc = &port->gc;
+	gc->owner = THIS_MODULE;
+	gc->parent = &rpdev->dev;
+	gc->fwnode = of_fwnode_handle(np);
+	gc->ngpio = port->ngpios;
+	gc->base = -1;
+	gc->label = devm_kasprintf(&rpdev->dev, GFP_KERNEL, "%s-gpio%d",
+				   name, port->idx);
+
+	gc->direction_input = rpmsg_gpio_direction_input;
+	gc->direction_output = rpmsg_gpio_direction_output;
+	gc->get_direction = rpmsg_gpio_get_direction;
+	gc->get = rpmsg_gpio_get;
+	gc->set = rpmsg_gpio_set;
+
+	girq = &gc->irq;
+	gpio_irq_chip_set_chip(girq, &gpio_rpmsg_irq_chip);
+	girq->parent_handler = NULL;
+	girq->num_parents = 0;
+	girq->parents = NULL;
+	girq->chip->name = devm_kstrdup(&rpdev->dev, gc->label, GFP_KERNEL);
+
+	dev_set_drvdata(&rpdev->dev, port);
+
+	return devm_gpiochip_add_data(&rpdev->dev, gc, port);
+}
+
+static const char *rpmsg_get_rproc_node_name(struct rpmsg_device *rpdev)
+{
+	const char *name = NULL;
+	struct device_node *np;
+	struct rproc *rproc;
+
+	rproc = rproc_get_by_child(&rpdev->dev);
+	if (!rproc)
+		return NULL;
+
+	np = of_node_get(rproc->dev.of_node);
+	if (!np && rproc->dev.parent)
+		np = of_node_get(rproc->dev.parent->of_node);
+
+	if (np) {
+		name = devm_kstrdup(&rpdev->dev, np->name, GFP_KERNEL);
+		of_node_put(np);
+	}
+
+	return name;
+}
+
+static struct device_node *
+rpmsg_find_child_by_compat_reg(struct device_node *parent, const char *compat, u32 idx)
+{
+	struct device_node *child;
+	u32 reg;
+
+	for_each_available_child_of_node(parent, child) {
+		if (!of_device_is_compatible(child, compat))
+			continue;
+
+		if (of_property_read_u32(child, "reg", &reg))
+			continue;
+
+		if (reg == idx)
+			return child;
+	}
+
+	return NULL;
+}
+
+static struct device_node *
+rpmsg_get_channel_ofnode(struct rpmsg_device *rpdev, const char *compat, u32 idx)
+{
+	struct device_node *np_chan = NULL, *np;
+	struct rproc *rproc;
+
+	rproc = rproc_get_by_child(&rpdev->dev);
+	if (!rproc)
+		return NULL;
+
+	np = of_node_get(rproc->dev.of_node);
+	if (!np && rproc->dev.parent)
+		np = of_node_get(rproc->dev.parent->of_node);
+
+	if (np)
+		np_chan = rpmsg_find_child_by_compat_reg(np, compat, idx);
+
+	return np_chan;
+}
+
+static int rpmsg_get_gpio_index(const char *name, const char *prefix)
+{
+	const char *p;
+	int base = 10;
+	int val;
+
+	if (!name)
+		return -EINVAL;
+
+	/* Ensure correct prefix */
+	if (!str_has_prefix(name, prefix))
+		return -EINVAL;
+
+	/* Find last '-' */
+	p = strrchr(name, '-');
+
+	if (!p || *(p + 1) == '\0')
+		return -EINVAL;
+
+	if (p[1] == '0' && (p[2] == 'x' || p[2] == 'X'))
+		base = 16;
+
+	if (kstrtoint(p + 1, base, &val))
+		return -EINVAL;
+
+	return val;
+}
+
+static int rpmsg_gpio_channel_callback(struct rpmsg_device *rpdev, void *data,
+				       int len, void *priv, u32 src)
+{
+	struct rpmsg_gpio_response *msg = data;
+	struct rpmsg_gpio_port *port = NULL;
+
+	port = dev_get_drvdata(&rpdev->dev);
+
+	if (!port) {
+		dev_err(&rpdev->dev, "port is null\n");
+		return -EINVAL;
+	}
+
+	if (msg->type == GPIO_RPMSG_REPLY) {
+		*port->recv_msg = *msg;
+		complete(&port->cmd_complete);
+	} else if (msg->type == GPIO_RPMSG_NOTIFY) {
+		generic_handle_domain_irq_safe(port->gc.irq.domain, msg->line);
+	} else {
+		dev_err(&rpdev->dev, "wrong message type (0x%x)\n", msg->type);
+	}
+
+	return 0;
+}
+
+static int rpmsg_gpio_channel_probe(struct rpmsg_device *rpdev)
+{
+	struct device *dev = &rpdev->dev;
+	struct device_node *np;
+	const char *rproc_name;
+	int idx;
+
+	idx = rpmsg_get_gpio_index(rpdev->id.name, CHAN_NAME_PREFIX);
+	if (idx < 0)
+		return -EINVAL;
+
+	if (!dev->of_node) {
+		np = rpmsg_get_channel_ofnode(rpdev, GPIO_COMPAT_STR, idx);
+		if (!np)
+			return -ENODEV;
+
+		dev->of_node = np;
+		set_primary_fwnode(dev, of_fwnode_handle(np));
+		return -EPROBE_DEFER;
+	}
+
+	rproc_name = rpmsg_get_rproc_node_name(rpdev);
+
+	return rpmsg_gpiochip_register(rpdev, dev->of_node, rproc_name);
+}
+
+static const struct of_device_id rpmsg_gpio_dt_ids[] = {
+	{ .compatible = GPIO_COMPAT_STR },
+	{ /* sentinel */ }
+};
+
+static struct rpmsg_device_id rpmsg_gpio_channel_id_table[] = {
+	{ .name = CHAN_NAME_PREFIX },
+	{ },
+};
+MODULE_DEVICE_TABLE(rpmsg, rpmsg_gpio_channel_id_table);
+
+static struct rpmsg_driver rpmsg_gpio_channel_client = {
+	.callback	= rpmsg_gpio_channel_callback,
+	.id_table	= rpmsg_gpio_channel_id_table,
+	.probe		= rpmsg_gpio_channel_probe,
+	.drv		= {
+		.name	= KBUILD_MODNAME,
+		.of_match_table = rpmsg_gpio_dt_ids,
+	},
+};
+module_rpmsg_driver(rpmsg_gpio_channel_client);
+
+MODULE_AUTHOR("Shenwei Wang <shenwei.wang@nxp.com>");
+MODULE_DESCRIPTION("generic rpmsg gpio driver");
+MODULE_LICENSE("GPL");
-- 
2.43.0


^ permalink raw reply related

* [PATCH v14 3/5] rpmsg: core: match rpmsg device IDs by prefix
From: Shenwei Wang @ 2026-06-25 15:54 UTC (permalink / raw)
  To: Linus Walleij, Bartosz Golaszewski, Jonathan Corbet, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Bjorn Andersson,
	Mathieu Poirier, Frank Li, Sascha Hauer
  Cc: Shuah Khan, linux-gpio, linux-doc, linux-kernel,
	Pengutronix Kernel Team, Fabio Estevam, Shenwei Wang, Peng Fan,
	devicetree, linux-remoteproc, imx, linux-arm-kernel, linux-imx,
	Arnaud POULIQUEN, b-padhi, Andrew Lunn
In-Reply-To: <20260625155432.815185-1-shenwei.wang@oss.nxp.com>

From: Shenwei Wang <shenwei.wang@nxp.com>

The current rpmsg_id_match() implementation requires an exact
string match between the driver id_table entry and the rpmsg
device name using strncmp() with RPMSG_NAME_SIZE.

This makes it impossible for a driver to match a group of
rpmsg devices sharing a common prefix (e.g. dynamically
suffixed channel names).

Update the matching logic to compare only the length of the
id->name string, allowing id_table entries to act as prefixes.
This enables drivers to bind to devices whose names start with
the specified id->name.

The implementation is copied from a reply by Mathieu.

Signed-off-by: Shenwei Wang <shenwei.wang@nxp.com>
---
 drivers/rpmsg/rpmsg_core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/rpmsg/rpmsg_core.c b/drivers/rpmsg/rpmsg_core.c
index e7f7831d37f8..f95bfc9965d4 100644
--- a/drivers/rpmsg/rpmsg_core.c
+++ b/drivers/rpmsg/rpmsg_core.c
@@ -414,7 +414,9 @@ ATTRIBUTE_GROUPS(rpmsg_dev);
 static inline int rpmsg_id_match(const struct rpmsg_device *rpdev,
 				  const struct rpmsg_device_id *id)
 {
-	return strncmp(id->name, rpdev->id.name, RPMSG_NAME_SIZE) == 0;
+	size_t len = strnlen(id->name, RPMSG_NAME_SIZE);
+
+	return strncmp(id->name, rpdev->id.name, len) == 0;
 }
 
 /* match rpmsg channel and rpmsg driver */
-- 
2.43.0


^ permalink raw reply related

* [PATCH v14 2/5] dt-bindings: remoteproc: imx_rproc: Add "rpmsg" subnode support
From: Shenwei Wang @ 2026-06-25 15:54 UTC (permalink / raw)
  To: Linus Walleij, Bartosz Golaszewski, Jonathan Corbet, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Bjorn Andersson,
	Mathieu Poirier, Frank Li, Sascha Hauer
  Cc: Shuah Khan, linux-gpio, linux-doc, linux-kernel,
	Pengutronix Kernel Team, Fabio Estevam, Shenwei Wang, Peng Fan,
	devicetree, linux-remoteproc, imx, linux-arm-kernel, linux-imx,
	Arnaud POULIQUEN, b-padhi, Andrew Lunn
In-Reply-To: <20260625155432.815185-1-shenwei.wang@oss.nxp.com>

From: Shenwei Wang <shenwei.wang@nxp.com>

Remote processors may announce multiple GPIO controllers over an RPMSG
channel. These GPIO controllers may require corresponding device tree
nodes, especially when acting as providers, to supply phandles for their
consumers.

Define an RPMSG node to work as a container for a group of RPMSG channels
under the imx_rproc node. Each subnode within "rpmsg" represents an
individual RPMSG channel. The name of each subnode corresponds to the
channel name as defined by the remote processor.

All remote devices associated with a given channel are defined as child
nodes under the corresponding channel node.

Signed-off-by: Shenwei Wang <shenwei.wang@nxp.com>
---
 .../devicetree/bindings/gpio/gpio-rpmsg.yaml  | 55 +++++++++++++++++++
 .../bindings/remoteproc/fsl,imx-rproc.yaml    | 53 ++++++++++++++++++
 2 files changed, 108 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/gpio/gpio-rpmsg.yaml

diff --git a/Documentation/devicetree/bindings/gpio/gpio-rpmsg.yaml b/Documentation/devicetree/bindings/gpio/gpio-rpmsg.yaml
new file mode 100644
index 000000000000..6c78b6850321
--- /dev/null
+++ b/Documentation/devicetree/bindings/gpio/gpio-rpmsg.yaml
@@ -0,0 +1,55 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/gpio/gpio-rpmsg.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Generic RPMSG GPIO Controller
+
+maintainers:
+  - Shenwei Wang <shenwei.wang@nxp.com>
+
+description:
+  On an AMP platform, some GPIO controllers are exposed by the remote processor
+  through the RPMSG bus. The RPMSG GPIO transport protocol defines the packet
+  structure and communication flow between Linux and the remote firmware. Those
+  controllers are managed via this transport protocol. For more details of the
+  protocol, check the document below.
+  Documentation/driver-api/gpio/gpio-rpmsg.rst
+
+properties:
+  compatible:
+    oneOf:
+      - items:
+          - enum:
+              - fsl,rpmsg-gpio
+          - const: rpmsg-gpio
+      - const: rpmsg-gpio
+
+  reg:
+    description:
+      The reg property represents the index of the GPIO controllers. Since
+      the driver manages controllers on a remote system, this index tells
+      the remote system which controller to operate.
+    maxItems: 1
+
+  "#gpio-cells":
+    const: 2
+
+  gpio-controller: true
+
+  interrupt-controller: true
+
+  "#interrupt-cells":
+    const: 2
+
+required:
+  - compatible
+  - reg
+  - "#gpio-cells"
+  - "#interrupt-cells"
+
+allOf:
+  - $ref: /schemas/gpio/gpio.yaml#
+
+unevaluatedProperties: false
diff --git a/Documentation/devicetree/bindings/remoteproc/fsl,imx-rproc.yaml b/Documentation/devicetree/bindings/remoteproc/fsl,imx-rproc.yaml
index ce8ec0119469..aea33205a881 100644
--- a/Documentation/devicetree/bindings/remoteproc/fsl,imx-rproc.yaml
+++ b/Documentation/devicetree/bindings/remoteproc/fsl,imx-rproc.yaml
@@ -85,6 +85,34 @@ properties:
       This property is to specify the resource id of the remote processor in SoC
       which supports SCFW
 
+  rpmsg:
+    type: object
+    additionalProperties: false
+    description:
+      Represents the RPMSG bus between Linux and the remote system. Contains
+      a group of RPMSG channel devices running on the bus.
+
+    properties:
+      rpmsg-io:
+        type: object
+        additionalProperties: false
+        properties:
+          '#address-cells':
+            const: 1
+
+          '#size-cells':
+            const: 0
+
+        patternProperties:
+          "gpio@[0-9a-f]+$":
+            type: object
+            $ref: /schemas/gpio/gpio-rpmsg.yaml#
+            unevaluatedProperties: false
+
+        required:
+          - '#address-cells'
+          - '#size-cells'
+
 required:
   - compatible
 
@@ -147,5 +175,30 @@ examples:
                 &mu 3 1>;
       memory-region = <&vdev0buffer>, <&vdev0vring0>, <&vdev0vring1>, <&rsc_table>;
       syscon = <&src>;
+
+      rpmsg {
+        rpmsg-io {
+          #address-cells = <1>;
+          #size-cells = <0>;
+
+          gpio@0 {
+            compatible = "rpmsg-gpio";
+            reg = <0>;
+            gpio-controller;
+            #gpio-cells = <2>;
+            #interrupt-cells = <2>;
+            interrupt-controller;
+          };
+
+          gpio@1 {
+            compatible = "rpmsg-gpio";
+            reg = <1>;
+            gpio-controller;
+            #gpio-cells = <2>;
+            #interrupt-cells = <2>;
+            interrupt-controller;
+          };
+        };
+      };
     };
 ...
-- 
2.43.0


^ permalink raw reply related

* [PATCH v14 1/5] docs: driver-api: gpio: rpmsg gpio driver over rpmsg bus
From: Shenwei Wang @ 2026-06-25 15:54 UTC (permalink / raw)
  To: Linus Walleij, Bartosz Golaszewski, Jonathan Corbet, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Bjorn Andersson,
	Mathieu Poirier, Frank Li, Sascha Hauer
  Cc: Shuah Khan, linux-gpio, linux-doc, linux-kernel,
	Pengutronix Kernel Team, Fabio Estevam, Shenwei Wang, Peng Fan,
	devicetree, linux-remoteproc, imx, linux-arm-kernel, linux-imx,
	Arnaud POULIQUEN, b-padhi, Andrew Lunn
In-Reply-To: <20260625155432.815185-1-shenwei.wang@oss.nxp.com>

From: Shenwei Wang <shenwei.wang@nxp.com>

Describes the gpio rpmsg transport protocol over the rpmsg bus between
the remote system and Linux.

Signed-off-by: Shenwei Wang <shenwei.wang@nxp.com>
---
 Documentation/driver-api/gpio/gpio-rpmsg.rst | 271 +++++++++++++++++++
 Documentation/driver-api/gpio/index.rst      |   1 +
 2 files changed, 272 insertions(+)
 create mode 100644 Documentation/driver-api/gpio/gpio-rpmsg.rst

diff --git a/Documentation/driver-api/gpio/gpio-rpmsg.rst b/Documentation/driver-api/gpio/gpio-rpmsg.rst
new file mode 100644
index 000000000000..7d351ff0adb0
--- /dev/null
+++ b/Documentation/driver-api/gpio/gpio-rpmsg.rst
@@ -0,0 +1,271 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+GPIO RPMSG (Remote Processor Messaging) Protocol
+================================================
+
+The GPIO RPMSG transport protocol is used for communication and interaction
+with GPIO controllers on remote processors via the RPMSG bus.
+
+Message Format
+--------------
+
+The RPMSG message consists of a 8-byte packet with the following layout:
+
+.. code-block:: none
+
+   +------+------+------+------+------+------+------+------+
+   | 0x00 | 0x01 | 0x02 | 0x03 | 0x04 | 0x05 | 0x06 | 0x07 |
+   |     cmd     |    line     |           value           |
+   +------+------+------+------+------+------+------+------+
+
+- **cmd**: Command code, used for GPIO_RPMSG_SEND messages.
+
+- **line**: The GPIO line (pin) index of the port.
+
+- **value**: See details in the command description below.
+
+
+GPIO Commands
+-------------
+
+Commands are specified in the **Cmd** field.
+
+The SEND message is always sent from Linux to the remote firmware. Each
+SEND corresponds to a single REPLY message. The GPIO driver should
+serialize messages and determine whether a REPLY message is required. If a
+REPLY message is expected but not received within the specified timeout
+period (currently 1 second in the Linux driver), the driver should return
+-ETIMEOUT.
+
+GET_DIRECTION (Cmd=2)
+~~~~~~~~~~~~~~~~~~~~~
+
+**Request:**
+
+.. code-block:: none
+
+   +------+------+------+------+------+------+------+------+
+   | 0x00 | 0x01 | 0x02 | 0x03 | 0x04 | 0x05 | 0x06 | 0x07 |
+   |      2      |    line     |             0             |
+   +------+------+------+------+------+------+------+------+
+
+**Reply:**
+
+.. code-block:: none
+
+   +------+--------+--------+
+   | 0x00 |  0x01  |  0x02  |
+   |   1  | status | value  |
+   +------+--------+--------+
+
+- **status**:
+
+  - 0: Ok
+  - 1: Error
+
+- **value**: Direction.
+
+  - 0: None
+  - 1: Output
+  - 2: Input
+
+
+SET_DIRECTION (Cmd=3)
+~~~~~~~~~~~~~~~~~~~~~
+
+**Request:**
+
+.. code-block:: none
+
+   +------+------+------+------+------+------+------+------+
+   | 0x00 | 0x01 | 0x02 | 0x03 | 0x04 | 0x05 | 0x06 | 0x07 |
+   |      3      |    line     |           value           |
+   +------+------+------+------+------+------+------+------+
+
+- **value**: Direction.
+
+  - 0: None
+  - 1: Output
+  - 2: Input
+
+**Reply:**
+
+.. code-block:: none
+
+   +------+--------+--------+
+   | 0x00 |  0x01  |  0x02  |
+   |   1  | status |    0   |
+   +------+--------+--------+
+
+- **status**:
+
+  - 0: Ok
+  - 1: Error
+
+
+GET_VALUE (Cmd=4)
+~~~~~~~~~~~~~~~~~
+
+**Request:**
+
+.. code-block:: none
+
+   +------+------+------+------+------+------+------+------+
+   | 0x00 | 0x01 | 0x02 | 0x03 | 0x04 | 0x05 | 0x06 | 0x07 |
+   |      4      |    line     |             0             |
+   +------+------+------+------+------+------+------+------+
+
+**Reply:**
+
+.. code-block:: none
+
+   +------+--------+--------+
+   | 0x00 |  0x01  |  0x02  |
+   |   1  | status | value  |
+   +------+--------+--------+
+
+- **status**:
+
+  - 0: Ok
+  - 1: Error
+
+- **value**: Level.
+
+  - 0: Low
+  - 1: High
+
+
+SET_VALUE (Cmd=5)
+~~~~~~~~~~~~~~~~~
+
+**Request:**
+
+.. code-block:: none
+
+   +------+------+------+------+------+------+------+------+
+   | 0x00 | 0x01 | 0x02 | 0x03 | 0x04 | 0x05 | 0x06 | 0x07 |
+   |      5      |    line     |           value           |
+   +------+------+------+------+------+------+------+------+
+
+- **value**: Output level.
+
+  - 0: Low
+  - 1: High
+
+**Reply:**
+
+.. code-block:: none
+
+   +------+--------+--------+
+   | 0x00 |  0x01  |  0x02  |
+   |   1  | status |    0   |
+   +------+--------+--------+
+
+- **status**:
+
+  - 0: Ok
+  - 1: Error
+
+
+SET_IRQ_TYPE (Cmd=6)
+~~~~~~~~~~~~~~~~~~~~
+
+**Request:**
+
+.. code-block:: none
+
+   +------+------+------+------+------+------+------+------+
+   | 0x00 | 0x01 | 0x02 | 0x03 | 0x04 | 0x05 | 0x06 | 0x07 |
+   |      6      |    line     |           value           |
+   +------+------+------+------+------+------+------+------+
+
+- **value**: IRQ types.
+
+  - 0: Interrupt disabled
+  - 1: Rising edge trigger
+  - 2: Falling edge trigger
+  - 3: Both edge trigger
+  - 4: High level trigger
+  - 8: Low level trigger
+
+**Reply:**
+
+.. code-block:: none
+
+   +------+--------+--------+
+   | 0x00 |  0x01  |  0x02  |
+   |   1  | status |    0   |
+   +------+--------+--------+
+
+- **status**:
+
+  - 0: Ok
+  - 1: Error
+
+SET_WAKEUP (Cmd=16)
+~~~~~~~~~~~~~~~~~~~
+
+**Request:**
+
+.. code-block:: none
+
+   +------+------+------+------+------+------+------+------+
+   | 0x00 | 0x01 | 0x02 | 0x03 | 0x04 | 0x05 | 0x06 | 0x07 |
+   |      1      |    line     |           value           |
+   +------+------+------+------+------+------+------+------+
+
+- **value**: Wakeup enable.
+
+  The remote system should always aim to stay in a power-efficient state by
+  shutting down or clock-gating the GPIO blocks that aren't in use. Since
+  the remoteproc driver is responsible for managing the power states of the
+  remote firmware, the GPIO driver does not require to know the firmware's
+  running states.
+
+  When the wakeup bit is set, the remote firmware should configure the line
+  as a wakeup source. The firmware should send the notification message to
+  Linux after it is woken from the GPIO line.
+
+  - 0: Disable wakeup from GPIO
+  - 1: Enable wakeup from GPIO
+
+**Reply:**
+
+.. code-block:: none
+
+   +------+--------+--------+
+   | 0x00 |  0x01  |  0x02  |
+   |   1  | status |    0   |
+   +------+--------+--------+
+
+- **status**:
+
+  - 0: Ok
+  - 1: Error
+
+Notification Message
+--------------------
+
+Notifications are sent by the remote core and they have
+**Type=2 (GPIO_RPMSG_NOTIFY)**:
+
+When a GPIO line asserts an interrupt on the remote processor, the firmware
+should immediately mask the corresponding interrupt source and send a
+notification message to the Linux. Upon completion of the interrupt
+handling on the Linux side, the driver should issue a
+command **SET_IRQ_TYPE** to the firmware to unmask the interrupt.
+
+A Notification message can arrive between a SEND and its REPLY message,
+and the driver is expected to handle this scenario.
+
+.. code-block:: none
+
+   +------+------+--------+
+   | 0x00 | 0x01 |  0x02  |
+   |   2  | line | trigger|
+   +------+------+--------+
+
+- **line**: The GPIO line (pin) index of the port.
+
+- **trigger**: Optional parameter to indicate the trigger event type.
+
diff --git a/Documentation/driver-api/gpio/index.rst b/Documentation/driver-api/gpio/index.rst
index bee58f709b9a..e5eb1f82f01f 100644
--- a/Documentation/driver-api/gpio/index.rst
+++ b/Documentation/driver-api/gpio/index.rst
@@ -16,6 +16,7 @@ Contents:
    drivers-on-gpio
    bt8xxgpio
    pca953x
+   gpio-rpmsg
 
 Core
 ====
-- 
2.43.0


^ permalink raw reply related

* [PATCH v14 0/5] Enable Remote GPIO over RPMSG on i.MX Platform
From: Shenwei Wang @ 2026-06-25 15:54 UTC (permalink / raw)
  To: Linus Walleij, Bartosz Golaszewski, Jonathan Corbet, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Bjorn Andersson,
	Mathieu Poirier, Frank Li, Sascha Hauer
  Cc: Shuah Khan, linux-gpio, linux-doc, linux-kernel,
	Pengutronix Kernel Team, Fabio Estevam, Shenwei Wang, Peng Fan,
	devicetree, linux-remoteproc, imx, linux-arm-kernel, linux-imx,
	Arnaud POULIQUEN, b-padhi, Andrew Lunn

From: Shenwei Wang <shenwei.wang@nxp.com>

Support the remote devices on the remote processor via the RPMSG bus on
i.MX platform.

Changes in v14:
 - Update gpio-rpmsg.rst per Mathieu’s feedback.
 - Align the rpmsg-gpio driver with the revised gpio-rpmsg.rst.
 - Modify rpmsg-core to enable prefix-based matching of RPMSG device IDs.

Changes in v13:
 - drop the support for legacy NXP firmware.
 - remove the fixed_up hooks from the rpmsg gpio driver.
 - code cleanup.

Changes in v12:
 - Fixed the "underline" warning reported by Randy.

Changes in v11:
 - Expand RPMSG for the first time per Shuah's review comment.

Changes in v10:
 - Update gpio-rpmsg.rst according to Daniel Baluta's review comments.
 - Add a kernel CONFIG for fixed up handlers and only enable it on
   i.MX products.
 - Fixed bugs reported by kernel test robot.

Changes in v9:
 - Reuse the gpio-virtio design for command and IRQ type definitions.
 - Remove msg_id, version, and vendor fields from the generic protocol.
 - Add fixed-up handlers to support legacy firmware.

Changes in v8:
 - Add "depends on REMOTEPROC" in Kconfig to fix the build error reported
   by the kernel test robot.
 - Move the .rst patch before the .yaml patch.
 - Handle the "ngpios" DT property based on Andrew's feedback.

Changes in v7:
 - Reworked the driver to use the rpmsg_driver framework instead of
   platform_driver, based on feedback from Bjorn and Arnaud.
 - Updated gpio-rpmsg.yaml and imx_rproc.yaml according to comments from
   Rob and Arnaud.
 - Further refinements to gpio-rpmsg.yaml per Arnaud's feedback.

Changes in v6:
 - make the driver more generic with the actions below:
     rename the driver file to gpio-rpmsg.c
     remove the imx related info in the function and variable names
     rename the imx_rpmsg.h to rpdev_info.h
     create a gpio-rpmsg.yaml and refer it in imx_rproc.yaml
 - update the gpio-rpmsg.rst according to the feedback from Andrew and
   move the source file to driver-api/gpio
 - fix the bug reported by Zhongqiu Han
 - remove the I2C related info

Changes in v5:
 - move the gpio-rpmsg.rst from admin-guide to staging directory after
   discussion with Randy Dunlap.
 - add include files with some code improvements per Bartosz's comments.

Changes in v4:
 - add a documentation to describe the transport protocol per Andrew's
   comments.
 - add a new handler to get the gpio direction.

Changes in v3:
 - fix various format issue and return value check per Peng 's review
   comments.
 - add the logic to also populate the subnodes which are not in the
   device map per Arnaud's request. (in imx_rproc.c)
 - update the yaml per Frank's review comments.

Changes in v2:
 - re-implemented the gpio driver per Linus Walleij's feedback by using
   GPIOLIB_IRQCHIP helper library.
 - fix various format issue per Mathieu/Peng 's review comments.
 - update the yaml doc per Rob's feedback

Shenwei Wang (5):
  docs: driver-api: gpio: rpmsg gpio driver over rpmsg bus
  dt-bindings: remoteproc: imx_rproc: Add "rpmsg" subnode support
  rpmsg: core: match rpmsg device IDs by prefix
  gpio: rpmsg: add generic rpmsg GPIO driver
  arm64: dts: imx8ulp: Add rpmsg node under imx_rproc

 .../devicetree/bindings/gpio/gpio-rpmsg.yaml  |  55 ++
 .../bindings/remoteproc/fsl,imx-rproc.yaml    |  53 ++
 Documentation/driver-api/gpio/gpio-rpmsg.rst  | 271 +++++++++
 Documentation/driver-api/gpio/index.rst       |   1 +
 arch/arm64/boot/dts/freescale/imx8ulp.dtsi    |  25 +
 drivers/gpio/Kconfig                          |  17 +
 drivers/gpio/Makefile                         |   1 +
 drivers/gpio/gpio-rpmsg.c                     | 568 ++++++++++++++++++
 drivers/rpmsg/rpmsg_core.c                    |   4 +-
 9 files changed, 994 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/devicetree/bindings/gpio/gpio-rpmsg.yaml
 create mode 100644 Documentation/driver-api/gpio/gpio-rpmsg.rst
 create mode 100644 drivers/gpio/gpio-rpmsg.c

--
2.43.0


^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Xin Zhao @ 2026-06-25 15:45 UTC (permalink / raw)
  To: ljs
  Cc: akpm, alex.aring, allen.lkml, arnd, brauner, chuck.lever, corbet,
	david, ebiederm, j.granados, jack, jackzxcui1989, jlayton,
	juri.lelli, keescook, liam, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, mcgrof, mingo, mjguzik, peterz, pfalcato,
	vincent.guittot, viro
In-Reply-To: <aj0cUrwdXYKIicC-@lucifer>

On Thu, 25 Jun 2026 13:48:10 +0100 Lorenzo Stoakes <ljs@kernel.org> wrote:

> +cc missing maintainers, lists.
> 
> NAK.
> 
> This is un-upstreamable for numerous reasons.
> 
> The stuff you're doing in mm is broken, wrong and invasive and you've not
> even bothered to cc- mm people. I'm annoyed by this.
> 
> You're also doing incredibly silly mistakes at v4 of something that should have
> been an RFC.
> 
> You don't seem to understand the concept of patch _series_ (break it up into
> smaller patches!!!) and you haven't bothered cc'ing maintainers whose subsystems
> you're radically alterting.
> 
> I'm annoyed as you have a history where you were told not to add insane hacks
> before ([0], my reply at [1]).
> 
> [0]:https://lore.kernel.org/all/20260116042817.3790405-1-jackzxcui1989@163.com/
> [1]:https://lore.kernel.org/all/14110b70-19e7-474d-b0dd-ba80e8bed9b0@lucifer.local/
> 
> Was I wasting my time there? Am I wasting my time responding now?
> 
> And how hard is it to run a simple perl script?
> 
> Let me run it for you for _just_ the maintainers:

I probably shouldn't reply to this email to waste more of your time, but I
can't help but respond because your comments have been very beneficial to
me, and I enjoy the process.

The v4 version has changed too much compared to the v3 version. I should
have re-executed the "get maintainer" script, but I mistakenly copied the
previous email list and sent it out. I sincerely apologize for that.

There are quite a few issues now, and I haven't come up with a good
overall solution. I actually want to resolve the problems we encountered
in our project with minimal kernel modifications, but I can't think of a
good way to do it. It seems that the v4 version has turned out to be a
complete disaster of a patch, and I sincerely hope that my example won't
be used as a counterexample in the future. Thank you for that.

Suddenly, I have some thoughts about this issue, but I even question
whether I should have these ideas. Let me sit down and sort things out
properly. I hope the v5 version won't be a disaster.

Thanks
Xin Zhao

^ permalink raw reply

* Re: [PATCH net-next] Documentation: networking: Add a test plan for ethtool pause validation
From: Andrew Lunn @ 2026-06-25 15:46 UTC (permalink / raw)
  To: Maxime Chevallier
  Cc: Jakub Kicinski, davem, Eric Dumazet, Paolo Abeni, Simon Horman,
	Russell King, Heiner Kallweit, Jonathan Corbet, Shuah Khan,
	Oleksij Rempel, Vladimir Oltean, Florian Fainelli,
	thomas.petazzoni, netdev, linux-kernel, linux-doc
In-Reply-To: <7a88fee8-bbb3-480f-9c93-677b7270a940@bootlin.com>

On Thu, Jun 25, 2026 at 12:46:44PM +0200, Maxime Chevallier wrote:
> Hi Andrew,
> 
> On 5/29/26 14:59, Andrew Lunn wrote:
> 
> (This discussion was a while ago, but this bit of context should be enough)
> 
> > But we also need to consider that for some APIs, we have decided that
> > a configuration can be set now, which does not actually apply in our
> > current conditions, but it will be stored away for when conditions
> > change and it is applicable. The half duplex case could fit that. When
> > the link is currently half duplex, you can configure pause, but you
> > don't expect it to actually change the current behaviour. It only
> > kicks in when the link renegotiates to full duplex sometime in the
> > future. We have to also consider this the other way around. The link
> > is full duplex and pause is configured by the user. Something happens
> > with the LP and the link renegotiates to half duplex. The local end
> > should not throw away the configuration, it simply cannot apply it
> > given the current situation.
> 
> I'm writing the test description for HD with a better formatting, so the
> HD test wouldn't be about "are we using pause stuff while in HD" as it
> doesn't make sense, but rather "do we correctly store the pause settings
> aside for later".

O.K.

> I'm realising that we don't really have an API to report the *true* in-use pause
> settings. Taking HD as an example :
> 
> # ethtool -s eth2 duplex half
> 
> [588209.379363] mvpp2 f4000000.ethernet eth2: Link is Up - 100Mbps/Half - flow control off
> 
> # ethtool eth2
> 	[...]
> 	Supported pause frame use: Symmetric Receive-only
> 	Advertised pause frame use: Symmetric Receive-only
> 	Link partner advertised pause frame use: Symmetric Receive-only

Does it even make sense to advertise this when in HD? But i don't
think we need to consider this now. I consider HD low priority, i
doubt it is actually used very often. We should concentrate on FD
testing.

> # ethtool -a eth2
> Autonegotiate:	on
> RX:		off
> TX:		off
> RX negotiated: on
> TX negotiated: on
> 
> 
> Sure, pause and HD don't make sense, however what I find confusing to some
> extent is that the only place we have information about the *actual* pause
> settings is the "link is Up" log in dmesg.

Maybe we should extend ksetting get to return the resolved pause
parameters? But i'm not sure how much that actually gives us. Anything
using phylink will just ask phylink to fill in the ksettings
information, and it seems unlikely phylink gets it wrong. What we are
really trying to test is drivers which don't user phylink, those are
the ones which are generally broken, and they are not going to
implement anything new in ksettings. So i think the test has to look
at:

> 	Advertised pause frame use: Symmetric Receive-only
> 	Link partner advertised pause frame use: Symmetric Receive-only

and check these match what we expect.

    Andrew

^ permalink raw reply

* Re: [PATCH v8 18/46] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Sean Christopherson @ 2026-06-25 15:40 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <6ed7d12a-c3a1-4572-8385-754e6d5b8b44@kernel.org>

On Thu, Jun 25, 2026, David Hildenbrand (Arm) wrote:
> On 6/25/26 02:35, Sean Christopherson wrote:
> > One thought I had, to avoid the IPIs that draining all per-CPU caches requires,
> > was to disallow putting guest_memfd pages in folio batches, e.g. by hacking
> > something into folio_may_be_lru_cached().  But due to taking a per-lru lock,
> > that would penalize the relatively hot path and definitely common operation of
> > faulting in guest memory.  On the other hand, memory conversion is already a
> > relatively slow operation and is relatively uncommon compared to page faults,
> > (and likely very uncommon for real world setups).  I.e. having to drain all
> > caches if conversion isn't safe penalizes a relatively slow, relatively uncommon
> > path.
> 
> Yeah, the lru_add_drain_all is rather messy.
> 
> We have similar code in
> 
> collect_longterm_unpinnable_folios(), where we first try a lru_add_drain(), to
> then escalate to a lru_add_drain_all().
> 
> Maybe we could factor that (suboptimal code) out to not have to reinvent the
> same thing multiple times?

As discussed in the guest_memfd call, we should do this straightaway, i.e. instead
of merging this series as-is, so that we don't export lru_add_drain_all() only to
drop the export a kernel or two later, and can instead export the helper to drain
any batches for a folio (or set of folios/pages).


^ permalink raw reply

* Re: [PATCH net-next] Documentation: networking: Add a test plan for ethtool pause validation
From: Maxime Chevallier @ 2026-06-25 15:29 UTC (permalink / raw)
  To: Andrew Lunn, Jakub Kicinski
  Cc: davem, Eric Dumazet, Paolo Abeni, Simon Horman, Russell King,
	Heiner Kallweit, Jonathan Corbet, Shuah Khan, Oleksij Rempel,
	Vladimir Oltean, Florian Fainelli, thomas.petazzoni, netdev,
	linux-kernel, linux-doc
In-Reply-To: <5cb8e2b4-8eb6-4446-9b90-1cd4c7964cd9@lunn.ch>

Hi Andrew,

On 5/27/26 04:47, Andrew Lunn wrote:
> On Tue, May 26, 2026 at 05:24:47PM -0700, Jakub Kicinski wrote:
>> On Fri, 22 May 2026 19:51:06 +0200 Maxime Chevallier (Netdev
>> Foundation) wrote:
>>>  Documentation/networking/pause_test_plan.rst | 556 +++++++++++++++++++
>>
>> It'd be great to hear from others but IMHO in the current form this is
>> not suitable for Documentation/networking/ We can commit the "knowledge"
>> part but enumerating the test cases seems odd for Documentation/.
> 
> Sorry, not looked too deeply at the actual content yet.
> 
> What i was thinking was a python file, which sphinx can ingest to
> produce documentation, and place holders were code would be added to
> implement the actual test during the next phase.
> 
> This is how i've done testing in the past. I would be the evil one who
> thought up the tests and described them in detail using sphinx markup
> in a python test template file. After some review they got passed off
> to a python developer for implementation. And when they got run and
> failed, sometimes the feature developer, the test developer and myself
> got together to figure who made the error.
> 
> I'm not sure we even need sphinx. What i find important is that the
> test is documented. What kAPI calls should be made with what
> parameters. What results we are expected and why? So that when a test
> fails, a developer has the information they need to fix their
> code. The Why? is important, and often missing from the kernel tests.

This isn't sphynx, but I've come-up with something like this for a
test definition :


@ksft_ethtool_needs_supported_anyof([Pause, Asym_Pause])
def test_ethtool_pause_advertising(cfg, peer) -> None:
    """Pause advertisement

    Validate that changing pause params through the ETHTOOL_MSG_PAUSE command
    translates to a change in the advertised pause params, and that these
    parameters are correct w.r.t the supported pause params and requested pause
    params.
    
    This exercises the .set_pauseparams() ethtool ops for MAC configuration,
    as well as the reconfiguration of the PHY's advertising and negociation.
    
    On non-phylink MACs, the MAC should call phy_set_sym_pause() to update the
    PHY's advertising, and restart a negotiation with phy_start_aneg() if
    need be. Failure to do so will result on the wrong advertising parameters.
    
    Pn phylink-enabled MACs, phylink deals with the PHY reconfiguration provided
    the MAC driver calls phylink_ethtool_set_pauseparam().
    
    Failing this test likely means that the PHY driver is not correctly advertising
    pause settings, either due to the MAC not triggering a PHY reconfiguration,
    a misconficonfiguration of the advertising registers by the PHY, or by
    mis-handling the phydev->advertising bitfield in the PHY driver directly.
    
    The validation is made by looking at the advertised modes locally, as well as
    what the peer's 'lp_advertising' values report.

    cfg -- local device's interface configuration
    peer -- peer device handle
    """

    # Initial conditions :
    # - Local interface is admin UP, and reports lowlayer link UP
    # - Remote interface is adming UP, and reports lowlayer link UP
    #
    # Test 1
    # - SKIP if supported doesn't contain "Pause"
    # - run 'ethtool -A ethX rx on tx on autoneg on'
    # - FAIL if the return isn't 0
    # - FAIL if ETHTOOL_A_LINKMODES_OURS's advertised values does not contain
    #   "Pause" or contains "Asym_Pause"
    # - FAIL if peer's lp_advertising doesn't contain "Pause" or contains
    #   "Asym_Pause"
    # - Succeed otherwise
    #
    # Test 2
    # - SKIP uif supported doesn't contain both "Pause" and "Asym_Pause"
    # - run 'ethtool -A ethX rx on tx on autoneg on'
    # - FAIL if the return isn't 0
    # - FAIL if ETHTOOL_A_LINKMODES_OURS's advertised values does not contain
    #   "Pause" or contains "Asym_Pause"
    # - FAIL if peer's lp_advertising doesn't contain "Pause" or contains
    #   "Asym_Pause"
    #
    # ...
   
The annotation defines the pre-requisites in terms of locally supported
linkmodes, we have a docstring containing information for developpers
to debug their drivers, what I'm unsure about is the commented-out part
below, so either one big function testing multiple adjacent scenarios
or indivitual functions.

We could also use annotations to enumerate the various combinations of
modes to test.

That's just an extract of the full test suite for Pause, but before
writing the whole thing down i figure it's better to iterate on a single
test's design.

What do you think ?

Maxime

^ permalink raw reply

* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default\
From: Sean Christopherson @ 2026-06-25 14:36 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <aj0Jf30PS2f7x1nt@yzhao56-desk.sh.intel.com>

On Thu, Jun 25, 2026, Yan Zhao wrote:
> On Thu, Jun 25, 2026 at 09:51:01AM +0800, Yan Zhao wrote:
> > On Wed, Jun 24, 2026 at 05:41:58PM -0700, Sean Christopherson wrote:
> > > On Wed, Jun 24, 2026, Ackerley Tng wrote:
> > > > Yan Zhao <yan.y.zhao@intel.com> writes:
> > > > > With gmem_in_place_conversion=true, userspace can create guest_memfd without the
> > > > > MMAP flag. In such cases, shared memory is allocated from different backends.
> > > > > This means this module parameter only enables per-gmem memory attribute and does
> > > > > not guarantee that gmem in-place conversion will actually occur.
> > > 
> > > KVM module params are pretty much always about what KVM supports, not what is
> > > guaranteed to happen.
> > > 
> > >   - enable_mmio_caching doesn't guarantee there will actually be MMIO SPTEs,
> > >     because maybe the guest never accesses emulated MMIO.
> > >   - enable_pmu doesn't guarantee VMs will get a PMU, because userspace may elect
> > >     not to advertise one.
> > >   - and so on and so forth...
> > > 
> > > Yes, there's a small mental jump to get from "KVM supports in-place conversion"
> > > to "I need to set memory attributes on the guest_memfd instance, not the VM",
> > > but I don't see that as a big hurdle, certainly not in the long term.  And once
> > > the VMM code is written, I really do think most people are going to care about
> > > whether or not KVM supports in-place conversion, not where PRIVATE is tracked.
> > Sorry, I just saw this mail after posting my reply in [1].
> > 
> > I'm ok with gmem_in_place_conversion=true just means KVM supports in-place
> > conversion, while we can still create VMs with shared memory not from gmem.
> Or what about "allow_gmem_in_place_conversion" ?

No, because turning on the param also disallows setting PRIVATE in the VM-scoped
KVM_SET_MEMORY_ATTRIBUTES ioctl.

> > Though it still feels a bit odd to require TDX huge pages to depend on
> > gmem_in_place_conversion=true when shared memory is not currently allocated
> > from gmem, 

I fully expect that to be a transient state, and in all likelihood not something
that is *ever* shipped in production.  Landing TDX hugepages without guest_memfd
hugepage support is all about avoiding unnecessary serialization of series and
features that aren't strictly dependent on each other.

> > it should become more natural over time once gmem supports in-place
> > conversions for huge page.

Yes, and I want to prioritize the steady state for end users, not the in-progress
state for developers.  Once all of this settles out, I fully expect the majority
of deployments to only support in-place conversion, at which point the end user
is only going to care whether or not in-place conversion is enabled in KVM, not
the subtle detail that it's still possible to do out-of-place conversions (and
that will always hold true, it's not like VMA-based memslots are being deprecated).

> > Besides my current usage, there may be other scenarios where gmem memory
> > attributes is preferred without allocating shared memory from gmem.
> > (e.g., PAGE.ADD from a temp extra shared source memory).
> > 
> > For such use cases, I'm concerns that the admins may find it confusing if they
> > enable gmem_in_place_conversion but still observe extra memory consumptions for
> > shared memory.

KVM can help with documentation, but beyond that, it's not KVM's problem to solve.
If a VMM *and* platform owner chooses to deploy a setup that utilizes out-of-place
conversions, then it's on the VMM and/or plaform owner to understand and communicate
the implications to the end user.

And I'm not remotely convinced that prepending allow_ to the param will help
end users diagnose "unexpected" memory consumption, in quotes because anyone that
is deploying a stack that utilizes out-of-place conversion absolutely needs to
understand and plan for the additional memory consumption.  I.e. if the memory
consumption is "unexpected" to the end user, they likely have far bigger problems.

^ permalink raw reply

* Re: [PATCH v6 03/12] PCI: liveupdate: Track incoming preserved PCI devices
From: Pratyush Yadav @ 2026-06-25 14:35 UTC (permalink / raw)
  To: David Matlack
  Cc: Pasha Tatashin, kexec, linux-doc, linux-kernel, linux-mm,
	linux-pci, Adithya Jayachandran, Alexander Graf, Alex Williamson,
	Bjorn Helgaas, Chris Li, David Rientjes, Jacob Pan,
	Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
	Lukas Wunner, Mike Rapoport, Parav Pandit, Pranjal Shrivastava,
	Pratyush Yadav, Saeed Mahameed, Samiullah Khawaja, Shuah Khan,
	Vipin Sharma, William Tu, Yi Liu
In-Reply-To: <ajBgj_aSuzMZG47e@google.com>

Hi David,

On Mon, Jun 15 2026, David Matlack wrote:

> On 2026-06-14 01:38 PM, Pasha Tatashin wrote:
>> On Fri, 22 May 2026 20:24:01 +0000, David Matlack <dmatlack@google.com> wrote:
[...]
>> > +	}
>> > +
>> > +	pci_info(dev, "Device was preserved by previous kernel across Live Update\n");
>> > +	dev->liveupdate.incoming = dev_ser;
>> > +
>> > +	/*
>> > +	 * Hold the ref on the incoming FLB until pci_liveupdate_finish() so
>> > +	 * that dev->liveupdate.incoming does not get freed while it is in use.
>> > +	 */
>> 
>> How would that work? If finish is not called FLB stays around until the 
>> next reboot.
>
> True... I think if the PCI core trusts drivers to call
> pci_liveupdate_finish() then we don't need to hold onto the incoming
> reference here.

That was my point when I was arguing against refcounts on outgoing FLBs.
This is very easy to abuse, especially when we are talking about device
drivers. And this refcounting mechanism makes the FLB no longer
file-lifecycle-bound, since now it is entirely up to drivers to decide
the lifecycle of this data.

I have been thinking about this a bit more in the last couple days, and
I wonder if we are doing this right. Here's an idea I have been thinking
of.

We should make live update a first class citizen in PCI. Instead of
patching in liveupdate via the liveupdate.incoming field, and letting
drivers figure out when to use it, we should separate out probe and
retrieve paths entirely.

Probe and retrieve are fundamentally different operations. While they
may share some common initialization logic for the _software_ state, how
they interface with the hardware is completely different. I think mixing
the two will result in driver code being more spaghetti by having
liveupdate checks sprayed out all over.

This series doesn't add support for any drivers, but looking at some of
the code we have downstream, I see this problem. The liveupdate code is
all over the place in the driver and it is very hard to wrap one's head
around how the device is actually retrieved.

So I think PCI core should track preserved devices, and if the device is
preserved, it should skip the probe and wait for retrieve. Retrieve does
the full initialization of the device. This fits in with the LUO model
as well. You can make retrieve a callback of struct pci_driver and do
some wrappers to talk with LUO, so device drivers don't directly
interface with LUO at all.

We should do similar things on the shutdown path. Shutdown is a
fundamentally different operation from freeze, and so we should separate
them out as well.

This solves the lifetime problem as well. When PCI core is initializing,
it knows for sure that no retrievals are going to happen. That's because
none of the drivers have registered yet. So it can safely access the FLB
and initialize its state. After that, drivers can register themselves
and start accepting retrieve() calls. Once the last driver goes away,
the FLB is freed automatically.

I am sorry for suggesting a big refactor at v6, but the early versions
looked good to me at the time, and I only thought more deeply about this
when trying to figure out how we can make the lifetimes cleaner.

What do you think? Does this make sense?

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* [RFC PATCH v1.2 01/11] Docs/mm/damon/design: update for DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP
From: SeongJae Park @ 2026-06-25 14:23 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
	linux-kernel, linux-mm
In-Reply-To: <20260625142357.103500-1-sj@kernel.org>

Commit 9138e27a3bc3 ("mm/damon: add node_eligible_mem_bp goal metric")
introduced DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP but forgot updating the
DAMON design document for that.  Update.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 Documentation/mm/damon/design.rst | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst
index 2da7ca0d3d17a..9dbace087a329 100644
--- a/Documentation/mm/damon/design.rst
+++ b/Documentation/mm/damon/design.rst
@@ -686,9 +686,11 @@ mechanism tries to make ``current_value`` of ``target_metric`` be same to
   (1/10,000).
 - ``inactive_mem_bp``: Inactive to active + inactive (LRU) memory size ratio in
   bp (1/10,000).
+- ``node_eligible_mem_bp``: Scheme target access pattern-eligible memory ratio
+  of a node in bp (1/10,000).
 
-``nid`` is optionally required for only ``node_mem_used_bp``,
-``node_mem_free_bp``, ``node_memcg_used_bp`` and ``node_memcg_free_bp`` to
+``nid`` is optionally required for ``node_mem_used_bp``, ``node_mem_free_bp``,
+``node_memcg_used_bp`,` ``node_memcg_free_bp`` and ``node_eligible_mem_bp`` to
 point the specific NUMA node.
 
 ``path`` is optionally required for only ``node_memcg_used_bp`` and
-- 
2.47.3

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox