Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* [PATCH 04/19] crypto: cmh - add SHA-2/SHA-3/SHAKE ahash
From: Saravanakrishnan Krishnamoorthy @ 2026-06-25 17:33 UTC (permalink / raw)
  To: Albert Ou, Alex Ousherovitch, Conor Dooley, David S. Miller,
	Herbert Xu, Jonathan Corbet, Krzysztof Kozlowski, Palmer Dabbelt,
	Paul Walmsley, Rob Herring, Saravanakrishnan Krishnamoorthy,
	Shuah Khan
  Cc: Alexandre Ghiti, devicetree, Joel Wittenauer, linux-api,
	linux-crypto, linux-doc, linux-kernel, linux-kselftest,
	linux-riscv, Shuah Khan, sipsupport, Thi Nguyen
In-Reply-To: <20260625173328.1140487-1-skrishnamoorthy@rambus.com>

From: Alex Ousherovitch <aousherovitch@rambus.com>

Register ahash algorithms for SHA-224, SHA-256, SHA-384, SHA-512,
SHA3-224, SHA3-256, SHA3-384, SHA3-512, SHAKE128, and SHAKE256
using the CMH hash core (core ID 0x02).

Supports incremental update/finup/final, init/export/import for
request cloning, and the CRYPTO_AHASH_REQ_VIRT flag for zero-copy
from virtual buffers.

Co-developed-by: Saravanakrishnan Krishnamoorthy <skrishnamoorthy@rambus.com>
Signed-off-by: Saravanakrishnan Krishnamoorthy <skrishnamoorthy@rambus.com>
Signed-off-by: Alex Ousherovitch <aousherovitch@rambus.com>
Reviewed-by: Joel Wittenauer <Joel.Wittenauer@cryptography.com>
Reviewed-by: Thi Nguyen <thin@rambus.com>
---
 drivers/crypto/cmh/Makefile           |   3 +-
 drivers/crypto/cmh/cmh_hash.c         | 860 ++++++++++++++++++++++++++
 drivers/crypto/cmh/cmh_main.c         |   9 +
 drivers/crypto/cmh/include/cmh_hash.h |  26 +
 4 files changed, 897 insertions(+), 1 deletion(-)
 create mode 100644 drivers/crypto/cmh/cmh_hash.c
 create mode 100644 drivers/crypto/cmh/include/cmh_hash.h

diff --git a/drivers/crypto/cmh/Makefile b/drivers/crypto/cmh/Makefile
index 1492e575598c..c0531f416229 100644
--- a/drivers/crypto/cmh/Makefile
+++ b/drivers/crypto/cmh/Makefile
@@ -14,7 +14,8 @@ cmh-y := \
        cmh_dma.o \
        cmh_sysfs.o \
        cmh_key.o \
-       cmh_sys.o
+       cmh_sys.o \
+       cmh_hash.o

 # Management ioctl device (/dev/cmh_mgmt): key lifecycle, PKE, PQC ioctls.
 cmh-$(CONFIG_CRYPTO_DEV_CMH_MGMT) += \
diff --git a/drivers/crypto/cmh/cmh_hash.c b/drivers/crypto/cmh/cmh_hash.c
new file mode 100644
index 000000000000..2256bf4314c3
--- /dev/null
+++ b/drivers/crypto/cmh/cmh_hash.c
@@ -0,0 +1,860 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026 Cryptography Research, Inc. (CRI).
+ * CMH LKM -- Kernel Crypto API Hash Driver
+ *
+ * Registers asynchronous hash (ahash) algorithms with the Linux crypto
+ * subsystem.  Implements SHA-2 (224/256/384/512), SHA-3
+ * (224/256/384/512), and SHAKE (128/256) families using the CMH Hash
+ * Core (HC).
+ *
+ * Incremental HW update model -- each .update() with enough data for
+ * at least one full block submits a self-contained VCQ transaction:
+ *
+ *   .init()   -> software-only: zero per-request context
+ *   .update() -> buffer data in holdback; when >= block_size bytes:
+ *                INIT [+ RESTORE] + UPDATE(full blocks) + SAVE + FLUSH
+ *                -> return -EINPROGRESS  (else return 0, data in holdback)
+ *   .final()  -> INIT [+ RESTORE] [+ UPDATE(residual)] + FINAL + FLUSH
+ *   .finup()  -> linearise holdback + new data, then final path
+ *   .digest() -> INIT + UPDATE + FINAL + FLUSH (single-shot, zero-copy)
+ *   .export() -> software-only: copy checkpoint + holdback to out
+ *   .import() -> software-only: restore checkpoint + holdback from in
+ *
+ * The FLUSH after each .update() releases the HC core, so no lockout.
+ * Two hash sessions interleave fine on the same MBX -- each saves its
+ * own state via SAVE and restores via RESTORE on the next call.
+ *
+ * Export/import is purely software (no HW interaction), enabling
+ * crypto API transform clone for all plain-hash algorithms.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/crypto.h>
+#include <crypto/internal/hash.h>
+#include <crypto/scatterwalk.h>
+#include <linux/scatterlist.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+
+#include "cmh_hash.h"
+#include "cmh_vcq.h"
+#include "cmh_txn.h"
+#include "cmh_dma.h"
+
+/* Algorithm Table */
+
+struct cmh_hash_alg_info {
+       u32         hc_algo;        /* HC_ALGO_* (SHA2, SHA3, SHAKE) */
+       u32         digest_size;    /* bytes */
+       u32         block_size;     /* cra_blocksize for Linux crypto API */
+       const char *alg_name;       /* Linux crypto name: "sha256" */
+       const char *drv_name;       /* driver name: "cri-cmh-sha256" */
+};
+
+static const struct cmh_hash_alg_info cmh_hash_algs_info[] = {
+       /* SHA-2 family */
+       {
+               .hc_algo     = HC_ALGO_SHA2_224,
+               .digest_size = CMH_SHA224_DIGEST_SIZE,
+               .block_size  = 64,
+               .alg_name    = "sha224",
+               .drv_name    = "cri-cmh-sha224",
+       },
+       {
+               .hc_algo     = HC_ALGO_SHA2_256,
+               .digest_size = CMH_SHA256_DIGEST_SIZE,
+               .block_size  = 64,
+               .alg_name    = "sha256",
+               .drv_name    = "cri-cmh-sha256",
+       },
+       {
+               .hc_algo     = HC_ALGO_SHA2_384,
+               .digest_size = CMH_SHA384_DIGEST_SIZE,
+               .block_size  = 128,
+               .alg_name    = "sha384",
+               .drv_name    = "cri-cmh-sha384",
+       },
+       {
+               .hc_algo     = HC_ALGO_SHA2_512,
+               .digest_size = CMH_SHA512_DIGEST_SIZE,
+               .block_size  = 128,
+               .alg_name    = "sha512",
+               .drv_name    = "cri-cmh-sha512",
+       },
+       /* SHA-3 family */
+       {
+               .hc_algo     = HC_ALGO_SHA3_224,
+               .digest_size = CMH_SHA3_224_DIGEST_SIZE,
+               .block_size  = 144,  /* rate = 1600/8 - 2*224/8 = 144 */
+               .alg_name    = "sha3-224",
+               .drv_name    = "cri-cmh-sha3-224",
+       },
+       {
+               .hc_algo     = HC_ALGO_SHA3_256,
+               .digest_size = CMH_SHA3_256_DIGEST_SIZE,
+               .block_size  = 136,  /* rate = 1600/8 - 2*256/8 = 136 */
+               .alg_name    = "sha3-256",
+               .drv_name    = "cri-cmh-sha3-256",
+       },
+       {
+               .hc_algo     = HC_ALGO_SHA3_384,
+               .digest_size = CMH_SHA3_384_DIGEST_SIZE,
+               .block_size  = 104,  /* rate = 1600/8 - 2*384/8 = 104 */
+               .alg_name    = "sha3-384",
+               .drv_name    = "cri-cmh-sha3-384",
+       },
+       {
+               .hc_algo     = HC_ALGO_SHA3_512,
+               .digest_size = CMH_SHA3_512_DIGEST_SIZE,
+               .block_size  = 72,   /* rate = 1600/8 - 2*512/8 = 72 */
+               .alg_name    = "sha3-512",
+               .drv_name    = "cri-cmh-sha3-512",
+       },
+       /*
+        * SHAKE (XOF) family -- fixed-output ahash registration.
+        *
+        * cra_blocksize = 1: SHAKE is a sponge/XOF, not Merkle-Damgaard.
+        * The Keccak rate (168 for SHAKE-128, 136 for SHAKE-256) exceeds
+        * MAX_ALGAPI_BLOCKSIZE (160) on Linux <=6.7.  Using 1 signals
+        * "byte-oriented" which is correct for XOF consumers.  The kernel
+        * raised the limit to 208 in 6.8 (commit 2f3a22704889).
+        */
+       {
+               .hc_algo     = HC_ALGO_SHAKE128,
+               .digest_size = CMH_SHAKE128_DIGEST_SIZE,
+               .block_size  = 1,    /* XOF: no meaningful block for crypto API */
+               .alg_name    = "shake128",
+               .drv_name    = "cri-cmh-shake128",
+       },
+       {
+               .hc_algo     = HC_ALGO_SHAKE256,
+               .digest_size = CMH_SHAKE256_DIGEST_SIZE,
+               .block_size  = 1,    /* XOF: no meaningful block for crypto API */
+               .alg_name    = "shake256",
+               .drv_name    = "cri-cmh-shake256",
+       },
+};
+
+#define CMH_HASH_ALG_COUNT  ARRAY_SIZE(cmh_hash_algs_info)
+
+/* Per-Request State */
+
+/* Maximum cra_blocksize across all registered algorithms (SHA3-224) */
+#define CMH_HASH_MAX_BLOCK     144
+
+/*
+ * Exported hash state -- serialised by .export(), deserialised by
+ * .import().  This is what statesize advertises to the crypto subsystem.
+ */
+struct cmh_hash_export_state {
+       u8  checkpoint[HC_CONTEXT_SIZE]; /* HC context from last SAVE */
+       u8  buf[CMH_HASH_MAX_BLOCK];    /* holdback buffer */
+       u32 buf_len;                     /* valid bytes in buf[] */
+       u32 hw_started;                  /* non-zero if checkpoint valid */
+};
+
+/*
+ * Maximum payload commands any hash transaction can produce:
+ *   INIT + RESTORE + UPDATE + SAVE/FINAL + FLUSH = 5
+ * Worst-case packed output (stride=7, 1 payload per VCQ):
+ *   5 VCQs x 2 entries = 10
+ */
+#define CMH_HASH_MAX_PAYLOAD    5
+#define CMH_HASH_MAX_PACKED     (CMH_HASH_MAX_PAYLOAD * 2)
+
+/*
+ * Stored in ahash_request_ctx().  Tracks the algorithm, a holdback
+ * buffer for partial blocks, an HC context checkpoint from the last
+ * SAVE, and DMA state for the current in-flight async operation.
+ *
+ * The checkpoint is embedded inline rather than heap-allocated because
+ * the kernel ahash API has no per-request destructor.  If a request is
+ * abandoned without .final() (e.g. transform freed early), a heap
+ * checkpoint would leak unconditionally.
+ */
+struct cmh_hash_reqctx {
+       const struct cmh_hash_alg_info *info;
+       int    error;
+       u32    hw_started;      /* non-zero after first HW submission */
+       u32    buf_len;         /* bytes in holdback buf[] */
+       u32    has_checkpoint;  /* non-zero if checkpoint[] valid */
+       /* DMA state for current async operation */
+       dma_addr_t ckpt_dma;   /* RESTORE input */
+       dma_addr_t save_dma;   /* SAVE output (update only) */
+       dma_addr_t data_dma;   /* UPDATE input */
+       dma_addr_t digest_dma; /* FINAL output (final/digest only) */
+       u8    *save_buf;       /* SAVE output buffer */
+       u8    *data_buf;       /* linearised data for DMA */
+       u32    data_len;       /* bytes in data_buf */
+       u8    *digest_buf;     /* digest output buffer */
+       u8     buf[CMH_HASH_MAX_BLOCK]; /* holdback for partial block */
+       u8     checkpoint[HC_CONTEXT_SIZE]; /* HC context from last SAVE */
+       struct vcq_cmd packed[CMH_HASH_MAX_PACKED];
+};
+
+/* VCQ Builders (HC-specific; shared builders in cmh_hc_abi.h / cmh_vcq.h) */
+
+/* Add an HC_CMD_UPDATE entry */
+static void vcq_add_hc_update(struct vcq_cmd *slot, u32 core_id, u64 input_phys, u32 len)
+{
+       memset(slot, 0, sizeof(*slot));
+       slot->magic = VCQ_CMD_MAGIC;
+       slot->id = VCQ_CMD_ID(core_id, 0, 1, HC_CMD_UPDATE);
+       slot->hwc.hc.cmd_update.input = input_phys;
+       slot->hwc.hc.cmd_update.inlen = len;
+}
+
+/* Add an HC_CMD_SAVE entry */
+static void vcq_add_hc_save(struct vcq_cmd *slot, u32 core_id, u64 output_phys, u32 outlen)
+{
+       memset(slot, 0, sizeof(*slot));
+       slot->magic = VCQ_CMD_MAGIC;
+       slot->id = VCQ_CMD_ID(core_id, 0, 1, HC_CMD_SAVE);
+       slot->hwc.hc.cmd_save.output = output_phys;
+       slot->hwc.hc.cmd_save.outlen = outlen;
+}
+
+/* Add an HC_CMD_RESTORE entry */
+static void vcq_add_hc_restore(struct vcq_cmd *slot, u32 core_id, u64 input_phys, u32 inlen)
+{
+       memset(slot, 0, sizeof(*slot));
+       slot->magic = VCQ_CMD_MAGIC;
+       slot->id = VCQ_CMD_ID(core_id, 0, 1, HC_CMD_RESTORE);
+       slot->hwc.hc.cmd_restore.input = input_phys;
+       slot->hwc.hc.cmd_restore.inlen = inlen;
+}
+
+/* Request Context Cleanup */
+
+static void cmh_hash_free_reqctx(struct cmh_hash_reqctx *rctx)
+{
+       rctx->has_checkpoint = 0;
+}
+
+/* VCQ Packing + Submit */
+
+/* ahash Operations */
+
+/*
+ * Wrapper struct: embeds ahash_alg + a pointer to our alg_info table
+ * entry so we can recover it in the tfm callbacks.
+ */
+struct cmh_hash_alg_drv {
+       struct ahash_alg                 alg;
+       const struct cmh_hash_alg_info  *info;
+};
+
+/*
+ * Find the cmh_hash_alg_info from the crypto_ahash (embedded in our
+ * registered template).  We stash the info pointer in the algorithm's
+ * driver-private area at registration time (see cmh_hash_register).
+ */
+static const struct cmh_hash_alg_info *
+cmh_hash_get_info(struct crypto_ahash *tfm)
+{
+       struct ahash_alg *alg = crypto_ahash_alg(tfm);
+
+       return container_of(alg, struct cmh_hash_alg_drv, alg)->info;
+}
+
+static int cmh_hash_init(struct ahash_request *req)
+{
+       struct crypto_ahash *tfm = crypto_ahash_reqtfm(req);
+       struct cmh_hash_reqctx *rctx = ahash_request_ctx(req);
+
+       memset(rctx, 0, sizeof(*rctx));
+       rctx->info = cmh_hash_get_info(tfm);
+       return 0;
+}
+
+/*
+ * Update completion -- called from threaded IRQ after SAVE completes.
+ * Takes ownership of save_buf as the new checkpoint.
+ */
+static void cmh_hash_update_complete(void *data, int error)
+{
+       struct ahash_request *req = data;
+       struct cmh_hash_reqctx *rctx = ahash_request_ctx(req);
+
+       if (error == -EINPROGRESS) {
+               cmh_complete(&req->base, error);
+               return;
+       }
+
+       /* Unmap DMA buffers */
+       if (rctx->has_checkpoint)
+               cmh_dma_unmap_single(rctx->ckpt_dma, HC_CONTEXT_SIZE,
+                                    DMA_TO_DEVICE);
+       cmh_dma_unmap_single(rctx->save_dma, HC_CONTEXT_SIZE,
+                            DMA_FROM_DEVICE);
+       cmh_dma_unmap_single(rctx->data_dma, rctx->data_len,
+                            DMA_TO_DEVICE);
+
+       if (!error) {
+               memcpy(rctx->checkpoint, rctx->save_buf, HC_CONTEXT_SIZE);
+               rctx->has_checkpoint = 1;
+               kfree(rctx->save_buf);
+               rctx->save_buf = NULL;
+               rctx->hw_started = 1;
+       } else {
+               kfree(rctx->save_buf);
+               rctx->save_buf = NULL;
+               rctx->error = error;
+       }
+
+       kfree(rctx->data_buf);
+       rctx->data_buf = NULL;
+       rctx->data_len = 0;
+
+       cmh_complete(&req->base, error);
+}
+
+/*
+ * .update -- buffer incoming data, submit full blocks to HW.
+ *
+ * Maintains a partial-block holdback buffer in rctx->buf[].  When
+ * enough data is available for at least one full block, the full
+ * blocks are linearised and submitted as:
+ *   INIT [+ RESTORE] + UPDATE(full_blocks) + SAVE + FLUSH
+ *
+ * The tail (< block_size) stays in the holdback for the next call.
+ * Returns -EINPROGRESS on HW submission, 0 if only buffering.
+ */
+static int cmh_hash_update(struct ahash_request *req)
+{
+       struct cmh_hash_reqctx *rctx = ahash_request_ctx(req);
+       const struct cmh_hash_alg_info *info = rctx->info;
+       struct vcq_cmd cmds[CMH_HASH_MAX_PAYLOAD];
+       struct core_dispatch d;
+       u32 block_size = info->block_size;
+       u32 total_avail, full_len, tail_len, from_src;
+       u32 idx;
+       int ret;
+       gfp_t gfp;
+
+       if (rctx->error)
+               return rctx->error;
+
+       if (!req->nbytes)
+               return 0;
+
+       gfp = req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ?
+             GFP_KERNEL : GFP_ATOMIC;
+
+       total_avail = rctx->buf_len + req->nbytes;
+
+       /* Not enough for a full block -- just buffer */
+       if (total_avail < block_size) {
+               if (req->base.flags & CRYPTO_AHASH_REQ_VIRT)
+                       memcpy(rctx->buf + rctx->buf_len,
+                              req->svirt, req->nbytes);
+               else
+                       scatterwalk_map_and_copy(rctx->buf + rctx->buf_len,
+                                                req->src, 0,
+                                                req->nbytes, 0);
+               rctx->buf_len = total_avail;
+               return 0;
+       }
+
+       /* Have at least one full block -- submit to HW */
+       full_len = total_avail - total_avail % block_size;
+       tail_len = total_avail - full_len;
+       from_src = full_len - rctx->buf_len;
+
+       /* Linearise: holdback prefix + full blocks from scatterlist */
+       rctx->data_buf = kmalloc(full_len, gfp);
+       if (!rctx->data_buf)
+               return -ENOMEM;
+
+       if (rctx->buf_len > 0)
+               memcpy(rctx->data_buf, rctx->buf, rctx->buf_len);
+
+       if (from_src > 0) {
+               if (req->base.flags & CRYPTO_AHASH_REQ_VIRT)
+                       memcpy(rctx->data_buf + rctx->buf_len,
+                              req->svirt, from_src);
+               else
+                       scatterwalk_map_and_copy(rctx->data_buf + rctx->buf_len,
+                                                req->src, 0,
+                                                from_src, 0);
+       }
+
+       /* Move tail to holdback */
+       if (tail_len > 0) {
+               if (req->base.flags & CRYPTO_AHASH_REQ_VIRT)
+                       memcpy(rctx->buf, req->svirt + from_src,
+                              tail_len);
+               else
+                       scatterwalk_map_and_copy(rctx->buf, req->src,
+                                                from_src, tail_len,
+                                                0);
+       }
+       rctx->buf_len = tail_len;
+       rctx->data_len = full_len;
+
+       /* Allocate SAVE output buffer */
+       rctx->save_buf = kzalloc(HC_CONTEXT_SIZE, gfp);
+       if (!rctx->save_buf) {
+               ret = -ENOMEM;
+               goto err_free;
+       }
+
+       /* DMA map data, save output, and checkpoint */
+       rctx->data_dma = cmh_dma_map_single(rctx->data_buf, full_len,
+                                           DMA_TO_DEVICE);
+       if (cmh_dma_map_error(rctx->data_dma)) {
+               ret = -ENOMEM;
+               goto err_free;
+       }
+
+       rctx->save_dma = cmh_dma_map_single(rctx->save_buf, HC_CONTEXT_SIZE,
+                                           DMA_FROM_DEVICE);
+       if (cmh_dma_map_error(rctx->save_dma)) {
+               ret = -ENOMEM;
+               goto err_unmap_data;
+       }
+
+       rctx->ckpt_dma = DMA_MAPPING_ERROR;
+       if (rctx->has_checkpoint) {
+               rctx->ckpt_dma = cmh_dma_map_single(rctx->checkpoint,
+                                                   HC_CONTEXT_SIZE,
+                                                    DMA_TO_DEVICE);
+               if (cmh_dma_map_error(rctx->ckpt_dma)) {
+                       ret = -ENOMEM;
+                       goto err_unmap_save;
+               }
+       }
+
+       /* Build VCQ: INIT [+ RESTORE] + UPDATE + SAVE + FLUSH */
+       d = cmh_core_select_instance(CMH_CORE_HC);
+       idx = 0;
+
+       vcq_add_hc_init(&cmds[idx++], d.core_id, info->hc_algo);
+
+       if (rctx->has_checkpoint)
+               vcq_add_hc_restore(&cmds[idx++], d.core_id,
+                                  (u64)rctx->ckpt_dma, HC_CONTEXT_SIZE);
+
+       vcq_add_hc_update(&cmds[idx++], d.core_id,
+                         (u64)rctx->data_dma, full_len);
+
+       vcq_add_hc_save(&cmds[idx++], d.core_id,
+                       (u64)rctx->save_dma, HC_CONTEXT_SIZE);
+
+       vcq_add_flush(&cmds[idx++], d.core_id);
+
+       ret = cmh_vcq_pack_and_submit_async(cmds, idx, rctx->packed,
+                                           CMH_HASH_MAX_PACKED,
+                                           d.mbx_idx,
+                                           cmh_hash_update_complete, req,
+                                           !!(req->base.flags &
+                                              CRYPTO_TFM_REQ_MAY_BACKLOG),
+                                           cmh_tm_async_timeout_jiffies());
+       if (ret == -EBUSY)
+               return -EBUSY;
+       if (ret)
+               goto err_unmap_ckpt;
+
+       return -EINPROGRESS;
+
+err_unmap_ckpt:
+       if (rctx->has_checkpoint)
+               cmh_dma_unmap_single(rctx->ckpt_dma, HC_CONTEXT_SIZE,
+                                    DMA_TO_DEVICE);
+err_unmap_save:
+       cmh_dma_unmap_single(rctx->save_dma, HC_CONTEXT_SIZE,
+                            DMA_FROM_DEVICE);
+err_unmap_data:
+       cmh_dma_unmap_single(rctx->data_dma, full_len, DMA_TO_DEVICE);
+err_free:
+       kfree(rctx->save_buf);
+       rctx->save_buf = NULL;
+       kfree(rctx->data_buf);
+       rctx->data_buf = NULL;
+       rctx->data_len = 0;
+       return ret;
+}
+
+/*
+ * Final completion -- unmap all DMA, copy digest, signal done.
+ */
+static void cmh_hash_final_complete(void *data, int error)
+{
+       struct ahash_request *req = data;
+       struct cmh_hash_reqctx *rctx = ahash_request_ctx(req);
+
+       if (error == -EINPROGRESS) {
+               cmh_complete(&req->base, error);
+               return;
+       }
+
+       if (rctx->has_checkpoint)
+               cmh_dma_unmap_single(rctx->ckpt_dma, HC_CONTEXT_SIZE,
+                                    DMA_TO_DEVICE);
+       if (rctx->data_buf)
+               cmh_dma_unmap_single(rctx->data_dma, rctx->data_len,
+                                    DMA_TO_DEVICE);
+       cmh_dma_unmap_single(rctx->digest_dma, rctx->info->digest_size,
+                            DMA_FROM_DEVICE);
+
+       if (!error)
+               memcpy(req->result, rctx->digest_buf,
+                      rctx->info->digest_size);
+
+       kfree(rctx->digest_buf);
+       rctx->digest_buf = NULL;
+       kfree(rctx->data_buf);
+       rctx->data_buf = NULL;
+       cmh_hash_free_reqctx(rctx);
+       cmh_complete(&req->base, error);
+}
+
+/*
+ * Submit the final VCQ transaction:
+ *   INIT [+ RESTORE] [+ UPDATE(residual)] + FINAL + FLUSH
+ *
+ * @data_buf: linearised residual data, or NULL for empty-hash.
+ *            Ownership transferred -- callback frees it.
+ * @data_len: bytes in data_buf.
+ */
+static int cmh_hash_submit_final(struct ahash_request *req,
+                                u8 *data_buf, u32 data_len)
+{
+       struct cmh_hash_reqctx *rctx = ahash_request_ctx(req);
+       const struct cmh_hash_alg_info *info = rctx->info;
+       struct vcq_cmd cmds[CMH_HASH_MAX_PAYLOAD];
+       struct core_dispatch d;
+       u32 idx;
+       int ret;
+       gfp_t gfp = req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ?
+                  GFP_KERNEL : GFP_ATOMIC;
+
+       rctx->data_buf = data_buf;
+       rctx->data_len = data_len;
+
+       /* Allocate digest output buffer */
+       rctx->digest_buf = kzalloc(info->digest_size, gfp);
+       if (!rctx->digest_buf) {
+               ret = -ENOMEM;
+               goto err_free_data;
+       }
+
+       rctx->digest_dma = cmh_dma_map_single(rctx->digest_buf,
+                                             info->digest_size,
+                                              DMA_FROM_DEVICE);
+       if (cmh_dma_map_error(rctx->digest_dma)) {
+               ret = -ENOMEM;
+               goto err_free_digest;
+       }
+
+       /* Map residual data for UPDATE */
+       rctx->data_dma = DMA_MAPPING_ERROR;
+       if (data_buf && data_len > 0) {
+               rctx->data_dma = cmh_dma_map_single(data_buf, data_len,
+                                                   DMA_TO_DEVICE);
+               if (cmh_dma_map_error(rctx->data_dma)) {
+                       ret = -ENOMEM;
+                       goto err_unmap_digest;
+               }
+       }
+
+       /* Map checkpoint for RESTORE */
+       rctx->ckpt_dma = DMA_MAPPING_ERROR;
+       if (rctx->has_checkpoint) {
+               rctx->ckpt_dma = cmh_dma_map_single(rctx->checkpoint,
+                                                   HC_CONTEXT_SIZE,
+                                                    DMA_TO_DEVICE);
+               if (cmh_dma_map_error(rctx->ckpt_dma)) {
+                       ret = -ENOMEM;
+                       goto err_unmap_data;
+               }
+       }
+
+       /* Build VCQ: INIT [+ RESTORE] [+ UPDATE] + FINAL + FLUSH */
+       d = cmh_core_select_instance(CMH_CORE_HC);
+       idx = 0;
+
+       vcq_add_hc_init(&cmds[idx++], d.core_id, info->hc_algo);
+
+       if (rctx->has_checkpoint)
+               vcq_add_hc_restore(&cmds[idx++], d.core_id,
+                                  (u64)rctx->ckpt_dma, HC_CONTEXT_SIZE);
+
+       if (data_buf && data_len > 0)
+               vcq_add_hc_update(&cmds[idx++], d.core_id,
+                                 (u64)rctx->data_dma, data_len);
+
+       vcq_add_hc_final(&cmds[idx++], d.core_id,
+                        (u64)rctx->digest_dma, info->digest_size);
+
+       vcq_add_flush(&cmds[idx++], d.core_id);
+
+       ret = cmh_vcq_pack_and_submit_async(cmds, idx, rctx->packed,
+                                           CMH_HASH_MAX_PACKED,
+                                           d.mbx_idx,
+                                           cmh_hash_final_complete, req,
+                                           !!(req->base.flags &
+                                              CRYPTO_TFM_REQ_MAY_BACKLOG),
+                                           cmh_tm_async_timeout_jiffies());
+       if (ret == -EBUSY)
+               return -EBUSY;
+       if (ret)
+               goto err_unmap_ckpt;
+
+       return -EINPROGRESS;
+
+err_unmap_ckpt:
+       if (rctx->has_checkpoint)
+               cmh_dma_unmap_single(rctx->ckpt_dma, HC_CONTEXT_SIZE,
+                                    DMA_TO_DEVICE);
+err_unmap_data:
+       if (data_buf && data_len > 0)
+               cmh_dma_unmap_single(rctx->data_dma, data_len,
+                                    DMA_TO_DEVICE);
+err_unmap_digest:
+       cmh_dma_unmap_single(rctx->digest_dma, info->digest_size,
+                            DMA_FROM_DEVICE);
+err_free_digest:
+       kfree(rctx->digest_buf);
+       rctx->digest_buf = NULL;
+err_free_data:
+       kfree(data_buf);
+       rctx->data_buf = NULL;
+       cmh_hash_free_reqctx(rctx);
+       return ret;
+}
+
+static int cmh_hash_final(struct ahash_request *req)
+{
+       struct cmh_hash_reqctx *rctx = ahash_request_ctx(req);
+       u8 *data_buf = NULL;
+       u32 data_len = 0;
+       gfp_t gfp;
+
+       if (rctx->error)
+               return rctx->error;
+
+       if (rctx->buf_len > 0) {
+               gfp = req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ?
+                     GFP_KERNEL : GFP_ATOMIC;
+               data_buf = kmalloc(rctx->buf_len, gfp);
+               if (!data_buf)
+                       return -ENOMEM;
+               memcpy(data_buf, rctx->buf, rctx->buf_len);
+               data_len = rctx->buf_len;
+               rctx->buf_len = 0;
+       }
+
+       return cmh_hash_submit_final(req, data_buf, data_len);
+}
+
+static int cmh_hash_finup(struct ahash_request *req);
+
+/*
+ * One-shot digest -- delegates to init + finup so that all data is
+ * linearised and mapped through cmh_dma_map_single(), which is the
+ * only DMA mapping path aware of all supported DMA backends.
+ */
+static int cmh_hash_digest(struct ahash_request *req)
+{
+       int ret;
+
+       ret = cmh_hash_init(req);
+       if (ret)
+               return ret;
+       return cmh_hash_finup(req);
+}
+
+/*
+ * .finup -- update + final combined into a single transaction.
+ *
+ * Linearises the holdback buffer + new data and submits everything
+ * through the final path.  Avoids the kernel's ahash_def_finup()
+ * which would allocate a subrequest and clone via export/import.
+ */
+static int cmh_hash_finup(struct ahash_request *req)
+{
+       struct cmh_hash_reqctx *rctx = ahash_request_ctx(req);
+       u32 data_len;
+       u8 *data_buf;
+       gfp_t gfp;
+
+       if (rctx->error)
+               return rctx->error;
+
+       data_len = rctx->buf_len + req->nbytes;
+
+       if (data_len == 0)
+               return cmh_hash_submit_final(req, NULL, 0);
+
+       gfp = req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ?
+             GFP_KERNEL : GFP_ATOMIC;
+
+       data_buf = kmalloc(data_len, gfp);
+       if (!data_buf)
+               return -ENOMEM;
+
+       if (rctx->buf_len > 0)
+               memcpy(data_buf, rctx->buf, rctx->buf_len);
+
+       if (req->nbytes > 0) {
+               if (req->base.flags & CRYPTO_AHASH_REQ_VIRT)
+                       memcpy(data_buf + rctx->buf_len,
+                              req->svirt, req->nbytes);
+               else
+                       scatterwalk_map_and_copy(data_buf + rctx->buf_len,
+                                                req->src, 0,
+                                                req->nbytes, 0);
+       }
+
+       rctx->buf_len = 0;
+       return cmh_hash_submit_final(req, data_buf, data_len);
+}
+
+/*
+ * Export -- purely software.
+ *
+ * Serialise the HC checkpoint (if any) and holdback buffer into the
+ * export state structure.  No HW interaction needed because the
+ * incremental model keeps checkpoint up-to-date after each .update().
+ */
+static int cmh_hash_export(struct ahash_request *req, void *out)
+{
+       struct cmh_hash_reqctx *rctx = ahash_request_ctx(req);
+       struct cmh_hash_export_state *state = out;
+
+       if (rctx->hw_started && rctx->has_checkpoint)
+               memcpy(state->checkpoint, rctx->checkpoint, HC_CONTEXT_SIZE);
+       else
+               memset(state->checkpoint, 0, HC_CONTEXT_SIZE);
+
+       if (rctx->buf_len > 0)
+               memcpy(state->buf, rctx->buf, rctx->buf_len);
+
+       state->buf_len = rctx->buf_len;
+       state->hw_started = rctx->hw_started;
+
+       return 0;
+}
+
+/*
+ * Import -- purely software.
+ *
+ * Restore checkpoint and holdback from a previously exported state.
+ * The next .update() or .final() will RESTORE the checkpoint into HW.
+ */
+static int cmh_hash_import(struct ahash_request *req, const void *in)
+{
+       struct crypto_ahash *tfm = crypto_ahash_reqtfm(req);
+       struct cmh_hash_reqctx *rctx = ahash_request_ctx(req);
+       const struct cmh_hash_export_state *state = in;
+
+       memset(rctx, 0, sizeof(*rctx));
+       rctx->info = cmh_hash_get_info(tfm);
+
+       if (state->buf_len > CMH_HASH_MAX_BLOCK)
+               return -EINVAL;
+
+       rctx->hw_started = state->hw_started;
+       rctx->buf_len = state->buf_len;
+       memcpy(rctx->buf, state->buf, state->buf_len);
+
+       if (state->hw_started) {
+               memcpy(rctx->checkpoint, state->checkpoint, HC_CONTEXT_SIZE);
+               rctx->has_checkpoint = 1;
+       }
+
+       return 0;
+}
+
+/* Transform init (cra_init) -- set per-request context size */
+
+static int cmh_hash_cra_init(struct crypto_tfm *tfm)
+{
+       crypto_ahash_set_reqsize(__crypto_ahash_cast(tfm),
+                                sizeof(struct cmh_hash_reqctx));
+       return 0;
+}
+
+/* Registration */
+
+static struct cmh_hash_alg_drv cmh_hash_drvs[CMH_HASH_ALG_COUNT];
+
+/**
+ * cmh_hash_register() - Register SHA-256/384/512/3-256/3-384/3-512 hash algorithms
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int cmh_hash_register(void)
+{
+       unsigned int i;
+       int ret;
+
+       for (i = 0; i < CMH_HASH_ALG_COUNT; i++) {
+               const struct cmh_hash_alg_info *info = &cmh_hash_algs_info[i];
+               struct cmh_hash_alg_drv *drv = &cmh_hash_drvs[i];
+               struct ahash_alg *alg = &drv->alg;
+
+               drv->info = info;
+
+               alg->init   = cmh_hash_init;
+               alg->update = cmh_hash_update;
+               alg->final  = cmh_hash_final;
+               alg->finup  = cmh_hash_finup;
+               alg->digest = cmh_hash_digest;
+               alg->export = cmh_hash_export;
+               alg->import = cmh_hash_import;
+
+               alg->halg.digestsize = info->digest_size;
+               alg->halg.statesize  = sizeof(struct cmh_hash_export_state);
+
+               strscpy(alg->halg.base.cra_name, info->alg_name,
+                       CRYPTO_MAX_ALG_NAME);
+               strscpy(alg->halg.base.cra_driver_name, info->drv_name,
+                       CRYPTO_MAX_ALG_NAME);
+               alg->halg.base.cra_priority    = 300;
+               alg->halg.base.cra_flags       = CRYPTO_ALG_KERN_DRIVER_ONLY |
+                                                CRYPTO_ALG_NO_FALLBACK |
+                                                CRYPTO_ALG_ASYNC |
+                                                CRYPTO_ALG_REQ_VIRT;
+               alg->halg.base.cra_blocksize   = info->block_size;
+               alg->halg.base.cra_ctxsize     = 0;
+               alg->halg.base.cra_init        = cmh_hash_cra_init;
+               alg->halg.base.cra_module      = THIS_MODULE;
+
+               ret = crypto_register_ahash(alg);
+               if (ret) {
+                       dev_err(cmh_dev(), "hash: failed to register %s (rc=%d)\n",
+                               info->drv_name, ret);
+                       /* Unregister any already-registered algorithms */
+                       while (i--)
+                               crypto_unregister_ahash(&cmh_hash_drvs[i].alg);
+                       return ret;
+               }
+
+               dev_dbg(cmh_dev(), "hash: registered %s (priority 300)\n",
+                       info->drv_name);
+       }
+
+       dev_info(cmh_dev(), "hash: %zu algorithm(s) registered\n",
+                CMH_HASH_ALG_COUNT);
+       return 0;
+}
+
+/**
+ * cmh_hash_unregister() - Unregister SHA hash algorithms from the crypto framework
+ */
+void cmh_hash_unregister(void)
+{
+       unsigned int i;
+
+       for (i = 0; i < CMH_HASH_ALG_COUNT; i++) {
+               crypto_unregister_ahash(&cmh_hash_drvs[i].alg);
+               dev_dbg(cmh_dev(), "hash: unregistered %s\n",
+                       cmh_hash_algs_info[i].drv_name);
+       }
+
+       dev_info(cmh_dev(), "hash: cleaned up\n");
+}
diff --git a/drivers/crypto/cmh/cmh_main.c b/drivers/crypto/cmh/cmh_main.c
index 307bd7dd304b..e8e30b893932 100644
--- a/drivers/crypto/cmh/cmh_main.c
+++ b/drivers/crypto/cmh/cmh_main.c
@@ -29,6 +29,7 @@
 #include "cmh_mqi.h"
 #include "cmh_txn.h"
 #include "cmh_rh.h"
+#include "cmh_hash.h"
 #include "cmh_mgmt.h"
 #include "cmh_registers.h"
 #include "cmh_debugfs.h"
@@ -191,6 +192,11 @@ static int cmh_probe(struct platform_device *pdev)
        if (ret)
                goto err_rh_init;

+       /* Register hash algorithms with the kernel crypto API */
+       ret = cmh_hash_register();
+       if (ret)
+               goto err_hash_register;
+
        /* Register key management device (/dev/cmh_mgmt) */
        ret = cmh_mgmt_register();
        if (ret)
@@ -203,6 +209,8 @@ static int cmh_probe(struct platform_device *pdev)
        return 0;

 err_mgmt_register:
+       cmh_hash_unregister();
+err_hash_register:
        cmh_rh_cleanup(cfg);
 err_rh_init:
        cmh_tm_cleanup();
@@ -229,6 +237,7 @@ static void cmh_remove(struct platform_device *pdev)
        cfg = &dev->config;

        cmh_mgmt_unregister();
+       cmh_hash_unregister();
        cmh_rh_cleanup(cfg);
        cmh_tm_cleanup();
        cmh_mqi_cleanup(cfg);
diff --git a/drivers/crypto/cmh/include/cmh_hash.h b/drivers/crypto/cmh/include/cmh_hash.h
new file mode 100644
index 000000000000..bf17d3af7787
--- /dev/null
+++ b/drivers/crypto/cmh/include/cmh_hash.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2026 Cryptography Research, Inc. (CRI).
+ * CMH LKM -- Kernel Crypto API Hash Driver
+ *
+ * Registers ahash algorithms (SHA-2, SHA-3, and SHAKE families) with the
+ * Linux crypto subsystem.  Uses an incremental HW update model:
+ *
+ *   .init()   -> software-only: zero per-request context
+ *   .update() -> holdback partial blocks; submit full blocks via
+ *                INIT [+ RESTORE] + UPDATE + SAVE + FLUSH
+ *   .final()  -> INIT [+ RESTORE] [+ UPDATE(residual)] + FINAL + FLUSH
+ *   .digest() -> INIT + UPDATE + FINAL + FLUSH (single-shot)
+ *   .export() -> software-only: copy checkpoint + holdback
+ *   .import() -> software-only: restore checkpoint + holdback
+ */
+
+#ifndef CMH_HASH_H
+#define CMH_HASH_H
+
+#include "cmh_config.h"
+
+int  cmh_hash_register(void);
+void cmh_hash_unregister(void);
+
+#endif /* CMH_HASH_H */
--
2.43.7


** This message and any attachments are for the sole use of the intended recipient(s). It may contain information that is confidential and privileged. If you are not the intended recipient of this message, you are prohibited from printing, copying, forwarding or saving it. Please delete the message and attachments and notify the sender immediately. **

Rambus Inc.<http://www.rambus.com>

^ permalink raw reply related

* [PATCH 18/19] selftests: crypto: cmh - add kselftest for management ioctl
From: Saravanakrishnan Krishnamoorthy @ 2026-06-25 17:33 UTC (permalink / raw)
  To: Albert Ou, Alex Ousherovitch, Conor Dooley, David S. Miller,
	Herbert Xu, Jonathan Corbet, Krzysztof Kozlowski, Palmer Dabbelt,
	Paul Walmsley, Rob Herring, Saravanakrishnan Krishnamoorthy,
	Shuah Khan
  Cc: Alexandre Ghiti, devicetree, Joel Wittenauer, linux-api,
	linux-crypto, linux-doc, linux-kernel, linux-kselftest,
	linux-riscv, Shuah Khan, sipsupport, Thi Nguyen
In-Reply-To: <20260625173328.1140487-1-skrishnamoorthy@rambus.com>

From: Alex Ousherovitch <aousherovitch@rambus.com>

Add a minimal kselftest exercising the /dev/cmh_mgmt ioctl interface:

  - open/close the device node
  - invalid ioctl returns -ENOTTY
  - bad version field returns -EINVAL
  - KEY_NEW + KEY_DELETE lifecycle
  - KIC HKDF1 key derivation
  - ML-KEM-768 keygen via hardware RNG

Tests use the kselftest_harness.h fixture framework and output TAP.
Tests that require hardware features not present on the device under
test are gracefully skipped (SKIP).

Co-developed-by: Saravanakrishnan Krishnamoorthy <skrishnamoorthy@rambus.com>
Signed-off-by: Saravanakrishnan Krishnamoorthy <skrishnamoorthy@rambus.com>
Signed-off-by: Alex Ousherovitch <aousherovitch@rambus.com>
Reviewed-by: Joel Wittenauer <Joel.Wittenauer@cryptography.com>
Reviewed-by: Thi Nguyen <thin@rambus.com>
---
 .../selftests/drivers/crypto/cmh/Makefile     |   6 +
 .../drivers/crypto/cmh/cmh_mgmt_test.c        | 183 ++++++++++++++++++
 .../selftests/drivers/crypto/cmh/config       |   1 +
 3 files changed, 190 insertions(+)
 create mode 100644 tools/testing/selftests/drivers/crypto/cmh/Makefile
 create mode 100644 tools/testing/selftests/drivers/crypto/cmh/cmh_mgmt_test.c
 create mode 100644 tools/testing/selftests/drivers/crypto/cmh/config

diff --git a/tools/testing/selftests/drivers/crypto/cmh/Makefile b/tools/testing/selftests/drivers/crypto/cmh/Makefile
new file mode 100644
index 000000000000..86cb63839b27
--- /dev/null
+++ b/tools/testing/selftests/drivers/crypto/cmh/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+TEST_GEN_PROGS := cmh_mgmt_test
+
+CFLAGS += -Wall -Wno-misleading-indentation -O2 $(KHDR_INCLUDES)
+
+include ../../../lib.mk
diff --git a/tools/testing/selftests/drivers/crypto/cmh/cmh_mgmt_test.c b/tools/testing/selftests/drivers/crypto/cmh/cmh_mgmt_test.c
new file mode 100644
index 000000000000..4514b5a1349a
--- /dev/null
+++ b/tools/testing/selftests/drivers/crypto/cmh/cmh_mgmt_test.c
@@ -0,0 +1,183 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Kselftest for /dev/cmh_mgmt ioctl interface.
+ *
+ * Tests basic ioctl operations on the CRI CryptoManager Hub management
+ * device.  Requires the cmh module loaded on real or emulated hardware.
+ *
+ * Run:  ./cmh_mgmt_test
+ * Output: TAP format (compatible with kselftest harness)
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdint.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+
+#include "kselftest_harness.h"
+#include <linux/cmh_mgmt_ioctl.h>
+
+#define CMH_DEV "/dev/cmh_mgmt"
+
+FIXTURE(cmh_mgmt)
+{
+       int fd;
+};
+
+FIXTURE_SETUP(cmh_mgmt)
+{
+       self->fd = open(CMH_DEV, O_RDWR);
+       if (self->fd < 0 && errno == ENOENT)
+               SKIP(return, "Device " CMH_DEV " not present (module not loaded?)");
+       if (self->fd < 0 && errno == EACCES)
+               SKIP(return, "Permission denied -- run as root or with CAP_SYS_ADMIN");
+       ASSERT_GE(self->fd, 0);
+}
+
+FIXTURE_TEARDOWN(cmh_mgmt)
+{
+       if (self->fd >= 0)
+               close(self->fd);
+}
+
+/*
+ * Test 1: open and close succeed.
+ * If we get here, FIXTURE_SETUP already validated the open.
+ */
+TEST_F(cmh_mgmt, open_close)
+{
+       ASSERT_GE(self->fd, 0);
+}
+
+/*
+ * Test 2: invalid ioctl number returns -ENOTTY.
+ */
+TEST_F(cmh_mgmt, invalid_ioctl)
+{
+       int ret;
+       unsigned long bogus_cmd = _IOC(_IOC_READ, 'J', 0xFF, 4);
+
+       ret = ioctl(self->fd, bogus_cmd, NULL);
+       ASSERT_EQ(ret, -1);
+       ASSERT_EQ(errno, ENOTTY);
+}
+
+/*
+ * Test 3: KEY_NEW with bad version field returns -EINVAL.
+ */
+TEST_F(cmh_mgmt, bad_version)
+{
+       struct cmh_ioctl_key_new req;
+       int ret;
+
+       memset(&req, 0, sizeof(req));
+       req.version = 0; /* invalid */
+       req.ds_type = CMH_DS_AES_KEY;
+       req.len = 32;
+       req.flags = CMH_FLAG_PT;
+       req.cid = 0xDEAD;
+
+       ret = ioctl(self->fd, CMH_IOCTL_KEY_NEW, &req);
+       ASSERT_EQ(ret, -1);
+       ASSERT_EQ(errno, EINVAL);
+}
+
+/*
+ * Test 4: KEY_NEW creates a key, KEY_DELETE destroys it.
+ */
+TEST_F(cmh_mgmt, key_new_delete)
+{
+       struct cmh_ioctl_key_new new_req;
+       struct cmh_ioctl_key_grant del_req;
+       int ret;
+
+       memset(&new_req, 0, sizeof(new_req));
+       new_req.version = CMH_MGMT_V1;
+       new_req.ds_type = CMH_DS_AES_KEY;
+       new_req.len = 32;
+       new_req.flags = CMH_FLAG_PT;
+       new_req.cid = 0x5E1F7E57ULL; /* "SELFTEST" */
+
+       ret = ioctl(self->fd, CMH_IOCTL_KEY_NEW, &new_req);
+       ASSERT_EQ(ret, 0);
+       ASSERT_NE(new_req.ref, (uint64_t)0);
+
+       /* Delete the key */
+       memset(&del_req, 0, sizeof(del_req));
+       del_req.version = CMH_MGMT_V1;
+       del_req.ref = new_req.ref;
+
+       ret = ioctl(self->fd, CMH_IOCTL_KEY_DELETE, &del_req);
+       ASSERT_EQ(ret, 0);
+}
+
+/*
+ * Test 5: KIC HKDF1 key derivation from hardware base key.
+ * Requires at least one KIC base key provisioned (KIC_KEY1).
+ */
+TEST_F(cmh_mgmt, kic_hkdf1)
+{
+       struct cmh_ioctl_kic_hkdf1 req;
+       static const char label[] = "kselftest-label";
+       int ret;
+
+       memset(&req, 0, sizeof(req));
+       req.version = CMH_MGMT_V1;
+       req.key_len = 32;
+       req.base_key = CMH_KIC_KEY1;
+       req.cid = 0x4B534C46ULL; /* "KSLF" */
+       req.label = (uint64_t)(uintptr_t)label;
+       req.label_len = sizeof(label) - 1;
+       req.flags = CMH_KIC_FLAG_TEMP;
+
+       ret = ioctl(self->fd, CMH_IOCTL_KIC_HKDF1, &req);
+       if (ret < 0 && errno == EIO)
+               SKIP(return, "KIC base key 1 not provisioned on this device");
+       ASSERT_EQ(ret, 0);
+       ASSERT_NE(req.ref, (uint64_t)0);
+}
+
+/*
+ * Test 6: ML-KEM-768 keygen using hardware RNG.
+ * Verifies the PQC keygen path end-to-end.
+ */
+TEST_F(cmh_mgmt, ml_kem_keygen)
+{
+       struct cmh_ioctl_ml_kem_keygen req;
+       /* ML-KEM-768: ek = 384*3+32 = 1184, dk = 768*3+96 = 2400 */
+       uint8_t ek[1184];
+       uint8_t dk[2400];
+       int ret;
+
+       memset(&req, 0, sizeof(req));
+       req.version = CMH_MGMT_V1;
+       req.k = 3; /* ML-KEM-768 */
+       req.flags = CMH_QSE_FLAG_HW_RNG;
+       req.seed = 0; /* HW RNG */
+       req.z = 0;    /* HW RNG */
+       req.ek = (uint64_t)(uintptr_t)ek;
+       req.dk = (uint64_t)(uintptr_t)dk;
+       req.dk_cid = 0;
+       req.dk_ref = 0;
+
+       memset(ek, 0, sizeof(ek));
+       memset(dk, 0, sizeof(dk));
+
+       ret = ioctl(self->fd, CMH_IOCTL_ML_KEM_KEYGEN, &req);
+       if (ret < 0 && errno == ENODEV)
+               SKIP(return, "QSE core not available on this hardware");
+       ASSERT_EQ(ret, 0);
+
+       /* Verify output is non-zero (extremely unlikely for random keys) */
+       {
+               int i, nonzero = 0;
+
+               for (i = 0; i < 64; i++)
+                       nonzero += (ek[i] != 0);
+               ASSERT_GT(nonzero, 0);
+       }
+}
+
+TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/drivers/crypto/cmh/config b/tools/testing/selftests/drivers/crypto/cmh/config
new file mode 100644
index 000000000000..063c1dd0e23b
--- /dev/null
+++ b/tools/testing/selftests/drivers/crypto/cmh/config
@@ -0,0 +1 @@
+CONFIG_CRYPTO_DEV_CMH=m
--
2.43.7


** This message and any attachments are for the sole use of the intended recipient(s). It may contain information that is confidential and privileged. If you are not the intended recipient of this message, you are prohibited from printing, copying, forwarding or saving it. Please delete the message and attachments and notify the sender immediately. **

Rambus Inc.<http://www.rambus.com>

^ permalink raw reply related

* [PATCH 14/19] crypto: cmh - add ECDH/X25519 kpp
From: Saravanakrishnan Krishnamoorthy @ 2026-06-25 17:33 UTC (permalink / raw)
  To: Albert Ou, Alex Ousherovitch, Conor Dooley, David S. Miller,
	Herbert Xu, Jonathan Corbet, Krzysztof Kozlowski, Palmer Dabbelt,
	Paul Walmsley, Rob Herring, Saravanakrishnan Krishnamoorthy,
	Shuah Khan
  Cc: Alexandre Ghiti, devicetree, Joel Wittenauer, linux-api,
	linux-crypto, linux-doc, linux-kernel, linux-kselftest,
	linux-riscv, Shuah Khan, sipsupport, Thi Nguyen
In-Reply-To: <20260625173328.1140487-1-skrishnamoorthy@rambus.com>

From: Alex Ousherovitch <aousherovitch@rambus.com>

Register ECDH and X25519 kpp algorithms using the CMH PKE core.
Supports P-256, P-384, and Curve25519 for key agreement.

Co-developed-by: Saravanakrishnan Krishnamoorthy <skrishnamoorthy@rambus.com>
Signed-off-by: Saravanakrishnan Krishnamoorthy <skrishnamoorthy@rambus.com>
Signed-off-by: Alex Ousherovitch <aousherovitch@rambus.com>
Reviewed-by: Joel Wittenauer <Joel.Wittenauer@cryptography.com>
Reviewed-by: Thi Nguyen <thin@rambus.com>
---
 drivers/crypto/cmh/Makefile       |   3 +-
 drivers/crypto/cmh/cmh_main.c     |   8 +
 drivers/crypto/cmh/cmh_pke_ecdh.c | 698 ++++++++++++++++++++++++++++++
 3 files changed, 708 insertions(+), 1 deletion(-)
 create mode 100644 drivers/crypto/cmh/cmh_pke_ecdh.c

diff --git a/drivers/crypto/cmh/Makefile b/drivers/crypto/cmh/Makefile
index fdbf66b13628..a4cea0a56fc1 100644
--- a/drivers/crypto/cmh/Makefile
+++ b/drivers/crypto/cmh/Makefile
@@ -32,7 +32,8 @@ cmh-y := \
        cmh_rng.o \
        cmh_pke_common.o \
        cmh_pke_rsa.o \
-       cmh_pke_ecdsa.o
+       cmh_pke_ecdsa.o \
+       cmh_pke_ecdh.o

 # Management ioctl device (/dev/cmh_mgmt): key lifecycle, PKE, PQC ioctls.
 cmh-$(CONFIG_CRYPTO_DEV_CMH_MGMT) += \
diff --git a/drivers/crypto/cmh/cmh_main.c b/drivers/crypto/cmh/cmh_main.c
index 939ff5007755..ea0f32b941f5 100644
--- a/drivers/crypto/cmh/cmh_main.c
+++ b/drivers/crypto/cmh/cmh_main.c
@@ -286,6 +286,11 @@ static int cmh_probe(struct platform_device *pdev)
        if (ret)
                goto err_pke_ecdsa_register;

+       /* Register PKE ECDH/X25519 kpp */
+       ret = cmh_pke_ecdh_register();
+       if (ret)
+               goto err_pke_ecdh_register;
+
        /* Register key management device (/dev/cmh_mgmt) */
        ret = cmh_mgmt_register();
        if (ret)
@@ -298,6 +303,8 @@ static int cmh_probe(struct platform_device *pdev)
        return 0;

 err_mgmt_register:
+       cmh_pke_ecdh_unregister();
+err_pke_ecdh_register:
        cmh_pke_ecdsa_unregister();
 err_pke_ecdsa_register:
        cmh_pke_rsa_unregister();
@@ -358,6 +365,7 @@ static void cmh_remove(struct platform_device *pdev)
        cfg = &dev->config;

        cmh_mgmt_unregister();
+       cmh_pke_ecdh_unregister();
        cmh_pke_ecdsa_unregister();
        cmh_pke_rsa_unregister();
        cmh_ccp_poly_unregister();
diff --git a/drivers/crypto/cmh/cmh_pke_ecdh.c b/drivers/crypto/cmh/cmh_pke_ecdh.c
new file mode 100644
index 000000000000..d8b821cc4217
--- /dev/null
+++ b/drivers/crypto/cmh/cmh_pke_ecdh.c
@@ -0,0 +1,698 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026 Cryptography Research, Inc. (CRI).
+ * CMH LKM -- ECDH / X25519 kpp Driver
+ *
+ * Registers "ecdh-nist-p256", "ecdh-nist-p384", and "curve25519"
+ * kpp algorithms with priority 300.
+ *
+ * - set_secret: decodes private key from kpp_secret + ecdh struct
+ *   (NIST curves) or raw 32-byte scalar (Curve25519).
+ *   Stores in cmh_key_ctx: raw keys written via SYS_REF_TEMP.
+ *   Datastore-referenced keys are only reachable through the ioctl
+ *   path (cmh_mgmt.c).
+ *
+ * - generate_public_key: PKE_CMD_ECDH_KEYGEN -> outputs X coordinate
+ *   (NIST Weierstrass) or full public key (Edwards/Montgomery).
+ *   For NIST curves, we generate X||Y by calling ECDSA_PUBGEN instead,
+ *   matching the kernel ecdh.c pattern that outputs uncompressed X||Y.
+ *
+ * - compute_shared_secret: PKE_CMD_ECDH -> shared secret X coordinate.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/scatterlist.h>
+#include <crypto/kpp.h>
+#include <crypto/ecdh.h>
+#include <crypto/internal/kpp.h>
+#include <crypto/internal/ecc.h>
+
+#include "cmh_pke.h"
+#include "cmh_sys.h"
+#include "cmh_sys_abi.h"
+#include "cmh_txn.h"
+#include "cmh_dma.h"
+#include "cmh_key.h"
+
+/*
+ * ECDH key format: kpp_secret header + key_size(u16) + key data.
+ * We decode this inline to avoid depending on CONFIG_CRYPTO_ECDH.
+ */
+#define ECDH_KPP_SECRET_MIN_SIZE (sizeof(struct kpp_secret) + sizeof(unsigned short))
+
+struct cmh_ecdh_tfm_ctx {
+       struct cmh_key_ctx key;
+       u32 curve;              /* PKE_CURVE_* */
+       u32 clen;               /* coordinate length in bytes */
+};
+
+static inline struct cmh_ecdh_tfm_ctx *cmh_ecdh_ctx(struct crypto_kpp *tfm)
+{
+       return kpp_tfm_ctx(tfm);
+}
+
+/*
+ * Per-request context for ECDH/X25519 operations.
+ *
+ * generate_public_key: single-phase async VCQ.
+ * compute_shared_secret: 2-phase async VCQ with callback chaining.
+ *   Phase 1: sys_write(sk) + sys_new(ref) + ecdh(peer) + pflush
+ *            -> phase1 callback reads ref, submits Phase 2.
+ *   Phase 2: sys_data(ref, ss_dma) + sys_flush
+ *            -> phase2 callback extracts shared secret, completes req.
+ *
+ * Both phases target the same mbx_idx so the DS reference remains
+ * valid, since DS objects are MBX-scoped.
+ */
+struct cmh_ecdh_reqctx {
+       /* Buffers */
+       u8 *pk_buf;             /* keygen: output public key */
+       u8 *sk_buf;             /* private key copy */
+       u8 *peer_buf;           /* compute: peer public key */
+       u8 *ss_buf;             /* compute: shared secret output */
+       u64 *ref_buf;           /* compute: DS ref from Phase 1 */
+       /* DMA handles */
+       dma_addr_t pk_dma;
+       dma_addr_t sk_dma;
+       dma_addr_t peer_dma;
+       dma_addr_t ss_dma;
+       dma_addr_t ref_dma;
+       /* Sizes and params for Phase 2 re-submit */
+       u32 out_len;            /* keygen: public key size */
+       u32 clen;
+       u32 peer_len;
+       u32 sk_len;
+       u32 dma_swap;
+       int mbx_idx;            /* pinned MBX for Phase 2 */
+};
+
+/*
+ * set_secret: NIST curves decode kpp_secret + u16 key_size + raw scalar.
+ * Curve25519 uses raw 32-byte scalar directly.
+ */
+static int cmh_ecdh_set_secret_nist(struct crypto_kpp *tfm,
+                                   const void *buf, unsigned int len)
+{
+       struct cmh_ecdh_tfm_ctx *ctx = cmh_ecdh_ctx(tfm);
+       const u8 *ptr = buf;
+       struct kpp_secret secret;
+       unsigned short key_size;
+       int ret;
+
+       if (!buf || len < ECDH_KPP_SECRET_MIN_SIZE)
+               return -EINVAL;
+
+       memcpy(&secret, ptr, sizeof(secret));
+       ptr += sizeof(secret);
+
+       if (secret.type != CRYPTO_KPP_SECRET_TYPE_ECDH)
+               return -EINVAL;
+       if (len < secret.len)
+               return -EINVAL;
+
+       memcpy(&key_size, ptr, sizeof(key_size));
+       ptr += sizeof(key_size);
+
+       if (key_size == 0) {
+               /*
+                * key_size == 0: generate a validated random private key.
+                * Uses the kernel ECC library (FIPS 186-5 A.2.2) to ensure
+                * the scalar is in the valid range [2, n-3] for the curve.
+                */
+               u64 priv[ECC_MAX_DIGITS];
+               unsigned int ndigits = ctx->clen / sizeof(u64);
+               unsigned int curve_id;
+               u8 *rnd;
+
+               if (secret.len != ECDH_KPP_SECRET_MIN_SIZE)
+                       return -EINVAL;
+               if (ndigits > ECC_MAX_DIGITS)
+                       return -EINVAL;
+               /* Reject non-limb-aligned clen to prevent ndigits truncation */
+               if (ctx->clen % sizeof(u64))
+                       return -EINVAL;
+
+               if (ctx->curve == PKE_CURVE_P256)
+                       curve_id = ECC_CURVE_NIST_P256;
+               else if (ctx->curve == PKE_CURVE_P384)
+                       curve_id = ECC_CURVE_NIST_P384;
+               else
+                       return -EINVAL;
+
+               ret = ecc_gen_privkey(curve_id, ndigits, priv);
+               if (ret) {
+                       memzero_explicit(priv, sizeof(priv));
+                       return ret;
+               }
+
+               rnd = kmalloc(ctx->clen, GFP_KERNEL);
+               if (!rnd) {
+                       memzero_explicit(priv, sizeof(priv));
+                       return -ENOMEM;
+               }
+
+               /* Convert VLI (native LE-digit-order) to big-endian bytes */
+               ecc_swap_digits(priv, (u64 *)rnd, ndigits);
+               memzero_explicit(priv, sizeof(priv));
+
+               ret = cmh_key_setkey_raw(&ctx->key, rnd, ctx->clen,
+                                        CORE_ID_PKE);
+               kfree_sensitive(rnd);
+               return ret;
+       }
+
+       if (key_size != ctx->clen)
+               return -EINVAL;
+
+       if (secret.len != ECDH_KPP_SECRET_MIN_SIZE + key_size)
+               return -EINVAL;
+
+       return cmh_key_setkey_raw(&ctx->key, ptr, key_size, CORE_ID_PKE);
+}
+
+static int cmh_ecdh_set_secret_x25519(struct crypto_kpp *tfm,
+                                     const void *buf, unsigned int len)
+{
+       struct cmh_ecdh_tfm_ctx *ctx = cmh_ecdh_ctx(tfm);
+
+       if (len != pke_curve_clen(PKE_CURVE_25519))
+               return -EINVAL;
+
+       return cmh_key_setkey_raw(&ctx->key, buf, len, CORE_ID_PKE);
+}
+
+static void cmh_ecdh_keygen_complete(void *data, int error)
+{
+       struct kpp_request *req = data;
+       struct cmh_ecdh_reqctx *rctx = kpp_request_ctx(req);
+
+       if (error == -EINPROGRESS) {
+               cmh_complete(&req->base, error);
+               return;
+       }
+
+       if (!cmh_dma_map_error(rctx->sk_dma))
+               cmh_dma_unmap_single(rctx->sk_dma, rctx->sk_len,
+                                    DMA_TO_DEVICE);
+       if (!cmh_dma_map_error(rctx->pk_dma))
+               cmh_dma_unmap_single(rctx->pk_dma, rctx->out_len,
+                                    DMA_FROM_DEVICE);
+
+       if (!error) {
+               int nents;
+
+               nents = sg_nents_for_len(req->dst, rctx->out_len);
+               if (nents < 0 ||
+                   sg_copy_from_buffer(req->dst, nents,
+                                       rctx->pk_buf,
+                                       rctx->out_len) != rctx->out_len)
+                       error = -EINVAL;
+               else
+                       req->dst_len = rctx->out_len;
+       }
+
+       kfree_sensitive(rctx->sk_buf);
+       rctx->sk_buf = NULL;
+       kfree(rctx->pk_buf);
+       rctx->pk_buf = NULL;
+       cmh_complete(&req->base, error);
+}
+
+/*
+ * generate_public_key: For NIST ECDH, use ECDH_KEYGEN which outputs
+ * the public key X-coordinate.  But the kernel kpp interface expects
+ * uncompressed X||Y, so we use ECDSA_PUBGEN which gives us (X,Y).
+ * For Curve25519, ECDH_KEYGEN gives us the Montgomery u-coordinate
+ * which is the full public key.
+ */
+static int cmh_ecdh_generate_public_key(struct kpp_request *req)
+{
+       struct crypto_kpp *tfm = crypto_kpp_reqtfm(req);
+       struct cmh_ecdh_tfm_ctx *ctx = cmh_ecdh_ctx(tfm);
+       struct cmh_ecdh_reqctx *rctx = kpp_request_ctx(req);
+       u32 clen = ctx->clen;
+       bool is_25519 = (ctx->curve == PKE_CURVE_25519);
+       u32 out_len = is_25519 ? clen : 2 * clen;
+       struct vcq_cmd vcq[PKE_VCQ_CMDS_MAX];
+       struct core_dispatch dd;
+       u32 swap, dma_swap;
+       int ret, idx;
+       gfp_t gfp;
+
+       if (ctx->key.mode != CMH_KEY_RAW)
+               return -EINVAL;
+       if (req->dst_len < out_len)
+               return -EINVAL;
+
+       gfp = req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ?
+             GFP_KERNEL : GFP_ATOMIC;
+
+       memset(rctx, 0, sizeof(*rctx));
+       rctx->out_len = out_len;
+       rctx->sk_len = ctx->key.raw.len;
+       rctx->pk_dma = DMA_MAPPING_ERROR;
+       rctx->sk_dma = DMA_MAPPING_ERROR;
+
+       rctx->pk_buf = kzalloc(out_len, gfp);
+       if (!rctx->pk_buf)
+               return -ENOMEM;
+
+       rctx->pk_dma = cmh_dma_map_single(rctx->pk_buf, out_len,
+                                         DMA_FROM_DEVICE);
+       if (cmh_dma_map_error(rctx->pk_dma)) {
+               ret = -ENOMEM;
+               goto out_free;
+       }
+
+       swap = PKE_SWAP_FLAGS;
+       dma_swap = pke_swap_flags(ctx->curve);
+
+       dd = cmh_core_select_instance(CMH_CORE_PKE);
+
+       rctx->sk_buf = kmemdup(ctx->key.raw.data, ctx->key.raw.len, gfp);
+       if (!rctx->sk_buf) {
+               ret = -ENOMEM;
+               goto out_unmap;
+       }
+       rctx->sk_dma = cmh_dma_map_single(rctx->sk_buf, ctx->key.raw.len,
+                                         DMA_TO_DEVICE);
+       if (cmh_dma_map_error(rctx->sk_dma)) {
+               ret = -ENOMEM;
+               goto out_unmap;
+       }
+
+       vcq_set_header(&vcq[0], PKE_VCQ_CMDS_MAX);
+       idx = 1;
+       vcq_add_sys_write(&vcq[idx], SYS_REF_TEMP, rctx->sk_dma,
+                         SYS_REF_NONE, ctx->key.raw.len,
+                         ctx->key.raw.sys_type);
+       vcq[idx].id |= dma_swap;
+       idx++;
+       if (is_25519)
+               vcq_add_pke_ecdh_keygen(&vcq[idx++], dd.core_id, ctx->curve,
+                                       clen, rctx->pk_dma, SYS_REF_TEMP,
+                                       swap);
+       else
+               vcq_add_pke_ecdsa_pubgen(&vcq[idx++], dd.core_id,
+                                        ctx->curve, clen, rctx->pk_dma,
+                                        SYS_REF_TEMP, swap);
+       vcq_add_pke_flush(&vcq[idx++], dd.core_id);
+
+       ret = cmh_tm_submit_async(vcq, PKE_VCQ_CMDS_MAX, 1, dd.mbx_idx,
+                                 cmh_ecdh_keygen_complete, req,
+                                 !!(req->base.flags &
+                                    CRYPTO_TFM_REQ_MAY_BACKLOG), 0);
+       if (ret == -EBUSY)
+               return -EBUSY;
+       if (!ret)
+               return -EINPROGRESS;
+
+out_unmap:
+       if (!cmh_dma_map_error(rctx->sk_dma))
+               cmh_dma_unmap_single(rctx->sk_dma, ctx->key.raw.len,
+                                    DMA_TO_DEVICE);
+       if (!cmh_dma_map_error(rctx->pk_dma))
+               cmh_dma_unmap_single(rctx->pk_dma, out_len,
+                                    DMA_FROM_DEVICE);
+
+out_free:
+       kfree_sensitive(rctx->sk_buf);
+       kfree(rctx->pk_buf);
+       return ret;
+}
+
+static void cmh_ecdh_ss_phase2_complete(void *data, int error)
+{
+       struct kpp_request *req = data;
+       struct cmh_ecdh_reqctx *rctx = kpp_request_ctx(req);
+
+       if (error == -EINPROGRESS) {
+               cmh_complete(&req->base, error);
+               return;
+       }
+
+       if (!cmh_dma_map_error(rctx->ss_dma))
+               cmh_dma_unmap_single(rctx->ss_dma, rctx->clen,
+                                    DMA_FROM_DEVICE);
+
+       if (!error) {
+               int nents;
+
+               nents = sg_nents_for_len(req->dst, rctx->clen);
+               if (nents < 0 ||
+                   sg_copy_from_buffer(req->dst, nents,
+                                       rctx->ss_buf,
+                                       rctx->clen) != rctx->clen)
+                       error = -EINVAL;
+               else
+                       req->dst_len = rctx->clen;
+       }
+
+       kfree(rctx->ref_buf);
+       rctx->ref_buf = NULL;
+       kfree_sensitive(rctx->ss_buf);
+       rctx->ss_buf = NULL;
+       cmh_complete(&req->base, error);
+}
+
+static void cmh_ecdh_ss_phase1_complete(void *data, int error)
+{
+       struct kpp_request *req = data;
+       struct cmh_ecdh_reqctx *rctx = kpp_request_ctx(req);
+       struct vcq_cmd vcq[3];
+       int ret;
+
+       if (error == -EINPROGRESS) {
+               cmh_complete(&req->base, error);
+               return;
+       }
+
+       /* Phase 1-only resources: sk, peer -- always clean up */
+       if (!cmh_dma_map_error(rctx->sk_dma))
+               cmh_dma_unmap_single(rctx->sk_dma, rctx->sk_len,
+                                    DMA_TO_DEVICE);
+       kfree_sensitive(rctx->sk_buf);
+       rctx->sk_buf = NULL;
+
+       if (!cmh_dma_map_error(rctx->peer_dma))
+               cmh_dma_unmap_single(rctx->peer_dma, rctx->peer_len,
+                                    DMA_TO_DEVICE);
+       kfree(rctx->peer_buf);
+       rctx->peer_buf = NULL;
+
+       if (error)
+               goto out_cleanup;
+
+       /* Read the DS reference written by Phase 1 */
+       cmh_dma_sync_for_cpu(rctx->ref_dma, sizeof(u64), DMA_FROM_DEVICE);
+       cmh_dma_unmap_single(rctx->ref_dma, sizeof(u64), DMA_FROM_DEVICE);
+       rctx->ref_dma = DMA_MAPPING_ERROR;
+
+       /* Phase 2: extract shared secret from DS */
+       vcq_set_header(&vcq[0], 3);
+       vcq_add_sys_data(&vcq[1], *rctx->ref_buf, rctx->ss_dma,
+                        rctx->clen);
+       vcq[1].id |= rctx->dma_swap;
+       vcq_add_sys_flush(&vcq[2]);
+
+       ret = cmh_tm_submit_async(vcq, 3, 1, rctx->mbx_idx,
+                                 cmh_ecdh_ss_phase2_complete, req,
+                                 true, 0);
+       if (ret == -EBUSY || !ret)
+               return;
+
+       error = ret;
+
+out_cleanup:
+       if (!cmh_dma_map_error(rctx->ref_dma))
+               cmh_dma_unmap_single(rctx->ref_dma, sizeof(u64),
+                                    DMA_FROM_DEVICE);
+       if (!cmh_dma_map_error(rctx->ss_dma))
+               cmh_dma_unmap_single(rctx->ss_dma, rctx->clen,
+                                    DMA_FROM_DEVICE);
+       kfree(rctx->ref_buf);
+       rctx->ref_buf = NULL;
+       kfree_sensitive(rctx->ss_buf);
+       rctx->ss_buf = NULL;
+       cmh_complete(&req->base, error);
+}
+
+/*
+ * compute_shared_secret: PKE_CMD_ECDH.
+ *
+ * req->src = peer public key (X||Y for NIST, raw 32B for Curve25519).
+ * Output = shared secret X coordinate (clen bytes).
+ *
+ * The CMH ECDH command stores the shared secret in a DS object,
+ * not directly to DMA.  We create a DS slot with SYS_CMD_NEW,
+ * reference it via SYS_REF_LAST, then extract the result with a
+ * second VCQ submission using SYS_CMD_DATA with the actual ref.
+ */
+static int cmh_ecdh_compute_shared_secret(struct kpp_request *req)
+{
+       struct crypto_kpp *tfm = crypto_kpp_reqtfm(req);
+       struct cmh_ecdh_tfm_ctx *ctx = cmh_ecdh_ctx(tfm);
+       struct cmh_ecdh_reqctx *rctx = kpp_request_ctx(req);
+       u32 clen = ctx->clen;
+       bool is_25519 = (ctx->curve == PKE_CURVE_25519);
+       u32 peer_len = is_25519 ? clen : 2 * clen;
+       u32 ss_type = SYS_TYPE_SET(SYS_TYPE_FLAG_PT, CORE_ID_PKE);
+       struct vcq_cmd vcq[5];
+       struct core_dispatch dd;
+       u32 swap, dma_swap;
+       int ret, idx, nents;
+       gfp_t gfp;
+
+       if (ctx->key.mode != CMH_KEY_RAW)
+               return -EINVAL;
+       if (req->src_len < peer_len || req->dst_len < clen)
+               return -EINVAL;
+
+       gfp = req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ?
+             GFP_KERNEL : GFP_ATOMIC;
+
+       memset(rctx, 0, sizeof(*rctx));
+       rctx->clen = clen;
+       rctx->peer_len = peer_len;
+       rctx->sk_len = ctx->key.raw.len;
+       rctx->pk_dma = DMA_MAPPING_ERROR;
+       rctx->sk_dma = DMA_MAPPING_ERROR;
+       rctx->peer_dma = DMA_MAPPING_ERROR;
+       rctx->ss_dma = DMA_MAPPING_ERROR;
+       rctx->ref_dma = DMA_MAPPING_ERROR;
+
+       rctx->peer_buf = kmalloc(peer_len, gfp);
+       rctx->ss_buf = kzalloc(clen, gfp);
+       rctx->ref_buf = kzalloc_obj(u64, gfp);
+       if (!rctx->peer_buf || !rctx->ss_buf || !rctx->ref_buf) {
+               ret = -ENOMEM;
+               goto out_free;
+       }
+
+       nents = sg_nents_for_len(req->src, peer_len);
+       if (nents < 0 ||
+           sg_pcopy_to_buffer(req->src, nents, rctx->peer_buf,
+                              peer_len, 0) != peer_len) {
+               ret = -EINVAL;
+               goto out_free;
+       }
+
+       rctx->peer_dma = cmh_dma_map_single(rctx->peer_buf, peer_len,
+                                           DMA_TO_DEVICE);
+       rctx->ss_dma = cmh_dma_map_single(rctx->ss_buf, clen,
+                                         DMA_FROM_DEVICE);
+       rctx->ref_dma = cmh_dma_map_single(rctx->ref_buf, sizeof(u64),
+                                          DMA_FROM_DEVICE);
+
+       if (cmh_dma_map_error(rctx->peer_dma) ||
+           cmh_dma_map_error(rctx->ss_dma) ||
+           cmh_dma_map_error(rctx->ref_dma)) {
+               ret = -ENOMEM;
+               goto out_unmap;
+       }
+
+       swap = PKE_SWAP_FLAGS;
+       dma_swap = pke_swap_flags(ctx->curve);
+       rctx->dma_swap = dma_swap;
+
+       dd = cmh_core_select_instance(CMH_CORE_PKE);
+       rctx->mbx_idx = dd.mbx_idx;
+
+       rctx->sk_buf = kmemdup(ctx->key.raw.data, ctx->key.raw.len, gfp);
+       if (!rctx->sk_buf) {
+               ret = -ENOMEM;
+               goto out_unmap;
+       }
+       rctx->sk_dma = cmh_dma_map_single(rctx->sk_buf, ctx->key.raw.len,
+                                         DMA_TO_DEVICE);
+       if (cmh_dma_map_error(rctx->sk_dma)) {
+               ret = -ENOMEM;
+               goto out_unmap;
+       }
+
+       vcq_set_header(&vcq[0], 5);
+       idx = 1;
+       vcq_add_sys_write(&vcq[idx], SYS_REF_TEMP, rctx->sk_dma,
+                         SYS_REF_NONE, ctx->key.raw.len,
+                         ctx->key.raw.sys_type);
+       vcq[idx].id |= dma_swap;
+       idx++;
+       vcq_add_sys_new(&vcq[idx++], 0, rctx->ref_dma, clen);
+       vcq_add_pke_ecdh(&vcq[idx++], dd.core_id, ctx->curve, clen,
+                        clen, ss_type, rctx->peer_dma,
+                        SYS_REF_TEMP, SYS_REF_LAST, swap);
+       vcq_add_pke_flush(&vcq[idx++], dd.core_id);
+
+       ret = cmh_tm_submit_async(vcq, 5, 1, dd.mbx_idx,
+                                 cmh_ecdh_ss_phase1_complete, req,
+                                 !!(req->base.flags &
+                                    CRYPTO_TFM_REQ_MAY_BACKLOG), 0);
+       if (ret == -EBUSY)
+               return -EBUSY;
+       if (!ret)
+               return -EINPROGRESS;
+
+out_unmap:
+       if (!cmh_dma_map_error(rctx->sk_dma))
+               cmh_dma_unmap_single(rctx->sk_dma, rctx->sk_len,
+                                    DMA_TO_DEVICE);
+       if (!cmh_dma_map_error(rctx->ss_dma))
+               cmh_dma_unmap_single(rctx->ss_dma, clen,
+                                    DMA_FROM_DEVICE);
+       if (!cmh_dma_map_error(rctx->ref_dma))
+               cmh_dma_unmap_single(rctx->ref_dma, sizeof(u64),
+                                    DMA_FROM_DEVICE);
+       if (!cmh_dma_map_error(rctx->peer_dma))
+               cmh_dma_unmap_single(rctx->peer_dma, peer_len,
+                                    DMA_TO_DEVICE);
+
+out_free:
+       kfree_sensitive(rctx->sk_buf);
+       kfree(rctx->ref_buf);
+       kfree_sensitive(rctx->ss_buf);
+       kfree(rctx->peer_buf);
+       return ret;
+}
+
+static unsigned int cmh_ecdh_max_size(struct crypto_kpp *tfm)
+{
+       struct cmh_ecdh_tfm_ctx *ctx = cmh_ecdh_ctx(tfm);
+
+       /* Max output = X||Y for generate_public_key (NIST) */
+       return 2 * ctx->clen;
+}
+
+static unsigned int cmh_x25519_max_size(struct crypto_kpp *tfm)
+{
+       return pke_curve_clen(PKE_CURVE_25519); /* single coordinate */
+}
+
+static int cmh_ecdh_p256_init(struct crypto_kpp *tfm)
+{
+       struct cmh_ecdh_tfm_ctx *ctx = cmh_ecdh_ctx(tfm);
+
+       memset(ctx, 0, sizeof(*ctx));
+       ctx->curve = PKE_CURVE_P256;
+       ctx->clen = pke_curve_clen(PKE_CURVE_P256);
+       tfm->reqsize = sizeof(struct cmh_ecdh_reqctx);
+       return 0;
+}
+
+static int cmh_ecdh_p384_init(struct crypto_kpp *tfm)
+{
+       struct cmh_ecdh_tfm_ctx *ctx = cmh_ecdh_ctx(tfm);
+
+       memset(ctx, 0, sizeof(*ctx));
+       ctx->curve = PKE_CURVE_P384;
+       ctx->clen = pke_curve_clen(PKE_CURVE_P384);
+       tfm->reqsize = sizeof(struct cmh_ecdh_reqctx);
+       return 0;
+}
+
+static int cmh_x25519_init(struct crypto_kpp *tfm)
+{
+       struct cmh_ecdh_tfm_ctx *ctx = cmh_ecdh_ctx(tfm);
+
+       memset(ctx, 0, sizeof(*ctx));
+       ctx->curve = PKE_CURVE_25519;
+       ctx->clen = pke_curve_clen(PKE_CURVE_25519);
+       tfm->reqsize = sizeof(struct cmh_ecdh_reqctx);
+       return 0;
+}
+
+static void cmh_ecdh_exit(struct crypto_kpp *tfm)
+{
+       struct cmh_ecdh_tfm_ctx *ctx = cmh_ecdh_ctx(tfm);
+
+       cmh_key_destroy(&ctx->key);
+}
+
+static struct kpp_alg cmh_ecdh_algs[] = {
+       {
+               .set_secret             = cmh_ecdh_set_secret_nist,
+               .generate_public_key    = cmh_ecdh_generate_public_key,
+               .compute_shared_secret  = cmh_ecdh_compute_shared_secret,
+               .max_size               = cmh_ecdh_max_size,
+               .init                   = cmh_ecdh_p256_init,
+               .exit                   = cmh_ecdh_exit,
+               .base = {
+                       .cra_name         = "ecdh-nist-p256",
+                       .cra_driver_name  = "cri-cmh-ecdh-nist-p256",
+                       .cra_priority     = 300,
+                       .cra_flags        = CRYPTO_ALG_ASYNC,
+                       .cra_module       = THIS_MODULE,
+                       .cra_ctxsize      = sizeof(struct cmh_ecdh_tfm_ctx),
+               },
+       },
+       {
+               .set_secret             = cmh_ecdh_set_secret_nist,
+               .generate_public_key    = cmh_ecdh_generate_public_key,
+               .compute_shared_secret  = cmh_ecdh_compute_shared_secret,
+               .max_size               = cmh_ecdh_max_size,
+               .init                   = cmh_ecdh_p384_init,
+               .exit                   = cmh_ecdh_exit,
+               .base = {
+                       .cra_name         = "ecdh-nist-p384",
+                       .cra_driver_name  = "cri-cmh-ecdh-nist-p384",
+                       .cra_priority     = 300,
+                       .cra_flags        = CRYPTO_ALG_ASYNC,
+                       .cra_module       = THIS_MODULE,
+                       .cra_ctxsize      = sizeof(struct cmh_ecdh_tfm_ctx),
+               },
+       },
+       {
+               .set_secret             = cmh_ecdh_set_secret_x25519,
+               .generate_public_key    = cmh_ecdh_generate_public_key,
+               .compute_shared_secret  = cmh_ecdh_compute_shared_secret,
+               .max_size               = cmh_x25519_max_size,
+               .init                   = cmh_x25519_init,
+               .exit                   = cmh_ecdh_exit,
+               .base = {
+                       .cra_name         = "curve25519",
+                       .cra_driver_name  = "cri-cmh-curve25519",
+                       .cra_priority     = 300,
+                       .cra_flags        = CRYPTO_ALG_ASYNC,
+                       .cra_module       = THIS_MODULE,
+                       .cra_ctxsize      = sizeof(struct cmh_ecdh_tfm_ctx),
+               },
+       },
+};
+
+/**
+ * cmh_pke_ecdh_register() - Register ECDH kpp algorithms with the crypto framework
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int cmh_pke_ecdh_register(void)
+{
+       int ret, i;
+
+       for (i = 0; i < ARRAY_SIZE(cmh_ecdh_algs); i++) {
+               ret = crypto_register_kpp(&cmh_ecdh_algs[i]);
+               if (ret) {
+                       dev_err(cmh_dev(), "cmh: failed to register %s (%d)\n",
+                               cmh_ecdh_algs[i].base.cra_name, ret);
+                       goto err_unregister;
+               }
+       }
+
+       return 0;
+
+err_unregister:
+       while (i--)
+               crypto_unregister_kpp(&cmh_ecdh_algs[i]);
+       return ret;
+}
+
+/**
+ * cmh_pke_ecdh_unregister() - Unregister ECDH kpp algorithms from the crypto framework
+ */
+void cmh_pke_ecdh_unregister(void)
+{
+       int i = ARRAY_SIZE(cmh_ecdh_algs);
+
+       while (i--)
+               crypto_unregister_kpp(&cmh_ecdh_algs[i]);
+}
--
2.43.7


** This message and any attachments are for the sole use of the intended recipient(s). It may contain information that is confidential and privileged. If you are not the intended recipient of this message, you are prohibited from printing, copying, forwarding or saving it. Please delete the message and attachments and notify the sender immediately. **

Rambus Inc.<http://www.rambus.com>

^ permalink raw reply related

* [PATCH 19/19] MAINTAINERS: add Rambus CryptoManager Hub (CMH)
From: Saravanakrishnan Krishnamoorthy @ 2026-06-25 17:33 UTC (permalink / raw)
  To: Albert Ou, Alex Ousherovitch, Conor Dooley, David S. Miller,
	Herbert Xu, Jonathan Corbet, Krzysztof Kozlowski, Palmer Dabbelt,
	Paul Walmsley, Rob Herring, Saravanakrishnan Krishnamoorthy,
	Shuah Khan
  Cc: Alexandre Ghiti, devicetree, Joel Wittenauer, linux-api,
	linux-crypto, linux-doc, linux-kernel, linux-kselftest,
	linux-riscv, Shuah Khan, sipsupport, Thi Nguyen
In-Reply-To: <20260625173328.1140487-1-skrishnamoorthy@rambus.com>

From: Alex Ousherovitch <aousherovitch@rambus.com>

Add MAINTAINERS entry for the CRI CryptoManager Hub (CMH) hardware
crypto accelerator driver under drivers/crypto/cmh/.

Co-developed-by: Saravanakrishnan Krishnamoorthy <skrishnamoorthy@rambus.com>
Signed-off-by: Saravanakrishnan Krishnamoorthy <skrishnamoorthy@rambus.com>
Signed-off-by: Alex Ousherovitch <aousherovitch@rambus.com>
Reviewed-by: Joel Wittenauer <Joel.Wittenauer@cryptography.com>
Reviewed-by: Thi Nguyen <thin@rambus.com>
---
 MAINTAINERS | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 90034eb7874e..ecb389795e3d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6797,6 +6797,25 @@ F:       kernel/cred.c
 F:     rust/kernel/cred.rs
 F:     Documentation/security/credentials.rst

+CRI CRYPTOMANAGER HUB (CMH) HARDWARE CRYPTO ACCELERATOR
+M:     Alex Ousherovitch <aousherovitch@rambus.com>
+M:     Saravanakrishnan Krishnamoorthy <skrishnamoorthy@rambus.com>
+R:     Joel Wittenauer <Joel.Wittenauer@cryptography.com>
+R:     Thi Nguyen <thin@rambus.com>
+L:     linux-crypto@vger.kernel.org
+L:     sipsupport@rambus.com (moderated for non-subscribers)
+S:     Maintained
+T:     git https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git
+F:     Documentation/ABI/testing/cmh-mgmt
+F:     Documentation/ABI/testing/debugfs-driver-cmh
+F:     Documentation/ABI/testing/sysfs-driver-cmh
+F:     Documentation/crypto/device_drivers/cmh.rst
+F:     Documentation/devicetree/bindings/crypto/cri,cmh.yaml
+F:     Documentation/userspace-api/ioctl/cmh_mgmt.rst
+F:     drivers/crypto/cmh/
+F:     include/uapi/linux/cmh_mgmt_ioctl.h
+F:     tools/testing/selftests/drivers/crypto/cmh/
+
 INTEL CRPS COMMON REDUNDANT PSU DRIVER
 M:     Ninad Palsule <ninad@linux.ibm.com>
 L:     linux-hwmon@vger.kernel.org
--
2.43.7


** This message and any attachments are for the sole use of the intended recipient(s). It may contain information that is confidential and privileged. If you are not the intended recipient of this message, you are prohibited from printing, copying, forwarding or saving it. Please delete the message and attachments and notify the sender immediately. **

Rambus Inc.<http://www.rambus.com>

^ permalink raw reply related

* [PATCH 08/19] crypto: cmh - add AES skcipher/aead/cmac
From: Saravanakrishnan Krishnamoorthy @ 2026-06-25 17:33 UTC (permalink / raw)
  To: Albert Ou, Alex Ousherovitch, Conor Dooley, David S. Miller,
	Herbert Xu, Jonathan Corbet, Krzysztof Kozlowski, Palmer Dabbelt,
	Paul Walmsley, Rob Herring, Saravanakrishnan Krishnamoorthy,
	Shuah Khan
  Cc: Alexandre Ghiti, devicetree, Joel Wittenauer, linux-api,
	linux-crypto, linux-doc, linux-kernel, linux-kselftest,
	linux-riscv, Shuah Khan, sipsupport, Thi Nguyen
In-Reply-To: <20260625173328.1140487-1-skrishnamoorthy@rambus.com>

From: Alex Ousherovitch <aousherovitch@rambus.com>

Register AES algorithms using the CMH AES core (core ID 0x03):
- skcipher: AES-ECB, AES-CBC, AES-CTR, AES-XTS, AES-CFB
- aead: AES-GCM, AES-CCM
- ahash: AES-CMAC

Supports 128, 192, and 256-bit keys.  AEAD algorithms handle
associated data, payload, and authentication tag with correct
encrypt/decrypt separation.

Co-developed-by: Saravanakrishnan Krishnamoorthy <skrishnamoorthy@rambus.com>
Signed-off-by: Saravanakrishnan Krishnamoorthy <skrishnamoorthy@rambus.com>
Signed-off-by: Alex Ousherovitch <aousherovitch@rambus.com>
Reviewed-by: Joel Wittenauer <Joel.Wittenauer@cryptography.com>
Reviewed-by: Thi Nguyen <thin@rambus.com>
---
 drivers/crypto/cmh/Makefile          |   5 +-
 drivers/crypto/cmh/cmh_aes.c         | 736 ++++++++++++++++++++
 drivers/crypto/cmh/cmh_aes_aead.c    | 987 +++++++++++++++++++++++++++
 drivers/crypto/cmh/cmh_aes_cmac.c    | 537 +++++++++++++++
 drivers/crypto/cmh/cmh_main.c        |  25 +
 drivers/crypto/cmh/include/cmh_aes.h |  24 +
 6 files changed, 2313 insertions(+), 1 deletion(-)
 create mode 100644 drivers/crypto/cmh/cmh_aes.c
 create mode 100644 drivers/crypto/cmh/cmh_aes_aead.c
 create mode 100644 drivers/crypto/cmh/cmh_aes_cmac.c
 create mode 100644 drivers/crypto/cmh/include/cmh_aes.h

diff --git a/drivers/crypto/cmh/Makefile b/drivers/crypto/cmh/Makefile
index b3018fbcf211..ced8d1748e6c 100644
--- a/drivers/crypto/cmh/Makefile
+++ b/drivers/crypto/cmh/Makefile
@@ -19,7 +19,10 @@ cmh-y := \
        cmh_hmac.o \
        cmh_cshake.o \
        cmh_kmac.o \
-       cmh_sm3.o
+       cmh_sm3.o \
+       cmh_aes.o \
+       cmh_aes_aead.o \
+       cmh_aes_cmac.o

 # Management ioctl device (/dev/cmh_mgmt): key lifecycle, PKE, PQC ioctls.
 cmh-$(CONFIG_CRYPTO_DEV_CMH_MGMT) += \
diff --git a/drivers/crypto/cmh/cmh_aes.c b/drivers/crypto/cmh/cmh_aes.c
new file mode 100644
index 000000000000..b36295763e33
--- /dev/null
+++ b/drivers/crypto/cmh/cmh_aes.c
@@ -0,0 +1,736 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026 Cryptography Research, Inc. (CRI).
+ * CMH LKM -- Kernel Crypto API AES (skcipher) Driver
+ *
+ * Registers skcipher algorithms with the Linux crypto subsystem:
+ *   ecb(aes), cbc(aes), ctr(aes), cfb(aes), xts(aes)
+ *
+ * Uses the CMH AES Core via VCQ commands:
+ *   [SYS_CMD_WRITE] + AES_CMD_INIT + [AES_CMD_UPDATE] + AES_CMD_FINAL
+ *   + VCQ_CMD_FLUSH
+ *
+ * The AES core requires bidirectional DMA -- both input and output
+ * buffers are mapped and passed in a single AES_CMD_FINAL command.
+ *
+ * Raw-key atomicity: SYS_CMD_WRITE to SYS_REF_TEMP is packed into
+ * the same VCQ as AES commands (see cmh_key.h for details).
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/crypto.h>
+#include <crypto/internal/skcipher.h>
+#include <crypto/aes.h>
+#include <crypto/algapi.h>
+#include <crypto/xts.h>
+#include <crypto/scatterwalk.h>
+#include <linux/scatterlist.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/unaligned.h>
+
+#include "cmh_aes.h"
+#include "cmh_vcq.h"
+#include "cmh_aes_abi.h"
+#include "cmh_sys_abi.h"
+#include "cmh_sys.h"
+#include "cmh_txn.h"
+#include "cmh_dma.h"
+#include "cmh_key.h"
+
+/* Algorithm Table */
+
+struct cmh_aes_alg_info {
+       u32         aes_mode;   /* AES_MODE_* */
+       u32         ivsize;             /* bytes (0 for ECB) */
+       u32         min_keysize;        /* minimum key bytes */
+       u32         max_keysize;        /* maximum key bytes */
+       const char *alg_name;   /* Linux crypto name: "ecb(aes)" */
+       const char *drv_name;   /* driver name: "cri-cmh-ecb-aes" */
+};
+
+static const struct cmh_aes_alg_info aes_algs[] = {
+       { AES_MODE_ECB, 0,                AES_KEYSIZE_128, AES_KEYSIZE_256,
+         "ecb(aes)", "cri-cmh-ecb-aes" },
+       { AES_MODE_CBC, CMH_AES_IV_SIZE,  AES_KEYSIZE_128, AES_KEYSIZE_256,
+         "cbc(aes)", "cri-cmh-cbc-aes" },
+       { AES_MODE_CTR, CMH_AES_IV_SIZE,  AES_KEYSIZE_128, AES_KEYSIZE_256,
+         "ctr(aes)", "cri-cmh-ctr-aes" },
+       { AES_MODE_CFB, CMH_AES_IV_SIZE,  AES_KEYSIZE_128, AES_KEYSIZE_256,
+         "cfb(aes)", "cri-cmh-cfb-aes" },
+       { AES_MODE_XTS, CMH_AES_IV_SIZE,  2 * AES_KEYSIZE_128, 2 * AES_KEYSIZE_256,
+         "xts(aes)", "cri-cmh-xts-aes" },
+};
+
+/* Per-transform context (allocated by crypto framework) */
+
+struct cmh_aes_tfm_ctx {
+       struct cmh_key_ctx key;
+};
+
+/* Per-request context (lives in skcipher_request::__ctx) */
+
+/*
+ * Maximum payload commands:
+ *   [SYS_CMD_WRITE] + AES_CMD_INIT + [AES_CMD_UPDATE] + AES_CMD_FINAL
+ *   + VCQ_CMD_FLUSH = 5
+ * UPDATE is used for XTS data > 2 blocks (see cmh_aes_crypt).
+ */
+#define CMH_AES_MAX_PAYLOAD    5
+#define CMH_AES_MAX_PACKED     (CMH_AES_MAX_PAYLOAD * 2)
+
+struct cmh_aes_reqctx {
+       dma_addr_t in_dma;
+       dma_addr_t out_dma;
+       dma_addr_t iv_dma;
+       dma_addr_t iv2_dma;
+       dma_addr_t key_dma;
+       u8 *in_buf;
+       u8 *out_buf;
+       u8 *iv_buf;
+       u8 *iv2_buf;
+       u32 cryptlen;
+       u32 ivsize;
+       u32 keylen;
+       u32 aes_mode;
+       u32 aes_op;
+       /* CTR counter-wrap split state */
+       u32 ctr_chunk1_len;
+       u32 core_id;
+       s32 target_mbx;
+       u64 key_ref;
+       struct vcq_cmd packed[CMH_AES_MAX_PACKED];
+};
+
+/* VCQ Builders -- AES-specific */
+
+static void vcq_add_aes_init(struct vcq_cmd *slot, u32 core_id, u64 key_ref, u64 iv_dma,
+                            u32 keylen, u32 ivlen, u32 mode, u32 op,
+                            u32 iolen)
+{
+       memset(slot, 0, sizeof(*slot));
+       slot->magic = VCQ_CMD_MAGIC;
+       slot->id = VCQ_CMD_ID(core_id, 0, 1, AES_CMD_INIT);
+       slot->hwc.aes.cmd_init.key = key_ref;
+       slot->hwc.aes.cmd_init.iv = iv_dma;
+       slot->hwc.aes.cmd_init.keylen = keylen;
+       slot->hwc.aes.cmd_init.ivlen = ivlen;
+       slot->hwc.aes.cmd_init.mode = mode;
+       slot->hwc.aes.cmd_init.op = op;
+       slot->hwc.aes.cmd_init.aadlen = 0;
+       slot->hwc.aes.cmd_init.iolen = iolen;
+       slot->hwc.aes.cmd_init.taglen = 0;
+}
+
+static void vcq_add_aes_update(struct vcq_cmd *slot, u32 core_id, u64 input_dma,
+                              u64 output_dma, u32 iolen)
+{
+       memset(slot, 0, sizeof(*slot));
+       slot->magic = VCQ_CMD_MAGIC;
+       slot->id = VCQ_CMD_ID(core_id, 0, 1, AES_CMD_UPDATE);
+       slot->hwc.aes.cmd_update.input = input_dma;
+       slot->hwc.aes.cmd_update.output = output_dma;
+       slot->hwc.aes.cmd_update.iolen = iolen;
+}
+
+static void vcq_add_aes_final(struct vcq_cmd *slot, u32 core_id, u64 input_dma,
+                             u64 output_dma, u32 iolen)
+{
+       memset(slot, 0, sizeof(*slot));
+       slot->magic = VCQ_CMD_MAGIC;
+       slot->id = VCQ_CMD_ID(core_id, 0, 1, AES_CMD_FINAL);
+       slot->hwc.aes.cmd_final.input = input_dma;
+       slot->hwc.aes.cmd_final.output = output_dma;
+       slot->hwc.aes.cmd_final.iolen = iolen;
+       slot->hwc.aes.cmd_final.tag = 0;
+       slot->hwc.aes.cmd_final.taglen = 0;
+}
+
+/*
+ * We wrap each skcipher_alg with its info pointer in a compound struct,
+ * then use container_of() in cmh_aes_get_info() to recover it.
+ * This is the same pattern used by hash, hmac, cshake, kmac.
+ */
+struct cmh_aes_alg_drv {
+       struct skcipher_alg             alg;
+       const struct cmh_aes_alg_info  *info;
+};
+
+static bool aes_is_stream_mode(u32 mode)
+{
+       return mode == AES_MODE_CTR || mode == AES_MODE_CFB;
+}
+
+/*
+ * Update req->iv after a successful encrypt/decrypt.
+ *
+ * The Linux skcipher API contract requires that req->iv is updated to
+ * reflect the state needed to continue processing in a chained call:
+ *   CBC encrypt: IV <- last ciphertext block
+ *   CBC decrypt: IV <- last ciphertext block of the *input*
+ *   CTR:         IV <- counter incremented by ceil(cryptlen / blocksize)
+ *   CFB:         IV <- last ciphertext block
+ */
+static void cmh_aes_update_iv(struct skcipher_request *req, u32 mode,
+                             u32 op, const u8 *in_buf, const u8 *out_buf)
+{
+       u32 bs = CMH_AES_BLOCK_SIZE;
+       u32 nblocks;
+
+       switch (mode) {
+       case AES_MODE_CBC:
+               if (op == AES_OP_ENCRYPT)
+                       memcpy(req->iv, out_buf + req->cryptlen - bs, bs);
+               else
+                       memcpy(req->iv, in_buf + req->cryptlen - bs, bs);
+               break;
+       case AES_MODE_CTR:
+               /*
+                * Arithmetic big-endian 128-bit counter increment.
+                * Process from the least-significant byte (index 15)
+                * upward, carrying as needed.
+                */
+               nblocks = DIV_ROUND_UP(req->cryptlen, bs);
+               {
+                       u8 *iv = req->iv;
+                       int i;
+
+                       for (i = bs - 1; i >= 0 && nblocks; i--) {
+                               u32 sum = (u32)iv[i] + (nblocks & 0xff);
+
+                               iv[i] = (u8)sum;
+                               nblocks = (nblocks >> 8) + (sum >> 8);
+                       }
+               }
+               break;
+       case AES_MODE_CFB:
+               /*
+                * CFB-128 chains on the last ciphertext block.  On encrypt,
+                * that is out_buf; on decrypt, it is in_buf.
+                *
+                * For sub-block requests (cryptlen < 16), there is no
+                * complete ciphertext block to chain, so the IV is left
+                * unchanged -- CFB-128 has no defined chaining semantic
+                * for partial blocks (shift-register CFB-n is a different
+                * mode).  Without this guard the pointer arithmetic
+                * underflows and reads before the buffer.
+                */
+               if (req->cryptlen >= bs) {
+                       if (op == AES_OP_ENCRYPT)
+                               memcpy(req->iv, out_buf + req->cryptlen - bs,
+                                      bs);
+                       else
+                               memcpy(req->iv, in_buf + req->cryptlen - bs,
+                                      bs);
+               }
+               break;
+       default:
+               break;
+       }
+}
+
+/* skcipher Operations */
+
+static const struct cmh_aes_alg_info *
+cmh_aes_get_info(struct crypto_skcipher *tfm)
+{
+       struct skcipher_alg *alg = crypto_skcipher_alg(tfm);
+
+       return container_of(alg, struct cmh_aes_alg_drv, alg)->info;
+}
+
+static int cmh_aes_setkey(struct crypto_skcipher *tfm, const u8 *key,
+                         unsigned int keylen)
+{
+       struct cmh_aes_tfm_ctx *tctx = crypto_skcipher_ctx(tfm);
+       const struct cmh_aes_alg_info *info = cmh_aes_get_info(tfm);
+
+       if (info->aes_mode == AES_MODE_XTS) {
+               int err;
+
+               /* XTS: double key (32, 48, or 64 bytes) */
+               if (keylen != 2 * AES_KEYSIZE_128 &&
+                   keylen != 2 * AES_KEYSIZE_192 &&
+                   keylen != 2 * AES_KEYSIZE_256)
+                       return -EINVAL;
+               err = xts_verify_key(tfm, key, keylen);
+               if (err)
+                       return err;
+       } else {
+               /* Standard: 16, 24, or 32 bytes */
+               if (keylen != AES_KEYSIZE_128 &&
+                   keylen != AES_KEYSIZE_192 &&
+                   keylen != AES_KEYSIZE_256)
+                       return -EINVAL;
+       }
+
+       return cmh_key_setkey_raw(&tctx->key, key, keylen, CORE_ID_AES);
+}
+
+static int cmh_aes_init_tfm(struct crypto_skcipher *tfm)
+{
+       struct cmh_aes_tfm_ctx *tctx = crypto_skcipher_ctx(tfm);
+
+       memset(tctx, 0, sizeof(*tctx));
+       crypto_skcipher_set_reqsize(tfm, sizeof(struct cmh_aes_reqctx));
+       return 0;
+}
+
+static void cmh_aes_exit_tfm(struct crypto_skcipher *tfm)
+{
+       struct cmh_aes_tfm_ctx *tctx = crypto_skcipher_ctx(tfm);
+
+       cmh_key_destroy(&tctx->key);
+}
+
+#define CMH_AES_MAX_CRYPTLEN   SZ_32M
+
+/* DMA unmap helper */
+static void cmh_aes_unmap_dma(struct cmh_aes_reqctx *rctx)
+{
+       if (rctx->iv2_buf)
+               cmh_dma_unmap_single(rctx->iv2_dma, rctx->ivsize,
+                                    DMA_TO_DEVICE);
+       if (rctx->ivsize > 0)
+               cmh_dma_unmap_single(rctx->iv_dma, rctx->ivsize,
+                                    DMA_TO_DEVICE);
+       cmh_dma_unmap_single(rctx->out_dma, rctx->cryptlen, DMA_FROM_DEVICE);
+       cmh_dma_unmap_single(rctx->in_dma, rctx->cryptlen, DMA_TO_DEVICE);
+}
+
+static void cmh_aes_free_bufs(struct cmh_aes_reqctx *rctx)
+{
+       kfree(rctx->iv2_buf);
+       rctx->iv2_buf = NULL;
+       kfree(rctx->iv_buf);
+       rctx->iv_buf = NULL;
+       kfree_sensitive(rctx->out_buf);
+       rctx->out_buf = NULL;
+       kfree_sensitive(rctx->in_buf);
+       rctx->in_buf = NULL;
+}
+
+/*
+ * Submit the second CTR chunk after the first completes.
+ * Called from cmh_aes_complete when ctr_chunk1_len > 0.
+ */
+static void cmh_aes_complete(void *data, int error);
+
+static int cmh_aes_ctr_submit_chunk2(struct skcipher_request *req)
+{
+       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+       struct cmh_aes_tfm_ctx *tctx = crypto_skcipher_ctx(tfm);
+       struct cmh_aes_reqctx *rctx = skcipher_request_ctx(req);
+       struct vcq_cmd cmds[CMH_AES_MAX_PAYLOAD];
+       u32 chunk1 = rctx->ctr_chunk1_len;
+       u32 chunk2 = rctx->cryptlen - chunk1;
+       u64 key_ref;
+       u32 keylen;
+       u32 idx = 0;
+
+       /* Clear split flag so next completion is final */
+       rctx->ctr_chunk1_len = 0;
+
+       vcq_add_sys_write(&cmds[idx++], SYS_REF_TEMP,
+                         (u64)rctx->key_dma, SYS_REF_NONE,
+                         tctx->key.raw.len,
+                         tctx->key.raw.sys_type);
+       key_ref = SYS_REF_TEMP;
+       keylen = tctx->key.raw.len;
+
+       vcq_add_aes_init(&cmds[idx++], rctx->core_id, key_ref,
+                        (u64)rctx->iv2_dma, keylen, rctx->ivsize,
+                        rctx->aes_mode, rctx->aes_op, 0);
+       vcq_add_aes_final(&cmds[idx++], rctx->core_id,
+                         (u64)(rctx->in_dma + chunk1),
+                         (u64)(rctx->out_dma + chunk1), chunk2);
+       vcq_add_flush(&cmds[idx++], rctx->core_id);
+
+       return cmh_vcq_pack_and_submit_async(cmds, idx, rctx->packed,
+                                            CMH_AES_MAX_PACKED,
+                                            rctx->target_mbx,
+                                            cmh_aes_complete, req,
+                                            !!(req->base.flags &
+                                               CRYPTO_TFM_REQ_MAY_BACKLOG),
+                                            cmh_tm_async_timeout_jiffies());
+}
+
+/*
+ * Async completion callback -- fires from RH threaded IRQ context.
+ *
+ * Unmaps DMA buffers, copies output to req->dst scatterlist,
+ * updates the IV state, frees temporaries, and completes the request.
+ *
+ * For CTR counter-wrap splits, the first chunk completion chains
+ * into a second VCQ submission rather than finalizing immediately.
+ */
+static void cmh_aes_complete(void *data, int error)
+{
+       struct skcipher_request *req = data;
+       struct cmh_aes_reqctx *rctx = skcipher_request_ctx(req);
+
+       if (error == -EINPROGRESS) {
+               cmh_complete(&req->base, error);
+               return;
+       }
+
+       /*
+        * CTR counter-wrap: first chunk completed, submit second.
+        * DMA mappings remain valid (they cover the full buffer).
+        *
+        * Recursion depth bounded: chunk2 clears ctr_chunk1_len before
+        * submission, so the second cmh_aes_complete invocation sees 0
+        * and finalizes (max depth = 2).
+        */
+       if (rctx->ctr_chunk1_len && !error) {
+               int ret;
+
+               ret = cmh_aes_ctr_submit_chunk2(req);
+
+               if (!ret || ret == -EBUSY)
+                       return;
+               /* Submission failed; clean up below */
+               error = ret;
+       }
+
+       cmh_aes_unmap_dma(rctx);
+
+       if (!error) {
+               scatterwalk_map_and_copy(rctx->out_buf, req->dst,
+                                        0, rctx->cryptlen, 1);
+               cmh_aes_update_iv(req, rctx->aes_mode, rctx->aes_op,
+                                 rctx->in_buf, rctx->out_buf);
+       }
+
+       cmh_aes_free_bufs(rctx);
+       cmh_complete(&req->base, error);
+}
+
+/*
+ * Core encrypt/decrypt -- builds a VCQ transaction and submits async.
+ *
+ * Returns -EINPROGRESS on successful submission (completion callback
+ * will fire later).  Returns 0 for trivial cases (zero-length).
+ * Returns negative errno on pre-submission errors.
+ */
+static int cmh_aes_crypt(struct skcipher_request *req, u32 aes_op)
+{
+       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+       struct cmh_aes_tfm_ctx *tctx = crypto_skcipher_ctx(tfm);
+       const struct cmh_aes_alg_info *info = cmh_aes_get_info(tfm);
+       struct cmh_aes_reqctx *rctx = skcipher_request_ctx(req);
+       struct vcq_cmd cmds[CMH_AES_MAX_PAYLOAD];
+       u64 key_ref;
+       u32 keylen;
+       struct core_dispatch d;
+       s32 target_mbx;
+       u32 core_id;
+       u32 idx;
+       int ret;
+       gfp_t gfp;
+
+       if (tctx->key.mode == CMH_KEY_NONE)
+               return -ENOKEY;
+
+       if (!req->cryptlen)
+               return 0;
+
+       if (req->cryptlen > CMH_AES_MAX_CRYPTLEN)
+               return -EINVAL;
+
+       switch (info->aes_mode) {
+       case AES_MODE_CTR:
+       case AES_MODE_CFB:
+               break;
+       case AES_MODE_XTS:
+               if (req->cryptlen < CMH_AES_BLOCK_SIZE)
+                       return -EINVAL;
+               break;
+       default:
+               if (req->cryptlen & (CMH_AES_BLOCK_SIZE - 1))
+                       return -EINVAL;
+               break;
+       }
+
+       gfp = req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ?
+             GFP_KERNEL : GFP_ATOMIC;
+
+       /* Initialise reqctx */
+       memset(rctx, 0, sizeof(*rctx));
+       rctx->cryptlen = req->cryptlen;
+       rctx->ivsize = info->ivsize;
+       rctx->aes_mode = info->aes_mode;
+       rctx->aes_op = aes_op;
+       rctx->iv2_buf = NULL;
+
+       /* Linearise input from scatterlist */
+       rctx->in_buf = kmalloc(req->cryptlen, gfp);
+       if (!rctx->in_buf)
+               return -ENOMEM;
+
+       scatterwalk_map_and_copy(rctx->in_buf, req->src, 0, req->cryptlen, 0);
+
+       rctx->in_dma = cmh_dma_map_single(rctx->in_buf, req->cryptlen,
+                                         DMA_TO_DEVICE);
+       if (cmh_dma_map_error(rctx->in_dma)) {
+               ret = -ENOMEM;
+               goto out_free_in;
+       }
+
+       /* Allocate and map output buffer */
+       rctx->out_buf = kmalloc(req->cryptlen, gfp);
+       if (!rctx->out_buf) {
+               ret = -ENOMEM;
+               goto out_unmap_in;
+       }
+
+       rctx->out_dma = cmh_dma_map_single(rctx->out_buf, req->cryptlen,
+                                          DMA_FROM_DEVICE);
+       if (cmh_dma_map_error(rctx->out_dma)) {
+               ret = -ENOMEM;
+               goto out_free_out;
+       }
+
+       /* Map IV if required */
+       if (info->ivsize > 0) {
+               rctx->iv_buf = kmemdup(req->iv, info->ivsize, gfp);
+               if (!rctx->iv_buf) {
+                       ret = -ENOMEM;
+                       goto out_unmap_out;
+               }
+               rctx->iv_dma = cmh_dma_map_single(rctx->iv_buf, info->ivsize,
+                                                 DMA_TO_DEVICE);
+               if (cmh_dma_map_error(rctx->iv_dma)) {
+                       ret = -ENOMEM;
+                       goto out_free_iv;
+               }
+       }
+
+       /* Resolve key reference */
+       idx = 0;
+
+       rctx->key_dma = tctx->key.raw.dma;
+       rctx->keylen = tctx->key.raw.len;
+       vcq_add_sys_write(&cmds[idx++], SYS_REF_TEMP,
+                         (u64)rctx->key_dma, SYS_REF_NONE,
+                         tctx->key.raw.len,
+                         tctx->key.raw.sys_type);
+       key_ref = SYS_REF_TEMP;
+       keylen = tctx->key.raw.len;
+       d = cmh_core_select_instance(CMH_CORE_AES);
+       target_mbx = d.mbx_idx;
+       core_id = d.core_id;
+
+       /*
+        * iolen in INIT: XTS needs total length upfront for tweak
+        * computation; all other modes use 0 (streaming).
+        */
+       vcq_add_aes_init(&cmds[idx++], core_id, key_ref, (u64)rctx->iv_dma,
+                        keylen, info->ivsize, info->aes_mode, aes_op,
+                        info->aes_mode == AES_MODE_XTS ?
+                        req->cryptlen : 0);
+
+       if (info->aes_mode == AES_MODE_XTS &&
+           req->cryptlen > 2 * CMH_AES_BLOCK_SIZE) {
+               u32 final_len, update_len;
+
+               if (req->cryptlen & (CMH_AES_BLOCK_SIZE - 1))
+                       final_len = CMH_AES_BLOCK_SIZE +
+                                   (req->cryptlen & (CMH_AES_BLOCK_SIZE - 1));
+               else
+                       final_len = 2 * CMH_AES_BLOCK_SIZE;
+
+               update_len = req->cryptlen - final_len;
+
+               vcq_add_aes_update(&cmds[idx++], core_id,
+                                  (u64)rctx->in_dma,
+                                  (u64)rctx->out_dma, update_len);
+               vcq_add_aes_final(&cmds[idx++], core_id,
+                                 (u64)(rctx->in_dma + update_len),
+                                 (u64)(rctx->out_dma + update_len),
+                                 final_len);
+       } else if (info->aes_mode == AES_MODE_CTR) {
+               /*
+                * CTR counter-wrap workaround:
+                * The AES-SCA hardware uses a 64-bit block counter.
+                * If the lower 64 bits of the IV would wrap during
+                * this operation, split into two separate VCQ
+                * transactions -- the completion callback for the
+                * first chunk submits the second.
+                */
+               u64 lower64 = get_unaligned_be64(rctx->iv_buf + 8);
+               u32 nblocks = DIV_ROUND_UP(req->cryptlen,
+                                         CMH_AES_BLOCK_SIZE);
+               u64 bwrap = lower64 ? (~lower64 + 1ULL) : U64_MAX;
+
+               if (nblocks > bwrap) {
+                       u32 chunk1 = (u32)bwrap * CMH_AES_BLOCK_SIZE;
+                       u64 upper64;
+
+                       /* Prepare second IV for chained submission */
+                       rctx->iv2_buf = kmalloc(info->ivsize, gfp);
+                       if (!rctx->iv2_buf) {
+                               ret = -ENOMEM;
+                               goto out_unmap_iv;
+                       }
+                       upper64 = get_unaligned_be64(rctx->iv_buf);
+                       put_unaligned_be64(upper64 + 1, rctx->iv2_buf);
+                       put_unaligned_be64(0, rctx->iv2_buf + 8);
+
+                       rctx->iv2_dma =
+                               cmh_dma_map_single(rctx->iv2_buf,
+                                                  info->ivsize,
+                                                  DMA_TO_DEVICE);
+                       if (cmh_dma_map_error(rctx->iv2_dma)) {
+                               ret = -ENOMEM;
+                               goto out_free_iv2;
+                       }
+
+                       /* Store state for the chained second submission */
+                       rctx->ctr_chunk1_len = chunk1;
+                       rctx->core_id = core_id;
+                       rctx->target_mbx = target_mbx;
+                       rctx->key_ref = key_ref;
+
+                       /* First transaction: only chunk1 */
+                       vcq_add_aes_final(&cmds[idx++], core_id,
+                                         (u64)rctx->in_dma,
+                                         (u64)rctx->out_dma, chunk1);
+               } else {
+                       /* No wrap: single FINAL with all data */
+                       vcq_add_aes_final(&cmds[idx++], core_id,
+                                         (u64)rctx->in_dma,
+                                         (u64)rctx->out_dma,
+                                         req->cryptlen);
+               }
+       } else {
+               vcq_add_aes_final(&cmds[idx++], core_id,
+                                 (u64)rctx->in_dma,
+                                 (u64)rctx->out_dma, req->cryptlen);
+       }
+
+       vcq_add_flush(&cmds[idx++], core_id);
+
+       ret = cmh_vcq_pack_and_submit_async(cmds, idx, rctx->packed,
+                                           CMH_AES_MAX_PACKED, target_mbx,
+                                           cmh_aes_complete, req,
+                                           !!(req->base.flags &
+                                              CRYPTO_TFM_REQ_MAY_BACKLOG),
+                                           cmh_tm_async_timeout_jiffies());
+       if (ret == -EBUSY)
+               return -EBUSY;
+       if (ret)
+               goto out_cleanup_all;
+
+       return -EINPROGRESS;
+
+out_cleanup_all:
+       if (rctx->iv2_buf) {
+               cmh_dma_unmap_single(rctx->iv2_dma, info->ivsize,
+                                    DMA_TO_DEVICE);
+       }
+out_free_iv2:
+       kfree(rctx->iv2_buf);
+out_unmap_iv:
+       if (info->ivsize > 0)
+               cmh_dma_unmap_single(rctx->iv_dma, info->ivsize,
+                                    DMA_TO_DEVICE);
+out_free_iv:
+       kfree(rctx->iv_buf);
+out_unmap_out:
+       cmh_dma_unmap_single(rctx->out_dma, req->cryptlen, DMA_FROM_DEVICE);
+out_free_out:
+       kfree_sensitive(rctx->out_buf);
+out_unmap_in:
+       cmh_dma_unmap_single(rctx->in_dma, req->cryptlen, DMA_TO_DEVICE);
+out_free_in:
+       kfree_sensitive(rctx->in_buf);
+       return ret;
+}
+
+static int cmh_aes_encrypt(struct skcipher_request *req)
+{
+       return cmh_aes_crypt(req, AES_OP_ENCRYPT);
+}
+
+static int cmh_aes_decrypt(struct skcipher_request *req)
+{
+       return cmh_aes_crypt(req, AES_OP_DECRYPT);
+}
+
+/* Registration */
+
+static struct cmh_aes_alg_drv aes_drv_algs[ARRAY_SIZE(aes_algs)];
+
+/**
+ * cmh_aes_register() - Register AES-CBC/CTR/ECB/XTS skcipher algorithms with the crypto framework
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int cmh_aes_register(void)
+{
+       unsigned int i;
+       int ret;
+
+       for (i = 0; i < ARRAY_SIZE(aes_algs); i++) {
+               const struct cmh_aes_alg_info *info = &aes_algs[i];
+               struct cmh_aes_alg_drv *drv = &aes_drv_algs[i];
+               struct skcipher_alg *alg = &drv->alg;
+
+               drv->info = info;
+
+               memset(alg, 0, sizeof(*alg));
+
+               alg->setkey      = cmh_aes_setkey;
+               alg->encrypt     = cmh_aes_encrypt;
+               alg->decrypt     = cmh_aes_decrypt;
+               alg->init        = cmh_aes_init_tfm;
+               alg->exit        = cmh_aes_exit_tfm;
+               alg->min_keysize = info->min_keysize;
+               alg->max_keysize = info->max_keysize;
+               alg->ivsize      = info->ivsize;
+
+               strscpy(alg->base.cra_name, info->alg_name,
+                       CRYPTO_MAX_ALG_NAME);
+               strscpy(alg->base.cra_driver_name, info->drv_name,
+                       CRYPTO_MAX_ALG_NAME);
+               alg->base.cra_priority  = 300;
+               alg->base.cra_flags     = CRYPTO_ALG_KERN_DRIVER_ONLY |
+                                         CRYPTO_ALG_ASYNC;
+               alg->base.cra_blocksize = aes_is_stream_mode(info->aes_mode)
+                                         ? 1 : CMH_AES_BLOCK_SIZE;
+               alg->base.cra_ctxsize  = sizeof(struct cmh_aes_tfm_ctx);
+               alg->base.cra_module   = THIS_MODULE;
+
+               ret = crypto_register_skcipher(alg);
+               if (ret) {
+                       dev_err(cmh_dev(), "cmh_aes: failed to register %s (rc=%d)\n",
+                               info->alg_name, ret);
+                       goto err_unregister;
+               }
+
+               dev_dbg(cmh_dev(), "cmh_aes: registered %s\n", info->alg_name);
+       }
+
+       return 0;
+
+err_unregister:
+       while (i--)
+               crypto_unregister_skcipher(&aes_drv_algs[i].alg);
+       return ret;
+}
+
+/**
+ * cmh_aes_unregister() - Unregister AES skcipher algorithms from the crypto framework
+ */
+void cmh_aes_unregister(void)
+{
+       unsigned int i;
+
+       for (i = 0; i < ARRAY_SIZE(aes_algs); i++) {
+               crypto_unregister_skcipher(&aes_drv_algs[i].alg);
+               dev_dbg(cmh_dev(), "cmh_aes: unregistered %s\n", aes_algs[i].alg_name);
+       }
+}
diff --git a/drivers/crypto/cmh/cmh_aes_aead.c b/drivers/crypto/cmh/cmh_aes_aead.c
new file mode 100644
index 000000000000..0b59c5f7d474
--- /dev/null
+++ b/drivers/crypto/cmh/cmh_aes_aead.c
@@ -0,0 +1,987 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026 Cryptography Research, Inc. (CRI).
+ * CMH LKM -- Kernel Crypto API AES AEAD Driver (GCM/CCM)
+ *
+ * Registers AEAD algorithms with the Linux crypto subsystem:
+ *   gcm(aes), ccm(aes)
+ *
+ * GCM: AES_CMD_INIT(mode=GCM) + [AAD_FINAL] + AES_CMD_FINAL + FLUSH
+ *   - Standard 12-byte IV (nonce), 16-byte tag
+ *   - AES_CMD_INIT carries aadlen/iolen/taglen
+ *   - AES_CMD_FINAL carries tag DMA for encrypt (produce) / decrypt (verify)
+ *
+ * CCM: AES_CMD_CCM_INIT + [AAD_FINAL] + AES_CMD_FINAL + FLUSH
+ *   - Variable nonce (7--13 bytes), variable tag (4--16 bytes)
+ *   - Uses AES_CMD_CCM_INIT (0x0A) with aes_cmd_init struct
+ *   - Nonce passed via IV field, taglen in init
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/crypto.h>
+#include <crypto/internal/aead.h>
+#include <crypto/internal/cipher.h>
+#include <crypto/scatterwalk.h>
+#include <crypto/utils.h>
+#include <linux/scatterlist.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+
+#include "cmh_aes.h"
+#include "cmh_vcq.h"
+#include "cmh_aes_abi.h"
+#include "cmh_sys_abi.h"
+#include "cmh_sys.h"
+#include "cmh_txn.h"
+#include "cmh_dma.h"
+#include "cmh_key.h"
+
+/*
+ * GCM IV contract:
+ *
+ * The AES core requires exactly 16 bytes loaded into its IV register.
+ * For standard 96-bit nonce GCM, the driver passes:
+ *
+ *   IV[0..11]  = user-supplied 12-byte nonce
+ *   IV[12..15] = 0x00000000
+ *
+ * The hardware internally sets the last 32 bits to the big-endian
+ * counter value 1 (forming J0 = nonce || 0x00000001) before
+ * processing AAD.  The driver must NOT pre-set the counter.
+ *
+ * If the IV format is incorrect, GCM authentication will fail
+ * (encrypt produces wrong ciphertext/tag, decrypt rejects).
+ */
+#define AES_GCM_IV_SIZE                12U     /* GCM nonce size (standard) */
+#define AES_GCM_HW_IV_SIZE     16U     /* HW requires 16-byte IV buffer */
+#define AES_GCM_TAG_SIZE       16U
+
+/* CCM: callers pass a 16-byte IV in RFC 3610 format:
+ * iv[0] = L-1, iv[1..14-iv[0]] = nonce, rest = counter (zeroed).
+ * Nonce length = 14 - iv[0], range 7..13.
+ */
+#define AES_CCM_IV_SIZE        16U
+
+enum cmh_aes_aead_type {
+       CMH_AES_AEAD_GCM,
+       CMH_AES_AEAD_CCM,
+};
+
+struct cmh_aes_aead_info {
+       enum cmh_aes_aead_type type;
+       u32         aes_mode;   /* AES_MODE_GCM or AES_MODE_CCM */
+       u32         ivsize;
+       u32         maxauthsize;
+       const char *alg_name;
+       const char *drv_name;
+};
+
+static const struct cmh_aes_aead_info aes_aead_algs[] = {
+       { CMH_AES_AEAD_GCM, AES_MODE_GCM, AES_GCM_IV_SIZE,
+         AES_GCM_TAG_SIZE, "gcm(aes)", "cri-cmh-gcm-aes" },
+       { CMH_AES_AEAD_CCM, AES_MODE_CCM, AES_CCM_IV_SIZE,
+         AES_GCM_TAG_SIZE, "ccm(aes)", "cri-cmh-ccm-aes" },
+};
+
+struct cmh_aes_aead_tfm_ctx {
+       struct cmh_key_ctx key;
+       u32 authsize;           /* tag length set by setauthsize */
+       struct crypto_cipher *sw_cipher;        /* CCM empty-input fallback */
+       struct crypto_aead *fallback;   /* CCM authsize=10 fallback */
+};
+
+/* Per-request context (lives in aead_request::__ctx) */
+
+/*
+ * Maximum payload commands:
+ *   [SYS_CMD_WRITE] + AES_CMD_INIT + AAD_FINAL + AES_CMD_FINAL + FLUSH = 5
+ */
+#define CMH_AES_AEAD_MAX_PAYLOAD       5
+#define CMH_AES_AEAD_MAX_PACKED                (CMH_AES_AEAD_MAX_PAYLOAD * 2)
+
+struct cmh_aes_aead_reqctx {
+       dma_addr_t in_dma;
+       dma_addr_t out_dma;
+       dma_addr_t iv_dma;
+       dma_addr_t key_dma;
+       dma_addr_t aad_dma;
+       dma_addr_t tag_dma;
+       u8 *in_buf;
+       u8 *out_buf;
+       u8 *iv_buf;
+       u8 *aad_buf;
+       u8 *tag_buf;
+       u32 cryptlen;
+       u32 assoclen;
+       u32 authsize;
+       u32 iv_map_len;
+       u32 keylen;
+       bool encrypting;
+       bool empty_gcm_fallback;
+       struct vcq_cmd packed[CMH_AES_AEAD_MAX_PACKED];
+};
+
+struct cmh_aes_aead_drv {
+       struct aead_alg                  alg;
+       const struct cmh_aes_aead_info  *info;
+};
+
+static const struct cmh_aes_aead_info *
+cmh_aes_aead_get_info(struct crypto_aead *tfm)
+{
+       struct aead_alg *alg = crypto_aead_alg(tfm);
+
+       return container_of(alg, struct cmh_aes_aead_drv, alg)->info;
+}
+
+/* VCQ Builders -- AEAD-specific */
+
+static void vcq_add_aes_aead_init(struct vcq_cmd *slot, u32 core_id, u64 key_ref,
+                                 u64 iv_dma, u32 keylen, u32 ivlen,
+                                 u32 mode, u32 op, u32 aadlen, u32 iolen,
+                                 u32 taglen)
+{
+       memset(slot, 0, sizeof(*slot));
+       slot->magic = VCQ_CMD_MAGIC;
+       slot->id = VCQ_CMD_ID(core_id, 0, 1, AES_CMD_INIT);
+       slot->hwc.aes.cmd_init.key = key_ref;
+       slot->hwc.aes.cmd_init.iv = iv_dma;
+       slot->hwc.aes.cmd_init.keylen = keylen;
+       slot->hwc.aes.cmd_init.ivlen = ivlen;
+       slot->hwc.aes.cmd_init.mode = mode;
+       slot->hwc.aes.cmd_init.op = op;
+       slot->hwc.aes.cmd_init.aadlen = aadlen;
+       slot->hwc.aes.cmd_init.iolen = iolen;
+       slot->hwc.aes.cmd_init.taglen = taglen;
+}
+
+static void vcq_add_aes_ccm_init(struct vcq_cmd *slot, u32 core_id, u64 key_ref,
+                                u64 nonce_dma, u32 keylen, u32 noncelen,
+                                u32 op, u32 aadlen, u32 iolen, u32 taglen)
+{
+       memset(slot, 0, sizeof(*slot));
+       slot->magic = VCQ_CMD_MAGIC;
+       slot->id = VCQ_CMD_ID(core_id, 0, 1, AES_CMD_CCM_INIT);
+       slot->hwc.aes.cmd_init.key = key_ref;
+       slot->hwc.aes.cmd_init.iv = nonce_dma;
+       slot->hwc.aes.cmd_init.keylen = keylen;
+       slot->hwc.aes.cmd_init.ivlen = noncelen;
+       slot->hwc.aes.cmd_init.mode = AES_MODE_CCM;
+       slot->hwc.aes.cmd_init.op = op;
+       slot->hwc.aes.cmd_init.aadlen = aadlen;
+       slot->hwc.aes.cmd_init.iolen = iolen;
+       slot->hwc.aes.cmd_init.taglen = taglen;
+}
+
+static void vcq_add_aes_aad_final(struct vcq_cmd *slot, u32 core_id, u64 aad_dma,
+                                 u32 aadlen)
+{
+       memset(slot, 0, sizeof(*slot));
+       slot->magic = VCQ_CMD_MAGIC;
+       slot->id = VCQ_CMD_ID(core_id, 0, 1, AES_CMD_AAD_FINAL);
+       slot->hwc.aes.cmd_aad_final.data = aad_dma;
+       slot->hwc.aes.cmd_aad_final.datalen = aadlen;
+}
+
+static void vcq_add_aes_aead_final(struct vcq_cmd *slot, u32 core_id, u64 input_dma,
+                                  u64 output_dma, u64 tag_dma,
+                                  u32 iolen, u32 taglen)
+{
+       memset(slot, 0, sizeof(*slot));
+       slot->magic = VCQ_CMD_MAGIC;
+       slot->id = VCQ_CMD_ID(core_id, 0, 1, AES_CMD_FINAL);
+       slot->hwc.aes.cmd_final.input = input_dma;
+       slot->hwc.aes.cmd_final.output = output_dma;
+       slot->hwc.aes.cmd_final.tag = tag_dma;
+       slot->hwc.aes.cmd_final.iolen = iolen;
+       slot->hwc.aes.cmd_final.taglen = taglen;
+}
+
+/* setkey */
+static int cmh_aes_aead_setkey(struct crypto_aead *tfm, const u8 *key,
+                              unsigned int keylen)
+{
+       struct cmh_aes_aead_tfm_ctx *tctx = crypto_aead_ctx(tfm);
+       int ret;
+
+       if (keylen != 16 && keylen != 24 && keylen != 32)
+               return -EINVAL;
+
+       /* Keep SW fallback ciphers in sync for CCM edge cases */
+       if (tctx->sw_cipher) {
+               ret = crypto_cipher_setkey(tctx->sw_cipher, key, keylen);
+               if (ret)
+                       return ret;
+       }
+       if (tctx->fallback) {
+               ret = crypto_aead_setkey(tctx->fallback, key, keylen);
+               if (ret)
+                       return ret;
+       }
+
+       ret = cmh_key_setkey_raw(&tctx->key, key, keylen, CORE_ID_AES);
+
+       return ret;
+}
+
+static int cmh_aes_aead_setauthsize(struct crypto_aead *tfm,
+                                   unsigned int authsize)
+{
+       struct cmh_aes_aead_tfm_ctx *tctx = crypto_aead_ctx(tfm);
+       const struct cmh_aes_aead_info *info = cmh_aes_aead_get_info(tfm);
+       int ret;
+
+       if (info->type == CMH_AES_AEAD_GCM) {
+               /* GCM: accept 4, 8, 12, 13, 14, 15, 16 per NIST SP 800-38D */
+               if (authsize < 4 || authsize > 16 ||
+                   (authsize > 4 && authsize < 8) ||
+                   (authsize > 8 && authsize < 12))
+                       return -EINVAL;
+       } else {
+               /* CCM: accept all RFC 3610 values {4,6,8,10,12,14,16} */
+               if (authsize < 4 || authsize > 16 || (authsize & 1))
+                       return -EINVAL;
+               /* Forward to SW fallback for authsize=10 (HW unsupported) */
+               if (tctx->fallback) {
+                       ret = crypto_aead_setauthsize(tctx->fallback,
+                                                     authsize);
+                       if (ret)
+                               return ret;
+               }
+       }
+
+       tctx->authsize = authsize;
+       return 0;
+}
+
+static int cmh_aes_aead_init_tfm(struct crypto_aead *tfm)
+{
+       struct cmh_aes_aead_tfm_ctx *tctx = crypto_aead_ctx(tfm);
+       const struct cmh_aes_aead_info *info = cmh_aes_aead_get_info(tfm);
+
+       memset(tctx, 0, sizeof(*tctx));
+       tctx->authsize = info->maxauthsize;
+
+       if (info->type == CMH_AES_AEAD_CCM) {
+               struct crypto_aead *fb;
+               struct crypto_cipher *ci;
+
+               ci = crypto_alloc_cipher("aes", 0, 0);
+               if (IS_ERR(ci))
+                       return PTR_ERR(ci);
+               tctx->sw_cipher = ci;
+
+               fb = crypto_alloc_aead("ccm(aes)", 0,
+                                      CRYPTO_ALG_NEED_FALLBACK);
+               if (IS_ERR(fb)) {
+                       crypto_free_cipher(ci);
+                       tctx->sw_cipher = NULL;
+                       return PTR_ERR(fb);
+               }
+               tctx->fallback = fb;
+
+               /*
+                * Subreq lives at (rctx + 1).  Alignment is guaranteed
+                * by the crypto framework's __ctx ALIGN mechanism.
+                */
+               crypto_aead_set_reqsize(tfm,
+                                       sizeof(struct cmh_aes_aead_reqctx) +
+                                       sizeof(struct aead_request) +
+                                       crypto_aead_reqsize(fb));
+       } else {
+               crypto_aead_set_reqsize(tfm,
+                                       sizeof(struct cmh_aes_aead_reqctx));
+       }
+
+       return 0;
+}
+
+static void cmh_aes_aead_exit_tfm(struct crypto_aead *tfm)
+{
+       struct cmh_aes_aead_tfm_ctx *tctx = crypto_aead_ctx(tfm);
+
+       if (tctx->fallback)
+               crypto_free_aead(tctx->fallback);
+       if (tctx->sw_cipher)
+               crypto_free_cipher(tctx->sw_cipher);
+       cmh_key_destroy(&tctx->key);
+}
+
+/* DMA unmap helper */
+static void cmh_aes_aead_unmap_dma(struct cmh_aes_aead_reqctx *rctx)
+{
+       u32 tag_map_len;
+
+       cmh_dma_unmap_single(rctx->iv_dma, rctx->iv_map_len, DMA_TO_DEVICE);
+       /*
+        * The empty-GCM fallback maps a full AES block (16 bytes) for the
+        * ECB output regardless of authsize, so unmap with the mapped size.
+        */
+       tag_map_len = rctx->empty_gcm_fallback ?
+                     AES_GCM_HW_IV_SIZE : rctx->authsize;
+       cmh_dma_unmap_single(rctx->tag_dma, tag_map_len,
+                            (rctx->encrypting || rctx->empty_gcm_fallback) ?
+                             DMA_FROM_DEVICE : DMA_TO_DEVICE);
+       if (rctx->cryptlen > 0) {
+               cmh_dma_unmap_single(rctx->out_dma, rctx->cryptlen,
+                                    DMA_FROM_DEVICE);
+               cmh_dma_unmap_single(rctx->in_dma, rctx->cryptlen,
+                                    DMA_TO_DEVICE);
+       }
+       if (rctx->assoclen > 0)
+               cmh_dma_unmap_single(rctx->aad_dma, rctx->assoclen,
+                                    DMA_TO_DEVICE);
+}
+
+static void cmh_aes_aead_free_bufs(struct cmh_aes_aead_reqctx *rctx)
+{
+       kfree(rctx->iv_buf);
+       rctx->iv_buf = NULL;
+       kfree(rctx->tag_buf);
+       rctx->tag_buf = NULL;
+       kfree_sensitive(rctx->out_buf);
+       rctx->out_buf = NULL;
+       kfree_sensitive(rctx->in_buf);
+       rctx->in_buf = NULL;
+       kfree(rctx->aad_buf);
+       rctx->aad_buf = NULL;
+}
+
+static void cmh_aes_aead_complete(void *data, int error)
+{
+       struct aead_request *req = data;
+       struct cmh_aes_aead_reqctx *rctx = aead_request_ctx(req);
+
+       if (error == -EINPROGRESS) {
+               cmh_complete(&req->base, error);
+               return;
+       }
+
+       cmh_aes_aead_unmap_dma(rctx);
+
+       /*
+        * Map HW error on decrypt to -EBADMSG.  The eSW AES core uses a
+        * single error code (-EIO) for both authentication failures and
+        * other core errors (e.g. DMA timeout), so we cannot distinguish
+        * them from the MBX_STATUS alone.  In practice the only error
+        * during a well-formed AEAD decrypt is auth-tag mismatch; a DMA
+        * timeout would indicate a fatal HW problem where -EBADMSG vs
+        * -EIO is moot.  The kernel crypto API requires -EBADMSG for
+        * AEAD authentication failures.
+        */
+       if (error == -EIO && !rctx->encrypting)
+               error = -EBADMSG;
+
+       if (!error) {
+               /* GCM empty-input decrypt: compare computed tag with expected */
+               if (rctx->empty_gcm_fallback && !rctx->encrypting) {
+                       if (crypto_memneq(rctx->tag_buf, rctx->in_buf,
+                                         rctx->authsize))
+                               error = -EBADMSG;
+               }
+               if (!error && rctx->cryptlen > 0)
+                       scatterwalk_map_and_copy(rctx->out_buf, req->dst,
+                                                req->assoclen,
+                                               rctx->cryptlen, 1);
+               if (!error && rctx->encrypting)
+                       scatterwalk_map_and_copy(rctx->tag_buf, req->dst,
+                                                req->assoclen +
+                                               rctx->cryptlen,
+                                               rctx->authsize, 1);
+       }
+
+       cmh_aes_aead_free_bufs(rctx);
+       cmh_complete(&req->base, error);
+}
+
+/*
+ * GCM empty-input fallback.
+ *
+ * When both AAD and plaintext are empty, GCM reduces to:
+ *   tag = E(K, J0) where J0 = nonce || 0x00000001
+ *
+ * The eSW GCM engine rejects this degenerate case, so we compute it
+ * via a single ECB block encryption of J0.
+ *
+ * VCQ: [SYS_CMD_WRITE] + AES_CMD_INIT(ECB) + AES_CMD_FINAL + FLUSH
+ */
+static int cmh_aes_gcm_empty(struct aead_request *req, u32 aes_op)
+{
+       struct crypto_aead *tfm = crypto_aead_reqtfm(req);
+       struct cmh_aes_aead_tfm_ctx *tctx = crypto_aead_ctx(tfm);
+       struct cmh_aes_aead_reqctx *rctx = aead_request_ctx(req);
+       struct vcq_cmd cmds[CMH_AES_AEAD_MAX_PAYLOAD];
+       u64 key_ref;
+       u32 keylen, authsize;
+       struct core_dispatch d;
+       s32 target_mbx;
+       u32 core_id;
+       u32 idx;
+       int ret;
+       gfp_t gfp;
+
+       authsize = tctx->authsize;
+
+       gfp = req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ?
+             GFP_KERNEL : GFP_ATOMIC;
+
+       memset(rctx, 0, sizeof(*rctx));
+       rctx->cryptlen = 0;
+       rctx->assoclen = 0;
+       rctx->authsize = authsize;
+       rctx->encrypting = (aes_op == AES_OP_ENCRYPT);
+       rctx->empty_gcm_fallback = true;
+
+       /* Build J0 = nonce || 0x00000001 in iv_buf */
+       rctx->iv_buf = kzalloc(AES_GCM_HW_IV_SIZE, gfp);
+       if (!rctx->iv_buf)
+               return -ENOMEM;
+       memcpy(rctx->iv_buf, req->iv, AES_GCM_IV_SIZE);
+       rctx->iv_buf[15] = 0x01; /* big-endian counter = 1 */
+       rctx->iv_map_len = AES_GCM_HW_IV_SIZE;
+
+       rctx->iv_dma = cmh_dma_map_single(rctx->iv_buf, AES_GCM_HW_IV_SIZE,
+                                         DMA_TO_DEVICE);
+       if (cmh_dma_map_error(rctx->iv_dma)) {
+               ret = -ENOMEM;
+               goto out_free_iv;
+       }
+
+       /* Tag buffer -- receives E(K, J0) output */
+       rctx->tag_buf = kzalloc(AES_GCM_HW_IV_SIZE, gfp);
+       if (!rctx->tag_buf) {
+               ret = -ENOMEM;
+               goto out_unmap_iv;
+       }
+       rctx->tag_dma = cmh_dma_map_single(rctx->tag_buf, AES_GCM_HW_IV_SIZE,
+                                          DMA_FROM_DEVICE);
+       if (cmh_dma_map_error(rctx->tag_dma)) {
+               ret = -ENOMEM;
+               goto out_free_tag;
+       }
+
+       /* For decrypt: read expected tag from request for later comparison */
+       if (!rctx->encrypting) {
+               rctx->in_buf = kmalloc(authsize, gfp);
+               if (!rctx->in_buf) {
+                       ret = -ENOMEM;
+                       goto out_unmap_tag;
+               }
+               scatterwalk_map_and_copy(rctx->in_buf, req->src, 0,
+                                        authsize, 0);
+       }
+
+       /* Resolve key */
+       idx = 0;
+       rctx->key_dma = tctx->key.raw.dma;
+       vcq_add_sys_write(&cmds[idx++], SYS_REF_TEMP,
+                         (u64)rctx->key_dma, SYS_REF_NONE,
+                         tctx->key.raw.len,
+                         tctx->key.raw.sys_type);
+       key_ref = SYS_REF_TEMP;
+       keylen = tctx->key.raw.len;
+       d = cmh_core_select_instance(CMH_CORE_AES);
+       target_mbx = d.mbx_idx;
+       core_id = d.core_id;
+
+       /* ECB INIT: single block encryption of J0 */
+       vcq_add_aes_aead_init(&cmds[idx++], core_id, key_ref,
+                             0, keylen, 0, AES_MODE_ECB, AES_OP_ENCRYPT,
+                             0, AES_GCM_HW_IV_SIZE, 0);
+
+       /* FINAL: J0 in, E(K,J0) out */
+       vcq_add_aes_aead_final(&cmds[idx++], core_id,
+                              (u64)rctx->iv_dma, (u64)rctx->tag_dma,
+                              0, AES_GCM_HW_IV_SIZE, 0);
+
+       vcq_add_flush(&cmds[idx++], core_id);
+
+       ret = cmh_vcq_pack_and_submit_async(cmds, idx, rctx->packed,
+                                           CMH_AES_AEAD_MAX_PACKED,
+                                           target_mbx,
+                                           cmh_aes_aead_complete, req,
+                                           !!(req->base.flags &
+                                              CRYPTO_TFM_REQ_MAY_BACKLOG),
+                                           cmh_tm_async_timeout_jiffies());
+       if (ret == -EBUSY)
+               return -EBUSY;
+       if (ret)
+               goto out_free_in;
+
+       return -EINPROGRESS;
+
+out_free_in:
+       kfree_sensitive(rctx->in_buf);
+out_unmap_tag:
+       cmh_dma_unmap_single(rctx->tag_dma, AES_GCM_HW_IV_SIZE,
+                            DMA_FROM_DEVICE);
+out_free_tag:
+       kfree(rctx->tag_buf);
+out_unmap_iv:
+       cmh_dma_unmap_single(rctx->iv_dma, AES_GCM_HW_IV_SIZE, DMA_TO_DEVICE);
+out_free_iv:
+       kfree(rctx->iv_buf);
+       return ret;
+}
+
+/*
+ * CCM empty-input fallback.
+ *
+ * When both AAD and plaintext are empty, CCM reduces to:
+ *   T  = E(K, B0)    -- CBC-MAC of the single formatting block
+ *   S0 = E(K, A0)    -- CTR block zero
+ *   tag = (T XOR S0)[0..authsize-1]
+ *
+ * The eSW rejects this degenerate case, so the driver computes it
+ * synchronously via two crypto_cipher single-block encryptions.
+ */
+static int cmh_aes_ccm_empty(struct aead_request *req, u32 aes_op)
+{
+       struct crypto_aead *tfm = crypto_aead_reqtfm(req);
+       struct cmh_aes_aead_tfm_ctx *tctx = crypto_aead_ctx(tfm);
+       u32 authsize = tctx->authsize;
+       u8 b0[CMH_AES_BLOCK_SIZE], a0[CMH_AES_BLOCK_SIZE];
+       u8 t[CMH_AES_BLOCK_SIZE], s0[CMH_AES_BLOCK_SIZE];
+       u8 tag[CMH_AES_BLOCK_SIZE];
+       u8 L;
+       u32 i;
+
+       /* Defense-in-depth: iv[0] = L-1, valid L is 2..8 per RFC 3610 S2.1 */
+       if (WARN_ON_ONCE(req->iv[0] < 1 || req->iv[0] > 7))
+               return -EINVAL;
+
+       L = req->iv[0] + 1;
+
+       if (tctx->key.mode != CMH_KEY_RAW)
+               return -EOPNOTSUPP;
+
+       /* B0: flags || nonce || Q(=0).  Adata=0, t=authsize, q=L. */
+       memset(b0, 0, CMH_AES_BLOCK_SIZE);
+       b0[0] = (u8)(8 * ((authsize - 2) / 2) + (L - 1));
+       memcpy(&b0[1], &req->iv[1], 15 - L);
+
+       /* A0: (L-1) || nonce || counter(=0) */
+       memset(a0, 0, CMH_AES_BLOCK_SIZE);
+       a0[0] = (u8)(L - 1);
+       memcpy(&a0[1], &req->iv[1], 15 - L);
+
+       crypto_cipher_encrypt_one(tctx->sw_cipher, t, b0);
+       crypto_cipher_encrypt_one(tctx->sw_cipher, s0, a0);
+
+       for (i = 0; i < authsize; i++)
+               tag[i] = t[i] ^ s0[i];
+
+       if (aes_op == AES_OP_ENCRYPT) {
+               scatterwalk_map_and_copy(tag, req->dst,
+                                        req->assoclen, authsize, 1);
+       } else {
+               u8 expected[CMH_AES_BLOCK_SIZE];
+
+               scatterwalk_map_and_copy(expected, req->src,
+                                        req->assoclen, authsize, 0);
+               if (crypto_memneq(tag, expected, authsize))
+                       return -EBADMSG;
+       }
+
+       return 0;
+}
+
+/*
+ * CCM authsize=10 fallback.
+ *
+ * The eSW AES CCM core does not support authsize=10 (valid per RFC 3610).
+ * Forward the entire request to the generic CCM implementation.
+ */
+static void cmh_aes_ccm_fb_done(void *data, int err)
+{
+       struct aead_request *req = data;
+
+       cmh_complete(&req->base, err);
+}
+
+static int cmh_aes_ccm_fallback(struct aead_request *req, u32 aes_op)
+{
+       struct crypto_aead *tfm = crypto_aead_reqtfm(req);
+       struct cmh_aes_aead_tfm_ctx *tctx = crypto_aead_ctx(tfm);
+       struct cmh_aes_aead_reqctx *rctx = aead_request_ctx(req);
+       struct aead_request *subreq = (void *)(rctx + 1);
+
+       aead_request_set_tfm(subreq, tctx->fallback);
+       aead_request_set_callback(subreq, req->base.flags,
+                                 cmh_aes_ccm_fb_done, req);
+       aead_request_set_crypt(subreq, req->src, req->dst,
+                              req->cryptlen, req->iv);
+       aead_request_set_ad(subreq, req->assoclen);
+
+       return (aes_op == AES_OP_ENCRYPT) ?
+               crypto_aead_encrypt(subreq) : crypto_aead_decrypt(subreq);
+}
+
+/*
+ * Core AEAD encrypt/decrypt -- async path.
+ *
+ * Encrypt: plaintext -> ciphertext + tag appended
+ * Decrypt: ciphertext + tag -> plaintext (tag verified by eSW)
+ *
+ * VCQ: [SYS_CMD_WRITE] + INIT/CCM_INIT + [AAD_FINAL] + FINAL + FLUSH
+ */
+static int cmh_aes_aead_crypt(struct aead_request *req, u32 aes_op)
+{
+       struct crypto_aead *tfm = crypto_aead_reqtfm(req);
+       struct cmh_aes_aead_tfm_ctx *tctx = crypto_aead_ctx(tfm);
+       const struct cmh_aes_aead_info *info = cmh_aes_aead_get_info(tfm);
+       struct cmh_aes_aead_reqctx *rctx = aead_request_ctx(req);
+       struct vcq_cmd cmds[CMH_AES_AEAD_MAX_PAYLOAD];
+       u64 key_ref;
+       u32 keylen, authsize, cryptlen;
+       struct core_dispatch d;
+       s32 target_mbx;
+       u32 core_id;
+       u32 idx;
+       int ret;
+       gfp_t gfp;
+
+       if (tctx->key.mode == CMH_KEY_NONE)
+               return -ENOKEY;
+
+       authsize = tctx->authsize;
+
+       if (aes_op == AES_OP_ENCRYPT) {
+               cryptlen = req->cryptlen;
+       } else {
+               if (req->cryptlen < authsize)
+                       return -EINVAL;
+               cryptlen = req->cryptlen - authsize;
+       }
+
+       /*
+        * Validate CCM IV format early -- the empty-input fallback and
+        * nonce extraction both depend on iv[0] being in range [1,7].
+        */
+       if (info->type == CMH_AES_AEAD_CCM) {
+               if (req->iv[0] < 1 || req->iv[0] > 7)
+                       return -EINVAL;
+       }
+
+       /*
+        * The CMH eSW rejects GCM/CCM when both aadlen and iolen are zero.
+        * For GCM, the tag is simply E(K, J0) -- handle with ECB fallback.
+        * For CCM, compute tag = E(K,B0) XOR E(K,A0) in software.
+        */
+       if (cryptlen == 0 && req->assoclen == 0) {
+               if (info->type == CMH_AES_AEAD_GCM)
+                       return cmh_aes_gcm_empty(req, aes_op);
+               return cmh_aes_ccm_empty(req, aes_op);
+       }
+
+       /*
+        * HW does not support authsize=10 for CCM.  Forward the entire
+        * request to the generic CCM implementation.
+        */
+       if (info->type == CMH_AES_AEAD_CCM && authsize == 10)
+               return cmh_aes_ccm_fallback(req, aes_op);
+
+       /*
+        * HW uses a proprietary LLI scatter-gather format that is
+        * incompatible with struct scatterlist, so the payload is
+        * linearised into contiguous buffers for DMA.  Cap total
+        * size to prevent excessive memory consumption.
+        */
+       if ((u64)cryptlen + req->assoclen > SZ_1M)
+               return -EINVAL;
+
+       gfp = req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ?
+             GFP_KERNEL : GFP_ATOMIC;
+
+       memset(rctx, 0, sizeof(*rctx));
+       rctx->cryptlen = cryptlen;
+       rctx->assoclen = req->assoclen;
+       rctx->authsize = authsize;
+       rctx->encrypting = (aes_op == AES_OP_ENCRYPT);
+
+       /* Linearise AAD */
+       if (req->assoclen > 0) {
+               rctx->aad_buf = kmalloc(req->assoclen, gfp);
+               if (!rctx->aad_buf)
+                       return -ENOMEM;
+               scatterwalk_map_and_copy(rctx->aad_buf, req->src,
+                                        0, req->assoclen, 0);
+               rctx->aad_dma = cmh_dma_map_single(rctx->aad_buf,
+                                                  req->assoclen,
+                                                   DMA_TO_DEVICE);
+               if (cmh_dma_map_error(rctx->aad_dma)) {
+                       ret = -ENOMEM;
+                       goto out_free_aad;
+               }
+       }
+
+       /* Linearise input */
+       if (cryptlen > 0) {
+               rctx->in_buf = kmalloc(cryptlen, gfp);
+               if (!rctx->in_buf) {
+                       ret = -ENOMEM;
+                       goto out_unmap_aad;
+               }
+               scatterwalk_map_and_copy(rctx->in_buf, req->src,
+                                        req->assoclen, cryptlen, 0);
+               rctx->in_dma = cmh_dma_map_single(rctx->in_buf, cryptlen,
+                                                 DMA_TO_DEVICE);
+               if (cmh_dma_map_error(rctx->in_dma)) {
+                       ret = -ENOMEM;
+                       goto out_free_in;
+               }
+       }
+
+       /* Allocate output buffer */
+       if (cryptlen > 0) {
+               rctx->out_buf = kmalloc(cryptlen, gfp);
+               if (!rctx->out_buf) {
+                       ret = -ENOMEM;
+                       goto out_unmap_in;
+               }
+               rctx->out_dma = cmh_dma_map_single(rctx->out_buf, cryptlen,
+                                                  DMA_FROM_DEVICE);
+               if (cmh_dma_map_error(rctx->out_dma)) {
+                       ret = -ENOMEM;
+                       goto out_free_out;
+               }
+       }
+
+       /* Tag buffer */
+       rctx->tag_buf = kmalloc(authsize, gfp);
+       if (!rctx->tag_buf) {
+               ret = -ENOMEM;
+               goto out_unmap_out;
+       }
+
+       if (!rctx->encrypting) {
+               scatterwalk_map_and_copy(rctx->tag_buf, req->src,
+                                        req->assoclen + cryptlen,
+                                       authsize, 0);
+       } else {
+               memset(rctx->tag_buf, 0, authsize);
+       }
+
+       rctx->tag_dma = cmh_dma_map_single(rctx->tag_buf, authsize,
+                                          rctx->encrypting ?
+                                           DMA_FROM_DEVICE : DMA_TO_DEVICE);
+       if (cmh_dma_map_error(rctx->tag_dma)) {
+               ret = -ENOMEM;
+               goto out_free_tag;
+       }
+
+       /* Map IV/nonce */
+       if (info->type == CMH_AES_AEAD_GCM) {
+               rctx->iv_buf = kzalloc(AES_GCM_HW_IV_SIZE, gfp);
+               if (!rctx->iv_buf) {
+                       ret = -ENOMEM;
+                       goto out_unmap_tag;
+               }
+               memcpy(rctx->iv_buf, req->iv, AES_GCM_IV_SIZE);
+               rctx->iv_map_len = AES_GCM_HW_IV_SIZE;
+               rctx->iv_dma = cmh_dma_map_single(rctx->iv_buf,
+                                                 rctx->iv_map_len,
+                                                  DMA_TO_DEVICE);
+       } else {
+               u32 noncelen;
+
+               if (req->iv[0] < 1 || req->iv[0] > 7) {
+                       ret = -EINVAL;
+                       goto out_unmap_tag;
+               }
+               noncelen = 14 - req->iv[0];
+
+               rctx->iv_buf = kmemdup(req->iv + 1, noncelen, gfp);
+               if (!rctx->iv_buf) {
+                       ret = -ENOMEM;
+                       goto out_unmap_tag;
+               }
+               rctx->iv_map_len = noncelen;
+               rctx->iv_dma = cmh_dma_map_single(rctx->iv_buf,
+                                                 rctx->iv_map_len,
+                                                  DMA_TO_DEVICE);
+       }
+       if (cmh_dma_map_error(rctx->iv_dma)) {
+               ret = -ENOMEM;
+               goto out_free_iv;
+       }
+
+       /* Resolve key reference */
+       idx = 0;
+
+       rctx->key_dma = tctx->key.raw.dma;
+       rctx->keylen = tctx->key.raw.len;
+       vcq_add_sys_write(&cmds[idx++], SYS_REF_TEMP,
+                         (u64)rctx->key_dma, SYS_REF_NONE,
+                         tctx->key.raw.len,
+                         tctx->key.raw.sys_type);
+       key_ref = SYS_REF_TEMP;
+       keylen = tctx->key.raw.len;
+       d = cmh_core_select_instance(CMH_CORE_AES);
+       target_mbx = d.mbx_idx;
+       core_id = d.core_id;
+
+       /* Build INIT command */
+       if (info->type == CMH_AES_AEAD_CCM) {
+               vcq_add_aes_ccm_init(&cmds[idx++], core_id, key_ref,
+                                    (u64)rctx->iv_dma, keylen,
+                                    rctx->iv_map_len, aes_op,
+                                    req->assoclen, cryptlen, authsize);
+       } else {
+               vcq_add_aes_aead_init(&cmds[idx++], core_id, key_ref,
+                                     (u64)rctx->iv_dma, keylen,
+                                     AES_GCM_HW_IV_SIZE, info->aes_mode,
+                                     aes_op, req->assoclen, cryptlen,
+                                     authsize);
+       }
+
+       if (req->assoclen > 0)
+               vcq_add_aes_aad_final(&cmds[idx++], core_id,
+                                     (u64)rctx->aad_dma, req->assoclen);
+
+       vcq_add_aes_aead_final(&cmds[idx++], core_id,
+                              cryptlen > 0 ? (u64)rctx->in_dma : 0,
+                              cryptlen > 0 ? (u64)rctx->out_dma : 0,
+                              (u64)rctx->tag_dma, cryptlen, authsize);
+
+       vcq_add_flush(&cmds[idx++], core_id);
+
+       ret = cmh_vcq_pack_and_submit_async(cmds, idx, rctx->packed,
+                                           CMH_AES_AEAD_MAX_PACKED,
+                                           target_mbx,
+                                           cmh_aes_aead_complete, req,
+                                           !!(req->base.flags &
+                                              CRYPTO_TFM_REQ_MAY_BACKLOG),
+                                           cmh_tm_async_timeout_jiffies());
+       if (ret == -EBUSY)
+               return -EBUSY;
+       if (ret)
+               goto out_cleanup_all;
+
+       return -EINPROGRESS;
+
+out_cleanup_all:
+       cmh_dma_unmap_single(rctx->iv_dma, rctx->iv_map_len, DMA_TO_DEVICE);
+out_free_iv:
+       kfree(rctx->iv_buf);
+out_unmap_tag:
+       cmh_dma_unmap_single(rctx->tag_dma, authsize,
+                            rctx->encrypting ? DMA_FROM_DEVICE :
+                                              DMA_TO_DEVICE);
+out_free_tag:
+       kfree(rctx->tag_buf);
+out_unmap_out:
+       if (cryptlen > 0)
+               cmh_dma_unmap_single(rctx->out_dma, cryptlen, DMA_FROM_DEVICE);
+out_free_out:
+       kfree_sensitive(rctx->out_buf);
+out_unmap_in:
+       if (cryptlen > 0)
+               cmh_dma_unmap_single(rctx->in_dma, cryptlen, DMA_TO_DEVICE);
+out_free_in:
+       kfree_sensitive(rctx->in_buf);
+out_unmap_aad:
+       if (req->assoclen > 0)
+               cmh_dma_unmap_single(rctx->aad_dma, req->assoclen,
+                                    DMA_TO_DEVICE);
+out_free_aad:
+       kfree(rctx->aad_buf);
+       return ret;
+}
+
+static int cmh_aes_aead_encrypt(struct aead_request *req)
+{
+       return cmh_aes_aead_crypt(req, AES_OP_ENCRYPT);
+}
+
+static int cmh_aes_aead_decrypt(struct aead_request *req)
+{
+       return cmh_aes_aead_crypt(req, AES_OP_DECRYPT);
+}
+
+/* Registration */
+
+static struct cmh_aes_aead_drv aes_aead_drv_algs[ARRAY_SIZE(aes_aead_algs)];
+
+/**
+ * cmh_aes_aead_register() - Register AES-GCM/CCM AEAD algorithms with the crypto framework
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int cmh_aes_aead_register(void)
+{
+       unsigned int i;
+       int ret;
+
+       for (i = 0; i < ARRAY_SIZE(aes_aead_algs); i++) {
+               const struct cmh_aes_aead_info *info = &aes_aead_algs[i];
+               struct cmh_aes_aead_drv *drv = &aes_aead_drv_algs[i];
+               struct aead_alg *alg = &drv->alg;
+
+               drv->info = info;
+
+               memset(alg, 0, sizeof(*alg));
+
+               alg->setkey      = cmh_aes_aead_setkey;
+               alg->setauthsize = cmh_aes_aead_setauthsize;
+               alg->encrypt     = cmh_aes_aead_encrypt;
+               alg->decrypt     = cmh_aes_aead_decrypt;
+               alg->init        = cmh_aes_aead_init_tfm;
+               alg->exit        = cmh_aes_aead_exit_tfm;
+               alg->ivsize      = info->ivsize;
+               alg->maxauthsize = info->maxauthsize;
+
+               strscpy(alg->base.cra_name, info->alg_name,
+                       CRYPTO_MAX_ALG_NAME);
+               strscpy(alg->base.cra_driver_name, info->drv_name,
+                       CRYPTO_MAX_ALG_NAME);
+               alg->base.cra_priority  = 300;
+               alg->base.cra_flags     = CRYPTO_ALG_KERN_DRIVER_ONLY |
+                                         CRYPTO_ALG_ASYNC;
+               if (info->type == CMH_AES_AEAD_CCM) {
+                       alg->base.cra_flags |= CRYPTO_ALG_NEED_FALLBACK;
+                       /*
+                        * Bump priority above 300 so we beat the generic
+                        * ccm_base template instance.  That template inherits
+                        * priority (ctr + cbcmac) / 2 = 300 when both
+                        * constituents are at 300, and list ordering would
+                        * otherwise let it shadow our driver.
+                        */
+                       alg->base.cra_priority = 301;
+               }
+               alg->base.cra_blocksize = 1;
+               alg->base.cra_ctxsize  = sizeof(struct cmh_aes_aead_tfm_ctx);
+               alg->base.cra_module   = THIS_MODULE;
+
+               ret = crypto_register_aead(alg);
+               if (ret) {
+                       dev_err(cmh_dev(), "cmh_aes_aead: failed to register %s (rc=%d)\n",
+                               info->alg_name, ret);
+                       goto err_unregister;
+               }
+
+               dev_dbg(cmh_dev(), "cmh_aes_aead: registered %s\n", info->alg_name);
+       }
+
+       return 0;
+
+err_unregister:
+       while (i--)
+               crypto_unregister_aead(&aes_aead_drv_algs[i].alg);
+       return ret;
+}
+
+/**
+ * cmh_aes_aead_unregister() - Unregister AES AEAD algorithms from the crypto framework
+ */
+void cmh_aes_aead_unregister(void)
+{
+       unsigned int i;
+
+       for (i = 0; i < ARRAY_SIZE(aes_aead_algs); i++) {
+               crypto_unregister_aead(&aes_aead_drv_algs[i].alg);
+               dev_dbg(cmh_dev(), "cmh_aes_aead: unregistered %s\n",
+                       aes_aead_algs[i].alg_name);
+       }
+}
diff --git a/drivers/crypto/cmh/cmh_aes_cmac.c b/drivers/crypto/cmh/cmh_aes_cmac.c
new file mode 100644
index 000000000000..a711c575398d
--- /dev/null
+++ b/drivers/crypto/cmh/cmh_aes_cmac.c
@@ -0,0 +1,537 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026 Cryptography Research, Inc. (CRI).
+ * CMH LKM -- Kernel Crypto API AES-CMAC (ahash) Driver
+ *
+ * Registers cmac(aes) as an ahash algorithm.
+ *
+ * CMAC produces a 16-byte tag (MAC) from a key and message.
+ * VCQ sequence: [SYS_CMD_WRITE] + AES_CMD_INIT(CMAC) +
+ *               AES_CMD_AAD_FINAL_AUTH + FLUSH
+ *
+ * The ahash interface accumulates data in a kernel buffer via .update(),
+ * then .final() builds and submits the VCQ asynchronously.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/crypto.h>
+#include <crypto/internal/hash.h>
+#include <crypto/scatterwalk.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+
+#include "cmh_aes.h"
+#include "cmh_vcq.h"
+#include "cmh_aes_abi.h"
+#include "cmh_sys_abi.h"
+#include "cmh_sys.h"
+#include "cmh_txn.h"
+#include "cmh_dma.h"
+#include "cmh_key.h"
+
+#define AES_CMAC_DIGEST_SIZE   16U
+#define AES_CMAC_BLOCK_SIZE    16U
+
+/*
+ * Maximum accumulated data for CMAC -- driver-imposed, not HW.
+ *
+ * The AES core does not expose external save/restore VCQ commands,
+ * so the driver must accumulate all data in kernel memory via
+ * .update() and submit it atomically in .final().  This cap limits
+ * the per-request kernel allocation.
+ */
+#define AES_CMAC_MAX_DATA      (64 * 1024)
+
+/* Per-transform context */
+struct cmh_aes_cmac_tfm_ctx {
+       struct cmh_key_ctx key;
+       spinlock_t         chunk_lock;  /* protects all_chunks */
+       struct list_head   all_chunks;  /* orphan-safe chunk tracking */
+};
+
+/* One chunk per .update() call -- data is embedded via flexible array */
+struct cmh_aes_cmac_chunk {
+       struct list_head list;
+       struct list_head tfm_node; /* per-tfm orphan tracking */
+       u32 len;
+       u8 data[];
+};
+
+/* Per-request context (lives in ahash_request::__ctx) */
+
+/*
+ * Maximum payload commands:
+ *   [SYS_CMD_WRITE] + AES_CMD_INIT + AES_CMD_AAD_FINAL_AUTH + FLUSH = 4
+ */
+#define CMH_AES_CMAC_MAX_PAYLOAD       4
+#define CMH_AES_CMAC_MAX_PACKED                (CMH_AES_CMAC_MAX_PAYLOAD * 2)
+
+struct cmh_aes_cmac_reqctx {
+       struct list_head chunks;
+       u32  total_len;
+       u8  *buf;       /* linearised in final() for DMA */
+       /* DMA state for async final */
+       dma_addr_t key_dma;
+       dma_addr_t in_dma;
+       dma_addr_t tag_dma;
+       u8 *tag_buf;
+       u32 keylen;
+       struct vcq_cmd packed[CMH_AES_CMAC_MAX_PACKED];
+};
+
+/* Flat state for export/import -- holds accumulated input data only */
+struct cmh_aes_cmac_export_state {
+       u32 total_len;
+       u8  data[];
+};
+
+/*
+ * Flat state buffer for export/import.  The CMH AES core does not
+ * support save/restore of intermediate CMAC state, so this driver
+ * accumulates input in SW and serialises the buffer on export.
+ *
+ * PAGE_SIZE (4096) caps the exportable accumulated-data window.
+ * Full-range export is not feasible because the crypto subsystem
+ * pre-allocates statesize bytes per request.  Export returns -EINVAL
+ * if the caller has accumulated more than CMH_AES_CMAC_EXPORT_MAX.
+ */
+#define CMH_AES_CMAC_STATE_SIZE 4096
+#define CMH_AES_CMAC_EXPORT_MAX \
+       (CMH_AES_CMAC_STATE_SIZE - sizeof(struct cmh_aes_cmac_export_state))
+
+/*
+ * Export/import: not supported.
+ *
+ * The AES core lacks external save/restore VCQ commands, so there is
+ * no way to checkpoint intermediate CMAC state to host memory.
+ * Pending eSW ABI extension to add save/restore for the AES core.
+ */
+
+static int cmh_aes_cmac_setkey(struct crypto_ahash *tfm, const u8 *key,
+                              unsigned int keylen)
+{
+       struct cmh_aes_cmac_tfm_ctx *tctx = crypto_ahash_ctx(tfm);
+
+       if (keylen != 16 && keylen != 24 && keylen != 32)
+               return -EINVAL;
+
+       return cmh_key_setkey_raw(&tctx->key, key, keylen, CORE_ID_AES);
+}
+
+static void cmh_aes_cmac_free_chunks(struct cmh_aes_cmac_reqctx *rctx,
+                                    struct cmh_aes_cmac_tfm_ctx *tctx)
+{
+       struct cmh_aes_cmac_chunk *c, *tmp;
+
+       spin_lock_bh(&tctx->chunk_lock);
+       list_for_each_entry_safe(c, tmp, &rctx->chunks, list) {
+               list_del(&c->list);
+               list_del(&c->tfm_node);
+               kfree_sensitive(c);
+       }
+       spin_unlock_bh(&tctx->chunk_lock);
+       rctx->total_len = 0;
+}
+
+static int cmh_aes_cmac_init(struct ahash_request *req)
+{
+       struct cmh_aes_cmac_reqctx *rctx = ahash_request_ctx(req);
+
+       memset(rctx, 0, sizeof(*rctx));
+       INIT_LIST_HEAD(&rctx->chunks);
+       return 0;
+}
+
+static int cmh_aes_cmac_update(struct ahash_request *req)
+{
+       struct crypto_ahash *tfm = crypto_ahash_reqtfm(req);
+       struct cmh_aes_cmac_tfm_ctx *tctx = crypto_ahash_ctx(tfm);
+       struct cmh_aes_cmac_reqctx *rctx = ahash_request_ctx(req);
+       struct cmh_aes_cmac_chunk *chunk;
+       gfp_t gfp;
+       int ret;
+
+       if (!req->nbytes)
+               return 0;
+
+       if (req->nbytes > AES_CMAC_MAX_DATA - rctx->total_len) {
+               ret = -EINVAL;
+               goto err_free_chunks;
+       }
+
+       gfp = req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ?
+             GFP_KERNEL : GFP_ATOMIC;
+
+       chunk = kmalloc(sizeof(*chunk) + req->nbytes, gfp);
+       if (!chunk) {
+               ret = -ENOMEM;
+               goto err_free_chunks;
+       }
+
+       chunk->len = req->nbytes;
+       if (req->base.flags & CRYPTO_AHASH_REQ_VIRT)
+               memcpy(chunk->data, req->svirt, req->nbytes);
+       else
+               scatterwalk_map_and_copy(chunk->data, req->src,
+                                        0, req->nbytes, 0);
+
+       list_add_tail(&chunk->list, &rctx->chunks);
+       spin_lock_bh(&tctx->chunk_lock);
+       list_add_tail(&chunk->tfm_node, &tctx->all_chunks);
+       spin_unlock_bh(&tctx->chunk_lock);
+       rctx->total_len += req->nbytes;
+       return 0;
+
+err_free_chunks:
+       /*
+        * Terminal error -- free all previously accumulated chunks.
+        * callers may not call .final() on error, so they would leak.
+        */
+       cmh_aes_cmac_free_chunks(rctx, tctx);
+       return ret;
+}
+
+static void cmh_aes_cmac_complete(void *data, int error)
+{
+       struct ahash_request *req = data;
+       struct crypto_ahash *tfm = crypto_ahash_reqtfm(req);
+       struct cmh_aes_cmac_tfm_ctx *tctx = crypto_ahash_ctx(tfm);
+       struct cmh_aes_cmac_reqctx *rctx = ahash_request_ctx(req);
+
+       if (error == -EINPROGRESS) {
+               cmh_complete(&req->base, error);
+               return;
+       }
+
+       /* Unmap DMA */
+       if (rctx->total_len > 0)
+               cmh_dma_unmap_single(rctx->in_dma, rctx->total_len,
+                                    DMA_TO_DEVICE);
+       cmh_dma_unmap_single(rctx->tag_dma, AES_CMAC_DIGEST_SIZE,
+                            DMA_FROM_DEVICE);
+
+       if (!error)
+               memcpy(req->result, rctx->tag_buf, AES_CMAC_DIGEST_SIZE);
+
+       kfree(rctx->tag_buf);
+       rctx->tag_buf = NULL;
+       kfree_sensitive(rctx->buf);
+       rctx->buf = NULL;
+       cmh_aes_cmac_free_chunks(rctx, tctx);
+       cmh_complete(&req->base, error);
+}
+
+static int cmh_aes_cmac_final(struct ahash_request *req)
+{
+       struct crypto_ahash *tfm = crypto_ahash_reqtfm(req);
+       struct cmh_aes_cmac_tfm_ctx *tctx = crypto_ahash_ctx(tfm);
+       struct cmh_aes_cmac_reqctx *rctx = ahash_request_ctx(req);
+       struct vcq_cmd cmds[CMH_AES_CMAC_MAX_PAYLOAD];
+       u64 key_ref;
+       u32 keylen;
+       struct core_dispatch d;
+       s32 target_mbx;
+       u32 core_id;
+       u32 idx;
+       int ret;
+       gfp_t gfp;
+
+       if (tctx->key.mode == CMH_KEY_NONE) {
+               ret = -ENOKEY;
+               goto out_free_buf;
+       }
+
+       gfp = req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ?
+             GFP_KERNEL : GFP_ATOMIC;
+
+       /* Linearise accumulated chunks into a contiguous buffer for DMA */
+       if (rctx->total_len > 0) {
+               struct cmh_aes_cmac_chunk *c;
+               u32 off = 0;
+
+               rctx->buf = kmalloc(rctx->total_len, gfp);
+               if (!rctx->buf) {
+                       ret = -ENOMEM;
+                       goto out_free_chunks;
+               }
+               list_for_each_entry(c, &rctx->chunks, list) {
+                       memcpy(rctx->buf + off, c->data, c->len);
+                       off += c->len;
+               }
+       }
+
+       /* Tag output buffer */
+       rctx->tag_buf = kzalloc(AES_CMAC_DIGEST_SIZE, gfp);
+       if (!rctx->tag_buf) {
+               ret = -ENOMEM;
+               goto out_free_buf;
+       }
+
+       rctx->tag_dma = cmh_dma_map_single(rctx->tag_buf,
+                                          AES_CMAC_DIGEST_SIZE,
+                                           DMA_FROM_DEVICE);
+       if (cmh_dma_map_error(rctx->tag_dma)) {
+               ret = -ENOMEM;
+               goto out_free_tag;
+       }
+
+       /* Map input data (may be zero-length for empty CMAC) */
+       if (rctx->total_len > 0) {
+               rctx->in_dma = cmh_dma_map_single(rctx->buf, rctx->total_len,
+                                                 DMA_TO_DEVICE);
+               if (cmh_dma_map_error(rctx->in_dma)) {
+                       ret = -ENOMEM;
+                       goto out_unmap_tag;
+               }
+       }
+
+       /* Resolve key */
+       idx = 0;
+
+       rctx->key_dma = tctx->key.raw.dma;
+       rctx->keylen = tctx->key.raw.len;
+       vcq_add_sys_write(&cmds[idx++], SYS_REF_TEMP,
+                         (u64)rctx->key_dma, SYS_REF_NONE,
+                         tctx->key.raw.len,
+                         tctx->key.raw.sys_type);
+       key_ref = SYS_REF_TEMP;
+       keylen = tctx->key.raw.len;
+       d = cmh_core_select_instance(CMH_CORE_AES);
+       target_mbx = d.mbx_idx;
+       core_id = d.core_id;
+
+       /*
+        * INIT: mode=CMAC, op=ENCRYPT (CMAC always "encrypts")
+        * CMAC data goes through the AAD path:
+        *   aadlen = total data length, iolen = 0
+        */
+       {
+               struct vcq_cmd *slot = &cmds[idx++];
+
+               memset(slot, 0, sizeof(*slot));
+               slot->magic = VCQ_CMD_MAGIC;
+               slot->id = VCQ_CMD_ID(core_id, 0, 1, AES_CMD_INIT);
+               slot->hwc.aes.cmd_init.key = key_ref;
+               slot->hwc.aes.cmd_init.iv = 0;
+               slot->hwc.aes.cmd_init.keylen = keylen;
+               slot->hwc.aes.cmd_init.ivlen = 0;
+               slot->hwc.aes.cmd_init.mode = AES_MODE_CMAC;
+               slot->hwc.aes.cmd_init.op = AES_OP_ENCRYPT;
+               slot->hwc.aes.cmd_init.aadlen = rctx->total_len;
+               slot->hwc.aes.cmd_init.iolen = 0;
+               slot->hwc.aes.cmd_init.taglen = AES_CMAC_DIGEST_SIZE;
+       }
+
+       /* AAD_FINAL_AUTH: final AAD + tag extraction in one atomic step */
+       {
+               struct vcq_cmd *slot = &cmds[idx++];
+
+               memset(slot, 0, sizeof(*slot));
+               slot->magic = VCQ_CMD_MAGIC;
+               slot->id = VCQ_CMD_ID(core_id, 0, 1, AES_CMD_AAD_FINAL_AUTH);
+               slot->hwc.aes.cmd_aad_final_auth.data =
+                       rctx->total_len > 0 ? (u64)rctx->in_dma : 0;
+               slot->hwc.aes.cmd_aad_final_auth.datalen = rctx->total_len;
+               slot->hwc.aes.cmd_aad_final_auth.tag = (u64)rctx->tag_dma;
+               slot->hwc.aes.cmd_aad_final_auth.taglen = AES_CMAC_DIGEST_SIZE;
+       }
+
+       vcq_add_flush(&cmds[idx++], core_id);
+
+       ret = cmh_vcq_pack_and_submit_async(cmds, idx, rctx->packed,
+                                           CMH_AES_CMAC_MAX_PACKED,
+                                           target_mbx,
+                                           cmh_aes_cmac_complete, req,
+                                           !!(req->base.flags &
+                                              CRYPTO_TFM_REQ_MAY_BACKLOG),
+                                           cmh_tm_async_timeout_jiffies());
+       /* -EBUSY = backlogged; ownership transferred to callback. */
+       if (ret == -EBUSY)
+               return -EBUSY;
+       if (ret)
+               goto out_cleanup_all;
+
+       return -EINPROGRESS;
+
+out_cleanup_all:
+       if (rctx->total_len > 0 && !cmh_dma_map_error(rctx->in_dma))
+               cmh_dma_unmap_single(rctx->in_dma, rctx->total_len,
+                                    DMA_TO_DEVICE);
+out_unmap_tag:
+       cmh_dma_unmap_single(rctx->tag_dma, AES_CMAC_DIGEST_SIZE,
+                            DMA_FROM_DEVICE);
+out_free_tag:
+       kfree(rctx->tag_buf);
+out_free_buf:
+out_free_chunks:
+       cmh_aes_cmac_free_chunks(rctx, tctx);
+       kfree_sensitive(rctx->buf);
+       rctx->buf = NULL;
+       rctx->total_len = 0;
+       return ret;
+}
+
+/*
+ * ahash .export()/.import(): serialize/deserialize the software
+ * accumulation buffer.  No HW state is involved -- the AES core
+ * does not support save/restore, but we only export the input queue.
+ */
+
+static int cmh_aes_cmac_export(struct ahash_request *req, void *out)
+{
+       struct cmh_aes_cmac_reqctx *rctx = ahash_request_ctx(req);
+       struct cmh_aes_cmac_export_state *state = out;
+       struct cmh_aes_cmac_chunk *chunk;
+       u32 offset = 0;
+
+       if (rctx->total_len > CMH_AES_CMAC_EXPORT_MAX)
+               return -ENOSPC;
+
+       state->total_len = rctx->total_len;
+       list_for_each_entry(chunk, &rctx->chunks, list) {
+               memcpy(state->data + offset, chunk->data, chunk->len);
+               offset += chunk->len;
+       }
+       return 0;
+}
+
+static int cmh_aes_cmac_import(struct ahash_request *req, const void *in)
+{
+       struct crypto_ahash *tfm = crypto_ahash_reqtfm(req);
+       struct cmh_aes_cmac_tfm_ctx *tctx = crypto_ahash_ctx(tfm);
+       struct cmh_aes_cmac_reqctx *rctx = ahash_request_ctx(req);
+       const struct cmh_aes_cmac_export_state *state = in;
+       struct cmh_aes_cmac_chunk *chunk;
+
+       /*
+        * Do NOT call free_chunks() here: the crypto API does not
+        * guarantee the request context is in a valid state before
+        * import(), so the list pointers may be stale or invalid.
+        * Re-initialize from scratch instead.  Any pre-existing chunks
+        * are tracked on tctx->all_chunks and freed in exit_tfm.
+        */
+       memset(rctx, 0, sizeof(*rctx));
+       INIT_LIST_HEAD(&rctx->chunks);
+
+       if (state->total_len > CMH_AES_CMAC_EXPORT_MAX)
+               return -EINVAL;
+
+       if (state->total_len) {
+               chunk = kmalloc(sizeof(*chunk) + state->total_len, GFP_KERNEL);
+               if (!chunk)
+                       return -ENOMEM;
+               chunk->len = state->total_len;
+               memcpy(chunk->data, state->data, state->total_len);
+               list_add_tail(&chunk->list, &rctx->chunks);
+               spin_lock_bh(&tctx->chunk_lock);
+               list_add_tail(&chunk->tfm_node, &tctx->all_chunks);
+               spin_unlock_bh(&tctx->chunk_lock);
+               rctx->total_len = state->total_len;
+       }
+       return 0;
+}
+
+static int cmh_aes_cmac_finup(struct ahash_request *req)
+{
+       int err;
+
+       err = cmh_aes_cmac_update(req);
+       if (err)
+               return err;
+       return cmh_aes_cmac_final(req);
+}
+
+static int cmh_aes_cmac_digest(struct ahash_request *req)
+{
+       int err;
+
+       err = cmh_aes_cmac_init(req);
+       if (err)
+               return err;
+       return cmh_aes_cmac_finup(req);
+}
+
+static int cmh_aes_cmac_init_tfm(struct crypto_ahash *tfm)
+{
+       struct cmh_aes_cmac_tfm_ctx *tctx = crypto_ahash_ctx(tfm);
+
+       memset(tctx, 0, sizeof(*tctx));
+       spin_lock_init(&tctx->chunk_lock);
+       INIT_LIST_HEAD(&tctx->all_chunks);
+       crypto_ahash_set_reqsize(tfm, sizeof(struct cmh_aes_cmac_reqctx));
+       return 0;
+}
+
+static void cmh_aes_cmac_exit_tfm(struct crypto_ahash *tfm)
+{
+       struct cmh_aes_cmac_tfm_ctx *tctx = crypto_ahash_ctx(tfm);
+       struct cmh_aes_cmac_chunk *c, *tmp;
+
+       /* Free any orphaned chunks (e.g. testmgr export/reimport poison) */
+       spin_lock_bh(&tctx->chunk_lock);
+       list_for_each_entry_safe(c, tmp, &tctx->all_chunks, tfm_node) {
+               list_del(&c->tfm_node);
+               kfree_sensitive(c);
+       }
+       spin_unlock_bh(&tctx->chunk_lock);
+
+       cmh_key_destroy(&tctx->key);
+}
+
+static struct ahash_alg cmh_aes_cmac_alg = {
+       .init           = cmh_aes_cmac_init,
+       .update         = cmh_aes_cmac_update,
+       .final          = cmh_aes_cmac_final,
+       .finup          = cmh_aes_cmac_finup,
+       .digest         = cmh_aes_cmac_digest,
+       .export         = cmh_aes_cmac_export,
+       .import         = cmh_aes_cmac_import,
+       .setkey         = cmh_aes_cmac_setkey,
+       .init_tfm       = cmh_aes_cmac_init_tfm,
+       .exit_tfm       = cmh_aes_cmac_exit_tfm,
+       .halg           = {
+               .digestsize     = AES_CMAC_DIGEST_SIZE,
+               .statesize      = CMH_AES_CMAC_STATE_SIZE,
+               .base           = {
+                       .cra_name        = "cmac(aes)",
+                       .cra_driver_name = "cri-cmh-cmac-aes",
+                       .cra_priority    = 300,
+                       .cra_flags       = CRYPTO_ALG_KERN_DRIVER_ONLY |
+                                          CRYPTO_ALG_NO_FALLBACK |
+                                          CRYPTO_ALG_ASYNC |
+                                          CRYPTO_ALG_REQ_VIRT,
+                       .cra_blocksize   = AES_CMAC_BLOCK_SIZE,
+                       .cra_ctxsize     = sizeof(struct cmh_aes_cmac_tfm_ctx),
+                       .cra_module      = THIS_MODULE,
+               },
+       },
+};
+
+/**
+ * cmh_aes_cmac_register() - Register AES-CMAC hash algorithm with the crypto framework
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int cmh_aes_cmac_register(void)
+{
+       int ret;
+
+       ret = crypto_register_ahash(&cmh_aes_cmac_alg);
+       if (ret)
+               dev_err(cmh_dev(), "cmh_aes_cmac: failed to register cmac(aes) (rc=%d)\n",
+                       ret);
+       else
+               dev_dbg(cmh_dev(), "cmh_aes_cmac: registered cmac(aes)\n");
+
+       return ret;
+}
+
+/**
+ * cmh_aes_cmac_unregister() - Unregister AES-CMAC hash algorithm from the crypto framework
+ */
+void cmh_aes_cmac_unregister(void)
+{
+       crypto_unregister_ahash(&cmh_aes_cmac_alg);
+       dev_dbg(cmh_dev(), "cmh_aes_cmac: unregistered cmac(aes)\n");
+}
diff --git a/drivers/crypto/cmh/cmh_main.c b/drivers/crypto/cmh/cmh_main.c
index 56541e0d4219..1edd8d14c666 100644
--- a/drivers/crypto/cmh/cmh_main.c
+++ b/drivers/crypto/cmh/cmh_main.c
@@ -34,6 +34,7 @@
 #include "cmh_cshake.h"
 #include "cmh_kmac.h"
 #include "cmh_sm3.h"
+#include "cmh_aes.h"
 #include "cmh_mgmt.h"
 #include "cmh_registers.h"
 #include "cmh_debugfs.h"
@@ -221,6 +222,21 @@ static int cmh_probe(struct platform_device *pdev)
        if (ret)
                goto err_sm3_register;

+       /* Register AES skcipher algorithms */
+       ret = cmh_aes_register();
+       if (ret)
+               goto err_aes_register;
+
+       /* Register AES AEAD algorithms (GCM, CCM) */
+       ret = cmh_aes_aead_register();
+       if (ret)
+               goto err_aes_aead_register;
+
+       /* Register AES CMAC algorithm */
+       ret = cmh_aes_cmac_register();
+       if (ret)
+               goto err_aes_cmac_register;
+
        /* Register key management device (/dev/cmh_mgmt) */
        ret = cmh_mgmt_register();
        if (ret)
@@ -233,6 +249,12 @@ static int cmh_probe(struct platform_device *pdev)
        return 0;

 err_mgmt_register:
+       cmh_aes_cmac_unregister();
+err_aes_cmac_register:
+       cmh_aes_aead_unregister();
+err_aes_aead_register:
+       cmh_aes_unregister();
+err_aes_register:
        cmh_sm3_unregister();
 err_sm3_register:
        cmh_kmac_unregister();
@@ -269,6 +291,9 @@ static void cmh_remove(struct platform_device *pdev)
        cfg = &dev->config;

        cmh_mgmt_unregister();
+       cmh_aes_cmac_unregister();
+       cmh_aes_aead_unregister();
+       cmh_aes_unregister();
        cmh_sm3_unregister();
        cmh_kmac_unregister();
        cmh_cshake_unregister();
diff --git a/drivers/crypto/cmh/include/cmh_aes.h b/drivers/crypto/cmh/include/cmh_aes.h
new file mode 100644
index 000000000000..591afaa36f85
--- /dev/null
+++ b/drivers/crypto/cmh/include/cmh_aes.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2026 Cryptography Research, Inc. (CRI).
+ * CMH LKM -- AES Crypto API Drivers
+ *
+ * Registers AES algorithms with the Linux crypto subsystem:
+ *   skcipher: ecb/cbc/ctr/cfb/xts(aes)
+ *   aead:     gcm/ccm(aes)
+ *   shash:    cmac(aes)
+ */
+
+#ifndef CMH_AES_H
+#define CMH_AES_H
+
+int  cmh_aes_register(void);
+void cmh_aes_unregister(void);
+
+int  cmh_aes_aead_register(void);
+void cmh_aes_aead_unregister(void);
+
+int  cmh_aes_cmac_register(void);
+void cmh_aes_cmac_unregister(void);
+
+#endif /* CMH_AES_H */
--
2.43.7


** This message and any attachments are for the sole use of the intended recipient(s). It may contain information that is confidential and privileged. If you are not the intended recipient of this message, you are prohibited from printing, copying, forwarding or saving it. Please delete the message and attachments and notify the sender immediately. **

Rambus Inc.<http://www.rambus.com>

^ permalink raw reply related

* [PATCH 01/19] dt-bindings: crypto: add Rambus CryptoManager Hub
From: Saravanakrishnan Krishnamoorthy @ 2026-06-25 17:33 UTC (permalink / raw)
  To: Albert Ou, Alex Ousherovitch, Conor Dooley, David S. Miller,
	Herbert Xu, Jonathan Corbet, Krzysztof Kozlowski, Palmer Dabbelt,
	Paul Walmsley, Rob Herring, Saravanakrishnan Krishnamoorthy,
	Shuah Khan
  Cc: Alexandre Ghiti, devicetree, Joel Wittenauer, linux-api,
	linux-crypto, linux-doc, linux-kernel, linux-kselftest,
	linux-riscv, Shuah Khan, sipsupport, Thi Nguyen
In-Reply-To: <20260625173328.1140487-1-skrishnamoorthy@rambus.com>

From: Alex Ousherovitch <aousherovitch@rambus.com>

Add device tree binding schema for the CRI CryptoManager Hub (CMH)
hardware crypto accelerator.  The binding covers the parent SoC-level
node with register region, interrupt, DMA properties, and per-core
child nodes identified by compatible string and unit address.

Register the 'cri' vendor prefix for Cryptography Research, Inc.

Co-developed-by: Saravanakrishnan Krishnamoorthy <skrishnamoorthy@rambus.com>
Signed-off-by: Saravanakrishnan Krishnamoorthy <skrishnamoorthy@rambus.com>
Signed-off-by: Alex Ousherovitch <aousherovitch@rambus.com>
Reviewed-by: Joel Wittenauer <Joel.Wittenauer@cryptography.com>
Reviewed-by: Thi Nguyen <thin@rambus.com>
---
 .../devicetree/bindings/crypto/cri,cmh.yaml   | 222 ++++++++++++++++++
 .../devicetree/bindings/vendor-prefixes.yaml  |   2 +
 2 files changed, 224 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/crypto/cri,cmh.yaml

diff --git a/Documentation/devicetree/bindings/crypto/cri,cmh.yaml b/Documentation/devicetree/bindings/crypto/cri,cmh.yaml
new file mode 100644
index 000000000000..db41132e0591
--- /dev/null
+++ b/Documentation/devicetree/bindings/crypto/cri,cmh.yaml
@@ -0,0 +1,222 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/crypto/cri,cmh.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: CRI CryptoManager Hub (CMH) Hardware Crypto Accelerator
+
+maintainers:
+  - Alex Ousherovitch <aousherovitch@rambus.com>
+  - Saravanakrishnan Krishnamoorthy <skrishnamoorthy@rambus.com>
+  - Joel Wittenauer <Joel.Wittenauer@cryptography.com>
+
+description: |
+  The CRI CryptoManager Hub (CMH) is a hardware cryptographic accelerator accessed
+  via a mailbox-based VCQ (Virtual Command Queue) interface.  The host
+  writes VCQ command sequences into per-mailbox DMA queue buffers and
+  rings a doorbell; the CMH eSW processes them and signals completion
+  via interrupt.
+
+  Supported algorithm families: SHA-2, SHA-3, SM3, AES, SM4,
+  ChaCha20-Poly1305, RSA, ECDSA, EdDSA, ECDH, SM2, ML-KEM, ML-DSA,
+  SLH-DSA, LMS, XMSS, DRBG.
+
+properties:
+  compatible:
+    const: cri,cmh
+
+  reg:
+    maxItems: 1
+    description:
+      SIC (System Interface Controller) MMIO region.  Mailbox instance
+      registers are at offsets N * 0x1000 within this region.
+
+  interrupts:
+    minItems: 1
+    maxItems: 64
+    description:
+      Per-mailbox completion/error interrupts from the CryptoManager Hub,
+      matching the real CMH ch_sys_interrupt_mbx[N-1:0] topology.
+      Entry i corresponds to MBX instance i.  The driver maps each
+      configured mailbox (cri,mbx-instances) to its DT interrupt
+      index and registers a separate threaded IRQ handler per MBX.
+
+  interrupt-names:
+    minItems: 1
+    maxItems: 64
+    items:
+      pattern: '^mbx[0-9]+$'
+    description:
+      Names for each mailbox interrupt, matching the interrupts array.
+      Format is "mbxN" where N is the mailbox instance index.
+
+  cri,mbx-instances:
+    $ref: /schemas/types.yaml#/definitions/uint32-array
+    minItems: 1
+    maxItems: 64
+    description:
+      Array of 0-based mailbox instance indices to configure.
+      Each index N maps to register offset N * 0x1000 within the
+      SIC region.  If absent, defaults to instances 0 and 1.
+
+  cri,mbx-slots-log2:
+    $ref: /schemas/types.yaml#/definitions/uint32-array
+    minItems: 1
+    maxItems: 64
+    description:
+      Per-mailbox slot count as log2.  Valid range 1..15.
+      Array length must match cri,mbx-instances.
+      Default is 5 (32 slots).
+
+  cri,mbx-strides-log2:
+    $ref: /schemas/types.yaml#/definitions/uint32-array
+    minItems: 1
+    maxItems: 64
+    description:
+      Per-mailbox stride (bytes per slot) as log2.  Valid range 7..10.
+      Array length must match cri,mbx-instances.
+      Default is 7 (128 bytes per slot).
+
+  "#address-cells":
+    const: 1
+
+  "#size-cells":
+    const: 0
+
+patternProperties:
+  "^(hc|aes|sm4|sm3|hcq|qse|pke|drbg|ccp)@[0-9a-f]+$":
+    type: object
+    description:
+      Per-core-type child nodes.  Each child represents one crypto core
+      instance available in the hardware.  The driver enumerates these at
+      probe to discover which algorithm families are present.
+
+    properties:
+      reg:
+        maxItems: 1
+        description:
+          Hardware core ID for this core type (e.g. 0x02 for HC, 0x03 for AES).
+          Must match the CORE_ID_* values defined by the CMH hardware.
+
+      cri,mbx:
+        $ref: /schemas/types.yaml#/definitions/uint32
+        description:
+          Pin this core instance to a specific mailbox instance index.
+          Multiple child nodes of the same core type may each specify a
+          different cri,mbx value to spread instances across mailboxes.
+          When absent, the driver auto-assigns a mailbox via round-robin
+          across the instances listed in cri,mbx-instances.
+
+    required:
+      - reg
+
+    additionalProperties: false
+
+required:
+  - compatible
+  - reg
+  - interrupts
+  - "#address-cells"
+  - "#size-cells"
+
+additionalProperties: false
+
+examples:
+  - |
+    soc {
+        #address-cells = <2>;
+        #size-cells = <2>;
+
+        crypto@a4800000 {
+            compatible = "cri,cmh";
+            reg = <0x0 0xa4800000 0x0 0x41000>;
+            interrupts = <1 2>;
+            interrupt-names = "mbx0", "mbx1";
+            cri,mbx-instances = <0 1>;
+            cri,mbx-slots-log2 = <5 5>;
+            cri,mbx-strides-log2 = <7 7>;
+            #address-cells = <1>;
+            #size-cells = <0>;
+
+            hc@2 {
+                reg = <0x02>;
+            };
+
+            aes@3 {
+                reg = <0x03>;
+            };
+
+            sm4@4 {
+                reg = <0x04>;
+            };
+
+            sm3@5 {
+                reg = <0x05>;
+            };
+
+            hcq@8 {
+                reg = <0x08>;
+            };
+
+            qse@9 {
+                reg = <0x09>;
+            };
+
+            pke@a {
+                reg = <0x0a>;
+                cri,mbx = <1>;
+            };
+
+            drbg@f {
+                reg = <0x0f>;
+            };
+
+            ccp@18 {
+                reg = <0x18>;
+            };
+        };
+    };
+
+  - |
+    /* Multi-instance: two AES cores on separate MBXes (future eSW support) */
+    soc {
+        #address-cells = <2>;
+        #size-cells = <2>;
+
+        crypto@a4800000 {
+            compatible = "cri,cmh";
+            reg = <0x0 0xa4800000 0x0 0x41000>;
+            interrupts = <1 2>;
+            interrupt-names = "mbx0", "mbx1";
+            cri,mbx-instances = <0 1>;
+            cri,mbx-slots-log2 = <5 5>;
+            cri,mbx-strides-log2 = <7 7>;
+            #address-cells = <1>;
+            #size-cells = <0>;
+
+            hc@2 {
+                reg = <0x02>;
+            };
+
+            aes@3 {
+                reg = <0x03>;
+                cri,mbx = <0>;
+            };
+
+            /* Second AES instance at core ID 0x06, pinned to MBX 1 */
+            aes@6 {
+                reg = <0x06>;
+                cri,mbx = <1>;
+            };
+
+            pke@a {
+                reg = <0x0a>;
+                cri,mbx = <1>;
+            };
+
+            drbg@f {
+                reg = <0x0f>;
+            };
+        };
+    };
diff --git a/Documentation/devicetree/bindings/vendor-prefixes.yaml b/Documentation/devicetree/bindings/vendor-prefixes.yaml
index 28784d66ae7b..3402adba3e49 100644
--- a/Documentation/devicetree/bindings/vendor-prefixes.yaml
+++ b/Documentation/devicetree/bindings/vendor-prefixes.yaml
@@ -375,6 +375,8 @@ patternProperties:
     description: Crane Connectivity Solutions
   "^creative,.*":
     description: Creative Technology Ltd
+  "^cri,.*":
+    description: Cryptography Research, Inc.
   "^crystalfontz,.*":
     description: Crystalfontz America, Inc.
   "^csky,.*":
--
2.43.7


** This message and any attachments are for the sole use of the intended recipient(s). It may contain information that is confidential and privileged. If you are not the intended recipient of this message, you are prohibited from printing, copying, forwarding or saving it. Please delete the message and attachments and notify the sender immediately. **

Rambus Inc.<http://www.rambus.com>

^ permalink raw reply related

* [PATCH 00/19] crypto: cmh - add CRI CryptoManager Hub driver
From: Saravanakrishnan Krishnamoorthy @ 2026-06-25 17:33 UTC (permalink / raw)
  To: Albert Ou, Alex Ousherovitch, Conor Dooley, David S. Miller,
	Herbert Xu, Jonathan Corbet, Krzysztof Kozlowski, Palmer Dabbelt,
	Paul Walmsley, Rob Herring, Saravanakrishnan Krishnamoorthy,
	Shuah Khan
  Cc: Alexandre Ghiti, devicetree, Joel Wittenauer, linux-api,
	linux-crypto, linux-doc, linux-kernel, linux-kselftest,
	linux-riscv, Shuah Khan, sipsupport, Thi Nguyen

From: Alex Ousherovitch <aousherovitch@rambus.com>

crypto: cmh - add CRI CryptoManager Hub hardware crypto accelerator

This series adds a driver for the CRI CryptoManager Hub (CMH), a
hardware cryptographic accelerator IP from Cryptography Research at
Rambus Inc. (https://www.rambus.com/cryptographyresearch/).
CMH provides a broad set of symmetric, asymmetric, and post-quantum
cryptographic algorithms accelerated in hardware, accessed via a
mailbox-based Virtual Command Queue (VCQ) interface.

The hardware is a platform device matched via device tree
(compatible = "cri,cmh").  It exposes a single MMIO register region
(SIC) with per-mailbox doorbell, status, and command registers.
Each mailbox has DMA-coherent queue memory for VCQ command
submission and completion.

Driver architecture:

  In-kernel users                       /dev/cmh_mgmt (ioctl)
  (dm-crypt, IPsec, kTLS, fscrypt)      (key management)
       |                                        |
       v                                        v
  +----------------------------------------------------+
  |        Kernel Crypto API + hwrng (72 total)        |
  |   ahash | skcipher | aead | akcipher | sig | kpp   |
  +----------------------------------------------------+
       |                                           |
       v                                           v
  +------------------+    +------------------------+
  | Transaction Mgr  |--->| Key / Mgmt subsystem   |
  | (kthread, CMQ)   |    | (datastore, ioctl ops) |
  +------------------+    +------------------------+
       |
       v
  +------------------+     +-------------------+
  | MQI (VCQ pack,   |---->| Response Handler  |
  |  DMA map, submit)|     | (threaded IRQ,    |
  +------------------+     |  watchdog, unmap) |
       |                   +-------------------+
       v                          ^
  +-----------+              +-----------+
  | Hardware  |--- IRQ ----->| Hardware  |
  | (mailbox) |              | (mailbox) |
  +-----------+              +-----------+

The transaction manager runs as a dedicated kthread that pulls
requests from a central command queue, packs VCQ entries, maps DMA
buffers, and submits to the least-loaded mailbox.  Completion is
handled by per-mailbox threaded IRQs.  The driver returns
-EINPROGRESS for async crypto requests and supports the
CRYPTO_TFM_REQ_MAY_BACKLOG flag for queue-full backpressure.

Registered algorithms (72 total):

  Type       Count  Algorithms
  ---------  -----  --------------------------------------------------
  ahash         15  SHA-{224,256,384,512}, SHA3-{224,256,384,512},
                     SHAKE-{128,256}, cSHAKE-{128,256},
                     KMAC-{128,256}, SM3
  ahash(HMAC)    8  HMAC-SHA-{224,256,384,512},
                     HMAC-SHA3-{224,256,384,512}
  ahash(MAC)     4  CMAC(AES), CMAC(SM4), XCBC(SM4), Poly1305
  skcipher      11  AES-{ECB,CBC,CTR,CFB,XTS},
                     SM4-{ECB,CBC,CTR,CFB,XTS}, ChaCha20
  aead           6  AES-{GCM,CCM}, SM4-{GCM,CCM},
                     rfc7539(chacha20,poly1305),
                     rfc7539esp(chacha20,poly1305)
  akcipher       1  RSA (2048--4096 bit; 512/1024 legacy/test)
  sig           23  ECDSA P-{256,384,521}, SM2 (verify-only),
                     ML-DSA-{44,65,87},
                     SLH-DSA (12 parameter sets),
                     LMS, LMS-HSS, XMSS, XMSS-MT
  kpp            3  ECDH P-{256,384}, X25519
  hwrng          1  DRBG-backed /dev/hwrng

Ioctl-only algorithms (not registered with the crypto API at all):
  - EdDSA (Ed25519, Ed448): sign and verify
  - ML-KEM (ML-KEM-512/768/1024): no standard kernel KEM API exists

The driver also exposes /dev/cmh_mgmt, a misc device providing 44
ioctl commands.  Relative to the in-kernel crypto API these fall into
two groups; the distinction matters because some commands name the
same primitives the driver also registers, and that overlap is
deliberate and bounded:

(1) Operations with no crypto API representation - the large
    majority.  The crypto API has no transform type or verb for
    these, so a character device is the only available UAPI:
      - hardware key lifecycle: create, import, export, derive,
        destroy, enumerate (keystore CRUD) - no keystore verb
      - KIC key derivation (HKDF, AES-CMAC-KDF, DKEK)
      - asymmetric key generation (RSA, EC, EdDSA, ML-DSA, SLH-DSA)
        and public-key derivation - the crypto API has no keygen verb
      - ML-KEM encapsulate/decapsulate - no kernel KEM API exists
      - SM2 encrypt/decrypt and key exchange (multi-step GM/T 0003)
      - EdDSA sign/verify - not registered with the crypto API
      - EAC Chip Authentication and DRBG (re)configuration

(2) Hardware-held-key operations on algorithms that ARE also
    registered (RSA decrypt, ECDSA/ML-DSA/SLH-DSA sign, ECDH).  These
    name the same primitives as the registered akcipher/sig/kpp
    transforms, but the crypto API's set_priv_key()/set_secret()
    accept only raw key bytes supplied by the caller; they cannot
    reference a private key that is generated inside, and never
    leaves, the hardware datastore - the central security property of
    this device.  The ioctl path keeps the private key
    hardware-resident, while the registered transforms serve raw-key
    in-kernel users.  The two paths are complementary, not redundant.

The device requires CAP_SYS_ADMIN.

/dev/cmh_mgmt is built conditionally on CONFIG_CRYPTO_DEV_CMH_MGMT
(default n); when disabled the ioctl interface is absent while all
kernel crypto API algorithms remain registered.

The ML-DSA sig algorithms are registered at priority 5001.  The
kernel's crypto/mldsa.c registers at priority 5000 with verify-only
(sign returns -EOPNOTSUPP).  Our driver provides full HW-accelerated
sign + verify, so the higher priority ensures the hardware
implementation is preferred when the driver is loaded.

Power management uses DEFINE_SIMPLE_DEV_PM_OPS.  On suspend the
transaction manager drains in-flight requests (configurable 10s
timeout, returns -ECANCELED on timeout), stops the kthread, and
masks IRQs.  On resume it re-verifies SIC/boot status and restarts
the kthread.

Dependencies:
  - Kernel 7.1+ (based on Herbert Xu's cryptodev-2.6 tree, 7.1.0-rc2)
  - sig_alg backend (upstream since 6.13)
  - CRYPTO_AHASH_REQ_VIRT (native support, no fallback needed)
  - CMH eSW loaded independently by hardware before driver probe

The driver registers all algorithms through the standard in-kernel
crypto API; in-kernel users (dm-crypt, fscrypt, IPsec, etc.) consume
them directly.  Key provisioning and hardware-held-key operations are
exposed to user space via /dev/cmh_mgmt ioctls.

Public hardware documentation:
  Product brief: https://go.rambus.com/ch-7xx-and-cc-7xx-product-brief
  No public datasheets are currently available.  The driver was
  developed against the CRI CryptoManager Hub Hardware Reference
  Manual (Rambus Inc. confidential).  Detailed hardware reference is
  available under NDA from Rambus Inc.; contact the maintainers listed
  in MAINTAINERS for access during review.

Tested on RISC-V and ARM64 QEMU emulation with the CMH hardware
model (QEMU TCG, 512 MiB RAM).  Also exercised on Xilinx VMK180
FPGA board with real CMH IP.

  - testmgr: 41 CMH algorithm registrations matched by upstream
    test vectors, all pass; 30 names report "No test for" (PQC
    families, KMAC, cSHAKE - no upstream vectors yet).
  - kselftest tools/testing/selftests/drivers/crypto/cmh:
    6 pass, 0 fail.

checkpatch.pl --strict: 0 errors, 0 warnings, 0 checks on all
files (the only output is the expected per-file "does MAINTAINERS
need updating?" reminder, satisfied by the MAINTAINERS patch).
sparse (C=2): 0 warnings.
W=1 -Werror: clean.
make dt_binding_check: clean (dtschema validates the
cri,cmh.yaml binding).

Tested with the following debug options enabled simultaneously
(submit-checklist "Test your code" item 1):
  CONFIG_PROVE_LOCKING, CONFIG_PROVE_RCU, CONFIG_DEBUG_LOCK_ALLOC,
  CONFIG_DEBUG_OBJECTS_RCU_HEAD, CONFIG_SLUB_DEBUG,
  CONFIG_DEBUG_PAGEALLOC, CONFIG_DEBUG_MUTEXES, CONFIG_DEBUG_SPINLOCK,
  CONFIG_DEBUG_PREEMPT, CONFIG_DEBUG_ATOMIC_SLEEP.
  Result: no lockdep warnings, no ODEBUG splats, no slab corruption.

Additionally tested (separate passes - mutually exclusive configs):
  - CONFIG_KASAN + CONFIG_UBSAN + CONFIG_DEBUG_KMEMLEAK + CONFIG_KFENCE:
    no sanitizer findings; KMEMLEAK scan reports 0 unreferenced objects.
  - CONFIG_KCSAN (arm64; riscv64 lacks HAVE_ARCH_KCSAN):
    0 data-race reports attributed to the driver.

Stack usage: worst-case under 1 KB on both riscv64 and arm64
(scripts/checkstack.pl).  Hardware command buffers live in
per-request context (heap-allocated by the crypto framework).

Alex Ousherovitch (19):
  dt-bindings: crypto: add Rambus CryptoManager Hub
  crypto: cmh - add core platform driver
  crypto: cmh - add key provisioning and management
  crypto: cmh - add SHA-2/SHA-3/SHAKE ahash
  crypto: cmh - add HMAC ahash
  crypto: cmh - add CSHAKE/KMAC ahash
  crypto: cmh - add SM3 ahash
  crypto: cmh - add AES skcipher/aead/cmac
  crypto: cmh - add SM4 skcipher/aead/cmac/xcbc
  crypto: cmh - add ChaCha20-Poly1305
  crypto: cmh - add DRBG hwrng
  crypto: cmh - add RSA akcipher
  crypto: cmh - add ECDSA/SM2 sig
  crypto: cmh - add ECDH/X25519 kpp
  crypto: cmh - add ML-KEM/ML-DSA (QSE)
  crypto: cmh - add SLH-DSA/LMS/XMSS (HCQ)
  Documentation: ioctl: add CMH ioctl documentation and register 'J'
  selftests: crypto: cmh - add kselftest for management ioctl
  MAINTAINERS: add Rambus CryptoManager Hub (CMH)

base-commit: 6ea0ce3a19f9c37a014099e2b0a46b27fa164564
--
2.43.7

** This message and any attachments are for the sole use of the intended recipient(s). It may contain information that is confidential and privileged. If you are not the intended recipient of this message, you are prohibited from printing, copying, forwarding or saving it. Please delete the message and attachments and notify the sender immediately. **

Rambus Inc.<http://www.rambus.com>

^ permalink raw reply

* Re: [PATCH v5 13/24] virt/steal_monitor: Add documentation
From: Randy Dunlap @ 2026-06-25 17:00 UTC (permalink / raw)
  To: Shrikanth Hegde, linux-kernel, mingo, peterz, juri.lelli,
	vincent.guittot, yury.norov, kprateek.nayak, iii, corbet
  Cc: tglx, gregkh, pbonzini, seanjc, vschneid, huschle, rostedt,
	dietmar.eggemann, maddy, srikar, hdanton, chleroy, vineeth,
	frederic, arighi, pauld, christian.loehle, tj, tommaso.cucinotta,
	maz, rafael, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-14-sshegde@linux.ibm.com>

Hi,

On 6/25/26 5:46 AM, Shrikanth Hegde wrote:
> Document this module named steal_monitor and its parameters.
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
> v4-v5:
> - new patch
> 
> Please let me know if the placing is not right.
> 
>  Documentation/driver-api/index.rst         |  1 +
>  Documentation/driver-api/steal-monitor.rst | 93 ++++++++++++++++++++++
>  2 files changed, 94 insertions(+)
>  create mode 100644 Documentation/driver-api/steal-monitor.rst


> diff --git a/Documentation/driver-api/steal-monitor.rst b/Documentation/driver-api/steal-monitor.rst
> new file mode 100644
> index 000000000000..997a22d0812c
> --- /dev/null
> +++ b/Documentation/driver-api/steal-monitor.rst
> @@ -0,0 +1,93 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +=============
> +Steal Monitor
> +=============
> +
> +:Author: Shrikanth Hegde
> +
> +Introduction:
> +=============

Nit:
Kernel heading adornment style does not include an ending ':' character
(4 places).

> +
> +Steal monitor is a driver aimed at solving the Noisy Neighbour problem
> +in virtualized environments. I.e performance of workload
> +running in one VM gets affected significantly due to other VMs and
> +combined they make slower forward progress.


-- 
~Randy


^ permalink raw reply

* [PATCH v3 11/11] vfio/pci: Provide opt-out for CXL Type-2 extensions
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
  To: djbw, alex, jgg, jic23, dave.jiang, ankita,
	alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
	ira.weiny
  Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
	linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>

From: Manish Honap <mhonap@nvidia.com>

Add an opt-out so users can keep vfio-pci's CXL extensions out of the
path for individual devices or for an entire vfio-pci instance.  The
build-time gate is CONFIG_VFIO_PCI_CXL; the runtime gates are:

  - Module parameter vfio_pci.disable_cxl (bool, 0444).  Setting
    disable_cxl=1 at modprobe time makes vfio_pci_probe() set
    vdev->disable_cxl on every device it binds.

  - Variant drivers (mlx5, pds, hisi, nvgrace, xe, etc.) may set
    vdev->disable_cxl=true in their own probe for per-device control
    without needing the module parameter.  The bit lives on
    struct vfio_pci_core_device so it's reachable from any variant.

vfio_pci_cxl_acquire() consults vdev->disable_cxl as the very first
check and returns -ENODEV when set, which makes vfio-pci-core treat
the device as a plain (non-CXL) PCI passthrough — no CAP_CXL, no HDM
or COMP_REGS VFIO regions, no DVSEC clipping shim.

This mirrors the long-standing disable_denylist opt-out shape.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/vfio/pci/cxl/vfio_cxl_core.c | 9 +++++++++
 drivers/vfio/pci/vfio_pci.c          | 9 +++++++++
 include/linux/vfio_pci_core.h        | 1 +
 3 files changed, 19 insertions(+)

diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index 8a00b776d7c7..905f74f4e725 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -234,6 +234,15 @@ int vfio_pci_cxl_acquire(struct vfio_pci_core_device *vdev)
 	u16 dvsec;
 	int rc;
 
+	/*
+	 * Honour the per-device opt-out (set by vfio-pci's module
+	 * parameter disable_cxl, or by a variant driver before
+	 * registration).  Returning -ENODEV here makes the caller
+	 * treat this device as plain vfio-pci.
+	 */
+	if (vdev->disable_cxl)
+		return -ENODEV;
+
 	if (!pcie_is_cxl(pdev))
 		return -ENODEV;
 
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 0c771064c0b8..fd226cb65d8b 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -60,6 +60,12 @@ static bool disable_denylist;
 module_param(disable_denylist, bool, 0444);
 MODULE_PARM_DESC(disable_denylist, "Disable use of device denylist. Disabling the denylist allows binding to devices with known errata that may lead to exploitable stability or security issues when accessed by untrusted users.");
 
+#if IS_ENABLED(CONFIG_VFIO_PCI_CXL)
+static bool disable_cxl;
+module_param(disable_cxl, bool, 0444);
+MODULE_PARM_DESC(disable_cxl, "Disable CXL Type-2 extensions for all devices bound to vfio-pci. Variant drivers may instead set vdev->disable_cxl in their probe for per-device control without needing this parameter.");
+#endif
+
 static bool vfio_pci_dev_in_denylist(struct pci_dev *pdev)
 {
 	switch (pdev->vendor) {
@@ -166,6 +172,9 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		return PTR_ERR(vdev);
 
 	dev_set_drvdata(&pdev->dev, vdev);
+#if IS_ENABLED(CONFIG_VFIO_PCI_CXL)
+	vdev->disable_cxl = disable_cxl;
+#endif
 	vdev->pci_ops = &vfio_pci_dev_ops;
 	ret = vfio_pci_core_register_device(vdev);
 	if (ret)
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 541c1911e090..20e9599b3bd7 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -127,6 +127,7 @@ struct vfio_pci_core_device {
 	bool			needs_pm_restore:1;
 	bool			pm_intx_masked:1;
 	bool			pm_runtime_engaged:1;
+	bool			disable_cxl:1;
 	struct pci_saved_state	*pci_saved_state;
 	struct pci_saved_state	*pm_save;
 	int			ioeventfds_nr;
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3 10/11] docs: vfio-pci: Document CXL Type-2 device passthrough
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
  To: djbw, alex, jgg, jic23, dave.jiang, ankita,
	alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
	ira.weiny
  Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
	linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>

From: Manish Honap <mhonap@nvidia.com>

Capture the ownership model, bind sequence, region layout, and the
DVSEC + HDM + CM cap-array virtualization contract for vfio-pci
Type-2 device passthrough in Documentation/driver-api/vfio-pci-cxl.rst.

cxl-core owns the CXL register virtualization through
devm_cxl_passthrough_create() and the cxl_passthrough_*_rw()
helpers; vfio-pci is a transport that forwards guest reads and
writes through them.  The HDM HPA range is mapped by vfio for the
mmappable HDM region.  Topology constraints and host-bridge decoder
limitations are listed under Known limitations.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 Documentation/driver-api/index.rst        |   1 +
 Documentation/driver-api/vfio-pci-cxl.rst | 282 ++++++++++++++++++++++
 2 files changed, 283 insertions(+)
 create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst

diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index eaf7161ff957..52f0c06a376a 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -47,6 +47,7 @@ of interest to most developers working on device drivers.
    vfio-mediated-device
    vfio
    vfio-pci-device-specific-driver-acceptance
+   vfio-pci-cxl
 
 Bus-level documentation
 =======================
diff --git a/Documentation/driver-api/vfio-pci-cxl.rst b/Documentation/driver-api/vfio-pci-cxl.rst
new file mode 100644
index 000000000000..1527b7dd85d0
--- /dev/null
+++ b/Documentation/driver-api/vfio-pci-cxl.rst
@@ -0,0 +1,282 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===========================================
+VFIO-PCI: CXL Type-2 device passthrough
+===========================================
+
+:Author: Manish Honap <mhonap@nvidia.com>
+
+Overview
+========
+
+vfio-pci-core, when built with ``CONFIG_VFIO_PCI_CXL=y``, passes a
+CXL Type-2 accelerator (CXL r4.0, HDM-D / HDM-DB) through to a KVM
+guest.  The host firmware commits the endpoint's HDM decoder before
+vfio-pci binds; the guest sees a CXL Type-2 device whose CXL.mem
+range is already programmed and locked.  The guest may inspect the
+HDM Decoder Capability block and DVSEC Device capability via spec-
+defined paths, and access the device's CXL.mem range as
+mmap'd memory.
+
+Scope
+=====
+
+The supported scope is intentionally narrow:
+
+* One CXL endpoint per host bridge.
+* The endpoint exposes exactly one HDM decoder (decoder 0).
+* No interleave.
+* Host firmware has committed the endpoint HDM decoder before
+  vfio-pci probes.  Devices whose HDM decoder is *uncommitted* fail
+  vfio-pci bind cleanly.
+* The host bridge is in single-RP-passthrough mode (the CXL host
+  bridge's own HDM decoder is not used; CFMWS-to-RP decode flows
+  implicitly).  This assumption is currently *not enforced* by
+  vfio-pci-core; it is a known limitation, see the Known
+  limitations section.
+
+Multi-decoder, interleave, FLR / reset state-machine integration,
+and host-bridge HDM decoder programming are explicitly out of scope.
+Adding any of them is additive on top of the contract described
+below.
+
+Driver model
+============
+
+There is no dedicated ``vfio-cxl`` PCI driver.  vfio-pci is the only
+driver that binds to the host PCI device.  When built with
+``CONFIG_VFIO_PCI_CXL=y``, vfio-pci-core calls into the cxl subsystem
+to do four things at bind time:
+
+1. ``devm_cxl_dev_state_create()`` — allocate per-device CXL state
+   embedded in ``struct vfio_pci_cxl_state``.
+2. ``cxl_pci_setup_regs()`` + ``cxl_get_hdm_info()`` — probe the
+   Register Locator DVSEC and harvest the HDM block's BAR-relative
+   offset and size.
+3. ``cxl_await_range_active()`` — wait for the firmware-committed
+   range to become live.
+4. ``devm_cxl_passthrough_create()`` — snapshot the CXL Device DVSEC
+   body, the HDM Decoder block, and the CXL.cache/mem cap-array
+   prefix into shadows owned by cxl-core.  All subsequent
+   register-virtualization happens inside ``drivers/cxl/core/passthrough.c``.
+5. ``devm_cxl_probe_mem()`` — register a ``cxl_memdev``, enumerate
+   the endpoint port, and auto-attach the firmware-committed
+   region.  cxl_mem binds to the memdev as it would for any other
+   Type-2 accelerator.
+
+Ownership split
+===============
+
+Each device-visible surface is owned by exactly one subsystem:
+
+============================================  ==============================================
+Surface                                       Owner
+============================================  ==============================================
+PCI config (non-DVSEC, non-CXL)               vfio-pci-core ``vconfig`` (existing perm-bits)
+CXL Device DVSEC body                         cxl-core ``cxl_passthrough_dvsec_rw()``
+HDM Decoder Capability block                  cxl-core ``cxl_passthrough_hdm_rw()``
+CM cap-array (read-only snapshot)             cxl-core ``cxl_passthrough_cm_rw()``
+``cxl_memdev`` / endpoint port / autoregion   cxl-core ``devm_cxl_probe_mem()``
+HDM HPA range mapping                         vfio-pci ``request_mem_region`` + ``memremap``
+Sparse mmap layout for the component BAR      vfio-pci
+============================================  ==============================================
+
+The vfio side holds no shadow buffer of its own.  ``vfio_pci_cxl_state``
+caches small scalars (DVSEC offset/size, HDM offset/size, component
+BAR layout) for dispatch decisions; the actual virtualization
+semantics live in cxl-core.
+
+Bind sequence
+=============
+
+``vfio_pci_cxl_acquire()`` is called from
+``vfio_pci_core_register_device()`` at PCI bind time.  The sequence::
+
+  0. devm_cxl_dev_state_create(parent, CXL_DEVTYPE_DEVMEM, dsn,
+                               dvsec_off, vfio_pci_cxl_state, cxlds,
+                               /*mbox=*/false)
+
+  1. pcie_is_cxl() and pci_find_dvsec_capability(CXL_DEVICE)
+     -> -ENODEV if either is absent
+     -> -ENODEV if the DVSEC's MEM_CAPABLE bit is clear
+
+  2. pci_enable_device_mem()
+
+     2a. cxl_pci_setup_regs(CXL_REGLOC_RBI_COMPONENT)
+     2b. cxl_get_hdm_info() — REJECT hdm_count != 1 with -EOPNOTSUPP
+     2c. cxl_regblock_get_bar_info()
+     2d. cxl_await_range_active()
+     2e. devm_cxl_passthrough_create(&pdev->dev, &cxlds)
+
+  3. pci_disable_device()
+     Clears PCI_COMMAND_MASTER but NOT PCI_COMMAND_MEMORY (see
+     do_pci_disable_device() in drivers/pci/pci.c).  Subsequent
+     MMIO from step 4 still succeeds.
+
+  4. devm_cxl_probe_mem(&cxlds, &hpa_range)
+     Registers the memdev, enumerates the endpoint port, attaches
+     the firmware-committed autoregion.
+
+  5. request_mem_region(hpa_base, hpa_size) + memremap_wb()
+
+  6. vdev->cxl = cxl  (state published; HDM and COMP_REGS regions
+     are registered later when the VFIO fd is opened)
+
+Fail-closed semantics
+---------------------
+
+Three errnos are mapped to "not a CXL device; caller falls back to
+plain vfio-pci": ``pcie_is_cxl()`` false, DVSEC absent, ``MEM_CAPABLE``
+clear.  All three return ``-ENODEV`` from
+``vfio_pci_cxl_acquire()``; the caller treats them as a silent
+fall-through.
+
+Any other negative errno from the bind sequence aborts the vfio-pci
+bind entirely.  The guest never sees a half-initialised CXL device.
+Once ``devm_cxl_probe_mem()`` has succeeded the published memdev
+holds a pointer into the embedded ``cxl_dev_state``; a failure in
+``vfio_cxl_map_hdm()`` after that point cannot ``devm_kfree(cxl)``
+and leaves the state allocated for the lifetime of the PCI device
+(devres unwinds it at pdev removal).
+
+VFIO regions exposed
+====================
+
+When the VFIO fd is first opened, ``vfio_pci_cxl_open()`` registers
+two additional regions on top of the standard vfio-pci BARs / config
+region:
+
+HDM region (``VFIO_REGION_SUBTYPE_CXL``)
+  Mappable view of the device's firmware-committed HPA range.
+
+  * ``mmap``: fault handler does
+    ``vmf_insert_pfn(vma, addr, PHYS_PFN(hpa_base + off))``.  The
+    guest gets the same backing physical memory the host sees.
+  * ``pread`` / ``pwrite``: served from the ``memremap_wb()`` kva
+    captured at bind time.
+
+COMP_REGS region (``VFIO_REGION_SUBTYPE_CXL_COMP_REGS``)
+  Shadow of the CXL component register sub-range.  ``pread`` /
+  ``pwrite`` only; ``mmap`` is intentionally not supported (the VMM
+  uses this region instead of mmapping the BAR).  Dword-aligned
+  access only; sub-dword accesses return ``-EINVAL``.
+
+  Dispatch by offset:
+
+  ============================================  =================================
+  Offset range                                  cxl-core helper
+  ============================================  =================================
+  ``< CXL_CM_OFFSET``                           zero-fill (reserved)
+  ``CXL_CM_OFFSET .. hdm_reg_offset``           ``cxl_passthrough_cm_rw()``
+  ``hdm_reg_offset .. +hdm_reg_size``           ``cxl_passthrough_hdm_rw()``
+  ``>= hdm_reg_offset + hdm_reg_size``          zero-fill (reserved)
+  ============================================  =================================
+
+DVSEC virtualization contract
+=============================
+
+The CXL Device DVSEC body is reached through the standard PCI
+config-space path.  ``vfio_pci_config_rw_single()`` clips chunks at
+the DVSEC body boundary via ``vfio_pci_cxl_config_boundary()`` and
+forwards body bytes to ``vfio_pci_cxl_config_rw()``, which in turn
+calls ``cxl_passthrough_dvsec_rw()``.
+
+Per-field write semantics (CXL r4.0 §8.1.3):
+
+============================================  ==============================================
+Field (offset from DVSEC cap base)            Spec attribute / behaviour
+============================================  ==============================================
+CAPABILITY        (0x0a)                      HwInit — writes dropped
+CONTROL           (0x0c)                      RWL — gated on DVSEC CONFIG_LOCK
+STATUS            (0x0e)                      RW1C
+CONTROL2          (0x10)                      RWL — gated on DVSEC CONFIG_LOCK
+STATUS2           (0x12)                      RW1C
+LOCK              (0x14)                      RWO — first 1-write latches CONFIG_LOCK
+Range1 SIZE_HI/LO BASE_HI/LO  (0x18..0x27)    HwInit — writes dropped
+Range2 SIZE_HI/LO BASE_HI/LO  (0x28..0x37)    RsvdZ — writes dropped
+============================================  ==============================================
+
+HDM virtualization contract
+===========================
+
+Per CXL r4.0 §8.2.4.20, on the single firmware-committed decoder:
+
+============================================  ==============================================
+Field (offset from HDM block base)            Spec attribute / behaviour
+============================================  ==============================================
+HDM Decoder Capability Header (0x00)          HwInit — writes dropped
+HDM Decoder Global Control    (0x04)          RW — shadow
+Decoder 0 BASE_LO / BASE_HI                   RWL — gated on COMMITTED or LOCK_ON_COMMIT
+Decoder 0 SIZE_LO / SIZE_HI                   RWL — same gate
+Decoder 0 CTRL                                Implements COMMIT → COMMITTED handshake; once
+                                              COMMITTED, only COMMIT toggles are honoured
+============================================  ==============================================
+
+CM cap-array
+============
+
+The CM cap-array (CXL r4.0 §8.2.4) prefix is snapshotted from the
+device's component register MMIO at bind time and served read-only
+through ``cxl_passthrough_cm_rw()``.  Guest writes to the cap-array
+are silently dropped.
+
+UAPI: CAP_CXL
+=============
+
+``VFIO_DEVICE_GET_INFO`` returns ``VFIO_DEVICE_FLAGS_CXL`` and a
+``VFIO_DEVICE_INFO_CAP_CXL`` capability::
+
+    struct vfio_device_info_cap_cxl {
+        struct vfio_info_cap_header header;
+        __u32 flags;
+        #define VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED (1 << 0)
+        __u32 hdm_region_idx;
+        __u32 comp_reg_region_idx;
+        __u32 comp_reg_bar;
+        __u32 __resv;
+        __u64 comp_reg_offset;
+        __u64 comp_reg_size;
+    };
+
+``VFIO_DEVICE_GET_REGION_INFO`` on the component BAR returns a
+``VFIO_REGION_INFO_CAP_SPARSE_MMAP`` that excludes
+``[comp_reg_offset, comp_reg_offset + comp_reg_size)`` from the
+mmappable areas.
+
+Known limitations
+=================
+
+* Host bridge HDM decoder programming is not driven by this driver.
+  The driver silently assumes single-RP-passthrough topology (the
+  CXL host bridge's own HDM decoder is not used).  Two remediations
+  are possible: either refuse to bind when the topology is not
+  single-RP-passthrough, or extend the kernel ABI so a host-bridge
+  HDM decoder programmer can attest the lock before vfio bind.  Both
+  leave the existing contract intact or add a single boolean to
+  CAP_CXL.
+
+* Function-level reset (FLR) does not re-snapshot the shadows.
+  Guests that issue FLR will see stale HDM and DVSEC state after
+  the reset.
+
+* Multi-decoder devices return ``-EOPNOTSUPP`` at bind.
+
+* Hotplug while the device is held by vfio is not supported.
+
+* Raw BAR read/write into the CXL component register sub-range is
+  unsupported.  VMMs must use the COMP_REGS region.
+
+Selftest
+========
+
+``tools/testing/selftests/vfio/vfio_cxl_type2_test`` exercises the
+five surfaces:
+
+* ``device_is_cxl`` — GET_INFO returns FLAGS_CXL + CAP_CXL.
+* ``hdm_region_mmap_rw`` — mmap + read/write pattern.
+* ``component_bar_sparse_mmap`` — SPARSE_MMAP cap excludes the CXL
+  block.
+* ``comp_regs_cm_cap_array_read`` — CM cap-array header is served
+  from the cxl-core snapshot.
+* ``dvsec_lock_byte_read`` -- DVSEC config-rw clipping shim is wired.
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3 09/11] selftests/vfio: Add CXL Type-2 device passthrough smoke test
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
  To: djbw, alex, jgg, jic23, dave.jiang, ankita,
	alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
	ira.weiny
  Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
	linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>

From: Manish Honap <mhonap@nvidia.com>

Exercise the user-visible contract added by CONFIG_VFIO_PCI_CXL:

  device_is_cxl                 GET_INFO returns VFIO_DEVICE_FLAGS_CXL
                                and a populated VFIO_DEVICE_INFO_CAP_CXL.

  hdm_region_mmap_rw            mmap() one page of the HDM region,
                                write a pattern, read it back.  Proves
                                the mmap fault handler's vmf_insert_pfn
                                path and the firmware-committed HPA
                                mapping.

  component_bar_sparse_mmap     GET_REGION_INFO on the component BAR
                                advertises a SPARSE_MMAP cap, and every
                                advertised mmappable area lies outside
                                [comp_reg_offset, +comp_reg_size).

  comp_regs_cm_cap_array_read   pread() of the COMP_REGS region at
                                CXL_CM_OFFSET returns a valid CM
                                cap-array header (CAP_ID == 1,
                                ARRAY_SIZE > 0).  Proves the
                                cxl_passthrough_cm_rw() dispatch is
                                wired.

  dvsec_lock_byte_read          pread() of the DVSEC CONFIG_LOCK byte
                                through the config-rw clipping shim
                                succeeds.  Proves the
                                cxl_passthrough_dvsec_rw() path is
                                wired.

COMMIT/COMMITTED state-machine and DVSEC LOCK latch behaviour are
out of scope for this smoke test.  No debugfs dependency.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 tools/testing/selftests/vfio/Makefile         |   1 +
 .../selftests/vfio/lib/vfio_pci_device.c      |  11 +-
 .../selftests/vfio/vfio_cxl_type2_test.c      | 350 ++++++++++++++++++
 3 files changed, 361 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/vfio/vfio_cxl_type2_test.c

diff --git a/tools/testing/selftests/vfio/Makefile b/tools/testing/selftests/vfio/Makefile
index 0684932d91bf..25f2a9420ef6 100644
--- a/tools/testing/selftests/vfio/Makefile
+++ b/tools/testing/selftests/vfio/Makefile
@@ -12,6 +12,7 @@ TEST_GEN_PROGS += vfio_iommufd_setup_test
 TEST_GEN_PROGS += vfio_pci_device_test
 TEST_GEN_PROGS += vfio_pci_device_init_perf_test
 TEST_GEN_PROGS += vfio_pci_driver_test
+TEST_GEN_PROGS += vfio_cxl_type2_test
 
 TEST_FILES += scripts/cleanup.sh
 TEST_FILES += scripts/lib.sh
diff --git a/tools/testing/selftests/vfio/lib/vfio_pci_device.c b/tools/testing/selftests/vfio/lib/vfio_pci_device.c
index fc75e04ef010..d2150129d854 100644
--- a/tools/testing/selftests/vfio/lib/vfio_pci_device.c
+++ b/tools/testing/selftests/vfio/lib/vfio_pci_device.c
@@ -281,7 +281,16 @@ static void vfio_pci_device_setup(struct vfio_pci_device *device)
 		struct vfio_pci_bar *bar = device->bars + i;
 
 		vfio_pci_region_get(device, i, &bar->info);
-		if (bar->info.flags & VFIO_REGION_INFO_FLAG_MMAP)
+		/*
+		 * Skip auto-mmap when the BAR advertises region-info caps
+		 * (e.g. VFIO_REGION_INFO_CAP_SPARSE_MMAP).  Such BARs are
+		 * only partially mmappable; the kernel rejects full-BAR
+		 * mmaps and the caller must walk the sparse-area cap and
+		 * mmap each advertised area separately.  Tests that need
+		 * access to such a BAR handle the per-area mmap themselves.
+		 */
+		if ((bar->info.flags & VFIO_REGION_INFO_FLAG_MMAP) &&
+		    !(bar->info.flags & VFIO_REGION_INFO_FLAG_CAPS))
 			vfio_pci_bar_map(device, i);
 	}
 
diff --git a/tools/testing/selftests/vfio/vfio_cxl_type2_test.c b/tools/testing/selftests/vfio/vfio_cxl_type2_test.c
new file mode 100644
index 000000000000..bc98a29f90ad
--- /dev/null
+++ b/tools/testing/selftests/vfio/vfio_cxl_type2_test.c
@@ -0,0 +1,350 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * vfio_cxl_type2_test - smoke + dispatch tests for CXL Type-2 device
+ * passthrough through vfio-pci.
+ *
+ * Exercises the user-visible surface gated by CONFIG_VFIO_PCI_CXL:
+ *  - GET_INFO returns VFIO_DEVICE_FLAGS_CXL + a populated CAP_CXL.
+ *  - The HDM-backed VFIO region can be mmap'd and read/written.
+ *  - The component BAR exposes a SPARSE_MMAP cap that excludes the
+ *    CXL component register sub-range.
+ *  - The COMP_REGS region serves CM cap-array dwords from cxl-core's
+ *    snapshot (proves the cxl_passthrough_cm_rw() path is wired).
+ *  - DVSEC body reads through the config-rw clipping shim return the
+ *    cxl-core shadow (proves cxl_passthrough_dvsec_rw() is wired).
+ *
+ * Usage:
+ *   ./vfio_cxl_type2_test <BDF>
+ * or export VFIO_SELFTESTS_BDF=<BDF> before running.  The device must
+ * be bound to vfio-pci and the kernel must have CONFIG_VFIO_PCI_CXL=y.
+ *
+ * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.
+ */
+
+#include <fcntl.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+
+#include <linux/pci_regs.h>
+#include <linux/sizes.h>
+#include <linux/vfio.h>
+
+#include <cxl/cxl_regs.h>
+
+#include <libvfio.h>
+
+#include "kselftest_harness.h"
+
+#define PCI_DVSEC_VENDOR_ID_CXL		0x1e98
+#define PCI_DVSEC_ID_CXL_DEVICE		0x0000
+
+/*
+ * vfio-pci's region offset packing (kernel-internal in
+ * include/linux/vfio_pci_core.h, not exposed via UAPI as of writing).
+ * Provide local definitions so the selftest builds against the bare
+ * UAPI vfio.h.  The guards let a future kernel hoist these to UAPI
+ * without breaking this test.
+ */
+#ifndef VFIO_PCI_OFFSET_SHIFT
+#define VFIO_PCI_OFFSET_SHIFT		40
+#endif
+#ifndef VFIO_PCI_INDEX_TO_OFFSET
+#define VFIO_PCI_INDEX_TO_OFFSET(index)	((uint64_t)(index) << VFIO_PCI_OFFSET_SHIFT)
+#endif
+
+static const char *device_bdf;
+
+/* Find a struct vfio_device_info capability by id in a GET_INFO buffer. */
+static const struct vfio_info_cap_header *
+find_device_cap(const void *buf, size_t bufsz, uint16_t id)
+{
+	const struct vfio_device_info *info = buf;
+	const struct vfio_info_cap_header *cap;
+	size_t off = info->cap_offset;
+
+	while (off && off < bufsz) {
+		cap = (const void *)((const char *)buf + off);
+		if (cap->id == id)
+			return cap;
+		off = cap->next;
+	}
+	return NULL;
+}
+
+/* Walk PCI extended capability list for the CXL Device DVSEC. */
+static uint16_t find_cxl_dvsec(struct vfio_pci_device *dev)
+{
+	uint16_t pos = PCI_CFG_SPACE_SIZE;
+	int iter = 0;
+
+	while (pos && iter++ < 64) {
+		uint32_t hdr = vfio_pci_config_readl(dev, pos);
+		uint16_t cap_id = hdr & 0xffff;
+		uint16_t next   = (hdr >> 20) & 0xffc;
+		uint32_t hdr1, hdr2;
+
+		if (cap_id == PCI_EXT_CAP_ID_DVSEC) {
+			hdr1 = vfio_pci_config_readl(dev, pos + 4);
+			hdr2 = vfio_pci_config_readl(dev, pos + 8);
+			if ((hdr1 & 0xffff) == PCI_DVSEC_VENDOR_ID_CXL &&
+			    (hdr2 & 0xffff) == PCI_DVSEC_ID_CXL_DEVICE)
+				return pos;
+		}
+		pos = next;
+	}
+	return 0;
+}
+
+FIXTURE(cxl_type2) {
+	struct iommu *iommu;
+	struct vfio_pci_device *dev;
+
+	struct vfio_device_info_cap_cxl cxl_cap;
+	uint16_t dvsec_base;
+
+	uint64_t hdm_region_size;
+	uint64_t comp_regs_size;
+};
+
+FIXTURE_SETUP(cxl_type2)
+{
+	uint8_t infobuf[512] = {};
+	struct vfio_device_info *info = (void *)infobuf;
+	const struct vfio_device_info_cap_cxl *cap;
+	struct vfio_region_info ri = { .argsz = sizeof(ri) };
+
+	self->iommu = iommu_init(default_iommu_mode);
+	self->dev   = vfio_pci_device_init(device_bdf, self->iommu);
+
+	info->argsz = sizeof(infobuf);
+	ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_INFO, info));
+
+	if (!(info->flags & VFIO_DEVICE_FLAGS_CXL))
+		SKIP(return, "not a CXL Type-2 device");
+
+	cap = (const void *)find_device_cap(infobuf, sizeof(infobuf),
+					    VFIO_DEVICE_INFO_CAP_CXL);
+	ASSERT_NE(NULL, cap);
+	memcpy(&self->cxl_cap, cap, sizeof(*cap));
+
+	ri.index = self->cxl_cap.hdm_region_idx;
+	ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, &ri));
+	self->hdm_region_size = ri.size;
+
+	ri.argsz = sizeof(ri);
+	ri.index = self->cxl_cap.comp_reg_region_idx;
+	ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, &ri));
+	self->comp_regs_size = ri.size;
+
+	self->dvsec_base = find_cxl_dvsec(self->dev);
+}
+
+FIXTURE_TEARDOWN(cxl_type2)
+{
+	vfio_pci_device_cleanup(self->dev);
+	iommu_cleanup(self->iommu);
+}
+
+TEST_F(cxl_type2, device_is_cxl)
+{
+	const struct vfio_device_info_cap_cxl *c = &self->cxl_cap;
+
+	ASSERT_EQ(VFIO_DEVICE_INFO_CAP_CXL, c->header.id);
+	ASSERT_EQ(1, c->header.version);
+	ASSERT_NE(c->hdm_region_idx, c->comp_reg_region_idx);
+	ASSERT_GE(c->hdm_region_idx,    VFIO_PCI_NUM_REGIONS);
+	ASSERT_GE(c->comp_reg_region_idx, VFIO_PCI_NUM_REGIONS);
+	ASSERT_LT(c->comp_reg_bar, PCI_STD_NUM_BARS);
+	ASSERT_GT(c->comp_reg_size, 0ULL);
+	ASSERT_EQ(c->comp_reg_size, self->comp_regs_size);
+}
+
+TEST_F(cxl_type2, hdm_region_mmap_rw)
+{
+	uint64_t off = (uint64_t)VFIO_PCI_INDEX_TO_OFFSET(
+		self->cxl_cap.hdm_region_idx);
+	uint32_t pattern = 0xdeadbeefU;
+	uint32_t readback = 0;
+	void *map;
+
+	if (self->hdm_region_size < SZ_4K)
+		SKIP(return, "HDM region < 4K");
+
+	map = mmap(NULL, SZ_4K, PROT_READ | PROT_WRITE, MAP_SHARED,
+		   self->dev->fd, off);
+	ASSERT_NE(MAP_FAILED, map);
+
+	*(volatile uint32_t *)map = pattern;
+	readback = *(volatile uint32_t *)map;
+	ASSERT_EQ(pattern, readback);
+
+	ASSERT_EQ(0, munmap(map, SZ_4K));
+}
+
+TEST_F(cxl_type2, component_bar_sparse_mmap)
+{
+	const uint8_t bar = self->cxl_cap.comp_reg_bar;
+	uint8_t buf[512] = {};
+	struct vfio_region_info *ri = (void *)buf;
+	const struct vfio_region_info_cap_sparse_mmap *sp;
+	const struct vfio_info_cap_header *hdr;
+	size_t off;
+	uint32_t i;
+
+	ri->argsz = sizeof(buf);
+	ri->index = bar;
+	ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, ri));
+
+	ASSERT_TRUE(ri->flags & VFIO_REGION_INFO_FLAG_CAPS);
+	off = ri->cap_offset;
+	hdr = NULL;
+	while (off && off < sizeof(buf)) {
+		hdr = (const void *)(buf + off);
+		if (hdr->id == VFIO_REGION_INFO_CAP_SPARSE_MMAP)
+			break;
+		off = hdr->next;
+		hdr = NULL;
+	}
+	ASSERT_NE(NULL, hdr);
+	sp = (const void *)hdr;
+	ASSERT_GE(sp->nr_areas, 1U);
+	for (i = 0; i < sp->nr_areas; i++) {
+		uint64_t a_start = sp->areas[i].offset;
+		uint64_t a_end   = a_start + sp->areas[i].size;
+
+		ASSERT_TRUE(a_end <= self->cxl_cap.comp_reg_offset ||
+			    a_start >= self->cxl_cap.comp_reg_offset +
+				       self->cxl_cap.comp_reg_size);
+	}
+}
+
+TEST_F(cxl_type2, comp_regs_cm_cap_array_read)
+{
+	uint64_t off = (uint64_t)VFIO_PCI_INDEX_TO_OFFSET(
+		self->cxl_cap.comp_reg_region_idx) + CXL_CM_OFFSET;
+	uint32_t hdr = 0;
+	uint16_t cap_id;
+	uint8_t  array_size;
+
+	ASSERT_EQ((ssize_t)sizeof(hdr),
+		  pread(self->dev->fd, &hdr, sizeof(hdr), off));
+
+	cap_id     = hdr & CXL_CM_CAP_HDR_ID_MASK;
+	array_size = (hdr & CXL_CM_CAP_HDR_ARRAY_SIZE_MASK) >> 24;
+	ASSERT_EQ(cap_id, CM_CAP_HDR_CAP_ID);
+	ASSERT_GT(array_size, 0);
+}
+
+TEST_F(cxl_type2, dvsec_lock_byte_read)
+{
+	uint8_t v;
+
+	if (!self->dvsec_base)
+		SKIP(return, "CXL Device DVSEC not found");
+
+	v = vfio_pci_config_readb(self->dev,
+				  self->dvsec_base + 0x14);	/* CONFIG_LOCK */
+	/* Snapshot value is host-firmware-dependent; just assert read
+	 * succeeds (no SIGBUS, no -EIO).
+	 */
+	(void)v;
+}
+
+/*
+ * Exercise the per-decoder COMMIT/COMMITTED state machine in
+ * cxl_passthrough_hdm_rw() (cxl-core).  Steps:
+ *
+ *   - Walk the CM cap-array via COMP_REGS reads to locate the HDM block.
+ *   - Read decoder 0 CTRL; for a firmware-committed Type-2 device both
+ *     COMMIT (bit 9) and COMMITTED (bit 10) are expected to be set.
+ *   - Release COMMIT by writing CTRL with bit 9 cleared.
+ *     Expected FSM transition: COMMITTED -> 0, LOCK_ON_COMMIT (bit 8) -> 0.
+ *   - Re-set COMMIT.  Expected: COMMITTED -> 1 (auto-set by the handler).
+ *   - Restore the original CTRL value so subsequent test runs see the
+ *     firmware-committed state.
+ *
+ * The CTRL writes touch the cxl-core shadow only — they do not reach
+ * the device — so the operation is safe to run repeatedly.
+ */
+TEST_F(cxl_type2, hdm_decoder_commit_fsm)
+{
+	uint64_t comp_off = (uint64_t)VFIO_PCI_INDEX_TO_OFFSET(
+		self->cxl_cap.comp_reg_region_idx);
+	uint32_t cm_hdr = 0, entry = 0;
+	uint64_t hdm_reg_offset = 0;
+	uint64_t ctrl_off;
+	uint32_t ctrl_orig, ctrl_test;
+	uint32_t array_size;
+	uint32_t i;
+
+	/* Discover HDM block offset via CM cap-array walk. */
+	ASSERT_EQ((ssize_t)sizeof(cm_hdr),
+		  pread(self->dev->fd, &cm_hdr, sizeof(cm_hdr),
+			comp_off + CXL_CM_OFFSET));
+	ASSERT_EQ(CM_CAP_HDR_CAP_ID, cm_hdr & CXL_CM_CAP_HDR_ID_MASK);
+	array_size = (cm_hdr & CXL_CM_CAP_HDR_ARRAY_SIZE_MASK) >> 24;
+	ASSERT_GT(array_size, 0);
+
+	for (i = 1; i <= array_size; i++) {
+		ASSERT_EQ((ssize_t)sizeof(entry),
+			  pread(self->dev->fd, &entry, sizeof(entry),
+				comp_off + CXL_CM_OFFSET + i * 4));
+		if ((entry & CXL_CM_CAP_HDR_ID_MASK) == CXL_CM_CAP_CAP_ID_HDM) {
+			hdm_reg_offset = CXL_CM_OFFSET +
+					 ((entry & CXL_CM_CAP_PTR_MASK) >> 20);
+			break;
+		}
+	}
+	ASSERT_NE(0, hdm_reg_offset);
+
+	/* Read decoder 0 CTRL. */
+	ctrl_off = comp_off + hdm_reg_offset +
+		   CXL_HDM_DECODER0_CTRL_OFFSET(0);
+	ASSERT_EQ((ssize_t)sizeof(ctrl_orig),
+		  pread(self->dev->fd, &ctrl_orig, sizeof(ctrl_orig),
+			ctrl_off));
+
+	/* Firmware-committed Type-2 device: COMMIT + COMMITTED both set. */
+	ASSERT_TRUE(ctrl_orig & BIT(9));	/* COMMIT */
+	ASSERT_TRUE(ctrl_orig & BIT(10));	/* COMMITTED */
+
+	/* Release COMMIT; FSM clears COMMITTED and LOCK_ON_COMMIT. */
+	ctrl_test = ctrl_orig & ~BIT(9);
+	ASSERT_EQ((ssize_t)sizeof(ctrl_test),
+		  pwrite(self->dev->fd, &ctrl_test, sizeof(ctrl_test),
+			 ctrl_off));
+	ASSERT_EQ((ssize_t)sizeof(ctrl_test),
+		  pread(self->dev->fd, &ctrl_test, sizeof(ctrl_test),
+			ctrl_off));
+	ASSERT_FALSE(ctrl_test & BIT(9));	/* COMMIT cleared */
+	ASSERT_FALSE(ctrl_test & BIT(10));	/* COMMITTED auto-cleared */
+	ASSERT_FALSE(ctrl_test & BIT(8));	/* LOCK_ON_COMMIT auto-cleared */
+
+	/* Re-set COMMIT; FSM auto-sets COMMITTED. */
+	ctrl_test = BIT(9);
+	ASSERT_EQ((ssize_t)sizeof(ctrl_test),
+		  pwrite(self->dev->fd, &ctrl_test, sizeof(ctrl_test),
+			 ctrl_off));
+	ASSERT_EQ((ssize_t)sizeof(ctrl_test),
+		  pread(self->dev->fd, &ctrl_test, sizeof(ctrl_test),
+			ctrl_off));
+	ASSERT_TRUE(ctrl_test & BIT(9));	/* COMMIT */
+	ASSERT_TRUE(ctrl_test & BIT(10));	/* COMMITTED auto-set */
+
+	/* Restore the original CTRL value. */
+	ASSERT_EQ((ssize_t)sizeof(ctrl_orig),
+		  pwrite(self->dev->fd, &ctrl_orig, sizeof(ctrl_orig),
+			 ctrl_off));
+}
+
+int main(int argc, char *argv[])
+{
+	device_bdf = vfio_selftests_get_bdf(&argc, argv);
+	return test_harness_run(argc, argv);
+}
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3 08/11] vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping shim
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
  To: djbw, alex, jgg, jic23, dave.jiang, ankita,
	alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
	ira.weiny
  Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
	linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>

From: Manish Honap <mhonap@nvidia.com>

Complete the vfio-pci-core integration of CXL Type-2 device
passthrough by exposing two VFIO regions to userspace, wiring DVSEC
config-space accesses through cxl-core's register-virtualization
helpers, and reserving the CXL component register block from BAR
mmap and BAR resource claim.

HDM region (VFIO_REGION_SUBTYPE_CXL):
  - mmappable view of the device's firmware-committed HPA range
  - mmap fault handler calls vmf_insert_pfn() from the physical HPA
    so the guest gets the same backing memory the host sees
  - pread/pwrite go through the memremap_wb() kva captured at
    bind time by vfio_cxl_map_hdm()

COMP_REGS region (VFIO_REGION_SUBTYPE_CXL_COMP_REGS):
  - pread/pwrite only, dword-aligned (-EINVAL on misalignment)
  - thin transport: each dword dispatches by offset to
    cxl_passthrough_cm_rw() (CM cap-array snapshot) or
    cxl_passthrough_hdm_rw() (HDM Decoder block).  No shadow buffer
    on the vfio side; all per-field semantics live in cxl-core.

DVSEC config-space access:
  - vfio_pci_cxl_config_boundary() clips a chunk at the CXL Device
    DVSEC body edge in vfio_pci_config_rw_single() so the generic
    perm-bits path handles the DVSEC header bytes and the CXL hook
    handles the body bytes.  The clipping shim is used instead of
    re-pointing the ecap_perms[] readfn/writefn (which would mutate
    a module-init static and race across multiple CXL devices).
  - vfio_pci_cxl_config_rw() forwards clipped accesses to
    cxl_passthrough_dvsec_rw(); cxl-core enforces the per-field
    write semantics (LOCK/RWO, CONTROL/RWL, STATUS/RW1C,
    RANGE1/HwInit, RANGE2/RsvdZ).

GET_INFO / GET_REGION_INFO:
  - VFIO_DEVICE_INFO_CAP_CXL advertises the two region indices, the
    component BAR layout, and HOST_FIRMWARE_COMMITTED.
  - GET_REGION_INFO on the component BAR returns a sparse-mmap cap
    that excludes [comp_reg_offset, comp_reg_offset+comp_reg_size).

BAR resource handling:
  - cxl-core holds request_mem_region() on the CXL component
    register sub-range from devm_cxl_probe_mem(), so vfio_pci-core's
    pci_request_selected_regions() on the full BAR would collide.
    map_bars() skips the request for the component BAR (still iomaps
    it; vfio holds the BAR via driver binding); disable() mirrors
    the asymmetric skip.
  - mmap of the component BAR refuses any range overlapping the CXL
    sub-range via vfio_pci_cxl_mmap_overlaps_comp_regs().

vfio_pci_cxl_open() now registers both VFIO regions; close()
unregisters them.  Raw BAR rw redirect into the CXL sub-range is
intentionally not implemented: VMMs use the COMP_REGS region
directly.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/vfio/pci/cxl/vfio_cxl_core.c | 521 ++++++++++++++++++++++++++-
 drivers/vfio/pci/vfio_pci_config.c   |  31 ++
 drivers/vfio/pci/vfio_pci_core.c     |  44 ++-
 drivers/vfio/pci/vfio_pci_priv.h     |  72 ++++
 drivers/vfio/pci/vfio_pci_rdwr.c     |  17 +
 5 files changed, 679 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index 42cd00bbe869..8a00b776d7c7 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -123,12 +123,24 @@ static int vfio_cxl_probe_regs(struct vfio_pci_cxl_state *cxl)
 	if (rc)
 		return rc;
 
+	/*
+	 * The CXL Component Register block is a fixed 64 KiB area (CXL r4.0
+	 * §8.2.3).  cxl_pci_setup_regs() records the remaining BAR length
+	 * after the regblock offset in reg_map.max_size, which is an upper
+	 * bound, not the spec-defined size.  Bail if the BAR does not have
+	 * room for a full component register block at the recorded offset,
+	 * and publish the spec size so the UAPI, sparse-mmap exclusion, and
+	 * COMP_REGS region all agree on the same window.
+	 */
+	if (cxlds->reg_map.max_size < CXL_COMPONENT_REG_BLOCK_SIZE)
+		return -ENXIO;
+
 	cxl->info.hdm_count               = hdm_count;
 	cxl->info.hdm_reg_offset          = hdm_off;
 	cxl->info.hdm_reg_size            = hdm_size;
 	cxl->info.comp_reg_bir            = bir;
 	cxl->info.comp_reg_offset         = bar_off;
-	cxl->info.comp_reg_size           = cxlds->reg_map.max_size;
+	cxl->info.comp_reg_size           = CXL_COMPONENT_REG_BLOCK_SIZE;
 	cxl->info.host_firmware_committed = true;
 
 	/*
@@ -354,16 +366,515 @@ void vfio_pci_cxl_release(struct vfio_pci_core_device *vdev)
 	vdev->cxl = NULL;
 }
 
+static int vfio_pci_cxl_register_hdm(struct vfio_pci_core_device *vdev);
+static int vfio_pci_cxl_register_comp_regs(struct vfio_pci_core_device *vdev);
+
 int vfio_pci_cxl_open(struct vfio_pci_core_device *vdev)
 {
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+	int rc;
+
+	if (!cxl)
+		return 0;	/* plain vfio-pci device */
+
+	rc = vfio_pci_cxl_register_comp_regs(vdev);
+	if (rc) {
+		pci_warn(vdev->pdev,
+			 "vfio-cxl: COMP_REGS region register failed (%d)\n",
+			 rc);
+		return rc;
+	}
+
+	rc = vfio_pci_cxl_register_hdm(vdev);
+	if (rc) {
+		pci_warn(vdev->pdev,
+			 "vfio-cxl: HDM region register failed (%d)\n", rc);
+		/*
+		 * COMP_REGS already registered above.  vfio core does not
+		 * call close_device() when open_device() returns an error,
+		 * so roll back the COMP_REGS dynamic region here to avoid
+		 * a leaked half-registered open state.
+		 */
+		vfio_pci_cxl_close(vdev);
+		return rc;
+	}
+	return 0;
+}
+
+void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+	unsigned int i;
+
+	if (!cxl)
+		return;
+
+	for (i = vdev->num_regions; i > 0; i--) {
+		struct vfio_pci_region *r = &vdev->region[i - 1];
+
+		if (r->data != cxl)
+			break;
+		if (r->ops->release)
+			r->ops->release(vdev, r);
+		vdev->num_regions--;
+	}
+}
+
+/* ------------------------------------------------------------------ */
+/* HDM region: mmappable view of the device's HPA range               */
+/* ------------------------------------------------------------------ */
+
+static vm_fault_t hdm_region_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct vfio_pci_cxl_state *cxl = vma->vm_private_data;
+	unsigned long off = (vmf->address - vma->vm_start) +
+			    (vma->vm_pgoff << PAGE_SHIFT);
+	phys_addr_t pa;
+
+	if (!cxl || !cxl->info.hpa_size)
+		return VM_FAULT_SIGBUS;
+	if (off >= cxl->info.hpa_size)
+		return VM_FAULT_SIGBUS;
+
+	pa = cxl->info.hpa_base + off;
+	return vmf_insert_pfn(vma, vmf->address, PHYS_PFN(pa));
+}
+
+static const struct vm_operations_struct hdm_region_vm_ops = {
+	.fault = hdm_region_fault,
+};
+
+static int hdm_region_mmap(struct vfio_pci_core_device *vdev,
+			   struct vfio_pci_region *region,
+			   struct vm_area_struct *vma)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+	pgoff_t pgoff;
+	u64 req_start, req_len;
+
+	if (!cxl || !cxl->info.hpa_size)
+		return -ENODEV;
+
 	/*
-	 * Region registration (HDM, COMP_REGS) is added by the next
-	 * patch in this series.  This hook exists so vfio-pci-core's
-	 * fd-open path has a stable call site.
+	 * vfio_pci_core_mmap() forwards the VMA with vm_pgoff still
+	 * carrying the VFIO region index in the high bits.  Mask it off
+	 * so req_start is the in-region offset; also overwrite vm_pgoff
+	 * with the normalised value so the fault handler computes the
+	 * physical address from a clean offset.
 	 */
+	pgoff = vma->vm_pgoff &
+		((1ULL << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+	req_start = (u64)pgoff << PAGE_SHIFT;
+	req_len   = vma->vm_end - vma->vm_start;
+	if (req_start > cxl->info.hpa_size ||
+	    req_len > cxl->info.hpa_size - req_start)
+		return -EINVAL;
+
+	vma->vm_pgoff = pgoff;
+	vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
+	vma->vm_ops = &hdm_region_vm_ops;
+	vma->vm_private_data = cxl;
 	return 0;
 }
 
-void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev)
+static ssize_t hdm_region_rw(struct vfio_pci_core_device *vdev,
+			     char __user *buf, size_t count,
+			     loff_t *ppos, bool iswrite)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	void *kva;
+
+	if (!cxl || !cxl->hdm_kva)
+		return -EINVAL;
+	if (pos < 0 || (u64)pos > cxl->info.hpa_size ||
+	    count > cxl->info.hpa_size - (u64)pos)
+		return -EINVAL;
+
+	kva = (u8 *)cxl->hdm_kva + pos;
+	if (iswrite) {
+		if (copy_from_user(kva, buf, count))
+			return -EFAULT;
+	} else {
+		if (copy_to_user(buf, kva, count))
+			return -EFAULT;
+	}
+
+	*ppos += count;
+	return count;
+}
+
+static void hdm_region_release(struct vfio_pci_core_device *vdev,
+			       struct vfio_pci_region *region)
+{
+}
+
+static const struct vfio_pci_regops vfio_pci_cxl_hdm_ops = {
+	.rw	 = hdm_region_rw,
+	.mmap	 = hdm_region_mmap,
+	.release = hdm_region_release,
+};
+
+static int vfio_pci_cxl_register_hdm(struct vfio_pci_core_device *vdev)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+	u32 region_type = VFIO_REGION_TYPE_PCI_VENDOR_TYPE | PCI_VENDOR_ID_CXL;
+	u32 region_flags = VFIO_REGION_INFO_FLAG_READ |
+			   VFIO_REGION_INFO_FLAG_WRITE |
+			   VFIO_REGION_INFO_FLAG_MMAP;
+	int rc;
+
+	rc = vfio_pci_core_register_dev_region(vdev, region_type,
+					       VFIO_REGION_SUBTYPE_CXL,
+					       &vfio_pci_cxl_hdm_ops,
+					       cxl->info.hpa_size,
+					       region_flags, cxl);
+	if (rc)
+		return rc;
+
+	cxl->hdm_region_idx = VFIO_PCI_NUM_REGIONS + vdev->num_regions - 1;
+	return 0;
+}
+
+/* ------------------------------------------------------------------ */
+/* COMP_REGS region: thin transport to cxl-core register helpers       */
+/* ------------------------------------------------------------------ */
+
+/*
+ * COMP_REGS exposes the CXL component register sub-range of the
+ * device's component BAR as a pread/pwrite-only VFIO region.  Access
+ * is dword-only (4-byte aligned); sub-dword access returns -EINVAL.
+ * The dispatch maps each dword to one of cxl-core's three rw helpers:
+ *
+ *   pos < CXL_CM_OFFSET                          → zero-fill / drop
+ *   CXL_CM_OFFSET <= pos < hdm_reg_offset         → cxl_passthrough_cm_rw
+ *   hdm_reg_offset <= pos < hdm_reg_offset+size   → cxl_passthrough_hdm_rw
+ *   pos >= hdm_reg_offset + hdm_reg_size          → zero-fill / drop
+ *
+ * vfio holds no shadow buffer of its own; the per-field write
+ * semantics live entirely in cxl-core.
+ */
+static ssize_t comp_regs_rw(struct vfio_pci_core_device *vdev,
+			    char __user *buf, size_t count,
+			    loff_t *ppos, bool iswrite)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	resource_size_t cm_off, hdm_start, hdm_end;
+	size_t done = 0;
+
+	if (!cxl || !cxl->cxlpt)
+		return -EINVAL;
+	if (pos < 0 || (u64)pos > cxl->info.comp_reg_size ||
+	    count > cxl->info.comp_reg_size - (u64)pos)
+		return -EINVAL;
+	if (!IS_ALIGNED(pos, 4) || !IS_ALIGNED(count, 4))
+		return -EINVAL;
+
+	cm_off    = CXL_CM_OFFSET;
+	hdm_start = cxl->info.hdm_reg_offset;
+	hdm_end   = hdm_start + cxl->info.hdm_reg_size;
+
+	while (done < count) {
+		__le32 le = 0;
+		u32 v32 = 0;
+		int rc;
+
+		if (iswrite) {
+			if (copy_from_user(&le, buf + done, 4))
+				return done ?: -EFAULT;
+			v32 = le32_to_cpu(le);
+		}
+
+		if (pos >= cm_off && pos < hdm_start) {
+			rc = cxl_passthrough_cm_rw(cxl->cxlpt,
+						   (u32)(pos - cm_off),
+						   &v32, iswrite);
+			if (rc)
+				return done ?: rc;
+		} else if (pos >= hdm_start && pos < hdm_end) {
+			rc = cxl_passthrough_hdm_rw(cxl->cxlpt,
+						    (u32)(pos - hdm_start),
+						    &v32, iswrite);
+			if (rc)
+				return done ?: rc;
+		} else if (!iswrite) {
+			v32 = 0;	/* outside modelled ranges: read 0 */
+		}
+		/* writes outside modelled ranges are silently dropped */
+
+		if (!iswrite) {
+			le = cpu_to_le32(v32);
+			if (copy_to_user(buf + done, &le, 4))
+				return done ?: -EFAULT;
+		}
+
+		pos  += 4;
+		done += 4;
+	}
+
+	*ppos += done;
+	return done;
+}
+
+static void comp_regs_release(struct vfio_pci_core_device *vdev,
+			      struct vfio_pci_region *region)
+{
+}
+
+static const struct vfio_pci_regops vfio_pci_cxl_comp_regs_ops = {
+	.rw	 = comp_regs_rw,
+	.release = comp_regs_release,
+};
+
+static int vfio_pci_cxl_register_comp_regs(struct vfio_pci_core_device *vdev)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+	u32 region_type = VFIO_REGION_TYPE_PCI_VENDOR_TYPE | PCI_VENDOR_ID_CXL;
+	u32 region_flags = VFIO_REGION_INFO_FLAG_READ |
+			   VFIO_REGION_INFO_FLAG_WRITE;
+	int rc;
+
+	rc = vfio_pci_core_register_dev_region(vdev, region_type,
+					       VFIO_REGION_SUBTYPE_CXL_COMP_REGS,
+					       &vfio_pci_cxl_comp_regs_ops,
+					       cxl->info.comp_reg_size,
+					       region_flags, cxl);
+	if (rc)
+		return rc;
+
+	cxl->comp_reg_region_idx = VFIO_PCI_NUM_REGIONS + vdev->num_regions - 1;
+	return 0;
+}
+
+/* ------------------------------------------------------------------ */
+/* DVSEC config-space clipping shim                                    */
+/* ------------------------------------------------------------------ */
+
+/*
+ * vfio_pci_cxl_config_boundary - clip a config-rw chunk at the DVSEC body edge
+ *
+ * Returns the maximum byte count the caller may pass through the
+ * generic chunker without straddling the CXL Device DVSEC body
+ * boundary, or SIZE_MAX when no clip is required.  Used by
+ * vfio_pci_config_rw_single() so the DVSEC header bytes stay on the
+ * generic perm-bits path and the body bytes reach the CXL hook.
+ */
+size_t vfio_pci_cxl_config_boundary(struct vfio_pci_core_device *vdev,
+				    loff_t pos)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+	u32 body_start, body_end;
+
+	if (!cxl)
+		return SIZE_MAX;
+
+	body_start = cxl->info.dvsec_offset + PCI_DVSEC_CXL_CAP;
+	body_end   = cxl->info.dvsec_offset + cxl->info.dvsec_size;
+
+	if (pos < body_start)
+		return body_start - pos;
+	if (pos < body_end)
+		return body_end - pos;
+	return SIZE_MAX;
+}
+
+/*
+ * vfio_pci_cxl_config_rw - forward CXL DVSEC config accesses to cxl-core
+ *
+ * Returns the number of bytes processed on success, -ENOENT if the
+ * access lies entirely outside the CXL Device DVSEC body (caller
+ * takes the standard perm-bits path), or another negative errno on
+ * hard failure.  vfio_pci_config_rw_single() applies
+ * vfio_pci_cxl_config_boundary() before width selection, so any
+ * access that reaches here was already clipped to lie entirely inside
+ * the DVSEC body.
+ */
+ssize_t vfio_pci_cxl_config_rw(struct vfio_pci_core_device *vdev,
+			       loff_t pos, size_t count, __le32 *val,
+			       bool iswrite)
 {
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+	u32 dvsec_off, body_start, body_end, off;
+	u32 host_val;
+	int rc;
+
+	if (!cxl || !cxl->cxlpt)
+		return -ENOENT;
+
+	dvsec_off  = cxl->info.dvsec_offset;
+	body_start = dvsec_off + PCI_DVSEC_CXL_CAP;
+	body_end   = dvsec_off + cxl->info.dvsec_size;
+
+	if (pos + count <= body_start || pos >= body_end)
+		return -ENOENT;
+	if (WARN_ON_ONCE(pos < body_start || pos + count > body_end))
+		return -EINVAL;	/* caller failed to clip at body boundary */
+
+	off = (u32)(pos - dvsec_off);
+	host_val = iswrite ? le32_to_cpu(*val) : 0;
+
+	rc = cxl_passthrough_dvsec_rw(cxl->cxlpt, off, &host_val, count,
+				      iswrite);
+	if (rc)
+		return rc;
+
+	if (!iswrite)
+		*val = cpu_to_le32(host_val);
+	return count;
+}
+
+/* ------------------------------------------------------------------ */
+/* GET_INFO / GET_REGION_INFO / mmap helpers                           */
+/* ------------------------------------------------------------------ */
+
+u8 vfio_pci_cxl_get_component_reg_bar(struct vfio_pci_core_device *vdev)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+	return cxl ? cxl->info.comp_reg_bir : U8_MAX;
+}
+
+bool vfio_pci_cxl_get_comp_reg_range(struct vfio_pci_core_device *vdev,
+				     size_t *start, size_t *end)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+	if (!cxl || !cxl->info.comp_reg_size)
+		return false;
+
+	*start = cxl->info.comp_reg_offset;
+	*end   = cxl->info.comp_reg_offset + cxl->info.comp_reg_size;
+	return true;
+}
+
+bool vfio_pci_cxl_mmap_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+					  u64 req_start, u64 req_len)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+	if (!cxl || !cxl->info.comp_reg_size)
+		return false;
+
+	return req_start < cxl->info.comp_reg_offset + cxl->info.comp_reg_size &&
+	       req_start + req_len > cxl->info.comp_reg_offset;
+}
+
+/*
+ * vfio_pci_cxl_bar_overlaps_comp_regs - check whether a BAR-relative access
+ * overlaps the CXL component register sub-range.
+ *
+ * Returns true when @bar is the component BAR and the [@start, @start + @len)
+ * window overlaps [comp_reg_offset, comp_reg_offset + comp_reg_size).  Used
+ * by the raw BAR read/write and ioeventfd paths to reject accesses that
+ * would bypass the COMP_REGS region and reach the physical component
+ * registers directly, sidestepping cxl-core's shadow and per-field write
+ * semantics.
+ */
+bool vfio_pci_cxl_bar_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+					 int bar, u64 start, u64 len)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+	if (!cxl || !cxl->info.comp_reg_size || !len)
+		return false;
+	if (bar != cxl->info.comp_reg_bir)
+		return false;
+
+	return start < cxl->info.comp_reg_offset + cxl->info.comp_reg_size &&
+	       start + len > cxl->info.comp_reg_offset;
+}
+
+int vfio_pci_cxl_get_info(struct vfio_pci_core_device *vdev,
+			  struct vfio_info_cap *caps)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+	struct vfio_device_info_cap_cxl cap = { };
+
+	if (!cxl)
+		return 0;
+
+	cap.header.id      = VFIO_DEVICE_INFO_CAP_CXL;
+	cap.header.version = 1;
+	if (cxl->info.host_firmware_committed)
+		cap.flags |= VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED;
+	cap.hdm_region_idx      = cxl->hdm_region_idx;
+	cap.comp_reg_region_idx = cxl->comp_reg_region_idx;
+	cap.comp_reg_bar        = cxl->info.comp_reg_bir;
+	cap.comp_reg_offset     = cxl->info.comp_reg_offset;
+	cap.comp_reg_size       = cxl->info.comp_reg_size;
+
+	return vfio_info_add_capability(caps, &cap.header, sizeof(cap));
+}
+
+/*
+ * Build a VFIO_REGION_INFO_CAP_SPARSE_MMAP that excludes the CXL
+ * component register block from the mmappable areas of the
+ * component BAR.  Returns -ENOTTY when the request is not for the
+ * component BAR or the component BAR is not mmappable; the caller
+ * (vfio_pci_ioctl_get_region_info) then continues with the standard
+ * BAR path.
+ */
+int vfio_pci_cxl_get_region_info(struct vfio_pci_core_device *vdev,
+				 struct vfio_region_info *info,
+				 struct vfio_info_cap *caps)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+	struct vfio_region_info_cap_sparse_mmap *sparse;
+	u64 bar_len, comp_start, comp_end;
+	u64 before_end, after_start;
+	struct vfio_region_sparse_mmap_area areas[2];
+	u32 nr_areas = 0, cap_size;
+	int ret;
+
+	if (!cxl)
+		return -ENOTTY;
+	if (info->index != cxl->info.comp_reg_bir)
+		return -ENOTTY;
+	if (!cxl->info.comp_reg_size)
+		return -ENOTTY;
+	if (!vdev->bar_mmap_supported[info->index])
+		return -ENOTTY;
+
+	bar_len    = pci_resource_len(vdev->pdev, info->index);
+	comp_start = cxl->info.comp_reg_offset;
+	comp_end   = comp_start + cxl->info.comp_reg_size;
+
+	before_end  = round_down(comp_start, PAGE_SIZE);
+	after_start = round_up(comp_end, PAGE_SIZE);
+
+	if (before_end > 0) {
+		areas[nr_areas].offset = 0;
+		areas[nr_areas].size   = before_end;
+		nr_areas++;
+	}
+	if (after_start < bar_len) {
+		areas[nr_areas].offset = after_start;
+		areas[nr_areas].size   = bar_len - after_start;
+		nr_areas++;
+	}
+
+	info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
+	info->size   = bar_len;
+	info->flags  = VFIO_REGION_INFO_FLAG_READ |
+		       VFIO_REGION_INFO_FLAG_WRITE;
+	if (!nr_areas)
+		return 0;
+
+	info->flags |= VFIO_REGION_INFO_FLAG_MMAP;
+
+	cap_size = struct_size(sparse, areas, nr_areas);
+	sparse = kzalloc(cap_size, GFP_KERNEL);
+	if (!sparse)
+		return -ENOMEM;
+
+	sparse->header.id      = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+	sparse->header.version = 1;
+	sparse->nr_areas       = nr_areas;
+	memcpy(sparse->areas, areas, nr_areas * sizeof(areas[0]));
+
+	ret = vfio_info_add_capability(caps, &sparse->header, cap_size);
+	kfree(sparse);
+	return ret;
 }
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index a10ed733f0e3..b9f30a33515a 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1898,8 +1898,15 @@ ssize_t vfio_pci_config_rw_single(struct vfio_pci_core_device *vdev,
 	/*
 	 * Chop accesses into aligned chunks containing no more than a
 	 * single capability.  Caller increments to the next chunk.
+	 *
+	 * For CXL Type-2 devices also clip at the CXL Device DVSEC body
+	 * boundary so the generic perm-bits path handles the DVSEC
+	 * header bytes and the CXL hook handles the body bytes; without
+	 * this clip a 32-bit access at dvsec + 0x08 would span the
+	 * generic Header2 word and the CXL CAPABILITY word.
 	 */
 	count = min(count, vfio_pci_cap_remaining_dword(vdev, *ppos));
+	count = min(count, vfio_pci_cxl_config_boundary(vdev, *ppos));
 	if (count >= 4 && !(*ppos % 4))
 		count = 4;
 	else if (count >= 2 && !(*ppos % 2))
@@ -1909,6 +1916,30 @@ ssize_t vfio_pci_config_rw_single(struct vfio_pci_core_device *vdev,
 
 	ret = count;
 
+	/*
+	 * Give the CXL Type-2 hook first claim on this access: if the
+	 * range lies inside the CXL Device DVSEC body, forward it to
+	 * cxl-core's register-virtualization helpers instead of the
+	 * standard perm-bits path.  -ENOENT means "not for me; use the
+	 * default path"; any other negative value is a hard error.
+	 */
+	if (vdev->cxl) {
+		__le32 le_val = 0;
+		ssize_t cxl_ret;
+
+		if (iswrite && copy_from_user(&le_val, buf, count))
+			return -EFAULT;
+		cxl_ret = vfio_pci_cxl_config_rw(vdev, *ppos, count, &le_val,
+						 iswrite);
+		if (cxl_ret >= 0) {
+			if (!iswrite && copy_to_user(buf, &le_val, count))
+				return -EFAULT;
+			return cxl_ret;
+		}
+		if (cxl_ret != -ENOENT)
+			return cxl_ret;
+	}
+
 	cap_id = vdev->pci_config_map[*ppos];
 
 	if (cap_id == PCI_CAP_ID_INVALID) {
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 05ab4ae59157..2d2dae278d1e 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -501,6 +501,23 @@ static void vfio_pci_core_map_bars(struct vfio_pci_core_device *vdev)
 		if (!pci_resource_len(pdev, i))
 			continue;
 
+		/*
+		 * cxl-core already holds request_mem_region() on the CXL
+		 * component register sub-range of this BAR.  Skip the
+		 * full-BAR request so we do not collide with that
+		 * sub-region; vfio still owns the BAR via the driver
+		 * binding and the iomap below succeeds without a region
+		 * claim.
+		 */
+		if (vdev->cxl && bar == vfio_pci_cxl_get_component_reg_bar(vdev)) {
+			vdev->barmap[bar] = pci_iomap(pdev, bar, 0);
+			if (!vdev->barmap[bar]) {
+				pci_dbg(pdev, "Failed to iomap region %d\n", bar);
+				vdev->barmap[bar] = IOMEM_ERR_PTR(-ENOMEM);
+			}
+			continue;
+		}
+
 		if (pci_request_selected_regions(pdev, 1 << bar, "vfio")) {
 			pci_dbg(pdev, "Failed to reserve region %d\n", bar);
 			vdev->barmap[bar] = IOMEM_ERR_PTR(-EBUSY);
@@ -701,7 +718,10 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 		if (IS_ERR_OR_NULL(vdev->barmap[bar]))
 			continue;
 		pci_iounmap(pdev, vdev->barmap[bar]);
-		pci_release_selected_regions(pdev, 1 << bar);
+		/* Mirror the asymmetric setup-time skip in map_bars(). */
+		if (!(vdev->cxl &&
+		      i == vfio_pci_cxl_get_component_reg_bar(vdev)))
+			pci_release_selected_regions(pdev, 1 << bar);
 		vdev->barmap[bar] = NULL;
 	}
 
@@ -1051,6 +1071,16 @@ static int vfio_pci_ioctl_get_info(struct vfio_pci_core_device *vdev,
 	info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions;
 	info.num_irqs = VFIO_PCI_NUM_IRQS;
 
+	if (vdev->cxl) {
+		ret = vfio_pci_cxl_get_info(vdev, &caps);
+		if (ret) {
+			pci_warn(vdev->pdev,
+				 "Failed to add CXL info capability\n");
+			return ret;
+		}
+		info.flags |= VFIO_DEVICE_FLAGS_CXL;
+	}
+
 	ret = vfio_pci_info_zdev_add_caps(vdev, &caps);
 	if (ret && ret != -ENODEV) {
 		pci_warn(vdev->pdev,
@@ -1093,6 +1123,12 @@ int vfio_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
 	struct pci_dev *pdev = vdev->pdev;
 	int i, ret;
 
+	if (vdev->cxl) {
+		ret = vfio_pci_cxl_get_region_info(vdev, info, caps);
+		if (ret != -ENOTTY)
+			return ret;
+	}
+
 	switch (info->index) {
 	case VFIO_PCI_CONFIG_REGION_INDEX:
 		info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
@@ -1811,6 +1847,12 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
 	if (req_start + req_len > phys_len)
 		return -EINVAL;
 
+	/* Block mmap of the CXL component register block. */
+	if (vdev->cxl &&
+	    index == vfio_pci_cxl_get_component_reg_bar(vdev) &&
+	    vfio_pci_cxl_mmap_overlaps_comp_regs(vdev, req_start, req_len))
+		return -EINVAL;
+
 	/*
 	 * Even though we don't make use of the barmap for the mmap,
 	 * we need to request the region and the barmap tracks that.
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 94bf7c6a8548..88b89da6dd5a 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -114,6 +114,23 @@ int  vfio_pci_cxl_acquire(struct vfio_pci_core_device *vdev);
 void vfio_pci_cxl_release(struct vfio_pci_core_device *vdev);
 int  vfio_pci_cxl_open(struct vfio_pci_core_device *vdev);
 void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev);
+size_t vfio_pci_cxl_config_boundary(struct vfio_pci_core_device *vdev,
+				    loff_t pos);
+ssize_t vfio_pci_cxl_config_rw(struct vfio_pci_core_device *vdev,
+			       loff_t pos, size_t count, __le32 *val,
+			       bool iswrite);
+int vfio_pci_cxl_get_info(struct vfio_pci_core_device *vdev,
+			  struct vfio_info_cap *caps);
+int vfio_pci_cxl_get_region_info(struct vfio_pci_core_device *vdev,
+				 struct vfio_region_info *info,
+				 struct vfio_info_cap *caps);
+u8   vfio_pci_cxl_get_component_reg_bar(struct vfio_pci_core_device *vdev);
+bool vfio_pci_cxl_get_comp_reg_range(struct vfio_pci_core_device *vdev,
+				     size_t *start, size_t *end);
+bool vfio_pci_cxl_mmap_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+					  u64 req_start, u64 req_len);
+bool vfio_pci_cxl_bar_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+					 int bar, u64 start, u64 len);
 #else
 static inline int vfio_pci_cxl_acquire(struct vfio_pci_core_device *vdev)
 {
@@ -128,6 +145,61 @@ static inline int vfio_pci_cxl_open(struct vfio_pci_core_device *vdev)
 }
 
 static inline void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev) { }
+
+static inline size_t
+vfio_pci_cxl_config_boundary(struct vfio_pci_core_device *vdev, loff_t pos)
+{
+	return SIZE_MAX;
+}
+
+static inline ssize_t
+vfio_pci_cxl_config_rw(struct vfio_pci_core_device *vdev, loff_t pos,
+		       size_t count, __le32 *val, bool iswrite)
+{
+	return -ENOENT;
+}
+
+static inline int
+vfio_pci_cxl_get_info(struct vfio_pci_core_device *vdev,
+		      struct vfio_info_cap *caps)
+{
+	return 0;
+}
+
+static inline int
+vfio_pci_cxl_get_region_info(struct vfio_pci_core_device *vdev,
+			     struct vfio_region_info *info,
+			     struct vfio_info_cap *caps)
+{
+	return -ENOTTY;
+}
+
+static inline u8
+vfio_pci_cxl_get_component_reg_bar(struct vfio_pci_core_device *vdev)
+{
+	return U8_MAX;
+}
+
+static inline bool
+vfio_pci_cxl_get_comp_reg_range(struct vfio_pci_core_device *vdev,
+				size_t *start, size_t *end)
+{
+	return false;
+}
+
+static inline bool
+vfio_pci_cxl_mmap_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+				     u64 req_start, u64 req_len)
+{
+	return false;
+}
+
+static inline bool
+vfio_pci_cxl_bar_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+				    int bar, u64 start, u64 len)
+{
+	return false;
+}
 #endif
 
 static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 3bfbb879a005..a856f29a3c94 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -236,6 +236,15 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 
 	count = min(count, (size_t)(end - pos));
 
+	/*
+	 * Reject raw BAR access that would land inside the CXL component
+	 * register sub-range.  cxl-core owns the per-field shadow and
+	 * spec-defined write semantics; userspace must use the dedicated
+	 * COMP_REGS VFIO region for that range.
+	 */
+	if (vfio_pci_cxl_bar_overlaps_comp_regs(vdev, bar, pos, count))
+		return -EINVAL;
+
 	if (bar == PCI_ROM_RESOURCE) {
 		/*
 		 * The ROM can fill less space than the BAR, so we start the
@@ -437,6 +446,14 @@ int vfio_pci_ioeventfd(struct vfio_pci_core_device *vdev, loff_t offset,
 	      pos >= vdev->msix_offset + vdev->msix_size))
 		return -EINVAL;
 
+	/*
+	 * Disallow ioeventfds arming against the CXL component register
+	 * sub-range; that area is fronted by cxl-core's shadow and must
+	 * not be reached through the raw BAR map.
+	 */
+	if (vfio_pci_cxl_bar_overlaps_comp_regs(vdev, bar, pos, count))
+		return -EINVAL;
+
 	if (count == 8)
 		return -EINVAL;
 
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3 07/11] vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2 acquisition
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
  To: djbw, alex, jgg, jic23, dave.jiang, ankita,
	alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
	ira.weiny
  Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
	linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>

From: Manish Honap <mhonap@nvidia.com>

Wire vfio-pci-core to acquire CXL Type-2 device state at PCI bind
and release it at PCI unbind, mirroring the existing vfio_pci_zdev_*
integration model.  Four lifecycle hooks are introduced —
vfio_pci_cxl_acquire / _release / _open / _close — with !-config
stubs that return -ENODEV / 0 / 0 / no-op respectively so vfio-pci
behaviour is unchanged when CONFIG_VFIO_PCI_CXL=n.

vfio_pci_cxl_acquire() implements the bind sequence:

  - pcie_is_cxl() and CXL Device DVSEC discovery (-ENODEV if absent
    or if MEM_CAPABLE clear — caller falls back to plain vfio-pci)
  - devm_cxl_dev_state_create() with struct vfio_pci_cxl_state
    embedding cxl_dev_state at offset 0 (required by the 7-arg
    macro's static_assert in include/cxl/cxl.h)
  - pci_enable_device_mem(), cxl_pci_setup_regs(), cxl_get_hdm_info()
    (rejecting hdm_count != 1), cxl_regblock_get_bar_info(),
    cxl_await_range_active()
  - devm_cxl_passthrough_create() to snapshot the DVSEC body, HDM
    block, and CM cap-array shadows owned by cxl-core
  - pci_disable_device() — clears PCI_COMMAND_MASTER but NOT
    PCI_COMMAND_MEMORY, so cxl-core MMIO accesses from the next step
    still succeed
  - devm_cxl_probe_mem() to register the cxl_memdev, enumerate the
    endpoint port, and attach the firmware-committed autoregion
  - request_mem_region() + memremap_wb() of the autoregion's HPA so
    the HDM VFIO region can serve guest accesses through it

The sequence is fail-closed for confirmed-CXL devices: -ENODEV maps
to plain vfio-pci fall-through; any other negative errno aborts the
vfio-pci bind so the guest never sees a half-initialised CXL device.

vfio_pci_cxl_open() / _close() are present as stable call sites for
the region-registration hooks that follow.

Selects CXL_VFIO_PASSTHROUGH so cxl-core's per-device
register-virtualization helpers (drivers/cxl/core/passthrough.c) are
built.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/vfio/pci/Kconfig             |   2 +
 drivers/vfio/pci/Makefile            |   1 +
 drivers/vfio/pci/cxl/Kconfig         |  34 +++
 drivers/vfio/pci/cxl/Makefile        |   2 +
 drivers/vfio/pci/cxl/vfio_cxl_core.c | 369 +++++++++++++++++++++++++++
 drivers/vfio/pci/cxl/vfio_cxl_priv.h |  71 ++++++
 drivers/vfio/pci/vfio_pci_core.c     |  24 ++
 drivers/vfio/pci/vfio_pci_priv.h     |  21 ++
 include/linux/vfio_pci_core.h        |   7 +
 9 files changed, 531 insertions(+)
 create mode 100644 drivers/vfio/pci/cxl/Kconfig
 create mode 100644 drivers/vfio/pci/cxl/Makefile
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_core.c
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_priv.h

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 296bf01e185e..4cd6acd36053 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -58,6 +58,8 @@ config VFIO_PCI_ZDEV_KVM
 config VFIO_PCI_DMABUF
 	def_bool y if VFIO_PCI_CORE && PCI_P2PDMA && DMA_SHARED_BUFFER
 
+source "drivers/vfio/pci/cxl/Kconfig"
+
 source "drivers/vfio/pci/mlx5/Kconfig"
 
 source "drivers/vfio/pci/ism/Kconfig"
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 6138f1bf241d..ac26e7494f0a 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -3,6 +3,7 @@
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
+include $(srctree)/$(src)/cxl/Makefile
 obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
 
 vfio-pci-y := vfio_pci.o
diff --git a/drivers/vfio/pci/cxl/Kconfig b/drivers/vfio/pci/cxl/Kconfig
new file mode 100644
index 000000000000..5d88999e1256
--- /dev/null
+++ b/drivers/vfio/pci/cxl/Kconfig
@@ -0,0 +1,34 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config VFIO_PCI_CXL
+	bool "VFIO support for CXL Type-2 device passthrough"
+	depends on VFIO_PCI_CORE
+	depends on CXL_BUS
+	depends on CXL_REGION
+	depends on CXL_MEM
+	# CXL providers are tristate; refuse a builtin vfio-pci-core
+	# against modular cxl-core (would fail to link the per-device
+	# helpers in drivers/cxl/core/passthrough.c).
+	depends on CXL_BUS=y    || VFIO_PCI_CORE=m
+	depends on CXL_REGION=y || VFIO_PCI_CORE=m
+	depends on CXL_MEM=y    || VFIO_PCI_CORE=m
+	select CXL_VFIO_PASSTHROUGH
+	help
+	  Support CXL Type-2 (HDM-D, HDM-DB) accelerator device passthrough
+	  to a KVM guest.  When this option is enabled, vfio-pci-core
+	  probes the CXL Register Locator DVSEC at PCI bind time, acquires
+	  a cxl_memdev and autoregion via devm_cxl_probe_mem(), and
+	  exposes two additional VFIO regions to userspace: a mappable
+	  HDM memory region for the device's HPA range, and a COMP_REGS
+	  shadow region forwarding HDM Decoder Capability accesses
+	  through the cxl-core register-virtualization helpers added by
+	  drivers/cxl/core/passthrough.c.
+
+	  Devices that do not advertise a CXL Device DVSEC fall back to
+	  plain vfio-pci behaviour.  Confirmed-CXL devices whose host
+	  firmware did not commit an HDM decoder, or whose cxl-core probe
+	  otherwise fails, do not bind to vfio-pci at all so the guest is
+	  never offered a half-initialised CXL device.
+
+	  Scope: firmware-committed, single-decoder, no-interleave.
+
+	  Say Y to support CXL Type-2 device passthrough.
diff --git a/drivers/vfio/pci/cxl/Makefile b/drivers/vfio/pci/cxl/Makefile
new file mode 100644
index 000000000000..35e952fe1858
--- /dev/null
+++ b/drivers/vfio/pci/cxl/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+vfio-pci-core-$(CONFIG_VFIO_PCI_CXL) += cxl/vfio_cxl_core.o
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
new file mode 100644
index 000000000000..42cd00bbe869
--- /dev/null
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -0,0 +1,369 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2026 NVIDIA Corporation. All rights reserved.
+ *
+ * vfio-pci CXL Type-2 device passthrough — core entry points.
+ *
+ * Four lifecycle hooks are inserted into vfio-pci-core: acquire and
+ * release run at PCI bind / unbind, open and close run on VFIO fd
+ * open / close.  This mirrors the existing vfio_pci_zdev_* integration
+ * model.
+ *
+ * vfio_pci_cxl_acquire() runs at PCI bind time.  It performs the CXL
+ * register-locator probe and HDM decoder discovery under a brief
+ * pci_enable_device_mem() / pci_disable_device() bracket, then asks
+ * cxl-core to register a cxl_memdev and auto-attach the
+ * firmware-committed region via devm_cxl_probe_mem().  pci_disable_device()
+ * clears PCI_COMMAND_MASTER but NOT PCI_COMMAND_MEMORY (see
+ * do_pci_disable_device() in drivers/pci/pci.c), so the cxl-core
+ * MMIO accesses performed by devm_cxl_probe_mem() after the disable
+ * still succeed even with vfio-pci's PCI enable refcount returned to
+ * zero.  The refcount is re-taken cleanly by vfio_pci_core_enable()
+ * at first VFIO fd open.
+ *
+ * Acquisition is fail-closed for confirmed-CXL devices.  Devices that
+ * do not advertise a CXL Device DVSEC, and CXL devices whose
+ * MEM_CAPABLE bit is clear, return -ENODEV so the caller falls back
+ * to plain vfio-pci behaviour.  Any other negative errno from
+ * acquire() is a confirmed-CXL probe failure (locator missing, HDM
+ * not single-decoder, range-active timeout, passthrough shadow
+ * snapshot failure, devm_cxl_probe_mem() refusal, HDM HPA range busy)
+ * and aborts the vfio-pci bind so the guest never sees a CXL device
+ * with half-initialised cxl-core state.
+ */
+
+#include <linux/bitfield.h>
+#include <linux/io.h>
+#include <linux/pci.h>
+#include <linux/range.h>
+#include <linux/vfio_pci_core.h>
+
+#include <uapi/cxl/cxl_regs.h>
+#include <uapi/linux/pci_regs.h>
+#include <uapi/linux/vfio.h>
+
+#include <cxl/cxl.h>
+#include <cxl/passthrough.h>
+#include <cxl/pci.h>
+
+#include "../vfio_pci_priv.h"
+#include "vfio_cxl_priv.h"
+
+MODULE_IMPORT_NS("CXL");
+
+#define VFIO_PCI_CXL_HDM_RES_NAME	"vfio-cxl-hdm"
+
+/* ------------------------------------------------------------------ */
+/* Bind-time setup helpers                                             */
+/* ------------------------------------------------------------------ */
+
+static struct vfio_pci_cxl_state *
+vfio_cxl_create_device_state(struct pci_dev *pdev, u16 dvsec)
+{
+	struct vfio_pci_cxl_state *cxl;
+	u32 hdr1;
+	u16 cap;
+	int rc;
+
+	cxl = devm_cxl_dev_state_create(&pdev->dev, CXL_DEVTYPE_DEVMEM,
+					pci_get_dsn(pdev), dvsec,
+					struct vfio_pci_cxl_state,
+					cxlds, false);
+	if (!cxl)
+		return ERR_PTR(-ENOMEM);
+
+	cxl->pdev = pdev;
+
+	rc = pci_read_config_dword(pdev, dvsec + PCI_DVSEC_HEADER1, &hdr1);
+	if (rc) {
+		devm_kfree(&pdev->dev, cxl);
+		return ERR_PTR(-EIO);
+	}
+	cxl->info.dvsec_offset = dvsec;
+	cxl->info.dvsec_size   = PCI_DVSEC_HEADER1_LEN(hdr1);
+
+	rc = pci_read_config_word(pdev, dvsec + PCI_DVSEC_CXL_CAP, &cap);
+	if (rc) {
+		devm_kfree(&pdev->dev, cxl);
+		return ERR_PTR(-EIO);
+	}
+	if (!(cap & PCI_DVSEC_CXL_MEM_CAPABLE)) {
+		devm_kfree(&pdev->dev, cxl);
+		return ERR_PTR(-ENODEV);
+	}
+
+	return cxl;
+}
+
+static int vfio_cxl_probe_regs(struct vfio_pci_cxl_state *cxl)
+{
+	struct cxl_dev_state *cxlds = &cxl->cxlds;
+	resource_size_t hdm_off, hdm_size, bar_off;
+	u8 hdm_count, bir;
+	int rc;
+
+	if (WARN_ON_ONCE(!pci_is_enabled(cxl->pdev)))
+		return -EINVAL;
+
+	rc = cxl_pci_setup_regs(cxl->pdev, CXL_REGLOC_RBI_COMPONENT,
+				&cxlds->reg_map);
+	if (rc)
+		return rc;
+
+	rc = cxl_get_hdm_info(cxlds, &hdm_count, &hdm_off, &hdm_size);
+	if (rc)
+		return rc;
+	if (hdm_count != 1) {
+		pci_err(cxl->pdev,
+			"vfio-cxl: hdm_count=%u, only 1 supported\n",
+			hdm_count);
+		return -EOPNOTSUPP;
+	}
+
+	rc = cxl_regblock_get_bar_info(&cxlds->reg_map, &bir, &bar_off);
+	if (rc)
+		return rc;
+
+	cxl->info.hdm_count               = hdm_count;
+	cxl->info.hdm_reg_offset          = hdm_off;
+	cxl->info.hdm_reg_size            = hdm_size;
+	cxl->info.comp_reg_bir            = bir;
+	cxl->info.comp_reg_offset         = bar_off;
+	cxl->info.comp_reg_size           = cxlds->reg_map.max_size;
+	cxl->info.host_firmware_committed = true;
+
+	/*
+	 * Range-active polls a config-space bit in the CXL DVSEC, not
+	 * MMIO, so it is safe inside or outside the memory-decode
+	 * bracket.  Keep it here so cxlds->media_ready is set before the
+	 * caller drops the PCI enable refcount.
+	 */
+	rc = cxl_await_range_active(cxlds);
+	if (rc)
+		return rc;
+	cxlds->media_ready = true;
+	return 0;
+}
+
+static int vfio_cxl_create_memdev(struct vfio_pci_cxl_state *cxl)
+{
+	struct range hpa_range;
+	struct cxl_memdev *cxlmd;
+
+	/*
+	 * devm_cxl_probe_mem() runs synchronously: it registers a
+	 * cxl_memdev which triggers cxl_mem_probe(), endpoint port
+	 * creation, and autoregion attach.  Endpoint port probe reads
+	 * HDM decoder MMIO via devm_cxl_setup_hdm(); the device must
+	 * therefore still be memory-decoded.  pci_disable_device() only
+	 * clears PCI_COMMAND_MASTER (not _MEMORY), so the paired enable
+	 * / disable done by the caller leaves the decode bit asserted
+	 * and these reads succeed even with the vfio refcount at zero.
+	 */
+	cxlmd = devm_cxl_probe_mem(&cxl->cxlds, &hpa_range);
+	if (IS_ERR(cxlmd))
+		return PTR_ERR(cxlmd);
+
+	cxl->cxlmd          = cxlmd;
+	cxl->info.hpa_base  = hpa_range.start;
+	cxl->info.hpa_size  = range_len(&hpa_range);
+	return 0;
+}
+
+/* ------------------------------------------------------------------ */
+/* HDM HPA mapping                                                     */
+/* ------------------------------------------------------------------ */
+
+static int vfio_cxl_map_hdm(struct vfio_pci_cxl_state *cxl)
+{
+	phys_addr_t base = cxl->info.hpa_base;
+	u64 size = cxl->info.hpa_size;
+
+	if (!size)
+		return -EINVAL;
+
+	cxl->hdm_res = request_mem_region(base, size,
+					  VFIO_PCI_CXL_HDM_RES_NAME);
+	if (!cxl->hdm_res) {
+		pci_err(cxl->pdev,
+			"vfio-cxl: HDM HPA %pa-%llx busy; check firmware mappings\n",
+			&base, size);
+		return -EBUSY;
+	}
+
+	cxl->hdm_kva = memremap(base, size, MEMREMAP_WB);
+	if (!cxl->hdm_kva) {
+		release_mem_region(base, size);
+		cxl->hdm_res = NULL;
+		return -ENOMEM;
+	}
+	return 0;
+}
+
+static void vfio_cxl_unmap_hdm(struct vfio_pci_cxl_state *cxl)
+{
+	if (cxl->hdm_kva) {
+		memunmap(cxl->hdm_kva);
+		cxl->hdm_kva = NULL;
+	}
+	if (cxl->hdm_res) {
+		release_mem_region(cxl->info.hpa_base, cxl->info.hpa_size);
+		cxl->hdm_res = NULL;
+	}
+}
+
+/* ------------------------------------------------------------------ */
+/* Lifecycle hooks                                                     */
+/* ------------------------------------------------------------------ */
+
+int vfio_pci_cxl_acquire(struct vfio_pci_core_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	struct vfio_pci_cxl_state *cxl;
+	u16 dvsec;
+	int rc;
+
+	if (!pcie_is_cxl(pdev))
+		return -ENODEV;
+
+	dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
+					  PCI_DVSEC_CXL_DEVICE);
+	if (!dvsec)
+		return -ENODEV;
+
+	cxl = vfio_cxl_create_device_state(pdev, dvsec);
+	if (IS_ERR(cxl)) {
+		rc = PTR_ERR(cxl);
+		if (rc == -ENODEV)
+			return -ENODEV;	/* MEM_CAPABLE clear: treat as non-CXL. */
+		pci_warn(pdev, "vfio-cxl: state alloc failed (%d)\n", rc);
+		return rc;
+	}
+
+	rc = pci_enable_device_mem(pdev);
+	if (rc) {
+		pci_warn(pdev, "vfio-cxl: pci_enable_device_mem failed (%d)\n",
+			 rc);
+		goto err_free;
+	}
+
+	rc = vfio_cxl_probe_regs(cxl);
+	if (rc) {
+		pci_disable_device(pdev);
+		pci_warn(pdev, "vfio-cxl: register probe failed (%d)\n", rc);
+		goto err_free;
+	}
+
+	/*
+	 * Allocate the cxl-core passthrough handle (DVSEC/HDM/CM
+	 * shadows) BEFORE devm_cxl_probe_mem() so that a -ENOMEM or
+	 * snapshot -EIO here is recoverable: devm_kfree() the
+	 * containing state and let devres unwind cxlds.  After
+	 * devm_cxl_probe_mem() publishes the memdev, no devm_kfree() is
+	 * possible because cxlmd->cxlds points into the state.
+	 */
+	cxl->cxlpt = devm_cxl_passthrough_create(&pdev->dev, &cxl->cxlds);
+	if (IS_ERR(cxl->cxlpt)) {
+		rc = PTR_ERR(cxl->cxlpt);
+		cxl->cxlpt = NULL;
+		pci_disable_device(pdev);
+		pci_warn(pdev,
+			 "vfio-cxl: passthrough shadow snapshot failed (%d)\n",
+			 rc);
+		goto err_free;
+	}
+
+	/*
+	 * Drop the PCI enable refcount before publishing the cxl_memdev:
+	 * vfio_pci_core_enable() will take a fresh refcount at first VFIO
+	 * fd open.  PCI_COMMAND_MEMORY stays asserted (see file header).
+	 */
+	pci_disable_device(pdev);
+
+	/*
+	 * Populate the DPA partition tree on cxlds before
+	 * devm_cxl_probe_mem() runs.  The endpoint port probe will try to
+	 * reserve the firmware-committed HDM decoder range as a DPA
+	 * resource child of cxlds->dpa_res; without an explicit
+	 * cxl_set_capacity() call dpa_res is zero-sized and the
+	 * reservation fails with -EBUSY (see __cxl_dpa_reserve() in
+	 * drivers/cxl/core/hdm.c).  Read the decoder's SIZE from the
+	 * snapshot we just took and size dpa_res to cover it.
+	 */
+	{
+		u32 size_lo = 0, size_hi = 0;
+		u64 dpa_size;
+
+		cxl_passthrough_hdm_rw(cxl->cxlpt,
+				       CXL_HDM_DECODER0_SIZE_LOW_OFFSET(0),
+				       &size_lo, false);
+		cxl_passthrough_hdm_rw(cxl->cxlpt,
+				       CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(0),
+				       &size_hi, false);
+		dpa_size = ((u64)size_hi << 32) | size_lo;
+
+		rc = cxl_set_capacity(&cxl->cxlds, dpa_size);
+		if (rc) {
+			pci_warn(pdev,
+				 "vfio-cxl: cxl_set_capacity(0x%llx) failed (%d)\n",
+				 dpa_size, rc);
+			goto err_free;
+		}
+	}
+
+	rc = vfio_cxl_create_memdev(cxl);
+	if (rc) {
+		pci_warn(pdev,
+			 "vfio-cxl: memdev/region creation failed (%d)\n", rc);
+		goto err_free;
+	}
+
+	/*
+	 * Once devm_cxl_probe_mem() has published a cxl_memdev that
+	 * holds a pointer into cxl->cxlds, the state must NOT be
+	 * devm_kfree'd.  A failure from vfio_cxl_map_hdm() is reported
+	 * to userspace; the state stays allocated for the lifetime of
+	 * the PCI device, and devres unwinds it when the pdev is
+	 * removed.
+	 */
+	rc = vfio_cxl_map_hdm(cxl);
+	if (rc) {
+		pci_warn(pdev, "vfio-cxl: HDM HPA mapping failed (%d)\n", rc);
+		return rc;
+	}
+
+	vdev->cxl = cxl;
+	pci_info(pdev,
+		 "vfio-cxl: acquired (hpa=%pa/0x%llx hdm@0x%llx/0x%llx BAR%u@0x%llx/0x%llx)\n",
+		 &cxl->info.hpa_base, cxl->info.hpa_size,
+		 cxl->info.hdm_reg_offset, cxl->info.hdm_reg_size,
+		 cxl->info.comp_reg_bir,
+		 cxl->info.comp_reg_offset, cxl->info.comp_reg_size);
+	return 0;
+
+err_free:
+	devm_kfree(&pdev->dev, cxl);
+	return rc;
+}
+
+void vfio_pci_cxl_release(struct vfio_pci_core_device *vdev)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+	if (cxl)
+		vfio_cxl_unmap_hdm(cxl);
+	vdev->cxl = NULL;
+}
+
+int vfio_pci_cxl_open(struct vfio_pci_core_device *vdev)
+{
+	/*
+	 * Region registration (HDM, COMP_REGS) is added by the next
+	 * patch in this series.  This hook exists so vfio-pci-core's
+	 * fd-open path has a stable call site.
+	 */
+	return 0;
+}
+
+void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev)
+{
+}
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
new file mode 100644
index 000000000000..4ce8f88f8d3d
--- /dev/null
+++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright(c) 2026 NVIDIA Corporation. All rights reserved. */
+#ifndef __VFIO_PCI_CXL_PRIV_H__
+#define __VFIO_PCI_CXL_PRIV_H__
+
+#include <linux/pci.h>
+#include <linux/vfio_pci_core.h>
+
+#include <cxl/cxl.h>
+#include <cxl/passthrough.h>
+
+/**
+ * struct vfio_pci_cxl_state - per-device CXL Type-2 passthrough state
+ *
+ * Anchored to a vfio-pci-core device via @vdev->cxl.  Allocated by
+ * devm_cxl_dev_state_create() so its lifetime is bound to the PCI
+ * device; the cxl_memdev acquired via devm_cxl_probe_mem() and the
+ * cxl_passthrough handle returned by devm_cxl_passthrough_create()
+ * are similarly devres-anchored.
+ *
+ * @cxlds:	CXL device state.  MUST be the first member (enforced by
+ *		devm_cxl_dev_state_create()'s static_assert).
+ * @pdev:	backpointer to the PCI device.
+ * @cxlmd:	cxl_memdev acquired at PCI bind via devm_cxl_probe_mem().
+ * @cxlpt:	register-virtualization handle owned by cxl-core; vfio
+ *		forwards DVSEC config-space, COMP_REGS region, and HDM
+ *		block accesses through this opaque pointer.  See
+ *		Documentation/driver-api/vfio-pci-cxl.rst.
+ * @info:	snapshot of cxl-side metadata describing the device's CXL
+ *		layout.  Filled in during vfio_pci_cxl_acquire() and used
+ *		by the VMM-facing helpers (CAP_CXL builder, region info,
+ *		COMP_REGS dispatch boundary).
+ * @hdm_region_idx, @comp_reg_region_idx: VFIO region indices.
+ *		Assigned by vfio_pci_cxl_open() when the regions are
+ *		registered; zero on a device whose fd has never been
+ *		opened.
+ * @hdm_res:	request_mem_region cookie for the HPA range.
+ * @hdm_kva:	memremap(MEMREMAP_WB) mapping of the HPA range.  Used
+ *		for the HDM region's pread/pwrite path.  The mmap fault
+ *		handler does vmf_insert_pfn from the physical HPA so the
+ *		guest gets the same backing memory the host sees.
+ */
+struct vfio_pci_cxl_state {
+	/* MUST be first member - see devm_cxl_dev_state_create() macro. */
+	struct cxl_dev_state		cxlds;
+
+	struct pci_dev		       *pdev;
+	struct cxl_memdev	       *cxlmd;
+	struct cxl_passthrough	       *cxlpt;
+
+	struct {
+		u16		dvsec_offset;
+		u16		dvsec_size;
+		phys_addr_t	hpa_base;
+		u64		hpa_size;
+		u8		comp_reg_bir;
+		u64		comp_reg_offset;
+		u64		comp_reg_size;
+		u8		hdm_count;
+		u64		hdm_reg_offset;
+		u64		hdm_reg_size;
+		bool		host_firmware_committed;
+	} info;
+
+	u32				hdm_region_idx;
+	u32				comp_reg_region_idx;
+	struct resource		       *hdm_res;
+	void			       *hdm_kva;
+};
+
+#endif /* __VFIO_PCI_CXL_PRIV_H__ */
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 050e7542952e..05ab4ae59157 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -602,10 +602,25 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 	if (!vfio_vga_disabled() && vfio_pci_is_vga(pdev))
 		vdev->has_vga = true;
 
+	/*
+	 * Register CXL VFIO regions before mapping BARs.  CXL region
+	 * registration only list-appends to vdev->region[]; it has no
+	 * dependency on vdev->barmap[] being populated.  Running it
+	 * first means a failure here unwinds through out_free_config
+	 * without leaking BAR ioremaps or selected-region requests
+	 * (those are released by vfio_pci_core_disable(), which is not
+	 * called for a failed open).
+	 */
+	ret = vfio_pci_cxl_open(vdev);
+	if (ret)
+		goto out_free_config;
+
 	vfio_pci_core_map_bars(vdev);
 
 	return 0;
 
+out_free_config:
+	vfio_config_free(vdev);
 out_free_zdev:
 	vfio_pci_zdev_close_device(vdev);
 out_free_state:
@@ -699,6 +714,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 
 	vdev->needs_reset = true;
 
+	vfio_pci_cxl_close(vdev);
 	vfio_pci_zdev_close_device(vdev);
 
 	/*
@@ -2222,6 +2238,10 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
 	if (ret)
 		goto out_vf;
 
+	ret = vfio_pci_cxl_acquire(vdev);
+	if (ret && ret != -ENODEV)
+		goto out_vga;
+
 	vfio_pci_probe_power_state(vdev);
 
 	/*
@@ -2250,6 +2270,9 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
 		pm_runtime_get_noresume(dev);
 
 	pm_runtime_forbid(dev);
+	vfio_pci_cxl_release(vdev);
+out_vga:
+	vfio_pci_vga_uninit(vdev);
 out_vf:
 	vfio_pci_vf_uninit(vdev);
 	return ret;
@@ -2264,6 +2287,7 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
 
 	vfio_pci_vf_uninit(vdev);
 	vfio_pci_vga_uninit(vdev);
+	vfio_pci_cxl_release(vdev);
 
 	if (!disable_idle_d3)
 		pm_runtime_get_noresume(&vdev->pdev->dev);
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index fca9d0dfac90..94bf7c6a8548 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -109,6 +109,27 @@ static inline void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
 {}
 #endif
 
+#ifdef CONFIG_VFIO_PCI_CXL
+int  vfio_pci_cxl_acquire(struct vfio_pci_core_device *vdev);
+void vfio_pci_cxl_release(struct vfio_pci_core_device *vdev);
+int  vfio_pci_cxl_open(struct vfio_pci_core_device *vdev);
+void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev);
+#else
+static inline int vfio_pci_cxl_acquire(struct vfio_pci_core_device *vdev)
+{
+	return -ENODEV;
+}
+
+static inline void vfio_pci_cxl_release(struct vfio_pci_core_device *vdev) { }
+
+static inline int vfio_pci_cxl_open(struct vfio_pci_core_device *vdev)
+{
+	return 0;
+}
+
+static inline void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev) { }
+#endif
+
 static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 {
 	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 89165b769e5c..541c1911e090 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -142,6 +142,13 @@ struct vfio_pci_core_device {
 	struct notifier_block	nb;
 	struct rw_semaphore	memory_lock;
 	struct list_head	dmabufs;
+	/*
+	 * Opaque pointer to struct vfio_pci_cxl_state (defined in
+	 * drivers/vfio/pci/cxl/vfio_cxl_priv.h).  Set by
+	 * vfio_pci_cxl_acquire() at PCI bind; NULL on non-CXL devices
+	 * and when CONFIG_VFIO_PCI_CXL=n.
+	 */
+	void			*cxl;
 };
 
 enum vfio_pci_io_width {
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3 06/11] cxl: Add register-virtualization helpers for vfio Type-2 passthrough
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
  To: djbw, alex, jgg, jic23, dave.jiang, ankita,
	alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
	ira.weiny
  Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
	linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>

From: Manish Honap <mhonap@nvidia.com>

vfio-pci needs the CXL Device DVSEC body, the HDM Decoder Capability
block, and the CXL.cache/mem cap-array prefix to be virtualized
toward a KVM guest in a CXL-spec-compliant way.

Introduce a narrow helper API owned by cxl-core:

  struct cxl_passthrough *
  devm_cxl_passthrough_create(struct device *dev,
                              struct cxl_dev_state *cxlds);

  int cxl_passthrough_dvsec_rw(struct cxl_passthrough *p, u32 off,
                               u32 *val, size_t sz, bool write);
  int cxl_passthrough_hdm_rw(struct cxl_passthrough *p, u32 off,
                             u32 *val, bool write);
  int cxl_passthrough_cm_rw(struct cxl_passthrough *p, u32 off,
                            u32 *val, bool write);

Each helper takes a per-device mutex covering the DVSEC + HDM shadows
(the CM cap-array snapshot is immutable after create) and dispatches
by offset to a hand-written write handler against CXL r4.0 §8.1.3
(DVSEC: LOCK is RWO, CONTROL/CONTROL2 are RWL gated on CONFIG_LOCK,
STATUS/STATUS2 are RW1C, RANGE1 is HwInit, RANGE2 is RsvdZ) and
§8.2.4.20 (HDM: GLOBAL_CTRL RW, decoder CTRL implements
COMMIT/COMMITTED, decoder BASE/SIZE RWL gated on COMMITTED or
LOCK_ON_COMMIT, cap header HwInit).

Writes to the CM cap-array are silently discarded because the
cap-array headers are RO per CXL r4.0 §8.2.4; the write parameter is
kept on the rw API to make the drop policy explicit at the call site.

The shadows are snapshotted at create time: the DVSEC body from PCI
config space dword-at-a-time, the CM cap-array and HDM block from
the cxl-core MMIO mapping at cxlds->reg_map.base.  This preserves
firmware-committed values so the guest reads what the host BIOS
committed, while writes update the shadow per the per-field write
semantics above.

The file is gated by the hidden Kconfig CXL_VFIO_PASSTHROUGH so the
passthrough code stays out of cxl_core when no vfio consumer is configured.

Scope: firmware-committed, single-decoder, no-interleave Type-2
passthrough.  Multi-decoder, interleave, and hotplug are
out-of-scope and rejected at create time (-EOPNOTSUPP for
hdm_count != 1).

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/cxl/Kconfig            |   7 +
 drivers/cxl/core/Makefile      |   1 +
 drivers/cxl/core/passthrough.c | 590 +++++++++++++++++++++++++++++++++
 include/cxl/passthrough.h      | 121 +++++++
 4 files changed, 719 insertions(+)
 create mode 100644 drivers/cxl/core/passthrough.c
 create mode 100644 include/cxl/passthrough.h

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 80aeb0d556bd..7c874d486a9c 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -19,6 +19,13 @@ menuconfig CXL_BUS
 
 if CXL_BUS
 
+config CXL_VFIO_PASSTHROUGH
+	bool
+	# Hidden symbol selected by VFIO_PCI_CXL to pull
+	# drivers/cxl/core/passthrough.c into cxl_core when a vfio
+	# Type-2 passthrough consumer is configured.  Keep silent: no
+	# help text, no default, no user-visible prompt.
+
 config CXL_PCI
 	tristate "PCI manageability"
 	default CXL_BUS
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index ce7213818d3c..0cc80bd35a88 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -22,3 +22,4 @@ cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
 cxl_core-$(CONFIG_CXL_RAS) += ras.o
 cxl_core-$(CONFIG_CXL_RAS) += ras_rch.o
 cxl_core-$(CONFIG_CXL_ATL) += atl.o
+cxl_core-$(CONFIG_CXL_VFIO_PASSTHROUGH) += passthrough.o
diff --git a/drivers/cxl/core/passthrough.c b/drivers/cxl/core/passthrough.c
new file mode 100644
index 000000000000..b89829586024
--- /dev/null
+++ b/drivers/cxl/core/passthrough.c
@@ -0,0 +1,590 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2026 NVIDIA Corporation. All rights reserved.
+ *
+ * vfio-pci Type-2 device passthrough — CXL register virtualization.
+ *
+ * Owns the CXL spec-defined virtualization semantics for the
+ *   - CXL Device DVSEC capability body  (CXL r4.0 §8.1.3)
+ *   - HDM Decoder Capability block      (CXL r4.0 §8.2.4.20)
+ *   - CXL.cache/mem (CM) cap-array      (CXL r4.0 §8.2.4)
+ *
+ * vfio-pci is the only caller.  This file is NOT a generic emulation
+ * framework: every register the guest may touch has a hand-written
+ * write handler against the spec.  Reads serve from a shadow
+ * snapshotted at create time; writes update the shadow per the spec
+ * attribute mode for that field.
+ *
+ * Scope: firmware-committed, single-decoder, no-interleave Type-2
+ * passthrough.  Multi-decoder, interleave, and hotplug are
+ * out-of-scope and rejected at create time.
+ */
+
+#include <linux/bitfield.h>
+#include <linux/bitops.h>
+#include <linux/cleanup.h>
+#include <linux/device.h>
+#include <linux/export.h>
+#include <linux/io.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/pci_ids.h>
+#include <linux/pci_regs.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/unaligned.h>
+
+#include <uapi/cxl/cxl_regs.h>
+
+#include <cxlpci.h>
+#include <cxlmem.h>
+#include <cxl/cxl.h>
+#include <cxl/passthrough.h>
+
+#include "core.h"
+
+/* DVSEC CXL Device body offsets — relative to DVSEC capability start.
+ * Body begins at PCI_DVSEC_CXL_CAP (0x0a); preceding bytes are the PCI
+ * ext-cap header and DVSEC headers handled by the generic vfio
+ * perm-bits path.
+ */
+#define DVSEC_OFF_CAPABILITY		PCI_DVSEC_CXL_CAP	/* 0x0a, u16 */
+#define DVSEC_OFF_CONTROL		PCI_DVSEC_CXL_CTRL	/* 0x0c, u16 */
+#define DVSEC_OFF_STATUS		0x0e			/* u16 */
+#define DVSEC_OFF_CONTROL2		0x10			/* u16 */
+#define DVSEC_OFF_STATUS2		0x12			/* u16 */
+#define DVSEC_OFF_LOCK			0x14			/* u16 */
+#define DVSEC_OFF_RANGE1_SIZE_HI	0x18			/* u32 */
+#define DVSEC_OFF_RANGE1_SIZE_LO	0x1c
+#define DVSEC_OFF_RANGE1_BASE_HI	0x20
+#define DVSEC_OFF_RANGE1_BASE_LO	0x24
+#define DVSEC_OFF_RANGE2_SIZE_HI	0x28
+#define DVSEC_OFF_RANGE2_SIZE_LO	0x2c
+#define DVSEC_OFF_RANGE2_BASE_HI	0x30
+#define DVSEC_OFF_RANGE2_BASE_LO	0x34
+#define DVSEC_BODY_END			0x38
+
+#define DVSEC_LOCK_CONFIG_LOCK		BIT(0)
+
+/* HDM Decoder Capability block offsets — relative to HDM block base.
+ * Decoder N register set starts at 0x10 + N * 0x20.
+ */
+#define HDM_OFF_CAP_HEADER		0x00
+#define HDM_OFF_GLOBAL_CTRL		0x04
+#define HDM_DEC_BASE			0x10
+#define HDM_DEC_STRIDE			0x20
+#define HDM_DEC_OFF_BASE_LO(n)		(HDM_DEC_BASE + (n) * HDM_DEC_STRIDE + 0x00)
+#define HDM_DEC_OFF_BASE_HI(n)		(HDM_DEC_BASE + (n) * HDM_DEC_STRIDE + 0x04)
+#define HDM_DEC_OFF_SIZE_LO(n)		(HDM_DEC_BASE + (n) * HDM_DEC_STRIDE + 0x08)
+#define HDM_DEC_OFF_SIZE_HI(n)		(HDM_DEC_BASE + (n) * HDM_DEC_STRIDE + 0x0c)
+#define HDM_DEC_OFF_CTRL(n)		(HDM_DEC_BASE + (n) * HDM_DEC_STRIDE + 0x10)
+
+/* HDM Decoder CTRL bits per CXL r4.0 §8.2.4.20.5. */
+#define HDM_CTRL_LOCK_ON_COMMIT		BIT(8)
+#define HDM_CTRL_COMMIT			BIT(9)
+#define HDM_CTRL_COMMITTED		BIT(10)
+#define HDM_CTRL_ERR_NOT_COMMITTED	BIT(11)
+
+struct cxl_passthrough {
+	struct cxl_dev_state *cxlds;
+
+	/* DVSEC body shadow.  Byte-indexed by (off - PCI_DVSEC_CXL_CAP).
+	 * Allocated rounded up to a dword so dword reads at the tail
+	 * never overrun.
+	 */
+	u8 *dvsec_shadow;
+	u16 dvsec_size;			/* full DVSEC cap length, incl. headers */
+	bool dvsec_config_locked;
+
+	/* HDM block shadow.  Byte-indexed; size = hdm_reg_size. */
+	u8 *hdm_shadow;
+	resource_size_t hdm_reg_size;
+
+	/* CM cap-array snapshot.  Dword-indexed by (off / 4) where off
+	 * is the byte offset from CXL_CM_OFFSET.  Read-only after create.
+	 */
+	__le32 *cm_snapshot;
+	size_t cm_snapshot_dwords;
+
+	/* Covers dvsec_shadow + dvsec_config_locked + hdm_shadow.
+	 * cm_snapshot is immutable after create; no lock needed.  Leaf-
+	 * level: no entry point holding this mutex calls into cxl-bus or
+	 * vfio.
+	 */
+	struct mutex lock;
+};
+
+/* ------------------------------------------------------------------ */
+/* Snapshot helpers                                                    */
+/* ------------------------------------------------------------------ */
+
+/* Read the DVSEC body bytes [PCI_DVSEC_CXL_CAP, dvsec_size) from PCI
+ * config space into the shadow.
+ *
+ * The body starts at PCI_DVSEC_CXL_CAP (0x0a), which is word-aligned but
+ * NOT dword-aligned, and CXL r4.0 §8.1.3 places six 16-bit descriptors
+ * (CAPABILITY through LOCK) at offsets 0x0a..0x14 before any 32-bit
+ * field.  Strict-alignment PCIe host bridges (e.g. ARM64 ECAM) reject
+ * misaligned dword config accesses with PCIBIOS_BAD_REGISTER_NUMBER;
+ * snapshot at the natural granularity of the body's 16-bit descriptors
+ * (2-byte stride) so every offset in the range is naturally aligned.
+ */
+static int snapshot_dvsec_body(struct cxl_passthrough *p)
+{
+	struct pci_dev *pdev = to_pci_dev(p->cxlds->dev);
+	u16 dvsec = p->cxlds->cxl_dvsec;
+	u16 off;
+	u16 word;
+	int rc;
+
+	for (off = PCI_DVSEC_CXL_CAP; off < p->dvsec_size; off += 2) {
+		rc = pci_read_config_word(pdev, dvsec + off, &word);
+		if (rc)
+			return -EIO;
+		put_unaligned_le16(word, p->dvsec_shadow +
+				   (off - PCI_DVSEC_CXL_CAP));
+	}
+	return 0;
+}
+
+/* Read the CM cap-array prefix [CXL_CM_OFFSET, hdm_reg_offset) from
+ * MMIO into cm_snapshot, and the HDM block [hdm_reg_offset,
+ * hdm_reg_offset + hdm_reg_size) into hdm_shadow.
+ *
+ * @base is a short-lived kva for the component register block,
+ * established by the caller via ioremap() against cxlds->reg_map.resource.
+ * cxl_setup_regs() drops its own ioremap (clears reg_map.base) after the
+ * cap-array probe completes, so this function cannot rely on
+ * cxlds->reg_map.base being valid; the caller passes a fresh mapping
+ * here and releases it once snapshot data has been copied into the
+ * in-memory shadows.
+ */
+static void snapshot_cm_and_hdm(struct cxl_passthrough *p,
+				void __iomem *base,
+				resource_size_t hdm_off)
+{
+	size_t i;
+
+	for (i = 0; i < p->cm_snapshot_dwords; i++)
+		p->cm_snapshot[i] = cpu_to_le32(readl(base + CXL_CM_OFFSET +
+						      i * 4));
+
+	for (i = 0; i < p->hdm_reg_size / 4; i++)
+		put_unaligned_le32(readl(base + hdm_off + i * 4),
+				   p->hdm_shadow + i * 4);
+}
+
+/* ------------------------------------------------------------------ */
+/* devres                                                              */
+/* ------------------------------------------------------------------ */
+
+static void cxl_passthrough_release(struct device *dev, void *res)
+{
+	struct cxl_passthrough *p = *(struct cxl_passthrough **)res;
+
+	kfree(p->dvsec_shadow);
+	kfree(p->hdm_shadow);
+	kfree(p->cm_snapshot);
+	mutex_destroy(&p->lock);
+	kfree(p);
+}
+
+struct cxl_passthrough *
+devm_cxl_passthrough_create(struct device *dev, struct cxl_dev_state *cxlds)
+{
+	struct cxl_passthrough **dres;
+	struct cxl_passthrough *p;
+	struct pci_dev *pdev;
+	resource_size_t hdm_off, hdm_size;
+	size_t dvsec_shadow_size;
+	u8 hdm_count;
+	u32 hdr;
+	int rc;
+
+	/*
+	 * cxl_setup_regs() releases its short-lived ioremap before returning,
+	 * so reg_map.base is NULL by the time we run.  Validate the persistent
+	 * fields (resource address and size) instead; the local ioremap
+	 * established further below covers the snapshot reads.
+	 */
+	if (!dev || !cxlds || !cxlds->dev || !cxlds->cxl_dvsec ||
+	    !cxlds->reg_map.resource || !cxlds->reg_map.max_size)
+		return ERR_PTR(-EINVAL);
+
+	pdev = to_pci_dev(cxlds->dev);
+
+	rc = cxl_get_hdm_info(cxlds, &hdm_count, &hdm_off, &hdm_size);
+	if (rc)
+		return ERR_PTR(rc);
+	if (hdm_count != 1 || !hdm_size || hdm_off <= CXL_CM_OFFSET ||
+	    !IS_ALIGNED(hdm_size, 4))
+		return ERR_PTR(-EOPNOTSUPP);
+
+	p = kzalloc_obj(*p, GFP_KERNEL);
+	if (!p)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_init(&p->lock);
+	p->cxlds = cxlds;
+	p->hdm_reg_size = hdm_size;
+
+	/* DVSEC body length from PCI ext-cap header. */
+	rc = pci_read_config_dword(pdev, cxlds->cxl_dvsec + PCI_DVSEC_HEADER1,
+				   &hdr);
+	if (rc) {
+		rc = -EIO;
+		goto err;
+	}
+	p->dvsec_size = PCI_DVSEC_HEADER1_LEN(hdr);
+	if (p->dvsec_size < DVSEC_BODY_END) {
+		rc = -EINVAL;
+		goto err;
+	}
+
+	dvsec_shadow_size = round_up(p->dvsec_size - PCI_DVSEC_CXL_CAP, 4);
+	p->dvsec_shadow = kzalloc(dvsec_shadow_size, GFP_KERNEL);
+	if (!p->dvsec_shadow) {
+		rc = -ENOMEM;
+		goto err;
+	}
+
+	p->cm_snapshot_dwords = (hdm_off - CXL_CM_OFFSET) / 4;
+	p->cm_snapshot = kcalloc(p->cm_snapshot_dwords, sizeof(__le32),
+				 GFP_KERNEL);
+	if (!p->cm_snapshot) {
+		rc = -ENOMEM;
+		goto err;
+	}
+
+	p->hdm_shadow = kzalloc(hdm_size, GFP_KERNEL);
+	if (!p->hdm_shadow) {
+		rc = -ENOMEM;
+		goto err;
+	}
+
+	rc = snapshot_dvsec_body(p);
+	if (rc)
+		goto err;
+
+	{
+		void __iomem *base;
+
+		/*
+		 * Bind-time-only ioremap.  cxl_setup_regs() has already
+		 * released the cxl-core ioremap (see comment on the entry
+		 * gate).  Take a fresh, short-lived mapping for the
+		 * snapshot, then release it; all subsequent reads serve
+		 * from the in-memory shadows.
+		 */
+		base = ioremap(cxlds->reg_map.resource,
+			       cxlds->reg_map.max_size);
+		if (!base) {
+			rc = -ENOMEM;
+			goto err;
+		}
+		snapshot_cm_and_hdm(p, base, hdm_off);
+		iounmap(base);
+	}
+
+	dres = devres_alloc(cxl_passthrough_release, sizeof(*dres),
+			    GFP_KERNEL);
+	if (!dres) {
+		rc = -ENOMEM;
+		goto err;
+	}
+	*dres = p;
+	devres_add(dev, dres);
+	return p;
+
+err:
+	kfree(p->dvsec_shadow);
+	kfree(p->cm_snapshot);
+	kfree(p->hdm_shadow);
+	mutex_destroy(&p->lock);
+	kfree(p);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_NS_GPL(devm_cxl_passthrough_create, "CXL");
+
+/* ------------------------------------------------------------------ */
+/* DVSEC write semantics                                               */
+/* ------------------------------------------------------------------ */
+
+static u16 dvsec_shadow_get_u16(struct cxl_passthrough *p, u16 off)
+{
+	return get_unaligned_le16(p->dvsec_shadow + (off - PCI_DVSEC_CXL_CAP));
+}
+
+static void dvsec_shadow_set_u16(struct cxl_passthrough *p, u16 off, u16 val)
+{
+	put_unaligned_le16(val, p->dvsec_shadow + (off - PCI_DVSEC_CXL_CAP));
+}
+
+/* Apply a write to a single DVSEC field at @off, with the field's
+ * native width (2 for descriptors, 4 for RANGE entries).  @width is
+ * the field's spec width; @new is the merged value to apply.  Caller
+ * holds p->lock.
+ */
+static void dvsec_apply_write(struct cxl_passthrough *p, u16 off, size_t width,
+			      u32 new)
+{
+	u16 cur16;
+
+	switch (off) {
+	case DVSEC_OFF_CAPABILITY:
+		/* HwInit — drop. */
+		return;
+	case DVSEC_OFF_CONTROL:
+	case DVSEC_OFF_CONTROL2:
+		/* RWL — gated on CONFIG_LOCK. */
+		if (p->dvsec_config_locked)
+			return;
+		dvsec_shadow_set_u16(p, off, (u16)new);
+		return;
+	case DVSEC_OFF_STATUS:
+	case DVSEC_OFF_STATUS2:
+		/* RW1C — clear bits where the guest wrote 1. */
+		cur16 = dvsec_shadow_get_u16(p, off);
+		dvsec_shadow_set_u16(p, off, cur16 & ~(u16)new);
+		return;
+	case DVSEC_OFF_LOCK:
+		/* RWO — first 1-write latches CONFIG_LOCK; subsequent
+		 * writes are ignored.
+		 */
+		cur16 = dvsec_shadow_get_u16(p, off);
+		if (cur16 & DVSEC_LOCK_CONFIG_LOCK)
+			return;
+		if (new & DVSEC_LOCK_CONFIG_LOCK) {
+			dvsec_shadow_set_u16(p, off,
+					     cur16 | DVSEC_LOCK_CONFIG_LOCK);
+			p->dvsec_config_locked = true;
+		}
+		return;
+	case DVSEC_OFF_RANGE1_SIZE_HI:
+	case DVSEC_OFF_RANGE1_SIZE_LO:
+	case DVSEC_OFF_RANGE1_BASE_HI:
+	case DVSEC_OFF_RANGE1_BASE_LO:
+		/* HwInit — drop. */
+		return;
+	case DVSEC_OFF_RANGE2_SIZE_HI:
+	case DVSEC_OFF_RANGE2_SIZE_LO:
+	case DVSEC_OFF_RANGE2_BASE_HI:
+	case DVSEC_OFF_RANGE2_BASE_LO:
+		/* RsvdZ — drop. */
+		return;
+	default:
+		/* Reserved offsets inside the modelled body: drop. */
+		(void)width;
+		return;
+	}
+}
+
+/* Map a byte offset @off inside the DVSEC body to the natural-width
+ * field that contains it: returns the field's base offset (16-bit
+ * aligned for descriptors, 32-bit aligned for RANGE entries) and width.
+ * Returns false if @off lies outside any modelled field.
+ */
+static bool dvsec_field_at(u16 off, u16 *field_off, size_t *width)
+{
+	if (off >= DVSEC_OFF_CAPABILITY && off < DVSEC_OFF_RANGE1_SIZE_HI) {
+		*field_off = ALIGN_DOWN(off, 2);
+		*width = 2;
+		return true;
+	}
+	if (off >= DVSEC_OFF_RANGE1_SIZE_HI && off < DVSEC_BODY_END) {
+		*field_off = ALIGN_DOWN(off, 4);
+		*width = 4;
+		return true;
+	}
+	return false;
+}
+
+int cxl_passthrough_dvsec_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+			     size_t sz, bool write)
+{
+	u8 *shadow;
+	u16 field_off;
+	size_t field_width;
+	u32 cur, merged;
+	u32 sub_shift;
+	u32 width_mask;
+
+	if (!p || !val)
+		return -EINVAL;
+	if (sz != 1 && sz != 2 && sz != 4)
+		return -EINVAL;
+	if (off < PCI_DVSEC_CXL_CAP || off + sz > p->dvsec_size)
+		return -EINVAL;
+
+	guard(mutex)(&p->lock);
+
+	shadow = p->dvsec_shadow + (off - PCI_DVSEC_CXL_CAP);
+
+	if (!write) {
+		switch (sz) {
+		case 1:
+			*val = *shadow;
+			break;
+		case 2:
+			*val = get_unaligned_le16(shadow);
+			break;
+		case 4:
+			*val = get_unaligned_le32(shadow);
+			break;
+		}
+		return 0;
+	}
+
+	if (!dvsec_field_at(off, &field_off, &field_width))
+		return 0;	/* outside any modelled field: drop */
+
+	/* Read-modify-merge the field at its natural width. */
+	if (field_width == 2)
+		cur = dvsec_shadow_get_u16(p, field_off);
+	else
+		cur = get_unaligned_le32(p->dvsec_shadow +
+					 (field_off - PCI_DVSEC_CXL_CAP));
+
+	width_mask = (sz == 4) ? 0xffffffff : (sz == 2 ? 0xffff : 0xff);
+	sub_shift = (off - field_off) * 8;
+	merged = cur & ~(width_mask << sub_shift);
+	merged |= (*val & width_mask) << sub_shift;
+
+	dvsec_apply_write(p, field_off, field_width, merged);
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_passthrough_dvsec_rw, "CXL");
+
+/* ------------------------------------------------------------------ */
+/* HDM write semantics                                                 */
+/* ------------------------------------------------------------------ */
+
+static u32 hdm_shadow_get(struct cxl_passthrough *p, u32 off)
+{
+	return get_unaligned_le32(p->hdm_shadow + off);
+}
+
+static void hdm_shadow_set(struct cxl_passthrough *p, u32 off, u32 val)
+{
+	put_unaligned_le32(val, p->hdm_shadow + off);
+}
+
+/* Decoder index for a per-decoder register offset. */
+static u32 hdm_decoder_of(u32 off)
+{
+	return (off - HDM_DEC_BASE) / HDM_DEC_STRIDE;
+}
+
+static u32 hdm_decoder_field(u32 off)
+{
+	return (off - HDM_DEC_BASE) % HDM_DEC_STRIDE;
+}
+
+static void hdm_decoder_ctrl_write(struct cxl_passthrough *p, u32 off, u32 val)
+{
+	u32 cur = hdm_shadow_get(p, off);
+	u32 next;
+
+	/* Once COMMITTED, only the COMMIT toggle is honoured.  Releasing
+	 * COMMIT clears COMMITTED and Lock-on-Commit per CXL r4.0
+	 * §8.2.4.20.5.
+	 */
+	if (cur & HDM_CTRL_COMMITTED) {
+		next = (cur & ~HDM_CTRL_COMMIT) | (val & HDM_CTRL_COMMIT);
+		if (!(val & HDM_CTRL_COMMIT)) {
+			next &= ~HDM_CTRL_COMMITTED;
+			next &= ~HDM_CTRL_LOCK_ON_COMMIT;
+		}
+		hdm_shadow_set(p, off, next);
+		return;
+	}
+
+	next = val & ~(HDM_CTRL_COMMITTED | HDM_CTRL_ERR_NOT_COMMITTED);
+	if (val & HDM_CTRL_COMMIT)
+		next |= HDM_CTRL_COMMITTED;
+	hdm_shadow_set(p, off, next);
+}
+
+static void hdm_decoder_basesize_write(struct cxl_passthrough *p, u32 off,
+				       u32 val)
+{
+	u32 n = hdm_decoder_of(off);
+	u32 ctrl = hdm_shadow_get(p, HDM_DEC_OFF_CTRL(n));
+
+	/* RWL — BASE/SIZE locked when the decoder is committed or
+	 * lock-on-commit has been latched.
+	 */
+	if (ctrl & (HDM_CTRL_COMMITTED | HDM_CTRL_LOCK_ON_COMMIT))
+		return;
+	hdm_shadow_set(p, off, val);
+}
+
+int cxl_passthrough_hdm_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+			   bool write)
+{
+	u32 field;
+
+	if (!p || !val)
+		return -EINVAL;
+	if (!IS_ALIGNED(off, 4) || off + 4 > p->hdm_reg_size)
+		return -EINVAL;
+
+	guard(mutex)(&p->lock);
+
+	if (!write) {
+		*val = hdm_shadow_get(p, off);
+		return 0;
+	}
+
+	switch (off) {
+	case HDM_OFF_CAP_HEADER:
+		/* HwInit — drop. */
+		return 0;
+	case HDM_OFF_GLOBAL_CTRL:
+		/* RW — shadow. */
+		hdm_shadow_set(p, off, *val);
+		return 0;
+	}
+
+	if (off < HDM_DEC_BASE)
+		return 0;	/* gap before per-decoder regs: drop */
+
+	field = hdm_decoder_field(off);
+	switch (field) {
+	case 0x00: case 0x04:	/* BASE_LO / BASE_HI */
+	case 0x08: case 0x0c:	/* SIZE_LO / SIZE_HI */
+		hdm_decoder_basesize_write(p, off, *val);
+		return 0;
+	case 0x10:		/* CTRL */
+		hdm_decoder_ctrl_write(p, off, *val);
+		return 0;
+	default:
+		/* TARGET_LIST_{LO,HI} and other per-decoder bytes are
+		 * accepted as plain RW shadow for the firmware-committed
+		 * scope; multi-decoder / interleave behaviour is
+		 * out-of-scope.
+		 */
+		hdm_shadow_set(p, off, *val);
+		return 0;
+	}
+}
+EXPORT_SYMBOL_NS_GPL(cxl_passthrough_hdm_rw, "CXL");
+
+/* ------------------------------------------------------------------ */
+/* CM cap-array snapshot                                               */
+/* ------------------------------------------------------------------ */
+
+int cxl_passthrough_cm_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+			  bool write)
+{
+	if (!p || !val)
+		return -EINVAL;
+	if (!IS_ALIGNED(off, 4) || off / 4 >= p->cm_snapshot_dwords)
+		return -EINVAL;
+
+	if (write)
+		return 0;	/* cap-array headers are RO; drop. */
+
+	*val = le32_to_cpu(p->cm_snapshot[off / 4]);
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_passthrough_cm_rw, "CXL");
diff --git a/include/cxl/passthrough.h b/include/cxl/passthrough.h
new file mode 100644
index 000000000000..43214b0d34f6
--- /dev/null
+++ b/include/cxl/passthrough.h
@@ -0,0 +1,121 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright(c) 2026 NVIDIA Corporation. All rights reserved.
+ *
+ * CXL register virtualization helpers for vfio-pci Type-2 passthrough.
+ *
+ * See Documentation/driver-api/vfio-pci-cxl.rst for the ownership
+ * contract.  In short: cxl-core owns the per-device DVSEC body, HDM
+ * Decoder block, and CM cap-array shadows; vfio-pci is a transport
+ * that forwards guest reads and writes through the helpers below.
+ *
+ * The helpers are not a generic emulation framework.  Each register
+ * is hand-coded against CXL r4.0 §8.1.3 and §8.2.4.20.  Adding a new
+ * field is "add a case", not "add a mode".
+ */
+#ifndef __CXL_PASSTHROUGH_H__
+#define __CXL_PASSTHROUGH_H__
+
+#include <linux/types.h>
+
+struct cxl_dev_state;
+struct cxl_passthrough;
+struct device;
+
+/**
+ * devm_cxl_passthrough_create - snapshot a Type-2 device's DVSEC + HDM +
+ * CM cap-array shadows and return the opaque handle the rw helpers
+ * operate on.
+ *
+ * @dev: device whose devres lifetime bounds the returned handle.
+ * @cxlds: CXL device state with cxlds->cxl_dvsec populated and
+ *	   cxlds->reg_map.resource and cxlds->reg_map.max_size describing
+ *	   the component register block.  cxlds->reg_map.base is NOT
+ *	   required; cxl_pci_setup_regs() releases its short-lived
+ *	   ioremap before returning, so this helper takes a local
+ *	   bind-time ioremap against cxlds->reg_map.resource for the
+ *	   duration of the snapshot.
+ *
+ * On success the returned handle is bound to @dev's devres so unwind
+ * happens automatically when @dev is unbound.  The handle must not be
+ * freed by the caller.
+ *
+ * Return: a valid &struct cxl_passthrough on success, ERR_PTR(-errno)
+ * on failure.
+ */
+struct cxl_passthrough *
+devm_cxl_passthrough_create(struct device *dev, struct cxl_dev_state *cxlds);
+
+/**
+ * cxl_passthrough_dvsec_rw - read or write the CXL Device DVSEC body shadow.
+ *
+ * @p: handle from devm_cxl_passthrough_create().
+ * @off: byte offset from the start of the DVSEC capability.  Must be
+ *	 >= PCI_DVSEC_CXL_CAP and (off + sz) must lie inside the DVSEC.
+ *	 Accesses to the PCI ext-cap header bytes (off < PCI_DVSEC_CXL_CAP)
+ *	 are the caller's responsibility; they belong on the generic
+ *	 perm-bits path, not here.
+ * @val: pointer to a u32 holding the read result or the write value.
+ *	 The low @sz bytes of *val are the payload; upper bytes ignored
+ *	 for writes and zero for reads.
+ * @sz: 1, 2, or 4.  Other values return -EINVAL.
+ * @write: false for read, true for write.
+ *
+ * Reads serve from the shadow.  Writes update the shadow per the spec
+ * attribute mode for the addressed field (LOCK is RWO, CONTROL/CONTROL2
+ * are RWL gated on CONFIG_LOCK, STATUS/STATUS2 are RW1C, RANGE1/2 are
+ * HwInit, Reserved/RsvdZ silently consumed).
+ *
+ * Known limitation: a 4-byte write whose @off straddles a 16-bit DVSEC
+ * field boundary (CONTROL/STATUS at 0x0c/0x0e, CONTROL2/STATUS2 at
+ * 0x10/0x12) applies only the field containing the first byte of the
+ * access; the adjacent 16-bit field is not updated by the same write.
+ * Standard CXL register-access patterns issue separate 2-byte accesses
+ * to CONTROL, STATUS, CONTROL2 and STATUS2, so this corner case is
+ * documented rather than handled.
+ *
+ * Return: 0 on success; -EINVAL on out-of-range or bad size.
+ */
+int cxl_passthrough_dvsec_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+			     size_t sz, bool write);
+
+/**
+ * cxl_passthrough_hdm_rw - read or write the HDM Decoder block shadow.
+ *
+ * @p: handle from devm_cxl_passthrough_create().
+ * @off: byte offset from the HDM block base; must be 4-byte aligned and
+ *	 (off + 4) <= hdm_reg_size.  Sub-dword access is not supported on
+ *	 HDM registers per CXL r4.0 §8.2.4.
+ * @val: pointer to a u32 holding the read result or the write value.
+ * @write: false for read, true for write.
+ *
+ * Reads serve from the shadow.  Writes implement the per-decoder
+ * COMMIT/COMMITTED handshake (CTRL) and the RWL gating on BASE/SIZE
+ * imposed by COMMITTED|LOCK_ON_COMMIT.  GLOBAL_CTRL is RW; the cap
+ * header is HwInit (writes dropped); other offsets in the per-decoder
+ * stride are RW shadow.
+ *
+ * Return: 0 on success; -EINVAL on misalignment or out-of-range.
+ */
+int cxl_passthrough_hdm_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+			   bool write);
+
+/**
+ * cxl_passthrough_cm_rw - read or write the CXL.cache/mem cap-array snapshot.
+ *
+ * @p: handle from devm_cxl_passthrough_create().
+ * @off: byte offset from CXL_CM_OFFSET (the start of the CM cap-array
+ *	 header in the component register block); must be 4-byte aligned
+ *	 and (off + 4) <= cm_snapshot_size.
+ * @val: pointer to a u32 holding the read result; ignored on write.
+ * @write: false for read.  Writes to the cap-array are silently dropped
+ *	   (the array headers are RO per CXL r4.0 §8.2.4); the @write
+ *	   parameter is present only to keep the API symmetric with the
+ *	   other rw helpers and to make the drop policy explicit at the
+ *	   call site.
+ *
+ * Return: 0 on success; -EINVAL on misalignment or out-of-range.
+ */
+int cxl_passthrough_cm_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+			  bool write);
+
+#endif /* __CXL_PASSTHROUGH_H__ */
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3 05/11] vfio: UAPI for CXL Type-2 device passthrough
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
  To: djbw, alex, jgg, jic23, dave.jiang, ankita,
	alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
	ira.weiny
  Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
	linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>

From: Manish Honap <mhonap@nvidia.com>

Add the user-visible interface that exposes a CXL Type-2 device to a
VMM through vfio-pci:

  VFIO_DEVICE_FLAGS_CXL (bit 9) on vfio_device_info::flags marks the
  device as CXL.

  VFIO_DEVICE_INFO_CAP_CXL (id 6) is the capability that carries the
  HDM-backed memory region index, the CXL component register region
  index, and the layout of the component register block within the
  containing PCI BAR.

  VFIO_REGION_SUBTYPE_CXL identifies the HDM memory region.
  VFIO_REGION_SUBTYPE_CXL_COMP_REGS identifies the CXL component
  register shadow.

Only the HOST_FIRMWARE_COMMITTED flag is exposed.  Other CXL device
states stay invisible to userspace at this stage.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 include/uapi/linux/vfio.h | 46 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 5de618a3a5ee..3707d53c4de5 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -215,6 +215,7 @@ struct vfio_device_info {
 #define VFIO_DEVICE_FLAGS_FSL_MC (1 << 6)	/* vfio-fsl-mc device */
 #define VFIO_DEVICE_FLAGS_CAPS	(1 << 7)	/* Info supports caps */
 #define VFIO_DEVICE_FLAGS_CDX	(1 << 8)	/* vfio-cdx device */
+#define VFIO_DEVICE_FLAGS_CXL	(1 << 9)	/* vfio-cxl Type-2 device */
 	__u32	num_regions;	/* Max region index + 1 */
 	__u32	num_irqs;	/* Max IRQ index + 1 */
 	__u32   cap_offset;	/* Offset within info struct of first cap */
@@ -257,6 +258,36 @@ struct vfio_device_info_cap_pci_atomic_comp {
 	__u32 reserved;
 };
 
+/*
+ * VFIO_DEVICE_INFO capability for CXL Type-2 passthrough devices.
+ * Present when VFIO_DEVICE_FLAGS_CXL is set on vfio_device_info::flags.
+ *
+ * @flags: VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED indicates the host CXL
+ *	subsystem committed the endpoint HDM decoder.
+ * @hdm_region_idx: VFIO region index for the HDM memory region
+ *	(subtype VFIO_REGION_SUBTYPE_CXL).
+ * @comp_reg_region_idx: VFIO region index for the CXL Component
+ *	Register shadow (subtype VFIO_REGION_SUBTYPE_CXL_COMP_REGS).
+ * @comp_reg_bar: PCI BAR index that contains the CXL component
+ *	register block.  Get-region-info on this BAR returns a
+ *	VFIO_REGION_INFO_CAP_SPARSE_MMAP that excludes the CXL block.
+ * @comp_reg_offset: byte offset of the CXL component register block
+ *	within @comp_reg_bar.
+ * @comp_reg_size: byte size of the CXL component register block.
+ */
+#define VFIO_DEVICE_INFO_CAP_CXL		6
+struct vfio_device_info_cap_cxl {
+	struct vfio_info_cap_header header;
+	__u32 flags;
+#define VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED	(1 << 0)
+	__u32 hdm_region_idx;
+	__u32 comp_reg_region_idx;
+	__u32 comp_reg_bar;
+	__u32 __resv;
+	__u64 comp_reg_offset;
+	__u64 comp_reg_size;
+};
+
 /**
  * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
  *				       struct vfio_region_info)
@@ -425,6 +456,21 @@ struct vfio_region_gfx_edid {
 #define VFIO_REGION_SUBTYPE_CCW_SCHIB		(2)
 #define VFIO_REGION_SUBTYPE_CCW_CRW		(3)
 
+/*
+ * sub-types for VFIO_REGION_TYPE_PCI_VENDOR (vendor id 1e98 reserved
+ * for the CXL Consortium); used by vfio-cxl Type-2 device passthrough.
+ *
+ * VFIO_REGION_SUBTYPE_CXL exposes the HDM-backed device memory range
+ *   as a mappable region.  The range is allocated by the host CXL
+ *   subsystem and the VMM is expected to mmap() it.
+ * VFIO_REGION_SUBTYPE_CXL_COMP_REGS exposes the CXL Component Register
+ *   block (read-write via pread()/pwrite() only, no mmap()).  The VMM
+ *   reads and writes HDM Decoder Capability registers through this
+ *   shadow region instead of touching hardware directly.
+ */
+#define VFIO_REGION_SUBTYPE_CXL			(1)
+#define VFIO_REGION_SUBTYPE_CXL_COMP_REGS	(2)
+
 /* sub-types for VFIO_REGION_TYPE_MIGRATION */
 #define VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED (1)
 
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3 04/11] cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
  To: djbw, alex, jgg, jic23, dave.jiang, ankita,
	alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
	ira.weiny
  Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
	linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>

From: Manish Honap <mhonap@nvidia.com>

The CXL component register layout and the HDM Decoder Capability
Structure defines live in drivers/cxl/cxl.h, where userspace
consumers cannot include them without depending on kernel-only
headers.  A VMM that owns a vfio-cxl COMP_REGS shadow region needs
these defines to interpret the shadow contents.

Move the spec-defined register layout, capability identifiers, and
HDM decoder field masks to a new public uapi header,
include/uapi/cxl/cxl_regs.h.  Use __GENMASK() and _BITUL() (not
GENMASK() / BIT()) so the header is uapi-clean.  Include
<asm/bitsperlong.h> for the __BITS_PER_LONG that __GENMASK() needs.

drivers/cxl/cxl.h now includes <uapi/cxl/cxl_regs.h>; the values
are identical, so kernel callers see no change.  Static inline
helpers that use FIELD_GET stay in drivers/cxl/cxl.h.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/cxl/cxl.h           | 52 +++++-------------------------
 include/uapi/cxl/cxl_regs.h | 63 +++++++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+), 45 deletions(-)
 create mode 100644 include/uapi/cxl/cxl_regs.h

diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index f43abd1903ce..583a27b6659e 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -24,51 +24,13 @@ extern const struct nvdimm_security_ops *cxl_security_ops;
  * (port-driver, region-driver, nvdimm object-drivers... etc).
  */
 
-/* CXL 2.0 8.2.4 CXL Component Register Layout and Definition */
-#define CXL_COMPONENT_REG_BLOCK_SIZE SZ_64K
-
-/* CXL 2.0 8.2.5 CXL.cache and CXL.mem Registers*/
-#define CXL_CM_OFFSET 0x1000
-#define CXL_CM_CAP_HDR_OFFSET 0x0
-#define   CXL_CM_CAP_HDR_ID_MASK GENMASK(15, 0)
-#define     CM_CAP_HDR_CAP_ID 1
-#define   CXL_CM_CAP_HDR_VERSION_MASK GENMASK(19, 16)
-#define     CM_CAP_HDR_CAP_VERSION 1
-#define   CXL_CM_CAP_HDR_CACHE_MEM_VERSION_MASK GENMASK(23, 20)
-#define     CM_CAP_HDR_CACHE_MEM_VERSION 1
-#define   CXL_CM_CAP_HDR_ARRAY_SIZE_MASK GENMASK(31, 24)
-#define CXL_CM_CAP_PTR_MASK GENMASK(31, 20)
-
-#define   CXL_CM_CAP_CAP_ID_RAS 0x2
-#define   CXL_CM_CAP_CAP_ID_HDM 0x5
-#define   CXL_CM_CAP_CAP_HDM_VERSION 1
-
-/* HDM decoders CXL 2.0 8.2.5.12 CXL HDM Decoder Capability Structure */
-#define CXL_HDM_DECODER_CAP_OFFSET 0x0
-#define   CXL_HDM_DECODER_COUNT_MASK GENMASK(3, 0)
-#define   CXL_HDM_DECODER_TARGET_COUNT_MASK GENMASK(7, 4)
-#define   CXL_HDM_DECODER_INTERLEAVE_11_8 BIT(8)
-#define   CXL_HDM_DECODER_INTERLEAVE_14_12 BIT(9)
-#define   CXL_HDM_DECODER_INTERLEAVE_3_6_12_WAY BIT(11)
-#define   CXL_HDM_DECODER_INTERLEAVE_16_WAY BIT(12)
-#define CXL_HDM_DECODER_CTRL_OFFSET 0x4
-#define   CXL_HDM_DECODER_ENABLE BIT(1)
-#define CXL_HDM_DECODER0_BASE_LOW_OFFSET(i) (0x20 * (i) + 0x10)
-#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET(i) (0x20 * (i) + 0x14)
-#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET(i) (0x20 * (i) + 0x18)
-#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(i) (0x20 * (i) + 0x1c)
-#define CXL_HDM_DECODER0_CTRL_OFFSET(i) (0x20 * (i) + 0x20)
-#define   CXL_HDM_DECODER0_CTRL_IG_MASK GENMASK(3, 0)
-#define   CXL_HDM_DECODER0_CTRL_IW_MASK GENMASK(7, 4)
-#define   CXL_HDM_DECODER0_CTRL_LOCK BIT(8)
-#define   CXL_HDM_DECODER0_CTRL_COMMIT BIT(9)
-#define   CXL_HDM_DECODER0_CTRL_COMMITTED BIT(10)
-#define   CXL_HDM_DECODER0_CTRL_COMMIT_ERROR BIT(11)
-#define   CXL_HDM_DECODER0_CTRL_HOSTONLY BIT(12)
-#define CXL_HDM_DECODER0_TL_LOW(i) (0x20 * (i) + 0x24)
-#define CXL_HDM_DECODER0_TL_HIGH(i) (0x20 * (i) + 0x28)
-#define CXL_HDM_DECODER0_SKIP_LOW(i) CXL_HDM_DECODER0_TL_LOW(i)
-#define CXL_HDM_DECODER0_SKIP_HIGH(i) CXL_HDM_DECODER0_TL_HIGH(i)
+/*
+ * Spec-defined CXL component register layout and HDM Decoder
+ * Capability Structure constants live in <uapi/cxl/cxl_regs.h> so a
+ * userspace VMM that owns a vfio-cxl COMP_REGS shadow region can
+ * consume them without depending on kernel-only headers.
+ */
+#include <uapi/cxl/cxl_regs.h>
 
 /* HDM decoder control register constants CXL 3.0 8.2.5.19.7 */
 #define CXL_DECODER_MIN_GRANULARITY 256
diff --git a/include/uapi/cxl/cxl_regs.h b/include/uapi/cxl/cxl_regs.h
new file mode 100644
index 000000000000..b284b7ad2d42
--- /dev/null
+++ b/include/uapi/cxl/cxl_regs.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0-only WITH Linux-syscall-note */
+/*
+ * CXL component register layout and HDM Decoder Capability Structure
+ * defines.  Userspace consumers (e.g. a VMM that owns a vfio-cxl
+ * COMP_REGS shadow region) need these without kernel-only header
+ * dependencies.
+ *
+ * Spec references: CXL r4.0 sections 8.2.3 and 8.2.4.20.
+ */
+#ifndef _UAPI_CXL_REGS_H_
+#define _UAPI_CXL_REGS_H_
+
+#include <asm/bitsperlong.h>	/* __BITS_PER_LONG; needed by __GENMASK() */
+#include <linux/const.h>	/* _BITUL(), _BITULL() */
+#include <linux/bits.h>		/* __GENMASK() */
+
+/* CXL r4.0 8.2.3 CXL Component Register Layout and Definition */
+#define CXL_COMPONENT_REG_BLOCK_SIZE		0x00010000
+
+/* CXL r4.0 8.2.4 CXL.cache and CXL.mem Registers */
+#define CXL_CM_OFFSET				0x1000
+#define CXL_CM_CAP_HDR_OFFSET			0x0
+#define   CXL_CM_CAP_HDR_ID_MASK		__GENMASK(15, 0)
+#define     CM_CAP_HDR_CAP_ID			1
+#define   CXL_CM_CAP_HDR_VERSION_MASK		__GENMASK(19, 16)
+#define     CM_CAP_HDR_CAP_VERSION		1
+#define   CXL_CM_CAP_HDR_CACHE_MEM_VERSION_MASK	__GENMASK(23, 20)
+#define     CM_CAP_HDR_CACHE_MEM_VERSION	1
+#define   CXL_CM_CAP_HDR_ARRAY_SIZE_MASK	__GENMASK(31, 24)
+#define CXL_CM_CAP_PTR_MASK			__GENMASK(31, 20)
+
+#define   CXL_CM_CAP_CAP_ID_RAS			0x2
+#define   CXL_CM_CAP_CAP_ID_HDM			0x5
+#define   CXL_CM_CAP_CAP_HDM_VERSION		1
+
+/* HDM decoders, CXL r4.0 8.2.4.20 */
+#define CXL_HDM_DECODER_CAP_OFFSET		0x0
+#define   CXL_HDM_DECODER_COUNT_MASK		__GENMASK(3, 0)
+#define   CXL_HDM_DECODER_TARGET_COUNT_MASK	__GENMASK(7, 4)
+#define   CXL_HDM_DECODER_INTERLEAVE_11_8	_BITUL(8)
+#define   CXL_HDM_DECODER_INTERLEAVE_14_12	_BITUL(9)
+#define   CXL_HDM_DECODER_INTERLEAVE_3_6_12_WAY	_BITUL(11)
+#define   CXL_HDM_DECODER_INTERLEAVE_16_WAY	_BITUL(12)
+#define CXL_HDM_DECODER_CTRL_OFFSET		0x4
+#define   CXL_HDM_DECODER_ENABLE		_BITUL(1)
+#define CXL_HDM_DECODER0_BASE_LOW_OFFSET(i)	(0x20 * (i) + 0x10)
+#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET(i)	(0x20 * (i) + 0x14)
+#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET(i)	(0x20 * (i) + 0x18)
+#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(i)	(0x20 * (i) + 0x1c)
+#define CXL_HDM_DECODER0_CTRL_OFFSET(i)		(0x20 * (i) + 0x20)
+#define   CXL_HDM_DECODER0_CTRL_IG_MASK		__GENMASK(3, 0)
+#define   CXL_HDM_DECODER0_CTRL_IW_MASK		__GENMASK(7, 4)
+#define   CXL_HDM_DECODER0_CTRL_LOCK		_BITUL(8)
+#define   CXL_HDM_DECODER0_CTRL_COMMIT		_BITUL(9)
+#define   CXL_HDM_DECODER0_CTRL_COMMITTED	_BITUL(10)
+#define   CXL_HDM_DECODER0_CTRL_COMMIT_ERROR	_BITUL(11)
+#define   CXL_HDM_DECODER0_CTRL_HOSTONLY	_BITUL(12)
+#define CXL_HDM_DECODER0_TL_LOW(i)		(0x20 * (i) + 0x24)
+#define CXL_HDM_DECODER0_TL_HIGH(i)		(0x20 * (i) + 0x28)
+#define CXL_HDM_DECODER0_SKIP_LOW(i)		CXL_HDM_DECODER0_TL_LOW(i)
+#define CXL_HDM_DECODER0_SKIP_HIGH(i)		CXL_HDM_DECODER0_TL_HIGH(i)
+
+#endif /* _UAPI_CXL_REGS_H_ */
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3 03/11] cxl: Record BIR and BAR offset in cxl_register_map
From: mhonap @ 2026-06-25 16:53 UTC (permalink / raw)
  To: djbw, alex, jgg, jic23, dave.jiang, ankita,
	alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
	ira.weiny
  Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
	linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>

From: Manish Honap <mhonap@nvidia.com>

The Register Locator DVSEC (CXL r4.0 8.1.9) describes register blocks
by BAR index (BIR) and offset within the BAR.  CXL core currently
only stores the resolved HPA (resource + offset) in struct
cxl_register_map, so callers that need pci_iomap() or want to report
the BAR to userspace must reverse-engineer the BAR from the HPA.

Add bar_index and bar_offset to struct cxl_register_map and fill
them in cxl_decode_regblock() when the regblock is BAR-backed
(BIR 0-5).  Add cxl_regblock_get_bar_info() so cxl drivers
(vfio-cxl, in-kernel accelerator drivers) can read the values
without touching the struct internals.  Export under the CXL
namespace.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/cxl/core/pci.c  |  2 ++
 drivers/cxl/core/regs.c | 34 ++++++++++++++++++++++++++++++++++
 include/cxl/cxl.h       | 12 ++++++++++++
 3 files changed, 48 insertions(+)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index c44595447bd8..9b9b17db9ee4 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -764,6 +764,8 @@ static int cxl_rcrb_get_comp_regs(struct pci_dev *pdev,
 	*map = (struct cxl_register_map) {
 		.host = &pdev->dev,
 		.resource = CXL_RESOURCE_NONE,
+		.bar_index = 0xff,
+		.bar_offset = 0,
 	};
 
 	component_reg_phys = cxl_rcd_component_reg_phys(&pdev->dev, dport);
diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index e828df0629d0..6af5739aa776 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -285,12 +285,46 @@ static bool cxl_decode_regblock(struct pci_dev *pdev, u32 reg_lo, u32 reg_hi,
 		return false;
 	}
 
+	if (bar >= 0 && bar <= 5) {
+		map->bar_index = (u8)bar;
+		map->bar_offset = offset;
+	} else {
+		map->bar_index = 0xff;
+		map->bar_offset = 0;
+	}
+
 	map->reg_type = reg_type;
 	map->resource = pci_resource_start(pdev, bar) + offset;
 	map->max_size = pci_resource_len(pdev, bar) - offset;
 	return true;
 }
 
+/**
+ * cxl_regblock_get_bar_info - read BAR index and offset for a regblock
+ * @map: regblock map produced by cxl_find_regblock()
+ * @bar_index: out, PCI BAR index (0-5)
+ * @bar_offset: out, byte offset of the regblock within the BAR
+ *
+ * Exported for cxl drivers (vfio-cxl, in-kernel accelerator drivers)
+ * that need to map the regblock via pci_iomap() or report the BAR to
+ * userspace.
+ *
+ * Return: 0 on success, -EINVAL if the regblock is not BAR-backed or
+ * if any out pointer is NULL.
+ */
+int cxl_regblock_get_bar_info(const struct cxl_register_map *map,
+			      u8 *bar_index, resource_size_t *bar_offset)
+{
+	if (!map || !bar_index || !bar_offset)
+		return -EINVAL;
+	if (map->bar_index > 5)
+		return -EINVAL;
+	*bar_index = map->bar_index;
+	*bar_offset = map->bar_offset;
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_regblock_get_bar_info, "CXL");
+
 /*
  * __cxl_find_regblock_instance() - Locate a register block or count instances by type / index
  * Use CXL_INSTANCES_COUNT for @index if counting instances.
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 3dcc034360af..3bcb71d80c91 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -100,9 +100,16 @@ struct cxl_pmu_reg_map {
  * @resource: physical resource base of the register block
  * @max_size: maximum mapping size to perform register search
  * @reg_type: see enum cxl_regloc_type
+ * @bar_index: PCI BAR index (0-5) when regblock is BAR-backed; 0xff otherwise
+ * @bar_offset: offset within the BAR; only valid when bar_index <= 5
  * @component_map: cxl_reg_map for component registers
  * @device_map: cxl_reg_maps for device registers
  * @pmu_map: cxl_reg_maps for CXL Performance Monitoring Units
+ *
+ * When the register block is described by the Register Locator DVSEC with
+ * a BAR Indicator (BIR 0-5), bar_index and bar_offset are set so callers
+ * can use pci_iomap(pdev, bar_index, size) and base + bar_offset instead
+ * of ioremap(resource).
  */
 struct cxl_register_map {
 	struct device *host;
@@ -110,6 +117,8 @@ struct cxl_register_map {
 	resource_size_t resource;
 	resource_size_t max_size;
 	u8 reg_type;
+	u8 bar_index;
+	resource_size_t bar_offset;
 	union {
 		struct cxl_component_reg_map component_map;
 		struct cxl_device_reg_map device_map;
@@ -234,4 +243,7 @@ int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u8 *count,
 		     resource_size_t *offset, resource_size_t *size);
 
 int cxl_await_range_active(struct cxl_dev_state *cxlds);
+
+int cxl_regblock_get_bar_info(const struct cxl_register_map *map,
+			      u8 *bar_index, resource_size_t *bar_offset);
 #endif /* __CXL_CXL_H__ */
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3 02/11] cxl: Split cxl_await_range_active() from media-ready wait
From: mhonap @ 2026-06-25 16:53 UTC (permalink / raw)
  To: djbw, alex, jgg, jic23, dave.jiang, ankita,
	alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
	ira.weiny
  Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
	linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>

From: Manish Honap <mhonap@nvidia.com>

Before accessing CXL device memory after reset or power-on, the
driver must ensure media is ready.  Not every CXL device implements
the CXL Memory Device register group: many Type-2 devices do not.
cxl_await_media_ready() reads cxlds->regs.memdev.  Access to memdev
registers on a Type-2 device that lacks them can result in a kernel
panic.

Split the HDM DVSEC range-active poll out of cxl_await_media_ready()
into a new helper cxl_await_range_active().  Type-2 cxl drivers
(vfio-cxl, in-kernel accelerator drivers) that lack the CXLMDEV
status register call this directly.  cxl_await_media_ready() now
calls cxl_await_range_active() for the DVSEC poll, then reads the
memory device status as before.

The 60 second per-range timeout from cxl_await_media_ready()
(media_ready_timeout module param) applies.  Export under the CXL
namespace.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/cxl/core/pci.c | 35 ++++++++++++++++++++++++++++++-----
 include/cxl/cxl.h      |  2 ++
 2 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index c917608c16f9..c44595447bd8 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -142,16 +142,24 @@ static int cxl_dvsec_mem_range_active(struct cxl_dev_state *cxlds, int id)
 	return 0;
 }
 
-/*
- * Wait up to @media_ready_timeout for the device to report memory
- * active.
+/**
+ * cxl_await_range_active - Wait for all HDM DVSEC memory ranges to be active
+ * @cxlds: CXL device state (DVSEC and HDM count must be valid)
+ *
+ * For each HDM decoder range reported in the CXL DVSEC capability, waits
+ * for the range to report MEM INFO VALID (up to 1s per range), then
+ * MEM ACTIVE (up to media_ready_timeout seconds per range, default 60s).
+ * Used by cxl_await_media_ready() and by cxl drivers that bind to Type-2
+ * devices without the memdev mailbox (e.g. vfio-cxl, accelerator drivers).
+ *
+ * Return: 0 if all ranges become valid and active, -ETIMEDOUT if a
+ * timeout occurs, or a negative errno from config read on failure.
  */
-int cxl_await_media_ready(struct cxl_dev_state *cxlds)
+int cxl_await_range_active(struct cxl_dev_state *cxlds)
 {
 	struct pci_dev *pdev = to_pci_dev(cxlds->dev);
 	int d = cxlds->cxl_dvsec;
 	int rc, i, hdm_count;
-	u64 md_status;
 	u16 cap;
 
 	rc = pci_read_config_word(pdev,
@@ -172,6 +180,23 @@ int cxl_await_media_ready(struct cxl_dev_state *cxlds)
 			return rc;
 	}
 
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_await_range_active, "CXL");
+
+/*
+ * Wait up to @media_ready_timeout for the device to report memory
+ * active.
+ */
+int cxl_await_media_ready(struct cxl_dev_state *cxlds)
+{
+	u64 md_status;
+	int rc;
+
+	rc = cxl_await_range_active(cxlds);
+	if (rc)
+		return rc;
+
 	md_status = readq(cxlds->regs.memdev + CXLMDEV_STATUS_OFFSET);
 	if (!CXLMDEV_READY(md_status))
 		return -EIO;
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 440ab09c640e..3dcc034360af 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -232,4 +232,6 @@ int cxl_set_capacity(struct cxl_dev_state *cxlds, u64 capacity);
 
 int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u8 *count,
 		     resource_size_t *offset, resource_size_t *size);
+
+int cxl_await_range_active(struct cxl_dev_state *cxlds);
 #endif /* __CXL_CXL_H__ */
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support
From: mhonap @ 2026-06-25 16:53 UTC (permalink / raw)
  To: djbw, alex, jgg, jic23, dave.jiang, ankita,
	alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
	ira.weiny
  Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
	linux-kernel, linux-kselftest

From: Manish Honap <mhonap@nvidia.com>

CXL Type-2 accelerators (CXL.mem-capable GPUs and similar) cannot be
passed through to virtual machines with stock vfio-pci because the
driver has no concept of HDM decoder management, HDM region exposure,
or component register virtualization.  This series adds those three
pieces, sufficient for a guest to use the device's firmware-committed
coherent memory under UVM / ATS.

v3 is a rewrite of the v2 framework form, responding to Dan's request
in the v2 review for "less emulation, narrower interfaces, and a
closer mapping to the spec language."
In this release, cxl-core exposes four EXPORT_SYMBOL_GPL helpers behind
an opaque handle.  vfio-pci becomes a thin transport on top of those.
Please see "Changes since v2" and "Reviewer feedback addressed" below for
the per-area summary.

Motivation
==========

A CXL Type-2 device exposes its HDM-mapped device memory through HDM
decoders that BIOS programs and commits at boot.  To pass such a
device to a guest, vfio-pci has to do three things at once:

  1. Surface the firmware-committed HDM-mapped HPA range as a guest-
     mmappable region.

  2. Surface a CXL-spec-compliant view of the CXL Device DVSEC body,
     the HDM Decoder Capability block, and the CXL.cache/mem cap-array
     prefix, so the guest's CXL driver enumerates the same topology
     the host saw.

  3. Keep the host's committed decoder configuration intact (the
     physical decoder is never reprogrammed) while letting the guest
     observe and manage a shadow that follows the per-field write
     semantics in the spec.

The series builds on Alejandro Lucero-Palau's v28 work
applied on for-7.3/cxl-type2-enabling [1] (sfc is the in-tree consumer
today). vfio-pci becomes the second consumer.

Architecture
============

cxl-core owns the CXL semantics.  A new file
drivers/cxl/core/passthrough.c (gated by hidden Kconfig
CXL_VFIO_PASSTHROUGH) provides four exported symbols:

    struct cxl_passthrough *
    devm_cxl_passthrough_create(struct device *dev,
                                struct cxl_dev_state *cxlds);

    int cxl_passthrough_dvsec_rw(p, off, val, sz, write);
    int cxl_passthrough_hdm_rw  (p, off, val,      write);
    int cxl_passthrough_cm_rw   (p, off, val,      write);

cxl_passthrough is an opaque handle; vfio-pci sees no cxl-internal
struct pointers.  The shadows are snapshotted at create time: the
DVSEC body from PCI config space dword by dword, the CM cap-array and
HDM block from the cxl-core MMIO mapping at cxlds->reg_map.base.
Per-field write semantics follow below:
CXL r4.0 8.1.3 DVSEC:
- LOCK is RWO,
- CONTROL/CONTROL2 are RWL gated on CONFIG_LOCK,
- STATUS/STATUS2 are RW1C,
- RANGE1 is HwInit, RANGE2 is RsvdZ
CXL r4.0 8.2.4.20 HDM:
- GLOBAL_CTRL RW,
- decoder CTRL implements COMMIT/COMMITTED,
- decoder BASE/SIZE RWL gated on COMMITTED or LOCK_ON_COMMIT,
- cap header HwInit).

vfio-pci becomes a thin transport.  The new module
drivers/vfio/pci/cxl/ exposes two VFIO regions.

  VFIO_REGION_SUBTYPE_CXL (HDM region): mmappable view of the
  HDM-mapped HPA. The mmap fault handler calls vmf_insert_pfn() from
  the physical HPA. pread/pwrite go through the memremap_wb() kva
  captured at bind time.

  VFIO_REGION_SUBTYPE_CXL_COMP_REGS (component register shadow):
  pread/pwrite only, dword-aligned (-EINVAL on misalignment).
  Each dword dispatches by offset to cxl_passthrough_cm_rw() or
  cxl_passthrough_hdm_rw(). No shadow state on the vfio side; cxl-core
  enforces the spec.

CXL DVSEC config-space accesses use a clipping shim in
vfio_pci_config_rw_single(). A config-space chunk that crosses the
DVSEC body boundary is split: header bytes go through the generic
perm-bits path, body bytes go through cxl_passthrough_dvsec_rw().
The shim replaces v2's approach of repointing ecap_perms[]

Sparse-mmap is exposed on the component BAR so userspace can mmap the
non-component portions directly; only the CXL component register
sub-range goes through pread/pwrite emulation. The CXL sub-range is
also skipped from vfio_pci-core's request_selected_regions() set
because cxl-core's devm_cxl_probe_mem() already holds a
request_mem_region() on it; the asymmetric skip is matched by an
asymmetric release on disable().

Scope and out-of-scope
======================

In scope (rejected at create time with -EOPNOTSUPP otherwise):

  - Firmware-committed devices (HOST_FIRMWARE_COMMITTED set).
  - Single HDM decoder (hdm_count == 1).
  - No interleave (IW == 0).

Out of scope, deferred for follow-on work:

  - Multi-decoder devices and interleave.
  - Guest-driven (non-firmware-committed) HDM commit.
  - Hotplug, FLR, and sibling-function reset of CXL Type-2 devices.

Changes since v2
================

This is a rewrite, not an incremental update.  The structure of the
series changed (20 patches in v2 to 11 in v3) because v3 collapses
v2 patches 9-15 (detection, HDM emulation, media readiness, region
management, HDM region, DVSEC emulation) into one cxl-core helper
file and one vfio-pci consumer.

Framework replaced by narrow opaque-handle helpers (patches 6, 8)

  v2 carried a generic register-emulation framework split across four
  state-machine files in cxl-core.
  v3 collapses it into one file: drivers/cxl/core/passthrough.c
  exposing the four EXPORT_SYMBOL_GPL helpers above behind a struct
  cxl_passthrough opaque handle.

Shadow ownership moved into cxl-core (patches 6, 8)

  vfio-pci no longer keeps any per-field state. It forwards
  (offset, value) into cxl-core, and cxl-core enforces the spec
  (RWO, RWL, RW1C, HwInit, RsvdZ) with explicit CXL r4.0 section
  references in the switch arms.

DVSEC config-space clipping shim (patch 8)

  v2 repointed ecap_perms[] to redirect CXL DVSEC reads and writes.
  v3 keeps ecap_perms[] untouched and clips per-config-access chunks
  at the DVSEC body boundary in vfio_pci_config_rw_single(); header bytes
  go through the generic perm-bits path, body bytes go through
  cxl_passthrough_dvsec_rw(). The shim is local to the per-device
  path.

CONFIG_VFIO_PCI_CXL gates the new module (patch 7)

  v2 had a CONFIG_VFIO_CXL_CORE Kconfig stub; v3 renames it to
  CONFIG_VFIO_PCI_CXL to match the vfio-pci naming convention.
  The hidden CXL_VFIO_PASSTHROUGH selects the cxl-core helper file
  on demand. With both disabled, the cxl-core size is unchanged.

UAPI rewritten with named fields (patch 5)

  vfio_device_info_cap_cxl in v3 carries:
    flags + HOST_FIRMWARE_COMMITTED bit
    hdm_region_idx
    comp_reg_region_idx
    comp_reg_bar
    comp_reg_offset
    comp_reg_size
  The DPA terminology is renamed to HDM region throughout.
  CACHE_CAPABLE (HDM-DB indicator) is dropped;
  it was informational only in v2 with no caller, and re-adding it
  for an active CXL.cache plumbing series later.

Selftests trimmed (patch 9)

  v2 carried selftests for device detection, capability parsing,
  region enumeration, HDM register emulation, HDM mmap with
  page-fault insertion, FLR invalidation, and DVSEC register
  emulation. v3 keeps a smoke-test set of six focused tests:

    device_is_cxl                  GET_INFO advertises FLAGS_CXL
                                   and a populated CAP_CXL.
    hdm_region_mmap_rw             mmap one page, write+read back.
    component_bar_sparse_mmap      SPARSE_MMAP cap excludes the
                                   CXL component register sub-range.
    comp_regs_cm_cap_array_read    pread of the CM cap-array
                                   header at CXL_CM_OFFSET succeeds
                                   (CAP_ID == 1).
    dvsec_lock_byte_read           pread of the DVSEC CONFIG_LOCK
                                   byte through the clipping shim
                                   succeeds.
    hdm_decoder_commit_fsm         COMMIT / COMMITTED state machine
                                   and LOCK_ON_COMMIT behaviour.

  FLR invalidation, page-fault insertion under load, and full
  DVSEC field-by-field write coverage are deferred to a follow-on
  selftest series. The current six are the minimal set that
  exercises the kernel-side contract end-to-end.

cxl-core prep patches split (patches 1-4)

  v3 keeps the cxl-side enablers from v2 patches 1-4 but each as
  a standalone change so the cxl maintainer can review the helper
  API independently of the vfio consumer:

    [1/11] cxl_get_hdm_info()
    [2/11] cxl_await_range_active() split from media-ready wait
    [3/11] cxl_register_map records BIR + BAR offset
    [4/11] component/HDM register defines moved to uapi/cxl/cxl_regs.h

Reviewer feedback addressed
===========================

Dan
---

- VFIO exposes HDM/host-visible region, not raw DPA; docs/UAPI say HDM
  region, DPA only inside cxl-core where appropriate.
- One vfio-pci device = one HDM region / one decoder, no interleave;
  hdm_count != 1 → -EOPNOTSUPP.
- Global HDM on DVSEC Range Base treated as legacy; RANGE1/RANGE2
  read-only snapshot, guest writes dropped.
- No guest/kernel lock games; DVSEC LOCK and HDM LOCK_ON_COMMIT RWO,
  fixed at create from firmware snapshot.
- Opaque cxl_passthrough handle only; vfio gets HPA via memdev probe +
  layout via cxl_get_hdm_info(), rw via helpers.
- No multi-region accelerator case in v3; single region enforced,
  multi-region deferred.
- cxl_await_range_active stays in cxl-core probe; not exported, vfio does
  not call it.
- No guest LOCK→0 reprogram; guest cannot clear LOCK to remap host HPA;
  kernel uncommit tied to COMMIT, not LOCK alone.

Jason / Gregory / Dan
---------------------

- memremap(WB) + request_mem_region on HPA; conflicting direct-map/EFI use
  fails probe with -EBUSY.

Jonathan
--------

- uapi/cxl/cxl_regs.h for register defines so VMMs need no private
  kernel headers.
- __free() locals on cxl-core/passthrough error paths instead of
  struct-owned temporaries.
- No "precommitted at probe" assumption; acquire checks COMMITTED in
  HDM shadow and refuses if missing.

Dave
----

- memremap(MEMREMAP_WB) for HDM host mapping (not ioremap_cache).
- Renamed cap flag to VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED for clarity.
- __free() / DEFINE_FREE() cleanup in new passthrough.c create path.

Patch series
============

 [1/11] cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata
 [2/11] cxl: Split cxl_await_range_active() from media-ready wait
 [3/11] cxl: Record BIR and BAR offset in cxl_register_map
 [4/11] cxl: Move component/HDM register defines to
        uapi/cxl/cxl_regs.h
 [5/11] vfio: UAPI for CXL Type-2 device passthrough
 [6/11] cxl: Add register-virtualization helpers for vfio Type-2
        passthrough
 [7/11] vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2
        acquisition
 [8/11] vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping
        shim
 [9/11] selftests/vfio: Add CXL Type-2 device passthrough smoke test
[10/11] docs: vfio-pci: Document CXL Type-2 device passthrough
[11/11] vfio/pci: Provide opt-out for CXL Type-2 extensions

Dependencies
============

[1] [PATCH v28 0/5] Type2 device basic support
https://lore.kernel.org/linux-cxl/20260618181806.118745-1-alejandro.lucero-palau@amd.com/

[2] Previous version of this patch series
[PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support
https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/

[3] Companion QEMU series
[RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci
https://lore.kernel.org/linux-cxl/20260427181235.3003865-1-mhonap@nvidia.com/

Manish Honap (11):
  cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata
  cxl: Split cxl_await_range_active() from media-ready wait
  cxl: Record BIR and BAR offset in cxl_register_map
  cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h
  vfio: UAPI for CXL Type-2 device passthrough
  cxl: Add register-virtualization helpers for vfio Type-2 passthrough
  vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2
    acquisition
  vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping shim
  selftests/vfio: Add CXL Type-2 device passthrough smoke test
  docs: vfio-pci: Document CXL Type-2 device passthrough
  vfio/pci: Provide opt-out for CXL Type-2 extensions

 Documentation/driver-api/index.rst            |   1 +
 Documentation/driver-api/vfio-pci-cxl.rst     | 282 ++++++
 drivers/cxl/Kconfig                           |   7 +
 drivers/cxl/core/Makefile                     |   1 +
 drivers/cxl/core/passthrough.c                | 590 ++++++++++++
 drivers/cxl/core/pci.c                        |  70 +-
 drivers/cxl/core/regs.c                       |  35 +
 drivers/cxl/cxl.h                             |  52 +-
 drivers/vfio/pci/Kconfig                      |   2 +
 drivers/vfio/pci/Makefile                     |   1 +
 drivers/vfio/pci/cxl/Kconfig                  |  34 +
 drivers/vfio/pci/cxl/Makefile                 |   2 +
 drivers/vfio/pci/cxl/vfio_cxl_core.c          | 889 ++++++++++++++++++
 drivers/vfio/pci/cxl/vfio_cxl_priv.h          |  71 ++
 drivers/vfio/pci/vfio_pci.c                   |   9 +
 drivers/vfio/pci/vfio_pci_config.c            |  31 +
 drivers/vfio/pci/vfio_pci_core.c              |  68 +-
 drivers/vfio/pci/vfio_pci_priv.h              |  93 ++
 drivers/vfio/pci/vfio_pci_rdwr.c              |  17 +
 include/cxl/cxl.h                             |  18 +
 include/cxl/passthrough.h                     | 121 +++
 include/linux/vfio_pci_core.h                 |   8 +
 include/uapi/cxl/cxl_regs.h                   |  63 ++
 include/uapi/linux/vfio.h                     |  46 +
 tools/testing/selftests/vfio/Makefile         |   1 +
 .../selftests/vfio/lib/vfio_pci_device.c      |  11 +-
 .../selftests/vfio/vfio_cxl_type2_test.c      | 350 +++++++
 27 files changed, 2821 insertions(+), 52 deletions(-)
 create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
 create mode 100644 drivers/cxl/core/passthrough.c
 create mode 100644 drivers/vfio/pci/cxl/Kconfig
 create mode 100644 drivers/vfio/pci/cxl/Makefile
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_core.c
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_priv.h
 create mode 100644 include/cxl/passthrough.h
 create mode 100644 include/uapi/cxl/cxl_regs.h
 create mode 100644 tools/testing/selftests/vfio/vfio_cxl_type2_test.c

base-commit: 90cf2e0d702c8a132ccbe72e7687f33c04c14658
-- 
2.25.1

^ permalink raw reply

* [PATCH v3 01/11] cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata
From: mhonap @ 2026-06-25 16:53 UTC (permalink / raw)
  To: djbw, alex, jgg, jic23, dave.jiang, ankita,
	alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
	ira.weiny
  Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
	linux-kernel, linux-kselftest
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>

From: Manish Honap <mhonap@nvidia.com>

cxl_probe_component_regs() finds the HDM decoder block during device
probe and caches its location, but does not record the decoder count
and does not expose the result outside drivers/cxl/.

In-kernel cxl drivers (Type-2 accelerator drivers, vfio-cxl) need the
decoder count and the byte offset and size of the HDM block without
re-running the probe sequence.

Record decoder_cnt in rmap->count when parsing the HDM capability in
cxl_probe_component_regs(), extend struct cxl_reg_map with a count
member, and add cxl_get_hdm_info() to return offset, size, and count
from the cached map.  Export under the CXL namespace.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/cxl/core/pci.c  | 33 +++++++++++++++++++++++++++++++++
 drivers/cxl/core/regs.c |  1 +
 include/cxl/cxl.h       |  4 ++++
 3 files changed, 38 insertions(+)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 2bcd683aa286..c917608c16f9 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -449,6 +449,39 @@ int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
 }
 EXPORT_SYMBOL_NS_GPL(cxl_hdm_decode_init, "CXL");
 
+/**
+ * cxl_get_hdm_info - Get HDM decoder register block location and count
+ * @cxlds: CXL device state (must have component regs enumerated via
+ *	   cxl_probe_component_regs())
+ * @count:  number of HDM decoders (from HDM Capability bits [3:0])
+ * @offset: byte offset of HDM decoder block within the component register BAR
+ * @size:   size in bytes of the HDM decoder block
+ *
+ * Exported for cxl drivers (in-kernel accelerator drivers, vfio-cxl) that
+ * need HDM decoder metadata from the cached component-register map without
+ * re-running the probe sequence.
+ *
+ * Return: 0 on success. -ENODEV if the HDM decoder block is not present.
+ */
+int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u8 *count,
+		     resource_size_t *offset, resource_size_t *size)
+{
+	struct cxl_reg_map *hdm = &cxlds->reg_map.component_map.hdm_decoder;
+
+	if (WARN_ON(!count || !offset || !size))
+		return -EINVAL;
+
+	if (!hdm->valid)
+		return -ENODEV;
+
+	*count	= hdm->count;
+	*offset = hdm->offset;
+	*size	= hdm->size;
+
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_get_hdm_info, "CXL");
+
 #define CXL_DOE_TABLE_ACCESS_REQ_CODE		0x000000ff
 #define   CXL_DOE_TABLE_ACCESS_REQ_CODE_READ	0
 #define CXL_DOE_TABLE_ACCESS_TABLE_TYPE		0x0000ff00
diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index 20c2d9fbcfe7..e828df0629d0 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -85,6 +85,7 @@ void cxl_probe_component_regs(struct device *dev, void __iomem *base,
 			decoder_cnt = cxl_hdm_decoder_count(hdr);
 			length = 0x20 * decoder_cnt + 0x10;
 			rmap = &map->hdm_decoder;
+			rmap->count = decoder_cnt;
 			break;
 		}
 		case CXL_CM_CAP_CAP_ID_RAS:
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 802b143de83d..440ab09c640e 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -75,6 +75,7 @@ struct cxl_reg_map {
 	int id;
 	unsigned long offset;
 	unsigned long size;
+	u8 count;
 };
 
 struct cxl_component_reg_map {
@@ -228,4 +229,7 @@ struct cxl_memdev *devm_cxl_probe_mem(struct cxl_dev_state *cxlds,
 				      struct range *range);
 
 int cxl_set_capacity(struct cxl_dev_state *cxlds, u64 capacity);
+
+int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u8 *count,
+		     resource_size_t *offset, resource_size_t *size);
 #endif /* __CXL_CXL_H__ */
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH v2 7/8] dt-bindings: riscv: Add generic CBQRI controller binding
From: Conor Dooley @ 2026-06-25 16:19 UTC (permalink / raw)
  To: Drew Fustini
  Cc: Adrien Ricciardi, Alexandre Ghiti, Atish Kumar Patra, Atish Patra,
	Babu Moger, Ben Horgan, Borislav Petkov, Chen Pei, Conor Dooley,
	Conor Dooley, Dave Hansen, Dave Martin, Fenghua Yu, Gong Shuai,
	Gong Shuai, guo.wenjia23, James Morse, Kornel Dulęba,
	Krzysztof Kozlowski, liu.qingtao2, Liu Zhiwei, Palmer Dabbelt,
	Paul Walmsley, Peter Newman, Radim Krčmář,
	Reinette Chatre, Rob Herring, Samuel Holland,
	Sebastian Andrzej Siewior, Tony Luck, Vasudevan Srinivasan,
	Ved Shanbhogue, Weiwei Li, yunhui cui, linux-kernel, linux-riscv,
	x86, devicetree, linux-rt-devel, linux-doc
In-Reply-To: <20260624-dfustini-atl-sc-cbqri-dt-v2-7-2f8049fd902b@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 5686 bytes --]

On Wed, Jun 24, 2026 at 06:38:35PM -0700, Drew Fustini wrote:
> Document the generic compatibles for capacity and bandwidth controllers
> that implement the RISC-V CBQRI specification. The binding also
> describes the common riscv,cbqri-rcid and riscv,cbqri-mcid properties,
> and the optional riscv,cbqri-cache phandle that links a capacity
> controller to the cache whose capacity it allocates.
> 
> Assisted-by: Claude:claude-opus-4-8
> Co-developed-by: Adrien Ricciardi <aricciardi@baylibre.com>
> Signed-off-by: Adrien Ricciardi <aricciardi@baylibre.com>
> Signed-off-by: Drew Fustini <fustini@kernel.org>
> ---
>  .../devicetree/bindings/riscv/riscv,cbqri.yaml     | 97 ++++++++++++++++++++++
>  MAINTAINERS                                        |  1 +
>  2 files changed, 98 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/riscv/riscv,cbqri.yaml b/Documentation/devicetree/bindings/riscv/riscv,cbqri.yaml
> new file mode 100644
> index 0000000000000000000000000000000000000000..5d6be645381780e187b39e60c3bb487fdf2cfb69
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/riscv/riscv,cbqri.yaml
> @@ -0,0 +1,97 @@
> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/riscv/riscv,cbqri.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: RISC-V Capacity and Bandwidth QoS Register Interface (CBQRI) controller
> +
> +description: |
> +  The RISC-V CBQRI specification defines capacity-controller and
> +  bandwidth-controller register blocks that allocate cache capacity and memory
> +  bandwidth to resource-control IDs (RCIDs) and monitor usage per
> +  monitoring-counter ID (MCID):
> +  https://github.com/riscv-non-isa/riscv-cbqri/blob/main/riscv-cbqri.pdf
> +
> +  Allocation and monitoring share one register block, and a controller may
> +  implement either or both. A driver discovers which at runtime from the
> +  capabilities register, so the compatible names only the controller type. It
> +  does not distinguish allocation-only, monitoring-only or combined
> +  controllers, and no property declares monitoring support.
> +
> +maintainers:
> +  - Drew Fustini <fustini@kernel.org>
> +
> +properties:
> +  compatible:
> +    oneOf:
> +      - items:
> +          - description: Tenstorrent Ascalon Shared Cache
> +            const: tenstorrent,ascalon-sc-cbqri
> +          - const: riscv,cbqri-capacity-controller
> +      - enum:
> +          - riscv,cbqri-capacity-controller
> +          - riscv,cbqri-bandwidth-controller

Please modify this, as has been done for other riscv spec related
bindings, to let people get away without using device-specific
compatibles.

In this case, you can just delete the first entry from this enum, since
it already has a user and only have to implement this feedback for the
second entry.

pw-bot: changes-requested

> +
> +  reg:
> +    maxItems: 1
> +    description:
> +      The CBQRI controller register block.
> +
> +  riscv,cbqri-rcid:
> +    $ref: /schemas/types.yaml#/definitions/uint32
> +    description:
> +      The maximum number of RCIDs the controller supports. RCIDs are the
> +      resource-control IDs that allocation operations target.
> +
> +  riscv,cbqri-mcid:
> +    $ref: /schemas/types.yaml#/definitions/uint32
> +    description:
> +      The maximum number of MCIDs the controller supports. MCIDs are the
> +      monitoring-counter IDs that usage-monitoring operations target. Present
> +      on controllers that implement monitoring.
> +
> +  riscv,cbqri-cache:
> +    $ref: /schemas/types.yaml#/definitions/phandle
> +    description:
> +      Phandle to the cache node whose capacity this controller allocates.
> +      Applies to capacity controllers that back a CPU cache. The cache level
> +      and the harts sharing it are taken from that node's cache topology.

Architecturally, is it impossible for a capacity controller to control
more than one cache?

> +
> +required:
> +  - compatible
> +  - reg
> +
> +allOf:
> +  - if:
> +      properties:
> +        compatible:
> +          contains:
> +            const: tenstorrent,ascalon-sc-cbqri
> +    then:
> +      required:
> +        - riscv,cbqri-rcid
> +        - riscv,cbqri-cache
> +
> +additionalProperties: false
> +
> +examples:
> +  - |
> +    l2_cache: l2-cache {
> +        compatible = "cache";
> +        cache-level = <2>;
> +        cache-unified;
> +        cache-size = <0xc00000>;
> +        cache-sets = <512>;
> +        cache-block-size = <64>;
> +    };
> +
> +    cache-controller@a21a00c0 {
> +        compatible = "tenstorrent,ascalon-sc-cbqri",
> +                     "riscv,cbqri-capacity-controller";

Is this or is this not a cache controller?
The compatible and fact that the property points to an actual cache
controller suggests that this is not.

Cheers,
Conor.

> +        reg = <0xa21a00c0 0xf40>;
> +        riscv,cbqri-rcid = <16>;
> +        riscv,cbqri-cache = <&l2_cache>;
> +    };
> +
> +...
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 9e1092165046c773771b055869030bc1bdb64b16..64a95a4d795a57033d3f36200d98cfb4a013ab94 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -23298,6 +23298,7 @@ M:	Drew Fustini <fustini@kernel.org>
>  R:	yunhui cui <cuiyunhui@bytedance.com>
>  L:	linux-riscv@lists.infradead.org
>  S:	Supported
> +F:	Documentation/devicetree/bindings/riscv/riscv,cbqri.yaml
>  F:	arch/riscv/include/asm/qos.h
>  F:	arch/riscv/include/asm/resctrl.h
>  F:	arch/riscv/kernel/qos.c
> 
> -- 
> 2.34.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH net-next] Documentation: networking: Add a test plan for ethtool pause validation
From: Andrew Lunn @ 2026-06-25 16:12 UTC (permalink / raw)
  To: Maxime Chevallier
  Cc: Jakub Kicinski, davem, Eric Dumazet, Paolo Abeni, Simon Horman,
	Russell King, Heiner Kallweit, Jonathan Corbet, Shuah Khan,
	Oleksij Rempel, Vladimir Oltean, Florian Fainelli,
	thomas.petazzoni, netdev, linux-kernel, linux-doc
In-Reply-To: <38bafe7e-d419-46f7-8fa7-87e9183e578c@bootlin.com>

> This isn't sphynx, but I've come-up with something like this for a
> test definition :
> 
> 
> @ksft_ethtool_needs_supported_anyof([Pause, Asym_Pause])
> def test_ethtool_pause_advertising(cfg, peer) -> None:
>     """Pause advertisement
> 
>     Validate that changing pause params through the ETHTOOL_MSG_PAUSE command
>     translates to a change in the advertised pause params, and that these
>     parameters are correct w.r.t the supported pause params and requested pause
>     params.
>     
>     This exercises the .set_pauseparams() ethtool ops for MAC configuration,
>     as well as the reconfiguration of the PHY's advertising and negociation.
>     
>     On non-phylink MACs, the MAC should call phy_set_sym_pause() to update the
>     PHY's advertising, and restart a negotiation with phy_start_aneg() if
>     need be. Failure to do so will result on the wrong advertising parameters.
>     
>     Pn phylink-enabled MACs, phylink deals with the PHY reconfiguration provided

On 

>     the MAC driver calls phylink_ethtool_set_pauseparam().
>     
>     Failing this test likely means that the PHY driver is not correctly advertising
>     pause settings, either due to the MAC not triggering a PHY reconfiguration,
>     a misconficonfiguration of the advertising registers by the PHY, or by
>     mis-handling the phydev->advertising bitfield in the PHY driver directly.
>     
>     The validation is made by looking at the advertised modes locally, as well as
>     what the peer's 'lp_advertising' values report.
> 
>     cfg -- local device's interface configuration
>     peer -- peer device handle

Plain Sphinx can be made to pick up this method documentation and
include it the generated documentation. You would use something like

.. automethod:: test_ethtool_pause_advertising

in the .rst file.

I've no idea if the kernel configuration of sphinx allows this. At the
moment, i would not spend too much time on getting sphinx to generate
documentation. I would say that is nice to have. The description
itself is more important.

>     """
> 
>     # Initial conditions :
>     # - Local interface is admin UP, and reports lowlayer link UP
>     # - Remote interface is adming UP, and reports lowlayer link UP
>     #
>     # Test 1
>     # - SKIP if supported doesn't contain "Pause"
>     # - run 'ethtool -A ethX rx on tx on autoneg on'
>     # - FAIL if the return isn't 0
>     # - FAIL if ETHTOOL_A_LINKMODES_OURS's advertised values does not contain
>     #   "Pause" or contains "Asym_Pause"
>     # - FAIL if peer's lp_advertising doesn't contain "Pause" or contains
>     #   "Asym_Pause"
>     # - Succeed otherwise
>     #
>     # Test 2
>     # - SKIP uif supported doesn't contain both "Pause" and "Asym_Pause"
>     # - run 'ethtool -A ethX rx on tx on autoneg on'
>     # - FAIL if the return isn't 0
>     # - FAIL if ETHTOOL_A_LINKMODES_OURS's advertised values does not contain
>     #   "Pause" or contains "Asym_Pause"
>     # - FAIL if peer's lp_advertising doesn't contain "Pause" or contains
>     #   "Asym_Pause"
>     #
>     # ...
>    
> The annotation defines the pre-requisites in terms of locally supported
> linkmodes, we have a docstring containing information for developpers
> to debug their drivers, what I'm unsure about is the commented-out part
> below, so either one big function testing multiple adjacent scenarios
> or indivitual functions.

Sphinx follows pythons object orientate structure. So you could have a
class test_ethtool_pause_advertising, with class documentation. And
then methods within the class which are individual tests.  The
commented out section would then be method documentation.

However, i've no idea if the selftest code allows for classes of test
methods? It looks like ksft_run() takes a list of methods. So you can
probably instantiate the class, and then pass it methods from the
class?

I would say you are right about picking one of the simple test case,
and playing with it, define and implement it, and see what comes out
at the end. 

	Andrew

^ permalink raw reply

* [PATCH v2 2/2] hwmon: (chipcap2) Add support for label
From: Flaviu Nistor @ 2026-06-25 16:04 UTC (permalink / raw)
  To: Guenter Roeck, Javier Carrasco, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Jonathan Corbet, Shuah Khan
  Cc: Flaviu Nistor, linux-hwmon, linux-kernel, devicetree, linux-doc
In-Reply-To: <20260625160423.17882-1-flaviu.nistor@gmail.com>

Add support for label sysfs attribute similar to other hwmon devices.
This is particularly useful for systems with multiple sensors on the
same board, where identifying individual sensors is much easier since
labels can be defined via device tree.

Signed-off-by: Flaviu Nistor <flaviu.nistor@gmail.com>
---
Changes in v2:
- No change for this patch in the patch series. 
- Link to v1: https://lore.kernel.org/all/20260622122200.14245-1-flaviu.nistor@gmail.com/

 Documentation/hwmon/chipcap2.rst |  2 ++
 drivers/hwmon/chipcap2.c         | 25 +++++++++++++++++++++++--
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/Documentation/hwmon/chipcap2.rst b/Documentation/hwmon/chipcap2.rst
index dc165becc64c..c38d87b91b69 100644
--- a/Documentation/hwmon/chipcap2.rst
+++ b/Documentation/hwmon/chipcap2.rst
@@ -70,4 +70,6 @@ humidity1_min_hyst:             RW      humidity low hystersis
 humidity1_max_hyst:             RW      humidity high hystersis
 humidity1_min_alarm:            RO      humidity low alarm indicator
 humidity1_max_alarm:            RO      humidity high alarm indicator
+humidity1_label:                RO      descriptive name for the sensor
+temp1_label:                    RO      descriptive name for the sensor
 =============================== ======= ========================================
diff --git a/drivers/hwmon/chipcap2.c b/drivers/hwmon/chipcap2.c
index 4aecf463180f..086571d556b7 100644
--- a/drivers/hwmon/chipcap2.c
+++ b/drivers/hwmon/chipcap2.c
@@ -22,6 +22,8 @@
 #include <linux/irq.h>
 #include <linux/module.h>
 #include <linux/regulator/consumer.h>
+#include <linux/mod_devicetable.h>
+#include <linux/property.h>
 
 #define CC2_START_CM			0xA0
 #define CC2_START_NOM			0x80
@@ -83,6 +85,7 @@ struct cc2_data {
 	struct i2c_client *client;
 	struct regulator *regulator;
 	const char *name;
+	const char *label;
 	int irq_ready;
 	int irq_low;
 	int irq_high;
@@ -449,6 +452,8 @@ static umode_t cc2_is_visible(const void *data, enum hwmon_sensor_types type,
 		switch (attr) {
 		case hwmon_humidity_input:
 			return 0444;
+		case hwmon_humidity_label:
+			return cc2->label ? 0444 : 0;
 		case hwmon_humidity_min_alarm:
 			return cc2->rh_alarm.low_alarm_visible ? 0444 : 0;
 		case hwmon_humidity_max_alarm:
@@ -466,6 +471,8 @@ static umode_t cc2_is_visible(const void *data, enum hwmon_sensor_types type,
 		switch (attr) {
 		case hwmon_temp_input:
 			return 0444;
+		case hwmon_temp_label:
+			return cc2->label ? 0444 : 0;
 		default:
 			return 0;
 		}
@@ -552,6 +559,16 @@ static int cc2_humidity_max_alarm_status(struct cc2_data *data, long *val)
 	return 0;
 }
 
+static int cc2_read_string(struct device *dev, enum hwmon_sensor_types type,
+			   u32 attr, int channel, const char **str)
+{
+	struct cc2_data *data = dev_get_drvdata(dev);
+
+	*str = data->label;
+
+	return 0;
+}
+
 static int cc2_read(struct device *dev, enum hwmon_sensor_types type, u32 attr,
 		    int channel, long *val)
 {
@@ -670,8 +687,9 @@ static int cc2_request_alarm_irqs(struct cc2_data *data, struct device *dev)
 }
 
 static const struct hwmon_channel_info *cc2_info[] = {
-	HWMON_CHANNEL_INFO(temp, HWMON_T_INPUT),
-	HWMON_CHANNEL_INFO(humidity, HWMON_H_INPUT | HWMON_H_MIN | HWMON_H_MAX |
+	HWMON_CHANNEL_INFO(temp, HWMON_T_INPUT | HWMON_T_LABEL),
+	HWMON_CHANNEL_INFO(humidity, HWMON_H_INPUT | HWMON_H_LABEL |
+			   HWMON_H_MIN | HWMON_H_MAX |
 			   HWMON_H_MIN_HYST | HWMON_H_MAX_HYST |
 			   HWMON_H_MIN_ALARM | HWMON_H_MAX_ALARM),
 	NULL
@@ -680,6 +698,7 @@ static const struct hwmon_channel_info *cc2_info[] = {
 static const struct hwmon_ops cc2_hwmon_ops = {
 	.is_visible = cc2_is_visible,
 	.read = cc2_read,
+	.read_string = cc2_read_string,
 	.write = cc2_write,
 };
 
@@ -710,6 +729,8 @@ static int cc2_probe(struct i2c_client *client)
 		return dev_err_probe(dev, PTR_ERR(data->regulator),
 				     "Failed to get regulator\n");
 
+	device_property_read_string(dev, "label", &data->label);
+
 	ret = cc2_request_ready_irq(data, dev);
 	if (ret)
 		return dev_err_probe(dev, ret, "Failed to request ready irq\n");
-- 
2.34.1


^ permalink raw reply related

* [PATCH v2 1/2] dt-bindings: hwmon: chipcap2: Add label property
From: Flaviu Nistor @ 2026-06-25 16:04 UTC (permalink / raw)
  To: Guenter Roeck, Javier Carrasco, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Jonathan Corbet, Shuah Khan
  Cc: Flaviu Nistor, linux-hwmon, linux-kernel, devicetree, linux-doc

Add support for an optional label property similar to other hwmon devices.
This allows, in case of boards with multiple CHIPCAP2 sensors, to assign
distinct names to each instance.

Signed-off-by: Flaviu Nistor <flaviu.nistor@gmail.com>
---
Changes in v2:
- Implement suggestion from Javier Carrasco as proposed by Krzysztof Kozlowski.
- Link to v1: https://lore.kernel.org/all/20260622122200.14245-1-flaviu.nistor@gmail.com/

 .../devicetree/bindings/hwmon/amphenol,chipcap2.yaml        | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/Documentation/devicetree/bindings/hwmon/amphenol,chipcap2.yaml b/Documentation/devicetree/bindings/hwmon/amphenol,chipcap2.yaml
index 17351fdbefce..56b0cecfca5f 100644
--- a/Documentation/devicetree/bindings/hwmon/amphenol,chipcap2.yaml
+++ b/Documentation/devicetree/bindings/hwmon/amphenol,chipcap2.yaml
@@ -45,6 +45,8 @@ properties:
       - const: low
       - const: high
 
+  label: true
+
   vdd-supply:
     description:
       Dedicated, controllable supply-regulator to reset the device and
@@ -55,6 +57,9 @@ required:
   - reg
   - vdd-supply
 
+allOf:
+  - $ref: hwmon-common.yaml#
+
 additionalProperties: false
 
 examples:
@@ -72,6 +77,7 @@ examples:
                          <5 IRQ_TYPE_EDGE_RISING>,
                          <6 IRQ_TYPE_EDGE_RISING>;
             interrupt-names = "ready", "low", "high";
+            label = "Room";
             vdd-supply = <&reg_vdd>;
         };
     };
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH net-next] Documentation: networking: Add a test plan for ethtool pause validation
From: Maxime Chevallier @ 2026-06-25 16:03 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jakub Kicinski, davem, Eric Dumazet, Paolo Abeni, Simon Horman,
	Russell King, Heiner Kallweit, Jonathan Corbet, Shuah Khan,
	Oleksij Rempel, Vladimir Oltean, Florian Fainelli,
	thomas.petazzoni, netdev, linux-kernel, linux-doc
In-Reply-To: <dfee1484-fa2a-4b98-af5a-1e67ac716905@lunn.ch>


> 
> Does it even make sense to advertise this when in HD? But i don't
> think we need to consider this now. I consider HD low priority, i
> doubt it is actually used very often. We should concentrate on FD
> testing.

That's fine by me as well, let's keep it simple, we may revisit that if
we really need to.

> 
>> # ethtool -a eth2
>> Autonegotiate:	on
>> RX:		off
>> TX:		off
>> RX negotiated: on
>> TX negotiated: on
>>
>>
>> Sure, pause and HD don't make sense, however what I find confusing to some
>> extent is that the only place we have information about the *actual* pause
>> settings is the "link is Up" log in dmesg.
> 
> Maybe we should extend ksetting get to return the resolved pause
> parameters? But i'm not sure how much that actually gives us. Anything
> using phylink will just ask phylink to fill in the ksettings
> information, and it seems unlikely phylink gets it wrong. What we are
> really trying to test is drivers which don't user phylink, those are
> the ones which are generally broken, and they are not going to
> implement anything new in ksettings.

Correct yes. If the MAC driver uses phylink and a test fails, it very likely
means that the PHY driver is doing shady stuff (and some are/were for pause)

> So i think the test has to look
> at:
> 
>> 	Advertised pause frame use: Symmetric Receive-only
>> 	Link partner advertised pause frame use: Symmetric Receive-only
> 
> and check these match what we expect.

All good for me :) thanks for you feedback,

Maxime

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox