[PATCH v2 0/5] cxl: Sashiko bug fixes

Linux CXL
 help / color / mirror / Atom feed

* [PATCH v2 0/5] cxl: Sashiko bug fixes
@ 2026-07-02  9:08 Richard Cheng
  2026-07-02  9:08 ` [PATCH v2 1/5] cxl/features: Reject feature offset that overflows 16-bit field Richard Cheng
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Richard Cheng @ 2026-07-02  9:08 UTC (permalink / raw)
  To: dave, jic23, dave.jiang, alison.schofield, vishal.l.verma, djbw,
	danwilliams
  Cc: iweiny, ming.li, gourry, rrichter, linux-cxl, linux-kernel, kees,
	newtonl, kristinc, mochs, kaihengf, kobak, Richard Cheng

Five independent, pre-existing bugs in the CXL core, reported by sashiko.

Patch 1: Get/Set Feature stored offset + transfer-size into a 16-bit
field via cpu_to_le16() with no bounds check, so a large offset/count
from the fwctl interface silently wrapped and steered the device to the
wrong feature offset. Reject offset + size > U16_MAX up front.

Patch 2: cxl_get_poison_unmapped() aborted its whole partition sweep on
the first fully-mapped partition, silently skipping unmapped poison in
all later partitions. Skip that partition instead.

Patch 3: the same function tolerated the -EFAULT a RAM partition returns
for Get Poison List but left it in rc, so a benign fault on the last
scanned partition surfaced as a spurious read failure. Clear rc, as
poison_by_decoder() already does.

Patch 4: the same function also ignored the ctx->offset handoff from
poison_by_decoder() and derived its scan start from the highest DPA
allocation, so the DPA of allocated-but-uncommitted decoders was never
scanned by either phase. Resume the sweep at ctx->offset.

Patch 5: cxl_get_poison_by_memdev() overwrote rc on each partition
query, so an earlier partition's failure was masked by a later success
and unscanned poison was reported as a clean list. Stop on any error
not tolerated as a RAM -EFAULT.

Changes since v1 [1]:
- Patch 1: write the bounds checks as size > U16_MAX - offset so the
  check itself cannot wrap on 32-bit architectures (sashiko)
- Patch 2: commit message wording fix (Dave)
- New patches 4 and 5, fixing the pre-existing issues sashiko raised on
  the v1 patch 3 thread [2]

[1]:
https://lore.kernel.org/linux-cxl/20260630074657.43077-1-icheng@nvidia.com/
[2]:
https://lore.kernel.org/linux-cxl/20260630100022.A621A1F000E9@smtp.kernel.org/

Richard Cheng (5):
  cxl/features: Reject feature offset that overflows 16-bit field
  cxl/region: Scan all partitions for unmapped poison
  cxl/region: Don't leak tolerated RAM -EFAULT from unmapped poison scan
  cxl/region: Start unmapped poison scan at the committed decoder
    boundary
  cxl/memdev: Don't overwrite the error from an earlier partition poison
    query

 drivers/cxl/core/features.c |  6 ++++++
 drivers/cxl/core/memdev.c   |  2 ++
 drivers/cxl/core/region.c   | 13 ++++++-------
 3 files changed, 14 insertions(+), 7 deletions(-)

base-commit: dc59e4fea9d83f03bad6bddf3fa2e52491777482
-- 
2.43.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/5] cxl/features: Reject feature offset that overflows 16-bit field
  2026-07-02  9:08 [PATCH v2 0/5] cxl: Sashiko bug fixes Richard Cheng
@ 2026-07-02  9:08 ` Richard Cheng
  2026-07-02 11:22   ` sashiko-bot
  2026-07-02  9:08 ` [PATCH v2 2/5] cxl/region: Scan all partitions for unmapped poison Richard Cheng
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 8+ messages in thread
From: Richard Cheng @ 2026-07-02  9:08 UTC (permalink / raw)
  To: dave, jic23, dave.jiang, alison.schofield, vishal.l.verma, djbw,
	danwilliams
  Cc: iweiny, ming.li, gourry, rrichter, linux-cxl, linux-kernel, kees,
	newtonl, kristinc, mochs, kaihengf, kobak, Richard Cheng

cxl_get_feature() and cxl_set_feature() build the mailbox command's
offset as cpu_to_le16(offset + data_rcvd_size/data_sent_size), but never
check the sum fits in the 16-bit field. Via fwctl, a user-supplied
offset plus count/op_size summing over 65535 silently wraps, steering
the device to the wrong feature offset.

Fixes: 5e5ac21f629d ("cxl/mbox: Add GET_FEATURE mailbox command")
Fixes: 14d502cc2718 ("cxl/mbox: Add SET_FEATURE mailbox command")
Signed-off-by: Richard Cheng <icheng@nvidia.com>
---
Changelog:

v1->v2:
    - refactor the guard to "size > U16_MAX - offset", the addition is
      performed in size_t, so on 32-bit arch a large user-supplied size
wrpas the  sum and bypasses the check. The substraction form can't
misbehave since offset is a u16, making U16_MAX - offset always
well-defined.
---
 drivers/cxl/core/features.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/cxl/core/features.c b/drivers/cxl/core/features.c
index 85185af46b72..c3d5f88a4e04 100644
--- a/drivers/cxl/core/features.c
+++ b/drivers/cxl/core/features.c
@@ -237,6 +237,9 @@ size_t cxl_get_feature(struct cxl_mailbox *cxl_mbox, const uuid_t *feat_uuid,
 	if (!feat_out || !feat_out_size)
 		return 0;
 
+	if (feat_out_size > U16_MAX - offset)
+		return 0;
+
 	size_out = min(feat_out_size, cxl_mbox->payload_size);
 	uuid_copy(&pi.uuid, feat_uuid);
 	pi.selection = selection;
@@ -287,6 +290,9 @@ int cxl_set_feature(struct cxl_mailbox *cxl_mbox,
 	if (return_code)
 		*return_code = CXL_MBOX_CMD_RC_INPUT;
 
+	if (feat_data_size > U16_MAX - offset)
+		return -EINVAL;
+
 	struct cxl_mbox_set_feat_in *pi __free(kfree) =
 			kzalloc(cxl_mbox->payload_size, GFP_KERNEL);
 	if (!pi)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/5] cxl/region: Scan all partitions for unmapped poison
  2026-07-02  9:08 [PATCH v2 0/5] cxl: Sashiko bug fixes Richard Cheng
  2026-07-02  9:08 ` [PATCH v2 1/5] cxl/features: Reject feature offset that overflows 16-bit field Richard Cheng
@ 2026-07-02  9:08 ` Richard Cheng
  2026-07-02  9:08 ` [PATCH v2 3/5] cxl/region: Don't leak tolerated RAM -EFAULT from unmapped poison scan Richard Cheng
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Richard Cheng @ 2026-07-02  9:08 UTC (permalink / raw)
  To: dave, jic23, dave.jiang, alison.schofield, vishal.l.verma, djbw,
	danwilliams
  Cc: iweiny, ming.li, gourry, rrichter, linux-cxl, linux-kernel, kees,
	newtonl, kristinc, mochs, kaihengf, kobak, Richard Cheng

cxl_get_poison_unmapped() sweeps the unmapped tail of each partition
from ctx->part onward. A fully-mapped partition has no unmapped tail,
it's a normal per-partition state, but the loop treated it with break,
aborting the whole sweep and silently skipping unmapped poison in all
later partition. Use continue so a fully-mapped partition is skipped and
later partitions are still scanned.

Fixes: be5cbd0840275 ("cxl: Kill enum cxl_decoder_mode")
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Richard Cheng <icheng@nvidia.com>
---
Changelog:

v1->v2:
    - Tweak commit message
---
 drivers/cxl/core/region.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 1e211542b6b6..be246fb09c99 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2931,7 +2931,7 @@ static int cxl_get_poison_unmapped(struct cxl_memdev *cxlmd,
 			offset = res->start;
 		length = res->end - offset + 1;
 		if (!length)
-			break;
+			continue;
 		rc = cxl_mem_get_poison(cxlmd, offset, length, NULL);
 		if (rc == -EFAULT && cxlds->part[i].mode == CXL_PARTMODE_RAM)
 			continue;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 3/5] cxl/region: Don't leak tolerated RAM -EFAULT from unmapped poison scan
  2026-07-02  9:08 [PATCH v2 0/5] cxl: Sashiko bug fixes Richard Cheng
  2026-07-02  9:08 ` [PATCH v2 1/5] cxl/features: Reject feature offset that overflows 16-bit field Richard Cheng
  2026-07-02  9:08 ` [PATCH v2 2/5] cxl/region: Scan all partitions for unmapped poison Richard Cheng
@ 2026-07-02  9:08 ` Richard Cheng
  2026-07-02  9:20   ` sashiko-bot
  2026-07-02  9:08 ` [PATCH v2 4/5] cxl/region: Start unmapped poison scan at the committed decoder boundary Richard Cheng
  2026-07-02  9:08 ` [PATCH v2 5/5] cxl/memdev: Don't overwrite the error from an earlier partition poison query Richard Cheng
  4 siblings, 1 reply; 8+ messages in thread
From: Richard Cheng @ 2026-07-02  9:08 UTC (permalink / raw)
  To: dave, jic23, dave.jiang, alison.schofield, vishal.l.verma, djbw,
	danwilliams
  Cc: iweiny, ming.li, gourry, rrichter, linux-cxl, linux-kernel, kees,
	newtonl, kristinc, mochs, kaihengf, kobak, Richard Cheng

cxl_get_poison_unmapped() tolerates the -EFAULT a RAM partition returns
for Get Poison List by skipping that partition, but left rc holding the
error. If the tolerated RAM fault was the last poison query before the
loop ended, the function returned a spurious -EFAULT and the poison-list
read failed even though enumeration succeeded. Reset rc to 0 when
tolerating the fault, matching poison_by_decoder().

Fixes: be5cbd0840275 ("cxl: Kill enum cxl_decoder_mode")
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Richard Cheng <icheng@nvidia.com>
---
 drivers/cxl/core/region.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index be246fb09c99..52ba8e9e4288 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2933,8 +2933,10 @@ static int cxl_get_poison_unmapped(struct cxl_memdev *cxlmd,
 		if (!length)
 			continue;
 		rc = cxl_mem_get_poison(cxlmd, offset, length, NULL);
-		if (rc == -EFAULT && cxlds->part[i].mode == CXL_PARTMODE_RAM)
+		if (rc == -EFAULT && cxlds->part[i].mode == CXL_PARTMODE_RAM) {
+			rc = 0;
 			continue;
+		}
 		if (rc)
 			break;
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 4/5] cxl/region: Start unmapped poison scan at the committed decoder boundary
  2026-07-02  9:08 [PATCH v2 0/5] cxl: Sashiko bug fixes Richard Cheng
                   ` (2 preceding siblings ...)
  2026-07-02  9:08 ` [PATCH v2 3/5] cxl/region: Don't leak tolerated RAM -EFAULT from unmapped poison scan Richard Cheng
@ 2026-07-02  9:08 ` Richard Cheng
  2026-07-02  9:08 ` [PATCH v2 5/5] cxl/memdev: Don't overwrite the error from an earlier partition poison query Richard Cheng
  4 siblings, 0 replies; 8+ messages in thread
From: Richard Cheng @ 2026-07-02  9:08 UTC (permalink / raw)
  To: dave, jic23, dave.jiang, alison.schofield, vishal.l.verma, djbw,
	danwilliams
  Cc: iweiny, ming.li, gourry, rrichter, linux-cxl, linux-kernel, kees,
	newtonl, kristinc, mochs, kaihengf, kobak, Richard Cheng

poison_by_decoder() stops at the last committed decoder and records the
handoff in ctx->offset, but cxl_get_poison_unmapped() ignores it and
starts after the highest DPA allocation instead. Allocation exist for
uncommitted decoders too, so their DPA is skipped by both phases and
poison there is never reported. Resume the scan at ctx->offset, and scan
later partitions in full, restoring the pre-rewrite behavior.

Fixes: be5cbd084027 ("cxl: Kill enum cxl_decoder_mode")
Signed-off-by: Richard Cheng <icheng@nvidia.com>
---
Changelog:

v1->v2:
    - New added patch ( sashiko's report )
---
 drivers/cxl/core/region.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 52ba8e9e4288..ba77416055f4 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2910,7 +2910,6 @@ static int cxl_get_poison_unmapped(struct cxl_memdev *cxlmd,
 {
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 	const struct resource *res;
-	struct resource *p, *last;
 	u64 offset, length;
 	int rc = 0;
 
@@ -2923,10 +2922,8 @@ static int cxl_get_poison_unmapped(struct cxl_memdev *cxlmd,
 	 */
 	for (int i = ctx->part; i < cxlds->nr_partitions; i++) {
 		res = &cxlds->part[i].res;
-		for (p = res->child, last = NULL; p; p = p->sibling)
-			last = p;
-		if (last)
-			offset = last->end + 1;
+		if (i == ctx->part)
+			offset = ctx->offset;
 		else
 			offset = res->start;
 		length = res->end - offset + 1;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 5/5] cxl/memdev: Don't overwrite the error from an earlier partition poison query
  2026-07-02  9:08 [PATCH v2 0/5] cxl: Sashiko bug fixes Richard Cheng
                   ` (3 preceding siblings ...)
  2026-07-02  9:08 ` [PATCH v2 4/5] cxl/region: Start unmapped poison scan at the committed decoder boundary Richard Cheng
@ 2026-07-02  9:08 ` Richard Cheng
  4 siblings, 0 replies; 8+ messages in thread
From: Richard Cheng @ 2026-07-02  9:08 UTC (permalink / raw)
  To: dave, jic23, dave.jiang, alison.schofield, vishal.l.verma, djbw,
	danwilliams
  Cc: iweiny, ming.li, gourry, rrichter, linux-cxl, linux-kernel, kees,
	newtonl, kristinc, mochs, kaihengf, kobak, Richard Cheng

cxl_get_poison_by_memdev() queries Get Poison List per partition but
never checks the result inside the loop, so a later partition's success
overwrites an earlier partition's failure and the whole scan reports
success while that partition's poison went unlisted. Before the loop
conversion the PMEM query returned early on error. Stop the loop on any
error not already tolerated as a RAM -EFAULT.

Fixes: be5cbd084027 ("cxl: Kill enum cxl_decoder_mode")
Signed-off-by: Richard Cheng <icheng@nvidia.com>
---
Changelog:

v1->v2:
    - New added patch ( sashiko-bot's report )
---
 drivers/cxl/core/memdev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 33a3d2e7b13a..8718964b9c5e 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -231,6 +231,8 @@ static int cxl_get_poison_by_memdev(struct cxl_memdev *cxlmd)
 		 */
 		if (rc == -EFAULT && cxlds->part[i].mode == CXL_PARTMODE_RAM)
 			rc = 0;
+		if (rc)
+			break;
 	}
 	return rc;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 3/5] cxl/region: Don't leak tolerated RAM -EFAULT from unmapped poison scan
  2026-07-02  9:08 ` [PATCH v2 3/5] cxl/region: Don't leak tolerated RAM -EFAULT from unmapped poison scan Richard Cheng
@ 2026-07-02  9:20   ` sashiko-bot
  0 siblings, 0 replies; 8+ messages in thread
From: sashiko-bot @ 2026-07-02  9:20 UTC (permalink / raw)
  To: Richard Cheng; +Cc: linux-cxl

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Potential infinite loop in `cxl_mem_get_poison()` if hardware returns `count == 0` with `_FLAG_MORE`.
--

commit f920ce49af56dd4b8cd88e19b4d4bf7ba33be7da
Author: Richard Cheng <icheng@nvidia.com>

cxl/region: Don't leak tolerated RAM -EFAULT from unmapped poison scan

cxl_get_poison_unmapped() tolerates the -EFAULT a RAM partition returns
for Get Poison List by skipping that partition, but left rc holding the
error. This patch resets rc to 0 when tolerating the fault to prevent
returning a spurious -EFAULT.

> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index be246fb09c999..52ba8e9e42885 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -2933,8 +2933,10 @@ static int cxl_get_poison_unmapped(struct cxl_memdev *cxlmd,
>  		if (!length)
>  			continue;
>  		rc = cxl_mem_get_poison(cxlmd, offset, length, NULL);

[Severity: High]
This is a pre-existing issue, but can this call result in an infinite loop
if the hardware is misbehaving or malicious?

Looking at cxl_mem_get_poison() in drivers/cxl/core/mbox.c, the loop
termination depends on the hardware providing a non-zero count to
increment nr_records:

	nr_records = nr_records + le16_to_cpu(po->count);
	if (nr_records >= mds->poison.max_errors) {
		dev_dbg(&cxlmd->dev, "Max Error Records reached: %d\n",
			nr_records);
		break;
	}
} while (po->flags & CXL_POISON_FLAG_MORE);

If the device returns a payload with po->count == 0 and leaves the
CXL_POISON_FLAG_MORE flag set, would this bypass the max_errors breakout
condition and loop indefinitely?

> -		if (rc == -EFAULT && cxlds->part[i].mode == CXL_PARTMODE_RAM)
> +		if (rc == -EFAULT && cxlds->part[i].mode == CXL_PARTMODE_RAM) {
> +			rc = 0;
>  			continue;
> +		}
>  		if (rc)
>  			break;
>  	}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260702090849.47501-1-icheng@nvidia.com?part=3

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/5] cxl/features: Reject feature offset that overflows 16-bit field
  2026-07-02  9:08 ` [PATCH v2 1/5] cxl/features: Reject feature offset that overflows 16-bit field Richard Cheng
@ 2026-07-02 11:22   ` sashiko-bot
  0 siblings, 0 replies; 8+ messages in thread
From: sashiko-bot @ 2026-07-02 11:22 UTC (permalink / raw)
  To: Richard Cheng; +Cc: linux-cxl

Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:

New issues:
- [Medium] The newly introduced boundary checks contain an off-by-one arithmetic error that wrongly rejects valid CXL mailbox commands, preventing accesses up to the exact 64KB maximum boundary.

Pre-existing issues:
- [High] Heap buffer overflow in `cxl_get_feature` due to re-using a constant `size_out` for multiple mailbox commands, which can exceed the remaining buffer space if a device returns more data than requested.
- [High] In `cxlctl_get_feature()`, the output buffer `rpc_out` is allocated based on user-provided `*out_len`, but the function copies `count` bytes into it without validating that the buffer is large enough.
- [High] `cxlctl_fw_rpc()` and its command handlers blindly read from `inbuf` (sized `in_len`) without validating that `in_len` is large enough to contain the structures and payload specified by `rpc_in->op_size`.
- [High] Multi-part CXL mailbox feature transfers lack serialization, leading to interleaved commands and hardware state corruption.
--

commit 85b87615c810272287769d45a80ec38b15c23133
Author: Richard Cheng <icheng@nvidia.com>

cxl/features: Reject feature offset that overflows 16-bit field

cxl_get_feature() and cxl_set_feature() build the mailbox command's
offset as cpu_to_le16(offset + data_rcvd_size/data_sent_size), but never
check the sum fits in the 16-bit field. Via fwctl, a user-supplied
offset plus count/op_size summing over 65535 silently wraps, steering
the device to the wrong feature offset.

> diff --git a/drivers/cxl/core/features.c b/drivers/cxl/core/features.c
> index 85185af46b72d..c3d5f88a4e045 100644
> --- a/drivers/cxl/core/features.c
> +++ b/drivers/cxl/core/features.c
> @@ -237,6 +237,9 @@ size_t cxl_get_feature(struct cxl_mailbox *cxl_mbox, const uuid_t *feat_uuid,
>  	if (!feat_out || !feat_out_size)
>  		return 0;
>  
> +	if (feat_out_size > U16_MAX - offset)
> +		return 0;
> +

[Severity: Medium]
Will this off-by-one arithmetic error reject valid boundary cases?

If a caller requests a full 64KB transfer (offset=0, size=65536) or a 1-byte
transfer at the highest offset (offset=65535, size=1), both evaluate as 
greater than 65535 (U16_MAX) and will incorrectly fail this check.

Should this check be feat_out_size > 65536 - offset instead?

[Severity: High]
This is a pre-existing issue, but can this function overflow the heap buffer?

Looking at cxl_get_feature() in drivers/cxl/core/features.c, if
feat_out_size requires multiple iterations and the remaining capacity on the
final iteration is smaller than the statically computed size_out,
mbox_cmd.size_out is incorrectly assigned size_out instead of the dynamically
bounded data_to_rd_size:

drivers/cxl/core/features.c:cxl_get_feature() {
	...
	do {
		data_to_rd_size = min(feat_out_size - data_rcvd_size,
				      cxl_mbox->payload_size);
		...
		mbox_cmd = (struct cxl_mbox_cmd) {
			...
			.size_out = size_out,
			...
		};
		rc = cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
		...
	} while (data_rcvd_size < feat_out_size);
	...
}

If a hardware device responds with more bytes than the final chunk requires,
will the mailbox subsystem write up to mbox_cmd.size_out bytes and overflow
the remaining space in feat_out?

[Severity: High]
Also a pre-existing issue, but does this multi-part transfer lack
serialization?

Since the global mbox_mutex is dropped between loop iterations inside
cxl_internal_send_cmd(), if two threads concurrently initiate feature
transfers, will their mailbox commands unpredictably interleave and corrupt
the CXL device's multi-part transfer state machine?

>  	size_out = min(feat_out_size, cxl_mbox->payload_size);
>  	uuid_copy(&pi.uuid, feat_uuid);

[ ... ]

> @@ -287,6 +290,9 @@ int cxl_set_feature(struct cxl_mailbox *cxl_mbox,
>  	if (return_code)
>  		*return_code = CXL_MBOX_CMD_RC_INPUT;
>  
> +	if (feat_data_size > U16_MAX - offset)
> +		return -EINVAL;
> +

[Severity: Medium]
Does this bounds check also wrongly reject valid boundary limits?

Similar to cxl_get_feature(), if feat_data_size + offset equals 65536,
it evaluates as greater than 65535 (U16_MAX) and fails.

Would changing this to feat_data_size > 65536 - offset permit exact
maximum boundaries?

[Severity: High]
This is a pre-existing issue, but are multi-part transfers lacking
serialization here as well?

If two threads concurrently initiate feature transfers, will their INITIATE,
CONTINUE, and FINISH mailbox commands interleave because a higher-level lock
isn't held across the loop iterations?

>  	struct cxl_mbox_set_feat_in *pi __free(kfree) =
>  			kzalloc(cxl_mbox->payload_size, GFP_KERNEL);

[ ... ]

[Severity: High]
This is a pre-existing issue, but does cxlctl_get_feature() have a heap
buffer overflow when writing to rpc_out?

drivers/cxl/core/features.c:cxlctl_get_feature() {
	...
	out_size = cxl_get_feature(cxl_mbox, &feat_in->uuid,
				   feat_in->selection, rpc_out->payload,
				   count, offset, &return_code);
	...
}

If a user issues the fwctl ioctl with a small out_len but a large
get_feat_in.count, cxl_get_feature() is instructed to write count bytes into
the buffer allocated for out_len. Can this lead to memory corruption?

[Severity: High]
This is a pre-existing issue, but does cxlctl_set_feature() perform an
out-of-bounds read?

drivers/cxl/core/features.c:cxlctl_set_feature() {
	...
	data_size = rpc_in->op_size - sizeof(feat_in->hdr);
	rc = cxl_set_feature(cxl_mbox, &feat_in->uuid,
			     feat_in->version, feat_in->feat_data,
			     data_size, flags, offset, &return_code);
	...
}

If a user provides a valid but small in_len (e.g., just enough for the
header) but sets a large op_size inside the CXL payload header,
cxlctl_fw_rpc() blindly reads data_size bytes past the end of the inbuf
allocation. Could this leak kernel memory to the hardware device?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260702090849.47501-1-icheng@nvidia.com?part=1

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-07-02 11:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-02  9:08 [PATCH v2 0/5] cxl: Sashiko bug fixes Richard Cheng
2026-07-02  9:08 ` [PATCH v2 1/5] cxl/features: Reject feature offset that overflows 16-bit field Richard Cheng
2026-07-02 11:22   ` sashiko-bot
2026-07-02  9:08 ` [PATCH v2 2/5] cxl/region: Scan all partitions for unmapped poison Richard Cheng
2026-07-02  9:08 ` [PATCH v2 3/5] cxl/region: Don't leak tolerated RAM -EFAULT from unmapped poison scan Richard Cheng
2026-07-02  9:20   ` sashiko-bot
2026-07-02  9:08 ` [PATCH v2 4/5] cxl/region: Start unmapped poison scan at the committed decoder boundary Richard Cheng
2026-07-02  9:08 ` [PATCH v2 5/5] cxl/memdev: Don't overwrite the error from an earlier partition poison query Richard Cheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox