* [PATCH v2 0/5] cxl: Sashiko bug fixes
@ 2026-07-02 9:08 Richard Cheng
2026-07-02 9:08 ` [PATCH v2 1/5] cxl/features: Reject feature offset that overflows 16-bit field Richard Cheng
` (4 more replies)
0 siblings, 5 replies; 6+ messages in thread
From: Richard Cheng @ 2026-07-02 9:08 UTC (permalink / raw)
To: dave, jic23, dave.jiang, alison.schofield, vishal.l.verma, djbw,
danwilliams
Cc: iweiny, ming.li, gourry, rrichter, linux-cxl, linux-kernel, kees,
newtonl, kristinc, mochs, kaihengf, kobak, Richard Cheng
Five independent, pre-existing bugs in the CXL core, reported by sashiko.
Patch 1: Get/Set Feature stored offset + transfer-size into a 16-bit
field via cpu_to_le16() with no bounds check, so a large offset/count
from the fwctl interface silently wrapped and steered the device to the
wrong feature offset. Reject offset + size > U16_MAX up front.
Patch 2: cxl_get_poison_unmapped() aborted its whole partition sweep on
the first fully-mapped partition, silently skipping unmapped poison in
all later partitions. Skip that partition instead.
Patch 3: the same function tolerated the -EFAULT a RAM partition returns
for Get Poison List but left it in rc, so a benign fault on the last
scanned partition surfaced as a spurious read failure. Clear rc, as
poison_by_decoder() already does.
Patch 4: the same function also ignored the ctx->offset handoff from
poison_by_decoder() and derived its scan start from the highest DPA
allocation, so the DPA of allocated-but-uncommitted decoders was never
scanned by either phase. Resume the sweep at ctx->offset.
Patch 5: cxl_get_poison_by_memdev() overwrote rc on each partition
query, so an earlier partition's failure was masked by a later success
and unscanned poison was reported as a clean list. Stop on any error
not tolerated as a RAM -EFAULT.
Changes since v1 [1]:
- Patch 1: write the bounds checks as size > U16_MAX - offset so the
check itself cannot wrap on 32-bit architectures (sashiko)
- Patch 2: commit message wording fix (Dave)
- New patches 4 and 5, fixing the pre-existing issues sashiko raised on
the v1 patch 3 thread [2]
[1]:
https://lore.kernel.org/linux-cxl/20260630074657.43077-1-icheng@nvidia.com/
[2]:
https://lore.kernel.org/linux-cxl/20260630100022.A621A1F000E9@smtp.kernel.org/
Richard Cheng (5):
cxl/features: Reject feature offset that overflows 16-bit field
cxl/region: Scan all partitions for unmapped poison
cxl/region: Don't leak tolerated RAM -EFAULT from unmapped poison scan
cxl/region: Start unmapped poison scan at the committed decoder
boundary
cxl/memdev: Don't overwrite the error from an earlier partition poison
query
drivers/cxl/core/features.c | 6 ++++++
drivers/cxl/core/memdev.c | 2 ++
drivers/cxl/core/region.c | 13 ++++++-------
3 files changed, 14 insertions(+), 7 deletions(-)
base-commit: dc59e4fea9d83f03bad6bddf3fa2e52491777482
--
2.43.0
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v2 1/5] cxl/features: Reject feature offset that overflows 16-bit field
2026-07-02 9:08 [PATCH v2 0/5] cxl: Sashiko bug fixes Richard Cheng
@ 2026-07-02 9:08 ` Richard Cheng
2026-07-02 9:08 ` [PATCH v2 2/5] cxl/region: Scan all partitions for unmapped poison Richard Cheng
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Richard Cheng @ 2026-07-02 9:08 UTC (permalink / raw)
To: dave, jic23, dave.jiang, alison.schofield, vishal.l.verma, djbw,
danwilliams
Cc: iweiny, ming.li, gourry, rrichter, linux-cxl, linux-kernel, kees,
newtonl, kristinc, mochs, kaihengf, kobak, Richard Cheng
cxl_get_feature() and cxl_set_feature() build the mailbox command's
offset as cpu_to_le16(offset + data_rcvd_size/data_sent_size), but never
check the sum fits in the 16-bit field. Via fwctl, a user-supplied
offset plus count/op_size summing over 65535 silently wraps, steering
the device to the wrong feature offset.
Fixes: 5e5ac21f629d ("cxl/mbox: Add GET_FEATURE mailbox command")
Fixes: 14d502cc2718 ("cxl/mbox: Add SET_FEATURE mailbox command")
Signed-off-by: Richard Cheng <icheng@nvidia.com>
---
Changelog:
v1->v2:
- refactor the guard to "size > U16_MAX - offset", the addition is
performed in size_t, so on 32-bit arch a large user-supplied size
wrpas the sum and bypasses the check. The substraction form can't
misbehave since offset is a u16, making U16_MAX - offset always
well-defined.
---
drivers/cxl/core/features.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/cxl/core/features.c b/drivers/cxl/core/features.c
index 85185af46b72..c3d5f88a4e04 100644
--- a/drivers/cxl/core/features.c
+++ b/drivers/cxl/core/features.c
@@ -237,6 +237,9 @@ size_t cxl_get_feature(struct cxl_mailbox *cxl_mbox, const uuid_t *feat_uuid,
if (!feat_out || !feat_out_size)
return 0;
+ if (feat_out_size > U16_MAX - offset)
+ return 0;
+
size_out = min(feat_out_size, cxl_mbox->payload_size);
uuid_copy(&pi.uuid, feat_uuid);
pi.selection = selection;
@@ -287,6 +290,9 @@ int cxl_set_feature(struct cxl_mailbox *cxl_mbox,
if (return_code)
*return_code = CXL_MBOX_CMD_RC_INPUT;
+ if (feat_data_size > U16_MAX - offset)
+ return -EINVAL;
+
struct cxl_mbox_set_feat_in *pi __free(kfree) =
kzalloc(cxl_mbox->payload_size, GFP_KERNEL);
if (!pi)
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v2 2/5] cxl/region: Scan all partitions for unmapped poison
2026-07-02 9:08 [PATCH v2 0/5] cxl: Sashiko bug fixes Richard Cheng
2026-07-02 9:08 ` [PATCH v2 1/5] cxl/features: Reject feature offset that overflows 16-bit field Richard Cheng
@ 2026-07-02 9:08 ` Richard Cheng
2026-07-02 9:08 ` [PATCH v2 3/5] cxl/region: Don't leak tolerated RAM -EFAULT from unmapped poison scan Richard Cheng
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Richard Cheng @ 2026-07-02 9:08 UTC (permalink / raw)
To: dave, jic23, dave.jiang, alison.schofield, vishal.l.verma, djbw,
danwilliams
Cc: iweiny, ming.li, gourry, rrichter, linux-cxl, linux-kernel, kees,
newtonl, kristinc, mochs, kaihengf, kobak, Richard Cheng
cxl_get_poison_unmapped() sweeps the unmapped tail of each partition
from ctx->part onward. A fully-mapped partition has no unmapped tail,
it's a normal per-partition state, but the loop treated it with break,
aborting the whole sweep and silently skipping unmapped poison in all
later partition. Use continue so a fully-mapped partition is skipped and
later partitions are still scanned.
Fixes: be5cbd0840275 ("cxl: Kill enum cxl_decoder_mode")
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Richard Cheng <icheng@nvidia.com>
---
Changelog:
v1->v2:
- Tweak commit message
---
drivers/cxl/core/region.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 1e211542b6b6..be246fb09c99 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2931,7 +2931,7 @@ static int cxl_get_poison_unmapped(struct cxl_memdev *cxlmd,
offset = res->start;
length = res->end - offset + 1;
if (!length)
- break;
+ continue;
rc = cxl_mem_get_poison(cxlmd, offset, length, NULL);
if (rc == -EFAULT && cxlds->part[i].mode == CXL_PARTMODE_RAM)
continue;
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v2 3/5] cxl/region: Don't leak tolerated RAM -EFAULT from unmapped poison scan
2026-07-02 9:08 [PATCH v2 0/5] cxl: Sashiko bug fixes Richard Cheng
2026-07-02 9:08 ` [PATCH v2 1/5] cxl/features: Reject feature offset that overflows 16-bit field Richard Cheng
2026-07-02 9:08 ` [PATCH v2 2/5] cxl/region: Scan all partitions for unmapped poison Richard Cheng
@ 2026-07-02 9:08 ` Richard Cheng
2026-07-02 9:08 ` [PATCH v2 4/5] cxl/region: Start unmapped poison scan at the committed decoder boundary Richard Cheng
2026-07-02 9:08 ` [PATCH v2 5/5] cxl/memdev: Don't overwrite the error from an earlier partition poison query Richard Cheng
4 siblings, 0 replies; 6+ messages in thread
From: Richard Cheng @ 2026-07-02 9:08 UTC (permalink / raw)
To: dave, jic23, dave.jiang, alison.schofield, vishal.l.verma, djbw,
danwilliams
Cc: iweiny, ming.li, gourry, rrichter, linux-cxl, linux-kernel, kees,
newtonl, kristinc, mochs, kaihengf, kobak, Richard Cheng
cxl_get_poison_unmapped() tolerates the -EFAULT a RAM partition returns
for Get Poison List by skipping that partition, but left rc holding the
error. If the tolerated RAM fault was the last poison query before the
loop ended, the function returned a spurious -EFAULT and the poison-list
read failed even though enumeration succeeded. Reset rc to 0 when
tolerating the fault, matching poison_by_decoder().
Fixes: be5cbd0840275 ("cxl: Kill enum cxl_decoder_mode")
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Richard Cheng <icheng@nvidia.com>
---
drivers/cxl/core/region.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index be246fb09c99..52ba8e9e4288 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2933,8 +2933,10 @@ static int cxl_get_poison_unmapped(struct cxl_memdev *cxlmd,
if (!length)
continue;
rc = cxl_mem_get_poison(cxlmd, offset, length, NULL);
- if (rc == -EFAULT && cxlds->part[i].mode == CXL_PARTMODE_RAM)
+ if (rc == -EFAULT && cxlds->part[i].mode == CXL_PARTMODE_RAM) {
+ rc = 0;
continue;
+ }
if (rc)
break;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v2 4/5] cxl/region: Start unmapped poison scan at the committed decoder boundary
2026-07-02 9:08 [PATCH v2 0/5] cxl: Sashiko bug fixes Richard Cheng
` (2 preceding siblings ...)
2026-07-02 9:08 ` [PATCH v2 3/5] cxl/region: Don't leak tolerated RAM -EFAULT from unmapped poison scan Richard Cheng
@ 2026-07-02 9:08 ` Richard Cheng
2026-07-02 9:08 ` [PATCH v2 5/5] cxl/memdev: Don't overwrite the error from an earlier partition poison query Richard Cheng
4 siblings, 0 replies; 6+ messages in thread
From: Richard Cheng @ 2026-07-02 9:08 UTC (permalink / raw)
To: dave, jic23, dave.jiang, alison.schofield, vishal.l.verma, djbw,
danwilliams
Cc: iweiny, ming.li, gourry, rrichter, linux-cxl, linux-kernel, kees,
newtonl, kristinc, mochs, kaihengf, kobak, Richard Cheng
poison_by_decoder() stops at the last committed decoder and records the
handoff in ctx->offset, but cxl_get_poison_unmapped() ignores it and
starts after the highest DPA allocation instead. Allocation exist for
uncommitted decoders too, so their DPA is skipped by both phases and
poison there is never reported. Resume the scan at ctx->offset, and scan
later partitions in full, restoring the pre-rewrite behavior.
Fixes: be5cbd084027 ("cxl: Kill enum cxl_decoder_mode")
Signed-off-by: Richard Cheng <icheng@nvidia.com>
---
Changelog:
v1->v2:
- New added patch ( sashiko's report )
---
drivers/cxl/core/region.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 52ba8e9e4288..ba77416055f4 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2910,7 +2910,6 @@ static int cxl_get_poison_unmapped(struct cxl_memdev *cxlmd,
{
struct cxl_dev_state *cxlds = cxlmd->cxlds;
const struct resource *res;
- struct resource *p, *last;
u64 offset, length;
int rc = 0;
@@ -2923,10 +2922,8 @@ static int cxl_get_poison_unmapped(struct cxl_memdev *cxlmd,
*/
for (int i = ctx->part; i < cxlds->nr_partitions; i++) {
res = &cxlds->part[i].res;
- for (p = res->child, last = NULL; p; p = p->sibling)
- last = p;
- if (last)
- offset = last->end + 1;
+ if (i == ctx->part)
+ offset = ctx->offset;
else
offset = res->start;
length = res->end - offset + 1;
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v2 5/5] cxl/memdev: Don't overwrite the error from an earlier partition poison query
2026-07-02 9:08 [PATCH v2 0/5] cxl: Sashiko bug fixes Richard Cheng
` (3 preceding siblings ...)
2026-07-02 9:08 ` [PATCH v2 4/5] cxl/region: Start unmapped poison scan at the committed decoder boundary Richard Cheng
@ 2026-07-02 9:08 ` Richard Cheng
4 siblings, 0 replies; 6+ messages in thread
From: Richard Cheng @ 2026-07-02 9:08 UTC (permalink / raw)
To: dave, jic23, dave.jiang, alison.schofield, vishal.l.verma, djbw,
danwilliams
Cc: iweiny, ming.li, gourry, rrichter, linux-cxl, linux-kernel, kees,
newtonl, kristinc, mochs, kaihengf, kobak, Richard Cheng
cxl_get_poison_by_memdev() queries Get Poison List per partition but
never checks the result inside the loop, so a later partition's success
overwrites an earlier partition's failure and the whole scan reports
success while that partition's poison went unlisted. Before the loop
conversion the PMEM query returned early on error. Stop the loop on any
error not already tolerated as a RAM -EFAULT.
Fixes: be5cbd084027 ("cxl: Kill enum cxl_decoder_mode")
Signed-off-by: Richard Cheng <icheng@nvidia.com>
---
Changelog:
v1->v2:
- New added patch ( sashiko-bot's report )
---
drivers/cxl/core/memdev.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 33a3d2e7b13a..8718964b9c5e 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -231,6 +231,8 @@ static int cxl_get_poison_by_memdev(struct cxl_memdev *cxlmd)
*/
if (rc == -EFAULT && cxlds->part[i].mode == CXL_PARTMODE_RAM)
rc = 0;
+ if (rc)
+ break;
}
return rc;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-07-02 9:09 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-02 9:08 [PATCH v2 0/5] cxl: Sashiko bug fixes Richard Cheng
2026-07-02 9:08 ` [PATCH v2 1/5] cxl/features: Reject feature offset that overflows 16-bit field Richard Cheng
2026-07-02 9:08 ` [PATCH v2 2/5] cxl/region: Scan all partitions for unmapped poison Richard Cheng
2026-07-02 9:08 ` [PATCH v2 3/5] cxl/region: Don't leak tolerated RAM -EFAULT from unmapped poison scan Richard Cheng
2026-07-02 9:08 ` [PATCH v2 4/5] cxl/region: Start unmapped poison scan at the committed decoder boundary Richard Cheng
2026-07-02 9:08 ` [PATCH v2 5/5] cxl/memdev: Don't overwrite the error from an earlier partition poison query Richard Cheng
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox