Linux CXL
 help / color / mirror / Atom feed
* [PATCH 0/2] cxl: Fix endpoint access issues with CXL MCE notifier handler
@ 2026-06-16  0:40 Dave Jiang
  2026-06-16  0:40 ` [PATCH 1/2] cxl/mce: Validate memdev and endpoint before dereference in cxl_handle_mce() Dave Jiang
  2026-06-16  0:40 ` [PATCH 2/2] cxl/mce: Serialize the MCE handler against endpoint teardown Dave Jiang
  0 siblings, 2 replies; 8+ messages in thread
From: Dave Jiang @ 2026-06-16  0:40 UTC (permalink / raw)
  To: linux-cxl; +Cc: djbw, dave, jic23, alison.schofield, vishal.l.verma, flavien

Fix the issues reported by Flavien to the Linux security mailing list.

First patch addresses accessing possible invalid pointers for cxlmd and
cxmd->endpoint.

The second patch adddresses the endpoint lifetime that is shorter than
mds and needs appropriate guardrails to access.


Dave Jiang (2):
  cxl/mce: Validate memdev and endpoint before dereference in
    cxl_handle_mce()
  cxl/mce: Serialize the MCE handler against endpoint teardown

 drivers/cxl/core/mce.c | 29 ++++++++++++++++++++++++++---
 1 file changed, 26 insertions(+), 3 deletions(-)


base-commit: 8cd9520d35a6c38db6567e97dd93b1f11f185dc6
-- 
2.54.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] cxl/mce: Validate memdev and endpoint before dereference in cxl_handle_mce()
  2026-06-16  0:40 [PATCH 0/2] cxl: Fix endpoint access issues with CXL MCE notifier handler Dave Jiang
@ 2026-06-16  0:40 ` Dave Jiang
  2026-06-16  0:54   ` sashiko-bot
  2026-06-16 17:43   ` Dan Williams (nvidia)
  2026-06-16  0:40 ` [PATCH 2/2] cxl/mce: Serialize the MCE handler against endpoint teardown Dave Jiang
  1 sibling, 2 replies; 8+ messages in thread
From: Dave Jiang @ 2026-06-16  0:40 UTC (permalink / raw)
  To: linux-cxl
  Cc: djbw, dave, jic23, alison.schofield, vishal.l.verma, flavien,
	stable

cxlmd and endpoint are both used in cxl_handle_mce() without proper
validation, which can lead to NULL pointer dereference or invalid pointer
dereference. The notifier is registered in cxl_memdev_state_create()
when the CXL PCI driver first binds, before the memdev is published and
before it is attached to a CXL topology.

Add checks to cxlmd and endpoint to ensure they are valid before usage.

Reported-by: Flavien Solt <flavien@nus.edu.sg>
Fixes: 516e5bd0b6bf ("cxl: Add mce notifier to emit aliased address for extended linear cache")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/core/mce.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/core/mce.c b/drivers/cxl/core/mce.c
index ff8d078c6ca1..47566015eb00 100644
--- a/drivers/cxl/core/mce.c
+++ b/drivers/cxl/core/mce.c
@@ -13,7 +13,7 @@ static int cxl_handle_mce(struct notifier_block *nb, unsigned long val,
 	struct cxl_memdev_state *mds = container_of(nb, struct cxl_memdev_state,
 						    mce_notifier);
 	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
-	struct cxl_port *endpoint = cxlmd->endpoint;
+	struct cxl_port *endpoint;
 	struct mce *mce = data;
 	u64 spa, spa_alias;
 	unsigned long pfn;
@@ -21,7 +21,11 @@ static int cxl_handle_mce(struct notifier_block *nb, unsigned long val,
 	if (!mce || !mce_usable_address(mce))
 		return NOTIFY_DONE;
 
-	if (!endpoint)
+	if (!cxlmd)
+		return NOTIFY_DONE;
+
+	endpoint = cxlmd->endpoint;
+	if (IS_ERR_OR_NULL(endpoint))
 		return NOTIFY_DONE;
 
 	spa = mce->addr & MCI_ADDR_PHYSADDR;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] cxl/mce: Serialize the MCE handler against endpoint teardown
  2026-06-16  0:40 [PATCH 0/2] cxl: Fix endpoint access issues with CXL MCE notifier handler Dave Jiang
  2026-06-16  0:40 ` [PATCH 1/2] cxl/mce: Validate memdev and endpoint before dereference in cxl_handle_mce() Dave Jiang
@ 2026-06-16  0:40 ` Dave Jiang
  2026-06-16  1:03   ` sashiko-bot
  2026-06-16 17:48   ` Dan Williams (nvidia)
  1 sibling, 2 replies; 8+ messages in thread
From: Dave Jiang @ 2026-06-16  0:40 UTC (permalink / raw)
  To: linux-cxl
  Cc: djbw, dave, jic23, alison.schofield, vishal.l.verma, flavien,
	stable

CXL endpoint has a shorter lifetime than CXL memdev state (mds) and
the MCE notifier is part of the mds. The MCE handler needs to take
a reference on the endpoint in order to keep it alive while operating
on it. Take the cxlmd lock to verify the endpoint is still valid and
take a reference on it before accessing it.

Reported-by: Flavien Solt <flavien@nus.edu.sg>
Fixes: 516e5bd0b6bf ("cxl: Add mce notifier to emit aliased address for extended linear cache")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/core/mce.c | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/drivers/cxl/core/mce.c b/drivers/cxl/core/mce.c
index 47566015eb00..e684e411921b 100644
--- a/drivers/cxl/core/mce.c
+++ b/drivers/cxl/core/mce.c
@@ -7,13 +7,27 @@
 #include <cxlmem.h>
 #include "mce.h"
 
+static struct device *cxlmd_get_endpoint_dev(struct cxl_memdev *cxlmd)
+{
+	struct cxl_port *endpoint;
+
+	if (!cxlmd)
+		return NULL;
+
+	guard(device)(&cxlmd->dev);
+	endpoint = cxlmd->endpoint;
+	if (IS_ERR_OR_NULL(endpoint))
+		return NULL;
+
+	return get_device(&endpoint->dev);
+}
+
 static int cxl_handle_mce(struct notifier_block *nb, unsigned long val,
 			  void *data)
 {
 	struct cxl_memdev_state *mds = container_of(nb, struct cxl_memdev_state,
 						    mce_notifier);
 	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
-	struct cxl_port *endpoint;
 	struct mce *mce = data;
 	u64 spa, spa_alias;
 	unsigned long pfn;
@@ -24,8 +38,13 @@ static int cxl_handle_mce(struct notifier_block *nb, unsigned long val,
 	if (!cxlmd)
 		return NOTIFY_DONE;
 
-	endpoint = cxlmd->endpoint;
-	if (IS_ERR_OR_NULL(endpoint))
+	/*
+	 * With the cxlmd device lock held, check the cxlmd->endpoint pointer
+	 * and then take a reference of the device in order to keep it alive
+	 * while accessing it.
+	 */
+	struct device *dev __free(put_device) = cxlmd_get_endpoint_dev(cxlmd);
+	if (!dev)
 		return NOTIFY_DONE;
 
 	spa = mce->addr & MCI_ADDR_PHYSADDR;
@@ -34,7 +53,7 @@ static int cxl_handle_mce(struct notifier_block *nb, unsigned long val,
 	if (!pfn_valid(pfn))
 		return NOTIFY_DONE;
 
-	spa_alias = cxl_port_get_spa_cache_alias(endpoint, spa);
+	spa_alias = cxl_port_get_spa_cache_alias(to_cxl_port(dev), spa);
 	if (spa_alias == ~0ULL)
 		return NOTIFY_DONE;
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] cxl/mce: Validate memdev and endpoint before dereference in cxl_handle_mce()
  2026-06-16  0:40 ` [PATCH 1/2] cxl/mce: Validate memdev and endpoint before dereference in cxl_handle_mce() Dave Jiang
@ 2026-06-16  0:54   ` sashiko-bot
  2026-06-16 17:43   ` Dan Williams (nvidia)
  1 sibling, 0 replies; 8+ messages in thread
From: sashiko-bot @ 2026-06-16  0:54 UTC (permalink / raw)
  To: Dave Jiang; +Cc: linux-cxl

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] The validation added for `cxlmd` fails to account for device teardown, where `cxlds->cxlmd` becomes a dangling pointer, leading to an incomplete fix and Use-After-Free.
--

commit d4d366d587ddef82637af37fbf8cd28754d28024
Author: Dave Jiang <dave.jiang@intel.com>

cxl/mce: Validate memdev and endpoint before dereference in cxl_handle_mce()

cxlmd and endpoint are both used in cxl_handle_mce() without proper
validation, which can lead to NULL pointer dereference or invalid pointer
dereference. The notifier is registered in cxl_memdev_state_create()
when the CXL PCI driver first binds.

> diff --git a/drivers/cxl/core/mce.c b/drivers/cxl/core/mce.c
> index ff8d078c6ca1e..47566015eb006 100644
> --- a/drivers/cxl/core/mce.c
> +++ b/drivers/cxl/core/mce.c
[ ... ]
> @@ -21,7 +21,11 @@ static int cxl_handle_mce(struct notifier_block *nb, unsigned long val,
>  	if (!mce || !mce_usable_address(mce))
>  		return NOTIFY_DONE;
>  
> -	if (!endpoint)
> +	if (!cxlmd)
> +		return NOTIFY_DONE;

[Severity: High]
This isn't a bug introduced by this patch, but does this check adequately
protect against a use-after-free during device teardown?

The MCE notifier is registered via devm_cxl_register_mce_notifier() early in
the PCI probe sequence. Later, cxl_memdev_autoremove() registers the
cxl_memdev_unregister() teardown action via devm_add_action_or_reset().
Because devm actions execute in LIFO order, cxl_memdev_unregister() runs
before the MCE notifier is unregistered.

During cxl_memdev_unregister(), put_device(&cxlmd->dev) is called, which can
free the cxlmd structure. However, neither cxl_memdev_unregister() nor
cxl_memdev_shutdown() clears the cxlds->cxlmd pointer.

This leaves mds->cxlds.cxlmd as a dangling pointer. If an MCE occurs during
this teardown window, cxl_handle_mce() will find a non-NULL cxlmd and pass it
to cxlmd_get_endpoint_dev(), which calls guard(device)(&cxlmd->dev), resulting
in a use-after-free.

> +
> +	endpoint = cxlmd->endpoint;
> +	if (IS_ERR_OR_NULL(endpoint))
>  		return NOTIFY_DONE;
>  
>  	spa = mce->addr & MCI_ADDR_PHYSADDR;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260616004007.4186004-1-dave.jiang@intel.com?part=1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] cxl/mce: Serialize the MCE handler against endpoint teardown
  2026-06-16  0:40 ` [PATCH 2/2] cxl/mce: Serialize the MCE handler against endpoint teardown Dave Jiang
@ 2026-06-16  1:03   ` sashiko-bot
  2026-06-16 17:48   ` Dan Williams (nvidia)
  1 sibling, 0 replies; 8+ messages in thread
From: sashiko-bot @ 2026-06-16  1:03 UTC (permalink / raw)
  To: Dave Jiang; +Cc: linux-cxl

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [Critical] The MCE notifier accesses a freed `cxlmd` structure and attempts to lock its mutex if an MCE occurs during device teardown, leading to a Use-After-Free and kernel crash.
--

commit a9bb7894eaa94a0fe2d0d4c6dd69b86cf2d1fe0f
Author: Dave Jiang <dave.jiang@intel.com>

cxl/mce: Serialize the MCE handler against endpoint teardown

This commit modifies the MCE handler to take the cxlmd lock before verifying
the endpoint pointer. It then takes a reference on the endpoint device to
ensure it remains alive during access.

> diff --git a/drivers/cxl/core/mce.c b/drivers/cxl/core/mce.c
> index 47566015eb006..e684e411921b9 100644
> --- a/drivers/cxl/core/mce.c
> +++ b/drivers/cxl/core/mce.c
> @@ -7,13 +7,27 @@
>  #include <cxlmem.h>
>  #include "mce.h"
>  
> +static struct device *cxlmd_get_endpoint_dev(struct cxl_memdev *cxlmd)
> +{
> +	struct cxl_port *endpoint;
> +
> +	if (!cxlmd)
> +		return NULL;
> +
> +	guard(device)(&cxlmd->dev);

[Severity: Critical]
This is a pre-existing issue, but does accessing cxlmd here lead to a
use-after-free if a Machine Check Exception occurs during device teardown?

During device teardown or driver unbind, devm actions execute in reverse
order. The MCE notifier is registered before the cxl_memdev structure
is allocated, which means cxl_memdev is unregistered and its memory freed
before the MCE notifier is deactivated.

While cxl_memdev_shutdown() clears cxlmd->cxlds, it does not clear
mds->cxlds.cxlmd. This leaves mds->cxlds.cxlmd as a dangling pointer.

If a Machine Check Exception happens during this window, cxl_handle_mce()
will read the freed cxlmd pointer and pass it here. Attempting to lock a
mutex inside freed memory will cause memory corruption or a crash.

> +	endpoint = cxlmd->endpoint;

[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260616004007.4186004-1-dave.jiang@intel.com?part=2

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] cxl/mce: Validate memdev and endpoint before dereference in cxl_handle_mce()
  2026-06-16  0:40 ` [PATCH 1/2] cxl/mce: Validate memdev and endpoint before dereference in cxl_handle_mce() Dave Jiang
  2026-06-16  0:54   ` sashiko-bot
@ 2026-06-16 17:43   ` Dan Williams (nvidia)
  2026-06-16 17:44     ` Dave Jiang
  1 sibling, 1 reply; 8+ messages in thread
From: Dan Williams (nvidia) @ 2026-06-16 17:43 UTC (permalink / raw)
  To: Dave Jiang, linux-cxl
  Cc: djbw, dave, jic23, alison.schofield, vishal.l.verma, flavien,
	stable

Dave Jiang wrote:
> cxlmd and endpoint are both used in cxl_handle_mce() without proper
> validation, which can lead to NULL pointer dereference or invalid pointer
> dereference. The notifier is registered in cxl_memdev_state_create()
> when the CXL PCI driver first binds, before the memdev is published and
> before it is attached to a CXL topology.
> 
> Add checks to cxlmd and endpoint to ensure they are valid before usage.

This looks to be trying to band-aid the original mistake of having
cxl_memdev_state_create() register a region-relative callback.

Move the mce notifier registration to be per-region and all the lookup
lifetime problems disappear.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] cxl/mce: Validate memdev and endpoint before dereference in cxl_handle_mce()
  2026-06-16 17:43   ` Dan Williams (nvidia)
@ 2026-06-16 17:44     ` Dave Jiang
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Jiang @ 2026-06-16 17:44 UTC (permalink / raw)
  To: Dan Williams (nvidia), linux-cxl
  Cc: dave, jic23, alison.schofield, vishal.l.verma, flavien, stable



On 6/16/26 10:43 AM, Dan Williams (nvidia) wrote:
> Dave Jiang wrote:
>> cxlmd and endpoint are both used in cxl_handle_mce() without proper
>> validation, which can lead to NULL pointer dereference or invalid pointer
>> dereference. The notifier is registered in cxl_memdev_state_create()
>> when the CXL PCI driver first binds, before the memdev is published and
>> before it is attached to a CXL topology.
>>
>> Add checks to cxlmd and endpoint to ensure they are valid before usage.
> 
> This looks to be trying to band-aid the original mistake of having
> cxl_memdev_state_create() register a region-relative callback.
> 
> Move the mce notifier registration to be per-region and all the lookup
> lifetime problems disappear.

ok that makes sense. 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] cxl/mce: Serialize the MCE handler against endpoint teardown
  2026-06-16  0:40 ` [PATCH 2/2] cxl/mce: Serialize the MCE handler against endpoint teardown Dave Jiang
  2026-06-16  1:03   ` sashiko-bot
@ 2026-06-16 17:48   ` Dan Williams (nvidia)
  1 sibling, 0 replies; 8+ messages in thread
From: Dan Williams (nvidia) @ 2026-06-16 17:48 UTC (permalink / raw)
  To: Dave Jiang, linux-cxl
  Cc: djbw, dave, jic23, alison.schofield, vishal.l.verma, flavien,
	stable

Dave Jiang wrote:
> CXL endpoint has a shorter lifetime than CXL memdev state (mds) and
> the MCE notifier is part of the mds. The MCE handler needs to take
> a reference on the endpoint in order to keep it alive while operating
> on it. Take the cxlmd lock to verify the endpoint is still valid and
> take a reference on it before accessing it.

The only way to synchronize against the removal of cxlmd would be to
lock its parent device which is moving in the wrong direction.

This lifetime problem disappears with a region-relative mce
notification.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-06-16 17:48 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-16  0:40 [PATCH 0/2] cxl: Fix endpoint access issues with CXL MCE notifier handler Dave Jiang
2026-06-16  0:40 ` [PATCH 1/2] cxl/mce: Validate memdev and endpoint before dereference in cxl_handle_mce() Dave Jiang
2026-06-16  0:54   ` sashiko-bot
2026-06-16 17:43   ` Dan Williams (nvidia)
2026-06-16 17:44     ` Dave Jiang
2026-06-16  0:40 ` [PATCH 2/2] cxl/mce: Serialize the MCE handler against endpoint teardown Dave Jiang
2026-06-16  1:03   ` sashiko-bot
2026-06-16 17:48   ` Dan Williams (nvidia)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox