From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AFBF12D94B5; Thu, 25 Jun 2026 18:32:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782412325; cv=none; b=PqKJ8OKpTBbUzPkZ28l6u5M+9UgfBkaOzdErTO/Ryv8UPAwcSESzOGZSYDMZR0kvlrNmpALfVpjV82AponHBusreMNs+wp2XDRhSzAL/M4WARsPqvVKeVrYkzT9eb7rB9fX2srjhZTrIalo6dKDZe4AzvqXGXqNDicJ7Zw5rmoc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782412325; c=relaxed/simple; bh=htYnye3sj1OGyyzJUooewtWOEq+Z9JA+lwBaLPTNmz4=; h=From:Subject:To:Cc:In-Reply-To:References:Content-Type:Date: Message-Id; b=ZKwHMDTWQMysqRUXJCOGYFb2/Zd/8YjNy4JQdibpHxFeyB9ElYnxtjXBxBBON2l1EWJ07d1Yf9L13/Nkbxbp9pgowbNB4C1IVnjD7bNVdLRvzfhAdZ5npkAi2f7ZF5+NmofkH3Bh7+Cmry2UH4QEZBHfvZlFy/BTGupgPw+uSi8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=FkqqGZd4; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="FkqqGZd4" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3042C1F000E9; Thu, 25 Jun 2026 18:32:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1782412323; bh=hr60yuCQ3FRSkQwbbsQ5sn5cTZrrHJUjI2Ulxr58Wuc=; h=From:Subject:Reply-To:To:Cc:In-Reply-To:References:Date; b=FkqqGZd4ALp7QFHZQ2V6YoNzscq0zIZv4Cwo54JJ9xROPMhIFNwEVSuMFUsiyuQFE zGXvtVw4xc8cGNAJM9KDZgBPTNAEIMlRKhDGMy+9Big9Cx/hOWOaCAspfpp+goiOYj 77c/sXmY33l+RX6qJyUvag6fz1vE/xBNiu/egkQHkvXHMTNIRU8AYQnKCwIeX8Z0XR 0E5ZlsQuBdywHmgoezj/uOi/tUPakcQ30pRveje9DbWOUHl4lEtWbcTb+8Qs0mDox5 HREWuGUDtxGi9T7d0illxkQNH5qZ99GLXs8dc/2kvuUofAGkAPXFUF4o8ZudR8+bfT 7H0HWK6d/9Ehg== From: sashiko-bot@kernel.org Subject: Re: [PATCH v11 27/31] cxl/region: Read existing extents on region creation Reply-To: sashiko-reviews@lists.linux.dev To: "Anisa Su" Cc: nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org In-Reply-To: <20260625112638.550691-28-anisa.su@samsung.com> References: <20260625112638.550691-1-anisa.su@samsung.com> <20260625112638.550691-28-anisa.su@samsung.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Date: Thu, 25 Jun 2026 18:32:02 +0000 Message-Id: <20260625183203.3042C1F000E9@smtp.kernel.org> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: Thank you for your contribution! Sashiko AI review found 8 potential issue(= s) to consider: New issues: - [High] Cross-region lock assertion failure when processing existing exten= ts. - [High] Out-of-bounds read due to unvalidated `nr_returned` payload size. - [High] Runtime DC add events are permanently dropped and lost during regi= on probe, contradicting the design intent. - [High] Permanent device capacity leak when existing extents are dropped. - [High] AB-BA deadlock between `mds->add_ctx.lock` and `cxlr_dax->dev` loc= k. - [High] Shared `pending_extents` list corruption aborts in-flight runtime = events. - [Medium] Mixing `goto`-based error handling with `__free()` and `guard()`= violates subsystem cleanup guidelines. Pre-existing issues: - [High] Integer overflow in extent boundary calculation bypasses range val= idation. -- commit ae344fab96ea05c143bd5e8ab47ce8059fc778b6 Author: Ira Weiny Subject: cxl/region: Read existing extents on region creation Dynamic capacity device extents may be left in an accepted state on a device due to an unexpected host crash. This patch reads the 'devices extent list' during the asynchronous cxl_dax_region probe to reveal previously accepted extents and add them to the new region, avoiding races with probe. > diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c > --- a/drivers/cxl/core/extent.c > +++ b/drivers/cxl/core/extent.c > @@ -261,7 +261,7 @@ static int cxl_validate_extent(struct cxl_memdev_stat= e *mds, > struct cxl_extent *extent, > struct cxl_endpoint_decoder **out_cxled, > struct cxl_dax_region **out_cxlr_dax, > - struct range *out_ext_range) > + struct range *out_ext_range, bool existing) > { > u64 start_dpa =3D le64_to_cpu(extent->start_dpa); [Severity: High] This is a pre-existing issue, but could the extent boundary calculation in cxl_validate_extent() overflow? A few lines down from this signature change, the end boundary is calculated: struct range ext_range =3D (struct range) { .start =3D start_dpa, .end =3D start_dpa + le64_to_cpu(extent->length) - 1, }; If a device provides a massive start_dpa and a small length, could this calculation wrap around and produce a small ext_range.end? If so, the bounds check later in the function: if (!range_contains(&ed_range, &ext_range)) { would evaluate r1->start <=3D r2->start && r1->end >=3D r2->end. With a wra= pped near-zero ext_range.end, the check might incorrectly pass, allowing an out-of-bounds extent. > [ ... ] > @@ -300,6 +300,13 @@ static int cxl_validate_extent(struct cxl_memdev_sta= te *mds, > if (!cxlr || !cxlr->cxlr_dax) > return -ENXIO; > =20 > + /* > + * Pre-existing extents must be read before any new extent is added so a > + * tag already in use is never added twice; defer new adds until then. > + */ > + if (!existing && !smp_load_acquire(&cxlr->cxlr_dax->extents_scanned)) > + return -EBUSY; > + [Severity: High] Does returning -EBUSY here cause runtime Add Capacity events to be permanen= tly lost? If cxl_validate_extent() returns -EBUSY for a new runtime event, its caller cxl_realize_group() treats it as a failure, logging an error and deleting t= he extent node: if (cxl_add_extent(mds, pos->extent, seq_num, existing) < 0) { dev_dbg(dev, "Tag %pUb: failed to add extent ..."); delete_extent_node(pos); continue; } Since the extent is destroyed rather than deferred, it seems it will be omi= tted from the ADD_DC_RESPONSE, meaning the host officially rejects it. Does this prevent the runtime capacity from being added? > [ ... ] > @@ -376,16 +383,22 @@ dc_extent_build(struct cxl_endpoint_decoder *cxled, > return dc_extent; > } > =20 > -int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event, > - struct cxl_dc_tag_group *group) > +/* > + * Core notify: the caller must hold device_lock(&cxlr->cxlr_dax->dev). = Used by > + * the existing-extent path that runs inside cxl_dax_region_probe(), whe= re the > + * async device-attach already holds the dax_region's device_lock =E2=80= =94 taking it > + * again (as cxlr_notify_extent() does) would deadlock the probe against= itself. > + */ > +int __cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event, > + struct cxl_dc_tag_group *group) > { > struct device *dev =3D &cxlr->cxlr_dax->dev; > struct cxl_notify_data notify_data; > struct cxl_driver *driver; > =20 > - dev_dbg(dev, "Trying notify: type %d tag %pUb\n", event, &group->uuid); > + device_lock_assert(dev); [Severity: High] Could this lock assertion fail when processing extents from other regions? When __cxl_process_extent_list() queries the device for existing extents, the device returns all extents for all partitions, which may belong to different regions. For an extent belonging to a different region, cxl_realize_group() will call __cxlr_notify_extent() targeting that other region. However, the probe thre= ad only holds the device lock for the region currently being probed. Will this cause the assertion to fail for extents on other regions? > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c > --- a/drivers/cxl/core/mbox.c > +++ b/drivers/cxl/core/mbox.c > @@ -1503,7 +1503,7 @@ static int cxl_realize_group(struct cxl_memdev_stat= e *mds, const uuid_t *tag, > else > seq_num++; > =20 > - if (cxl_add_extent(mds, pos->extent, seq_num) < 0) { > + if (cxl_add_extent(mds, pos->extent, seq_num, existing) < 0) { [Severity: High] If cxl_add_extent() fails for an existing extent, could this cause a perman= ent capacity leak? If existing=3Dtrue, but cxl_validate_extent() or cxl_add_extent() fails (e.g., due to an overlap or memory allocation failure), the host drops the extent by calling delete_extent_node(pos) inside cxl_realize_group(). Because existing=3Dtrue, cxl_add_pending() skips sending an ADD_DC_RESPONSE. Since the device considers the extent accepted and the host never sends a Release-DC command for the dropped extent, does the capacity remain leaked until reboot? > [ ... ] > @@ -2199,6 +2209,145 @@ int cxl_dev_dc_identify(struct cxl_mailbox *mbox, > } > EXPORT_SYMBOL_NS_GPL(cxl_dev_dc_identify, "CXL"); > =20 > +/* Return -EAGAIN if the extent list changes while reading */ > +static int __cxl_process_extent_list(struct cxl_endpoint_decoder *cxled) > +{ > + u32 current_index, total_read, total_expected, initial_gen_num; > + struct cxl_memdev_state *mds =3D cxled_to_mds(cxled); > + struct cxl_mailbox *cxl_mbox =3D &mds->cxlds.cxl_mbox; > + struct device *dev =3D mds->cxlds.dev; > + struct cxl_mbox_cmd mbox_cmd; > + u32 max_extent_count; > + int rc =3D 0; > + bool first =3D true; > + > + struct cxl_mbox_get_extent_out *extents __free(kvfree) =3D > + kvmalloc(cxl_mbox->payload_size, GFP_KERNEL); > + if (!extents) > + return -ENOMEM; > + > + /* > + * Build and consume add_ctx.pending_extents under add_ctx.lock, the > + * same lock the DC event path (handle_add_event()) holds, so the two > + * cannot corrupt the shared pending list. > + */ > + guard(mutex)(&mds->add_ctx.lock); [Severity: High] Could this lock acquisition lead to an AB-BA deadlock between the mds->add_ctx.lock and the region device lock? In __cxl_process_extent_list(), the driver core is probing the region, so the region's device lock is already held. Here we acquire mds->add_ctx.l= ock, establishing a (device_lock -> add_ctx.lock) ordering. However, in the asynchronous event path (e.g., handle_add_event()), the dri= ver holds mds->add_ctx.lock, and then cxl_realize_group() calls cxlr_notify_ext= ent() which acquires the device lock: guard(device)(&cxlr->cxlr_dax->dev); This establishes an (add_ctx.lock -> device_lock) ordering. If a DC event arrives while a region is being probed, could these paths deadlock? [Severity: High] Could appending to mds->add_ctx.pending_extents here disrupt in-flight multi-part runtime Add Capacity events? Runtime DC Add Capacity events with the 'More' flag set accumulate partial extents in mds->add_ctx.pending_extents across multiple interrupts, with the lock released in between. If __cxl_process_extent_list() is called concurrently during a region probe, it unconditionally appends to this list. Later, it clears the list: out: clear_pending_extents(mds); If there was an in-flight runtime event, would its accumulated extents be processed as existing=3Dtrue (meaning no response sent) and then destroyed, leaving the device stalled waiting for a response? > + > + total_read =3D 0; > + current_index =3D 0; > + total_expected =3D 0; > + max_extent_count =3D (cxl_mbox->payload_size - sizeof(*extents)) / > + sizeof(struct cxl_extent); > + do { > + u32 nr_returned, current_total, current_gen_num; > + struct cxl_mbox_get_extent_in get_extent; > + > + get_extent =3D (struct cxl_mbox_get_extent_in) { > + .extent_cnt =3D cpu_to_le32(max_extent_count), > + .start_extent_index =3D cpu_to_le32(current_index), > + }; > + > + mbox_cmd =3D (struct cxl_mbox_cmd) { > + .opcode =3D CXL_MBOX_OP_GET_DC_EXTENT_LIST, > + .payload_in =3D &get_extent, > + .size_in =3D sizeof(get_extent), > + .size_out =3D cxl_mbox->payload_size, > + .payload_out =3D extents, > + .min_out =3D 1, > + }; > + > + rc =3D cxl_internal_send_cmd(cxl_mbox, &mbox_cmd); > + if (rc < 0) > + goto out; > + > + /* Save initial data */ > + if (first) { > + total_expected =3D le32_to_cpu(extents->total_extent_count); > + initial_gen_num =3D le32_to_cpu(extents->generation_num); > + first =3D false; > + } > + > + nr_returned =3D le32_to_cpu(extents->returned_extent_count); [Severity: High] Could a malformed device response cause an out-of-bounds read here? Because min_out is hardcoded to 1, truncated mailbox responses are accepted. The code then extracts nr_returned: nr_returned =3D le32_to_cpu(extents->returned_extent_count); and iterates over it: for (int i =3D 0; i < nr_returned ; i++) { struct cxl_extent *extent =3D &extents->extent[i]; Since nr_returned is not validated against max_extent_count or the actual payload size returned by the hardware, could a large nr_returned value cause the loop to read past the end of the kvmalloc buffer into unrelated kernel heap memory? > [ ... ] > + } while (total_expected > total_read); > + > + if (!list_empty(&mds->add_ctx.pending_extents)) { > + /* > + * Reached only on the success path (every error does goto out), > + * so rc is 0 here. These extents are already accepted on the > + * device (recovered from a prior boot). Pass existing=3Dtrue so > + * they are not re-reported in an Add-DC-Response (the device > + * would reject a DPA already added by a prior response), and so > + * a failed online releases them rather than silently dropping > + * them. > + */ > + rc =3D cxl_add_pending(mds, true); > + } > +out: > + clear_pending_extents(mds); [Severity: Medium] Does mixing goto statements with scope-based cleanup violate the cleanup subsystem guidelines? This function declares resources using scope-based cleanup: struct cxl_mbox_get_extent_out *extents __free(kvfree) =3D ... guard(mutex)(&mds->add_ctx.lock); But it also uses explicit goto out statements that jump to a manual cleanup label: out: clear_pending_extents(mds); The cleanup subsystem documentation expects that usage of "goto" and cleanup helpers is never mixed in the same function to avoid confusing ownership semantics. Should this be refactored? > + > + return rc; > +} --=20 Sashiko AI review =C2=B7 https://sashiko.dev/#/patchset/20260625112638.5506= 91-1-anisa.su@samsung.com?part=3D27