From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2F4F9E555; Thu, 25 Jun 2026 18:15:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782411346; cv=none; b=L9qVtddSMYJVn6qEMzMLRxlyDA4nSQf6YzyTKZf2meNNovWyuIQYbd1DZkONH0r454rYh/trPMc4zbx2ISP7RP6W+3OXilxFh9ZeRP0daVm5NlSCzMCPgtPBqiEafrJzV+GIYbGKrKEGrJpxAV+8pj4nkuMUZ/tU1SNkZAuzGq4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782411346; c=relaxed/simple; bh=Vy0mz5S0xGeZhiPhMD75YSiZqn1KDDwMRCwt3twshMg=; h=From:Subject:To:Cc:In-Reply-To:References:Content-Type:Date: Message-Id; b=XPQSvRKl4z9loAQLQlSJaa9fkLJu+lBh+4dmNpaS2aENczJkNFFZTfIC8eeivC5ADkbqQV/6d8J4Z7tyd5dB6gFNoT1eQNLnNpujhGP0vhLtzf7u37spCt2hk2FgsQrlezJZMinz5aQIOc6TonfOMjspzoS5daEORlJJ7lf97A0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=ln3g1MRE; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="ln3g1MRE" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 67B691F000E9; Thu, 25 Jun 2026 18:15:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1782411344; bh=ZxoYy/8q8oKT4Ormr6MYWOzEDEp1/6aaNQIxhYA4rfo=; h=From:Subject:Reply-To:To:Cc:In-Reply-To:References:Date; b=ln3g1MREeTrCFZWXcg6aaGUWNsI9hmJGLcvC8X65CJbD+YDOpaBm6vvjjJdokS1Rt bHSfEvYpHHiX00HoE0JtLyrpkkkjD0cDgJcGdbOT5DPPR3NDoNcH5aD/Em7GBhuyAL t6nyES8OnRkZ4xhcNxNpOk6FNJJvJnZRkIfJP/zm2uiqAocTr2aDPteL821V20ktTN QDtqT56Ob4LX1OdKtHERLHKawTdT3VM5IHMlNjrLfwl0dHpjxR5m7/SrKQIaN3zhbt gSNj0wRel5SJ6Fm7ilTUSVPBcpAxa2eCucsj6xjs8t9DZ9iuW+bVZhh07MhigyImqL iwzJmxWPgl2jg== From: sashiko-bot@kernel.org Subject: Re: [PATCH v11 13/31] cxl/mem: Add 20 second timeout for stalled DC_ADD_CAPACITY chains Reply-To: sashiko-reviews@lists.linux.dev To: "Anisa Su" Cc: linux-cxl@vger.kernel.org, nvdimm@lists.linux.dev In-Reply-To: <20260625112638.550691-14-anisa.su@samsung.com> References: <20260625112638.550691-1-anisa.su@samsung.com> <20260625112638.550691-14-anisa.su@samsung.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Date: Thu, 25 Jun 2026 18:15:43 +0000 Message-Id: <20260625181544.67B691F000E9@smtp.kernel.org> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: Thank you for your contribution! Sashiko AI review found 2 potential issue(= s) to consider: - [High] Error paths in handle_add_event fail to abort the DCD event chain = properly, leading to a stalled device, leaked watchdog timers, and partial = chain acceptance. - [High] Watchdog timeout clears the pending list without transitioning to = a discard state, leading to a double-response protocol violation if the cha= in completes. -- commit f4b4a95fd0c56d91f06baa4811e9a05ea46012ee Author: Anisa Su cxl/mem: Add 20 second timeout for stalled DC_ADD_CAPACITY chains A DC_ADD_CAPACITY event can span multiple event records. This patch adds a delayed-work watchdog to prevent indefinitely stalled chains and enforces a maximum limit on the number of pending extents. > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c > index 7dd40fb8d613c..4e887b5cdc3ed 100644 > --- a/drivers/cxl/core/mbox.c > +++ b/drivers/cxl/core/mbox.c [ ... ] > @@ -1208,15 +1208,78 @@ static void clear_pending_extents(void *_mds) > =20 > list_for_each_entry_safe(pos, tmp, &mds->add_ctx.pending_extents, list) > delete_extent_node(pos); > + mds->add_ctx.nr_pending =3D 0; > mds->add_ctx.group =3D NULL; > } [ ... ] > +static void cxl_dc_add_timeout(struct work_struct *work) > +{ > + struct pending_add_ctx *ctx =3D container_of(to_delayed_work(work), > + struct pending_add_ctx, > + timeout_work); [ ... ] > + if (cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE, > + &ctx->pending_extents, 0)) > + dev_dbg(dev, "Failed to send empty ADD_DC_RESPONSE on timeout\n"); > + > + clear_pending_extents(mds); > + ctx->armed =3D false; > +} [Severity: High] If the device is just slow and later sends the remainder of the chain (including the closing More=3D0 event), will the driver treat the delayed events as a fresh chain? Because cxl_dc_add_timeout() resets ctx->armed =3D false and nr_pending =3D= 0 without marking the chain as discarded, it appears handle_add_event() would process a delayed tail of the chain as a new chain. When the More=3D0 event is processed, wouldn't the driver issue a second ADD_DC_RESPONSE for this single event chain, violating the CXL protocol? [ ... ] > @@ -1239,10 +1303,20 @@ static int add_to_pending_list(struct list_head *= pending_list, > static int handle_add_event(struct cxl_memdev_state *mds, > struct cxl_event_dcd *event) > { > + struct pending_add_ctx *ctx =3D &mds->add_ctx; > struct device *dev =3D mds->cxlds.dev; > int rc; > =20 > - rc =3D add_to_pending_list(&mds->add_ctx.pending_extents, &event->exten= t); > + guard(mutex)(&ctx->lock); > + > + if (ctx->nr_pending >=3D CXL_DC_MAX_PENDING_EXTENTS) { > + dev_warn(dev, "DC add chain exceeds %u extents; dropping (firmware bug= )\n", > + CXL_DC_MAX_PENDING_EXTENTS); > + clear_pending_extents(mds); > + return -ENOSPC; > + } > + > + rc =3D add_to_pending_list(&ctx->pending_extents, &event->extent); > if (rc) { > clear_pending_extents(mds); > return rc; [Severity: High] Do these early error returns fail to abort the event chain properly? If the limit is exceeded or memory allocation fails, the code calls clear_pending_extents() and returns an error without sending an ADD_DC_RESPONSE. If this happens on the first extent, wouldn't the device never receive a response and stall indefinitely? If it occurs on a subsequent extent, does this leave the watchdog timer armed to fire spuriously later? Additionally, because clear_pending_extents() resets nr_pending to 0, if the device continues sending More=3D1 events for the same chain, would the driver incorrectly process them as a new chain and potentially accept the tail of a chain after dropping the head? --=20 Sashiko AI review =C2=B7 https://sashiko.dev/#/patchset/20260625112638.5506= 91-1-anisa.su@samsung.com?part=3D13