From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D7CE42857CC for ; Sat, 16 May 2026 01:10:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778893818; cv=none; b=c/lLr97vSKpToJT17gyZmpa4FsUhTfo6qDzMGLo/tC7Iy06P0FrE2nvDJ/VKsrSiRIIu8N9EGDXZA1Joh/ULzI7QwHxVNwKTso6XaiLBag2X7HL57LKTtnAnV29oNWrEs1w/cugkTKDhadrwfey3/RZIkCmZuOPDyJ0KB2kFjHE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778893818; c=relaxed/simple; bh=ZNTPgj9dNqgW+DuxWcf+FDn/LuarrOrlwuQgTAu7jwI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=NiU1BGsc8YI0Xcx8wTyhDuM+bFCMaUxPTFNMdHDy30rDJAGOLgwn8I+v03x1ET8viH8Aqs4gP+IrR2nr2Z8tvsjFy1gCpv+QFj6ewQ2WAstkjx2DgCP4M9gI7uyunHJZeHQwLz+gnoupv9XhPKEmG4WzmsfW+Fgyoo8YbdQUHvg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=PnQfQQuq; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="PnQfQQuq" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 60B2CC2BCB0; Sat, 16 May 2026 01:10:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1778893818; bh=ZNTPgj9dNqgW+DuxWcf+FDn/LuarrOrlwuQgTAu7jwI=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=PnQfQQuq8eGd5jjb++ezR9s01Am3EZJfkaA320nl+AB8fTUwmj4kZbgAA6TdgAjjx mhmUKbuw0euyF8XXmOCkqkAPfBIMEQXcA+l0G9C5IBNeS/jHfltdnFJ0mY1sb/HfUg u1ChZ7Ytzf71ZWFXJf4+pv5D4j3jnLI2T0/DkyfKFvcZB9tszWRrjzrhuzX28UPCYy Xuy0sICRW3GAxvr2UyD71rAxvrW0kDD44xy3/qIkKNvM5j5YhHE7wKZlaypi093A6i uJKvbJU7KkgvZifG1sn/QDrmjB+wErQUfJS/lM7dAlhNrEz4auMOjxkYQaslPgqeu+ nFGDIObKOg/yg== From: Jakub Kicinski To: michael.chan@broadcom.com Cc: Jakub Kicinski , davem@davemloft.net, netdev@vger.kernel.org, edumazet@google.com, pabeni@redhat.com, andrew+netdev@lunn.ch, pavan.chebbi@broadcom.com, andrew.gospodarek@broadcom.com Subject: Re: [PATCH net-next v2 09/15] bnxt_en: Add infrastructure for crypto key context IDs Date: Fri, 15 May 2026 18:10:16 -0700 Message-ID: <20260516011017.1857586-1-kuba@kernel.org> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260512212105.3488258-10-michael.chan@broadcom.com> References: <20260512212105.3488258-10-michael.chan@broadcom.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This is an AI-generated review of your patch. The human sending this email has considered the AI review valid, or at least plausible. --- Subject: bnxt_en: Add infrastructure for crypto key context IDs This patch adds a kid_info data structure plus APIs to allocate and free crypto key context IDs (KIDs) for kTLS in the bnxt_en driver. Batches of KIDs are obtained from the firmware and tracked in an RCU-walked linked list with a per-batch bitmap. Reviewing the changes raised a few questions, inline below. > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c > index 926665c8bb79..4f5de41d4e86 100644 > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c > @@ -17219,6 +17219,9 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) > rc = bnxt_dl_register(bp); > if (rc) > goto init_err_dl; > + rc = bnxt_crypto_init(bp); > + if (rc) > + bnxt_free_crypto_info(bp); [Low] The other init steps in this function uniformly do an "if (rc) goto init_err_*;" with at least a netdev_warn() logged. Here, on failure rc is silently overwritten by the next assignment (rc = register_netdev(dev) further down), so a transient HWRM error or DMA-slice ENOMEM during probe ends up leaving the device up with kTLS permanently disabled and no message in dmesg. Additionally, if the TX HWRM_FUNC_KEY_CTX_ALLOC succeeded but the RX one failed, the in-file comment /* Note that the driver does not free the key contexts. They are freed * by the FW during FLR and HWRM_FUNC_RESET. */ implies the firmware-side TX KIDs remain allocated until the next FLR/FUNC_RESET, while the driver has dropped its bookkeeping for them. Would it be reasonable to log a netdev_warn() and/or short-circuit the probe path here? > > INIT_LIST_HEAD(&bp->usr_fltr_list); > > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.c > index ee154f1e4e19..4ea3e67be9f5 100644 > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.c > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.c [ ... ] > +void bnxt_clear_crypto(struct bnxt *bp) > +{ > + struct bnxt_crypto_info *crypto = bp->crypto_info; > + struct bnxt_kid_info *kid, *tmp; > + struct bnxt_kctx *kctx; > + int i; > + > + if (!crypto) > + return; > + > + /* Shutting down or FW reset, no need to protect the lists. */ > + for (i = 0; i < BNXT_MAX_CRYPTO_KEY_TYPE; i++) { > + kctx = &crypto->kctx[i]; > + list_for_each_entry_safe(kid, tmp, &kctx->list, list) { > + list_del(&kid->list); > + kfree(kid); > + } > + kctx->total_alloc = 0; > + kctx->epoch++; > + } > +} [Medium] The kid nodes are published into kctx->list with list_add_tail_rcu() in bnxt_key_ctx_store() and read by bnxt_kid_valid(), bnxt_alloc_one_kctx() and bnxt_free_one_kctx() under list_for_each_entry_rcu() inside rcu_read_lock(). Here they are torn down with plain list_del() followed by an immediate kfree(), with no list_del_rcu(), no synchronize_rcu() and no kfree_rcu() (struct bnxt_kid_info has no rcu_head). The comment says "no need to protect the lists" because the function "assumes serialization (called during shutdown)", but the kerneldoc above also lists "firmware reset" as a use case. If a future caller invokes bnxt_clear_crypto() from a path where an RCU reader is concurrently walking kctx->list, list_del() will turn kid->next into LIST_POISON1 and the immediate kfree() will return the memory the reader is about to dereference. Would it be safer to use list_del_rcu() plus kfree_rcu() (adding an rcu_head to bnxt_kid_info) so the API is self-contained, regardless of caller serialization assumptions? [Low] Also on the epoch bump: kctx->epoch is a u8, and BNXT_KID_EPOCH_MASK is 0xff000000 (8 bits). Each call to bnxt_clear_crypto() does kctx->epoch++, so after 256 shutdown/reset cycles the value wraps back to a previously-used epoch. The kerneldoc states that the increment is used to "invalidate any outstanding key references", but a stale KID encoding the wrapped epoch will silently pass the "epoch != kctx->epoch" check in bnxt_kid_valid() and be matched against a freshly-allocated KID range. Should the epoch field be wider, given that there are 24 free bits in the KID encoding? > + > /** > * bnxt_free_crypto_info - Free crypto offload resources > * @bp: pointer to bnxt device [ ... ] > +static int bnxt_key_ctx_store(struct bnxt_kctx *kctx, __le32 *key_buf, u32 num, > + bool contig, u8 kind, u32 *id) > +{ > + struct bnxt_kid_info *kid; > + u32 i; > + > + for (i = 0; i < num; ) { > + kid = kzalloc_obj(*kid); > + if (!kid) > + return -ENOMEM; [Low] On a kzalloc failure partway through a non-contiguous batch, this returns -ENOMEM with kids from earlier iterations already linked into kctx->list. The remaining keys in key_buf[i..num-1], and in the contig=true case all "num" keys, were already allocated by the firmware, but the driver now has no record of them. Per the comment /* Note that the driver does not free the key contexts. They are * freed by the FW during FLR and HWRM_FUNC_RESET. */ these IDs stay allocated on the firmware side until the next reset. Is there a way to either record them or release them via HWRM on the error path so the usable KID pool isn't permanently shrunk under sustained ENOMEM during kTLS bring-up? > + kid->start_id = le32_to_cpu(key_buf[i]); > + kid->type = kctx->type; > + kid->kind = kind; > + if (contig) > + kid->count = num; > + else > + kid->count = 1; > + bitmap_set(kid->ids, 0, kid->count); > + if (id && !i) { > + clear_bit(0, kid->ids); > + *id = BNXT_SET_KID(kctx, kid->start_id); > + } > + spin_lock(&kctx->lock); > + list_add_tail_rcu(&kid->list, &kctx->list); > + kctx->total_alloc += kid->count; > + spin_unlock(&kctx->lock); > + i += kid->count; > + } > + return 0; > +} > + > +/* Note that the driver does not free the key contexts. They are freed > + * by the FW during FLR and HWRM_FUNC_RESET. > + */ > +static int bnxt_hwrm_key_ctx_alloc(struct bnxt *bp, struct bnxt_kctx *kctx, > + u8 kind, u32 num, u32 *id) > +{ [ ... ] > + num_alloc = le16_to_cpu(resp->num_key_ctxs_allocated); > + if (num_alloc > num) > + netdev_warn(bp->dev, > + "FW allocated more keys (%d) than requested (%d)\n", > + num_alloc, num); > + else > + num = num_alloc; > + contig = resp->flags & > + FUNC_KEY_CTX_ALLOC_RESP_FLAGS_KEY_CTXS_CONTIGUOUS; > + rc = bnxt_key_ctx_store(kctx, key_buf, num, contig, kind, id); [Medium] What happens if the firmware returns success with num_key_ctxs_allocated == 0? The "else num = num_alloc;" branch sets num to 0 and then calls bnxt_key_ctx_store(kctx, key_buf, 0, contig, kind, id). Inside bnxt_key_ctx_store(): for (i = 0; i < num; ) { ... if (id && !i) { clear_bit(0, kid->ids); *id = BNXT_SET_KID(kctx, kid->start_id); } ... } return 0; the loop body never executes, so *id is never written and the function still returns 0. bnxt_hwrm_key_ctx_alloc() then falls through to key_alloc_exit_wake with rc==0, and bnxt_key_ctx_alloc_one() returns 0 on the "if (!rc) return 0;" path. Would it be appropriate to either treat num_alloc == 0 as an error when an id was requested, or guarantee *id is initialised before the function returns success? > + > +key_alloc_exit_wake: > + if (pending_count >= BNXT_KCTX_ALLOC_PENDING_MAX) > + wake_up_all(&kctx->alloc_pending_wq); > +key_alloc_exit: > + hwrm_req_drop(bp, req); > + return rc; > +} [ ... ] > +void bnxt_free_one_kctx(struct bnxt_kctx *kctx, u32 id) > +{ > + struct bnxt_kid_info *kid; > + > + id = BNXT_KID_HW(id); > + rcu_read_lock(); > + list_for_each_entry_rcu(kid, &kctx->list, list) { > + if (id >= kid->start_id && id < kid->start_id + kid->count) { > + set_bit(id - kid->start_id, kid->ids); > + break; > + } > + } > + rcu_read_unlock(); > +} [Low] bnxt_kid_valid() rejects stale IDs via "epoch != kctx->epoch", but bnxt_free_one_kctx() strips the epoch with BNXT_KID_HW(id) without any such check. If a caller passes an id captured before a bnxt_clear_crypto() epoch bump, set_bit() will mark a currently in-use slot as free, and a subsequent bnxt_alloc_one_kctx() can hand the same hardware KID to a second kTLS connection — which for AES-GCM means (key, nonce) reuse. The single in-tree consumer (introduced later in the series) does call bnxt_kid_valid() before bnxt_free_one_kctx(), so this is not currently reachable, but would a defensive epoch check inside this function close the API hazard? > + > +#define BNXT_KCTX_ALLOC_RETRY_MAX 3 > + > +int bnxt_key_ctx_alloc_one(struct bnxt *bp, struct bnxt_kctx *kctx, u8 kind, > + u32 *id) > +{ > + int rc, retry = 0; > + > + while (retry++ < BNXT_KCTX_ALLOC_RETRY_MAX) { > + rc = bnxt_alloc_one_kctx(kctx, kind, id); > + if (!rc) > + return 0; > + > + if ((kctx->total_alloc + BNXT_KID_BATCH_SIZE) > kctx->max_ctx) > + return -ENOSPC; > + > + if (!BNXT_KCTX_ALLOC_OK(kctx)) { > + wait_event(kctx->alloc_pending_wq, > + BNXT_KCTX_ALLOC_OK(kctx)); > + continue; > + } > + rc = bnxt_hwrm_key_ctx_alloc(bp, kctx, kind, > + BNXT_KID_BATCH_SIZE, id); > + if (!rc) > + return 0; > + } > + return -EAGAIN; > +} [Medium] The struct definition documents kctx->lock as /* to serialize update to the linked list and total_alloc */ spinlock_t lock; and bnxt_key_ctx_store() does spin_lock(&kctx->lock); list_add_tail_rcu(&kid->list, &kctx->list); kctx->total_alloc += kid->count; spin_unlock(&kctx->lock); but here total_alloc is read with no lock and no READ_ONCE(). Multiple callers can each observe a stale value and proceed to issue their own HWRM allocation, which combined with the BNXT_KCTX_ALLOC_PENDING_MAX gate (8 in flight) can ask the firmware for up to 8 * BNXT_KID_BATCH_SIZE = 1024 IDs beyond max_ctx before any of them update the counter. Should this read be serialised under kctx->lock, or otherwise made race-free with respect to the writers? [Low] On the BNXT_KCTX_ALLOC_OK gate itself, the check-then-act here is non-atomic: BNXT_KCTX_ALLOC_OK reads alloc_pending and bnxt_hwrm_key_ctx_alloc() does a separate atomic_inc_return(). Many threads can simultaneously observe alloc_pending < 8 and then all proceed to inc, raising it arbitrarily above the cap. In addition, the wake_up_all() in bnxt_hwrm_key_ctx_alloc() depends on its own pending_count (post-inc) being >= BNXT_KCTX_ALLOC_PENDING_MAX: pending_count = atomic_inc_return(&kctx->alloc_pending); rc = hwrm_req_send(bp, req); atomic_dec(&kctx->alloc_pending); ... key_alloc_exit_wake: if (pending_count >= BNXT_KCTX_ALLOC_PENDING_MAX) wake_up_all(&kctx->alloc_pending_wq); If a burst of callers all happen to record pending_count values < 8 by each racing past the check, none of them wake the wait queue, and a waiter parked in wait_event(BNXT_KCTX_ALLOC_OK) only gets released when some later caller happens to increment to >= 8. Was this latency tolerated intentionally, or is the gate intended to be exact? > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.h b/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.h > index e090491006db..ecdf18ba6d83 100644 > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.h > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.h [ ... ] > +#define BNXT_SET_KID(kctx, kid) \ > + ((kid) | ((kctx)->epoch << BNXT_KID_EPOCH_SHIFT)) [Low] kctx->epoch is u8 and is promoted to (signed) int before the shift. For epoch >= 128, "(int)epoch << 24" sets the sign bit of int, which is undefined behaviour per C99/C11 6.5.7p4. In combination with the 8-bit epoch wraparound noted earlier on bnxt_clear_crypto(), epoch >= 128 is reachable, and a UBSan-instrumented kernel will splat here. Would casting epoch to u32 before the shift be acceptable? > + > +#define BNXT_KCTX_ALLOC_OK(kctx) \ > + (atomic_read(&((kctx)->alloc_pending)) < BNXT_KCTX_ALLOC_PENDING_MAX) > + [ ... ]