From: Jakub Kicinski <kuba@kernel.org>
To: michael.chan@broadcom.com
Cc: Jakub Kicinski <kuba@kernel.org>,
davem@davemloft.net, netdev@vger.kernel.org, edumazet@google.com,
pabeni@redhat.com, andrew+netdev@lunn.ch,
pavan.chebbi@broadcom.com, andrew.gospodarek@broadcom.com
Subject: Re: [PATCH net-next v2 09/15] bnxt_en: Add infrastructure for crypto key context IDs
Date: Fri, 15 May 2026 18:10:16 -0700 [thread overview]
Message-ID: <20260516011017.1857586-1-kuba@kernel.org> (raw)
In-Reply-To: <20260512212105.3488258-10-michael.chan@broadcom.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
Subject: bnxt_en: Add infrastructure for crypto key context IDs
This patch adds a kid_info data structure plus APIs to allocate and free
crypto key context IDs (KIDs) for kTLS in the bnxt_en driver. Batches
of KIDs are obtained from the firmware and tracked in an RCU-walked
linked list with a per-batch bitmap.
Reviewing the changes raised a few questions, inline below.
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index 926665c8bb79..4f5de41d4e86 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -17219,6 +17219,9 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
> rc = bnxt_dl_register(bp);
> if (rc)
> goto init_err_dl;
> + rc = bnxt_crypto_init(bp);
> + if (rc)
> + bnxt_free_crypto_info(bp);
[Low]
The other init steps in this function uniformly do an "if (rc) goto
init_err_*;" with at least a netdev_warn() logged.
Here, on failure rc is silently overwritten by the next assignment
(rc = register_netdev(dev) further down), so a transient HWRM error or
DMA-slice ENOMEM during probe ends up leaving the device up with kTLS
permanently disabled and no message in dmesg.
Additionally, if the TX HWRM_FUNC_KEY_CTX_ALLOC succeeded but the RX one
failed, the in-file comment
/* Note that the driver does not free the key contexts. They are freed
* by the FW during FLR and HWRM_FUNC_RESET.
*/
implies the firmware-side TX KIDs remain allocated until the next
FLR/FUNC_RESET, while the driver has dropped its bookkeeping for them.
Would it be reasonable to log a netdev_warn() and/or short-circuit the
probe path here?
>
> INIT_LIST_HEAD(&bp->usr_fltr_list);
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.c
> index ee154f1e4e19..4ea3e67be9f5 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.c
[ ... ]
> +void bnxt_clear_crypto(struct bnxt *bp)
> +{
> + struct bnxt_crypto_info *crypto = bp->crypto_info;
> + struct bnxt_kid_info *kid, *tmp;
> + struct bnxt_kctx *kctx;
> + int i;
> +
> + if (!crypto)
> + return;
> +
> + /* Shutting down or FW reset, no need to protect the lists. */
> + for (i = 0; i < BNXT_MAX_CRYPTO_KEY_TYPE; i++) {
> + kctx = &crypto->kctx[i];
> + list_for_each_entry_safe(kid, tmp, &kctx->list, list) {
> + list_del(&kid->list);
> + kfree(kid);
> + }
> + kctx->total_alloc = 0;
> + kctx->epoch++;
> + }
> +}
[Medium]
The kid nodes are published into kctx->list with list_add_tail_rcu() in
bnxt_key_ctx_store() and read by bnxt_kid_valid(), bnxt_alloc_one_kctx()
and bnxt_free_one_kctx() under list_for_each_entry_rcu() inside
rcu_read_lock().
Here they are torn down with plain list_del() followed by an immediate
kfree(), with no list_del_rcu(), no synchronize_rcu() and no kfree_rcu()
(struct bnxt_kid_info has no rcu_head). The comment says "no need to
protect the lists" because the function "assumes serialization (called
during shutdown)", but the kerneldoc above also lists "firmware reset"
as a use case.
If a future caller invokes bnxt_clear_crypto() from a path where an
RCU reader is concurrently walking kctx->list, list_del() will turn
kid->next into LIST_POISON1 and the immediate kfree() will return the
memory the reader is about to dereference.
Would it be safer to use list_del_rcu() plus kfree_rcu() (adding an
rcu_head to bnxt_kid_info) so the API is self-contained, regardless of
caller serialization assumptions?
[Low]
Also on the epoch bump: kctx->epoch is a u8, and BNXT_KID_EPOCH_MASK is
0xff000000 (8 bits). Each call to bnxt_clear_crypto() does
kctx->epoch++, so after 256 shutdown/reset cycles the value wraps back
to a previously-used epoch.
The kerneldoc states that the increment is used to "invalidate any
outstanding key references", but a stale KID encoding the wrapped
epoch will silently pass the "epoch != kctx->epoch" check in
bnxt_kid_valid() and be matched against a freshly-allocated KID range.
Should the epoch field be wider, given that there are 24 free bits in
the KID encoding?
> +
> /**
> * bnxt_free_crypto_info - Free crypto offload resources
> * @bp: pointer to bnxt device
[ ... ]
> +static int bnxt_key_ctx_store(struct bnxt_kctx *kctx, __le32 *key_buf, u32 num,
> + bool contig, u8 kind, u32 *id)
> +{
> + struct bnxt_kid_info *kid;
> + u32 i;
> +
> + for (i = 0; i < num; ) {
> + kid = kzalloc_obj(*kid);
> + if (!kid)
> + return -ENOMEM;
[Low]
On a kzalloc failure partway through a non-contiguous batch, this
returns -ENOMEM with kids from earlier iterations already linked into
kctx->list. The remaining keys in key_buf[i..num-1], and in the
contig=true case all "num" keys, were already allocated by the
firmware, but the driver now has no record of them.
Per the comment
/* Note that the driver does not free the key contexts. They are
* freed by the FW during FLR and HWRM_FUNC_RESET.
*/
these IDs stay allocated on the firmware side until the next reset.
Is there a way to either record them or release them via HWRM on the
error path so the usable KID pool isn't permanently shrunk under
sustained ENOMEM during kTLS bring-up?
> + kid->start_id = le32_to_cpu(key_buf[i]);
> + kid->type = kctx->type;
> + kid->kind = kind;
> + if (contig)
> + kid->count = num;
> + else
> + kid->count = 1;
> + bitmap_set(kid->ids, 0, kid->count);
> + if (id && !i) {
> + clear_bit(0, kid->ids);
> + *id = BNXT_SET_KID(kctx, kid->start_id);
> + }
> + spin_lock(&kctx->lock);
> + list_add_tail_rcu(&kid->list, &kctx->list);
> + kctx->total_alloc += kid->count;
> + spin_unlock(&kctx->lock);
> + i += kid->count;
> + }
> + return 0;
> +}
> +
> +/* Note that the driver does not free the key contexts. They are freed
> + * by the FW during FLR and HWRM_FUNC_RESET.
> + */
> +static int bnxt_hwrm_key_ctx_alloc(struct bnxt *bp, struct bnxt_kctx *kctx,
> + u8 kind, u32 num, u32 *id)
> +{
[ ... ]
> + num_alloc = le16_to_cpu(resp->num_key_ctxs_allocated);
> + if (num_alloc > num)
> + netdev_warn(bp->dev,
> + "FW allocated more keys (%d) than requested (%d)\n",
> + num_alloc, num);
> + else
> + num = num_alloc;
> + contig = resp->flags &
> + FUNC_KEY_CTX_ALLOC_RESP_FLAGS_KEY_CTXS_CONTIGUOUS;
> + rc = bnxt_key_ctx_store(kctx, key_buf, num, contig, kind, id);
[Medium]
What happens if the firmware returns success with
num_key_ctxs_allocated == 0?
The "else num = num_alloc;" branch sets num to 0 and then calls
bnxt_key_ctx_store(kctx, key_buf, 0, contig, kind, id). Inside
bnxt_key_ctx_store():
for (i = 0; i < num; ) {
...
if (id && !i) {
clear_bit(0, kid->ids);
*id = BNXT_SET_KID(kctx, kid->start_id);
}
...
}
return 0;
the loop body never executes, so *id is never written and the function
still returns 0. bnxt_hwrm_key_ctx_alloc() then falls through to
key_alloc_exit_wake with rc==0, and bnxt_key_ctx_alloc_one() returns 0
on the "if (!rc) return 0;" path.
Would it be appropriate to either treat num_alloc == 0 as an error
when an id was requested, or guarantee *id is initialised before the
function returns success?
> +
> +key_alloc_exit_wake:
> + if (pending_count >= BNXT_KCTX_ALLOC_PENDING_MAX)
> + wake_up_all(&kctx->alloc_pending_wq);
> +key_alloc_exit:
> + hwrm_req_drop(bp, req);
> + return rc;
> +}
[ ... ]
> +void bnxt_free_one_kctx(struct bnxt_kctx *kctx, u32 id)
> +{
> + struct bnxt_kid_info *kid;
> +
> + id = BNXT_KID_HW(id);
> + rcu_read_lock();
> + list_for_each_entry_rcu(kid, &kctx->list, list) {
> + if (id >= kid->start_id && id < kid->start_id + kid->count) {
> + set_bit(id - kid->start_id, kid->ids);
> + break;
> + }
> + }
> + rcu_read_unlock();
> +}
[Low]
bnxt_kid_valid() rejects stale IDs via "epoch != kctx->epoch", but
bnxt_free_one_kctx() strips the epoch with BNXT_KID_HW(id) without any
such check.
If a caller passes an id captured before a bnxt_clear_crypto() epoch
bump, set_bit() will mark a currently in-use slot as free, and a
subsequent bnxt_alloc_one_kctx() can hand the same hardware KID to a
second kTLS connection — which for AES-GCM means (key, nonce) reuse.
The single in-tree consumer (introduced later in the series) does call
bnxt_kid_valid() before bnxt_free_one_kctx(), so this is not currently
reachable, but would a defensive epoch check inside this function close
the API hazard?
> +
> +#define BNXT_KCTX_ALLOC_RETRY_MAX 3
> +
> +int bnxt_key_ctx_alloc_one(struct bnxt *bp, struct bnxt_kctx *kctx, u8 kind,
> + u32 *id)
> +{
> + int rc, retry = 0;
> +
> + while (retry++ < BNXT_KCTX_ALLOC_RETRY_MAX) {
> + rc = bnxt_alloc_one_kctx(kctx, kind, id);
> + if (!rc)
> + return 0;
> +
> + if ((kctx->total_alloc + BNXT_KID_BATCH_SIZE) > kctx->max_ctx)
> + return -ENOSPC;
> +
> + if (!BNXT_KCTX_ALLOC_OK(kctx)) {
> + wait_event(kctx->alloc_pending_wq,
> + BNXT_KCTX_ALLOC_OK(kctx));
> + continue;
> + }
> + rc = bnxt_hwrm_key_ctx_alloc(bp, kctx, kind,
> + BNXT_KID_BATCH_SIZE, id);
> + if (!rc)
> + return 0;
> + }
> + return -EAGAIN;
> +}
[Medium]
The struct definition documents kctx->lock as
/* to serialize update to the linked list and total_alloc */
spinlock_t lock;
and bnxt_key_ctx_store() does
spin_lock(&kctx->lock);
list_add_tail_rcu(&kid->list, &kctx->list);
kctx->total_alloc += kid->count;
spin_unlock(&kctx->lock);
but here total_alloc is read with no lock and no READ_ONCE(). Multiple
callers can each observe a stale value and proceed to issue their own
HWRM allocation, which combined with the BNXT_KCTX_ALLOC_PENDING_MAX
gate (8 in flight) can ask the firmware for up to
8 * BNXT_KID_BATCH_SIZE = 1024 IDs beyond max_ctx before any of them
update the counter.
Should this read be serialised under kctx->lock, or otherwise made
race-free with respect to the writers?
[Low]
On the BNXT_KCTX_ALLOC_OK gate itself, the check-then-act here is
non-atomic: BNXT_KCTX_ALLOC_OK reads alloc_pending and
bnxt_hwrm_key_ctx_alloc() does a separate atomic_inc_return(). Many
threads can simultaneously observe alloc_pending < 8 and then all
proceed to inc, raising it arbitrarily above the cap.
In addition, the wake_up_all() in bnxt_hwrm_key_ctx_alloc() depends on
its own pending_count (post-inc) being >= BNXT_KCTX_ALLOC_PENDING_MAX:
pending_count = atomic_inc_return(&kctx->alloc_pending);
rc = hwrm_req_send(bp, req);
atomic_dec(&kctx->alloc_pending);
...
key_alloc_exit_wake:
if (pending_count >= BNXT_KCTX_ALLOC_PENDING_MAX)
wake_up_all(&kctx->alloc_pending_wq);
If a burst of callers all happen to record pending_count values < 8 by
each racing past the check, none of them wake the wait queue, and a
waiter parked in wait_event(BNXT_KCTX_ALLOC_OK) only gets released when
some later caller happens to increment to >= 8.
Was this latency tolerated intentionally, or is the gate intended to
be exact?
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.h b/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.h
> index e090491006db..ecdf18ba6d83 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.h
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.h
[ ... ]
> +#define BNXT_SET_KID(kctx, kid) \
> + ((kid) | ((kctx)->epoch << BNXT_KID_EPOCH_SHIFT))
[Low]
kctx->epoch is u8 and is promoted to (signed) int before the shift.
For epoch >= 128, "(int)epoch << 24" sets the sign bit of int, which
is undefined behaviour per C99/C11 6.5.7p4.
In combination with the 8-bit epoch wraparound noted earlier on
bnxt_clear_crypto(), epoch >= 128 is reachable, and a UBSan-instrumented
kernel will splat here.
Would casting epoch to u32 before the shift be acceptable?
> +
> +#define BNXT_KCTX_ALLOC_OK(kctx) \
> + (atomic_read(&((kctx)->alloc_pending)) < BNXT_KCTX_ALLOC_PENDING_MAX)
> +
[ ... ]
next prev parent reply other threads:[~2026-05-16 1:10 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-12 21:20 [PATCH net-next v2 00/15] bnxt_en: Add kTLS TX offload support Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 01/15] bnxt_en: Add Midpath channel information Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 02/15] bnxt_en: Account for the MPC TX and CP rings Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 03/15] bnxt_en: Set default MPC ring count Michael Chan
2026-05-16 1:10 ` Jakub Kicinski
2026-05-12 21:20 ` [PATCH net-next v2 04/15] bnxt_en: Rename xdp_tx_lock to tx_lock Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 05/15] bnxt_en: Allocate and free MPC software structures Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 06/15] bnxt_en: Allocate and free MPC channels from firmware Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 07/15] bnxt_en: Allocate crypto structure and backing store Michael Chan
2026-05-16 1:10 ` Jakub Kicinski
2026-05-12 21:20 ` [PATCH net-next v2 08/15] bnxt_en: Reserve crypto RX and TX key contexts on a PF Michael Chan
2026-05-16 1:10 ` Jakub Kicinski
2026-05-12 21:20 ` [PATCH net-next v2 09/15] bnxt_en: Add infrastructure for crypto key context IDs Michael Chan
2026-05-16 1:10 ` Jakub Kicinski [this message]
2026-05-12 21:21 ` [PATCH net-next v2 10/15] bnxt_en: Add MPC transmit and completion functions Michael Chan
2026-05-16 1:10 ` Jakub Kicinski
2026-05-12 21:21 ` [PATCH net-next v2 11/15] bnxt_en: Add crypto MPC transmit/completion infrastructure Michael Chan
2026-05-16 1:10 ` Jakub Kicinski
2026-05-12 21:21 ` [PATCH net-next v2 12/15] bnxt_en: Support kTLS TX offload by implementing .tls_dev_add/del() Michael Chan
2026-05-16 1:10 ` Jakub Kicinski
2026-05-12 21:21 ` [PATCH net-next v2 13/15] bnxt_en: Implement kTLS TX normal path Michael Chan
2026-05-16 1:10 ` Jakub Kicinski
2026-05-12 21:21 ` [PATCH net-next v2 14/15] bnxt_en: Add support for inline transmit BDs Michael Chan
2026-05-12 21:21 ` [PATCH net-next v2 15/15] bnxt_en: Add kTLS retransmission support Michael Chan
2026-05-16 1:10 ` Jakub Kicinski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260516011017.1857586-1-kuba@kernel.org \
--to=kuba@kernel.org \
--cc=andrew+netdev@lunn.ch \
--cc=andrew.gospodarek@broadcom.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=michael.chan@broadcom.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=pavan.chebbi@broadcom.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox