Netdev List
 help / color / mirror / Atom feed
From: Jakub Kicinski <kuba@kernel.org>
To: michael.chan@broadcom.com
Cc: Jakub Kicinski <kuba@kernel.org>,
	davem@davemloft.net, netdev@vger.kernel.org, edumazet@google.com,
	pabeni@redhat.com, andrew+netdev@lunn.ch,
	pavan.chebbi@broadcom.com, andrew.gospodarek@broadcom.com
Subject: Re: [PATCH net-next v2 09/15] bnxt_en: Add infrastructure for crypto key context IDs
Date: Fri, 15 May 2026 18:10:16 -0700	[thread overview]
Message-ID: <20260516011017.1857586-1-kuba@kernel.org> (raw)
In-Reply-To: <20260512212105.3488258-10-michael.chan@broadcom.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
Subject: bnxt_en: Add infrastructure for crypto key context IDs

This patch adds a kid_info data structure plus APIs to allocate and free
crypto key context IDs (KIDs) for kTLS in the bnxt_en driver.  Batches
of KIDs are obtained from the firmware and tracked in an RCU-walked
linked list with a per-batch bitmap.

Reviewing the changes raised a few questions, inline below.

> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index 926665c8bb79..4f5de41d4e86 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -17219,6 +17219,9 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>  	rc = bnxt_dl_register(bp);
>  	if (rc)
>  		goto init_err_dl;
> +	rc = bnxt_crypto_init(bp);
> +	if (rc)
> +		bnxt_free_crypto_info(bp);

[Low]
The other init steps in this function uniformly do an "if (rc) goto
init_err_*;" with at least a netdev_warn() logged.

Here, on failure rc is silently overwritten by the next assignment
(rc = register_netdev(dev) further down), so a transient HWRM error or
DMA-slice ENOMEM during probe ends up leaving the device up with kTLS
permanently disabled and no message in dmesg.

Additionally, if the TX HWRM_FUNC_KEY_CTX_ALLOC succeeded but the RX one
failed, the in-file comment

  /* Note that the driver does not free the key contexts.  They are freed
   * by the FW during FLR and HWRM_FUNC_RESET.
   */

implies the firmware-side TX KIDs remain allocated until the next
FLR/FUNC_RESET, while the driver has dropped its bookkeeping for them.
Would it be reasonable to log a netdev_warn() and/or short-circuit the
probe path here?

>  
>  	INIT_LIST_HEAD(&bp->usr_fltr_list);
>  
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.c
> index ee154f1e4e19..4ea3e67be9f5 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.c

[ ... ]

> +void bnxt_clear_crypto(struct bnxt *bp)
> +{
> +	struct bnxt_crypto_info *crypto = bp->crypto_info;
> +	struct bnxt_kid_info *kid, *tmp;
> +	struct bnxt_kctx *kctx;
> +	int i;
> +
> +	if (!crypto)
> +		return;
> +
> +	/* Shutting down or FW reset, no need to protect the lists. */
> +	for (i = 0; i < BNXT_MAX_CRYPTO_KEY_TYPE; i++) {
> +		kctx = &crypto->kctx[i];
> +		list_for_each_entry_safe(kid, tmp, &kctx->list, list) {
> +			list_del(&kid->list);
> +			kfree(kid);
> +		}
> +		kctx->total_alloc = 0;
> +		kctx->epoch++;
> +	}
> +}

[Medium]
The kid nodes are published into kctx->list with list_add_tail_rcu() in
bnxt_key_ctx_store() and read by bnxt_kid_valid(), bnxt_alloc_one_kctx()
and bnxt_free_one_kctx() under list_for_each_entry_rcu() inside
rcu_read_lock().

Here they are torn down with plain list_del() followed by an immediate
kfree(), with no list_del_rcu(), no synchronize_rcu() and no kfree_rcu()
(struct bnxt_kid_info has no rcu_head).  The comment says "no need to
protect the lists" because the function "assumes serialization (called
during shutdown)", but the kerneldoc above also lists "firmware reset"
as a use case.

If a future caller invokes bnxt_clear_crypto() from a path where an
RCU reader is concurrently walking kctx->list, list_del() will turn
kid->next into LIST_POISON1 and the immediate kfree() will return the
memory the reader is about to dereference.

Would it be safer to use list_del_rcu() plus kfree_rcu() (adding an
rcu_head to bnxt_kid_info) so the API is self-contained, regardless of
caller serialization assumptions?

[Low]
Also on the epoch bump: kctx->epoch is a u8, and BNXT_KID_EPOCH_MASK is
0xff000000 (8 bits).  Each call to bnxt_clear_crypto() does
kctx->epoch++, so after 256 shutdown/reset cycles the value wraps back
to a previously-used epoch.

The kerneldoc states that the increment is used to "invalidate any
outstanding key references", but a stale KID encoding the wrapped
epoch will silently pass the "epoch != kctx->epoch" check in
bnxt_kid_valid() and be matched against a freshly-allocated KID range.

Should the epoch field be wider, given that there are 24 free bits in
the KID encoding?

> +
>  /**
>   * bnxt_free_crypto_info - Free crypto offload resources
>   * @bp: pointer to bnxt device

[ ... ]

> +static int bnxt_key_ctx_store(struct bnxt_kctx *kctx, __le32 *key_buf, u32 num,
> +			      bool contig, u8 kind, u32 *id)
> +{
> +	struct bnxt_kid_info *kid;
> +	u32 i;
> +
> +	for (i = 0; i < num; ) {
> +		kid = kzalloc_obj(*kid);
> +		if (!kid)
> +			return -ENOMEM;

[Low]
On a kzalloc failure partway through a non-contiguous batch, this
returns -ENOMEM with kids from earlier iterations already linked into
kctx->list.  The remaining keys in key_buf[i..num-1], and in the
contig=true case all "num" keys, were already allocated by the
firmware, but the driver now has no record of them.

Per the comment

  /* Note that the driver does not free the key contexts.  They are
   * freed by the FW during FLR and HWRM_FUNC_RESET.
   */

these IDs stay allocated on the firmware side until the next reset.

Is there a way to either record them or release them via HWRM on the
error path so the usable KID pool isn't permanently shrunk under
sustained ENOMEM during kTLS bring-up?

> +		kid->start_id = le32_to_cpu(key_buf[i]);
> +		kid->type = kctx->type;
> +		kid->kind = kind;
> +		if (contig)
> +			kid->count = num;
> +		else
> +			kid->count = 1;
> +		bitmap_set(kid->ids, 0, kid->count);
> +		if (id && !i) {
> +			clear_bit(0, kid->ids);
> +			*id = BNXT_SET_KID(kctx, kid->start_id);
> +		}
> +		spin_lock(&kctx->lock);
> +		list_add_tail_rcu(&kid->list, &kctx->list);
> +		kctx->total_alloc += kid->count;
> +		spin_unlock(&kctx->lock);
> +		i += kid->count;
> +	}
> +	return 0;
> +}
> +
> +/* Note that the driver does not free the key contexts.  They are freed
> + * by the FW during FLR and HWRM_FUNC_RESET.
> + */
> +static int bnxt_hwrm_key_ctx_alloc(struct bnxt *bp, struct bnxt_kctx *kctx,
> +				   u8 kind, u32 num, u32 *id)
> +{

[ ... ]

> +	num_alloc = le16_to_cpu(resp->num_key_ctxs_allocated);
> +	if (num_alloc > num)
> +		netdev_warn(bp->dev,
> +			    "FW allocated more keys (%d) than requested (%d)\n",
> +			    num_alloc, num);
> +	else
> +		num = num_alloc;
> +	contig = resp->flags &
> +		 FUNC_KEY_CTX_ALLOC_RESP_FLAGS_KEY_CTXS_CONTIGUOUS;
> +	rc = bnxt_key_ctx_store(kctx, key_buf, num, contig, kind, id);

[Medium]
What happens if the firmware returns success with
num_key_ctxs_allocated == 0?

The "else num = num_alloc;" branch sets num to 0 and then calls
bnxt_key_ctx_store(kctx, key_buf, 0, contig, kind, id).  Inside
bnxt_key_ctx_store():

	for (i = 0; i < num; ) {
		...
		if (id && !i) {
			clear_bit(0, kid->ids);
			*id = BNXT_SET_KID(kctx, kid->start_id);
		}
		...
	}
	return 0;

the loop body never executes, so *id is never written and the function
still returns 0.  bnxt_hwrm_key_ctx_alloc() then falls through to
key_alloc_exit_wake with rc==0, and bnxt_key_ctx_alloc_one() returns 0
on the "if (!rc) return 0;" path.

Would it be appropriate to either treat num_alloc == 0 as an error
when an id was requested, or guarantee *id is initialised before the
function returns success?

> +
> +key_alloc_exit_wake:
> +	if (pending_count >= BNXT_KCTX_ALLOC_PENDING_MAX)
> +		wake_up_all(&kctx->alloc_pending_wq);
> +key_alloc_exit:
> +	hwrm_req_drop(bp, req);
> +	return rc;
> +}

[ ... ]

> +void bnxt_free_one_kctx(struct bnxt_kctx *kctx, u32 id)
> +{
> +	struct bnxt_kid_info *kid;
> +
> +	id = BNXT_KID_HW(id);
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(kid, &kctx->list, list) {
> +		if (id >= kid->start_id && id < kid->start_id + kid->count) {
> +			set_bit(id - kid->start_id, kid->ids);
> +			break;
> +		}
> +	}
> +	rcu_read_unlock();
> +}

[Low]
bnxt_kid_valid() rejects stale IDs via "epoch != kctx->epoch", but
bnxt_free_one_kctx() strips the epoch with BNXT_KID_HW(id) without any
such check.

If a caller passes an id captured before a bnxt_clear_crypto() epoch
bump, set_bit() will mark a currently in-use slot as free, and a
subsequent bnxt_alloc_one_kctx() can hand the same hardware KID to a
second kTLS connection — which for AES-GCM means (key, nonce) reuse.

The single in-tree consumer (introduced later in the series) does call
bnxt_kid_valid() before bnxt_free_one_kctx(), so this is not currently
reachable, but would a defensive epoch check inside this function close
the API hazard?

> +
> +#define BNXT_KCTX_ALLOC_RETRY_MAX	3
> +
> +int bnxt_key_ctx_alloc_one(struct bnxt *bp, struct bnxt_kctx *kctx, u8 kind,
> +			   u32 *id)
> +{
> +	int rc, retry = 0;
> +
> +	while (retry++ < BNXT_KCTX_ALLOC_RETRY_MAX) {
> +		rc = bnxt_alloc_one_kctx(kctx, kind, id);
> +		if (!rc)
> +			return 0;
> +
> +		if ((kctx->total_alloc + BNXT_KID_BATCH_SIZE) > kctx->max_ctx)
> +			return -ENOSPC;
> +
> +		if (!BNXT_KCTX_ALLOC_OK(kctx)) {
> +			wait_event(kctx->alloc_pending_wq,
> +				   BNXT_KCTX_ALLOC_OK(kctx));
> +			continue;
> +		}
> +		rc = bnxt_hwrm_key_ctx_alloc(bp, kctx, kind,
> +					     BNXT_KID_BATCH_SIZE, id);
> +		if (!rc)
> +			return 0;
> +	}
> +	return -EAGAIN;
> +}

[Medium]
The struct definition documents kctx->lock as

	/* to serialize update to the linked list and total_alloc */
	spinlock_t		lock;

and bnxt_key_ctx_store() does

	spin_lock(&kctx->lock);
	list_add_tail_rcu(&kid->list, &kctx->list);
	kctx->total_alloc += kid->count;
	spin_unlock(&kctx->lock);

but here total_alloc is read with no lock and no READ_ONCE().  Multiple
callers can each observe a stale value and proceed to issue their own
HWRM allocation, which combined with the BNXT_KCTX_ALLOC_PENDING_MAX
gate (8 in flight) can ask the firmware for up to
8 * BNXT_KID_BATCH_SIZE = 1024 IDs beyond max_ctx before any of them
update the counter.

Should this read be serialised under kctx->lock, or otherwise made
race-free with respect to the writers?

[Low]
On the BNXT_KCTX_ALLOC_OK gate itself, the check-then-act here is
non-atomic: BNXT_KCTX_ALLOC_OK reads alloc_pending and
bnxt_hwrm_key_ctx_alloc() does a separate atomic_inc_return().  Many
threads can simultaneously observe alloc_pending < 8 and then all
proceed to inc, raising it arbitrarily above the cap.

In addition, the wake_up_all() in bnxt_hwrm_key_ctx_alloc() depends on
its own pending_count (post-inc) being >= BNXT_KCTX_ALLOC_PENDING_MAX:

	pending_count = atomic_inc_return(&kctx->alloc_pending);
	rc = hwrm_req_send(bp, req);
	atomic_dec(&kctx->alloc_pending);
	...
key_alloc_exit_wake:
	if (pending_count >= BNXT_KCTX_ALLOC_PENDING_MAX)
		wake_up_all(&kctx->alloc_pending_wq);

If a burst of callers all happen to record pending_count values < 8 by
each racing past the check, none of them wake the wait queue, and a
waiter parked in wait_event(BNXT_KCTX_ALLOC_OK) only gets released when
some later caller happens to increment to >= 8.

Was this latency tolerated intentionally, or is the gate intended to
be exact?

> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.h b/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.h
> index e090491006db..ecdf18ba6d83 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.h
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.h

[ ... ]

> +#define BNXT_SET_KID(kctx, kid)						\
> +	((kid) | ((kctx)->epoch << BNXT_KID_EPOCH_SHIFT))

[Low]
kctx->epoch is u8 and is promoted to (signed) int before the shift.
For epoch >= 128, "(int)epoch << 24" sets the sign bit of int, which
is undefined behaviour per C99/C11 6.5.7p4.

In combination with the 8-bit epoch wraparound noted earlier on
bnxt_clear_crypto(), epoch >= 128 is reachable, and a UBSan-instrumented
kernel will splat here.

Would casting epoch to u32 before the shift be acceptable?

> +
> +#define BNXT_KCTX_ALLOC_OK(kctx)	\
> +	(atomic_read(&((kctx)->alloc_pending)) < BNXT_KCTX_ALLOC_PENDING_MAX)
> +

[ ... ]

  reply	other threads:[~2026-05-16  1:10 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-12 21:20 [PATCH net-next v2 00/15] bnxt_en: Add kTLS TX offload support Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 01/15] bnxt_en: Add Midpath channel information Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 02/15] bnxt_en: Account for the MPC TX and CP rings Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 03/15] bnxt_en: Set default MPC ring count Michael Chan
2026-05-16  1:10   ` Jakub Kicinski
2026-05-12 21:20 ` [PATCH net-next v2 04/15] bnxt_en: Rename xdp_tx_lock to tx_lock Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 05/15] bnxt_en: Allocate and free MPC software structures Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 06/15] bnxt_en: Allocate and free MPC channels from firmware Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 07/15] bnxt_en: Allocate crypto structure and backing store Michael Chan
2026-05-16  1:10   ` Jakub Kicinski
2026-05-12 21:20 ` [PATCH net-next v2 08/15] bnxt_en: Reserve crypto RX and TX key contexts on a PF Michael Chan
2026-05-16  1:10   ` Jakub Kicinski
2026-05-12 21:20 ` [PATCH net-next v2 09/15] bnxt_en: Add infrastructure for crypto key context IDs Michael Chan
2026-05-16  1:10   ` Jakub Kicinski [this message]
2026-05-12 21:21 ` [PATCH net-next v2 10/15] bnxt_en: Add MPC transmit and completion functions Michael Chan
2026-05-16  1:10   ` Jakub Kicinski
2026-05-12 21:21 ` [PATCH net-next v2 11/15] bnxt_en: Add crypto MPC transmit/completion infrastructure Michael Chan
2026-05-16  1:10   ` Jakub Kicinski
2026-05-12 21:21 ` [PATCH net-next v2 12/15] bnxt_en: Support kTLS TX offload by implementing .tls_dev_add/del() Michael Chan
2026-05-16  1:10   ` Jakub Kicinski
2026-05-12 21:21 ` [PATCH net-next v2 13/15] bnxt_en: Implement kTLS TX normal path Michael Chan
2026-05-16  1:10   ` Jakub Kicinski
2026-05-12 21:21 ` [PATCH net-next v2 14/15] bnxt_en: Add support for inline transmit BDs Michael Chan
2026-05-12 21:21 ` [PATCH net-next v2 15/15] bnxt_en: Add kTLS retransmission support Michael Chan
2026-05-16  1:10   ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260516011017.1857586-1-kuba@kernel.org \
    --to=kuba@kernel.org \
    --cc=andrew+netdev@lunn.ch \
    --cc=andrew.gospodarek@broadcom.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=michael.chan@broadcom.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=pavan.chebbi@broadcom.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox