All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jakub Kicinski <kuba@kernel.org>
To: michael.chan@broadcom.com
Cc: Jakub Kicinski <kuba@kernel.org>,
	davem@davemloft.net, netdev@vger.kernel.org, edumazet@google.com,
	pabeni@redhat.com, andrew+netdev@lunn.ch,
	pavan.chebbi@broadcom.com, andrew.gospodarek@broadcom.com
Subject: Re: [PATCH net-next v2 09/15] bnxt_en: Add infrastructure for crypto key context IDs
Date: Fri, 15 May 2026 18:10:16 -0700	[thread overview]
Message-ID: <20260516011017.1857586-1-kuba@kernel.org> (raw)
In-Reply-To: <20260512212105.3488258-10-michael.chan@broadcom.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
Subject: bnxt_en: Add infrastructure for crypto key context IDs

This patch adds a kid_info data structure plus APIs to allocate and free
crypto key context IDs (KIDs) for kTLS in the bnxt_en driver.  Batches
of KIDs are obtained from the firmware and tracked in an RCU-walked
linked list with a per-batch bitmap.

Reviewing the changes raised a few questions, inline below.

> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index 926665c8bb79..4f5de41d4e86 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -17219,6 +17219,9 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>  	rc = bnxt_dl_register(bp);
>  	if (rc)
>  		goto init_err_dl;
> +	rc = bnxt_crypto_init(bp);
> +	if (rc)
> +		bnxt_free_crypto_info(bp);

[Low]
The other init steps in this function uniformly do an "if (rc) goto
init_err_*;" with at least a netdev_warn() logged.

Here, on failure rc is silently overwritten by the next assignment
(rc = register_netdev(dev) further down), so a transient HWRM error or
DMA-slice ENOMEM during probe ends up leaving the device up with kTLS
permanently disabled and no message in dmesg.

Additionally, if the TX HWRM_FUNC_KEY_CTX_ALLOC succeeded but the RX one
failed, the in-file comment

  /* Note that the driver does not free the key contexts.  They are freed
   * by the FW during FLR and HWRM_FUNC_RESET.
   */

implies the firmware-side TX KIDs remain allocated until the next
FLR/FUNC_RESET, while the driver has dropped its bookkeeping for them.
Would it be reasonable to log a netdev_warn() and/or short-circuit the
probe path here?

>  
>  	INIT_LIST_HEAD(&bp->usr_fltr_list);
>  
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.c
> index ee154f1e4e19..4ea3e67be9f5 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.c

[ ... ]

> +void bnxt_clear_crypto(struct bnxt *bp)
> +{
> +	struct bnxt_crypto_info *crypto = bp->crypto_info;
> +	struct bnxt_kid_info *kid, *tmp;
> +	struct bnxt_kctx *kctx;
> +	int i;
> +
> +	if (!crypto)
> +		return;
> +
> +	/* Shutting down or FW reset, no need to protect the lists. */
> +	for (i = 0; i < BNXT_MAX_CRYPTO_KEY_TYPE; i++) {
> +		kctx = &crypto->kctx[i];
> +		list_for_each_entry_safe(kid, tmp, &kctx->list, list) {
> +			list_del(&kid->list);
> +			kfree(kid);
> +		}
> +		kctx->total_alloc = 0;
> +		kctx->epoch++;
> +	}
> +}

[Medium]
The kid nodes are published into kctx->list with list_add_tail_rcu() in
bnxt_key_ctx_store() and read by bnxt_kid_valid(), bnxt_alloc_one_kctx()
and bnxt_free_one_kctx() under list_for_each_entry_rcu() inside
rcu_read_lock().

Here they are torn down with plain list_del() followed by an immediate
kfree(), with no list_del_rcu(), no synchronize_rcu() and no kfree_rcu()
(struct bnxt_kid_info has no rcu_head).  The comment says "no need to
protect the lists" because the function "assumes serialization (called
during shutdown)", but the kerneldoc above also lists "firmware reset"
as a use case.

If a future caller invokes bnxt_clear_crypto() from a path where an
RCU reader is concurrently walking kctx->list, list_del() will turn
kid->next into LIST_POISON1 and the immediate kfree() will return the
memory the reader is about to dereference.

Would it be safer to use list_del_rcu() plus kfree_rcu() (adding an
rcu_head to bnxt_kid_info) so the API is self-contained, regardless of
caller serialization assumptions?

[Low]
Also on the epoch bump: kctx->epoch is a u8, and BNXT_KID_EPOCH_MASK is
0xff000000 (8 bits).  Each call to bnxt_clear_crypto() does
kctx->epoch++, so after 256 shutdown/reset cycles the value wraps back
to a previously-used epoch.

The kerneldoc states that the increment is used to "invalidate any
outstanding key references", but a stale KID encoding the wrapped
epoch will silently pass the "epoch != kctx->epoch" check in
bnxt_kid_valid() and be matched against a freshly-allocated KID range.

Should the epoch field be wider, given that there are 24 free bits in
the KID encoding?

> +
>  /**
>   * bnxt_free_crypto_info - Free crypto offload resources
>   * @bp: pointer to bnxt device

[ ... ]

> +static int bnxt_key_ctx_store(struct bnxt_kctx *kctx, __le32 *key_buf, u32 num,
> +			      bool contig, u8 kind, u32 *id)
> +{
> +	struct bnxt_kid_info *kid;
> +	u32 i;
> +
> +	for (i = 0; i < num; ) {
> +		kid = kzalloc_obj(*kid);
> +		if (!kid)
> +			return -ENOMEM;

[Low]
On a kzalloc failure partway through a non-contiguous batch, this
returns -ENOMEM with kids from earlier iterations already linked into
kctx->list.  The remaining keys in key_buf[i..num-1], and in the
contig=true case all "num" keys, were already allocated by the
firmware, but the driver now has no record of them.

Per the comment

  /* Note that the driver does not free the key contexts.  They are
   * freed by the FW during FLR and HWRM_FUNC_RESET.
   */

these IDs stay allocated on the firmware side until the next reset.

Is there a way to either record them or release them via HWRM on the
error path so the usable KID pool isn't permanently shrunk under
sustained ENOMEM during kTLS bring-up?

> +		kid->start_id = le32_to_cpu(key_buf[i]);
> +		kid->type = kctx->type;
> +		kid->kind = kind;
> +		if (contig)
> +			kid->count = num;
> +		else
> +			kid->count = 1;
> +		bitmap_set(kid->ids, 0, kid->count);
> +		if (id && !i) {
> +			clear_bit(0, kid->ids);
> +			*id = BNXT_SET_KID(kctx, kid->start_id);
> +		}
> +		spin_lock(&kctx->lock);
> +		list_add_tail_rcu(&kid->list, &kctx->list);
> +		kctx->total_alloc += kid->count;
> +		spin_unlock(&kctx->lock);
> +		i += kid->count;
> +	}
> +	return 0;
> +}
> +
> +/* Note that the driver does not free the key contexts.  They are freed
> + * by the FW during FLR and HWRM_FUNC_RESET.
> + */
> +static int bnxt_hwrm_key_ctx_alloc(struct bnxt *bp, struct bnxt_kctx *kctx,
> +				   u8 kind, u32 num, u32 *id)
> +{

[ ... ]

> +	num_alloc = le16_to_cpu(resp->num_key_ctxs_allocated);
> +	if (num_alloc > num)
> +		netdev_warn(bp->dev,
> +			    "FW allocated more keys (%d) than requested (%d)\n",
> +			    num_alloc, num);
> +	else
> +		num = num_alloc;
> +	contig = resp->flags &
> +		 FUNC_KEY_CTX_ALLOC_RESP_FLAGS_KEY_CTXS_CONTIGUOUS;
> +	rc = bnxt_key_ctx_store(kctx, key_buf, num, contig, kind, id);

[Medium]
What happens if the firmware returns success with
num_key_ctxs_allocated == 0?

The "else num = num_alloc;" branch sets num to 0 and then calls
bnxt_key_ctx_store(kctx, key_buf, 0, contig, kind, id).  Inside
bnxt_key_ctx_store():

	for (i = 0; i < num; ) {
		...
		if (id && !i) {
			clear_bit(0, kid->ids);
			*id = BNXT_SET_KID(kctx, kid->start_id);
		}
		...
	}
	return 0;

the loop body never executes, so *id is never written and the function
still returns 0.  bnxt_hwrm_key_ctx_alloc() then falls through to
key_alloc_exit_wake with rc==0, and bnxt_key_ctx_alloc_one() returns 0
on the "if (!rc) return 0;" path.

Would it be appropriate to either treat num_alloc == 0 as an error
when an id was requested, or guarantee *id is initialised before the
function returns success?

> +
> +key_alloc_exit_wake:
> +	if (pending_count >= BNXT_KCTX_ALLOC_PENDING_MAX)
> +		wake_up_all(&kctx->alloc_pending_wq);
> +key_alloc_exit:
> +	hwrm_req_drop(bp, req);
> +	return rc;
> +}

[ ... ]

> +void bnxt_free_one_kctx(struct bnxt_kctx *kctx, u32 id)
> +{
> +	struct bnxt_kid_info *kid;
> +
> +	id = BNXT_KID_HW(id);
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(kid, &kctx->list, list) {
> +		if (id >= kid->start_id && id < kid->start_id + kid->count) {
> +			set_bit(id - kid->start_id, kid->ids);
> +			break;
> +		}
> +	}
> +	rcu_read_unlock();
> +}

[Low]
bnxt_kid_valid() rejects stale IDs via "epoch != kctx->epoch", but
bnxt_free_one_kctx() strips the epoch with BNXT_KID_HW(id) without any
such check.

If a caller passes an id captured before a bnxt_clear_crypto() epoch
bump, set_bit() will mark a currently in-use slot as free, and a
subsequent bnxt_alloc_one_kctx() can hand the same hardware KID to a
second kTLS connection — which for AES-GCM means (key, nonce) reuse.

The single in-tree consumer (introduced later in the series) does call
bnxt_kid_valid() before bnxt_free_one_kctx(), so this is not currently
reachable, but would a defensive epoch check inside this function close
the API hazard?

> +
> +#define BNXT_KCTX_ALLOC_RETRY_MAX	3
> +
> +int bnxt_key_ctx_alloc_one(struct bnxt *bp, struct bnxt_kctx *kctx, u8 kind,
> +			   u32 *id)
> +{
> +	int rc, retry = 0;
> +
> +	while (retry++ < BNXT_KCTX_ALLOC_RETRY_MAX) {
> +		rc = bnxt_alloc_one_kctx(kctx, kind, id);
> +		if (!rc)
> +			return 0;
> +
> +		if ((kctx->total_alloc + BNXT_KID_BATCH_SIZE) > kctx->max_ctx)
> +			return -ENOSPC;
> +
> +		if (!BNXT_KCTX_ALLOC_OK(kctx)) {
> +			wait_event(kctx->alloc_pending_wq,
> +				   BNXT_KCTX_ALLOC_OK(kctx));
> +			continue;
> +		}
> +		rc = bnxt_hwrm_key_ctx_alloc(bp, kctx, kind,
> +					     BNXT_KID_BATCH_SIZE, id);
> +		if (!rc)
> +			return 0;
> +	}
> +	return -EAGAIN;
> +}

[Medium]
The struct definition documents kctx->lock as

	/* to serialize update to the linked list and total_alloc */
	spinlock_t		lock;

and bnxt_key_ctx_store() does

	spin_lock(&kctx->lock);
	list_add_tail_rcu(&kid->list, &kctx->list);
	kctx->total_alloc += kid->count;
	spin_unlock(&kctx->lock);

but here total_alloc is read with no lock and no READ_ONCE().  Multiple
callers can each observe a stale value and proceed to issue their own
HWRM allocation, which combined with the BNXT_KCTX_ALLOC_PENDING_MAX
gate (8 in flight) can ask the firmware for up to
8 * BNXT_KID_BATCH_SIZE = 1024 IDs beyond max_ctx before any of them
update the counter.

Should this read be serialised under kctx->lock, or otherwise made
race-free with respect to the writers?

[Low]
On the BNXT_KCTX_ALLOC_OK gate itself, the check-then-act here is
non-atomic: BNXT_KCTX_ALLOC_OK reads alloc_pending and
bnxt_hwrm_key_ctx_alloc() does a separate atomic_inc_return().  Many
threads can simultaneously observe alloc_pending < 8 and then all
proceed to inc, raising it arbitrarily above the cap.

In addition, the wake_up_all() in bnxt_hwrm_key_ctx_alloc() depends on
its own pending_count (post-inc) being >= BNXT_KCTX_ALLOC_PENDING_MAX:

	pending_count = atomic_inc_return(&kctx->alloc_pending);
	rc = hwrm_req_send(bp, req);
	atomic_dec(&kctx->alloc_pending);
	...
key_alloc_exit_wake:
	if (pending_count >= BNXT_KCTX_ALLOC_PENDING_MAX)
		wake_up_all(&kctx->alloc_pending_wq);

If a burst of callers all happen to record pending_count values < 8 by
each racing past the check, none of them wake the wait queue, and a
waiter parked in wait_event(BNXT_KCTX_ALLOC_OK) only gets released when
some later caller happens to increment to >= 8.

Was this latency tolerated intentionally, or is the gate intended to
be exact?

> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.h b/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.h
> index e090491006db..ecdf18ba6d83 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.h
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_crypto.h

[ ... ]

> +#define BNXT_SET_KID(kctx, kid)						\
> +	((kid) | ((kctx)->epoch << BNXT_KID_EPOCH_SHIFT))

[Low]
kctx->epoch is u8 and is promoted to (signed) int before the shift.
For epoch >= 128, "(int)epoch << 24" sets the sign bit of int, which
is undefined behaviour per C99/C11 6.5.7p4.

In combination with the 8-bit epoch wraparound noted earlier on
bnxt_clear_crypto(), epoch >= 128 is reachable, and a UBSan-instrumented
kernel will splat here.

Would casting epoch to u32 before the shift be acceptable?

> +
> +#define BNXT_KCTX_ALLOC_OK(kctx)	\
> +	(atomic_read(&((kctx)->alloc_pending)) < BNXT_KCTX_ALLOC_PENDING_MAX)
> +

[ ... ]

  reply	other threads:[~2026-05-16  1:10 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-12 21:20 [PATCH net-next v2 00/15] bnxt_en: Add kTLS TX offload support Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 01/15] bnxt_en: Add Midpath channel information Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 02/15] bnxt_en: Account for the MPC TX and CP rings Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 03/15] bnxt_en: Set default MPC ring count Michael Chan
2026-05-16  1:10   ` Jakub Kicinski
2026-05-12 21:20 ` [PATCH net-next v2 04/15] bnxt_en: Rename xdp_tx_lock to tx_lock Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 05/15] bnxt_en: Allocate and free MPC software structures Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 06/15] bnxt_en: Allocate and free MPC channels from firmware Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 07/15] bnxt_en: Allocate crypto structure and backing store Michael Chan
2026-05-16  1:10   ` Jakub Kicinski
2026-05-12 21:20 ` [PATCH net-next v2 08/15] bnxt_en: Reserve crypto RX and TX key contexts on a PF Michael Chan
2026-05-16  1:10   ` Jakub Kicinski
2026-05-12 21:20 ` [PATCH net-next v2 09/15] bnxt_en: Add infrastructure for crypto key context IDs Michael Chan
2026-05-16  1:10   ` Jakub Kicinski [this message]
2026-05-12 21:21 ` [PATCH net-next v2 10/15] bnxt_en: Add MPC transmit and completion functions Michael Chan
2026-05-16  1:10   ` Jakub Kicinski
2026-05-12 21:21 ` [PATCH net-next v2 11/15] bnxt_en: Add crypto MPC transmit/completion infrastructure Michael Chan
2026-05-16  1:10   ` Jakub Kicinski
2026-05-12 21:21 ` [PATCH net-next v2 12/15] bnxt_en: Support kTLS TX offload by implementing .tls_dev_add/del() Michael Chan
2026-05-16  1:10   ` Jakub Kicinski
2026-05-12 21:21 ` [PATCH net-next v2 13/15] bnxt_en: Implement kTLS TX normal path Michael Chan
2026-05-16  1:10   ` Jakub Kicinski
2026-05-12 21:21 ` [PATCH net-next v2 14/15] bnxt_en: Add support for inline transmit BDs Michael Chan
2026-05-12 21:21 ` [PATCH net-next v2 15/15] bnxt_en: Add kTLS retransmission support Michael Chan
2026-05-16  1:10   ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260516011017.1857586-1-kuba@kernel.org \
    --to=kuba@kernel.org \
    --cc=andrew+netdev@lunn.ch \
    --cc=andrew.gospodarek@broadcom.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=michael.chan@broadcom.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=pavan.chebbi@broadcom.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.