Re: [PATCH net-next v2 13/15] bnxt_en: Implement kTLS TX normal path

Netdev List
 help / color / mirror / Atom feed

From: Jakub Kicinski <kuba@kernel.org>
To: michael.chan@broadcom.com
Cc: Jakub Kicinski <kuba@kernel.org>,
	davem@davemloft.net, netdev@vger.kernel.org, edumazet@google.com,
	pabeni@redhat.com, andrew+netdev@lunn.ch,
	pavan.chebbi@broadcom.com, andrew.gospodarek@broadcom.com
Subject: Re: [PATCH net-next v2 13/15] bnxt_en: Implement kTLS TX normal path
Date: Fri, 15 May 2026 18:10:23 -0700	[thread overview]
Message-ID: <20260516011023.1857772-1-kuba@kernel.org> (raw)
In-Reply-To: <20260512212105.3488258-14-michael.chan@broadcom.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
Subject: bnxt_en: Implement kTLS TX normal path

This patch offloads TLS encryption of TX packets to hardware when the
TCP sequence number matches what the driver expects, otherwise falling
back to software encryption.  It also adds the bnxt_ktls_xmit() helper
and basic kTLS counters in ethtool -S.  A few questions inline below.

> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index e26980814fe7..7cae7e3024ab 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -488,6 +488,7 @@ static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  	struct bnxt_sw_tx_bd *tx_buf;
>  	__le32 lflags = 0;
>  	skb_frag_t *frag;
> +	u32 kid = 0;
>  
>  	i = skb_get_queue_mapping(skb);
>  	if (unlikely(i >= bp->tx_nr_rings)) {
> @@ -527,6 +528,10 @@ static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  			return NETDEV_TX_BUSY;
>  	}
>  
> +	skb = bnxt_ktls_xmit(bp, txr, skb, &lflags, &kid);
> +	if (unlikely(!skb))
> +		return NETDEV_TX_OK;
> +

[Low]
Should this drop path also bump the device tx_dropped counter?  The
adjacent drop paths in bnxt_start_xmit() (queue out of range, the
skb_linearize() failure, and tx_kick_pending) all call
dev_core_stats_tx_dropped_inc(dev) before returning NETDEV_TX_OK, but
this new branch returns silently.  tls_sw_fallback() has already
freed the original skb when it returns NULL (kmalloc(GFP_ATOMIC) of
sg_in, alloc_skb() of nskb, or fill_sg_in() can each fail), so these
drops would otherwise be invisible in tx_dropped — exactly the
counter operators look at when memory pressure causes TLS fallback to
fail.

>  	length = skb->len;
>  	len = skb_headlen(skb);
>  	last_frag = skb_shinfo(skb)->nr_frags;
> @@ -675,7 +680,7 @@ static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  
>  	prod = NEXT_TX(prod);
>  	txbd1 = bnxt_init_ext_bd(bp, txr, prod, lflags, vlan_tag_flags,
> -				 cfa_action);
> +				 cfa_action, kid);
>  
>  	if (skb_is_gso(skb)) {
>  		bool udp_gso = !!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4);
> @@ -698,7 +703,8 @@ static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  					TX_BD_FLAGS_T_IPID |
>  					(hdr_len << (TX_BD_HSIZE_SHIFT - 1)));
>  		length = skb_shinfo(skb)->gso_size;
> -		txbd1->tx_bd_mss = cpu_to_le32(length);
> +		txbd1->tx_bd_kid_mss = cpu_to_le32(BNXT_TX_KID_HI(kid) |
> +						   length);

[Low]
Should length be masked with TX_BD_MSS here?  The patch redefines

    #define TX_BD_MSS                                       0x7fff
    #define TX_BD_KID_HI                                    (0x1ffff << 15)

so the MSS field is now only the low 15 bits of the word, with the KID
HI bits sharing the same 32-bit slot.  BNXT_TX_KID_LO/HI both apply
defensive masks (TX_BD_KID_LO_MASK, TX_BD_KID_HI_MASK) but the
length term in this OR does not.  If gso_size ever has bit 15 or
higher set, those bits spill into KID_HI and corrupt the crypto
context lookup on a TLS packet.  Would something like

    txbd1->tx_bd_kid_mss = cpu_to_le32(BNXT_TX_KID_HI(kid) |
                                       (length & TX_BD_MSS));

or a BUILD_BUG_ON-style guard be appropriate?

>  		length += hdr_len;
>  	} else if (skb->ip_summed == CHECKSUM_PARTIAL) {
>  		txbd1->tx_bd_hsize_lflags |=
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> index a947b9420a7a..ab3a86634a20 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> @@ -101,10 +101,18 @@ struct tx_bd_ext {
>  	#define TX_BD_FLAGS_LSO					(1 << 5)
>  	#define TX_BD_FLAGS_IPID_FMT				(1 << 6)
>  	#define TX_BD_FLAGS_T_IPID				(1 << 7)
> +	#define TX_BD_FLAGS_CRYPTO_EN				(1 << 15)
>  	#define TX_BD_HSIZE					(0xff << 16)
>  	 #define TX_BD_HSIZE_SHIFT				 16

[Low]
Does placing TX_BD_FLAGS_CRYPTO_EN at bit 15 collide with the LSO
hdr_len encoding in bnxt_start_xmit()?  The LSO path writes:

    txbd1->tx_bd_hsize_lflags |= cpu_to_le32(TX_BD_FLAGS_LSO |
                                TX_BD_FLAGS_T_IPID |
                                (hdr_len << (TX_BD_HSIZE_SHIFT - 1)));

with TX_BD_HSIZE_SHIFT == 16, so this is hdr_len << 15.  The LSB of
hdr_len lands directly on bit 15 — the new CRYPTO_EN flag.  Today all
hdr_len sources are even (TCP doff*4, IPv4 base 20, IPv6 40, UDP 8,
Ethernet 14, VLAN 4) so bit 0 is always 0, but the encoding contract
is now fragile — any future encapsulation that yields an odd hdr_len
or a refactor that changes the shift would silently set CRYPTO_EN
with kid=0 on a non-TLS LSO segment.  Would encoding as
((hdr_len >> 1) << TX_BD_HSIZE_SHIFT) (or masking with TX_BD_HSIZE)
make this less prone to future breakage?

> -
> -	__le32 tx_bd_mss;
> +	#define TX_BD_KID_LO					(0x7f << 25)
> +	 #define TX_BD_KID_LO_MASK				 0x7f
> +	 #define TX_BD_KID_LO_SHIFT				 25
> +
> +	__le32 tx_bd_kid_mss;
> +	#define TX_BD_MSS					0x7fff
> +	#define TX_BD_KID_HI					(0x1ffff << 15)
> +	 #define TX_BD_KID_HI_MASK				 0xffff80
> +	 #define TX_BD_KID_HI_SHIFT				 8

[ ... ]

> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> index 11cb1b841359..66b323e94140 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> @@ -653,6 +682,16 @@ static void bnxt_get_ethtool_stats(struct net_device *dev,
>  	for (i = 0; i < BNXT_NUM_RING_DRV_STATS; i++, j++, curr++, prev++)
>  		buf[j] = *curr + *prev;
>  
> +	if (bp->ktls_info) {
> +		struct bnxt_tls_info *ktls = bp->ktls_info;
> +		struct bnxt_tls_sw_stats tls_stats = {};
> +
> +		bnxt_get_ring_tls_stats(bp, &tls_stats);
> +		for (i = 0; i < ARRAY_SIZE(bnxt_ktls_data_stats); i++, j++)
> +			buf[j] = tls_stats.counters[i];
> +		for (i = 0; i < ARRAY_SIZE(bnxt_ktls_ctrl_stats); i++, j++)
> +			buf[j] = atomic64_read(&ktls->counters[i]);
> +	}

[High]
Can this block dereference a freed bp->bnapi when ethtool -S runs on
a closed interface?  The function starts with:

    if (!bp->bnapi) {
        j += bnxt_get_num_ring_stats(bp);
        goto skip_ring_stats;
    }

skip_ring_stats:
    [ ... existing ring drv stats accumulator ... ]

    if (bp->ktls_info) {
        ...
        bnxt_get_ring_tls_stats(bp, &tls_stats);

bp->bnapi is freed and cleared in bnxt_free_mem() during interface
close, but bp->ktls_info has a longer lifetime — it is allocated at
probe and freed only at remove.  The new ktls block sits after the
skip_ring_stats label, so it executes even when the !bp->bnapi guard
fired.  bnxt_get_ring_tls_stats() then loops bp->cp_nr_rings (which
is not zeroed on close) and dereferences bp->bnapi[i]:

    for (i = 0; i < bp->cp_nr_rings; i++) {
        ring_stats = &bp->bnapi[i]->cp_ring.sw_stats->tls;

Should the ktls block be guarded by the same bp->bnapi check, and/or
should bnxt_get_ring_tls_stats() bail out when bp->bnapi is NULL?

>  	if (bp->flags & BNXT_FLAG_PORT_STATS) {
>  		u64 *port_stats = bp->port_stats.sw_stats;
>  
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ktls.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ktls.c
> index ac5d927ded1c..263b075af621 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ktls.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ktls.c
> @@ -287,3 +288,51 @@ int bnxt_ktls_init(struct bnxt *bp)
>  	dev->features |= NETIF_F_HW_TLS_TX;
>  	return 0;
>  }
> +
> +struct sk_buff *bnxt_ktls_xmit(struct bnxt *bp, struct bnxt_tx_ring_info *txr,
> +			       struct sk_buff *skb, __le32 *lflags, u32 *kid)
> +{
> +	struct bnxt_sw_stats *sw_stats = txr->tx_cpr->sw_stats;
> +	struct bnxt_tls_info *ktls = bp->ktls_info;
> +	struct bnxt_ktls_offload_ctx_tx *kctx_tx;
> +	struct tls_context *tls_ctx;
> +	u32 seq, payload_len;
> +
> +	if (!IS_ENABLED(CONFIG_TLS_DEVICE) || !ktls ||
> +	    !tls_is_skb_tx_device_offloaded(skb))
> +		return skb;
> +
> +	seq = ntohl(tcp_hdr(skb)->seq);
> +	tls_ctx = tls_get_ctx(skb->sk);
> +	kctx_tx = __tls_driver_ctx(tls_ctx, TLS_OFFLOAD_CTX_DIR_TX);
> +	payload_len = skb->len - skb_tcp_all_headers(skb);
> +	if (!payload_len)
> +		return skb;
> +	if (kctx_tx->tcp_seq_no == seq) {
> +		kctx_tx->tcp_seq_no += payload_len;
> +		*kid = BNXT_KID_HW(kctx_tx->kid);
> +		*lflags |= cpu_to_le32(TX_BD_FLAGS_CRYPTO_EN |
> +				       BNXT_TX_KID_LO(*kid));
> +		sw_stats->tls.counters[BNXT_KTLS_TX_PKTS]++;
> +		sw_stats->tls.counters[BNXT_KTLS_TX_BYTES] += payload_len;

[High]
Can advancing kctx_tx->tcp_seq_no here desynchronise driver and HW
state if the packet is dropped later in bnxt_start_xmit()?  The
sequence and stats are bumped before the BD is built, but the caller
can still drop the skb after this returns — skb_pad() failure goes to
tx_kick_pending, dma_map_single() failure to tx_free, the
bnxt_lhint_arr oversize check and skb_frag_dma_map() failure to
tx_dma_error.  On any of those drops the BD never reaches HW, so the
HW-side TCP sequence variable remains at the old position, while the
driver-tracked kctx_tx->tcp_seq_no has already advanced by
payload_len.

The next aligned TLS data send will then hit:

    if (kctx_tx->tcp_seq_no == seq) {
        ...
        *lflags |= cpu_to_le32(TX_BD_FLAGS_CRYPTO_EN | ...);

with seq matching the inflated tracker, so the driver flags it as
in-order and HW encrypts from a stale TLS record state.  The follow-up
patch that adds bnxt_ktls_tx_ooo() handles seq < kctx_tx->tcp_seq_no
via tls_get_record(), but does not appear to detect this driver-ahead
case.

Would it be safer to defer the kctx_tx->tcp_seq_no advance and the
counter updates until after the BD is committed (near the doorbell),
or to roll them back on the failure paths?

[Medium]
Are these u64 increments safe against concurrent xmit on different TX
queues that share the same sw_stats?  bnxt_alloc_cp_rings() aliases
all sub-CP rings of a bnapi:

    cpr2->sw_stats = cpr->sw_stats;

so multiple bnxt_tx_ring_info entries in one bnapi have distinct
txr->tx_cpr but the same txr->tx_cpr->sw_stats.  Different netdev TX
queues hold different __netif_tx_lock instances, so two CPUs can
transmit concurrently on TX queues mapped to the same bnapi and race
on these counters with plain ++ and +=.  The reader
bnxt_get_ring_tls_stats() also reads them from process context with
no synchronisation, and on 32-bit the BNXT_KTLS_TX_BYTES u64 update
is a torn read/write.  The pre-existing bnxt_sw_stats fields are only
written from a single NAPI context per bnapi, which is why the
aliasing was previously safe; the new TLS counters introduce a new
multi-writer pattern.

Would u64_stats_sync (or per-CPU + SNMP_ADD_STATS64-style helpers) be
more appropriate here, in line with the networking guidance for
shared per-bnapi 64-bit counters?

> +	} else {
> +		skb = tls_encrypt_skb(skb);
> +		if (!skb)
> +			return NULL;
> +	}
> +	return skb;
> +}
> +
> +void bnxt_get_ring_tls_stats(struct bnxt *bp, struct bnxt_tls_sw_stats *stats)
> +{
> +	struct bnxt_tls_sw_stats *ring_stats;
> +	int i, j;
> +
> +	if (!bp->ktls_info)
> +		return;
> +	for (i = 0; i < bp->cp_nr_rings; i++) {
> +		ring_stats = &bp->bnapi[i]->cp_ring.sw_stats->tls;
> +		for (j = 0; j < BNXT_KTLS_MAX_DATA_COUNTERS; j++)
> +			stats->counters[j] += ring_stats->counters[j];
> +	}
> +}

[ ... ]

next prev parent reply	other threads:[~2026-05-16  1:10 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-12 21:20 [PATCH net-next v2 00/15] bnxt_en: Add kTLS TX offload support Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 01/15] bnxt_en: Add Midpath channel information Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 02/15] bnxt_en: Account for the MPC TX and CP rings Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 03/15] bnxt_en: Set default MPC ring count Michael Chan
2026-05-16  1:10   ` Jakub Kicinski
2026-05-12 21:20 ` [PATCH net-next v2 04/15] bnxt_en: Rename xdp_tx_lock to tx_lock Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 05/15] bnxt_en: Allocate and free MPC software structures Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 06/15] bnxt_en: Allocate and free MPC channels from firmware Michael Chan
2026-05-12 21:20 ` [PATCH net-next v2 07/15] bnxt_en: Allocate crypto structure and backing store Michael Chan
2026-05-16  1:10   ` Jakub Kicinski
2026-05-12 21:20 ` [PATCH net-next v2 08/15] bnxt_en: Reserve crypto RX and TX key contexts on a PF Michael Chan
2026-05-16  1:10   ` Jakub Kicinski
2026-05-12 21:20 ` [PATCH net-next v2 09/15] bnxt_en: Add infrastructure for crypto key context IDs Michael Chan
2026-05-16  1:10   ` Jakub Kicinski
2026-05-12 21:21 ` [PATCH net-next v2 10/15] bnxt_en: Add MPC transmit and completion functions Michael Chan
2026-05-16  1:10   ` Jakub Kicinski
2026-05-12 21:21 ` [PATCH net-next v2 11/15] bnxt_en: Add crypto MPC transmit/completion infrastructure Michael Chan
2026-05-16  1:10   ` Jakub Kicinski
2026-05-12 21:21 ` [PATCH net-next v2 12/15] bnxt_en: Support kTLS TX offload by implementing .tls_dev_add/del() Michael Chan
2026-05-16  1:10   ` Jakub Kicinski
2026-05-12 21:21 ` [PATCH net-next v2 13/15] bnxt_en: Implement kTLS TX normal path Michael Chan
2026-05-16  1:10   ` Jakub Kicinski [this message]
2026-05-12 21:21 ` [PATCH net-next v2 14/15] bnxt_en: Add support for inline transmit BDs Michael Chan
2026-05-12 21:21 ` [PATCH net-next v2 15/15] bnxt_en: Add kTLS retransmission support Michael Chan
2026-05-16  1:10   ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260516011023.1857772-1-kuba@kernel.org \
    --to=kuba@kernel.org \
    --cc=andrew+netdev@lunn.ch \
    --cc=andrew.gospodarek@broadcom.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=michael.chan@broadcom.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=pavan.chebbi@broadcom.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox