public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Tariq Toukan <ttoukan.linux@gmail.com>
To: Finn Dayton <Finnius.Dayton@spacex.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>
Cc: "Alexei Starovoitov ," <ast@kernel.org>,
	"Daniel Borkmann ," <daniel@iogearbox.net>,
	"David S. Miller ," <davem@davemloft.net>,
	"Jakub Kicinski ," <kuba@kernel.org>,
	"Jesper Dangaard Brouer ," <hawk@kernel.org>,
	"John Fastabend ," <john.fastabend@gmail.com>,
	"Stanislav Fomichev ," <sdf@fomichev.me>,
	"Saeed Mahameed ," <saeedm@nvidia.com>,
	"Leon Romanovsky ," <leon@kernel.org>,
	"Tariq Toukan ," <tariqt@nvidia.com>,
	"Mark Bloch ," <mbloch@nvidia.com>,
	"Andrew Lunn ," <andrew+netdev@lunn.ch>,
	"Eric Dumazet ," <edumazet@google.com>,
	"Paolo Abeni ," <pabeni@redhat.com>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH net] net/mlx5e: Precompute xdpsq assignments for mlx5e_xdp_xmit()
Date: Tue, 24 Feb 2026 09:33:57 +0200	[thread overview]
Message-ID: <08e815f1-5c50-4bf9-8289-71c0202ee674@gmail.com> (raw)
In-Reply-To: <5D5DF105-22C6-42C5-A9B9-14C0DF9685EA@spacex.com>



On 24/02/2026 6:32, Finn Dayton wrote:
> 
> 
> On 2/23/26, 5:46 AM, "Tariq Toukan" <ttoukan.linux@gmail.com <mailto:ttoukan.linux@gmail.com>> wrote:
> 
> 
> 
> 
> 
> 
> On 23/02/2026 2:05, Finn Dayton wrote:
>> mlx5e_xdp_xmit() selects an XDP SQ (Send Queue) using smp_processor_id()
>> (CPU ID). When doing XDP_REDIRECT from a CPU whose ID is
>>> = priv->channels.num, mlx5e_xdp_xmit() returns -ENXIO and the
>> redirect fails.
>>
>> Previous discussion proposed using modulo in mlx5e_xdp_xmit() to map
>> CPU IDs into the channel range, but modulo/division is too costly in
>> the hot path.
>>
> 
> 
> That discussion reached to an agreement of using a while loop with
> subtraction. It's expected to be fast, and optimizes the common case.
> 
> 
>> Instead, this solution precomputes per-cpu priv->xdpsq assignments when
>> channels are (re)configured and does a single lookup in mlx5e_xdp_xmit().
>>
>> Because multiple CPUs map to the same xdpsq when CPU count exceeds
>> channel count, serialize xdp_xmit on the ring with xdp_tx_lock.
>>
> 
> 
> What's the advantage of this solution over the one we already agreed to?
> 
> Subtraction in a while loop is great for the common case (cpu_id < channels.num).
> But in worst case, i.e. CPUs >> XDP SQs, the while-subtraction approach
> Is O(cpu_id / channels.num) iterations. With 256 CPUs and 8 channels, cpu_id=255 does
> 31 subtractions; with 2 channels it can be 127 iterations in the mlx5e_xdp_xmit() hot path.
> Precomputing moves the expensive operation out of the hot path into (re)configuration
> time and makes runtime lookups constant. The tradeoff is added per-cpu state that needs
> to be rebuilt whenever channels are reconfigured.
> 
> Given modern systems where CPU count can greatly exceed the number of XDP SQs, constant-
> time lookups avoid a large/variable loop in the hot path.
> 

Valid point.
The while loop solution is significantly simpler.
We can go with this one, but must make sure we take care of all needed 
syncs, for safety.

See comments on original mail.

Also, is the performance impact measurable?
Do you see the regression in your tests?

> Thank you,
> Finn
> 
>> Fixes: 58b99ee3e3eb ("net/mlx5e: Add support for XDP_REDIRECT in device-out side")
>> Link: https://urldefense.us/v3/__https://lore.kernel.org/netdev/20251031231038.1092673-1-zijianzhang@bytedance.com <mailto:20251031231038.1092673-1-zijianzhang@bytedance.com>/__;!!Fqb0NABsjhF0Kh8I!Y83cZX7QBDtN-kWQrfMsOLg2GhV5biqKsT4Dkj0Zk677rm_78pu6_4ij7TtkVuh859B_YQf8n0yDiBMAtI0Caie0lpk$
>> Link: https://urldefense.us/v3/__https://lore.kernel.org/netdev/44f69955-b566-4fb1-904d-f551046ff2d4@gmail.com <mailto:44f69955-b566-4fb1-904d-f551046ff2d4@gmail.com>__;!!Fqb0NABsjhF0Kh8I!Y83cZX7QBDtN-kWQrfMsOLg2GhV5biqKsT4Dkj0Zk677rm_78pu6_4ij7TtkVuh859B_YQf8n0yDiBMAtI0CNvBcoVA$
>> Cc: stable@vger.kernel.org <mailto:stable@vger.kernel.org> # 6.12+
>> Signed-off-by: Finn Dayton <finnius.dayton@spacex.com <mailto:finnius.dayton@spacex.com>>
>> ---
>> Testing:
>> - XDP forwarding / XDP_REDIRECT verified with both low CPU ids and
>> CPU ids > than number of send queues.
>> - No -ENXIO observed, successful forwarding.
>>
>> drivers/net/ethernet/mellanox/mlx5/core/en.h | 4 +++
>> .../net/ethernet/mellanox/mlx5/core/en/xdp.c | 16 +++++++----
>> .../net/ethernet/mellanox/mlx5/core/en_main.c | 28 +++++++++++++++++++
>> 3 files changed, 43 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> index ea2cd1f5d1d0..387954201640 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> @@ -519,6 +519,8 @@ struct mlx5e_xdpsq {
>> /* control path */
>> struct mlx5_wq_ctrl wq_ctrl;
>> struct mlx5e_channel *channel;
>> + /* serialize writes by multiple CPUs to this send queue */
>> + spinlock_t xdp_tx_lock;
>> } ____cacheline_aligned_in_smp;
>>
>> struct mlx5e_xdp_buff {
>> @@ -909,6 +911,8 @@ struct mlx5e_priv {
>> struct mlx5e_rq drop_rq;
>>
>> struct mlx5e_channels channels;
>> + /* selects the xdpsq during mlx5e_xdp_xmit() */
>> + int __percpu *send_queue_idx_ptr;
>> struct mlx5e_rx_res *rx_res;
>> u32 *tx_rates;
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
>> index 80f9fc10877a..2dd44ad873a1 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
>> @@ -845,7 +845,7 @@ int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>> struct mlx5e_priv *priv = netdev_priv(dev);
>> struct mlx5e_xdpsq *sq;
>> int nxmit = 0;
>> - int sq_num;
>> + int send_queue_idx = 0;
>> int i;
>>
>> /* this flag is sufficient, no need to test internal sq state */
>> @@ -855,13 +855,19 @@ int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>> if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
>> return -EINVAL;
>>
>> - sq_num = smp_processor_id();
>>
>> - if (unlikely(sq_num >= priv->channels.num))
>> + if (unlikely(!priv->send_queue_idx_ptr))
>> return -ENXIO;
>>
>> - sq = priv->channels.c[sq_num]->xdpsq;
>> + send_queue_idx = *this_cpu_ptr(priv->send_queue_idx_ptr);
>> + if (unlikely(send_queue_idx >= priv->channels.num || send_queue_idx < 0))
>> + return -ENXIO;
>>
>> + sq = priv->channels.c[send_queue_idx]->xdpsq;
>> + /* The number of queues configured on a netdev may be smaller than the
>> + * CPU pool, so two CPUs might map to this queue. We must serialize writes.
>> + */
>> + spin_lock(&sq->xdp_tx_lock);
>> for (i = 0; i < n; i++) {
>> struct mlx5e_xmit_data_frags xdptxdf = {};
>> struct xdp_frame *xdpf = frames[i];
>> @@ -941,7 +947,7 @@ int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>>
>> if (flags & XDP_XMIT_FLUSH)
>> mlx5e_xmit_xdp_doorbell(sq);
>> -
>> + spin_unlock(&sq->xdp_tx_lock);
>> return nxmit;
>> }
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> index 7eb691c2a1bd..adef35d06b89 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> @@ -1492,6 +1492,7 @@ static int mlx5e_alloc_xdpsq(struct mlx5e_channel *c,
>> sq->pdev = c->pdev;
>> sq->mkey_be = c->mkey_be;
>> sq->channel = c;
>> + spin_lock_init(&sq->xdp_tx_lock);
>> sq->uar_map = c->bfreg->map;
>> sq->min_inline_mode = params->tx_min_inline_mode;
>> sq->hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu) - ETH_FCS_LEN;
>> @@ -3283,10 +3284,30 @@ static void mlx5e_build_txq_maps(struct mlx5e_priv *priv)
>> smp_wmb();
>> }
>>
>> +static void build_priv_to_xdpsq_associations(struct mlx5e_priv *priv)
>> +{
>> + /*
>> + * Build the mapping from CPU to XDP send queue index for priv.
>> + * This is used by mlx5e_xdp_xmit() to determine which xdpsq (send queue)
>> + * should handle the xdptx data, based on the CPU running mlx5e_xdp_xmit()
>> + * and the target priv (netdev).
>> + */
>> + int send_queue_idx, cpu;
>> +
>> + if (unlikely(priv->channels.num == 0))
>> + return;
>> +
>> + for_each_possible_cpu(cpu) {
>> + send_queue_idx = cpu % priv->channels.num;
>> + *per_cpu_ptr(priv->send_queue_idx_ptr, cpu) = send_queue_idx;
>> + }
>> +}
>> +
>> void mlx5e_activate_priv_channels(struct mlx5e_priv *priv)
>> {
>> mlx5e_build_txq_maps(priv);
>> mlx5e_activate_channels(priv, &priv->channels);
>> + build_priv_to_xdpsq_associations(priv);
>> mlx5e_xdp_tx_enable(priv);
>>
>> /* dev_watchdog() wants all TX queues to be started when the carrier is
>> @@ -6263,8 +6284,14 @@ int mlx5e_priv_init(struct mlx5e_priv *priv,
>> if (!priv->fec_ranges)
>> goto err_free_channel_stats;
>>
>> + priv->send_queue_idx_ptr = alloc_percpu(int);
>> + if (!priv->send_queue_idx_ptr)
>> + goto err_free_fec_ranges;
>> +
>> return 0;
>>
>> +err_free_fec_ranges:
>> + kfree(priv->fec_ranges);
>> err_free_channel_stats:
>> kfree(priv->channel_stats);
>> err_free_tx_rates:
>> @@ -6295,6 +6322,7 @@ void mlx5e_priv_cleanup(struct mlx5e_priv *priv)
>> for (i = 0; i < priv->stats_nch; i++)
>> kvfree(priv->channel_stats[i]);
>> kfree(priv->channel_stats);
>> + free_percpu(priv->send_queue_idx_ptr);
>> kfree(priv->tx_rates);
>> kfree(priv->txq2sq_stats);
>> kfree(priv->txq2sq);
> 
> 
> 
> 
> 


  reply	other threads:[~2026-02-24  7:33 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-23  0:05 [PATCH net] net/mlx5e: Precompute xdpsq assignments for mlx5e_xdp_xmit() Finn Dayton
2026-02-23 13:46 ` Tariq Toukan
2026-02-24  4:32   ` Finn Dayton
2026-02-24  7:33     ` Tariq Toukan [this message]
2026-02-24  7:32 ` Tariq Toukan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=08e815f1-5c50-4bf9-8289-71c0202ee674@gmail.com \
    --to=ttoukan.linux@gmail.com \
    --cc=Finnius.Dayton@spacex.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=john.fastabend@gmail.com \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mbloch@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=saeedm@nvidia.com \
    --cc=sdf@fomichev.me \
    --cc=stable@vger.kernel.org \
    --cc=tariqt@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox