public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jakub Kicinski <kuba@kernel.org>
To: Tariq Toukan <tariqt@nvidia.com>
Cc: Eric Dumazet <edumazet@google.com>,
	Paolo Abeni <pabeni@redhat.com>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	Saeed Mahameed <saeedm@nvidia.com>,
	Leon Romanovsky <leon@kernel.org>, Mark Bloch <mbloch@nvidia.com>,
	<netdev@vger.kernel.org>, <linux-rdma@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, Gal Pressman <gal@nvidia.com>,
	Moshe Shemesh <moshe@nvidia.com>,
	Dragos Tatulea <dtatulea@nvidia.com>
Subject: Re: [PATCH net-next 3/3] net/mlx5e: Speed up channel creation by initializing MKEY entries via UMR WQE
Date: Mon, 23 Mar 2026 17:39:55 -0700	[thread overview]
Message-ID: <20260323173955.39c8c902@kernel.org> (raw)
In-Reply-To: <20260319074338.24265-4-tariqt@nvidia.com>

On Thu, 19 Mar 2026 09:43:38 +0200 Tariq Toukan wrote:
> Initializing all UMR MKEY entries as part of the CREATE_MKEY firmware
> command is relatively slow. Since this operation is performed per RQ,
> the cumulative latency becomes significant with a large number of
> queues.
> 
> Move the entries initialization out of the CREATE_MKEY command and
> perform it in the fast path by posting an appropriate UMR WQE on the
> ICOSQ.
> 
> The UMR WQE is prepared and written to the ICOSQ before activation,
> making it safe without additional locking, as it does not race with NOP
> postings or early NAPI refills.
> 
> Performance results:
> Setup: 248 channels, MTU 9000, RX/TX ring size 8K.

AI says:

> @@ -397,57 +396,61 @@ static u8 mlx5e_mpwrq_access_mode(enum mlx5e_mpwrq_umr_mode umr_mode)
> +static int mlx5e_rq_umr_mkey_data_alloc(struct mlx5e_rq *rq, u32 npages,
> +					struct mlx5_wqe_data_seg *dseg)
> +{
> +	dma_addr_t data_addr;
> +	int data_sz;
> +	void *data;
>  
> -	octwords = mlx5e_mpwrq_umr_octowords(npages, umr_mode);
> +	data_sz = mlx5e_mpwrq_umr_octowords(npages, rq->mpwqe.umr_mode) *
> +		MLX5_OCTWORD;
> +	rq->mpwqe_sp.init_data.p_unaligned =
> +		kzalloc(data_sz + MLX5_UMR_ALIGN - 1, GFP_KERNEL);
Is this allocation safe for large setups?
For setups with large ring sizes like 8K WQEs in OVERSIZED mode, the data size
can reach 256KB to 1MB. Requesting physically contiguous high-order allocations
across many channels during interface bring-up might fail on memory-fragmented
systems.
Should this continue to fall back to vmalloc like the previous kvzalloc code
did?

> @@ -455,48 +458,152 @@ static int mlx5e_create_umr_mkey(struct mlx5_core_dev *mdev,
> +static int mlx5e_rq_umr_mkey_data_init(struct mlx5e_rq *rq, u32 npages)
> +{
> +	struct mlx5_wqe_ctrl_seg      *cseg;
> +	struct mlx5_wqe_umr_ctrl_seg *ucseg;
> +	struct mlx5e_icosq *sq = rq->icosq;
> +	struct mlx5e_umr_wqe *umr_wqe;
> +	u16 pi, num_wqebbs, octowords;
> +	u8 ds_cnt;
> +	int err;
> +
> +	/* + 1 for the data segment */
> +	ds_cnt = 1 + DIV_ROUND_UP(offsetof(struct mlx5e_umr_wqe, dseg),
> +				  MLX5_SEND_WQE_DS);
> +	num_wqebbs = DIV_ROUND_UP(ds_cnt, MLX5_SEND_WQEBB_NUM_DS);
> +	pi = mlx5e_icosq_get_next_pi(sq, num_wqebbs);
> +	umr_wqe = mlx5_wq_cyc_get_wqe(&sq->wq, pi);
> +	memset(umr_wqe, 0, num_wqebbs * MLX5_SEND_WQE_BB);
> +
> +	cseg = &umr_wqe->hdr.ctrl;
> +	ucseg = &umr_wqe->hdr.uctrl;
> +
> +	cseg->opmod_idx_opcode =
> +		cpu_to_be32((sq->pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
> +			    MLX5_OPCODE_UMR);
> +	cseg->qpn_ds = cpu_to_be32((sq->sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
> +				   ds_cnt);
> +	cseg->umr_mkey = rq->mpwqe_sp.umr_mkey_be;
> +
> +	octowords = mlx5e_mpwrq_umr_octowords(npages, rq->mpwqe.umr_mode);
> +	ucseg->xlt_octowords = cpu_to_be16(octowords);
> +	ucseg->mkey_mask = cpu_to_be64(MLX5_MKEY_MASK_FREE);
> +
> +	err = mlx5e_rq_umr_mkey_data_alloc(rq, npages, umr_wqe->dseg);
> +	if (err)
> +		return err;
> +
> +	mlx5e_rq_umr_mkey_data_fill(rq, npages);

Could this sequence violate DMA coherency rules?
The buffer is mapped to the device with dma_map_single inside
mlx5e_rq_umr_mkey_data_alloc, but then mlx5e_rq_umr_mkey_data_fill writes the
MKEY translation data into the buffer.
On architectures without hardware cache coherency, the CPU writes might
remain in the cache, causing the NIC to read stale memory instead of the
actual MKEY data.

> +
> +	sq->db.wqe_info[pi] = (struct mlx5e_icosq_wqe_info) {
> +		.wqe_type   = MLX5E_ICOSQ_WQE_UMR_RX_INIT,
> +		.num_wqebbs = num_wqebbs,
> +		.umr.rq     = rq,
> +	};
> +
> +	sq->pc += num_wqebbs;
> +
> +	sq->doorbell_cseg = cseg;

Can this leave a stale pointer in sq->doorbell_cseg?
If a NOP WQE is later pushed to the ICOSQ but fails to clear this pointer, and
a subsequent allocation error occurs during the first iteration of
mlx5e_post_rx_mpwqes, the error path might use this stale doorbell_cseg.
Modifying an already-executed WQE could desynchronize the queue and stall the
ICOSQ.

> @@ -505,7 +612,6 @@ static int mlx5e_create_umr_mkey(struct mlx5_core_dev *mdev,
>  
>  static int mlx5e_create_rq_umr_mkey(struct mlx5_core_dev *mdev, struct mlx5e_rq *rq)
>  {
> -	u32 xsk_chunk_size = rq->xsk_pool ? rq->xsk_pool->chunk_size : 0;
>  	u32 wq_size = mlx5_wq_ll_get_size(&rq->mpwqe.wq);
>  	u32 num_entries, max_num_entries;
>  	u32 umr_mkey;
> @@ -522,9 +628,16 @@ static int mlx5e_create_rq_umr_mkey(struct mlx5_core_dev *mdev, struct mlx5e_rq
>  			      max_num_entries);
>  
>  	err = mlx5e_create_umr_mkey(mdev, num_entries, rq->mpwqe.page_shift,
> -				    &umr_mkey, rq->wqe_overflow.addr,
> -				    rq->mpwqe.umr_mode, xsk_chunk_size);
> +				    &umr_mkey, rq->mpwqe.umr_mode);
> +	if (err)
> +		return err;
> +
>  	rq->mpwqe_sp.umr_mkey_be = cpu_to_be32(umr_mkey);
> +
> +	err = mlx5e_rq_umr_mkey_data_init(rq, num_entries);
> +	if (err)
> +		mlx5_core_destroy_mkey(mdev, umr_mkey);
> +
>  	return err;
>  }

Does this code leak the initialization data and DMA mapping if a failure
happens later during queue creation?
If mlx5e_create_rq_umr_mkey completes successfully, but the broader
mlx5e_alloc_rq fails on a subsequent step like page pool creation, the teardown
seems to miss calling mlx5e_rq_umr_mkey_data_free for the newly allocated
buffer.

> @@ -1275,8 +1389,11 @@ int mlx5e_wait_for_min_rx_wqes(struct mlx5e_rq *rq, int wait_time)
>  	u16 min_wqes = mlx5_min_rx_wqes(rq->wq_type, mlx5e_rqwq_get_size(rq));
>  
>  	do {
> -		if (mlx5e_rqwq_get_cur_sz(rq) >= min_wqes)
> +		if (mlx5e_rqwq_get_cur_sz(rq) >= min_wqes) {
> +			/* memory usage completed, can be freed already */
> +			mlx5e_rq_umr_mkey_data_free(rq);
>  			return 0;
> +		}
>  
>  		msleep(20);
>  	} while (time_before(jiffies, exp_time));

Does this prematurely free the initialization buffer?
The condition checking that cur_sz is greater than min_wqes tracks the CPU's
NAPI thread pushing RX WQEs to the RQ, but it does not appear to guarantee that
the hardware has actually finished executing the asynchronous
MLX5E_ICOSQ_WQE_UMR_RX_INIT WQE on the ICOSQ.
If unmapped and freed while the NIC DMA engine is still fetching the data,
could this result in a DMA use-after-free and potential IOMMU faults?

  reply	other threads:[~2026-03-24  0:39 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-19  7:43 [PATCH net-next 0/3] net/mlx5e: Improve channel creation time via fast-path MKEY initialization Tariq Toukan
2026-03-19  7:43 ` [PATCH net-next 1/3] net/mlx5e: Move RX MPWQE slowpath fields into a separate struct Tariq Toukan
2026-03-20 23:02   ` Joe Damato
2026-03-19  7:43 ` [PATCH net-next 2/3] net/mlx5e: RX, Pre-calculate pad value in MPWQE Tariq Toukan
2026-03-20 23:02   ` Joe Damato
2026-03-19  7:43 ` [PATCH net-next 3/3] net/mlx5e: Speed up channel creation by initializing MKEY entries via UMR WQE Tariq Toukan
2026-03-24  0:39   ` Jakub Kicinski [this message]
2026-03-20 11:56 ` [PATCH net-next 0/3] net/mlx5e: Improve channel creation time via fast-path MKEY initialization Simon Horman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260323173955.39c8c902@kernel.org \
    --to=kuba@kernel.org \
    --cc=andrew+netdev@lunn.ch \
    --cc=davem@davemloft.net \
    --cc=dtatulea@nvidia.com \
    --cc=edumazet@google.com \
    --cc=gal@nvidia.com \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mbloch@nvidia.com \
    --cc=moshe@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=saeedm@nvidia.com \
    --cc=tariqt@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox