public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
From: Christoph Hellwig <hch@infradead.org>
To: Chuck Lever <cel@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>,
	Leon Romanovsky <leon@kernel.org>, Christoph Hellwig <hch@lst.de>,
	linux-rdma@vger.kernel.org, Chuck Lever <chuck.lever@oracle.com>
Subject: Re: [PATCH] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
Date: Tue, 10 Mar 2026 06:42:14 -0700	[thread overview]
Message-ID: <abAftjplHdwdwrkd@infradead.org> (raw)
In-Reply-To: <20260310034621.5799-1-cel@kernel.org>

On Mon, Mar 09, 2026 at 11:46:21PM -0400, Chuck Lever wrote:
> Under NFS WRITE workloads the server performs RDMA READs to
> pull data from the client. With the inflated MR demand, the
> pool is rapidly exhausted, ib_mr_pool_get() returns NULL, and
> rdma_rw_init_one_mr() returns -EAGAIN. svcrdma treats this as
> a DMA mapping failure, closes the connection, and the client
> reconnects -- producing a cycle of 71% RPC retransmissions and
> ~100 reconnections per test run. RDMA WRITEs (NFS READ
> direction) are unaffected because DMA_TO_DEVICE never triggers
> the max_sgl_rd check.

So this changelog extensively describes the problem, but it doesn't
actually say how you fix it.

>  	/*
> +	 * IOVA not available; map each bvec individually. Do not
> +	 * check max_sgl_rd here: nr_bvec is a raw page count that
> +	 * overstates DMA entry demand and exhausts the MR pool.

It fails to explain why that is fine?

> +	 *
> +	 * TODO: A bulk DMA mapping API for bvecs analogous to
> +	 * dma_map_sgtable() would provide a proper post-DMA-
> +	 * coalescing segment count here, enabling the map_wrs
> +	 * path in more cases.

This isn't really something the DMA layer can easily do without getting
as inefficient as the sgtable based path.  What the block layer does
here is to simply keep a higher level count of merged segments.  The
other option would be to not create multiple bvecs for continguous
regions, which is what modern file system do in general, and why the
above block layer nr_phys_segments based optimization isn't actually
used all that much these days.

Why can't NFS send a single bvec for contiguous ranges?

> -	if (rdma_rw_io_needs_mr(dev, port_num, dir, nr_bvec))
> -		return rdma_rw_init_mr_wrs_bvec(ctx, qp, port_num, bvecs,
> -						nr_bvec, &iter, remote_addr,
> -						rkey, dir);
> -
>  	return rdma_rw_init_map_wrs_bvec(ctx, qp, bvecs, nr_bvec, &iter,
>  			remote_addr, rkey, dir);

First I thought this breaks iWarp reads, but they are handled earlier,
which might have been usefu in the commit log.

  reply	other threads:[~2026-03-10 13:42 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-10  3:46 [PATCH] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Chuck Lever
2026-03-10 13:42 ` Christoph Hellwig [this message]
2026-03-10 14:36   ` Chuck Lever
2026-03-10 18:37     ` Leon Romanovsky
2026-03-10 18:49       ` Chuck Lever
2026-03-10 19:31         ` Leon Romanovsky
2026-03-10 19:56           ` [RFC PATCH] svcrdma: Use compound pages for RDMA Read sink buffers Chuck Lever
2026-03-10 20:27             ` Leon Romanovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abAftjplHdwdwrkd@infradead.org \
    --to=hch@infradead.org \
    --cc=cel@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=hch@lst.de \
    --cc=jgg@nvidia.com \
    --cc=leon@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox