All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Chuck Lever III <chuck.lever@oracle.com>
Cc: Leon Romanovsky <leon@kernel.org>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>
Subject: Re: [PATCH v1] RDMA/core: Fix check_flush_dependency splat on addr_wq
Date: Mon, 29 Aug 2022 15:26:51 -0300	[thread overview]
Message-ID: <Yw0E68fl/FcvUSnO@nvidia.com> (raw)
In-Reply-To: <904E5390-9D6B-4772-AAE6-DBF33F31698C@oracle.com>

On Mon, Aug 29, 2022 at 06:15:28PM +0000, Chuck Lever III wrote:
> 
> 
> > On Aug 29, 2022, at 1:22 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > On Mon, Aug 29, 2022 at 05:14:56PM +0000, Chuck Lever III wrote:
> >> 
> >> 
> >>> On Aug 29, 2022, at 12:45 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> >>> 
> >>> On Fri, Aug 26, 2022 at 07:57:04PM +0000, Chuck Lever III wrote:
> >>>> The connect APIs would be a place to start. In the meantime, though...
> >>>> 
> >>>> Two or three years ago I spent some effort to ensure that closing
> >>>> an RDMA connection leaves a client-side RPC/RDMA transport with no
> >>>> RDMA resources associated with it. It releases the CQs, QP, and all
> >>>> the MRs. That makes initial connect and reconnect both behave exactly
> >>>> the same, and guarantees that a reconnect does not get stuck with
> >>>> an old CQ that is no longer working or a QP that is in TIMEWAIT.
> >>>> 
> >>>> However that does mean that substantial resource allocation is
> >>>> done on every reconnect.
> >>> 
> >>> And if the resource allocations fail then what happens? The storage
> >>> ULP retries forever and is effectively deadlocked?
> >> 
> >> The reconnection attempt fails, and any resources allocated during
> >> that attempt are released. The ULP waits a bit then tries again
> >> until it works or is interrupted.
> >> 
> >> A deadlock might occur if one of those allocations triggers
> >> additional reclaim activity.
> > 
> > No, you are deadlocked now.
> 
> GFP_KERNEL can and will give up eventually, in which case
> the connection attempt fails and any previously allocated
> memory is released. Something else can then make progress.

Something else, maybe for a time, but likely the storage stack is
forever stuck.

> Single page allocation nearly always succeeds. It's the
> larger-order allocations that can block for long periods,
> and that's not necessarily because memory is low -- it can
> happen when one NUMA node's memory is heavily fragmented.

We've done a lot of work in the rdma stack and drivers to avoid
multi-page allocations. But we might need a lot of them, and I'm
skeptical about this claim they always succeed.

> This issue seems to be addressed in the socket stack, so I
> don't believe there's _no_ solution for RDMA. Usually the
> trick is to communicate the memalloc_noio settings somehow
> to other allocating threads.

And how do you do that when the other threads may have already started
their work before a reclaim writeback is triggered? We may already be
blocked inside GFP_KERNEL - heck we may already be inside reclaim from
within one of our own threads!

> If nothing else we can talk with the MM folks about planning
> improvements. We've just gone through this with NFS on the
> socket stack.

I'm not aware of any work in this area.. 
 
> > Even a simple case like mlx5 may cause the NIC to trigger a host
> > memory allocation, which is done in another thread and done as a
> > normal GFP_KERNEL. This memory allocation must progress before a
> > CQ/QP/MR/etc can be created. So now we are deadlocked again.
> 
> That sounds to me like a bug in mlx5. The driver is supposed
> to respect the caller's GFP settings. Again, if the request
> is small, it's likely to succeed anyway, but larger requests
> are not reliable and need to fail quickly so the system can
> move onto other fishing spots.

It is a design artifact, the FW is the one requesting the memory and
it has no idea about kernel GFP flags. As above a FW thread could have
already started requesting memory for some other purpose and we may
already be inside the mlx5 FW page request thread under a GFP_KERNEL
allocation doing reclaim. How can this ever be fixed?

> I would like to at least get rid of the check_flush_dependency
> splat, which will fire a lot more often than we will get stuck
> in a reclaim allocation corner. I'm testing a patch that
> converts rpcrdma not to use MEM_RECLAIM work queues and notes
> how extensive the problem actually is.

Ok

Jason

  reply	other threads:[~2022-08-29 18:26 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-22 15:30 [PATCH v1] RDMA/core: Fix check_flush_dependency splat on addr_wq Chuck Lever
2022-08-23  8:09 ` Leon Romanovsky
2022-08-23 13:58   ` Chuck Lever III
2022-08-24  9:20     ` Leon Romanovsky
2022-08-24 14:09       ` Chuck Lever III
2022-08-26 13:29         ` Jason Gunthorpe
2022-08-26 14:02           ` Chuck Lever III
2022-08-26 14:08             ` Jason Gunthorpe
2022-08-26 19:57               ` Chuck Lever III
2022-08-29 16:45                 ` Jason Gunthorpe
2022-08-29 17:14                   ` Chuck Lever III
2022-08-29 17:22                     ` Jason Gunthorpe
2022-08-29 18:15                       ` Chuck Lever III
2022-08-29 18:26                         ` Jason Gunthorpe [this message]
2022-08-29 19:31                           ` Chuck Lever III

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yw0E68fl/FcvUSnO@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=chuck.lever@oracle.com \
    --cc=leon@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.