Re: [PATCH] blk-mq-rdma: remove queue mapping helper for rdma devices

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jason Gunthorpe <jgg@ziepe.ca>
To: Christoph Hellwig <hch@lst.de>
Cc: Leon Romanovsky <leon@kernel.org>,
	Sagi Grimberg <sagi@grimberg.me>,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
	Keith Busch <kbusch@kernel.org>,
	Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>,
	linux-rdma@vger.kernel.org
Subject: Re: [PATCH] blk-mq-rdma: remove queue mapping helper for rdma devices
Date: Sun, 26 Mar 2023 21:56:48 -0300	[thread overview]
Message-ID: <ZCDp0KY9ISj9haV8@ziepe.ca> (raw)
In-Reply-To: <20230326231622.GA19436@lst.de>

On Mon, Mar 27, 2023 at 01:16:22AM +0200, Christoph Hellwig wrote:
> On Thu, Mar 23, 2023 at 10:03:25AM -0300, Jason Gunthorpe wrote:
> > > > > Given that nvme-rdma was the only consumer, do you prefer this goes from
> > > > > the nvme tree?
> > > > 
> > > > Sure, it is probably fine
> > > 
> > > I tried to do it two+ years ago:
> > > https://lore.kernel.org/all/20200929091358.421086-1-leon@kernel.org
> > 
> > Christoph's points make sense, but I think we should still purge this
> > code.
> 
> Given that we don't keep dead code around in the kernel as policy
> we should probably remove it.  That being said, I'm really sad about
> this, as I think what the RDMA code does here right now is pretty
> broken.

I don't know nvme well, but any affinity scheme that relies on using
/proc/interrupts to set the affinity of queues is incompatible with
RDMA HW and more broadly incompatible with mlx5 HW.

The fundamental issue is that this class of HW can have more queues
than the CPUs can have interrupts. To make this work it has to mux
each MSI interrupt to N queues.

So, I think the only way it can really make sense is if the MSI
interrupt is per-cpu and the muxing is adjusted according to affinity
and hotplug needs.

Thus, we'd need to see a scheme where something like nvme-cli directs
the affinity on the queue so it can flow down that way, as there is no
other obvious API that can manipulate a queue multiplexed on a MSI.

IIRC netdev sort of has this in that you can set the number of queues
and the queues layout according to CPU number by default. So
requesting N-4 queues will reserve the last 4 CPUs for isolation sort
of thinking.

> > If we want to do proper managed affinity the right RDMA API is to
> > directly ask for the desired CPU binding when creating the CQ, and
> > optionally a way to change the CPU binding of the CQ at runtime.
> 
> Chanigng the bindings causes a lot of nasty interactions with CPU
> hotplug.  The managed affinity and the way blk-mq interacts with it
> is designed around the hotunplug notifier quiescing the queues,
> and I'm not sure we can get everything right without a strict
> binding to a set of CPUs.

Yeah, I think the netdev version is that the queues can't change until
the netdev is brought down and the queues destroyed. So from that view
what you want is a tunable to provide a CPU mask that you'd like the
logical device to occupy provided at creation/startup time.

I can imagine people isolating containers to certain CPU cores and
then having netdevs and nvmef devices that were created only for that
container wanting them to follow the same CPU assignment.

So affinity at creation time does get a good wack of the use cases..

Jason

next prev parent reply	other threads:[~2023-03-27  0:56 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-22 12:37 [PATCH] blk-mq-rdma: remove queue mapping helper for rdma devices Sagi Grimberg
2023-03-22 12:54 ` Jason Gunthorpe
2023-03-22 13:00   ` Sagi Grimberg
2023-03-22 13:50     ` Jason Gunthorpe
2023-03-23 12:05       ` Leon Romanovsky
2023-03-23 13:03         ` Jason Gunthorpe
2023-03-23 15:07           ` Sagi Grimberg
2023-03-23 15:57             ` Jason Gunthorpe
2023-03-26  7:12               ` Sagi Grimberg
2023-03-26 23:16           ` Christoph Hellwig
2023-03-27  0:56             ` Jason Gunthorpe [this message]
2023-03-22 14:45 ` Keith Busch
2023-03-22 18:14 ` Chaitanya Kulkarni
2023-04-12  6:42 ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZCDp0KY9ISj9haV8@ziepe.ca \
    --to=jgg@ziepe.ca \
    --cc=Chaitanya.Kulkarni@wdc.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).