Re: [Ksummit-discuss] [TECH TOPIC] IRQ affinity

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Wilcox <willy@linux.intel.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>,
	ksummit-discuss@lists.linuxfoundation.org,
	linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-nvme@lists.infradead.org,
	Keith Busch <keith.busch@intel.com>,
	Bart Van Assche <bart.vanassche@sandisk.com>
Subject: Re: [Ksummit-discuss] [TECH TOPIC] IRQ affinity
Date: Wed, 15 Jul 2015 14:48:00 -0400	[thread overview]
Message-ID: <20150715184800.GL13681@linux.intel.com> (raw)
In-Reply-To: <55A697A3.3090305@kernel.dk>

On Wed, Jul 15, 2015 at 11:25:55AM -0600, Jens Axboe wrote:
> On 07/15/2015 11:19 AM, Keith Busch wrote:
> >On Wed, 15 Jul 2015, Bart Van Assche wrote:
> >>* With blk-mq and scsi-mq optimal performance can only be achieved if
> >> the relationship between MSI-X vector and NUMA node does not change
> >> over time. This is necessary to allow a blk-mq/scsi-mq driver to
> >> ensure that interrupts are processed on the same NUMA node as the
> >> node on which the data structures for a communication channel have
> >> been allocated. However, today there is no API that allows
> >> blk-mq/scsi-mq drivers and irqbalanced to exchange information
> >> about the relationship between MSI-X vector ranges and NUMA nodes.
> >
> >We could have low-level drivers provide blk-mq the controller's irq
> >associated with a particular h/w context, and the block layer can provide
> >the context's cpumask to irqbalance with the smp affinity hint.
> >
> >The nvme driver already uses the hwctx cpumask to set hints, but this
> >doesn't seems like it should be a driver responsibility. It currently
> >doesn't work correctly anyway with hot-cpu since blk-mq could rebalance
> >the h/w contexts without syncing with the low-level driver.
> >
> >If we can add this to blk-mq, one additional case to consider is if the
> >same interrupt vector is used with multiple h/w contexts. Blk-mq's cpu
> >assignment needs to be aware of this to prevent sharing a vector across
> >NUMA nodes.
> 
> Exactly. I may have promised to do just that at the last LSF/MM conference,
> just haven't done it yet. The point is to share the mask, I'd ideally like
> to take it all the way where the driver just asks for a number of vecs
> through a nice API that takes care of all this. Lots of duplicated code in
> drivers for this these days, and it's a mess.

Yes.  I think the fundamental problem is that our MSI-X API is so funky.
We have this incredibly flexible scheme where each MSI-X vector could
have its own interrupt handler, but that's not what drivers want.
They want to say "Give me eight MSI-X vectors spread across the CPUs,
and use this interrupt handler for all of them".  That is, instead of
the current scheme where each MSI-X vector gets its own Linux interrupt,
we should have one interrupt handler (of the per-cpu interrupt type),
which shows up with N bits set in its CPU mask.

WARNING: multiple messages have this Message-ID (diff)

From: willy@linux.intel.com (Matthew Wilcox)
Subject: [Ksummit-discuss] [TECH TOPIC] IRQ affinity
Date: Wed, 15 Jul 2015 14:48:00 -0400	[thread overview]
Message-ID: <20150715184800.GL13681@linux.intel.com> (raw)
In-Reply-To: <55A697A3.3090305@kernel.dk>

On Wed, Jul 15, 2015@11:25:55AM -0600, Jens Axboe wrote:
> On 07/15/2015 11:19 AM, Keith Busch wrote:
> >On Wed, 15 Jul 2015, Bart Van Assche wrote:
> >>* With blk-mq and scsi-mq optimal performance can only be achieved if
> >> the relationship between MSI-X vector and NUMA node does not change
> >> over time. This is necessary to allow a blk-mq/scsi-mq driver to
> >> ensure that interrupts are processed on the same NUMA node as the
> >> node on which the data structures for a communication channel have
> >> been allocated. However, today there is no API that allows
> >> blk-mq/scsi-mq drivers and irqbalanced to exchange information
> >> about the relationship between MSI-X vector ranges and NUMA nodes.
> >
> >We could have low-level drivers provide blk-mq the controller's irq
> >associated with a particular h/w context, and the block layer can provide
> >the context's cpumask to irqbalance with the smp affinity hint.
> >
> >The nvme driver already uses the hwctx cpumask to set hints, but this
> >doesn't seems like it should be a driver responsibility. It currently
> >doesn't work correctly anyway with hot-cpu since blk-mq could rebalance
> >the h/w contexts without syncing with the low-level driver.
> >
> >If we can add this to blk-mq, one additional case to consider is if the
> >same interrupt vector is used with multiple h/w contexts. Blk-mq's cpu
> >assignment needs to be aware of this to prevent sharing a vector across
> >NUMA nodes.
> 
> Exactly. I may have promised to do just that at the last LSF/MM conference,
> just haven't done it yet. The point is to share the mask, I'd ideally like
> to take it all the way where the driver just asks for a number of vecs
> through a nice API that takes care of all this. Lots of duplicated code in
> drivers for this these days, and it's a mess.

Yes.  I think the fundamental problem is that our MSI-X API is so funky.
We have this incredibly flexible scheme where each MSI-X vector could
have its own interrupt handler, but that's not what drivers want.
They want to say "Give me eight MSI-X vectors spread across the CPUs,
and use this interrupt handler for all of them".  That is, instead of
the current scheme where each MSI-X vector gets its own Linux interrupt,
we should have one interrupt handler (of the per-cpu interrupt type),
which shows up with N bits set in its CPU mask.

WARNING: multiple messages have this Message-ID (diff)

From: Matthew Wilcox <willy@linux.intel.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <keith.busch@intel.com>,
	Bart Van Assche <bart.vanassche@sandisk.com>,
	ksummit-discuss@lists.linuxfoundation.org,
	linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-nvme@lists.infradead.org,
	Christoph Hellwig <hch@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [Ksummit-discuss] [TECH TOPIC] IRQ affinity
Date: Wed, 15 Jul 2015 14:48:00 -0400	[thread overview]
Message-ID: <20150715184800.GL13681@linux.intel.com> (raw)
In-Reply-To: <55A697A3.3090305@kernel.dk>

On Wed, Jul 15, 2015 at 11:25:55AM -0600, Jens Axboe wrote:
> On 07/15/2015 11:19 AM, Keith Busch wrote:
> >On Wed, 15 Jul 2015, Bart Van Assche wrote:
> >>* With blk-mq and scsi-mq optimal performance can only be achieved if
> >> the relationship between MSI-X vector and NUMA node does not change
> >> over time. This is necessary to allow a blk-mq/scsi-mq driver to
> >> ensure that interrupts are processed on the same NUMA node as the
> >> node on which the data structures for a communication channel have
> >> been allocated. However, today there is no API that allows
> >> blk-mq/scsi-mq drivers and irqbalanced to exchange information
> >> about the relationship between MSI-X vector ranges and NUMA nodes.
> >
> >We could have low-level drivers provide blk-mq the controller's irq
> >associated with a particular h/w context, and the block layer can provide
> >the context's cpumask to irqbalance with the smp affinity hint.
> >
> >The nvme driver already uses the hwctx cpumask to set hints, but this
> >doesn't seems like it should be a driver responsibility. It currently
> >doesn't work correctly anyway with hot-cpu since blk-mq could rebalance
> >the h/w contexts without syncing with the low-level driver.
> >
> >If we can add this to blk-mq, one additional case to consider is if the
> >same interrupt vector is used with multiple h/w contexts. Blk-mq's cpu
> >assignment needs to be aware of this to prevent sharing a vector across
> >NUMA nodes.
> 
> Exactly. I may have promised to do just that at the last LSF/MM conference,
> just haven't done it yet. The point is to share the mask, I'd ideally like
> to take it all the way where the driver just asks for a number of vecs
> through a nice API that takes care of all this. Lots of duplicated code in
> drivers for this these days, and it's a mess.

Yes.  I think the fundamental problem is that our MSI-X API is so funky.
We have this incredibly flexible scheme where each MSI-X vector could
have its own interrupt handler, but that's not what drivers want.
They want to say "Give me eight MSI-X vectors spread across the CPUs,
and use this interrupt handler for all of them".  That is, instead of
the current scheme where each MSI-X vector gets its own Linux interrupt,
we should have one interrupt handler (of the per-cpu interrupt type),
which shows up with N bits set in its CPU mask.

next prev parent reply	other threads:[~2015-07-15 18:48 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-15 12:07 [Ksummit-discuss] [TECH TOPIC] IRQ affinity Christoph Hellwig
2015-07-15 12:07 ` Christoph Hellwig
2015-07-15 12:07 ` Christoph Hellwig
2015-07-15 12:07 ` Christoph Hellwig
2015-07-15 12:12 ` [Ksummit-discuss] " Thomas Gleixner
2015-07-15 12:12   ` Thomas Gleixner
2015-07-15 12:12   ` Thomas Gleixner
2015-07-15 12:12   ` Thomas Gleixner
2015-07-15 15:41   ` Bart Van Assche
2015-07-15 15:41     ` Bart Van Assche
2015-07-15 15:41     ` Bart Van Assche
2015-07-15 15:41     ` Bart Van Assche
2015-07-15 17:19     ` Keith Busch
2015-07-15 17:19       ` Keith Busch
2015-07-15 17:19       ` Keith Busch
2015-07-15 17:19       ` Keith Busch
2015-07-15 17:25       ` Jens Axboe
2015-07-15 17:25         ` Jens Axboe
2015-07-15 17:25         ` Jens Axboe
2015-07-15 17:25         ` Jens Axboe
2015-07-15 18:24         ` Sagi Grimberg
2015-07-15 18:24           ` Sagi Grimberg
2015-07-15 18:24           ` Sagi Grimberg
2015-07-15 18:48         ` Matthew Wilcox [this message]
2015-07-15 18:48           ` Matthew Wilcox
2015-07-15 18:48           ` Matthew Wilcox
2015-07-16  6:13           ` Michael S. Tsirkin
2015-07-16  6:13             ` Michael S. Tsirkin
2015-07-16  6:13             ` Michael S. Tsirkin
2015-07-16  6:13             ` Michael S. Tsirkin
2015-07-17 15:51           ` Thomas Gleixner
2015-07-17 15:51             ` Thomas Gleixner
2015-07-17 15:51             ` Thomas Gleixner
2015-07-17 15:51             ` Thomas Gleixner
2015-07-15 14:38 ` Christoph Lameter
2015-07-15 14:38   ` Christoph Lameter
2015-07-15 14:38   ` Christoph Lameter
2015-07-15 14:56 ` [Ksummit-discuss] " Marc Zyngier
2015-07-15 14:56   ` Marc Zyngier
2015-07-15 14:56   ` Marc Zyngier
2015-07-15 16:05 ` Michael S. Tsirkin
2015-07-15 16:05   ` Michael S. Tsirkin
2015-07-15 16:05   ` Michael S. Tsirkin
2015-07-15 16:05   ` Michael S. Tsirkin
2015-10-12 16:09 ` Theodore Ts'o
2015-10-12 16:09   ` Theodore Ts'o
2015-10-12 16:09   ` Theodore Ts'o
2015-10-12 18:41   ` Christoph Hellwig
2015-10-12 18:41     ` Christoph Hellwig
2015-10-14 15:56     ` Theodore Ts'o
2015-10-14 15:56       ` Theodore Ts'o
2015-10-14 15:56       ` Theodore Ts'o

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150715184800.GL13681@linux.intel.com \
    --to=willy@linux.intel.com \
    --cc=axboe@kernel.dk \
    --cc=bart.vanassche@sandisk.com \
    --cc=hch@infradead.org \
    --cc=keith.busch@intel.com \
    --cc=ksummit-discuss@lists.linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-rdma@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.