public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@elte.hu>
To: Linus Torvalds <torvalds@linux-foundation.org>,
	"David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Jesse Barnes <jesse.barnes@intel.com>,
	"Rafael J. Wysocki" <rjw@sisk.pl>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Andreas Schwab <schwab@suse.de>, Len Brown <lenb@kernel.org>
Subject: Re: Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early)
Date: Tue, 3 Feb 2009 21:57:27 +0100	[thread overview]
Message-ID: <20090203205727.GA4460@elte.hu> (raw)
In-Reply-To: <alpine.LFD.2.00.0902031209280.3247@localhost.localdomain>


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So I wouldn't worry too much. I think this is interesting mostly from a 
> performance standpoint - MSI interrupts are supposed to be fast, and under 
> heavy interrupt load I could easily see something like
> 
>  - cpu1: handles interrupt, has acked it, calls down to the handler
> 
>  - the handler clears the original irq source, but another packet (or disk 
>    completion) happens almost immediately
> 
>  - cpu2 takes the second interrupt, but it's still IRQ_INPROGRESS, so it 
>    masks.
> 
>  - cpu1 gets back and unmasks etc and now really handles it because of 
>    IRQ_PENDING.
> 
> Note how the mask/unmask were all just costly extra overhead over the PCI 
> bus. If we're talking something like high-performance 10Gbit ethernet (or 
> even maybe fast SSD disks), driver writers actually do count PCI cycles, 
> because a single PCI read can be several hundred ns, and if you take a 
> thousand interrupts per second, it does add up.

In practice MSI (and in particular MSI-X) irq sources tend to be bound to a 
single CPU on modern x86 hardware. The kernel does not do IRQ balancing 
anymore, nor does the hardware. We have a slow irq-balancer daemon 
(irqbalanced) in user-space. So singular IRQ sources, especially when they 
are MSI, tend to be 99.9% on the same CPU. Changing affinity is possible and 
has to always work reliably, but it is a performance slowpath.

An increasing trend is to have multiple irqs per device (multiple descriptor 
rings, split rx and tx rings with separate irq sources): and each IRQ can 
get balanced to a separate CPU. But those irqs cannot interact on a ->mask() 
level as each IRQ has its separate irq_desc.

The most advanced way of balancing IRQs is not widespread yet: it is where 
devices actually interpret the payload and send completions dynamically to 
differing CPUs - depending on things like the TCP/IP hash value or a 
in-descriptor "target CPU". That way we could get completion on the CPU 
where the work was submitted from. (and where the data structures are the 
most cache-localized)

That principle works both for networking and for other IO transports - but 
we have little support for it yet. It would work really well for workloads 
where one physical device is shared by many CPUs.

(A lesser method that approximates this is the use of lots of 
submission/completion rings per device and their binding to cpus - but that 
can never really approach the number of CPUs really possible in a system.)

And in this most advanced mode of MSI IRQs, and if MSI devices had the 
ability to direct IRQs to a specific CPU (they dont have that right now 
AFAICT), we'd run into the overhead scenarios you describe above, and your 
edge-triggered flow is the most performant one.

	Ingo

  reply	other threads:[~2009-02-03 20:58 UTC|newest]

Thread overview: 98+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <200901261904.n0QJ4Q9c016709@hera.kernel.org>
2009-02-02  9:54 ` PCI PM: Restore standard config registers of all devices early Benjamin Herrenschmidt
2009-02-02 17:06   ` Linus Torvalds
2009-02-02 20:29     ` Benjamin Herrenschmidt
2009-02-02 20:41       ` Linus Torvalds
2009-02-02 21:00         ` Benjamin Herrenschmidt
2009-02-02 21:32           ` Rafael J. Wysocki
2009-02-02 20:33     ` Benjamin Herrenschmidt
2009-02-02 20:50       ` Linus Torvalds
2009-02-02 20:55         ` Linus Torvalds
2009-02-02 21:19           ` Benjamin Herrenschmidt
2009-02-02 21:39           ` Rafael J. Wysocki
2009-02-02 22:05             ` Linus Torvalds
2009-02-02 22:09               ` Linus Torvalds
2009-02-02 22:31               ` Rafael J. Wysocki
2009-02-02 23:18                 ` Linus Torvalds
2009-02-02 23:45                   ` Rafael J. Wysocki
2009-02-02 23:59                     ` Linus Torvalds
2009-02-03  0:15                       ` Rafael J. Wysocki
2009-02-03  0:28                         ` Linus Torvalds
2009-02-03  1:12                           ` Benjamin Herrenschmidt
2009-02-03  1:32                             ` Linus Torvalds
2009-02-03  1:46                               ` Benjamin Herrenschmidt
2009-02-03  3:30                                 ` Benjamin Herrenschmidt
2009-02-03  3:47                                   ` Linus Torvalds
2009-02-03  4:03                                     ` Benjamin Herrenschmidt
2009-02-03  6:07                           ` Benjamin Herrenschmidt
2009-02-03 15:48                             ` Linus Torvalds
2009-02-03 22:59                               ` Benjamin Herrenschmidt
2009-02-03 23:23                                 ` Rafael J. Wysocki
2009-02-03 16:33                             ` Jesse Barnes
2009-02-03  0:15                       ` Linus Torvalds
2009-02-03  0:58                   ` Benjamin Herrenschmidt
2009-02-03  3:51                   ` Benjamin Herrenschmidt
2009-02-03  3:55                     ` Benjamin Herrenschmidt
2009-02-03  4:09                       ` Linus Torvalds
2009-02-03  4:21                         ` Benjamin Herrenschmidt
2009-02-03  9:26                     ` Rafael J. Wysocki
2009-02-03 17:04                       ` Reworking suspend-resume sequence (was: Re: PCI PM: Restore standard config registers of all devices early) Rafael J. Wysocki
2009-02-03 17:59                         ` Linus Torvalds
2009-02-03 18:31                           ` Linus Torvalds
2009-02-03 18:41                             ` Ingo Molnar
2009-02-03 18:32                           ` Jesse Barnes
2009-02-03 18:46                             ` Linus Torvalds
2009-02-03 19:03                               ` Linus Torvalds
2009-02-03 19:13                                 ` Ingo Molnar
2009-02-03 19:38                                   ` Linus Torvalds
2009-02-03 19:53                                     ` Ingo Molnar
2009-02-03 20:04                                       ` Ingo Molnar
2009-02-03 20:18                                       ` Linus Torvalds
2009-02-03 20:57                                         ` Ingo Molnar [this message]
2009-02-03 21:04                                           ` Ingo Molnar
2009-02-03 21:12                                             ` Thomas Gleixner
2009-02-04 10:07                                               ` Russell King
2009-02-03 21:18                                             ` Linus Torvalds
2009-02-03 19:19                                 ` Linus Torvalds
2009-02-03 21:11                                   ` Benjamin Herrenschmidt
2009-02-03 21:53                                     ` Rafael J. Wysocki
2009-02-03 22:33                                       ` Benjamin Herrenschmidt
2009-02-03 22:44                                         ` Rafael J. Wysocki
2009-02-03 23:05                                           ` Benjamin Herrenschmidt
2009-02-03 23:18                                             ` Linus Torvalds
2009-02-04  0:27                                               ` Benjamin Herrenschmidt
2009-03-04  8:02                                               ` Pavel Machek
2009-03-04 23:25                                                 ` Benjamin Herrenschmidt
2009-03-05  8:19                                                   ` Pavel Machek
2009-03-05 19:09                                                     ` Rafael J. Wysocki
2009-02-03 23:25                                             ` Rafael J. Wysocki
2009-02-04  0:46                                               ` Linus Torvalds
2009-02-03 21:02                         ` Benjamin Herrenschmidt
2009-02-03 21:56                           ` Rafael J. Wysocki
2009-02-03 17:53                       ` PCI PM: Restore standard config registers of all devices early Linus Torvalds
2009-02-03 21:57                         ` Rafael J. Wysocki
2009-02-02 22:48               ` Benjamin Herrenschmidt
2009-02-02 23:00                 ` Rafael J. Wysocki
2009-02-03  0:23                   ` Benjamin Herrenschmidt
2009-02-03  0:29                     ` Rafael J. Wysocki
2009-02-03  0:44                     ` Linus Torvalds
2009-02-03  1:32                       ` Benjamin Herrenschmidt
2009-02-03  5:06                       ` Ingo Molnar
2009-02-03 11:06                         ` Peter Zijlstra
2009-02-03 12:09                           ` Ingo Molnar
2009-02-02 23:49               ` Ingo Molnar
2009-02-03 22:09                 ` Rafael J. Wysocki
2009-02-03 23:13                   ` Linus Torvalds
2009-02-02 22:28             ` Benjamin Herrenschmidt
2009-02-02 21:07         ` Benjamin Herrenschmidt
2009-02-02 21:49           ` Rafael J. Wysocki
2009-02-02 22:15             ` Linus Torvalds
2009-02-02 22:33               ` Rafael J. Wysocki
2009-02-02 22:56                 ` Rafael J. Wysocki
2009-02-03  0:11                   ` Benjamin Herrenschmidt
2009-02-03  0:21                     ` Linus Torvalds
2009-02-10 20:25                   ` Pavel Machek
2009-02-02 22:57               ` Benjamin Herrenschmidt
2009-02-02 23:22                 ` Rafael J. Wysocki
2009-02-03  1:03                   ` Benjamin Herrenschmidt
2009-02-10 20:25                     ` kmalloc during suspend, was " Pavel Machek
2009-02-02 17:20   ` Rafael J. Wysocki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090203205727.GA4460@elte.hu \
    --to=mingo@elte.hu \
    --cc=benh@kernel.crashing.org \
    --cc=davem@davemloft.net \
    --cc=jesse.barnes@intel.com \
    --cc=lenb@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rjw@sisk.pl \
    --cc=schwab@suse.de \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox