public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Notes from LPC PCI/MSI BoF session
@ 2008-09-22 19:29 Jesse Barnes
  2008-09-24  5:51 ` Grant Grundler
  2008-10-01 15:00 ` Matthew Wilcox
  0 siblings, 2 replies; 7+ messages in thread
From: Jesse Barnes @ 2008-09-22 19:29 UTC (permalink / raw)
  To: linux-pci, linux-kernel

Matthew was kind enough to set up a BoF for those of us interested in PCI and 
MSI issues at this year's LPC.  We went over several issues there:  MSI, PCI 
hotplug, PCIe virtualization, VGA arbitration and PCI address space 
management.

MSI
---
This was probably the biggest topic; we spent almost an hour on the topic as I 
recall.  The main takeaways were these:

Issue:
Need to add something like James' lost interrupt code so we can better     
detect platforms with buggy MSI support (along with more general     
interrupt routing issues).
Owner:
Me.  I'll be handling this by integrating James' patches into the PCI tree for     
2.6.28.

Issue:
Storage drivers need to use the functionality provided above and print an oops 
message or similar so we can pick it up and blacklist the platform.
Owner:
Matthew (I think?)

Issue:
smp_affinity and MSI are currently incompatible if MSI masking and INTX 
mapping are unavailable.  Need to return -EINVAL from smp_affinity changes in 
this case.
Owner:
Me (happy to receive patches for this though).

Issue:
MSI API improvements.  Need ways to allocate MSIs at runtime, and perhaps on a 
per-CPU level and get affinity information.  And of course we need a way to 
get more than one legacy MSI allocated.
Owner:
Matthew is handling the first pass of this (the more than one legacy MSI 
addition).  I think we'll need some detailed requirements from the driver 
guys to make further improvements (e.g. per-CPU or affinity stuff).

Issue:
Need better MSI platform blacklists.
Owner:
Me.  I'll see if I can gather the list of issues the Fedora guys saw last time 
they tried to enable MSI unconditionally.  We can also improve the list 
through the added debugging described above.

PCI hotplug
-----------
PCI hotplug has seen a lot of churn lately, mainly due to long overdue fixes 
from Alex & Kenji-san.  Now that Kristen is done with LPC related planning 
and implementation, she says she'll have time to review more patches, but she 
also said PCI hotplug doesn't need a dedicated maintainer.  Assuming there 
are no objections, I'll apply a patch to remove the associated entry from 
MAINTAINERS for 2.6.28.

I'm really happy with the level of review we've been getting on hotplug 
related patches these days, I hope we can keep it up.  One of the main 
remaining issues that I can see is adding some better detection code for our 
hotplug slot drivers.  It seems like we're missing something given that we 
see so many systems with duplicate slot IDs; maybe we're not calling the 
right ACPI methods or are calling them wrong somehow?

PCIe virtualization
-------------------
Briefly discussed the SR-IOV patches Yu has been posting recently.  Except for 
a couple of changes requested by Alex & Matthew, it looks like this patch set 
is in pretty good shape.  I'll wait for Matthew & Alex to ack the next set, 
then I'll apply to my linux-next branch.

VGA arbitration
---------------
There's some code for handling legacy VGA routing floating around.  Ben put it 
together awhile back, and Tiago Vignatti has been looking after it lately.  
It's a good interface to have for multi-seat configurations that require 
multiple VGA cards to be POSTed; it mainly needs to be re-posted to the 
mailing list for final review at this point (and Ben mentioned one more API 
is needed to allow drivers to exclude themselves from arbitration once they 
no longer need legacy VGA space).

PCI address space management
----------------------------
TJ has a bunch of code to improve address space management in Linux.  We 
talked for a few minutes about this at the BoF; at this point I'm just 
waiting for TJ to post his stuff so we can start integrating it.  Hopefully 
we can start merging small pieces of it (like the multiple PCI gap stuff) for 
2.6.28, and get some more eyes on the more aggressive PCI-DMAR stuff he's 
been talking about soon.

Well that's all I have in the way of notes.  Feel free to add your own if I 
missed anything or correct me if I mischaracterized things.

Thanks,
Jesse
    

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Notes from LPC PCI/MSI BoF session
  2008-09-22 19:29 Notes from LPC PCI/MSI BoF session Jesse Barnes
@ 2008-09-24  5:51 ` Grant Grundler
  2008-09-24  6:47   ` David Miller
  2008-09-24 15:44   ` Matthew Wilcox
  2008-10-01 15:00 ` Matthew Wilcox
  1 sibling, 2 replies; 7+ messages in thread
From: Grant Grundler @ 2008-09-24  5:51 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: linux-pci, linux-kernel

On Mon, Sep 22, 2008 at 12:29:18PM -0700, Jesse Barnes wrote:
> Matthew was kind enough to set up a BoF for those of us interested in PCI and 
> MSI issues at this year's LPC.  We went over several issues there:  MSI, PCI 
> hotplug, PCIe virtualization, VGA arbitration and PCI address space 
> management.

Jesse,
thanks for summarizing and posting this....let me use this as an opportunity
to write up an MSI API proposal like I promised.

> Issue:
> MSI API improvements.  Need ways to allocate MSIs at runtime, and perhaps
> on a per-CPU level and get affinity information.  And of course we need
> a way to get more than one legacy MSI allocated.
> Owner:
> Matthew is handling the first pass of this (the more than one legacy MSI 
> addition).  I think we'll need some detailed requirements from the driver 
> guys to make further improvements (e.g. per-CPU or affinity stuff).

Being one of the "driver guys", let me add my thoughts.
For the following discussion, I think we can treat MSI and MSI-X the
same and will just say "MSI". The issue is smp_affinity and how
drivers want to bind MSIs to specific CPUs based on topology/architecture
for optimal performance.

"queue pairs" means command/completion queues.
"multiple queues" means more than one such pairs.

The problem is multi-queue capable devices want to bind MSIs
to specific queues. How those queues are bound to each MSI depends
on how the device uses the queues. I can think of three cases:
1) 1:1 mapping between queue pairs and MSI.
2) 1:N mapping of MSI to multiple queues - e.g. different classes of service.
3) N:1 mapping of MSI to a queue pair - e.g. different event types
   (error vs good status). 

"classes of service" could be 1:N or N:1.
"event types" case would typically be 1 command queue with
multiple completion queues and one MSI per completion queue.

Dave Miller (and others) have clearly stated they don't want to see
CPU affinity handled in the device drivers and want irqbalanced
to handle interrupt distribution. The problem with this is irqbalanced
needs to know how each device driver is binding multiple MSI to it's queues.
Some devices could prefer several MSI go to the same processor and
others want each MSI bound to a different "node" (NUMA).

Without any additional API, this means the device driver has to
update irqbalanced for each device it supports. We thought pci_ids.h
was a PITA...that would be trivial compared to maintaining this.

Initially, at the BOF, I proposed "pci_enable_msix_for_nodes()"
to spread out MSI across multiple NUMA nodes by default. CPU Cores which
share cache was my definition of a "NUMA Node" for the purpose of
this discussion. But each arch would have to define that. The device
driver would also need an API to map each "node" to a queue pair as well.

In retrospect, I think this API only work well for smaller systems
and simple 1:1 MSI/queue mappings. and we'd still have to teach
irqbalanced to not touch some MSIs which are "optimally" allocated.

A second solution I thought of later might be for the device driver to
export (sysfs?) to irqbalanced which MSIs the driver instance owns and
how many "domains" those MSIs can serve.  irqbalanced can then write
back into the same (sysfs?) the mapping of MSI to domains and update
the smp_affinity mask for each of those MSI.

The driver could quickly look up the reverse map CPUs to "domains".
When a process attempts to start an IO, driver wants to know which
queue pair the IO should be placed on so the completion event will
be handled in the same "domain". The result is IOs could start/complete
on the same (now warm) "CPU cache" with minimal spinlock bouncing.

I'm not clear on details right now. I belive this would allow
irqbalanced to manage IRQs in an optimal way without having to
have device specific code in it. Unfortunately, I'm not in a position
propose patches due to current work/family commitments. It would
be fun to work on. *sigh*


I suspect the same thing could be implemented without irqbalanced since
I believe process management knows about the same NUMA attributes we care
about here...maybe it's time for PM to start dealing with interrupt
"scheduling" (kthreads like the RT folks want?) as well?

Ok..maybe I should stop before my asbestos underwear aren't sufficient. :)

hth,
grant

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Notes from LPC PCI/MSI BoF session
  2008-09-24  5:51 ` Grant Grundler
@ 2008-09-24  6:47   ` David Miller
  2008-09-25 15:53     ` Grant Grundler
  2008-09-24 15:44   ` Matthew Wilcox
  1 sibling, 1 reply; 7+ messages in thread
From: David Miller @ 2008-09-24  6:47 UTC (permalink / raw)
  To: grundler; +Cc: jbarnes, linux-pci, linux-kernel

From: Grant Grundler <grundler@parisc-linux.org>
Date: Tue, 23 Sep 2008 23:51:16 -0600

> Dave Miller (and others) have clearly stated they don't want to see
> CPU affinity handled in the device drivers and want irqbalanced
> to handle interrupt distribution. The problem with this is irqbalanced
> needs to know how each device driver is binding multiple MSI to it's queues.
> Some devices could prefer several MSI go to the same processor and
> others want each MSI bound to a different "node" (NUMA).
> 
> Without any additional API, this means the device driver has to
> update irqbalanced for each device it supports. We thought pci_ids.h
> was a PITA...that would be trivial compared to maintaining this.

We just need a consistent naming scheme for the IRQs to disseminate
this information to irqbalanced, then there is one change to irqbalanced
rather than one for each and every driver as you seem to suggest.

Anything that's complicated and takes more than a paragraph or two
to describe is not what we want.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Notes from LPC PCI/MSI BoF session
  2008-09-24  5:51 ` Grant Grundler
  2008-09-24  6:47   ` David Miller
@ 2008-09-24 15:44   ` Matthew Wilcox
  2008-09-25 16:15     ` Grant Grundler
  1 sibling, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2008-09-24 15:44 UTC (permalink / raw)
  To: Grant Grundler; +Cc: Jesse Barnes, linux-pci, linux-kernel

On Tue, Sep 23, 2008 at 11:51:16PM -0600, Grant Grundler wrote:
> Being one of the "driver guys", let me add my thoughts.
> For the following discussion, I think we can treat MSI and MSI-X the
> same and will just say "MSI".

I really don't think so.  MSI suffers from numerous problems, including
on x86 the need to have all interrupts targetted at the same CPU.  You
effectively can't reprogram the number of MSI allocated while the device
is active.  So I would say this discussion applies *only* to MSI-X.

> Dave Miller (and others) have clearly stated they don't want to see
> CPU affinity handled in the device drivers and want irqbalanced
> to handle interrupt distribution. The problem with this is irqbalanced
> needs to know how each device driver is binding multiple MSI to it's queues.
> Some devices could prefer several MSI go to the same processor and
> others want each MSI bound to a different "node" (NUMA).

But that's *policy*.  It's not what the device wants, it's what the
sysadmin wants.

> A second solution I thought of later might be for the device driver to
> export (sysfs?) to irqbalanced which MSIs the driver instance owns and
> how many "domains" those MSIs can serve.  irqbalanced can then write
> back into the same (sysfs?) the mapping of MSI to domains and update
> the smp_affinity mask for each of those MSI.
> 
> The driver could quickly look up the reverse map CPUs to "domains".
> When a process attempts to start an IO, driver wants to know which
> queue pair the IO should be placed on so the completion event will
> be handled in the same "domain". The result is IOs could start/complete
> on the same (now warm) "CPU cache" with minimal spinlock bouncing.
> 
> I'm not clear on details right now. I belive this would allow
> irqbalanced to manage IRQs in an optimal way without having to
> have device specific code in it. Unfortunately, I'm not in a position
> propose patches due to current work/family commitments. It would
> be fun to work on. *sigh*

I think looking at this in terms of MSIs is the wrong level.  The driver
needs to be instructed how many and what type of *queues* to create.
Then allocation of MSIs falls out naturally from that.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Notes from LPC PCI/MSI BoF session
  2008-09-24  6:47   ` David Miller
@ 2008-09-25 15:53     ` Grant Grundler
  0 siblings, 0 replies; 7+ messages in thread
From: Grant Grundler @ 2008-09-25 15:53 UTC (permalink / raw)
  To: David Miller; +Cc: grundler, jbarnes, linux-pci, linux-kernel

On Tue, Sep 23, 2008 at 11:47:05PM -0700, David Miller wrote:
> From: Grant Grundler <grundler@parisc-linux.org>
> Date: Tue, 23 Sep 2008 23:51:16 -0600
> 
> > Dave Miller (and others) have clearly stated they don't want to see
> > CPU affinity handled in the device drivers and want irqbalanced
> > to handle interrupt distribution. The problem with this is irqbalanced
> > needs to know how each device driver is binding multiple MSI to it's queues.
> > Some devices could prefer several MSI go to the same processor and
> > others want each MSI bound to a different "node" (NUMA).
> > 
> > Without any additional API, this means the device driver has to
> > update irqbalanced for each device it supports. We thought pci_ids.h
> > was a PITA...that would be trivial compared to maintaining this.
> 
> We just need a consistent naming scheme for the IRQs to disseminate
> this information to irqbalanced, then there is one change to irqbalanced
> rather than one for each and every driver as you seem to suggest.

That's sort of what I proposed at the end of my email:
| A second solution I thought of later might be for the device driver to
| export (sysfs?) to irqbalanced which MSIs the driver instance owns and
| how many "domains" those MSIs can serve.  irqbalanced can then write
| back into the same (sysfs?) the mapping of MSI to domains and update
| the smp_affinity mask for each of those MSI.

> Anything that's complicated and takes more than a paragraph or two
> to describe is not what we want.

I agree. The discussion will need more though.

thanks,
grant

> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Notes from LPC PCI/MSI BoF session
  2008-09-24 15:44   ` Matthew Wilcox
@ 2008-09-25 16:15     ` Grant Grundler
  0 siblings, 0 replies; 7+ messages in thread
From: Grant Grundler @ 2008-09-25 16:15 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Grant Grundler, Jesse Barnes, linux-pci, linux-kernel

On Wed, Sep 24, 2008 at 09:44:40AM -0600, Matthew Wilcox wrote:
> On Tue, Sep 23, 2008 at 11:51:16PM -0600, Grant Grundler wrote:
> > Being one of the "driver guys", let me add my thoughts.
> > For the following discussion, I think we can treat MSI and MSI-X the
> > same and will just say "MSI".
> 
> I really don't think so.  MSI suffers from numerous problems, including
> on x86 the need to have all interrupts targetted at the same CPU.  You
> effectively can't reprogram the number of MSI allocated while the device
> is active.  So I would say this discussion applies *only* to MSI-X.

I would entirely agree with this but we have the "N:1" case that I described.
(multiple vectors which by design should target one CPU.)
In any case, MSI-X is clearly more interesting for this discussion.

> > Dave Miller (and others) have clearly stated they don't want to see
> > CPU affinity handled in the device drivers and want irqbalanced
> > to handle interrupt distribution. The problem with this is irqbalanced
> > needs to know how each device driver is binding multiple MSI to it's queues.
> > Some devices could prefer several MSI go to the same processor and
> > others want each MSI bound to a different "node" (NUMA).
> 
> But that's *policy*.  It's not what the device wants, it's what the
> sysadmin wants.

That sounds remarkable close saying the sysadmin has to know about
each devices attributes. If interpreted that way, I'll argue that's
not realistic in 99% of the cases and certainly not how sysadmins
want to spend their time (frobbing irqbalanced policy).

> 
> > A second solution I thought of later might be for the device driver to
> > export (sysfs?) to irqbalanced which MSIs the driver instance owns and
> > how many "domains" those MSIs can serve.  irqbalanced can then write
> > back into the same (sysfs?) the mapping of MSI to domains and update
> > the smp_affinity mask for each of those MSI.
> > 
> > The driver could quickly look up the reverse map CPUs to "domains".
> > When a process attempts to start an IO, driver wants to know which
> > queue pair the IO should be placed on so the completion event will
> > be handled in the same "domain". The result is IOs could start/complete
> > on the same (now warm) "CPU cache" with minimal spinlock bouncing.
> > 
> > I'm not clear on details right now. I belive this would allow
> > irqbalanced to manage IRQs in an optimal way without having to
> > have device specific code in it. Unfortunately, I'm not in a position
> > propose patches due to current work/family commitments. It would
> > be fun to work on. *sigh*
> 
> I think looking at this in terms of MSIs is the wrong level.  The driver
> needs to be instructed how many and what type of *queues* to create.
> Then allocation of MSIs falls out naturally from that.

Yes, good point. That's certainly a better approach and could precede
the "second proposal" above. ie driver queries how many domains it
should "plan" for, set up that many queues, and request the same number
of MSI-X vectors.

That still leaves open which code is going to export queue attribute
information to irqbalanced. My guess is the driver query could provide
a table which could be exported. But it would make more sense to export
when the msi's are allocated since we want to associate with actually
allocated MSI-X vectors.

thanks,
grant

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Notes from LPC PCI/MSI BoF session
  2008-09-22 19:29 Notes from LPC PCI/MSI BoF session Jesse Barnes
  2008-09-24  5:51 ` Grant Grundler
@ 2008-10-01 15:00 ` Matthew Wilcox
  1 sibling, 0 replies; 7+ messages in thread
From: Matthew Wilcox @ 2008-10-01 15:00 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: linux-pci, linux-kernel

On Mon, Sep 22, 2008 at 12:29:18PM -0700, Jesse Barnes wrote:
> PCI address space management
> ----------------------------
> TJ has a bunch of code to improve address space management in Linux.  We 
> talked for a few minutes about this at the BoF; at this point I'm just 
> waiting for TJ to post his stuff so we can start integrating it.  Hopefully 
> we can start merging small pieces of it (like the multiple PCI gap stuff) for 
> 2.6.28, and get some more eyes on the more aggressive PCI-DMAR stuff he's 
> been talking about soon.
> 
> Well that's all I have in the way of notes.  Feel free to add your own if I 
> missed anything or correct me if I mischaracterized things.

One thing that wasn't resolved at the meeting was the question of 64-bit
addressing and PCI-PCI bridges (primarily because nobody had their
PCI-SIG login with them).  I've now downloaded the latest PCI-PCI Bridge
spec and there are still no facilities to forward addresses >4GB (other
than the prefetchable region that was already in PPB 1.1).

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-10-01 15:01 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-22 19:29 Notes from LPC PCI/MSI BoF session Jesse Barnes
2008-09-24  5:51 ` Grant Grundler
2008-09-24  6:47   ` David Miller
2008-09-25 15:53     ` Grant Grundler
2008-09-24 15:44   ` Matthew Wilcox
2008-09-25 16:15     ` Grant Grundler
2008-10-01 15:00 ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox