public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: + msi-refactor-and-move-the-msi-irq_chip-into-the-arch-code.patch added to -mm tree
       [not found] <200609272215.k8RMFbuH018967@shell0.pdx.osdl.net>
@ 2006-09-27 22:50 ` David Miller
  2006-09-27 23:09   ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 4+ messages in thread
From: David Miller @ 2006-09-27 22:50 UTC (permalink / raw)
  To: linux-kernel, akpm; +Cc: ebiederm, ak, benh, greg, mingo, tglx, tony.luck

From: akpm@osdl.org
Date: Wed, 27 Sep 2006 15:15:37 -0700

> Subject: msi: refactor and move the msi irq_chip into the arch code
> From: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> It turns out msi_ops was simply not enough to abstract the architecture
> specific details of msi.  So I have moved the resposibility of constructing
> the struct irq_chip to the architectures, and have two architecture specific
> functions arch_setup_msi_irq, and arch_teardown_msi_irq.
> 
> For simple architectures those functions can do all of the work.  For
> architectures with platform dependencies they can call into the appropriate
> platform code.
> 
> With this msi.c is finally free of assuming you have an apic, and this
> actually takes less code.
> 
> The helpers for the architecture specific code are declared in the linux/msi.h
> to keep them separate from the msi functions used by drivers in linux/pci.h
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Andi Kleen <ak@suse.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Greg KH <greg@kroah.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Signed-off-by: Andrew Morton <akpm@osdl.org>

Eric, thanks so much for doing this work.

Once this goes in I'll try to add support for MSI on sparc64
Niagara boxes.  I suppose the PowerPC folks can make use of
this as well.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: + msi-refactor-and-move-the-msi-irq_chip-into-the-arch-code.patch added to -mm tree
  2006-09-27 22:50 ` + msi-refactor-and-move-the-msi-irq_chip-into-the-arch-code.patch added to -mm tree David Miller
@ 2006-09-27 23:09   ` Benjamin Herrenschmidt
  2006-09-28  5:50     ` Eric W. Biederman
  0 siblings, 1 reply; 4+ messages in thread
From: Benjamin Herrenschmidt @ 2006-09-27 23:09 UTC (permalink / raw)
  To: David Miller
  Cc: linux-kernel, akpm, ebiederm, ak, greg, mingo, tglx, tony.luck,
	Michael Ellerman


> Eric, thanks so much for doing this work.
> 
> Once this goes in I'll try to add support for MSI on sparc64
> Niagara boxes.  I suppose the PowerPC folks can make use of
> this as well.

Well, it's definitely a great step in the right direction. But I have
some issues with one basic element of the design which is the allocation
of linux irqs being made generic which I can't see how it can apply to
us in any shape or form unfortunately without yet another significant
rework of the ppc interrupt management code (ugh).

We have our allocation mecanism for interrupts, which involves binding
them to a controller among other things. 

Unfortunately, we actually wrote something for MSIs, but Michael didn't
finish yet and didn't post it to lkml yet. We kept the idea of msi_ops
though that isn't necessary (we could hide them in the arch if
necessary) because we need different mecanisms for different busses in
the same machine. We had a pci_get_msi_ops(pdev) provided by the arch
returning the ops for a given device, and the ops callbacks we had were
along the lines of:

 - alloc: allocates linux irqs for MSIs, (on powerpc, that includes
binds them to a controller) and setup the irq_desc / irq_chip for them.
This got passed the array of MSI(X) from the pci_enable_msi/x call (we
use a one entry array for MSI-nonX iirc).

 - setup: returns for one given MSI the address/data to write to the
device. Could be better named I agree as it doesn't actually setup
anything, maybe get_data ?

 - free.

The important thing here is that alloc is arch specific. We don't
absolutely need the msi_ops mecanism exposed generically, it could be
instead something like pcibios_alloc_msis() and pcibios_setup_msi()
etc... and internally, powerpc implementation could use per-bus function
pointers, but the base idea is that allocation is something that is left
to the platform.

Also, another important thing we have to do to be compatible with our
hypervisor based platforms is to have the option of setup being NULL in
the ops (which could also be done differently using a special result
code from alloc). In this case, alloc does everything and is essentially
a re-implementation at the top level of pci_enable_msi(x). We need that
because on IBM PAPR at least, the hypervisor does everything: allocating
hardware vectors, configuring the device, etc... so all the platform
code does is to allocate linux irq numbers and bind them to the vectors
returned by HV.

There are additional "nits" with thus patch as I see it. He puts the
msi_desc in the irq_desc->chip_data. That cannot work for us. MSIs can
be routed in hardware to various chips like the MPIC, the XICS, etc...
and thus their irq_chip (as setup by our alloc callback) will be our
normal MPIC or XICS irq_chips... those controller implementations
already use irq_desc->chip_data for their own use and having the MSI
layer hijack it will just blow things up. (Typically we can have
multiple PIC instances and we put the instance pointer in there).

I think the msi_desc (or whatever data structure that holds the state of
the MSI(X), backup LSI, etc... for one device) shall be hanging off
struct pci_dev instead.

Michael, can you post your current patch somewhere where it can be seen
as an example of our approach ? Even if it is not complete yet in the
sense that we haven't tested it with a backend yet, we were in the
process of doing so...

It should certainly be possible to modify Eric approach to fit our
needs, but that probably involves not having the irq allocation mecanism
made generic.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: + msi-refactor-and-move-the-msi-irq_chip-into-the-arch-code.patch added to -mm tree
  2006-09-27 23:09   ` Benjamin Herrenschmidt
@ 2006-09-28  5:50     ` Eric W. Biederman
  2006-09-28  6:16       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 4+ messages in thread
From: Eric W. Biederman @ 2006-09-28  5:50 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Miller, linux-kernel, akpm, ebiederm, ak, greg, mingo, tglx,
	tony.luck, Michael Ellerman

Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

>> Eric, thanks so much for doing this work.
>> 
>> Once this goes in I'll try to add support for MSI on sparc64
>> Niagara boxes.  I suppose the PowerPC folks can make use of
>> this as well.
>
> Well, it's definitely a great step in the right direction. But I have
> some issues with one basic element of the design which is the allocation
> of linux irqs being made generic which I can't see how it can apply to
> us in any shape or form unfortunately without yet another significant
> rework of the ppc interrupt management code (ugh).

So for those architectures where there are not weird ties I think
it makes sense.  One function to allocate a linux irq number and
another function to use it for something.  I just looked and it looks
relatively easy to refactor that part of the code so we do the work
with one architecture call and not too.  Both calls always happen
within close proximity to each other.  So that fact I use two
functions appears to be incidental instead of fundamental. 

> We have our allocation mecanism for interrupts, which involves binding
> them to a controller among other things. 
>
> Unfortunately, we actually wrote something for MSIs, but Michael didn't
> finish yet and didn't post it to lkml yet. We kept the idea of msi_ops
> though that isn't necessary (we could hide them in the arch if
> necessary) because we need different mecanisms for different busses in
> the same machine. We had a pci_get_msi_ops(pdev) provided by the arch
> returning the ops for a given device, and the ops callbacks we had were
> along the lines of:

My apologies for the conflict.  I had hoped to catch you at OLS so
I could get a better understanding of how things look in ppc land but
we never ran into each other.  I also posted a precursor to this design
with the hypertransport irqs and asked for comments.

>  - alloc: allocates linux irqs for MSIs, (on powerpc, that includes
> binds them to a controller) and setup the irq_desc / irq_chip for them.
> This got passed the array of MSI(X) from the pci_enable_msi/x call (we
> use a one entry array for MSI-nonX iirc).
>
>  - setup: returns for one given MSI the address/data to write to the
> device. Could be better named I agree as it doesn't actually setup
> anything, maybe get_data ?
>
>  - free.
>
> The important thing here is that alloc is arch specific. We don't
> absolutely need the msi_ops mecanism exposed generically, it could be
> instead something like pcibios_alloc_msis() and pcibios_setup_msi()
> etc... and internally, powerpc implementation could use per-bus function
> pointers, but the base idea is that allocation is something that is left
> to the platform.

Sure, and I left it to the architecture, so there should be no problem
there.

Mostly it looks like to meet your needs more of msi.c needs to be
turned into library functions that most architectures can use but your
weird cases on ppc can skip.

> Also, another important thing we have to do to be compatible with our
> hypervisor based platforms is to have the option of setup being NULL in
> the ops (which could also be done differently using a special result
> code from alloc). In this case, alloc does everything and is essentially
> a re-implementation at the top level of pci_enable_msi(x). We need that
> because on IBM PAPR at least, the hypervisor does everything: allocating
> hardware vectors, configuring the device, etc... so all the platform
> code does is to allocate linux irq numbers and bind them to the vectors
> returned by HV.

I knew there was a hypervisor issue but I don't have enough visibility
there so I didn't even try. 

I have a weird concern coming from working with the infinipath driver,
and that is what happens if someone almost implements msi and puts the
registers in the wrong location.

> There are additional "nits" with thus patch as I see it. He puts the
> msi_desc in the irq_desc->chip_data. That cannot work for us. MSIs can
> be routed in hardware to various chips like the MPIC, the XICS, etc...
> and thus their irq_chip (as setup by our alloc callback) will be our
> normal MPIC or XICS irq_chips... those controller implementations
> already use irq_desc->chip_data for their own use and having the MSI
> layer hijack it will just blow things up. (Typically we can have
> multiple PIC instances and we put the instance pointer in there).

The problem is simply we have 2 irq controllers in play.  The
one on the chip and the one your the msi is targeted at.  It might
be easiest to add an extra pointer in struct irq to handle the case
of irq controllers on plug in devices.

> I think the msi_desc (or whatever data structure that holds the state of
> the MSI(X), backup LSI, etc... for one device) shall be hanging off
> struct pci_dev instead.

That doesn't feel right when you can have up to 4K irqs per device.
Using using the msi_desc array is a reasonable short term solution
but I am hoping at some point to kill all of the NR_IRQ arrays
so we can more efficiently have a sparsely populated irq space.
But I would really like something like a slot I can use in struct
irq_desc.

> Michael, can you post your current patch somewhere where it can be seen
> as an example of our approach ? Even if it is not complete yet in the
> sense that we haven't tested it with a backend yet, we were in the
> process of doing so...
>
> It should certainly be possible to modify Eric approach to fit our
> needs, but that probably involves not having the irq allocation mecanism
> made generic.

Sounds good to me.  One small step at a time.  As long as we keep
things in terms of linux irq numbers in the generic code.  I really
don't much care.

Let's just take this one small step at a time and fix the issues that
we see and understand well.   All I know is that the original code was
so horrible that when I replicated the code across 3 different
architectures the total number of lines went down.

There are things in that code like msi_lock that I think are totally
unnecessary but haven't been annoying enough for me to kill.

I'm just glad things are now close enough people can imagine
how to go from where we are to where the code needs to be.

Eric

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: + msi-refactor-and-move-the-msi-irq_chip-into-the-arch-code.patch added to -mm tree
  2006-09-28  5:50     ` Eric W. Biederman
@ 2006-09-28  6:16       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 4+ messages in thread
From: Benjamin Herrenschmidt @ 2006-09-28  6:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, linux-kernel, akpm, ak, greg, mingo, tglx,
	tony.luck, Michael Ellerman


> My apologies for the conflict.  I had hoped to catch you at OLS so
> I could get a better understanding of how things look in ppc land but
> we never ran into each other.  I also posted a precursor to this design
> with the hypertransport irqs and asked for comments.

No worries. We just both got sidetracked and Michael forgot to send you
his preliminary code... anyway, it's not that fundamentally alien to our
needs. I'm tempted to let Michael finish and push what he has so we have
a solution for 2.6.19 (we have some emergency there) and in a second
step, look into reconciling the 2 approaches.

> > The important thing here is that alloc is arch specific. We don't
> > absolutely need the msi_ops mecanism exposed generically, it could be
> > instead something like pcibios_alloc_msis() and pcibios_setup_msi()
> > etc... and internally, powerpc implementation could use per-bus function
> > pointers, but the base idea is that allocation is something that is left
> > to the platform.
> 
> Sure, and I left it to the architecture, so there should be no problem
> there.

Ok. Good.

> Mostly it looks like to meet your needs more of msi.c needs to be
> turned into library functions that most architectures can use but your
> weird cases on ppc can skip.

Sort-of... We didn't that much use the library model though with our
implementation as I didn't expect alloc to be generic on some archs, but
you are right, it could be made that way. (I sort of assume that all
archs have different means of allocating HW vectors and matching them to
linux IRQs, or at least discovering which linux IRQs are not in
potential use by hardware...).

> I knew there was a hypervisor issue but I don't have enough visibility
> there so I didn't even try. 
> 
> I have a weird concern coming from working with the infinipath driver,
> and that is what happens if someone almost implements msi and puts the
> registers in the wrong location.

Heh... well, our hypervisor is nasty in the sense that it will do
everything, and doesn't give us the choice, on MSI-X, of what individual
MSI-X to enable. Only how many (starting with the first one). That
doesn't quite match the linux API, though looking at how that API is
used, I haven't found somebody requesting a discontiguous range of MSIs
to be enabled ...

> The problem is simply we have 2 irq controllers in play.  The
> one on the chip and the one your the msi is targeted at.  It might
> be easiest to add an extra pointer in struct irq to handle the case
> of irq controllers on plug in devices.

Well, our current approach on PPC hides the "MSI controller" part in the
code for the PIC that receives the MSI. In practice, that works fairly
well because we almost never have to handle different combinations: the
Apple U4's PCIe is always driven by MPIC, the HT controllers (HT2000
typically is what we use) too, thus we just added code to the MPIC
driver to also handle MSI for both types of controllers. The Cell stuff
is separate and just a self-contained cascaded controller itself (the
MSI part), etc...

In some cases (like the HT and Apple PCIe), the MSI logic is comletely
transparent, it's just a small decoder that turns MSI writes to a magic
address or HT interrupts into inputs on the MPIC. Thus the irq_chip is
really only the "standard" MPIC irq_chip... Same with hypervisor, where
they just appear as just some more irqs to the existing virtual IRQ
controller, and thus have the exact same irq_chip... those (at least
MPIC does) need to use the chip_data already for other means.

Thus it's the alloc function which is routed to the appropriate
controller code for a given device, which allocates the low level
hardware vectors, binds them to linux irqs, and sets up an appropriate
irq_desc/irq_chip for it. That's all local to the alloc function, and
those alloc functions are currently provided by PIC code.

> > I think the msi_desc (or whatever data structure that holds the state of
> > the MSI(X), backup LSI, etc... for one device) shall be hanging off
> > struct pci_dev instead.
> 
> That doesn't feel right when you can have up to 4K irqs per device.

Why ? I didn't say it shall be embedded in the pci_dev, but hanging off
it (a pointer).

> Using using the msi_desc array is a reasonable short term solution
> but I am hoping at some point to kill all of the NR_IRQ arrays
> so we can more efficiently have a sparsely populated irq space.
> But I would really like something like a slot I can use in struct
> irq_desc.

Might be useful too yes... I think our current implementation carries a
mapping table of IRQs to pci_dev to get to the pointer in pci_dev iirc
(at least that's how we discussed it a while ago with Michael, I don't
know if he actually followed that). It might be better indeed to just
have a pointer off irq_desc for exclusive use by MSIs ... 

> Sounds good to me.  One small step at a time.  As long as we keep
> things in terms of linux irq numbers in the generic code.  I really
> don't much care.

Yes, we all agree there. Hardware number are not visible to non-arch
code.


> Let's just take this one small step at a time and fix the issues that
> we see and understand well.   All I know is that the original code was
> so horrible that when I replicated the code across 3 different
> architectures the total number of lines went down.

Yes, you work as I said is definitely a good step in the right
direction. I don't think the divergences are that fundamental.

> There are things in that code like msi_lock that I think are totally
> unnecessary but haven't been annoying enough for me to kill.
> 
> I'm just glad things are now close enough people can imagine
> how to go from where we are to where the code needs to be.

Yeah :)

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2006-09-28  6:17 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <200609272215.k8RMFbuH018967@shell0.pdx.osdl.net>
2006-09-27 22:50 ` + msi-refactor-and-move-the-msi-irq_chip-into-the-arch-code.patch added to -mm tree David Miller
2006-09-27 23:09   ` Benjamin Herrenschmidt
2006-09-28  5:50     ` Eric W. Biederman
2006-09-28  6:16       ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox