* [git patch] free_irq() fixes
@ 2008-04-22 22:17 Jeff Garzik
2008-04-22 22:25 ` Linus Torvalds
0 siblings, 1 reply; 31+ messages in thread
From: Jeff Garzik @ 2008-04-22 22:17 UTC (permalink / raw)
To: Andrew Morton, Linus Torvalds; +Cc: LKML, rmk
Please pull from 'irq-fixes' branch of
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/misc-2.6.git irq-fixes
to receive the following updates:
arch/arm/mach-integrator/time.c | 4 ++--
drivers/char/mwave/tp3780i.c | 13 +++++++++----
2 files changed, 11 insertions(+), 6 deletions(-)
Jeff Garzik (1):
arm/mach-integrator, char/mwave: fix request_irq/free_irq mismatch
diff --git a/arch/arm/mach-integrator/time.c b/arch/arm/mach-integrator/time.c
index 5235f64..28726ee 100644
--- a/arch/arm/mach-integrator/time.c
+++ b/arch/arm/mach-integrator/time.c
@@ -137,7 +137,7 @@ static int rtc_probe(struct amba_device *dev, void *id)
return 0;
irq_out:
- free_irq(dev->irq[0], dev);
+ free_irq(dev->irq[0], NULL);
map_out:
iounmap(rtc_base);
rtc_base = NULL;
@@ -153,7 +153,7 @@ static int rtc_remove(struct amba_device *dev)
writel(0, rtc_base + RTC_CR);
- free_irq(dev->irq[0], dev);
+ free_irq(dev->irq[0], NULL);
unregister_rtc(&rtc_ops);
iounmap(rtc_base);
diff --git a/drivers/char/mwave/tp3780i.c b/drivers/char/mwave/tp3780i.c
index 37fe80d..5681c01 100644
--- a/drivers/char/mwave/tp3780i.c
+++ b/drivers/char/mwave/tp3780i.c
@@ -274,7 +274,8 @@ int tp3780I_ReleaseResources(THINKPAD_BD_DATA * pBDData)
release_region(pSettings->usDspBaseIO & (~3), 16);
if (pSettings->bInterruptClaimed) {
- free_irq(pSettings->usDspIrq, NULL);
+ free_irq(pSettings->usDspIrq,
+ (void *)(unsigned long) pSettings->usDspIrq);
pSettings->bInterruptClaimed = FALSE;
}
@@ -365,12 +366,14 @@ int tp3780I_EnableDSP(THINKPAD_BD_DATA * pBDData)
pSettings->bPllBypass = TP_CFG_PllBypass;
pSettings->usChipletEnable = TP_CFG_ChipletEnable;
+ /* FIXME: this is a racy way to verify irq conflict does not exist */
if (request_irq(pSettings->usUartIrq, &UartInterrupt, 0, "mwave_uart",
(void *)(unsigned long) pSettings->usUartIrq)) {
PRINTK_ERROR(KERN_ERR_MWAVE "tp3780i::tp3780I_EnableDSP: Error: Could not get UART IRQ %x\n", pSettings->usUartIrq);
goto exit_cleanup;
} else { /* no conflict just release */
- free_irq(pSettings->usUartIrq, NULL);
+ free_irq(pSettings->usUartIrq,
+ (void *)(unsigned long) pSettings->usUartIrq);
}
if (request_irq(pSettings->usDspIrq, &DspInterrupt, 0, "mwave_3780i",
@@ -411,7 +414,8 @@ exit_cleanup:
if (bDSPPoweredUp)
smapi_set_DSP_power_state(FALSE);
if (bInterruptAllocated) {
- free_irq(pSettings->usDspIrq, NULL);
+ free_irq(pSettings->usDspIrq,
+ (void *)(unsigned long) pSettings->usDspIrq);
pSettings->bInterruptClaimed = FALSE;
}
return -EIO;
@@ -428,7 +432,8 @@ int tp3780I_DisableDSP(THINKPAD_BD_DATA * pBDData)
if (pBDData->bDSPEnabled) {
dsp3780I_DisableDSP(&pBDData->rDspSettings);
if (pSettings->bInterruptClaimed) {
- free_irq(pSettings->usDspIrq, NULL);
+ free_irq(pSettings->usDspIrq,
+ (void *)(unsigned long) pSettings->usDspIrq);
pSettings->bInterruptClaimed = FALSE;
}
smapi_set_DSP_power_state(FALSE);
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-22 22:17 [git patch] free_irq() fixes Jeff Garzik
@ 2008-04-22 22:25 ` Linus Torvalds
2008-04-22 22:59 ` Jeff Garzik
0 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2008-04-22 22:25 UTC (permalink / raw)
To: Jeff Garzik; +Cc: Andrew Morton, LKML, rmk
On Tue, 22 Apr 2008, Jeff Garzik wrote:
>
> diff --git a/arch/arm/mach-integrator/time.c b/arch/arm/mach-integrator/time.c
> index 5235f64..28726ee 100644
> --- a/arch/arm/mach-integrator/time.c
> +++ b/arch/arm/mach-integrator/time.c
> @@ -137,7 +137,7 @@ static int rtc_probe(struct amba_device *dev, void *id)
> return 0;
>
> irq_out:
> - free_irq(dev->irq[0], dev);
> + free_irq(dev->irq[0], NULL);
> map_out:
> iounmap(rtc_base);
> rtc_base = NULL;
> @@ -153,7 +153,7 @@ static int rtc_remove(struct amba_device *dev)
>
> writel(0, rtc_base + RTC_CR);
>
> - free_irq(dev->irq[0], dev);
> + free_irq(dev->irq[0], NULL);
> unregister_rtc(&rtc_ops);
>
> iounmap(rtc_base);
Quite frankly, I'd actually prefer to just reinstate "dev" to the irq
registration instead.
Why? Because that field is how we are able to track multiple interrupt
registrations that share an IRQ. Now, the "request_irq()" logic has a
special rule that actually tests that NULL is a valid cookie if the
IRQF_SHARED bit isn't set, but isn't it a nice thing to double-check
regardless?
> index 37fe80d..5681c01 100644
> --- a/drivers/char/mwave/tp3780i.c
> +++ b/drivers/char/mwave/tp3780i.c
> @@ -274,7 +274,8 @@ int tp3780I_ReleaseResources(THINKPAD_BD_DATA * pBDData)
> release_region(pSettings->usDspBaseIO & (~3), 16);
>
> if (pSettings->bInterruptClaimed) {
> - free_irq(pSettings->usDspIrq, NULL);
> + free_irq(pSettings->usDspIrq,
> + (void *)(unsigned long) pSettings->usDspIrq);
This, in contrast, *really* sucks as a cookie. In fact, it's useless. If
there are multiple tp3780i instances on the same irq, they will always
have the same cookie.
That's why we pass in a "device" pointer or similar - exactly to make sure
that the cookie is unique for that irq resource!
So it would be much nicer to fix the cookie to be a real device pointer
(why not "pSettings" itself?) that is stable across the whole series of
request_irq/free_irq. And that also would tend to get rid of the
butt-ugly casts.
Willing to fix those up?
Linus
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-22 22:25 ` Linus Torvalds
@ 2008-04-22 22:59 ` Jeff Garzik
2008-04-22 23:20 ` Linus Torvalds
0 siblings, 1 reply; 31+ messages in thread
From: Jeff Garzik @ 2008-04-22 22:59 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Andrew Morton, LKML, rmk
[-- Attachment #1: Type: text/plain, Size: 974 bytes --]
Linus Torvalds wrote:
> Quite frankly, I'd actually prefer to just reinstate "dev" to the irq
> registration instead.
>
> Why? Because that field is how we are able to track multiple interrupt
> registrations that share an IRQ. Now, the "request_irq()" logic has a
> special rule that actually tests that NULL is a valid cookie if the
> IRQF_SHARED bit isn't set, but isn't it a nice thing to double-check
> regardless?
[...]
> This, in contrast, *really* sucks as a cookie. In fact, it's useless. If
> there are multiple tp3780i instances on the same irq, they will always
> have the same cookie.
AFAICS that will never happen for mwave hardware.
Nonetheless, it's a good point that NULL fails for disambiguation, so I
created the attached, which should do what you want. I'll push it after
the compiler gives it a green light (unless it needs further revisions)
(note, for mwave I couldn't use pSettings, since that might fail the
ambiguity test)
Jeff
[-- Attachment #2: patch --]
[-- Type: text/plain, Size: 3646 bytes --]
diff --git a/arch/arm/mach-integrator/time.c b/arch/arm/mach-integrator/time.c
index 5235f64..8508a0d 100644
--- a/arch/arm/mach-integrator/time.c
+++ b/arch/arm/mach-integrator/time.c
@@ -124,8 +124,11 @@ static int rtc_probe(struct amba_device *dev, void *id)
xtime.tv_sec = __raw_readl(rtc_base + RTC_DR);
+ /* note that 'dev' is merely used for irq disambiguation;
+ * it is not actually referenced in the irq handler
+ */
ret = request_irq(dev->irq[0], arm_rtc_interrupt, IRQF_DISABLED,
- "rtc-pl030", NULL);
+ "rtc-pl030", dev);
if (ret)
goto map_out;
diff --git a/drivers/char/mwave/tp3780i.c b/drivers/char/mwave/tp3780i.c
index 37fe80d..35e1723 100644
--- a/drivers/char/mwave/tp3780i.c
+++ b/drivers/char/mwave/tp3780i.c
@@ -97,16 +97,17 @@ static void EnableSRAM(THINKPAD_BD_DATA * pBDData)
static irqreturn_t UartInterrupt(int irq, void *dev_id)
{
- int irqno = (int)(unsigned long) dev_id;
+ unsigned short *irqno = dev_id;
PRINTK_3(TRACE_TP3780I,
- "tp3780i::UartInterrupt entry irq %x dev_id %p\n", irqno, dev_id);
+ "tp3780i::UartInterrupt entry irq %hx dev_id %p\n",
+ *irqno, dev_id);
return IRQ_HANDLED;
}
static irqreturn_t DspInterrupt(int irq, void *dev_id)
{
- int irqno = (int)(unsigned long) dev_id;
+ unsigned short *irqno = dev_id;
pMWAVE_DEVICE_DATA pDrvData = &mwave_s_mdd;
DSP_3780I_CONFIG_SETTINGS *pSettings = &pDrvData->rBDData.rDspSettings;
@@ -114,7 +115,8 @@ static irqreturn_t DspInterrupt(int irq, void *dev_id)
unsigned short usIPCSource = 0, usIsolationMask, usPCNum;
PRINTK_3(TRACE_TP3780I,
- "tp3780i::DspInterrupt entry irq %x dev_id %p\n", irqno, dev_id);
+ "tp3780i::DspInterrupt entry irq %hx dev_id %p\n",
+ *irqno, dev_id);
if (dsp3780I_GetIPCSource(usDspBaseIO, &usIPCSource) == 0) {
PRINTK_2(TRACE_TP3780I,
@@ -274,7 +276,7 @@ int tp3780I_ReleaseResources(THINKPAD_BD_DATA * pBDData)
release_region(pSettings->usDspBaseIO & (~3), 16);
if (pSettings->bInterruptClaimed) {
- free_irq(pSettings->usDspIrq, NULL);
+ free_irq(pSettings->usDspIrq, &pSettings->usDspIrq);
pSettings->bInterruptClaimed = FALSE;
}
@@ -366,15 +368,15 @@ int tp3780I_EnableDSP(THINKPAD_BD_DATA * pBDData)
pSettings->usChipletEnable = TP_CFG_ChipletEnable;
if (request_irq(pSettings->usUartIrq, &UartInterrupt, 0, "mwave_uart",
- (void *)(unsigned long) pSettings->usUartIrq)) {
+ &pSettings->usUartIrq)) {
PRINTK_ERROR(KERN_ERR_MWAVE "tp3780i::tp3780I_EnableDSP: Error: Could not get UART IRQ %x\n", pSettings->usUartIrq);
goto exit_cleanup;
} else { /* no conflict just release */
- free_irq(pSettings->usUartIrq, NULL);
+ free_irq(pSettings->usUartIrq, &pSettings->usUartIrq);
}
if (request_irq(pSettings->usDspIrq, &DspInterrupt, 0, "mwave_3780i",
- (void *)(unsigned long) pSettings->usDspIrq)) {
+ &pSettings->usDspIrq)) {
PRINTK_ERROR("tp3780i::tp3780I_EnableDSP: Error: Could not get 3780i IRQ %x\n", pSettings->usDspIrq);
goto exit_cleanup;
} else {
@@ -411,7 +413,7 @@ exit_cleanup:
if (bDSPPoweredUp)
smapi_set_DSP_power_state(FALSE);
if (bInterruptAllocated) {
- free_irq(pSettings->usDspIrq, NULL);
+ free_irq(pSettings->usDspIrq, &pSettings->usDspIrq);
pSettings->bInterruptClaimed = FALSE;
}
return -EIO;
@@ -428,7 +430,7 @@ int tp3780I_DisableDSP(THINKPAD_BD_DATA * pBDData)
if (pBDData->bDSPEnabled) {
dsp3780I_DisableDSP(&pBDData->rDspSettings);
if (pSettings->bInterruptClaimed) {
- free_irq(pSettings->usDspIrq, NULL);
+ free_irq(pSettings->usDspIrq, &pSettings->usDspIrq);
pSettings->bInterruptClaimed = FALSE;
}
smapi_set_DSP_power_state(FALSE);
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-22 22:59 ` Jeff Garzik
@ 2008-04-22 23:20 ` Linus Torvalds
2008-04-22 23:49 ` Jeff Garzik
0 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2008-04-22 23:20 UTC (permalink / raw)
To: Jeff Garzik; +Cc: Andrew Morton, LKML, rmk
On Tue, 22 Apr 2008, Jeff Garzik wrote:
>
> (note, for mwave I couldn't use pSettings, since that might fail the ambiguity
> test)
Ok, so using the pointer to inside a specific pSettings field is fine.
But can you also explain to me why that insane driver does this:
static irqreturn_t UartInterrupt(int irq, void *dev_id)
{
- int irqno = (int)(unsigned long) dev_id;
+ unsigned short *irqno = dev_id;
...
*irqno, dev_id);
instead of just ignoring "dev_id" entirely, and then just using that "irq"
argument directly?
Linus
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-22 23:20 ` Linus Torvalds
@ 2008-04-22 23:49 ` Jeff Garzik
2008-04-22 23:52 ` Linus Torvalds
0 siblings, 1 reply; 31+ messages in thread
From: Jeff Garzik @ 2008-04-22 23:49 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Andrew Morton, LKML, rmk
Linus Torvalds wrote:
>
> On Tue, 22 Apr 2008, Jeff Garzik wrote:
>> (note, for mwave I couldn't use pSettings, since that might fail the ambiguity
>> test)
>
> Ok, so using the pointer to inside a specific pSettings field is fine.
>
> But can you also explain to me why that insane driver does this:
>
> static irqreturn_t UartInterrupt(int irq, void *dev_id)
> {
> - int irqno = (int)(unsigned long) dev_id;
> + unsigned short *irqno = dev_id;
> ...
> *irqno, dev_id);
>
> instead of just ignoring "dev_id" entirely, and then just using that "irq"
> argument directly?
That was noted briefly in the push email:
> In my review of every single interrupt handler in the Linux, while
> working on another project (jgarzik/misc-2.6.git#irq-remove), I've
[...]
> Since the #irq-remove project involves removal of the 'irq' argument
> from interrupt handlers (unused 99.8% of the time),
[...]
After going over every irq handler (read: almost every driver in the
kernel, plus arch code), my #irq-remove branch has confirmed what my gut
already knew -- the 'irq' argument is completely unused for almost every
driver. So I was taking that line of thought as far as it went.
I found less than 10 cases (out of ~1100) that actually did something
useful with the value _and_ did not have the value already stashed
somewhere in a reached data structure.
Those cases are easily handled a la pt_regs change -- via a
get_irqfunc_irq() -- as a quick fix, or the preferred cleanup would be
to pass info properly via the standard method for passing info to irq
handlers: dev_id
Jeff
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-22 23:49 ` Jeff Garzik
@ 2008-04-22 23:52 ` Linus Torvalds
2008-04-23 0:05 ` Adrian Bunk
0 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2008-04-22 23:52 UTC (permalink / raw)
To: Jeff Garzik; +Cc: Andrew Morton, LKML, rmk
On Tue, 22 Apr 2008, Jeff Garzik wrote:
> [...]
> > Since the #irq-remove project involves removal of the 'irq' argument
> > from interrupt handlers (unused 99.8% of the time),
> [...]
>
> After going over every irq handler (read: almost every driver in the kernel,
> plus arch code), my #irq-remove branch has confirmed what my gut already knew
> -- the 'irq' argument is completely unused for almost every driver. So I was
> taking that line of thought as far as it went.
Ok, that's just not going to happen.
What's the upside? Really?
I can tell you the downsides:
- tons of huge patches with ugly churn
- total disaster when it comes to any drivers that are maintained over
multiple versions and/or out-of-tree like the DRI stuff
so those upsides had better be really big.
Linus
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-22 23:52 ` Linus Torvalds
@ 2008-04-23 0:05 ` Adrian Bunk
2008-04-23 0:16 ` Linus Torvalds
0 siblings, 1 reply; 31+ messages in thread
From: Adrian Bunk @ 2008-04-23 0:05 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Jeff Garzik, Andrew Morton, LKML, rmk
On Tue, Apr 22, 2008 at 04:52:44PM -0700, Linus Torvalds wrote:
>
>
> On Tue, 22 Apr 2008, Jeff Garzik wrote:
> > [...]
> > > Since the #irq-remove project involves removal of the 'irq' argument
> > > from interrupt handlers (unused 99.8% of the time),
> > [...]
> >
> > After going over every irq handler (read: almost every driver in the kernel,
> > plus arch code), my #irq-remove branch has confirmed what my gut already knew
> > -- the 'irq' argument is completely unused for almost every driver. So I was
> > taking that line of thought as far as it went.
>
> Ok, that's just not going to happen.
>
> What's the upside? Really?
>
> I can tell you the downsides:
> - tons of huge patches with ugly churn
>...
If it goes like the regs removal in one big patch around -rc1 into your
tree this shouldn't be a problem.
> Linus
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-23 0:05 ` Adrian Bunk
@ 2008-04-23 0:16 ` Linus Torvalds
2008-04-23 13:51 ` Rene Herman
0 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2008-04-23 0:16 UTC (permalink / raw)
To: Adrian Bunk; +Cc: Jeff Garzik, Andrew Morton, LKML, rmk
On Wed, 23 Apr 2008, Adrian Bunk wrote:
>
> If it goes like the regs removal in one big patch around -rc1 into your
> tree this shouldn't be a problem.
Well, the regs removal had a real upside (it wasn't even sensible for all
irq types), and really nobody used it apart from "system users" (ie Sysrq
etc).
I'm still waiting for anybody mentioning any upside at _all_ on removing
"irq".
Linus
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-23 0:16 ` Linus Torvalds
@ 2008-04-23 13:51 ` Rene Herman
2008-04-24 2:10 ` Jeff Garzik
0 siblings, 1 reply; 31+ messages in thread
From: Rene Herman @ 2008-04-23 13:51 UTC (permalink / raw)
To: Linus Torvalds
Cc: Adrian Bunk, Jeff Garzik, Andrew Morton, LKML, rmk,
Eric W. Biederman, Thomas Gleixner, Ingo Molnar
On 23-04-08 02:16, Linus Torvalds wrote:
> On Wed, 23 Apr 2008, Adrian Bunk wrote:
>> If it goes like the regs removal in one big patch around -rc1 into your
>> tree this shouldn't be a problem.
>
> Well, the regs removal had a real upside (it wasn't even sensible for all
> irq types), and really nobody used it apart from "system users" (ie Sysrq
> etc).
>
> I'm still waiting for anybody mentioning any upside at _all_ on removing
> "irq".
Saves another 4 bytes of stack? :-/ Seriously, Jeff can probably better
answer himself but when this was posted before:
http://lkml.org/lkml/2007/5/19/23
Eric Biederman said it fit nicely into his "nefarious plan of making
everything use a struct irq pointer". A later mention:
http://lkml.org/lkml/2007/10/19/66
got strong ACKs from Thomas Gleixner, Ingo Molnar and Greg KH. Remember due
to working on a local driver at the time and deleting the "irq" argument
usage from its handler (unneccesarily used in a debugging printk) from it in
response.
My own view is that if it's not really overly painful this does make for a
nice API cleanliness thing -- the IRQ level is only relevant to the lower
level generic handler code not the "driver endpoint handler" and not passing
it in the first place thereby is in keeping with this layering.
Rene.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-23 13:51 ` Rene Herman
@ 2008-04-24 2:10 ` Jeff Garzik
2008-04-24 2:19 ` Linus Torvalds
0 siblings, 1 reply; 31+ messages in thread
From: Jeff Garzik @ 2008-04-24 2:10 UTC (permalink / raw)
To: Rene Herman, Linus Torvalds
Cc: Adrian Bunk, Andrew Morton, LKML, rmk, Eric W. Biederman,
Thomas Gleixner, Ingo Molnar
Rene Herman wrote:
> On 23-04-08 02:16, Linus Torvalds wrote:
>
>> On Wed, 23 Apr 2008, Adrian Bunk wrote:
>>> If it goes like the regs removal in one big patch around -rc1 into
>>> your tree this shouldn't be a problem.
>>
>> Well, the regs removal had a real upside (it wasn't even sensible for
>> all irq types), and really nobody used it apart from "system users"
>> (ie Sysrq etc).
>>
>> I'm still waiting for anybody mentioning any upside at _all_ on
>> removing "irq".
>
> Saves another 4 bytes of stack? :-/ Seriously, Jeff can probably better
> answer himself but when this was posted before:
>
> http://lkml.org/lkml/2007/5/19/23
>
> Eric Biederman said it fit nicely into his "nefarious plan of making
> everything use a struct irq pointer". A later mention:
>
> http://lkml.org/lkml/2007/10/19/66
>
> got strong ACKs from Thomas Gleixner, Ingo Molnar and Greg KH. Remember
> due to working on a local driver at the time and deleting the "irq"
> argument usage from its handler (unneccesarily used in a debugging
> printk) from it in response.
Thanks. I was hoping that some of the people who expressed interest in
prior threads would appear.
Answering Linus's question, the things I tend to think of are
* it's not used in overwhelming majority of cases
* irq number has morphed over time with MSI-X and APICs and such from a
direct "reference" to a hardware line to a more abstract cookie value.
* the need for a struct [pci_]device everywhere means drivers have ready
access to irq number _anyway_
* it has clearly led to many helpful cleanups and bug fixes, by both me
and others [and yes, for the sake of argument I'm excluding those
discussed in this thread]
* it helps clean up abuses like HPET where it is used to encode data
(ignoring dev_id unnecessarily... I posted a patch to fix this):
if (rtc_int_flag) {
rtc_int_flag |= (RTC_IRQF | (RTC_NUM_INTS << 8));
if (irq_handler)
irq_handler(rtc_int_flag, dev_id);
}
["irq_handler" is a function passed to request_irq, as well as being
called here]
dev_id exists for passing various data to the irq_handler... with some
drivers abusing the 'irq' argument to pass data, that potential opens
holes for bugs whenever the irq numbering (aka cookie) scheme is changed
-- because changing the cookie scheme could potentially trigger code like
if (irq == MAGIC_NUMBER)
this is an internal self-call, do some polling
else
handle real hardware-raised interrupt
When drivers make assumptions about system irq numbering, particularly
on x86, IMO the situation is fragile.
Jeff
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-24 2:10 ` Jeff Garzik
@ 2008-04-24 2:19 ` Linus Torvalds
2008-04-24 5:59 ` Eric W. Biederman
0 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2008-04-24 2:19 UTC (permalink / raw)
To: Jeff Garzik
Cc: Rene Herman, Adrian Bunk, Andrew Morton, LKML, rmk,
Eric W. Biederman, Thomas Gleixner, Ingo Molnar
On Wed, 23 Apr 2008, Jeff Garzik wrote:
>
> When drivers make assumptions about system irq numbering, particularly on x86,
> IMO the situation is fragile.
And when people make changes to long-standing and stable infrastructure,
the situation also gets fragile.
The fact is, stability of interfaces is a really worthy goal in itself.
Making a change for its own sake is not a good thing. This fixes
*nothing*, and the driver changes I objected to I objected to because they
were ugly as sin.
And I want to point out that your patches made it *much* uglier.
So "cleanup" it sure as hell wasn't. That irq number may not be worth all
that much in itself, but it has no subtle implementation problems (we
_need_ that irq number for registration and irq handler lookup anyway, so
it is meaningful from a driver perspective, and is well-defined from a irq
core standpoint as well).
I don't mind cleanups, but this is "churn". Change for its own sake. If it
doesn't lead to any _improvement_, it's pointless.
If drivers don't need it, let them ignore it. But let them ignore it in
ways that work across versions, and in ways that don't cause ridiculous
and ugly work-arounds for when they do want it (even if it's just for a
printk() or similar).
Linus
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-24 2:19 ` Linus Torvalds
@ 2008-04-24 5:59 ` Eric W. Biederman
2008-04-24 10:53 ` Jeff Garzik
0 siblings, 1 reply; 31+ messages in thread
From: Eric W. Biederman @ 2008-04-24 5:59 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jeff Garzik, Rene Herman, Adrian Bunk, Andrew Morton, LKML, rmk,
Thomas Gleixner, Ingo Molnar
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Wed, 23 Apr 2008, Jeff Garzik wrote:
>>
>> When drivers make assumptions about system irq numbering, particularly on x86,
>> IMO the situation is fragile.
>
> And when people make changes to long-standing and stable infrastructure,
> the situation also gets fragile.
>
> The fact is, stability of interfaces is a really worthy goal in itself.
> Making a change for its own sake is not a good thing. This fixes
> *nothing*, and the driver changes I objected to I objected to because they
> were ugly as sin.
>
> And I want to point out that your patches made it *much* uglier.
>
> So "cleanup" it sure as hell wasn't. That irq number may not be worth all
> that much in itself, but it has no subtle implementation problems (we
> _need_ that irq number for registration and irq handler lookup anyway, so
> it is meaningful from a driver perspective, and is well-defined from a irq
> core standpoint as well).
>
> I don't mind cleanups, but this is "churn". Change for its own sake. If it
> doesn't lead to any _improvement_, it's pointless.
>
> If drivers don't need it, let them ignore it. But let them ignore it in
> ways that work across versions, and in ways that don't cause ridiculous
> and ugly work-arounds for when they do want it (even if it's just for a
> printk() or similar).
I haven't looked at what Jeff's patches in particular so I can not
comment there. I do remember looking at the drivers in question and
yes there were indeed bugs with the handful of drivers that used
the irq parameter. So fixing and cleaning up those drivers so they
use the same idioms as the rest of the kernel should be a maintenance
win. Even if we do keep the irq parameter to the interrupt handler.
I can comment on where there seems to be a real need for change.
The hard coded NR_IRQS parameter and the arrays of size NR_IRQS are a
kernel scaling bottle neck. They prevents us from building one kernel
that works well on a large ranges of machines sizes. Having a single
array prevents us from allocating the irq structures with NUMA
affinity which slows down irq processing. Having a small number for
NR_IRQS to keep the table compact keeps the irq number from being
readable/useful in the case of MSI and occasionally in the case of
IO_APICs.
...
Since there does seem to be a real win it is my intention to kill all
arrays of NR_IRQs size in the generic code. I did a quick search for
[NR_IRQS] and have only found about 7 of them.
Then modify the genirq code to allow architectures to provide an
alternative to todays irq_desc array.
To export the struct irq_desc to the rest of the kernel generically as
a struct irq, with a new API irq_request,irq_free etc.
To build the existing API on top of the new API with a potentially
slow function that maps an irq number to the struct irq. Irq actions
are not fast path so we should be ok with even if we do a linear walk
through the set of irqs.
After that I will see about removing irq number references from
the bulk of kernel drivers. Essentially transferring the drivers to
the new API in some stepwise fashion. Which is where I see synergy
with Jeff's effort.
For old ISA drivers the irq number has real meaning and I don't
intend to loose that, and I don't fundamentally intend to loose
the irq number, instead reserving it for talking to users. Which
will allow us to use a stable mapping for MSI irqs where we simply
do not have enough bits below NR_IRQS today.
My goal is for an API upgrade that is a slight variation of todays
API that will be stable into the future as machines grow larger
and larger and we get more and more irq sources. I believe the trend
line is for the number of irqs in a machine to grow linearly with
the number of cores, which means I expect them to start going
exponential. Not having an integer which can and has been abused
in some hideous ways but instead a structure pointer is likely to keep
things more robust into the future.
Eric
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-24 5:59 ` Eric W. Biederman
@ 2008-04-24 10:53 ` Jeff Garzik
2008-04-24 15:16 ` Linus Torvalds
0 siblings, 1 reply; 31+ messages in thread
From: Jeff Garzik @ 2008-04-24 10:53 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Rene Herman, Adrian Bunk, Andrew Morton, LKML,
rmk, Thomas Gleixner, Ingo Molnar
Eric W. Biederman wrote:
> I haven't looked at what Jeff's patches in particular so I can not
> comment there. I do remember looking at the drivers in question and
> yes there were indeed bugs with the handful of drivers that used
> the irq parameter. So fixing and cleaning up those drivers so they
> use the same idioms as the rest of the kernel should be a maintenance
> win. Even if we do keep the irq parameter to the interrupt handler.
>
> I can comment on where there seems to be a real need for change.
> The hard coded NR_IRQS parameter and the arrays of size NR_IRQS are a
> kernel scaling bottle neck. They prevents us from building one kernel
> that works well on a large ranges of machines sizes. Having a single
> array prevents us from allocating the irq structures with NUMA
> affinity which slows down irq processing. Having a small number for
> NR_IRQS to keep the table compact keeps the irq number from being
> readable/useful in the case of MSI and occasionally in the case of
> IO_APICs.
Honestly, one thing I was thinking was perhaps a change from
irqreturn_t foo_handler(int irq, void *dev_id)
to
irqreturn_t foo_handler(struct irq_info *ii, void *dev_id)
which would IMO make the first parameter useful again, by enabling
passing of information like MSI message info, or more flexible
platform-specific irq info that a platform driver may want. Or direct
access to irq_desc or irq_chip info.
Jeff
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-24 10:53 ` Jeff Garzik
@ 2008-04-24 15:16 ` Linus Torvalds
2008-04-24 15:40 ` Jeff Garzik
0 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2008-04-24 15:16 UTC (permalink / raw)
To: Jeff Garzik
Cc: Eric W. Biederman, Rene Herman, Adrian Bunk, Andrew Morton, LKML,
rmk, Thomas Gleixner, Ingo Molnar
On Thu, 24 Apr 2008, Jeff Garzik wrote:
>
> Honestly, one thing I was thinking was perhaps a change from
>
> irqreturn_t foo_handler(int irq, void *dev_id)
> to
> irqreturn_t foo_handler(struct irq_info *ii, void *dev_id)
>
> which would IMO make the first parameter useful again, by enabling passing of
> information like MSI message info, or more flexible platform-specific irq info
> that a platform driver may want. Or direct access to irq_desc or irq_chip
> info.
So I *really* hate that idea. It's much *much* worse than what we have
now.
Why?
The absolutely _only_ piece of reliably information we have that is
architecture- and irq-controller neutral is the exact information we pass
in to "request_irq()". That is: irq number, the name, and the device
cookie thing. Nothing more.
And of the three things, they have the following pattern:
- "irq number" is some random cookie, but it is a cookie that the
*system* forces on the driver, and that is totally independent of how
the irq is delivered or what kind of irq it is (device, system, PCI,
ISA, whatever). The driver doesn't get to choose it, but the system and
the driver have to agree on it some way (ie regardless of whether it's
a PCI driver or a Super-Integrated-bus-of-year-2025, the driver will
have to get it from some system resource, ie the pci_dev or whatever)
IOW, the irq number *does* have meaning, but it is very much by design
something that is _purely_ a cookie. You cannot look into it - it's not
a pointer to any data.
- "the name". There really is no point to passing this around, because
it's purely for show, and purely so that the generic irq layer can tell
the user something in /proc/irq etc. Passing it back to the driver
would be entirely pointless, because it is designed purely to be a
driver->system informational thing.
- the "device cookie". This is the thing that the system itself doesn't
care about, and is _entirely_ under control of the driver, so the
driver can pass its own interrupt controller some helpful instance
pointers.
So of the three, "device cookie" is the one that we absolutely have to
have. The irq number is not necessary, but it does actually have some
meaning especially for legacy devices (eg ISA), and it is at least
_sensible_ to pass around (ie it has no downsides, and it's not
fundamentally broken). And the name would be just stupid.
EVERYTHING else would be architecture-specific. And that is exactly what
we do not want. EVER.
Passing in some context that contains bus information is absolutely the
*last* thing we want. We do not want to have irq handlers that know about
the interrupt controller details.
Linus
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-24 15:55 ` Linus Torvalds
@ 2008-04-24 15:37 ` Alan Cox
2008-04-24 16:20 ` Jeff Garzik
2008-04-24 16:16 ` Jeff Garzik
2008-04-24 16:48 ` Eric W. Biederman
2 siblings, 1 reply; 31+ messages in thread
From: Alan Cox @ 2008-04-24 15:37 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jeff Garzik, Eric W. Biederman, Rene Herman, Adrian Bunk,
Andrew Morton, LKML, rmk, Thomas Gleixner, Ingo Molnar
> But then you'd have to have some way to "printk" the information, which is
> a very common requirement (and the printk still needs to be a number,
> because you want to match up 'dmesg' output with the '/proc/interrupts'
> file etc).
>
> That, in turn, would effectively force a whole new function, and then
> you'd have things like
>
> printk(.. irq %d .., irq_number(irqcookie) ..)
>
> which while not ugly isn't really all that clean either. And I guarantee
> that people would misuse that "irq_number(cookie)" exactly in the same
> ways they'd misuse "irq" (ie not very much).
Sparc32 had this and it was very ugly. However if you don't pass in the
IRQ then people will store the irq value privately and things like
request_irq can deal with numeric interrupts and the like as before while
new interfaces for MSI can deal in MSI objects whatever they end up like.
Kill off the *use* of the int irq passed into functions as Jeff has
basically done and you can later replace it with a structure that contains
useful info, some private some not. For now all it has to do is happen to
be a masked int as you suggest.
Alan
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-24 15:16 ` Linus Torvalds
@ 2008-04-24 15:40 ` Jeff Garzik
2008-04-24 15:55 ` Linus Torvalds
0 siblings, 1 reply; 31+ messages in thread
From: Jeff Garzik @ 2008-04-24 15:40 UTC (permalink / raw)
To: Linus Torvalds
Cc: Eric W. Biederman, Rene Herman, Adrian Bunk, Andrew Morton, LKML,
rmk, Thomas Gleixner, Ingo Molnar
Linus Torvalds wrote:
> The absolutely _only_ piece of reliably information we have that is
> architecture- and irq-controller neutral is the exact information we pass
> in to "request_irq()". That is: irq number, the name, and the device
> cookie thing. Nothing more.
Agreed.
However, it does not follow that an int is what _must_ be passed around.
We already have design patterns like
cookie_pointer = ioremap(raw bus resource)
Not that I am the one pushing for that, just noting.
Overall this is all wild-assed speculation based on a thought
exploration (#irq-remove) that a several kernel hackers seemed to like.
> - the "device cookie". This is the thing that the system itself doesn't
> care about, and is _entirely_ under control of the driver, so the
> driver can pass its own interrupt controller some helpful instance
> pointers.
>
> So of the three, "device cookie" is the one that we absolutely have to
> have. The irq number is not necessary, but it does actually have some
> meaning especially for legacy devices (eg ISA), and it is at least
> _sensible_ to pass around (ie it has no downsides, and it's not
> fundamentally broken). And the name would be just stupid.
Agreed.
> EVERYTHING else would be architecture-specific. And that is exactly what
> we do not want. EVER.
Not true -- you have metadata/OOB data like MSI messages, where you are
passed a value from the PCI hardware in the PCI message, not just an
"interrupt asserted" condition. Or s/value/values/ if you enable PCI
MSI's multiple message support.
The PCI devices themselves are moving from sending a single bit of
information ("irq!") to sending actual messages.
That is not arch-specific at all, but a new model for "interrupt" (i.e.
event) notification being pushed upon us.
Jeff
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-24 15:40 ` Jeff Garzik
@ 2008-04-24 15:55 ` Linus Torvalds
2008-04-24 15:37 ` Alan Cox
` (2 more replies)
0 siblings, 3 replies; 31+ messages in thread
From: Linus Torvalds @ 2008-04-24 15:55 UTC (permalink / raw)
To: Jeff Garzik
Cc: Eric W. Biederman, Rene Herman, Adrian Bunk, Andrew Morton, LKML,
rmk, Thomas Gleixner, Ingo Molnar
On Thu, 24 Apr 2008, Jeff Garzik wrote:
>
> However, it does not follow that an int is what _must_ be passed around. We
> already have design patterns like
>
> cookie_pointer = ioremap(raw bus resource)
>
> Not that I am the one pushing for that, just noting.
I do agree that we could use something more type-safe.
So a "pointer" to a structure that doesn't actually exist would be fine
and would give us some C type checking.
But then you'd have to have some way to "printk" the information, which is
a very common requirement (and the printk still needs to be a number,
because you want to match up 'dmesg' output with the '/proc/interrupts'
file etc).
That, in turn, would effectively force a whole new function, and then
you'd have things like
printk(.. irq %d .., irq_number(irqcookie) ..)
which while not ugly isn't really all that clean either. And I guarantee
that people would misuse that "irq_number(cookie)" exactly in the same
ways they'd misuse "irq" (ie not very much).
Quite frankly, I'd much prefer a
typedef int __bitwise irq_t;
and then we can use sparse to do this testing, without breaking any
existing use at all (because it will still be an "int" to gcc, but sparse
will make "irq_t" a type of its own and make sure that people pass it
around as such and not do arithmetic ops on it etc).
> > EVERYTHING else would be architecture-specific. And that is exactly what we
> > do not want. EVER.
>
> Not true -- you have metadata/OOB data like MSI messages, where you are passed
> a value from the PCI hardware in the PCI message, not just an "interrupt
> asserted" condition. Or s/value/values/ if you enable PCI MSI's multiple
> message support.
The point is, MSI *is* architecture-specific. In fact, it's even
motherboard-specific, in that you are going to have (for the forseeable
future) drivers that have to work with or witgout MSI even on the same
architecture.
Linus
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-24 15:55 ` Linus Torvalds
2008-04-24 15:37 ` Alan Cox
@ 2008-04-24 16:16 ` Jeff Garzik
2008-04-24 16:48 ` Eric W. Biederman
2 siblings, 0 replies; 31+ messages in thread
From: Jeff Garzik @ 2008-04-24 16:16 UTC (permalink / raw)
To: Linus Torvalds
Cc: Eric W. Biederman, Rene Herman, Adrian Bunk, Andrew Morton, LKML,
rmk, Thomas Gleixner, Ingo Molnar
Linus Torvalds wrote:
>
> On Thu, 24 Apr 2008, Jeff Garzik wrote:
>> However, it does not follow that an int is what _must_ be passed around. We
>> already have design patterns like
>>
>> cookie_pointer = ioremap(raw bus resource)
>>
>> Not that I am the one pushing for that, just noting.
>
> I do agree that we could use something more type-safe.
>
> So a "pointer" to a structure that doesn't actually exist would be fine
> and would give us some C type checking.
>
> But then you'd have to have some way to "printk" the information, which is
> a very common requirement (and the printk still needs to be a number,
> because you want to match up 'dmesg' output with the '/proc/interrupts'
> file etc).
>
> That, in turn, would effectively force a whole new function, and then
> you'd have things like
>
> printk(.. irq %d .., irq_number(irqcookie) ..)
>
> which while not ugly isn't really all that clean either. And I guarantee
> that people would misuse that "irq_number(cookie)" exactly in the same
> ways they'd misuse "irq" (ie not very much).
>
> Quite frankly, I'd much prefer a
>
> typedef int __bitwise irq_t;
>
> and then we can use sparse to do this testing, without breaking any
> existing use at all (because it will still be an "int" to gcc, but sparse
> will make "irq_t" a type of its own and make sure that people pass it
> around as such and not do arithmetic ops on it etc).
sparse checking sure would be nice for drivers......
As an aside it would be nice to associate a struct device with an irq at
request_irq() time. I wonder if we could perform this transition in
situ, by creating a dev_request_irq() that used the new irq_t, while
leaving all existing drivers to be converted as time permits.
>>> EVERYTHING else would be architecture-specific. And that is exactly what we
>>> do not want. EVER.
>> Not true -- you have metadata/OOB data like MSI messages, where you are passed
>> a value from the PCI hardware in the PCI message, not just an "interrupt
>> asserted" condition. Or s/value/values/ if you enable PCI MSI's multiple
>> message support.
>
> The point is, MSI *is* architecture-specific. In fact, it's even
> motherboard-specific, in that you are going to have (for the forseeable
> future) drivers that have to work with or witgout MSI even on the same
> architecture.
Sure, so what would a driver API look like that works for both situations?
We have two models:
For each interrupt event delivered:
old model: "irq asserted" datum is passed to driver implicitly,
via the fact that your irq handler is called.
new model: we have a dword, or two or three, passed to driver
explicitly, in addition to implicit "irq asserted" fact
It's easier to handle both models if you could pass something other than
an int to the driver --for each interrupt event--.
Having something pointer-sized rather than int-sized would be flexible
enough to cover all situations, now and in the future, presumably.
Jeff
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-24 15:37 ` Alan Cox
@ 2008-04-24 16:20 ` Jeff Garzik
0 siblings, 0 replies; 31+ messages in thread
From: Jeff Garzik @ 2008-04-24 16:20 UTC (permalink / raw)
To: Alan Cox
Cc: Linus Torvalds, Eric W. Biederman, Rene Herman, Adrian Bunk,
Andrew Morton, LKML, rmk, Thomas Gleixner, Ingo Molnar
Alan Cox wrote:
> Sparc32 had this and it was very ugly. However if you don't pass in the
> IRQ then people will store the irq value privately and things like
> request_irq can deal with numeric interrupts and the like as before while
> new interfaces for MSI can deal in MSI objects whatever they end up like.
Yes, and on a related note...
_Today_ drivers _already_ store the irq value privately, because they must:
Logic dictates they must do so because all other functions in the driver
do not have an 'irq' argument, but do need to call things (free_irq,
disable_irq) that take an irq number argument.
That is one of my key design objections to passing 'int' to an irq handler:
Every modern driver _must_ store the irq value anyway -- and typically
this is done automatically in struct device or struct pci_dev resources,
so the driver writer need not bother with storing it themselves.
Jeff
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-24 15:55 ` Linus Torvalds
2008-04-24 15:37 ` Alan Cox
2008-04-24 16:16 ` Jeff Garzik
@ 2008-04-24 16:48 ` Eric W. Biederman
2008-04-24 16:58 ` Linus Torvalds
2008-04-24 17:30 ` Jeff Garzik
2 siblings, 2 replies; 31+ messages in thread
From: Eric W. Biederman @ 2008-04-24 16:48 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jeff Garzik, Rene Herman, Adrian Bunk, Andrew Morton, LKML, rmk,
Thomas Gleixner, Ingo Molnar
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Thu, 24 Apr 2008, Jeff Garzik wrote:
>>
>> However, it does not follow that an int is what _must_ be passed around. We
>> already have design patterns like
>>
>> cookie_pointer = ioremap(raw bus resource)
>>
>> Not that I am the one pushing for that, just noting.
>
> I do agree that we could use something more type-safe.
>
> So a "pointer" to a structure that doesn't actually exist would be fine
> and would give us some C type checking.
>
> But then you'd have to have some way to "printk" the information, which is
> a very common requirement (and the printk still needs to be a number,
> because you want to match up 'dmesg' output with the '/proc/interrupts'
> file etc).
>
> That, in turn, would effectively force a whole new function, and then
> you'd have things like
>
> printk(.. irq %d .., irq_number(irqcookie) ..)
>
> which while not ugly isn't really all that clean either. And I guarantee
> that people would misuse that "irq_number(cookie)" exactly in the same
> ways they'd misuse "irq" (ie not very much).
>
> Quite frankly, I'd much prefer a
>
> typedef int __bitwise irq_t;
Well that had better be.
typedef unsigned int __bitwise irq_t;
Or else we have changed the historical type.
The one real advantage of a pointer cookie type is that we can remove
the confusion of dealing with irq 0. Which is a valid irq but no
driver can use it, or our if (!irq) tests fail.
If irq 0 does not have a NULL cookie all of those problems disappear.
> and then we can use sparse to do this testing, without breaking any
> existing use at all (because it will still be an "int" to gcc, but sparse
> will make "irq_t" a type of its own and make sure that people pass it
> around as such and not do arithmetic ops on it etc).
My only real concern is that it doesn't get too hard for the
implementation to go from the irq cookie to the struct irq_desc when
we kill the arrays.
Right now the irq number is very much an array index and that
property is limiting.
Even if we never export them to drivers we will need to implement
in genirq functions like:
int __must_check irq_request(struct irq_desc *irq, irq_handler_t handler,
unsigned long irqflags, const char *devname, void *devid);
int __must_check request_irq(unsigned int irq, irq_handler_t handler,
unsigned long irqflags, const char *devname, void *devid)
{
return irq_request(cookie_to_desc(irq), handler, irqflags, devname, devid);
}
Or at the very least do the mapping from cookie to irq_desc at the
start of all of the genirq functions. One valid implementation of
that cookie to desc will be the current array lookup. But for x86
we need something less limiting.
>> > EVERYTHING else would be architecture-specific. And that is exactly what we
>> > do not want. EVER.
>>
>> Not true -- you have metadata/OOB data like MSI messages, where you are passed
>> a value from the PCI hardware in the PCI message, not just an "interrupt
>> asserted" condition. Or s/value/values/ if you enable PCI MSI's multiple
>> message support.
>
> The point is, MSI *is* architecture-specific. In fact, it's even
> motherboard-specific, in that you are going to have (for the forseeable
> future) drivers that have to work with or witgout MSI even on the same
> architecture.
And on x86 at least the hardware maps the MSI write into an interrupt.
So there is not an opportunity to get any metdata/OOB data from the
MSI message. Instead we just potentially get a boatload more irq
sources. Which is one of the things making a static NR_IRQS painful.
To be safe we have to make NR_IRQS 10x+ or so bigger then people use
today. Just in case they decide to plug in some really irq hungry
cards.
Eric
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-24 16:48 ` Eric W. Biederman
@ 2008-04-24 16:58 ` Linus Torvalds
2008-04-24 18:15 ` Eric W. Biederman
2008-04-24 17:30 ` Jeff Garzik
1 sibling, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2008-04-24 16:58 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Jeff Garzik, Rene Herman, Adrian Bunk, Andrew Morton, LKML, rmk,
Thomas Gleixner, Ingo Molnar
On Thu, 24 Apr 2008, Eric W. Biederman wrote:
>
> Right now the irq number is very much an array index and that
> property is limiting.
That's simply not true.
You are looking at some implementation detail, but other architectures
have used other notions.
The number is just that: a number. Nothing else. If you think it's an
array index, you are looking at it in all the wrong ways.
So you're now complaining about somethign totally different than the type,
you're talking about simple implementation issues.
And the reason it's generally implemented as a vector is very simple: it's
basic to the whole notion that the hardware gives us some cookie (which
itself is _often_ a small number, but sometimes is something that we can
program), and we want to look up that cookie into our internal irq
handling data structures.
So even when we can program the hw cookie arbitrarily (eg MSI), we
actually tend to *want* to program it with something we can use as an
array index, simply because that's often the best way to then map the
hardware cookie into the internal "struct irq_desc" things.
So your argument MAKES ABSOLUTELY NO SENSE!
The reason we use something that _looks_ like an array index is that we
ourself (and hardware) MADE IT SO.
It has absolutely nothing to do with the type of "irq". Even if "irq" was
a pointer to a struct, we'd *still* want to have that array-index-like
thing, and the only thing you could do would be to then use an extra level
of indirection to turn it into something else!
So "irq" looks like an array index because we _avoid_ the indirection
(except for the lookup of "irq_desc", which we absolutely do NOT want to
expose to drivers!)
Your argument is utter crap, in other words. It makes no sense.
Linus
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-24 16:48 ` Eric W. Biederman
2008-04-24 16:58 ` Linus Torvalds
@ 2008-04-24 17:30 ` Jeff Garzik
2008-04-25 2:53 ` Eric W. Biederman
1 sibling, 1 reply; 31+ messages in thread
From: Jeff Garzik @ 2008-04-24 17:30 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Rene Herman, Adrian Bunk, Andrew Morton, LKML,
rmk, Thomas Gleixner, Ingo Molnar
Eric W. Biederman wrote:
> Even if we never export them to drivers we will need to implement
> in genirq functions like:
>
> int __must_check irq_request(struct irq_desc *irq, irq_handler_t handler,
> unsigned long irqflags, const char *devname, void *devid);
>
> int __must_check request_irq(unsigned int irq, irq_handler_t handler,
> unsigned long irqflags, const char *devname, void *devid)
> {
> return irq_request(cookie_to_desc(irq), handler, irqflags, devname, devid);
> }
>
> Or at the very least do the mapping from cookie to irq_desc at the
> start of all of the genirq functions. One valid implementation of
> that cookie to desc will be the current array lookup. But for x86
> we need something less limiting.
[...]
> And on x86 at least the hardware maps the MSI write into an interrupt.
> So there is not an opportunity to get any metdata/OOB data from the
> MSI message. Instead we just potentially get a boatload more irq
> sources. Which is one of the things making a static NR_IRQS painful.
>
> To be safe we have to make NR_IRQS 10x+ or so bigger then people use
> today. Just in case they decide to plug in some really irq hungry
> cards.
Just to be clear, irq_chip/irq_desc and metadata/OOB data are two very
different beasts. irq_chip/irq_desc is more a system attribute as Linus
notes. Also, it doesn't change very often.
metadata/OOB, on the other hand, is different -for each interrupt-, and
is highly relevant to drivers. Thus should be part of the driver API
somehow.
Jeff
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-24 16:58 ` Linus Torvalds
@ 2008-04-24 18:15 ` Eric W. Biederman
0 siblings, 0 replies; 31+ messages in thread
From: Eric W. Biederman @ 2008-04-24 18:15 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jeff Garzik, Rene Herman, Adrian Bunk, Andrew Morton, LKML, rmk,
Thomas Gleixner, Ingo Molnar
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Thu, 24 Apr 2008, Eric W. Biederman wrote:
>>
>> Right now the irq number is very much an array index and that
>> property is limiting.
>
> That's simply not true.
>
> You are looking at some implementation detail, but other architectures
> have used other notions.
>
> The number is just that: a number. Nothing else. If you think it's an
> array index, you are looking at it in all the wrong ways.
*laughs*
I don't think it is fundamentally an array index. I just think we
have assumptions in the implementation that are a bit hard to
get out keeping it an array index. Some of that has to do
with the cookie we pass to drivers. So I am concerned there.
My preference would be a cookie based on struct irq_desc * so we
could recover the pointer to the irq_desc with a constant time operation.
However I really don't care as long as the driver irq cookie doesn't
limit us.
> So you're now complaining about somethign totally different than the type,
> you're talking about simple implementation issues.
>
> And the reason it's generally implemented as a vector is very simple: it's
> basic to the whole notion that the hardware gives us some cookie (which
> itself is _often_ a small number, but sometimes is something that we can
> program), and we want to look up that cookie into our internal irq
> handling data structures.
Yes.
In some sense we have 3 cookies. The hw cookie the driver cookie
and the human cookie (the irq number).
The hw cookie is so abstracted and hidden that most people don't
even realize we are doing really weird things there today.
The driver cookie is currently the same as the human cookie. Not
really a problem, but it does sometimes mean drivers make assumptions
that are not future proof.
The human cookie the irq number is for sysadmins and other people
trying to understand the system. So the human cookie should be
stable (the same from boot to boot) and have some meaning if
we can (The Nth vector IOAPIC vector, the ISA irq,
domain/bus/slot/msivec). Of course humans also tend to deal with
small numbers better. So we have an ugly trade off between stability
over time human readability with msi vectors.
> So even when we can program the hw cookie arbitrarily (eg MSI), we
> actually tend to *want* to program it with something we can use as an
> array index, simply because that's often the best way to then map the
> hardware cookie into the internal "struct irq_desc" things.
Yes. For the hw cookie it is important that we can look it up quickly
in a small array. That is a purely architecture detail and drivers
and even the genirq layer don't see that as they don't see the hw
cookie.
Given that many (most?) architectures have their own hw cookie that
is completely different from the irq number, there is little need
for the struct irq_desc to be stored in an array. The only
thing really driving that is inertia and the fact that the generic irq
layer uses a number to communicate with drivers.
I do agree that the number we communicate to drivers with can easily
be something else.
> The reason we use something that _looks_ like an array index is that we
> ourself (and hardware) MADE IT SO.
The linux irq number has NOTHING implementation wise to do with the
hardware. The connection with the hardware is not an implementation
detail but is a deliberate thing so HUMANS can understand the
connection with the hardware.
> It has absolutely nothing to do with the type of "irq". Even if "irq" was
> a pointer to a struct, we'd *still* want to have that array-index-like
> thing, and the only thing you could do would be to then use an extra level
> of indirection to turn it into something else!
There are two possible benefits that I see from having the irq driver
cookie be separate from the human cookie.
- irq 0 can stop being special and non-magical drivers can use it.
As we can separate the driver cookie and the human cookie.
Making it so that "if (!irqcookie)" doesn't trigger for irq 0.
Which has the potential to simplify things.
- We won't need an extra data structure eventually to map from the
driver cookie to struct irq_desc.
We could do something like:
struct irq_desc *irq_to_desc(struct irq *irq)
{
struct irq_desc *desc;
unsigned long val;
val = (unsigned long )irq;
val = ~val;
val = val - 3;
desc = (struct irq_desc *)val;
BUG_ON(desc->magic != IRQDESC_MAGIC)
return desc;
}
Neither of which appears particularly persuasive to me in the short
term, because there will be a transition period where I will need some
intermediate data structure. So we don't need a flag if we change
the type of the cookie.
> So "irq" looks like an array index because we _avoid_ the indirection
> (except for the lookup of "irq_desc", which we absolutely do NOT want to
> expose to drivers!)
I can see the point of not exposing irq_desc directly to drivers.
That seems to make it more future proof. I'm still not quite
convinced that we might not want to have a:
struct irqdesc {
struct irq generic_irq_info;
}
And pass that generic_irq_info to drivers, but I don't see any real
advantage in it either.
> Your argument is utter crap, in other words. It makes no sense.
Seems we are on different wave lengths then.
Eric
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [git patch] free_irq() fixes
2008-04-24 17:30 ` Jeff Garzik
@ 2008-04-25 2:53 ` Eric W. Biederman
2008-04-25 3:33 ` MSI, fun for the whole family (was Re: [git patch] free_irq() fixes) Jeff Garzik
0 siblings, 1 reply; 31+ messages in thread
From: Eric W. Biederman @ 2008-04-25 2:53 UTC (permalink / raw)
To: Jeff Garzik
Cc: Linus Torvalds, Rene Herman, Adrian Bunk, Andrew Morton, LKML,
rmk, Thomas Gleixner, Ingo Molnar
Jeff Garzik <jeff@garzik.org> writes:
> Eric W. Biederman wrote:
>> And on x86 at least the hardware maps the MSI write into an interrupt.
>> So there is not an opportunity to get any metdata/OOB data from the
>> MSI message. Instead we just potentially get a boatload more irq
>> sources. Which is one of the things making a static NR_IRQS painful.
>>
>> To be safe we have to make NR_IRQS 10x+ or so bigger then people use
>> today. Just in case they decide to plug in some really irq hungry
>> cards.
>
>
> Just to be clear, irq_chip/irq_desc and metadata/OOB data are two very different
> beasts. irq_chip/irq_desc is more a system attribute as Linus notes. Also, it
> doesn't change very often.
>
> metadata/OOB, on the other hand, is different -for each interrupt-, and is
> highly relevant to drivers. Thus should be part of the driver API somehow.
I'm not certain I follow so I will ask.
Do you mean information that is different each time an interrupt is fires?
Or do you mean information that differs for each different interrupt?
Something like the current dev_id?
To my knowledge there is not any information that varies each time an
interrupt fires.
Eric
^ permalink raw reply [flat|nested] 31+ messages in thread
* MSI, fun for the whole family (was Re: [git patch] free_irq() fixes)
2008-04-25 2:53 ` Eric W. Biederman
@ 2008-04-25 3:33 ` Jeff Garzik
2008-04-25 3:57 ` MSI, fun for the whole family Roland Dreier
2008-04-25 5:08 ` Eric W. Biederman
0 siblings, 2 replies; 31+ messages in thread
From: Jeff Garzik @ 2008-04-25 3:33 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Rene Herman, Adrian Bunk, Andrew Morton, LKML,
rmk, Thomas Gleixner, Ingo Molnar
Eric W. Biederman wrote:
> Jeff Garzik <jeff@garzik.org> writes:
>
>> Eric W. Biederman wrote:
>>> And on x86 at least the hardware maps the MSI write into an interrupt.
>>> So there is not an opportunity to get any metdata/OOB data from the
>>> MSI message. Instead we just potentially get a boatload more irq
>>> sources. Which is one of the things making a static NR_IRQS painful.
>>>
>>> To be safe we have to make NR_IRQS 10x+ or so bigger then people use
>>> today. Just in case they decide to plug in some really irq hungry
>>> cards.
>>
>> Just to be clear, irq_chip/irq_desc and metadata/OOB data are two very different
>> beasts. irq_chip/irq_desc is more a system attribute as Linus notes. Also, it
>> doesn't change very often.
>>
>> metadata/OOB, on the other hand, is different -for each interrupt-, and is
>> highly relevant to drivers. Thus should be part of the driver API somehow.
>
> I'm not certain I follow so I will ask.
>
> Do you mean information that is different each time an interrupt is fires?
>
> Or do you mean information that differs for each different interrupt?
> Something like the current dev_id?
>
> To my knowledge there is not any information that varies each time an
> interrupt fires.
Absolutely there is! This is why MSI is so cool.
You get a tiny chunk of data from the hardware, across the PCI bus in a
single PCI transaction, sent [well, basically...] straight to the driver
__for each MSI interrupt__. Rather than having a separate interrupt
line -- really an ugly OOB mechanism -- you get a bus transaction as God
intended, a bus transaction just like all the others going across the
PCI bus.
Let's illustrate with a real world example, with hardware you probably
already have in your hands today.
Download AHCI 1.1 SATA controller specification from
http://www.intel.com/technology/serialata/ahci.htm
and check out Section 2.3 and MSI-related bits of Section 10.6.2 for the
usage of those PCI MSI registers on the PCI device.
An AHCI PCI device uses MSI messages to inform the driver which <mask>
of 32 SATA ports have asserted an activity indication.
This MSI message varies _for each interrupt_, and replaces the standard
driver idiom of reading a hardware Interrupt-Status register.
Thus you can see increased performance with MSI messages because the
hardware "pushes" useful information to the driver, using an in-band
mechanism (PCI bus transaction) rather than an out-of-band mechanism ($N
SATA ports sharing a single interrupt line).
This is the reverse of the standard model, where the driver receives the
knowledge "your interrupt line asserted... maybe" and it must deduce
activity from there by reading an Interrupt-Status register.
That is one fundamental of MSI messages: they carry data. To
illustrate with "kernel pseudocode", this equates to
irqreturn_t irq_handler(int irq, void *dev_id,
const void *metadata,
size_t metadata_len)
You have a fundamentally new model for interrupt handling with MSI...
You are no longer managing an interrupt line that is asserted and
cleared. It is now an asynchronous flow of data blobs from hardware to
various per-driver "mailboxes".
Jeff
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: MSI, fun for the whole family
2008-04-25 3:33 ` MSI, fun for the whole family (was Re: [git patch] free_irq() fixes) Jeff Garzik
@ 2008-04-25 3:57 ` Roland Dreier
2008-04-25 4:19 ` David Miller
2008-04-25 4:35 ` Jeff Garzik
2008-04-25 5:08 ` Eric W. Biederman
1 sibling, 2 replies; 31+ messages in thread
From: Roland Dreier @ 2008-04-25 3:57 UTC (permalink / raw)
To: Jeff Garzik
Cc: Eric W. Biederman, Linus Torvalds, Rene Herman, Adrian Bunk,
Andrew Morton, LKML, rmk, Thomas Gleixner, Ingo Molnar
> > To my knowledge there is not any information that varies each time an
> > interrupt fires.
>
> Absolutely there is! This is why MSI is so cool.
>
> You get a tiny chunk of data from the hardware, across the PCI bus in
> a single PCI transaction, sent [well, basically...] straight to the
> driver __for each MSI interrupt__. Rather than having a separate
> interrupt line -- really an ugly OOB mechanism -- you get a bus
> transaction as God intended, a bus transaction just like all the
> others going across the PCI bus.
I think you've fundamentally misunderstood what the PCI spec for MSI
multi message means. It is true that if the whole system agrees, then
an MSI-capable device that supports multiple messages might be allocated
a range of vectors (MSI is kind of stupid because it only allows
multiple messages to be generated by varying the low order bits -- MSI-X
fixes this limitation). However, the way that these different messages
are handled is that they are all independent interrupt vectors.
Now, it is true that the kernel could do something crazy and collapse
all these interrupt vectors into a single "IRQ" and then tell the
interrupt handler which vector it was by passing some "metadata" in, but
why not just give each MSI message it's own IRQ?
By the way, for MSI on Linux this is theoretical, since MSI messages
have to be consecutive, and the kernel (on standard x86 at least) can't
really allocate a chunk of consecutive vectors. However, Linux has
quite a few device drivers that handle devices that generate multiple
MSI-X messages, and the model of having each message be a different IRQ
works quite well.
You are correct that MSI/MSI-X has the advantage of being an in-band
transaction that obeys PCI ordering rules, rather than going through an
asynchronous separate wire that might reach the CPU before the DMA it is
signaling.
> An AHCI PCI device uses MSI messages to inform the driver which <mask>
> of 32 SATA ports have asserted an activity indication.
Actually from the spec it looks like it uses multiple MSI messages to
give each port its own interrupt. There aren't enough bits available to
| together a bitwise port mask, and it wouldn't be sane to have eg 3
separate interrupt handlers for port 1 interrupts, port 2 interrupts,
and port 1 *and* 2 interrupts.
- R.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: MSI, fun for the whole family
2008-04-25 3:57 ` MSI, fun for the whole family Roland Dreier
@ 2008-04-25 4:19 ` David Miller
2008-04-25 4:35 ` Jeff Garzik
1 sibling, 0 replies; 31+ messages in thread
From: David Miller @ 2008-04-25 4:19 UTC (permalink / raw)
To: rdreier
Cc: jeff, ebiederm, torvalds, rene.herman, bunk, akpm, linux-kernel,
rmk, tglx, mingo
From: Roland Dreier <rdreier@cisco.com>
Date: Thu, 24 Apr 2008 20:57:48 -0700
> Now, it is true that the kernel could do something crazy and collapse
> all these interrupt vectors into a single "IRQ" and then tell the
> interrupt handler which vector it was by passing some "metadata" in, but
> why not just give each MSI message it's own IRQ?
Actually, it doesn't make any sense to have more MSI, or "MSI queue"
interrupts than you have cpus.
Non-x86 PCI-E controller impelemntations that I am familiar with collect
MSI and MSI-X interrupts into "queues", these queues being non-empty
is what actually triggers an interrupt to the CPU. And, there are
enough MSI queue instances such that you can direct each one to a
unique cpu.
The MSI queue interrupt simply scans the ring buffer of pending MSI
interrupts and dispatches them to the device.
You can handle PCI-E frabric error messages the same way, and in fact
that's what the controllers I am familiar with do.
A Linux implementation of support for this kind of setup can be seen
in arch/sparc64/kernel/pci_msi.c:sparc64_msiq_interrupt(). It's
very generic and doesn't care whether it's talking to real PCI
controller hardware or a hypervisor based interface.
Besides the obvious extra indirection overhead, our IRQ layer is very
much capable of supporting multi-level dispatch like this correctly.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: MSI, fun for the whole family
2008-04-25 3:57 ` MSI, fun for the whole family Roland Dreier
2008-04-25 4:19 ` David Miller
@ 2008-04-25 4:35 ` Jeff Garzik
2008-04-25 5:48 ` Eric W. Biederman
2008-04-25 22:44 ` Roland Dreier
1 sibling, 2 replies; 31+ messages in thread
From: Jeff Garzik @ 2008-04-25 4:35 UTC (permalink / raw)
To: Roland Dreier
Cc: Eric W. Biederman, Linus Torvalds, Rene Herman, Adrian Bunk,
Andrew Morton, LKML, rmk, Thomas Gleixner, Ingo Molnar
Roland Dreier wrote:
> I think you've fundamentally misunderstood what the PCI spec for MSI
> multi message means. It is true that if the whole system agrees, then
You seem to be ignoring a key usage...
> an MSI-capable device that supports multiple messages might be allocated
> a range of vectors (MSI is kind of stupid because it only allows
> multiple messages to be generated by varying the low order bits -- MSI-X
> fixes this limitation). However, the way that these different messages
> are handled is that they are all independent interrupt vectors.
>
> Now, it is true that the kernel could do something crazy and collapse
> all these interrupt vectors into a single "IRQ" and then tell the
> interrupt handler which vector it was by passing some "metadata" in, but
> why not just give each MSI message it's own IRQ?
The answer is: the driver might prefer to see the message as it
arrived, rather than dividing it up into independent vectors. The
message itself is a unit of data consistency, and there is value in
letting the driver see the bounds of that unit.
As it stands now, we only a spray of $N function calls for each message,
with no notion of "we started processing this set of messages" and "we
ended processing[...]"
Additionally the bitmask-friendly multi-port architecture of these SATA
controllers matches nicely with an activity (event) status mask we
already obtain in almost every driver.
Don't assume that the way Linux supports this stuff today is the best,
or the only way to do things. It's not "collapsing all these interrupt
vectors" -- remember that an expansion occurred, and /avoiding
expansion/ into multiple vectors for multiple messages may be an optimal
path for a specific driver application.
Jeff
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: MSI, fun for the whole family
2008-04-25 3:33 ` MSI, fun for the whole family (was Re: [git patch] free_irq() fixes) Jeff Garzik
2008-04-25 3:57 ` MSI, fun for the whole family Roland Dreier
@ 2008-04-25 5:08 ` Eric W. Biederman
1 sibling, 0 replies; 31+ messages in thread
From: Eric W. Biederman @ 2008-04-25 5:08 UTC (permalink / raw)
To: Jeff Garzik
Cc: Eric W. Biederman, Linus Torvalds, Rene Herman, Adrian Bunk,
Andrew Morton, LKML, rmk, Thomas Gleixner, Ingo Molnar
Jeff Garzik <jeff@garzik.org> writes:
> Eric W. Biederman wrote:
>> Jeff Garzik <jeff@garzik.org> writes:
>>
>>> Eric W. Biederman wrote:
>>>> And on x86 at least the hardware maps the MSI write into an interrupt.
>>>> So there is not an opportunity to get any metdata/OOB data from the
>>>> MSI message. Instead we just potentially get a boatload more irq
>>>> sources. Which is one of the things making a static NR_IRQS painful.
>>>>
>>>> To be safe we have to make NR_IRQS 10x+ or so bigger then people use
>>>> today. Just in case they decide to plug in some really irq hungry
>>>> cards.
>>>
>>> Just to be clear, irq_chip/irq_desc and metadata/OOB data are two very
> different
>>> beasts. irq_chip/irq_desc is more a system attribute as Linus notes. Also,
> it
>>> doesn't change very often.
>>>
>>> metadata/OOB, on the other hand, is different -for each interrupt-, and is
>>> highly relevant to drivers. Thus should be part of the driver API somehow.
>>
>> I'm not certain I follow so I will ask.
>>
>> Do you mean information that is different each time an interrupt is fires?
>>
>> Or do you mean information that differs for each different interrupt?
>> Something like the current dev_id?
>>
>> To my knowledge there is not any information that varies each time an
>> interrupt fires.
>
> Absolutely there is! This is why MSI is so cool.
>
> You get a tiny chunk of data from the hardware, across the PCI bus in a single
> PCI transaction, sent [well, basically...] straight to the driver __for each MSI
> interrupt__. Rather than having a separate interrupt line -- really an ugly OOB
> mechanism -- you get a bus transaction as God intended, a bus transaction just
> like all the others going across the PCI bus.
Correct.
You get a write of 16 bits of data to a 32bit address on x86.
Encoded in that write is a cpu number, an 8 bit vector number,
and various encoding modes. That 8 bit vector is encoded in the
low 8 bits of the 16bit data word.
> Let's illustrate with a real world example, with hardware you probably already
> have in your hands today.
>
>
> Download AHCI 1.1 SATA controller specification from
> http://www.intel.com/technology/serialata/ahci.htm
>
> and check out Section 2.3 and MSI-related bits of Section 10.6.2 for the usage
> of those PCI MSI registers on the PCI device.
>
> An AHCI PCI device uses MSI messages to inform the driver which <mask> of 32
> SATA ports have asserted an activity indication.
Not a mask of 32 SATA ports. The low 5 bits of the 16 bit data word vary.
> This MSI message varies _for each interrupt_, and replaces the standard driver
> idiom of reading a hardware Interrupt-Status register.
>
> Thus you can see increased performance with MSI messages because the hardware
> "pushes" useful information to the driver, using an in-band mechanism (PCI bus
> transaction) rather than an out-of-band mechanism ($N SATA ports sharing a
> single interrupt line).
The effect is the same but the principle of operation is slightly different.
In practice you have a limited set of messages that a card may generate.
A maximum of 32 different messages in the case of a plain MSI capability
and a maximum of 4096 different messages in the case of a MSI-X capability.
The hardware encoding on x86 ensures each of those different messages
maps to a different system interrupt. The irq layer then maps each
of those interrupts into a different linux irq number. And those
interrupts may never be shared.
The support code for all of that is already implemented.
Now I do have some bad news for you. We do not support
using the multiple message capability of the AHCI. The linux API
currently requires that we be able to migrate irqs individually to
different CPUS and that we be able mask individual irqs, and
the only the MSI-X capability allows us to implement that.
> This is the reverse of the standard model, where the driver receives the
> knowledge "your interrupt line asserted... maybe" and it must deduce activity
> from there by reading an Interrupt-Status register.
> That is one fundamental of MSI messages: they carry data. To illustrate with
> "kernel pseudocode", this equates to
>
> irqreturn_t irq_handler(int irq, void *dev_id,
> const void *metadata,
> size_t metadata_len)
>
> You have a fundamentally new model for interrupt handling with MSI...
Close.
> You are no longer managing an interrupt line that is asserted and cleared. It
> is now an asynchronous flow of data blobs from hardware to various per-driver
> "mailboxes".
Pretty much. Although a large chunk of that comes from simply having
edge triggered interrupts.
I can almost see handling the irqs for the msi capability that way.
As one interrupt with a data blob. As that gets around the migration
and masking issues that are otherwise present. The need to allocate
several vectors continuously is a pain and hard to do portably. So
if those kinds of cards take off and there is a real win I won't
object.
For now I figure cards like that get one MSI interrupt, and if
they want more or to use the in-band data they may implement MSI-X
which provides for a completely separate address and data item
for each message. Making the separate messages much more useful.
Eric
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: MSI, fun for the whole family
2008-04-25 4:35 ` Jeff Garzik
@ 2008-04-25 5:48 ` Eric W. Biederman
2008-04-25 22:44 ` Roland Dreier
1 sibling, 0 replies; 31+ messages in thread
From: Eric W. Biederman @ 2008-04-25 5:48 UTC (permalink / raw)
To: Jeff Garzik
Cc: Roland Dreier, Linus Torvalds, Rene Herman, Adrian Bunk,
Andrew Morton, LKML, rmk, Thomas Gleixner, Ingo Molnar
Jeff Garzik <jeff@garzik.org> writes:
> Roland Dreier wrote:
>> I think you've fundamentally misunderstood what the PCI spec for MSI
>> multi message means. It is true that if the whole system agrees, then
>
> You seem to be ignoring a key usage...
>
>
>> an MSI-capable device that supports multiple messages might be allocated
>> a range of vectors (MSI is kind of stupid because it only allows
>> multiple messages to be generated by varying the low order bits -- MSI-X
>> fixes this limitation). However, the way that these different messages
>> are handled is that they are all independent interrupt vectors.
>>
>> Now, it is true that the kernel could do something crazy and collapse
>> all these interrupt vectors into a single "IRQ" and then tell the
>> interrupt handler which vector it was by passing some "metadata" in, but
>> why not just give each MSI message it's own IRQ?
>
> The answer is: the driver might prefer to see the message as it arrived, rather
> than dividing it up into independent vectors. The message itself is a unit of
> data consistency, and there is value in letting the driver see the bounds of
> that unit.
On x86 the driver very much sees the message as it arrived at the cpu.
If you want the message as it was sent we would need to deduce it from
which cpu we are on and which interrupt descriptors function was called.
As for the bounds of data consistency currently no information is lost.
> As it stands now, we only a spray of $N function calls for each message, with no
> notion of "we started processing this set of messages" and "we ended
> processing[...]"
Hopefully my other messages explains it better but there is no
spray of function calls. Nor can there be. There is not a bitmask in
your MSI message there is instead a base 2 number encoding which port
had traffic. You don't have 32 bits but 5.
> Don't assume that the way Linux supports this stuff today is the best, or the
> only way to do things. It's not "collapsing all these interrupt vectors" --
> remember that an expansion occurred, and /avoiding expansion/ into multiple
> vectors for multiple messages may be an optimal path for a specific driver
> application.
If we were receiving a bitmask I would find this more persuasive.
To implement what you suggest currently feels like a lot of work digging
through each architectures implementation of MSI and seeing if it possible
and maintainable on each architecture, and refactoring a lot of the interrupt
handling infrastructure to get there. All to support hardware vendors
that were too lazy to either implement MSI-X or implement a DMAing a
status value to a known spot before sending the MSI.
I don't see it being worth it at this time.
Eric
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: MSI, fun for the whole family
2008-04-25 4:35 ` Jeff Garzik
2008-04-25 5:48 ` Eric W. Biederman
@ 2008-04-25 22:44 ` Roland Dreier
1 sibling, 0 replies; 31+ messages in thread
From: Roland Dreier @ 2008-04-25 22:44 UTC (permalink / raw)
To: Jeff Garzik
Cc: Eric W. Biederman, Linus Torvalds, Rene Herman, Adrian Bunk,
Andrew Morton, LKML, rmk, Thomas Gleixner, Ingo Molnar
> The answer is: the driver might prefer to see the message as it
> arrived, rather than dividing it up into independent vectors. The
> message itself is a unit of data consistency, and there is value in
> letting the driver see the bounds of that unit.
I can't really understand what this is about. As things stand now, each
MSI message results in an interrupt, which results in calling the
interrupt handler. What information from "the message as it arrived" is
being lost other than the precise contents of the PCI transaction, which
is system dependent and of no more interest to the driver than what ball
number a PCI wire interrupt is hooked up to?
> Additionally the bitmask-friendly multi-port architecture of these
> SATA controllers matches nicely with an activity (event) status mask
> we already obtain in almost every driver.
I also don't know what you mean about "bitmask-friendly". The AHCI spec
that you mentioned pretty clearly talks about having an MSI message per
port, rather than putting a bitmask of ports in the message. Given that
AHCI supports 32 ports, and MSI messages only are 16 bits long (of which
only 5 bits can be changed by the device), I don't see how you expect
this bitmask to be encoded.
It seems to me that if the kernel supported multiple MSI messages in the
same way it handles multiple MSI-X messages, by giving a different IRQ
number for each message, this would be ideal for a driver... you would
basically do
for (i = 0; i < dev->num_ports; i++)
request_irq(dev->msi_vector[i],
port_irq_handler, 0,
"name", &dev->port[i]);
and then your interrupt handler could look like
irqreturn_t port_irq_handler(int irq, void *port_ptr)
{
struct my_port *port = port_ptr;
//...
which is very clean and saves you from having to look up any mask of
ports.
So it might be nice to handle multiple MSI messages, but it's fairly
ugly (both because of the current Linux interrupt handling structure and
also because the MSI spec is somewhat broken), but given that most
devices are moving to MSI-X, it's not clear that it's worth working on
this. And I don't see any reason why MSI would require changes to the
IRQ handler prototype.
> Don't assume that the way Linux supports this stuff today is the best,
> or the only way to do things. It's not "collapsing all these
> interrupt vectors" -- remember that an expansion occurred, and
> /avoiding expansion/ into multiple vectors for multiple messages may
> be an optimal path for a specific driver application.
What expansion occurrs? On x86, having a different vector for each MSI
message it pretty fundamental to the way the hardware handles interrupts.
- R.
^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2008-04-25 22:44 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-22 22:17 [git patch] free_irq() fixes Jeff Garzik
2008-04-22 22:25 ` Linus Torvalds
2008-04-22 22:59 ` Jeff Garzik
2008-04-22 23:20 ` Linus Torvalds
2008-04-22 23:49 ` Jeff Garzik
2008-04-22 23:52 ` Linus Torvalds
2008-04-23 0:05 ` Adrian Bunk
2008-04-23 0:16 ` Linus Torvalds
2008-04-23 13:51 ` Rene Herman
2008-04-24 2:10 ` Jeff Garzik
2008-04-24 2:19 ` Linus Torvalds
2008-04-24 5:59 ` Eric W. Biederman
2008-04-24 10:53 ` Jeff Garzik
2008-04-24 15:16 ` Linus Torvalds
2008-04-24 15:40 ` Jeff Garzik
2008-04-24 15:55 ` Linus Torvalds
2008-04-24 15:37 ` Alan Cox
2008-04-24 16:20 ` Jeff Garzik
2008-04-24 16:16 ` Jeff Garzik
2008-04-24 16:48 ` Eric W. Biederman
2008-04-24 16:58 ` Linus Torvalds
2008-04-24 18:15 ` Eric W. Biederman
2008-04-24 17:30 ` Jeff Garzik
2008-04-25 2:53 ` Eric W. Biederman
2008-04-25 3:33 ` MSI, fun for the whole family (was Re: [git patch] free_irq() fixes) Jeff Garzik
2008-04-25 3:57 ` MSI, fun for the whole family Roland Dreier
2008-04-25 4:19 ` David Miller
2008-04-25 4:35 ` Jeff Garzik
2008-04-25 5:48 ` Eric W. Biederman
2008-04-25 22:44 ` Roland Dreier
2008-04-25 5:08 ` Eric W. Biederman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox