Flexible SFF interrupt handling

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Flexible SFF interrupt handling
@ 2007-11-28 13:45 Jeff Garzik
  2007-11-28 14:29 ` Alan Cox
  2007-11-28 14:33 ` Mark Lord
  0 siblings, 2 replies; 9+ messages in thread
From: Jeff Garzik @ 2007-11-28 13:45 UTC (permalink / raw)
  To: IDE/ATA development list

This has been bubbling on my brain for a while.  I blathered on about 
this on IRC to Tejun, but figured I might as well post it here and get 
it archived.

In general, I think we should adopt a flexible or "loose" model for 
acking interrupts on SFF controllers.

(a) whenever we are in bus-idle (qc == NULL), and get an interrupt, go 
ahead and read Status.

(b) if we are expecting an interrupt, and receive one, check Status (or 
AltStatus if DMAing).

(c) if condition "(b)" indicates busy, initiate status polling every 
250ms until timeout occurs or BSY clears.

(d) if N seconds (4?) elapses without an interrupt, initiate polling. 
keep a history of such "fail-over" events, and note each fail-over'd 
command's eventual success via polling, success via interrupt, or 
timeout.  Use that history to decide to switch to 100% polling mode 
(i.e. reach conclusion that interrupt delivery is broken, via observation)

That should cover no-interrupts, lost interrupts, early interrupts, 
screaming interrupts, insane devices, and of course normal operation.

The model could be summarized as "interrupt as a hint" operation.

	Jeff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flexible SFF interrupt handling
  2007-11-28 13:45 Flexible SFF interrupt handling Jeff Garzik
@ 2007-11-28 14:29 ` Alan Cox
  2007-11-28 16:09   ` Jeff Garzik
  2007-11-28 14:33 ` Mark Lord
  1 sibling, 1 reply; 9+ messages in thread
From: Alan Cox @ 2007-11-28 14:29 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: IDE/ATA development list

> In general, I think we should adopt a flexible or "loose" model for 
> acking interrupts on SFF controllers.

Agreed - especially as the IRQ is often essentially the drive output not
under any kind of sane control of ours.

> (a) whenever we are in bus-idle (qc == NULL), and get an interrupt, go 
> ahead and read Status.

Please call into the driver. Quite a few PATA drivers have multiple IRQ
sources, and SATA many. 

> (b) if we are expecting an interrupt, and receive one, check Status (or 
> AltStatus if DMAing).

Providing we are not mid data transfer (which is why we need to get into
enable/disable_irq for some controllers). Right now its a problem that
can't occur but on some controllers reading status mid PIO xfer causes
joyous things like silent corruption.

> (c) if condition "(b)" indicates busy, initiate status polling every 
> 250ms until timeout occurs or BSY clears.

Yep.

> (d) if N seconds (4?) elapses without an interrupt, initiate polling. 
> keep a history of such "fail-over" events, and note each fail-over'd 
> command's eventual success via polling, success via interrupt, or 
> timeout.  Use that history to decide to switch to 100% polling mode 
> (i.e. reach conclusion that interrupt delivery is broken, via observation)

N = 8 sounds good to me (7 being the normal maximum command timeout)

> That should cover no-interrupts, lost interrupts, early interrupts, 
> screaming interrupts, insane devices, and of course normal operation.

Should we also consider resetting the device as one of the strategies (at
least once off)

Might also want to think at that point about the case of 

	command
	....
	timeout

where old IDE checks with the controller to spot lost IRQ cases where a
command finished and stuff vanished. Old IDE doesn't do much with it but
we could use that as a good hint that we want to switch to polling mode
and tell the user their computer sucks.

Alan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flexible SFF interrupt handling
  2007-11-28 13:45 Flexible SFF interrupt handling Jeff Garzik
  2007-11-28 14:29 ` Alan Cox
@ 2007-11-28 14:33 ` Mark Lord
  2007-11-28 15:58   ` Jeff Garzik
  1 sibling, 1 reply; 9+ messages in thread
From: Mark Lord @ 2007-11-28 14:33 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: IDE/ATA development list

Jeff Garzik wrote:
> This has been bubbling on my brain for a while.  I blathered on about 
> this on IRC to Tejun, but figured I might as well post it here and get 
> it archived.
> 
> In general, I think we should adopt a flexible or "loose" model for 
> acking interrupts on SFF controllers.
> 
> (a) whenever we are in bus-idle (qc == NULL), and get an interrupt, go 
> ahead and read Status.
> 
> (b) if we are expecting an interrupt, and receive one, check Status (or 
> AltStatus if DMAing).
> 
> (c) if condition "(b)" indicates busy, initiate status polling every 
> 250ms until timeout occurs or BSY clears.
> 
> (d) if N seconds (4?) elapses without an interrupt, initiate polling. 
> keep a history of such "fail-over" events, and note each fail-over'd 
> command's eventual success via polling, success via interrupt, or 
> timeout.  Use that history to decide to switch to 100% polling mode 
> (i.e. reach conclusion that interrupt delivery is broken, via observation)
> 
> That should cover no-interrupts, lost interrupts, early interrupts, 
> screaming interrupts, insane devices, and of course normal operation.
> 
> The model could be summarized as "interrupt as a hint" operation.
..

The only question is, under which conditions do we return IRQ "handled=1",
and which times should we return 0 ?

Definitely when a real IRQ wakes us up and we see (qc != NULL && drive_ready),
essentially exactly as we currently do it.

But things might be trickier once polling is introduced, unless we also mask
the device interrupt before initiating the polling.

For example, we decide to begin polling according to rule (b) above,
and our poll routine happens to later find/service the IRQ before the
hardware IRQ handler runs.  Shortly after, the hardware IRQ handler does run,
but sees no active qc.  If it returns handled=1, it is lying.  If it returns
handled=0, we may eventually find our IRQ being disabled for "spurious" incidents.

Tricky, that.

-ml

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flexible SFF interrupt handling
  2007-11-28 14:33 ` Mark Lord
@ 2007-11-28 15:58   ` Jeff Garzik
  2007-11-28 16:48     ` Mark Lord
  2007-11-30  1:08     ` Tejun Heo
  0 siblings, 2 replies; 9+ messages in thread
From: Jeff Garzik @ 2007-11-28 15:58 UTC (permalink / raw)
  To: Mark Lord; +Cc: IDE/ATA development list

Mark Lord wrote:
> Jeff Garzik wrote:
>> This has been bubbling on my brain for a while.  I blathered on about 
>> this on IRC to Tejun, but figured I might as well post it here and get 
>> it archived.
>>
>> In general, I think we should adopt a flexible or "loose" model for 
>> acking interrupts on SFF controllers.
>>
>> (a) whenever we are in bus-idle (qc == NULL), and get an interrupt, go 
>> ahead and read Status.
>>
>> (b) if we are expecting an interrupt, and receive one, check Status 
>> (or AltStatus if DMAing).
>>
>> (c) if condition "(b)" indicates busy, initiate status polling every 
>> 250ms until timeout occurs or BSY clears.
>>
>> (d) if N seconds (4?) elapses without an interrupt, initiate polling. 
>> keep a history of such "fail-over" events, and note each fail-over'd 
>> command's eventual success via polling, success via interrupt, or 
>> timeout.  Use that history to decide to switch to 100% polling mode 
>> (i.e. reach conclusion that interrupt delivery is broken, via 
>> observation)
>>
>> That should cover no-interrupts, lost interrupts, early interrupts, 
>> screaming interrupts, insane devices, and of course normal operation.
>>
>> The model could be summarized as "interrupt as a hint" operation.
> ..
> 
> The only question is, under which conditions do we return IRQ "handled=1",
> and which times should we return 0 ?
> 
> Definitely when a real IRQ wakes us up and we see (qc != NULL && 
> drive_ready),
> essentially exactly as we currently do it.
> 
> But things might be trickier once polling is introduced, unless we also 
> mask
> the device interrupt before initiating the polling.

Actually no, and that is a key benefit of this scheme:  if we ensure 
that the polling paths are resilient even in case where interrupts are 
being delivered -- as we must do anyway -- then we don't have to worry 
about interrupt masking, either on the interrupt controller or on the 
device[1].

If we do get an interrupt, ack it ASAP.  That covers normal operation 
and screaming interrupts.
If we don't get an interrupt, we will notice after a spell and poll 
Status to ensure progress occurs.

Note that this polling is a different sort of polling than running an 
entire ATA command via a kernel thread.  In this case, we're talking 
about periodic Status (or AltStatus or LLD-specific-register status) 
polling only.

A lot of fiddling with irq masking is getting around ugliness that I am 
instead trying to eliminate altogether.  A truly robust system follows 
the spec WRT nIEN and other interrupt masking.....  but then prepares 
for the case where hw decides to send an interrupt anyway.

On SFF controllers, we should consider interrupts to be unreliable 
messages delivered on a best effort basis by hardware.  If we get them, 
great, ack and act.  If we lack them, make sure progress occurs.

Regards,

	Jeff

[1] well, there -are- exceptions, such as when we are bitbanging the ATA 
Data register

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flexible SFF interrupt handling
  2007-11-28 14:29 ` Alan Cox
@ 2007-11-28 16:09   ` Jeff Garzik
  0 siblings, 0 replies; 9+ messages in thread
From: Jeff Garzik @ 2007-11-28 16:09 UTC (permalink / raw)
  To: Alan Cox; +Cc: IDE/ATA development list

Alan Cox wrote:
>> In general, I think we should adopt a flexible or "loose" model for 
>> acking interrupts on SFF controllers.
> 
> Agreed - especially as the IRQ is often essentially the drive output not
> under any kind of sane control of ours.

Good point (I had not thought of looking at it that way).


>> (a) whenever we are in bus-idle (qc == NULL), and get an interrupt, go 
>> ahead and read Status.
> 
> Please call into the driver. Quite a few PATA drivers have multiple IRQ
> sources, and SATA many. 

Done :)  This should simply be a new behavior coded into the existing 
interrupt handlers.

Thus you can choose per-driver whether to do this or not.


>> (b) if we are expecting an interrupt, and receive one, check Status (or 
>> AltStatus if DMAing).
> 
> Providing we are not mid data transfer (which is why we need to get into
> enable/disable_irq for some controllers). Right now its a problem that
> can't occur but on some controllers reading status mid PIO xfer causes
> joyous things like silent corruption.

True..


>> (c) if condition "(b)" indicates busy, initiate status polling every 
>> 250ms until timeout occurs or BSY clears.
> 
> Yep.
> 
>> (d) if N seconds (4?) elapses without an interrupt, initiate polling. 
>> keep a history of such "fail-over" events, and note each fail-over'd 
>> command's eventual success via polling, success via interrupt, or 
>> timeout.  Use that history to decide to switch to 100% polling mode 
>> (i.e. reach conclusion that interrupt delivery is broken, via observation)
> 
> N = 8 sounds good to me (7 being the normal maximum command timeout)
> 
>> That should cover no-interrupts, lost interrupts, early interrupts, 
>> screaming interrupts, insane devices, and of course normal operation.
> 
> Should we also consider resetting the device as one of the strategies (at
> least once off)
> 
> Might also want to think at that point about the case of 
> 
> 	command
> 	....
> 	timeout
> 
> where old IDE checks with the controller to spot lost IRQ cases where a
> command finished and stuff vanished. Old IDE doesn't do much with it but
> we could use that as a good hint that we want to switch to polling mode
> and tell the user their computer sucks.

That's basically where I wanted to go with "(d)".  Being able to both 
handle interrupts _and_ fall back to polling makes it easy to notice 
when interrupts are getting lost.  If more than a couple rescues of this 
nature occur, do as you describe.

	Jeff



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flexible SFF interrupt handling
  2007-11-28 15:58   ` Jeff Garzik
@ 2007-11-28 16:48     ` Mark Lord
  2007-11-28 17:00       ` Mark Lord
  2007-11-30  1:08     ` Tejun Heo
  1 sibling, 1 reply; 9+ messages in thread
From: Mark Lord @ 2007-11-28 16:48 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: IDE/ATA development list

Jeff Garzik wrote:
> Mark Lord wrote:
>> Jeff Garzik wrote:
>>> This has been bubbling on my brain for a while.  I blathered on about 
>>> this on IRC to Tejun, but figured I might as well post it here and 
>>> get it archived.
>>>
>>> In general, I think we should adopt a flexible or "loose" model for 
>>> acking interrupts on SFF controllers.
>>>
>>> (a) whenever we are in bus-idle (qc == NULL), and get an interrupt, 
>>> go ahead and read Status.
>>>
>>> (b) if we are expecting an interrupt, and receive one, check Status 
>>> (or AltStatus if DMAing).
>>>
>>> (c) if condition "(b)" indicates busy, initiate status polling every 
>>> 250ms until timeout occurs or BSY clears.
>>>
>>> (d) if N seconds (4?) elapses without an interrupt, initiate polling. 
>>> keep a history of such "fail-over" events, and note each fail-over'd 
>>> command's eventual success via polling, success via interrupt, or 
>>> timeout.  Use that history to decide to switch to 100% polling mode 
>>> (i.e. reach conclusion that interrupt delivery is broken, via 
>>> observation)
>>>
>>> That should cover no-interrupts, lost interrupts, early interrupts, 
>>> screaming interrupts, insane devices, and of course normal operation.
>>>
>>> The model could be summarized as "interrupt as a hint" operation.
>> ..
>>
>> The only question is, under which conditions do we return IRQ 
>> "handled=1",
>> and which times should we return 0 ?
>>
>> Definitely when a real IRQ wakes us up and we see (qc != NULL && 
>> drive_ready),
>> essentially exactly as we currently do it.
>>
>> But things might be trickier once polling is introduced, unless we 
>> also mask
>> the device interrupt before initiating the polling.
> 
> Actually no, and that is a key benefit of this scheme:  if we ensure 
> that the polling paths are resilient even in case where interrupts are 
> being delivered -- as we must do anyway -- then we don't have to worry 
> about interrupt masking, either on the interrupt controller or on the 
> device[1].
> 
> If we do get an interrupt, ack it ASAP.  That covers normal operation 
> and screaming interrupts.
..

I was considering a shared IRQ environment, where the screamer
might be a different device on the same IRQ..


> If we don't get an interrupt, we will notice after a spell and poll 
> Status to ensure progress occurs.
> 
> Note that this polling is a different sort of polling than running an 
> entire ATA command via a kernel thread.  In this case, we're talking 
> about periodic Status (or AltStatus or LLD-specific-register status) 
> polling only.
> 
> A lot of fiddling with irq masking is getting around ugliness that I am 
> instead trying to eliminate altogether.  A truly robust system follows 
> the spec WRT nIEN and other interrupt masking.....  but then prepares 
> for the case where hw decides to send an interrupt anyway.
> 
> On SFF controllers, we should consider interrupts to be unreliable 
> messages delivered on a best effort basis by hardware.  If we get them, 
> great, ack and act.  If we lack them, make sure progress occurs.
> 
> Regards,
> 
>     Jeff
> 
> 
> [1] well, there -are- exceptions, such as when we are bitbanging the ATA 
> Data register


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flexible SFF interrupt handling
  2007-11-28 16:48     ` Mark Lord
@ 2007-11-28 17:00       ` Mark Lord
  0 siblings, 0 replies; 9+ messages in thread
From: Mark Lord @ 2007-11-28 17:00 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: IDE/ATA development list

Mark Lord wrote:
> Jeff Garzik wrote:
>> Mark Lord wrote:
>>> Jeff Garzik wrote:
...
>>> The only question is, under which conditions do we return IRQ 
>>> "handled=1",
>>> and which times should we return 0 ?
>>>
>>> Definitely when a real IRQ wakes us up and we see (qc != NULL && 
>>> drive_ready),
>>> essentially exactly as we currently do it.
>>>
>>> But things might be trickier once polling is introduced, unless we 
>>> also mask
>>> the device interrupt before initiating the polling.
>>
>> Actually no, and that is a key benefit of this scheme:  if we ensure 
>> that the polling paths are resilient even in case where interrupts are 
>> being delivered -- as we must do anyway -- then we don't have to worry 
>> about interrupt masking, either on the interrupt controller or on the 
>> device[1].
>>
>> If we do get an interrupt, ack it ASAP.  That covers normal operation 
>> and screaming interrupts.
> ..
> 
> I was considering a shared IRQ environment, where the screamer
> might be a different device on the same IRQ..
...

To clarify:  If we are sharing an IRQ line with other device(s),
then it is not good to always return "handled=1" from our IRQ handler
in cases where the interrupt could not be confirmed to be ours.

Because if we did, then the babbling-IRQ detection ("too many spurious")
elsewhere in the kernel will be unable to function for this particular IRQ line.

If we really have no other good option, then fine --> sata_qstor does this.
And I suppose that's the intent --> only fall back to this kind of stuff
when we are dealing with hardware that is known to be unreliable w.r.t. IRQs.

Cheers

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flexible SFF interrupt handling
  2007-11-28 15:58   ` Jeff Garzik
  2007-11-28 16:48     ` Mark Lord
@ 2007-11-30  1:08     ` Tejun Heo
  2007-11-30  1:11       ` Tejun Heo
  1 sibling, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2007-11-30  1:08 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Mark Lord, IDE/ATA development list

Jeff Garzik wrote:
> Actually no, and that is a key benefit of this scheme:  if we ensure
> that the polling paths are resilient even in case where interrupts are
> being delivered -- as we must do anyway -- then we don't have to worry
> about interrupt masking, either on the interrupt controller or on the
> device[1].
> 
> [1] well, there -are- exceptions, such as when we are bitbanging the ATA
> Data register

Yeah, that one exception is exactly why we need to plug the IRQ during
polling PIO or to push PIO data xfer to a workqueue and doing so by nIEN
is unreliable on several controllers and unreliable by nature in many
early SFF SATA controllers when paired with certain SATA devices as
noted in appendix C of SATA 2.5 spec.

And when reading off Status when idle, we'll need to return 0 from
irq_handler.  That will prevent IRQ screaming IRQ lock and nobody cared
detection code is exactly there to watch how often such idle IRQs occur
and disable it if necessary.

Maybe we can make irqpoll on-demand such that nobody cared turns it and
driver can also request polling on the IRQ.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flexible SFF interrupt handling
  2007-11-30  1:08     ` Tejun Heo
@ 2007-11-30  1:11       ` Tejun Heo
  0 siblings, 0 replies; 9+ messages in thread
From: Tejun Heo @ 2007-11-30  1:11 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Mark Lord, IDE/ATA development list

Tejun Heo wrote:
> And when reading off Status when idle, we'll need to return 0 from
> irq_handler.  That will prevent IRQ screaming IRQ lock and nobody cared

That sentence was "That will prevent IRQ screaming from causing IRQ live
lock..." in my mind.  Some packets were lost while transmitting it via
my neck, arms and finally through my fingertips and my fingers foolishly
didn't ask for retransmission.  :-P

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-11-30  1:11 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-28 13:45 Flexible SFF interrupt handling Jeff Garzik
2007-11-28 14:29 ` Alan Cox
2007-11-28 16:09   ` Jeff Garzik
2007-11-28 14:33 ` Mark Lord
2007-11-28 15:58   ` Jeff Garzik
2007-11-28 16:48     ` Mark Lord
2007-11-28 17:00       ` Mark Lord
2007-11-30  1:08     ` Tejun Heo
2007-11-30  1:11       ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).