public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* BUG: [2.4.18+] IDE Race Condition
@ 2003-01-27 17:11 Ross Biro
  2003-01-27 17:34 ` Ross Biro
  2003-01-30 17:34 ` Alan Cox
  0 siblings, 2 replies; 10+ messages in thread
From: Ross Biro @ 2003-01-27 17:11 UTC (permalink / raw)
  To: linux-kernel


There is at least one more IDE race condition in 2.4.18 and 2.4.21-pre3. 
 Basically the interrupt for the controller being serviced is left on 
while setting up the next command.  I'm not sure how much trouble it can 
cause but it does lead to some interesting stack traces.

The condition
if (masked_irq && hwif->irq != masked_irq)
in ide_do_request should be replaced with
if (!masked_irq || hwif->irq != masked_irq)
in two places.

This doesn't totally eliminate the race conditions, but it does minimize 
them some more.  I can still see a race in 2.4.18.  I'll say more about 
it once I've tracked it down.

    Ross



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: BUG: [2.4.18+] IDE Race Condition
  2003-01-27 17:11 BUG: [2.4.18+] IDE Race Condition Ross Biro
@ 2003-01-27 17:34 ` Ross Biro
  2003-01-28  2:46   ` Andre Hedrick
  2003-01-30 17:34 ` Alan Cox
  1 sibling, 1 reply; 10+ messages in thread
From: Ross Biro @ 2003-01-27 17:34 UTC (permalink / raw)
  To: Ross Biro; +Cc: linux-kernel

The net effect of this race condition and the other one I spotted is 
that you may see some interesting messages in your log file and you can 
detect the race condition if you look for it hard enough.  I don't 
currently see any bad effects.

    Ross

Ross Biro wrote:

>
> There is at least one more IDE race condition in 2.4.18 and 
> 2.4.21-pre3. Basically the interrupt for the controller being serviced 
> is left on while setting up the next command.  I'm not sure how much 
> trouble it can cause but it does lead to some interesting stack traces.
>
> The condition
> if (masked_irq && hwif->irq != masked_irq)
> in ide_do_request should be replaced with
> if (!masked_irq || hwif->irq != masked_irq)
> in two places.
>



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: BUG: [2.4.18+] IDE Race Condition
  2003-01-27 17:34 ` Ross Biro
@ 2003-01-28  2:46   ` Andre Hedrick
  2003-01-28 16:48     ` Ross Biro
  0 siblings, 1 reply; 10+ messages in thread
From: Andre Hedrick @ 2003-01-28  2:46 UTC (permalink / raw)
  To: Ross Biro; +Cc: linux-kernel


Okay, how do you reproduce it to see the effects?

On Mon, 27 Jan 2003, Ross Biro wrote:

> The net effect of this race condition and the other one I spotted is 
> that you may see some interesting messages in your log file and you can 
> detect the race condition if you look for it hard enough.  I don't 
> currently see any bad effects.
> 
>     Ross
> 
> Ross Biro wrote:
> 
> >
> > There is at least one more IDE race condition in 2.4.18 and 
> > 2.4.21-pre3. Basically the interrupt for the controller being serviced 
> > is left on while setting up the next command.  I'm not sure how much 
> > trouble it can cause but it does lead to some interesting stack traces.
> >
> > The condition
> > if (masked_irq && hwif->irq != masked_irq)
> > in ide_do_request should be replaced with
> > if (!masked_irq || hwif->irq != masked_irq)
> > in two places.
> >
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

Andre Hedrick
LAD Storage Consulting Group


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: BUG: [2.4.18+] IDE Race Condition
  2003-01-28  2:46   ` Andre Hedrick
@ 2003-01-28 16:48     ` Ross Biro
  2003-01-28 17:28       ` Andre Hedrick
  0 siblings, 1 reply; 10+ messages in thread
From: Ross Biro @ 2003-01-28 16:48 UTC (permalink / raw)
  To: Andre Hedrick; +Cc: linux-kernel


Easy, it happens all the time, there are just no tests in place to see it.

We are keeping a histogram of how long every IDE DMA transfer takes 
place.  In ide_intr we record the time and set the start time in 
ide_drive_t to 0.  In ide_dma_proc, ide_dma_begin right AFTER activating 
the dma, we store the current value of jiffies in start time in ide_drive_t.

In both those places we check to make sure that the value of start_time 
is sensible.  In ide_dma_begin, we make sure it's 0 and in ide_dma_intr, 
we make sure its non-zero.  Because of this race condition, we often saw 
DMAs finish before they began.

In the normal kernel, the only thing I can see that could go wrong would 
be that the printk

printk("%s: ide_intr: huh? expected NULL handler on exit\n", drive->name);

in ide_intr  could be triggered.  I've never seen it happen, but I 
believe with enough effort, it could be made to occur.

    Ross


Andre Hedrick wrote:

>Okay, how do you reproduce it to see the effects?
>
>On Mon, 27 Jan 2003, Ross Biro wrote:
>
>  
>
>>The net effect of this race condition and the other one I spotted is 
>>that you may see some interesting messages in your log file and you can 
>>detect the race condition if you look for it hard enough.  I don't 
>>currently see any bad effects.
>>
>>    Ross
>>
>>Ross Biro wrote:
>>
>>    
>>
>>>There is at least one more IDE race condition in 2.4.18 and 
>>>2.4.21-pre3. Basically the interrupt for the controller being serviced 
>>>is left on while setting up the next command.  I'm not sure how much 
>>>trouble it can cause but it does lead to some interesting stack traces.
>>>
>>>The condition
>>>if (masked_irq && hwif->irq != masked_irq)
>>>in ide_do_request should be replaced with
>>>if (!masked_irq || hwif->irq != masked_irq)
>>>in two places.
>>>
>>>      
>>>
>>-
>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>the body of a message to majordomo@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>Please read the FAQ at  http://www.tux.org/lkml/
>>
>>    
>>
>
>Andre Hedrick
>LAD Storage Consulting Group
>
>  
>




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: BUG: [2.4.18+] IDE Race Condition
  2003-01-28 16:48     ` Ross Biro
@ 2003-01-28 17:28       ` Andre Hedrick
  2003-01-28 17:42         ` Ross Biro
  0 siblings, 1 reply; 10+ messages in thread
From: Andre Hedrick @ 2003-01-28 17:28 UTC (permalink / raw)
  To: Ross Biro; +Cc: linux-kernel


Ross,

How did you create the sub-microsecond timers to profile the device/driver
behavior?  I had been working on this for a while but little success.
This is one of the key methods to predict device failure.

One of the goals of the prebuilder it find slam prebuild commands down the
pipes to force breakages and races to show up.

So you have a possible solution?

Cheers,

Andre Hedrick
LAD Storage Consulting Group


On Tue, 28 Jan 2003, Ross Biro wrote:

> 
> Easy, it happens all the time, there are just no tests in place to see it.
> 
> We are keeping a histogram of how long every IDE DMA transfer takes 
> place.  In ide_intr we record the time and set the start time in 
> ide_drive_t to 0.  In ide_dma_proc, ide_dma_begin right AFTER activating 
> the dma, we store the current value of jiffies in start time in ide_drive_t.
> 
> In both those places we check to make sure that the value of start_time 
> is sensible.  In ide_dma_begin, we make sure it's 0 and in ide_dma_intr, 
> we make sure its non-zero.  Because of this race condition, we often saw 
> DMAs finish before they began.
> 
> In the normal kernel, the only thing I can see that could go wrong would 
> be that the printk
> 
> printk("%s: ide_intr: huh? expected NULL handler on exit\n", drive->name);
> 
> in ide_intr  could be triggered.  I've never seen it happen, but I 
> believe with enough effort, it could be made to occur.
> 
>     Ross
> 
> 
> Andre Hedrick wrote:
> 
> >Okay, how do you reproduce it to see the effects?
> >
> >On Mon, 27 Jan 2003, Ross Biro wrote:
> >
> >  
> >
> >>The net effect of this race condition and the other one I spotted is 
> >>that you may see some interesting messages in your log file and you can 
> >>detect the race condition if you look for it hard enough.  I don't 
> >>currently see any bad effects.
> >>
> >>    Ross
> >>
> >>Ross Biro wrote:
> >>
> >>    
> >>
> >>>There is at least one more IDE race condition in 2.4.18 and 
> >>>2.4.21-pre3. Basically the interrupt for the controller being serviced 
> >>>is left on while setting up the next command.  I'm not sure how much 
> >>>trouble it can cause but it does lead to some interesting stack traces.
> >>>
> >>>The condition
> >>>if (masked_irq && hwif->irq != masked_irq)
> >>>in ide_do_request should be replaced with
> >>>if (!masked_irq || hwif->irq != masked_irq)
> >>>in two places.
> >>>
> >>>      
> >>>
> >>-
> >>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>the body of a message to majordomo@vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>Please read the FAQ at  http://www.tux.org/lkml/
> >>
> >>    
> >>
> >
> >Andre Hedrick
> >LAD Storage Consulting Group
> >
> >  
> >


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: BUG: [2.4.18+] IDE Race Condition
  2003-01-28 17:28       ` Andre Hedrick
@ 2003-01-28 17:42         ` Ross Biro
  2003-01-28 18:01           ` Andre Hedrick
  0 siblings, 1 reply; 10+ messages in thread
From: Ross Biro @ 2003-01-28 17:42 UTC (permalink / raw)
  To: Andre Hedrick; +Cc: linux-kernel


Sorry, we are not using sub-microsecond timers, we are just using the 
latest-greatest-bugiest hard drives and lots of them.  We can see 
problems like this because we can run on so many machines at once.  Even 
something that has a small percentage chance of showing up, we can see 
pretty quick.

We do have a bus analyzer that can time things down to about 4ns.  That 
has been invaluable for tracking down some of these issues.

    Ross

Andre Hedrick wrote:

>Ross,
>
>How did you create the sub-microsecond timers to profile the device/driver
>behavior?  I had been working on this for a while but little success.
>This is one of the key methods to predict device failure.
>
>One of the goals of the prebuilder it find slam prebuild commands down the
>pipes to force breakages and races to show up.
>
>So you have a possible solution?
>
>Cheers,
>
>Andre Hedrick
>LAD Storage Consulting Group
>
>
>On Tue, 28 Jan 2003, Ross Biro wrote:
>
>  
>
>>Easy, it happens all the time, there are just no tests in place to see it.
>>
>>We are keeping a histogram of how long every IDE DMA transfer takes 
>>place.  In ide_intr we record the time and set the start time in 
>>ide_drive_t to 0.  In ide_dma_proc, ide_dma_begin right AFTER activating 
>>the dma, we store the current value of jiffies in start time in ide_drive_t.
>>
>>In both those places we check to make sure that the value of start_time 
>>is sensible.  In ide_dma_begin, we make sure it's 0 and in ide_dma_intr, 
>>we make sure its non-zero.  Because of this race condition, we often saw 
>>DMAs finish before they began.
>>
>>In the normal kernel, the only thing I can see that could go wrong would 
>>be that the printk
>>
>>printk("%s: ide_intr: huh? expected NULL handler on exit\n", drive->name);
>>
>>in ide_intr  could be triggered.  I've never seen it happen, but I 
>>believe with enough effort, it could be made to occur.
>>
>>    Ross
>>
>>
>>Andre Hedrick wrote:
>>
>>    
>>
>>>Okay, how do you reproduce it to see the effects?
>>>
>>>On Mon, 27 Jan 2003, Ross Biro wrote:
>>>
>>> 
>>>
>>>      
>>>
>>>>The net effect of this race condition and the other one I spotted is 
>>>>that you may see some interesting messages in your log file and you can 
>>>>detect the race condition if you look for it hard enough.  I don't 
>>>>currently see any bad effects.
>>>>
>>>>   Ross
>>>>
>>>>Ross Biro wrote:
>>>>
>>>>   
>>>>
>>>>        
>>>>
>>>>>There is at least one more IDE race condition in 2.4.18 and 
>>>>>2.4.21-pre3. Basically the interrupt for the controller being serviced 
>>>>>is left on while setting up the next command.  I'm not sure how much 
>>>>>trouble it can cause but it does lead to some interesting stack traces.
>>>>>
>>>>>The condition
>>>>>if (masked_irq && hwif->irq != masked_irq)
>>>>>in ide_do_request should be replaced with
>>>>>if (!masked_irq || hwif->irq != masked_irq)
>>>>>in two places.
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>-
>>>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>>>the body of a message to majordomo@vger.kernel.org
>>>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>Please read the FAQ at  http://www.tux.org/lkml/
>>>>
>>>>   
>>>>
>>>>        
>>>>
>>>Andre Hedrick
>>>LAD Storage Consulting Group
>>>
>>> 
>>>
>>>      
>>>
>
>  
>




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: BUG: [2.4.18+] IDE Race Condition
  2003-01-28 17:42         ` Ross Biro
@ 2003-01-28 18:01           ` Andre Hedrick
  0 siblings, 0 replies; 10+ messages in thread
From: Andre Hedrick @ 2003-01-28 18:01 UTC (permalink / raw)
  To: Ross Biro; +Cc: linux-kernel



Well if someone would drop a $16,000->$27,000 USD bus analyzer with
switchable PODS, I could do wonders!  Well I am not sure how to take this
on the chin.  Either "nice try, but you didn't get there" or "nice try, it
is not possible to get there without more tools"; regardless, it needs to
be fixed.  So if you have an answer and a trace to support it ... bring it
on!

Cheers,

Andre Hedrick
LAD Storage Consulting Group


On Tue, 28 Jan 2003, Ross Biro wrote:

> 
> Sorry, we are not using sub-microsecond timers, we are just using the 
> latest-greatest-bugiest hard drives and lots of them.  We can see 
> problems like this because we can run on so many machines at once.  Even 
> something that has a small percentage chance of showing up, we can see 
> pretty quick.
> 
> We do have a bus analyzer that can time things down to about 4ns.  That 
> has been invaluable for tracking down some of these issues.
> 
>     Ross
> 
> Andre Hedrick wrote:
> 
> >Ross,
> >
> >How did you create the sub-microsecond timers to profile the device/driver
> >behavior?  I had been working on this for a while but little success.
> >This is one of the key methods to predict device failure.
> >
> >One of the goals of the prebuilder it find slam prebuild commands down the
> >pipes to force breakages and races to show up.
> >
> >So you have a possible solution?
> >
> >Cheers,
> >
> >Andre Hedrick
> >LAD Storage Consulting Group
> >
> >
> >On Tue, 28 Jan 2003, Ross Biro wrote:
> >
> >  
> >
> >>Easy, it happens all the time, there are just no tests in place to see it.
> >>
> >>We are keeping a histogram of how long every IDE DMA transfer takes 
> >>place.  In ide_intr we record the time and set the start time in 
> >>ide_drive_t to 0.  In ide_dma_proc, ide_dma_begin right AFTER activating 
> >>the dma, we store the current value of jiffies in start time in ide_drive_t.
> >>
> >>In both those places we check to make sure that the value of start_time 
> >>is sensible.  In ide_dma_begin, we make sure it's 0 and in ide_dma_intr, 
> >>we make sure its non-zero.  Because of this race condition, we often saw 
> >>DMAs finish before they began.
> >>
> >>In the normal kernel, the only thing I can see that could go wrong would 
> >>be that the printk
> >>
> >>printk("%s: ide_intr: huh? expected NULL handler on exit\n", drive->name);
> >>
> >>in ide_intr  could be triggered.  I've never seen it happen, but I 
> >>believe with enough effort, it could be made to occur.
> >>
> >>    Ross
> >>
> >>
> >>Andre Hedrick wrote:
> >>
> >>    
> >>
> >>>Okay, how do you reproduce it to see the effects?
> >>>
> >>>On Mon, 27 Jan 2003, Ross Biro wrote:
> >>>
> >>> 
> >>>
> >>>      
> >>>
> >>>>The net effect of this race condition and the other one I spotted is 
> >>>>that you may see some interesting messages in your log file and you can 
> >>>>detect the race condition if you look for it hard enough.  I don't 
> >>>>currently see any bad effects.
> >>>>
> >>>>   Ross
> >>>>
> >>>>Ross Biro wrote:
> >>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>There is at least one more IDE race condition in 2.4.18 and 
> >>>>>2.4.21-pre3. Basically the interrupt for the controller being serviced 
> >>>>>is left on while setting up the next command.  I'm not sure how much 
> >>>>>trouble it can cause but it does lead to some interesting stack traces.
> >>>>>
> >>>>>The condition
> >>>>>if (masked_irq && hwif->irq != masked_irq)
> >>>>>in ide_do_request should be replaced with
> >>>>>if (!masked_irq || hwif->irq != masked_irq)
> >>>>>in two places.
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>-
> >>>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>>>the body of a message to majordomo@vger.kernel.org
> >>>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>Please read the FAQ at  http://www.tux.org/lkml/
> >>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>Andre Hedrick
> >>>LAD Storage Consulting Group
> >>>
> >>> 
> >>>
> >>>      
> >>>
> >
> >  
> >
> 
> 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: BUG: [2.4.18+] IDE Race Condition
  2003-01-30 17:34 ` Alan Cox
@ 2003-01-30 16:58   ` Ross Biro
  2003-01-30 18:01     ` Alan Cox
  0 siblings, 1 reply; 10+ messages in thread
From: Ross Biro @ 2003-01-30 16:58 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linux Kernel Mailing List

Alan Cox wrote:

>>st should be replaced with
>>if (!masked_irq || hwif->irq != masked_irq)
>>in two places.
>>    
>>
>
>Which would also shorten to if (hwif->irq != masked_irq) it seems
>
>  
>
With the assumption that hwif->irq != 0 which is implicit now.  Perhaps 
we should make the assumption explicit,

if (unlikely(hwif->irq == 0)) {
    BUG();
}
if (hwif->irq != masked_irq) ....

    Ross




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: BUG: [2.4.18+] IDE Race Condition
  2003-01-27 17:11 BUG: [2.4.18+] IDE Race Condition Ross Biro
  2003-01-27 17:34 ` Ross Biro
@ 2003-01-30 17:34 ` Alan Cox
  2003-01-30 16:58   ` Ross Biro
  1 sibling, 1 reply; 10+ messages in thread
From: Alan Cox @ 2003-01-30 17:34 UTC (permalink / raw)
  To: Ross Biro; +Cc: Linux Kernel Mailing List

On Mon, 2003-01-27 at 17:11, Ross Biro wrote:
> The condition
> if (masked_irq && hwif->irq != masked_irq)
> in ide_do_request should be replaced with
> if (!masked_irq || hwif->irq != masked_irq)
> in two places.

Which would also shorten to if (hwif->irq != masked_irq) it seems


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: BUG: [2.4.18+] IDE Race Condition
  2003-01-30 16:58   ` Ross Biro
@ 2003-01-30 18:01     ` Alan Cox
  0 siblings, 0 replies; 10+ messages in thread
From: Alan Cox @ 2003-01-30 18:01 UTC (permalink / raw)
  To: Ross Biro; +Cc: Linux Kernel Mailing List

On Thu, 2003-01-30 at 16:58, Ross Biro wrote:
> With the assumption that hwif->irq != 0 which is implicit now.  Perhaps 
> we should make the assumption explicit,
> 
> if (unlikely(hwif->irq == 0)) {
>     BUG();
> }
> if (hwif->irq != masked_irq) ....
> 

Actually I fixed the IRQ 0 assumption in that code while I was at it 8)


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2003-01-30 16:57 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-01-27 17:11 BUG: [2.4.18+] IDE Race Condition Ross Biro
2003-01-27 17:34 ` Ross Biro
2003-01-28  2:46   ` Andre Hedrick
2003-01-28 16:48     ` Ross Biro
2003-01-28 17:28       ` Andre Hedrick
2003-01-28 17:42         ` Ross Biro
2003-01-28 18:01           ` Andre Hedrick
2003-01-30 17:34 ` Alan Cox
2003-01-30 16:58   ` Ross Biro
2003-01-30 18:01     ` Alan Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox