[RFT] hpt366: reset DMA state machine on timeouts

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFT] hpt366: reset DMA state machine on timeouts
@ 2007-06-21 17:54 Sergei Shtylyov
  2007-06-21 19:31 ` Linas Vepstas
  2007-06-22 15:13 ` Linas Vepstas
  0 siblings, 2 replies; 11+ messages in thread
From: Sergei Shtylyov @ 2007-06-21 17:54 UTC (permalink / raw)
  To: linas; +Cc: linux-ide

Reset HPT36x's DMA state machine on a DMA timeout the way it's done for HPT370.

Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>

---
Linas, here's what I've come up with -- this should apply against 2.6.21.y.
Compile-tested only, not for merging.

 drivers/ide/pci/hpt366.c |   24 +++++++++++++++++++++++-
 1 files changed, 23 insertions(+), 1 deletion(-)

Index: linux-2.6/drivers/ide/pci/hpt366.c
===================================================================
--- linux-2.6.orig/drivers/ide/pci/hpt366.c
+++ linux-2.6/drivers/ide/pci/hpt366.c
@@ -752,6 +752,26 @@ static int hpt366_config_drive_xfer_rate
  * This is specific to the HPT366 UDMA chipset
  * by HighPoint|Triones Technologies, Inc.
  */
+static int hpt366_ide_dma_timeout(ide_drive_t *drive)
+{
+	ide_hwif_t *hwif	= HWIF(drive);
+	struct pci_dev *dev	= hwif->pci_dev;
+	u8 mcr1 = 0, mcr2	= 0;
+	u8 dma_cmd		= inb(hwif->dma_command);
+
+	/* Stop DMA */
+	outb(dma_cmd & ~0x01, hwif->dma_command);
+
+	/* Clear bus master state machine and S/G counter */
+	pci_read_config_byte (dev, 0x51, &mcr2);
+	pci_write_config_byte(dev, 0x51, (mcr2 | 0x03));
+	/* Clear buffers 0/1 */
+	pci_read_config_byte (dev, 0x50, &mcr1);
+	pci_write_config_byte(dev, 0x50, (mcr1 | 0x0c));
+
+	return __ide_dma_timeout(drive);
+}
+
 static int hpt366_ide_dma_lostirq(ide_drive_t *drive)
 {
 	struct pci_dev *dev = HWIF(drive)->pci_dev;
@@ -1368,8 +1388,10 @@ static void __devinit init_hwif_hpt366(i
 		hwif->dma_start 	= &hpt370_ide_dma_start;
 		hwif->ide_dma_end	= &hpt370_ide_dma_end;
 		hwif->ide_dma_timeout	= &hpt370_ide_dma_timeout;
-	} else
+	} else {
+		hwif->ide_dma_timeout	= &hpt366_ide_dma_timeout;
 		hwif->ide_dma_lostirq	= &hpt366_ide_dma_lostirq;
+	}
 
 	if (!noautodma)
 		hwif->autodma = 1;


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFT] hpt366: reset DMA state machine on timeouts
  2007-06-21 17:54 [RFT] hpt366: reset DMA state machine on timeouts Sergei Shtylyov
@ 2007-06-21 19:31 ` Linas Vepstas
  2007-06-22 15:13 ` Linas Vepstas
  1 sibling, 0 replies; 11+ messages in thread
From: Linas Vepstas @ 2007-06-21 19:31 UTC (permalink / raw)
  To: Sergei Shtylyov; +Cc: linux-ide

On Thu, Jun 21, 2007 at 09:54:47PM +0400, Sergei Shtylyov wrote:
> ---
> Linas, here's what I've come up with -- this should apply against 2.6.21.y.
> Compile-tested only, not for merging.

Thanks,

I'll test tonight. Meanwhile, under spoearate cover, I'll post
the libata debug info.

--linas


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFT] hpt366: reset DMA state machine on timeouts
  2007-06-21 17:54 [RFT] hpt366: reset DMA state machine on timeouts Sergei Shtylyov
  2007-06-21 19:31 ` Linas Vepstas
@ 2007-06-22 15:13 ` Linas Vepstas
  2007-06-22 15:32   ` Sergei Shtylyov
  2007-06-22 15:54   ` Alan Cox
  1 sibling, 2 replies; 11+ messages in thread
From: Linas Vepstas @ 2007-06-22 15:13 UTC (permalink / raw)
  To: Sergei Shtylyov; +Cc: linux-ide

On Thu, Jun 21, 2007 at 09:54:47PM +0400, Sergei Shtylyov wrote:
> Reset HPT36x's DMA state machine on a DMA timeout the way it's done for HPT370.
> 
> Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
> 
> ---
> Linas, here's what I've come up with -- this should apply against 2.6.21.y.
> Compile-tested only, not for merging.
> 
>  drivers/ide/pci/hpt366.c |   24 +++++++++++++++++++++++-

This worked great!  The patch is good. But it raises another interesting
issue, one of those akpm ZFS "voilates boundaries" isses.

However.. When raid goes to reconstruct the partition, I get one
of the Drive Ready Seek Complete etc. messages.  Your handler recovers 
from it (I put in a printk to verify this). And so these printk's
try to get logged into /var/log/messages ... which trigger more 
errors. At a very high rate ... sometimes hundreds a second, sometimes
less.  The system remains usable, but at one point, it hit 60% cpu usage
spewing these messages to the screen.  

I'd like to see several things.

1) This patch should go in.  It converts a system that hangs into
   one that doesn't hang.

2) There needs to be a way of failing the disk when there's a high
   number of errors. e.g. if there are more than 100 errors per minute
   then the disk needs to be marked "failed" in the raid array.

   Note it should be stopped only if the rate is high: if there is 
   only 1 error per minte, this might be very annoying, but acceptable,
   esp. if one is just trying to copy data off the disk.

   I'm not sure what to do if this had been the only disk in the system.
   Maybe if the eror reate exceed 100/minute, then dma is turned off 
   permanently?

--linas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFT] hpt366: reset DMA state machine on timeouts
  2007-06-22 15:13 ` Linas Vepstas
@ 2007-06-22 15:32   ` Sergei Shtylyov
  2007-06-22 16:36     ` Linas Vepstas
  2007-06-22 15:54   ` Alan Cox
  1 sibling, 1 reply; 11+ messages in thread
From: Sergei Shtylyov @ 2007-06-22 15:32 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: linux-ide

Hello.

Linas Vepstas wrote:

>>Reset HPT36x's DMA state machine on a DMA timeout the way it's done for HPT370.

>>Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>

>>---
>>Linas, here's what I've come up with -- this should apply against 2.6.21.y.
>>Compile-tested only, not for merging.

>> drivers/ide/pci/hpt366.c |   24 +++++++++++++++++++++++-

> This worked great!  The patch is good. But it raises another interesting
> issue, one of those akpm ZFS "voilates boundaries" isses.

> However.. When raid goes to reconstruct the partition, I get one
> of the Drive Ready Seek Complete etc. messages.  Your handler recovers 

    I hope you meant those messages were preceeded by DMA timeouts (otherwise 
this code wouldn't come into action).

> from it (I put in a printk to verify this).

    You mean into my ide_dma_timeout() method?

> And so these printk's
> try to get logged into /var/log/messages ... which trigger more 
> errors. At a very high rate ... sometimes hundreds a second, sometimes
> less.  The system remains usable, but at one point, it hit 60% cpu usage
> spewing these messages to the screen.  

    Hm...

> I'd like to see several things.

> 1) This patch should go in.  It converts a system that hangs into
>    one that doesn't hang.

    What's strange is that it never seemed to be necessary before your great 
new drive... ;-)
    So, providing its data certainly wouldn't hurt -- perhaps we just should 
blacklist it instead -- maybe there's a UDMA speed at which this wouldn't 
happen, and we could just limit the drive to it.

> 2) There needs to be a way of failing the disk when there's a high
>    number of errors. e.g. if there are more than 100 errors per minute
>    then the disk needs to be marked "failed" in the raid array.

>    Note it should be stopped only if the rate is high: if there is 
>    only 1 error per minte, this might be very annoying, but acceptable,
>    esp. if one is just trying to copy data off the disk.

>    I'm not sure what to do if this had been the only disk in the system.
>    Maybe if the eror reate exceed 100/minute, then dma is turned off 
>    permanently?

    In fact, it should be turned off after 3 DMA errors (causing PIO retries).

> --linas

MBR, Sergei

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFT] hpt366: reset DMA state machine on timeouts
  2007-06-22 15:13 ` Linas Vepstas
  2007-06-22 15:32   ` Sergei Shtylyov
@ 2007-06-22 15:54   ` Alan Cox
  2007-06-22 16:03     ` Linas Vepstas
  1 sibling, 1 reply; 11+ messages in thread
From: Alan Cox @ 2007-06-22 15:54 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: Sergei Shtylyov, linux-ide

>    I'm not sure what to do if this had been the only disk in the system.
>    Maybe if the eror reate exceed 100/minute, then dma is turned off 
>    permanently?

Turn off UDMA and you turn off cable side error protection, so if your
error is caused by noise you just did the electronic version of a long
walk off a short pier.

Degradation policies for IDE are tricky things.

Alan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFT] hpt366: reset DMA state machine on timeouts
  2007-06-22 15:54   ` Alan Cox
@ 2007-06-22 16:03     ` Linas Vepstas
  2007-06-22 16:33       ` Alan Cox
  0 siblings, 1 reply; 11+ messages in thread
From: Linas Vepstas @ 2007-06-22 16:03 UTC (permalink / raw)
  To: Alan Cox; +Cc: Sergei Shtylyov, linux-ide

On Fri, Jun 22, 2007 at 04:54:31PM +0100, Alan Cox wrote:
> >    I'm not sure what to do if this had been the only disk in the system.
> >    Maybe if the eror reate exceed 100/minute, then dma is turned off 
> >    permanently?
> 
> Turn off UDMA and you turn off cable side error protection, so if your
> error is caused by noise you just did the electronic version of a long
> walk off a short pier.

Right. Yes. Clearly, I must be debugging with half a brain, here.

--linas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFT] hpt366: reset DMA state machine on timeouts
  2007-06-22 16:03     ` Linas Vepstas
@ 2007-06-22 16:33       ` Alan Cox
  0 siblings, 0 replies; 11+ messages in thread
From: Alan Cox @ 2007-06-22 16:33 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: Sergei Shtylyov, linux-ide

> > Turn off UDMA and you turn off cable side error protection, so if your
> > error is caused by noise you just did the electronic version of a long
> > walk off a short pier.
> 
> Right. Yes. Clearly, I must be debugging with half a brain, here.

Half a brain each so together we can manage ;)

I've had a look at your other traces and while I don't yet know why the
first DMA timed out with the libata driver the rest of the sequence looks
like the same hang as with the old driver. I'll propogate Sergei's fix
across at some point.

Thanks
Alan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFT] hpt366: reset DMA state machine on timeouts
  2007-06-22 15:32   ` Sergei Shtylyov
@ 2007-06-22 16:36     ` Linas Vepstas
  2007-06-23 18:10       ` Sergei Shtylyov
  0 siblings, 1 reply; 11+ messages in thread
From: Linas Vepstas @ 2007-06-22 16:36 UTC (permalink / raw)
  To: Sergei Shtylyov; +Cc: linux-ide

On Fri, Jun 22, 2007 at 07:32:44PM +0400, Sergei Shtylyov wrote:
> 
> >>Reset HPT36x's DMA state machine on a DMA timeout the way it's done for 
> >>HPT370.
> >>drivers/ide/pci/hpt366.c |   24 +++++++++++++++++++++++-
> 
> >This worked great!  
>
>    I hope you meant those messages were preceeded by DMA timeouts 
>    (otherwise this code wouldn't come into action).

Oops, I was wrong ... 

Scads of 
Jun 21 20:22:30 localhost kernel: [  434.574301] hdc: task_out_intr:
status=0x50 { DriveReady SeekComplete }
Jun 21 20:22:30 localhost kernel: [  434.574318] ide: failed opcode was:
unknown

> >from it (I put in a printk to verify this).
> 
>    You mean into my ide_dma_timeout() method?

Ooops, I lied. I have so many printk's in there, that I got confused.
No, in fact, it looks like I did NOT see your handler run. 

Per Alan Cox, I have to go back and see if dropping the
UDMA speeds and/or replacing the cable will help.

>    What's strange is that it never seemed to be necessary before your great 
> new drive... ;-)

At $70 for 320GB, how can one say "no"?  Frye's had a mound of them,
shoulder-high.

 MAXTOR STM3320620A

>    So, providing its data certainly wouldn't hurt -- perhaps we just should 
> blacklist it instead -- maybe there's a UDMA speed at which this wouldn't 
> happen, and we could just limit the drive to it.

I'll experiment with the UDMA settings.

>    In fact, it should be turned off after 3 DMA errors (causing PIO 
>    retries).

I'd like to see this turned into a rate. If the system gets one error
a month, and has been up for 3 months, the third error should not shut
things down. The room that this is in is hot; the machine might be
occasionally bumped. A low error rate is acceptable; its more acceptable
than a mysterious slow-down of performance after 3 months. 

--linas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFT] hpt366: reset DMA state machine on timeouts
  2007-06-22 16:36     ` Linas Vepstas
@ 2007-06-23 18:10       ` Sergei Shtylyov
  2007-06-25 21:44         ` Linas Vepstas
  0 siblings, 1 reply; 11+ messages in thread
From: Sergei Shtylyov @ 2007-06-23 18:10 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: linux-ide

Hello.

Linas Vepstas wrote:

>>>>Reset HPT36x's DMA state machine on a DMA timeout the way it's done for 
>>>>HPT370.
>>>>drivers/ide/pci/hpt366.c |   24 +++++++++++++++++++++++-

>>>This worked great!  

>>   I hope you meant those messages were preceeded by DMA timeouts 
>>   (otherwise this code wouldn't come into action).

> Oops, I was wrong ... 

> Scads of 
> Jun 21 20:22:30 localhost kernel: [  434.574301] hdc: task_out_intr:
> status=0x50 { DriveReady SeekComplete }
> Jun 21 20:22:30 localhost kernel: [  434.574318] ide: failed opcode was:
> unknown

>>>from it (I put in a printk to verify this).

>>   You mean into my ide_dma_timeout() method?

> Ooops, I lied. I have so many printk's in there, that I got confused.
> No, in fact, it looks like I did NOT see your handler run. 

    Now I'm confused too -- did you get any DMA timeouts this time?

> --linas

MBR, Sergei

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFT] hpt366: reset DMA state machine on timeouts
  2007-06-23 18:10       ` Sergei Shtylyov
@ 2007-06-25 21:44         ` Linas Vepstas
  2007-06-26 13:57           ` Sergei Shtylyov
  0 siblings, 1 reply; 11+ messages in thread
From: Linas Vepstas @ 2007-06-25 21:44 UTC (permalink / raw)
  To: Sergei Shtylyov; +Cc: linux-ide

On Sat, Jun 23, 2007 at 10:10:34PM +0400, Sergei Shtylyov wrote:
> 
>    Now I'm confused too -- did you get any DMA timeouts this time?

No. And yes, this is confusing, as the initial hang that I was seeing 
was preceeded by DMA timeout messages on the screen (as posted in the
initial email). Now, with the patched kernel, I'm not seeing any
DMA timeouts; and, instead, I'm seeing lots of hdc: task_out_intr:
status=0x50 { DriveReady SeekComplete } I don't know why.

However, I have now learned how to make all of my problems go
away: lower the UDMA level.  So its seems that this has been
a humbling experience in (re-)learning about bus glitches.

I'm currently doing an /sbin/hdparm -X 67 /dev/hdc to drop
the udma mode to mode3 == 44 MHz (-X udma_mode + 64) from
the default of mode4 for this controller (the hard drive itself is
supposedly udma100 capable, according to the box).  This has cured 
all of the ide driver problems.

The disk that caused all the problems was: 
Device Model:     MAXTOR STM3320620A

A quick look through ide-dma.c shows that there is no way to
blacklist a drive to the "highest suported" level; the blacklist
is UDMA on or UDMA off.

===============
I have not yet tried playing any udma mode games with libata yet,
to see if I could get that working.
===============

I'd like to propose that, for a system is seeing a fair number of 
drive errors, that, perhaps it should automatically lower the mode, 
in the hope of clearing up the problem. 

--linas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFT] hpt366: reset DMA state machine on timeouts
  2007-06-25 21:44         ` Linas Vepstas
@ 2007-06-26 13:57           ` Sergei Shtylyov
  0 siblings, 0 replies; 11+ messages in thread
From: Sergei Shtylyov @ 2007-06-26 13:57 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: linux-ide

Hello.

Linas Vepstas wrote:

>>   Now I'm confused too -- did you get any DMA timeouts this time?

> No. And yes, this is confusing, as the initial hang that I was seeing 
> was preceeded by DMA timeout messages on the screen (as posted in the
> initial email). Now, with the patched kernel, I'm not seeing any
> DMA timeouts; and, instead, I'm seeing lots of hdc: task_out_intr:
> status=0x50 { DriveReady SeekComplete } I don't know why.

    Looks like the state machine reset didn't do it much good... but wait -- 
you're saing that it didn't even occur. :-)

> However, I have now learned how to make all of my problems go
> away: lower the UDMA level.  So its seems that this has been
> a humbling experience in (re-)learning about bus glitches.

    Also, this is not just glitches... normally, UDMA errors introduced by 
cabling should manifest themselves as CRC errors.

> I'm currently doing an /sbin/hdparm -X 67 /dev/hdc to drop
> the udma mode to mode3 == 44 MHz (-X udma_mode + 64) from
> the default of mode4 for this controller (the hard drive itself is
> supposedly udma100 capable, according to the box).  This has cured 
> all of the ide driver problems.

> The disk that caused all the problems was: 
> Device Model:     MAXTOR STM3320620A

    Good, time to prepare a patch then...

> A quick look through ide-dma.c shows that there is no way to
> blacklist a drive to the "highest suported" level; the blacklist
> is UDMA on or UDMA off.

    Actually, there are DMA white/black lists to force all DMA on/off.
    But hpt366.c has its own UltraDMA specific blacklists. :-)

> ===============
> I have not yet tried playing any udma mode games with libata yet,
> to see if I could get that working.
> ===============

> I'd like to propose that, for a system is seeing a fair number of 
> drive errors, that, perhaps it should automatically lower the mode, 
> in the hope of clearing up the problem. 

    It already does so for UDMA CRC errors. Maybe it's worth considering to do 
this for DMA timeouts as well...

> --linas

MBR, Sergei

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2007-06-26 13:56 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-21 17:54 [RFT] hpt366: reset DMA state machine on timeouts Sergei Shtylyov
2007-06-21 19:31 ` Linas Vepstas
2007-06-22 15:13 ` Linas Vepstas
2007-06-22 15:32   ` Sergei Shtylyov
2007-06-22 16:36     ` Linas Vepstas
2007-06-23 18:10       ` Sergei Shtylyov
2007-06-25 21:44         ` Linas Vepstas
2007-06-26 13:57           ` Sergei Shtylyov
2007-06-22 15:54   ` Alan Cox
2007-06-22 16:03     ` Linas Vepstas
2007-06-22 16:33       ` Alan Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).