CK804 SATA Errors (still got them)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* CK804 SATA Errors (still got them)
@ 2007-03-01 13:39 Alistair John Strachan
  2007-03-01 14:45 ` Robert Hancock
  0 siblings, 1 reply; 12+ messages in thread
From: Alistair John Strachan @ 2007-03-01 13:39 UTC (permalink / raw)
  To: Robert Hancock; +Cc: Jeff Garzik, linux-kernel

Hi Robert,

Despite all the work that went into making these less frequent with ADMA,
they're still possible to trigger.

alistair@damocles:~$ cat /proc/version
Linux version 2.6.21-rc2-damocles (root@damocles) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #1 SMP Wed Feb 28 21:58:41 GMT 2007

alistair@damocles:~$ dmesg | tail -n 13
ata1: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1501000 status 0x500 next cpb count 0x0 next cpb idx 0x0
ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:38:ae:08:c2/00:00:00:00:00/e0 tag 0 cdb 0x0 data 28672 out
         res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1: soft resetting port
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
ata1: EH complete
SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA

These cause the same ~30 second stalls. Machine was not under load.

No 3rd party modules were loaded.

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CK804 SATA Errors (still got them)
  2007-03-01 13:39 CK804 SATA Errors (still got them) Alistair John Strachan
@ 2007-03-01 14:45 ` Robert Hancock
  2007-03-01 15:13   ` Alistair John Strachan
  0 siblings, 1 reply; 12+ messages in thread
From: Robert Hancock @ 2007-03-01 14:45 UTC (permalink / raw)
  To: Alistair John Strachan; +Cc: Jeff Garzik, linux-kernel

Alistair John Strachan wrote:
> Hi Robert,
> 
> Despite all the work that went into making these less frequent with ADMA,
> they're still possible to trigger.
> 
> alistair@damocles:~$ cat /proc/version
> Linux version 2.6.21-rc2-damocles (root@damocles) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #1 SMP Wed Feb 28 21:58:41 GMT 2007
> 
> alistair@damocles:~$ dmesg | tail -n 13
> ata1: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1501000 status 0x500 next cpb count 0x0 next cpb idx 0x0
> ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> ata1.00: cmd ca/00:38:ae:08:c2/00:00:00:00:00/e0 tag 0 cdb 0x0 data 28672 out
>          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
> ata1: soft resetting port
> ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> ata1.00: configured for UDMA/133
> ata1: EH complete
> SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB)
> sda: Write Protect is off
> sda: Mode Sense: 00 3a 00 00
> SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> 
> These cause the same ~30 second stalls. Machine was not under load.
> 
> No 3rd party modules were loaded.

This one seems a bit different. This time it's not related to NCQ vs. 
non-NCQ (this is a non-NCQ write here), it's in ADMA mode (so it's 
presumably not related to switching between ADMA and register mode, 
unless perhaps a flush cache or something executed just before), and 
from the CPB data it appears the command completed but the controller's 
registers aren't indicating that it has. Not sure if I've seen one like 
that before..

How easily can you reproduce this?

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CK804 SATA Errors (still got them)
  2007-03-01 14:45 ` Robert Hancock
@ 2007-03-01 15:13   ` Alistair John Strachan
  2007-03-02  1:20     ` Alistair John Strachan
  0 siblings, 1 reply; 12+ messages in thread
From: Alistair John Strachan @ 2007-03-01 15:13 UTC (permalink / raw)
  To: Robert Hancock; +Cc: Jeff Garzik, linux-kernel

On Thursday 01 March 2007 14:45, Robert Hancock wrote:
> This one seems a bit different. This time it's not related to NCQ vs.
> non-NCQ (this is a non-NCQ write here), it's in ADMA mode (so it's
> presumably not related to switching between ADMA and register mode,
> unless perhaps a flush cache or something executed just before), and
> from the CPB data it appears the command completed but the controller's
> registers aren't indicating that it has. Not sure if I've seen one like
> that before..
>
> How easily can you reproduce this?

It's the first one since -rc2, so apparently not easily. I'm more than willing 
to find loads that expose it, though, so I might try that this afternoon.

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CK804 SATA Errors (still got them)
  2007-03-01 15:13   ` Alistair John Strachan
@ 2007-03-02  1:20     ` Alistair John Strachan
  2007-03-02  2:40       ` Robert Hancock
  0 siblings, 1 reply; 12+ messages in thread
From: Alistair John Strachan @ 2007-03-02  1:20 UTC (permalink / raw)
  To: Robert Hancock; +Cc: Jeff Garzik, linux-kernel

On Thursday 01 March 2007 15:13, Alistair John Strachan wrote:
> On Thursday 01 March 2007 14:45, Robert Hancock wrote:
> > This one seems a bit different. This time it's not related to NCQ vs.
> > non-NCQ (this is a non-NCQ write here), it's in ADMA mode (so it's
> > presumably not related to switching between ADMA and register mode,
> > unless perhaps a flush cache or something executed just before), and
> > from the CPB data it appears the command completed but the controller's
> > registers aren't indicating that it has. Not sure if I've seen one like
> > that before..
> >
> > How easily can you reproduce this?
>
> It's the first one since -rc2, so apparently not easily. I'm more than
> willing to find loads that expose it, though, so I might try that this
> afternoon.

Got another:

ata2: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1501000 status 0x500 next cpb count 0x0 next cpb idx 0x0
ata2: CPB 0: ctl_flags 0xd, resp_flags 0x1
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata2.00: cmd c8/00:80:85:c4:ed/00:00:00:00:00/e3 tag 0 cdb 0x0 data 65536 in
         res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata2: soft resetting port
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: configured for UDMA/133
ata2: EH complete
SCSI device sdb: 488397168 512-byte hdwr sectors (250059 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Different HD, similar problem.

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CK804 SATA Errors (still got them)
  2007-03-02  1:20     ` Alistair John Strachan
@ 2007-03-02  2:40       ` Robert Hancock
  2007-03-02 15:47         ` Alistair John Strachan
  0 siblings, 1 reply; 12+ messages in thread
From: Robert Hancock @ 2007-03-02  2:40 UTC (permalink / raw)
  To: Alistair John Strachan; +Cc: Jeff Garzik, linux-kernel

Alistair John Strachan wrote:
> On Thursday 01 March 2007 15:13, Alistair John Strachan wrote:
>> On Thursday 01 March 2007 14:45, Robert Hancock wrote:
>>> This one seems a bit different. This time it's not related to NCQ vs.
>>> non-NCQ (this is a non-NCQ write here), it's in ADMA mode (so it's
>>> presumably not related to switching between ADMA and register mode,
>>> unless perhaps a flush cache or something executed just before), and
>>> from the CPB data it appears the command completed but the controller's
>>> registers aren't indicating that it has. Not sure if I've seen one like
>>> that before..
>>>
>>> How easily can you reproduce this?
>> It's the first one since -rc2, so apparently not easily. I'm more than
>> willing to find loads that expose it, though, so I might try that this
>> afternoon.
> 
> Got another:
> 
> ata2: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1501000 status 0x500 next cpb count 0x0 next cpb idx 0x0
> ata2: CPB 0: ctl_flags 0xd, resp_flags 0x1
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> ata2.00: cmd c8/00:80:85:c4:ed/00:00:00:00:00/e3 tag 0 cdb 0x0 data 65536 in
>          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
> ata2: soft resetting port
> ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> ata2.00: configured for UDMA/133
> ata2: EH complete
> SCSI device sdb: 488397168 512-byte hdwr sectors (250059 MB)
> sdb: Write Protect is off
> sdb: Mode Sense: 00 3a 00 00
> SCSI device sdb: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> 
> Different HD, similar problem.

Can you try reverting commit 721449bf0d51213fe3abf0ac3e3561ef9ea7827a 
(link below) and see what effect that has?

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=721449bf0d51213fe3abf0ac3e3561ef9ea7827a

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CK804 SATA Errors (still got them)
  2007-03-02  2:40       ` Robert Hancock
@ 2007-03-02 15:47         ` Alistair John Strachan
  2007-03-04 23:25           ` Robert Hancock
  0 siblings, 1 reply; 12+ messages in thread
From: Alistair John Strachan @ 2007-03-02 15:47 UTC (permalink / raw)
  To: Robert Hancock; +Cc: Jeff Garzik, linux-kernel

On Friday 02 March 2007 02:40, Robert Hancock wrote:
> Alistair John Strachan wrote:
> > On Thursday 01 March 2007 15:13, Alistair John Strachan wrote:
> >> On Thursday 01 March 2007 14:45, Robert Hancock wrote:
> >>> This one seems a bit different. This time it's not related to NCQ vs.
> >>> non-NCQ (this is a non-NCQ write here), it's in ADMA mode (so it's
> >>> presumably not related to switching between ADMA and register mode,
> >>> unless perhaps a flush cache or something executed just before), and
> >>> from the CPB data it appears the command completed but the controller's
> >>> registers aren't indicating that it has. Not sure if I've seen one like
> >>> that before..
> >>>
> >>> How easily can you reproduce this?
> >>
> >> It's the first one since -rc2, so apparently not easily. I'm more than
> >> willing to find loads that expose it, though, so I might try that this
> >> afternoon.
> >
> > Got another:
> >
> > ata2: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1501000
> > status 0x500 next cpb count 0x0 next cpb idx 0x0 ata2: CPB 0: ctl_flags
> > 0xd, resp_flags 0x1
> > ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> > ata2.00: cmd c8/00:80:85:c4:ed/00:00:00:00:00/e3 tag 0 cdb 0x0 data 65536
> > in res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) ata2: soft
> > resetting port
> > ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> > ata2.00: configured for UDMA/133
> > ata2: EH complete
> > SCSI device sdb: 488397168 512-byte hdwr sectors (250059 MB)
> > sdb: Write Protect is off
> > sdb: Mode Sense: 00 3a 00 00
> > SCSI device sdb: write cache: enabled, read cache: enabled, doesn't
> > support DPO or FUA
> >
> > Different HD, similar problem.
>
> Can you try reverting commit 721449bf0d51213fe3abf0ac3e3561ef9ea7827a
> (link below) and see what effect that has?
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h
>=721449bf0d51213fe3abf0ac3e3561ef9ea7827a

Obviously, I'll let you know if it happens again, but I've reverted this 
commit and transferred 22.5GB over 45 minutes onto a RAID5 with 4 HDs on an 
NVIDIA sata controller, and this error hasn't appeared.

So I'm inclined to (very unscientifically) say that this brings it back to 
2.6.20's level of stability.

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CK804 SATA Errors (still got them)
  2007-03-02 15:47         ` Alistair John Strachan
@ 2007-03-04 23:25           ` Robert Hancock
  2007-03-04 23:41             ` Alistair John Strachan
                               ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Robert Hancock @ 2007-03-04 23:25 UTC (permalink / raw)
  To: Alistair John Strachan; +Cc: Jeff Garzik, linux-kernel

Alistair John Strachan wrote:
>> Can you try reverting commit 721449bf0d51213fe3abf0ac3e3561ef9ea7827a
>> (link below) and see what effect that has?
>>
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h
>> =721449bf0d51213fe3abf0ac3e3561ef9ea7827a
> 
> Obviously, I'll let you know if it happens again, but I've reverted this 
> commit and transferred 22.5GB over 45 minutes onto a RAID5 with 4 HDs on an 
> NVIDIA sata controller, and this error hasn't appeared.
> 
> So I'm inclined to (very unscientifically) say that this brings it back to 
> 2.6.20's level of stability.

Interesting. Can you try un-reverting that patch, and applying this one?

The reading of the status register is something that was part of the original
NVidia code, which I'm not really sure why is there. Given that reading
the status register clears the drive's interrupt status, that might be
causing some wierd interaction with the ADMA controller. Also, I added in
a printk for cases where notifiers are triggered but the command doesn't
indicate completion - if you still get problems, let me know if you see
that message.

--- linux-2.6.21-rc2-git3/drivers/ata/sata_nv.c	2007-03-04 14:44:05.000000000 -0600
+++ linux-2.6.21-rc2-git3edit/drivers/ata/sata_nv.c	2007-03-04 17:09:06.000000000 -0600
@@ -745,10 +745,10 @@
 			/* Grab the ATA port status for non-NCQ commands.
 			   For NCQ commands the current status may have nothing to do with
 			   the command just completed. */
-			if (qc->tf.protocol != ATA_PROT_NCQ) {
+/*			if (qc->tf.protocol != ATA_PROT_NCQ) {
 				u8 ata_status = readb(pp->ctl_block + (ATA_REG_STATUS * 4));
 				qc->err_mask |= ac_err_mask(ata_status);
-			}
+			}*/
 			DPRINTK("Completing qc from tag %d with err_mask %u\n",cpb_num,
 				qc->err_mask);
 			ata_qc_complete(qc);
@@ -764,6 +764,9 @@
 			ata_port_freeze(ap);
 			return 1;
 		}
+	} else {
+		ata_port_printk(ap, KERN_WARNING, "notifier for tag %d but not complete?\n",
+			cpb_num);
 	}
 	return 0;
 }

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CK804 SATA Errors (still got them)
  2007-03-04 23:25           ` Robert Hancock
@ 2007-03-04 23:41             ` Alistair John Strachan
  2007-03-04 23:49               ` Robert Hancock
  2007-03-04 23:50               ` Jeff Garzik
  2007-03-04 23:46             ` Jeff Garzik
  2007-03-05  3:52             ` Alistair John Strachan
  2 siblings, 2 replies; 12+ messages in thread
From: Alistair John Strachan @ 2007-03-04 23:41 UTC (permalink / raw)
  To: Robert Hancock; +Cc: Jeff Garzik, linux-kernel

On Sunday 04 March 2007 23:25, Robert Hancock wrote:
> Alistair John Strachan wrote:
> >> Can you try reverting commit 721449bf0d51213fe3abf0ac3e3561ef9ea7827a
> >> (link below) and see what effect that has?
> >>
> >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commi
> >>t;h =721449bf0d51213fe3abf0ac3e3561ef9ea7827a
> >
> > Obviously, I'll let you know if it happens again, but I've reverted this
> > commit and transferred 22.5GB over 45 minutes onto a RAID5 with 4 HDs on
> > an NVIDIA sata controller, and this error hasn't appeared.
> >
> > So I'm inclined to (very unscientifically) say that this brings it back
> > to 2.6.20's level of stability.
>
> Interesting. Can you try un-reverting that patch, and applying this one?

Sorry for the newbie question, but is it adequate to do a:

	git reset --hard v2.6.21-rc2

To ensure a patch is "unreverted" (I reverted it with "git revert"), before 
applying your patch?

I've done so now, assuming this _will_ work. The reason I ask is that your 
diff was offset by 12 lines versus -rc2.

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CK804 SATA Errors (still got them)
  2007-03-04 23:41             ` Alistair John Strachan
@ 2007-03-04 23:49               ` Robert Hancock
  2007-03-04 23:50               ` Jeff Garzik
  1 sibling, 0 replies; 12+ messages in thread
From: Robert Hancock @ 2007-03-04 23:49 UTC (permalink / raw)
  To: Alistair John Strachan; +Cc: Jeff Garzik, linux-kernel

Alistair John Strachan wrote:
>> Interesting. Can you try un-reverting that patch, and applying this one?
> 
> Sorry for the newbie question, but is it adequate to do a:
> 
> 	git reset --hard v2.6.21-rc2
> 
> To ensure a patch is "unreverted" (I reverted it with "git revert"), before 
> applying your patch?
> 
> I've done so now, assuming this _will_ work. The reason I ask is that your 
> diff was offset by 12 lines versus -rc2.

I assume it's OK, though I'm not a git expert. I diffed against rc2-git3 
which has some CONFIG_PM ifdef changes, those shouldn't be important though.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CK804 SATA Errors (still got them)
  2007-03-04 23:41             ` Alistair John Strachan
  2007-03-04 23:49               ` Robert Hancock
@ 2007-03-04 23:50               ` Jeff Garzik
  1 sibling, 0 replies; 12+ messages in thread
From: Jeff Garzik @ 2007-03-04 23:50 UTC (permalink / raw)
  To: Alistair John Strachan; +Cc: Robert Hancock, linux-kernel

Alistair John Strachan wrote:
> On Sunday 04 March 2007 23:25, Robert Hancock wrote:
>> Alistair John Strachan wrote:
>>>> Can you try reverting commit 721449bf0d51213fe3abf0ac3e3561ef9ea7827a
>>>> (link below) and see what effect that has?
>>>>
>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commi
>>>> t;h =721449bf0d51213fe3abf0ac3e3561ef9ea7827a
>>> Obviously, I'll let you know if it happens again, but I've reverted this
>>> commit and transferred 22.5GB over 45 minutes onto a RAID5 with 4 HDs on
>>> an NVIDIA sata controller, and this error hasn't appeared.
>>>
>>> So I'm inclined to (very unscientifically) say that this brings it back
>>> to 2.6.20's level of stability.
>> Interesting. Can you try un-reverting that patch, and applying this one?
> 
> Sorry for the newbie question, but is it adequate to do a:
> 
> 	git reset --hard v2.6.21-rc2
> 
> To ensure a patch is "unreverted" (I reverted it with "git revert"), before 
> applying your patch?
> 
> I've done so now, assuming this _will_ work. The reason I ask is that your 
> diff was offset by 12 lines versus -rc2.

If you committed the revert to the repository, it's probably to blow it 
away and re-clone.  Generally, with git, you want to keep a pristine, 
never-touched-except-for-pulling kernel repository around, and then when 
doing compiles and experiments and such, run

	git-clone --reference my-vanilla-2.6-repo $URL

The --reference argument will ensure that you don't haul around multiple 
copies of the repository objects, with each clone.


Otherwise, if you have committed nothing to the repository, this will 
undo all your not-committed changes:

	git checkout -f

Regards,

	Jeff




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CK804 SATA Errors (still got them)
  2007-03-04 23:25           ` Robert Hancock
  2007-03-04 23:41             ` Alistair John Strachan
@ 2007-03-04 23:46             ` Jeff Garzik
  2007-03-05  3:52             ` Alistair John Strachan
  2 siblings, 0 replies; 12+ messages in thread
From: Jeff Garzik @ 2007-03-04 23:46 UTC (permalink / raw)
  To: Robert Hancock
  Cc: Alistair John Strachan, linux-kernel, IDE/ATA development list

Robert Hancock wrote:
> Alistair John Strachan wrote:
>>> Can you try reverting commit 721449bf0d51213fe3abf0ac3e3561ef9ea7827a
>>> (link below) and see what effect that has?
>>>
>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h 
>>>
>>> =721449bf0d51213fe3abf0ac3e3561ef9ea7827a
>>
>> Obviously, I'll let you know if it happens again, but I've reverted 
>> this commit and transferred 22.5GB over 45 minutes onto a RAID5 with 4 
>> HDs on an NVIDIA sata controller, and this error hasn't appeared.
>>
>> So I'm inclined to (very unscientifically) say that this brings it 
>> back to 2.6.20's level of stability.
> 
> Interesting. Can you try un-reverting that patch, and applying this one?
> 
> The reading of the status register is something that was part of the 
> original
> NVidia code, which I'm not really sure why is there. Given that reading
> the status register clears the drive's interrupt status, that might be
> causing some wierd interaction with the ADMA controller. Also, I added in
> a printk for cases where notifiers are triggered but the command doesn't
> indicate completion - if you still get problems, let me know if you see
> that message.


AFAICS, when in ADMA mode, you absolutely should not touch the ATA 
shadow registers at all.

This is normal for all controllers with both a "legacy mode" and an 
"enhanced DMA mode" of some sort:  the internal silicon state machines 
"own" the ATA shadow registers while in enhanced DMA mode.  Reading or 
writing the ATA shadow registers while in enhanced DMA mode can lead to 
undefined results, running the gamut from no-op to data corruption and 
hardware lock-ups.

You may only access the ATA shadow registers when NV_ADMA_CTL_GO is 
cleared, and then NV_ADMA_STAT_LEGACY is set, indicating the NVIDIA chip 
is in register mode (aka legacy mode).

	Jeff




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: CK804 SATA Errors (still got them)
  2007-03-04 23:25           ` Robert Hancock
  2007-03-04 23:41             ` Alistair John Strachan
  2007-03-04 23:46             ` Jeff Garzik
@ 2007-03-05  3:52             ` Alistair John Strachan
  2 siblings, 0 replies; 12+ messages in thread
From: Alistair John Strachan @ 2007-03-05  3:52 UTC (permalink / raw)
  To: Robert Hancock; +Cc: Jeff Garzik, linux-kernel

On Sunday 04 March 2007 23:25, Robert Hancock wrote:
> Alistair John Strachan wrote:
> >> Can you try reverting commit 721449bf0d51213fe3abf0ac3e3561ef9ea7827a
> >> (link below) and see what effect that has?
> >>
> >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commi
> >>t;h =721449bf0d51213fe3abf0ac3e3561ef9ea7827a
> >
> > Obviously, I'll let you know if it happens again, but I've reverted this
> > commit and transferred 22.5GB over 45 minutes onto a RAID5 with 4 HDs on
> > an NVIDIA sata controller, and this error hasn't appeared.
> >
> > So I'm inclined to (very unscientifically) say that this brings it back
> > to 2.6.20's level of stability.
>
> Interesting. Can you try un-reverting that patch, and applying this one?
>
> The reading of the status register is something that was part of the
> original NVidia code, which I'm not really sure why is there. Given that
> reading the status register clears the drive's interrupt status, that might
> be causing some wierd interaction with the ADMA controller. Also, I added
> in a printk for cases where notifiers are triggered but the command doesn't
> indicate completion - if you still get problems, let me know if you see
> that message.

Didn't take long to observe the problem again, so I'm guessing that this isn't
it. I was definitely using a kernel compiled with your patch:

alistair@damocles:~$ uname -v
#1 SMP Sun Mar 4 23:39:56 GMT 2007

I got the following in dmesg:

ata1: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1501000 status 0x500 next cpb count 0x0 next cpb idx 0x0
ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd c8/00:08:37:77:61/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 in
         res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1: soft resetting port
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
ata1: EH complete
SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Your debugging message did not appear in dmesg, however.

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-03-05  3:53 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-01 13:39 CK804 SATA Errors (still got them) Alistair John Strachan
2007-03-01 14:45 ` Robert Hancock
2007-03-01 15:13   ` Alistair John Strachan
2007-03-02  1:20     ` Alistair John Strachan
2007-03-02  2:40       ` Robert Hancock
2007-03-02 15:47         ` Alistair John Strachan
2007-03-04 23:25           ` Robert Hancock
2007-03-04 23:41             ` Alistair John Strachan
2007-03-04 23:49               ` Robert Hancock
2007-03-04 23:50               ` Jeff Garzik
2007-03-04 23:46             ` Jeff Garzik
2007-03-05  3:52             ` Alistair John Strachan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox