* Re: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen [not found] <56020.194.255.108.253.1192004925.squirrel@root.dusted.dk> @ 2007-10-13 0:41 ` Andrew Morton 2007-10-13 5:49 ` Steen Eugen Poulsen 0 siblings, 1 reply; 11+ messages in thread From: Andrew Morton @ 2007-10-13 0:41 UTC (permalink / raw) To: lists; +Cc: linux-kernel, linux-ide On Wed, 10 Oct 2007 10:28:45 +0200 (CEST) lists@dusted.dk wrote: > I get this on brand new hardware, 2xHitachi Deathstar 320gb SATA2 > (sata_via driver) > > I get this a lot, the disk makes some sound after heavy IO and then the > system hangs for a few seconds, then this comes up: > > ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen > ata1.00: cmd 25/00:00:3f:76:30/00:04:00:00:00/e0 tag 0 cdb 0x0 data 524288 in > res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > ata1: port is slow to respond, please be patient (Status 0xd0) > ata1: soft resetting port > ata1.00: configured for UDMA/133 > ata1: EH complete > sd 0:0:0:0: [sda] 625142448 512-byte hardware sectors (320073 MB) > sd 0:0:0:0: [sda] Write Protect is off > sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 > sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't > support DPO or FUA > raid1: Disk failure on sdb1, disabling device. > > This is on kernel 2.6.23 > (added linux-ide) ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen 2007-10-13 0:41 ` exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Andrew Morton @ 2007-10-13 5:49 ` Steen Eugen Poulsen 2007-10-23 9:55 ` Tejun Heo 0 siblings, 1 reply; 11+ messages in thread From: Steen Eugen Poulsen @ 2007-10-13 5:49 UTC (permalink / raw) To: Andrew Morton; +Cc: lists, linux-kernel, linux-ide [-- Attachment #1: Type: text/plain, Size: 3959 bytes --] Andrew Morton skrev: > On Wed, 10 Oct 2007 10:28:45 +0200 (CEST) > lists@dusted.dk wrote: > >> I get this on brand new hardware, 2xHitachi Deathstar 320gb SATA2 >> (sata_via driver) Sep 28 04:32:40 locker ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Sep 28 04:32:40 locker ata1.00: cmd b0/d2:f1:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 123392 in Sep 28 04:32:40 locker res 50/00:f1:00:4f:c2/00:00:00:00:00/00 Emask 0x202 (HSM violation) Sep 28 04:32:41 locker current size: 625140335 sectors Sep 28 04:32:41 locker native size: 625142448 sectors Sep 28 04:32:41 locker current size: 625140335 sectors Sep 28 04:32:41 locker native size: 625142448 sectors Another machine: Sep 28 03:47:55 dragonslair ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Sep 28 03:47:55 dragonslair ata1.00: cmd b0/db:f8:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 126976 in Sep 28 03:47:55 dragonslair res 50/00:f8:00:4f:c2/00:00:00:00:00/00 Emask 0x202 (HSM violation) Sep 28 03:47:55 dragonslair ata1: soft resetting port Sep 28 03:47:55 dragonslair ata1.00: configured for UDMA/133 Sep 28 03:47:55 dragonslair ata1: EH complete Sep 28 03:47:55 dragonslair sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Sep 28 03:47:55 dragonslair sd 0:0:0:0: [sda] 156250000 512-byte hardware sectors (80000 MB) Sep 28 03:47:55 dragonslair sd 0:0:0:0: [sda] Write Protect is off Sep 28 03:47:55 dragonslair sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 Sep 28 03:47:55 dragonslair sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA And yet another: Sep 28 04:33:52 liferaft kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Sep 28 04:33:55 liferaft kernel: ata1.00: cmd b0/d2:f1:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 123392 in Sep 28 04:33:55 liferaft kernel: res 50/00:f1:00:4f:c2/00:00:00:00:00/00 Emask 0x202 (HSM violation) Sep 28 04:33:55 liferaft kernel: ata1: soft resetting port Sep 28 04:33:55 liferaft kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Sep 28 04:33:55 liferaft kernel: ata1.00: configured for UDMA/133 Sep 28 04:33:55 liferaft kernel: ata1: EH complete Another few cases, taken from semi random locations from my log to get the many different data, maybe some of it can help out. Weirdness 1: I have 3 machines, that decide to spew this garbage within the same second? (smartd running at it is around the hour that smartd would run, but it's just this one day Sep 28 that horrible bad) Note 1: Bad/failing hardware creates these type of errors. Note 2: The hardware didn't freeze for me and I believe the freeze is do to swap breaking due to the errors. Note 3: dragonslair's harddisk actually crashed, kernel didn't die, it just remounted read only. Reboot and the disk was missing, more reboot and the machine started with all disks running again, been stable since the 28th Sep. (knock on wood) Note 4: I've changed hardware and kernel in a non controled manner, so I was waiting for another case of these errors where I would be able to write down kernel config. I'm not sure, but I do believe that a keyword with this stuff is SMP and 2.6.22, older kernels doesn't seem to trigger this and non SMP seems to avoid it with 2.6.22, but I can't trigger the error, so there is no way of knowing if the conclusion can be trusted. Dragonslair: P4 x2 3.0 Ghz Chips Intel 865GV & ICH5 32bit SMP kernel (2.6.22) 2 SATA disks WDC WD800JD-75MSA3 (I'm guessing this one has a physical bad disk, since it's the only one the disk has physically failed and the only one with a worrying SMART error: 1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 5) Locker: AMD 64x2 Chips Nvidia 570 32bit SMP kernel (2.6.22) 6 SATA disks 2xWD3200YS-01PGB0 4xWD3200AAKS-00TMA0 Liferaft: AMD 64x2 Chips Nvidia 590 32bit SMP kernel (vserver 2.6.22 based) 1 SATA disk WD2500KS-00MJB0 [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/x-pkcs7-signature, Size: 3412 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen 2007-10-13 5:49 ` Steen Eugen Poulsen @ 2007-10-23 9:55 ` Tejun Heo [not found] ` <471F2647.4030304@lix-world.net> 0 siblings, 1 reply; 11+ messages in thread From: Tejun Heo @ 2007-10-23 9:55 UTC (permalink / raw) To: Steen Eugen Poulsen; +Cc: Andrew Morton, lists, linux-kernel, linux-ide Hello, Steen Eugen Poulsen wrote: > Sep 28 04:32:40 locker ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 > action 0x2 frozen > Sep 28 04:32:40 locker ata1.00: cmd b0/d2:f1:00:4f:c2/00:00:00:00:00/00 > tag 0 cdb 0x0 data 123392 in > Sep 28 04:32:40 locker res 50/00:f1:00:4f:c2/00:00:00:00:00/00 Emask > 0x202 (HSM violation) [--snip--] > Another machine: > > Sep 28 03:47:55 dragonslair ata1.00: exception Emask 0x0 SAct 0x0 SErr > 0x0 action 0x2 frozen > Sep 28 03:47:55 dragonslair ata1.00: cmd > b0/db:f8:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 126976 in > > Sep 28 03:47:55 dragonslair res 50/00:f8:00:4f:c2/00:00:00:00:00/00 > Emask 0x202 (HSM violation) [--snip--] > Sep 28 04:33:52 liferaft kernel: ata1.00: exception Emask 0x0 SAct 0x0 > SErr 0x0 action 0x2 frozen > Sep 28 04:33:55 liferaft kernel: ata1.00: cmd > b0/d2:f1:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 123392 in > Sep 28 04:33:55 liferaft kernel: res > 50/00:f1:00:4f:c2/00:00:00:00:00/00 Emask 0x202 (HSM violation) > Sep 28 04:33:55 liferaft kernel: ata1: soft resetting port > Sep 28 04:33:55 liferaft kernel: ata1: SATA link up 3.0 Gbps (SStatus > 123 SControl 300) > Sep 28 04:33:55 liferaft kernel: ata1.00: configured for UDMA/133 > Sep 28 04:33:55 liferaft kernel: ata1: EH complete All these are caused by smartd. Updating should fix the problem. > Note 2: The hardware didn't freeze for me and I believe the freeze is do > to swap breaking due to the errors. Above HSM violations should be harmless other than those messages. libata resets the devices and should just go on. > Note 3: dragonslair's harddisk actually crashed, kernel didn't die, it > just remounted read only. Reboot and the disk was missing, more reboot > and the machine started with all disks running again, been stable since > the 28th Sep. (knock on wood) libata EH can't really recover from actual hardware failures but some drives come back on if you hot unplug and then replug it. -- tejun ^ permalink raw reply [flat|nested] 11+ messages in thread
[parent not found: <471F2647.4030304@lix-world.net>]
* Re: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen [not found] ` <471F2647.4030304@lix-world.net> @ 2007-10-26 1:40 ` Tejun Heo 2007-10-26 4:02 ` Jim Paris 0 siblings, 1 reply; 11+ messages in thread From: Tejun Heo @ 2007-10-26 1:40 UTC (permalink / raw) To: Steen Eugen Poulsen Cc: Andrew Morton, lists, Linux Kernel, IDE/ATA development list, bruce.allen [please don't drop cc. restored] Steen Eugen Poulsen wrote: > Tejun Heo skrev: >> All these are caused by smartd. Updating should fix the problem. > > Okay, but there is no newer smartd than what I'm using. (5.37) Bruce? Original thread can be read from... http://thread.gmane.org/gmane.linux.kernel/588972 -- tejun ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen 2007-10-26 1:40 ` Tejun Heo @ 2007-10-26 4:02 ` Jim Paris 2007-11-06 9:49 ` Bruce Allen 0 siblings, 1 reply; 11+ messages in thread From: Jim Paris @ 2007-10-26 4:02 UTC (permalink / raw) To: Tejun Heo Cc: Steen Eugen Poulsen, Andrew Morton, lists, Linux Kernel, IDE/ATA development list, bruce.allen Tejun Heo wrote: > [please don't drop cc. restored] > > Steen Eugen Poulsen wrote: > >Tejun Heo skrev: > >>All these are caused by smartd. Updating should fix the problem. > > > >Okay, but there is no newer smartd than what I'm using. (5.37) > > Bruce? Original thread can be read from... > > http://thread.gmane.org/gmane.linux.kernel/588972 The fixes were added in smartmontools CVS, but there hasn't been a release since then. -jim ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen 2007-10-26 4:02 ` Jim Paris @ 2007-11-06 9:49 ` Bruce Allen 0 siblings, 0 replies; 11+ messages in thread From: Bruce Allen @ 2007-11-06 9:49 UTC (permalink / raw) To: Jim Paris Cc: Tejun Heo, Steen Eugen Poulsen, Andrew Morton, lists, Linux Kernel, IDE/ATA development list, bruce.allen >>>> All these are caused by smartd. Updating should fix the problem. >>> >>> Okay, but there is no newer smartd than what I'm using. (5.37) >> >> Bruce? Original thread can be read from... >> >> http://thread.gmane.org/gmane.linux.kernel/588972 > > The fixes were added in smartmontools CVS, but there hasn't been a > release since then. I think we'll do a new smartmontools release fairly soon. Cheers, Bruce ^ permalink raw reply [flat|nested] 11+ messages in thread
[parent not found: <29a863790710101217g607b5425g6a3374c2be1d75a5@mail.gmail.com>]
* Re: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen [not found] <29a863790710101217g607b5425g6a3374c2be1d75a5@mail.gmail.com> @ 2007-10-13 0:54 ` Andrew Morton 0 siblings, 0 replies; 11+ messages in thread From: Andrew Morton @ 2007-10-13 0:54 UTC (permalink / raw) To: Greg Cormier; +Cc: linux-kernel, linux-ide On Wed, 10 Oct 2007 15:17:19 -0400 "Greg Cormier" <gcormier@gmail.com> wrote: > I'd like to hop in on this, and add my similar problem. This is my > first post so please excuse me if I'm doing something wrong. Please cc linux-ide@vger.kernel.org on ide, sata and pata reports. A "hard freeze" is fairly serious. > I've been having issues recently (couple of weeks?) with my server. I > have three WD5000YS (500gb) drives in RAID5, on an Asus A8N > motherboard which is nForce 4. I've even RMA'd one of the drives, but > now I'm thinking the drives are fine. > > The drive seems to have issues under heavy to moderate IO. I unmounted > my raid, and forced an e2fsck. e2fsck didn't even print anything out, > I got this. > > Oct 10 14:50:40 zeus kernel: ata3: EH in ADMA mode, notifier 0x0 > notifier_error 0x0 gen_ctl 0x1501000 status 0x400 next cpb count 0x0 > next cpb idx 0x0 > Oct 10 14:50:40 zeus kernel: ata3: CPB 0: ctl_flags 0x1f, resp_flags 0x2 > Oct 10 14:50:40 zeus kernel: ata3: timeout waiting for ADMA IDLE, stat=0x400 > Oct 10 14:50:40 zeus kernel: ata3: timeout waiting for ADMA LEGACY, stat=0x400 > Oct 10 14:50:40 zeus kernel: ata3.00: exception Emask 0x0 SAct 0x1 > SErr 0x1c00000 action 0x2 frozen > Oct 10 14:50:40 zeus kernel: ata3.00: cmd > 61/08:00:bf:4b:38/00:00:3a:00:00/40 tag 0 cdb 0x0 data 4096 out > Oct 10 14:50:40 zeus kernel: res > 40/00:f2:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > Oct 10 14:50:40 zeus kernel: ata3: soft resetting port > Oct 10 14:50:40 zeus kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 > SControl 300) > Oct 10 14:50:40 zeus kernel: ata3.00: configured for UDMA/133 > Oct 10 14:50:40 zeus kernel: ata3: EH complete > Oct 10 14:50:40 zeus kernel: sd 2:0:0:0: [sdb] 976773168 512-byte > hardware sectors (500108 MB) > Oct 10 14:50:40 zeus kernel: sd 2:0:0:0: [sdb] Write Protect is off > Oct 10 14:50:40 zeus kernel: sd 2:0:0:0: [sdb] Write cache: enabled, > read cache: enabled, doesn't support DPO or FUA > Oct 10 14:51:40 zeus kernel: ata3: EH in ADMA mode, notifier 0x0 > notifier_error 0x0 gen_ctl 0x1501000 status 0x400 next cpb count 0x0 > next cpb idx 0x0 > Oct 10 14:51:40 zeus kernel: ata3: CPB 0: ctl_flags 0x1f, resp_flags 0x2 > Oct 10 14:51:40 zeus kernel: ata3: timeout waiting for ADMA IDLE, stat=0x400 > Oct 10 14:51:40 zeus kernel: ata3: timeout waiting for ADMA LEGACY, stat=0x400 > Oct 10 14:51:40 zeus kernel: ata3.00: exception Emask 0x0 SAct 0x1 > SErr 0x400000 action 0x2 frozen > Oct 10 14:51:40 zeus kernel: ata3.00: cmd > 61/08:00:bf:4b:38/00:00:3a:00:00/40 tag 0 cdb 0x0 data 4096 out > Oct 10 14:51:40 zeus kernel: res > 40/00:f2:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > Oct 10 14:51:41 zeus kernel: ata3: soft resetting port > Oct 10 14:51:41 zeus kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 > SControl 300) > Oct 10 14:51:41 zeus kernel: ata3.00: configured for UDMA/133 > Oct 10 14:51:41 zeus kernel: ata3: EH complete > Oct 10 14:51:41 zeus kernel: sd 2:0:0:0: [sdb] 976773168 512-byte > hardware sectors (500108 MB) > Oct 10 14:51:41 zeus kernel: sd 2:0:0:0: [sdb] Write Protect is off > Oct 10 14:51:41 zeus kernel: sd 2:0:0:0: [sdb] Write cache: enabled, > read cache: enabled, doesn't support DPO or FUA > Oct 10 14:52:19 zeus kernel: device eth0 left promiscuous mode > Oct 10 14:52:41 zeus kernel: ata3: EH in ADMA mode, notifier 0x0 > notifier_error 0x0 gen_ctl 0x1501000 status 0x400 next cpb count 0x0 > next cpb idx 0x0 > Oct 10 14:52:41 zeus kernel: ata3: CPB 0: ctl_flags 0x1f, resp_flags 0x2 > Oct 10 14:52:41 zeus kernel: ata3: timeout waiting for ADMA IDLE, stat=0x400 > Oct 10 14:52:41 zeus kernel: ata3: timeout waiting for ADMA LEGACY, stat=0x400 > Oct 10 14:52:41 zeus kernel: ata3.00: exception Emask 0x0 SAct 0x1 > SErr 0x400000 action 0x2 frozen > Oct 10 14:52:41 zeus kernel: ata3.00: cmd > 61/08:00:bf:4b:38/00:00:3a:00:00/40 tag 0 cdb 0x0 data 4096 out > Oct 10 14:52:41 zeus kernel: res > 40/00:f2:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > Oct 10 14:52:41 zeus kernel: ata3: soft resetting port > Oct 10 14:52:42 zeus kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 > SControl 300) > Oct 10 14:52:42 zeus kernel: ata3.00: configured for UDMA/133 > Oct 10 14:52:42 zeus kernel: ata3: EH complete > Oct 10 14:52:42 zeus kernel: sd 2:0:0:0: [sdb] 976773168 512-byte > hardware sectors (500108 MB) > Oct 10 14:52:42 zeus kernel: sd 2:0:0:0: [sdb] Write Protect is off > Oct 10 14:52:42 zeus kernel: sd 2:0:0:0: [sdb] Write cache: enabled, > read cache: enabled, doesn't support DPO or FUA > > These errors have been happening on various .22 kernels, and this > message is from the hot-off-the-press .23 kernel. This message is > followed by a hard freeze. > > I'm in the process of figuring out why netconsole isn't quite working, > so hopefully I can provide more information soon. The server is > currently frozen, when I get home I can perhaps provide more > information? lspci? > > Looks like another rebuild of the array when I get home. > > > Thanks, > Greg > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 11+ messages in thread
[parent not found: <48B91198.3020803@xms.se>]
[parent not found: <alpine.DEB.1.10.0808300708330.19513@p34.internal.lan>]
[parent not found: <48B9AE15.7010605@xms.se>]
[parent not found: <alpine.DEB.1.10.0808301710380.11166@p34.internal.lan>]
[parent not found: <48B9C261.4010609@xms.se>]
* Re: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen [not found] ` <48B9C261.4010609@xms.se> @ 2008-08-30 22:12 ` Justin Piszcz 2008-08-31 10:00 ` Jonas Petersson 2008-09-02 13:39 ` Owen Martin 0 siblings, 2 replies; 11+ messages in thread From: Justin Piszcz @ 2008-08-30 22:12 UTC (permalink / raw) To: Jonas Petersson; +Cc: linux-ide, smartmontools-support, linux-kernel On Sat, 30 Aug 2008, Jonas Petersson wrote: > Justin Piszcz skrev: >> On Sat, 30 Aug 2008, Jonas Petersson wrote: >>> [...] >> smartctl -a would be useful (#1) > > # smartctl -a /dev/sda > smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen > Home page is http://smartmontools.sourceforge.net/ I have the same controller in my host as well, but it does not appear to matter whether it happens on the ICH8 controller or other controllers. I have noticed on Velociraptors I seem to get the same/similar error that you do as well, and I ran all the same tests as you, to no avail as to getting any closer to finding the root cause/problem. (.. more so than the regular old raptor150s) Besides the annoying messages in the kernel log/syslog/dmesg, does it affect your system stability in any way as of yet? I must add a very important note here though, you are using an ICH8 chipset and so am I, we both have same/similar problems-- however, I also have another machine setup VERY similarly (except different HDDs) for the RAID5 but the RAID1 is the same as one of my ICH8 boxes (dual raptor150s)-- and to date it has never? or rarely thrown the frozen error except when a disk actually failed (or when NCQ is enabled for a WD drive), (NCQ+Linux for WD) is broken. I have disks in a raid set (both raid1 and raid5) that get same/similar warnings as I mentioned above and so far it has not had any impact that I have noticed in relation to these specific errors. I think for now we just have to live with them, I am not sure what else to say here.. CC'ing linux-ide and linux-kernel with your original error from the start of this e-mail thread: Here is a snippet from this morning - this time it came back to life: [46874.898690] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen [46874.898703] ata3.00: cmd c8/00:08:90:3c:59/00:00:00:00:00/ef tag 0 dma 4096 in [46874.898705] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [46874.898709] ata3.00: status: { DRDY } [46879.643962] ata3: port is slow to respond, please be patient (Status 0xd0) [46884.473195] ata3: device not ready (errno=-16), forcing hardreset [46884.473202] ata3: soft resetting link [46912.740010] ata3.00: qc timeout (cmd 0xec) [46912.740020] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4) [46912.740023] ata3.00: revalidation failed (errno=-5) [46912.740028] ata3: failed to recover some devices, retrying in 5 secs [46917.458070] ata3: soft resetting link [46917.636464] ata3.00: configured for UDMA/100 [46917.636482] ata3: EH complete [46917.699224] sd 2:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB) [46917.699257] sd 2:0:0:0: [sda] Write Protect is off [46917.699263] sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00 [46917.699300] sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Here is an example from my host (same/similar issue): Aug 23 20:00:32 p34 kernel: [189770.219773] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Aug 23 20:00:32 p34 kernel: [189770.219784] ata1.00: cmd 35/00:40:9a:d9:7a/00:00:12:00:00/e0 tag 0 dma 32768 out Aug 23 20:00:32 p34 kernel: [189770.219786] res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Aug 23 20:00:32 p34 kernel: [189770.219790] ata1.00: status: { DRDY } Aug 23 20:00:32 p34 kernel: [189770.219795] ata1: hard resetting link Aug 23 20:00:32 p34 kernel: [189770.524770] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Aug 23 20:00:32 p34 kernel: [189770.543960] ata1.00: configured for UDMA/133 Aug 23 20:00:32 p34 kernel: [189770.543977] ata1: EH complete Aug 23 20:00:32 p34 kernel: [189770.544810] sd 0:0:0:0: [sda] 586072368 512-byte hardware sectors (300069 MB) Aug 23 20:00:32 p34 kernel: [189770.551810] sd 0:0:0:0: [sda] Write Protect is off Aug 23 20:00:32 p34 kernel: [189770.551810] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 Aug 23 20:00:32 p34 kernel: [189770.863810] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA What is the root cause of this? It still seems to be a mystery to most as far as I can tell, but the one thing in common is we are both using ICH8 chipsets, which, just may happen to be part of the problem? Justin. ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen 2008-08-30 22:12 ` Justin Piszcz @ 2008-08-31 10:00 ` Jonas Petersson 2008-09-02 13:39 ` Owen Martin 1 sibling, 0 replies; 11+ messages in thread From: Jonas Petersson @ 2008-08-31 10:00 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-ide, smartmontools-support, linux-kernel Hi again Justin, Justin Piszcz skrev: > On Sat, 30 Aug 2008, Jonas Petersson wrote: >> Justin Piszcz skrev: >>> On Sat, 30 Aug 2008, Jonas Petersson wrote: >>>> [...] >>> smartctl -a would be useful (#1) >> # smartctl -a /dev/sda >> smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen >> Home page is http://smartmontools.sourceforge.net/ > > I have the same controller in my host as well, but it does not appear to > matter whether it happens on the ICH8 controller or other controllers. > > I have noticed on Velociraptors I seem to get the same/similar error that > you do as well, and I ran all the same tests as you, to no avail as to getting > any closer to finding the root cause/problem. > (.. more so than the regular old raptor150s) > > Besides the annoying messages in the kernel log/syslog/dmesg, does it > affect your system stability in any way as of yet? Very much so, yes. At best, all disk access will hang for a while and then resume after the reset has worked out - this often happens a couple of times per day now. At worst, the reset will not work and the disk is remounted read-only and I can sort of use the system a bit this way. It seems somewhat random how much still works: Up until today I could at least always use dmesg and tail various logs to try to hunt down what happened, but this morning dmesg could not be found and I got I/O errors when accessing anything in /var/log. Rebooting helped as usual. This fatal variant has happened about every second day lately. The first two weeks I had the system showed nothing at all like this: I have log files since July 26 and the first recorded (reset-able) glitch is from Aug 16. Obviously, any non-resetable problem would have been easy to spot. > I must add a very important note here though, you are using an ICH8 chipset > and so am I, we both have same/similar problems-- however, I also have > another machine setup VERY similarly (except different HDDs) for the RAID5 > but the RAID1 is the same as one of my ICH8 boxes (dual raptor150s)-- > and to date it has never? or rarely thrown the frozen error except when a disk > actually failed (or when NCQ is enabled for a WD drive), (NCQ+Linux for WD) is > broken. Yes, I would not point fingers to the ICH8 chipset either: The other MacBookPro I have experimented with now is a 2,2 (ATI based) and has ICH7, but I'm 99.9% sure my previous MacBookPro 3,1 (nvidia based) was ICH8 and it worked flawlessly (I saw no reason to swap for the 4,1 version, but it was stolen from me in June). As far as I know the significant differences with my current MBP are just: higher screen resolution, multitouch ("iphone") touchpad and more memory. Alas, I didn't keep a lshw dump. > [...] > CC'ing linux-ide and linux-kernel with your original error from the start > of this e-mail thread: > > Here is a snippet from this morning - this time it came back to life: > > [46874.898690] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 > frozen > [46874.898703] ata3.00: cmd c8/00:08:90:3c:59/00:00:00:00:00/ef tag 0 > dma 4096 in > [46874.898705] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask > 0x4 (timeout) > [46874.898709] ata3.00: status: { DRDY } > [46879.643962] ata3: port is slow to respond, please be patient (Status > 0xd0) > [46884.473195] ata3: device not ready (errno=-16), forcing hardreset > [46884.473202] ata3: soft resetting link > [46912.740010] ata3.00: qc timeout (cmd 0xec) > [46912.740020] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4) > [46912.740023] ata3.00: revalidation failed (errno=-5) > [46912.740028] ata3: failed to recover some devices, retrying in 5 secs > [46917.458070] ata3: soft resetting link > [46917.636464] ata3.00: configured for UDMA/100 > [46917.636482] ata3: EH complete > [46917.699224] sd 2:0:0:0: [sda] 488397168 512-byte hardware sectors > (250059 MB) > [46917.699257] sd 2:0:0:0: [sda] Write Protect is off > [46917.699263] sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00 > [46917.699300] sd 2:0:0:0: [sda] Write cache: enabled, read cache: > enabled, doesn't support DPO or FUA I'll just clarify that the errno after "revalidation failed" is not always -5. When it ends up fatal I've also seen -3 and possibly something else too. I would have taken a screen shot this morning if only dmesg had worked. :-( > What is the root cause of this? It still seems to be a mystery to most as far > as I can tell, but the one thing in common is we are both using ICH8 chipsets, > which, just may happen to be part of the problem? For the record: My current theory is that it is some kind of hardware problem - either in the disk or on the motherboard so I have persuaded my local AppleStore to swap the harddisk on Monday and then they will run their full hardware stress test (4+ hours according to him). The stress test was apparently suggested from the central repair people (who have no idea I run Linux on it - the local techie knows, but has no problem with it as long as I keep a small OSX partition) so I guess this sort of hints that they are aware of hardware issues. (Note: I've had the same techie replace a broken motherboard in the past when the Linux messages where at least as clear as the OSX ones - in that case drives would in the end only show up in the boot menue when the system had cooled down for at least 20 minutes. To be on the safe side, I've upped the minimum fan speed by 50% to ensure all sensors give me happy readings all the time - luckily the 4,1 fans are very silent compared to the 2,2) I hope to have everything back in shape on Wednesday and I'll let you know how it fares. BTW: For a while I displayed the hddtemp sensor all the time along with coretemp etc, but I now understand that this is also SMART based so I've turned it off in the past weeks experimentation. Again, it seemed to work flawlessly for months on my previous (stolen) MBP 3,1. Best / Jonas ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen 2008-08-30 22:12 ` Justin Piszcz 2008-08-31 10:00 ` Jonas Petersson @ 2008-09-02 13:39 ` Owen Martin 2008-09-02 15:49 ` [smartmontools-support] " Jonas Petersson 1 sibling, 1 reply; 11+ messages in thread From: Owen Martin @ 2008-09-02 13:39 UTC (permalink / raw) To: Justin Piszcz, Jonas Petersson Cc: linux-ide, smartmontools-support, linux-kernel This looks like a timeout during a read command: ata3.00: cmd c8/00:08:90:3c:59 Read dma of 8 blocks from 0x903c59 Next time it happens, see if it is the same LBA. Since the drive came back after the bus reset makes me think it was probably in error recovery for an extended amount of time. Sorry, but I am new to using smartmontools for decoding SMART attributes. Your previous email showed: Device is: Not in smartctl database [for details use: -P showall] Does that imply the tool will not know the exact meaning of all the attributes? I am not familiar with Fujitsu's implementation. >From the data you sent about the attributes before, it looks like the pending and reallocated sector counts are zero, so the block must have not failed recovery. Can you try to dump the sector using hdparm-8.9 to see if it reproduces? hdparm --read-sector 9452633 /dev/sda What is the timeout set to? cat /sys/block/sda/device/timeout Maybe try to increase that. You want to be sure that it is not a drive issue by verifying the block is readable and the raw values from the pending, uncorrectable or reallocated sector attributes don't change. I was seeing the exact same thing when I was trying to run the SMART selftest in captive mode (not using smartmon). When I increased the timeout it was able to complete. Aug 27 02:36:42 spu0201 user.err kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Aug 27 02:36:42 spu0201 user.err kernel: ata1.00: cmd b0/d4:00:83:4f:c2/00:00:00:00:00/00 tag 0 Aug 27 02:36:42 spu0201 user.warn kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Justin's error is from a write: ata1.00: cmd 35/00:40:9a:d9:7a/00:00:12:00:00/e0 tag 0 dma 32768 out That typically only happens in a high vibration environment. Since the write is open loop, typically the only thing that can prevent it from completing is position error. It might be a PHY issue, but without a bus analyzer, it is hard to tell. The new Seagate drives have attribute 199, SATA R-err count, which might help to identify the issue, if you think it is related to the chipset/PHY. -Owen -----Original Message----- From: smartmontools-support-bounces@lists.sourceforge.net [mailto:smartmontools-support-bounces@lists.sourceforge.net] On Behalf Of Justin Piszcz Sent: Saturday, August 30, 2008 6:13 PM To: Jonas Petersson Cc: linux-ide@vger.kernel.org; smartmontools-support@lists.sourceforge.net; linux-kernel@vger.kernel.org Subject: Re: [smartmontools-support] exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen On Sat, 30 Aug 2008, Jonas Petersson wrote: > Justin Piszcz skrev: >> On Sat, 30 Aug 2008, Jonas Petersson wrote: >>> [...] >> smartctl -a would be useful (#1) > > # smartctl -a /dev/sda > smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen > Home page is http://smartmontools.sourceforge.net/ I have the same controller in my host as well, but it does not appear to matter whether it happens on the ICH8 controller or other controllers. I have noticed on Velociraptors I seem to get the same/similar error that you do as well, and I ran all the same tests as you, to no avail as to getting any closer to finding the root cause/problem. (.. more so than the regular old raptor150s) Besides the annoying messages in the kernel log/syslog/dmesg, does it affect your system stability in any way as of yet? I must add a very important note here though, you are using an ICH8 chipset and so am I, we both have same/similar problems-- however, I also have another machine setup VERY similarly (except different HDDs) for the RAID5 but the RAID1 is the same as one of my ICH8 boxes (dual raptor150s)-- and to date it has never? or rarely thrown the frozen error except when a disk actually failed (or when NCQ is enabled for a WD drive), (NCQ+Linux for WD) is broken. I have disks in a raid set (both raid1 and raid5) that get same/similar warnings as I mentioned above and so far it has not had any impact that I have noticed in relation to these specific errors. I think for now we just have to live with them, I am not sure what else to say here.. CC'ing linux-ide and linux-kernel with your original error from the start of this e-mail thread: Here is a snippet from this morning - this time it came back to life: [46874.898690] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen [46874.898703] ata3.00: cmd c8/00:08:90:3c:59/00:00:00:00:00/ef tag 0 dma 4096 in [46874.898705] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [46874.898709] ata3.00: status: { DRDY } [46879.643962] ata3: port is slow to respond, please be patient (Status 0xd0) [46884.473195] ata3: device not ready (errno=-16), forcing hardreset [46884.473202] ata3: soft resetting link [46912.740010] ata3.00: qc timeout (cmd 0xec) [46912.740020] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4) [46912.740023] ata3.00: revalidation failed (errno=-5) [46912.740028] ata3: failed to recover some devices, retrying in 5 secs [46917.458070] ata3: soft resetting link [46917.636464] ata3.00: configured for UDMA/100 [46917.636482] ata3: EH complete [46917.699224] sd 2:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB) [46917.699257] sd 2:0:0:0: [sda] Write Protect is off [46917.699263] sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00 [46917.699300] sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Here is an example from my host (same/similar issue): Aug 23 20:00:32 p34 kernel: [189770.219773] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Aug 23 20:00:32 p34 kernel: [189770.219784] ata1.00: cmd 35/00:40:9a:d9:7a/00:00:12:00:00/e0 tag 0 dma 32768 out Aug 23 20:00:32 p34 kernel: [189770.219786] res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Aug 23 20:00:32 p34 kernel: [189770.219790] ata1.00: status: { DRDY } Aug 23 20:00:32 p34 kernel: [189770.219795] ata1: hard resetting link Aug 23 20:00:32 p34 kernel: [189770.524770] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Aug 23 20:00:32 p34 kernel: [189770.543960] ata1.00: configured for UDMA/133 Aug 23 20:00:32 p34 kernel: [189770.543977] ata1: EH complete Aug 23 20:00:32 p34 kernel: [189770.544810] sd 0:0:0:0: [sda] 586072368 512-byte hardware sectors (300069 MB) Aug 23 20:00:32 p34 kernel: [189770.551810] sd 0:0:0:0: [sda] Write Protect is off Aug 23 20:00:32 p34 kernel: [189770.551810] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 Aug 23 20:00:32 p34 kernel: [189770.863810] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA What is the root cause of this? It still seems to be a mystery to most as far as I can tell, but the one thing in common is we are both using ICH8 chipsets, which, just may happen to be part of the problem? Justin. ------------------------------------------------------------------------ - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Smartmontools-support mailing list Smartmontools-support@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/smartmontools-support ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [smartmontools-support] exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen 2008-09-02 13:39 ` Owen Martin @ 2008-09-02 15:49 ` Jonas Petersson 2008-09-07 20:48 ` Jonas Petersson 0 siblings, 1 reply; 11+ messages in thread From: Jonas Petersson @ 2008-09-02 15:49 UTC (permalink / raw) To: Owen Martin; +Cc: Justin Piszcz, linux-ide, smartmontools-support, linux-kernel Hi Owen, Owen Martin wrote: > This looks like a timeout during a read command: > > ata3.00: cmd c8/00:08:90:3c:59 > > Read dma of 8 blocks from 0x903c59 > > Next time it happens, see if it is the same LBA. Since the drive came > back after the bus reset makes me think it was probably in error > recovery for an extended amount of time. Sounds like a good idea. However, I had the drive swapped yesterday and have now reinstalled on a (seemingly) identical one which so far seems to be free from these messages. Hence, I keep my fingers crossed that this was indeed a hw error. As it was on warranty I was not allowed to keep the bad drive for further experiments. > Sorry, but I am new to using smartmontools for decoding SMART > attributes. Your previous email showed: > > Device is: Not in smartctl database [for details use: -P showall] > > Does that imply the tool will not know the exact meaning of all the > attributes? I am not familiar with Fujitsu's implementation. I believe you are correct. >>From the data you sent about the attributes before, it looks like the > pending and reallocated sector counts are zero, so the block must have > not failed recovery. Can you try to dump the sector using hdparm-8.9 to > see if it reproduces? > > hdparm --read-sector 9452633 /dev/sda Would if I could... The messages I sent were indeed only from cases where the driver succeeded write to the disk in the end (extracts from /var/log/messages). In the failure cases I did not make a hard copy. > What is the timeout set to? > > cat /sys/block/sda/device/timeout 30 > Maybe try to increase that. You want to be sure that it is not a drive > issue by verifying the block is readable and the raw values from the > pending, uncorrectable or reallocated sector attributes don't change. Will do if I ever see it again. Best / Jonas ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen 2008-09-02 15:49 ` [smartmontools-support] " Jonas Petersson @ 2008-09-07 20:48 ` Jonas Petersson 0 siblings, 0 replies; 11+ messages in thread From: Jonas Petersson @ 2008-09-07 20:48 UTC (permalink / raw) Cc: linux-ide, smartmontools-support, linux-kernel For the record: Jonas Petersson wrote: > Owen Martin wrote: > > This looks like a timeout during a read command: > > > > ata3.00: cmd c8/00:08:90:3c:59 > > > > Read dma of 8 blocks from 0x903c59 > > > > Next time it happens, see if it is the same LBA. Since the drive came > > back after the bus reset makes me think it was probably in error > > recovery for an extended amount of time. > > Sounds like a good idea. However, I had the drive swapped yesterday and > have now reinstalled on a (seemingly) identical one which so far seems > to be free from these messages. Hence, I keep my fingers crossed that > this was indeed a hw error. > [...] I've now stressed the new disk for almost a week and seen no indication at all to the previous error. Everything else is the same as before - I even installed from the very same DVD. My conclusion is therefore that I really had a disk that was broken in a way that normal tests will not detect. Hence, my tip to anyone having a similar experience: Don't blame the driver, nor the motherboard/chipset - just replace the drive. It would of course be even nicer if the error message could spell this out somewhat clearer too, but I guess the "I/O error" in the middle is a fair hint in retrospect. Best / Jonas ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2008-09-07 20:48 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <56020.194.255.108.253.1192004925.squirrel@root.dusted.dk>
2007-10-13 0:41 ` exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Andrew Morton
2007-10-13 5:49 ` Steen Eugen Poulsen
2007-10-23 9:55 ` Tejun Heo
[not found] ` <471F2647.4030304@lix-world.net>
2007-10-26 1:40 ` Tejun Heo
2007-10-26 4:02 ` Jim Paris
2007-11-06 9:49 ` Bruce Allen
[not found] <29a863790710101217g607b5425g6a3374c2be1d75a5@mail.gmail.com>
2007-10-13 0:54 ` Andrew Morton
[not found] <48B91198.3020803@xms.se>
[not found] ` <alpine.DEB.1.10.0808300708330.19513@p34.internal.lan>
[not found] ` <48B9AE15.7010605@xms.se>
[not found] ` <alpine.DEB.1.10.0808301710380.11166@p34.internal.lan>
[not found] ` <48B9C261.4010609@xms.se>
2008-08-30 22:12 ` Justin Piszcz
2008-08-31 10:00 ` Jonas Petersson
2008-09-02 13:39 ` Owen Martin
2008-09-02 15:49 ` [smartmontools-support] " Jonas Petersson
2008-09-07 20:48 ` Jonas Petersson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).