From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Andrew Lyon" Subject: Re: Scary Intel SATA problem: "frozen" Date: Wed, 6 Dec 2006 18:45:10 +0000 Message-ID: References: <456CB72A.3010004@local.se> <456CDB06.40806@gmail.com> <457704BA.7090001@local.se> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from ug-out-1314.google.com ([66.249.92.171]:50966 "EHLO ug-out-1314.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S937101AbWLFSpM (ORCPT ); Wed, 6 Dec 2006 13:45:12 -0500 Received: by ug-out-1314.google.com with SMTP id 44so230220uga for ; Wed, 06 Dec 2006 10:45:11 -0800 (PST) In-Reply-To: <457704BA.7090001@local.se> Content-Disposition: inline Sender: linux-ide-owner@vger.kernel.org List-Id: linux-ide@vger.kernel.org To: jonas@local.se Cc: Tejun Heo , linux-ide@vger.kernel.org On 12/6/06, Jonas Lundgren wrote: > Tejun Heo wrote: > [--snip--] > > >> IF the system does recover, I start getting > >> the extremly low disk write speeds that I reported above, and only a > >> reboot will get the performance back to regular. > > > > Please full dmesg after your computer got really slow. I suspect libata > > decided to switch to PIO mode. > Here's the relevant part, if you want the whole dmesg look at: > http://pastebin.ca/269581 > > [--snip--] > [82048.255126] can't create port > [85055.578172] reiser4[unrar(30787)]: disable_write_barrier > (fs/reiser4/wander.c:234)[zam-1055]: > [85055.578174] NOTICE: md5 does not support write barriers, using > synchronous write instead. > [87825.501998] can't create port > [89520.019538] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 > frozen > [89520.019545] ata2.00: cmd c8/00:08:fe:68:df/00:00:00:00:00/e1 tag 0 > data 4096 in > [89520.019547] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask > 0x4 (timeout) > [89520.322292] ata2: soft resetting port > [89527.515891] ata2: port is slow to respond, please be patient (Status > 0xd0) > [89550.457913] ata2: port failed to respond (30 secs, Status 0xd0) > [89550.457917] ata2: softreset failed (device not ready) > [89550.457921] ata2: softreset failed, retrying in 5 secs > [89555.454103] ata2: hard resetting port > [89562.799693] ata2: port is slow to respond, please be patient (Status > 0x80) > [89585.740239] ata2: port failed to respond (30 secs, Status 0x80) > [89585.740242] ata2: COMRESET failed (device not ready) > [89585.740245] ata2: hardreset failed, retrying in 5 secs > [89590.736978] ata2: hard resetting port > [89598.081854] ata2: port is slow to respond, please be patient (Status > 0x80) > [89617.604742] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) > [89617.611034] ata2.00: configured for UDMA/100 > [89617.611042] ata2: EH complete > [89617.623426] SCSI device sdb: 145226112 512-byte hdwr sectors (74356 MB) > [89617.633551] sdb: Write Protect is off > [89617.633553] sdb: Mode Sense: 00 3a 00 00 > [89617.637765] SCSI device sdb: write cache: enabled, read cache: > enabled, doesn't support DPO or FUA > > > > >> I don't know what causes it, but most of the times when I've gotten it > >> my system has been under heavy load (compiling, downloading torrents in > >> 11mb/sec etc). Please let me know if you want any additional info, want > >> me to try something out, or whatever. My recent hardware upgrade for > >> around $1200 (to a core2duo system, i965 mobo) is just going to waste > >> because of this problem. :/ > > > > Heh, nice machine you got there. When you look at the dmesg, do the > > error messages occur only on one of the two drives? Or are both > > affected? If only one is affected, > > > > 1. swap the two. you'll probably have to dance a little bit with boot > > loader but md should handle that fine once the kernel is loaded. does > > the errors persist? on which device do they occur? do they follow the > > drive or stay on the mobo port? > It follows the drive. (Hardware problem?) > > > > > 2. try different cable / port. if you change port, again, you need to > > dance w/ boot loader. who's carrying the error messages with it? > Read above. > > > > > 3. try different power plug from different power lane. > I've got a really good power supply, wich can handle max 560W on the +12 > / -12 V rail alone. > > > > >> I just got so glad when I saw the post of this on linux-ide, I've been > >> searching like crazy to find another person having the same problem (and > >> possibly a solution) for the past 2-3 weeks or so. > > > > My first guess is frequent transmission errors. Please report the test > > results. Thanks. > > > > I guess it could only be a hardware problem since the error follows the > drive, and both the drives are identical, so it can't be a firmware > problem. Correct me if I'm wrong. > > I just checked the smart status, and the drive passes, but it seems like > it's going down though, on the other hand I might misread the results. > > smartctl -d ata -A /dev/sdb > smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen > Home page is http://smartmontools.sourceforge.net/ > > === START OF READ SMART DATA SECTION === > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always > - 0 > 3 Spin_Up_Time 0x0007 113 111 021 Pre-fail Always > - 4875 > 4 Start_Stop_Count 0x0032 100 100 040 Old_age Always > - 237 > 5 Reallocated_Sector_Ct 0x0033 153 153 140 Pre-fail Always > - 747 > 7 Seek_Error_Rate 0x000b 100 253 051 Pre-fail Always > - 0 > 9 Power_On_Hours 0x0032 076 076 000 Old_age Always > - 18117 > 10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always > - 0 > 11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail Always > - 0 > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always > - 228 > 194 Temperature_Celsius 0x0022 117 108 000 Old_age Always > - 33 > 196 Reallocated_Event_Count 0x0032 001 001 000 Old_age Always > - 639 > 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always > - 0 > 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always > - 0 > 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always > - 0 > 200 Multi_Zone_Error_Rate 0x0009 200 179 051 Pre-fail > Offline - 0 > > > The "Reallocated_Sector_Ct" and "Reallocated_Event_Count" worries me.. > Should I be worried? Yes, they are a sign that the drive is wearing out! Andy > -- > -Jonas > > Name: Jonas Lundgren > ICQ#: 52064961 > Mail: jonas@local.se > IRC: neon / neonman @ EFnet, Undernet, Quakenet, freenode > - > To unsubscribe from this list: send the line "unsubscribe linux-ide" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >