From mboxrd@z Thu Jan 1 00:00:00 1970 From: Danilo Godec Subject: Re: SATA errors? Date: Wed, 01 Oct 2008 23:36:13 +0200 Message-ID: <48E3ED4D.6000409@agenda.si> References: <48E34221.1000008@agenda.si> <20081001100833.3253724847@gemini.denx.de> <48E353D5.1060908@agenda.si> <48E36A61.4020903@dgreaves.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: Sender: linux-raid-owner@vger.kernel.org To: David Lethe Cc: Linux RAID Mailing List List-Id: linux-raid.ids I don't want to start any holly wars, but I'm not using a RAID controller. It's just a plain old on-board SATA controller (at least that's what I think it is): 00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA Storage Controller AHCI (rev 09) David Lethe wrote: > If this was my system, I would ... > 1) First check into upgrading firmware/bios/drivers of disk controller. > 2) Look at cron jobs and see if anything that needs capacity runs around > the time the errors are reported. Something has to run to start this > off, so you need to find it. > 3) Use logger and a shell script to try to catch system in the 15 second > window when you have this problem, and see what programs are running. > 4) Actually, if this was my system, and if I/O wasn't actually being > suspended during those 15 seconds, then I probably would do step 1 only, > and if everything is current, then I would move on and not worry about > it. Even if you find the offending program, then that doesn't mean > that the author of the program has or will make an acceptable change in > their code. > 1. I will get a new server in a couple of days and then I'll be able to move the Xen VM's from the 'problematic' server. Then I'll see what can be updated/upgraded. 2. The errors are pretty much random and there is nothing in the cron at all. I don't think Xen VM's could do anything with the physical drive, so their crons shouldn't be relevant. 3. It's not really a problem that we (the users) would feel. It's just the logs that got me worried (I don't like unexplainable hard drive errors). 4. As said before, I changed the scripts to use 'smartctl' with one of the other drives. So far it seems better - there hasn't been a single error in 12 hours. > The problem you have from SCSI perspective is the bozo who wrote this > chunk of code did it the wrong way. The CORRECT way to determine > addressable blocks is to send out the READCAP10, look for return value > of FFFFFFFFh blocks, then issue a 16-byte (0x9e 0x10) READ CAPACITY, > because you have > FFFFFFFE blocks on the disk. This architect never > imagined that the READCAP10 would have to deal with large disks, and > assumed if there was a problem, then the disk needs to be reset. If it turns out that 'smartctl' was causing this I'll report it to 'smartmontools' guys. Thanks for the help, Danilo