From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Morton Subject: Re: Strange freezes (seems like SATA related) Date: Thu, 1 Nov 2007 16:40:46 -0700 Message-ID: <20071101164046.462f40f0.akpm@linux-foundation.org> References: <47261043.5020907@qualcomm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Return-path: Received: from smtp2.linux-foundation.org ([207.189.120.14]:36502 "EHLO smtp2.linux-foundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753817AbXKAXkv (ORCPT ); Thu, 1 Nov 2007 19:40:51 -0400 In-Reply-To: <47261043.5020907@qualcomm.com> Sender: linux-ide-owner@vger.kernel.org List-Id: linux-ide@vger.kernel.org To: Max Krasnyansky Cc: linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org On Mon, 29 Oct 2007 09:54:27 -0700 Max Krasnyansky wrote: > A couple of HP xw9300 machines (dual Opterons) started freezing up. > We're running on 2.6.22.1 on them. Freezes a somewhere weird. VGA console is alive > (I can switch vts, etc) but everything else is dead (network, etc). > Unfortunately SYSRQ was not enabled and I could not get backtraces and stuff. > > Hooked up serial console and the only error that shows up is this. > > ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 0x1540 next cpb count 0x0 next cpb idx 0x0 > ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1 > ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen > ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out > res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > Descriptor sense data with sense descriptors (in hex): > end_request: I/O error, dev sda, sector 8388695 > Buffer I/O error on device sda1, logical block 1048579 > lost page write due to I/O error on sda1 > sd 0:0:0:0: [sda] Write Protect is off > > I see a bunch of those and then the box just sits there spewing this periodically > > ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 0x1540 next cpb count 0x0 next cpb idx 0x0 > ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1 > ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen > ata1.00: cmd ca/00:08:4f:00:f8/00:00:00:00:00/e1 tag 0 cdb 0x0 data 4096 out > res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > > SMART selftest on the drive passed without errors. > > Here is how this machine looks like > > ... So this happens on more than one machine? The kernel shouldn't freeze, so even if both machines have magically identical hardware faults, there's a kernel bug there somewhere. I guess it would be useful to test a 2.6.23 kernel if poss. We've seen a very large number of reports like this one in recent months (many of which have not been responded to, btw) and perhaps someone has done something about them.