Re: raid5 failure + libata irq: nobody cared

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Bill Davidsen <davidsen@tmr.com>
To: "Vincze, Tamas" <vincze@neb.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: raid5 failure + libata irq: nobody cared
Date: Mon, 19 Nov 2007 09:16:41 -0500	[thread overview]
Message-ID: <47419AC9.8080105@tmr.com> (raw)
In-Reply-To: <473DCDE3.5060904@neb.com>

Vincze, Tamas wrote:
> Hi,
>
> Last night a drive failed in my RAID5 array and it was kicked
> out of the array, continuing with 3 drives as expected.
>
> However a few minutes later this was logged:
>
> irq 18: nobody cared (try booting with the "irqpoll" option)
> Call Trace: <IRQ> <ffffffff8015b930>{__report_bad_irq+48}
>    <ffffffff8015bb2e>{note_interrupt+433} 
> <ffffffff8015b444>{__do_IRQ+191}
>
> IRQ 18 belongs to the SATA controller where all 4 drives are connected.

The troubling thing is that the controller was still in use, and there 
should have been handling for the "nobody cared" interrupt. It sounds as 
if the failed drive didn't get marked to be ignored, or logged then 
ignored. I'd love to know what generated the IRQ in the first place.
>
> Nothing more was logged, probably because the interrupt got disabled,
> making it impossible to talk to the drives anymore. It's bad because
> I ended up with a dirty degraded array the second time this year.
>
> How would a RAID-6 handle a crash when a drive is missing?
> Would that also lead to possible silent corruptions?
> Or is the only option to avoid silent corruptions is a battery
> backed hardware controller?
>
>
> Kernel is 2.6.16-1.2133_FC5

There have been a lot of improvements in raid since then.
>
> Here's the full log:
>
> Nov 16 00:43:10 p4 kernel: ata1: command 0xea timeout, stat 0xd0 
> host_stat 0x0
> Nov 16 00:43:10 p4 kernel: ata1: status=0xd0 { Busy }
> Nov 16 00:43:10 p4 kernel: ATA: abnormal status 0xD0 on port 0xE407
> Nov 16 00:43:10 p4 last message repeated 2 times
> Nov 16 01:30:06 p4 kernel: ata1: command 0xea timeout, stat 0xd0 
> host_stat 0x0
> Nov 16 01:30:06 p4 kernel: ata1: status=0xd0 { Busy }
> Nov 16 01:30:06 p4 kernel: ATA: abnormal status 0xD0 on port 0xE407
> Nov 16 01:30:06 p4 last message repeated 2 times
> Nov 16 01:34:13 p4 kernel: ata1: command 0xea timeout, stat 0xd0 
> host_stat 0x0
> Nov 16 01:34:13 p4 kernel: ata1: status=0xd0 { Busy }
> Nov 16 01:34:13 p4 kernel: ATA: abnormal status 0xD0 on port 0xE407
> Nov 16 01:34:13 p4 last message repeated 2 times
> Nov 16 01:35:13 p4 kernel: ata1: command 0x35 timeout, stat 0xd0 
> host_stat 0x61
> Nov 16 01:35:13 p4 kernel: ata1: status=0xd0 { Busy }
> Nov 16 01:35:13 p4 kernel: sd 0:0:0:0: SCSI error: return code = 
> 0x8000002
> Nov 16 01:35:13 p4 kernel: sda: Current: sense key: Aborted Command
> Nov 16 01:35:13 p4 kernel:     Additional sense: Scsi parity error
> Nov 16 01:35:13 p4 kernel: end_request: I/O error, dev sda, sector 
> 781015848
> Nov 16 01:35:43 p4 kernel: ATA: abnormal status 0xD0 on port 0xE407
> Nov 16 01:35:44 p4 last message repeated 2 times
> Nov 16 01:35:44 p4 kernel: ata1: command 0xea timeout, stat 0xd0 
> host_stat 0x0
> Nov 16 01:35:44 p4 kernel: ata1: status=0xd0 { Busy }
> Nov 16 01:35:44 p4 kernel: raid5: Disk failure on sda3, disabling 
> device. Operation continuing on 3 devices
> Nov 16 01:35:44 p4 kernel: ATA: abnormal status 0xD0 on port 0xE407
> Nov 16 01:35:44 p4 kernel: RAID5 conf printout:
> Nov 16 01:35:44 p4 kernel:  --- rd:4 wd:3 fd:1
> Nov 16 01:35:44 p4 kernel:  disk 0, o:0, dev:sda3
> Nov 16 01:35:44 p4 kernel:  disk 1, o:1, dev:sdc3
> Nov 16 01:35:44 p4 kernel:  disk 2, o:1, dev:sdb3
> Nov 16 01:35:44 p4 kernel:  disk 3, o:1, dev:sdd3
> Nov 16 01:35:44 p4 kernel: RAID5 conf printout:
> Nov 16 01:35:44 p4 kernel:  --- rd:4 wd:3 fd:1
> Nov 16 01:35:44 p4 kernel:  disk 1, o:1, dev:sdc3
> Nov 16 01:35:44 p4 kernel:  disk 2, o:1, dev:sdb3
> Nov 16 01:35:44 p4 kernel:  disk 3, o:1, dev:sdd3
> Nov 16 01:37:36 p4 kernel: irq 18: nobody cared (try booting with the 
> "irqpoll" option)
> Nov 16 01:37:36 p4 kernel:
> Nov 16 01:37:36 p4 kernel: Call Trace: <IRQ> 
> <ffffffff8015b930>{__report_bad_irq+48}
>
> Nov 16 01:37:36 p4 kernel:        
> <ffffffff8015bb2e>{note_interrupt+433} <ffffffff8015b444>{__do_IRQ+191}
>
>
> -Tamas
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979

     prev parent reply	other threads:[~2007-11-19 14:16 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-11-16 17:05 raid5 failure + libata irq: nobody cared Vincze, Tamas
2007-11-19 14:16 ` Bill Davidsen [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47419AC9.8080105@tmr.com \
    --to=davidsen@tmr.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=vincze@neb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).