linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@suse.de>
To: Paul Clements <paul.clements@steeleye.com>
Cc: Mark Rustad <MRustad@aol.com>,
	linux-raid@vger.kernel.org, linux-scsi@vger.kernel.org
Subject: Re: PROBLEM: kernel crashes on RAID1 drive error
Date: Thu, 21 Oct 2004 16:02:37 +0200	[thread overview]
Message-ID: <20041021140236.GO10531@suse.de> (raw)
In-Reply-To: <4177C13C.10007@steeleye.com>

On Thu, Oct 21 2004, Paul Clements wrote:
> Jens Axboe wrote:
> >On Thu, Oct 21 2004, Paul Clements wrote:
> >
> >>Jens Axboe wrote:
> >>
> >>>On Wed, Oct 20 2004, Mark Rustad wrote:
> >>>
> >>>
> >>>>Folks,
> >>>>
> >>>>I have been having trouble with kernel crashes resulting from RAID1 
> >>>>component device failures. I have been testing the robustness of an 
> >>>>embedded system and have been using a drive that is known to fail after 
> >>>>a time under load. When this device returns a media error, I always 
> >>>>wind up with either a kernel hang or reboot. In this environment, each 
> >>>>drive has four partitions, each of which is part of a RAID1 with its 
> >>>>partner on the other device. Swap is on md2 so even it should be 
> >>>>robust.
> >>>>
> >>>>I have gotten this result with the SuSE standard i386 smp kernels 
> >>>>2.6.5-7.97 and 2.6.5-7.108. I also get these failures with the 
> >>>>kernel.org kernels 2.6.8.1, 2.6.9-rc4 and 2.6.9.
> >>>>
> >>>>The hardware setup is a two cpu Nacona with an Adaptec 7902 SCSI 
> >>>>controller with two Seagate drives on a SAF-TE bus. I run three or four 
> >>>>dd commands copying /dev/md0 to /dev/null to provide the activity that 
> >>>>stimulates the failure.
> >>>>
> >>>>I suspect that something is going wrong in the retry of the failed I/O 
> >>>>operations, but I'm really not familiar with any of this area of the 
> >>>>kernel at all.
> >>>>
> >>>>In one failure, I get the following messages from kernel 2.6.9:
> >>>>
> >>>>raid1: Disk failure on sdb1, disabling device.
> >>>>raid1: sdb1: rescheduling sector 176
> >>>>raid1: sda1: redirecting sector 176 to another mirror
> >>>>raid1: sdb1: rescheduling sector 184
> >>>>raid1: sda1: redirecting sector 184 to another mirror
> >>>>Incorrect number of segments after building list
> >>>>counted 2, received 1
> >>>>req nr_sec 0, cur_nr_sec 7
> >>>
> >>>
> >>>This should be fixed by this patch, can you test it?
> >>
> >>There may well be two problems here, but the original problem you're 
> >>seeing (infinite read retries, and failures) is due to a bug in raid1. 
> >>Basically the bio handling on read error retry was not quite right. Neil 
> >>Brown just posted the patch to correct this a couple of days ago:
> >>
> >>http://marc.theaimsgroup.com/?l=linux-raid&m=109824318202358&w=2
> >>
> >>Please try that. (If you need a patch that applies to SUSE 2.6.5, I also 
> >>have a version of the patch which should apply to that).
> >
> >
> >Is 2.6.9 not uptodate wrt those raid1 patches?!
> 
> Unfortunately, no. This latest problem (the one he's reporting) is not 
> fixed in mainline. I discovered the problem a month or so ago while 
> testing with SLES 9. I posted a patch and Neil expanded on it (to 
> include raid10, which is now in mainline, and also suffers from the same 
> problem). Neil just posted the patch two days ago to linux-raid, so I 
> expect it's in -mm now.

Irk, that's too bad. So we are now looking at probably a month before
mainline has a stable release with that fixed too :/

> >>Please be aware that there are several other bugs in the SUSE 2.6.5-7.97 
> >>kernel in md and raid1 (basically it's a matter of that kernel being 
> >>somewhat behind mainline, where most of these bugs are now fixed). I've 
> >>sent several patches to SUSE to fix these issues, that hopefully will 
> >>get into their SP1 release that should be forthcoming soon...
> >
> >
> >-97 is the release kernel, -111 is the current update kernel. And it has
> >those raid1 issues fixed already, at least the ones that are known. The
> >scsi segment issue is not, however.
> 
> Thanks. Good to know that. -111 is currently available to customers? We 
> may recommend that our customers use that, rather than patching -97 
> ourselves.

Yes it is, it's generally available through the online updates.

-- 
Jens Axboe


  reply	other threads:[~2004-10-21 14:02 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-10-20 22:08 PROBLEM: kernel crashes on RAID1 drive error Mark Rustad
2004-10-21  8:45 ` Jens Axboe
2004-10-21 13:52   ` Paul Clements
2004-10-21 13:55     ` Jens Axboe
2004-10-21 14:01       ` Paul Clements
2004-10-21 14:02         ` Jens Axboe [this message]
2004-10-22 16:00           ` Mark Rustad
2004-10-28 19:35             ` Mark Rustad
2004-11-04 18:56               ` Mark Rustad
2004-11-16 15:51                 ` Lars Marowsky-Bree
2004-11-16 16:40                   ` Mark Rustad
2004-10-21 16:31   ` Mark Rustad
  -- strict thread matches above, loose matches on Subject: below --
2004-12-28 12:00 Problem: " bernd

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20041021140236.GO10531@suse.de \
    --to=axboe@suse.de \
    --cc=MRustad@aol.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=paul.clements@steeleye.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).