Re: Re[2]: RAID1 submirror failure causes reboot?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jens Axboe <jens.axboe@oracle.com>
To: Neil Brown <neilb@suse.de>
Cc: Jim Klimov <klimov@2ka.mipt.ru>, Linux RAID <linux-raid@vger.kernel.org>
Subject: Re: Re[2]: RAID1 submirror failure causes reboot?
Date: Mon, 13 Nov 2006 21:11:05 +0100	[thread overview]
Message-ID: <20061113201105.GD15031@kernel.dk> (raw)
In-Reply-To: <17752.7205.373640.220014@cse.unsw.edu.au>

On Mon, Nov 13 2006, Neil Brown wrote:
> On Friday November 10, klimov@2ka.mipt.ru wrote:
> > Hello Neil,
> > 
> > >> [87398.531579] blk: request botched
> > NB>                  ^^^^^^^^^^^^^^^^^^^^
> > 
> > NB> That looks bad.  Possible some bug in the IDE controller or elsewhere
> > NB> in the block layer.  Jens: What might cause that?
> > 
> > NB> --snip--
> > 
> > NB> That doesn't look like raid was involved.  If it was you would expect
> > NB> to see raid1_end_write_request or raid1_end_read_request in that
> > NB> trace.
> > So that might be the hard or soft part of IDE layer failing the
> > system, or a PCI problem for example?
> 
> What I think is happening here (and Jens: if you could tell me how
> impossible this is, that would be good) is this:
> 
> Some error handling somewhere in the low-level ide driver is getting
> confused and somehow one the sector counts in the 'struct request' is
> getting set wrongly.  blk_recalc_rq_sectors notices this and says
> "blk: request botched".  It tries to auto-correct by increasing
> rq->nr_sectors to be consistent with other counts.
> I'm *guessing* this is the wrong thing to do, and that it has a
> side-effect but bi_end_io is getting called on the Bi twice.
> The second time the bio has been freed and reused and the wrong
> b_end_io is called and it does the wrong thing.

It doesn't sound at all unreasonable. It's most likely either a bug in
the ide driver, or a "bad" bio being passed to the block layer (and
later on to the request and driver). By "bad" I mean one that isn't
entirely consistent, which could be a bug in eg md. The "request
botched" error is usually a sign of something being severly screwed up.
As you mention further down, get slab and page debugging enabled to
potentially catch this earlier. It could be a sign of a freed bio or
request with corrupt contents.

> This sounds a bit far-fetched, but it is the only explanation I can
> come up with for the observed back trace which is:
> 
> [87403.706012]  [<c0103871>] error_code+0x39/0x40
> [87403.710794]  [<c0180e0a>] mpage_end_io_read+0x5e/0x72
> [87403.716154]  [<c0164af9>] bio_endio+0x56/0x7b
> [87403.720798]  [<c0256778>] __end_that_request_first+0x1e0/0x301
> [87403.726985]  [<c02568a4>] end_that_request_first+0xb/0xd
> [87403.732699]  [<c02bd73c>] __ide_end_request+0x54/0xe1
> [87403.738214]  [<c02bd807>] ide_end_request+0x3e/0x5c
> [87403.743382]  [<c02c35df>] task_error+0x5b/0x97
> [87403.748113]  [<c02c36fa>] task_in_intr+0x6e/0xa2
> [87403.753120]  [<c02bf19e>] ide_intr+0xaf/0x12c
> [87403.757815]  [<c013e5a7>] handle_IRQ_event+0x23/0x57
> [87403.763135]  [<c013e66f>] __do_IRQ+0x94/0xfd
> [87403.767802]  [<c0105192>] do_IRQ+0x32/0x68
> [87403.772278]  [<c010372e>] common_interrupt+0x1a/0x20
> 
> i.e. bio_endio goes straight to mpage_end_io despite the face that the
> filesystem is mounted over md/raid1.

What is the workload? Is io to the real device mixed with io that came
through md as well?

> Is the kernel compiled with CONFIG_DEBUG_SLAB=y and
> CONFIG_DEBUG_PAGEALLOC=y ??
> They might help trigger the error earlier and so make the problem more
> obvious.

Agree, that would be a good plan to enable. Other questions: are you
seeing timeouts at any point? The ide timeout code has some request/bio
"resetting" code which might be worrisome.
> 
> NeilBrown

-- 
Jens Axboe

next prev parent reply	other threads:[~2006-11-13 20:11 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-11-10  8:17 RAID1 submirror failure causes reboot? Jim Klimov
2006-11-10  8:41 ` Neil Brown
2006-11-10 12:53   ` Re[2]: " Jim Klimov
2006-11-13  7:17     ` Neil Brown
2006-11-13 20:11       ` Jens Axboe [this message]
2006-11-13 22:05         ` Neil Brown
2006-11-14  7:28           ` Jens Axboe
2006-11-14 10:36             ` Re[4]: " Jim Klimov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20061113201105.GD15031@kernel.dk \
    --to=jens.axboe@oracle.com \
    --cc=klimov@2ka.mipt.ru \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.