meta: should i chase this down?

Linux RAID subsystem development
 help / color / mirror / Atom feed

* meta: should i chase this down?
@ 2011-12-07  0:02 Keith Keller
  2011-12-07  0:47 ` NeilBrown
  0 siblings, 1 reply; 4+ messages in thread
From: Keith Keller @ 2011-12-07  0:02 UTC (permalink / raw)
  To: linux-raid

Hi all,

A little while back, I had a strange issue, where reshaping a RAID6 to
add a disk, then performing significant write activity (in this case, an
rsnapshot), would cause a kernel crash.  I only attempted this twice,
and neglected to write down the kernel oops errors, but I saw a few
calls that seemed to imply that the md driver might be involved.  (Doing
the same write activity during a rebuild is fine, which is another
reason I suspected the reshape code in the md driver.  If it's of
interest, I'm using kernel 2.6.39-4.el5.elrepo from ELRepo on a CentOS
5.7 box.)  It's certainly possible that I have a hardware issue, but not
being able to reliably replicate the issue outside a reshape complicates
debugging.

My question is, should I try to hunt down the actual source of this
crash, and if so, what would be the best way to go about that?  I am
decidedly not a kernel developer, and am not familiar with how to obtain
debugging information in that environment.  I'm happy enough for this
machine to suffer crashes, but I prefer not to work with the existing
RAID6 if possible, and would want a more reliable way of collecting the
kernel's debug output beyond writing it down on paper.  :)

Thanks,

--keith

-- 
kkeller@wombat.san-francisco.ca.us

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: meta: should i chase this down?
  2011-12-07  0:02 meta: should i chase this down? Keith Keller
@ 2011-12-07  0:47 ` NeilBrown
  2011-12-07  4:39   ` Keith Keller
  0 siblings, 1 reply; 4+ messages in thread
From: NeilBrown @ 2011-12-07  0:47 UTC (permalink / raw)
  To: Keith Keller; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2389 bytes --]

On Tue, 06 Dec 2011 16:02:44 -0800 Keith Keller
<kkeller@wombat.san-francisco.ca.us> wrote:

> Hi all,
> 
> A little while back, I had a strange issue, where reshaping a RAID6 to
> add a disk, then performing significant write activity (in this case, an
> rsnapshot), would cause a kernel crash.  I only attempted this twice,
> and neglected to write down the kernel oops errors, but I saw a few
> calls that seemed to imply that the md driver might be involved.  (Doing
> the same write activity during a rebuild is fine, which is another
> reason I suspected the reshape code in the md driver.  If it's of
> interest, I'm using kernel 2.6.39-4.el5.elrepo from ELRepo on a CentOS
> 5.7 box.)  It's certainly possible that I have a hardware issue, but not
> being able to reliably replicate the issue outside a reshape complicates
> debugging.
> 
> My question is, should I try to hunt down the actual source of this
> crash, and if so, what would be the best way to go about that?  I am
> decidedly not a kernel developer, and am not familiar with how to obtain
> debugging information in that environment.  I'm happy enough for this
> machine to suffer crashes, but I prefer not to work with the existing
> RAID6 if possible, and would want a more reliable way of collecting the
> kernel's debug output beyond writing it down on paper.  :)
> 

I'm always happy to receive detailed crash reports.  However I cannot measure
how much your time is worth, nor can I guarantee that what you find wont
already have been fixed (though 2.6.39 is quite recent and I don't recall any
recent kernel-crash-during-reshape bugs, not can I find any in a quick scan
through the logs).
So I cannot advise you on whether it is "worth the effort".  I would
appreciate it though.

The best way I have found to catch kernel messages is using netconsole.
See Documentation/networking/netconsole.txt

You need a wired network port and another machine on the same network that
can capture the messages.

You almost certainly need some disks to make the RAID6 out of.  You could try
loop-back devices over files but the timing is likely to be very different
and so the chance of reproducing the bug correspondingly small.

But if you do manage to get a crash message I would be very happy to
interpret it and work to fix the bug that causes it.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: meta: should i chase this down?
  2011-12-07  0:47 ` NeilBrown
@ 2011-12-07  4:39   ` Keith Keller
  2011-12-12 21:13     ` Keith Keller
  0 siblings, 1 reply; 4+ messages in thread
From: Keith Keller @ 2011-12-07  4:39 UTC (permalink / raw)
  To: linux-raid

On 2011-12-07, NeilBrown <neilb@suse.de> wrote:
>
> The best way I have found to catch kernel messages is using netconsole.
> See Documentation/networking/netconsole.txt

That's exactly what I was looking for!  It looks like it's working
perfectly so far.  It sure beats the Back In My Day (TM) method of
needing a serial console.

> You almost certainly need some disks to make the RAID6 out of.  You could t=
> ry
> loop-back devices over files but the timing is likely to be very different
> and so the chance of reproducing the bug correspondingly small.

I do have some drive bays that I should be able to free up, and I think
that I have some free disks lying around I can use, which I think should
replicate a more real-life scenario.  I can't honestly say that this
will be a high priority for me :), but I do want to try to be helpful if
there really is a bug, or help myself if there's a problem in my
hardware.  It probably can't hurt to make a first pass with some
loopback devices--if it does crash the kernel, great (for debugging,
anyway); if not, I hunt down some real disks.  (Would using multiple
partitions on one or two disks be a reasonable test case, or should I
not bother and stick with real disks?)

Thanks for your help!

--keith

-- 
kkeller@wombat.san-francisco.ca.us

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: meta: should i chase this down?
  2011-12-07  4:39   ` Keith Keller
@ 2011-12-12 21:13     ` Keith Keller
  0 siblings, 0 replies; 4+ messages in thread
From: Keith Keller @ 2011-12-12 21:13 UTC (permalink / raw)
  To: linux-raid

Hi all,

On 2011-12-07, Keith Keller <kkeller@wombat.san-francisco.ca.us> wrote:
>
> I do have some drive bays that I should be able to free up, and I think
> that I have some free disks lying around I can use, which I think should
> replicate a more real-life scenario.

Just a short followup: I've yet to be able to replicate the kernel
crash.  I added some 400GB disks, split them into two 200GB partitions
each, built a 4-part RAID6, added a fifth partition and started a
reshape, then started some disk IO (both reads and writes).  So far,
nothing bad has happened, even with a bad disk in the mix (though
perhaps that's masking the issue in a bizarre way).

I may make another attempt using one partition on all the disks instead
of two, but I'm not terribly optimistic that this will result in the
crash.  So I am afraid that my likeliest next step is to watch the next
time I need to grow the ''real'' RAID6 where this happened the first
time.  At least this time I will have netconsole working if it does
happen again, and will be able to make a more informed report.

--keith

-- 
kkeller@wombat.san-francisco.ca.us

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-12-12 21:13 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-07  0:02 meta: should i chase this down? Keith Keller
2011-12-07  0:47 ` NeilBrown
2011-12-07  4:39   ` Keith Keller
2011-12-12 21:13     ` Keith Keller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox