out of sync raid 5 + xfs = kernel startup problem

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* out of sync raid 5 + xfs = kernel startup problem
@ 2005-04-13  1:59 Robey Holderith
  2005-04-13  4:10 ` Neil Brown
  0 siblings, 1 reply; 4+ messages in thread
From: Robey Holderith @ 2005-04-13  1:59 UTC (permalink / raw)
  To: linux-raid

My raid5 system recently went through a sequence of power outages.  When 
everything came back on the drives were out of sync.  No big issue... 
just sync them back up again.  But something is going wrong.  Any help 
is appreciated.  dmesg provides the following (the network stuff is 
mixed in):

md: raid5 personality registered as nr 4
raid5: automatically using best checksumming function: generic_sse
   generic_sse:  2444.000 MB/sec
raid5: using function: generic_sse (2444.000 MB/sec)
md: md driver 0.90.1 MAX_MD_DEVS=256, MD_SB_DISKS=27
NET: Registered protocol family 2
IP: routing cache hash table of 8192 buckets, 64Kbytes
TCP: Hash tables configured (established 262144 bind 65536)
NET: Registered protocol family 1
NET: Registered protocol family 10
IPv6 over IPv4 tunneling driver
NET: Registered protocol family 17
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
VFS: Mounted root (ext2 filesystem) readonly.
Freeing unused kernel memory: 164k freed
md: raidstart(pid 220) used deprecated START_ARRAY ioctl. This will not 
be supported beyond 2.6
md: could not bd_claim sdf2.
md: autorun ...
md: considering sdd2 ...
md:  adding sdd2 ...
md:  adding sde2 ...
md:  adding sdf2 ...
md:  adding sdc2 ...
md:  adding sdb2 ...
md:  adding sda2 ...
md: created md0
md: bind<sda2>
md: bind<sdb2>
md: bind<sdc2>
md: bind<sdf2>
md: bind<sde2>
md: bind<sdd2>
md: running: <sdd2><sde2><sdf2><sdc2><sdb2><sda2>
md: kicking non-fresh sdd2 from array!
md: unbind<sdd2>
md: export_rdev(sdd2)
md: md0: raid array is not clean -- starting background reconstruction
raid5: device sde2 operational as raid disk 4
raid5: device sdf2 operational as raid disk 3
raid5: device sdc2 operational as raid disk 2
raid5: device sdb2 operational as raid disk 1
raid5: device sda2 operational as raid disk 0
raid5: cannot start dirty degraded array for md0
RAID5 conf printout:
 --- rd:6 wd:5 fd:1
 disk 0, o:1, dev:sda2
 disk 1, o:1, dev:sdb2
 disk 2, o:1, dev:sdc2
 disk 3, o:1, dev:sdf2
 disk 4, o:1, dev:sde2
raid5: failed to run raid set md0
md: pers->run() failed ...
md: do_md_run() returned -22
md: md0 stopped.
md: unbind<sde2>
md: export_rdev(sde2)
md: unbind<sdf2>
md: export_rdev(sdf2)
md: unbind<sdc2>
md: export_rdev(sdc2)
md: unbind<sdb2>
md: export_rdev(sdb2)
md: unbind<sda2>
md: export_rdev(sda2)
md: ... autorun DONE.
XFS: SB read failed
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
<ffffffff802c4d5d>{raid5_unplug_device+13}
PML4 3f691067 PGD 3f6ad067 PMD 0
Oops: 0000 [1]
CPU 0
Pid: 226, comm: mount Not tainted 2.6.10
RIP: 0010:[<ffffffff802c4d5d>] <ffffffff802c4d5d>{raid5_unplug_device+13}
RSP: 0018:000001003f66dab8  EFLAGS: 00010216
RAX: ffffffff802c4d50 RBX: 000001003f66daa0 RCX: 000001003f66dad8
RDX: 000001003f66dad8 RSI: 0000000000000000 RDI: 000001003fcacd10
RBP: 0000000000000000 R08: 0000000000000034 R09: 0000010002134b00
R10: 0000000000000200 R11: ffffffff802c4d50 R12: 000001003f66dad8
R13: 0000000000000001 R14: 000001003f440640 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff8042d300(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0
Process mount (pid: 226, threadinfo 000001003f66c000, task 000001003f42eef0)
Stack: 0000000000000001 000001003f66daa0 000001003f66daa0 ffffffff8023c91a
       000001003f66dad8 000001003f66dad8 0000000000000005 000001003f6b7000
       0000000000000005 000001003f6b7800
Call Trace:<ffffffff8023c91a>{xfs_flush_buftarg+442} 
<ffffffff80231511>{xfs_mount+2465}
       <ffffffff80242560>{linvfs_fill_super+0} 
<ffffffff80242560>{linvfs_fill_super+0}
       <ffffffff80242613>{linvfs_fill_super+179} 
<ffffffff80242560>{linvfs_fill_super+0}
       <ffffffff802502d3>{snprintf+131} <ffffffff8024f40e>{strlcpy+78}
       <ffffffff801613fa>{sget+730} <ffffffff801619f0>{set_bdev_super+0}
       <ffffffff80242560>{linvfs_fill_super+0} 
<ffffffff80161b50>{get_sb_bdev+272}
       <ffffffff80161def>{do_kern_mount+111} 
<ffffffff8017596c>{do_mount+1548}
       <ffffffff80142c3e>{find_get_page+14} 
<ffffffff801438bc>{filemap_nopage+396}
       <ffffffff8015086c>{do_no_page+972} 
<ffffffff80150a00>{handle_mm_fault+320}
       <ffffffff801198b7>{do_page_fault+583} 
<ffffffff80146bdf>{__get_free_pages+31}
       <ffffffff80175d47>{sys_mount+151} 
<ffffffff8010cfaa>{system_call+126}
      

Code: 48 8b 5d 00 9c 8f 04 24 fa e8 b5 00 fb ff 85 c0 74 61 8b 43
RIP <ffffffff802c4d5d>{raid5_unplug_device+13} RSP <000001003f66dab8>
CR2: 0000000000000000
 <6>eth1: link up, 100Mbps, full-duplex, lpa 0x41E1
eth1: no IPv6 routers present

This may be posted to the wrong group... or perhaps it needs to be 
posted to both raid and xfs.  Any insights are welcomed.

-Robey


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: out of sync raid 5 + xfs = kernel startup problem
  2005-04-13  1:59 out of sync raid 5 + xfs = kernel startup problem Robey Holderith
@ 2005-04-13  4:10 ` Neil Brown
  2005-04-14  2:07   ` Robey Holderith
  0 siblings, 1 reply; 4+ messages in thread
From: Neil Brown @ 2005-04-13  4:10 UTC (permalink / raw)
  To: Robey Holderith; +Cc: linux-raid

On Tuesday April 12, robey@flaminglunchbox.net wrote:
> My raid5 system recently went through a sequence of power outages.  When 
> everything came back on the drives were out of sync.  No big issue... 
> just sync them back up again.  But something is going wrong.  Any help 
> is appreciated.  dmesg provides the following (the network stuff is 
> mixed in):
> 
..
> md: raidstart(pid 220) used deprecated START_ARRAY ioctl. This will not 
> be supported beyond 2.6

First hint.  Don't use 'raidstart'.  It works OK when everything is
working, but when things aren't working, raidstart makes it worse.

> md: could not bd_claim sdf2.

That's odd... Maybe it is trying to 'claim' it twice, because it
certainly seems to have got it below..

> md: autorun ...
> md: considering sdd2 ...
> md:  adding sdd2 ...
> md:  adding sde2 ...
> md:  adding sdf2 ...
> md:  adding sdc2 ...
> md:  adding sdb2 ...
> md:  adding sda2 ...
> md: created md0
> md: bind<sda2>
> md: bind<sdb2>
> md: bind<sdc2>
> md: bind<sdf2>
> md: bind<sde2>
> md: bind<sdd2>
> md: running: <sdd2><sde2><sdf2><sdc2><sdb2><sda2>
> md: kicking non-fresh sdd2 from array!

So sdd2 is not fresh.  Must have been missing at one stage, so it
probably has old data.


> md: unbind<sdd2>
> md: export_rdev(sdd2)
> md: md0: raid array is not clean -- starting background reconstruction
> raid5: device sde2 operational as raid disk 4
> raid5: device sdf2 operational as raid disk 3
> raid5: device sdc2 operational as raid disk 2
> raid5: device sdb2 operational as raid disk 1
> raid5: device sda2 operational as raid disk 0
> raid5: cannot start dirty degraded array for md0


Here's the main problem.

You've got a degraded, unclean array.  i.e. one drive is
failed/missing and md isn't confident that all the parity blocks are
correct due to an unclean shutdown (could have been in the middle of a
write). 
This means you could have undetectable data corruption.

md wants you to know this an not assume that everything is perfectly
OK.

You can still start the array, but you will need to use
  mdadm --assemble --force
which means you need to boot first ... got a boot CD?

I should add a "raid=force-start" or similar boot option, but I
haven't yet.

So, boot somehow, and
  mdadm --assemble /dev/md0 --force /dev/sd[a-f]2

  mdadm /dev/md0 -a /dev/sdd2

 wait for sync to complete (not absolutely needed).

Reboot.

> XFS: SB read failed
> Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
> <ffffffff802c4d5d>{raid5_unplug_device+13}

Hmm.. This is a bit of a worry.. I should be doing
	mddev->queue->unplug_fn = raid5_unplug_device;
	mddev->queue->issue_flush_fn = raid5_issue_flush;
a bit later in drivers/md/raid5.c(run), after the last 'goto
abort'... I'll have to think through it a bit though to be sure.

NeilBrown

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: out of sync raid 5 + xfs = kernel startup problem
  2005-04-13  4:10 ` Neil Brown
@ 2005-04-14  2:07   ` Robey Holderith
  2005-04-14  2:29     ` Neil Brown
  0 siblings, 1 reply; 4+ messages in thread
From: Robey Holderith @ 2005-04-14  2:07 UTC (permalink / raw)
  To: linux-raid; +Cc: Neil Brown

Neil Brown wrote:

>On Tuesday April 12, robey@flaminglunchbox.net wrote:
>  
>
>>My raid5 system recently went through a sequence of power outages.  When 
>>everything came back on the drives were out of sync.  No big issue... 
>>just sync them back up again.  But something is going wrong.  Any help 
>>is appreciated.  dmesg provides the following (the network stuff is 
>>mixed in):
>>
>>    
>>
>
>Here's the main problem.
>
>You've got a degraded, unclean array.  i.e. one drive is
>failed/missing and md isn't confident that all the parity blocks are
>correct due to an unclean shutdown (could have been in the middle of a
>write). 
>This means you could have undetectable data corruption.
>
>md wants you to know this an not assume that everything is perfectly
>OK.
>
>You can still start the array, but you will need to use
>  mdadm --assemble --force
>which means you need to boot first ... got a boot CD?
>
>I should add a "raid=force-start" or similar boot option, but I
>haven't yet.
>
>So, boot somehow, and
>  mdadm --assemble /dev/md0 --force /dev/sd[a-f]2
>
>  mdadm /dev/md0 -a /dev/sdd2
>
> wait for sync to complete (not absolutely needed).
>
>Reboot.
>  
>
Thanks for the help.  I rebooted using a rescue partition and used the 
two commands.  After about 2 hours of synching the array decided that 
sdf had failed and ceased its synch.  I restarted and then tried to 
assemble the array once again.  sdd2 and sdf2 are now both marked as 
spares and the array had only 4/6 partitions... dead.  Can I force the 
device numbers within the array?  I know that sdd2 was position 5 and 
sdf2 was position 3.  I'd like to save what I can... most of the data on 
the array can be reproduced... but it takes so much time.

If anyone is interested during my attempts to force the array to run I 
got a segfault in mdadm.  I'll post a snippet here... ignore if it's old 
news.

md: pers->run() failed ...
Unable to handle kernel NULL pointer dereference at 0000000000000030 RIP:
<ffffffff802c9350>{md_error+64}

Again... thanks for any and all help.

-Robey

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: out of sync raid 5 + xfs = kernel startup problem
  2005-04-14  2:07   ` Robey Holderith
@ 2005-04-14  2:29     ` Neil Brown
  0 siblings, 0 replies; 4+ messages in thread
From: Neil Brown @ 2005-04-14  2:29 UTC (permalink / raw)
  To: Robey Holderith; +Cc: linux-raid

On Wednesday April 13, robey@flaminglunchbox.net wrote:
> >
> Thanks for the help.  I rebooted using a rescue partition and used the 
> two commands.  After about 2 hours of synching the array decided that 
> sdf had failed and ceased its synch.  I restarted and then tried to 
> assemble the array once again.  sdd2 and sdf2 are now both marked as 
> spares and the array had only 4/6 partitions... dead.  Can I force the 
> device numbers within the array?  I know that sdd2 was position 5 and 
> sdf2 was position 3.  I'd like to save what I can... most of the data on 
> the array can be reproduced... but it takes so much time.

Just do the "mdadm --assemble --force" again, but this time don't
--add any device that mdadm doesn't include.
This will leave you with a degraded array consisting of the "best"
drives that mdadm can find.

Then copy of the data you want.

Then seriously look at your hardware.  if sdd and sdf have appeared to
fail, there is a good chance that it is SCSI cabling rather than
drives that is giving the problem.

NeilBrown


> 
> If anyone is interested during my attempts to force the array to run I 
> got a segfault in mdadm.  I'll post a snippet here... ignore if it's old 
> news.
> 
> md: pers->run() failed ...
> Unable to handle kernel NULL pointer dereference at 0000000000000030 RIP:
> <ffffffff802c9350>{md_error+64}
> 
> Again... thanks for any and all help.
> 
> -Robey

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2005-04-14  2:29 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-13  1:59 out of sync raid 5 + xfs = kernel startup problem Robey Holderith
2005-04-13  4:10 ` Neil Brown
2005-04-14  2:07   ` Robey Holderith
2005-04-14  2:29     ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).