Hung rebuilding

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Hung rebuilding
@ 2005-06-30 13:45 Sandro Dentella
  2005-06-30 15:38 ` Sandro Dentella
  2005-06-30 16:25 ` Gil
  0 siblings, 2 replies; 4+ messages in thread
From: Sandro Dentella @ 2005-06-30 13:45 UTC (permalink / raw)
  To: Linux Raid List

Sorry for this pretty log mail. there are several different questions
related to the same problem.
	1. stuck rebuilding speed=0K/sec
	2. disk added as spare instead of normal
	3. kernel messages I don't know how to handle

The server of a school had problem this morning, 2 RAID1 arrays got out of
sync. The firsr one was fixed just w/ "mdadm /dev/md2 --add ..." the other
says rebuilding but really hangs:

   md3 : active raid1 ide/host0/bus0/target0/lun0/part7[2] ide/host0/bus1/target0/lun0/part7[1]
	 74340672 blocks [2/1] [_U]
	 [>....................]  recovery =  0.0% (192/74340672) finish=308306.9min speed=0K/sec


Following a similar thread some days ago I used hdparm:

   srv-ornago:~#  hdparm -Tt /dev/hdc

   /dev/hdc:
    Timing cached reads:   1000 MB in  2.00 seconds = 500.00 MB/sec
    Timing buffered disk reads:  172 MB in  3.02 seconds =  56.95 MB/sec
   srv-ornago:~#  hdparm -Tt /dev/hda

   /dev/hda:
    Timing cached reads:   944 MB in  2.00 seconds = 472.00 MB/sec

That seems to me reasonable.
What is strange is the result of mdadm -D:

      srv-ornago:~# mdadm -D /dev/md3
      /dev/md3:
	      Version : 00.90.00
	Creation Time : Wed Dec  8 12:28:15 2004
	   Raid Level : raid1
	   Array Size : 74340672 (70.90 GiB 76.12 GB)
	  Device Size : 74340672 (70.90 GiB 76.12 GB)
	 Raid Devices : 2
	Total Devices : 2
      Preferred Minor : 3
	  Persistence : Superblock is persistent

	  Update Time : Thu Jun 30 12:53:14 2005
		State : dirty, degraded, recovering
       Active Devices : 1
      Working Devices : 2
       Failed Devices : 0
	Spare Devices : 1

       Rebuild Status : 0% complete

		 UUID : 1ea38e0e:050ac659:7e84e367:2d256edd
	       Events : 0.171

	  Number   Major   Minor   RaidDevice State
	     0       0        0        0      faulty removed
	     1      22        7        1      active sync   /dev/ide/host0/bus1/target0/lun0/part7

	     2       3        7        2      spare rebuilding   /dev/ide/host0/bus0/target0/lun0/part7


I added the array w/ "mdadm /dev/md3 --add /dev/ide/host0/bus0/target0/lun0/part7" 
and it become "spare" what did I do wrong?

On /dev/md3 there is a reiserfs filesystem that I know is corrupted (I can't
mount it). I thought to first sync the array and then try to fix the
filesystem. Would it be generally better to do the opposite (so as to keep
the *bad* array as a "backup")?

TYA
sandro
*:-)

PS: now kern.log shows something that may help understanding...:





   Jun 30 14:33:48 srv-ornago kernel: RAID1 conf printout:
   Jun 30 14:33:48 srv-ornago kernel:  --- wd:1 rd:2 nd:2
   Jun 30 14:33:48 srv-ornago kernel:  disk 0, s:0, o:0, n:0 rd:0 us:1 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 1, s:0, o:1, n:1 rd:1 us:1 dev:ide/host0/bus1/target0/lun0/part7
   Jun 30 14:33:48 srv-ornago kernel:  disk 2, s:1, o:1, n:2 rd:2 us:1 dev:ide/host0/bus0/target0/lun0/part7
   Jun 30 14:33:48 srv-ornago kernel:  disk 3, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 4, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 5, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 6, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 7, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 8, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 9, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 10, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 11, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 12, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 13, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 14, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 15, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 16, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 17, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 18, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 19, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 20, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 21, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 22, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 23, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 24, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 25, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 26, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel: RAID1 conf printout:
   Jun 30 14:33:48 srv-ornago kernel:  --- wd:1 rd:2 nd:2
   Jun 30 14:33:48 srv-ornago kernel:  disk 0, s:0, o:0, n:0 rd:0 us:1 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 1, s:0, o:1, n:1 rd:1 us:1 dev:ide/host0/bus1/target0/lun0/part7
   Jun 30 14:33:48 srv-ornago kernel:  disk 2, s:1, o:1, n:2 rd:2 us:1 dev:ide/host0/bus0/target0/lun0/part7
   Jun 30 14:33:48 srv-ornago kernel:  disk 3, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 4, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 5, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 6, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 7, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 8, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 9, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 10, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 11, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 12, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 13, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 14, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 15, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 16, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 17, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 18, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 19, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 20, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 21, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 22, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 23, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 24, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 25, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel:  disk 26, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
   Jun 30 14:33:48 srv-ornago kernel: md: cannot remove active disk ide/host0/bus0/target0/lun0/part7 from md3 ...
   Jun 30 14:40:30 srv-ornago kernel: reiserfs: found format "3.6" with standard journal
   Jun 30 14:40:31 srv-ornago kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
   Jun 30 14:40:31 srv-ornago kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=15584452, sector=4194304
   Jun 30 14:40:31 srv-ornago kernel: end_request: I/O error, dev 03:07 (hda), sector 4194304
   Jun 30 14:40:31 srv-ornago kernel: ide0(3,7):sh-2029: reiserfs read_bitmaps: bitmap block (#524288) reading failed
   Jun 30 14:40:31 srv-ornago kernel: ide0(3,7):sh-2014: reiserfs_read_super: unable to read bitmap
   Jun 30 14:40:33 srv-ornago kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
   Jun 30 14:40:33 srv-ornago kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=16108740, sector=4718592
   Jun 30 14:40:33 srv-ornago kernel: end_request: I/O error, dev 03:07 (hda), sector 4718592


-- 
Sandro Dentella  *:-)
e-mail: sandro@e-den.it 
http://www.tksql.org                    TkSQL Home page - My GPL work

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Hung rebuilding
  2005-06-30 13:45 Hung rebuilding Sandro Dentella
@ 2005-06-30 15:38 ` Sandro Dentella
  2005-06-30 16:25 ` Gil
  1 sibling, 0 replies; 4+ messages in thread
From: Sandro Dentella @ 2005-06-30 15:38 UTC (permalink / raw)
  To: Linux Raid List

Following to the previous mail. I really need to recover the data in the
array. I didn't think it was sure to fsck during a stuck reconstruction so I
set faulty the disk that is probably faulty 

  mdadm /dev/md3 --fail   /dev/hda7 (devfs way really...)
  mdadm /dev/md3 --remove /dev/hda7 (devfs way really...)
  
  now I read 

srv-ornago:~# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
 (...)
md3 : active raid1 ide/host0/bus1/target0/lun0/part7[1]
      74340672 blocks [2/1] [_U]
      [>....................]  recovery =  0.0% (192/74340672) finish=328392.5min speed=0K/sec
unused devices: <none>

so it is marked _U but is still under recovery!!!

Is it safe to fsck now? (at least as w/ 1 single disk...)


TIA
sandro
*:-)


-- 
Sandro Dentella  *:-)
e-mail: sandro@e-den.it 
http://www.tksql.org                    TkSQL Home page - My GPL work

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Hung rebuilding
  2005-06-30 13:45 Hung rebuilding Sandro Dentella
  2005-06-30 15:38 ` Sandro Dentella
@ 2005-06-30 16:25 ` Gil
  2005-06-30 17:42   ` Sandro Dentella
  1 sibling, 1 reply; 4+ messages in thread
From: Gil @ 2005-06-30 16:25 UTC (permalink / raw)
  To: Sandro Dentella; +Cc: Linux Raid List

> Jun 30 14:40:31 srv-ornago kernel: hda: dma_intr: status=0x51 {
> DriveReady SeekComplete Error } Jun 30 14:40:31 srv-ornago
> kernel: hda: dma_intr: error=0x40 { UncorrectableError },
> LBAsect=15584452, sector=4194304 Jun 30 14:40:31 srv-ornago
> kernel: end_request: I/O error, dev 03:07 (hda), sector 4194304

This sequence usually indicates a bad block on the media.  If you
have enabled SMART on your disks, you can confirm this with

    smartctl -l error /dev/hda

You'll see a bunch of UNC errors.

> Jun 30 14:40:31 srv-ornago kernel: ide0(3,7):sh-2029: reiserfs
> read_bitmaps: bitmap block (#524288) reading failed Jun 30
> 14:40:31 srv-ornago kernel: ide0(3,7):sh-2014:
> reiserfs_read_super: unable to read bitmap

Worse yet it appears that your bad block contains a part of the
reiserfs superblock which would be why you can't mount the filesystem.

At first blush it would appear that the bad block is preventing the
resync from happening, but I'm no expert in the reading of mdadm -D.

>       srv-ornago:~# mdadm -D /dev/md3
>       /dev/md3:
> 	      Version : 00.90.00
> 	Creation Time : Wed Dec  8 12:28:15 2004
> 	   Raid Level : raid1
> 	   Array Size : 74340672 (70.90 GiB 76.12 GB)
> 	  Device Size : 74340672 (70.90 GiB 76.12 GB)
> 	 Raid Devices : 2
> 	Total Devices : 2
>       Preferred Minor : 3
> 	  Persistence : Superblock is persistent
>
> 	  Update Time : Thu Jun 30 12:53:14 2005
> 		State : dirty, degraded, recovering
>        Active Devices : 1
>       Working Devices : 2
>        Failed Devices : 0
> 	Spare Devices : 1
>
>        Rebuild Status : 0% complete
>
> 		 UUID : 1ea38e0e:050ac659:7e84e367:2d256edd
> 	       Events : 0.171
>
> 	  Number   Major   Minor   RaidDevice State
> 	     0       0        0        0      faulty removed
> 	     1      22        7        1      active sync
> /dev/ide/host0/bus1/target0/lun0/part7
>
> 	     2       3        7        2      spare rebuilding
> /dev/ide/host0/bus0/target0/lun0/part7


I'm confused by this because your I/O error is on hda according to
the kernel output, but hda should be the disk onto which the rebuild
would be writing.

>    Jun 30 14:33:48 srv-ornago kernel: md: cannot remove active
> disk ide/host0/bus0/target0/lun0/part7 from md3 ...

However, this error message seems to say that hda is, in fact, a
part of the array.

Someone with better mdadm -D kung-fu: what are your thoughts?

--Gil

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Hung rebuilding
  2005-06-30 16:25 ` Gil
@ 2005-06-30 17:42   ` Sandro Dentella
  0 siblings, 0 replies; 4+ messages in thread
From: Sandro Dentella @ 2005-06-30 17:42 UTC (permalink / raw)
  To: Gil

>     smartctl -l error /dev/hda

You are write... what is worse is that both hda and hdc have errors, 

hda: ATA Error Count: 6 (device log contains only the most recent five errors)
hdc: ATA Error Count: 49 (device log contains only the most recent five errors)

(should it be need: www.e-den.it/misc/smart-a   smart-c)

> Worse yet it appears that your bad block contains a part of the
> reiserfs superblock which would be why you can't mount the filesystem.

But does this mean that it failed reading from *both* HD? Shouldn't it use
both HD to read and one should provide the information in case the other
fails?

I'm really puzzled. Last year the same server lost 2 disk in a day in a way
that they where not even recogniable by the bios...

Here are the logs that refer to reiserfs, I wanted to understand if it is
possible that the superblock is positioned in both disk on bad blocks...

  Jun 30 11:47:56 srv-ornago kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
  Jun 30 11:47:56 srv-ornago kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=11390276, sector=128
  Jun 30 11:47:56 srv-ornago kernel: end_request: I/O error, dev 16:07 (hdc), sector 128
  Jun 30 11:47:56 srv-ornago kernel: raid1: ide/host0/bus1/target0/lun0/part7: rescheduling block 128
  Jun 30 11:47:56 srv-ornago kernel: raid1: ide/host0/bus1/target0/lun0/part7: unrecoverable I/O read error for block 128
  Jun 30 11:47:56 srv-ornago kernel: sh-2007: reiserfs read_super_block: bread failed (dev 09:03, block 64, size 1024)
  Jun 30 11:47:56 srv-ornago kernel: sh-2021: reiserfs_read_super: can not find reiserfs on md(9,3)


  Jun 30 12:00:15 srv-ornago kernel: raid1: ide/host0/bus1/target0/lun0/part7: rescheduling block 134
  Jun 30 12:00:15 srv-ornago kernel: raid1: ide/host0/bus1/target0/lun0/part7: unrecoverable I/O read error for block 134
  Jun 30 12:00:15 srv-ornago kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
  Jun 30 12:00:15 srv-ornago kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=11390276, sector=128
  Jun 30 12:00:15 srv-ornago kernel: end_request: I/O error, dev 16:07 (hdc), sector 128
  Jun 30 12:00:15 srv-ornago kernel: raid1: ide/host0/bus1/target0/lun0/part7: rescheduling block 128
  Jun 30 12:00:15 srv-ornago kernel: raid1: ide/host0/bus1/target0/lun0/part7: unrecoverable I/O read error for block 128
  Jun 30 12:00:15 srv-ornago kernel: sh-2006: reiserfs read_super_block: bread failed (dev 09:03, block 64, size 1024)
  Jun 30 12:00:15 srv-ornago kernel: sh-2021: reiserfs_read_super: can not find reiserfs on md(9,3)


  Jun 30 12:25:41 srv-ornago kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
  Jun 30 12:25:41 srv-ornago kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=11390276, sector=128
  Jun 30 12:25:41 srv-ornago kernel: end_request: I/O error, dev 16:07 (hdc), sector 128
  Jun 30 12:25:41 srv-ornago kernel: raid1: ide/host0/bus1/target0/lun0/part7: rescheduling block 128
  Jun 30 12:25:41 srv-ornago kernel: raid1: ide/host0/bus1/target0/lun0/part7: unrecoverable I/O read error for block 128
  Jun 30 12:25:41 srv-ornago kernel: sh-2006: reiserfs read_super_block: bread failed (dev 09:03, block 64, size 1024)
  Jun 30 12:25:41 srv-ornago kernel: sh-2021: reiserfs_read_super: can not find reiserfs on md(9,3)


  Jun 30 14:33:48 srv-ornago kernel: md: cannot remove active disk ide/host0/bus0/target0/lun0/part7 from md3 ...
  Jun 30 14:40:30 srv-ornago kernel: reiserfs: found format "3.6" with standard journal
  Jun 30 14:40:31 srv-ornago kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
  Jun 30 14:40:31 srv-ornago kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=15584452, sector=4194304
  Jun 30 14:40:31 srv-ornago kernel: end_request: I/O error, dev 03:07 (hda), sector 4194304
  Jun 30 14:40:31 srv-ornago kernel: ide0(3,7):sh-2029: reiserfs read_bitmaps: bitmap block (#524288) reading failed
  Jun 30 14:40:31 srv-ornago kernel: ide0(3,7):sh-2014: reiserfs_read_super: unable to read bitmap
  Jun 30 14:40:33 srv-ornago kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
  Jun 30 14:40:33 srv-ornago kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=16108740, sector=4718592
  Jun 30 14:40:33 srv-ornago kernel: end_request: I/O error, dev 03:07 (hda), sector 4718592
  Jun 30 17:30:56 srv-ornago kernel: md: trying to remove ide/host0/bus0/target0/lun0/part7 from md3 ...


Thanks for any possible hint
sandro
*%-|


-- 
Sandro Dentella  *:-)
e-mail: sandro@e-den.it 
http://www.tksql.org                    TkSQL Home page - My GPL work

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2005-06-30 17:42 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-30 13:45 Hung rebuilding Sandro Dentella
2005-06-30 15:38 ` Sandro Dentella
2005-06-30 16:25 ` Gil
2005-06-30 17:42   ` Sandro Dentella

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).