linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Greaves <david@dgreaves.com>
To: Mitchell Laks <mlaks@verizon.net>
Cc: linux-raid@vger.kernel.org
Subject: Re: disaster. raid1 drive failure rsync=DELAYED why?? please help
Date: Sun, 13 Mar 2005 15:49:02 +0000	[thread overview]
Message-ID: <423460EE.9070602@dgreaves.com> (raw)
In-Reply-To: <200503122351.23430.mlaks@verizon.net>

Mitchell Laks wrote:

>Hi,
>I have a remote system with a raid1 of a data disk. I got a call from  the 
>person using the system that the application that writes to the data disk was 
>not working.
>
>system drive is /dev/hda with separte partitions / , /var, /home, /tmp.
>data drive is linux software raid1 /dev/md0 with /dev/hdc1,  /dev/hde1. 
>
>I logged in remotely and discovered that the /var partition was full because 
>many write errors from /dev/hde1 in /var/log/syslog.
>
>When I looked into cat /proc/mdstat i discovered that /dev/md0 was degraded  
>because /dev/hdc1 had failed (there was an f there) and /dev/hde1 was 
>carrying the load.
>
>I shut down the applications in background. I emptied out /var/log/syslog. I 
>then removed /dev/hdc1 from the array /dev/md0. 
>
>I had another pair of drives on the system that was part of another mirrored 
>array /dev/md1 with no useful information stored on them. 
>
>/dev/md1  /dev/hdf1 /dev/hdh1 
>
>I thought ok, let me detach /dev/hdf1 from the  other array /dev/md1  and try 
>attach it to /dev/md0 and rebuild the array /dev/md0. That way i would rescue 
>the data on the threatening drive /dev/hde1 which is spewing out error 
>messages to my /var/log/syslog and threatening to die! 
>
>So stupidly (probably), I did
>
>mdadm /dev/md1  --fail /dev/hdf1 --remove /dev/hdf1
>  
>
OK
what does mdadm --detail /dev/md1 show?

>then i did 
>mdadm /dev/md0 --add /dev/hdf1
>  
>
hmm - I don't know. I would have zeroed it :)

>Now when i did 
>cat /proc/mdstat I see:
>
>md0 : active raid1 hdf1[2] hde1[0]
>      244195904 blocks [2/1] [U_]
>        resync=DELAYED
>
>I don't see any rebuilding action going on.
>  
>
I see the full /proc/mdstat appears later...

 From the source (md.c)
    /* we overload curr_resync somewhat here.
     * 0 == not engaged in resync at all
     * 2 == checking that there is no conflict with another sync
     * 1 == like 2, but have yielded to allow conflicting resync to
     *        commense
     * other == active in resync - this many blocks
     *
     * Before starting a resync we must have set curr_resync to
     * 2, and then checked that every "conflicting" array has curr_resync
     * less than ours.  When we find one that is the same or higher
     * we wait on resync_wait.  To avoid deadlock, we reduce curr_resync
     * to 1 if we choose to yield (based arbitrarily on address of mddev 
structure).
     * This will mean we have to start checking from the beginning again.

you are in state 1 or 2.
hmmm


next email:

Mitchell Laks wrote:

>1) I tried to add the new spare device to /dev/md0 on friday afternoon.  It
>still has not rebuilt.
>
problem 1.

> I am also unable to do "ls" of the directory of the 
>drive.
>
problem 2 - this shouldn't be happening

>2) I had another idea. Why not umount the drive and then run fsck.ext3 on the 
>drive. Maybe it needs fsck? When I tried that I got the message:
>  
>
nope - rebuilding happens deep underneath the filesystem.

>A1:~# umount /home/big0
>umount: /home/big0: device is busy
>umount: /home/big0: device is busy
>
>(/dev/md0 is mounted on /home/big0).
>  
>
This just means that some process has a filehandle open on /home/big0
lsof + grep can help to find candidate processes

>A1:~# cat /proc/mdstat
>Personalities : [raid1]
>md0 : active raid1 hdi1[2] hdg1[0]
>      244195904 blocks [2/1] [U_]
>        resync=DELAYED
>md1 : active raid1 hdc1[1]
>      244195904 blocks [2/1] [_U]
>
>md2 : active raid1 hde1[1]
>      244195904 blocks [2/1] [_U]
>
>unused devices: 
>-
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>  
>

next email:

>I had some more bright ideas and here is what happened:
>
>I am unable to even do ls on the directory mounted on this raid device.
>
>So, I said, maybe the problem is that I need to run fsck.ext3 on the drive 
>first. So I tried to umount it and i got the error message:
>
>A1:~# umount /home/big0
>umount: /home/big0: device is busy
>umount: /home/big0: device is busy
>
>So I said maybe the problem is the rsyncing. So maybe an idea is to fail the 
>new added device /dev/hdi1  and then remove /dev/hdi1, move back to degraded 
>mode. Do an umount of the drive, then do an fsck.ext3 on the drive and then I 
>can do a reboot and then add the drive back in.
>
>Hey why not?
>  
>
'cos I can't figure out what's going on!

>Ok. So I tried: Here is the transcipt of the session:
>
>A1:~# cat /proc/mdstat
>Personalities : [raid1]
>md0 : active raid1 hdi1[2] hdg1[0]
>      244195904 blocks [2/1] [U_]
>        resync=DELAYED
>md1 : active raid1 hdc1[1]
>      244195904 blocks [2/1] [_U]
>
>md2 : active raid1 hde1[1]
>      244195904 blocks [2/1] [_U]
>
>unused devices: <none>
>A1:~# umount /home/big0
>umount: /home/big0: device is busy
>umount: /home/big0: device is busy
>A1:~# whoami
>root
>A1:~# mdadm /dev/md0 -fail /dev/hdi1 --remove /dev/hdi1
>mdadm: hot add failed for /dev/hdi1: Invalid argument
>
>A1:~# cat /proc/mdstat
>Personalities : [raid1]
>md0 : active raid1 hdi1[2] hdg1[0]
>      244195904 blocks [2/1] [U_]
>        resync=DELAYED
>md1 : active raid1 hdc1[1]
>      244195904 blocks [2/1] [_U]
>
>md2 : active raid1 hde1[1]
>      244195904 blocks [2/1] [_U]
>
>unused devices: <none>
>A1:~# mdadm --manage --set-faulty /dev/md0  /dev/hdi1
>mdadm: set /dev/hdi1 faulty in /dev/md0
>A1:~# mdadm --detail /dev/md0
>/dev/md0:
>        Version : 00.90.01
>  Creation Time : Wed Jan 12 14:19:21 2005
>     Raid Level : raid1
>     Array Size : 244195904 (232.88 GiB 250.06 GB)
>    Device Size : 244195904 (232.88 GiB 250.06 GB)
>   Raid Devices : 2
>  Total Devices : 2
>Preferred Minor : 0
>    Persistence : Superblock is persistent
>
>    Update Time : Sun Mar 13 01:28:06 2005
>          State : clean, degraded
> Active Devices : 1
>Working Devices : 1
> Failed Devices : 1
>  Spare Devices : 0
>
>           UUID : 6b8b4567:327b23c6:643c9869:66334873
>         Events : 0.343413
>
>    Number   Major   Minor   RaidDevice State
>       0      34        1        0      active sync   /dev/hdg1
>       1       0        0        -      removed
>
>       2      56        1        1      faulty   /dev/hdi1
>A1:~# mdadm /dev/md0 -r /dev/hdi1
>mdadm: hot remove failed for /dev/hdi1: Device or resource busy
>  
>
could this be mdadm 1.8.1 issue?? it seemed like the right thing to do.

>A1:~# cat /proc/mdstat
>Personalities : [raid1]
>md0 : active raid1 hdi1[2](F) hdg1[0]
>      244195904 blocks [2/1] [U_]
>        resync=DELAYED
>md1 : active raid1 hdc1[1]
>      244195904 blocks [2/1] [_U]
>
>md2 : active raid1 hde1[1]
>      244195904 blocks [2/1] [_U]
>
>unused devices: <none>
>A1:~# mdadm /dev/md0 -r /dev/hdi1
>mdadm: hot remove failed for /dev/hdi1: Device or resource busy
>A1:~#                                                                 
>
>Any ideas on what I can do now?
>  
>
upgrade mdadm and try the remove again.

next email:

>One more bit of information:
>
>this was a bit of info from 
>
>tail /var/log/kern.log
>
>Mar 11 04:42:11 A1 kernel:
>Mar 11 04:42:11 A1 kernel: hdg: drive not ready for command
>Mar 11 04:42:11 A1 kernel: raid1: hdg1: rescheduling sector 215908496
>Mar 11 04:42:11 A1 kernel: raid1: hdg1: redirecting sector 215908496 to 
>anotherr
>Mar 11 04:42:11 A1 kernel: hdg: status error: status=0x58 { DriveReady 
>SeekComp}
>Mar 11 04:42:11 A1 kernel:
>Mar 11 04:42:11 A1 kernel: hdg: drive not ready for command
>Mar 11 04:42:11 A1 kernel: raid1: hdg1: rescheduling sector 215908496
>Mar 11 04:42:11 A1 kernel: raid1: hdg1: redirecting sector 215908496 to 
>
>but that all was from Mar11 and today is Mar13....
>  
>
well, it may explain why things went bad.


I think you need to:
* upgrade mdadm.
* Then cat /proc/mdstat
* then mdadm --detail on all md devices

Then note what md devices are 'important'

Also:
what does mount say?
is the filessytem on /dev/md0 useable (it should be fine)

Is the box safe to reboot?

when you reply to my inline questions, remove all the context to trim 
the mail right down :)

David

  parent reply	other threads:[~2005-03-13 15:49 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-03-13  4:51 disaster. raid1 drive failure rsync=DELAYED why?? please help Mitchell Laks
2005-03-13  9:49 ` David Greaves
2005-03-13 14:32   ` Mitchell Laks
2005-03-13 15:23     ` David Greaves
2005-03-13 15:49 ` David Greaves [this message]
2005-03-14  7:43   ` Mitchell Laks
2005-03-14  9:49     ` David Greaves
  -- strict thread matches above, loose matches on Subject: below --
2005-03-13  6:23 Mitchell Laks
2005-03-13  6:45 ` Mitchell Laks
2005-03-13  7:22 Mitchell Laks

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=423460EE.9070602@dgreaves.com \
    --to=david@dgreaves.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=mlaks@verizon.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).