From: David Greaves <david@dgreaves.com>
To: Jon Lewis <jlewis@lewis.org>, neilb@cse.unsw.edu.au
Cc: Guy <bugzilla@watkins-home.com>,
linux-raid@vger.kernel.org, aaron@america.com
Subject: BUG: mdadm --fail makes the kernel lose count (was Re: raid5 won't resync)
Date: Tue, 31 Aug 2004 10:22:40 +0100 [thread overview]
Message-ID: <41344360.8060206@dgreaves.com> (raw)
In-Reply-To: <Pine.LNX.4.58.0408310401070.1949@web1.mmaero.com>
Neil
copied you as I think there's a bug in resync behaviour (kernel.org 2.6.6)
Summary: No data loss. A resync in progress doesn't stop when mdadm
fails the resyncing device and the kernel loses count.
When complete
# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [raid6]
md0 : active raid5 sdd1[3] sdc1[1] sdb1[2] sda1[0] hdb1[4]
980446208 blocks level 5, 128k chunk, algorithm 2 [5/4] [UUUUU]
That should be [5/5] shouldn't it?
Apologies if this is known and fixed in a later kernel.
Jon Lewis wrote:
>Since the recovery had stopped making progress, I decided to fail the
>drive it had brought in as the spare with mdadm /dev/md2 -f /dev/sdf1.
>That worked as expected. mdadm /dev/md2 -r /dev/sdf1 seems to have hung.
>It's in state D and I can't terminate it. Trying to add a new spare,
>mdadm can't get a lock on /dev/md2 because the previous one is stuck.
>
>I suspect at this point, we're going to have to just reboot again.
>
>
Jon,
Since I had a similar problem (manually 'failing' a device during resync
- I have a 5 device RAID5 - no spares)
I thought I'd ask if you noticed anything like this at all?
David
PS full story, messages etc below
Whilst having my own problems the other day I had the following odd
behaviour:
Disk sdd1 failed (I think a single spurious bad block read)
/proc/mdstat and --detail showed it marked faulty
I mdadm-removed it from the array.
I checked it and found no errors.
I mdadm-added it and a resync started.
I realised I'd made a mistake and checked the partition and not the disk
I looked to see what was happening:
I did an mdadm --detail /dev/md0
--
/dev/md0:
Version : 00.90.01
Creation Time : Sat Jun 5 18:13:04 2004
Raid Level : raid5
Array Size : 980446208 (935.03 GiB 1003.98 GB)
Device Size : 245111552 (233.76 GiB 250.99 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Sun Aug 29 21:08:35 2004
State : clean, degraded, recovering
Active Devices : 4
Working Devices : 5
Failed Devices : 0
Spare Devices : 1
Layout : left-symmetric
Chunk Size : 128K
Rebuild Status : 0% complete
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 33 1 active sync /dev/sdc1
2 8 17 2 active sync /dev/sdb1
3 0 0 -1 removed
4 3 65 4 active sync /dev/hdb1
5 8 49 3 spare /dev/sdd1
UUID : 19779db7:1b41c34b:f70aa853:062c9fe5
Events : 0.1979229
--
I mdadm-failed the device _whilst it was syncing_
The kernel reported "Operation continuing on 3 devices" (not 4)
[I thought at this point that I'd lost the lot!
The kernel not counting properly is not confidence inspiring]
at this point I had:
--
# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [raid6]
md0 : active raid5 sdd1[5](F) sdc1[1] sdb1[2] sda1[0] hdb1[4]
980446208 blocks level 5, 128k chunk, algorithm 2 [5/3] [UUU_U]
[>....................] recovery = 0.3% (920724/245111552)
finish=349.5min s
--
Not nice looking at all!!!
Another mdadm --detail /dev/md0
--
/dev/md0:
Version : 00.90.01
Creation Time : Sat Jun 5 18:13:04 2004
Raid Level : raid5
Array Size : 980446208 (935.03 GiB 1003.98 GB)
Device Size : 245111552 (233.76 GiB 250.99 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Sun Aug 29 21:09:06 2004
State : clean, degraded, recovering
Active Devices : 4
Working Devices : 4
Failed Devices : 1
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 128K
Rebuild Status : 0% complete
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 33 1 active sync /dev/sdc1
2 8 17 2 active sync /dev/sdb1
3 0 0 -1 removed
4 3 65 4 active sync /dev/hdb1
5 8 49 3 faulty /dev/sdd1
UUID : 19779db7:1b41c34b:f70aa853:062c9fe5
Events : 0.1979246
--
Now mdadm reports the drive faulty but:
mdadm /dev/md0 --remove /dev/sdd1
mdadm: hot remove failed for /dev/sdd1: Device or resource busy
OK, fail the drive again and try and remove it.
Nope.
Oh-oh.
I figured leaving it was the safest thing at this point.
Later that night it finished.
Aug 30 01:37:55 cu kernel: md: md0: sync done.
Aug 30 01:37:55 cu kernel: RAID5 conf printout:
Aug 30 01:37:55 cu kernel: --- rd:5 wd:3 fd:1
Aug 30 01:37:55 cu kernel: disk 0, o:1, dev:sda1
Aug 30 01:37:55 cu kernel: disk 1, o:1, dev:sdc1
Aug 30 01:37:55 cu kernel: disk 2, o:1, dev:sdb1
Aug 30 01:37:55 cu kernel: disk 3, o:0, dev:sdd1
Aug 30 01:37:55 cu kernel: disk 4, o:1, dev:hdb1
Aug 30 01:37:55 cu kernel: RAID5 conf printout:
Aug 30 01:37:55 cu kernel: --- rd:5 wd:3 fd:1
Aug 30 01:37:55 cu kernel: disk 0, o:1, dev:sda1
Aug 30 01:37:55 cu kernel: disk 1, o:1, dev:sdc1
Aug 30 01:37:55 cu kernel: disk 2, o:1, dev:sdb1
Aug 30 01:37:55 cu kernel: disk 3, o:0, dev:sdd1
Aug 30 01:37:55 cu kernel: disk 4, o:1, dev:hdb1
Aug 30 01:37:55 cu kernel: RAID5 conf printout:
Aug 30 01:37:55 cu kernel: --- rd:5 wd:3 fd:1
Aug 30 01:37:55 cu kernel: disk 0, o:1, dev:sda1
Aug 30 01:37:55 cu kernel: disk 1, o:1, dev:sdc1
Aug 30 01:37:55 cu kernel: disk 2, o:1, dev:sdb1
Aug 30 01:37:55 cu kernel: disk 4, o:1, dev:hdb1
Next morning:
# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [raid6]
md0 : active raid5 sdd1[5](F) sdc1[1] sdb1[2] sda1[0] hdb1[4]
980446208 blocks level 5, 128k chunk, algorithm 2 [5/3] [UUU_U]
unused devices: <none>
# mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.01
Creation Time : Sat Jun 5 18:13:04 2004
Raid Level : raid5
Array Size : 980446208 (935.03 GiB 1003.98 GB)
Device Size : 245111552 (233.76 GiB 250.99 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Mon Aug 30 08:45:35 2004
State : clean, degraded
Active Devices : 4
Working Devices : 4
Failed Devices : 1
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 128K
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 33 1 active sync /dev/sdc1
2 8 17 2 active sync /dev/sdb1
3 0 0 -1 removed
4 3 65 4 active sync /dev/hdb1
5 8 49 -1 faulty /dev/sdd1
UUID : 19779db7:1b41c34b:f70aa853:062c9fe5
Events : 0.1986057
I don't know why it was still (F). As if the last fail and remove were
'queued'?
Finally I did mdadm /dev/md0 --remove /dev/sdd1
mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.01
Creation Time : Sat Jun 5 18:13:04 2004
Raid Level : raid5
Array Size : 980446208 (935.03 GiB 1003.98 GB)
Device Size : 245111552 (233.76 GiB 250.99 GB)
Raid Devices : 5
Total Devices : 4
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Mon Aug 30 08:54:28 2004
State : clean, degraded
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 128K
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 33 1 active sync /dev/sdc1
2 8 17 2 active sync /dev/sdb1
3 0 0 -1 removed
4 3 65 4 active sync /dev/hdb1
UUID : 19779db7:1b41c34b:f70aa853:062c9fe5
Events : 0.1986058
cu:/var/cache/apt-cacher# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [raid6]
md0 : active raid5 sdc1[1] sdb1[2] sda1[0] hdb1[4]
980446208 blocks level 5, 128k chunk, algorithm 2 [5/3] [UUU_U]
unused devices: <none>
mdadm /dev/md0 --add /dev/sdd1
cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [raid6]
md0 : active raid5 sdd1[5] sdc1[1] sdb1[2] sda1[0] hdb1[4]
980446208 blocks level 5, 128k chunk, algorithm 2 [5/3] [UUU_U]
[>....................] recovery = 0.0% (161328/245111552)
finish=252.9min speed=16132K/sec
unused devices: <none>
Eventually:
Aug 30 17:24:07 cu kernel: md: md0: sync done.
Aug 30 17:24:07 cu kernel: RAID5 conf printout:
Aug 30 17:24:07 cu kernel: --- rd:5 wd:4 fd:0
Aug 30 17:24:07 cu kernel: disk 0, o:1, dev:sda1
Aug 30 17:24:07 cu kernel: disk 1, o:1, dev:sdc1
Aug 30 17:24:07 cu kernel: disk 2, o:1, dev:sdb1
Aug 30 17:24:07 cu kernel: disk 3, o:1, dev:sdd1
Aug 30 17:24:07 cu kernel: disk 4, o:1, dev:hdb1
# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [raid6]
md0 : active raid5 sdd1[3] sdc1[1] sdb1[2] sda1[0] hdb1[4]
980446208 blocks level 5, 128k chunk, algorithm 2 [5/4] [UUUUU]
unused devices: <none>
# mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.01
Creation Time : Sat Jun 5 18:13:04 2004
Raid Level : raid5
Array Size : 980446208 (935.03 GiB 1003.98 GB)
Device Size : 245111552 (233.76 GiB 250.99 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Mon Aug 30 17:24:07 2004
State : clean
Active Devices : 5
Working Devices : 5
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 128K
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 33 1 active sync /dev/sdc1
2 8 17 2 active sync /dev/sdb1
3 8 49 3 active sync /dev/sdd1
4 3 65 4 active sync /dev/hdb1
UUID : 19779db7:1b41c34b:f70aa853:062c9fe5
Events : 0.2014548
So back to normal and happy - but I guess the md0 device needs a restart
now which is bad.
David
next prev parent reply other threads:[~2004-08-31 9:22 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-08-31 3:08 raid5 won't resync Jon Lewis
2004-08-31 4:08 ` Guy
2004-08-31 8:08 ` Jon Lewis
2004-08-31 9:22 ` David Greaves [this message]
2004-09-01 0:36 ` BUG: mdadm --fail makes the kernel lose count (was Re: raid5 won't resync) Neil Brown
2004-08-31 14:50 ` raid5 won't resync Guy
2004-08-31 20:09 ` Jon Lewis
2004-08-31 20:40 ` Guy
2004-08-31 21:27 ` Jon Lewis
2004-08-31 22:37 ` Guy
2004-09-01 0:25 ` Jon Lewis
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=41344360.8060206@dgreaves.com \
--to=david@dgreaves.com \
--cc=aaron@america.com \
--cc=bugzilla@watkins-home.com \
--cc=jlewis@lewis.org \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@cse.unsw.edu.au \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.