From: David Greaves <david@dgreaves.com>
To: Jon Lewis <jlewis@lewis.org>, neilb@cse.unsw.edu.au
Cc: Guy <bugzilla@watkins-home.com>,
linux-raid@vger.kernel.org, aaron@america.com
Subject: BUG: mdadm --fail makes the kernel lose count (was Re: raid5 won't resync)
Date: Tue, 31 Aug 2004 10:22:40 +0100 [thread overview]
Message-ID: <41344360.8060206@dgreaves.com> (raw)
In-Reply-To: <Pine.LNX.4.58.0408310401070.1949@web1.mmaero.com>
Neil
copied you as I think there's a bug in resync behaviour (kernel.org 2.6.6)
Summary: No data loss. A resync in progress doesn't stop when mdadm
fails the resyncing device and the kernel loses count.
When complete
# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [raid6]
md0 : active raid5 sdd1[3] sdc1[1] sdb1[2] sda1[0] hdb1[4]
980446208 blocks level 5, 128k chunk, algorithm 2 [5/4] [UUUUU]
That should be [5/5] shouldn't it?
Apologies if this is known and fixed in a later kernel.
Jon Lewis wrote:
>Since the recovery had stopped making progress, I decided to fail the
>drive it had brought in as the spare with mdadm /dev/md2 -f /dev/sdf1.
>That worked as expected. mdadm /dev/md2 -r /dev/sdf1 seems to have hung.
>It's in state D and I can't terminate it. Trying to add a new spare,
>mdadm can't get a lock on /dev/md2 because the previous one is stuck.
>
>I suspect at this point, we're going to have to just reboot again.
>
>
Jon,
Since I had a similar problem (manually 'failing' a device during resync
- I have a 5 device RAID5 - no spares)
I thought I'd ask if you noticed anything like this at all?
David
PS full story, messages etc below
Whilst having my own problems the other day I had the following odd
behaviour:
Disk sdd1 failed (I think a single spurious bad block read)
/proc/mdstat and --detail showed it marked faulty
I mdadm-removed it from the array.
I checked it and found no errors.
I mdadm-added it and a resync started.
I realised I'd made a mistake and checked the partition and not the disk
I looked to see what was happening:
I did an mdadm --detail /dev/md0
--
/dev/md0:
Version : 00.90.01
Creation Time : Sat Jun 5 18:13:04 2004
Raid Level : raid5
Array Size : 980446208 (935.03 GiB 1003.98 GB)
Device Size : 245111552 (233.76 GiB 250.99 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Sun Aug 29 21:08:35 2004
State : clean, degraded, recovering
Active Devices : 4
Working Devices : 5
Failed Devices : 0
Spare Devices : 1
Layout : left-symmetric
Chunk Size : 128K
Rebuild Status : 0% complete
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 33 1 active sync /dev/sdc1
2 8 17 2 active sync /dev/sdb1
3 0 0 -1 removed
4 3 65 4 active sync /dev/hdb1
5 8 49 3 spare /dev/sdd1
UUID : 19779db7:1b41c34b:f70aa853:062c9fe5
Events : 0.1979229
--
I mdadm-failed the device _whilst it was syncing_
The kernel reported "Operation continuing on 3 devices" (not 4)
[I thought at this point that I'd lost the lot!
The kernel not counting properly is not confidence inspiring]
at this point I had:
--
# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [raid6]
md0 : active raid5 sdd1[5](F) sdc1[1] sdb1[2] sda1[0] hdb1[4]
980446208 blocks level 5, 128k chunk, algorithm 2 [5/3] [UUU_U]
[>....................] recovery = 0.3% (920724/245111552)
finish=349.5min s
--
Not nice looking at all!!!
Another mdadm --detail /dev/md0
--
/dev/md0:
Version : 00.90.01
Creation Time : Sat Jun 5 18:13:04 2004
Raid Level : raid5
Array Size : 980446208 (935.03 GiB 1003.98 GB)
Device Size : 245111552 (233.76 GiB 250.99 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Sun Aug 29 21:09:06 2004
State : clean, degraded, recovering
Active Devices : 4
Working Devices : 4
Failed Devices : 1
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 128K
Rebuild Status : 0% complete
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 33 1 active sync /dev/sdc1
2 8 17 2 active sync /dev/sdb1
3 0 0 -1 removed
4 3 65 4 active sync /dev/hdb1
5 8 49 3 faulty /dev/sdd1
UUID : 19779db7:1b41c34b:f70aa853:062c9fe5
Events : 0.1979246
--
Now mdadm reports the drive faulty but:
mdadm /dev/md0 --remove /dev/sdd1
mdadm: hot remove failed for /dev/sdd1: Device or resource busy
OK, fail the drive again and try and remove it.
Nope.
Oh-oh.
I figured leaving it was the safest thing at this point.
Later that night it finished.
Aug 30 01:37:55 cu kernel: md: md0: sync done.
Aug 30 01:37:55 cu kernel: RAID5 conf printout:
Aug 30 01:37:55 cu kernel: --- rd:5 wd:3 fd:1
Aug 30 01:37:55 cu kernel: disk 0, o:1, dev:sda1
Aug 30 01:37:55 cu kernel: disk 1, o:1, dev:sdc1
Aug 30 01:37:55 cu kernel: disk 2, o:1, dev:sdb1
Aug 30 01:37:55 cu kernel: disk 3, o:0, dev:sdd1
Aug 30 01:37:55 cu kernel: disk 4, o:1, dev:hdb1
Aug 30 01:37:55 cu kernel: RAID5 conf printout:
Aug 30 01:37:55 cu kernel: --- rd:5 wd:3 fd:1
Aug 30 01:37:55 cu kernel: disk 0, o:1, dev:sda1
Aug 30 01:37:55 cu kernel: disk 1, o:1, dev:sdc1
Aug 30 01:37:55 cu kernel: disk 2, o:1, dev:sdb1
Aug 30 01:37:55 cu kernel: disk 3, o:0, dev:sdd1
Aug 30 01:37:55 cu kernel: disk 4, o:1, dev:hdb1
Aug 30 01:37:55 cu kernel: RAID5 conf printout:
Aug 30 01:37:55 cu kernel: --- rd:5 wd:3 fd:1
Aug 30 01:37:55 cu kernel: disk 0, o:1, dev:sda1
Aug 30 01:37:55 cu kernel: disk 1, o:1, dev:sdc1
Aug 30 01:37:55 cu kernel: disk 2, o:1, dev:sdb1
Aug 30 01:37:55 cu kernel: disk 4, o:1, dev:hdb1
Next morning:
# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [raid6]
md0 : active raid5 sdd1[5](F) sdc1[1] sdb1[2] sda1[0] hdb1[4]
980446208 blocks level 5, 128k chunk, algorithm 2 [5/3] [UUU_U]
unused devices: <none>
# mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.01
Creation Time : Sat Jun 5 18:13:04 2004
Raid Level : raid5
Array Size : 980446208 (935.03 GiB 1003.98 GB)
Device Size : 245111552 (233.76 GiB 250.99 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Mon Aug 30 08:45:35 2004
State : clean, degraded
Active Devices : 4
Working Devices : 4
Failed Devices : 1
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 128K
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 33 1 active sync /dev/sdc1
2 8 17 2 active sync /dev/sdb1
3 0 0 -1 removed
4 3 65 4 active sync /dev/hdb1
5 8 49 -1 faulty /dev/sdd1
UUID : 19779db7:1b41c34b:f70aa853:062c9fe5
Events : 0.1986057
I don't know why it was still (F). As if the last fail and remove were
'queued'?
Finally I did mdadm /dev/md0 --remove /dev/sdd1
mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.01
Creation Time : Sat Jun 5 18:13:04 2004
Raid Level : raid5
Array Size : 980446208 (935.03 GiB 1003.98 GB)
Device Size : 245111552 (233.76 GiB 250.99 GB)
Raid Devices : 5
Total Devices : 4
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Mon Aug 30 08:54:28 2004
State : clean, degraded
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 128K
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 33 1 active sync /dev/sdc1
2 8 17 2 active sync /dev/sdb1
3 0 0 -1 removed
4 3 65 4 active sync /dev/hdb1
UUID : 19779db7:1b41c34b:f70aa853:062c9fe5
Events : 0.1986058
cu:/var/cache/apt-cacher# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [raid6]
md0 : active raid5 sdc1[1] sdb1[2] sda1[0] hdb1[4]
980446208 blocks level 5, 128k chunk, algorithm 2 [5/3] [UUU_U]
unused devices: <none>
mdadm /dev/md0 --add /dev/sdd1
cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [raid6]
md0 : active raid5 sdd1[5] sdc1[1] sdb1[2] sda1[0] hdb1[4]
980446208 blocks level 5, 128k chunk, algorithm 2 [5/3] [UUU_U]
[>....................] recovery = 0.0% (161328/245111552)
finish=252.9min speed=16132K/sec
unused devices: <none>
Eventually:
Aug 30 17:24:07 cu kernel: md: md0: sync done.
Aug 30 17:24:07 cu kernel: RAID5 conf printout:
Aug 30 17:24:07 cu kernel: --- rd:5 wd:4 fd:0
Aug 30 17:24:07 cu kernel: disk 0, o:1, dev:sda1
Aug 30 17:24:07 cu kernel: disk 1, o:1, dev:sdc1
Aug 30 17:24:07 cu kernel: disk 2, o:1, dev:sdb1
Aug 30 17:24:07 cu kernel: disk 3, o:1, dev:sdd1
Aug 30 17:24:07 cu kernel: disk 4, o:1, dev:hdb1
# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [raid6]
md0 : active raid5 sdd1[3] sdc1[1] sdb1[2] sda1[0] hdb1[4]
980446208 blocks level 5, 128k chunk, algorithm 2 [5/4] [UUUUU]
unused devices: <none>
# mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.01
Creation Time : Sat Jun 5 18:13:04 2004
Raid Level : raid5
Array Size : 980446208 (935.03 GiB 1003.98 GB)
Device Size : 245111552 (233.76 GiB 250.99 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Mon Aug 30 17:24:07 2004
State : clean
Active Devices : 5
Working Devices : 5
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 128K
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 33 1 active sync /dev/sdc1
2 8 17 2 active sync /dev/sdb1
3 8 49 3 active sync /dev/sdd1
4 3 65 4 active sync /dev/hdb1
UUID : 19779db7:1b41c34b:f70aa853:062c9fe5
Events : 0.2014548
So back to normal and happy - but I guess the md0 device needs a restart
now which is bad.
David
next prev parent reply other threads:[~2004-08-31 9:22 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-08-31 3:08 raid5 won't resync Jon Lewis
2004-08-31 4:08 ` Guy
2004-08-31 8:08 ` Jon Lewis
2004-08-31 9:22 ` David Greaves [this message]
2004-09-01 0:36 ` BUG: mdadm --fail makes the kernel lose count (was Re: raid5 won't resync) Neil Brown
2004-08-31 14:50 ` raid5 won't resync Guy
2004-08-31 20:09 ` Jon Lewis
2004-08-31 20:40 ` Guy
2004-08-31 21:27 ` Jon Lewis
2004-08-31 22:37 ` Guy
2004-09-01 0:25 ` Jon Lewis
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=41344360.8060206@dgreaves.com \
--to=david@dgreaves.com \
--cc=aaron@america.com \
--cc=bugzilla@watkins-home.com \
--cc=jlewis@lewis.org \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@cse.unsw.edu.au \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).