In testing this some more, I've determined that (always with this raid10.c patch, sometimes without) the kernel is not recognizing marked-faulty drives when they're added back to the array. It appears to be some bit that is flagged and (I assume) normally cleared when that drive is re-added as an array member. If I zero the device (I'm assuming it's the wiping of the mdadm superblock), it will be marked upon issuing 'mdadm /dev/md0 -a /dev/dm-0' as "spare" instead of "faulty-spare". This behaviour has been erratic for a while, and I'm not sure if I'm seeing a bug or if I am working under the wrong presumption with inappropriate actions on my part. When a drive is either manually marked "failed" or is automatically tagged during a failure, is the expected user action to zero the (original or replacement) drive before doing an 'add'? Should the kernel recognize that the drive was removed, and that the 'add' should clear any "faulty" or "failed" state? /eli PS - In the process of figuring out when this occurs and how to work around it, I just hacked up this shell script that takes care of removing the device from the array, zeroing it, re-reading/scanning the disk and adding it back in, depending on the function that is called. Eli Stair wrote: > > Thanks Neil, > > I just gave this patched module a shot on four systems. So far, I > haven't seen the device number inappropriately increment, though as per > a mail I sent a short while ago that seemed remedied by using the 1.2 > superblock, for some reason. However, it appears to have introduced a > new issue, and another is unresolved by it: > > > > // BUG 1 > The single-command syntax to fail and remove a drive is still failing, I > do not know if this is somehow contributing to the further (new) issues > below: > > [root@gtmp06 tmp]# mdadm /dev/md0 --fail /dev/dm-0 --remove /dev/dm-0 > mdadm: set /dev/dm-0 faulty in /dev/md0 > mdadm: hot remove failed for /dev/dm-0: Device or resource busy > > [root@gtmp06 tmp]# mdadm /dev/md0 --remove /dev/dm-0 > mdadm: hot removed /dev/dm-0 > > > // BUG 2 > Now, upon adding or re-adding a "fail...remove"'d drive, it is not used > for resync. I realized previously that added drives weren't re-synced > until the existing array build was done, then they were grabbed. This > however is a clean/active array that is rejecting the drive. > > I've performed this identically on both a clean & active array, as well > as a newly-created (resync'ing) array, to the same effect. Even after > rebuild or reboot, the removed drive isn't taken back and remains listed > as a "faulty spare", with dmesg indicating that it is "non-fresh". > > > > > // DMESG: > > md: kicking non-fresh dm-0 from array! > > > // ARRAY status 'mdadm -D /dev/md0' > > State : active, degraded > Active Devices : 13 > Working Devices : 13 > Failed Devices : 1 > Spare Devices : 0 > > Layout : near=1, offset=2 > Chunk Size : 512K > > Name : 0 > UUID : 05c2faf4:facfcad3:ba33b140:100f428a > Events : 22 > > Number Major Minor RaidDevice State > 0 253 1 0 active sync /dev/dm-1 > 1 253 2 1 active sync /dev/dm-2 > 2 253 5 2 active sync /dev/dm-5 > 3 253 4 3 active sync /dev/dm-4 > 4 253 6 4 active sync /dev/dm-6 > 5 253 3 5 active sync /dev/dm-3 > 6 253 13 6 active sync /dev/dm-13 > 7 0 0 7 removed > 8 253 7 8 active sync /dev/dm-7 > 9 253 8 9 active sync /dev/dm-8 > 10 253 9 10 active sync /dev/dm-9 > 11 253 11 11 active sync /dev/dm-11 > 12 253 10 12 active sync /dev/dm-10 > 13 253 12 13 active sync /dev/dm-12 > > 7 253 0 - faulty spare /dev/dm-0 > > > > > Let me know what more I can do to help track this down. I'm reverting > this patch, since it is behaving less-well than before. Will be happy > to try others. > > Attached are typescript of the drive remove/add sessions and all output. > > > /eli > > > Neil Brown wrote: > > On Friday October 6, estair@ilm.com wrote: > > > > > > This patch has resolved the immediate issue I was having on 2.6.18 > with > > > RAID10. Previous to this change, after removing a device from the > array > > > (with mdadm --remove), physically pulling the device and > > > changing/re-inserting, the "Number" of the new device would be > > > incremented on top of the highest-present device in the array. > Now, it > > > resumes its previous place. > > > > > > Does this look to be 'correct' output for a 14-drive array, which > dev 8 > > > was failed/removed from then "add"'ed? I'm trying to determine > why the > > > device doesn't get pulled back into the active configuration and > > > re-synced. Any comments? > > > > Does this patch help? > > > > > > > > Fix count of degraded drives in raid10. > > > > > > Signed-off-by: Neil Brown > > > > ### Diffstat output > > ./drivers/md/raid10.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c > > --- .prev/drivers/md/raid10.c 2006-10-09 14:18:00.000000000 +1000 > > +++ ./drivers/md/raid10.c 2006-10-05 20:10:07.000000000 +1000 > > @@ -2079,7 +2079,7 @@ static int run(mddev_t *mddev) > > disk = conf->mirrors + i; > > > > if (!disk->rdev || > > - !test_bit(In_sync, &rdev->flags)) { > > + !test_bit(In_sync, &disk->rdev->flags)) { > > disk->head_position = 0; > > mddev->degraded++; > > } > > > > > > NeilBrown > > >