* Wrong array assembly on boot? @ 2017-07-22 18:39 Dark Penguin 2017-07-24 14:48 ` Wols Lists 0 siblings, 1 reply; 9+ messages in thread From: Dark Penguin @ 2017-07-22 18:39 UTC (permalink / raw) To: linux-raid Greetings! I have a mirror RAID with two devices (sdc1 and sde1). It's not a root partition, just a RAID with some data for services running on this server. (I'm running Debian Jessie x86_64 with a 4.1.18 kernel.) The RAID is listed in /etc/mdadm, and it has an external bitmap in /RAID . One of the devices in the RAID (sdc1) "fell off" - it disappeared from the system for some reason. Well, I thought, I have to reboot to get the drive back, and then re-add it. That's what I did. After the reboot, I saw a degraded array with one drive missing, so I found out which one , and re-added it back. Later, I noticed that I'm missing some data, and thinking about this situation led me to understanding what happened. After a reboot, the system tried to assemble my arrays; it found sdc1 first (the one that disappeared), assembled a degraded array with only this drive, and started it. When I re-added the second drive, I've overwritten everything that happened between those events. Now I'm trying to understand why this happened and what am I supposed to do in this situation to handle it properly. So now I have a lot of questions boiling down to "how should booting with degraded arrays be handled?" - Why did mdadm not notice that the second drive is "newer"? I thought there were timestamps in the devices and even in the bitmap!.. - Why did it START this array?! I thought if a degraded array is found at boot, it's supposed to be assembled but not started?.. At least I think that's how it used to be in Wheezy (before systemd?). - Googling revealed that if a degraded array is detected, the system should stop booting and ask for a confirmation in the console. (Only for root partitions? And only before systemd?..) - My services are not going to be happy either way. If the array is assembled but not run, they will have data missing. If the array is assembled and run, it's even worse - they will start with outdated data! How is this even supposed to be handled?.. Should I add dependencies on mounting a specific mountpoint in each service definition?.. Am I wrong in thinking that mdadm should have detected that the second drive is "newer" and assemble the array just like it was before, thus avoiding all those problems easily?.. Especially considering that the array on the "new" drive already consists of only one drive, which is "not as degraded" and would be fine to run, compared to the array on the "old" drive which was not stopped properly and only now learns about one of the drives missing? Maybe this behaviour is already changed in the newer versions?.. -- darkpenguin ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Wrong array assembly on boot? 2017-07-22 18:39 Wrong array assembly on boot? Dark Penguin @ 2017-07-24 14:48 ` Wols Lists 2017-07-24 15:27 ` Dark Penguin 0 siblings, 1 reply; 9+ messages in thread From: Wols Lists @ 2017-07-24 14:48 UTC (permalink / raw) To: Dark Penguin, linux-raid On 22/07/17 19:39, Dark Penguin wrote: > Greetings! > > I have a mirror RAID with two devices (sdc1 and sde1). It's not a root > partition, just a RAID with some data for services running on this > server. (I'm running Debian Jessie x86_64 with a 4.1.18 kernel.) The > RAID is listed in /etc/mdadm, and it has an external bitmap in /RAID . As an absolute minimum, can you please give us your version of mdadm. And the output of "mdadm --display" of your arrays. (I think I've got that right, I think --examine is the disk ...) Cheers, Wol ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Wrong array assembly on boot? 2017-07-24 14:48 ` Wols Lists @ 2017-07-24 15:27 ` Dark Penguin 2017-07-24 19:36 ` Wols Lists 0 siblings, 1 reply; 9+ messages in thread From: Dark Penguin @ 2017-07-24 15:27 UTC (permalink / raw) To: Wols Lists, linux-raid On 24/07/17 17:48, Wols Lists wrote: > On 22/07/17 19:39, Dark Penguin wrote: >> Greetings! >> >> I have a mirror RAID with two devices (sdc1 and sde1). It's not a root >> partition, just a RAID with some data for services running on this >> server. (I'm running Debian Jessie x86_64 with a 4.1.18 kernel.) The >> RAID is listed in /etc/mdadm, and it has an external bitmap in /RAID . > > As an absolute minimum, can you please give us your version of mdadm. Oh, right, sorry. I thought the "absolute minimum" would be the kernel version and the distribution. :) mdadm - v3.3.2 - 21st August 2014 > And the output of "mdadm --display" of your arrays. (I think I've got > that right, I think --examine is the disk ...) It's "mdadm --detail --scan" for all arrays or "mdadm --detail /dev/md0" for md0. I have 8 arrays on this server, and the only one that's relevant is this one. (The rest of them are set up exactly the same way, but with different names and UUIDs.) So, to avoid cluttering: $ sudo mdadm --detail /dev/md/RAID /dev/md/RAID: Version : 1.2 Creation Time : Thu Oct 6 23:15:56 2016 Raid Level : raid1 Array Size : 244066432 (232.76 GiB 249.92 GB) Used Dev Size : 244066432 (232.76 GiB 249.92 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Intent Bitmap : /RAID Update Time : Mon Jul 24 17:59:53 2017 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : BAAL:RAID (local to host BAAL) UUID : 8b5f18f0:54f655b7:8bfcc60d:4db6e6c8 Events : 5000 Number Major Minor RaidDevice State 0 8 65 0 active sync /dev/sde1 1 8 33 1 active sync writemostly /dev/sdc1 And the /etc/mdadm/mdadm.conf entry is: ARRAY /dev/md/RAID metadata=1.2 name=BAAL:RAID bitmap=/RAID UUID=8b5f18f0:54f655b7:8bfcc60d:4db6e6c8 I don't use the device names here because they change often in a server with 8 arrays and 20 drives (sometimes I connect a new one or remove an old one...). The UUID is here, the bitmap file is here, so it just looks for all drives with this UUID and assembles the array. As I understand, it has found the first device (/dev/sdc1, which was outdated) and immediately added it to an array. Then it found the second device (/dev/sde1, the up-to-date one), noticed an inconsistency and did not add it. The question is, why did it start the array, why did it not halt the boot process, why did it not realize that the second device is newer (and also it already knows about the disappearance of the first one!)... -- darkpenguin ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Wrong array assembly on boot? 2017-07-24 15:27 ` Dark Penguin @ 2017-07-24 19:36 ` Wols Lists 2017-07-24 19:58 ` Dark Penguin 0 siblings, 1 reply; 9+ messages in thread From: Wols Lists @ 2017-07-24 19:36 UTC (permalink / raw) To: Dark Penguin, linux-raid On 24/07/17 16:27, Dark Penguin wrote: > On 24/07/17 17:48, Wols Lists wrote: >> > On 22/07/17 19:39, Dark Penguin wrote: >>> >> Greetings! >>> >> >>> >> I have a mirror RAID with two devices (sdc1 and sde1). It's not a root >>> >> partition, just a RAID with some data for services running on this >>> >> server. (I'm running Debian Jessie x86_64 with a 4.1.18 kernel.) The >>> >> RAID is listed in /etc/mdadm, and it has an external bitmap in /RAID . >> > >> > As an absolute minimum, can you please give us your version of mdadm. > Oh, right, sorry. I thought the "absolute minimum" would be the kernel > version and the distribution. :) > > mdadm - v3.3.2 - 21st August 2014 > > I was afraid it might be that ... You've hit a known bug in mdadm. It doesn't always successfully assemble a mirror. I had exactly that problem - I created one mirror and when I rebooted I had two ... Can't offer any advice about how to fix your damaged mirror, but you need to upgrade mdadm! That's two minor versions out of date - 3.4 and 4.0. Cheers, Wol ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Wrong array assembly on boot? 2017-07-24 19:36 ` Wols Lists @ 2017-07-24 19:58 ` Dark Penguin 2017-07-24 20:20 ` Wols Lists 0 siblings, 1 reply; 9+ messages in thread From: Dark Penguin @ 2017-07-24 19:58 UTC (permalink / raw) To: Wols Lists, linux-raid On 24/07/17 22:36, Wols Lists wrote: > On 24/07/17 16:27, Dark Penguin wrote: >> On 24/07/17 17:48, Wols Lists wrote: >>>> On 22/07/17 19:39, Dark Penguin wrote: >>>>>> Greetings! >>>>>> >>>>>> I have a mirror RAID with two devices (sdc1 and sde1). It's not a root >>>>>> partition, just a RAID with some data for services running on this >>>>>> server. (I'm running Debian Jessie x86_64 with a 4.1.18 kernel.) The >>>>>> RAID is listed in /etc/mdadm, and it has an external bitmap in /RAID . >>>> >>>> As an absolute minimum, can you please give us your version of mdadm. >> Oh, right, sorry. I thought the "absolute minimum" would be the kernel >> version and the distribution. :) >> >> mdadm - v3.3.2 - 21st August 2014 >> >> > I was afraid it might be that ... > > You've hit a known bug in mdadm. It doesn't always successfully assemble > a mirror. I had exactly that problem - I created one mirror and when I > rebooted I had two ... > > Can't offer any advice about how to fix your damaged mirror, but you > need to upgrade mdadm! That's two minor versions out of date - 3.4 and 4.0. > > Cheers, > Wol My mirror is not damaged anymore - it's quite healthy and cleanly missing some information I've overwritten. :) Of course, there's no way to help that now - that's what backups are for. I just wanted to learn how to avoid this situation in the future. And learn how is it really supposed to handle such things. Is this bug fixed in the newer mdadm? Or is it "known, but not fixed yet"? -- darkpenguin ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Wrong array assembly on boot? 2017-07-24 19:58 ` Dark Penguin @ 2017-07-24 20:20 ` Wols Lists 2017-12-16 12:40 ` Dark Penguin 0 siblings, 1 reply; 9+ messages in thread From: Wols Lists @ 2017-07-24 20:20 UTC (permalink / raw) To: Dark Penguin, linux-raid On 24/07/17 20:58, Dark Penguin wrote: > On 24/07/17 22:36, Wols Lists wrote: >> On 24/07/17 16:27, Dark Penguin wrote: >>> On 24/07/17 17:48, Wols Lists wrote: >>>>> On 22/07/17 19:39, Dark Penguin wrote: >>>>>>> Greetings! >>>>>>> >>>>>>> I have a mirror RAID with two devices (sdc1 and sde1). It's not a root >>>>>>> partition, just a RAID with some data for services running on this >>>>>>> server. (I'm running Debian Jessie x86_64 with a 4.1.18 kernel.) The >>>>>>> RAID is listed in /etc/mdadm, and it has an external bitmap in /RAID . >>>>> >>>>> As an absolute minimum, can you please give us your version of mdadm. >>> Oh, right, sorry. I thought the "absolute minimum" would be the kernel >>> version and the distribution. :) >>> >>> mdadm - v3.3.2 - 21st August 2014 >>> >>> >> I was afraid it might be that ... >> >> You've hit a known bug in mdadm. It doesn't always successfully assemble >> a mirror. I had exactly that problem - I created one mirror and when I >> rebooted I had two ... >> >> Can't offer any advice about how to fix your damaged mirror, but you >> need to upgrade mdadm! That's two minor versions out of date - 3.4 and 4.0. >> >> Cheers, >> Wol > > My mirror is not damaged anymore - it's quite healthy and cleanly > missing some information I've overwritten. :) Of course, there's no way > to help that now - that's what backups are for. I just wanted to learn > how to avoid this situation in the future. And learn how is it really > supposed to handle such things. > > Is this bug fixed in the newer mdadm? Or is it "known, but not fixed yet"? > > Long fixed :-) Cheers, Wol ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Wrong array assembly on boot? 2017-07-24 20:20 ` Wols Lists @ 2017-12-16 12:40 ` Dark Penguin 2017-12-16 20:27 ` Wol's lists 0 siblings, 1 reply; 9+ messages in thread From: Dark Penguin @ 2017-12-16 12:40 UTC (permalink / raw) To: Wols Lists, linux-raid On 24/07/17 23:20, Wols Lists wrote: > On 24/07/17 20:58, Dark Penguin wrote: >> On 24/07/17 22:36, Wols Lists wrote: >>> On 24/07/17 16:27, Dark Penguin wrote: >>>> On 24/07/17 17:48, Wols Lists wrote: >>>>>> On 22/07/17 19:39, Dark Penguin wrote: >>>>>>>> Greetings! >>>>>>>> >>>>>>>> I have a mirror RAID with two devices (sdc1 and sde1). It's not a root >>>>>>>> partition, just a RAID with some data for services running on this >>>>>>>> server. (I'm running Debian Jessie x86_64 with a 4.1.18 kernel.) The >>>>>>>> RAID is listed in /etc/mdadm, and it has an external bitmap in /RAID . >>>>>> >>>>>> As an absolute minimum, can you please give us your version of mdadm. >>>> Oh, right, sorry. I thought the "absolute minimum" would be the kernel >>>> version and the distribution. :) >>>> >>>> mdadm - v3.3.2 - 21st August 2014 >>>> >>>> >>> I was afraid it might be that ... >>> >>> You've hit a known bug in mdadm. It doesn't always successfully assemble >>> a mirror. I had exactly that problem - I created one mirror and when I >>> rebooted I had two ... I think this is not the same problem (see below). >>> Can't offer any advice about how to fix your damaged mirror, but you >>> need to upgrade mdadm! That's two minor versions out of date - 3.4 and 4.0. It's 3.4-4 in Ubuntu 17.10 and 3.4-4 in Debian Stretch, so I assume 4.0 must be "not there yet"... >> My mirror is not damaged anymore - it's quite healthy and cleanly >> missing some information I've overwritten. :) Of course, there's no way >> to help that now - that's what backups are for. I just wanted to learn >> how to avoid this situation in the future. And learn how is it really >> supposed to handle such things. >> >> Is this bug fixed in the newer mdadm? Or is it "known, but not fixed yet"? >> >> > Long fixed :-) No, this is still not fixed in Ubuntu Artful (17.10) with mdadm v3.4-4 . My problem is the following (tested just now on Ubuntu 17.10): - I create a RAID1 on two devices: /dev/sda1 and /dev/sdb1 (writemostly) - I use it - I pull /dev/sda1 out (bad cable, exactly the same situation as I had) - I continue using the degraded array: $ sudo mdadm --detail /dev/md0 /dev/md0: <...> Number Major Minor RaidDevice State - 0 0 0 removed 1 8 17 1 active sync writemostly /dev/sdb1 - I shut down the machine and replace the cable, then boot it up again - I see the following: mdadm: ignoring /dev/sdb1 as it reports /dev/sda1 as failed mdadm: /dev/md/0 has been started with 1 drive (out of 2). mdadm: Found some drive for an array that is already active: /dev/md/0 mdadm: giving up. $ sudo mdadm --detail /dev/md0 /dev/md0: <...> Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 - 0 0 1 removed So, when assembling the arrays, mdadm sees two devices: - one that fell off and reports a clean array - one that knows that the first one fell off and reports it as faulty And it decides to use the one that obviously fell off, which it knows about from the second device. Seriously? Is there a reason for this chosen behaviour, "ignoring the device that knows about problems"? It seems obviously wrong, but they know about it and even put the message to explain what's going on! There must be a reason that makes this "the lesser evil", but I can't imagine that situation. -- darkpenguin ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Wrong array assembly on boot? 2017-12-16 12:40 ` Dark Penguin @ 2017-12-16 20:27 ` Wol's lists 2017-12-17 11:38 ` Dark Penguin 0 siblings, 1 reply; 9+ messages in thread From: Wol's lists @ 2017-12-16 20:27 UTC (permalink / raw) To: Dark Penguin, linux-raid On 16/12/17 12:40, Dark Penguin wrote: > On 24/07/17 23:20, Wols Lists wrote: >> On 24/07/17 20:58, Dark Penguin wrote: >>> On 24/07/17 22:36, Wols Lists wrote: >>>> On 24/07/17 16:27, Dark Penguin wrote: >>>>> On 24/07/17 17:48, Wols Lists wrote: >>>>>>> On 22/07/17 19:39, Dark Penguin wrote: >>>>>>>>> Greetings! >>>>>>>>> >>>>>>>>> I have a mirror RAID with two devices (sdc1 and sde1). It's not a root >>>>>>>>> partition, just a RAID with some data for services running on this >>>>>>>>> server. (I'm running Debian Jessie x86_64 with a 4.1.18 kernel.) The >>>>>>>>> RAID is listed in /etc/mdadm, and it has an external bitmap in /RAID . >>>>>>> >>>>>>> As an absolute minimum, can you please give us your version of mdadm. >>>>> Oh, right, sorry. I thought the "absolute minimum" would be the kernel >>>>> version and the distribution. :) >>>>> >>>>> mdadm - v3.3.2 - 21st August 2014 >>>>> >>>>> >>>> I was afraid it might be that ... >>>> >>>> You've hit a known bug in mdadm. It doesn't always successfully assemble >>>> a mirror. I had exactly that problem - I created one mirror and when I >>>> rebooted I had two ... > > I think this is not the same problem (see below). > > >>>> Can't offer any advice about how to fix your damaged mirror, but you >>>> need to upgrade mdadm! That's two minor versions out of date - 3.4 and 4.0. > > It's 3.4-4 in Ubuntu 17.10 and 3.4-4 in Debian Stretch, so I assume 4.0 > must be "not there yet"... > https://raid.wiki.kernel.org/index.php/Linux_Raid#Help_wanted mdadm 4.0 is nearly a year old ... > >>> My mirror is not damaged anymore - it's quite healthy and cleanly >>> missing some information I've overwritten. :) Of course, there's no way >>> to help that now - that's what backups are for. I just wanted to learn >>> how to avoid this situation in the future. And learn how is it really >>> supposed to handle such things. >>> >>> Is this bug fixed in the newer mdadm? Or is it "known, but not fixed yet"? >>> >>> >> Long fixed :-) > > No, this is still not fixed in Ubuntu Artful (17.10) with mdadm v3.4-4 . > > My problem is the following (tested just now on Ubuntu 17.10): > > > - I create a RAID1 on two devices: /dev/sda1 and /dev/sdb1 (writemostly) > - I use it > - I pull /dev/sda1 out (bad cable, exactly the same situation as I had) > - I continue using the degraded array: > > $ sudo mdadm --detail /dev/md0 > /dev/md0: > <...> > Number Major Minor RaidDevice State > - 0 0 0 removed > 1 8 17 1 active sync writemostly /dev/sdb1 > > > - I shut down the machine and replace the cable, then boot it up again > - I see the following: > > mdadm: ignoring /dev/sdb1 as it reports /dev/sda1 as failed > mdadm: /dev/md/0 has been started with 1 drive (out of 2). > mdadm: Found some drive for an array that is already active: /dev/md/0 > mdadm: giving up. > > $ sudo mdadm --detail /dev/md0 > /dev/md0: > <...> > Number Major Minor RaidDevice State > 0 8 1 0 active sync /dev/sda1 > - 0 0 1 removed > > > So, when assembling the arrays, mdadm sees two devices: > - one that fell off and reports a clean array > - one that knows that the first one fell off and reports it as faulty > > And it decides to use the one that obviously fell off, which it knows > about from the second device. Except that it does NOT know about the second device !!! (At least, not to start with.) > > Seriously? Is there a reason for this chosen behaviour, "ignoring the > device that knows about problems"? It seems obviously wrong, but they > know about it and even put the message to explain what's going on! There > must be a reason that makes this "the lesser evil", but I can't imagine > that situation. > Read the mdadm manual, especially about booting and "mdadm --assemble --incremental". udev detects sda, and passes it to mdadm, which starts building the array. udev then detects sdb, and passes it to madam, WHICH HITS A BUG IN 3.4 AND MESSES UP THE ASSEMBLY. Standard advice for fixing any problems is always "upgrade to the latest version and see if you can reproduce the problem". I don't remember which version(s) of mdadm had this bug, but I know there were a LOT of fixes like this that went into v4. Cheers, Wol ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Wrong array assembly on boot? 2017-12-16 20:27 ` Wol's lists @ 2017-12-17 11:38 ` Dark Penguin 0 siblings, 0 replies; 9+ messages in thread From: Dark Penguin @ 2017-12-17 11:38 UTC (permalink / raw) To: Wol's lists, linux-raid On 16/12/17 23:27, Wol's lists wrote: > On 16/12/17 12:40, Dark Penguin wrote: >> On 24/07/17 23:20, Wols Lists wrote: >>> On 24/07/17 20:58, Dark Penguin wrote: >>>> On 24/07/17 22:36, Wols Lists wrote: >>>>> On 24/07/17 16:27, Dark Penguin wrote: >>>>>> On 24/07/17 17:48, Wols Lists wrote: >>>>>>>> On 22/07/17 19:39, Dark Penguin wrote: >>>>>>>>>> Greetings! >>>>>>>>>> >>>>>>>>>> I have a mirror RAID with two devices (sdc1 and sde1). It's not a root >>>>>>>>>> partition, just a RAID with some data for services running on this >>>>>>>>>> server. (I'm running Debian Jessie x86_64 with a 4.1.18 kernel.) The >>>>>>>>>> RAID is listed in /etc/mdadm, and it has an external bitmap in /RAID . >>>>>>>> >>>>>>>> As an absolute minimum, can you please give us your version of mdadm. >>>>>> Oh, right, sorry. I thought the "absolute minimum" would be the kernel >>>>>> version and the distribution. :) >>>>>> >>>>>> mdadm - v3.3.2 - 21st August 2014 >>>>>> >>>>>> >>>>> I was afraid it might be that ... >>>>> >>>>> You've hit a known bug in mdadm. It doesn't always successfully assemble >>>>> a mirror. I had exactly that problem - I created one mirror and when I >>>>> rebooted I had two ... >> >> I think this is not the same problem (see below). >> >> >>>>> Can't offer any advice about how to fix your damaged mirror, but you >>>>> need to upgrade mdadm! That's two minor versions out of date - 3.4 and 4.0. >> >> It's 3.4-4 in Ubuntu 17.10 and 3.4-4 in Debian Stretch, so I assume 4.0 >> must be "not there yet"... >> > https://raid.wiki.kernel.org/index.php/Linux_Raid#Help_wanted > > mdadm 4.0 is nearly a year old ... >> >>>> My mirror is not damaged anymore - it's quite healthy and cleanly >>>> missing some information I've overwritten. :) Of course, there's no way >>>> to help that now - that's what backups are for. I just wanted to learn >>>> how to avoid this situation in the future. And learn how is it really >>>> supposed to handle such things. >>>> >>>> Is this bug fixed in the newer mdadm? Or is it "known, but not fixed yet"? >>>> >>>> >>> Long fixed :-) >> >> No, this is still not fixed in Ubuntu Artful (17.10) with mdadm v3.4-4 . >> >> My problem is the following (tested just now on Ubuntu 17.10): >> >> >> - I create a RAID1 on two devices: /dev/sda1 and /dev/sdb1 (writemostly) >> - I use it >> - I pull /dev/sda1 out (bad cable, exactly the same situation as I had) >> - I continue using the degraded array: >> >> $ sudo mdadm --detail /dev/md0 >> /dev/md0: >> <...> >> Number Major Minor RaidDevice State >> - 0 0 0 removed >> 1 8 17 1 active sync writemostly /dev/sdb1 >> >> >> - I shut down the machine and replace the cable, then boot it up again >> - I see the following: >> >> mdadm: ignoring /dev/sdb1 as it reports /dev/sda1 as failed >> mdadm: /dev/md/0 has been started with 1 drive (out of 2). >> mdadm: Found some drive for an array that is already active: /dev/md/0 >> mdadm: giving up. >> >> $ sudo mdadm --detail /dev/md0 >> /dev/md0: >> <...> >> Number Major Minor RaidDevice State >> 0 8 1 0 active sync /dev/sda1 >> - 0 0 1 removed >> >> >> So, when assembling the arrays, mdadm sees two devices: >> - one that fell off and reports a clean array >> - one that knows that the first one fell off and reports it as faulty >> >> And it decides to use the one that obviously fell off, which it knows >> about from the second device. > > Except that it does NOT know about the second device !!! (At least, not > to start with.) >> >> Seriously? Is there a reason for this chosen behaviour, "ignoring the >> device that knows about problems"? It seems obviously wrong, but they >> know about it and even put the message to explain what's going on! There >> must be a reason that makes this "the lesser evil", but I can't imagine >> that situation. >> > Read the mdadm manual, especially about booting and "mdadm --assemble > --incremental". > > udev detects sda, and passes it to mdadm, which starts building the array. > > udev then detects sdb, and passes it to madam, WHICH HITS A BUG IN 3.4 > AND MESSES UP THE ASSEMBLY. > > Standard advice for fixing any problems is always "upgrade to the latest > version and see if you can reproduce the problem". I don't remember > which version(s) of mdadm had this bug, but I know there were a LOT of > fixes like this that went into v4. > > Cheers, > Wol I was wrong - I was actually testing it on Ubuntu 16.10 that has mdadm 3.4-4 (I assumed "long fixed" means more than a year ago). Now I tried it on 17.10, which of course has mdadm 4.0-2. And the problem is still there. Bug I gathered more data this time, including situations in which the problem goes away. To reproduce the problem: - Boot into Ubuntu 17.10 LiveCD and install mdadm (4.0-2 in the repos). - Create a RAID1 array from two drives and wait for rebuild. * The first one MUST be earlier in alphabetical order, i.e. sda1 and sdb1, NOT sdb1 and sda1 ! * The second device (sdb1) MUST be write-mostly! - Create a filesystem, mount the array and put something on it. - Disconnect the SATA cable on the FIRST device (NOT writemostly) - Put more data on the array (to easily see if it's there later). - Shut down the machine, connect the cable, boot back, install mdadm - Do mdadm --assemble --scan $ sudo mdadm --assemble --scan mdadm: ignoring /dev/sdb1 as it reports /dev/sda1 as failed mdadm: /dev/md/0 has been started with 1 drive (out of 2). You can confirm that your "new" data is NOT on the array. WHAT'S MORE, now do: $ mdadm --add /dev/md0 --write-mostly /dev/sdb1 mdadm: *re-added* /dev/sdb1 "Re-added"?! But there is no write-intent bitmap!.. Experimenting with different situations gets more results. For example. I had a situation when mdadm automatically re-added the device to the array, so after a reboot I got a "clean" array (I don't remember if it was assembled correctly or wrongly). If the second device is not write-mostly, then the problem goes away. If you disconnect the second device and not the first one, the problem goes away. What I don't understand is the logic in ignoring the device that reports others as faulty. In what situation could it possibly be sane to ignore it instead of using it and ignoring all others? On the other hand, I don't see this message when the "faulty" drive happens to be the second one - the array just assembles without any errors, with the device that reports others as faulty. -- darkpenguin ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2017-12-17 11:38 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-07-22 18:39 Wrong array assembly on boot? Dark Penguin 2017-07-24 14:48 ` Wols Lists 2017-07-24 15:27 ` Dark Penguin 2017-07-24 19:36 ` Wols Lists 2017-07-24 19:58 ` Dark Penguin 2017-07-24 20:20 ` Wols Lists 2017-12-16 12:40 ` Dark Penguin 2017-12-16 20:27 ` Wol's lists 2017-12-17 11:38 ` Dark Penguin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).