mdadm: Assemble.c: "force-one" update conflicts with the split-brain protection logic

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* mdadm: Assemble.c: "force-one" update conflicts with the split-brain protection logic
@ 2012-08-22 17:50 Alexander Lyakas
  2012-08-28  7:45 ` Alexander Lyakas
  0 siblings, 1 reply; 2+ messages in thread
From: Alexander Lyakas @ 2012-08-22 17:50 UTC (permalink / raw)
  To: linux-raid, NeilBrown

Hi Neil,
I see the following issue:

# I have a raid5 with drives a,b,c,d. Drive a fails, and then drive b
fails, and so the whole array fails.
# Superblocks of c and d show a and b as failed (via 0xffe in
dev_roles[] array).
# Now I perform --assemble --force
# Since b has higher event count than a, b's event count is bumped to
match the event count of c and d ("force-one")
# However, something goes wrong and assembly is aborted
# Now assembly is restarted (--force doesn't matter now)

At this point, drive b is chosen as "most_recent", since it comes
first and has highest event count (equal to c and d).
However, when drives c and d are inspected, they are rejected by the
following split-brain protection code:
		if (j != most_recent &&
		    content->array.raid_disks > 0 &&
		    devices[most_recent].i.disk.raid_disk >= 0 &&
		    devmap[j * content->array.raid_disks +
devices[most_recent].i.disk.raid_disk] == 0) {
			if (c->verbose > -1)
				pr_err("ignoring %s as it reports %s as failed\n",
					devices[j].devname, devices[most_recent].devname);
			best[i] = -1;
			continue;
		}

because the dev_roles[] array of c and d show b as failed (because b
really had failed while c and d were operational).

So I was thinking that the "force-one" update should also somehow
align the dev_roles[] arrays of all devices that it affects. More
precisely, if we decide to promote a device via "force-one" path, we
must update dev_roles[] of all "good" devices to say that the promoted
device is not 0xffe, but has a valid role. Does this make sense? What
do you think?

And I also think, that the split-brain protection logic that you added
should be made a little bit more explicit. Currently, the first device
with the highest event count is selected as "most_recent", and
split-brain protection is enforced WRT to that device. But this logic
can be affected by the order of devices passed to "assemble". I
already mentioned that in the past I pitched a proposal of dealing
with it. Do you want me to go over it and try to pitch it again?

Thanks!
Alex.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: mdadm: Assemble.c: "force-one" update conflicts with the split-brain protection logic
  2012-08-22 17:50 mdadm: Assemble.c: "force-one" update conflicts with the split-brain protection logic Alexander Lyakas
@ 2012-08-28  7:45 ` Alexander Lyakas
  0 siblings, 0 replies; 2+ messages in thread
From: Alexander Lyakas @ 2012-08-28  7:45 UTC (permalink / raw)
  To: linux-raid, NeilBrown

Hi Neil,
yet another issue that I see with the "force-one" update, is that it
does not increment the event count on the bitmap of the appropriate
device.

Here is a scenario that I hit:

# raid5 with 4 drives: A,B,C,D
# drive A fails, then drive B fails
# force-assembly is performed
# drive B has higher event count than A, so it is selected for the
"force-one" update. However, the "force-one" update does not update
the bitmap event counter. As a result, the following happens:
# array is started in the kernel
# bitmap_read_sb() is called and calls read_sb_page()
# read_sb_page() loops through devices and picks the first one that is
In_sync. In our case, this is drive B. So bitmap superblock from drive
B is read. But this superblock has a stale event count. It was not
updated by "force-one". So, as a result, bitmap is considered as stale
and marked as BITMAP_STALE.
# As a result of BITMAP_STALE, bitmap->events_cleared is set to
mddev->events (and also the bitmap is set to all 1's)
# Later, when drive A is re-added, its event count is less than
events_cleared, because events_cleared has been bumped up. So drive A
is rejected by re-add.

The workaround in this case, is to wipe the superblock on A and add it
as a fresh drive.

Thanks,
Alex.

On Wed, Aug 22, 2012 at 8:50 PM, Alexander Lyakas
<alex.bolshoy@gmail.com> wrote:
> Hi Neil,
> I see the following issue:
>
> # I have a raid5 with drives a,b,c,d. Drive a fails, and then drive b
> fails, and so the whole array fails.
> # Superblocks of c and d show a and b as failed (via 0xffe in
> dev_roles[] array).
> # Now I perform --assemble --force
> # Since b has higher event count than a, b's event count is bumped to
> match the event count of c and d ("force-one")
> # However, something goes wrong and assembly is aborted
> # Now assembly is restarted (--force doesn't matter now)
>
> At this point, drive b is chosen as "most_recent", since it comes
> first and has highest event count (equal to c and d).
> However, when drives c and d are inspected, they are rejected by the
> following split-brain protection code:
>                 if (j != most_recent &&
>                     content->array.raid_disks > 0 &&
>                     devices[most_recent].i.disk.raid_disk >= 0 &&
>                     devmap[j * content->array.raid_disks +
> devices[most_recent].i.disk.raid_disk] == 0) {
>                         if (c->verbose > -1)
>                                 pr_err("ignoring %s as it reports %s as failed\n",
>                                         devices[j].devname, devices[most_recent].devname);
>                         best[i] = -1;
>                         continue;
>                 }
>
> because the dev_roles[] array of c and d show b as failed (because b
> really had failed while c and d were operational).
>
> So I was thinking that the "force-one" update should also somehow
> align the dev_roles[] arrays of all devices that it affects. More
> precisely, if we decide to promote a device via "force-one" path, we
> must update dev_roles[] of all "good" devices to say that the promoted
> device is not 0xffe, but has a valid role. Does this make sense? What
> do you think?
>
> And I also think, that the split-brain protection logic that you added
> should be made a little bit more explicit. Currently, the first device
> with the highest event count is selected as "most_recent", and
> split-brain protection is enforced WRT to that device. But this logic
> can be affected by the order of devices passed to "assemble". I
> already mentioned that in the past I pitched a proposal of dealing
> with it. Do you want me to go over it and try to pitch it again?
>
> Thanks!
> Alex.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2012-08-28  7:45 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-22 17:50 mdadm: Assemble.c: "force-one" update conflicts with the split-brain protection logic Alexander Lyakas
2012-08-28  7:45 ` Alexander Lyakas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).