* [PATCH 0/7] patches: fix dm-raid1 race, bug 502927 @ 2009-11-18 12:09 Mikulas Patocka 2009-11-18 12:10 ` [PATCH 1/7] Explicitly initialize bio lists Mikulas Patocka 2009-11-30 16:46 ` [PATCH 0/7] patches: fix dm-raid1 race, bug 502927 Takahiro Yasui 0 siblings, 2 replies; 29+ messages in thread From: Mikulas Patocka @ 2009-11-18 12:09 UTC (permalink / raw) To: Alasdair G Kergon; +Cc: dm-devel Hi Here is the serie of 7 patches to hold write bios on dm-raid1 until dmeventd does its job. It fixes bug https://bugzilla.redhat.com/show_bug.cgi?id=502927 . The first 6 patches are preparatory, they just move the code around, the last patch does the fix. I tested the thing, I managed to reproduce the bug (by manually stopping dmeventd with STOP signal, failing primary mirror leg and writing to the device) and I also verified that the patches fix the bug. For non-dmeventd operation, the current behavior is wrong and I just keep it as wrong as it was. There is no easy fix. It is just assume that if the user doesn't use dmeventd, he can't activate failed disks again. Mikulas ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 1/7] Explicitly initialize bio lists 2009-11-18 12:09 [PATCH 0/7] patches: fix dm-raid1 race, bug 502927 Mikulas Patocka @ 2009-11-18 12:10 ` Mikulas Patocka 2009-11-18 12:11 ` [PATCH 2/7] A framework for holding bios until suspend Mikulas Patocka 2009-11-30 16:46 ` [PATCH 0/7] patches: fix dm-raid1 race, bug 502927 Takahiro Yasui 1 sibling, 1 reply; 29+ messages in thread From: Mikulas Patocka @ 2009-11-18 12:10 UTC (permalink / raw) To: Alasdair G Kergon; +Cc: dm-devel Explicitly initialize bio lists. Currently, zeroing the list with kzalloc already does the initialization, bit if we ever switch to another bio list implementation, it could be a problem. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- drivers/md/dm-raid1.c | 3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6.31-fast-new/drivers/md/dm-raid1.c =================================================================== --- linux-2.6.31-fast-new.orig/drivers/md/dm-raid1.c 2009-10-06 18:15:09.000000000 +0200 +++ linux-2.6.31-fast-new/drivers/md/dm-raid1.c 2009-10-06 18:15:10.000000000 +0200 @@ -791,6 +791,9 @@ static struct mirror_set *alloc_context( } spin_lock_init(&ms->lock); + bio_list_init(&ms->reads); + bio_list_init(&ms->writes); + bio_list_init(&ms->failures); ms->ti = ti; ms->nr_mirrors = nr_mirrors; ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 2/7] A framework for holding bios until suspend 2009-11-18 12:10 ` [PATCH 1/7] Explicitly initialize bio lists Mikulas Patocka @ 2009-11-18 12:11 ` Mikulas Patocka 2009-11-18 12:11 ` [PATCH 3/7] Use the hold framework in do_failures Mikulas Patocka 2009-11-28 18:02 ` [PATCH 2/7] A framework for holding bios until suspend Takahiro Yasui 0 siblings, 2 replies; 29+ messages in thread From: Mikulas Patocka @ 2009-11-18 12:11 UTC (permalink / raw) To: Alasdair G Kergon; +Cc: dm-devel A framework for holding bios until suspend and then resubmitting them with either DM_ENDIO_REQUEUE (if the suspend was noflush) or terminating them with -EIO. The bio can be held with the function hold_bio. So far it is unused, it will be used in later patches. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- drivers/md/dm-raid1.c | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) Index: linux-2.6.31-fast-new/drivers/md/dm-raid1.c =================================================================== --- linux-2.6.31-fast-new.orig/drivers/md/dm-raid1.c 2009-10-06 18:15:10.000000000 +0200 +++ linux-2.6.31-fast-new/drivers/md/dm-raid1.c 2009-10-06 18:15:14.000000000 +0200 @@ -57,6 +57,7 @@ struct mirror_set { struct bio_list reads; struct bio_list writes; struct bio_list failures; + struct bio_list hold; /* bios are waiting until suspend */ struct dm_region_hash *rh; struct dm_kcopyd_client *kcopyd_client; @@ -415,6 +416,20 @@ static void map_region(struct dm_io_regi io->count = bio->bi_size >> 9; } +static void hold_bio(struct mirror_set *ms, struct bio *bio) +{ + if (atomic_read(&ms->suspend)) { + if (dm_noflush_suspending(ms->ti)) + bio_endio(bio, DM_ENDIO_REQUEUE); + else + bio_endio(bio, -EIO); + } else { + spin_lock_irq(&ms->lock); + bio_list_add(&ms->hold, bio); + spin_unlock_irq(&ms->lock); + } +} + /*----------------------------------------------------------------- * Reads *---------------------------------------------------------------*/ @@ -760,6 +775,7 @@ static void do_mirror(struct work_struct bio_list_init(&ms->reads); bio_list_init(&ms->writes); bio_list_init(&ms->failures); + bio_list_init(&ms->hold); spin_unlock_irqrestore(&ms->lock, flags); dm_rh_update_states(ms->rh, errors_handled(ms)); @@ -1192,6 +1208,9 @@ static void mirror_presuspend(struct dm_ struct mirror_set *ms = (struct mirror_set *) ti->private; struct dm_dirty_log *log = dm_rh_dirty_log(ms->rh); + struct bio_list hold; + struct bio *bio; + atomic_set(&ms->suspend, 1); /* @@ -1214,6 +1233,23 @@ static void mirror_presuspend(struct dm_ * we know that all of our I/O has been pushed. */ flush_workqueue(ms->kmirrord_wq); + + /* + * Once we set ms->suspend and flush the workqueue, + * no more entries are being added to ms->hold list. + * Process the list now. + * + * (note that bios can still arive concurrently with + * or after presuspend, but they are never put to + * hold list because of ms->suspend != 0). + */ + spin_lock_irq(&ms->lock); + hold = ms->hold; + bio_list_init(&ms->hold); + spin_unlock_irq(&ms->lock); + + while ((bio = bio_list_pop(&hold))) + hold_bio(ms, bio); } static void mirror_postsuspend(struct dm_target *ti) ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 3/7] Use the hold framework in do_failures 2009-11-18 12:11 ` [PATCH 2/7] A framework for holding bios until suspend Mikulas Patocka @ 2009-11-18 12:11 ` Mikulas Patocka 2009-11-18 12:12 ` [PATCH 4/7] Don't optimize for failure case Mikulas Patocka 2009-11-28 18:02 ` [PATCH 2/7] A framework for holding bios until suspend Takahiro Yasui 1 sibling, 1 reply; 29+ messages in thread From: Mikulas Patocka @ 2009-11-18 12:11 UTC (permalink / raw) To: Alasdair G Kergon; +Cc: dm-devel Use the hold framework in do_failures. This patch doesn't change any logic in processing bios, it just simplifies failure handling and avoids periodically polling the failures list. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- drivers/md/dm-raid1.c | 34 +++++++++------------------------- 1 file changed, 9 insertions(+), 25 deletions(-) Index: linux-2.6.31-fast-new/drivers/md/dm-raid1.c =================================================================== --- linux-2.6.31-fast-new.orig/drivers/md/dm-raid1.c 2009-10-06 18:15:14.000000000 +0200 +++ linux-2.6.31-fast-new/drivers/md/dm-raid1.c 2009-10-06 18:15:17.000000000 +0200 @@ -703,20 +703,12 @@ static void do_failures(struct mirror_se { struct bio *bio; - if (!failures->head) + if (likely(!failures->head)) return; - if (!ms->log_failure) { - while ((bio = bio_list_pop(failures))) { - ms->in_sync = 0; - dm_rh_mark_nosync(ms->rh, bio, bio->bi_size, 0); - } - return; - } - /* * If the log has failed, unattempted writes are being - * put on the failures list. We can't issue those writes + * put on the hold list. We can't issue those writes * until a log has been marked, so we must store them. * * If a 'noflush' suspend is in progress, we can requeue @@ -731,23 +723,15 @@ static void do_failures(struct mirror_se * for us to treat them the same and requeue them * as well. */ - if (dm_noflush_suspending(ms->ti)) { - while ((bio = bio_list_pop(failures))) - bio_endio(bio, DM_ENDIO_REQUEUE); - return; - } - if (atomic_read(&ms->suspend)) { - while ((bio = bio_list_pop(failures))) - bio_endio(bio, -EIO); - return; + while ((bio = bio_list_pop(failures))) { + if (!ms->log_failure) { + ms->in_sync = 0; + dm_rh_mark_nosync(ms->rh, bio, bio->bi_size, 0); + } else { + hold_bio(ms, bio); + } } - - spin_lock_irq(&ms->lock); - bio_list_merge(&ms->failures, failures); - spin_unlock_irq(&ms->lock); - - delayed_wake(ms); } static void trigger_event(struct work_struct *work) ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 4/7] Don't optimize for failure case 2009-11-18 12:11 ` [PATCH 3/7] Use the hold framework in do_failures Mikulas Patocka @ 2009-11-18 12:12 ` Mikulas Patocka 2009-11-18 12:13 ` [PATCH 5/7] Move a logic to get a valid mirror leg to a function Mikulas Patocka 0 siblings, 1 reply; 29+ messages in thread From: Mikulas Patocka @ 2009-11-18 12:12 UTC (permalink / raw) To: Alasdair G Kergon; +Cc: dm-devel Don't optimize for failure case. should_wake is just an optimization to avoid waking the thread up when there's already something to do. Don't do this optimization in failure handling code. It is totally pointless to optimize failures. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- drivers/md/dm-raid1.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) Index: linux-2.6.31-fast-new/drivers/md/dm-raid1.c =================================================================== --- linux-2.6.31-fast-new.orig/drivers/md/dm-raid1.c 2009-10-06 18:15:17.000000000 +0200 +++ linux-2.6.31-fast-new/drivers/md/dm-raid1.c 2009-10-06 18:15:20.000000000 +0200 @@ -529,7 +529,6 @@ static void write_callback(unsigned long struct bio *bio = (struct bio *) context; struct mirror_set *ms; int uptodate = 0; - int should_wake = 0; unsigned long flags; ms = bio_get_m(bio)->ms; @@ -561,12 +560,9 @@ static void write_callback(unsigned long * the main thread. */ spin_lock_irqsave(&ms->lock, flags); - if (!ms->failures.head) - should_wake = 1; bio_list_add(&ms->failures, bio); spin_unlock_irqrestore(&ms->lock, flags); - if (should_wake) - wakeup_mirrord(ms); + wakeup_mirrord(ms); return; } out: ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 5/7] Move a logic to get a valid mirror leg to a function 2009-11-18 12:12 ` [PATCH 4/7] Don't optimize for failure case Mikulas Patocka @ 2009-11-18 12:13 ` Mikulas Patocka 2009-11-18 12:18 ` [PATCH 6/7] Move bio completion from dm_rh_mark_nosync to its caller Mikulas Patocka 0 siblings, 1 reply; 29+ messages in thread From: Mikulas Patocka @ 2009-11-18 12:13 UTC (permalink / raw) To: Alasdair G Kergon; +Cc: dm-devel Move a logic to get a valid mirror leg to a function. It will be reused later. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- drivers/md/dm-raid1.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) Index: linux-2.6.31-rc3-devel/drivers/md/dm-raid1.c =================================================================== --- linux-2.6.31-rc3-devel.orig/drivers/md/dm-raid1.c 2009-07-20 20:46:49.000000000 +0200 +++ linux-2.6.31-rc3-devel/drivers/md/dm-raid1.c 2009-07-20 20:47:44.000000000 +0200 @@ -180,6 +180,15 @@ static void set_default_mirror(struct mi atomic_set(&ms->default_mirror, m - m0); } +static struct mirror *get_valid_mirror(struct mirror_set *ms) +{ + struct mirror *m; + for (m = ms->mirror; m < ms->mirror + ms->nr_mirrors; m++) + if (!atomic_read(&m->error_count)) + return m; + return NULL; +} + /* fail_mirror * @m: mirror device to fail * @error_type: one of the enum's, DM_RAID1_*_ERROR @@ -225,13 +234,10 @@ static void fail_mirror(struct mirror *m goto out; } - for (new = ms->mirror; new < ms->mirror + ms->nr_mirrors; new++) - if (!atomic_read(&new->error_count)) { - set_default_mirror(new); - break; - } - - if (unlikely(new == ms->mirror + ms->nr_mirrors)) + new = get_valid_mirror(ms); + if (new) + set_default_mirror(new); + else DMWARN("All sides of mirror have failed."); out: ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 6/7] Move bio completion from dm_rh_mark_nosync to its caller 2009-11-18 12:13 ` [PATCH 5/7] Move a logic to get a valid mirror leg to a function Mikulas Patocka @ 2009-11-18 12:18 ` Mikulas Patocka 2009-11-18 12:19 ` [PATCH 7/7] Hold all write bios when errors are handled Mikulas Patocka 0 siblings, 1 reply; 29+ messages in thread From: Mikulas Patocka @ 2009-11-18 12:18 UTC (permalink / raw) To: Alasdair G Kergon; +Cc: dm-devel Move bio completion from dm_rh_mark_nosync to its caller. It is more understandable this way. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- drivers/md/dm-raid1.c | 3 ++- drivers/md/dm-region-hash.c | 6 +----- include/linux/dm-region-hash.h | 3 +-- 3 files changed, 4 insertions(+), 8 deletions(-) Index: linux-2.6.31.6-fast/drivers/md/dm-raid1.c =================================================================== --- linux-2.6.31.6-fast.orig/drivers/md/dm-raid1.c 2009-11-18 10:51:37.000000000 +0100 +++ linux-2.6.31.6-fast/drivers/md/dm-raid1.c 2009-11-18 10:53:00.000000000 +0100 @@ -729,7 +729,8 @@ static void do_failures(struct mirror_se while ((bio = bio_list_pop(failures))) { if (!ms->log_failure) { ms->in_sync = 0; - dm_rh_mark_nosync(ms->rh, bio, bio->bi_size, 0); + dm_rh_mark_nosync(ms->rh, bio); + bio_endio(bio, 0); } else { hold_bio(ms, bio); } Index: linux-2.6.31.6-fast/drivers/md/dm-region-hash.c =================================================================== --- linux-2.6.31.6-fast.orig/drivers/md/dm-region-hash.c 2009-11-18 10:53:10.000000000 +0100 +++ linux-2.6.31.6-fast/drivers/md/dm-region-hash.c 2009-11-18 10:59:17.000000000 +0100 @@ -392,8 +392,6 @@ static void complete_resync_work(struct /* dm_rh_mark_nosync * @ms * @bio - * @done - * @error * * The bio was written on some mirror(s) but failed on other mirror(s). * We can successfully endio the bio but should avoid the region being @@ -401,8 +399,7 @@ static void complete_resync_work(struct * * This function is _not_ safe in interrupt context! */ -void dm_rh_mark_nosync(struct dm_region_hash *rh, - struct bio *bio, unsigned done, int error) +void dm_rh_mark_nosync(struct dm_region_hash *rh, struct bio *bio) { unsigned long flags; struct dm_dirty_log *log = rh->log; @@ -439,7 +436,6 @@ void dm_rh_mark_nosync(struct dm_region_ BUG_ON(!list_empty(®->list)); spin_unlock_irqrestore(&rh->region_lock, flags); - bio_endio(bio, error); if (recovering) complete_resync_work(reg, 0); } Index: linux-2.6.31.6-fast/include/linux/dm-region-hash.h =================================================================== --- linux-2.6.31.6-fast.orig/include/linux/dm-region-hash.h 2009-11-18 10:53:39.000000000 +0100 +++ linux-2.6.31.6-fast/include/linux/dm-region-hash.h 2009-11-18 10:53:45.000000000 +0100 @@ -78,8 +78,7 @@ void dm_rh_dec(struct dm_region_hash *rh /* Delay bios on regions. */ void dm_rh_delay(struct dm_region_hash *rh, struct bio *bio); -void dm_rh_mark_nosync(struct dm_region_hash *rh, - struct bio *bio, unsigned done, int error); +void dm_rh_mark_nosync(struct dm_region_hash *rh, struct bio *bio); /* * Region recovery control. ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 7/7] Hold all write bios when errors are handled 2009-11-18 12:18 ` [PATCH 6/7] Move bio completion from dm_rh_mark_nosync to its caller Mikulas Patocka @ 2009-11-18 12:19 ` Mikulas Patocka 2009-11-23 5:58 ` malahal 0 siblings, 1 reply; 29+ messages in thread From: Mikulas Patocka @ 2009-11-18 12:19 UTC (permalink / raw) To: Alasdair G Kergon; +Cc: dm-devel Hold all write bios when errors are handled. The patch changes these things: - previously, the failures list was used only when handling errors with dmeventd. Now, it is used for all bios. Even when not using dmeventd, the regions where some writes failed must be marked as nosync. This can only be done in process context (i.e. in raid1 workqueue), not in the write_callback function. - previously, the write would succeed if writing to at least one leg succeeded. This is wrong because data from the failed leg may be replicated to the correct leg. Now, if using dmeventd, the write with some failures will fail be held until dmeventd does its job and reconfigures the array. If not using dmeventd, the write still succeeds if at least one leg succeeds, it is wrong but it is consistent with current behavior. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- drivers/md/dm-raid1.c | 54 ++++++++++++++++++++++++++------------------------ 1 file changed, 29 insertions(+), 25 deletions(-) Index: linux-2.6.31.6-fast/drivers/md/dm-raid1.c =================================================================== --- linux-2.6.31.6-fast.orig/drivers/md/dm-raid1.c 2009-11-18 10:53:00.000000000 +0100 +++ linux-2.6.31.6-fast/drivers/md/dm-raid1.c 2009-11-18 12:40:35.000000000 +0100 @@ -534,7 +534,6 @@ static void write_callback(unsigned long unsigned i, ret = 0; struct bio *bio = (struct bio *) context; struct mirror_set *ms; - int uptodate = 0; unsigned long flags; ms = bio_get_m(bio)->ms; @@ -546,33 +545,23 @@ static void write_callback(unsigned long * This way we handle both writes to SYNC and NOSYNC * regions with the same code. */ - if (likely(!error)) - goto out; + if (likely(!error)) { + bio_endio(bio, ret); + return; + } for (i = 0; i < ms->nr_mirrors; i++) if (test_bit(i, &error)) fail_mirror(ms->mirror + i, DM_RAID1_WRITE_ERROR); - else - uptodate = 1; - if (unlikely(!uptodate)) { - DMERR("All replicated volumes dead, failing I/O"); - /* None of the writes succeeded, fail the I/O. */ - ret = -EIO; - } else if (errors_handled(ms)) { - /* - * Need to raise event. Since raising - * events can block, we need to do it in - * the main thread. - */ - spin_lock_irqsave(&ms->lock, flags); - bio_list_add(&ms->failures, bio); - spin_unlock_irqrestore(&ms->lock, flags); - wakeup_mirrord(ms); - return; - } -out: - bio_endio(bio, ret); + /* + * In either case we must mark the region as NOSYNC. + * That would block, so do it in the thread. + */ + spin_lock_irqsave(&ms->lock, flags); + bio_list_add(&ms->failures, bio); + spin_unlock_irqrestore(&ms->lock, flags); + wakeup_mirrord(ms); } static void do_write(struct mirror_set *ms, struct bio *bio) @@ -730,10 +719,25 @@ static void do_failures(struct mirror_se if (!ms->log_failure) { ms->in_sync = 0; dm_rh_mark_nosync(ms->rh, bio); + } + /* + * If all the legs are dead, fail the I/O. + * + * If we are not using dmeventd, we pretend that the I/O + * succeeded. This is wrong (the failed leg might come online + * again after reboot and it would be replicated back to + * the good leg) but it is consistent with current behavior. + * For proper behavior, dm-raid1 shouldn't be used without + * dmeventd at all. + * + * If we use dmeventd, hold the bio until dmeventd does its job. + */ + if (!get_valid_mirror(ms)) + bio_endio(bio, -EIO); + else if (!errors_handled(ms)) bio_endio(bio, 0); - } else { + else hold_bio(ms, bio); - } } } ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7] Hold all write bios when errors are handled 2009-11-18 12:19 ` [PATCH 7/7] Hold all write bios when errors are handled Mikulas Patocka @ 2009-11-23 5:58 ` malahal 2009-11-23 17:54 ` Takahiro Yasui 2009-11-24 11:51 ` Mikulas Patocka 0 siblings, 2 replies; 29+ messages in thread From: malahal @ 2009-11-23 5:58 UTC (permalink / raw) To: dm-devel Mikulas Patocka [mpatocka@redhat.com] wrote: > Index: linux-2.6.31.6-fast/drivers/md/dm-raid1.c > - if (unlikely(!uptodate)) { > - DMERR("All replicated volumes dead, failing I/O"); > - /* None of the writes succeeded, fail the I/O. */ > - ret = -EIO; > - } else if (errors_handled(ms)) { > - /* > - * Need to raise event. Since raising > - * events can block, we need to do it in > - * the main thread. > - */ > - spin_lock_irqsave(&ms->lock, flags); > - bio_list_add(&ms->failures, bio); > - spin_unlock_irqrestore(&ms->lock, flags); > - wakeup_mirrord(ms); > - return; > - } > -out: > - bio_endio(bio, ret); > + /* > + * In either case we must mark the region as NOSYNC. > + * That would block, so do it in the thread. > + */ What exactly you mean by "either case' here? "errors_handled" case or not? Need to adjust the comment as we don't do that check here anymore. > @@ -730,10 +719,25 @@ static void do_failures(struct mirror_se > if (!ms->log_failure) { > ms->in_sync = 0; > dm_rh_mark_nosync(ms->rh, bio); > + } > + /* > + * If all the legs are dead, fail the I/O. > + * > + * If we are not using dmeventd, we pretend that the I/O > + * succeeded. This is wrong (the failed leg might come online > + * again after reboot and it would be replicated back to > + * the good leg) but it is consistent with current behavior. > + * For proper behavior, dm-raid1 shouldn't be used without > + * dmeventd at all. > + * > + * If we use dmeventd, hold the bio until dmeventd does its job. > + */ > + if (!get_valid_mirror(ms)) > + bio_endio(bio, -EIO); > + else if (!errors_handled(ms)) > bio_endio(bio, 0); > - } else { > + else > hold_bio(ms, bio); You seem to be holding the I/O that has failed, but what about writes that are issued after this failure. They seem to go through just fine. Did I miss something? I think do_writes() needs to check for a leg failure (just like it does for log_failure already), and put them on failure/hold list, right? Also, we do need to do the above work only if "primary" leg fails. We can continue to work just like the old code if "secondary" legs fail, right? Not sure if this is worth optimizing though, but I would like to see it implemented as it is just a few extra checks. We can have primary_failure field like log_failure field. Thanks, Malahal. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7] Hold all write bios when errors are handled 2009-11-23 5:58 ` malahal @ 2009-11-23 17:54 ` Takahiro Yasui 2009-11-24 11:51 ` Mikulas Patocka 1 sibling, 0 replies; 29+ messages in thread From: Takahiro Yasui @ 2009-11-23 17:54 UTC (permalink / raw) To: dm-devel On 11/23/09 00:58, malahal@us.ibm.com wrote: > Mikulas Patocka [mpatocka@redhat.com] wrote: >> @@ -730,10 +719,25 @@ static void do_failures(struct mirror_se >> if (!ms->log_failure) { >> ms->in_sync = 0; >> dm_rh_mark_nosync(ms->rh, bio); >> + } >> + /* >> + * If all the legs are dead, fail the I/O. >> + * >> + * If we are not using dmeventd, we pretend that the I/O >> + * succeeded. This is wrong (the failed leg might come online >> + * again after reboot and it would be replicated back to >> + * the good leg) but it is consistent with current behavior. >> + * For proper behavior, dm-raid1 shouldn't be used without >> + * dmeventd at all. >> + * >> + * If we use dmeventd, hold the bio until dmeventd does its job. >> + */ >> + if (!get_valid_mirror(ms)) >> + bio_endio(bio, -EIO); >> + else if (!errors_handled(ms)) >> bio_endio(bio, 0); >> - } else { >> + else >> hold_bio(ms, bio); > > You seem to be holding the I/O that has failed, but what about writes > that are issued after this failure. They seem to go through just fine. > Did I miss something? I think do_writes() needs to check for a leg > failure (just like it does for log_failure already), and put them on > failure/hold list, right? On 09/09/09 16:18, Takahiro Yasui wrote: > - As I mentioned before, bios which are sent to out-of-sync regions also > need to be blocked because bios to out-of-sync regions are processed by > generic_make_request() and would return the "success" to upper layer > without dm-raid1 notices it. This might cause data corruption. > https://www.redhat.com/archives/dm-devel/2009-July/msg00118.html I think you are right. I also commented above two months ago, but I haven't got comments from Mikulas. > Also, we do need to do the above work only if "primary" leg fails. We > can continue to work just like the old code if "secondary" legs fail, > right? Not sure if this is worth optimizing though, but I would like to > see it implemented as it is just a few extra checks. We can have > primary_failure field like log_failure field. Good idea. This data corruption we are talking could happen only if the primary leg fails. Keeping system running in case of non-primary legs looks reasonable. Taka ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7] Hold all write bios when errors are handled 2009-11-23 5:58 ` malahal 2009-11-23 17:54 ` Takahiro Yasui @ 2009-11-24 11:51 ` Mikulas Patocka 2009-11-24 19:17 ` malahal 1 sibling, 1 reply; 29+ messages in thread From: Mikulas Patocka @ 2009-11-24 11:51 UTC (permalink / raw) To: malahal; +Cc: dm-devel > > -out: > > - bio_endio(bio, ret); > > + /* > > + * In either case we must mark the region as NOSYNC. > > + * That would block, so do it in the thread. > > + */ > > What exactly you mean by "either case' here? "errors_handled" case or not? > Need to adjust the comment as we don't do that check here anymore. Yes. I mean both "errors_handled" case and "errors_not_handled". > You seem to be holding the I/O that has failed, but what about writes > that are issued after this failure. They seem to go through just fine. > Did I miss something? I think do_writes() needs to check for a leg > failure (just like it does for log_failure already), and put them on > failure/hold list, right? Yes, writes after the failed request are processed, but it is not a problem --- if the write succeeded on all legs, it is returned as success --- in this case, resychronization can't corrupt written data. If the write succeeded only on some legs, it is held again. So in practice, if some leg fails completely, all writes will be held. > Also, we do need to do the above work only if "primary" leg fails. We > can continue to work just like the old code if "secondary" legs fail, > right? Not sure if this is worth optimizing though, but I would like to > see it implemented as it is just a few extra checks. We can have > primary_failure field like log_failure field. > > Thanks, Malahal. I thought about it too, but concluded that we need to hold bios even if the primary leg fails. Imagine this scenario: * secondary leg fails * write fails on the secondaty leg and succeeds on the primary leg and is successfully complete * the computer crashes * after a reboot, the primary leg is inaccessible and the secondary leg is back online --- now raid1 would be returning stale data. In transaction processing applications, errorneous revert of committed transaction is worse problem than computer crash. If we hold the bios if the secondary leg fails (as the patch does), one of these two scenarios happen: * secondary leg fails * write succeeds on the primary leg and is held * the computer crashes * after a reboot, the primary leg is inaccessible and the secondary leg is back online --- but we haven't completed the write, so the transaction wasn't reported as committed or * secondary leg fails * write succeeds on the primary leg and is held * dmeventd removes the secondary leg and the write succeeds * the computer crashes * after a reboot, the primary leg is inaccessible, the secondary leg was already removed by dmeventd, so the array is considered inaccessible. So it doesn't work but at least it doesn't revert already committed transaction. Mikulas ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7] Hold all write bios when errors are handled 2009-11-24 11:51 ` Mikulas Patocka @ 2009-11-24 19:17 ` malahal 2009-11-25 13:19 ` Mikulas Patocka 0 siblings, 1 reply; 29+ messages in thread From: malahal @ 2009-11-24 19:17 UTC (permalink / raw) To: dm-devel Mikulas Patocka [mpatocka@redhat.com] wrote: > Yes, writes after the failed request are processed, but it is not a > problem --- if the write succeeded on all legs, it is returned as success > --- in this case, resychronization can't corrupt written data. If the > write succeeded only on some legs, it is held again. > > So in practice, if some leg fails completely, all writes will be held. I need to look at the code again, but I thought any new writes to a failed region go to a surviving leg. In that case, we end up returning I/O's to the application after writing to a single leg. > > Also, we do need to do the above work only if "primary" leg fails. We > > can continue to work just like the old code if "secondary" legs fail, > > right? Not sure if this is worth optimizing though, but I would like to > > see it implemented as it is just a few extra checks. We can have > > primary_failure field like log_failure field. > I thought about it too, but concluded that we need to hold bios even if > the primary leg fails. > > Imagine this scenario: > * secondary leg fails > * write fails on the secondaty leg and succeeds on the primary leg > and is successfully complete > * the computer crashes > * after a reboot, the primary leg is inaccessible and the secondary leg is > back online --- now raid1 would be returning stale data. The software can detect this case. We can fail this completely or use the data from the secondary that could be "stale" with help from admin. Let us call this method 1. > If we hold the bios if the secondary leg fails (as the patch does), one of > these two scenarios happen: > > * secondary leg fails > * write succeeds on the primary leg and is held > * the computer crashes > * after a reboot, the primary leg is inaccessible and the secondary leg is > back online --- but we haven't completed the write, so the transaction > wasn't reported as committed > > or > > * secondary leg fails > * write succeeds on the primary leg and is held > * dmeventd removes the secondary leg and the write succeeds > * the computer crashes > * after a reboot, the primary leg is inaccessible, the secondary leg was > already removed by dmeventd, so the array is considered inaccessible. So > it doesn't work but at least it doesn't revert already committed > transaction. How is this latter case (it doesn't need a crash anyway) different/better from the case where we detect that 'primary' is missing and ask admin if he wants to use the data on the secondary or not. At least, the admin has a choice with "method 1" and this doesn't have that choice. Thanks, Malahal. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7] Hold all write bios when errors are handled 2009-11-24 19:17 ` malahal @ 2009-11-25 13:19 ` Mikulas Patocka 2009-11-25 15:43 ` Takahiro Yasui 2009-11-25 20:23 ` [PATCH 7/7] Hold all write bios when errors are handled malahal 0 siblings, 2 replies; 29+ messages in thread From: Mikulas Patocka @ 2009-11-25 13:19 UTC (permalink / raw) To: malahal; +Cc: dm-devel On Tue, 24 Nov 2009, malahal@us.ibm.com wrote: > I need to look at the code again, but I thought any new writes to a > failed region go to a surviving leg. In that case, we end up returning > I/O's to the application after writing to a single leg. Writes always go to all the legs, see do_write(). Anyway, dmeventd removes the failed leg soon. > > > Also, we do need to do the above work only if "primary" leg fails. We > > > can continue to work just like the old code if "secondary" legs fail, > > > right? Not sure if this is worth optimizing though, but I would like to > > > see it implemented as it is just a few extra checks. We can have > > > primary_failure field like log_failure field. > > > I thought about it too, but concluded that we need to hold bios even if > > the primary leg fails. > > > > Imagine this scenario: > > * secondary leg fails > > * write fails on the secondaty leg and succeeds on the primary leg > > and is successfully complete > > * the computer crashes > > * after a reboot, the primary leg is inaccessible and the secondary leg is > > back online --- now raid1 would be returning stale data. > > The software can detect this case. We can fail this completely or use > the data from the secondary that could be "stale" with help from admin. > Let us call this method 1. You can't detect it because the computer crashed *before* you write the information that the secondary leg failed to the metadata. So, after a reboot, you can't tell if any mirror leg failed some requests before the crash. > > If we hold the bios if the secondary leg fails (as the patch does), one of > > these two scenarios happen: > > > > * secondary leg fails > > * write succeeds on the primary leg and is held > > * the computer crashes > > * after a reboot, the primary leg is inaccessible and the secondary leg is > > back online --- but we haven't completed the write, so the transaction > > wasn't reported as committed > > > > or > > > > * secondary leg fails > > * write succeeds on the primary leg and is held > > * dmeventd removes the secondary leg and the write succeeds > > * the computer crashes > > * after a reboot, the primary leg is inaccessible, the secondary leg was > > already removed by dmeventd, so the array is considered inaccessible. So > > it doesn't work but at least it doesn't revert already committed > > transaction. > > How is this latter case (it doesn't need a crash anyway) > different/better from the case where we detect that 'primary' is missing > and ask admin if he wants to use the data on the secondary or not. At > least, the admin has a choice with "method 1" and this doesn't have that > choice. If you ask the admin always if primary leg failed and wait for his action, you lose fault-tolerance --- the computer would wait until the admin does an action. The requirements are: * if one of legs fail or log fails, you must automatically continue without human intervention * if both legs fail, you must shut it down and not pretend that something was written when it wasn't (this would break durability requirement of transactions). Mikulas > Thanks, Malahal. > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7] Hold all write bios when errors are handled 2009-11-25 13:19 ` Mikulas Patocka @ 2009-11-25 15:43 ` Takahiro Yasui 2009-11-25 20:44 ` malahal 2009-11-26 17:54 ` [PATCH 8/7] Hold all write bios in nosync region Mikulas Patocka 2009-11-25 20:23 ` [PATCH 7/7] Hold all write bios when errors are handled malahal 1 sibling, 2 replies; 29+ messages in thread From: Takahiro Yasui @ 2009-11-25 15:43 UTC (permalink / raw) To: dm-devel On 11/25/09 08:19, Mikulas Patocka wrote: > > > On Tue, 24 Nov 2009, malahal@us.ibm.com wrote: > >> I need to look at the code again, but I thought any new writes to a >> failed region go to a surviving leg. In that case, we end up returning >> I/O's to the application after writing to a single leg. > > Writes always go to all the legs, see do_write(). Anyway, dmeventd removes > the failed leg soon. Is it correct? When the region is in the state of out of sync (NOSYNC), I/Os are not processed by do_write() but generic_make_request() in the do_writes(). >>>> Also, we do need to do the above work only if "primary" leg fails. We >>>> can continue to work just like the old code if "secondary" legs fail, >>>> right? Not sure if this is worth optimizing though, but I would like to >>>> see it implemented as it is just a few extra checks. We can have >>>> primary_failure field like log_failure field. >> >>> I thought about it too, but concluded that we need to hold bios even if >>> the primary leg fails. >>> >>> Imagine this scenario: >>> * secondary leg fails >>> * write fails on the secondaty leg and succeeds on the primary leg >>> and is successfully complete >>> * the computer crashes >>> * after a reboot, the primary leg is inaccessible and the secondary leg is >>> back online --- now raid1 would be returning stale data. >> >> The software can detect this case. We can fail this completely or use >> the data from the secondary that could be "stale" with help from admin. >> Let us call this method 1. > > You can't detect it because the computer crashed *before* you write the > information that the secondary leg failed to the metadata. > > So, after a reboot, you can't tell if any mirror leg failed some requests > before the crash. > >>> If we hold the bios if the secondary leg fails (as the patch does), one of >>> these two scenarios happen: >>> >>> * secondary leg fails >>> * write succeeds on the primary leg and is held >>> * the computer crashes >>> * after a reboot, the primary leg is inaccessible and the secondary leg is >>> back online --- but we haven't completed the write, so the transaction >>> wasn't reported as committed >>> >>> or >>> >>> * secondary leg fails >>> * write succeeds on the primary leg and is held >>> * dmeventd removes the secondary leg and the write succeeds >>> * the computer crashes >>> * after a reboot, the primary leg is inaccessible, the secondary leg was >>> already removed by dmeventd, so the array is considered inaccessible. So >>> it doesn't work but at least it doesn't revert already committed >>> transaction. >> >> How is this latter case (it doesn't need a crash anyway) >> different/better from the case where we detect that 'primary' is missing >> and ask admin if he wants to use the data on the secondary or not. At >> least, the admin has a choice with "method 1" and this doesn't have that >> choice. > > If you ask the admin always if primary leg failed and wait for his action, > you lose fault-tolerance --- the computer would wait until the admin does > an action. > > The requirements are: > * if one of legs fail or log fails, you must automatically continue > without human intervention > * if both legs fail, you must shut it down and not pretend that something > was written when it wasn't (this would break durability requirement of > transactions). I agree with this point. lvm mirror could be used on filesystems such as ext3 and each filesystem and application needs to take care those situation to prevent data corruption. I don't think that it is realistic, and the underlying layer should prevent data corruption. I now understand primary and secondary disks need to be blocked. Thanks, Taka ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7] Hold all write bios when errors are handled 2009-11-25 15:43 ` Takahiro Yasui @ 2009-11-25 20:44 ` malahal 2009-11-25 22:50 ` Takahiro Yasui 2009-11-26 17:56 ` Mikulas Patocka 2009-11-26 17:54 ` [PATCH 8/7] Hold all write bios in nosync region Mikulas Patocka 1 sibling, 2 replies; 29+ messages in thread From: malahal @ 2009-11-25 20:44 UTC (permalink / raw) To: dm-devel Takahiro Yasui [tyasui@redhat.com] wrote: > > The requirements are: > > * if one of legs fail or log fails, you must automatically continue > > without human intervention > > * if both legs fail, you must shut it down and not pretend that something > > was written when it wasn't (this would break durability requirement of > > transactions). > > I agree with this point. lvm mirror could be used on filesystems such as > ext3 and each filesystem and application needs to take care those situation > to prevent data corruption. I don't think that it is realistic, and the > underlying layer should prevent data corruption. I now understand primary > and secondary disks need to be blocked. If you think that I/O needs to be blocked for first or second leg failures, then I am afraid that there is no way to keep the failed mirror in the system for re-integrating failed devices. I would like to keep the mirror as mirror when one of the legs fails as opposed to making it linear now by dmeventd. This is to re-integrate a failed leg into the mirror to handle transient device failures. Here is what I am planning to do, let me know if you find any issues: 1. Secondary leg failure. dmeventd will NOT change the meta data at all. It will call "lvchange --refresh" that will start resynchronization. This will be done a few times. If the device still fails to re-integrate, it is left the way it is and no further re-integrations attempts are done. The number of attempts can be done at configurable intervals. 2. First leg failure (aka Primary leg failure): The kernel would stop handling any further I/O. The dmeventd will change mirror meta data [it will re-order the legs ] so that the secondary now becomes the primary. I will try to re-integrate the failed device few times before giving up just as in the case "1" above. The kernel module will never progress with a first leg failure as you see above. Thanks, Malahal. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7] Hold all write bios when errors are handled 2009-11-25 20:44 ` malahal @ 2009-11-25 22:50 ` Takahiro Yasui 2009-11-26 17:56 ` Mikulas Patocka 1 sibling, 0 replies; 29+ messages in thread From: Takahiro Yasui @ 2009-11-25 22:50 UTC (permalink / raw) To: dm-devel On 11/25/09 15:44, malahal@us.ibm.com wrote: > Takahiro Yasui [tyasui@redhat.com] wrote: >>> The requirements are: >>> * if one of legs fail or log fails, you must automatically continue >>> without human intervention >>> * if both legs fail, you must shut it down and not pretend that something >>> was written when it wasn't (this would break durability requirement of >>> transactions). >> >> I agree with this point. lvm mirror could be used on filesystems such as >> ext3 and each filesystem and application needs to take care those situation >> to prevent data corruption. I don't think that it is realistic, and the >> underlying layer should prevent data corruption. I now understand primary >> and secondary disks need to be blocked. > > If you think that I/O needs to be blocked for first or second leg > failures, then I am afraid that there is no way to keep the failed > mirror in the system for re-integrating failed devices. > > I would like to keep the mirror as mirror when one of the legs fails as > opposed to making it linear now by dmeventd. This is to re-integrate a > failed leg into the mirror to handle transient device failures. Here is > what I am planning to do, let me know if you find any issues: > > 1. Secondary leg failure. dmeventd will NOT change the meta data at all. > It will call "lvchange --refresh" that will start resynchronization. > This will be done a few times. If the device still fails to > re-integrate, it is left the way it is and no further re-integrations > attempts are done. The number of attempts can be done at configurable > intervals. > > 2. First leg failure (aka Primary leg failure): The kernel would stop > handling any further I/O. The dmeventd will change mirror meta data > [it will re-order the legs ] so that the secondary now becomes the > primary. I will try to re-integrate the failed device few times before > giving up just as in the case "1" above. > > The kernel module will never progress with a first leg failure as you > see above. It looks very good idea for both increasing system availability and handling transient errors. I hope we could solve the issue I commented in the email: https://www.redhat.com/archives/dm-devel/2009-November/msg00253.html and adopt this method. Thanks, Taka ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7] Hold all write bios when errors are handled 2009-11-25 20:44 ` malahal 2009-11-25 22:50 ` Takahiro Yasui @ 2009-11-26 17:56 ` Mikulas Patocka 1 sibling, 0 replies; 29+ messages in thread From: Mikulas Patocka @ 2009-11-26 17:56 UTC (permalink / raw) To: malahal; +Cc: dm-devel On Wed, 25 Nov 2009, malahal@us.ibm.com wrote: > Takahiro Yasui [tyasui@redhat.com] wrote: > > > The requirements are: > > > * if one of legs fail or log fails, you must automatically continue > > > without human intervention > > > * if both legs fail, you must shut it down and not pretend that something > > > was written when it wasn't (this would break durability requirement of > > > transactions). > > > > I agree with this point. lvm mirror could be used on filesystems such as > > ext3 and each filesystem and application needs to take care those situation > > to prevent data corruption. I don't think that it is realistic, and the > > underlying layer should prevent data corruption. I now understand primary > > and secondary disks need to be blocked. > > If you think that I/O needs to be blocked for first or second leg > failures, then I am afraid that there is no way to keep the failed > mirror in the system for re-integrating failed devices. > > I would like to keep the mirror as mirror when one of the legs fails as > opposed to making it linear now by dmeventd. This is to re-integrate a > failed leg into the mirror to handle transient device failures. Here is > what I am planning to do, let me know if you find any issues: > > 1. Secondary leg failure. dmeventd will NOT change the meta data at all. > It will call "lvchange --refresh" that will start resynchronization. > This will be done a few times. If the device still fails to > re-integrate, it is left the way it is and no further re-integrations > attempts are done. The number of attempts can be done at configurable > intervals. > > 2. First leg failure (aka Primary leg failure): The kernel would stop > handling any further I/O. The dmeventd will change mirror meta data > [it will re-order the legs ] so that the secondary now becomes the > primary. I will try to re-integrate the failed device few times before > giving up just as in the case "1" above. > > The kernel module will never progress with a first leg failure as you > see above. > > Thanks, Malahal. This is possible but it would need some redesign. The purpose of my patch was just to fix the race, not redesign it for transient failures. But yes --- in the long term, it could be done. Mikulas ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 8/7] Hold all write bios in nosync region 2009-11-25 15:43 ` Takahiro Yasui 2009-11-25 20:44 ` malahal @ 2009-11-26 17:54 ` Mikulas Patocka 1 sibling, 0 replies; 29+ messages in thread From: Mikulas Patocka @ 2009-11-26 17:54 UTC (permalink / raw) To: Takahiro Yasui; +Cc: dm-devel, Alasdair G Kergon On Wed, 25 Nov 2009, Takahiro Yasui wrote: > On 11/25/09 08:19, Mikulas Patocka wrote: > > > > > > On Tue, 24 Nov 2009, malahal@us.ibm.com wrote: > > > >> I need to look at the code again, but I thought any new writes to a > >> failed region go to a surviving leg. In that case, we end up returning > >> I/O's to the application after writing to a single leg. > > > > Writes always go to all the legs, see do_write(). Anyway, dmeventd removes > > the failed leg soon. > > Is it correct? When the region is in the state of out of sync (NOSYNC), > I/Os are not processed by do_write() but generic_make_request() in the > do_writes(). This is good point. Here I'm sending patch 8 (to be applied on the top of the rest of patches) that fixes it. Mikulas --- Hold all write bios when leg fails and errors are handled When using dmeventd to handle errors, we must be held until dmeventd does its job. This patch prevents the following race: * primary leg fails * write "1" fail, the write is held, secondary leg is set default * write "2" goes straight to the secondary leg *** crash *** (before dmeventd does its job) * after a reboot, primary leg goes back online, it is resynchronized to the secondary leg, write "2" is reverted although it already completed Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- drivers/md/dm-raid1.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) Index: linux-2.6.32-rc8-devel/drivers/md/dm-raid1.c =================================================================== --- linux-2.6.32-rc8-devel.orig/drivers/md/dm-raid1.c 2009-11-26 17:26:26.000000000 +0100 +++ linux-2.6.32-rc8-devel/drivers/md/dm-raid1.c 2009-11-26 18:30:48.000000000 +0100 @@ -68,6 +68,7 @@ struct mirror_set { region_t nr_regions; int in_sync; int log_failure; + int leg_failure; atomic_t suspend; atomic_t default_mirror; /* Default mirror */ @@ -210,6 +211,8 @@ static void fail_mirror(struct mirror *m struct mirror_set *ms = m->ms; struct mirror *new; + ms->leg_failure = 1; + /* * error_count is used for nothing more than a * simple way to tell if a device has encountered @@ -694,8 +697,12 @@ static void do_writes(struct mirror_set dm_rh_delay(ms->rh, bio); while ((bio = bio_list_pop(&nosync))) { - map_bio(get_default_mirror(ms), bio); - generic_make_request(bio); + if (unlikely(ms->leg_failure) && errors_handled(ms)) + hold_bio(ms, bio); + else { + map_bio(get_default_mirror(ms), bio); + generic_make_request(bio); + } } } @@ -816,6 +823,7 @@ static struct mirror_set *alloc_context( ms->nr_regions = dm_sector_div_up(ti->len, region_size); ms->in_sync = 0; ms->log_failure = 0; + ms->leg_failure = 0; atomic_set(&ms->suspend, 0); atomic_set(&ms->default_mirror, DEFAULT_MIRROR); ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7] Hold all write bios when errors are handled 2009-11-25 13:19 ` Mikulas Patocka 2009-11-25 15:43 ` Takahiro Yasui @ 2009-11-25 20:23 ` malahal 2009-11-25 22:47 ` Takahiro Yasui 2009-11-26 17:58 ` Mikulas Patocka 1 sibling, 2 replies; 29+ messages in thread From: malahal @ 2009-11-25 20:23 UTC (permalink / raw) To: dm-devel Mikulas Patocka [mpatocka@redhat.com] wrote: > > > > Imagine this scenario: > > > * secondary leg fails > > > * write fails on the secondaty leg and succeeds on the primary leg > > > and is successfully complete > > > * the computer crashes > > > * after a reboot, the primary leg is inaccessible and the secondary leg is > > > back online --- now raid1 would be returning stale data. > > > > The software can detect this case. We can fail this completely or use > > the data from the secondary that could be "stale" with help from admin. > > Let us call this method 1. > > You can't detect it because the computer crashed *before* you write the > information that the secondary leg failed to the metadata. > > So, after a reboot, you can't tell if any mirror leg failed some requests > before the crash. My definition of 'primary' is the first leg. Now on, I will use "first leg" to avoid confusion. On a reboot, LVM can find if its first leg is missing. If it is missing, it can ask the admin whether to use the 'second' leg or not. When I said, "software" can detect, I really meant that LVM can detect that the "first leg" is missing. > > > If we hold the bios if the secondary leg fails (as the patch does), one of > > > these two scenarios happen: > > > > > > * secondary leg fails > > > * write succeeds on the primary leg and is held > > > * the computer crashes > > > * after a reboot, the primary leg is inaccessible and the secondary leg is > > > back online --- but we haven't completed the write, so the transaction > > > wasn't reported as committed > > > > > > or > > > > > > * secondary leg fails > > > * write succeeds on the primary leg and is held > > > * dmeventd removes the secondary leg and the write succeeds > > > * the computer crashes > > > * after a reboot, the primary leg is inaccessible, the secondary leg was > > > already removed by dmeventd, so the array is considered inaccessible. So > > > it doesn't work but at least it doesn't revert already committed > > > transaction. > > > > How is this latter case (it doesn't need a crash anyway) > > different/better from the case where we detect that 'primary' is missing > > and ask admin if he wants to use the data on the secondary or not. At > > least, the admin has a choice with "method 1" and this doesn't have that > > choice. > > If you ask the admin always if primary leg failed and wait for his action, > you lose fault-tolerance --- the computer would wait until the admin does > an action. Well, we are talking one time admin help at reboot [actually activation time] if and only if the "state of the mirror" changed during reboot! This is not applicable at run time. Most of the time, dmeventd will have a chance to change the mirror to linear anyway. > The requirements are: > * if one of legs fail or log fails, you must automatically continue > without human intervention This is satisfied. I was saying for the admin input only at reboot time. What happens if first leg fails now? I think, activation fails and needs to specify "--partial" or so. > * if both legs fail, you must shut it down and not pretend that something > was written when it wasn't (this would break durability requirement of > transactions). True. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7] Hold all write bios when errors are handled 2009-11-25 20:23 ` [PATCH 7/7] Hold all write bios when errors are handled malahal @ 2009-11-25 22:47 ` Takahiro Yasui 2009-11-25 23:20 ` malahal 2009-11-26 17:58 ` Mikulas Patocka 1 sibling, 1 reply; 29+ messages in thread From: Takahiro Yasui @ 2009-11-25 22:47 UTC (permalink / raw) To: dm-devel On 11/25/09 15:23, malahal@us.ibm.com wrote: > Mikulas Patocka [mpatocka@redhat.com] wrote: >> >>>> Imagine this scenario: >>>> * secondary leg fails >>>> * write fails on the secondaty leg and succeeds on the primary leg >>>> and is successfully complete >>>> * the computer crashes >>>> * after a reboot, the primary leg is inaccessible and the secondary leg is >>>> back online --- now raid1 would be returning stale data. >>> >>> The software can detect this case. We can fail this completely or use >>> the data from the secondary that could be "stale" with help from admin. >>> Let us call this method 1. >> >> You can't detect it because the computer crashed *before* you write the >> information that the secondary leg failed to the metadata. >> >> So, after a reboot, you can't tell if any mirror leg failed some requests >> before the crash. > > My definition of 'primary' is the first leg. Now on, I will use "first > leg" to avoid confusion. On a reboot, LVM can find if its first leg is > missing. If it is missing, it can ask the admin whether to use the > 'second' leg or not. When I said, "software" can detect, I really meant > that LVM can detect that the "first leg" is missing. I think again the scenario which Mikulas pointed. It looks double failures (fails happened on two legs), and human intervention would be acceptable. However, how do we know if the second leg contains valid data? There might be two cases. 1) System crashed during write operations without any disk failures, and the first leg fails at the next boot. We can use the secondary leg because data in the secondary leg is valid. 2) System crashed after the secondary leg failed, and the first leg fails and the secondary leg gets back online at the next boot. We can't use the secondary leg because data might be stale. I haven't checked the contents of log disk, but I guess we can't differentiate these cases from log disks. Another possibility I thought was error messages. If any error messages for the secondary leg are recorded, we can judge that the secondary leg contains stale data, but I suspect that it is not a secure way because syslog might not be written in disk before system crash. I would like to enhance system availability by keep system running when the secondary leg fails, but we need to confirm this case. I appreciate your comments. Thanks, Taka ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7] Hold all write bios when errors are handled 2009-11-25 22:47 ` Takahiro Yasui @ 2009-11-25 23:20 ` malahal 2009-11-25 23:50 ` Takahiro Yasui 0 siblings, 1 reply; 29+ messages in thread From: malahal @ 2009-11-25 23:20 UTC (permalink / raw) To: dm-devel Takahiro Yasui [tyasui@redhat.com] wrote: > I think again the scenario which Mikulas pointed. It looks double failures > (fails happened on two legs), and human intervention would be acceptable. > However, how do we know if the second leg contains valid data? > > There might be two cases. > > 1) System crashed during write operations without any disk failures, and > the first leg fails at the next boot. > > We can use the secondary leg because data in the secondary leg is valid. > > 2) System crashed after the secondary leg failed, and the first leg fails > and the secondary leg gets back online at the next boot. > > We can't use the secondary leg because data might be stale. > > I haven't checked the contents of log disk, but I guess we can't > differentiate these cases from log disks. There were plans to add a new region state to make sure that all the mirror legs have same data after a "crash". Currently your best bet is a complete resync after a crash! I am not sure if the state is written to log disk though. It may be possible to distinguish the above two cases with this... Or just have LVM meta data that records a device failure. Suspend writes [for any kind of leg] and record device failure in the LVM meta data and restart writes. This requires LVM meta data change though! > Another possibility I thought was error messages. > If any error messages for the secondary leg are recorded, we can judge that > the secondary leg contains stale data, but I suspect that it is not a secure > way because syslog might not be written in disk before system crash. We should be able to fix it in the LVM meta data! ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7] Hold all write bios when errors are handled 2009-11-25 23:20 ` malahal @ 2009-11-25 23:50 ` Takahiro Yasui 2009-11-26 0:30 ` malahal 0 siblings, 1 reply; 29+ messages in thread From: Takahiro Yasui @ 2009-11-25 23:50 UTC (permalink / raw) To: dm-devel On 11/25/09 18:20, malahal@us.ibm.com wrote: > Takahiro Yasui [tyasui@redhat.com] wrote: >> I think again the scenario which Mikulas pointed. It looks double failures >> (fails happened on two legs), and human intervention would be acceptable. >> However, how do we know if the second leg contains valid data? >> >> There might be two cases. >> >> 1) System crashed during write operations without any disk failures, and >> the first leg fails at the next boot. >> >> We can use the secondary leg because data in the secondary leg is valid. >> >> 2) System crashed after the secondary leg failed, and the first leg fails >> and the secondary leg gets back online at the next boot. >> >> We can't use the secondary leg because data might be stale. >> >> I haven't checked the contents of log disk, but I guess we can't >> differentiate these cases from log disks. > > There were plans to add a new region state to make sure that all the > mirror legs have same data after a "crash". Currently your best bet is a > complete resync after a crash! Please let me clarify this. There are two legs and system crash happens. Then, how can we resync? We have only one leg (secondary) after boot. When we use "mirror", we expect the last device to contain valid data, don't we? > Or just have LVM meta data that records a device failure. Suspend writes > [for any kind of leg] and record device failure in the LVM meta data and > restart writes. This requires LVM meta data change though! Do you mean that write I/Os need to be blocked when the secondary leg fails in order to update LVM meta data by lvm commands called by dmeventd? Thanks, Taka ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7] Hold all write bios when errors are handled 2009-11-25 23:50 ` Takahiro Yasui @ 2009-11-26 0:30 ` malahal 0 siblings, 0 replies; 29+ messages in thread From: malahal @ 2009-11-26 0:30 UTC (permalink / raw) To: dm-devel Takahiro Yasui [tyasui@redhat.com] wrote: > >> I haven't checked the contents of log disk, but I guess we can't > >> differentiate these cases from log disks. > > > > There were plans to add a new region state to make sure that all the > > mirror legs have same data after a "crash". Currently your best bet is a > > complete resync after a crash! > > Please let me clarify this. There are two legs and system crash happens. > Then, how can we resync? We have only one leg (secondary) after boot. > When we use "mirror", we expect the last device to contain valid data, > don't we? I was actually referring to a problem that I mentioned in a DM/LVM meeting. Mikulas posted it here and there were some discussions. http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/5392 If I remember, the proposed solution would add another state bit. If it is implemented, then we can use that information to differentiate the cases you mentioned. Essentially, if there are only 'dirty' regions, we can safely use the secondary. On the other hand, if th log has some 'out of sync' regions, then we can't use the secondary. You can ignore my reference to "complete resynchronization". I was referring to an existing problem with system crashes! > > Or just have LVM meta data that records a device failure. Suspend writes > > [for any kind of leg] and record device failure in the LVM meta data and > > restart writes. This requires LVM meta data change though! > > Do you mean that write I/Os need to be blocked when the secondary leg fails > in order to update LVM meta data by lvm commands called by dmeventd? Yes, that is correct. We need to stop write I/O for any leg failure if we chose LVM meta data method. Thanks, Malahal. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7] Hold all write bios when errors are handled 2009-11-25 20:23 ` [PATCH 7/7] Hold all write bios when errors are handled malahal 2009-11-25 22:47 ` Takahiro Yasui @ 2009-11-26 17:58 ` Mikulas Patocka 2009-11-26 22:22 ` malahal 1 sibling, 1 reply; 29+ messages in thread From: Mikulas Patocka @ 2009-11-26 17:58 UTC (permalink / raw) To: device-mapper development > > If you ask the admin always if primary leg failed and wait for his action, > > you lose fault-tolerance --- the computer would wait until the admin does > > an action. > > Well, we are talking one time admin help at reboot [actually activation > time] if and only if the "state of the mirror" changed during reboot! > This is not applicable at run time. Most of the time, dmeventd will have > a chance to change the mirror to linear anyway. You can ask the admin about failed on reboot --- but why do it if it can continue on its own? (unless both legs fail) Mikulas ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7] Hold all write bios when errors are handled 2009-11-26 17:58 ` Mikulas Patocka @ 2009-11-26 22:22 ` malahal 0 siblings, 0 replies; 29+ messages in thread From: malahal @ 2009-11-26 22:22 UTC (permalink / raw) To: dm-devel Mikulas Patocka [mpatocka@redhat.com] wrote: > > > If you ask the admin always if primary leg failed and wait for his > > > action, you lose fault-tolerance --- the computer would wait until > > > the admin does an action. > > > > Well, we are talking one time admin help at reboot [actually activation > > time] if and only if the "state of the mirror" changed during reboot! > > This is not applicable at run time. Most of the time, dmeventd will have > > a chance to change the mirror to linear anyway. > > You can ask the admin about failed on reboot --- but why do it if it can > continue on its own? (unless both legs fail) Agreed, it is better to stop on either failure. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2/7] A framework for holding bios until suspend 2009-11-18 12:11 ` [PATCH 2/7] A framework for holding bios until suspend Mikulas Patocka 2009-11-18 12:11 ` [PATCH 3/7] Use the hold framework in do_failures Mikulas Patocka @ 2009-11-28 18:02 ` Takahiro Yasui 2009-11-30 2:55 ` malahal 2009-11-30 9:41 ` Alasdair G Kergon 1 sibling, 2 replies; 29+ messages in thread From: Takahiro Yasui @ 2009-11-28 18:02 UTC (permalink / raw) To: dm-devel On 11/18/09 07:11, Mikulas Patocka wrote: > @@ -760,6 +775,7 @@ static void do_mirror(struct work_struct > bio_list_init(&ms->reads); > bio_list_init(&ms->writes); > bio_list_init(&ms->failures); > + bio_list_init(&ms->hold); > spin_unlock_irqrestore(&ms->lock, flags); Initializing the hold list in do_mirror() is a problem. Some bios might already in it and they will never be processed. This makes device-mapper "stuck" when the device goes suspend. The hold list should be inizialized only when the mirror device is created in alloc_context(). alloc_context() spin_lock_init(&ms->lock); bio_list_init(&ms->reads); bio_list_init(&ms->writes); bio_list_init(&ms->failures); + bio_list_init(&ms->hold); Thanks, Taka ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2/7] A framework for holding bios until suspend 2009-11-28 18:02 ` [PATCH 2/7] A framework for holding bios until suspend Takahiro Yasui @ 2009-11-30 2:55 ` malahal 2009-11-30 9:41 ` Alasdair G Kergon 1 sibling, 0 replies; 29+ messages in thread From: malahal @ 2009-11-30 2:55 UTC (permalink / raw) To: dm-devel Takahiro Yasui [tyasui@redhat.com] wrote: > On 11/18/09 07:11, Mikulas Patocka wrote: > > @@ -760,6 +775,7 @@ static void do_mirror(struct work_struct > > bio_list_init(&ms->reads); > > bio_list_init(&ms->writes); > > bio_list_init(&ms->failures); > > + bio_list_init(&ms->hold); > > spin_unlock_irqrestore(&ms->lock, flags); > > Initializing the hold list in do_mirror() is a problem. Some bios might > already in it and they will never be processed. This makes device-mapper > "stuck" when the device goes suspend. The hold list should be inizialized > only when the mirror device is created in alloc_context(). > > alloc_context() > spin_lock_init(&ms->lock); > bio_list_init(&ms->reads); > bio_list_init(&ms->writes); > bio_list_init(&ms->failures); > + bio_list_init(&ms->hold); > > Thanks, > Taka Good catch! ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2/7] A framework for holding bios until suspend 2009-11-28 18:02 ` [PATCH 2/7] A framework for holding bios until suspend Takahiro Yasui 2009-11-30 2:55 ` malahal @ 2009-11-30 9:41 ` Alasdair G Kergon 1 sibling, 0 replies; 29+ messages in thread From: Alasdair G Kergon @ 2009-11-30 9:41 UTC (permalink / raw) To: Takahiro Yasui; +Cc: dm-devel On Sat, Nov 28, 2009 at 01:02:15PM -0500, Takahiro Yasui wrote: > Initializing the hold list in do_mirror() is a problem. Some bios might > already in it and they will never be processed. This makes device-mapper > "stuck" when the device goes suspend. The hold list should be inizialized > only when the mirror device is created in alloc_context(). I fixed that as I merged the patch into my tree and didn't even notice the original patch was wrong! (I assumed it was a fuzzy patching error at my end.) http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-raid1-add-framework-to-hold-bios-during-suspend.patch http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-raid1-explicitly-initialise-bio_lists.patch Alasdair ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/7] patches: fix dm-raid1 race, bug 502927 2009-11-18 12:09 [PATCH 0/7] patches: fix dm-raid1 race, bug 502927 Mikulas Patocka 2009-11-18 12:10 ` [PATCH 1/7] Explicitly initialize bio lists Mikulas Patocka @ 2009-11-30 16:46 ` Takahiro Yasui 1 sibling, 0 replies; 29+ messages in thread From: Takahiro Yasui @ 2009-11-30 16:46 UTC (permalink / raw) To: dm-devel On 11/18/09 07:09, Mikulas Patocka wrote: > Hi > > Here is the serie of 7 patches to hold write bios on dm-raid1 until > dmeventd does its job. It fixes bug > https://bugzilla.redhat.com/show_bug.cgi?id=502927 . The first 6 patches > are preparatory, they just move the code around, the last patch does the > fix. > > I tested the thing, I managed to reproduce the bug (by manually stopping > dmeventd with STOP signal, failing primary mirror leg and writing to the > device) and I also verified that the patches fix the bug. > > For non-dmeventd operation, the current behavior is wrong and I just keep > it as wrong as it was. There is no easy fix. It is just assume that if the > user doesn't use dmeventd, he can't activate failed disks again. > > Mikulas > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel I reviewed and tested your patch set and looks good as a kernel side. Reviewed-by: Takahiro Yasui <tyasui@redhat.com> Tested-by: Takahiro Yasui <tyasui@redhat.com> However, there are two issues found related to #8 and requires improvements of dmeventd and lvm commands. This patch set is based on the idea that dmeventd and lvm commands (lvconvert and vgreduce) fix device failures and release blocked write I/Os. However, the blocked write I/Os won't be released forever in some cases. * Case 1: Medium error When medium errors are detected for a write I/O and reported to dmeventd, lvconvert is kicked from dmeventd, but nothing is done. Therefore, write I/Os will be blocked forever. lvm commands will result in as follows: # dmsetup status vg00-lv00 0 24576 mirror 2 253:1 253:2 23/24 1 DA 3 disk 253:0 A # lvconvert --config devices{ignore_suspended_devices=1} --repair vg00/lv00 The mirror is consistent, nothing to repair. # vgreduce --removemissing vg00 Volume group "vg00" is already consistent # /usr/sbin/lvm version LVM version: 2.02.57(1)-cvs (2009-11-24) Library version: 1.02.41-cvs (2009-11-24) Driver version: 4.15.0 # uname -mr 2.6.31.6 i686 * Case 2: Sync error dmeventd doesn't handle sync error ('S' showed by status) which happens during recovery. When write I/Os are issued on out-of-sync region, they are blocked, but dmeventd won't handle sync error and release blocked I/Os. The error on the primary leg during recovery won't help and we need to accept the system stop because of no valid leg. The error on the secondary leg can be handled as the regular write error. We can fix this issue by changing the error flag from DM_RAID1_SYNC_ERROR to DM_RAID1_WRITE_ERROR so that this error can be handled by dmeventd. Other idea is to change dmeventd so that it can handle sync error ('S') on the secondary error. It is also easy to make a small patch. Thanks, Taka ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2009-11-30 16:46 UTC | newest] Thread overview: 29+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-11-18 12:09 [PATCH 0/7] patches: fix dm-raid1 race, bug 502927 Mikulas Patocka 2009-11-18 12:10 ` [PATCH 1/7] Explicitly initialize bio lists Mikulas Patocka 2009-11-18 12:11 ` [PATCH 2/7] A framework for holding bios until suspend Mikulas Patocka 2009-11-18 12:11 ` [PATCH 3/7] Use the hold framework in do_failures Mikulas Patocka 2009-11-18 12:12 ` [PATCH 4/7] Don't optimize for failure case Mikulas Patocka 2009-11-18 12:13 ` [PATCH 5/7] Move a logic to get a valid mirror leg to a function Mikulas Patocka 2009-11-18 12:18 ` [PATCH 6/7] Move bio completion from dm_rh_mark_nosync to its caller Mikulas Patocka 2009-11-18 12:19 ` [PATCH 7/7] Hold all write bios when errors are handled Mikulas Patocka 2009-11-23 5:58 ` malahal 2009-11-23 17:54 ` Takahiro Yasui 2009-11-24 11:51 ` Mikulas Patocka 2009-11-24 19:17 ` malahal 2009-11-25 13:19 ` Mikulas Patocka 2009-11-25 15:43 ` Takahiro Yasui 2009-11-25 20:44 ` malahal 2009-11-25 22:50 ` Takahiro Yasui 2009-11-26 17:56 ` Mikulas Patocka 2009-11-26 17:54 ` [PATCH 8/7] Hold all write bios in nosync region Mikulas Patocka 2009-11-25 20:23 ` [PATCH 7/7] Hold all write bios when errors are handled malahal 2009-11-25 22:47 ` Takahiro Yasui 2009-11-25 23:20 ` malahal 2009-11-25 23:50 ` Takahiro Yasui 2009-11-26 0:30 ` malahal 2009-11-26 17:58 ` Mikulas Patocka 2009-11-26 22:22 ` malahal 2009-11-28 18:02 ` [PATCH 2/7] A framework for holding bios until suspend Takahiro Yasui 2009-11-30 2:55 ` malahal 2009-11-30 9:41 ` Alasdair G Kergon 2009-11-30 16:46 ` [PATCH 0/7] patches: fix dm-raid1 race, bug 502927 Takahiro Yasui
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.