* [PATCH] handle transient errors in lvconvert --repair @ 2010-04-27 15:14 Petr Rockai 2010-04-27 18:03 ` Petr Rockai 2010-04-28 19:37 ` Takahiro Yasui 0 siblings, 2 replies; 21+ messages in thread From: Petr Rockai @ 2010-04-27 15:14 UTC (permalink / raw) To: lvm-devel Hi, this is a first iteration of the transient-error handling in lvconvert. This works by marking LVs as partial even though there are no PVs missing -- whenever a mirror marks a leg or log as failed (in the status string). The downside of this approach is that we never figure which PV is failing, but that would require kernel-level support for IO error tracking. Yours, Petr. PS: The test is rather rudimentary. On one hand, testing log failures seems impossible because that runs into kernel lockups. I will add more test scenarios for leg failures later, and also for leg replacement. -------------- next part -------------- A non-text attachment was scrubbed... Name: lvconvert-repair-transient.diff Type: text/x-diff Size: 24993 bytes Desc: not available URL: <http://listman.redhat.com/archives/lvm-devel/attachments/20100427/a839fbb6/attachment.bin> ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair 2010-04-27 15:14 [PATCH] handle transient errors in lvconvert --repair Petr Rockai @ 2010-04-27 18:03 ` Petr Rockai 2010-05-06 17:30 ` Petr Rockai 2010-04-28 19:37 ` Takahiro Yasui 1 sibling, 1 reply; 21+ messages in thread From: Petr Rockai @ 2010-04-27 18:03 UTC (permalink / raw) To: lvm-devel A couple issues slipped into last patch, this is a fixed version (no regex debris, fixed testcase). -------------- next part -------------- A non-text attachment was scrubbed... Name: lvconvert-repair-transient.diff Type: text/x-diff Size: 22931 bytes Desc: not available URL: <http://listman.redhat.com/archives/lvm-devel/attachments/20100427/55914445/attachment.bin> ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair 2010-04-27 18:03 ` Petr Rockai @ 2010-05-06 17:30 ` Petr Rockai 2010-05-14 22:52 ` Takahiro Yasui 0 siblings, 1 reply; 21+ messages in thread From: Petr Rockai @ 2010-05-06 17:30 UTC (permalink / raw) To: lvm-devel Hi, I have rebased the patch against current CVS. Also, I have added a guard to _mirrored_transient_status to ensure that we alloca() only very modest (and, more importantly, bounded) amount of memory. -------------- next part -------------- A non-text attachment was scrubbed... Name: lvconvert-repair-transient.diff Type: text/x-diff Size: 24041 bytes Desc: not available URL: <http://listman.redhat.com/archives/lvm-devel/attachments/20100506/a5c0eb27/attachment.bin> -------------- next part -------------- Yours, Petr. ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair 2010-05-06 17:30 ` Petr Rockai @ 2010-05-14 22:52 ` Takahiro Yasui 2010-05-17 23:01 ` Takahiro Yasui 0 siblings, 1 reply; 21+ messages in thread From: Takahiro Yasui @ 2010-05-14 22:52 UTC (permalink / raw) To: lvm-devel Hi Petr, > I have rebased the patch against current CVS. Also, I have added a guard > to _mirrored_transient_status to ensure that we alloca() only very > modest (and, more importantly, bounded) amount of memory. This patch is just a rebase against current CVS, so it doesn't support mirrored log, yet. Do you have a plan to support mirrored log? I also tested this patch for a lvm mirror with core/disk log. When a mirror log failed, the mirror log was removed from a mirror volume, but a log voluem is not removed from its volume group. This always happens both on a transient and persistent error. Here is an output of lvs command in my testing environment. lv00_mlog should be removed from vg00. # dmsetup ls vg00-lv00_mimage_1 (253, 2) vg00-lv00_mimage_0 (253, 1) vg00-lv00 (253, 3) # lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert lv00 vg00 mwi-a- 12.00M 100.00 lv00_mlog vg00 -wi--- 4.00M I will report if I get the reason of this. Thanks, Taka ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair 2010-05-14 22:52 ` Takahiro Yasui @ 2010-05-17 23:01 ` Takahiro Yasui 2010-05-19 12:06 ` Petr Rockai 0 siblings, 1 reply; 21+ messages in thread From: Takahiro Yasui @ 2010-05-17 23:01 UTC (permalink / raw) To: lvm-devel Hi Petr, On 05/14/10 18:52, Takahiro Yasui wrote: >> I have rebased the patch against current CVS. Also, I have added a guard >> to _mirrored_transient_status to ensure that we alloca() only very >> modest (and, more importantly, bounded) amount of memory. > > This patch is just a rebase against current CVS, so it doesn't support > mirrored log, yet. Do you have a plan to support mirrored log? > > I also tested this patch for a lvm mirror with core/disk log. When > a mirror log failed, the mirror log was removed from a mirror volume, > but a log voluem is not removed from its volume group. This always > happens both on a transient and persistent error. This issue seems related to this part of your patch. @@ -1139,6 +1163,8 @@ static int _lvconvert_mirrors_repair(str ... - /* - * Remove all failed_pvs - */ - if (!_lvconvert_mirrors_aux(cmd, lv, lp, failed_pvs, - lp->mirrors, new_log_count)) - return 0; + if (failed_mirrors) { + if (!lv_remove_mirrors(cmd, lv, failed_mirrors, new_log_count, + _is_partial_lv, NULL, 0)) + return 0; + + if (!_reload_lv(cmd, lv)) + return 0; + } When I removed this modification, a log volume was removed as expected. And also other my test cases also passed. I hope this information would be helpful. Thanks, Taka ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair 2010-05-17 23:01 ` Takahiro Yasui @ 2010-05-19 12:06 ` Petr Rockai 2010-05-19 15:06 ` Takahiro Yasui 2010-05-19 16:44 ` Takahiro Yasui 0 siblings, 2 replies; 21+ messages in thread From: Petr Rockai @ 2010-05-19 12:06 UTC (permalink / raw) To: lvm-devel Hi Taka, Takahiro Yasui <tyasui@redhat.com> writes: > On 05/14/10 18:52, Takahiro Yasui wrote: >> I also tested this patch for a lvm mirror with core/disk log. When >> a mirror log failed, the mirror log was removed from a mirror volume, >> but a log voluem is not removed from its volume group. This always >> happens both on a transient and persistent error. > > This issue seems related to this part of your patch. > > @@ -1139,6 +1163,8 @@ static int _lvconvert_mirrors_repair(str > ... > - /* > - * Remove all failed_pvs > - */ > - if (!_lvconvert_mirrors_aux(cmd, lv, lp, failed_pvs, > - lp->mirrors, new_log_count)) > - return 0; > + if (failed_mirrors) { > + if (!lv_remove_mirrors(cmd, lv, failed_mirrors, new_log_count, > + _is_partial_lv, NULL, 0)) > + return 0; > + > + if (!_reload_lv(cmd, lv)) > + return 0; > + } > > When I removed this modification, a log volume was removed as expected. > And also other my test cases also passed. The catch is that this won't work correctly in other cases, especially with transient errors. I suspect the real problem is in not calling _lv_update_log_type in the new code path -- but see below: I cannot reliably fix this without having a reproducer. Also, I would very much like to have the tests you had failing on our regression suite, to avoid similar problem in the future. > I hope this information would be helpful. Yes, it is indeed quite helpful. Unfortunately, I still cannot reproduce the problem -- I have written a few testcases that only fail the log, or fail a log and some other things and I can't seem to trigger the bug. I have tried with both normal and cluster locking. It would be very useful if you could provide more specific instructions on how to trigger this. I have tried these, all of them remove the log as expected, both with and without my patch (at least for me): aux prepare_vg 5 lvcreate -m 2 --ig -L 1 -n 3way $vg disable_dev $dev1 $dev2 echo n | lvconvert --repair $vg/3way check linear $vg 3way lvs -a -o +devices | not grep unknown lvs -a -o +devices | not grep mlog dmsetup ls | not grep mlog vgreduce --removemissing $vg enable_dev $dev1 $dev2 check linear $vg 3way aux prepare_vg 5 lvcreate -m 2 --ig -L 1 -n 3way $vg $dev1 $dev2 $dev3 $dev4:0 disable_dev $dev4 echo n | lvconvert --repair $vg/3way check mirror $vg 3way core lvs -a -o +devices | not grep unknown lvs -a -o +devices | not grep mlog dmsetup ls | not grep mlog vgreduce --removemissing $vg enable_dev $dev4 aux prepare_vg 5 lvcreate -m 1 --ig -L 1 -n 2way $vg $dev1 $dev2 $dev3:0 disable_dev $dev3 echo n | lvconvert --repair $vg/2way check mirror $vg 2way core lvs -a -o +devices | not grep unknown lvs -a -o +devices | not grep mlog vgreduce --removemissing $vg enable_dev $dev3 Some further analysis: During a call to lv_remove_mirrors above, we call through to _remove_mirror_images, with remove_log = 1. We have this: ... if (remove_log) detached_log_lv = detach_mirror_log(mirrored_seg); ... if (detached_log_lv && !_delete_lv(lv, detached_log_lv)) return_0; So the log *should* be gone after this is finished. Since you see the log hanging around, I suspect that this code has some bugs (this part of the code is known to be problematic, unfortunately). Apart from actual steps to reproduce the problem, the output from lvconvert doing the repair would be helpful. It should be printing things like "Mirror status" and "Mirror log status", please paste these. Thanks! Yours, Petr. ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair 2010-05-19 12:06 ` Petr Rockai @ 2010-05-19 15:06 ` Takahiro Yasui 2010-05-19 18:31 ` Petr Rockai 2010-05-19 16:44 ` Takahiro Yasui 1 sibling, 1 reply; 21+ messages in thread From: Takahiro Yasui @ 2010-05-19 15:06 UTC (permalink / raw) To: lvm-devel Hi Petr, On 05/19/10 08:06, Petr Rockai wrote: > Takahiro Yasui <tyasui@redhat.com> writes: > The catch is that this won't work correctly in other cases, especially > with transient errors. I suspect the real problem is in not calling > _lv_update_log_type in the new code path -- but see below: I cannot > reliably fix this without having a reproducer. Also, I would very much > like to have the tests you had failing on our regression suite, to avoid > similar problem in the future. > ... > Unfortunately, I still cannot reproduce the problem -- I have written a > few testcases that only fail the log, or fail a log and some other > things and I can't seem to trigger the bug. I have tried with both > normal and cluster locking. > > It would be very useful if you could provide more specific instructions > on how to trigger this. Here is the instruction. I used 2.02.65 but I also reproduced it using 2.02.66, too. 0. environment # lvm version LVM version: 2.02.66(2)-cvs (2010-05-17) Library version: 1.02.49-cvs (2010-05-17) Driver version: 4.16.0 # grep mirror_log_fault_policy /etc/lvm/lvm.conf # 'mirror_image_fault_policy' and 'mirror_log_fault_policy' define mirror_log_fault_policy = "remove" 1. create vg and lv # vgcreate vg00 /dev/sd[c-e]; lvcreate --ig -m1 -L12m -nlv00 vg00 Volume group "vg00" successfully created Logical volume "lv00" created 2. disable log device (/dev/sde in my environment) # echo offline > /sys/block/sde/device/state 3. run 'lvconvert --repair' # lvconvert --config devices{ignore_suspended_devices=1} --repair --use-policies vg00/lv00 Mirrored transient status: "2 253:1 253:2 24/24 1 AA 3 disk 253:0 D" Mirror log status: 1 of 1 images failed - switching to core WARNING: Failed to replace 1 of 1 logs in volume lv00 4. check logical volumes # lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert lv00 vg00 mwi-a- 12.00M 100.00 lv00_mlog vg00 -wi--- 4.00M > aux prepare_vg 5 > lvcreate -m 1 --ig -L 1 -n 2way $vg $dev1 $dev2 $dev3:0 > disable_dev $dev3 > echo n | lvconvert --repair $vg/2way > check mirror $vg 2way core > lvs -a -o +devices | not grep unknown > lvs -a -o +devices | not grep mlog > vgreduce --removemissing $vg > enable_dev $dev3 This issue didn't occurred with your test case in my environment, either. So, the differences in our test cases seems 'policy.' I used the same options for lvconvert as ones in dmeventd. Thanks, Taka > During a call to lv_remove_mirrors above, we call through to > _remove_mirror_images, with remove_log = 1. We have this: > > ... if (remove_log) > detached_log_lv = detach_mirror_log(mirrored_seg); > > ... > > if (detached_log_lv && !_delete_lv(lv, detached_log_lv)) > return_0; > > So the log *should* be gone after this is finished. Since you see the > log hanging around, I suspect that this code has some bugs (this part of > the code is known to be problematic, unfortunately). Apart from actual > steps to reproduce the problem, the output from lvconvert doing the > repair would be helpful. It should be printing things like "Mirror > status" and "Mirror log status", please paste these. Yes, see step 4. Thanks, Taka ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair 2010-05-19 15:06 ` Takahiro Yasui @ 2010-05-19 18:31 ` Petr Rockai 2010-05-19 19:46 ` Takahiro Yasui 2010-05-20 0:49 ` Takahiro Yasui 0 siblings, 2 replies; 21+ messages in thread From: Petr Rockai @ 2010-05-19 18:31 UTC (permalink / raw) To: lvm-devel Hi again, Takahiro Yasui <tyasui@redhat.com> writes: > 3. run 'lvconvert --repair' > > # lvconvert --config devices{ignore_suspended_devices=1} --repair --use-policies vg00/lv00 > Mirrored transient status: "2 253:1 253:2 24/24 1 AA 3 disk 253:0 D" > Mirror log status: 1 of 1 images failed - switching to core > WARNING: Failed to replace 1 of 1 logs in volume lv00 > > 4. check logical volumes > > # lvs > LV VG Attr LSize Origin Snap% Move Log Copy% Convert > lv00 vg00 mwi-a- 12.00M 100.00 > lv00_mlog vg00 -wi--- 4.00M I have tracked the problem down to a silly branching bug in lvconvert.c. The LV was only being written to disk if mirrors changed, or log re-allocation was attempted. I have fixed that, and also added both mine and your testcases to the suite. The attached patch is on top of the previous one -- it should apply cleanly on top of current CVS + the lvconvert-repair-transient.diff I have sent previously. It also fixes a bug in messages, where it would claim that it failed to replace things even if it never tried to in the first place. Please test and let me know if you run into any other issues. Thanks, Petr. -------------- next part -------------- A non-text attachment was scrubbed... Name: lvconvert-repair-transient-inter.diff Type: text/x-diff Size: 5799 bytes Desc: not available URL: <http://listman.redhat.com/archives/lvm-devel/attachments/20100519/e6a38379/attachment.bin> ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair 2010-05-19 18:31 ` Petr Rockai @ 2010-05-19 19:46 ` Takahiro Yasui 2010-05-19 20:19 ` Petr Rockai 2010-05-20 0:49 ` Takahiro Yasui 1 sibling, 1 reply; 21+ messages in thread From: Takahiro Yasui @ 2010-05-19 19:46 UTC (permalink / raw) To: lvm-devel Hi Petr, >+++ cvs-upstream/tools/lvconvert.c 2010-05-19 20:18:40.000000000 +0200 >@@ -1225,11 +1225,11 @@ > if (!lv_remove_mirrors(cmd, lv, failed_mirrors, new_log_count, > _is_partial_lv, NULL, 0)) > return 0; As I posted in the previous mail, the argument, new_log_count, is correct here? The argument is used to check if a log should be removed. I think that failed_log comes here instead of new_log_count. Or the value such as nlc in _lvconvert_mirrors_aux() comes. uint32_t nlc = (!new_log_count || lp->mirrors == 1) ? 1U : 0U; ... } else if (!lv_remove_mirrors(cmd, lv, nmc, nlc, is_mirror_image_removable, operable_pvs, 0)) Thanks, Taka ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair 2010-05-19 19:46 ` Takahiro Yasui @ 2010-05-19 20:19 ` Petr Rockai 2010-05-19 20:31 ` Alasdair G Kergon 2010-05-19 20:35 ` Takahiro Yasui 0 siblings, 2 replies; 21+ messages in thread From: Petr Rockai @ 2010-05-19 20:19 UTC (permalink / raw) To: lvm-devel Takahiro Yasui <tyasui@redhat.com> writes: >>+++ cvs-upstream/tools/lvconvert.c 2010-05-19 20:18:40.000000000 +0200 >>@@ -1225,11 +1225,11 @@ >> if (!lv_remove_mirrors(cmd, lv, failed_mirrors, new_log_count, >> _is_partial_lv, NULL, 0)) >> return 0; > > As I posted in the previous mail, the argument, new_log_count, is > correct here? The argument is used to check if a log should be removed. > I think that failed_log comes here instead of new_log_count. Oh *jeez*. int lv_remove_mirrors(struct cmd_context *cmd __attribute((unused)), struct logical_volume *lv, uint32_t mirrors, uint32_t log_count, int (*is_removable)(struct logical_volume *, void *), void *removable_baton, uint64_t status_mask) Notice how the parameter is named "log_count" -- I have assumed, given that elsewhere it is called "remove_log" that this has a reverse meaning. Wrong, it actually means the same thing as remove_log. Thanks for noticing this. So yes, it should spell if (!lv_remove_mirrors(cmd, lv, failed_mirrors, failed_log, _is_partial_lv, NULL, 0)) ... now this was probably not being noticed, since in most cases, the mistakenly removed log would be re-created right away by later code. I'll fix this in my local patch. > Or the value such as nlc in _lvconvert_mirrors_aux() comes. > > uint32_t nlc = (!new_log_count || lp->mirrors == 1) ? 1U : 0U; > ... > } else if (!lv_remove_mirrors(cmd, lv, nmc, nlc, > is_mirror_image_removable, operable_pvs, 0)) Ugh, nlc sounds like short for new_log_count to me, which is *extremely confusing*, since it's actually quite opposite: nlc = 1 means remove log, nlc = 0 means retain the log. Is that right? Yours, Petr. ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair 2010-05-19 20:19 ` Petr Rockai @ 2010-05-19 20:31 ` Alasdair G Kergon 2010-05-19 20:35 ` Takahiro Yasui 1 sibling, 0 replies; 21+ messages in thread From: Alasdair G Kergon @ 2010-05-19 20:31 UTC (permalink / raw) To: lvm-devel On Wed, May 19, 2010 at 10:19:41PM +0200, Peter Rockai wrote: > Notice how the parameter is named "log_count" -- I have assumed, given > that elsewhere it is called "remove_log" that this has a reverse > meaning. Wrong, it actually means the same thing as remove_log. Our variable names are meant to be global. > Ugh, nlc sounds like short for new_log_count to me, which is *extremely Fix abbreviations you find like that too, please:) Alasdair ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair 2010-05-19 20:19 ` Petr Rockai 2010-05-19 20:31 ` Alasdair G Kergon @ 2010-05-19 20:35 ` Takahiro Yasui 1 sibling, 0 replies; 21+ messages in thread From: Takahiro Yasui @ 2010-05-19 20:35 UTC (permalink / raw) To: lvm-devel On 05/19/10 16:19, Petr Rockai wrote: > Takahiro Yasui <tyasui@redhat.com> writes: > >>> +++ cvs-upstream/tools/lvconvert.c 2010-05-19 20:18:40.000000000 +0200 >>> @@ -1225,11 +1225,11 @@ >>> if (!lv_remove_mirrors(cmd, lv, failed_mirrors, new_log_count, >>> _is_partial_lv, NULL, 0)) >>> return 0; >> >> As I posted in the previous mail, the argument, new_log_count, is >> correct here? The argument is used to check if a log should be removed. >> I think that failed_log comes here instead of new_log_count. ... > ... now this was probably not being noticed, since in most cases, the > mistakenly removed log would be re-created right away by later > code. I'll fix this in my local patch. Yeah, that valuable name is very confusing. :-< Anyway thanks to fixing this. >> Or the value such as nlc in _lvconvert_mirrors_aux() comes. >> >> uint32_t nlc = (!new_log_count || lp->mirrors == 1) ? 1U : 0U; >> ... >> } else if (!lv_remove_mirrors(cmd, lv, nmc, nlc, >> is_mirror_image_removable, operable_pvs, 0)) > > Ugh, nlc sounds like short for new_log_count to me, which is *extremely > confusing*, since it's actually quite opposite: nlc = 1 means remove > log, nlc = 0 means retain the log. Is that right? Yes, it is. It is similarly used as the previous parameter, nmc. nmc tells the number of mirrors which will be removed. So as for nlc, it is easier to think nlc the log number which should be removed. (0: not removed, 1: removed) Thanks, Taka ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair 2010-05-19 18:31 ` Petr Rockai 2010-05-19 19:46 ` Takahiro Yasui @ 2010-05-20 0:49 ` Takahiro Yasui 2010-05-20 7:40 ` Petr Rockai 1 sibling, 1 reply; 21+ messages in thread From: Takahiro Yasui @ 2010-05-20 0:49 UTC (permalink / raw) To: lvm-devel Hi Petr, On 05/19/10 14:31, Petr Rockai wrote: > The attached patch is on top of the previous one -- it should apply > cleanly on top of current CVS + the lvconvert-repair-transient.diff I > have sent previously. It also fixes a bug in messages, where it would > claim that it failed to replace things even if it never tried to in the > first place. > > Please test and let me know if you run into any other issues. I tested your two patches and all my test cases for 'disk' log passed. As for 'mirrored' log, we need to check the state for log_lv. Here is the sample code. _lvconvert_mirrors_repair() ... lv_check_transient(lv); /* TODO check this in lib for all commands? */ - if (!(lv->status & PARTIAL_LV)) { + log_lv = first_seg(lv)->log_lv; + if (log_lv && log_lv->status & MIRRORED) + lv_check_transient(log_lv); + + if (!(lv->status & PARTIAL_LV) && !log_lv && + !(log_lv->status & PARTIAL_LV)) { ... new_log_count = old_log_count; - log_lv = first_seg(lv)->log_lv; if (log_lv) { Thanks, Taka ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair 2010-05-20 0:49 ` Takahiro Yasui @ 2010-05-20 7:40 ` Petr Rockai 2010-05-20 11:54 ` Takahiro Yasui 0 siblings, 1 reply; 21+ messages in thread From: Petr Rockai @ 2010-05-20 7:40 UTC (permalink / raw) To: lvm-devel Hi Taka! Takahiro Yasui <tyasui@redhat.com> writes: >> Please test and let me know if you run into any other issues. > > I tested your two patches and all my test cases for 'disk' log > passed. As for 'mirrored' log, we need to check the state for > log_lv. Here is the sample code. So, do you think it is reasonable to check this in in its current form, and work on --repair support for mirrored logs in a separate patch? I think there are currently other things broken wrt. mirrored log and --repair, other than just the transient support, and I'll have to sit down and write tests first to see what works and what doesn't and then fix things up. Yours, Petr. ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair 2010-05-20 7:40 ` Petr Rockai @ 2010-05-20 11:54 ` Takahiro Yasui 0 siblings, 0 replies; 21+ messages in thread From: Takahiro Yasui @ 2010-05-20 11:54 UTC (permalink / raw) To: lvm-devel Hi Petr, Petr Rockai wrote: > Takahiro Yasui <tyasui@redhat.com> writes: >>> Please test and let me know if you run into any other issues. >> I tested your two patches and all my test cases for 'disk' log >> passed. As for 'mirrored' log, we need to check the state for >> log_lv. Here is the sample code. > > So, do you think it is reasonable to check this in in its current form, > and work on --repair support for mirrored logs in a separate patch? That sounds reasonable to me. > I think there are currently other things broken wrt. mirrored log and > --repair, other than just the transient support, and I'll have to sit > down and write tests first to see what works and what doesn't and then > fix things up. I will review & test a patch is ready. Thanks, Taka ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair 2010-05-19 12:06 ` Petr Rockai 2010-05-19 15:06 ` Takahiro Yasui @ 2010-05-19 16:44 ` Takahiro Yasui 2010-05-19 17:15 ` Takahiro Yasui 1 sibling, 1 reply; 21+ messages in thread From: Takahiro Yasui @ 2010-05-19 16:44 UTC (permalink / raw) To: lvm-devel Hi Petr, On 05/19/10 08:06, Petr Rockai wrote: > Takahiro Yasui <tyasui@redhat.com> writes: >> On 05/14/10 18:52, Takahiro Yasui wrote: >>> I also tested this patch for a lvm mirror with core/disk log. When >>> a mirror log failed, the mirror log was removed from a mirror volume, >>> but a log voluem is not removed from its volume group. This always >>> happens both on a transient and persistent error. >> >> This issue seems related to this part of your patch. >> >> @@ -1139,6 +1163,8 @@ static int _lvconvert_mirrors_repair(str >> ... >> - /* >> - * Remove all failed_pvs >> - */ >> - if (!_lvconvert_mirrors_aux(cmd, lv, lp, failed_pvs, >> - lp->mirrors, new_log_count)) >> - return 0; >> + if (failed_mirrors) { >> + if (!lv_remove_mirrors(cmd, lv, failed_mirrors, new_log_count, >> + _is_partial_lv, NULL, 0)) >> + return 0; >> + >> + if (!_reload_lv(cmd, lv)) >> + return 0; >> + } >> >> When I removed this modification, a log volume was removed as expected. >> And also other my test cases also passed. > > The catch is that this won't work correctly in other cases, especially > with transient errors. I suspect the real problem is in not calling > _lv_update_log_type in the new code path -- but see below: I cannot > reliably fix this without having a reproducer. Also, I would very much > like to have the tests you had failing on our regression suite, to avoid > similar problem in the future. I checked this code, but the value of failed_mirrors is '0' when no mirror leg has error. So this code path was not executed in my test. This is the reason why log was not removed with mirror_log_fault_policy = "remove". mirror_log_fault_policy = "allocate," on the other hand, the log was removed in the following path. while (replace_mirrors || replace_log) { ... if (_lvconvert_mirrors_aux(cmd, lv, lp, NULL, lp->mirrors, log_count)) break; We need to change the condition so that lv_remove_mirrors() is executed. >> - * Remove all failed_pvs >> - */ >> - if (!_lvconvert_mirrors_aux(cmd, lv, lp, failed_pvs, >> - lp->mirrors, new_log_count)) >> - return 0; >> + if (failed_mirrors) { Thanks, Taka ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair 2010-05-19 16:44 ` Takahiro Yasui @ 2010-05-19 17:15 ` Takahiro Yasui 2010-05-19 20:26 ` Petr Rockai 0 siblings, 1 reply; 21+ messages in thread From: Takahiro Yasui @ 2010-05-19 17:15 UTC (permalink / raw) To: lvm-devel Hi Peter, >>> + if (failed_mirrors) { >>> + if (!lv_remove_mirrors(cmd, lv, failed_mirrors, new_log_count, >>> + _is_partial_lv, NULL, 0)) + if (failed_mirrors || failed_log) { + if (!lv_remove_mirrors(cmd, lv, failed_mirrors, failed_log, + _is_partial_lv, NULL, 0)) { My test cases passed with this modification. Thanks, Taka On 05/19/10 12:44, Takahiro Yasui wrote: > Hi Petr, > > On 05/19/10 08:06, Petr Rockai wrote: >> Takahiro Yasui <tyasui@redhat.com> writes: >>> On 05/14/10 18:52, Takahiro Yasui wrote: >>>> I also tested this patch for a lvm mirror with core/disk log. When >>>> a mirror log failed, the mirror log was removed from a mirror volume, >>>> but a log voluem is not removed from its volume group. This always >>>> happens both on a transient and persistent error. >>> >>> This issue seems related to this part of your patch. >>> >>> @@ -1139,6 +1163,8 @@ static int _lvconvert_mirrors_repair(str >>> ... >>> - /* >>> - * Remove all failed_pvs >>> - */ >>> - if (!_lvconvert_mirrors_aux(cmd, lv, lp, failed_pvs, >>> - lp->mirrors, new_log_count)) >>> - return 0; >>> + if (failed_mirrors) { >>> + if (!lv_remove_mirrors(cmd, lv, failed_mirrors, new_log_count, >>> + _is_partial_lv, NULL, 0)) >>> + return 0; >>> + >>> + if (!_reload_lv(cmd, lv)) >>> + return 0; >>> + } >>> >>> When I removed this modification, a log volume was removed as expected. >>> And also other my test cases also passed. >> >> The catch is that this won't work correctly in other cases, especially >> with transient errors. I suspect the real problem is in not calling >> _lv_update_log_type in the new code path -- but see below: I cannot >> reliably fix this without having a reproducer. Also, I would very much >> like to have the tests you had failing on our regression suite, to avoid >> similar problem in the future. > > I checked this code, but the value of failed_mirrors is '0' when > no mirror leg has error. So this code path was not executed in > my test. This is the reason why log was not removed with > mirror_log_fault_policy = "remove". > > mirror_log_fault_policy = "allocate," on the other hand, the log > was removed in the following path. > > while (replace_mirrors || replace_log) { > ... > if (_lvconvert_mirrors_aux(cmd, lv, lp, NULL, > lp->mirrors, log_count)) > break; > > We need to change the condition so that lv_remove_mirrors() is > executed. > >>> - * Remove all failed_pvs >>> - */ >>> - if (!_lvconvert_mirrors_aux(cmd, lv, lp, failed_pvs, >>> - lp->mirrors, new_log_count)) >>> - return 0; >>> + if (failed_mirrors) { > > Thanks, > Taka > > -- > lvm-devel mailing list > lvm-devel at redhat.com > https://www.redhat.com/mailman/listinfo/lvm-devel ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair 2010-05-19 17:15 ` Takahiro Yasui @ 2010-05-19 20:26 ` Petr Rockai 0 siblings, 0 replies; 21+ messages in thread From: Petr Rockai @ 2010-05-19 20:26 UTC (permalink / raw) To: lvm-devel Hi Taka, Takahiro Yasui <tyasui@redhat.com> writes: >>>> + if (failed_mirrors) { >>>> + if (!lv_remove_mirrors(cmd, lv, failed_mirrors, new_log_count, >>>> + _is_partial_lv, NULL, 0)) > > + if (failed_mirrors || failed_log) { > + if (!lv_remove_mirrors(cmd, lv, failed_mirrors, failed_log, > + _is_partial_lv, NULL, 0)) { > if (failed_mirrors || failed_log) is actually the same (well, should be) as if (1) at this point... Anyway, if you have seen my previous mail, the thing that actually is required to be called every time is _reload_lv, which does metadata writing. The lv_remove_mirrors call is only needed if failed_mirrors is true, since _lv_update_log_type should have already dealt with log. If failed_log is true, then new_log_count is 0 for disklog mirrors in the _lv_update_log_type call. This is all still broken for mirrored logs, but this is not completely easy to fix, so I'll defer that for now. Yours, Petr. PS: failed_mirrors would probably be better called failed_images... ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair 2010-04-27 15:14 [PATCH] handle transient errors in lvconvert --repair Petr Rockai 2010-04-27 18:03 ` Petr Rockai @ 2010-04-28 19:37 ` Takahiro Yasui 2010-05-05 7:46 ` Petr Rockai 1 sibling, 1 reply; 21+ messages in thread From: Takahiro Yasui @ 2010-04-28 19:37 UTC (permalink / raw) To: lvm-devel On 04/27/10 11:14, Petr Rockai wrote: > this is a first iteration of the transient-error handling in > lvconvert. This works by marking LVs as partial even though there are no > PVs missing -- whenever a mirror marks a leg or log as failed (in the > status string). The downside of this approach is that we never figure > which PV is failing, but that would require kernel-level support for IO > error tracking. Hello Petr, It is very grad that the patch to fix this issue was being developed. I have not thoroughly read it, but I found one issue. It doesn't work for a "mirrored" log, while it looks good for a "core" or "disk" log. A logical mirror device with a mirrored log has a 'layered' structure for its log. Therefore, we need to check the status of its log, too. As for the following example, we need to check the status string of vg00-lv00_mlog. # dmsetup ls --tree -o ascii vg00-lv00 (253:5) |-vg00-lv00_mimage_1 (253:4) | `- (8:48) |-vg00-lv00_mimage_0 (253:3) | `- (8:32) `-vg00-lv00_mlog (253:2) |-vg00-lv00_mlog_mimage_1 (253:1) | `- (8:80) `-vg00-lv00_mlog_mimage_0 (253:0) `- (8:64) # dmsetup status ... vg00-lv00_mlog: 0 8192 mirror 2 253:0 253:1 8/8 1 AA 1 core vg00-lv00: 0 24576 mirror 2 253:3 253:4 24/24 1 AA 3 disk 253:2 A The current code calls lv_check_transient() at the beginning of _lvconvert_mirrors_repair() but I think that lv_check_transient() needs to be called in functions which repairs a mirror log, such as _lv_update_log_type(), too. Also this patch works well not only for a transient error but also a medium error, therefore, function names with '_transient' could be '_status' or '_status_string'. (e.g. lv_check_status()) I appreciate your work on this issue. Thanks, Taka ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair 2010-04-28 19:37 ` Takahiro Yasui @ 2010-05-05 7:46 ` Petr Rockai 2010-05-05 15:31 ` Takahiro Yasui 0 siblings, 1 reply; 21+ messages in thread From: Petr Rockai @ 2010-05-05 7:46 UTC (permalink / raw) To: lvm-devel Hi Taka, Takahiro Yasui <tyasui@redhat.com> writes: > It is very grad that the patch to fix this issue was being developed. > I have not thoroughly read it, but I found one issue. It doesn't work > for a "mirrored" log, while it looks good for a "core" or "disk" log. [snip] this is not really an issue with the new code per se. What needs to be done is that lvconvert --repair itself needs to be changed to handle mirrored logs correctly. What needs to be done is to call the repair recursively on the log device if it is a mirror. > The current code calls lv_check_transient() at the beginning of > _lvconvert_mirrors_repair() but I think that lv_check_transient() > needs to be called in functions which repairs a mirror log, > such as _lv_update_log_type(), too. (see above) > Also this patch works well not only for a transient error but > also a medium error, therefore, function names with '_transient' > could be '_status' or '_status_string'. (e.g. lv_check_status()) I chose transient not because of "transient error", but because of the relation to the transient, in-kernel, status of the device -- as opposed to persistent, i.e. in metadata. Maybe something with "active" or "activation" in it would be better though... And, thanks for looking through the patch. Yours, Petr. ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair 2010-05-05 7:46 ` Petr Rockai @ 2010-05-05 15:31 ` Takahiro Yasui 0 siblings, 0 replies; 21+ messages in thread From: Takahiro Yasui @ 2010-05-05 15:31 UTC (permalink / raw) To: lvm-devel Hi Petr, On 05/05/10 03:46, Petr Rockai wrote: > Takahiro Yasui <tyasui@redhat.com> writes: >> It is very grad that the patch to fix this issue was being developed. >> I have not thoroughly read it, but I found one issue. It doesn't work >> for a "mirrored" log, while it looks good for a "core" or "disk" log. > [snip] > > this is not really an issue with the new code per se. What needs to be > done is that lvconvert --repair itself needs to be changed to handle > mirrored logs correctly. > > What needs to be done is to call the repair recursively on the log > device if it is a mirror. I understand. I will wait and test your updated patch. >> The current code calls lv_check_transient() at the beginning of >> _lvconvert_mirrors_repair() but I think that lv_check_transient() >> needs to be called in functions which repairs a mirror log, >> such as _lv_update_log_type(), too. > (see above) > >> Also this patch works well not only for a transient error but >> also a medium error, therefore, function names with '_transient' >> could be '_status' or '_status_string'. (e.g. lv_check_status()) > I chose transient not because of "transient error", but because of the > relation to the transient, in-kernel, status of the device -- as opposed > to persistent, i.e. in metadata. Maybe something with "active" or > "activation" in it would be better though... I see your point. I just thought 'a medium error' is persistent and commented, but I will follow your decision. Thanks, Taka ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2010-05-20 11:54 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-04-27 15:14 [PATCH] handle transient errors in lvconvert --repair Petr Rockai 2010-04-27 18:03 ` Petr Rockai 2010-05-06 17:30 ` Petr Rockai 2010-05-14 22:52 ` Takahiro Yasui 2010-05-17 23:01 ` Takahiro Yasui 2010-05-19 12:06 ` Petr Rockai 2010-05-19 15:06 ` Takahiro Yasui 2010-05-19 18:31 ` Petr Rockai 2010-05-19 19:46 ` Takahiro Yasui 2010-05-19 20:19 ` Petr Rockai 2010-05-19 20:31 ` Alasdair G Kergon 2010-05-19 20:35 ` Takahiro Yasui 2010-05-20 0:49 ` Takahiro Yasui 2010-05-20 7:40 ` Petr Rockai 2010-05-20 11:54 ` Takahiro Yasui 2010-05-19 16:44 ` Takahiro Yasui 2010-05-19 17:15 ` Takahiro Yasui 2010-05-19 20:26 ` Petr Rockai 2010-04-28 19:37 ` Takahiro Yasui 2010-05-05 7:46 ` Petr Rockai 2010-05-05 15:31 ` Takahiro Yasui
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.