* [PATCH] handle transient errors in lvconvert --repair
@ 2010-04-27 15:14 Petr Rockai
2010-04-27 18:03 ` Petr Rockai
2010-04-28 19:37 ` Takahiro Yasui
0 siblings, 2 replies; 21+ messages in thread
From: Petr Rockai @ 2010-04-27 15:14 UTC (permalink / raw)
To: lvm-devel
Hi,
this is a first iteration of the transient-error handling in
lvconvert. This works by marking LVs as partial even though there are no
PVs missing -- whenever a mirror marks a leg or log as failed (in the
status string). The downside of this approach is that we never figure
which PV is failing, but that would require kernel-level support for IO
error tracking.
Yours,
Petr.
PS: The test is rather rudimentary. On one hand, testing log failures
seems impossible because that runs into kernel lockups. I will add more
test scenarios for leg failures later, and also for leg replacement.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lvconvert-repair-transient.diff
Type: text/x-diff
Size: 24993 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/lvm-devel/attachments/20100427/a839fbb6/attachment.bin>
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair
2010-04-27 15:14 [PATCH] handle transient errors in lvconvert --repair Petr Rockai
@ 2010-04-27 18:03 ` Petr Rockai
2010-05-06 17:30 ` Petr Rockai
2010-04-28 19:37 ` Takahiro Yasui
1 sibling, 1 reply; 21+ messages in thread
From: Petr Rockai @ 2010-04-27 18:03 UTC (permalink / raw)
To: lvm-devel
A couple issues slipped into last patch, this is a fixed version (no
regex debris, fixed testcase).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lvconvert-repair-transient.diff
Type: text/x-diff
Size: 22931 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/lvm-devel/attachments/20100427/55914445/attachment.bin>
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair
2010-04-27 15:14 [PATCH] handle transient errors in lvconvert --repair Petr Rockai
2010-04-27 18:03 ` Petr Rockai
@ 2010-04-28 19:37 ` Takahiro Yasui
2010-05-05 7:46 ` Petr Rockai
1 sibling, 1 reply; 21+ messages in thread
From: Takahiro Yasui @ 2010-04-28 19:37 UTC (permalink / raw)
To: lvm-devel
On 04/27/10 11:14, Petr Rockai wrote:
> this is a first iteration of the transient-error handling in
> lvconvert. This works by marking LVs as partial even though there are no
> PVs missing -- whenever a mirror marks a leg or log as failed (in the
> status string). The downside of this approach is that we never figure
> which PV is failing, but that would require kernel-level support for IO
> error tracking.
Hello Petr,
It is very grad that the patch to fix this issue was being developed.
I have not thoroughly read it, but I found one issue. It doesn't work
for a "mirrored" log, while it looks good for a "core" or "disk" log.
A logical mirror device with a mirrored log has a 'layered' structure
for its log. Therefore, we need to check the status of its log, too.
As for the following example, we need to check the status string of
vg00-lv00_mlog.
# dmsetup ls --tree -o ascii
vg00-lv00 (253:5)
|-vg00-lv00_mimage_1 (253:4)
| `- (8:48)
|-vg00-lv00_mimage_0 (253:3)
| `- (8:32)
`-vg00-lv00_mlog (253:2)
|-vg00-lv00_mlog_mimage_1 (253:1)
| `- (8:80)
`-vg00-lv00_mlog_mimage_0 (253:0)
`- (8:64)
# dmsetup status
...
vg00-lv00_mlog: 0 8192 mirror 2 253:0 253:1 8/8 1 AA 1 core
vg00-lv00: 0 24576 mirror 2 253:3 253:4 24/24 1 AA 3 disk 253:2 A
The current code calls lv_check_transient() at the beginning of
_lvconvert_mirrors_repair() but I think that lv_check_transient()
needs to be called in functions which repairs a mirror log,
such as _lv_update_log_type(), too.
Also this patch works well not only for a transient error but
also a medium error, therefore, function names with '_transient'
could be '_status' or '_status_string'. (e.g. lv_check_status())
I appreciate your work on this issue.
Thanks,
Taka
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair
2010-04-28 19:37 ` Takahiro Yasui
@ 2010-05-05 7:46 ` Petr Rockai
2010-05-05 15:31 ` Takahiro Yasui
0 siblings, 1 reply; 21+ messages in thread
From: Petr Rockai @ 2010-05-05 7:46 UTC (permalink / raw)
To: lvm-devel
Hi Taka,
Takahiro Yasui <tyasui@redhat.com> writes:
> It is very grad that the patch to fix this issue was being developed.
> I have not thoroughly read it, but I found one issue. It doesn't work
> for a "mirrored" log, while it looks good for a "core" or "disk" log.
[snip]
this is not really an issue with the new code per se. What needs to be
done is that lvconvert --repair itself needs to be changed to handle
mirrored logs correctly.
What needs to be done is to call the repair recursively on the log
device if it is a mirror.
> The current code calls lv_check_transient() at the beginning of
> _lvconvert_mirrors_repair() but I think that lv_check_transient()
> needs to be called in functions which repairs a mirror log,
> such as _lv_update_log_type(), too.
(see above)
> Also this patch works well not only for a transient error but
> also a medium error, therefore, function names with '_transient'
> could be '_status' or '_status_string'. (e.g. lv_check_status())
I chose transient not because of "transient error", but because of the
relation to the transient, in-kernel, status of the device -- as opposed
to persistent, i.e. in metadata. Maybe something with "active" or
"activation" in it would be better though...
And, thanks for looking through the patch.
Yours,
Petr.
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair
2010-05-05 7:46 ` Petr Rockai
@ 2010-05-05 15:31 ` Takahiro Yasui
0 siblings, 0 replies; 21+ messages in thread
From: Takahiro Yasui @ 2010-05-05 15:31 UTC (permalink / raw)
To: lvm-devel
Hi Petr,
On 05/05/10 03:46, Petr Rockai wrote:
> Takahiro Yasui <tyasui@redhat.com> writes:
>> It is very grad that the patch to fix this issue was being developed.
>> I have not thoroughly read it, but I found one issue. It doesn't work
>> for a "mirrored" log, while it looks good for a "core" or "disk" log.
> [snip]
>
> this is not really an issue with the new code per se. What needs to be
> done is that lvconvert --repair itself needs to be changed to handle
> mirrored logs correctly.
>
> What needs to be done is to call the repair recursively on the log
> device if it is a mirror.
I understand. I will wait and test your updated patch.
>> The current code calls lv_check_transient() at the beginning of
>> _lvconvert_mirrors_repair() but I think that lv_check_transient()
>> needs to be called in functions which repairs a mirror log,
>> such as _lv_update_log_type(), too.
> (see above)
>
>> Also this patch works well not only for a transient error but
>> also a medium error, therefore, function names with '_transient'
>> could be '_status' or '_status_string'. (e.g. lv_check_status())
> I chose transient not because of "transient error", but because of the
> relation to the transient, in-kernel, status of the device -- as opposed
> to persistent, i.e. in metadata. Maybe something with "active" or
> "activation" in it would be better though...
I see your point. I just thought 'a medium error' is persistent
and commented, but I will follow your decision.
Thanks,
Taka
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair
2010-04-27 18:03 ` Petr Rockai
@ 2010-05-06 17:30 ` Petr Rockai
2010-05-14 22:52 ` Takahiro Yasui
0 siblings, 1 reply; 21+ messages in thread
From: Petr Rockai @ 2010-05-06 17:30 UTC (permalink / raw)
To: lvm-devel
Hi,
I have rebased the patch against current CVS. Also, I have added a guard
to _mirrored_transient_status to ensure that we alloca() only very
modest (and, more importantly, bounded) amount of memory.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lvconvert-repair-transient.diff
Type: text/x-diff
Size: 24041 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/lvm-devel/attachments/20100506/a5c0eb27/attachment.bin>
-------------- next part --------------
Yours,
Petr.
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair
2010-05-06 17:30 ` Petr Rockai
@ 2010-05-14 22:52 ` Takahiro Yasui
2010-05-17 23:01 ` Takahiro Yasui
0 siblings, 1 reply; 21+ messages in thread
From: Takahiro Yasui @ 2010-05-14 22:52 UTC (permalink / raw)
To: lvm-devel
Hi Petr,
> I have rebased the patch against current CVS. Also, I have added a guard
> to _mirrored_transient_status to ensure that we alloca() only very
> modest (and, more importantly, bounded) amount of memory.
This patch is just a rebase against current CVS, so it doesn't support
mirrored log, yet. Do you have a plan to support mirrored log?
I also tested this patch for a lvm mirror with core/disk log. When
a mirror log failed, the mirror log was removed from a mirror volume,
but a log voluem is not removed from its volume group. This always
happens both on a transient and persistent error.
Here is an output of lvs command in my testing environment. lv00_mlog
should be removed from vg00.
# dmsetup ls
vg00-lv00_mimage_1 (253, 2)
vg00-lv00_mimage_0 (253, 1)
vg00-lv00 (253, 3)
# lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
lv00 vg00 mwi-a- 12.00M 100.00
lv00_mlog vg00 -wi--- 4.00M
I will report if I get the reason of this.
Thanks,
Taka
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair
2010-05-14 22:52 ` Takahiro Yasui
@ 2010-05-17 23:01 ` Takahiro Yasui
2010-05-19 12:06 ` Petr Rockai
0 siblings, 1 reply; 21+ messages in thread
From: Takahiro Yasui @ 2010-05-17 23:01 UTC (permalink / raw)
To: lvm-devel
Hi Petr,
On 05/14/10 18:52, Takahiro Yasui wrote:
>> I have rebased the patch against current CVS. Also, I have added a guard
>> to _mirrored_transient_status to ensure that we alloca() only very
>> modest (and, more importantly, bounded) amount of memory.
>
> This patch is just a rebase against current CVS, so it doesn't support
> mirrored log, yet. Do you have a plan to support mirrored log?
>
> I also tested this patch for a lvm mirror with core/disk log. When
> a mirror log failed, the mirror log was removed from a mirror volume,
> but a log voluem is not removed from its volume group. This always
> happens both on a transient and persistent error.
This issue seems related to this part of your patch.
@@ -1139,6 +1163,8 @@ static int _lvconvert_mirrors_repair(str
...
- /*
- * Remove all failed_pvs
- */
- if (!_lvconvert_mirrors_aux(cmd, lv, lp, failed_pvs,
- lp->mirrors, new_log_count))
- return 0;
+ if (failed_mirrors) {
+ if (!lv_remove_mirrors(cmd, lv, failed_mirrors, new_log_count,
+ _is_partial_lv, NULL, 0))
+ return 0;
+
+ if (!_reload_lv(cmd, lv))
+ return 0;
+ }
When I removed this modification, a log volume was removed as expected.
And also other my test cases also passed.
I hope this information would be helpful.
Thanks,
Taka
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair
2010-05-17 23:01 ` Takahiro Yasui
@ 2010-05-19 12:06 ` Petr Rockai
2010-05-19 15:06 ` Takahiro Yasui
2010-05-19 16:44 ` Takahiro Yasui
0 siblings, 2 replies; 21+ messages in thread
From: Petr Rockai @ 2010-05-19 12:06 UTC (permalink / raw)
To: lvm-devel
Hi Taka,
Takahiro Yasui <tyasui@redhat.com> writes:
> On 05/14/10 18:52, Takahiro Yasui wrote:
>> I also tested this patch for a lvm mirror with core/disk log. When
>> a mirror log failed, the mirror log was removed from a mirror volume,
>> but a log voluem is not removed from its volume group. This always
>> happens both on a transient and persistent error.
>
> This issue seems related to this part of your patch.
>
> @@ -1139,6 +1163,8 @@ static int _lvconvert_mirrors_repair(str
> ...
> - /*
> - * Remove all failed_pvs
> - */
> - if (!_lvconvert_mirrors_aux(cmd, lv, lp, failed_pvs,
> - lp->mirrors, new_log_count))
> - return 0;
> + if (failed_mirrors) {
> + if (!lv_remove_mirrors(cmd, lv, failed_mirrors, new_log_count,
> + _is_partial_lv, NULL, 0))
> + return 0;
> +
> + if (!_reload_lv(cmd, lv))
> + return 0;
> + }
>
> When I removed this modification, a log volume was removed as expected.
> And also other my test cases also passed.
The catch is that this won't work correctly in other cases, especially
with transient errors. I suspect the real problem is in not calling
_lv_update_log_type in the new code path -- but see below: I cannot
reliably fix this without having a reproducer. Also, I would very much
like to have the tests you had failing on our regression suite, to avoid
similar problem in the future.
> I hope this information would be helpful.
Yes, it is indeed quite helpful.
Unfortunately, I still cannot reproduce the problem -- I have written a
few testcases that only fail the log, or fail a log and some other
things and I can't seem to trigger the bug. I have tried with both
normal and cluster locking.
It would be very useful if you could provide more specific instructions
on how to trigger this.
I have tried these, all of them remove the log as expected, both with
and without my patch (at least for me):
aux prepare_vg 5
lvcreate -m 2 --ig -L 1 -n 3way $vg
disable_dev $dev1 $dev2
echo n | lvconvert --repair $vg/3way
check linear $vg 3way
lvs -a -o +devices | not grep unknown
lvs -a -o +devices | not grep mlog
dmsetup ls | not grep mlog
vgreduce --removemissing $vg
enable_dev $dev1 $dev2
check linear $vg 3way
aux prepare_vg 5
lvcreate -m 2 --ig -L 1 -n 3way $vg $dev1 $dev2 $dev3 $dev4:0
disable_dev $dev4
echo n | lvconvert --repair $vg/3way
check mirror $vg 3way core
lvs -a -o +devices | not grep unknown
lvs -a -o +devices | not grep mlog
dmsetup ls | not grep mlog
vgreduce --removemissing $vg
enable_dev $dev4
aux prepare_vg 5
lvcreate -m 1 --ig -L 1 -n 2way $vg $dev1 $dev2 $dev3:0
disable_dev $dev3
echo n | lvconvert --repair $vg/2way
check mirror $vg 2way core
lvs -a -o +devices | not grep unknown
lvs -a -o +devices | not grep mlog
vgreduce --removemissing $vg
enable_dev $dev3
Some further analysis:
During a call to lv_remove_mirrors above, we call through to
_remove_mirror_images, with remove_log = 1. We have this:
... if (remove_log)
detached_log_lv = detach_mirror_log(mirrored_seg);
...
if (detached_log_lv && !_delete_lv(lv, detached_log_lv))
return_0;
So the log *should* be gone after this is finished. Since you see the
log hanging around, I suspect that this code has some bugs (this part of
the code is known to be problematic, unfortunately). Apart from actual
steps to reproduce the problem, the output from lvconvert doing the
repair would be helpful. It should be printing things like "Mirror
status" and "Mirror log status", please paste these.
Thanks!
Yours,
Petr.
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair
2010-05-19 12:06 ` Petr Rockai
@ 2010-05-19 15:06 ` Takahiro Yasui
2010-05-19 18:31 ` Petr Rockai
2010-05-19 16:44 ` Takahiro Yasui
1 sibling, 1 reply; 21+ messages in thread
From: Takahiro Yasui @ 2010-05-19 15:06 UTC (permalink / raw)
To: lvm-devel
Hi Petr,
On 05/19/10 08:06, Petr Rockai wrote:
> Takahiro Yasui <tyasui@redhat.com> writes:
> The catch is that this won't work correctly in other cases, especially
> with transient errors. I suspect the real problem is in not calling
> _lv_update_log_type in the new code path -- but see below: I cannot
> reliably fix this without having a reproducer. Also, I would very much
> like to have the tests you had failing on our regression suite, to avoid
> similar problem in the future.
> ...
> Unfortunately, I still cannot reproduce the problem -- I have written a
> few testcases that only fail the log, or fail a log and some other
> things and I can't seem to trigger the bug. I have tried with both
> normal and cluster locking.
>
> It would be very useful if you could provide more specific instructions
> on how to trigger this.
Here is the instruction. I used 2.02.65 but I also reproduced it using
2.02.66, too.
0. environment
# lvm version
LVM version: 2.02.66(2)-cvs (2010-05-17)
Library version: 1.02.49-cvs (2010-05-17)
Driver version: 4.16.0
# grep mirror_log_fault_policy /etc/lvm/lvm.conf
# 'mirror_image_fault_policy' and 'mirror_log_fault_policy' define
mirror_log_fault_policy = "remove"
1. create vg and lv
# vgcreate vg00 /dev/sd[c-e]; lvcreate --ig -m1 -L12m -nlv00 vg00
Volume group "vg00" successfully created
Logical volume "lv00" created
2. disable log device (/dev/sde in my environment)
# echo offline > /sys/block/sde/device/state
3. run 'lvconvert --repair'
# lvconvert --config devices{ignore_suspended_devices=1} --repair --use-policies vg00/lv00
Mirrored transient status: "2 253:1 253:2 24/24 1 AA 3 disk 253:0 D"
Mirror log status: 1 of 1 images failed - switching to core
WARNING: Failed to replace 1 of 1 logs in volume lv00
4. check logical volumes
# lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
lv00 vg00 mwi-a- 12.00M 100.00
lv00_mlog vg00 -wi--- 4.00M
> aux prepare_vg 5
> lvcreate -m 1 --ig -L 1 -n 2way $vg $dev1 $dev2 $dev3:0
> disable_dev $dev3
> echo n | lvconvert --repair $vg/2way
> check mirror $vg 2way core
> lvs -a -o +devices | not grep unknown
> lvs -a -o +devices | not grep mlog
> vgreduce --removemissing $vg
> enable_dev $dev3
This issue didn't occurred with your test case in my environment, either.
So, the differences in our test cases seems 'policy.' I used the same
options for lvconvert as ones in dmeventd.
Thanks,
Taka
> During a call to lv_remove_mirrors above, we call through to
> _remove_mirror_images, with remove_log = 1. We have this:
>
> ... if (remove_log)
> detached_log_lv = detach_mirror_log(mirrored_seg);
>
> ...
>
> if (detached_log_lv && !_delete_lv(lv, detached_log_lv))
> return_0;
>
> So the log *should* be gone after this is finished. Since you see the
> log hanging around, I suspect that this code has some bugs (this part of
> the code is known to be problematic, unfortunately). Apart from actual
> steps to reproduce the problem, the output from lvconvert doing the
> repair would be helpful. It should be printing things like "Mirror
> status" and "Mirror log status", please paste these.
Yes, see step 4.
Thanks,
Taka
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair
2010-05-19 12:06 ` Petr Rockai
2010-05-19 15:06 ` Takahiro Yasui
@ 2010-05-19 16:44 ` Takahiro Yasui
2010-05-19 17:15 ` Takahiro Yasui
1 sibling, 1 reply; 21+ messages in thread
From: Takahiro Yasui @ 2010-05-19 16:44 UTC (permalink / raw)
To: lvm-devel
Hi Petr,
On 05/19/10 08:06, Petr Rockai wrote:
> Takahiro Yasui <tyasui@redhat.com> writes:
>> On 05/14/10 18:52, Takahiro Yasui wrote:
>>> I also tested this patch for a lvm mirror with core/disk log. When
>>> a mirror log failed, the mirror log was removed from a mirror volume,
>>> but a log voluem is not removed from its volume group. This always
>>> happens both on a transient and persistent error.
>>
>> This issue seems related to this part of your patch.
>>
>> @@ -1139,6 +1163,8 @@ static int _lvconvert_mirrors_repair(str
>> ...
>> - /*
>> - * Remove all failed_pvs
>> - */
>> - if (!_lvconvert_mirrors_aux(cmd, lv, lp, failed_pvs,
>> - lp->mirrors, new_log_count))
>> - return 0;
>> + if (failed_mirrors) {
>> + if (!lv_remove_mirrors(cmd, lv, failed_mirrors, new_log_count,
>> + _is_partial_lv, NULL, 0))
>> + return 0;
>> +
>> + if (!_reload_lv(cmd, lv))
>> + return 0;
>> + }
>>
>> When I removed this modification, a log volume was removed as expected.
>> And also other my test cases also passed.
>
> The catch is that this won't work correctly in other cases, especially
> with transient errors. I suspect the real problem is in not calling
> _lv_update_log_type in the new code path -- but see below: I cannot
> reliably fix this without having a reproducer. Also, I would very much
> like to have the tests you had failing on our regression suite, to avoid
> similar problem in the future.
I checked this code, but the value of failed_mirrors is '0' when
no mirror leg has error. So this code path was not executed in
my test. This is the reason why log was not removed with
mirror_log_fault_policy = "remove".
mirror_log_fault_policy = "allocate," on the other hand, the log
was removed in the following path.
while (replace_mirrors || replace_log) {
...
if (_lvconvert_mirrors_aux(cmd, lv, lp, NULL,
lp->mirrors, log_count))
break;
We need to change the condition so that lv_remove_mirrors() is
executed.
>> - * Remove all failed_pvs
>> - */
>> - if (!_lvconvert_mirrors_aux(cmd, lv, lp, failed_pvs,
>> - lp->mirrors, new_log_count))
>> - return 0;
>> + if (failed_mirrors) {
Thanks,
Taka
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair
2010-05-19 16:44 ` Takahiro Yasui
@ 2010-05-19 17:15 ` Takahiro Yasui
2010-05-19 20:26 ` Petr Rockai
0 siblings, 1 reply; 21+ messages in thread
From: Takahiro Yasui @ 2010-05-19 17:15 UTC (permalink / raw)
To: lvm-devel
Hi Peter,
>>> + if (failed_mirrors) {
>>> + if (!lv_remove_mirrors(cmd, lv, failed_mirrors, new_log_count,
>>> + _is_partial_lv, NULL, 0))
+ if (failed_mirrors || failed_log) {
+ if (!lv_remove_mirrors(cmd, lv, failed_mirrors, failed_log,
+ _is_partial_lv, NULL, 0)) {
My test cases passed with this modification.
Thanks,
Taka
On 05/19/10 12:44, Takahiro Yasui wrote:
> Hi Petr,
>
> On 05/19/10 08:06, Petr Rockai wrote:
>> Takahiro Yasui <tyasui@redhat.com> writes:
>>> On 05/14/10 18:52, Takahiro Yasui wrote:
>>>> I also tested this patch for a lvm mirror with core/disk log. When
>>>> a mirror log failed, the mirror log was removed from a mirror volume,
>>>> but a log voluem is not removed from its volume group. This always
>>>> happens both on a transient and persistent error.
>>>
>>> This issue seems related to this part of your patch.
>>>
>>> @@ -1139,6 +1163,8 @@ static int _lvconvert_mirrors_repair(str
>>> ...
>>> - /*
>>> - * Remove all failed_pvs
>>> - */
>>> - if (!_lvconvert_mirrors_aux(cmd, lv, lp, failed_pvs,
>>> - lp->mirrors, new_log_count))
>>> - return 0;
>>> + if (failed_mirrors) {
>>> + if (!lv_remove_mirrors(cmd, lv, failed_mirrors, new_log_count,
>>> + _is_partial_lv, NULL, 0))
>>> + return 0;
>>> +
>>> + if (!_reload_lv(cmd, lv))
>>> + return 0;
>>> + }
>>>
>>> When I removed this modification, a log volume was removed as expected.
>>> And also other my test cases also passed.
>>
>> The catch is that this won't work correctly in other cases, especially
>> with transient errors. I suspect the real problem is in not calling
>> _lv_update_log_type in the new code path -- but see below: I cannot
>> reliably fix this without having a reproducer. Also, I would very much
>> like to have the tests you had failing on our regression suite, to avoid
>> similar problem in the future.
>
> I checked this code, but the value of failed_mirrors is '0' when
> no mirror leg has error. So this code path was not executed in
> my test. This is the reason why log was not removed with
> mirror_log_fault_policy = "remove".
>
> mirror_log_fault_policy = "allocate," on the other hand, the log
> was removed in the following path.
>
> while (replace_mirrors || replace_log) {
> ...
> if (_lvconvert_mirrors_aux(cmd, lv, lp, NULL,
> lp->mirrors, log_count))
> break;
>
> We need to change the condition so that lv_remove_mirrors() is
> executed.
>
>>> - * Remove all failed_pvs
>>> - */
>>> - if (!_lvconvert_mirrors_aux(cmd, lv, lp, failed_pvs,
>>> - lp->mirrors, new_log_count))
>>> - return 0;
>>> + if (failed_mirrors) {
>
> Thanks,
> Taka
>
> --
> lvm-devel mailing list
> lvm-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/lvm-devel
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair
2010-05-19 15:06 ` Takahiro Yasui
@ 2010-05-19 18:31 ` Petr Rockai
2010-05-19 19:46 ` Takahiro Yasui
2010-05-20 0:49 ` Takahiro Yasui
0 siblings, 2 replies; 21+ messages in thread
From: Petr Rockai @ 2010-05-19 18:31 UTC (permalink / raw)
To: lvm-devel
Hi again,
Takahiro Yasui <tyasui@redhat.com> writes:
> 3. run 'lvconvert --repair'
>
> # lvconvert --config devices{ignore_suspended_devices=1} --repair --use-policies vg00/lv00
> Mirrored transient status: "2 253:1 253:2 24/24 1 AA 3 disk 253:0 D"
> Mirror log status: 1 of 1 images failed - switching to core
> WARNING: Failed to replace 1 of 1 logs in volume lv00
>
> 4. check logical volumes
>
> # lvs
> LV VG Attr LSize Origin Snap% Move Log Copy% Convert
> lv00 vg00 mwi-a- 12.00M 100.00
> lv00_mlog vg00 -wi--- 4.00M
I have tracked the problem down to a silly branching bug in
lvconvert.c. The LV was only being written to disk if mirrors changed,
or log re-allocation was attempted. I have fixed that, and also added
both mine and your testcases to the suite.
The attached patch is on top of the previous one -- it should apply
cleanly on top of current CVS + the lvconvert-repair-transient.diff I
have sent previously. It also fixes a bug in messages, where it would
claim that it failed to replace things even if it never tried to in the
first place.
Please test and let me know if you run into any other issues.
Thanks,
Petr.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lvconvert-repair-transient-inter.diff
Type: text/x-diff
Size: 5799 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/lvm-devel/attachments/20100519/e6a38379/attachment.bin>
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair
2010-05-19 18:31 ` Petr Rockai
@ 2010-05-19 19:46 ` Takahiro Yasui
2010-05-19 20:19 ` Petr Rockai
2010-05-20 0:49 ` Takahiro Yasui
1 sibling, 1 reply; 21+ messages in thread
From: Takahiro Yasui @ 2010-05-19 19:46 UTC (permalink / raw)
To: lvm-devel
Hi Petr,
>+++ cvs-upstream/tools/lvconvert.c 2010-05-19 20:18:40.000000000 +0200
>@@ -1225,11 +1225,11 @@
> if (!lv_remove_mirrors(cmd, lv, failed_mirrors, new_log_count,
> _is_partial_lv, NULL, 0))
> return 0;
As I posted in the previous mail, the argument, new_log_count, is
correct here? The argument is used to check if a log should be removed.
I think that failed_log comes here instead of new_log_count.
Or the value such as nlc in _lvconvert_mirrors_aux() comes.
uint32_t nlc = (!new_log_count || lp->mirrors == 1) ? 1U : 0U;
...
} else if (!lv_remove_mirrors(cmd, lv, nmc, nlc,
is_mirror_image_removable, operable_pvs, 0))
Thanks,
Taka
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair
2010-05-19 19:46 ` Takahiro Yasui
@ 2010-05-19 20:19 ` Petr Rockai
2010-05-19 20:31 ` Alasdair G Kergon
2010-05-19 20:35 ` Takahiro Yasui
0 siblings, 2 replies; 21+ messages in thread
From: Petr Rockai @ 2010-05-19 20:19 UTC (permalink / raw)
To: lvm-devel
Takahiro Yasui <tyasui@redhat.com> writes:
>>+++ cvs-upstream/tools/lvconvert.c 2010-05-19 20:18:40.000000000 +0200
>>@@ -1225,11 +1225,11 @@
>> if (!lv_remove_mirrors(cmd, lv, failed_mirrors, new_log_count,
>> _is_partial_lv, NULL, 0))
>> return 0;
>
> As I posted in the previous mail, the argument, new_log_count, is
> correct here? The argument is used to check if a log should be removed.
> I think that failed_log comes here instead of new_log_count.
Oh *jeez*.
int lv_remove_mirrors(struct cmd_context *cmd __attribute((unused)),
struct logical_volume *lv,
uint32_t mirrors, uint32_t log_count,
int (*is_removable)(struct logical_volume *, void *),
void *removable_baton,
uint64_t status_mask)
Notice how the parameter is named "log_count" -- I have assumed, given
that elsewhere it is called "remove_log" that this has a reverse
meaning. Wrong, it actually means the same thing as remove_log.
Thanks for noticing this. So yes, it should spell
if (!lv_remove_mirrors(cmd, lv, failed_mirrors, failed_log,
_is_partial_lv, NULL, 0))
... now this was probably not being noticed, since in most cases, the
mistakenly removed log would be re-created right away by later
code. I'll fix this in my local patch.
> Or the value such as nlc in _lvconvert_mirrors_aux() comes.
>
> uint32_t nlc = (!new_log_count || lp->mirrors == 1) ? 1U : 0U;
> ...
> } else if (!lv_remove_mirrors(cmd, lv, nmc, nlc,
> is_mirror_image_removable, operable_pvs, 0))
Ugh, nlc sounds like short for new_log_count to me, which is *extremely
confusing*, since it's actually quite opposite: nlc = 1 means remove
log, nlc = 0 means retain the log. Is that right?
Yours,
Petr.
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair
2010-05-19 17:15 ` Takahiro Yasui
@ 2010-05-19 20:26 ` Petr Rockai
0 siblings, 0 replies; 21+ messages in thread
From: Petr Rockai @ 2010-05-19 20:26 UTC (permalink / raw)
To: lvm-devel
Hi Taka,
Takahiro Yasui <tyasui@redhat.com> writes:
>>>> + if (failed_mirrors) {
>>>> + if (!lv_remove_mirrors(cmd, lv, failed_mirrors, new_log_count,
>>>> + _is_partial_lv, NULL, 0))
>
> + if (failed_mirrors || failed_log) {
> + if (!lv_remove_mirrors(cmd, lv, failed_mirrors, failed_log,
> + _is_partial_lv, NULL, 0)) {
>
if (failed_mirrors || failed_log) is actually the same (well, should be)
as if (1) at this point... Anyway, if you have seen my previous mail,
the thing that actually is required to be called every time is
_reload_lv, which does metadata writing. The lv_remove_mirrors call is
only needed if failed_mirrors is true, since _lv_update_log_type should
have already dealt with log. If failed_log is true, then new_log_count
is 0 for disklog mirrors in the _lv_update_log_type call.
This is all still broken for mirrored logs, but this is not completely
easy to fix, so I'll defer that for now.
Yours,
Petr.
PS: failed_mirrors would probably be better called failed_images...
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair
2010-05-19 20:19 ` Petr Rockai
@ 2010-05-19 20:31 ` Alasdair G Kergon
2010-05-19 20:35 ` Takahiro Yasui
1 sibling, 0 replies; 21+ messages in thread
From: Alasdair G Kergon @ 2010-05-19 20:31 UTC (permalink / raw)
To: lvm-devel
On Wed, May 19, 2010 at 10:19:41PM +0200, Peter Rockai wrote:
> Notice how the parameter is named "log_count" -- I have assumed, given
> that elsewhere it is called "remove_log" that this has a reverse
> meaning. Wrong, it actually means the same thing as remove_log.
Our variable names are meant to be global.
> Ugh, nlc sounds like short for new_log_count to me, which is *extremely
Fix abbreviations you find like that too, please:)
Alasdair
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair
2010-05-19 20:19 ` Petr Rockai
2010-05-19 20:31 ` Alasdair G Kergon
@ 2010-05-19 20:35 ` Takahiro Yasui
1 sibling, 0 replies; 21+ messages in thread
From: Takahiro Yasui @ 2010-05-19 20:35 UTC (permalink / raw)
To: lvm-devel
On 05/19/10 16:19, Petr Rockai wrote:
> Takahiro Yasui <tyasui@redhat.com> writes:
>
>>> +++ cvs-upstream/tools/lvconvert.c 2010-05-19 20:18:40.000000000 +0200
>>> @@ -1225,11 +1225,11 @@
>>> if (!lv_remove_mirrors(cmd, lv, failed_mirrors, new_log_count,
>>> _is_partial_lv, NULL, 0))
>>> return 0;
>>
>> As I posted in the previous mail, the argument, new_log_count, is
>> correct here? The argument is used to check if a log should be removed.
>> I think that failed_log comes here instead of new_log_count.
...
> ... now this was probably not being noticed, since in most cases, the
> mistakenly removed log would be re-created right away by later
> code. I'll fix this in my local patch.
Yeah, that valuable name is very confusing. :-<
Anyway thanks to fixing this.
>> Or the value such as nlc in _lvconvert_mirrors_aux() comes.
>>
>> uint32_t nlc = (!new_log_count || lp->mirrors == 1) ? 1U : 0U;
>> ...
>> } else if (!lv_remove_mirrors(cmd, lv, nmc, nlc,
>> is_mirror_image_removable, operable_pvs, 0))
>
> Ugh, nlc sounds like short for new_log_count to me, which is *extremely
> confusing*, since it's actually quite opposite: nlc = 1 means remove
> log, nlc = 0 means retain the log. Is that right?
Yes, it is. It is similarly used as the previous parameter, nmc.
nmc tells the number of mirrors which will be removed. So as for
nlc, it is easier to think nlc the log number which should be removed.
(0: not removed, 1: removed)
Thanks,
Taka
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair
2010-05-19 18:31 ` Petr Rockai
2010-05-19 19:46 ` Takahiro Yasui
@ 2010-05-20 0:49 ` Takahiro Yasui
2010-05-20 7:40 ` Petr Rockai
1 sibling, 1 reply; 21+ messages in thread
From: Takahiro Yasui @ 2010-05-20 0:49 UTC (permalink / raw)
To: lvm-devel
Hi Petr,
On 05/19/10 14:31, Petr Rockai wrote:
> The attached patch is on top of the previous one -- it should apply
> cleanly on top of current CVS + the lvconvert-repair-transient.diff I
> have sent previously. It also fixes a bug in messages, where it would
> claim that it failed to replace things even if it never tried to in the
> first place.
>
> Please test and let me know if you run into any other issues.
I tested your two patches and all my test cases for 'disk' log
passed. As for 'mirrored' log, we need to check the state for
log_lv. Here is the sample code.
_lvconvert_mirrors_repair()
...
lv_check_transient(lv); /* TODO check this in lib for all commands? */
- if (!(lv->status & PARTIAL_LV)) {
+ log_lv = first_seg(lv)->log_lv;
+ if (log_lv && log_lv->status & MIRRORED)
+ lv_check_transient(log_lv);
+
+ if (!(lv->status & PARTIAL_LV) && !log_lv &&
+ !(log_lv->status & PARTIAL_LV)) {
...
new_log_count = old_log_count;
- log_lv = first_seg(lv)->log_lv;
if (log_lv) {
Thanks,
Taka
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair
2010-05-20 0:49 ` Takahiro Yasui
@ 2010-05-20 7:40 ` Petr Rockai
2010-05-20 11:54 ` Takahiro Yasui
0 siblings, 1 reply; 21+ messages in thread
From: Petr Rockai @ 2010-05-20 7:40 UTC (permalink / raw)
To: lvm-devel
Hi Taka!
Takahiro Yasui <tyasui@redhat.com> writes:
>> Please test and let me know if you run into any other issues.
>
> I tested your two patches and all my test cases for 'disk' log
> passed. As for 'mirrored' log, we need to check the state for
> log_lv. Here is the sample code.
So, do you think it is reasonable to check this in in its current form,
and work on --repair support for mirrored logs in a separate patch?
I think there are currently other things broken wrt. mirrored log and
--repair, other than just the transient support, and I'll have to sit
down and write tests first to see what works and what doesn't and then
fix things up.
Yours,
Petr.
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] handle transient errors in lvconvert --repair
2010-05-20 7:40 ` Petr Rockai
@ 2010-05-20 11:54 ` Takahiro Yasui
0 siblings, 0 replies; 21+ messages in thread
From: Takahiro Yasui @ 2010-05-20 11:54 UTC (permalink / raw)
To: lvm-devel
Hi Petr,
Petr Rockai wrote:
> Takahiro Yasui <tyasui@redhat.com> writes:
>>> Please test and let me know if you run into any other issues.
>> I tested your two patches and all my test cases for 'disk' log
>> passed. As for 'mirrored' log, we need to check the state for
>> log_lv. Here is the sample code.
>
> So, do you think it is reasonable to check this in in its current form,
> and work on --repair support for mirrored logs in a separate patch?
That sounds reasonable to me.
> I think there are currently other things broken wrt. mirrored log and
> --repair, other than just the transient support, and I'll have to sit
> down and write tests first to see what works and what doesn't and then
> fix things up.
I will review & test a patch is ready.
Thanks,
Taka
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2010-05-20 11:54 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-27 15:14 [PATCH] handle transient errors in lvconvert --repair Petr Rockai
2010-04-27 18:03 ` Petr Rockai
2010-05-06 17:30 ` Petr Rockai
2010-05-14 22:52 ` Takahiro Yasui
2010-05-17 23:01 ` Takahiro Yasui
2010-05-19 12:06 ` Petr Rockai
2010-05-19 15:06 ` Takahiro Yasui
2010-05-19 18:31 ` Petr Rockai
2010-05-19 19:46 ` Takahiro Yasui
2010-05-19 20:19 ` Petr Rockai
2010-05-19 20:31 ` Alasdair G Kergon
2010-05-19 20:35 ` Takahiro Yasui
2010-05-20 0:49 ` Takahiro Yasui
2010-05-20 7:40 ` Petr Rockai
2010-05-20 11:54 ` Takahiro Yasui
2010-05-19 16:44 ` Takahiro Yasui
2010-05-19 17:15 ` Takahiro Yasui
2010-05-19 20:26 ` Petr Rockai
2010-04-28 19:37 ` Takahiro Yasui
2010-05-05 7:46 ` Petr Rockai
2010-05-05 15:31 ` Takahiro Yasui
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.