[RFC][PATCH 0/5] dmeventd device filtering

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH 0/5] dmeventd device filtering
@ 2009-09-30  0:28 Takahiro Yasui
  2009-09-30 20:50 ` Petr Rockai
  0 siblings, 1 reply; 5+ messages in thread
From: Takahiro Yasui @ 2009-09-30  0:28 UTC (permalink / raw)
  To: lvm-devel

Hi,

This is a prototype patch to add device filtering function to dmeventd.
I aprreciate any comments on the idea or implementation.

PATCH SET
=========
    1/5: support command string with space
    2/5: add device list registering interface
    3/5: add filtering function to dmeventd
    4/5: dmeventd filtering failed devices
    5/5: update device lists by lvm commands

BACKGROUND
==========

Most part of an error recovery of LVM mirror is processed in userspace,
especially dmeventd. dmeventd calls lvconvert and vgreduce internally
and those lvm commands remove failed devices.

However, a lvm command scans all devices managed by lvm every time it
is executed, and it will take for a long time if there are many devices
in the system.

Also, a failed device which triggered the error recovery is also accessed.
When the error is related to timeout, accesses to the failed device may
cause another timeout and the error recovery could take for a long time.
The error recovery time is also affected by failed devices which are not
associated with a volume group which a mirror volume belongs to.

FYI: This issue is also described in the following post.
   Introduce metadata cache feature
   https://www.redhat.com/archives/lvm-devel/2009-April/msg00014.html

SOLUTION
========

Device filtering feature is added to dmeventd so that dmeventd calls
a LVM command with a filter option to limit accessing devices as follows:

   - Allow access to devices associated with the volume group
   - Deny access to the failed devices which triggered the error recovery

For example, when mimage0 broke in the following environment, the current
implementation accesses all devices (pv0 ... pv8), but access to pv1 and
pv2 are enough to remove mimage0.

    vg0 { pv0, pv1, pv2 }, vg1 { pv3, pv4, pv5 }, vg2 { pv6, pv7, pv8 }

        lv0(mirror) --+-- mimage0 { pv0 }
                      +-- mimage1 { pv1 }
                      +-- mlog    { pv2 }

This patch set limits devices to be accessed during error recovery.

DESIGN OVERVIEW
===============

The key idea is executing lvconvert and vgreduce with "filter" options
from dmeventd and override filtering rule defined in the config file
(lvm.conf). When an error is reported to dmeventd, dmeventd automatically
generates filtering option and call lvm commands with it as follows.

   vgreduce --removemissing --config \
     devices{filter=["a|/dev/sda", "a|/dev/sdb", ...,"r|.*|"]} VG/LV

To generate filter option, dmeventd requires a list of devices included
in the VG. When a LV is registered as a monitoring device, a device list
of the VG are passed to dmeventd. This information needs to be updated if
the VG structure is changed by adding or removing devices to/from the VG
by vgextend, vgreduce or other lvm commands, dmeventd gets a new device
list.

A failed device list is generated when an error is notified. dmeventd gets
devices included in failed mirror leg or log from kernel through device-mapper
interface.

FUTURE WORKS
============

  - To make the filtering function configurable (e.g. lvm.conf)
  - More tests
  - Code cleanup (including adjusting the size of static array)
  - Evaluation together with mirroredlog which malahal posted

Regards,
-- 
Takahiro Yasui
Hitachi Computer Products (America), Inc.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC][PATCH 0/5] dmeventd device filtering
  2009-09-30  0:28 [RFC][PATCH 0/5] dmeventd device filtering Takahiro Yasui
@ 2009-09-30 20:50 ` Petr Rockai
  2009-10-07  1:58   ` malahal
  2009-10-16 15:46   ` Takahiro Yasui
  0 siblings, 2 replies; 5+ messages in thread
From: Petr Rockai @ 2009-09-30 20:50 UTC (permalink / raw)
  To: lvm-devel

Takahiro Yasui <tyasui@redhat.com> writes:

> BACKGROUND
Agreed.

> SOLUTION
> ========
>
> Device filtering feature is added to dmeventd so that dmeventd calls
> a LVM command with a filter option to limit accessing devices as follows:
>
>    - Allow access to devices associated with the volume group
>    - Deny access to the failed devices which triggered the error recovery
>
> For example, when mimage0 broke in the following environment, the current
> implementation accesses all devices (pv0 ... pv8), but access to pv1 and
> pv2 are enough to remove mimage0.
>
>     vg0 { pv0, pv1, pv2 }, vg1 { pv3, pv4, pv5 }, vg2 { pv6, pv7, pv8 }
>
>         lv0(mirror) --+-- mimage0 { pv0 }
>                       +-- mimage1 { pv1 }
>                       +-- mlog    { pv2 }
>
> This patch set limits devices to be accessed during error recovery.
Interesting idea.

> DESIGN OVERVIEW
> ===============
>
> The key idea is executing lvconvert and vgreduce with "filter" options
> from dmeventd and override filtering rule defined in the config file
> (lvm.conf). When an error is reported to dmeventd, dmeventd automatically
> generates filtering option and call lvm commands with it as follows.
>
>    vgreduce --removemissing --config \
>      devices{filter=["a|/dev/sda", "a|/dev/sdb", ...,"r|.*|"]} VG/LV
>
Sounds like a good interim solution. Eventually, we may want to switch away
from using lvm2cmd for dmeventd plugins, but I agree that this is still far on
the horizon.

> To generate filter option, dmeventd requires a list of devices included
> in the VG. When a LV is registered as a monitoring device, a device list
> of the VG are passed to dmeventd. This information needs to be updated if
> the VG structure is changed by adding or removing devices to/from the VG
> by vgextend, vgreduce or other lvm commands, dmeventd gets a new device
> list.
>
> A failed device list is generated when an error is notified. dmeventd gets
> devices included in failed mirror leg or log from kernel through device-mapper
> interface.
Hmm. Does this introduce some race conditions? When a bad sequence of metadata
edits and failures happens, could this lead to bad behaviour? I have skimmed
the patches and I think following may happen:

- vgextend a volume group (adding say /dev/sde)
- metadata is written and committed
- dmeventd notices a failure, but its device list is out of date 
- lvconvert does its job, but when writing metadata, it marks the /dev/sde PV
  as missing, since it can't find it
- dmeventd triggers vgreduce, which removes /dev/sde from the volume group

It is not a fatal problem, but definitely surprising. Maybe we could fix it,
although I'm not entirely sure how.

Also, I'm a little worried that this is something that may rather easily go out
of sync -- keeping a cached copy of data like this around is always
dangerous. Fortunately, the worst that should happen is that an automatic
recovery fails or that empty PVs are removed from the volume group (like above)
-- it shouldn't be possible to trick dmeventd into clobbering any data this
way. Either way -- I am not sure it is a showstopper, but it's definitely not
very nice. Thoughts?

Yours,
   Petr.

PS: Another thing crossed my mind -- how safe it is to use device node names
here? Would it make more sense to use major/minor numbers? If device nodes get
re-arranged between registration and a failure, this could cause some woes as
well. The gap could easily be many months. Maybe not likely, but definitely not
impossible...



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC][PATCH 0/5] dmeventd device filtering
  2009-09-30 20:50 ` Petr Rockai
@ 2009-10-07  1:58   ` malahal
  2009-10-16 18:54     ` Takahiro Yasui
  2009-10-16 15:46   ` Takahiro Yasui
  1 sibling, 1 reply; 5+ messages in thread
From: malahal @ 2009-10-07  1:58 UTC (permalink / raw)
  To: lvm-devel

Petr Rockai [prockai at redhat.com] wrote:
> Hmm. Does this introduce some race conditions? When a bad sequence of metadata
> edits and failures happens, could this lead to bad behaviour? I have skimmed
> the patches and I think following may happen:
> 
> - vgextend a volume group (adding say /dev/sde)
> - metadata is written and committed
> - dmeventd notices a failure, but its device list is out of date 
> - lvconvert does its job, but when writing metadata, it marks the /dev/sde PV
>   as missing, since it can't find it
> - dmeventd triggers vgreduce, which removes /dev/sde from the volume group
> 
> It is not a fatal problem, but definitely surprising. Maybe we could fix it,
> although I'm not entirely sure how.
> 
> Also, I'm a little worried that this is something that may rather easily go out
> of sync -- keeping a cached copy of data like this around is always
> dangerous. Fortunately, the worst that should happen is that an automatic
> recovery fails or that empty PVs are removed from the volume group (like above)
> -- it shouldn't be possible to trick dmeventd into clobbering any data this
> way. Either way -- I am not sure it is a showstopper, but it's definitely not
> very nice. Thoughts?
> 
> Yours,
>    Petr.

How about vgreduce only not scanning the failed devices. It will scan
/dev/sde in the above case. Multiple device failures at the same (not
uncommon) is going to be a problem though. :-(



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC][PATCH 0/5] dmeventd device filtering
  2009-09-30 20:50 ` Petr Rockai
  2009-10-07  1:58   ` malahal
@ 2009-10-16 15:46   ` Takahiro Yasui
  1 sibling, 0 replies; 5+ messages in thread
From: Takahiro Yasui @ 2009-10-16 15:46 UTC (permalink / raw)
  To: lvm-devel

>> To generate filter option, dmeventd requires a list of devices included
>> in the VG. When a LV is registered as a monitoring device, a device list
>> of the VG are passed to dmeventd. This information needs to be updated if
>> the VG structure is changed by adding or removing devices to/from the VG
>> by vgextend, vgreduce or other lvm commands, dmeventd gets a new device
>> list.
>>
>> A failed device list is generated when an error is notified. dmeventd gets
>> devices included in failed mirror leg or log from kernel through device-mapper
>> interface.
> Hmm. Does this introduce some race conditions? When a bad sequence of metadata
> edits and failures happens, could this lead to bad behaviour? I have skimmed
> the patches and I think following may happen:
> 
> - vgextend a volume group (adding say /dev/sde)
> - metadata is written and committed
> - dmeventd notices a failure, but its device list is out of date 
> - lvconvert does its job, but when writing metadata, it marks the /dev/sde PV
>   as missing, since it can't find it
> - dmeventd triggers vgreduce, which removes /dev/sde from the volume group
> 
> It is not a fatal problem, but definitely surprising. Maybe we could fix it,
> although I'm not entirely sure how.
> 
> Also, I'm a little worried that this is something that may rather easily go out
> of sync -- keeping a cached copy of data like this around is always
> dangerous. Fortunately, the worst that should happen is that an automatic
> recovery fails or that empty PVs are removed from the volume group (like above)
> -- it shouldn't be possible to trick dmeventd into clobbering any data this
> way. Either way -- I am not sure it is a showstopper, but it's definitely not
> very nice. Thoughts?

I'm very sorry for my late response. You are right. This method needs to keep 
data integrity between dmeventd and lvm metadata on disk and the sequence you
described should be handled in some way.

I don't have perfect solution right now, but stopping monitoring during VG
update would be one of the solutions. Stopping and starting monitoring is
not a perfect solution, but it is the same as LV is changed. Anyway, I will
look into solutions, and I appreciate if you could give me some idea.

> PS: Another thing crossed my mind -- how safe it is to use device node names
> here? Would it make more sense to use major/minor numbers? If device nodes get
> re-arranged between registration and a failure, this could cause some woes as
> well. The gap could easily be many months. Maybe not likely, but definitely not
> impossible...

I understand your point. It is not impossible but it is not likely. I believe
it is the base idea on which the current filtering method or device-cache
are implemented.

Thanks,
Taka



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC][PATCH 0/5] dmeventd device filtering
  2009-10-07  1:58   ` malahal
@ 2009-10-16 18:54     ` Takahiro Yasui
  0 siblings, 0 replies; 5+ messages in thread
From: Takahiro Yasui @ 2009-10-16 18:54 UTC (permalink / raw)
  To: lvm-devel

On 10/06/09 21:58, malahal at us.ibm.com wrote:
> Petr Rockai [prockai at redhat.com] wrote:
>> Hmm. Does this introduce some race conditions? When a bad sequence of metadata
>> edits and failures happens, could this lead to bad behaviour? I have skimmed
>> the patches and I think following may happen:
>>
>> - vgextend a volume group (adding say /dev/sde)
>> - metadata is written and committed
>> - dmeventd notices a failure, but its device list is out of date 
>> - lvconvert does its job, but when writing metadata, it marks the /dev/sde PV
>>   as missing, since it can't find it
>> - dmeventd triggers vgreduce, which removes /dev/sde from the volume group
>>
>> It is not a fatal problem, but definitely surprising. Maybe we could fix it,
>> although I'm not entirely sure how.
>>
>> Also, I'm a little worried that this is something that may rather easily go out
>> of sync -- keeping a cached copy of data like this around is always
>> dangerous. Fortunately, the worst that should happen is that an automatic
>> recovery fails or that empty PVs are removed from the volume group (like above)
>> -- it shouldn't be possible to trick dmeventd into clobbering any data this
>> way. Either way -- I am not sure it is a showstopper, but it's definitely not
>> very nice. Thoughts?
>>
>> Yours,
>>    Petr.
> 
> How about vgreduce only not scanning the failed devices. It will scan
> /dev/sde in the above case. Multiple device failures at the same (not
> uncommon) is going to be a problem though. :-(

If there is some way to make vgreduce/lvconvert avoid scanning, just
filtering out failed devices would work. Metadata cache feature,

Introduce metadata cache feature
https://www.redhat.com/archives/lvm-devel/2009-April/msg00014.html

or another way like not saving metadata on each device but saving it
in a directory by lvm configuration:

  metadata/pvmetadatacopies=0
  metadata/dirs=[<directory>]

Thanks,
Taka



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2009-10-16 18:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-09-30  0:28 [RFC][PATCH 0/5] dmeventd device filtering Takahiro Yasui
2009-09-30 20:50 ` Petr Rockai
2009-10-07  1:58   ` malahal
2009-10-16 18:54     ` Takahiro Yasui
2009-10-16 15:46   ` Takahiro Yasui

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.