[RFC 0/6] Waiting for the missing device in mirror

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Zdenek Kabelac <zkabelac@redhat.com>
To: lvm-devel@redhat.com
Subject: [RFC 0/6] Waiting for the missing device in mirror
Date: Tue, 09 Jun 2015 09:12:35 +0200	[thread overview]
Message-ID: <557691E3.5050805@redhat.com> (raw)
In-Reply-To: <5576CBDB020000E10000FE8B@relay2.provo.novell.com>

Dne 9.6.2015 v 05:19 Lidong Zhong napsal(a):
>>> On 6/8/2015 at 04:38 PM, in message <55755485.2080802@redhat.com>, Zdenek
> Kabelac <zkabelac@redhat.com> wrote:
>> Dne 8.6.2015 v 09:48 Lidong Zhong napsal(a):
>>> Hi List,
>>>
>>> The implementation here is trying to add another policy for the
>>> missing leg/log device in mirror. We want to wait the device for some
>>> time in case of a temporary device failure, especially a network
>> disconnection
>>> for clvmd, to avoid a full disk recovery.
>>>
>>> This version is kind of a draft. There are many immature places to improve.
>> So comments
>>> and suggestions are welcomed.
>>>
>>> The responding kernel part is here:
>>> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/
>>> commit/?h=for-next&id=ed63287dd670f8e9d2412a913de7fdc50a689831
>>
>> Hi
>>
> Hi Zdenek,
>
> Thanks for your reply.
>> I think you should please start first with the very precise description what
>>
>> you are trying to achieve/fix - then we should discuss how to reach desired
>> goal.
>>
>
> Sorry, my fault. Here is the situation:
> If one leg of the mirror fails, according to current implementation, the failed leg
> will either be removed or be replaced. However, if it is a temporary failure( such as
> network failure in clvmd), we have to do a full sync for the disk if we re-add it as mirror ,
> which will cost a long time. So we plan to add another policy for the missing device, that is
> waiting the device for a configurable time. Then we could just do a incremental sync
> for the device while it's disappeared.
>
> What I do in the patch series is:
> Add a new feature for the mirror target, which enables bios still could be written to the left
> mirror devices and also keep the bitmap. The implementation has been done for the kernel.
> We add a KEEP_LOG feature, which depends on current HANDLE_ERRORS feature. For the
> userspace, we should add the parameter --trackchanges if we create a dm-mirror device to
> enable this feature.

Before we start to think about enhancing the old mirror which is really 
incapable to easily track multiple lost legs compared with new 'raid1' target:

Does user needs to activate mirror on multiple nodes at once (using cmirrord. 
and gfs?)

For exclusive mirror I'd advice to switch to superior new 'raid1' --type which 
already does provide 'tracking' feature .


> When dmeventd gets a device failure event, it will call lvconvert according to the policy set in

So the 'failures' are not short term - but there is really device lost and 
reappears ?


> 1\ It will create a temporary file named by UUID of the device under /tmp file, in case of there
> are two or more failed devices and the daemons wait for the same one.

Can't use things in /tmp - you need to have prepared some device
(like we already introduces  _pmspare for repair of thin pools)


> 2\ The major:minor of the missing device probably changes when it comes back. So I put the original
> device number into metadata.(As already pointed out, it does not fit the rule.)

Devices are simply always mapped by PV UUID - never ever by major:minor - and 
they are discovered by udev and stored in lvmetad - this is basically: 
'vgextend --restoremissing' operation.
You also have numerous filtering rules (host & guest disks on a single box).

>> #1 - Never store any device major:minor in lvm2 metadata - everything is
>> strictly PV UUID oriented (there are number of daemons these days)
>>
>
> I thought about storing this info into lvmetad. But if lvmetad service is not running,
> then what should we do.

Simply forget about  major:minor - you don't need them.

>> #2 - Activation layer & Command layer are 2 separate entities - so your
>> command may run on different node then the actual activation happens (unless
>>
>> you do a local activation) -  the layer separator is ATM 'lock' - the code
>> before lock and  after lock do not share any data - and the 'activation'
>> layer
>> knows only what is in written metadata on disk (just for optimization
>> purposes
>> there is some internal mechanism of caching and reusing of some existing
>> data).
>>
>
> I don't quite understand this part. I guess it's related to the replacing table info and
> starting sync in my code. I will look deep into this part. Thanks.

There is 'extra' interface how a '/tools' command could manipulate with dm 
table. It's represented currently by a lock and you could imagine 'clvmd' as 
an activation daemon which understands 4 simple commands:

activate, deactivate, suspend, resume

As parameter it gets  LV-UUID and few extra bits (unfortunately we run out of 
free bits years ago and it's hard to extend protocol without breaking 
compatibility)

Nothing else gets passed through - and this activation 'side' accesses on-disk 
metadata and does the actual activation  (in parallel on multiple nodes if 
needed)  (in case it's all running in single command there are some 'caching 
methods' for speed up.

>
>> #3 - There is no 'hidden' data exchange channel via /tmp for activation -
>> everything goes strictly via written and committed metadata, and for every
>> such metadata state there needs to be some clear recovery path (e.g. what
>> happens after 'power-off' with each committed lvm2 metadata state)
>>
>
> You mean I should put the waiting device info into metadata?

Figuring proper setup for an old mirror may get complex  (since old mirror 
does not support separate tracking device for individual leg).
So it will be something like 'pvmove'.
You 'create' another mirror layer and you pass a new 'temporary' log device to 
it.  But when you consider you want a universal solution and you would need to 
be able to track changes for i.e. 16legged mirror - it may get seriously scary.

But if you really do not need parallel activation - I'd recommend to switch 
the raid1 mirroring - so first check if this would not resolve your problem.
(As old mirrors are seen as 'obsolete')

Zdenek

     prev parent reply	other threads:[~2015-06-09  7:12 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-08  7:48 [RFC 0/6] Waiting for the missing device in mirror Lidong Zhong
2015-06-08  7:48 ` [RFC 1/6] Enable the keep_log feature while creating a mirror device Lidong Zhong
2015-06-08  7:48 ` [RFC 3/6] Mark a device if already being waited Lidong Zhong
2015-06-08  7:48 ` [RFC 4/6] Write the device number into metadata Lidong Zhong
2015-06-08  7:48 ` [RFC 5/6] Add another policy for the missing device -- wait Lidong Zhong
2015-06-08  7:48 ` [RFC 6/6] lvconvert: implement the wait policy Lidong Zhong
2015-06-08  8:38 ` [RFC 0/6] Waiting for the missing device in mirror Zdenek Kabelac
2015-06-09  3:19   ` Lidong Zhong
2015-06-09  7:12     ` Zdenek Kabelac [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=557691E3.5050805@redhat.com \
    --to=zkabelac@redhat.com \
    --cc=lvm-devel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.