All of lore.kernel.org
 help / color / mirror / Atom feed
From: agk@sourceware.org <agk@sourceware.org>
To: lvm-devel@redhat.com
Subject: LVM2 doc/lvmetad_design.txt daemons/lvmetad/DESIGN
Date: 8 Jul 2011 18:55:29 -0000	[thread overview]
Message-ID: <20110708185529.17980.qmail@sourceware.org> (raw)

CVSROOT:	/cvs/lvm2
Module name:	LVM2
Changes by:	agk at sourceware.org	2011-07-08 18:55:29

Added files:
	doc            : lvmetad_design.txt 
Removed files:
	daemons/lvmetad: DESIGN 

Log message:
	move doc to doc dir

Patches:
http://sourceware.org/cgi-bin/cvsweb.cgi/LVM2/doc/lvmetad_design.txt.diff?cvsroot=lvm2&r1=NONE&r2=1.1
http://sourceware.org/cgi-bin/cvsweb.cgi/LVM2/daemons/lvmetad/DESIGN.diff?cvsroot=lvm2&r1=1.2&r2=NONE

/cvs/lvm2/LVM2/doc/lvmetad_design.txt,v  -->  standard output
revision 1.1
--- LVM2/doc/lvmetad_design.txt
+++ -	2011-07-08 18:55:29.519267000 +0000
@@ -0,0 +1,197 @@
+The design of LVMetaD
+=====================
+
+Invocation and setup
+--------------------
+
+The daemon should be started automatically by the first LVM command issued on
+the system, when needed. The usage of the daemon should be configurable in
+lvm.conf, probably with its own section. Say
+
+    lvmetad {
+        enabled = 1 # default
+        autostart = 1 # default
+        socket = "/path/to/socket" # defaults to /var/run/lvmetad or such
+    }
+
+Library integration
+-------------------
+
+When a command needs to access metadata, it currently needs to perform a scan
+of the physical devices available in the system. This is a possibly quite
+expensive operation, especially if many devices are attached to the system. In
+most cases, LVM needs a complete image of the system's PVs to operate
+correctly, so all devices need to be read, to at least determine presence (and
+content) of a PV label. Additional IO is done to obtain or write metadata
+areas, but this is only marginally related and addressed by Dave's
+metadata-balancing work.
+
+In the existing scanning code, a cache layer exists, under
+lib/cache/lvmcache.[hc]. This layer is keeping a textual copy of the metadata
+for a given volume group, in a format_text form, as a character string. We can
+plug the lvmetad interface at this level: in lvmcache_get_vg, which is
+responsible for looking up metadata in a local cache, we can, if the metadata
+is not available in the local cache, query lvmetad. Under normal circumstances,
+when a VG is not cached yet, this operation fails and prompts the caller to
+perform a scan. Under the lvmetad enabled scenario, this would never happen and
+the fall-through would only be activated when lvmetad is disabled, which would
+lead to local cache being populated as usual through a locally executed scan.
+
+Therefore, existing stand-alone (i.e. no lvmetad) functionality of the tools
+would be not compromised by adding lvmetad. With lvmetad enabled, however,
+significant portions of the code would be short-circuited.
+
+Scanning
+--------
+
+Initially (at least), the lvmetad will be not allowed to read disks: it will
+rely on an external program to provide the metadata. In the ideal case, this
+will be triggered by udev. The role of lvmetad is then to collect and maintain
+an accurate (up to the data it has received) image of the VGs available in the
+system. I imagine we could extend the pvscan command (or add a new one, say
+lvmetad_client, if pvscan is found to be inappropriate):
+
+    $ pvscan --lvmetad /dev/foo
+    $ pvscan --lvmetad --remove /dev/foo
+
+These commands would simply read the label and the MDA (if applicable) from the
+given PV and feed that data to the running lvmetad, using
+lvmetad_{add,remove}_pv (see lvmetad_client.h).
+
+We however need to ensure a couple of things here:
+
+1) only LVM commands ever touch PV labels and VG metadata
+2) when a device is added or removed, udev fires a rule to notify lvmetad
+
+While the latter is straightforward, there are issues with the first. We
+*might* want to invoke the dreaded "watch" udev rule in this case, however it
+ends up being implemented. Of course, we can also rely on the sysadmin to be
+reasonable and not write over existing LVM metadata without first telling LVM
+to let go of the respective device(s).
+
+Even if we simply ignore the problem, metadata write should fail in these
+cases, so the admin should be unable to do substantial damage to the system. If
+there were active LVs on top of the vanished PV, they are in trouble no matter
+what happens there.
+
+Incremental scan
+----------------
+
+There are some new issues arising with the "udev" scan mode. Namely, the
+devices of a volume group will be appearing one by one. The behaviour in this
+case will be very similar to the current behaviour when devices are missing:
+the volume group, until *all* its physical volumes have been discovered and
+announced by udev, will be in a state with some of its devices flagged as
+MISSING_PV. This means that the volume group will be, for most purposes,
+read-only until it is complete and LVs residing on yet-unknown PVs won't
+activate without --partial. Under usual circumstances, this is not a problem
+and the current code for dealing with MISSING_PVs should be adequate.
+
+However, the code for reading volume groups from disks will need to be adapted,
+since it currently does not work incrementally. Such support will need to track
+metadata-less PVs that have been encountered so far and to provide a way to
+update an existing volume group. When the first PV with metadata of a given VG
+is encountered, the VG is created in lvmetad (probably in the form of "struct
+volume_group") and it is assigned any previously cached metadata-less PVs it is
+referencing. Any PVs that were not yet encountered will be marked as MISSING_PV
+in the "struct volume_group". Upon scanning a new PV, if it belongs to any
+already-known volume group, this PV is checked for consistency with the already
+cached metadata (in a case of mismatch, the VG needs to be recovered or
+declared conflicted), and is subsequently unmarked MISSING_PV. Care need be
+taken not to unmark MISSING_PV on PVs that have this flag in their persistent
+metadata, though.
+
+The most problematic aspect of the whole design may be orphan PVs. At any given
+point, a metadata-less PV may appear orphaned, if a PV of its VG with metadata
+has not been scanned yet. Eventually, we will have to decide that this PV is
+really an orphan and enable its usage for creating or extending VGs. In
+practice, the decision might be governed by a timeout or assumed immediately --
+the former case is a little safer, the latter is probably more transparent. I
+am not very keen on using timeouts and we can probably assume that the admin
+won't blindly try to re-use devices in a way that would trip up LVM in this
+respect. I would be in favour of just assuming that metadata-less VGs with no
+known referencing VGs are orphans -- after all, this is the same approach as we
+use today. The metadata balancing support may stress this a bit more than the
+usual contemporary setups do, though.
+
+Automatic activation
+--------------------
+
+It may also be prudent to provide a command that will block until a volume
+group is complete, so that scripts can reliably activate/mount LVs and such. Of
+course, some PVs may never appear, so a timeout is necessary. Again, this is
+something not handled by current tools, but may become more important in
+future. It probably does not need to be implemented right away though.
+
+The other aspect of the progressive VG assembly is automatic activation. The
+currently only problem with that is that we would like to avoid having
+activation code in lvmetad, so we would prefer to fire up an event of some sort
+and let someone else handle the activation and whatnot.
+
+Cluster support
+---------------
+
+When working in a cluster, clvmd integration will be necessary: clvmd will need
+to instruct lvmetad to re-read metadata as appropriate due to writes on remote
+hosts. Overall, this is not hard, but the devil is in the details. I would
+possibly disable lvmetad for clustered volume groups in the first phase and
+only proceed when the local mode is robust and well tested.
+
+Protocol & co.
+--------------
+
+I expect a simple text-based protocol executed on top of an Unix Domain Socket
+to be the communication interface for lvmetad. Ideally, the requests and
+replies will be well-formed "config file" style strings, so we can re-use
+existing parsing infrastructure.
+
+Since we already have two daemons, I would probably look into factoring some
+common code for daemon-y things, like sockets, communication (including thread
+management) and maybe logging and re-using it in all the daemons (clvmd,
+dmeventd and lvmetad). This shared infrastructure should live under
+daemons/common, and the existing daemons shall be gradually migrated to the
+shared code.
+
+Future extensions
+-----------------
+
+The above should basically cover the use of lvmetad as a cache-only
+daemon. Writes could still be executed locally, and the new metadata version
+can be provided to lvmetad through the socket the usual way. This is fairly
+natural and in my opinion reasonable. The lvmetad acts like a cache that will
+hold metadata, no more no less.
+
+Above this, there is a couple of things that could be worked on later, when the
+above basic design is finished and implemented.
+
+_Metadata writing_: We may want to support writing new metadata through
+lvmetad. This may or may not be a better design, but the write itself should be
+more or less orthogonal to the rest of the story outlined above.
+
+_Locking_: Other than directing metadata writes through lvmetad, one could
+conceivably also track VG/LV locking through the same.
+
+_Clustering_: A deeper integration of lvmetad with clvmd might be possible and
+maybe desirable. Since clvmd communicates over the network with other clvmd
+instances, this could be extended to metadata exchange between lvmetad's,
+further cutting down scanning costs. This would combine well with the
+write-through-lvmetad approach.
+
+Testing
+-------
+
+Since (at least bare-bones) lvmetad has no disk interaction and is fed metadata
+externally, it should be very amenable to automated testing. We need to provide
+a client that can feed arbitrary, synthetic metadata to the daemon and request
+the data back, providing reasonable (nearly unit-level) testing infrastructure.
+
+Battle plan & code layout
+=========================
+
+- config_tree from lib/config needs to move to libdm/
+- daemon/common *client* code can go to libdm/ as well (say
+  libdm/libdm-daemon.{h,c} or such)
+- daemon/common *server* code stays, is built in daemon/ toplevel as a static
+  library, say libdaemon-common.a
+- daemon/lvmetad *client* code goes to lib/lvmetad
+- daemon/lvmetad *server* code stays (links in daemon/libdaemon_common.a)



                 reply	other threads:[~2011-07-08 18:55 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110708185529.17980.qmail@sourceware.org \
    --to=agk@sourceware.org \
    --cc=lvm-devel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.