reiserfs-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Edward Shishkin <edward.shishkin@gmail.com>
To: ReiserFS development mailing list <reiserfs-devel@vger.kernel.org>
Subject: Re: [ANNOUNCE] Reiser4 Logical Volumes. Mirrors and Failover
Date: Mon, 26 Sep 2016 12:43:38 +0200	[thread overview]
Message-ID: <8f93b6ff-df06-9e45-8a02-76caa334db51@gmail.com> (raw)
In-Reply-To: <57E7026B.20001@gmail.com>



On 09/25/2016 12:47 AM, Edward Shishkin wrote:
> Logical Volumes
>
>
> Reiser4 will support logical (compound) volumes. For now we have
> implemented the simplest ones - mirrors. As a supplement to existing
> checksums it will provide a failover - an important feature, which
> will reduce number of cases when your volume needs to be repaired by
> fsck.
>
> Reiser4 subvolume is a component of logical volume. Subvolume is
> always associated with a physical, or logical (built of RAID, LVM,
> etc means) block device. Every subvolume possesses:
>
> . volume ID;
> . subvolume ID;
> . mirror ID;
> . number of replicas.
>
> mirror ID is a serial number from 0 till 65535. Subvolume with mirror
> ID 0 has a special name - original. Other ones are called replicas.
> We use to say "original A has a replica B" (or "B replicates A",
> which is the same), iff A and B possess the same subvolume ID.
> Original with all its replicas are called "mirrors".
>
> For subvolumes we have introduced a special disk format plugin
> "format41". In accordance with Reiser4 development model it means
> forward incompatibility. We have introduced it intentionally, for
> protection. Indeed, for clear reasons users must not have possibility
> to RW-mount separate replicas (without originals).
> The multi-device extension is backward compatible: all volumes of the
> old format (format40) are supported as logical volumes composed of
> only one (original) subvolume.
>
>
>            Registration and activation of subvolumes
>
>
> For now every Reiser4 logical volume has only one original subvolume.
> Number of replicas can be 0, or more. Logical volume can be mount
> by usual mount command. Simply specify any its subvolume (the
> original, or some its replica). The only condition is that original
> and all its replicas should be registered in the system. If original,
> or some its replica are not registered, then mount will fail with a
> respective kernel message.
>
> Currently there is no tool to register specified subvolume (TBD).
> However, mount command always tries to register the specified device.
> The registration policy is "sticky". It means that your device won't
> be unregistered after umount, as well as failed mount. (You will be
> able to unregister it mandatory by a special tool - TBD).
>
> Procedure of registration reads the master super-block of the
> subvolume and puts the subvolume header to a specilal list of
> registered subvolumes.
>
> Mounting a logical volume activates all its registered components.
> Procedure of activation reads format super-block of the subvolume, and
> performs other actions like initialization of space maps, transaction
> replay, etc. as specified by the method ->init_format() of respective
> disk format plugin. Pointer to an activated subvolume is placed to a
> special table of active subvolumes.
>
>
>                        Mirror operations
>
>
> So original and mirrors actually represent RAID0 on the filesystem
> level.


Err.. RAID1, of course, not RAID0.
Instead of RAID0 (striping) Reiser4 will offer something more interesting..

Edward.

>
> COMMENT. We aren't engaged in marketing fraud on collecting all
> features of the block layer's RAID and LVM. Reiser4 mirrors implement
> a failover, that block layers's RAID0 is not able to provide.
>
> It will be possible to "upgrade", or "downgrade" a reiser4 array of
> mirrors by attaching / detaching online one, or more replicas by
> special user-space tools (mirror.reiser4, TBD). Also by those tools it
> will be possible to swap original with any its replica, or make a new
> original from any replica, if the old one is lost for some reasons.
>
> Fsck will refuse to check/repir replica. Fsck is supposed to work only
> with original subvolumes. After mounting an fsck-ed original, kernel
> will automatically run a special on-line backgroud procedure (scrub)
> in order to synchronize the repaired original with all its replicas.
>
> Once in a while user has to check his array of mirrors by running
> scrub in the background mode.
>
> WARNING: Bear in mind once and forever: Replica is not a backup!!!
>
>
>                        Technical Notes
>
>
> 1. Reiser4 Transaction Design document is transferred to logical
> volumes without any modifications, but with a small addition. Atom is
> now composed of per-subvolume components.
>
> 2. By design all mirrors differ only in mirror-IDs which are stored in
> master super-block. Format super-blocks of mirrors are identical. This
> approach provides best performance and full parallelism in issuing IO
> requests for mirrors. The minus is a small compromise in design,
> according to which master super-block doesn't participate in
> transactions. It means that mirror operations on upgrading/degrading/
> swapping can not spawn usual transactions, which can be committed
> and (re)played using existing transaction manager. That is, mirror
> operations won't survive a system crash. If a system crash happens
> during a mirror operation, then the mirror structure should be
> checked/fixed offline by the mirror tools (kernel will refuse to mount
> unchecked array of mirrors). Fortunately, all critical mirror
> operations issue small number of IO requests, so that probability of
> their interruption is close to zero.
>
> 3. We don't commit transactions on all mirrors, only on the original
> subvolume (this is the single functional difference of original and
> its replicas). Transaction (re)play, of course, is going on all
> mirrors using the wandering maps/blocks of the original subvolume.
>
>
>                    How to test the new features
>
>
> Checkout branch "format41" of the upstream reiser4 and reiser4progs
> git repos on https://github.com/edward6 Build and install as usual.
>
> Mirrors can be created by mkfs.reiser4 option -m. If this option is
> specified, then the first listed device will be the original, other
> ones - replicas. All devices of an array should have the same size.
> Further we'll avoid that restriction.
>
> IMPORTANT: when creating mirrors specify node41 plugin (with checksum
> support). Otherwise, your mirrors won't be more useful than block
> layer's RAID0.
>
> Register all your mirrors, trying to "mount" them one-by-one in any
> order. If you have N mirrors (i.e. one original and N-1 replicas),
> then first N-1 mount commands will fail. Of course, it is not too
> graceful, but this is temporal solution. The N-th "attempt" should
> succeed. Have a fun. Unmount as usual.
>
>
>                            Example
>
>
> Suppose we have 2 partitions /dev/sda7 and /dev/sda8 of equal
> size. Let's create an array of 2 mirrors:
>
> # mkfs.reiser4 -my -o node=node41 /dev/sda7 /dev/sda8
>
> Take a look at original subvolume:
>
> # debugfs.reiser4 /dev/sda7
>
> Take a look at replica:
>
> # debugfs.reiser4 /dev/sda8
>
> Find differences ;)
>
> Register the original subvolume
>
> # mount /dev/sda7 /mnt
> mount: wrong fs type, bad option, bad superblock blablabla....
> # dmesg
> reiser4[mount(20914)]: check_active_replicas 
> (fs/reiser4/init_volume.c:268)[edward-1750]:
> WARNING: /dev/sda7 requires replicas, which are not registered.
>
> Register the replica and mount the array:
>
> #mount /dev/sda8 /mnt
> #dmesg
>
> reiser4: registered subvolume (/dev/sda8)
> reiser4 (sda8): found disk format 4.0.1.
> reiser4 (/dev/sda7): using Hybrid Transaction Model.
>
> Let's copy a file /etc/services to our array of mirrors:
>
> # cp /etc/services /mnt/.
>
> Unmount the array:
>
> # umount /mnt
>
> Find a root block: it goes the first in the tree dump:
>
> # debugfs.reiser4 -t /dev/sda7
>
> In our case the root block has blocknumber #79
>
> Let's now take a look on how our failover works. The death defying
> act: we erase the root block of the original subvolume:
>
> # dd if=/dev/zero of=/dev/sda7 bs=4096 count=1 seek=79
>
> We know that the mount procedure load the root block. Let's try to
> mount our array with the corrupted root block:
>
> # mount /dev/sda8 /mnt
>
> Everything works..
> Take a look at kernel messages:
>
> # dmesg
> reiser4[mount(21224)]: parse_node41 
> (fs/reiser4/plugin/node/node41.c:79)[edward-1645]:
> WARNING: block 79 (/dev/sda7): bad checksum. Please, scrub the volume.
>
>
>                              TODO
>
>
> 1) Mirror tools (upgrade/downgrade a mirror array, swap original and
>     specified replica, convert replica to an original, visualization 
> of mirror
>     arrays, etc);
> 2) Scrub (online background checking and synchronizaton of mirrors);
> 3) Checksumming format super-block;
> 4) Issuing discard requests for replicas on SSD devices.
>
> All items are very simple to implement. If anyone cares, then I'll
> provide details.
>
> Thanks,
> Edward.


  reply	other threads:[~2016-09-26 10:43 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-24 22:47 [ANNOUNCE] Reiser4 Logical Volumes. Mirrors and Failover Edward Shishkin
2016-09-26 10:43 ` Edward Shishkin [this message]
2016-11-20 11:58 ` Edward Shishkin
2016-11-20 16:17   ` Dušan Čolić

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8f93b6ff-df06-9e45-8a02-76caa334db51@gmail.com \
    --to=edward.shishkin@gmail.com \
    --cc=reiserfs-devel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).