* [RFC][PATCH] Multiple mount protection
@ 2007-05-21 19:52 Kalpak Shah
2007-05-22 7:15 ` Manoj Joseph
` (2 more replies)
0 siblings, 3 replies; 21+ messages in thread
From: Kalpak Shah @ 2007-05-21 19:52 UTC (permalink / raw)
To: linux-ext4; +Cc: Andreas Dilger
Hi,
There have been reported instances of a filesystem having been mounted at 2 places at the same time causing a lot of damage to the filesystem. This patch reserves superblock fields and an INCOMPAT flag for adding multiple mount protection(MMP) support within the ext4 filesystem itself. The superblock will have a block number (s_mmp_block) which will hold a MMP structure which has a sequence number which will be periodically updated every 5 seconds by a mounted filesystem. Whenever a filesystem will be mounted it will wait for s_mmp_interval seconds to make sure that the MMP sequence does not change. To further make sure, we write a random sequence number into the MMP block and wait for another s_mmp_interval secs. If the sequence no. doesn't change then the mount will succeed. In case of failure, the nodename, bdevname and the time at which the MMP block was last updated will be displaye
d. tune2fs can be used to set s_mmp_interval as desired.
It will also protect against running e2fsck on a mounted filesystem by adding similar logic to ext2fs_open().
Any comments or views are welcome!
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Kalpak Shah <kalpak@clusterfs.com>
Index: e2fsprogs-1.40/lib/ext2fs/ext2_fs.h
===================================================================
--- e2fsprogs-1.40.orig/lib/ext2fs/ext2_fs.h
+++ e2fsprogs-1.40/lib/ext2fs/ext2_fs.h
@@ -568,8 +568,9 @@ struct ext2_super_block {
__u16 s_want_extra_isize; /* New inodes should reserve # bytes */
__u32 s_flags; /* Miscellaneous flags */
__u16 s_raid_stride; /* RAID stride */
- __u16 s_pad; /* Padding */
- __u32 s_reserved[166]; /* Padding to the end of the block */
+ __u16 s_mmp_interval; /* Wait for # seconds in MMP checking */
+ __u64 s_mmp_block; /* Block for multi-mount protection */
+ __u32 s_reserved[164]; /* Padding to the end of the block */
};
/*
@@ -631,10 +632,12 @@ struct ext2_super_block {
#define EXT2_FEATURE_INCOMPAT_META_BG 0x0010
#define EXT3_FEATURE_INCOMPAT_EXTENTS 0x0040
#define EXT4_FEATURE_INCOMPAT_64BIT 0x0080
+#define EXT4_FEATURE_INCOMPAT_MMP 0x0100
#define EXT2_FEATURE_COMPAT_SUPP 0
-#define EXT2_FEATURE_INCOMPAT_SUPP (EXT2_FEATURE_INCOMPAT_FILETYPE)
+#define EXT2_FEATURE_INCOMPAT_SUPP (EXT2_FEATURE_INCOMPAT_FILETYPE| \
+ EXT4_FEATURE_INCOMPAT_MMP)
#define EXT2_FEATURE_RO_COMPAT_SUPP (EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER| \
EXT2_FEATURE_RO_COMPAT_LARGE_FILE| \
EXT2_FEATURE_RO_COMPAT_BTREE_DIR)
Thanks,
Kalpak.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC][PATCH] Multiple mount protection
2007-05-21 19:52 [RFC][PATCH] Multiple mount protection Kalpak Shah
@ 2007-05-22 7:15 ` Manoj Joseph
2007-05-22 7:34 ` Kalpak Shah
2007-05-25 14:39 ` Theodore Tso
2007-06-01 8:46 ` Andi Kleen
2 siblings, 1 reply; 21+ messages in thread
From: Manoj Joseph @ 2007-05-22 7:15 UTC (permalink / raw)
To: Kalpak Shah; +Cc: linux-ext4, Andreas Dilger
Kalpak Shah wrote:
> Hi,
>
> There have been reported instances of a filesystem having been
> mounted at 2 places at the same time causing a lot of damage to the
> filesystem. This patch reserves superblock fields and an INCOMPAT
> flag for adding multiple mount protection(MMP) support within the
> ext4 filesystem itself. The superblock will have a block number
> (s_mmp_block) which will hold a MMP structure which has a sequence
> number which will be periodically updated every 5 seconds by a
> mounted filesystem. Whenever a filesystem will be mounted it will
> wait for s_mmp_interval seconds to make sure that the MMP sequence
> does not change. To further make sure, we write a random sequence
> number into the MMP block and wait for another s_mmp_interval secs.
> If the sequence no. doesn't change then the mount will succeed. In
> case of failure, the nodename, bdevname and the time at which the MMP
> block was last updated will be displayed. tune2fs can be used to set
> s_mmp_interval as desired.
What would the default value of s_mmp_interval be? 5 seconds? more?
If I am not reading this wrong a mount will take more than
's_mmp_interval' seconds to complete. Wouldn't this be too much of a
penalty during boot up if the system has many 'mount at boot' filesystems?
Also, I am curious about this. Is there a test case for mounting the
same filesystem multiple times? Does this use different paths to reach
the device? Or is there a race? Or does it happen on a device shared by
multiple hosts?
-Manoj
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC][PATCH] Multiple mount protection
2007-05-22 7:15 ` Manoj Joseph
@ 2007-05-22 7:34 ` Kalpak Shah
2007-05-22 7:53 ` Manoj Joseph
2007-05-24 23:25 ` Karel Zak
0 siblings, 2 replies; 21+ messages in thread
From: Kalpak Shah @ 2007-05-22 7:34 UTC (permalink / raw)
To: Manoj Joseph; +Cc: linux-ext4, Andreas Dilger
On Tue, 2007-05-22 at 12:45 +0530, Manoj Joseph wrote:
> Kalpak Shah wrote:
> > Hi,
> >
> > There have been reported instances of a filesystem having been
> > mounted at 2 places at the same time causing a lot of damage to the
> > filesystem. This patch reserves superblock fields and an INCOMPAT
> > flag for adding multiple mount protection(MMP) support within the
> > ext4 filesystem itself. The superblock will have a block number
> > (s_mmp_block) which will hold a MMP structure which has a sequence
> > number which will be periodically updated every 5 seconds by a
> > mounted filesystem. Whenever a filesystem will be mounted it will
> > wait for s_mmp_interval seconds to make sure that the MMP sequence
> > does not change. To further make sure, we write a random sequence
> > number into the MMP block and wait for another s_mmp_interval secs.
> > If the sequence no. doesn't change then the mount will succeed. In
> > case of failure, the nodename, bdevname and the time at which the MMP
> > block was last updated will be displayed. tune2fs can be used to set
> > s_mmp_interval as desired.
>
> What would the default value of s_mmp_interval be? 5 seconds? more?
I have set the default value to 6 seconds. Depending on specific
conditions (hardware, etc.) it can be increased using tunefs.
>
> If I am not reading this wrong a mount will take more than
> 's_mmp_interval' seconds to complete. Wouldn't this be too much of a
> penalty during boot up if the system has many 'mount at boot' filesystems?
Yes it may take a maximum of s_mmp_interval*2 seconds to mount a
filesystem which has INCOMPAT_MMP feature set. Its up to the user to use
this feature, if he finds the penalty is too large, he can do away with
this feature. This feature will mostly be used for filesystems used in
failover scenarios.
>
> Also, I am curious about this. Is there a test case for mounting the
> same filesystem multiple times? Does this use different paths to reach
> the device? Or is there a race? Or does it happen on a device shared by
> multiple hosts?
>
If you are using some HA software, there is the possibility of a race.
Yes it can happen on a device shared by multiple hosts.
A simple test case for this will be:
$ dd if=/dev/zero of=img0 bs=1M count=256
$ mke2fs -F -j img0
$ ln img0 img1
$ losetup /dev/loop0 img0
$ losetup /dev/loop1 img1
$ mount /dev/loop0 /mnt/loop0
$ mount /dev/loop1 /mnt/loop1
This succeeds currently causing a multiple mount.
Thanks,
Kalpak.
> -Manoj
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC][PATCH] Multiple mount protection
2007-05-22 7:34 ` Kalpak Shah
@ 2007-05-22 7:53 ` Manoj Joseph
2007-05-22 8:06 ` Kalpak Shah
2007-05-24 23:25 ` Karel Zak
1 sibling, 1 reply; 21+ messages in thread
From: Manoj Joseph @ 2007-05-22 7:53 UTC (permalink / raw)
To: Kalpak Shah; +Cc: linux-ext4, Andreas Dilger
Kalpak Shah wrote:
>> Also, I am curious about this. Is there a test case for mounting the
>> same filesystem multiple times? Does this use different paths to reach
>> the device? Or is there a race? Or does it happen on a device shared by
>> multiple hosts?
>>
>
> If you are using some HA software, there is the possibility of a race.
> Yes it can happen on a device shared by multiple hosts.
Ah, if the HA-software doesn't deal with multiple mounts for filesystems
it is managing, then I would claim that the software is flawed. :)
But yes, turning on MMP would help.
It might also help to make the frequency at which sequence number gets
updated (currently 5 sec) tunable. Would making that also a field in the
super block be a bad idea (set only by mkfs/tunefs)?
It might also be worthwhile to write the dev_t, the path of the device
and the hostname to the s_mmp_block, along with the random sequence. (I
assume there is enough space.) If the mount is being failed because of a
multiple mount scenario, these fields could be used to provide useful
diagnostics.
My $ 0.02. :)
Regards,
Manoj
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC][PATCH] Multiple mount protection
2007-05-22 7:53 ` Manoj Joseph
@ 2007-05-22 8:06 ` Kalpak Shah
0 siblings, 0 replies; 21+ messages in thread
From: Kalpak Shah @ 2007-05-22 8:06 UTC (permalink / raw)
To: Manoj Joseph; +Cc: linux-ext4, Andreas Dilger
On Tue, 2007-05-22 at 13:23 +0530, Manoj Joseph wrote:
> Kalpak Shah wrote:
>
> >> Also, I am curious about this. Is there a test case for mounting the
> >> same filesystem multiple times? Does this use different paths to reach
> >> the device? Or is there a race? Or does it happen on a device shared by
> >> multiple hosts?
> >>
> >
> > If you are using some HA software, there is the possibility of a race.
> > Yes it can happen on a device shared by multiple hosts.
>
> Ah, if the HA-software doesn't deal with multiple mounts for filesystems
> it is managing, then I would claim that the software is flawed. :)
Well, it is known to happen so it wouldn't be bad to make sure.
>
> But yes, turning on MMP would help.
>
> It might also help to make the frequency at which sequence number gets
> updated (currently 5 sec) tunable. Would making that also a field in the
> super block be a bad idea (set only by mkfs/tunefs)?
Updating the MMP sequence too often would hurt the filesystem
performance.
>
> It might also be worthwhile to write the dev_t, the path of the device
> and the hostname to the s_mmp_block, along with the random sequence. (I
> assume there is enough space.) If the mount is being failed because of a
> multiple mount scenario, these fields could be used to provide useful
> diagnostics.
Yes, the dev_t, host name, the sequence and the time last updated would
all be printed. There is lots of space since we have an entire block.
> My $ 0.02. :)
Thanks. :)
>
> Regards,
> Manoj
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC][PATCH] Multiple mount protection
2007-05-22 7:34 ` Kalpak Shah
2007-05-22 7:53 ` Manoj Joseph
@ 2007-05-24 23:25 ` Karel Zak
2007-05-25 6:44 ` Kalpak Shah
1 sibling, 1 reply; 21+ messages in thread
From: Karel Zak @ 2007-05-24 23:25 UTC (permalink / raw)
To: Kalpak Shah; +Cc: Manoj Joseph, linux-ext4, Andreas Dilger
On Tue, May 22, 2007 at 01:04:42PM +0530, Kalpak Shah wrote:
> On Tue, 2007-05-22 at 12:45 +0530, Manoj Joseph wrote:
> > Kalpak Shah wrote:
> > > Hi,
> > >
> > > There have been reported instances of a filesystem having been
> > > mounted at 2 places at the same time causing a lot of damage to the
> > > filesystem. This patch reserves superblock fields and an INCOMPAT
> > > flag for adding multiple mount protection(MMP) support within the
> > > ext4 filesystem itself. The superblock will have a block number
> > > (s_mmp_block) which will hold a MMP structure which has a sequence
> > > number which will be periodically updated every 5 seconds by a
> > > mounted filesystem. Whenever a filesystem will be mounted it will
> > > wait for s_mmp_interval seconds to make sure that the MMP sequence
> > > does not change. To further make sure, we write a random sequence
> > > number into the MMP block and wait for another s_mmp_interval secs.
> > > If the sequence no. doesn't change then the mount will succeed. In
> > > case of failure, the nodename, bdevname and the time at which the MMP
> > > block was last updated will be displayed. tune2fs can be used to set
> > > s_mmp_interval as desired.
Frankly, I don't understand why we need this feature. The filesystem
limitations (=not ready for clusters) should be described in docs.
That's enough from my POV...
> >
> > What would the default value of s_mmp_interval be? 5 seconds? more?
>
> I have set the default value to 6 seconds. Depending on specific
> conditions (hardware, etc.) it can be increased using tunefs.
> >
> > If I am not reading this wrong a mount will take more than
> > 's_mmp_interval' seconds to complete. Wouldn't this be too much of a
> > penalty during boot up if the system has many 'mount at boot' filesystems?
>
> Yes it may take a maximum of s_mmp_interval*2 seconds to mount a
> filesystem which has INCOMPAT_MMP feature set. Its up to the user to use
> this feature, if he finds the penalty is too large, he can do away with
> this feature. This feature will mostly be used for filesystems used in
> failover scenarios.
I hope the feature will be disabled by default. It sounds strange
that I have to way 6 secs to mount a FS if I (and 99% of Linux users)
needn't to share same FS between two mountpoint.
I have 5 filesystems on my workstation = 30 secs penality during boot?!
> > Also, I am curious about this. Is there a test case for mounting the
> > same filesystem multiple times? Does this use different paths to reach
> > the device? Or is there a race? Or does it happen on a device shared by
> > multiple hosts?
>
> If you are using some HA software, there is the possibility of a race.
> Yes it can happen on a device shared by multiple hosts.
That's reason why people use OCFS or GFS.
> A simple test case for this will be:
> $ dd if=/dev/zero of=img0 bs=1M count=256
> $ mke2fs -F -j img0
> $ ln img0 img1
> $ losetup /dev/loop0 img0
> $ losetup /dev/loop1 img1
> $ mount /dev/loop0 /mnt/loop0
> $ mount /dev/loop1 /mnt/loop1
>
> This succeeds currently causing a multiple mount.
And what? That's wrong FS usage.
Karel
--
Karel Zak <kzak@redhat.com>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC][PATCH] Multiple mount protection
2007-05-24 23:25 ` Karel Zak
@ 2007-05-25 6:44 ` Kalpak Shah
0 siblings, 0 replies; 21+ messages in thread
From: Kalpak Shah @ 2007-05-25 6:44 UTC (permalink / raw)
To: Karel Zak; +Cc: Manoj Joseph, linux-ext4, Andreas Dilger
On Fri, 2007-05-25 at 01:25 +0200, Karel Zak wrote:
> Frankly, I don't understand why we need this feature. The filesystem
> limitations (=not ready for clusters) should be described in docs.
> That's enough from my POV...
It is highly advocated that ext3/4 filesystem should not be multiply
mounted. This just makes doubly sure of this only if the user desires.
> > >
> > > What would the default value of s_mmp_interval be? 5 seconds? more?
> >
> > I have set the default value to 6 seconds. Depending on specific
> > conditions (hardware, etc.) it can be increased using tunefs.
> > >
> > > If I am not reading this wrong a mount will take more than
> > > 's_mmp_interval' seconds to complete. Wouldn't this be too much of a
> > > penalty during boot up if the system has many 'mount at boot' filesystems?
> >
> > Yes it may take a maximum of s_mmp_interval*2 seconds to mount a
> > filesystem which has INCOMPAT_MMP feature set. Its up to the user to use
> > this feature, if he finds the penalty is too large, he can do away with
> > this feature. This feature will mostly be used for filesystems used in
> > failover scenarios.
>
> I hope the feature will be disabled by default. It sounds strange
> that I have to way 6 secs to mount a FS if I (and 99% of Linux users)
> needn't to share same FS between two mountpoint.
>
> I have 5 filesystems on my workstation = 30 secs penality during boot?!
This feature won't be enabled by default. Its absolutely the users
discretion if he wants to enable this feature. It can be set by tune2fs
and can be disabled without unmounting the filesystem. So you won't have
to waste time during mounting unless you choose to.
>
> > > Also, I am curious about this. Is there a test case for mounting the
> > > same filesystem multiple times? Does this use different paths to reach
> > > the device? Or is there a race? Or does it happen on a device shared by
> > > multiple hosts?
> >
> > If you are using some HA software, there is the possibility of a race.
> > Yes it can happen on a device shared by multiple hosts.
>
> That's reason why people use OCFS or GFS.
OCFS and GFS are clustered file systems and hence provide read-write
support at multiple mount points.
Note that the MMP feature will make that you can't run e2fsck on a
mounted filesystem. So in short the filesystem cannot be opened in
read-write mode by more than 1 entity.
>
> > A simple test case for this will be:
> > $ dd if=/dev/zero of=img0 bs=1M count=256
> > $ mke2fs -F -j img0
> > $ ln img0 img1
> > $ losetup /dev/loop0 img0
> > $ losetup /dev/loop1 img1
> > $ mount /dev/loop0 /mnt/loop0
> > $ mount /dev/loop1 /mnt/loop1
> >
> > This succeeds currently causing a multiple mount.
>
> And what? That's wrong FS usage.
Here I had just described a test case for reproducing multiple mounts.
Thanks,
Kalpak.
>
> Karel
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC][PATCH] Multiple mount protection
2007-05-21 19:52 [RFC][PATCH] Multiple mount protection Kalpak Shah
2007-05-22 7:15 ` Manoj Joseph
@ 2007-05-25 14:39 ` Theodore Tso
2007-05-25 19:31 ` Jim Garlick
2007-05-25 21:36 ` Kalpak Shah
2007-06-01 8:46 ` Andi Kleen
2 siblings, 2 replies; 21+ messages in thread
From: Theodore Tso @ 2007-05-25 14:39 UTC (permalink / raw)
To: Kalpak Shah; +Cc: linux-ext4, Andreas Dilger
Hi Kalpak,
On Tue, May 22, 2007 at 01:22:32AM +0530, Kalpak Shah wrote:
> It will also protect against running e2fsck on a mounted filesystem
> by adding similar logic to ext2fs_open().
Your patch didn't add this logic to ext2fs_open(); it just reserved
the space in the superblock.
I don't mind reserving the space so we don't have to worry about
conflicting superblock uses, but I'm still on the fence about actually
adding this functionality (a) into e2fsprogs, and (b) into the ext4
kernel code. I guess it depends on how complicated/icky the
implementation code is, I guess. The question as before is whether
the complexity is worth it, given that someone who is actually going
to be subject to accidentally mounting an ext3/4 filesystem on
multiple systems needs to be using an HA system anyway. So basically
this is just to protect against (a) a bug/failure in the HA subsystem,
and (b) the idiotic user that failed to realized he/she needed to set
up an HA subsystem in the first place. Granted, the universe is going
to create idiots at a faster rate that we can deal with it, but that's
why I'm still not 100% convinced the complexity is worth it.
To be fair, if I was on a L3 support team having to deal with these
idiots, I'd probably feel differently. :-)
- Ted
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC][PATCH] Multiple mount protection
2007-05-25 14:39 ` Theodore Tso
@ 2007-05-25 19:31 ` Jim Garlick
2007-05-25 21:36 ` Kalpak Shah
1 sibling, 0 replies; 21+ messages in thread
From: Jim Garlick @ 2007-05-25 19:31 UTC (permalink / raw)
To: Theodore Tso; +Cc: Kalpak Shah, linux-ext4, Andreas Dilger
Hi Ted,
For what it's worth, we have several petabytes of data residing in
ext3 file systems, a large staff of mainly non-idiots, and HA s/w,
and I still feel strongly that multi-mount protection is a good idea.
People, software, and hardware all malfunction in myriad ways, and the
more you have, the greater the odds (or so it seems to us). This
relatively simple safeguard at the fs level has high value IMHO.
Regards,
Jim
On Fri, 25 May 2007, Theodore Tso wrote:
> Hi Kalpak,
>
> On Tue, May 22, 2007 at 01:22:32AM +0530, Kalpak Shah wrote:
>> It will also protect against running e2fsck on a mounted filesystem
>> by adding similar logic to ext2fs_open().
>
> Your patch didn't add this logic to ext2fs_open(); it just reserved
> the space in the superblock.
>
> I don't mind reserving the space so we don't have to worry about
> conflicting superblock uses, but I'm still on the fence about actually
> adding this functionality (a) into e2fsprogs, and (b) into the ext4
> kernel code. I guess it depends on how complicated/icky the
> implementation code is, I guess. The question as before is whether
> the complexity is worth it, given that someone who is actually going
> to be subject to accidentally mounting an ext3/4 filesystem on
> multiple systems needs to be using an HA system anyway. So basically
> this is just to protect against (a) a bug/failure in the HA subsystem,
> and (b) the idiotic user that failed to realized he/she needed to set
> up an HA subsystem in the first place. Granted, the universe is going
> to create idiots at a faster rate that we can deal with it, but that's
> why I'm still not 100% convinced the complexity is worth it.
>
> To be fair, if I was on a L3 support team having to deal with these
> idiots, I'd probably feel differently. :-)
>
> - Ted
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC][PATCH] Multiple mount protection
2007-05-25 14:39 ` Theodore Tso
2007-05-25 19:31 ` Jim Garlick
@ 2007-05-25 21:36 ` Kalpak Shah
2007-05-30 20:58 ` Kalpak Shah
1 sibling, 1 reply; 21+ messages in thread
From: Kalpak Shah @ 2007-05-25 21:36 UTC (permalink / raw)
To: Theodore Tso; +Cc: linux-ext4, Andreas Dilger
[-- Attachment #1: Type: text/plain, Size: 2040 bytes --]
Hi Ted,
On Fri, 2007-05-25 at 10:39 -0400, Theodore Tso wrote:
> Hi Kalpak,
>
> On Tue, May 22, 2007 at 01:22:32AM +0530, Kalpak Shah wrote:
> > It will also protect against running e2fsck on a mounted filesystem
> > by adding similar logic to ext2fs_open().
>
> Your patch didn't add this logic to ext2fs_open(); it just reserved
> the space in the superblock.
Yeah the earlier patch for just reserving the fields.
>
> I don't mind reserving the space so we don't have to worry about
> conflicting superblock uses, but I'm still on the fence about actually
> adding this functionality (a) into e2fsprogs, and (b) into the ext4
> kernel code. I guess it depends on how complicated/icky the
> implementation code is, I guess.
I am attaching the kernel and e2fsrogs patches so that you can suggest
any short-comings in the implementation. These patches are still a WIP.
> The question as before is whether
> the complexity is worth it, given that someone who is actually going
> to be subject to accidentally mounting an ext3/4 filesystem on
> multiple systems needs to be using an HA system anyway. So basically
> this is just to protect against (a) a bug/failure in the HA subsystem,
> and (b) the idiotic user that failed to realized he/she needed to set
> up an HA subsystem in the first place. Granted, the universe is going
> to create idiots at a faster rate that we can deal with it, but that's
> why I'm still not 100% convinced the complexity is worth it.
Given the amount of damage that multiple mounts can cause to the
filesystem, it would be desirable to make doubly sure. Also the MMP
feature is quite uncomplicated and absolutely tunable.
Thanks for your views.
- Kalpak.
>
> To be fair, if I was on a L3 support team having to deal with these
> idiots, I'd probably feel differently. :-)
>
> - Ted
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
[-- Attachment #2: mmp.patch --]
[-- Type: text/x-patch, Size: 10099 bytes --]
Index: linux-2.6.19/fs/ext4/super.c
===================================================================
--- linux-2.6.19.orig/fs/ext4/super.c
+++ linux-2.6.19/fs/ext4/super.c
@@ -35,6 +35,8 @@
#include <linux/namei.h>
#include <linux/quotaops.h>
#include <linux/seq_file.h>
+#include <linux/kthread.h>
+#include <linux/utsname.h>
#include <asm/uaccess.h>
@@ -481,6 +483,9 @@ static void ext4_put_super (struct super
invalidate_bdev(sbi->journal_bdev, 0);
ext4_blkdev_remove(sbi);
}
+ if (sbi->s_mmp_tsk)
+ kthread_stop(sbi->s_mmp_tsk);
+
sb->s_fs_info = NULL;
kfree(sbi);
return;
@@ -1441,6 +1446,223 @@ static ext4_fsblk_t descriptor_loc(struc
return (has_super + ext4_group_first_block_no(sb, bg));
}
+static inline
+int write_mmp_block(struct super_block *sb, struct buffer_head *bh,
+ const char *bdev_name)
+{
+ int retval;
+
+ mark_buffer_dirty(bh);
+ retval = sync_dirty_buffer(bh);
+ if (retval)
+ ext4_error(sb, "write_mmp_block",
+ "Error writing to MMP block.");
+
+ return retval;
+}
+
+static inline
+int read_mmp_block(struct super_block *sb, struct buffer_head **bh,
+ ext4_fsblk_t mmp_block)
+{
+ if (*bh)
+ clear_buffer_uptodate(*bh);
+
+ *bh = sb_bread(sb, mmp_block);
+ if (!*bh) {
+ ext4_warning(sb, "read_mmp_block",
+ "Error while reading MMP block %llu", mmp_block);
+ return -1;
+ }
+
+ return 0;
+}
+
+/*
+ * kmmpd will update the MMP sequence every s_mmp_interval seconds
+ */
+static int kmmpd(void *data)
+{
+ struct super_block *sb = (struct super_block *) data;
+ struct ext4_super_block *es = EXT4_SB(sb)->s_es;
+ struct buffer_head *bh = NULL;
+ struct mmp_struct *mmp;
+ ext4_fsblk_t mmp_block;
+ u32 seq = 0;
+ unsigned long failed_writes = 0;
+ int retval;
+ int mmp_interval = cpu_to_le16(es->s_mmp_interval);
+
+ mmp_block = le32_to_cpu(es->s_mmp_block);
+ retval = read_mmp_block(sb, &bh, mmp_block);
+ if (retval)
+ goto failed;
+
+ mmp = (struct mmp_struct *)(bh->b_data);
+ mmp->mmp_magic = cpu_to_le32(EXT4_MMP_MAGIC);
+ mmp->mmp_time = cpu_to_le64(get_seconds());
+ mmp->mmp_interval = mmp_interval;
+ bdevname(bh->b_bdev, mmp->mmp_bdevname);
+
+ down_read(&uts_sem);
+ memcpy(mmp->mmp_nodename, init_uts_ns.name.nodename, 64);
+ up_read(&uts_sem);
+
+ while (!kthread_should_stop()) {
+ if (++seq >= EXT4_MMP_FSCK_ON)
+ seq = 1;
+
+ mmp->mmp_seq = cpu_to_le32(seq);
+ mmp->mmp_time = cpu_to_le64(get_seconds());
+
+ retval = write_mmp_block(sb, bh, mmp->mmp_bdevname);
+ /*
+ * Don't spew too many error messages. Print one every
+ * (s_mmp_interval * 60) seconds.
+ */
+ if (retval && (failed_writes % 60) == 0) {
+ ext4_warning(sb, "kmmpd",
+ "Error writing to MMP block");
+ failed_writes++;
+ }
+
+ if (!(le32_to_cpu(es->s_feature_incompat) &
+ EXT4_FEATURE_INCOMPAT_MMP)) {
+ ext4_warning(sb, "kmmpd", "kmmpd being stopped "
+ "since MMP feature has been "
+ "disabled.");
+ goto failed;
+ }
+
+ if (sb->s_flags & MS_RDONLY) {
+ ext4_warning(sb, "kmmpd", "kmmpd being stopped since "
+ "filesystem has been remounted as readonly.");
+ goto failed;
+ }
+
+ schedule_timeout_interruptible(mmp_interval * HZ);
+ }
+
+ /* Unmount seems to be clean */
+ mmp->mmp_seq = cpu_to_le32(EXT4_MMP_CLEAN);
+ mmp->mmp_time = cpu_to_le64(get_seconds());
+
+ retval = write_mmp_block(sb, bh, mmp->mmp_bdevname);
+
+failed:
+ brelse(bh);
+ return 0;
+}
+
+void dump_mmp_msg(struct super_block *sb, struct mmp_struct *mmp,
+ const char *function, const char *msg)
+{
+ ext4_warning(sb, function, msg);
+ ext4_warning(sb, function, "Dumping MMP information:\n"
+ "Time last updated: %llu\n"
+ "Last node which updated MMP: %s\n"
+ "Last block device which updated MMP: %s\n",
+ le64_to_cpu(mmp->mmp_time), mmp->mmp_nodename,
+ mmp->mmp_bdevname);
+}
+
+/*
+ * Protect the filesystem from being mounted more than once.
+ */
+static int ext4_multi_mount_protect(struct super_block *sb,
+ ext4_fsblk_t mmp_block)
+{
+ struct ext4_super_block *es = EXT4_SB(sb)->s_es;
+ struct buffer_head *bh = NULL;
+ struct mmp_struct *mmp = NULL;
+ u32 seq;
+ unsigned int wait_interval = 2 * le32_to_cpu(es->s_mmp_interval);
+ int retval;
+
+ if (mmp_block < le32_to_cpu(es->s_first_data_block) ||
+ mmp_block > ext4_blocks_count(EXT4_SB(sb)->s_es)) {
+ ext4_warning(sb, "ext4_multi_mount_protect",
+ "Invalid MMP block in superblock");
+ goto failed;
+ }
+
+ retval = read_mmp_block(sb, &bh, mmp_block);
+ if (retval)
+ goto failed;
+
+ mmp = (struct mmp_struct *)(bh->b_data);
+ if (le32_to_cpu(mmp->mmp_magic) != EXT4_MMP_MAGIC) {
+ ext4_error(sb, "ext4_multi_mount_protect",
+ "Invalid magic number in MMP block");
+ goto failed;
+ }
+
+ if (le32_to_cpu(es->s_mmp_interval) == 0)
+ es->s_mmp_interval = cpu_to_le32(EXT4_MMP_DEF_INTERVAL);
+
+ seq = le32_to_cpu(mmp->mmp_seq);
+ if (seq == EXT4_MMP_CLEAN)
+ goto skip;
+
+ if (seq == EXT4_MMP_FSCK_ON) {
+ dump_mmp_msg(sb, mmp, "ext4_multi_mount_protect",
+ "fsck is running on the filesystem");
+ goto failed;
+ }
+
+ /* wait for MMP interval and check seq again */
+ schedule_timeout_uninterruptible(HZ * wait_interval);
+
+ retval = read_mmp_block(sb, &bh, mmp_block);
+ if (retval)
+ goto failed;
+ mmp = (struct mmp_struct *)(bh->b_data);
+ if (seq != le32_to_cpu(mmp->mmp_seq)) {
+ dump_mmp_msg(sb, mmp, "ext4_multi_mount_protect",
+ "Device is already active on another node.");
+ goto failed;
+ }
+
+skip:
+ /* write a new random sequence number */
+ get_random_bytes(&seq, sizeof(u32));
+ mmp->mmp_seq = cpu_to_le32(seq);
+ retval = write_mmp_block(sb, bh, sb->s_id);
+ if (retval)
+ goto failed;
+
+ /* wait for MMP interval and check seq again */
+ schedule_timeout_uninterruptible(HZ * wait_interval);
+
+ retval = read_mmp_block(sb, &bh, mmp_block);
+ if (retval)
+ goto failed;
+ mmp = (struct mmp_struct *)(bh->b_data);
+ if (seq != le32_to_cpu(mmp->mmp_seq)) {
+ dump_mmp_msg(sb, mmp, "ext4_multi_mount_protect",
+ "Device is already active on another node.");
+ goto failed;
+ }
+
+ /* Start a kernel thread to update the MMP block periodically */
+ EXT4_SB(sb)->s_mmp_tsk = kthread_run(kmmpd, sb, "kmmpd-%02x:%02x",
+ MAJOR(sb->s_dev), MINOR(sb->s_dev));
+ if (IS_ERR(EXT4_SB(sb)->s_mmp_tsk)) {
+ EXT4_SB(sb)->s_mmp_tsk = 0;
+ ext4_warning(sb, "ext4_multi_mount_protect",
+ "Unable to create kmmpd thread for %s.", sb->s_id);
+ goto failed;
+ }
+
+ brelse(bh);
+ return 0;
+
+failed:
+ brelse(bh);
+
+ return 1;
+}
+
static int ext4_fill_super (struct super_block *sb, void *data, int silent)
{
@@ -1770,6 +1992,10 @@ static int ext4_fill_super (struct super
EXT4_HAS_INCOMPAT_FEATURE(sb,
EXT4_FEATURE_INCOMPAT_RECOVER));
+ if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_MMP))
+ if (ext4_multi_mount_protect(sb, le64_to_cpu(es->s_mmp_block)))
+ goto failed_mount2;
+
/*
* The first inode we look at is the journal inode. Don't try
* root first: it may be modified in the journal!
Index: linux-2.6.19/include/linux/ext4_fs_sb.h
===================================================================
--- linux-2.6.19.orig/include/linux/ext4_fs_sb.h
+++ linux-2.6.19/include/linux/ext4_fs_sb.h
@@ -90,6 +90,8 @@ struct ext4_sb_info {
unsigned long s_ext_extents;
#endif
unsigned int s_want_extra_isize; /* New inodes should reserve # bytes */
+
+ struct task_struct * s_mmp_tsk; /* Kernel thread for multiple mount protection */
};
#endif /* _LINUX_EXT4_FS_SB */
Index: linux-2.6.19/include/linux/ext4_fs.h
===================================================================
--- linux-2.6.19.orig/include/linux/ext4_fs.h
+++ linux-2.6.19/include/linux/ext4_fs.h
@@ -578,10 +578,11 @@ struct ext4_super_block {
__le32 s_free_blocks_count_hi; /* Free blocks count */
__le16 s_min_extra_isize; /* All inodes have at least # bytes */
__le16 s_want_extra_isize; /* New inodes should reserve # bytes */
- __le32 s_flags; /* Miscellaneous flags */
+/*160*/ __le32 s_flags; /* Miscellaneous flags */
__le16 s_raid_stride; /* RAID stride */
- __le16 s_pad; /* Padding */
- __le32 s_reserved[166]; /* Padding to the end of the block */
+ __le16 s_mmp_interval; /* Wait for # seconds in MMP checking */
+ __le64 s_mmp_block; /* Block for multi-mount protection */
+ __u32 s_reserved[164]; /* Padding to the end of the block */
};
#ifdef __KERNEL__
@@ -680,13 +681,15 @@ static inline int ext4_valid_inum(struct
#define EXT4_FEATURE_INCOMPAT_META_BG 0x0010
#define EXT4_FEATURE_INCOMPAT_EXTENTS 0x0040 /* extents support */
#define EXT4_FEATURE_INCOMPAT_64BIT 0x0080
+#define EXT4_FEATURE_INCOMPAT_MMP 0x0100
#define EXT4_FEATURE_COMPAT_SUPP EXT2_FEATURE_COMPAT_EXT_ATTR
#define EXT4_FEATURE_INCOMPAT_SUPP (EXT4_FEATURE_INCOMPAT_FILETYPE| \
EXT4_FEATURE_INCOMPAT_RECOVER| \
EXT4_FEATURE_INCOMPAT_META_BG| \
EXT4_FEATURE_INCOMPAT_EXTENTS| \
- EXT4_FEATURE_INCOMPAT_64BIT)
+ EXT4_FEATURE_INCOMPAT_64BIT| \
+ EXT4_FEATURE_INCOMPAT_MMP)
#define EXT4_FEATURE_RO_COMPAT_SUPP (EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER| \
EXT4_FEATURE_RO_COMPAT_LARGE_FILE| \
EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE| \
@@ -850,6 +853,30 @@ void ext4_get_group_no_and_offset(struct
unsigned long *blockgrpp, ext4_grpblk_t *offsetp);
/*
+ * This structure will be used for multiple mount protection. It will be
+ * written into the block number saved in the s_mmp_block field in the
+ * superblock.
+ */
+#define EXT4_MMP_MAGIC 0x004D4D50 /* ASCII of MMP */
+#define EXT4_MMP_CLEAN 0xFF4D4D50 /* Value of mmp_seq for clean unmount */
+#define EXT4_MMP_FSCK_ON 0xE24D4D50 /* Value of mmp_seq when being fscked */
+struct mmp_struct {
+ __le32 mmp_magic;
+ __le32 mmp_seq;
+ __le64 mmp_time;
+ char mmp_nodename[64];
+ char mmp_bdevname[BDEVNAME_SIZE];
+ __le16 mmp_interval;
+ __le16 mmp_pad1;
+ __le32 mmp_pad2;
+};
+
+/*
+ * Interval in number of seconds to update the MMP sequence number.
+ */
+#define EXT4_MMP_DEF_INTERVAL 5
+
+/*
* Function prototypes
*/
[-- Attachment #3: e2fsprogs-mmp.patch --]
[-- Type: text/x-patch, Size: 24051 bytes --]
Index: e2fsprogs-1.39/lib/e2p/feature.c
===================================================================
--- e2fsprogs-1.39.orig/lib/e2p/feature.c
+++ e2fsprogs-1.39/lib/e2p/feature.c
@@ -67,6 +67,8 @@ static struct feature feature_list[] = {
"extent" },
{ E2P_FEATURE_INCOMPAT, EXT4_FEATURE_INCOMPAT_64BIT,
"64bit" },
+ { E2P_FEATURE_INCOMPAT, EXT4_FEATURE_INCOMPAT_MMP,
+ "mmp" },
{ 0, 0, 0 },
};
Index: e2fsprogs-1.39/lib/ext2fs/ext2_fs.h
===================================================================
--- e2fsprogs-1.39.orig/lib/ext2fs/ext2_fs.h
+++ e2fsprogs-1.39/lib/ext2fs/ext2_fs.h
@@ -570,8 +570,9 @@ struct ext2_super_block {
__u16 s_want_extra_isize; /* New inodes should reserve # bytes */
__u32 s_flags; /* Miscellaneous flags */
__u16 s_raid_stride; /* RAID stride */
- __u16 s_pad; /* Padding */
- __u32 s_reserved[166]; /* Padding to the end of the block */
+ __u16 s_mmp_interval; /* Wait for # seconds in MMP checking */
+ __u64 s_mmp_block; /* Block for multi-mount protection */
+ __u32 s_reserved[164]; /* Padding to the end of the block */
};
/*
@@ -633,10 +634,12 @@ struct ext2_super_block {
#define EXT2_FEATURE_INCOMPAT_META_BG 0x0010
#define EXT3_FEATURE_INCOMPAT_EXTENTS 0x0040
#define EXT4_FEATURE_INCOMPAT_64BIT 0x0080
+#define EXT4_FEATURE_INCOMPAT_MMP 0x0100
#define EXT2_FEATURE_COMPAT_SUPP 0
-#define EXT2_FEATURE_INCOMPAT_SUPP (EXT2_FEATURE_INCOMPAT_FILETYPE)
+#define EXT2_FEATURE_INCOMPAT_SUPP (EXT2_FEATURE_INCOMPAT_FILETYPE| \
+ EXT4_FEATURE_INCOMPAT_MMP)
#define EXT2_FEATURE_RO_COMPAT_SUPP (EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER| \
EXT2_FEATURE_RO_COMPAT_LARGE_FILE| \
EXT4_FEATURE_RO_COMPAT_DIR_NLINK| \
@@ -712,4 +715,28 @@ struct ext2_dir_entry_2 {
#define EXT2_DIR_REC_LEN(name_len) (((name_len) + 8 + EXT2_DIR_ROUND) & \
~EXT2_DIR_ROUND)
+/*
+ * This structure will be used for multiple mount protection. It will be
+ * written into the block number saved in the s_mmp_block field in the
+ * superblock.
+ */
+#define EXT2_MMP_MAGIC 0x004D4D50 /* ASCII for MMP */
+#define EXT2_MMP_CLEAN 0xFF4D4D50 /* Value of mmp_seq for clean unmount */
+#define EXT2_MMP_FSCK_ON 0xE24D4D50 /* Value of mmp_seq when being fscked */
+struct mmp_struct {
+ __u32 mmp_magic;
+ __u32 mmp_seq;
+ __u64 mmp_time;
+ char mmp_nodename[64];
+ char mmp_bdevname[32];
+ __u16 mmp_interval;
+ __u16 mmp_pad1;
+ __u32 mmp_pad2;
+};
+
+/*
+ * Interval in number of seconds to update the MMP sequence number.
+ */
+#define EXT2_MMP_DEF_INTERVAL 5
+
#endif /* _LINUX_EXT2_FS_H */
Index: e2fsprogs-1.39/lib/ext2fs/ext2fs.h
===================================================================
--- e2fsprogs-1.39.orig/lib/ext2fs/ext2fs.h
+++ e2fsprogs-1.39/lib/ext2fs/ext2fs.h
@@ -190,6 +190,7 @@ typedef struct ext2_file *ext2_file_t;
#define EXT2_FLAG_IMAGE_FILE 0x2000
#define EXT2_FLAG_EXCLUSIVE 0x4000
#define EXT2_FLAG_SOFTSUPP_FEATURES 0x8000
+#define EXT2_FLAG_SKIP_MMP 0x18000
/*
* Special flag in the ext2 inode i_flag field that means that this is
@@ -462,7 +463,8 @@ typedef struct ext2_icount *ext2_icount_
EXT3_FEATURE_INCOMPAT_JOURNAL_DEV|\
EXT2_FEATURE_INCOMPAT_META_BG|\
EXT3_FEATURE_INCOMPAT_RECOVER|\
- EXT3_FEATURE_INCOMPAT_EXTENTS)
+ EXT3_FEATURE_INCOMPAT_EXTENTS|\
+ EXT4_FEATURE_INCOMPAT_MMP)
#endif
#define EXT2_LIB_FEATURE_RO_COMPAT_SUPP (EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER|\
EXT2_FEATURE_RO_COMPAT_LARGE_FILE|\
@@ -991,6 +993,7 @@ extern void ext2fs_swap_inode(ext2_filsy
extern void ext2fs_swap_extent_header(struct ext3_extent_header *eh);
extern void ext2fs_swap_extent_index(struct ext3_extent_idx *ix);
extern void ext2fs_swap_extent(struct ext3_extent *ex);
+extern void ext2fs_swap_mmp(struct mmp_struct *mmp);
/* valid_blk.c */
extern int ext2fs_inode_has_valid_blocks(struct ext2_inode *inode);
Index: e2fsprogs-1.39/misc/tune2fs.c
===================================================================
--- e2fsprogs-1.39.orig/misc/tune2fs.c
+++ e2fsprogs-1.39/misc/tune2fs.c
@@ -60,7 +60,7 @@ char * device_name;
char * new_label, *new_last_mounted, *new_UUID;
char * io_options;
static int c_flag, C_flag, e_flag, f_flag, g_flag, i_flag, l_flag, L_flag;
-static int m_flag, M_flag, r_flag, s_flag = -1, u_flag, U_flag, T_flag;
+static int m_flag, M_flag, r_flag, s_flag = -1, u_flag, U_flag, T_flag, p_flag;
static time_t last_check_time;
static int print_label;
static int max_mount_count, mount_count, mount_flags;
@@ -71,6 +71,7 @@ static unsigned short errors;
static int open_flag;
static char *features_cmd;
static char *mntopts_cmd;
+static unsigned long mmp_interval;
int journal_size, journal_flags;
char *journal_device;
@@ -86,7 +87,8 @@ static void usage(void)
"[-g group]\n"
"\t[-i interval[d|m|w]] [-j] [-J journal_options]\n"
"\t[-l] [-s sparse_flag] [-m reserved_blocks_percent]\n"
- "\t[-o [^]mount_options[,...]] [-r reserved_blocks_count]\n"
+ "\t[-o [^]mount_options[,...]] [-p]"
+ "[-r reserved_blocks_count]\n"
"\t[-u user] [-C mount_count] [-L volume_label] "
"[-M last_mounted_dir]\n"
"\t[-O [^]feature[,...]] [-T last_check_time] [-U UUID]"
@@ -97,7 +99,8 @@ static void usage(void)
static __u32 ok_features[3] = {
EXT3_FEATURE_COMPAT_HAS_JOURNAL |
EXT2_FEATURE_COMPAT_DIR_INDEX, /* Compat */
- EXT2_FEATURE_INCOMPAT_FILETYPE, /* Incompat */
+ EXT2_FEATURE_INCOMPAT_FILETYPE | /* Incompat */
+ EXT4_FEATURE_INCOMPAT_MMP,
EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER | /* R/O compat */
EXT4_FEATURE_RO_COMPAT_GDT_CSUM
};
@@ -286,8 +289,10 @@ static void update_feature_set(ext2_fils
{
int sparse, old_sparse, filetype, old_filetype;
int journal, old_journal, dxdir, old_dxdir, uninit, old_uninit;
+ int mmp, old_mmp;
struct ext2_super_block *sb= fs->super;
__u32 old_compat, old_incompat, old_ro_compat;
+ int error;
old_compat = sb->s_feature_compat;
old_ro_compat = sb->s_feature_ro_compat;
@@ -303,6 +308,8 @@ static void update_feature_set(ext2_fils
EXT2_FEATURE_COMPAT_DIR_INDEX;
old_uninit = sb->s_feature_ro_compat &
EXT4_FEATURE_RO_COMPAT_GDT_CSUM;
+ old_mmp = sb->s_feature_incompat &
+ EXT4_FEATURE_INCOMPAT_MMP;
if (e2p_edit_feature(features, &sb->s_feature_compat,
ok_features)) {
fprintf(stderr, _("Invalid filesystem option set: %s\n"),
@@ -319,6 +326,8 @@ static void update_feature_set(ext2_fils
EXT2_FEATURE_COMPAT_DIR_INDEX;
uninit = sb->s_feature_ro_compat &
EXT4_FEATURE_RO_COMPAT_GDT_CSUM;
+ mmp = sb->s_feature_incompat &
+ EXT4_FEATURE_INCOMPAT_MMP;
if (old_journal && !journal) {
if ((mount_flags & EXT2_MF_MOUNTED) &&
!(mount_flags & EXT2_MF_READONLY)) {
@@ -359,6 +368,124 @@ static void update_feature_set(ext2_fils
if (uuid_is_null((unsigned char *) sb->s_hash_seed))
uuid_generate((unsigned char *) sb->s_hash_seed);
}
+ if (!old_mmp && mmp) {
+ blk_t mmp_block;
+ char *buf;
+ struct mmp_struct *mmp_s;
+
+ if ((mount_flags & EXT2_MF_MOUNTED) ||
+ (mount_flags & EXT2_MF_READONLY)) {
+ fputs(_("The multiple mount protection feature cannot\n"
+ "be set if the filesystem is mounted or \n"
+ "read-only.\n"), stderr);
+ exit(1);
+ }
+
+ error = ext2fs_read_bitmaps(fs);
+ if (error) {
+ fputs(_("Error while reading bitmaps\n"), stderr);
+ exit(1);
+ }
+
+ error = ext2fs_new_block(fs, 0, 0, &mmp_block);
+ if (error) {
+ fputs(_("Error allocating block required for setting "
+ "MMP feature.\n"), stderr);
+ exit(1);
+ }
+ ext2fs_block_alloc_stats(fs, mmp_block, +1);
+ sb->s_mmp_block = mmp_block;
+
+ error = ext2fs_get_mem(fs->blocksize, &buf);
+ if (error) {
+ fputs(_("Error allocating memory.\n"), stderr);
+ exit(1);
+ }
+ error = io_channel_read_blk(fs->io, mmp_block, 1, buf);
+ if (error) {
+ fputs(_("Error reading MMP block.\n"), stderr);
+ exit(1);
+ }
+
+ mmp_s = (struct mmp_struct *) buf;
+ mmp_s->mmp_magic = EXT2_MMP_MAGIC;
+ mmp_s->mmp_seq = EXT2_MMP_CLEAN;
+ mmp_s->mmp_time = 0;
+ mmp_s->mmp_nodename[0] = '\0';
+ mmp_s->mmp_bdevname[0] = '\0';
+ mmp_s->mmp_interval = EXT2_MMP_DEF_INTERVAL;
+
+#ifdef EXT2FS_ENABLE_SWAPFS
+ if (sb->s_magic == ext2fs_swab16(EXT2_SUPER_MAGIC))
+ ext2fs_swap_mmp(mmp_s);
+#endif
+ error = io_channel_write_blk(fs->io, mmp_block, 1, buf);
+ if (error) {
+ fputs(_("Error writing to MMP block.\n"), stderr);
+ exit(1);
+ }
+ if (buf)
+ ext2fs_free_mem(&buf);
+
+ sb->s_mmp_interval = EXT2_MMP_DEF_INTERVAL;
+ }
+
+ if (old_mmp && !mmp) {
+ blk_t mmp_block;
+ struct mmp_struct *mmp_s;
+ char *buf;
+
+ if ((mount_flags & EXT2_MF_MOUNTED) ||
+ (mount_flags & EXT2_MF_READONLY)) {
+ fputs(_("The multiple mount protection feature cannot\n"
+ "be disabled if the filesystem is mounted or\n"
+ "read-only.\n"), stderr);
+ exit(1);
+ }
+
+ error = ext2fs_read_bitmaps(fs);
+ if (error) {
+ fputs(_("Error while reading bitmaps\n"), stderr);
+ exit(1);
+ }
+
+ mmp_block = sb->s_mmp_block;
+ if ((mmp_block < sb->s_first_data_block) ||
+ (mmp_block >= sb->s_blocks_count)) {
+ fputs(_("MMP block number beyond filesystem range.\n"),
+ stderr);
+ exit(1);
+ }
+
+ error = ext2fs_get_mem(fs->blocksize, &buf);
+ if (error) {
+ fputs(_("Error allocating memory.\n"), stderr);
+ exit(1);
+ }
+ error = io_channel_read_blk(fs->io, mmp_block, 1, buf);
+ if (error) {
+ fputs(_("Error reading MMP block.\n"), stderr);
+ exit(1);
+ }
+
+ mmp_s = (struct mmp_struct *) buf;
+#ifdef EXT2FS_ENABLE_SWAPFS
+ if (sb->s_magic == ext2fs_swab16(EXT2_SUPER_MAGIC))
+ ext2fs_swap_mmp(mmp_s);
+#endif
+ if (mmp_s->mmp_magic != EXT2_MMP_MAGIC) {
+ fputs(_("Magic number in MMP block does not match. MMP "
+ "block number in superblock may be corrupted.\n"),
+ stderr);
+ exit(1);
+ }
+
+ ext2fs_unmark_block_bitmap(fs->block_map, mmp_block);
+ ext2fs_mark_bb_dirty(fs);
+
+ sb->s_mmp_block = 0;
+ sb->s_mmp_interval = 0;
+ }
if (sb->s_rev_level == EXT2_GOOD_OLD_REV &&
(sb->s_feature_compat || sb->s_feature_ro_compat ||
@@ -515,7 +642,7 @@ static void parse_tune2fs_options(int ar
struct passwd * pw;
printf("tune2fs %s (%s)\n", E2FSPROGS_VERSION, E2FSPROGS_DATE);
- while ((c = getopt(argc, argv, "c:e:fg:i:jlm:o:r:s:u:C:J:L:M:O:T:U:")) != EOF)
+ while ((c = getopt(argc, argv, "c:e:fg:i:jlm:o:p:r:s:u:C:J:L:M:O:T:U:")) != EOF)
switch (c)
{
case 'c':
@@ -666,6 +793,20 @@ static void parse_tune2fs_options(int ar
features_cmd = optarg;
open_flag = EXT2_FLAG_RW;
break;
+ case 'p':
+ mmp_interval = strtol (optarg, &tmp, 0);
+ if (*tmp && mmp_interval != 0 &&
+ mmp_interval < EXT2_MMP_DEF_INTERVAL) {
+ com_err (program_name, 0,
+ _("multi-mount interval of %s"
+ " seconds may negatively"
+ "impact filesystem performance"),
+ optarg);
+ usage();
+ }
+ p_flag = 1;
+ open_flag = EXT2_FLAG_RW;
+ break;
case 'r':
reserved_blocks = strtoul (optarg, &tmp, 0);
if (*tmp) {
@@ -780,6 +921,9 @@ int main (int argc, char ** argv)
#else
io_ptr = unix_io_manager;
#endif
+ if (open_flag == EXT2_FLAG_RW && f_flag)
+ open_flag |= EXT2_FLAG_SKIP_MMP;
+
retval = ext2fs_open2(device_name, io_options, open_flag,
0, 0, io_ptr, &fs);
if (retval) {
@@ -840,6 +984,12 @@ int main (int argc, char ** argv)
printf (_("Setting reserved blocks percentage to %g%% (%u blocks)\n"),
reserved_ratio, sb->s_r_blocks_count);
}
+ if (p_flag) {
+ sb->s_mmp_interval = mmp_interval;
+ ext2fs_mark_super_dirty(fs);
+ printf (_("Setting multiple mount protection interval to %lu "
+ "seconds\n"), mmp_interval);
+ }
if (r_flag) {
if (reserved_blocks >= sb->s_blocks_count/2) {
com_err (program_name, 0,
Index: e2fsprogs-1.39/e2fsck/pass1.c
===================================================================
--- e2fsprogs-1.39.orig/e2fsck/pass1.c
+++ e2fsprogs-1.39/e2fsck/pass1.c
@@ -466,6 +466,39 @@ extern void e2fsck_setup_tdb_icount(e2fs
*ret = 0;
}
+/*
+ * Marks a block as in use, setting the dup_map if it's been set
+ * already. Called by process_block and process_bad_block.
+ *
+ * WARNING: Assumes checks have already been done to make sure block
+ * is valid. This is true in both process_block and process_bad_block.
+ */
+static void mark_block_used(e2fsck_t ctx, blk_t block)
+{
+ struct problem_context pctx;
+
+ clear_problem_context(&pctx);
+
+ if (ext2fs_fast_test_block_bitmap(ctx->block_found_map, block)) {
+ if (!ctx->block_dup_map) {
+ pctx.errcode = ext2fs_allocate_block_bitmap(ctx->fs,
+ _("multiply claimed block map"),
+ &ctx->block_dup_map);
+ if (pctx.errcode) {
+ pctx.num = 3;
+ fix_problem(ctx, PR_1_ALLOCATE_BBITMAP_ERROR,
+ &pctx);
+ /* Should never get here */
+ ctx->flags |= E2F_FLAG_ABORT;
+ return;
+ }
+ }
+ ext2fs_fast_mark_block_bitmap(ctx->block_dup_map, block);
+ } else {
+ ext2fs_fast_mark_block_bitmap(ctx->block_found_map, block);
+ }
+}
+
void e2fsck_pass1(e2fsck_t ctx)
{
int i;
@@ -1021,6 +1054,9 @@ void e2fsck_pass1(e2fsck_t ctx)
ctx->block_ea_map = 0;
}
+ if (fs->super->s_feature_incompat & EXT4_FEATURE_INCOMPAT_MMP)
+ mark_block_used(ctx, fs->super->s_mmp_block);
+
if (ctx->flags & E2F_FLAG_RESIZE_INODE) {
ext2fs_block_bitmap save_bmap;
@@ -1227,39 +1263,6 @@ static void alloc_imagic_map(e2fsck_t ct
}
/*
- * Marks a block as in use, setting the dup_map if it's been set
- * already. Called by process_block and process_bad_block.
- *
- * WARNING: Assumes checks have already been done to make sure block
- * is valid. This is true in both process_block and process_bad_block.
- */
-static _INLINE_ void mark_block_used(e2fsck_t ctx, blk_t block)
-{
- struct problem_context pctx;
-
- clear_problem_context(&pctx);
-
- if (ext2fs_fast_test_block_bitmap(ctx->block_found_map, block)) {
- if (!ctx->block_dup_map) {
- pctx.errcode = ext2fs_allocate_block_bitmap(ctx->fs,
- _("multiply claimed block map"),
- &ctx->block_dup_map);
- if (pctx.errcode) {
- pctx.num = 3;
- fix_problem(ctx, PR_1_ALLOCATE_BBITMAP_ERROR,
- &pctx);
- /* Should never get here */
- ctx->flags |= E2F_FLAG_ABORT;
- return;
- }
- }
- ext2fs_fast_mark_block_bitmap(ctx->block_dup_map, block);
- } else {
- ext2fs_fast_mark_block_bitmap(ctx->block_found_map, block);
- }
-}
-
-/*
* Adjust the extended attribute block's reference counts at the end
* of pass 1, either by subtracting out references for EA blocks that
* are still referenced in ctx->refcount, or by adding references for
Index: e2fsprogs-1.39/e2fsck/unix.c
===================================================================
--- e2fsprogs-1.39.orig/e2fsck/unix.c
+++ e2fsprogs-1.39/e2fsck/unix.c
@@ -1055,6 +1055,18 @@ restart:
"to do a read-only\n"
"check of the device.\n"));
#endif
+ else if (retval == ERANGE) {
+ if (fix_problem(ctx, PR_0_MMP_INVALID_BLK, &pctx)) {
+ fs->super->s_mmp_block = 0;
+ ext2fs_mark_super_dirty(fs);
+ }
+ }
+ else if (retval == EXT2_ET_MMP_FAILED)
+ printf(_("Dump MMP info\n"));
+ else if (retval == EXT2_ET_MMP_FSCK_ON)
+ printf(_("If you are sure that e2fsck is not running "
+ "then use \"tune2fs -O ^mmp device\" "
+ "followed by \"tune2fs -O mmp device\""));
else
fix_problem(ctx, PR_0_SB_CORRUPT, &pctx);
fatal_error(ctx, 0);
@@ -1331,6 +1343,43 @@ no_journal:
!(ctx->options & E2F_OPT_READONLY))
ext2fs_set_gdt_csum(ctx->fs);
+ if ((flags & EXT2_FLAG_RW) &&
+ (fs->super->s_feature_incompat & EXT4_FEATURE_INCOMPAT_MMP)) {
+ blk_t mmp_blk = fs->super->s_mmp_block;
+ char *buf;
+ struct mmp_struct *mmp_s;
+ int error;
+
+ error = ext2fs_get_mem(fs->blocksize, &buf);
+ if (error) {
+ printf(_("Error allocating memory.\n"));
+ goto mmp_error2;
+ }
+
+ error = io_channel_read_blk(fs->io, mmp_blk, 1, buf);
+ if (error) {
+ printf(_("Error reading MMP block.\n"));
+ goto mmp_error2;
+ }
+
+ mmp_s = (struct mmp_struct *) buf;
+ if (mmp_s->mmp_magic != EXT2_MMP_MAGIC) {
+ printf(_("Invalid magic number in MMP block.\n"));
+ goto mmp_error2;
+ }
+
+ mmp_s->mmp_seq = EXT2_MMP_CLEAN;
+ error = io_channel_write_blk(fs->io, mmp_blk, 1, buf);
+ if (error) {
+ printf(_("Error writing to MMP block.\n"));
+ goto mmp_error2;
+ }
+
+mmp_error2:
+ if (buf)
+ ext2fs_free_mem(&buf);
+ }
+
e2fsck_write_bitmaps(ctx);
ext2fs_close(fs);
Index: e2fsprogs-1.39/e2fsck/problem.c
===================================================================
--- e2fsprogs-1.39.orig/e2fsck/problem.c
+++ e2fsprogs-1.39/e2fsck/problem.c
@@ -376,6 +376,11 @@ static struct e2fsck_problem problem_tab
N_("last @g @b @B uninitialized. "),
PROMPT_FIX, PR_PREEN_OK },
+ /* Resize_inode not enabled, but resize inode is non-zero */
+ { PR_0_MMP_INVALID_BLK,
+ N_("@S has invalid MMP block. "),
+ PROMPT_CLEAR, PR_PREEN_OK },
+
/* Pass 1 errors */
/* Pass 1: Checking inodes, blocks, and sizes */
Index: e2fsprogs-1.39/e2fsck/problem.h
===================================================================
--- e2fsprogs-1.39.orig/e2fsck/problem.h
+++ e2fsprogs-1.39/e2fsck/problem.h
@@ -212,6 +212,9 @@ struct problem_context {
/* Last group block bitmap is uninitialized. */
#define PR_0_BB_UNINIT_LAST 0x000039
+/* The MMP block in the superblock is invalid. */
+#define PR_0_MMP_INVALID_BLK 0x00003A
+
/*
* Pass 1 errors
*/
Index: e2fsprogs-1.39/lib/ext2fs/swapfs.c
===================================================================
--- e2fsprogs-1.39.orig/lib/ext2fs/swapfs.c
+++ e2fsprogs-1.39/lib/ext2fs/swapfs.c
@@ -70,6 +70,8 @@ void ext2fs_swap_super(struct ext2_super
sb->s_min_extra_isize = ext2fs_swab16(sb->s_min_extra_isize);
sb->s_want_extra_isize = ext2fs_swab16(sb->s_want_extra_isize);
sb->s_flags = ext2fs_swab32(sb->s_flags);
+ sb->s_mmp_interval = ext2fs_swab16(sb->s_mmp_interval);
+ sb->s_mmp_block = ext2fs_swab64(sb->s_mmp_block);
for (i=0; i < 4; i++)
sb->s_hash_seed[i] = ext2fs_swab32(sb->s_hash_seed[i]);
for (i=0; i < 17; i++)
@@ -274,4 +276,12 @@ void ext2fs_swap_inode(ext2_filsys fs, s
sizeof(struct ext2_inode));
}
+void ext2fs_swap_mmp(struct mmp_struct *mmp)
+{
+ mmp->mmp_magic = ext2fs_swab32(mmp->mmp_magic);
+ mmp->mmp_seq = ext2fs_swab32(mmp->mmp_seq);
+ mmp->mmp_time = ext2fs_swab64(mmp->mmp_time);
+ mmp->mmp_interval = ext2fs_swab16(mmp->mmp_interval);
+}
+
#endif
Index: e2fsprogs-1.39/lib/ext2fs/openfs.c
===================================================================
--- e2fsprogs-1.39.orig/lib/ext2fs/openfs.c
+++ e2fsprogs-1.39/lib/ext2fs/openfs.c
@@ -22,6 +22,9 @@
#if HAVE_SYS_TYPES_H
#include <sys/types.h>
#endif
+#ifdef HAVE_ERRNO_H
+#include <errno.h>
+#endif
#include "ext2_fs.h"
@@ -68,6 +71,107 @@ errcode_t ext2fs_open(const char *name,
}
/*
+ * Make sure that the fs is not mounted or under fsck while opening the fs.
+ */
+int ext2fs_multiple_mount_protect(ext2_filsys fs)
+{
+ blk_t mmp_blk = fs->super->s_mmp_block;
+ char *buf;
+ struct mmp_struct *mmp_s;
+ unsigned long seq;
+ int retval = 0;
+
+ if ((mmp_blk < fs->super->s_first_data_block) ||
+ (mmp_blk >= fs->super->s_blocks_count)) {
+ return ERANGE;
+ }
+
+ retval = ext2fs_get_mem(fs->blocksize * 5, &buf);
+ if (retval)
+ goto mmp_error;
+
+ retval = io_channel_read_blk(fs->io, mmp_blk, 1, buf);
+ if (retval)
+ goto mmp_error;
+
+ mmp_s = (struct mmp_struct *) buf;
+#ifdef EXT2FS_ENABLE_SWAPFS
+ if (fs->flags & EXT2_FLAG_SWAP_BYTES)
+ ext2fs_swap_mmp(mmp_s);
+#endif
+
+ if (mmp_s->mmp_magic != EXT2_MMP_MAGIC) {
+ retval = EXT2_ET_MMP_MAGIC_INVALID;
+ goto mmp_error;
+ }
+
+ if (fs->super->s_mmp_interval == 0)
+ fs->super->s_mmp_interval = EXT2_MMP_DEF_INTERVAL;
+
+ seq = mmp_s->mmp_seq;
+ if (seq == EXT2_MMP_CLEAN)
+ goto clean_seq;
+
+ if (seq == EXT2_MMP_FSCK_ON) {
+ retval = EXT2_ET_MMP_FSCK_ON;
+ goto mmp_error;
+ }
+
+ sleep(2 * fs->super->s_mmp_interval);
+
+ /*
+ * Make sure that we read direct from disk by reading only
+ * sizeof(stuct mmp_struct) bytes.
+ */
+ retval = io_channel_read_blk(fs->io, mmp_blk,
+ -sizeof(struct mmp_struct), buf);
+ if (retval)
+ goto mmp_error;
+
+ if (seq != mmp_s->mmp_seq) {
+ retval = EXT2_ET_MMP_FAILED;
+ goto mmp_error;
+ }
+
+clean_seq:
+ mmp_s->mmp_seq = seq = rand();
+ retval = io_channel_write_blk(fs->io, mmp_blk, 1, buf);
+ if (retval)
+ goto mmp_error;
+
+ io_channel_flush(fs->io);
+ sleep(2 * fs->super->s_mmp_interval);
+ retval = io_channel_read_blk(fs->io, mmp_blk,
+ -sizeof(struct mmp_struct), buf);
+ if (retval)
+ goto mmp_error;
+
+ if (seq != mmp_s->mmp_seq) {
+ retval = EXT2_ET_MMP_FAILED;
+ goto mmp_error;
+ }
+
+ mmp_s->mmp_seq = EXT2_MMP_FSCK_ON;
+ retval = io_channel_write_blk(fs->io, mmp_blk, 1, buf);
+ if (retval)
+ goto mmp_error;
+
+ if (buf)
+ ext2fs_free_mem(&buf);
+
+ return 0;
+
+mmp_error:
+ if (buf)
+ ext2fs_free_mem(&buf);
+
+ return retval;
+
+
+ return 0;
+}
+
+/*
* Note: if superblock is non-zero, block-size must also be non-zero.
* Superblock and block_size can be zero to use the default size.
*
@@ -77,6 +181,7 @@ errcode_t ext2fs_open(const char *name,
* EXT2_FLAG_FORCE - Open the filesystem even if some of the
* features aren't supported.
* EXT2_FLAG_JOURNAL_DEV_OK - Open an ext3 journal device
+ * EXT2_FLAG_SKIP_MMP - Open without multi-mount protection check.
*/
errcode_t ext2fs_open2(const char *name, const char *io_options,
int flags, int superblock,
@@ -317,6 +422,13 @@ errcode_t ext2fs_open2(const char *name,
*ret_fs = fs;
+ if ((fs->super->s_feature_incompat & EXT4_FEATURE_INCOMPAT_MMP) &&
+ (flags & EXT2_FLAG_RW) && !(flags & EXT2_FLAG_SKIP_MMP)) {
+ retval = ext2fs_multiple_mount_protect(fs);
+ if (retval)
+ goto cleanup;
+ }
+
return 0;
cleanup:
ext2fs_free(fs);
Index: e2fsprogs-1.39/lib/ext2fs/ext2_err.et.in
===================================================================
--- e2fsprogs-1.39.orig/lib/ext2fs/ext2_err.et.in
+++ e2fsprogs-1.39/lib/ext2fs/ext2_err.et.in
@@ -338,5 +338,14 @@ ec EXT2_ET_EXTENT_LEAF_BAD,
ec EXT2_ET_EXTENT_NO_SPACE,
"No free space in extent map"
+ec EXT2_ET_MMP_MAGIC_INVALID,
+ "MMP: Invalid magic number in MMP block"
+
+ec EXT2_ET_MMP_FAILED,
+ "MMP: Device already active on another node"
+
+ec EXT2_ET_MMP_FSCK_ON,
+ "MMP: Seems as if fsck is already being run on the filesystem."
+
end
Index: e2fsprogs-1.39/lib/ext2fs/closefs.c
===================================================================
--- e2fsprogs-1.39.orig/lib/ext2fs/closefs.c
+++ e2fsprogs-1.39/lib/ext2fs/closefs.c
@@ -359,12 +359,61 @@ errout:
return retval;
}
+errcode_t write_mmp_clean(ext2_filsys fs)
+{
+ blk_t mmp_blk = fs->super->s_mmp_block;
+ char *buf;
+ struct mmp_struct *mmp_s;
+ int error;
+
+ error = ext2fs_get_mem(fs->blocksize, &buf);
+ if (error)
+ goto mmp_error;
+
+ error = io_channel_read_blk(fs->io, mmp_blk, 1, buf);
+ if (error)
+ goto mmp_error;
+
+ mmp_s = (struct mmp_struct *) buf;
+#ifdef EXT2FS_ENABLE_SWAPFS
+ if (fs->flags & EXT2_FLAG_SWAP_BYTES)
+ ext2fs_swap_mmp(mmp_s);
+#endif
+ if (mmp_s->mmp_magic != EXT2_MMP_MAGIC) {
+ error = EXT2_ET_MMP_MAGIC_INVALID;
+ goto mmp_error;
+ }
+
+ mmp_s->mmp_seq = EXT2_MMP_CLEAN;
+#ifdef EXT2FS_ENABLE_SWAPFS
+ if (fs->flags & EXT2_FLAG_SWAP_BYTES)
+ ext2fs_swap_mmp(mmp_s);
+#endif
+ error = io_channel_write_blk(fs->io, mmp_blk, 1, buf);
+ if (error)
+ goto mmp_error;
+
+mmp_error:
+ if (buf)
+ ext2fs_free_mem(&buf);
+
+ return error;
+}
+
+
errcode_t ext2fs_close(ext2_filsys fs)
{
errcode_t retval;
EXT2_CHECK_MAGIC(fs, EXT2_ET_MAGIC_EXT2FS_FILSYS);
+ if ((fs->flags & EXT2_FLAG_RW) &&
+ (fs->super->s_feature_incompat & EXT4_FEATURE_INCOMPAT_MMP)) {
+ retval = write_mmp_clean(fs);
+ if (retval)
+ return retval;
+ }
+
if (fs->flags & EXT2_FLAG_DIRTY) {
retval = ext2fs_flush(fs);
if (retval)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC][PATCH] Multiple mount protection
2007-05-25 21:36 ` Kalpak Shah
@ 2007-05-30 20:58 ` Kalpak Shah
2007-05-31 16:16 ` Theodore Tso
0 siblings, 1 reply; 21+ messages in thread
From: Kalpak Shah @ 2007-05-30 20:58 UTC (permalink / raw)
To: Theodore Tso; +Cc: linux-ext4, Andreas Dilger
On Sat, 2007-05-26 at 03:06 +0530, Kalpak Shah wrote:
> Hi Ted,
>
> On Fri, 2007-05-25 at 10:39 -0400, Theodore Tso wrote:
> > Hi Kalpak,
> >
> > On Tue, May 22, 2007 at 01:22:32AM +0530, Kalpak Shah wrote:
> > > It will also protect against running e2fsck on a mounted filesystem
> > > by adding similar logic to ext2fs_open().
> >
> > Your patch didn't add this logic to ext2fs_open(); it just reserved
> > the space in the superblock.
>
> Yeah the earlier patch for just reserving the fields.
>
> >
> > I don't mind reserving the space so we don't have to worry about
> > conflicting superblock uses, but I'm still on the fence about actually
> > adding this functionality (a) into e2fsprogs, and (b) into the ext4
> > kernel code. I guess it depends on how complicated/icky the
> > implementation code is, I guess.
>
Hi Ted,
So can I assume that the INCOMPAT_MMP flag and the s_mmp_interval and
s_mmp_block superblock fields will be reserved regardless of whether the
patches go into ext4? I had attached the patches in the last mail so you
can share your views on them.
Thanks,
Kalpak.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC][PATCH] Multiple mount protection
2007-05-30 20:58 ` Kalpak Shah
@ 2007-05-31 16:16 ` Theodore Tso
2007-05-31 21:09 ` Kalpak Shah
0 siblings, 1 reply; 21+ messages in thread
From: Theodore Tso @ 2007-05-31 16:16 UTC (permalink / raw)
To: Kalpak Shah; +Cc: linux-ext4, Andreas Dilger
On Thu, May 31, 2007 at 02:28:33AM +0530, Kalpak Shah wrote:
>
> So can I assume that the INCOMPAT_MMP flag and the s_mmp_interval and
> s_mmp_block superblock fields will be reserved regardless of whether the
> patches go into ext4? I had attached the patches in the last mail so you
> can share your views on them.
Yes, i've reserved the code point and superblock fields. I'm not
going to add INCOMPAT_MMP flag to the supported file until I get and
integrate the patch ext2fs_open() that actually tests for the flag,
though, since that would be a bit silly.
I assume the patch will add a flag to ext2fs_open which skips the MMP
checking. After all, tune2fs is allowed to make changes to the
superblock while the filesystem is mounted. So it needs to be able to
open the filesystem read/only even if it is mounted.
Regards,
- Ted
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC][PATCH] Multiple mount protection
2007-05-31 16:16 ` Theodore Tso
@ 2007-05-31 21:09 ` Kalpak Shah
0 siblings, 0 replies; 21+ messages in thread
From: Kalpak Shah @ 2007-05-31 21:09 UTC (permalink / raw)
To: Theodore Tso; +Cc: linux-ext4, Andreas Dilger
On Thu, 2007-05-31 at 12:16 -0400, Theodore Tso wrote:
> On Thu, May 31, 2007 at 02:28:33AM +0530, Kalpak Shah wrote:
> >
> > So can I assume that the INCOMPAT_MMP flag and the s_mmp_interval and
> > s_mmp_block superblock fields will be reserved regardless of whether the
> > patches go into ext4? I had attached the patches in the last mail so you
> > can share your views on them.
>
> Yes, i've reserved the code point and superblock fields.
Thanks.
> I'm not going to add INCOMPAT_MMP flag to the supported file until I get and
> integrate the patch ext2fs_open() that actually tests for the flag,
> though, since that would be a bit silly.
>
> I assume the patch will add a flag to ext2fs_open which skips the MMP
> checking.
Yes I have added a EXT2_FLAG_SKIP_MMP flag to ext2fs_open() to bypass
MMP which will be set if tunefs is used with -f option. Also MMP check
will not be run if the filesystem is being opened readonly.
Thanks,
Kalpak.
> After all, tune2fs is allowed to make changes to the
> superblock while the filesystem is mounted. So it needs to be able to
> open the filesystem read/only even if it is mounted.
>
> Regards,
>
> - Ted
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC][PATCH] Multiple mount protection
2007-06-01 8:46 ` Andi Kleen
@ 2007-06-01 8:27 ` Kalpak Shah
2007-06-01 9:14 ` Andreas Dilger
2007-06-01 11:41 ` Theodore Tso
2 siblings, 0 replies; 21+ messages in thread
From: Kalpak Shah @ 2007-06-01 8:27 UTC (permalink / raw)
To: Andi Kleen; +Cc: linux-ext4, Andreas Dilger
On Fri, 2007-06-01 at 10:46 +0200, Andi Kleen wrote:
> Kalpak Shah <kalpak@clusterfs.com> writes:
>
> > Hi,
> >
> > There have been reported instances of a filesystem having been mounted at 2 places at the same time causing a lot of damage to the filesystem. This patch reserves superblock fields and an INCOMPAT flag for adding multiple mount protection(MMP) support within the ext4 filesystem itself. The superblock will have a block number (s_mmp_block) which will hold a MMP structure which has a sequence number which will be periodically updated every 5 seconds by a mounted filesystem. Whenever a filesystem will be mounted it will wait for s_mmp_interval seconds to make sure that the MMP sequence does not change. To further make sure, we write a random sequence number into the MMP block and wait for another s_mmp_interval secs. If the sequence no. doesn't change then the mount will succeed. In case of failure, the nodename, bdevname and the time at which the MMP block was last updated will be disp
laye
> > d. tune2fs can be used to set s_mmp_interval as desired.
>
> That will make laptop users very unhappy if you spin up their disks every 5 seconds. And
> even on other systems it might reduce the MTBF if you write the super block much more
> often than before. It might be better to set it up in some way to only increase
> that number when the super block is written for some other reason anyways.
The super block only saves the block number of the MMP block. So the
super block is not updated but the contents of the MMP block are updated
every 5 seconds.
If any user is unhappy with the 5 seconds interval, he can modify the
interval to be greater, with the caveat that it will take 2*mmp_interval
seconds during mounting the filesystem.
Thanks,
Kalpak.
>
> -Andi
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC][PATCH] Multiple mount protection
2007-05-21 19:52 [RFC][PATCH] Multiple mount protection Kalpak Shah
2007-05-22 7:15 ` Manoj Joseph
2007-05-25 14:39 ` Theodore Tso
@ 2007-06-01 8:46 ` Andi Kleen
2007-06-01 8:27 ` Kalpak Shah
` (2 more replies)
2 siblings, 3 replies; 21+ messages in thread
From: Andi Kleen @ 2007-06-01 8:46 UTC (permalink / raw)
To: Kalpak Shah; +Cc: linux-ext4, Andreas Dilger
Kalpak Shah <kalpak@clusterfs.com> writes:
> Hi,
>
> There have been reported instances of a filesystem having been mounted at 2 places at the same time causing a lot of damage to the filesystem. This patch reserves superblock fields and an INCOMPAT flag for adding multiple mount protection(MMP) support within the ext4 filesystem itself. The superblock will have a block number (s_mmp_block) which will hold a MMP structure which has a sequence number which will be periodically updated every 5 seconds by a mounted filesystem. Whenever a filesystem will be mounted it will wait for s_mmp_interval seconds to make sure that the MMP sequence does not change. To further make sure, we write a random sequence number into the MMP block and wait for another s_mmp_interval secs. If the sequence no. doesn't change then the mount will succeed. In case of failure, the nodename, bdevname and the time at which the MMP block was last updated will be displa
ye
> d. tune2fs can be used to set s_mmp_interval as desired.
That will make laptop users very unhappy if you spin up their disks every 5 seconds. And
even on other systems it might reduce the MTBF if you write the super block much more
often than before. It might be better to set it up in some way to only increase
that number when the super block is written for some other reason anyways.
-Andi
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC][PATCH] Multiple mount protection
2007-06-01 8:46 ` Andi Kleen
2007-06-01 8:27 ` Kalpak Shah
@ 2007-06-01 9:14 ` Andreas Dilger
2007-06-01 10:56 ` Andi Kleen
2007-06-01 11:41 ` Theodore Tso
2 siblings, 1 reply; 21+ messages in thread
From: Andreas Dilger @ 2007-06-01 9:14 UTC (permalink / raw)
To: Andi Kleen; +Cc: Kalpak Shah, linux-ext4
On Jun 01, 2007 10:46 +0200, Andi Kleen wrote:
> Kalpak Shah <kalpak@clusterfs.com> writes:
> > There have been reported instances of a filesystem having been
> mounted at 2 places at the same time causing a lot of damage to the
> filesystem.... The superblock will have a block number (s_mmp_block)
> which will hold a MMP structure which has a sequence number which will be
> periodically updated every 5 seconds by a mounted filesystem.
>
> That will make laptop users very unhappy if you spin up their disks every
> 5 seconds. And even on other systems it might reduce the MTBF if you
> write the super block much more often than before. It might be better to
> set it up in some way to only increase that number when the super block is
> written for some other reason anyways.
It was mentioned before but deserves mentioning again that this will
be an optional feature, mostly for use on SANs, iSCSI, etc where a disk
might be accessed by multiple nodes at the same time. That means there
will not be any impact for desktop users waiting 10s for each of their
filesystems to mount.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC][PATCH] Multiple mount protection
2007-06-01 9:14 ` Andreas Dilger
@ 2007-06-01 10:56 ` Andi Kleen
0 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2007-06-01 10:56 UTC (permalink / raw)
To: Andreas Dilger; +Cc: Andi Kleen, Kalpak Shah, linux-ext4
> It was mentioned before but deserves mentioning again that this will
> be an optional feature, mostly for use on SANs, iSCSI, etc where a disk
> might be accessed by multiple nodes at the same time. That means there
> will not be any impact for desktop users waiting 10s for each of their
> filesystems to mount.
A safety feature that is optional? Doesn't sound very safe to me.
If the safety is needed it should be probably default, otherwise
it isn't needed.
-Andi
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC][PATCH] Multiple mount protection
2007-06-01 8:46 ` Andi Kleen
2007-06-01 8:27 ` Kalpak Shah
2007-06-01 9:14 ` Andreas Dilger
@ 2007-06-01 11:41 ` Theodore Tso
2007-06-01 12:13 ` Andi Kleen
2 siblings, 1 reply; 21+ messages in thread
From: Theodore Tso @ 2007-06-01 11:41 UTC (permalink / raw)
To: Andi Kleen; +Cc: Kalpak Shah, linux-ext4, Andreas Dilger
On Fri, Jun 01, 2007 at 10:46:19AM +0200, Andi Kleen wrote:
>
> That will make laptop users very unhappy if you spin up their disks
> every 5 seconds. And even on other systems it might reduce the MTBF
> if you write the super block much more often than before. It might
> be better to set it up in some way to only increase that number when
> the super block is written for some other reason anyways.
You would never want to use this feature on a laptop; it would buy no
benefit for its costs, since with (all common) laptops, their hard
drives can't be shared with other machines in a cluster.
Unfortunately, it's not possible to do what you suggest, since one of
the whole points of increasing the sequence number every 5 seconds is
to act as a keep-alive, so another machine trying to access the shared
hard drive can tell whether or not the machine which currently had the
hard drive mounted is still alive or not.
This is why I and others have been a little worried about implementing
this feature, since it adds complexity which has to be in a proper HA
system anyway, and what is there isn't really an optimal HA solution
(since it lacks STONITH) and so you have to implement the
functionality again _anyway_ using a proper HA solution.
The argument on the other side is that it protects against failed HA
solutions, and against users who are too stupid to know that they need
an HA solution. It does do the first; the second would only apply if
the users who were too stupid to realize they needed an HA solution,
were smart enough to enable it the MMP feature --- and because of its
many costs, including keeping the disk spun up on laptops, and
delaying the time required to mount the disk by 10 seconds, I don't
think it will ever be enabled by default. Hence, I don't really think
it helps the idiotic user problem.
But apparently a belt-and-suspenders approach to HA is comforting to
some users, and so I don't mind reserving the space. The code to
implement it still seems like more complexity than what should be in
the kernel. My suggestion would be to put it in a separate file, and
make it be something which has to be explicitly configured to enable
it, possibly as a module (but that may add too much extra hair). I
really don't think the save-the-stupid-user argument holds water, but
the belt-and-suspenders argument IFF you are using a shared-disk setup
is a valid, although probably not a common setup.
Regards,
- Ted
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC][PATCH] Multiple mount protection
2007-06-01 11:41 ` Theodore Tso
@ 2007-06-01 12:13 ` Andi Kleen
2007-06-01 13:52 ` Theodore Tso
0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2007-06-01 12:13 UTC (permalink / raw)
To: Theodore Tso; +Cc: Andi Kleen, Kalpak Shah, linux-ext4, Andreas Dilger
> Unfortunately, it's not possible to do what you suggest, since one of
> the whole points of increasing the sequence number every 5 seconds is
> to act as a keep-alive, so another machine trying to access the shared
Clusters usually have other ways to do this, haven't they?
Typically they have STONITH too. It's probably too simple minded
to just replace a real cluster setup which also handles split
brain and other conditions. So it's purely against mistakes.
Besides relying on it would seem dangerous because it is not synchronous
and you could do a lot of damage in 5 seconds.
The rationale of the lazy writing would be that at least in usual
operation super block should be written regularly and with
5 seconds delay just being a not fully reliable heuristic it probably
doesn't matter if the possible delay is a little longer even.
-Andi
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC][PATCH] Multiple mount protection
2007-06-01 12:13 ` Andi Kleen
@ 2007-06-01 13:52 ` Theodore Tso
2007-06-01 18:00 ` Andreas Dilger
0 siblings, 1 reply; 21+ messages in thread
From: Theodore Tso @ 2007-06-01 13:52 UTC (permalink / raw)
To: Andi Kleen; +Cc: Kalpak Shah, linux-ext4, Andreas Dilger
On Fri, Jun 01, 2007 at 02:13:39PM +0200, Andi Kleen wrote:
> > Unfortunately, it's not possible to do what you suggest, since one of
> > the whole points of increasing the sequence number every 5 seconds is
> > to act as a keep-alive, so another machine trying to access the shared
>
> Clusters usually have other ways to do this, haven't they?
> Typically they have STONITH too. It's probably too simple minded
> to just replace a real cluster setup which also handles split
> brain and other conditions. So it's purely against mistakes.
Yes, it's only real value is to protect against Cluster-HA
malfunctions or misconfiguration.
> Besides relying on it would seem dangerous because it is not synchronous
> and you could do a lot of damage in 5 seconds.
Well, the MMP feature is assigned an incompatible feature bit, so a
kernel who doesn't know about MMP will refuse to touch it; and a
kernel which does follow the MMP protocol will check the MMP block
(delaying the mount by 10 seconds) to make sure no other system is
using the block.
So aside from being !@#!@ annoying (which is why it will never be the
default), it does work, modulo the problem that without STONITH or any
kind of I/O fencing, we do risk the other system coming back to life
and then modifying the filesystem in parallel. So as everyone has
said, this is not solution that works in isolation, but is really only
a backup.
The question of whether the complexity and then 10 second mount delay
for what is only a backup solution is worth it is obviously going to
be a very subjective one --- and as I've said previously, I'm on the
fence on this.
- Ted
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC][PATCH] Multiple mount protection
2007-06-01 13:52 ` Theodore Tso
@ 2007-06-01 18:00 ` Andreas Dilger
0 siblings, 0 replies; 21+ messages in thread
From: Andreas Dilger @ 2007-06-01 18:00 UTC (permalink / raw)
To: Theodore Tso; +Cc: Andi Kleen, Kalpak Shah, linux-ext4
On Jun 01, 2007 09:52 -0400, Theodore Tso wrote:
> On Fri, Jun 01, 2007 at 02:13:39PM +0200, Andi Kleen wrote:
> > Clusters usually have other ways to do this, haven't they?
> > Typically they have STONITH too. It's probably too simple minded
> > to just replace a real cluster setup which also handles split
> > brain and other conditions. So it's purely against mistakes.
>
> Yes, it's only real value is to protect against Cluster-HA
> malfunctions or misconfiguration.
While I agree that HA systems _should_ be enough for this, in our
experience even with an HA system some people get it wrong (e.g.
manually mounting and bypassing HA, HA itself is broken, comms failure,
STONITH failure, etc).
I agree it is not intended to be a replacement for an HA/STONITH
solution, just belt & suspenders that would have saved hundreds of
TB of user data in several cases if it were available. We will
enable it by default on all of our filesystems, and of course I'd
advise anyone in a SAN environment (whether they _intend_ to have
shared disk access or not) to enable it also.
> > Besides relying on it would seem dangerous because it is not synchronous
> > and you could do a lot of damage in 5 seconds.
>
> Well, the MMP feature is assigned an incompatible feature bit, so a
> kernel who doesn't know about MMP will refuse to touch it; and a
> kernel which does follow the MMP protocol will check the MMP block
> (delaying the mount by 10 seconds) to make sure no other system is
> using the block.
Correct. There is a "fast path" where it will wait a shorter time
during mount if the fs is reported cleanly unmounted. We can't skip
the check entirely, because 2 systems might be mounting at the same
time.
> So aside from being !@#!@ annoying (which is why it will never be the
> default), it does work, modulo the problem that without STONITH or any
> kind of I/O fencing, we do risk the other system coming back to life
> and then modifying the filesystem in parallel. So as everyone has
> said, this is not solution that works in isolation, but is really only
> a backup.
If the kmmpd is not scheduled in more than 10s then it will re-read the
block to ensure that the local system is still the one in control. If
not, it will ext3_error() and (in our case at least) this will make the
client fs read-only. Even if there is some IO leakage from the local
client, this is far better than to continue running with 2 systems writing
to the same disk.
Ideally there would also be a block-layer functionality to fence the IO
on the local system (e.g. plug the elevator output, I don't think that
there is anything that could be done about IO already submitted to the
device), but the function I thought did this (set_device_rdonly()) is
only checked at mount time and is useless.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2007-06-01 18:00 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-21 19:52 [RFC][PATCH] Multiple mount protection Kalpak Shah
2007-05-22 7:15 ` Manoj Joseph
2007-05-22 7:34 ` Kalpak Shah
2007-05-22 7:53 ` Manoj Joseph
2007-05-22 8:06 ` Kalpak Shah
2007-05-24 23:25 ` Karel Zak
2007-05-25 6:44 ` Kalpak Shah
2007-05-25 14:39 ` Theodore Tso
2007-05-25 19:31 ` Jim Garlick
2007-05-25 21:36 ` Kalpak Shah
2007-05-30 20:58 ` Kalpak Shah
2007-05-31 16:16 ` Theodore Tso
2007-05-31 21:09 ` Kalpak Shah
2007-06-01 8:46 ` Andi Kleen
2007-06-01 8:27 ` Kalpak Shah
2007-06-01 9:14 ` Andreas Dilger
2007-06-01 10:56 ` Andi Kleen
2007-06-01 11:41 ` Theodore Tso
2007-06-01 12:13 ` Andi Kleen
2007-06-01 13:52 ` Theodore Tso
2007-06-01 18:00 ` Andreas Dilger
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox