* [RFC] store RAID stride in superblock
@ 2007-05-12 2:02 Andreas Dilger
2007-05-12 2:21 ` Eric Sandeen
` (3 more replies)
0 siblings, 4 replies; 17+ messages in thread
From: Andreas Dilger @ 2007-05-12 2:02 UTC (permalink / raw)
To: linux-ext4
It is possible to specify the RAID stride to mke2fs allow it to optimize
the layout of the bitmaps. With the new mballoc it is also possible to
tell it via a mount option to do large allocations aligned on the RAID
stride (by default it aligns on 1MB boundaries from the start of the LUN).
What would be rather convenient is to store the RAID stride value in the
superblock. That would spare a lot of hassle on the part of the admin
to tune the filesystem optimally for the underlying storage. There is
also a library used in the XFS tools that knows how to probe various
kinds of block devices (e.g. MD RAID, LVM/DM, etc) to get their storage
layout that would avoid the need for the user to specify anything.
Any thoughts on this?
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [RFC] store RAID stride in superblock 2007-05-12 2:02 [RFC] store RAID stride in superblock Andreas Dilger @ 2007-05-12 2:21 ` Eric Sandeen 2007-05-12 8:11 ` Eric ` (2 subsequent siblings) 3 siblings, 0 replies; 17+ messages in thread From: Eric Sandeen @ 2007-05-12 2:21 UTC (permalink / raw) To: Andreas Dilger; +Cc: linux-ext4 Andreas Dilger wrote: > It is possible to specify the RAID stride to mke2fs allow it to optimize > the layout of the bitmaps. With the new mballoc it is also possible to > tell it via a mount option to do large allocations aligned on the RAID > stride (by default it aligns on 1MB boundaries from the start of the LUN). > > What would be rather convenient is to store the RAID stride value in the > superblock. That would spare a lot of hassle on the part of the admin > to tune the filesystem optimally for the underlying storage. There is > also a library used in the XFS tools that knows how to probe various > kinds of block devices (e.g. MD RAID, LVM/DM, etc) to get their storage > layout that would avoid the need for the user to specify anything. > > Any thoughts on this? I think it sounds great. I think ext4 would benefit greatly from knowing a bit more about the underlying device geometry & allocating accordingly... -Eric ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] store RAID stride in superblock 2007-05-12 2:02 [RFC] store RAID stride in superblock Andreas Dilger 2007-05-12 2:21 ` Eric Sandeen @ 2007-05-12 8:11 ` Eric 2007-05-12 8:33 ` Alex Tomas 2007-05-12 15:26 ` Andreas Dilger 2007-05-19 2:08 ` Theodore Tso 2007-05-24 11:44 ` Andreas Dilger 3 siblings, 2 replies; 17+ messages in thread From: Eric @ 2007-05-12 8:11 UTC (permalink / raw) To: linux-ext4 [-- Attachment #1: Type: text/plain, Size: 1047 bytes --] On Fri, 2007-05-11 at 19:02 -0700, Andreas Dilger wrote: > What would be rather convenient is to store the RAID stride value in the > superblock. > There is also a library used in the XFS tools that knows how to probe various > kinds of block devices (e.g. MD RAID, LVM/DM, etc) to get their storage > layout that would avoid the need for the user to specify anything. > > Any thoughts on this? It's late at night here so my thoughts are a little fuzzy. Nonetheless: The concept is really tempting. RAID is good, and not asking the user for information that the system can find out for itself is good too. In the unlikely event that the RAID stride were to change, I think the autodetect-each-time method would be superior to the store-in-superblock method. Doubly so if the code to detect MD and LVM stride is lean and clean. I wonder if, in a RAID 0 configuration, deliberately misaligning data structures smaller than (size of stride * number of disks in array) would yield a performance benefit. Cheers, Eric [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] store RAID stride in superblock 2007-05-12 8:11 ` Eric @ 2007-05-12 8:33 ` Alex Tomas 2007-05-12 9:32 ` Eric 2007-05-12 15:26 ` Andreas Dilger 1 sibling, 1 reply; 17+ messages in thread From: Alex Tomas @ 2007-05-12 8:33 UTC (permalink / raw) To: Eric; +Cc: linux-ext4 Eric wrote: > The concept is really tempting. RAID is good, and not asking the user > for information that the system can find out for itself is good too. > > In the unlikely event that the RAID stride were to change, I think the > autodetect-each-time method would be superior to the store-in-superblock > method. Doubly so if the code to detect MD and LVM stride is lean and > clean. true, but in some cases (hardware raid, SAN, etc) there is no easy way to learn that other than asking user. thanks, Alex ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] store RAID stride in superblock 2007-05-12 8:33 ` Alex Tomas @ 2007-05-12 9:32 ` Eric 2007-05-12 9:38 ` Alex Tomas 0 siblings, 1 reply; 17+ messages in thread From: Eric @ 2007-05-12 9:32 UTC (permalink / raw) To: linux-ext4 [-- Attachment #1: Type: text/plain, Size: 829 bytes --] On Sat, 2007-05-12 at 12:33 +0400, Alex Tomas wrote: > > In the unlikely event that the RAID stride were to change, I think the > > autodetect-each-time method would be superior to the store-in-superblock > > method. > > true, but in some cases (hardware raid, SAN, etc) there is no easy way > to learn that other than asking user. That hadn't occurred to me. Perhaps the filesystem driver or mkfs could probe for the stride in those cases? If the code asks for, say, 10MiB of data from the block device and it gets back sectors that are spaced 128KiB apart before it gets the rest of the data, it can make an intelligent guess about the stride. I wonder what penalties would come from a bad guess due to a cache in between the block device driver and the disk platters, or other load on a SAN... Eric [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] store RAID stride in superblock 2007-05-12 9:32 ` Eric @ 2007-05-12 9:38 ` Alex Tomas 2007-05-12 16:14 ` Eric 0 siblings, 1 reply; 17+ messages in thread From: Alex Tomas @ 2007-05-12 9:38 UTC (permalink / raw) To: Eric; +Cc: linux-ext4 I don't quite follow? how would you "probe" ? for example, there is DDN array which write well with 1MB aligned/sized requests only. thus, mballoc tries to align allocation requests WRT to this constrain. do you mean incorporation storage benchmark in the mount procedure? thanks, Alex Eric wrote: > That hadn't occurred to me. Perhaps the filesystem driver or mkfs could > probe for the stride in those cases? If the code asks for, say, 10MiB of > data from the block device and it gets back sectors that are spaced > 128KiB apart before it gets the rest of the data, it can make an > intelligent guess about the stride. > > I wonder what penalties would come from a bad guess due to a cache in > between the block device driver and the disk platters, or other load on > a SAN... > > > Eric > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] store RAID stride in superblock 2007-05-12 9:38 ` Alex Tomas @ 2007-05-12 16:14 ` Eric 0 siblings, 0 replies; 17+ messages in thread From: Eric @ 2007-05-12 16:14 UTC (permalink / raw) To: linux-ext4 [-- Attachment #1: Type: text/plain, Size: 1447 bytes --] > > Perhaps the filesystem driver or mkfs could > > probe for the stride in those cases? If the code asks for, say, 10MiB of > > data from the block device and it gets back sectors that are spaced > > 128KiB apart before it gets the rest of the data, it can make an > > intelligent guess about the stride. > > do you mean incorporation > storage benchmark in the mount procedure? Yes. If the benefits of automatically aligning on-disk data structures to the stride of the array are great enough, then a storage mini-benchmark may be of use. For example, suppose we have an array with a stride of 1MiB and the filesystem driver requests 10MiB of contiguous data from the start of the block device. Then the data at +0MiB from the start of the device, the data at +1MiB, the data at +2MiB, and so on ought to arrive earlier the data at, say, +0.5MiB, +1.5MiB and +2.5MiB. This would allow the filesystem driver to detect the stride even when the striping isn't being done by the MD or LVM/DM drivers in Linux (which, apparently, have well-defined interfaces for discovering the stride in software). I imagine this would work well for a run-of-the-mill hardware RAID card in a PC. However, as you pointed out in your original email, there are SANs to be considered. If another host is putting load on the SAN, it could throw off the read timings and cause the filesystem driver to make a bad guess. Cheers, Eric [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] store RAID stride in superblock 2007-05-12 8:11 ` Eric 2007-05-12 8:33 ` Alex Tomas @ 2007-05-12 15:26 ` Andreas Dilger 1 sibling, 0 replies; 17+ messages in thread From: Andreas Dilger @ 2007-05-12 15:26 UTC (permalink / raw) To: Eric; +Cc: linux-ext4 On May 12, 2007 01:11 -0700, Eric wrote: > The concept is really tempting. RAID is good, and not asking the user > for information that the system can find out for itself is good too. > > In the unlikely event that the RAID stride were to change, I think the > autodetect-each-time method would be superior to the store-in-superblock > method. Doubly so if the code to detect MD and LVM stride is lean and > clean. I've asked the block layer folks a couple of times if it would be possible to have an interface for this in the kernel, but so far I've had little success in getting them to do it and I don't have time for it myself. I agree that auto-detection is best (would need a userspace interface too) but a lot can be done with a format-time detection. It is unlikely that the RAID striping will change under the filesystem, and if it does then the stripe size is usually kept the same (e.g. RAID 5 restriping to add a disk). Even if the stiping does change, the current alignment of bitmaps is about the worst possible case for power-of-two stride sizes because a single disk has all of the bitmaps (using the terms "stripe = N * stride" for N+1 RAID5 or N+2 RAID6 - if anyone knows the "more correct" terms please speak up). It would also be possible to use tune2fs to change the stride + stripe size in the superblock to at least tune the mballoc allocation even if we can't move the bitmaps around very easily. > I wonder if, in a RAID 0 configuration, deliberately misaligning data > structures smaller than (size of stride * number of disks in array) > would yield a performance benefit. Yes, that would definitely be something to do. If you have N-disk RAID0, each disk having "stride" blocks at a time, then offsetting the bitmaps by "stride" blocks each is exactly what "mke2fs -E stride=" does. The mballoc "stripe" option tries to put large allocations covering the whole stripe to avoid parity read-modify-write if possible. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] store RAID stride in superblock 2007-05-12 2:02 [RFC] store RAID stride in superblock Andreas Dilger 2007-05-12 2:21 ` Eric Sandeen 2007-05-12 8:11 ` Eric @ 2007-05-19 2:08 ` Theodore Tso 2007-05-24 11:44 ` Andreas Dilger 3 siblings, 0 replies; 17+ messages in thread From: Theodore Tso @ 2007-05-19 2:08 UTC (permalink / raw) To: Andreas Dilger; +Cc: linux-ext4 On Fri, May 11, 2007 at 07:02:48PM -0700, Andreas Dilger wrote: > It is possible to specify the RAID stride to mke2fs allow it to optimize > the layout of the bitmaps. With the new mballoc it is also possible to > tell it via a mount option to do large allocations aligned on the RAID > stride (by default it aligns on 1MB boundaries from the start of the LUN). You asked for it, you got it. - Ted # HG changeset patch # User tytso@mit.edu # Date 1179540413 14400 # Node ID 2afd1c039d26aaa66c55ede30770df1990392f84 # Parent f95d161b454ec94e8974946d38e3e94c612f2cd2 Store the RAID stride value in the superblock and take advantage of it Store the RAID stride value when a filesystem is created with a requested RAID stride, and then use it automatically in resize2fs. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> diff -r f95d161b454e -r 2afd1c039d26 lib/ext2fs/ChangeLog --- a/lib/ext2fs/ChangeLog Fri May 18 21:44:29 2007 -0400 +++ b/lib/ext2fs/ChangeLog Fri May 18 22:06:53 2007 -0400 @@ -1,3 +1,10 @@ 2007-05-08 Eric Sandeen <sandeen@redhat +2007-05-18 Theodore Tso <tytso@mit.edu> + + * openfs.c (ext2fs_open2): Set fs->stride from the superblock's + s_raid_stride value. + + * ext2_fs.h: Allocate space for RAID stride in the superblock. + 2007-05-08 Eric Sandeen <sandeen@redhat.com> * ext2_fs.h (inode_uid, inode_gid): The inode_uid() and diff -r f95d161b454e -r 2afd1c039d26 lib/ext2fs/ext2_fs.h --- a/lib/ext2fs/ext2_fs.h Fri May 18 21:44:29 2007 -0400 +++ b/lib/ext2fs/ext2_fs.h Fri May 18 22:06:53 2007 -0400 @@ -573,7 +573,9 @@ struct ext2_super_block { __u16 s_min_extra_isize; /* All inodes have at least # bytes */ __u16 s_want_extra_isize; /* New inodes should reserve # bytes */ __u32 s_flags; /* Miscellaneous flags */ - __u32 s_reserved[167]; /* Padding to the end of the block */ + __u16 s_raid_stride; /* RAID stride */ + __u16 s_pad; /* Padding */ + __u32 s_reserved[166]; /* Padding to the end of the block */ }; /* diff -r f95d161b454e -r 2afd1c039d26 lib/ext2fs/openfs.c --- a/lib/ext2fs/openfs.c Fri May 18 21:44:29 2007 -0400 +++ b/lib/ext2fs/openfs.c Fri May 18 22:06:53 2007 -0400 @@ -297,6 +297,8 @@ errcode_t ext2fs_open2(const char *name, dest += fs->blocksize; } + fs->stride = fs->super->s_raid_stride; + *ret_fs = fs; return 0; cleanup: diff -r f95d161b454e -r 2afd1c039d26 misc/ChangeLog --- a/misc/ChangeLog Fri May 18 21:44:29 2007 -0400 +++ b/misc/ChangeLog Fri May 18 22:06:53 2007 -0400 @@ -1,4 +1,6 @@ 2007-05-18 Theodore Tso <tytso@mit.edu 2007-05-18 Theodore Tso <tytso@mit.edu> + + * mke2fs.c (main): Save the raid stride to the superblock * blkid.c (main): Add -g option to blkid which will garbage collect the cache. diff -r f95d161b454e -r 2afd1c039d26 misc/mke2fs.c --- a/misc/mke2fs.c Fri May 18 21:44:29 2007 -0400 +++ b/misc/mke2fs.c Fri May 18 22:06:53 2007 -0400 @@ -1611,7 +1611,7 @@ int main (int argc, char *argv[]) test_disk(fs, &bb_list); handle_bad_blocks(fs, bb_list); - fs->stride = fs_stride; + fs->stride = fs->super->s_raid_stride = fs_stride; retval = ext2fs_allocate_tables(fs); if (retval) { com_err(program_name, retval, diff -r f95d161b454e -r 2afd1c039d26 resize/ChangeLog --- a/resize/ChangeLog Fri May 18 21:44:29 2007 -0400 +++ b/resize/ChangeLog Fri May 18 22:06:53 2007 -0400 @@ -1,3 +1,9 @@ 2007-03-18 Theodore Tso <tytso@mit.edu +2007-05-18 Theodore Tso <tytso@mit.edu> + + * main.c (determine_fs_stride): Use the superblock s_raid_stride + if it is set; save the hueristically determined stride to + the superblock if it is not set. + 2007-03-18 Theodore Tso <tytso@mit.edu> * resize2fs.c (check_and_change_inodes): Check to make sure the diff -r f95d161b454e -r 2afd1c039d26 resize/main.c --- a/resize/main.c Fri May 18 21:44:29 2007 -0400 +++ b/resize/main.c Fri May 18 22:06:53 2007 -0400 @@ -101,6 +101,8 @@ static void determine_fs_stride(ext2_fil unsigned int has_sb, prev_has_sb, num; int i_stride, b_stride; + if (fs->stride) + return; num = 0; sum = 0; for (group = 0; group < fs->group_desc_count; group++) { has_sb = ext2fs_bg_has_super(fs, group); @@ -131,6 +133,9 @@ static void determine_fs_stride(ext2_fil fs->stride = sum / num; else fs->stride = 0; + + fs->super->s_raid_stride = fs->stride; + ext2fs_mark_super_dirty(fs); #if 0 if (fs->stride) @@ -348,7 +353,8 @@ int main (int argc, char ** argv) _("Invalid stride length")); exit(1); } - fs->stride = use_stride; + fs->stride = fs->super->s_raid_stride = use_stride; + ext2fs_mark_super_dirty(fs); } else determine_fs_stride(fs); ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] store RAID stride in superblock 2007-05-12 2:02 [RFC] store RAID stride in superblock Andreas Dilger ` (2 preceding siblings ...) 2007-05-19 2:08 ` Theodore Tso @ 2007-05-24 11:44 ` Andreas Dilger 2007-05-24 14:15 ` Rupesh Thakare 3 siblings, 1 reply; 17+ messages in thread From: Andreas Dilger @ 2007-05-24 11:44 UTC (permalink / raw) To: linux-ext4; +Cc: Theodore Ts'o On May 22, 2007 01:22 +0530, Kalpak Shah wrote: > __u16 s_raid_stride; /* RAID stride */ > - __u16 s_pad; /* Padding */ > + __u16 s_mmp_interval; /* Wait for # seconds in MMP > checking */ > + __u64 s_mmp_block; /* Block for multi-mount protection > */ Ted, I just noticed this updated patch w.r.t. your recent s_raid_stride addition. I also want to have a separate parameter for "s_raid_stripe_width" which is normally N * s_raid_stride, where N is the number of disks in a RAID 5 N+1 (or RAID 6 N+2) parity stripe. This is for delalloc+mballoc to allow it to align and size new allocations so that writes do not impose read-modify-write overhead on the RAID stripes. My understanding from the code is that s_raid_stride is to put the bitmaps for different groups on different disks to avoid always having a single disk busy with bitmap updates. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] store RAID stride in superblock 2007-05-24 11:44 ` Andreas Dilger @ 2007-05-24 14:15 ` Rupesh Thakare 2007-05-31 16:21 ` Theodore Tso 0 siblings, 1 reply; 17+ messages in thread From: Rupesh Thakare @ 2007-05-24 14:15 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Andreas Dilger, linux-ext4 [-- Attachment #1: Type: text/plain, Size: 1651 bytes --] Hello, I've added "s_raid_stripe_width" parameter in superblock. I've also incorporated "s_raid_stride" and "s_raid_stripe_width" parameters in tune2fs. The new options can be specified using '-E options' in both mke2fs and tune2fs. Both the Man pages (mke2fs and tune2fs) are updated accordingly. Patch is attached herewith. Thanks, Rupesh. Andreas Dilger wrote: > On May 22, 2007 01:22 +0530, Kalpak Shah wrote: > >> __u16 s_raid_stride; /* RAID stride */ >> - __u16 s_pad; /* Padding */ >> + __u16 s_mmp_interval; /* Wait for # seconds in MMP >> checking */ >> + __u64 s_mmp_block; /* Block for multi-mount protection >> */ >> > > Ted, I just noticed this updated patch w.r.t. your recent s_raid_stride > addition. I also want to have a separate parameter for "s_raid_stripe_width" > which is normally N * s_raid_stride, where N is the number of disks in a > RAID 5 N+1 (or RAID 6 N+2) parity stripe. This is for delalloc+mballoc to > allow it to align and size new allocations so that writes do not impose > read-modify-write overhead on the RAID stripes. > > My understanding from the code is that s_raid_stride is to put the bitmaps > for different groups on different disks to avoid always having a single > disk busy with bitmap updates. > > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. > > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > [-- Attachment #2: e2fsprogs_stride.patch --] [-- Type: text/x-patch, Size: 10108 bytes --] Index: e2fsprogs-upstream/lib/ext2fs/ext2_fs.h =================================================================== --- e2fsprogs-upstream.orig/lib/ext2fs/ext2_fs.h +++ e2fsprogs-upstream/lib/ext2fs/ext2_fs.h @@ -573,9 +573,13 @@ struct ext2_super_block { __u16 s_min_extra_isize; /* All inodes have at least # bytes */ __u16 s_want_extra_isize; /* New inodes should reserve # bytes */ __u32 s_flags; /* Miscellaneous flags */ + /* + * RAID chunksize and stripe width support + */ __u16 s_raid_stride; /* RAID stride */ __u16 s_pad; /* Padding */ - __u32 s_reserved[166]; /* Padding to the end of the block */ + __u32 s_raid_stripe_width; /* blocks on all data disks (N*stride)*/ + __u32 s_reserved[165]; /* Padding to the end of the block */ }; /* Index: e2fsprogs-upstream/lib/ext2fs/initialize.c =================================================================== --- e2fsprogs-upstream.orig/lib/ext2fs/initialize.c +++ e2fsprogs-upstream/lib/ext2fs/initialize.c @@ -156,6 +156,8 @@ errcode_t ext2fs_initialize(const char * set_field(s_feature_incompat, 0); set_field(s_feature_ro_compat, 0); set_field(s_first_meta_bg, 0); + set_field(s_raid_stride, 0); /* default stride size: 0 */ + set_field(s_raid_stripe_width, 0); /* default stripe width: 0 */ if (super->s_feature_incompat & ~EXT2_LIB_FEATURE_INCOMPAT_SUPP) { retval = EXT2_ET_UNSUPP_FEATURE; goto cleanup; Index: e2fsprogs-upstream/misc/mke2fs.c =================================================================== --- e2fsprogs-upstream.orig/misc/mke2fs.c +++ e2fsprogs-upstream/misc/mke2fs.c @@ -100,7 +100,7 @@ static void usage(void) "\t[-N number-of-inodes] [-m reserved-blocks-percentage] " "[-o creator-os]\n\t[-g blocks-per-group] [-L volume-label] " "[-M last-mounted-directory]\n\t[-O feature[,...]] " - "[-r fs-revision] [-R options] [-qvSV]\n\tdevice [blocks-count]\n"), + "[-r fs-revision] [-E options] [-qvSV]\n\tdevice [blocks-count]\n"), program_name); exit(1); } @@ -785,14 +785,27 @@ static void parse_extended_opts(struct e r_usage++; continue; } - fs_stride = strtoul(arg, &p, 0); - if (*p || (fs_stride == 0)) { + param->s_raid_stride = strtoul(arg, &p, 0); + if (*p || (param->s_raid_stride == 0)) { fprintf(stderr, _("Invalid stride parameter: %s\n"), arg); r_usage++; continue; } + } else if (strcmp(token, "stripe-width") == 0) { + if (!arg) { + r_usage++; + continue; + } + param->s_raid_stripe_width = strtoul(arg, &p, 0); + if (*p || (param->s_raid_stripe_width == 0)) { + fprintf(stderr, + _("Invalid stripe-width parameter: %s\n"), + arg); + r_usage++; + continue; + } } else if (!strcmp(token, "resize")) { unsigned long resize, bpg, rsv_groups; unsigned long group_desc_count, desc_blocks; @@ -858,6 +871,7 @@ static void parse_extended_opts(struct e "\tis set off by an equals ('=') sign.\n\n" "Valid extended options are:\n" "\tstride=<stride length in blocks>\n" + "\tstripe-width=<stripe width in blocks>\n" "\tresize=<resize maximum size in blocks>\n\n")); exit(1); } @@ -1625,7 +1639,7 @@ int main (int argc, char *argv[]) test_disk(fs, &bb_list); handle_bad_blocks(fs, bb_list); - fs->stride = fs->super->s_raid_stride = fs_stride; + fs->stride = fs_stride = fs->super->s_raid_stride; retval = ext2fs_allocate_tables(fs); if (retval) { com_err(program_name, retval, Index: e2fsprogs-upstream/misc/tune2fs.c =================================================================== --- e2fsprogs-upstream.orig/misc/tune2fs.c +++ e2fsprogs-upstream/misc/tune2fs.c @@ -71,6 +71,8 @@ static unsigned short errors; static int open_flag; static char *features_cmd; static char *mntopts_cmd; +static int stride, stripe_width; +static int stride_set, stripe_width_set; int journal_size, journal_flags; char *journal_device; @@ -87,9 +89,9 @@ static void usage(void) "\t[-i interval[d|m|w]] [-j] [-J journal_options]\n" "\t[-l] [-s sparse_flag] [-m reserved_blocks_percent]\n" "\t[-o [^]mount_options[,...]] [-r reserved_blocks_count]\n" - "\t[-u user] [-C mount_count] [-L volume_label] " - "[-M last_mounted_dir]\n" - "\t[-O [^]feature[,...]] [-T last_check_time] [-U UUID]" + "\t[-u user] [-C mount_count] [-E options] [-L volume_label]" + "\n\t[-M last_mounted_dir] [-O [^]feature[,...]]\n" + "\t[-T last_check_time] [-U UUID]" " device\n"), program_name); exit (1); } @@ -497,15 +499,86 @@ static time_t parse_time(char *str) return (mktime(&ts)); } +static void parse_extended_opts(const char *opts) +{ + char *buf, *token, *next, *p, *arg; + int len; + int r_usage = 0; + + len = strlen(opts); + buf = malloc(len+1); + if (!buf) { + fprintf(stderr, + _("Couldn't allocate memory to parse options!\n")); + exit(1); + } + strcpy(buf, opts); + for (token = buf; token && *token; token = next) { + p = strchr(token, ','); + next = 0; + if (p) { + *p = 0; + next = p+1; + } + arg = strchr(token, '='); + if (arg) { + *arg = 0; + arg++; + } + if (strcmp(token, "stride") == 0) { + if (!arg) { + r_usage++; + continue; + } + stride = strtoul(arg, &p, 0); + if (*p || (stride == 0)) { + fprintf(stderr, + _("Invalid stride parameter: %s\n"), + arg); + r_usage++; + continue; + } + stride_set = 1; + } else if (strcmp(token, "stripe-width") == 0) { + if (!arg) { + r_usage++; + continue; + } + stripe_width = strtoul(arg, &p, 0); + if (*p || (stripe_width == 0)) { + fprintf(stderr, + _("Invalid stripe-width parameter: %s\n"), + arg); + r_usage++; + continue; + } + stripe_width_set = 1; + } else + r_usage++; + } + if (r_usage) { + fprintf(stderr, _("\nBad options specified.\n\n" + "Extended options are separated by commas, " + "and may take an argument which\n" + "\tis set off by an equals ('=') sign.\n\n" + "Valid extended options are:\n" + "\tstride=<stride length in blocks>\n" + "\tstripe-width=<stripe width in blocks>\n")); + exit(1); + } + +} + static void parse_tune2fs_options(int argc, char **argv) { int c; char * tmp; + char * extended_opts = NULL; struct group * gr; struct passwd * pw; printf("tune2fs %s (%s)\n", E2FSPROGS_VERSION, E2FSPROGS_DATE); - while ((c = getopt(argc, argv, "c:e:fg:i:jlm:o:r:s:u:C:J:L:M:O:T:U:")) != EOF) + while ((c = getopt(argc, argv, "c:e:fg:i:jlm:o:r:s:u:C:E:J:L:M:O:T:U:")) != EOF) switch (c) { case 'c': @@ -548,6 +621,10 @@ static void parse_tune2fs_options(int ar e_flag = 1; open_flag = EXT2_FLAG_RW; break; + case 'E': + extended_opts = optarg; + parse_extended_opts(extended_opts); + break; case 'f': /* Force */ f_flag = 1; break; @@ -921,6 +998,16 @@ int main (int argc, char ** argv) if (l_flag) list_super (sb); + if (stride_set) { + sb->s_raid_stride = stride; + ext2fs_mark_super_dirty(fs); + printf(_("Setting stride size to %d\n"), stride); + } + if (stripe_width_set) { + sb->s_raid_stripe_width = stripe_width; + ext2fs_mark_super_dirty(fs); + printf(_("Setting stripe width to %d"), stripe_width); + } remove_error_table(&et_ext2_error_table); return (ext2fs_close (fs) ? 1 : 0); } Index: e2fsprogs-upstream/misc/mke2fs.8.in =================================================================== --- e2fsprogs-upstream.orig/misc/mke2fs.8.in +++ e2fsprogs-upstream/misc/mke2fs.8.in @@ -179,10 +179,23 @@ option is still accepted for backwards c following extended options are supported: .RS 1.2i .TP -.BI stride= stripe-size +.BI stride= stride-size Configure the filesystem for a RAID array with -.I stripe-size -filesystem blocks per stripe. +.I stride-size +filesystem blocks. This is the number of blocks read or written to disk +before moving to next disk. This mostly affects placement of filesystem +metadata like bitmaps at +.BR mke2fs (2) +time to avoid placing them on a single disk, which can hurt the performanace. +It may also be used by block allocator. +.TP +.BI stripe-width= stripe-width +Configure the filesystem for a RAID array with +.I stripe-width +filesystem blocks per stripe. This is typically be stride-size * N, where +N is the number of data disks in the RAID (e.g. RAID 5 N+1, RAID 6 N+2). +This allows the block allocator to prevent read-modify-write of the +parity in a RAID stripe if possible when the data is written. .TP .BI resize= max-online-resize Reserve enough space so that the block group descriptor table can grow Index: e2fsprogs-upstream/misc/tune2fs.8.in =================================================================== --- e2fsprogs-upstream.orig/misc/tune2fs.8.in +++ e2fsprogs-upstream/misc/tune2fs.8.in @@ -61,6 +61,10 @@ tune2fs \- adjust tunable filesystem par .I mount-count ] [ +.B \-E +.I extended-options +] +[ .B \-L .I volume-name ] @@ -144,6 +148,31 @@ Remount filesystem read-only. Cause a kernel panic. .RE .TP +.BI \-E " extended-options" +Set extended options for the filesystem. Extended options are comma +separated, and may take an argument using the equals ('=') sign. +The following extended options are supported: +.RS 1.2i +.TP +.BI stride= stride-size +Configure the filesystem for a RAID array with +.I stride-size +filesystem blocks. This is the number of blocks read or written to disk +before moving to next disk. This mostly affects placement of filesystem +metadata like bitmaps at +.BR mke2fs (2) +time to avoid placing them on a single disk, which can hurt the performanace. +It may also be used by block allocator. +.TP +.BI stripe-width= stripe-width +Configure the filesystem for a RAID array with +.I stripe-width +filesystem blocks per stripe. This is typically be stride-size * N, where +N is the number of data disks in the RAID (e.g. RAID 5 N+1, RAID 6 N+2). +This allows the block allocator to prevent read-modify-write of the +parity in a RAID stripe if possible when the data is written. +.RE +.TP .B \-f Force the tune2fs operation to complete even in the face of errors. This option is useful when removing the ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] store RAID stride in superblock 2007-05-24 14:15 ` Rupesh Thakare @ 2007-05-31 16:21 ` Theodore Tso 2007-05-31 20:19 ` Andreas Dilger 0 siblings, 1 reply; 17+ messages in thread From: Theodore Tso @ 2007-05-31 16:21 UTC (permalink / raw) To: Rupesh Thakare; +Cc: Andreas Dilger, linux-ext4 On Thu, May 24, 2007 at 07:45:32PM +0530, Rupesh Thakare wrote: > Hello, > I've added "s_raid_stripe_width" parameter in superblock. > I've also incorporated "s_raid_stride" and "s_raid_stripe_width" > parameters in tune2fs. > The new options can be specified using '-E options' in both mke2fs and > tune2fs. > Both the Man pages (mke2fs and tune2fs) are updated accordingly. > Patch is attached herewith. Thanks. I've used a different offset for the raid_stripe_width, to avoid conflicting with Kalpak's mmp patch. Could you send me a signed-off-by for your patch? Thanks, - Ted ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] store RAID stride in superblock 2007-05-31 16:21 ` Theodore Tso @ 2007-05-31 20:19 ` Andreas Dilger 2007-05-31 21:02 ` Kalpak Shah 2007-05-31 21:33 ` Theodore Tso 0 siblings, 2 replies; 17+ messages in thread From: Andreas Dilger @ 2007-05-31 20:19 UTC (permalink / raw) To: Theodore Tso; +Cc: Rupesh Thakare, linux-ext4, Kalpak Shah On May 31, 2007 12:21 -0400, Theodore Tso wrote: > On Thu, May 24, 2007 at 07:45:32PM +0530, Rupesh Thakare wrote: > > I've added "s_raid_stripe_width" parameter in superblock. > > I've also incorporated "s_raid_stride" and "s_raid_stripe_width" > > parameters in tune2fs. > > The new options can be specified using '-E options' in both mke2fs and > > tune2fs. > > Both the Man pages (mke2fs and tune2fs) are updated accordingly. > > Patch is attached herewith. > > Thanks. I've used a different offset for the raid_stripe_width, to > avoid conflicting with Kalpak's mmp patch. Ah, we've been doing it the other way around here. It makes sense to keep the s_raid_stripe_width fields together. I think this code is preliminary enough that nobody has actually started using it yet. Can you please post what the end of ext2_super_block looks like (whether you decide to reorder the fields or not). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] store RAID stride in superblock 2007-05-31 20:19 ` Andreas Dilger @ 2007-05-31 21:02 ` Kalpak Shah 2007-05-31 21:33 ` Theodore Tso 1 sibling, 0 replies; 17+ messages in thread From: Kalpak Shah @ 2007-05-31 21:02 UTC (permalink / raw) To: Andreas Dilger; +Cc: Theodore Tso, Rupesh Thakare, linux-ext4 On Thu, 2007-05-31 at 14:19 -0600, Andreas Dilger wrote: > On May 31, 2007 12:21 -0400, Theodore Tso wrote: > > On Thu, May 24, 2007 at 07:45:32PM +0530, Rupesh Thakare wrote: > > > I've added "s_raid_stripe_width" parameter in superblock. > > > I've also incorporated "s_raid_stride" and "s_raid_stripe_width" > > > parameters in tune2fs. > > > The new options can be specified using '-E options' in both mke2fs and > > > tune2fs. > > > Both the Man pages (mke2fs and tune2fs) are updated accordingly. > > > Patch is attached herewith. > > > > Thanks. I've used a different offset for the raid_stripe_width, to > > avoid conflicting with Kalpak's mmp patch. > > Ah, we've been doing it the other way around here. It makes sense to keep > the s_raid_stripe_width fields together. I think this code is preliminary > enough that nobody has actually started using it yet. Can you please post > what the end of ext2_super_block looks like (whether you decide to reorder > the fields or not). I can update the MMP patches when I actually send them for inclusion. So I think it makes sense to keep the s_raid_* fields together. Thanks, Kalpak. > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. > > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] store RAID stride in superblock 2007-05-31 20:19 ` Andreas Dilger 2007-05-31 21:02 ` Kalpak Shah @ 2007-05-31 21:33 ` Theodore Tso 2007-05-31 22:01 ` Eric Sandeen 2007-05-31 22:03 ` Andreas Dilger 1 sibling, 2 replies; 17+ messages in thread From: Theodore Tso @ 2007-05-31 21:33 UTC (permalink / raw) To: Andreas Dilger; +Cc: Rupesh Thakare, linux-ext4, Kalpak Shah On Thu, May 31, 2007 at 02:19:02PM -0600, Andreas Dilger wrote: > Ah, we've been doing it the other way around here. It makes sense to keep > the s_raid_stripe_width fields together. I think this code is preliminary > enough that nobody has actually started using it yet. Can you please post > what the end of ext2_super_block looks like (whether you decide to reorder > the fields or not). Oops, I just pushed a set of bugfixes to Linux that included the superblock field reservations. I was going back and forth about whether to keep them together, or whether to keep the extra u16 s_pad and then have to reserve another u16 field plus another u16 field for MMP seconds field. Since you guys had been talking about the MMP code for longer period of time (I think you first made the proposal a few months ago), I had assumed it had precedence (and had possibly already been in use at some customer somewhere), so I used Kalpak's original MMP superblock field reservations. I don't think it's worth changing at this point. (If no one is using it yet, it won't be too hard to switch around so we're all doing the same thing. :-) What is in the e2fsprogs hg repository as well as the for_linus branch of ext4.git is: .. __u16 s_raid_stride; /* RAID stride */ __u16 s_mmp_interval; /* # seconds to wait in MMP checking */ __u64 s_mmp_block; /* Block for multi-mount protection */ __u32 s_raid_stripe_width; /* blocks on all data disks (N*stride)*/ __u32 s_reserved[163]; /* Padding to the end of the block */ }; One question which does come to mind; is there any reason why we might want to know the RAID level and/or the number of disks (as opposed to just the stripe width)? And has anyone investigated where there are magic ioctl's or libdevmapper APi's so we can get the RAID parameters automatically? If so, patches so that mke2fs can get the information automatically (as opposed to forcing the user to have to specify lots of annoying options) would be most welcome.... - Ted ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] store RAID stride in superblock 2007-05-31 21:33 ` Theodore Tso @ 2007-05-31 22:01 ` Eric Sandeen 2007-05-31 22:03 ` Andreas Dilger 1 sibling, 0 replies; 17+ messages in thread From: Eric Sandeen @ 2007-05-31 22:01 UTC (permalink / raw) To: Theodore Tso; +Cc: Andreas Dilger, Rupesh Thakare, linux-ext4, Kalpak Shah Theodore Tso wrote: > And has anyone investigated where there are > magic ioctl's or libdevmapper APi's so we can get the RAID parameters > automatically? If so, patches so that mke2fs can get the information > automatically (as opposed to forcing the user to have to specify lots > of annoying options) would be most welcome.... xfsprogs has a libdisk which does this for evms, lvm, md, dm, and xvm(!) see for example md_get_subvol_stripe() in xfsprogs. -Eric ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] store RAID stride in superblock 2007-05-31 21:33 ` Theodore Tso 2007-05-31 22:01 ` Eric Sandeen @ 2007-05-31 22:03 ` Andreas Dilger 1 sibling, 0 replies; 17+ messages in thread From: Andreas Dilger @ 2007-05-31 22:03 UTC (permalink / raw) To: Theodore Tso; +Cc: Rupesh Thakare, linux-ext4, Kalpak Shah On May 31, 2007 17:33 -0400, Theodore Tso wrote: > Oops, I just pushed a set of bugfixes to Linux that included the > superblock field reservations. Oh well. > What is in the e2fsprogs hg repository ... is: > > .. > __u16 s_raid_stride; /* RAID stride */ > __u16 s_mmp_interval; /* # seconds to wait in MMP checking */ > __u64 s_mmp_block; /* Block for multi-mount protection */ > __u32 s_raid_stripe_width; /* blocks on all data disks (N*stride)*/ > __u32 s_reserved[163]; /* Padding to the end of the block */ > }; We're updating our patches to be based on the new HG code. > One question which does come to mind; is there any reason why we might > want to know the RAID level and/or the number of disks (as opposed to > just the stripe width)? Not so far. The raid_stride is for bitmap placement (and could also be used for alignment of random IOs to avoid making 2 disks busy when 1 would do). The raid_stripe_width is the amount that delalloc+mballoc will use for allocations+writes to avoid read-modify-write of RAID stripes. It doesn't really matter what the RAID level is. > And has anyone investigated where there are > magic ioctl's or libdevmapper APi's so we can get the RAID parameters > automatically? If so, patches so that mke2fs can get the information > automatically (as opposed to forcing the user to have to specify lots > of annoying options) would be most welcome.... For now we will specify this via mke2fs or tune2fs for existing filesystems. The XFS folks mentioned they have a library to extract this info for linux devices (e.g. DM, MD, etc), but of course that still won't work for e.g. external RAID devices. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2007-05-31 22:03 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-05-12 2:02 [RFC] store RAID stride in superblock Andreas Dilger 2007-05-12 2:21 ` Eric Sandeen 2007-05-12 8:11 ` Eric 2007-05-12 8:33 ` Alex Tomas 2007-05-12 9:32 ` Eric 2007-05-12 9:38 ` Alex Tomas 2007-05-12 16:14 ` Eric 2007-05-12 15:26 ` Andreas Dilger 2007-05-19 2:08 ` Theodore Tso 2007-05-24 11:44 ` Andreas Dilger 2007-05-24 14:15 ` Rupesh Thakare 2007-05-31 16:21 ` Theodore Tso 2007-05-31 20:19 ` Andreas Dilger 2007-05-31 21:02 ` Kalpak Shah 2007-05-31 21:33 ` Theodore Tso 2007-05-31 22:01 ` Eric Sandeen 2007-05-31 22:03 ` Andreas Dilger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).