[PATCH] mk2fs lazy_journal

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mk2fs lazy_journal_init option
@ 2010-02-10 11:44 Andreas Dilger
  2010-02-16 22:15 ` Andreas Dilger
  0 siblings, 1 reply; 7+ messages in thread
From: Andreas Dilger @ 2010-02-10 11:44 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: ext4 development

[-- Attachment #1: Type: text/plain, Size: 1047 bytes --]

Attached is a patch to skip zeroing of the journal if the "-E  
lazy_journal_init" option is given to mke2fs (named to complement the  
"-E lazy_itable_init" option).  This can speed up format time  
significantly for large journal devices.  There is only a short-term  
risk of problems with the uninitialized journal, until such a time  
that the journal has been overwritten once.

To have any kind of problem with the uninitialized journal, the  
filesystem would have to have been newly reformatted twice nearly in a  
row, and then the node crashes before filling the journal even once,  
and the previously-written content would have to line up precisely on  
disk and also have the same transaction ID.

Patch has been lightly tested, showing mke2fs times steady at 14s for  
a 40GB filesystem, regardless of journal size, while previously it  
took up to 45s for an internal 2GB journal.

Signed-off-by: Andreas Dilger <adilger@sun.com>

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

[-- Attachment #2: e2fsprogs-lazy_journal_init.patch --]
[-- Type: application/octet-stream, Size: 6087 bytes --]

Index: e2fsprogs-1.41.9/lib/ext2fs/ext2fs.h
===================================================================
--- e2fsprogs-1.41.9.orig/lib/ext2fs/ext2fs.h
+++ e2fsprogs-1.41.9/lib/ext2fs/ext2fs.h
@@ -181,10 +181,9 @@ typedef struct ext2_file *ext2_file_t;
 
 /*
  * Flags for mkjournal
- *
- * EXT2_MKJOURNAL_V1_SUPER	Make a (deprecated) V1 journal superblock
  */
-#define EXT2_MKJOURNAL_V1_SUPER	0x0000001
+#define EXT2_MKJOURNAL_V1_SUPER	0x0000001 /* create V1 superblock (deprecated) */
+#define EXT2_MKJOURNAL_LAZYINIT	0x0000002 /* don't zero journal inode before use*/
 
 struct struct_ext2_filsys {
 	errcode_t			magic;
Index: e2fsprogs-1.41.9/lib/ext2fs/mkjournal.c
===================================================================
--- e2fsprogs-1.41.9.orig/lib/ext2fs/mkjournal.c
+++ e2fsprogs-1.41.9/lib/ext2fs/mkjournal.c
@@ -103,7 +103,7 @@ static errcode_t write_journal_file(ext2
 	/* Open the device or journal file */
 	if ((fd = open(filename, O_WRONLY)) < 0) {
 		retval = errno;
-		goto errout;
+		goto errfree;
 	}
 
 	/* Write the superblock out */
@@ -117,6 +117,9 @@ static errcode_t write_journal_file(ext2
 		goto errout;
 	memset(buf, 0, fs->blocksize);
 
+	if (flags & EXT2_MKJOURNAL_LAZYINIT)
+		goto success;
+
 	for (i = 1; i < size; i++) {
 		ret_size = write(fd, buf, fs->blocksize);
 		if (ret_size < 0) {
@@ -126,10 +129,12 @@ static errcode_t write_journal_file(ext2
 		if (ret_size != (int) fs->blocksize)
 			goto errout;
 	}
-	close(fd);
 
+success:
 	retval = 0;
 errout:
+	close(fd);
+errfree:
 	ext2fs_free_mem(&buf);
 	return retval;
 }
@@ -201,6 +206,7 @@ struct mkjournal_struct {
 	blk_t		goal;
 	blk_t		blk_to_zero;
 	int		zero_count;
+	int		flags;
 	char		*buf;
 	errcode_t	err;
 };
@@ -232,7 +238,7 @@ static int mkjournal_proc(ext2_filsys	fs
 	retval = 0;
 	if (blockcnt <= 0)
 		retval = io_channel_write_blk(fs->io, new_blk, 1, es->buf);
-	else {
+	else if (!(es->flags & EXT2_MKJOURNAL_LAZYINIT)) {
 		if (es->zero_count) {
 			if ((es->blk_to_zero + es->zero_count == new_blk) &&
 			    (es->zero_count < 1024))
@@ -296,6 +302,7 @@ static errcode_t write_journal_inode(ext
 	es.newblocks = 0;
 	es.buf = buf;
 	es.err = 0;
+	es.flags = flags;
 	es.zero_count = 0;
 
 	if (fs->super->s_feature_incompat & EXT3_FEATURE_INCOMPAT_EXTENTS) {
@@ -490,6 +497,13 @@ errcode_t ext2fs_add_journal_inode(ext2_
 		if ((fd = open(jfile, O_CREAT|O_WRONLY, 0600)) < 0)
 			return errno;
 
+		/* Note that we can't do lazy journal initialization for mounted
+		 * filesystems, since the zero writing is also allocating the
+		 * journal blocks.  We could use fallocate, but not all kernels
+		 * support that, and creating a journal on a mounted ext2
+		 * filesystems is extremely rare these days...  Skip it. */
+		flags &= ~EXT2_MKJOURNAL_LAZYINIT;
+
 		if ((retval = write_journal_file(fs, jfile, size, flags)))
 			goto errout;
 
Index: e2fsprogs-1.41.9/misc/mke2fs.8.in
===================================================================
--- e2fsprogs-1.41.9.orig/misc/mke2fs.8.in
+++ e2fsprogs-1.41.9/misc/mke2fs.8.in
@@ -232,7 +232,15 @@ This speeds up filesystem
 initialization noticeably, but it requires the kernel to finish
 initializing the filesystem in the background when the filesystem is
 first mounted.  If the option value is omitted, it defaults to 1 to
-enable lazy inode table initialization.
+enable lazy inode table zeroing.
+.TP
+.B lazy_journal_init\fR[\fB= \fI<0 to disable, 1 to enable>\fR]
+If enabled, the journal inode will not be fully zeroed out by
+.BR mke2fs .
+This speeds up filesystem initialization noticeably, but carries some
+small risk if the system crashes before the journal has been overwritten
+entirely one time.  If the option value is omitted, it defaults to 1 to
+enable lazy journal inode zeroing.
 .TP
 .B test_fs
 Set a flag in the filesystem superblock indicating that it may be
Index: e2fsprogs-1.41.9/misc/mke2fs.c
===================================================================
--- e2fsprogs-1.41.9.orig/misc/mke2fs.c
+++ e2fsprogs-1.41.9/misc/mke2fs.c
@@ -551,6 +551,10 @@ static void create_journal_dev(ext2_fils
 			_("while initializing journal superblock"));
 		exit(1);
 	}
+
+	if (journal_flags & EXT2_MKJOURNAL_LAZYINIT)
+		goto write_superblock;
+
 	if (quiet)
 		memset(&progress, 0, sizeof(progress));
 	else
@@ -578,6 +582,8 @@ static void create_journal_dev(ext2_fils
 	}
 	ext2fs_zero_blocks(0, 0, 0, 0, 0);
 
+	progress_close(&progress);
+write_superblock:
 	retval = io_channel_write_blk(fs->io,
 				      fs->super->s_first_data_block+1,
 				      1, buf);
@@ -586,7 +592,6 @@ static void create_journal_dev(ext2_fils
 			_("while writing journal superblock"));
 		exit(1);
 	}
-	progress_close(&progress);
 }
 
 static void show_stats(ext2_filsys fs)
@@ -795,6 +800,12 @@ static void parse_extended_opts(struct e
 			}
 		} else if (!strcmp(token, "test_fs")) {
 			param->s_flags |= EXT2_FLAGS_TEST_FILESYS;
+		} else if (!strcmp(token, "lazy_journal_init")) {
+			if (arg)
+				journal_flags |= strtoul(arg, &p, 0) ?
+						EXT2_MKJOURNAL_LAZYINIT : 0;
+			else
+				journal_flags |= EXT2_MKJOURNAL_LAZYINIT;
 		} else if (!strcmp(token, "lazy_itable_init")) {
 			if (arg)
 				lazy_itable_init = strtoul(arg, &p, 0);
@@ -815,6 +826,7 @@ static void parse_extended_opts(struct e
 			"\tstripe-width=<RAID stride * data disks in blocks>\n"
 			"\tresize=<resize maximum size in blocks>\n"
 			"\tlazy_itable_init=<0 to disable, 1 to enable>\n"
+			"\tlazy_journal_init=<0 to disable, 1 to enable>\n"
 			"\ttest_fs\n\n"),
 			badopt ? badopt : "");
 		free(buf);
@@ -1637,6 +1649,9 @@ got_size:
 
 	lazy_itable_init = get_bool_from_profile(fs_types,
 						 "lazy_itable_init", 0);
+	journal_flags |= get_bool_from_profile(fs_types,
+					       "lazy_journal_init", 0) ?
+						EXT2_MKJOURNAL_LAZYINIT : 0;
 
 	/* Get options from profile */
 	for (cpp = fs_types; *cpp; cpp++) {

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mk2fs lazy_journal_init option
  2010-02-10 11:44 [PATCH] mk2fs lazy_journal_init option Andreas Dilger
@ 2010-02-16 22:15 ` Andreas Dilger
  2010-02-16 23:11   ` Eric Sandeen
  0 siblings, 1 reply; 7+ messages in thread
From: Andreas Dilger @ 2010-02-16 22:15 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: ext4 development, Shuichi Ihara

On 2010-02-10, at 04:44, Andreas Dilger wrote:
> Attached is a patch to skip zeroing of the journal if the
> "-E lazy_journal_init" option is given to mke2fs (named to
> complement the "-E lazy_itable_init" option).  This can
> speed up format time significantly for large journal devices.
> There's only a short-term risk of problems with uninitialized
> journal, until the journal has been overwritten once.
>
> Patch has been lightly tested, showing mke2fs times steady
> at 14s for a 40GB filesystem, regardless of journal size,
> while previously it took up to 45s for an internal 2GB journal.

While testing this patch more thoroughly, we uncovered a bug
in the mke2fs/libext2fs code.  It seems that when running:

mke2fs -J size=X -O extents /dev/XXX

for any size > 512 the journal creation time is growing
exponentially:

no journal-> 12s
size=128 ->  14s
size=256 ->  16s
size=512 ->  21s
size=768 -> 143s
size=1024-> 298s
size=1280-> 663s

We wanted originally to use size=4000, but this took so
long we thought it was hung, and started investigating.

This happens even without the "-E lazy_itable_init" option.
Running ltrace on mke2fs shows lots of zero writes (to be
expected for journal zeroing) followed by a single read
(completes quickly) and many thousands of memcpy() calls.
The mke2fs program is completely CPU bound (99.9% user).

Running with the "-E lazy_itable_init" the writes/reads go
away, and all that is left is an endless stream of memcpy().

It seems to loop in ext2fs_block_iterate2->mkjournal_proc()
forever:

426            for (blockcnt = extent.e_lblk, j = 0;
427                 j < extent.e_len;
428                 blk++, blockcnt++, j++) {
429                    new_blk = blk;
430                    r = (*ctx.func)(fs, &new_blk, blockcnt,
431                                    0, 0, priv_data);


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mk2fs lazy_journal_init option
  2010-02-16 22:15 ` Andreas Dilger
@ 2010-02-16 23:11   ` Eric Sandeen
  2010-02-16 23:21     ` Eric Sandeen
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Sandeen @ 2010-02-16 23:11 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Theodore Ts'o, ext4 development, Shuichi Ihara

Andreas Dilger wrote:
> On 2010-02-10, at 04:44, Andreas Dilger wrote:
>> Attached is a patch to skip zeroing of the journal if the
>> "-E lazy_journal_init" option is given to mke2fs (named to
>> complement the "-E lazy_itable_init" option).  This can
>> speed up format time significantly for large journal devices.
>> There's only a short-term risk of problems with uninitialized
>> journal, until the journal has been overwritten once.
>>
>> Patch has been lightly tested, showing mke2fs times steady
>> at 14s for a 40GB filesystem, regardless of journal size,
>> while previously it took up to 45s for an internal 2GB journal.
> 
> While testing this patch more thoroughly, we uncovered a bug
> in the mke2fs/libext2fs code.  It seems that when running:
> 
> mke2fs -J size=X -O extents /dev/XXX
> 
> for any size > 512 the journal creation time is growing
> exponentially:
> 
> no journal-> 12s
> size=128 ->  14s
> size=256 ->  16s
> size=512 ->  21s
> size=768 -> 143s
> size=1024-> 298s
> size=1280-> 663s
> 
> We wanted originally to use size=4000, but this took so
> long we thought it was hung, and started investigating.
> 
> This happens even without the "-E lazy_itable_init" option.
> Running ltrace on mke2fs shows lots of zero writes (to be
> expected for journal zeroing) followed by a single read
> (completes quickly) and many thousands of memcpy() calls.
> The mke2fs program is completely CPU bound (99.9% user).
> 
> Running with the "-E lazy_itable_init" the writes/reads go
> away, and all that is left is an endless stream of memcpy().
> 
> It seems to loop in ext2fs_block_iterate2->mkjournal_proc()
> forever:
> 
> 426            for (blockcnt = extent.e_lblk, j = 0;
> 427                 j < extent.e_len;
> 428                 blk++, blockcnt++, j++) {
> 429                    new_blk = blk;
> 430                    r = (*ctx.func)(fs, &new_blk, blockcnt,
> 431                                    0, 0, priv_data);
> 

Haven't dug all the way through it but I think this is another
in the saga of blk_t vs. blk64_t.  This seems to fix (?) it:

--- a/lib/ext2fs/mkjournal.c
+++ b/lib/ext2fs/mkjournal.c
@@ -218,7 +218,7 @@ struct mkjournal_struct {
 };

 static int mkjournal_proc(ext2_filsys  fs,
-                          blk_t        *blocknr,
+                          blk64_t      *blocknr,
                           e2_blkcnt_t  blockcnt,
                           blk_t        ref_block EXT2FS_ATTR((unused)),
                           int          ref_offset EXT2FS_ATTR((unused)),

though I doubt that is correct/complete.

-Eric

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mk2fs lazy_journal_init option
  2010-02-16 23:11   ` Eric Sandeen
@ 2010-02-16 23:21     ` Eric Sandeen
  0 siblings, 0 replies; 7+ messages in thread
From: Eric Sandeen @ 2010-02-16 23:21 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Theodore Ts'o, ext4 development, Shuichi Ihara

Eric Sandeen wrote:

> Haven't dug all the way through it but I think this is another
> in the saga of blk_t vs. blk64_t.  This seems to fix (?) it:
> 
> --- a/lib/ext2fs/mkjournal.c
> +++ b/lib/ext2fs/mkjournal.c
> @@ -218,7 +218,7 @@ struct mkjournal_struct {
>  };
> 
>  static int mkjournal_proc(ext2_filsys  fs,
> -                          blk_t        *blocknr,
> +                          blk64_t      *blocknr,
>                            e2_blkcnt_t  blockcnt,
>                            blk_t        ref_block EXT2FS_ATTR((unused)),
>                            int          ref_offset EXT2FS_ATTR((unused)),
> 
> though I doubt that is correct/complete.

Humm my git tree is not what I thought it was, the above might be a
wild goose chase, sorry.

-Eric

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mk2fs lazy_journal_init option
@ 2010-02-19 19:53 number9652
  2010-02-24  5:30 ` Andreas Dilger
  0 siblings, 1 reply; 7+ messages in thread
From: number9652 @ 2010-02-19 19:53 UTC (permalink / raw)
  To: Andreas Dilger, Eric Sandeen
  Cc: Theodore Ts'o, ext4 development, Shuichi Ihara

I hope you decide to continue unconditionally zeroing the journal at mkfs time.

In this case, block_iterate2 was being used to create a file, but when the file became big enough to have an extent leaf, ext2fs_get_extent was returning an extent when it shouldn't have been.  That caused ext2fs_block_iterate2 to go wrong, eventually calling mkjournal_proc millions of times more than it needed to (during which it would immediately return).  A patch which resolves this is below.  It shows normal mkfs times for me with a 512 MB journal and passes make check.

Signed-off-by Nic Case <number9652@yahoo.com>

--- lib/ext2fs/extent-orig.c    2010-02-05 08:58:41.000000000 -0600
+++ lib/ext2fs/extent.c 2010-02-19 13:37:32.000000000 -0600
@@ -307,16 +307,12 @@
                                op = EXT2_EXTENT_DOWN;
                        } else if (path->left > 0)
                                op = EXT2_EXTENT_NEXT_SIB;
-                       else if (handle->level > 0)
-                               op = EXT2_EXTENT_UP;
                        else
                                return EXT2_ET_EXTENT_NO_NEXT;
                } else {
                        /* leaf node */
                        if (path->left > 0)
                                op = EXT2_EXTENT_NEXT_SIB;
-                       else if (handle->level > 0)
-                               op = EXT2_EXTENT_UP;
                        else
                                return EXT2_ET_EXTENT_NO_NEXT;
                }




      


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mk2fs lazy_journal_init option
  2010-02-19 19:53 number9652
@ 2010-02-24  5:30 ` Andreas Dilger
  2010-03-02 19:36   ` Nic Case
  0 siblings, 1 reply; 7+ messages in thread
From: Andreas Dilger @ 2010-02-24  5:30 UTC (permalink / raw)
  To: number9652
  Cc: Eric Sandeen, Theodore Ts'o, ext4 development, Shuichi Ihara

On 2010-02-19, at 12:53, number9652 wrote:
> I hope you decide to continue unconditionally zeroing the journal at  
> mkfs time.

That will continue to be the default for the time being, unless we can  
be sure that there is no possibility of error.  That would be possible  
in conjunction with the journal checksum patch, if the checksum  
function is changed slightly to include the journal UUID.

> In this case, block_iterate2 was being used to create a file, but  
> when the file became big enough to have an extent leaf,  
> ext2fs_get_extent was returning an extent when it shouldn't have  
> been.  That caused ext2fs_block_iterate2 to go wrong, eventually  
> calling mkjournal_proc millions of times more than it needed to  
> (during which it would immediately return).  A patch which resolves  
> this is below.  It shows normal mkfs times for me with a 512 MB  
> journal and passes make check.
>
> Signed-off-by Nic Case <number9652@yahoo.com>

I haven't looked at this from a logic POV, but I tested this with a  
1GB journal and it ran the same time with/without specifying "-O  
extents" (10s vs. 650s without this patch).  I also ran our additional  
e2fsck extents tests without errors, and on a couple of my extent- 
based filesystems.

Tested-by: Andreas Dilger <adilger@sun.com>

> --- lib/ext2fs/extent-orig.c    2010-02-05 08:58:41.000000000 -0600
> +++ lib/ext2fs/extent.c 2010-02-19 13:37:32.000000000 -0600
> @@ -307,16 +307,12 @@
>                                op = EXT2_EXTENT_DOWN;
>                        } else if (path->left > 0)
>                                op = EXT2_EXTENT_NEXT_SIB;
> -                       else if (handle->level > 0)
> -                               op = EXT2_EXTENT_UP;
>                        else
>                                return EXT2_ET_EXTENT_NO_NEXT;
>                } else {
>                        /* leaf node */
>                        if (path->left > 0)
>                                op = EXT2_EXTENT_NEXT_SIB;
> -                       else if (handle->level > 0)
> -                               op = EXT2_EXTENT_UP;
>                        else
>                                return EXT2_ET_EXTENT_NO_NEXT;
>                }
>
>
>
>
>
>


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mk2fs lazy_journal_init option
  2010-02-24  5:30 ` Andreas Dilger
@ 2010-03-02 19:36   ` Nic Case
  0 siblings, 0 replies; 7+ messages in thread
From: Nic Case @ 2010-03-02 19:36 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Eric Sandeen, Theodore Ts'o, ext4 development, Shuichi Ihara

--- On Tue, 2/23/10, Andreas Dilger wrote:
> I haven't looked at this from a logic POV, but I tested
> this with a 1GB journal and it ran the same time
> with/without specifying "-O extents" (10s vs. 650s without
> this patch).  I also ran our additional e2fsck extents
> tests without errors, and on a couple of my extent-based
> filesystems.

Thanks for testing this patch, but after looking at it again,
it is clear it is not the right thing to do.  If there are at
least two levels of the extent tree with more than one entry,
iterating through with EXTENT_NEXT will return NO_EXTENT before
reaching the end of the file.  I sent another patch to the ext4
mailing list (with ext2fs_extent_get in the title) that I believe
is the right way to fix the problem.

Nic

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-03-02 19:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-10 11:44 [PATCH] mk2fs lazy_journal_init option Andreas Dilger
2010-02-16 22:15 ` Andreas Dilger
2010-02-16 23:11   ` Eric Sandeen
2010-02-16 23:21     ` Eric Sandeen
  -- strict thread matches above, loose matches on Subject: below --
2010-02-19 19:53 number9652
2010-02-24  5:30 ` Andreas Dilger
2010-03-02 19:36   ` Nic Case

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).