[RFC 0/2] ext4: zero uninitialized inode tables

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/2] ext4: zero uninitialized inode tables
@ 2008-11-21 10:23 Solofo.Ramangalahy
  2008-11-21 10:23 ` [RFC 1/2] ext4 resize: Mark the added group with EXT4_BG_INODE_ZEROED flag Solofo.Ramangalahy
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Solofo.Ramangalahy @ 2008-11-21 10:23 UTC (permalink / raw)
  To: linux-ext4

The time to format a filesystem is mostly linear with filesystem size.

Exact time spent on formating depends on hardware and software, but
this is mainly explained by the zeroing of some blocks (inode, block
bitmaps and inodes tables).
While the mkfs time can be considered negligible (for example compared
to RAID formatting of disk arrays), it is significant compared
to the formating time of others filesystems.
This is noticeable when conducting performance comparison tests, or
testing involving multiple formatting of the same device.
This may become prohibitive for large disks (arrays).

For some measurements, see:
http://www.bullopensource.org/ext4/20080909-mkfs-speed-lazy_itable_init/
http://www.bullopensource.org/ext4/20080911-mkfs-speed-lazy_itable_init/
http://www.bullopensource.org/ext4/20080912-mkfs-speed-lazy_itable_init/
so far it is under one hour, further measurements would be needed,
like for 16TB filesystems.

It is possible to skip the initialization of the inode tables blocks
with the mkfs option "lazy_itable_init" (mkfs.ext4(8)).
However, this option is not safe with respect to fsck, as there is no
way to distinguish between an unitialized block filled with old bits
and a corrupted one.
(The use of lazy_itable_init could be considered safe in the case where
the blocks of the disk, in particular those used by the inode tables,
are prefilled with zeros.)

These patches (try to) initialize the inode tables after mount via a
kernel thread launched by module loading. The goal is to find a
tradeoff between speed and safety.

Apart from use in testing, another use case could be a distribution
installation: since device size rises faster than system size, the
percentage of the formating time during the installation will
increase. Since the system will use a fragment of the full device (say
10GB for system installation on a 1TB disk), it would not be strictly
necessary to initialize all the inode tables before starting the
installation, for example for the home partition.

So far, I've only been able to initialize some small filesystems with
this code (using 2.6.28-rc4).
For example, like this:

. dd if=/dev/zero of=/tmp/ext4fs.img bs=1M count=1024
. losetup /dev/loop0 /tmp/ext4fs.img
. mkfs.ext4 -O^resize_inode -Elazy_itable_init /dev/loop0
. mount /dev/loop0 /mnt/test-ext4
. [dumpe2fs /dev/loop0]
. modprobe ext4_itable_init
. [dumpe2fs /dev/loop0  # here check the ITABLE_ZEROED]
. umount /mnt/test-ext4
. [dumpe2fs /dev/loop0]
. [fsck /dev/loop0]

But I also hitted several bugs and managed to somehow screw up my
machine. So be _extremly_ careful if ever you try the code!

TODO:
. fix the resize inode case
. fix the observed soft lockup
. decide whether to keep it a module.
  If not, decide how/when run the kernel thread
. initialize some blocks (for example the non-empty ones) at mount
  time, or somewhere else.
. non-empty group case
. feature interactions? (for example inode zeroing vs. resize)
. multiple threads (based on cpu/disks)
. other ?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC 1/2] ext4 resize: Mark the added group with EXT4_BG_INODE_ZEROED flag
  2008-11-21 10:23 [RFC 0/2] ext4: zero uninitialized inode tables Solofo.Ramangalahy
@ 2008-11-21 10:23 ` Solofo.Ramangalahy
  2008-11-24 23:25   ` Andreas Dilger
  2008-11-27  4:50   ` Theodore Tso
  2008-11-21 10:23 ` [RFC 2/2] ext4: module to initialize the inode table when using mkfs option lazy_itable_init Solofo.Ramangalahy
  2008-11-25  5:32 ` [RFC 0/2] ext4: zero uninitialized inode tables Theodore Tso
  2 siblings, 2 replies; 15+ messages in thread
From: Solofo.Ramangalahy @ 2008-11-21 10:23 UTC (permalink / raw)
  To: linux-ext4

[-- Attachment #1: SR-ext4-resize-mark-new-group-EXT_BG_INODE_ZEROED.patch --]
[-- Type: text/plain, Size: 896 bytes --]

The inode table has been zeroed in setup_new_group_blocks().
Mark it as such in ext4_group_add().

As a side note, online resize and inode zeroing are "dual".

In order to obtain a filesystem with faster formating times one can
do:
. either format a smaller fs and then resize it,
. or format the fs with lazy_itable_init

---
 fs/ext4/resize.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.28-rc4-itable_init/fs/ext4/resize.c
===================================================================
--- linux-2.6.28-rc4-itable_init.orig/fs/ext4/resize.c
+++ linux-2.6.28-rc4-itable_init/fs/ext4/resize.c
@@ -865,7 +865,7 @@ int ext4_group_add(struct super_block *s
 	gdp->bg_free_blocks_count = cpu_to_le16(input->free_blocks_count);
 	gdp->bg_free_inodes_count = cpu_to_le16(EXT4_INODES_PER_GROUP(sb));
 	gdp->bg_checksum = ext4_group_desc_csum(sbi, input->group, gdp);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC 2/2] ext4: module to initialize the inode table when using mkfs option lazy_itable_init
  2008-11-21 10:23 [RFC 0/2] ext4: zero uninitialized inode tables Solofo.Ramangalahy
  2008-11-21 10:23 ` [RFC 1/2] ext4 resize: Mark the added group with EXT4_BG_INODE_ZEROED flag Solofo.Ramangalahy
@ 2008-11-21 10:23 ` Solofo.Ramangalahy
  2008-11-25  5:32 ` [RFC 0/2] ext4: zero uninitialized inode tables Theodore Tso
  2 siblings, 0 replies; 15+ messages in thread
From: Solofo.Ramangalahy @ 2008-11-21 10:23 UTC (permalink / raw)
  To: linux-ext4

[-- Attachment #1: SR-ext4_init_table_module.patch --]
[-- Type: text/plain, Size: 11014 bytes --]

The idea is to reuse the code from resize since initializing a group
(hence the inode table) is also done during resize.

The code of the main function zero_itable_blocks() is extracted from
setup_new_group_blocks().

---
 fs/ext4/Makefile           |    2 
 fs/ext4/balloc.c           |    2 
 fs/ext4/ext4_itable_init.c |  190 +++++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/ext4_jbd2.c        |    3 
 fs/ext4/resize.c           |    8 -
 fs/ext4/super.c            |    6 +
 6 files changed, 205 insertions(+), 6 deletions(-)

Index: linux-2.6.28-rc4-itable_init/fs/ext4/ext4_itable_init.c
===================================================================
--- /dev/null
+++ linux-2.6.28-rc4-itable_init/fs/ext4/ext4_itable_init.c
@@ -0,0 +1,190 @@
+/*
+ * ext4_itable_init: module to initialize the inode tables for
+ * filesystem formatted with the lazy_itable_init option.
+ */
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/nsproxy.h>
+#include <linux/mount.h>
+#include <linux/mnt_namespace.h>
+#include <linux/fs.h>
+#include <linux/kthread.h>
+#include "ext4_jbd2.h"
+#include "group.h"
+#include "ext4.h"
+
+#define MAX_ITABLE_INIT_THREADS  1
+struct task_struct *init_itable_thread[MAX_ITABLE_INIT_THREADS];
+static int threads_nb;
+
+/* Functions from resize.c */
+struct buffer_head *bclean(handle_t *handle, struct super_block *sb,
+			   ext4_fsblk_t blk);
+int extend_or_restart_transaction(handle_t *handle, int thresh,
+				  struct buffer_head *bh);
+__le16 ext4_group_desc_csum(struct ext4_sb_info *sbi, __u32 block_group,
+			    struct ext4_group_desc *gdp);
+
+/* Zero out all of the inode table blocks */
+/* FIXME: refactor this function with resize.c::setup_new_group_blocks()? */
+/* checkpatch: ERROR: do not use assignment in if condition */
+static int ext4_zero_itable_blocks(ext4_group_t group, ext4_fsblk_t inode_table,
+			      struct super_block *sb)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	ext4_fsblk_t block;
+	ext4_grpblk_t bit;
+	int i;
+	int err = 0;
+	int err2 = 0;
+	ext4_fsblk_t start = ext4_group_first_block_no(sb, group);
+	handle_t *handle;
+	struct buffer_head *bh;
+	struct ext4_group_desc *gdp;
+	struct buffer_head *gdp_bh;
+
+printk(KERN_INFO "SR: ext4_zero_itable_blocks(ext4_group_t group: %lu, ext4_fsblk_t inode_table: %llu, super_block: %p)\n", group, inode_table, sb);
+	handle = ext4_journal_start_sb(sb, EXT4_MAX_TRANS_DATA);
+	lock_super(sb);
+	if (IS_ERR(bh = bclean(handle, sb, inode_table))) {
+		err = PTR_ERR(bh);
+		goto exit_journal;
+	}
+	for (i = 0, block = inode_table+1, bit = block - start;
+	     i < sbi->s_itb_per_group; i++, bit++, block++) {
+		struct buffer_head *it;
+
+		ext4_debug("clear inode block %#04llx (+%d)\n", block, bit);
+		if ((err = extend_or_restart_transaction(handle, 1, bh)))
+			goto exit_bh;
+		if (IS_ERR(it = bclean(handle, sb, block))) {
+			err = PTR_ERR(it);
+			goto exit_bh;
+		}
+		ext4_journal_dirty_metadata(handle, it);
+		brelse(it);
+		ext4_set_bit(bit, bh->b_data);
+	}
+	if ((err = extend_or_restart_transaction(handle, 2, bh)))
+		goto exit_bh;
+	ext4_journal_dirty_metadata(handle, bh);
+	/* Flag the group EXT4_BG_INODE_ZEROED */
+	gdp = ext4_get_group_desc(sb, group, &gdp_bh);
+
+	spin_lock(sb_bgl_lock(sbi, group));
+	if ((err = ext4_journal_get_write_access(handle, gdp_bh)))
+		goto exit_journal;
+	gdp->bg_flags |= cpu_to_le16(EXT4_BG_INODE_ZEROED);
+	gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
+	spin_unlock(sb_bgl_lock(sbi, group));
+
+	err = ext4_journal_dirty_metadata(handle, gdp_bh);
+exit_bh:
+	brelse(bh);
+exit_journal:
+	unlock_super(sb);
+	if ((err2 = ext4_journal_stop(handle)) && !err)
+		err = err2;
+	return err;
+}
+
+/* Detect empty groups for which inode tables can be simply zeroed */
+/* use inode bitmap instead? */
+static bool has_no_inode(struct super_block *sb, struct ext4_sb_info *sbi,
+			 struct ext4_group_desc *gdp)
+{
+	bool result;
+	result = (EXT4_BLOCK_SIZE(sb) * sbi->s_itb_per_group ==
+		  gdp->bg_free_inodes_count * sbi->s_inode_size);
+	return result;
+}
+/* Function launched by the threads. */
+static int ext4_thread_init_itable(void *data)
+{
+	struct super_block *sb = data;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	ext4_group_t i;
+	int err = 0;
+	int groups_count = sbi->s_groups_count;
+
+	for (i = 0; i < groups_count; i++) {
+		struct ext4_group_desc *gdp =
+			ext4_get_group_desc(sb, i, NULL);
+		ext4_fsblk_t inode_table = ext4_inode_table(sb, gdp);
+		if (kthread_should_stop())
+			break;
+		if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED))
+		    && has_no_inode(sb, sbi, gdp)) {
+			err = ext4_zero_itable_blocks(i, inode_table, sb);
+		} else {
+			/* TODO: non-empty groups  */
+			/* Do it elsewhere?:       */
+			/* . mount-time?           */
+			/* . group (first) access? */
+		}
+	}
+	return err;
+}
+/* Get the ext4 file_system_type to walk its instances. */
+static struct file_system_type *ext4_fs_type(void)
+{
+	struct vfsmount *mnt;
+	struct list_head list;
+	int i = 0;
+
+	list =  current->nsproxy->mnt_ns->root->mnt_list;
+	list_for_each_entry(mnt, &list, mnt_list) {
+		if (mnt && mnt->mnt_sb && mnt->mnt_sb->s_type) {
+			struct file_system_type *t = mnt->mnt_sb->s_type;
+
+			if (!strcmp(t->name, "ext4"))
+				return t;
+		} else {
+			/* FIXME: NULL: which valid case is this? */
+		}
+		i++;
+		if (i > 15) /* FIXME: which case does it loop? */
+			break;
+	}
+	return NULL;
+}
+
+/*
+ * itable_init_init  the init function, called when the module is loaded.
+ * Returns zero if successfully loaded, nonzero otherwise.
+ */
+static int itable_init_init(void)
+{
+	struct file_system_type *t = ext4_fs_type();
+	struct super_block *sb;
+	struct list_head list;
+
+	printk(KERN_INFO "EXT4: starting inode tables initialization.\n");
+
+	threads_nb = 0;
+	if (t != NULL) {
+		list = t->fs_supers;
+		list_for_each_entry(sb, &list, s_instances) {
+printk(KERN_INFO "itable_init_init() sb: %p name: %s.\n", sb, sb->s_type->name);
+			init_itable_thread[threads_nb] =
+				kthread_run(ext4_thread_init_itable,
+					    sb, "ext4_thread_init");
+			threads_nb++;
+			if (threads_nb >= MAX_ITABLE_INIT_THREADS)
+				break;
+		}
+	}
+	return 0;
+}
+/*
+ * itable_init_exit the exit function, called when the module is removed.
+ */
+static void itable_init_exit(void)
+{
+//	kthread_stop(init_itable_thread); //FIXME
+	printk(KERN_INFO "EXT4: end of inode table initialization.\n");
+}
+module_init(itable_init_init);
+module_exit(itable_init_exit);
+MODULE_LICENSE("GPL");
Index: linux-2.6.28-rc4-itable_init/fs/ext4/Makefile
===================================================================
--- linux-2.6.28-rc4-itable_init.orig/fs/ext4/Makefile
+++ linux-2.6.28-rc4-itable_init/fs/ext4/Makefile
@@ -2,6 +2,8 @@
 # Makefile for the linux ext4-filesystem routines.
 #
 
+obj-m += ext4_itable_init.o
+
 obj-$(CONFIG_EXT4_FS) += ext4.o
 
 ext4-y	:= balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \
Index: linux-2.6.28-rc4-itable_init/fs/ext4/balloc.c
===================================================================
--- linux-2.6.28-rc4-itable_init.orig/fs/ext4/balloc.c
+++ linux-2.6.28-rc4-itable_init/fs/ext4/balloc.c
@@ -21,6 +21,7 @@
 #include "ext4_jbd2.h"
 #include "group.h"
 
+
 /*
  * balloc.c contains the blocks allocation and deallocation routines
  */
@@ -237,6 +238,7 @@ struct ext4_group_desc * ext4_get_group_
 		*bh = sbi->s_group_desc[group_desc];
 	return desc;
 }
+EXPORT_SYMBOL_GPL(ext4_get_group_desc);
 
 static int ext4_valid_block_bitmap(struct super_block *sb,
 					struct ext4_group_desc *desc,
Index: linux-2.6.28-rc4-itable_init/fs/ext4/super.c
===================================================================
--- linux-2.6.28-rc4-itable_init.orig/fs/ext4/super.c
+++ linux-2.6.28-rc4-itable_init/fs/ext4/super.c
@@ -47,6 +47,7 @@
 #include "namei.h"
 #include "group.h"
 
+
 struct proc_dir_entry *ext4_proc_root;
 
 static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
@@ -92,6 +93,7 @@ ext4_fsblk_t ext4_inode_table(struct sup
 		(EXT4_DESC_SIZE(sb) >= EXT4_MIN_DESC_SIZE_64BIT ?
 		(ext4_fsblk_t)le32_to_cpu(bg->bg_inode_table_hi) << 32 : 0);
 }
+EXPORT_SYMBOL_GPL(ext4_inode_table);
 
 void ext4_block_bitmap_set(struct super_block *sb,
 			   struct ext4_group_desc *bg, ext4_fsblk_t blk)
@@ -144,7 +146,7 @@ handle_t *ext4_journal_start_sb(struct s
 
 	return jbd2_journal_start(journal, nblocks);
 }
-
+EXPORT_SYMBOL_GPL(ext4_journal_start_sb);
 /*
  * The only special thing we need to do here is to make sure that all
  * jbd2_journal_stop calls result in the superblock being marked dirty, so
@@ -167,6 +169,7 @@ int __ext4_journal_stop(const char *wher
 		__ext4_std_error(sb, where, err);
 	return err;
 }
+EXPORT_SYMBOL_GPL(__ext4_journal_stop);
 
 void ext4_journal_abort_handle(const char *caller, const char *err_fn,
 		struct buffer_head *bh, handle_t *handle, int err)
@@ -1511,6 +1514,7 @@ __le16 ext4_group_desc_csum(struct ext4_
 
 	return cpu_to_le16(crc);
 }
+EXPORT_SYMBOL_GPL(ext4_group_desc_csum);
 
 int ext4_group_desc_csum_verify(struct ext4_sb_info *sbi, __u32 block_group,
 				struct ext4_group_desc *gdp)
Index: linux-2.6.28-rc4-itable_init/fs/ext4/ext4_jbd2.c
===================================================================
--- linux-2.6.28-rc4-itable_init.orig/fs/ext4/ext4_jbd2.c
+++ linux-2.6.28-rc4-itable_init/fs/ext4/ext4_jbd2.c
@@ -21,7 +21,7 @@ int __ext4_journal_get_write_access(cons
 		ext4_journal_abort_handle(where, __func__, bh, handle, err);
 	return err;
 }
-
+EXPORT_SYMBOL_GPL(__ext4_journal_get_write_access);
 int __ext4_journal_forget(const char *where, handle_t *handle,
 				struct buffer_head *bh)
 {
@@ -57,3 +57,4 @@ int __ext4_journal_dirty_metadata(const 
 		ext4_journal_abort_handle(where, __func__, bh, handle, err);
 	return err;
 }
+EXPORT_SYMBOL_GPL(__ext4_journal_dirty_metadata);
Index: linux-2.6.28-rc4-itable_init/fs/ext4/resize.c
===================================================================
--- linux-2.6.28-rc4-itable_init.orig/fs/ext4/resize.c
+++ linux-2.6.28-rc4-itable_init/fs/ext4/resize.c
@@ -117,7 +117,7 @@ static int verify_group_input(struct sup
 	return err;
 }
 
-static struct buffer_head *bclean(handle_t *handle, struct super_block *sb,
+struct buffer_head *bclean(handle_t *handle, struct super_block *sb,
 				  ext4_fsblk_t blk)
 {
 	struct buffer_head *bh;
@@ -138,13 +138,13 @@ static struct buffer_head *bclean(handle
 
 	return bh;
 }
-
+EXPORT_SYMBOL_GPL(bclean);
 /*
  * If we have fewer than thresh credits, extend by EXT4_MAX_TRANS_DATA.
  * If that fails, restart the transaction & regain write access for the
  * buffer head which is used for block_bitmap modifications.
  */
-static int extend_or_restart_transaction(handle_t *handle, int thresh,
+int extend_or_restart_transaction(handle_t *handle, int thresh,
 					 struct buffer_head *bh)
 {
 	int err;
@@ -164,7 +164,7 @@ static int extend_or_restart_transaction
 
 	return 0;
 }

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 1/2] ext4 resize: Mark the added group with EXT4_BG_INODE_ZEROED flag
  2008-11-21 10:23 ` [RFC 1/2] ext4 resize: Mark the added group with EXT4_BG_INODE_ZEROED flag Solofo.Ramangalahy
@ 2008-11-24 23:25   ` Andreas Dilger
  2008-11-25 11:27     ` Solofo.Ramangalahy
  2008-11-27  4:50   ` Theodore Tso
  1 sibling, 1 reply; 15+ messages in thread
From: Andreas Dilger @ 2008-11-24 23:25 UTC (permalink / raw)
  To: Solofo.Ramangalahy; +Cc: linux-ext4

On Nov 21, 2008  11:23 +0100, Solofo.Ramangalahy@bull.net wrote:
> The inode table has been zeroed in setup_new_group_blocks().
> Mark it as such in ext4_group_add().
> 
> As a side note, online resize and inode zeroing are "dual".
> 
> In order to obtain a filesystem with faster formating times one can
> do:
> . either format a smaller fs and then resize it,
> . or format the fs with lazy_itable_init

As discussed on the concall, it probably makes more sense to have the
resize code work by marking the inode tables UNINIT (if GDT_CSUM feature
is enabled) and then start the "itable zeroing" thread, if it isn't
already running, to do the zeroing of the itable.

If GDT_CSUM isn't set then the below fix is the right solution.

> Index: linux-2.6.28-rc4-itable_init/fs/ext4/resize.c
> ===================================================================
> --- linux-2.6.28-rc4-itable_init.orig/fs/ext4/resize.c
> +++ linux-2.6.28-rc4-itable_init/fs/ext4/resize.c
> @@ -865,7 +865,7 @@ int ext4_group_add(struct super_block *s
>  	gdp->bg_free_blocks_count = cpu_to_le16(input->free_blocks_count);
>  	gdp->bg_free_inodes_count = cpu_to_le16(EXT4_INODES_PER_GROUP(sb));
>  	gdp->bg_checksum = ext4_group_desc_csum(sbi, input->group, gdp);
> -
> +	gdp->bg_flags |= cpu_to_le16(EXT4_BG_INODE_ZEROED);
>  	/*
>  	 * We can allocate memory for mb_alloc based on the new group
>  	 * descriptor

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/2] ext4: zero uninitialized inode tables
  2008-11-21 10:23 [RFC 0/2] ext4: zero uninitialized inode tables Solofo.Ramangalahy
  2008-11-21 10:23 ` [RFC 1/2] ext4 resize: Mark the added group with EXT4_BG_INODE_ZEROED flag Solofo.Ramangalahy
  2008-11-21 10:23 ` [RFC 2/2] ext4: module to initialize the inode table when using mkfs option lazy_itable_init Solofo.Ramangalahy
@ 2008-11-25  5:32 ` Theodore Tso
  2008-11-25  8:35   ` Andreas Dilger
  2008-11-25 12:28   ` Solofo.Ramangalahy
  2 siblings, 2 replies; 15+ messages in thread
From: Theodore Tso @ 2008-11-25  5:32 UTC (permalink / raw)
  To: Solofo.Ramangalahy; +Cc: linux-ext4

On Fri, Nov 21, 2008 at 11:23:09AM +0100, Solofo.Ramangalahy@bull.net wrote:
> . decide whether to keep it a module.
>   If not, decide how/when run the kernel thread
> . multiple threads (based on cpu/disks)

I would *not* do it as a module.  It's more than a little
aesthetically unclean that this has to be a module --- there are
people who by choice decide not to use modules, for example.  If
you're clever, doing it as a module allows you to shorten your
compile-edit-debug cycle, I suppose, so maybe it's a justification for
doing it that way, but if that's the main reason, I'd choose using
user mode linux or KVM as my main development vehicle to speed up the
development cycle....

Instead, what I would do is to have the mount system call, if the
filesystem is mounted read/write, and there are uninitialized block
groups, to create a kernel thread responsible for initializing the
filesystem in question.  Once it is complete, it can exit.

> . initialize some blocks (for example the non-empty ones) at mount
>   time, or somewhere else.
> . non-empty group case

I'm not sure why you are treating the non-empty group case any
different from the empty-group case.  The only difference is where you
start zeroing the inode table.  In both cases you do need to worry
about locking issues --- what happens if the filesystem tries to
allocate a new inode just as you are starting to zero the filesystem?

In your current patch, you are checking to see if the block group has
no inodes in ext4_thread_init_itable(), by calling has_no_inode(), but
there is no locking between when you check to see that a particular
part of the inode table is not in use, and when you call
ext4_zero_itable_blocks().  If an inode does get allocated, either
between the call to has_no_inode, and ext4_zero_itable_blocks(), or
while ext4_zero_itable_blocks() is running, the inode could get
zero'ed out, causing data loss.  So locking is critical.

My suggestion for how to do locking is is to add a field in
ext4_group_info, a data structure which was defined in mballoc.h, but
is going to be moved to ext4.h as part of a patch that is currently in
the ext4 patch queue.  This field would be protected by bg_sgl_lock()
(like the other fields in the bg descriptor), and would be called
inode_bg_unavail.  If non-zero, and the relative inode number (i.e.,
the inode number modulo the number of inodes per blockgroup) selected
by the inode allocator is greater than or equal to inode_bg_unavail,
the inode allocator should try to find another block group to allocate
the inode.

Now what the inode table initialization thread can do is for each
block group where EXT4_BG_INODE_ZERO is not set, it should first set
inode_bg_unavail to bg_itable_unused.  This will prevent the inode
allocator allocating any new inodes in that block group.  Since we are
going to zero out inode table blocks, being paranoid is a good thing;
we should check to make sure the bg_checksum is valid, and that inode
bitmap does not show any blocks past bg_unavail as being in use.  If
there are, the filesystem must have gotten corrupted and we should
call ext4_error() in that case.

We we start zero'ing the inode table, I would recommend doing so
without using the journal, and doing direct block i/o from the zero
page.  The ext4_ext_zeroout function does most of what you need,
although I'd generalize it so it doesn't take inode and and
ext4_extent structure, but rather a struct sb, starting physical block
number, and number of blocks.  This function would wait for the I/O to
complete, and once it completes you know the blocks are on disk and
you can make changes to the filesystem metadata that would be
journaled.  I would recommend doing the first 32k of the inode table
first, and once it completes, you can update inode_bg_unavaile so that
an additional (32k / EXT4_INODE_SIZE(sb)) inodes are available.  This
allows the inode allocator to allocate inodes in the block group while
the itable initializer is initializing the rest of the inode table in
that block group.  Once the entire block group's inode table has been
initialized, the itable initializer can then set the BG_ITABLE_ZERO
flag (and this would be a journaled update), and then move on to the
next block group.

Does that make sense?

In terms of how quickly the itable initializer should work, in between
each block group, as we discussed on the call, the simplest thing for
it do is to wait for some time period to go by (say, 5 seconds) before
working on the next block group.  The next, slightly more complicated
scheme would be to set a "last ext4 operation time" field in
EXT4_SB(sb) which is set any time the ext4 code paths are entered
(basically, any function in ext4's inode operations, super operations
or file operations).  The itable initalizer would sample that time,
and before starting to initialize the next block group where
BG_ITABLE_ZERO is not set, it would check the last ext4 operation time
field, and if there had been an ext4 operation in the last 5 seconds,
it would sleep 5 seconds and check again.  This would prevent the
itable initializer from running if the filesystem is in use, although
it will not detect the case where there is a lot of mmap'ed I/O going
on, but no other ext4 operations.

In the long run, we would really want some kind of I/O activity
indication from the block device elevator, but that would require
changes to the core kernel, and the last ext4 operation time is almost
just as good.

> . fix the resize inode case

Not sure what problem you were having here?

							- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/2] ext4: zero uninitialized inode tables
  2008-11-25  5:32 ` [RFC 0/2] ext4: zero uninitialized inode tables Theodore Tso
@ 2008-11-25  8:35   ` Andreas Dilger
  2008-11-25 12:28   ` Solofo.Ramangalahy
  1 sibling, 0 replies; 15+ messages in thread
From: Andreas Dilger @ 2008-11-25  8:35 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Solofo.Ramangalahy, linux-ext4

On Nov 25, 2008  00:32 -0500, Theodore Ts'o wrote:
> I would recommend doing the first 32k of the inode table
> first, and once it completes, you can update inode_bg_unavaile so that
> an additional (32k / EXT4_INODE_SIZE(sb)) inodes are available.

I agree with everything Ted says, though I would zero the itable in
chunks of 64kB or even 128kB.  Two reasons are because 64kB is the
maximum blocksize for the filesystem, and it doesn't make sense to zero
less than a whole block at once.  Secondly, 64kB is more likely to
match with the internal track size of spinning disks, and 128kB is more
likely to match the erase block size of SSDs.

> In terms of how quickly the itable initializer should work, in between
> each block group, as we discussed on the call, the simplest thing for
> it do is to wait for some time period to go by (say, 5 seconds) before
> working on the next block group.  The next, slightly more complicated
> scheme would be to set a "last ext4 operation time" field in
> EXT4_SB(sb) which is set any time the ext4 code paths are entered

That would be "s_wtime" already in the on-disk superblock.  It wouldn't
kill us to update this occasionally in ext4, though not on disk all
the time.

> (basically, any function in ext4's inode operations, super operations
> or file operations).  The itable initalizer would sample that time,
> and before starting to initialize the next block group where
> BG_ITABLE_ZERO is not set, it would check the last ext4 operation time
> field, and if there had been an ext4 operation in the last 5 seconds,
> it would sleep 5 seconds and check again.

Well, I'd say if it has slept 5s then it should submit a block regardless
of whether the filesystem was in use or not.  Otherwise the itable may
never be zeroed out if the filesystem is always in use.  Adding a rare
64kB write to disk is unlikely to hurt anything, and if people REALLY care
about it they can avoid formatting with "lazy_itable_init".

> This would prevent the itable initializer from running if the filesystem
> is in use, although it will not detect the case where there is a lot
> of mmap'ed I/O going on, but no other ext4 operations.

Wouldn't even mmap operations cause some ext4 methods to be called?

> In the long run, we would really want some kind of I/O activity
> indication from the block device elevator, but that would require
> changes to the core kernel, and the last ext4 operation time is almost
> just as good.

Alternately we could check the journal tid?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 1/2] ext4 resize: Mark the added group with EXT4_BG_INODE_ZEROED flag
  2008-11-24 23:25   ` Andreas Dilger
@ 2008-11-25 11:27     ` Solofo.Ramangalahy
  2008-11-25 21:18       ` Andreas Dilger
  0 siblings, 1 reply; 15+ messages in thread
From: Solofo.Ramangalahy @ 2008-11-25 11:27 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Solofo.Ramangalahy, linux-ext4

Andreas Dilger writes:
 > > As a side note, online resize and inode zeroing are "dual".
 > > 
 > > In order to obtain a filesystem with faster formating times one can
 > > do:
 > > . either format a smaller fs and then resize it,
 > > . or format the fs with lazy_itable_init
 > 
 > As discussed on the concall, it probably makes more sense to have the
 > resize code work by marking the inode tables UNINIT (if GDT_CSUM feature
 > is enabled)

If I understand correctly, this is already the case:
#define EXT4_BG_INODE_UNINIT    0x0001 /* Inode table/bitmap not in use */
#define EXT4_BG_BLOCK_UNINIT    0x0002 /* Block bitmap not in use */
#define EXT4_BG_INODE_ZEROED    0x0004 /* On-disk itable initialized to zero */

As the EXT4_BG_INODE_ZEROED is not present, the inode table is UNINIT.

By the way, is there any reason the #defines are like this, instead of:
#define EXT4_BG_INODE_UNINIT    0x0001 /* Inode table/bitmap not in use */
#define EXT4_BG_BLOCK_UNINIT    0x0002 /* Block bitmap not in use */
#define EXT4_BG_ITABLE_UNINIT   0x0004 /* On-disk itable not initialized */
?

 > and then start the "itable zeroing" thread, if it isn't
 > already running, to do the zeroing of the itable.

Yes.

Is there other resize changes you could think of?
While working on this, I noted this checkpatch error 
"ERROR: do not use assignment in if condition"
(but I am not sure of the exact justification).

Thanks,
-- 
solofo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/2] ext4: zero uninitialized inode tables
  2008-11-25  5:32 ` [RFC 0/2] ext4: zero uninitialized inode tables Theodore Tso
  2008-11-25  8:35   ` Andreas Dilger
@ 2008-11-25 12:28   ` Solofo.Ramangalahy
  2008-11-25 18:52     ` Theodore Tso
  2008-11-25 21:10     ` Andreas Dilger
  1 sibling, 2 replies; 15+ messages in thread
From: Solofo.Ramangalahy @ 2008-11-25 12:28 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Solofo.Ramangalahy, linux-ext4

Theodore Tso writes:
 > Does that make sense?

Yes, thanks for the guidance!

 > doing it as a module allows you to shorten your
 > compile-edit-debug cycle, I suppose, so maybe it's a justification for
 > doing it that way,

Yes it is.

 > but if that's the main reason, I'd choose using
 > user mode linux or KVM as my main development vehicle to speed up the
 > development cycle....

I did both kvm and module :-)
Other reasons were:
. this resize comment:
  * This could probably be made into a module, because it is not often in use.
  and this sentence from the OLS'02 paper
  "Since the online resizing code is only used very rarely, it would
  be possible to put the bulk of this code into a separate module that
  is only loaded when a resize operation is done."
  Inode zeroing is done only once in a filesystem lifetime (and each
  time it is resized).
. the fact that did not have a clear idea of when to fire the thread.
. some consideration for memory usage (you mentionned NAS boxes in
  another thread).

 > I'm not sure why you are treating the non-empty group case any
 > different from the empty-group case.

Because of simplicity. I wanted to have "something" to start the
discussion.

I was also thinking that there may be other places to do it. For
example, zeroing the inode table where the inode bitmap is initialized
(ext4_init_inode_bitmap() called only once in
ext4_read_inode_bitmap()).

The reasoning would have been to zero as soon as it is known to be
needed:
. without deferring it to the threads,
. decreasing the probability of zeroing competing with other code
. decreasing the "window of vulnerability" (the time between formating
  and end of zeroing where it is known that fsck is not safe).

I don't know if it would have been sufficient to guarantee that all
the groups are eventually itable zeroed.

 > > . fix the resize inode case
 > 
 > Not sure what problem you were having here?

With resize inode, the obtained filesystem is corrupted, fsck says
"Resize inode not valid.  Recreate?"
as well as: 
"Free blocks count wrong for group #0 (6788, counted=6789)."

Appart from the data structures change you mentionned, these changes
were discussed:
. a mount option to disable the threads when doing testing/performance
  benchmarking
. a flag in s_flags field of struct ext4_super_block to indicate that
  the zeroing has been done on all the groups. Possibly reset with
  resize.
Do they sound reasonable?

-- 
solofo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/2] ext4: zero uninitialized inode tables
  2008-11-25 12:28   ` Solofo.Ramangalahy
@ 2008-11-25 18:52     ` Theodore Tso
  2008-11-25 21:10     ` Andreas Dilger
  1 sibling, 0 replies; 15+ messages in thread
From: Theodore Tso @ 2008-11-25 18:52 UTC (permalink / raw)
  To: Solofo.Ramangalahy; +Cc: linux-ext4

On Tue, Nov 25, 2008 at 01:28:47PM +0100, Solofo.Ramangalahy@bull.net wrote:
>   * This could probably be made into a module, because it is not often in use.
>   and this sentence from the OLS'02 paper
>   "Since the online resizing code is only used very rarely, it would
>   be possible to put the bulk of this code into a separate module that
>   is only loaded when a resize operation is done."
>   Inode zeroing is done only once in a filesystem lifetime (and each
>   time it is resized).

Sure, but (a) zeroing the inode table should not be much code;
especially since we need to zero contiguous block ranges in extents.c
to deal with uninitialized extents.  Also (b) modules have their own
cost; they waste on average PAGE_SIZE/2 worth of memory per module due
to internal fragmentation, and cause extra entries in the TLB cache
(on an x86, the entire kernel uses a single TLB entry, but each 4k
page used by a module burns a separate TLB entry), and (c) loading
modules is slow and serializes the kernel at boot time.  In addition,
(d) some users simply prohibit modules for policy reasons.

So I could imagine adding module support, but it has to work built
into the kernel is critical, and this will probably the primary way it
will be compiled.  And you need to make sure the kernel tries to
automatically load the module when a resize or a new filesystem is
mounted, if it is compiled as a module.

> I was also thinking that there may be other places to do it. For
> example, zeroing the inode table where the inode bitmap is initialized
> (ext4_init_inode_bitmap() called only once in
> ext4_read_inode_bitmap()).

When we first allocate an inode in the inode table and when we need to
zero it is largely unrelated.  The problem is that e2fsck doesn't want
to trust the inode bitmap as being accurate; nor can it necessarily
trust the bg_inodes_unused field --- note particularly that this is
not updated in the backup group descriptors.  Hence the window of
vulnerability has nothing to do with whether or not we have started
using a particular part of the inode table, but because the inode
table has not been initialized.

So we want to get the inode table fully initialized as soon as
possible, although we have to balance this with not impacting the
system's performance.

> The reasoning would have been to zero as soon as it is known to be
> needed:
> . without deferring it to the threads,
> . decreasing the probability of zeroing competing with other code

A block group's inode table can be 2-4 megabytes; and zeroing out that
many disk blocks can take a noticeable amount of time, so doing it
synchronously with an inode creation doesn't seem like a great idea to
me.....

>  > > . fix the resize inode case
>  > 
>  > Not sure what problem you were having here?
> 
> With resize inode, the obtained filesystem is corrupted, fsck says
> "Resize inode not valid.  Recreate?"
> as well as: 
> "Free blocks count wrong for group #0 (6788, counted=6789)."

Something really bogus must be happening; the resize inode is in block
group 0, which always has some inodes (and therefore would never be
touched by your patch, which only initializes completely empty inode
tables).  So if it managed to corrupt the resize inode, that's
especially worrisome.

> Appart from the data structures change you mentionned, these changes
> were discussed:
> . a mount option to disable the threads when doing testing/performance
>   benchmarking

Yes, this makes sense.  The administrator can always remount the filesystem
with a mount option to re-enable itable initialization if the
sysadmin wants to zero the inode table later on.

> . a flag in s_flags field of struct ext4_super_block to indicate that
>   the zeroing has been done on all the groups. Possibly reset with
>   resize.

This doesn't make that much sense to me.  It's not that hard to
iterate through all of the block groups descriptors checking for the
EXT4_BG_INODE_ZEROED flag.

							- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/2] ext4: zero uninitialized inode tables
  2008-11-25 12:28   ` Solofo.Ramangalahy
  2008-11-25 18:52     ` Theodore Tso
@ 2008-11-25 21:10     ` Andreas Dilger
  1 sibling, 0 replies; 15+ messages in thread
From: Andreas Dilger @ 2008-11-25 21:10 UTC (permalink / raw)
  To: Solofo.Ramangalahy; +Cc: Theodore Tso, linux-ext4

On Nov 25, 2008  13:28 +0100, Solofo.Ramangalahy@bull.net wrote:
> Appart from the data structures change you mentionned, these changes
> were discussed:
> . a mount option to disable the threads when doing testing/performance
>   benchmarking

Sure.

> . a flag in s_flags field of struct ext4_super_block to indicate that
>   the zeroing has been done on all the groups. Possibly reset with
>   resize.

I was thinking that it makes sense to have this same thread do checking
of all the block group metadata as it traverses the filesystem.  That
includes validating the GDT checksums, checking the existing block and
inode bitmaps (and possibly checksums for them, when that is implemented),
along with zeroing the inode table.

The only requirement is that there only be a single such thread running
on the filesystem at one time, and that if the filesystem is unmounted
that the thread be killed before the unmount is completed.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 1/2] ext4 resize: Mark the added group with EXT4_BG_INODE_ZEROED flag
  2008-11-25 11:27     ` Solofo.Ramangalahy
@ 2008-11-25 21:18       ` Andreas Dilger
  0 siblings, 0 replies; 15+ messages in thread
From: Andreas Dilger @ 2008-11-25 21:18 UTC (permalink / raw)
  To: Solofo.Ramangalahy; +Cc: linux-ext4

On Nov 25, 2008  12:27 +0100, Solofo.Ramangalahy@bull.net wrote:
> Andreas Dilger writes:
>  > As discussed on the concall, it probably makes more sense to have the
>  > resize code work by marking the inode tables UNINIT (if GDT_CSUM feature
>  > is enabled)
> 
> If I understand correctly, this is already the case:
> #define EXT4_BG_INODE_UNINIT    0x0001 /* Inode table/bitmap not in use */
> #define EXT4_BG_BLOCK_UNINIT    0x0002 /* Block bitmap not in use */
> #define EXT4_BG_INODE_ZEROED    0x0004 /* On-disk itable initialized to zero */

> As the EXT4_BG_INODE_ZEROED is not present, the inode table is UNINIT.

Ah, I suppose you are correct.

> By the way, is there any reason the #defines are like this, instead of:
> #define EXT4_BG_INODE_UNINIT    0x0001 /* Inode table/bitmap not in use */
> #define EXT4_BG_BLOCK_UNINIT    0x0002 /* Block bitmap not in use */
> #define EXT4_BG_ITABLE_UNINIT   0x0004 /* On-disk itable not initialized */
> ?

No particular reason, that is just how the implementation was done.  In
hindsight that probably would make more sense...

> While working on this, I noted this checkpatch error 
> "ERROR: do not use assignment in if condition"
> (but I am not sure of the exact justification).

I'm not sure what you are asking.  The reason not to use "assignment in if"
is because of possible coding error like:

	if (x = 6) {
		/* do something if x is 6 */
	}

when coder actually meant to write:

	if (x == 6) {
		/* do something if x is 6 */
	}

The first one will now generate a warning in GCC unless written as:

	if ((x = 6)) {
		/* do something if x is 6 */
	}

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 1/2] ext4 resize: Mark the added group with EXT4_BG_INODE_ZEROED flag
  2008-11-21 10:23 ` [RFC 1/2] ext4 resize: Mark the added group with EXT4_BG_INODE_ZEROED flag Solofo.Ramangalahy
  2008-11-24 23:25   ` Andreas Dilger
@ 2008-11-27  4:50   ` Theodore Tso
  2008-11-27  9:30     ` Solofo.Ramangalahy
  1 sibling, 1 reply; 15+ messages in thread
From: Theodore Tso @ 2008-11-27  4:50 UTC (permalink / raw)
  To: Solofo.Ramangalahy; +Cc: linux-ext4

On Fri, Nov 21, 2008 at 11:23:10AM +0100, Solofo.Ramangalahy@bull.net wrote:
> The inode table has been zeroed in setup_new_group_blocks().
> Mark it as such in ext4_group_add().

This patch makes sense to apply right away.  However:

1) You didn't include a Developer's Certification of Origin (i.e., a
"Signed-off-by" header).  Since this is a one line patch, and it seems
pretty clear your intention is to submit this to Linus, I added one on
your behalf so you can see how it should be done.  However, in general
you should never add a signoff for someone else, so I need an explicit
OK from you that you agree with the terms of the Developer's
Certification of Origin as found in the Linux kernel source code,
Documentation/SubmittingPatches before I can push this patch to Linus.
This is very important legally.

2) You need to set the flag *before* you calculate the block group
checksum, not afterwards.

So the corrected patch should look like this....

							- Ted

ext4: When resizing set the EXT4_BG_INODE_ZEROED flag for new block groups

From: Solofo.Ramangalahy@bull.net

The inode table has been zeroed in setup_new_group_blocks().  Mark it as
such in ext4_group_add().  Since we are currently clearing inode table
for the new block group, we should set the EXT4_BG_INODE_ZEROED flag.
If at some point in the future we don't immediately zero out the inode
table as part of the resize operation, then obviously we shouldn't do
this.

Signed-off-by: Solofo.Ramangalahy@bull.net
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index b6ec184..d448eb1 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -864,6 +864,7 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
 	ext4_inode_table_set(sb, gdp, input->inode_table); /* LV FIXME */
 	gdp->bg_free_blocks_count = cpu_to_le16(input->free_blocks_count);
 	gdp->bg_free_inodes_count = cpu_to_le16(EXT4_INODES_PER_GROUP(sb));
+	gdp->bg_flags |= cpu_to_le16(EXT4_BG_INODE_ZEROED);
 	gdp->bg_checksum = ext4_group_desc_csum(sbi, input->group, gdp);

 	/*

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC 1/2] ext4 resize: Mark the added group with EXT4_BG_INODE_ZEROED flag
  2008-11-27  4:50   ` Theodore Tso
@ 2008-11-27  9:30     ` Solofo.Ramangalahy
  2008-11-27 22:35       ` Theodore Tso
  0 siblings, 1 reply; 15+ messages in thread
From: Solofo.Ramangalahy @ 2008-11-27  9:30 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Solofo.Ramangalahy, linux-ext4

Hi Ted,

Theodore Tso writes:
 > 2) You need to set the flag *before* you calculate the block group
 > checksum, not afterwards.

Sorry about this. I forgot to do the quilt refresh (and check that the
code I submit is the same that the code I run).

 > 1) You didn't include a Developer's Certification of Origin (i.e., a
 > "Signed-off-by" header).  Since this is a one line patch, and it seems
 > pretty clear your intention is to submit this to Linus,

This was really an RFC, as you also pointed out.
Regarding this patch, the discussion raised the question of whether
EXT4_BG_INODE_UNINIT or EXT4_BG_ITABLE_UNINIT would be more coherent
than EXT4_BG_INODE_ZEROED wrt. EXT4_BG_INODE_UNINIT and
EXT4_BG_BLOCK_UNINIT.
This was also used as an example for the discussion about doing the
initialization outside of an init thread (which turned up not to be a
good idea).
This is also the first use of EXT4_BG_INODE_ZEROED in the kernel, so
an occasion to revisit the name. 
I did not look carefully in the progs (EXT2_BG_INODE_ZEROED) to see if
it is desirable and easy to change it. cscope indicates that it may be
easy (4 instances).

 > So the corrected patch should look like this....

Thank you, that's settled then,
-- 
solofo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 1/2] ext4 resize: Mark the added group with EXT4_BG_INODE_ZEROED flag
  2008-11-27  9:30     ` Solofo.Ramangalahy
@ 2008-11-27 22:35       ` Theodore Tso
  2008-11-27 23:09         ` Andreas Dilger
  0 siblings, 1 reply; 15+ messages in thread
From: Theodore Tso @ 2008-11-27 22:35 UTC (permalink / raw)
  To: Solofo.Ramangalahy; +Cc: linux-ext4

On Thu, Nov 27, 2008 at 10:30:31AM +0100, Solofo.Ramangalahy@bull.net wrote:
> This was really an RFC, as you also pointed out.
> Regarding this patch, the discussion raised the question of whether
> EXT4_BG_INODE_UNINIT or EXT4_BG_ITABLE_UNINIT would be more coherent
> than EXT4_BG_INODE_ZEROED wrt. EXT4_BG_INODE_UNINIT and
> EXT4_BG_BLOCK_UNINIT.

EXT2_BG_ITABLE_UNINIT (or EXT2_BG_ITABLE_PARTIALLY_UNINIT, to be more
correct) would have been better, yes.  That way legacy filesystems
that didn't enable uninit_bg would have bg_flags == 0, and we would
know that inode table was properly initialized.  Unfortunately we did
it the other way, where EXT2_BG_INODE_ZEROED is set when the inode
table is initialized, instead of the other way around.

> This is also the first use of EXT4_BG_INODE_ZEROED in the kernel, so
> an occasion to revisit the name.

Unfortunately, we've been shipping mke2fs in e2fsprogs that sets the
EXT4_BG_INODE_ZERO for newly created filesystem, and if the
lazy_itable_init configuration parameter is set, it doesn't initialize
the inode table and leaves bg_flags set to EXT2_BG_INODE_UNINIT and
EXT2_BG_BLOCK_UNINIT. 

Distributions are already shipping e2fsprogs with this, and there are
ext4 filesystems out there in the wild, so it is indeed probably way
too late to change this.

					- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 1/2] ext4 resize: Mark the added group with EXT4_BG_INODE_ZEROED flag
  2008-11-27 22:35       ` Theodore Tso
@ 2008-11-27 23:09         ` Andreas Dilger
  0 siblings, 0 replies; 15+ messages in thread
From: Andreas Dilger @ 2008-11-27 23:09 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Solofo.Ramangalahy, linux-ext4

On Nov 27, 2008  17:35 -0500, Theodore Ts'o wrote:
> Unfortunately, we've been shipping mke2fs in e2fsprogs that sets the
> EXT4_BG_INODE_ZERO for newly created filesystem, and if the
> lazy_itable_init configuration parameter is set, it doesn't initialize
> the inode table and leaves bg_flags set to EXT2_BG_INODE_UNINIT and
> EXT2_BG_BLOCK_UNINIT. 
> 
> Distributions are already shipping e2fsprogs with this, and there are
> ext4 filesystems out there in the wild, so it is indeed probably way
> too late to change this.

I suppose it would be possible to add a new flag and deprecate the old
INODE_ZERO usage after a couple of years.  Until we start doing anything
with the kernel we've done everything with mke2fs to zero the inode table,
so that would be a reasonable assumption for the kernel to make.

I agree that having the flag with the opposite meaning would have been
better, but hindsight is 20-20.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2008-11-27 23:09 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-21 10:23 [RFC 0/2] ext4: zero uninitialized inode tables Solofo.Ramangalahy
2008-11-21 10:23 ` [RFC 1/2] ext4 resize: Mark the added group with EXT4_BG_INODE_ZEROED flag Solofo.Ramangalahy
2008-11-24 23:25   ` Andreas Dilger
2008-11-25 11:27     ` Solofo.Ramangalahy
2008-11-25 21:18       ` Andreas Dilger
2008-11-27  4:50   ` Theodore Tso
2008-11-27  9:30     ` Solofo.Ramangalahy
2008-11-27 22:35       ` Theodore Tso
2008-11-27 23:09         ` Andreas Dilger
2008-11-21 10:23 ` [RFC 2/2] ext4: module to initialize the inode table when using mkfs option lazy_itable_init Solofo.Ramangalahy
2008-11-25  5:32 ` [RFC 0/2] ext4: zero uninitialized inode tables Theodore Tso
2008-11-25  8:35   ` Andreas Dilger
2008-11-25 12:28   ` Solofo.Ramangalahy
2008-11-25 18:52     ` Theodore Tso
2008-11-25 21:10     ` Andreas Dilger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).